Some notes on the format of table files used here.
(See README.tables for what to do with them.)
The format is derived from stuff in the console driver of the
Linux kernel (as are the guts of the chartrans machinery).
THAT DOES NOT MEAN that anything here is Linux specific - it isn't.
[Note that the format may change, this is still somewhat experimental.]
There are four kinds of lines:
Summary example:
# This line is a comment, the next line is a directive
O Brand new Charset!
0x41 U+0041 U+0391
U+00cd:I'
Description:
a) comment lines start with a '#' character.
(trailing comments are allowed on some of the other lines, if in doubt
check the examples..)
b) directives:
start with a keyword which may be abbreviated to one letter (first
letter must be capitalized), followed by space and a value.
Currently recognized:
OptionName
The name under which this should appear on the O)ptions screen
in the list for Display Character Set
MIMEName
The name for this charset in MIME syntax (one word with digits
and some other non-letters allowed, should be IANA registered)
Default
If "Y[es]" or "1", this is the default (fallback) translation table,
it will be used for Unicode -> 8bit (or 7bit) translation if no
translation is found in the specific table.
FallBack
Whether to use the default table if no translation is found in
this table. Normally fallback is used, "FallBack NO" or "FallBack 0"
disables it (actually, other values than "FallBack Y[es]" or
"FallBack 1" disable it).
RawOrEnc
a number which flags some special property (encoding) for this
charset [see utf8_uni.tbl for example, see UCDefs.h for details].
Codepage number (IBM specific)
used by OS/2 font-switching code.
c) character translation definitions:
they look like
0x41 U+0041 U+0391 ...
and are used for "forward" translation (mapping this charset to Unicode)
AS WELL AS "back" translation (mapping Unicodes to an 8-bit
[incl. 7-bit ASCII] code).
For the "forward" direction, only the first Unicode is used; for
"back" translation, all listed Unicodes are mapped to the byte (i.e.
code point) on the left.
The above example line would tell the chartrans mechanism:
"For this charset, code position 65 [hex 0x41] contains Unicode
U+0041 (LATIN CAPITAL LETTER A). For translation of Unicodes to
this charset, use byte value 65 [hex 0x41] for U+0041 (LATIN CAPITAL
LETTER A) as well as for U+0391 (GREEK CAPITAL LETTER ALPHA)."
[Note that for bytes in the ASCII range 0x00-0x7F, the forward translations
will (probably) not be used by Lynx. It doesn't hurt to list those,
too, for completeness.]
Some other forms are also accepted:
* Syntax accepted:
* <fontpos> <unicode> <unicode> ...
* <fontpos> <unicode range> <unicode range> ...
* <fontpos> idem
* <range> idem
* <range> <unicode range>
*
* where <unicode range> ::= <unicode>-<unicode>
* and <unicode> ::= U+<h><h><h><h>
* and <h> ::= <hexadecimal digit>
*
[Note that <fontpos> _without_ targets assumed notdefined,
so tables from ftp.unicode.org need no patching.]
d) string replacement definitions:
They look like
U+00cd:I'
which would mean "Replace Unicode U+00cd (LATIN CAPITAL LETTER I WITH
ACUTE" with the string (consisting of two character) I' (if no other
translation is available)." Please note that replacement definitions
in certain charset table will override ones from the Default table.
Note that everything after the ':' is currently taken VERBATIM, so
careful with trailing blanks etc. Please use <C replace> syntax below
when you need trailing spaces.
* Syntax accepted:
* <unicode> :<replace>
* <unicode range> :<replace>
* <unicode> "<C replace>"
* <unicode range> "<C replace>"
*
* where <unicode range> ::= <unicode>-<unicode>
* and <unicode> ::= U+<h><h><h><h>
* and <h> ::= <hexadecimal digit>
* and <replace> any string not containing '\n' or '\0', taken verbatim
* and <C replace> any string, with backslash having the usual C meaning.
Motivation:
- It is an extension of the format already in use for Linux (kernel,
kbd package), those files can be used with some minimal editing.
- It is easy to convert Unicode tables for other charsets, as they
are commonly found on ftp sites etc., to this format - the right
sed command should do 99% of the work.
- The format is independent of details of other parts of the Lynx code,
unlike the "old" LYCharsets.c mechanism. The tables don't have to
be changed in synch when e.g., new entities are added to the entities.h.
Note: the Default "7bit approximation" table can be used for
case-insensitive search for non-ascii letters if no upper/lower case
information provided by other means, e.g., locale. It is assumed that
upper/lower case letters have their "7bit approximation" images
in def7_uni.tbl matched case-insensitively.