diff options
Diffstat (limited to 'src/chrtrans/README.format')
-rw-r--r-- | src/chrtrans/README.format | 103 |
1 files changed, 103 insertions, 0 deletions
diff --git a/src/chrtrans/README.format b/src/chrtrans/README.format new file mode 100644 index 00000000..e5eca7bf --- /dev/null +++ b/src/chrtrans/README.format @@ -0,0 +1,103 @@ +Some notes on the format of table files used here. +(See README.tables for what to do with them.) + +The format is derived from stuff in the console driver of the +Linux kernel (as are the guts of the chartrans machinery). +THAT DOES NOT MEAN that anything here is Linux specific - it isn't. + +[Note that the format may change, this is still somewhat experimental.] + +There are four kinds of lines: + +Summary example: + + # This line is a comment, the next line is a directive + OBrand new Charset! + 0x41 U+0041 U+0391 + U+00cd:I' + +Description: + +a) comment lines start with a '#' character. + (trailing comments are allowed on some of the other lines, if in doubt + check the examples..) + +b) directives: + start with a special character, currently recognized are the letters: + + O The name under which this should appear on the O)ptions screen + M The name for this charset in MIME syntax (one word with digits + and some other non-letters allowed, should be IANA registered) + R a number which flags some special property (encoding) for this + charset [see utf8.uni for example, see UCDefs.h for details]. + +c) character translation definitions: + they look like + + 0x41 U+0041 U+0391 ... + + and are used for "forward" translation (mapping this charset to Unicode) + AS WELL AS "back" translation (mapping Unicodes to an 8-bit + [incl. 7-bit ASCII] code). + + For the "forward" direction, only the first Unicode is used; for + "back" translation, all listed Unicodes are mapped to the byte (i.e. + code point) on the left. + + The above example line would tell the chartrans mechanism: + "For this charset, code position 65 [hex 0x41] contains Unicode + U+0041 (LATIN CAPITAL LETTER A). For translation of Unicodes to + this charset, use byte value 65 [hex 0x41] for U+0041 (LATIN CAPITAL + LETTER A) as well as for U+0391 (GREEK CAPITAL LETTER ALPHA)." + + [Note that for bytes in the ASCII range 0x00-0x7F, the forward translations + will (probably) not be used by Lynx. It doesn't hurt to list those, + too, for completeness.] + + Some other forms are also accepted: + + * Syntax accepted: + * <fontpos> <unicode> <unicode> ... + * <fontpos> <unicode range> <unicode range> ... + * <fontpos> idem + * <range> idem + * <range> <unicode range> + * + * where <range> ::= <fontpos>-<fontpos> + * and <unicode> ::= U+<h><h><h><h> + * and <h> ::= <hexadecimal digit> + +d) string replacement definitions: + + They look like + + U+00cd:I' + + which would mean "Replace Unicode U+00cd (LATIN CAPITAL LETTER I WITH + ACUTE" with the string (consisting of two character) I' (if no other + translation is available)." + + Note that everything after the ':' is currently taken VERBATIM, so + careful with trailing blanks etc. + + * Syntax accepted: + * <unicode> :<replace> + * <unicode range> :<replace> + * + * where <unicode range> ::= <unicode>-<unicode> + * and <unicode> ::= U+<h><h><h><h> + * and <h> ::= <hexadecimal digit> + * and <replace> any string not containing '\n' or '\0' + +Motivation: + +- It is an extention of the format already in use for Linux (kernel, + kbd package), those files can be used with some minimal editing. + +- It is easy to convert Unicode tables for other charsets, as they + are commonly found on ftp sites etc., to this format - the right + sed command should do 99% of the work. + +- The format is independent of details of other parts of the Lynx code, + unlike the "old" LYCharsets.c mechanism. The tables don't have to + be changed in synch when e.g. new entities are added to the HTMLDTD. |