diff options
Diffstat (limited to 'src/chrtrans/README.format')
-rw-r--r-- | src/chrtrans/README.format | 138 |
1 files changed, 138 insertions, 0 deletions
diff --git a/src/chrtrans/README.format b/src/chrtrans/README.format new file mode 100644 index 00000000..7437b503 --- /dev/null +++ b/src/chrtrans/README.format @@ -0,0 +1,138 @@ +Some notes on the format of table files used here. +(See README.tables for what to do with them.) + +The format is derived from stuff in the console driver of the +Linux kernel (as are the guts of the chartrans machinery). +THAT DOES NOT MEAN that anything here is Linux specific - it isn't. + +[Note that the format may change, this is still somewhat experimental.] + +There are four kinds of lines: + +Summary example: + + # This line is a comment, the next line is a directive + O Brand new Charset! + 0x41 U+0041 U+0391 + U+00cd:I' + +Description: + +a) comment lines start with a '#' character. + (trailing comments are allowed on some of the other lines, if in doubt + check the examples..) + +b) directives: + start with a keyword which may be abbreviated to one letter (first + letter must be capitalized), followed by space and a value. + Currently recognized: + + OptionName + The name under which this should appear on the O)ptions screen + in the list for Display Character Set + MIMEName + The name for this charset in MIME syntax (one word with digits + and some other non-letters allowed, should be IANA registered) + Default + If "Y[es]" or "1", this is the default (fallback) translation table, + it will be used for Unicode -> 8bit (or 7bit) translation if no + translation is found in the specific table. + FallBack + Whether to use the default table if no translation is found in + this table. Normally fallback is used, "FallBack NO" or "FallBack 0" + disables it (actually, other values than "FallBack Y[es]" or + "FallBack 1" disable it). + + RawOrEnc + a number which flags some special property (encoding) for this + charset [see utf8_uni.tbl for example, see UCDefs.h for details]. + + Codepage number (IBM specific) + used by OS/2 font-switching code. + +c) character translation definitions: + they look like + + 0x41 U+0041 U+0391 ... + + and are used for "forward" translation (mapping this charset to Unicode) + AS WELL AS "back" translation (mapping Unicodes to an 8-bit + [incl. 7-bit ASCII] code). + + For the "forward" direction, only the first Unicode is used; for + "back" translation, all listed Unicodes are mapped to the byte (i.e. + code point) on the left. + + The above example line would tell the chartrans mechanism: + "For this charset, code position 65 [hex 0x41] contains Unicode + U+0041 (LATIN CAPITAL LETTER A). For translation of Unicodes to + this charset, use byte value 65 [hex 0x41] for U+0041 (LATIN CAPITAL + LETTER A) as well as for U+0391 (GREEK CAPITAL LETTER ALPHA)." + + [Note that for bytes in the ASCII range 0x00-0x7F, the forward translations + will (probably) not be used by Lynx. It doesn't hurt to list those, + too, for completeness.] + + Some other forms are also accepted: + + * Syntax accepted: + * <fontpos> <unicode> <unicode> ... + * <fontpos> <unicode range> <unicode range> ... + * <fontpos> idem + * <range> idem + * <range> <unicode range> + * + * where <unicode range> ::= <unicode>-<unicode> + * and <unicode> ::= U+<h><h><h><h> + * and <h> ::= <hexadecimal digit> + * + [Note that <fontpos> _without_ targets assumed notdefined, + so tables from ftp.unicode.org need no patching.] + + +d) string replacement definitions: + + They look like + + U+00cd:I' + + which would mean "Replace Unicode U+00cd (LATIN CAPITAL LETTER I WITH + ACUTE" with the string (consisting of two character) I' (if no other + translation is available)." Please note that replacement definitions + in certain charset table will override ones from the Default table. + + Note that everything after the ':' is currently taken VERBATIM, so + careful with trailing blanks etc. Please use <C replace> syntax below + when you need trailing spaces. + + * Syntax accepted: + * <unicode> :<replace> + * <unicode range> :<replace> + * <unicode> "<C replace>" + * <unicode range> "<C replace>" + * + * where <unicode range> ::= <unicode>-<unicode> + * and <unicode> ::= U+<h><h><h><h> + * and <h> ::= <hexadecimal digit> + * and <replace> any string not containing '\n' or '\0', taken verbatim + * and <C replace> any string, with backslash having the usual C meaning. + +Motivation: + +- It is an extension of the format already in use for Linux (kernel, + kbd package), those files can be used with some minimal editing. + +- It is easy to convert Unicode tables for other charsets, as they + are commonly found on ftp sites etc., to this format - the right + sed command should do 99% of the work. + +- The format is independent of details of other parts of the Lynx code, + unlike the "old" LYCharsets.c mechanism. The tables don't have to + be changed in synch when e.g., new entities are added to the entities.h. + + +Note: the Default "7bit approximation" table can be used for +case-insensitive search for non-ascii letters if no upper/lower case +information provided by other means, e.g., locale. It is assumed that +upper/lower case letters have their "7bit approximation" images +in def7_uni.tbl matched case-insensitively. |