about summary refs log tree commit diff stats
path: root/src/chrtrans/README.format
diff options
context:
space:
mode:
Diffstat (limited to 'src/chrtrans/README.format')
-rw-r--r--src/chrtrans/README.format103
1 files changed, 103 insertions, 0 deletions
diff --git a/src/chrtrans/README.format b/src/chrtrans/README.format
new file mode 100644
index 00000000..e5eca7bf
--- /dev/null
+++ b/src/chrtrans/README.format
@@ -0,0 +1,103 @@
+Some notes on the format of table files used here.
+(See README.tables for what to do with them.)
+
+The format is derived from stuff in the console driver of the
+Linux kernel (as are the guts of the chartrans machinery).
+THAT DOES NOT MEAN that anything here is Linux specific - it isn't.
+
+[Note that the format may change, this is still somewhat experimental.]
+
+There are four kinds of lines:
+
+Summary example:
+
+  # This line is a comment, the next line is a directive
+  OBrand new Charset!
+  0x41    U+0041 U+0391
+  U+00cd:I'
+
+Description:
+
+a) comment lines start with a '#' character.
+   (trailing comments are allowed on some of the other lines, if in doubt
+   check the examples..)
+
+b) directives:
+   start with a special character, currently recognized are the letters:
+ 
+    O   The name under which this should appear on the O)ptions screen
+    M   The name for this charset in MIME syntax (one word with digits
+        and some other non-letters allowed, should be IANA registered)
+    R   a number which flags some special property (encoding) for this
+        charset [see utf8.uni for example, see UCDefs.h for details].
+
+c) character translation definitions:
+   they look like
+
+   0x41    U+0041 U+0391 ...
+
+   and are used for "forward" translation (mapping this charset to Unicode)
+   AS WELL AS "back" translation (mapping Unicodes to an 8-bit 
+   [incl. 7-bit ASCII] code).
+
+   For the "forward" direction, only the first Unicode is used; for
+   "back" translation, all listed Unicodes are mapped to the byte (i.e.
+   code point) on the left.
+
+   The above example line would tell the chartrans mechanism:
+   "For this charset, code position 65 [hex 0x41] contains Unicode
+    U+0041 (LATIN CAPITAL LETTER A).  For translation of Unicodes to
+    this charset, use byte value 65 [hex 0x41] for U+0041 (LATIN CAPITAL 
+    LETTER A) as well as for U+0391 (GREEK CAPITAL LETTER ALPHA)."
+
+  [Note that for bytes in the ASCII range 0x00-0x7F, the forward translations
+   will (probably) not be used by Lynx.  It doesn't hurt to list those,
+   too, for completeness.]
+
+   Some other forms are also accepted:
+
+ * Syntax accepted:
+ *	<fontpos>	<unicode> <unicode> ...
+ *	<fontpos>	<unicode range> <unicode range> ...
+ *	<fontpos>	idem
+ *	<range>		idem
+ *	<range>		<unicode range>
+ *
+ * where <range> ::= <fontpos>-<fontpos>
+ * and <unicode> ::= U+<h><h><h><h>
+ * and <h> ::= <hexadecimal digit>
+
+d) string replacement definitions:
+
+  They look like
+
+  U+00cd:I'
+
+  which would mean "Replace Unicode U+00cd (LATIN CAPITAL LETTER I WITH 
+  ACUTE" with the string (consisting of two character) I' (if no other
+  translation is available)."
+
+  Note that everything after the ':' is currently taken VERBATIM, so
+  careful with trailing blanks etc.
+
+ * Syntax accepted:
+ *      <unicode>	:<replace>
+ *      <unicode range>	:<replace>
+ *
+ * where <unicode range> ::= <unicode>-<unicode>
+ * and <unicode> ::= U+<h><h><h><h>
+ * and <h> ::= <hexadecimal digit>
+ * and <replace> any string not containing '\n' or '\0'
+
+Motivation:
+
+- It is an extention of the format already in use for Linux (kernel,
+  kbd package), those files can be used with some minimal editing.
+
+- It is easy to convert Unicode tables for other charsets, as they
+  are commonly found on ftp sites etc., to this format - the right
+  sed command should do 99% of the work.
+
+- The format is independent of details of other parts of the Lynx code,
+  unlike the "old" LYCharsets.c mechanism.  The tables don't have to
+  be changed in synch when e.g. new entities are added to the HTMLDTD.