about summary refs log tree commit diff stats
path: root/src/chrtrans/README.format
diff options
context:
space:
mode:
Diffstat (limited to 'src/chrtrans/README.format')
-rw-r--r--src/chrtrans/README.format138
1 files changed, 138 insertions, 0 deletions
diff --git a/src/chrtrans/README.format b/src/chrtrans/README.format
new file mode 100644
index 00000000..7437b503
--- /dev/null
+++ b/src/chrtrans/README.format
@@ -0,0 +1,138 @@
+Some notes on the format of table files used here.
+(See README.tables for what to do with them.)
+
+The format is derived from stuff in the console driver of the
+Linux kernel (as are the guts of the chartrans machinery).
+THAT DOES NOT MEAN that anything here is Linux specific - it isn't.
+
+[Note that the format may change, this is still somewhat experimental.]
+
+There are four kinds of lines:
+
+Summary example:
+
+  # This line is a comment, the next line is a directive
+  O Brand new Charset!
+  0x41    U+0041 U+0391
+  U+00cd:I'
+
+Description:
+
+a) comment lines start with a '#' character.
+   (trailing comments are allowed on some of the other lines, if in doubt
+   check the examples..)
+
+b) directives:
+   start with a keyword which may be abbreviated to one letter (first
+   letter must be capitalized), followed by space and a value.
+   Currently recognized:
+
+    OptionName
+	The name under which this should appear on the O)ptions screen
+	in the list for Display Character Set
+    MIMEName
+	The name for this charset in MIME syntax (one word with digits
+	and some other non-letters allowed, should be IANA registered)
+    Default
+	If "Y[es]" or "1", this is the default (fallback) translation table,
+	it will be used for Unicode -> 8bit (or 7bit) translation if no
+	translation is found in the specific table.
+    FallBack
+	Whether to use the default table if no translation is found in
+	this table.  Normally fallback is used, "FallBack NO" or "FallBack 0"
+	disables it (actually, other values than "FallBack Y[es]" or
+	"FallBack 1" disable it).
+
+    RawOrEnc
+	a number which flags some special property (encoding) for this
+	charset [see utf8_uni.tbl for example, see UCDefs.h for details].
+
+    Codepage number (IBM specific)
+	used by OS/2 font-switching code.
+
+c) character translation definitions:
+   they look like
+
+   0x41    U+0041 U+0391 ...
+
+   and are used for "forward" translation (mapping this charset to Unicode)
+   AS WELL AS "back" translation (mapping Unicodes to an 8-bit
+   [incl. 7-bit ASCII] code).
+
+   For the "forward" direction, only the first Unicode is used; for
+   "back" translation, all listed Unicodes are mapped to the byte (i.e.
+   code point) on the left.
+
+   The above example line would tell the chartrans mechanism:
+   "For this charset, code position 65 [hex 0x41] contains Unicode
+    U+0041 (LATIN CAPITAL LETTER A).  For translation of Unicodes to
+    this charset, use byte value 65 [hex 0x41] for U+0041 (LATIN CAPITAL
+    LETTER A) as well as for U+0391 (GREEK CAPITAL LETTER ALPHA)."
+
+  [Note that for bytes in the ASCII range 0x00-0x7F, the forward translations
+   will (probably) not be used by Lynx.  It doesn't hurt to list those,
+   too, for completeness.]
+
+   Some other forms are also accepted:
+
+ * Syntax accepted:
+ *	<fontpos>	<unicode> <unicode> ...
+ *	<fontpos>	<unicode range> <unicode range> ...
+ *	<fontpos>	idem
+ *	<range>		idem
+ *	<range>		<unicode range>
+ *
+ * where <unicode range> ::= <unicode>-<unicode>
+ * and <unicode> ::= U+<h><h><h><h>
+ * and <h> ::= <hexadecimal digit>
+ *
+  [Note that <fontpos> _without_ targets assumed notdefined,
+  so tables from ftp.unicode.org need no patching.]
+
+
+d) string replacement definitions:
+
+  They look like
+
+  U+00cd:I'
+
+  which would mean "Replace Unicode U+00cd (LATIN CAPITAL LETTER I WITH
+  ACUTE" with the string (consisting of two character) I' (if no other
+  translation is available)."  Please note that replacement definitions
+  in certain charset table will override ones from the Default table.
+
+  Note that everything after the ':' is currently taken VERBATIM, so
+  careful with trailing blanks etc.  Please use <C replace> syntax below
+  when you need trailing spaces.
+
+ * Syntax accepted:
+ *      <unicode>	:<replace>
+ *      <unicode range>	:<replace>
+ *      <unicode>	"<C replace>"
+ *      <unicode range>	"<C replace>"
+ *
+ * where <unicode range> ::= <unicode>-<unicode>
+ * and <unicode> ::= U+<h><h><h><h>
+ * and <h> ::= <hexadecimal digit>
+ * and <replace> any string not containing '\n' or '\0', taken verbatim
+ * and <C replace> any string, with backslash having the usual C meaning.
+
+Motivation:
+
+- It is an extension of the format already in use for Linux (kernel,
+  kbd package), those files can be used with some minimal editing.
+
+- It is easy to convert Unicode tables for other charsets, as they
+  are commonly found on ftp sites etc., to this format - the right
+  sed command should do 99% of the work.
+
+- The format is independent of details of other parts of the Lynx code,
+  unlike the "old" LYCharsets.c mechanism.  The tables don't have to
+  be changed in synch when e.g., new entities are added to the entities.h.
+
+
+Note: the Default "7bit approximation" table can be used for
+case-insensitive search for non-ascii letters if no upper/lower case
+information provided by other means, e.g., locale.  It is assumed that
+upper/lower case letters have their "7bit approximation" images
+in def7_uni.tbl matched case-insensitively.