src/chrtrans/README.format


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138

Some notes on the format of table files used here.
(See README.tables for what to do with them.)

The format is derived from stuff in the console driver of the
Linux kernel (as are the guts of the chartrans machinery).
THAT DOES NOT MEAN that anything here is Linux specific - it isn't.

[Note that the format may change, this is still somewhat experimental.]

There are four kinds of lines:

Summary example:

  # This line is a comment, the next line is a directive
  O Brand new Charset!
  0x41    U+0041 U+0391
  U+00cd:I'

Description:

a) comment lines start with a '#' character.
   (trailing comments are allowed on some of the other lines, if in doubt
   check the examples..)

b) directives:
   start with a keyword which may be abbreviated to one letter (first
   letter must be capitalized), followed by space and a value.
   Currently recognized:

    OptionName
	The name under which this should appear on the O)ptions screen
	in the list for Display Character Set
    MIMEName
	The name for this charset in MIME syntax (one word with digits
	and some other non-letters allowed, should be IANA registered)
    Default
	If "Y[es]" or "1", this is the default (fallback) translation table,
	it will be used for Unicode -> 8bit (or 7bit) translation if no
	translation is found in the specific table.
    FallBack
	Whether to use the default table if no translation is found in
	this table.  Normally fallback is used, "FallBack NO" or "FallBack 0"
	disables it (actually, other values than "FallBack Y[es]" or
	"FallBack 1" disable it).

    RawOrEnc
	a number which flags some special property (encoding) for this
	charset [see utf8_uni.tbl for example, see UCDefs.h for details].

    Codepage number (IBM specific)
	used by OS/2 font-switching code.

c) character translation definitions:
   they look like

   0x41    U+0041 U+0391 ...

   and are used for "forward" translation (mapping this charset to Unicode)
   AS WELL AS "back" translation (mapping Unicodes to an 8-bit
   [incl. 7-bit ASCII] code).

   For the "forward" direction, only the first Unicode is used; for
   "back" translation, all listed Unicodes are mapped to the byte (i.e.
   code point) on the left.

   The above example line would tell the chartrans mechanism:
   "For this charset, code position 65 [hex 0x41] contains Unicode
    U+0041 (LATIN CAPITAL LETTER A).  For translation of Unicodes to
    this charset, use byte value 65 [hex 0x41] for U+0041 (LATIN CAPITAL
    LETTER A) as well as for U+0391 (GREEK CAPITAL LETTER ALPHA)."

  [Note that for bytes in the ASCII range 0x00-0x7F, the forward translations
   will (probably) not be used by Lynx.  It doesn't hurt to list those,
   too, for completeness.]

   Some other forms are also accepted:

 * Syntax accepted:
 *	<fontpos>	<unicode> <unicode> ...
 *	<fontpos>	<unicode range> <unicode range> ...
 *	<fontpos>	idem
 *	<range>		idem
 *	<range>		<unicode range>
 *
 * where <unicode range> ::= <unicode>-<unicode>
 * and <unicode> ::= U+<h><h><h><h>
 * and <h> ::= <hexadecimal digit>
 *
  [Note that <fontpos> _without_ targets assumed notdefined,
  so tables from ftp.unicode.org need no patching.]


d) string replacement definitions:

  They look like

  U+00cd:I'

  which would mean "Replace Unicode U+00cd (LATIN CAPITAL LETTER I WITH
  ACUTE" with the string (consisting of two character) I' (if no other
  translation is available)."  Please note that replacement definitions
  in certain charset table will override ones from the Default table.

  Note that everything after the ':' is currently taken VERBATIM, so
  careful with trailing blanks etc.  Please use <C replace> syntax below
  when you need trailing spaces.

 * Syntax accepted:
 *      <unicode>	:<replace>
 *      <unicode range>	:<replace>
 *      <unicode>	"<C replace>"
 *      <unicode range>	"<C replace>"
 *
 * where <unicode range> ::= <unicode>-<unicode>
 * and <unicode> ::= U+<h><h><h><h>
 * and <h> ::= <hexadecimal digit>
 * and <replace> any string not containing '\n' or '\0', taken verbatim
 * and <C replace> any string, with backslash having the usual C meaning.

Motivation:

- It is an extension of the format already in use for Linux (kernel,
  kbd package), those files can be used with some minimal editing.

- It is easy to convert Unicode tables for other charsets, as they
  are commonly found on ftp sites etc., to this format - the right
  sed command should do 99% of the work.

- The format is independent of details of other parts of the Lynx code,
  unlike the "old" LYCharsets.c mechanism.  The tables don't have to
  be changed in synch when e.g., new entities are added to the entities.h.


Note: the Default "7bit approximation" table can be used for
case-insensitive search for non-ascii letters if no upper/lower case
information provided by other means, e.g., locale.  It is assumed that
upper/lower case letters have their "7bit approximation" images
in def7_uni.tbl matched case-insensitively.