about summary refs log blame commit diff stats
path: root/IBMPC-charsets.announce
blob: 40d2854c5c1dff9b370c27e4fb25f83da5c8f0af (plain) (tree)



























































































































































                                                                              
Mike Brown (mike@hyperreal.com)
-------------------------------

Summary
=======
This document describes peculiarities in the way MS-DOS handles character
sets and provides instructions on how to activate different character sets
that are ISO-8859 compliant.  This is primarily of utility to people who
will be using Lynx on a remote UNIX or VMS system via an MS-DOS based
terminal program.


General Information
===================
Lynx comes with built-in translation tables to map the 8-bit character codes
or ISO-8859-x character entities coming in from an HTML document to their
equivalent codes, where possible, for various character sets, including some
that are not quite the same as ISO-8859-x.  The translations supported as of
the 09-02-96 Lynx2-6 code include:
        "ISO Latin 1         " (ISO-8859-1)
        "ISO Latin 2         " (ISO-8859-2)
        "Other ISO Latin     "
        "DEC Multinational   "
        "IBM PC character set" (CP 437, standard for US)
        "IBM PC codepage 850 " (ISO-8859-1, but see below!)
        "Macintosh (8 bit)   "
        "NeXT character set  "
        "KOI8-R character set"
        "Chinese             "
        "Japanese (EUC)      "
        "Japanese (SJIS)     "
        "Korean              "
        "Taipei (Big5)       "
        "7 bit approximations"

Under ideal conditions, when using Lynx through a system that displays one 
of these character sets, selecting the appropriate character set in the Lynx 
options will ensure proper display of all characters one might encounter in 
HTML documents.

Note that all points of the connection between the display at your end and 
Lynx at the remote end must be 8-bit clean.  If the high bit is being 
stripped at any point in between, the only character set you can use 
(effectively) in Lynx will be "7 bit approximations".  More on that later.


MS-DOS character set weirdness
==============================
MS-DOS uses a bass-ackwards character set in which half the normal 
characters have been replaced by pseudo-graphic line and box-drawing 
characters, and in which almost all of the international characters are 
mapped to nonstandard numbers.  It also contains Greek letters.

Further confusing matters, there is more than one MS-DOS character set. 
The character sets are referred to as "codepages," each of which has a
unique number.  IBM PCs and compatibles come with one hardware-based
default codepage and a keyboard to match.  In the US market the hardware
codepage is 437.  PCs destined for other regions of the world often have a
different default codepage which contains characters for other languages
and keyboards.  Under MS-DOS, one can load different codepages into memory
and use one of them instead of the hardware default.

If you are using Lynx through an MS-DOS based terminal program or telnet 
client, you should use the "IBM PC character set" in Lynx.  I believe this 
was written with codepage 437 in mind.  [ what about console displays for a 
PC-based UNIX?  what about DOSLynx?  I don't know! ]  Also, the Windows
font "Terminal" is nearly the same as codepage 437.

Check your display by accessing Martin Ramsch's ISO-8859-1 table
(iso8859-1.html in the Lynx distribution's test subdirectory).

Ramsch's table describes each entity and shows examples of each.  It should 
be immediately obvious that you are either seeing what you are supposed to, 
or you're not.  If you see box and line-drawing characters and mismatched 
letters and so on, you are likely displaying 7 bit data, not 8.  Ensure that 
all points of your connection are 8-bit clean:

	On any remote UNIX systems you must pass through, do 
		'stty cs8 -istrip' or 'stty pass8'.  'stty -a' should list
		your settings.
	On any remote VMS systems, do 'set terminal /eightbit'.
	Make sure your terminal program or telnet client is not filtering
		8-bit data.  Note: Procomm for DOS has a confusing "Use 7 bit
		or 8 bit ANSI" setting -- this has to do with ANSI sequences.
		If set to 8 bit, some 8-bit character sequences, including
		those passed by Lynx as well as those which are for your
		terminal type (vt100, etc.) will be processed by Procomm as
		ANSI screen control codes and will most likely result in a
		garbled display.  Set it to 7 bit.
	If going through a dialup terminal server, you may have to set the 
		terminal server itself to pass 8 bit data.  How to do this
		varies with the make of the server, and in some cases only a
		system admin in charge of the box will have the authorization
		to do that.
	SLIP or PPP connections should already be 8-bit clean.


Displaying true ISO-8859-1 under MS-DOS
=======================================

Since there are apparently no ISO-8859-1 EGA/VGA soft fonts (I looked) and
since such fonts tend to cause problems when switching video modes, the
next-best alternative is to use MS-DOS 5/6's international codepage
feature.  I'm fuzzy on the why-how-wherefores, but it works great if you
do it like this:

        In your config.sys, add a line to make codepage switching possible:
                devicehigh=c:\dos\display.sys con=(ega,437,1)

	This loads the display driver.  437 is the codepage supported by my 
	hardware.  Check your MS-DOS documentation and help screens for 
	more info on what these things do.

        In your autoexec.bat, add lines to load the IBM OEM ISO-Latin1 
	character set from the ega.cpi collection and switch over to it:
                mode con cp prep=((850) c:\dos\ega.cpi)
                mode con cp sel=850

Note that the codepage 850 in ega.cpi is IBM/Microsoft's ISO-Latin1,
which, although it contains all the right characters, does *not* map them
to the standard numbers as per ISO-8859-1, and it still preserves some of
the pseudo-graphic characters.  If you run Procomm for DOS (or just about
any other application), you'll see that some of the line-drawing
characters in the title screen and on the dialing/help menus appear as
international letters.  There's no way around this. 

Once you are using codepage 850, you've still got the problem of the 
characters being mapped to the wrong numbers.  For example, if Lynx sends 
your terminal a code for a middle dot, you'll see something other than a 
middle dot -- maybe an upper-left box-corner (regular codepage) or an A with 
an accent mark (codepage 850).  There are two possible remedies:

	1. If using a terminal program like Procomm, use its Translation Table
	to process incoming characters.  On my slow 286, even with a speedy
	screen driver (nansi or nnansi.sys) installed, this results in a
	slight (20%) slowdown in the screen write time.  If you still want to
	give it a try, I found a set of translation tables for ISO-8859-1 ->
	IBM CP 850 for Procomm and Qmodem in the SimTel archives at:
		http://oak.oakland.edu:8080/SimTel/msdos/modem/xlate.zip

	2. Have Lynx do the work for you.  I used the information in xlate.zip
	to create a Lynx character set for codepage 850.  Select it via the
	'o'ptions menu when running Lynx, and save the choice in your .lynxrc
	file.

There is another option.  There are actually ISO-8859 compliant codepages
available at:
		ftp://ftp.informatik.uni-erlangen.de/pub/doc/ISO/charsets/
		ftp://nic.funet.fi/pub/doc/charsets/

as part of Kosta Kosis' free ISOCP collection.  You have to use a custom
keyboard driver (supplied) and you may find that sacrificing all of the
pseudo-graphic characters may make your terminal program (and many other
DOS applications) look rather ugly, but at least no translations will be
necessary -- ISO-8859-[1,2] data received will appear on screen exactly as
it should with the Lynx "ISO Latin" character sets selected.