lib/chagashi0/README.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120

# Chagashi: a Nim implementation of the WHATWG encoding standard

Chagashi is a Nim text encoding/decoding library in compliance with the WHATWG
standards for [Chawan](https://sr.ht/~bptato/chawan).

## Minimal example

First, include it in your nimble file:

```
requires "chagashi"
```

Note: following code uses the (very) high-level interface, which is rather
inefficient. Lower level interfaces are normally faster.

```Nim
# Makeshift iconv.
# Usage: nim r whatever.nim -f fromCharset -t toCharset <infile.txt >outfile.txt
import std/os, chagashi/[encoder, decoder, charset]

var fromCharset = CHARSET_UTF_8
var toCharset = CHARSET_UTF_8
for i in 1..paramCount():
  case paramStr(i)
  of "-f": fromCharset = getCharset(paramStr(i + 1))
  of "-t": toCharset = getCharset(paramStr(i + 1))
  else: assert false, "wrong parameter"
assert fromCharset != CHARSET_UNKNOWN and toCharset != CHARSET_UNKNOWN
let ins = stdin.readAll()
let insDecoded = ins.decodeAll(fromCharset)
if toCharset == CHARSET_UTF_8: # insDecoded is already UTF-8, nothing to do
  stdout.write(insDecoded)
else:
  stdout.write(insDecoded.encodeAll(toCharset))
```

## Q&A

Q: What encodings does Chagashi support?

A: All the ones you can find on
[https://encoding.spec.whatwg.org/](https://encoding.spec.whatwg.org/), no
more and no less.

Q: What is the intermediate format?

A: UTF-8, because it is the native encoding of Nim. In general, you can just
take whatever non-UTF-8 string you want to decode, pass it to the decoder, and
use the result immediately.

Q: What API should I use?

For decoding: the TextDecoderContext.decode() iterator provides a fairly
high-level API that does no unnecessary copying, and I recommend using that
where you can.

You may also use `decodeAll` when performance is less of a concern and/or you
need the output to be in a string, or reach to `decodercore` directly if you
really need the best performance. (In the latter case I recommend you study the
`decoder` module first, because it's very easy to get it wrong.)

For encoding: sorry, at the moment you need to use `encodercore` or stick with
the (non-optimal) `encodeAll`. I'll see if I can add an in-between API in the
future.

Q: Is it correct?

A: To my knowledge, yes. However, testing is still somewhat inadequate: many
single-byte encodings are not covered yet, and we do not have fuzzing either.

Q: Is it fast?

A: Not really, I have done very little optimization because it's not necessary
for my use case.

If you need better performance, feel free to complain in the
[tickets](https://todo.sr.ht/~bptato/chawan) with a specific input and I may
look into it. Patches are welcome, too.

Q: How do I decode UTF-8?

A: Like any other character set. Obviously, it won't be "decoded", just
validated, because the target charset is UTF-8 as well.

Previously, the API did not have a way to return views into the input data, so
we had a separate UTF-8 validator API. This turned out to be very annoying to
use, so the two APIs have been unified.

Q: How do I encode UTF-8?

A: You have to make sure that the UTF-8 you are passing to the encoder is at
least valid *WTF-8*. The encoder will convert surrogate codepoints to
replacement characters, but it *does not* validate the input byte stream.

To validate your input, you can run `validateUtf8()` from `std/unicode`, or
validateUTF8Surr from `chagashi/decoder`.

Q: Why no UTF-16 encoder?

A: It's not specified in the encoding standard, and I don't need one. Maybe try
std/encodings.

Q: Why replace your previous character decoding library?

A: Because it didn't work.

## Thanks

To the standard authors for writing a detailed, easy to implement specification.

Chagashi's multibyte test files (test/data.tar.xz) were borrowed from Henri
Sivonen's excellent [encoding_rs](https://github.com/hsivonen/encoding_rs)
library. His [writeup](https://hsivonen.fi/encoding_rs/) on compressing the
encoding data was also very helpful, and Chagashi applies similar
techniques.

## License

Chagashi is dedicated to the public domain. See the UNLICENSE file for details.