diff options
-rw-r--r-- | NEWS | 68 | ||||
-rw-r--r-- | README.md | 37 | ||||
-rw-r--r-- | doc/.index.md | 16 | ||||
-rw-r--r-- | doc/manual.md | 252 | ||||
-rwxr-xr-x | gendoc.sh | 13 |
5 files changed, 371 insertions, 15 deletions
diff --git a/NEWS b/NEWS new file mode 100644 index 00000000..a94e35dd --- /dev/null +++ b/NEWS @@ -0,0 +1,68 @@ +0.14 (2024.02.07) + +* The "bag of pointers" interface design has been dropped +* Tag and attribute names are now treated as interned strings (a user-defined + "Atom" type) +* Support processing of embedded SVG/MathML elements +* Chakasu has been made an optional dependency +* std/streams no longer used by htmlparser; now it supports chunked parsing + instead +* All tokenizer + tree builder tests passed in html5lib-tests + +Rough migration guide from the previous API: + +Users of minidom + +* nodeType is no longer supported, use the of operator to distinguish + between node types. +* minidom now only supports UTF-8; if you need support for other + charsets, use minidom_cs. +* localName is now an MAtom; to get the stringified local name, use + localNameStr. +* attrs is now a seq of Attribute tuples. Linear search this seq to find + specific attributes. +* minidom now contains several MAtom fields; to convert these to + strings, call document.atomToStr(atom). + +Users of htmlparser + +* The NodeType enum has been removed. Either copy-paste the enum + definition from a previous version, or (more efficient) use the `of` + operator to distinguish between types. +* Use of an AtomFactory is now required for consumers of htmlparser. The + easiest fix is to copy-paste the implementation found in minidom. +* Your DOM builder should be generic over a Handle and an Atom. Example: + `DOMBuilder[Node, MAtom]` +* You no longer have to copy function pointers into your DOM builder. +* It is recommended to add `include chame/htmlparseriface` to your DOM + builder module. See the htmlparseriface documentation for details. + +Switching to the new interface: + +* Add `Impl` to the name of all your procedure implementations. +* If you included chame/htmlparseriface, replace all parameters of your + procedures containing `DOMBuilder[MyHandle]` with `MyDOMBuilder`. +* setCharacterSet -> setEncodingImpl that takes a string label. +* getLocalNameImpl now must return an Atom. getTagType is no longer used. +* insertBeforeImpl must take an `Option[Handle]` +* addAttrsIfMissingImpl is now mandatory, and must take a `Table[Atom, string]` +* getNamespaceImpl is now mandatory. +* getDocumentImpl must be implemented, and must return the Handle of the + document. +* tagTypeToAtomImpl, atomToTagTypeImpl, strToAtomImpl must all be implemented. +* createHTMLElementImpl must be implemented, and must return the handle of a new + `<html>` element. +* createElement -> createElementForTokenImpl, the signature has changed + significantly. localName is the 2-in-1 replacement for both tagType and + localName. Also, you probably have to convert attributes from htmlAttrs & + xmlAttrs to your own internal representation. +* finish is no longer called at the end of parsing. Call it yourself. + +Event loop changes: + +* parseHTML is now split into three parts: initHTML5Parser, parseChunk, finish. +* If you implement scripting and/or character sets other than UTF-8, see + doc/manual.md for handling parseChunk's result. Otherwise, it is safe to + discard it. +* Do not forget to call finish after having parsed the entire document (first + for the parser, then for your own DOM builder). diff --git a/README.md b/README.md index a4e3a967..082f3c87 100644 --- a/README.md +++ b/README.md @@ -1,10 +1,5 @@ # Chame: an HTML5 parser library in Nim -## WARNING - -This library is still in beta stage. The API **will** undergo significant -changes before the 1.0 release. - ## Usage Include Chame in your project using either Nimble or as a git submodule. @@ -22,17 +17,22 @@ Note: only Nim 1.6.10+ is supported. * Almost full compliance with the WHATWG standard. (Except for the few missing features listed in the following section.) -* Passes all tokenizer and tree builder tests in html5lib-tests.[^1] -* Includes a minimal DOM implementation. -* No mandatory dependencies other than the Nim standard library. -* Optional character encoding support (see minidom_enc). -* String interning support for tag and attribute names. -* Support for chunked parsing. -* document.write (WIP) +* Passes all tokenizer and tree builder tests in html5lib-tests[^1] +* Includes a minimal DOM implementation +* No mandatory dependencies other than the Nim standard library +* Optional character encoding support (see minidom_enc) +* String interning support for tag and attribute names +* Support for chunked parsing +* document.write (no actual implementation here, but it's possible to implement + it on top of Chame) [^1]: Except for tree builder tests requiring JavaScript and xmlViolation tokenizer tests. +## Manual + +There is a manual available at [doc/manual.md](doc/manual.md). + ## To-do Some parts of the specification have not been implemented yet. These are: @@ -41,10 +41,8 @@ Some parts of the specification have not been implemented yet. These are: Support for this feature is planned. -Other, non-standard-related tasks (in no particular order): +Other, non-standard-related tasks: -* Document minidom/minidom_enc -* Document the new interface (also explain what `Atom` does etc.) * Optimize inefficient parts of the library ## Bugs, feedback, etc. @@ -71,6 +69,15 @@ an example of a complete DOM implementation that depends on Chame. If you implement a DOM library based on Chame, please notify me, so that I can redirect users to it in this section. +### I read the manual, but it's too complex, I don't understand anything, help + +Just call minidom.parseHTML on an std/stream.Stream object and forget about +everything else. Chances are this is enough for whatever you want to do. + +### How do I implement speculative parsing? + +No idea. Let me know if you figure something out. + ### How do you pronounce Chame? It is an acronym of "**Cha**wan HT**M**L (aitch-tee-e**m-e**l)." Accordingly, it is diff --git a/doc/.index.md b/doc/.index.md new file mode 100644 index 00000000..7e6e6c33 --- /dev/null +++ b/doc/.index.md @@ -0,0 +1,16 @@ +# Documentation of Chame, an HTML5 parsing library written in Nim. + +Index of the (stable) public API: + +* [minidom](minidom.html): minimal DOM module and a high-level interface to the + HTML parser. +* [minidom_cs](minidom_cs.html): minidom with support for non-UTF-8 character + sets. +* [htmlparser](htmlparser.html): low-level interface to the HTML parser. +* [htmlparseriface](htmlparseriface.html): forward declarations for the HTML + parser +* [tags](tags.html): enum definitions for Chame (tags, namespaces, etc.) +* [tokstate](tokstate.html): tokenizer states for the fragment parsing + algorithm + +Also of interest: the Chame [manual](manual.html). diff --git a/doc/manual.md b/doc/manual.md new file mode 100644 index 00000000..333d305d --- /dev/null +++ b/doc/manual.md @@ -0,0 +1,252 @@ +# Using Chame + +Chame is divided into two parts: a low-level API ([htmlparser](htmlparser.html)) +and a high-level API ([minidom](minidom.html), [minidom_cs](minidom_cs.html)). +The high-level APIs build on top of htmlparser, and are easier to use. However, +they give consumers less control than htmlparser. + +Here we describe both APIs. + +## Basic concepts + +### Standards + +Chame implements HTML5 parsing as described in the +[Parsing HTML documents](https://html.spec.whatwg.org/multipage/parsing.html) +section of the WHATWG's living standard. Note that this document may change at +any time, and newer additions might take some time to implement in Chame. + +Users of the low-level API are encouraged to consult the appropriate sections +of the standard while implementing hooks provided by htmlparser. + +### String interning + +To achieve O(1) comparisons of certain categories of strings (tag and attribute +names) and a lower memory footprint, Chame uses +[string interning](https://en.wikipedia.org/wiki/String_interning). This means +that while minidom users will only have to call the appropriate conversion +functions on Document.factory for converting the output to string, consumers +of htmlparser must implement string interning themselves (be that through +MAtomFactory or a custom solution). + +### String validation + +Note that (as per standard) the tokenization stage strips out all NUL +characters, so strings from the parser can be safely converted to cstrings. + +htmlparser itself does no UTF-8 validation; it is up to the DOM builder to +validate the input. Non-ASCII characters are treated as opaque characters, +so parsing of ASCII-compatible character sets should just work with the caveat +that the strings from htmlparser will not necessarily be valid UTF-8. This is +not a problem in minidom, since it abstracts over this difficulty. + +## High-level API (minidom, minidom_cs) + +minidom has two main entry points: `parseHTML` and `parseHTMLFragment`. For +parsing documents, `parseHTML` should be used; `parseHTMLFragment` is for parsing +incomplete document fragments. + +e.g. in a browser, the `innerHTML` setter would use `parseHTMLFragment`, while +`DOMParser.parseFromString` would use `parseHTML`. + +The input stream must be passed as a `Stream` object from `std/streams`. Both +parseHTML and parseHTMLFragment return only when the input stream has been +completely consumed from the stream. For chunked parsing, you must use the +low-level htmlparser API instead. + +minidom (and minidom_cs) implements string interning using `MAtomFactory`, and +interned strings in minidom are represented using `MAtom`s. Every `MAtom` is +guaranteed to be a valid UTF-8 string. To convert a `MAtom` into a Nim string, +use the MAtomFactory.atomToStr function. + +The output is a DOM tree, with the root node being a `Document`. The root +Document node also contains a `MAtomFactory` instance, which can be used to +convert `MAtom`s back to strings (through `atomToStr`). + +Strings returned from minidom are guaranteed to be valid UTF-8. Note however +that minidom only understands UTF-8 documents. For parsing documents with +character sets other than UTF-8, minidom_cs must be used. The `parseHTML` +function of minidom_cs is also able to BOM sniff, interpret meta charset +tags and optionally retry parsing of documents with a predefined list of +character sets (using the companion character decoding library Chakasu). + +## Low-level API (htmlparser) + +### Functions and procedures + +htmlparser has three defined procedures: `initHTML5Parser`, `parseChunk`, and +`finish`. A `getInsertionPoint` function is available as well. + +#### initHTML5Parser + +```nim +# Signature +proc initHTML5Parser[Handle, Atom](dombuilder: DOMBuilder[Handle, Atom], + opts: HTML5ParserOpts[Handle, Atom]): HTML5Parser[Handle, Atom] +``` + +The `initHTML5Parser` procedure requires a user-defined DOMBuilder object +derived from the `DOMBuilder[Handle, Atom]` generic object reference. + +To implement all interfaces necessary for htmlparser, please include +[htmlparseriface](htmlparseriface.html) in your DOM builder module; it contains +forward-declarations for all procedures that `HTML5Parser` depends on. Feel +free to study/copy [minidom](minidom.html)'s implementations. + +The return value is an `HTML5Parser[Handle, Atom]` object. Note that this is +a rather large object that is passed by value; if possible, avoid copying it +at all. + +#### parseChunk + +```nim +# Signature +proc parseChunk[Handle, Atom](parser: var HTML5Parser[Handle, Atom], + inputBuf: openArray[char], reprocess = false): ParseResult +``` + +`parseChunk` consumes all data passed in `inputBuf`. During this, the +appropriate functions (`createElementImpl`, etc.) will be called by the parser. + +`parseChunk` returns a `ParseResult`, which is one of the following values: + +* `PRES_CONTINUE`: the caller should continue with parsing the next chunk of + data when it is available. (It's also fine to do delay processing the next + call by processing something different first.) +* `PRES_STOP`: parsing was stopped by your setEncodingImpl implementation. The + caller is expected to restart parsing from the beginning using a **new** + `HTML5Parser` object. WARNING: do *not* re-use the current HTML5Parser for + this. +* `PRES_SCRIPT`: a `</script>` end tag has been encountered, which immediately + suspended parsing. In the next `parseChunk` call, the caller is expected to + pass the **same** buffer (`inputBuf`) as in the current one. For details, + see below. + +Special care is required when implementing that have scripting support. The +HTML5 standard requires the parser to be re-entrant for supporting the +`document.write` JavaScript function; therefore the parser suspends itself upon +encountering a `</script>` end tag, returning a `PRES_SCRIPT` `ParseResult`. + +At this point, implementations have two options. + +##### Option 1: Continue parsing the current buffer + +If either: + +* your implementation either does not support `document.write`, or +* no `document.write` call has been issued by the script, or +* parsing of all buffers passed by `document.write` calls has finished, + +then you can simply resume parsing the current buffer by calling `parseChunk` +again with an openArray that uses the same backing buffer, except starting +from `parser.getInsertionPoint()`. `minidom`, which pretends to support +scripting in test cases, but does not actually execute scripts, has an example +of this: + +```nim +var buffer: array[4096, char] +while true: + let n = inputStream.readData(addr buffer[0], buffer.len) + if n == 0: break + # res can be PRES_CONTINUE or PRES_SCRIPTING. PRES_STOP is only returned + # on charset switching, and minidom does not support that. + var res = parser.parseChunk(buffer.toOpenArray(0, n - 1)) + # Important: we must repeat parseChunk with the same contents for the script + # end tag result, with reprocess = true. + # + # (This is only relevant for calls where scripting = true; with scripting = + # false, PRES_SCRIPT would never be returned.) + var ip = 0 + while res == PRES_SCRIPT and (ip += parser.getInsertionPoint(); ip != n): + res = parser.parseChunk(buffer.toOpenArray(ip, n - 1)) +parser.finish() +``` + +Note the while loop; `parseChunk` may return `PRES_SCRIPT` multiple times +for a single buffer, as it one buffer can contain several scripts. + +Also note that `minidom` does not handle `PRES_STOP`, since it does support +character encodings. For an implementation that *does* handle `PRES_STOP`, see +`minidom_cs`. + +#### Option 2: Parse buffers passed by `document.write` + +Per standard, it is possible to insert buffers into the stream from scripts +using the `document.write` function. + +It is possible to implement this, but it is somewhat too involved to give a +detailed explanation of it here. Please refer to Chawan's implementation in +html/chadombuilder and html/dom. + +### finish + +After having parsed all chunks of your document with `parseChunk`, you **must** +call the `finish` function. This is necessary because the parser may still have +some non-flushed characters in an internal buffer. e.g. when the parser receives +the string `>`, it is not clear whether the character reference refers to a +"greater than" sign, or a longer character reference like `≳`; `finish` +confirms that the reference is indeed a `>` sign. Also, the parser has to +execute certain actions on encountering the `EOF` token, which only `finish` can +produce. + +`finish` must never be called twice, and any `parseChunk` call after `finish` +is invalid. + +### Generic parameters + +`initHTML5Parser` takes two generic parameters: `Handle` and `Atom`. + +`Handle` is conceptually a unique pointer to a node in the document. A naive +single-threaded implementation (like minidom) may simply implement this as +a Nim `ref` to an object. However, this is not mandatory; since `Handle` is +a generic parameter, any type is accepted. For example, multi-processing +implementations that use message passing might instead prefer to use an +integer ID that refers to an object owned by a different thread. + +Similarly, `Atom` is a unique pointer to a string. This means that +`DOMBuilder.strToAtom` must always return the same Atom for every string whose +contents are equivalent. Additionally, `atomToTagType` and `tagTypeToAtom` must +operate as if `TagType` values were equivalent to the contents of its +stringifier. (i.e. `tagTypeToAtom(tagType) == strToAtom($tagType)` for all tag +types except `TAG_UNKNOWN`, which is never passed to `tagTypeToAtom`.) + +Note that htmlparser does not *require* an `atomToStr` procedure, so it is not +even necessary to store interned strings in a format compatible with the Nim +string type. (Obviously, some way to stringify atoms is required for most use +cases, but it need not be exposed.) + +## Example + +A simple example of minidom: dumps all text on a page. + +```Nim +# Compile with nim c -d:ssl +# List strings found on the target website. +import std/httpclient +import std/os +import std/strutils +import chame/minidom + +if paramCount() != 1: + echo "Usage: " & paramStr(0) & " [URL]" + quit(1) +let client = newHttpClient() +let res = client.get(paramStr(1)) +let document = parseHTML(res.bodyStream) +var stack = @[Node(document)] +while stack.len > 0: + let node = stack.pop() + if node of Text: + let s = Text(node).data.strip() + if s != "": + echo s + for i in countdown(node.childList.high, 0): + stack.add(node.childList[i]) +``` + +For more advanced usage of minidom, please study the tests/tree.nim and +tests/shared/tree_common.nim which together are a runner of html5lib-tests. + +For an example implementation of [htmlparseriface](htmlparseriface.html), please +check the source code of [minidom](minidom.html) (and of +[minidom_cs](minidom_cs.html), if you need non-UTF-8 support). diff --git a/gendoc.sh b/gendoc.sh index 922539b4..02175778 100755 --- a/gendoc.sh +++ b/gendoc.sh @@ -10,3 +10,16 @@ do if test "$f" = "chame/htmlparseriface.nim" -e 's/theindex.html/index.html/g' \ ".obj/doc/$(basename "$f" .nim).html" done +makehtml() { + printf '<!DOCTYPE html> +<head> +<meta name=viewport content="width=device-width, initial-scale=1"> +<title>%s</title> +</head> +<body> +' "$2" + cat "$1" | pandoc + printf '</body>\n' +} +makehtml doc/manual.md "Chame manual" > .obj/doc/manual.html +makehtml doc/.index.md "Chame documentation" > .obj/doc/index.html |