about summary refs log tree commit diff stats
diff options
context:
space:
mode:
authorbptato <nincsnevem662@gmail.com>2024-02-07 21:15:36 +0100
committerbptato <nincsnevem662@gmail.com>2024-02-07 21:38:50 +0100
commit04f76fc803cdbab4ef12b8686a076eccbe8aaed3 (patch)
tree55f951ecab1f03f7861c36b997e060e45b647fc0
parentf6018b45b8d3d41e31d5ab003813504333c42eb0 (diff)
downloadchawan-04f76fc803cdbab4ef12b8686a076eccbe8aaed3.tar.gz
Update docs
-rw-r--r--NEWS68
-rw-r--r--README.md37
-rw-r--r--doc/.index.md16
-rw-r--r--doc/manual.md252
-rwxr-xr-xgendoc.sh13
5 files changed, 371 insertions, 15 deletions
diff --git a/NEWS b/NEWS
new file mode 100644
index 00000000..a94e35dd
--- /dev/null
+++ b/NEWS
@@ -0,0 +1,68 @@
+0.14 (2024.02.07)
+
+* The "bag of pointers" interface design has been dropped
+* Tag and attribute names are now treated as interned strings (a user-defined
+  "Atom" type)
+* Support processing of embedded SVG/MathML elements
+* Chakasu has been made an optional dependency
+* std/streams no longer used by htmlparser; now it supports chunked parsing
+  instead
+* All tokenizer + tree builder tests passed in html5lib-tests
+
+Rough migration guide from the previous API:
+
+Users of minidom
+
+* nodeType is no longer supported, use the of operator to distinguish
+  between node types.
+* minidom now only supports UTF-8; if you need support for other
+  charsets, use minidom_cs.
+* localName is now an MAtom; to get the stringified local name, use
+  localNameStr.
+* attrs is now a seq of Attribute tuples. Linear search this seq to find
+  specific attributes.
+* minidom now contains several MAtom fields; to convert these to
+  strings, call document.atomToStr(atom).
+
+Users of htmlparser
+
+* The NodeType enum has been removed. Either copy-paste the enum
+  definition from a previous version, or (more efficient) use the `of`
+  operator to distinguish between types.
+* Use of an AtomFactory is now required for consumers of htmlparser. The
+  easiest fix is to copy-paste the implementation found in minidom.
+* Your DOM builder should be generic over a Handle and an Atom. Example:
+  `DOMBuilder[Node, MAtom]`
+* You no longer have to copy function pointers into your DOM builder.
+* It is recommended to add `include chame/htmlparseriface` to your DOM
+  builder module. See the htmlparseriface documentation for details.
+
+Switching to the new interface:
+
+* Add `Impl` to the name of all your procedure implementations.
+* If you included chame/htmlparseriface, replace all parameters of your
+  procedures containing `DOMBuilder[MyHandle]` with `MyDOMBuilder`.
+* setCharacterSet -> setEncodingImpl that takes a string label.
+* getLocalNameImpl now must return an Atom. getTagType is no longer used.
+* insertBeforeImpl must take an `Option[Handle]`
+* addAttrsIfMissingImpl is now mandatory, and must take a `Table[Atom, string]`
+* getNamespaceImpl is now mandatory.
+* getDocumentImpl must be implemented, and must return the Handle of the
+  document.
+* tagTypeToAtomImpl, atomToTagTypeImpl, strToAtomImpl must all be implemented.
+* createHTMLElementImpl must be implemented, and must return the handle of a new
+  `<html>` element.
+* createElement -> createElementForTokenImpl, the signature has changed
+  significantly. localName is the 2-in-1 replacement for both tagType and
+  localName. Also, you probably have to convert attributes from htmlAttrs &
+  xmlAttrs to your own internal representation.
+* finish is no longer called at the end of parsing. Call it yourself.
+
+Event loop changes:
+
+* parseHTML is now split into three parts: initHTML5Parser, parseChunk, finish.
+* If you implement scripting and/or character sets other than UTF-8, see
+  doc/manual.md for handling parseChunk's result. Otherwise, it is safe to
+  discard it.
+* Do not forget to call finish after having parsed the entire document (first
+  for the parser, then for your own DOM builder).
diff --git a/README.md b/README.md
index a4e3a967..082f3c87 100644
--- a/README.md
+++ b/README.md
@@ -1,10 +1,5 @@
 # Chame: an HTML5 parser library in Nim
 
-## WARNING
-
-This library is still in beta stage. The API **will** undergo significant
-changes before the 1.0 release.
-
 ## Usage
 
 Include Chame in your project using either Nimble or as a git submodule.
@@ -22,17 +17,22 @@ Note: only Nim 1.6.10+ is supported.
 
 * Almost full compliance with the WHATWG standard. (Except for the few missing
   features listed in the following section.)
-* Passes all tokenizer and tree builder tests in html5lib-tests.[^1]
-* Includes a minimal DOM implementation.
-* No mandatory dependencies other than the Nim standard library.
-* Optional character encoding support (see minidom_enc).
-* String interning support for tag and attribute names.
-* Support for chunked parsing.
-* document.write (WIP)
+* Passes all tokenizer and tree builder tests in html5lib-tests[^1]
+* Includes a minimal DOM implementation
+* No mandatory dependencies other than the Nim standard library
+* Optional character encoding support (see minidom_enc)
+* String interning support for tag and attribute names
+* Support for chunked parsing
+* document.write (no actual implementation here, but it's possible to implement
+  it on top of Chame)
 
 [^1]: Except for tree builder tests requiring JavaScript and xmlViolation
 tokenizer tests.
 
+## Manual
+
+There is a manual available at [doc/manual.md](doc/manual.md).
+
 ## To-do
 
 Some parts of the specification have not been implemented yet. These are:
@@ -41,10 +41,8 @@ Some parts of the specification have not been implemented yet. These are:
 
 Support for this feature is planned.
 
-Other, non-standard-related tasks (in no particular order):
+Other, non-standard-related tasks:
 
-* Document minidom/minidom_enc
-* Document the new interface (also explain what `Atom` does etc.)
 * Optimize inefficient parts of the library
 
 ## Bugs, feedback, etc.
@@ -71,6 +69,15 @@ an example of a complete DOM implementation that depends on Chame.
 If you implement a DOM library based on Chame, please notify me, so that I
 can redirect users to it in this section.
 
+### I read the manual, but it's too complex, I don't understand anything, help
+
+Just call minidom.parseHTML on an std/stream.Stream object and forget about
+everything else. Chances are this is enough for whatever you want to do.
+
+### How do I implement speculative parsing?
+
+No idea. Let me know if you figure something out.
+
 ### How do you pronounce Chame?
 
 It is an acronym of "**Cha**wan HT**M**L (aitch-tee-e**m-e**l)." Accordingly, it is
diff --git a/doc/.index.md b/doc/.index.md
new file mode 100644
index 00000000..7e6e6c33
--- /dev/null
+++ b/doc/.index.md
@@ -0,0 +1,16 @@
+# Documentation of Chame, an HTML5 parsing library written in Nim.
+
+Index of the (stable) public API:
+
+* [minidom](minidom.html): minimal DOM module and a high-level interface to the
+  HTML parser.
+* [minidom_cs](minidom_cs.html): minidom with support for non-UTF-8 character
+  sets.
+* [htmlparser](htmlparser.html): low-level interface to the HTML parser.
+* [htmlparseriface](htmlparseriface.html): forward declarations for the HTML
+  parser
+* [tags](tags.html): enum definitions for Chame (tags, namespaces, etc.)
+* [tokstate](tokstate.html): tokenizer states for the fragment parsing
+  algorithm
+
+Also of interest: the Chame [manual](manual.html).
diff --git a/doc/manual.md b/doc/manual.md
new file mode 100644
index 00000000..333d305d
--- /dev/null
+++ b/doc/manual.md
@@ -0,0 +1,252 @@
+# Using Chame
+
+Chame is divided into two parts: a low-level API ([htmlparser](htmlparser.html))
+and a high-level API ([minidom](minidom.html), [minidom_cs](minidom_cs.html)).
+The high-level APIs build on top of htmlparser, and are easier to use. However,
+they give consumers less control than htmlparser.
+
+Here we describe both APIs.
+
+## Basic concepts
+
+### Standards
+
+Chame implements HTML5 parsing as described in the
+[Parsing HTML documents](https://html.spec.whatwg.org/multipage/parsing.html)
+section of the WHATWG's living standard. Note that this document may change at
+any time, and newer additions might take some time to implement in Chame.
+
+Users of the low-level API are encouraged to consult the appropriate sections
+of the standard while implementing hooks provided by htmlparser.
+
+### String interning
+
+To achieve O(1) comparisons of certain categories of strings (tag and attribute
+names) and a lower memory footprint, Chame uses
+[string interning](https://en.wikipedia.org/wiki/String_interning). This means
+that while minidom users will only have to call the appropriate conversion
+functions on Document.factory for converting the output to string, consumers
+of htmlparser must implement string interning themselves (be that through
+MAtomFactory or a custom solution).
+
+### String validation
+
+Note that (as per standard) the tokenization stage strips out all NUL
+characters, so strings from the parser can be safely converted to cstrings.
+
+htmlparser itself does no UTF-8 validation; it is up to the DOM builder to
+validate the input. Non-ASCII characters are treated as opaque characters,
+so parsing of ASCII-compatible character sets should just work with the caveat
+that the strings from htmlparser will not necessarily be valid UTF-8. This is
+not a problem in minidom, since it abstracts over this difficulty.
+
+## High-level API (minidom, minidom_cs)
+
+minidom has two main entry points: `parseHTML` and `parseHTMLFragment`. For
+parsing documents, `parseHTML` should be used; `parseHTMLFragment` is for parsing
+incomplete document fragments.
+
+e.g. in a browser, the `innerHTML` setter would use `parseHTMLFragment`, while
+`DOMParser.parseFromString` would use `parseHTML`.
+
+The input stream must be passed as a `Stream` object from `std/streams`. Both
+parseHTML and parseHTMLFragment return only when the input stream has been
+completely consumed from the stream. For chunked parsing, you must use the
+low-level htmlparser API instead.
+
+minidom (and minidom_cs) implements string interning using `MAtomFactory`, and
+interned strings in minidom are represented using `MAtom`s. Every `MAtom` is
+guaranteed to be a valid UTF-8 string. To convert a `MAtom` into a Nim string,
+use the MAtomFactory.atomToStr function.
+
+The output is a DOM tree, with the root node being a `Document`. The root
+Document node also contains a `MAtomFactory` instance, which can be used to
+convert `MAtom`s back to strings (through `atomToStr`).
+
+Strings returned from minidom are guaranteed to be valid UTF-8. Note however
+that minidom only understands UTF-8 documents. For parsing documents with
+character sets other than UTF-8, minidom_cs must be used. The `parseHTML`
+function of minidom_cs is also able to BOM sniff, interpret meta charset
+tags and optionally retry parsing of documents with a predefined list of
+character sets (using the companion character decoding library Chakasu).
+
+## Low-level API (htmlparser)
+
+### Functions and procedures
+
+htmlparser has three defined procedures: `initHTML5Parser`, `parseChunk`, and
+`finish`. A `getInsertionPoint` function is available as well.
+
+#### initHTML5Parser
+
+```nim
+# Signature
+proc initHTML5Parser[Handle, Atom](dombuilder: DOMBuilder[Handle, Atom],
+    opts: HTML5ParserOpts[Handle, Atom]): HTML5Parser[Handle, Atom]
+```
+
+The `initHTML5Parser` procedure requires a user-defined DOMBuilder object
+derived from the `DOMBuilder[Handle, Atom]` generic object reference.
+
+To implement all interfaces necessary for htmlparser, please include
+[htmlparseriface](htmlparseriface.html) in your DOM builder module; it contains
+forward-declarations for all procedures that `HTML5Parser` depends on. Feel
+free to study/copy [minidom](minidom.html)'s implementations.
+
+The return value is an `HTML5Parser[Handle, Atom]` object. Note that this is
+a rather large object that is passed by value; if possible, avoid copying it
+at all.
+
+#### parseChunk
+
+```nim
+# Signature
+proc parseChunk[Handle, Atom](parser: var HTML5Parser[Handle, Atom],
+    inputBuf: openArray[char], reprocess = false): ParseResult
+```
+
+`parseChunk` consumes all data passed in `inputBuf`. During this, the
+appropriate functions (`createElementImpl`, etc.) will be called by the parser.
+
+`parseChunk` returns a `ParseResult`, which is one of the following values:
+
+* `PRES_CONTINUE`: the caller should continue with parsing the next chunk of
+  data when it is available. (It's also fine to do delay processing the next
+  call by processing something different first.)
+* `PRES_STOP`: parsing was stopped by your setEncodingImpl implementation. The
+  caller is expected to restart parsing from the beginning using a **new**
+  `HTML5Parser` object. WARNING: do *not* re-use the current HTML5Parser for
+  this.
+* `PRES_SCRIPT`: a `</script>` end tag has been encountered, which immediately
+  suspended parsing. In the next `parseChunk` call, the caller is expected to
+  pass the **same** buffer (`inputBuf`) as in the current one. For details,
+  see below.
+
+Special care is required when implementing that have scripting support. The
+HTML5 standard requires the parser to be re-entrant for supporting the
+`document.write` JavaScript function; therefore the parser suspends itself upon
+encountering a `</script>` end tag, returning a `PRES_SCRIPT` `ParseResult`.
+
+At this point, implementations have two options.
+
+##### Option 1: Continue parsing the current buffer
+
+If either:
+
+* your implementation either does not support `document.write`, or
+* no `document.write` call has been issued by the script, or
+* parsing of all buffers passed by `document.write` calls has finished,
+
+then you can simply resume parsing the current buffer by calling `parseChunk`
+again with an openArray that uses the same backing buffer, except starting
+from `parser.getInsertionPoint()`. `minidom`, which pretends to support
+scripting in test cases, but does not actually execute scripts, has an example
+of this:
+
+```nim
+var buffer: array[4096, char]
+while true:
+  let n = inputStream.readData(addr buffer[0], buffer.len)
+  if n == 0: break
+  # res can be PRES_CONTINUE or PRES_SCRIPTING. PRES_STOP is only returned
+  # on charset switching, and minidom does not support that.
+  var res = parser.parseChunk(buffer.toOpenArray(0, n - 1))
+  # Important: we must repeat parseChunk with the same contents for the script
+  # end tag result, with reprocess = true.
+  #
+  # (This is only relevant for calls where scripting = true; with scripting =
+  # false, PRES_SCRIPT would never be returned.)
+  var ip = 0
+  while res == PRES_SCRIPT and (ip += parser.getInsertionPoint(); ip != n):
+    res = parser.parseChunk(buffer.toOpenArray(ip, n - 1))
+parser.finish()
+```
+
+Note the while loop; `parseChunk` may return `PRES_SCRIPT` multiple times
+for a single buffer, as it one buffer can contain several scripts.
+
+Also note that `minidom` does not handle `PRES_STOP`, since it does support
+character encodings. For an implementation that *does* handle `PRES_STOP`, see
+`minidom_cs`.
+
+#### Option 2: Parse buffers passed by `document.write`
+
+Per standard, it is possible to insert buffers into the stream from scripts
+using the `document.write` function.
+
+It is possible to implement this, but it is somewhat too involved to give a
+detailed explanation of it here. Please refer to Chawan's implementation in
+html/chadombuilder and html/dom.
+
+### finish
+
+After having parsed all chunks of your document with `parseChunk`, you **must**
+call the `finish` function. This is necessary because the parser may still have
+some non-flushed characters in an internal buffer. e.g. when the parser receives
+the string `&gt`, it is not clear whether the character reference refers to a
+"greater than" sign, or a longer character reference like `&gtrsim;`; `finish`
+confirms that the reference is indeed a `&gt` sign. Also, the parser has to
+execute certain actions on encountering the `EOF` token, which only `finish` can
+produce.
+
+`finish` must never be called twice, and any `parseChunk` call after `finish`
+is invalid.
+
+### Generic parameters
+
+`initHTML5Parser` takes two generic parameters: `Handle` and `Atom`.
+
+`Handle` is conceptually a unique pointer to a node in the document. A naive
+single-threaded implementation (like minidom) may simply implement this as
+a Nim `ref` to an object. However, this is not mandatory; since `Handle` is
+a generic parameter, any type is accepted. For example, multi-processing
+implementations that use message passing might instead prefer to use an
+integer ID that refers to an object owned by a different thread.
+
+Similarly, `Atom` is a unique pointer to a string. This means that
+`DOMBuilder.strToAtom` must always return the same Atom for every string whose
+contents are equivalent. Additionally, `atomToTagType` and `tagTypeToAtom` must
+operate as if `TagType` values were equivalent to the contents of its
+stringifier. (i.e. `tagTypeToAtom(tagType) == strToAtom($tagType)` for all tag
+types except `TAG_UNKNOWN`, which is never passed to `tagTypeToAtom`.)
+
+Note that htmlparser does not *require* an `atomToStr` procedure, so it is not
+even necessary to store interned strings in a format compatible with the Nim
+string type. (Obviously, some way to stringify atoms is required for most use
+cases, but it need not be exposed.)
+
+## Example
+
+A simple example of minidom: dumps all text on a page.
+
+```Nim
+# Compile with nim c -d:ssl
+# List strings found on the target website.
+import std/httpclient
+import std/os
+import std/strutils
+import chame/minidom
+
+if paramCount() != 1:
+  echo "Usage: " & paramStr(0) & " [URL]"
+  quit(1)
+let client = newHttpClient()
+let res = client.get(paramStr(1))
+let document = parseHTML(res.bodyStream)
+var stack = @[Node(document)]
+while stack.len > 0:
+  let node = stack.pop()
+  if node of Text:
+    let s = Text(node).data.strip()
+    if s != "":
+      echo s
+  for i in countdown(node.childList.high, 0):
+    stack.add(node.childList[i])
+```
+
+For more advanced usage of minidom, please study the tests/tree.nim and
+tests/shared/tree_common.nim which together are a runner of html5lib-tests.
+
+For an example implementation of [htmlparseriface](htmlparseriface.html), please
+check the source code of [minidom](minidom.html) (and of
+[minidom_cs](minidom_cs.html), if you need non-UTF-8 support).
diff --git a/gendoc.sh b/gendoc.sh
index 922539b4..02175778 100755
--- a/gendoc.sh
+++ b/gendoc.sh
@@ -10,3 +10,16 @@ do	if test "$f" = "chame/htmlparseriface.nim"
           -e 's/theindex.html/index.html/g' \
           ".obj/doc/$(basename "$f" .nim).html"
 done
+makehtml() {
+	printf '<!DOCTYPE html>
+<head>
+<meta name=viewport content="width=device-width, initial-scale=1">
+<title>%s</title>
+</head>
+<body>
+' "$2"
+	cat "$1" | pandoc
+	printf '</body>\n'
+}
+makehtml doc/manual.md "Chame manual" > .obj/doc/manual.html
+makehtml doc/.index.md "Chame documentation" > .obj/doc/index.html