diff options
Diffstat (limited to 'doc/pegdocs.txt')
-rw-r--r-- | doc/pegdocs.txt | 230 |
1 files changed, 230 insertions, 0 deletions
diff --git a/doc/pegdocs.txt b/doc/pegdocs.txt new file mode 100644 index 000000000..0a8fd8187 --- /dev/null +++ b/doc/pegdocs.txt @@ -0,0 +1,230 @@ +PEG syntax and semantics +======================== + +A PEG (Parsing expression grammar) is a simple deterministic grammar, that can +be directly used for parsing. The current implementation has been designed as +a more powerful replacement for regular expressions. UTF-8 is supported. + +The notation used for a PEG is similar to that of EBNF: + +=============== ============================================================ +notation meaning +=============== ============================================================ +``A / ... / Z`` Ordered choice: Apply expressions `A`, ..., `Z`, in this + order, to the text ahead, until one of them succeeds and + possibly consumes some text. Indicate success if one of + expressions succeeded. Otherwise, do not consume any text + and indicate failure. +``A ... Z`` Sequence: Apply expressions `A`, ..., `Z`, in this order, + to consume consecutive portions of the text ahead, as long + as they succeed. Indicate success if all succeeded. + Otherwise, do not consume any text and indicate failure. + The sequence's precedence is higher than that of ordered + choice: ``A B / C`` means ``(A B) / Z`` and + not ``A (B / Z)``. +``(E)`` Grouping: Parenthesis can be used to change + operator priority. +``{E}`` Capture: Apply expression `E` and store the substring + that matched `E` into a *capture* that can be accessed + after the matching process. +``{}`` Empty capture: Delete the last capture. No character + is consumed. +``$i`` Back reference to the ``i``th capture. ``i`` counts forwards + from 1 or backwards (last capture to first) from ^1. +``$`` Anchor: Matches at the end of the input. No character + is consumed. Same as ``!.``. +``^`` Anchor: Matches at the start of the input. No character + is consumed. +``&E`` And predicate: Indicate success if expression `E` matches + the text ahead; otherwise indicate failure. Do not consume + any text. +``!E`` Not predicate: Indicate failure if expression E matches the + text ahead; otherwise indicate success. Do not consume any + text. +``E+`` One or more: Apply expression `E` repeatedly to match + the text ahead, as long as it succeeds. Consume the matched + text (if any) and indicate success if there was at least + one match. Otherwise, indicate failure. +``E*`` Zero or more: Apply expression `E` repeatedly to match + the text ahead, as long as it succeeds. Consume the matched + text (if any). Always indicate success. +``E?`` Zero or one: If expression `E` matches the text ahead, + consume it. Always indicate success. +``[s]`` Character class: If the character ahead appears in the + string `s`, consume it and indicate success. Otherwise, + indicate failure. +``[a-b]`` Character range: If the character ahead is one from the + range `a` through `b`, consume it and indicate success. + Otherwise, indicate failure. +``'s'`` String: If the text ahead is the string `s`, consume it + and indicate success. Otherwise, indicate failure. +``i's'`` String match ignoring case. +``y's'`` String match ignoring style. +``v's'`` Verbatim string match: Use this to override a global + ``\i`` or ``\y`` modifier. +``i$j`` String match ignoring case for back reference. +``y$j`` String match ignoring style for back reference. +``v$j`` Verbatim string match for back reference. +``.`` Any character: If there is a character ahead, consume it + and indicate success. Otherwise, (that is, at the end of + input) indicate failure. +``_`` Any Unicode character: If there is a UTF-8 character + ahead, consume it and indicate success. Otherwise, indicate + failure. +``@E`` Search: Shorthand for ``(!E .)* E``. (Search loop for the + pattern `E`.) +``{@} E`` Captured Search: Shorthand for ``{(!E .)*} E``. (Search + loop for the pattern `E`.) Everything until and excluding + `E` is captured. +``@@ E`` Same as ``{@} E``. +``A <- E`` Rule: Bind the expression `E` to the *nonterminal symbol* + `A`. **Left recursive rules are not possible and crash the + matching engine.** +``\identifier`` Built-in macro for a longer expression. +``\ddd`` Character with decimal code *ddd*. +``\"``, etc. Literal ``"``, etc. +=============== ============================================================ + + +Built-in macros +--------------- + +============== ============================================================ +macro meaning +============== ============================================================ +``\d`` any decimal digit: ``[0-9]`` +``\D`` any character that is not a decimal digit: ``[^0-9]`` +``\s`` any whitespace character: ``[ \9-\13]`` +``\S`` any character that is not a whitespace character: + ``[^ \9-\13]`` +``\w`` any "word" character: ``[a-zA-Z0-9_]`` +``\W`` any "non-word" character: ``[^a-zA-Z0-9_]`` +``\a`` same as ``[a-zA-Z]`` +``\A`` same as ``[^a-zA-Z]`` +``\n`` any newline combination: ``\10 / \13\10 / \13`` +``\i`` ignore case for matching; use this at the start of the PEG +``\y`` ignore style for matching; use this at the start of the PEG +``\skip`` pat skip pattern *pat* before trying to match other tokens; + this is useful for whitespace skipping, for example: + ``\skip(\s*) {\ident} ':' {\ident}`` matches key value + pairs ignoring whitespace around the ``':'``. +``\ident`` a standard ASCII identifier: ``[a-zA-Z_][a-zA-Z_0-9]*`` +``\letter`` any Unicode letter +``\upper`` any Unicode uppercase letter +``\lower`` any Unicode lowercase letter +``\title`` any Unicode title letter +``\white`` any Unicode whitespace character +============== ============================================================ + +A backslash followed by a letter is a built-in macro, otherwise it +is used for ordinary escaping: + +============== ============================================================ +notation meaning +============== ============================================================ +``\\`` a single backslash +``\*`` same as ``'*'`` +``\t`` not a tabulator, but an (unknown) built-in +============== ============================================================ + + +Supported PEG grammar +--------------------- + +The PEG parser implements this grammar (written in PEG syntax): + + # Example grammar of PEG in PEG syntax. + # Comments start with '#'. + # First symbol is the start symbol. + + grammar <- rule* / expr + + identifier <- [A-Za-z][A-Za-z0-9_]* + charsetchar <- "\\" . / [^\]] + charset <- "[" "^"? (charsetchar ("-" charsetchar)?)+ "]" + stringlit <- identifier? ("\"" ("\\" . / [^"])* "\"" / + "'" ("\\" . / [^'])* "'") + builtin <- "\\" identifier / [^\13\10] + + comment <- '#' @ \n + ig <- (\s / comment)* # things to ignore + + rule <- identifier \s* "<-" expr ig + identNoArrow <- identifier !(\s* "<-") + prefixOpr <- ig '&' / ig '!' / ig '@' / ig '{@}' / ig '@@' + literal <- ig identifier? '$' '^'? [0-9]+ / '$' / '^' / + ig identNoArrow / + ig charset / + ig stringlit / + ig builtin / + ig '.' / + ig '_' / + (ig "(" expr ig ")") / + (ig "{" expr? ig "}") + postfixOpr <- ig '?' / ig '*' / ig '+' + primary <- prefixOpr* (literal postfixOpr*) + + # Concatenation has higher priority than choice: + # ``a b / c`` means ``(a b) / c`` + + seqExpr <- primary+ + expr <- seqExpr (ig "/" expr)* + + +**Note**: As a special syntactic extension if the whole PEG is only a single +expression, identifiers are not interpreted as non-terminals, but are +interpreted as verbatim string: + + ```nim + abc =~ peg"abc" # is true + ``` + +So it is not necessary to write ``peg" 'abc' "`` in the above example. + + +Examples +-------- + +Check if `s` matches Nim's "while" keyword: + + ```nim + s =~ peg" y'while'" + ``` + +Exchange (key, val)-pairs: + + ```nim + "key: val; key2: val2".replacef(peg"{\ident} \s* ':' \s* {\ident}", "$2: $1") + ``` + +Determine the ``#include``'ed files of a C file: + + ```nim + for line in lines("myfile.c"): + if line =~ peg"""s <- ws '#include' ws '"' {[^"]+} '"' ws + comment <- '/*' @ '*/' / '//' .* + ws <- (comment / \s+)* """: + echo matches[0] + ``` + +PEG vs regular expression +------------------------- +As a regular expression ``\[.*\]`` matches the longest possible text between +``'['`` and ``']'``. As a PEG it never matches anything, because a PEG is +deterministic: ``.*`` consumes the rest of the input, so ``\]`` never matches. +As a PEG this needs to be written as: ``\[ ( !\] . )* \]`` (or ``\[ @ \]``). + +Note that the regular expression does not behave as intended either: in the +example ``*`` should not be greedy, so ``\[.*?\]`` should be used instead. + + +PEG construction +---------------- +There are two ways to construct a PEG in Nim code: +(1) Parsing a string into an AST which consists of `Peg` nodes with the + `peg` proc. +(2) Constructing the AST directly with proc calls. This method does not + support constructing rules, only simple expressions and is not as + convenient. Its only advantage is that it does not pull in the whole PEG + parser into your executable. + |