summary refs log tree commit diff stats
path: root/tests/compile/tinheritref.nim
Commit message (Collapse)AuthorAgeFilesLines
* fixes #554, fixes #179Araq2013-08-311-0/+27
n40' href='#n40'>40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222
PEG syntax and semantics
========================

A PEG (Parsing expression grammar) is a simple deterministic grammar, that can
be directly used for parsing. The current implementation has been designed as
a more powerful replacement for regular expressions. UTF-8 is supported.

The notation used for a PEG is similar to that of EBNF:

===============    ============================================================
notation           meaning
===============    ============================================================
``A / ... / Z``    Ordered choice: Apply expressions `A`, ..., `Z`, in this
                   order, to the text ahead, until one of them succeeds and
                   possibly consumes some text. Indicate success if one of
                   expressions succeeded. Otherwise do not consume any text
                   and indicate failure. 
``A ... Z``        Sequence: Apply expressions `A`, ..., `Z`, in this order,
                   to consume consecutive portions of the text ahead, as long
                   as they succeed. Indicate success if all succeeded.
                   Otherwise do not consume any text and indicate failure.
                   The sequence's precedence is higher than that of ordered
                   choice: ``A B / C`` means ``(A B) / Z`` and
                   not ``A (B / Z)``.
``(E)``            Grouping: Parenthesis can be used to change
                   operator priority.
``{E}``            Capture: Apply expression `E` and store the substring
                   that matched `E` into a *capture* that can be accessed
                   after the matching process.
``$i``             Back reference to the ``i``th capture. ``i`` counts from 1. 
``$``              Anchor: Matches at the end of the input. No character 
                   is consumed. Same as ``!.``. 
``^``              Anchor: Matches at the start of the input. No character 
                   is consumed. 
``&E``             And predicate: Indicate success if expression `E` matches
                   the text ahead; otherwise indicate failure. Do not consume
                   any text.
``!E``             Not predicate: Indicate failure if expression E matches the
                   text ahead; otherwise indicate success. Do not consume any
                   text.
``E+``             One or more: Apply expression `E` repeatedly to match
                   the text ahead, as long as it succeeds. Consume the matched
                   text (if any) and indicate success if there was at least
                   one match. Otherwise indicate failure. 
``E*``             Zero or more: Apply expression `E` repeatedly to match
                   the text ahead, as long as it succeeds. Consume the matched
                   text (if any). Always indicate success. 
``E?``             Zero or one: If expression `E` matches the text ahead,
                   consume it. Always indicate success. 
``[s]``            Character class: If the character ahead appears in the
                   string `s`, consume it and indicate success. Otherwise
                   indicate failure. 
``[a-b]``          Character range: If the character ahead is one from the
                   range `a` through `b`, consume it and indicate success.
                   Otherwise indicate failure.
``'s'``            String: If the text ahead is the string `s`, consume it
                   and indicate success. Otherwise indicate failure.
``i's'``           String match ignoring case.
``y's'``           String match ignoring style.
``v's'``           Verbatim string match: Use this to override a global
                   ``\i`` or ``\y`` modifier.
``i$j``            String match ignoring case for back reference.
``y$j``            String match ignoring style for back reference.
``v$j``            Verbatim string match for back reference.
``.``              Any character: If there is a character ahead, consume it
                   and indicate success. Otherwise (that is, at the end of
                   input) indicate failure.
``_``              Any Unicode character: If there is an UTF-8 character
                   ahead, consume it and indicate success. Otherwise indicate
                   failure.
``@E``             Search: Shorthand for ``(!E .)* E``. (Search loop for the
                   pattern `E`.)
``{@} E``          Captured Search: Shorthand for ``{(!E .)*} E``. (Search 
                   loop for the pattern `E`.) Everything until and exluding
                   `E` is captured.
``@@ E``           Same as ``{@} E``.
``A <- E``         Rule: Bind the expression `E` to the *nonterminal symbol*
                   `A`. **Left recursive rules are not possible and crash the
                   matching engine.**
``\identifier``    Built-in macro for a longer expression.
``\ddd``           Character with decimal code *ddd*.
``\"``, etc        Literal ``"``, etc. 
===============    ============================================================


Built-in macros
---------------

==============     ============================================================
macro              meaning
==============     ============================================================
``\d``             any decimal digit: ``[0-9]``
``\D``             any character that is not a decimal digit: ``[^0-9]``
``\s``             any whitespace character: ``[ \9-\13]``
``\S``             any character that is not a whitespace character:
                   ``[^ \9-\13]``
``\w``             any "word" character: ``[a-zA-Z0-9_]``
``\W``             any "non-word" character: ``[^a-zA-Z0-9_]``
``\a``             same as ``[a-zA-Z]``
``\A``             same as ``[^a-zA-Z]``
``\n``             any newline combination: ``\10 / \13\10 / \13``
``\i``             ignore case for matching; use this at the start of the PEG
``\y``             ignore style for matching; use this at the start of the PEG
``\skip`` pat      skip pattern *pat* before trying to match other tokens; 
                   this is useful for whitespace skipping, for example:
                   ``\skip(\s*) {\ident} ':' {\ident}`` matches key value 
                   pairs ignoring whitespace around the ``':'``.
``\ident``         a standard ASCII identifier: ``[a-zA-Z_][a-zA-Z_0-9]*``
``\letter``        any Unicode letter
``\upper``         any Unicode uppercase letter
``\lower``         any Unicode lowercase letter
``\title``         any Unicode title letter
``\white``         any Unicode whitespace character
==============     ============================================================

A backslash followed by a letter is a built-in macro, otherwise it
is used for ordinary escaping:

==============     ============================================================
notation           meaning
==============     ============================================================
``\\``             a single backslash
``\*``             same as ``'*'``
``\t``             not a tabulator, but an (unknown) built-in
==============     ============================================================


Supported PEG grammar
---------------------

The PEG parser implements this grammar (written in PEG syntax)::

  # Example grammar of PEG in PEG syntax.
  # Comments start with '#'.
  # First symbol is the start symbol.
  
  grammar <- rule* / expr
  
  identifier <- [A-Za-z][A-Za-z0-9_]*
  charsetchar <- "\\" . / [^\]]
  charset <- "[" "^"? (charsetchar ("-" charsetchar)?)+ "]"
  stringlit <- identifier? ("\"" ("\\" . / [^"])* "\"" /
                            "'" ("\\" . / [^'])* "'")
  builtin <- "\\" identifier / [^\13\10]
  
  comment <- '#' @ \n
  ig <- (\s / comment)* # things to ignore
  
  rule <- identifier \s* "<-" expr ig
  identNoArrow <- identifier !(\s* "<-")
  prefixOpr <- ig '&' / ig '!' / ig '@' / ig '{@}' / ig '@@'
  literal <- ig identifier? '$' [0-9]+ / '$' / '^' /
             ig identNoArrow / 
             ig charset / 
             ig stringlit / 
             ig builtin / 
             ig '.' / 
             ig '_' / 
             (ig "(" expr ig ")")
  postfixOpr <- ig '?' / ig '*' / ig '+'
  primary <- prefixOpr* (literal postfixOpr*)
  
  # Concatenation has higher priority than choice:
  # ``a b / c`` means ``(a b) / c``
  
  seqExpr <- primary+
  expr <- seqExpr (ig "/" expr)*


**Note**: As a special syntactic extension if the whole PEG is only a single 
expression, identifiers are not interpreted as non-terminals, but are 
interpreted as verbatim string:

.. code-block:: nimrod
  abc =~ peg"abc" # is true

So it is not necessary to write ``peg" 'abc' "`` in the above example.


Examples
--------

Check if `s` matches Nimrod's "while" keyword:

.. code-block:: nimrod
  s =~ peg" y'while'"

Exchange (key, val)-pairs: 

.. code-block:: nimrod
  "key: val; key2: val2".replace(peg"{\ident} \s* ':' \s* {\ident}", "$2: $1")

Determine the ``#include``'ed files of a C file:

.. code-block:: nimrod
  for line in lines("myfile.c"):
    if line =~ peg"""s <- ws '#include' ws '"' {[^"]+} '"' ws
                     comment <- '/*' @ '*/' / '//' .*
                     ws <- (comment / \s+)* """:
      echo matches[0]

PEG vs regular expression
-------------------------
As a regular expression ``\[.*\]`` matches the longest possible text between
``'['`` and ``']'``. As a PEG it never matches anything, because a PEG is
deterministic: ``.*`` consumes the rest of the input, so ``\]`` never matches.
As a PEG this needs to be written as: ``\[ ( !\] . )* \]`` (or ``\[ @ \]``).

Note that the regular expression does not behave as intended either: in the
example ``*`` should not be greedy, so ``\[.*?\]`` should be used instead.


PEG construction
----------------
There are two ways to construct a PEG in Nimrod code:
(1) Parsing a string into an AST which consists of `TPeg` nodes with the
    `peg` proc.
(2) Constructing the AST directly with proc calls. This method does not
    support constructing rules, only simple expressions and is not as
    convenient. Its only advantage is that it does not pull in the whole PEG
    parser into your executable.