doc/pegdocs.txt


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222

PEG syntax and semantics
========================

A PEG (Parsing expression grammar) is a simple deterministic grammar, that can
be directly used for parsing. The current implementation has been designed as
a more powerful replacement for regular expressions. UTF-8 is supported.

The notation used for a PEG is similar to that of EBNF:

===============    ============================================================
notation           meaning
===============    ============================================================
``A / ... / Z``    Ordered choice: Apply expressions `A`, ..., `Z`, in this
                   order, to the text ahead, until one of them succeeds and
                   possibly consumes some text. Indicate success if one of
                   expressions succeeded. Otherwise do not consume any text
                   and indicate failure. 
``A ... Z``        Sequence: Apply expressions `A`, ..., `Z`, in this order,
                   to consume consecutive portions of the text ahead, as long
                   as they succeed. Indicate success if all succeeded.
                   Otherwise do not consume any text and indicate failure.
                   The sequence's precedence is higher than that of ordered
                   choice: ``A B / C`` means ``(A B) / Z`` and
                   not ``A (B / Z)``.
``(E)``            Grouping: Parenthesis can be used to change
                   operator priority.
``{E}``            Capture: Apply expression `E` and store the substring
                   that matched `E` into a *capture* that can be accessed
                   after the matching process.
``$i``             Back reference to the ``i``th capture. ``i`` counts from 1. 
``$``              Anchor: Matches at the end of the input. No character 
                   is consumed. Same as ``!.``. 
``^``              Anchor: Matches at the start of the input. No character 
                   is consumed. 
``&E``             And predicate: Indicate success if expression `E` matches
                   the text ahead; otherwise indicate failure. Do not consume
                   any text.
``!E``             Not predicate: Indicate failure if expression E matches the
                   text ahead; otherwise indicate success. Do not consume any
                   text.
``E+``             One or more: Apply expression `E` repeatedly to match
                   the text ahead, as long as it succeeds. Consume the matched
                   text (if any) and indicate success if there was at least
                   one match. Otherwise indicate failure. 
``E*``             Zero or more: Apply expression `E` repeatedly to match
                   the text ahead, as long as it succeeds. Consume the matched
                   text (if any). Always indicate success. 
``E?``             Zero or one: If expression `E` matches the text ahead,
                   consume it. Always indicate success. 
``[s]``            Character class: If the character ahead appears in the
                   string `s`, consume it and indicate success. Otherwise
                   indicate failure. 
``[a-b]``          Character range: If the character ahead is one from the
                   range `a` through `b`, consume it and indicate success.
                   Otherwise indicate failure.
``'s'``            String: If the text ahead is the string `s`, consume it
                   and indicate success. Otherwise indicate failure.
``i's'``           String match ignoring case.
``y's'``           String match ignoring style.
``v's'``           Verbatim string match: Use this to override a global
                   ``\i`` or ``\y`` modifier.
``i$j``            String match ignoring case for back reference.
``y$j``            String match ignoring style for back reference.
``v$j``            Verbatim string match for back reference.
``.``              Any character: If there is a character ahead, consume it
                   and indicate success. Otherwise (that is, at the end of
                   input) indicate failure.
``_``              Any Unicode character: If there is an UTF-8 character
                   ahead, consume it and indicate success. Otherwise indicate
                   failure.
``@E``             Search: Shorthand for ``(!E .)* E``. (Search loop for the
                   pattern `E`.)
``{@} E``          Captured Search: Shorthand for ``{(!E .)*} E``. (Search 
                   loop for the pattern `E`.) Everything until and exluding
                   `E` is captured.
``@@ E``           Same as ``{@} E``.
``A <- E``         Rule: Bind the expression `E` to the *nonterminal symbol*
                   `A`. **Left recursive rules are not possible and crash the
                   matching engine.**
``\identifier``    Built-in macro for a longer expression.
``\ddd``           Character with decimal code *ddd*.
``\"``, etc        Literal ``"``, etc. 
===============    ============================================================


Built-in macros
---------------

==============     ============================================================
macro              meaning
==============     ============================================================
``\d``             any decimal digit: ``[0-9]``
``\D``             any character that is not a decimal digit: ``[^0-9]``
``\s``             any whitespace character: ``[ \9-\13]``
``\S``             any character that is not a whitespace character:
                   ``[^ \9-\13]``
``\w``             any "word" character: ``[a-zA-Z0-9_]``
``\W``             any "non-word" character: ``[^a-zA-Z0-9_]``
``\a``             same as ``[a-zA-Z]``
``\A``             same as ``[^a-zA-Z]``
``\n``             any newline combination: ``\10 / \13\10 / \13``
``\i``             ignore case for matching; use this at the start of the PEG
``\y``             ignore style for matching; use this at the start of the PEG
``\skip`` pat      skip pattern *pat* before trying to match other tokens; 
                   this is useful for whitespace skipping, for example:
                   ``\skip(\s*) {\ident} ':' {\ident}`` matches key value 
                   pairs ignoring whitespace around the ``':'``.
``\ident``         a standard ASCII identifier: ``[a-zA-Z_][a-zA-Z_0-9]*``
``\letter``        any Unicode letter
``\upper``         any Unicode uppercase letter
``\lower``         any Unicode lowercase letter
``\title``         any Unicode title letter
``\white``         any Unicode whitespace character
==============     ============================================================

A backslash followed by a letter is a built-in macro, otherwise it
is used for ordinary escaping:

==============     ============================================================
notation           meaning
==============     ============================================================
``\\``             a single backslash
``\*``             same as ``'*'``
``\t``             not a tabulator, but an (unknown) built-in
==============     ============================================================


Supported PEG grammar
---------------------

The PEG parser implements this grammar (written in PEG syntax)::

  # Example grammar of PEG in PEG syntax.
  # Comments start with '#'.
  # First symbol is the start symbol.
  
  grammar <- rule* / expr
  
  identifier <- [A-Za-z][A-Za-z0-9_]*
  charsetchar <- "\\" . / [^\]]
  charset <- "[" "^"? (charsetchar ("-" charsetchar)?)+ "]"
  stringlit <- identifier? ("\"" ("\\" . / [^"])* "\"" /
                            "'" ("\\" . / [^'])* "'")
  builtin <- "\\" identifier / [^\13\10]
  
  comment <- '#' @ \n
  ig <- (\s / comment)* # things to ignore
  
  rule <- identifier \s* "<-" expr ig
  identNoArrow <- identifier !(\s* "<-")
  prefixOpr <- ig '&' / ig '!' / ig '@' / ig '{@}' / ig '@@'
  literal <- ig identifier? '$' [0-9]+ / '$' / '^' /
             ig identNoArrow / 
             ig charset / 
             ig stringlit / 
             ig builtin / 
             ig '.' / 
             ig '_' / 
             (ig "(" expr ig ")")
  postfixOpr <- ig '?' / ig '*' / ig '+'
  primary <- prefixOpr* (literal postfixOpr*)
  
  # Concatenation has higher priority than choice:
  # ``a b / c`` means ``(a b) / c``
  
  seqExpr <- primary+
  expr <- seqExpr (ig "/" expr)*


**Note**: As a special syntactic extension if the whole PEG is only a single 
expression, identifiers are not interpreted as non-terminals, but are 
interpreted as verbatim string:

.. code-block:: nim
  abc =~ peg"abc" # is true

So it is not necessary to write ``peg" 'abc' "`` in the above example.


Examples
--------

Check if `s` matches Nim's "while" keyword:

.. code-block:: nim
  s =~ peg" y'while'"

Exchange (key, val)-pairs: 

.. code-block:: nim
  "key: val; key2: val2".replace(peg"{\ident} \s* ':' \s* {\ident}", "$2: $1")

Determine the ``#include``'ed files of a C file:

.. code-block:: nim
  for line in lines("myfile.c"):
    if line =~ peg"""s <- ws '#include' ws '"' {[^"]+} '"' ws
                     comment <- '/*' @ '*/' / '//' .*
                     ws <- (comment / \s+)* """:
      echo matches[0]

PEG vs regular expression
-------------------------
As a regular expression ``\[.*\]`` matches the longest possible text between
``'['`` and ``']'``. As a PEG it never matches anything, because a PEG is
deterministic: ``.*`` consumes the rest of the input, so ``\]`` never matches.
As a PEG this needs to be written as: ``\[ ( !\] . )* \]`` (or ``\[ @ \]``).

Note that the regular expression does not behave as intended either: in the
example ``*`` should not be greedy, so ``\[.*?\]`` should be used instead.


PEG construction
----------------
There are two ways to construct a PEG in Nim code:
(1) Parsing a string into an AST which consists of `Peg` nodes with the
    `peg` proc.
(2) Constructing the AST directly with proc calls. This method does not
    support constructing rules, only simple expressions and is not as
    convenient. Its only advantage is that it does not pull in the whole PEG
    parser into your executable.
version 0.9.0
=============

- ``=`` should be overloadable; requires specialization for ``=``
- fix remaining generics bugs
- fix remaining closure bugs:
  - fix evals.nim with closures
  - deactivate lambda lifting for JS backend
  - Test capture of for loop vars; test generics;
  - test constant closures

- GC: marker procs for native Nimrod GC and Boehm GC; precise stack marking;
  escape analysis for string/seq seems to be easy to do too;
  even further write barrier specialization
- dead code elim for JS backend; 'of' operator for JS backend
- const ptr/ref
- unsigned ints and bignums; requires abstract integer literal type: 
  use tyInt+node for that
- implement the high level optimizer
- change overloading resolution
- implement proper coroutines
- implement ``partial`` pragma for partial evaluation
- we need to support iteration of 2 different data structures in parallel
- make exceptions compatible with C++ exceptions
- 'const' objects including case objects
- change how comments are part of the AST
- optional indentation for 'case' statement; hm, keep in mind other syntax
  changes that people want; may turn out to be a bad idea
- activate more thread tests
- optimize method dispatchers

Bugs
----
- bug: generic assign still buggy
  - special case the generic assign that needs to care about case objects

- bug: returning a tyVar does not mean it is save to return it:
  proc forward[T](x: var T): var T = result = x
  proc p(): var int = 
    var x: int
    # reject this call via alias analysis:
    result = forward(x)

- bug: stress testing basic method example (eval example) 
  without ``-d:release`` leaks memory?
- bug: temp2.nim triggers weird compiler and except.nim bug
- bug: negative array indexes fail to index check
- bug: object {.pure, final.} does not work again!
- bug: tsortdev does not run with native GC?


version 0.9.XX
==============

- implicit ref/ptr->var conversion; the compiler may store an object
  implicitly on the heap for write barrier efficiency; better: 
  proc specialization in the code gen
- shared memory heap: ``shared ref`` etc. The only hard part in the GC is to
  "stop the world". However, it may be worthwhile to generate explicit 
  (or implicit) syncGC() calls in loops. Automatic loop injection seems
  troublesome, but maybe we can come up with a simple heuristic. (All procs
  that `new` shared memory are syncGC() candidates...)
- EcmaScript needs a new and better code gen: simply adapt the C code gen to it
- tlastmod returns wrong results on BSD (Linux, MacOS X: works)
- nested tuple unpacking; tuple unpacking in non-var-context
- 'nimrod def': does not always work?
- test branch coverage
- checked exceptions
- make pegs support a compile-time option and make c2nim use regexes instead
  per default?
- fix implicit generic routines
- improve docgen to use the semantic pass
- 'export' feature (requires improved docgen)
- think about ``{:}.toTable[int, string]()``
- mocking support with ``tyProxy`` that does:
  o.p(x) --> p(o, x) -->  myMacro(o, p, x)
  
  This is really the opposite of ``tyExpr``:
  * For parameter ``tyExpr`` any argument matches.
  * Argument ``tyProxy`` matches any parameter.
  
- nice idea:

  p(a, b): 
    echo a
    echo b
  
  is the same as:
  
  p(a, b, proc() =
    echo a
    echo b)

Library
-------

- wrappers for poppler; libharu
- suffix trees
- locale support; i18n module
- bignums

- pdcurses bindings

- for system:
  proc `@` [T](a: openArray[T]): seq[T] = 
    newSeq(result, a.len)
    for i in 0..a.len-1: result[i] = a[i]
    
  --> ensure @[] calls the array version!


Low priority
------------

- ``with proc `+`(x, y: T): T`` for generic code
- new feature: ``distinct T with operations``
- find a way for easy constructors and destructors; (destructors are much more
  important than constructors)
- code generated for type information is wasteful
- resizing of strings/sequences could take into account the memory that
  is allocated
- timeout for locks
- compilation cache:
  - adapt thread var emulation to care about the new merge operation
  - check for interface changes; if only the implemenation changes, no
    need to recompile clients; er ... what about templates, macros or anything
    that has inlining semantics?
- codegen should use "NIM_CAST" macro and respect aliasing rules for GCC
- warning for implicit openArray -> varargs conversion
- implement explicit varargs; **but** ``len(varargs)`` problem remains! 
  --> solve by implicit conversion from varargs to openarray
- implement closures that support nesting > 1


Version 2
=========

- language change: inheritance should only work with reference types, so that
  the ``type`` field is not needed for objects! --> zero overhead aggregation
  BETTER: ``of`` and safe object conversions only work with ref objects. Same
  for multi methods.
  
- explicit nil types?
  * nil seq[int]
  * nil string
  * nil ref int
  * nil ptr THallo
  * nil proc 

- better for backwards compatibility: default nilable, but ``not nil``
  notation:
  
  type
    PWindow = ref TWindow not nil
    
  The problem with ``nil`` is that the language currently relies on it for
  implicit initialization. Initialization is different from assignment. The
  issues can "easily" dealt with by ensuring:
  
    var x = myProc() # checks myProc() initializes every pointer explicitely

- guards for the 'case' statement; generalized case statement;
  a guard looks like: 
  
    case x
    of nkStmtList if x.value == 0: 
  
  a generalized case statement looks like:
    
    case x with `=~`
      
- rethink the syntax: distinction between expr and stmt is unfortunate; 
  indentation handling is quite complex too; problem with exception handling
  is that often the scope of ``try`` is wrong and apart from that ``try`` is
  a full blown statement; a ``try`` expression might be a good idea to make
  error handling more light-weight
  people also want ``inc a; inc b``