diff options
Diffstat (limited to 'doc/manual/lexing.txt')
-rw-r--r-- | doc/manual/lexing.txt | 417 |
1 files changed, 0 insertions, 417 deletions
diff --git a/doc/manual/lexing.txt b/doc/manual/lexing.txt deleted file mode 100644 index 7ffd5eb1c..000000000 --- a/doc/manual/lexing.txt +++ /dev/null @@ -1,417 +0,0 @@ -Lexical Analysis -================ - -Encoding --------- - -All Nim source files are in the UTF-8 encoding (or its ASCII subset). Other -encodings are not supported. Any of the standard platform line termination -sequences can be used - the Unix form using ASCII LF (linefeed), the Windows -form using the ASCII sequence CR LF (return followed by linefeed), or the old -Macintosh form using the ASCII CR (return) character. All of these forms can be -used equally, regardless of platform. - - -Indentation ------------ - -Nim's standard grammar describes an `indentation sensitive`:idx: language. -This means that all the control structures are recognized by indentation. -Indentation consists only of spaces; tabulators are not allowed. - -The indentation handling is implemented as follows: The lexer annotates the -following token with the preceding number of spaces; indentation is not -a separate token. This trick allows parsing of Nim with only 1 token of -lookahead. - -The parser uses a stack of indentation levels: the stack consists of integers -counting the spaces. The indentation information is queried at strategic -places in the parser but ignored otherwise: The pseudo terminal ``IND{>}`` -denotes an indentation that consists of more spaces than the entry at the top -of the stack; ``IND{=}`` an indentation that has the same number of spaces. ``DED`` -is another pseudo terminal that describes the *action* of popping a value -from the stack, ``IND{>}`` then implies to push onto the stack. - -With this notation we can now easily define the core of the grammar: A block of -statements (simplified example):: - - ifStmt = 'if' expr ':' stmt - (IND{=} 'elif' expr ':' stmt)* - (IND{=} 'else' ':' stmt)? - - simpleStmt = ifStmt / ... - - stmt = IND{>} stmt ^+ IND{=} DED # list of statements - / simpleStmt # or a simple statement - - - -Comments --------- - -Comments start anywhere outside a string or character literal with the -hash character ``#``. -Comments consist of a concatenation of `comment pieces`:idx:. A comment piece -starts with ``#`` and runs until the end of the line. The end of line characters -belong to the piece. If the next line only consists of a comment piece with -no other tokens between it and the preceding one, it does not start a new -comment: - - -.. code-block:: nim - i = 0 # This is a single comment over multiple lines. - # The scanner merges these two pieces. - # The comment continues here. - - -`Documentation comments`:idx: are comments that start with two ``##``. -Documentation comments are tokens; they are only allowed at certain places in -the input file as they belong to the syntax tree! - - -Multiline comments ------------------- - -Starting with version 0.13.0 of the language Nim supports multiline comments. -They look like: - -.. code-block:: nim - #[Comment here. - Multiple lines - are not a problem.]# - -Multiline comments support nesting: - -.. code-block:: nim - #[ #[ Multiline comment in already - commented out code. ]# - proc p[T](x: T) = discard - ]# - -Multiline documentation comments also exist and support nesting too: - -.. code-block:: nim - proc foo = - ##[Long documentation comment - here. - ]## - - -Identifiers & Keywords ----------------------- - -Identifiers in Nim can be any string of letters, digits -and underscores, beginning with a letter. Two immediate following -underscores ``__`` are not allowed:: - - letter ::= 'A'..'Z' | 'a'..'z' | '\x80'..'\xff' - digit ::= '0'..'9' - IDENTIFIER ::= letter ( ['_'] (letter | digit) )* - -Currently any Unicode character with an ordinal value > 127 (non ASCII) is -classified as a ``letter`` and may thus be part of an identifier but later -versions of the language may assign some Unicode characters to belong to the -operator characters instead. - -The following keywords are reserved and cannot be used as identifiers: - -.. code-block:: nim - :file: ../keywords.txt - -Some keywords are unused; they are reserved for future developments of the -language. - - -Identifier equality -------------------- - -Two identifiers are considered equal if the following algorithm returns true: - -.. code-block:: nim - proc sameIdentifier(a, b: string): bool = - a[0] == b[0] and - a.replace(re"_|–", "").toLower == b.replace(re"_|–", "").toLower - -That means only the first letters are compared in a case sensitive manner. Other -letters are compared case insensitively and underscores and en-dash (Unicode -point U+2013) are ignored. - -This rather unorthodox way to do identifier comparisons is called -`partial case insensitivity`:idx: and has some advantages over the conventional -case sensitivity: - -It allows programmers to mostly use their own preferred -spelling style, be it humpStyle, snake_style or dash–style and libraries written -by different programmers cannot use incompatible conventions. -A Nim-aware editor or IDE can show the identifiers as preferred. -Another advantage is that it frees the programmer from remembering -the exact spelling of an identifier. The exception with respect to the first -letter allows common code like ``var foo: Foo`` to be parsed unambiguously. - -Historically, Nim was a fully `style-insensitive`:idx: language. This meant that -it was not case-sensitive and underscores were ignored and there was no even a -distinction between ``foo`` and ``Foo``. - - -String literals ---------------- - -Terminal symbol in the grammar: ``STR_LIT``. - -String literals can be delimited by matching double quotes, and can -contain the following `escape sequences`:idx:\ : - -================== =================================================== - Escape sequence Meaning -================== =================================================== - ``\n`` `newline`:idx: - ``\r``, ``\c`` `carriage return`:idx: - ``\l`` `line feed`:idx: - ``\f`` `form feed`:idx: - ``\t`` `tabulator`:idx: - ``\v`` `vertical tabulator`:idx: - ``\\`` `backslash`:idx: - ``\"`` `quotation mark`:idx: - ``\'`` `apostrophe`:idx: - ``\`` '0'..'9'+ `character with decimal value d`:idx:; - all decimal digits directly - following are used for the character - ``\a`` `alert`:idx: - ``\b`` `backspace`:idx: - ``\e`` `escape`:idx: `[ESC]`:idx: - ``\x`` HH `character with hex value HH`:idx:; - exactly two hex digits are allowed -================== =================================================== - - -Strings in Nim may contain any 8-bit value, even embedded zeros. However -some operations may interpret the first binary zero as a terminator. - - -Triple quoted string literals ------------------------------ - -Terminal symbol in the grammar: ``TRIPLESTR_LIT``. - -String literals can also be delimited by three double quotes -``"""`` ... ``"""``. -Literals in this form may run for several lines, may contain ``"`` and do not -interpret any escape sequences. -For convenience, when the opening ``"""`` is followed by a newline (there may -be whitespace between the opening ``"""`` and the newline), -the newline (and the preceding whitespace) is not included in the string. The -ending of the string literal is defined by the pattern ``"""[^"]``, so this: - -.. code-block:: nim - """"long string within quotes"""" - -Produces:: - - "long string within quotes" - - -Raw string literals -------------------- - -Terminal symbol in the grammar: ``RSTR_LIT``. - -There are also raw string literals that are preceded with the -letter ``r`` (or ``R``) and are delimited by matching double quotes (just -like ordinary string literals) and do not interpret the escape sequences. -This is especially convenient for regular expressions or Windows paths: - -.. code-block:: nim - - var f = openFile(r"C:\texts\text.txt") # a raw string, so ``\t`` is no tab - -To produce a single ``"`` within a raw string literal, it has to be doubled: - -.. code-block:: nim - - r"a""b" - -Produces:: - - a"b - -``r""""`` is not possible with this notation, because the three leading -quotes introduce a triple quoted string literal. ``r"""`` is the same -as ``"""`` since triple quoted string literals do not interpret escape -sequences either. - - -Generalized raw string literals -------------------------------- - -Terminal symbols in the grammar: ``GENERALIZED_STR_LIT``, -``GENERALIZED_TRIPLESTR_LIT``. - -The construct ``identifier"string literal"`` (without whitespace between the -identifier and the opening quotation mark) is a -generalized raw string literal. It is a shortcut for the construct -``identifier(r"string literal")``, so it denotes a procedure call with a -raw string literal as its only argument. Generalized raw string literals -are especially convenient for embedding mini languages directly into Nim -(for example regular expressions). - -The construct ``identifier"""string literal"""`` exists too. It is a shortcut -for ``identifier("""string literal""")``. - - -Character literals ------------------- - -Character literals are enclosed in single quotes ``''`` and can contain the -same escape sequences as strings - with one exception: `newline`:idx: (``\n``) -is not allowed as it may be wider than one character (often it is the pair -CR/LF for example). Here are the valid `escape sequences`:idx: for character -literals: - -================== =================================================== - Escape sequence Meaning -================== =================================================== - ``\r``, ``\c`` `carriage return`:idx: - ``\l`` `line feed`:idx: - ``\f`` `form feed`:idx: - ``\t`` `tabulator`:idx: - ``\v`` `vertical tabulator`:idx: - ``\\`` `backslash`:idx: - ``\"`` `quotation mark`:idx: - ``\'`` `apostrophe`:idx: - ``\`` '0'..'9'+ `character with decimal value d`:idx:; - all decimal digits directly - following are used for the character - ``\a`` `alert`:idx: - ``\b`` `backspace`:idx: - ``\e`` `escape`:idx: `[ESC]`:idx: - ``\x`` HH `character with hex value HH`:idx:; - exactly two hex digits are allowed -================== =================================================== - -A character is not an Unicode character but a single byte. The reason for this -is efficiency: for the overwhelming majority of use-cases, the resulting -programs will still handle UTF-8 properly as UTF-8 was specially designed for -this. Another reason is that Nim can thus support ``array[char, int]`` or -``set[char]`` efficiently as many algorithms rely on this feature. The `Rune` -type is used for Unicode characters, it can represent any Unicode character. -``Rune`` is declared in the `unicode module <unicode.html>`_. - - -Numerical constants -------------------- - -Numerical constants are of a single type and have the form:: - - hexdigit = digit | 'A'..'F' | 'a'..'f' - octdigit = '0'..'7' - bindigit = '0'..'1' - HEX_LIT = '0' ('x' | 'X' ) hexdigit ( ['_'] hexdigit )* - DEC_LIT = digit ( ['_'] digit )* - OCT_LIT = '0' ('o' | 'c' | 'C') octdigit ( ['_'] octdigit )* - BIN_LIT = '0' ('b' | 'B' ) bindigit ( ['_'] bindigit )* - - INT_LIT = HEX_LIT - | DEC_LIT - | OCT_LIT - | BIN_LIT - - INT8_LIT = INT_LIT ['\''] ('i' | 'I') '8' - INT16_LIT = INT_LIT ['\''] ('i' | 'I') '16' - INT32_LIT = INT_LIT ['\''] ('i' | 'I') '32' - INT64_LIT = INT_LIT ['\''] ('i' | 'I') '64' - - UINT_LIT = INT_LIT ['\''] ('u' | 'U') - UINT8_LIT = INT_LIT ['\''] ('u' | 'U') '8' - UINT16_LIT = INT_LIT ['\''] ('u' | 'U') '16' - UINT32_LIT = INT_LIT ['\''] ('u' | 'U') '32' - UINT64_LIT = INT_LIT ['\''] ('u' | 'U') '64' - - exponent = ('e' | 'E' ) ['+' | '-'] digit ( ['_'] digit )* - FLOAT_LIT = digit (['_'] digit)* (('.' (['_'] digit)* [exponent]) |exponent) - FLOAT32_SUFFIX = ('f' | 'F') ['32'] - FLOAT32_LIT = HEX_LIT '\'' FLOAT32_SUFFIX - | (FLOAT_LIT | DEC_LIT | OCT_LIT | BIN_LIT) ['\''] FLOAT32_SUFFIX - FLOAT64_SUFFIX = ( ('f' | 'F') '64' ) | 'd' | 'D' - FLOAT64_LIT = HEX_LIT '\'' FLOAT64_SUFFIX - | (FLOAT_LIT | DEC_LIT | OCT_LIT | BIN_LIT) ['\''] FLOAT64_SUFFIX - - -As can be seen in the productions, numerical constants can contain underscores -for readability. Integer and floating point literals may be given in decimal (no -prefix), binary (prefix ``0b``), octal (prefix ``0o`` or ``0c``) and hexadecimal -(prefix ``0x``) notation. - -There exists a literal for each numerical type that is -defined. The suffix starting with an apostrophe ('\'') is called a -`type suffix`:idx:. Literals without a type suffix are of the type ``int``, -unless the literal contains a dot or ``E|e`` in which case it is of -type ``float``. For notational convenience the apostrophe of a type suffix -is optional if it is not ambiguous (only hexadecimal floating point literals -with a type suffix can be ambiguous). - - -The type suffixes are: - -================= ========================= - Type Suffix Resulting type of literal -================= ========================= - ``'i8`` int8 - ``'i16`` int16 - ``'i32`` int32 - ``'i64`` int64 - ``'u`` uint - ``'u8`` uint8 - ``'u16`` uint16 - ``'u32`` uint32 - ``'u64`` uint64 - ``'f`` float32 - ``'d`` float64 - ``'f32`` float32 - ``'f64`` float64 - ``'f128`` float128 -================= ========================= - -Floating point literals may also be in binary, octal or hexadecimal -notation: -``0B0_10001110100_0000101001000111101011101111111011000101001101001001'f64`` -is approximately 1.72826e35 according to the IEEE floating point standard. - -Literals are bounds checked so that they fit the datatype. Non base-10 -literals are used mainly for flags and bit pattern representations, therefore -bounds checking is done on bit width, not value range. If the literal fits in -the bit width of the datatype, it is accepted. -Hence: 0b10000000'u8 == 0x80'u8 == 128, but, 0b10000000'i8 == 0x80'i8 == -1 -instead of causing an overflow error. - -Operators ---------- - -Nim allows user defined operators. An operator is any combination of the -following characters:: - - = + - * / < > - @ $ ~ & % | - ! ? ^ . : \ - -These keywords are also operators: -``and or not xor shl shr div mod in notin is isnot of``. - -`=`:tok:, `:`:tok:, `::`:tok: are not available as general operators; they -are used for other notational purposes. - -``*:`` is as a special case treated as the two tokens `*`:tok: and `:`:tok: -(to support ``var v*: T``). - - -Other tokens ------------- - -The following strings denote other tokens:: - - ` ( ) { } [ ] , ; [. .] {. .} (. .) - - -The `slice`:idx: operator `..`:tok: takes precedence over other tokens that -contain a dot: `{..}`:tok: are the three tokens `{`:tok:, `..`:tok:, `}`:tok: -and not the two tokens `{.`:tok:, `.}`:tok:. - |