diff options
Diffstat (limited to 'doc/spec.txt')
-rwxr-xr-x | doc/spec.txt | 1297 |
1 files changed, 1297 insertions, 0 deletions
diff --git a/doc/spec.txt b/doc/spec.txt new file mode 100755 index 000000000..3bad06e97 --- /dev/null +++ b/doc/spec.txt @@ -0,0 +1,1297 @@ +==================== +Nimrod Specification +==================== + +:Author: Andreas Rumpf + +.. contents:: + + +About this document +=================== + +This document describes the lexis, the syntax, and the semantics of Nimrod. +However, this is only a first draft. Some parts need to be more precise, +features may be added to the language, etc. + +The language constructs are explained using an extended BNF, in +which ``(a)*`` means 0 or more ``a``'s, ``a+`` means 1 or more ``a``'s, and +``(a)?`` means an optional *a*; an alternative spelling for optional parts is +``[a]``. The ``|`` symbol is used to mark alternatives +and has the lowest precedence. Parentheses may be used to group elements. +Non-terminals are in lowercase, terminal symbols (including keywords) are in +UPPERCASE. An example:: + + if_stmt ::= IF expr COLON stmts (ELIF expr COLON stmts)* [ELSE stmts] + + +Definitions +=========== + +The following defintions are the same as their counterparts in the +specification of the Modula-3 programming language. + +A Nimrod program specifies a computation that acts on a sequence of digital +components called `locations`. A variable is a set of locations that +represents a mathematical value according to a convention determined by the +variable's *type*. If a value can be represented by some variable of type +``T``, then we say that the value is a *member* of ``T`` and ``T`` *contains* +the value. + +An *identifier* is a symbol declared as a name for a variable, type, +procedure, etc. The region of the program over which a declaration applies is +called the *scope* of the declaration. Scopes can be nested. The meaning of an +identifier is determined by the smallest enclosing scope in which the +identifier is declared. + +An expression specifies a computation that produces a value or variable. +Expressions that produce variables are called `designators`. A designator +can denote either a variable or the value of that variable, depending on +the context. Some designators are *readonly*, which means that they cannot +be used in contexts that might change the value of the variable. A +designator that is not readonly is called *writable*. Expressions whose +values can be determined statically are called *constant expressions*; +they are never designators. + +A `static error` is an error that the implementation must detect before +program execution. Violations of the language definition are static +errors unless they are explicitly classified as runtime errors. + +A `checked runtime error` is an error that the implementation must detect +and report at runtime. The method for reporting such errors is via *raising +exceptions*. However, an implementation may provide a means to disable these +runtime checks. See the section *pragmas* for details. + +An `unchecked runtime error` is an error that is not guaranteed to be +detected, and can cause the subsequent behavior of the computation to +be arbitrary. Unchecked runtime errors cannot occur if only `safe` +language features are used. + + +Lexical Analysis +================ + +Indentation +----------- + +Nimrod's standard grammar describes an `indentation sensitive` language. +This means that all the control structures are recognized by the indentation. +Indentation consists only of spaces; tabulators are not allowed. + +The terminals ``IND`` (indentation), ``DED`` (dedentation) and ``SAD`` +(same indentation) are generated by the scanner, denoting an indentation. +Using tabulators for the indentation is not allowed. + +These terminals are only generated for *logical lines*, i.e. not for an empty +line and not for a line with only whitespace or comments. + +The parser and the scanner communicate over a stack which indentation terminal +should be generated: The stack consists of integers counting the spaces. The +stack is initialized with a zero on its top. The scanner reads from the stack: +If the current indentation token consists of more spaces than the entry at the +top of the stack, a ``IND`` token is generated, else if it consists of the same +number of spaces, a ``SAD`` token is generated. If it consists of fewer spaces, +a ``DED`` token is generated for any item on the stack that is greater than the +current. These items are then popped from the stack by the scanner. At the end +of the file, a ``DED`` token is generated for each number remaining on the +stack that is larger than zero. + +Because the grammar contains some optional ``IND`` tokens, the scanner cannot +push new indentation levels. This has to be done by the parser. The symbol +``IND_PUSH`` indicates that the ``IND`` token should be pushed onto the stack +by the parser. + +An Example how this works:: + + if_stmt ::= IF expr COLON stmts (ELIF expr COLON stmts)* (ELSE stmts)? + stmts ::= IND_PUSH (stmt [SAD])+ DED | stmt [SAD] + + if expr0: + stmt1 # would be valid because, SAD is not generated any longer! + + if expr0: + stmt1 # scanner: IND; parser pushes 2 onto the stack + if expr3: stmt5 # DED; ... SAD eaten by stmt5 + else: stmt6 # + if expr4: stmt7 + if expr: + stmt1 # scanner: IND; the parser pushes 2 onto the stack + stmt2 # scanner: SAD (because indentation is 2) + elif expr2: # scanner: DED (because indentation is 0) + stmts3 # scanner: IND; the parser pushes 2 onto the stack + else: # scanner: DED; + stmt4 # scanner: IND; the parser pushes 2 onto the stack + # scanner generates 1 DED, because end of file and 1 item on stack + + + +Identifiers & Keywords +---------------------- + +`Identifiers` in Nimrod can be any string of letters, digits +and underscores, beginning with a letter. Two immediate following +underscores ``__`` are not allowed:: + + letter ::= 'A'..'Z' | 'a'..'z' + digit ::= '0'..'9' + IDENTIFIER ::= letter ( ['_'] letter | digit )* + +The following `keywords` are reserved and cannot be used as identifiers:: + + ${keywords} + +Some keywords are unused; they are reserved for future developments of the +language. + +Nimrod is a `style-insensitive` language. This means that it is not +case-sensitive and even underscores are ignored: +**type** is a reserved word, and so is **TYPE** or **T_Y_P_E**. The idea behind +this is, that this allows programmers to use their own prefered spelling style. +Editors can show the identifiers as configured. + + +Literal strings +--------------- + +`Literal strings` can be delimited by matching double quotes, and can contain +the following `escape sequences`: + +================== ================================== + Escape sequence Meaning +================== ================================== + ``\n`` `newline` + ``\r`` `carriage return` + ``\l`` `line feed` + ``\f`` `form feed` + ``\t`` `tabulator` + ``\v`` `vertical tabulator` + ``\\`` `backslash` + ``\"`` `quotation mark` + ``\'`` `apostrophe` + ``\y`` `character with number y` + ``\a`` `alert` + ``\b`` `backspace` + ``\e`` `escape` `[ESC]` +================== ================================== + + +Strings in Nimrod may contain any 8-bit value, except embedded zeros, +which are not allowed for compability with `C`. + +Literal strings can also be delimited by three double squotes +``"""`` ... ``"""``. +Literals in this form may run for several lines, may contain ``"`` and do not +interpret any escape sequences. +For convenience, when the opening ``"""`` is immediately +followed by a newline, the newline is not included in the string. +`Raw string literals` are preceded with the letter ``r`` (or ``R``) +and are delimited by matching double quotes (just like ordinary string +literals) and do not interpret the escape sequences. + + +Literal characters +------------------ + +Character literals are enclosed in single quotes ``''`` and can contain the +same escape sequences as strings - with one exception: ``\n`` is not allowed +as it may be wider than one character (often it is the pair CR/LF for example). + + +Numerical constants +------------------- + +`Numerical constants` are of a single type and have the form:: + + hexdigit ::= digit | 'A'..'F' | 'a'..'f' + octdigit ::= '0'..'7' + bindigit ::= '0'..'1' + INT_LIT ::= digit ( ['_'] digit )* + | '0' ('x' | 'X' ) hexdigit ( ['_'] hexdigit )* + | '0o' octdigit ( ['_'] octdigit )* + | '0' ('b' | 'B' ) bindigit ( ['_'] bindigit )* + + INT8_LIT ::= INT_LIT '\'' ('i' | 'I' ) '8' + INT16_LIT ::= INT_LIT '\'' ('i' | 'I' ) '16' + INT32_LIT ::= INT_LIT '\'' ('i' | 'I' ) '32' + INT64_LIT ::= INT_LIT '\'' ('i' | 'I' ) '64' + + exponent ::= ('e' | 'E' ) ['+' | '-'] digit ( ['_'] digit )* + FLOAT_LIT ::= digit (['_'] digit)* ('.' (['_'] digit)* [exponent] |exponent) + FLOAT32_LIT ::= ( FLOAT_LIT | INT_LIT ) '\'' ('f' | 'F') '32' + FLOAT64_LIT ::= ( FLOAT_LIT | INT_LIT ) '\'' ('f' | 'F') '64' + + +As can be seen in the productions, numerical constants can contain unterscores +for readability. Integer and floating point literals may be given in decimal (no +prefix), binary (prefix ``0b``), octal (prefix ``0o``) and +hexadecimal (prefix ``0x``) notation. + +There exists a literal for each numerical type that are +defined. The suffix starting with an apostophe ('\'') is called a +`type suffix`. Literals without a type prefix are of the type ``int``, unless +the literal contains a dot or an ``E`` in which case it is of type ``float``. + +The following table specifies type suffixes: + +================= ========================= + Type Suffix Resulting type of literal +================= ========================= + ``'i8`` int8 + ``'i16`` int16 + ``'i32`` int32 + ``'i64`` int64 + ``'f32`` float32 + ``'f64`` float64 +================= ========================= + +Floating point literals may also be in binary, octal or hexadecimal +notation: +``0B0_10001110100_0000101001000111101011101111111011000101001101001001'f64`` +is approximately 1.72826e35 according to the IEEE floating point standard. + + + + +Comments +-------- + +`Comments` start anywhere outside a string with the hash character ``#``. +Comments run until the end of the line. Comments are tokens; they are only +allowed at certain places in the input file as they belong to the syntax. This +is essential for performing correct source-to-source transformations or +documentation generators. + + +Other tokens +------------ + +The following strings denote other tokens:: + + ( ) { } [ ] , ; [. .] {. .} (. .) + : = ^ .. ` + +``..`` takes precedence over other tokens that contain a dot: ``{..}`` are the +three tokens ``{``, ``..``, ``}`` and not the two tokens ``{.``, ``.}``. + +In Nimrod one can define his own operators. An `operator` is any +combination of the following characters that are not listed above:: + + + - * / < > + = @ $ ~ & % + ! ? ^ . | + +These keywords are also operators: +``and or not xor shl shr div mod in notin is isnot``. + + +Syntax +====== + +This section lists Nimrod's standard syntax in ENBF. How the parser receives +indentation tokens is already described in the Lexical Analysis section. + +Nimrod allows user-definable operators. +Binary operators have 8 different levels of precedence. For user-defined +operators, the precedence depends on the first character the operator consists +of. All binary operators are left-associative. + +================ ============================================== ================== =============== +Precedence level Operators First characters Terminal symbol +================ ============================================== ================== =============== + 7 (highest) ``$`` OP7 + 6 ``* / div mod shl shr %`` ``* % \ /`` OP6 + 5 ``+ -`` ``+ ~ |`` OP5 + 4 ``&`` ``&`` OP4 + 3 ``== <= < >= > != in not_in is isnot`` ``= < > !`` OP3 + 2 ``and`` OP2 + 1 ``or xor`` OP1 + 0 (lowest) ``? @ ^ ` : .`` OP0 +================ ============================================== ================== =============== + + +The grammar's start symbol is ``module``. The grammar is LL(1) and therefore +not ambigious. + +.. include:: grammar.txt + :literal: + + + +Semantics +========= + +Constants +--------- + +Constants are symbols which are bound to a value. The constant's value +cannot change. The compiler must be able to evaluate the expression in a +constant declaration at compile time. This means that most of the functions in +the runtime library cannot be used in a constant declaration. + +Operators such as ``+, -, *, /, not, and, or, div, mod`` and the procedures +``addr, ord, chr, sizeof, trunc, round, frac, odd, abs`` can be used, however. +An implementation may restrict the usage of ``addr`` in constant expressions. + + +Types +----- + +All expressions have a type which is known at compile time. Thus Nimrod is +statically typed. One can declare new types, which is in +essence defining an identifier that can be used to denote this custom type +when declaring variables further in the source code. + +These are the major type classes: + +* ordinal types (consist of integer, bool, character, enumeration + (and subranges thereof) types) +* floating point types +* string type +* structured types +* reference (pointer) type +* procedural type + + +Ordinal types +~~~~~~~~~~~~~ +Ordinal types have the following characteristics: + +- Ordinal types are countable and ordered. This property allows + the operation of functions as Inc, Ord, Dec on ordinal types to be defined. +- Ordinal values have a smallest possible value. Trying to count farther + down than the smallest value gives a checked runtime or static error. +- Ordinal values have a largest possible value. Trying to count farther + than the largest value gives a checked runtime or static error. + +Signed integers, bool, characters and enumeration types (and subrange of these +types) belong to ordinal types. Unsigned integer types are special in the way +that over- and underflows generate no errors, but wrap around. + +Pre-defined numerical types +~~~~~~~~~~~~~~~~~~~~~~~~~~~ +These integer types are pre-defined: + +``int`` + the generic signed integer type; its size is platform dependant + (a compiler should choose the processor's fastest integer type) + this type should be used in general. An integer literal that has no type + suffix is of this type. + +intXX + an implementation may define additional signed integer types + of XX bits using this naming scheme (example: int16 is a 16 bit wide integer). + The current implementation supports ``int8``, ``int16``, ``int32``, ``int64``. + Literals of these types have the suffix 'iXX. + + +There are no unsigned integer types, only *unsigned operations* that treat their +arguments as unsigned. Unsigned operations all wrap around; they may not lead to +over- or underflow errors. + +The following floating point types are pre-defined: + +``float`` + the generic floating point type; its size is platform dependant + (a compiler should choose the processor's fastest floating point type) + this type should be used in general + +floatXX + an implementation may define additional floating point types of XX bits using + this naming scheme (example: float64 is a 64 bit wide float). The current + implementation supports ``float32`` and ``float64``. Literals of these types + have the suffix 'fXX. + +Automatic type conversion in expressions where different kinds +of integer types are used is performed. However, if the type conversion +loses information, the `EConvertError` exception is raised. An implementation +may detect certain cases of the convert error at compile time. + +Automatic type conversion in expressions with different kinds +of floating point types is performed: The smaller type is +converted to the larger. Arithmetic performed on floating point types +follows the IEEE standard. + + +Boolean type +~~~~~~~~~~~~ +The boolean type is named ``bool`` in Nimrod and can be one of the two +pre-defined values ``true`` and ``false``. Conditions in while, +if, elif, when statements need to be of type bool. + +This condition should hold:: + + ord(false) == 0 and ord(true) == 1 + +The operators ``not, and, or, xor`` are defined for the bool type. +The ``and`` and ``or`` operators perform short-cut evaluation. +Example:: + + while p != nil and p.name != "xyz": + # p.name is not evaluated if p == nil + p = p.next + + +The size of the bool type is implementation-dependant, typically it is +one byte. + + +Character type +~~~~~~~~~~~~~~ +The character type is named ``char`` in Nimrod and uses the platform's +native encoding. Thus on nearly every platform its size is one byte due to +the popular UTF-8 encoding. +Character literals are enclosed in single quotes ``''``. + +.. Note:: For platform-independant character handling is the ``encoding`` + standard module. + + +Enumeration types +~~~~~~~~~~~~~~~~~ +Enumeration types define a new type whose values consist only of the ones +specified. +The values are ordered by the order in enum's declaration. Example:: + + type + TDirection = enum + north, east, south, west + + +Now the following holds:: + + ord(north) == 0 + ord(east) == 1 + ord(south) == 2 + ord(west) == 3 + +Thus, north < east < south < west. The comparison operators can be used +with enumeration types. + +An implemenation should store enumeration types with the minimal number of +bytes required for the particular enum, unless efficiency would be affected by +doing so. + +For better interfacing to other programming languages, the fields of enum +types can be assigned an explicit ordinal value. However, the ordinal values +have to be in ascending order appropriately. A field whose ordinal value that +is not explicitly given, gets the value of the previous field + 1. + +An explicit ordered enum can have *wholes*:: + + type + TTokenType = enum + a = 2, b = 4, c = 89 # wholes are valid + +However, it is then not an ordinal anymore, so it is not possible to use these +enums as an index type for arrays. The procedures ``inc``, ``dec``, ``succ`` +and ``pred`` are not available for them. + + +Subrange types +~~~~~~~~~~~~~~ +A subrange type is a range of values from an ordinal type (the host type). +To define a subrange type, one must specify it's limiting values: the highest +and lowest value of the type:: + + type + TSubrange = range[0..5] + + +``TSubrange`` is a subrange of an integer which can only hold the values 0 +to 5. Assigning an other value to a variable of type ``TSubrange`` is a +checked runtime error (or static error if it can be statically +determined). Assignments from the base type to one of its subrange types +(and vice versa) are allowed. + +An implemenation should give it the same size as its base type. + + +String type +~~~~~~~~~~~ +All string literals are of the type string. A string in Nimrod is very +similar to a sequence of characters. However, strings in Nimrod both are +zero-terminated and have a length field. One can retrieve the length with the +builtin ``length`` procedure; the length never counts the terminating zero. +The assignment operator for strings always copies the string. + +Strings are compared by their lexicographical order. All comparison operators +are available. String can be indexed like arrays (lower bound is 0). Unlike +arrays, they can be used in case statements:: + + case paramStr(i) + of "-v": incl(options, optVerbose) + of "-h", "-?": incl(options, optHelp) + else: write(stdout, "invalid command line option!\n") + + +Structured types +~~~~~~~~~~~~~~~~ +A variable of a structured type can hold multiple values at the same time. +Stuctured types can be nested to unlimited levels. Arrays, sequences, records, +objects and sets belong to the structured types. + +Array type +~~~~~~~~~~ +Arrays are a homogenous type, meaning that each element in the array has the +same type. Arrays always have a fixed length which is specified at compile time +(except for open arrays). They can be indexed by any ordinal type. A parameter +may leave out the index type in the declaration making it an +*open array*. An open array ``A`` is always indexed by integers from 0 to +``length(A)-1``. + +A sequence may be passed to a parameter that is of type *open array*, but +not to a multi-dimensional open array, because it is impossible to do so in an +efficient manner. + +An array expression may be constructed by the array constructor ``[]``. +A constructed array is assignment compatible to a sequence. + +Example:: + + type + TIntArray = array[0..5, int] + var + x: TIntArray + x = [1, 2, 3, 4, 5, 6] # this is the array constructor + +The lower bound of an array may be received by the built-in proc +``low()``, the higher bound by ``high()``. The length may be +received by ``length()``. + +Arrays are always bounds checked (at compile-time or at runtime). An +implementation may provide a means to disable these checks. + + +Sequence type +~~~~~~~~~~~~~ +Sequences are similar to arrays but of dynamic length which may change +during runtime (like strings). A sequence ``S`` is always indexed by integers +from 0 to ``length(S)-1`` and its bounds are checked. Sequences can also be +constructed by the array constructor ``[]``. + + +Record and object types +~~~~~~~~~~~~~~~~~~~~~~~ +A variable of a record or object type is a heterogenous storage container. +A record or object defines various named *fields* of a type. The assignment +operator for records and objects always copies the whole record/object. The +constructor ``[]`` can be used to initialize records/objects. A field may +be given a default value. Fields with default values do not have to be listed +in a record construction, all other fields have to be listed. +:: + + type + TPerson = record # type representing a person + name: string # a person consists of a name + age: int = 30 # and an age which default value is 30 + + var + person: TPerson + person = (name: "Peter") # person.age is its default value (30) + +An implementation may align or even reorder the fields for best access +performance. The alignment may be specified with the `align` +pragma. If an alignment is specified the compiler shall not reorder the fields. + +The difference between records and objects is that objects allow inheritance. +Objects have access to their type at runtime, so that the ``is`` operator +can be used to determine the object's type. Assignment from an object to its +parents' object leads to a static or runtime error (the +`EInvalidObjectAssignment` exception is raised). +:: + + type + TPerson = object + name: string + age: int + + TStudent = object of TPerson # a student is a person + id: int # with an id field + + var + student: TStudent + person: TPerson + student = (name: "Peter", age: 89, id: 3) + person = (name: "Mary", age: 17) + assert(student is TStudent) # is true + person = student # this is an error; person has no storage for id. + + +Set type +~~~~~~~~ +The `set type` models the mathematical notion of a set. The set's +basetype can only be an ordinal type. The reason is that sets are implemented +as bit vectors. Sets are designed for high performance computing. + +.. Note:: The sets module can be used for sets of other types. + +Sets can be constructed via the set constructor: ``{}`` is the empty set. The +empty set is type combatible with any special set type. The constructor +can also be used to include elements (and ranges of elements) in the set:: + + {'a'..'z', '0'..'9'} # This constructs a set that conains the + # letters from 'a' to 'z' and the digits + # from '0' to '9' + +These operations are supported by sets: + +================== ========================================================== +operation meaning +================== ========================================================== + A + B union of two sets + A * B intersection of two sets + A - B difference of two sets (A without B's elements) + A == B set equality + A <= B subset relation (A is subset of B or equal to B) + A < B strong subset relation (A is a real subset of B) + e in A set membership (A contains element e) + A >< B symmetric set difference (= (A - B) + (B - A)) + card(A) the cardinality of A (number of elements in A) + incl(A, elem) same as A = A + {elem}, but may be faster + excl(A, elem) same as A = A - {elem}, but may be faster +================== ========================================================== + +Reference type +~~~~~~~~~~~~~~ +References (similiar to `pointers` in other programming languages) are a way to +introduce many-to-one relationships. This means different references can point +to and modify the same location in memory. References should be used sparingly +in a program. They are only needed for constructing graphs. + +Nimrod distinguishes between *traced* and *untraced* references. Untraced +references are also called `pointers`. The difference between them is that +traced references are garbage collected, untraced are not. Thus untraced +references are *unsafe*. However for certain low-level operations (accessing +the hardware) untraced references are unavoidable. + +Traced references are declared with the **ref** keyword, untraced references +are declared with the **ptr** keyword. + +The ``^`` operator can be used to derefer a reference, the ``addr`` procedure +returns the address of an item. An address is always an untraced reference. +Thus the usage of ``addr`` is an *unsafe* feature. + +The ``.`` (access a record field operator) and ``[]`` (array/string/sequence +index operator) operators perform implicit dereferencing operations for +reference types:: + + type + PNode = ref TNode + TNode = record + le, ri: PNode + data: int + + var + n: PNode + new(n) + n.data = 9 # no need to write n^.data + +As can be seen by the example, reference types are the only types that can be +used in *implicit forward declarations*: TNode may be used before it is +defined, because only a refence to it is needed. + +To allocate a new traced object, the built-in procedure ``new`` has to be used. +To deal with untraced memory, the procedures ``alloc``, ``dealloc`` and +``realloc`` can be used. The documentation of the system module contains +further information. + +Special care has to be taken if an untraced object contains traced objects like +traced references, strings or sequences: In order to free everything properly, +the built-in procedure ``finalize`` has to be called before freeing the +untraced memory manually! + +.. XXX finalizers for traced objects + +Procedural type +~~~~~~~~~~~~~~~ +A procedural type is internally a pointer to procedure. Thus ``nil`` is an +allowed value for variables of a procedural type. Nimrod uses procedural types +to achieve `functional` programming techniques. Dynamic dispatch for OOP +constructs can also be implemented with procedural types. + +Example:: + + type + TCallback = proc (x: int) {.cdecl.} + + proc printItem(x: Int) = ... + + proc forEach(c: TCallback) = + ... + + forEach(printItem) # this will NOT work because calling conventions differ + +A subtle issue with procedural types is that the calling convention of the +procedure influences the type compability: Procedural types are only compatible +if they have the same calling convention. + +Altough a Nimrod implementation may provide additional calling conventions +the following shall always exist: + +``cdecl`` + is used for interfacing with C; indicates that a proc shall + use the same calling convention as the C compiler. + +``inline`` + indicates that the caller of the proc should not call it, + but rather inline its code in place for improved efficiency. Note that + this is only a hint for the compiler: It may completely ignore it and + it may inline procedures that are not marked as ``inline``. + +``closure`` + indicates that the procedure expects a context, a `closure` that needs + to be passed to the procedure. + + +Statements +---------- +Nimrod uses the common statement/expression paradigma: Statements do not +produce a value in contrast to expressions. Call expressions are statements. +If the called procedure returns a value, it is not a valid statement +as statements do not produce values. To evaluate an expression for +side-effects and throwing its value away, one can use the ``discard`` +statement. + +Statements are separated into `simple statements` and `complex statements`. +Simple statements are statements that cannot contain other statements, like +assignments, calls or the ``return`` statement; complex statements can +contain other statements. To avoid the `dangling else problem`, complex +statements always have to be intended:: + + XXX + + + +Discard statement +~~~~~~~~~~~~~~~~~ + +Syntax:: + + discard_stmt ::= discard expr + +The `discard` statement evaluates its expression for side-effects and throws +the expression's resulting value away. If the expression has no side-effects, +this shall generate at least a warning from the compiler. + + +Var statement +~~~~~~~~~~~~~ + +Syntax:: + + varlist ::= identlist [asgn_opr expr doc] + var_section ::= var varlist + | var INDENT(x > I[-1]) PUSH(x) varlist + { INDENT(x) varlist } POP + +`Var` statements simply declare new local and global variables and may +initialize them. A comma seperated list of variables can be used to specify +variables of the same type:: + + var + a: int = 0 + x, y, z: int + +However, an initializer is not allowed for such a list as its semantic +would be ambigious in some cases. If an initializer is given the type +can be omitted: The variable is of the same type as the initializing +expression. + + +If statement +~~~~~~~~~~~~ + +Syntax:: + + if_stmt ::= PUSH(x = I[-1]) if expr ":" stmts + { INDENT(x) elif expr ":" stmts } + [ INDENT(x) else ":" stmts ] + POP + +The `if` statement is a simple way to make a branch in the control flow: +The expression after the keyword ``if`` is evaluated, if it is true +the corresponding statements after the ``:`` are executed. Otherwise +the expression after the ``elif`` is evaluated (if there is an +``elif`` branch), if it is true the corresponding statements after +the ``:`` are executed. This goes on until the last ``elif``. If all +conditions fail, the ``else`` part is executed. If there is no ``else`` +part, execution continues with the statement after the ``if`` statement. + + +Case statement +~~~~~~~~~~~~~~ + +Syntax:: + + case_stmt ::= PUSH(I[-1]) case expr + INDENT(x>=I[-1]) of vallist ":" stmts + { INDENT(x) of vallist ":" stmts } + { INDENT(x) elif expr ":" stmts } + [ INDENT(x) else ":" stmts ] + POP + +The `case` statement is similar to the if statement, but it represents +a multi-branch selection. The expression after the keyword ``case`` is +evaluated and if its value is in a *vallist* the corresponding statements +(after the ``of`` keyword) are executed. If the value is no given *vallist* +the ``else`` part is executed. If there is no ``else`` part and not all +possible values that ``expr`` can hold occur in a ``vallist``, a static +error shall be given. This holds only for expressions of ordinal types. +If the expression is not of an ordinal type, and no ``else`` part is +given, control just passes after the ``case`` statement. + +To suppress the static error in the ordinal case the programmer needs +to write an ``else`` part with a ``nil`` statement. + + +When statement +~~~~~~~~~~~~~~ + +Syntax:: + + when_stmt ::= PUSH(x=I[-1]) when expr ":" stmts + { INDENT(x) elif expr ":" stmts } + [ INDENT(x) else ":" stmts ] + POP + +The `when` statement is almost identical to the ``if`` statement with some +exceptions: + +* Each ``expr`` has to be a constant expression of type ``bool``. +* The statements do not open a new scope if they introduce new identifiers. +* The statements that belong to the expression that evaluated to true are + translated by the compiler, the other statements are not checked for + syntax or semantics at all! This holds also for any ``expr`` coming + after the expression that evaluated to true. + +The ``when`` statement enables conditional compilation techniques. As +a special syntatic extension, the ``when `` construct is also available +within ``record`` or ``object`` definitions. + + +Raise statement +~~~~~~~~~~~~~~~ + +Syntax:: + + raise_stmt ::= raise [qualified_identifier [comma expr]] + +Apart from built-in operations like array indexing, memory allocation, etc. +the ``raise`` statement is the only way to raise an exception. The +identifier has to be the name of a previously declared exception. A +comma followed by an expression may follow; the expression must be of type +``string`` or ``cstring``; this is an error message that can be extracted +with the `getCurrentExceptionMsg` procedure in the module ``system``. + +If no exception name is given, the current exception is `re-raised`. The +`ENoExceptionToReraise` exception is raised if there is no exception to +re-raise. It follows that the ``raise`` statement *always* raises an +exception. + + +Try statement +~~~~~~~~~~~~~ + +Syntax:: + + try_stmt ::= PUSH(x=I[-1]) try ":" stmts + { INDENT(x) except exceptlist ":" stmts } + [ INDENT(x) except ":" stmts ] + [ INDENT(x) finally ":" stmts ] + POP + +The statements after the ``try`` are executed in sequential order unless +an exception ``e`` is raised. If the exception type of ``e`` matches any +of the list ``exceptlist`` the corresponding statements are executed. +The statements following the ``except`` clauses are called +`exception handlers`. + +The empty ``except`` clause is executed if there is an exception that is +in no list. It is similiar to an ``else`` clause in ``if`` statements. + +If there is a ``finally`` clause given, it is always executed after the +exception handlers. + +The exception is *consumed* in an exception handler. However, an +exception handler may raise another exception. If the exception is not +handled, it is propagated through the call stack. This means that often +the rest of the procedure - that is not within a ``finally`` clause - +is not executed (if an exception occurs). + + +Block statement +~~~~~~~~~~~~~~~ + +Syntax:: + + block_stmt ::= block [IDENTIFIER] ":" stmts + +The block statement is a means to group statements to a (named) `block`. +Inside the block, the ``break`` statement is allowed to leave the block +immediately. A ``break`` statement can contain a name of a surrounding +block to specify which block is to leave. + + +Break statement +~~~~~~~~~~~~~~~ + +Syntax:: + + break_stmt ::= break [IDENTIFIER] + +The break statement is used to leave a block immediately. If ``IDENTIFIER`` +is given, it is the name of the enclosing block that is to leave. If it is +absent, the innermost block is leaved. + + +While statement +~~~~~~~~~~~~~~~ + +Syntax:: + + while_stmt ::= WHILE expr COLON stmts + +The `while` statement is executed until the ``expr`` evaluates to false. +Endless loops are no error. ``while`` statements open an `implicit block`, +so that they can be aborted by a ``break`` statement. + + +For statement & iterators +~~~~~~~~~~~~~~~~~~~~~~~~~ + +Syntax:: + + for_stmt ::= PUSH(x=I[-1]) for exprlist in expr [".." expr] ":" stmts + POP + +The `for` statement is an abstract mechanism to iterate over the elements +of a container. It relies on an *iterator* to do so. Like ``while`` +statements, ``for`` statements open an `implicit block`, so that they +can be aborted by a ``break`` statement. + +XXX + + + +Assembler statement +~~~~~~~~~~~~~~~~~~~ +Syntax:: + + asm_stmt ::= asm [CHAR_LITERAL] STRING_LITERAL + +The direct embedding of assembler code into Nimrod code is supported by the +unsafe ``asm`` statement. Identifiers in the assembler code that refer to +Nimrod identifiers shall be enclosed in a special character which can be +specified right after the ``asm`` keyword. The default special character is +``'!'``. An implementation does not need to support the assembler statement, +giving a static error if it encounters one. + + +Modules +------- +Nimrod supports splitting a program into pieces by a module concept. Modules +make separate compilation possible. Each module needs to be in its own file. +Modules consist of an interface and an implementation section. The interface +section lists the symbols that can be imported from other modules. Thus modules +enable `information hiding`. The interface section may not contain any +code that is executable. This means that only the headers of procedures can +appear in the interface. A module may gain access to symbols of another module +by the `import` statement. Recursive module dependancies are allowed, but +slightly subtle. + +The algorithm for compiling modules is: + +- Compile the whole module as usual, following import statements recursively +- if we have a cycle only import the already parsed symbols (in the interface + of course); if an unknown identifier occurs then abort + +This is best illustrated by an example:: + + # Module A + interface + type + T1 = int + import B # the compiler starts parsing B + + implementation + + proc main() = + var i = p(3) # works because B has been parsed completely here + + main() + + + # Module B + interface + import A # A is not parsed here! Only the already known symbols + # of A are imported here. + + proc p(x: A.T1): A.T1 # this works because the compiler has already + # added T1 to A's interface symbol table + + implementation + + proc p(x: A.T1): A.T1 = return x + 1 + + +Scope rules +----------- +Identifiers are valid from the point of their declaration until the end of +the block in which the declaration occurred. The range where the identifier +is known is the `scope` of the identifier. The exact scope of an identifier +depends on the way it was declared. + +Block scope +~~~~~~~~~~~ +The *scope* of a variable declared in the declaration part of a block +is valid from the point of declaration until the end of the block. If a +block contains a second block, in which the identifier is redeclared, +then inside this block, the second declaration will be valid. Upon +leaving the inner block, the first declaration is valid again. An +identifier cannot be redefined in the same block, except if valid for +procedure or iterator overloading purposes. + + +Record or object scope +~~~~~~~~~~~~~~~~~~~~~~ +The field identifiers inside a record or object definition are valid in the +following places: + +* To the end of the record definition +* Field designators of a variable of the given record type. +* In all descendent types of the object type. + +Module scope +~~~~~~~~~~~~ +All identifiers in the interface part of a module are valid from the point of +declaration, until the end of the module. Furthermore, the identifiers are +known in other modules that import the module. Identifiers from indirectly +dependent modules are *not* available. Identifiers declared in the +implementation part of a module are valid from the point of declaration to +the end of the module. The ``system`` module is automatically imported in +all other modules. + +If a module imports an identifier by two different modules, +each occurance of the identifier has to be qualified, unless it is an +overloaded procedure or iterator in which case the overloading +resolution takes place:: + + # Module A + interface + var x: string + + # Module B + interface + var x: int + + # Module C + import A, B, io + write(stdout, x) # error: x is ambigious + write(sdtout, A.x) # no error: qualifier used + + +Messages +======== + +A Nimrod compiler has to emit different kinds of messages: `hint`, +`warning`, and `error` messages. `errors` have to be emitted if the compiler +encounters any static errors. If and when the other two message kinds are +emitted is not specified, unless a message is requested with the +``hint`` or ``warning`` pragma. + +Pragmas +======= +Pragmas are Nimrod's method to give the compiler additional information/ +commands without introducing a massive number of new keywords. Pragmas are +processed on the fly during parsing. Pragmas are always enclosed in the +special ``{.`` and ``.}`` curly brackets. There are a number of pragmas +that a compiler has to process; a compiler may define additional pragmas +not specified here. + + +define pragma +------------- +The `define` pragma defines a conditional symbol. This symbol may only be +used in other pragmas and in the ``defined`` expression and not in ordinary +Nimrod source code. The conditional symbols go into a special symbol table. +The compiler shall define the target processor and the target operating +system as conditional symbols. See `Annex A <XXX>`_ for a list of specified +processors and operating systems. The Syntax of the define pragma is:: + + define_pragma ::= curlydot_le "define" colon IDENTIFIER curlydot_ri + + +undef pragma +------------ +The `undef` pragma the counterpart to the define pragma. It undefines a +conditional symbol. Syntax:: + + undef_pragma ::= curlydot_le "undef" colon IDENTIFIER curlydot_ri + + +error pragma +------------ +The `error` pragma is used to make the compiler output an error message with +the given content. Compilation may abort after an error (or not). Syntax:: + + error_pragma ::= curlydot_le "error" colon STRING_LITERAL curlydot_ri + + +fatal pragma +------------ +The `fatal` pragma is used to make the compiler output an error message with +the given content. In contrast to the ``error`` pragma, compilation +is guaranteed to be aborted by this pragma. Syntax:: + + fatal_pragma ::= curlydot_le "fatal" colon STRING_LITERAL curlydot_ri + + +warning pragma +-------------- +The `warning` pragma is used to make the compiler output a warning message with +the given content. Compilation continues after the warning. Syntax:: + + warning_pragma ::= curlydot_le "warning" colon STRING_LITERAL curlydot_ri + + +hint pragma +----------- +The `hint` pragma is used to make the compiler output a hint message with +the given content. Compilation continues after the hint. Syntax:: + + + hint_pragma ::= curlydot_le "hint" colon STRING_LITERAL curlydot_ri + + + +compilation option pragmas +-------------------------- +The listed pragmas here can be used to override the code generation options +for a section of code. +:: + + "{." pragma: val {pragma: val} ".}" + + +An implementation should provide at least the following possible options (it can +add various others). If an implementation does not recognize the option, a +warning shall be given to the user. + +=============== =============== ============================================ +pragma allowed values description +=============== =============== ============================================ +checks on|off Turns the code generation for all runtime + checks on or off. +bound_checks on|off Turns the code generation for array bound + checks on or off. +overflow_checks on|off Turns the code generation for over- or + underflow checks on or off. +nil_checks on|off Turns the code generation for nil pointer + checks on or off. +assertions on|off Turns the code generation for assertions + on or off. +warnings on|off Turns the warning messages of the compiler + on or off. +hints on|off Turns the hint messages of the compiler + on or off. +optimization none|speed|size Optimize the code for speed or size, or + disable optimization. For non-optimizing + compilers this option has no effect. + Neverless they must parse it properly. +callconv cdecl|... Specifies the default calling convention for + all procedures (and procedure types) that + follow. +=============== =============== ============================================ + +Example:: + + {.checks: off, optimization: speed.} + # compile without runtime checks and optimize for speed + + +push and pop pragmas +-------------------- +The push/pop pragmas are very similar to the option directive, +but are used to override the settings temporarily. Example:: + + {.push checks: off.} + # compile this section without runtime checks as it is + # speed critical + # ... some code ... + {.pop.} # restore old settings + + +Annex A: List of conditional symbols +==================================== + +``posix`` is defined on any POSIX compatible operating system. + ++----------------------------+-----------------+ +| Operating System | Symbols | ++============================+=================+ +| AIX | aix, posix | ++----------------------------+-----------------+ +| Compaq Tru64 UNIX | tru64, posix | ++----------------------------+-----------------+ +| Digital UNIX | tru64, posix | ++----------------------------+-----------------+ +| OSF/1 | tru64, posix | ++----------------------------+-----------------+ +| FreeBSD | freebsd, | +| | posix, | +| | bsd | ++----------------------------+-----------------+ +| GNU/Linux | linux, | +| | posix | ++----------------------------+-----------------+ +| HP-UX | hpux, | +| | posix | ++----------------------------+-----------------+ +| Irix | irix, | +| | posix | ++----------------------------+-----------------+ +| MacOS X | macosx, | +| | posix | ++----------------------------+-----------------+ +| NetBSD | netbsd, | +| | posix, | +| | bsd | ++----------------------------+-----------------+ +| OpenBSD | openbsd, | +| | posix, | +| | bsd | ++----------------------------+-----------------+ +| Solaris | solaris, | +| | posix | ++----------------------------+-----------------+ +| Windows (all variants) | windows | ++----------------------------+-----------------+ + + ++----------------------------+-----------------+ +| Processor | Symbols | ++============================+=================+ +| Compaq Alpha | alpha | ++----------------------------+-----------------+ +| HP Precision Architecture | hppa | ++----------------------------+-----------------+ +| INTEL x86 | x86 | ++----------------------------+-----------------+ +| AMD/INTEL x86 64bit | x86_64, | +| | amd64 | ++----------------------------+-----------------+ +| MIPS RISC | mips | ++----------------------------+-----------------+ +| IBM Power PC | powerpc | ++----------------------------+-----------------+ +| SPARC | sparc | ++----------------------------+-----------------+ +| MicroSPARC | sparc | ++----------------------------+-----------------+ +| UltraSPARC | sparc | ++----------------------------+-----------------+ + +On targets for 16, 32 or 64 bit processors the symbols ``cpu16``, ``cpu32`` +or ``cpu64`` shall be defined respectively. On little endian machines the +symbol ``little_endian`` and on big endian ones the symbol ``big_endian`` +are defined. |