diff options
Diffstat (limited to 'doc/intern.txt')
-rwxr-xr-x | doc/intern.txt | 417 |
1 files changed, 0 insertions, 417 deletions
diff --git a/doc/intern.txt b/doc/intern.txt deleted file mode 100755 index c45765b23..000000000 --- a/doc/intern.txt +++ /dev/null @@ -1,417 +0,0 @@ -========================================= - Internals of the Nimrod Compiler -========================================= - - -:Author: Andreas Rumpf -:Version: |nimrodversion| - -.. contents:: - - Abstraction is layering ignorance on top of reality. -- unknown - - -Directory structure -=================== - -The Nimrod project's directory structure is: - -============ ============================================== -Path Purpose -============ ============================================== -``bin`` generated binary files -``build`` generated C code for the installation -``nim`` Pascal sources of the Nimrod compiler; this - should be modified, not the Nimrod version in - ``rod``! -``rod`` Nimrod sources of the Nimrod compiler; - automatically generated from the Pascal - version -``data`` data files that are used for generating source - code -``doc`` the documentation lives here; it is a bunch of - reStructuredText files -``dist`` additional packages for the distribution -``config`` configuration files for Nimrod -``lib`` the Nimrod library lives here; ``rod`` depends - on it! -``web`` website of Nimrod; generated by ``koch.py`` - from the ``*.txt`` and ``*.tmpl`` files -``obj`` generated ``*.obj`` files -============ ============================================== - - -Bootstrapping the compiler -========================== - -The compiler is written in a subset of Pascal with special annotations so -that it can be translated to Nimrod code automatically. This conversion is -done by Nimrod itself via the undocumented ``boot`` command. Thus both Nimrod -and Free Pascal can compile the Nimrod compiler. However, the Pascal version -has no garbage collector and leaks memory! So the Pascal version should only -be used for bootstrapping. - -Requirements for bootstrapping: - -- Python (should work with version 1.5 or higher) (optional) -- supported C compiler - -Compiling the compiler is a simple matter of running:: - - koch.py boot - -For a release version use:: - - koch.py boot -d:release - -The ``koch.py`` script is Nimrod's maintainance script. It is a replacement for -make and shell scripting with the advantage that it is much more portable. - -If you don't have Python, there is a ``boot`` Nimrod program which does roughly -the same:: - - nimrod cc boot.nim - ./boot [-d:release] - - -Pascal annotations -================== -There are some annotations that the Pascal sources use so that they can -be converted to Nimrod automatically: - -``{@discard} <expr>`` - Tells the compiler that a ``discard`` statement is needed for Nimrod - here. - -``{@cast}typ(expr)`` - Tells the compiler that the Pascal conversion is a ``cast`` in Nimrod. - -``{@emit <code>}`` - Emits ``<code>``. The code fragment needs to be in Pascal syntax. - -``{@ignore} <codeA> {@emit <codeB>}`` - Ignores ``<codeA>`` and instead emits ``<codeB>`` which needs to be in - Pascal syntax. An empty ``{@emit}`` is possible too (it then only closes - the ``<codeA>`` part). - -``record {@tuple}`` - Is used to tell the compiler that the record type should be transformed - to a Nimrod tuple type. - -``^ {@ptr}`` - Is used to tell the compiler that the pointer type should be transformed - to a Nimrod ``ptr`` type. The default is a ``ref`` type. - -``'a' + ''`` - The idiom ``+''`` is used to tell the compiler that it is a string - literal and not a character literal. (Pascal does not distinguish between - character literals and string literals of length 1.) - -``+{&}`` - This tells the compiler that Pascal's ``+`` here is a string concatenation - and thus should be converted to ``&``. Note that this is not needed if - any of the operands is a string literal because the compiler then can - figure this out by itself. - -``{@set}['a', 'b', 'c']`` - Tells the compiler that Pascal's ``[]`` constructor is a set and not an - array. This is only needed if the compiler cannot figure this out for - itself. - - -Porting to new platforms -======================== - -Porting Nimrod to a new architecture is pretty easy, since C is the most -portable programming language (within certain limits) and Nimrod generates -C code, porting the code generator is not necessary. - -POSIX-compliant systems on conventional hardware are usually pretty easy to -port: Add the platform to ``platform`` (if it is not already listed there), -check that the OS, System modules work and recompile Nimrod. - -The only case where things aren't as easy is when the garbage -collector needs some assembler tweaking to work. The standard -version of the GC uses C's ``setjmp`` function to store all registers -on the hardware stack. It may be necessary that the new platform needs to -replace this generic code by some assembler code. - - -Runtime type information -======================== - -*Runtime type information* (RTTI) is needed for several aspects of the Nimrod -programming language: - -Garbage collection - The most important reason for RTTI. Generating - traversal procedures produces bigger code and is likely to be slower on - modern hardware as dynamic procedure binding is hard to predict. - -Complex assignments - Sequences and strings are implemented as - pointers to resizeable buffers, but Nimrod requires copying for - assignments. Apart from RTTI the compiler could generate copy procedures - for any type that needs one. However, this would make the code bigger and - the RTTI is likely already there for the GC. - -We already know the type information as a graph in the compiler. -Thus we need to serialize this graph as RTTI for C code generation. -Look at the file ``lib/system/hti.nim`` for more information. - - -The Garbage Collector -===================== - -Introduction ------------- - -I use the term *cell* here to refer to everything that is traced -(sequences, refs, strings). -This section describes how the new GC works. - -The basic algorithm is *Deferrent Reference Counting* with cycle detection. -References in the stack are not counted for better performance and easier C -code generation. - -Each cell has a header consisting of a RC and a pointer to its type -descriptor. However the program does not know about these, so they are placed at -negative offsets. In the GC code the type ``PCell`` denotes a pointer -decremented by the right offset, so that the header can be accessed easily. It -is extremely important that ``pointer`` is not confused with a ``PCell`` -as this would lead to a memory corruption. - - -The CellSet data structure --------------------------- - -The GC depends on an extremely efficient datastructure for storing a -set of pointers - this is called a ``TCellSet`` in the source code. -Inserting, deleting and searching are done in constant time. However, -modifying a ``TCellSet`` during traversation leads to undefined behaviour. - -.. code-block:: Nimrod - type - TCellSet # hidden - - proc CellSetInit(s: var TCellSet) # initialize a new set - proc CellSetDeinit(s: var TCellSet) # empty the set and free its memory - proc incl(s: var TCellSet, elem: PCell) # include an element - proc excl(s: var TCellSet, elem: PCell) # exclude an element - - proc `in`(elem: PCell, s: TCellSet): bool # tests membership - - iterator elements(s: TCellSet): (elem: PCell) - - -All the operations have to perform efficiently. Because a Cellset can -become huge a hash table alone is not suitable for this. - -We use a mixture of bitset and hash table for this. The hash table maps *pages* -to a page descriptor. The page descriptor contains a bit for any possible cell -address within this page. So including a cell is done as follows: - -- Find the page descriptor for the page the cell belongs to. -- Set the appropriate bit in the page descriptor indicating that the - cell points to the start of a memory block. - -Removing a cell is analogous - the bit has to be set to zero. -Single page descriptors are never deleted from the hash table. This is not -needed as the data structures needs to be rebuilt periodically anyway. - -Complete traversal is done in this way:: - - for each page decriptor d: - for each bit in d: - if bit == 1: - traverse the pointer belonging to this bit - - -Further complications ---------------------- - -In Nimrod the compiler cannot always know if a reference -is stored on the stack or not. This is caused by var parameters. -Consider this example: - -.. code-block:: Nimrod - proc setRef(r: var ref TNode) = - new(r) - - proc usage = - var - r: ref TNode - setRef(r) # here we should not update the reference counts, because - # r is on the stack - setRef(r.left) # here we should update the refcounts! - -We have to decide at runtime whether the reference is on the stack or not. -The generated code looks roughly like this: - -.. code-block:: C - void setref(TNode** ref) { - unsureAsgnRef(ref, newObj(TNode_TI, sizeof(TNode))) - } - void usage(void) { - setRef(&r) - setRef(&r->left) - } - -Note that for systems with a continous stack (which most systems have) -the check whether the ref is on the stack is very cheap (only two -comparisons). - - -The compiler's architecture -=========================== - -Nimrod uses the classic compiler architecture: A scanner feds tokens to a -parser. The parser builds a syntax tree that is used by the code generator. -This syntax tree is the interface between the parser and the code generator. -It is essential to understand most of the compiler's code. - -In order to compile Nimrod correctly, type-checking has to be seperated from -parsing. Otherwise generics would not work. - -.. include:: filelist.txt - - -The syntax tree ---------------- -The synax tree consists of nodes which may have an arbitrary number of -children. Types and symbols are represented by other nodes, because they -may contain cycles. The AST changes its shape after semantic checking. This -is needed to make life easier for the code generators. See the "ast" module -for the type definitions. The `macros <macros.html>`_ module contains many -examples how the AST represents each syntactic structure. - - -How the RTL is compiled -======================= - -The ``system`` module contains the part of the RTL which needs support by -compiler magic (and the stuff that needs to be in it because the spec -says so). The C code generator generates the C code for it just like any other -module. However, calls to some procedures like ``addInt`` are inserted by -the CCG. Therefore the module ``magicsys`` contains a table (``compilerprocs``) -with all symbols that are marked as ``compilerproc``. ``compilerprocs`` are -needed by the code generator. A ``magic`` proc is not the same as a -``compilerproc``: A ``magic`` is a proc that needs compiler magic for its -semantic checking, a ``compilerproc`` is a proc that is used by the code -generator. - - -Debugging Nimrod's memory management -==================================== - -The following paragraphs are mostly a reminder for myself. Things to keep -in mind: - -* Segmentation faults can have multiple reasons: One that is frequently - forgotten is that *stack overflow* can trigger one! -* If an assertion in Nimrod's memory manager or GC fails, the stack trace - keeps allocating memory! Thus a stack overflow may happen, hiding the - real issue. -* What seem to be C code generation problems is often a bug resulting from - not producing prototypes, so that some types default to ``cint``. Testing - without the ``-w`` option helps! - - -Generation of dynamic link libraries -==================================== - -Generation of dynamic link libraries or shared libraries is not difficult; the -underlying C compiler already does all the hard work for us. The problem is the -common runtime library, especially the memory manager. Note that Borland's -Delphi had exactly the same problem. The workaround is to not link the GC with -the Dll and provide an extra runtime dll that needs to be initialized. - - -Code generation for closures -============================ - -Example code: - -.. code-block:: nimrod - proc add(x: int): proc (y: int): int {.closure.} = - return lambda (y: int): int = - return x + y - - var add2 = add(2) - echo add2(5) #OUT 7 - -This should produce roughly this code: - -.. code-block:: nimrod - type - PClosure = ref object - fn: proc (x: int, c: PClosure): int - x: int # data - - proc wasLambda(y: int, c: PClosure): int = - return y + c.x - - proc add(x: int): PClosure = - var c: PClosure - new(c) - c.x = x - c.fn = wasLambda - - var add2 = add(2) - echo add2.fn(5, add2) - - -Beware of nesting: - -.. code-block:: nimrod - proc add(x: int): proc (y: int): proc (z: int): int {.closure.} {.closure.} = - return lamba (y: int): proc (z: int): int {.closure.} = - return lambda (z: int): int = - return x + y + z - - var add24 = add(2)(4) - echo add24(5) #OUT 11 - -This should produce roughly this code: - -.. code-block:: nimrod - type - PClosure1 = ref object - fn: proc (x: int, c: PClosure1): int - x: int # data - - PClosure2 = ref object - fn: proc (x: int, c: PClosure2): int - y: int - c1: PClosure1 - - - proc innerLambda(z: int, c2: PClosure2): int = - return c2.c1.x + c2.y + z - - proc outerLambda1(y: int, c1: PClosure1): PClosure2 = - new(result) - result.c1 = c1 - result.y = y - result.fn = innerLambda - - proc add(x: int): PClosure1 = - new(result) - result.x = x - result.fn = outerLambda - - var tmp = add(2) - var tmp2 = tmp.fn(4, tmp) - var add24 = tmp2.fn(4, tmp2) - echo add24(5) - - -Accumulator ------------ - -.. code-block:: nimrod - proc GetAccumulator(start: int): proc (): int {.closure} = - var i = start - return lambda: int = - inc i - return i |