document how the incremental compilation scheme could work

author: Andreas Rumpf <rumpf_a@web.de> 2018-06-01 22:11:32 +0200
committer: Andreas Rumpf <rumpf_a@web.de> 2018-06-01 22:11:32 +0200
commit: cae19738562f14fbb76004748bed8d2f337d6f0b (patch)
tree: a2d965f68a1e0d2d5617b74166c7798bc77d69f0
parent: 61fb83ecbb4c691c03d500f6c71499e59a67cef2 (diff)
download: Nim-cae19738562f14fbb76004748bed8d2f337d6f0b.tar.gz
4 files changed, 110 insertions, 51 deletions
diff --git a/compiler/lineinfos.nim b/compiler/lineinfos.nim
index 0384fda26..cad1fe6aa 100644
--- a/compiler/lineinfos.nim
+++ b/compiler/lineinfos.nim
@@ -246,10 +246,10 @@ const trackPosInvalidFileIdx* = FileIndex(-2) # special marker so that no sugges
                                    # are produced within comments and string literals
 
 type
-  MsgConfig* = object
+  MsgConfig* = object ## does not need to be stored in the incremental cache
     trackPos*: TLineInfo
-    trackPosAttached*: bool ## whether the tracking position was attached to some
-                            ## close token.
+    trackPosAttached*: bool ## whether the tracking position was attached to
+                            ## some close token.
 
     errorOutputs*: TErrorOutputs
     msgContext*: seq[TLineInfo]
diff --git a/compiler/modulegraphs.nim b/compiler/modulegraphs.nim
index 7c9837f54..02307ca9f 100644
--- a/compiler/modulegraphs.nim
+++ b/compiler/modulegraphs.nim
@@ -47,7 +47,7 @@ type
     doStopCompile*: proc(): bool {.closure.}
     usageSym*: PSym # for nimsuggest
     owners*: seq[PSym]
-    methods*: seq[tuple[methods: TSymSeq, dispatcher: PSym]]
+    methods*: seq[tuple[methods: TSymSeq, dispatcher: PSym]] # needs serialization!
     systemModule*: PSym
     sysTypes*: array[TTypeKind, PType]
     compilerprocs*: TStrTable
diff --git a/compiler/options.nim b/compiler/options.nim
index cb4f1e885..044461b55 100644
--- a/compiler/options.nim
+++ b/compiler/options.nim
@@ -156,24 +156,27 @@ type
     version*: int
   Suggestions* = seq[Suggest]
 
-  ConfigRef* = ref object ## eventually all global configuration should be moved here
-    target*: Target
+  ConfigRef* = ref object ## every global configuration
+                          ## fields marked with '*' are subject to
+                          ## the incremental compilation mechanisms
+                          ## (+) means "part of the dependency"
+    target*: Target       # (+)
     linesCompiled*: int  # all lines that have been compiled
-    options*: TOptions
-    globalOptions*: TGlobalOptions
+    options*: TOptions    # (+)
+    globalOptions*: TGlobalOptions # (+)
     m*: MsgConfig
     evalTemplateCounter*: int
     evalMacroCounter*: int
     exitcode*: int8
     cmd*: TCommands  # the command
-    selectedGC*: TGCMode       # the selected GC
+    selectedGC*: TGCMode       # the selected GC (+)
     verbosity*: int            # how verbose the compiler is
     numberOfProcessors*: int   # number of processors
     evalExpr*: string          # expression for idetools --eval
     lastCmdTime*: float        # when caas is enabled, we measure each command
     symbolFiles*: SymbolFilesOption
 
-    cppDefines*: HashSet[string]
+    cppDefines*: HashSet[string] # (*)
     headerFile*: string
     features*: set[Feature]
     arguments*: string ## the arguments to be passed to the program that
@@ -220,13 +223,13 @@ type
     cLinkedLibs*: seq[string]  # libraries to link
 
     externalToLink*: seq[string]  # files to link in addition to the file
-                                  # we compiled
+                                  # we compiled (*)
     linkOptionsCmd*: string
     compileOptionsCmd*: seq[string]
-    linkOptions*: string
-    compileOptions*: string
+    linkOptions*: string          # (*)
+    compileOptions*: string       # (*)
     ccompilerpath*: string
-    toCompile*: CfileList
+    toCompile*: CfileList         # (*)
     suggestionResultHook*: proc (result: Suggest) {.closure.}
     suggestVersion*: int
     suggestMaxResults*: int
diff --git a/doc/intern.txt b/doc/intern.txt
index dadb0eb05..a4545583e 100644
--- a/doc/intern.txt
+++ b/doc/intern.txt
@@ -38,10 +38,6 @@ Path           Purpose
 Bootstrapping the compiler
 ==========================
 
-As of version 0.8.5 the compiler is maintained in Nim. (The first versions
-have been implemented in Object Pascal.) The Python-based build system has
-been rewritten in Nim too.
-
 Compiling the compiler is a simple matter of running::
 
   nim c koch.nim
@@ -202,16 +198,86 @@ Compilation cache
 =================
 
 The implementation of the compilation cache is tricky: There are lots
-of issues to be solved for the front- and backend. In the following
-sections *global* means *shared between modules* or *property of the whole
-program*.
+of issues to be solved for the front- and backend.
+
+
+General approach: AST replay
+----------------------------
+
+We store a module's AST of a successful semantic check in a SQLite
+database. There are plenty of features that require a sub sequence
+to be re-applied, for example:
+
+.. code-block:: nim
+  {.compile: "foo.c".} # even if the module is loaded from the DB,
+                       # "foo.c" needs to be compiled/linked.
+
+The solution is to **re-play** the module's top level statements.
+This solves the problem without having to special case the logic
+that fills the internal seqs which are affected by the pragmas.
+
+In fact, this decribes how the AST should be stored in the database,
+as a "shallow" tree. Let's assume we compile module ``m`` with the
+following contents:
+
+.. code-block:: nim
+  import strutils
+
+  var x*: int = 90
+  {.compile: "foo.c".}
+  proc p = echo "p"
+  proc q = echo "q"
+  static:
+    echo "static"
+
+Conceptually this is the AST we store for the module:
+
+.. code-block:: nim
+  import strutils
+
+  var x*
+  {.compile: "foo.c".}
+  proc p
+  proc q
+  static:
+    echo "static"
+
+The symbol's ``ast`` field is loaded lazily, on demand. This is where most
+savings come from, only the shallow outer AST is reconstructed immediately.
+
+It is also important that the replay involves the ``import`` statement so
+that the dependencies are resolved properly.
+
+
+Shared global compiletime state
+-------------------------------
+
+Nim allows ``.global, compiletime`` variables that can be filled by macro
+invokations across different modules. This feature breaks modularity in a
+severe way. Plenty of different solutions have been proposed:
+
+- Restrict the types of global compiletime variables to ``Set[T]`` or
+  similar unordered, only-growable collections so that we can track
+  the module's write effects to these variables and reapply the changes
+  in a different order.
+- In every module compilation, reset the variable to its default value.
+- Provide a restrictive API that can load/save the compiletime state to
+  a file.
+
+(These solutions are not mutually exclusive.)
+
+Since we adopt the "replay the top level statements" idea, the natural
+solution to this problem is to emit pseudo top level statements that
+reflect the mutations done to the global variable.
 
 
-Frontend issues
----------------
 
 Methods and type converters
-~~~~~~~~~~~~~~~~~~~~~~~~~~~
+---------------------------
+
+In the following
+sections *global* means *shared between modules* or *property of the whole
+program*.
 
 Nim contains language features that are *global*. The best example for that
 are multi methods: Introducing a new method with the same name and some
@@ -238,20 +304,17 @@ If in the above example module ``B`` is re-compiled, but ``A`` is not then
 ``B`` needs to be aware of ``toBool`` even though  ``toBool`` is not referenced
 in ``B`` *explicitly*.
 
-Both the multi method and the type converter problems are solved by storing
-them in special sections in the ROD file that are loaded *unconditionally*
-when the ROD file is read.
+Both the multi method and the type converter problems are solved by the
+AST replay implementation.
+
 
 Generics
 ~~~~~~~~
 
-If we generate an instance of a generic, we'd like to re-use that
-instance if possible across module boundaries. However, this is not
-possible if the compilation cache is enabled. So we give up then and use
-the caching of generics only per module, not per project. This means that
-``--symbolFiles:on`` hurts a bit for efficiency. A better solution would
-be to persist the instantiations in a global cache per project. This might be
-implemented in later versions.
+We cache generic instantiations and need to ensure this caching works
+well with the incremental compilation feature. Since the cache is
+attached to the ``PSym`` datastructure, it should work without any
+special logic.
 
 
 Backend issues
@@ -259,13 +322,10 @@ Backend issues
 
 - Init procs must not be "forgotten" to be called.
 - Files must not be "forgotten" to be linked.
-- Anything that is contained in ``nim__dat.c`` is shared between modules
-  implicitly.
 - Method dispatchers are global.
 - DLL loading via ``dlsym`` is global.
 - Emulated thread vars are global.
 
-
 However the biggest problem is that dead code elimination breaks modularity!
 To see why, consider this scenario: The module ``G`` (for example the huge
 Gtk2 module...) is compiled with dead code elimination turned on. So none
@@ -274,25 +334,21 @@ of ``G``'s procs is generated at all.
 Then module ``B`` is compiled that requires ``G.P1``. Ok, no problem,
 ``G.P1`` is loaded from the symbol file and ``G.c`` now contains ``G.P1``.
 
-Then module ``A`` (that depends onto ``B`` and ``G``) is compiled and ``B``
+Then module ``A`` (that depends on ``B`` and ``G``) is compiled and ``B``
 and ``G`` are left unchanged. ``A`` requires ``G.P2``.
 
 So now ``G.c`` MUST contain both ``P1`` and ``P2``, but we haven't even
 loaded ``P1`` from the symbol file, nor do we want to because we then quickly
-would restore large parts of the whole program. But we also don't want to
-store ``P1`` in ``B.c`` because that would mean to store every symbol where
-it is referred from which ultimately means the main module and putting
-everything in a single C file.
+would restore large parts of the whole program.
 
-There is however another solution: The old file ``G.c`` containing ``P1`` is
-**merged** with the new file ``G.c`` containing ``P2``. This is the solution
-that is implemented in the C code generator (have a look at the ``ccgmerge``
-module). The merging may lead to *cruft* (aka dead code) in generated C code
-which can only be removed by recompiling a project with the compilation cache
-turned off. Nevertheless the merge solution is way superior to the
-cheap solution "turn off dead code elimination if the compilation cache is
-turned on".
+Solution
+~~~~~~~~ 
 
+The backend must have some logic so that if the currently processed module
+is from the compilation cache, the ``ast`` field is not accessed. Instead
+the generated C(++) for the symbol's body needs to be cached too and
+inserted back into the produced C file. This approach seems to deal with
+all the outlined problems above.
 
 
 Debugging Nim's memory management
@@ -317,7 +373,7 @@ Introduction
 
 I use the term *cell* here to refer to everything that is traced
 (sequences, refs, strings).
-This section describes how the new GC works.
+This section describes how the GC works.
 
 The basic algorithm is *Deferrent Reference Counting* with cycle detection.
 References on the stack are not counted for better performance and easier C
author	Andreas Rumpf <rumpf_a@web.de>	2018-06-01 22:11:32 +0200
committer	Andreas Rumpf <rumpf_a@web.de>	2018-06-01 22:11:32 +0200
commit	cae19738562f14fbb76004748bed8d2f337d6f0b (patch)
tree	a2d965f68a1e0d2d5617b74166c7798bc77d69f0
parent	61fb83ecbb4c691c03d500f6c71499e59a67cef2 (diff)
download	Nim-cae19738562f14fbb76004748bed8d2f337d6f0b.tar.gz