# The Mu computer's level-2 language, also called Mu. # http://akkartik.name/post/mu-2019-2 # # To run: # $ ./translate_subx init.linux [0-9]*.subx apps/mu.subx # == Goals # 1. Be memory safe. It should be impossible to corrupt the heap, or to create # a bad pointer. (Requires strong type safety.) # 2. Do as little as possible to achieve goal 1. The translator should be # implementable in machine code. # - minimize impedance mismatch between source language and SubX target # (e.g. programmer manages registers manually) # - checks over syntax # (e.g. programmer's register allocation is checked) # - runtime checks to avoid complex static analysis # (e.g. array indexing always checks bounds) # == Language description # A program is a sequence of function definitions. # # Function example: # fn foo n: int -> result/eax: int { # ... # } # # Functions consist of a name, optional inputs, optional outputs and a block. # # Function inputs and outputs are variables. All variables have a type and # storage specifier. They can be placed either in memory (on the stack) or in # one of 6 named registers. # eax ecx edx ebx esi edi # Variables in registers must be primitive 32-bit types. # Variables not explicitly placed in a register are on the stack. # # Function inputs are always passed in memory (on the stack), while outputs # are always returned in registers. # # Blocks mostly consist of statements. # # Statements mostly consist of a name, optional inputs and optional outputs. # # Statement inputs are variables or literals. Variables need to specify type # (and storage) the first time they're mentioned but not later. # # Statement outputs, like function outputs, must be variables in registers. # # Statement names must be either primitives or user-defined functions. # # Primitives can write to any register. # User-defined functions only write to hard-coded registers. Outputs of each # call must have the same registers as in the function definition. # # There are some other statement types: # - blocks. Multiple statements surrounded by '{...}' and optionally # prefixed with a label name and ':' # - { # ... # } # - foo: { # ... # } # # - variable definitions on the stack. E.g.: # - var foo: int # - var bar: (array int 3) # There's no initializer; variables are automatically initialized. # The type of a local variable is either word-length (4 bytes) or starts with 'ref'. # # - variables definitions in a register. E.g.: # - var foo/eax: int <- add bar 1 # The initializer is mandatory and must be a valid instruction that writes # a single output to the right register. In practice registers will # usually be either initialized by primitives or copied from eax. # - var eax: int <- foo bar quux # var floo/ecx: int <- copy eax # # Still todo: # global variables # heap allocations (planned name: 'handle') # user-defined types: 'type' for structs, 'choice' for unions # short-lived 'address' type for efficiently writing inside nested structs # # We don't have 'handle' types yet, but we try to distinguish 'ref', 'handle' # and 'address' in comments. Their definitions are in layer 50, but really you # can ignore the distinctions on a first reading of this program. # # Formal types: # A program is a linked list of functions # A function contains: # name: (handle array byte) # inouts: linked list of vars <-- 'inouts' is more precise than 'inputs' # data: (handle var) # next: (handle list) # outputs: linked list of vars # data: (handle var) # next: (handle list) # body: (handle block) # A var-type contains: # name: (handle array byte) # type: (handle tree type-id) # # A statement can be: # tag 0: a block # tag 1: a simple statement (stmt1) # tag 2: a variable defined on the stack # tag 3: a variable defined in a register # # A block contains: # tag: 0 # statements: (handle list stmt) # name: (handle array byte) -- starting with '$' # # A regular statement contains: # tag: 1 # operation: (handle array byte) # inouts: (handle list operand) # outputs: (handle list var) # # A variable defined on the stack contains: # tag: 2 # name: (handle array byte) # type: (handle tree type-id) # # A variable defined in a register contains: # tag: 3 # name: (handle array byte) # type: (handle tree type-id) # reg: (handle array byte) # == Translation: managing the stack # Now that we know what the language looks like in the large, let's think # about how translation happens from the bottom up. One crucial piece of the # puzzle is how Mu will clean up variables defined on the stack for you. # # Assume that we maintain a 'functions' list while parsing source code. And a # 'primitives' list is a global constant. Both these contain enough information # to perform type-checking on function calls or primitive statements, respectively. # # Defining variables pushes them on a stack with the current block depth and # enough information about their location (stack offset or register). # Starting a block increments the current block id. # Each statement now has enough information to emit code for it. # Ending a block is where the magic happens: # pop all variables at the current block depth # emit code to restore all register variables introduced at the current depth # emit code to clean up all stack variables at the current depth (just increment esp) # decrement the current block depth # # Formal types: # live-vars: stack of vars # var: # name: (handle array byte) # type: (handle tree type-id) # block: int # stack-offset: int (added to ebp) # register: (handle array byte) # either usual register names # or '*' to indicate any register # At most one of stack-offset or register-index must be non-zero. # A register of '*' designates a variable _template_. Only legal in formal # parameters for primitives. # == Translating a single function call # This one's easy. Assuming we've already checked things, we just drop the # outputs (which use hard-coded registers) and emit inputs in a standard format. # # out1, out2, out3, ... <- name inout1, inout2, inout3, ... # => # (subx-name inout1 inout2 inout3) # # Formal types: # functions: linked list of info # name: (handle array byte) # inouts: linked list of vars # outputs: linked list of vars # body: block (singleton linked list) # subx-name: (handle array byte) # == Translating a single primitive instruction # A second crucial piece of the puzzle is how Mu converts fairly regular # primitives with their uniform syntax to SubX instructions with their gnarly # x86 details. # # Mu instructions have inputs and outputs. Primitives can have up to 2 of # them. # SubX instructions have rm32 and r32 operands. # The translation between them covers almost all the possibilities. # Instructions with 1 inout may turn into ones with 1 rm32 # (e.g. incrementing a var on the stack) # Instructions with 1 output may turn into ones with 1 rm32 # (e.g. incrementing a var in a register) # 1 inout and 1 output may turn into 1 rm32 and 1 r32 # (e.g. adding a var to a reg) # 2 inouts may turn into 1 rm32 and 1 r32 # (e.g. adding a reg to a var) # 1 inout and 1 literal may turn into 1 rm32 and 1 imm32 # (e.g. adding a constant to a var) # 1 output and 1 literal may turn into 1 rm32 and 1 imm32 # (e.g. adding a constant to a reg) # 2 outputs to hardcoded registers and 1 inout may turn into 1 rm32 # (special-case: divide edx:eax by a var or reg) # Observations: # We always emit rm32. It may be the first inout or the first output. # We may emit r32 or imm32 or neither. # When we emit r32 it may come from first inout or second inout or first output. # # Accordingly, the formal data structure for a primitive looks like this: # primitives: linked list of info # name: (handle array byte) # mu-inouts: linked list of vars to check # mu-outputs: linked list of vars to check; at most a singleton # subx-name: (handle array byte) # subx-rm32: enum arg-location # subx-r32: enum arg-location # subx-imm32: enum arg-location # subx-disp32: enum arg-location # output-is-write-only: boolean # arg-location: enum # 0 means none # 1 means first inout # 2 means second inout # 3 means first output # == Translating a block # Emit block name if necessary # Emit '{' # When you encounter a statement, emit it as above # When you encounter a variable declaration # emit any code needed for it (bzeros) # push it on the var stack # update register dict if necessary # When you encounter '}' # While popping variables off the var stack until block id changes # Emit code needed to clean up the stack # either increment esp # or pop into appropriate register # The rest is straightforward. == data Program: _Program-functions: # (handle function) 0/imm32 _Program-types: # (handle typeinfo) 0/imm32 # Some constants for simulating the data structures described above. # Many constants here come with a type in a comment. # # Sometimes the type is of the value at that offset for the given type. For # example, if you start at a function record and move forward Function-inouts # bytes, you'll find a (handle list var). # # At other times, the type is of the constant itself. For example, the type of # the constant Function-size is (addr int). To get the size of a function, # look in *Function-size. Function-name: # (handle array byte) 0/imm32 Function-subx-name: # (handle array byte) 4/imm32 Function-inouts: # (handle list var) 8/imm32 Function-outputs: # (handle list var) 0xc/imm32 Function-body: # (handle block) 0x10/imm32 Function-next: # (handle function) 0x14/imm32 Function-size: # (addr int) 0x18/imm32/24 Primitive-name: # (handle array byte) 0/imm32 Primitive-inouts: # (handle list var) 4/i
Run all these from the top-level `mu/` directory.
### Some tools for Mu's build process
These are built automatically.
* `enumerate`: list numeric files in current directory, optionally `--until`
some prefix.
### Miscellaneous odds and ends
These are built lazily.
* `browse_trace`: debugging tool. See `browse_trace.readme.md` for details.
* `linkify`: inserts hyperlinks from variables to definitions in Mu's html
sources. Hacky; just see the number of tests. Invoked by `update_html`.
* `treeshake_all`: rebuild SubX binaries without tests and unused functions.
Hacky; just helps estimate the code needed to perform various tasks.
```
tools/treeshake_all
```
### Notes to self: constraints on the tools/ directory
* Don't overwhelm the initial view of the project with lots of crap in the
root directory.
* Directories go up top in the github view, so too many sub-directories are
also overwhelming.
* Don't increase increase build time too much; everything in `tools/` shouldn't
be automatically built.
* stuff needed all the time is built from root directory.
* `tools/` contains many independent things; don't make it hard to see
boundaries. Ideally just one source file per tool. If not, give related
files similar name prefixes.