diff options
Diffstat (limited to 'mu.md')
-rw-r--r-- | mu.md | 892 |
1 files changed, 382 insertions, 510 deletions
diff --git a/mu.md b/mu.md index e8ec2dfa..1eeb9842 100644 --- a/mu.md +++ b/mu.md @@ -1,594 +1,559 @@ -# Mu Syntax +# Mu reference -Here are two valid statements in Mu: +Mu programs are sequences of `fn` and `type` definitions. + +## Functions + +Define functions with the `fn` keyword. For example: ``` -increment x -y <- increment + fn foo arg1: int, arg2: int -> result/eax: boolean ``` -Understanding when to use one vs the other is the critical idea in Mu. In -short, the former increments a value in memory, while the latter increments a -value in a register. +Functions contain `{}` blocks, `var` declarations, primitive statements and +calls to other functions. Primitive statements and function calls look +similar: -Most languages start from some syntax and do what it takes to implement it. -Mu, however, is designed as a safe way to program in [a regular subset of -32-bit x86 machine code](subx.md), _satisficing_ rather than optimizing for a -clean syntax. To keep the mapping to machine code lightweight, Mu exclusively -uses statements. Most statements map to a single instruction of machine code. +``` + out1, out2, out3, ... <- operation inout1, inout2, inout3, ... +``` -Since the x86 instruction set restricts how many memory locations an instruction -can use, Mu makes registers explicit as well. Variables must be explicitly -mapped to specific registers; otherwise they live in memory. While you have to -do your own register allocation, Mu will helpfully point out when you get it -wrong. +They can take any number of inouts and outputs, including 0. Statements +with 0 outputs also drop the `<-`. -Statements consist of 3 parts: the operation, optional _inouts_ and optional -_outputs_. Outputs come before the operation name and `<-`. +Inouts can be either variables in memory, variables in registers, or +constants. Outputs are always variables in registers. -Outputs are always registers; memory locations that need to be modified are -passed in by reference in inouts. +Inouts in memory can be either inputs or outputs (if they're addresses being +written to). Hence the name. -So Mu programmers need to make two new categories of decisions: whether to -define variables in registers or memory, and whether to put variables to the -left or right. There's always exactly one way to write any given operation. In -return for this overhead you get a lightweight and future-proof stack. And Mu -will provide good error messages to support you. +Primitives can often write to arbitrary output registers. User-defined +functions, however, require rigidly specified output registers. -Further down, this page enumerates all available primitives in Mu, and [a -separate page](http://akkartik.github.io/mu/html/mu_instructions.html) -describes how each primitive is translated to machine code. There is also a -useful list of pre-defined functions (implemented in unsafe machine code) in [400.mu](http://akkartik.github.io/mu/html/400.mu.html) -and [vocabulary.md](vocabulary.md). +## Variables, registers and memory -## Functions and calls +Declare local variables in a function using the `var` keyword. -Zooming out from single statements, here's a complete sample program in Mu -that runs in Linux: +You can declare local variables in either registers or memory (the stack). So +a `var` statement has two forms: + - `var x/eax: int <- copy 0` + - `var x: int` -<img alt='ex2.mu' src='html/ex2.mu.png' width='400px'> +Variables in registers must be initialized. Variables on the stack are +implicitly zeroed out. -Mu programs are lists of functions. Each function has the following form: +Variables can be in six 32-bit _general-purpose_ registers of the x86 processor. + - eax + - ebx + - ecx + - edx + - esi ('s' often a mnemonic for 'source') + - edi ('d' often a mnemonic for 'destination') -``` -fn _name_ _inout_ ... -> _output_ ... { - _statement_ - _statement_ - ... -} -``` +You can store several types in these registers: + - int + - boolean + - (addr T) (address into memory) + - byte (uses only 8 bits) + - code-point (Unicode) + - grapheme (code-point encoded in UTF-8) -Each function has a header line, and some number of statements, each on a -separate line. Headers describe inouts and outputs. Inouts can't be registers, -and outputs _must_ be registers (specified using metadata after a `/`). -Outputs can't take names. +There's one 32-bit type you _cannot_ store in these registers: + - float -The above program also demonstrates a function call (to the function `do-add`). -Function calls look the same as primitive statements: they can return (multiple) -outputs in registers, and modify inouts passed in by reference. In addition, -there's one more constraint: output registers must match the function header. -For example: +It instead uses eight separate 32-bit registers: xmm0, xmm1, ..., xmm7 -``` -fn f -> _/eax: int { - ... -} -fn g { - a/eax <- f # ok - a/ebx <- f # wrong; `a` must be in register `eax` -} -``` +Types that require more than 32 bits (4 bytes) cannot be stored in registers: + - (array T) + - (handle T) + - (stream T) + - slice + - any compound types you define using the `type` keyword -You can exit a function at any time with the `return` instruction. Give it the -right number of arguments, and it'll assign them respectively to the function's -outputs before jumping back to the caller. +`T` here can be any type, including combinations of types. For example: + - (array int) -- an array of ints + - (addr int) -- an address to an int + - (handle int) -- a handle to an int + - (addr handle int) -- an address to a handle to int + - (addr array handle int) -- an address to an array of handles to ints + - ...and so on. -Mu encloses multi-word types in parentheses, and types can get quite expressive. -For example, you read `main`'s inout type as "an address to an array of -addresses to arrays of bytes." Since addresses to arrays of bytes are almost -always strings in Mu, you'll quickly learn to mentally shorten this type to -"an address to an array of strings". +Other miscellaneous restrictions: + - `byte` variables must be either in registers or on the heap, never local + variables on the stack. + - `addr` variables can never "escape" a function either by being returned or + by being written to a memory location. When you need that sort of thing, + use a `handle` instead. -Mu currently has no way to name magic constants. Instead, document integer -literals using metadata after a `/`. For example: +## Primitive statement types -``` -var x/eax: int <- copy 3/margin-left -``` +These usually operate on variables with 32-bit types, with some restrictions +noted below. Most instructions with multiple args require types to match. -Here we use metadata in two ways: to specify a register for the variable `x` -(checked), and to give a name to the constant `3` (unchecked; purely for -documentation). +Notation in this section: + - `var/reg` indicates a variable in a register + - `var/xreg` indicates a variable in a floating-point register + - `var` without a `reg` indicates either a variable on the stack or + dereferencing a variable in a (non-floating-point) register: `*var/reg` + - `n` indicates a literal integer. There are no floating-point literals. -Variables can't currently accept unchecked metadata for documentation. -(Perhaps this should change.) +### Moving values around -The function `main` is special. It's where Mu programs start executing. It has -a different signature depending on whether a Mu program requires Linux or can -run without an OS. On Linux, the signature looks like this: +These instructions work with variables of any 32-bit type except `addr` and +`float`. ``` -fn main args: (addr array addr array byte) -> _/ebx: int + var/reg <- copy var2/reg2 + copy-to var1, var2/reg + var/reg <- copy var2 + var/reg <- copy n + copy-to var, n ``` -It takes an array of strings and returns a status code to Linux in register -`ebx`. - -Without an OS, the signature looks like this: +Byte variables have their own instructions: ``` -fn main screen: (addr screen), keyboard: (addr keyboard), data-disk: (addr disk) + var/reg <- copy-byte var2/reg2 + var/reg <- copy-byte *var2/reg2 # var2 must have type (addr byte) + copy-byte-to *var1/reg1, var2/reg2 # var1 must have type (addr byte) ``` -A screen and keyboard are explicitly passed in. The goal is for all hardware -dependencies to always be explicit. However there are currently gaps: - * The mouse is accessed implicitly - * The screen argument only supports text-mode graphics. Pixel graphics rely - on implicit access to the screen. - * The Mu computer has two disks, and the disk containing Mu code is not - accessible. - -## Blocks +Floating point instructions can be copied as well, but only to floating-point +registers `xmm_`. -Blocks are useful for grouping related statements. They're delimited by `{` -and `}`, each alone on a line. +``` + var/xreg <- copy var2/xreg2 + copy-to var1, var2/xreg + var/xreg <- copy var2 + var/xreg <- copy *var2/reg2 # var2 must have type (addr byte) and live in a general-purpose register +``` -Blocks can nest: +There's no way to copy a literal to a floating-point register. However, +there's a few ways to convert non-float values in general-purpose registers. ``` -{ - _statements_ - { - _more statements_ - } -} + var/xreg <- convert var2/reg2 + var/xreg <- convert var2 + var/xreg <- convert *var2/reg2 ``` -Blocks can be named (with the name ending in a `:` on the same line as the -`{`): +Correspondingly, there are ways to convert floats into integers. ``` -$name: { - _statements_ -} + var/reg <- convert var2/xreg2 + var/reg <- convert var2 + var/reg <- convert *var2/reg2 + + var/reg <- truncate var2/xreg2 + var/reg <- truncate var2 + var/reg <- truncate *var2/reg2 ``` -Further down we'll see primitive statements for skipping or repeating blocks. -Besides control flow, the other use for blocks is... +### Comparing values + +Work with variables of any 32-bit type. `addr` variables can only be compared +to 0. -## Local variables + compare var1, var2/reg + compare var1/reg, var2 + compare var/eax, n + compare var, n -Functions can define new variables at any time with the keyword `var`. There -are two variants of the `var` statement, for defining variables in registers -or memory. +Floating-point numbers cannot be compared to literals, and the register must +come first. + + compare var1/xreg1, var2/xreg2 + compare var1/xreg1, var2 + +### Branches + +Immediately after a `compare` instruction you can branch on its result. For +example: ``` -var name: type -var name/reg: type <- ... + break-if-= ``` -Variables on the stack are never initialized. (They're always implicitly -zeroed out.) Variables in registers are always initialized. +This instruction will jump to after the enclosing `{}` block if the previous +`compare` detected equality. Here's the list of conditional and unconditional +`break` instructions: -Register variables can go in 6 integer registers (`eax`, `ebx`, `ecx`, `edx`, -`esi`, `edi`) or 8 floating-point registers (`xmm0`, `xmm1`, `xmm2`, `xmm3`, -`xmm4`, `xmm5`, `xmm6`, `xmm7`). - -Defining a variable in a register either clobbers the previous variable (if it -was defined in the same block) or shadows it temporarily (if it was defined in -an outer block). +``` + break + break-if-= + break-if-!= + break-if-< + break-if-> + break-if-<= + break-if->= +``` -Variables exist from their definition until the end of their containing block. -Register variables may also die earlier if their register is clobbered by a -new variable. +Similarly, you can jump back to the start of the enclosing `{}` block with +`loop`. Here's the list of `loop` instructions. -Variables on the stack can be of many types (but not `byte`). Integer registers -can only contain 32-bit values: `int`, `byte`, `boolean`, `(addr ...)`. Floating-point -registers can only contain values of type `float`. +``` + loop + loop-if-= + loop-if-!= + loop-if-< + loop-if-> + loop-if-<= + loop-if->= +``` -## Integer primitives +Additionally, there are special variants for comparing `addr` and `float` +values, which results in the following comprehensive list of jumps: -Here is the list of arithmetic primitive operations supported by Mu. The name -`n` indicates a literal integer rather than a variable, and `var/reg` indicates -a variable in a register, though that's not always valid Mu syntax. +``` + break + break-if-= + break-if-!= + break-if-< break-if-addr< break-if-float< + break-if-> break-if-addr> break-if-float> + break-if-<= break-if-addr<= break-if-float<= + break-if->= break-if-addr>= break-if-float>= + loop + loop-if-= + loop-if-!= + loop-if-< loop-if-addr< loop-if-float< + loop-if-> loop-if-addr> loop-if-float> + loop-if-<= loop-if-addr<= loop-if-float<= + loop-if->= loop-if-addr>= loop-if-float>= ``` -var/reg <- increment -increment var -var/reg <- decrement -decrement var -var1/reg1 <- add var2/reg2 -var/reg <- add var2 -add-to var1, var2/reg -var/reg <- add n -add-to var, n +One final property all these jump instructions share: they can take an +optional block name to jump to. For example: + +``` + a: { + ... + break a #----------| + ... # | + } # <--| -var1/reg1 <- subtract var2/reg2 -var/reg <- subtract var2 -subtract-from var1, var2/reg -var/reg <- subtract n -subtract-from var, n -var1/reg1 <- negate -negate var + a: { # <--| + ... # | + b: { # | + ... # | + loop a #----------| + ... + } + ... + } +``` -var/reg <- copy var2/reg2 -copy-to var1, var2/reg -var/reg <- copy var2 -var/reg <- copy n -copy-to var, n +However, there's no way to jump to a block that doesn't contain the `loop` or +`break` instruction. -compare var1, var2/reg -compare var1/reg, var2 -compare var/eax, n -compare var, n +### Integer arithmetic -var/reg <- shift-left n -var/reg <- shift-right n -var/reg <- shift-right-signed n -shift-left var, n -shift-right var, n -shift-right-signed var, n +These instructions require variables of non-`addr`, non-float types. -var/reg <- multiply var2 +Add: +``` + var1/reg1 <- add var2/reg2 + var/reg <- add var2 + add-to var1, var2/reg # var1 += var2 + var/reg <- add n + add-to var, n ``` -Bitwise operations: +Subtract: +``` + var1/reg1 <- subtract var2/reg2 + var/reg <- subtract var2 + subtract-from var1, var2/reg # var1 -= var2 + var/reg <- subtract n + subtract-from var, n ``` -var1/reg1 <- and var2/reg2 -var/reg <- and var2 -and-with var1, var2/reg -var/reg <- and n -and-with var, n -var1/reg1 <- or var2/reg2 -var/reg <- or var2 -or-with var1, var2/reg -var/reg <- or n -or-with var, n +Add one: +``` + var/reg <- increment + increment var +``` -var1/reg1 <- not -not var +Subtract one: +``` + var/reg <- decrement + decrement var +``` -var1/reg1 <- xor var2/reg2 -var/reg <- xor var2 -xor-with var1, var2/reg -var/reg <- xor n -xor-with var, n +Multiply: +``` + var/reg <- multiply var2 ``` -Any statement above that takes a variable in memory can be replaced with a -dereference (`*`) of an address variable (of type `(addr ...)`) in a register. -You can't dereference variables in memory. You have to load them into a -register first. +The result of a multiply must be a register. -Excluding dereferences, the above statements must operate on non-address -values with primitive types: `int`, `boolean` or `byte`. (Booleans are really -just `int`s, and Mu assumes any value but `0` is true.) You can copy addresses -to int variables, but not the other way around. +Negate: +``` + var1/reg1 <- negate + negate var +``` -## Floating-point primitives +### Floating-point arithmetic -These instructions may use the floating-point registers `xmm0` ... `xmm7` -(denoted by `/xreg2` or `/xrm32`). They also use integer values on occasion -(`/rm32` and `/r32`). +Operations on `float` variables include a few we've seen before and some new +ones. Notice here that we mostly use floating-point registers `xmm_`, but +still use the general-purpose registers when dereferencing variables of type +`(addr float)`. ``` -var/xreg <- add var2/xreg2 -var/xreg <- add var2 -var/xreg <- add *var2/reg2 + var/xreg <- add var2/xreg2 + var/xreg <- add var2 + var/xreg <- add *var2/reg2 -var/xreg <- subtract var2/xreg2 -var/xreg <- subtract var2 -var/xreg <- subtract *var2/reg2 + var/xreg <- subtract var2/xreg2 + var/xreg <- subtract var2 + var/xreg <- subtract *var2/reg2 -var/xreg <- multiply var2/xreg2 -var/xreg <- multiply var2 -var/xreg <- multiply *var2/reg2 + var/xreg <- multiply var2/xreg2 + var/xreg <- multiply var2 + var/xreg <- multiply *var2/reg2 -var/xreg <- divide var2/xreg2 -var/xreg <- divide var2 -var/xreg <- divide *var2/reg2 + var/xreg <- divide var2/xreg2 + var/xreg <- divide var2 + var/xreg <- divide *var2/reg2 -var/xreg <- reciprocal var2/xreg2 -var/xreg <- reciprocal var2 -var/xreg <- reciprocal *var2/reg2 + var/xreg <- reciprocal var2/xreg2 + var/xreg <- reciprocal var2 + var/xreg <- reciprocal *var2/reg2 -var/xreg <- square-root var2/xreg2 -var/xreg <- square-root var2 -var/xreg <- square-root *var2/reg2 + var/xreg <- square-root var2/xreg2 + var/xreg <- square-root var2 + var/xreg <- square-root *var2/reg2 -var/xreg <- inverse-square-root var2/xreg2 -var/xreg <- inverse-square-root var2 -var/xreg <- inverse-square-root *var2/reg2 + var/xreg <- inverse-square-root var2/xreg2 + var/xreg <- inverse-square-root var2 + var/xreg <- inverse-square-root *var2/reg2 -var/xreg <- min var2/xreg2 -var/xreg <- min var2 -var/xreg <- min *var2/reg2 + var/xreg <- min var2/xreg2 + var/xreg <- min var2 + var/xreg <- min *var2/reg2 -var/xreg <- max var2/xreg2 -var/xreg <- max var2 -var/xreg <- max *var2/reg2 + var/xreg <- max var2/xreg2 + var/xreg <- max var2 + var/xreg <- max *var2/reg2 ``` -Remember, when these instructions use indirect mode, they still use an integer -register. Floating-point registers can't hold addresses. - Two instructions in the above list are approximate. According to the Intel manual, `reciprocal` and `inverse-square-root` [go off the rails around the -fourth decimal place](x86_approx.md). If you need more precision, use `divide` -separately. +fourth decimal place](linux/x86_approx.md). If you need more precision, use +`divide` separately. -Most instructions operate exclusively on integer or floating-point operands. -The only exceptions are the instructions for converting between integers and -floating-point numbers. - -``` -var/xreg <- convert var2/reg2 -var/xreg <- convert var2 -var/xreg <- convert *var2/reg2 +### Bitwise boolean operations -var/reg <- convert var2/xreg2 -var/reg <- convert var2 -var/reg <- convert *var2/reg2 +These require variables of non-`addr`, non-float types. -var/reg <- truncate var2/xreg2 -var/reg <- truncate var2 -var/reg <- truncate *var2/reg2 +And: +``` + var1/reg1 <- and var2/reg2 + var/reg <- and var2 + and-with var1, var2/reg + var/reg <- and n + and-with var, n ``` -There are no instructions accepting floating-point literals. To obtain integer -literals in floating-point registers, copy them to general-purpose registers -and then convert them to floating-point. - -The floating-point instructions above always write to registers. The only -instructions that can write floats to memory are `copy` instructions. - +Or: ``` -var/xreg <- copy var2/xreg2 -copy-to var1, var2/xreg -var/xreg <- copy var2 -var/xreg <- copy *var2/reg2 + var1/reg1 <- or var2/reg2 + var/reg <- or var2 + or-with var1, var2/reg + var/reg <- or n + or-with var, n ``` -Finally, there are floating-point comparisons. They must always put a register -on the left-hand side: - +Not: ``` -compare var1/xreg1, var2/xreg2 -compare var1/xreg1, var2 + var1/reg1 <- not + not var ``` -## Operating on individual bytes +Xor: +``` + var1/reg1 <- xor var2/reg2 + var/reg <- xor var2 + xor-with var1, var2/reg + var/reg <- xor n + xor-with var, n +``` -A special case is variables of type `byte`. Mu is a 32-bit platform so for the -most part only supports types that are multiples of 32 bits. However, we do -want to support strings in ASCII and UTF-8, which will be arrays of 8-bit -bytes. +### Shifts -Since most x86 instructions implicitly load 32 bits at a time from memory, -variables of type 'byte' are only allowed in registers, not on the stack. Here -are the possible statements for reading bytes to/from memory: +Shifts require variables of non-`addr`, non-float types. ``` -var/reg <- copy-byte var2/reg2 # var: byte -var/reg <- copy-byte *var2/reg2 # var: byte -copy-byte-to *var1/reg1, var2/reg2 # var1: (addr byte) + var/reg <- shift-left n + var/reg <- shift-right n + var/reg <- shift-right-signed n + shift-left var, n + shift-right var, n + shift-right-signed var, n ``` -In addition, variables of type 'byte' are restricted to (the lowest bytes of) -just 4 registers: `eax`, `ecx`, `edx` and `ebx`. As always, this is due to -constraints of the x86 instruction set. +Shifting bits left always inserts zeros on the right. +Shifting bits right inserts zeros on the left by default. +A _signed_ shift right duplicates the leftmost bit, thereby preserving the +sign of an integer. + +## More complex instructions on more complex types -## Primitive jumps +These instructions work with any type `T`. As before we use `/reg` here to +indicate when a variable must live in a register. We also include type +constraints after a `:`. -There are two kinds of jumps, both with many variations: `break` and `loop`. -`break` instructions jump to the end of the containing block. `loop` instructions -jump to the beginning of the containing block. +### Addresses and handles -All jumps can take an optional label starting with '$': +You can compute the address of any variable in memory (never in registers): ``` -loop $foo + var/reg: (addr T) <- address var2: T ``` -This instruction jumps to the beginning of the block called $foo. The corresponding -`break` jumps to the end of the block. Either jump statement must lie somewhere -inside such a block. Jumps are only legal to containing blocks. (Use named -blocks with restraint; jumps to places far away can get confusing.) +As mentioned up top, `addr` variables can never escape the function where +they're computed. You can't store them on the heap, or in compound types. -There are two unconditional jumps: +To manage long-lived addresses, _allocate_ them on the heap. ``` -loop -loop label -break -break label + allocate var: (addr handle T) # var can be in either register or memory ``` -The remaining jump instructions are all conditional. Conditional jumps rely on -the result of the most recently executed `compare` instruction. (To keep -programs easy to read, keep `compare` instructions close to the jump that uses -them.) +Handles can be copied and stored without restriction. However, they're too +large to fit in a register. You also can't access their payload directly, you +have to first convert them into a short-lived `addr` using _lookup_. ``` -break-if-= -break-if-= label -break-if-!= -break-if-!= label + var y/eax: (addr T) <- lookup x: (handle T) ``` -Inequalities are similar, but have additional variants for addresses and floats. +The output of `lookup` is always returned in register `eax`. Many other +function calls do the same thing. In practice, this means `eax` ends up being +a temporary location used to store lots of variables in quick succession. -``` -break-if-< -break-if-< label -break-if-> -break-if-> label -break-if-<= -break-if-<= label -break-if->= -break-if->= label - -break-if-addr< -break-if-addr< label -break-if-addr> -break-if-addr> label -break-if-addr<= -break-if-addr<= label -break-if-addr>= -break-if-addr>= label +Since handles are large compound types, there's a special helper for comparing +them: -break-if-float< -break-if-float< label -break-if-float> -break-if-float> label -break-if-float<= -break-if-float<= label -break-if-float>= -break-if-float>= label ``` - -Similarly, conditional loops: - + var/eax: boolean <- handle-equal? var1: (handle T), var2: (handle T) ``` -loop-if-= -loop-if-= label -loop-if-!= -loop-if-!= label - -loop-if-< -loop-if-< label -loop-if-> -loop-if-> label -loop-if-<= -loop-if-<= label -loop-if->= -loop-if->= label -loop-if-addr< -loop-if-addr< label -loop-if-addr> -loop-if-addr> label -loop-if-addr<= -loop-if-addr<= label -loop-if-addr>= -loop-if-addr>= label +### Arrays -loop-if-float< -loop-if-float< label -loop-if-float> -loop-if-float> label -loop-if-float<= -loop-if-float<= label -loop-if-float>= -loop-if-float>= label +Arrays are declared in two ways: + 1. On the stack with a literal size: ``` - -## Addresses - -Passing objects by reference requires the `address` operation, which returns -an object of type `addr`. - + var x: (array int 3) ``` -var/reg: (addr T) <- address var2: T + 2. On the heap with a potentially variable size. For example: ``` + var x: (handle array int) + var x-ah/eax: (addr handle array int) <- address x + populate x-ah, 8 +``` + The `8` here can also be an int in a register or memory. -Here `var2` can't live in a register. - -## Array operations - -Here's an example definition of a fixed-length array: +You can compute the length of an array, though you'll need an `addr` to do so: ``` -var x: (array int 3) + var/reg: int <- length arr/reg: (addr array T) ``` -The length (here `3`) must be an integer literal. We'll show how to create -dynamically-sized arrays further down. - -Arrays can be large; to avoid copying them around on every function call -you'll usually want to manage `addr`s to them. Here's an example computing the -address of an array. +To read from or write to an array, use `index` to first obtain an address to +read from or modify: ``` -var n/eax: (addr array int) <- address x + var/reg: (addr T) <- index arr/reg: (addr array T), n + var/reg: (addr T) <- index arr: (array T len), n ``` -Addresses to arrays don't include the array length in their type. However, you -can obtain the length of an array like this: +Like our notation of `n`, `len` here is required to be a literal. + +The index requested can also be a variable in a register, with one caveat: ``` -var/reg: int <- length arr/reg: (addr array T) + var/reg: (addr T) <- index arr/reg: (addr array T), idx/reg: int + var/reg: (addr T) <- index arr: (array T len), idx/reg: int ``` -To operate on elements of an array, use the `index` statement: +The caveat: the size of T must be 1, 2, 4 or 8 bytes. For other sizes of T +you'll need to split up the work, performing a `compute-offset` before the +`index`. ``` -var/reg: (addr T) <- index arr/reg: (addr array T), n -var/reg: (addr T) <- index arr: (array T len), n + var/reg: (offset T) <- compute-offset arr: (addr array T), idx/reg: int # arr can be in reg or mem + var/reg: (offset T) <- compute-offset arr: (addr array T), idx: int # arr can be in reg or mem ``` -The index can also be a variable in a register, with a caveat: +The result of a `compute-offset` statement can be passed to `index`: ``` -var/reg: (addr T) <- index arr/reg: (addr array T), idx/reg: int -var/reg: (addr T) <- index arr: (array T len), idx/reg: int + var/reg: (addr T) <- index arr/reg: (addr array T), idx/reg: (offset T) ``` -The caveat: the size of T must be 1, 2, 4 or 8 bytes. The x86 instruction set -has complex addressing modes that can index into an array in a single instruction -in these situations. +### Stream operations -For other sizes of T you'll need to split up the work, performing a `compute-offset` -before the `index`. +A common use for arrays is as buffers. Save a few items to a scratch space and +then process them. This pattern is so common (we use it in files) that there's +special support for it with a built-in type: `stream`. + +Streams are like arrays in many ways. You can initialize them with a length on +the stack: ``` -var/reg: (offset T) <- compute-offset arr: (addr array T), idx/reg: int # arr can be in reg or mem -var/reg: (offset T) <- compute-offset arr: (addr array T), idx: int # arr can be in reg or mem + var x: (stream int 3) ``` -The `compute-offset` statement returns a value of type `(offset T)` after -performing any necessary bounds checking. Now the offset can be passed to -`index` as usual: - +You can also populate them on the heap: ``` -var/reg: (addr T) <- index arr/reg: (addr array T), idx/reg: (offset T) + var x: (handle stream int) + var x-ah/eax: (addr handle stream int) <- address x + populate-stream x-ah, 8 ``` -## Stream operations +However, streams don't provide random access with an `index` instruction. +Instead, you write to them sequentially, and read back what you wrote. -A common use for arrays is as buffers. Save a few items to a scratch space and -then process them. This pattern is so common (we use it in files) that there's -special support for it with a built-in type: `stream`. +``` + read-from-stream s: (addr stream T), out: (addr T) + write-to-stream s: (addr stream T), in: (addr T) +``` -Streams are like arrays in many ways. You can initialize them with a length: +Streams of bytes are particularly common for managing Unicode text, and there +are a few functions to help with them: ``` -var x: (stream int 3) + write s: (addr stream byte), u: (addr array byte) # write u to s, abort if full + overflow?/eax: boolean <- try-write s: (addr stream byte), u: (addr array byte) + write-stream dest: (addr stream byte), src: (addr stream byte) + # bytes + append-byte s: (addr stream byte), var: int # write lower byte of var + var/eax: byte <- read-byte s: (addr stream byte) + # 32-bit graphemes encoded in UTF-8 + write-grapheme out: (addr stream byte), g: grapheme + g/eax: grapheme <- read-grapheme in: (addr stream byte) ``` -However, streams don't provide random access with an `index` instruction. -Instead, you write to them sequentially, and read back what you wrote. +You can check if a stream is empty or full: ``` -read-from-stream s: (addr stream T), out: (addr T) -write-to-stream s: (addr stream T), in: (addr T) -var/eax: boolean <- stream-empty? s: (addr stream) -var/eax: boolean <- stream-full? s: (addr stream) + var/eax: boolean <- stream-empty? s: (addr stream) + var/eax: boolean <- stream-full? s: (addr stream) ``` You can clear streams: ``` -clear-stream f: (addr stream _) + clear-stream f: (addr stream T) ``` -You can also rewind them to reread what's been written: +You can also rewind them to reread their contents: ``` -rewind-stream f: (addr stream _) + rewind-stream f: (addr stream T) ``` ## Compound types @@ -610,7 +575,7 @@ which is described below). They also can't currently include `array`, `stream` or `byte` types. Since arrays and streams carry their size with them, supporting them in compound types complicates variable initialization. Instead of defining them inline in a type definition, define a `handle` to them. Bytes -shouldn't be used for anything but utf-8 strings. +shouldn't be used for anything but UTF-8 strings. To access within a compound type, use the `get` instruction. There are two forms. You need either a variable of the type itself (say `T`) in memory, or a @@ -630,108 +595,15 @@ var a/eax: (addr int) <- get p, x var a/eax: (addr int) <- get p, y ``` -You can clear arbitrary types using the `clear-object` function: +You can clear compound types using the `clear-object` function: ``` clear-object var: (addr T) ``` -Don't clear arrays or streams using `clear-object`; doing so will irreversibly -make their length 0 as well. - -You can shallow-copy arbitrary types using the `copy-object` function: +You can shallow-copy compound types using the `copy-object` function: ``` copy-object src: (addr T), dest: (addr T) ``` -## Handles for safe access to the heap - -We've seen the `addr` type, but it's intended to be short-lived. `addr` values -should never escape from functions. Function outputs can't be `addr`s, -function inouts can't include `addr` in their payload type. Finally, you can't -save `addr` values inside compound `type`s. To do that you need a "fat -pointer" called a `handle` that is safe to keep around for extended periods -and ensures it's used safely without corrupting the heap and causing security -issues or hard-to-debug misbehavior. - -To actually _use_ a `handle`, we have to turn it into an `addr` first using -the `lookup` statement. - -``` -var y/reg: (addr T) <- lookup x: (handle T) -``` - -Now operate on `y` as usual, safe in the knowledge that you can later recover -any writes to its payload from `x`. - -It's illegal to continue to use an `addr` after a function that reclaims heap -memory. You have to repeat the lookup from the `handle`. (Luckily Mu doesn't -implement reclamation yet.) - -Having two kinds of addresses takes some getting used to. Do we pass in -variables by value, by `addr` or by `handle`? In inputs or outputs? Here are 3 -rules of thumb: - - * Functions that need to look at the payload should accept an `(addr ...)` - where possible. - * Functions that need to treat a handle as a value, without looking at its - payload, should accept a `(handle ...)`. Helpers that save handles into - data structures are a common example. - * Functions that need to allocate memory should accept an `(addr handle ...)`. - -Try to avoid mixing these use cases. - -If you have a variable `src` of type `(handle ...)`, you can save it inside a -compound type like this (provided the types match): - -``` -var dest/reg: (addr handle T_f) <- get var: (addr T), f -copy-handle src, dest -``` - -Or this: - -``` -var dest/reg: (addr handle T) <- index arr: (addr array handle T), n -copy-handle src, dest -``` - -To create handles to non-array types, use `allocate`: - -``` -var x: (addr handle T) -... initialize x ... -allocate x -``` - -To create handles to array types (of potentially dynamic size), use `populate`: - -``` -var x: (addr handle array T) -... initialize x ... -populate x, 3 # array of 3 T's -``` - -## Seams - -I said at the start that most instructions map 1:1 to x86 machine code. To -enforce type- and memory-safety, I was forced to carve out a few exceptions: - -* the `index` instruction on arrays, for bounds-checking -* the `length` instruction on arrays, for translating the array size in bytes - into the number of elements. -* the `lookup` instruction on handles, for validating fat-pointer metadata -* `var` instructions, to initialize memory -* byte copies, to initialize memory - -If you're curious, [the compiler summary page](http://akkartik.github.io/mu/html/mu_instructions.html) -has the complete nitty-gritty on how each instruction is implemented. Including -the above exceptions. - -## Conclusion - -Anything not allowed here is forbidden, even if the compiler doesn't currently -detect and complain about it. Please [contact me](mailto:ak@akkartik.com) or -[report issues](https://github.com/akkartik/mu/issues) when you encounter a -missing or misleading error message. |