diff options
author | Kartik Agaram <vc@akkartik.com> | 2020-07-05 15:28:37 -0700 |
---|---|---|
committer | Kartik Agaram <vc@akkartik.com> | 2020-07-06 01:05:10 -0700 |
commit | 9a524793ee01ce47f3963768559a0d6c348631c5 (patch) | |
tree | aa1045424609f1a14a7a8e5bfaa1517101acd894 /subx.md | |
parent | 3ccb2c83280e22ad5b7f7c47d7bd95748b969521 (diff) | |
download | mu-9a524793ee01ce47f3963768559a0d6c348631c5.tar.gz |
6618 - new docs
Diffstat (limited to 'subx.md')
-rw-r--r-- | subx.md | 162 |
1 files changed, 162 insertions, 0 deletions
diff --git a/subx.md b/subx.md new file mode 100644 index 00000000..b1ab38bc --- /dev/null +++ b/subx.md @@ -0,0 +1,162 @@ +## SubX + +SubX is a notation for a subset of x86 machine code. [The Mu translator](http://akkartik.github.io/mu/html/apps/mu.subx.html) +is implemented in SubX and also emits SubX code. + +Here's an example program in SubX that adds 1 and 1 and returns the result to +the parent shell process: + + ```sh + == code + Entry: + # ebx = 1 + bb/copy-to-ebx 1/imm32 + # increment ebx + 43/increment-ebx + # exit(ebx) + e8/call syscall_exit/disp32 + ``` + +## The syntax of SubX instructions + +Just like in regular machine code, SubX programs consist mostly of instructions, +which are basically sequences of numbers (always in hex). Instructions consist +of words separated by whitespace. Words may be _opcodes_ (defining the +operation being performed) or _arguments_ (specifying the data the operation +acts on). Any word can have extra _metadata_ attached to it after `/`. Some +metadata is required (like the `/imm32` and `/imm8` above), but unrecognized +metadata is silently skipped so you can attach comments to words (like the +instruction name `/copy-to-eax` above, or the `/exit` argument). + +What do all these numbers mean? SubX supports a small subset of the 32-bit x86 +instruction set that likely runs on your computer. (Think of the name as short +for "sub-x86".) The instruction set contains instructions like `89/copy`, +`01/add`, `3d/compare` and `51/push-ecx` which modify registers and a byte-addressable +memory. For a complete list of supported instructions, run `bootstrap help +opcodes`. + +The registers instructions operate on are as follows: + +- Six general-purpose 32-bit registers: `0/eax`, `1/ebx`, `2/ecx`, `3/edx`, + `6/esi` and `7/edi`. +- Two additional 32-bit registers: `4/esp` and `5/ebp`. (I suggest you only + use these to manage the call stack.) + +(SubX doesn't support floating-point registers yet. Intel processors support +an 8-bit mode, 16-bit mode and 64-bit mode. SubX will never support them. +There are also _many_ more instructions that SubX will never support.) + +While SubX doesn't provide the usual mnemonics for opcodes, it _does_ provide +error-checking. If you miss an argument or accidentally add an extra argument, +you'll get a nice error. SubX won't arbitrarily interpret bytes of data as +instructions or vice versa. + +It's worth distinguishing between an instruction's arguments and its _operands_. +Arguments are provided directly in instructions. Operands are pieces of data +in register or memory that are operated on by instructions. + +Intel processors typically operate on no more than two operands, and at most +one of them (the 'reg/mem' operand) can access memory. The address of the +reg/mem operand is constructed by expressions of one of these forms: + + - `%reg`: operate on just a register, not memory + - `*reg`: look up memory with the address in some register + - `*(reg + disp)`: add a constant to the address in some register + - `*(base + (index << scale) + disp)` where `base` and `index` are registers, + and `scale` and `disp` are 2- and 32-bit constants respectively. + +Under the hood, SubX turns expressions of these forms into multiple arguments +with metadata in some complex ways. See [the doc on bare SubX](subx_bare.md). + +That covers the complexities of the reg/mem operand. The second operand is +simpler. It comes from exactly one of the following argument types: + + - `/r32` + - displacement: `/disp8` or `/disp32` + - immediate: `/imm8` or `/imm32` + +Putting all this together, here's an example that adds the integer in `eax` to +the one at address `edx`: + + ``` + 01/add %edx 0/r32/eax + ``` + +## The syntax of SubX programs + +SubX programs map to the same ELF binaries that a conventional Linux system +uses. Linux ELF binaries consist of a series of _segments_. In particular, they +distinguish between code and data. Correspondingly, SubX programs consist of a +series of segments, each starting with a header line: `==` followed by a name +and approximate starting address. + +All code must lie in a segment called 'code'. + +Segments can be added to. + +```sh +== code 0x09000000 # first mention requires starting address +...A... + +== data 0x0a000000 +...B... + +== code # no address necessary when adding +...C... +``` + +The `code` segment now contains the instructions of `A` as well as `C`. + +Within the `code` segment, each line contains a comment, label or instruction. +Comments start with a `#` and are ignored. Labels should always be the first +word on a line, and they end with a `:`. + +Instructions can refer to labels in displacement or immediate arguments, and +they'll obtain a value based on the address of the label: immediate arguments +will contain the address directly, while displacement arguments will contain +the difference between the address and the address of the current instruction. +The latter is mostly useful for `jump` and `call` instructions. + +Functions are defined using labels. By convention, labels internal to functions +(that must only be jumped to) start with a `$`. Any other labels must only be +called, never jumped to. All labels must be unique. + +Functions are called using the following syntax: +``` +(func arg1 arg2 ...) +``` + +Function arguments must be either literals (integers or strings) or a reg/mem +operand using the syntax in the previous section. + +A special label is `Entry`, which can be used to specify/override the entry +point of the program. It doesn't have to be unique, and the latest definition +will override earlier ones. + +(The `Entry` label, along with duplicate segment headers, allows programs to +be built up incrementally out of multiple [_layers_](http://akkartik.name/post/wart-layers).) + +Another special pair of labels are the block delimiters `{` and `}`. They can +be nested, and jump instructions can take arguments `loop` or `break` that +jump to the enclosing `{` and `}` respectively. + +The data segment consists of labels as before and byte values. Referring to +data labels in either `code` segment instructions or `data` segment values +yields their address. + +Automatic tests are an important part of SubX, and there's a simple mechanism +to provide a test harness: all functions that start with `test-` are called in +turn by a special, auto-generated function called `run-tests`. How you choose +to call it is up to you. + +I try to keep things simple so that there's less work to do when implementing +SubX in SubX. But there _is_ one convenience: instructions can provide a +string literal surrounded by quotes (`"`) in an `imm32` argument. SubX will +transparently copy it to the `data` segment and replace it with its address. +Strings are the only place where a SubX word is allowed to contain spaces. + +That should be enough information for writing SubX programs. The `apps/` +directory provides some fodder for practice in the `apps/ex*.subx` files, +giving a more gradual introduction to SubX features. In particular, you should +work through `apps/factorial4.subx`, which demonstrates all the above ideas in +concert. |