about summary refs log tree commit diff stats
path: root/subx.md
diff options
context:
space:
mode:
authorKartik Agaram <vc@akkartik.com>2020-07-05 15:28:37 -0700
committerKartik Agaram <vc@akkartik.com>2020-07-06 01:05:10 -0700
commit9a524793ee01ce47f3963768559a0d6c348631c5 (patch)
treeaa1045424609f1a14a7a8e5bfaa1517101acd894 /subx.md
parent3ccb2c83280e22ad5b7f7c47d7bd95748b969521 (diff)
downloadmu-9a524793ee01ce47f3963768559a0d6c348631c5.tar.gz
6618 - new docs
Diffstat (limited to 'subx.md')
-rw-r--r--subx.md162
1 files changed, 162 insertions, 0 deletions
diff --git a/subx.md b/subx.md
new file mode 100644
index 00000000..b1ab38bc
--- /dev/null
+++ b/subx.md
@@ -0,0 +1,162 @@
+## SubX
+
+SubX is a notation for a subset of x86 machine code. [The Mu translator](http://akkartik.github.io/mu/html/apps/mu.subx.html)
+is implemented in SubX and also emits SubX code.
+
+Here's an example program in SubX that adds 1 and 1 and returns the result to
+the parent shell process:
+
+  ```sh
+  == code
+  Entry:
+    # ebx = 1
+    bb/copy-to-ebx  1/imm32
+    # increment ebx
+    43/increment-ebx
+    # exit(ebx)
+    e8/call  syscall_exit/disp32
+  ```
+
+## The syntax of SubX instructions
+
+Just like in regular machine code, SubX programs consist mostly of instructions,
+which are basically sequences of numbers (always in hex). Instructions consist
+of words separated by whitespace. Words may be _opcodes_ (defining the
+operation being performed) or _arguments_ (specifying the data the operation
+acts on). Any word can have extra _metadata_ attached to it after `/`. Some
+metadata is required (like the `/imm32` and `/imm8` above), but unrecognized
+metadata is silently skipped so you can attach comments to words (like the
+instruction name `/copy-to-eax` above, or the `/exit` argument).
+
+What do all these numbers mean? SubX supports a small subset of the 32-bit x86
+instruction set that likely runs on your computer. (Think of the name as short
+for "sub-x86".) The instruction set contains instructions like `89/copy`,
+`01/add`, `3d/compare` and `51/push-ecx` which modify registers and a byte-addressable
+memory. For a complete list of supported instructions, run `bootstrap help
+opcodes`.
+
+The registers instructions operate on are as follows:
+
+- Six general-purpose 32-bit registers: `0/eax`, `1/ebx`, `2/ecx`, `3/edx`,
+  `6/esi` and `7/edi`.
+- Two additional 32-bit registers: `4/esp` and `5/ebp`. (I suggest you only
+  use these to manage the call stack.)
+
+(SubX doesn't support floating-point registers yet. Intel processors support
+an 8-bit mode, 16-bit mode and 64-bit mode. SubX will never support them.
+There are also _many_ more instructions that SubX will never support.)
+
+While SubX doesn't provide the usual mnemonics for opcodes, it _does_ provide
+error-checking. If you miss an argument or accidentally add an extra argument,
+you'll get a nice error. SubX won't arbitrarily interpret bytes of data as
+instructions or vice versa.
+
+It's worth distinguishing between an instruction's arguments and its _operands_.
+Arguments are provided directly in instructions. Operands are pieces of data
+in register or memory that are operated on by instructions.
+
+Intel processors typically operate on no more than two operands, and at most
+one of them (the 'reg/mem' operand) can access memory. The address of the
+reg/mem operand is constructed by expressions of one of these forms:
+
+  - `%reg`: operate on just a register, not memory
+  - `*reg`: look up memory with the address in some register
+  - `*(reg + disp)`: add a constant to the address in some register
+  - `*(base + (index << scale) + disp)` where `base` and `index` are registers,
+    and `scale` and `disp` are 2- and 32-bit constants respectively.
+
+Under the hood, SubX turns expressions of these forms into multiple arguments
+with metadata in some complex ways. See [the doc on bare SubX](subx_bare.md).
+
+That covers the complexities of the reg/mem operand. The second operand is
+simpler. It comes from exactly one of the following argument types:
+
+  - `/r32`
+  - displacement: `/disp8` or `/disp32`
+  - immediate: `/imm8` or `/imm32`
+
+Putting all this together, here's an example that adds the integer in `eax` to
+the one at address `edx`:
+
+  ```
+  01/add %edx 0/r32/eax
+  ```
+
+## The syntax of SubX programs
+
+SubX programs map to the same ELF binaries that a conventional Linux system
+uses. Linux ELF binaries consist of a series of _segments_. In particular, they
+distinguish between code and data. Correspondingly, SubX programs consist of a
+series of segments, each starting with a header line: `==` followed by a name
+and approximate starting address.
+
+All code must lie in a segment called 'code'.
+
+Segments can be added to.
+
+```sh
+== code 0x09000000  # first mention requires starting address
+...A...
+
+== data 0x0a000000
+...B...
+
+== code             # no address necessary when adding
+...C...
+```
+
+The `code` segment now contains the instructions of `A` as well as `C`.
+
+Within the `code` segment, each line contains a comment, label or instruction.
+Comments start with a `#` and are ignored. Labels should always be the first
+word on a line, and they end with a `:`.
+
+Instructions can refer to labels in displacement or immediate arguments, and
+they'll obtain a value based on the address of the label: immediate arguments
+will contain the address directly, while displacement arguments will contain
+the difference between the address and the address of the current instruction.
+The latter is mostly useful for `jump` and `call` instructions.
+
+Functions are defined using labels. By convention, labels internal to functions
+(that must only be jumped to) start with a `$`. Any other labels must only be
+called, never jumped to. All labels must be unique.
+
+Functions are called using the following syntax:
+```
+(func arg1 arg2 ...)
+```
+
+Function arguments must be either literals (integers or strings) or a reg/mem
+operand using the syntax in the previous section.
+
+A special label is `Entry`, which can be used to specify/override the entry
+point of the program. It doesn't have to be unique, and the latest definition
+will override earlier ones.
+
+(The `Entry` label, along with duplicate segment headers, allows programs to
+be built up incrementally out of multiple [_layers_](http://akkartik.name/post/wart-layers).)
+
+Another special pair of labels are the block delimiters `{` and `}`. They can
+be nested, and jump instructions can take arguments `loop` or `break` that
+jump to the enclosing `{` and `}` respectively.
+
+The data segment consists of labels as before and byte values. Referring to
+data labels in either `code` segment instructions or `data` segment values
+yields their address.
+
+Automatic tests are an important part of SubX, and there's a simple mechanism
+to provide a test harness: all functions that start with `test-` are called in
+turn by a special, auto-generated function called `run-tests`. How you choose
+to call it is up to you.
+
+I try to keep things simple so that there's less work to do when implementing
+SubX in SubX. But there _is_ one convenience: instructions can provide a
+string literal surrounded by quotes (`"`) in an `imm32` argument. SubX will
+transparently copy it to the `data` segment and replace it with its address.
+Strings are the only place where a SubX word is allowed to contain spaces.
+
+That should be enough information for writing SubX programs. The `apps/`
+directory provides some fodder for practice in the `apps/ex*.subx` files,
+giving a more gradual introduction to SubX features. In particular, you should
+work through `apps/factorial4.subx`, which demonstrates all the above ideas in
+concert.