## SubX: A minimalist assembly language for a subset of the x86 ISA
SubX is a simple, minimalist stack for programming your computer.
```sh
$ git clone https://github.com/akkartik/mu
$ cd mu/subx
$ ./subx # print out a help message
```
SubX is designed:
* to explore ways to turn arbitrary manual tests into reproducible automated tests,
* to be easy to implement in itself, and
* to help learn and teach the x86 instruction set.
It requires a Unix-like environment with a C++ compiler (Linux or BSD or Mac
OS). Running `subx` will transparently compile it as necessary.
[![Build Status](https://api.travis-ci.org/akkartik/mu.svg?branch=master)](https://travis-ci.org/akkartik/mu)
You can generate native ELF binaries with it that run on a bare Linux
kernel. No other dependencies needed.
```sh
$ ./subx translate examples/ex1.subx -o examples/ex1
$ ./examples/ex1 # only on Linux
$ echo $?
42
```
You can run the generated binaries on an interpreter/VM for better error
messages.
```sh
$ ./subx run examples/ex1 # on Linux or BSD or OS X
$ echo $?
42
```
Emulated runs generate a trace that permits [time-travel debugging](https://github.com/akkartik/mu/blob/master/browse_trace/Readme.md).
```sh
$ ./subx --debug translate examples/factorial.subx -o examples/factorial
saving address->label information to 'labels'
saving address->source information to 'source_lines'
$ ./subx --debug --trace run examples/factorial
saving trace to 'last_run'
$ ../browse_trace/browse_trace last_run # text-mode debugger UI
```
You can write tests for your assembly programs. The entire stack is thoroughly
covered by automated tests. SubX's tagline: tests before syntax.
```sh
$ ./subx test
$ ./subx run apps/factorial test
```
You can use SubX to translate itself. For example, running natively on Linux:
```sh
# generate translator phases using C++
$ ./subx translate 0*.subx apps/subx-common.subx apps/hex.subx -o hex
$ ./subx translate 0*.subx apps/subx-common.subx apps/survey.subx -o survey
$ ./subx translate 0*.subx apps/subx-common.subx apps/pack.subx -o pack
$ ./subx translate 0*.subx apps/subx-common.subx apps/assort.subx -o assort
$ ./subx translate 0*.subx apps/subx-common.subx apps/dquotes.subx -o dquotes
$ ./subx translate 0*.subx apps/subx-common.subx apps/tests.subx -o tests
$ chmod +x hex survey pack assort dquotes tests
# use the new translator phases to translate subx programs
$ cat examples/ex1.subx |./tests |./dquotes |./assort |./pack |./survey |./hex > a.elf
$ chmod +x a.elf
$ ./a.elf
$ echo $?
42
```
(For running emulated on other platforms, see the `translate` script. You'll
need 16GB RAM for translating some of the larger programs.)
You can use it to learn about the x86 processor that (almost certainly) runs
your computer. (See below.)
You can read its tiny zero-dependency internals and understand how they work.
You can hack on it, and its thorough tests will raise the alarm when you break
something.
Eventually you will be able to program in higher-level notations. But you'll
always have tests as guardrails and traces for inspecting runs. The entire
stack will always be designed for others to comprehend. You'll always be
empowered to understand how things work, and change what doesn't work for you.
You'll always be expected to make small changes during upgrades.
## What it looks like
Here is the first example we ran above, a program that just returns 42:
```sh
bb/copy-to-EBX 0x2a/imm32 # 42 in hex
b8/copy-to-EAX 1/imm32/exit
cd/syscall 0x80/imm8
```
Every line contains at most one instruction. Instructions consist of words
separated by whitespace. Words may be _opcodes_ (defining the operation being
performed) or _arguments_ (specifying the data the operation acts on). Any
word can have extra _metadata_ attached to it after `/`. Some metadata is
required (like the `/imm32` and `/imm8` above), but unrecognized metadata is
silently skipped so you can attach comments to words (like the instruction
name `/copy-to-EAX` above, or the `/exit` operand).
SubX doesn't provide much syntax (there aren't even the usual mnemonics for
opcodes), but it _does_ provide error-checking. If you miss an operand or
accidentally add an extra operand you'll get a nice error. SubX won't arbitrarily
interpret bytes of data as instructions or vice versa.
So much for syntax. What do all these numbers actually _mean_? SubX supports a
small subset of the 32-bit x86 instruction set that likely runs on your
computer. (Think of the name as short for "sub-x86".) Instructions operate on
a few registers:
* Six general-purpose 32-bit registers: EAX, EBX, ECX, EDX, ESI and EDI
* Two additional 32-bit registers: ESP and EBP (I suggest you only use these to
manage the call stack.)
* Four 1-bit _flag_ registers for conditional branching:
- zero/equal flag ZF
- sign flag SF
- overflow flag OF
- carry flag CF
SubX programs consist of instructions like `89/copy`, `01/add`, `3d/compare`
and `52/push-ECX` which modify these registers as well as a byte-addressable
memory. For a complete list of supported instructions, run `subx help opcodes`.
(SubX doesn't support floating-point registers yet. Intel processors support
an 8-bit mode, 16-bit mode and 64-bit mode. SubX will never support them.
There are other flags. SubX will never support them. There are also _many_
more instructions that SubX will never support.)
It's worth distinguishing between an instruction's _operands_ and its _arguments_.
Arguments are provided directly in instructions. Operands are pieces of data
in register or memory that are operated on by instructions. Intel processors
determine operands from arguments in fairly complex ways.
## Lengthy interlude: How x86 instructions compute operands
The [Intel processor manual](http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-instruction-set-reference-manual-325383.pdf)
is the final source of truth on the x86 instruction set, but it can be
forbidding to make sense of, so here's a quick orientation. You will need
familiarity with binary numbers, and maybe a few other things. Email [me](mailto:mu@akkartik.com)
any time if something isn't clear. I love explaining this stuff for as long as
it takes. The bad news is that it takes some getting used to. The good news is
that internalizing the next 500 words will give you a significantly deeper
understanding of your computer.
Most instructions operate on an operand in register or memory ('reg/mem'), and
a second operand in a register. The register operand is specified fairly
directly using the 3-bit `/r32` argument:
- 0 means register `EAX`
- 1 means register `ECX`
- 2 means register `EDX`
- 3 means register `EBX`
- 4 means register `ESP`
- 5 means register `EBP`
- 6 means register `ESI`
- 7 means register `EDI`
The reg/mem operand, however, gets complex. It can be specified by 1-7
arguments, each ranging in size from 2 bits to 4 bytes.
The key argument that's always present for reg/mem operands is `/mod`, the
_addressing mode_. This is a 2-bit argument that can take 4 possible values,
and it determines what other arguments are required, and how to interpret
them.
* If `/mod` is `3`: the operand is in the register described by the 3-bit
`/rm32` argument similarly to `/r32` above.
* If `/mod` is `0`: the operand is in the address provided in the register
described by `/rm32`. That's `*rm32` in C syntax.
* If `/mod` is `1`: the operand is in the address provided by adding the
register in `/rm32` with the (1-byte) displacement. That's `*(rm32 + /disp8)`
in C syntax.
* If `/mod` is `2`: the operand is in the address provided by adding the
register in `/rm32` with the (4-byte) displacement. That's `*(/rm32 +
/disp32)` in C syntax.
In the last three cases, one exception occurs when the `/rm32` argument
contains `4`. Rather than encoding register `ESP`, it means the address is
provided by three _whole new_ arguments (`/base`, `/index` and `/scale`) in a
_totally_ different way:
```
reg/mem = *(/base + /index * (2 ^ /scale))
```
(There are a couple more exceptions ☹; see [Table 2-2](modrm.pdf) and [Table 2-3](sib.pdf)
of the Intel manual for the complete story.)
Phew, that was a lot to take in. Some examples to work through as you reread
and digest it:
1. To read directly from the EAX register, `/mod` must be `3` (direct mode),
and `/rm32` must be `0`. There must be no `/base`, `/index` or `/scale`
arguments.
1. To read from `*EAX` (in C syntax), `/mod` must be `0` (indirect mode), and
the `/rm32` argument must be `0`. There must be no `/base`, `/index` or
`/scale` arguments.
1. To read from `*(EAX+4)`, `/mod` must be `1` (indirect + disp8 mode),
`/rm32` must be `0`, there must be no SIB byte, and there must be a single
displacement byte containing `4`.
1. To read from `*(EAX+ECX+4)`, one approach would be to set `/mod` to `1` as
above, `/rm32` to `4` (SIB byte next), `/base` to `0`, `/index` to `1`
(ECX) and a single displacement byte to `4`. (What should the `scale` bits
be? Can you think of another approach?)
1. To read from `*(EAX+ECX+1000)`, one approach would be:
- `/mod`: `2` (indirect + disp32)
- `/rm32`: `4` (`/base`, `/index` and `/scale` arguments required)
- `/base`: `0` (EAX)
- `/index`: `1` (ECX)
- `/disp32`: 4 bytes containing `1000`
## Putting it all together
Here's a more meaty example:
This program sums the first 10 natural numbers. By convention I use horizontal
tabstops to help read instructions, dots to help follow the long lines,
comments before groups of instructions to describe their high-level purpose,
and comments at the end of complex instructions to state the low-level
operation they perform. Numbers are always in hexadecimal (base 16) and must
start with a digit ('0'..'9'); use the '0x' prefix when a number starts with a
letter ('a'..'f'). I tend to also include it as a reminder when numbers look
like decimal numbers.
Try running this example now:
```sh
$ ./subx translate examples/ex3.subx -o examples/ex3
$ ./subx run examples/ex3
$ echo $?
55
```
If you're on Linux you can also run it natively:
```sh
$ ./examples/ex3
$ echo $?
55
```
Use it now to follow along for a more complete tour of SubX syntax.
## The syntax of SubX programs
SubX programs map to the same ELF binaries that a conventional Linux system
uses. Linux ELF binaries consist of a series of _segments_. In particular, they
distinguish between code and data. Correspondingly, SubX programs consist of a
series of segments, each starting with a header line: `==` followed by a name
and approximate starting address.
All code must lie in a segment called 'code'. Execution begins at the start of
the `code` segment by default.
You can reuse segment names:
```
== code
...A...
== data
...B...
== code
...C...
```
The `code` segment now contains the instructions of `A` as well as `C`.
Within the `code` segment, each line contains a comment, label or instruction.
Comments start with a `#` and are ignored. Labels should always be the first
word on a line, and they end with a `:`.
Instruction arguments must specify their type, from:
- `/mod`
- `/rm32`
- `/r32`
- `/subop` (sometimes the `/r32` bits in an instruction are used as an extra opcode)
- displacement: `/disp8` or `/disp32`
- immediate: `/imm8` or `/imm32`
Different instructions (opcodes) require different arguments. SubX will
validate each instruction in your programs, and raise an error anytime you
miss or spuriously add an argument.
I recommend you order arguments consistently in your programs. SubX allows
arguments in any order, but only because that's simplest to explain/implement.
Switching order from instruction to instruction is likely to add to the
reader's burden. Here's the order I've been using after opcodes:
```
|<--------- reg/mem --------->| |<- reg/mem? ->|
/subop /mod /rm32 /base /index /scale /r32 /displacement /immediate
```
Instructions can refer to labels in displacement or immediate arguments, and
they'll obtain a value based on the address of the label: immediate arguments
will contain the address directly, while displacement arguments will contain
the difference between the address and the address of the current instruction.
The latter is mostly useful for `jump` and `call` instructions.
Functions are defined using labels. By convention, labels internal to functions
(that must only be jumped to) start with a `$`. Any other labels must only be
called, never jumped to. All labels must be unique.
A special label is `Entry`, which can be used to specify/override the entry
point of the program. It doesn't have to be unique, and the latest definition
will override earlier ones.
(The `Entry` label, along with duplicate segment headers, allows programs to
be built up incrementally out of multiple [_layers_](http://akkartik.name/post/wart-layers).)
The data segment consists of labels as before and byte values. Referring to
data labels in either `code` segment instructions or `data` segment values
(using the `imm32` metadata either way) yields their address.
Automatic tests are an important part of SubX, and there's a simple mechanism
to provide a test harness: all functions that start with `test-` are called in
turn by a special, auto-generated function called `run-tests`. How you choose
to call it is up to you.
I try to keep things simple so that there's less work to do when I eventually
implement SubX in SubX. But there _is_ one convenience: instructions can
provide a string literal surrounded by quotes (`"`) in an `imm32` argument.
SubX will transparently copy it to the `data` segment and replace it with its
address. Strings are the only place where a SubX word is allowed to contain
spaces.
That should be enough information for writing SubX programs. The `examples/`
directory provides some fodder for practice, giving a more gradual introduction
to SubX features. This repo includes the binary for all examples. At any
commit, an example's binary should be identical bit for bit with the result of
translating the corresponding `.subx` file. The binary should also be natively
runnable on a Linux system running on Intel x86 processors, either 32- or
64-bit. If either of these invariants is broken it's a bug on my part.
## Roadmap and status
* Self-hosting. (✓)
`tests |dquotes |assort |pack |survey |hex`
* A script to package SubX together with a minimal Linux kernel image
(compiled from source, of course).
* Testable, dependency-injected vocabulary of primitives
- Streams: `read()`, `write()`. (✓)
- `exit()` (✓)
- Sockets
- Files
- Concurrency, and a framework for testing blocking code
* Higher-level notations. Like programming languages, but with thinner
implementations that you can -- and are expected to! -- modify.
- syntax for addressing modes: `%reg`, `*reg`, `*(reg+disp)`,
`*(reg+reg+disp)`, `*(reg+reg< -o