diff options
-rw-r--r-- | README.md | 788 | ||||
-rw-r--r-- | apps/README.md | 6 | ||||
-rwxr-xr-x | apps/ex2 | bin | 214 -> 219 bytes | |||
-rw-r--r-- | apps/ex2.2.mu | 19 | ||||
-rw-r--r-- | apps/ex2.mu | 2 | ||||
-rw-r--r-- | apps/ex2.subx | 10 | ||||
-rw-r--r-- | bootstrap.md | 17 | ||||
-rw-r--r-- | debugging.md | 122 | ||||
-rw-r--r-- | editor.md | 19 | ||||
-rw-r--r-- | html/apps/ex2.2.mu.html | 81 | ||||
-rw-r--r-- | mu.md | 456 | ||||
-rw-r--r-- | mu_summary | 259 | ||||
-rw-r--r-- | subx.md | 162 | ||||
-rw-r--r-- | subx_bare.md (renamed from subx_addressing_modes.md) | 55 | ||||
-rwxr-xr-x | test_apps | 15 | ||||
-rw-r--r-- | vocabulary.md | 208 |
16 files changed, 1127 insertions, 1092 deletions
diff --git a/README.md b/README.md index aad6253d..c0740f89 100644 --- a/README.md +++ b/README.md @@ -10,7 +10,7 @@ Running the code you want to run, and nothing else. ```sh $ git clone https://github.com/akkartik/mu $ cd mu - $ ./translate_mu apps/ex2.mu # emits a.elf + $ ./translate_mu apps/ex2.mu # emit a.elf $ ./a.elf # adds 3 and 4 $ echo $? 7 @@ -18,10 +18,13 @@ Running the code you want to run, and nothing else. [![Build Status](https://api.travis-ci.org/akkartik/mu.svg?branch=master)](https://travis-ci.org/akkartik/mu) -Mu requires just a Unix-like OS and nothing else. The Mu translator is built -up from machine code. You can also bootstrap it from C++. Both C++ and -self-hosted versions emit identical binaries. The generated binaries require -just a Unix-like kernel and nothing else. (There are more details in [this paper](http://akkartik.name/akkartik-convivial-20200607.pdf).) +Rather than start from some syntax and introduce layers of translation to +implement it, Mu starts from the processor's instruction set and tries to get +to _some_ safe and clear syntax with as few layers of translation as possible. +The emphasis is on internal consistency at any point in time rather than +compatibility with the past. ([More details.](http://akkartik.name/akkartik-convivial-20200607.pdf)) + +Currently Mu requires a 32-bit x86 processor and Linux kernel. ## Goals @@ -56,668 +59,84 @@ In priority order: For now it's a thin veneer over machine code. I'm working on memory safety before expressive syntax. -## Source Language - -Mu's main source language is [still being built](http://akkartik.name/post/mu-2019-2). -When completed, it will be type- and memory-safe. At the moment it performs no -checks. Here's the program we translated above: - -<img alt='ex2.mu' src='html/ex2.mu.png'> - -There are no expressions; functions consist of only statements that operate on -variables. Most statements in Mu translate to [a single machine-code instruction](http://akkartik.github.io/mu/html/mu_instructions.html). -Variables reside in memory by default. Programs must specify registers when -they want to use them. Functions must return results in registers. Execution -begins at the function `main`, which always returns its result in register -`ebx`. [The paper](http://akkartik.name/akkartik-convivial-20200607.pdf) has -more details, and there's a [summary](mu_summary) of all supported instructions. - -## SubX - -Mu is written in [a notation for a subset of x86 machine code called SubX](http://akkartik.name/post/mu-2019-1). -Here's a SubX program (`apps/ex1.subx`) that returns 42: - - ```sh - bb/copy-to-ebx 0x2a/imm32 # 42 in hex - b8/copy-to-eax 1/imm32/exit - cd/syscall 0x80/imm8 - ``` - -You can generate tiny zero-dependency ELF binaries from SubX that run on Linux. - - ```sh - $ ./bootstrap translate init.linux apps/ex1.subx -o apps/ex1 # on Linux or BSD or Mac - $ ./apps/ex1 # only on Linux - $ echo $? - 42 - ``` - -(Running `bootstrap` requires a C++ compiler, transparently invoking it as -necessary.) - -You can run the generated binaries on an interpreter/VM for better error -messages. - - ```sh - $ ./bootstrap run apps/ex1 # on Linux or BSD or Mac - $ echo $? - 42 - ``` - -Emulated runs can generate a trace that permits [time-travel debugging](https://github.com/akkartik/mu/blob/master/tools/browse_trace.readme.md). +## Toolchain - ```sh - $ ./bootstrap --debug translate init.linux apps/factorial.subx -o apps/factorial - saving address->label information to 'labels' - saving address->source information to 'source_lines' +The Mu stack consists of: +- the Mu type-safe language; +- SubX, an unsafe notation for a subset of x86 machine code; and +- _bare_ SubX, a more rudimentary form of SubX without certain syntax sugar. - $ ./bootstrap --trace run apps/factorial - saving trace to 'last_run' +All Mu programs get translated through these layers, and the translators are +mostly built out of lower levels. The translator from Mu to SubX is written in +SubX, and the translator from SubX to bare SubX is built in bare SubX. - $ tools/browse_trace last_run # text-mode debugger UI - ``` +To translate Mu programs through these layers into tiny zero-dependency ELF +binaries for Linux, use `translate_mu`. -You can write tests for your programs. The entire stack is thoroughly covered -by automated tests. SubX's tagline: tests before syntax. +Mu programs can be run in emulated mode to emit traces, which permit time-travel +debugging. ([More details.](debugging.md)) - ```sh - $ ./bootstrap test - $ ./bootstrap run apps/factorial test - ``` +### incomplete tools -You can use SubX to translate itself. For example, running natively on Linux: +The Mu translator is still a work in progress; not all incorrect programs +result in good error messages. - ```sh - # generate translator phases using the C++ translator - $ ./bootstrap translate init.linux 0*.subx apps/subx-params.subx apps/hex.subx -o hex - $ ./bootstrap translate init.linux 0*.subx apps/subx-params.subx apps/survey.subx -o survey - $ ./bootstrap translate init.linux 0*.subx apps/subx-params.subx apps/pack.subx -o pack - $ ./bootstrap translate init.linux 0*.subx apps/subx-params.subx apps/assort.subx -o assort - $ ./bootstrap translate init.linux 0*.subx apps/subx-params.subx apps/dquotes.subx -o dquotes - $ ./bootstrap translate init.linux 0*.subx apps/subx-params.subx apps/tests.subx -o tests - $ chmod +x hex survey pack assort dquotes tests - - # use the generated translator phases to translate SubX programs - $ cat init.linux apps/ex1.subx |./tests |./dquotes |./assort |./pack |./survey |./hex > a.elf - $ chmod +x a.elf - $ ./a.elf - $ echo $? - 42 - - # or, automating the above steps - $ ./translate_subx init.linux apps/ex1.subx - $ ./a.elf - $ echo $? - 42 - ``` - -Or, running in a VM on other platforms (much slower): - - ```sh - $ ./translate_subx_emulated init.linux apps/ex1.subx # generates identical a.elf to above - $ ./bootstrap run a.elf - $ echo $? - 42 - ``` - -You can package up SubX binaries with the minimal hobbyist OS [Soso](https://github.com/ozkl/soso) -and run them on Qemu. (Requires graphics and sudo access. Currently doesn't -work on a cloud server.) +Once generated, ELF binaries can be packaged up with a Linux kernel into a +bootable disk image: ```sh + $ ./translate_mu apps/ex2.mu # emit a.elf # dependencies - $ sudo apt install build-essential util-linux nasm xorriso # maybe also dosfstools and mtools - # package up a "hello world" program with a third-party kernel into mu_soso.iso - # requires sudo - $ tools/iso/soso init.soso apps/ex6.subx - # try it out - $ qemu-system-i386 -cdrom mu_soso.iso - ``` - -You can also package up SubX binaries with a Linux kernel and run them on -either Qemu or [a cloud server that supports custom images](http://akkartik.name/post/iso-on-linode). -(Takes 12 minutes with 2GB RAM. Requires 12 million LoC of C for the Linux -kernel; that number will gradually go down.) - - ```sh $ sudo apt install build-essential flex bison wget libelf-dev libssl-dev xorriso - $ tools/iso/linux init.linux apps/ex6.subx + $ tools/iso/linux a.elf $ qemu-system-x86_64 -m 256M -cdrom mu_linux.iso -boot d ``` -## The syntax of SubX instructions +The disk image also runs on [any cloud server that supports custom images](http://akkartik.name/post/iso-on-linode). -Here is the above SubX example again: +Mu also runs on the minimal hobbyist OS [Soso](https://github.com/ozkl/soso). +(Requires graphics and sudo access. Currently doesn't work on a cloud server.) ```sh - bb/copy-to-ebx 0x2a/imm32 # 42 in hex - b8/copy-to-eax 1/imm32/exit - cd/syscall 0x80/imm8 - ``` - -Every line contains at most one instruction. Instructions consist of words -separated by whitespace. Words may be _opcodes_ (defining the operation being -performed) or _arguments_ (specifying the data the operation acts on). Any -word can have extra _metadata_ attached to it after `/`. Some metadata is -required (like the `/imm32` and `/imm8` above), but unrecognized metadata is -silently skipped so you can attach comments to words (like the instruction -name `/copy-to-eax` above, or the `/exit` argument). - -What do all these numbers mean? SubX supports a small subset of the 32-bit x86 -instruction set that likely runs on your computer. (Think of the name as short -for "sub-x86".) The instruction set contains instructions like `89/copy`, -`01/add`, `3d/compare` and `51/push-ecx` which modify registers and a byte-addressable -memory. For a complete list of supported instructions, run `bootstrap help opcodes`. - -The registers instructions operate on are as follows: - -- Six general-purpose 32-bit registers: `0/eax`, `1/ebx`, `2/ecx`, `3/edx`, - `6/esi` and `7/edi`. -- Two additional 32-bit registers: `4/esp` and `5/ebp`. (I suggest you only - use these to manage the call stack.) - -(SubX doesn't support floating-point registers yet. Intel processors support -an 8-bit mode, 16-bit mode and 64-bit mode. SubX will never support them. -There are also _many_ more instructions that SubX will never support.) - -While SubX doesn't provide the usual mnemonics for opcodes, it _does_ provide -error-checking. If you miss an argument or accidentally add an extra argument, -you'll get a nice error. SubX won't arbitrarily interpret bytes of data as -instructions or vice versa. - -It's worth distinguishing between an instruction's arguments and its _operands_. -Arguments are provided directly in instructions. Operands are pieces of data -in register or memory that are operated on by instructions. - -Intel processors typically operate on no more than two operands, and at most -one of them (the 'reg/mem' operand) can access memory. The address of the -reg/mem operand is constructed by expressions of one of these forms: - - - `%reg`: operate on just a register, not memory - - `*reg`: look up memory with the address in some register - - `*(reg + disp)`: add a constant to the address in some register - - `*(base + (index << scale) + disp)` where `base` and `index` are registers, - and `scale` and `disp` are 2- and 32-bit constants respectively. - -Under the hood, SubX turns expressions of these forms into multiple arguments -with metadata in some complex ways. See [subx\_addressing\_modes.md](subx_addressing_modes.md). - -That covers the complexities of the reg/mem operand. The second operand is -simpler. It comes from exactly one of the following argument types: - - - `/r32` - - displacement: `/disp8` or `/disp32` - - immediate: `/imm8` or `/imm32` - -Putting all this together, here's an example that adds the integer in `eax` to -the one at address `edx`: - - ``` - 01/add %edx 0/r32/eax + $ ./translate_mu apps/ex2.mu # emit a.elf + # dependencies + $ sudo apt install build-essential util-linux nasm xorriso # maybe also dosfstools and mtools + $ tools/iso/soso a.elf # requires sudo + $ qemu-system-i386 -cdrom mu_soso.iso ``` -## The syntax of SubX programs - -SubX programs map to the same ELF binaries that a conventional Linux system -uses. Linux ELF binaries consist of a series of _segments_. In particular, they -distinguish between code and data. Correspondingly, SubX programs consist of a -series of segments, each starting with a header line: `==` followed by a name -and approximate starting address. - -All code must lie in a segment called 'code'. - -Segments can be added to. - -```sh -== code 0x09000000 # first mention requires starting address -...A... - -== data 0x0a000000 -...B... - -== code # no address necessary when adding -...C... -``` - -The `code` segment now contains the instructions of `A` as well as `C`. - -Within the `code` segment, each line contains a comment, label or instruction. -Comments start with a `#` and are ignored. Labels should always be the first -word on a line, and they end with a `:`. - -Instructions can refer to labels in displacement or immediate arguments, and -they'll obtain a value based on the address of the label: immediate arguments -will contain the address directly, while displacement arguments will contain -the difference between the address and the address of the current instruction. -The latter is mostly useful for `jump` and `call` instructions. - -Functions are defined using labels. By convention, labels internal to functions -(that must only be jumped to) start with a `$`. Any other labels must only be -called, never jumped to. All labels must be unique. - -Functions are called using the following syntax: -``` -(func arg1 arg2 ...) -``` - -Function arguments must be either literals (integers or strings) or a reg/mem -operand using the syntax in the previous section. - -A special label is `Entry`, which can be used to specify/override the entry -point of the program. It doesn't have to be unique, and the latest definition -will override earlier ones. - -(The `Entry` label, along with duplicate segment headers, allows programs to -be built up incrementally out of multiple [_layers_](http://akkartik.name/post/wart-layers).) - -Another special pair of labels are the block delimiters `{` and `}`. They can -be nested, and jump instructions can take arguments `loop` or `break` that -jump to the enclosing `{` and `}` respectively. +## Syntax -The data segment consists of labels as before and byte values. Referring to -data labels in either `code` segment instructions or `data` segment values -yields their address. +The entire stack shares certain properties and conventions. Programs consist +of functions and functions consist of statements, each performing a single +operation. Operands to statements are always variables or constants. You can't +say `a + b*c`, you have to break it up into two operations. Variables can live +in memory or in registers. Registers must be explicitly specified. There are +some shared lexical rules; comments always start with '#', and numbers are +always written in hex. -Automatic tests are an important part of SubX, and there's a simple mechanism -to provide a test harness: all functions that start with `test-` are called in -turn by a special, auto-generated function called `run-tests`. How you choose -to call it is up to you. +Here's an example program in Mu: -I try to keep things simple so that there's less work to do when implementing -SubX in SubX. But there _is_ one convenience: instructions can provide a -string literal surrounded by quotes (`"`) in an `imm32` argument. SubX will -transparently copy it to the `data` segment and replace it with its address. -Strings are the only place where a SubX word is allowed to contain spaces. - -That should be enough information for writing SubX programs. The `apps/` -directory provides some fodder for practice in the `apps/ex*.subx` files, -giving a more gradual introduction to SubX features. In particular, you should -work through `apps/factorial4.subx`, which demonstrates all the above ideas in -concert. - -This repo includes binaries for all examples. At any commit, an example's -binary should be identical bit for bit with the result of translating the -corresponding `.subx` file. The binary should also be natively runnable on a -Linux system running on Intel x86 processors, either 32- or 64-bit. If either -of these invariants is broken it's a bug on my part. - -## Running - -`bootstrap` currently has the following sub-commands: - -- `bootstrap help`: some helpful documentation to have at your fingertips. - -- `bootstrap test`: runs all automated tests. - -- `bootstrap translate <input files> -o <output ELF binary>`: translates `.subx` - files into an executable ELF binary. - -- `bootstrap run <ELF binary> <args>`: simulates running the ELF binaries emitted - by `bootstrap translate`. Useful for testing and debugging. - - Remember, not all 32-bit Linux binaries are guaranteed to run. I'm not - building general infrastructure here for all of the x86 instruction set. - SubX is about programming with a small, regular subset of 32-bit x86. - -## Getting your editor set up - -If you've read this far, it's time to set up your editor. Mu is really -intended to be read interactively rather than on a browser. - -There is rudimentary syntax highlighting support for Mu and SubX files for -various editors. Look for your editor in `mu.*` and `subx.*`, and follow the -instructions within. - -The Vim files are most developed. In particular, I recommend some optional -setup in subx.vim to use multiple colors for comments. - -If you use [Exuberant Ctags](http://ctags.sourceforge.net) for jumping easily -from names to their definitions in your editor, copy the contents of `exuberant_ctags_rc` -into your `.ctags` file. - -## A few hints for debugging - -Writing programs in SubX is surprisingly pleasant and addictive. Reading -programs is a work in progress, and hopefully the extensive unit tests help. -However, _debugging_ programs is where one really faces up to the low-level -nature of SubX. Even the smallest modifications need testing to make sure they -work. In my experience, there is no modification so small that I get it working -on the first attempt. And when it doesn't work, there are no clear error -messages. Machine code is too simple-minded for that. You can't use a debugger, -since SubX's simplistic ELF binaries contain no debugging information. So -debugging requires returning to basics and practicing with a new, more -rudimentary but hopefully still workable toolkit: - -- Start by nailing down a concrete set of steps for reproducibly obtaining the - error or erroneous behavior. - -- If possible, turn the steps into a failing test. It's not always possible, - but SubX's primary goal is to keep improving the variety of tests one can - write. - -- Start running the single failing test alone. This involves modifying the top - of the program (or the final `.subx` file passed in to `bootstrap translate`) by - replacing the call to `run-tests` with a call to the appropriate `test-` - function. - -- Generate a trace for the failing test while running your program in emulated - mode (`bootstrap run`): - ``` - $ ./bootstrap translate input.subx -o binary - $ ./bootstrap --trace run binary arg1 arg2 2>trace - ``` - The ability to generate a trace is the essential reason for the existence of - `bootstrap run` mode. It gives far better visibility into program internals than - running natively. - -- As a further refinement, it is possible to render label names in the trace - by adding a second flag to the `bootstrap translate` command: - ``` - $ ./bootstrap --debug translate input.subx -o binary - $ ./bootstrap --trace run binary arg1 arg2 2>trace - ``` - `bootstrap --debug translate` emits a mapping from label to address in a file - called `labels`. `bootstrap --trace run` reads in the `labels` file if - it exists and prints out any matching label name as it traces each instruction - executed. - - Here's a sample of what a trace looks like, with a few boxes highlighted: +<img alt='ex2.mu' src='html/ex2.mu.png'> - <img alt='trace example' src='html/trace.png'> +[More details on Mu syntax →](mu.md) - Each of the green boxes shows the trace emitted for a single instruction. - It starts with a line of the form `run: inst: ___` followed by the opcode - for the instruction, the state of registers before the instruction executes, - and various other facts deduced during execution. Some instructions first - print a matching label. In the above screenshot, the red boxes show that - address `0x0900005e` maps to label `$loop` and presumably marks the start of - some loop. Function names get similar `run: == label` lines. +Here's an example program in SubX: -- One trick when emitting traces with labels: - ``` - $ grep label trace - ``` - This is useful for quickly showing you the control flow for the run, and the - function executing when the error occurred. I find it useful to start with - this information, only looking at the complete trace after I've gotten - oriented on the control flow. Did it get to the loop I just modified? How - many times did it go through the loop? - -- Once you have SubX displaying labels in traces, it's a short step to modify - the program to insert more labels just to gain more insight. For example, - consider the following function: - - <img alt='control example -- before' src='html/control0.png'> - - This function contains a series of jump instructions. If a trace shows - `is-hex-lowercase-byte?` being encountered, and then `$is-hex-lowercase-byte?:end` - being encountered, it's still ambiguous what happened. Did we hit an early - exit, or did we execute all the way through? To clarify this, add temporary - labels after each jump: - - <img alt='control example -- after' src='html/control1.png'> - - Now the trace should have a lot more detail on which of these labels was - reached, and precisely when the exit was taken. - -- If you find yourself wondering, "when did the contents of this memory - address change?", `bootstrap run` has some rudimentary support for _watch - points_. Just insert a label starting with `$watch-` before an instruction - that writes to the address, and its value will start getting dumped to the - trace after every instruction thereafter. - -- Once we have a sense for precisely which instructions we want to look at, - it's time to look at the trace as a whole. Key is the state of registers - before each instruction. If a function is receiving bad arguments it becomes - natural to inspect what values were pushed on the stack before calling it, - tracing back further from there, and so on. - - I occasionally want to see the precise state of the stack segment, in which - case I uncomment a commented-out call to `dump_stack()` in the `vm.cc` - layer. It makes the trace a lot more verbose and a lot less dense, necessitating - a lot more scrolling around, so I keep it turned off most of the time. - -- If the trace seems overwhelming, try [browsing it](https://github.com/akkartik/mu/blob/master/tools/browse_trace.readme.md) - in the 'time-travel debugger'. - -- Don't be afraid to slice and dice the trace using Unix tools. For example, - say you have a SubX binary that dies while running tests. You can see what - test it's segfaulting at by compiling it with debug information using - `./translate_subx_debug`, and then running: - - ``` - ./bootstrap --trace --dump run a.elf test 2>&1 |grep 'label test' + ```sh + == code + Entry: + # ebx = 1 + bb/copy-to-ebx 1/imm32 + # increment ebx + 43/increment-ebx + # exit(ebx) + e8/call syscall_exit/disp32 ``` - Just read out the last test printed out before the segfault. - -Hopefully these hints are enough to get you started. The main thing to -remember is to not be afraid of modifying the sources. A good debugging -session gets into a nice rhythm of generating a trace, staring at it for a -while, modifying the sources, regenerating the trace, and so on. Email -[me](mailto:mu@akkartik.com) if you'd like another pair of eyes to stare at a -trace, or if you have questions or complaints. - -## Reference documentation on available primitives - -### Data Structures - -- Kernel strings: null-terminated regions of memory. Unsafe and to be avoided, - but needed for interacting with the kernel. - -- Arrays: length-prefixed regions of memory containing multiple elements of a - single type. Contents are preceded by 4 bytes (32 bits) containing the - `length` of the array in bytes. - -- Slices: a pair of 32-bit addresses denoting a [half-open](https://en.wikipedia.org/wiki/Interval_(mathematics)) - \[`start`, `end`) interval to live memory with a consistent lifetime. +[More details on SubX syntax →](subx.md) - Invariant: `start` <= `end` - -- Streams: strings prefixed by 32-bit `write` and `read` indexes that the next - write or read goes to, respectively. - - - offset 0: write index - - offset 4: read index - - offset 8: length of array (in bytes) - - offset 12: start of array data - - Invariant: 0 <= `read` <= `write` <= `length` - -- File descriptors (fd): Low-level 32-bit integers that the kernel uses to - track files opened by the program. - -- File: 32-bit value containing either a fd or an address to a stream (fake - file). - -- Buffered files (buffered-file): Contain a file descriptor and a stream for - buffering reads/writes. Each `buffered-file` must exclusively perform either - reads or writes. - -### 'system calls' - -As I said at the top, a primary design goal of SubX (and Mu more broadly) is -to explore ways to turn arbitrary manual tests into reproducible automated -tests. SubX aims for this goal by baking testable interfaces deep into the -stack, at the OS syscall level. The idea is that every syscall that interacts -with hardware (and so the environment) should be *dependency injected* so that -it's possible to insert fake hardware in tests. - -But those are big goals. Here are the syscalls I have so far: - -- `write`: takes two arguments, a file `f` and an address to array `s`. - - Comparing this interface with the Unix `write()` syscall shows two benefits: - - 1. SubX can handle 'fake' file descriptors in tests. - - 1. `write()` accepts buffer and its length in separate arguments, which - requires callers to manage the two separately and so can be error-prone. - SubX's wrapper keeps the two together to increase the chances that we - never accidentally go out of array bounds. - -- `read`: takes two arguments, a file `f` and an address to stream `s`. Reads - as much data from `f` as can fit in (the free space of) `s`. - - Like with `write()`, this wrapper around the Unix `read()` syscall adds the - ability to handle 'fake' file descriptors in tests, and reduces the chances - of clobbering outside array bounds. - - One bit of weirdness here: in tests we do a redundant copy from one stream - to another. See [the comments before the implementation](http://akkartik.github.io/mu/html/060read.subx.html) - for a discussion of alternative interfaces. - -- `stop`: takes two arguments: - - `ed` is an address to an _exit descriptor_. Exit descriptors allow us to - `exit()` the program in production, but return to the test harness within - tests. That allows tests to make assertions about when `exit()` is called. - - `value` is the status code to `exit()` with. - - For more details on exit descriptors and how to create one, see [the - comments before the implementation](http://akkartik.github.io/mu/html/059stop.subx.html). - -- `new-segment` - - Allocates a whole new segment of memory for the program, discontiguous with - both existing code and data (heap) segments. Just a more opinionated form of - [`mmap`](http://man7.org/linux/man-pages/man2/mmap.2.html). - -- `allocate`: takes two arguments, an address to allocation-descriptor `ad` - and an integer `n` - - Allocates a contiguous range of memory that is guaranteed to be exclusively - available to the caller. Returns the starting address to the range in `eax`. - - An allocation descriptor tracks allocated vs available addresses in some - contiguous range of memory. The int specifies the number of bytes to allocate. - - Explicitly passing in an allocation descriptor allows for nested memory - management, where a sub-system gets a chunk of memory and further parcels it - out to individual allocations. Particularly helpful for (surprise) tests. - -- ... _(to be continued)_ - -I will continue to import syscalls over time from [the old Mu VM in the parent -directory](https://github.com/akkartik/mu), which has experimented with -interfaces for the screen, keyboard, mouse, disk and network. - -### primitives built atop system calls - -_(Compound arguments are usually passed in by reference. Where the results are -compound objects that don't fit in a register, the caller usually passes in -allocated memory for it.)_ - -#### assertions for tests -- `check-ints-equal`: fails current test if given ints aren't equal -- `check-stream-equal`: fails current test if stream doesn't match string -- `check-next-stream-line-equal`: fails current test if next line of stream - until newline doesn't match string - -#### error handling -- `error`: takes three arguments, an exit-descriptor, a file and a string (message) - - Prints out the message to the file and then exits using the provided - exit-descriptor. - -- `error-byte`: like `error` but takes an extra byte value that it prints out - at the end of the message. - -#### predicates -- `kernel-string-equal?`: compares a kernel string with a string -- `string-equal?`: compares two strings -- `stream-data-equal?`: compares a stream with a string -- `next-stream-line-equal?`: compares with string the next line in a stream, from - `read` index to newline - -- `slice-empty?`: checks if the `start` and `end` of a slice are equal -- `slice-equal?`: compares a slice with a string -- `slice-starts-with?`: compares the start of a slice with a string -- `slice-ends-with?`: compares the end of a slice with a string - -#### writing to disk -- `write`: string -> file - - Can also be used to cat a string into a stream. - - Will abort the entire program if destination is a stream and doesn't have - enough room. -- `write-stream`: stream -> file - - Can also be used to cat one stream into another. - - Will abort the entire program if destination is a stream and doesn't have - enough room. -- `write-slice`: slice -> stream - - Will abort the entire program if there isn't enough room in the - destination stream. -- `append-byte`: int -> stream - - Will abort the entire program if there isn't enough room in the - destination stream. -- `append-byte-hex`: int -> stream - - textual representation in hex, no '0x' prefix - - Will abort the entire program if there isn't enough room in the - destination stream. -- `print-int32`: int -> stream - - textual representation in hex, including '0x' prefix - - Will abort the entire program if there isn't enough room in the - destination stream. -- `write-buffered`: string -> buffered-file -- `write-slice-buffered`: slice -> buffered-file -- `flush`: buffered-file -- `write-byte-buffered`: int -> buffered-file -- `print-byte-buffered`: int -> buffered-file - - textual representation in hex, no '0x' prefix -- `print-int32-buffered`: int -> buffered-file - - textual representation in hex, including '0x' prefix - -#### reading from disk -- `read`: file -> stream - - Can also be used to cat one stream into another. - - Will silently stop reading when destination runs out of space. -- `read-byte-buffered`: buffered-file -> byte -- `read-line-buffered`: buffered-file -> stream - - Will abort the entire program if there isn't enough room. - -#### non-IO operations on streams -- `new-stream`: allocates space for a stream of `n` elements, each occupying - `b` bytes. - - Will abort the entire program if `n*b` requires more than 32 bits. -- `clear-stream`: resets everything in the stream to `0` (except its `length`). -- `rewind-stream`: resets the read index of the stream to `0` without modifying - its contents. - -#### reading/writing hex representations of integers -- `is-hex-int?`: takes a slice argument, returns boolean result in `eax` -- `parse-hex-int`: takes a slice argument, returns int result in `eax` -- `is-hex-digit?`: takes a 32-bit word containing a single byte, returns - boolean result in `eax`. -- `from-hex-char`: takes a hexadecimal digit character in `eax`, returns its - numeric value in `eax` -- `to-hex-char`: takes a single-digit numeric value in `eax`, returns its - corresponding hexadecimal character in `eax` - -#### tokenization - -from a stream: -- `next-token`: stream, delimiter byte -> slice -- `skip-chars-matching`: stream, delimiter byte -- `skip-chars-not-matching`: stream, delimiter byte - -from a slice: -- `next-token-from-slice`: start, end, delimiter byte -> slice - - Given a slice and a delimiter byte, returns a new slice inside the input - that ends at the delimiter byte. - -- `skip-chars-matching-in-slice`: curr, end, delimiter byte -> new-curr (in `eax`) -- `skip-chars-not-matching-in-slice`: curr, end, delimiter byte -> new-curr (in `eax`) - -## Resources - -- [Single-page cheatsheet for the x86 ISA](https://net.cs.uni-bonn.de/fileadmin/user_upload/plohmann/x86_opcode_structure_and_instruction_overview.pdf) - (pdf; [cached local copy](https://github.com/akkartik/mu/blob/master/cheatsheet.pdf)) -- [Concise reference for the x86 ISA](https://c9x.me/x86) -- [Intel processor manual](http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-instruction-set-reference-manual-325383.pdf) (pdf) -- [“Bootstrapping a compiler from nothing”](http://web.archive.org/web/20061108010907/http://www.rano.org/bcompiler.html) by Edmund Grumley-Evans. -- [“Creating tiny ELF executables”](https://www.muppetlabs.com/~breadbox/software/tiny/teensy.html) by Brian Raiter. -- [StoneKnifeForth](https://github.com/kragen/stoneknifeforth) by [Kragen Sitaker](http://canonical.org/~kragen). - -## Other forks +## Forks Forks of Mu are encouraged. If you don't like something about this repo, feel free to make a fork. If you show it to me, I'll link to it here, so others can @@ -726,64 +145,34 @@ use it. I might even pull your changes into this repo! - [mu-normie](https://git.sr.ht/~akkartik/mu-normie): with a more standard build system and C++ modules. -## Conclusion - -The hypothesis of Mu and SubX is that designing the entire system to be -testable from day 1 and from the ground up would radically impact the culture -of the eco-system in a way that no bolted-on tool or service at higher levels -can replicate: - -- Tests would make it easier to write programs that can be easily understood - by newcomers. - -- More broad-based understanding would lead to more forks. +## Desiderata -- Tests would make it easy to share code across forks. Copy the tests over, - and then copy code over and polish it until the tests pass. Manual work, but - tractable and without major risks. - -- The community would gain a diversified portfolio of forks for each program, - a “wavefront” of possible combinations of features and alternative - implementations of features. Application writers who wrote thorough tests - for their apps (something they just can’t do today) would be able to bounce - around between forks more easily without getting locked in to a single one - as currently happens. - -- There would be a stronger culture of reviewing the code for programs you use - or libraries you depend on. [More eyeballs would make more bugs shallow.](https://en.wikipedia.org/wiki/Linus%27s_Law) - -To falsify these hypotheses, here's a roadmap of the next few planned features: +If you're still reading, here are some more things to check out: -- Testable, dependency-injected vocabulary of primitives - - Streams: `read()`, `write()`. (✓) - - `exit()` (✓) - - Client-like non-blocking socket/file primitives: `load`, `save` - - Concurrency, and a framework for testing blocking code - - Server-like blocking socket/file primitives +- The references on [Mu](mu.md) and [SubX](subx.md) syntax, and also [bare + SubX](subx_bare.md) without any syntax sugar. -- Gradually streamline the bundled kernel, stripping away code we don't need. +- [How to get your text editor set up for Mu and SubX programs.](editor.md) ---- +- [Some tips for debugging Mu and SubX programs.](debugging.md) -If you're still reading, here are some more things to check out: +- [Shared vocabulary of data types and functions shared by Mu programs.](vocabulary.md) + Mu programs can transparently call low-level functions written in SubX. -a) Try running the tests: `./test_apps` +- [A summary](mu_instructions) of how the Mu compiler translates instructions + to SubX. ([colorized version](http://akkartik.github.io/mu/html/mu_instructions.html)) -b) There's a handy [summary](mu_instructions) of how the Mu compiler translates -instructions to SubX. +- [Some starter exercises for learning SubX](https://github.com/akkartik/mu/pulls) + (labelled `hello`). Feel free to [ping me](mailto:ak@akkartik.com) with any questions. -c) Check out the online help on SubX. Starting point: `./bootstrap` +- [Commandline reference for the bootstrap C++ program.](bootstrap.md) -d) Familiarize yourself with the list of opcodes supported in SubX: `./bootstrap -help opcodes`. (It's also [in this repo](https://github.com/akkartik/mu/blob/master/subx_opcodes).) -[Here](https://lobste.rs/s/qglfdp/subx_minimalist_assembly_language_for#c_o9ddqk) -are some tips on my setup for quickly finding the right opcode for any -situation from within Vim. +- The [list of x86 opcodes](subx_opcodes) supported in SubX: `./bootstrap + help opcodes`. -e) Try working on [some starter SubX exercises](https://github.com/akkartik/mu/pulls) -(labelled `hello`). +- [Some details on the unconventional organization of this project.](http://akkartik.name/post/four-repos) -f) SubX comes with some useful [syntax sugar](http://akkartik.name/post/mu-2019-1). +- Previous prototypes: [mu0](https://github.com/akkartik/mu0), [mu1](https://github.com/akkartik/mu1). ## Credits @@ -795,36 +184,15 @@ Mu builds on many ideas that have come before, especially: and [Richard Gabriel](http://dreamsongs.net/Files/PatternsOfSoftware.pdf) for the intellectual tools for reasoning about the higher order design of a codebase; -- Unix and C for showing us how to co-evolve language and OS, and for teaching - the (much maligned, misunderstood and underestimated) value of concise - *implementation* in addition to a clean interface; -- Donald Knuth's [literate programming](http://www.literateprogramming.com/knuthweb.pdf) - for liberating "code for humans to read" from the tyranny of compiler order; - [David Parnas](http://www.cs.umd.edu/class/spring2003/cmsc838p/Design/criteria.pdf) and others for highlighting the value of separating concerns and stepwise refinement; -- [Lisp](http://www.paulgraham.com/rootsoflisp.html) for showing the power of - dynamic languages, late binding and providing the right primitives *a la - carte*, especially lisp macros; - The folklore of debugging by print and the trace facility in many lisp systems; - Automated tests for showing the value of developing programs inside an elaborate harness; -- [Python doctest](http://docs.python.org/2/library/doctest.html) for - exemplifying interactive documentation that doubles as tests; -- [ReStructuredText](https://en.wikipedia.org/wiki/ReStructuredText) - and [its antecedents](https://en.wikipedia.org/wiki/Setext) for showing that - markup can be clean; -- BDD for challenging us all to write tests at a higher level; -- JavaScript and CSS for demonstrating the power of a DOM for complex - structured documents; -- Rust for demonstrating that a system-programming language can be safe; -- Forth for demonstrating that ergonomics don't require grammar; and - [Minimal Linux Live](http://minimal.linux-bg.org) for teaching how to create a bootable disk image. -- [Soso](https://github.com/ozkl/soso), a tiny hackable OS. - -## Coda - -- [Some details on the unconventional organization of this project.](http://akkartik.name/post/four-repos) -- Previous prototypes: [mu0](https://github.com/akkartik/mu0), [mu1](https://github.com/akkartik/mu1). +- [“Bootstrapping a compiler from nothing”](http://web.archive.org/web/20061108010907/http://www.rano.org/bcompiler.html) by Edmund Grumley-Evans. +- [“Creating tiny ELF executables”](https://www.muppetlabs.com/~breadbox/software/tiny/teensy.html) by Brian Raiter. +- [StoneKnifeForth](https://github.com/kragen/stoneknifeforth) by [Kragen Sitaker](http://canonical.org/~kragen). diff --git a/apps/README.md b/apps/README.md index 1e1a5685..c246b44b 100644 --- a/apps/README.md +++ b/apps/README.md @@ -9,3 +9,9 @@ Some apps written in SubX and Mu, in 3 categories: * More ambitious translator for a memory-safe language (in progress): `mu` * Miscellaneous test programs. + +All SubX apps include binaries. At any commit, an example's binary should be +identical bit for bit with the result of translating the corresponding `.subx` +file. The binary should also be natively runnable on a Linux system running on +Intel x86 processors, either 32- or 64-bit. If either of these invariants is +broken, it's a bug. diff --git a/apps/ex2 b/apps/ex2 index e18440e1..0917c0aa 100755 --- a/apps/ex2 +++ b/apps/ex2 Binary files differdiff --git a/apps/ex2.2.mu b/apps/ex2.2.mu deleted file mode 100644 index b177e770..00000000 --- a/apps/ex2.2.mu +++ /dev/null @@ -1,19 +0,0 @@ -# Increment a number, and return the result in the exit code. -# -# To run: -# $ ./translate_mu apps/ex2.2.mu -# $ ./a.elf -# Expected result: -# $ echo $? -# 7 - -fn main -> result/ebx: int { - result <- foo -} - -fn foo -> result/ebx: int { - var n: int - copy-to n, 3 - increment n - result <- copy n -} diff --git a/apps/ex2.mu b/apps/ex2.mu index fc20aa2e..c3a73874 100644 --- a/apps/ex2.mu +++ b/apps/ex2.mu @@ -1,4 +1,4 @@ -# Add two numbers, and return the result in the exit code. +# Add 3 and 4, and return the result in the exit code. # # To run: # $ ./translate_mu apps/ex2.mu diff --git a/apps/ex2.subx b/apps/ex2.subx index 11c04432..14007329 100644 --- a/apps/ex2.subx +++ b/apps/ex2.subx @@ -1,4 +1,4 @@ -# Add 1 and 1, and return the result in the exit code. +# Add 3 and 4, and return the result in the exit code. # # To run: # $ ./bootstrap translate init.linux apps/ex2.subx -o apps/ex2 @@ -10,10 +10,10 @@ == code Entry: -# ebx = 1 -bb/copy-to-ebx 1/imm32 -# increment ebx -43/increment-ebx +# ebx = 3 +bb/copy-to-ebx 3/imm32 +# add 4 to ebx +81 0/subop/add 3/mod/direct 3/rm32/ebx 4/imm32 # exit(ebx) e8/call syscall_exit/disp32 diff --git a/bootstrap.md b/bootstrap.md new file mode 100644 index 00000000..ca9320e5 --- /dev/null +++ b/bootstrap.md @@ -0,0 +1,17 @@ +## Running + +`bootstrap` currently has the following sub-commands: + +- `bootstrap help`: some helpful documentation to have at your fingertips. + +- `bootstrap test`: runs all automated tests. + +- `bootstrap translate <input files> -o <output ELF binary>`: translates `.subx` + files into an executable ELF binary. + +- `bootstrap run <ELF binary> <args>`: simulates running the ELF binaries emitted + by `bootstrap translate`. Useful for testing and debugging. + + Remember, not all 32-bit Linux binaries are guaranteed to run. I'm not + building general infrastructure here for all of the x86 instruction set. + SubX is about programming with a small, regular subset of 32-bit x86. diff --git a/debugging.md b/debugging.md new file mode 100644 index 00000000..ef4cf474 --- /dev/null +++ b/debugging.md @@ -0,0 +1,122 @@ +## A few hints for debugging + +Writing programs in SubX is surprisingly pleasant and addictive. Reading +programs is a work in progress, and hopefully the extensive unit tests help. +However, _debugging_ programs is where one really faces up to the low-level +nature of SubX. Even the smallest modifications need testing to make sure they +work. In my experience, there is no modification so small that I get it working +on the first attempt. And when it doesn't work, there are no clear error +messages. Machine code is too simple-minded for that. You can't use a debugger, +since SubX's simplistic ELF binaries contain no debugging information. So +debugging requires returning to basics and practicing with a new, more +rudimentary but hopefully still workable toolkit: + +- Start by nailing down a concrete set of steps for reproducibly obtaining the + error or erroneous behavior. + +- If possible, turn the steps into a failing test. It's not always possible, + but SubX's primary goal is to keep improving the variety of tests one can + write. + +- Start running the single failing test alone. This involves modifying the top + of the program (or the final `.subx` file passed in to `bootstrap translate`) by + replacing the call to `run-tests` with a call to the appropriate `test-` + function. + +- Generate a trace for the failing test while running your program in emulated + mode (`bootstrap run`): + ``` + $ ./bootstrap translate input.subx -o binary + $ ./bootstrap --trace run binary arg1 arg2 2>trace + ``` + The ability to generate a trace is the essential reason for the existence of + `bootstrap run` mode. It gives far better visibility into program internals than + running natively. + +- As a further refinement, it is possible to render label names in the trace + by adding a second flag to the `bootstrap translate` command: + ``` + $ ./bootstrap --debug translate input.subx -o binary + $ ./bootstrap --trace run binary arg1 arg2 2>trace + ``` + `bootstrap --debug translate` emits a mapping from label to address in a file + called `labels`. `bootstrap --trace run` reads in the `labels` file if + it exists and prints out any matching label name as it traces each instruction + executed. + + Here's a sample of what a trace looks like, with a few boxes highlighted: + + <img alt='trace example' src='html/trace.png'> + + Each of the green boxes shows the trace emitted for a single instruction. + It starts with a line of the form `run: inst: ___` followed by the opcode + for the instruction, the state of registers before the instruction executes, + and various other facts deduced during execution. Some instructions first + print a matching label. In the above screenshot, the red boxes show that + address `0x0900005e` maps to label `$loop` and presumably marks the start of + some loop. Function names get similar `run: == label` lines. + +- One trick when emitting traces with labels: + ``` + $ grep label trace + ``` + This is useful for quickly showing you the control flow for the run, and the + function executing when the error occurred. I find it useful to start with + this information, only looking at the complete trace after I've gotten + oriented on the control flow. Did it get to the loop I just modified? How + many times did it go through the loop? + +- Once you have SubX displaying labels in traces, it's a short step to modify + the program to insert more labels just to gain more insight. For example, + consider the following function: + + <img alt='control example -- before' src='html/control0.png'> + + This function contains a series of jump instructions. If a trace shows + `is-hex-lowercase-byte?` being encountered, and then `$is-hex-lowercase-byte?:end` + being encountered, it's still ambiguous what happened. Did we hit an early + exit, or did we execute all the way through? To clarify this, add temporary + labels after each jump: + + <img alt='control example -- after' src='html/control1.png'> + + Now the trace should have a lot more detail on which of these labels was + reached, and precisely when the exit was taken. + +- If you find yourself wondering, "when did the contents of this memory + address change?", `bootstrap run` has some rudimentary support for _watch + points_. Just insert a label starting with `$watch-` before an instruction + that writes to the address, and its value will start getting dumped to the + trace after every instruction thereafter. + +- Once we have a sense for precisely which instructions we want to look at, + it's time to look at the trace as a whole. Key is the state of registers + before each instruction. If a function is receiving bad arguments it becomes + natural to inspect what values were pushed on the stack before calling it, + tracing back further from there, and so on. + + I occasionally want to see the precise state of the stack segment, in which + case I uncomment a commented-out call to `dump_stack()` in the `vm.cc` + layer. It makes the trace a lot more verbose and a lot less dense, necessitating + a lot more scrolling around, so I keep it turned off most of the time. + +- If the trace seems overwhelming, try [browsing it](https://github.com/akkartik/mu/blob/master/tools/browse_trace.readme.md) + in the 'time-travel debugger'. + +- Don't be afraid to slice and dice the trace using Unix tools. For example, + say you have a SubX binary that dies while running tests. You can see what + test it's segfaulting at by compiling it with debug information using + `./translate_subx_debug`, and then running: + + ``` + ./bootstrap --trace --dump run a.elf test 2>&1 |grep 'label test' + ``` + + Just read out the last test printed out before the segfault. + +Hopefully these hints are enough to get you started. The main thing to +remember is to not be afraid of modifying the sources. A good debugging +session gets into a nice rhythm of generating a trace, staring at it for a +while, modifying the sources, regenerating the trace, and so on. Email +[me](mailto:mu@akkartik.com) if you'd like another pair of eyes to stare at a +trace, or if you have questions or complaints. diff --git a/editor.md b/editor.md new file mode 100644 index 00000000..65f2badb --- /dev/null +++ b/editor.md @@ -0,0 +1,19 @@ +## Getting your editor set up + +If you've read this far, it's time to set up your editor. Mu is really +intended to be read interactively rather than on a browser. + +There is rudimentary syntax highlighting support for Mu and SubX files for +various editors. Look for your editor in `mu.*` and `subx.*`, and follow the +instructions within. + +The Vim files are most developed. In particular, I recommend some optional +setup in subx.vim to use multiple colors for comments. + +If you use [Exuberant Ctags](http://ctags.sourceforge.net) for jumping easily +from names to their definitions in your editor, copy the contents of `exuberant_ctags_rc` +into your `.ctags` file. + +[Here](https://lobste.rs/s/qglfdp/subx_minimalist_assembly_language_for#c_o9ddqk) +are some tips on my setup for quickly finding the right opcode for any +situation from within Vim. diff --git a/html/apps/ex2.2.mu.html b/html/apps/ex2.2.mu.html deleted file mode 100644 index 455305a6..00000000 --- a/html/apps/ex2.2.mu.html +++ /dev/null @@ -1,81 +0,0 @@ -<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> -<html> -<head> -<meta http-equiv="content-type" content="text/html; charset=UTF-8"> -<title>Mu - apps/ex2.2.mu</title> -<meta name="Generator" content="Vim/8.1"> -<meta name="plugin-version" content="vim8.1_v1"> -<meta name="syntax" content="none"> -<meta name="settings" content="number_lines,use_css,pre_wrap,no_foldcolumn,expand_tabs,line_ids,prevent_copy="> -<meta name="colorscheme" content="minimal-light"> -<style type="text/css"> -<!-- -pre { white-space: pre-wrap; font-family: monospace; color: #000000; background-color: #c6c6c6; } -body { font-size:12pt; font-family: monospace; color: #000000; background-color: #c6c6c6; } -a { color:inherit; } -* { font-size:12pt; font-size: 1em; } -.LineNr { } -.muFunction { color: #af5f00; text-decoration: underline; } -.SpecialChar { color: #d70000; } -.Comment { color: #005faf; } -.Constant { color: #008787; } -.Delimiter { color: #c000c0; } -.PreProc { color: #c000c0; } ---> -</style> - -<script type='text/javascript'> -<!-- - -/* function to open any folds containing a jumped-to line before jumping to it */ -function JumpToLine() -{ - var lineNum; - lineNum = window.location.hash; - lineNum = lineNum.substr(1); /* strip off '#' */ - - if (lineNum.indexOf('L') == -1) { - lineNum = 'L'+lineNum; - } - var lineElem = document.getElementById(lineNum); - /* Always jump to new location even if the line was hidden inside a fold, or - * we corrected the raw number to a line ID. - */ - if (lineElem) { - lineElem.scrollIntoView(true); - } - return true; -} -if ('onhashchange' in window) { - window.onhashchange = JumpToLine; -} - ---> -</script> -</head> -<body onload='JumpToLine();'> -<a href='https://github.com/akkartik/mu/blob/master/apps/ex2.2.mu'>https://github.com/akkartik/mu/blob/master/apps/ex2.2.mu</a> -<pre id='vimCodeElement'> -<span id="L1" class="LineNr"> 1 </span><span class="Comment"># Increment a number, and return the result in the exit code.</span> -<span id="L2" class="LineNr"> 2 </span><span class="Comment">#</span> -<span id="L3" class="LineNr"> 3 </span><span class="Comment"># To run:</span> -<span id="L4" class="LineNr"> 4 </span><span class="Comment"># $ ./translate_mu apps/ex2.2.mu</span> -<span id="L5" class="LineNr"> 5 </span><span class="Comment"># $ ./a.elf</span> -<span id="L6" class="LineNr"> 6 </span><span class="Comment"># Expected result:</span> -<span id="L7" class="LineNr"> 7 </span><span class="Comment"># $ echo $?</span> -<span id="L8" class="LineNr"> 8 </span><span class="Comment"># 7</span> -<span id="L9" class="LineNr"> 9 </span> -<span id="L10" class="LineNr">10 </span><span class="PreProc">fn</span> <span class="muFunction">main</span><span class="PreProc"> -> </span>result/<span class="Constant">ebx</span>: int <span class="Delimiter">{</span> -<span id="L11" class="LineNr">11 </span> result <span class="SpecialChar"><-</span> <a href='ex2.2.mu.html#L14'>foo</a> -<span id="L12" class="LineNr">12 </span><span class="Delimiter">}</span> -<span id="L13" class="LineNr">13 </span> -<span id="L14" class="LineNr">14 </span><span class="PreProc">fn</span> <span class="muFunction"><a href='ex2.2.mu.html#L14'>foo</a></span><span class="PreProc"> -> </span>result/<span class="Constant">ebx</span>: int <span class="Delimiter">{</span> -<span id="L15" class="LineNr">15 </span> <span class="PreProc">var</span> n: int -<span id="L16" class="LineNr">16 </span> copy-to n, <span class="Constant">3</span> -<span id="L17" class="LineNr">17 </span> increment n -<span id="L18" class="LineNr">18 </span> result <span class="SpecialChar"><-</span> copy n -<span id="L19" class="LineNr">19 </span><span class="Delimiter">}</span> -</pre> -</body> -</html> -<!-- vim: set foldmethod=manual : --> diff --git a/mu.md b/mu.md new file mode 100644 index 00000000..b093f0e0 --- /dev/null +++ b/mu.md @@ -0,0 +1,456 @@ +# Mu Syntax + +Here are two valid statements in Mu: + +``` +increment x +y <- increment +``` + +Understanding when to use one vs the other is the critical idea in Mu. In +short, the former increments a value in memory, while the latter increments a +value in a register. + +Most languages start from some syntax and do what it takes to implement it. +Mu, however, is designed as a safe way to program in [a regular subset of +32-bit x86 machine code](subx.md), _satisficing_ rather than optimizing for a +clean syntax. To keep the mapping to machine code lightweight, Mu exclusively +uses statements and most statements map to a single instruction of machine +code. + +Since the x86 instruction set restricts how many memory locations an instruction +can use, Mu makes registers explicit as well. Variables must be explicitly +mapped to registers; otherwise they live in memory. + +Statements consist of 3 parts: the operation, optional _inouts_ and optional +_outputs_. Outputs come before the operation name and `<-`. + +Outputs are always registers; memory locations that need to be modified are +passed in by reference using the `addr` type described below. + +So Mu programmers need to make two new categories of decisions: whether to +define variables in registers or memory, and whether to put variables to the +left or right. There's always exactly one way to write any given operation. In +return for this overhead you get a lightweight and future-proof stack. And Mu +will provide good error message to support you. + +Further down, this page enumerates all available primitives in Mu, and [a +separate page](http://akkartik.github.io/mu/html/mu_instructions.html) +describes how each primitive is translated to machine code. + +## Functions and calls + +Zooming out from single statements, here's a complete sample program in Mu: + +<img alt='ex2.mu' src='html/ex2.mu.png'> + +Mu programs are lists of functions. Each function has the following form: + + ``` + fn _name_ inout ... -> output ... { + _statement_ + _statement_ + ... + } + ``` + +Each function has a header line, and some number of statements, each on a +separate line. Headers describe inouts and outputs. Inouts can't be registers, +and outputs _must_ be registers. + +The above program also demonstrates a function call (to the function `do-add`). +Function calls look the same as primitive statements: they can return (multiple) +outputs in registers, and modify inouts passed in by reference. In addition, +there's one more constraint: output registers must match the function header. +For example: + + ``` + fn f -> x/eax: int { + ... + } + fn g { + a/eax <- f # ok + a/ebx <- f # wrong + } + ``` + +The function `main` is special; it is where the program starts running. It +must always return a single int in register `ebx` (as the exit status of the +process). It can also optionally accept an array of strings as input (from the +shell command-line). To be precise, `main` must have one of the following +two signatures: + +- `fn main -> x/ebx: int` +- `fn main args: (addr array (addr array byte)) -> x/ebx: int` + +(The name of the output is flexible.) + +## Blocks + +Blocks are useful for grouping related statements. They're delimited by `{` +and `}`, both each alone on a line. + +Blocks can nest: + + ``` + { + _statements_ + { + _more statements_ + } + } + ``` + +Blocks can be named (with the name ending in a `:` on the same line as the +`{`): + + ``` + $name: { + _statements_ + } + ``` + +Further down we'll see primitive statements for skipping or repeating blocks. +Besides control flow, the other use for blocks is... + +## Local variables + +Functions can define new variables at any time with the keyword `var`. There +are two variants of the `var` statement, for defining variables in registers +or memory. + + ``` + var name: type + var name/reg: type <- ... + ``` + +Variables on the stack are never initialized. (They're always implicitly +zeroed them out.) Variables in registers are always initialized. + +Variables exist from their definition until the end of their containing block. +Register variables may also die earlier if their register is clobbered by a +new variable. + +## Arithmetic primitives + +Here is the list of arithmetic primitive operations supported by Mu. The name +`n` indicates a literal integer rather than a variable, and `var/reg` indicates +a variable in a register. + + ``` + var/reg <- increment + increment var + var/reg <- decrement + decrement var + var1/reg1 <- add var2/reg2 + var/reg <- add var2 + add-to var1, var2/reg + var/reg <- add n + add-to var, n + + var1/reg1 <- sub var2/reg2 + var/reg <- sub var2 + sub-from var1, var2/reg + var/reg <- sub n + sub-from var, n + + var1/reg1 <- and var2/reg2 + var/reg <- and var2 + and-with var1, var2/reg + var/reg <- and n + and-with var, n + + var1/reg1 <- or var2/reg2 + var/reg <- or var2 + or-with var1, var2/reg + var/reg <- or n + or-with var, n + + var1/reg1 <- xor var2/reg2 + var/reg <- xor var2 + xor-with var1, var2/reg + var/reg <- xor n + xor-with var, n + + var/reg <- copy var2/reg2 + copy-to var1, var2/reg + var/reg <- copy var2 + var/reg <- copy n + copy-to var, n + + compare var1, var2/reg + compare var1/reg, var2 + compare var/eax, n + compare var, n + + var/reg <- multiply var2 + ``` + +Any statement above that takes a variable in memory can be replaced with a +dereference (`*`) of an address variable (of type `(addr ...)`) in a register. +(Types can have multiple words, and are wrapped in `()` when they do.) But you +can't dereference variables in memory. You have to load them into a register +first. + +Excluding dereferences, the above statements must operate on non-address +primitive types: `int` or `boolean`. (Booleans are really just `int`s, and Mu +assumes any value but `0` is true.) + +## Operating on individual bytes + +A special-case is variables of type 'byte'. Mu is a 32-bit platform so for the +most part only supports types that are multiples of 32 bits. However, we do +want to support strings in ASCII and UTF-8, which will be arrays of bytes. + +Since most x86 instructions implicitly load 32 bits at a time from memory, +variables of type 'byte' are only allowed in registers, not on the stack. Here +are the possible statements for reading bytes to/from memory: + + ``` + var/reg <- copy-byte var2/reg2 # var: byte, var2: byte + var/reg <- copy-byte *var2/reg2 # var: byte, var2: (addr byte) + copy-byte-to *var1/reg1, var2/reg2 # var1: (addr byte), var2: byte + ``` + +In addition, variables of type 'byte' are restricted to (the lowest bytes of) +just 4 registers: eax, ecx, edx and ebx. + +## Primitive jumps + +There are two kinds of jumps, both with many variations: `break` and `loop`. +`break` instructions jump to the end of the containing block. `loop` instructions +jump to the beginning of the containing block. + +All jumps can take an optional label starting with '$': + + ``` + loop $foo + ``` + +This instruction jumps to the beginning of the block called $foo. The corresponding +`break` jumps to the end of the block. Either jump statement must lie somewhere +inside such a block. Jumps are only legal to containing blocks. (Use named +blocks with restraint; jumps to places far away can get confusing.) + +There are two unconditional jumps: + + ``` + loop + loop label + break + break label + ``` + +The remaining jump instructions are all conditional. Conditional jumps rely on +the result of the most recently executed `compare` instruction. (To keep +programs easy to read, keep compare instructions close to the jump that uses +them.) + + ``` + break-if-= + break-if-= label + break-if-!= + break-if-!= label + ``` + +Inequalities are similar, but have unsigned and signed variants. For simplicity, +always use signed integers; use the unsigned variants only to compare addresses. + + ``` + break-if-< + break-if-< label + break-if-> + break-if-> label + break-if-<= + break-if-<= label + break-if->= + break-if->= label + + break-if-addr< + break-if-addr< label + break-if-addr> + break-if-addr> label + break-if-addr<= + break-if-addr<= label + break-if-addr>= + break-if-addr>= label + ``` + +Similarly, conditional loops: + + ``` + loop-if-= + loop-if-= label + loop-if-!= + loop-if-!= label + + loop-if-< + loop-if-< label + loop-if-> + loop-if-> label + loop-if-<= + loop-if-<= label + loop-if->= + loop-if->= label + + loop-if-addr< + loop-if-addr< label + loop-if-addr> + loop-if-addr> label + loop-if-addr<= + loop-if-addr<= label + loop-if-addr>= + loop-if-addr>= label + ``` + +## Addresses + +Passing objects by reference requires the `address` operation, which returns +an object of type `addr`. + + ``` + var/reg: (addr T) <- address var2: T + ``` + +Here `var2` can't live in a register. + +## Array operations + +Mu arrays are size-prefixed so that operations on them can check bounds as +necessary at run-time. The `length` statement returns the number of elements +in an array. + + ``` + var/reg: int <- length arr/reg: (addr array T) + ``` + +The `index` statement takes an `addr` to an `array` and returns an `addr` to +one of its elements, that can be read from or written to. + + ``` + var/reg: (addr T) <- index arr/reg: (addr array T), n + var/reg: (addr T) <- index arr: (array T sz), n + ``` + +The index can also be a variable in a register, with a caveat: + + ``` + var/reg: (addr T) <- index arr/reg: (addr array T), idx/reg: int + var/reg: (addr T) <- index arr: (array T sz), idx/reg: int + ``` + +The caveat: the size of T must be 1, 2, 4 or 8 bytes. The x86 instruction set +has complex addressing modes that can index into an array in a single instruction +in these situations. + +For types in general you'll need to split up the work, performing a `compute-offset` +before the `index`. + + ``` + var/reg: (offset T) <- compute-offset arr: (addr array T), idx/reg: int # arr can be in reg or mem + var/reg: (offset T) <- compute-offset arr: (addr array T), idx: int # arr can be in reg or mem + ``` + +The `compute-offset` statement returns a value of type `(offset T)` after +performing any necessary bounds checking. Now the offset can be passed to +`index` as usual: + + ``` + var/reg: (addr T) <- index arr/reg: (addr array T), idx/reg: (offset T) + ``` + +## Compound types + +Primitive types can be combined together using the `type` keyword. For +example: + + ``` + type point { + x: int + y: int + } + ``` + +Mu programs are currently sequences of `fn` and `type` definitions. + +To access within a compound type, use the `get` instruction. There are two +forms. You need either a variable of the type itself (say `T`) in memory, or a +variable of type `(addr T)` in a register. + + ``` + var/reg: (addr T_f) <- get var/reg: (addr T), f + var/reg: (addr T_f) <- get var: T, f + ``` + +The `f` here is the field name from the `type` definition, and `T_f` must +match the type of `f` in the `type` definition. For example, some legal +instructions for the definition of `point` above: + + ``` + var a/eax: (addr int) <- get p, x + var a/eax: (addr int) <- get p, y + ``` + +## Handles for safe access to the heap + +We've seen the `addr` type, but it's intended to be short-lived. In particular, +you can't save `addr` values inside compound `type`s. To do that you need a +"fat pointer" called a `handle` that is safe to keep around for extended +periods and ensures it's used safely without corrupting the heap and causing +security issues or hard-to-debug misbehavior. + +To actually _use_ a `handle`, we have to turn it into an `addr` first using +the `lookup` statement. + + ``` + var y/reg: (addr T) <- lookup x + ``` + +Now operate on the `addr` as usual, safe in the knowledge that you can later +recover any writes to its payload from `x`. + +It's illegal to continue to use this `addr` after a function that reclaims +heap memory. You have to repeat the lookup from the `handle`. (Luckily Mu +doesn't implement reclamation yet.) + +Having two kinds of addresses takes some getting used to. Do we pass in +variables by value, by `addr` or by `handle`? In inputs or outputs? Here are 3 +rules: + + * Functions that need to look at the payload should accept an `(addr ...)`. + * Functions that need to treat a handle as a value, without looking at its + payload, should accept a `(handle ...)`. Helpers that save handles into data + structures are a common example. + * Functions that need to allocate memory should accept an `(addr handle + ...)`. + +Try to avoid mixing these use cases. + +You can copy handles to another variable on the stack like this: + + ``` + var x: (handle T) + # ..some code initializing x.. + var y/eax: (addr handle T) <- address ... + copy-handle x, y + ``` + +You can also save handles inside compound types like this: + + ``` + var y/reg: (addr handle T_f) <- get var: (addr T), f + copy-handle-to *y, x + ``` + +Or this: + + ``` + var y/reg: (addr handle T) <- index arr: (addr array handle T), n + copy-handle-to *y, x + ``` + +## Conclusion + +Anything not allowed here is forbidden. At least until you modify mu.subx. +Please [contact me](mailto:ak@akkartik.com) or [report issues](https://github.com/akkartik/mu/issues) +when you encounter a missing or misleading error message. diff --git a/mu_summary b/mu_summary deleted file mode 100644 index 286f8286..00000000 --- a/mu_summary +++ /dev/null @@ -1,259 +0,0 @@ -Mu programs are lists of functions. Each function has the following form: - - fn _name_ _inouts_with_types_ -> _outputs_with_types_ { - _instructions_ - } - -Each function has a header line, and some number of instructions, each on a -separate line. - -Instructions may be primitives or function calls. Either way, all instructions -have one of the following forms: - - # defining variables - var _name_: _type_ - var _name_/_register_: _type_ - - # doing things with variables - _operation_ _inouts_ - _outputs_ <- _operation_ _inouts_ - -Instructions and functions may have inouts and outputs. Both inouts and -outputs are variables. - -As seen above, variables can be defined to live in a register, like this: - - n/eax - -Variables not assigned a register live in the stack. - -Function inouts must always be on the stack, and outputs must always be in -registers. A function call must always write to the exact registers its -definition requires. For example: - - fn foo -> x/eax: int { - ... - } - fn main { - a/eax <- foo # ok - a/ebx <- foo # wrong - } - -Primitive inouts may be on the stack or in registers, but outputs must always -be in registers. - -Functions can contain nested blocks inside { and }. Variables defined in a -block don't exist outside it. - - { - _instructions_ - { - _more instructions_ - } - } - -Blocks can be named like so: - - $name: { - _instructions_ - } - -## Primitive instructions - -Primitive instructions currently supported in Mu ('n' indicates a literal -integer rather than a variable, and 'var/reg' indicates a variable in a -register): - - var/reg <- increment - increment var - var/reg <- decrement - decrement var - var1/reg1 <- add var2/reg2 - var/reg <- add var2 - add-to var1, var2/reg - var/reg <- add n - add-to var, n - - var1/reg1 <- sub var2/reg2 - var/reg <- sub var2 - sub-from var1, var2/reg - var/reg <- sub n - sub-from var, n - - var1/reg1 <- and var2/reg2 - var/reg <- and var2 - and-with var1, var2/reg - var/reg <- and n - and-with var, n - - var1/reg1 <- or var2/reg2 - var/reg <- or var2 - or-with var1, var2/reg - var/reg <- or n - or-with var, n - - var1/reg1 <- xor var2/reg2 - var/reg <- xor var2 - xor-with var1, var2/reg - var/reg <- xor n - xor-with var, n - - var/reg <- copy var2/reg2 - copy-to var1, var2/reg - var/reg <- copy var2 - var/reg <- copy n - copy-to var, n - - compare var1, var2/reg - compare var1/reg, var2 - compare var/eax, n - compare var, n - - var/reg <- multiply var2 - -Notice that there are no primitive instructions operating on two variables in -memory. That's a restriction of the underlying x86 processor. - -Any instruction above that takes a variable in memory can be replaced with a -dereference (`*`) of an address variable in a register. But you can't dereference -variables in memory. - -## Byte operations - -A special-case is variables of type 'byte'. Mu is a 32-bit platform so for the -most part only supports types that are multiples of 32 bits. However, we do -want to support strings in ASCII and UTF-8, which will be arrays of bytes. - -Since most x86 instructions implicitly load 32 bits at a time from memory, -variables of type 'byte' are only allowed in registers, not on the stack. Here -are the possible instructions for reading bytes to/from memory: - - var/reg <- copy-byte var2/reg2 # var: byte, var2: byte - var/reg <- copy-byte *var2/reg2 # var: byte, var2: (addr byte) - copy-byte-to *var1/reg1, var2/reg2 # var1: (addr byte), var2: byte - -In addition, variables of type 'byte' are restricted to (the lowest bytes of) -just 4 registers: eax, ecx, edx and ebx. - -## Primitive jump instructions - -There are two kinds of jumps, both with many variations: `break` and `loop`. -`break` instructions jump to the end of the containing block. `loop` instructions -jump to the beginning of the containing block. - -Jumps can take an optional label starting with '$': - - loop $foo - -This instruction jumps to the beginning of the block called $foo. It must lie -somewhere inside such a block. Jumps are only legal to containing blocks. Use -named blocks with restraint; jumps to places far away can get confusing. - -There are two unconditional jumps: - - loop - loop label - break - break label - -The remaining jump instructions are all conditional. Conditional jumps rely on -the result of the most recently executed `compare` instruction. (To keep -programs easy to read, keep compare instructions close to the jump that uses -them.) - - break-if-= - break-if-= label - break-if-!= - break-if-!= label - -Inequalities are similar, but have unsigned and signed variants. We assume -unsigned variants are only ever used to compare addresses. - - break-if-< - break-if-< label - break-if-> - break-if-> label - break-if-<= - break-if-<= label - break-if->= - break-if->= label - - break-if-addr< - break-if-addr< label - break-if-addr> - break-if-addr> label - break-if-addr<= - break-if-addr<= label - break-if-addr>= - break-if-addr>= label - -Similarly, conditional loops: - - loop-if-= - loop-if-= label - loop-if-!= - loop-if-!= label - - loop-if-< - loop-if-< label - loop-if-> - loop-if-> label - loop-if-<= - loop-if-<= label - loop-if->= - loop-if->= label - - loop-if-addr< - loop-if-addr< label - loop-if-addr> - loop-if-addr> label - loop-if-addr<= - loop-if-addr<= label - loop-if-addr>= - loop-if-addr>= label - -## Address operations - - var/reg: (addr T) <- address var: T # var must be in mem (on the stack) - -## Array operations - - var/reg: int <- length arr/reg: (addr array T) - var/reg: (addr T) <- index arr/reg: (addr array T), idx/reg: int - var/reg: (addr T) <- index arr: (array T sz), idx/reg: int - var/reg: (addr T) <- index arr/reg: (addr array T), n - var/reg: (addr T) <- index arr: (array T sz), n - - var/reg: (offset T) <- compute-offset arr: (addr array T), idx/reg: int # arr can be in reg or mem - var/reg: (offset T) <- compute-offset arr: (addr array T), idx: int # arr can be in reg or mem - var/reg: (addr T) <- index arr/reg: (addr array T), idx/reg: (offset T) - -## User-defined types - - var/reg: (addr T_f) <- get var/reg: (addr T), f - where record (product) type T has elements a, b, c, ... of types T_a, T_b, T_c, ... - var/reg: (addr T_f) <- get var: T, f - -## Handles for safe access to the heap - -Say we created a handle like this on the stack (it can't be in a register) - var x: (handle T) - allocate Heap, T, x - -You can copy handles to another variable on the stack like this: - var y: (handle T) - copy-handle-to y, x - -You can also save handles inside other user-defined types like this: - var y/reg: (addr handle T_f) <- get var: (addr T), f - copy-handle-to *y, x - -Or this: - var y/reg: (addr handle T) <- index arr: (addr array handle T), n - copy-handle-to *y, x - -Handles can be converted into addresses like this: - var y/reg: (addr T) <- lookup x - -It's illegal to continue to use this addr after a function that reclaims heap -memory. You have to repeat the lookup. diff --git a/subx.md b/subx.md new file mode 100644 index 00000000..b1ab38bc --- /dev/null +++ b/subx.md @@ -0,0 +1,162 @@ +## SubX + +SubX is a notation for a subset of x86 machine code. [The Mu translator](http://akkartik.github.io/mu/html/apps/mu.subx.html) +is implemented in SubX and also emits SubX code. + +Here's an example program in SubX that adds 1 and 1 and returns the result to +the parent shell process: + + ```sh + == code + Entry: + # ebx = 1 + bb/copy-to-ebx 1/imm32 + # increment ebx + 43/increment-ebx + # exit(ebx) + e8/call syscall_exit/disp32 + ``` + +## The syntax of SubX instructions + +Just like in regular machine code, SubX programs consist mostly of instructions, +which are basically sequences of numbers (always in hex). Instructions consist +of words separated by whitespace. Words may be _opcodes_ (defining the +operation being performed) or _arguments_ (specifying the data the operation +acts on). Any word can have extra _metadata_ attached to it after `/`. Some +metadata is required (like the `/imm32` and `/imm8` above), but unrecognized +metadata is silently skipped so you can attach comments to words (like the +instruction name `/copy-to-eax` above, or the `/exit` argument). + +What do all these numbers mean? SubX supports a small subset of the 32-bit x86 +instruction set that likely runs on your computer. (Think of the name as short +for "sub-x86".) The instruction set contains instructions like `89/copy`, +`01/add`, `3d/compare` and `51/push-ecx` which modify registers and a byte-addressable +memory. For a complete list of supported instructions, run `bootstrap help +opcodes`. + +The registers instructions operate on are as follows: + +- Six general-purpose 32-bit registers: `0/eax`, `1/ebx`, `2/ecx`, `3/edx`, + `6/esi` and `7/edi`. +- Two additional 32-bit registers: `4/esp` and `5/ebp`. (I suggest you only + use these to manage the call stack.) + +(SubX doesn't support floating-point registers yet. Intel processors support +an 8-bit mode, 16-bit mode and 64-bit mode. SubX will never support them. +There are also _many_ more instructions that SubX will never support.) + +While SubX doesn't provide the usual mnemonics for opcodes, it _does_ provide +error-checking. If you miss an argument or accidentally add an extra argument, +you'll get a nice error. SubX won't arbitrarily interpret bytes of data as +instructions or vice versa. + +It's worth distinguishing between an instruction's arguments and its _operands_. +Arguments are provided directly in instructions. Operands are pieces of data +in register or memory that are operated on by instructions. + +Intel processors typically operate on no more than two operands, and at most +one of them (the 'reg/mem' operand) can access memory. The address of the +reg/mem operand is constructed by expressions of one of these forms: + + - `%reg`: operate on just a register, not memory + - `*reg`: look up memory with the address in some register + - `*(reg + disp)`: add a constant to the address in some register + - `*(base + (index << scale) + disp)` where `base` and `index` are registers, + and `scale` and `disp` are 2- and 32-bit constants respectively. + +Under the hood, SubX turns expressions of these forms into multiple arguments +with metadata in some complex ways. See [the doc on bare SubX](subx_bare.md). + +That covers the complexities of the reg/mem operand. The second operand is +simpler. It comes from exactly one of the following argument types: + + - `/r32` + - displacement: `/disp8` or `/disp32` + - immediate: `/imm8` or `/imm32` + +Putting all this together, here's an example that adds the integer in `eax` to +the one at address `edx`: + + ``` + 01/add %edx 0/r32/eax + ``` + +## The syntax of SubX programs + +SubX programs map to the same ELF binaries that a conventional Linux system +uses. Linux ELF binaries consist of a series of _segments_. In particular, they +distinguish between code and data. Correspondingly, SubX programs consist of a +series of segments, each starting with a header line: `==` followed by a name +and approximate starting address. + +All code must lie in a segment called 'code'. + +Segments can be added to. + +```sh +== code 0x09000000 # first mention requires starting address +...A... + +== data 0x0a000000 +...B... + +== code # no address necessary when adding +...C... +``` + +The `code` segment now contains the instructions of `A` as well as `C`. + +Within the `code` segment, each line contains a comment, label or instruction. +Comments start with a `#` and are ignored. Labels should always be the first +word on a line, and they end with a `:`. + +Instructions can refer to labels in displacement or immediate arguments, and +they'll obtain a value based on the address of the label: immediate arguments +will contain the address directly, while displacement arguments will contain +the difference between the address and the address of the current instruction. +The latter is mostly useful for `jump` and `call` instructions. + +Functions are defined using labels. By convention, labels internal to functions +(that must only be jumped to) start with a `$`. Any other labels must only be +called, never jumped to. All labels must be unique. + +Functions are called using the following syntax: +``` +(func arg1 arg2 ...) +``` + +Function arguments must be either literals (integers or strings) or a reg/mem +operand using the syntax in the previous section. + +A special label is `Entry`, which can be used to specify/override the entry +point of the program. It doesn't have to be unique, and the latest definition +will override earlier ones. + +(The `Entry` label, along with duplicate segment headers, allows programs to +be built up incrementally out of multiple [_layers_](http://akkartik.name/post/wart-layers).) + +Another special pair of labels are the block delimiters `{` and `}`. They can +be nested, and jump instructions can take arguments `loop` or `break` that +jump to the enclosing `{` and `}` respectively. + +The data segment consists of labels as before and byte values. Referring to +data labels in either `code` segment instructions or `data` segment values +yields their address. + +Automatic tests are an important part of SubX, and there's a simple mechanism +to provide a test harness: all functions that start with `test-` are called in +turn by a special, auto-generated function called `run-tests`. How you choose +to call it is up to you. + +I try to keep things simple so that there's less work to do when implementing +SubX in SubX. But there _is_ one convenience: instructions can provide a +string literal surrounded by quotes (`"`) in an `imm32` argument. SubX will +transparently copy it to the `data` segment and replace it with its address. +Strings are the only place where a SubX word is allowed to contain spaces. + +That should be enough information for writing SubX programs. The `apps/` +directory provides some fodder for practice in the `apps/ex*.subx` files, +giving a more gradual introduction to SubX features. In particular, you should +work through `apps/factorial4.subx`, which demonstrates all the above ideas in +concert. diff --git a/subx_addressing_modes.md b/subx_bare.md index 52455510..b2429ed2 100644 --- a/subx_addressing_modes.md +++ b/subx_bare.md @@ -1,7 +1,7 @@ -[The Readme](Readme.md) describes SubX notation with some details hidden -behind _syntax sugar_ -- local rewrite rules that make programming in SubX -less error-prone. However, much low-level SubX (before the syntax sugar is -implemented) is written without syntax sugar. This document describes some +[The SubX documentation](subx.md) describes SubX notation with some details +hidden behind _syntax sugar_ -- local rewrite rules that make programming in +SubX less error-prone. However, much low-level SubX (before the syntax sugar +is implemented) is written without syntax sugar. This document describes some details of the syntax sugar: how the reg/mem operand is translated into arguments. @@ -123,3 +123,50 @@ $ echo $? These details should now be enough information for reading and modifying low-level SubX programs. + +## Translating SubX programs + +This repo includes two translators for bare SubX. The first is [the bootstrap +translator](bootstrap.md) implemented in C++. In addition, you can use SubX to +translate itself. For example, running natively on Linux: + + ```sh + # generate translator phases using the C++ translator + $ ./bootstrap translate init.linux 0*.subx apps/subx-params.subx apps/hex.subx -o hex + $ ./bootstrap translate init.linux 0*.subx apps/subx-params.subx apps/survey.subx -o survey + $ ./bootstrap translate init.linux 0*.subx apps/subx-params.subx apps/pack.subx -o pack + $ ./bootstrap translate init.linux 0*.subx apps/subx-params.subx apps/assort.subx -o assort + $ ./bootstrap translate init.linux 0*.subx apps/subx-params.subx apps/dquotes.subx -o dquotes + $ ./bootstrap translate init.linux 0*.subx apps/subx-params.subx apps/tests.subx -o tests + $ chmod +x hex survey pack assort dquotes tests + + # use the generated translator phases to translate SubX programs + $ cat init.linux apps/ex1.subx |./tests |./dquotes |./assort |./pack |./survey |./hex > a.elf + $ chmod +x a.elf + $ ./a.elf + $ echo $? + 42 + + # or, automating the above steps + $ ./translate_subx init.linux apps/ex1.subx + $ ./a.elf + $ echo $? + 42 + ``` + +Or, running in a VM on other platforms (much slower): + + ```sh + $ ./translate_subx_emulated init.linux apps/ex1.subx # generates identical a.elf to above + $ ./bootstrap run a.elf + $ echo $? + 42 + ``` + +## Resources + +- [Single-page cheatsheet for the x86 ISA](https://net.cs.uni-bonn.de/fileadmin/user_upload/plohmann/x86_opcode_structure_and_instruction_overview.pdf) + (pdf; [cached local copy](https://github.com/akkartik/mu/blob/master/cheatsheet.pdf)) +- [Concise reference for the x86 ISA](https://c9x.me/x86) +- [Concise reference for the x86 ISA #2](http://ref.x86asm.net/coder32.html) +- [Intel processor manual](http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-instruction-set-reference-manual-325383.pdf) (pdf) diff --git a/test_apps b/test_apps index 325266e1..bcd61745 100755 --- a/test_apps +++ b/test_apps @@ -47,11 +47,11 @@ echo ex2 test "$1" = 'record' || git diff --exit-code apps/ex2 test $EMULATED && { ./bootstrap run apps/ex2 || ret=$? - test $ret -eq 2 # 1 + 1 + test $ret -eq 7 # 3 + 4 } test $NATIVE && { apps/ex2 || ret=$? - test $ret -eq 2 # 1 + 1 + test $ret -eq 7 # 3 + 4 } echo ex3 @@ -365,17 +365,6 @@ test $NATIVE && { test $ret -eq 7 } -echo ex2.2.mu -./translate_mu apps/ex2.2.mu -test $EMULATED && { - ./bootstrap run a.elf || ret=$? - test $ret -eq 4 -} -test $NATIVE && { - ./a.elf || ret=$? - test $ret -eq 4 -} - echo ex3.mu ./translate_mu apps/ex3.mu test $EMULATED && { diff --git a/vocabulary.md b/vocabulary.md new file mode 100644 index 00000000..426ed590 --- /dev/null +++ b/vocabulary.md @@ -0,0 +1,208 @@ +## Reference documentation on available primitives + +### Data Structures + +- Kernel strings: null-terminated regions of memory. Unsafe and to be avoided, + but needed for interacting with the kernel. + +- Arrays: length-prefixed regions of memory containing multiple elements of a + single type. Contents are preceded by 4 bytes (32 bits) containing the + `length` of the array in bytes. + +- Slices: a pair of 32-bit addresses denoting a [half-open](https://en.wikipedia.org/wiki/Interval_(mathematics)) + \[`start`, `end`) interval to live memory with a consistent lifetime. + + Invariant: `start` <= `end` + +- Streams: strings prefixed by 32-bit `write` and `read` indexes that the next + write or read goes to, respectively. + + - offset 0: write index + - offset 4: read index + - offset 8: length of array (in bytes) + - offset 12: start of array data + + Invariant: 0 <= `read` <= `write` <= `length` + +- File descriptors (fd): Low-level 32-bit integers that the kernel uses to + track files opened by the program. + +- File: 32-bit value containing either a fd or an address to a stream (fake + file). + +- Buffered files (buffered-file): Contain a file descriptor and a stream for + buffering reads/writes. Each `buffered-file` must exclusively perform either + reads or writes. + +### 'system calls' + +As I said at the top, a primary design goal of SubX (and Mu more broadly) is +to explore ways to turn arbitrary manual tests into reproducible automated +tests. SubX aims for this goal by baking testable interfaces deep into the +stack, at the OS syscall level. The idea is that every syscall that interacts +with hardware (and so the environment) should be *dependency injected* so that +it's possible to insert fake hardware in tests. + +But those are big goals. Here are the syscalls I have so far: + +- `write`: takes two arguments, a file `f` and an address to array `s`. + + Comparing this interface with the Unix `write()` syscall shows two benefits: + + 1. SubX can handle 'fake' file descriptors in tests. + + 1. `write()` accepts buffer and its length in separate arguments, which + requires callers to manage the two separately and so can be error-prone. + SubX's wrapper keeps the two together to increase the chances that we + never accidentally go out of array bounds. + +- `read`: takes two arguments, a file `f` and an address to stream `s`. Reads + as much data from `f` as can fit in (the free space of) `s`. + + Like with `write()`, this wrapper around the Unix `read()` syscall adds the + ability to handle 'fake' file descriptors in tests, and reduces the chances + of clobbering outside array bounds. + + One bit of weirdness here: in tests we do a redundant copy from one stream + to another. See [the comments before the implementation](http://akkartik.github.io/mu/html/060read.subx.html) + for a discussion of alternative interfaces. + +- `stop`: takes two arguments: + - `ed` is an address to an _exit descriptor_. Exit descriptors allow us to + `exit()` the program in production, but return to the test harness within + tests. That allows tests to make assertions about when `exit()` is called. + - `value` is the status code to `exit()` with. + + For more details on exit descriptors and how to create one, see [the + comments before the implementation](http://akkartik.github.io/mu/html/059stop.subx.html). + +- `new-segment` + + Allocates a whole new segment of memory for the program, discontiguous with + both existing code and data (heap) segments. Just a more opinionated form of + [`mmap`](http://man7.org/linux/man-pages/man2/mmap.2.html). + +- `allocate`: takes two arguments, an address to allocation-descriptor `ad` + and an integer `n` + + Allocates a contiguous range of memory that is guaranteed to be exclusively + available to the caller. Returns the starting address to the range in `eax`. + + An allocation descriptor tracks allocated vs available addresses in some + contiguous range of memory. The int specifies the number of bytes to allocate. + + Explicitly passing in an allocation descriptor allows for nested memory + management, where a sub-system gets a chunk of memory and further parcels it + out to individual allocations. Particularly helpful for (surprise) tests. + +- ... _(to be continued)_ + +I will continue to import syscalls over time from [the old Mu VM in the parent +directory](https://github.com/akkartik/mu), which has experimented with +interfaces for the screen, keyboard, mouse, disk and network. + +### primitives built atop system calls + +_(Compound arguments are usually passed in by reference. Where the results are +compound objects that don't fit in a register, the caller usually passes in +allocated memory for it.)_ + +#### assertions for tests +- `check-ints-equal`: fails current test if given ints aren't equal +- `check-stream-equal`: fails current test if stream doesn't match string +- `check-next-stream-line-equal`: fails current test if next line of stream + until newline doesn't match string + +#### error handling +- `error`: takes three arguments, an exit-descriptor, a file and a string (message) + + Prints out the message to the file and then exits using the provided + exit-descriptor. + +- `error-byte`: like `error` but takes an extra byte value that it prints out + at the end of the message. + +#### predicates +- `kernel-string-equal?`: compares a kernel string with a string +- `string-equal?`: compares two strings +- `stream-data-equal?`: compares a stream with a string +- `next-stream-line-equal?`: compares with string the next line in a stream, from + `read` index to newline + +- `slice-empty?`: checks if the `start` and `end` of a slice are equal +- `slice-equal?`: compares a slice with a string +- `slice-starts-with?`: compares the start of a slice with a string +- `slice-ends-with?`: compares the end of a slice with a string + +#### writing to disk +- `write`: string -> file + - Can also be used to cat a string into a stream. + - Will abort the entire program if destination is a stream and doesn't have + enough room. +- `write-stream`: stream -> file + - Can also be used to cat one stream into another. + - Will abort the entire program if destination is a stream and doesn't have + enough room. +- `write-slice`: slice -> stream + - Will abort the entire program if there isn't enough room in the + destination stream. +- `append-byte`: int -> stream + - Will abort the entire program if there isn't enough room in the + destination stream. +- `append-byte-hex`: int -> stream + - textual representation in hex, no '0x' prefix + - Will abort the entire program if there isn't enough room in the + destination stream. +- `print-int32`: int -> stream + - textual representation in hex, including '0x' prefix + - Will abort the entire program if there isn't enough room in the + destination stream. +- `write-buffered`: string -> buffered-file +- `write-slice-buffered`: slice -> buffered-file +- `flush`: buffered-file +- `write-byte-buffered`: int -> buffered-file +- `print-byte-buffered`: int -> buffered-file + - textual representation in hex, no '0x' prefix +- `print-int32-buffered`: int -> buffered-file + - textual representation in hex, including '0x' prefix + +#### reading from disk +- `read`: file -> stream + - Can also be used to cat one stream into another. + - Will silently stop reading when destination runs out of space. +- `read-byte-buffered`: buffered-file -> byte +- `read-line-buffered`: buffered-file -> stream + - Will abort the entire program if there isn't enough room. + +#### non-IO operations on streams +- `new-stream`: allocates space for a stream of `n` elements, each occupying + `b` bytes. + - Will abort the entire program if `n*b` requires more than 32 bits. +- `clear-stream`: resets everything in the stream to `0` (except its `length`). +- `rewind-stream`: resets the read index of the stream to `0` without modifying + its contents. + +#### reading/writing hex representations of integers +- `is-hex-int?`: takes a slice argument, returns boolean result in `eax` +- `parse-hex-int`: takes a slice argument, returns int result in `eax` +- `is-hex-digit?`: takes a 32-bit word containing a single byte, returns + boolean result in `eax`. +- `from-hex-char`: takes a hexadecimal digit character in `eax`, returns its + numeric value in `eax` +- `to-hex-char`: takes a single-digit numeric value in `eax`, returns its + corresponding hexadecimal character in `eax` + +#### tokenization + +from a stream: +- `next-token`: stream, delimiter byte -> slice +- `skip-chars-matching`: stream, delimiter byte +- `skip-chars-not-matching`: stream, delimiter byte + +from a slice: +- `next-token-from-slice`: start, end, delimiter byte -> slice + - Given a slice and a delimiter byte, returns a new slice inside the input + that ends at the delimiter byte. + +- `skip-chars-matching-in-slice`: curr, end, delimiter byte -> new-curr (in `eax`) +- `skip-chars-not-matching-in-slice`: curr, end, delimiter byte -> new-curr (in `eax`) |