diff options
Diffstat (limited to 'subx')
-rw-r--r-- | subx/Readme.md | 162 | ||||
-rw-r--r-- | subx/html/ex1.png | bin | 140501 -> 0 bytes |
2 files changed, 89 insertions, 73 deletions
diff --git a/subx/Readme.md b/subx/Readme.md index b909f20c..67a7e5e1 100644 --- a/subx/Readme.md +++ b/subx/Readme.md @@ -8,8 +8,7 @@ C++ compiler and runtime.) ## Thin layer of abstraction over machine code, isn't that just an assembler? -Assemblers try to hide the precise instructions emitted from the programmer. -Consider these instructions in Assembly language: +Compare some code in Assembly: ``` add EBX, ECX @@ -17,70 +16,84 @@ copy EBX, 0 copy ECX, 1 ``` -Here are the same instructions in SubX, just a list of numbers (opcodes and -operands) with metadata 'comments' after a `/`: +..with the same instructions in SubX: ``` 01/add 3/mod/direct 3/rm32/ebx 1/r32/ecx -bb/copy 0/imm32 -b9/copy 1/imm32 +bb/copy-EBX 0/imm32 +b9/copy-ECX 1/imm32 ``` -Notice that a single instruction, say 'copy', maps to multiple opcodes. -That's just the tip of the iceberg of complexity that Assembly languages deal -with. - -SubX doesn't shield the programmer from these details. Words always contain -the actual bits or bytes for machine code. But they also can contain metadata -after slashes, and SubX will run cross-checks and give good error messages -when there's a discrepancy between code and metadata. - -## But why not use an assembler? - -The long-term goal is to make programming in machine language ergonomic enough -that I (or someone else) can build a compiler for a high-level language in it. -That is, building a compiler without needing a compiler, anywhere among its -prerequisites. - -Assemblers today are complex enough that they're built in a high-level -language, and need a compiler to build. They also tend to be designed to fit -into a larger toolchain, to be a back-end for a compiler. Their output is in -turn often passed to other tools like a linker. The formats that all these -tools use to talk to each other have grown increasingly complex in the face of -decades of evolution, usage and backwards-compatibility constraints. All these -considerations add to the burden of the assembler developer. Building the -assembler in a high-level language helps face up to them. - -Assemblers _do_ often accept a far simpler language, just a file format -really, variously called 'flat' or 'binary', which gives the programmer -complete control over the precise bytes in an executable. SubX is basically -trying to be a more ergonomic flat assembler that will one day be bootstrapped -from machine code. - -## Why in the world? - -1. It seems wrong-headed that our computers look polished but are plagued by - foundational problems of security and reliability. I'd like to learn to - walk before I try to run. The plan: start out using the computer only to - check my program for errors rather than to hide low-level details. Force - myself to think about security by living with raw machine code for a while. - Reintroduce high level languages (HLLs) only after confidence is regained - in the foundations (and when the foundations are ergonomic enough to - support developing a compiler in them). Delegate only when I can verify - with confidence. - -2. The software in our computers has grown incomprehensible. Nobody - understands it all, not even experts. Even simple programs written by a - single author require lots of time for others to comprehend. Compilers are - a prime example, growing so complex that programmers have to choose to - either program them or use them. I think they may also contribute to the - incomprehensibility of the stack above them. I'd like to explore how much - of a HLL I can build without a monolithic optimizing compiler, and see if - deconstructing the work of the compiler can make the stack as a whole more - comprehensible to others. - -3. I want to learn about the internals of the infrastructure we all rely on in - our lives. +Assembly is pretty low-level, but SubX makes Assembly look like the gleaming +chrome of the Starship Enterprise. Opcodes for instructions are explicit, as +are addressing modes and the precise bit fields used to encode them. There is +no portability. Only a subset of x86 is supported, so there's no backwards +compatibility either, zero interoperability with existing libraries. Only +statically linked libraries are supported, so the kernel will inefficiently +juggle multiple copies of the same libraries in RAM. + +In exchange for these drawbacks, SubX will hopefully be simpler to implement. +Ideally in itself. + +I'm also hoping that SubX will be simpler to program in, that it will fit a +programmer's head better in spite of the lack of syntax. Modern Assembly +supports 50+ years of accretions in the x86 ISA and 40+ years of accumulated +cruft in the toolchain (standard library, ELF format, binutils, linker, +loader). + +You may say I just don't understand the toolchain well enough. And that's the +point. I tried, and I failed. Each package above has only a piece of the +puzzle. Learning each of the above tools takes time; figuring out how they all +work together is not a well-supported activity. + +My hypothesis is that _it's easier to understand a coherent system written in +machine code than an incoherent system in a high-level language._ To test this +hypothesis, I plan to take a hatchet to [anything I don't understand](https://en.wikipedia.org/wiki/Wikipedia:Chesterton%27s_fence), +but to take full ownership of what's left. Not just how it runs, but the +experience of programming with it. A few basic mechanisms can hopefully be put +together into a more self-explanatory system: + +a) Metadata. In the example above, words after a slash (`/`) act as metadata +that doesn't make it into the final binary. Metadata can act as comments for +readers, or as directives for tools operating on SubX code. Programmers will +be encouraged to create new tools of their own. + +b) Checks. While SubX doesn't provide syntax, it tries to provide good +guardrails for invalid programs. Metadata specifies which field of an instruction +each operand is intended for. Missing operands are caught before they can +silently mislead instruction decoding. Instructions with unexpected operand +types are immediately flagged. SubX includes an emulator for a subset of x86, +which provides better error messages than native execution for certain kinds +of bad binaries. + +c) A test harness. SubX includes automated tests from the start, and the +entire stack is designed to be easy to test. We will provide wrappers for OS +syscalls that allow fakes to be _dependency-injected_ in, expanding the kinds +of tests that can be written. See [the earlier Mu interpreter](https://github.com/akkartik/mu#readme) +for more details. + +d) Traces of execution. Writing good error messages for a compiler is a hard +problem, and it can add complexity. We'd like to keep things ergonomic with a +minimum of code, so we will provide a _trace browser_ that allows programmers +to scan the trace of events emitted by SubX leading up to an error message, +drilling down into details as needed. Traces will also be available in tests, +enabling testing for cross-cutting concerns like performance, race conditions, +precise error messages displayed on screen, and so on. The effect is again to +expand the kinds of tests that can be written. [More details.](http://akkartik.name/about) + +e) Incremental construction. SubX programs are translated into monolithic ELF +binaries, but you will be able to build just a subset of their code (denominated +in _layers_), and get a running program that passes all its automated tests. +[More details.](https://akkartik.name/post/wart-layers) + +It seems wrong-headed that our computers look polished but are plagued by +foundational problems of security and reliability. I'd like to learn to walk +before I try to run. The plan: start out using the computer only to check my +program for errors rather than to hide low-level details. Force myself to +think about security by living with raw machine code for a while. Reintroduce +high level languages (HLLs) only after confidence is regained in the foundations +(and when the foundations are ergonomic enough to support developing a +compiler in them). Delegate only when I can verify with confidence. ## Running @@ -107,31 +120,34 @@ Running `subx` will transparently compile it as necessary. Putting them together, build and run one of the example programs: -<img alt='examples/ex1.1.subx' src='html/ex1.png'> +<img alt='apps/factorial.subx' src='../html/subx/factorial.png'> ``` -$ ./subx translate examples/ex1.1.subx examples/ex1 -$ ./subx run examples/ex1 +$ ./subx translate apps/factorial.subx apps/factorial +$ ./subx run apps/factorial # returns the factorial of 5 +$ echo $? +120 ``` -If you're running on Linux, `ex1` will also be runnable directly: +If you're running on Linux, `factorial` will also be runnable directly: ``` -$ examples/ex1 +$ apps/factorial ``` -There are a few such example programs in the examples/ directory. At any -commit an example's binary should be identical bit for bit with the output of -translating the .subx file. The binary should also be natively runnable on a -32-bit Linux system. If either of these invariants is broken it's a bug on my -part. The binary should also be runnable on a 64-bit Linux system. I can't -guarantee it, but I'd appreciate hearing if it doesn't run. +The `examples/` directory shows some simpler programs giving a more gradual +introduction to SubX features. The repo includes the binary for all examples. +At any commit an example's binary should be identical bit for bit with the +result of translating the .subx file. The binary should also be natively +runnable on a 32-bit Linux system. If either of these invariants is broken +it's a bug on my part. The binary should also be runnable on a 64-bit Linux +system. I can't guarantee it, but I'd appreciate hearing if it doesn't run. However, not all 32-bit Linux binaries are guaranteed to be runnable by `subx`. I'm not building general infrastructure here for all of the x86 ISA and ELF format. SubX is about programming with a small, regular subset of 32-bit x86: -* Only instructions that operate on the 32-bit E\*X registers. (No +* Only instructions that operate on the 32-bit integer E\*X registers. (No floating-point yet.) * Only instructions that assume a flat address space; no instructions that use segment registers. diff --git a/subx/html/ex1.png b/subx/html/ex1.png deleted file mode 100644 index c491c471..00000000 --- a/subx/html/ex1.png +++ /dev/null Binary files differ |