about summary refs log tree commit diff stats
path: root/mu.md
blob: bb8b68c9ad7ed1893c3f97eda97d58c712673127 (plain) (blame)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
# Mu Syntax

Here are two valid statements in Mu:

```
increment x
y <- increment
```

Understanding when to use one vs the other is the critical idea in Mu. In
short, the former increments a value in memory, while the latter increments a
value in a register.

Most languages start from some syntax and do what it takes to implement it.
Mu, however, is designed as a safe[1] way to program in [a regular subset of
32-bit x86 machine code](subx.md), _satisficing_ rather than optimizing for a
clean syntax. To keep the mapping to machine code lightweight, Mu exclusively
uses statements. Most statements map to a single instruction of machine code.

[1] While it's designed to be memory-safe, and already performs many safety
checks, the Mu compiler is still a work in progress and can currently corrupt
memory just like C can. I estimate that it'll currently point out 90% of the
mistakes you make.

Since the x86 instruction set restricts how many memory locations an instruction
can use, Mu makes registers explicit as well. Variables must be explicitly
mapped to specific registers; otherwise they live in memory. While you have to
do your own register allocation, Mu will helpfully point out[2] when you get it
wrong.

[2] Again, there are some known issues here at the moment. I estimate that
it'll currently catch 95% of register allocation errors.

Statements consist of 3 parts: the operation, optional _inouts_ and optional
_outputs_. Outputs come before the operation name and `<-`.

Outputs are always registers; memory locations that need to be modified are
passed in by reference in inouts.

So Mu programmers need to make two new categories of decisions: whether to
define variables in registers or memory, and whether to put variables to the
left or right. There's always exactly one way to write any given operation. In
return for this overhead you get a lightweight and future-proof stack. And Mu
will provide good error messages to support you.

Further down, this page enumerates all available primitives in Mu, and [a
separate page](http://akkartik.github.io/mu/html/mu_instructions.html)
describes how each primitive is translated to machine code.

## Functions and calls

Zooming out from single statements, here's a complete sample program in Mu:

<img alt='ex2.mu' src='html/ex2.mu.png'>

Mu programs are lists of functions. Each function has the following form:

```
fn _name_ _inout_ ... -> _output_ ... {
  _statement_
  _statement_
  ...
}
```

Each function has a header line, and some number of statements, each on a
separate line. Headers describe inouts and outputs. Inouts can't be registers,
and outputs _must_ be registers. In the above example, the outputs of both
`do-add` and `main` have type `int` and are available in register `ebx` at the
end of the respective calls.

The above program also demonstrates a function call (to the function `do-add`).
Function calls look the same as primitive statements: they can return (multiple)
outputs in registers, and modify inouts passed in by reference. In addition,
there's one more constraint: output registers must match the function header.
For example:

```
fn f -> x/eax: int {
  ...
}
fn g {
  a/eax <- f  # ok
  a/ebx <- f  # wrong; `a` must be in register `eax`
}
```

You can exit a function at any time with the `return` instruction. Give it the
right number of arguments, and it'll assign them respectively to the function's
outputs before jumping back to the caller.

The function `main` is special; it is where the program starts running. It
must always return a single int in register `ebx` (as the exit status of the
process). It can also optionally accept an array of strings as input (from the
shell command-line). To be precise, `main` must have one of the following
two signatures:

- `fn main -> x/ebx: int`
- `fn main args: (addr array (addr array byte)) -> x/ebx: int`

(The names of the inout and output are flexible.)

Mu encloses multi-word types in parentheses, and types can get quite expressive.
For example, you read `main`'s inout type as "an address to an array of
addresses to arrays of bytes." Since addresses to arrays of bytes are almost
always strings in Mu, you'll quickly learn to mentally shorten this type to
"an address to an array of strings".

## Blocks

Blocks are useful for grouping related statements. They're delimited by `{`
and `}`, both each alone on a line.

Blocks can nest:

```
{
  _statements_
  {
    _more statements_
  }
}
```

Blocks can be named (with the name ending in a `:` on the same line as the
`{`):

```
$name: {
  _statements_
}
```

Further down we'll see primitive statements for skipping or repeating blocks.
Besides control flow, the other use for blocks is...

## Local variables

Functions can define new variables at any time with the keyword `var`. There
are two variants of the `var` statement, for defining variables in registers
or memory.

```
var name: type
var name/reg: type <- ...
```

Variables on the stack are never initialized. (They're always implicitly
zeroed out.) Variables in registers are always initialized.

Register variables can go in 6 integer registers: `eax`, `ebx`, `ecx`, `edx`,
`esi` and `edi`. Floating-point values can go in 8 other registers: `xmm0`,
`xmm1`, `xmm2`, `xmm3`, `xmm4`, `xmm5`, `xmm6` and `xmm7`.

Defining a variable in a register either clobbers the previous variable (if it
was defined in the same block) or shadows it temporarily (if it was defined in
an outer block).

Variables exist from their definition until the end of their containing block.
Register variables may also die earlier if their register is clobbered by a
new variable.

Variables on the stack can be of many types (but not `byte`). Integer registers
can only contain 32-bit values: `int`, `byte`, `boolean`, `(addr ...)`. Floating-point
registers can only contain values of type `float`.

## Integer primitives

Here is the list of arithmetic primitive operations supported by Mu. The name
`n` indicates a literal integer rather than a variable, and `var/reg` indicates
a variable in a register, though that's not always valid Mu syntax.

```
var/reg <- increment
increment var
var/reg <- decrement
decrement var
var1/reg1 <- add var2/reg2
var/reg <- add var2
add-to var1, var2/reg
var/reg <- add n
add-to var, n

var1/reg1 <- sub var2/reg2
var/reg <- sub var2
sub-from var1, var2/reg
var/reg <- sub n
sub-from var, n

var1/reg1 <- and var2/reg2
var/reg <- and var2
and-with var1, var2/reg
var/reg <- and n
and-with var, n

var1/reg1 <- or var2/reg2
var/reg <- or var2
or-with var1, var2/reg
var/reg <- or n
or-with var, n

var1/reg1 <- xor var2/reg2
var/reg <- xor var2
xor-with var1, var2/reg
var/reg <- xor n
xor-with var, n

var1/reg1 <- negate
negate var

var/reg <- copy var2/reg2
copy-to var1, var2/reg
var/reg <- copy var2
var/reg <- copy n
copy-to var, n

compare var1, var2/reg
compare var1/reg, var2
compare var/eax, n
compare var, n

var/reg <- shift-left n
var/reg <- shift-right n
var/reg <- shift-right-signed n
shift-left var, n
shift-right var, n
shift-right-signed var, n

var/reg <- multiply var2
```

Any statement above that takes a variable in memory can be replaced with a
dereference (`*`) of an address variable (of type `(addr ...)`) in a register.
(Types can have multiple words, and are wrapped in `()` when they do.) But you
can't dereference variables in memory. You have to load them into a register
first.

Excluding dereferences, the above statements must operate on non-address
values with primitive types: `int`, `boolean` or `byte`. (Booleans are really
just `int`s, and Mu assumes any value but `0` is true.) You can copy addresses
to int variables, but not the other way around.

## Floating-point primitives

These instructions may use the floating-point registers `xmm0` ... `xmm7`
(denoted by `/xreg2` or `/xrm32`). They also use integer values on occasion
(`/rm32` and `/r32`). They can't take literal floating-point values.

```
var/xreg <- add var2/xreg2
var/xreg <- add var2
var/xreg <- add *var2/reg2

var/xreg <- subtract var2/xreg2
var/xreg <- subtract var2
var/xreg <- subtract *var2/reg2

var/xreg <- multiply var2/xreg2
var/xreg <- multiply var2
var/xreg <- multiply *var2/reg2

var/xreg <- divide var2/xreg2
var/xreg <- divide var2
var/xreg <- divide *var2/reg2

var/xreg <- reciprocal var2/xreg2
var/xreg <- reciprocal var2
var/xreg <- reciprocal *var2/reg2

var/xreg <- square-root var2/xreg2
var/xreg <- square-root var2
var/xreg <- square-root *var2/reg2

var/xreg <- inverse-square-root var2/xreg2
var/xreg <- inverse-square-root var2
var/xreg <- inverse-square-root *var2/reg2

var/xreg <- min var2/xreg2
var/xreg <- min var2
var/xreg <- min *var2/reg2

var/xreg <- max var2/xreg2
var/xreg <- max var2
var/xreg <- max *var2/reg2
```

Remember, when these instructions use indirect mode, they still use an integer
register. Floating-point registers can't hold addresses.

Two instructions in the above list are approximate. According to the Intel
manual, `reciprocal` and `inverse-square-root` [go off the rails around the
fourth decimal place](x86_approx.md). If you need more precision, use `divide`
separately.

Most instructions operate exclusively on integer or floating-point operands.
The only exceptions are the instructions for converting between integers and
floating-point numbers.

```
var/xreg <- convert var2/reg2
var/xreg <- convert var2
var/xreg <- convert *var2/reg2

var/reg <- convert var2/xreg2
var/reg <- convert var2
var/reg <- convert *var2/reg2

var/reg <- truncate var2/xreg2
var/reg <- truncate var2
var/reg <- truncate *var2/reg2
```

There are no instructions accepting floating-point literals. To obtain integer
literals in floating-point registers, copy them to general-purpose registers
and then convert them to floating-point.

One pattern you may have noticed above is that the floating-point instructions
above always write to registers. The only exceptions are `copy` instructions,
which can write to memory locations.

```
var/xreg <- copy var2/xreg2
copy-to var1, var2/xreg
var/xreg <- copy var2
var/xreg <- copy *var2/reg2
```

Floating-point comparisons always put a register on the left-hand side:

```
compare var1/xreg1, var2/xreg2
compare var1/xreg1, var2
```

## Operating on individual bytes

A special-case is variables of type `byte`. Mu is a 32-bit platform so for the
most part only supports types that are multiples of 32 bits. However, we do
want to support strings in ASCII and UTF-8, which will be arrays of 8-bit
bytes.

Since most x86 instructions implicitly load 32 bits at a time from memory,
variables of type 'byte' are only allowed in registers, not on the stack. Here
are the possible statements for reading bytes to/from memory:

```
var/reg <- copy-byte var2/reg2      # var: byte, var2: byte
var/reg <- copy-byte *var2/reg2     # var: byte, var2: (addr byte)
copy-byte-to *var1/reg1, var2/reg2  # var1: (addr byte), var2: byte
```

In addition, variables of type 'byte' are restricted to (the lowest bytes of)
just 4 registers: `eax`, `ecx`, `edx` and `ebx`. As always, this is due to
constraints of the x86 instruction set.

## Primitive jumps

There are two kinds of jumps, both with many variations: `break` and `loop`.
`break` instructions jump to the end of the containing block. `loop` instructions
jump to the beginning of the containing block.

All jumps can take an optional label starting with '$':

```
loop $foo
```

This instruction jumps to the beginning of the block called $foo. The corresponding
`break` jumps to the end of the block. Either jump statement must lie somewhere
inside such a block. Jumps are only legal to containing blocks. (Use named
blocks with restraint; jumps to places far away can get confusing.)

There are two unconditional jumps:

```
loop
loop label
break
break label
```

The remaining jump instructions are all conditional. Conditional jumps rely on
the result of the most recently executed `compare` instruction. (To keep
programs easy to read, keep compare instructions close to the jump that uses
them.)

```
break-if-=
break-if-= label
break-if-!=
break-if-!= label
```

Inequalities are similar, but have additional variants for addresses and floats.

```
break-if-<
break-if-< label
break-if->
break-if-> label
break-if-<=
break-if-<= label
break-if->=
break-if->= label

break-if-addr<
break-if-addr< label
break-if-addr>
break-if-addr> label
break-if-addr<=
break-if-addr<= label
break-if-addr>=
break-if-addr>= label

break-if-float<
break-if-float< label
break-if-float>
break-if-float> label
break-if-float<=
break-if-float<= label
break-if-float>=
break-if-float>= label
```

Similarly, conditional loops:

```
loop-if-=
loop-if-= label
loop-if-!=
loop-if-!= label

loop-if-<
loop-if-< label
loop-if->
loop-if-> label
loop-if-<=
loop-if-<= label
loop-if->=
loop-if->= label

loop-if-addr<
loop-if-addr< label
loop-if-addr>
loop-if-addr> label
loop-if-addr<=
loop-if-addr<= label
loop-if-addr>=
loop-if-addr>= label

loop-if-float<
loop-if-float< label
loop-if-float>
loop-if-float> label
loop-if-float<=
loop-if-float<= label
loop-if-float>=
loop-if-float>= label
```

## Addresses

Passing objects by reference requires the `address` operation, which returns
an object of type `addr`.

```
var/reg: (addr T) <- address var2: T
```

Here `var2` can't live in a register.

## Array operations

Mu arrays are size-prefixed so that operations on them can check bounds as
necessary at run-time. The `length` statement returns the number of elements
in an array.

```
var/reg: int <- length arr/reg: (addr array T)
```

The `index` statement takes an `addr` to an `array` and returns an `addr` to
one of its elements, that can be read from or written to.

```
var/reg: (addr T) <- index arr/reg: (addr array T), n
var/reg: (addr T) <- index arr: (array T sz), n
```

The index can also be a variable in a register, with a caveat:

```
var/reg: (addr T) <- index arr/reg: (addr array T), idx/reg: int
var/reg: (addr T) <- index arr: (array T sz), idx/reg: int
```

The caveat: the size of T must be 1, 2, 4 or 8 bytes. The x86 instruction set
has complex addressing modes that can index into an array in a single instruction
in these situations.

For types in general you'll need to split up the work, performing a `compute-offset`
before the `index`.

```
var/reg: (offset T) <- compute-offset arr: (addr array T), idx/reg: int     # arr can be in reg or mem
var/reg: (offset T) <- compute-offset arr: (addr array T), idx: int         # arr can be in reg or mem
```

The `compute-offset` statement returns a value of type `(offset T)` after
performing any necessary bounds checking. Now the offset can be passed to
`index` as usual:

```
var/reg: (addr T) <- index arr/reg: (addr array T), idx/reg: (offset T)
```

## Compound types

Primitive types can be combined together using the `type` keyword. For
example:

```
type point {
  x: int
  y: int
}
```

Mu programs are currently sequences of `fn` and `type` definitions.

Compound types can't include `addr` types for safety (use `handle` instead,
which is described below). They also can't currently include `array`, `stream`
or `byte` types. Since arrays and streams carry their size with them, supporting
them in compound types complicates variable initialization. Instead of
defining them inline in a type definition, define a `handle` to them. Bytes
shouldn't be used for anything but utf-8 strings.

To access within a compound type, use the `get` instruction. There are two
forms. You need either a variable of the type itself (say `T`) in memory, or a
variable of type `(addr T)` in a register.

```
var/reg: (addr T_f) <- get var/reg: (addr T), f
var/reg: (addr T_f) <- get var: T, f
```

The `f` here is the field name from the `type` definition, and its type `T_f`
must match the type of `f` in the `type` definition. For example, some legal
instructions for the definition of `point` above:

```
var a/eax: (addr int) <- get p, x
var a/eax: (addr int) <- get p, y
```

## Handles for safe access to the heap

We've seen the `addr` type, but it's intended to be short-lived. `addr` values
should never escape from functions. In particular, save `addr` values inside
compound `type`s. To do that you need a "fat pointer" called a `handle` that
is safe to keep around for extended periods and ensures it's used safely
without corrupting the heap and causing security issues or hard-to-debug
misbehavior.

To actually _use_ a `handle`, we have to turn it into an `addr` first using
the `lookup` statement.

```
var y/reg: (addr T) <- lookup x
```

Now operate on the `addr` as usual, safe in the knowledge that you can later
recover any writes to its payload from `x`.

It's illegal to continue to use this `addr` after a function that reclaims
heap memory. You have to repeat the lookup from the `handle`. (Luckily Mu
doesn't implement reclamation yet.)

Having two kinds of addresses takes some getting used to. Do we pass in
variables by value, by `addr` or by `handle`? In inputs or outputs? Here are 3
rules of thumb:

  * Functions that need to look at the payload should accept an `(addr ...)`.
  * Functions that need to treat a handle as a value, without looking at its
  payload, should accept a `(handle ...)`. Helpers that save handles into data
  structures are a common example.
  * Functions that need to allocate memory should accept an `(addr handle
  ...)`.

Try to avoid mixing these use cases.

If you have a variable `src` of type `(handle ...)`, you can save it inside a
compound type like this (provided the types match):

```
var dest/reg: (addr handle T_f) <- get var: (addr T), f
copy-handle src, dest
```

Or this:

```
var dest/reg: (addr handle T) <- index arr: (addr array handle T), n
copy-handle src, dest
```

To create handles to non-array types, use `allocate`:

```
var x: (addr handle T)
... initialize x ...
allocate x
```

To create handles to array types, use `populate`:

```
var x: (addr handle array T)
... initialize x ...
populate x, 3  # array of 3 T's
```

You can copy handles to another variable on the stack like this:

```
var x: (handle T)
# ..some code initializing x..
var y/eax: (addr handle T) <- address ...
copy-handle x, y
```

## Seams

I said at the start that most instructions map 1:1 to x86 machine code. To
enforce type- and memory-safety, I was forced to carve out a few exceptions:

* the `index` instruction on arrays, for bounds-checking (not yet implemented)
* the `length` instruction on arrays, for translating the array size in bytes
  into the number of elements.
* the `lookup` instruction on handles, for validating fat-pointer metadata
* `var` instructions, for initializing memory

If you're curious, [the compiler summary page](http://akkartik.github.io/mu/html/mu_instructions.html)
has the complete nitty-gritty on how each instruction is implemented. Including
the above exceptions.

## Conclusion

Anything not allowed here is forbidden. Even if the compiler doesn't currently
detect and complain about it. Please [contact me](mailto:ak@akkartik.com) or
[report issues](https://github.com/akkartik/mu/issues) when you encounter a
missing or misleading error message. Thank you for bearing with the dust! I'm
here for the long haul, and everything will be clean and checked in due time.