about summary refs log tree commit diff stats
path: root/mu.md
blob: cfecadf0d22a7da35ae67ad06b7d8832df49d730 (plain) (blame)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
# Mu Reference

Mu programs are sequences of `fn` and `type` definitions.

## Functions

Define functions with the `fn` keyword. For example:

```
  fn foo arg1: int, arg2: int -> _/eax: boolean
```

Functions contain `{}` blocks, `var` declarations, primitive statements and
calls to other functions. Only `{}` blocks can nest. Primitive statements and
function calls look similar:

```
  out1, out2, out3, ... <- operation inout1, inout2, inout3, ...
```

They can take any number of inouts and outputs, including 0. Statements
with 0 outputs also drop the `<-`.

Inouts can be either variables in memory, variables in registers, or
literal constants (either integers or string literals that get replaced with
constant addresses). Outputs are always variables in registers.

Inouts in memory can be either inputs or outputs (if they're addresses being
written to). Hence the name.

Primitives usually require their inouts to be in specific combinations of
memory and register. They can't handle more than one inout in memory.
User-defined functions are flexible.

Primitives can often write to arbitrary output registers. User-defined
functions, however, require rigidly specified output registers. Notice how the
definition of `foo` above writes to `eax`.

A function's header names its inouts, but not its outputs (hence the `_`
above).

All Mu programs must define at least one function: `main`. That's where they
begin executing instructions.

## Variables, registers and memory

Declare local variables in a function using the `var` keyword. As shown above,
all variable declarations should specify a type after a `:`.

You can declare local variables in either registers or memory (the _stack_
segment). So a `var` statement has two forms:
  - Living in a register, e.g. `var x/eax: int <- copy 0` defines `x` which
    lives in `eax`.
  - Living in memory, e.g. `var x: int` defines `x` on the stack.

Variables in registers must be initialized. Variables on the stack are
implicitly zeroed out.

Variables exist only within the `{}` block they're defined in. Space allocated
to them on the stack is reclaimed after execution leaves the block. Registers
restore whatever variable was using them in the outer block.

It is perfectly ok to reuse a register for a new variable. Even in a single
block (though you permanently lose the old variable then).

Variables can be in six 32-bit _general-purpose_ registers of the x86 processor.
  - eax
  - ebx
  - ecx
  - edx
  - esi ('s' often a mnemonic for 'source')
  - edi ('d' often a mnemonic for 'destination')

Most functions return results in `eax` by convention. In practice, it ends up
churning through variables pretty quickly.

You can store several types in these registers:
  - int
  - boolean
  - (addr T) (address into memory)
  - byte (uses only 8 bits)
  - code-point (Unicode)
  - grapheme (code-point encoded in UTF-8)

There's one 32-bit type you _cannot_ store in these registers:
  - float

It instead uses eight separate 32-bit registers: xmm0, xmm1, ..., xmm7

Types that require more than 32 bits (4 bytes) cannot be stored in registers:
  - (array T)
  - (handle T)
  - (stream T)
  - slice
  - any compound types you define using the `type` keyword

`T` here can be any type, including combinations of types. For example:
  - (array int) -- an array of ints
  - (addr int) -- an address to an int
  - (addr array byte) -- an address to an array of bytes, useful for Unicode
    text strings
  - (handle int) -- a handle to an int
  - (addr handle int) -- an address to a handle to int
  - (addr array handle int) -- an address to an array of handles to ints
  - ...and so on.

Other miscellaneous restrictions:
  - `byte` variables must be either in registers or on the heap, never local
    variables on the stack.
  - `addr` variables can never "escape" a function either by being returned or
    by being written to a memory location. When you need that sort of thing,
    use a `handle` instead.

## Operations on simple types

We'll now survey a long list of statement forms that operate on 32-bit types.
Most of these are primitives, but some are also implemented as functions
(which have slightly different rules as mentioned up top). Most instructions
with multiple args require types to match. Various operations have other
restrictions which we'll note below, using the following notation:
  - `var/reg` indicates a variable in some register. Where we require a
    variable in a specific register, we'll mention it explicitly. E.g.
    `var/eax`.
  - `var/xreg` indicates a variable in some floating-point register `xmm_`.
  - `var` without a `reg` indicates either a variable on the stack, or
    dereferencing a variable in a (non-floating-point) register: `*var/reg`.
  - `var: type` indicates a variable that must satisfy some type constraint.
  - `n` indicates a literal integer. There are no floating-point literals.

### Moving values around

These instructions work with variables of any 32-bit type except `byte` and
`float`.

```
  var/reg <- copy var2/reg2
  copy-to var1, var2/reg
  var/reg <- copy var2
  var/reg <- copy n
  copy-to var, n
```

Byte variables have their own instructions:

```
  var/reg <- copy-byte var2/reg2
  var/reg <- copy-byte *var2/reg2     # var2 must have type (addr byte)
  copy-byte-to *var1/reg1, var2/reg2  # var1 must have type (addr byte)
```

Floating point variables can be copied as well, but only to or from
floating-point registers `xmm_`.

```
  var/xreg <- copy var2/xreg2
  copy-to var1, var2/xreg
  var/xreg <- copy var2
  var/xreg <- copy *var2/reg2         # var2 must have type (addr byte) and live in a general-purpose register
```

There's no way to copy a literal to a floating-point register. However,
there's a few ways to convert non-float values in general-purpose registers.

```
  var/xreg <- convert var2/reg2
  var/xreg <- convert var2
  var/xreg <- convert *var2/reg2
```

Correspondingly, there are ways to convert floats into integers, with and
without rounding.

```
  var/reg <- convert var2/xreg2
  var/reg <- convert var2
  var/reg <- convert *var2/reg2

  var/reg <- truncate var2/xreg2
  var/reg <- truncate var2
  var/reg <- truncate *var2/reg2
```

Still, the absence of fractional literals is an annoyance. Mu provides some
helpers to mitigate it somewhat:

```
  result/xmm0 <- rational numerator: int, denominator: int
  fill-in-rational out: (addr float), numerator: int, denominator: int
```

These are functions, so the inouts have fewer restrictions while the outputs
have more. The inouts can be registers, or memory, or even literals. The
output for `rational` _must_ be in register `xmm0`.

### Comparing values

Work with variables of any 32-bit type. `addr` variables can only be compared
to 0.

```
  compare var1, var2/reg
  compare var1/reg, var2
  compare var/eax, n
  compare var, n
```

Floating-point numbers cannot be compared to literals, and the register must
come first.

```
  compare var1/xreg1, var2/xreg2
  compare var1/xreg1, var2
```

### Branches

Immediately after a `compare` instruction you can branch on its result. For
example:

```
  break-if-=
```

This instruction will jump to after the enclosing `{}` block if the previous
`compare` detected equality. Here's the list of conditional and unconditional
`break` instructions:

```
  break
  break-if-=
  break-if-!=
  break-if-<
  break-if->
  break-if-<=
  break-if->=
```

Similarly, you can jump back to the start of the enclosing `{}` block with
`loop`. Here's the list of `loop` instructions.

```
  loop
  loop-if-=
  loop-if-!=
  loop-if-<
  loop-if->
  loop-if-<=
  loop-if->=
```

Additionally, there are special variants for comparing `addr` and `float`
values, which results in the following comprehensive list of jumps:

```
  break
  break-if-=
  break-if-!=
  break-if-<    break-if-addr<    break-if-float<
  break-if->    break-if-addr>    break-if-float>
  break-if-<=   break-if-addr<=   break-if-float<=
  break-if->=   break-if-addr>=   break-if-float>=

  loop
  loop-if-=
  loop-if-!=
  loop-if-<     loop-if-addr<     loop-if-float<
  loop-if->     loop-if-addr>     loop-if-float>
  loop-if-<=    loop-if-addr<=    loop-if-float<=
  loop-if->=    loop-if-addr>=    loop-if-float>=
```

One final property all these jump instructions share: they can take an
optional block name to jump to. For example:

```
  a: {
    ...
    break a     #----------|
    ...               #    |
  }                   # <--|


  a: {                # <--|
    ...               #    |
    b: {              #    |
      ...             #    |
      loop a    #----------|
      ...
    }
    ...
  }
```

However, there's no way to jump to a block that doesn't contain the `loop` or
`break` instruction.

### Integer arithmetic

These instructions require variables of non-`addr`, non-`float` types.

Add:
```
  var1/reg1 <- add var2/reg2
  var1/reg <- add var2
  add-to var1, var2/reg                 # var1 += var2
  var/reg <- add n
  add-to var, n
```

Subtract:
```
  var1/reg1 <- subtract var2/reg2
  var1/reg <- subtract var2
  subtract-from var1, var2/reg          # var1 -= var2
  var/reg <- subtract n
  subtract-from var, n
```

Add one:
```
  var/reg <- increment
  increment var
```

Subtract one:
```
  var/reg <- decrement
  decrement var
```

Multiply:
```
  var1/reg <- multiply var2
```

The result of a multiply must be a register.

Negate:
```
  var/reg1 <- negate
  negate var
```

### Fractional arithmetic

Operations on `float` variables include a few we've seen before and some new
ones. Notice here that we mostly use floating-point registers `xmm_`, but
still use the general-purpose registers when dereferencing variables of type
`(addr float)`.

```
  var1/xreg <- add var2/xreg2
  var1/xreg <- add var2
  var1/xreg <- add *var2/reg2

  var1/xreg <- subtract var2/xreg2
  var1/xreg <- subtract var2
  var1/xreg <- subtract *var2/reg2

  var1/xreg <- multiply var2/xreg2
  var1/xreg <- multiply var2
  var1/xreg <- multiply *var2/reg2

  var1/xreg <- divide var2/xreg2
  var1/xreg <- divide var2
  var1/xreg <- divide *var2/reg2

  var1/xreg <- reciprocal var2/xreg2
  var1/xreg <- reciprocal var2
  var1/xreg <- reciprocal *var2/reg2

  var1/xreg <- square-root var2/xreg2
  var1/xreg <- square-root var2
  var1/xreg <- square-root *var2/reg2

  var1/xreg <- inverse-square-root var2/xreg2
  var1/xreg <- inverse-square-root var2
  var1/xreg <- inverse-square-root *var2/reg2

  var1/xreg <- min var2/xreg2
  var1/xreg <- min var2
  var1/xreg <- min *var2/reg2

  var1/xreg <- max var2/xreg2
  var1/xreg <- max var2
  var1/xreg <- max *var2/reg2
```

Two instructions in the above list are approximate. According to the Intel
manual, `reciprocal` and `inverse-square-root` [go off the rails around the
fourth decimal place](linux/x86_approx.md). If you need more precision, use
`divide` separately.

### Bitwise boolean operations

These require variables of non-`addr`, non-`float` types.

And:
```
  var1/reg1 <- and var2/reg2
  var1/reg <- and var2
  and-with var1, var2/reg
  var/reg <- and n
  and-with var, n
```

Or:
```
  var1/reg1 <- or var2/reg2
  var1/reg <- or var2
  or-with var1, var2/reg
  var/reg <- or n
  or-with var, n
```

Not:
```
  var1/reg1 <- not
  not var
```

Xor:
```
  var1/reg1 <- xor var2/reg2
  var1/reg <- xor var2
  xor-with var1, var2/reg
  var/reg <- xor n
  xor-with var, n
```

### Bitwise shifts

Shifts require variables of non-`addr`, non-`float` types.

```
  var/reg <- shift-left n
  var/reg <- shift-right n
  var/reg <- shift-right-signed n
  shift-left var, n
  shift-right var, n
  shift-right-signed var, n
```

Shifting bits left always inserts zeros on the right.
Shifting bits right inserts zeros on the left by default.
A _signed_ shift right duplicates the leftmost bit, thereby preserving the
sign of an integer.

## Operations on more complex types

These instructions work with any type `T`. As before we use `/reg` here to
indicate when a variable must live in a register. We also include type
constraints after a `:`.

### Addresses and handles

You can compute the address of any variable in memory (never in registers):

```
  var1/reg: (addr T) <- address var2: T
```

As mentioned up top, `addr` variables can never escape the function where
they're computed. You can't store them on the heap, or in compound types.
Think of them as short-lived things.

To manage long-lived addresses, _allocate_ them on the heap.

```
  allocate var: (addr handle T)       # var can be in either register or memory
```

Handles can be copied and stored without restriction. However, they're too
large to fit in a register. You also can't access their payload directly, you
have to first convert them into a short-lived `addr` using _lookup_.

```
  var1/eax: (addr T) <- lookup var2: (handle T)  # var2 in either register or memory
```

Since handles are large compound types, there's a special helper for comparing
them:

```
  var/eax: boolean <- handle-equal? var1: (handle T), var2: (handle T)
```

### Arrays

Arrays are declared in two ways:
  1. On the stack with a literal size:
```
  var x: (array int 3)
```
  2. On the heap with a potentially variable size. For example:
```
  var x: (handle array int)
  var x-ah/eax: (addr handle array int) <- address x
  populate x-ah, 8
```

The `8` here can also be an int in a register or memory. (The `-ah` is a
common variable naming convention and stands for "address of a handle".
Essential for allocating long-lived data on the heap.)

You can compute the length of an array, though you'll need an `addr` to do so:

```
  var/reg: int <- length arr/reg: (addr array T)
```

To read from or write to an array, use `index` to first obtain an address to
read from or modify:

```
  var/reg: (addr T) <- index arr/reg: (addr array T), n
  var/reg: (addr T) <- index arr: (array T len), n
```

Like our notation of `n`, `len` here is required to be a literal.

The index requested can also be a variable in a register, with one caveat:

```
  var/reg: (addr T) <- index arr/reg: (addr array T), idx/reg: int
  var/reg: (addr T) <- index arr: (array T len), idx/reg: int
```

The caveat: the size of T must be 1, 2, 4 or 8 bytes. For other sizes of T
you'll need to split up the work, performing a `compute-offset` before the
`index`.

```
  var/reg: (offset T) <- compute-offset arr: (addr array T), idx/reg: int     # arr can be in reg or mem
  var/reg: (offset T) <- compute-offset arr: (addr array T), idx: int         # arr can be in reg or mem
```

The result of a `compute-offset` statement can be passed to `index`:

```
  var/reg: (addr T) <- index arr/reg: (addr array T), idx/reg: (offset T)
```

### Streams

A common use for arrays is as buffers. Save a few items to a scratch space and
then process them. This pattern is so common (we use it in files) that there's
special support for it with a built-in type: `stream`.

Streams are like arrays in many ways. You can initialize them with a length on
the stack:

```
  var x: (stream int 3)
```

You can also populate them on the heap:
```
  var x: (handle stream int)
  var x-ah/eax: (addr handle stream int) <- address x
  populate-stream x-ah, 8
```

However, streams don't provide random access with an `index` instruction.
Instead, you write to them sequentially, and read back what you wrote.

```
  read-from-stream s: (addr stream T), out: (addr T)
  write-to-stream s: (addr stream T), in: (addr T)
```

Streams of bytes are particularly common for managing Unicode text, and there
are a few functions to help with them:

```
  write s: (addr stream byte), u: (addr array byte)  # write u to s, abort if full
  overflow?/eax: boolean <- try-write s: (addr stream byte), u: (addr array byte)
  write-stream dest: (addr stream byte), src: (addr stream byte)
  # bytes
  append-byte s: (addr stream byte), var: int  # write lower byte of var
  var/eax: byte <- read-byte s: (addr stream byte)
  # 32-bit graphemes encoded in UTF-8
  write-grapheme out: (addr stream byte), g: grapheme
  g/eax: grapheme <- read-grapheme in: (addr stream byte)
```

You can check if a stream is empty or full:

```
  var/eax: boolean <- stream-empty? s: (addr stream)
  var/eax: boolean <- stream-full? s: (addr stream)
```

You can clear streams:

```
  clear-stream f: (addr stream T)
```

You can also rewind them to reread their contents:

```
  rewind-stream f: (addr stream T)
```

## Compound types

Primitive types can be combined together using the `type` keyword. For
example:

```
type point {
  x: int
  y: int
}
```

Mu programs are sequences of just `fn` and `type` definitions.

Compound types can't include `addr` types for safety reasons (use `handle` instead,
which is described below). They also can't currently include `array`, `stream`
or `byte` types. Since arrays and streams carry their size with them, supporting
them in compound types complicates variable initialization. Instead of
defining them inline in a type definition, define a `handle` to them. Bytes
shouldn't be used for anything but UTF-8 strings.

To access within a compound type, use the `get` instruction. There are two
forms. You need either a variable of the type itself (say `T`) in memory, or a
variable of type `(addr T)` in a register.

```
var/reg: (addr T_f) <- get var/reg: (addr T), f
var/reg: (addr T_f) <- get var: T, f
```

The `f` here is the field name from the `type` definition, and its type `T_f`
must match the type of `f` in the `type` definition. For example, some legal
instructions for the definition of `point` above:

```
var a/eax: (addr int) <- get p, x
var a/eax: (addr int) <- get p, y
```

You can clear compound types using the `clear-object` function:

```
clear-object var: (addr T)
```

You can shallow-copy compound types using the `copy-object` function:

```
copy-object src: (addr T), dest: (addr T)
```