linux/x86_approx.md


1
2
3
4
5
6
pre { line-height: 125%; }
td.linenos .normal { color: inherit; background-color: transparent; padding-left: 5px; padding-right: 5px; }
span.linenos { color: inherit; background-color: transparent; padding-left: 5px; padding-right: 5px; }
td.linenos .special { color: #000000; background-color: #ffffc0; padding-left: 5px; padding-right: 5px; }
span.linenos.special { color: #000000; background-color: #ffffc0; padding-left: 5px; padding-right: 5px; }
.highlight .hll { background-color: #ffffcc }
.highlight .c { color: #888888 } /* Comment */
.highlight .err { color: #a61717; background-color: #e3d2d2 } /* Error */
.highlight .k { color: #008800; font-weight: bold } /* Keyword */
.highlight .ch { color: #888888 } /* Comment.Hashbang */
.highlight .cm { color: #888888 } /* Comment.Multiline */
.highlight .cp { color: #cc0000; font-weight: bold } /* Comment.Preproc */
.highlight .cpf { color: #888888 } /* Comment.PreprocFile */
.highlight .c1 { color: #888888 } /* Comment.Single */
.highlight .cs { color: #cc0000; font-weight: bold; background-color: #fff0f0 } /* Comment.Special */
.highlight .gd { color: #000000; background-color: #ffdddd } /* Generic.Deleted */
.highlight .ge { font-style: italic } /* Generic.Emph */
.highlight .ges { font-weight: bold; font-style: italic } /* Generic.EmphStrong */
.highlight .gr { color: #aa0000 } /* Generic.Error */
.highlight .gh { color: #333333 } /* Generic.Heading */
.highlight .gi { color: #000000; background-color: #ddffdd } /* Generic.Inserted */
.highlight .go { color: #888888 } /* Generic.Output */
.highlight .gp { color: #555555 } /* Generic.Prompt */
.highlight .gs { font-weight: bold } /* Generic.Strong */
.highlight .gu { color: #666666 } /* Generic.Subheading */
.highlight .gt { color: #aa0000 } /* Generic.Traceback */
.highlight .kc { color: #008800; font-weight: bold } /* Keyword.Constant */
.highlight .kd { color: #008800; font-weight: bold } /* Keyword.Declaration */
.highlight .kn { color: #008800; font-weight: bold } /* Keyword.Namespace */
.highlight .kp { color: #008800 } /* Keyword.Pseudo */
.highlight .kr { color: #008800; font-weight: bold } /* Keyword.Reserved */
.highlight .kt { color: #888888; font-weight: bold } /* Keyword.Type */
.highlight .m { color: #0000DD; font-weight: bold } /* Literal.Number */
.highlight .s { color: #dd2200; background-color: #fff0f0 } /* Literal.String */
.highlight .na { color: #336699 } /* Name.Attribute */
.highlight .nb { color: #003388 } /* Name.Builtin */
.highlight .nc { color: #bb0066; font-weight: bold } /* Name.Class */
.highlight .no { color: #003366; font-weight: bold } /* Name.Constant */
.highlight .nd { color: #555555 } /* Name.Decorator */
.highlight .ne { color: #bb0066; font-weight: bold } /* Name.Exception */
.highlight .nf { color: #0066bb; font-weight: bold } /* Name.Function */
.highlight .nl { color: #336699; font-style: italic } /* Name.Label */
.highlight .nn { color: #bb0066; font-weight: bold } /* Name.Namespace */
.highlight .py { color: #336699; font-weight: bold } /* Name.Property */
.highlight .nt { color: #bb0066; font-weight: bold } /* Name.Tag */
.highlight .nv { color: #336699 } /* Name.Variable */
.highlight .ow { color: #008800 } /* Operator.Word */
.highlight .w { color: #bbbbbb } /* Text.Whitespace */
.highlight .mb { color: #0000DD; font-weight: bold } /* Literal.Number.Bin */
.highlight .mf { color: #0000DD; font-weight: bold } /* Literal.Number.Float */
.highlight .mh { color: #0000DD; font-weight: bold } /* Literal.Number# How approximate is Intel's floating-point reciprocal instruction?

2020/10/03

Here's a test Mu program that prints out the bits for 0.5:

  ```
  fn main -> r/ebx: int {
    var two/eax: int <- copy 2
    var half/xmm0: float <- convert two
    half <- reciprocal half
    var mem: float
    copy-to mem, half
    var out/eax: int <- reinterpret mem
    print-int32-hex 0, out
    print-string 0, "\n"
    r <- copy 0
  }
  ```

It gives different results when emulated and run natively:

  ```
  $ cd linux
  $ ./translate_debug x.mu  # debug mode = error checking
  $ bootstrap/bootstrap run a.elf
  0x3f000000  # correct
  $ ./a.elf
  0x3efff000  # wrong
  ```

I spent some time digging into this before I realized it wasn't a bug in Mu,
just an artifact of the emulator not actually using the `reciprocal` instruction.
Here's a procedure you can follow along with to convince yourself.

Start with this program (good.c):

  ```c
  #include<stdio.h>
  int main(void) {
    int n = 2;
    float f = 1.0/n;
    printf("%f\n", f);
    return 0;
  }
  ```

It works as you'd expect (compiling unoptimized to actually compute the
division):

  ```
  $ gcc good.c
  $ ./a.out
  0.5
  ```

Let's look at its Assembly:

  ```
  $ gcc -S good.c
  ```

The generated `good.s` has a lot of stuff that doesn't interest us, surrounding
these lines:

  ```asm
                        ; destination
  movl      $2,         -8(%rbp)
  cvtsi2sd  -8(%rbp),   %xmm0
  movsd     .LC0(%rip), %xmm1
  divsd     %xmm0,      %xmm1
  movapd    %xmm1,      %xmm0
  ```

This fragment converts `2` into floating-point and then divides 1.0 (the
constant `.LC0`) by it, leaving the result in register `xmm0`.

There's a way to get gcc to emit the `rcpss` instruction using intrinsics, but
I don't know how to do it, so I'll modify the generated Assembly directly:

  ```diff
        movl      $2,         -8(%rbp)
  <     cvtsi2sd  -8(%rbp),   %xmm0
  <     movsd     .LC0(%rip), %xmm1
  <     divsd     %xmm0,      %xmm1
  <     movapd    %xmm1,      %xmm0
  ---
  >     cvtsi2ss  -8(%rbp),   %xmm0
  >     rcpss     %xmm0,      %xmm0
  >     movss     %xmm0,      -4(%rbp)
  ```

Let's compare the result of both versions:

  ```
  $ gcc good.s
  $ ./a.out
  0.5
  $ gcc good.modified.s
  $ ./a.out
  0.499878
  ```

Whoa!

Reading the Intel manual more closely, it guarantees that the relative error
of `rcpss` is less than `1.5*2^-12`, and indeed 12 bits puts us squarely in
the fourth decimal place.

Among the x86 instructions Mu supports, two are described in the Intel manual
as "approximate": `reciprocal` (`rcpss`) and `inverse-square-root` (`rsqrtss`).
Intel introduced these instructions as part of its SSE expansion in 1999. When
it upgraded SSE to SSE2 (in 2000), most of its scalar[1] single-precision
floating-point instructions got upgraded to double-precision &mdash; but not
these two. So they seem to be an evolutionary dead-end.

[1] Thanks boulos for feedback: https://news.ycombinator.com/item?id=28501429#28507118