6941 - detective story for the day

author: Kartik Agaram <vc@akkartik.com> 2020-10-03 23:17:17 -0700
committer: Kartik Agaram <vc@akkartik.com> 2020-10-03 23:17:17 -0700
commit: faeeef34f549ad00ede2136ffabdee9368f02dbe (patch)
tree: b349bbf5a832ead2b6252a29ba771115fe1f897f
parent: 2f5a8496e64c063796ea5daf6381701d7cd873f1 (diff)
download: mu-faeeef34f549ad00ede2136ffabdee9368f02dbe.tar.gz
2 files changed, 119 insertions, 0 deletions
diff --git a/mu.md b/mu.md
index 6892df9d..c3b456bf 100644
--- a/mu.md
+++ b/mu.md
@@ -263,6 +263,11 @@ var/xreg <- max *var2/reg2
 Remember, when these instructions use indirect mode, they still use an integer
 register. Floating-point registers can't hold addresses.
 
+Two instructions in the above list are approximate. According to the Intel
+Manual, `reciprocal` and `inverse-square-root` [go off the rails around the
+fourth decimal place](x86_approx.md). If you need more precision, use `divide`
+separately.
+
 Most instructions operate exclusively on integer or floating-point operands.
 The only exceptions are the instructions for converting between integers and
 floating-point numbers.
diff --git a/x86_approx.md b/x86_approx.md
new file mode 100644
index 00000000..d26f1bdd
--- /dev/null
+++ b/x86_approx.md
@@ -0,0 +1,114 @@
+# How approximate is Intel's floating-point reciprocal instruction?
+
+2020/10/03
+
+Here's a test Mu program that prints out the bits for 0.5:
+
+  ```
+  fn main -> r/ebx: int {
+    var two/eax: int <- copy 2
+    var half/xmm0: float <- convert two
+    half <- reciprocal half
+    var mem: float
+    copy-to mem, half
+    var out/eax: int <- reinterpret mem
+    print-int32-hex 0, out
+    print-string 0, "\n"
+    r <- copy 0
+  }
+  ```
+
+It gives different results when emulated and run natively:
+
+  ```
+  $ ./translate_mu_debug x.mu  # debug mode = error checking
+  $ ./bootstrap run a.elf
+  0x3f000000
+  $ ./a.elf
+  0x3efff000
+  ```
+
+I spent some time digging into this before I realized it wasn't a bug in Mu,
+just an artifact of the emulator not actually using the `reciprocal` instruction.
+Here's a procedure you can follow along with to convince yourself.
+
+Start with this program (good.c):
+
+  ```c
+  #include<stdio.h>
+  int main(void) {
+    int n = 2;
+    float f = 1.0/n;
+    printf("%f\n", f);
+    return 0;
+  }
+  ```
+
+It works as you'd expect (compiling unoptimized to actually compute the
+division):
+
+  ```
+  $ gcc good.c
+  $ ./a.out
+  0.5
+  ```
+
+Let's look at its Assembly:
+
+  ```
+  $ gcc -S good.c
+  ```
+
+The generated `good.s` has a lot of stuff that doesn't interest us, surrounding
+these lines:
+
+  ```asm
+                        ; destination
+  movl      $2,         -8(%rbp)
+  cvtsi2sd  -8(%rbp),   %xmm0
+  movsd     .LC0(%rip), %xmm1
+  divsd     %xmm0,      %xmm1
+  movapd    %xmm1,      %xmm0
+  ```
+
+This fragment converts `2` into floating-point and then divides 1.0 (the
+constant `.LC0`) by it, leaving the result in register `xmm0`.
+
+There's a way to get gcc to emit the `rcpss` instruction using intrinsics, but
+I don't know how to do it, so I'll modify the generated Assembly directly:
+
+  ```diff
+        movl      $2,         -8(%rbp)
+  <     cvtsi2sd  -8(%rbp),   %xmm0
+  <     movsd     .LC0(%rip), %xmm1
+  <     divsd     %xmm0,      %xmm1
+  <     movapd    %xmm1,      %xmm0
+  ---
+  >     cvtsi2ss  -8(%rbp),   %xmm0
+  >     rcpss     %xmm0,      %xmm0
+  >     movss     %xmm0,      -4(%rbp)
+  ```
+
+Let's compare the result of both versions:
+
+  ```
+  $ gcc good.s
+  $ ./a.out
+  0.5
+  $ gcc good.modified.s
+  $ ./a.out
+  0.499878
+  ```
+
+Whoa!
+
+Reading the Intel manual more closely, it guarantees that the relative error
+of `rcpss` is less than `1.5*2^-12`, and indeed 12 bits puts us squarely in
+the fourth decimal place.
+
+Among the x86 instructions Mu supports, two are described in the Intel manual
+as "approximate": `reciprocal` (`rcpss`) and `inverse-square-root` (`rsqrtss`).
+Intel introduced these instructions as part of its SSE expansion in 1999. When
+it upgraded SSE to SSE2 (in 2000), most of its single-precision floating-point
+instructions got upgraded to double-precision &mdash; but not these two. So
+they seem to be an evolutionary dead-end.
author	Kartik Agaram <vc@akkartik.com>	2020-10-03 23:17:17 -0700
committer	Kartik Agaram <vc@akkartik.com>	2020-10-03 23:17:17 -0700
commit	faeeef34f549ad00ede2136ffabdee9368f02dbe (patch)
tree	b349bbf5a832ead2b6252a29ba771115fe1f897f
parent	2f5a8496e64c063796ea5daf6381701d7cd873f1 (diff)
download	mu-faeeef34f549ad00ede2136ffabdee9368f02dbe.tar.gz