about summary refs log tree commit diff stats
path: root/js/baba-yaga/experimental/COMPILER.md
diff options
context:
space:
mode:
Diffstat (limited to 'js/baba-yaga/experimental/COMPILER.md')
-rw-r--r--js/baba-yaga/experimental/COMPILER.md270
1 files changed, 270 insertions, 0 deletions
diff --git a/js/baba-yaga/experimental/COMPILER.md b/js/baba-yaga/experimental/COMPILER.md
new file mode 100644
index 0000000..d451334
--- /dev/null
+++ b/js/baba-yaga/experimental/COMPILER.md
@@ -0,0 +1,270 @@
+## Baba Yaga → JavaScript Compiler (Combinator-Based)
+
+This document describes a minimal, practical compiler from Baba Yaga to JavaScript using abstraction elimination into combinators. It reuses the existing front-end (`lexer.js`, `parser.js`) and targets a small JS runtime that implements SKI-family combinators and primitives consistent with `interpreter.js` semantics. The working implementation now lives under `experimental/`.
+
+### Goals
+- Compile any program accepted by the current `parser.js` and executed by `interpreter.js`.
+- Preserve semantics: numbers with `{ value, isFloat }`, pure and immutable data, partial application, namespacing, and “when” pattern matching.
+- Keep changes additive: new `experimental/compiler.js` and a small runtime; no required changes to lexer/parser/interpreter.
+- Output readable, runnable JS with a tiny explicit runtime.
+
+### Non-Goals (initially)
+- Advanced type-directed optimizations or codegen from declared types.
+- Mutual recursion via pure combinator fixpoints (we’ll use two-phase `let`+assignment for simplicity).
+- Aggressive size-optimizing bracket abstraction (we can add B/C/η later).
+
+## High-Level Architecture
+1. Front-end: reuse `createLexer` and `createParser` to produce AST.
+2. Core normalization pass (implemented in `experimental/compiler.js`):
+   - Desugar to a small core language (λ, application, identifiers, literals, primitives, data).
+   - Convert infix operators and keywords into primitive function calls.
+   - Normalize multi-arg functions to nested single-arg lambdas.
+   - Lower member access to `get key obj` form.
+   - Lower `when` to a chain of guarded branches with thunks.
+3. Abstraction elimination (bracket abstraction) or closure-based codegen (mode-dependent) in `experimental/compiler.js`: remove λs into SKI-family combinators, or emit JS closures directly when selected.
+4. Emission: generate JS module text that includes a tiny runtime and the compiled program.
+
+## Source Language Surface (from parser/interpreter)
+- Program-level: `TypeDeclaration`, `VariableDeclaration`, `FunctionDeclaration`, expressions.
+- Expressions: numbers, strings, booleans, identifiers, anonymous functions, application, unary/binary ops, lists, tables, member access, `ResultExpression` (Ok/Err), `WhenExpression` (patterns: type, result, list, table, wildcard), curried and typed functions.
+- Namespaces: `io`, `str`, `math`, list/table functions (append, map, reduce, etc.) via native functions.
+
+## Core IR (post-desugar)
+Use a minimal lambda calculus with primitives and data literals:
+- Term := `Var(name)` | `Lam(param, term)` | `App(term, term)` | `Const(name)` | `Lit(<number|string|boolean|list|table|result>)`.
+- No infix ops; all are `Const` applied to args.
+- Member access lowered to `App(App(Const('get'), key), obj)`.
+- `when` lowered to nested `cond` forms with thunks.
+
+### Desugaring Rules (selected)
+- Multi-arg function `f: (x: T, y: U) -> R -> body` → `Lam(x, Lam(y, body))` with type metadata discarded at runtime.
+- Untyped function `x y -> body` → `Lam(x, Lam(y, body))`.
+- Anonymous `(x y -> e)` → `Lam(x, Lam(y, e))`.
+- Unary minus: `-e` → `App(Const('neg'), e)`.
+- Binary `e1 op e2` → `App(App(Const(opName), e1), e2)` where `opName ∈ { add, sub, mul, div, mod, eq, neq, gt, lt, gte, lte, and, or, xor, concatDot }`.
+- Member `obj.prop` or `obj.0` → `App(App(Const('get'), keyLit), obj)`.
+- `Ok e` → `App(Const('Ok'), e)`; `Err e` → `App(Const('Err'), e)`.
+- Lists `[e1, ..., en]` → `Lit([...])` (JS arrays); tables `{ k: v, ... }` → `Lit({ ... })` or `Const('table')(...)` builder (see Runtime).
+- `when`:
+  - `when d1 ... dn is p1 ... pn then c1; ...; _ then cK` → nested `cond` calls:
+    - `cond (match([p1..pn], [d1..dn])) (thunk(() => c1)) (thunk(() => cond (...)))`.
+  - Pattern matchers compile to booleans with helpers and do not bind new variables except `ResultPattern` which binds its value.
+
+## Bracket Abstraction
+Eliminate lambdas into SKI-family combinators over the core term. Treat any `Var` not equal to the bound variable as free (terminal). Use the standard rules:
+- [x]x → I
+- [x]M where x ∉ FV(M) → K M
+- [x](M N) → S ([x]M) ([x]N)
+
+Notes:
+- Multi-arg λ: `[x][y]M` via nesting.
+- We can later add optimizations: η-reduction, B/C introduction, constant folding of K-consts, etc.
+
+## Emission Strategy
+Output a single JS module string with:
+1. Runtime prelude (combinators, curry helpers, primitives, matchers, namespaces).
+2. Top-level declarations:
+   - Functions: declare `let f;` for each function name first (supports mutual recursion), then `f = <compiled expression>;`.
+   - Variables: `const x = <compiled expression>;`.
+3. Exports and format: by default emit UMD for maximal portability (Node/CommonJS, Bun, browsers). Also support format selection via CLI.
+
+### Module formats, exports, and codegen modes
+- Default: UMD wrapper that exposes named exports under `module.exports` (CJS), `define` (AMD when available), and global `BabaYaga` namespace in browsers.
+- Optional formats: `--format esm|cjs|umd` to force a specific target. ESM emits `export` statements; CJS uses `module.exports = { ... }`.
+ - Codegen mode: `--mode ski|closure|hybrid` (default: `ski`).
+   - `ski`: full combinatorization after normalization.
+   - `closure`: emit JS closures with currying helpers (easiest parity path, useful for debugging and when SKI would bloat output).
+   - `hybrid`: use closures for complex constructs (e.g., deeply nested `when`, heavy pattern matching) and SKI for simple lambdas; implementation-defined heuristics.
+
+### Entry points and exports
+- Compiled module should export named bindings for all top-level declarations.
+- Additionally, export a single entry for executing a program:
+  - `run(host)` preferred: evaluates the compiled `Program` under the provided `host.io` and returns the last expression value.
+  - Fallback: `main(host)` if a user-defined `main` exists; parity harness will try `run` then `main`.
+- Host wiring: prefer a lightweight `run(host)` that creates the runtime namespaces based on `host`. Avoid global side-effects.
+
+### Recursion
+- Self-recursion: works via two-phase assignment (`let f; f = ...`), because references resolve at call time.
+- Mutual recursion: also works via predeclared `let` for all functions then assignments.
+- Optional: pure combinator fixpoint `Z` for self-recursion; not required initially.
+
+## Runtime Design (tiny and explicit)
+
+### Combinators and Helpers
+- `I(a) = a`
+- `K(a)(b) = a`
+- `S(f)(g)(x) = f(x)(g(x))`
+- Optional: `B(f)(g)(x) = f(g(x))`, `C(f)(x)(y) = f(y)(x)`, `Z(f) = (g => g(g))(g => f(x => g(g)(x)))` for CBV.
+- `curryN(n, f)` and arity-specific `curry2`, `curry3` for primitives.
+- `thunk(f)` returns `f` (we pass thunks to `cond`).
+- `cond(p)(t)(e) = p ? t() : e()`.
+
+### Numeric and Data Representation
+- Numbers are `{ value: number, isFloat: boolean }` to match interpreter.
+- Lists as JS arrays; tables as `{ type: 'Object', properties: Map }` built via a small `table(obj)` helper that constructs a `Map`. Access is always via `get(key)(obj)` in compiled code. We will NOT use a Proxy in compiled output; this keeps runtime simple and predictable while preserving interpreter semantics for member/index access.
+
+### Runtime namespace and symbol hygiene
+- Prefix all runtime symbols with `__rt` (e.g., `__rt.I`, `__rt.add`, `__rt.get`) to avoid collisions with user-defined names.
+- User symbols (top-level functions/vars) are emitted in their original names and exported by the module wrapper.
+
+### Primitives (curried)
+- Arithmetic: `add, sub, mul, div, mod, neg` (unwrap/rewrap numbers; div guards divide-by-zero like interpreter).
+- Compare: `eq, neq, gt, lt, gte, lte` (numeric unwrapping + generic string/boolean).
+- Logic: `and, or, xor` (booleanize inputs).
+- String: `concatDot` for `..` (String(lhs) + String(rhs)).
+- Result: `Ok, Err` → `{ type: 'Result', variant: 'Ok'|'Err', value }`.
+- Core ops: `get`, and helpers for lists/tables (append, prepend, concat, update, removeAt, slice, set, remove, merge, keys, values) mirroring interpreter. In compiled code, `get` returns `undefined` for missing keys/indices (including out-of-bounds array access) to match interpreter behavior.
+- Namespaces: `io`, `str`, `math` exposing same functions/signatures; `io` is parameterized by a passed-in host object for integration.
+- Matchers: `isType(expected, value)`, `matchList(pattern, value)`, `matchTable(pattern, value)`, `isOk`, `isErr`.
+
+### Pattern Matching Lowering
+- Each `when` case compiles to `cond(guard)(() => consequent)(() => next)`, ensuring short-circuiting and correct evaluation order.
+- Compound guards (e.g., multi-element list/table checks) are lazily folded using nested `cond` so that later predicates are not evaluated if earlier ones fail. This avoids unintended member/index access and preserves semantics.
+- Guards are pure booleans; consequents compiled normally; for `ResultPattern Ok x`, the consequent becomes a λ bound to the inner value and is applied on the success branch.
+- Switch micro-optimization: when there is exactly one discriminant and all patterns are literals with no bindings or type/list/table patterns, the compiler may emit a JavaScript `switch` on a normalized key (numbers unwrap to `.value`) for readability/perf. All other cases use the `cond`/if-else chain.
+
+## Mapping from AST → IR → JS (node-by-node)
+
+- Program: collect top-level function names; emit `let` declarations; then emit assignments/consts in order. Optionally return last expression for REPL mode.
+- TypeDeclaration: ignored at runtime; optionally preserved as comments.
+- VariableDeclaration: `const name = compile(expr)`.
+- FunctionDeclaration/AnonymousFunction: normalize to nested `Lam`, then bracket abstraction, then emit the combinator tree expression.
+- FunctionCall: normal application; arity handled at runtime (curried).
+- NumberLiteral: `{ value: n, isFloat }`. String/Boolean: JS literals.
+- UnaryExpression `-x`: `neg x`. BinaryExpression: map to primitive names.
+- MemberExpression: `get key obj` (curried).
+- ListLiteral/TableLiteral: JS arrays + `table` builder (to get `{type:'Object',properties:Map}`) where needed.
+- ResultExpression: `Ok v`/`Err v`.
+- WhenExpression: compiled to `cond` chain with matchers.
+
+## Example (sketch)
+
+Source:
+```baba
+add: (x: Number, y: Number) -> Number -> x + y;
+main: -> add 2 3;
+```
+
+IR (informal):
+```text
+add = Lam(x, Lam(y, App(App(Const('add'), Var(x)), Var(y))))
+main = Lam(_unit, App(App(Var(add), Lit({2})), Lit({3})))
+```
+
+After abstraction elimination (SKI form, schematic):
+```text
+add = S (K (S (K add) I)) (K (S (K ...) ...)) // produced by the standard rules
+```
+
+Emitted JS (abridged):
+```js
+// runtime: S,K,I,curry2, add, ...
+let add;
+add = /* SKI of λx.λy. add x y */;
+const main = /* SKI of λ_. add 2 3 */;
+export { add, main };
+```
+
+## Evaluation Strategy
+- We generate eager JS; to preserve lazy branches in `when`, consequents are wrapped in thunks and forced via `cond`.
+- Curried primitives plus normal JS evaluation replicates call-by-value with partial application like the interpreter.
+
+## Testing Strategy
+- Golden tests: compile and run existing `tests/*.test.js`-backed programs using both the interpreter and compiled output; assert observational equivalence of results/side-effects.
+- Parity harness:
+  - For each sample program (or inline snippet), run: `interpret(ast)` and `execute(compiledJS)` with same `host`.
+  - Compare returned value shapes; for numbers compare `.value` and `isFloat`; for lists/tables structural equality; for `Result` `variant` and inner value.
+  - Capture `io.out` writes from both executions and compare sequences.
+- CI entry: add a job that compiles a subset of fixtures and runs them; failures show diff of values and `io.out`.
+- Edge cases: negative numbers vs unary minus; division by zero; list/table bounds; member access; nested `when` with short-circuiting; partial application and closures; recursion and mutual recursion; Ok/Err binding; boolean/logical ops precedence.
+
+### Parity harness usage
+- A simple harness exists in `parity.js`:
+  - Runs interpreter and compiled module, normalizes results to plain JS via a deep-plain conversion, and compares `io.out` arrays.
+  - It currently imports the compiled module via a data URL and looks for `run(host)` or `main([host])` exports.
+  - Default compiler mode for parity is `closure` to reduce bring-up friction; once SKI is ready, switch to `--mode ski`.
+
+Examples:
+```sh
+node parity.js ../example.baba
+```
+or via package script:
+```sh
+bun run parity
+```
+
+## Status and Next Steps
+
+What’s working now (closure mode MVP):
+- Normalization covers literals, identifiers, anonymous and declared functions (curried), calls, unary minus, binary ops (+ - * / % .. and compare/logic), member access via `get`, `Ok`/`Err`, list/table literals, and a basic `when` to `Cond` lowering with thunks.
+- Emission produces an ESM module exporting `run(host)`, supporting recursion via let-first then assignment.
+- Runtime prelude implements: numbers with `{ value, isFloat }`, curry helpers, arithmetic/compare/logic, `concatDot`, `Ok/Err`, `get`, `cond`, `io` wiring, `table` builder, list HOFs (map/filter/reduce), and core `str.*` helpers.
+- Parity harness loads ESM via data URL and can compare interpreter vs compiled; simple programs pass.
+
+Actionable next steps (ordered):
+1) Runtime parity with interpreter
+- Implement missing list/table primitives: `append`, `prepend`, `concat`, `update`, `removeAt`, `slice`, `set`, `remove`, `merge`, `keys`, `values` with identical semantics (indices unwrap numbers; maps remain `Map`).
+- Implement `math.*` namespace mirroring `interpreter.js` (abs, sign, floor, ceil, round, trunc, min, max, clamp, pow, sqrt, exp, log, sin, cos, tan, asin, acos, atan, atan2, deg, rad, random, randomInt).
+- Consider `io.emit`/`io.listen` to support host event wiring if compiled code uses them.
+
+2) When/pattern matching completeness
+- Extend lowering to fully support: multi-discriminant patterns, `TypePattern`, `ListPattern`, `TablePattern`, and `ResultPattern` with bindings. Ensure every branch is thunked for short-circuit evaluation. Add switch micro-optimization for single discriminant with literal-only patterns (optional).
+
+3) Emission and module formats
+- Add named exports for top-level declarations (in addition to `run(host)`).
+- Implement `wrapModule` for `umd` and `cjs` formats. Keep `esm` as the default for parity.
+- Add `--pretty` and `--sourcemap inline|file` support per the spec.
+
+4) Codegen modes
+- Implement SKI mode with bracket abstraction (`I/K/S` at minimum) and corresponding runtime.
+- Implement hybrid heuristics: closures for complex `when`/deep terms, SKI for simple lambdas.
+
+5) CLI and DX
+- Integrate flags into `runner.js` or provide a `bin/compile` entry. Support `--runtime external:path` to reference an external runtime.
+
+6) Testing and CI
+- Run fixture programs under both interpreter and compiler; compare values and `io.out` via parity harness.
+- Add CI job to compile and run a representative subset; show diffs on failure.
+
+Notes/pitfalls to keep in mind
+- Numbers must preserve `{ value, isFloat }` across all operations and boundaries.
+- `get` must support arrays (numeric index) and tables (Map by key). Member access errors should match interpreter behavior.
+- `when` consequents must not be evaluated eagerly; use thunks and `__rt.cond`.
+- Keep runtime symbol hygiene: all helpers live under `__rt.*` to avoid user name collisions.
+
+## CLI
+- Extend `runner.js` (or add `bin/compile`) with:
+  - `--compile <input.baba>`: compile input file
+  - `-o, --out <out.js>`: output file path
+  - `--format <esm|cjs|umd>`: module format (default: `umd`)
+  - `--mode <ski|closure|hybrid>`: code generation mode (default: `ski`)
+  - `--runtime <inline|external:path>`: inline runtime by default; or reference external runtime
+  - `--sourcemap <inline|file|none>`: source map emission (default: `none` initially)
+  - `--pretty`: pretty-print emitted JS (default off)
+
+Example:
+```sh
+bun run runner.js --compile src/main.baba -o dist/main.js --format umd --sourcemap file
+```
+
+## Source maps (planned)
+- Nice-to-have; implement after parity. Approach:
+  - Thread source locations from `parser.js` into the core IR nodes.
+  - Generate mappings at IR application/literal/const boundaries rather than raw SKI nodes.
+  - Support `--sourcemap inline|file` to control emission.
+  - Mode-aware mapping: in `closure` mode, map directly at emitted closure expressions; in `ski`/`hybrid` modes, map at IR boundaries pre-elimination to keep maps stable.
+
+## Open Questions / TODO
+- TODO: Confirm semantics for pattern matching on nested member expressions inside patterns (limited in current parser).
+- TODO: Implement source maps per the plan once parity is achieved.
+
+## New files and changes
+- Added `experimental/compiler.js`: scaffold with `compile`, normalization and codegen stubs, emission and runtime prelude stubs, and a minimal developer CLI.
+- Added `experimental/parity.js`: parity harness described above.
+- Updated `package.json` scripts with `parity` command.
+- `runtime.js`: optional future extraction if we decide not to inline the prelude.
+
+## Minimalism and Parity
+This approach is intentionally minimal: no changes to existing front-end/runtime semantics, small targeted runtime, and direct syntactic lowering. The initial goal is parity with `interpreter.js`; once achieved, we can iterate on size and performance optimizations.
+
+