diff options
Diffstat (limited to 'js/baba-yaga/experimental/COMPILER.md')
-rw-r--r-- | js/baba-yaga/experimental/COMPILER.md | 270 |
1 files changed, 270 insertions, 0 deletions
diff --git a/js/baba-yaga/experimental/COMPILER.md b/js/baba-yaga/experimental/COMPILER.md new file mode 100644 index 0000000..d451334 --- /dev/null +++ b/js/baba-yaga/experimental/COMPILER.md @@ -0,0 +1,270 @@ +## Baba Yaga → JavaScript Compiler (Combinator-Based) + +This document describes a minimal, practical compiler from Baba Yaga to JavaScript using abstraction elimination into combinators. It reuses the existing front-end (`lexer.js`, `parser.js`) and targets a small JS runtime that implements SKI-family combinators and primitives consistent with `interpreter.js` semantics. The working implementation now lives under `experimental/`. + +### Goals +- Compile any program accepted by the current `parser.js` and executed by `interpreter.js`. +- Preserve semantics: numbers with `{ value, isFloat }`, pure and immutable data, partial application, namespacing, and “when” pattern matching. +- Keep changes additive: new `experimental/compiler.js` and a small runtime; no required changes to lexer/parser/interpreter. +- Output readable, runnable JS with a tiny explicit runtime. + +### Non-Goals (initially) +- Advanced type-directed optimizations or codegen from declared types. +- Mutual recursion via pure combinator fixpoints (we’ll use two-phase `let`+assignment for simplicity). +- Aggressive size-optimizing bracket abstraction (we can add B/C/η later). + +## High-Level Architecture +1. Front-end: reuse `createLexer` and `createParser` to produce AST. +2. Core normalization pass (implemented in `experimental/compiler.js`): + - Desugar to a small core language (λ, application, identifiers, literals, primitives, data). + - Convert infix operators and keywords into primitive function calls. + - Normalize multi-arg functions to nested single-arg lambdas. + - Lower member access to `get key obj` form. + - Lower `when` to a chain of guarded branches with thunks. +3. Abstraction elimination (bracket abstraction) or closure-based codegen (mode-dependent) in `experimental/compiler.js`: remove λs into SKI-family combinators, or emit JS closures directly when selected. +4. Emission: generate JS module text that includes a tiny runtime and the compiled program. + +## Source Language Surface (from parser/interpreter) +- Program-level: `TypeDeclaration`, `VariableDeclaration`, `FunctionDeclaration`, expressions. +- Expressions: numbers, strings, booleans, identifiers, anonymous functions, application, unary/binary ops, lists, tables, member access, `ResultExpression` (Ok/Err), `WhenExpression` (patterns: type, result, list, table, wildcard), curried and typed functions. +- Namespaces: `io`, `str`, `math`, list/table functions (append, map, reduce, etc.) via native functions. + +## Core IR (post-desugar) +Use a minimal lambda calculus with primitives and data literals: +- Term := `Var(name)` | `Lam(param, term)` | `App(term, term)` | `Const(name)` | `Lit(<number|string|boolean|list|table|result>)`. +- No infix ops; all are `Const` applied to args. +- Member access lowered to `App(App(Const('get'), key), obj)`. +- `when` lowered to nested `cond` forms with thunks. + +### Desugaring Rules (selected) +- Multi-arg function `f: (x: T, y: U) -> R -> body` → `Lam(x, Lam(y, body))` with type metadata discarded at runtime. +- Untyped function `x y -> body` → `Lam(x, Lam(y, body))`. +- Anonymous `(x y -> e)` → `Lam(x, Lam(y, e))`. +- Unary minus: `-e` → `App(Const('neg'), e)`. +- Binary `e1 op e2` → `App(App(Const(opName), e1), e2)` where `opName ∈ { add, sub, mul, div, mod, eq, neq, gt, lt, gte, lte, and, or, xor, concatDot }`. +- Member `obj.prop` or `obj.0` → `App(App(Const('get'), keyLit), obj)`. +- `Ok e` → `App(Const('Ok'), e)`; `Err e` → `App(Const('Err'), e)`. +- Lists `[e1, ..., en]` → `Lit([...])` (JS arrays); tables `{ k: v, ... }` → `Lit({ ... })` or `Const('table')(...)` builder (see Runtime). +- `when`: + - `when d1 ... dn is p1 ... pn then c1; ...; _ then cK` → nested `cond` calls: + - `cond (match([p1..pn], [d1..dn])) (thunk(() => c1)) (thunk(() => cond (...)))`. + - Pattern matchers compile to booleans with helpers and do not bind new variables except `ResultPattern` which binds its value. + +## Bracket Abstraction +Eliminate lambdas into SKI-family combinators over the core term. Treat any `Var` not equal to the bound variable as free (terminal). Use the standard rules: +- [x]x → I +- [x]M where x ∉ FV(M) → K M +- [x](M N) → S ([x]M) ([x]N) + +Notes: +- Multi-arg λ: `[x][y]M` via nesting. +- We can later add optimizations: η-reduction, B/C introduction, constant folding of K-consts, etc. + +## Emission Strategy +Output a single JS module string with: +1. Runtime prelude (combinators, curry helpers, primitives, matchers, namespaces). +2. Top-level declarations: + - Functions: declare `let f;` for each function name first (supports mutual recursion), then `f = <compiled expression>;`. + - Variables: `const x = <compiled expression>;`. +3. Exports and format: by default emit UMD for maximal portability (Node/CommonJS, Bun, browsers). Also support format selection via CLI. + +### Module formats, exports, and codegen modes +- Default: UMD wrapper that exposes named exports under `module.exports` (CJS), `define` (AMD when available), and global `BabaYaga` namespace in browsers. +- Optional formats: `--format esm|cjs|umd` to force a specific target. ESM emits `export` statements; CJS uses `module.exports = { ... }`. + - Codegen mode: `--mode ski|closure|hybrid` (default: `ski`). + - `ski`: full combinatorization after normalization. + - `closure`: emit JS closures with currying helpers (easiest parity path, useful for debugging and when SKI would bloat output). + - `hybrid`: use closures for complex constructs (e.g., deeply nested `when`, heavy pattern matching) and SKI for simple lambdas; implementation-defined heuristics. + +### Entry points and exports +- Compiled module should export named bindings for all top-level declarations. +- Additionally, export a single entry for executing a program: + - `run(host)` preferred: evaluates the compiled `Program` under the provided `host.io` and returns the last expression value. + - Fallback: `main(host)` if a user-defined `main` exists; parity harness will try `run` then `main`. +- Host wiring: prefer a lightweight `run(host)` that creates the runtime namespaces based on `host`. Avoid global side-effects. + +### Recursion +- Self-recursion: works via two-phase assignment (`let f; f = ...`), because references resolve at call time. +- Mutual recursion: also works via predeclared `let` for all functions then assignments. +- Optional: pure combinator fixpoint `Z` for self-recursion; not required initially. + +## Runtime Design (tiny and explicit) + +### Combinators and Helpers +- `I(a) = a` +- `K(a)(b) = a` +- `S(f)(g)(x) = f(x)(g(x))` +- Optional: `B(f)(g)(x) = f(g(x))`, `C(f)(x)(y) = f(y)(x)`, `Z(f) = (g => g(g))(g => f(x => g(g)(x)))` for CBV. +- `curryN(n, f)` and arity-specific `curry2`, `curry3` for primitives. +- `thunk(f)` returns `f` (we pass thunks to `cond`). +- `cond(p)(t)(e) = p ? t() : e()`. + +### Numeric and Data Representation +- Numbers are `{ value: number, isFloat: boolean }` to match interpreter. +- Lists as JS arrays; tables as `{ type: 'Object', properties: Map }` built via a small `table(obj)` helper that constructs a `Map`. Access is always via `get(key)(obj)` in compiled code. We will NOT use a Proxy in compiled output; this keeps runtime simple and predictable while preserving interpreter semantics for member/index access. + +### Runtime namespace and symbol hygiene +- Prefix all runtime symbols with `__rt` (e.g., `__rt.I`, `__rt.add`, `__rt.get`) to avoid collisions with user-defined names. +- User symbols (top-level functions/vars) are emitted in their original names and exported by the module wrapper. + +### Primitives (curried) +- Arithmetic: `add, sub, mul, div, mod, neg` (unwrap/rewrap numbers; div guards divide-by-zero like interpreter). +- Compare: `eq, neq, gt, lt, gte, lte` (numeric unwrapping + generic string/boolean). +- Logic: `and, or, xor` (booleanize inputs). +- String: `concatDot` for `..` (String(lhs) + String(rhs)). +- Result: `Ok, Err` → `{ type: 'Result', variant: 'Ok'|'Err', value }`. +- Core ops: `get`, and helpers for lists/tables (append, prepend, concat, update, removeAt, slice, set, remove, merge, keys, values) mirroring interpreter. In compiled code, `get` returns `undefined` for missing keys/indices (including out-of-bounds array access) to match interpreter behavior. +- Namespaces: `io`, `str`, `math` exposing same functions/signatures; `io` is parameterized by a passed-in host object for integration. +- Matchers: `isType(expected, value)`, `matchList(pattern, value)`, `matchTable(pattern, value)`, `isOk`, `isErr`. + +### Pattern Matching Lowering +- Each `when` case compiles to `cond(guard)(() => consequent)(() => next)`, ensuring short-circuiting and correct evaluation order. +- Compound guards (e.g., multi-element list/table checks) are lazily folded using nested `cond` so that later predicates are not evaluated if earlier ones fail. This avoids unintended member/index access and preserves semantics. +- Guards are pure booleans; consequents compiled normally; for `ResultPattern Ok x`, the consequent becomes a λ bound to the inner value and is applied on the success branch. +- Switch micro-optimization: when there is exactly one discriminant and all patterns are literals with no bindings or type/list/table patterns, the compiler may emit a JavaScript `switch` on a normalized key (numbers unwrap to `.value`) for readability/perf. All other cases use the `cond`/if-else chain. + +## Mapping from AST → IR → JS (node-by-node) + +- Program: collect top-level function names; emit `let` declarations; then emit assignments/consts in order. Optionally return last expression for REPL mode. +- TypeDeclaration: ignored at runtime; optionally preserved as comments. +- VariableDeclaration: `const name = compile(expr)`. +- FunctionDeclaration/AnonymousFunction: normalize to nested `Lam`, then bracket abstraction, then emit the combinator tree expression. +- FunctionCall: normal application; arity handled at runtime (curried). +- NumberLiteral: `{ value: n, isFloat }`. String/Boolean: JS literals. +- UnaryExpression `-x`: `neg x`. BinaryExpression: map to primitive names. +- MemberExpression: `get key obj` (curried). +- ListLiteral/TableLiteral: JS arrays + `table` builder (to get `{type:'Object',properties:Map}`) where needed. +- ResultExpression: `Ok v`/`Err v`. +- WhenExpression: compiled to `cond` chain with matchers. + +## Example (sketch) + +Source: +```baba +add: (x: Number, y: Number) -> Number -> x + y; +main: -> add 2 3; +``` + +IR (informal): +```text +add = Lam(x, Lam(y, App(App(Const('add'), Var(x)), Var(y)))) +main = Lam(_unit, App(App(Var(add), Lit({2})), Lit({3}))) +``` + +After abstraction elimination (SKI form, schematic): +```text +add = S (K (S (K add) I)) (K (S (K ...) ...)) // produced by the standard rules +``` + +Emitted JS (abridged): +```js +// runtime: S,K,I,curry2, add, ... +let add; +add = /* SKI of λx.λy. add x y */; +const main = /* SKI of λ_. add 2 3 */; +export { add, main }; +``` + +## Evaluation Strategy +- We generate eager JS; to preserve lazy branches in `when`, consequents are wrapped in thunks and forced via `cond`. +- Curried primitives plus normal JS evaluation replicates call-by-value with partial application like the interpreter. + +## Testing Strategy +- Golden tests: compile and run existing `tests/*.test.js`-backed programs using both the interpreter and compiled output; assert observational equivalence of results/side-effects. +- Parity harness: + - For each sample program (or inline snippet), run: `interpret(ast)` and `execute(compiledJS)` with same `host`. + - Compare returned value shapes; for numbers compare `.value` and `isFloat`; for lists/tables structural equality; for `Result` `variant` and inner value. + - Capture `io.out` writes from both executions and compare sequences. +- CI entry: add a job that compiles a subset of fixtures and runs them; failures show diff of values and `io.out`. +- Edge cases: negative numbers vs unary minus; division by zero; list/table bounds; member access; nested `when` with short-circuiting; partial application and closures; recursion and mutual recursion; Ok/Err binding; boolean/logical ops precedence. + +### Parity harness usage +- A simple harness exists in `parity.js`: + - Runs interpreter and compiled module, normalizes results to plain JS via a deep-plain conversion, and compares `io.out` arrays. + - It currently imports the compiled module via a data URL and looks for `run(host)` or `main([host])` exports. + - Default compiler mode for parity is `closure` to reduce bring-up friction; once SKI is ready, switch to `--mode ski`. + +Examples: +```sh +node parity.js ../example.baba +``` +or via package script: +```sh +bun run parity +``` + +## Status and Next Steps + +What’s working now (closure mode MVP): +- Normalization covers literals, identifiers, anonymous and declared functions (curried), calls, unary minus, binary ops (+ - * / % .. and compare/logic), member access via `get`, `Ok`/`Err`, list/table literals, and a basic `when` to `Cond` lowering with thunks. +- Emission produces an ESM module exporting `run(host)`, supporting recursion via let-first then assignment. +- Runtime prelude implements: numbers with `{ value, isFloat }`, curry helpers, arithmetic/compare/logic, `concatDot`, `Ok/Err`, `get`, `cond`, `io` wiring, `table` builder, list HOFs (map/filter/reduce), and core `str.*` helpers. +- Parity harness loads ESM via data URL and can compare interpreter vs compiled; simple programs pass. + +Actionable next steps (ordered): +1) Runtime parity with interpreter +- Implement missing list/table primitives: `append`, `prepend`, `concat`, `update`, `removeAt`, `slice`, `set`, `remove`, `merge`, `keys`, `values` with identical semantics (indices unwrap numbers; maps remain `Map`). +- Implement `math.*` namespace mirroring `interpreter.js` (abs, sign, floor, ceil, round, trunc, min, max, clamp, pow, sqrt, exp, log, sin, cos, tan, asin, acos, atan, atan2, deg, rad, random, randomInt). +- Consider `io.emit`/`io.listen` to support host event wiring if compiled code uses them. + +2) When/pattern matching completeness +- Extend lowering to fully support: multi-discriminant patterns, `TypePattern`, `ListPattern`, `TablePattern`, and `ResultPattern` with bindings. Ensure every branch is thunked for short-circuit evaluation. Add switch micro-optimization for single discriminant with literal-only patterns (optional). + +3) Emission and module formats +- Add named exports for top-level declarations (in addition to `run(host)`). +- Implement `wrapModule` for `umd` and `cjs` formats. Keep `esm` as the default for parity. +- Add `--pretty` and `--sourcemap inline|file` support per the spec. + +4) Codegen modes +- Implement SKI mode with bracket abstraction (`I/K/S` at minimum) and corresponding runtime. +- Implement hybrid heuristics: closures for complex `when`/deep terms, SKI for simple lambdas. + +5) CLI and DX +- Integrate flags into `runner.js` or provide a `bin/compile` entry. Support `--runtime external:path` to reference an external runtime. + +6) Testing and CI +- Run fixture programs under both interpreter and compiler; compare values and `io.out` via parity harness. +- Add CI job to compile and run a representative subset; show diffs on failure. + +Notes/pitfalls to keep in mind +- Numbers must preserve `{ value, isFloat }` across all operations and boundaries. +- `get` must support arrays (numeric index) and tables (Map by key). Member access errors should match interpreter behavior. +- `when` consequents must not be evaluated eagerly; use thunks and `__rt.cond`. +- Keep runtime symbol hygiene: all helpers live under `__rt.*` to avoid user name collisions. + +## CLI +- Extend `runner.js` (or add `bin/compile`) with: + - `--compile <input.baba>`: compile input file + - `-o, --out <out.js>`: output file path + - `--format <esm|cjs|umd>`: module format (default: `umd`) + - `--mode <ski|closure|hybrid>`: code generation mode (default: `ski`) + - `--runtime <inline|external:path>`: inline runtime by default; or reference external runtime + - `--sourcemap <inline|file|none>`: source map emission (default: `none` initially) + - `--pretty`: pretty-print emitted JS (default off) + +Example: +```sh +bun run runner.js --compile src/main.baba -o dist/main.js --format umd --sourcemap file +``` + +## Source maps (planned) +- Nice-to-have; implement after parity. Approach: + - Thread source locations from `parser.js` into the core IR nodes. + - Generate mappings at IR application/literal/const boundaries rather than raw SKI nodes. + - Support `--sourcemap inline|file` to control emission. + - Mode-aware mapping: in `closure` mode, map directly at emitted closure expressions; in `ski`/`hybrid` modes, map at IR boundaries pre-elimination to keep maps stable. + +## Open Questions / TODO +- TODO: Confirm semantics for pattern matching on nested member expressions inside patterns (limited in current parser). +- TODO: Implement source maps per the plan once parity is achieved. + +## New files and changes +- Added `experimental/compiler.js`: scaffold with `compile`, normalization and codegen stubs, emission and runtime prelude stubs, and a minimal developer CLI. +- Added `experimental/parity.js`: parity harness described above. +- Updated `package.json` scripts with `parity` command. +- `runtime.js`: optional future extraction if we decide not to inline the prelude. + +## Minimalism and Parity +This approach is intentionally minimal: no changes to existing front-end/runtime semantics, small targeted runtime, and direct syntactic lowering. The initial goal is parity with `interpreter.js`; once achieved, we can iterate on size and performance optimizations. + + |