diff options
Diffstat (limited to 'js/baba-yaga/scratch/docs/LEXER_BUG_REPORT.md')
-rw-r--r-- | js/baba-yaga/scratch/docs/LEXER_BUG_REPORT.md | 139 |
1 files changed, 139 insertions, 0 deletions
diff --git a/js/baba-yaga/scratch/docs/LEXER_BUG_REPORT.md b/js/baba-yaga/scratch/docs/LEXER_BUG_REPORT.md new file mode 100644 index 0000000..4a2efe3 --- /dev/null +++ b/js/baba-yaga/scratch/docs/LEXER_BUG_REPORT.md @@ -0,0 +1,139 @@ +# Critical Lexer Bug Report + +## ๐จ **Issue Summary** + +The optimized regex-based lexer (`src/core/lexer.js`) has a critical bug that causes it to **skip large portions of input files** and produce incorrect tokens, leading to runtime errors. + +## ๐ **Impact Assessment** + +- **Severity**: Critical - causes complete parsing failures +- **Scope**: Context-dependent - works for simple cases, fails on complex files +- **Test Coverage**: All 210 tests pass (suggests bug is triggered by specific patterns) +- **Workaround**: Use `--legacy` flag to use the working legacy lexer + +## ๐ **Bug Symptoms** + +### **Observed Behavior:** +1. **Content Skipping**: Lexer jumps from beginning to middle/end of file +2. **Token Mangling**: Produces partial tokens (e.g., "esults" instead of "Results") +3. **Line Number Issues**: First token appears at line 21 instead of line 1 +4. **Variable Name Errors**: Runtime "Undefined variable" errors for correctly defined variables + +### **Example Failure:** +```bash +# Works with legacy +./build/baba-yaga life-final.baba --legacy # โ Success + +# Fails with optimized +./build/baba-yaga life-final.baba # โ ParseError: Unexpected token: COLON (:) +``` + +## ๐งช **Test Results** + +### **Lexer Compatibility Test:** +- โ Individual identifier lexing works correctly +- โ All 210 existing tests pass +- โ Complex files fail completely + +### **Debug Output Comparison:** + +**Legacy Lexer (Working):** +``` +Tokens generated: 160 +First token: IDENTIFIER = "var_with_underscore" (line 4, col 20) +``` + +**Optimized Lexer (Broken):** +``` +Tokens generated: 82 +First token: IDENTIFIER = "esults" (line 21, col 12) # โ Wrong! +``` + +## ๐ฌ **Technical Analysis** + +### **Suspected Root Causes:** + +1. **Regex Pattern Conflicts**: Token patterns may be interfering with each other +2. **Multiline Comment Handling**: `/^\/\/.*$/m` regex may be consuming too much +3. **Pattern Order Issues**: Longer patterns not matching before shorter ones +4. **Position Tracking Bug**: `advance()` function may have off-by-one errors + +### **Key Differences from Legacy:** + +| Aspect | Legacy | Optimized | Issue | +|--------|--------|-----------|--------| +| **Method** | Character-by-character | Regex-based | Regex conflicts | +| **Identifier Pattern** | `readWhile(isLetter)` | `/^[a-zA-Z_][a-zA-Z0-9_]*/` | Should be equivalent | +| **Comment Handling** | Manual parsing | `/^\/\/.*$/m` | May over-consume | +| **Error Recovery** | Graceful | Regex failures | May skip content | + +## ๐ **Attempted Fixes** + +### **What Was Tried:** +1. โ Verified identifier regex patterns match legacy behavior +2. โ Confirmed individual token patterns work correctly +3. โ Root cause in pattern interaction not yet identified + +### **What Needs Investigation:** +1. **Pattern Order**: Ensure longest patterns match first +2. **Multiline Regex**: Check if comment regex consumes too much +3. **Position Tracking**: Verify `advance()` function correctness +4. **Error Handling**: Check regex failure recovery + +## ๐ **Performance Impact** + +- **Legacy Lexer**: Reliable, slightly slower character-by-character parsing +- **Optimized Lexer**: When working, ~2-3x faster, but **completely broken** for many cases +- **Net Impact**: Negative due to correctness failures + +## โ **Recommended Actions** + +### **Immediate (Done):** +1. โ **Revert to legacy lexer by default** for reliability +2. โ **Document the bug** for future investigation +3. โ **Keep optimized lexer available** with explicit flag + +### **Future Investigation:** +1. **Debug regex pattern interactions** in isolation +2. **Add comprehensive lexer test suite** with problematic files +3. **Consider hybrid approach** (regex for simple tokens, fallback for complex) +4. **Profile memory usage** during lexing failures + +## ๐ง **Workarounds** + +### **For Users:** +```bash +# Use legacy lexer (reliable) +bun run index.js program.baba --legacy + +# Or configure engine +const config = new BabaYagaConfig({ enableOptimizations: false }); +``` + +### **For Development:** +```bash +# Test both lexers +bun run build.js --target=macos-arm64 # Uses legacy by default now +``` + +## ๐ **Files Affected** + +- `src/core/lexer.js` - Broken optimized lexer +- `src/legacy/lexer.js` - Working legacy lexer +- `src/core/engine.js` - Now defaults to legacy lexer +- `index.js` - Updated to use legacy by default + +## ๐ฏ **Success Criteria for Fix** + +1. **All existing tests pass** โ (already working) +2. **Complex files parse correctly** โ (currently broken) +3. **Performance improvement maintained** โ ๏ธ (secondary to correctness) +4. **No regressions in error messages** โ ๏ธ (needs verification) + +--- + +**Status**: **REVERTED TO LEGACY** - Optimized lexer disabled by default until bug is resolved. + +**Priority**: High - affects core language functionality + +**Assigned**: Future investigation needed |