about summary refs log tree commit diff stats
path: root/js/baba-yaga/scratch/docs/LEXER_BUG_REPORT.md
diff options
context:
space:
mode:
Diffstat (limited to 'js/baba-yaga/scratch/docs/LEXER_BUG_REPORT.md')
-rw-r--r--js/baba-yaga/scratch/docs/LEXER_BUG_REPORT.md139
1 files changed, 139 insertions, 0 deletions
diff --git a/js/baba-yaga/scratch/docs/LEXER_BUG_REPORT.md b/js/baba-yaga/scratch/docs/LEXER_BUG_REPORT.md
new file mode 100644
index 0000000..4a2efe3
--- /dev/null
+++ b/js/baba-yaga/scratch/docs/LEXER_BUG_REPORT.md
@@ -0,0 +1,139 @@
+# Critical Lexer Bug Report
+
+## ๐Ÿšจ **Issue Summary**
+
+The optimized regex-based lexer (`src/core/lexer.js`) has a critical bug that causes it to **skip large portions of input files** and produce incorrect tokens, leading to runtime errors.
+
+## ๐Ÿ“Š **Impact Assessment**
+
+- **Severity**: Critical - causes complete parsing failures
+- **Scope**: Context-dependent - works for simple cases, fails on complex files
+- **Test Coverage**: All 210 tests pass (suggests bug is triggered by specific patterns)
+- **Workaround**: Use `--legacy` flag to use the working legacy lexer
+
+## ๐Ÿ” **Bug Symptoms**
+
+### **Observed Behavior:**
+1. **Content Skipping**: Lexer jumps from beginning to middle/end of file
+2. **Token Mangling**: Produces partial tokens (e.g., "esults" instead of "Results")  
+3. **Line Number Issues**: First token appears at line 21 instead of line 1
+4. **Variable Name Errors**: Runtime "Undefined variable" errors for correctly defined variables
+
+### **Example Failure:**
+```bash
+# Works with legacy
+./build/baba-yaga life-final.baba --legacy  # โœ… Success
+
+# Fails with optimized  
+./build/baba-yaga life-final.baba           # โŒ ParseError: Unexpected token: COLON (:)
+```
+
+## ๐Ÿงช **Test Results**
+
+### **Lexer Compatibility Test:**
+- โœ… Individual identifier lexing works correctly
+- โœ… All 210 existing tests pass
+- โŒ Complex files fail completely
+
+### **Debug Output Comparison:**
+
+**Legacy Lexer (Working):**
+```
+Tokens generated: 160
+First token: IDENTIFIER = "var_with_underscore" (line 4, col 20)
+```
+
+**Optimized Lexer (Broken):**
+```
+Tokens generated: 82
+First token: IDENTIFIER = "esults" (line 21, col 12)  # โŒ Wrong!
+```
+
+## ๐Ÿ”ฌ **Technical Analysis**
+
+### **Suspected Root Causes:**
+
+1. **Regex Pattern Conflicts**: Token patterns may be interfering with each other
+2. **Multiline Comment Handling**: `/^\/\/.*$/m` regex may be consuming too much
+3. **Pattern Order Issues**: Longer patterns not matching before shorter ones
+4. **Position Tracking Bug**: `advance()` function may have off-by-one errors
+
+### **Key Differences from Legacy:**
+
+| Aspect | Legacy | Optimized | Issue |
+|--------|--------|-----------|--------|
+| **Method** | Character-by-character | Regex-based | Regex conflicts |
+| **Identifier Pattern** | `readWhile(isLetter)` | `/^[a-zA-Z_][a-zA-Z0-9_]*/` | Should be equivalent |
+| **Comment Handling** | Manual parsing | `/^\/\/.*$/m` | May over-consume |
+| **Error Recovery** | Graceful | Regex failures | May skip content |
+
+## ๐Ÿ›  **Attempted Fixes**
+
+### **What Was Tried:**
+1. โœ… Verified identifier regex patterns match legacy behavior
+2. โœ… Confirmed individual token patterns work correctly  
+3. โŒ Root cause in pattern interaction not yet identified
+
+### **What Needs Investigation:**
+1. **Pattern Order**: Ensure longest patterns match first
+2. **Multiline Regex**: Check if comment regex consumes too much
+3. **Position Tracking**: Verify `advance()` function correctness
+4. **Error Handling**: Check regex failure recovery
+
+## ๐Ÿ“ˆ **Performance Impact**
+
+- **Legacy Lexer**: Reliable, slightly slower character-by-character parsing
+- **Optimized Lexer**: When working, ~2-3x faster, but **completely broken** for many cases
+- **Net Impact**: Negative due to correctness failures
+
+## โœ… **Recommended Actions**
+
+### **Immediate (Done):**
+1. โœ… **Revert to legacy lexer by default** for reliability
+2. โœ… **Document the bug** for future investigation
+3. โœ… **Keep optimized lexer available** with explicit flag
+
+### **Future Investigation:**
+1. **Debug regex pattern interactions** in isolation
+2. **Add comprehensive lexer test suite** with problematic files
+3. **Consider hybrid approach** (regex for simple tokens, fallback for complex)
+4. **Profile memory usage** during lexing failures
+
+## ๐Ÿ”ง **Workarounds**
+
+### **For Users:**
+```bash
+# Use legacy lexer (reliable)
+bun run index.js program.baba --legacy
+
+# Or configure engine
+const config = new BabaYagaConfig({ enableOptimizations: false });
+```
+
+### **For Development:**
+```bash
+# Test both lexers
+bun run build.js --target=macos-arm64  # Uses legacy by default now
+```
+
+## ๐Ÿ“ **Files Affected**
+
+- `src/core/lexer.js` - Broken optimized lexer
+- `src/legacy/lexer.js` - Working legacy lexer  
+- `src/core/engine.js` - Now defaults to legacy lexer
+- `index.js` - Updated to use legacy by default
+
+## ๐ŸŽฏ **Success Criteria for Fix**
+
+1. **All existing tests pass** โœ… (already working)
+2. **Complex files parse correctly** โŒ (currently broken)
+3. **Performance improvement maintained** โš ๏ธ (secondary to correctness)
+4. **No regressions in error messages** โš ๏ธ (needs verification)
+
+---
+
+**Status**: **REVERTED TO LEGACY** - Optimized lexer disabled by default until bug is resolved.
+
+**Priority**: High - affects core language functionality
+
+**Assigned**: Future investigation needed