# Critical Lexer Bug Report ## ๐Ÿšจ **Issue Summary** The optimized regex-based lexer (`src/core/lexer.js`) has a critical bug that causes it to **skip large portions of input files** and produce incorrect tokens, leading to runtime errors. ## ๐Ÿ“Š **Impact Assessment** - **Severity**: Critical - causes complete parsing failures - **Scope**: Context-dependent - works for simple cases, fails on complex files - **Test Coverage**: All 210 tests pass (suggests bug is triggered by specific patterns) - **Workaround**: Use `--legacy` flag to use the working legacy lexer ## ๐Ÿ” **Bug Symptoms** ### **Observed Behavior:** 1. **Content Skipping**: Lexer jumps from beginning to middle/end of file 2. **Token Mangling**: Produces partial tokens (e.g., "esults" instead of "Results") 3. **Line Number Issues**: First token appears at line 21 instead of line 1 4. **Variable Name Errors**: Runtime "Undefined variable" errors for correctly defined variables ### **Example Failure:** ```bash # Works with legacy ./build/baba-yaga life-final.baba --legacy # โœ… Success # Fails with optimized ./build/baba-yaga life-final.baba # โŒ ParseError: Unexpected token: COLON (:) ``` ## ๐Ÿงช **Test Results** ### **Lexer Compatibility Test:** - โœ… Individual identifier lexing works correctly - โœ… All 210 existing tests pass - โŒ Complex files fail completely ### **Debug Output Comparison:** **Legacy Lexer (Working):** ``` Tokens generated: 160 First token: IDENTIFIER = "var_with_underscore" (line 4, col 20) ``` **Optimized Lexer (Broken):** ``` Tokens generated: 82 First token: IDENTIFIER = "esults" (line 21, col 12) # โŒ Wrong! ``` ## ๐Ÿ”ฌ **Technical Analysis** ### **Suspected Root Causes:** 1. **Regex Pattern Conflicts**: Token patterns may be interfering with each other 2. **Multiline Comment Handling**: `/^\/\/.*$/m` regex may be consuming too much 3. **Pattern Order Issues**: Longer patterns not matching before shorter ones 4. **Position Tracking Bug**: `advance()` function may have off-by-one errors ### **Key Differences from Legacy:** | Aspect | Legacy | Optimized | Issue | |--------|--------|-----------|--------| | **Method** | Character-by-character | Regex-based | Regex conflicts | | **Identifier Pattern** | `readWhile(isLetter)` | `/^[a-zA-Z_][a-zA-Z0-9_]*/` | Should be equivalent | | **Comment Handling** | Manual parsing | `/^\/\/.*$/m` | May over-consume | | **Error Recovery** | Graceful | Regex failures | May skip content | ## ๐Ÿ›  **Attempted Fixes** ### **What Was Tried:** 1. โœ… Verified identifier regex patterns match legacy behavior 2. โœ… Confirmed individual token patterns work correctly 3. โŒ Root cause in pattern interaction not yet identified ### **What Needs Investigation:** 1. **Pattern Order**: Ensure longest patterns match first 2. **Multiline Regex**: Check if comment regex consumes too much 3. **Position Tracking**: Verify `advance()` function correctness 4. **Error Handling**: Check regex failure recovery ## ๐Ÿ“ˆ **Performance Impact** - **Legacy Lexer**: Reliable, slightly slower character-by-character parsing - **Optimized Lexer**: When working, ~2-3x faster, but **completely broken** for many cases - **Net Impact**: Negative due to correctness failures ## โœ… **Recommended Actions** ### **Immediate (Done):** 1. โœ… **Revert to legacy lexer by default** for reliability 2. โœ… **Document the bug** for future investigation 3. โœ… **Keep optimized lexer available** with explicit flag ### **Future Investigation:** 1. **Debug regex pattern interactions** in isolation 2. **Add comprehensive lexer test suite** with problematic files 3. **Consider hybrid approach** (regex for simple tokens, fallback for complex) 4. **Profile memory usage** during lexing failures ## ๐Ÿ”ง **Workarounds** ### **For Users:** ```bash # Use legacy lexer (reliable) bun run index.js program.baba --legacy # Or configure engine const config = new BabaYagaConfig({ enableOptimizations: false }); ``` ### **For Development:** ```bash # Test both lexers bun run build.js --target=macos-arm64 # Uses legacy by default now ``` ## ๐Ÿ“ **Files Affected** - `src/core/lexer.js` - Broken optimized lexer - `src/legacy/lexer.js` - Working legacy lexer - `src/core/engine.js` - Now defaults to legacy lexer - `index.js` - Updated to use legacy by default ## ๐ŸŽฏ **Success Criteria for Fix** 1. **All existing tests pass** โœ… (already working) 2. **Complex files parse correctly** โŒ (currently broken) 3. **Performance improvement maintained** โš ๏ธ (secondary to correctness) 4. **No regressions in error messages** โš ๏ธ (needs verification) --- **Status**: **REVERTED TO LEGACY** - Optimized lexer disabled by default until bug is resolved. **Priority**: High - affects core language functionality **Assigned**: Future investigation needed