js/baba-yaga/scratch/docs/LEXER_BUG_REPORT.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139

# Critical Lexer Bug Report

## 🚨 **Issue Summary**

The optimized regex-based lexer (`src/core/lexer.js`) has a critical bug that causes it to **skip large portions of input files** and produce incorrect tokens, leading to runtime errors.

## 📊 **Impact Assessment**

- **Severity**: Critical - causes complete parsing failures
- **Scope**: Context-dependent - works for simple cases, fails on complex files
- **Test Coverage**: All 210 tests pass (suggests bug is triggered by specific patterns)
- **Workaround**: Use `--legacy` flag to use the working legacy lexer

## 🔍 **Bug Symptoms**

### **Observed Behavior:**
1. **Content Skipping**: Lexer jumps from beginning to middle/end of file
2. **Token Mangling**: Produces partial tokens (e.g., "esults" instead of "Results")  
3. **Line Number Issues**: First token appears at line 21 instead of line 1
4. **Variable Name Errors**: Runtime "Undefined variable" errors for correctly defined variables

### **Example Failure:**
```bash
# Works with legacy
./build/baba-yaga life-final.baba --legacy  # ✅ Success

# Fails with optimized  
./build/baba-yaga life-final.baba           # ❌ ParseError: Unexpected token: COLON (:)
```

## 🧪 **Test Results**

### **Lexer Compatibility Test:**
- ✅ Individual identifier lexing works correctly
- ✅ All 210 existing tests pass
- ❌ Complex files fail completely

### **Debug Output Comparison:**

**Legacy Lexer (Working):**
```
Tokens generated: 160
First token: IDENTIFIER = "var_with_underscore" (line 4, col 20)
```

**Optimized Lexer (Broken):**
```
Tokens generated: 82
First token: IDENTIFIER = "esults" (line 21, col 12)  # ❌ Wrong!
```

## 🔬 **Technical Analysis**

### **Suspected Root Causes:**

1. **Regex Pattern Conflicts**: Token patterns may be interfering with each other
2. **Multiline Comment Handling**: `/^\/\/.*$/m` regex may be consuming too much
3. **Pattern Order Issues**: Longer patterns not matching before shorter ones
4. **Position Tracking Bug**: `advance()` function may have off-by-one errors

### **Key Differences from Legacy:**

| Aspect | Legacy | Optimized | Issue |
|--------|--------|-----------|--------|
| **Method** | Character-by-character | Regex-based | Regex conflicts |
| **Identifier Pattern** | `readWhile(isLetter)` | `/^[a-zA-Z_][a-zA-Z0-9_]*/` | Should be equivalent |
| **Comment Handling** | Manual parsing | `/^\/\/.*$/m` | May over-consume |
| **Error Recovery** | Graceful | Regex failures | May skip content |

## 🛠 **Attempted Fixes**

### **What Was Tried:**
1. ✅ Verified identifier regex patterns match legacy behavior
2. ✅ Confirmed individual token patterns work correctly  
3. ❌ Root cause in pattern interaction not yet identified

### **What Needs Investigation:**
1. **Pattern Order**: Ensure longest patterns match first
2. **Multiline Regex**: Check if comment regex consumes too much
3. **Position Tracking**: Verify `advance()` function correctness
4. **Error Handling**: Check regex failure recovery

## 📈 **Performance Impact**

- **Legacy Lexer**: Reliable, slightly slower character-by-character parsing
- **Optimized Lexer**: When working, ~2-3x faster, but **completely broken** for many cases
- **Net Impact**: Negative due to correctness failures

## ✅ **Recommended Actions**

### **Immediate (Done):**
1. ✅ **Revert to legacy lexer by default** for reliability
2. ✅ **Document the bug** for future investigation
3. ✅ **Keep optimized lexer available** with explicit flag

### **Future Investigation:**
1. **Debug regex pattern interactions** in isolation
2. **Add comprehensive lexer test suite** with problematic files
3. **Consider hybrid approach** (regex for simple tokens, fallback for complex)
4. **Profile memory usage** during lexing failures

## 🔧 **Workarounds**

### **For Users:**
```bash
# Use legacy lexer (reliable)
bun run index.js program.baba --legacy

# Or configure engine
const config = new BabaYagaConfig({ enableOptimizations: false });
```

### **For Development:**
```bash
# Test both lexers
bun run build.js --target=macos-arm64  # Uses legacy by default now
```

## 📝 **Files Affected**

- `src/core/lexer.js` - Broken optimized lexer
- `src/legacy/lexer.js` - Working legacy lexer  
- `src/core/engine.js` - Now defaults to legacy lexer
- `index.js` - Updated to use legacy by default

## 🎯 **Success Criteria for Fix**

1. **All existing tests pass** ✅ (already working)
2. **Complex files parse correctly** ❌ (currently broken)
3. **Performance improvement maintained** ⚠️ (secondary to correctness)
4. **No regressions in error messages** ⚠️ (needs verification)

---

**Status**: **REVERTED TO LEGACY** - Optimized lexer disabled by default until bug is resolved.

**Priority**: High - affects core language functionality

**Assigned**: Future investigation needed