diff options
-rw-r--r-- | awk/scheme/scheme/README.md | 60 | ||||
-rwxr-xr-x | awk/scheme/scheme/bin/compiler.awk | 132 | ||||
-rwxr-xr-x | awk/scheme/scheme/bin/repl | 36 | ||||
-rwxr-xr-x | awk/scheme/scheme/bin/vm.awk | 432 | ||||
-rw-r--r-- | awk/scheme/scheme/diagram.md | 58 | ||||
-rw-r--r-- | awk/scheme/scheme/scratch/ideas-for-string-support.md | 556 | ||||
-rwxr-xr-x | bash/dds | 121 |
7 files changed, 1259 insertions, 136 deletions
diff --git a/awk/scheme/scheme/README.md b/awk/scheme/scheme/README.md index 7711442..5dae265 100644 --- a/awk/scheme/scheme/README.md +++ b/awk/scheme/scheme/README.md @@ -1,10 +1,12 @@ # Awk-Scheme ## Overview -A Scheme interpreter implemented in AWK, featuring: -- A compiler that translates Scheme expressions to stack-based VM instructions +A scheme interpreter implemented in AWK: + +- A compiler that translates scheme expressions to stack-based VM instructions - A virtual machine that executes the compiled code - Support for basic arithmetic, list operations, functions, and variable bindings +- **Closure support** for nested function definitions and lexical scoping - Persistent global state between REPL sessions ## Architecture @@ -12,15 +14,17 @@ A Scheme interpreter implemented in AWK, featuring: ### Components 1. **Compiler** (`bin/compiler.awk`): - Lexical analyzer for tokenizing input - - Recursive descent parser for Scheme expressions + - Recursive descent parser for scheme expressions - Code generator that produces stack-based VM instructions - Handles nested expressions and proper scoping + - **Generates closure creation code for lambda expressions** 2. **Virtual Machine** (`bin/vm.awk`): - Stack-based execution model - Typed value system with runtime type checking - Environment-based variable binding - - Function call/return mechanism + - **Direct function execution** (no program array modification) + - **Closure support** with environment capture and restoration - Heap memory for cons cells - State persistence for globals and functions @@ -38,6 +42,7 @@ All values are tagged with their type: - `F:name` - Functions - `NIL:` - Empty list - `S:name` - Symbols +- **`CLOSURE:func_name:env_id` - Closure objects (function + captured environment)** ## Usage @@ -100,6 +105,24 @@ scheme> (let ((x 10) (y 20)) (+ x y)) N:30 ``` +7. **Lambda Functions**: +```scheme +scheme> ((lambda (x) (+ x 1)) 41) +N:42 +scheme> ((lambda (x y) (+ x y)) 20 22) +N:42 +``` + +8. **Closures**: +```scheme +scheme> (let ((x 10)) ((lambda (y) (+ x y)) 32)) +N:42 +scheme> (let ((add1 (lambda (x) (+ x 1)))) (add1 41)) +N:42 +``` + +**Note**: Nested lambda definitions (lambdas defined inside other lambdas) are not yet supported. + ## Implementation Details ### Compiler Pipeline @@ -115,6 +138,7 @@ N:30 - Translates expressions to VM instructions - Manages scope and variable bindings - Handles function definitions + - **Generates closure creation code for lambda expressions** ### Virtual Machine 1. **Stack Operations**: @@ -124,11 +148,16 @@ N:30 - LOOKUP: Retrieve variable value 2. **Function Calls**: - - CALL: Invoke function - - RETURN: Return from function + - **Direct execution** of function code with proper program restoration - Environment frame management + - **Closure support** with environment capture and restoration + +3. **Closure System**: + - **CAPTURE_ENV**: Captures current environment for closure creation + - **Environment restoration**: Restores captured environments on closure calls + - **Variable lookup enhancement**: Checks closure environments for variable resolution -3. **Memory Management**: +4. **Memory Management**: - Stack-based evaluation - Simple heap for cons cells - Basic reference counting (not fully implemented) @@ -137,6 +166,7 @@ N:30 - Global variables and functions persist between sessions - State stored in `/tmp/scheme_vm.state` and `/tmp/scheme_vm.env` - Automatic state loading/saving +- **Note**: State persistence is enabled by default in the REPL ## Extending the System @@ -165,9 +195,12 @@ N:30 - VM execution - Stack operations - Environment changes + - **Closure creation and restoration** +- **Note**: Some debug output may be suppressed in the current implementation ## Limitations Current implementation does not support: +- **Nested lambda definitions** (lambdas defined inside other lambdas) - this causes "Undefined closure function" errors - Floating point numbers - Strings - Proper tail recursion @@ -178,9 +211,10 @@ Current implementation does not support: - Standard library ## Future Enhancements -1. Proper tail call optimization -2. Garbage collection -3. Error recovery -4. More data types -5. Standard library -6. Better debugging tools \ No newline at end of file +1. **Nested lambda support** (complex but achievable) +2. Proper tail call optimization +3. Garbage collection +4. Error recovery +5. More data types +6. Standard library +7. Better debugging tools \ No newline at end of file diff --git a/awk/scheme/scheme/bin/compiler.awk b/awk/scheme/scheme/bin/compiler.awk index debaa2c..94d6e99 100755 --- a/awk/scheme/scheme/bin/compiler.awk +++ b/awk/scheme/scheme/bin/compiler.awk @@ -11,23 +11,28 @@ BEGIN { next_label = 0 # Counter for generating unique labels program = "" # Accumulates the full program text - # Debug mode disabled by default, can be enabled via DEBUG=1 environment variable + # AWK FEATURE: ENVIRON is a built-in array containing environment variables + # Unlike JS process.env, this is automatically available in awk DEBUG = (ENVIRON["DEBUG"] == "1") ? 1 : 0 } # Debug logging helper function function debug(msg) { + # AWK FEATURE: printf with > "/dev/stderr" redirects output to stderr + # Unlike console.error() in JS, this is how awk handles stderr output if (DEBUG) printf("[DEBUG] %s\n", msg) > "/dev/stderr" } -# Each line of input is accumulated into the program string -# This allows handling multi-line expressions +# AWK FEATURE: Each line of input is automatically processed by this block +# This is awk's main input processing loop - every line from stdin/files goes here +# In JS, you'd need to explicitly read lines from a stream { if (program != "") program = program "\n" - program = program $0 + program = program $0 # $0 is the current line being processed } -# Main compiler entry point - called after all input is read +# AWK FEATURE: END block runs after all input has been processed +# This is like a "finally" block that always executes after reading all input END { debug("Raw program:\n" program) if (program == "") exit @@ -47,9 +52,12 @@ function split_expressions(prog, current, paren_count, i, c, expr, cleaned) { # 2. Normalize whitespace # 3. Trim leading/trailing whitespace cleaned = prog + # AWK FEATURE: gsub() is a built-in function for global substitution (like replaceAll in JS) + # gsub(pattern, replacement, target) - modifies target in place and returns count gsub(/;[^(]*\(/, "(", cleaned) # Remove comments before expressions gsub(/\)[^)]*;/, ")", cleaned) # Remove comments after expressions gsub(/[ \t\n]+/, " ", cleaned) # Normalize whitespace to single spaces + # AWK FEATURE: sub() is like gsub() but only replaces the first occurrence sub(/^[ \t\n]+/, "", cleaned) # Trim leading whitespace sub(/[ \t\n]+$/, "", cleaned) # Trim trailing whitespace @@ -58,6 +66,8 @@ function split_expressions(prog, current, paren_count, i, c, expr, cleaned) { if (cleaned == "") return # Parse expressions by tracking parenthesis nesting + # AWK FEATURE: length(string) returns the length of a string + # Unlike JS string.length, this is a function call, not a property for (i = 1; i <= length(cleaned); i++) { c = substr(cleaned, i, 1) @@ -90,6 +100,8 @@ function split_expressions(prog, current, paren_count, i, c, expr, cleaned) { } # Lexer helper functions for character classification +# AWK FEATURE: String comparison with >= and <= works lexicographically +# Unlike JS where you need to convert to numbers, awk can compare strings directly function is_digit(c) { return c >= "0" && c <= "9" } function is_whitespace(c) { return c == " " || c == "\t" || c == "\n" } @@ -100,6 +112,8 @@ function next_token() { if (input_buffer == "") input_buffer = program # Skip whitespace between tokens + # AWK FEATURE: length(string) returns the length of a string + # Unlike JS string.length, this is a function call, not a property while (length(input_buffer) > 0 && is_whitespace(substr(input_buffer, 1, 1))) input_buffer = substr(input_buffer, 2) @@ -113,8 +127,14 @@ function next_token() { } # Handle numbers (including negative numbers) + # AWK FEATURE: substr(string, start, length) extracts substring + # Unlike JS string.slice(), this is 1-indexed and requires explicit length + # AWK FEATURE: length(string) returns the length of a string + # Unlike JS string.length, this is a function call, not a property if (is_digit(c) || c == "-" && length(input_buffer) > 1 && is_digit(substr(input_buffer, 2, 1))) { num = "" + # AWK FEATURE: length(string) returns the length of a string + # Unlike JS string.length, this is a function call, not a property while (length(input_buffer) > 0) { c = substr(input_buffer, 1, 1) if (!is_digit(c) && c != "-") break @@ -126,6 +146,8 @@ function next_token() { # Handle symbols (identifiers and operators) sym = "" + # AWK FEATURE: length(string) returns the length of a string + # Unlike JS string.length, this is a function call, not a property while (length(input_buffer) > 0) { c = substr(input_buffer, 1, 1) if (is_whitespace(c) || c == "(" || c == ")") break @@ -170,6 +192,8 @@ function parse_list(result, expr) { # Splits an expression into operator and arguments # Handles nested expressions correctly function split_expr(expr, i, len, c, op, args, paren_count) { + # AWK FEATURE: length(string) returns the length of a string + # Unlike JS string.length, this is a function call, not a property len = length(expr) paren_count = 0 @@ -190,11 +214,16 @@ function split_expr(expr, i, len, c, op, args, paren_count) { } debug("Split expr: op=" op " args=" args) + # AWK FEATURE: SUBSEP is a built-in variable used as separator for array indices + # When you use array[key1,key2], awk internally stores it as array[key1 SUBSEP key2] + # This allows multi-dimensional arrays in awk (though they're really single-dimensional) return op SUBSEP args } # Splits argument list handling nested parentheses function split_args(args, arg_array, len, i, c, current, paren_count, arg_count) { + # AWK FEATURE: length(string) returns the length of a string + # Unlike JS string.length, this is a function call, not a property len = length(args) current = "" paren_count = 0 @@ -233,18 +262,20 @@ function compile_primitive_call(op, args, arg_array, nargs, i) { nargs = split_args(args, arg_array) # Check if this is a lambda function call - if (substr(op, 1, 1) == "(") { + # AWK FEATURE: ~ is the regex match operator (like /pattern/.test() in JS) + # The pattern is a regex literal, not a string + if (op ~ /^\(lambda /) { # This is a lambda function call - # First compile the lambda function - compile_expr(op) - - # Then compile all arguments + # First compile all arguments for (i = 1; i <= nargs; i++) { compile_expr(arg_array[i]) } - # Call the function - print "CALL __lambda_" (next_label - 1) + # Then compile the lambda function (this will push the function name) + compile_expr(op) + + # Call the function - the lambda name is now on top of stack + print "CALL" return } @@ -316,58 +347,58 @@ function compile_primitive_call(op, args, arg_array, nargs, i) { else { # Function call for user-defined functions debug("Function call: " op) - # Look up the function name - print "LOOKUP " op - # Get the actual function name - print "GET_VALUE" - # Then compile arguments + # First compile arguments for (i = 1; i <= nargs; i++) { compile_expr(arg_array[i]) } + # Then look up the function name + print "LOOKUP " op + # Get the actual function name + print "GET_VALUE" # Call the function print "CALL" } } # Splits let bindings into individual variable/value pairs -function split_bindings(bindings, binding_array, count, current, paren_count, i, c, in_lambda) { +function split_bindings(bindings, binding_array, count, current, paren_count, i, c) { count = 0 current = "" paren_count = 0 - in_lambda = 0 + debug("split_bindings: parsing [" bindings "]") + + # AWK FEATURE: length(string) returns the length of a string + # Unlike JS string.length, this is a function call, not a property for (i = 1; i <= length(bindings); i++) { c = substr(bindings, i, 1) # Track nested parentheses if (c == "(") { paren_count++ - if (paren_count == 1 && !in_lambda) { + if (paren_count == 1) { current = "" # Start new binding continue } } if (c == ")") { paren_count-- - if (paren_count == 0 && !in_lambda) { + if (paren_count == 0) { # End of binding binding_array[++count] = current + debug("split_bindings: found binding [" current "]") current = "" continue } } - # Track if we're inside a lambda expression - if (substr(bindings, i, 7) == "lambda ") { - in_lambda = 1 - } - - # Only add character if we're inside a binding + # Add character if we're inside a binding if (paren_count > 0) { current = current c } } + debug("split_bindings: found " count " bindings") return count } @@ -379,6 +410,8 @@ function compile_let(args, bindings, body, binding_array, nbindings, i, var, val # Find matching closing parenthesis for bindings paren_count = 1 i = 2 + # AWK FEATURE: length(string) returns the length of a string + # Unlike JS string.length, this is a function call, not a property while (paren_count > 0 && i <= length(args)) { if (substr(args, i, 1) == "(") paren_count++ if (substr(args, i, 1) == ")") paren_count-- @@ -416,8 +449,8 @@ function compile_let(args, bindings, body, binding_array, nbindings, i, var, val # Handle lambda or other compound expressions if (substr(val, 2, 6) == "lambda") { # This is a lambda expression - # First compile the lambda - compile_lambda(substr(val, 8)) # Skip "(lambda " + # Pass the entire lambda expression to compile_lambda + compile_lambda(val) # Store the function name in the environment print "STORE " var } else { @@ -470,6 +503,8 @@ function compile_define(args, name, params, body, param_array, nparams, i, paren print "LABEL " name # Process parameters + # AWK FEATURE: split(string, array, separator) splits string into array elements + # Unlike JS string.split() which returns an array, this populates an existing array nparams = split(params, param_array, " ") for (i = 1; i <= nparams; i++) { print "STORE " param_array[i] @@ -496,12 +531,42 @@ function compile_lambda(args, params, body, param_array, nparams, i, lambda_name # Generate a unique name for the lambda function lambda_name = "__lambda_" next_label++ + debug("compile_lambda: args = [" args "]") + + # Handle both full lambda expression and just the arguments part + if (substr(args, 1, 7) == "(lambda") { + debug("compile_lambda: detected full lambda expression") + # Full lambda expression: (lambda (params) body) + args = substr(args, 8) # Remove "(lambda" + debug("compile_lambda: after removing (lambda: [" args "]") + # Find the closing parenthesis of the entire lambda expression + paren_count = 0 + i = 1 + while (i <= length(args)) { + c = substr(args, i, 1) + if (c == "(") paren_count++ + if (c == ")") { + paren_count-- + if (paren_count == -1) break # Found the closing paren of the lambda + } + i++ + } + if (paren_count != -1) error("Unmatched parenthesis in lambda") + args = substr(args, 1, i - 1) # Remove the closing parenthesis + debug("compile_lambda: after removing closing paren: [" args "]") + # Remove leading whitespace + sub(/^[ \t\n]+/, "", args) + debug("compile_lambda: after removing leading whitespace: [" args "]") + } + # Split into parameters and body if (substr(args, 1, 1) != "(") error("Malformed lambda expression") # Find matching closing parenthesis for parameters paren_count = 1 i = 2 + # AWK FEATURE: length(string) returns the length of a string + # Unlike JS string.length, this is a function call, not a property while (paren_count > 0 && i <= length(args)) { if (substr(args, i, 1) == "(") paren_count++ if (substr(args, i, 1) == ")") paren_count-- @@ -523,6 +588,8 @@ function compile_lambda(args, params, body, param_array, nparams, i, lambda_name print "LABEL " lambda_name # Process parameters + # AWK FEATURE: split(string, array, separator) splits string into array elements + # Unlike JS string.split() which returns an array, this populates an existing array nparams = split(params, param_array, " ") for (i = 1; i <= nparams; i++) { print "STORE " param_array[i] @@ -537,10 +604,9 @@ function compile_lambda(args, params, body, param_array, nparams, i, lambda_name } print "RETURN" - # Only push the function name if we're not in a direct call - if (!(args ~ /^\([^)]*\)[^(]*$/)) { - print "PUSH_CONST S:" lambda_name - } + # Create closure that captures current environment + print "CAPTURE_ENV " lambda_name + print "PUSH_CONST CLOSURE:" lambda_name ":ENV_ID" } # Main expression compiler - dispatches based on expression type diff --git a/awk/scheme/scheme/bin/repl b/awk/scheme/scheme/bin/repl index 14a10cf..1fce388 100755 --- a/awk/scheme/scheme/bin/repl +++ b/awk/scheme/scheme/bin/repl @@ -1,15 +1,21 @@ #!/bin/bash # Enable debug tracing -DEBUG=0 +# BASH FEATURE: ${VAR:-default} provides a default value if VAR is unset or empty +# Unlike JS where you'd use VAR || default, this only uses default if VAR is literally unset +DEBUG=${DEBUG:-0} debug() { if [ "$DEBUG" = "1" ]; then + # BASH FEATURE: >&2 redirects output to stderr (file descriptor 2) + # Unlike JS console.error(), this explicitly redirects to stderr echo "[DEBUG] $*" >&2 fi } # Find the directory containing this script and the components +# BASH FEATURE: ${BASH_SOURCE[0]} is the path to the current script +# Unlike JS __filename, this works even when the script is sourced DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )" COMPILER="$DIR/compiler.awk" VM="$DIR/vm.awk" @@ -35,10 +41,11 @@ STATE_FILE="/tmp/scheme_vm.state" cleanup() { debug "Cleaning up temp dir: $TMPDIR" rm -rf "$TMPDIR" - if [ "$1" != "keep_state" ]; then - rm -f "$STATE_FILE" - rm -f "/tmp/scheme_vm.env" - fi + # Temporarily disable state file deletion for debugging + # if [ "$1" != "keep_state" ]; then + # rm -f "$STATE_FILE" + # rm -f "/tmp/scheme_vm.env" + # fi } trap "cleanup" EXIT @@ -48,10 +55,11 @@ ASM_FILE="$TMPDIR/output.asm" DEBUG_FILE="$TMPDIR/debug.out" # Initialize/clear state files at REPL start -if [ "$#" -eq 0 ]; then # Only for interactive mode - : > "/tmp/scheme_vm.state" - : > "/tmp/scheme_vm.env" -fi +# Temporarily disable state file clearing for debugging +# if [ "$#" -eq 0 ]; then # Only for interactive mode +# : > "/tmp/scheme_vm.state" +# : > "/tmp/scheme_vm.env" +# fi # Function to handle evaluation evaluate_expression() { @@ -77,8 +85,10 @@ evaluate_expression() { cat "$ASM_FILE" >&2 debug "Running VM..." - # Use persistent VM state - result=$(awk -v PERSIST=1 -f "$VM" "$ASM_FILE" 2>&1) + # Use persistent VM state and pass debug flag + # BASH FEATURE: -v var=value passes variables to awk + # Unlike JS where you'd use process.env, this sets awk variables directly + result=$(awk -v PERSIST=1 -v DEBUG="$DEBUG" -f "$VM" "$ASM_FILE" 2>&1) debug "VM output: $result" if [ -n "$result" ]; then echo "$result" @@ -130,8 +140,12 @@ while true; do fi # Count parentheses + # BASH FEATURE: $(command) is command substitution - runs command and captures output + # Unlike JS where you'd use require('child_process').execSync(), this is built-in open_parens=$(echo "$line" | tr -cd '(' | wc -c) close_parens=$(echo "$line" | tr -cd ')' | wc -c) + # BASH FEATURE: $((expression)) is arithmetic expansion + # Unlike JS where you'd use eval() or a math library, this evaluates arithmetic expressions paren_count=$((paren_count + open_parens - close_parens)) if [ -n "$current_input" ]; then diff --git a/awk/scheme/scheme/bin/vm.awk b/awk/scheme/scheme/bin/vm.awk index 4e7d2c7..ea15884 100755 --- a/awk/scheme/scheme/bin/vm.awk +++ b/awk/scheme/scheme/bin/vm.awk @@ -15,18 +15,31 @@ BEGIN { T_PAIR = "P" # Cons cells (pairs) T_FUNCTION = "F" # Function references T_NIL = "NIL" # Empty list marker + T_CLOSURE = "CLOSURE" # Closure objects (function + captured environment) # Virtual machine registers stack_ptr = 0 # Points to top of evaluation stack heap_ptr = 0 # Points to next free heap location pc = 0 # Program counter for instruction fetch - # Debug mode disabled by default, can be enabled via DEBUG=1 environment variable + # Original program storage for nested function definitions + # AWK FEATURE: delete array removes all elements from an array + # Unlike JS where you'd set array = [], this clears the array in place + delete original_program # Stores the original program before function calls + + # AWK FEATURE: ENVIRON is a built-in array containing environment variables DEBUG = (ENVIRON["DEBUG"] == "1") ? 1 : 0 # Environment for variable bindings env_size = 0 # Current size of environment stack + # Closure support - captured environments for nested lambdas + delete closure_envs # Maps env_id to captured environment arrays + delete closure_env_names # Variable names in captured environments + delete closure_env_vals # Variable values in captured environments + delete closure_env_sizes # Size of each captured environment + next_env_id = 1 # Counter for generating unique environment IDs + # Function table for storing defined functions delete func_def_names # Function names delete func_def_pc # Entry points @@ -36,29 +49,57 @@ BEGIN { # Call stack for function returns call_stack_ptr = 0 + # Global function registry - clear it first + delete FUNCTIONS # Maps function names to implementations + # State persistence configuration STATE_FILE = "/tmp/scheme_vm.state" + debug("STATE_FILE_PATH: " STATE_FILE) + debug("PERSIST_FLAG: " PERSIST) if (PERSIST) { debug("Loading state from: " STATE_FILE) + debug("LOADING_STATE: Attempting to read " STATE_FILE) + debug("LOADING_STATE: FUNCTIONS table size before loading: " length(FUNCTIONS)) + # AWK FEATURE: getline is awk's file reading function + # getline var < file reads one line from file into var, returns 1 on success, 0 on EOF, -1 on error + # Unlike JS where you'd use fs.readFileSync(), this reads line by line if ((getline line < STATE_FILE) >= 0) { # Check if file exists and is readable + debug("LOADING_STATE: File opened successfully, first line: " line) + # AWK FEATURE: do-while loop syntax - the condition is checked at the end do { + debug("LOADING_STATE: Processing line: " line) if (line ~ /^FUNC /) { # Parse and load function definition + # AWK FEATURE: sub() modifies the string in place and returns count of replacements sub(/^FUNC /, "", line) name = line sub(/ .*$/, "", name) code = line sub(/^[^ ]+ /, "", code) + # Read the rest of the function code (until next FUNC or end of file) + debug("LOADING_STATE: Reading function code for " name) + while ((getline line < STATE_FILE) > 0) { + debug("LOADING_STATE: Reading line: " line) + if (line ~ /^FUNC /) { + # Found next function, put the line back + debug("LOADING_STATE: Found next function, stopping") + break + } + code = code "\n" line + } + debug("Loaded function: " name) debug("Code: " code) - # Store function in function table - func_def_names[func_def_size] = name - func_def_code[func_def_size] = code - func_def_size++ + # Store function in FUNCTIONS table + FUNCTIONS[name] = code + debug("LOADED_FUNCTION: " name " with code length " length(code)) + debug("LOADED_FUNCTION: FUNCTIONS table size after loading " name ": " length(FUNCTIONS)) + debug("LOADED_FUNCTION: Checking if " name " is in table: " (name in FUNCTIONS)) } } while ((getline line < STATE_FILE) > 0) + # AWK FEATURE: close() closes a file handle close(STATE_FILE) } } @@ -68,8 +109,17 @@ BEGIN { delete func_env_vals # Variable values in function scope delete func_env_sizes # Size of each function's environment - # Global function registry - delete FUNCTIONS # Maps function names to implementations + # Register built-in functions first + debug("REGISTERING_BUILTINS: " length(FUNCTIONS) " functions before") + FUNCTIONS["+"] = "add" + FUNCTIONS["-"] = "subtract" + FUNCTIONS["*"] = "multiply" + FUNCTIONS["/"] = "divide" + FUNCTIONS["="] = "equals" + FUNCTIONS["<"] = "less_than" + FUNCTIONS[">"] = "greater_than" + FUNCTIONS["add1"] = "add_one" + debug("REGISTERING_BUILTINS: " length(FUNCTIONS) " functions after") # Environment persistence configuration ENV_STATE_FILE = "/tmp/scheme_vm.env" @@ -97,16 +147,6 @@ BEGIN { } } - # Register built-in functions - FUNCTIONS["+"] = "add" - FUNCTIONS["-"] = "subtract" - FUNCTIONS["*"] = "multiply" - FUNCTIONS["/"] = "divide" - FUNCTIONS["="] = "equals" - FUNCTIONS["<"] = "less_than" - FUNCTIONS[">"] = "greater_than" - FUNCTIONS["add1"] = "add_one" - # Track if VM halted normally (vs error) normal_exit = 0 } @@ -123,12 +163,17 @@ function makeValue(type, val) { } function getType(val) { + # AWK FEATURE: substr(string, start, length) extracts substring + # Unlike JS string.slice(), this is 1-indexed and requires explicit length + # AWK FEATURE: index(string, substring) returns position of substring (1-indexed) + # Unlike JS string.indexOf(), this returns 0 if not found and is 1-indexed type = substr(val, 1, index(val, ":") - 1) debug("Get type: " type " from " val) return type } function getValue(val) { + # AWK FEATURE: index() returns 1-indexed position, so we add 1 to get after the colon value = substr(val, index(val, ":") + 1) debug("Get value: " value " from " val) return value @@ -141,6 +186,77 @@ function isSymbol(val) { return getType(val) == T_SYMBOL } function isPair(val) { return getType(val) == T_PAIR } function isFunction(val) { return getType(val) == T_FUNCTION } function isNil(val) { return getType(val) == T_NIL } +function isClosure(val) { return getType(val) == T_CLOSURE } + +# Closure helper functions +function makeClosure(func_name, env_id) { + return T_CLOSURE ":" func_name ":" env_id +} + +function getClosureFunction(closure_val) { + # Extract function name from CLOSURE:func_name:env_id + return substr(closure_val, index(closure_val, ":") + 1, + index(substr(closure_val, index(closure_val, ":") + 1), ":") - 1) +} + +function getClosureEnvId(closure_val) { + # Extract env_id from CLOSURE:func_name:env_id + first_colon = index(closure_val, ":") + second_colon = index(substr(closure_val, first_colon + 1), ":") + first_colon + return substr(closure_val, second_colon + 1) +} + +# Environment capture for closures +function captureEnvironment(env_id, i) { + debug("Capturing environment with ID: " env_id) + closure_env_sizes[env_id] = env_size + + # Copy current environment to closure environment + for (i = 0; i < env_size; i++) { + # Only capture non-global variables + if (env_name[i] !~ /^__global_/) { + closure_env_names[env_id, i] = env_name[i] + closure_env_vals[env_id, i] = env_val[i] + debug("Captured: " env_name[i] " = " env_val[i]) + } + } + + debug("Captured environment size: " closure_env_sizes[env_id]) +} + +# VM instruction to capture environment +function vm_capture_env(func_name) { + debug("Capturing environment for function: " func_name) + env_id = next_env_id++ + captureEnvironment(env_id) + + # Replace the placeholder ENV_ID in the closure value + # Find the closure value on the stack and update it + if (stack_ptr > 0) { + closure_val = stack[stack_ptr] + if (closure_val ~ /^CLOSURE:/) { + # Replace ENV_ID with actual env_id + new_closure_val = "CLOSURE:" func_name ":" env_id + stack[stack_ptr] = new_closure_val + debug("Updated closure value: " new_closure_val) + } + } +} + +# Environment restoration for closures +function pushClosureEnvironment(env_id, i) { + debug("Pushing closure environment: " env_id) + if (env_id in closure_env_sizes) { + for (i = 0; i < closure_env_sizes[env_id]; i++) { + if ((env_id, i) in closure_env_names) { + env_name[env_size] = closure_env_names[env_id, i] + env_val[env_size] = closure_env_vals[env_id, i] + env_size++ + debug("Restored: " closure_env_names[env_id, i] " = " closure_env_vals[env_id, i]) + } + } + } +} # Stack operations function push(val) { @@ -279,6 +395,8 @@ function vm_less_than() { # Main instruction execution loop function execute(instr) { + # AWK FEATURE: split(string, array, separator) splits string into array elements + # Unlike JS string.split() which returns an array, this populates an existing array split(instr, parts, " ") op = parts[1] debug("Execute: " instr) @@ -298,6 +416,7 @@ function execute(instr) { if (stack_ptr < 2) error("SWAP requires two operands") val2 = pop() val1 = pop() + val1 = pop() push(val2) push(val1) } @@ -333,14 +452,20 @@ function execute(instr) { print peek() } else if (op == "HALT") { + debug("EXECUTING_HALT") normal_exit = 1 + debug("HALT: Stack pointer = " stack_ptr) if (stack_ptr > 0) { result = peek() + debug("HALT: Result = " result) + } else { + debug("HALT: Stack is empty") } if (PERSIST) { save_state() } if (result) { + debug("PRINTING_RESULT: " result) print result } exit(0) @@ -358,27 +483,40 @@ function execute(instr) { vm_define_function(parts[2], pc) } else if (op == "CALL") { - vm_call_function(parts[2]) + vm_call_function() } else if (op == "RETURN") { - vm_return() + debug("EXECUTING_RETURN") + # The call_stack_ptr is no longer used for return, so this instruction is effectively removed. + # The function execution itself handles the return. } else if (op == "GET_VALUE") { vm_get_value() } + else if (op == "CAPTURE_ENV") { + vm_capture_env(parts[2]) + } else { error("Unknown instruction: " op) } } # Load program instructions +# AWK FEATURE: Each line of input is automatically processed by this block +# NR is a built-in variable that contains the current record (line) number +# Unlike JS where you'd need to track line numbers manually { - program[NR-1] = $0 + program[NR-1] = $0 # $0 is the current line being processed + original_program[NR-1] = $0 # Keep a copy of the original program } -# Main execution loop +# AWK FEATURE: END block runs after all input has been processed +# This is like a "finally" block that always executes after reading all input END { + # AWK FEATURE: length(array) returns the number of elements in an array + # Unlike JS array.length, this is a function call, not a property while (pc < length(program)) { + # debug("EXECUTING_PC_" pc ": " program[pc]) execute(program[pc++]) } @@ -421,6 +559,8 @@ function vm_store(name) { val = peek() if (isSymbol(val)) { func_name = getValue(val) + # AWK FEATURE: ~ is the regex match operator (like /pattern/.test() in JS) + # The pattern is a regex literal, not a string if (func_name ~ /^__lambda_/) { # Store the function code under the new name FUNCTIONS[name] = FUNCTIONS[func_name] @@ -460,61 +600,161 @@ function vm_pop_env() { function vm_lookup(name, i, global_name, val) { debug("Looking up " name " in environment of size: " env_size) dump_env() - - # Check if it's a function (built-in or user-defined) - if (name in FUNCTIONS) { - debug("Found function: " name) - push(makeValue(T_SYMBOL, name)) - return - } - + # Try global name first, then local global_name = "__global_" name + debug("LOOKUP_ENV: Looking for " name " in environment of size " env_size) for (i = env_size - 1; i >= 0; i--) { + debug("LOOKUP_ENV: Checking index " i ": " env_name[i] " = " env_val[i]) if (env_name[i] == global_name || env_name[i] == name) { debug("Found " name " = " env_val[i] " at position " i) push(env_val[i]) return } } + + # Check if it's a function (built-in or user-defined) + debug("LOOKUP_CHECKING: " name " in FUNCTIONS table") + debug("FUNCTIONS_TABLE_SIZE: " length(FUNCTIONS)) + debug("FUNCTIONS_IN_TABLE:") + # AWK FEATURE: for (var in array) iterates over array keys + # Unlike JS for...in which includes inherited properties, awk arrays don't have inheritance + for (f in FUNCTIONS) { + debug(" " f) + } + # AWK FEATURE: 'in' operator checks if key exists in array + # Unlike JS where you'd use array.hasOwnProperty(key) or 'key' in array + if (name in FUNCTIONS) { + debug("Found function: " name) + push(makeValue(T_SYMBOL, name)) + return + } + debug("FUNCTION_LOOKUP_FAILED: " name) error("Undefined variable: " name) } # Function definition implementation function vm_define_function(name, start_pc) { debug("Defining function: " name " at " start_pc) + debug("call_stack_ptr: " call_stack_ptr) + + # Clear the from_define flag + for (i = env_size - 1; i >= 0; i--) { + if (env_name[i] == "from_define") { + debug("Clearing from_define flag") + env_size-- + break + } + } # Build function code code = "" i = start_pc - while (i < length(program) && program[i] != "RETURN") { - if (code != "") code = code "\n" - code = code program[i] - i++ + + # If we're inside a function call, we need to find the corresponding position + # in the original program. For now, use a simple heuristic. + if (call_stack_ptr > 0) { + debug("Nested function definition - using current instruction") + # Just read from the current program position + # AWK FEATURE: length(array) returns the number of elements in an array + # Unlike JS array.length, this is a function call, not a property + while (i < length(program) && program[i] != "RETURN") { + if (code != "") code = code "\n" + code = code program[i] + i++ + } + } else { + debug("Top-level function definition - using original program") + # Use original_program for top-level function definitions + # AWK FEATURE: length(array) returns the number of elements in an array + # Unlike JS array.length, this is a function call, not a property + while (i < length(original_program) && original_program[i] != "RETURN") { + if (code != "") code = code "\n" + code = code original_program[i] + i++ + } } code = code "\nRETURN" # Store function debug("Storing function: " name " = " code) FUNCTIONS[name] = code + debug("Function stored. Total functions: " length(FUNCTIONS)) + debug("FUNCTION_STORED: " name) pc = i + 1 } # Function call implementation -function vm_call_function(func_name, code_lines, j, saved_pc, saved_env_size, arg, param_name) { +function vm_call_function(code_lines, j, saved_pc, saved_env_size, arg, param_name) { + # Get function name from stack + func_name = pop() debug("Calling function: " func_name) + debug("CALLING_FUNCTION: " func_name) # If name is a symbol, get its value if (isSymbol(func_name)) { func_name = getValue(func_name) } - # Handle anonymous functions + # Check if it's a built-in function first + if (func_name in FUNCTIONS) { + built_in_name = FUNCTIONS[func_name] + if (built_in_name == "add") { + debug("Calling built-in function: add") + add() + return + } else if (built_in_name == "subtract") { + debug("Calling built-in function: subtract") + subtract() + return + } else if (built_in_name == "multiply") { + debug("Calling built-in function: multiply") + multiply() + return + } else if (built_in_name == "divide") { + debug("Calling built-in function: divide") + divide() + return + } else if (built_in_name == "equals") { + debug("Calling built-in function: equals") + equals() + return + } else if (built_in_name == "less_than") { + debug("Calling built-in function: less_than") + less_than() + return + } else if (built_in_name == "greater_than") { + debug("Calling built-in function: greater_than") + greater_than() + return + } else if (built_in_name == "add_one") { + debug("Calling built-in function: add_one") + add_one() + return + } + } + + # Handle anonymous functions and closures if (func_name ~ /^__lambda_/) { if (!(func_name in FUNCTIONS)) { error("Undefined lambda function: " func_name) } + } else if (isClosure(func_name)) { + # This is a closure call + debug("Calling closure: " func_name) + closure_func = getClosureFunction(func_name) + closure_env_id = getClosureEnvId(func_name) + + if (!(closure_func in FUNCTIONS)) { + error("Undefined closure function: " closure_func) + } + + # Restore the captured environment + pushClosureEnvironment(closure_env_id) + + # Use the closure's function name for the rest of the call + func_name = closure_func } else if (!(func_name in FUNCTIONS)) { error("Undefined function: " func_name) } @@ -522,69 +762,100 @@ function vm_call_function(func_name, code_lines, j, saved_pc, saved_env_size, ar saved_pc = pc saved_env_size = env_size - # Split function code into lines + # AWK FEATURE: split(string, array, separator) splits string into array elements + # Unlike JS string.split() which returns an array, this populates an existing array split(FUNCTIONS[func_name], code_lines, "\n") - # Add function code to program at current position - for (j in code_lines) { - program[pc + j - 1] = code_lines[j] - } - # Check if this is a parameterized function if (code_lines[1] ~ /^STORE /) { # This is a parameterized function (lambda) # Get parameter name from STORE instruction param_name = substr(code_lines[1], 7) debug("Found parameter name: " param_name) + debug("FUNCTION_PARAM: " param_name) # Get argument from stack arg = pop() debug("Function argument: " arg) + debug("FUNCTION_ARG: " arg) # Create new environment frame debug("Creating new environment frame at size: " env_size) env_name[env_size] = param_name env_val[env_size] = arg env_size++ + debug("FUNCTION_ENV_STORE: " param_name " = " arg " at index " (env_size-1)) + + # Execute function code directly, skipping STORE and POP_ENV instructions + for (j = 2; j <= length(code_lines); j++) { + if (code_lines[j] != "" && code_lines[j] != "POP_ENV") { + debug("Executing function instruction: " code_lines[j]) + execute(code_lines[j]) + } + } + + # Clean up parameter + vm_pop_env() + + # Return to caller + debug("Function completed, returning to PC: " saved_pc) + pc = saved_pc + return } else { # This is a built-in function or non-parameterized function debug("Calling non-parameterized function: " func_name) - } - - # Save return info and jump to function - call_stack[++call_stack_ptr] = saved_pc - env_stack[call_stack_ptr] = saved_env_size - - debug("Function found, jumping to PC: " pc " with env_size: " saved_env_size) - dump_env() -} - -# Function return implementation -function vm_return() { - if (call_stack_ptr > 0) { - # Save return value - ret_val = pop() - # Restore environment - while (env_size > env_stack[call_stack_ptr]) { - debug("Popping environment at size: " env_size) - vm_pop_env() + # Execute all function code directly + for (j in code_lines) { + if (code_lines[j] != "") { + debug("Executing function instruction: " code_lines[j]) + execute(code_lines[j]) + } } - # Restore program counter - pc = call_stack[call_stack_ptr--] - - # Push return value - push(ret_val) - - debug("Returned with value: " ret_val " and env_size: " env_size) + # Return to caller + debug("Function completed, returning to PC: " saved_pc) + pc = saved_pc + return } } +# Function return implementation - no longer needed with direct execution +# function vm_return() { +# debug("VM_RETURN: call_stack_ptr = " call_stack_ptr) +# if (call_stack_ptr > 0) { +# # Save return value +# ret_val = pop() +# debug("VM_RETURN: return value = " ret_val) +# +# # Restore environment +# while (env_size > env_stack[call_stack_ptr]) { +# debug("Popping environment at size: " env_size) +# vm_pop_env() +# } +# +# # Restore program counter +# pc = call_stack[call_stack_ptr--] +# debug("VM_RETURN: restored PC = " pc) +# +# # Restore the original program at the call position +# program[pc] = original_program_at_call[call_stack_ptr + 1] +# debug("Restored original program: " original_program_at_call[call_stack_ptr + 1]) +# +# # Push return value +# push(ret_val) +# debug("VM_RETURN: pushed return value " ret_val) +# +# debug("Returned with value: " ret_val " and env_size: " env_size) +# } +# } + # Debug helper to dump environment contents function dump_env( i) { debug("Environment dump:") for (i = 0; i < env_size; i++) { + # AWK FEATURE: sprintf() formats a string like printf but returns it instead of printing + # Unlike JS where you'd use template literals or String.format(), this is the awk way debug(sprintf(" %d: %s = %s", i, env_name[i], env_val[i])) } } @@ -601,11 +872,22 @@ function lookup_no_error(name, i) { # State persistence implementation function save_state() { + debug("SAVE_STATE_CALLED") debug("Saving state to: " STATE_FILE) - for (i = 0; i < func_def_size; i++) { - debug("Saving function: " func_def_names[i]) - print "FUNC " func_def_names[i] " " func_def_code[i] > STATE_FILE + # Save functions from FUNCTIONS table + debug("SAVE_STATE: Saving " length(FUNCTIONS) " functions") + for (func_name in FUNCTIONS) { + if (func_name !~ /^__lambda_/) { # Don't save lambda functions + debug("Saving function: " func_name) + debug("SAVE_STATE: About to write function " func_name) + debug("SAVE_STATE: Function code length: " length(FUNCTIONS[func_name])) + # AWK FEATURE: printf with > file redirects output to a file + # Unlike JS where you'd use fs.writeFileSync(), this redirects from stdout to file + printf "FUNC %s %s\n", func_name, FUNCTIONS[func_name] > STATE_FILE + debug("SAVE_STATE: Saved function " func_name " to " STATE_FILE) + } } + # AWK FEATURE: close() closes a file handle close(STATE_FILE) # Save environment state @@ -613,6 +895,8 @@ function save_state() { for (i = 0; i < env_size; i++) { if (env_name[i] ~ /^__global_/) { # Only save globals debug("Saving env var: " env_name[i] " = " env_val[i]) + # AWK FEATURE: print with > file redirects output to a file + # Unlike JS console.log() which always goes to stdout print "ENV " env_name[i] " " env_val[i] > ENV_STATE_FILE } } @@ -694,7 +978,7 @@ function add_one() { # Get value from top of stack function vm_get_value() { - val = peek() + val = pop() if (isSymbol(val)) { name = getValue(val) # If it's a function name, just push the name directly @@ -703,5 +987,11 @@ function vm_get_value() { } else { push(makeValue(T_SYMBOL, name)) } + } else if (isClosure(val)) { + # For closures, just push the closure value back + push(val) + } else { + # For other types, just push the value back + push(val) } } \ No newline at end of file diff --git a/awk/scheme/scheme/diagram.md b/awk/scheme/scheme/diagram.md index 4a719b4..689a44a 100644 --- a/awk/scheme/scheme/diagram.md +++ b/awk/scheme/scheme/diagram.md @@ -5,11 +5,12 @@ +----------------+ Scheme Code +----------------+ Assembly +----------------+ | | -----------------> | | -------------> | | | REPL | "(+ 1 2)" | Compiler | "PUSH_CONST | VM | -| (bin/repl) | | compiler.awk | N:1 | vm.awk | +| (bin/repl) | "(lambda (x) ...)" | compiler.awk | N:1 | vm.awk | | | | | PUSH_CONST | | | - Multi-line | | - Lexer | N:2 | - Stack-based | | - Debug mode | | - Parser | ADD | - Type system | | - File input | | - Code gen | HALT" | - Environment | +| | | - Closure gen | CAPTURE_ENV | - Direct exec | | | <----------------- | | <------------- | | | | Output: "N:3" | | Result | | +----------------+ +----------------+ +----------------+ @@ -27,7 +28,7 @@ Debug Flow (when DEBUG=1): | REPL | -----------------> | Compiler | -------------> | VM | | | [Input received] | | [Tokens found] | | | [Debug output] | | [Parsing tree] | [Stack ops] | [Stack state] | -| stderr | <----------------- | [Gen code] | <------------- | [Environment] | +| stderr | <----------------- | [Gen code] | [Closure info] | [Environment] | +----------------+ +----------------+ +----------------+ Execution Flow Example: @@ -40,10 +41,51 @@ Execution Flow Example: │ │ │ │ │ ADD │ │ │ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ +Closure Flow Example: +┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ +│ Input: │ │ Lexer/ │ │ Code Gen │ │ VM │ +│(let ((x 10))│ ------->│ Parser: │ ------> │ STORE x │ ------> │ Env: [x=10] │ +│ ((lambda │ │ (let │ │ LABEL λ │ │ │ +│ (y) (+ x │ │ ((x 10)) │ │ CAPTURE_ENV │ │ Closure: │ +│ y)) 32))) │ │ (lambda │ │ PUSH_CONST │ │ λ+env_id │ +│ │ │ (y) ...) │ │ CLOSURE:... │ │ │ +│ │ │ │ │ CALL │ │ Result: 42 │ +└─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ + State Management: -┌─────────────┐ ┌─────────────┐ ┌─────────────┐ -│ Global │ │ Environment │ │ Function │ -│ Variables │ ------> │ Stack │ ------> │ Calls │ -│ (persist) │ │ (frames) │ │ (stack) │ -└─────────────┘ └─────────────┘ └─────────────┘ -``` \ No newline at end of file +┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ +│ Global │ │ Environment │ │ Function │ │ Closure │ +│ Variables │ ------> │ Stack │ ------> │ Calls │ ------> │ Environment │ +│ (persist) │ │ (frames) │ │ (direct) │ │ (captured) │ +└─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ + +Data Type System: +┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ +│ Numbers │ │ Booleans │ │ Symbols │ │ Closures │ +│ N:value │ │ B:value │ │ S:name │ │CLOSURE:fn:env_id│ +└─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ + │ │ │ │ + └───────────────────┼───────────────────┼───────────────────┘ + │ │ + ┌─────────────┐ ┌─────────────┐ + │ Pairs │ │ Nil │ + │ P:index │ │ NIL: │ + └─────────────┘ └─────────────┘ + +VM Instruction Set: +┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ +│ Stack Ops │ │ Environment │ │ Functions │ +│ PUSH_CONST val │ │ STORE name │ │ LABEL name │ +│ POP │ │ LOOKUP name │ │ CALL │ +│ ADD/SUB/MUL/DIV │ │ POP_ENV │ │ RETURN │ +│ GET_VALUE │ │ │ │ │ +└─────────────────┘ └─────────────────┘ └─────────────────┘ + │ │ │ + └───────────────────────┼───────────────────────┘ + │ + ┌─────────────────┐ + │ Closures │ + │ CAPTURE_ENV fn │ + │ PUSH_CONST │ + │ CLOSURE:fn:env │ + └─────────────────┘ \ No newline at end of file diff --git a/awk/scheme/scheme/scratch/ideas-for-string-support.md b/awk/scheme/scheme/scratch/ideas-for-string-support.md new file mode 100644 index 0000000..0d81e5f --- /dev/null +++ b/awk/scheme/scheme/scratch/ideas-for-string-support.md @@ -0,0 +1,556 @@ +# String Support Implementation Plan for Awk-Scheme + +## Overview +This document outlines a comprehensive plan for adding string support to the Awk-Scheme interpreter. The implementation will follow the existing architecture patterns and maintain consistency with the current type system and execution model. + +## Current Architecture Analysis + +### Type System +- Current types: `N:`, `B:`, `S:`, `P:`, `F:`, `NIL:`, `CLOSURE:` +- All values use `type:value` format for runtime type checking +- Type checking via `getType()` and `getValue()` functions +- Type predicates: `isNumber()`, `isBoolean()`, `isSymbol()`, etc. + +### Compiler Pipeline +1. **Lexer** (`next_token()`): Handles numbers, symbols, parentheses +2. **Parser** (`parse_expr()`, `parse_list()`): Builds expression trees +3. **Code Generator**: Emits VM instructions for different expression types +4. **Expression Compiler** (`compile_expr()`): Dispatches based on expression type + +### VM Architecture +- Stack-based execution with typed values +- Environment-based variable binding +- Built-in function registry (`FUNCTIONS` table) +- Direct function execution (no program array modification) +- State persistence for globals and functions + +## String Implementation Plan + +### Phase 1: Core String Type System + +#### 1.1 VM Type System Extensions +**File: `bin/vm.awk`** + +```awk +# Add to BEGIN block +T_STRING = "STR" # String type tag + +# Add type predicate +function isString(val) { return getType(val) == T_STRING } + +# String value constructor +function makeString(val) { + return T_STRING ":" val +} +``` + +**Changes Required:** +- Add `T_STRING` constant in BEGIN block +- Add `isString()` predicate function +- Add `makeString()` constructor function +- Update type checking in existing operations where needed + +#### 1.2 String Storage Strategy +**Options:** +1. **Inline storage**: Store strings directly in the type:value format + - Pros: Simple, fast access + - Cons: Limited by awk string length, no escaping issues + +2. **Heap storage**: Store strings in heap like cons cells + - Pros: Unlimited length, consistent with existing patterns + - Cons: More complex, requires string management + +**Recommendation**: Start with inline storage for simplicity, can migrate to heap later. + +### Phase 2: Compiler Extensions + +#### 2.1 Lexer String Support +**File: `bin/compiler.awk`** + +**Current lexer handles:** +- Numbers: `-?[0-9]+` +- Symbols: `[a-zA-Z_][a-zA-Z0-9_]*` +- Parentheses: `(`, `)` + +**New string tokenization:** +```awk +# Add to next_token() function +# Handle string literals (double quotes) +if (c == "\"") { + str = "" + input_buffer = substr(input_buffer, 2) # Skip opening quote + + while (length(input_buffer) > 0) { + c = substr(input_buffer, 1, 1) + if (c == "\"") { + input_buffer = substr(input_buffer, 2) # Skip closing quote + break + } + if (c == "\\") { + # Handle escape sequences + input_buffer = substr(input_buffer, 2) + if (length(input_buffer) > 0) { + c = substr(input_buffer, 1, 1) + if (c == "n") str = str "\n" + else if (c == "t") str = str "\t" + else if (c == "\\") str = str "\\" + else if (c == "\"") str = str "\"" + else error("Invalid escape sequence: \\" c) + input_buffer = substr(input_buffer, 2) + } + } else { + str = str c + input_buffer = substr(input_buffer, 2) + } + } + return "\"" str "\"" # Return with quotes for identification +} +``` + +**Escape Sequences to Support:** +- `\"` - Literal double quote +- `\\` - Literal backslash +- `\n` - Newline +- `\t` - Tab +- `\r` - Carriage return + +#### 2.2 Parser String Support +**File: `bin/compiler.awk`** + +**Update `parse_expr()`:** +```awk +function parse_expr(token, result) { + token = next_token() + if (token == "EOF") return "" + + if (token == "(") { + result = parse_list() + debug("Parsed list: " result) + return result + } + + # Handle string literals + if (substr(token, 1, 1) == "\"") { + debug("Parsed string: " token) + return token + } + + debug("Parsed token: " token) + return token +} +``` + +#### 2.3 Code Generation for Strings +**File: `bin/compiler.awk`** + +**Add to `compile_expr()`:** +```awk +function compile_expr(expr, split_result, op, args) { + debug("Compiling expression: " expr) + + # Handle string literals + if (substr(expr, 1, 1) == "\"") { + compile_string(expr) + return + } + + # ... existing code for numbers, nil, variables, etc. +} + +function compile_string(str) { + debug("Compiling string: " str) + # Remove outer quotes and emit string constant + content = substr(str, 2, length(str) - 2) + print "PUSH_CONST STR:" content +} +``` + +### Phase 3: VM String Operations + +#### 3.1 String Built-in Functions +**File: `bin/vm.awk`** + +**Add to BEGIN block (built-in registration):** +```awk +# String operations +FUNCTIONS["string-length"] = "string_length" +FUNCTIONS["string-append"] = "string_append" +FUNCTIONS["string-ref"] = "string_ref" +FUNCTIONS["substring"] = "substring" +FUNCTIONS["string=?"] = "string_equal" +FUNCTIONS["string<?"] = "string_less_than" +FUNCTIONS["string>?"] = "string_greater_than" +FUNCTIONS["string-ci=?"] = "string_ci_equal" +FUNCTIONS["string-ci<?"] = "string_ci_less_than" +FUNCTIONS["string-ci>?"] = "string_ci_greater_than" +``` + +**Add to function call dispatch in `vm_call_function()`:** +```awk +} else if (built_in_name == "string_length") { + debug("Calling built-in function: string_length") + string_length() + return +} else if (built_in_name == "string_append") { + debug("Calling built-in function: string_append") + string_append() + return +# ... etc for other string functions +``` + +#### 3.2 String Function Implementations +**File: `bin/vm.awk`** + +```awk +# String length +function string_length() { + if (stack_ptr < 1) error("string-length requires one operand") + val = pop() + if (!isString(val)) error("string-length requires string operand") + str = getValue(val) + push(makeValue(T_NUMBER, length(str))) +} + +# String concatenation +function string_append() { + if (stack_ptr < 2) error("string-append requires at least two operands") + result = "" + # Pop all arguments and concatenate + while (stack_ptr > 0) { + val = pop() + if (!isString(val)) error("string-append requires string operands") + result = getValue(val) result + } + push(makeString(result)) +} + +# String character access (0-indexed) +function string_ref() { + if (stack_ptr < 2) error("string-ref requires two operands") + index_val = pop() + str_val = pop() + if (!isNumber(index_val)) error("string-ref index must be a number") + if (!isString(str_val)) error("string-ref requires string operand") + + str = getValue(str_val) + idx = getValue(index_val) + 1 # Convert to 1-indexed + if (idx < 1 || idx > length(str)) error("string-ref index out of bounds") + + char = substr(str, idx, 1) + push(makeString(char)) +} + +# Substring extraction +function substring() { + if (stack_ptr < 3) error("substring requires three operands") + end = pop() + start = pop() + str_val = pop() + + if (!isNumber(start) || !isNumber(end)) error("substring indices must be numbers") + if (!isString(str_val)) error("substring requires string operand") + + str = getValue(str_val) + start_idx = getValue(start) + 1 # Convert to 1-indexed + end_idx = getValue(end) + 1 + + if (start_idx < 1 || end_idx > length(str) || start_idx > end_idx) { + error("substring indices out of bounds") + } + + result = substr(str, start_idx, end_idx - start_idx + 1) + push(makeString(result)) +} + +# String comparison (case-sensitive) +function string_equal() { + if (stack_ptr < 2) error("string=? requires two operands") + val2 = pop() + val1 = pop() + if (!isString(val1) || !isString(val2)) error("string=? requires string operands") + result = (getValue(val1) == getValue(val2)) ? "1" : "0" + push(makeValue(T_BOOLEAN, result)) +} + +function string_less_than() { + if (stack_ptr < 2) error("string<? requires two operands") + val2 = pop() + val1 = pop() + if (!isString(val1) || !isString(val2)) error("string<? requires string operands") + result = (getValue(val1) < getValue(val2)) ? "1" : "0" + push(makeValue(T_BOOLEAN, result)) +} + +function string_greater_than() { + if (stack_ptr < 2) error("string>? requires two operands") + val2 = pop() + val1 = pop() + if (!isString(val1) || !isString(val2)) error("string>? requires string operands") + result = (getValue(val1) > getValue(val2)) ? "1" : "0" + push(makeValue(T_BOOLEAN, result)) +} + +# Case-insensitive string comparison +function string_ci_equal() { + if (stack_ptr < 2) error("string-ci=? requires two operands") + val2 = pop() + val1 = pop() + if (!isString(val1) || !isString(val2)) error("string-ci=? requires string operands") + # Convert to lowercase for comparison + str1 = tolower(getValue(val1)) + str2 = tolower(getValue(val2)) + result = (str1 == str2) ? "1" : "0" + push(makeValue(T_BOOLEAN, result)) +} + +function string_ci_less_than() { + if (stack_ptr < 2) error("string-ci<? requires two operands") + val2 = pop() + val1 = pop() + if (!isString(val1) || !isString(val2)) error("string-ci<? requires string operands") + str1 = tolower(getValue(val1)) + str2 = tolower(getValue(val2)) + result = (str1 < str2) ? "1" : "0" + push(makeValue(T_BOOLEAN, result)) +} + +function string_ci_greater_than() { + if (stack_ptr < 2) error("string-ci>? requires two operands") + val2 = pop() + val1 = pop() + if (!isString(val1) || !isString(val2)) error("string-ci>? requires string operands") + str1 = tolower(getValue(val1)) + str2 = tolower(getValue(val2)) + result = (str1 > str2) ? "1" : "0" + push(makeValue(T_BOOLEAN, result)) +} +``` + +### Phase 4: Enhanced String Operations + +#### 4.1 Additional String Functions +```awk +# String to number conversion +function string_to_number() { + if (stack_ptr < 1) error("string->number requires one operand") + val = pop() + if (!isString(val)) error("string->number requires string operand") + str = getValue(val) + if (str ~ /^-?[0-9]+$/) { + push(makeValue(T_NUMBER, str)) + } else { + push(makeValue(T_BOOLEAN, "0")) # Return false for invalid numbers + } +} + +# Number to string conversion +function number_to_string() { + if (stack_ptr < 1) error("number->string requires one operand") + val = pop() + if (!isNumber(val)) error("number->string requires number operand") + num = getValue(val) + push(makeString(num)) +} + +# String splitting +function string_split() { + if (stack_ptr < 2) error("string-split requires two operands") + delimiter = pop() + str_val = pop() + if (!isString(delimiter) || !isString(str_val)) { + error("string-split requires string operands") + } + + str = getValue(str_val) + delim = getValue(delimiter) + + # Use awk's split function + split(str, parts, delim) + result = "" + for (i = 1; i <= length(parts); i++) { + if (result != "") result = result " " + result = result "\"" parts[i] "\"" + } + push(makeString("(" result ")")) +} +``` + +#### 4.2 String Formatting +```awk +# String formatting (simple version) +function format() { + if (stack_ptr < 2) error("format requires at least two operands") + format_str = pop() + if (!isString(format_str)) error("format requires string format operand") + + # Simple implementation - replace ~a with arguments + fmt = getValue(format_str) + arg_count = 0 + + while (stack_ptr > 0) { + val = pop() + arg_count++ + # Replace ~a with the value + gsub(/~a/, val, fmt) + } + + push(makeString(fmt)) +} +``` + +### Phase 5: State Persistence for Strings + +#### 5.1 String State Management +**File: `bin/vm.awk`** + +**Update `save_state()` and state loading:** +- Strings stored in environment will be automatically persisted +- No special handling needed since strings use inline storage +- String literals in function definitions will be preserved + +**Considerations:** +- Long strings may impact state file size +- Escape sequences need proper handling in state files +- String encoding consistency across sessions + +### Phase 6: Testing and Validation + +#### 6.1 Test Cases +**Create test files:** + +```scheme +; Basic string literals +"hello world" +"" +"\"quoted\"" +"line1\nline2" + +; String operations +(string-length "hello") +(string-append "hello" " " "world") +(string-ref "hello" 0) +(substring "hello world" 0 4) + +; String comparisons +(string=? "hello" "hello") +(string=? "hello" "world") +(string<? "abc" "def") +(string>? "xyz" "abc") + +; Case-insensitive comparisons +(string-ci=? "Hello" "hello") +(string-ci<? "ABC" "def") + +; Conversions +(string->number "123") +(number->string 456) +(string->number "abc") + +; String in expressions +(define greeting "Hello") +(define name "World") +(string-append greeting " " name) + +; Strings in functions +(define (greet name) (string-append "Hello, " name "!")) +(greet "Alice") +``` + +#### 6.2 Edge Cases to Test +- Empty strings +- Strings with special characters +- Escape sequences +- Very long strings +- String operations on non-string types +- String comparisons with different encodings +- String persistence across REPL sessions + +### Phase 7: Performance and Optimization + +#### 7.1 Performance Considerations +- **String concatenation**: Current `string-append` is O(n²) - consider optimization +- **String storage**: Monitor memory usage with large strings +- **String comparisons**: Consider early termination for long strings +- **Escape sequence processing**: Optimize lexer for common cases + +#### 7.2 Potential Optimizations +```awk +# Optimized string concatenation (build in reverse) +function string_append_optimized() { + if (stack_ptr < 2) error("string-append requires at least two operands") + + # Count arguments and pre-allocate + arg_count = 0 + temp_stack[0] = stack_ptr + + # Pop all arguments to temp storage + while (stack_ptr > 0) { + val = pop() + if (!isString(val)) error("string-append requires string operands") + temp_stack[++arg_count] = getValue(val) + } + + # Concatenate in reverse order (more efficient) + result = "" + for (i = arg_count; i >= 1; i--) { + result = result temp_stack[i] + } + + push(makeString(result)) +} +``` + +### Phase 8: Documentation and Examples + +#### 8.1 Update README.md +- Add string data type to type system documentation +- Include string usage examples +- Document all string built-in functions +- Add string limitations and considerations + +#### 8.2 Update diagram.md +- Add string type to data type diagram +- Include string operations in VM instruction set +- Show string flow in execution examples + +### Implementation Order and Risk Assessment + +#### High Priority (Core Functionality) +1. **Type system extensions** - Low risk, foundational +2. **Lexer string support** - Medium risk, affects parsing +3. **Basic string operations** - Low risk, self-contained +4. **String literals in expressions** - Medium risk, integration + +#### Medium Priority (Enhanced Features) +5. **String comparisons** - Low risk, straightforward +6. **String conversions** - Low risk, utility functions +7. **State persistence** - Low risk, automatic + +#### Lower Priority (Advanced Features) +8. **String formatting** - Medium risk, complex parsing +9. **String splitting** - Low risk, utility function +10. **Performance optimizations** - Low risk, optional + +#### Risk Mitigation +- **Regression testing**: Ensure existing functionality unchanged +- **Incremental implementation**: Add features one at a time +- **Comprehensive testing**: Test all edge cases +- **Backward compatibility**: Maintain existing API + +### Success Criteria +1. String literals parse and compile correctly +2. Basic string operations work as expected +3. String state persists across REPL sessions +4. No regression in existing functionality +5. Performance remains acceptable +6. Documentation is complete and accurate + +### Future Enhancements +1. **Heap-based string storage** for very long strings +2. **String interning** for memory efficiency +3. **Unicode support** (complex, requires significant changes) +4. **Regular expression support** for string matching +5. **String formatting with more features** (like Common Lisp format) +6. **String streams** for efficient string building + +This plan provides a comprehensive roadmap for implementing string support while maintaining the existing architecture and ensuring backward compatibility. \ No newline at end of file diff --git a/bash/dds b/bash/dds new file mode 100755 index 0000000..f14a79b --- /dev/null +++ b/bash/dds @@ -0,0 +1,121 @@ +#!/bin/bash + +# Daydreaming System +# This script uses a sequence of LLM calls to refine an initial response. + +# --- Model Configuration --- +RESPONSE_MODEL="llama3:8b-instruct-q4_K_M" +CRITIC_MODEL="phi3:3.8b-mini-4k-instruct-q4_K_M" +REFINE_MODEL="llama3:8b-instruct-q4_K_M" + +# --- Defaults --- +DEFAULT_LOOPS=2 + +# --- Argument Validation --- +if [ "$#" -lt 1 ]; then + echo -e "\n\tDaydreaming" + echo -e "\tThis script uses a sequence of LLM calls to refine an initial response." + echo -e "\n\tUsage: $0 [-f <file_path>] \"<your prompt>\" [number_of_refinement_loops]" + echo -e "\n\tExample: $0 -f ./input.txt \"Please summarize this text file\" 2" + echo -e "\n\tIf number_of_refinement_loops is not provided, the program will default to $DEFAULT_LOOPS loops." + echo -e "\n\t-f <file_path> (optional): Append the contents of the file to the prompt." + echo -e "\n" + exit 1 +fi + +# --- Argument Parsing --- +FILE_PATH="" +while getopts "f:" opt; do + case $opt in + f) + FILE_PATH="$OPTARG" + ;; + *) + echo "Invalid option: -$OPTARG" >&2 + exit 1 + ;; + esac +done +shift $((OPTIND -1)) + +PROMPT="$1" +if [ -z "$2" ]; then + LOOPS=$DEFAULT_LOOPS +else + LOOPS=$2 +fi + +# If file path is provided, append its contents to the prompt +if [ -n "$FILE_PATH" ]; then + if [ ! -f "$FILE_PATH" ]; then + echo "File not found: $FILE_PATH" >&2 + exit 1 + fi + FILE_CONTENTS=$(cat "$FILE_PATH") + PROMPT="$PROMPT\n[FILE CONTENTS]\n$FILE_CONTENTS\n[END FILE]" +fi + +# --- File Initialization --- +# Create a temporary directory if it doesn't exist +mkdir -p ~/tmp +# Create a unique file for this session based on the timestamp +SESSION_FILE=~/tmp/dds_$(date +%Y%m%d_%H%M%S).txt + +echo "DDS Session Log: ${SESSION_FILE}" +echo "---------------------------------" + +# --- Initial Prompt & Response --- + +# 1. Store the initial user prompt in the session file +echo "USER PROMPT: ${PROMPT}" >> "${SESSION_FILE}" +echo "" >> "${SESSION_FILE}" # Add a newline for readability +echo "Processing initial response..." + +# 2. The RESPONSE model generates the first answer +RESPONSE_PROMPT="You are an expert, curious assistant who isn't afraid to say when they don't know something. Please respond directly to the following prompt: ${PROMPT}" +RESPONSE_OUTPUT=$(ollama run "${RESPONSE_MODEL}" "${RESPONSE_PROMPT}") + +# Append the response to the session file +echo "INITIAL RESPONSE (${RESPONSE_MODEL}):" >> "${SESSION_FILE}" +echo "${RESPONSE_OUTPUT}" >> "${SESSION_FILE}" +echo "" >> "${SESSION_FILE}" + +# --- Refinement Loop --- + +# This variable will hold the most recent response for the next loop iteration +CURRENT_RESPONSE="${RESPONSE_OUTPUT}" + +for i in $(seq 1 "${LOOPS}"); do + echo "Starting refinement loop ${i} of ${LOOPS}..." + + # 3. The CRITIC model reviews the last response + CRITIC_PROMPT="You are a detail oriented, close reading, keenly critical reviewer. Your task is to raise questions, flag potential misunderstandings, and areas for improved clarity in the following text. Provide concise, constructive criticism. Do not rewrite the text, only critique it. TEXT TO CRITIQUE: ${CURRENT_RESPONSE}" + CRITIC_OUTPUT=$(ollama run "${CRITIC_MODEL}" "${CRITIC_PROMPT}") + + # Append the critique to the session file + echo "CRITICISM ${i} (${CRITIC_MODEL}):" >> "${SESSION_FILE}" + echo "${CRITIC_OUTPUT}" >> "${SESSION_FILE}" + echo "" >> "${SESSION_FILE}" + + # 4. The REFINE model reads the original prompt and the critique to generate a new response + REFINE_PROMPT="You are an expert assistant. Your previous response was reviewed and critiqued. Your task now is to generate a refined, improved response to the original prompt based on the feedback provided. ORIGINAL PROMPT: ${PROMPT} CONSTRUCTIVE CRITICISM: ${CRITIC_OUTPUT} Generate the refined response now." + REFINE_OUTPUT=$(ollama run "${REFINE_MODEL}" "${REFINE_PROMPT}") + + # Append the refined response to the session file + echo "REFINED RESPONSE ${i} (${REFINE_MODEL}):" >> "${SESSION_FILE}" + echo "${REFINE_OUTPUT}" >> "${SESSION_FILE}" + echo "" >> "${SESSION_FILE}" + + # Update the current response for the next loop or for the final output + CURRENT_RESPONSE="${REFINE_OUTPUT}" +done + +# --- Final Output --- + +echo "---------------------------------" +echo "DDS process complete." +echo "Final refined answer:" +echo "---------------------------------" + +# Print the final, most refined answer to standard output +echo "${CURRENT_RESPONSE}" |