# RAG Knowledge Corpus This directory contains the knowledge corpus for the RAG (Retrieval-Augmented Generation) system. The corpus is organized as a structured knowledge base that can be searched and used to augment AI responses with relevant context. ## 📁 Directory Structure ``` corpus/ ├── README.md # This file ├── corpus_registry.txt # Auto-generated registry of available topics ├── corpus_manager.sh # Management script (in parent directory) ├── topic_template.md # Template for new topics ├── .topic_keywords # Topic keyword mappings ├── .file_processors # File processing configurations │ ├── programming/ # Programming topics │ ├── lil/ # Lil programming language │ │ └── guide.md │ └── algorithms.txt │ ├── science/ # Scientific topics │ ├── physics.txt │ └── biology.md │ ├── literature/ # Literary topics ├── general/ # General knowledge └── examples/ # Example content ``` ## 🔧 Management Tools ### Corpus Manager (`./corpus_manager.sh`) The corpus manager provides utilities for managing the knowledge base: ```bash # Update the corpus registry (run after adding new files) ./corpus_manager.sh update # List all available topics ./corpus_manager.sh list # Check if a topic exists ./corpus_manager.sh exists programming # List files in a specific topic ./corpus_manager.sh files programming # Create template files for a new topic ./corpus_manager.sh template newtopic # Get corpus statistics ./corpus_manager.sh count science ``` ### RAG Search (`./rag_search.sh`) Search the corpus using efficient Unix tools: ```bash # Search entire corpus ./rag_search.sh search "quantum physics" # Search specific topic ./rag_search.sh search "lil programming" programming # Get context around matches ./rag_search.sh context "variables" programming # Extract relevant sections ./rag_search.sh extract "functions" programming # Show corpus statistics ./rag_search.sh stats ``` ## 📝 File Format Guidelines ### Supported Formats - **`.txt`** - Plain text files - **`.md`** - Markdown files (recommended) - **`.html`** - HTML files ### Content Organization 1. **Use clear, descriptive headers** (`#`, `##`, `###`) 2. **Include examples and code blocks** where relevant 3. **Add cross-references** between related topics 4. **Use consistent formatting** and terminology 5. **Include practical applications** and use cases ### Markdown Template ```markdown # Topic Name - Comprehensive Guide ## Introduction [Brief overview of the topic] ## Core Concepts ### [Subtopic 1] [Explanation and details] ### [Subtopic 2] [Explanation and details] ## Examples [Code examples, diagrams, practical applications] ## Best Practices [Recommended approaches and common pitfalls] ## References [Links to additional resources] ``` ## ➕ Adding New Content ### Step 1: Create Topic Directory ```bash # Create a new topic directory mkdir -p corpus/newtopic # Or use the template command ./corpus_manager.sh template newtopic ``` ### Step 2: Add Content Files ```bash # Create content files in your preferred format vim corpus/newtopic/guide.md vim corpus/newtopic/examples.txt vim corpus/newtopic/reference.html ``` ### Step 3: Update Registry ```bash # Update the corpus registry to include new files ./corpus_manager.sh update # Verify the topic is recognized ./corpus_manager.sh exists newtopic ./corpus_manager.sh files newtopic ``` ### Step 4: Test Search ```bash # Test that content is searchable ./rag_search.sh search "keyword" newtopic ./rag_search.sh context "concept" newtopic ``` ## 🔍 Search Behavior ### Keyword Matching - **Case-insensitive** search across all text files - **Multi-word queries** supported - **Partial matches** found within words - **Context extraction** shows surrounding lines ### Topic Filtering - **General search**: Searches entire corpus - **Topic-specific**: Limited to specific directories - **Hierarchical**: Supports subtopics (e.g., `science/physics`) ### Performance - **Sub-second lookups** using Unix tools - **Efficient grep/sed/awk** processing - **Cached registry** for fast topic discovery - **Minimal memory usage** ## 🔧 Advanced Configuration ### Custom Topic Keywords Edit `corpus/.topic_keywords` to add custom topic detection: ``` newtopic|keyword1 keyword2 keyword3 ``` ### File Processors Edit `corpus/.file_processors` to add support for new file types: ``` custom|processing_command ``` ### Registry Customization The `corpus_registry.txt` file can be manually edited for custom topic mappings: ``` topic|path/to/files|keywords|description ``` ## 🎯 Integration with AI Systems The corpus is designed to integrate with AI thinking mechanisms: ### Automatic RAG Detection - **Query analysis** determines when corpus search is needed - **Topic classification** matches queries to appropriate corpus sections - **Confidence scoring** determines RAG vs direct response ### Context Injection - **Relevant sections** extracted and formatted - **Context length** managed to stay within token limits - **Multiple sources** combined for comprehensive answers ### Fallback Strategy - **Graceful degradation** when no relevant corpus found - **Direct LLM response** when corpus search yields no results - **Error handling** for missing or corrupted files ## 📊 Current Corpus Statistics *Run `./rag_search.sh stats` to see current corpus statistics.* ## 🚀 Best Practices 1. **Keep files focused** - One topic per file when possible 2. **Use descriptive names** - File names should indicate content 3. **Regular updates** - Run `update` after adding new files 4. **Test searches** - Verify content is discoverable 5. **Cross-reference** - Link related topics when appropriate 6. **Version control** - Track changes to corpus files ## 🔄 Maintenance ### Regular Tasks - Run `./corpus_manager.sh update` after adding files - Test search functionality with new content - Review and update outdated information - Archive unused or deprecated topics ### Performance Monitoring - Monitor search response times - Check registry file size and complexity - Validate file integrity periodically - Clean up temporary search files This corpus system provides a scalable, efficient foundation for knowledge-augmented AI responses while maintaining the flexibility to grow and adapt to new requirements.