diff options
Diffstat (limited to 'bash/talk-to-computer/corpus/README.md')
-rw-r--r-- | bash/talk-to-computer/corpus/README.md | 236 |
1 files changed, 236 insertions, 0 deletions
diff --git a/bash/talk-to-computer/corpus/README.md b/bash/talk-to-computer/corpus/README.md new file mode 100644 index 0000000..d87af43 --- /dev/null +++ b/bash/talk-to-computer/corpus/README.md @@ -0,0 +1,236 @@ +# RAG Knowledge Corpus + +This directory contains the knowledge corpus for the RAG (Retrieval-Augmented Generation) system. The corpus is organized as a structured knowledge base that can be searched and used to augment AI responses with relevant context. + +## 📁 Directory Structure + +``` +corpus/ +├── README.md # This file +├── corpus_registry.txt # Auto-generated registry of available topics +├── corpus_manager.sh # Management script (in parent directory) +├── topic_template.md # Template for new topics +├── .topic_keywords # Topic keyword mappings +├── .file_processors # File processing configurations +│ +├── programming/ # Programming topics +│ ├── lil/ # Lil programming language +│ │ └── guide.md +│ └── algorithms.txt +│ +├── science/ # Scientific topics +│ ├── physics.txt +│ └── biology.md +│ +├── literature/ # Literary topics +├── general/ # General knowledge +└── examples/ # Example content +``` + +## 🔧 Management Tools + +### Corpus Manager (`./corpus_manager.sh`) + +The corpus manager provides utilities for managing the knowledge base: + +```bash +# Update the corpus registry (run after adding new files) +./corpus_manager.sh update + +# List all available topics +./corpus_manager.sh list + +# Check if a topic exists +./corpus_manager.sh exists programming + +# List files in a specific topic +./corpus_manager.sh files programming + +# Create template files for a new topic +./corpus_manager.sh template newtopic + +# Get corpus statistics +./corpus_manager.sh count science +``` + +### RAG Search (`./rag_search.sh`) + +Search the corpus using efficient Unix tools: + +```bash +# Search entire corpus +./rag_search.sh search "quantum physics" + +# Search specific topic +./rag_search.sh search "lil programming" programming + +# Get context around matches +./rag_search.sh context "variables" programming + +# Extract relevant sections +./rag_search.sh extract "functions" programming + +# Show corpus statistics +./rag_search.sh stats +``` + +## 📝 File Format Guidelines + +### Supported Formats +- **`.txt`** - Plain text files +- **`.md`** - Markdown files (recommended) +- **`.html`** - HTML files + +### Content Organization +1. **Use clear, descriptive headers** (`#`, `##`, `###`) +2. **Include examples and code blocks** where relevant +3. **Add cross-references** between related topics +4. **Use consistent formatting** and terminology +5. **Include practical applications** and use cases + +### Markdown Template +```markdown +# Topic Name - Comprehensive Guide + +## Introduction +[Brief overview of the topic] + +## Core Concepts +### [Subtopic 1] +[Explanation and details] + +### [Subtopic 2] +[Explanation and details] + +## Examples +[Code examples, diagrams, practical applications] + +## Best Practices +[Recommended approaches and common pitfalls] + +## References +[Links to additional resources] +``` + +## ➕ Adding New Content + +### Step 1: Create Topic Directory +```bash +# Create a new topic directory +mkdir -p corpus/newtopic + +# Or use the template command +./corpus_manager.sh template newtopic +``` + +### Step 2: Add Content Files +```bash +# Create content files in your preferred format +vim corpus/newtopic/guide.md +vim corpus/newtopic/examples.txt +vim corpus/newtopic/reference.html +``` + +### Step 3: Update Registry +```bash +# Update the corpus registry to include new files +./corpus_manager.sh update + +# Verify the topic is recognized +./corpus_manager.sh exists newtopic +./corpus_manager.sh files newtopic +``` + +### Step 4: Test Search +```bash +# Test that content is searchable +./rag_search.sh search "keyword" newtopic +./rag_search.sh context "concept" newtopic +``` + +## 🔍 Search Behavior + +### Keyword Matching +- **Case-insensitive** search across all text files +- **Multi-word queries** supported +- **Partial matches** found within words +- **Context extraction** shows surrounding lines + +### Topic Filtering +- **General search**: Searches entire corpus +- **Topic-specific**: Limited to specific directories +- **Hierarchical**: Supports subtopics (e.g., `science/physics`) + +### Performance +- **Sub-second lookups** using Unix tools +- **Efficient grep/sed/awk** processing +- **Cached registry** for fast topic discovery +- **Minimal memory usage** + +## 🔧 Advanced Configuration + +### Custom Topic Keywords +Edit `corpus/.topic_keywords` to add custom topic detection: +``` +newtopic|keyword1 keyword2 keyword3 +``` + +### File Processors +Edit `corpus/.file_processors` to add support for new file types: +``` +custom|processing_command +``` + +### Registry Customization +The `corpus_registry.txt` file can be manually edited for custom topic mappings: +``` +topic|path/to/files|keywords|description +``` + +## 🎯 Integration with AI Systems + +The corpus is designed to integrate with AI thinking mechanisms: + +### Automatic RAG Detection +- **Query analysis** determines when corpus search is needed +- **Topic classification** matches queries to appropriate corpus sections +- **Confidence scoring** determines RAG vs direct response + +### Context Injection +- **Relevant sections** extracted and formatted +- **Context length** managed to stay within token limits +- **Multiple sources** combined for comprehensive answers + +### Fallback Strategy +- **Graceful degradation** when no relevant corpus found +- **Direct LLM response** when corpus search yields no results +- **Error handling** for missing or corrupted files + +## 📊 Current Corpus Statistics + +*Run `./rag_search.sh stats` to see current corpus statistics.* + +## 🚀 Best Practices + +1. **Keep files focused** - One topic per file when possible +2. **Use descriptive names** - File names should indicate content +3. **Regular updates** - Run `update` after adding new files +4. **Test searches** - Verify content is discoverable +5. **Cross-reference** - Link related topics when appropriate +6. **Version control** - Track changes to corpus files + +## 🔄 Maintenance + +### Regular Tasks +- Run `./corpus_manager.sh update` after adding files +- Test search functionality with new content +- Review and update outdated information +- Archive unused or deprecated topics + +### Performance Monitoring +- Monitor search response times +- Check registry file size and complexity +- Validate file integrity periodically +- Clean up temporary search files + +This corpus system provides a scalable, efficient foundation for knowledge-augmented AI responses while maintaining the flexibility to grow and adapt to new requirements. |