about summary refs log tree commit diff stats
path: root/bash/talk-to-computer/corpus/README.md
diff options
context:
space:
mode:
Diffstat (limited to 'bash/talk-to-computer/corpus/README.md')
-rw-r--r--bash/talk-to-computer/corpus/README.md236
1 files changed, 236 insertions, 0 deletions
diff --git a/bash/talk-to-computer/corpus/README.md b/bash/talk-to-computer/corpus/README.md
new file mode 100644
index 0000000..d87af43
--- /dev/null
+++ b/bash/talk-to-computer/corpus/README.md
@@ -0,0 +1,236 @@
+# RAG Knowledge Corpus
+
+This directory contains the knowledge corpus for the RAG (Retrieval-Augmented Generation) system. The corpus is organized as a structured knowledge base that can be searched and used to augment AI responses with relevant context.
+
+## 📁 Directory Structure
+
+```
+corpus/
+├── README.md              # This file
+├── corpus_registry.txt    # Auto-generated registry of available topics
+├── corpus_manager.sh      # Management script (in parent directory)
+├── topic_template.md      # Template for new topics
+├── .topic_keywords        # Topic keyword mappings
+├── .file_processors       # File processing configurations
+│
+├── programming/           # Programming topics
+│   ├── lil/              # Lil programming language
+│   │   └── guide.md
+│   └── algorithms.txt
+│
+├── science/              # Scientific topics
+│   ├── physics.txt
+│   └── biology.md
+│
+├── literature/           # Literary topics
+├── general/              # General knowledge
+└── examples/             # Example content
+```
+
+## 🔧 Management Tools
+
+### Corpus Manager (`./corpus_manager.sh`)
+
+The corpus manager provides utilities for managing the knowledge base:
+
+```bash
+# Update the corpus registry (run after adding new files)
+./corpus_manager.sh update
+
+# List all available topics
+./corpus_manager.sh list
+
+# Check if a topic exists
+./corpus_manager.sh exists programming
+
+# List files in a specific topic
+./corpus_manager.sh files programming
+
+# Create template files for a new topic
+./corpus_manager.sh template newtopic
+
+# Get corpus statistics
+./corpus_manager.sh count science
+```
+
+### RAG Search (`./rag_search.sh`)
+
+Search the corpus using efficient Unix tools:
+
+```bash
+# Search entire corpus
+./rag_search.sh search "quantum physics"
+
+# Search specific topic
+./rag_search.sh search "lil programming" programming
+
+# Get context around matches
+./rag_search.sh context "variables" programming
+
+# Extract relevant sections
+./rag_search.sh extract "functions" programming
+
+# Show corpus statistics
+./rag_search.sh stats
+```
+
+## 📝 File Format Guidelines
+
+### Supported Formats
+- **`.txt`** - Plain text files
+- **`.md`** - Markdown files (recommended)
+- **`.html`** - HTML files
+
+### Content Organization
+1. **Use clear, descriptive headers** (`#`, `##`, `###`)
+2. **Include examples and code blocks** where relevant
+3. **Add cross-references** between related topics
+4. **Use consistent formatting** and terminology
+5. **Include practical applications** and use cases
+
+### Markdown Template
+```markdown
+# Topic Name - Comprehensive Guide
+
+## Introduction
+[Brief overview of the topic]
+
+## Core Concepts
+### [Subtopic 1]
+[Explanation and details]
+
+### [Subtopic 2]
+[Explanation and details]
+
+## Examples
+[Code examples, diagrams, practical applications]
+
+## Best Practices
+[Recommended approaches and common pitfalls]
+
+## References
+[Links to additional resources]
+```
+
+## ➕ Adding New Content
+
+### Step 1: Create Topic Directory
+```bash
+# Create a new topic directory
+mkdir -p corpus/newtopic
+
+# Or use the template command
+./corpus_manager.sh template newtopic
+```
+
+### Step 2: Add Content Files
+```bash
+# Create content files in your preferred format
+vim corpus/newtopic/guide.md
+vim corpus/newtopic/examples.txt
+vim corpus/newtopic/reference.html
+```
+
+### Step 3: Update Registry
+```bash
+# Update the corpus registry to include new files
+./corpus_manager.sh update
+
+# Verify the topic is recognized
+./corpus_manager.sh exists newtopic
+./corpus_manager.sh files newtopic
+```
+
+### Step 4: Test Search
+```bash
+# Test that content is searchable
+./rag_search.sh search "keyword" newtopic
+./rag_search.sh context "concept" newtopic
+```
+
+## 🔍 Search Behavior
+
+### Keyword Matching
+- **Case-insensitive** search across all text files
+- **Multi-word queries** supported
+- **Partial matches** found within words
+- **Context extraction** shows surrounding lines
+
+### Topic Filtering
+- **General search**: Searches entire corpus
+- **Topic-specific**: Limited to specific directories
+- **Hierarchical**: Supports subtopics (e.g., `science/physics`)
+
+### Performance
+- **Sub-second lookups** using Unix tools
+- **Efficient grep/sed/awk** processing
+- **Cached registry** for fast topic discovery
+- **Minimal memory usage**
+
+## 🔧 Advanced Configuration
+
+### Custom Topic Keywords
+Edit `corpus/.topic_keywords` to add custom topic detection:
+```
+newtopic|keyword1 keyword2 keyword3
+```
+
+### File Processors
+Edit `corpus/.file_processors` to add support for new file types:
+```
+custom|processing_command
+```
+
+### Registry Customization
+The `corpus_registry.txt` file can be manually edited for custom topic mappings:
+```
+topic|path/to/files|keywords|description
+```
+
+## 🎯 Integration with AI Systems
+
+The corpus is designed to integrate with AI thinking mechanisms:
+
+### Automatic RAG Detection
+- **Query analysis** determines when corpus search is needed
+- **Topic classification** matches queries to appropriate corpus sections
+- **Confidence scoring** determines RAG vs direct response
+
+### Context Injection
+- **Relevant sections** extracted and formatted
+- **Context length** managed to stay within token limits
+- **Multiple sources** combined for comprehensive answers
+
+### Fallback Strategy
+- **Graceful degradation** when no relevant corpus found
+- **Direct LLM response** when corpus search yields no results
+- **Error handling** for missing or corrupted files
+
+## 📊 Current Corpus Statistics
+
+*Run `./rag_search.sh stats` to see current corpus statistics.*
+
+## 🚀 Best Practices
+
+1. **Keep files focused** - One topic per file when possible
+2. **Use descriptive names** - File names should indicate content
+3. **Regular updates** - Run `update` after adding new files
+4. **Test searches** - Verify content is discoverable
+5. **Cross-reference** - Link related topics when appropriate
+6. **Version control** - Track changes to corpus files
+
+## 🔄 Maintenance
+
+### Regular Tasks
+- Run `./corpus_manager.sh update` after adding files
+- Test search functionality with new content
+- Review and update outdated information
+- Archive unused or deprecated topics
+
+### Performance Monitoring
+- Monitor search response times
+- Check registry file size and complexity
+- Validate file integrity periodically
+- Clean up temporary search files
+
+This corpus system provides a scalable, efficient foundation for knowledge-augmented AI responses while maintaining the flexibility to grow and adapt to new requirements.