bash/talk-to-computer/corpus/README.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236

# RAG Knowledge Corpus

This directory contains the knowledge corpus for the RAG (Retrieval-Augmented Generation) system. The corpus is organized as a structured knowledge base that can be searched and used to augment AI responses with relevant context.

## 📁 Directory Structure

```
corpus/
├── README.md              # This file
├── corpus_registry.txt    # Auto-generated registry of available topics
├── corpus_manager.sh      # Management script (in parent directory)
├── topic_template.md      # Template for new topics
├── .topic_keywords        # Topic keyword mappings
├── .file_processors       # File processing configurations
│
├── programming/           # Programming topics
│   ├── lil/              # Lil programming language
│   │   └── guide.md
│   └── algorithms.txt
│
├── science/              # Scientific topics
│   ├── physics.txt
│   └── biology.md
│
├── literature/           # Literary topics
├── general/              # General knowledge
└── examples/             # Example content
```

## 🔧 Management Tools

### Corpus Manager (`./corpus_manager.sh`)

The corpus manager provides utilities for managing the knowledge base:

```bash
# Update the corpus registry (run after adding new files)
./corpus_manager.sh update

# List all available topics
./corpus_manager.sh list

# Check if a topic exists
./corpus_manager.sh exists programming

# List files in a specific topic
./corpus_manager.sh files programming

# Create template files for a new topic
./corpus_manager.sh template newtopic

# Get corpus statistics
./corpus_manager.sh count science
```

### RAG Search (`./rag_search.sh`)

Search the corpus using efficient Unix tools:

```bash
# Search entire corpus
./rag_search.sh search "quantum physics"

# Search specific topic
./rag_search.sh search "lil programming" programming

# Get context around matches
./rag_search.sh context "variables" programming

# Extract relevant sections
./rag_search.sh extract "functions" programming

# Show corpus statistics
./rag_search.sh stats
```

## 📝 File Format Guidelines

### Supported Formats
- **`.txt`** - Plain text files
- **`.md`** - Markdown files (recommended)
- **`.html`** - HTML files

### Content Organization
1. **Use clear, descriptive headers** (`#`, `##`, `###`)
2. **Include examples and code blocks** where relevant
3. **Add cross-references** between related topics
4. **Use consistent formatting** and terminology
5. **Include practical applications** and use cases

### Markdown Template
```markdown
# Topic Name - Comprehensive Guide

## Introduction
[Brief overview of the topic]

## Core Concepts
### [Subtopic 1]
[Explanation and details]

### [Subtopic 2]
[Explanation and details]

## Examples
[Code examples, diagrams, practical applications]

## Best Practices
[Recommended approaches and common pitfalls]

## References
[Links to additional resources]
```

## ➕ Adding New Content

### Step 1: Create Topic Directory
```bash
# Create a new topic directory
mkdir -p corpus/newtopic

# Or use the template command
./corpus_manager.sh template newtopic
```

### Step 2: Add Content Files
```bash
# Create content files in your preferred format
vim corpus/newtopic/guide.md
vim corpus/newtopic/examples.txt
vim corpus/newtopic/reference.html
```

### Step 3: Update Registry
```bash
# Update the corpus registry to include new files
./corpus_manager.sh update

# Verify the topic is recognized
./corpus_manager.sh exists newtopic
./corpus_manager.sh files newtopic
```

### Step 4: Test Search
```bash
# Test that content is searchable
./rag_search.sh search "keyword" newtopic
./rag_search.sh context "concept" newtopic
```

## 🔍 Search Behavior

### Keyword Matching
- **Case-insensitive** search across all text files
- **Multi-word queries** supported
- **Partial matches** found within words
- **Context extraction** shows surrounding lines

### Topic Filtering
- **General search**: Searches entire corpus
- **Topic-specific**: Limited to specific directories
- **Hierarchical**: Supports subtopics (e.g., `science/physics`)

### Performance
- **Sub-second lookups** using Unix tools
- **Efficient grep/sed/awk** processing
- **Cached registry** for fast topic discovery
- **Minimal memory usage**

## 🔧 Advanced Configuration

### Custom Topic Keywords
Edit `corpus/.topic_keywords` to add custom topic detection:
```
newtopic|keyword1 keyword2 keyword3
```

### File Processors
Edit `corpus/.file_processors` to add support for new file types:
```
custom|processing_command
```

### Registry Customization
The `corpus_registry.txt` file can be manually edited for custom topic mappings:
```
topic|path/to/files|keywords|description
```

## 🎯 Integration with AI Systems

The corpus is designed to integrate with AI thinking mechanisms:

### Automatic RAG Detection
- **Query analysis** determines when corpus search is needed
- **Topic classification** matches queries to appropriate corpus sections
- **Confidence scoring** determines RAG vs direct response

### Context Injection
- **Relevant sections** extracted and formatted
- **Context length** managed to stay within token limits
- **Multiple sources** combined for comprehensive answers

### Fallback Strategy
- **Graceful degradation** when no relevant corpus found
- **Direct LLM response** when corpus search yields no results
- **Error handling** for missing or corrupted files

## 📊 Current Corpus Statistics

*Run `./rag_search.sh stats` to see current corpus statistics.*

## 🚀 Best Practices

1. **Keep files focused** - One topic per file when possible
2. **Use descriptive names** - File names should indicate content
3. **Regular updates** - Run `update` after adding new files
4. **Test searches** - Verify content is discoverable
5. **Cross-reference** - Link related topics when appropriate
6. **Version control** - Track changes to corpus files

## 🔄 Maintenance

### Regular Tasks
- Run `./corpus_manager.sh update` after adding files
- Test search functionality with new content
- Review and update outdated information
- Archive unused or deprecated topics

### Performance Monitoring
- Monitor search response times
- Check registry file size and complexity
- Validate file integrity periodically
- Clean up temporary search files

This corpus system provides a scalable, efficient foundation for knowledge-augmented AI responses while maintaining the flexibility to grow and adapt to new requirements.