File Crawling Technique¶
Systematic processing of many files without context overload
What is File Crawling?¶
File crawling is a technique for processing large numbers of files systematically using an external index and sequential processing. It solves the fundamental problem that AI cannot hold all files in context at once.
Core pattern: External checklist → Process one file → Mark complete → Repeat
The Problem It Solves¶
AI Limitations¶
AI assistants have critical limitations:
- Limited context window - Cannot hold 100+ files at once
- Attention degradation - Misses files in large lists
- Memory limitations - Forgets files between iterations
- False confidence - Thinks it remembers but doesn't
Real example: AI shown list of 100 files. AI processes first 20, then forgets about the rest. Returns saying "all done" but 80 files untouched.
Traditional Approach Fails¶
# BAD: Try to hold all files in context
files = [file1, file2, file3, ... file100]
for file in files: # AI will forget most of these
process(file)
What happens:
- AI loads all 100 filenames (1000+ tokens each iteration)
- Can only focus on ~20 files before attention degrades
- Forgets remaining 80 files
- Returns confidently saying "all done"
- Human discovers 80 files untouched
The Solution: External Index + Sequential Processing¶
Core Pattern¶
# 1. Generate external checklist
find . -name "*.md" > /tmp/checklist.txt
sed -i 's/^/[ ] /' /tmp/checklist.txt
# 2. Process loop - AI reads only ONE line at a time
while [ $(grep -c "^\[ \]" /tmp/checklist.txt) -gt 0 ]; do
# Get next uncompleted file (5-10 tokens)
NEXT=$(grep -m1 "^\[ \]" /tmp/checklist.txt | sed 's/\[ \] //')
# Process this ONE file completely
# - AI reads full file
# - AI makes all needed changes
# - AI verifies changes
# Mark complete (in-place edit)
sed -i "s|\[ \] $NEXT|[x] $NEXT|" /tmp/checklist.txt
done
# 3. Cleanup
rm /tmp/checklist.txt
Why It Works¶
Token Efficiency¶
Without file crawling: 100 iterations × 2,000 tokens = 200,000 tokens wasted
With file crawling: 100 iterations × 10 tokens = 1,000 tokens
Savings: 199,000 tokens (99.5% reduction)
Key Benefits¶
- No forgetting - Files tracked externally, not in AI memory
- Clear progress - Visual
[x]marks show what's done - Resumable - Can stop and restart without losing place
- Systematic - Guarantees every file processed exactly once
- Verifiable - Human can check progress anytime
When to Use File Crawling¶
✅ Use When:¶
- Processing 10+ files systematically
- Each file requires similar updates
- Need clear progress visibility
- Want resumability
- Working across multiple turns
✅ Common in DDD:¶
- Phase 1: Processing all documentation files
- Phase 3: Code reconnaissance across modules
- Phase 4: Implementing changes across files
- Phase 5: Testing all documented examples
Step-by-Step Guide¶
Step 1: Generate File Index¶
# Find all files to process
find . -type f \( -name "*.md" -o -name "*.py" \) \
! -path "*/.git/*" ! -path "*/.venv/*" \
> /tmp/files_to_process.txt
# Convert to checklist format
sed 's/^/[ ] /' /tmp/files_to_process.txt > /tmp/checklist.txt
# Show AI the checklist once
cat /tmp/checklist.txt
Step 2: Sequential Processing¶
# AI executes this pattern
while [ $(grep -c "^\[ \]" /tmp/checklist.txt) -gt 0 ]; do
# Get next (minimal tokens)
NEXT=$(grep -m1 "^\[ \]" /tmp/checklist.txt | sed 's/\[ \] //')
echo "Processing: $NEXT"
# AI reads this ONE file COMPLETELY
# AI makes ALL needed changes
# AI verifies changes worked
# Mark complete ONLY after full review
sed -i "s|\[ \] $NEXT|[x] $NEXT|" /tmp/checklist.txt
# Optional: Show progress every 10 files
if [ $((counter % 10)) -eq 0 ]; then
DONE=$(grep -c "^\[x\]" /tmp/checklist.txt)
TOTAL=$(wc -l < /tmp/checklist.txt)
echo "Progress: $DONE/$TOTAL files"
fi
counter=$((counter + 1))
done
Step 3: Verify and Cleanup¶
# Verify all files processed
REMAINING=$(grep -c "^\[ \]" /tmp/checklist.txt)
if [ $REMAINING -gt 0 ]; then
echo "WARNING: $REMAINING files not processed"
grep "^\[ \]" /tmp/checklist.txt
else
echo "✓ All files processed"
fi
# Cleanup
rm /tmp/checklist.txt /tmp/files_to_process.txt
Common Mistakes to Avoid¶
❌ Loading Entire Checklist into Context¶
# BAD: All 100 files in context
for file in $(cat /tmp/checklist.txt); do
# Won't work
done
# GOOD: Only next file
NEXT=$(grep -m1 "^\[ \]" /tmp/checklist.txt | sed 's/\[ \] //')
❌ Marking Complete Without Full Review¶
# BAD: Mark all complete after global replace
sed -i 's/old/new/g' docs/*.md
sed -i 's/^\[ \]/[x]/' /tmp/checklist.txt # All marked!
# GOOD: Mark after individual review
# Read file → Make changes → Verify → Mark complete
Why this matters: Global replacements miss files and context. Each file needs individual attention.
❌ Processing Multiple Files Per Iteration¶
# BAD: Process 5 at once
NEXT_5=$(grep -m5 "^\[ \]" /tmp/checklist.txt)
# GOOD: One at a time
NEXT=$(grep -m1 "^\[ \]" /tmp/checklist.txt)
Advanced Techniques¶
Filtered Crawling¶
# Only files mentioning "provider"
grep -rl "provider" docs/ | \
grep -v ".git" | \
sed 's/^/[ ] /' > /tmp/provider_docs.txt
Priority Ordering¶
# Manual priority
cat > /tmp/ordered.txt << 'EOF'
[ ] README.md # Highest priority
[ ] docs/USER_GUIDE.md # User-facing
[ ] docs/API.md # Developer-facing
EOF
# Or by size (smaller first)
find docs/ -name "*.md" -exec wc -l {} + | \
sort -n | awk '{print $2}' | sed 's/^/[ ] /' > /tmp/by_size.txt
Integration with DDD Process¶
File crawling is used throughout:
- Phase 1: Documentation file processing
- Phase 4: Code file implementation
- Phase 5: Testing documented examples
Tips for Success¶
For AI Assistants¶
- Always use file crawling for 10+ files
- Process one file at a time
- Read complete file before changes
- Mark complete honestly
- Show progress periodically
For Humans¶
- Check progress:
grep "^\[x\]" /tmp/checklist.txt | wc -l - Interrupt safely - resume from checklist
- Verify completion before proceeding
- Review checklist files
Quick Reference¶
# Standard Pattern
# 1. Generate index
find . -name "*.md" > /tmp/files.txt
sed 's/^/[ ] /' /tmp/files.txt > /tmp/checklist.txt
# 2. Process loop
while [ $(grep -c "^\[ \]" /tmp/checklist.txt) -gt 0 ]; do
NEXT=$(grep -m1 "^\[ \]" /tmp/checklist.txt | sed 's/\[ \] //')
# Process $NEXT completely
sed -i "s|\[ \] $NEXT|[x] $NEXT|" /tmp/checklist.txt
done
# 3. Cleanup
rm /tmp/checklist.txt /tmp/files.txt
Return to: Core Concepts | Main Index
Related: Context Poisoning | Retcon Writing