File Crawling Technique¶

Systematic processing of many files without context overload

What is File Crawling?¶

File crawling is a technique for processing large numbers of files systematically using an external index and sequential processing. It solves the fundamental problem that AI cannot hold all files in context at once.

Core pattern: External checklist → Process one file → Mark complete → Repeat

The Problem It Solves¶

AI Limitations¶

AI assistants have critical limitations:

Limited context window - Cannot hold 100+ files at once
Attention degradation - Misses files in large lists
Memory limitations - Forgets files between iterations
False confidence - Thinks it remembers but doesn't

Real example: AI shown list of 100 files. AI processes first 20, then forgets about the rest. Returns saying "all done" but 80 files untouched.

Traditional Approach Fails¶

# BAD: Try to hold all files in context
files = [file1, file2, file3, ... file100]
for file in files:  # AI will forget most of these
  process(file)

What happens:

AI loads all 100 filenames (1000+ tokens each iteration)
Can only focus on ~20 files before attention degrades
Forgets remaining 80 files
Returns confidently saying "all done"
Human discovers 80 files untouched

The Solution: External Index + Sequential Processing¶

Core Pattern¶

# 1. Generate external checklist
find . -name "*.md" > /tmp/checklist.txt
sed -i 's/^/[ ] /' /tmp/checklist.txt

# 2. Process loop - AI reads only ONE line at a time
while [ $(grep -c "^\[ \]" /tmp/checklist.txt) -gt 0 ]; do
  # Get next uncompleted file (5-10 tokens)
  NEXT=$(grep -m1 "^\[ \]" /tmp/checklist.txt | sed 's/\[ \] //')

  # Process this ONE file completely
  # - AI reads full file
  # - AI makes all needed changes
  # - AI verifies changes

  # Mark complete (in-place edit)
  sed -i "s|\[ \] $NEXT|[x] $NEXT|" /tmp/checklist.txt
done

# 3. Cleanup
rm /tmp/checklist.txt

Why It Works¶

Token Efficiency¶

Without file crawling: 100 iterations × 2,000 tokens = 200,000 tokens wasted

With file crawling: 100 iterations × 10 tokens = 1,000 tokens

Savings: 199,000 tokens (99.5% reduction)

Key Benefits¶

No forgetting - Files tracked externally, not in AI memory
Clear progress - Visual [x] marks show what's done
Resumable - Can stop and restart without losing place
Systematic - Guarantees every file processed exactly once
Verifiable - Human can check progress anytime

When to Use File Crawling¶

✅ Use When:¶

Processing 10+ files systematically
Each file requires similar updates
Need clear progress visibility
Want resumability
Working across multiple turns

✅ Common in DDD:¶

Phase 1: Processing all documentation files
Phase 3: Code reconnaissance across modules
Phase 4: Implementing changes across files
Phase 5: Testing all documented examples

Step-by-Step Guide¶

Step 1: Generate File Index¶

# Find all files to process
find . -type f \( -name "*.md" -o -name "*.py" \) \
  ! -path "*/.git/*" ! -path "*/.venv/*" \
  > /tmp/files_to_process.txt

# Convert to checklist format
sed 's/^/[ ] /' /tmp/files_to_process.txt > /tmp/checklist.txt

# Show AI the checklist once
cat /tmp/checklist.txt

Step 2: Sequential Processing¶

# AI executes this pattern
while [ $(grep -c "^\[ \]" /tmp/checklist.txt) -gt 0 ]; do
  # Get next (minimal tokens)
  NEXT=$(grep -m1 "^\[ \]" /tmp/checklist.txt | sed 's/\[ \] //')

  echo "Processing: $NEXT"

  # AI reads this ONE file COMPLETELY
  # AI makes ALL needed changes
  # AI verifies changes worked

  # Mark complete ONLY after full review
  sed -i "s|\[ \] $NEXT|[x] $NEXT|" /tmp/checklist.txt

  # Optional: Show progress every 10 files
  if [ $((counter % 10)) -eq 0 ]; then
    DONE=$(grep -c "^\[x\]" /tmp/checklist.txt)
    TOTAL=$(wc -l < /tmp/checklist.txt)
    echo "Progress: $DONE/$TOTAL files"
  fi
  counter=$((counter + 1))
done

Step 3: Verify and Cleanup¶

# Verify all files processed
REMAINING=$(grep -c "^\[ \]" /tmp/checklist.txt)
if [ $REMAINING -gt 0 ]; then
  echo "WARNING: $REMAINING files not processed"
  grep "^\[ \]" /tmp/checklist.txt
else
  echo "✓ All files processed"
fi

# Cleanup
rm /tmp/checklist.txt /tmp/files_to_process.txt

Common Mistakes to Avoid¶

❌ Loading Entire Checklist into Context¶

# BAD: All 100 files in context
for file in $(cat /tmp/checklist.txt); do
  # Won't work
done

# GOOD: Only next file
NEXT=$(grep -m1 "^\[ \]" /tmp/checklist.txt | sed 's/\[ \] //')

❌ Marking Complete Without Full Review¶

# BAD: Mark all complete after global replace
sed -i 's/old/new/g' docs/*.md
sed -i 's/^\[ \]/[x]/' /tmp/checklist.txt  # All marked!

# GOOD: Mark after individual review
# Read file → Make changes → Verify → Mark complete

Why this matters: Global replacements miss files and context. Each file needs individual attention.

❌ Processing Multiple Files Per Iteration¶

# BAD: Process 5 at once
NEXT_5=$(grep -m5 "^\[ \]" /tmp/checklist.txt)

# GOOD: One at a time
NEXT=$(grep -m1 "^\[ \]" /tmp/checklist.txt)

Advanced Techniques¶

Filtered Crawling¶

# Only files mentioning "provider"
grep -rl "provider" docs/ | \
  grep -v ".git" | \
  sed 's/^/[ ] /' > /tmp/provider_docs.txt

Priority Ordering¶

# Manual priority
cat > /tmp/ordered.txt << 'EOF'
[ ] README.md              # Highest priority
[ ] docs/USER_GUIDE.md     # User-facing
[ ] docs/API.md            # Developer-facing
EOF

# Or by size (smaller first)
find docs/ -name "*.md" -exec wc -l {} + | \
  sort -n | awk '{print $2}' | sed 's/^/[ ] /' > /tmp/by_size.txt

Integration with DDD Process¶

File crawling is used throughout:

Phase 1: Documentation file processing
Phase 4: Code file implementation
Phase 5: Testing documented examples

Tips for Success¶

For AI Assistants¶

Always use file crawling for 10+ files
Process one file at a time
Read complete file before changes
Mark complete honestly
Show progress periodically

For Humans¶

Check progress: grep "^\[x\]" /tmp/checklist.txt | wc -l
Interrupt safely - resume from checklist
Verify completion before proceeding
Review checklist files

Quick Reference¶

# Standard Pattern

# 1. Generate index
find . -name "*.md" > /tmp/files.txt
sed 's/^/[ ] /' /tmp/files.txt > /tmp/checklist.txt

# 2. Process loop
while [ $(grep -c "^\[ \]" /tmp/checklist.txt) -gt 0 ]; do
  NEXT=$(grep -m1 "^\[ \]" /tmp/checklist.txt | sed 's/\[ \] //')
  # Process $NEXT completely
  sed -i "s|\[ \] $NEXT|[x] $NEXT|" /tmp/checklist.txt
done

# 3. Cleanup
rm /tmp/checklist.txt /tmp/files.txt

Return to: Core Concepts | Main Index

Related: Context Poisoning | Retcon Writing

See Also: Phase 1 | Phase 4