External Knowledge Integration - Quick Reference¶

One-page reference for developers implementing external knowledge integration

Architecture at a Glance¶

PROJECT MEMORY (SQLite) → CACHED DOCS (Files) → NEO4J (Optional)
     Priority: 1              Priority: 2         Priority: 3
    <10ms query             <20ms query         <50ms query (if needed)

Implementation Checklist¶

✅ Phase 1: File Cache (Start Here - 1 Day)¶

# 1. Create cache class
touch src/amplihack/external_knowledge/cache.py

# 2. Create Python docs fetcher
touch src/amplihack/external_knowledge/sources/python_docs.py

# 3. Write tests
touch tests/test_external_knowledge/test_cache.py

# 4. Test with real data
python -m pytest tests/test_external_knowledge/test_cache.py -v

⏳ Phase 2: Memory Integration (Week 2)¶

# 1. Create retriever
touch src/amplihack/external_knowledge/retriever.py

# 2. Integrate with memory manager
# Modify: src/amplihack/memory/manager.py (add external knowledge queries)

# 3. Update agent context builder
# Modify: Agent invocation code to include external knowledge

# 4. Test integration
python -m pytest tests/test_external_knowledge/test_integration.py -v

⏳ Phase 3: Multiple Sources (Week 3)¶

# Add more fetchers
touch src/amplihack/external_knowledge/sources/ms_learn.py
touch src/amplihack/external_knowledge/sources/stackoverflow.py
touch src/amplihack/external_knowledge/sources/mdn.py

⏳ Phase 4: Neo4j (Optional - Only If Needed)¶

# Only add if:
# - File cache queries consistently >100ms
# - Need complex relationship queries
# - Have >10k documents

touch src/amplihack/external_knowledge/neo4j_schema.py
touch src/amplihack/external_knowledge/code_linker.py

Code Snippets¶

1. Basic Cache Usage¶

from amplihack.external_knowledge import ExternalKnowledgeCache

cache = ExternalKnowledgeCache()

# Store
data = {"title": "asyncio.run", "description": "..."}
cache.set("python_docs", "asyncio.run", data, version="3.12")

# Retrieve
cached = cache.get("python_docs", "asyncio.run", version="3.12", max_age_days=30)
if cached:
    print(cached["data"])

2. Fetch and Cache Documentation¶

from amplihack.external_knowledge.sources import PythonDocsFetcher

fetcher = PythonDocsFetcher()
doc = fetcher.fetch_function_doc("asyncio", "run", "3.12")

if doc:
    # Cache it
    cache.set("python_docs", f"{doc['module']}.{doc['function']}", doc)

3. Integration with Memory System¶

from amplihack.memory import MemoryManager
from amplihack.external_knowledge import ExternalKnowledgeRetriever

memory = MemoryManager(session_id="my_session")
retriever = ExternalKnowledgeRetriever(memory)

# Automatic fallback: memory → cache → fetch
doc = retriever.get_function_doc("python", "asyncio", "run")

4. Agent Context with External Knowledge¶

def build_agent_context(agent_id: str, task: str, memory: MemoryManager) -> str:
    context = []

    # Project memory FIRST
    project_memories = memory.retrieve(agent_id=agent_id, search=task)
    if project_memories:
        context.append("## Project Knowledge")
        for m in project_memories:
            context.append(f"- {m.title}: {m.content}")

    # External knowledge IF NEEDED
    retriever = ExternalKnowledgeRetriever(memory)
    if retriever.should_fetch_external({"search_term": task}):
        ext_docs = retriever.get_relevant_docs(task, limit=2)
        if ext_docs:
            context.append("\n## External Reference (Advisory)")
            for doc in ext_docs:
                context.append(f"- [{doc['source']}] {doc['title']}")

    return "\n".join(context)

Performance Targets¶

Metric	Target	How to Measure
Cache hit rate	>80%	`cache.get_stats()["hit_rate"]`
Query time	<100ms	Use monitoring decorator
Cache size	<100MB	`du -sh ~/.amplihack/external_knowledge`
Project memory first	100%	Always check before external

Configuration¶

Environment Variables¶

# Custom cache location
export AMPLIHACK_EXTERNAL_CACHE_DIR="/custom/path"

# Cache TTL (days)
export AMPLIHACK_EXTERNAL_CACHE_TTL="30"

# Enable/disable external knowledge
export AMPLIHACK_EXTERNAL_KNOWLEDGE_ENABLED="true"

Cache Location¶

Default: ~/.amplihack/external_knowledge/
Structure:
  python_docs/
    <hash>/
      3.12/
        data.json
  ms_learn/
    <hash>/
      data.json

Source Credibility¶

Source	Trust Score	TTL	Use For
Python.org	0.95	30d	API reference
MS Learn	0.95	30d	Azure, .NET docs
MDN	0.95	30d	Web APIs
Real Python	0.85	90d	Tutorials
StackOverflow (accepted)	0.75	7d	Solutions
GitHub (maintainer)	0.80	14d	Library docs

Common Patterns¶

Pattern 1: Error-Driven Fetch¶

def handle_error_with_external_knowledge(error: Exception):
    # Check project memory first
    solutions = memory.retrieve(
        search=str(error),
        tags=["error_solution"]
    )
    if solutions:
        return solutions[0].content

    # Fetch external solution
    retriever = ExternalKnowledgeRetriever(memory)
    external_solution = retriever.search_error_solution(str(error))

    if external_solution:
        # Store for next time
        memory.store(
            agent_id="error_handler",
            title=f"Solution: {type(error).__name__}",
            content=external_solution["solution"],
            tags=["error_solution"]
        )
        return external_solution["solution"]

Pattern 2: API Documentation on Import¶

import ast

def link_imports_to_docs(code: str):
    tree = ast.parse(code)
    for node in ast.walk(tree):
        if isinstance(node, ast.Import):
            for alias in node.names:
                # Fetch and cache docs for imported module
                doc = retriever.get_function_doc(
                    "python",
                    alias.name,
                    ""  # Module-level docs
                )
                # Store in project memory

Pattern 3: Smart Refresh¶

def refresh_if_needed(doc: Dict):
    age_days = (datetime.now() - doc["cached_at"]).days
    ttl = TTL_BY_SOURCE[doc["source"]]

    if age_days > ttl:
        # Check if it's high-value
        if doc.get("access_count", 0) > 10:
            # Refresh
            fresh_doc = fetch_from_source(doc)
            cache.set(doc["source"], doc["id"], fresh_doc)

Monitoring¶

Get Cache Stats¶

from amplihack.external_knowledge import ExternalKnowledgeCache

cache = ExternalKnowledgeCache()
stats = cache.get_stats()

print(f"Total files: {stats['total_files']}")
print(f"Total size: {stats['total_size_mb']:.2f} MB")
print(f"Sources: {list(stats['sources'].keys())}")

Track Performance¶

from amplihack.external_knowledge.monitoring import monitor

@monitor.timed_query
def get_documentation(module: str, function: str):
    return retriever.get_function_doc("python", module, function)

# Get stats
stats = monitor.get_stats()
print(f"Cache hit rate: {stats['cache_hit_rate']}")
print(f"Avg query time: {stats['avg_query_time_ms']}")

Testing¶

Test Cache¶

def test_cache():
    cache = ExternalKnowledgeCache()

    # Store
    cache.set("test_source", "test_id", {"key": "value"})

    # Retrieve
    cached = cache.get("test_source", "test_id")
    assert cached["data"]["key"] == "value"

    # Invalidate
    cache.invalidate("test_source", "test_id")
    assert cache.get("test_source", "test_id") is None

Test Integration¶

def test_integration_with_memory():
    memory = MemoryManager()
    retriever = ExternalKnowledgeRetriever(memory)

    # Store in project memory
    memory.store(
        agent_id="test",
        title="asyncio.run usage",
        content="...",
        tags=["external_doc"]
    )

    # Should retrieve from project memory, not external
    doc = retriever.get_function_doc("python", "asyncio", "run")
    # Verify it came from memory, not external fetch

Troubleshooting¶

Problem: Slow Queries¶

# Check cache hit rate
stats = cache.get_stats()
if stats["cache_hit_rate"] < 0.7:
    # Pre-warm cache for common APIs
    pre_warm_cache(common_apis)

# Check query time
if stats["avg_query_time_ms"] > 100:
    # Consider adding Neo4j for metadata

Problem: Stale Data¶

# Force refresh high-value docs
for doc_id in get_top_docs(limit=20):
    cache.invalidate(doc_id)
    # Will fetch fresh on next access

Problem: Cache Too Large¶

# Clean up old, unused entries
cache.cleanup(older_than_days=90, unused=True)

Neo4j Queries (Phase 4 - Optional)¶

Find Documentation for API¶

MATCH (api:APIReference)-[:DOCUMENTED_IN]->(doc:ExternalDoc)
WHERE api.namespace = "asyncio" AND api.function_name = "run"
RETURN doc

Find Best Practices for Domain¶

MATCH (bp:BestPractice)
WHERE bp.domain = "authentication"
RETURN bp
ORDER BY bp.confidence_score DESC
LIMIT 5

Track API Usage in Project¶

MATCH (api:APIReference)-[:USED_IN]->(file:CodeFile)
WHERE file.project_id = $project_id
RETURN api.namespace, api.function_name, count(file) as usage_count
ORDER BY usage_count DESC

Key Principles¶

Project Memory First: Always check before fetching externally
Cache Aggressively: 30-day TTL for official docs
Graceful Degradation: System works without external knowledge
Measure Before Optimizing: Start simple, add complexity only if needed
Version Awareness: Always track compatibility
Source Credibility: Official > curated > community

Decision Tree¶

Need external knowledge?
    ↓
  YES → Check project memory?
    ↓        ↓
   NO       Found → Use project memory ✅
    ↓
Check file cache?
    ↓        ↓
   NO       Found → Return cached doc ✅
    ↓
Fetch from source
    ↓
Cache for future ✅
    ↓
Store in project memory if used ✅

Quick Commands¶

# Create basic structure
mkdir -p src/amplihack/external_knowledge/sources
touch src/amplihack/external_knowledge/{__init__,cache,retriever,monitoring}.py

# Run tests
pytest tests/test_external_knowledge/ -v

# Check cache stats
python -c "from amplihack.external_knowledge import ExternalKnowledgeCache; print(ExternalKnowledgeCache().get_stats())"

# Clean cache
rm -rf ~/.amplihack/external_knowledge/

# Monitor performance
python -c "from amplihack.external_knowledge.monitoring import monitor; print(monitor.get_stats())"

File Locations Quick Reference¶

Design Docs:
- EXTERNAL_KNOWLEDGE_NEO4J_DESIGN.md         (Full design)
- EXTERNAL_KNOWLEDGE_IMPLEMENTATION_GUIDE.md (Code examples)
- EXTERNAL_KNOWLEDGE_INTEGRATION_SUMMARY.md  (Summary)
- EXTERNAL_KNOWLEDGE_QUICK_REFERENCE.md      (This file)

Implementation:
src/amplihack/external_knowledge/
├── cache.py                   # File-based cache
├── retriever.py              # Main retrieval logic
├── monitoring.py             # Performance tracking
├── neo4j_schema.py          # Neo4j (optional, phase 4)
└── sources/
    ├── python_docs.py       # Python fetcher
    ├── ms_learn.py         # MS Learn fetcher
    └── stackoverflow.py    # StackOverflow fetcher

Tests:
tests/test_external_knowledge/
├── test_cache.py
├── test_retriever.py
└── test_integration.py

Ready to implement? Start with Phase 1: src/amplihack/external_knowledge/cache.py

Questions? Refer to:

Design details → EXTERNAL_KNOWLEDGE_NEO4J_DESIGN.md
Code examples → EXTERNAL_KNOWLEDGE_IMPLEMENTATION_GUIDE.md
Architecture → EXTERNAL_KNOWLEDGE_INTEGRATION_SUMMARY.md