Skip to content

External Knowledge Integration for Neo4j Memory Graph

Complete design and implementation strategy for integrating external knowledge sources (API docs, developer guides, library references) into the coding agent memory system.


📚 Documentation Overview

This package contains comprehensive design documents for integrating external knowledge sources into the Neo4j-based memory graph for coding agents. The design follows the project's ruthless simplicity philosophy: start simple, measure performance, and add complexity only when justified by metrics.

Documents in This Package

Document Purpose Audience Size
NEO4J_DESIGN.md Complete design specification Architects, system designers 39KB
IMPLEMENTATION_GUIDE.md Concrete code examples and patterns Developers, implementers 33KB
INTEGRATION_SUMMARY.md Strategic overview and cost-benefit analysis Product managers, tech leads 18KB
QUICK_REFERENCE.md One-page developer reference All developers 12KB

Total Documentation: 102KB across 4 comprehensive documents


🎯 Quick Start

For Architects & Decision Makers

Start here: INTEGRATION_SUMMARY.md

Key takeaways:

  • Three-tier architecture (Project Memory → File Cache → Neo4j optional)
  • Phased implementation (4 phases, 5 weeks)
  • No breaking changes to existing system
  • Performance targets: <100ms queries, >80% cache hit rate

For Developers Implementing This

Start here: QUICK_REFERENCE.md

Then:

  1. Read IMPLEMENTATION_GUIDE.md for code examples
  2. Refer to NEO4J_DESIGN.md for detailed design decisions

For Code Review / Deep Dive

Start here: NEO4J_DESIGN.md

Complete specification covering:

  • Graph schema (nodes, relationships)
  • External knowledge sources (official docs, tutorials, community)
  • Caching strategies
  • Version tracking
  • Performance optimization
  • Integration patterns

🏗️ Architecture Summary

Three-Tier Strategy

┌────────────────────────────────────────────────────┐
│ Tier 1: Project Memory (SQLite)                   │
│ - HIGHEST PRIORITY                                 │
│ - Learned patterns from THIS project               │
│ - <10ms query performance                          │
└────────────────┬───────────────────────────────────┘
┌────────────────────────────────────────────────────┐
│ Tier 2: Cached External Knowledge (Files)         │
│ - ADVISORY                                         │
│ - Official docs, tutorials, solutions             │
│ - <20ms query performance                         │
│ - 7-30 day TTL by source type                     │
└────────────────┬───────────────────────────────────┘
┌────────────────────────────────────────────────────┐
│ Tier 3: Neo4j Metadata (Optional - Phase 4)       │
│ - OPTIMIZATION                                     │
│ - Fast relationship queries                       │
│ - <50ms query performance                         │
│ - Only add if file cache becomes bottleneck       │
└────────────────────────────────────────────────────┘

Key Design Principles

  1. Project Memory First: Always check project-specific learnings before external sources
  2. Start Simple: File-based cache before Neo4j
  3. Measure Before Optimizing: Add complexity only when metrics justify it
  4. No Breaking Changes: System works identically with or without external knowledge
  5. Graceful Degradation: Works offline after cache warm-up
  6. Version Awareness: Track compatibility with language/framework versions
  7. Source Credibility: Official docs > curated tutorials > community solutions

📊 Implementation Phases

Phase 1: File-Based Cache (Week 1) ✅ Ready to Implement

Goal: Prove value with simplest approach

Deliverables:

  • ExternalKnowledgeCache class (file-based storage)
  • PythonDocsFetcher class (fetch Python official docs)
  • Basic tests
  • Performance baseline

Success Criteria:

  • Can fetch and cache Python official docs
  • Query time <100ms
  • Cache hit rate >70% after warm-up

Phase 2: Memory Integration (Week 2)

Goal: Connect to existing memory system

Deliverables:

  • ExternalKnowledgeRetriever class
  • Integration with MemoryManager
  • Agent context builder enhancement
  • Integration tests

Success Criteria:

  • Agents query external knowledge when needed
  • Project memory always checked first
  • No breaking changes to existing agents

Phase 3: Multiple Sources (Week 3)

Goal: Expand knowledge sources

Deliverables:

  • MS Learn fetcher
  • MDN Web Docs fetcher
  • StackOverflow fetcher (with quality filtering)
  • Source credibility scoring

Success Criteria:

  • Support 3+ external sources
  • Source ranking by credibility
  • Smart fallback chains

Phase 4: Neo4j Optimization (Week 4+ - Optional)

Goal: Optimize for scale and relationships

Condition: Only implement if:

  • File cache queries consistently >100ms
  • Need complex relationship queries
  • Have >10,000 documents

Deliverables:

  • Neo4j schema implementation
  • Automatic code-to-doc linking
  • Complex version queries
  • Analytics and recommendations

🎓 Knowledge Sources

Tier 1: Official Documentation (Trust Score: 0.9-1.0)

Source Coverage Use Case
Python.org Python standard library API reference
MS Learn Azure, .NET, TypeScript Microsoft ecosystem
MDN JavaScript, Web APIs Web development
Library official docs Specific libraries Framework-specific

Characteristics: High credibility, version-specific, regularly updated

Tier 2: Curated Tutorials (Trust Score: 0.7-0.9)

Source Coverage Use Case
Real Python Python tutorials Learning resources
FreeCodeCamp Web development Beginner-friendly
Official framework tutorials Framework-specific Getting started

Characteristics: High quality, practical examples, may lag latest versions

Tier 3: Community Knowledge (Trust Score: 0.4-0.8)

Source Coverage Use Case
StackOverflow Error solutions Problem-solving
GitHub Issues Library-specific Bug workarounds
Reddit r/programming Best practices Community wisdom

Characteristics: Variable quality, practical solutions, requires filtering


🔍 Graph Schema (Neo4j - Phase 4)

Core Node Types

// External documentation
(:ExternalDoc {
    id, source, source_url, title, summary,
    version, language, category, relevance_score
})

// API references
(:APIReference {
    id, namespace, function_name, signature,
    version_introduced, deprecated_in
})

// Best practices
(:BestPractice {
    id, title, domain, description,
    confidence_score
})

// Code examples
(:CodeExample {
    id, title, language, code, upvotes
})

Key Relationships

// Link to project code
(doc:ExternalDoc)-[:EXPLAINS]->(file:CodeFile)
(api:APIReference)-[:USED_IN]->(func:Function)

// Knowledge hierarchy
(api:APIReference)-[:DOCUMENTED_IN]->(doc:ExternalDoc)
(ex:CodeExample)-[:DEMONSTRATES]->(api:APIReference)

// Version tracking
(api:APIReference)-[:COMPATIBLE_WITH {version}]->(lang:Language)
(old:APIReference)-[:REPLACED_BY {in_version}]->(new:APIReference)

📈 Performance Targets

Metric Target Rationale
Query time <100ms Keep agents responsive
Cache hit rate >80% Minimize external fetches
Cache size <100MB For 10k documents (metadata only)
Project memory check 100% Always check before external

Measured Performance (Phase 1-2 Expected)

Operation Target Expected
Project memory lookup <10ms 2-5ms
Cache lookup <20ms 5-15ms
External fetch <500ms 100-300ms
End-to-end <100ms 60-80ms

🔧 Integration with Existing System

Current State (SQLite Memory System)

# Agent gets context from project memory only
memory = MemoryManager(session_id=session_id)
project_memories = memory.retrieve(agent_id=agent_id, search=task)

Enhanced State (With External Knowledge)

# Agent gets context from project memory + external knowledge
memory = MemoryManager(session_id=session_id)
retriever = ExternalKnowledgeRetriever(memory)

context = build_comprehensive_context(
    agent_id=agent_id,
    task=task,
    memory=memory,
    retriever=retriever
)
# context includes:
# 1. Project-specific memories (priority 1)
# 2. Relevant external docs (priority 2, advisory)

Key: Project memory is always checked first. External knowledge is advisory only.


💡 Usage Examples

Example 1: New API Usage

# Agent task: "Use Azure Blob Storage to upload a file"

# Flow:
# 1. Check project memory → No prior Blob Storage usage
# 2. External retriever detects new API
# 3. Fetch Azure Blob Storage docs from MS Learn
# 4. Cache for 30 days
# 5. Provide agent with API reference + code example
# 6. Agent completes task successfully
# 7. Store pattern in project memory
# 8. Next time: Retrieved from project memory (faster!)

Example 2: Error Resolution

# Agent encounters: ImportError: No module named 'asyncio'

# Flow:
# 1. Check project memory → No prior solution
# 2. Query external knowledge for error pattern
# 3. Find StackOverflow accepted answer (150+ upvotes)
# 4. Extract solution: "asyncio is built-in for Python 3.4+"
# 5. Provide solution to agent
# 6. Store in project memory with tag "error_solution"
# 7. Next time: Instant resolution from project memory

🎯 Success Metrics

Must Have (All Phases)

  • ✅ No breaking changes to existing system
  • ✅ Project memory always checked first
  • ✅ External knowledge is advisory only
  • ✅ Graceful degradation if external unavailable

Should Have (Phase 1-2)

  • ✅ Cache hit rate >70%
  • ✅ Query performance <100ms
  • ✅ Multiple source support

Nice to Have (Phase 4)

  • ⏳ Neo4j relationship queries
  • ⏳ Complex version tracking
  • ⏳ Recommendation engine
  • ⏳ Learning analytics

🛠️ File Structure

Documentation:
├── EXTERNAL_KNOWLEDGE_README.md                    (This file)
├── EXTERNAL_KNOWLEDGE_NEO4J_DESIGN.md             (Complete design)
├── EXTERNAL_KNOWLEDGE_IMPLEMENTATION_GUIDE.md     (Code examples)
├── EXTERNAL_KNOWLEDGE_INTEGRATION_SUMMARY.md      (Strategic overview)
└── EXTERNAL_KNOWLEDGE_QUICK_REFERENCE.md          (Developer reference)

Implementation (Phase 1-2):
src/amplihack/external_knowledge/
├── __init__.py
├── cache.py                    # File-based cache (START HERE)
├── retriever.py               # Main retrieval logic
├── monitoring.py              # Performance tracking
└── sources/
    ├── __init__.py
    ├── python_docs.py         # Python official docs fetcher
    ├── ms_learn.py            # MS Learn fetcher
    ├── stackoverflow.py       # StackOverflow fetcher
    └── mdn.py                 # MDN Web Docs fetcher

Implementation (Phase 4 - Optional):
src/amplihack/external_knowledge/
├── neo4j_schema.py           # Neo4j integration
└── code_linker.py            # Automatic code-to-doc linking

Data Storage:
├── ~/.amplihack/external_knowledge/  # File cache (Phase 1-3)
└── Neo4j database (optional)         # Metadata + relationships (Phase 4)

Tests:
tests/test_external_knowledge/
├── test_cache.py
├── test_retriever.py
├── test_integration.py
└── test_neo4j.py              (Phase 4 only)

📋 Next Steps

For Project Leadership

  1. Review INTEGRATION_SUMMARY.md
  2. Approve phased implementation approach
  3. Allocate resources for Phase 1-2 (2 weeks)

For Development Team

  1. Read QUICK_REFERENCE.md
  2. Review IMPLEMENTATION_GUIDE.md
  3. Start Phase 1: Implement ExternalKnowledgeCache class
  4. Set up basic tests
  5. Measure baseline performance

For Architecture Review

  1. Deep dive into NEO4J_DESIGN.md
  2. Validate graph schema design
  3. Review integration patterns
  4. Approve or suggest modifications

❓ FAQ

Why start with files instead of Neo4j?

Answer: Following the project's ruthless simplicity philosophy. Files are:

  • Simple to implement and debug
  • Zero runtime dependencies
  • Version control friendly
  • Fast enough for most use cases (<100ms)

Neo4j adds value only when:

  • File queries consistently >100ms
  • Need complex relationship traversal
  • Have >10k documents with rich relationships

How does this integrate with the existing SQLite memory system?

Answer: Seamlessly. The SQLite memory system remains unchanged and is ALWAYS queried first. External knowledge is an optional enhancement that provides additional context when project memory doesn't have sufficient information.

What if external sources are unavailable?

Answer: Graceful degradation. The system:

  1. Uses cached data (even if slightly stale)
  2. Falls back to project memory only
  3. Continues working normally
  4. Never fails due to external unavailability

How is version compatibility handled?

Answer: Multiple strategies:

  • Cache is version-aware (Python 3.11 vs 3.12 cached separately)
  • Neo4j relationships track compatibility
  • Deprecation detection identifies outdated APIs
  • Version queries find appropriate documentation

What about cost/performance of external fetches?

Answer: Minimized through:

  • Aggressive caching (7-30 day TTL)
  • Project memory checked first
  • Batch fetching where possible
  • Smart refresh (only high-value docs)
  • Target: >80% cache hit rate

📞 Support & Feedback


🏆 Design Highlights

What Makes This Design Great

  1. Ruthlessly Simple: Start with files, not databases
  2. No Breaking Changes: Existing system works identically
  3. Project Memory First: Always prioritizes project-specific learnings
  4. Graceful Degradation: Works offline after warm-up
  5. Phased Implementation: Prove value before adding complexity
  6. Version Aware: Tracks compatibility across language versions
  7. Source Credibility: Ranks sources by trust score
  8. Performance Focused: <100ms query target
  9. Measurement Driven: Add Neo4j only if metrics justify it
  10. Integration Ready: Works with existing SQLite memory system

Alignment with Project Philosophy

  • Ruthless Simplicity: File cache before Neo4j
  • Modular Design: Clear interfaces, replaceable components
  • Zero-BS Implementation: No stubs, everything works
  • Measure First: Add complexity only when justified
  • AI-Ready: Clear contracts for agent integration

📜 License & Attribution

This design is part of the Microsoft Hackathon 2025 - Agentic Coding Framework project.

Design Principles: Based on project's ruthless simplicity philosophy Database Philosophy: Inspired by database.md agent guidelines (start simple, measure, optimize) Integration: Builds on existing SQLite memory system (amplihack/memory/)


Status: ✅ Design Complete | 🚀 Ready for Phase 1 Implementation

Last Updated: November 2, 2025


Quick Navigation

Total Documentation: 102KB of comprehensive design and implementation guidance

Ready to build? Start with Phase 1: src/amplihack/external_knowledge/cache.py