External Knowledge Integration for Neo4j Memory Graph¶

Complete design and implementation strategy for integrating external knowledge sources (API docs, developer guides, library references) into the coding agent memory system.

📚 Documentation Overview¶

This package contains comprehensive design documents for integrating external knowledge sources into the Neo4j-based memory graph for coding agents. The design follows the project's ruthless simplicity philosophy: start simple, measure performance, and add complexity only when justified by metrics.

Documents in This Package¶

Document	Purpose	Audience	Size
NEO4J_DESIGN.md	Complete design specification	Architects, system designers	39KB
IMPLEMENTATION_GUIDE.md	Concrete code examples and patterns	Developers, implementers	33KB
INTEGRATION_SUMMARY.md	Strategic overview and cost-benefit analysis	Product managers, tech leads	18KB
QUICK_REFERENCE.md	One-page developer reference	All developers	12KB

Total Documentation: 102KB across 4 comprehensive documents

🎯 Quick Start¶

For Architects & Decision Makers¶

Start here: INTEGRATION_SUMMARY.md

Key takeaways:

Three-tier architecture (Project Memory → File Cache → Neo4j optional)
Phased implementation (4 phases, 5 weeks)
No breaking changes to existing system
Performance targets: <100ms queries, >80% cache hit rate

For Developers Implementing This¶

Start here: QUICK_REFERENCE.md

Then:

Read IMPLEMENTATION_GUIDE.md for code examples
Refer to NEO4J_DESIGN.md for detailed design decisions

For Code Review / Deep Dive¶

Start here: NEO4J_DESIGN.md

Complete specification covering:

Graph schema (nodes, relationships)
External knowledge sources (official docs, tutorials, community)
Caching strategies
Version tracking
Performance optimization
Integration patterns

🏗️ Architecture Summary¶

Three-Tier Strategy¶

┌────────────────────────────────────────────────────┐
│ Tier 1: Project Memory (SQLite)                   │
│ - HIGHEST PRIORITY                                 │
│ - Learned patterns from THIS project               │
│ - <10ms query performance                          │
└────────────────┬───────────────────────────────────┘
                 │
                 ▼
┌────────────────────────────────────────────────────┐
│ Tier 2: Cached External Knowledge (Files)         │
│ - ADVISORY                                         │
│ - Official docs, tutorials, solutions             │
│ - <20ms query performance                         │
│ - 7-30 day TTL by source type                     │
└────────────────┬───────────────────────────────────┘
                 │
                 ▼
┌────────────────────────────────────────────────────┐
│ Tier 3: Neo4j Metadata (Optional - Phase 4)       │
│ - OPTIMIZATION                                     │
│ - Fast relationship queries                       │
│ - <50ms query performance                         │
│ - Only add if file cache becomes bottleneck       │
└────────────────────────────────────────────────────┘

Key Design Principles¶

Project Memory First: Always check project-specific learnings before external sources
Start Simple: File-based cache before Neo4j
Measure Before Optimizing: Add complexity only when metrics justify it
No Breaking Changes: System works identically with or without external knowledge
Graceful Degradation: Works offline after cache warm-up
Version Awareness: Track compatibility with language/framework versions
Source Credibility: Official docs > curated tutorials > community solutions

📊 Implementation Phases¶

Phase 1: File-Based Cache (Week 1) ✅ Ready to Implement¶

Goal: Prove value with simplest approach

Deliverables:

ExternalKnowledgeCache class (file-based storage)
PythonDocsFetcher class (fetch Python official docs)
Basic tests
Performance baseline

Success Criteria:

Can fetch and cache Python official docs
Query time <100ms
Cache hit rate >70% after warm-up

Phase 2: Memory Integration (Week 2)¶

Goal: Connect to existing memory system

Deliverables:

ExternalKnowledgeRetriever class
Integration with MemoryManager
Agent context builder enhancement
Integration tests

Success Criteria:

Agents query external knowledge when needed
Project memory always checked first
No breaking changes to existing agents

Phase 3: Multiple Sources (Week 3)¶

Goal: Expand knowledge sources

Deliverables:

MS Learn fetcher
MDN Web Docs fetcher
StackOverflow fetcher (with quality filtering)
Source credibility scoring

Success Criteria:

Support 3+ external sources
Source ranking by credibility
Smart fallback chains

Phase 4: Neo4j Optimization (Week 4+ - Optional)¶

Goal: Optimize for scale and relationships

Condition: Only implement if:

File cache queries consistently >100ms
Need complex relationship queries
Have >10,000 documents

Deliverables:

Neo4j schema implementation
Automatic code-to-doc linking
Complex version queries
Analytics and recommendations

🎓 Knowledge Sources¶

Tier 1: Official Documentation (Trust Score: 0.9-1.0)¶

Source	Coverage	Use Case
Python.org	Python standard library	API reference
MS Learn	Azure, .NET, TypeScript	Microsoft ecosystem
MDN	JavaScript, Web APIs	Web development
Library official docs	Specific libraries	Framework-specific

Characteristics: High credibility, version-specific, regularly updated

Tier 2: Curated Tutorials (Trust Score: 0.7-0.9)¶

Source	Coverage	Use Case
Real Python	Python tutorials	Learning resources
FreeCodeCamp	Web development	Beginner-friendly
Official framework tutorials	Framework-specific	Getting started

Characteristics: High quality, practical examples, may lag latest versions

Tier 3: Community Knowledge (Trust Score: 0.4-0.8)¶

Source	Coverage	Use Case
StackOverflow	Error solutions	Problem-solving
GitHub Issues	Library-specific	Bug workarounds
Reddit r/programming	Best practices	Community wisdom

Characteristics: Variable quality, practical solutions, requires filtering

🔍 Graph Schema (Neo4j - Phase 4)¶

Core Node Types¶

// External documentation
(:ExternalDoc {
    id, source, source_url, title, summary,
    version, language, category, relevance_score
})

// API references
(:APIReference {
    id, namespace, function_name, signature,
    version_introduced, deprecated_in
})

// Best practices
(:BestPractice {
    id, title, domain, description,
    confidence_score
})

// Code examples
(:CodeExample {
    id, title, language, code, upvotes
})

Key Relationships¶

// Link to project code
(doc:ExternalDoc)-[:EXPLAINS]->(file:CodeFile)
(api:APIReference)-[:USED_IN]->(func:Function)

// Knowledge hierarchy
(api:APIReference)-[:DOCUMENTED_IN]->(doc:ExternalDoc)
(ex:CodeExample)-[:DEMONSTRATES]->(api:APIReference)

// Version tracking
(api:APIReference)-[:COMPATIBLE_WITH {version}]->(lang:Language)
(old:APIReference)-[:REPLACED_BY {in_version}]->(new:APIReference)

📈 Performance Targets¶

Metric	Target	Rationale
Query time	<100ms	Keep agents responsive
Cache hit rate	>80%	Minimize external fetches
Cache size	<100MB	For 10k documents (metadata only)
Project memory check	100%	Always check before external

Measured Performance (Phase 1-2 Expected)¶

Operation	Target	Expected
Project memory lookup	<10ms	2-5ms
Cache lookup	<20ms	5-15ms
External fetch	<500ms	100-300ms
End-to-end	<100ms	60-80ms

🔧 Integration with Existing System¶

Current State (SQLite Memory System)¶

# Agent gets context from project memory only
memory = MemoryManager(session_id=session_id)
project_memories = memory.retrieve(agent_id=agent_id, search=task)

Enhanced State (With External Knowledge)¶

# Agent gets context from project memory + external knowledge
memory = MemoryManager(session_id=session_id)
retriever = ExternalKnowledgeRetriever(memory)

context = build_comprehensive_context(
    agent_id=agent_id,
    task=task,
    memory=memory,
    retriever=retriever
)
# context includes:
# 1. Project-specific memories (priority 1)
# 2. Relevant external docs (priority 2, advisory)

Key: Project memory is always checked first. External knowledge is advisory only.

💡 Usage Examples¶

Example 1: New API Usage¶

# Agent task: "Use Azure Blob Storage to upload a file"

# Flow:
# 1. Check project memory → No prior Blob Storage usage
# 2. External retriever detects new API
# 3. Fetch Azure Blob Storage docs from MS Learn
# 4. Cache for 30 days
# 5. Provide agent with API reference + code example
# 6. Agent completes task successfully
# 7. Store pattern in project memory
# 8. Next time: Retrieved from project memory (faster!)

Example 2: Error Resolution¶

# Agent encounters: ImportError: No module named 'asyncio'

# Flow:
# 1. Check project memory → No prior solution
# 2. Query external knowledge for error pattern
# 3. Find StackOverflow accepted answer (150+ upvotes)
# 4. Extract solution: "asyncio is built-in for Python 3.4+"
# 5. Provide solution to agent
# 6. Store in project memory with tag "error_solution"
# 7. Next time: Instant resolution from project memory

🎯 Success Metrics¶

Must Have (All Phases)¶

✅ No breaking changes to existing system
✅ Project memory always checked first
✅ External knowledge is advisory only
✅ Graceful degradation if external unavailable

Should Have (Phase 1-2)¶

✅ Cache hit rate >70%
✅ Query performance <100ms
✅ Multiple source support

Nice to Have (Phase 4)¶

⏳ Neo4j relationship queries
⏳ Complex version tracking
⏳ Recommendation engine
⏳ Learning analytics

🛠️ File Structure¶

Documentation:
├── EXTERNAL_KNOWLEDGE_README.md                    (This file)
├── EXTERNAL_KNOWLEDGE_NEO4J_DESIGN.md             (Complete design)
├── EXTERNAL_KNOWLEDGE_IMPLEMENTATION_GUIDE.md     (Code examples)
├── EXTERNAL_KNOWLEDGE_INTEGRATION_SUMMARY.md      (Strategic overview)
└── EXTERNAL_KNOWLEDGE_QUICK_REFERENCE.md          (Developer reference)

Implementation (Phase 1-2):
src/amplihack/external_knowledge/
├── __init__.py
├── cache.py                    # File-based cache (START HERE)
├── retriever.py               # Main retrieval logic
├── monitoring.py              # Performance tracking
└── sources/
    ├── __init__.py
    ├── python_docs.py         # Python official docs fetcher
    ├── ms_learn.py            # MS Learn fetcher
    ├── stackoverflow.py       # StackOverflow fetcher
    └── mdn.py                 # MDN Web Docs fetcher

Implementation (Phase 4 - Optional):
src/amplihack/external_knowledge/
├── neo4j_schema.py           # Neo4j integration
└── code_linker.py            # Automatic code-to-doc linking

Data Storage:
├── ~/.amplihack/external_knowledge/  # File cache (Phase 1-3)
└── Neo4j database (optional)         # Metadata + relationships (Phase 4)

Tests:
tests/test_external_knowledge/
├── test_cache.py
├── test_retriever.py
├── test_integration.py
└── test_neo4j.py              (Phase 4 only)

📋 Next Steps¶

For Project Leadership¶

Review INTEGRATION_SUMMARY.md
Approve phased implementation approach
Allocate resources for Phase 1-2 (2 weeks)

For Development Team¶

Read QUICK_REFERENCE.md
Review IMPLEMENTATION_GUIDE.md
Start Phase 1: Implement ExternalKnowledgeCache class
Set up basic tests
Measure baseline performance

For Architecture Review¶

Deep dive into NEO4J_DESIGN.md
Validate graph schema design
Review integration patterns
Approve or suggest modifications

❓ FAQ¶

Why start with files instead of Neo4j?¶

Answer: Following the project's ruthless simplicity philosophy. Files are:

Simple to implement and debug
Zero runtime dependencies
Version control friendly
Fast enough for most use cases (<100ms)

Neo4j adds value only when:

File queries consistently >100ms
Need complex relationship traversal
Have >10k documents with rich relationships

How does this integrate with the existing SQLite memory system?¶

Answer: Seamlessly. The SQLite memory system remains unchanged and is ALWAYS queried first. External knowledge is an optional enhancement that provides additional context when project memory doesn't have sufficient information.

What if external sources are unavailable?¶

Answer: Graceful degradation. The system:

Uses cached data (even if slightly stale)
Falls back to project memory only
Continues working normally
Never fails due to external unavailability

How is version compatibility handled?¶

Answer: Multiple strategies:

Cache is version-aware (Python 3.11 vs 3.12 cached separately)
Neo4j relationships track compatibility
Deprecation detection identifies outdated APIs
Version queries find appropriate documentation

What about cost/performance of external fetches?¶

Answer: Minimized through:

Aggressive caching (7-30 day TTL)
Project memory checked first
Batch fetching where possible
Smart refresh (only high-value docs)
Target: >80% cache hit rate

📞 Support & Feedback¶

Design Questions: Refer to NEO4J_DESIGN.md
Implementation Questions: See IMPLEMENTATION_GUIDE.md
Quick Lookup: Check QUICK_REFERENCE.md
Strategic Discussion: Review INTEGRATION_SUMMARY.md

🏆 Design Highlights¶

What Makes This Design Great¶

Ruthlessly Simple: Start with files, not databases
No Breaking Changes: Existing system works identically
Project Memory First: Always prioritizes project-specific learnings
Graceful Degradation: Works offline after warm-up
Phased Implementation: Prove value before adding complexity
Version Aware: Tracks compatibility across language versions
Source Credibility: Ranks sources by trust score
Performance Focused: <100ms query target
Measurement Driven: Add Neo4j only if metrics justify it
Integration Ready: Works with existing SQLite memory system

Alignment with Project Philosophy¶

✅ Ruthless Simplicity: File cache before Neo4j
✅ Modular Design: Clear interfaces, replaceable components
✅ Zero-BS Implementation: No stubs, everything works
✅ Measure First: Add complexity only when justified
✅ AI-Ready: Clear contracts for agent integration

📜 License & Attribution¶

This design is part of the Microsoft Hackathon 2025 - Agentic Coding Framework project.

Design Principles: Based on project's ruthless simplicity philosophy Database Philosophy: Inspired by database.md agent guidelines (start simple, measure, optimize) Integration: Builds on existing SQLite memory system (amplihack/memory/)

Status: ✅ Design Complete | 🚀 Ready for Phase 1 Implementation

Last Updated: November 2, 2025

📖 Complete Design Specification (39KB)
💻 Implementation Guide with Code (33KB)
📊 Strategic Summary & Cost-Benefit (18KB)
⚡ Developer Quick Reference (12KB)

Total Documentation: 102KB of comprehensive design and implementation guidance

Ready to build? Start with Phase 1: src/amplihack/external_knowledge/cache.py

External Knowledge Integration for Neo4j Memory Graph¶

📚 Documentation Overview¶

Documents in This Package¶

🎯 Quick Start¶

For Architects & Decision Makers¶

For Developers Implementing This¶

For Code Review / Deep Dive¶

🏗️ Architecture Summary¶

Three-Tier Strategy¶

Key Design Principles¶

📊 Implementation Phases¶

Phase 1: File-Based Cache (Week 1) ✅ Ready to Implement¶

Phase 2: Memory Integration (Week 2)¶

Phase 3: Multiple Sources (Week 3)¶

Phase 4: Neo4j Optimization (Week 4+ - Optional)¶

🎓 Knowledge Sources¶

Tier 1: Official Documentation (Trust Score: 0.9-1.0)¶

Tier 2: Curated Tutorials (Trust Score: 0.7-0.9)¶

Tier 3: Community Knowledge (Trust Score: 0.4-0.8)¶

🔍 Graph Schema (Neo4j - Phase 4)¶

Core Node Types¶

Key Relationships¶

📈 Performance Targets¶

Measured Performance (Phase 1-2 Expected)¶

🔧 Integration with Existing System¶

Current State (SQLite Memory System)¶

Enhanced State (With External Knowledge)¶

💡 Usage Examples¶

Example 1: New API Usage¶

Example 2: Error Resolution¶

🎯 Success Metrics¶

Must Have (All Phases)¶

Should Have (Phase 1-2)¶

Nice to Have (Phase 4)¶

🛠️ File Structure¶

📋 Next Steps¶

For Project Leadership¶

For Development Team¶

For Architecture Review¶

❓ FAQ¶

Why start with files instead of Neo4j?¶

How does this integrate with the existing SQLite memory system?¶

What if external sources are unavailable?¶

How is version compatibility handled?¶

What about cost/performance of external fetches?¶

📞 Support & Feedback¶

🏆 Design Highlights¶

What Makes This Design Great¶

Alignment with Project Philosophy¶

📜 License & Attribution¶

Quick Navigation¶