External Knowledge Integration for Neo4j Memory Graph¶
Complete design and implementation strategy for integrating external knowledge sources (API docs, developer guides, library references) into the coding agent memory system.
📚 Documentation Overview¶
This package contains comprehensive design documents for integrating external knowledge sources into the Neo4j-based memory graph for coding agents. The design follows the project's ruthless simplicity philosophy: start simple, measure performance, and add complexity only when justified by metrics.
Documents in This Package¶
| Document | Purpose | Audience | Size |
|---|---|---|---|
| NEO4J_DESIGN.md | Complete design specification | Architects, system designers | 39KB |
| IMPLEMENTATION_GUIDE.md | Concrete code examples and patterns | Developers, implementers | 33KB |
| INTEGRATION_SUMMARY.md | Strategic overview and cost-benefit analysis | Product managers, tech leads | 18KB |
| QUICK_REFERENCE.md | One-page developer reference | All developers | 12KB |
Total Documentation: 102KB across 4 comprehensive documents
🎯 Quick Start¶
For Architects & Decision Makers¶
Start here: INTEGRATION_SUMMARY.md
Key takeaways:
- Three-tier architecture (Project Memory → File Cache → Neo4j optional)
- Phased implementation (4 phases, 5 weeks)
- No breaking changes to existing system
- Performance targets: <100ms queries, >80% cache hit rate
For Developers Implementing This¶
Start here: QUICK_REFERENCE.md
Then:
- Read IMPLEMENTATION_GUIDE.md for code examples
- Refer to NEO4J_DESIGN.md for detailed design decisions
For Code Review / Deep Dive¶
Start here: NEO4J_DESIGN.md
Complete specification covering:
- Graph schema (nodes, relationships)
- External knowledge sources (official docs, tutorials, community)
- Caching strategies
- Version tracking
- Performance optimization
- Integration patterns
🏗️ Architecture Summary¶
Three-Tier Strategy¶
┌────────────────────────────────────────────────────┐
│ Tier 1: Project Memory (SQLite) │
│ - HIGHEST PRIORITY │
│ - Learned patterns from THIS project │
│ - <10ms query performance │
└────────────────┬───────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────┐
│ Tier 2: Cached External Knowledge (Files) │
│ - ADVISORY │
│ - Official docs, tutorials, solutions │
│ - <20ms query performance │
│ - 7-30 day TTL by source type │
└────────────────┬───────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────┐
│ Tier 3: Neo4j Metadata (Optional - Phase 4) │
│ - OPTIMIZATION │
│ - Fast relationship queries │
│ - <50ms query performance │
│ - Only add if file cache becomes bottleneck │
└────────────────────────────────────────────────────┘
Key Design Principles¶
- Project Memory First: Always check project-specific learnings before external sources
- Start Simple: File-based cache before Neo4j
- Measure Before Optimizing: Add complexity only when metrics justify it
- No Breaking Changes: System works identically with or without external knowledge
- Graceful Degradation: Works offline after cache warm-up
- Version Awareness: Track compatibility with language/framework versions
- Source Credibility: Official docs > curated tutorials > community solutions
📊 Implementation Phases¶
Phase 1: File-Based Cache (Week 1) ✅ Ready to Implement¶
Goal: Prove value with simplest approach
Deliverables:
ExternalKnowledgeCacheclass (file-based storage)PythonDocsFetcherclass (fetch Python official docs)- Basic tests
- Performance baseline
Success Criteria:
- Can fetch and cache Python official docs
- Query time <100ms
- Cache hit rate >70% after warm-up
Phase 2: Memory Integration (Week 2)¶
Goal: Connect to existing memory system
Deliverables:
ExternalKnowledgeRetrieverclass- Integration with
MemoryManager - Agent context builder enhancement
- Integration tests
Success Criteria:
- Agents query external knowledge when needed
- Project memory always checked first
- No breaking changes to existing agents
Phase 3: Multiple Sources (Week 3)¶
Goal: Expand knowledge sources
Deliverables:
- MS Learn fetcher
- MDN Web Docs fetcher
- StackOverflow fetcher (with quality filtering)
- Source credibility scoring
Success Criteria:
- Support 3+ external sources
- Source ranking by credibility
- Smart fallback chains
Phase 4: Neo4j Optimization (Week 4+ - Optional)¶
Goal: Optimize for scale and relationships
Condition: Only implement if:
- File cache queries consistently >100ms
- Need complex relationship queries
- Have >10,000 documents
Deliverables:
- Neo4j schema implementation
- Automatic code-to-doc linking
- Complex version queries
- Analytics and recommendations
🎓 Knowledge Sources¶
Tier 1: Official Documentation (Trust Score: 0.9-1.0)¶
| Source | Coverage | Use Case |
|---|---|---|
| Python.org | Python standard library | API reference |
| MS Learn | Azure, .NET, TypeScript | Microsoft ecosystem |
| MDN | JavaScript, Web APIs | Web development |
| Library official docs | Specific libraries | Framework-specific |
Characteristics: High credibility, version-specific, regularly updated
Tier 2: Curated Tutorials (Trust Score: 0.7-0.9)¶
| Source | Coverage | Use Case |
|---|---|---|
| Real Python | Python tutorials | Learning resources |
| FreeCodeCamp | Web development | Beginner-friendly |
| Official framework tutorials | Framework-specific | Getting started |
Characteristics: High quality, practical examples, may lag latest versions
Tier 3: Community Knowledge (Trust Score: 0.4-0.8)¶
| Source | Coverage | Use Case |
|---|---|---|
| StackOverflow | Error solutions | Problem-solving |
| GitHub Issues | Library-specific | Bug workarounds |
| Reddit r/programming | Best practices | Community wisdom |
Characteristics: Variable quality, practical solutions, requires filtering
🔍 Graph Schema (Neo4j - Phase 4)¶
Core Node Types¶
// External documentation
(:ExternalDoc {
id, source, source_url, title, summary,
version, language, category, relevance_score
})
// API references
(:APIReference {
id, namespace, function_name, signature,
version_introduced, deprecated_in
})
// Best practices
(:BestPractice {
id, title, domain, description,
confidence_score
})
// Code examples
(:CodeExample {
id, title, language, code, upvotes
})
Key Relationships¶
// Link to project code
(doc:ExternalDoc)-[:EXPLAINS]->(file:CodeFile)
(api:APIReference)-[:USED_IN]->(func:Function)
// Knowledge hierarchy
(api:APIReference)-[:DOCUMENTED_IN]->(doc:ExternalDoc)
(ex:CodeExample)-[:DEMONSTRATES]->(api:APIReference)
// Version tracking
(api:APIReference)-[:COMPATIBLE_WITH {version}]->(lang:Language)
(old:APIReference)-[:REPLACED_BY {in_version}]->(new:APIReference)
📈 Performance Targets¶
| Metric | Target | Rationale |
|---|---|---|
| Query time | <100ms | Keep agents responsive |
| Cache hit rate | >80% | Minimize external fetches |
| Cache size | <100MB | For 10k documents (metadata only) |
| Project memory check | 100% | Always check before external |
Measured Performance (Phase 1-2 Expected)¶
| Operation | Target | Expected |
|---|---|---|
| Project memory lookup | <10ms | 2-5ms |
| Cache lookup | <20ms | 5-15ms |
| External fetch | <500ms | 100-300ms |
| End-to-end | <100ms | 60-80ms |
🔧 Integration with Existing System¶
Current State (SQLite Memory System)¶
# Agent gets context from project memory only
memory = MemoryManager(session_id=session_id)
project_memories = memory.retrieve(agent_id=agent_id, search=task)
Enhanced State (With External Knowledge)¶
# Agent gets context from project memory + external knowledge
memory = MemoryManager(session_id=session_id)
retriever = ExternalKnowledgeRetriever(memory)
context = build_comprehensive_context(
agent_id=agent_id,
task=task,
memory=memory,
retriever=retriever
)
# context includes:
# 1. Project-specific memories (priority 1)
# 2. Relevant external docs (priority 2, advisory)
Key: Project memory is always checked first. External knowledge is advisory only.
💡 Usage Examples¶
Example 1: New API Usage¶
# Agent task: "Use Azure Blob Storage to upload a file"
# Flow:
# 1. Check project memory → No prior Blob Storage usage
# 2. External retriever detects new API
# 3. Fetch Azure Blob Storage docs from MS Learn
# 4. Cache for 30 days
# 5. Provide agent with API reference + code example
# 6. Agent completes task successfully
# 7. Store pattern in project memory
# 8. Next time: Retrieved from project memory (faster!)
Example 2: Error Resolution¶
# Agent encounters: ImportError: No module named 'asyncio'
# Flow:
# 1. Check project memory → No prior solution
# 2. Query external knowledge for error pattern
# 3. Find StackOverflow accepted answer (150+ upvotes)
# 4. Extract solution: "asyncio is built-in for Python 3.4+"
# 5. Provide solution to agent
# 6. Store in project memory with tag "error_solution"
# 7. Next time: Instant resolution from project memory
🎯 Success Metrics¶
Must Have (All Phases)¶
- ✅ No breaking changes to existing system
- ✅ Project memory always checked first
- ✅ External knowledge is advisory only
- ✅ Graceful degradation if external unavailable
Should Have (Phase 1-2)¶
- ✅ Cache hit rate >70%
- ✅ Query performance <100ms
- ✅ Multiple source support
Nice to Have (Phase 4)¶
- ⏳ Neo4j relationship queries
- ⏳ Complex version tracking
- ⏳ Recommendation engine
- ⏳ Learning analytics
🛠️ File Structure¶
Documentation:
├── EXTERNAL_KNOWLEDGE_README.md (This file)
├── EXTERNAL_KNOWLEDGE_NEO4J_DESIGN.md (Complete design)
├── EXTERNAL_KNOWLEDGE_IMPLEMENTATION_GUIDE.md (Code examples)
├── EXTERNAL_KNOWLEDGE_INTEGRATION_SUMMARY.md (Strategic overview)
└── EXTERNAL_KNOWLEDGE_QUICK_REFERENCE.md (Developer reference)
Implementation (Phase 1-2):
src/amplihack/external_knowledge/
├── __init__.py
├── cache.py # File-based cache (START HERE)
├── retriever.py # Main retrieval logic
├── monitoring.py # Performance tracking
└── sources/
├── __init__.py
├── python_docs.py # Python official docs fetcher
├── ms_learn.py # MS Learn fetcher
├── stackoverflow.py # StackOverflow fetcher
└── mdn.py # MDN Web Docs fetcher
Implementation (Phase 4 - Optional):
src/amplihack/external_knowledge/
├── neo4j_schema.py # Neo4j integration
└── code_linker.py # Automatic code-to-doc linking
Data Storage:
├── ~/.amplihack/external_knowledge/ # File cache (Phase 1-3)
└── Neo4j database (optional) # Metadata + relationships (Phase 4)
Tests:
tests/test_external_knowledge/
├── test_cache.py
├── test_retriever.py
├── test_integration.py
└── test_neo4j.py (Phase 4 only)
📋 Next Steps¶
For Project Leadership¶
- Review INTEGRATION_SUMMARY.md
- Approve phased implementation approach
- Allocate resources for Phase 1-2 (2 weeks)
For Development Team¶
- Read QUICK_REFERENCE.md
- Review IMPLEMENTATION_GUIDE.md
- Start Phase 1: Implement
ExternalKnowledgeCacheclass - Set up basic tests
- Measure baseline performance
For Architecture Review¶
- Deep dive into NEO4J_DESIGN.md
- Validate graph schema design
- Review integration patterns
- Approve or suggest modifications
❓ FAQ¶
Why start with files instead of Neo4j?¶
Answer: Following the project's ruthless simplicity philosophy. Files are:
- Simple to implement and debug
- Zero runtime dependencies
- Version control friendly
- Fast enough for most use cases (<100ms)
Neo4j adds value only when:
- File queries consistently >100ms
- Need complex relationship traversal
- Have >10k documents with rich relationships
How does this integrate with the existing SQLite memory system?¶
Answer: Seamlessly. The SQLite memory system remains unchanged and is ALWAYS queried first. External knowledge is an optional enhancement that provides additional context when project memory doesn't have sufficient information.
What if external sources are unavailable?¶
Answer: Graceful degradation. The system:
- Uses cached data (even if slightly stale)
- Falls back to project memory only
- Continues working normally
- Never fails due to external unavailability
How is version compatibility handled?¶
Answer: Multiple strategies:
- Cache is version-aware (Python 3.11 vs 3.12 cached separately)
- Neo4j relationships track compatibility
- Deprecation detection identifies outdated APIs
- Version queries find appropriate documentation
What about cost/performance of external fetches?¶
Answer: Minimized through:
- Aggressive caching (7-30 day TTL)
- Project memory checked first
- Batch fetching where possible
- Smart refresh (only high-value docs)
- Target: >80% cache hit rate
📞 Support & Feedback¶
- Design Questions: Refer to NEO4J_DESIGN.md
- Implementation Questions: See IMPLEMENTATION_GUIDE.md
- Quick Lookup: Check QUICK_REFERENCE.md
- Strategic Discussion: Review INTEGRATION_SUMMARY.md
🏆 Design Highlights¶
What Makes This Design Great¶
- Ruthlessly Simple: Start with files, not databases
- No Breaking Changes: Existing system works identically
- Project Memory First: Always prioritizes project-specific learnings
- Graceful Degradation: Works offline after warm-up
- Phased Implementation: Prove value before adding complexity
- Version Aware: Tracks compatibility across language versions
- Source Credibility: Ranks sources by trust score
- Performance Focused: <100ms query target
- Measurement Driven: Add Neo4j only if metrics justify it
- Integration Ready: Works with existing SQLite memory system
Alignment with Project Philosophy¶
- ✅ Ruthless Simplicity: File cache before Neo4j
- ✅ Modular Design: Clear interfaces, replaceable components
- ✅ Zero-BS Implementation: No stubs, everything works
- ✅ Measure First: Add complexity only when justified
- ✅ AI-Ready: Clear contracts for agent integration
📜 License & Attribution¶
This design is part of the Microsoft Hackathon 2025 - Agentic Coding Framework project.
Design Principles: Based on project's ruthless simplicity philosophy Database Philosophy: Inspired by database.md agent guidelines (start simple, measure, optimize) Integration: Builds on existing SQLite memory system (amplihack/memory/)
Status: ✅ Design Complete | 🚀 Ready for Phase 1 Implementation
Last Updated: November 2, 2025
Quick Navigation¶
- 📖 Complete Design Specification (39KB)
- 💻 Implementation Guide with Code (33KB)
- 📊 Strategic Summary & Cost-Benefit (18KB)
- ⚡ Developer Quick Reference (12KB)
Total Documentation: 102KB of comprehensive design and implementation guidance
Ready to build? Start with Phase 1: src/amplihack/external_knowledge/cache.py