External Knowledge Integration for Neo4j Memory Graph¶
Design Document - November 2025
Executive Summary¶
This document outlines strategies for integrating external knowledge sources (API docs, developer guides, library references) into the Neo4j memory graph for coding agents. The design follows the project's ruthless simplicity philosophy: start simple, measure, optimize based on need.
Core Philosophy¶
Start Simple, Scale Smart¶
Phase 1: File-based cache (TODAY)
↓
Phase 2: Hybrid (cache + Neo4j references) (MEASURE FIRST)
↓
Phase 3: Full graph integration (ONLY IF NEEDED)
Golden Rule: External knowledge is ADVISORY. Project-specific memory always takes precedence.
1. Graph Schema Design¶
Node Types¶
ExternalDoc Node¶
CREATE (doc:ExternalDoc {
id: string, // Unique identifier
source: string, // "ms_learn" | "python_docs" | "mdn" | "stackoverflow"
source_url: string, // Original URL
title: string, // Document title
content_hash: string, // SHA256 of content (for change detection)
summary: string, // AI-generated summary (200-500 chars)
content_snippet: string, // First 1000 chars or key excerpt
last_updated: datetime, // When we fetched it
last_accessed: datetime, // When agent used it
access_count: integer, // Usage tracking
relevance_score: float, // 0.0-1.0 based on usage patterns
version: string, // e.g., "Python 3.12", "Node 20"
language: string, // Programming language
category: string, // "api" | "tutorial" | "reference" | "guide"
fetch_method: string // "cache" | "api" | "web_scrape"
})
APIReference Node¶
CREATE (api:APIReference {
id: string,
namespace: string, // e.g., "azure.storage.blob"
function_name: string, // e.g., "BlobServiceClient.create_container"
signature: string, // Full function signature
parameters: string, // JSON array of parameters
return_type: string,
description: string,
example_code: string, // Working code example
common_patterns: string, // JSON array of usage patterns
gotchas: string, // Common pitfalls (JSON array)
version_introduced: string,
deprecated_in: string, // null if not deprecated
source_doc_id: string // Link to full documentation
})
BestPractice Node¶
CREATE (bp:BestPractice {
id: string,
title: string,
domain: string, // "authentication" | "async" | "security"
description: string,
when_to_use: string,
when_not_to_use: string,
example_code: string,
anti_patterns: string, // JSON array of what NOT to do
related_apis: string, // JSON array of API references
confidence_score: float, // 0.0-1.0 based on source credibility
source_count: integer, // How many sources agree
last_validated: datetime
})
CodeExample Node¶
CREATE (ex:CodeExample {
id: string,
title: string,
language: string,
framework: string, // Optional: "Flask" | "FastAPI" | "Django"
code: string, // Full working example
explanation: string,
use_case: string,
difficulty: string, // "beginner" | "intermediate" | "advanced"
execution_time_ms: integer, // Performance metric if available
dependencies: string, // JSON array of required packages
works_with_version: string, // Version compatibility
upvotes: integer, // If from StackOverflow/GitHub
source_url: string
})
Relationships¶
Between External Knowledge and Code¶
// Link to project code
(doc:ExternalDoc)-[:EXPLAINS]->(file:CodeFile)
(api:APIReference)-[:USED_IN]->(func:Function)
(bp:BestPractice)-[:APPLIED_IN]->(file:CodeFile)
(ex:CodeExample)-[:SIMILAR_TO]->(func:Function)
// Knowledge hierarchy
(api:APIReference)-[:DOCUMENTED_IN]->(doc:ExternalDoc)
(bp:BestPractice)-[:REFERENCES]->(api:APIReference)
(ex:CodeExample)-[:DEMONSTRATES]->(api:APIReference)
(ex:CodeExample)-[:IMPLEMENTS]->(bp:BestPractice)
// Cross-references
(doc:ExternalDoc)-[:RELATED_TO]->(doc2:ExternalDoc)
(api:APIReference)-[:ALTERNATIVE_TO]->(api2:APIReference)
(bp:BestPractice)-[:CONFLICTS_WITH]->(bp2:BestPractice)
Version Tracking¶
// Version relationships
(api:APIReference)-[:VERSION_OF]->(api_v2:APIReference)
(doc:ExternalDoc)-[:SUPERSEDES]->(doc_old:ExternalDoc)
// Compatibility tracking
(api:APIReference)-[:COMPATIBLE_WITH {version: "3.12"}]->(lang:Language)
(bp:BestPractice)-[:DEPRECATED_IN {version: "4.0"}]->(framework:Framework)
Source Credibility¶
// Source credibility metadata
(doc:ExternalDoc)-[:SOURCED_FROM]->(source:Source {
name: "Microsoft Learn",
trust_score: 0.95, // Official docs = high trust
last_verified: datetime
})
(ex:CodeExample)-[:SOURCED_FROM]->(source:Source {
name: "StackOverflow",
trust_score: 0.75, // Community = medium trust
answer_accepted: true,
upvotes: 150
})
2. External Knowledge Sources¶
Tier 1: Official Documentation (Highest Priority)¶
Sources:
- Microsoft Learn (Azure, .NET, TypeScript)
- Python.org official docs
- MDN Web Docs (JavaScript, Web APIs)
- Library-specific official docs
Characteristics:
- High credibility (trust_score: 0.9-1.0)
- Version-specific
- Regularly updated
- Comprehensive
Fetch Strategy:
def fetch_official_docs(package_name: str, version: str) -> ExternalDoc:
"""
Fetch official documentation with caching.
Priority:
1. Check local cache (< 7 days old)
2. Fetch from official API if available
3. Web scrape documentation site
4. Fallback to cached version (even if stale)
"""
cache_path = f"~/.amplihack/external_knowledge/{package_name}/{version}"
# Check cache first
if cache_exists(cache_path) and cache_age_days(cache_path) < 7:
return load_from_cache(cache_path)
# Fetch from source
try:
doc = fetch_from_official_source(package_name, version)
save_to_cache(cache_path, doc)
return doc
except FetchError:
# Graceful degradation: use stale cache
return load_from_cache(cache_path) if cache_exists(cache_path) else None
Tier 2: Curated Tutorials (Medium Priority)¶
Sources:
- Real Python
- FreeCodeCamp
- Official framework tutorials
- Microsoft sample code repositories
Characteristics:
- High quality (trust_score: 0.7-0.9)
- Practical, working examples
- Often more accessible than official docs
- May lag behind latest versions
Fetch Strategy:
def fetch_tutorial_knowledge(topic: str, language: str) -> List[ExternalDoc]:
"""
Fetch curated tutorials with quality filtering.
Filter criteria:
- Published within last 2 years
- Author has track record (check source credibility)
- Code examples compile/run
- Clear explanations
"""
candidates = search_tutorial_sources(topic, language)
return [t for t in candidates if quality_score(t) > 0.7]
Tier 3: Community Knowledge (Advisory Only)¶
Sources:
- StackOverflow (accepted answers with high upvotes)
- GitHub issues (maintainer responses)
- Reddit r/programming (highly upvoted)
Characteristics:
- Variable quality (trust_score: 0.4-0.8)
- Often practical, real-world solutions
- Version compatibility may vary
- Require validation
Fetch Strategy:
def fetch_community_knowledge(error_pattern: str) -> List[CodeExample]:
"""
Fetch community solutions with strict filtering.
Only include:
- StackOverflow: Accepted answer + 10+ upvotes
- GitHub: Maintainer comment or closed issue
- Must have code example
- Must be <2 years old or version-agnostic
"""
results = []
# StackOverflow
so_results = search_stackoverflow(error_pattern)
results.extend([
r for r in so_results
if r.accepted and r.upvotes >= 10
])
# GitHub issues
gh_results = search_github_issues(error_pattern)
results.extend([
r for r in gh_results
if r.author_is_maintainer or r.state == "closed"
])
return results
Tier 4: Library-Specific Knowledge¶
Sources:
- PyPI package documentation
- NPM package README
- Library changelog and migration guides
Fetch Strategy:
def fetch_library_knowledge(package: str, version: str) -> Dict:
"""
Fetch library-specific knowledge from package registries.
Data extracted:
- README (installation, quick start)
- API reference (if available)
- CHANGELOG (breaking changes, new features)
- Common usage patterns from examples/
"""
registry_data = fetch_from_registry(package, version)
return {
"readme": extract_readme(registry_data),
"api_reference": extract_api_docs(registry_data),
"changelog": extract_changelog(registry_data),
"examples": extract_examples(registry_data)
}
3. Caching vs. On-Demand Strategy¶
Decision Matrix¶
| Knowledge Type | Cache Strategy | Reason |
|---|---|---|
| Official API docs | Cache + Refresh | Stable, frequently used, version-specific |
| Tutorials | Cache Long-term | Don't change often, high value |
| StackOverflow | On-demand + Short cache | Dynamic, context-dependent |
| Library READMEs | Cache + Version-aware | Stable per version |
| Best practices | Cache + Periodic refresh | Evolve slowly |
Caching Implementation¶
class ExternalKnowledgeCache:
"""
Simple file-based cache with TTL and version awareness.
Philosophy: Files are simple, versionable, inspectable.
Don't use a database until you measure the need.
"""
def __init__(self, cache_dir: Path = Path.home() / ".amplihack" / "external_knowledge"):
self.cache_dir = cache_dir
self.cache_dir.mkdir(parents=True, exist_ok=True, mode=0o700)
def cache_key(self, source: str, identifier: str, version: str = None) -> Path:
"""Generate cache file path."""
key = f"{source}/{identifier}"
if version:
key += f"/{version}"
return self.cache_dir / f"{key}.json"
def get(self, source: str, identifier: str, version: str = None, max_age_days: int = 7) -> Optional[Dict]:
"""Get from cache if fresh enough."""
cache_file = self.cache_key(source, identifier, version)
if not cache_file.exists():
return None
# Check age
age_days = (datetime.now() - datetime.fromtimestamp(cache_file.stat().st_mtime)).days
if age_days > max_age_days:
return None
with cache_file.open() as f:
return json.load(f)
def set(self, source: str, identifier: str, data: Dict, version: str = None):
"""Save to cache."""
cache_file = self.cache_key(source, identifier, version)
cache_file.parent.mkdir(parents=True, exist_ok=True)
with cache_file.open('w') as f:
json.dump(data, f, indent=2)
# Secure permissions
cache_file.chmod(0o600)
When to Fetch On-Demand¶
Fetch on-demand for:
- Error-specific solutions (context-dependent)
- Rare API usage (< 5% of queries)
- Rapidly changing content (beta features)
- User-initiated searches
Pre-cache for:
- Common APIs (used in > 10% of projects)
- Core language features
- Framework essentials
- Known problem areas
4. Linking External Knowledge to Code¶
Automatic Linking (Simple Heuristics)¶
def link_external_knowledge_to_code(file: CodeFile, external_docs: List[ExternalDoc]):
"""
Link external knowledge to code files using simple heuristics.
Match criteria:
1. Import statements → Library documentation
2. API calls → API reference docs
3. Error patterns → Solution examples
4. Code patterns → Best practices
"""
# Extract imports
imports = extract_imports(file.content)
for imp in imports:
matching_docs = find_docs_for_package(imp.package_name)
for doc in matching_docs:
create_relationship(doc, "EXPLAINS", file)
# Extract API calls
api_calls = extract_api_calls(file.content)
for call in api_calls:
matching_api_refs = find_api_reference(call.namespace, call.function)
for api_ref in matching_api_refs:
create_relationship(api_ref, "USED_IN", file)
# Pattern matching
patterns = detect_code_patterns(file.content)
for pattern in patterns:
best_practices = find_best_practices(pattern.type)
for bp in best_practices:
create_relationship(bp, "APPLIED_IN", file)
Manual Linking (Agent-Driven)¶
def agent_link_external_knowledge(agent_id: str, code_context: str, decision: str):
"""
Let agents explicitly link external knowledge they used.
When an agent says:
"I followed the Azure Blob Storage documentation for container creation"
We extract:
- Source: "Azure Blob Storage documentation"
- Topic: "container creation"
- Decision: <agent's decision>
Then create explicit link in graph.
"""
knowledge_references = extract_knowledge_references(decision)
for ref in knowledge_references:
external_doc = find_or_fetch_external_doc(ref.source, ref.topic)
if external_doc:
create_relationship(
external_doc,
"INFORMED_DECISION",
decision_node,
properties={"agent_id": agent_id, "confidence": ref.confidence}
)
5. Version Tracking Strategy¶
Version-Aware Querying¶
// Query: Find API reference for Python 3.12
MATCH (api:APIReference)-[:COMPATIBLE_WITH]->(lang:Language {name: "Python"})
WHERE lang.version = "3.12" OR api.version_introduced <= "3.12"
AND (api.deprecated_in IS NULL OR api.deprecated_in > "3.12")
RETURN api
// Query: Find best practices valid for current framework version
MATCH (bp:BestPractice)-[:APPLIES_TO]->(framework:Framework)
WHERE framework.name = "FastAPI"
AND framework.version >= bp.min_version
AND (bp.deprecated_in IS NULL OR framework.version < bp.deprecated_in)
RETURN bp
ORDER BY bp.confidence_score DESC
LIMIT 5
Version Metadata Storage¶
class VersionedKnowledge:
"""Track version-specific knowledge with deprecation."""
def store_api_reference(self, api_data: Dict):
"""Store API with version metadata."""
cypher = """
MERGE (api:APIReference {id: $id})
SET api.function_name = $function_name,
api.signature = $signature,
api.version_introduced = $version_introduced,
api.deprecated_in = $deprecated_in,
api.last_updated = datetime()
// Link to language version
MERGE (lang:Language {name: $language, version: $language_version})
MERGE (api)-[:COMPATIBLE_WITH]->(lang)
"""
self.neo4j.run(cypher, **api_data)
def mark_deprecated(self, api_id: str, deprecated_in_version: str, replacement: str):
"""Mark API as deprecated and link to replacement."""
cypher = """
MATCH (old:APIReference {id: $api_id})
SET old.deprecated_in = $deprecated_in
MERGE (new:APIReference {id: $replacement_id})
MERGE (old)-[:REPLACED_BY {in_version: $deprecated_in}]->(new)
"""
self.neo4j.run(cypher,
api_id=api_id,
deprecated_in=deprecated_in_version,
replacement_id=replacement)
6. Ranking & Relevance¶
Source Credibility Scoring¶
SOURCE_CREDIBILITY = {
# Official sources
"microsoft_learn": 0.95,
"python_docs": 0.95,
"mdn": 0.95,
# Curated content
"real_python": 0.85,
"official_tutorials": 0.85,
# Community (filtered)
"stackoverflow_accepted": 0.75,
"github_maintainer": 0.80,
"reddit_highvote": 0.60,
# Unknown/unverified
"blog_unknown": 0.40,
"forum_post": 0.35
}
def calculate_relevance_score(doc: ExternalDoc, context: str) -> float:
"""
Calculate relevance score based on multiple factors.
Factors (weighted):
- Source credibility (40%)
- Content freshness (20%)
- Usage frequency (20%)
- Text similarity to context (20%)
"""
# Base credibility
credibility = SOURCE_CREDIBILITY.get(doc.source, 0.5)
# Freshness score (decay over time)
age_days = (datetime.now() - doc.last_updated).days
freshness = max(0.0, 1.0 - (age_days / 730.0)) # 2-year decay
# Usage score (more used = more relevant)
usage = min(1.0, doc.access_count / 100.0)
# Semantic similarity (simple keyword matching for now)
similarity = calculate_text_similarity(doc.summary, context)
# Weighted average
score = (
credibility * 0.40 +
freshness * 0.20 +
usage * 0.20 +
similarity * 0.20
)
return score
Learning Which Sources Work¶
class ExternalKnowledgeFeedback:
"""Track which external knowledge actually helped."""
def record_usage(self, doc_id: str, agent_id: str, task_outcome: str):
"""Record when external knowledge was used and outcome."""
cypher = """
MATCH (doc:ExternalDoc {id: $doc_id})
SET doc.access_count = doc.access_count + 1,
doc.last_accessed = datetime()
// Record outcome
CREATE (usage:KnowledgeUsage {
doc_id: $doc_id,
agent_id: $agent_id,
outcome: $outcome,
timestamp: datetime()
})
"""
self.neo4j.run(cypher, doc_id=doc_id, agent_id=agent_id, outcome=task_outcome)
def get_effective_sources(self, domain: str) -> List[ExternalDoc]:
"""Find external knowledge that led to successful outcomes."""
cypher = """
MATCH (doc:ExternalDoc {category: $domain})
OPTIONAL MATCH (doc)<-[:USED]-(usage:KnowledgeUsage {outcome: "success"})
WITH doc, count(usage) as success_count
WHERE success_count > 5
RETURN doc
ORDER BY success_count DESC, doc.relevance_score DESC
LIMIT 10
"""
return self.neo4j.run(cypher, domain=domain)
7. Retrieval Strategies¶
When to Query External Knowledge¶
class ExternalKnowledgeRetriever:
"""Decide when and what external knowledge to retrieve."""
def should_query_external(self, context: AgentContext) -> bool:
"""
Decide if external knowledge would help.
Query external knowledge if:
1. Agent is encountering new library/API
2. Error pattern not in project memory
3. Best practice request for unfamiliar domain
4. User explicitly asks for documentation
Don't query if:
1. Project memory has sufficient context
2. Agent has handled this pattern before
3. Pure refactoring (no new APIs)
"""
# Check project memory first
project_memories = self.memory_manager.retrieve(
agent_id=context.agent_id,
memory_type=MemoryType.PATTERN,
search=context.current_task
)
if len(project_memories) >= 3:
# We have sufficient project-specific knowledge
return False
# Check if task involves new APIs
new_apis = self.detect_new_apis(context.code_context)
if new_apis:
return True
# Check for error patterns
if context.error_pattern and not self.has_solution_in_project_memory(context.error_pattern):
return True
return False
def retrieve_relevant_knowledge(self, context: AgentContext, max_items: int = 5) -> List[ExternalDoc]:
"""
Retrieve most relevant external knowledge.
Strategy:
1. Identify knowledge gaps in project memory
2. Query external knowledge to fill gaps
3. Rank by relevance
4. Return top N items
5. Cache for future use
"""
knowledge_gaps = self.identify_knowledge_gaps(context)
external_docs = []
for gap in knowledge_gaps:
docs = self.query_external_sources(
topic=gap.topic,
language=gap.language,
version=gap.version
)
external_docs.extend(docs)
# Rank by relevance
ranked_docs = sorted(
external_docs,
key=lambda d: calculate_relevance_score(d, context.current_task),
reverse=True
)
return ranked_docs[:max_items]
Combining Project Memory + External Knowledge¶
def build_agent_context(agent_id: str, task: str) -> str:
"""
Build comprehensive context from project memory + external knowledge.
Priority:
1. Project-specific memories (HIGHEST)
2. Previously successful external knowledge
3. Fresh external knowledge (if gaps exist)
"""
context_parts = []
# Project memory (always first)
project_memories = memory_manager.retrieve(
agent_id=agent_id,
search=task,
min_importance=5
)
if project_memories:
context_parts.append("## Project-Specific Knowledge")
for mem in project_memories[:3]: # Top 3
context_parts.append(f"- {mem.title}: {mem.content}")
# External knowledge (if needed)
context = AgentContext(agent_id=agent_id, current_task=task)
if external_retriever.should_query_external(context):
external_docs = external_retriever.retrieve_relevant_knowledge(context, max_items=2)
if external_docs:
context_parts.append("\n## External Reference (Advisory)")
for doc in external_docs:
context_parts.append(f"- [{doc.source}] {doc.title}: {doc.summary}")
if doc.example_code:
context_parts.append(f" Example:\n ```\n {doc.example_code}\n ```")
return "\n".join(context_parts)
8. Keeping Knowledge Up-to-Date¶
Refresh Strategy¶
class ExternalKnowledgeRefresher:
"""Keep external knowledge fresh without over-fetching."""
# Refresh frequencies by source type
REFRESH_INTERVALS = {
"official_docs": timedelta(days=30), # Stable
"tutorials": timedelta(days=90), # Slow-changing
"community": timedelta(days=7), # Dynamic
"library_specific": timedelta(days=14) # Version-dependent
}
def needs_refresh(self, doc: ExternalDoc) -> bool:
"""Check if document needs refreshing."""
age = datetime.now() - doc.last_updated
interval = self.REFRESH_INTERVALS.get(doc.category, timedelta(days=30))
# Refresh if:
# 1. Older than refresh interval
# 2. Has been accessed recently (indicates value)
# 3. Is in top 20% by access count (high-value docs)
is_stale = age > interval
is_valuable = doc.access_count > self.get_access_count_percentile(0.8)
recently_used = (datetime.now() - doc.last_accessed).days < 7
return is_stale and (is_valuable or recently_used)
def refresh_knowledge(self, doc: ExternalDoc):
"""Refresh external knowledge document."""
try:
# Fetch fresh content
fresh_content = fetch_from_source(doc.source_url)
# Check if content changed
new_hash = hashlib.sha256(fresh_content.encode()).hexdigest()
if new_hash != doc.content_hash:
# Content changed - update
doc.content_hash = new_hash
doc.summary = generate_summary(fresh_content)
doc.last_updated = datetime.now()
# Log change for version tracking
self.log_knowledge_change(doc.id, "content_updated")
except FetchError:
# Graceful degradation: keep using cached version
self.log_knowledge_change(doc.id, "fetch_failed")
Deprecation Detection¶
def detect_deprecations(api_ref: APIReference) -> Optional[Dict]:
"""
Detect if API has been deprecated.
Methods:
1. Check official deprecation notices in docs
2. Monitor library changelogs
3. Track community warnings
"""
# Check official docs for deprecation markers
doc_content = fetch_current_docs(api_ref.namespace)
if "deprecated" in doc_content.lower():
deprecation_info = extract_deprecation_info(doc_content)
return {
"deprecated": True,
"deprecated_in": deprecation_info.version,
"replacement": deprecation_info.replacement_api,
"reason": deprecation_info.reason
}
# Check changelog
changelog = fetch_changelog(api_ref.namespace)
deprecation = find_deprecation_in_changelog(changelog, api_ref.function_name)
if deprecation:
return deprecation
return None
9. Handling Large External Knowledge Bases¶
Problem: 100k+ Documents¶
Challenge: Can't load all external knowledge into agent context (token limits)
Solution: Tiered retrieval + aggressive filtering
class LargeKnowledgeBaseHandler:
"""Handle large external knowledge bases efficiently."""
def __init__(self):
self.index = ExternalKnowledgeIndex() # Fast lookup structure
self.cache = LRUCache(maxsize=1000) # In-memory cache of frequently used docs
def query(self, context: str, language: str, max_results: int = 5) -> List[ExternalDoc]:
"""
Query large knowledge base efficiently.
Strategy:
1. Fast keyword filtering (reduce 100k → 1k)
2. Semantic ranking (reduce 1k → 100)
3. Relevance scoring (reduce 100 → 5)
"""
# Stage 1: Fast keyword filter
candidates = self.index.keyword_search(
context=context,
language=language,
max_candidates=1000
)
# Stage 2: Semantic ranking
if len(candidates) > 100:
candidates = self.semantic_rank(candidates, context, top_k=100)
# Stage 3: Detailed relevance scoring
ranked_results = []
for doc in candidates:
score = calculate_relevance_score(doc, context)
ranked_results.append((score, doc))
ranked_results.sort(reverse=True, key=lambda x: x[0])
return [doc for score, doc in ranked_results[:max_results]]
def build_index(self):
"""Build efficient search index for large knowledge base."""
# Use inverted index for fast keyword lookup
index = {}
for doc in self.load_all_docs():
# Extract keywords
keywords = extract_keywords(doc.title + " " + doc.summary)
for keyword in keywords:
if keyword not in index:
index[keyword] = []
index[keyword].append(doc.id)
return index
Metadata-Only Storage¶
def store_metadata_only(doc: ExternalDoc):
"""
Store only metadata in Neo4j, full content in file cache.
Neo4j storage (lightweight):
- id, title, source, source_url
- summary (500 chars max)
- version, language, category
- relevance_score, access_count
File cache (full content):
- Complete documentation
- Code examples
- Full API reference
"""
# Store metadata in Neo4j
cypher = """
MERGE (doc:ExternalDoc {id: $id})
SET doc.title = $title,
doc.source = $source,
doc.source_url = $source_url,
doc.summary = $summary,
doc.version = $version,
doc.relevance_score = $relevance_score
"""
neo4j.run(cypher, **doc.metadata)
# Store full content in file cache
cache_path = get_cache_path(doc.id)
save_full_content(cache_path, doc.full_content)
10. Performance Considerations¶
Query Performance Targets¶
Target: <100ms for external knowledge queries
Breakdown:
- Relevance check: <10ms (decide if external knowledge needed)
- Cache lookup: <20ms (check if already cached)
- Neo4j query: <50ms (fetch metadata)
- Full content fetch: <20ms (from file cache)
Optimization Strategies¶
# 1. Pre-compute relevance scores
def precompute_relevance_scores():
"""Run nightly to update relevance scores."""
cypher = """
MATCH (doc:ExternalDoc)
SET doc.relevance_score =
(doc.access_count * 0.4) +
((1.0 - (duration.between(doc.last_updated, datetime()).days / 730.0)) * 0.3) +
(size([(doc)<-[:USED]-(usage:KnowledgeUsage {outcome: "success"}) | usage]) * 0.3)
"""
neo4j.run(cypher)
# 2. Index frequently queried paths
neo4j.run("""
CREATE INDEX external_doc_category IF NOT EXISTS
FOR (d:ExternalDoc) ON (d.category, d.language, d.relevance_score)
""")
# 3. Materialize common queries
def materialize_top_docs():
"""Cache top documents for each category."""
cypher = """
MATCH (doc:ExternalDoc)
WHERE doc.category = $category AND doc.language = $language
WITH doc
ORDER BY doc.relevance_score DESC
LIMIT 20
RETURN doc
"""
for category in CATEGORIES:
for language in LANGUAGES:
results = neo4j.run(cypher, category=category, language=language)
cache_results(f"top_{category}_{language}", results)
11. Integration Code Examples¶
Example 1: Agent Queries External Knowledge¶
class AgentWithExternalKnowledge:
"""Agent that uses both project memory and external knowledge."""
def __init__(self, agent_id: str, session_id: str):
self.agent_id = agent_id
self.memory_manager = MemoryManager(session_id=session_id)
self.external_knowledge = ExternalKnowledgeRetriever()
def execute_task(self, task: str) -> str:
"""Execute task with comprehensive knowledge context."""
# Build context from multiple sources
context = self.build_comprehensive_context(task)
# Execute with context
result = self.execute_with_context(context, task)
# Record what knowledge was useful
self.record_knowledge_usage(context, result)
return result
def build_comprehensive_context(self, task: str) -> str:
"""Build context from project memory + external knowledge."""
context_parts = []
# 1. Project-specific memories (highest priority)
project_memories = self.memory_manager.retrieve(
agent_id=self.agent_id,
search=task,
min_importance=5,
limit=3
)
if project_memories:
context_parts.append("## Project Memory (Proven Patterns)")
for mem in project_memories:
context_parts.append(f"- {mem.title}: {mem.content}")
# 2. External knowledge (if needed)
agent_context = AgentContext(
agent_id=self.agent_id,
current_task=task,
code_context=self.get_current_code_context()
)
if self.external_knowledge.should_query_external(agent_context):
external_docs = self.external_knowledge.retrieve_relevant_knowledge(
agent_context,
max_items=2
)
if external_docs:
context_parts.append("\n## External Reference (Advisory)")
for doc in external_docs:
context_parts.append(
f"- [{doc.source}] {doc.title}\n"
f" {doc.summary}\n"
f" URL: {doc.source_url}"
)
return "\n".join(context_parts)
Example 2: Automatic API Documentation Linking¶
def link_api_usage_to_docs(code_file: Path):
"""
Automatically link API usage in code to external documentation.
Process:
1. Parse code to find API calls
2. Query external knowledge for matching API docs
3. Create relationships in Neo4j
4. Cache for fast retrieval
"""
# Parse code
tree = ast.parse(code_file.read_text())
api_calls = extract_api_calls(tree)
for call in api_calls:
# Find or fetch API documentation
api_doc = find_or_fetch_api_doc(
namespace=call.module,
function=call.function_name,
version=detect_package_version(call.module)
)
if api_doc:
# Create relationship in Neo4j
cypher = """
MATCH (file:CodeFile {path: $file_path})
MATCH (api:APIReference {id: $api_id})
MERGE (api)-[:USED_IN {
line_number: $line_number,
context: $context
}]->(file)
"""
neo4j.run(cypher,
file_path=str(code_file),
api_id=api_doc.id,
line_number=call.line_number,
context=call.context_code)
Example 3: Error-Driven Knowledge Fetching¶
class ErrorDrivenKnowledgeFetcher:
"""Fetch external knowledge based on error patterns."""
def handle_error(self, error: Exception, code_context: str) -> Optional[str]:
"""
Fetch relevant external knowledge for error resolution.
Priority:
1. Check project memory for previous solutions
2. Query external knowledge for error pattern
3. Return most relevant solution
"""
error_pattern = classify_error(error)
# Check project memory first
project_solutions = self.memory_manager.retrieve(
memory_type=MemoryType.PATTERN,
search=str(error),
tags=["error_solution"]
)
if project_solutions:
# We've solved this before
return project_solutions[0].content
# Query external knowledge
external_solutions = self.query_external_error_solutions(
error_pattern=error_pattern,
code_context=code_context
)
if external_solutions:
best_solution = external_solutions[0]
# Store in project memory for future
self.memory_manager.store(
agent_id="error_handler",
title=f"Solution: {error_pattern}",
content=best_solution.solution,
memory_type=MemoryType.PATTERN,
tags=["error_solution", error_pattern],
importance=8,
metadata={
"external_source": best_solution.source,
"source_url": best_solution.url
}
)
return best_solution.solution
return None
12. Progressive Implementation Plan¶
Phase 1: Foundation (Week 1)¶
Goal: File-based cache + basic Neo4j metadata
Tasks:
- Implement
ExternalKnowledgeCache(file-based) - Create Neo4j schema (ExternalDoc, APIReference nodes)
- Build basic fetch functions for official docs
- Test with single source (e.g., Python docs)
Deliverable: Can fetch and cache Python official docs
Phase 2: Integration (Week 2)¶
Goal: Integrate with existing memory system
Tasks:
- Implement
should_query_external()logic - Add external knowledge to agent context builder
- Create Neo4j relationships to code files
- Test with architect agent
Deliverable: Agents can query external knowledge when needed
Phase 3: Multiple Sources (Week 3)¶
Goal: Support multiple knowledge sources
Tasks:
- Add MS Learn fetcher
- Add MDN fetcher
- Add StackOverflow fetcher (with filtering)
- Implement source credibility scoring
Deliverable: Multi-source knowledge retrieval
Phase 4: Optimization (Week 4)¶
Goal: Performance and ranking
Tasks:
- Implement relevance scoring
- Add usage tracking
- Build recommendation engine
- Performance testing
Deliverable: <100ms query performance
Phase 5: Learning (Week 5)¶
Goal: Adaptive knowledge retrieval
Tasks:
- Track which knowledge helps
- Implement feedback loop
- Auto-refresh stale content
- Deprecation detection
Deliverable: Self-improving knowledge system
13. Success Metrics¶
Performance Metrics¶
- External knowledge query time: <100ms (p95)
- Cache hit rate: >80%
- Neo4j query time: <50ms (p95)
Quality Metrics¶
- Source credibility score: >0.7 average
- Knowledge freshness: <30 days average age for high-use docs
- Agent satisfaction: Track when external knowledge was useful
Usage Metrics¶
- External knowledge query rate: 20-40% of agent tasks
- Cache efficiency: >80% hit rate
- Storage efficiency: <100MB for 10k documents (metadata only)
14. Monitoring & Maintenance¶
Daily Operations¶
def daily_maintenance():
"""Daily maintenance tasks."""
# 1. Update relevance scores
precompute_relevance_scores()
# 2. Refresh high-value docs
refresh_high_value_docs()
# 3. Clean up unused cache
cleanup_cache(older_than_days=90)
Weekly Analysis¶
def weekly_analysis():
"""Analyze external knowledge usage patterns."""
# 1. Top 20 most used docs
top_docs = get_top_documents(limit=20)
# 2. Unused docs (consider removing)
unused_docs = get_documents_never_accessed(age_days=90)
# 3. Source effectiveness
source_stats = analyze_source_effectiveness()
# 4. Knowledge gaps
gaps = identify_knowledge_gaps()
return {
"top_docs": top_docs,
"unused_docs": unused_docs,
"source_stats": source_stats,
"gaps": gaps
}
15. Key Principles (Summary)¶
- Project Memory First: External knowledge is advisory, never replaces project-specific memory
- Cache Aggressively: File-based cache with smart refresh
- Measure Before Optimizing: Start simple, measure, optimize based on data
- Version Awareness: Always track version compatibility
- Source Credibility: Rank sources by trust score
- Graceful Degradation: System works even if external knowledge unavailable
- Lightweight Metadata: Store metadata in Neo4j, full content in files
- Learning Loop: Track what works, improve recommendations
- Performance First: <100ms query target
- User Control: Never override explicit user requirements
File Locations¶
Implementation:
- src/amplihack/external_knowledge/cache.py
- src/amplihack/external_knowledge/retriever.py
- src/amplihack/external_knowledge/neo4j_schema.py
- src/amplihack/external_knowledge/sources/
Data Storage:
- ~/.amplihack/external_knowledge/cache/ (file cache)
- Neo4j database (metadata + relationships)
Integration:
- src/amplihack/memory/manager.py (add external knowledge queries)
- .claude/agents/*/ (no changes - agents remain stateless)
END OF DESIGN DOCUMENT
This design follows the project's ruthless simplicity philosophy: start with file-based caching, measure what's needed, and only add Neo4j complexity where it provides clear value (relationships, fast metadata queries, version tracking). External knowledge is always advisory and never overrides project-specific memory.