Retrieval Pipeline¶
This page provides a detailed step-by-step walkthrough of the retrieval pipeline, from the moment a question enters the system to the final synthesized answer.
Pipeline Overview¶
Question
│
▼
1. Query Expansion (optional)
│
▼
2. Vector Search
│
▼
3. Confidence Gate
│
▼
4. Cross-Encoder Reranking (optional)
│
▼
5. Graph Reranking
│
▼
6. Multi-Document Expansion
│
▼
7. Content Quality Filtering
│
▼
8. Few-Shot Example Injection
│
▼
9. Claude Synthesis
│
▼
Answer + Sources
Step 1: Query Expansion (Multi-Query Retrieval)¶
Flag: enable_multi_query=True (opt-in, default False)
A single query embedding may miss content that uses different vocabulary. When enabled, Claude Haiku generates 2 alternative phrasings of the question.
Example:
Original: "What experiments demonstrate quantum entanglement?"
Alt 1: "Which laboratory tests have proven entangled particle behavior?"
Alt 2: "How has quantum correlation been experimentally verified?"
All 3 queries proceed to vector search independently. Results are deduplicated by article title, keeping the highest similarity score for each title.
Technical details:
- Question truncated to 500 characters before sending to Haiku (prompt injection defense)
- Each alternative capped at 300 characters
- If the Haiku call fails, the pipeline proceeds with only the original query (graceful degradation)
- Adds ~50ms latency (1 Haiku call) + ~200ms (2 extra vector searches)
Data residency
When enable_multi_query=True, the user's question is sent to the Anthropic API for expansion. Keep this False for deployments with data-residency or PII constraints.
Step 2: Vector Search¶
The question (and any alternatives from Step 1) is embedded using the same model that generated section embeddings during pack building. The HNSW index returns the top-K most similar sections by cosine similarity.
# Pseudocode
query_embedding = embed(question) # 768-dim vector
results = hnsw_index.search(query_embedding, top_k=10)
# Returns: [(section_title, similarity_score), ...]
Configuration:
max_results: Number of results per query (default 10, clamped to [1, 1000])- When cross-encoder is enabled, fetches 2x candidates to give the cross-encoder a larger pool
Step 3: Confidence Gate¶
The confidence gate prevents the pack from injecting irrelevant content when it has nothing useful for the question.
max_similarity = max(score for _, score in results)
if max_similarity >= CONTEXT_CONFIDENCE_THRESHOLD (0.5):
→ proceed with pack context
else:
→ skip pack, call _synthesize_answer_minimal()
(Claude answers from own knowledge, no KG context)
When the gate fires:
- Response includes
query_type: "confidence_gated_fallback" sources,entities, andfactsare empty lists- Claude still provides a useful answer from training data
When the gate does NOT fire:
- Response includes
query_type: "vector_search" - Full pipeline continues
Impact: Eliminates accuracy regressions on questions outside the pack's domain. Without this gate, packs like go-expert showed negative deltas because irrelevant content confused Claude.
| Scenario | max_similarity | Gate fires? |
|---|---|---|
| Question directly matches pack content | 0.7-0.95 | No |
| Question loosely related | 0.5-0.7 | No |
| Question on adjacent topic | 0.3-0.5 | Yes |
| Question completely outside domain | 0.0-0.3 | Yes |
Step 4: Cross-Encoder Reranking (Optional)¶
Flag: enable_cross_encoder=True (opt-in, default False)
Bi-encoder search scores query and document independently -- it cannot capture interactions between the two texts. Cross-encoders see both texts simultaneously, enabling precise relevance judgments for negations, comparisons, and qualifications.
How it works:
- Vector search fetches
2 * max_resultscandidates (expanded pool) - Cross-encoder (
ms-marco-MiniLM-L-12-v2) scores each(query, document)pair - Results sorted by
ce_score, truncated tomax_results
Performance:
- ~50ms per query on CPU (10-candidate pool)
- ~120MB model RAM (loaded once, shared across queries)
- If model fails to load, becomes a passthrough (results unchanged)
Step 5: Graph Reranking¶
Flag: enable_reranker=True (default True)
Degree centrality (in-degree + out-degree) is computed over the LINKS_TO edge graph. Articles with many connections are considered authoritative. In practice, the reranker is called via Reciprocal Rank Fusion (RRF) with k=60, combining the original vector ranking with the centrality ranking:
rrf_score = 1/(k + vector_rank) + 0.5/(k + centrality_rank)
The GraphReranker.rerank() method also supports direct weighted combination:
combined_score = vector_weight * vector_similarity + graph_weight * normalized_centrality
Default weights: vector_weight=0.6, graph_weight=0.4
Example:
Before reranking:
1. Quantum_fluctuation (similarity=0.95, centrality=0.02) → combined=0.57 + 0.01 = 0.58
2. Quantum_mechanics (similarity=0.90, centrality=0.85) → combined=0.54 + 0.34 = 0.88
After reranking:
1. Quantum_mechanics (combined=0.88) ← promoted (authoritative)
2. Quantum_fluctuation (combined=0.58)
Performance:
- Degree centrality computation: O(V+E), computed per query
- Sparse graph detection: automatically disables centrality for graphs with < 2 avg links/article
- Reranking: O(N log N) where N = result count
- Adds ~50ms per query
Step 6: Multi-Document Expansion¶
Flag: enable_multidoc=True (default True)
Instead of using a single article, the synthesizer traverses LINKS_TO edges from the top result to find related articles.
How it works:
- Take the top result from the reranked list as the seed
- Follow LINKS_TO edges from the seed article
- Add up to 2 neighbor articles
- Cap the total source list at 7 articles
- Assemble unified context for synthesis
Constructor: MultiDocSynthesizer(kuzu_conn) -- takes only the LadybugDB connection. No additional configuration parameters.
Step 7: Content Quality Filtering¶
Always active when a question is provided (no flag needed).
Each section is scored on a 0.0-1.0 scale:
if word_count < 20:
score = 0.0 (hard stub cutoff)
else:
length_score = min(0.8, 0.2 + (word_count / 200) * 0.6)
keyword_score = min(0.2, overlap_ratio * 0.2)
score = min(1.0, length_score + keyword_score)
Sections below CONTENT_QUALITY_THRESHOLD = 0.3 are excluded from the synthesis context.
Score examples:
| Section | Words | Keyword Overlap | Score | Included? |
|---|---|---|---|---|
| "See also." | 2 | - | 0.0 | No |
| Short stub, no keywords | 20 | 0% | 0.26 | No |
| Medium paragraph | 50 | 0% | 0.35 | Yes |
| Medium with keywords | 50 | 100% | 0.55 | Yes |
| Full paragraph | 200 | 0% | 0.80 | Yes |
| Full with keywords | 200 | 100% | 1.0 | Yes |
Fallback: If ALL sections across ALL retrieved articles are filtered, the agent falls back to using raw article.content for the source titles. This is a global fallback -- it only activates when the entire output is empty.
Step 8: Few-Shot Example Injection¶
Flag: enable_fewshot=True (default True)
Pack-specific examples are injected into the synthesis prompt before the question. These guide Claude to follow the pack's preferred answer format, citation style, and reasoning structure.
Example source:
The FewShotManager auto-detects examples from eval/questions.jsonl adjacent to pack.db. If not found, few-shot is silently disabled.
Selection: The 2 most semantically similar examples to the current question are chosen via cosine similarity over example question embeddings (called with k=2 in the query pipeline).
Prompt structure:
Here are examples of high-quality answers:
=== Example 1 ===
Question: What does slices.Contains do?
Answer: slices.Contains reports whether v is present in s.
E must satisfy the comparable constraint. [Source: slices_stdlib]
=== Example 2 ===
[...]
Now answer this question following the same pattern:
Question: [user's question]
Context: [retrieved sections]
Answer:
Step 9: Claude Synthesis¶
The assembled prompt -- containing retrieved context, few-shot examples, and the user's question -- is sent to Claude Opus for synthesis.
Model configuration:
| Parameter | Value |
|---|---|
| Model | claude-opus-4-6 (configurable via synthesis_model) |
| Max tokens | 1024 (SYNTHESIS_MAX_TOKENS) |
| Temperature | Default (not overridden) |
Response format:
{
"answer": "Goroutine scheduling uses an M:N model where...",
"sources": ["runtime_scheduling", "goroutines_overview"],
"entities": ["goroutine", "OS thread", "work-stealing"],
"facts": ["goroutines are multiplexed onto OS threads"],
"cypher_query": "CALL QUERY_VECTOR_INDEX(...)",
"query_type": "vector_search",
"token_usage": {
"input_tokens": 2847,
"output_tokens": 312,
"api_calls": 2
}
}
Configuration Flags Summary¶
| Flag | Default | What It Controls |
|---|---|---|
use_enhancements |
True | Master switch for all enhancement modules |
enable_reranker |
True | GraphReranker (degree centrality via RRF) |
enable_multidoc |
True | MultiDocSynthesizer (LINKS_TO expansion) |
enable_fewshot |
True | FewShotManager (example injection) |
enable_cross_encoder |
False | CrossEncoderReranker (joint scoring) |
enable_multi_query |
False | Multi-query retrieval (Haiku expansion) |
Flag interaction
All enable_* flags are ignored when use_enhancements=False. The confidence gate and content quality scoring are always active regardless of use_enhancements.
Latency Breakdown¶
Baseline (use_enhancements=False):
Vector search: ~100ms
Synthesis: ~150ms
─────────────────────────
Total: ~250ms
Balanced (default):
Vector search: ~100ms
GraphReranker: ~50ms (degree centrality)
MultiDocSynthesizer: ~300ms (LINKS_TO traversal)
FewShotManager: ~20ms (example retrieval)
Synthesis: ~200ms (larger context)
─────────────────────────
Total: ~670ms
Maximum Quality:
Multi-query: ~250ms (1 Haiku + 2 extra searches)
Cross-encoder: ~50ms (10 candidates on CPU)
GraphReranker: ~50ms
MultiDocSynthesizer: ~300ms
FewShotManager: ~20ms
Synthesis: ~200ms
─────────────────────────
Total: ~870ms
All configurations remain well under 1 second for typical queries.