Improving Accuracy¶

Seven techniques used in the retrieval pipeline to achieve 98% pack accuracy across 44 packs. Each addresses a specific failure mode.

Overview¶

#	Improvement	What It Fixes	Accuracy Impact
1	Confidence-gated context injection	Pack injects irrelevant content	Eliminates negative deltas
2	Cross-encoder reranking	Bi-encoder misranks nuanced queries	+10-15% retrieval precision
3	Multi-query retrieval	Vocabulary mismatch misses content	+15-25% recall
4	Content quality scoring	Stub sections dilute context	-30% context noise
5	URL list expansion	Thin coverage in source URLs	More content to retrieve
6	Eval question calibration	Questions test training, not packs	Accurate measurement
7	Full pack rebuilds	Stale database after URL changes	Fresh content

Improvement 1: Confidence-Gated Context Injection¶

Problem¶

Before this improvement, the synthesis prompt always included retrieved pack content regardless of relevance. When a question fell outside the pack's domain, vector search returned low-similarity sections that confused Claude:

Claude hallucinated connections between irrelevant sections and the question
Answers incorrectly cited unrelated articles
Packs with strong training coverage (Go, React) showed accuracy regressions

Solution¶

After vector search, check whether the best result meets a minimum similarity threshold:

max_similarity >= 0.5  →  inject pack context (full pipeline)
max_similarity < 0.5   →  skip pack, let Claude answer from own knowledge

Configuration¶

The threshold is a class constant on KnowledgeGraphAgent:

class KnowledgeGraphAgent:
    CONTEXT_CONFIDENCE_THRESHOLD = 0.5

To tune for a specific pack:

class StrictAgent(KnowledgeGraphAgent):
    CONTEXT_CONFIDENCE_THRESHOLD = 0.65  # require higher confidence

class PermissiveAgent(KnowledgeGraphAgent):
    CONTEXT_CONFIDENCE_THRESHOLD = 0.35  # inject more context

Impact¶

Pack	Before	After
go-expert	Negative delta (KG noise)	100% (gate fires on OOD questions)
react-expert	Negative delta (KG noise)	100% (gate fires on OOD questions)

Improvement 2: Cross-Encoder Reranking¶

Problem¶

Bi-encoder search (embedding similarity) scores query and document independently. It cannot capture interactions like negations ("not supported"), comparisons ("differs from"), or qualifications ("only when").

Solution¶

After vector search returns candidates, a cross-encoder model (ms-marco-MiniLM-L-12-v2) rescores each (query, document) pair jointly. This enables precise relevance judgments.

Configuration¶

agent = KnowledgeGraphAgent(
    db_path="data/packs/go-expert/pack.db",
    use_enhancements=True,
    enable_cross_encoder=True,  # opt-in
)

First use downloads a ~33MB model. Subsequent runs load from ~/.cache/huggingface/.

Graceful Degradation¶

If the model fails to load (no network on first use), the cross-encoder becomes a passthrough -- results are returned unchanged. The agent continues to work with bi-encoder ranking.

Impact¶

+10-15% retrieval precision on nuanced queries
~50ms additional latency per query
~120MB model RAM (loaded once)

Improvement 3: Multi-Query Retrieval¶

Problem¶

A single query embedding misses relevant content that uses different vocabulary. "Memory safety in systems programming" may not surface articles about "ownership and borrowing."

Solution¶

When enabled, Claude Haiku generates 2 alternative phrasings. Vector search runs for all 3 queries. Results are deduplicated by title (highest similarity wins).

Configuration¶

agent = KnowledgeGraphAgent(
    db_path="data/packs/go-expert/pack.db",
    enable_multi_query=True,  # opt-in
)

Data residency

When enable_multi_query=True, the question (truncated to 500 chars) is sent to the Anthropic API. Keep False for data-sensitive deployments.

Impact¶

+15-25% recall improvement
~250ms additional latency (1 Haiku call + 2 extra searches)
Graceful fallback: if Haiku fails, proceeds with original query only

Improvement 4: Content Quality Scoring¶

Problem¶

Short stub sections (disambiguation headers, "See also" entries, navigation labels) waste Claude's context window and add noise that causes hallucination.

Solution¶

Each section is scored on a 0.0-1.0 scale before inclusion in synthesis context:

if word_count < 20:
    score = 0.0  (hard cutoff)
else:
    length_score  = min(0.8, 0.2 + (word_count / 200) * 0.6)
    keyword_score = min(0.2, overlap_ratio * 0.2)
    score = min(1.0, length_score + keyword_score)

Sections below CONTENT_QUALITY_THRESHOLD = 0.3 are filtered out.

Configuration¶

This is always active when a question is available -- no flag needed. The threshold is a class constant:

class KnowledgeGraphAgent:
    CONTENT_QUALITY_THRESHOLD = 0.3

Impact¶

~30% reduction in context noise
No API calls (CPU-only scoring, <1ms per section)
Global fallback if all sections filtered: raw article content used instead

Improvement 5: URL List Expansion¶

Problem¶

Initial URL lists were too short (15-30 URLs), leaving coverage gaps in the knowledge graph. The pack database lacked content for many evaluation questions.

Solution¶

Expand urls.txt for each pack to cover:

Core documentation and overview pages
Getting started and quickstart guides
How-to guides and sub-pages
Tutorials and sub-pages
API reference with sub-categories
GitHub source files and READMEs
Community resources (where applicable)

Steps¶

Audit existing URLs: Count URLs, check section coverage

grep -v '^\s*#' data/packs/langchain-expert/urls.txt | grep -v '^\s*$' | wc -l

Identify gaps: Compare against the documentation site's table of contents

Add URLs by section: Group with comment headers

# How-To Guides - Additional Sub-Pages
https://python.langchain.com/docs/how_to/custom_tools/
https://python.langchain.com/docs/how_to/streaming/

Validate reachability:

python scripts/validate_pack_urls.py --pack langchain-expert

Recommended URL Counts¶

Pack Complexity	Minimum	Recommended
Focused library	30	45-60
Framework with integrations	50	65-80
Full platform	50	70-90

Impact¶

More URLs lead to more articles in the graph, which leads to better retrieval coverage.

Improvement 6: Eval Question Calibration¶

Problem¶

Many initial evaluation questions tested general knowledge that Claude already has from training. When the training baseline is 100%, the pack cannot demonstrate improvement -- the questions are measuring the wrong thing.

Solution¶

Replace generic questions with pack-specific ones that target current documentation content.

Common Fixes¶

Replace training-data questions with pack-specific ones:

Pack	Old (generic)	New (pack-specific)
go-expert	"What is a goroutine?"	"What does `slices.Contains` do, and what constraint must E satisfy?"
react-expert	"What are React hooks?"	"What does the `useActionState` hook return, and when is it used?"

Update deprecated API references in ground truth:

Pack	Old	New
langchain-expert	LangServe as recommended	LangGraph Platform as successor
openai-api-expert	`gpt-4-turbo`	`gpt-4o`

Correct factually wrong ground truth:

Pack	Correction
vercel-ai-sdk	Removed incorrect WebSocket streaming claims
bicep-infrastructure	Fixed Key Vault reference syntax

Remove questions about removed features:

Pack	Removed	Replaced With
zig-expert	Zig async/await (removed in 0.12)	`GeneralPurposeAllocator`, `b.dependency()`

Validation After Editing¶

# Verify JSONL parses cleanly
python -c "
import json, sys
path = 'data/packs/PACK/eval/questions.jsonl'
lines = open(path).readlines()
for i, line in enumerate(lines, 1):
    obj = json.loads(line)
    required = {'id','domain','difficulty','question','ground_truth','source'}
    missing = required - obj.keys()
    if missing:
        print(f'Line {i}: missing fields {missing}')
        sys.exit(1)
print(f'OK: {len(lines)} questions, all valid')
"

Impact¶

Calibrated questions accurately measure pack value. Before calibration, some packs showed 0pp delta because questions tested training knowledge. After calibration, the same packs showed meaningful positive deltas.

Improvement 7: Full Pack Rebuilds¶

Problem¶

After expanding URLs (Improvement 5), the pack database still contains the old, smaller content set. Evaluation runs against the stale database.

Solution¶

Rebuild the pack from scratch after URL expansion:

# Rebuild (uses urls.txt, creates new pack.db)
echo "y" | uv run python scripts/build_go_pack.py

# Re-evaluate
uv run python scripts/eval_single_pack.py go-expert --sample 10

Impact¶

Fresh builds incorporate all new URLs, providing the expanded coverage needed for improved retrieval.

Applying All Seven Improvements¶

The improvements stack and should be applied together for maximum effect:

Expand urls.txt (Improvement 5)
Validate URLs: python scripts/validate_pack_urls.py --pack <name>
Rebuild pack: python scripts/build_<pack>_pack.py
Calibrate eval questions (Improvement 6)
Run evaluation with all enhancements:

agent = KnowledgeGraphAgent(
    db_path="data/packs/<pack>/pack.db",
    use_enhancements=True,         # enables reranker, multidoc, fewshot
    enable_cross_encoder=True,     # improvement 2
    enable_multi_query=True,       # improvement 3
    # improvement 1 (confidence gate) is always active
    # improvement 4 (quality scoring) is always active
)

Evaluate: python scripts/eval_single_pack.py <pack> --sample 25
Compare Pack vs Training delta