Evaluation Results¶

48 packs evaluated with LadybugDB 0.15.1, 5 questions per pack (240 total). Judge model: Claude Opus 4.6.

Summary¶

Metric	Training (Claude alone)	Pack (KG retrieval)	Enhanced (KG + reranker + multidoc + fewshot)
Avg Score	8.85/10	9.0/10	9.47/10
Accuracy (>=7)	90.8%	90.4%	95.0%
Questions	240	240	240

Enhanced mode achieves 95% accuracy — a +4.2pp improvement over training baseline.

Three-Condition Comparison¶

The evaluation tests three delivery modes:

Training: Claude Opus answers from training knowledge only
Pack: Claude + single KG Agent query (pre-fetched context)
Enhanced: Claude + KG Agent with reranker, multi-doc synthesis, and few-shot examples

Top Accuracy Gains (Enhanced vs Training)¶

Pack	Training	Pack	Enhanced	Delta
workiq-mcp	20%	20%	80%	+60pp
azure-ai-foundry	40%	60%	80%	+40pp
docker-expert	20%	20%	60%	+40pp
azure-lighthouse	80%	100%	100%	+20pp
fabric-graph-gql-expert	80%	80%	100%	+20pp
github-actions-advanced	60%	60%	80%	+20pp
go-expert	80%	100%	100%	+20pp
microsoft-agent-framework	80%	80%	100%	+20pp
rust-async-expert	80%	100%	100%	+20pp
semantic-kernel	80%	80%	100%	+20pp

Where Packs Add the Most Value¶

The biggest gains come from domains where Claude's training data is thin or outdated:

workiq-mcp (+60pp): Internal Microsoft tool with virtually no public training data
azure-ai-foundry (+40pp): Rapidly evolving Azure service with frequent API changes
docker-expert (+40pp): Advanced Docker patterns beyond basic training coverage

Where Packs Match Training¶

Many packs achieve 100% in both training and enhanced conditions (bicep, cpp, csharp, dotnet, dspy, fabric-graphql, huggingface, java, kotlin, kubernetes, langchain, llamaindex, mcp-protocol, nextjs, openai-api, opencypher, opentelemetry, physics, postgresql, prompt-engineering, python, react, ruby, rust, swift, terraform, typescript, vercel, vscode, wasm, zig). The pack still provides value through:

Source attribution (every answer cites specific documentation)
Confidence gating ensures the pack never degrades results
Graph traversal finds connections training data misses

Skill Delivery Evaluation¶

A separate A/B/C evaluation (scripts/eval_skill_delivery.py) tests whether delivering packs as Claude Code skills improves coding task outcomes (not just Q&A):

Condition	Avg Judge (0-10)	Accuracy (>=7)
A) Baseline (Claude alone)	7.0	73%
B) Pack retrieval	7.5	80%
C) Skill delivery (tool_use)	6.9	80%

Key finding: Pack retrieval helps for niche domains. Skill delivery (tool_use) doesn't consistently beat pre-fetched context. See issue #287 for full analysis.

Methodology¶

Database: LadybugDB 0.15.1 (rebuilt from Kuzu migration)
Answer model: Claude Opus 4.6
Judge model: Claude Opus 4.6
Questions per pack: 5 (from eval/questions.jsonl)
Accuracy threshold: Score >= 7 out of 10
Conditions: Training / Pack / Enhanced

Run the evaluation:

uv run python scripts/run_all_packs_evaluation.py --sample 5
uv run python scripts/eval_skill_delivery.py --all

See Methodology for full details.