Run Evaluations¶
How to evaluate Knowledge Pack quality using eval_single_pack.py and run_all_packs_evaluation.py.
Prerequisites¶
ANTHROPIC_API_KEYenvironment variable set- Pack built with
pack.dbandeval/questions.jsonlpresent - Python dependencies installed (
uv sync)
Single Pack Evaluation¶
Basic Usage¶
uv run python scripts/eval_single_pack.py <pack-name> [--sample N]
Arguments:
| Argument | Required | Description |
|---|---|---|
pack-name |
Yes | Name of the pack directory under data/packs/ |
--sample N |
No | Number of questions to sample (default: all) |
Examples¶
# Evaluate 5 questions (quick check, ~$0.03)
uv run python scripts/eval_single_pack.py go-expert --sample 5
# Evaluate 25 questions (good confidence, ~$0.75)
uv run python scripts/eval_single_pack.py go-expert --sample 25
# Evaluate all questions
uv run python scripts/eval_single_pack.py go-expert
What It Does¶
For each question in data/packs/<pack-name>/eval/questions.jsonl:
- Training condition: Claude Opus answers without any pack context
- Pack condition: KG Agent retrieves from the pack with the full retrieval pipeline, then Claude Opus synthesizes
- Judge scoring: Claude Opus scores each answer 0-10 against the
ground_truth
Output¶
Pack: go-expert (10 questions)
Condition Avg Score Accuracy
────────── ───────── ────────
Training 8.7/10 90.0%
Pack 9.6/10 100.0%
Delta +10.0pp
Models Used¶
| Role | Model | Purpose |
|---|---|---|
| Answer model | claude-opus-4-6 |
Generates answers for both conditions |
| Judge model | claude-opus-4-6 |
Scores answers against ground truth |
All-Packs Evaluation¶
Basic Usage¶
uv run python scripts/run_all_packs_evaluation.py [--sample N] [flags]
Arguments:
| Argument | Required | Description |
|---|---|---|
--sample N |
No | Questions per pack (default: all) |
--disable-reranker |
No | Disable GraphReranker for A/B testing |
--disable-multidoc |
No | Disable MultiDocSynthesizer |
--disable-fewshot |
No | Disable FewShotManager |
Examples¶
# Quick sample across all packs (~$0.15)
uv run python scripts/run_all_packs_evaluation.py --sample 5
# Standard evaluation (10 per pack, ~$0.30)
uv run python scripts/run_all_packs_evaluation.py --sample 10
# Full evaluation (all questions, ~$15)
uv run python scripts/run_all_packs_evaluation.py
# A/B test: evaluate without reranker
uv run python scripts/run_all_packs_evaluation.py --sample 10 --disable-reranker
Which Packs Are Included?¶
The script automatically discovers packs that have both:
data/packs/<pack-name>/pack.db(LadybugDB database)data/packs/<pack-name>/eval/questions.jsonl(evaluation questions)
Packs missing either file are skipped.
Output¶
Results are saved to data/packs/all_packs_evaluation.json:
{
"timestamp": "2026-03-01T17:56:44Z",
"packs_evaluated": 8,
"sample_per_pack": 10,
"grand_summary": {
"training": {"avg": 8.875, "accuracy": 96.25, "n": 80},
"pack": {"avg": 9.05, "accuracy": 97.5, "n": 80}
},
"per_pack": {
"go-expert": {
"scores": {
"training": [5, 9, 10, 10, 9, 10, 7, 9, 9, 9],
"pack": [9, 10, 10, 9, 9, 10, 9, 10, 10, 10]
}
}
}
}
Console Output¶
=== All Packs Evaluation ===
Packs found: 8
Sample per pack: 10
[1/8] bicep-infrastructure ...
Training: avg=9.1 acc=100%
Pack: avg=9.4 acc=100%
[2/8] go-expert ...
Training: avg=8.7 acc=90%
Pack: avg=9.6 acc=100%
...
=== Grand Summary ===
Training: avg=8.9 acc=96.2% (n=80)
Pack: avg=9.1 acc=97.5% (n=80)
Results saved to data/packs/all_packs_evaluation.json
Cost Estimates¶
Evaluation cost depends on the number of questions and conditions:
| Questions | Packs | Total API Calls | Estimated Cost |
|---|---|---|---|
| 5 per pack | 8 | ~120 | ~$0.15 |
| 10 per pack | 8 | ~240 | ~$0.30 |
| 25 per pack | 8 | ~600 | ~$0.75 |
| 50 per pack | 8 | ~1200 | ~$1.50 |
| All (~200 total) | 8 | ~4800 | ~$6.00 |
Each question requires 2 answer calls (Training + Pack) and 2 judge calls, all using Claude Opus.
A/B Testing Enhancement Modules¶
Disable individual enhancement modules to measure their contribution:
# Baseline: all enhancements on
uv run python scripts/run_all_packs_evaluation.py --sample 10
# Test: no graph reranking
uv run python scripts/run_all_packs_evaluation.py --sample 10 --disable-reranker
# Test: no multi-doc synthesis
uv run python scripts/run_all_packs_evaluation.py --sample 10 --disable-multidoc
# Test: no few-shot examples
uv run python scripts/run_all_packs_evaluation.py --sample 10 --disable-fewshot
Compare the output JSON files to see which modules contribute the most to each pack's accuracy.
Troubleshooting¶
"No questions found for pack"¶
Ensure eval/questions.jsonl exists in the pack directory:
ls data/packs/<pack-name>/eval/questions.jsonl
If missing, generate questions:
python scripts/generate_eval_questions.py --pack <pack-name> --count 50
"ANTHROPIC_API_KEY not set"¶
export ANTHROPIC_API_KEY="sk-ant-..."
High Variance Between Runs¶
Small sample sizes produce unreliable results. Increase the sample:
# Use 25+ questions for stable results
uv run python scripts/eval_single_pack.py go-expert --sample 25
Negative Delta on a Pack¶
See Improving Accuracy for diagnosis and remediation steps. Common causes:
- Pack content is outdated or incorrect
- Evaluation questions test training knowledge, not pack-specific knowledge
- Source URLs have thin content
See Also¶
- Evaluation Methodology -- Understanding what we measure and how
- Evaluation Results -- Current results across all packs
- Improving Accuracy -- Strategies for improving pack quality