Run Evaluations¶

How to evaluate Knowledge Pack quality using eval_single_pack.py and run_all_packs_evaluation.py.

Prerequisites¶

ANTHROPIC_API_KEY environment variable set
Pack built with pack.db and eval/questions.jsonl present
Python dependencies installed (uv sync)

Single Pack Evaluation¶

Basic Usage¶

uv run python scripts/eval_single_pack.py <pack-name> [--sample N]

Arguments:

Argument	Required	Description
`pack-name`	Yes	Name of the pack directory under `data/packs/`
`--sample N`	No	Number of questions to sample (default: all)

Examples¶

# Evaluate 5 questions (quick check, ~$0.03)
uv run python scripts/eval_single_pack.py go-expert --sample 5

# Evaluate 25 questions (good confidence, ~$0.75)
uv run python scripts/eval_single_pack.py go-expert --sample 25

# Evaluate all questions
uv run python scripts/eval_single_pack.py go-expert

What It Does¶

For each question in data/packs/<pack-name>/eval/questions.jsonl:

Training condition: Claude Opus answers without any pack context
Pack condition: KG Agent retrieves from the pack with the full retrieval pipeline, then Claude Opus synthesizes
Judge scoring: Claude Opus scores each answer 0-10 against the ground_truth

Output¶

Pack: go-expert  (10 questions)

Condition     Avg Score  Accuracy
──────────    ─────────  ────────
Training      8.7/10     90.0%
Pack          9.6/10     100.0%
Delta                    +10.0pp

Models Used¶

Role	Model	Purpose
Answer model	`claude-opus-4-6`	Generates answers for both conditions
Judge model	`claude-opus-4-6`	Scores answers against ground truth

All-Packs Evaluation¶

Basic Usage¶

uv run python scripts/run_all_packs_evaluation.py [--sample N] [flags]

Arguments:

Argument	Required	Description
`--sample N`	No	Questions per pack (default: all)
`--disable-reranker`	No	Disable GraphReranker for A/B testing
`--disable-multidoc`	No	Disable MultiDocSynthesizer
`--disable-fewshot`	No	Disable FewShotManager

Examples¶

# Quick sample across all packs (~$0.15)
uv run python scripts/run_all_packs_evaluation.py --sample 5

# Standard evaluation (10 per pack, ~$0.30)
uv run python scripts/run_all_packs_evaluation.py --sample 10

# Full evaluation (all questions, ~$15)
uv run python scripts/run_all_packs_evaluation.py

# A/B test: evaluate without reranker
uv run python scripts/run_all_packs_evaluation.py --sample 10 --disable-reranker

Which Packs Are Included?¶

The script automatically discovers packs that have both:

data/packs/<pack-name>/pack.db (LadybugDB database)
data/packs/<pack-name>/eval/questions.jsonl (evaluation questions)

Packs missing either file are skipped.

Output¶

Results are saved to data/packs/all_packs_evaluation.json:

{
  "timestamp": "2026-03-01T17:56:44Z",
  "packs_evaluated": 8,
  "sample_per_pack": 10,
  "grand_summary": {
    "training": {"avg": 8.875, "accuracy": 96.25, "n": 80},
    "pack": {"avg": 9.05, "accuracy": 97.5, "n": 80}
  },
  "per_pack": {
    "go-expert": {
      "scores": {
        "training": [5, 9, 10, 10, 9, 10, 7, 9, 9, 9],
        "pack": [9, 10, 10, 9, 9, 10, 9, 10, 10, 10]
      }
    }
  }
}

Console Output¶

=== All Packs Evaluation ===
Packs found: 8
Sample per pack: 10

[1/8] bicep-infrastructure ...
  Training:  avg=9.1  acc=100%
  Pack:      avg=9.4  acc=100%

[2/8] go-expert ...
  Training:  avg=8.7  acc=90%
  Pack:      avg=9.6  acc=100%

...

=== Grand Summary ===
Training:  avg=8.9  acc=96.2%  (n=80)
Pack:      avg=9.1  acc=97.5%  (n=80)

Results saved to data/packs/all_packs_evaluation.json

Cost Estimates¶

Evaluation cost depends on the number of questions and conditions:

Questions	Packs	Total API Calls	Estimated Cost
5 per pack	8	~120	~$0.15
10 per pack	8	~240	~$0.30
25 per pack	8	~600	~$0.75
50 per pack	8	~1200	~$1.50
All (~200 total)	8	~4800	~$6.00

Each question requires 2 answer calls (Training + Pack) and 2 judge calls, all using Claude Opus.

A/B Testing Enhancement Modules¶

Disable individual enhancement modules to measure their contribution:

# Baseline: all enhancements on
uv run python scripts/run_all_packs_evaluation.py --sample 10

# Test: no graph reranking
uv run python scripts/run_all_packs_evaluation.py --sample 10 --disable-reranker

# Test: no multi-doc synthesis
uv run python scripts/run_all_packs_evaluation.py --sample 10 --disable-multidoc

# Test: no few-shot examples
uv run python scripts/run_all_packs_evaluation.py --sample 10 --disable-fewshot

Compare the output JSON files to see which modules contribute the most to each pack's accuracy.

Troubleshooting¶

"No questions found for pack"¶

Ensure eval/questions.jsonl exists in the pack directory:

ls data/packs/<pack-name>/eval/questions.jsonl

If missing, generate questions:

python scripts/generate_eval_questions.py --pack <pack-name> --count 50

"ANTHROPIC_API_KEY not set"¶

export ANTHROPIC_API_KEY="sk-ant-..."

High Variance Between Runs¶

Small sample sizes produce unreliable results. Increase the sample:

# Use 25+ questions for stable results
uv run python scripts/eval_single_pack.py go-expert --sample 25

Negative Delta on a Pack¶

See Improving Accuracy for diagnosis and remediation steps. Common causes:

Pack content is outdated or incorrect
Evaluation questions test training knowledge, not pack-specific knowledge
Source URLs have thin content

Run Evaluations¶

Prerequisites¶

Single Pack Evaluation¶

Basic Usage¶

Examples¶

What It Does¶

Output¶

Models Used¶

All-Packs Evaluation¶

Basic Usage¶

Examples¶

Which Packs Are Included?¶

Output¶

Console Output¶

Cost Estimates¶

A/B Testing Enhancement Modules¶

Troubleshooting¶

"No questions found for pack"¶

"ANTHROPIC_API_KEY not set"¶

High Variance Between Runs¶

Negative Delta on a Pack¶

See Also¶