Evaluation Methodology¶
This page explains how Knowledge Packs are evaluated: what we measure, how we measure it, and how to interpret the results.
What We Measure¶
The evaluation framework answers one question: Does the pack improve answer quality compared to Claude answering from training alone?
We measure three metrics:
| Metric | Definition | Calculation |
|---|---|---|
| Accuracy | Percentage of answers scored >= 7 out of 10 | count(score >= 7) / total_questions * 100 |
| Average Score | Mean judge score across all questions | sum(scores) / total_questions |
| Delta | Accuracy difference between Pack and Training | pack_accuracy - training_accuracy |
A score of 7+ indicates the answer is substantially correct and useful. Scores below 7 indicate significant errors, omissions, or hallucinations.
The Two Conditions¶
Every pack is evaluated under two conditions, using the same set of questions:
1. Training Baseline¶
Claude Opus answers the question with no pack context -- purely from its training data. This establishes what the model already knows about the domain.
Prompt:
Answer the following question:
Q: {question}
2. Pack (KG Agent)¶
The KG Agent retrieves content from the pack database with the full retrieval pipeline enabled: confidence gating, graph reranking, multi-document synthesis, content quality scoring, and few-shot examples. This tests whether the pack adds value beyond training.
Pipeline:
question → vector search → confidence gate →
graph rerank → multi-doc expand →
quality filter → few-shot inject → synthesis
The Judge Model¶
Answers are scored by a judge model (Claude Opus claude-opus-4-6) on a 0-10 scale.
Scoring Prompt¶
Score 0-10.
Q: {question}
Expected: {ground_truth}
Actual: {answer}
Number only.
The judge compares the generated answer against the ground_truth from the evaluation question set and returns a single integer.
Judge Model Configuration¶
Both evaluation scripts use Claude Opus (claude-opus-4-6) as the judge model. Opus provides more accurate, nuanced scoring than smaller models — it evaluates semantic correctness rather than surface-level keyword matching.
| Consideration | Decision |
|---|---|
| Accuracy | Opus provides the most accurate relevance judgments |
| Consistency | Produces stable, reproducible scores at temperature 0 |
| Fewer false negatives | Evaluates semantic correctness, not just keyword overlap |
| Simplicity | Single-number output avoids parsing complexity |
Scoring Rubric¶
The judge model assigns scores based on correctness relative to the ground truth:
| Score Range | Meaning |
|---|---|
| 9-10 | Fully correct, comprehensive, may exceed ground truth |
| 7-8 | Substantially correct with minor gaps or imprecisions |
| 5-6 | Partially correct -- some key facts present, others missing |
| 3-4 | Mostly incorrect with some relevant information |
| 1-2 | Severely wrong, major hallucinations or misunderstandings |
| 0 | Completely irrelevant or no answer |
The accuracy threshold of 7 captures answers that are "good enough to be useful" while excluding answers with significant errors.
Answer Model¶
The model that generates answers for evaluation is Claude Opus (claude-opus-4-6). This is the same model used in production, ensuring evaluation results reflect real-world performance.
Question Format¶
Evaluation questions are stored in eval/questions.jsonl (one JSON object per line):
{"id": "ge_001", "domain": "go_expert", "difficulty": "easy", "question": "What does slices.Contains do?", "ground_truth": "slices.Contains reports whether v is present in s. E must satisfy the comparable constraint.", "source": "slices_stdlib"}
Required Fields¶
| Field | Type | Description |
|---|---|---|
id |
string | Unique identifier with pack prefix (e.g., ge_001, re_015) |
domain |
string | Snake-case domain name (e.g., go_expert) |
difficulty |
string | One of easy, medium, hard |
question |
string | The question text posed to both conditions |
ground_truth |
string | Expected correct answer used by the judge for scoring |
source |
string | Topic slug identifying the content area |
Difficulty Distribution¶
For a standard 50-question pack:
| Difficulty | Count | Purpose |
|---|---|---|
easy |
20 | Basic concepts the pack should definitely cover |
medium |
20 | Moderate complexity requiring some depth |
hard |
10 | Advanced topics, edge cases, implementation details |
Question Design Principles¶
Good questions test pack-specific knowledge:
- Use exact API names, types, and terminology from the documentation
- Target recent features or version-specific behavior
- Require depth that goes beyond general awareness
Bad questions test general knowledge:
- "What is a goroutine?" (Claude already knows this perfectly)
- Generic terminology instead of specific API names
- Topics covered extensively in training data
Training data overlap
If training baseline accuracy is already 100% on a question set, the questions are testing knowledge Claude already has. The pack cannot demonstrate value. Replace generic questions with pack-specific ones targeting current documentation content.
Statistical Considerations¶
Sample Size¶
The number of questions affects result reliability:
| Questions | Confidence | Typical Variance | Cost (all packs) |
|---|---|---|---|
| 5 | Low | +/- 2 points | ~$0.15 |
| 10 | Moderate | +/- 1.5 points | ~$0.30 |
| 25 | Good | +/- 1 point | ~$0.75 |
| 50 | High | +/- 0.5 points | ~$1.50 |
For production evaluation, use at least 25 questions per pack. For quick iteration, 5-10 questions provide directional signal.
Score Distributions¶
Individual question scores are integers from 0-10. Because both the answer model and judge model are deterministic at temperature 0, scores are reproducible across runs.
However, scores can vary if:
- The answer model is updated (different Claude version)
- The pack database is rebuilt (different content)
- Questions are modified
Interpreting Delta¶
The delta (Pack accuracy - Training accuracy) is the key metric:
| Delta | Interpretation | Action |
|---|---|---|
| +5pp or more | Strong pack value | Pack clearly helps; deploy |
| +1pp to +5pp | Moderate value | Pack helps on some questions; look for improvement areas |
| 0pp | Neutral | Pack matches training; may still provide provenance value |
| -1pp to -5pp | Slight regression | Investigate retrieval quality and question calibration |
| Below -5pp | Significant regression | Content quality issues; review URLs and rebuild |
Positive delta is not always required
A pack can provide value even with a zero delta if:
- It provides source attribution (provenance) that training cannot
- It handles questions about very recent content not yet in training
- It works offline without internet access
Running Evaluations¶
Single Pack¶
# Quick check (5 questions, ~$0.03)
uv run python scripts/eval_single_pack.py go-expert --sample 5
# Full evaluation (all questions)
uv run python scripts/eval_single_pack.py go-expert
All Packs¶
# Sample across all packs (10 questions each, ~$0.30)
uv run python scripts/run_all_packs_evaluation.py --sample 10
# Full evaluation
uv run python scripts/run_all_packs_evaluation.py
Output¶
Results are saved to data/packs/all_packs_evaluation.json:
{
"timestamp": "2026-03-01T17:56:44Z",
"packs_evaluated": 8,
"sample_per_pack": 10,
"grand_summary": {
"training": {"avg": 8.875, "accuracy": 96.25, "n": 80},
"pack": {"avg": 9.05, "accuracy": 97.5, "n": 80}
},
"per_pack": {
"go-expert": {
"scores": {
"training": [5, 9, 10, 10, 9, 10, 7, 9, 9, 9],
"pack": [9, 10, 10, 9, 9, 10, 9, 10, 10, 10]
}
}
}
}
Evaluation Architecture¶
questions.jsonl
│
├──► Training condition
│ Claude Opus answers without context
│ │
│ ▼
│ Judge (configurable) scores 0-10 vs ground_truth
│
└──► Pack condition
KG Agent retrieves + synthesizes (full pipeline)
│
▼
Judge (configurable) scores 0-10 vs ground_truth
Results:
per-question scores
per-condition averages and accuracy
delta (pack - training)