Skip to content

Evaluation Methodology

This page explains how Knowledge Packs are evaluated: what we measure, how we measure it, and how to interpret the results.

What We Measure

The evaluation framework answers one question: Does the pack improve answer quality compared to Claude answering from training alone?

We measure three metrics:

Metric Definition Calculation
Accuracy Percentage of answers scored >= 7 out of 10 count(score >= 7) / total_questions * 100
Average Score Mean judge score across all questions sum(scores) / total_questions
Delta Accuracy difference between Pack and Training pack_accuracy - training_accuracy

A score of 7+ indicates the answer is substantially correct and useful. Scores below 7 indicate significant errors, omissions, or hallucinations.

The Two Conditions

Every pack is evaluated under two conditions, using the same set of questions:

1. Training Baseline

Claude Opus answers the question with no pack context -- purely from its training data. This establishes what the model already knows about the domain.

Prompt:
  Answer the following question:
  Q: {question}

2. Pack (KG Agent)

The KG Agent retrieves content from the pack database with the full retrieval pipeline enabled: confidence gating, graph reranking, multi-document synthesis, content quality scoring, and few-shot examples. This tests whether the pack adds value beyond training.

Pipeline:
  question → vector search → confidence gate →
  graph rerank → multi-doc expand →
  quality filter → few-shot inject → synthesis

The Judge Model

Answers are scored by a judge model (Claude Opus claude-opus-4-6) on a 0-10 scale.

Scoring Prompt

Score 0-10.
Q: {question}
Expected: {ground_truth}
Actual: {answer}
Number only.

The judge compares the generated answer against the ground_truth from the evaluation question set and returns a single integer.

Judge Model Configuration

Both evaluation scripts use Claude Opus (claude-opus-4-6) as the judge model. Opus provides more accurate, nuanced scoring than smaller models — it evaluates semantic correctness rather than surface-level keyword matching.

Consideration Decision
Accuracy Opus provides the most accurate relevance judgments
Consistency Produces stable, reproducible scores at temperature 0
Fewer false negatives Evaluates semantic correctness, not just keyword overlap
Simplicity Single-number output avoids parsing complexity

Scoring Rubric

The judge model assigns scores based on correctness relative to the ground truth:

Score Range Meaning
9-10 Fully correct, comprehensive, may exceed ground truth
7-8 Substantially correct with minor gaps or imprecisions
5-6 Partially correct -- some key facts present, others missing
3-4 Mostly incorrect with some relevant information
1-2 Severely wrong, major hallucinations or misunderstandings
0 Completely irrelevant or no answer

The accuracy threshold of 7 captures answers that are "good enough to be useful" while excluding answers with significant errors.

Answer Model

The model that generates answers for evaluation is Claude Opus (claude-opus-4-6). This is the same model used in production, ensuring evaluation results reflect real-world performance.

Question Format

Evaluation questions are stored in eval/questions.jsonl (one JSON object per line):

{"id": "ge_001", "domain": "go_expert", "difficulty": "easy", "question": "What does slices.Contains do?", "ground_truth": "slices.Contains reports whether v is present in s. E must satisfy the comparable constraint.", "source": "slices_stdlib"}

Required Fields

Field Type Description
id string Unique identifier with pack prefix (e.g., ge_001, re_015)
domain string Snake-case domain name (e.g., go_expert)
difficulty string One of easy, medium, hard
question string The question text posed to both conditions
ground_truth string Expected correct answer used by the judge for scoring
source string Topic slug identifying the content area

Difficulty Distribution

For a standard 50-question pack:

Difficulty Count Purpose
easy 20 Basic concepts the pack should definitely cover
medium 20 Moderate complexity requiring some depth
hard 10 Advanced topics, edge cases, implementation details

Question Design Principles

Good questions test pack-specific knowledge:

  • Use exact API names, types, and terminology from the documentation
  • Target recent features or version-specific behavior
  • Require depth that goes beyond general awareness

Bad questions test general knowledge:

  • "What is a goroutine?" (Claude already knows this perfectly)
  • Generic terminology instead of specific API names
  • Topics covered extensively in training data

Training data overlap

If training baseline accuracy is already 100% on a question set, the questions are testing knowledge Claude already has. The pack cannot demonstrate value. Replace generic questions with pack-specific ones targeting current documentation content.

Statistical Considerations

Sample Size

The number of questions affects result reliability:

Questions Confidence Typical Variance Cost (all packs)
5 Low +/- 2 points ~$0.15
10 Moderate +/- 1.5 points ~$0.30
25 Good +/- 1 point ~$0.75
50 High +/- 0.5 points ~$1.50

For production evaluation, use at least 25 questions per pack. For quick iteration, 5-10 questions provide directional signal.

Score Distributions

Individual question scores are integers from 0-10. Because both the answer model and judge model are deterministic at temperature 0, scores are reproducible across runs.

However, scores can vary if:

  • The answer model is updated (different Claude version)
  • The pack database is rebuilt (different content)
  • Questions are modified

Interpreting Delta

The delta (Pack accuracy - Training accuracy) is the key metric:

Delta Interpretation Action
+5pp or more Strong pack value Pack clearly helps; deploy
+1pp to +5pp Moderate value Pack helps on some questions; look for improvement areas
0pp Neutral Pack matches training; may still provide provenance value
-1pp to -5pp Slight regression Investigate retrieval quality and question calibration
Below -5pp Significant regression Content quality issues; review URLs and rebuild

Positive delta is not always required

A pack can provide value even with a zero delta if:

  • It provides source attribution (provenance) that training cannot
  • It handles questions about very recent content not yet in training
  • It works offline without internet access

Running Evaluations

Single Pack

# Quick check (5 questions, ~$0.03)
uv run python scripts/eval_single_pack.py go-expert --sample 5

# Full evaluation (all questions)
uv run python scripts/eval_single_pack.py go-expert

All Packs

# Sample across all packs (10 questions each, ~$0.30)
uv run python scripts/run_all_packs_evaluation.py --sample 10

# Full evaluation
uv run python scripts/run_all_packs_evaluation.py

Output

Results are saved to data/packs/all_packs_evaluation.json:

{
  "timestamp": "2026-03-01T17:56:44Z",
  "packs_evaluated": 8,
  "sample_per_pack": 10,
  "grand_summary": {
    "training": {"avg": 8.875, "accuracy": 96.25, "n": 80},
    "pack": {"avg": 9.05, "accuracy": 97.5, "n": 80}
  },
  "per_pack": {
    "go-expert": {
      "scores": {
        "training": [5, 9, 10, 10, 9, 10, 7, 9, 9, 9],
        "pack": [9, 10, 10, 9, 9, 10, 9, 10, 10, 10]
      }
    }
  }
}

Evaluation Architecture

questions.jsonl
  │
  ├──► Training condition
  │    Claude Opus answers without context
  │    │
  │    ▼
  │    Judge (configurable) scores 0-10 vs ground_truth
  │
  └──► Pack condition
       KG Agent retrieves + synthesizes (full pipeline)
       │
       ▼
       Judge (configurable) scores 0-10 vs ground_truth

Results:
  per-question scores
  per-condition averages and accuracy
  delta (pack - training)