Eval System Architecture¶

Comprehensive guide to the evaluation and self-improvement infrastructure for goal-seeking agents. Covers the progressive test suite (L1-L12), long-horizon memory testing with 12 information blocks (including security domain), 5-way matrix evaluation across SDKs, self-improvement loop with patch proposer + reviewer voting, and the standalone amplihack-agent-eval package.

Overview¶

The eval system is a multi-layered framework that tests agent learning and reasoning capabilities across 12 progressively harder levels. It supports 4 SDK implementations, includes a self-improvement loop with patch proposer and reviewer voting, and provides domain-specific evaluation for 5 specialized agents.

Current best scores (as of 2026-02-28):

Progressive test suite (3-run median, mini SDK):

Level	Median	Description
L1	83%	Single source direct recall
L2	100%	Multi-source synthesis
L3	99.8%	Temporal reasoning
L4	79%	Procedural learning
L5	95%	Contradiction handling
L6	100%	Incremental learning
L7	84%	Teacher-student transfer
Overall	97.8%	Weighted median across all levels

Long-horizon memory evaluation: 98.9% at 1000 turns.

Recent improvements (2026-02-28): - Overall accuracy: 96.0% → 97.8% (+1.8%) - temporal_evolution: 86.6% → 99.8% (+13.2%) - security_log_analysis: 88% → 100% (+12%) - 9 categories now at 100% accuracy

+------------------------------------------------------------------+
|                    EVALUATION ENTRY POINTS                         |
+------------------------------------------------------------------+
|                                                                    |
|  progressive_test_suite.py   sdk_eval_loop.py   run_domain_evals  |
|  (L1-L12 single/parallel)   (multi-SDK loop)    (5 domain agents) |
|                                                                    |
|  self_improve/runner.py      long_horizon_memory.py               |
|  (closed-loop improvement)   (1000-turn stress test, 12 blocks)   |
|                                                                    |
|  long_horizon_self_improve.py                                     |
|  (long-horizon self-improvement with patch proposer + voting)     |
|                                                                    |
|  matrix_eval.py              meta_eval_experiment.py              |
|  (5-way SDK comparison)      (self-referential eval)              |
|                                                                    |
+------------------------------------------------------------------+
                        |                |
                        v                v
+------------------------------------------------------------------+
|                    CORE EVAL PIPELINE                              |
+------------------------------------------------------------------+
|                                                                    |
|  1. DATA LAYER              2. AGENT LAYER                        |
|  +-----------------+        +------------------------+            |
|  | test_levels.py  |        | agent_subprocess.py    |            |
|  | (L1-L12 defs)   |        | (subprocess isolation) |            |
|  | - TestArticle   |------->| - learning phase       |            |
|  | - TestQuestion   |        | - testing phase        |            |
|  | - TestLevel      |        | - SDK routing          |            |
|  +-----------------+        +------------------------+            |
|  | long_horizon_   |        +------------------------+            |
|  | data.py         |        | teaching_subprocess.py |            |
|  | (12-block gen)  |        | (teacher-student)      |            |
|  +-----------------+        +------------------------+            |
|                                       |                            |
|  3. GRADING LAYER                     v                            |
|  +---------------------+   +------------------------+            |
|  | grader.py            |   | metacognition_grader.py|            |
|  | (LLM semantic grade) |   | (4-dimension scoring)  |            |
|  | - multi-vote         |   | - effort calibration   |            |
|  | - level-specific     |   | - sufficiency judgment  |            |
|  |   rubrics (L3, L5,   |   | - search quality       |            |
|  |   L9, L11, L12)      |   | - self correction      |            |
|  +---------------------+   +------------------------+            |
|                                                                    |
+------------------------------------------------------------------+
                        |
                        v
+------------------------------------------------------------------+
|                    ANALYSIS & IMPROVEMENT                          |
+------------------------------------------------------------------+
|                                                                    |
|  self_improve/error_analyzer.py (10 failure modes)                |
|  self_improve/patch_proposer.py (LLM-generated diffs)            |
|  self_improve/reviewer_voting.py (3-perspective review)          |
|  +----------------------------------------------------------+    |
|  | EVAL -> ANALYZE -> PROPOSE -> CHALLENGE -> VOTE ->         |   |
|  |   APPLY -> RE-EVAL -> DECIDE                               |   |
|  |                                                             |   |
|  | Patch proposer: hypothesis + diff + expected impact         |   |
|  | Reviewer voting: quality/regression/simplicity perspectives |   |
|  | Challenge phase: devil's advocate arguments + defense       |   |
|  | Auto-revert on regression > 5%                              |   |
|  +----------------------------------------------------------+    |
|                                                                    |
+------------------------------------------------------------------+

Test Levels (L1-L12)¶

Each level represents a specific cognitive capability:

Level	Name	Capability Tested	Key Challenge
L1	Single Source Recall	Direct fact retrieval from one source	Baseline accuracy
L2	Multi-Source Synthesis	Combining info from multiple sources	Cross-source counting
L3	Temporal Reasoning	Understanding changes over time	Arithmetic on temporal data
L4	Procedural Learning	Learning and applying step-by-step procedures	Sequence ordering
L5	Contradiction Handling	Detecting conflicting information	Conflict acknowledgment
L6	Incremental Learning	Updating knowledge with new info	Superseding old facts
L7	Teacher-Student Transfer	Teaching knowledge to another agent	Multi-turn dialogue
L8	Metacognition	Awareness of own reasoning quality	Effort calibration
L9	Causal Reasoning	Identifying cause-and-effect chains	Root cause identification
L10	Counterfactual Reasoning	"What if" hypothetical analysis	Hypothetical scenarios
L11	Novel Skill Acquisition	Learning new task formats from examples	Config generation
L12	Far Transfer	Applying learned patterns to new domains	Cross-domain application

Level Data Structure¶

Defined in src/amplihack/eval/test_levels.py:

@dataclass
class TestLevel:
    level_id: str           # "L1", "L2", etc.
    level_name: str         # Human-readable name
    description: str        # What this level tests
    articles: list[TestArticle]    # Source material to learn
    questions: list[TestQuestion]  # Questions to answer
    requires_temporal_ordering: bool = False
    requires_update_handling: bool = False

Each level provides its own articles and questions. L6 uses phased articles (initial + update) to test incremental learning. L7 uses a teaching session instead of direct learning/testing.

Why 2026 Winter Olympics? The test content uses synthetic data about the February 2026 Milan-Cortina Olympics -- a topic that post-dates LLM training cutoffs. This ensures the agent must actually learn from the provided sources rather than relying on pre-training knowledge.

Core Pipeline¶

1. Data Layer¶

test_levels.py: Defines all 12 levels with articles and questions
multi_source_collector.py: Collects and normalizes news articles
quiz_generator.py: Generates quiz questions from articles
long_horizon_data.py: Generates deterministic 1000-turn dialogues

2. Agent Layer¶

The eval system uses subprocess isolation to prevent state leakage between levels:

agent_subprocess.py: Runs learning/testing phases in isolated subprocesses. Each level gets a unique agent name mapping to a separate memory database. Supports --sdk flag for SDK routing (validates SDK agent creation, records SDK metadata in output). All SDKs use the same LearningAgent core for learning/answering.
teaching_subprocess.py: Runs teacher-student dialogues for L7 evaluation. Teacher and student have separate memory databases; knowledge transfer happens only through conversation.

Memory isolation is critical: unique agent names like agent_L1_1708123456 ensure facts from L1 cannot leak into L2.

Advanced Retrieval Methods (added 2026-02-28):

The LearningAgent now includes three retrieval strategies:

Entity-Linked Retrieval (_entity_linked_retrieval): When questions contain structured entity IDs (INC-, CVE-, PROJ-*), retrieves all facts containing those IDs across ALL context tags. This addresses the problem where facts about a single entity are stored under different contexts (e.g., incident data in "incidents", "security_logs", and "post_mortems").
Multi-Entity Retrieval (_multi_entity_retrieval): For multi-hop reasoning questions mentioning 2+ named entities, retrieves facts for EACH entity independently and merges results. This improves coverage for questions like "How did X affect Y?" where facts about X and Y are stored separately.
Standard Context-Based Retrieval: Traditional semantic similarity search within detected question context.

The agent automatically selects the most appropriate strategy based on question patterns. See EVAL_RETRIEVAL_REFERENCE.md for detailed API documentation.

3. Grading Layer¶

grader.py: LLM-based semantic grading with level-specific rubrics. Uses Claude Sonnet 4.5 as the grading model. Supports multi-vote grading (--grader-votes N, recommended: 3) where each answer is graded N times independently and the median score is taken.

Level-specific grading criteria: - L3: Numerical values are primary (correct numbers = at least 0.7); trend direction is secondary - L5: Explicit conflict acknowledgment scoring rubric (4 tiers: 0.9-1.0, 0.6-0.8, 0.3-0.5, 0.0-0.2) - L9: Accept multiple valid root causes if reasoning is sound - L11: Grade on required fields, don't penalize extra optional fields - L12: Direction of trend is critical for ratio computations

Deterministic Grading Improvements (2026-02-28):

The _deterministic_grade() function now implements a critical false negative fix: incorrect patterns are only penalized when correct keywords are NOT present. This prevents the grader from scoring 0% on answers like "budget increased from $1.2M to $1.4M" which contain both historical (incorrect) and current (correct) values. The fix improved temporal_evolution from 86.6% to 99.8%. See EVAL_GRADING_IMPROVEMENTS.md for implementation details.

metacognition_grader.py: Grades reasoning traces on 4 dimensions:

Dimension	Weight	What It Measures
Effort Calibration	25%	Proportional effort to question difficulty
Sufficiency Judgment	30%	Correct assessment of when enough info gathered
Search Quality	25%	Ratio of useful results to total queries
Self Correction	20%	Detects and fixes errors in reasoning

Evaluation Runners¶

Progressive Test Suite (`progressive_test_suite.py`)¶

The primary evaluation runner. Runs all levels sequentially with per-level memory isolation.

# Single run
python -m amplihack.eval.progressive_test_suite --sdk mini --levels L1 L2 L3

# Parallel runs (3-run median for stability)
python -m amplihack.eval.progressive_test_suite --runs 3 --sdk mini

# Multi-vote grading (3 votes per question)
python -m amplihack.eval.progressive_test_suite --grader-votes 3 --sdk mini

# Recommended: 3-run median + 3-vote grading
python -m amplihack.eval.progressive_test_suite --runs 3 --grader-votes 3 --sdk mini

# Advanced levels
python -m amplihack.eval.progressive_test_suite --advanced --sdk mini

# All levels
python -m amplihack.eval.progressive_test_suite \
    --levels L1 L2 L3 L4 L5 L6 L8 L9 L10 L11 L12 --sdk mini

CLI options:

Option	Description	Default
`--output-dir`	Directory for results	`./eval_progressive`
`--agent-name`	Agent name (memory isolation)	`progressive-test-agent`
`--levels`	Specific levels (L1-L12)	L1-L6
`--advanced`	Include L8-L10	Off
`--memory-backend`	Memory backend	`amplihack-memory-lib`
`--parallel N`	Run N times, report medians	Off
`--runs N`	Alias for --parallel	Off
`--sdk`	SDK type: mini, claude, copilot, microsoft	`mini`
`--grader-votes N`	Grading votes per question (1=single, 3=majority)	1

Key classes:

ProgressiveConfig: SDK, levels, output dir, grader votes
ProgressiveResult: Per-level results and overall scores
ParallelResult: Median scores across multiple runs

SDK Eval Loop (`sdk_eval_loop.py`)¶

Runs improvement loops for one or more SDKs, tracking score progression and generating prompt tuning recommendations:

# Single SDK
python -m amplihack.eval.sdk_eval_loop --sdks mini --loops 5

# All SDKs comparison (4-way)
python -m amplihack.eval.sdk_eval_loop --all-sdks --loops 3

# Specific levels
python -m amplihack.eval.sdk_eval_loop --sdks mini claude --loops 3 --levels L1 L2 L3

Each loop iteration:

Runs L1-L6 eval for the SDK
Analyzes failures to identify weak levels
Generates SDK-specific prompt tuning recommendations
Re-evaluates to measure improvement

Output includes:

Per-SDK score progression across iterations
Failure analysis with recommendations
Per-level comparison table
Ranked SDK comparison by best overall score

CLI options:

Option	Description	Default
`--sdks`	SDKs to evaluate (space-separated)	`mini`
`--all-sdks`	Evaluate all 4 SDKs	Off
`--loops`	Improvement loops per SDK	5
`--levels`	Levels to run	L1-L6
`--output-dir`	Output directory	`./eval_sdk_loop`

Self-Improvement Runner (`self_improve/runner.py`)¶

Closed-loop improvement: EVAL -> ANALYZE -> RESEARCH -> IMPROVE -> RE-EVAL -> DECIDE.

# Basic usage
python -m amplihack.eval.self_improve.runner --sdk mini --iterations 5

# With all options
python -m amplihack.eval.self_improve.runner \
    --sdk mini \
    --iterations 5 \
    --improvement-threshold 2.0 \
    --regression-tolerance 5.0 \
    --levels L1 L2 L3 L4 L5 L6 \
    --output-dir ./eval_results/self_improve \
    --dry-run

6 phases per iteration:

EVAL: Run progressive test suite to get baseline scores.
ANALYZE: Classify failures using the error taxonomy (10 failure modes). Maps each failure to the specific code component responsible.
RESEARCH: For each proposed fix, state hypothesis, gather evidence from eval results and failure patterns, consider counter-arguments (regression risk, stochasticity, cross-level impact), and make a reasoned decision: apply, skip, or defer. All research decisions are logged to research_decisions.json.
IMPROVE: Apply approved changes. The patch proposer (self_improve/patch_proposer.py) generates specific code changes as unified diffs. Each PatchProposal includes: target_file, hypothesis, description, diff, expected_impact (per-category score delta), risk_assessment, and confidence (0.0-1.0). Maintains PatchHistory to avoid re-proposing reverted/rejected patches. Proposals go through reviewer voting (self_improve/reviewer_voting.py) where three perspectives (quality, regression, simplicity) vote with majority determining outcome. A challenge phase forces the proposer to defend the change.
RE-EVAL: Run eval again on all levels to measure impact.
DECIDE: Commit if net improvement >= +2% overall and no single level regresses > 5%. Revert if any level regresses beyond tolerance.

CLI options:

Option	Description	Default
`--sdk`	SDK type	`mini`
`--iterations`	Max improvement iterations	5
`--improvement-threshold`	Min % improvement to commit	2.0
`--regression-tolerance`	Max % regression on any level	5.0
`--levels`	Levels to evaluate	L1-L6
`--output-dir`	Output directory	`./eval_results/self_improve`
`--agent-name`	Agent name for memory isolation	`self-improve-agent`
`--dry-run`	Evaluate only, don't apply changes	Off

Domain Agent Eval (`domain_eval_harness.py`)¶

Evaluates 5 domain-specific agents using the DomainEvalHarness:

from amplihack.eval.domain_eval_harness import DomainEvalHarness
from amplihack.agents.domain_agents.code_review.agent import CodeReviewAgent

harness = DomainEvalHarness(CodeReviewAgent("test"))
report = harness.run()
print(f"Overall: {report.overall_score:.0%}")

5 domain agents (defined in src/amplihack/agents/domain_agents/):

Agent	Domain	Tools
CodeReviewAgent	Code quality, security, style	analyze_code, check_style, detect_security_issues, suggest_improvements
MeetingSynthesizerAgent	Meeting transcripts	extract_action_items, generate_summary, identify_decisions, identify_topics
DataAnalysisAgent	Dataset analysis	(domain-specific)
DocumentCreatorAgent	Documentation generation	(domain-specific)
ProjectPlanningAgent	Task breakdown and estimates	(domain-specific)

Each agent provides its own eval levels with scenarios and rubrics. Domain eval uses deterministic grading (pattern matching, field checking) rather than LLM grading.

Combined scoring: Domain agents can use a combined score: 60% domain-specific eval + 40% teaching eval.

Long-Horizon Memory Eval (`long_horizon_memory.py`)¶

1000-turn memory stress test with 12 information blocks (including security domain):

# Quick test
python -m amplihack.eval.long_horizon_memory --turns 100 --questions 20

# Full stress test (best score: 98.9%)
python -m amplihack.eval.long_horizon_memory --turns 1000 --questions 100

# With SDK agent
python -m amplihack.eval.long_horizon_memory --sdk claude --turns 500 --questions 50

# Large-scale with subprocess segmentation (prevents OOM on 5000+ turns)
python -m amplihack.eval.long_horizon_memory --turns 5000 --questions 200 --segment-size 100

Subprocess segmentation (--segment-size N): For 5000+ turn evaluations, native memory from Kuzu C++ and aiohttp accumulates ~3MB/turn. The --segment-size flag splits the learning phase into subprocess segments of N turns each. Each subprocess loads its turn slice, learns it into the shared on-disk DB, then exits -- freeing all native memory. After all segments complete, the questioning phase runs against the accumulated DB. Default (--segment-size 0) runs everything in one process.

12 information blocks with proportional turn allocation:

Block	Name	% Turns	What It Tests	Recent Score
1	People	5%	Personal details, relationships	~95%
2	Projects	10%	Project metadata with evolving updates	~98%
3	Technical	10%	Domain-specific knowledge facts	~97%
4	Evolving Story	15%	Narrative with corrections over time	99.8%
5	Numerical	10%	Exact number recall and arithmetic	~96%
6	Contradictory	8%	Conflicting sources with different claims	~94%
7	Callbacks	6%	References to facts from earlier blocks	~97%
8	Distractors	6%	Irrelevant noise the agent must resist	~98%
9	Security Logs	10%	CVEs, access logs, authentication events	100%
10	Incidents	8%	Incident reports, post-mortems, RCAs	~95%
11	Infrastructure	7%	Server inventory, network configuration	~97%
12	Problem Solving	5%	Multi-step reasoning tasks (code gen)	~96%

Blocks 9-12 (security domain) together account for 30% of turns.

Security Domain Improvements (2026-02-28):

Block 9 (Security Logs) now achieves 100% accuracy due to: - Entity-linked retrieval for CVE IDs and incident references - Improved conflict detection for superseded vulnerabilities - Better aggregation of security event chains across contexts

Data generation is deterministic from long_horizon_data.py (same seed = same dialogue). When the amplihack-agent-eval package is installed, data types are imported from the package.

Tests memory at scale with:

Deterministic dialogue generation (reproducible with seed)
12 topic domains covering both narrative and structured operational data
Questions spanning different recency windows and complexity levels
LLM-graded scoring on 5 dimensions per question:
Factual accuracy
Completeness
Recency (uses most recent info)
Source attribution
Coherence

Matrix Eval Across SDKs (`matrix_eval.py`)¶

5-way agent comparison using the long-horizon eval:

# Full 5-way matrix
python -m amplihack.eval.matrix_eval --turns 500 --questions 50

# Specific agents
python -m amplihack.eval.matrix_eval --agents mini claude --turns 100 --questions 20

# With markdown report
python -m amplihack.eval.matrix_eval --turns 500 --questions 50 \
    --report-path Specs/MATRIX_EVAL_REPORT.md

5 agent configurations:

Name	SDK	Multi-Agent	Spawning	What It Tests
`mini`	mini	No	No	LearningAgent directly (no SDK wrapper)
`claude`	claude	No	No	ClaudeGoalSeekingAgent via SDK factory
`copilot`	copilot	No	No	CopilotGoalSeekingAgent via SDK factory
`microsoft`	microsoft	No	No	MicrosoftGoalSeekingAgent via SDK factory
`multiagent-copilot`	copilot	Yes	Yes	MultiAgentLearningAgent with spawning

Key design: dialogue and questions are generated once and shared across all agents. Each agent uses a separate storage/DB path. SDK agents that fail to instantiate (missing SDK package) are skipped gracefully.

Output: per-agent reports, shared ground truth, matrix comparison JSON, and a markdown report with category-by-category scores, best-performer-per-category, and overall ranking.

Long-Horizon Self-Improvement (`long_horizon_self_improve.py`)¶

Automated improvement loop targeting the long-horizon eval. Uses patch proposer and reviewer voting for safe iteration:

python -m amplihack.eval.long_horizon_self_improve \
    --turns 100 --questions 20 --iterations 3

# With multi-agent architecture
python -m amplihack.eval.long_horizon_self_improve \
    --turns 100 --questions 20 --multi-agent

8-stage cycle per iteration:

EVAL: Run long-horizon eval to get per-category scores.
ANALYZE: Identify worst-performing category.
PROPOSE: Patch proposer generates a unified diff with hypothesis, expected impact (per-category delta), risk assessment, and confidence score.
CHALLENGE: Devil's advocate arguments against the proposed patch.
VOTE: Three reviewer perspectives (quality, regression, simplicity) vote accept/reject/modify with rationale. Majority vote determines outcome.
APPLY: If accepted, apply the patch and git commit.
RE-EVAL: Run eval again to measure impact.
DECIDE: If regression > 5% on any category, auto-revert via git. If net improvement >= 2%, keep. Otherwise, keep with marginal note.

The PatchHistory object tracks all applied, reverted, and rejected patches to prevent repeating failed fixes across iterations.

Meta-Eval Experiment (`meta_eval_experiment.py`)¶

Self-referential test: an agent learns about the eval system itself and teaches it to a student.

Deterministic knowledge base (from EVAL_KNOWLEDGE_BASE constant, derived from actual source code)
Uses TeachingSession for multi-turn dialogue
Uses MetacognitionGrader for student evaluation
JSON report output

Error Analysis Taxonomy¶

The FAILURE_TAXONOMY in error_analyzer.py maps symptoms to root causes:

Failure Mode	Description	Code Component
retrieval_insufficient	Not enough facts retrieved	`agentic_loop.py::_plan_retrieval`
temporal_ordering_wrong	Wrong temporal arithmetic	`learning_agent.py::_synthesize_with_llm`
intent_misclassification	Wrong intent detection	`learning_agent.py::_detect_intent`
fact_extraction_incomplete	Missing facts in memory	`learning_agent.py::_extract_facts_with_llm`
synthesis_hallucination	Invented information	`learning_agent.py::_synthesize_with_llm`
update_not_applied	Used outdated data	`hierarchical_memory.py::_detect_supersedes`
contradiction_undetected	Missed conflicting sources	`learning_agent.py::_detect_intent + synthesis`
procedural_ordering_lost	Steps out of sequence	`learning_agent.py::_extract_facts_with_llm`
teaching_coverage_gap	Student not taught key facts	`teaching_session.py::_teacher_respond`
counterfactual_refusal	Refused hypothetical reasoning	`learning_agent.py::_synthesize_with_llm`

General Capability Evaluation¶

Beyond memory, the system evaluates 5 general-purpose capabilities via general_capability_eval.py:

PYTHONPATH=src python -m amplihack.eval.general_capability_eval --eval tool_use,planning,reasoning
PYTHONPATH=src python -m amplihack.eval.general_capability_eval --eval all --sdk mini
PYTHONPATH=src python -m amplihack.eval.general_capability_eval --eval planning --output-dir /tmp/cap-eval

Eval Type	What It Tests	Scenarios
Tool Use Efficiency	Correct tool selection, chaining, call economy	4
Planning	Multi-step task decomposition and execution	3
Reasoning Under Uncertainty	Handling incomplete/conflicting evidence	3
Cross-Domain Transfer	Applying learned patterns to new domains	2
Collaborative Task	Multi-agent delegation and synthesis	2

Each eval type defines scenarios with gold-standard expectations, runs the agent through them, and grades using rubric-based LLM evaluation.

How to Add New Eval Levels¶

1. Define the test content in `test_levels.py`¶

L13_LEVEL = TestLevel(
    level_id="L13",
    level_name="Analogical Reasoning",
    description="Apply learned analogies to novel situations",
    articles=[
        {"title": "Source Domain", "content": "..."},
        {"title": "Target Domain", "content": "..."},
    ],
    questions=[
        {
            "question": "Based on the source domain pattern, what would happen in the target domain?",
            "expected_answer": "The expected analogical inference...",
            "level": "L13",
        },
    ],
)

2. Register the level¶

Add your level to the appropriate list in test_levels.py:

# For standard learn-then-test levels:
STANDARD_LEVELS = [..., L13_LEVEL]

# For levels needing special handling (like L7 teaching):
CUSTOM_LEVELS = [L13_LEVEL]

3. Add level-specific grading criteria (if needed)¶

In grader.py, add criteria to _build_grading_prompt():

elif level == "L13":
    level_criteria = (
        "\n\nL13 ANALOGICAL REASONING grading criteria:\n"
        "- Award 0.9-1.0 if the agent correctly maps the source pattern to the target.\n"
        "- Award 0.6-0.8 if the mapping is partially correct.\n"
    )

4. Add the level to CLI choices¶

In progressive_test_suite.py, add "L13" to the --levels choices list.

5. Run and validate¶

PYTHONPATH=src python -m amplihack.eval.progressive_test_suite \
    --output-dir /tmp/eval_l13 \
    --levels L13

How to Add New Domain Agents¶

1. Create the agent directory¶

src/amplihack/agents/domain_agents/my_domain/
    __init__.py
    agent.py           # DomainAgent subclass
    tools.py           # Domain-specific tool functions
    prompts/
        system.txt     # System prompt

2. Implement the agent¶

Subclass DomainAgent and implement: _register_tools(), execute_task(), get_eval_levels(), teach(), and get_system_prompt().

from amplihack.agents.domain_agents.base import DomainAgent, EvalLevel, TaskResult

class MyDomainAgent(DomainAgent):
    def __init__(self, agent_name="my_agent", model="gpt-4o-mini", skill_injector=None):
        super().__init__(agent_name=agent_name, domain="my_domain", model=model, skill_injector=skill_injector)

    def _register_tools(self):
        self.executor.register_action("my_tool", my_tool_function)

    def execute_task(self, task: dict) -> TaskResult:
        result = self.executor.execute("my_tool", **task)
        return TaskResult(success=True, output=result.output)

    def get_eval_levels(self) -> list[EvalLevel]:
        return [
            EvalLevel(
                level_id="L1",
                name="Basic Task",
                scenarios=[EvalScenario(...)],
                passing_threshold=0.7,
            ),
        ]

    def teach(self, topic, student_level="beginner"):
        return TeachingResult(...)

    def get_system_prompt(self) -> str:
        return "You are an expert in my domain."

3. Run the eval¶

from amplihack.eval.domain_eval_harness import DomainEvalHarness

harness = DomainEvalHarness(MyDomainAgent("test"))
report = harness.run()
print(f"Overall: {report.overall_score:.0%}")

Key Design Decisions¶

Subprocess Isolation: Each eval level runs in a separate subprocess with its own memory database. This prevents cross-level contamination and makes results reproducible.
LLM-Based Grading: Uses Claude Sonnet 4.5 as the grading model rather than exact-match scoring. This handles semantic equivalence (e.g., "26 medals" vs "twenty-six medals") and partial credit.
Multi-Vote Grading: When --grader-votes 3 is set, each answer is graded 3 times independently and the median is taken. This reduces noise on ambiguous answers where a single grader call might score 0.4 or 0.9.
3-Run Medians: Single runs are unreliable due to LLM stochasticity. Running 3 times and taking median scores gives stable, reproducible results.
Level-Specific Rubrics: Grading prompts include level-specific criteria because different cognitive tasks need different scoring rules (e.g., L3 temporal reasoning vs L5 contradiction detection).
Research Step in Self-Improvement: Rather than blindly applying fixes, the runner requires hypothesis + evidence + counter-arguments before any change. This prevents regression from overenthusiastic tuning.
Error Taxonomy: Rather than just reporting scores, the error analyzer maps failures to specific code components. This makes the self-improvement loop actionable rather than just diagnostic.
Deterministic Data: Test data is generated deterministically with seeds, enabling meaningful comparison across runs and SDK versions.
Per-SDK Prompt Tuning: Each SDK has dedicated eval prompt templates in src/amplihack/agents/goal_seeking/prompts/sdk/, allowing SDK-specific instruction tuning without modifying shared agent code.

Best Practices for Running Evals¶

Use --runs 3 for production eval results. Single runs have high variance on levels L2, L5, L9, L10, and L11.
Use --grader-votes 3 for final benchmarks. Multi-vote grading reduces noise on ambiguous answers.
Use --sdk mini for development iteration (fastest). Switch to specific SDK for final comparison.
Check ANTHROPIC_API_KEY is set before running. The grader requires it.
Monitor output directories for disk usage. Each eval run generates JSON logs per level.
Use --dry-run with the self-improvement runner to preview what changes would be made before applying them.
Run --levels L1 L2 L3 to iterate quickly on specific levels rather than running all 12 every time.

Package Architecture (amplihack-agent-eval)¶

The evaluation framework has been extracted into a standalone package (amplihack-agent-eval) that can be used independently of the full amplihack codebase. The amplihack repo depends on this package for its core eval types and data generation, and provides amplihack-specific adapter implementations.

Installation¶

pip install "amplihack-agent-eval @ git+https://github.com/rysweet/amplihack-agent-eval.git@main"

Package Relationship¶

amplihack-agent-eval (standalone)     amplihack (this repo)
================================     ==========================
amplihack_eval/                       src/amplihack/eval/
  adapters/base.py  <-- AgentAdapter    agent_adapter.py  (amplihack adapters)
  core/runner.py    <-- EvalRunner      compat.py         (re-exports)
  core/grader.py    <-- grade_answer    long_horizon_memory.py (uses package)
  data/long_horizon.py  <-- data gen    long_horizon_data.py (local copy)
  self_improve/                         self_improve/ (local copy)
    patch_proposer.py                     patch_proposer.py
    reviewer_voting.py                    reviewer_voting.py

How It Works¶

amplihack-agent-eval provides the core eval types: AgentAdapter, AgentResponse, ToolCall, EvalRunner, grade_answer, GradeResult, data generation (Turn, Question, GroundTruth, GradingRubric, generate_dialogue, generate_questions), and self-improvement primitives (PatchProposal, propose_patch, ReviewVote, ReviewResult, vote_on_proposal).
amplihack adds the optional eval dependency in pyproject.toml:

[project.optional-dependencies]
eval = ["amplihack-agent-eval>=0.1.0"]

long_horizon_memory.py imports data types from the package:

from amplihack_eval.data.long_horizon import (
    GradingRubric, GroundTruth, Question,
    generate_dialogue, generate_questions,
)

If the package is not installed, an ImportError is raised with installation instructions.

compat.py re-exports all key types from the standalone package, so existing code can from amplihack.eval.compat import EvalRunner. If the package is not installed, compat.py is a no-op (no errors).
agent_adapter.py provides three adapter implementations that wrap amplihack agents for the package's AgentAdapter interface:
AmplihackLearningAgentAdapter -- wraps LearningAgent
AmplihackMultiAgentAdapter -- wraps MultiAgentLearningAgent
AmplihackSDKAgentAdapter -- wraps any SDK-backed GoalSeekingAgent

Using the Adapters¶

from amplihack_eval.core.runner import EvalRunner
from amplihack.eval.agent_adapter import AmplihackLearningAgentAdapter

# Create adapter wrapping amplihack's LearningAgent
adapter = AmplihackLearningAgentAdapter("eval-agent", model="gpt-4o-mini")

# Run evaluation
runner = EvalRunner(num_turns=100, num_questions=20, seed=42)
report = runner.run(adapter)
print(f"Score: {report.overall_score:.2%}")
adapter.close()

File Map¶

src/amplihack/eval/
  __init__.py                    # Public API exports
  compat.py                      # Re-exports from amplihack-agent-eval package
  agent_adapter.py               # Amplihack-specific AgentAdapter implementations
  test_levels.py                 # L1-L12 level definitions (articles + questions)
  progressive_test_suite.py      # Main runner (single + parallel + multi-vote)
  agent_subprocess.py            # Subprocess isolation for learning/testing (SDK routing)
  teaching_subprocess.py         # Subprocess for teacher-student dialogue
  grader.py                      # LLM semantic grading with multi-vote
  metacognition_grader.py        # 4-dimension metacognition scoring
  multi_source_collector.py      # News article collection
  quiz_generator.py              # Quiz question generation
  harness_runner.py              # Original 4-level harness
  sdk_eval_loop.py               # Multi-SDK eval comparison loop
  domain_eval_harness.py         # Generic harness for domain agents
  run_domain_evals.py            # Run all 5 domain agent evals
  teaching_session.py            # Teacher-student session framework
  teaching_eval.py               # Teaching quality evaluation
  long_horizon_data.py           # Deterministic dialogue generation (12 blocks)
  long_horizon_memory.py         # 1000-turn stress test (uses eval package data types)
  long_horizon_self_improve.py   # Long-horizon self-improvement loop (8 phases)
  matrix_eval.py                 # 5-way matrix evaluation across SDKs
  meta_eval_experiment.py        # Meta-eval experiment framework
  five_agent_experiment.py       # 5-agent experiment runner
  QUICK_START.md                 # Quick start guide
  self_improve/
    __init__.py
    error_analyzer.py            # Failure taxonomy + classification (10 modes)
    runner.py                    # Closed-loop self-improvement (6 phases)
    patch_proposer.py            # LLM-generated code patches with hypotheses
    reviewer_voting.py           # 3-perspective review + majority voting

src/amplihack/agents/goal_seeking/
  prompts/sdk/                   # Per-SDK eval prompt templates
    copilot_eval.md
    claude_eval.md
    microsoft_eval.md
    goal_seeking_system.md
    learning_task.md
    synthesis_template.md
    teaching_system.md

src/amplihack/agents/domain_agents/
  base.py                        # DomainAgent ABC
  code_review/agent.py           # Code Review agent
  meeting_synthesizer/agent.py   # Meeting Synthesizer agent
  data_analysis/agent.py         # Data Analysis agent
  document_creator/agent.py      # Document Creator agent
  project_planning/agent.py      # Project Planning agent

tests/eval/
  test_eval_compat.py            # Compat layer + agent adapter tests (32 tests)
  test_grader.py                 # Grader unit tests
  test_harness_runner.py         # Harness runner tests
  test_metacognition_grader.py   # Metacognition grader tests
  test_teaching_session.py       # Teaching session tests
  test_progressive_suite.py      # Progressive suite tests
  test_meta_eval_experiment.py   # Meta-eval tests
  test_multi_source_collector.py # Collector tests
  test_quiz_generator.py         # Quiz generator tests
  test_long_horizon_memory.py    # Long-horizon memory tests

Goal-Seeking Agents -- End-to-end guide: generation, capabilities, evaluation, self-improvement
SDK Adapters Guide -- Deep dive into each SDK backend
Quick Start -- Get running in 30 seconds
Eval Grading Improvements -- Tutorial on fixing grader false negatives and implementing advanced retrieval (NEW: 2026-02-28)
Eval Retrieval Reference -- API reference for entity-linked and multi-entity retrieval methods (NEW: 2026-02-28)

Eval System Architecture¶

Overview¶

Test Levels (L1-L12)¶

Level Data Structure¶

Core Pipeline¶

1. Data Layer¶

2. Agent Layer¶

3. Grading Layer¶

Evaluation Runners¶

Progressive Test Suite (progressive_test_suite.py)¶

SDK Eval Loop (sdk_eval_loop.py)¶

Self-Improvement Runner (self_improve/runner.py)¶

Domain Agent Eval (domain_eval_harness.py)¶

Long-Horizon Memory Eval (long_horizon_memory.py)¶

Matrix Eval Across SDKs (matrix_eval.py)¶

Long-Horizon Self-Improvement (long_horizon_self_improve.py)¶

Meta-Eval Experiment (meta_eval_experiment.py)¶

Error Analysis Taxonomy¶

General Capability Evaluation¶

How to Add New Eval Levels¶

1. Define the test content in test_levels.py¶

2. Register the level¶

3. Add level-specific grading criteria (if needed)¶

4. Add the level to CLI choices¶

5. Run and validate¶

How to Add New Domain Agents¶

1. Create the agent directory¶

2. Implement the agent¶

3. Run the eval¶

Key Design Decisions¶

Best Practices for Running Evals¶

Package Architecture (amplihack-agent-eval)¶

Installation¶

Package Relationship¶

How It Works¶

Using the Adapters¶

File Map¶

Related Documentation¶

Progressive Test Suite (`progressive_test_suite.py`)¶

SDK Eval Loop (`sdk_eval_loop.py`)¶

Self-Improvement Runner (`self_improve/runner.py`)¶

Domain Agent Eval (`domain_eval_harness.py`)¶

Long-Horizon Memory Eval (`long_horizon_memory.py`)¶

Matrix Eval Across SDKs (`matrix_eval.py`)¶

Long-Horizon Self-Improvement (`long_horizon_self_improve.py`)¶

Meta-Eval Experiment (`meta_eval_experiment.py`)¶

1. Define the test content in `test_levels.py`¶