Skip to content

Eval System Architecture

Comprehensive guide to the evaluation and self-improvement infrastructure for goal-seeking agents. Covers the progressive test suite (L1-L12), long-horizon memory testing with 12 information blocks (including security domain), 5-way matrix evaluation across SDKs, self-improvement loop with patch proposer + reviewer voting, and the standalone amplihack-agent-eval package.

Overview

The eval system is a multi-layered framework that tests agent learning and reasoning capabilities across 12 progressively harder levels. It supports 4 SDK implementations, includes a self-improvement loop with patch proposer and reviewer voting, and provides domain-specific evaluation for 5 specialized agents.

Current best scores (as of 2026-02-28):

Progressive test suite (3-run median, mini SDK):

Level Median Description
L1 83% Single source direct recall
L2 100% Multi-source synthesis
L3 99.8% Temporal reasoning
L4 79% Procedural learning
L5 95% Contradiction handling
L6 100% Incremental learning
L7 84% Teacher-student transfer
Overall 97.8% Weighted median across all levels

Long-horizon memory evaluation: 98.9% at 1000 turns.

Recent improvements (2026-02-28): - Overall accuracy: 96.0% → 97.8% (+1.8%) - temporal_evolution: 86.6% → 99.8% (+13.2%) - security_log_analysis: 88% → 100% (+12%) - 9 categories now at 100% accuracy

+------------------------------------------------------------------+
|                    EVALUATION ENTRY POINTS                         |
+------------------------------------------------------------------+
|                                                                    |
|  progressive_test_suite.py   sdk_eval_loop.py   run_domain_evals  |
|  (L1-L12 single/parallel)   (multi-SDK loop)    (5 domain agents) |
|                                                                    |
|  self_improve/runner.py      long_horizon_memory.py               |
|  (closed-loop improvement)   (1000-turn stress test, 12 blocks)   |
|                                                                    |
|  long_horizon_self_improve.py                                     |
|  (long-horizon self-improvement with patch proposer + voting)     |
|                                                                    |
|  matrix_eval.py              meta_eval_experiment.py              |
|  (5-way SDK comparison)      (self-referential eval)              |
|                                                                    |
+------------------------------------------------------------------+
                        |                |
                        v                v
+------------------------------------------------------------------+
|                    CORE EVAL PIPELINE                              |
+------------------------------------------------------------------+
|                                                                    |
|  1. DATA LAYER              2. AGENT LAYER                        |
|  +-----------------+        +------------------------+            |
|  | test_levels.py  |        | agent_subprocess.py    |            |
|  | (L1-L12 defs)   |        | (subprocess isolation) |            |
|  | - TestArticle   |------->| - learning phase       |            |
|  | - TestQuestion   |        | - testing phase        |            |
|  | - TestLevel      |        | - SDK routing          |            |
|  +-----------------+        +------------------------+            |
|  | long_horizon_   |        +------------------------+            |
|  | data.py         |        | teaching_subprocess.py |            |
|  | (12-block gen)  |        | (teacher-student)      |            |
|  +-----------------+        +------------------------+            |
|                                       |                            |
|  3. GRADING LAYER                     v                            |
|  +---------------------+   +------------------------+            |
|  | grader.py            |   | metacognition_grader.py|            |
|  | (LLM semantic grade) |   | (4-dimension scoring)  |            |
|  | - multi-vote         |   | - effort calibration   |            |
|  | - level-specific     |   | - sufficiency judgment  |            |
|  |   rubrics (L3, L5,   |   | - search quality       |            |
|  |   L9, L11, L12)      |   | - self correction      |            |
|  +---------------------+   +------------------------+            |
|                                                                    |
+------------------------------------------------------------------+
                        |
                        v
+------------------------------------------------------------------+
|                    ANALYSIS & IMPROVEMENT                          |
+------------------------------------------------------------------+
|                                                                    |
|  self_improve/error_analyzer.py (10 failure modes)                |
|  self_improve/patch_proposer.py (LLM-generated diffs)            |
|  self_improve/reviewer_voting.py (3-perspective review)          |
|  +----------------------------------------------------------+    |
|  | EVAL -> ANALYZE -> PROPOSE -> CHALLENGE -> VOTE ->         |   |
|  |   APPLY -> RE-EVAL -> DECIDE                               |   |
|  |                                                             |   |
|  | Patch proposer: hypothesis + diff + expected impact         |   |
|  | Reviewer voting: quality/regression/simplicity perspectives |   |
|  | Challenge phase: devil's advocate arguments + defense       |   |
|  | Auto-revert on regression > 5%                              |   |
|  +----------------------------------------------------------+    |
|                                                                    |
+------------------------------------------------------------------+

Test Levels (L1-L12)

Each level represents a specific cognitive capability:

Level Name Capability Tested Key Challenge
L1 Single Source Recall Direct fact retrieval from one source Baseline accuracy
L2 Multi-Source Synthesis Combining info from multiple sources Cross-source counting
L3 Temporal Reasoning Understanding changes over time Arithmetic on temporal data
L4 Procedural Learning Learning and applying step-by-step procedures Sequence ordering
L5 Contradiction Handling Detecting conflicting information Conflict acknowledgment
L6 Incremental Learning Updating knowledge with new info Superseding old facts
L7 Teacher-Student Transfer Teaching knowledge to another agent Multi-turn dialogue
L8 Metacognition Awareness of own reasoning quality Effort calibration
L9 Causal Reasoning Identifying cause-and-effect chains Root cause identification
L10 Counterfactual Reasoning "What if" hypothetical analysis Hypothetical scenarios
L11 Novel Skill Acquisition Learning new task formats from examples Config generation
L12 Far Transfer Applying learned patterns to new domains Cross-domain application

Level Data Structure

Defined in src/amplihack/eval/test_levels.py:

@dataclass
class TestLevel:
    level_id: str           # "L1", "L2", etc.
    level_name: str         # Human-readable name
    description: str        # What this level tests
    articles: list[TestArticle]    # Source material to learn
    questions: list[TestQuestion]  # Questions to answer
    requires_temporal_ordering: bool = False
    requires_update_handling: bool = False

Each level provides its own articles and questions. L6 uses phased articles (initial + update) to test incremental learning. L7 uses a teaching session instead of direct learning/testing.

Why 2026 Winter Olympics? The test content uses synthetic data about the February 2026 Milan-Cortina Olympics -- a topic that post-dates LLM training cutoffs. This ensures the agent must actually learn from the provided sources rather than relying on pre-training knowledge.

Core Pipeline

1. Data Layer

  • test_levels.py: Defines all 12 levels with articles and questions
  • multi_source_collector.py: Collects and normalizes news articles
  • quiz_generator.py: Generates quiz questions from articles
  • long_horizon_data.py: Generates deterministic 1000-turn dialogues

2. Agent Layer

The eval system uses subprocess isolation to prevent state leakage between levels:

  • agent_subprocess.py: Runs learning/testing phases in isolated subprocesses. Each level gets a unique agent name mapping to a separate memory database. Supports --sdk flag for SDK routing (validates SDK agent creation, records SDK metadata in output). All SDKs use the same LearningAgent core for learning/answering.
  • teaching_subprocess.py: Runs teacher-student dialogues for L7 evaluation. Teacher and student have separate memory databases; knowledge transfer happens only through conversation.

Memory isolation is critical: unique agent names like agent_L1_1708123456 ensure facts from L1 cannot leak into L2.

Advanced Retrieval Methods (added 2026-02-28):

The LearningAgent now includes three retrieval strategies:

  1. Entity-Linked Retrieval (_entity_linked_retrieval): When questions contain structured entity IDs (INC-, CVE-, PROJ-*), retrieves all facts containing those IDs across ALL context tags. This addresses the problem where facts about a single entity are stored under different contexts (e.g., incident data in "incidents", "security_logs", and "post_mortems").

  2. Multi-Entity Retrieval (_multi_entity_retrieval): For multi-hop reasoning questions mentioning 2+ named entities, retrieves facts for EACH entity independently and merges results. This improves coverage for questions like "How did X affect Y?" where facts about X and Y are stored separately.

  3. Standard Context-Based Retrieval: Traditional semantic similarity search within detected question context.

The agent automatically selects the most appropriate strategy based on question patterns. See EVAL_RETRIEVAL_REFERENCE.md for detailed API documentation.

3. Grading Layer

  • grader.py: LLM-based semantic grading with level-specific rubrics. Uses Claude Sonnet 4.5 as the grading model. Supports multi-vote grading (--grader-votes N, recommended: 3) where each answer is graded N times independently and the median score is taken.

Level-specific grading criteria: - L3: Numerical values are primary (correct numbers = at least 0.7); trend direction is secondary - L5: Explicit conflict acknowledgment scoring rubric (4 tiers: 0.9-1.0, 0.6-0.8, 0.3-0.5, 0.0-0.2) - L9: Accept multiple valid root causes if reasoning is sound - L11: Grade on required fields, don't penalize extra optional fields - L12: Direction of trend is critical for ratio computations

Deterministic Grading Improvements (2026-02-28):

The _deterministic_grade() function now implements a critical false negative fix: incorrect patterns are only penalized when correct keywords are NOT present. This prevents the grader from scoring 0% on answers like "budget increased from $1.2M to $1.4M" which contain both historical (incorrect) and current (correct) values. The fix improved temporal_evolution from 86.6% to 99.8%. See EVAL_GRADING_IMPROVEMENTS.md for implementation details.

  • metacognition_grader.py: Grades reasoning traces on 4 dimensions:
Dimension Weight What It Measures
Effort Calibration 25% Proportional effort to question difficulty
Sufficiency Judgment 30% Correct assessment of when enough info gathered
Search Quality 25% Ratio of useful results to total queries
Self Correction 20% Detects and fixes errors in reasoning

Evaluation Runners

Progressive Test Suite (progressive_test_suite.py)

The primary evaluation runner. Runs all levels sequentially with per-level memory isolation.

# Single run
python -m amplihack.eval.progressive_test_suite --sdk mini --levels L1 L2 L3

# Parallel runs (3-run median for stability)
python -m amplihack.eval.progressive_test_suite --runs 3 --sdk mini

# Multi-vote grading (3 votes per question)
python -m amplihack.eval.progressive_test_suite --grader-votes 3 --sdk mini

# Recommended: 3-run median + 3-vote grading
python -m amplihack.eval.progressive_test_suite --runs 3 --grader-votes 3 --sdk mini

# Advanced levels
python -m amplihack.eval.progressive_test_suite --advanced --sdk mini

# All levels
python -m amplihack.eval.progressive_test_suite \
    --levels L1 L2 L3 L4 L5 L6 L8 L9 L10 L11 L12 --sdk mini

CLI options:

Option Description Default
--output-dir Directory for results ./eval_progressive
--agent-name Agent name (memory isolation) progressive-test-agent
--levels Specific levels (L1-L12) L1-L6
--advanced Include L8-L10 Off
--memory-backend Memory backend amplihack-memory-lib
--parallel N Run N times, report medians Off
--runs N Alias for --parallel Off
--sdk SDK type: mini, claude, copilot, microsoft mini
--grader-votes N Grading votes per question (1=single, 3=majority) 1

Key classes:

  • ProgressiveConfig: SDK, levels, output dir, grader votes
  • ProgressiveResult: Per-level results and overall scores
  • ParallelResult: Median scores across multiple runs

SDK Eval Loop (sdk_eval_loop.py)

Runs improvement loops for one or more SDKs, tracking score progression and generating prompt tuning recommendations:

# Single SDK
python -m amplihack.eval.sdk_eval_loop --sdks mini --loops 5

# All SDKs comparison (4-way)
python -m amplihack.eval.sdk_eval_loop --all-sdks --loops 3

# Specific levels
python -m amplihack.eval.sdk_eval_loop --sdks mini claude --loops 3 --levels L1 L2 L3

Each loop iteration:

  1. Runs L1-L6 eval for the SDK
  2. Analyzes failures to identify weak levels
  3. Generates SDK-specific prompt tuning recommendations
  4. Re-evaluates to measure improvement

Output includes:

  • Per-SDK score progression across iterations
  • Failure analysis with recommendations
  • Per-level comparison table
  • Ranked SDK comparison by best overall score

CLI options:

Option Description Default
--sdks SDKs to evaluate (space-separated) mini
--all-sdks Evaluate all 4 SDKs Off
--loops Improvement loops per SDK 5
--levels Levels to run L1-L6
--output-dir Output directory ./eval_sdk_loop

Self-Improvement Runner (self_improve/runner.py)

Closed-loop improvement: EVAL -> ANALYZE -> RESEARCH -> IMPROVE -> RE-EVAL -> DECIDE.

# Basic usage
python -m amplihack.eval.self_improve.runner --sdk mini --iterations 5

# With all options
python -m amplihack.eval.self_improve.runner \
    --sdk mini \
    --iterations 5 \
    --improvement-threshold 2.0 \
    --regression-tolerance 5.0 \
    --levels L1 L2 L3 L4 L5 L6 \
    --output-dir ./eval_results/self_improve \
    --dry-run

6 phases per iteration:

  1. EVAL: Run progressive test suite to get baseline scores.
  2. ANALYZE: Classify failures using the error taxonomy (10 failure modes). Maps each failure to the specific code component responsible.
  3. RESEARCH: For each proposed fix, state hypothesis, gather evidence from eval results and failure patterns, consider counter-arguments (regression risk, stochasticity, cross-level impact), and make a reasoned decision: apply, skip, or defer. All research decisions are logged to research_decisions.json.
  4. IMPROVE: Apply approved changes. The patch proposer (self_improve/patch_proposer.py) generates specific code changes as unified diffs. Each PatchProposal includes: target_file, hypothesis, description, diff, expected_impact (per-category score delta), risk_assessment, and confidence (0.0-1.0). Maintains PatchHistory to avoid re-proposing reverted/rejected patches. Proposals go through reviewer voting (self_improve/reviewer_voting.py) where three perspectives (quality, regression, simplicity) vote with majority determining outcome. A challenge phase forces the proposer to defend the change.
  5. RE-EVAL: Run eval again on all levels to measure impact.
  6. DECIDE: Commit if net improvement >= +2% overall and no single level regresses > 5%. Revert if any level regresses beyond tolerance.

CLI options:

Option Description Default
--sdk SDK type mini
--iterations Max improvement iterations 5
--improvement-threshold Min % improvement to commit 2.0
--regression-tolerance Max % regression on any level 5.0
--levels Levels to evaluate L1-L6
--output-dir Output directory ./eval_results/self_improve
--agent-name Agent name for memory isolation self-improve-agent
--dry-run Evaluate only, don't apply changes Off

Domain Agent Eval (domain_eval_harness.py)

Evaluates 5 domain-specific agents using the DomainEvalHarness:

from amplihack.eval.domain_eval_harness import DomainEvalHarness
from amplihack.agents.domain_agents.code_review.agent import CodeReviewAgent

harness = DomainEvalHarness(CodeReviewAgent("test"))
report = harness.run()
print(f"Overall: {report.overall_score:.0%}")

5 domain agents (defined in src/amplihack/agents/domain_agents/):

Agent Domain Tools
CodeReviewAgent Code quality, security, style analyze_code, check_style, detect_security_issues, suggest_improvements
MeetingSynthesizerAgent Meeting transcripts extract_action_items, generate_summary, identify_decisions, identify_topics
DataAnalysisAgent Dataset analysis (domain-specific)
DocumentCreatorAgent Documentation generation (domain-specific)
ProjectPlanningAgent Task breakdown and estimates (domain-specific)

Each agent provides its own eval levels with scenarios and rubrics. Domain eval uses deterministic grading (pattern matching, field checking) rather than LLM grading.

Combined scoring: Domain agents can use a combined score: 60% domain-specific eval + 40% teaching eval.

Long-Horizon Memory Eval (long_horizon_memory.py)

1000-turn memory stress test with 12 information blocks (including security domain):

# Quick test
python -m amplihack.eval.long_horizon_memory --turns 100 --questions 20

# Full stress test (best score: 98.9%)
python -m amplihack.eval.long_horizon_memory --turns 1000 --questions 100

# With SDK agent
python -m amplihack.eval.long_horizon_memory --sdk claude --turns 500 --questions 50

# Large-scale with subprocess segmentation (prevents OOM on 5000+ turns)
python -m amplihack.eval.long_horizon_memory --turns 5000 --questions 200 --segment-size 100

Subprocess segmentation (--segment-size N): For 5000+ turn evaluations, native memory from Kuzu C++ and aiohttp accumulates ~3MB/turn. The --segment-size flag splits the learning phase into subprocess segments of N turns each. Each subprocess loads its turn slice, learns it into the shared on-disk DB, then exits -- freeing all native memory. After all segments complete, the questioning phase runs against the accumulated DB. Default (--segment-size 0) runs everything in one process.

12 information blocks with proportional turn allocation:

Block Name % Turns What It Tests Recent Score
1 People 5% Personal details, relationships ~95%
2 Projects 10% Project metadata with evolving updates ~98%
3 Technical 10% Domain-specific knowledge facts ~97%
4 Evolving Story 15% Narrative with corrections over time 99.8%
5 Numerical 10% Exact number recall and arithmetic ~96%
6 Contradictory 8% Conflicting sources with different claims ~94%
7 Callbacks 6% References to facts from earlier blocks ~97%
8 Distractors 6% Irrelevant noise the agent must resist ~98%
9 Security Logs 10% CVEs, access logs, authentication events 100%
10 Incidents 8% Incident reports, post-mortems, RCAs ~95%
11 Infrastructure 7% Server inventory, network configuration ~97%
12 Problem Solving 5% Multi-step reasoning tasks (code gen) ~96%

Blocks 9-12 (security domain) together account for 30% of turns.

Security Domain Improvements (2026-02-28):

Block 9 (Security Logs) now achieves 100% accuracy due to: - Entity-linked retrieval for CVE IDs and incident references - Improved conflict detection for superseded vulnerabilities - Better aggregation of security event chains across contexts

Data generation is deterministic from long_horizon_data.py (same seed = same dialogue). When the amplihack-agent-eval package is installed, data types are imported from the package.

Tests memory at scale with:

  • Deterministic dialogue generation (reproducible with seed)
  • 12 topic domains covering both narrative and structured operational data
  • Questions spanning different recency windows and complexity levels
  • LLM-graded scoring on 5 dimensions per question:
  • Factual accuracy
  • Completeness
  • Recency (uses most recent info)
  • Source attribution
  • Coherence

Matrix Eval Across SDKs (matrix_eval.py)

5-way agent comparison using the long-horizon eval:

# Full 5-way matrix
python -m amplihack.eval.matrix_eval --turns 500 --questions 50

# Specific agents
python -m amplihack.eval.matrix_eval --agents mini claude --turns 100 --questions 20

# With markdown report
python -m amplihack.eval.matrix_eval --turns 500 --questions 50 \
    --report-path Specs/MATRIX_EVAL_REPORT.md

5 agent configurations:

Name SDK Multi-Agent Spawning What It Tests
mini mini No No LearningAgent directly (no SDK wrapper)
claude claude No No ClaudeGoalSeekingAgent via SDK factory
copilot copilot No No CopilotGoalSeekingAgent via SDK factory
microsoft microsoft No No MicrosoftGoalSeekingAgent via SDK factory
multiagent-copilot copilot Yes Yes MultiAgentLearningAgent with spawning

Key design: dialogue and questions are generated once and shared across all agents. Each agent uses a separate storage/DB path. SDK agents that fail to instantiate (missing SDK package) are skipped gracefully.

Output: per-agent reports, shared ground truth, matrix comparison JSON, and a markdown report with category-by-category scores, best-performer-per-category, and overall ranking.

Long-Horizon Self-Improvement (long_horizon_self_improve.py)

Automated improvement loop targeting the long-horizon eval. Uses patch proposer and reviewer voting for safe iteration:

python -m amplihack.eval.long_horizon_self_improve \
    --turns 100 --questions 20 --iterations 3

# With multi-agent architecture
python -m amplihack.eval.long_horizon_self_improve \
    --turns 100 --questions 20 --multi-agent

8-stage cycle per iteration:

  1. EVAL: Run long-horizon eval to get per-category scores.
  2. ANALYZE: Identify worst-performing category.
  3. PROPOSE: Patch proposer generates a unified diff with hypothesis, expected impact (per-category delta), risk assessment, and confidence score.
  4. CHALLENGE: Devil's advocate arguments against the proposed patch.
  5. VOTE: Three reviewer perspectives (quality, regression, simplicity) vote accept/reject/modify with rationale. Majority vote determines outcome.
  6. APPLY: If accepted, apply the patch and git commit.
  7. RE-EVAL: Run eval again to measure impact.
  8. DECIDE: If regression > 5% on any category, auto-revert via git. If net improvement >= 2%, keep. Otherwise, keep with marginal note.

The PatchHistory object tracks all applied, reverted, and rejected patches to prevent repeating failed fixes across iterations.

Meta-Eval Experiment (meta_eval_experiment.py)

Self-referential test: an agent learns about the eval system itself and teaches it to a student.

  • Deterministic knowledge base (from EVAL_KNOWLEDGE_BASE constant, derived from actual source code)
  • Uses TeachingSession for multi-turn dialogue
  • Uses MetacognitionGrader for student evaluation
  • JSON report output

Error Analysis Taxonomy

The FAILURE_TAXONOMY in error_analyzer.py maps symptoms to root causes:

Failure Mode Description Code Component
retrieval_insufficient Not enough facts retrieved agentic_loop.py::_plan_retrieval
temporal_ordering_wrong Wrong temporal arithmetic learning_agent.py::_synthesize_with_llm
intent_misclassification Wrong intent detection learning_agent.py::_detect_intent
fact_extraction_incomplete Missing facts in memory learning_agent.py::_extract_facts_with_llm
synthesis_hallucination Invented information learning_agent.py::_synthesize_with_llm
update_not_applied Used outdated data hierarchical_memory.py::_detect_supersedes
contradiction_undetected Missed conflicting sources learning_agent.py::_detect_intent + synthesis
procedural_ordering_lost Steps out of sequence learning_agent.py::_extract_facts_with_llm
teaching_coverage_gap Student not taught key facts teaching_session.py::_teacher_respond
counterfactual_refusal Refused hypothetical reasoning learning_agent.py::_synthesize_with_llm

General Capability Evaluation

Beyond memory, the system evaluates 5 general-purpose capabilities via general_capability_eval.py:

PYTHONPATH=src python -m amplihack.eval.general_capability_eval --eval tool_use,planning,reasoning
PYTHONPATH=src python -m amplihack.eval.general_capability_eval --eval all --sdk mini
PYTHONPATH=src python -m amplihack.eval.general_capability_eval --eval planning --output-dir /tmp/cap-eval
Eval Type What It Tests Scenarios
Tool Use Efficiency Correct tool selection, chaining, call economy 4
Planning Multi-step task decomposition and execution 3
Reasoning Under Uncertainty Handling incomplete/conflicting evidence 3
Cross-Domain Transfer Applying learned patterns to new domains 2
Collaborative Task Multi-agent delegation and synthesis 2

Each eval type defines scenarios with gold-standard expectations, runs the agent through them, and grades using rubric-based LLM evaluation.

How to Add New Eval Levels

1. Define the test content in test_levels.py

L13_LEVEL = TestLevel(
    level_id="L13",
    level_name="Analogical Reasoning",
    description="Apply learned analogies to novel situations",
    articles=[
        {"title": "Source Domain", "content": "..."},
        {"title": "Target Domain", "content": "..."},
    ],
    questions=[
        {
            "question": "Based on the source domain pattern, what would happen in the target domain?",
            "expected_answer": "The expected analogical inference...",
            "level": "L13",
        },
    ],
)

2. Register the level

Add your level to the appropriate list in test_levels.py:

# For standard learn-then-test levels:
STANDARD_LEVELS = [..., L13_LEVEL]

# For levels needing special handling (like L7 teaching):
CUSTOM_LEVELS = [L13_LEVEL]

3. Add level-specific grading criteria (if needed)

In grader.py, add criteria to _build_grading_prompt():

elif level == "L13":
    level_criteria = (
        "\n\nL13 ANALOGICAL REASONING grading criteria:\n"
        "- Award 0.9-1.0 if the agent correctly maps the source pattern to the target.\n"
        "- Award 0.6-0.8 if the mapping is partially correct.\n"
    )

4. Add the level to CLI choices

In progressive_test_suite.py, add "L13" to the --levels choices list.

5. Run and validate

PYTHONPATH=src python -m amplihack.eval.progressive_test_suite \
    --output-dir /tmp/eval_l13 \
    --levels L13

How to Add New Domain Agents

1. Create the agent directory

src/amplihack/agents/domain_agents/my_domain/
    __init__.py
    agent.py           # DomainAgent subclass
    tools.py           # Domain-specific tool functions
    prompts/
        system.txt     # System prompt

2. Implement the agent

Subclass DomainAgent and implement: _register_tools(), execute_task(), get_eval_levels(), teach(), and get_system_prompt().

from amplihack.agents.domain_agents.base import DomainAgent, EvalLevel, TaskResult

class MyDomainAgent(DomainAgent):
    def __init__(self, agent_name="my_agent", model="gpt-4o-mini", skill_injector=None):
        super().__init__(agent_name=agent_name, domain="my_domain", model=model, skill_injector=skill_injector)

    def _register_tools(self):
        self.executor.register_action("my_tool", my_tool_function)

    def execute_task(self, task: dict) -> TaskResult:
        result = self.executor.execute("my_tool", **task)
        return TaskResult(success=True, output=result.output)

    def get_eval_levels(self) -> list[EvalLevel]:
        return [
            EvalLevel(
                level_id="L1",
                name="Basic Task",
                scenarios=[EvalScenario(...)],
                passing_threshold=0.7,
            ),
        ]

    def teach(self, topic, student_level="beginner"):
        return TeachingResult(...)

    def get_system_prompt(self) -> str:
        return "You are an expert in my domain."

3. Run the eval

from amplihack.eval.domain_eval_harness import DomainEvalHarness

harness = DomainEvalHarness(MyDomainAgent("test"))
report = harness.run()
print(f"Overall: {report.overall_score:.0%}")

Key Design Decisions

  1. Subprocess Isolation: Each eval level runs in a separate subprocess with its own memory database. This prevents cross-level contamination and makes results reproducible.

  2. LLM-Based Grading: Uses Claude Sonnet 4.5 as the grading model rather than exact-match scoring. This handles semantic equivalence (e.g., "26 medals" vs "twenty-six medals") and partial credit.

  3. Multi-Vote Grading: When --grader-votes 3 is set, each answer is graded 3 times independently and the median is taken. This reduces noise on ambiguous answers where a single grader call might score 0.4 or 0.9.

  4. 3-Run Medians: Single runs are unreliable due to LLM stochasticity. Running 3 times and taking median scores gives stable, reproducible results.

  5. Level-Specific Rubrics: Grading prompts include level-specific criteria because different cognitive tasks need different scoring rules (e.g., L3 temporal reasoning vs L5 contradiction detection).

  6. Research Step in Self-Improvement: Rather than blindly applying fixes, the runner requires hypothesis + evidence + counter-arguments before any change. This prevents regression from overenthusiastic tuning.

  7. Error Taxonomy: Rather than just reporting scores, the error analyzer maps failures to specific code components. This makes the self-improvement loop actionable rather than just diagnostic.

  8. Deterministic Data: Test data is generated deterministically with seeds, enabling meaningful comparison across runs and SDK versions.

  9. Per-SDK Prompt Tuning: Each SDK has dedicated eval prompt templates in src/amplihack/agents/goal_seeking/prompts/sdk/, allowing SDK-specific instruction tuning without modifying shared agent code.

Best Practices for Running Evals

  1. Use --runs 3 for production eval results. Single runs have high variance on levels L2, L5, L9, L10, and L11.

  2. Use --grader-votes 3 for final benchmarks. Multi-vote grading reduces noise on ambiguous answers.

  3. Use --sdk mini for development iteration (fastest). Switch to specific SDK for final comparison.

  4. Check ANTHROPIC_API_KEY is set before running. The grader requires it.

  5. Monitor output directories for disk usage. Each eval run generates JSON logs per level.

  6. Use --dry-run with the self-improvement runner to preview what changes would be made before applying them.

  7. Run --levels L1 L2 L3 to iterate quickly on specific levels rather than running all 12 every time.

Package Architecture (amplihack-agent-eval)

The evaluation framework has been extracted into a standalone package (amplihack-agent-eval) that can be used independently of the full amplihack codebase. The amplihack repo depends on this package for its core eval types and data generation, and provides amplihack-specific adapter implementations.

Installation

pip install "amplihack-agent-eval @ git+https://github.com/rysweet/amplihack-agent-eval.git@main"

Package Relationship

amplihack-agent-eval (standalone)     amplihack (this repo)
================================     ==========================
amplihack_eval/                       src/amplihack/eval/
  adapters/base.py  <-- AgentAdapter    agent_adapter.py  (amplihack adapters)
  core/runner.py    <-- EvalRunner      compat.py         (re-exports)
  core/grader.py    <-- grade_answer    long_horizon_memory.py (uses package)
  data/long_horizon.py  <-- data gen    long_horizon_data.py (local copy)
  self_improve/                         self_improve/ (local copy)
    patch_proposer.py                     patch_proposer.py
    reviewer_voting.py                    reviewer_voting.py

How It Works

  1. amplihack-agent-eval provides the core eval types: AgentAdapter, AgentResponse, ToolCall, EvalRunner, grade_answer, GradeResult, data generation (Turn, Question, GroundTruth, GradingRubric, generate_dialogue, generate_questions), and self-improvement primitives (PatchProposal, propose_patch, ReviewVote, ReviewResult, vote_on_proposal).

  2. amplihack adds the optional eval dependency in pyproject.toml:

[project.optional-dependencies]
eval = ["amplihack-agent-eval>=0.1.0"]
  1. long_horizon_memory.py imports data types from the package:
from amplihack_eval.data.long_horizon import (
    GradingRubric, GroundTruth, Question,
    generate_dialogue, generate_questions,
)

If the package is not installed, an ImportError is raised with installation instructions.

  1. compat.py re-exports all key types from the standalone package, so existing code can from amplihack.eval.compat import EvalRunner. If the package is not installed, compat.py is a no-op (no errors).

  2. agent_adapter.py provides three adapter implementations that wrap amplihack agents for the package's AgentAdapter interface:

  3. AmplihackLearningAgentAdapter -- wraps LearningAgent
  4. AmplihackMultiAgentAdapter -- wraps MultiAgentLearningAgent
  5. AmplihackSDKAgentAdapter -- wraps any SDK-backed GoalSeekingAgent

Using the Adapters

from amplihack_eval.core.runner import EvalRunner
from amplihack.eval.agent_adapter import AmplihackLearningAgentAdapter

# Create adapter wrapping amplihack's LearningAgent
adapter = AmplihackLearningAgentAdapter("eval-agent", model="gpt-4o-mini")

# Run evaluation
runner = EvalRunner(num_turns=100, num_questions=20, seed=42)
report = runner.run(adapter)
print(f"Score: {report.overall_score:.2%}")
adapter.close()

File Map

src/amplihack/eval/
  __init__.py                    # Public API exports
  compat.py                      # Re-exports from amplihack-agent-eval package
  agent_adapter.py               # Amplihack-specific AgentAdapter implementations
  test_levels.py                 # L1-L12 level definitions (articles + questions)
  progressive_test_suite.py      # Main runner (single + parallel + multi-vote)
  agent_subprocess.py            # Subprocess isolation for learning/testing (SDK routing)
  teaching_subprocess.py         # Subprocess for teacher-student dialogue
  grader.py                      # LLM semantic grading with multi-vote
  metacognition_grader.py        # 4-dimension metacognition scoring
  multi_source_collector.py      # News article collection
  quiz_generator.py              # Quiz question generation
  harness_runner.py              # Original 4-level harness
  sdk_eval_loop.py               # Multi-SDK eval comparison loop
  domain_eval_harness.py         # Generic harness for domain agents
  run_domain_evals.py            # Run all 5 domain agent evals
  teaching_session.py            # Teacher-student session framework
  teaching_eval.py               # Teaching quality evaluation
  long_horizon_data.py           # Deterministic dialogue generation (12 blocks)
  long_horizon_memory.py         # 1000-turn stress test (uses eval package data types)
  long_horizon_self_improve.py   # Long-horizon self-improvement loop (8 phases)
  matrix_eval.py                 # 5-way matrix evaluation across SDKs
  meta_eval_experiment.py        # Meta-eval experiment framework
  five_agent_experiment.py       # 5-agent experiment runner
  QUICK_START.md                 # Quick start guide
  self_improve/
    __init__.py
    error_analyzer.py            # Failure taxonomy + classification (10 modes)
    runner.py                    # Closed-loop self-improvement (6 phases)
    patch_proposer.py            # LLM-generated code patches with hypotheses
    reviewer_voting.py           # 3-perspective review + majority voting

src/amplihack/agents/goal_seeking/
  prompts/sdk/                   # Per-SDK eval prompt templates
    copilot_eval.md
    claude_eval.md
    microsoft_eval.md
    goal_seeking_system.md
    learning_task.md
    synthesis_template.md
    teaching_system.md

src/amplihack/agents/domain_agents/
  base.py                        # DomainAgent ABC
  code_review/agent.py           # Code Review agent
  meeting_synthesizer/agent.py   # Meeting Synthesizer agent
  data_analysis/agent.py         # Data Analysis agent
  document_creator/agent.py      # Document Creator agent
  project_planning/agent.py      # Project Planning agent

tests/eval/
  test_eval_compat.py            # Compat layer + agent adapter tests (32 tests)
  test_grader.py                 # Grader unit tests
  test_harness_runner.py         # Harness runner tests
  test_metacognition_grader.py   # Metacognition grader tests
  test_teaching_session.py       # Teaching session tests
  test_progressive_suite.py      # Progressive suite tests
  test_meta_eval_experiment.py   # Meta-eval tests
  test_multi_source_collector.py # Collector tests
  test_quiz_generator.py         # Quiz generator tests
  test_long_horizon_memory.py    # Long-horizon memory tests