Eval System Architecture¶
Comprehensive guide to the evaluation and self-improvement infrastructure for goal-seeking agents. Covers the progressive test suite (L1-L12), long-horizon memory testing with 12 information blocks (including security domain), 5-way matrix evaluation across SDKs, self-improvement loop with patch proposer + reviewer voting, and the standalone amplihack-agent-eval package.
Overview¶
The eval system is a multi-layered framework that tests agent learning and reasoning capabilities across 12 progressively harder levels. It supports 4 SDK implementations, includes a self-improvement loop with patch proposer and reviewer voting, and provides domain-specific evaluation for 5 specialized agents.
Current best scores (as of 2026-02-28):
Progressive test suite (3-run median, mini SDK):
| Level | Median | Description |
|---|---|---|
| L1 | 83% | Single source direct recall |
| L2 | 100% | Multi-source synthesis |
| L3 | 99.8% | Temporal reasoning |
| L4 | 79% | Procedural learning |
| L5 | 95% | Contradiction handling |
| L6 | 100% | Incremental learning |
| L7 | 84% | Teacher-student transfer |
| Overall | 97.8% | Weighted median across all levels |
Long-horizon memory evaluation: 98.9% at 1000 turns.
Recent improvements (2026-02-28): - Overall accuracy: 96.0% → 97.8% (+1.8%) - temporal_evolution: 86.6% → 99.8% (+13.2%) - security_log_analysis: 88% → 100% (+12%) - 9 categories now at 100% accuracy
+------------------------------------------------------------------+
| EVALUATION ENTRY POINTS |
+------------------------------------------------------------------+
| |
| progressive_test_suite.py sdk_eval_loop.py run_domain_evals |
| (L1-L12 single/parallel) (multi-SDK loop) (5 domain agents) |
| |
| self_improve/runner.py long_horizon_memory.py |
| (closed-loop improvement) (1000-turn stress test, 12 blocks) |
| |
| long_horizon_self_improve.py |
| (long-horizon self-improvement with patch proposer + voting) |
| |
| matrix_eval.py meta_eval_experiment.py |
| (5-way SDK comparison) (self-referential eval) |
| |
+------------------------------------------------------------------+
| |
v v
+------------------------------------------------------------------+
| CORE EVAL PIPELINE |
+------------------------------------------------------------------+
| |
| 1. DATA LAYER 2. AGENT LAYER |
| +-----------------+ +------------------------+ |
| | test_levels.py | | agent_subprocess.py | |
| | (L1-L12 defs) | | (subprocess isolation) | |
| | - TestArticle |------->| - learning phase | |
| | - TestQuestion | | - testing phase | |
| | - TestLevel | | - SDK routing | |
| +-----------------+ +------------------------+ |
| | long_horizon_ | +------------------------+ |
| | data.py | | teaching_subprocess.py | |
| | (12-block gen) | | (teacher-student) | |
| +-----------------+ +------------------------+ |
| | |
| 3. GRADING LAYER v |
| +---------------------+ +------------------------+ |
| | grader.py | | metacognition_grader.py| |
| | (LLM semantic grade) | | (4-dimension scoring) | |
| | - multi-vote | | - effort calibration | |
| | - level-specific | | - sufficiency judgment | |
| | rubrics (L3, L5, | | - search quality | |
| | L9, L11, L12) | | - self correction | |
| +---------------------+ +------------------------+ |
| |
+------------------------------------------------------------------+
|
v
+------------------------------------------------------------------+
| ANALYSIS & IMPROVEMENT |
+------------------------------------------------------------------+
| |
| self_improve/error_analyzer.py (10 failure modes) |
| self_improve/patch_proposer.py (LLM-generated diffs) |
| self_improve/reviewer_voting.py (3-perspective review) |
| +----------------------------------------------------------+ |
| | EVAL -> ANALYZE -> PROPOSE -> CHALLENGE -> VOTE -> | |
| | APPLY -> RE-EVAL -> DECIDE | |
| | | |
| | Patch proposer: hypothesis + diff + expected impact | |
| | Reviewer voting: quality/regression/simplicity perspectives | |
| | Challenge phase: devil's advocate arguments + defense | |
| | Auto-revert on regression > 5% | |
| +----------------------------------------------------------+ |
| |
+------------------------------------------------------------------+
Test Levels (L1-L12)¶
Each level represents a specific cognitive capability:
| Level | Name | Capability Tested | Key Challenge |
|---|---|---|---|
| L1 | Single Source Recall | Direct fact retrieval from one source | Baseline accuracy |
| L2 | Multi-Source Synthesis | Combining info from multiple sources | Cross-source counting |
| L3 | Temporal Reasoning | Understanding changes over time | Arithmetic on temporal data |
| L4 | Procedural Learning | Learning and applying step-by-step procedures | Sequence ordering |
| L5 | Contradiction Handling | Detecting conflicting information | Conflict acknowledgment |
| L6 | Incremental Learning | Updating knowledge with new info | Superseding old facts |
| L7 | Teacher-Student Transfer | Teaching knowledge to another agent | Multi-turn dialogue |
| L8 | Metacognition | Awareness of own reasoning quality | Effort calibration |
| L9 | Causal Reasoning | Identifying cause-and-effect chains | Root cause identification |
| L10 | Counterfactual Reasoning | "What if" hypothetical analysis | Hypothetical scenarios |
| L11 | Novel Skill Acquisition | Learning new task formats from examples | Config generation |
| L12 | Far Transfer | Applying learned patterns to new domains | Cross-domain application |
Level Data Structure¶
Defined in src/amplihack/eval/test_levels.py:
@dataclass
class TestLevel:
level_id: str # "L1", "L2", etc.
level_name: str # Human-readable name
description: str # What this level tests
articles: list[TestArticle] # Source material to learn
questions: list[TestQuestion] # Questions to answer
requires_temporal_ordering: bool = False
requires_update_handling: bool = False
Each level provides its own articles and questions. L6 uses phased articles (initial + update) to test incremental learning. L7 uses a teaching session instead of direct learning/testing.
Why 2026 Winter Olympics? The test content uses synthetic data about the February 2026 Milan-Cortina Olympics -- a topic that post-dates LLM training cutoffs. This ensures the agent must actually learn from the provided sources rather than relying on pre-training knowledge.
Core Pipeline¶
1. Data Layer¶
test_levels.py: Defines all 12 levels with articles and questionsmulti_source_collector.py: Collects and normalizes news articlesquiz_generator.py: Generates quiz questions from articleslong_horizon_data.py: Generates deterministic 1000-turn dialogues
2. Agent Layer¶
The eval system uses subprocess isolation to prevent state leakage between levels:
agent_subprocess.py: Runs learning/testing phases in isolated subprocesses. Each level gets a unique agent name mapping to a separate memory database. Supports--sdkflag for SDK routing (validates SDK agent creation, records SDK metadata in output). All SDKs use the sameLearningAgentcore for learning/answering.teaching_subprocess.py: Runs teacher-student dialogues for L7 evaluation. Teacher and student have separate memory databases; knowledge transfer happens only through conversation.
Memory isolation is critical: unique agent names like agent_L1_1708123456 ensure facts from L1 cannot leak into L2.
Advanced Retrieval Methods (added 2026-02-28):
The LearningAgent now includes three retrieval strategies:
-
Entity-Linked Retrieval (
_entity_linked_retrieval): When questions contain structured entity IDs (INC-, CVE-, PROJ-*), retrieves all facts containing those IDs across ALL context tags. This addresses the problem where facts about a single entity are stored under different contexts (e.g., incident data in "incidents", "security_logs", and "post_mortems"). -
Multi-Entity Retrieval (
_multi_entity_retrieval): For multi-hop reasoning questions mentioning 2+ named entities, retrieves facts for EACH entity independently and merges results. This improves coverage for questions like "How did X affect Y?" where facts about X and Y are stored separately. -
Standard Context-Based Retrieval: Traditional semantic similarity search within detected question context.
The agent automatically selects the most appropriate strategy based on question patterns. See EVAL_RETRIEVAL_REFERENCE.md for detailed API documentation.
3. Grading Layer¶
grader.py: LLM-based semantic grading with level-specific rubrics. Uses Claude Sonnet 4.5 as the grading model. Supports multi-vote grading (--grader-votes N, recommended: 3) where each answer is graded N times independently and the median score is taken.
Level-specific grading criteria: - L3: Numerical values are primary (correct numbers = at least 0.7); trend direction is secondary - L5: Explicit conflict acknowledgment scoring rubric (4 tiers: 0.9-1.0, 0.6-0.8, 0.3-0.5, 0.0-0.2) - L9: Accept multiple valid root causes if reasoning is sound - L11: Grade on required fields, don't penalize extra optional fields - L12: Direction of trend is critical for ratio computations
Deterministic Grading Improvements (2026-02-28):
The _deterministic_grade() function now implements a critical false negative fix: incorrect patterns are only penalized when correct keywords are NOT present. This prevents the grader from scoring 0% on answers like "budget increased from $1.2M to $1.4M" which contain both historical (incorrect) and current (correct) values. The fix improved temporal_evolution from 86.6% to 99.8%. See EVAL_GRADING_IMPROVEMENTS.md for implementation details.
metacognition_grader.py: Grades reasoning traces on 4 dimensions:
| Dimension | Weight | What It Measures |
|---|---|---|
| Effort Calibration | 25% | Proportional effort to question difficulty |
| Sufficiency Judgment | 30% | Correct assessment of when enough info gathered |
| Search Quality | 25% | Ratio of useful results to total queries |
| Self Correction | 20% | Detects and fixes errors in reasoning |
Evaluation Runners¶
Progressive Test Suite (progressive_test_suite.py)¶
The primary evaluation runner. Runs all levels sequentially with per-level memory isolation.
# Single run
python -m amplihack.eval.progressive_test_suite --sdk mini --levels L1 L2 L3
# Parallel runs (3-run median for stability)
python -m amplihack.eval.progressive_test_suite --runs 3 --sdk mini
# Multi-vote grading (3 votes per question)
python -m amplihack.eval.progressive_test_suite --grader-votes 3 --sdk mini
# Recommended: 3-run median + 3-vote grading
python -m amplihack.eval.progressive_test_suite --runs 3 --grader-votes 3 --sdk mini
# Advanced levels
python -m amplihack.eval.progressive_test_suite --advanced --sdk mini
# All levels
python -m amplihack.eval.progressive_test_suite \
--levels L1 L2 L3 L4 L5 L6 L8 L9 L10 L11 L12 --sdk mini
CLI options:
| Option | Description | Default |
|---|---|---|
--output-dir | Directory for results | ./eval_progressive |
--agent-name | Agent name (memory isolation) | progressive-test-agent |
--levels | Specific levels (L1-L12) | L1-L6 |
--advanced | Include L8-L10 | Off |
--memory-backend | Memory backend | amplihack-memory-lib |
--parallel N | Run N times, report medians | Off |
--runs N | Alias for --parallel | Off |
--sdk | SDK type: mini, claude, copilot, microsoft | mini |
--grader-votes N | Grading votes per question (1=single, 3=majority) | 1 |
Key classes:
ProgressiveConfig: SDK, levels, output dir, grader votesProgressiveResult: Per-level results and overall scoresParallelResult: Median scores across multiple runs
SDK Eval Loop (sdk_eval_loop.py)¶
Runs improvement loops for one or more SDKs, tracking score progression and generating prompt tuning recommendations:
# Single SDK
python -m amplihack.eval.sdk_eval_loop --sdks mini --loops 5
# All SDKs comparison (4-way)
python -m amplihack.eval.sdk_eval_loop --all-sdks --loops 3
# Specific levels
python -m amplihack.eval.sdk_eval_loop --sdks mini claude --loops 3 --levels L1 L2 L3
Each loop iteration:
- Runs L1-L6 eval for the SDK
- Analyzes failures to identify weak levels
- Generates SDK-specific prompt tuning recommendations
- Re-evaluates to measure improvement
Output includes:
- Per-SDK score progression across iterations
- Failure analysis with recommendations
- Per-level comparison table
- Ranked SDK comparison by best overall score
CLI options:
| Option | Description | Default |
|---|---|---|
--sdks | SDKs to evaluate (space-separated) | mini |
--all-sdks | Evaluate all 4 SDKs | Off |
--loops | Improvement loops per SDK | 5 |
--levels | Levels to run | L1-L6 |
--output-dir | Output directory | ./eval_sdk_loop |
Self-Improvement Runner (self_improve/runner.py)¶
Closed-loop improvement: EVAL -> ANALYZE -> RESEARCH -> IMPROVE -> RE-EVAL -> DECIDE.
# Basic usage
python -m amplihack.eval.self_improve.runner --sdk mini --iterations 5
# With all options
python -m amplihack.eval.self_improve.runner \
--sdk mini \
--iterations 5 \
--improvement-threshold 2.0 \
--regression-tolerance 5.0 \
--levels L1 L2 L3 L4 L5 L6 \
--output-dir ./eval_results/self_improve \
--dry-run
6 phases per iteration:
- EVAL: Run progressive test suite to get baseline scores.
- ANALYZE: Classify failures using the error taxonomy (10 failure modes). Maps each failure to the specific code component responsible.
- RESEARCH: For each proposed fix, state hypothesis, gather evidence from eval results and failure patterns, consider counter-arguments (regression risk, stochasticity, cross-level impact), and make a reasoned decision: apply, skip, or defer. All research decisions are logged to
research_decisions.json. - IMPROVE: Apply approved changes. The patch proposer (
self_improve/patch_proposer.py) generates specific code changes as unified diffs. EachPatchProposalincludes: target_file, hypothesis, description, diff, expected_impact (per-category score delta), risk_assessment, and confidence (0.0-1.0). MaintainsPatchHistoryto avoid re-proposing reverted/rejected patches. Proposals go through reviewer voting (self_improve/reviewer_voting.py) where three perspectives (quality, regression, simplicity) vote with majority determining outcome. A challenge phase forces the proposer to defend the change. - RE-EVAL: Run eval again on all levels to measure impact.
- DECIDE: Commit if net improvement >= +2% overall and no single level regresses > 5%. Revert if any level regresses beyond tolerance.
CLI options:
| Option | Description | Default |
|---|---|---|
--sdk | SDK type | mini |
--iterations | Max improvement iterations | 5 |
--improvement-threshold | Min % improvement to commit | 2.0 |
--regression-tolerance | Max % regression on any level | 5.0 |
--levels | Levels to evaluate | L1-L6 |
--output-dir | Output directory | ./eval_results/self_improve |
--agent-name | Agent name for memory isolation | self-improve-agent |
--dry-run | Evaluate only, don't apply changes | Off |
Domain Agent Eval (domain_eval_harness.py)¶
Evaluates 5 domain-specific agents using the DomainEvalHarness:
from amplihack.eval.domain_eval_harness import DomainEvalHarness
from amplihack.agents.domain_agents.code_review.agent import CodeReviewAgent
harness = DomainEvalHarness(CodeReviewAgent("test"))
report = harness.run()
print(f"Overall: {report.overall_score:.0%}")
5 domain agents (defined in src/amplihack/agents/domain_agents/):
| Agent | Domain | Tools |
|---|---|---|
| CodeReviewAgent | Code quality, security, style | analyze_code, check_style, detect_security_issues, suggest_improvements |
| MeetingSynthesizerAgent | Meeting transcripts | extract_action_items, generate_summary, identify_decisions, identify_topics |
| DataAnalysisAgent | Dataset analysis | (domain-specific) |
| DocumentCreatorAgent | Documentation generation | (domain-specific) |
| ProjectPlanningAgent | Task breakdown and estimates | (domain-specific) |
Each agent provides its own eval levels with scenarios and rubrics. Domain eval uses deterministic grading (pattern matching, field checking) rather than LLM grading.
Combined scoring: Domain agents can use a combined score: 60% domain-specific eval + 40% teaching eval.
Long-Horizon Memory Eval (long_horizon_memory.py)¶
1000-turn memory stress test with 12 information blocks (including security domain):
# Quick test
python -m amplihack.eval.long_horizon_memory --turns 100 --questions 20
# Full stress test (best score: 98.9%)
python -m amplihack.eval.long_horizon_memory --turns 1000 --questions 100
# With SDK agent
python -m amplihack.eval.long_horizon_memory --sdk claude --turns 500 --questions 50
# Large-scale with subprocess segmentation (prevents OOM on 5000+ turns)
python -m amplihack.eval.long_horizon_memory --turns 5000 --questions 200 --segment-size 100
Subprocess segmentation (--segment-size N): For 5000+ turn evaluations, native memory from Kuzu C++ and aiohttp accumulates ~3MB/turn. The --segment-size flag splits the learning phase into subprocess segments of N turns each. Each subprocess loads its turn slice, learns it into the shared on-disk DB, then exits -- freeing all native memory. After all segments complete, the questioning phase runs against the accumulated DB. Default (--segment-size 0) runs everything in one process.
12 information blocks with proportional turn allocation:
| Block | Name | % Turns | What It Tests | Recent Score |
|---|---|---|---|---|
| 1 | People | 5% | Personal details, relationships | ~95% |
| 2 | Projects | 10% | Project metadata with evolving updates | ~98% |
| 3 | Technical | 10% | Domain-specific knowledge facts | ~97% |
| 4 | Evolving Story | 15% | Narrative with corrections over time | 99.8% |
| 5 | Numerical | 10% | Exact number recall and arithmetic | ~96% |
| 6 | Contradictory | 8% | Conflicting sources with different claims | ~94% |
| 7 | Callbacks | 6% | References to facts from earlier blocks | ~97% |
| 8 | Distractors | 6% | Irrelevant noise the agent must resist | ~98% |
| 9 | Security Logs | 10% | CVEs, access logs, authentication events | 100% |
| 10 | Incidents | 8% | Incident reports, post-mortems, RCAs | ~95% |
| 11 | Infrastructure | 7% | Server inventory, network configuration | ~97% |
| 12 | Problem Solving | 5% | Multi-step reasoning tasks (code gen) | ~96% |
Blocks 9-12 (security domain) together account for 30% of turns.
Security Domain Improvements (2026-02-28):
Block 9 (Security Logs) now achieves 100% accuracy due to: - Entity-linked retrieval for CVE IDs and incident references - Improved conflict detection for superseded vulnerabilities - Better aggregation of security event chains across contexts
Data generation is deterministic from long_horizon_data.py (same seed = same dialogue). When the amplihack-agent-eval package is installed, data types are imported from the package.
Tests memory at scale with:
- Deterministic dialogue generation (reproducible with seed)
- 12 topic domains covering both narrative and structured operational data
- Questions spanning different recency windows and complexity levels
- LLM-graded scoring on 5 dimensions per question:
- Factual accuracy
- Completeness
- Recency (uses most recent info)
- Source attribution
- Coherence
Matrix Eval Across SDKs (matrix_eval.py)¶
5-way agent comparison using the long-horizon eval:
# Full 5-way matrix
python -m amplihack.eval.matrix_eval --turns 500 --questions 50
# Specific agents
python -m amplihack.eval.matrix_eval --agents mini claude --turns 100 --questions 20
# With markdown report
python -m amplihack.eval.matrix_eval --turns 500 --questions 50 \
--report-path Specs/MATRIX_EVAL_REPORT.md
5 agent configurations:
| Name | SDK | Multi-Agent | Spawning | What It Tests |
|---|---|---|---|---|
mini | mini | No | No | LearningAgent directly (no SDK wrapper) |
claude | claude | No | No | ClaudeGoalSeekingAgent via SDK factory |
copilot | copilot | No | No | CopilotGoalSeekingAgent via SDK factory |
microsoft | microsoft | No | No | MicrosoftGoalSeekingAgent via SDK factory |
multiagent-copilot | copilot | Yes | Yes | MultiAgentLearningAgent with spawning |
Key design: dialogue and questions are generated once and shared across all agents. Each agent uses a separate storage/DB path. SDK agents that fail to instantiate (missing SDK package) are skipped gracefully.
Output: per-agent reports, shared ground truth, matrix comparison JSON, and a markdown report with category-by-category scores, best-performer-per-category, and overall ranking.
Long-Horizon Self-Improvement (long_horizon_self_improve.py)¶
Automated improvement loop targeting the long-horizon eval. Uses patch proposer and reviewer voting for safe iteration:
python -m amplihack.eval.long_horizon_self_improve \
--turns 100 --questions 20 --iterations 3
# With multi-agent architecture
python -m amplihack.eval.long_horizon_self_improve \
--turns 100 --questions 20 --multi-agent
8-stage cycle per iteration:
- EVAL: Run long-horizon eval to get per-category scores.
- ANALYZE: Identify worst-performing category.
- PROPOSE: Patch proposer generates a unified diff with hypothesis, expected impact (per-category delta), risk assessment, and confidence score.
- CHALLENGE: Devil's advocate arguments against the proposed patch.
- VOTE: Three reviewer perspectives (quality, regression, simplicity) vote accept/reject/modify with rationale. Majority vote determines outcome.
- APPLY: If accepted, apply the patch and git commit.
- RE-EVAL: Run eval again to measure impact.
- DECIDE: If regression > 5% on any category, auto-revert via git. If net improvement >= 2%, keep. Otherwise, keep with marginal note.
The PatchHistory object tracks all applied, reverted, and rejected patches to prevent repeating failed fixes across iterations.
Meta-Eval Experiment (meta_eval_experiment.py)¶
Self-referential test: an agent learns about the eval system itself and teaches it to a student.
- Deterministic knowledge base (from
EVAL_KNOWLEDGE_BASEconstant, derived from actual source code) - Uses TeachingSession for multi-turn dialogue
- Uses MetacognitionGrader for student evaluation
- JSON report output
Error Analysis Taxonomy¶
The FAILURE_TAXONOMY in error_analyzer.py maps symptoms to root causes:
| Failure Mode | Description | Code Component |
|---|---|---|
| retrieval_insufficient | Not enough facts retrieved | agentic_loop.py::_plan_retrieval |
| temporal_ordering_wrong | Wrong temporal arithmetic | learning_agent.py::_synthesize_with_llm |
| intent_misclassification | Wrong intent detection | learning_agent.py::_detect_intent |
| fact_extraction_incomplete | Missing facts in memory | learning_agent.py::_extract_facts_with_llm |
| synthesis_hallucination | Invented information | learning_agent.py::_synthesize_with_llm |
| update_not_applied | Used outdated data | hierarchical_memory.py::_detect_supersedes |
| contradiction_undetected | Missed conflicting sources | learning_agent.py::_detect_intent + synthesis |
| procedural_ordering_lost | Steps out of sequence | learning_agent.py::_extract_facts_with_llm |
| teaching_coverage_gap | Student not taught key facts | teaching_session.py::_teacher_respond |
| counterfactual_refusal | Refused hypothetical reasoning | learning_agent.py::_synthesize_with_llm |
General Capability Evaluation¶
Beyond memory, the system evaluates 5 general-purpose capabilities via general_capability_eval.py:
PYTHONPATH=src python -m amplihack.eval.general_capability_eval --eval tool_use,planning,reasoning
PYTHONPATH=src python -m amplihack.eval.general_capability_eval --eval all --sdk mini
PYTHONPATH=src python -m amplihack.eval.general_capability_eval --eval planning --output-dir /tmp/cap-eval
| Eval Type | What It Tests | Scenarios |
|---|---|---|
| Tool Use Efficiency | Correct tool selection, chaining, call economy | 4 |
| Planning | Multi-step task decomposition and execution | 3 |
| Reasoning Under Uncertainty | Handling incomplete/conflicting evidence | 3 |
| Cross-Domain Transfer | Applying learned patterns to new domains | 2 |
| Collaborative Task | Multi-agent delegation and synthesis | 2 |
Each eval type defines scenarios with gold-standard expectations, runs the agent through them, and grades using rubric-based LLM evaluation.
How to Add New Eval Levels¶
1. Define the test content in test_levels.py¶
L13_LEVEL = TestLevel(
level_id="L13",
level_name="Analogical Reasoning",
description="Apply learned analogies to novel situations",
articles=[
{"title": "Source Domain", "content": "..."},
{"title": "Target Domain", "content": "..."},
],
questions=[
{
"question": "Based on the source domain pattern, what would happen in the target domain?",
"expected_answer": "The expected analogical inference...",
"level": "L13",
},
],
)
2. Register the level¶
Add your level to the appropriate list in test_levels.py:
# For standard learn-then-test levels:
STANDARD_LEVELS = [..., L13_LEVEL]
# For levels needing special handling (like L7 teaching):
CUSTOM_LEVELS = [L13_LEVEL]
3. Add level-specific grading criteria (if needed)¶
In grader.py, add criteria to _build_grading_prompt():
elif level == "L13":
level_criteria = (
"\n\nL13 ANALOGICAL REASONING grading criteria:\n"
"- Award 0.9-1.0 if the agent correctly maps the source pattern to the target.\n"
"- Award 0.6-0.8 if the mapping is partially correct.\n"
)
4. Add the level to CLI choices¶
In progressive_test_suite.py, add "L13" to the --levels choices list.
5. Run and validate¶
PYTHONPATH=src python -m amplihack.eval.progressive_test_suite \
--output-dir /tmp/eval_l13 \
--levels L13
How to Add New Domain Agents¶
1. Create the agent directory¶
src/amplihack/agents/domain_agents/my_domain/
__init__.py
agent.py # DomainAgent subclass
tools.py # Domain-specific tool functions
prompts/
system.txt # System prompt
2. Implement the agent¶
Subclass DomainAgent and implement: _register_tools(), execute_task(), get_eval_levels(), teach(), and get_system_prompt().
from amplihack.agents.domain_agents.base import DomainAgent, EvalLevel, TaskResult
class MyDomainAgent(DomainAgent):
def __init__(self, agent_name="my_agent", model="gpt-4o-mini", skill_injector=None):
super().__init__(agent_name=agent_name, domain="my_domain", model=model, skill_injector=skill_injector)
def _register_tools(self):
self.executor.register_action("my_tool", my_tool_function)
def execute_task(self, task: dict) -> TaskResult:
result = self.executor.execute("my_tool", **task)
return TaskResult(success=True, output=result.output)
def get_eval_levels(self) -> list[EvalLevel]:
return [
EvalLevel(
level_id="L1",
name="Basic Task",
scenarios=[EvalScenario(...)],
passing_threshold=0.7,
),
]
def teach(self, topic, student_level="beginner"):
return TeachingResult(...)
def get_system_prompt(self) -> str:
return "You are an expert in my domain."
3. Run the eval¶
from amplihack.eval.domain_eval_harness import DomainEvalHarness
harness = DomainEvalHarness(MyDomainAgent("test"))
report = harness.run()
print(f"Overall: {report.overall_score:.0%}")
Key Design Decisions¶
-
Subprocess Isolation: Each eval level runs in a separate subprocess with its own memory database. This prevents cross-level contamination and makes results reproducible.
-
LLM-Based Grading: Uses Claude Sonnet 4.5 as the grading model rather than exact-match scoring. This handles semantic equivalence (e.g., "26 medals" vs "twenty-six medals") and partial credit.
-
Multi-Vote Grading: When
--grader-votes 3is set, each answer is graded 3 times independently and the median is taken. This reduces noise on ambiguous answers where a single grader call might score 0.4 or 0.9. -
3-Run Medians: Single runs are unreliable due to LLM stochasticity. Running 3 times and taking median scores gives stable, reproducible results.
-
Level-Specific Rubrics: Grading prompts include level-specific criteria because different cognitive tasks need different scoring rules (e.g., L3 temporal reasoning vs L5 contradiction detection).
-
Research Step in Self-Improvement: Rather than blindly applying fixes, the runner requires hypothesis + evidence + counter-arguments before any change. This prevents regression from overenthusiastic tuning.
-
Error Taxonomy: Rather than just reporting scores, the error analyzer maps failures to specific code components. This makes the self-improvement loop actionable rather than just diagnostic.
-
Deterministic Data: Test data is generated deterministically with seeds, enabling meaningful comparison across runs and SDK versions.
-
Per-SDK Prompt Tuning: Each SDK has dedicated eval prompt templates in
src/amplihack/agents/goal_seeking/prompts/sdk/, allowing SDK-specific instruction tuning without modifying shared agent code.
Best Practices for Running Evals¶
-
Use
--runs 3for production eval results. Single runs have high variance on levels L2, L5, L9, L10, and L11. -
Use
--grader-votes 3for final benchmarks. Multi-vote grading reduces noise on ambiguous answers. -
Use
--sdk minifor development iteration (fastest). Switch to specific SDK for final comparison. -
Check ANTHROPIC_API_KEY is set before running. The grader requires it.
-
Monitor output directories for disk usage. Each eval run generates JSON logs per level.
-
Use
--dry-runwith the self-improvement runner to preview what changes would be made before applying them. -
Run
--levels L1 L2 L3to iterate quickly on specific levels rather than running all 12 every time.
Package Architecture (amplihack-agent-eval)¶
The evaluation framework has been extracted into a standalone package (amplihack-agent-eval) that can be used independently of the full amplihack codebase. The amplihack repo depends on this package for its core eval types and data generation, and provides amplihack-specific adapter implementations.
Installation¶
Package Relationship¶
amplihack-agent-eval (standalone) amplihack (this repo)
================================ ==========================
amplihack_eval/ src/amplihack/eval/
adapters/base.py <-- AgentAdapter agent_adapter.py (amplihack adapters)
core/runner.py <-- EvalRunner compat.py (re-exports)
core/grader.py <-- grade_answer long_horizon_memory.py (uses package)
data/long_horizon.py <-- data gen long_horizon_data.py (local copy)
self_improve/ self_improve/ (local copy)
patch_proposer.py patch_proposer.py
reviewer_voting.py reviewer_voting.py
How It Works¶
-
amplihack-agent-eval provides the core eval types:
AgentAdapter,AgentResponse,ToolCall,EvalRunner,grade_answer,GradeResult, data generation (Turn,Question,GroundTruth,GradingRubric,generate_dialogue,generate_questions), and self-improvement primitives (PatchProposal,propose_patch,ReviewVote,ReviewResult,vote_on_proposal). -
amplihack adds the optional
evaldependency inpyproject.toml:
long_horizon_memory.pyimports data types from the package:
from amplihack_eval.data.long_horizon import (
GradingRubric, GroundTruth, Question,
generate_dialogue, generate_questions,
)
If the package is not installed, an ImportError is raised with installation instructions.
-
compat.pyre-exports all key types from the standalone package, so existing code canfrom amplihack.eval.compat import EvalRunner. If the package is not installed,compat.pyis a no-op (no errors). -
agent_adapter.pyprovides three adapter implementations that wrap amplihack agents for the package'sAgentAdapterinterface: AmplihackLearningAgentAdapter-- wrapsLearningAgentAmplihackMultiAgentAdapter-- wrapsMultiAgentLearningAgentAmplihackSDKAgentAdapter-- wraps any SDK-backedGoalSeekingAgent
Using the Adapters¶
from amplihack_eval.core.runner import EvalRunner
from amplihack.eval.agent_adapter import AmplihackLearningAgentAdapter
# Create adapter wrapping amplihack's LearningAgent
adapter = AmplihackLearningAgentAdapter("eval-agent", model="gpt-4o-mini")
# Run evaluation
runner = EvalRunner(num_turns=100, num_questions=20, seed=42)
report = runner.run(adapter)
print(f"Score: {report.overall_score:.2%}")
adapter.close()
File Map¶
src/amplihack/eval/
__init__.py # Public API exports
compat.py # Re-exports from amplihack-agent-eval package
agent_adapter.py # Amplihack-specific AgentAdapter implementations
test_levels.py # L1-L12 level definitions (articles + questions)
progressive_test_suite.py # Main runner (single + parallel + multi-vote)
agent_subprocess.py # Subprocess isolation for learning/testing (SDK routing)
teaching_subprocess.py # Subprocess for teacher-student dialogue
grader.py # LLM semantic grading with multi-vote
metacognition_grader.py # 4-dimension metacognition scoring
multi_source_collector.py # News article collection
quiz_generator.py # Quiz question generation
harness_runner.py # Original 4-level harness
sdk_eval_loop.py # Multi-SDK eval comparison loop
domain_eval_harness.py # Generic harness for domain agents
run_domain_evals.py # Run all 5 domain agent evals
teaching_session.py # Teacher-student session framework
teaching_eval.py # Teaching quality evaluation
long_horizon_data.py # Deterministic dialogue generation (12 blocks)
long_horizon_memory.py # 1000-turn stress test (uses eval package data types)
long_horizon_self_improve.py # Long-horizon self-improvement loop (8 phases)
matrix_eval.py # 5-way matrix evaluation across SDKs
meta_eval_experiment.py # Meta-eval experiment framework
five_agent_experiment.py # 5-agent experiment runner
QUICK_START.md # Quick start guide
self_improve/
__init__.py
error_analyzer.py # Failure taxonomy + classification (10 modes)
runner.py # Closed-loop self-improvement (6 phases)
patch_proposer.py # LLM-generated code patches with hypotheses
reviewer_voting.py # 3-perspective review + majority voting
src/amplihack/agents/goal_seeking/
prompts/sdk/ # Per-SDK eval prompt templates
copilot_eval.md
claude_eval.md
microsoft_eval.md
goal_seeking_system.md
learning_task.md
synthesis_template.md
teaching_system.md
src/amplihack/agents/domain_agents/
base.py # DomainAgent ABC
code_review/agent.py # Code Review agent
meeting_synthesizer/agent.py # Meeting Synthesizer agent
data_analysis/agent.py # Data Analysis agent
document_creator/agent.py # Document Creator agent
project_planning/agent.py # Project Planning agent
tests/eval/
test_eval_compat.py # Compat layer + agent adapter tests (32 tests)
test_grader.py # Grader unit tests
test_harness_runner.py # Harness runner tests
test_metacognition_grader.py # Metacognition grader tests
test_teaching_session.py # Teaching session tests
test_progressive_suite.py # Progressive suite tests
test_meta_eval_experiment.py # Meta-eval tests
test_multi_source_collector.py # Collector tests
test_quiz_generator.py # Quiz generator tests
test_long_horizon_memory.py # Long-horizon memory tests
Related Documentation¶
- Goal-Seeking Agents -- End-to-end guide: generation, capabilities, evaluation, self-improvement
- SDK Adapters Guide -- Deep dive into each SDK backend
- Quick Start -- Get running in 30 seconds
- Eval Grading Improvements -- Tutorial on fixing grader false negatives and implementing advanced retrieval (NEW: 2026-02-28)
- Eval Retrieval Reference -- API reference for entity-linked and multi-entity retrieval methods (NEW: 2026-02-28)