Architecture¶
Overview¶
amplihack-agent-eval is a modular evaluation framework for goal-seeking AI agents. It measures memory recall, tool use, planning, and reasoning across progressive difficulty levels (L1--L16) and long-horizon memory stress tests (up to 5000 dialogue turns). The package is designed to be agent-agnostic: any system that can learn content and answer questions is evaluable through the AgentAdapter interface.
High-Level Architecture¶
+---------------------+
| CLI |
| amplihack_eval.cli |
+--------+------------+
|
+--------------------+--------------------+
| | |
+--------v----------+ +------v-----------+ +------v-----------+
| EvalRunner | | SelfImproveRunner| | MultiSeedRunner |
| (core/runner) | | (self_improve/) | | (core/multi_seed)|
+--------+-----------+ +------+-----------+ +---------+--------+
| | |
+--------------------+-----------+-----------+
| |
+--------v--------+ |
| Grader | |
| (core/grader + |<-+
| runner inline) |
+--------+--------+
|
+--------v--------+
| Data Generation |
| (data/) |
+-----------------+
+---------------------+ +-----------------------+
| AgentAdapter | | Multi-Agent Eval |
| (adapters/base.py) | | (multi_agent_eval/) |
+-----+-----+-----+---+ | Coordinator, Grader, |
| | | | Adversary, Analyst, |
HTTP Subproc LearningAgent | Pipeline |
Adapter Adapter Adapter +-----------------------+
Package Layout¶
src/amplihack_eval/
__init__.py # Public API: EvalRunner, AgentAdapter, AgentResponse, ...
cli.py # CLI entry point (amplihack-eval command)
adapters/ # Agent adapter layer
__init__.py
base.py # AgentAdapter ABC, AgentResponse, ToolCall
http_adapter.py # REST API adapter (POST /learn, /answer, /reset)
subprocess_adapter.py # CLI subprocess adapter (stdin/stdout)
learning_agent.py # amplihack LearningAgent wrapper
core/ # Evaluation engine
__init__.py
runner.py # EvalRunner, LevelResult, SuiteResult, run_level, run_suite
grader.py # Standalone grade_answer() with multi-vote support
multi_seed.py # Multi-seed holdout evaluation with variance analysis
data/ # Test data generation
__init__.py # Re-exports all data modules
long_horizon.py # 5000-turn dialogue generator (12 blocks, 15 categories)
progressive_levels.py # L1-L12 Python-defined levels (articles + questions)
tool_use_scenarios.py # L13 tool selection scenarios
forgetting_scenarios.py # L14 selective forgetting scenarios
adversarial_scenarios.py # L15 adversarial recall scenarios
decision_scenarios.py # L16 decision-from-memory scenarios
levels/ # Level schema, loader, and scoring modules
__init__.py # Convenience re-exports
schema.py # LevelDefinition, QuestionTemplate, ScoringConfig
loader.py # YAML-driven level loader
L13_tool_selection.py # Tool selection scoring logic
L14_selective_forgetting.py # Selective forgetting scoring logic
L15_adversarial_recall.py # Adversarial recall scoring logic
L16_decision_from_memory.py # Decision-from-memory scoring logic
self_improve/ # Automated self-improvement loop
__init__.py
runner.py # 8-phase self-improvement orchestrator
patch_proposer.py # LLM-powered patch generation with history
reviewer_voting.py # Devil's advocate + 3-reviewer A/B voting
multi_agent_eval/ # Multi-agent evaluation pipeline
__init__.py
coordinator.py # EvalCoordinator
grader_agent.py # GraderAgent with perspective-based grading
adversary_agent.py # AdversaryAgent for hard question generation
analyst_agent.py # AnalystAgent for failure analysis
pipeline.py # MultiAgentEvalPipeline end-to-end orchestrator
tests/
test_adapters.py # AgentAdapter interface + concrete adapter tests
test_data_generation.py # Data generator + progressive level tests
recipes/ # YAML recipes (future, currently .gitkeep)
Core Components¶
1. AgentAdapter (adapters/base.py)¶
The central abstraction. Four methods define the complete contract:
+------------------+
| AgentAdapter | (abstract base class)
+------------------+
| + learn(content) | Feed content for memorization
| + answer(question)| Ask a question, get AgentResponse
| + reset() | Reset state between runs
| + close() | Clean up resources
+------------------+
| capabilities | set[str] - what the agent can do
| name | str - human-readable name
+------------------+
AgentResponse captures more than just the answer text. It includes:
- tool_calls: list[ToolCall] -- the agent's tool use trajectory
- reasoning_trace: str -- chain-of-thought or reasoning log
- confidence: float -- self-reported confidence
- metadata: dict -- arbitrary key-value pairs (e.g., latency, model name)
Built-in adapters:
| Adapter | Communication | Use case |
|---|---|---|
HttpAdapter |
REST API | Any agent with HTTP endpoints |
SubprocessAdapter |
stdin/stdout | Any CLI-invokable agent |
LearningAgentAdapter |
Direct import | amplihack's LearningAgent |
2. EvalRunner (core/runner.py)¶
Orchestrates the full evaluation pipeline: generate data, feed to agent, quiz, grade.
EvalRunner
|
|-- generate() -> (GroundTruth, list[Question])
|-- run_dialogue() -> feeds all turns to agent.learn()
|-- evaluate() -> quizzes agent, grades answers
|-- run() -> all three steps in sequence
The runner also supports YAML-driven level evaluation through run_level() and run_suite(), which load level definitions, feed articles to the agent, and grade with the level's scoring configuration.
Key data flow:
generate_dialogue(num_turns, seed)
|
v
GroundTruth # Turns with embedded facts, entity tracking
|
v
generate_questions(gt, num_questions)
|
v
list[Question] # Questions with expected answers, rubrics, categories
|
v
agent.learn(turn.content) # Feed each turn
|
v
agent.answer(question.text) # Quiz the agent
|
v
_grade_multi_vote() # Hybrid grading with multi-vote stability
|
v
EvalReport # Aggregate scores by category and dimension
3. Grading System (core/runner.py inline + core/grader.py)¶
Two complementary grading subsystems:
Runner-integrated grading (in runner.py):
- _deterministic_grade() -- regex/keyword matching against rubrics
- _grade_with_llm() -- LLM semantic evaluation on multiple dimensions
- _grade_hybrid() -- deterministic for rubric-compatible dimensions, LLM for the rest
- _grade_multi_vote() -- runs hybrid grading N times, takes median per dimension
Standalone grader (in grader.py):
- grade_answer() -- independent grading function with level-specific criteria
- Level-aware grading prompts (L3 temporal, L5 contradiction, L9 causal, etc.)
- Multi-vote support with median aggregation
Grading dimensions:
| Dimension | Deterministic? | Description |
|---|---|---|
factual_accuracy |
Yes | Does the answer match key facts? |
specificity |
Yes | Does it include names, numbers, dates? |
temporal_awareness |
LLM only | Current vs. historical value distinction |
source_attribution |
LLM only | Correct source labeling |
confidence_calibration |
LLM only | Appropriate uncertainty expression |
4. Data Generation (data/)¶
Long-horizon generator (long_horizon.py):
- Produces 100--5000 turns of structured dialogue
- 12 information blocks (people, projects, technical, evolving stories, numerical, contradictions, callbacks, distractors, security logs, incidents, infrastructure, problem-solving)
- Deterministic: same seed produces identical output
- Ground truth tracking: every fact is recorded with its delivery turn
Progressive levels (progressive_levels.py):
- Hand-crafted L1--L12 with curated articles and questions
- Each level is a TestLevel dataclass with articles, questions, and metadata
Extended scenarios (L13--L16):
- tool_use_scenarios.py -- tool selection/chaining scenarios with expected tool sequences
- forgetting_scenarios.py -- fact update scenarios testing stale data handling
- adversarial_scenarios.py -- plausible-but-wrong questions testing hallucination resistance
- decision_scenarios.py -- fact recall + reasoning + decision scenarios
5. Self-Improvement Loop (self_improve/)¶
An 8-phase automated improvement cycle:
EVAL -> ANALYZE -> PROPOSE -> CHALLENGE -> VOTE -> APPLY -> RE-EVAL -> DECIDE
| |
+----------------------------------------------------------------------+
(iterate up to N times)
Three cooperating modules: - runner.py -- orchestrates the 8 phases, manages iteration state, detects regression - patch_proposer.py -- LLM-powered analysis of failures, generates unified diffs - reviewer_voting.py -- devil's advocate challenge + 3-reviewer (quality, regression, simplicity) voting
6. Multi-Seed Evaluation (core/multi_seed.py)¶
Runs the same eval across multiple random seeds (default: 42, 123, 456, 789) to: - Measure inter-seed variance for each category - Flag noisy questions (>10 percentage point variance) - Compute confidence intervals (mean +/- stddev)
Each seed gets a fresh agent instance to avoid cross-contamination.
7. Multi-Agent Evaluation Pipeline (multi_agent_eval/)¶
An advanced evaluation pipeline using specialized agent roles: - GraderAgent -- grades from specific perspectives (quality, regression, simplicity) - AdversaryAgent -- generates difficult questions targeting known weaknesses - AnalystAgent -- analyzes results and proposes improvements - EvalCoordinator -- orchestrates the multi-agent pipeline - MultiAgentEvalPipeline -- end-to-end pipeline with adversarial rounds
Design Principles¶
-
Agent-agnostic: The
AgentAdapterinterface makes any agent evaluable. No assumptions about the agent's internal architecture. -
Deterministic data generation: Same seed always produces identical dialogues and questions. No LLM needed for data generation -- all content is template-based.
-
Hybrid grading: Deterministic rubrics for fast, cheap, reproducible scoring of factual accuracy and specificity. LLM semantic judgment for nuanced dimensions (temporal awareness, source attribution, confidence calibration).
-
Multi-vote stability: Grading N times and taking the median reduces LLM noise. For deterministic dimensions, multi-vote has zero overhead (same result every time).
-
Progressive difficulty: L1 (simple recall) through L16 (decision-from-memory). Each level isolates a specific cognitive capability.
-
Safety-gated self-improvement: The self-improvement loop never modifies the grader, test data, or safety constraints. Auto-revert on regression protects existing quality.
-
Reproducibility: Full configuration logging, JSON report output, and seeded generation ensure any result can be reproduced.
-
Zero external dependencies for core: The core package has no required dependencies. LLM grading requires
anthropic(optional). The LearningAgent adapter requiresamplihack(optional).
Data Flow Summary¶
+-----------+
| User |
+-----+-----+
|
CLI (amplihack-eval run)
or Python API (EvalRunner)
|
+-----v-----+
| EvalRunner |
+-----+-----+
|
+-------------+-------------+
| |
+-------v--------+ +--------v--------+
| generate_dialogue| | generate_questions|
| (long_horizon) | | (long_horizon) |
+-------+--------+ +--------+--------+
| |
GroundTruth list[Question]
(turns with facts) (with rubrics)
| |
+-------v--------+ |
| agent.learn() | |
| (N turns) | |
+-------+--------+ |
| |
+-------v-------------------------v--------+
| agent.answer(question) |
+-------+----------------------------------+
|
+-------v--------+
| _grade_hybrid() | deterministic + LLM
| _grade_multi_vote() | N votes, median
+-------+--------+
|
+-------v--------+
| EvalReport |
| (scores by |
| category and |
| dimension) |
+----------------+
Environment Variables¶
| Variable | Purpose | Default |
|---|---|---|
ANTHROPIC_API_KEY |
Required for LLM grading | (none -- grading disabled) |
GRADER_MODEL |
LLM model for grading | claude-sonnet-4-5-20250929 |
EVAL_MODEL |
LLM model for LearningAgent adapter | claude-sonnet-4-5-20250929 |