Eval System Architecture¶

Type: Explanation (Understanding-Oriented)

Comprehensive guide to the evaluation and self-improvement infrastructure for goal-seeking agents. Covers the progressive test suite (L1-L12), long-horizon memory testing, multi-SDK evaluation, and the self-improvement loop.

Overview¶

The eval system is a multi-layered framework that tests agent learning and reasoning capabilities across 12 progressively harder levels. It supports multiple SDK implementations, includes a self-improvement loop with patch proposer and reviewer voting, and provides domain-specific evaluation for specialized agents.

Architecture¶

+------------------------------------------------------------------+
|                    EVALUATION ENTRY POINTS                         |
+------------------------------------------------------------------+
|                                                                    |
|  progressive_test_suite     sdk_eval_loop       run_domain_evals  |
|  (L1-L12 single/parallel)  (multi-SDK loop)    (domain agents)   |
|                                                                    |
|  self_improve/runner        long_horizon_memory                   |
|  (closed-loop improvement)  (1000-turn stress test, 12 blocks)   |
|                                                                    |
+------------------------------------------------------------------+
                        |                |
                        v                v
+------------------------------------------------------------------+
|                    CORE EVAL PIPELINE                              |
+------------------------------------------------------------------+
|                                                                    |
|  1. DATA LAYER              2. AGENT LAYER                        |
|  - test_levels (L1-L12)     - subprocess isolation                |
|  - TestArticle/Question     - learning phase                      |
|  - long_horizon_data        - testing phase                       |
|    (12-block generation)    - SDK routing                         |
|                                                                    |
|  3. GRADING LAYER                                                 |
|  - LLM semantic grading     - metacognition grading               |
|  - multi-vote scoring       - level-specific rubrics              |
|  - deterministic fallback   - 4-dimension scoring                 |
|                                                                    |
+------------------------------------------------------------------+
                        |
                        v
+------------------------------------------------------------------+
|                    ANALYSIS & IMPROVEMENT                          |
+------------------------------------------------------------------+
|                                                                    |
|  error_analyzer (10 failure modes)                                |
|  patch_proposer (LLM-generated diffs)                             |
|  reviewer_voting (3-perspective review)                           |
|                                                                    |
|  EVAL -> ANALYZE -> PROPOSE -> CHALLENGE -> VOTE ->               |
|    APPLY -> RE-EVAL -> DECIDE                                     |
|                                                                    |
+------------------------------------------------------------------+

Progressive Test Suite (L1-L12)¶

Level	Name	Description
L1	Direct Recall	Single source direct recall
L2	Multi-Source Synthesis	Combine facts across sources
L3	Temporal Reasoning	Track changes over time
L4	Procedural Learning	Learn and apply procedures
L5	Contradiction Handling	Resolve conflicting information
L6	Incremental Learning	Accumulate knowledge over time
L7	Teacher-Student Transfer	Teach learned knowledge to another
L8	Analogical Reasoning	Apply patterns to novel domains
L9	Meta-Cognitive	Reason about own knowledge gaps
L10	Adversarial	Resist misleading information
L11	Long-Range Dependencies	Connect distant facts
L12	Open-Ended Synthesis	Generate novel insights

Each level runs in a separate subprocess with its own memory database, preventing cross-level contamination.

Long-Horizon Memory Testing¶

Stress tests agent memory across 1,000 turns using 12 information blocks (including security domain). Measures retention, update handling, and retrieval accuracy at scale.

Self-Improvement Loop¶

An 8-stage cycle per iteration:

EVAL — Run evaluation to get per-category scores
ANALYZE — Identify worst-performing category
PROPOSE — Generate unified diff with hypothesis and expected impact
CHALLENGE — Devil's advocate arguments against the patch
VOTE — Three reviewer perspectives (quality, regression, simplicity)
APPLY — If accepted, apply patch and commit
RE-EVAL — Run evaluation again to measure impact
DECIDE — If regression > 5% on any category, auto-revert; if net improvement >= 2%, keep

PatchHistory tracks all applied, reverted, and rejected patches to prevent repeating failed fixes.

Error Analysis Taxonomy¶

Failure Mode	Description
retrieval_insufficient	Not enough facts retrieved
temporal_ordering_wrong	Wrong temporal arithmetic
intent_misclassification	Wrong intent detection
fact_extraction_incomplete	Missing facts in memory
synthesis_hallucination	Invented information
update_not_applied	Used outdated data
contradiction_undetected	Missed conflicting sources
procedural_ordering_lost	Steps out of sequence
teaching_coverage_gap	Student not taught key facts
counterfactual_refusal	Refused hypothetical reasoning

General Capability Evaluation¶

Beyond memory, 5 general-purpose capabilities are evaluated:

Eval Type	What It Tests	Scenarios
Tool Use Efficiency	Correct tool selection, chaining, economy	4
Planning	Multi-step task decomposition	3
Reasoning Under Uncertainty	Handling incomplete/conflicting evidence	3
Cross-Domain Transfer	Applying patterns to new domains	2
Collaborative Task	Multi-agent delegation and synthesis	2

Key Design Decisions¶

Subprocess Isolation — Each eval level runs in a separate subprocess with its own memory database, preventing cross-level contamination and ensuring reproducibility.
LLM-Based Grading — Uses semantic grading rather than exact-match scoring, handling equivalences like "26 medals" vs "twenty-six medals" and partial credit.
Multi-Vote Grading — Each answer is graded 3 times independently; the median reduces noise on ambiguous answers.
3-Run Medians — Single runs are unreliable due to LLM stochasticity. Running 3 times and taking median scores gives stable results.

amplihack-rs Integration¶

In amplihack-rs, the eval system is accessed via:

# Run progressive test suite
amplihack agent-eval --levels L1,L2,L3

# Run with specific SDK
amplihack agent-eval --sdk mini --output-dir ./eval-results

The crates/amplihack-agent-eval/ crate provides the Rust evaluation harness, while the Python eval scripts in amplifier-bundle/ handle LLM-based grading.

See Evaluation Framework for the high-level framework overview and Run Agent Evaluations for step-by-step instructions.

Eval Grading Improvements — fixing grader false negatives
Eval Retrieval Reference — retrieval method specifications
Evaluation Framework — high-level framework overview