amplihack-agent-eval¶

Evaluation framework for goal-seeking AI agents. Tests memory recall, tool use, planning, and reasoning across progressive difficulty levels (L1-L12).

Key Features¶

Long-horizon memory stress tests -- Generates 1000+ turn dialogues with embedded facts, then quizzes the agent on details from various points in the conversation
Hybrid grading -- Deterministic (rubric keywords) + LLM (semantic judgment) with multi-vote stability
Progressive difficulty levels -- L1 (simple recall) through L12 (far transfer reasoning)
Agent-agnostic -- Works with any agent through the AgentAdapter interface
Self-improvement loop -- Automated EVAL -> ANALYZE -> PROPOSE -> CHALLENGE -> VOTE -> APPLY -> RE-EVAL cycle
Multi-seed holdout -- Run across multiple random seeds to measure inter-seed variance
Hive mind evaluation -- Multi-agent topology comparison (flat, distributed DHT, federated)

Running Evals¶

Guide	What It Covers
Running Evals Quick Start	All eval types on one page -- single-agent, 20-agent, and distributed
Long-Horizon Memory Eval	Single-agent eval: 15 question categories, grading system, dataset details
Hive Mind Eval Strategy	Multi-agent topologies, scoring methodology, four-layer architecture
Distributed Eval on Azure	Deploy agents to Azure, feed content, run eval, cleanup

Framework Documentation¶

Guide	Description
Architecture	Package layout, core concepts, and design principles
Evaluation Levels	Complete guide to all 12 progressive difficulty levels (L1-L12)
Writing Adapters	How to write custom `AgentAdapter` implementations
Self-Improvement Loop	Automated improvement cycle with safety gates
Multi-Agent Eval	Multi-agent evaluation scenarios

Installation¶

# Basic installation (data generation and adapters, no LLM grading)
pip install amplihack-agent-eval

# With Anthropic grading support
pip install amplihack-agent-eval[anthropic]

# Development
pip install amplihack-agent-eval[dev]

# Everything
pip install amplihack-agent-eval[all,dev]

Quick Start¶

1. Implement the AgentAdapter interface¶

from amplihack_eval import AgentAdapter, AgentResponse

class MyMemoryAgent(AgentAdapter):
    def __init__(self):
        self.memory = []

    def learn(self, content: str) -> None:
        self.memory.append(content)

    def answer(self, question: str) -> AgentResponse:
        relevant = [m for m in self.memory
                    if any(w in m.lower() for w in question.lower().split())]
        return AgentResponse(
            answer=" ".join(relevant[:3]) if relevant else "I don't know"
        )

    def reset(self) -> None:
        self.memory.clear()

    def close(self) -> None:
        pass

2. Run an evaluation¶

from amplihack_eval import EvalRunner

agent = MyMemoryAgent()
runner = EvalRunner(num_turns=100, num_questions=20, grader_votes=3)
report = runner.run(agent)

print(f"Overall score: {report.overall_score:.2%}")
for cb in report.category_breakdown:
    print(f"  {cb.category}: {cb.avg_score:.2%}")

3. CLI usage¶

# Run eval against an HTTP agent
amplihack-eval run --turns 100 --questions 20 --adapter http --agent-url http://localhost:8000

# Run eval with amplihack's LearningAgent
amplihack-eval run --turns 100 --questions 20 --adapter learning-agent

# Multi-seed comparison
amplihack-eval compare --seeds 42,123,456,789 --turns 100

# Self-improvement loop
amplihack-eval self-improve --iterations 5 --turns 100

Environment Variables¶

Variable	Purpose	Default
`ANTHROPIC_API_KEY`	Required for LLM grading	--
`GRADER_MODEL`	Model for grading	`claude-sonnet-4-5-20250929`
`EVAL_MODEL`	Model for LearningAgent adapter	`claude-sonnet-4-5-20250929`

Contributing¶

# Clone the repository
git clone https://github.com/rysweet/amplihack-agent-eval.git
cd amplihack-agent-eval

# Install in development mode
pip install -e ".[dev]"

# Install pre-commit hooks
pre-commit install

# Run tests
pytest tests/ -q

# Run linting
ruff check src/ tests/
ruff format --check src/ tests/

License¶

MIT