Benchmarking amplihack with eval-recipes¶
Complete guide for running Microsoft's eval-recipes benchmarks to measure amplihack performance.
Rust Port
The eval-recipes benchmarking system works with both the upstream Python amplihack and amplihack-rs. When benchmarking amplihack-rs, the agent config uses the Rust binary directly.
Quick Start¶
# 1. Clone eval-recipes
git clone https://github.com/microsoft/eval-recipes.git ~/eval-recipes
cd ~/eval-recipes
# 2. Copy agent configs
cp -r /path/to/amplihack/.claude/agents/eval-recipes/* data/agents/
# 3. Set API key
echo "ANTHROPIC_API_KEY=sk-ant-..." > .env
# 4. Run benchmark
uv run scripts/run_benchmarks.py \
--agent-filter "name=amplihack" \
--task-filter "name=linkedin_drafting" \
--num-trials 1
Available Agents¶
| Agent | Description |
|---|---|
amplihack |
Baseline amplihack with full context and workflow orchestration |
claude_code |
Vanilla Claude Code for baseline comparison |
amplihack_pr1443_v2 |
amplihack with task classification fix (+36.5 point improvement) |
Common Tasks¶
# LinkedIn drafting (complex tool creation)
--task-filter "name=linkedin_drafting"
# Email drafting (CLI tool creation)
--task-filter "name=email_drafting"
# Multiple tasks
--task-filter "name=linkedin_drafting,email_drafting"
Command Reference¶
Basic Run¶
uv run scripts/run_benchmarks.py \
--agent-filter "name=AGENT_NAME" \
--task-filter "name=TASK_NAME" \
--num-trials 1
With Reports¶
uv run scripts/run_benchmarks.py \
--agent-filter "name=amplihack" \
--task-filter "name=linkedin_drafting" \
--num-trials 3 \
--generate-reports
Compare Multiple Agents¶
uv run scripts/run_benchmarks.py \
--agent-filter "name=amplihack,claude_code" \
--task-filter "name=linkedin_drafting" \
--num-trials 1
Results Location¶
Results saved to: ~/eval-recipes/.benchmark_results/YYYY-MM-DD_HH-MM-SS/
# Find latest score
find .benchmark_results -name "score.txt" -newer /tmp/test_start -exec cat {} \;
# View failure report
find .benchmark_results -name "FAILURE_REPORT*.md" | tail -1 | xargs less
# View HTML report
open .benchmark_results/latest/benchmark_report.html
Testing a PR Branch¶
To test a specific PR branch:
- Create new agent config in
~/eval-recipes/data/agents/amplihack_prXXXX/ -
Update
install.dockerfile: -
Run:
Proven Results¶
PR #1443 Validation (2025-11-19):
| Metric | Score |
|---|---|
| Baseline (main) | 6.5/100 (created skill instead of tool) |
| With fix (PR #1443 V2) | 43.0/100 (created executable tool) |
| Improvement | +36.5 points proven via actual benchmark |
Timing Expectations¶
| Phase | Duration |
|---|---|
| Docker build | ~60 seconds |
| Task execution | 5–15 minutes |
| Test scoring | 8–12 minutes |
| Total | 15–25 minutes per task |
Troubleshooting¶
"command not found: amplihack"
- Agent config uses wrong command template
- Should use:
IS_SANDBOX=1 claude -p "{{task_instructions}}"notamplihack claude
Import errors in generated code
- This is expected — eval tests both architecture and execution
- Architecture scores (30%) award partial credit for good design
- Execution scores (70%) require working code
Docker build fails
- Check Dockerfile syntax matches base.dockerfile patterns
- Do not add
RUN apt-get install(causes errors) - Use existing Claude Code from base image
References¶
- eval-recipes: https://github.com/microsoft/eval-recipes
- Agent configs:
~/.amplihack/.claude/agents/eval-recipes/in amplihack repo
Related Documentation¶
- Eval System Architecture — evaluation framework design
- Eval Grading Improvements — grading rubric details
- Eval Retrieval Methods — retrieval strategy reference