Current Validation Results¶
This page records the current verified validation snapshot for the distributed hive work and the commands used to reproduce it.
Accepted Azure Distributed Eval Results¶
These are the currently accepted full Azure runs for the live amplihive3175e path after the final code change that fixed the late infrastructure relation follow-up issue.
| Run | Score |
|---|---|
| 1 | 98.20% |
| 2 | 98.80% |
| 3 | 98.23% |
Key blocker questions that stayed fixed across the accepted set:
Q11 = 1.00Q12 = 1.00Q13 = 1.00Q40 = 1.00Q48 = 1.00Q49 = 1.00Q50 = 1.00
Accepted image tag:
q49-infrastructure-relation-followup-20260322T064641Z
Current Local Validation Slices¶
Main amplihack repo¶
Current wrapper-validation result:
64 passedruff checkpassed for the touched wrapper files
Reproduction command:
cd /path/to/amplihack
PYTHONPATH=/path/to/amplihack-agent-eval/src:/path/to/amplihack/src \
.venv/bin/python -m pytest -q tests/eval/test_long_horizon_memory.py
.venv/bin/ruff check \
src/amplihack/eval/long_horizon_memory.py \
src/amplihack/eval/long_horizon_multi_seed.py \
tests/eval/test_long_horizon_memory.py
amplihack-agent-eval repo¶
Current source-accurate validation result:
42 passed, 1 warningruff checkpassed for the touched eval filesbash -n run_distributed_eval.shpassed
Reproduction command:
cd /path/to/amplihack-agent-eval
uv run --with pytest --with ruff python -m pytest -q \
tests/test_data_generation.py \
tests/test_datasets.py
uv run --with ruff ruff check \
src/amplihack_eval/data/long_horizon.py \
src/amplihack_eval/core/runner.py \
src/amplihack_eval/core/multi_seed.py \
src/amplihack_eval/cli.py \
src/amplihack_eval/azure/eval_distributed.py \
tests/test_data_generation.py \
tests/test_datasets.py
bash -n run_distributed_eval.sh
Why the eval repo reproduction uses uv run¶
In this checkout, running the same CLI integration slice through .venv/bin/python -m pytest can pick up a stale installed amplihack-eval console script and fail the --question-set help assertion even though the source checkout is correct.
Use uv run for a source-accurate reproduction, or reinstall the repo in editable mode before relying on the amplihack-eval console script.
How to Reproduce the Accepted Azure Path¶
Full clean rerun from source¶
Run from the eval repo and let the wrapper redeploy current code:
cd /path/to/amplihack-agent-eval
export ANTHROPIC_API_KEY=...
export AMPLIHACK_SOURCE_ROOT=/path/to/amplihack
./run_distributed_eval.sh \
--agents 100 \
--turns 5000 \
--questions 50 \
--question-set standard
Reuse an already deployed hive¶
cd /path/to/amplihack-agent-eval
export ANTHROPIC_API_KEY=...
export AMPLIHACK_SOURCE_ROOT=/path/to/amplihack
export SKIP_DEPLOY=1
export HIVE_NAME=amplihive3175e
export HIVE_RESOURCE_GROUP=hive-pr3175-rg
./run_distributed_eval.sh \
--agents 100 \
--turns 5000 \
--questions 50 \
--question-set standard
If you need to reproduce the exact accepted image rather than the latest source tree, redeploy the same image tag before launching the wrapper.