Skip to content

Current Validation Results

This page records the current verified validation snapshot for the distributed hive work and the commands used to reproduce it.

Accepted Azure Distributed Eval Results

These are the currently accepted full Azure runs for the live amplihive3175e path after the final code change that fixed the late infrastructure relation follow-up issue.

Run Score
1 98.20%
2 98.80%
3 98.23%

Key blocker questions that stayed fixed across the accepted set:

  • Q11 = 1.00
  • Q12 = 1.00
  • Q13 = 1.00
  • Q40 = 1.00
  • Q48 = 1.00
  • Q49 = 1.00
  • Q50 = 1.00

Accepted image tag:

  • q49-infrastructure-relation-followup-20260322T064641Z

Current Local Validation Slices

Main amplihack repo

Current wrapper-validation result:

  • 64 passed
  • ruff check passed for the touched wrapper files

Reproduction command:

cd /path/to/amplihack

PYTHONPATH=/path/to/amplihack-agent-eval/src:/path/to/amplihack/src \
.venv/bin/python -m pytest -q tests/eval/test_long_horizon_memory.py

.venv/bin/ruff check \
  src/amplihack/eval/long_horizon_memory.py \
  src/amplihack/eval/long_horizon_multi_seed.py \
  tests/eval/test_long_horizon_memory.py

amplihack-agent-eval repo

Current source-accurate validation result:

  • 42 passed, 1 warning
  • ruff check passed for the touched eval files
  • bash -n run_distributed_eval.sh passed

Reproduction command:

cd /path/to/amplihack-agent-eval

uv run --with pytest --with ruff python -m pytest -q \
  tests/test_data_generation.py \
  tests/test_datasets.py

uv run --with ruff ruff check \
  src/amplihack_eval/data/long_horizon.py \
  src/amplihack_eval/core/runner.py \
  src/amplihack_eval/core/multi_seed.py \
  src/amplihack_eval/cli.py \
  src/amplihack_eval/azure/eval_distributed.py \
  tests/test_data_generation.py \
  tests/test_datasets.py

bash -n run_distributed_eval.sh

Why the eval repo reproduction uses uv run

In this checkout, running the same CLI integration slice through .venv/bin/python -m pytest can pick up a stale installed amplihack-eval console script and fail the --question-set help assertion even though the source checkout is correct.

Use uv run for a source-accurate reproduction, or reinstall the repo in editable mode before relying on the amplihack-eval console script.

How to Reproduce the Accepted Azure Path

Full clean rerun from source

Run from the eval repo and let the wrapper redeploy current code:

cd /path/to/amplihack-agent-eval

export ANTHROPIC_API_KEY=...
export AMPLIHACK_SOURCE_ROOT=/path/to/amplihack

./run_distributed_eval.sh \
  --agents 100 \
  --turns 5000 \
  --questions 50 \
  --question-set standard

Reuse an already deployed hive

cd /path/to/amplihack-agent-eval

export ANTHROPIC_API_KEY=...
export AMPLIHACK_SOURCE_ROOT=/path/to/amplihack
export SKIP_DEPLOY=1
export HIVE_NAME=amplihive3175e
export HIVE_RESOURCE_GROUP=hive-pr3175-rg

./run_distributed_eval.sh \
  --agents 100 \
  --turns 5000 \
  --questions 50 \
  --question-set standard

If you need to reproduce the exact accepted image rather than the latest source tree, redeploy the same image tag before launching the wrapper.