Skip to content

Distributed Hive Evaluation

This repo owns the agent/runtime side of the distributed eval story.

  • deploy/azure_hive/ contains the Azure Container Apps and Event Hubs deployment assets
  • src/amplihack/eval/long_horizon_memory.py is the thin local wrapper used from this repo
  • src/amplihack/eval/long_horizon_multi_seed.py is the thin multi-seed wrapper used from this repo

The authoritative long-horizon dataset generator and Azure distributed runner live in the sibling amplihack-agent-eval repo.

Read This Next

  • Day-zero operator guide — exact commands for local wrappers, the eval CLI, Azure distributed runs, and Aspire flows
  • How the eval stack fits together — five-minute walkthrough of repo ownership, local versus distributed paths, Event Hubs, Container Apps, why Aspire is in C#, and how EH_CONN reaches runners without going through argv

Use This Repo For

  • changing the agent runtime
  • changing the Azure deployment shape
  • running the thin local wrappers while you are editing amplihack

Use amplihack-agent-eval For

  • authoritative long-horizon question generation
  • the Event Hubs distributed runner
  • packaged eval reports and rerun metadata
  • the end-to-end run_distributed_eval.sh wrapper

Local Wrapper: Single Run

The wrapper in this repo delegates to amplihack_eval, so include both repos on PYTHONPATH when you are testing sibling checkouts.

PYTHONPATH=/path/to/amplihack-agent-eval/src:/path/to/amplihack/src \
python -m amplihack.eval.long_horizon_memory \
  --turns 100 \
  --questions 20 \
  --output-dir /tmp/eval-run

Useful flags from this wrapper:

  • --sdk {mini,claude,copilot,microsoft}
  • --memory-type {auto,hierarchical,cognitive}
  • --answer-mode {single-shot,agentic}
  • --parallel-workers
  • --load-db and --skip-learning

If you need standard versus holdout question slices, use the amplihack-agent-eval CLI or distributed runner rather than the thin wrappers in this repo.

Local Wrapper: Multi-Seed Comparison

PYTHONPATH=/path/to/amplihack-agent-eval/src:/path/to/amplihack/src \
python -m amplihack.eval.long_horizon_multi_seed \
  --turns 100 \
  --questions 20 \
  --seeds 42,123,456,789 \
  --output-dir /tmp/eval-compare

Distributed Azure Run

For real Azure distributed runs, switch to the sibling amplihack-agent-eval repo and use its wrapper or direct runner.

cd /path/to/amplihack-agent-eval

export ANTHROPIC_API_KEY=...
export AMPLIHACK_SOURCE_ROOT=/path/to/amplihack

./run_distributed_eval.sh \
  --agents 100 \
  --turns 5000 \
  --questions 50 \
  --question-set standard

That path drives the Azure deployment assets from this repo, but the harness and reporting stay centralized in amplihack-agent-eval.

Question Sets

Value Meaning
standard Canonical deterministic question slice
holdout Alternate deterministic slice for anti-overfitting validation

holdout changes which questions are asked. It does not generate a second fact universe.

Important Checkout Note

This repo uses a src/ layout. If you run the wrappers with only PYTHONPATH=src, Python may resolve a globally installed amplihack_eval instead of the sibling checkout you are editing. Include both source roots explicitly when validating local changes.

Current Verified Results

A current validation snapshot, including the accepted Azure scores and the latest reproducible local test commands, lives here: