Distributed Hive Evaluation¶
This repo owns the agent/runtime side of the distributed eval story.
deploy/azure_hive/contains the Azure Container Apps and Event Hubs deployment assetssrc/amplihack/eval/long_horizon_memory.pyis the thin local wrapper used from this reposrc/amplihack/eval/long_horizon_multi_seed.pyis the thin multi-seed wrapper used from this repo
The authoritative long-horizon dataset generator and Azure distributed runner live in the sibling amplihack-agent-eval repo.
Read This Next¶
- Day-zero operator guide — exact commands for local wrappers, the eval CLI, Azure distributed runs, and Aspire flows
- How the eval stack fits together — five-minute walkthrough of repo ownership, local versus distributed paths, Event Hubs, Container Apps, why Aspire is in C#, and how
EH_CONNreaches runners without going throughargv
Use This Repo For¶
- changing the agent runtime
- changing the Azure deployment shape
- running the thin local wrappers while you are editing
amplihack
Use amplihack-agent-eval For¶
- authoritative long-horizon question generation
- the Event Hubs distributed runner
- packaged eval reports and rerun metadata
- the end-to-end
run_distributed_eval.shwrapper
Local Wrapper: Single Run¶
The wrapper in this repo delegates to amplihack_eval, so include both repos on PYTHONPATH when you are testing sibling checkouts.
PYTHONPATH=/path/to/amplihack-agent-eval/src:/path/to/amplihack/src \
python -m amplihack.eval.long_horizon_memory \
--turns 100 \
--questions 20 \
--output-dir /tmp/eval-run
Useful flags from this wrapper:
--sdk {mini,claude,copilot,microsoft}--memory-type {auto,hierarchical,cognitive}--answer-mode {single-shot,agentic}--parallel-workers--load-dband--skip-learning
If you need standard versus holdout question slices, use the amplihack-agent-eval CLI or distributed runner rather than the thin wrappers in this repo.
Local Wrapper: Multi-Seed Comparison¶
PYTHONPATH=/path/to/amplihack-agent-eval/src:/path/to/amplihack/src \
python -m amplihack.eval.long_horizon_multi_seed \
--turns 100 \
--questions 20 \
--seeds 42,123,456,789 \
--output-dir /tmp/eval-compare
Distributed Azure Run¶
For real Azure distributed runs, switch to the sibling amplihack-agent-eval repo and use its wrapper or direct runner.
cd /path/to/amplihack-agent-eval
export ANTHROPIC_API_KEY=...
export AMPLIHACK_SOURCE_ROOT=/path/to/amplihack
./run_distributed_eval.sh \
--agents 100 \
--turns 5000 \
--questions 50 \
--question-set standard
That path drives the Azure deployment assets from this repo, but the harness and reporting stay centralized in amplihack-agent-eval.
Question Sets¶
| Value | Meaning |
|---|---|
standard | Canonical deterministic question slice |
holdout | Alternate deterministic slice for anti-overfitting validation |
holdout changes which questions are asked. It does not generate a second fact universe.
Important Checkout Note¶
This repo uses a src/ layout. If you run the wrappers with only PYTHONPATH=src, Python may resolve a globally installed amplihack_eval instead of the sibling checkout you are editing. Include both source roots explicitly when validating local changes.
Current Verified Results¶
A current validation snapshot, including the accepted Azure scores and the latest reproducible local test commands, lives here:
Related Docs¶
- Day-zero operator guide
- How the eval stack fits together
amplihack-agent-eval/docs/distributed-hive-eval.mdamplihack-agent-eval/docs/running-evals.md- Getting Started
- How to Run the Learning Eval Harness