How The Eval Stack Fits Together¶

This page explains the eval system in plain English.

Use it when you understand the command names already but need to know which repo owns which part, how the Azure path works, and where Aspire fits.

The Short Version¶

amplihack-agent-eval owns the eval harness, datasets, grading, and report packaging.

amplihack owns the agent runtime, the Azure deployment assets, the remote adapter wiring, and the Aspire AppHost that orchestrates the scripts locally.

Azure Event Hubs carries the distributed learn and question traffic. Azure Container Apps runs the agent fleet. Aspire is an optional local dashboard and orchestration shell around the existing Python and bash entrypoints.

The Five-Minute Walkthrough¶

If you need to explain the stack quickly to another engineer, use this mental model:

For local runtime changes, start in amplihack and run the thin wrappers in src/amplihack/eval/. Those wrappers are convenience shims over amplihack_eval, not a second eval framework.
For authoritative local eval runs, switch to amplihack-agent-eval and use amplihack-eval run or amplihack-eval compare. That repo owns question generation, grading, and packaged reports.
For distributed Azure evals, the eval repo still owns the run shape and final report, but it calls the Azure deployment assets from amplihack. Event Hubs carries the traffic, and the agents themselves run in Azure Container Apps.
For Aspire, stay in amplihack and run the AppHost in deploy/azure_hive/aspire/. Aspire is the local orchestration shell around the existing deploy, monitor, and eval scripts. It is not a replacement for the Python eval harness.

Plain-English Component Diagram¶

flowchart TD
    A[Operator] --> B{Which path?}

    B -->|Local evals| C[amplihack-agent-eval CLI\nor amplihack wrapper]
    C --> D[Generate dialogue & questions]
    D --> E[Feed content to adapter]
    E --> F[Ask questions]
    F --> G[Grade answers]
    G --> H[Write reports]

    B -->|Distributed Azure evals| I[amplihack-agent-eval wrapper\nor direct runner]
    I --> J["Call deploy.sh\n(when deployment needed)"]
    J --> K[Apply main.bicep & push image]
    K --> L[Provision Event Hubs\nand start Container Apps fleet]
    L --> M[Runner publishes learn &\nquestion messages to Event Hubs]
    M --> N[Running agents in Container Apps\nconsume via remote adapter path]
    N --> O[Collect correlated answers]
    O --> P[Grade & write report artifacts]

    B -->|Aspire| Q["apphost.cs\n(deploy/azure_hive/aspire/)"]
    Q --> R["Launch deploy.sh, eval_monitor.py,\neval_distributed.py as resources"]
    R --> S[Send OTEL telemetry\nto Aspire dashboard]
    S --> T[Does not replace\nPython eval harness]

Who Owns What¶

Surface	Owning repo	Why it lives there
agent runtime behavior	`amplihack`	That is the production runtime under test
Azure deployment shape	`amplihack`	The main repo owns `deploy/azure_hive/` and `main.bicep`
thin local wrappers	`amplihack`	They let runtime changes call into the eval package without leaving the repo
datasets and question generation	`amplihack-agent-eval`	These are eval-framework concerns, not runtime concerns
grading and report packaging	`amplihack-agent-eval`	The eval repo is the authoritative harness and reporting layer
distributed Azure runner	`amplihack-agent-eval`	It owns the top-level Event Hubs distributed eval module
Aspire AppHost	`amplihack`	It orchestrates the main repo's deploy and monitoring scripts

Why There Are Two Repos¶

The split keeps the runtime and the eval framework from turning into one inseparable package.

That separation lets you:

change the agent runtime without rewriting the eval framework
change datasets, grading, and report packaging without modifying the runtime repo
reuse amplihack-agent-eval against adapters other than the main repo's learning agent

The cost is that some flows need both repos checked out at the same time, especially the thin wrappers and the distributed Azure runner.

How A Local Eval Run Works¶

For a local wrapper run such as python -m amplihack.eval.long_horizon_memory:

you run a thin wrapper from amplihack
that wrapper imports data types and runner logic from amplihack_eval
the adapter talks to the local agent runtime
the eval harness generates dialogue and questions
the grader scores answers and writes the report

The important detail is that the local wrapper is not a second eval framework. It is a convenience layer over the standalone eval package.

How The Authoritative Local CLI Fits¶

For amplihack-eval run or amplihack-eval compare:

you run the CLI in amplihack-agent-eval
the eval repo chooses the dataset, question slice, seed, and output location
the adapter talks to the runtime under test
grading and packaged report output stay in amplihack-agent-eval

That is why the thin wrappers in amplihack are useful for runtime iteration, while the eval repo remains the source of truth for the full local CLI surface.

How A Distributed Azure Run Works¶

For ./run_distributed_eval.sh or python -m amplihack_eval.azure.eval_distributed:

the eval repo decides the run shape: turns, questions, question set, retries, and output location
if deployment is needed, it calls amplihack/deploy/azure_hive/deploy.sh
deploy.sh applies main.bicep, which creates Event Hubs, Container Apps, Log Analytics, and the rest of the Azure fleet
the distributed runner publishes learn and question events into Event Hubs
agents in Container Apps consume those events, process them, and publish answers back to the response hub
the runner correlates answers by event ID, then grades and packages the final report

The eval harness and the report stay centralized in amplihack-agent-eval. The runtime and fleet deployment stay centralized in amplihack.

Where Event Hubs Fits¶

Event Hubs is the transport layer for the distributed path.

The main Bicep template defines three hubs:

hive-events-<hive-name> for learning and question input
hive-shards-<hive-name> for cross-shard retrieval traffic
eval-responses-<hive-name> for agent answers and eval progress events

That means the distributed runner does not talk to the agents directly. It talks to Event Hubs, and the fleet consumes and publishes through those hubs.

Where Container Apps Fits¶

Container Apps is the execution layer for the distributed path.

The deploy script builds or pushes the agent image, and the Bicep template creates the container apps that run the fleet. Each app can host more than one agent depending on agentsPerApp.

So if the question is "where do the agents actually run," the answer is "inside Azure Container Apps, using the image built from the main repo."

Where Aspire Fits¶

Aspire is optional and local to the operator workstation.

The AppHost in deploy/azure_hive/aspire/apphost.cs does three things:

starts the existing deploy and eval scripts as named resources
wires them into the Aspire dashboard and OTEL output
gives you one place to toggle monitor, retrieval-smoke, long-horizon, and security flows with environment variables

Aspire does not replace:

the eval dataset generator
the grader
the distributed runner
the Azure deployment script

It orchestrates those pieces. That is why the AppHost is small and the heavy logic still lives in Python and bash.

Why The Aspire AppHost Is In C#¶

The AppHost is in C# because Aspire's application model is a .NET host built around the DistributedApplication builder API.

That does not mean the eval stack became a C# system. The deploy script, monitor, and eval runners are still the existing Python and bash entrypoints. The C# layer is only the orchestration shell that names those resources, wires environment variables, and publishes OTEL telemetry into the Aspire dashboard.

How Secrets Reach The Distributed Runner¶

The Event Hubs connection string should move through environment variables, not command-line arguments.

The direct compatibility wrappers in deploy/azure_hive/ read EH_CONN, AMPLIHACK_EH_INPUT_HUB, and AMPLIHACK_EH_RESPONSE_HUB from the environment, then reconstruct the effective upstream argument list inside the Python process only when the operator did not already provide explicit flags. The Aspire AppHost follows the same pattern for its local monitor and eval executable resources, where it sets EH_CONN as an environment variable instead of putting the connection string on the shell command line.

That is why the docs prefer read -rsp ... EH_CONN plus export EH_CONN over typing --connection-string ... in the command you launch. It keeps the secret out of the operator-visible exec-time command line and normal process-list output, even though the compatibility wrapper still rebuilds the effective arguments internally before delegating to the upstream Python entrypoint.

What To Change When Something Breaks¶

If the problem is in...	Start in...
question generation, grading, report output	`amplihack-agent-eval`
local thin wrapper behavior	`src/amplihack/eval/` in `amplihack`
Azure deployment topology	`deploy/azure_hive/` in `amplihack`
agent runtime behavior	`src/amplihack/agents/` in `amplihack`
distributed answer transport or correlation	`deploy/azure_hive/remote_agent_adapter.py` and `agent_entrypoint.py` in `amplihack` plus the distributed runner in `amplihack-agent-eval`
Aspire dashboard orchestration	`deploy/azure_hive/aspire/apphost.cs` in `amplihack`