Long-Horizon Memory Evaluation¶

What It Tests¶

The long-horizon memory evaluation is a stress test for AI agent memory systems. It generates a structured dialogue of up to 5000 turns, feeds each turn to the agent's learn() method, then quizzes the agent on details from various points in the conversation. The goal is to measure how well an agent retains, organizes, and retrieves information at scale -- far beyond what fits in a single context window.

Key capabilities tested:

Needle-in-haystack retrieval: Finding specific facts buried among thousands of turns
Temporal evolution tracking: Understanding how values change over time
Numerical precision: Exact recall of numbers, percentages, and metrics
Source attribution: Correctly attributing facts to their original sources
Cross-referencing: Connecting facts across different information blocks
Distractor resistance: Ignoring irrelevant information when answering
Meta-memory: Reasoning about what the agent knows (and does not know)
Security log analysis: Detecting patterns and anomalies in structured logs
Incident tracking: Following evolving incident timelines
Infrastructure knowledge: Recalling configuration details
Problem solving: Applying stored knowledge to solve new problems
Multi-hop reasoning: Chaining multiple facts to answer complex questions

Quick Start¶

Run Single-Agent Eval¶

python -m amplihack.eval.long_horizon_memory \
  --turns 300 --questions 50 \
  --model claude-sonnet-4-6 \
  --grader-model claude-haiku-4-5-20251001 \
  --output-dir /tmp/eval-results

Common Options¶

# Small smoke test
python -m amplihack.eval.long_horizon_memory \
  --turns 100 --questions 20

# Large-scale stress test
python -m amplihack.eval.long_horizon_memory \
  --turns 5000 --questions 200 --grader-votes 5 --seed 42

# Large-scale with subprocess segmentation (prevents OOM on 5000+ turns)
python -m amplihack.eval.long_horizon_memory \
  --turns 5000 --segment-size 100

# Parallel workers for faster learning phase
python -m amplihack.eval.long_horizon_memory \
  --turns 1000 --questions 100 --parallel-workers 10

How Dialogue Generation Works¶

Deterministic Generation¶

All dialogue content is template-based. No LLM is needed for data generation. The same seed always produces identical output:

from amplihack_eval.data.long_horizon import generate_dialogue, generate_questions

gt = generate_dialogue(num_turns=1000, seed=42)   # Deterministic
questions = generate_questions(gt, num_questions=50)

The 12 Information Blocks¶

The dialogue is divided into 12 thematic blocks, each allocated a proportional range of the total turns. For a 5000-turn dialogue:

Block  1: People              (turns    1-  250,  5%)  -- personal details, preferences
Block  2: Projects            (turns  251-  750, 10%)  -- project updates with changes
Block  3: Technical           (turns  751- 1250, 10%)  -- technical facts across 9 domains
Block  4: Evolving Story      (turns 1251- 2000, 15%)  -- story with corrections/updates
Block  5: Numerical           (turns 2001- 2500, 10%)  -- precise metrics and KPIs
Block  6: Contradictory       (turns 2501- 2900,  8%)  -- conflicting reports from sources
Block  7: Callbacks           (turns 2901- 3200,  6%)  -- references back to earlier blocks
Block  8: Distractors         (turns 3201- 3500,  6%)  -- irrelevant fun facts
Block  9: Security Logs       (turns 3501- 4000, 10%)  -- structured security events
Block 10: Incidents           (turns 4001- 4400,  8%)  -- incident reports with status updates
Block 11: Infrastructure      (turns 4401- 4750,  7%)  -- server/network inventory
Block 12: Problem Solving     (turns 4751- 5000,  5%)  -- problem descriptions with solutions

Turn counts scale linearly. A 100-turn dialogue uses 1/50th of each range.

Block Details¶

Block 1: People -- 10 team members with detailed personal profiles (name, birthday, allergy, hobby, role, team, pet, hometown, favorite food, degree). Each person's facts are delivered across multiple turns with natural-language context.

Block 2: Projects -- 5 projects (Atlas, Beacon, Cascade, Delta, Echo) each with initial descriptions and a series of updates that change deadlines, budgets, team sizes, and project leads at specific turn offsets. This tests temporal evolution tracking.

Block 3: Technical -- Facts from 9 technical domains: programming, security, databases, cloud, ML/AI, DevOps, architecture, frontend. Each fact is a standalone technical statement.

Block 4: Evolving Story -- A multi-chapter narrative about a startup's journey with deliberate corrections and updates. Tests the agent's ability to track the most current version of facts.

Block 5: Numerical -- 30 precise metrics (Q1 revenue, server uptime, test coverage, API response times, etc.) with specific values and context details. Tests numerical precision.

Block 6: Contradictory -- 8 topics where 2-3 different sources provide conflicting claims. Tests the agent's ability to acknowledge and reason about contradictions.

Block 7: Callbacks -- References back to facts from earlier blocks, creating cross-references that require connecting information across different topics.

Block 8: Distractors -- 30 irrelevant fun facts designed to test whether the agent can filter relevant from irrelevant information.

Block 9: Security Logs -- Structured security events with timestamps, source IPs, event types, users, and severity levels. Includes attack patterns that require pattern recognition.

Block 10: Incidents -- Incident reports with evolving status updates (open -> investigating -> identified -> resolved). Each incident has a timeline of events that tests temporal tracking.

Block 11: Infrastructure -- Server and network inventory with detailed specifications (CPU, RAM, storage, OS, location, uptime).

Block 12: Problem Solving -- Problem descriptions paired with solutions, testing the agent's ability to recall and apply stored problem-solving knowledge.

Ground Truth Tracking¶

Every fact delivered to the agent is tracked in a GroundTruth structure:

@dataclass
class GroundTruth:
    turns: list[Turn]                           # All dialogue turns
    facts_by_entity: dict[str, list[dict]]      # Facts indexed by entity name
    current_values: dict[str, Any]              # Latest value for each entity
    superseded_values: dict[str, list[dict]]    # Historical values with timestamps

Each Turn records:

turn_number -- position in the dialogue
content -- the text delivered to the agent
block -- which block (1-12) this turn belongs to
block_name -- human-readable block name
facts -- list of ground truth facts delivered in this turn

The 15 Question Categories¶

Questions are generated from the ground truth data. Each question has an expected answer, relevant turn numbers, scoring dimensions, and an optional deterministic grading rubric.

1. `needle_in_haystack`¶

What it tests: Direct recall of specific facts from a single source among many turns.

Example: "What is Sarah Chen's allergy?" (Expected: "shellfish")

Scoring dimensions: factual_accuracy, specificity

2. `temporal_evolution`¶

What it tests: Tracking how values change over time, including computing differences.

Example: "What is the current deadline for Project Atlas, and how many times has it changed?"

Scoring dimensions: factual_accuracy, temporal_awareness

3. `numerical_precision`¶

What it tests: Exact recall of numbers, percentages, and metrics.

Example: "What is the Q1 revenue and how does it compare to the forecast?"

Scoring dimensions: factual_accuracy, specificity

4. `source_attribution`¶

What it tests: Correctly attributing claims to their original sources, especially when multiple sources discuss the same topic.

Scoring dimensions: factual_accuracy, source_attribution

5. `cross_reference`¶

What it tests: Connecting facts across different information blocks.

Scoring dimensions: factual_accuracy, specificity

6. `distractor_resistance`¶

What it tests: Answering questions accurately while ignoring irrelevant information.

Scoring dimensions: factual_accuracy, confidence_calibration

7. `meta_memory`¶

What it tests: The agent's awareness of what it knows and does not know.

Scoring dimensions: factual_accuracy, confidence_calibration

8. `security_log_analysis`¶

What it tests: Pattern recognition in structured security event data.

Scoring dimensions: factual_accuracy, specificity

9. `incident_tracking`¶

What it tests: Following incident timelines and status evolution.

Scoring dimensions: factual_accuracy, temporal_awareness

10. `infrastructure_knowledge`¶

What it tests: Recall of technical infrastructure details.

Scoring dimensions: factual_accuracy, specificity

11. `problem_solving`¶

What it tests: Recalling stored problem-solution pairs and applying them.

Scoring dimensions: factual_accuracy, specificity

12. `multi_hop_reasoning`¶

What it tests: Chaining multiple facts to answer a question that requires 2+ retrieval steps.

Scoring dimensions: factual_accuracy, specificity

13-15. Additional Categories¶

The question generator also produces additional questions that combine categories -- for example, temporal + numerical or cross-reference + security.

The Grading System¶

Hybrid Deterministic + LLM Grading¶

Each question is graded on multiple dimensions using a two-tier approach:

Tier 1 -- Deterministic grading (instant, free, reproducible):

Checks required_keywords against the answer (case-insensitive)
Awards bonus for acceptable_paraphrases found
Scores 0.0 if any incorrect_patterns match
Applies to factual_accuracy and specificity dimensions

Tier 2 -- LLM semantic grading (slower, costs API calls, nuanced):

Used for temporal_awareness, source_attribution, confidence_calibration
Also used when no deterministic rubric is available
Prompt includes scoring guide and dimension descriptions
Returns structured JSON with per-dimension scores and reasoning

Multi-Vote Stability¶

To reduce LLM grading noise, each question is graded N times (configurable, default 3). For each dimension, the median score is taken as the final grade. The reasoning from the vote closest to the median is preserved.

Vote 1: factual_accuracy = 0.85
Vote 2: factual_accuracy = 0.90    -> Median: 0.85
Vote 3: factual_accuracy = 0.80

For deterministic dimensions, multi-vote has zero overhead since the score is identical every time.

Scoring Scale¶

Score	Meaning
1.0	Perfect or semantically equivalent
0.8-0.9	Correct main points, minor differences
0.5-0.7	Partially correct, missing key details
0.2-0.4	Some relevant content, significant gaps
0.0-0.1	Incorrect or irrelevant

How to Interpret Results¶

The EvalReport¶

A completed evaluation produces an EvalReport containing:

@dataclass
class EvalReport:
    num_turns: int                          # Dialogue length
    num_questions: int                      # Questions asked
    total_facts_delivered: int              # Total facts in ground truth
    learning_time_s: float                  # Time to feed all turns
    questioning_time_s: float               # Time to ask + grade all questions
    grading_time_s: float                   # Time spent on grading only
    overall_score: float                    # Average of all question scores
    category_breakdown: list[CategoryBreakdown]  # Per-category averages
    results: list[EvalResult]              # Per-question details
    memory_stats: dict                     # Agent-reported memory statistics

Reading the Category Breakdown¶

CATEGORY BREAKDOWN:
-----------------------------------------------------------------------
Category                     Avg      Min      Max   Count
-----------------------------------------------------------------------
cross_reference            85.00%   70.00%   95.00%      10
distractor_resistance      92.00%   80.00%  100.00%       8
needle_in_haystack         88.00%   65.00%  100.00%      20
numerical_precision        82.00%   55.00%   95.00%      15
...

Focus on the weakest categories. Categories below 70% indicate systematic weaknesses in the agent's memory system. The min score reveals worst-case performance.

The Worst 5 Questions¶

The report highlights the 5 lowest-scoring questions. These are the best starting points for debugging:

WORST 5 QUESTIONS:
  [25.00%] What was the budget change for Project Atlas over time?
    Expected: $2.1M -> $2.5M (turn 45, additional cloud credits needed)
    Got: The budget is $2.5M.

This example shows the agent knows the current value but lost the change history -- a temporal awareness issue.

Performance Characteristics¶

Scaling¶

Turns	Questions	Typical Facts	Generation Time	Learning Time*
100	20	~80	< 0.1s	Depends on agent
500	50	~400	< 0.5s	Depends on agent
1000	100	~800	< 1s	Depends on agent
5000	200	~4000	< 5s	Depends on agent

*Learning time depends entirely on the agent implementation -- a simple list-based agent will be much faster than one using LLM-powered ingestion.

Grading Costs¶

Deterministic grading: Free, instant (no API calls)
LLM grading: 1 API call per dimension per vote per question
Example: 100 questions * 3 LLM dimensions * 3 votes = 900 API calls

Reproducibility¶

Same seed + same agent = same scores (within LLM grading variance). Multi-vote and multi-seed evaluation reduce this variance. For fully deterministic results, set all question rubrics and use grading_mode="deterministic".

Pre-built Datasets¶

Skip the Learning Phase¶

The 5000-turn learning phase takes hours and requires thousands of LLM API calls. Pre-built datasets let you skip this entirely and jump straight to evaluation.

Download and Use¶

# List available datasets
amplihack-eval list-datasets

# Download a pre-built dataset
amplihack-eval download-dataset 5000t-seed42-v1.0

# Run evaluation using the pre-built DB (skip learning phase)
amplihack-eval run \
  --adapter learning-agent \
  --skip-learning \
  --load-db datasets/5000t-seed42-v1.0/memory_db \
  --turns 5000 \
  --questions 100

Programmatic Usage¶

from amplihack_eval.datasets import download_dataset, list_datasets

# List available datasets
datasets = list_datasets()
for ds in datasets:
    print(f"{ds['name']}: {'local' if ds.get('local') else 'remote'}")

# Download a dataset
path = download_dataset("5000t-seed42-v1.0")

# Use with evaluation
from amplihack_eval.adapters.learning_agent import LearningAgentAdapter
adapter = LearningAgentAdapter(storage_path=path / "memory_db")

Dataset Structure¶

Each dataset contains:

metadata.json -- Configuration and provenance (turns, seed, facts, code version)
baseline_results.json -- Full evaluation scores at time of creation
memory_db/ -- Kuzu graph database (the pre-built learning DB)

Datasets are distributed via GitHub Releases to keep the repository lightweight.

CLI Usage¶

# Basic evaluation (100 turns, 20 questions)
amplihack-eval run --turns 100 --questions 20 --adapter http --agent-url http://localhost:8000

# With LearningAgent
amplihack-eval run --turns 100 --adapter learning-agent --model claude-sonnet-4-6

# Large-scale stress test
amplihack-eval run --turns 5000 --questions 200 --grader-votes 5 --seed 42

# Skip learning with pre-built DB
amplihack-eval run --adapter learning-agent --skip-learning \
  --load-db datasets/5000t-seed42-v1.0/memory_db --turns 5000 --questions 100

# Multi-seed comparison
amplihack-eval compare --seeds 42,123,456,789 --turns 100 --questions 20

Programmatic Usage¶

from amplihack_eval import EvalRunner

# Create your agent (must implement AgentAdapter)
agent = MyAgent()

# Run evaluation
runner = EvalRunner(num_turns=1000, num_questions=100, seed=42, grader_votes=3)
report = runner.run(agent, grader_model="claude-sonnet-4-6")

# Inspect results
print(f"Overall: {report.overall_score:.2%}")
for cb in report.category_breakdown:
    print(f"  {cb.category}: {cb.avg_score:.2%} (n={cb.num_questions})")

# Save report
import json
with open("report.json", "w") as f:
    json.dump(report.to_dict(), f, indent=2)

Long-Horizon Memory Evaluation¶

What It Tests¶

Quick Start¶

Run Single-Agent Eval¶

Common Options¶

How Dialogue Generation Works¶

Deterministic Generation¶

The 12 Information Blocks¶

Block Details¶

Ground Truth Tracking¶

The 15 Question Categories¶

1. needle_in_haystack¶

2. temporal_evolution¶

3. numerical_precision¶

4. source_attribution¶

5. cross_reference¶

6. distractor_resistance¶

7. meta_memory¶

8. security_log_analysis¶

9. incident_tracking¶

10. infrastructure_knowledge¶

11. problem_solving¶

12. multi_hop_reasoning¶

13-15. Additional Categories¶

The Grading System¶

Hybrid Deterministic + LLM Grading¶

Multi-Vote Stability¶

Scoring Scale¶

How to Interpret Results¶

The EvalReport¶

Reading the Category Breakdown¶

The Worst 5 Questions¶

Performance Characteristics¶

Scaling¶

Grading Costs¶

Reproducibility¶

Pre-built Datasets¶

Skip the Learning Phase¶

Download and Use¶

Programmatic Usage¶

Dataset Structure¶

CLI Usage¶

Programmatic Usage¶

1. `needle_in_haystack`¶

2. `temporal_evolution`¶

3. `numerical_precision`¶

4. `source_attribution`¶

5. `cross_reference`¶

6. `distractor_resistance`¶

7. `meta_memory`¶

8. `security_log_analysis`¶

9. `incident_tracking`¶

10. `infrastructure_knowledge`¶

11. `problem_solving`¶

12. `multi_hop_reasoning`¶