Eval Grading Improvements Tutorial¶

Type: Tutorial (Learning-Oriented) Last Updated: 2026-02-28 Related PRs: #2673, #2674

Overview¶

This tutorial teaches you how to fix grader false negatives and implement advanced retrieval strategies for eval systems. These improvements increased evaluation accuracy from 96.0% to 97.8% in the amplihack eval system.

What You'll Learn¶

How to fix grader false negatives using deterministic patterns
How to implement entity-linked retrieval for structured IDs
How to implement multi-entity retrieval for multi-hop reasoning
Best practices for combining deterministic and semantic grading

Prerequisites¶

Basic understanding of LLM-based evaluation systems
Familiarity with semantic similarity grading
Knowledge of information retrieval concepts

Problem: Grader False Negatives¶

The Issue¶

Consider this question and answer:

Question: "What is the current project budget?" Expected: "$1.4M" Agent Answer: "The budget increased from $1.2M to $1.4M"

The agent's answer contains the correct current value ($1.4M) but also mentions the historical value ($1.2M). A naive grader using incorrect patterns might score this as 0% because $1.2M appears in the answer and matches an "incorrect pattern" for the historical value.

The Root Cause¶

The grader was checking for incorrect patterns WITHOUT first verifying that the correct answer was present:

# BUGGY VERSION (from before PR #2674)
if any(pattern in answer.lower() for pattern in incorrect_patterns):
    score = 0.0  # Wrong! Penalizes even when correct answer is present

This caused false negatives when answers contained both historical and current information.

Solution 1: Fix Deterministic Grading Logic¶

The Fix¶

Only apply incorrect pattern penalties when the correct keywords are NOT present:

# FIXED VERSION (from PR #2674)
def _deterministic_grade(answer: str, rubric: GradingRubric) -> float:
    answer_lower = answer.lower()

    # Check if correct keywords are present
    has_correct = any(
        keyword.lower() in answer_lower
        for keyword in rubric.correct_keywords
    )

    # Only penalize incorrect patterns if correct answer is MISSING
    if not has_correct:
        for pattern in rubric.incorrect_patterns:
            if pattern.lower() in answer_lower:
                return 0.0

    # Grade based on keyword matches
    matches = sum(
        1 for keyword in rubric.correct_keywords
        if keyword.lower() in answer_lower
    )
    return matches / len(rubric.correct_keywords)

Key Insight¶

Only skip incorrect patterns when ALL correct keywords are present. This ensures that:

Complete answers get full credit even if they mention historical data
Incomplete answers still get penalized for incorrect patterns
Grading is more aligned with human judgment

Impact¶

temporal_evolution: 86.6% → 99.8%
Overall accuracy: 96.0% → 97.4%

Solution 2: Entity-Linked Retrieval¶

The Problem¶

Questions about structured entities (incident IDs, CVE numbers, etc.) often fail because facts are stored under different context tags:

Example: - Question: "What was the impact of INC-2024-089?" - Facts stored under: "incidents", "security_logs", "post_mortems" - Standard retrieval: Only searches "incidents" context - Result: Misses related facts in "security_logs" and "post_mortems"

The Solution¶

When structured entity IDs are detected, search ALL facts containing that entity:

def _entity_linked_retrieval(self, question: str) -> list[Fact]:
    """
    Retrieves all facts containing structured entity IDs found in the question.

    Structured ID patterns:
    - INC-YYYY-NNN (incidents)
    - CVE-YYYY-NNNNN (vulnerabilities)
    - PROJ-NNN (projects)
    - SRV-NNN (servers)
    """
    import re

    # Extract entity IDs from question
    entity_patterns = [
        r'INC-\d{4}-\d{3}',    # Incident IDs
        r'CVE-\d{4}-\d{4,}',   # CVE IDs
        r'PROJ-\d{3}',         # Project IDs
        r'SRV-\d{3}',          # Server IDs
    ]

    entities = []
    for pattern in entity_patterns:
        entities.extend(re.findall(pattern, question, re.IGNORECASE))

    if not entities:
        return []

    # Search for facts containing any of these entity IDs
    all_facts = []
    for entity_id in entities:
        # Get facts from memory where fact text contains the entity ID
        facts = self.memory.search(
            query=entity_id,
            filters=None,  # Search ALL contexts
            limit=10
        )
        all_facts.extend(facts)

    return self._deduplicate_facts(all_facts)

When to Use¶

Use entity-linked retrieval when:

Questions contain structured identifiers (INC-, CVE-, PROJ-*)
Related facts are stored across multiple context tags
Standard context-based retrieval misses relevant information

Impact¶

security_log_analysis: 88% → 100%
incident_tracking: Improved multi-source fact aggregation

Solution 3: Multi-Entity Retrieval¶

The Problem¶

Multi-hop reasoning questions ask about relationships between multiple entities:

Example: "How did the Snowfall incident affect the Alpine Lodge project?"

This question involves: 1. The "Snowfall incident" entity 2. The "Alpine Lodge project" entity 3. The relationship/impact between them

Standard retrieval searches for both terms together and often finds nothing because facts about each entity are stored separately.

The Solution¶

Detect questions with 2+ named entities, retrieve facts for EACH entity independently, then merge results:

def _multi_entity_retrieval(self, question: str) -> list[Fact]:
    """
    Detects questions with multiple entities and retrieves facts for each independently.

    Useful for multi-hop reasoning questions like:
    - "How did X affect Y?"
    - "What's the relationship between A and B?"
    - "Compare X and Y"
    """
    # Detect named entities or key phrases
    entities = self._extract_entities(question)

    if len(entities) < 2:
        return []

    # Retrieve facts for EACH entity independently
    all_facts = []
    for entity in entities:
        facts = self.memory.search(
            query=entity,
            limit=5
        )
        all_facts.extend(facts)

    return self._deduplicate_facts(all_facts)

def _extract_entities(self, question: str) -> list[str]:
    """
    Extract named entities and key noun phrases from question.

    Uses simple heuristics:
    - Capitalized phrases (e.g., "Alpine Lodge")
    - Quoted terms (e.g., "Snowfall incident")
    - Common entity patterns
    """
    import re

    entities = []

    # Extract quoted terms
    entities.extend(re.findall(r'"([^"]+)"', question))

    # Extract capitalized phrases (2-4 words)
    entities.extend(
        re.findall(r'\b([A-Z][a-z]+(?: [A-Z][a-z]+){1,3})\b', question)
    )

    # Remove duplicates, keep unique entities
    return list(set(entities))

When to Use¶

Use multi-entity retrieval when:

Questions contain 2+ named entities or key phrases
Questions ask about relationships, comparisons, or impacts
Facts about each entity are stored in separate context tags
Standard combined search returns insufficient results

Impact¶

multi_hop_reasoning: Improved coverage for chain-of-thought questions
temporal_evolution: Better handling of "before/after" comparisons

Combining the Strategies¶

The most powerful approach combines all three techniques:

def answer_question(self, question: str) -> str:
    # 1. Detect question intent
    intent = self._detect_intent(question)

    # 2. Try entity-linked retrieval first (structured IDs)
    facts = self._entity_linked_retrieval(question)

    # 3. If insufficient, try multi-entity retrieval
    if len(facts) < 3:
        facts.extend(self._multi_entity_retrieval(question))

    # 4. Fall back to standard retrieval
    if len(facts) < 3:
        facts.extend(self._standard_retrieval(question))

    # 5. Synthesize answer from retrieved facts
    answer = self._synthesize_with_llm(question, facts)

    return answer

Retrieval Strategy Decision Tree¶

Question received
    │
    ├─ Contains structured IDs (INC-*, CVE-*)?
    │   └─ YES → Use entity-linked retrieval
    │
    ├─ Contains 2+ named entities?
    │   └─ YES → Use multi-entity retrieval
    │
    └─ Otherwise → Use standard context-based retrieval

Best Practices¶

1. Always Check for False Negatives¶

When adding new incorrect patterns to grading rubrics:

# DON'T: Apply incorrect patterns unconditionally
if "outdated_info" in answer:
    return 0.0

# DO: Check for correct answer first
if not has_correct_answer and "outdated_info" in answer:
    return 0.0

2. Deduplicate Retrieved Facts¶

Multiple retrieval strategies can return overlapping facts:

def _deduplicate_facts(self, facts: list[Fact]) -> list[Fact]:
    seen = set()
    unique = []
    for fact in facts:
        # Use fact ID or content hash as deduplication key
        key = fact.id or hash(fact.content)
        if key not in seen:
            seen.add(key)
            unique.append(fact)
    return unique

3. Log Retrieval Strategy Used¶

For debugging and analysis:

def answer_question(self, question: str) -> str:
    retrieval_method = None

    facts = self._entity_linked_retrieval(question)
    if facts:
        retrieval_method = "entity_linked"
    else:
        facts = self._multi_entity_retrieval(question)
        if facts:
            retrieval_method = "multi_entity"
        else:
            facts = self._standard_retrieval(question)
            retrieval_method = "standard"

    logger.info(f"Used {retrieval_method} retrieval: {len(facts)} facts")
    return self._synthesize_with_llm(question, facts)

4. Test with Multi-Vote Grading¶

Grading improvements should be validated with multiple grading runs:

# Single run (unreliable)
python -m amplihack.eval.progressive_test_suite --sdk mini

# Multi-vote grading (recommended)
python -m amplihack.eval.progressive_test_suite --grader-votes 3 --sdk mini

Results¶

Applying these three improvements to the amplihack eval system:

Category	Before	After	Improvement
temporal_evolution	86.6%	99.8%	+13.2%
security_log_analysis	88.0%	100.0%	+12.0%
incident_tracking	~85%	~95%	+10.0%
multi_hop_reasoning	~90%	~95%	+5.0%
Overall	96.0%	97.8%	+1.8%

Next Steps¶

Apply to your eval system: Implement deterministic grading fixes in your grader
Add retrieval strategies: Implement entity-linked and multi-entity retrieval
Validate with multi-vote grading: Run 3-vote grading to measure improvement
Monitor specific categories: Track per-category scores to identify remaining gaps

EVAL_SYSTEM_ARCHITECTURE.md - Complete eval system overview
EVAL_RETRIEVAL_REFERENCE.md - Detailed API reference
PR #2673 - Original implementation
PR #2674 - Grading regression fix
PR #2675 - Security domain improvements

Eval Grading Improvements Tutorial¶

Overview¶

What You'll Learn¶

Prerequisites¶

Problem: Grader False Negatives¶

The Issue¶

The Root Cause¶

Solution 1: Fix Deterministic Grading Logic¶

The Fix¶

Key Insight¶

Impact¶

Solution 2: Entity-Linked Retrieval¶

The Problem¶

The Solution¶

When to Use¶

Impact¶

Solution 3: Multi-Entity Retrieval¶

The Problem¶

The Solution¶

When to Use¶

Impact¶

Combining the Strategies¶

Retrieval Strategy Decision Tree¶

Best Practices¶

1. Always Check for False Negatives¶

2. Deduplicate Retrieved Facts¶

3. Log Retrieval Strategy Used¶

4. Test with Multi-Vote Grading¶

Results¶

Next Steps¶

Related Documentation¶