Eval Grading Improvements¶
Type: Explanation (Understanding-Oriented)
How grader false negatives were fixed and advanced retrieval strategies implemented to improve evaluation accuracy.
The Problem: Grader False Negatives¶
Consider this example:
- Question: "What is the current project budget?"
- Expected: "$1.4M"
- Agent answer: "The budget increased from $1.2M to $1.4M"
The agent's answer is correct ($1.4M is present), but a naive grader penalizes it because the historical value $1.2M matches an "incorrect pattern." This is a false negative — correct answers scored as wrong.
Root Cause¶
The grader checked for incorrect patterns WITHOUT first verifying that the correct answer was present:
# BUGGY: penalizes even when correct answer IS present
if any(pattern in answer.lower() for pattern in incorrect_patterns):
score = 0.0
The Fix: Correct-First Grading¶
Only apply incorrect pattern penalties when the correct keywords are NOT present:
def _deterministic_grade(answer: str, rubric: GradingRubric) -> float:
answer_lower = answer.lower()
# Check if correct keywords are present FIRST
has_correct = any(
keyword.lower() in answer_lower
for keyword in rubric.correct_keywords
)
# Only penalize incorrect patterns if correct answer is MISSING
if not has_correct:
for pattern in rubric.incorrect_patterns:
if pattern.lower() in answer_lower:
return 0.0
# Grade based on keyword match ratio
matches = sum(
1 for keyword in rubric.correct_keywords
if keyword.lower() in answer_lower
)
return matches / len(rubric.correct_keywords)
Key Insight¶
An answer containing both historical and current values should score based on whether the current (correct) value is present — not penalized for also mentioning the historical value.
Entity-Linked Retrieval¶
Standard context-based search misses questions targeting structured entity IDs
(e.g., INC-2024-089, CVE-2024-12345). Entity-linked retrieval extracts these
IDs and searches across ALL contexts.
How It Works¶
- Extract entity IDs from question using regex patterns
- For each entity ID, search ALL memory contexts (not just the current one)
- Aggregate and deduplicate retrieved facts
- Return enriched fact list
Supported Entity ID Patterns¶
| Pattern | Example | Use Case |
|---|---|---|
INC-YYYY-NNN |
INC-2024-089 | Security incidents, outages |
CVE-YYYY-NNNNN |
CVE-2024-12345 | Vulnerability tracking |
PROJ-NNN |
PROJ-456 | Project management |
SRV-NNN |
SRV-789 | Infrastructure inventory |
Multi-Entity Retrieval¶
For questions requiring multi-hop reasoning across multiple entities:
- Identify all named entities in the question
- Retrieve facts for each entity independently
- Combine with context overlap detection
- Support cross-entity relationship traversal
This handles questions like "Compare the response times of INC-2024-089 and INC-2024-102" where facts about both incidents must be retrieved.
Results¶
These improvements increased evaluation accuracy:
- Overall: 96.0% to 97.8% (+1.8%)
- Temporal evolution: 86.6% to 99.8% (+13.2%)
- Security log analysis: 88% to 100% (+12%)
- 9 categories reached 100% accuracy
Best Practices¶
- Deterministic before semantic — Check for exact/pattern matches first; fall back to LLM grading only for ambiguous cases
- Correct-first logic — Always check for correct answers before penalizing for incorrect patterns
- Cross-context retrieval — For structured IDs, search all contexts, not just the current one
- Multi-vote grading — Use 3+ grading votes and take the median
Related¶
- Eval System Architecture — full evaluation system overview
- Eval Retrieval Reference — retrieval method specifications