Memory A/B Test - Quick Reference¶

One-page guide for running memory effectiveness tests

Quick Commands¶

# Install dependencies
pip install scipy statsmodels numpy

# Run full test suite (all phases)
python scripts/memory_test_harness.py --full

# Run specific phase
python scripts/memory_test_harness.py --phase baseline
python scripts/memory_test_harness.py --phase sqlite
python scripts/memory_test_harness.py --phase neo4j

# Analyze existing results
python scripts/memory_test_harness.py --analyze

# Custom output directory
python scripts/memory_test_harness.py --full --output-dir my_results

Test Phases¶

Phase	Command	Duration	Decision
1. Baseline	`--phase baseline`	~8 hours	None
2. SQLite	`--phase sqlite`	~8 hours	Proceed if >20% improvement
3. Neo4j	`--phase neo4j`	~8 hours	Only if Phase 2 succeeds
4. Report	Automatic	~1 hour	Final recommendation

Success Criteria Checklist¶

For SQLite (Phase 2 → Phase 3)¶

Statistical significance: p < 0.05
Effect size: Cohen's d > 0.5
Time reduction: > 20%
No major errors

Decision: Proceed to Neo4j testing? YES / NO

For Neo4j (Phase 3 → Production)¶

Statistical significance vs SQLite: p < 0.05
Meaningful improvement: > 15% over SQLite
Benefit justifies complexity
Scale warrants graph database

Decision: Deploy Neo4j? YES / NO (start with SQLite)

Expected Results¶

Memory vs No Memory (Phase 2)¶

Metric	Expected	Action if Below
Time reduction	-20% to -35%	Investigate scenarios
Error reduction	-50% to -70%	Check memory quality
Quality improvement	+25% to +40%	Review metrics

Neo4j vs SQLite (Phase 3)¶

Metric	Expected	Action if Below
Time reduction	-5% to -15%	Stick with SQLite
Query speed	-10% to -30%	Scale not reached yet

Results Files¶

test_results/
├── baseline_results.json            # Phase 1: Control data
├── sqlite_results.json              # Phase 2: SQLite data
├── neo4j_results.json               # Phase 3: Neo4j data
├── baseline_vs_sqlite_comparison.json
├── sqlite_vs_neo4j_comparison.json
└── final_report.md                  # Phase 4: Final recommendation

Quick Troubleshooting¶

Test Taking Too Long¶

# Run with fewer iterations (3 instead of 5)
# Edit memory_test_harness.py:
# Change: for iteration in range(5)
# To:     for iteration in range(3)

Memory System Not Available¶

# Check SQLite memory
python -c "from amplihack.memory import MemoryManager; print('OK')"

# Check Neo4j memory
python -c "from amplihack.memory.neo4j import Neo4jConnector; print('OK')"

Statistical Analysis Fails¶

# Install required packages
pip install scipy statsmodels numpy matplotlib seaborn

# Verify installation
python -c "import scipy.stats; import statsmodels; print('OK')"

Interpreting Results¶

P-Value (Statistical Significance)¶

p < 0.001: Extremely strong evidence ✅✅✅
p < 0.01: Strong evidence ✅✅
p < 0.05: Moderate evidence ✅
p > 0.05: Insufficient evidence ❌

Effect Size (Practical Significance)¶

d > 0.8: Large effect (major improvement) ✅✅
d > 0.5: Medium effect (substantial improvement) ✅
d > 0.2: Small effect (minor improvement) ~
d < 0.2: Negligible effect (not worth it) ❌

Confidence Interval¶

Negative CI: Performance degraded ❌
Crosses zero: Uncertain benefit ⚠️
Positive CI: Confirmed improvement ✅

Decision Matrix¶

p-value	Effect Size	Action
< 0.05	> 0.8	STRONG PROCEED ✅✅
< 0.05	0.5-0.8	PROCEED ✅
< 0.05	0.2-0.5	CONSIDER ~
< 0.05	< 0.2	STOP ❌
> 0.05	Any	STOP ❌

Common Scenarios¶

Scenario 1: SQLite Shows Strong Benefit¶

Phase 2 Results:
- p-value: 0.001 ✅
- Effect size: 0.85 ✅
- Time reduction: -32% ✅

Decision: PROCEED to Phase 3 (test Neo4j)

Scenario 2: SQLite Shows Marginal Benefit¶

Phase 2 Results:
- p-value: 0.04 ✅
- Effect size: 0.35 ~
- Time reduction: -12% ~

Decision: STOP - benefit too small to justify

Scenario 3: Neo4j Shows No Additional Benefit¶

Phase 3 Results (Neo4j vs SQLite):
- p-value: 0.45 ❌
- Effect size: 0.15 ~
- Time reduction: -3% ~

Decision: Deploy SQLite, skip Neo4j

Timeline¶

Week 1-2: Implementation¶

Implement scenario execution
Integrate with agents
Validate test harness

Week 3: Phase 1¶

python scripts/memory_test_harness.py --phase baseline
# Wait ~8 hours
# Review baseline_results.json

Week 4: Phase 2 + Decision¶

python scripts/memory_test_harness.py --phase sqlite
# Wait ~8 hours
# Review baseline_vs_sqlite_comparison.json
# DECISION GATE: Proceed to Phase 3?

Week 5: Phase 3 (conditional)¶

# Only if Phase 2 successful
python scripts/memory_test_harness.py --phase neo4j
# Wait ~8 hours
# Review sqlite_vs_neo4j_comparison.json

Week 6: Final Decision¶

Review all results
Generate final report
Make deployment decision

Emergency Stops¶

Stop Testing If:¶

Test harness errors: Fix bugs before continuing
Metrics look wrong: Validate collection logic
Results highly variable: Increase iterations
Clear negative impact: Stop immediately

Contact / Questions¶

Test Design: docs/memory/EFFECTIVENESS_TEST_DESIGN.md
Implementation: scripts/memory_test_harness.py
Results: test_results/ directory

Cheat Sheet¶

# Quick analysis of results
import json
import numpy as np

# Load results
with open('test_results/baseline_results.json') as f:
    baseline = json.load(f)
with open('test_results/sqlite_results.json') as f:
    sqlite = json.load(f)

# Extract execution times
baseline_times = [r['time']['execution_time'] for r in baseline]
sqlite_times = [r['time']['execution_time'] for r in sqlite]

# Calculate improvement
baseline_mean = np.mean(baseline_times)
sqlite_mean = np.mean(sqlite_times)
improvement = ((sqlite_mean - baseline_mean) / baseline_mean) * 100

print(f"Baseline: {baseline_mean:.1f}s")
print(f"SQLite: {sqlite_mean:.1f}s")
print(f"Improvement: {improvement:.1f}%")

# Quick decision
from scipy.stats import ttest_rel
t_stat, p_value = ttest_rel(baseline_times, sqlite_times)
print(f"p-value: {p_value:.4f}")
print(f"Significant: {p_value < 0.05}")

Status: Ready for Use Updated: 2025-11-03