GraphReranker Module Documentation¶
Module: wikigr.agent.enhancements.graph_reranker
Module Overview¶
GraphReranker reranks vector search results using graph centrality metrics (PageRank) to promote authoritative articles with many incoming links.
Accuracy Impact: +5-10% over baseline vector search Citation Quality Impact: +15% (promotes authoritative sources) Latency: +50ms per query (PageRank cached)
Module-Level Docstring¶
"""
Graph-based reranking for vector search results.
This module provides graph centrality reranking to improve retrieval accuracy
by promoting authoritative articles (those with high PageRank scores) in
vector search results.
Algorithm:
1. Compute PageRank for all articles using LINKS_TO edges
2. Normalize PageRank scores to [0, 1] range
3. Combine vector similarity and PageRank:
combined_score = alpha * vector_similarity + beta * pagerank
4. Rerank results by combined score
Performance:
- PageRank computation: O(V + E) where V = articles, E = links
- Reranking: O(N log N) where N = number of results
- PageRank cached for 1 hour (configurable via cache_ttl)
Example:
>>> from wikigr.agent.enhancements.graph_reranker import GraphReranker
>>> import real_ladybug as kuzu
>>>
>>> conn = kuzu.Connection(kuzu.Database("physics.db"))
>>> reranker = GraphReranker(conn, alpha=0.7, beta=0.3)
>>>
>>> results = [
... {"title": "Quantum_fluctuation", "score": 0.95},
... {"title": "Quantum_mechanics", "score": 0.90}
... ]
>>>
>>> reranked = reranker.rerank(results, top_k=10)
>>> print(reranked[0]["title"])
'Quantum_mechanics' # Promoted due to high PageRank
Dependencies:
- real_ladybug (aliased as kuzu): LadybugDB graph database connection
- numpy: PageRank computation (optional, falls back to pure Python)
See Also:
- MultiDocSynthesizer: Multi-document retrieval
- FewShotManager: Few-shot example injection
"""
Class: GraphReranker¶
class GraphReranker:
"""
Reranks vector search results using graph centrality metrics.
This class computes PageRank scores for articles in the knowledge graph
and combines them with vector similarity scores to rerank search results.
PageRank scores are cached to avoid recomputation on every query.
Attributes:
conn (kuzu.Connection): LadybugDB database connection
alpha (float): Weight for vector similarity (default: 0.7)
beta (float): Weight for PageRank score (default: 0.3)
cache_ttl (int): PageRank cache TTL in seconds (default: 3600)
_pagerank_cache (dict | None): Cached PageRank scores
_cache_timestamp (float): Timestamp of last PageRank computation
Example:
>>> reranker = GraphReranker(conn)
>>> reranked = reranker.rerank(results, top_k=10)
"""
Constructor¶
def __init__(
self,
conn: kuzu.Connection,
alpha: float = 0.7,
beta: float = 0.3,
cache_ttl: int = 3600
) -> None:
"""
Initialize GraphReranker with connection and weights.
Args:
conn: LadybugDB database connection
alpha: Weight for vector similarity (must be 0 < alpha <= 1)
beta: Weight for PageRank score (must be 0 < beta <= 1)
cache_ttl: PageRank cache TTL in seconds (default: 1 hour)
Raises:
ValueError: If alpha + beta != 1.0
ValueError: If alpha or beta not in (0, 1]
TypeError: If conn is not a kuzu.Connection
Example:
>>> conn = kuzu.Connection(kuzu.Database("pack.db"))
>>> reranker = GraphReranker(conn, alpha=0.8, beta=0.2)
"""
rerank() Method¶
def rerank(
self,
results: list[dict],
top_k: int = 10
) -> list[dict]:
"""
Rerank search results using combined vector + PageRank scores.
Combines vector similarity (from input results) with PageRank scores
to produce reranked results that balance semantic relevance and
article authority.
Args:
results: List of search results with 'title' and 'score' keys
top_k: Number of top results to return (default: 10)
Returns:
Reranked results (list of dicts) with updated 'score' values,
sorted by combined score in descending order
Raises:
ValueError: If results is empty or missing required keys
RuntimeError: If PageRank computation fails
Example:
>>> results = [
... {"title": "Article_A", "score": 0.95},
... {"title": "Article_B", "score": 0.90}
... ]
>>> reranked = reranker.rerank(results, top_k=5)
>>> print(reranked[0]["title"])
'Article_B' # May be promoted if it has higher PageRank
Note:
- Results without PageRank scores (new articles) receive median score
- Missing 'title' or 'score' keys raise ValueError
- top_k larger than len(results) returns all results
"""
compute_pagerank() Method¶
def compute_pagerank(
self,
damping: float = 0.85,
max_iter: int = 100,
tol: float = 1e-6
) -> dict[str, float]:
"""
Compute PageRank scores for all articles in the knowledge graph.
Uses the LINKS_TO relationship edges to compute PageRank via the
power iteration method. Results are normalized to [0, 1] range.
Args:
damping: PageRank damping factor (default: 0.85, standard value)
max_iter: Maximum iterations (default: 100)
tol: Convergence tolerance (default: 1e-6)
Returns:
Dictionary mapping article titles to normalized PageRank scores
Raises:
RuntimeError: If LadybugDB query fails or graph is empty
ValueError: If damping not in (0, 1)
Example:
>>> pagerank = reranker.compute_pagerank(damping=0.85)
>>> print(pagerank["Quantum_mechanics"])
0.0145 # Normalized PageRank score
Implementation:
1. Query: MATCH (a:Article)-[:LINKS_TO]->(b:Article)
RETURN a.title, b.title
2. Build adjacency matrix from edges
3. Power iteration: PR(t+1) = (1-d)/N + d * sum(PR(t) / out_degree)
4. Normalize to [0, 1] range
Note:
- Uses cache if available and not expired (check cache_ttl)
- Returns empty dict if no LINKS_TO edges exist
- Articles with no incoming links receive minimum score
"""
_is_cache_valid() Method¶
def _is_cache_valid(self) -> bool:
"""
Check if PageRank cache is valid (not expired).
Returns:
True if cache exists and not expired, False otherwise
Example:
>>> if not reranker._is_cache_valid():
... reranker.compute_pagerank()
"""
_normalize_scores() Method¶
def _normalize_scores(
self,
scores: dict[str, float]
) -> dict[str, float]:
"""
Normalize PageRank scores to [0, 1] range using min-max scaling.
Args:
scores: Dictionary of raw PageRank scores
Returns:
Dictionary of normalized scores in [0, 1] range
Example:
>>> raw = {"A": 0.02, "B": 0.01, "C": 0.03}
>>> normalized = reranker._normalize_scores(raw)
>>> print(normalized)
{'A': 0.5, 'B': 0.0, 'C': 1.0} # Min-max normalized
"""
Usage Examples¶
Basic Reranking¶
from wikigr.agent.enhancements.graph_reranker import GraphReranker
import real_ladybug as kuzu
# Initialize
conn = kuzu.Connection(kuzu.Database("physics.db"))
reranker = GraphReranker(conn)
# Vector search results (from semantic search)
results = [
{"title": "Quantum_fluctuation", "score": 0.95},
{"title": "Quantum_mechanics", "score": 0.90},
{"title": "Quantum_field_theory", "score": 0.88}
]
# Rerank using graph centrality
reranked = reranker.rerank(results, top_k=10)
for rank, result in enumerate(reranked, 1):
print(f"{rank}. {result['title']} (score: {result['score']:.3f})")
Output:
1. Quantum_mechanics (score: 0.921) # Promoted (high PageRank)
2. Quantum_fluctuation (score: 0.913)
3. Quantum_field_theory (score: 0.871)
Custom Weights¶
# Emphasize graph centrality (favor authoritative articles)
reranker = GraphReranker(conn, alpha=0.5, beta=0.5)
reranked = reranker.rerank(results, top_k=10)
# Emphasize vector similarity (favor semantic relevance)
reranker = GraphReranker(conn, alpha=0.9, beta=0.1)
reranked = reranker.rerank(results, top_k=10)
PageRank Inspection¶
# Compute and inspect PageRank scores
pagerank = reranker.compute_pagerank()
# Find most authoritative articles
top_articles = sorted(
pagerank.items(),
key=lambda x: x[1],
reverse=True
)[:10]
print("Top 10 most authoritative articles:")
for title, score in top_articles:
print(f" {title}: {score:.4f}")
Output:
Top 10 most authoritative articles:
Quantum_mechanics: 0.0145
Physics: 0.0123
Quantum_entanglement: 0.0089
Particle_physics: 0.0078
...
Cache Management¶
# Check cache status
if reranker._is_cache_valid():
print("Using cached PageRank scores")
else:
print("Computing fresh PageRank scores")
# Force recomputation (e.g., after adding new articles)
reranker._pagerank_cache = None
pagerank = reranker.compute_pagerank()
# Configure custom cache TTL (30 minutes)
reranker = GraphReranker(conn, cache_ttl=1800)
Performance Tuning¶
Recommended Settings by Pack Size¶
| Pack Size | alpha | beta | cache_ttl | Notes |
|---|---|---|---|---|
| <100 articles | 0.8 | 0.2 | 1800 (30min) | Small packs: favor semantic similarity |
| 100-500 articles | 0.7 | 0.3 | 3600 (1h) | Balanced setting (default) |
| 500-1000 articles | 0.6 | 0.4 | 7200 (2h) | Large packs: leverage graph structure |
Latency Optimization¶
# Precompute PageRank at pack build time
pagerank = reranker.compute_pagerank()
# Save to file for instant loading
import json
with open("pagerank_cache.json", "w") as f:
json.dump(pagerank, f)
# Load precomputed scores
with open("pagerank_cache.json") as f:
reranker._pagerank_cache = json.load(f)
reranker._cache_timestamp = time.time()
Testing¶
import pytest
from wikigr.agent.enhancements.graph_reranker import GraphReranker
def test_rerank_promotes_high_pagerank():
"""Test that articles with high PageRank are promoted."""
conn = kuzu.Connection(kuzu.Database("test.db")) # kuzu aliased from real_ladybug
reranker = GraphReranker(conn)
results = [
{"title": "Low_Authority", "score": 0.95},
{"title": "High_Authority", "score": 0.90}
]
reranked = reranker.rerank(results, top_k=10)
# High_Authority should be promoted if it has higher PageRank
assert reranked[0]["title"] == "High_Authority"
def test_pagerank_computation():
"""Test PageRank computation returns valid scores."""
conn = kuzu.Connection(kuzu.Database("test.db")) # kuzu aliased from real_ladybug
reranker = GraphReranker(conn)
pagerank = reranker.compute_pagerank()
assert len(pagerank) > 0
assert all(0 <= score <= 1 for score in pagerank.values())
assert abs(sum(pagerank.values()) - 1.0) < 1e-6 # Scores sum to 1
def test_cache_invalidation():
"""Test PageRank cache expires correctly."""
conn = kuzu.Connection(kuzu.Database("test.db")) # kuzu aliased from real_ladybug
reranker = GraphReranker(conn, cache_ttl=1) # 1 second TTL
# First computation
pagerank1 = reranker.compute_pagerank()
assert reranker._is_cache_valid()
# Wait for cache to expire
time.sleep(2)
assert not reranker._is_cache_valid()
# Second computation (should recompute)
pagerank2 = reranker.compute_pagerank()
assert pagerank1 == pagerank2 # Same graph, same scores
Troubleshooting¶
No PageRank Scores Available¶
Problem: reranker.compute_pagerank() returns empty dict.
Cause: No LINKS_TO edges in the graph.
Solution: Check that articles have links:
result = conn.execute("MATCH ()-[r:LINKS_TO]->() RETURN count(r) AS count")
count = result.get_as_df().iloc[0]["count"]
print(f"Total links: {count}")
Reranking Has No Effect¶
Problem: Reranked results same order as input.
Cause: All articles have similar PageRank scores (flat graph).
Solution: Increase beta to emphasize graph centrality:
reranker = GraphReranker(conn, alpha=0.5, beta=0.5)
Slow Reranking Performance¶
Problem: Reranking takes >500ms per query.
Cause: PageRank recomputed on every query (cache invalid).
Solution: Increase cache TTL or precompute PageRank:
reranker = GraphReranker(conn, cache_ttl=7200) # 2 hours
See Also¶
- Phase 1 Enhancements Reference - Complete API reference
- Phase 1 How-To Guide - Usage examples
- MultiDocSynthesizer Module - Multi-document retrieval