BFS Link Expansion Algorithm¶

Understanding how WikiGR crawls web content using breadth-first search.

The Problem¶

When building knowledge graphs from web content, a single page rarely contains all relevant information. Users need to:

Follow links to related content
Control crawl depth and breadth
Filter irrelevant links
Avoid infinite loops and redundant processing

A naive depth-first approach can get stuck in deep rabbit holes. A better approach uses breadth-first search (BFS) to explore links systematically.

Why BFS?¶

Breadth-first search explores all links at depth N before moving to depth N+1.

BFS vs DFS Comparison¶

Depth-First Search (DFS):

Root → Link1 → Link1.1 → Link1.1.1 → Link1.1.2
                                        └→ (too deep!)

Breadth-First Search (BFS):

Depth 0: Root
Depth 1: Link1, Link2, Link3
Depth 2: Link1.1, Link1.2, Link2.1, Link2.2, Link3.1
Depth 3: ... (controlled by max_depth)

Benefits of BFS:

Proximity to root: Pages closer to root are processed first (usually more relevant)
Controlled exploration: Easy to limit depth without missing breadth
Fair processing: All links at same depth get equal priority
Predictable memory: Queue size bounded by links per page × max_depth

Algorithm Overview¶

The BFS crawler uses a queue to track URLs to visit:

1. Initialize queue with root URL (depth 0)
2. While queue not empty and haven't hit max_links:
   a. Dequeue (url, depth)
   b. Skip if already visited
   c. Fetch and parse page
   d. Yield Article object
   e. Extract links from page
   f. Filter links (domain, patterns, visited)
   g. If depth < max_depth, enqueue filtered links at depth+1
3. Done when queue empty or max_links reached

Implementation Details¶

Data Structures¶

# Queue: (url, depth) tuples
queue: List[Tuple[str, int]] = [(root_url, 0)]

# Visited set: prevent reprocessing
visited: Set[str] = set()

# Current depth and count
current_depth = 0
processed_count = 0

Main Loop¶

def get_articles(self) -> Generator[Article, None, None]:
    queue = [(self.url, 0)]

    while queue and len(self.visited) < self.max_links:
        current_url, depth = queue.pop(0)  # FIFO (BFS)

        if current_url in self.visited:
            continue

        # Fetch page
        response = requests.get(current_url)
        soup = BeautifulSoup(response.text, "html.parser")

        # Extract content
        title = self._extract_title(soup)
        content = self._extract_content(soup)
        links = self._extract_links(soup, current_url)

        # Mark visited
        self.visited.add(current_url)

        # Yield article
        yield Article(title=title, content=content, url=current_url, links=links)

        # Expand if not at max depth
        if depth < self.max_depth:
            filtered = self.expand_links(current_url, links, depth)
            queue.extend((link, depth + 1) for link in filtered)

Key aspects:

queue.pop(0) - FIFO ensures BFS order
visited set - O(1) lookup prevents cycles
Generator pattern - memory-efficient, lazy evaluation
Depth tracking - each URL knows its distance from root

Link Filtering¶

def expand_links(self, root_url: str, links: List[str], depth: int) -> List[str]:
    """
    Filter discovered links before adding to queue.

    Filters applied in order:
    1. Already visited → skip
    2. Same domain (if enabled)
    3. Include pattern (if provided)
    4. Exclude pattern (if provided)
    5. Max links limit
    """
    filtered = []
    root_domain = urlparse(root_url).netloc

    for link in links:
        # Normalize URL
        link = urljoin(root_url, link)  # Resolve relative links

        # Already visited
        if link in self.visited:
            continue

        # Same domain check
        if self.same_domain_only:
            if urlparse(link).netloc != root_domain:
                continue

        # Include pattern
        if self.include_pattern:
            if not re.search(self.include_pattern, link):
                continue

        # Exclude pattern
        if self.exclude_pattern:
            if re.search(self.exclude_pattern, link):
                continue

        # Max links limit
        if len(self.visited) + len(filtered) >= self.max_links:
            break

        filtered.append(link)

    return filtered

Filter order matters:

Early checks (visited, domain) are fast
Regex checks are slower, applied later
Max links check is last (already invested in filtering)

Example Execution¶

Let's trace a BFS crawl with max_depth=2 and max_links=10:

Initial State¶

Root URL: https://example.com/docs/intro
max_depth: 2
max_links: 10

Queue: [("https://example.com/docs/intro", 0)]
Visited: {}

Iteration 1 (Depth 0)¶

Dequeue: ("https://example.com/docs/intro", 0)
Fetch page, extract content
Links found: ["/docs/setup", "/docs/api", "/blog/news"]

Filter links (same domain only):
  ✓ https://example.com/docs/setup
  ✓ https://example.com/docs/api
  ✗ https://example.com/blog/news (excluded by pattern)

Queue: [
  ("https://example.com/docs/setup", 1),
  ("https://example.com/docs/api", 1)
]
Visited: {"https://example.com/docs/intro"}
Processed: 1/10

Iteration 2 (Depth 1)¶

Dequeue: ("https://example.com/docs/setup", 1)
Fetch page, extract content
Links found: ["/docs/config", "/docs/install", "/docs/intro"]

Filter links:
  ✓ https://example.com/docs/config
  ✓ https://example.com/docs/install
  ✗ https://example.com/docs/intro (already visited)

Queue: [
  ("https://example.com/docs/api", 1),
  ("https://example.com/docs/config", 2),
  ("https://example.com/docs/install", 2)
]
Visited: {"https://example.com/docs/intro", "https://example.com/docs/setup"}
Processed: 2/10

Iteration 3 (Depth 1)¶

Dequeue: ("https://example.com/docs/api", 1)
Fetch page, extract content
Links found: ["/docs/api/auth", "/docs/api/rest"]

Queue: [
  ("https://example.com/docs/config", 2),
  ("https://example.com/docs/install", 2),
  ("https://example.com/docs/api/auth", 2),
  ("https://example.com/docs/api/rest", 2)
]
Visited: {intro, setup, api}
Processed: 3/10

Iteration 4-7 (Depth 2)¶

Process all depth-2 links: - config (depth 2) - install (depth 2) - api/auth (depth 2) - api/rest (depth 2)

Each might discover more links, but depth 2 is max, so new links are not enqueued.

Final State¶

Processed: 7 pages (1 at depth 0, 2 at depth 1, 4 at depth 2)
Queue: [] (empty)
Visited: {intro, setup, api, config, install, api/auth, api/rest}

Depth vs Breadth Trade-offs¶

Deep Crawl (High max_depth, Low max_links)¶

wikigr create --source=web --url="..." --max-depth=5 --max-links=25

Characteristics: - Explores deep hierarchies - Finds highly specific content - Narrow focus (few branches followed)

Use cases: - Following a specific path deep into docs - Researching a specialized topic - Building focused knowledge graphs

Wide Crawl (Low max_depth, High max_links)¶

wikigr create --source=web --url="..." --max-depth=1 --max-links=100

Characteristics: - Stays close to root - Covers broad range of topics - Wide focus (many branches followed)

Use cases: - Surveying a documentation site - Building broad knowledge graphs - Discovering what's available

Balanced Crawl¶

wikigr create --source=web --url="..." --max-depth=2 --max-links=50

Characteristics: - Moderate depth and breadth - Good for most use cases

Performance Characteristics¶

Time Complexity¶

BFS traversal: O(V + E) where V = vertices (pages), E = edges (links)
With max_links limit: O(min(V, max_links))
Depth limit impact: O(b^d) where b = branching factor, d = max_depth

Space Complexity¶

Queue size: O(b × d) worst case (all links at deepest level)
Visited set: O(min(V, max_links))
Total: O(b × d + max_links)

Real-World Performance¶

Measured on Microsoft Learn documentation:

Configuration	Pages	Depth Levels	Time	Queue Peak
depth=0, links=1	1	1	1.2s	1
depth=1, links=10	10	2	12s	9
depth=2, links=50	50	3	1m 45s	28
depth=3, links=100	100	4	4m 12s	47

Observations:

Time grows linearly with pages (network-bound)
Queue size stays manageable (< 50 even for 100 pages)
Depth 3+ rarely needed for most documentation sites

Edge Cases and Error Handling¶

Infinite Loops¶

Problem: Site has circular links (A → B → A)

Solution: Visited set prevents reprocessing

if current_url in self.visited:
    continue

Redirects¶

Problem: URL redirects to different URL (301/302)

Solution: Use final URL after redirects

response = requests.get(url, allow_redirects=True)
final_url = response.url  # Use this for visited check

Unreachable Pages¶

Problem: HTTP 404, 500, or network timeout

Solution: Log error, continue with remaining queue

try:
    response = requests.get(url, timeout=10)
    response.raise_for_status()
except requests.RequestException as e:
    logging.warning(f"Failed to fetch {url}: {e}")
    continue  # Skip this page

Malformed URLs¶

Problem: Invalid URLs in <a href> tags

Solution: Normalize and validate before adding

from urllib.parse import urljoin, urlparse

def normalize_url(base_url: str, link: str) -> Optional[str]:
    try:
        # Resolve relative URLs
        absolute = urljoin(base_url, link)
        # Validate
        parsed = urlparse(absolute)
        if parsed.scheme in ("http", "https"):
            return absolute
    except ValueError:
        pass
    return None

Rate Limiting¶

Problem: Too many requests to same domain

Solution: Add delay between requests

import time

def get_articles(self):
    for url, depth in self.bfs_queue:
        # ... fetch page ...
        time.sleep(0.5)  # 500ms delay between requests

Comparison to Other Algorithms¶

vs Depth-First Search (DFS)¶

Aspect	BFS	DFS
Exploration order	Level by level	Branch by branch
Memory	O(b × d)	O(d)
Finds closest first	Yes	No
Balanced crawl	Yes	No (can go very deep)

BFS wins for web crawling: Proximity to root matters, balanced exploration.

vs Bidirectional BFS¶

Start from both root and target, meet in middle.

Not applicable here: No specific target URL, exploring unknown space.

vs Priority Queue (Dijkstra-style)¶

Use priority queue instead of FIFO queue, prioritize by relevance score.

Possible enhancement:

import heapq

queue = []
heapq.heappush(queue, (priority, url, depth))

while queue:
    priority, url, depth = heapq.heappop(queue)
    # ... process ...

Trade-off: More complex, requires relevance scoring function.

Future Enhancements¶

Parallel Crawling¶

Use thread pool for concurrent fetching:

from concurrent.futures import ThreadPoolExecutor

def fetch_batch(urls: List[str]) -> List[Article]:
    with ThreadPoolExecutor(max_workers=5) as executor:
        return list(executor.map(fetch_single, urls))

Benefit: 5-10x speedup for I/O-bound crawling

Adaptive Depth¶

Adjust max_depth based on content quality:

def should_expand(article: Article, depth: int) -> bool:
    if depth >= max_depth:
        return False
    # Expand if high-quality content
    return calculate_quality_score(article) > threshold

Smart Filtering¶

Use ML to predict link relevance:

def predict_relevance(link: str, context: str) -> float:
    # Use trained model to score link
    return model.predict(link, context)