How to Filter Links During Web Crawling¶

Control which links are followed during BFS crawling to focus on relevant content.

Problem¶

You want to crawl a website but only follow certain links (same domain, specific paths, exclude patterns).

Solution¶

Use link filtering options to control BFS crawling behavior.

Filter by Domain¶

Only follow links within the same domain:

wikigr create \
  --source=web \
  --url="https://learn.microsoft.com/en-us/azure/aks/what-is-aks" \
  --max-depth=2 \
  --same-domain-only \
  --db-path=azure_aks.db

What happens: - Starting URL: learn.microsoft.com - Follows: https://learn.microsoft.com/en-us/azure/aks/concepts-clusters-workloads - Skips: https://github.com/Azure/AKS (different domain)

Filter by URL Pattern¶

Include only URLs matching a pattern:

wikigr create \
  --source=web \
  --url="https://learn.microsoft.com/en-us/azure/aks/what-is-aks" \
  --max-depth=3 \
  --include-pattern="/azure/aks/" \
  --db-path=aks_only.db

What happens: - Follows: https://learn.microsoft.com/en-us/azure/aks/tutorial-kubernetes-deploy-cluster - Skips: https://learn.microsoft.com/en-us/azure/vm/... (doesn't match pattern)

Exclude URL Patterns¶

Skip URLs matching exclusion patterns:

wikigr create \
  --source=web \
  --url="https://docs.python.org/3/library/" \
  --max-depth=2 \
  --exclude-pattern="^.*/genindex\\.html$" \
  --exclude-pattern="^.*/py-modindex\\.html$" \
  --db-path=python_docs.db

What happens: - Follows: https://docs.python.org/3/library/os.html - Skips: https://docs.python.org/3/genindex.html (matches exclusion)

Combine Multiple Filters¶

Use multiple filters together:

wikigr create \
  --source=web \
  --url="https://kubernetes.io/docs/concepts/overview/" \
  --max-depth=2 \
  --max-links=20 \
  --same-domain-only \
  --include-pattern="/docs/concepts/" \
  --exclude-pattern="/docs/concepts/workloads/controllers/" \
  --db-path=k8s_concepts.db

Filter logic: 1. Must be same domain (kubernetes.io) 2. Must match include pattern (/docs/concepts/) 3. Must NOT match exclude pattern (/docs/concepts/workloads/controllers/) 4. Stop after 20 pages

Example: Crawl Azure Documentation for AKS¶

wikigr create \
  --source=web \
  --url="https://learn.microsoft.com/en-us/azure/aks/" \
  --max-depth=2 \
  --max-links=50 \
  --same-domain-only \
  --include-pattern="/azure/aks/" \
  --exclude-pattern="/azure/aks/api-reference/" \
  --db-path=azure_aks_focused.db

Output:

Processing 1 article from web...
Expanding links: depth 1, found 28 URLs (12 after filtering)
Expanding links: depth 2, found 64 URLs (37 after filtering, limiting to 50 total)
Filtered out: 15 different domain, 8 excluded pattern
Extracted 1,456 entities, 892 relationships across 50 pages
Knowledge graph created: azure_aks_focused.db

Example: Crawl GitHub Wiki with URL Filtering¶

wikigr create \
  --source=web \
  --url="https://github.com/microsoft/WSL/wiki" \
  --max-depth=1 \
  --same-domain-only \
  --include-pattern="/microsoft/WSL/wiki/" \
  --db-path=wsl_wiki.db

Output:

Processing 1 article from web...
Expanding links: depth 1, found 42 URLs (42 after filtering)
All links within github.com/microsoft/WSL/wiki/
Extracted 783 entities, 521 relationships across 43 pages
Knowledge graph created: wsl_wiki.db

Depth vs Breadth Control¶

Understand how max-depth and max-links interact:

# Deep but narrow: follow links far but limit total count
wikigr create \
  --source=web \
  --url="https://example.com/root" \
  --max-depth=5 \
  --max-links=25

# Shallow but wide: follow many links but stay close to root
wikigr create \
  --source=web \
  --url="https://example.com/root" \
  --max-depth=1 \
  --max-links=100

Depth 5, max 25 links: - Explores deep hierarchies - Finds highly specific content - Fewer pages overall

Depth 1, max 100 links: - Stays close to root - Covers broad topics - More pages at same level

Link Filtering Algorithm¶

The BFS crawler applies filters in this order:

Parse URL - Extract domain, path, query
Check visited - Skip if already processed
Same domain - Skip if --same-domain-only and domain differs
Include pattern - Skip if --include-pattern provided and doesn't match
Exclude pattern - Skip if --exclude-pattern provided and matches
Max links - Stop if total pages exceeds --max-links
Add to queue - Process at next depth level

Troubleshooting¶

Too Many Irrelevant Pages¶

Problem: Crawler follows links to off-topic pages.

Solution: Add more restrictive include patterns:

--include-pattern="/docs/guide/"

Missing Important Pages¶

Problem: Crawler skips pages you want to include.

Solution: Check if exclude patterns are too broad:

# Too broad - excludes everything under /api/
--exclude-pattern="/api/"

# More specific - excludes only /api/internal/
--exclude-pattern="/api/internal/"

Crawl Never Finishes¶

Problem: Too many links discovered.

Solution: Reduce depth or add stricter filters:

wikigr create \
  --source=web \
  --url="..." \
  --max-depth=1 \
  --max-links=20 \
  --same-domain-only