Web Content Source API Reference¶
Complete reference for WebContentSource class and related interfaces.
WebContentSource¶
Implementation of ContentSource protocol for web-based knowledge graph creation.
Class Definition¶
from backend.sources.web_content_source import WebContentSource
class WebContentSource(ContentSource):
"""
Content source for web URLs with LLM extraction and BFS link expansion.
Provides feature parity with WikipediaContentSource:
- LLM entity and relationship extraction
- BFS link crawling with configurable depth
- Incremental updates with URL deduplication
- Shared ArticleProcessor for consistent extraction
"""
Constructor¶
def __init__(
self,
url: str,
max_depth: int = 0,
max_links: int = 1,
same_domain_only: bool = True,
include_pattern: Optional[str] = None,
exclude_pattern: Optional[str] = None
)
Parameters:
url(str, required): Starting URL for content extractionmax_depth(int, default: 0): Maximum BFS depth for link expansion (0 = no expansion)max_links(int, default: 1): Maximum total pages to processsame_domain_only(bool, default: True): Only follow links within same domaininclude_pattern(Optional[str], default: None): Regex pattern - only follow matching URLsexclude_pattern(Optional[str], default: None): Regex pattern - skip matching URLs
Example:
source = WebContentSource(
url="https://learn.microsoft.com/en-us/azure/aks/what-is-aks",
max_depth=2,
max_links=50,
same_domain_only=True,
include_pattern=r"/azure/aks/",
exclude_pattern=r"/api-reference/"
)
Methods¶
get_articles()¶
def get_articles(self) -> Generator[Article, None, None]:
"""
Yields Article objects from web content.
Yields:
Article: Each discovered web page as an Article with:
- title: Extracted from <title> or <h1> tag
- content: Main text content (HTML stripped)
- url: Full URL of the page
- links: Discovered links for BFS expansion
Raises:
ValueError: If URL is invalid or unreachable
requests.HTTPError: If HTTP request fails
Example:
>>> source = WebContentSource(url="https://example.com")
>>> for article in source.get_articles():
... print(f"{article.title}: {len(article.content)} chars")
Example Page: 1234 chars
"""
expand_links()¶
def expand_links(self, root_url: str, links: List[str], depth: int) -> List[str]:
"""
Expand links using BFS with filtering.
Args:
root_url: Starting URL (for domain checking)
links: List of discovered links from current page
depth: Current depth in BFS traversal
Returns:
List of filtered URLs to process at next depth level
Filtering applied:
1. Same domain check (if same_domain_only=True)
2. Include pattern match (if include_pattern set)
3. Exclude pattern rejection (if exclude_pattern set)
4. Duplicate removal (already visited)
5. Max links enforcement
Example:
>>> source = WebContentSource(
... url="https://example.com",
... same_domain_only=True,
... exclude_pattern=r"/admin/"
... )
>>> links = [
... "https://example.com/about",
... "https://example.com/admin/settings",
... "https://other.com/page"
... ]
>>> filtered = source.expand_links("https://example.com", links, 1)
>>> print(filtered)
['https://example.com/about']
"""
Command-Line Interface¶
wikigr create --source=web¶
Create a knowledge graph from web content.
wikigr create --source=web [OPTIONS]
Required Options:
--url URL: Starting URL for content extraction--db-path PATH: Output database file path
Optional Options:
--max-depth INT: BFS depth (default: 0, no expansion)--max-links INT: Maximum pages to process (default: 1)--same-domain-only: Only follow same-domain links (default: enabled)--include-pattern REGEX: Only follow URLs matching pattern--exclude-pattern REGEX: Skip URLs matching pattern--max-entities INT: Maximum entities per page (default: 50)--no-relationships: Skip relationship extraction (faster)
Examples:
# Single page
wikigr create \
--source=web \
--url="https://example.com/article" \
--db-path=output.db
# Crawl with depth 2, max 25 pages
wikigr create \
--source=web \
--url="https://learn.microsoft.com/en-us/azure/aks/intro" \
--max-depth=2 \
--max-links=25 \
--db-path=azure_aks.db
# Filter by URL pattern
wikigr create \
--source=web \
--url="https://docs.python.org/3/library/" \
--max-depth=1 \
--include-pattern="/library/" \
--exclude-pattern="/genindex" \
--db-path=python_stdlib.db
Exit Codes:
0: Success1: Invalid URL or unreachable2: Database error3: LLM extraction failure
wikigr update --source=web¶
Update existing knowledge graph with new web content.
wikigr update --source=web [OPTIONS]
Required Options:
--url URL: URL to add/update--db-path PATH: Existing database file path
Optional Options:
Same as wikigr create --source=web (excluding --db-path which must exist)
Behavior:
- Checks if URL already exists in database
- Skips if content hash matches (no changes)
- Updates if content changed
- Adds new entities and relationships
- Preserves existing graph structure
Example:
# Add single page to existing graph
wikigr update \
--source=web \
--url="https://learn.microsoft.com/en-us/azure/aks/best-practices" \
--db-path=azure_aks.db
# Add multiple pages with crawling
wikigr update \
--source=web \
--url="https://learn.microsoft.com/en-us/azure/aks/security-overview" \
--max-depth=1 \
--max-links=10 \
--db-path=azure_aks.db
Exit Codes:
0: Success (updated or skipped if unchanged)1: Database not found2: URL unreachable3: Update failed (database corruption)
Article Data Model¶
Web content is represented using the shared Article model.
from dataclasses import dataclass
from typing import List
@dataclass
class Article:
"""
Represents a single web page for processing.
Attributes:
title: Page title (from <title> or <h1>)
content: Main text content (HTML stripped)
url: Full URL of the page
links: Discovered hyperlinks for BFS expansion
"""
title: str
content: str
url: str
links: List[str]
Example:
article = Article(
title="What is Azure Kubernetes Service?",
content="Azure Kubernetes Service (AKS) is a managed container orchestration...",
url="https://learn.microsoft.com/en-us/azure/aks/what-is-aks",
links=[
"https://learn.microsoft.com/en-us/azure/aks/concepts-clusters-workloads",
"https://learn.microsoft.com/en-us/azure/aks/tutorial-kubernetes-prepare-app"
]
)
ContentSource Protocol¶
WebContentSource implements the ContentSource protocol for consistent interface.
from typing import Protocol, Generator
class ContentSource(Protocol):
"""
Protocol for content sources (Wikipedia, web, local files, etc.).
All sources must implement get_articles() to yield Article objects.
"""
def get_articles(self) -> Generator[Article, None, None]:
"""
Yields Article objects from the content source.
Yields:
Article: Content with title, body, URL, and links
"""
...
Implementations:
WikipediaContentSource- Wikipedia articles via APIWebContentSource- Web pages via HTTP- (Future)
LocalFileSource- Local markdown/HTML files - (Future)
GitHubWikiSource- GitHub wiki pages
Integration with ArticleProcessor¶
WebContentSource uses the shared ArticleProcessor for entity extraction.
from backend.kg_construction.article_processor import ArticleProcessor
processor = ArticleProcessor(conn, use_llm=True)
for article in web_source.get_articles():
processor.process_article(
title=article.title,
content=article.content,
url=article.url
)
Shared behavior across all sources:
- Same LLM extraction pipeline
- Same entity normalization
- Same relationship identification
- Same vector embedding generation
See ArticleProcessor API Reference for details.
Environment Variables¶
Configure LLM extraction behavior:
| Variable | Description | Default |
|---|---|---|
OPENAI_API_KEY |
OpenAI API key (required) | None |
OPENAI_MODEL |
Model for extraction | gpt-4-turbo-preview |
LLM_TEMPERATURE |
Sampling temperature | 0.0 |
LLM_MAX_RETRIES |
Retry attempts on failure | 3 |
LLM_RETRY_DELAY |
Seconds between retries | 1.0 |
Example:
export OPENAI_API_KEY=sk-...
export OPENAI_MODEL=gpt-3.5-turbo
export LLM_TEMPERATURE=0.0
wikigr create --source=web --url="..." --db-path=output.db
Error Handling¶
Common Exceptions¶
# Invalid URL
try:
source = WebContentSource(url="not-a-valid-url")
except ValueError as e:
print(f"Invalid URL: {e}")
# Unreachable URL
try:
for article in source.get_articles():
pass
except requests.HTTPError as e:
print(f"HTTP error: {e}")
# LLM extraction failure
try:
processor.process_article(title, content, url)
except openai.error.RateLimitError as e:
print(f"Rate limit: {e}")
Graceful Degradation¶
If LLM extraction fails:
1. Retries up to LLM_MAX_RETRIES times
2. Logs error and continues to next article
3. Partial graph created from successful extractions
Performance Characteristics¶
Time Complexity¶
- Single page: O(1) HTTP request + O(n) LLM extraction (n = content length)
- BFS crawling: O(d × l) where d = depth, l = links per page
- Total: O(d × l × n) for full crawl with extraction
Space Complexity¶
- Memory: O(l) for link queue + O(n) for article content
- Database: O(e + r) where e = entities, r = relationships
Benchmarks¶
Measured on Azure AKS documentation:
| Operation | Pages | Entities | Relationships | Time | Cost |
|---|---|---|---|---|---|
| Single page | 1 | 42 | 28 | 3.2s | $0.02 |
| Depth 1 (10 pages) | 10 | 312 | 187 | 28s | $0.15 |
| Depth 2 (50 pages) | 50 | 1,456 | 892 | 2m 14s | $0.68 |
Using GPT-4-turbo-preview, temperatures 0.0