Web Content Source API Reference¶

Complete reference for WebContentSource class and related interfaces.

WebContentSource¶

Implementation of ContentSource protocol for web-based knowledge graph creation.

Class Definition¶

from backend.sources.web_content_source import WebContentSource

class WebContentSource(ContentSource):
    """
    Content source for web URLs with LLM extraction and BFS link expansion.

    Provides feature parity with WikipediaContentSource:
    - LLM entity and relationship extraction
    - BFS link crawling with configurable depth
    - Incremental updates with URL deduplication
    - Shared ArticleProcessor for consistent extraction
    """

Constructor¶

def __init__(
    self,
    url: str,
    max_depth: int = 0,
    max_links: int = 1,
    same_domain_only: bool = True,
    include_pattern: Optional[str] = None,
    exclude_pattern: Optional[str] = None
)

Parameters:

url (str, required): Starting URL for content extraction
max_depth (int, default: 0): Maximum BFS depth for link expansion (0 = no expansion)
max_links (int, default: 1): Maximum total pages to process
same_domain_only (bool, default: True): Only follow links within same domain
include_pattern (Optional[str], default: None): Regex pattern - only follow matching URLs
exclude_pattern (Optional[str], default: None): Regex pattern - skip matching URLs

Example:

source = WebContentSource(
    url="https://learn.microsoft.com/en-us/azure/aks/what-is-aks",
    max_depth=2,
    max_links=50,
    same_domain_only=True,
    include_pattern=r"/azure/aks/",
    exclude_pattern=r"/api-reference/"
)

Methods¶

get_articles()¶

def get_articles(self) -> Generator[Article, None, None]:
    """
    Yields Article objects from web content.

    Yields:
        Article: Each discovered web page as an Article with:
            - title: Extracted from <title> or <h1> tag
            - content: Main text content (HTML stripped)
            - url: Full URL of the page
            - links: Discovered links for BFS expansion

    Raises:
        ValueError: If URL is invalid or unreachable
        requests.HTTPError: If HTTP request fails

    Example:
        >>> source = WebContentSource(url="https://example.com")
        >>> for article in source.get_articles():
        ...     print(f"{article.title}: {len(article.content)} chars")
        Example Page: 1234 chars
    """

expand_links()¶

def expand_links(self, root_url: str, links: List[str], depth: int) -> List[str]:
    """
    Expand links using BFS with filtering.

    Args:
        root_url: Starting URL (for domain checking)
        links: List of discovered links from current page
        depth: Current depth in BFS traversal

    Returns:
        List of filtered URLs to process at next depth level

    Filtering applied:
        1. Same domain check (if same_domain_only=True)
        2. Include pattern match (if include_pattern set)
        3. Exclude pattern rejection (if exclude_pattern set)
        4. Duplicate removal (already visited)
        5. Max links enforcement

    Example:
        >>> source = WebContentSource(
        ...     url="https://example.com",
        ...     same_domain_only=True,
        ...     exclude_pattern=r"/admin/"
        ... )
        >>> links = [
        ...     "https://example.com/about",
        ...     "https://example.com/admin/settings",
        ...     "https://other.com/page"
        ... ]
        >>> filtered = source.expand_links("https://example.com", links, 1)
        >>> print(filtered)
        ['https://example.com/about']
    """

Command-Line Interface¶

wikigr create --source=web¶

Create a knowledge graph from web content.

wikigr create --source=web [OPTIONS]

Required Options:

--url URL: Starting URL for content extraction
--db-path PATH: Output database file path

Optional Options:

--max-depth INT: BFS depth (default: 0, no expansion)
--max-links INT: Maximum pages to process (default: 1)
--same-domain-only: Only follow same-domain links (default: enabled)
--include-pattern REGEX: Only follow URLs matching pattern
--exclude-pattern REGEX: Skip URLs matching pattern
--max-entities INT: Maximum entities per page (default: 50)
--no-relationships: Skip relationship extraction (faster)

Examples:

# Single page
wikigr create \
  --source=web \
  --url="https://example.com/article" \
  --db-path=output.db

# Crawl with depth 2, max 25 pages
wikigr create \
  --source=web \
  --url="https://learn.microsoft.com/en-us/azure/aks/intro" \
  --max-depth=2 \
  --max-links=25 \
  --db-path=azure_aks.db

# Filter by URL pattern
wikigr create \
  --source=web \
  --url="https://docs.python.org/3/library/" \
  --max-depth=1 \
  --include-pattern="/library/" \
  --exclude-pattern="/genindex" \
  --db-path=python_stdlib.db

Exit Codes:

0: Success
1: Invalid URL or unreachable
2: Database error
3: LLM extraction failure

wikigr update --source=web¶

Update existing knowledge graph with new web content.

wikigr update --source=web [OPTIONS]

Required Options:

--url URL: URL to add/update
--db-path PATH: Existing database file path

Optional Options:

Same as wikigr create --source=web (excluding --db-path which must exist)

Behavior:

Checks if URL already exists in database
Skips if content hash matches (no changes)
Updates if content changed
Adds new entities and relationships
Preserves existing graph structure

Example:

# Add single page to existing graph
wikigr update \
  --source=web \
  --url="https://learn.microsoft.com/en-us/azure/aks/best-practices" \
  --db-path=azure_aks.db

# Add multiple pages with crawling
wikigr update \
  --source=web \
  --url="https://learn.microsoft.com/en-us/azure/aks/security-overview" \
  --max-depth=1 \
  --max-links=10 \
  --db-path=azure_aks.db

Exit Codes:

0: Success (updated or skipped if unchanged)
1: Database not found
2: URL unreachable
3: Update failed (database corruption)

Article Data Model¶

Web content is represented using the shared Article model.

from dataclasses import dataclass
from typing import List

@dataclass
class Article:
    """
    Represents a single web page for processing.

    Attributes:
        title: Page title (from <title> or <h1>)
        content: Main text content (HTML stripped)
        url: Full URL of the page
        links: Discovered hyperlinks for BFS expansion
    """
    title: str
    content: str
    url: str
    links: List[str]

Example:

article = Article(
    title="What is Azure Kubernetes Service?",
    content="Azure Kubernetes Service (AKS) is a managed container orchestration...",
    url="https://learn.microsoft.com/en-us/azure/aks/what-is-aks",
    links=[
        "https://learn.microsoft.com/en-us/azure/aks/concepts-clusters-workloads",
        "https://learn.microsoft.com/en-us/azure/aks/tutorial-kubernetes-prepare-app"
    ]
)

ContentSource Protocol¶

WebContentSource implements the ContentSource protocol for consistent interface.

from typing import Protocol, Generator

class ContentSource(Protocol):
    """
    Protocol for content sources (Wikipedia, web, local files, etc.).

    All sources must implement get_articles() to yield Article objects.
    """

    def get_articles(self) -> Generator[Article, None, None]:
        """
        Yields Article objects from the content source.

        Yields:
            Article: Content with title, body, URL, and links
        """
        ...

Implementations:

WikipediaContentSource - Wikipedia articles via API
WebContentSource - Web pages via HTTP
(Future) LocalFileSource - Local markdown/HTML files
(Future) GitHubWikiSource - GitHub wiki pages

Integration with ArticleProcessor¶

WebContentSource uses the shared ArticleProcessor for entity extraction.

from backend.kg_construction.article_processor import ArticleProcessor

processor = ArticleProcessor(conn, use_llm=True)

for article in web_source.get_articles():
    processor.process_article(
        title=article.title,
        content=article.content,
        url=article.url
    )

Shared behavior across all sources:

Same LLM extraction pipeline
Same entity normalization
Same relationship identification
Same vector embedding generation

See ArticleProcessor API Reference for details.

Environment Variables¶

Configure LLM extraction behavior:

Variable	Description	Default
`OPENAI_API_KEY`	OpenAI API key (required)	None
`OPENAI_MODEL`	Model for extraction	`gpt-4-turbo-preview`
`LLM_TEMPERATURE`	Sampling temperature	`0.0`
`LLM_MAX_RETRIES`	Retry attempts on failure	`3`
`LLM_RETRY_DELAY`	Seconds between retries	`1.0`

Example:

export OPENAI_API_KEY=sk-...
export OPENAI_MODEL=gpt-3.5-turbo
export LLM_TEMPERATURE=0.0
wikigr create --source=web --url="..." --db-path=output.db

Error Handling¶

Common Exceptions¶

# Invalid URL
try:
    source = WebContentSource(url="not-a-valid-url")
except ValueError as e:
    print(f"Invalid URL: {e}")

# Unreachable URL
try:
    for article in source.get_articles():
        pass
except requests.HTTPError as e:
    print(f"HTTP error: {e}")

# LLM extraction failure
try:
    processor.process_article(title, content, url)
except openai.error.RateLimitError as e:
    print(f"Rate limit: {e}")

Graceful Degradation¶

If LLM extraction fails: 1. Retries up to LLM_MAX_RETRIES times 2. Logs error and continues to next article 3. Partial graph created from successful extractions

Performance Characteristics¶

Time Complexity¶

Single page: O(1) HTTP request + O(n) LLM extraction (n = content length)
BFS crawling: O(d × l) where d = depth, l = links per page
Total: O(d × l × n) for full crawl with extraction

Space Complexity¶

Memory: O(l) for link queue + O(n) for article content
Database: O(e + r) where e = entities, r = relationships

Benchmarks¶

Measured on Azure AKS documentation:

Operation	Pages	Entities	Relationships	Time	Cost
Single page	1	42	28	3.2s	$0.02
Depth 1 (10 pages)	10	312	187	28s	$0.15
Depth 2 (50 pages)	50	1,456	892	2m 14s	$0.68

Using GPT-4-turbo-preview, temperatures 0.0

Web Content Source API Reference¶