Skip to content

Getting Started with Web Content Sources

Learn how to build knowledge graphs from web content with full feature parity to Wikipedia sources.

What You'll Learn

  • Creating a knowledge graph from web URLs
  • Using LLM extraction for entities and relationships
  • Expanding your graph with link crawling
  • Keeping your graph up-to-date with incremental updates

Prerequisites

  • WikiGR installed (pip install -e .)
  • OpenAI API key configured (export OPENAI_API_KEY=your-key)
  • Basic familiarity with command-line tools

Tutorial: Building a Knowledge Graph from Microsoft Azure Docs

Step 1: Create Your First Web-Based Knowledge Graph

Start with a single URL to extract entities and relationships:

wikigr create \
  --source=web \
  --url="https://learn.microsoft.com/en-us/azure/aks/what-is-aks" \
  --db-path=azure_aks.db

What happens: - Downloads and parses the web page - Extracts entities (Azure Kubernetes Service, containers, orchestration) - Identifies relationships between entities - Creates nodes and edges in the knowledge graph

Expected output:

Processing 1 article from web...
Extracted 42 entities, 28 relationships
Knowledge graph created: azure_aks.db

Add related pages using breadth-first search (BFS):

wikigr create \
  --source=web \
  --url="https://learn.microsoft.com/en-us/azure/aks/what-is-aks" \
  --max-depth=2 \
  --max-links=10 \
  --db-path=azure_aks_expanded.db

What happens: - Starts from the root URL - Follows links to depth 2 (root → linked pages → their linked pages) - Processes up to 10 pages total - Each page extracts entities and relationships

Expected output:

Processing 1 article from web...
Expanding links: depth 1, found 8 new URLs
Expanding links: depth 2, found 15 new URLs (limiting to 10 total)
Extracted 312 entities, 187 relationships across 10 pages
Knowledge graph created: azure_aks_expanded.db

Step 3: Update Your Graph Incrementally

Add new content without rebuilding:

wikigr update \
  --source=web \
  --url="https://learn.microsoft.com/en-us/azure/aks/kubernetes-deployment" \
  --db-path=azure_aks_expanded.db

What happens: - Checks if URL already exists in the database - Skips if already processed, or updates if content changed - Adds new entities and relationships - Preserves existing graph structure

Expected output:

Checking existing content...
Processing 1 new article from web...
Extracted 18 new entities, 12 new relationships
Updated knowledge graph: azure_aks_expanded.db

Step 4: Query Your Graph

Now explore the knowledge you've extracted:

# Start the backend server
python -m backend.main

# In another terminal, query the graph
curl -X POST http://localhost:8000/api/graph/query \
  -H "Content-Type: application/json" \
  -d '{"query": "What is Azure Kubernetes Service?"}'

Expected response:

{
  "answer": "Azure Kubernetes Service (AKS) is a managed container orchestration service...",
  "entities": ["Azure Kubernetes Service", "containers", "Kubernetes"],
  "relationships": [
    {"source": "AKS", "relation": "IS_A", "target": "managed service"},
    {"source": "AKS", "relation": "ORCHESTRATES", "target": "containers"}
  ]
}

Key Concepts Learned

LLM Extraction

Web sources use the same LLM extraction pipeline as Wikipedia: - Identifies named entities (people, places, organizations, concepts) - Extracts semantic relationships between entities - Normalizes entity names for consistency

BFS crawling follows links intelligently: - Respects max-depth (how many hops from root) - Limits total pages with max-links - Filters same-domain links by default

Incremental Updates

The update command is efficient: - Skips already-processed URLs - Updates only changed content - Merges new entities into existing graph

Next Steps