urls.txt Format and Conventions¶
Complete reference for the urls.txt file used to specify source URLs for knowledge packs.
File Location¶
data/packs/<pack-name>/urls.txt
Example:
data/packs/langchain-expert/urls.txt
data/packs/openai-api-expert/urls.txt
data/packs/vercel-ai-sdk/urls.txt
data/packs/llamaindex-expert/urls.txt
data/packs/zig-expert/urls.txt
File Format¶
Plain text. One URL or comment per line. No trailing whitespace requirements.
# This is a comment (full-line comment only)
https://example.com/docs/overview
https://example.com/docs/concepts/
# Another section comment
https://example.com/docs/api/
Lines¶
| Type | Syntax | Description |
|---|---|---|
| URL | https://... |
A fully-qualified HTTPS URL to ingest |
| Comment | # text |
Section header or explanatory note; ignored by build |
| Blank | (empty) | Visual separator; ignored by build |
URL Rules¶
Required:
- Must begin with
https://— plainhttp://lines are silently dropped byload_urls()at parse time (SEC-01) and rejected byvalidate_download_url()at fetch time - Must be a public URL reachable without authentication
- Must not contain private IP ranges or cloud metadata endpoints
Prohibited:
| Pattern | Example | Reason |
|---|---|---|
| Plain HTTP | http://example.com |
No cleartext transport |
| Localhost | http://localhost:8080 |
Not a public source |
| Private IP | http://10.0.0.1/ |
Not a public source |
| Cloud metadata | http://169.254.169.254/ |
SSRF risk |
| Credentials in URL | https://user:pass@example.com/ |
Secret exposure |
| API keys in query | https://example.com?api_key=sk-… |
Secret exposure |
Optional but recommended:
- Trailing slash on directory-style URLs (
/docs/concepts/not/docs/concepts) - Ordering within a section from most general to most specific
Comment Conventions¶
Comments serve as section headers. They appear in build logs when verbose logging is enabled.
Preferred style:
# Section Name - Optional Subtitle
Examples from production packs:
# Core Documentation
# How-To Guides
# How-To Guides - Additional Sub-Pages
# Tutorials - Additional Sub-Pages
# API Reference Sub-Categories
# GitHub - Raw Content & Source Files (reliable text extraction)
# Community Resources
Comments should describe the category of URLs that follow, not individual URLs. Per-URL comments are not supported.
Canonical Hosts¶
Use the current canonical hostname. Do not use deprecated or unofficial mirrors.
| Pack | Canonical Host(s) | Notes |
|---|---|---|
| langchain-expert | python.langchain.com |
Use this, NOT docs.langchain.com (dead) |
| langchain-expert | docs.smith.langchain.com |
LangSmith observability — separate service, valid |
| openai-api-expert | platform.openai.com |
Official API docs |
| openai-api-expert | cookbook.openai.com, github.com/openai/openai-cookbook |
Cookbook examples |
| openai-api-expert | openai.github.io/openai-agents-python |
Agents SDK docs |
| vercel-ai-sdk | sdk.vercel.ai |
Official SDK docs |
| vercel-ai-sdk | raw.githubusercontent.com/vercel/ai, github.com/vercel/ai |
Rate-limit fallback |
| llamaindex-expert | docs.llamaindex.ai/en/stable |
Stable release docs |
| zig-expert | ziglang.org |
Official language reference |
| zig-expert | zig.guide |
Community tutorial, structured learning path |
| zig-expert | ziglearn.org |
Community resource |
GitHub URL Formats¶
GitHub URLs may appear in two forms, each serving a different extraction strategy:
Blob URLs¶
https://github.com/owner/repo/blob/main/path/to/file.md
- Returns rendered HTML with file content
- Use for Markdown, Jupyter notebooks (
.ipynb), and documentation files - Jupyter notebooks: code cells and markdown cells are both extracted as text
Raw URLs¶
https://raw.githubusercontent.com/owner/repo/main/path/to/file.md
- Returns raw file content (plain text, Markdown, etc.)
- Faster and more reliable for text extraction
- Preferred for Markdown and MDX files when both options are available
Best practice: include both blob/ and raw.githubusercontent.com/ variants for important files:
# GitHub - Raw Content & Source Files (reliable text extraction)
https://raw.githubusercontent.com/vercel/ai/main/README.md
https://github.com/vercel/ai/blob/main/README.md
URL Count Guidelines¶
Minimum and recommended URL counts by pack complexity:
| Pack Complexity | Minimum URLs | Recommended URLs | Example |
|---|---|---|---|
| Focused library | 30 | 45–60 | openai-api-expert |
| Framework with integrations | 50 | 65–80 | langchain-expert |
| Full platform (RAG + agents) | 50 | 70–90 | llamaindex-expert |
| Language reference | 30 | 45–60 | zig-expert |
| TypeScript SDK | 35 | 45–55 | vercel-ai-sdk |
URL count does not equal article count. Build scripts may crawl additional linked pages; the urls.txt provides seed URLs only.
Section Ordering¶
Recommended section order for a typical technology pack:
- File header comment (technology name, topics covered, any important notes)
- Core Documentation / Overview
- Getting Started / Quickstart
- Concepts / Architecture
- How-To Guides (index + sub-pages)
- Tutorials (index + sub-pages)
- API Reference (index + sub-categories)
- Integrations / Providers
- GitHub (repository, README)
- SDK / CLI Tools
- Observability / Tooling
- Community Resources
Validation¶
The scripts/validate_pack_urls.py script checks all URLs in a pack file:
python scripts/validate_pack_urls.py --pack langchain-expert
Checks performed:
- URL passes SSRF safety validation (rejects private IPs, localhost, cloud metadata endpoints)
- HTTP HEAD returns a successful status (non-4xx / non-5xx). 429 (rate-limited) is treated as valid.
- Retry logic: up to 2 retries on timeout or 429
Output:
Validating 71 URLs from data/packs/langchain-expert/urls.txt...
❌ [404] https://python.langchain.com/docs/how_to/deprecated_example/ (HTTP 404)
❌ [0] https://internal.example.com/ (blocked: private IP)
Results: 69 valid, 2 invalid
Use --fix to automatically comment out invalid URLs in place.
Additive Changes Policy¶
When expanding an existing urls.txt:
- Never remove a URL that is currently reachable
- Replace only URLs that are confirmed dead (404, domain gone)
- Add new URLs by appending new section blocks at the end or inserting within the relevant existing section
This policy ensures that pack rebuilds always have at least as much source coverage as the previous build.
Example: Complete urls.txt Structure¶
# LangChain Framework - Official Documentation
# Agents, chains, retrievers, prompts, embeddings, vector stores, LCEL
# NOTE: Use python.langchain.com (NOT docs.langchain.com which is dead)
# Core Documentation
https://python.langchain.com/docs/concepts/
https://python.langchain.com/docs/concepts/architecture/
https://python.langchain.com/docs/concepts/agents/
# How-To Guides
https://python.langchain.com/docs/how_to/
https://python.langchain.com/docs/how_to/tool_calling/
# How-To Guides - Additional Sub-Pages
https://python.langchain.com/docs/how_to/custom_tools/
https://python.langchain.com/docs/how_to/agent_executor/
# Tutorials
https://python.langchain.com/docs/tutorials/
https://python.langchain.com/docs/tutorials/rag/
# Tutorials - Additional Sub-Pages
https://python.langchain.com/docs/tutorials/extraction/
https://python.langchain.com/docs/tutorials/sql_qa/
# API Reference
https://python.langchain.com/api_reference/core/index.html
# Integrations - Additional Categories
https://python.langchain.com/docs/integrations/llms/
https://python.langchain.com/docs/integrations/vectorstores/chroma/
# GitHub
https://github.com/langchain-ai/langchain
# LangSmith (Observability)
https://docs.smith.langchain.com/
Related Documentation¶
- How to Curate and Expand Pack URL Lists
- Pack URL Coverage: Expanded Packs
- Web Content Source API Reference
- Pack Utilities API Reference —
load_urlsfunction that reads this format