Pack Utilities API Reference¶
Reference documentation for wikigr.packs.utils — shared helper functions used by all knowledge pack build scripts.
Overview¶
wikigr/packs/utils.py is the single source of truth for common operations shared across the 49+ build scripts in scripts/. It eliminates copy-paste drift between scripts and ensures consistent behaviour for URL loading, filtering, and logging.
All build scripts import load_urls from this module. Local definitions of load_urls inside individual scripts are not permitted — use the shared import.
load_urls¶
from wikigr.packs.utils import load_urls
def load_urls(urls_file: Path, limit: int | None = None) -> list[str]:
...
Reads a urls.txt file and returns a filtered list of URLs, skipping blank lines and # comments.
Parameters¶
| Parameter | Type | Default | Description |
|---|---|---|---|
urls_file |
pathlib.Path |
— | Path to the urls.txt file to read. Must exist; raises FileNotFoundError otherwise. |
limit |
int \| None |
None |
If set and truthy, truncates the result to the first limit URLs. Intended for test-mode builds. A value of 0 is falsy and treated as no limit. |
Returns¶
list[str] — Filtered list of URL strings. Each entry:
- Has leading and trailing whitespace stripped.
- Starts with
"https://"— plainhttp://URLs are silently dropped (SEC-01). - Is not a
#comment line. - Is not a blank line.
Order is preserved from the file.
Side Effects¶
Up to two logging.INFO messages are emitted via the module-level logger (wikigr.packs.utils):
"Limited to N URLs for testing"— emitted first, only whenlimitis truthy."Loaded N URLs from <path>"— always emitted last. Whenlimitis truthy,Nreflects the truncated count, not the total lines in the file.
Raises¶
| Exception | When |
|---|---|
FileNotFoundError |
urls_file does not exist. |
PermissionError |
urls_file cannot be read by the current process. |
OSError |
Any other OS-level I/O failure. |
load_urls does not validate that URLs are reachable or well-formed. HTTPS is enforced at load time via the startswith("https://") filter. Use validate_download_url or scripts/validate_pack_urls.py to check reachability and additional safety constraints (private IPs, cloud metadata endpoints, etc.).
Filter Details¶
The function uses a generator expression with itertools.islice to strip, filter, and limit in a single pass:
candidates = (
stripped
for line in f
if (stripped := line.strip()) # skip blank lines
and not stripped.startswith("#") # skip comments
and stripped.startswith("https://") # HTTPS-only (SEC-01)
)
urls = list(islice(candidates, limit or None))
The startswith("https://") filter enforces HTTPS at parse time (SEC-01). Plain http:// lines in urls.txt are silently dropped before they reach any network layer. This provides defence-in-depth alongside the SSRF guard in WebContentSource and validate_download_url, which re-validate each URL at fetch time.
Usage¶
Basic Usage¶
from pathlib import Path
from wikigr.packs.utils import load_urls
urls_file = Path("data/packs/my-pack/urls.txt")
urls = load_urls(urls_file)
for url in urls:
process(url)
Test Mode (Limit URLs)¶
Build scripts accept --test-mode to process only the first few URLs:
import argparse
from pathlib import Path
from wikigr.packs.utils import load_urls
parser = argparse.ArgumentParser()
parser.add_argument("--test-mode", action="store_true")
args = parser.parse_args()
urls_file = Path("data/packs/my-pack/urls.txt")
limit = 5 if args.test_mode else None
urls = load_urls(urls_file, limit=limit)
Guard for Optional urls.txt¶
When a urls.txt may or may not exist (e.g. freshness checking over arbitrary pack directories), guard the call at the call site rather than passing a non-existent path:
urls = load_urls(urls_file) if urls_file.exists() else []
This pattern is used in scripts/check_pack_freshness.py.
Import Path¶
All build scripts resolve the project root and insert it into sys.path before importing:
import sys
from pathlib import Path
sys.path.insert(0, str(Path(__file__).parent.parent))
from wikigr.packs.utils import load_urls # noqa: E402
The # noqa: E402 suppresses the E402 module level import not at top of file linter warning that results from the sys.path manipulation above.
Logging Configuration¶
load_urls uses the standard library logging module. The logger name is wikigr.packs.utils.
To see INFO-level messages from load_urls, configure the root logger or the wikigr hierarchy:
import logging
logging.basicConfig(level=logging.INFO)
Or more targeted:
import logging
logging.getLogger("wikigr.packs.utils").setLevel(logging.INFO)
Build scripts that use logging.basicConfig(level=logging.INFO, ...) in their main() function will automatically surface these messages.
Writing a New Build Script¶
When creating a new scripts/build_<name>_pack.py, follow this import pattern:
#!/usr/bin/env python3
"""
Build <Name> Knowledge Pack from URLs.
...
"""
import argparse
import logging
import os
import sys
from pathlib import Path
# ... other stdlib and third-party imports ...
os.environ["TOKENIZERS_PARALLELISM"] = "false"
sys.path.insert(0, str(Path(__file__).parent.parent))
import real_ladybug as kuzu # noqa: E402
import wikigr.bootstrap # noqa: E402
from wikigr.packs.utils import load_urls # noqa: E402
PACK_NAME = "my-domain-expert"
URLS_FILE = Path(__file__).parent.parent / "data" / "packs" / PACK_NAME / "urls.txt"
def main() -> None:
parser = argparse.ArgumentParser()
parser.add_argument("--test-mode", action="store_true")
args = parser.parse_args()
limit = 5 if args.test_mode else None
urls = load_urls(URLS_FILE, limit=limit)
for url in urls:
build_from_url(url)
Do not define a local def load_urls(...) in new scripts. Use the shared import.
Related¶
- urls.txt Format and Conventions — Format rules for the
urls.txtinput file. - Pack Installer Security —
validate_download_urlfor SSRF prevention before HTTP requests. - How to Build a Pack — End-to-end guide for creating and verifying a new knowledge pack.
- How to Validate Pack URLs — Using
scripts/validate_pack_urls.pyto check URL reachability.