A self-improving, multi-agent vulnerability analyzer that bootstrapped from a short spec into a competitive detection system — using AI agents to analyze, benchmark, and improve their own investigation methodology.
The name comes from the Lushootseed word for Raven — the trickster who reveals hidden truths.
Skwaq is an agentic vulnerability analyzer that uses a team of AI agents to investigate source code and binaries for security vulnerabilities. Unlike traditional static analysis tools that rely on fixed rules, skwaq's agents reason about code like experienced security researchers — tracing data flows, mapping attack surfaces, and debating exploitability.
What makes it unique: skwaq improves itself. A built-in benchmark harness (Skwaq Gym) measures detection accuracy against industry benchmarks, and a self-improvement loop uses AI agents to analyze their own failures and propose better investigation strategies.
Skwaq started with a short specification and a question: can AI agents bootstrap themselves into a competitive vulnerability detection system?
gym improve command: failure-analyst agent investigates each false negative, diagnoses why detection failed, and proposes improvements to agent prompts, taint rules, CWE mappings, or patterns. An overfitting-reviewer agent validates every proposal. Agents started teaching themselves.
overfitting-reviewer agent and added workflow steps that reject proposals which only help on benchmark cases versus the general case. This agent has rejected ~66% of proposals, catching narrow heuristics, benchmark-specific naming tricks, and detection logic too tightly coupled to test structure.
Each vulnerability investigation passes through five layers, from fast pattern matching to deep semantic reasoning:
flowchart TB
SRC["Source Code / Binary"] --> L1
L1["Layer 1: Pattern Detection\n~260 patterns across 6 languages"] --> L2
L2["Layer 2: Dataflow Analysis\nTaint source-to-sink tracing"] --> L3
L3["Layer 3: Context Validation\nMulti-cycle false positive reduction"] --> L4
L4["Layer 4: LLM Agent Pipeline\nattack-surface, vuln-hunter, critic"] --> L5
L5["Layer 5: Synthesis\nDomain-expert weighted evidence"] --> OUT["Findings"]
All analysis operates on a code property graph stored in LadybugDB — functions, call edges, taint flows, data sources, data sinks, and symbols. Agents query this graph with specialized tools.
erDiagram
FUNCTIONS ||--o{ CALLS : "caller-callee"
FUNCTIONS ||--o{ DATA_SOURCES : "reads from"
FUNCTIONS ||--o{ DATA_SINKS : "writes to"
DATA_SOURCES ||--o{ TAINT_FLOWS : "source"
DATA_SINKS ||--o{ TAINT_FLOWS : "sink"
FUNCTIONS ||--o{ SYMBOLS : "defines-imports"
FUNCTIONS ||--o{ FINDINGS : "contains"
All 18 agents and the pipelines they participate in:
flowchart TB
subgraph INPUT["Input"]
SRC["Source Code"]
BIN["Binary"]
end
subgraph PREPROCESS["Pre-processing (binary only)"]
DR["decompile-renamer\nImproves decompiled output"]
DA["decompile-analyst\nReviews decompilation quality"]
end
subgraph DISCOVERY["Discovery"]
AS["attack-surface\nMaps entry points,\ndata sources, imports"]
VH["vuln-hunter\nGraph-first vuln discovery"]
VHJ["vuln-hunter-java\nJava/servlet specialist"]
VHP["vuln-hunter-python\nPython specialist"]
TT["taint-tracer\nSource-to-sink flows"]
end
subgraph VALIDATION["Validation (standard pipeline)"]
CR["critic\nAccept / Reject / Adjust severity"]
RS["results-skeptic\nDouble-checks suspicious results"]
CC["cwe-classifier\nMaps findings to CWE families"]
end
subgraph DEBATE["Debate (deep pipeline)"]
EA["exploit-analyst\nCan attacker trigger this?"]
DEF["defense-analyst\nAre mitigations effective?"]
end
subgraph SYNTHESIS["Synthesis"]
VS["verdict-synthesizer\nWeighs all perspectives,\nbreaks ties"]
end
subgraph ANALYSIS["Specialized Analysis"]
CA["crash-analyst\nAnalyzes crash/fuzz output"]
PDA["patch-diff-analyst\nAnalyzes security patches"]
end
subgraph IMPROVE["Self-Improvement Loop"]
FA["failure-analyst\nDiagnoses missed detections"]
OR["overfitting-reviewer\nValidates proposals"]
end
SRC --> AS
BIN --> DR --> DA --> AS
AS --> VH & VHJ & VHP
VH & VHJ & VHP --> TT
TT --> CR
CR --> RS --> CC --> VS
CR --> EA & DEF
EA & DEF --> VS
VS --> OUT["Confirmed Findings"]
CA -.-> CR
PDA -.-> CR
OUT --> FA
FA --> OR
OR -->|"Accepted ~34%"| APPLY["Apply improvements"]
OR -->|"Rejected ~66%"| MEM["Store lesson"]
APPLY --> AS
The self-improvement loop is what makes skwaq unique. Instead of manually tuning rules, AI agents analyze their own failures and propose improvements — validated by another AI agent that prevents overfitting.
flowchart TB
BENCH["Run Benchmark"] --> SCORE["Score Results"]
SCORE --> FN["Identify False Negatives"]
FN --> FA["failure-analyst Agent\nReads code, queries graph,\ndiagnoses WHY we missed it"]
FA --> PROP["Generate Proposals\n1. Agent Prompt\n2. Taint Rule\n3. CWE Mapping\n4. Pattern"]
PROP --> REV["overfitting-reviewer\nValidates proposals"]
REV -->|Accepted| APPLY["Apply to agents,\ntaint engine, scoring"]
REV -->|Rejected| LEARN["Store lesson\nin durable memory"]
APPLY --> MEM["Durable memory\nfor future cycles"]
MEM --> BENCH
| Type | What It Does | Example |
|---|---|---|
| AGENT_PROMPT | Teaches agents new investigation strategies | Check ALL argument positions in exec calls for tainted data |
| TAINT_RULE | Adds source/sink definitions to dataflow engine | Register putenv() as CWE-427 environment sink |
| CWE_MAPPING | Fixes scoring/classification | Map CWE-614 to CryptoWeakness semantic class |
| NEW_PATTERN | Regex detection (last resort) | Only when graph-based approaches cannot detect the pattern |
The overfitting-reviewer agent has rejected ~66% of proposals across 22+ cycles. Common rejection reasons:
| Suite | Source | Cases | Languages | Focus |
|---|---|---|---|---|
| Juliet | NIST | 54,488 | C/C++ | 116 CWEs, synthetic variants |
| OWASP Benchmark | OWASP Foundation | 2,740 | Java | Web app vulns (XSS, SQLi, crypto) |
| CyberSecEval | Meta | 578 | C/C++, Python | Real-world vuln patterns |
| CGC | DARPA | 300 | C | Real challenge binaries (patched/unpatched) |
| CyberGym | UC Berkeley | 3,014 | C/C++ | Real OSS-Fuzz CVEs from 188 projects |
| Fixtures | skwaq team | 99 | Mixed | Regression suite |
| Suite | Cases | F1 | Precision | Recall | TP | FP | FN | TN |
|---|---|---|---|---|---|---|---|---|
| Fixtures | 99 | 93.7% | 98.1% | 89.3% | 103 | 2 | 12 | 11 |
| CSE | 578 | 91.8% | 100% | 84.8% | 434 | 0 | 78 | 0 |
| Juliet | 1,000 | 88.8% | 100% | 79.9% | 798 | 0 | 201 | 1 |
| OWASP | 500 | 93.8% | 100% | 88.3% | 228 | 0 | 30 | 242 |
| CGC | 226 | 89.8% | 100% | 81.5% | 388 | 0 | 88 | 0 |
| CyberGym | 100 | 71.9% | 100% | 56.1% | 64 | 0 | 50 | 0 |
Key differentiator: Skwaq maintains 100% precision across all benchmarks (0 false positives). Most tools trade precision for recall. We continue to run the self-improvement loop, expanding the number of cases with each cycle.
| CWE | Before | After | Method |
|---|---|---|---|
| CWE-401 memory leak | 0% | 99% | Agent-recommended calloc→ResourceLeak mapping |
| CWE-614 secure cookie | 0% | 100% | setSecure(false) + semantic class fix |
| CWE-78 cmd injection | 37% | 70% | spawn family (agent-identified taint rules) |
| CWE-22 path traversal | 41% | 65% | java.io.File* qualified name fix |
| CWE-79 XSS | 46% | 63% | getWriter().format/append (agentic cycle) |
| Race conditions | 33% | 100% | signal handler + thread (agentic cycle) |
| Tool | Approach | Benchmark | P | R | F1 |
|---|---|---|---|---|---|
| Skwaq (agentic) | Multi-agent + graph | Juliet 20 | 100% | 95% | 97.3% |
| VulBinLLM | LLM decompile+reason | Juliet stripped | ~85% | ~100% | ~92% |
| PS³ | Binary pattern | Curated dataset | 82% | 97% | 89% |
| LATTE | Taint + LLM | Juliet | ~70% | ~85% | ~77% |
IRIS (LLM+CodeQL hybrid) detected 2x more vulns than CodeQL alone — validating skwaq's hybrid approach. GitHub SecLab Taskflow Agent is the closest competitor architecture, with 80+ real CVEs found using multi-agent investigation.
| PR | Change | Impact |
|---|---|---|
| #29 | Gym benchmark harness | Baseline F1=50% |
| #35-39 | Agentic pipeline, multi-agent validation | First agent-driven detection |
| #49 | Dual-judge breakthrough | Precision: 15%→100% |
| #57 | All 4 industry benchmark adapters | Juliet, OWASP, CGC, CSE |
| #217-243 | 34 semantic classes, 109/109 Juliet CWEs | Full CWE coverage |
| #298 | Industry expansion | Juliet 1K, OWASP 1K, CSE 400 |
| #302 | CyberGym + results-skeptic agent | 3,014 real CVEs added |
| #303 | 3-layer improvement + 4 agentic cycles | +7.6% Juliet, +8.0% OWASP |
| #312-313 | Agentic eval tuning | Juliet F1=97.3% |
Full benchmark progress and history →
# Full agentic analysis of a repo
skwaq analyze /path/to/repo
# Quick pattern-only scan of a single file
skwaq analyze path/to/source.c --quick
# Analyze a compiled binary (decompile + graph + agents)
skwaq analyze path/to/binary --binary
# Analyze a stripped binary with decompiler renaming
skwaq analyze path/to/binary --binary --decompile-rename
# Agentic eval on industry suites
skwaq gym eval --suites juliet,owasp,cyberseceval
# Pattern-only eval
skwaq gym eval --suites juliet --quick --max-cases 200
# Run improvement cycle
skwaq gym improve juliet --max-cases 30 --max-improvements 5
# Target specific CWE
skwaq gym improve juliet --cwe CWE-78 --max-cases 20
Skwaq supports multiple LLM backends. Configure in skwaq.toml:
# skwaq.toml
[llm]
reasoning = "azure"
decompilation = "azure"
[llm.azure]
endpoint = "https://your-resource.cognitiveservices.azure.com/"
deployment = "gpt-54-skwaq"
api_version = "2024-10-21"
# Deploy model (idempotent)
bash infra/azure/setup.sh
# Run eval with GPT-5.4
skwaq gym eval --suites juliet --max-cases 100 --adaptive -j 4
# skwaq.toml
[llm]
reasoning = "anthropic"
decompilation = "anthropic"
# Auth: set ANTHROPIC_API_KEY or use Azure MaaS endpoint
# skwaq.toml
[llm]
reasoning = "copilot"
decompilation = "copilot"
[llm.copilot]
model = "claude-opus-4.6"
Run the same eval suite with each backend, then compare with the TUI dashboard:
# 1. Edit skwaq.toml to use GPT-5.4 (azure backend)
skwaq gym eval --suites juliet,owasp --max-cases 200 --adaptive -j 4
# 2. Edit skwaq.toml to use Claude Opus (copilot backend)
skwaq gym eval --suites juliet,owasp --max-cases 200 --adaptive -j 4
# 3. Compare results in the live dashboard
export SKWAQ_ROOT=/path/to/skwaq
skwaq gym dashboard --live
# 4. Or view the static snapshot
skwaq gym dashboard --tui
The dashboard shows F1/precision/recall per suite, trend sparklines across runs, agent stats with per-agent token usage, and the active model in the title bar. Each eval run is recorded in the history DB, so you can track how scores change when switching models.