Skwaq View code on GitHub

A self-improving, multi-agent vulnerability analyzer that bootstrapped from a short spec into a competitive detection system — using AI agents to analyze, benchmark, and improve their own investigation methodology.

The name comes from the Lushootseed word for Raven — the trickster who reveals hidden truths.

Skwaq — multi-agent vulnerability analyzer

What Is Skwaq?

Skwaq is an agentic vulnerability analyzer that uses a team of AI agents to investigate source code and binaries for security vulnerabilities. Unlike traditional static analysis tools that rely on fixed rules, skwaq's agents reason about code like experienced security researchers — tracing data flows, mapping attack surfaces, and debating exploitability.

What makes it unique: skwaq improves itself. A built-in benchmark harness (Skwaq Gym) measures detection accuracy against industry benchmarks, and a self-improvement loop uses AI agents to analyze their own failures and propose better investigation strategies.

97.3%
Juliet F1 (agentic)
91.0%
CSE F1
18
Agent Prompts
6
Benchmark Suites
22+
Improvement Cycles

The Story: From Spec to Self-Improving System

Skwaq started with a short specification and a question: can AI agents bootstrap themselves into a competitive vulnerability detection system?

Architecture

Five-Layer Detection Pipeline

Each vulnerability investigation passes through five layers, from fast pattern matching to deep semantic reasoning:

flowchart TB
    SRC["Source Code / Binary"] --> L1
    L1["Layer 1: Pattern Detection\n~260 patterns across 6 languages"] --> L2
    L2["Layer 2: Dataflow Analysis\nTaint source-to-sink tracing"] --> L3
    L3["Layer 3: Context Validation\nMulti-cycle false positive reduction"] --> L4
    L4["Layer 4: LLM Agent Pipeline\nattack-surface, vuln-hunter, critic"] --> L5
    L5["Layer 5: Synthesis\nDomain-expert weighted evidence"] --> OUT["Findings"]
        

Code Property Graph

All analysis operates on a code property graph stored in LadybugDB — functions, call edges, taint flows, data sources, data sinks, and symbols. Agents query this graph with specialized tools.

erDiagram
    FUNCTIONS ||--o{ CALLS : "caller-callee"
    FUNCTIONS ||--o{ DATA_SOURCES : "reads from"
    FUNCTIONS ||--o{ DATA_SINKS : "writes to"
    DATA_SOURCES ||--o{ TAINT_FLOWS : "source"
    DATA_SINKS ||--o{ TAINT_FLOWS : "sink"
    FUNCTIONS ||--o{ SYMBOLS : "defines-imports"
    FUNCTIONS ||--o{ FINDINGS : "contains"
        

Multi-Agent Analysis

All 18 agents and the pipelines they participate in:

flowchart TB
    subgraph INPUT["Input"]
        SRC["Source Code"]
        BIN["Binary"]
    end

    subgraph PREPROCESS["Pre-processing (binary only)"]
        DR["decompile-renamer\nImproves decompiled output"]
        DA["decompile-analyst\nReviews decompilation quality"]
    end

    subgraph DISCOVERY["Discovery"]
        AS["attack-surface\nMaps entry points,\ndata sources, imports"]
        VH["vuln-hunter\nGraph-first vuln discovery"]
        VHJ["vuln-hunter-java\nJava/servlet specialist"]
        VHP["vuln-hunter-python\nPython specialist"]
        TT["taint-tracer\nSource-to-sink flows"]
    end

    subgraph VALIDATION["Validation (standard pipeline)"]
        CR["critic\nAccept / Reject / Adjust severity"]
        RS["results-skeptic\nDouble-checks suspicious results"]
        CC["cwe-classifier\nMaps findings to CWE families"]
    end

    subgraph DEBATE["Debate (deep pipeline)"]
        EA["exploit-analyst\nCan attacker trigger this?"]
        DEF["defense-analyst\nAre mitigations effective?"]
    end

    subgraph SYNTHESIS["Synthesis"]
        VS["verdict-synthesizer\nWeighs all perspectives,\nbreaks ties"]
    end

    subgraph ANALYSIS["Specialized Analysis"]
        CA["crash-analyst\nAnalyzes crash/fuzz output"]
        PDA["patch-diff-analyst\nAnalyzes security patches"]
    end

    subgraph IMPROVE["Self-Improvement Loop"]
        FA["failure-analyst\nDiagnoses missed detections"]
        OR["overfitting-reviewer\nValidates proposals"]
    end

    SRC --> AS
    BIN --> DR --> DA --> AS
    AS --> VH & VHJ & VHP
    VH & VHJ & VHP --> TT
    TT --> CR
    CR --> RS --> CC --> VS
    CR --> EA & DEF
    EA & DEF --> VS
    VS --> OUT["Confirmed Findings"]
    CA -.-> CR
    PDA -.-> CR

    OUT --> FA
    FA --> OR
    OR -->|"Accepted ~34%"| APPLY["Apply improvements"]
    OR -->|"Rejected ~66%"| MEM["Store lesson"]
    APPLY --> AS
        

Self-Improvement Loop

The self-improvement loop is what makes skwaq unique. Instead of manually tuning rules, AI agents analyze their own failures and propose improvements — validated by another AI agent that prevents overfitting.

flowchart TB
    BENCH["Run Benchmark"] --> SCORE["Score Results"]
    SCORE --> FN["Identify False Negatives"]
    FN --> FA["failure-analyst Agent\nReads code, queries graph,\ndiagnoses WHY we missed it"]
    FA --> PROP["Generate Proposals\n1. Agent Prompt\n2. Taint Rule\n3. CWE Mapping\n4. Pattern"]
    PROP --> REV["overfitting-reviewer\nValidates proposals"]
    REV -->|Accepted| APPLY["Apply to agents,\ntaint engine, scoring"]
    REV -->|Rejected| LEARN["Store lesson\nin durable memory"]
    APPLY --> MEM["Durable memory\nfor future cycles"]
    MEM --> BENCH
        

Proposal Types (in Preference Order)

TypeWhat It DoesExample
AGENT_PROMPTTeaches agents new investigation strategiesCheck ALL argument positions in exec calls for tainted data
TAINT_RULEAdds source/sink definitions to dataflow engineRegister putenv() as CWE-427 environment sink
CWE_MAPPINGFixes scoring/classificationMap CWE-614 to CryptoWeakness semantic class
NEW_PATTERNRegex detection (last resort)Only when graph-based approaches cannot detect the pattern

Overfitting Prevention

The overfitting-reviewer agent has rejected ~66% of proposals across 22+ cycles. Common rejection reasons:

Results

Benchmark Suites

SuiteSourceCasesLanguagesFocus
JulietNIST54,488C/C++116 CWEs, synthetic variants
OWASP BenchmarkOWASP Foundation2,740JavaWeb app vulns (XSS, SQLi, crypto)
CyberSecEvalMeta578C/C++, PythonReal-world vuln patterns
CGCDARPA300CReal challenge binaries (patched/unpatched)
CyberGymUC Berkeley3,014C/C++Real OSS-Fuzz CVEs from 188 projects
Fixturesskwaq team99MixedRegression suite

Current Baselines — Pattern+Dataflow Mode

SuiteCasesF1PrecisionRecallTPFPFNTN
Fixtures9993.7%98.1%89.3%10321211
CSE57891.8%100%84.8%4340780
Juliet1,00088.8%100%79.9%79802011
OWASP50093.8%100%88.3%228030242
CGC22689.8%100%81.5%3880880
CyberGym10071.9%100%56.1%640500

Key differentiator: Skwaq maintains 100% precision across all benchmarks (0 false positives). Most tools trade precision for recall. We continue to run the self-improvement loop, expanding the number of cases with each cycle.

Per-CWE Improvements (from Self-Improvement Loop)

CWEBeforeAfterMethod
CWE-401 memory leak0%99%Agent-recommended calloc→ResourceLeak mapping
CWE-614 secure cookie0%100%setSecure(false) + semantic class fix
CWE-78 cmd injection37%70%spawn family (agent-identified taint rules)
CWE-22 path traversal41%65%java.io.File* qualified name fix
CWE-79 XSS46%63%getWriter().format/append (agentic cycle)
Race conditions33%100%signal handler + thread (agentic cycle)

Industry Comparison

ToolApproachBenchmarkPRF1
Skwaq (agentic)Multi-agent + graphJuliet 20100%95%97.3%
VulBinLLMLLM decompile+reasonJuliet stripped~85%~100%~92%
PS³Binary patternCurated dataset82%97%89%
LATTETaint + LLMJuliet~70%~85%~77%

IRIS (LLM+CodeQL hybrid) detected 2x more vulns than CodeQL alone — validating skwaq's hybrid approach. GitHub SecLab Taskflow Agent is the closest competitor architecture, with 80+ real CVEs found using multi-agent investigation.

Improvement History (Key Milestones)

PRChangeImpact
#29Gym benchmark harnessBaseline F1=50%
#35-39Agentic pipeline, multi-agent validationFirst agent-driven detection
#49Dual-judge breakthroughPrecision: 15%→100%
#57All 4 industry benchmark adaptersJuliet, OWASP, CGC, CSE
#217-24334 semantic classes, 109/109 Juliet CWEsFull CWE coverage
#298Industry expansionJuliet 1K, OWASP 1K, CSE 400
#302CyberGym + results-skeptic agent3,014 real CVEs added
#3033-layer improvement + 4 agentic cycles+7.6% Juliet, +8.0% OWASP
#312-313Agentic eval tuningJuliet F1=97.3%

Full benchmark progress and history →

Usage

Analyze a Source Repository

# Full agentic analysis of a repo
skwaq analyze /path/to/repo

# Quick pattern-only scan of a single file
skwaq analyze path/to/source.c --quick

Binary Analysis

# Analyze a compiled binary (decompile + graph + agents)
skwaq analyze path/to/binary --binary

# Analyze a stripped binary with decompiler renaming
skwaq analyze path/to/binary --binary --decompile-rename

Run Benchmarks

# Agentic eval on industry suites
skwaq gym eval --suites juliet,owasp,cyberseceval

# Pattern-only eval
skwaq gym eval --suites juliet --quick --max-cases 200

Self-Improvement

# Run improvement cycle
skwaq gym improve juliet --max-cases 30 --max-improvements 5

# Target specific CWE
skwaq gym improve juliet --cwe CWE-78 --max-cases 20

Running with Different Models

Skwaq supports multiple LLM backends. Configure in skwaq.toml:

Azure AI Foundry (GPT-5.4)

# skwaq.toml
[llm]
reasoning = "azure"
decompilation = "azure"

[llm.azure]
endpoint = "https://your-resource.cognitiveservices.azure.com/"
deployment = "gpt-54-skwaq"
api_version = "2024-10-21"
# Deploy model (idempotent)
bash infra/azure/setup.sh

# Run eval with GPT-5.4
skwaq gym eval --suites juliet --max-cases 100 --adaptive -j 4

Anthropic Claude (via Azure MaaS)

# skwaq.toml
[llm]
reasoning = "anthropic"
decompilation = "anthropic"

# Auth: set ANTHROPIC_API_KEY or use Azure MaaS endpoint

GitHub Copilot (Claude Opus 4.6)

# skwaq.toml
[llm]
reasoning = "copilot"
decompilation = "copilot"

[llm.copilot]
model = "claude-opus-4.6"

Comparing Models

Run the same eval suite with each backend, then compare with the TUI dashboard:

# 1. Edit skwaq.toml to use GPT-5.4 (azure backend)
skwaq gym eval --suites juliet,owasp --max-cases 200 --adaptive -j 4

# 2. Edit skwaq.toml to use Claude Opus (copilot backend)
skwaq gym eval --suites juliet,owasp --max-cases 200 --adaptive -j 4

# 3. Compare results in the live dashboard
export SKWAQ_ROOT=/path/to/skwaq
skwaq gym dashboard --live

# 4. Or view the static snapshot
skwaq gym dashboard --tui

The dashboard shows F1/precision/recall per suite, trend sparklines across runs, agent stats with per-agent token usage, and the active model in the title bar. Each eval run is recorded in the history DB, so you can track how scores change when switching models.

Documentation

↑ Top