Skwaq View code on GitHub

A self-improving, multi-agent vulnerability analyzer that bootstrapped from a short spec into a competitive detection system — using AI agents to analyze, benchmark, and improve their own investigation methodology.

The name comes from the Lushootseed word for Raven — the trickster who reveals hidden truths.

Skwaq — multi-agent vulnerability analyzer

What It Is The Story Architecture Multi-Agent Analysis Self-Improvement Loop Results Usage Benchmark Progress →

What Is Skwaq?

Skwaq is an agentic vulnerability analyzer that uses a team of AI agents to investigate source code and binaries for security vulnerabilities. Unlike traditional static analysis tools that rely on fixed rules, skwaq's agents reason about code like experienced security researchers — tracing data flows, mapping attack surfaces, and debating exploitability.

What makes it unique: skwaq improves itself. A built-in benchmark harness (Skwaq Gym) measures detection accuracy against industry benchmarks, and a self-improvement loop uses AI agents to analyze their own failures and propose better investigation strategies.

97.3%

Juliet F1 (agentic)

91.0%

CSE F1

Agent Prompts

Benchmark Suites

22+

Improvement Cycles

The Story: From Spec to Self-Improving System

Skwaq started with a short specification and a question: can AI agents bootstrap themselves into a competitive vulnerability detection system?

The Spec
A brief specification for a "vulnerability investigation copilot" — analyze binaries and source code using a code property graph, with AI agents reasoning about the results. No training data, no pre-built rules, no benchmark infrastructure.
Building the Harness
Built the Skwaq Gym benchmark harness with a small fixture suite (F1=50%). Wired up 5-layer detection: patterns, dataflow, context validation, LLM agents, and synthesis. Connected NIST Juliet (54K cases), OWASP Benchmark (2,740 cases), DARPA CGC, and Meta CyberSecEval. Both development and multi-agent analysis pipelines orchestrated by amplihack-recipe-runner.
Leveraging the Graph Database
Built a code property graph in LadybugDB — skwaq's own analysis code extracts functions, call edges, taint flows (tracking how untrusted user input propagates through code), data sources, data sinks, and symbols into a queryable graph. Agents query this graph with specialized tools, enabling cross-file analysis, recursive call-graph traversal, and source-to-sink data flow tracing that static patterns alone can't achieve.
Multi-Agent Pipeline
Started with a small set of agents and a simple workflow. The self-improvement loop was given instructions to experiment with new agentic workflows and agent types — and thus evolved the pipeline organically as it learned what worked. From basic attack-surface mapping and vulnerability hunting, it added specialized language-specific agents, a critic for false-positive reduction, and eventually an offense/defense debate for high-stakes findings. All powered by RustyClawd (a Rust-based agentic LLM framework), the system grew itself to 18 specialized agents across discovery, validation, debate, and self-improvement stages. Fixture F1 reached 92.9%.
Agent Knowledge Graph Packs
Integrated agent-kgpacks — portable knowledge graph packages that encode vulnerability patterns, CWE taxonomies, and investigation strategies. Agents load domain-specific knowledge packs to bootstrap expertise without re-learning from scratch.
Agent Memory
Added durable agent memory (adapted from amplihack-memory-lib's cognitive memory model) so investigation insights persist across cycles. Agents remember which detection strategies worked, which proposals were rejected and why, and which CWE-specific instructions improved recall — building institutional knowledge over 22+ improvement cycles.
Self-Improvement Loop
Built the gym improve command: failure-analyst agent investigates each false negative, diagnoses why detection failed, and proposes improvements to agent prompts, taint rules, CWE mappings, or patterns. An overfitting-reviewer agent validates every proposal. Agents started teaching themselves.
The Importance of Not Overfitting
It became critical to ensure the improvement loops don't "build to the benchmark" — overfitting to specific test cases rather than improving general detection. We built a specialized overfitting-reviewer agent and added workflow steps that reject proposals which only help on benchmark cases versus the general case. This agent has rejected ~66% of proposals, catching narrow heuristics, benchmark-specific naming tricks, and detection logic too tightly coupled to test structure.
Scale and Results
22+ self-improvement cycles across all suites. 100+ proposals generated, 34% acceptance rate (reviewer catches overfitting). Added CyberGym (3,014 real CVEs from OSS-Fuzz). Agentic eval: Juliet F1=97.3%.

Architecture

Five-Layer Detection Pipeline

Each vulnerability investigation passes through five layers, from fast pattern matching to deep semantic reasoning:

flowchart TB
    SRC["Source Code / Binary"] --> L1
    L1["Layer 1: Pattern Detection\n~260 patterns across 6 languages"] --> L2
    L2["Layer 2: Dataflow Analysis\nTaint source-to-sink tracing"] --> L3
    L3["Layer 3: Context Validation\nMulti-cycle false positive reduction"] --> L4
    L4["Layer 4: LLM Agent Pipeline\nattack-surface, vuln-hunter, critic"] --> L5
    L5["Layer 5: Synthesis\nDomain-expert weighted evidence"] --> OUT["Findings"]

Code Property Graph

All analysis operates on a code property graph stored in LadybugDB — functions, call edges, taint flows, data sources, data sinks, and symbols. Agents query this graph with specialized tools.

erDiagram
    FUNCTIONS ||--o{ CALLS : "caller-callee"
    FUNCTIONS ||--o{ DATA_SOURCES : "reads from"
    FUNCTIONS ||--o{ DATA_SINKS : "writes to"
    DATA_SOURCES ||--o{ TAINT_FLOWS : "source"
    DATA_SINKS ||--o{ TAINT_FLOWS : "sink"
    FUNCTIONS ||--o{ SYMBOLS : "defines-imports"
    FUNCTIONS ||--o{ FINDINGS : "contains"

Multi-Agent Analysis

All 18 agents and the pipelines they participate in:

flowchart TB
    subgraph INPUT["Input"]
        SRC["Source Code"]
        BIN["Binary"]
    end

    subgraph PREPROCESS["Pre-processing (binary only)"]
        DR["decompile-renamer\nImproves decompiled output"]
        DA["decompile-analyst\nReviews decompilation quality"]
    end

    subgraph DISCOVERY["Discovery"]
        AS["attack-surface\nMaps entry points,\ndata sources, imports"]
        VH["vuln-hunter\nGraph-first vuln discovery"]
        VHJ["vuln-hunter-java\nJava/servlet specialist"]
        VHP["vuln-hunter-python\nPython specialist"]
        TT["taint-tracer\nSource-to-sink flows"]
    end

    subgraph VALIDATION["Validation (standard pipeline)"]
        CR["critic\nAccept / Reject / Adjust severity"]
        RS["results-skeptic\nDouble-checks suspicious results"]
        CC["cwe-classifier\nMaps findings to CWE families"]
    end

    subgraph DEBATE["Debate (deep pipeline)"]
        EA["exploit-analyst\nCan attacker trigger this?"]
        DEF["defense-analyst\nAre mitigations effective?"]
    end

    subgraph SYNTHESIS["Synthesis"]
        VS["verdict-synthesizer\nWeighs all perspectives,\nbreaks ties"]
    end

    subgraph ANALYSIS["Specialized Analysis"]
        CA["crash-analyst\nAnalyzes crash/fuzz output"]
        PDA["patch-diff-analyst\nAnalyzes security patches"]
    end

    subgraph IMPROVE["Self-Improvement Loop"]
        FA["failure-analyst\nDiagnoses missed detections"]
        OR["overfitting-reviewer\nValidates proposals"]
    end

    SRC --> AS
    BIN --> DR --> DA --> AS
    AS --> VH & VHJ & VHP
    VH & VHJ & VHP --> TT
    TT --> CR
    CR --> RS --> CC --> VS
    CR --> EA & DEF
    EA & DEF --> VS
    VS --> OUT["Confirmed Findings"]
    CA -.-> CR
    PDA -.-> CR

    OUT --> FA
    FA --> OR
    OR -->|"Accepted ~34%"| APPLY["Apply improvements"]
    OR -->|"Rejected ~66%"| MEM["Store lesson"]
    APPLY --> AS

Self-Improvement Loop

The self-improvement loop is what makes skwaq unique. Instead of manually tuning rules, AI agents analyze their own failures and propose improvements — validated by another AI agent that prevents overfitting.

flowchart TB
    BENCH["Run Benchmark"] --> SCORE["Score Results"]
    SCORE --> FN["Identify False Negatives"]
    FN --> FA["failure-analyst Agent\nReads code, queries graph,\ndiagnoses WHY we missed it"]
    FA --> PROP["Generate Proposals\n1. Agent Prompt\n2. Taint Rule\n3. CWE Mapping\n4. Pattern"]
    PROP --> REV["overfitting-reviewer\nValidates proposals"]
    REV -->|Accepted| APPLY["Apply to agents,\ntaint engine, scoring"]
    REV -->|Rejected| LEARN["Store lesson\nin durable memory"]
    APPLY --> MEM["Durable memory\nfor future cycles"]
    MEM --> BENCH

Proposal Types (in Preference Order)

Type	What It Does	Example
AGENT_PROMPT	Teaches agents new investigation strategies	Check ALL argument positions in exec calls for tainted data
TAINT_RULE	Adds source/sink definitions to dataflow engine	Register putenv() as CWE-427 environment sink
CWE_MAPPING	Fixes scoring/classification	Map CWE-614 to CryptoWeakness semantic class
NEW_PATTERN	Regex detection (last resort)	Only when graph-based approaches cannot detect the pattern

Overfitting Prevention

The overfitting-reviewer agent has rejected ~66% of proposals across 22+ cycles. Common rejection reasons:

Duplicate boilerplate proposals from different test cases
Benchmark-specific naming conventions
Narrow detection logic for broad CWE classes
Generic graph traversal where type/semantic analysis is needed
Speculative detection based on proximity rather than evidence

Results

Benchmark Suites

Suite	Source	Cases	Languages	Focus
Juliet	NIST	54,488	C/C++	116 CWEs, synthetic variants
OWASP Benchmark	OWASP Foundation	2,740	Java	Web app vulns (XSS, SQLi, crypto)
CyberSecEval	Meta	578	C/C++, Python	Real-world vuln patterns
CGC	DARPA	300	C	Real challenge binaries (patched/unpatched)
CyberGym	UC Berkeley	3,014	C/C++	Real OSS-Fuzz CVEs from 188 projects
Fixtures	skwaq team	99	Mixed	Regression suite

Current Baselines — Pattern+Dataflow Mode

Suite	Cases	F1	Precision	Recall	TP	FP	FN	TN
Fixtures	99	93.7%	98.1%	89.3%	103	2	12	11
CSE	578	91.8%	100%	84.8%	434	0	78	0
Juliet	1,000	88.8%	100%	79.9%	798	0	201	1
OWASP	500	93.8%	100%	88.3%	228	0	30	242
CGC	226	89.8%	100%	81.5%	388	0	88	0
CyberGym	100	71.9%	100%	56.1%	64	0	50	0

Key differentiator: Skwaq maintains 100% precision across all benchmarks (0 false positives). Most tools trade precision for recall. We continue to run the self-improvement loop, expanding the number of cases with each cycle.

Per-CWE Improvements (from Self-Improvement Loop)

CWE	Before	After	Method
CWE-401 memory leak	0%	99%	Agent-recommended calloc→ResourceLeak mapping
CWE-614 secure cookie	0%	100%	setSecure(false) + semantic class fix
CWE-78 cmd injection	37%	70%	spawn family (agent-identified taint rules)
CWE-22 path traversal	41%	65%	java.io.File* qualified name fix
CWE-79 XSS	46%	63%	getWriter().format/append (agentic cycle)
Race conditions	33%	100%	signal handler + thread (agentic cycle)

Industry Comparison

Tool	Approach	Benchmark	P	R	F1
Skwaq (agentic)	Multi-agent + graph	Juliet 20	100%	95%	97.3%
VulBinLLM	LLM decompile+reason	Juliet stripped	~85%	~100%	~92%
PS³	Binary pattern	Curated dataset	82%	97%	89%
LATTE	Taint + LLM	Juliet	~70%	~85%	~77%

IRIS (LLM+CodeQL hybrid) detected 2x more vulns than CodeQL alone — validating skwaq's hybrid approach. GitHub SecLab Taskflow Agent is the closest competitor architecture, with 80+ real CVEs found using multi-agent investigation.

Improvement History (Key Milestones)

PR	Change	Impact
#29	Gym benchmark harness	Baseline F1=50%
#35-39	Agentic pipeline, multi-agent validation	First agent-driven detection
#49	Dual-judge breakthrough	Precision: 15%→100%
#57	All 4 industry benchmark adapters	Juliet, OWASP, CGC, CSE
#217-243	34 semantic classes, 109/109 Juliet CWEs	Full CWE coverage
#298	Industry expansion	Juliet 1K, OWASP 1K, CSE 400
#302	CyberGym + results-skeptic agent	3,014 real CVEs added
#303	3-layer improvement + 4 agentic cycles	+7.6% Juliet, +8.0% OWASP
#312-313	Agentic eval tuning	Juliet F1=97.3%

Full benchmark progress and history →

Usage

Analyze a Source Repository

# Full agentic analysis of a repo
skwaq analyze /path/to/repo

# Quick pattern-only scan of a single file
skwaq analyze path/to/source.c --quick

Binary Analysis

# Analyze a compiled binary (decompile + graph + agents)
skwaq analyze path/to/binary --binary

# Analyze a stripped binary with decompiler renaming
skwaq analyze path/to/binary --binary --decompile-rename

Run Benchmarks

# Agentic eval on industry suites
skwaq gym eval --suites juliet,owasp,cyberseceval

# Pattern-only eval
skwaq gym eval --suites juliet --quick --max-cases 200

Self-Improvement

# Run improvement cycle
skwaq gym improve juliet --max-cases 30 --max-improvements 5

# Target specific CWE
skwaq gym improve juliet --cwe CWE-78 --max-cases 20

Running with Different Models

Skwaq supports multiple LLM backends. Configure in skwaq.toml:

Azure AI Foundry (GPT-5.4)

# skwaq.toml
[llm]
reasoning = "azure"
decompilation = "azure"

[llm.azure]
endpoint = "https://your-resource.cognitiveservices.azure.com/"
deployment = "gpt-54-skwaq"
api_version = "2024-10-21"

# Deploy model (idempotent)
bash infra/azure/setup.sh

# Run eval with GPT-5.4
skwaq gym eval --suites juliet --max-cases 100 --adaptive -j 4

Anthropic Claude (via Azure MaaS)

# skwaq.toml
[llm]
reasoning = "anthropic"
decompilation = "anthropic"

# Auth: set ANTHROPIC_API_KEY or use Azure MaaS endpoint

GitHub Copilot (Claude Opus 4.6)

# skwaq.toml
[llm]
reasoning = "copilot"
decompilation = "copilot"

[llm.copilot]
model = "claude-opus-4.6"

Comparing Models

Run the same eval suite with each backend, then compare with the TUI dashboard:

# 1. Edit skwaq.toml to use GPT-5.4 (azure backend)
skwaq gym eval --suites juliet,owasp --max-cases 200 --adaptive -j 4

# 2. Edit skwaq.toml to use Claude Opus (copilot backend)
skwaq gym eval --suites juliet,owasp --max-cases 200 --adaptive -j 4

# 3. Compare results in the live dashboard
export SKWAQ_ROOT=/path/to/skwaq
skwaq gym dashboard --live

# 4. Or view the static snapshot
skwaq gym dashboard --tui

The dashboard shows F1/precision/recall per suite, trend sparklines across runs, agent stats with per-agent token usage, and the active model in the title bar. Each eval run is recorded in the history DB, so you can track how scores change when switching models.

Skwaq View code on GitHub

What Is Skwaq?

The Story: From Spec to Self-Improving System

Architecture

Five-Layer Detection Pipeline

Code Property Graph

Multi-Agent Analysis

Self-Improvement Loop

Proposal Types (in Preference Order)

Overfitting Prevention

Results

Benchmark Suites

Current Baselines — Pattern+Dataflow Mode

Per-CWE Improvements (from Self-Improvement Loop)

Industry Comparison

Improvement History (Key Milestones)

Usage

Analyze a Source Repository

Binary Analysis

Run Benchmarks

Self-Improvement

Running with Different Models

Azure AI Foundry (GPT-5.4)

Anthropic Claude (via Azure MaaS)

GitHub Copilot (Claude Opus 4.6)

Comparing Models

Documentation