Formal Specifications as Prompt Language¶
Formal specification languages — TLA+ and Gherkin — produce better AI-generated code than English alone for complex tasks. This guide covers the experiments that proved it, when to use each formalism, and how the tooling integrates into the amplihack workflow.
Judgment call, not a rule
English-only is the default. Formal specs earn their place when complexity warrants them. This is a design judgment — not a mandatory step.
The Core Finding¶
Two independent experiments measured how prompt language affects code generation quality:
| Experiment | Formalism | English Baseline | Spec-Only | Improvement | Key Insight |
|---|---|---|---|---|---|
| #3497 — Distributed Retrieval Contract | TLA+ | 0.57 | 0.86 | +51% | Hybrid TLA++English degrades to 0.50 |
| #3969 — Recipe Step Executor | Gherkin | 0.713 | 0.898 | +26% | Hybrid Gherkin+English adds noise (widest CI) |
Both experiments converge on the same conclusion: spec-only prompts outperform hybrid spec+English prompts. Natural language dilutes rather than amplifies the formal signal.
When to Use What¶
flowchart TD
REQ[New Requirement] --> Q1{Is the hard part<br/>'what must always/never be true'?}
Q1 -->|Yes| TLA["Consider TLA+"]
Q1 -->|No| Q2{Is the hard part<br/>'what does done look like'?}
Q2 -->|Yes| GHK["Consider Gherkin"]
Q2 -->|No| ENG["Use English-only"]
TLA --> V1["Concurrent state, safety invariants,<br/>distributed protocols, state machines"]
GHK --> V2["Multi-step behavior, multi-actor scenarios,<br/>business rules, acceptance criteria"]
ENG --> V3["CRUD, UI components,<br/>simple transformations"] Consider TLA+ when¶
- Concurrent components with safety invariants (mutual exclusion, ordering, liveness)
- Distributed protocols (fan-out/merge, quorum, timeout handling)
- State machines with non-obvious valid transitions
- The requirement has "must never" or "must always" constraints
Consider Gherkin when¶
- Complex multi-step behavioral requirements with many edge cases
- Multi-actor scenarios with specific acceptance criteria
- Business rules with combinatorial conditions
- Stakeholder validation is needed (Gherkin is human-readable)
Use English-only when¶
- CRUD endpoints, UI components, simple data transformations
- No safety invariants or complex behavioral specs exist
- The requirement is straightforward and unambiguous
Experiment 1: TLA+ for Concurrent Systems¶
Issue: #3497 Task: Generate a Distributed Retrieval Contract — a request-local protocol preserving question end-to-end, dispatching retrieval across agents, merging results deterministically, and surfacing shard failures explicitly.
Methodology¶
- Evaluation:
heuristic_signal_v2— keyword/regex scoring across 6 dimensions - Models: Claude Opus 4.6, GPT-5.4
- Prompt variants:
english,tla_only,tla_plus_english,tla_plus_refinement - Matrix: 8 conditions (4 variants x 2 models), smoke run (1 repeat)
- TLC validation: 5,927 states explored, 2,660 distinct, 0 errors, 7 invariants hold
Results¶
| Prompt Variant | Baseline Score | Coverage Score | Notes |
|---|---|---|---|
| english | 0.57 | 0.43 | Neither model references formal methods spontaneously |
| tla_only | 0.86 | 0.86 | Best across both models |
| tla_plus_english | 0.50 | 0.50 | Worse than English alone — hybrid interference |
| tla_plus_refinement | 0.71 | 0.71 | Refinement guidance helps some, but less than spec-only |
Per-Feature Breakdown (Claude Opus 4.6)¶
| Feature | english | tla_only | tla_plus_english | tla_plus_refinement |
|---|---|---|---|---|
| Baseline compliance | 0.43 | 0.86 | 0.43 | 0.86 |
| Invariant adherence | 0.50 | 0.75 | 0.50 | 0.75 |
| Proof alignment | 0.00 | 1.00 | 1.00 | 1.00 |
| Local protocol | 0.00 | 1.00 | 0.00 | 1.00 |
| Progress signal | 0.00 | 1.00 | 0.00 | 1.00 |
| Spec coverage | 0.29 | 0.86 | 0.43 | 0.86 |
Key observations:
- English-only scores 0.0 on proof alignment, local protocol, and progress signal — the model has no formal target to aim for
- The hybrid prompt (tla_plus_english) achieves proof alignment=1.0 but loses local protocol and progress signal — English guidance directs attention to the wrong priorities
- TLA+-only achieves perfect scores on 4 of 6 features
Why Hybrid TLA++English Degrades¶
When both a formal spec and English description are present, the model must reconcile two representations that may subtly conflict. The model resolves ambiguity unpredictably — sometimes following the spec, sometimes the English. This produces worse results than either alone.
Caveats¶
- Heuristic scoring is keyword-based, not semantic
- GPT-5.4 timed out on
tla_onlyandtla_plus_refinementconditions (infrastructure limit, not content issue)
Experiment 2: Gherkin for Behavioral Requirements¶
Issue: #3969 Task: Generate a Python RecipeStepExecutor class with 6 interacting behavioral features: conditional execution, step dependencies, retry with exponential backoff, timeout handling, output capture, and sub-recipe delegation.
Methodology¶
- Evaluation:
agent_consensus_v1— 3 independent LLM evaluator agents per condition - Models: Claude Opus 4.6 (12/12 completed), GPT-5.4 (1/12 completed due to timeouts)
- Prompt variants:
english,gherkin_only,gherkin_plus_english,gherkin_plus_acceptance - Matrix: 24 conditions (4 variants x 2 models x 3 repeats)
- Statistical basis: 95% confidence intervals via t-distribution
Agent Consensus Scoring¶
Unlike the TLA+ experiment's heuristic scoring, the Gherkin experiment uses agent consensus — a method designed to reduce evaluator bias:
- Three independent evaluator agents assess each generated implementation
- Each agent has a distinct persona to reduce correlated errors:
- Senior software engineer (correctness focus)
- QA engineer (specification compliance focus)
- Systems architect (behavioral contract focus)
- Each agent reads the generated code against the acceptance criteria and Gherkin scenarios
- Each votes PASS/FAIL on each of 6 behavioral features with written reasoning
- The consensus score per feature = fraction of valid votes that are PASS
- Parse-failed evaluator responses are excluded (not counted as FAIL)
This method produces more reliable scores than single-evaluator or heuristic approaches because evaluator errors are independent and wash out through voting.
Results¶
| Prompt Variant | Avg Score | 95% CI | N |
|---|---|---|---|
| gherkin_only | 0.898 | ±0.144 | 3 |
| gherkin_plus_acceptance | 0.889 | ±0.478 | 3 |
| gherkin_plus_english | 0.785 | ±0.769 | 4 |
| english | 0.713 | ±0.478 | 3 |
Confidence intervals matter
gherkin_plus_english has the widest CI (±0.769) — one run scored near-perfect, another scored 0.5 across all features. English adds variance, not just noise. gherkin_only has the tightest CI (±0.144), meaning it's the most reliable variant.
Per-Feature Breakdown¶
| Feature | english | gherkin_only | gherkin_plus_english | gherkin_plus_acceptance |
|---|---|---|---|---|
| Conditional execution | 1.000 | 1.000 | 0.875 | 0.889 |
| Dependency handling | 0.167 | 0.778 | 0.875 | 0.889 |
| Retry logic | 1.000 | 0.833 | 0.625 | 0.889 |
| Timeout semantics | 1.000 | 0.778 | 0.583 | 0.889 |
| Output capture | 0.222 | 1.000 | 0.875 | 0.889 |
| Sub-recipe delegation | 0.889 | 1.000 | 0.875 | 0.889 |
Key observations:
- Timeout semantics is the hardest feature across all variants — it involves a negative constraint ("timed-out steps are NOT retried") that interacts with retry logic. Models frequently miss this interaction.
- English scores perfectly on conditional execution, retry, and timeout individually, but misses dependency handling (0.167) and output capture (0.222) — it fails on cross-feature interactions.
- Gherkin-only achieves the best overall score and the most consistent results, though it's not perfect on every individual feature.
gherkin_plus_acceptanceis the most uniform variant (0.889 on every feature) — acceptance criteria may help balance attention across features.
Why Gherkin Differs from TLA+¶
Unlike TLA+, Gherkin+English hybrids do improve over English-only (0.785 > 0.713). The degradation is milder because Gherkin scenarios are written in near-English, so there's less semantic conflict between the spec and the natural language guidance. However, pure Gherkin still wins because adding English introduces variance — some runs benefit, others are thrown off.
Workflow Integration¶
Formal specifications integrate into the Default Workflow at two points:
Step 05: Research and Design¶
The architecture design step includes this guidance:
For components with concurrent state or multiple actors modifying shared state, express key safety invariants as formal predicates. When designing complex behavioral requirements, consider whether Gherkin scenarios or TLA+ specs would clarify the design.
At this step, the prompt-writer agent uses a tri-path judgment to decide the specification language:
| Path | When | Signal |
|---|---|---|
| English-only (default) | Simple CRUD, sequential logic, internal utilities | Neither question applies |
| Gherkin | Complex behaviors, multi-actor, acceptance criteria | "What does done look like?" is hard to express in English |
| TLA+ | Concurrent state, safety invariants, protocols | "What must always/never be true?" is hard to express in English |
Step 07: TDD Write Tests First¶
When the design includes Gherkin scenarios or TLA+ formal predicates, use them as the primary source for test case derivation — they define the behavioral contract more precisely than English requirements.
Gherkin scenarios map directly to test cases. TLA+ invariants map to property assertions.
Skills and Agents¶
Skills (user-invocable)¶
| Skill | Invocation | Purpose |
|---|---|---|
| Gherkin Expert | /gherkin-expert | Write Gherkin scenarios, review existing specs, translate requirements to BDD format |
| TLA+ Expert | /tla-plus-expert | Write TLA+ specs, review specs, interpret TLC output, generate invariants |
Agents (delegated by other agents)¶
| Agent | Type | Role |
|---|---|---|
| gherkin-expert | Specialized | Writes Gherkin scenarios, applies BDD patterns, domain modeling through scenarios |
| tla-plus-expert | Specialized | Writes TLA+ specs, model checking, applies Demirbas's 7 mental models |
| prompt-writer | Specialized | Tri-path judgment: decides when to use English, Gherkin, or TLA+ |
Tutorial: Gherkin Expert in Practice¶
Here's how the gherkin-expert skill works in a typical development flow.
Scenario: Adding a rate limiter¶
You're implementing a rate limiter with sliding window, burst allowance, and per-client quotas. The behavioral requirements are complex enough that English descriptions would be ambiguous.
Step 1: Invoke the skill to write scenarios.
The gherkin-expert agent produces a .feature file:
Feature: Rate Limiter
Sliding window rate limiting with burst allowance and per-client quotas.
Background:
Given the rate limiter is configured with a window of 60 seconds
And the default quota is 100 requests per window
Scenario: Requests within quota are allowed
Given client "alpha" has made 50 requests in the current window
When client "alpha" sends a request
Then the request is allowed
And the remaining quota for "alpha" is 49
Scenario: Requests exceeding quota are rejected
Given client "alpha" has made 100 requests in the current window
When client "alpha" sends a request
Then the request is rejected with status 429
And the response includes a Retry-After header
Scenario: Burst allowance permits temporary spikes
Given client "alpha" has a burst allowance of 20
And client "alpha" has made 100 requests in the current window
When client "alpha" sends a request
Then the request is allowed
And the burst remaining for "alpha" is 19
Scenario: Sliding window expires old requests
Given client "alpha" made 100 requests between 0s and 30s ago
And 31 seconds have elapsed
When client "alpha" sends a request
Then the request is allowed
And the remaining quota reflects only requests within the window
Scenario: Per-client quotas override defaults
Given client "beta" has a custom quota of 500 requests per window
And client "beta" has made 200 requests in the current window
When client "beta" sends a request
Then the request is allowed
And the remaining quota for "beta" is 299
Step 2: Use the .feature file as the prompt for code generation (step 05/07 of the workflow). The builder agent receives the Gherkin scenarios as its primary specification — no English description needed.
Step 3: The tester agent derives test cases directly from the scenarios. Each Scenario: becomes one or more test functions.
Why this works¶
The Gherkin scenarios are unambiguous about:
- What "within quota" means (exact numbers)
- How burst allowance interacts with the base quota
- What the sliding window expiry behavior is
- That per-client quotas override (not supplement) defaults
An English description of the same requirements would likely leave at least one of these interactions ambiguous.
Tutorial: TLA+ Expert in Practice¶
Here's how the tla-plus-expert skill works, using the actual hive mind distributed retrieval system as the example.
Scenario: Distributed retrieval with shard failures¶
You're building a system where a question is dispatched to multiple retrieval agents (shards), results are merged, and the response must handle partial failures. The hard part isn't any single component — it's the interactions: What happens when some shards fail? Can the system complete with partial results? Is the merge deterministic regardless of response order?
These are exactly the questions that English descriptions fumble but TLA+ makes precise.
Step 1: Invoke the skill to write the spec.
/tla-plus-expert write a TLA+ spec for distributed retrieval: a question is dispatched to N agents,
each returns facts or fails, results are merged deterministically, partial failures are allowed,
total failure (no responses) is a distinct state
The tla-plus-expert agent produces a .tla specification. Here's the actual spec from the hive mind experiment (simplified for clarity):
---- MODULE DistributedRetrievalBestEffort ----
EXTENDS FiniteSets, Naturals, Sequences
CONSTANTS Agents, Questions, Facts, NullQuestion
VARIABLES
activeAgents, originalQuestion, normalizedQuery,
shardResults, respondedAgents, failedAgents,
mergedResult, phase
\* Phase transitions: idle -> dispatch -> complete | failed -> idle
Init ==
/\ activeAgents \in SUBSET Agents
/\ activeAgents # {}
/\ phase = "idle"
StartRequest(q, nq) ==
/\ phase = "idle"
/\ q \in Questions
/\ phase' = "dispatch"
RecordShardSuccess(a, facts) ==
/\ phase = "dispatch"
/\ a \in activeAgents
/\ a \notin respondedAgents \* Each agent responds at most once
/\ shardResults' = [shardResults EXCEPT ![a] = facts]
/\ respondedAgents' = respondedAgents \cup {a}
RecordShardFailure(a) ==
/\ phase = "dispatch"
/\ a \in activeAgents
/\ a \notin respondedAgents
/\ failedAgents' = failedAgents \cup {a}
/\ shardResults' = [shardResults EXCEPT ![a] = {}] \* Failed = empty
\* Best-effort completion: allowed even with failures
CompleteRequest ==
/\ phase = "dispatch"
/\ activeAgents \subseteq (respondedAgents \cup failedAgents)
/\ mergedResult' = CanonicalMerge \* Deterministic merge
/\ phase' = "complete"
\* Total failure: nobody responded at all
FailRequest ==
/\ phase = "dispatch"
/\ respondedAgents = {}
/\ failedAgents = activeAgents
/\ phase' = "failed"
Step 2: Define the invariants — the properties that must always hold.
This is where TLA+ earns its keep. These invariants are what English descriptions consistently miss:
\* The original question is never lost mid-request
OriginalQuestionPreserved ==
phase \in {"dispatch", "complete", "failed"} =>
originalQuestion # NullQuestion
\* Merged results come ONLY from actual responses, not garbage
MergedFactsComeFromResponses ==
phase = "complete" =>
SeqToSet(mergedResult) = UNION {shardResults[a] : a \in respondedAgents}
\* Same inputs produce same merge regardless of response order
DeterministicMerge ==
phase = "complete" =>
mergedResult = CanonicalizeSet(RespondedFactsSet)
\* Failed shards contribute nothing (not random data)
FailedShardsContributeNothing ==
\A a \in failedAgents : shardResults[a] = {}
\* Total failure ONLY when nobody responded
FailOnlyWhenNoResponses ==
phase = "failed" => respondedAgents = {}
Step 3: Model-check with TLC to verify the spec is consistent.
TLC explores all possible interleavings: agent A succeeds then B fails, B fails then A succeeds, both succeed, both fail, timeouts, etc. The actual experiment explored 5,927 states with 0 invariant violations across 7 invariants.
Step 4: Use the spec as the prompt for code generation.
Pass the .tla file to the builder agent as the primary specification — no English description needed. The experiment showed this produces code that scores 0.86 on contract compliance, vs 0.57 with English-only.
Why this works¶
The TLA+ spec makes five things explicit that English consistently misses:
-
"Each agent responds at most once" — the
a \notin respondedAgentsguard. English descriptions rarely state this constraint explicitly, leading to code that doesn't deduplicate. -
"Failed shards contribute empty, not garbage" —
shardResults' = [shardResults EXCEPT ![a] = {}]. English says "handle failures" but doesn't specify what the failure contribution looks like in the data structure. -
"Merge is deterministic" —
CanonicalizeSetproduces the same sequence regardless of response order. English says "merge results" but doesn't address ordering. -
"Total failure is distinct from partial success" — separate
FailRequestandCompleteRequestactions with different phase transitions. English descriptions often conflate these. -
"Completion requires all agents accounted for" —
activeAgents \subseteq (respondedAgents \cup failedAgents). Without this, the system could "complete" while agents are still in flight.
An English prompt describing the same system produced code that scored 0.57 — it consistently missed the deterministic merge invariant and the distinction between total failure and partial success.
When TLC catches bugs you wouldn't¶
The real power of TLA+ isn't just as a prompt language — it's that TLC can verify your spec before you generate code. In the hive mind spec, TLC verified that no interleaving of agent responses can violate the merge determinism invariant. No amount of English description review would give you that confidence.
Reference¶
External resources¶
- TLA+ Home Page — Leslie Lamport's TLA+ resource site with learning materials, tools, and the TLA+ book
- TLA+ Tools (TLC, TLAPS, PlusCal) — open-source model checker, proof system, and algorithm language
- Learn TLA+ — interactive tutorial for getting started with TLA+ specifications
- Cucumber / Gherkin Documentation — official Gherkin language reference (Given/When/Then syntax)
- BDD Introduction (Cucumber) — Behavior-Driven Development concepts and practices
- Gherkin Reference — complete keyword reference for .feature files
Project resources¶
- PATTERNS.md: The Formal Specification as Prompt pattern contains the summary evidence table and domain guidance.
- TLA+ experiment data:
experiments/hive_mind/tla_prompt_language/ - Gherkin experiment data:
experiments/hive_mind/gherkin_v2_recipe_executor/ - Evaluation code:
src/amplihack/eval/gherkin_agent_evaluator.py(agent consensus),src/amplihack/eval/gherkin_prompt_experiment.py(experiment runner)
Research¶
- Demirbas, M. "Using TLA+ as a Design Accelerator" — 8 industry case studies (AWS, MongoDB, Azure)
- SysMoBench: LLM Evaluation for Formal System Modeling — LLMs violate 41.9% of liveness properties vs 8.3% of safety properties; always validate with TLC
- Use of Formal Methods at Amazon Web Services — how TLA+ is used in production at AWS (DynamoDB, S3, EBS)
Potential Follow-ups¶
Increase statistical power¶
The current Gherkin experiment uses N=3 repeats per condition. Confidence intervals overlap across all variants, meaning we can't statistically distinguish them. Running N=10 would tighten the CIs enough to determine whether gherkin_only is genuinely better than english or whether we're observing noise. The infrastructure supports this — change full_repeats in manifest.json and re-run.
Investigate why hybrid prompts degrade¶
Both experiments show that adding English to a formal spec makes results worse. This is counterintuitive. Is it because the model tries to reconcile conflicting representations? Does the degradation depend on whether the English paraphrases or supplements the spec? A targeted experiment varying the relationship between the English and spec content could turn this from an observation into an actionable design principle for prompt engineering.
Execute generated code against the reference test suite¶
Agent consensus scoring is better than regex, but the ground truth is: does the generated code actually work? The Gherkin experiment includes reference/test_recipe_step_executor.py with 34 passing tests. Running each generated artifact against those tests would give real pass/fail per feature — not LLM opinion. This would also calibrate whether the evaluator agents are accurate (do they agree with the tests?).
Broader model coverage¶
GPT-5.4 timed out on every condition due to copilot SDK infrastructure issues. Fixing that would give cross-model data. The TLA+ experiment showed both Claude and GPT converge to 0.86 with formal specs — a stronger finding than single-model results. Testing with Sonnet (faster, cheaper) would show whether the spec-vs-english gap holds across capability levels.
Automate the spec-to-test pipeline¶
trace_to_test.py exists for converting TLA+ model-checker traces into pytest cases, but isn't integrated into the default workflow. For Gherkin, the scenario-to-test mapping is still manual. Automating both paths — TLC traces become pytest cases, Gherkin scenarios become pytest cases — would make formal specs directly executable rather than just prompt engineering artifacts. This closes the loop: spec drives code generation and test generation from the same source of truth.
Measure evaluator inter-rater reliability¶
The 3 evaluator agents sometimes disagree (split votes). We don't yet know if these disagreements are random noise or systematic. Measuring inter-rater reliability (Cohen's kappa or Fleiss' kappa across agents) and calibrating against human judgment on a small sample would tell us how much to trust the consensus scores — and whether 3 agents is enough or we need 5.
Instrument the workflow feedback loop¶
The prompt-writer, architect, and tester agents now have formal spec awareness, but we don't know if they actually use it or if it improves outcomes in real development tasks. Instrumenting the prompt-writer to log when it recommends TLA+ vs Gherkin vs English, and tracking whether downstream code quality improves on those tasks, would measure the production value of these experiments — not just the lab results.