How to Troubleshoot Recipe Execution Failures¶

Use this guide when a recipe step fails unexpectedly — a shell step hangs, an agent step produces no file changes, or a prerequisite tool is missing.

Contents¶

Shell step hangs waiting for input
Shell step fails with "TASK_DESCRIPTION: unbound variable"
Agent step completes but changes nothing
Shell step fails with "python3 not found"
Workflow classification routes to the wrong type
Workflow step requires a Git repository
Agent step killed by step timeout
Update completes but assets are stale
smart-orchestrator fails after update with stale Python behavior
Install fails after switching to the Rust repository
Step-08c fails with WORKTREE_SETUP_WORKTREE_PATH not set
Step-08c rejects a valid verdict synonym
Step-08c reports hollow success on a clean worktree after commit

Shell step hangs waiting for input¶

Symptom: A recipe run invocation stalls indefinitely on a shell step. No output appears. Killing the process shows the step was waiting for TTY input from a tool like npm, apt, or a git credential helper.

Cause: The child shell process inherited an environment that signaled interactive mode. Tools like apt and npm prompt for confirmation unless told otherwise.

Fix: The recipe executor now injects five non-interactive environment variables into every shell step automatically:

Variable	Value	Purpose
`HOME`	inherited or `/root`	Prevents `~`-expansion failures
`PATH`	inherited or `/usr/local/bin:/usr/bin:/bin`	Ensures basic tool lookup
`NONINTERACTIVE`	`1`	Generic non-interactive signal
`DEBIAN_FRONTEND`	`noninteractive`	Suppresses dpkg/apt prompts
`CI`	`true`	Suppresses interactive prompts in npm, yarn, pip

No recipe YAML changes are required. All shell steps receive these variables.

Verify the fix is active:

# Create a one-step recipe that prints CI env vars
cat > /tmp/env-check.yaml << 'EOF'
name: env-check
steps:
  - id: check
    type: shell
    command: "env | grep -E '^(CI|NONINTERACTIVE|DEBIAN_FRONTEND)='"
EOF

amplihack recipe run /tmp/env-check.yaml
# Expected output includes:
#   CI=true
#   NONINTERACTIVE=1
#   DEBIAN_FRONTEND=noninteractive

Shell step fails with "TASK_DESCRIPTION: unbound variable"¶

Symptom: A bash step that reads a context value from the environment aborts immediately, often inside a nested sub-recipe such as step-03-create-issue:

TASK_DESCRIPTION: unbound variable
REPO_PATH: unbound variable

The step uses set -u (or set -euo pipefail) and references $TASK_DESCRIPTION, $REPO_PATH, or another context value directly rather than through a {{placeholder}}.

Cause: Before context environment export existed, recipe context fed only {{placeholder}} substitution in step text — it was not present in the shell step's process environment. Under set -u, referencing an unset variable is a hard error, so the step failed before doing any work. This blocked multi-workstream campaigns at step-03-create-issue.

Fix: amplihack recipe run now exports every recipe context variable to the recipe-runner-rs subprocess as an environment variable whose name is the ASCII-uppercased context key (task_description → TASK_DESCRIPTION). The export is inherited by every shell step and every nested sub-recipe, so the same value is available at any depth. No recipe YAML changes are required.

Verify the fix is active:

cat > /tmp/unbound-check.yaml << 'EOF'
name: unbound-check
context:
  task_description: ""
  repo_path: "."
steps:
  - id: read-env
    type: bash
    command: |
      set -euo pipefail
      echo "TASK_DESCRIPTION=$TASK_DESCRIPTION"
      echo "REPO_PATH=$REPO_PATH"
EOF

amplihack recipe run /tmp/unbound-check.yaml \
  -c task_description="probe" -c repo_path=.
# Expected output:
#   TASK_DESCRIPTION=probe
#   REPO_PATH=.

If the variable is still unbound:

The key cannot become a valid identifier. Only keys whose uppercased form matches ^[A-Z_][A-Z0-9_]*$ are exported. A key with spaces, dots, or a leading digit is skipped — look for a WARN recipe context key skipped for env export name=… reason=invalid_identifier line on stderr. These warnings are hidden by the CLI's default log filter (errors only), so re-run with RUST_LOG=warn to surface them. Rename the context key (for example issue-title → issue_title).
The name is reserved. Names such as PATH, HOME, IFS, BASH_ENV, LD_PRELOAD, PYTHONPATH, and anything beginning with AMPLIHACK_ are never exported from context (a reason=reserved_name warning is logged). Read those values from their canonical source instead of from context.
The context key is genuinely absent. Confirm the value is supplied via the recipe's context: block or a -c/--context flag; an unset, un-inferred key produces no environment variable.

See Recipe Context Environment Export for the full transform rules and denylist, and Propagate Recipe Context to Bash Steps for a hands-on walkthrough.

Agent step completes but changes nothing¶

Symptom: An agent step runs, appears to succeed, but produces no file modifications. The agent's output may reference files by relative path without knowing where the working directory is.

Cause: The agent backend was not receiving the recipe's working directory or non-interactive flag in its context map. The agent would default to its own notion of "current directory," which may differ from the recipe's working_dir.

Fix: The executor now augments the context passed to every agent step with two entries:

Context key	Value	Purpose
`working_directory`	The recipe's configured `working_dir`	Tells the agent where to read/write files
`NONINTERACTIVE`	`1`	Prevents the agent from attempting interactive prompts

These entries are injected only when the recipe step's own context does not already define them, so explicit overrides in recipe YAML still work.

Verify in a recipe:

name: agent-test
steps:
  - id: write-file
    type: agent
    agent: claude
    prompt: "Create a file called hello.txt containing 'Hello from agent step'"

After running, hello.txt should appear in the recipe's working directory, not in some other location.

Shell step fails with "python3 not found"¶

Symptom: A recipe step that references python3 or python in its command fails immediately with:

Error: Shell step 'step-id' requires python3 but it is not installed or not on PATH.
Recipe steps should use deterministic Rust tools instead of Python sidecars.

Cause: The executor performs a pre-flight check for Python availability before executing any shell step whose command mentions python3 or python. This prevents long-running recipes from wasting hours before failing at a late step that needs Python.

Resolution options:

Install Python 3 on the machine and ensure it is on PATH:

# Ubuntu/Debian
sudo apt-get install -y python3

# macOS
brew install python@3

# Verify
python3 --version

Rewrite the step to avoid the Python dependency. The error message recommends using Rust-native tools. For example, replace a Python JSON-processing script with jq or a purpose-built Rust binary.
Use a Docker image that includes Python if the recipe must run Python:

docker run --rm -v "$PWD:/work" -w /work python:3.12-slim \
  amplihack recipe run my-recipe.yaml

Workflow classification routes to the wrong type¶

Symptom: A development task like "Add an agentic disk-cleanup loop. Extend src/cmd_cleanup.rs" gets classified as Ops instead of Default, causing the wrong workflow to execute.

Cause: Single-word OPS keywords (like cleanup, manage) matched as substrings in code paths (cmd_cleanup.rs) and task descriptions. This triggered false-positive OPS classification.

Fix: OPS keywords are now multi-word phrases that require specific operational context:

Old keyword (removed)	New phrase (required)
`cleanup`	`disk cleanup`, `clean up temp`
`manage`	`manage repos`, `repo management`
`delete`	`delete files`
`organize`	`organize files`

A task description must contain the full phrase to match OPS. Single words like cleanup appearing inside code references no longer trigger misclassification.

Verify classification:

# This should classify as Default (development), not Ops
amplihack classify "Add an agentic disk-cleanup loop. Extend src/cmd_cleanup.rs with a new function."
# Expected: workflow=Default

# This should classify as Ops
amplihack classify "disk cleanup on staging servers"
# Expected: workflow=Ops

Workflow step requires a Git repository¶

Symptom: amplihack recipe run starts successfully from a scratch or temporary directory, then a development, publish, worktree, TDD, or PR step fails before executing its Git command:

ERROR: step <workflow>/<step> requires a git repo at /tmp/demo; either `git init` or rerun from a checkout

Cause: Recipe routing and investigation flows are Git-optional, but the selected step needs repository state to create a branch, compute tracked-file changes, create a worktree, commit, push, or open a pull request. The recipe checks this precondition before running git so failures are actionable instead of surfacing as a raw Git exit 128.

Fix: Run the Git-dependent workflow from a checkout:

cd /home/user/src/myproject
amplihack recipe run ~/.amplihack/.claude/recipes/default-workflow.yaml \
  -c task_description="Fix the pagination bug" \
  -c repo_path=.

Or initialize the directory when a new repository is appropriate:

git init
amplihack recipe run ~/.amplihack/.claude/recipes/default-workflow.yaml \
  -c task_description="Create the initial project structure" \
  -c repo_path=.

If your task is analysis-only, use an investigation or Q&A recipe instead; those workflows do not require Git, and history-specific analysis is skipped or marked unavailable outside a repository:

amplihack recipe run ~/.amplihack/.claude/recipes/investigation-workflow.yaml \
  -c task_description="Explain the error message" \
  -c repo_path=.

Note: Optional Git telemetry is different from required Git work. When a telemetry-only check runs outside a repository, it prints a visible skip message such as [skip] not a git repo at /tmp/demo; skipping operations file-change git telemetry and continues.

Agent step killed by step timeout¶

Symptom: An agent step (architecture design, large refactoring, complex analysis) appears to fail mid-thought with a timeout error.

Cause: A user-supplied --step-timeout override (or AMPLIHACK_STEP_TIMEOUT in the environment) is forcing a per-step ceiling on agent steps. The bundled recipes under amplifier-bundle/recipes/ no longer set per-step timeouts on agent steps — agent reasoning is highly variable and aborting mid-thought corrupts orchestrator state (issue #439). If you are seeing a timeout on an agent step, something outside the recipe is imposing it.

Fix: Remove or relax the override.

# Drop the --step-timeout flag entirely (recommended for agent-heavy work)
amplihack recipe run recipe.yaml \
  -c task_description="Complex task"

# Or explicitly disable timeouts for the run
amplihack recipe run recipe.yaml \
  -c task_description="Very long task" \
  --step-timeout 0

# Or raise the ceiling to a generous floor (e.g., for a CI guard rail)
amplihack recipe run recipe.yaml \
  -c task_description="CI run" \
  --step-timeout 1800

If AMPLIHACK_STEP_TIMEOUT is set in your shell or CI environment, unset it (unset AMPLIHACK_STEP_TIMEOUT) or override it on the command line with --step-timeout 0.

Note: A handful of bash steps in bundled recipes still carry timeout_seconds: 1800. Those are the network-I/O steps (gh api, git fetch, curl) where a stuck socket could hang indefinitely. The 1800-second floor is an availability guardrail, not a work bound — if one of those fires, the underlying network call has hung and the right fix is to investigate the network failure, not to extend the timeout.

Note: --step-timeout overrides per-step timeout_seconds only. It does not affect recipe-level default_step_timeout (e.g., quality-loop.yaml) or the max_runtime budget in multitask workstreams.

Update completes but assets are stale¶

Symptom: After running amplihack update, the binary version is correct but skills, hooks, recipes, or bundled assets show old behavior. Running amplihack install manually fixes it.

Cause: The update command replaced the binary but did not re-stage framework assets from the new binary's embedded amplifier-bundle/.

Fix: The update flow now calls ensure_framework_installed() immediately after the binary swap. If asset re-staging fails, the update still succeeds (the binary is already replaced) but prints a warning:

⚠️  Binary updated but framework asset refresh failed: <error>
   Run `amplihack install` to refresh assets manually.

Verify the fix:

# Update to latest
amplihack update

# Check that assets match the new version
amplihack --version
ls ~/.amplihack/.claude/
# Asset files should have recent timestamps matching the update time

Since issue #734, the install/update path also validates smart-orchestrator compatibility before accepting local bundle candidates and after staging the destination. Stale AMPLIHACK_HOME bundles are skipped or repaired instead of being silently reused.

smart-orchestrator fails after update with stale Python behavior¶

Symptom: amplihack recipe run smart-orchestrator fails during parse-decomposition after update. The installed ~/.amplihack/amplifier-bundle/recipes/smart-orchestrator.yaml may be a large old monolithic recipe and may mention Python/importlib, orch_helper.py, or an old resolve-bundle-asset helper-path orchestration-helper path.

Cause: The binary is current, but AMPLIHACK_HOME still contains stale framework bundle assets from an older release. The current smart-orchestrator is a composable parent recipe that delegates to four companion recipes:

smart-classify-route
smart-execute-routing
smart-reflect-loop
smart-validate-summarize

Fix: Run install or update with a version that includes the framework bundle compatibility validator:

amplihack update
# or
amplihack install

The installer validates candidate bundles before selecting them. A stale AMPLIHACK_HOME bundle is skipped, and a successful install validates that the staged destination contains the composable smart-orchestrator and all required companion recipes.

Verify:

grep -E 'recipe: "(smart-classify-route|smart-execute-routing|smart-reflect-loop|smart-validate-summarize)"' \
  ~/.amplihack/amplifier-bundle/recipes/smart-orchestrator.yaml

All four recipes should be present.

Do not repair this by restoring orch_helper.py. Do not remap helper-path to orch_helper.py. helper-path intentionally resolves to amplifier-bundle/bin/multitask-orchestrator.sh.

See Repair a stale framework bundle.

Install fails after switching to the Rust repository¶

Symptom: Installation via amplihack install fails because the downloaded archive or cloned repository has a different directory structure than expected. The installer cannot find a .claude/ directory at the repository root.

Cause: Framework archives may use either .claude/ or amplifier-bundle/ as their root marker. The root-detection function needs to accept both layouts.

Fix: find_framework_repo_root() now accepts either .claude/ or amplifier-bundle/ as a valid repository root marker. This handles:

Archives that contain .claude/
Archives that contain amplifier-bundle/
Mixed layouts where both directories exist

Verify:

# Clone the Rust repo and confirm install works
git clone --depth 1 https://github.com/rysweet/amplihack-rs /tmp/test-install
amplihack install --local /tmp/test-install
# Should succeed regardless of whether .claude/ or amplifier-bundle/ is present

Checksum verification fails intermittently¶

Symptom: amplihack install fails with a checksum verification error during binary download, but succeeds on retry.

Cause: The SHA-256 checksum file download used a single HTTP GET without retry logic. Transient network errors caused the checksum fetch to fail even when the main binary download succeeded.

Fix: Checksum verification now uses http_get_with_retry(), which retries with exponential backoff (up to 3 attempts). The retry logic applies only to the checksum fetch, not the binary download itself (which already had retries).

If it still fails after retries:

# Check network connectivity to GitHub releases
curl -I https://github.com/rysweet/amplihack-rs/releases/latest

# Manual install with local checkout bypasses all downloads
git clone https://github.com/rysweet/amplihack-rs /tmp/amplihack-local
amplihack install --local /tmp/amplihack-local

Step-08c fails with WORKTREE_SETUP_WORKTREE_PATH not set¶

Symptom: During an orchestrator-driven workflow (fixing an issue, implementing a feature), the recipe fails at step-08c with:

WORKTREE_SETUP_WORKTREE_PATH: step-08c requires worktree_setup.worktree_path
from step-04 (workflow-worktree); ensure parent recipe ran worktree-setup and
propagated outputs

This blocks the entire issue-fixing loop. Any task routed through smart-orchestrator → default-workflow → workflow-tdd fails at the no-op guard.

Cause: A sub-recipe in the default-workflow chain has context: {} instead of declaring worktree_setup in its context block. The recipe runner only forwards context variables that the child recipe explicitly declares. Without the declaration, worktree_setup is silently dropped at the recipe boundary, and step-08c cannot read WORKTREE_SETUP_WORKTREE_PATH.

Fix: Every post-worktree sub-recipe (workflow-tdd, workflow-refactor-review, workflow-precommit-test, workflow-publish, workflow-pr-review, workflow-finalize) declares worktree_setup: "" and allow_no_op: false in its context: block. This ensures the recipe runner threads the value from default-workflow through to the step that needs it.

Verify the fix:

# Run the propagation regression tests
python -m pytest amplifier-bundle/tools/test_default_workflow_fixes.py::TestWorktreeSetupPropagation479 -v

# All four tests should pass:
#   test_post_worktree_sub_recipes_declare_worktree_setup
#   test_post_worktree_sub_recipes_declare_allow_no_op
#   test_default_workflow_declares_worktree_setup
#   test_smart_execute_routing_forwards_worktree_setup

If you are writing a new sub-recipe that runs after workflow-worktree (step 04), add these keys to your recipe's context: block:

context:
  worktree_setup: ""
  allow_no_op: false

See worktree_setup Context Propagation for the complete propagation chain and design rationale.

Step-08c rejects a valid verdict synonym¶

Symptom: The recipe fails at step-08c-enforce-verdict with exit code 1, even though the work-verifier agent approved the work. The error log shows a verdict string like VERIFIED, SUCCESS, or APPROVED instead of the canonical WORK_VERIFIED.

Cause: The work-verifier agent is an LLM. It sometimes produces natural synonyms of the three canonical verdict strings (WORK_VERIFIED, HOLLOW_SUCCESS, INSUFFICIENT_EVIDENCE). Before issue #624, the enforce-verdict bash gate used a strict case statement with a *) exit 1 default — any non-canonical string hard-failed the recipe after the PR had already been opened.

Fix: The enforce-verdict step now normalizes common synonyms before the case gate:

LLM synonym	Normalized to
`VERIFIED`, `SUCCESS`, `APPROVED`, `PASS`, `PASSED`	`WORK_VERIFIED`
`FAILED`, `NO_WORK`, `EMPTY`, `NO_ARTIFACTS`	`HOLLOW_SUCCESS`
`INCONCLUSIVE`, `UNKNOWN`, `UNCLEAR`, `PARTIAL`	`INSUFFICIENT_EVIDENCE`

If the verdict string is still unrecognized after synonym mapping, the gate fail-safes to INSUFFICIENT_EVIDENCE (exit 0 with a loud warning) instead of exit 1. An LLM producing a novel verdict string never hard-fails a recipe that already has real artifacts.

Verify:

# Run the verdict enforcement integration tests
cargo test enforce_verdict -- --nocapture
# The "unknown verdict" test should pass with exit 0 (not exit 1)

Step-08c reports hollow success on a clean worktree after commit¶

Symptom: After the implement step successfully commits, pushes, and opens a PR, step-08c-work-verifier reports HOLLOW_SUCCESS because the working tree is clean (no uncommitted changes).

Cause: A clean working tree after a successful commit/push is correct behavior — it means the work was committed to the branch. The legacy bash hollow-success guard counted uncommitted working-tree changes and treated a clean tree as "no work done," which is incorrect for the commit-then-verify flow.

Fix: The work-verifier agent (issue #596) now treats git log and PR state as primary evidence and working-tree state as secondary:

Primary: git log --oneline origin/main..HEAD — commits on the branch since divergence
Primary: gh pr list — PRs matching the task on this branch
Primary: gh issue view — linked issue closure by merged PR
Secondary: git status --porcelain — uncommitted edits (additional evidence, not required)

A clean worktree with commits on the branch and/or an open PR produces WORK_VERIFIED. The agent only reports HOLLOW_SUCCESS when all evidence sources show no work — no commits, no PR, no file changes, no linked issue resolution.

Verify:

# After a successful implement step that committed and pushed:
git log --oneline origin/main..HEAD   # Should show commits
gh pr list --head "$(git branch --show-current)"  # Should show PR
# The verifier should produce WORK_VERIFIED, not HOLLOW_SUCCESS

Run a Recipe End-to-End — Normal recipe execution workflow
Run amplihack in Non-interactive Mode — CI and headless environments
Install amplihack for the First Time — Bootstrap from scratch
Environment Variables — All variables read or injected by amplihack
Recipe Execution Flow — Architecture of the step execution pipeline
worktree_setup Context Propagation — Full reference for the worktree_setup propagation chain

How to Troubleshoot Recipe Execution Failures¶

Contents¶

Shell step hangs waiting for input¶

Shell step fails with "TASK_DESCRIPTION: unbound variable"¶

Agent step completes but changes nothing¶

Shell step fails with "python3 not found"¶

Workflow classification routes to the wrong type¶

Workflow step requires a Git repository¶

Agent step killed by step timeout¶

Update completes but assets are stale¶

smart-orchestrator fails after update with stale Python behavior¶

Install fails after switching to the Rust repository¶

Checksum verification fails intermittently¶

Step-08c fails with WORKTREE_SETUP_WORKTREE_PATH not set¶

Step-08c rejects a valid verdict synonym¶

Step-08c reports hollow success on a clean worktree after commit¶

Related¶