How to Troubleshoot Recipe Execution Failures¶
Use this guide when a recipe step fails unexpectedly — a shell step hangs, an agent step produces no file changes, or a prerequisite tool is missing.
Contents¶
- Shell step hangs waiting for input
- Agent step completes but changes nothing
- Shell step fails with "python3 not found"
- Workflow classification routes to the wrong type
- Workflow step requires a Git repository
- Agent step killed by step timeout
- Update completes but assets are stale
- smart-orchestrator fails after update with stale Python behavior
- Install fails after switching to the Rust repository
- Step-08c fails with WORKTREE_SETUP_WORKTREE_PATH not set
- Step-08c rejects a valid verdict synonym
- Step-08c reports hollow success on a clean worktree after commit
Shell step hangs waiting for input¶
Symptom: A recipe run invocation stalls indefinitely on a shell step. No output appears. Killing the process shows the step was waiting for TTY input from a tool like npm, apt, or a git credential helper.
Cause: The child shell process inherited an environment that signaled interactive mode. Tools like apt and npm prompt for confirmation unless told otherwise.
Fix: The recipe executor now injects five non-interactive environment variables into every shell step automatically:
| Variable | Value | Purpose |
|---|---|---|
HOME |
inherited or /root |
Prevents ~-expansion failures |
PATH |
inherited or /usr/local/bin:/usr/bin:/bin |
Ensures basic tool lookup |
NONINTERACTIVE |
1 |
Generic non-interactive signal |
DEBIAN_FRONTEND |
noninteractive |
Suppresses dpkg/apt prompts |
CI |
true |
Suppresses interactive prompts in npm, yarn, pip |
No recipe YAML changes are required. All shell steps receive these variables.
Verify the fix is active:
# Create a one-step recipe that prints CI env vars
cat > /tmp/env-check.yaml << 'EOF'
name: env-check
steps:
- id: check
type: shell
command: "env | grep -E '^(CI|NONINTERACTIVE|DEBIAN_FRONTEND)='"
EOF
amplihack recipe run /tmp/env-check.yaml
# Expected output includes:
# CI=true
# NONINTERACTIVE=1
# DEBIAN_FRONTEND=noninteractive
Agent step completes but changes nothing¶
Symptom: An agent step runs, appears to succeed, but produces no file modifications. The agent's output may reference files by relative path without knowing where the working directory is.
Cause: The agent backend was not receiving the recipe's working directory or non-interactive flag in its context map. The agent would default to its own notion of "current directory," which may differ from the recipe's working_dir.
Fix: The executor now augments the context passed to every agent step with two entries:
| Context key | Value | Purpose |
|---|---|---|
working_directory |
The recipe's configured working_dir |
Tells the agent where to read/write files |
NONINTERACTIVE |
1 |
Prevents the agent from attempting interactive prompts |
These entries are injected only when the recipe step's own context does not already define them, so explicit overrides in recipe YAML still work.
Verify in a recipe:
name: agent-test
steps:
- id: write-file
type: agent
agent: claude
prompt: "Create a file called hello.txt containing 'Hello from agent step'"
After running, hello.txt should appear in the recipe's working directory, not in some other location.
Shell step fails with "python3 not found"¶
Symptom: A recipe step that references python3 or python in its command fails immediately with:
Error: Shell step 'step-id' requires python3 but it is not installed or not on PATH.
Recipe steps should use deterministic Rust tools instead of Python sidecars.
Cause: The executor performs a pre-flight check for Python availability before executing any shell step whose command mentions python3 or python. This prevents long-running recipes from wasting hours before failing at a late step that needs Python.
Resolution options:
- Install Python 3 on the machine and ensure it is on
PATH:
# Ubuntu/Debian
sudo apt-get install -y python3
# macOS
brew install python@3
# Verify
python3 --version
-
Rewrite the step to avoid the Python dependency. The error message recommends using Rust-native tools. For example, replace a Python JSON-processing script with
jqor a purpose-built Rust binary. -
Use a Docker image that includes Python if the recipe must run Python:
Workflow classification routes to the wrong type¶
Symptom: A development task like "Add an agentic disk-cleanup loop. Extend src/cmd_cleanup.rs" gets classified as Ops instead of Default, causing the wrong workflow to execute.
Cause: Single-word OPS keywords (like cleanup, manage) matched as substrings in code paths (cmd_cleanup.rs) and task descriptions. This triggered false-positive OPS classification.
Fix: OPS keywords are now multi-word phrases that require specific operational context:
| Old keyword (removed) | New phrase (required) |
|---|---|
cleanup |
disk cleanup, clean up temp |
manage |
manage repos, repo management |
delete |
delete files |
organize |
organize files |
A task description must contain the full phrase to match OPS. Single words like cleanup appearing inside code references no longer trigger misclassification.
Verify classification:
# This should classify as Default (development), not Ops
amplihack classify "Add an agentic disk-cleanup loop. Extend src/cmd_cleanup.rs with a new function."
# Expected: workflow=Default
# This should classify as Ops
amplihack classify "disk cleanup on staging servers"
# Expected: workflow=Ops
Workflow step requires a Git repository¶
Symptom: amplihack recipe run starts successfully from a scratch or
temporary directory, then a development, publish, worktree, TDD, or PR step
fails before executing its Git command:
ERROR: step <workflow>/<step> requires a git repo at /tmp/demo; either `git init` or rerun from a checkout
Cause: Recipe routing and investigation flows are Git-optional, but the
selected step needs repository state to create a branch, compute tracked-file
changes, create a worktree, commit, push, or open a pull request. The recipe
checks this precondition before running git so failures are actionable instead
of surfacing as a raw Git exit 128.
Fix: Run the Git-dependent workflow from a checkout:
cd /home/user/src/myproject
amplihack recipe run ~/.amplihack/.claude/recipes/default-workflow.yaml \
-c task_description="Fix the pagination bug" \
-c repo_path=.
Or initialize the directory when a new repository is appropriate:
git init
amplihack recipe run ~/.amplihack/.claude/recipes/default-workflow.yaml \
-c task_description="Create the initial project structure" \
-c repo_path=.
If your task is analysis-only, use an investigation or Q&A recipe instead; those workflows do not require Git, and history-specific analysis is skipped or marked unavailable outside a repository:
amplihack recipe run ~/.amplihack/.claude/recipes/investigation-workflow.yaml \
-c task_description="Explain the error message" \
-c repo_path=.
Note: Optional Git telemetry is different from required Git work. When a
telemetry-only check runs outside a repository, it prints a visible skip message
such as [skip] not a git repo at /tmp/demo; skipping operations file-change git
telemetry and continues.
Agent step killed by step timeout¶
Symptom: An agent step (architecture design, large refactoring, complex analysis) appears to fail mid-thought with a timeout error.
Cause: A user-supplied --step-timeout override (or AMPLIHACK_STEP_TIMEOUT in the environment) is forcing a per-step ceiling on agent steps. The bundled recipes under amplifier-bundle/recipes/ no longer set per-step timeouts on agent steps — agent reasoning is highly variable and aborting mid-thought corrupts orchestrator state (issue #439). If you are seeing a timeout on an agent step, something outside the recipe is imposing it.
Fix: Remove or relax the override.
# Drop the --step-timeout flag entirely (recommended for agent-heavy work)
amplihack recipe run recipe.yaml \
-c task_description="Complex task"
# Or explicitly disable timeouts for the run
amplihack recipe run recipe.yaml \
-c task_description="Very long task" \
--step-timeout 0
# Or raise the ceiling to a generous floor (e.g., for a CI guard rail)
amplihack recipe run recipe.yaml \
-c task_description="CI run" \
--step-timeout 1800
If AMPLIHACK_STEP_TIMEOUT is set in your shell or CI environment, unset it (unset AMPLIHACK_STEP_TIMEOUT) or override it on the command line with --step-timeout 0.
Note: A handful of bash steps in bundled recipes still carry timeout_seconds: 1800. Those are the network-I/O steps (gh api, git fetch, curl) where a stuck socket could hang indefinitely. The 1800-second floor is an availability guardrail, not a work bound — if one of those fires, the underlying network call has hung and the right fix is to investigate the network failure, not to extend the timeout.
Note: --step-timeout overrides per-step timeout_seconds only. It does not affect recipe-level default_step_timeout (e.g., quality-loop.yaml) or the max_runtime budget in multitask workstreams.
Update completes but assets are stale¶
Symptom: After running amplihack update, the binary version is correct but skills, hooks, recipes, or bundled assets show old behavior. Running amplihack install manually fixes it.
Cause: The update command replaced the binary but did not re-stage framework assets from the new binary's embedded amplifier-bundle/.
Fix: The update flow now calls ensure_framework_installed() immediately after the binary swap. If asset re-staging fails, the update still succeeds (the binary is already replaced) but prints a warning:
⚠️ Binary updated but framework asset refresh failed: <error>
Run `amplihack install` to refresh assets manually.
Verify the fix:
# Update to latest
amplihack update
# Check that assets match the new version
amplihack --version
ls ~/.amplihack/.claude/
# Asset files should have recent timestamps matching the update time
Since issue #734, the install/update path also validates smart-orchestrator
compatibility before accepting local bundle candidates and after staging the
destination. Stale AMPLIHACK_HOME bundles are skipped or repaired instead of
being silently reused.
smart-orchestrator fails after update with stale Python behavior¶
Symptom: amplihack recipe run smart-orchestrator fails during
parse-decomposition after update. The installed
~/.amplihack/amplifier-bundle/recipes/smart-orchestrator.yaml may be a large
old monolithic recipe and may mention Python/importlib, orch_helper.py, or an
old resolve-bundle-asset helper-path orchestration-helper path.
Cause: The binary is current, but AMPLIHACK_HOME still contains stale
framework bundle assets from an older release. The current smart-orchestrator is
a composable parent recipe that delegates to four companion recipes:
Fix: Run install or update with a version that includes the framework bundle compatibility validator:
The installer validates candidate bundles before selecting them. A stale
AMPLIHACK_HOME bundle is skipped, and a successful install validates that the
staged destination contains the composable smart-orchestrator and all required
companion recipes.
Verify:
grep -E 'recipe: "(smart-classify-route|smart-execute-routing|smart-reflect-loop|smart-validate-summarize)"' \
~/.amplihack/amplifier-bundle/recipes/smart-orchestrator.yaml
All four recipes should be present.
Do not repair this by restoring orch_helper.py. Do not remap helper-path to
orch_helper.py. helper-path intentionally resolves to
amplifier-bundle/bin/multitask-orchestrator.sh.
See Repair a stale framework bundle.
Install fails after switching to the Rust repository¶
Symptom: Installation via amplihack install fails because the downloaded archive or cloned repository has a different directory structure than expected. The installer cannot find a .claude/ directory at the repository root.
Cause: Framework archives may use either .claude/ or amplifier-bundle/ as their root marker. The root-detection function needs to accept both layouts.
Fix: find_framework_repo_root() now accepts either .claude/ or amplifier-bundle/ as a valid repository root marker. This handles:
- Archives that contain
.claude/ - Archives that contain
amplifier-bundle/ - Mixed layouts where both directories exist
Verify:
# Clone the Rust repo and confirm install works
git clone --depth 1 https://github.com/rysweet/amplihack-rs /tmp/test-install
amplihack install --local /tmp/test-install
# Should succeed regardless of whether .claude/ or amplifier-bundle/ is present
Checksum verification fails intermittently¶
Symptom: amplihack install fails with a checksum verification error during binary download, but succeeds on retry.
Cause: The SHA-256 checksum file download used a single HTTP GET without retry logic. Transient network errors caused the checksum fetch to fail even when the main binary download succeeded.
Fix: Checksum verification now uses http_get_with_retry(), which retries with exponential backoff (up to 3 attempts). The retry logic applies only to the checksum fetch, not the binary download itself (which already had retries).
If it still fails after retries:
# Check network connectivity to GitHub releases
curl -I https://github.com/rysweet/amplihack-rs/releases/latest
# Manual install with local checkout bypasses all downloads
git clone https://github.com/rysweet/amplihack-rs /tmp/amplihack-local
amplihack install --local /tmp/amplihack-local
Step-08c fails with WORKTREE_SETUP_WORKTREE_PATH not set¶
Symptom: During an orchestrator-driven workflow (fixing an issue, implementing a feature), the recipe fails at step-08c with:
WORKTREE_SETUP_WORKTREE_PATH: step-08c requires worktree_setup.worktree_path
from step-04 (workflow-worktree); ensure parent recipe ran worktree-setup and
propagated outputs
This blocks the entire issue-fixing loop. Any task routed through smart-orchestrator → default-workflow → workflow-tdd fails at the no-op guard.
Cause: A sub-recipe in the default-workflow chain has context: {} instead of declaring worktree_setup in its context block. The recipe runner only forwards context variables that the child recipe explicitly declares. Without the declaration, worktree_setup is silently dropped at the recipe boundary, and step-08c cannot read WORKTREE_SETUP_WORKTREE_PATH.
Fix: Every post-worktree sub-recipe (workflow-tdd, workflow-refactor-review, workflow-precommit-test, workflow-publish, workflow-pr-review, workflow-finalize) declares worktree_setup: "" and allow_no_op: false in its context: block. This ensures the recipe runner threads the value from default-workflow through to the step that needs it.
Verify the fix:
# Run the propagation regression tests
python -m pytest amplifier-bundle/tools/test_default_workflow_fixes.py::TestWorktreeSetupPropagation479 -v
# All four tests should pass:
# test_post_worktree_sub_recipes_declare_worktree_setup
# test_post_worktree_sub_recipes_declare_allow_no_op
# test_default_workflow_declares_worktree_setup
# test_smart_execute_routing_forwards_worktree_setup
If you are writing a new sub-recipe that runs after workflow-worktree (step 04), add these keys to your recipe's context: block:
See worktree_setup Context Propagation for the complete propagation chain and design rationale.
Step-08c rejects a valid verdict synonym¶
Symptom: The recipe fails at step-08c-enforce-verdict with exit code 1, even though the work-verifier agent approved the work. The error log shows a verdict string like VERIFIED, SUCCESS, or APPROVED instead of the canonical WORK_VERIFIED.
Cause: The work-verifier agent is an LLM. It sometimes produces natural synonyms of the three canonical verdict strings (WORK_VERIFIED, HOLLOW_SUCCESS, INSUFFICIENT_EVIDENCE). Before issue #624, the enforce-verdict bash gate used a strict case statement with a *) exit 1 default — any non-canonical string hard-failed the recipe after the PR had already been opened.
Fix: The enforce-verdict step now normalizes common synonyms before the case gate:
| LLM synonym | Normalized to |
|---|---|
VERIFIED, SUCCESS, APPROVED, PASS, PASSED |
WORK_VERIFIED |
FAILED, NO_WORK, EMPTY, NO_ARTIFACTS |
HOLLOW_SUCCESS |
INCONCLUSIVE, UNKNOWN, UNCLEAR, PARTIAL |
INSUFFICIENT_EVIDENCE |
If the verdict string is still unrecognized after synonym mapping, the gate fail-safes to INSUFFICIENT_EVIDENCE (exit 0 with a loud warning) instead of exit 1. An LLM producing a novel verdict string never hard-fails a recipe that already has real artifacts.
Verify:
# Run the verdict enforcement integration tests
cargo test enforce_verdict -- --nocapture
# The "unknown verdict" test should pass with exit 0 (not exit 1)
Step-08c reports hollow success on a clean worktree after commit¶
Symptom: After the implement step successfully commits, pushes, and opens a PR, step-08c-work-verifier reports HOLLOW_SUCCESS because the working tree is clean (no uncommitted changes).
Cause: A clean working tree after a successful commit/push is correct behavior — it means the work was committed to the branch. The legacy bash hollow-success guard counted uncommitted working-tree changes and treated a clean tree as "no work done," which is incorrect for the commit-then-verify flow.
Fix: The work-verifier agent (issue #596) now treats git log and PR state as primary evidence and working-tree state as secondary:
- Primary:
git log --oneline origin/main..HEAD— commits on the branch since divergence - Primary:
gh pr list— PRs matching the task on this branch - Primary:
gh issue view— linked issue closure by merged PR - Secondary:
git status --porcelain— uncommitted edits (additional evidence, not required)
A clean worktree with commits on the branch and/or an open PR produces WORK_VERIFIED. The agent only reports HOLLOW_SUCCESS when all evidence sources show no work — no commits, no PR, no file changes, no linked issue resolution.
Verify:
# After a successful implement step that committed and pushed:
git log --oneline origin/main..HEAD # Should show commits
gh pr list --head "$(git branch --show-current)" # Should show PR
# The verifier should produce WORK_VERIFIED, not HOLLOW_SUCCESS
Related¶
- Run a Recipe End-to-End — Normal recipe execution workflow
- Run amplihack in Non-interactive Mode — CI and headless environments
- Install amplihack for the First Time — Bootstrap from scratch
- Environment Variables — All variables read or injected by amplihack
- Recipe Execution Flow — Architecture of the step execution pipeline
- worktree_setup Context Propagation — Full reference for the worktree_setup propagation chain