Cascade Workflow with Graceful Degradation¶

This workflow implements graceful degradation through cascading fallback strategies. When optimal approaches fail or timeout, the system automatically falls back to simpler, more reliable alternatives while maintaining acceptable functionality.

DEPRECATION WARNING: Markdown workflows deprecated. See docs/WORKFLOW_TO_SKILLS_MIGRATION.md

Configuration¶

Core Parameters¶

Timeout Strategy: How long to wait at each cascade level

aggressive - Fast failures, quick degradation (5s / 2s / 1s)
balanced - Reasonable attempts (30s / 10s / 5s) - DEFAULT
patient - Thorough attempts before fallback (120s / 30s / 10s)
custom - Define your own timeouts

Fallback Types: What degrades at each level

service - External API → Cached data → Static defaults
quality - Comprehensive → Standard → Minimal analysis
freshness - Real-time → Recent → Historical data
completeness - Full dataset → Sample → Summary
accuracy - Precise → Approximate → Estimate

Degradation Notification: How to inform users

silent - Log only, no user notification
warning - Inform user of degradation
explicit - Detailed explanation of what degraded and why

Cost-Benefit Analysis¶

When to Use:

External service dependencies (APIs, databases)
Time-sensitive operations with acceptable degraded modes
Operations where partial results are better than no results
High-availability requirements (system must always respond)
Scenarios where waiting for perfect solution is worse than good-enough solution

When NOT to Use:

Operations requiring exact correctness (no acceptable degradation)
Security-critical operations (authentication, authorization)
Financial transactions (no room for "approximate")
When failures must surface to user (diagnostic operations)
Simple operations with no meaningful fallback

Trade-offs:

Benefit: System always completes, never fully fails
Cost: Users may receive degraded responses
Best for: User-facing features where responsiveness matters

How This Workflow Works¶

Integration with DEFAULT_WORKFLOW:

This workflow can be applied to any step in DEFAULT_WORKFLOW that has fallback options. Most commonly used for external service integrations (Step 5) or data fetching operations.

Execution:

Invoke automatically when operations have known fallbacks
Or explicitly via /ultrathink --workflow cascade for resilient operations
System attempts optimal approach first
Falls back automatically on timeout or failure
Reports degradation level achieved

Key Principle: Better to deliver degraded service than no service

The Cascade Workflow¶

Step 1: Define Cascade Levels¶

Use architect agent to identify cascade levels
Define PRIMARY approach (optimal solution)
Define SECONDARY approach (acceptable degradation)
Define TERTIARY approach (guaranteed completion)
Set timeout for each level
Document what degrades at each level
CRITICAL: Ensure tertiary ALWAYS succeeds

Cascade Level Requirements:

PRIMARY (Optimal):

Best possible outcome
May depend on external services
May be slow or resource-intensive
Can fail or timeout

SECONDARY (Acceptable):

Reduced quality but functional
More reliable than primary
Faster or fewer dependencies
Acceptable for users

TERTIARY (Guaranteed):

Always succeeds, never fails
No external dependencies
Fast and reliable
Minimal but functional

Example Cascade Definitions:

## Example 1: Code Analysis with AI

PRIMARY: GPT-4 comprehensive analysis (timeout: 30s)

- Full semantic understanding
- Complex pattern detection
- Detailed recommendations

SECONDARY: GPT-3.5 standard analysis (timeout: 10s)

- Basic pattern detection
- Standard recommendations
- Faster, lower quality

TERTIARY: Static analysis with regex (timeout: 5s)

- Pattern matching only
- No semantic understanding
- Always completes, basic insights

## Example 2: External API Data Fetch

PRIMARY: Live API call (timeout: 10s)

- Real-time data
- Complete information
- Current state

SECONDARY: Cached data (timeout: 2s)

- Recent data (< 1 hour old)
- May be stale
- Fast retrieval

TERTIARY: Default values (timeout: 0s)

- Static fallback data
- Guaranteed available
- Generic/safe defaults

## Example 3: Test Execution

PRIMARY: Full test suite (timeout: 120s)

- All tests run
- Complete coverage
- High confidence

SECONDARY: Critical tests only (timeout: 30s)

- Core functionality verified
- Reduced coverage
- Acceptable confidence

TERTIARY: Smoke tests (timeout: 10s)

- Basic sanity checks
- Minimal coverage
- System functional

Step 2: Attempt Primary Approach¶

Execute optimal solution
Set timeout based on strategy configuration
Monitor execution progress
If completes successfully: DONE (best outcome)
If fails or times out: Continue to Step 3
Log attempt and reason for failure

Primary Execution:

# Pseudocode for primary attempt
try:
    result = execute_primary_approach(timeout=PRIMARY_TIMEOUT)
    log_success(level="PRIMARY", result=result)
    return result  # DONE - best outcome achieved
except TimeoutError:
    log_failure(level="PRIMARY", reason="timeout")
    # Continue to Step 3
except ExternalServiceError as e:
    log_failure(level="PRIMARY", reason=f"service_error: {e}")
    # Continue to Step 3

Example - Code Analysis:

PRIMARY ATTEMPT: GPT-4 Analysis

- Started: 2025-01-20 10:15:30
- Timeout: 30s
- Status: TIMEOUT after 30s
- Reason: GPT-4 API slow response
- Next: Falling back to SECONDARY (GPT-3.5)

Step 3: Attempt Secondary Approach¶

Log degradation to secondary level
Execute acceptable fallback solution
Set shorter timeout (typically ⅓ of primary)
Monitor execution progress
If completes successfully: DONE (acceptable outcome)
If fails or times out: Continue to Step 4
Log attempt and reason for failure

Secondary Execution:

# Pseudocode for secondary attempt
log_degradation(from_level="PRIMARY", to_level="SECONDARY")
try:
    result = execute_secondary_approach(timeout=SECONDARY_TIMEOUT)
    log_success(level="SECONDARY", result=result, degraded=True)
    return result  # DONE - acceptable outcome
except TimeoutError:
    log_failure(level="SECONDARY", reason="timeout")
    # Continue to Step 4
except Error as e:
    log_failure(level="SECONDARY", reason=f"error: {e}")
    # Continue to Step 4

Example - Code Analysis:

SECONDARY ATTEMPT: GPT-3.5 Analysis

- Started: 2025-01-20 10:16:00
- Timeout: 10s
- Status: SUCCESS after 6s
- Quality: Standard (degraded from comprehensive)
- Result: Basic analysis completed
- Degradation: Missing advanced semantic insights
- Outcome: ACCEPTABLE - delivered to user

Step 4: Attempt Tertiary Approach¶

Log degradation to tertiary level
Execute guaranteed completion approach
Set minimal timeout (typically 1s)
MUST succeed - no failures allowed
Return minimal but functional result
Log success (degraded but functional)
DONE (guaranteed completion)

Tertiary Execution:

# Pseudocode for tertiary attempt
log_degradation(from_level="SECONDARY", to_level="TERTIARY")
try:
    result = execute_tertiary_approach(timeout=TERTIARY_TIMEOUT)
    log_success(level="TERTIARY", result=result, heavily_degraded=True)
    return result  # DONE - minimal but functional
except Exception as e:
    # THIS SHOULD NEVER HAPPEN
    log_critical_failure("TERTIARY approach failed - this is a bug!")
    # Tertiary must be designed to never fail
    raise SystemError("Cascade safety violation: tertiary failed")

Example - Code Analysis:

TERTIARY ATTEMPT: Static Regex Analysis

- Started: 2025-01-20 10:16:10
- Timeout: 5s
- Status: SUCCESS after 0.3s
- Quality: Minimal (heavily degraded)
- Result: Basic pattern matching completed
- Degradation: No semantic understanding, pattern-based only
- Outcome: FUNCTIONAL - delivered basic insights to user

Step 5: Report Degradation¶

Determine notification level from configuration
If silent: Log only, no user message
If warning: Brief notification to user
If explicit: Detailed degradation explanation
Document which level succeeded
Explain impact of degradation
Log cascade path taken for analysis

Degradation Reporting Templates:

Silent Mode (logs only):

[LOG] CASCADE: PRIMARY timeout (30s) → SECONDARY success (6s)
Result: standard_analysis (degraded from comprehensive)

Warning Mode (brief user notification):

⚠️  Using cached data (less than 1 hour old)
Current real-time data unavailable.

Explicit Mode (detailed explanation):

ℹ️  Analysis Quality Notice

We attempted to provide comprehensive code analysis using GPT-4,
but encountered slow response times (>30s timeout).

Fallback Applied:
- Used: GPT-3.5 standard analysis (completed in 6s)
- Quality: Standard (vs. Comprehensive)
- Impact: Advanced semantic insights not included

What You're Getting:
✓ Basic pattern detection
✓ Standard recommendations
✓ Code quality assessment

What's Missing:
✗ Complex architectural insights
✗ Deep semantic analysis
✗ Advanced refactoring suggestions

This is still actionable analysis, just less detailed than optimal.

Cascade Path Documentation:

# Cascade Execution Report

Operation: Code Analysis
Timestamp: 2025-01-20 10:15:30
Strategy: Balanced timeouts

## Cascade Path

1. PRIMARY: GPT-4 Comprehensive Analysis
   - Timeout: 30s
   - Result: TIMEOUT
   - Reason: API response time exceeded threshold
   - Duration: 30.1s

2. SECONDARY: GPT-3.5 Standard Analysis
   - Timeout: 10s
   - Result: SUCCESS
   - Duration: 6.2s
   - Degradation: Reduced from comprehensive to standard

3. TERTIARY: Not Attempted (secondary succeeded)

## Final Outcome

- Level: SECONDARY (acceptable degradation)
- Quality: Standard analysis delivered
- User Impact: Moderate (missing advanced insights)
- Total Time: 36.2s (primary timeout + secondary execution)

Step 6: Log Cascade Metrics¶

Record cascade path taken
Document level reached (primary/secondary/tertiary)
Log timing for each level attempted
Track degradation frequency
Identify patterns in failures
Update cascade strategy if needed

Metrics to Track:

## Cascade Metrics (Last 30 Days)

Operation: Code Analysis (500 attempts)

Success by Level:

- PRIMARY: 320 (64%) - Optimal outcome
- SECONDARY: 150 (30%) - Acceptable degradation
- TERTIARY: 30 (6%) - Minimal outcome

Average Response Times:

- PRIMARY success: 12s
- PRIMARY timeout: 30s (then fallback)
- SECONDARY success: 5s
- TERTIARY success: 0.3s

Degradation Impact:

- 36% of requests served degraded results
- 94% acceptable quality (primary + secondary)
- 6% minimal quality (tertiary)

Recommendations:

- Consider increasing PRIMARY timeout from 30s to 45s
  (64% success rate suggests timeout too aggressive)
- SECONDARY performing well (30% fallback, 5s average)
- TERTIARY rarely needed (6%), working as designed

Step 7: Continuous Optimization¶

Use analyzer agent to review cascade metrics
Identify optimization opportunities
Adjust timeouts based on success rates
Improve secondary approaches if frequently used
Update tertiary if inadequate
Document learnings in .claude/context/DISCOVERIES.md

Optimization Criteria:

If PRIMARY succeeds < 50%:

Timeout too aggressive → Increase timeout
Approach too brittle → Improve reliability
External service unreliable → Consider different primary

If SECONDARY used > 40%:

Secondary is really the "normal" case → Swap primary and secondary
Primary too optimistic → Recalibrate expectations

If TERTIARY used > 10%:

Secondary not reliable enough → Improve secondary
Timeouts too aggressive → Increase secondary timeout
System under stress → Investigate root cause

Example Optimization:

## Optimization: Code Analysis Cascade

Current Performance:

- PRIMARY: 64% success (GPT-4, 30s timeout)
- SECONDARY: 30% usage (GPT-3.5, 10s timeout)
- TERTIARY: 6% usage (regex, 5s timeout)

Issue Identified:
PRIMARY timeout of 30s too aggressive. 64% success means
36% of requests waiting full 30s before fallback.

Optimization:

- Increase PRIMARY timeout: 30s → 45s
- Add fast-fail detection: If no response in 5s, likely to timeout
- Implement speculative execution: Start SECONDARY at 20s
  while PRIMARY still running, use whichever completes first

Expected Impact:

- PRIMARY success rate: 64% → 80%
- Average response time: 18s → 15s (less timeout waste)
- User experience: Fewer degraded results, similar speed

Cascade Patterns Library¶

Pattern 1: External API with Cache¶

Use Case: Fetching data from external API that may be slow or unavailable

PRIMARY: Live API call (timeout: 10s)

- Fresh data from API
- Complete and current

SECONDARY: Cached data (timeout: 2s)

- Data from cache (< 1 hour old)
- May be stale but recent

TERTIARY: Default/Historical data (timeout: 0s)

- Safe default values or old historical data
- Always available

Notification: Warning ("Using cached data from 30 minutes ago")

Pattern 2: AI Analysis Quality Levels¶

Use Case: Code analysis with AI models of varying capability

PRIMARY: GPT-4 comprehensive (timeout: 30s)

- Deep semantic analysis
- Complex insights

SECONDARY: GPT-3.5 standard (timeout: 10s)

- Basic analysis
- Standard patterns

TERTIARY: Static analysis (timeout: 5s)

- Pattern matching
- No AI, guaranteed fast

Notification: Explicit (explain quality degradation)

Pattern 3: Database Query Optimization¶

Use Case: Complex database queries that may be slow

PRIMARY: Exact query with full JOINs (timeout: 5s)

- Precise results
- All relationships included

SECONDARY: Simplified query (timeout: 2s)

- Approximate results
- Fewer JOINs, main data only

TERTIARY: Cached summary (timeout: 0s)

- Precomputed aggregates
- May be stale but instant

Notification: Silent (log only)

Pattern 4: Test Execution Levels¶

Use Case: Running tests with time constraints

PRIMARY: Full test suite (timeout: 120s)

- All tests run
- Complete coverage

SECONDARY: Critical tests (timeout: 30s)

- Core functionality only
- Reduced coverage

TERTIARY: Smoke tests (timeout: 10s)

- Basic sanity only
- Minimal validation

Notification: Warning ("Ran critical tests only, skipped full suite")

Pattern 5: Data Processing Completeness¶

Use Case: Processing large datasets

PRIMARY: Full dataset (timeout: 60s)

- Process all records
- Complete accuracy

SECONDARY: 10% sample (timeout: 10s)

- Statistical sample
- Approximate results

TERTIARY: Summary statistics (timeout: 1s)

- Precomputed aggregates
- High-level overview only

Notification: Explicit (explain sampling used)

Integration with DEFAULT_WORKFLOW¶

During Step 5: Implementation¶

When implementing features with external dependencies:

Identify operations that can degrade gracefully
Define cascade levels for each operation
Implement primary approach first
Add secondary and tertiary fallbacks
Configure timeouts and notification levels
Test all cascade levels independently

During Step 7: Testing¶

Test cascade behavior explicitly:

def test_cascade_levels():
    """Test all cascade levels independently"""

    # Test primary succeeds
    result = operation_with_cascade()
    assert result.level == "PRIMARY"
    assert result.quality == "optimal"

    # Test secondary fallback (mock primary failure)
    with mock_primary_timeout():
        result = operation_with_cascade()
        assert result.level == "SECONDARY"
        assert result.quality == "acceptable"

    # Test tertiary fallback (mock primary and secondary failure)
    with mock_primary_and_secondary_timeout():
        result = operation_with_cascade()
        assert result.level == "TERTIARY"
        assert result.quality == "minimal"

Examples¶

Example 1: Weather API Integration¶

Configuration:

Strategy: Balanced (30s / 10s / 5s)
Type: Service fallback
Notification: Warning

Implementation:

async def get_weather(location: str) -> WeatherData:
    """Get weather data with cascade fallback"""

    # PRIMARY: Live weather API
    try:
        return await fetch_weather_api(location, timeout=30)
    except (TimeoutError, APIError):
        log.warning("PRIMARY weather API failed, trying cache")

    # SECONDARY: Cached weather data
    try:
        cached = await get_cached_weather(location, max_age=3600)
        if cached:
            notify_user("Using weather data from cache (< 1 hour old)")
            return cached
    except CacheError:
        log.warning("SECONDARY cache failed, using defaults")

    # TERTIARY: Default weather data
    return get_default_weather(location)  # Never fails

Outcome: System always returns weather data, quality degrades gracefully

Example 2: Code Review with AI¶

Configuration:

Strategy: Patient (120s / 30s / 10s)
Type: Quality fallback
Notification: Explicit

Cascade Path (actual execution):

PRIMARY: GPT-4 comprehensive review - TIMEOUT after 120s
SECONDARY: GPT-3.5 standard review - SUCCESS in 18s
TERTIARY: Not attempted

User Notification:

ℹ️  Code Review Quality Notice

We attempted comprehensive AI review using GPT-4, but encountered
slow response times (>120s timeout).

Delivered: Standard review using GPT-3.5 (completed in 18s)
- Basic code quality checks ✓
- Standard best practices ✓
- Security pattern detection ✓

Not included in this review:
- Advanced architectural insights
- Complex refactoring suggestions
- Deep semantic analysis

This is still a thorough review, just less detailed than optimal.

Example 3: Search Results Ranking¶

Configuration:

Strategy: Aggressive (5s / 2s / 1s)
Type: Accuracy fallback
Notification: Silent

Implementation:

def search_and_rank(query: str) -> List[Result]:
    """Search with ML ranking, fallback to simple ranking"""

    results = fetch_results(query)

    # PRIMARY: ML-based ranking (sophisticated)
    try:
        return ml_rank(results, timeout=5)
    except TimeoutError:
        pass  # Silent fallback

    # SECONDARY: Heuristic ranking (good enough)
    try:
        return heuristic_rank(results, timeout=2)
    except TimeoutError:
        pass

    # TERTIARY: Simple text match ranking (basic)
    return simple_rank(results)  # Always fast

Outcome: User always gets results, ranking quality degrades silently

Customization¶

To customize this workflow:

Edit Configuration section to adjust timeouts and notification levels
Define cascade levels for your specific operations
Choose appropriate fallback types (service, quality, freshness, etc.)
Set notification preferences based on user expectations
Monitor metrics and optimize timeout values
Save changes - updated workflow applies to future executions

Philosophy Notes¶

This workflow enforces:

Resilience: System always completes, never completely fails
User Experience: Better degraded service than error message
Transparency: Users understand what they're getting (if explicit mode)
Progressive Enhancement: Optimal by default, degrade when necessary
Measurable Quality: Clear definition of what degrades at each level
Continuous Improvement: Metrics drive timeout optimization
Guaranteed Completion: Tertiary level must never fail