Overview
When a request's primary provider fails, ModelRiver's multi-provider failback system automatically retries with the next provider in your workflow's fallback chain. Each failed attempt is logged as a failover attempt in the timeline, capturing the full request/response data and failure reason. This gives you complete visibility into your system's resilience behavior.
How failover attempts appear
In the log list
Requests that required failovers show a failed models badge — a red badge with a count indicating how many provider attempts failed before the request was ultimately resolved.
Example: A badge showing "2 failed" means two providers failed before the third succeeded (or before all providers were exhausted).
In the timeline
Failover attempts appear as amber/yellow badges at the beginning of the timeline, before the main request:
- Position: Before the main request, in chronological order
- Badge color: Amber/yellow
- Badge content: Provider icon, name, and model
- Additional info: Duration and timestamp
When clicked
Clicking a failover attempt reveals:
- Provider & model – Which provider and model were attempted
- Status – "Failed" with the specific error
- Duration – How long the attempt took before failing
- Timestamp – When the attempt occurred
- Request Body tab – The exact request sent to the provider
- Response Body tab – The provider's error response
Understanding failover data
Primary Request ID
Every failover attempt is linked to its eventual main request via the primary_req_id field. This creates a chain:
Failover Attempt 1 (primary_req_id: "abc-123") ↓Failover Attempt 2 (primary_req_id: "abc-123") ↓Main Request (id: "abc-123")This linking ensures you can always trace the complete failover chain for any request.
Common failure reasons
| Error type | Description | Action |
|---|---|---|
| Rate limit (429) | Provider's rate limit exceeded | Reduce request rate or upgrade provider plan |
| Server error (500/502/503) | Provider infrastructure issue | Usually transient; monitor for patterns |
| Model unavailable | Model offline or deprecated | Update workflow to use available model |
| Authentication error (401/403) | Invalid or expired API key | Rotate credentials in ModelRiver |
| Content policy violation | Input rejected by provider safety filters | Review and adjust input content |
| Timeout | Provider didn't respond in time | Provider may be overloaded |
| Invalid request (400) | Request format incompatible with provider | Check provider-specific requirements |
Failover impact on performance
Each failover attempt adds to total request latency:
Total latency = Attempt 1 duration + Attempt 2 duration + ... + Main request duration
Frequent failovers indicate provider instability and directly impact user experience. See Performance Monitoring for latency analysis.
Failover impact on cost
Some providers charge for tokens even on failed requests (if the model processed tokens before the error). This means failovers can increase the true cost of a request beyond what the main request shows.
See Cost Analysis for cost optimization strategies.
Debugging failover attempts
Step-by-step investigation
- Open the request detail – Click the request in the log list
- Review the timeline – Note the number and order of failover attempts
- Click each failover attempt – View the error details
- Compare request bodies – Verify the same request was sent to each provider
- Read error messages – Identify the specific failure reason
- Check for patterns – Is the same provider always failing? Is the error always the same?
What to look for
- Same error across providers – May indicate a request-level issue (e.g., content policy violation) rather than a provider issue
- One provider always failing – Provider-specific issue; consider removing or deprioritizing
- Rate limit errors – Too many requests; need to distribute load or upgrade plans
- Intermittent failures – Transient provider issues; failover is working as designed
Failover best practices
Configure fallback providers wisely
- Diversity: Use providers from different vendors to avoid correlated failures
- Priority: Put the most reliable provider first
- Cost awareness: Put cheaper providers earlier in the chain
- Compatibility: Ensure all fallback providers support your request format
Monitor failover rates
- Track what percentage of requests require failovers
- Set alerts for failover rate thresholds (e.g., > 5%)
- Review Provider Reliability regularly
Respond to failover patterns
- Consistent failures: Remove the failing provider or update credentials
- Rate limiting: Upgrade provider plan or reduce request rate
- Model deprecation: Update workflow to use the replacement model
Next steps
- Primary Request – The main request details
- Provider Reliability – Aggregate provider failure analysis
- Performance Monitoring – Latency impact of failovers
- Back to Timeline – Timeline overview