Provider reliability tracking

Overview

AI providers are external services that can fail for various reasons—rate limiting, infrastructure issues, model availability, and more. ModelRiver's multi-provider failover system ensures your requests succeed even when individual providers fail. Request Logs capture every provider attempt, successful or failed, giving you the data to evaluate provider reliability and optimize your workflow configuration.

Understanding provider reliability data

Failover attempts in Request Logs

When a provider fails and ModelRiver retries with a fallback:

A log entry is created for the failed attempt with status: "failed" and a primary_req_id linking to the eventual successful request
The successful request (or final failed request if all providers fail) is the main log entry
The timeline in the detail view shows all attempts in chronological order

Key indicators of provider reliability

Indicator	Where to find it	What it tells you
Failed models badge	Log list view (red badge with count)	How many provider attempts failed before success
Failover attempts	Timeline in detail view (amber badges)	Which specific providers failed and why
Error messages	Failed attempt detail view	Root cause of each failure
Overall success/error rate	Status column across many requests	Provider health over time

Reliability analysis workflow

Step 1: Identify failover frequency

Navigate to Request Logs in your project console
Filter to Live mode for production traffic
Look for requests with failed models badges — these required fallbacks
Count how frequently failovers occur relative to total requests

Healthy benchmark: Failovers should be rare events (< 5% of requests). If more than 10% of requests require fallbacks, there's likely a provider issue to address.

Step 2: Analyze failure patterns by provider

Click on requests with failover attempts and review the timeline:

Which provider failed? – Note the provider name and model
Why did it fail? – Read the error message in the failed attempt detail
When did it fail? – Check timestamps for time-based patterns

Common failure reasons by provider:

Failure type	Description	Indicates
Rate limiting	Provider returned 429 (Too Many Requests)	Need to reduce request rate or upgrade provider plan
Server errors	Provider returned 500/502/503	Provider infrastructure issues
Model unavailable	Model is temporarily offline	Provider is updating or deprecating the model
Authentication error	Invalid or expired API key	Credentials need to be updated in ModelRiver
Content policy	Request rejected for policy violations	Input content may need filtering
Timeout	Provider didn't respond within time limit	Provider is overloaded or experiencing issues

Step 3: Track provider trends

Review reliability over time:

Is a provider consistently failing? – May need to be removed or deprioritized
Are failures time-based? – Some providers have peak-hour degradation
Are failures model-specific? – The provider may be fine, but a specific model is unreliable
Are failures increasing? – May indicate a worsening provider issue

Step 4: Optimize provider configuration

Based on your analysis:

Reorder fallback providers – Put the most reliable provider first
Remove unreliable providers – If a provider consistently fails, remove it from the workflow
Diversify providers – Use providers from different vendors to reduce correlated failures
Update credentials – If authentication errors occur, rotate API keys
Adjust rate limits – If rate limiting is frequent, consider upgrading your provider plan

Provider-specific considerations

Rate limiting patterns

Each provider has different rate limits:

Per-minute request limits – Too many requests in a short window
Per-minute token limits – Total tokens consumed too quickly
Per-day limits – Daily quota exceeded

How to identify: Look for 429 errors in failed attempt details. Rate limit errors typically come in bursts during high-traffic periods.

How to address:

Spread requests more evenly over time
Use multiple providers to distribute load
Upgrade provider plans for higher limits
Implement client-side rate limiting

Provider outages

How to identify: Sudden spike in failures from a single provider, with 500/502/503 errors.

How to address:

ModelRiver's fallback system handles this automatically
Monitor the timeline to confirm fallbacks are working
Check the provider's status page for outage announcements
Consider temporarily removing the provider from workflows if outages are prolonged

Model deprecations

How to identify: Consistent "model not found" or "model unavailable" errors.

How to address:

Update your workflow to use the replacement model
Check the provider's documentation for model lifecycle announcements
Set up monitoring to catch deprecation warnings early

Making data-driven decisions

When to keep a provider

Failure rate is low (< 5%)
Failures are transient and self-recovering
Provider offers unique models or capabilities
Cost-performance ratio is favorable

When to remove a provider

Consistent failure rate above 10%
Frequent rate limiting despite reasonable usage
Provider latency is significantly higher than alternatives
Authentication or credential issues are recurring

When to change provider order

Primary provider has higher failure rate than secondary
Secondary provider has better latency for your use case
Primary provider is more expensive and failing often (double cost impact)

Provider reliability interacts with other observability metrics:

Performance – Unreliable providers add latency through failover attempts. See Performance Monitoring.
Cost – Failed attempts still consume tokens on some providers. See Cost Analysis.
Debugging – Provider failures can cause unexpected results if fallback models behave differently. See Debugging.

Next steps

Provider Failover Timeline – Deep dive into failover attempt details
Performance Monitoring – Impact of reliability on latency
Cost Analysis – Cost impact of provider failures
Debugging – Investigate specific provider failures
Back to Observability – Return to the overview

Make data-driven provider decisions

Overview

Understanding provider reliability data

Failover attempts in Request Logs

Key indicators of provider reliability

Reliability analysis workflow

Step 1: Identify failover frequency

Step 2: Analyze failure patterns by provider

Step 3: Track provider trends

Step 4: Optimize provider configuration

Provider-specific considerations

Rate limiting patterns

Provider outages

Model deprecations

Making data-driven decisions

When to keep a provider

When to remove a provider

When to change provider order

Related metrics

Next steps