Make data-driven provider decisions

See which providers fail most often, how often fallbacks trigger, and identify the most reliable providers for your workloads.

Overview

AI providers are external services that can fail for various reasons—rate limiting, infrastructure issues, model availability, and more. ModelRiver's multi-provider failover system ensures your requests succeed even when individual providers fail. Request Logs capture every provider attempt, successful or failed, giving you the data to evaluate provider reliability and optimize your workflow configuration.


Understanding provider reliability data

Failover attempts in Request Logs

When a provider fails and ModelRiver retries with a fallback:

  1. A log entry is created for the failed attempt with status: "failed" and a primary_req_id linking to the eventual successful request
  2. The successful request (or final failed request if all providers fail) is the main log entry
  3. The timeline in the detail view shows all attempts in chronological order

Key indicators of provider reliability

IndicatorWhere to find itWhat it tells you
Failed models badgeLog list view (red badge with count)How many provider attempts failed before success
Failover attemptsTimeline in detail view (amber badges)Which specific providers failed and why
Error messagesFailed attempt detail viewRoot cause of each failure
Overall success/error rateStatus column across many requestsProvider health over time

Reliability analysis workflow

Step 1: Identify failover frequency

  1. Navigate to Request Logs in your project console
  2. Filter to Live mode for production traffic
  3. Look for requests with failed models badges — these required fallbacks
  4. Count how frequently failovers occur relative to total requests

Healthy benchmark: Failovers should be rare events (< 5% of requests). If more than 10% of requests require fallbacks, there's likely a provider issue to address.

Step 2: Analyze failure patterns by provider

Click on requests with failover attempts and review the timeline:

  • Which provider failed? – Note the provider name and model
  • Why did it fail? – Read the error message in the failed attempt detail
  • When did it fail? – Check timestamps for time-based patterns

Common failure reasons by provider:

Failure typeDescriptionIndicates
Rate limitingProvider returned 429 (Too Many Requests)Need to reduce request rate or upgrade provider plan
Server errorsProvider returned 500/502/503Provider infrastructure issues
Model unavailableModel is temporarily offlineProvider is updating or deprecating the model
Authentication errorInvalid or expired API keyCredentials need to be updated in ModelRiver
Content policyRequest rejected for policy violationsInput content may need filtering
TimeoutProvider didn't respond within time limitProvider is overloaded or experiencing issues

Review reliability over time:

  • Is a provider consistently failing? – May need to be removed or deprioritized
  • Are failures time-based? – Some providers have peak-hour degradation
  • Are failures model-specific? – The provider may be fine, but a specific model is unreliable
  • Are failures increasing? – May indicate a worsening provider issue

Step 4: Optimize provider configuration

Based on your analysis:

  • Reorder fallback providers – Put the most reliable provider first
  • Remove unreliable providers – If a provider consistently fails, remove it from the workflow
  • Diversify providers – Use providers from different vendors to reduce correlated failures
  • Update credentials – If authentication errors occur, rotate API keys
  • Adjust rate limits – If rate limiting is frequent, consider upgrading your provider plan

Provider-specific considerations

Rate limiting patterns

Each provider has different rate limits:

  • Per-minute request limits – Too many requests in a short window
  • Per-minute token limits – Total tokens consumed too quickly
  • Per-day limits – Daily quota exceeded

How to identify: Look for 429 errors in failed attempt details. Rate limit errors typically come in bursts during high-traffic periods.

How to address:

  • Spread requests more evenly over time
  • Use multiple providers to distribute load
  • Upgrade provider plans for higher limits
  • Implement client-side rate limiting

Provider outages

How to identify: Sudden spike in failures from a single provider, with 500/502/503 errors.

How to address:

  • ModelRiver's fallback system handles this automatically
  • Monitor the timeline to confirm fallbacks are working
  • Check the provider's status page for outage announcements
  • Consider temporarily removing the provider from workflows if outages are prolonged

Model deprecations

How to identify: Consistent "model not found" or "model unavailable" errors.

How to address:

  • Update your workflow to use the replacement model
  • Check the provider's documentation for model lifecycle announcements
  • Set up monitoring to catch deprecation warnings early

Making data-driven decisions

When to keep a provider

  • Failure rate is low (< 5%)
  • Failures are transient and self-recovering
  • Provider offers unique models or capabilities
  • Cost-performance ratio is favorable

When to remove a provider

  • Consistent failure rate above 10%
  • Frequent rate limiting despite reasonable usage
  • Provider latency is significantly higher than alternatives
  • Authentication or credential issues are recurring

When to change provider order

  • Primary provider has higher failure rate than secondary
  • Secondary provider has better latency for your use case
  • Primary provider is more expensive and failing often (double cost impact)

Provider reliability interacts with other observability metrics:

  • Performance – Unreliable providers add latency through failover attempts. See Performance Monitoring.
  • Cost – Failed attempts still consume tokens on some providers. See Cost Analysis.
  • Debugging – Provider failures can cause unexpected results if fallback models behave differently. See Debugging.

Next steps