Analyze provider reliability with data

Track which AI providers fail most often, how frequently failovers trigger, and use historical data to make informed decisions about provider priority in your workflows.

Overview

Not all AI providers are equally reliable. Some have more frequent outages, others struggle during peak hours, and some models are more stable than others. Request Logs track every provider attempt — successful and failed — giving you concrete data to evaluate provider reliability.


Understanding provider reliability data

What gets tracked

For every request, Request Logs capture:

  • Each provider attempt — Including failed attempts that triggered failover
  • Failure reasons — Why each attempt failed (rate limit, server error, etc.)
  • Primary Request ID — Links failed attempts to the eventual successful request
  • Timing — How long each attempt took before failing or succeeding

Key reliability metrics

From Request Logs data, you can derive:

MetricWhat it measuresHow to calculate
Success rate% of first-attempt successesSuccessful requests ÷ Total requests per provider
Failover rateHow often this provider triggers failoverFailed attempts ÷ Total attempts per provider
Mean time to failAverage latency of failed requestsAverage duration of failed attempts
Recovery timeHow long outages lastTime between first failure and first success

Step-by-step reliability analysis

1. Review failover frequency

  1. Navigate to Request Logs and filter to Live mode
  2. Look for requests with Failed models badges (e.g., "2 failed")
  3. Count how many requests required failover over a time period

Example analysis over 7 days:

Provider Attempts Failures Rate Avg fail time
OpenAI gpt-4o 1,250 42 3.4% 180ms
Anthropic claude-3.5 890 12 1.3% 120ms
Google gemini-1.5 620 8 1.3% 150ms
OpenAI gpt-4o-mini 980 89 9.1% 200ms

This tells you gpt-4o-mini has a significantly higher failure rate and should be deprioritized or investigated.

2. Identify failure patterns

Click through failed attempts and categorize failures:

Time-based patterns:

Failures per hour (OpenAI gpt-4o):
12 AM - 6 AM: 0 failures (low traffic)
6 AM - 12 PM: 5 failures (morning ramp)
12 PM - 6 PM: 28 failures (peak hours)
6 PM - 12 AM: 9 failures (evening wind-down)

Error-based patterns:

OpenAI gpt-4o failure reasons:
rate_limit_exceeded: 32 (76%)
server_error: 7 (17%)
timeout: 3 (7%)

This tells you peak-hour rate limiting is the primary issue. Consider upgrading your OpenAI tier or adding provider capacity.

3. Compare provider stability

Side-by-side provider comparison:

7-Day Provider Reliability Report
Anthropic claude-3.5-sonnet
Success rate: 98.7%
Avg latency: 1,100ms
Outages: 0
Rating:
Google gemini-1.5-pro
Success rate: 98.7%
Avg latency: 950ms
Outages: 0
Rating:
OpenAI gpt-4o
Success rate: 96.6%
Avg latency: 1,200ms
Outages: 1 (23 min)
Rating:
OpenAI gpt-4o-mini
Success rate: 90.9%
Avg latency: 420ms
Outages: 3 (45 min total)
Rating:

4. Adjust workflow provider priority

Based on your analysis, update your workflow's provider configuration:

Before (based on cost):

Priority 1: OpenAI gpt-4o-mini (cheapest)
Priority 2: Google gemini-1.5-pro
Priority 3: OpenAI gpt-4o

After (balanced for reliability):

Priority 1: Google gemini-1.5-pro (fast + reliable)
Priority 2: Anthropic claude-3.5 (reliable + high quality)
Priority 3: OpenAI gpt-4o (fallback)

Tracking outage impact

When a provider has an extended outage, Request Logs let you measure the impact:

  1. Identify the outage window — Look for a cluster of failures for a specific provider
  2. Count affected requests — How many requests hit the failed provider before failover
  3. Measure latency impact — Requests during the outage likely had higher latency due to failover attempts
  4. Calculate cost impact — Failed attempts may still consume tokens and incur costs

Example outage analysis:

OpenAI outage: Feb 10, 2:15 PM - 2:48 PM (33 minutes)
 
Affected requests: 47
Successfully failed over: 45 (96%)
Complete failures: 2 (4%)
Average latency increase: 2,400ms (from failover)
Estimated extra cost: $0.38 (from failed attempts)

Setting up reliability alerts

Configure alerts based on your analysis:

  • Failure rate > 5% for any provider over a 1-hour window
  • 3+ consecutive failures for a specific provider/model
  • New error type that hasn't been seen before
  • Failover rate spike — Sudden increase in requests needing fallback

Next steps