Provider Reliability Analysis

Overview

Not all AI providers are equally reliable. Some have more frequent outages, others struggle during peak hours, and some models are more stable than others. Request Logs track every provider attempt — successful and failed — giving you concrete data to evaluate provider reliability.

Understanding provider reliability data

What gets tracked

For every request, Request Logs capture:

Each provider attempt — Including failed attempts that triggered failover
Failure reasons — Why each attempt failed (rate limit, server error, etc.)
Primary Request ID — Links failed attempts to the eventual successful request
Timing — How long each attempt took before failing or succeeding

Key reliability metrics

From Request Logs data, you can derive:

Metric	What it measures	How to calculate
Success rate	% of first-attempt successes	Successful requests ÷ Total requests per provider
Failover rate	How often this provider triggers failover	Failed attempts ÷ Total attempts per provider
Mean time to fail	Average latency of failed requests	Average duration of failed attempts
Recovery time	How long outages last	Time between first failure and first success

Step-by-step reliability analysis

1. Review failover frequency

Navigate to Request Logs and filter to Live mode
Look for requests with Failed models badges (e.g., "2 failed")
Count how many requests required failover over a time period

Example analysis over 7 days:

Provider             Attempts   Failures   Rate    Avg fail time
─────────────────────────────────────────────────────────────────
OpenAI gpt-4o        1,250      42        3.4%    180ms
Anthropic claude-3.5   890      12        1.3%    120ms
Google gemini-1.5      620       8        1.3%    150ms
OpenAI gpt-4o-mini     980      89        9.1%    200ms

This tells you gpt-4o-mini has a significantly higher failure rate and should be deprioritized or investigated.

2. Identify failure patterns

Click through failed attempts and categorize failures:

Time-based patterns:

Failures per hour (OpenAI gpt-4o):
  12 AM - 6 AM:   0 failures  (low traffic)
  6 AM - 12 PM:   5 failures  (morning ramp)
  12 PM - 6 PM:  28 failures  (peak hours)
  6 PM - 12 AM:   9 failures  (evening wind-down)

Error-based patterns:

OpenAI gpt-4o failure reasons:
  rate_limit_exceeded:    32 (76%)
  server_error:            7 (17%)
  timeout:                 3 (7%)

This tells you peak-hour rate limiting is the primary issue. Consider upgrading your OpenAI tier or adding provider capacity.

3. Compare provider stability

Side-by-side provider comparison:

┌─────────────────────────────────────────────────┐
│  7-Day Provider Reliability Report              │
│                                                  │
│  ● Anthropic claude-3.5-sonnet                   │
│    Success rate: 98.7%                           │
│    Avg latency:  1,100ms                         │
│    Outages:      0                               │
│    Rating:       ★★★★★                           │
│                                                  │
│  ● Google gemini-1.5-pro                         │
│    Success rate: 98.7%                           │
│    Avg latency:  950ms                           │
│    Outages:      0                               │
│    Rating:       ★★★★★                           │
│                                                  │
│  ● OpenAI gpt-4o                                 │
│    Success rate: 96.6%                           │
│    Avg latency:  1,200ms                         │
│    Outages:      1 (23 min)                      │
│    Rating:       ★★★★☆                           │
│                                                  │
│  ● OpenAI gpt-4o-mini                            │
│    Success rate: 90.9%                           │
│    Avg latency:  420ms                           │
│    Outages:      3 (45 min total)                │
│    Rating:       ★★★☆☆                           │
└─────────────────────────────────────────────────┘

4. Adjust workflow provider priority

Based on your analysis, update your workflow's provider configuration:

Before (based on cost):

Priority 1: OpenAI gpt-4o-mini     (cheapest)
Priority 2: Google gemini-1.5-pro
Priority 3: OpenAI gpt-4o

After (balanced for reliability):

Priority 1: Google gemini-1.5-pro   (fast + reliable)
Priority 2: Anthropic claude-3.5    (reliable + high quality)
Priority 3: OpenAI gpt-4o           (fallback)

Tracking outage impact

When a provider has an extended outage, Request Logs let you measure the impact:

Identify the outage window — Look for a cluster of failures for a specific provider
Count affected requests — How many requests hit the failed provider before failover
Measure latency impact — Requests during the outage likely had higher latency due to failover attempts
Calculate cost impact — Failed attempts may still consume tokens and incur costs

Example outage analysis:

OpenAI outage: Feb 10, 2:15 PM - 2:48 PM (33 minutes)
 
Affected requests:        47
Successfully failed over: 45 (96%)
Complete failures:         2 (4%)
Average latency increase:  2,400ms (from failover)
Estimated extra cost:      $0.38 (from failed attempts)

Setting up reliability alerts

Configure alerts based on your analysis:

Failure rate > 5% for any provider over a 1-hour window
3+ consecutive failures for a specific provider/model
New error type that hasn't been seen before
Failover rate spike — Sudden increase in requests needing fallback

Next steps

Webhook Delivery Monitoring — Track webhook reliability
Performance Monitoring — Factor latency into provider decisions
Provider Reliability Dashboard — Aggregated views
Back to Observability — Return to the overview

Analyze provider reliability with data