Overview
Not all AI providers are equally reliable. Some have more frequent outages, others struggle during peak hours, and some models are more stable than others. Request Logs track every provider attempt — successful and failed — giving you concrete data to evaluate provider reliability.
Understanding provider reliability data
What gets tracked
For every request, Request Logs capture:
- Each provider attempt — Including failed attempts that triggered failover
- Failure reasons — Why each attempt failed (rate limit, server error, etc.)
- Primary Request ID — Links failed attempts to the eventual successful request
- Timing — How long each attempt took before failing or succeeding
Key reliability metrics
From Request Logs data, you can derive:
| Metric | What it measures | How to calculate |
|---|---|---|
| Success rate | % of first-attempt successes | Successful requests ÷ Total requests per provider |
| Failover rate | How often this provider triggers failover | Failed attempts ÷ Total attempts per provider |
| Mean time to fail | Average latency of failed requests | Average duration of failed attempts |
| Recovery time | How long outages last | Time between first failure and first success |
Step-by-step reliability analysis
1. Review failover frequency
- Navigate to Request Logs and filter to Live mode
- Look for requests with Failed models badges (e.g., "2 failed")
- Count how many requests required failover over a time period
Example analysis over 7 days:
Provider Attempts Failures Rate Avg fail time─────────────────────────────────────────────────────────────────OpenAI gpt-4o 1,250 42 3.4% 180msAnthropic claude-3.5 890 12 1.3% 120msGoogle gemini-1.5 620 8 1.3% 150msOpenAI gpt-4o-mini 980 89 9.1% 200msThis tells you gpt-4o-mini has a significantly higher failure rate and should be deprioritized or investigated.
2. Identify failure patterns
Click through failed attempts and categorize failures:
Time-based patterns:
Failures per hour (OpenAI gpt-4o): 12 AM - 6 AM: 0 failures (low traffic) 6 AM - 12 PM: 5 failures (morning ramp) 12 PM - 6 PM: 28 failures (peak hours) 6 PM - 12 AM: 9 failures (evening wind-down)Error-based patterns:
OpenAI gpt-4o failure reasons: rate_limit_exceeded: 32 (76%) server_error: 7 (17%) timeout: 3 (7%)This tells you peak-hour rate limiting is the primary issue. Consider upgrading your OpenAI tier or adding provider capacity.
3. Compare provider stability
Side-by-side provider comparison:
┌─────────────────────────────────────────────────┐│ 7-Day Provider Reliability Report ││ ││ ● Anthropic claude-3.5-sonnet ││ Success rate: 98.7% ││ Avg latency: 1,100ms ││ Outages: 0 ││ Rating: ★★★★★ ││ ││ ● Google gemini-1.5-pro ││ Success rate: 98.7% ││ Avg latency: 950ms ││ Outages: 0 ││ Rating: ★★★★★ ││ ││ ● OpenAI gpt-4o ││ Success rate: 96.6% ││ Avg latency: 1,200ms ││ Outages: 1 (23 min) ││ Rating: ★★★★☆ ││ ││ ● OpenAI gpt-4o-mini ││ Success rate: 90.9% ││ Avg latency: 420ms ││ Outages: 3 (45 min total) ││ Rating: ★★★☆☆ │└─────────────────────────────────────────────────┘4. Adjust workflow provider priority
Based on your analysis, update your workflow's provider configuration:
Before (based on cost):
Priority 1: OpenAI gpt-4o-mini (cheapest)Priority 2: Google gemini-1.5-proPriority 3: OpenAI gpt-4oAfter (balanced for reliability):
Priority 1: Google gemini-1.5-pro (fast + reliable)Priority 2: Anthropic claude-3.5 (reliable + high quality)Priority 3: OpenAI gpt-4o (fallback)Tracking outage impact
When a provider has an extended outage, Request Logs let you measure the impact:
- Identify the outage window — Look for a cluster of failures for a specific provider
- Count affected requests — How many requests hit the failed provider before failover
- Measure latency impact — Requests during the outage likely had higher latency due to failover attempts
- Calculate cost impact — Failed attempts may still consume tokens and incur costs
Example outage analysis:
OpenAI outage: Feb 10, 2:15 PM - 2:48 PM (33 minutes) Affected requests: 47Successfully failed over: 45 (96%)Complete failures: 2 (4%)Average latency increase: 2,400ms (from failover)Estimated extra cost: $0.38 (from failed attempts)Setting up reliability alerts
Configure alerts based on your analysis:
- Failure rate > 5% for any provider over a 1-hour window
- 3+ consecutive failures for a specific provider/model
- New error type that hasn't been seen before
- Failover rate spike — Sudden increase in requests needing fallback
Next steps
- Webhook Delivery Monitoring — Track webhook reliability
- Performance Monitoring — Factor latency into provider decisions
- Provider Reliability Dashboard — Aggregated views
- Back to Observability — Return to the overview