Overview
Failed model attempts are one of the most important signals in Request Logs. A high failover rate means your primary provider is unreliable, adding latency and hidden costs to every request. Monitoring these failures proactively ensures you adjust provider configuration before users notice degradation.
What to monitor
Failed models badge
In the Request Logs list, look for amber badges showing failed model counts:
Request #1: OpenAI gpt-4o ✓ success 1,200msRequest #2: OpenAI gpt-4o ✓ success 3,400ms ⚠ 2 failedRequest #3: Anthropic claude ✓ success 890msRequest #4: Google gemini ✓ success 4,100ms ⚠ 3 failedRequests #2 and #4 ultimately succeeded, but required fallback providers — adding significant latency.
Failover rate calculation
Track this metric over time:
Failover rate = Requests with failed models ÷ Total requests × 100 Example: Total requests this week: 1,200 Requests with ≥1 failed model: 72 Failover rate: 6.0% Healthy: < 3% Warning: 3-8% Critical: > 8%Analyzing failure patterns
By provider
Group failures by provider to find the weakest link:
Provider Total attempts Failures Rate─────────────────────────────────────────────────────OpenAI gpt-4o 850 34 4.0%OpenAI gpt-4o-mini 620 52 8.4% ← ProblemAnthropic claude-3.5 480 6 1.3%Google gemini-1.5 390 4 1.0%By time of day
Failures often cluster during peak hours:
Hour Requests Failures Rate─────────────────────────────────────00-06 120 2 1.7%06-12 380 14 3.7%12-18 520 38 7.3% ← Peak failures18-24 280 8 2.9%This pattern suggests peak-hour rate limiting. Solutions include upgrading your provider tier or distributing load across more providers.
By error type
Categorize failures to understand root causes:
Error category Count % Action──────────────────────────────────────────────────Rate limit exceeded 42 66% Upgrade tier or add providersServer error 12 19% Monitor; usually transientTimeout 6 9% Check network; consider faster modelsAuth failure 4 6% Rotate API keysTaking action
Based on your analysis:
| Finding | Action |
|---|---|
| One provider has high failure rate | Deprioritize in workflow or add capacity |
| Peak-hour failures | Add more providers or implement request queuing |
| Consistent rate limits | Upgrade provider tier |
| Auth failures | Rotate API keys immediately |
| Timeout failures | Switch to faster model or increase timeout |
Setting up failure monitoring
Recommended thresholds
- Alert on failover rate > 5% over any 1-hour window
- Alert on 3+ consecutive failures for a specific provider
- Weekly review of per-provider failure rates
- Monthly review of failure trends and provider performance
Next steps
- Tracking Webhook Reliability — Monitor webhook delivery
- Provider Reliability Analysis — Deep dive scenario
- Back to Best Practices — Return to the overview