Monitor provider failures proactively

Track which providers and models fail most often, detect instability trends, and make data-driven decisions about your workflow's provider priority.

Overview

Failed model attempts are one of the most important signals in Request Logs. A high failover rate means your primary provider is unreliable, adding latency and hidden costs to every request. Monitoring these failures proactively ensures you adjust provider configuration before users notice degradation.


What to monitor

Failed models badge

In the Request Logs list, look for amber badges showing failed model counts:

Request #1: OpenAI gpt-4o ✓ success 1,200ms
Request #2: OpenAI gpt-4o ✓ success 3,400ms ⚠ 2 failed
Request #3: Anthropic claude ✓ success 890ms
Request #4: Google gemini ✓ success 4,100ms ⚠ 3 failed

Requests #2 and #4 ultimately succeeded, but required fallback providers — adding significant latency.

Failover rate calculation

Track this metric over time:

Failover rate = Requests with failed models ÷ Total requests × 100
 
Example:
Total requests this week: 1,200
Requests with 1 failed model: 72
Failover rate: 6.0%
 
Healthy: < 3%
Warning: 3-8%
Critical: > 8%

Analyzing failure patterns

By provider

Group failures by provider to find the weakest link:

Provider Total attempts Failures Rate
OpenAI gpt-4o 850 34 4.0%
OpenAI gpt-4o-mini 620 52 8.4% Problem
Anthropic claude-3.5 480 6 1.3%
Google gemini-1.5 390 4 1.0%

By time of day

Failures often cluster during peak hours:

Hour Requests Failures Rate
00-06 120 2 1.7%
06-12 380 14 3.7%
12-18 520 38 7.3% Peak failures
18-24 280 8 2.9%

This pattern suggests peak-hour rate limiting. Solutions include upgrading your provider tier or distributing load across more providers.

By error type

Categorize failures to understand root causes:

Error category Count % Action
Rate limit exceeded 42 66% Upgrade tier or add providers
Server error 12 19% Monitor; usually transient
Timeout 6 9% Check network; consider faster models
Auth failure 4 6% Rotate API keys

Taking action

Based on your analysis:

FindingAction
One provider has high failure rateDeprioritize in workflow or add capacity
Peak-hour failuresAdd more providers or implement request queuing
Consistent rate limitsUpgrade provider tier
Auth failuresRotate API keys immediately
Timeout failuresSwitch to faster model or increase timeout

Setting up failure monitoring

  • Alert on failover rate > 5% over any 1-hour window
  • Alert on 3+ consecutive failures for a specific provider
  • Weekly review of per-provider failure rates
  • Monthly review of failure trends and provider performance

Next steps