Monitoring Failed Models – ModelRiver Docs

Overview

Failed model attempts are one of the most important signals in Request Logs. A high failover rate means your primary provider is unreliable, adding latency and hidden costs to every request. Monitoring these failures proactively ensures you adjust provider configuration before users notice degradation.

What to monitor

Failed models badge

In the Request Logs list, look for amber badges showing failed model counts:

Request #1: OpenAI gpt-4o     ✓ success    1,200ms
Request #2: OpenAI gpt-4o     ✓ success    3,400ms   ⚠ 2 failed
Request #3: Anthropic claude  ✓ success      890ms
Request #4: Google gemini     ✓ success    4,100ms   ⚠ 3 failed

Requests #2 and #4 ultimately succeeded, but required fallback providers — adding significant latency.

Failover rate calculation

Track this metric over time:

Failover rate = Requests with failed models ÷ Total requests × 100
 
Example:
  Total requests this week:     1,200
  Requests with ≥1 failed model:   72
  Failover rate:                   6.0%
 
  Healthy: < 3%
  Warning: 3-8%
  Critical: > 8%

Analyzing failure patterns

By provider

Group failures by provider to find the weakest link:

Provider             Total attempts  Failures  Rate
─────────────────────────────────────────────────────
OpenAI gpt-4o             850          34      4.0%
OpenAI gpt-4o-mini        620          52      8.4%  ← Problem
Anthropic claude-3.5      480           6      1.3%
Google gemini-1.5         390           4      1.0%

By time of day

Failures often cluster during peak hours:

Hour     Requests    Failures    Rate
─────────────────────────────────────
00-06      120          2       1.7%
06-12      380         14       3.7%
12-18      520         38       7.3%  ← Peak failures
18-24      280          8       2.9%

This pattern suggests peak-hour rate limiting. Solutions include upgrading your provider tier or distributing load across more providers.

By error type

Categorize failures to understand root causes:

Error category          Count    %      Action
──────────────────────────────────────────────────
Rate limit exceeded       42    66%     Upgrade tier or add providers
Server error              12    19%     Monitor; usually transient
Timeout                    6     9%     Check network; consider faster models
Auth failure               4     6%     Rotate API keys

Taking action

Based on your analysis:

Finding	Action
One provider has high failure rate	Deprioritize in workflow or add capacity
Peak-hour failures	Add more providers or implement request queuing
Consistent rate limits	Upgrade provider tier
Auth failures	Rotate API keys immediately
Timeout failures	Switch to faster model or increase timeout

Overview

What to monitor

Failed models badge

Failover rate calculation

Analyzing failure patterns

By provider

By time of day

By error type

Taking action

Setting up failure monitoring

Recommended thresholds

Next steps

Monitor provider failures proactively

Overview

What to monitor

Failed models badge

Failover rate calculation

Analyzing failure patterns

By provider

By time of day

By error type

Taking action

Setting up failure monitoring

Recommended thresholds

Next steps