Troubleshooting Failures – ModelRiver Docs

Overview

Request failures happen for many reasons — provider outages, rate limits, malformed inputs, expired credentials, or content policy violations. Request Logs give you the tools to quickly categorize, diagnose, and resolve failures systematically.

Failure categories

Provider-level failures

Failures caused by the AI provider:

Error type	Example message	Typical cause	Resolution
Rate limit	`rate_limit_exceeded`	Too many requests to provider	Back off, add more providers
Model unavailable	`model_not_found`	Model deprecated or temporarily unavailable	Switch model or wait
Authentication	`invalid_api_key`	API key expired or revoked	Update credentials in Settings
Content policy	`content_filtered`	Request violated provider guidelines	Review and modify prompt content
Server error	`internal_server_error`	Provider infrastructure issue	Relies on failover; wait for resolution

Webhook delivery failures

Failures in delivering notifications to your backend:

Error type	Example	Typical cause	Resolution
Connection refused	`ECONNREFUSED`	Endpoint is down	Fix or restart your server
Timeout	`Request timeout after 30s`	Endpoint is too slow	Optimize endpoint performance
DNS failure	`ENOTFOUND`	Domain doesn't resolve	Check webhook URL configuration
SSL error	`UNABLE_TO_VERIFY_LEAF_SIGNATURE`	Certificate issue	Fix SSL certificate
Non-2xx response	`HTTP 500`	Your endpoint returned an error	Check your server logs

Callback failures

Failures in event-driven workflow callbacks:

Error type	Typical cause	Resolution
Timeout (5 min)	Backend didn't call back in time	Optimize backend processing speed
Invalid payload	Callback data doesn't match expected format	Review callback documentation
Missing callback	Backend never called the callback URL	Verify webhook handling code

Step-by-step troubleshooting

1. Assess the scope

Before diving into individual failures, understand the scope:

Filter to Live mode and Error status
Count the number of failures in the affected time period
Look for patterns:
- All requests failing? → Likely a systemic issue (bad credentials, provider outage)
- Specific models failing? → Provider-specific issue
- Intermittent failures? → Rate limits or transient errors

2. Categorize the failure

Click a failed request and inspect the timeline:

Scenario A: All provider attempts failed
┌──────────────────────────────────────────────────┐
│  ✗ OpenAI gpt-4o         failed    150ms        │
│  ✗ Anthropic claude-3.5  failed    120ms        │
│  ✗ Google gemini-1.5     failed    180ms        │
│  No successful request                           │
└──────────────────────────────────────────────────┘
→ Check: Are all providers rejecting the same input?
 
Scenario B: Provider succeeded, webhook failed
┌──────────────────────────────────────────────────┐
│  ✓ OpenAI gpt-4o         success   1,200ms     │
│  ✗ Webhook delivery      error     30,000ms    │
└──────────────────────────────────────────────────┘
→ Check: Is your webhook endpoint responding?
 
Scenario C: Everything worked except callback
┌──────────────────────────────────────────────────┐
│  ✓ OpenAI gpt-4o         success   1,200ms     │
│  ✓ Webhook delivery      success   45ms        │
│  ✗ Backend callback      timeout   300,000ms   │
└──────────────────────────────────────────────────┘
→ Check: Is your backend processing and calling back?

3. Inspect error details

Click the failed item and review:

Provider errors:

JSON

1{
2  "error": {
3    "type": "invalid_request_error",
4    "message": "This model's maximum context length is 128000 tokens. However, your messages resulted in 135420 tokens. Please reduce the length of the messages.",
5    "code": "context_length_exceeded"
6  }
7}

Webhook errors:

Status: Error
HTTP Status: 502
Error: Bad Gateway
Duration: 120ms
URL: https://api.yourapp.com/webhooks/modelriver

4. Apply the fix

Based on the error category:

Rate limits: Add more providers, implement request queuing, or upgrade your plan
Context length exceeded: Trim conversation history, summarize older messages, or use a model with a larger context window
Authentication failures: Navigate to Settings → Providers and update your API keys
Webhook failures: Fix your endpoint, then use the Retry button in Request Logs
Callback timeouts: Optimize your backend processing time or increase parallelism

5. Verify the fix

After applying the fix:

Run a test in the Playground
Check Request Logs for the new test request
Confirm the timeline shows success
Monitor for 30 minutes to ensure the fix holds

Advanced failure analysis

Identifying cascading failures

When one component fails, it can cascade:

1. Provider rate limit hit → failover to secondary provider
2. Secondary provider also rate limited → failover to tertiary
3. All providers exhausted → request fails
4. Webhook not sent (no response to deliver) → backend never notified
5. Downstream features that depend on the AI response also fail

How to trace: Open the failed request, review each timeline item, and note the chain of events. The first failure is usually the root cause.

Provider-specific error patterns

Common patterns by provider:

OpenAI:

rate_limit_exceeded — Most common during peak hours
context_length_exceeded — Prompt too long for the selected model
invalid_api_key — Key rotated or revoked

Anthropic:

overloaded — High demand periods
invalid_request_error — Format mismatch

Google:

RESOURCE_EXHAUSTED — Quota exceeded
INVALID_ARGUMENT — Parameter validation failure

Next steps

Provider Reliability — Track which providers fail most
Webhook Delivery Monitoring — Ensure reliable webhook delivery
Debugging Production Issues — Deep-dive debugging
Back to Observability — Return to the overview

Systematic failure troubleshooting