Systematic failure troubleshooting

When requests fail, follow a structured approach to identify root causes using error patterns, timeline analysis, and payload inspection.

Overview

Request failures happen for many reasons — provider outages, rate limits, malformed inputs, expired credentials, or content policy violations. Request Logs give you the tools to quickly categorize, diagnose, and resolve failures systematically.


Failure categories

Provider-level failures

Failures caused by the AI provider:

Error typeExample messageTypical causeResolution
Rate limitrate_limit_exceededToo many requests to providerBack off, add more providers
Model unavailablemodel_not_foundModel deprecated or temporarily unavailableSwitch model or wait
Authenticationinvalid_api_keyAPI key expired or revokedUpdate credentials in Settings
Content policycontent_filteredRequest violated provider guidelinesReview and modify prompt content
Server errorinternal_server_errorProvider infrastructure issueRelies on failover; wait for resolution

Webhook delivery failures

Failures in delivering notifications to your backend:

Error typeExampleTypical causeResolution
Connection refusedECONNREFUSEDEndpoint is downFix or restart your server
TimeoutRequest timeout after 30sEndpoint is too slowOptimize endpoint performance
DNS failureENOTFOUNDDomain doesn't resolveCheck webhook URL configuration
SSL errorUNABLE_TO_VERIFY_LEAF_SIGNATURECertificate issueFix SSL certificate
Non-2xx responseHTTP 500Your endpoint returned an errorCheck your server logs

Callback failures

Failures in event-driven workflow callbacks:

Error typeTypical causeResolution
Timeout (5 min)Backend didn't call back in timeOptimize backend processing speed
Invalid payloadCallback data doesn't match expected formatReview callback documentation
Missing callbackBackend never called the callback URLVerify webhook handling code

Step-by-step troubleshooting

1. Assess the scope

Before diving into individual failures, understand the scope:

  1. Filter to Live mode and Error status
  2. Count the number of failures in the affected time period
  3. Look for patterns:
    • All requests failing? → Likely a systemic issue (bad credentials, provider outage)
    • Specific models failing? → Provider-specific issue
    • Intermittent failures? → Rate limits or transient errors

2. Categorize the failure

Click a failed request and inspect the timeline:

Scenario A: All provider attempts failed
OpenAI gpt-4o failed 150ms
Anthropic claude-3.5 failed 120ms
Google gemini-1.5 failed 180ms
No successful request
Check: Are all providers rejecting the same input?
 
Scenario B: Provider succeeded, webhook failed
OpenAI gpt-4o success 1,200ms
Webhook delivery error 30,000ms
Check: Is your webhook endpoint responding?
 
Scenario C: Everything worked except callback
OpenAI gpt-4o success 1,200ms
Webhook delivery success 45ms
Backend callback timeout 300,000ms
Check: Is your backend processing and calling back?

3. Inspect error details

Click the failed item and review:

Provider errors:

JSON
1{
2 "error": {
3 "type": "invalid_request_error",
4 "message": "This model's maximum context length is 128000 tokens. However, your messages resulted in 135420 tokens. Please reduce the length of the messages.",
5 "code": "context_length_exceeded"
6 }
7}

Webhook errors:

Status: Error
HTTP Status: 502
Error: Bad Gateway
Duration: 120ms
URL: https://api.yourapp.com/webhooks/modelriver

4. Apply the fix

Based on the error category:

  • Rate limits: Add more providers, implement request queuing, or upgrade your plan
  • Context length exceeded: Trim conversation history, summarize older messages, or use a model with a larger context window
  • Authentication failures: Navigate to SettingsProviders and update your API keys
  • Webhook failures: Fix your endpoint, then use the Retry button in Request Logs
  • Callback timeouts: Optimize your backend processing time or increase parallelism

5. Verify the fix

After applying the fix:

  1. Run a test in the Playground
  2. Check Request Logs for the new test request
  3. Confirm the timeline shows success
  4. Monitor for 30 minutes to ensure the fix holds

Advanced failure analysis

Identifying cascading failures

When one component fails, it can cascade:

1. Provider rate limit hit failover to secondary provider
2. Secondary provider also rate limited failover to tertiary
3. All providers exhausted request fails
4. Webhook not sent (no response to deliver) backend never notified
5. Downstream features that depend on the AI response also fail

How to trace: Open the failed request, review each timeline item, and note the chain of events. The first failure is usually the root cause.

Provider-specific error patterns

Common patterns by provider:

OpenAI:

  • rate_limit_exceeded — Most common during peak hours
  • context_length_exceeded — Prompt too long for the selected model
  • invalid_api_key — Key rotated or revoked

Anthropic:

  • overloaded — High demand periods
  • invalid_request_error — Format mismatch

Google:

  • RESOURCE_EXHAUSTED — Quota exceeded
  • INVALID_ARGUMENT — Parameter validation failure

Next steps