Debugging Production Issues – ModelRiver Docs

Overview

Production debugging is one of the most critical use cases for Request Logs. When users report issues with AI-powered features, you need a clear path from symptom to root cause. ModelRiver captures every detail of every request, so you never have to guess what went wrong.

Step-by-step debugging workflow

1. Filter to production traffic

Start by isolating real user requests:

Navigate to Request Logs in your project console
Select Live mode from the filter dropdown
Narrow the time range to the period when the issue was reported

Why Live mode? Test mode and playground requests create noise. Filtering to Live mode ensures you're only looking at real production traffic.

2. Identify the failing request

Look for visual indicators in the log list:

Red error badges — Requests that failed entirely
Failed models count — Amber badges showing the number of failed provider attempts before success (e.g., "2 failed")
Duration anomalies — Unusually long request times may indicate multiple failover attempts

3. Inspect the timeline

Click the request to open the detail view. The timeline reveals the complete story:

┌─────────────────────────────────────────────────┐
│  Timeline                                       │
│                                                 │
│  ⚠ OpenAI gpt-4o         failed    120ms       │
│  ⚠ Anthropic claude-3    failed    340ms       │
│  ✓ Google gemini-1.5     success   890ms       │
│  ✓ Webhook delivery      success   45ms        │
│  ✓ Backend callback      success   210ms       │
└─────────────────────────────────────────────────┘

This example tells us the request ultimately succeeded but required two failover attempts. If the issue is intermittent, this pattern helps explain latency spikes.

4. Examine request and response payloads

Click each timeline item to drill into the payload:

Request Body example:

JSON

1{
2  "model": "gpt-4o",
3  "messages": [
4    {
5      "role": "system",
6      "content": "You are a helpful assistant for our e-commerce platform."
7    },
8    {
9      "role": "user",
10      "content": "What's the return policy for electronics?"
11    }
12  ],
13  "temperature": 0.7,
14  "max_tokens": 500
15}

Check for these common issues:

Missing or malformed messages array
Incorrect temperature or max_tokens values
Truncated system prompts
Unexpected special characters in user content

Response Body (error example):

JSON

1{
2  "error": {
3    "type": "rate_limit_error",
4    "message": "You have exceeded your rate limit. Please retry after 30 seconds.",
5    "code": "rate_limit_exceeded"
6  }
7}

5. Trace the failover chain

When you see failed model attempts:

Click each failed attempt to view its error message
Note whether failures are provider-specific (all OpenAI attempts failing) or model-specific (only gpt-4o failing, but gpt-4o-mini succeeding)
Common failure reasons:
- Rate limiting — Too many requests to the provider
- Model unavailable — Provider is experiencing an outage
- Authentication failure — API key expired or invalid
- Content policy violation — Request content was filtered

Real-world debugging examples

Example 1: Chatbot returning empty responses

Symptom: Users report that the chatbot sometimes returns blank messages.

Investigation:

Filter to Live mode, focus on the last 2 hours
Find requests where the response has 0 output tokens
Inspect the request body — the messages array is correct
Inspect the response body — the provider returned finish_reason: "length" with max_tokens: 1
Root cause: A workflow update accidentally set max_tokens to 1

Fix: Update the workflow's max_tokens parameter back to the intended value.

Example 2: Intermittent 500 errors on async requests

Symptom: About 10% of async AI requests fail silently.

Investigation:

Filter to errors in Live mode
Open a failed request — the timeline shows the AI request succeeded
The webhook delivery shows Error status with "Connection refused"
Check multiple failed requests — all have the same webhook URL
Root cause: The webhook endpoint server was running out of memory during peak hours

Fix: Scale the webhook endpoint server and add health monitoring.

Example 3: Slow response times during peak hours

Symptom: AI responses take 5-10 seconds during peak hours, normally 1-2 seconds.

Investigation:

Filter to Live mode, focus on peak hours (2-4 PM)
Sort by duration — many requests show 4000-8000ms
Open a slow request — the timeline shows 2-3 failover attempts before success
Failed attempts show rate_limit_error from the primary provider
Root cause: Primary provider rate limits are being hit during peak traffic

Fix: Add additional provider capacity or implement request queuing during peak hours.

Best practices for production debugging

Compare failing and succeeding requests side by side to spot payload differences
Use the timeline to understand the full request lifecycle, not just the final result
Check for patterns — single failures may be transient, but repeated failures indicate systemic issues
Monitor the Provider Reliability page for trend analysis
Keep production logs clean by using test mode for development

Next steps

Testing Workflows — Validate changes before deploying
Cost Analysis — Understand spending patterns
Troubleshooting Failures — Systematic failure resolution
Back to Observability — Return to the overview

Debugging production issues with Request Logs

Overview

Step-by-step debugging workflow

1. Filter to production traffic

2. Identify the failing request

3. Inspect the timeline

4. Examine request and response payloads

5. Trace the failover chain

Real-world debugging examples

Example 1: Chatbot returning empty responses

Example 2: Intermittent 500 errors on async requests

Example 3: Slow response times during peak hours

Best practices for production debugging

Next steps