Overview
Production debugging is one of the most critical use cases for Request Logs. When users report issues with AI-powered features, you need a clear path from symptom to root cause. ModelRiver captures every detail of every request, so you never have to guess what went wrong.
Step-by-step debugging workflow
1. Filter to production traffic
Start by isolating real user requests:
- Navigate to Request Logs in your project console
- Select Live mode from the filter dropdown
- Narrow the time range to the period when the issue was reported
Why Live mode? Test mode and playground requests create noise. Filtering to Live mode ensures you're only looking at real production traffic.
2. Identify the failing request
Look for visual indicators in the log list:
- Red error badges — Requests that failed entirely
- Failed models count — Amber badges showing the number of failed provider attempts before success (e.g., "2 failed")
- Duration anomalies — Unusually long request times may indicate multiple failover attempts
3. Inspect the timeline
Click the request to open the detail view. The timeline reveals the complete story:
┌─────────────────────────────────────────────────┐│ Timeline ││ ││ ⚠ OpenAI gpt-4o failed 120ms ││ ⚠ Anthropic claude-3 failed 340ms ││ ✓ Google gemini-1.5 success 890ms ││ ✓ Webhook delivery success 45ms ││ ✓ Backend callback success 210ms │└─────────────────────────────────────────────────┘This example tells us the request ultimately succeeded but required two failover attempts. If the issue is intermittent, this pattern helps explain latency spikes.
4. Examine request and response payloads
Click each timeline item to drill into the payload:
Request Body example:
1{2 "model": "gpt-4o",3 "messages": [4 {5 "role": "system",6 "content": "You are a helpful assistant for our e-commerce platform."7 },8 {9 "role": "user",10 "content": "What's the return policy for electronics?"11 }12 ],13 "temperature": 0.7,14 "max_tokens": 50015}Check for these common issues:
- Missing or malformed
messagesarray - Incorrect
temperatureormax_tokensvalues - Truncated system prompts
- Unexpected special characters in user content
Response Body (error example):
1{2 "error": {3 "type": "rate_limit_error",4 "message": "You have exceeded your rate limit. Please retry after 30 seconds.",5 "code": "rate_limit_exceeded"6 }7}5. Trace the failover chain
When you see failed model attempts:
- Click each failed attempt to view its error message
- Note whether failures are provider-specific (all OpenAI attempts failing) or model-specific (only
gpt-4ofailing, butgpt-4o-minisucceeding) - Common failure reasons:
- Rate limiting — Too many requests to the provider
- Model unavailable — Provider is experiencing an outage
- Authentication failure — API key expired or invalid
- Content policy violation — Request content was filtered
Real-world debugging examples
Example 1: Chatbot returning empty responses
Symptom: Users report that the chatbot sometimes returns blank messages.
Investigation:
- Filter to Live mode, focus on the last 2 hours
- Find requests where the response has 0 output tokens
- Inspect the request body — the
messagesarray is correct - Inspect the response body — the provider returned
finish_reason: "length"withmax_tokens: 1 - Root cause: A workflow update accidentally set
max_tokensto 1
Fix: Update the workflow's max_tokens parameter back to the intended value.
Example 2: Intermittent 500 errors on async requests
Symptom: About 10% of async AI requests fail silently.
Investigation:
- Filter to errors in Live mode
- Open a failed request — the timeline shows the AI request succeeded
- The webhook delivery shows Error status with "Connection refused"
- Check multiple failed requests — all have the same webhook URL
- Root cause: The webhook endpoint server was running out of memory during peak hours
Fix: Scale the webhook endpoint server and add health monitoring.
Example 3: Slow response times during peak hours
Symptom: AI responses take 5-10 seconds during peak hours, normally 1-2 seconds.
Investigation:
- Filter to Live mode, focus on peak hours (2-4 PM)
- Sort by duration — many requests show 4000-8000ms
- Open a slow request — the timeline shows 2-3 failover attempts before success
- Failed attempts show
rate_limit_errorfrom the primary provider - Root cause: Primary provider rate limits are being hit during peak traffic
Fix: Add additional provider capacity or implement request queuing during peak hours.
Best practices for production debugging
- Compare failing and succeeding requests side by side to spot payload differences
- Use the timeline to understand the full request lifecycle, not just the final result
- Check for patterns — single failures may be transient, but repeated failures indicate systemic issues
- Monitor the Provider Reliability page for trend analysis
- Keep production logs clean by using test mode for development
Next steps
- Testing Workflows — Validate changes before deploying
- Cost Analysis — Understand spending patterns
- Troubleshooting Failures — Systematic failure resolution
- Back to Observability — Return to the overview