Debugging production issues with Request Logs

When a user reports that an AI feature isn't working, Request Logs give you full visibility into the complete request lifecycle to pinpoint the root cause.

Overview

Production debugging is one of the most critical use cases for Request Logs. When users report issues with AI-powered features, you need a clear path from symptom to root cause. ModelRiver captures every detail of every request, so you never have to guess what went wrong.


Step-by-step debugging workflow

1. Filter to production traffic

Start by isolating real user requests:

  1. Navigate to Request Logs in your project console
  2. Select Live mode from the filter dropdown
  3. Narrow the time range to the period when the issue was reported

Why Live mode? Test mode and playground requests create noise. Filtering to Live mode ensures you're only looking at real production traffic.

2. Identify the failing request

Look for visual indicators in the log list:

  • Red error badges — Requests that failed entirely
  • Failed models count — Amber badges showing the number of failed provider attempts before success (e.g., "2 failed")
  • Duration anomalies — Unusually long request times may indicate multiple failover attempts

3. Inspect the timeline

Click the request to open the detail view. The timeline reveals the complete story:

Timeline
OpenAI gpt-4o failed 120ms
Anthropic claude-3 failed 340ms
Google gemini-1.5 success 890ms
Webhook delivery success 45ms
Backend callback success 210ms

This example tells us the request ultimately succeeded but required two failover attempts. If the issue is intermittent, this pattern helps explain latency spikes.

4. Examine request and response payloads

Click each timeline item to drill into the payload:

Request Body example:

JSON
1{
2 "model": "gpt-4o",
3 "messages": [
4 {
5 "role": "system",
6 "content": "You are a helpful assistant for our e-commerce platform."
7 },
8 {
9 "role": "user",
10 "content": "What's the return policy for electronics?"
11 }
12 ],
13 "temperature": 0.7,
14 "max_tokens": 500
15}

Check for these common issues:

  • Missing or malformed messages array
  • Incorrect temperature or max_tokens values
  • Truncated system prompts
  • Unexpected special characters in user content

Response Body (error example):

JSON
1{
2 "error": {
3 "type": "rate_limit_error",
4 "message": "You have exceeded your rate limit. Please retry after 30 seconds.",
5 "code": "rate_limit_exceeded"
6 }
7}

5. Trace the failover chain

When you see failed model attempts:

  1. Click each failed attempt to view its error message
  2. Note whether failures are provider-specific (all OpenAI attempts failing) or model-specific (only gpt-4o failing, but gpt-4o-mini succeeding)
  3. Common failure reasons:
    • Rate limiting — Too many requests to the provider
    • Model unavailable — Provider is experiencing an outage
    • Authentication failure — API key expired or invalid
    • Content policy violation — Request content was filtered

Real-world debugging examples

Example 1: Chatbot returning empty responses

Symptom: Users report that the chatbot sometimes returns blank messages.

Investigation:

  1. Filter to Live mode, focus on the last 2 hours
  2. Find requests where the response has 0 output tokens
  3. Inspect the request body — the messages array is correct
  4. Inspect the response body — the provider returned finish_reason: "length" with max_tokens: 1
  5. Root cause: A workflow update accidentally set max_tokens to 1

Fix: Update the workflow's max_tokens parameter back to the intended value.

Example 2: Intermittent 500 errors on async requests

Symptom: About 10% of async AI requests fail silently.

Investigation:

  1. Filter to errors in Live mode
  2. Open a failed request — the timeline shows the AI request succeeded
  3. The webhook delivery shows Error status with "Connection refused"
  4. Check multiple failed requests — all have the same webhook URL
  5. Root cause: The webhook endpoint server was running out of memory during peak hours

Fix: Scale the webhook endpoint server and add health monitoring.

Example 3: Slow response times during peak hours

Symptom: AI responses take 5-10 seconds during peak hours, normally 1-2 seconds.

Investigation:

  1. Filter to Live mode, focus on peak hours (2-4 PM)
  2. Sort by duration — many requests show 4000-8000ms
  3. Open a slow request — the timeline shows 2-3 failover attempts before success
  4. Failed attempts show rate_limit_error from the primary provider
  5. Root cause: Primary provider rate limits are being hit during peak traffic

Fix: Add additional provider capacity or implement request queuing during peak hours.


Best practices for production debugging

  • Compare failing and succeeding requests side by side to spot payload differences
  • Use the timeline to understand the full request lifecycle, not just the final result
  • Check for patterns — single failures may be transient, but repeated failures indicate systemic issues
  • Monitor the Provider Reliability page for trend analysis
  • Keep production logs clean by using test mode for development

Next steps