Overview
Request latency directly impacts user experience. Slow AI responses frustrate users and can break real-time applications. ModelRiver captures duration metrics for every request, enabling you to monitor performance, identify bottlenecks, and make informed decisions about provider and model selection.
Understanding performance data
Duration metric
The Duration column in Request Logs shows the request latency in milliseconds, measured from when the request is sent to the provider until the response is received.
What duration includes:
- Network latency to the provider
- Provider processing time (model inference)
- Response transmission time
What duration does NOT include:
- Time spent in ModelRiver's request routing
- Webhook delivery time (tracked separately)
- Backend callback processing time (tracked separately)
Duration in the timeline
The detail view's timeline shows duration for each component:
- Failover attempts – Duration of each failed provider attempt
- Main request – Duration of the successful request
- Webhook deliveries – Time to deliver the webhook to your endpoint
- Backend callbacks – Time for your backend to process and respond
Total perceived latency = sum of all failover attempt durations + main request duration (for sync requests) + webhook delivery time + callback processing time (for event-driven workflows).
Performance monitoring workflow
Step 1: Identify slow requests
- Navigate to Request Logs in your project console
- Filter to Live mode to focus on production traffic
- Sort or review the Duration column
- Look for requests significantly above your baseline latency
Step 2: Analyze root causes
Click on slow requests to open the detail view:
- Multiple failover attempts? – Each failed provider attempt adds to total latency. If a request has 2-3 failover attempts before succeeding, the total time includes all attempts.
- Slow provider? – Compare the duration of the main request across different providers. Some providers consistently have higher latency.
- Large token counts? – More tokens typically means longer processing time. Check if prompt or completion tokens are unusually high.
- Model complexity? – Larger models generally take longer.
gpt-4owill be slower thangpt-4o-mini.
Step 3: Compare across providers
Review multiple requests to build a picture of provider performance:
| Provider | Average latency | Consistency | Notes |
|---|---|---|---|
| Provider A | Low | Stable | Good for real-time applications |
| Provider B | Medium | Variable | Occasional spikes during peak hours |
| Provider C | High | Stable | Acceptable for background tasks |
Step 4: Optimize
Based on your analysis:
- Reorder fallback providers – Put faster providers first for latency-sensitive workflows
- Use smaller models – Where quality permits, smaller models respond faster
- Reduce prompt size – Fewer input tokens means faster processing
- Set
max_tokens– Limit response size to reduce completion time - Use async for non-critical – Move non-time-sensitive requests to async mode
Performance patterns to watch
Consistent high latency
Symptom: All requests to a specific provider or model are slow.
Likely causes:
- Provider is experiencing slowdowns
- Model is inherently slow (e.g., very large models)
- Network latency to the provider's region
Actions:
- Consider switching to an alternative provider or model
- Use failover to faster fallback providers
- For background tasks, consider async processing
Intermittent latency spikes
Symptom: Occasional requests take much longer than usual.
Likely causes:
- Provider rate limiting causing queuing
- Provider infrastructure issues
- Cold starts for less frequently used models
- Network routing issues
Actions:
- Monitor frequency and timing of spikes
- Check if spikes correlate with high traffic periods
- Review Provider Reliability for failure patterns
- Consider adding failover timeout thresholds
Latency increasing over time
Symptom: Average request duration gradually increasing.
Likely causes:
- Prompts growing longer over time (more context, larger system prompts)
- Higher completion token counts
- Provider service degradation
- More frequent failover triggers
Actions:
- Review prompt sizes — are system prompts growing?
- Check token usage trends
- Compare provider performance week-over-week
- Audit workflow configuration for unnecessary complexity
Failover-related latency
Symptom: Requests succeed but take longer due to multiple provider attempts.
Likely causes:
- Primary provider frequently failing
- Slow failure detection (provider takes time to return errors)
Actions:
- Review Provider Reliability for failure rates
- Consider removing unreliable providers
- Implement request-level timeout thresholds
- Reorder providers to put most reliable first
Setting performance baselines
Establish your baseline
- Review a week of production request logs
- Calculate average and p95 duration for each provider/model combination
- Document acceptable latency ranges for your application
- Use these baselines to identify anomalies
Monitor against baseline
- Requests 2x above baseline may warrant investigation
- Requests 5x above baseline should trigger immediate review
- Consistent drift above baseline indicates systemic issues
Performance vs. cost tradeoffs
Performance and cost often trade off:
| Decision | Performance impact | Cost impact |
|---|---|---|
| Use a larger model | Slower | More expensive |
| Use a smaller model | Faster | Cheaper |
| Add failover providers | Better resilience, potentially slower on failure | May incur costs on failed attempts |
| Reduce prompt size | Faster | Cheaper |
| Set max_tokens | Faster for capped responses | Cheaper for capped responses |
| Use async mode | No user-facing latency | Same AI cost |
See Cost Analysis for detailed cost optimization strategies.
Next steps
- Provider Reliability – Analyze failure rates affecting performance
- Cost Analysis – Balance performance and cost
- Debugging – Investigate specific slow or failed requests
- Timeline Components – Understand request lifecycle timing
- Back to Observability – Return to the overview