Performance monitoring – ModelRiver Docs

Overview

Request latency directly impacts user experience. Slow AI responses frustrate users and can break real-time applications. ModelRiver captures duration metrics for every request, enabling you to monitor performance, identify bottlenecks, and make informed decisions about provider and model selection.

Understanding performance data

Duration metric

The Duration column in Request Logs shows the request latency in milliseconds, measured from when the request is sent to the provider until the response is received.

What duration includes:

Network latency to the provider
Provider processing time (model inference)
Response transmission time

What duration does NOT include:

Time spent in ModelRiver's request routing
Webhook delivery time (tracked separately)
Backend callback processing time (tracked separately)

Duration in the timeline

The detail view's timeline shows duration for each component:

Failover attempts – Duration of each failed provider attempt
Main request – Duration of the successful request
Webhook deliveries – Time to deliver the webhook to your endpoint
Backend callbacks – Time for your backend to process and respond

Total perceived latency = sum of all failover attempt durations + main request duration (for sync requests) + webhook delivery time + callback processing time (for event-driven workflows).

Performance monitoring workflow

Step 1: Identify slow requests

Navigate to Request Logs in your project console
Filter to Live mode to focus on production traffic
Sort or review the Duration column
Look for requests significantly above your baseline latency

Step 2: Analyze root causes

Click on slow requests to open the detail view:

Multiple failover attempts? – Each failed provider attempt adds to total latency. If a request has 2-3 failover attempts before succeeding, the total time includes all attempts.
Slow provider? – Compare the duration of the main request across different providers. Some providers consistently have higher latency.
Large token counts? – More tokens typically means longer processing time. Check if prompt or completion tokens are unusually high.
Model complexity? – Larger models generally take longer. gpt-4o will be slower than gpt-4o-mini.

Step 3: Compare across providers

Review multiple requests to build a picture of provider performance:

Provider	Average latency	Consistency	Notes
Provider A	Low	Stable	Good for real-time applications
Provider B	Medium	Variable	Occasional spikes during peak hours
Provider C	High	Stable	Acceptable for background tasks

Step 4: Optimize

Based on your analysis:

Reorder fallback providers – Put faster providers first for latency-sensitive workflows
Use smaller models – Where quality permits, smaller models respond faster
Reduce prompt size – Fewer input tokens means faster processing
Set max_tokens – Limit response size to reduce completion time
Use async for non-critical – Move non-time-sensitive requests to async mode

Performance patterns to watch

Consistent high latency

Symptom: All requests to a specific provider or model are slow.

Likely causes:

Provider is experiencing slowdowns
Model is inherently slow (e.g., very large models)
Network latency to the provider's region

Actions:

Consider switching to an alternative provider or model
Use failover to faster fallback providers
For background tasks, consider async processing

Intermittent latency spikes

Symptom: Occasional requests take much longer than usual.

Likely causes:

Provider rate limiting causing queuing
Provider infrastructure issues
Cold starts for less frequently used models
Network routing issues

Actions:

Monitor frequency and timing of spikes
Check if spikes correlate with high traffic periods
Review Provider Reliability for failure patterns
Consider adding failover timeout thresholds

Latency increasing over time

Symptom: Average request duration gradually increasing.

Likely causes:

Prompts growing longer over time (more context, larger system prompts)
Higher completion token counts
Provider service degradation
More frequent failover triggers

Actions:

Review prompt sizes — are system prompts growing?
Check token usage trends
Compare provider performance week-over-week
Audit workflow configuration for unnecessary complexity

Symptom: Requests succeed but take longer due to multiple provider attempts.

Likely causes:

Primary provider frequently failing
Slow failure detection (provider takes time to return errors)

Actions:

Review Provider Reliability for failure rates
Consider removing unreliable providers
Implement request-level timeout thresholds
Reorder providers to put most reliable first

Setting performance baselines

Establish your baseline

Review a week of production request logs
Calculate average and p95 duration for each provider/model combination
Document acceptable latency ranges for your application
Use these baselines to identify anomalies

Monitor against baseline

Requests 2x above baseline may warrant investigation
Requests 5x above baseline should trigger immediate review
Consistent drift above baseline indicates systemic issues

Performance vs. cost tradeoffs

Performance and cost often trade off:

Decision	Performance impact	Cost impact
Use a larger model	Slower	More expensive
Use a smaller model	Faster	Cheaper
Add failover providers	Better resilience, potentially slower on failure	May incur costs on failed attempts
Reduce prompt size	Faster	Cheaper
Set max_tokens	Faster for capped responses	Cheaper for capped responses
Use async mode	No user-facing latency	Same AI cost

See Cost Analysis for detailed cost optimization strategies.

Next steps

Provider Reliability – Analyze failure rates affecting performance
Cost Analysis – Balance performance and cost
Debugging – Investigate specific slow or failed requests
Timeline Components – Understand request lifecycle timing
Back to Observability – Return to the overview

Keep your AI features fast and responsive

Overview

Understanding performance data

Duration metric

Duration in the timeline

Performance monitoring workflow

Step 1: Identify slow requests

Step 2: Analyze root causes

Step 3: Compare across providers

Step 4: Optimize

Performance patterns to watch

Consistent high latency

Intermittent latency spikes

Latency increasing over time

Failover-related latency

Setting performance baselines

Establish your baseline

Monitor against baseline

Performance vs. cost tradeoffs

Next steps