Keep your AI features fast and responsive

Monitor request latency, identify slow providers, and track performance trends to ensure your AI-powered features meet user expectations.

Overview

Request latency directly impacts user experience. Slow AI responses frustrate users and can break real-time applications. ModelRiver captures duration metrics for every request, enabling you to monitor performance, identify bottlenecks, and make informed decisions about provider and model selection.


Understanding performance data

Duration metric

The Duration column in Request Logs shows the request latency in milliseconds, measured from when the request is sent to the provider until the response is received.

What duration includes:

  • Network latency to the provider
  • Provider processing time (model inference)
  • Response transmission time

What duration does NOT include:

  • Time spent in ModelRiver's request routing
  • Webhook delivery time (tracked separately)
  • Backend callback processing time (tracked separately)

Duration in the timeline

The detail view's timeline shows duration for each component:

  • Failover attempts – Duration of each failed provider attempt
  • Main request – Duration of the successful request
  • Webhook deliveries – Time to deliver the webhook to your endpoint
  • Backend callbacks – Time for your backend to process and respond

Total perceived latency = sum of all failover attempt durations + main request duration (for sync requests) + webhook delivery time + callback processing time (for event-driven workflows).


Performance monitoring workflow

Step 1: Identify slow requests

  1. Navigate to Request Logs in your project console
  2. Filter to Live mode to focus on production traffic
  3. Sort or review the Duration column
  4. Look for requests significantly above your baseline latency

Step 2: Analyze root causes

Click on slow requests to open the detail view:

  • Multiple failover attempts? – Each failed provider attempt adds to total latency. If a request has 2-3 failover attempts before succeeding, the total time includes all attempts.
  • Slow provider? – Compare the duration of the main request across different providers. Some providers consistently have higher latency.
  • Large token counts? – More tokens typically means longer processing time. Check if prompt or completion tokens are unusually high.
  • Model complexity? – Larger models generally take longer. gpt-4o will be slower than gpt-4o-mini.

Step 3: Compare across providers

Review multiple requests to build a picture of provider performance:

ProviderAverage latencyConsistencyNotes
Provider ALowStableGood for real-time applications
Provider BMediumVariableOccasional spikes during peak hours
Provider CHighStableAcceptable for background tasks

Step 4: Optimize

Based on your analysis:

  • Reorder fallback providers – Put faster providers first for latency-sensitive workflows
  • Use smaller models – Where quality permits, smaller models respond faster
  • Reduce prompt size – Fewer input tokens means faster processing
  • Set max_tokens – Limit response size to reduce completion time
  • Use async for non-critical – Move non-time-sensitive requests to async mode

Performance patterns to watch

Consistent high latency

Symptom: All requests to a specific provider or model are slow.

Likely causes:

  • Provider is experiencing slowdowns
  • Model is inherently slow (e.g., very large models)
  • Network latency to the provider's region

Actions:

  • Consider switching to an alternative provider or model
  • Use failover to faster fallback providers
  • For background tasks, consider async processing

Intermittent latency spikes

Symptom: Occasional requests take much longer than usual.

Likely causes:

  • Provider rate limiting causing queuing
  • Provider infrastructure issues
  • Cold starts for less frequently used models
  • Network routing issues

Actions:

  • Monitor frequency and timing of spikes
  • Check if spikes correlate with high traffic periods
  • Review Provider Reliability for failure patterns
  • Consider adding failover timeout thresholds

Latency increasing over time

Symptom: Average request duration gradually increasing.

Likely causes:

  • Prompts growing longer over time (more context, larger system prompts)
  • Higher completion token counts
  • Provider service degradation
  • More frequent failover triggers

Actions:

  • Review prompt sizes — are system prompts growing?
  • Check token usage trends
  • Compare provider performance week-over-week
  • Audit workflow configuration for unnecessary complexity

Symptom: Requests succeed but take longer due to multiple provider attempts.

Likely causes:

  • Primary provider frequently failing
  • Slow failure detection (provider takes time to return errors)

Actions:

  • Review Provider Reliability for failure rates
  • Consider removing unreliable providers
  • Implement request-level timeout thresholds
  • Reorder providers to put most reliable first

Setting performance baselines

Establish your baseline

  1. Review a week of production request logs
  2. Calculate average and p95 duration for each provider/model combination
  3. Document acceptable latency ranges for your application
  4. Use these baselines to identify anomalies

Monitor against baseline

  • Requests 2x above baseline may warrant investigation
  • Requests 5x above baseline should trigger immediate review
  • Consistent drift above baseline indicates systemic issues

Performance vs. cost tradeoffs

Performance and cost often trade off:

DecisionPerformance impactCost impact
Use a larger modelSlowerMore expensive
Use a smaller modelFasterCheaper
Add failover providersBetter resilience, potentially slower on failureMay incur costs on failed attempts
Reduce prompt sizeFasterCheaper
Set max_tokensFaster for capped responsesCheaper for capped responses
Use async modeNo user-facing latencySame AI cost

See Cost Analysis for detailed cost optimization strategies.


Next steps