Performance Monitoring – ModelRiver Docs

Overview

Response time directly impacts user experience. A chatbot that takes 8 seconds to respond feels broken; one that responds in 1-2 seconds feels magical. Request Logs capture precise duration data for every request and every provider attempt, giving you the tools to monitor, compare, and optimize performance.

Key performance metrics

Request duration

The Duration column shows end-to-end latency in milliseconds:

Duration range	User experience	Action
< 1,000ms	Excellent — feels instant	No action needed
1,000-3,000ms	Good — acceptable for most use cases	Monitor
3,000-5,000ms	Moderate — noticeable delay	Investigate
> 5,000ms	Poor — users may abandon	Immediate investigation

Time to first token (streaming)

For streaming requests, the key metric is how quickly the first token arrives. While Request Logs show total duration, you can estimate first-token latency by comparing streaming vs. non-streaming requests with similar prompts.

Failover latency impact

Each failed provider attempt adds latency. A request with 3 failover attempts may show:

Attempt 1: OpenAI      failed after 2,100ms
Attempt 2: Anthropic   failed after 1,800ms
Attempt 3: Google      success in    890ms
                                    ─────────
Total effective latency:            4,790ms

Without failovers, this request would have completed in ~890ms.

Step-by-step performance monitoring

1. Establish baseline metrics

Before you can identify anomalies, establish what "normal" looks like:

Filter to Live mode and review 7 days of requests
Note typical duration ranges for each provider/model
Record average token counts per request type
Document typical failover frequency

Example baseline:

Provider: OpenAI gpt-4o
  P50 latency: 1,200ms
  P95 latency: 3,400ms
  P99 latency: 5,100ms
  Avg tokens:  1,800 total
 
Provider: Anthropic claude-3.5-sonnet
  P50 latency: 1,100ms
  P95 latency: 2,800ms
  P99 latency: 4,200ms
  Avg tokens:  1,900 total

2. Identify performance outliers

Navigate to Request Logs and filter to Live mode
Look for requests with abnormally high duration values
Click to inspect — check the timeline for:
- Multiple failover attempts — Primary cause of unexpected latency
- High token counts — More tokens = longer processing
- Specific provider/model — Some models are consistently slower

3. Compare provider performance

Run the same prompt through different providers to compare:

┌────────────────────────────────────────────────────────┐
│  Provider Performance Comparison                       │
│                                                        │
│  Provider              Latency   Tokens   Cost        │
│  ──────────────────────────────────────────────────── │
│  OpenAI gpt-4o-mini    420ms    800      $0.0004     │
│  OpenAI gpt-4o         1,200ms  780      $0.0120     │
│  Anthropic claude-3.5  1,100ms  820      $0.0098     │
│  Google gemini-1.5     950ms    810      $0.0071     │
│  Anthropic claude-3    2,100ms  790      $0.0450     │
└────────────────────────────────────────────────────────┘

Use this data to configure your workflow's provider priority for the best balance of speed, quality, and cost.

4. Monitor peak-hour performance

Performance often degrades during peak usage:

Filter logs to specific time windows (e.g., 2-6 PM)
Compare duration statistics to off-peak hours
Look for higher failover rates during peak times
Check if provider rate limits are being hit

Performance optimization strategies

Optimize provider selection

Configure your workflow to use the fastest reliable provider as primary:

Workflow provider priority:
  1. Google Gemini 1.5 Pro     (fastest, cost-effective)
  2. Anthropic Claude 3.5      (reliable, good quality)
  3. OpenAI GPT-4o             (fallback, highest quality)

Reduce token count

Fewer tokens = faster processing:

Trim system prompts — Remove unnecessary instructions
Limit conversation history — Summarize older messages
Set appropriate max_tokens — Don't allow more output than needed
Use structured outputs — Constrained outputs are typically faster

Minimize failovers

Failovers are the biggest source of unexpected latency:

Monitor provider reliability — Use Provider Reliability data to choose stable providers
Configure sensible timeouts — Don't wait too long for a slow provider before failing over
Keep provider credentials current — Expired API keys cause immediate failures

Consider model-task matching

Use faster models for simpler tasks:

Classification, routing → gpt-4o-mini (< 500ms typical)
Content generation → gpt-4o or claude-3.5-sonnet (1-2s typical)
Complex analysis → claude-3-opus (2-4s typical, but highest quality)

Setting up performance alerts

Based on your baseline metrics, set alerts for:

P95 latency exceeding 2x baseline — Early warning for degradation
Failover rate exceeding 10% — Provider instability
Duration exceeding absolute threshold — e.g., 5,000ms for user-facing requests
Consistent slowdown over time — Gradual degradation may indicate growing prompt sizes

Next steps

Troubleshooting Failures — When performance issues become failures
Provider Reliability — Choose reliable providers
Performance Monitoring Dashboard — Aggregated views
Back to Observability — Return to the overview

Monitor AI request performance in real time