Duration & Performance – ModelRiver Docs

Overview

Request duration directly impacts how responsive your AI features feel to users. Request Logs capture precise timing for every request and every provider attempt, giving you the data to monitor performance and optimize for speed.

Establishing baselines

Collect baseline data

Review 7 days of production requests to establish your baselines:

Model: gpt-4o
  P50: 1,200ms    (half of requests are faster)
  P90: 2,800ms    (90% of requests are faster)
  P95: 3,400ms    (95% of requests are faster)
  P99: 5,100ms    (99% of requests are faster)
 
Model: gpt-4o-mini
  P50: 420ms
  P90: 780ms
  P95: 1,100ms
  P99: 1,800ms
 
Model: claude-3.5-sonnet
  P50: 1,100ms
  P90: 2,200ms
  P95: 2,800ms
  P99: 4,200ms

Set alert thresholds

Based on baselines, set performance alerts:

Metric	Formula	Example
Warning	P95 × 1.5	3,400ms × 1.5 = 5,100ms
Critical	P95 × 2.0	3,400ms × 2.0 = 6,800ms
Emergency	P95 × 3.0	3,400ms × 3.0 = 10,200ms

Duration breakdown

Where time is spent

A request's total duration includes:

Total duration = Network overhead + Provider processing + Response transfer
 
Typical breakdown for a 1,500ms request:
  Network overhead:     50ms  (3%)
  Provider processing:  1,350ms (90%)
  Response transfer:    100ms  (7%)

Failover impact on duration

When failovers occur, each failed attempt adds to total latency:

Without failover:
  ✓ Primary provider:  1,200ms
  Total: 1,200ms
 
With 1 failover:
  ✗ Primary:    800ms (failed)
  ✓ Secondary:  1,100ms
  Total: 1,900ms (+58%)
 
With 2 failovers:
  ✗ Primary:    800ms (failed)
  ✗ Secondary:  600ms (failed)
  ✓ Tertiary:   950ms
  Total: 2,350ms (+96%)

Performance optimization checklist

Quick wins

Set appropriate max_tokens — Lower limits = shorter response time
Choose faster models for simple tasks — gpt-4o-mini is 2-3x faster than gpt-4o
Trim system prompts — Fewer input tokens = faster processing
Reduce conversation history — Summarize instead of sending full history

Provider selection

Optimize your workflow's provider priority for speed:

For latency-sensitive features (chat UX):
  1. gpt-4o-mini (fastest, good enough quality)
  2. claude-3-haiku (fast fallback)
  3. gpt-4o (quality fallback)
 
For quality-sensitive features (reports):
  1. claude-3.5-sonnet (balanced)
  2. gpt-4o (high quality fallback)
  3. gemini-1.5-pro (alternative)

Infrastructure

Minimize failovers — Choose reliable primary providers
Configure timeouts — Don't wait too long for a slow provider
Stream responses — Users see content sooner even if total time is the same
Cache repeated queries — For identical or similar prompts

Monitoring duration trends

Track these metrics weekly:

Week    P50      P95      Failover rate   Notes
─────────────────────────────────────────────────
W1      1,200ms  3,400ms  2.8%
W2      1,250ms  3,600ms  3.1%            Slight increase
W3      1,180ms  3,200ms  2.5%            Improved
W4      1,400ms  4,800ms  6.2%            ⚠ Investigate!

Week 4 shows a significant performance degradation — likely caused by a provider issue or increased prompt sizes.

Balancing speed, cost, and quality

The performance triangle:

          Speed
         /    \
        /      \
       /        \
     Cost ──── Quality
 
Fast + Cheap = Lower quality (gpt-4o-mini)
Fast + Quality = Expensive (gpt-4o with fast provider)
Cheap + Quality = Slower (batch processing)

Use Request Logs to find your optimal balance by comparing duration, token usage, and response quality across different configurations.

Next steps

Request & Response Inspection — Debug with payloads
Timeline Context — Complete lifecycle view
Performance Monitoring Use Case — Detailed scenario guide
Back to Best Practices — Return to the overview

Monitor and optimize request performance