Monitor and optimize request performance

Duration data reveals performance bottlenecks. Set baselines, track trends, and balance speed with cost and quality for the best user experience.

Overview

Request duration directly impacts how responsive your AI features feel to users. Request Logs capture precise timing for every request and every provider attempt, giving you the data to monitor performance and optimize for speed.


Establishing baselines

Collect baseline data

Review 7 days of production requests to establish your baselines:

Model: gpt-4o
P50: 1,200ms (half of requests are faster)
P90: 2,800ms (90% of requests are faster)
P95: 3,400ms (95% of requests are faster)
P99: 5,100ms (99% of requests are faster)
 
Model: gpt-4o-mini
P50: 420ms
P90: 780ms
P95: 1,100ms
P99: 1,800ms
 
Model: claude-3.5-sonnet
P50: 1,100ms
P90: 2,200ms
P95: 2,800ms
P99: 4,200ms

Set alert thresholds

Based on baselines, set performance alerts:

MetricFormulaExample
WarningP95 × 1.53,400ms × 1.5 = 5,100ms
CriticalP95 × 2.03,400ms × 2.0 = 6,800ms
EmergencyP95 × 3.03,400ms × 3.0 = 10,200ms

Duration breakdown

Where time is spent

A request's total duration includes:

Total duration = Network overhead + Provider processing + Response transfer
 
Typical breakdown for a 1,500ms request:
Network overhead: 50ms (3%)
Provider processing: 1,350ms (90%)
Response transfer: 100ms (7%)

Failover impact on duration

When failovers occur, each failed attempt adds to total latency:

Without failover:
Primary provider: 1,200ms
Total: 1,200ms
 
With 1 failover:
Primary: 800ms (failed)
Secondary: 1,100ms
Total: 1,900ms (+58%)
 
With 2 failovers:
Primary: 800ms (failed)
Secondary: 600ms (failed)
Tertiary: 950ms
Total: 2,350ms (+96%)

Performance optimization checklist

Quick wins

  • Set appropriate max_tokens — Lower limits = shorter response time
  • Choose faster models for simple tasks — gpt-4o-mini is 2-3x faster than gpt-4o
  • Trim system prompts — Fewer input tokens = faster processing
  • Reduce conversation history — Summarize instead of sending full history

Provider selection

Optimize your workflow's provider priority for speed:

For latency-sensitive features (chat UX):
1. gpt-4o-mini (fastest, good enough quality)
2. claude-3-haiku (fast fallback)
3. gpt-4o (quality fallback)
 
For quality-sensitive features (reports):
1. claude-3.5-sonnet (balanced)
2. gpt-4o (high quality fallback)
3. gemini-1.5-pro (alternative)

Infrastructure

  • Minimize failovers — Choose reliable primary providers
  • Configure timeouts — Don't wait too long for a slow provider
  • Stream responses — Users see content sooner even if total time is the same
  • Cache repeated queries — For identical or similar prompts

Track these metrics weekly:

Week P50 P95 Failover rate Notes
W1 1,200ms 3,400ms 2.8%
W2 1,250ms 3,600ms 3.1% Slight increase
W3 1,180ms 3,200ms 2.5% Improved
W4 1,400ms 4,800ms 6.2% Investigate!

Week 4 shows a significant performance degradation — likely caused by a provider issue or increased prompt sizes.


Balancing speed, cost, and quality

The performance triangle:

Speed
/ \
/ \
/ \
Cost Quality
 
Fast + Cheap = Lower quality (gpt-4o-mini)
Fast + Quality = Expensive (gpt-4o with fast provider)
Cheap + Quality = Slower (batch processing)

Use Request Logs to find your optimal balance by comparing duration, token usage, and response quality across different configurations.


Next steps