Overview
Request duration directly impacts how responsive your AI features feel to users. Request Logs capture precise timing for every request and every provider attempt, giving you the data to monitor performance and optimize for speed.
Establishing baselines
Collect baseline data
Review 7 days of production requests to establish your baselines:
Model: gpt-4o P50: 1,200ms (half of requests are faster) P90: 2,800ms (90% of requests are faster) P95: 3,400ms (95% of requests are faster) P99: 5,100ms (99% of requests are faster) Model: gpt-4o-mini P50: 420ms P90: 780ms P95: 1,100ms P99: 1,800ms Model: claude-3.5-sonnet P50: 1,100ms P90: 2,200ms P95: 2,800ms P99: 4,200msSet alert thresholds
Based on baselines, set performance alerts:
| Metric | Formula | Example |
|---|---|---|
| Warning | P95 × 1.5 | 3,400ms × 1.5 = 5,100ms |
| Critical | P95 × 2.0 | 3,400ms × 2.0 = 6,800ms |
| Emergency | P95 × 3.0 | 3,400ms × 3.0 = 10,200ms |
Duration breakdown
Where time is spent
A request's total duration includes:
Total duration = Network overhead + Provider processing + Response transfer Typical breakdown for a 1,500ms request: Network overhead: 50ms (3%) Provider processing: 1,350ms (90%) Response transfer: 100ms (7%)Failover impact on duration
When failovers occur, each failed attempt adds to total latency:
Without failover: ✓ Primary provider: 1,200ms Total: 1,200ms With 1 failover: ✗ Primary: 800ms (failed) ✓ Secondary: 1,100ms Total: 1,900ms (+58%) With 2 failovers: ✗ Primary: 800ms (failed) ✗ Secondary: 600ms (failed) ✓ Tertiary: 950ms Total: 2,350ms (+96%)Performance optimization checklist
Quick wins
- Set appropriate
max_tokens— Lower limits = shorter response time - Choose faster models for simple tasks —
gpt-4o-miniis 2-3x faster thangpt-4o - Trim system prompts — Fewer input tokens = faster processing
- Reduce conversation history — Summarize instead of sending full history
Provider selection
Optimize your workflow's provider priority for speed:
For latency-sensitive features (chat UX): 1. gpt-4o-mini (fastest, good enough quality) 2. claude-3-haiku (fast fallback) 3. gpt-4o (quality fallback) For quality-sensitive features (reports): 1. claude-3.5-sonnet (balanced) 2. gpt-4o (high quality fallback) 3. gemini-1.5-pro (alternative)Infrastructure
- Minimize failovers — Choose reliable primary providers
- Configure timeouts — Don't wait too long for a slow provider
- Stream responses — Users see content sooner even if total time is the same
- Cache repeated queries — For identical or similar prompts
Monitoring duration trends
Track these metrics weekly:
Week P50 P95 Failover rate Notes─────────────────────────────────────────────────W1 1,200ms 3,400ms 2.8%W2 1,250ms 3,600ms 3.1% Slight increaseW3 1,180ms 3,200ms 2.5% ImprovedW4 1,400ms 4,800ms 6.2% ⚠ Investigate!Week 4 shows a significant performance degradation — likely caused by a provider issue or increased prompt sizes.
Balancing speed, cost, and quality
The performance triangle:
Speed / \ / \ / \ Cost ──── Quality Fast + Cheap = Lower quality (gpt-4o-mini)Fast + Quality = Expensive (gpt-4o with fast provider)Cheap + Quality = Slower (batch processing)Use Request Logs to find your optimal balance by comparing duration, token usage, and response quality across different configurations.
Next steps
- Request & Response Inspection — Debug with payloads
- Timeline Context — Complete lifecycle view
- Performance Monitoring Use Case — Detailed scenario guide
- Back to Best Practices — Return to the overview