Monitor AI request performance in real time

Track request latency, identify performance bottlenecks, and compare provider speeds to deliver the fastest AI experience to your users.

Overview

Response time directly impacts user experience. A chatbot that takes 8 seconds to respond feels broken; one that responds in 1-2 seconds feels magical. Request Logs capture precise duration data for every request and every provider attempt, giving you the tools to monitor, compare, and optimize performance.


Key performance metrics

Request duration

The Duration column shows end-to-end latency in milliseconds:

Duration rangeUser experienceAction
< 1,000msExcellent — feels instantNo action needed
1,000-3,000msGood — acceptable for most use casesMonitor
3,000-5,000msModerate — noticeable delayInvestigate
> 5,000msPoor — users may abandonImmediate investigation

Time to first token (streaming)

For streaming requests, the key metric is how quickly the first token arrives. While Request Logs show total duration, you can estimate first-token latency by comparing streaming vs. non-streaming requests with similar prompts.

Failover latency impact

Each failed provider attempt adds latency. A request with 3 failover attempts may show:

Attempt 1: OpenAI failed after 2,100ms
Attempt 2: Anthropic failed after 1,800ms
Attempt 3: Google success in 890ms
Total effective latency: 4,790ms

Without failovers, this request would have completed in ~890ms.


Step-by-step performance monitoring

1. Establish baseline metrics

Before you can identify anomalies, establish what "normal" looks like:

  1. Filter to Live mode and review 7 days of requests
  2. Note typical duration ranges for each provider/model
  3. Record average token counts per request type
  4. Document typical failover frequency

Example baseline:

Provider: OpenAI gpt-4o
P50 latency: 1,200ms
P95 latency: 3,400ms
P99 latency: 5,100ms
Avg tokens: 1,800 total
 
Provider: Anthropic claude-3.5-sonnet
P50 latency: 1,100ms
P95 latency: 2,800ms
P99 latency: 4,200ms
Avg tokens: 1,900 total

2. Identify performance outliers

  1. Navigate to Request Logs and filter to Live mode
  2. Look for requests with abnormally high duration values
  3. Click to inspect — check the timeline for:
    • Multiple failover attempts — Primary cause of unexpected latency
    • High token counts — More tokens = longer processing
    • Specific provider/model — Some models are consistently slower

3. Compare provider performance

Run the same prompt through different providers to compare:

Provider Performance Comparison
Provider Latency Tokens Cost
OpenAI gpt-4o-mini 420ms 800 $0.0004
OpenAI gpt-4o 1,200ms 780 $0.0120
Anthropic claude-3.5 1,100ms 820 $0.0098
Google gemini-1.5 950ms 810 $0.0071
Anthropic claude-3 2,100ms 790 $0.0450

Use this data to configure your workflow's provider priority for the best balance of speed, quality, and cost.

4. Monitor peak-hour performance

Performance often degrades during peak usage:

  1. Filter logs to specific time windows (e.g., 2-6 PM)
  2. Compare duration statistics to off-peak hours
  3. Look for higher failover rates during peak times
  4. Check if provider rate limits are being hit

Performance optimization strategies

Optimize provider selection

Configure your workflow to use the fastest reliable provider as primary:

Workflow provider priority:
1. Google Gemini 1.5 Pro (fastest, cost-effective)
2. Anthropic Claude 3.5 (reliable, good quality)
3. OpenAI GPT-4o (fallback, highest quality)

Reduce token count

Fewer tokens = faster processing:

  • Trim system prompts — Remove unnecessary instructions
  • Limit conversation history — Summarize older messages
  • Set appropriate max_tokens — Don't allow more output than needed
  • Use structured outputs — Constrained outputs are typically faster

Minimize failovers

Failovers are the biggest source of unexpected latency:

  • Monitor provider reliability — Use Provider Reliability data to choose stable providers
  • Configure sensible timeouts — Don't wait too long for a slow provider before failing over
  • Keep provider credentials current — Expired API keys cause immediate failures

Consider model-task matching

Use faster models for simpler tasks:

  • Classification, routinggpt-4o-mini (< 500ms typical)
  • Content generationgpt-4o or claude-3.5-sonnet (1-2s typical)
  • Complex analysisclaude-3-opus (2-4s typical, but highest quality)

Setting up performance alerts

Based on your baseline metrics, set alerts for:

  • P95 latency exceeding 2x baseline — Early warning for degradation
  • Failover rate exceeding 10% — Provider instability
  • Duration exceeding absolute threshold — e.g., 5,000ms for user-facing requests
  • Consistent slowdown over time — Gradual degradation may indicate growing prompt sizes

Next steps