Overview
The most effective teams don't wait for user reports to discover issues — they proactively review logs to catch anomalies early. A 5-10 minute daily review can prevent hours of firefighting.
Daily review checklist
Morning check (5 minutes)
Run through this quick check each morning:
- Open Request Logs → Filter to Live mode
- Scan for red badges — Any error status requests since yesterday?
- Check failed models count — Are failover rates higher than usual?
- Review duration column — Any requests significantly slower than baseline?
- Check webhook deliveries — Any failed deliveries that need attention?
What to look for
┌─────────────────────────────────────────────────────────┐│ Morning Review Dashboard (mental model) ││ ││ Error rate: ● 2.1% (normal: < 3%) ✓ OK ││ Failover rate: ● 4.2% (normal: < 5%) ✓ OK ││ P95 latency: ● 3.2s (normal: < 4s) ✓ OK ││ Webhook errors: ● 0 (normal: < 2) ✓ OK ││ ││ Status: All clear ✓ │└─────────────────────────────────────────────────────────┘If anything is outside your normal range, investigate immediately.
Weekly deep dive (30 minutes)
Analyze trends
Once a week, look at broader patterns:
- Compare error rates week-over-week — Is reliability improving or degrading?
- Review token usage trends — Are average tokens per request growing? (This increases costs)
- Check provider performance — Has any provider's latency or failure rate changed?
- Review failover chains — Are specific providers consistently failing?
Example weekly report
Week of Feb 3-9, 2026: Total requests: 8,420 Success rate: 97.8% (last week: 98.1%) Avg latency: 1,340ms (last week: 1,280ms) Failover rate: 3.2% (last week: 2.8%) Top failing model: gpt-4o-mini (12 failures, up from 6) Webhook success: 99.4% (last week: 99.6%) Estimated cost: $42.30 (last week: $38.50) ⚠ Action items: - Investigate gpt-4o-mini failure increase - Review cost increase (+10%)Setting up alerts
Complement manual reviews with automated alerts:
Recommended alert thresholds
| Metric | Warning threshold | Critical threshold |
|---|---|---|
| Error rate | > 3% over 1 hour | > 10% over 15 min |
| Failover rate | > 5% over 1 hour | > 15% over 15 min |
| P95 latency | > 2x baseline | > 3x baseline |
| Webhook failures | > 2 in 1 hour | > 5 in 15 min |
| Callback timeouts | > 1 in 1 hour | > 3 in 15 min |
Alert escalation
Level 1 (Warning): Log it, review in daily checkLevel 2 (Critical): Investigate immediatelyLevel 3 (Outage): All hands, coordinate in incident channelBuilding the habit
- Schedule it — Add a 5-minute recurring calendar block for morning reviews
- Create a checklist — Use the checklist above or customize for your needs
- Rotate responsibility — If you have a team, rotate the daily review role
- Document findings — Keep a simple log of what you notice each day
- Act on trends — Don't just observe — take action when you see patterns
Next steps
- Using Filters Effectively — Master the filter system
- Monitoring Failed Models — Track provider stability
- Back to Best Practices — Return to the overview