Overview
ModelRiver input guardrails scan user-supplied text at the gateway before requests are forwarded to AI providers. Guardrails run on every supported entrypoint — native API, streaming, async, playground, and OpenAI-compatible routes.
Blocked requests never reach providers, are not billed, and do not persist the original prompt in logs or error responses.
How guardrails work
Each request passes through a layered decision pipeline:
- Text extraction — Scans
messages,prompt, andinputfields. Multimodal message content is supported when text parts are present. - Always-on minors check — CSAM and underage content is blocked regardless of project settings.
- Local classifier — Fast regex-based patterns detect clear violations.
- Remote moderation — Ambiguous cases escalate to OpenAI
omni-moderation-latestwhen remote moderation is enabled. - Decision — Allow, block, or (in monitor mode) log and allow based on your project policy.
- Audit log — Records categories, source, latency, and action. Request bodies are never stored for blocked requests.
Repeat prompts are fingerprinted and cached per project policy to reduce latency on identical input.
Policy modes
Configure the enforcement mode per project in Settings → Project settings → Content Safety:
| Mode | Behavior |
|---|---|
enforce | Violating requests are blocked with HTTP 403 (or OpenAI-compatible 400) |
monitor | Violations are logged but requests proceed to providers |
disabled | Configurable categories are off; minors protection still enforced |
New projects default to enforce mode with all four configurable categories enabled. Self-hosted deployments can set GUARDRAILS_FORCE_ENFORCE=true to force enforce mode on all projects.
Content categories
Four categories are configurable per project:
| Category | Description |
|---|---|
sexual | Explicit or pornographic content |
self-harm | Suicide or self-harm instructions |
hate | Slurs, genocide, or ethnic cleansing |
violence | Weapons, explosives, murder, or body disposal |
Minors/CSAM protection is always enforced and cannot be disabled. This applies even when project mode is disabled or the global GUARDRAILS_ENABLED flag is off.
Configuring guardrails
- Open your project in the console
- Navigate to Settings → Project settings
- Scroll to Content Safety
- Select a Mode: Enforce, Monitor, or Disabled
- Toggle the Categories you want to enforce
- Click Save
Access controls
- Owner/admin only: Only organization owners and admins can weaken guardrails — switching from enforce to monitor/disabled, or removing categories
- Members can strengthen: Members can re-enable categories or switch back to enforce mode
- Per-project isolation: Guardrail policy is scoped to each project independently
Covered entrypoints
Guardrails run automatically on:
- Native sync:
POST /api/v1/ai - Native async:
POST /api/v1/ai/async - Streaming: preflight check before SSE begins
- OpenAI-compatible:
POST /api/v1/chat/completions - Console playground (sync and async)
Requests with image-only or empty text content return not_checked and proceed without text classification. Guardrails apply to input only — AI output is not moderated server-side.
Error handling
Blocked requests in enforce mode surface clear errors:
| API | Status | Error code | Message |
|---|---|---|---|
| Native API | 403 | content_policy_violation | Request blocked by content policy. |
| OpenAI-compatible | 400 | content_policy_violation | Request blocked by content policy. |
| Streaming (preflight) | 403 | content_policy_violation | JSON response before SSE starts |
| Abuse throttle | 429 | rate_limited | Cooldown after repeated denials |
Blocked responses include triggered categories when available. The original prompt is never returned in the error body. After repeated enforce-mode denials from the same API key or user, ModelRiver returns HTTP 429 with a Retry-After header (default: 5 denials within 900 seconds per project).
Privacy and billing
- No prompt leakage: Blocked prompts are not stored in request logs or returned in API errors
- No provider billing: Blocked and throttled requests do not increment organization request counters or consume provider tokens
- Audit without exposure: Guardrail logs record
guardrail_result,guardrail_action,guardrail_source,guardrail_categories, andguardrail_latency_mswithrequest_body: nil
In monitor mode, violating requests proceed to providers but guardrail decisions are still logged with guardrail_action: allow.
Best practices
- Start in monitor mode: Observe violation rates before enforcing blocks on production traffic
- Keep enforce as default: Use enforce mode for public-facing applications and production API keys
- Restrict weakening to admins: Use organization roles so only owners/admins can disable or weaken guardrails
- Review guardrail logs: Use the console request logs to audit categories and latency without accessing blocked prompt text
Next steps
- Compliance: Audit trails and regulatory considerations
- Data retention: Understand how request data is stored and managed
- API keys: Manage authentication credentials
- Observability: Monitor and audit request logs