Documentation

Gateway input guardrails

Scan user prompts at the gateway before they reach AI providers. Configure enforce, monitor, or disabled modes per project with always-on minors protection.

Overview

ModelRiver input guardrails scan user-supplied text at the gateway before requests are forwarded to AI providers. Guardrails run on every supported entrypoint — native API, streaming, async, playground, and OpenAI-compatible routes.

Blocked requests never reach providers, are not billed, and do not persist the original prompt in logs or error responses.

How guardrails work

Each request passes through a layered decision pipeline:

  1. Text extraction — Scans messages, prompt, and input fields. Multimodal message content is supported when text parts are present.
  2. Always-on minors check — CSAM and underage content is blocked regardless of project settings.
  3. Local classifier — Fast regex-based patterns detect clear violations.
  4. Remote moderation — Ambiguous cases escalate to OpenAI omni-moderation-latest when remote moderation is enabled.
  5. Decision — Allow, block, or (in monitor mode) log and allow based on your project policy.
  6. Audit log — Records categories, source, latency, and action. Request bodies are never stored for blocked requests.

Repeat prompts are fingerprinted and cached per project policy to reduce latency on identical input.

Policy modes

Configure the enforcement mode per project in Settings → Project settings → Content Safety:

ModeBehavior
enforceViolating requests are blocked with HTTP 403 (or OpenAI-compatible 400)
monitorViolations are logged but requests proceed to providers
disabledConfigurable categories are off; minors protection still enforced

New projects default to enforce mode with all four configurable categories enabled. Self-hosted deployments can set GUARDRAILS_FORCE_ENFORCE=true to force enforce mode on all projects.

Content categories

Four categories are configurable per project:

CategoryDescription
sexualExplicit or pornographic content
self-harmSuicide or self-harm instructions
hateSlurs, genocide, or ethnic cleansing
violenceWeapons, explosives, murder, or body disposal

Minors/CSAM protection is always enforced and cannot be disabled. This applies even when project mode is disabled or the global GUARDRAILS_ENABLED flag is off.

Configuring guardrails

  1. Open your project in the console
  2. Navigate to Settings → Project settings
  3. Scroll to Content Safety
  4. Select a Mode: Enforce, Monitor, or Disabled
  5. Toggle the Categories you want to enforce
  6. Click Save

Access controls

  • Owner/admin only: Only organization owners and admins can weaken guardrails — switching from enforce to monitor/disabled, or removing categories
  • Members can strengthen: Members can re-enable categories or switch back to enforce mode
  • Per-project isolation: Guardrail policy is scoped to each project independently

Covered entrypoints

Guardrails run automatically on:

  • Native sync: POST /api/v1/ai
  • Native async: POST /api/v1/ai/async
  • Streaming: preflight check before SSE begins
  • OpenAI-compatible: POST /api/v1/chat/completions
  • Console playground (sync and async)

Requests with image-only or empty text content return not_checked and proceed without text classification. Guardrails apply to input only — AI output is not moderated server-side.

Error handling

Blocked requests in enforce mode surface clear errors:

APIStatusError codeMessage
Native API403content_policy_violationRequest blocked by content policy.
OpenAI-compatible400content_policy_violationRequest blocked by content policy.
Streaming (preflight)403content_policy_violationJSON response before SSE starts
Abuse throttle429rate_limitedCooldown after repeated denials

Blocked responses include triggered categories when available. The original prompt is never returned in the error body. After repeated enforce-mode denials from the same API key or user, ModelRiver returns HTTP 429 with a Retry-After header (default: 5 denials within 900 seconds per project).

Privacy and billing

  • No prompt leakage: Blocked prompts are not stored in request logs or returned in API errors
  • No provider billing: Blocked and throttled requests do not increment organization request counters or consume provider tokens
  • Audit without exposure: Guardrail logs record guardrail_result, guardrail_action, guardrail_source, guardrail_categories, and guardrail_latency_ms with request_body: nil

In monitor mode, violating requests proceed to providers but guardrail decisions are still logged with guardrail_action: allow.

Best practices

  • Start in monitor mode: Observe violation rates before enforcing blocks on production traffic
  • Keep enforce as default: Use enforce mode for public-facing applications and production API keys
  • Restrict weakening to admins: Use organization roles so only owners/admins can disable or weaken guardrails
  • Review guardrail logs: Use the console request logs to audit categories and latency without accessing blocked prompt text

Next steps