Input guardrails: ModelRiver Security

Overview

ModelRiver input guardrails scan user-supplied text at the gateway before requests are forwarded to AI providers. Guardrails run on every supported entrypoint — native API, streaming, async, playground, and OpenAI-compatible routes.

Blocked requests never reach providers, are not billed, and do not persist the original prompt in logs or error responses.

How guardrails work

Each request passes through a layered decision pipeline:

Text extraction — Scans messages, prompt, and input fields. Multimodal message content is supported when text parts are present.
Always-on minors check — CSAM and underage content is blocked regardless of project settings.
Local classifier — Fast regex-based patterns detect clear violations.
Remote moderation — Ambiguous cases escalate to OpenAI omni-moderation-latest when remote moderation is enabled.
Decision — Allow, block, or (in monitor mode) log and allow based on your project policy.
Audit log — Records categories, source, latency, and action. Request bodies are never stored for blocked requests.

Repeat prompts are fingerprinted and cached per project policy to reduce latency on identical input.

Policy modes

Configure the enforcement mode per project in Settings → Project settings → Content Safety:

Mode	Behavior
`enforce`	Violating requests are blocked with HTTP 403 (or OpenAI-compatible 400)
`monitor`	Violations are logged but requests proceed to providers
`disabled`	Configurable categories are off; minors protection still enforced

New projects default to enforce mode with all four configurable categories enabled. Self-hosted deployments can set GUARDRAILS_FORCE_ENFORCE=true to force enforce mode on all projects.

Content categories

Four categories are configurable per project:

Category	Description
`sexual`	Explicit or pornographic content
`self-harm`	Suicide or self-harm instructions
`hate`	Slurs, genocide, or ethnic cleansing
`violence`	Weapons, explosives, murder, or body disposal

Minors/CSAM protection is always enforced and cannot be disabled. This applies even when project mode is disabled or the global GUARDRAILS_ENABLED flag is off.

Configuring guardrails

Open your project in the console
Navigate to Settings → Project settings
Scroll to Content Safety
Select a Mode: Enforce, Monitor, or Disabled
Toggle the Categories you want to enforce
Click Save

Access controls

Owner/admin only: Only organization owners and admins can weaken guardrails — switching from enforce to monitor/disabled, or removing categories
Members can strengthen: Members can re-enable categories or switch back to enforce mode
Per-project isolation: Guardrail policy is scoped to each project independently

Covered entrypoints

Guardrails run automatically on:

Native sync: POST /api/v1/ai
Native async: POST /api/v1/ai/async
Streaming: preflight check before SSE begins
OpenAI-compatible: POST /api/v1/chat/completions
Console playground (sync and async)

Requests with image-only or empty text content return not_checked and proceed without text classification. Guardrails apply to input only — AI output is not moderated server-side.

Error handling

Blocked requests in enforce mode surface clear errors:

API	Status	Error code	Message
Native API	403	`content_policy_violation`	`Request blocked by content policy.`
OpenAI-compatible	400	`content_policy_violation`	`Request blocked by content policy.`
Streaming (preflight)	403	`content_policy_violation`	JSON response before SSE starts
Abuse throttle	429	`rate_limited`	Cooldown after repeated denials

Blocked responses include triggered categories when available. The original prompt is never returned in the error body. After repeated enforce-mode denials from the same API key or user, ModelRiver returns HTTP 429 with a Retry-After header (default: 5 denials within 900 seconds per project).

Privacy and billing

No prompt leakage: Blocked prompts are not stored in request logs or returned in API errors
No provider billing: Blocked and throttled requests do not increment organization request counters or consume provider tokens
Audit without exposure: Guardrail logs record guardrail_result, guardrail_action, guardrail_source, guardrail_categories, and guardrail_latency_ms with request_body: nil

In monitor mode, violating requests proceed to providers but guardrail decisions are still logged with guardrail_action: allow.

Best practices

Start in monitor mode: Observe violation rates before enforcing blocks on production traffic
Keep enforce as default: Use enforce mode for public-facing applications and production API keys
Restrict weakening to admins: Use organization roles so only owners/admins can disable or weaken guardrails
Review guardrail logs: Use the console request logs to audit categories and latency without accessing blocked prompt text

Next steps

Compliance: Audit trails and regulatory considerations
Data retention: Understand how request data is stored and managed
API keys: Manage authentication credentials
Observability: Monitor and audit request logs

Gateway input guardrails