Workflow Cache: Zero-latency, zero-cost AI responses

Why workflow caching?

Running AI models in production can become expensive as user volume grows. Often, users ask similar queries or trigger the exact same workflow multiple times. Re-processing these identical requests leads to high latency and duplicate token costs.

ModelRiver Cache solves this by capturing and instantly serving previously successful responses across your workflows.

How it works

Intelligent Capture: When an AI request completes successfully, ModelRiver records the input context and the structured output if caching is enabled.
Instant Delivery: When an incoming request matches a cached pattern exactly, the cached response is served instantly in less than 10ms.
Analytics: The Cache tab in your project console outlines your cache hit rate, bandwidth saved, and total latency reduction.

Enabling the cache

Navigate to the Workflows section and open the settings for a specific workflow.
Locate the Caching (optional) section.
Select your desired Cache window (e.g., 15m, 1h, or 1d). Cached responses older than this window are automatically ignored.
You can also manually clear the cache for this workflow at any time using the Clear button.

Key metrics

By utilizing the Workflow Cache, your team can expect:

Zero-cost hits: Cached requests don't incur token costs from your underlying providers (like OpenAI or Anthropic).
Sub-10ms latency: Instead of waiting 2-5 seconds for an LLM to generate text, the response is instantly available.

Note: While cache hits save you from costly external provider usage, they are still counted as standard requests towards your ModelRiver limits to help us maintain our high-performance global cache network.

Cache hit criteria

To ensure responses are accurate when caching, ModelRiver enforces strict matching requirements. A cache hit only triggers if:

System and User prompts match exactly.
Temperature, Top P, and Model settings remain identical.
Attachments (such as images for vision models) compute to the exact same hash.

If a mismatch occurs on any vector, the request will intelligently bypass the cache and run normally.

Next steps

Type-safe solutions: Learn how cache works with structured outputs.
Workflows: Discover how workflows process cached queries.
Observability: Understand how cached responses appear in your timeline.