Streaming allows your application to receive AI-generated tokens as they're produced, rather than waiting for the entire response. This dramatically reduces time-to-first-token and creates a more responsive user experience.
Enabling streaming
Set "stream": true in your request body to receive Server-Sent Events (SSE):
Python (OpenAI SDK)
1from openai import OpenAI2 3client = OpenAI(4 base_url="https://api.modelriver.com/v1",5 api_key="mr_live_YOUR_API_KEY"6)7 8stream = client.chat.completions.create(9 model="my_workflow",10 messages=[{"role": "user", "content": "Tell me a story"}],11 stream=True12)13 14for chunk in stream:15 if chunk.choices[0].delta.content:16 print(chunk.choices[0].delta.content, end="")Node.js (OpenAI SDK)
1import OpenAI from "openai";2 3const client = new OpenAI({4 baseURL: "https://api.modelriver.com/v1",5 apiKey: "mr_live_YOUR_API_KEY",6});7 8const stream = await client.chat.completions.create({9 model: "my_workflow",10 messages: [{ role: "user", content: "Tell me a story" }],11 stream: true,12});13 14for await (const chunk of stream) {15 const content = chunk.choices[0]?.delta?.content;16 if (content) process.stdout.write(content);17}Native API (cURL)
curl -X POST https://api.modelriver.com/v1/ai \ -H "Authorization: Bearer mr_live_your_key" \ -H "Content-Type: application/json" \ -N \ -d '{ "workflow": "my-workflow", "messages": [{"role": "user", "content": "Tell me a story"}], "stream": true }'Vercel AI SDK (React)
1import { useChat } from "ai/react";2 3export function Chat() {4 const { messages, input, handleInputChange, handleSubmit } = useChat({5 api: "/api/chat",6 });7 8 return (9 <div>10 {messages.map((m) => (11 <div key={m.id}>12 <strong>{m.role}:</strong> {m.content}13 </div>14 ))}15 <form onSubmit={handleSubmit}>16 <input value={input} onChange={handleInputChange} />17 <button type="submit">Send</button>18 </form>19 </div>20 );21}SSE event format
Each Server-Sent Event follows the OpenAI delta format:
data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"role":"assistant"},"finish_reason":null}]} data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"Hello"},"finish_reason":null}]} data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":" world"},"finish_reason":null}]} data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]} data: [DONE]Chunk structure
| Field | Description |
|---|---|
id | Consistent across all chunks in a stream |
object | Always "chat.completion.chunk" |
choices[0].delta.role | Set to "assistant" in the first chunk |
choices[0].delta.content | The incremental text content |
choices[0].finish_reason | null during streaming, "stop" on the final content chunk |
[DONE] | Sentinel value indicating stream end |
Streaming features
Heartbeat
ModelRiver sends heartbeat comments (: heartbeat) every 15 seconds to keep connections alive. This prevents proxy servers and load balancers from closing idle connections.
Timeout
Streaming requests have a 5-minute timeout for long-running requests. If the provider doesn't begin generating within this window, the stream closes with an error event.
Stream termination
The stream ends with data: [DONE] to signal completion. Always check for this sentinel value to properly close your connection and handle cleanup.
Error handling in streams
If an error occurs during streaming, ModelRiver sends an error event before closing the stream:
data: {"error":{"message":"Provider timeout","type":"upstream_error","code":"timeout"}} data: [DONE]Handling stream errors
1try:2 stream = client.chat.completions.create(3 model="my_workflow",4 messages=[{"role": "user", "content": "Hello"}],5 stream=True6 )7 8 for chunk in stream:9 if chunk.choices[0].delta.content:10 print(chunk.choices[0].delta.content, end="")11except Exception as e:12 print(f"Stream error: {e}")13 # Implement retry logic or fallback1try {2 const stream = await client.chat.completions.create({3 model: "my_workflow",4 messages: [{ role: "user", content: "Hello" }],5 stream: true,6 });7 8 for await (const chunk of stream) {9 const content = chunk.choices[0]?.delta?.content;10 if (content) process.stdout.write(content);11 }12} catch (error) {13 console.error("Stream error:", error.message);14 // Implement retry logic or fallback15}Streaming with function calling
When streaming responses that include tool calls, the tool call data arrives incrementally:
1stream = client.chat.completions.create(2 model="my_workflow",3 messages=[{"role": "user", "content": "What's the weather in Paris?"}],4 tools=[{5 "type": "function",6 "function": {7 "name": "get_weather",8 "description": "Get weather for a location",9 "parameters": {10 "type": "object",11 "properties": {12 "location": {"type": "string"}13 }14 }15 }16 }],17 stream=True18)19 20tool_calls = []21for chunk in stream:22 delta = chunk.choices[0].delta23 if delta.tool_calls:24 for tc in delta.tool_calls:25 if tc.index >= len(tool_calls):26 tool_calls.append({"id": tc.id, "function": {"name": "", "arguments": ""}})27 if tc.function.name:28 tool_calls[tc.index]["function"]["name"] = tc.function.name29 if tc.function.arguments:30 tool_calls[tc.index]["function"]["arguments"] += tc.function.arguments31 32print("Tool calls:", tool_calls)For more details on tool use, see Function calling.
Best practices
- Use streaming for user-facing chat: Reduces perceived latency significantly
- Buffer before rendering: Consider buffering a few tokens before displaying to avoid jittery output
- Handle disconnections: Implement reconnection logic for long streams
- Set reasonable timeouts: Don't rely solely on the server timeout; implement client-side timeouts
- Process
[DONE]sentinel: Always handle stream termination properly - Use SDKs when possible: They handle SSE parsing, error handling, and reconnection automatically
Next steps
- Function calling: Use tools with streaming
- OpenAI compatibility: Compatible SDKs
- Error handling: Complete error reference