Documentation

LlamaIndex + ModelRiver

Route every LlamaIndex query through ModelRiver. Get automatic failover, cost tracking, and structured outputs for your RAG pipelines.

Overview

LlamaIndex is a data-aware LLM framework designed for RAG, knowledge graphs, and document QA. Because it supports the OpenAI client interface, connecting to ModelRiver is a one-line change.

What you get:

  • Every LlamaIndex query routes through ModelRiver
  • Automatic failover if your primary provider goes down mid-query
  • Token and cost tracking for every document retrieval + generation step
  • Provider switching from the console: no redeployment needed

Quick start

Install dependencies

Bash
pip install llama-index llama-index-llms-openai llama-index-embeddings-openai

Connect LlamaIndex to ModelRiver

PYTHON
1from llama_index.llms.openai import OpenAI
2from llama_index.embeddings.openai import OpenAIEmbedding
3from llama_index.core import Settings
4 
5# LLM for generation
6llm = OpenAI(
7 api_base="https://api.modelriver.com/v1",
8 api_key="mr_live_YOUR_API_KEY",
9 model="my-chat-workflow",
10 temperature=0.7,
11)
12 
13# Embeddings (if you have an embedding workflow)
14embed_model = OpenAIEmbedding(
15 api_base="https://api.modelriver.com/v1",
16 api_key="mr_live_YOUR_API_KEY",
17 model="my-embedding-workflow",
18)
19 
20# Set as global defaults
21Settings.llm = llm
22Settings.embed_model = embed_model

Document QA

Load and index documents

PYTHON
1from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
2 
3# Load documents from a directory
4documents = SimpleDirectoryReader("./data").load_data()
5 
6# Build the index (embeddings go through ModelRiver)
7index = VectorStoreIndex.from_documents(documents)
8 
9# Query (LLM calls go through ModelRiver)
10query_engine = index.as_query_engine()
11response = query_engine.query("What are the main themes in the documents?")
12print(response)

From text strings

PYTHON
1from llama_index.core import Document, VectorStoreIndex
2 
3documents = [
4 Document(text="ModelRiver routes AI requests across providers."),
5 Document(text="Workflows define provider, model, and fallback configuration."),
6 Document(text="Structured outputs guarantee JSON schema compliance."),
7]
8 
9index = VectorStoreIndex.from_documents(documents)
10query_engine = index.as_query_engine()
11response = query_engine.query("How does ModelRiver handle provider routing?")
12print(response)

Chat engine

PYTHON
1chat_engine = index.as_chat_engine(chat_mode="context")
2 
3response = chat_engine.chat("What is ModelRiver?")
4print(response)
5 
6# Follow-up (maintains conversation context)
7response = chat_engine.chat("How does failover work?")
8print(response)

Streaming

PYTHON
1query_engine = index.as_query_engine(streaming=True)
2 
3streaming_response = query_engine.query("Explain failover in detail")
4streaming_response.print_response_stream()

Sub-question query engine

For complex queries that require breaking down into sub-questions:

PYTHON
1from llama_index.core.query_engine import SubQuestionQueryEngine
2from llama_index.core.tools import QueryEngineTool, ToolMetadata
3 
4# Create tools for different document sets
5tool_1 = QueryEngineTool(
6 query_engine=index_1.as_query_engine(),
7 metadata=ToolMetadata(
8 name="product_docs",
9 description="Product documentation and features"
10 ),
11)
12 
13tool_2 = QueryEngineTool(
14 query_engine=index_2.as_query_engine(),
15 metadata=ToolMetadata(
16 name="api_docs",
17 description="API reference and technical specifications"
18 ),
19)
20 
21# Sub-question engine routes each sub-query through ModelRiver
22query_engine = SubQuestionQueryEngine.from_defaults(
23 query_engine_tools=[tool_1, tool_2],
24 llm=llm,
25)
26 
27response = query_engine.query("Compare the product features with API capabilities")
28print(response)

Different workflows per component

Use faster/cheaper models for embeddings and more powerful models for generation:

PYTHON
1from llama_index.llms.openai import OpenAI
2from llama_index.embeddings.openai import OpenAIEmbedding
3 
4# Fast embedding model
5embed_model = OpenAIEmbedding(
6 api_base="https://api.modelriver.com/v1",
7 api_key="mr_live_YOUR_API_KEY",
8 model="fast-embeddings", # text-embedding-3-small workflow
9)
10 
11# Powerful generation model
12llm = OpenAI(
13 api_base="https://api.modelriver.com/v1",
14 api_key="mr_live_YOUR_API_KEY",
15 model="deep-generation", # GPT-4o / Claude 3.5 workflow
16)
17 
18# Lightweight model for summarisation steps
19summary_llm = OpenAI(
20 api_base="https://api.modelriver.com/v1",
21 api_key="mr_live_YOUR_API_KEY",
22 model="fast-summary", # GPT-4o-mini workflow
23)

Best practices

  1. Separate embedding and LLM workflows: Use a cheap, fast model for embeddings and a powerful one for generation
  2. Monitor indexing costs: Large document sets generate many embedding calls; track in Request Logs
  3. Configure failover: RAG queries can be long; ensure fallback providers are configured
  4. Use structured outputs: Define answer schemas in ModelRiver for consistent response formats
  5. Stream for interactive QA: Use streaming mode for user-facing query interfaces

Next steps