LlamaIndex Integration – ModelRiver Docs

Overview

LlamaIndex is a data-aware LLM framework designed for RAG, knowledge graphs, and document QA. Because it supports the OpenAI client interface, connecting to ModelRiver is a one-line change.

What you get:

Every LlamaIndex query routes through ModelRiver
Automatic failover if your primary provider goes down mid-query
Token and cost tracking for every document retrieval + generation step
Provider switching from the console: no redeployment needed

Quick start

Install dependencies

Bash

pip install llama-index llama-index-llms-openai llama-index-embeddings-openai

Connect LlamaIndex to ModelRiver

PYTHON

1from llama_index.llms.openai import OpenAI
2from llama_index.embeddings.openai import OpenAIEmbedding
3from llama_index.core import Settings
4 
5# LLM for generation
6llm = OpenAI(
7    api_base="https://api.modelriver.com/v1",
8    api_key="mr_live_YOUR_API_KEY",
9    model="my-chat-workflow",
10    temperature=0.7,
11)
12 
13# Embeddings (if you have an embedding workflow)
14embed_model = OpenAIEmbedding(
15    api_base="https://api.modelriver.com/v1",
16    api_key="mr_live_YOUR_API_KEY",
17    model="my-embedding-workflow",
18)
19 
20# Set as global defaults
21Settings.llm = llm
22Settings.embed_model = embed_model

Document QA

Load and index documents

PYTHON

1from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
2 
3# Load documents from a directory
4documents = SimpleDirectoryReader("./data").load_data()
5 
6# Build the index (embeddings go through ModelRiver)
7index = VectorStoreIndex.from_documents(documents)
8 
9# Query (LLM calls go through ModelRiver)
10query_engine = index.as_query_engine()
11response = query_engine.query("What are the main themes in the documents?")
12print(response)

From text strings

PYTHON

1from llama_index.core import Document, VectorStoreIndex
2 
3documents = [
4    Document(text="ModelRiver routes AI requests across providers."),
5    Document(text="Workflows define provider, model, and fallback configuration."),
6    Document(text="Structured outputs guarantee JSON schema compliance."),
7]
8 
9index = VectorStoreIndex.from_documents(documents)
10query_engine = index.as_query_engine()
11response = query_engine.query("How does ModelRiver handle provider routing?")
12print(response)

Chat engine

PYTHON

1chat_engine = index.as_chat_engine(chat_mode="context")
2 
3response = chat_engine.chat("What is ModelRiver?")
4print(response)
5 
6# Follow-up (maintains conversation context)
7response = chat_engine.chat("How does failover work?")
8print(response)

Streaming

PYTHON

1query_engine = index.as_query_engine(streaming=True)
2 
3streaming_response = query_engine.query("Explain failover in detail")
4streaming_response.print_response_stream()

Sub-question query engine

For complex queries that require breaking down into sub-questions:

PYTHON

1from llama_index.core.query_engine import SubQuestionQueryEngine
2from llama_index.core.tools import QueryEngineTool, ToolMetadata
3 
4# Create tools for different document sets
5tool_1 = QueryEngineTool(
6    query_engine=index_1.as_query_engine(),
7    metadata=ToolMetadata(
8        name="product_docs",
9        description="Product documentation and features"
10    ),
11)
12 
13tool_2 = QueryEngineTool(
14    query_engine=index_2.as_query_engine(),
15    metadata=ToolMetadata(
16        name="api_docs",
17        description="API reference and technical specifications"
18    ),
19)
20 
21# Sub-question engine routes each sub-query through ModelRiver
22query_engine = SubQuestionQueryEngine.from_defaults(
23    query_engine_tools=[tool_1, tool_2],
24    llm=llm,
25)
26 
27response = query_engine.query("Compare the product features with API capabilities")
28print(response)

Different workflows per component

Use faster/cheaper models for embeddings and more powerful models for generation:

PYTHON

1from llama_index.llms.openai import OpenAI
2from llama_index.embeddings.openai import OpenAIEmbedding
3 
4# Fast embedding model
5embed_model = OpenAIEmbedding(
6    api_base="https://api.modelriver.com/v1",
7    api_key="mr_live_YOUR_API_KEY",
8    model="fast-embeddings",           # text-embedding-3-small workflow
9)
10 
11# Powerful generation model
12llm = OpenAI(
13    api_base="https://api.modelriver.com/v1",
14    api_key="mr_live_YOUR_API_KEY",
15    model="deep-generation",           # GPT-4o / Claude 3.5 workflow
16)
17 
18# Lightweight model for summarisation steps
19summary_llm = OpenAI(
20    api_base="https://api.modelriver.com/v1",
21    api_key="mr_live_YOUR_API_KEY",
22    model="fast-summary",              # GPT-4o-mini workflow
23)

Best practices

Separate embedding and LLM workflows: Use a cheap, fast model for embeddings and a powerful one for generation
Monitor indexing costs: Large document sets generate many embedding calls; track in Request Logs
Configure failover: RAG queries can be long; ensure fallback providers are configured
Use structured outputs: Define answer schemas in ModelRiver for consistent response formats
Stream for interactive QA: Use streaming mode for user-facing query interfaces

Next steps

CrewAI integration: Multi-agent orchestration
RAG system guide: Full architecture blueprint
API reference: Endpoint documentation

LlamaIndex + ModelRiver

Overview

Quick start

Install dependencies

Connect LlamaIndex to ModelRiver

Document QA

Load and index documents

From text strings

Chat engine

Streaming

Sub-question query engine

Different workflows per component

Best practices

Next steps