Overview
LlamaIndex is a data-aware LLM framework designed for RAG, knowledge graphs, and document QA. Because it supports the OpenAI client interface, connecting to ModelRiver is a one-line change.
What you get:
- Every LlamaIndex query routes through ModelRiver
- Automatic failover if your primary provider goes down mid-query
- Token and cost tracking for every document retrieval + generation step
- Provider switching from the console: no redeployment needed
Quick start
Install dependencies
Bash
pip install llama-index llama-index-llms-openai llama-index-embeddings-openaiConnect LlamaIndex to ModelRiver
PYTHON
1from llama_index.llms.openai import OpenAI2from llama_index.embeddings.openai import OpenAIEmbedding3from llama_index.core import Settings4 5# LLM for generation6llm = OpenAI(7 api_base="https://api.modelriver.com/v1",8 api_key="mr_live_YOUR_API_KEY",9 model="my-chat-workflow",10 temperature=0.7,11)12 13# Embeddings (if you have an embedding workflow)14embed_model = OpenAIEmbedding(15 api_base="https://api.modelriver.com/v1",16 api_key="mr_live_YOUR_API_KEY",17 model="my-embedding-workflow",18)19 20# Set as global defaults21Settings.llm = llm22Settings.embed_model = embed_modelDocument QA
Load and index documents
PYTHON
1from llama_index.core import VectorStoreIndex, SimpleDirectoryReader2 3# Load documents from a directory4documents = SimpleDirectoryReader("./data").load_data()5 6# Build the index (embeddings go through ModelRiver)7index = VectorStoreIndex.from_documents(documents)8 9# Query (LLM calls go through ModelRiver)10query_engine = index.as_query_engine()11response = query_engine.query("What are the main themes in the documents?")12print(response)From text strings
PYTHON
1from llama_index.core import Document, VectorStoreIndex2 3documents = [4 Document(text="ModelRiver routes AI requests across providers."),5 Document(text="Workflows define provider, model, and fallback configuration."),6 Document(text="Structured outputs guarantee JSON schema compliance."),7]8 9index = VectorStoreIndex.from_documents(documents)10query_engine = index.as_query_engine()11response = query_engine.query("How does ModelRiver handle provider routing?")12print(response)Chat engine
PYTHON
1chat_engine = index.as_chat_engine(chat_mode="context")2 3response = chat_engine.chat("What is ModelRiver?")4print(response)5 6# Follow-up (maintains conversation context)7response = chat_engine.chat("How does failover work?")8print(response)Streaming
PYTHON
1query_engine = index.as_query_engine(streaming=True)2 3streaming_response = query_engine.query("Explain failover in detail")4streaming_response.print_response_stream()Sub-question query engine
For complex queries that require breaking down into sub-questions:
PYTHON
1from llama_index.core.query_engine import SubQuestionQueryEngine2from llama_index.core.tools import QueryEngineTool, ToolMetadata3 4# Create tools for different document sets5tool_1 = QueryEngineTool(6 query_engine=index_1.as_query_engine(),7 metadata=ToolMetadata(8 name="product_docs",9 description="Product documentation and features"10 ),11)12 13tool_2 = QueryEngineTool(14 query_engine=index_2.as_query_engine(),15 metadata=ToolMetadata(16 name="api_docs",17 description="API reference and technical specifications"18 ),19)20 21# Sub-question engine routes each sub-query through ModelRiver22query_engine = SubQuestionQueryEngine.from_defaults(23 query_engine_tools=[tool_1, tool_2],24 llm=llm,25)26 27response = query_engine.query("Compare the product features with API capabilities")28print(response)Different workflows per component
Use faster/cheaper models for embeddings and more powerful models for generation:
PYTHON
1from llama_index.llms.openai import OpenAI2from llama_index.embeddings.openai import OpenAIEmbedding3 4# Fast embedding model5embed_model = OpenAIEmbedding(6 api_base="https://api.modelriver.com/v1",7 api_key="mr_live_YOUR_API_KEY",8 model="fast-embeddings", # text-embedding-3-small workflow9)10 11# Powerful generation model12llm = OpenAI(13 api_base="https://api.modelriver.com/v1",14 api_key="mr_live_YOUR_API_KEY",15 model="deep-generation", # GPT-4o / Claude 3.5 workflow16)17 18# Lightweight model for summarisation steps19summary_llm = OpenAI(20 api_base="https://api.modelriver.com/v1",21 api_key="mr_live_YOUR_API_KEY",22 model="fast-summary", # GPT-4o-mini workflow23)Best practices
- Separate embedding and LLM workflows: Use a cheap, fast model for embeddings and a powerful one for generation
- Monitor indexing costs: Large document sets generate many embedding calls; track in Request Logs
- Configure failover: RAG queries can be long; ensure fallback providers are configured
- Use structured outputs: Define answer schemas in ModelRiver for consistent response formats
- Stream for interactive QA: Use streaming mode for user-facing query interfaces
Next steps
- CrewAI integration: Multi-agent orchestration
- RAG system guide: Full architecture blueprint
- API reference: Endpoint documentation