Why Elixir + Phoenix Are Ideal for Building Scalable, Fault-Tolerant AI Gateways

20 min read

Why Elixir + Phoenix Are Ideal for Building Scalable, Fault-Tolerant AI Gateways

The stack nobody talks about

Every AI gateway and orchestration platform in 2026 runs on one of three stacks: Python + FastAPI, TypeScript + Node, or Go. That's the default mental model. If you search "how to build an AI gateway," every tutorial, every open-source boilerplate, and every HackerNews thread assumes one of those three.

I'm going to make the case for a fourth option that almost nobody considers, and that I believe is architecturally superior for this specific class of problem: Elixir + Phoenix, running on the Erlang VM (BEAM).

This isn't a hot take for attention. I've spent over a year building a production AI gateway on this stack — handling multi-provider routing, failover chains, async job queues, WebSocket delivery, prompt caching, encrypted secrets management, rate limiting, webhook delivery, metered billing, and real-time observability. Every architectural decision in this post comes from production experience, not theory.

Let me walk through why BEAM is the most natural runtime for AI infrastructure, and why the Elixir ecosystem provides the exact primitives you need — often with less code and fewer moving parts than the alternatives.


The core problem: an AI gateway is a concurrent orchestration engine

Before comparing stacks, it helps to understand what an AI gateway actually does at runtime. It's not a simple proxy. For every incoming request, the system needs to:

  1. Authenticate and authorize the request against project-scoped API keys
  2. Load workflow configuration — which provider, which model, what system instructions, whether structured output is enabled, what backup providers exist
  3. Check rate limits at both the IP level and the project/organization level
  4. Check prompt cache (ETS in-memory first, then database fallback) to avoid redundant provider calls
  5. Run content moderation before forwarding to the provider
  6. Make the actual LLM API call — with timeout budgets, failure classification, and automatic failover to backup providers
  7. Log the request and response with full metadata: tokens, latency, provider used, cache status, failover attempts
  8. Meter usage for billing — report overage events to Stripe's metered billing API
  9. Deliver results — synchronously via HTTP response, or asynchronously via WebSocket channels and webhook delivery with HMAC-signed payloads
  10. Update caches — store the result in ETS for fast subsequent lookups

Every one of these steps involves I/O: database queries, HTTP calls to providers, HTTP calls to Stripe, WebSocket broadcasts, ETS reads/writes. And every one of them needs to happen concurrently — you can't block the entire server while waiting for OpenAI to respond to a single request.

This is where the BEAM changes the game.


Why BEAM is architecturally perfect for this

Lightweight processes for request isolation

Every request in an Elixir/Phoenix application runs in its own BEAM process. Not an OS thread. Not a goroutine. A BEAM process — a lightweight, isolated unit of execution with its own heap, its own garbage collection schedule, and its own mailbox.

The practical impact:

TEXT
Request A → BEAM process #1 → calls OpenAI → waits 3 seconds → responds
Request B → BEAM process #2 → calls Anthropic → waits 1.5 seconds → responds
Request C → BEAM process #3 → cache hit → responds in 2ms

Process #1 waiting on OpenAI does not block process #2 or #3. There's no thread pool to exhaust, no event loop to block, no async/await ceremony to manage. Concurrency is the default, not something you opt into.

In a Python/FastAPI stack, achieving this requires asyncio with careful discipline — one blocking call in the wrong place stalls the event loop. In Node, you get non-blocking I/O by default, but the single-threaded model means CPU-bound work (JSON parsing large responses, computing HMAC signatures, SHA-256 hashing for cache keys) blocks everything. In Go, goroutines are lightweight but share memory, making isolation harder to guarantee.

BEAM processes are preemptively scheduled with soft real-time guarantees. A process that's doing heavy computation gets preempted after a reduction count — it can't starve other processes. This is critical for an AI gateway where some requests take 200ms (cache hits) and others take 15 seconds (complex multi-model chains).

Supervision trees: self-healing infrastructure

An AI gateway talks to unreliable external services all day. Providers go down. Webhook endpoints return 500s. Stripe's API occasionally hiccups. The DB connection pool temporarily exhausts under load spikes.

In most stacks, you handle this with try/catch blocks, retry libraries, and health check endpoints that you monitor in PagerDuty. In Elixir, you get OTP supervision trees — a battle-tested process management framework that's been keeping telecom systems alive since the 1980s.

Here's a simplified version of what the application supervision tree looks like in production:

ELIXIR
children = [
# Encryption vault — GenServer, auto-restarts if key loading fails
MyApp.Vault,
# 2FA token store — ETS-backed GenServer
MyApp.TwoFactorTokens,
# Pending async channels — tracks in-flight async requests
MyApp.PendingChannels,
# Telemetry supervisor — metrics collection
MyAppWeb.Telemetry,
# AWS credential initializer — creates ETS table before Repo starts
MyApp.ExAwsInitializer,
# Database connection pool
MyApp.Repo,
# Background job processor (Oban)
{Oban, Application.fetch_env!(:myapp, Oban)},
# DNS-based cluster discovery for horizontal scaling
{DNSCluster, query: Application.get_env(:myapp, :dns_cluster_query)},
# PubSub for real-time broadcasting
{Phoenix.PubSub, name: MyApp.PubSub},
# HTTP client pool for outbound requests (emails, webhooks)
{Finch, name: MyApp.Finch},
# WebSocket token manager — one-time tokens for WS auth
MyApp.WebSocketTokens,
# CLI WebSocket token manager — longer-lived reusable tokens
MyApp.CLITokens,
# CLI connection tracker
MyApp.CLIConnections,
# Real-time presence tracking
MyAppWeb.Presence,
# The HTTP endpoint — starts last, serves requests only after everything is ready
MyAppWeb.Endpoint
]
 
opts = [strategy: :one_for_one, name: MyApp.Supervisor]
Supervisor.start_link(children, opts)

Each child is supervised. If the Vault process crashes because the encryption key loaded incorrectly, the supervisor restarts it. If the WebSocket token store crashes, it restarts. If the database connection pool dies and comes back, the Repo child restarts and reconnects.

The strategy: :one_for_one means each child is independent — one crashing doesn't take down the others. The boot order is sequential and deterministic: the ETS table for AWS credentials initializes before the Repo starts, so database URLs fetched from AWS Secrets Manager are available immediately.

You don't get this in Python, Node, or Go without significant manual engineering. OTP supervision has been refined for 30+ years. It's not a library — it's a fundamental runtime capability.

ETS: in-process caching without Redis

Every AI gateway needs a fast cache. The obvious choice is Redis. But Redis introduces a network hop, a separate process to manage, and another failure point.

BEAM has ETS (Erlang Term Storage) — an in-memory key-value store that lives inside the VM, accessible from any process, with constant-time lookups and no serialization overhead.

Here's how prompt caching works:

ELIXIR
# Cache key: SHA-256 of user_id + workflow_name + normalized request params
def generate_cache_key(user_id, workflow_name, params) do
normalized_params =
params
|> normalize_params() # Sort keys, strip volatile fields
|> Jason.encode!()
 
data = "#{user_id}:#{workflow_name}:#{normalized_params}"
 
:crypto.hash(:sha256, data)
|> Base.encode16(case: :lower)
end
 
# Two-tier lookup: ETS (fast path) → Database (durable fallback)
def get(user_id, workflow_name, params, cache_cutoff \\ nil) do
cache_key = generate_cache_key(user_id, workflow_name, params)
bucket_id = "prompt_cache:#{cache_key}"
now = System.system_time(:millisecond)
 
case :ets.lookup(:cache_table, bucket_id) do
[{^bucket_id, {data, created_at}, expires_at}]
when expires_at > now and created_at >= cache_cutoff_ms ->
{:ok, data}
 
_ ->
:miss
end
end

No Redis connection to manage. No serialization/deserialization. No network latency. The cache check is a single ETS lookup — sub-microsecond. If the ETS cache misses, it falls back to a database query that checks for a matching request log with the same request body hash, same workflow, and an insertion time within the cache window.

This two-tier approach means:

  • Hot path: ETS returns cached responses in < 1ms
  • Warm path: Database returns cached responses in ~5ms (from previous request logs)
  • Cold path: Full provider call, 1-15 seconds

Redis would add ~0.5ms to every cache check. That doesn't sound like much until you're handling 10,000 requests per minute and every millisecond compounds.


The ecosystem: purpose-built tools, not glue

Oban: background job processing without a separate queue service

AI requests that run asynchronously need a reliable job queue. Most stacks reach for Celery (Python), BullMQ (Node), or a standalone message broker like RabbitMQ or SQS.

Elixir has Oban — a robust, PostgreSQL-backed job processing library that runs inside the same application. No external queue service. No separate worker process. Just another supervised child in the application tree.

ELIXIR
# Queue configuration
config :myapp, Oban,
queues: [
default: 10,
ai_requests: 10,
webhooks: 5,
emails: 10,
billing: 5
],
plugins: [
{Oban.Plugins.Cron,
crontab: [
{"*/5 * * * *", MyApp.Workers.DowngradeWorker, args: %{}}
]}
]

Five named queues with independent concurrency limits. The ai_requests queue handles async LLM calls. The webhooks queue handles webhook delivery with retries. The billing queue reports usage to Stripe. The emails queue sends transactional emails. And a cron plugin runs periodic tasks like checking for subscription downgrades.

Each worker is pattern-matched on its job arguments:

ELIXIR
defmodule MyApp.Workers.AIRequestWorker do
use Oban.Worker, queue: :ai_requests, max_attempts: 1
 
@impl Oban.Worker
def perform(%Oban.Job{args: %{
"channel_id" => channel_id,
"params" => params,
"workflow_name" => workflow_name,
"user_id" => user_id,
"project_id" => project_id
}}) do
# Load workflow → check cache → call provider → handle failover
# → log request → broadcast result via WebSocket
# All in a single, supervised background job
end
end

Why this matters: the job queue, the cron scheduler, and the job transactional guarantees all come from PostgreSQL. If the server crashes mid-job, the job isn't lost — it's still in the database, marked as in-progress, and will be picked up on restart. No Redis. No RabbitMQ. No SQS. One fewer infrastructure component to operate.

Comparison:

  • Python + Celery requires Redis or RabbitMQ as a broker, plus a separate worker process. Two infrastructure components in addition to your app.
  • Node + BullMQ requires Redis. One additional component.
  • Go typically uses a standalone queue (SQS, NATS, or RabbitMQ). One or two additional components.
  • Elixir + Oban uses PostgreSQL, which you already have for your application database. Zero additional components.

Phoenix Channels: real-time delivery without WebSocket libraries

Async AI requests need a way to deliver results back to clients. The standard approach: set up a WebSocket server, handle connection upgrades, manage authentication, implement heartbeats, handle reconnection.

Phoenix has Channels built in — a real-time abstraction on top of WebSockets with presence tracking, topic-based pub/sub, and authentication baked into the connection lifecycle.

ELIXIR
defmodule MyAppWeb.AIResponseChannel do
use Phoenix.Channel
 
def join("ai_response:" <> subtopic, _params, socket) do
# Parse project_id and channel_id from topic
# Verify project access via token or database lookup
# Track presence for real-time connection monitoring
send(self(), :after_join)
{:ok, socket}
end
 
def handle_info(:after_join, socket) do
# Track presence so we know which channels are connected
MyAppWeb.Presence.track(socket, socket.assigns.user_id, %{
online_at: inspect(System.system_time(:second)),
project_id: socket.assigns.project_id,
channel_id: socket.assigns.channel_id
})
{:noreply, socket}
end
end

When an async AI request completes, broadcasting the result is one line:

ELIXIR
MyAppWeb.Endpoint.broadcast(
"ai_response:#{project_id}:#{channel_id}",
"response",
payload
)

Every connected client on that channel receives the result. No polling. No long-lived HTTP connections. No SSE with reconnection quirks. Phoenix Channels handle connection management, heartbeats, and reconnection on the client side via the official JavaScript client.

This also powers CLI webhook delivery — CLI clients connect via WebSocket, and the server broadcasts webhook payloads to all connected CLI users for a project, with HMAC-SHA256 signed payloads for verification:

ELIXIR
def generate_signature(payload, secret, timestamp) do
json_payload = Jason.encode!(payload)
signature_payload = "#{timestamp}.#{json_payload}"
 
:crypto.mac(:hmac, :sha256, secret, signature_payload)
|> Base.encode16(case: :lower)
end

Cloak: application-level encryption

AI gateways store provider API keys — OpenAI keys, Anthropic keys, customer-provided keys for bring-your-own-key scenarios. These must be encrypted at rest.

The Elixir ecosystem has Cloak — an encryption library that integrates with Ecto (the database layer) to provide transparent field-level encryption using AES-256-GCM:

ELIXIR
defmodule MyApp.Vault do
use Cloak.Vault, otp_app: :myapp
 
@impl GenServer
def init(config) do
key = fetch_key!("CLOAK_KEY")
 
config =
Keyword.put(config, :ciphers,
default: {
Cloak.Ciphers.AES.GCM,
tag: "AES.GCM.V1", key: key
}
)
 
{:ok, config}
end
end

The Vault starts as a supervised GenServer in the application tree. Database fields marked with Cloak.Ecto.Binary are automatically encrypted on write and decrypted on read. API keys are never stored as plaintext in the database, never logged, and never serialized to disk.

Key rotation is a runtime operation — update the key, restart the Vault process, and re-encrypt. No redeployment needed. The supervisor tree ensures the Vault is available before any process tries to access encrypted data.

Behaviours: polymorphic provider adapters

An AI gateway needs to talk to OpenAI, Anthropic, Google, Mistral, Cohere, DeepSeek, Groq, xAI, Qwen, and custom endpoints. Each provider has different API formats, authentication schemes, and response structures.

Elixir's behaviours provide compile-time contracts for this:

ELIXIR
defmodule MyApp.Providers.Behaviour do
@type request :: map()
@type response :: {:ok, map()} | {:error, term()}
 
@callback complete(request(), String.t()) :: response()
@callback models() :: list(String.t())
@callback provider_name() :: String.t()
end

Each provider module implements this behaviour. The routing layer dispatches based on provider name. Adding a new provider is one module — implement complete/2, models/0, and provider_name/0. No interface boilerplate. No factory pattern. No dependency injection framework. Just a behaviour contract enforced at compile time.

Hammer: rate limiting without external state

AI gateways need multi-tier rate limiting: per-IP, per-project, per-API-key. Most stacks use Redis for distributed rate limiting.

The Elixir ecosystem has Hammer — a rate limiting library with pluggable backends. For single-node deployments, ETS is the backend. Zero external dependencies.

ELIXIR
# IP-level rate limiting
case Hammer.check_rate("ip:#{ip}", 60_000, 100) do
{:allow, _count} -> :ok
{:deny, _limit} -> {:error, :rate_limited}
end
 
# Project-level rate limiting
case Hammer.check_rate("project:#{project_id}", 60_000, 1000) do
{:allow, _count} -> :ok
{:deny, _limit} -> {:error, :rate_limited}
end

ETS handles the sliding window counters. No Redis. No Lua scripts. No distributed clock synchronization issues for single-node deployment.


The stack comparison: concrete trade-offs

Python + FastAPI

Strengths: Fastest path to a working prototype. The AI/ML ecosystem is Python-native — LangChain, LlamaIndex, Hugging Face, every provider SDK. If your team already knows Python, the learning curve is minimal.

Weaknesses for AI gateways:

  • Concurrency is bolted on. asyncio works but requires discipline. One blocking call — a synchronous database query, a forgotten await, a library that doesn't support async — stalls the event loop. In an AI gateway making hundreds of concurrent provider calls, this is a constant source of bugs.
  • No supervision. If a background task crashes, you need to detect it and restart it manually. Celery workers that die mid-job need external monitoring (Flower, supervisor processes) to restart.
  • GIL limitations. CPU-bound work (HMAC computation, JSON parsing, cache key hashing) is limited by the Global Interpreter Lock. You can work around it with multiprocessing, but that adds complexity.
  • Separate infrastructure. You need Redis for Celery, Redis for caching (or Memcached), and a separate WebSocket server (or a library like socket.io with its own event loop). That's three additional components.

TypeScript + Node (Express/Fastify/Hono)

Strengths: Excellent TypeScript ecosystem. Good async I/O model. Large talent pool. Vercel/Cloudflare Workers make deployment trivial for simpler use cases.

Weaknesses for AI gateways:

  • Single-threaded CPU. JSON parsing a 50KB LLM response, computing SHA-256 cache keys, generating HMAC signatures — all of this blocks the event loop. worker_threads help but add complexity.
  • No built-in job queue. You need BullMQ + Redis, or an external service like SQS. Every additional dependency is another failure point.
  • WebSocket management. Libraries like ws or socket.io work but don't provide Phoenix-level abstractions. You manually manage connections, rooms, authentication, heartbeats, and reconnection. Phoenix Channels handle all of this declaratively.
  • No supervision trees. If a background process crashes, you catch it with process.on('uncaughtException') and hope your PM2 or Docker restarts it quickly enough.

Go

Strengths: Excellent performance. Great concurrency model with goroutines. Strong standard library. Compiles to a single binary. Perfect for high-throughput proxies.

Weaknesses for AI gateways:

  • Verbose for complex domain logic. AI gateway business logic — workflow loading, structured output merging, prompt caching with two-tier fallback, multi-step failover with attempt logging — produces a lot of code in Go. Error handling alone (if err != nil) can double the line count.
  • No built-in application framework. You assemble everything from libraries: router, middleware, database layer, migrations, job queue, WebSocket server, rate limiter, encryption. Each has its own conventions. The cognitive overhead compounds.
  • No OTP equivalent. Go has goroutines but no supervision trees. You write your own process management, health checking, and graceful shutdown. This is fine for simple services; it's significant work for a system with 15+ long-running supervised components.
  • No built-in hot code reloading. In development, you restart the server for every change. Trivial for microservices, tedious for monolithic applications with complex startup sequences.

Elixir + Phoenix

Strengths:

  • Concurrency is the default. Every request is isolated. No async/await ceremony. No event loop blocking. No GIL.
  • OTP supervision trees. Self-healing process management refined for 30+ years.
  • ETS. In-process caching without Redis.
  • Oban. PostgreSQL-backed job queue without a separate broker service.
  • Phoenix Channels. Real-time WebSocket delivery with presence tracking, built in.
  • Cloak. Field-level AES-256-GCM encryption for secrets.
  • Behaviours. Compile-time polymorphism for provider adapters.
  • Pattern matching. Cleanest error handling of any mainstream language: destructure success/failure tuples at every call site.
  • Telemetry. Built-in instrumentation framework for metrics and observability.
  • Hot code reloading in development. Instant feedback loop.

Weaknesses:

  • Smaller ecosystem. Fewer libraries than Python/Node/Go. No provider SDKs — you write HTTP adapters yourself (which, honestly, is fine for API gateways; you want control over the request).
  • Smaller talent pool. Hiring is harder. Finding Elixir developers takes more effort than finding Python or Go developers.
  • Learning curve. Functional programming, pattern matching, and OTP concepts take time to learn if your team comes from OOP backgrounds.
  • Deployment. Elixir releases are well-documented but less "push-button" than Docker + Node or Go's single binary. Mix releases and Distillery/Burrito add a learning step.

What this looks like in practice: the execution pipeline

To make this concrete, here's the simplified execution flow for a single AI request in the production system:

TEXT
HTTP Request arrives at Phoenix Endpoint
→ Plug pipeline: CORS → API key auth → project rate limit → IP rate limit
→ Controller extracts workflow name and params
→ Execution module:
1. Load workflow from DB (via Ash framework)
2. Check if test mode → return sample data
3. Prepare structured output schema if configured
4. Merge system instructions into messages
5. Load provider credentials (Cloak-decrypted from DB)
6. Content moderation check
7. Prompt cache lookup (ETS → DB fallback)
→ Cache hit: return cached response, log as cache_hit
→ Cache miss: continue to provider call
8. Call primary provider
→ Success: log, cache, return
→ Failure: try backup_1 provider
→ Failure: try backup_2 provider
→ All failed: try offline fallback (structured output defaults)
9. Log request with full metadata (provider, model, tokens, latency, attempts)
10. Increment organization request counter (async Task)
11. Report usage to Stripe meter if overage (async Task)
12. Return response

For async requests, the flow is:

TEXT
HTTP Request arrives
→ Validate and authenticate
→ Enqueue Oban job on :ai_requests queue
→ Return channel_id immediately (HTTP 202)
→ [Background]
→ Oban worker picks up job
→ Same execution pipeline as above
→ On completion: broadcast via Phoenix Channel
→ Deliver to webhooks via :webhooks queue (with HMAC signatures)

Every step in this pipeline runs in its own BEAM process. The controller process, the Oban worker process, the PubSub broadcast process, the webhook delivery process — all isolated, all supervised, all concurrent.


Telemetry: observability built into the foundation

Phoenix ships with Telemetry — a lightweight instrumentation library that's integrated into every layer of the stack. Database queries, HTTP requests, channel operations, and custom application events all emit Telemetry events.

ELIXIR
def metrics do
[
# Phoenix endpoint and router metrics
summary("phoenix.endpoint.stop.duration", unit: {:native, :millisecond}),
summary("phoenix.router_dispatch.stop.duration", tags: [:route]),
 
# Database metrics
summary("myapp.repo.query.total_time", unit: {:native, :millisecond}),
summary("myapp.repo.query.queue_time", unit: {:native, :millisecond}),
 
# AI gateway-specific metrics
summary("myapp.openai_compat.request.duration", tags: [:model, :status]),
summary("myapp.openai_compat.stream.duration_ms", tags: [:model, :status]),
counter("myapp.openai_compat.error.count", tags: [:error_type]),
 
# VM metrics — free, because BEAM exposes them
summary("vm.memory.total", unit: {:byte, :kilobyte}),
summary("vm.total_run_queue_lengths.total"),
summary("vm.total_run_queue_lengths.cpu"),
summary("vm.total_run_queue_lengths.io")
]
end

No external APM agent required. No monkey-patching. The BEAM VM itself reports memory usage, run queue lengths, and I/O metrics. You plug in any reporter — Prometheus, Datadog, custom dashboards — and the metrics flow automatically.


The infrastructure simplification

Here's the final picture — the full infrastructure footprint for a production AI gateway:

With Python/Node/Go:

  • Application server
  • PostgreSQL
  • Redis (caching + job queue broker)
  • Message queue (RabbitMQ/SQS for reliable job processing)
  • WebSocket server (separate process or library)
  • Cron scheduler (cron jobs or CloudWatch Events)
  • Rate limiter state (Redis-backed)
  • Encryption service or vault

With Elixir + Phoenix:

  • Application server (includes WebSocket server, job processor, cron scheduler, rate limiter, encryption vault, in-memory cache)
  • PostgreSQL (for data + Oban job queue)

That's it. Two components. Every other capability lives inside the BEAM runtime as a supervised process. When you deploy, you deploy one thing. When you debug, you look at one set of logs. When something crashes, the supervision tree handles it before you wake up.


When not to use Elixir

I'd be dishonest if I didn't mention the cases where Elixir is the wrong choice:

  • If your team has zero functional programming experience and you're on a 2-week deadline, the learning curve will cost you more than the architectural benefits save.
  • If you need deep ML/AI library integration (model training, embedding generation, vectorDB clients), the Python ecosystem is years ahead.
  • If you're building only a thin proxy with no business logic — just forwarding requests — Go or even Cloudflare Workers will be simpler and faster.
  • If hiring is your bottleneck, the Elixir talent pool is measurably smaller than Python, Node, or Go.

The sweet spot for Elixir is exactly the AI gateway use case: a system that's I/O-heavy, concurrent, requires fault tolerance, manages many long-lived connections, processes background jobs, and needs to stay up reliably without a complex infrastructure footprint.


Conclusion

At ModelRiver, the entire backend — API gateway, provider routing with multi-level failover, prompt caching, async job processing, WebSocket delivery, CLI tooling, webhook delivery, encrypted secrets management, metered billing, rate limiting, content moderation, telemetry, and a full observability layer — runs on Elixir + Phoenix with a single PostgreSQL database.

The stack doesn't get talked about enough in AI circles because the AI ecosystem is overwhelmingly Python-centric. But if you're building the infrastructure layer — the routing, orchestration, and reliability layer that sits between application code and LLM providers — the BEAM VM gives you primitives that other runtimes either lack or require bolting on through third-party services.

The best stack for building AI applications is Python. The best stack for building AI infrastructure might be Elixir.

If you want to see what this looks like in practice, ModelRiver's documentation walks through the system, or you can point your base_url to https://api.modelriver.com/v1 and make your first request.