Self-Host LiteLLM: One API Gateway for Every LLM

Michael Soto

04 Jul 2026 • 4 min read

Here's a mess I got myself into last year. One service was calling OpenAI. Another used Anthropic. A batch job talked to a local Ollama box to save money. Each had its own SDK, its own auth, its own quirks. API keys were scattered across three .env files and a Kubernetes secret I'd forgotten about. And when finance asked "what are we actually spending on this AI stuff?", the honest answer was: I have no idea.

That's the problem LiteLLM solves. It's one gateway that sits in front of every model you use, speaks the OpenAI API format to all of them, and gives you a single place to manage keys, budgets, and fallbacks. Let me show you how to run your own.

What LiteLLM actually is

LiteLLM is an AI gateway (a proxy server) that exposes one OpenAI-compatible endpoint and routes your requests to 100+ providers behind it: OpenAI, Anthropic, Azure, Bedrock, VertexAI, a local Ollama or vLLM instance, whatever. Your application only ever knows about one URL and one key. Swapping GPT-5.5 for Claude, or for a local Llama model, becomes a one-line config change instead of a code change.

On top of the routing, you get the things you actually need in production: virtual keys with per-key budgets, spend tracking per user and team, rate limits, and automatic fallbacks when a provider has a bad day.

Getting it running on Elestio

You can run LiteLLM from a Docker Compose file yourself, but the proxy needs a PostgreSQL database for virtual keys, spend tracking, and the admin UI, which means you're now running and backing up a database too. Deploying the managed LiteLLM on Elestio hands you the proxy plus its Postgres, with SSL, backups, and updates handled, starting around $16/month for the VM.

Once it's up, the part you care about is the config. LiteLLM is driven by a config.yaml that lists your models:

model_list:
  - model_name: gpt-5.5
    litellm_params:
      model: openai/gpt-5.5
      api_key: os.environ/OPENAI_API_KEY
  - model_name: claude-sonnet
    litellm_params:
      model: anthropic/claude-sonnet-5
      api_key: os.environ/ANTHROPIC_API_KEY
  - model_name: local-llama
    litellm_params:
      model: ollama/llama3.1
      api_base: http://ollama:11434

litellm_settings:
  fallbacks: [{"gpt-5.5": ["claude-sonnet", "local-llama"]}]

general_settings:
  master_key: os.environ/LITELLM_MASTER_KEY

The model_name on the left is the alias your apps call. The litellm_params on the right is where the request actually goes. That indirection is the whole trick: your code says gpt-5.5, and you decide behind the scenes whether that's really OpenAI, Azure, or something else entirely.

Talking to your gateway

Because it speaks OpenAI's dialect, any OpenAI client works. Point the base URL at your gateway and use one of your keys:

from openai import OpenAI

client = OpenAI(
    base_url="https://your-litellm.vm.elestio.app",
    api_key="sk-your-virtual-key",
)

resp = client.chat.completions.create(
    model="claude-sonnet",
    messages=[{"role": "user", "content": "Summarize this in one line."}],
)
print(resp.choices[0].message.content)

Same call, model="local-llama", and you're hitting your own hardware instead. No new SDK, no rewrite.

Virtual keys are the real reason to do this

The master key is your admin key and should never touch application code. Instead, mint scoped virtual keys, each with its own model access and budget:

curl https://your-litellm.vm.elestio.app/key/generate \
  -H "Authorization: Bearer $LITELLM_MASTER_KEY" \
  -H "Content-Type: application/json" \
  -d '{"models": ["gpt-5.5", "claude-sonnet"], "max_budget": 50, "duration": "30d"}'

That returns a key that can only reach those two models and stops working once it's burned through $50. Hand one to each project, each teammate, or each agent. When something starts spending suspiciously, you revoke that one key instead of rotating a provider key that fifteen things depend on. The admin UI shows the spend per key so the finance question finally has an answer.

Without a gateway	With LiteLLM
A different SDK per provider	One OpenAI-compatible client everywhere
Provider keys copied into every app	Scoped virtual keys with budgets
No idea what you're spending	Per-key, per-team spend in the UI
Provider outage takes you down	Automatic fallback to another model

The honest trade-offs

A gateway is one more hop and one more service to keep alive. Every request now depends on the proxy and its database being healthy, so back that Postgres up (on Elestio, that's automatic). And be aware of supply-chain hygiene: in March 2026 two LiteLLM releases (1.82.7 and 1.82.8) were briefly compromised before being pulled and cleaned up in 1.83.0. The images on GHCR are signed with cosign, so pin to a known-good version and verify signatures rather than blindly tracking latest in production.

Troubleshooting

401 on every call: you're using the master key in your app, or the virtual key doesn't include the model you requested. Generate a key scoped to that model.
"model not found": the model in your request must match a model_name alias in config.yaml, not the provider's raw model string.
Fallbacks never fire: they trigger on errors, not on slowness. Make sure the fallback target is a real entry in your model_list.
UI shows no spend: the proxy can't reach Postgres. Check your DATABASE_URL and that the database container is actually up.

Put a gateway in front of your models once and every future integration gets cheaper: new app, new agent, new provider, they all just point at the same endpoint. You can spin up a managed instance on Elestio and have your whole model fleet behind one door in a few minutes.

Thanks for reading ❤️ See you in the next one 👋