Why Vector Databases Are the Missing Piece in Your Self-Hosted AI Stack
You've probably heard the buzz about running your own AI. Maybe you've already got Ollama humming along on a spare machine, serving up Llama or Mistral models. You ask it questions, it gives you answers. Magic.
But here's the thing: those local LLMs have a fatal flaw. They don't know your stuff. Your company docs, your knowledge base, your product specs. They're smart in general, dumb about your specifics.
That's where vector databases come in. And if you're serious about self-hosted AI, you're going to need one.
The Problem With "Just Use an LLM"
Let's say you want an AI assistant that knows your internal documentation. The naive approach is to shove everything into the prompt. But LLMs have context limits. Even the big ones tap out around 128K tokens. Your documentation probably exceeds that.
The smarter approach is called Retrieval-Augmented Generation (RAG). Instead of feeding the entire knowledge base to the LLM, you search for relevant chunks first, then include only those in your prompt.
But how do you search? Keywords don't cut it. Someone asking "how do I reset my password" should match documentation about "account recovery" and "credential management" even if those exact words aren't used.
You need semantic search. And semantic search needs vectors.
What Vector Databases Actually Do
A vector database stores embeddings, which are numerical representations of text (or images, or audio). Similar concepts end up as similar numbers. When someone asks a question, you convert it to an embedding, then find the closest matches in your database.
It's surprisingly simple in concept. The complexity is in doing it fast at scale. That's what vector databases optimize for.
For self-hosters, this is the missing link. You've got Ollama for inference. You've got your documents. But without vector storage, you can't connect them efficiently.
Weaviate vs Qdrant vs Milvus: The Self-Hosted Showdown
Three vector databases dominate the self-hosted space. Here's what you need to know:
| Feature | Weaviate | Qdrant | Milvus |
|---|---|---|---|
| Language | Go | Rust | Go/C++ |
| Built-in Embeddings | Yes | No | No |
| RAM Requirement | Medium | Low | High |
| Learning Curve | Gentle | Gentle | Steep |
| Best For | Beginners, all-in-one | Performance, simplicity | Enterprise scale |
Weaviate is the friendliest option. It can generate embeddings internally using built-in modules, meaning you don't need a separate embedding service. GraphQL API, good documentation, batteries included. If you're just starting with vector search, this is your best bet.
Qdrant is lean and fast. Written in Rust, it's efficient with memory and surprisingly capable. No built-in embeddings, so you'll need something like Ollama or a dedicated embedding model. But if you want raw performance without the overhead, Qdrant delivers.
Milvus is the heavyweight. Designed for billion-scale datasets, it's what you reach for when the others can't keep up. But that power comes with complexity. More moving parts, more configuration, more things that can break.
For most self-hosters, I'd recommend starting with Weaviate or Qdrant. You can always migrate later if you outgrow them (spoiler: you probably won't).
A Practical RAG Setup
Here's what a working self-hosted RAG stack looks like:
- Vector Database (Weaviate or Qdrant) stores your document embeddings
- Embedding Model (via Ollama or built into Weaviate) converts text to vectors
- LLM (Ollama with Llama/Mistral/Qwen) generates answers using retrieved context
- Ingestion Pipeline (Python script, LangChain, or similar) processes your documents
When someone asks a question:
- Question gets converted to a vector
- Vector database finds the 5-10 most relevant document chunks
- Those chunks plus the question go to your LLM
- LLM generates an answer grounded in your actual data
The beauty is that everything stays on your infrastructure. No API calls to OpenAI. No data leaving your network. No per-token billing surprises.
Getting Started on Elestio
All three vector databases are available on Elestio with one-click deployment: Weaviate, Qdrant, and Milvus. Pair them with Ollama for your LLM needs and you've got a complete RAG stack. Automated backups, updates, SSL, and monitoring all included.
A typical setup runs around $30-50/month for a capable RAG stack. Compare that to cloud AI APIs where costs can spiral unpredictably with usage, and the economics become clear.
Common Pitfalls
Chunk size matters. Too small and you lose context. Too large and you retrieve irrelevant information. Start with 500-1000 tokens per chunk with 100-token overlap.
Embedding model choice affects everything. The embedding model determines what "similar" means. Using a different model at query time than ingestion time will give garbage results.
Don't skip metadata. Store source URLs, timestamps, and document sections alongside your vectors. You'll thank yourself when debugging why the AI cited the wrong document.
Monitor your retrieval quality. If users complain the AI doesn't know things it should, check whether the right chunks are being retrieved before blaming the LLM.
The Bottom Line
Vector databases aren't optional anymore for serious self-hosted AI. They're the bridge between your knowledge and your language models.
If you're already running Ollama, adding Weaviate or Qdrant is the logical next step. Your AI assistant will go from "generally smart" to "actually useful for your specific needs."
And isn't that the whole point?
Thanks for reading. See you in the next one.