Language models are commoditised. Quality of the answer, in production, is mostly about whether you handed the model the right context. We spend accordingly.
What "retrieval" actually means
It's tempting to think of retrieval as one step: embed the query, fetch the top-k chunks, hand them to the model. In practice it's at least four:
- Chunking — how you cut the source matters more than the embedding model.
- Hybrid search — dense vectors miss exact-token matches; BM25 misses semantic. Use both, then merge (Reciprocal Rank Fusion is a good default — Cormack et al., 2009).
- Reranking — a cross-encoder over the top 50 beats a bigger embedding model over the top 5, almost every time. Cohere Rerank and open-source models like BGE reranker are both production-grade starting points.
- Citation fidelity — the chunk that "won" needs to be inspectable by a human in one click. Otherwise trust collapses.
If your RAG system can't show the user the exact passage it used, you don't have a RAG system — you have a chatbot with extra steps.
The trap of "good enough" evals
A common pattern: launch with hand-picked examples, watch demos go well, declare victory. Six weeks later, accuracy on real traffic is half of what the demo suggested. The fix isn't a better model. It's an evaluation harness with:
- A frozen test set drawn from real user queries, not synthetic ones.
- Ground-truth answers written by domain experts, not the team that built the system.
- Regression checks that run on every change — including prompt edits, not just code.
What we won't do
We won't ship a retrieval system without freshness handling for time-sensitive sources. We won't ship one that confidently answers when it shouldn't. And we won't ship one whose citations point to the wrong place — that's worse than no answer at all.
The cheap part of an AI system is asking the model. The expensive part is making sure it has the right thing to read.
