Feb 15, 20264 min read1 read
RAG Quality Is Chunking Quality
Before you tune prompts or swap embedding models, look at how you split documents. It is probably the whole problem.
Every RAG debugging session I have run for TypeFlow AI ends in the same place: the retrieval was fine, the model was fine, the chunks were garbage.
The failure mode
Fixed-size token windows cut arguments in half. The embedding of half an argument points somewhere useless in vector space, so retrieval surfaces a chunk that is lexically related and semantically broken. The model then does what models do with broken context: it improvises.
What worked
Splitting on semantic boundaries instead of token counts. Headings, list boundaries, and argument turns become chunk edges. Chunks vary wildly in size and that is fine; a coherent 800-token chunk beats two incoherent 400-token ones every time.
Two more compounding wins:
- Prefilter cheaply, rank expensively. A keyword prefilter before pgvector similarity cut latency enough to make autocomplete viable.
- Cite everything. When every answer shows its chunks, users debug your retrieval for you. The worst chunks get reported within days.
The takeaway
Embedding model choice moved my answer quality a few percent. Chunking strategy moved it more than everything else combined.
- RAG
- Vector Search
- Supabase