The Architecture Review
All episodes
Episode 08

Vector Search Internals (HNSW, IVF, PQ)

How HNSW, IVF, and product quantization actually find nearest neighbors at scale.

Video publishes soon

slug: 008-vector-search number: 8 title: "Vector Search Internals (HNSW, IVF, PQ)" description: "How HNSW, IVF, and product quantization actually find nearest neighbors at scale." youtubeId: null publishedAt: null anchor: authors: "Yu. A. Malkov & D. A. Yashunin" year: 2018 title: "Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs" institution: "—" venue: "IEEE TPAMI"

The pattern at a glance

Long-form article coming soon. The narration below is the spoken version of this episode — read it as a quick transcript while the written companion is in draft.

Transcript

Your RAG system worked at one million documents. Now you have ten million, and queries take two seconds instead of fifty milliseconds. The hardware is the same. The math just got harder.

Brute-force similarity search is linear in your corpus size. To find the most similar document to a query, you compute the distance between the query vector and every stored vector, then sort. Correct, but quadratic at scale. At a billion vectors, your latency budget is gone before you start.

The trade we make: give up exact answers in exchange for sub-linear search. This is approximate nearest neighbor search.

Approximate nearest neighbor — ANN — returns vectors that are probably the closest, not provably so. You measure quality with recall: of the true top ten neighbors, how many did the index actually return?

Ninety-five percent recall is usually enough. Going from ninety-five to ninety-nine percent often costs ten times the latency, ten times the memory, or both. Pick the recall floor your application can tolerate.

Approximate nearest neighbor research goes back decades. Locality-sensitive hashing was published by Piotr Indyk and Rajeev Motwani in 1998.

The current dominant algorithm is HNSW — Hierarchical Navigable Small World graphs — published by Yury Malkov and Dmitry Yashunin in 2018. HNSW outperformed every prior approach on both recall and latency, and is now the default in nearly every vector database.

HNSW builds a layered graph.

The bottom layer contains every vector. Each vector is connected to its M nearest neighbors. Above that, sparser layers with longer-range connections. The top layer has only a handful of vectors with very long edges.

A search starts at the top. Find the closest entry point, descend. At each layer, walk the graph greedily — always move to the neighbor nearest to the query — until no neighbor is closer. Drop to the next layer. Repeat until the bottom.

The result: cover the search space in log time at the upper layers, and do precise local comparison only at the bottom.

Two parameters matter most. M, the number of neighbors per node. And efSearch, how aggressively the search explores. Higher values, better recall, more time and memory.

IVF — inverted file index — takes a different approach. Cluster the vector space into N partitions using k-means. At query time, find the closest C clusters, search only those.

Ten thousand clusters, probe the closest ten — work reduced by a thousand times. Recall depends on how many clusters you probe.

The trade-off: cluster boundaries are arbitrary. A query near a partition edge may have its true nearest neighbor in an unprobed cluster. IVF is predictable and easy to update. HNSW is faster at high recall but harder to update incrementally.

Product quantization — PQ — is not an alternative to HNSW or IVF. It's an orthogonal optimization that reduces memory.

A vector of dimension D is split into M sub-vectors. Each sub-vector is quantized to one of K codewords learned from training data. A vector that originally took thousands of bytes is stored as M small integer codes — often a hundred-fold compression.

Distances are computed in the compressed space using lookup tables. Faster, smaller, slightly less accurate.

The standard production stack is IVF plus PQ — partition with IVF, compress with PQ, fit a billion vectors in RAM.

Three traps every vector search system hits.

One: parameter tuning is dataset-dependent. The right M, efSearch, cluster count, PQ codeword count — there are no universal values. Tune against your actual corpus. Defaults are starting points, not answers.

Two: recall measurement is a separate system. You need a ground-truth subset of queries with known correct answers, and to measure recall continuously as the corpus grows. Without this, you can't know whether your index is degrading.

Three: rebuild cost is real. HNSW does not support deletes well. Updates often mean periodic rebuild-from-scratch. For frequently-changing corpora, rebuild cost may dominate query cost.

For most applications, vector database defaults are fine. You don't need to tune HNSW parameters for a thousand-vector corpus.

You start to care at around a million vectors, or when latency is tight enough that the default trade-off doesn't fit. Then the choice is about recall, latency, and memory — not algorithm names.

Vector search is an old field with new urgency. HNSW dominates because it's the best practical balance. IVF and PQ remain important — they show up in combination depending on what you're optimizing.

Knowing which lever to pull saves you orders of magnitude.

Next episode: Agentic architectures, and why ReAct is the pattern most agent frameworks are quietly running.