Ribbon filter

A ribbon filter is a static set-membership structure built by solving a linear system over GF(2). Each item becomes one row of a sparse banded matrix; the bit array stored on disk is the solution that satisfies every row at once.

The interactive below builds a small one in front of you — 48 items, 64 columns, ribbon width 12, single-bit fingerprints. Press play and watch the algorithm run; the structure underneath is the same one production filters use at millions-of-rows scale.


ready
phase init · pivot row · pivot col · xors 0 · pivots 0/48 · failures 0 hover the matrix
Visualization of a 48-row by 64-column ribbon filter coefficient matrix being reduced by banded Gaussian elimination, with a separate fingerprint column on the right.
bit array A
The solved bit array, filled by back-substitution after elimination.
bit = 1 in band, bit = 0 outside band fingerprint pivot column XOR target

How it works

Each item hashes to a triple: a start column s, a 12-bit coefficient pattern c with bit 0 forced to 1, and a 1-bit fingerprint b. The equation for item x is A[s..s+12] · cx = bx, where the dot product is taken over GF(2) — XOR. The filter’s job is to find a bit array A that makes every equation true at once.

1. Sort

Rows get sorted by their start column. The nonzero entries now line up as a diagonal band across the matrix — that’s the ribbon, and it’s the entire reason the algorithm is practical.

2. Eliminate

Banded Gaussian elimination, top-to-bottom. For each row, find the leftmost set bit — that column becomes the row’s pivot. For every later row whose band covers that pivot column and has a 1 there, XOR the pivot row in to clear it. The band structure survives every XOR, so each elimination only touches r cells. Total construction cost is O(n · r) instead of O(n3).

3. Back-substitute

After elimination, each pivot row determines one bit of A. Process pivot columns right-to-left: A[pivot column] = b ⊕ (bits at later columns the row’s coefficients select). By the time you reach column zero, every bit is filled and every original equation holds.

Querying is the same dot product: hash the item, XOR the selected bits, compare to the fingerprint. False-positive rate is 2-f where f is the fingerprint width. The visualization uses f = 1 for legibility (50% FPR) so every cell can be one bit on screen. Real filters use 8–16-bit fingerprints.

Where they show up

Dillinger and Walzer published ribbon filters in 2021. RocksDB switched to them shortly after — the build-once-query-many shape fits LSM-tree compactions, and a ribbon filter at the same false-positive rate is measurably smaller than the alternatives.

My haveibeenfiltered uses one: 2 billion SHA-1 password hashes from Have I Been Pwned, packed into a 1.79 GB filter that runs entirely client-side with no per-check network call. Ribbon filters get within a few percent of the entropy bound — n · log2(1/FPR) bits — which is the floor no probabilistic filter can beat.

For the membership-filter cousin that takes the opposite approach — set bits as you go, no system to solve, no construction phase — see the Bloom filter.