Skip to main content
GoldenMatch runs the same pipeline across four backends. The default is in-memory Polars; the others trade per-record speed for the ability to handle larger datasets without running out of memory.

Backends

BackendRangeHow it works
polars (default)< 500K rowsIn-memory Polars DataFrames. Fastest per record.
chunked5M+ rows, single machineStreaming CSV reader plus a vectorized Polars cross-chunk join plus a block-keyed bucketed index.
duckdb500K – 50M rowsOut-of-core pair store via DuckDB. Spills to disk, so RAM is not the ceiling.
ray≥ 50M rows, distributedPer-block remote tasks; matchkey config and exclude-pairs set shared zero-copy via the Ray object store.

Measured scale

Quoted from the repository. Re-measure for your own hardware.
RecordsBackendWallPeak RSSClusters
1,000polars0.2s101 MB210 multi-member
10,000polars1.4s123 MB7,000 multi-member
100,000polars12s544 MB~8,200 rec/s
1M (exact)polars~7.8s
1M (fuzzy)polars~43 min~10 GB836K
5Mchunked~50 min11.9 GB618,817 multi-member, no OOM

Selecting a backend

You can force a backend in config or on the CLI:
config = gm.GoldenMatchConfig(backend="duckdb")
goldenmatch dedupe big.csv --backend ray
goldenmatch dedupe big.csv --chunked
Or let auto-config pick. By default it chooses duckdb at 1M+ rows and chunked for 5M+. Tune the cutoff with GOLDENMATCH_AUTOCONFIG_BACKEND_THRESHOLD.

DuckDB spill path

The DuckDB backend stores candidate pairs in an on-disk store. Point it somewhere with room:
export GOLDENMATCH_DUCKDB_SCORE_DB=/scratch/goldenmatch_pairs.duckdb

Ray distributed

pip install goldenmatch[ray]
goldenmatch dedupe big.csv --backend ray
The Ray path short-circuits back to local parallel scoring below four large blocks, since the distribution overhead does not pay off for small workloads.
Debug the prep-versus-kernel split for the bucket backend with GOLDENMATCH_BUCKET_DEBUG=1. It prints a per-bucket prep / kernel / post-filter timing breakdown. It is off by default, costs nothing, and does not change output.

Scale envelope

The block-size failure modes that matter more than the backend choice.