Backends and scale

GoldenMatch runs the same pipeline across five backends (the backend= / --backend values). The in-memory default trades per-record speed for the others’ ability to handle larger datasets without running out of memory.

Backends

Backend	Range	How it works
`polars-direct`	< 500K rows	In-memory classic scorer, and the default when the native `block_scoring` kernel is unavailable. Named for the Polars frame path; v3.1.0+ requires the `[polars]` extra, and without it the planner routes to the faster Arrow-native paths (the Rust kernels carry the hot paths).
`bucket` (default)	< 500K rows	Memory-bounded field-hash bucket scorer. The default in-memory route whenever the native `block_scoring` kernel is present (1.7-3.7x faster than `polars-direct` at 1K-60K rows); falls back to `polars-direct` otherwise.
`chunked`	5M+ rows, single machine	Streaming CSV reader plus a vectorized Polars cross-chunk join plus a block-keyed bucketed index.
`duckdb`	500K – 50M rows	Out-of-core pair store via DuckDB. Spills to disk, so RAM is not the ceiling.
`ray`	≥ 50M rows, distributed	Per-block remote tasks; matchkey config and exclude-pairs set shared zero-copy via the Ray object store.

Probabilistic (Fellegi-Sunter) matchkeys are single-box only. The chunked and distributed (Ray) lanes score exact and weighted matchkeys per chunk / partition; a type: probabilistic matchkey routed through them raises NotImplementedError at lane entry (it would otherwise contribute zero pairs silently). Run FS configs on the in-memory / bucket path, which scales to tens of millions of rows on one node — see Scoring → Scale-out.

Measured scale

Quoted from the repository. Re-measure for your own hardware.

Records	Backend	Wall	Peak RSS	Clusters
1,000	polars	0.2s	101 MB	210 multi-member
10,000	polars	1.4s	123 MB	7,000 multi-member
100,000	polars	12s	544 MB	~8,200 rec/s
1M (exact)	polars	~7.8s	—	—
1M (fuzzy)	polars	~43 min	~10 GB	836K
5M	chunked	~50 min	11.9 GB	618,817 multi-member, no OOM

Selecting a backend

You can force a backend in config or on the CLI:

config = gm.GoldenMatchConfig(backend="duckdb")

goldenmatch dedupe big.csv --backend ray
goldenmatch dedupe big.csv --chunked

Or let auto-config pick. By default it chooses duckdb at 1M+ rows and chunked for 5M+. To override the choice, set backend= in config or pass --backend on the CLI.

DuckDB spill path

The DuckDB backend stores candidate pairs in an on-disk store. Point it somewhere with room:

export GOLDENMATCH_DUCKDB_SCORE_DB=/scratch/goldenmatch_pairs.duckdb

Ray distributed

pip install goldenmatch[ray]
goldenmatch dedupe big.csv --backend ray

The Ray path short-circuits back to local parallel scoring below four large blocks, since the distribution overhead does not pay off for small workloads.

Distributed Phase 5 (100M+)

At hundreds of millions of rows the pipeline runs fully distributed on a multi-node Ray cluster, keeping every stage as a Ray Dataset so the driver never holds the full graph. Enable it with GOLDENMATCH_DISTRIBUTED_PIPELINE=2 and GOLDENMATCH_ENABLE_DISTRIBUTED_RAY=1, pointing RAY_ADDRESS at your cluster. The recall-complete path is now the default (shipped 2026-06-11): scoring block-shuffles records by blocking key so every record sharing a key co-locates, then clustering runs a distributed connected-components pass that handles components spanning partitions. It was validated end-to-end at 100M rows on a real 5-node GCP cluster — 9.2 min, 20M clusters recovered exactly, 0.36 GB driver RSS. GOLDENMATCH_DISTRIBUTED_BLOCK_SHUFFLE=0 restores the legacy per-partition path (faster but under-merges across partitions, so recall drops as partition count rises).

export GOLDENMATCH_DISTRIBUTED_PIPELINE=2
export GOLDENMATCH_ENABLE_DISTRIBUTED_RAY=1
# Multi-node only: a SHARED scratch the WCC checkpoints round-by-round.
export GOLDENMATCH_DISTRIBUTED_WCC_SCRATCH=gs://your-bucket/rc_scratch

The distributed connected-components algorithm is a randomized contraction (Bögeholz, Brand and Todor, 2018): relational, chain-robust, with no driver-side union-find and no per-vertex Python dictionary. Each round checkpoints its shrinking edge set to parquet, which keeps the Ray streaming executor from deadlocking on long iterative joins. Below 50M candidate pairs it uses driver-side scipy instead, which is correct and bounded at that scale.

On a multi-node cluster, GOLDENMATCH_DISTRIBUTED_WCC_SCRATCH must be shared storage (a gs:// prefix). A node-local path is not readable by workers on other nodes, so the per-round parquet checkpoints would fail — the run now raises rather than silently under-merging.

GOLDENMATCH_DISTRIBUTED_PIPELINE=2 requires a Ray Dataset input with a global __row_id__ column and writes golden records to output_path rather than returning a clusters dict. Note that backend="ray" on its own does not stream — it runs an in-memory cheat-line; you need PIPELINE=2. Cluster setup, the full env reference, and the benchmark live in the distributed Ray cluster guide. For every distributed knob and its default, see Tuning & opt-ins.

Debug the prep-versus-kernel split for the bucket backend with GOLDENMATCH_BUCKET_DEBUG=1. It prints a per-bucket prep / kernel / post-filter timing breakdown. It is off by default, costs nothing, and does not change output.

Clustering strategies

Once pairs are scored, GoldenMatch groups the surviving edges into entities. The planner picks a clustering_strategy to match the backend and scale; you can also set it explicitly in config.

Strategy	When it runs	How it works
`in_memory`	Default, single machine	In-memory union-find over the full edge set. Fastest when the graph fits in RAM.
`partitioned_union_find`	Large single-machine pair sets	Disk-partitioned union-find, so the edge set is not bounded by RAM.
`streaming_cc`	Chunked / out-of-core runs	Streaming connected-components that consumes edges without materializing the whole graph.
`distributed_wcc`	Ray cluster (100M+)	Distributed weakly-connected-components (randomized contraction) that resolves components spanning partitions, with no driver-side union-find. See the Distributed Phase 5 section above.

Native acceleration

pip install goldenmatch ships the Rust kernel (goldenmatch-native) on common platforms. Rust is the reference implementation (2026-07 reference-mode change); the pure-Python paths are the lossy fallback used when the wheel is absent. The GOLDENMATCH_NATIVE environment variable has three modes:

`GOLDENMATCH_NATIVE`	Behaviour
unset / `auto` (default)	Run native wherever the component’s kernel symbol exists — native is the reference. Pure-Python is the fallback for a missing wheel/symbol.
`1`	Require native everywhere it exists; raise if the wheel is not importable.
`0`	Force the pure-Python fallback everywhere.

For at-scale and benchmark runs, set GOLDENMATCH_NATIVE=1 so a missing/broken wheel fails loudly instead of silently running the slow pure-Python fallback. Under auto native already runs wherever a kernel exists; 1 just makes its absence a hard error.

“Available” does not mean “your hot loop ran native.” The wheel being importable makes the native telemetry report available: true regardless of whether the stage that dominates your wall actually dispatched to the kernel — for example, the bucket backend’s vectorized numpy fast-path handles many blocks instead of the native scoring kernel. Result telemetry therefore reports the per-run dispatch reality, not just availability:

"native": {
  "available": true,
  "env_gate": "1",
  "ran_native": ["block_scoring", "clustering", "hashing"],
  "ran_fallback": [],
  "hot_path_native": true
}

ran_native / ran_fallback list the components that actually dispatched to the kernel vs fell back on this run; hot_path_native is true when the scoring or clustering stage went native. When the kernel is importable but GOLDENMATCH_NATIVE is unset, a one-line INFO log notes that auto is in effect and that =1 enables full acceleration.

Tuning & opt-ins

Every runtime knob — native gate, backend selection, the distributed pipeline, and the full GOLDENMATCH_* table with defaults and when-to-use guidance.

Scale envelope

The block-size failure modes that matter more than the backend choice.

Get started

Concepts

GoldenMatch

GoldenCheck

GoldenFlow

GoldenPipe

GoldenAnalysis

InferMap

SQL extensions

Reference

Research

Backends

Measured scale

Selecting a backend

DuckDB spill path

Ray distributed

Distributed Phase 5 (100M+)

Clustering strategies

Native acceleration

Tuning & opt-ins

Scale envelope

​Backends

​Measured scale

​Selecting a backend

​DuckDB spill path

​Ray distributed

​Distributed Phase 5 (100M+)

​Clustering strategies

​Native acceleration

Tuning & opt-ins

Scale envelope

Backends

Measured scale

Selecting a backend

DuckDB spill path

Ray distributed

Distributed Phase 5 (100M+)

Clustering strategies

Native acceleration