GoldenMatch overview

GoldenMatch is the headline package of the Golden Suite. It finds duplicate records in seconds with no rules to write and no model to train. It combines fuzzy matching, probabilistic Fellegi-Sunter EM, LLM scoring, privacy-preserving record linkage, golden-record synthesis, and an introspective auto-configuration controller that beats hand-tuned baselines on standard benchmarks.

The healing loop

The core workflow is a loop: (1) a zero-config first pass — dedupe_df(df) runs with no rules or training, auto-config picks a defensible config, and you get good results immediately; (2) you get the config it chose — returned, inspectable, versionable, not a black box; (3) the healer suggests tweaks — review_config(df, config) reviews results against the config and emits ranked, explainable, self-verified edits (kept only if they don’t worsen an unsupervised health proxy, so a suggestion never makes results worse); (4) you apply the tweaks — apply_suggestion produces an improved config; (5) results improve, and you repeat. Zero-config gets you most of the way in one pass; the healing loop closes the gap to an expert-tuned config without you having to be the expert. The healer is opt-in today (Python API from goldenmatch.core.suggest import review_config, needs goldenmatch[native]). See Config suggestions for detail.

Quickstart

Dedupe, match, and write golden records.

Auto-config

How zero-config converges on a defensible config.

Backends and scale

Polars, DuckDB, chunked, and Ray.

CLI reference

Every command and the key flags.

Config suggestions

The healing loop: ranked, self-verified config tweaks.

Install

pip install goldenmatch

npm install goldenmatch

Common extras:

pip install goldenmatch[embeddings]   # sentence-transformers, FAISS
pip install goldenmatch[llm]          # Claude / OpenAI borderline scoring
pip install goldenmatch[duckdb]       # out-of-core backend
pip install goldenmatch[ray]          # distributed backend (50M+ rows)
pip install goldenmatch[postgres]     # Postgres sync
pip install goldenmatch[quality]      # GoldenCheck integration
pip install goldenmatch[transform]    # GoldenFlow integration
pip install goldenmatch[web]          # localhost browser workbench
pip install goldenmatch[mcp]          # MCP server for Claude Desktop

Key features

12+ scoring methods: exact, Jaro-Winkler, Levenshtein, token-sort, soundex, ensemble, embedding, record-embedding, dice, jaccard, surname IDF-weighted, and alias-aware.
Probabilistic matching: Fellegi-Sunter EM-trained m/u probabilities with automatic threshold estimation.
LLM scorer: scores borderline pairs with budget caps and graceful degradation.
8+ blocking strategies: static, adaptive, sorted-neighborhood, multi-pass, ANN, canopy, and learned.
Golden records: eight merge strategies with field-level provenance and quality-weighted survivorship.
PPRL: privacy-preserving record linkage across organizations (F1 0.924 on FEBRL4).
Learning Memory: persists steward corrections, unmerges, and LLM votes across runs.
Interactive TUI, REST API, MCP server (78 tools), A2A agent, and a localhost web workbench.

Benchmarks

Zero-config accuracy, quoted from the package README:

Dataset	F1	Note
DBLP-ACM (bibliographic)	0.964	Hand-tuned ceiling 0.918
Febrl3 (PII)	0.944	Zero-config
NCVR (voter records)	0.972	Zero-config
Febrl4 (PII, PPRL)	0.924	Bloom-filter PPRL
DQBench ER composite	91.04	No LLM (v1.12.0)

These are the zero-config (weighted-path) numbers. On the separate probabilistic (Fellegi-Sunter) path, auto-config v2 beats Splink head-to-head under the shared bench_er_headtohead evaluator on every dataset Splink scores (historical_50k 0.778 vs 0.757, febrl3 0.991 vs 0.965, synthetic_person 0.998 vs 0.996) — deterministic as of #829. See Scoring → Probabilistic auto-config v2 for the full panel and the honest pairwise-vs-cluster caveat.

A minimal example

import goldenmatch as gm

result = gm.dedupe("customers.csv", exact=["email"], fuzzy={"name": 0.85, "zip": 0.95})
result.golden.write_csv("deduped.csv")

Leave out exact and fuzzy entirely to let auto-config choose them.

Get started

Concepts

GoldenMatch

GoldenCheck

GoldenFlow

GoldenPipe

GoldenAnalysis

InferMap

SQL extensions

Reference

Research

The healing loop

Quickstart

Auto-config

Backends and scale

CLI reference

Config suggestions

Install

Key features

Benchmarks

A minimal example

​The healing loop

Quickstart

Auto-config

Backends and scale

CLI reference

Config suggestions

​Install

​Key features

​Benchmarks

​A minimal example

The healing loop

Install

Key features

Benchmarks

A minimal example