Scoring

GoldenMatch provides 10+ scoring methods for comparing record pairs. Scoring runs after blocking and produces (row_id_a, row_id_b, score) tuples.

Scorer reference

Scorer	Description	Range	Best For
`exact`	Binary 0/1 match	0 or 1	Email, phone, ID
`jaro_winkler`	Edit distance with prefix bonus	0.0—1.0	Names
`levenshtein`	Normalized Levenshtein distance	0.0—1.0	General strings
`date`	Damerau-Levenshtein over canonical ISO digits	0.0—1.0	Dates (`dob`, `birth_date`)
`date_diff`	Day-distance bands; magnitude-aware (a year gap is a weak partial)	0.0—1.0	Dates (`dob`, `birth_date`)
`numeric_diff`	Banded numeric distance (`:abs:<eps>` / `:pct:<frac>`); magnitude-aware	0.0—1.0	Amounts, measurements, ages
`geo_haversine`	Great-circle (haversine) km distance banded, on a `lat,long` field	0.0—1.0	Coordinates (`lat,long`)
`token_sort`	Sort tokens, then ratio	0.0—1.0	Names, addresses
`soundex_match`	Phonetic code comparison	0 or 1	Names
`ensemble`	max(jaro_winkler, token_sort, soundex)	0.0—1.0	Names with reordering
`embedding`	Cosine similarity of sentence embeddings	0.0—1.0	Semantic matching
`record_embedding`	Multi-field concatenated embeddings	0.0—1.0	Cross-field semantic
`dice`	Dice coefficient on bloom filters	0.0—1.0	PPRL
`jaccard`	Jaccard similarity on bloom filters	0.0—1.0	PPRL
`name_freq_weighted_jw`	Jaro-Winkler modulated by US Census surname IDF	0.0—1.0	`last_name` / `surname`
`given_name_aliased_jw`	Jaro-Winkler with alias-aware exact bonus	0.0—1.0	`first_name` / `given_name`
`qgram`	Q-gram (n-gram) overlap similarity	0.0—1.0	General strings, typos
`alias_match`	Matches known name aliases and nicknames (e.g. Bob and Robert)	0 or 1	Names with nicknames
`initialism_match`	Matches initials/acronyms against their expansions	0 or 1	Acronyms, initials
`phash`	Perceptual-hash Hamming similarity	0.0—1.0	Images
`radial`	Rotation/crop-invariant radial-variance similarity	0.0—1.0	Rotated/cropped images
`audio_fp`	Audio-fingerprint similarity	0.0—1.0	Audio clips

The name_freq_weighted_jw / given_name_aliased_jw scorers ship as part of the bundled reference-data packs and are picked automatically by auto-config when a column matches the relevant name pattern AND its profiled col_type agrees. See Reference Data for the full pack overview, refinement rules, and the col_type gate.

Date fields

Do not score an ISO date with jaro_winkler. The fixed YYYY-MM-DD shape, the shared digit alphabet, and the common 19../20.. prefix push unrelated birthdays to 0.80+, so a threshold that admits a one-digit typo also admits a different person — precision collapses on any date column that blocking co-locates by birth year. The date scorer parses both sides as ISO dates and compares them by Damerau-Levenshtein edit distance over the eight canonical digits (a swapped-digit typo is one edit), mapped so a single-digit typo stays high while an unrelated date drops to 0:

1980-01-01 vs 1980-01-01  -> 1.00   (same)
1980-01-01 vs 1980-01-02  -> 0.90   (one-digit typo)
1980-01-01 vs 1975-01-01  -> 0.75   (two edits)
1980-01-01 vs 1975-11-30  -> 0.00   (unrelated)   # jaro_winkler gives this 0.80

Non-ISO input falls back to levenshtein. Auto-config already skips date columns as fuzzy fields, so date is for hand-written / Splink-converted configs; a preflight check warns when a name-oriented scorer (jaro_winkler / token_sort / …) is placed on a date field. The date scorer is edit-distance and therefore magnitude-blind — 1990-01-02 vs 1991-01-02 is one digit-edit → 0.90, over-scoring a full-year gap. The date_diff scorer instead parses both sides to a day-ordinal and bands by day-distance (same → 1.0, ≤1 day → 0.92, ≤1 month → 0.80, ≤1 year → 0.60, ≤5 years → 0.30, else 0), with an MM/DD-transposition floor and the same levenshtein fallback on unparseable input. On the Fellegi-Sunter path, GOLDENMATCH_FS_DOMAIN_COMPARATORS=1 makes auto-config admit date columns as date_diff instead of levenshtein (default off; byte-identical when off).

Numeric and coordinate fields

The same magnitude-blindness afflicts numbers and coordinates: levenshtein("100","900") ≈ 0.67 reads two very different amounts as near-agreement, and string similarity on a lat,long pair is meaningless. Two more domain comparators (behind the same GOLDENMATCH_FS_DOMAIN_COMPARATORS flag) fix this on the Fellegi-Sunter path:

numeric_diff parses both sides to a float and maps the distance to a monotone [0,1] ramp — numeric_diff:abs:<eps> for an absolute band (|a−b|) or numeric_diff:pct:<frac> for a relative band (|a−b| / max(|a|,|b|)); bare numeric_diff = pct:0.1. Auto-config admits numeric columns as numeric_diff:pct:0.1.
geo_haversine parses one combined "lat,long" field per side and bands the great-circle (haversine) km distance (≤0.1 km → 1.0, ≤1 km → 0.85, ≤10 km → 0.5, ≤100 km → 0.2, else 0). Auto-config admits a column whose sampled values parse as coordinates. (Two separate lat/long columns are a deferred cross-field comparator.)

Both fall back to exact-string equality on unparseable input (never returning None for non-null values), so the scalar and vectorized scoring paths agree by construction — and, like date_diff, they are scale-neutral: just new scorers flowing through the unchanged level → m/u → weight machinery, leaving blocking, the pair set, memory, and clustering untouched. Default off is byte-identical.

matchkeys:
  - name: person
    type: weighted
    fields:
      - field: dob
        scorer: date
        weight: 0.3

Fuzzy scoring

Fuzzy matching uses rapidfuzz.process.cdist for vectorized NxN scoring within each block. This is the core scoring engine for weighted matchkeys.

import goldenmatch as gm

score = gm.score_strings("John Smith", "Jon Smyth", "jaro_winkler")
# 0.884

Weighted matchkeys

Each field gets a scorer, weight, and optional transforms. The overall score is a weighted average:

matchkeys:
  - name: fuzzy_person
    type: weighted
    threshold: 0.85
    fields:
      - field: first_name
        scorer: jaro_winkler
        weight: 0.4
        transforms: [lowercase, strip]
      - field: last_name
        scorer: jaro_winkler
        weight: 0.4
      - field: zip
        scorer: exact
        weight: 0.2

overall_score = sum(field_score * weight) / sum(weight) Pairs with overall_score >= threshold are matched.

Exact scoring

Exact matching uses Polars self-join for high performance. No threshold needed.

matchkeys:
  - name: exact_email
    type: exact
    fields:
      - field: email
        transforms: [lowercase, strip]

# The exact matchkey above is applied by the pipeline:
result = gm.dedupe_df(df, exact=["email"])

Probabilistic scoring (Fellegi-Sunter)

EM-trained m/u probabilities with comparison vectors. Match weights are log-likelihood ratios.

matchkeys:
  - name: fs_match
    type: probabilistic
    em_iterations: 20
    fields:
      - field: first_name
        scorer: jaro_winkler
        levels: 3              # agree / partial / disagree
        partial_threshold: 0.8
      - field: last_name
        scorer: jaro_winkler
        levels: 4              # N-level: explicit banding via level_thresholds
        level_thresholds: [0.95, 0.88, 0.7]
      - field: zip
        scorer: exact
        levels: 2

level_thresholds generalizes the fixed agree/partial/disagree banding to any number of levels: pass descending similarity cutoffs (len == levels - 1), and a pair’s comparison level is the count of thresholds its similarity satisfies (satisfying all of them = the top “agree” level, none = “disagree”). Omit it to keep the classic 2/3-level behavior (partial_threshold covers the 3-level case). The native FS kernel scores level_thresholds matchkeys natively from goldenmatch-native >= 0.1.14 (older wheels fall back to the pure-Python scoring path automatically); the fused-match path scores them natively from goldenmatch-native >= 0.1.15 (capability const FUSED_FS_SUPPORTS_LEVEL_THRESHOLDS; older wheels fall back).

import goldenmatch as gm

em_result = gm.train_em(df, matchkey, n_sample_pairs=10000, blocking_fields=["zip"])
pairs = gm.score_probabilistic(block_df, matchkey, em_result)

Key details:

u-probabilities estimated from random pairs and fixed during EM (Splink approach)
Blocking fields must be excluded from training (always agree within blocks)
Comparison vectors apply field transforms before scoring
Achieves P=0.978 / R=0.958 / F1=0.968 on DBLP-ACM (full-block vectorized scoring). The old “98.8% / 57.6%” figure was a benchmark artifact — a per-block size cap, not the scorer.

Negative evidence on Fellegi-Sunter matchkeys

negative_evidence is valid on type: probabilistic matchkeys (not just weighted/exact). Each NE field is a constrained EM-learned dimension: it fires when both values are present and scorer(a, b) < threshold (strict <), contributing log2(m_fired/u_fired) bits when fired and exactly 0 when not fired — that fired-else-zero clamp is what makes it negative evidence rather than a normal scored field (agreement never adds weight; only a hard disagreement subtracts it).

matchkeys:
  - name: fs_person
    type: probabilistic
    fields:
      - field: first_name
        scorer: jaro_winkler
        levels: 2
        partial_threshold: 0.8
      - field: city
        scorer: exact
        levels: 2
    negative_evidence:
      - field: phone
        scorer: exact
        threshold: 1.0
        # penalty_bits: 6.0   # optional fixed override, see below

Two ways to set the fired weight:

EM-learned (default): omit penalty_bits. The weight is estimated from data during train_em the same way every other field’s m/u is — m_fired = P(fired | match), u_fired = P(fired | non-match), both from the same EM loop and random-pair sample as the regular fields.
penalty_bits (fixed override): a literal log2 LLR in bits; the fired contribution is exactly -abs(penalty_bits), no EM training for that dimension. Useful when you know the right veto strength (e.g. migrating a Splink config) without waiting on EM to converge on enough data.

penalty (the weighted/exact knob) and penalty_bits are mutually exclusive by matchkey type: weighted/exact matchkeys still require penalty and reject penalty_bits; probabilistic matchkeys reject penalty and accept penalty_bits (or neither, for EM-learned). An unregistered/unknown NE scorer on a probabilistic matchkey fails loudly at train/score time (score_field raises on unknown scorers; there is no _NE_BROKEN swallow on FS) — unlike the weighted path, which silently swallows a broken NE scorer and warns. This is intentional: FS negative evidence is new enough that a silent no-op would be worse than a hard error. Guards: the native kernel scores NE-bearing FS matchkeys from goldenmatch-native >= 0.1.15 (capability const FS_SUPPORTS_NE; requires every NE scorer to be FS-native), and the fused-match path does likewise; older wheels keep the pure-Python fallback automatically. The fast-path per-pair scorer does not implement NE and defers to the standard path. Persisted/imported EM models trained before this feature (including imported Splink models) fail loudly on load if the matchkey has NE fields without penalty_bits and the model lacks the corresponding trained weights, rather than silently scoring NE at weight 0 — retrain, or set penalty_bits to bypass EM entirely. NE is not supported on the continuous/Winkler EM path (train_em_continuous), which rejects it with a clear error.

Splink-parity surface

The Fellegi-Sunter matchkey is a full probabilistic-linkage engine, not just a scorer:

Model lifecycle (train-once, reuse). EMResult.save_json / load_json + MatchkeyConfig.model_path (or dedupe_df(fs_model_path=...)) train EM once and reuse the model — no retraining per run.
Supervised m from labels. estimate_m_from_labels(df, mk, labels) (Splink’s estimate_m_from_label_column) estimates m directly from known matches; adapters pull labels straight from the review-queue / memory corrections store.
Match-weight waterfall. explain_pair_fs decomposes a pair into per-comparison log2(m/u) bits + prior + posterior, surfaced in goldenmatch explain --pair and the lineage sidecar.
Calibration. GOLDENMATCH_FS_CALIBRATED=posterior turns the score into a true match probability 1/(1+2^-(log2(λ/(1-λ)) + ΣW)); linear (default) is monotonic in the summed weight.
Accuracy analysis from labels. goldenmatch evaluate --threshold-sweep emits the precision/recall/F1 operating-point curve, a recommended cut, probability_two_random_records_match, and the per-comparison m/u match-weight report.
Config migration from Splink. goldenmatch import-splink settings.json -o goldenmatch.yaml [--model-out model.json] (or gm.from_splink(...) in Python, or the convert_splink_config MCP tool) converts a Splink settings or trained-model JSON into a GoldenMatch config; trained m/u probabilities import directly so no re-training is needed, and anything lossy is reported in a ConversionReport.
Splink migration upgrade pass. goldenmatch import-splink settings.json --upgrade data.parquet -o goldenmatch.yaml --model-out model.json (or gm.upgrade_splink_conversion(conversion, data)) runs a data-aware pass over the converted config with four levers: it computes term-frequency tables from the data (Splink model exports don’t include them), re-derives Levenshtein distance thresholds from measured string lengths, applies fan-out defenses (a risk-gated negative-evidence suggestion — an unused identity-grade column whose disagreement contradicts pairs the imported model would confidently merge is added as negative_evidence with posterior-weighted __ne__ weights — plus golden_rules.max_cluster_size tuned from the reference clusters), and calibrates link/review thresholds from the blocked-pair score distribution (NE-aware). The faithful baseline conversion is always written alongside as *.baseline.*, and a baseline-vs-upgraded delta table is printed (add --splink-clusters / --labels for agreement / truth F1; --sample-cap, --no-measure, and --id-column control the measurement pass). Measured on wild Splink configs, the upgraded conversion beat native Splink on all three benchmark pairs (pairwise F1 vs truth: 0.633 vs 0.601, 0.766 vs 0.699, 0.740 vs 0.686).

Scale-out

Probabilistic matchkeys ride the bucket backend’s hash-bucketed parallel scorer (the same path that carries the Ray / DataFusion distribution wiring), so they scale single-node the same way weighted matchkeys do. Measured 6M-row dedupe on a 16c/64GB node: 162.6 s (native) / 288.5 s (numpy), 11.3 GB peak RSS, F1 1.000 on synthetic data with entity-clique ground truth. An opt-in Rust kernel (GOLDENMATCH_FS_NATIVE=1) is ~10.8× on the scoring step for tiny-block workloads. See scale envelope. Embedding scorers on the probabilistic path. embedding and record_embedding are first-class Fellegi-Sunter field scorers: they both train (EM) and score on the vectorized matrix path (their cosine-similarity matrix flows through the same comparison-level logic as string scorers, so training and scoring agree). Because they are matrix-only, a matchkey carrying one always runs vectorized — the GOLDENMATCH_FS_VECTORIZED=0 debug fallback only applies to string scorers.

Probabilistic auto-config v2 (beats Splink head-to-head)

The probabilistic auto-config path (auto_configure_probabilistic_df / build_probabilistic_matchkeys) builds Fellegi-Sunter configs that beat Splink on every dataset Splink scores in the shared bench_er_headtohead panel. This is default-on as of FS auto-config v2; set GOLDENMATCH_FS_AUTOCONFIG_V2=0 to restore the legacy auto-config byte-identically. It touches only the probabilistic path — the weighted/DQbench path and zero-config dedupe_df are unchanged. Numbers are deterministic as of #829 (which fixed a non-deterministic EM training-pair sample that previously swung historical_50k F1 between 0.64 and 0.80).

Dataset	GoldenMatch (probabilistic v2)	Splink
historical_50k (Splink’s flagship)	0.778	0.757
febrl3	0.991	0.965
synthetic_person	0.998	0.996

GoldenMatch also wins at the cluster level on historical_50k (B-cubed F1 0.844 vs 0.789). The full three-engine accuracy + performance bake-off (incl. the zero-config controller path, wall, peak RSS, throughput) is at docs/benchmarks/2026-06-09-splink-bakeoff.md. Four levers drive the gain:

Admit dob / date fields as levenshtein instead of leaving them exact-only, so near-miss dates still contribute weight.
Drop redundant name composites (full_name / first_and_surname) when atomic given + family fields already exist — no double-counting the same signal.
Additively diversify blocking onto orthogonal stable keys (date-year + postcode/zip) so a single noisy key doesn’t cap recall.
Admit description / multi-name fields as token_sort, which lifts the bibliographic case (DBLP-ACM) from 0.003 → 0.377 — a large relative gain, but still recall-bound.

These are pairwise F1 under one shared evaluator (bench_er_headtohead). The often-cited ~0.97 Splink figure on historical_50k is a cluster/entity-level metric, not exhaustive within-cluster pairwise F1 — Splink itself scores ~0.75 pairwise on this dataset under the same harness, and the pairwise blocking ceiling for any engine on these columns is ~0.93. The honest claim is “matches/beats Splink head-to-head on the same evaluator,” not “0.97 pairwise.” Where Splink still leads: it is 3-19x faster on these datasets, runs distributed FS at 1B+ rows on Spark, and ships a mature interactive m/u comparison-viewer charting UI.

For bibliographic data (DBLP-ACM), use the weighted path, not probabilistic. Splink skips dblp_acm, and the probabilistic auto-config is weak there (pairwise F1 0.377, recall-bound). The zero-config weighted controller scores 0.964 on DBLP-ACM and is the right tool for that shape. The probabilistic path targets PII / person-record linkage (historical_50k, febrl3, synthetic_person).

LLM scoring

Send borderline pairs to GPT-4o-mini or Claude for scoring. Two modes:

Pairwise mode

Score individual pairs. Best for small candidate sets.

llm_scorer:
  enabled: true
  mode: pairwise
  auto_threshold: 0.95       # auto-accept above this
  candidate_lo: 0.75         # LLM scores pairs in [0.75, 0.95]
  budget:
    max_cost_usd: 0.05

Cluster mode

Send entire borderline blocks to the LLM for in-context clustering. More efficient for large blocks.

llm_scorer:
  enabled: true
  mode: cluster
  cluster_max_size: 100
  cluster_min_size: 5        # below this, fall back to pairwise

See LLM Integration for full details.

Cross-encoder reranking

Re-score borderline pairs with a pre-trained cross-encoder for higher precision.

matchkeys:
  - name: fuzzy_name
    type: weighted
    threshold: 0.85
    rerank: true
    rerank_band: 0.1
    rerank_model: cross-encoder/ms-marco-MiniLM-L-6-v2

Pairs within threshold +/- rerank_band get reranked. Requires pip install goldenmatch[embeddings].

reranked = gm.rerank_top_pairs(pairs, df, matchkey)

Parallel scoring

Fuzzy blocks are scored concurrently via ThreadPoolExecutor. RapidFuzz’s cdist releases the GIL, so threads provide real parallelism.

Block 1 ──> Thread 1 ──> pairs
Block 2 ──> Thread 2 ──> pairs    (concurrent)
Block 3 ──> Thread 3 ──> pairs

Implementation details:

Blocks are independent — frozen exclude_pairs snapshot avoids race conditions
For 2 or fewer blocks, threading overhead is skipped (sequential execution)
All call sites (pipeline, engine, chunked) use the shared score_blocks_parallel helper
Ray backend (score_blocks_ray) distributes blocks across Ray tasks for cluster-level scaling

Intra-field early termination

After scoring each expensive field, the scorer checks if the remaining fields can push any pair above the threshold. If not, it breaks early. This reduces 100K fuzzy matching from ~100s to ~39s (2.5x speedup).

Embedding scoring

Requires pip install goldenmatch[embeddings].

Single-field embedding

fields:
  - field: description
    scorer: embedding
    weight: 1.0
    model: all-MiniLM-L6-v2

Record embedding (multi-field)

Concatenate multiple fields with optional per-field weights:

fields:
  - columns: [title, authors, venue]
    scorer: record_embedding
    weight: 1.0
    column_weights: { title: 2.0, authors: 1.0, venue: 0.5 }

Vertex AI embeddings

Use Google Cloud’s managed embedding API (no GPU needed):

# Set GOOGLE_APPLICATION_CREDENTIALS, then use embedding scorer
# Vertex AI text-embedding-004 supports inference only (no fine-tuning)

Scoring a single pair

import goldenmatch as gm

# Score two strings
score = gm.score_strings("John Smith", "Jon Smyth", "jaro_winkler")

# Score two records
score = gm.score_pair_df(
    {"name": "John Smith", "zip": "10001"},
    {"name": "Jon Smyth", "zip": "10001"},
    fuzzy={"name": 0.7, "zip": 0.3},
)

# Explain the score
explanation = gm.explain_pair_df(
    {"name": "John Smith", "zip": "10001"},
    {"name": "Jon Smyth", "zip": "10001"},
    fuzzy={"name": 0.7, "zip": 0.3},
)

Get started

Concepts

GoldenMatch

GoldenCheck

GoldenFlow

GoldenPipe

GoldenAnalysis

InferMap

SQL extensions

Reference

Research

Scorer reference

Date fields

Numeric and coordinate fields

Fuzzy scoring

Weighted matchkeys

Exact scoring

Probabilistic scoring (Fellegi-Sunter)

Negative evidence on Fellegi-Sunter matchkeys

Splink-parity surface

Scale-out

Probabilistic auto-config v2 (beats Splink head-to-head)

LLM scoring

Pairwise mode

Cluster mode

Cross-encoder reranking

Parallel scoring

Intra-field early termination

Embedding scoring

Single-field embedding

Record embedding (multi-field)

Vertex AI embeddings

Scoring a single pair

​Scorer reference

​Date fields

​Numeric and coordinate fields

​Fuzzy scoring

​Weighted matchkeys

​Exact scoring

​Probabilistic scoring (Fellegi-Sunter)

​Negative evidence on Fellegi-Sunter matchkeys

​Splink-parity surface

​Scale-out

​Probabilistic auto-config v2 (beats Splink head-to-head)

​LLM scoring

​Pairwise mode

​Cluster mode

​Cross-encoder reranking

​Parallel scoring

​Intra-field early termination

​Embedding scoring

​Single-field embedding

​Record embedding (multi-field)

​Vertex AI embeddings

​Scoring a single pair

Scorer reference

Date fields

Numeric and coordinate fields

Fuzzy scoring

Weighted matchkeys

Exact scoring

Probabilistic scoring (Fellegi-Sunter)

Negative evidence on Fellegi-Sunter matchkeys

Splink-parity surface

Scale-out

Probabilistic auto-config v2 (beats Splink head-to-head)

LLM scoring

Pairwise mode

Cluster mode

Cross-encoder reranking

Parallel scoring

Intra-field early termination

Embedding scoring

Single-field embedding

Record embedding (multi-field)

Vertex AI embeddings

Scoring a single pair