(row_id_a, row_id_b, score) tuples.
Scorer reference
| Scorer | Description | Range | Best For |
|---|---|---|---|
exact | Binary 0/1 match | 0 or 1 | Email, phone, ID |
jaro_winkler | Edit distance with prefix bonus | 0.0—1.0 | Names |
levenshtein | Normalized Levenshtein distance | 0.0—1.0 | General strings |
token_sort | Sort tokens, then ratio | 0.0—1.0 | Names, addresses |
soundex_match | Phonetic code comparison | 0 or 1 | Names |
ensemble | max(jaro_winkler, token_sort, soundex) | 0.0—1.0 | Names with reordering |
embedding | Cosine similarity of sentence embeddings | 0.0—1.0 | Semantic matching |
record_embedding | Multi-field concatenated embeddings | 0.0—1.0 | Cross-field semantic |
dice | Dice coefficient on bloom filters | 0.0—1.0 | PPRL |
jaccard | Jaccard similarity on bloom filters | 0.0—1.0 | PPRL |
name_freq_weighted_jw | Jaro-Winkler modulated by US Census surname IDF | 0.0—1.0 | last_name / surname |
given_name_aliased_jw | Jaro-Winkler with alias-aware exact bonus | 0.0—1.0 | first_name / given_name |
col_type agrees. See Reference Data for the full pack overview, refinement rules, and the col_type gate.
Fuzzy scoring
Fuzzy matching usesrapidfuzz.process.cdist for vectorized NxN scoring within each block. This is the core scoring engine for weighted matchkeys.
Weighted matchkeys
Each field gets a scorer, weight, and optional transforms. The overall score is a weighted average:overall_score = sum(field_score * weight) / sum(weight)
Pairs with overall_score >= threshold are matched.
Exact scoring
Exact matching uses Polars self-join for high performance. No threshold needed.Probabilistic scoring (Fellegi-Sunter)
EM-trained m/u probabilities with comparison vectors. Match weights are log-likelihood ratios.- u-probabilities estimated from random pairs and fixed during EM (Splink approach)
- Blocking fields must be excluded from training (always agree within blocks)
- Comparison vectors apply field transforms before scoring
- Achieves 98.8% precision, 57.6% recall on DBLP-ACM
LLM scoring
Send borderline pairs to GPT-4o-mini or Claude for scoring. Two modes:Pairwise mode
Score individual pairs. Best for small candidate sets.Cluster mode
Send entire borderline blocks to the LLM for in-context clustering. More efficient for large blocks.Cross-encoder reranking
Re-score borderline pairs with a pre-trained cross-encoder for higher precision.threshold +/- rerank_band get reranked. Requires pip install goldenmatch[embeddings].
Parallel scoring
Fuzzy blocks are scored concurrently viaThreadPoolExecutor. RapidFuzz’s cdist releases the GIL, so threads provide real parallelism.
- Blocks are independent — frozen
exclude_pairssnapshot avoids race conditions - For 2 or fewer blocks, threading overhead is skipped (sequential execution)
- All call sites (pipeline, engine, chunked) use the shared
score_blocks_parallelhelper - Ray backend (
score_blocks_ray) distributes blocks across Ray tasks for cluster-level scaling
Intra-field early termination
After scoring each expensive field, the scorer checks if the remaining fields can push any pair above the threshold. If not, it breaks early. This reduces 100K fuzzy matching from ~100s to ~39s (2.5x speedup).Embedding scoring
Requirespip install goldenmatch[embeddings].