Quick start
Ground truth format
A CSV file with two columns identifying matched pairs:id_a and id_b but are configurable.
IDs correspond to GoldenMatch’s __row_id__ (int64). Ground truth CSVs may have string IDs — load_ground_truth_csv attempts int conversion automatically.
CI/CD quality gates
Exit with code 1 if accuracy falls below thresholds:EvalResult
Evaluate pairs directly
Evaluate clusters
Evaluate a cluster dict (as returned bybuild_clusters). Expands cluster members into pairs for comparison.
run_dedupe() does not return scored_pairs — use the clusters dict instead.
Build ground truth with label command
Interactively label record pairs to create a ground truth CSV:| Key | Meaning |
|---|---|
y | Match (add to ground truth) |
n | No match (skip) |
s | Skip (unsure) |
Evaluation workflow
- Build ground truth: Use
goldenmatch labelor create a CSV manually - Run evaluation:
goldenmatch evaluate --gt gt.csv - Iterate: Adjust config (thresholds, scorers, blocking) and re-evaluate
- Gate CI: Add
--min-f1threshold to your CI pipeline
Metrics explained
| Metric | Formula | Interpretation |
|---|---|---|
| Precision | TP / (TP + FP) | Of the pairs GoldenMatch found, how many are correct? |
| Recall | TP / (TP + FN) | Of the true matches, how many did GoldenMatch find? |
| F1 | 2PR / (P+R) | Harmonic mean of precision and recall |
- High precision means few false merges (records incorrectly combined)
- High recall means few missed duplicates
- Most production systems prioritize precision (false merges are harder to fix than missed dupes)
Cluster comparison (CCMS)
Compare two clustering outcomes on the same dataset without ground truth. Based on the Case Count Metric System (Talburt et al., arXiv:2601.02824v1).| Case | Meaning |
|---|---|
| Unchanged | Identical cluster in both runs |
| Merged | Run A cluster absorbed into a larger cluster in run B |
| Partitioned | Run A cluster split into smaller clusters in run B |
| Overlapping | Complex reorganization — members redistributed across clusters |
Parameter sensitivity analysis
Sweep a parameter across a range and compare each run against a baseline:threshold, matchkey.<name>.threshold, blocking.max_block_size.
Benchmark evaluation tips
- Always use threshold-based pair generation, NOT top-1-per-record (argmax)
- Leipzig benchmark CSVs have invalid UTF-8 — use
pl.read_csv(encoding="utf8-lossy", ignore_errors=True) - Run benchmarks:
python tests/benchmarks/run_leipzig.py