Skip to main content
Measure matching accuracy against ground truth and enforce quality gates in CI/CD pipelines.

Quick start

import goldenmatch as gm

metrics = gm.evaluate("data.csv", config="config.yaml", ground_truth="gt.csv")
print(f"F1: {metrics['f1']:.1%}, Precision: {metrics['precision']:.1%}, Recall: {metrics['recall']:.1%}")
goldenmatch evaluate data.csv --config config.yaml --gt ground_truth.csv

Ground truth format

A CSV file with two columns identifying matched pairs:
id_a,id_b
1,42
1,108
5,200
5,201
5,203
Each row represents a known true match. Column names default to id_a and id_b but are configurable. IDs correspond to GoldenMatch’s __row_id__ (int64). Ground truth CSVs may have string IDs — load_ground_truth_csv attempts int conversion automatically.
gt_pairs = gm.load_ground_truth_csv("gt.csv", col_a="id_a", col_b="id_b")
# Returns set of (int, int) tuples

CI/CD quality gates

Exit with code 1 if accuracy falls below thresholds:
goldenmatch evaluate data.csv \
    --config config.yaml \
    --gt ground_truth.csv \
    --min-f1 0.90 \
    --min-precision 0.80 \
    --min-recall 0.70
Use in GitHub Actions:
# .github/workflows/quality.yml
jobs:
  quality-gate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pip install goldenmatch
      - run: |
          goldenmatch evaluate data.csv \
            --config config.yaml \
            --gt ground_truth.csv \
            --min-f1 0.90 --min-precision 0.80

EvalResult

@dataclass
class EvalResult:
    precision: float    # TP / (TP + FP)
    recall: float       # TP / (TP + FN)
    f1: float           # 2 * P * R / (P + R)
    tp: int             # True positives (correct matches)
    fp: int             # False positives (incorrect matches)
    fn: int             # False negatives (missed matches)

    def summary(self) -> dict

Evaluate pairs directly

import goldenmatch as gm

predicted = {(1, 42), (1, 108), (5, 200), (7, 300)}
ground_truth = {(1, 42), (1, 108), (5, 200), (5, 201)}

result = gm.evaluate_pairs(predicted, ground_truth)
print(f"Precision: {result.precision:.1%}")  # 3/4 = 75%
print(f"Recall: {result.recall:.1%}")        # 3/4 = 75%
print(f"F1: {result.f1:.1%}")                # 75%

Evaluate clusters

Evaluate a cluster dict (as returned by build_clusters). Expands cluster members into pairs for comparison.
import goldenmatch as gm

result = gm.evaluate_clusters(clusters, ground_truth_pairs)
print(result.f1)
Note: run_dedupe() does not return scored_pairs — use the clusters dict instead.

Build ground truth with label command

Interactively label record pairs to create a ground truth CSV:
goldenmatch label customers.csv --config config.yaml --gt ground_truth.csv
The label command shows pairs and prompts for your judgment:
KeyMeaning
yMatch (add to ground truth)
nNo match (skip)
sSkip (unsure)
Pairs are selected from actual pipeline output, focusing on borderline cases near the threshold.

Evaluation workflow

  1. Build ground truth: Use goldenmatch label or create a CSV manually
  2. Run evaluation: goldenmatch evaluate --gt gt.csv
  3. Iterate: Adjust config (thresholds, scorers, blocking) and re-evaluate
  4. Gate CI: Add --min-f1 threshold to your CI pipeline
label pairs --> ground_truth.csv --> evaluate --> adjust config --> repeat
                                         |
                                    CI/CD gate (--min-f1 0.90)

Metrics explained

MetricFormulaInterpretation
PrecisionTP / (TP + FP)Of the pairs GoldenMatch found, how many are correct?
RecallTP / (TP + FN)Of the true matches, how many did GoldenMatch find?
F12PR / (P+R)Harmonic mean of precision and recall
For entity resolution:
  • High precision means few false merges (records incorrectly combined)
  • High recall means few missed duplicates
  • Most production systems prioritize precision (false merges are harder to fix than missed dupes)

Cluster comparison (CCMS)

Compare two clustering outcomes on the same dataset without ground truth. Based on the Case Count Metric System (Talburt et al., arXiv:2601.02824v1).
import goldenmatch as gm

result = gm.compare_clusters(clusters_a, clusters_b)
print(result.summary())
# {"unchanged": 42, "merged": 3, "partitioned": 5, "overlapping": 1, "twi": 0.92, ...}
Each cluster from run A is classified into one of four cases:
CaseMeaning
UnchangedIdentical cluster in both runs
MergedRun A cluster absorbed into a larger cluster in run B
PartitionedRun A cluster split into smaller clusters in run B
OverlappingComplex reorganization — members redistributed across clusters
The TWI (Talburt-Wang Index) measures overall clustering similarity, normalized to [0, 1] where 1.0 means identical outcomes.
goldenmatch compare-clusters run_a.json run_b.json --details --case-type merged

Parameter sensitivity analysis

Sweep a parameter across a range and compare each run against a baseline:
import goldenmatch as gm

results = gm.run_sensitivity(
    file_specs=[("data.csv", "src")],
    config=gm.load_config("config.yaml"),
    sweep_params=[gm.SweepParam("threshold", 0.70, 0.95, 0.05)],
    sample_size=5000,
)
for r in results:
    print(r.stability_report())
goldenmatch sensitivity data.csv -c config.yaml --sweep threshold:0.70:0.95:0.05 --sample 5000
Supported sweep fields: threshold, matchkey.<name>.threshold, blocking.max_block_size.

Benchmark evaluation tips

  • Always use threshold-based pair generation, NOT top-1-per-record (argmax)
  • Leipzig benchmark CSVs have invalid UTF-8 — use pl.read_csv(encoding="utf8-lossy", ignore_errors=True)
  • Run benchmarks: python tests/benchmarks/run_leipzig.py