goldenmatch/core/.
Pipeline steps
1. Ingest
Load data from CSV, Excel, Parquet, or a Polars DataFrame..csv, .tsv, .xlsx, .xls, .parquet, .json. Cloud paths (s3://, gs://, az://) are handled by cloud_ingest.
Each record gets an __row_id__ (int64) and __source__ column.
2. Column Map
Map columns between different schemas when matching across sources.(path, source_name, column_map).
3. Auto Fix
Automatic data cleaning before validation.4. Validate
Apply validation rules and quarantine bad records.flag (keep but mark), null (set value to null), quarantine (remove from matching).
5. Standardize
Apply per-column standardization transforms._NATIVE_STANDARDIZERS) that avoids Python UDFs for common transforms.
6. Matchkeys
Compute matchkey columns by applying field transforms.__mk_*__. Matchkey transforms also have a native Polars fast path (_try_native_chain).
7. Block
Reduce the comparison space by grouping records that share a blocking key.auto_select: true to let GoldenMatch pick the best key by histogram analysis.
Dynamic block splitting automatically handles oversized blocks by splitting on the highest-cardinality column.
8. Score
Compare record pairs within each block. Exact matching uses Polars self-join (not Python loops):rapidfuzz.process.cdist for vectorized NxN scoring:
ThreadPoolExecutor. RapidFuzz’s cdist releases the GIL, so threads give real parallelism. For 2 or fewer blocks, threading overhead is skipped.
Intra-field early termination: after each expensive field, the scorer breaks early if no pair can reach the threshold.
Backend selection: _get_block_scorer(config) returns score_blocks_parallel (threads) or score_blocks_ray (Ray distributed) based on config.backend.
9. Cluster
Group matched pairs into clusters via iterative Union-Find.confidence = 0.4 * min_edge + 0.3 * avg_edge + 0.3 * connectivity. The bottleneck_pair identifies the weakest link in each cluster.
Incremental updates:
10. Golden
Merge each cluster into one canonical record.most_complete, majority_vote, source_priority, most_recent, first_non_null. Strategies can be set per-field.
Output
Write results to files or database.Pipeline entry points
| Entry Point | Description |
|---|---|
gm.dedupe(*files) | High-level file-based dedupe |
gm.dedupe_df(df) | DataFrame-based dedupe (no file I/O) |
gm.match(target, reference) | File-based list matching |
gm.match_df(target_df, ref_df) | DataFrame-based list matching |
run_dedupe(file_specs, config) | Low-level pipeline |
run_match(target_spec, ref_specs, config) | Low-level pipeline |
_run_dedupe_pipeline() and _run_match_pipeline() internal functions are shared by both file-based and DataFrame-based entry points.