Full YAML reference
Matchkeys
Three matchkey types:| Type | Description | Required Fields |
|---|---|---|
exact | Binary match on transformed values | field, optional transforms |
weighted | Weighted average of field scores | field, scorer, weight, threshold |
probabilistic | Fellegi-Sunter log-likelihood ratios | field, scorer, optional levels |
Transforms
Applied to field values before scoring.| Transform | Description |
|---|---|
lowercase | Convert to lowercase |
uppercase | Convert to uppercase |
strip | Remove leading/trailing whitespace |
strip_all | Remove all whitespace |
soundex | Soundex phonetic encoding |
metaphone | Metaphone phonetic encoding |
digits_only | Keep only digits |
alpha_only | Keep only letters |
normalize_whitespace | Collapse multiple spaces |
token_sort | Sort tokens alphabetically |
first_token | First whitespace-delimited token |
last_token | Last whitespace-delimited token |
substring:start:end | Substring extraction |
qgram:n | Q-gram tokenization |
bloom_filter or bloom_filter:ngram:k:size | Bloom filter (for PPRL) |
legal_form_strip | Strip corporate legal-form suffixes (Inc, LLC, Ltd, GmbH, S.A., …) — bundled refdata |
address_normalize | USPS Pub. 28 street-suffix + unit-abbrev canonicalization (Avenue→AVE, Apartment→APT) — bundled refdata |
naics_normalize | NAICS 2022 industry-code canonicalization (code-or-title input → canonical code) — bundled refdata |
col_type agrees. See Reference Data.
Scorers
| Scorer | Description | Best For |
|---|---|---|
exact | Binary 0/1 match | Email, phone, ID |
jaro_winkler | Edit distance with prefix bonus | Names |
levenshtein | Normalized Levenshtein distance | General strings |
token_sort | Order-invariant token matching | Names, addresses |
soundex_match | Phonetic match | Names |
ensemble | max(jaro_winkler, token_sort, soundex) | Names with reordering |
embedding | Cosine similarity of embeddings | Semantic matching |
record_embedding | Concatenated multi-field embeddings | Cross-field semantic |
dice | Dice coefficient on bloom filters | PPRL |
jaccard | Jaccard similarity on bloom filters | PPRL |
name_freq_weighted_jw | Surname IDF-weighted Jaro-Winkler — bundled refdata | last_name / surname |
given_name_aliased_jw | Alias-aware Jaro-Winkler — bundled refdata | first_name / given_name |
Cross-encoder reranking
Addrerank: true to a weighted matchkey to re-score borderline pairs with a cross-encoder model:
Blocking
| Strategy | Description |
|---|---|
static | Group by blocking key (default) |
adaptive | Static + recursive sub-blocking for oversized blocks |
sorted_neighborhood | Sliding window over sorted records |
multi_pass | Union of blocks from multiple passes |
ann | ANN via FAISS on embeddings |
ann_pairs | Direct-pair ANN scoring (50—100x faster than ann) |
canopy | TF-IDF canopy clustering |
learned | Data-driven predicate selection |
auto_select: true to auto-pick the best blocking key by histogram analysis. Set auto_suggest: true to get blocking suggestions when no keys are specified.
Golden rules
Five merge strategies for building canonical records:| Strategy | Description |
|---|---|
most_complete | Pick value with fewest nulls |
majority_vote | Most common value across cluster members |
source_priority | Prefer values from specified sources (requires source_priority list) |
most_recent | Latest value by date (requires date_column) |
first_non_null | First non-null value encountered |
Standardization
Map column names to standardizer functions:| Standardizer | Description |
|---|---|
email | Lowercase, strip, validate format |
name_proper | Title case |
name_upper | Uppercase |
name_lower | Lowercase |
phone | Strip non-digits, normalize format |
zip5 | First 5 digits |
address | Normalize abbreviations (St->Street, etc.) |
state | Normalize state abbreviations |
strip | Remove leading/trailing whitespace |
trim_whitespace | Collapse multiple spaces |
Validation
regex, min_length, max_length, not_null, in_set, format.
Actions: flag (mark but keep), null (set to null), quarantine (remove from matching).
Environment variables
Key environment variables that affect runtime behaviour:| Variable | Default | Description |
|---|---|---|
GOLDENMATCH_ALLOWED_ROOT | unset | When set, every user-supplied file path (MCP tools, ingest, rollback, lineage, domain registry) must resolve under this root. Paths that escape it return an error instead of executing. Unset = no containment; local-first default. See Path sandbox below. |
GOLDENMATCH_MCP_TOKEN | unset | Bearer token for the MCP HTTP server. Required on non-loopback binds (fail-closed). |
GOLDENMATCH_API_TOKEN | unset | Bearer token for the REST API server. Required on non-loopback binds. |
GOLDENMATCH_AGENT_TOKEN | unset | Bearer token for the A2A agent server. Required on non-loopback binds. |
GOLDENMATCH_ENABLE_DISTRIBUTED_RAY | 0 | Enable the Ray distributed backend. |
GOLDENMATCH_NATIVE_RAYON_MIN_PAIRS | 20000000 | Candidate-pair threshold below which the native kernel scores in the calling thread (avoids rayon futex contention on small workloads). |
GOLDENMATCH_BUCKET_DEBUG | 0 | When 1, prints per-bucket prep/kernel/post-filter timing for backend=bucket. |
POLARS_SKIP_CPU_CHECK | unset | Set to 1 to skip the Polars WMI CPU check on Windows (avoids startup hang). |
Path sandbox
GOLDENMATCH_ALLOWED_ROOT activates an opt-in path containment layer
for deployed or shared instances. When set, safe_path() resolves every
incoming file argument with Path.resolve(), checks for NUL bytes, and
then asserts the resolved path falls under the root. Paths that fail
this check are rejected before any I/O occurs.
The Railway
goldenmatch-mcp service should set
GOLDENMATCH_ALLOWED_ROOT=/data to scope the MCP tools to the mounted
data volume. See MCP Server.Settings persistence
- Global:
~/.goldenmatch/settings.yaml— output mode, default model, API keys - Project:
.goldenmatch.yaml— column mappings, thresholds, blocking config
Programmatic config
Verification (v1.5.0)
auto_configure_df runs preflight at the end of config generation — 6 checks that auto-repair missing domain-extracted columns, drop useless-cardinality exact matchkeys, flag oversized blocks, demote remote-asset scorers, and cap low-confidence weights. Unrepairable issues raise ConfigValidationError; the full report is attached to the exception as err.report.
The pipeline runs postflight after scoring and before clustering — 4 signals (score histogram + bimodality, blocking recall, cluster sizes + bottleneck pairs, threshold-band overlap) that can auto-nudge the threshold on clear bimodality and attach the report to DedupeResult.postflight_report / MatchResult.postflight_report.
Two new kwargs on auto_configure_df:
preflight / postflight signatures and the PostflightSignals schema.
Learning Memory (v1.6.0)
The optionalmemory: section enables persistent corrections. Once a steward, agent, or LLM decides a pair, that decision is stored, re-anchored across row reorders by record_hash, and applied automatically on every subsequent dedupe_df / match_df call. After 10+ corrections accumulate against a matchkey, the learner adjusts that matchkey’s threshold for the next run. Off by default; enable via the YAML block above or config.memory.enabled = True.
| Field | Default | Notes |
|---|---|---|
enabled | false | Zero-config preserved. Enabling does not change pipeline output until corrections exist. |
backend | "sqlite" | "postgres" requires pip install goldenmatch[postgres]. |
path | ".goldenmatch/memory.db" | SQLite file or full DSN for postgres. |
reanchor | true | Re-anchor by record_hash when row IDs miss; ambiguous re-anchors report stale_ambiguous. |
dataset | null | Tag corrections; isolates per-table memory in shared DBs. |
learning.threshold_min_corrections | 10 | Trust-weighted grid search runs once a matchkey crosses this floor. |
learning.weights_min_corrections | 50 | Field-weight learning is stubbed in v1.6.0 and returns null. |