Skip to main content
GoldenMatch uses YAML config files with Pydantic validation. Every section is optional — GoldenMatch auto-configures what you leave out.

Full YAML reference

matchkeys:
  - name: exact_email
    type: exact
    fields:
      - field: email
        transforms: [lowercase, strip]

  - name: fuzzy_name_zip
    type: weighted
    threshold: 0.85
    rerank: true
    rerank_band: 0.1
    fields:
      - field: first_name
        scorer: jaro_winkler
        weight: 0.4
        transforms: [lowercase, strip]
      - field: last_name
        scorer: jaro_winkler
        weight: 0.4
        transforms: [lowercase, strip]
      - field: zip
        scorer: exact
        weight: 0.2

  - name: probabilistic_fs
    type: probabilistic
    em_iterations: 20
    convergence_threshold: 0.001
    fields:
      - field: first_name
        scorer: jaro_winkler
        levels: 3
        partial_threshold: 0.8
      - field: last_name
        scorer: jaro_winkler
        levels: 2
      - field: zip
        scorer: exact
        levels: 2

  - name: semantic
    type: weighted
    threshold: 0.80
    fields:
      - columns: [title, authors, venue]
        scorer: record_embedding
        weight: 1.0
        column_weights: {title: 2.0, authors: 1.0, venue: 0.5}

blocking:
  strategy: adaptive
  auto_select: true
  auto_suggest: true
  max_block_size: 5000
  skip_oversized: false
  keys:
    - fields: [zip]
    - fields: [last_name]
      transforms: [lowercase, soundex]
  # Sorted neighborhood
  window_size: 20
  sort_key:
    - column: last_name
      transforms: [lowercase, soundex]
  # Multi-pass
  passes:
    - fields: [zip]
    - fields: [last_name]
      transforms: [lowercase, soundex]
  union_mode: true
  # ANN blocking
  ann_column: description
  ann_model: all-MiniLM-L6-v2
  ann_top_k: 20
  # Learned blocking
  learned_sample_size: 5000
  learned_min_recall: 0.95
  learned_min_reduction: 0.90
  learned_predicate_depth: 2
  learned_cache_path: .goldenmatch/learned_blocking.pkl
  # Canopy
  canopy:
    fields: [name, address]
    loose_threshold: 0.3
    tight_threshold: 0.7
    max_canopy_size: 500

golden_rules:
  default_strategy: most_complete
  max_cluster_size: 100
  field_rules:
    email:
      strategy: majority_vote
    first_name:
      strategy: source_priority
      source_priority: [crm, marketing]
    updated_at:
      strategy: most_recent
      date_column: updated_at

standardization:
  rules:
    email: [email]
    first_name: [name_proper, strip]
    last_name: [name_proper, strip]
    phone: [phone]
    zip: [zip5]
    address: [address, strip]
    state: [state]

validation:
  auto_fix: true
  rules:
    - column: email
      rule_type: regex
      params: {pattern: "^.+@.+\\..+$"}
      action: flag
    - column: zip
      rule_type: min_length
      params: {length: 5}
      action: null
    - column: name
      rule_type: not_null
      action: quarantine

domain:
  enabled: true
  pack: electronics

llm_scorer:
  enabled: true
  provider: openai
  model: gpt-4o-mini
  auto_threshold: 0.95
  candidate_lo: 0.75
  candidate_hi: 0.95
  batch_size: 20
  mode: pairwise           # or "cluster" for in-context LLM clustering
  cluster_max_size: 100
  cluster_min_size: 5
  budget:
    max_cost_usd: 0.05
    max_calls: 100
    warn_at_pct: 80

memory:
  enabled: true                 # default: false. Off => no memory work, zero overhead.
  backend: sqlite               # sqlite | postgres
  path: .goldenmatch/memory.db  # sqlite path or postgres DSN
  reanchor: true                # default: true. Look up corrections by record_hash when row IDs miss.
  dataset: customers            # tag corrections so one DB can hold memory for many tables
  learning:
    threshold_min_corrections: 10   # learner runs once per matchkey at this floor
    weights_min_corrections: 50     # field-weight learner floor (stub in v1.6, returns null)

output:
  directory: ./output
  format: csv
  run_name: dedupe_run_001

backend: null              # null (Polars), "ray", or "duckdb"

Matchkeys

Three matchkey types:
TypeDescriptionRequired Fields
exactBinary match on transformed valuesfield, optional transforms
weightedWeighted average of field scoresfield, scorer, weight, threshold
probabilisticFellegi-Sunter log-likelihood ratiosfield, scorer, optional levels

Transforms

Applied to field values before scoring.
TransformDescription
lowercaseConvert to lowercase
uppercaseConvert to uppercase
stripRemove leading/trailing whitespace
strip_allRemove all whitespace
soundexSoundex phonetic encoding
metaphoneMetaphone phonetic encoding
digits_onlyKeep only digits
alpha_onlyKeep only letters
normalize_whitespaceCollapse multiple spaces
token_sortSort tokens alphabetically
first_tokenFirst whitespace-delimited token
last_tokenLast whitespace-delimited token
substring:start:endSubstring extraction
qgram:nQ-gram tokenization
bloom_filter or bloom_filter:ngram:k:sizeBloom filter (for PPRL)
legal_form_stripStrip corporate legal-form suffixes (Inc, LLC, Ltd, GmbH, S.A., …) — bundled refdata
address_normalizeUSPS Pub. 28 street-suffix + unit-abbrev canonicalization (Avenue→AVE, Apartment→APT) — bundled refdata
naics_normalizeNAICS 2022 industry-code canonicalization (code-or-title input → canonical code) — bundled refdata
Refdata transforms are auto-prepended by the controller when a column name matches the relevant pattern AND its profiled col_type agrees. See Reference Data.

Scorers

ScorerDescriptionBest For
exactBinary 0/1 matchEmail, phone, ID
jaro_winklerEdit distance with prefix bonusNames
levenshteinNormalized Levenshtein distanceGeneral strings
token_sortOrder-invariant token matchingNames, addresses
soundex_matchPhonetic matchNames
ensemblemax(jaro_winkler, token_sort, soundex)Names with reordering
embeddingCosine similarity of embeddingsSemantic matching
record_embeddingConcatenated multi-field embeddingsCross-field semantic
diceDice coefficient on bloom filtersPPRL
jaccardJaccard similarity on bloom filtersPPRL
name_freq_weighted_jwSurname IDF-weighted Jaro-Winkler — bundled refdatalast_name / surname
given_name_aliased_jwAlias-aware Jaro-Winkler — bundled refdatafirst_name / given_name

Cross-encoder reranking

Add rerank: true to a weighted matchkey to re-score borderline pairs with a cross-encoder model:
matchkeys:
  - name: fuzzy_name
    type: weighted
    threshold: 0.85
    rerank: true
    rerank_band: 0.1       # pairs within threshold +/- 0.1 get reranked
    rerank_model: cross-encoder/ms-marco-MiniLM-L-6-v2

Blocking

StrategyDescription
staticGroup by blocking key (default)
adaptiveStatic + recursive sub-blocking for oversized blocks
sorted_neighborhoodSliding window over sorted records
multi_passUnion of blocks from multiple passes
annANN via FAISS on embeddings
ann_pairsDirect-pair ANN scoring (50—100x faster than ann)
canopyTF-IDF canopy clustering
learnedData-driven predicate selection
Set auto_select: true to auto-pick the best blocking key by histogram analysis. Set auto_suggest: true to get blocking suggestions when no keys are specified.

Golden rules

Five merge strategies for building canonical records:
StrategyDescription
most_completePick value with fewest nulls
majority_voteMost common value across cluster members
source_priorityPrefer values from specified sources (requires source_priority list)
most_recentLatest value by date (requires date_column)
first_non_nullFirst non-null value encountered
Set a default strategy and override per field:
golden_rules:
  default_strategy: most_complete
  field_rules:
    email: { strategy: majority_vote }
    name: { strategy: source_priority, source_priority: [crm, erp] }

Standardization

Map column names to standardizer functions:
standardization:
  rules:
    email: [email]
    phone: [phone]
    zip: [zip5]
    first_name: [name_proper, strip]
    address: [address, strip]
    state: [state]
StandardizerDescription
emailLowercase, strip, validate format
name_properTitle case
name_upperUppercase
name_lowerLowercase
phoneStrip non-digits, normalize format
zip5First 5 digits
addressNormalize abbreviations (St->Street, etc.)
stateNormalize state abbreviations
stripRemove leading/trailing whitespace
trim_whitespaceCollapse multiple spaces

Validation

validation:
  auto_fix: true
  rules:
    - column: email
      rule_type: regex
      params: { pattern: "^.+@.+\\..+$" }
      action: flag
    - column: name
      rule_type: not_null
      action: quarantine
    - column: zip
      rule_type: min_length
      params: { length: 5 }
      action: null
Rule types: regex, min_length, max_length, not_null, in_set, format. Actions: flag (mark but keep), null (set to null), quarantine (remove from matching).

Environment variables

Key environment variables that affect runtime behaviour:
VariableDefaultDescription
GOLDENMATCH_ALLOWED_ROOTunsetWhen set, every user-supplied file path (MCP tools, ingest, rollback, lineage, domain registry) must resolve under this root. Paths that escape it return an error instead of executing. Unset = no containment; local-first default. See Path sandbox below.
GOLDENMATCH_MCP_TOKENunsetBearer token for the MCP HTTP server. Required on non-loopback binds (fail-closed).
GOLDENMATCH_API_TOKENunsetBearer token for the REST API server. Required on non-loopback binds.
GOLDENMATCH_AGENT_TOKENunsetBearer token for the A2A agent server. Required on non-loopback binds.
GOLDENMATCH_ENABLE_DISTRIBUTED_RAY0Enable the Ray distributed backend.
GOLDENMATCH_NATIVE_RAYON_MIN_PAIRS20000000Candidate-pair threshold below which the native kernel scores in the calling thread (avoids rayon futex contention on small workloads).
GOLDENMATCH_BUCKET_DEBUG0When 1, prints per-bucket prep/kernel/post-filter timing for backend=bucket.
POLARS_SKIP_CPU_CHECKunsetSet to 1 to skip the Polars WMI CPU check on Windows (avoids startup hang).

Path sandbox

GOLDENMATCH_ALLOWED_ROOT activates an opt-in path containment layer for deployed or shared instances. When set, safe_path() resolves every incoming file argument with Path.resolve(), checks for NUL bytes, and then asserts the resolved path falls under the root. Paths that fail this check are rejected before any I/O occurs.
# Deployed MCP server on Railway: restrict to the /data volume
export GOLDENMATCH_ALLOWED_ROOT=/data
goldenmatch mcp-serve --transport http --port 8200
Unset (the default) means no containment check — any accessible path is valid. This is the correct default for a local-first desktop tool.
The Railway goldenmatch-mcp service should set GOLDENMATCH_ALLOWED_ROOT=/data to scope the MCP tools to the mounted data volume. See MCP Server.

Settings persistence

  • Global: ~/.goldenmatch/settings.yaml — output mode, default model, API keys
  • Project: .goldenmatch.yaml — column mappings, thresholds, blocking config
Settings tuned in the TUI can be saved to the project file. Next run picks them up automatically.

Programmatic config

import goldenmatch as gm

config = gm.GoldenMatchConfig(
    matchkeys=[
        gm.MatchkeyConfig(name="exact_email", type="exact",
            fields=[gm.MatchkeyField(field="email", transforms=["lowercase"])]),
        gm.MatchkeyConfig(name="fuzzy_name", type="weighted", threshold=0.85,
            fields=[
                gm.MatchkeyField(field="name", scorer="jaro_winkler", weight=0.7),
                gm.MatchkeyField(field="zip", scorer="exact", weight=0.3),
            ]),
    ],
    blocking=gm.BlockingConfig(strategy="learned"),
    llm_scorer=gm.LLMScorerConfig(enabled=True, mode="cluster"),
    backend="ray",
)

result = gm.dedupe("data.csv", config=config)
Or auto-generate from data:
config = gm.auto_configure([("data.csv", "source")])

Verification (v1.5.0)

auto_configure_df runs preflight at the end of config generation — 6 checks that auto-repair missing domain-extracted columns, drop useless-cardinality exact matchkeys, flag oversized blocks, demote remote-asset scorers, and cap low-confidence weights. Unrepairable issues raise ConfigValidationError; the full report is attached to the exception as err.report. The pipeline runs postflight after scoring and before clustering — 4 signals (score histogram + bimodality, blocking recall, cluster sizes + bottleneck pairs, threshold-band overlap) that can auto-nudge the threshold on clear bimodality and attach the report to DedupeResult.postflight_report / MatchResult.postflight_report. Two new kwargs on auto_configure_df:
import goldenmatch as gm

# Offline-safe (default): remote-asset scorers demoted, postflight may adjust threshold
cfg = gm.auto_configure_df(df)

# Opt in to cross-encoder rerank / embedding scorers
cfg = gm.auto_configure_df(df, allow_remote_assets=True)

# Strict: compute signals + advisories, but suppress auto-adjustments (DQBench, regression)
cfg = gm.auto_configure_df(df, strict=True)
The preflight report is available on the returned config (underscore is private-by-convention but stable across v1.5.x):
cfg = gm.auto_configure_df(df)
for finding in cfg._preflight_report.findings:
    print(f"[{finding.severity}] {finding.check}: {finding.message}")
See the Verification section in the Python API docs for the full preflight / postflight signatures and the PostflightSignals schema.

Learning Memory (v1.6.0)

The optional memory: section enables persistent corrections. Once a steward, agent, or LLM decides a pair, that decision is stored, re-anchored across row reorders by record_hash, and applied automatically on every subsequent dedupe_df / match_df call. After 10+ corrections accumulate against a matchkey, the learner adjusts that matchkey’s threshold for the next run. Off by default; enable via the YAML block above or config.memory.enabled = True.
FieldDefaultNotes
enabledfalseZero-config preserved. Enabling does not change pipeline output until corrections exist.
backend"sqlite""postgres" requires pip install goldenmatch[postgres].
path".goldenmatch/memory.db"SQLite file or full DSN for postgres.
reanchortrueRe-anchor by record_hash when row IDs miss; ambiguous re-anchors report stale_ambiguous.
datasetnullTag corrections; isolates per-table memory in shared DBs.
learning.threshold_min_corrections10Trust-weighted grid search runs once a matchkey crosses this floor.
learning.weights_min_corrections50Field-weight learning is stubbed in v1.6.0 and returns null.
Full guide: Learning Memory.