Configuration

GoldenMatch uses YAML config files with Pydantic validation. Every section is optional — GoldenMatch auto-configures what you leave out.

Full YAML reference

matchkeys:
  - name: exact_email
    type: exact
    fields:
      - field: email
        transforms: [lowercase, strip]

  - name: fuzzy_name_zip
    type: weighted
    threshold: 0.85
    rerank: true
    rerank_band: 0.1
    fields:
      - field: first_name
        scorer: jaro_winkler
        weight: 0.4
        transforms: [lowercase, strip]
      - field: last_name
        scorer: jaro_winkler
        weight: 0.4
        transforms: [lowercase, strip]
      - field: zip
        scorer: exact
        weight: 0.2

  - name: probabilistic_fs
    type: probabilistic
    em_iterations: 20
    convergence_threshold: 0.001
    fields:
      - field: first_name
        scorer: jaro_winkler
        levels: 3
        partial_threshold: 0.8
      - field: last_name
        scorer: jaro_winkler
        levels: 4
        # Explicit N-level banding: descending cutoffs, len == levels - 1.
        # A pair's level = how many thresholds its similarity satisfies.
        level_thresholds: [0.95, 0.88, 0.7]
      - field: zip
        scorer: exact
        levels: 2
    negative_evidence:
      # Fires (and subtracts weight) when both values are present AND
      # scorer(a, b) < threshold; agreement/missing never adds weight.
      # Omit penalty_bits for EM-learned; probabilistic matchkeys reject
      # `penalty` (that knob is weighted/exact-only) -- see Scoring.
      - field: phone
        scorer: exact
        threshold: 1.0
        # penalty_bits: 6.0   # optional: fixed log2 LLR override, skips EM

  - name: semantic
    type: weighted
    threshold: 0.80
    fields:
      - columns: [title, authors, venue]
        scorer: record_embedding
        weight: 1.0
        column_weights: {title: 2.0, authors: 1.0, venue: 0.5}

blocking:
  strategy: adaptive
  auto_select: true
  auto_suggest: true
  max_block_size: 5000
  skip_oversized: false
  keys:
    - fields: [zip]
    - fields: [last_name]
      transforms: [lowercase, soundex]
  # Sorted neighborhood
  window_size: 20
  sort_key:
    - column: last_name
      transforms: [lowercase, soundex]
  # Multi-pass
  passes:
    - fields: [zip]
    - fields: [last_name]
      transforms: [lowercase, soundex]
  union_mode: true
  # ANN blocking
  ann_column: description
  ann_model: all-MiniLM-L6-v2
  ann_top_k: 20
  # Learned blocking
  learned_sample_size: 5000
  learned_min_recall: 0.95
  learned_min_reduction: 0.90
  learned_predicate_depth: 2
  learned_cache_path: .goldenmatch/learned_blocking.pkl
  # Canopy
  canopy:
    fields: [name, address]
    loose_threshold: 0.3
    tight_threshold: 0.7
    max_canopy_size: 500
  # MinHash/LSH (near-duplicate text blocking; set strategy: lsh to use it)
  lsh:
    column: description   # the text column to shingle
    mode: word            # word- or char-grams
    k: 2                  # shingle size
    num_perms: 128        # MinHash signature length
    threshold: 0.5        # similarity threshold (picks the band/row split);
                          # or set num_bands explicitly (must divide num_perms)
  # SimHash/LSH (semantic near-duplicate text blocking; set strategy: simhash to use it)
  simhash:
    column: description   # the text column to embed
    num_planes: 256       # SimHash signature length (hyperplane count)
    num_bands: 32         # band count (must divide num_planes)
    seed: 0               # deterministic plane set
    # threshold: 0.85     # alternative to num_bands: picks the band/row split
    # model: inhouse      # embedder; defaults to the in-house ER embedder

golden_rules:
  default_strategy: most_complete
  max_cluster_size: 100
  field_group_detection: false   # true enables auto-detection of correlated field groups
  field_groups:
    - name: mailing_address
      columns: [street, city, state, zip]
      category: address            # optional hint
      strategy: most_complete      # most_complete | source_priority | most_recent
  field_rules:
    email:
      strategy: majority_vote
    phone:
      - when: "state in ['CA', 'NY']"
        strategy: most_recent
        date_column: updated_at
        validate: nanp
      - strategy: source_priority  # default clause (no when:), must be last
        source_priority: [crm, billing]
    first_name:
      strategy: source_priority
      source_priority: [crm, marketing]
    updated_at:
      strategy: most_recent
      date_column: updated_at

standardization:
  rules:
    email: [email]
    first_name: [name_proper, strip]
    last_name: [name_proper, strip]
    phone: [phone]
    zip: [zip5]
    address: [address, strip]
    state: [state]

validation:
  auto_fix: true
  rules:
    - column: email
      rule_type: regex
      params: {pattern: "^.+@.+\\..+$"}
      action: flag
    - column: zip
      rule_type: min_length
      params: {length: 5}
      action: "null"
    - column: name
      rule_type: not_null
      action: quarantine

domain:
  enabled: true
  mode: product          # product | person | bibliographic | company | auto

llm_scorer:
  enabled: true
  provider: openai
  model: gpt-4o-mini
  auto_threshold: 0.95
  candidate_lo: 0.75
  candidate_hi: 0.95
  batch_size: 20
  mode: pairwise           # or "cluster" for in-context LLM clustering
  cluster_max_size: 100
  cluster_min_size: 5
  budget:
    max_cost_usd: 0.05
    max_calls: 100
    warn_at_pct: 80

memory:
  enabled: true                 # MemoryConfig.enabled defaults to True; memory is off only when the `memory:` section is absent. Omit the section for zero overhead.
  backend: sqlite               # sqlite | postgres
  path: .goldenmatch/memory.db  # sqlite path or postgres DSN
  reanchor: true                # default: true. Look up corrections by record_hash when row IDs miss.
  dataset: customers            # tag corrections so one DB can hold memory for many tables
  learning:
    threshold_min_corrections: 10   # learner runs once per matchkey at this floor
    weights_min_corrections: 50     # field-weight learner floor (stub in v1.6, returns null)

output:
  directory: ./output
  format: csv
  run_name: dedupe_run_001

backend: null              # null (Polars), "ray", or "duckdb"

Matchkeys

Three matchkey types:

Type	Description	Required Fields
`exact`	Binary match on transformed values	`field`, optional `transforms`
`weighted`	Weighted average of field scores	`field`, `scorer`, `weight`, `threshold`
`probabilistic`	Fellegi-Sunter log-likelihood ratios	`field`, `scorer`, optional `levels`, optional `level_thresholds`

Transforms

Applied to field values before scoring.

Transform	Description
`lowercase`	Convert to lowercase
`uppercase`	Convert to uppercase
`strip`	Remove leading/trailing whitespace
`strip_all`	Remove all whitespace
`soundex`	Soundex phonetic encoding
`metaphone`	Metaphone phonetic encoding
`digits_only`	Keep only digits
`alpha_only`	Keep only letters
`normalize_whitespace`	Collapse multiple spaces
`token_sort`	Sort tokens alphabetically
`first_token`	First whitespace-delimited token
`last_token`	Last whitespace-delimited token
`strip_honorifics`	Drop honorific/title/rank tokens (“Sir”, “Baronet”) from a name; a honorific-only value becomes missing
`substring:start:end`	Substring extraction
`qgram:n`	Q-gram tokenization
`bloom_filter` or `bloom_filter:ngram:k:size`	Bloom filter (for PPRL)
`legal_form_strip`	Strip corporate legal-form suffixes (Inc, LLC, Ltd, GmbH, S.A., …) — bundled refdata
`address_normalize`	USPS Pub. 28 street-suffix + unit-abbrev canonicalization (Avenue→AVE, Apartment→APT) — bundled refdata
`naics_normalize`	NAICS 2022 industry-code canonicalization (code-or-title input → canonical code) — bundled refdata

Refdata transforms are auto-prepended by the controller when a column name matches the relevant pattern AND its profiled col_type agrees. See Reference Data.

Scorers

Scorer	Description	Best For
`exact`	Binary 0/1 match	Email, phone, ID
`jaro_winkler`	Edit distance with prefix bonus	Names
`levenshtein`	Normalized Levenshtein distance	General strings
`token_sort`	Order-invariant token matching	Names, addresses
`soundex_match`	Phonetic match	Names
`ensemble`	max(jaro_winkler, token_sort, soundex)	Names with reordering
`embedding`	Cosine similarity of embeddings	Semantic matching
`record_embedding`	Concatenated multi-field embeddings	Cross-field semantic
`dice`	Dice coefficient on bloom filters	PPRL
`jaccard`	Jaccard similarity on bloom filters	PPRL
`name_freq_weighted_jw`	Surname IDF-weighted Jaro-Winkler — bundled refdata	`last_name` / `surname`
`given_name_aliased_jw`	Alias-aware Jaro-Winkler — bundled refdata	`first_name` / `given_name`

Cross-encoder reranking

Add rerank: true to a weighted matchkey to re-score borderline pairs with a cross-encoder model:

matchkeys:
  - name: fuzzy_name
    type: weighted
    threshold: 0.85
    rerank: true
    rerank_band: 0.1       # pairs within threshold +/- 0.1 get reranked
    rerank_model: cross-encoder/ms-marco-MiniLM-L-6-v2

Blocking

Strategy	Description
`static`	Group by blocking key (default)
`adaptive`	Static + recursive sub-blocking for oversized blocks
`sorted_neighborhood`	Sliding window over sorted records
`multi_pass`	Union of blocks from multiple passes
`ann`	ANN via FAISS on embeddings
`ann_pairs`	Direct-pair ANN scoring (50—100x faster than `ann`)
`canopy`	TF-IDF canopy clustering
`learned`	Data-driven predicate selection

Set auto_select: true to auto-pick the best blocking key by histogram analysis. Set auto_suggest: true to get blocking suggestions when no keys are specified.

Golden rules

Eight merge strategies for building canonical records:

Strategy	Description
`most_complete`	Pick value with fewest nulls
`majority_vote`	Most common value across cluster members
`source_priority`	Prefer values from specified sources (requires `source_priority` list)
`most_recent`	Latest value by date (requires `date_column`)
`first_non_null`	First non-null value encountered
`longest_value`	Pick the longest value
`unanimous_or_null`	Keep the value only when all members agree, else null
`confidence_majority`	Confidence-weighted majority vote

Set a default strategy and override per field:

golden_rules:
  default_strategy: most_complete
  field_rules:
    email: { strategy: majority_vote }
    name: { strategy: source_priority, source_priority: [crm, erp] }

Lock-step field groups

By default GoldenMatch picks the best value for each column independently. That can produce “Frankenstein” records where, say, a street address comes from one source while the city and zip come from another. field_groups prevents this by pinning a set of columns to a single winning source record.

golden_rules:
  default_strategy: most_complete
  field_groups:
    - name: mailing_address
      columns: [street, city, state, zip]   # must include >= 2 source columns
      category: address                      # optional hint (address | person_name | contact)
      strategy: most_complete                # most_complete | source_priority | most_recent
      # For strategy: most_recent, also set:
      # date_column: updated_at
      # For strategy: source_priority, also set:
      # source_priority: [crm, billing]

The winner is chosen by the group’s strategy applied across the group’s columns as a unit. All columns in the group then take their values from that one record — including that record’s nulls. There is no column-by-column fallback within a group (strict lock-step). Two constraints the config loader enforces:

Columns must be disjoint across groups (a column can belong to at most one group).
A column in a field_groups entry cannot also have its own field_rules entry.

Auto-detection (optional, default off). Set field_group_detection: true in golden_rules, or export GOLDENMATCH_FIELD_GROUP_SURVIVORSHIP=1, to let auto-config propose groups. The heuristic is conservative — it targets well-known correlated patterns (mailing address, person name, contact block) and optionally uses a GoldenCheck domain pack for infermap-fed detection. Explicitly declared field_groups are always honored regardless of whether the flag is set.

Correlated survivorship resolves on the driver (the in-memory golden builder). The Ray Phase-5 streaming pipeline (GOLDENMATCH_DISTRIBUTED_PIPELINE=2) and the Sail backend reject configs that include field_groups or conditional field_rules with a clear error. Run those configs on the in-memory path.

Vectorized resolution (automatic). The in-memory golden builder resolves field_groups, scalar fields, allow_fill, anchor, and conditional field_rules through a vectorized Polars path when building plain golden records (no per-field provenance requested). It is selected automatically when every strategy in the config is one the vectorized path reproduces exactly; any config using a custom:, confidence_majority, majority_vote, or unanimous_or_null strategy — or one requesting per-field provenance — falls back to the per-cluster path. Output is byte-identical either way (values and confidences); the only difference is speed. No configuration is required. The richer per-field source provenance surfaced by save_lineage / explain --cluster is built on the per-cluster path.

Surfacing the group decision

When a config uses field_groups (or conditional/validated field_rules), GoldenMatch surfaces the survivorship decision in three places:

Lineage JSON (save_lineage / goldenmatch lineage --output-dir ...): a top-level golden_records array, one entry per cluster, each with a groups array (name, columns, winner_row_id, winner_source, strategy, tie, confidence) and a human-readable audit string. Example: street, city, zip promoted together from record 7 via most_complete (group 'mailing_address').
CLI goldenmatch explain --cluster <id>: the output gains a Survivorship: block with the same audit lines (group promotions and any conditional/validated field notes).
MCP lineage tool: the result includes the same golden_records section so agents can read which columns were promoted lock-step and from which source.

For plain configs with no field_groups and no conditional field_rules, output is byte-identical to prior behavior — none of these surfaces add anything.

anchor strategy and allow_fill

strategy: anchor selects the winning record by first restricting candidates to those where the named anchor column is non-null, then applying most_complete within that subset. If no record has the anchor present, the group falls back to plain most_complete. Use this when one column (such as a plan or product ID) anchors the rest of the group’s fields.

field_groups:
  - name: plan_block
    columns: [plan_id, plan_name, plan_tier]
    strategy: anchor
    anchor: plan_id

allow_fill: true relaxes strict lock-step for the winning record’s null cells. After the winner is selected, each null cell in the winner is back-filled from the best other record (chosen by the group’s strategy) that has a value for that cell. The lineage audit records a <group>: <col> back-filled from record N line for every filled cell, and the filled map in the group provenance tracks which row supplied each value. Off by default; when off, the winner’s nulls are kept as-is.

field_groups:
  - name: plan_block
    columns: [plan_id, plan_name, plan_tier]
    strategy: anchor
    anchor: plan_id
    allow_fill: true

Conditional and validated field rules

A field_rules entry can be a list of clauses instead of a single strategy. The first clause whose when: predicate is satisfied wins; the last clause must have no when: and acts as the default.

golden_rules:
  field_rules:
    phone:
      - when: "state in ['CA', 'NY']"
        strategy: most_recent
        date_column: updated_at
        validate: nanp
      - strategy: source_priority   # default clause (no when:), must be last
        source_priority: [crm, billing]

when: predicate syntax. The predicate is a safe boolean mini-language, not Python eval. Supported operators: == != < <= > >= in not in and or not. Names resolve to already-resolved golden fields. Literals: list/tuple, numeric, string, boolean. Unsupported: function calls, attribute access, indexing, arithmetic. If a predicate references a field that is absent or None for a given cluster, the predicate is treated as unsatisfied (that clause is skipped) and evaluation continues to the next clause. validate:. An optional candidate filter applied before the merge strategy. Candidates that fail the validator are dropped from consideration before the strategy runs (fail-open: an unknown validator name drops nothing). Only nanp / phone are friendly aliases; every other validator is referenced by its full GoldenFlow transform name:

Name	Resolves to
`nanp` or `phone`	`phone_validate` (aliases)
`email_validate`	email format validator
`npi_validate`	NPI checksum validator
`date_validate`	date-parse validator

Lineage. The audit trail records how each field’s value was chosen. For a field group the entry reads like street, city, state, zip promoted together from record 7 via most_complete (group 'mailing_address'). For a conditional rule it reads like phone used most_recent because state in ['CA', 'NY'].

Standardization

Map column names to standardizer functions:

standardization:
  rules:
    email: [email]
    phone: [phone]
    zip: [zip5]
    first_name: [name_proper, strip]
    address: [address, strip]
    state: [state]

Standardizer	Description
`email`	Lowercase, strip, validate format
`name_proper`	Title case
`name_upper`	Uppercase
`name_lower`	Lowercase
`phone`	Strip non-digits, normalize format
`zip5`	First 5 digits
`address`	Normalize abbreviations (St->Street, etc.)
`state`	Normalize state abbreviations
`strip`	Remove leading/trailing whitespace
`trim_whitespace`	Collapse multiple spaces

Validation

validation:
  auto_fix: true
  rules:
    - column: email
      rule_type: regex
      params: { pattern: "^.+@.+\\..+$" }
      action: flag
    - column: name
      rule_type: not_null
      action: quarantine
    - column: zip
      rule_type: min_length
      params: { length: 5 }
      action: "null"

Rule types: regex, min_length, max_length, not_null, in_set, format. Actions: flag (mark but keep), "null" (set to null — quote it so YAML keeps the string; bare null parses as None and fails validation), quarantine (remove from matching).

Environment variables

The table below covers the security/sandbox variables most relevant to a YAML config. For the complete, verified reference of every GOLDENMATCH_* variable — the native gate, backend selection, the distributed pipeline, perf opt-ins, and when to set or avoid each — see Tuning & opt-ins.

Variable	Default	Description
`GOLDENMATCH_ALLOWED_ROOT`	unset	When set, every user-supplied file path (MCP tools, ingest, rollback, lineage, domain registry) must resolve under this root. Paths that escape it return an error instead of executing. Unset = no containment; local-first default. See Path sandbox below.
`GOLDENMATCH_MCP_TOKEN`	unset	Bearer token for the MCP HTTP server. Required on non-loopback binds (fail-closed).
`GOLDENMATCH_API_TOKEN`	unset	Bearer token for the REST API server. Required on non-loopback binds.
`GOLDENMATCH_AGENT_TOKEN`	unset	Bearer token for the A2A agent server. Required on non-loopback binds.
`GOLDENMATCH_ENABLE_DISTRIBUTED_RAY`	`0`	Enable the Ray distributed backend.
`GOLDENMATCH_NATIVE_RAYON_MIN_PAIRS`	`20000000`	Candidate-pair threshold below which the native kernel scores in the calling thread (avoids rayon futex contention on small workloads).
`GOLDENMATCH_BUCKET_DEBUG`	`0`	When `1`, prints per-bucket prep/kernel/post-filter timing for `backend=bucket`.
`POLARS_SKIP_CPU_CHECK`	unset	Set to `1` to skip the Polars WMI CPU check on Windows (avoids startup hang).

Path sandbox

GOLDENMATCH_ALLOWED_ROOT activates an opt-in path containment layer for deployed or shared instances. When set, safe_path() resolves every incoming file argument with Path.resolve(), checks for NUL bytes, and then asserts the resolved path falls under the root. Paths that fail this check are rejected before any I/O occurs.

# Deployed MCP server on Railway: restrict to the /data volume
export GOLDENMATCH_ALLOWED_ROOT=/data
goldenmatch mcp-serve --transport http --port 8200

Unset (the default) means no containment check — any accessible path is valid. This is the correct default for a local-first desktop tool.

The Railway goldenmatch-mcp service should set GOLDENMATCH_ALLOWED_ROOT=/data to scope the MCP tools to the mounted data volume. See MCP Server.

Settings persistence

Global: ~/.goldenmatch/settings.yaml — output mode, default model, API keys
Project: .goldenmatch.yaml — column mappings, thresholds, blocking config

Settings tuned in the TUI can be saved to the project file. Next run picks them up automatically.

Programmatic config

import goldenmatch as gm

config = gm.GoldenMatchConfig(
    matchkeys=[
        gm.MatchkeyConfig(name="exact_email", type="exact",
            fields=[gm.MatchkeyField(field="email", transforms=["lowercase"])]),
        gm.MatchkeyConfig(name="fuzzy_name", type="weighted", threshold=0.85,
            fields=[
                gm.MatchkeyField(field="name", scorer="jaro_winkler", weight=0.7),
                gm.MatchkeyField(field="zip", scorer="exact", weight=0.3),
            ]),
    ],
    blocking=gm.BlockingConfig(strategy="learned"),
    llm_scorer=gm.LLMScorerConfig(enabled=True, mode="cluster"),
    backend="ray",
)

result = gm.dedupe("data.csv", config=config)

Or auto-generate from data:

config = gm.auto_configure([("data.csv", "source")])

Verification (v1.5.0)

auto_configure_df runs preflight at the end of config generation — 6 checks that auto-repair missing domain-extracted columns, drop useless-cardinality exact matchkeys, flag oversized blocks, demote remote-asset scorers, and cap low-confidence weights. Unrepairable issues raise ConfigValidationError; the full report is attached to the exception as err.report. The pipeline runs postflight after scoring and before clustering — 4 signals (score histogram + bimodality, blocking recall, cluster sizes + bottleneck pairs, threshold-band overlap) that can auto-nudge the threshold on clear bimodality and attach the report to DedupeResult.postflight_report / MatchResult.postflight_report. Two new kwargs on auto_configure_df:

import goldenmatch as gm

# Offline-safe (default): remote-asset scorers demoted, postflight may adjust threshold
cfg = gm.auto_configure_df(df)

# Opt in to cross-encoder rerank / embedding scorers
cfg = gm.auto_configure_df(df, allow_remote_assets=True)

# Strict: compute signals + advisories, but suppress auto-adjustments (DQBench, regression)
cfg = gm.auto_configure_df(df, strict=True)

The preflight report is available on the returned config (underscore is private-by-convention but stable across v1.5.x):

cfg = gm.auto_configure_df(df)
for finding in cfg._preflight_report.findings:
    print(f"[{finding.severity}] {finding.check}: {finding.message}")

See the Verification section in the Python API docs for the full preflight / postflight signatures and the PostflightSignals schema.

Learning Memory (v1.6.0)

The optional memory: section enables persistent corrections. Once a steward, agent, or LLM decides a pair, that decision is stored, re-anchored across row reorders by record_hash, and applied automatically on every subsequent dedupe_df / match_df call. After 10+ corrections accumulate against a matchkey, the learner adjusts that matchkey’s threshold for the next run. Off by default; enable via the YAML block above or config.memory.enabled = True.

Field	Default	Notes
`enabled`	`true` (when the `memory:` section is present)	Memory is off by default because the section is absent; once you add a `memory:` block it defaults on. Enabling does not change pipeline output until corrections exist.
`backend`	`"sqlite"`	`"postgres"` requires `pip install goldenmatch[postgres]`.
`path`	`".goldenmatch/memory.db"`	SQLite file or full DSN for postgres.
`reanchor`	`true`	Re-anchor by `record_hash` when row IDs miss; ambiguous re-anchors report `stale_ambiguous`.
`dataset`	`null`	Tag corrections; isolates per-table memory in shared DBs.
`learning.threshold_min_corrections`	`10`	Trust-weighted grid search runs once a matchkey crosses this floor.
`learning.weights_min_corrections`	`50`	Field-weight learning is stubbed in v1.6.0 and returns `null`.

Full guide: Learning Memory.

Throughput config

The optional throughput: section (or the throughput= kwarg on dedupe_df) enables the sketch-then-verify near-duplicate tier. Off by default; enabling it does not affect any run where the kwarg is None. Full reference: Throughput tier.

from goldenmatch import ThroughputConfig

result = gm.dedupe_df(df, throughput=ThroughputConfig(
    recall_target=0.95,       # LSH recall target; tunes banding
    similarity_threshold=0.8, # sketch-distance cutoff at verify
))

Field	Default	Notes
`enabled`	`true` when a `ThroughputConfig` is passed	Normally you pass `throughput=True` or `throughput=0.95` rather than constructing `ThroughputConfig` directly.
`recall_target`	`0.95`	LSH-theoretic recall target (0 to 1). Higher = more LSH bands = more candidate pairs = more wall time.
`similarity_threshold`	`0.8` (Jaccard) / `0.85` (cosine)	Sketch-distance cutoff applied at verify. Pairs below this are dropped without field-level scoring.

ThroughputConfig is re-exported from the top level. The throughput= kwarg also accepts True (uses defaults), a float (sets recall_target), or None (off, byte-identical to today).

Get started

Concepts

GoldenMatch

GoldenCheck

GoldenFlow

GoldenPipe

GoldenAnalysis

InferMap

SQL extensions

Reference

Research

Full YAML reference

Matchkeys

Transforms

Scorers

Cross-encoder reranking

Blocking

Golden rules

Lock-step field groups

Surfacing the group decision

anchor strategy and allow_fill

Conditional and validated field rules

Standardization

Validation

Environment variables

Path sandbox

Settings persistence

Programmatic config

Verification (v1.5.0)

Learning Memory (v1.6.0)

Throughput config

​Full YAML reference

​Matchkeys

​Transforms

​Scorers

​Cross-encoder reranking

​Blocking

​Golden rules

​Lock-step field groups

​Surfacing the group decision

​anchor strategy and allow_fill

​Conditional and validated field rules

​Standardization

​Validation

​Environment variables

​Path sandbox

​Settings persistence

​Programmatic config

​Verification (v1.5.0)

​Learning Memory (v1.6.0)

​Throughput config

Full YAML reference

Matchkeys

Transforms

Scorers

Cross-encoder reranking

Blocking

Golden rules

Lock-step field groups

Surfacing the group decision

anchor strategy and allow_fill

Conditional and validated field rules

Standardization

Validation

Environment variables

Path sandbox

Settings persistence

Programmatic config

Verification (v1.5.0)

Learning Memory (v1.6.0)

Throughput config