Auto-config

The introspective AutoConfig controller is what lets GoldenMatch beat hand-tuned baselines with no input. It detects column types, selects scorers, picks a blocking strategy, then iterates on signals the pipeline emits until it converges on a defensible config.

What it does

Given a file, the controller:

Detects column types (name, email, phone, zip, address, description).
Selects appropriate scorers per column.
Picks a blocking strategy.
Runs the pipeline and reads back complexity signals.
Applies refinement rules and repeats until health stops improving.

Signals it watches

The controller iterates on a ComplexityProfile:

Block-size distribution (p50 / p95 / p99).
Score histogram (bimodality detection).
Transitivity rate.
Borderline mass (pairs near the threshold).
Negative-evidence collision rate (v1.11+).

The HeuristicRefitPolicy applies rules such as:

Rule	Effect
`rule_no_matches`	Raises the threshold if no pairs are found.
`rule_blocking_key_swap`	Switches to an orthogonal blocking key on low recall.
`rule_corruption_normalize`	Adds normalization for corrupted identity columns.
`rule_sparse_match_expand`	Lowers the threshold and adds side-channel blocking for sparse matches.
`promote_negative_evidence`	Adds negative-evidence penalties for identity-discriminating columns (v1.11).
`rule_precision_anchor_threshold_raise`	Raises the weighted threshold to 0.9 on the precision-collapse shape: near-total scored mass above the threshold on a name-dominated weighted matchkey that also carries a strong exact identity anchor.

Configs are ranked by a health metric: GREEN > YELLOW > RED, with the initial config as a virtual fallback. Precision-collapse detection. The precision-anchor rule doubles as a labels-free precision-collapse detector at commit time: once it has fired, commit selection rank-demotes any candidate config whose shape still trips the rule’s trigger, and the score-distribution unimodality (dip) gate only reads RED with at least 30 scored pairs behind it (a flat dip over fewer pairs is sampling noise, not evidence). Measured on the crafted over-merge fixture: precision 0.009 -> 0.9868 at recall 1.0, with NCVR unaffected.

At 100,000+ rows, auto-config raises ControllerNotConfidentError rather than silently committing a low-confidence (RED) config. Handle this explicitly instead of adding a silent fallback path.

Planning effort

The planning-effort tier controls how hard the controller searches for a config. Pass it as a planning_effort= kwarg to dedupe_df / match_df / auto_configure_df, set it on a GoldenMatchConfig, or use the GOLDENMATCH_PLANNING_EFFORT env var. The default normal is byte-for-byte the prior behavior.

Tier	What it does
`fast`	A single cheap pass — no refit breadth, tight wall budget.
`normal` (default)	Today’s interactive budget: sqrt-scaled sample, a few refit iterations, linear pair-count extrapolation.
`thinking`	Larger sample + more iterations + a longer budget, and measures real blocking on the full frame to pick the backend instead of extrapolating from the sample.
`einstein`	The widest search — the largest sample, the most iterations, and the longest budget.

import goldenmatch as gm

# Spend more search effort on a tricky dataset:
result = gm.dedupe_df(df, planning_effort="thinking")

Because block scoring is now ~5x faster (bucket+native), the higher tiers measure the true candidate-pair count on the full data rather than projecting it from a 2K–20K sample — which removes the wrong-backend-on-skewed-data failure. Any measurement failure falls back to extrapolation.

In-house embeddings

When you point auto-config at the local in-house embedding model (a matchkey field with model="inhouse:/path", or GOLDENMATCH_EMBEDDING_PROVIDER=inhouse + GOLDENMATCH_INHOUSE_MODEL), it is treated as a local, offline-safe scorer and is not demoted as a remote-asset drift risk. Cloud embedders (sentence-transformers / Vertex) still require allow_remote_assets=True.

Environment variables

Variable	Effect
`GOLDENMATCH_AUTOCONFIG_MEMORY=0`	Disable cross-run memory (default on; stored at `~/.goldenmatch/autoconfig_memory.db`).
`GOLDENMATCH_AUTOCONFIG_LLM=1`	Enable the LLM fallback policy (requires `OPENAI_API_KEY`).
`GOLDENMATCH_AUTOCONFIG_INDICATOR_BUDGET=fast\|full`	Run the cheap or the full indicator set (default `fast`).
`GOLDENMATCH_PLANNING_EFFORT=fast\|normal\|thinking\|einstein`	Planning-effort tier (default `normal`). Higher tiers search wider and measure full-frame blocking.
`GOLDENMATCH_EMBEDDING_PROVIDER=inhouse` + `GOLDENMATCH_INHOUSE_MODEL=<path>`	Treat the local in-house embedding model as an offline-safe scorer (not demoted as a remote asset).

Cross-run memory

By default the controller remembers what worked on prior runs and seeds future runs from it. Disable it with GOLDENMATCH_AUTOCONFIG_MEMORY=0.

Scale envelope

The block-size guards the controller respects, and how it picks a backend.

Get started

Concepts

GoldenMatch

GoldenCheck

GoldenFlow

GoldenPipe

GoldenAnalysis

InferMap

SQL extensions

Reference

Research

What it does

Signals it watches

Refinement rules

Planning effort

In-house embeddings

Environment variables

Cross-run memory

Scale envelope

​What it does

​Signals it watches

​Refinement rules

​Planning effort

​In-house embeddings

​Environment variables

​Cross-run memory

Scale envelope

What it does

Signals it watches

Refinement rules

Planning effort

In-house embeddings

Environment variables

Cross-run memory