How the introspective AutoConfig controller detects column types, picks scorers and blocking, and iterates until it converges.
The introspective AutoConfig controller is what lets GoldenMatch beat hand-tuned baselines with no input. It detects column types, selects scorers, picks a blocking strategy, then iterates on signals the pipeline emits until it converges on a defensible config.
Switches to an orthogonal blocking key on low recall.
rule_corruption_normalize
Adds normalization for corrupted identity columns.
rule_sparse_match_expand
Lowers the threshold and adds side-channel blocking for sparse matches.
rule_demote_clustered_identity
Demotes exact matchkeys with a collision rate above 0.75 (v1.11).
promote_negative_evidence
Adds negative-evidence penalties for identity-discriminating columns (v1.11).
rule_negative_evidence_exact_filter
Applies negative-evidence penalties to exact matchkeys via a post-filter (Path Y, v1.12).
Configs are ranked by a health metric: GREEN > YELLOW > RED, with the initial config as a virtual fallback.
At 100,000+ rows, auto-config raises ControllerNotConfidentError rather than silently committing a low-confidence (RED) config. Handle this explicitly instead of adding a silent fallback path.
The planning-effort tier controls how hard the controller searches for a config. Pass it as a planning_effort= kwarg to dedupe_df / match_df / auto_configure_df, set it on a GoldenMatchConfig, or use the GOLDENMATCH_PLANNING_EFFORT env var. The default normal is byte-for-byte the prior behavior.
Tier
What it does
fast
A single cheap pass — no refit breadth, tight wall budget.
normal(default)
Today’s interactive budget: sqrt-scaled sample, a few refit iterations, linear pair-count extrapolation.
thinking
Larger sample + more iterations + a longer budget, and measures real blocking on the full frame to pick the backend instead of extrapolating from the sample.
einstein
The widest search — the largest sample, the most iterations, and the longest budget.
import goldenmatch as gm# Spend more search effort on a tricky dataset:result = gm.dedupe_df(df, planning_effort="thinking")
Because block scoring is now ~5x faster (bucket+native), the higher tiers measure the true candidate-pair count on the full data rather than projecting it from a 2K–20K sample — which removes the wrong-backend-on-skewed-data failure. Any measurement failure falls back to extrapolation.
When you point auto-config at the local in-house embedding model (a matchkey field with model="inhouse:/path", or GOLDENMATCH_EMBEDDING_PROVIDER=inhouse + GOLDENMATCH_INHOUSE_MODEL), it is treated as a local, offline-safe scorer and is not demoted as a remote-asset drift risk. Cloud embedders (sentence-transformers / Vertex) still require allow_remote_assets=True.
By default the controller remembers what worked on prior runs and seeds future runs from it. Disable it with GOLDENMATCH_AUTOCONFIG_MEMORY=0, or point it elsewhere with GOLDENMATCH_AUTOCONFIG_MEMORY_PATH.
Scale envelope
The block-size guards the controller respects, and how it picks a backend.