Skip to main content
The introspective AutoConfig controller is what lets GoldenMatch beat hand-tuned baselines with no input. It detects column types, selects scorers, picks a blocking strategy, then iterates on signals the pipeline emits until it converges on a defensible config.

What it does

Given a file, the controller:
  1. Detects column types (name, email, phone, zip, address, description).
  2. Selects appropriate scorers per column.
  3. Picks a blocking strategy.
  4. Runs the pipeline and reads back complexity signals.
  5. Applies refinement rules and repeats until health stops improving.

Signals it watches

The controller iterates on a ComplexityProfile:
  • Block-size distribution (p50 / p95 / p99).
  • Score histogram (bimodality detection).
  • Transitivity rate.
  • Borderline mass (pairs near the threshold).
  • Negative-evidence collision rate (v1.11+).

Refinement rules

The HeuristicRefitPolicy applies rules such as:
RuleEffect
rule_no_matchesRaises the threshold if no pairs are found.
rule_blocking_key_swapSwitches to an orthogonal blocking key on low recall.
rule_corruption_normalizeAdds normalization for corrupted identity columns.
rule_sparse_match_expandLowers the threshold and adds side-channel blocking for sparse matches.
rule_demote_clustered_identityDemotes exact matchkeys with a collision rate above 0.75 (v1.11).
promote_negative_evidenceAdds negative-evidence penalties for identity-discriminating columns (v1.11).
rule_negative_evidence_exact_filterApplies negative-evidence penalties to exact matchkeys via a post-filter (Path Y, v1.12).
Configs are ranked by a health metric: GREEN > YELLOW > RED, with the initial config as a virtual fallback.
At 100,000+ rows, auto-config raises ControllerNotConfidentError rather than silently committing a low-confidence (RED) config. Handle this explicitly instead of adding a silent fallback path.

Planning effort

The planning-effort tier controls how hard the controller searches for a config. Pass it as a planning_effort= kwarg to dedupe_df / match_df / auto_configure_df, set it on a GoldenMatchConfig, or use the GOLDENMATCH_PLANNING_EFFORT env var. The default normal is byte-for-byte the prior behavior.
TierWhat it does
fastA single cheap pass — no refit breadth, tight wall budget.
normal (default)Today’s interactive budget: sqrt-scaled sample, a few refit iterations, linear pair-count extrapolation.
thinkingLarger sample + more iterations + a longer budget, and measures real blocking on the full frame to pick the backend instead of extrapolating from the sample.
einsteinThe widest search — the largest sample, the most iterations, and the longest budget.
import goldenmatch as gm

# Spend more search effort on a tricky dataset:
result = gm.dedupe_df(df, planning_effort="thinking")
Because block scoring is now ~5x faster (bucket+native), the higher tiers measure the true candidate-pair count on the full data rather than projecting it from a 2K–20K sample — which removes the wrong-backend-on-skewed-data failure. Any measurement failure falls back to extrapolation.

In-house embeddings

When you point auto-config at the local in-house embedding model (a matchkey field with model="inhouse:/path", or GOLDENMATCH_EMBEDDING_PROVIDER=inhouse + GOLDENMATCH_INHOUSE_MODEL), it is treated as a local, offline-safe scorer and is not demoted as a remote-asset drift risk. Cloud embedders (sentence-transformers / Vertex) still require allow_remote_assets=True.

Environment variables

VariableEffect
GOLDENMATCH_AUTOCONFIG_MEMORY=0Disable cross-run memory (default on; stored at ~/.goldenmatch/autoconfig_memory.db).
GOLDENMATCH_AUTOCONFIG_MEMORY_PATH=<path>Custom path for the cross-run memory store.
GOLDENMATCH_AUTOCONFIG_LLM=1Enable the LLM fallback policy (requires OPENAI_API_KEY).
GOLDENMATCH_AUTOCONFIG_BACKEND=duckdb|ray|0Force a backend, or 0 to auto-select.
GOLDENMATCH_AUTOCONFIG_BACKEND_THRESHOLD=<int>Row-count cutoff for DuckDB auto-selection (default 1,000,000).
GOLDENMATCH_AUTOCONFIG_INDICATOR_BUDGET=fast|fullRun the cheap or the full indicator set (default fast).
GOLDENMATCH_PLANNING_EFFORT=fast|normal|thinking|einsteinPlanning-effort tier (default normal). Higher tiers search wider and measure full-frame blocking.
GOLDENMATCH_EMBEDDING_PROVIDER=inhouse + GOLDENMATCH_INHOUSE_MODEL=<path>Treat the local in-house embedding model as an offline-safe scorer (not demoted as a remote asset).

Cross-run memory

By default the controller remembers what worked on prior runs and seeds future runs from it. Disable it with GOLDENMATCH_AUTOCONFIG_MEMORY=0, or point it elsewhere with GOLDENMATCH_AUTOCONFIG_MEMORY_PATH.

Scale envelope

The block-size guards the controller respects, and how it picks a backend.