LLM integration

GoldenMatch uses LLMs (GPT-4o-mini, Claude) to score borderline pairs that fuzzy matching alone cannot resolve. Two modes: pairwise scoring and in-context block clustering.

Quick start

import goldenmatch as gm

# Enable LLM scoring via convenience API
result = gm.dedupe("products.csv", fuzzy={"title": 0.80}, llm_scorer=True)

# CLI (LLM scoring is enabled via the llm_scorer block in config.yaml)
goldenmatch dedupe products.csv --config config.yaml

Requires OPENAI_API_KEY or ANTHROPIC_API_KEY environment variable.

Pairwise scoring

The default mode. Sends individual borderline pairs to the LLM for match/no-match decisions.

llm_scorer:
  enabled: true
  mode: pairwise
  provider: openai          # auto-detected from env vars if omitted
  model: gpt-4o-mini        # cheapest option, default
  auto_threshold: 0.95      # auto-accept pairs above this (no LLM call)
  candidate_lo: 0.75        # lower bound of LLM scoring range
  candidate_hi: 0.95        # upper bound (same as auto_threshold)
  batch_size: 75            # pairs per API call
  max_workers: 3            # concurrent LLM requests

How it works:

Fuzzy scoring produces pairs with scores in [0, 1]
Pairs above auto_threshold (0.95) are auto-accepted — no LLM call
Pairs in [candidate_lo, candidate_hi] (0.75—0.95) are candidates for LLM scoring
Pairs below candidate_lo (0.75) keep their original fuzzy score
LLM-approved pairs get score=1.0; LLM-rejected pairs keep their fuzzy score (never demoted)

import goldenmatch as gm

scored = gm.llm_score_pairs(borderline_pairs, df, config=llm_config)

Iterative calibration

New in v1.2.6. When the candidate set is large (>100 pairs), GoldenMatch uses iterative calibration instead of scoring every pair:

Round 1: Stratified sample of 100 pairs across the score range
Learn threshold: Grid search finds the score that best separates LLM YES from NO
Round 2+: Focused sample of 100 pairs near the learned threshold (threshold +/- 0.03)
Converge: Stop when threshold shifts less than 0.01 between rounds
Apply: Pairs above threshold promoted to 1.0; all others keep original score

Typically converges in 2-3 rounds (~200 pairs, ~$0.01). On the Bulldozer dataset (401K rows, 23.7M candidate pairs), calibration learned threshold=0.947 from just 200 pairs.

llm_scorer:
  enabled: true
  calibration_sample_size: 100      # pairs per round
  calibration_max_rounds: 5         # max iterations
  calibration_convergence_delta: 0.01  # stop when threshold shift < this

Calibration activates automatically when candidates exceed calibration_sample_size. For small candidate sets (≤100 pairs), all pairs are scored directly.

Cluster mode

Send entire blocks of borderline records to the LLM for in-context clustering. More efficient than pairwise for large candidate sets.

llm_scorer:
  enabled: true
  mode: cluster
  cluster_max_size: 100     # max records per LLM cluster block
  cluster_min_size: 5       # below this, fall back to pairwise

How it works:

Build connected components from borderline pairs
Send each component (block) to the LLM as a clustering task
LLM returns cluster assignments
Synthesize pair_scores from cluster confidence for compatibility with Union-Find, unmerge, and lineage

import goldenmatch as gm

scored = gm.llm_cluster_pairs(borderline_pairs, df, llm_config)

Graceful degradation: cluster mode falls back to pairwise if a block is too small, then stops if the budget is exhausted.

Budget tracking

Control LLM spending with BudgetConfig:

llm_scorer:
  enabled: true
  budget:
    max_cost_usd: 0.05         # hard cost cap
    max_calls: 100             # max API calls
    warn_at_pct: 80            # warn at 80% of budget
    escalation_model: gpt-4o   # escalate to better model for hard pairs
    escalation_band: [0.80, 0.90]
    escalation_budget_pct: 20  # reserve 20% of budget for escalation

import goldenmatch as gm

tracker = gm.BudgetTracker(gm.BudgetConfig(max_cost_usd=0.05, max_calls=100))
# tracker.record_usage(input_tokens, output_tokens, model)
# tracker.budget_remaining_pct
# tracker.total_cost_usd
# tracker.budget_exhausted

The BudgetTracker class is constructed from a BudgetConfig, tracks token usage and cost, and enforces limits. When the budget runs out, scoring stops gracefully — pairs are kept at their fuzzy scores. Budget summary is available in EngineStats.llm_cost after a pipeline run.

Model tiering

Automatic escalation sends harder pairs to a better (more expensive) model:

Tier 1: GPT-4o-mini for most pairs (cheapest)
Tier 2: GPT-4o for pairs in the escalation band (0.80—0.90)

The escalation budget percentage (default 20%) reserves a portion of the total budget for tier-2 calls.

LLM boost

A separate feature from the LLM scorer. LLM boost fine-tunes an embedding model using LLM-generated labels:

goldenmatch dedupe products.csv --llm-boost

Tiered auto-escalation:

Level 1 — zero-shot (free, instant)
Level 2 — bi-encoder fine-tuning (~$0.20, ~2 min CPU)
Level 3 — Ditto-style cross-encoder with data augmentation (~$0.50, ~5 min CPU)

Active sampling selects the most informative pairs for labeling, reducing cost by ~45%. LLM boost is most valuable for product matching with local models (MiniLM). For structured data, fuzzy matching alone achieves 97%+ F1.

LLM feature extraction

Extract structured fields from unstructured text using the LLM. O(N) preprocessing, not O(N^2) pair scoring.

import goldenmatch as gm

# row_ids selects which (low-confidence) records to process; returns
# {row_id: {field: value, ...}}
features = gm.llm_extract_features(df, row_ids, text_column="description", budget_tracker=tracker)

Provider configuration

GoldenMatch auto-detects the provider from environment variables:

Variable	Provider
`OPENAI_API_KEY`	OpenAI (GPT-4o-mini, GPT-4o)
`ANTHROPIC_API_KEY`	Anthropic (Claude)

Both providers return (text, input_tokens, output_tokens) tuples for budget tracking.

Cost benchmarks

Dataset	Strategy	LLM Cost	F1
Abt-Buy (electronics)	Domain + emb + LLM	$0.04	72.2%
Amazon-Google (software)	emb + ANN + LLM	$0.02	45.3%
Abt-Buy (Vertex AI + LLM)	Embeddings + GPT-4o-mini	$0.74	81.7%
Bulldozer 401K (equipment)	Multi-pass + ANN + calibration	~$0.01	87.7% conf
Typical 5K dataset	LLM scorer (borderline only)	~$0.05	varies

With iterative calibration (v1.2.6+), the LLM scores only ~200 pairs to learn the optimal threshold, then applies it to all candidates. This reduced the Bulldozer benchmark from ~

0.50 (37,500 pairs) to ~

0.01 (200 pairs).

Python API summary

Function	Description
`gm.llm_score_pairs(pairs, df, config=config)`	Pairwise LLM scoring
`gm.llm_cluster_pairs(pairs, df, config)`	In-context block clustering
`gm.BudgetTracker(BudgetConfig(...))`	Track and limit LLM spending
`gm.llm_label_pairs(pairs, columns, context, provider, api_key, model)`	Generate LLM-labeled training pairs
`gm.llm_extract_features(df, row_ids, text_column)`	LLM-based feature extraction

Get started

Concepts

GoldenMatch

GoldenCheck

GoldenFlow

GoldenPipe

GoldenAnalysis

InferMap

SQL extensions

Reference

Research

Quick start

Pairwise scoring

Iterative calibration

Cluster mode

Budget tracking

Model tiering

LLM boost

LLM feature extraction

Provider configuration

Cost benchmarks

Python API summary

​Quick start

​Pairwise scoring

​Iterative calibration

​Cluster mode

​Budget tracking

​Model tiering

​LLM boost

​LLM feature extraction

​Provider configuration

​Cost benchmarks

​Python API summary

Quick start

Pairwise scoring

Iterative calibration

Cluster mode

Budget tracking

Model tiering

LLM boost

LLM feature extraction

Provider configuration

Cost benchmarks

Python API summary