Quick start
OPENAI_API_KEY or ANTHROPIC_API_KEY environment variable.
Pairwise scoring
The default mode. Sends individual borderline pairs to the LLM for match/no-match decisions.- Fuzzy scoring produces pairs with scores in [0, 1]
- Pairs above
auto_threshold(0.95) are auto-accepted — no LLM call - Pairs in
[candidate_lo, candidate_hi](0.75—0.95) are candidates for LLM scoring - Pairs below
candidate_lo(0.75) keep their original fuzzy score - LLM-approved pairs get score=1.0; LLM-rejected pairs keep their fuzzy score (never demoted)
Iterative calibration
New in v1.2.6. When the candidate set is large (>100 pairs), GoldenMatch uses iterative calibration instead of scoring every pair:- Round 1: Stratified sample of 100 pairs across the score range
- Learn threshold: Grid search finds the score that best separates LLM YES from NO
- Round 2+: Focused sample of 100 pairs near the learned threshold (threshold +/- 0.03)
- Converge: Stop when threshold shifts less than 0.01 between rounds
- Apply: Pairs above threshold promoted to 1.0; all others keep original score
calibration_sample_size. For small candidate sets (≤100 pairs), all pairs are scored directly.
Cluster mode
Send entire blocks of borderline records to the LLM for in-context clustering. More efficient than pairwise for large candidate sets.- Build connected components from borderline pairs
- Send each component (block) to the LLM as a clustering task
- LLM returns cluster assignments
- Synthesize pair_scores from cluster confidence for compatibility with Union-Find, unmerge, and lineage
Budget tracking
Control LLM spending withBudgetConfig:
BudgetTracker class tracks token usage, cost, and enforces limits. When the budget runs out, scoring stops gracefully — pairs are kept at their fuzzy scores.
Budget summary is available in EngineStats.llm_cost after a pipeline run.
Model tiering
Automatic escalation sends harder pairs to a better (more expensive) model:- Tier 1: GPT-4o-mini for most pairs (cheapest)
- Tier 2: GPT-4o for pairs in the escalation band (0.80—0.90)
LLM boost
A separate feature from the LLM scorer. LLM boost fine-tunes an embedding model using LLM-generated labels:- Level 1 — zero-shot (free, instant)
- Level 2 — bi-encoder fine-tuning (~$0.20, ~2 min CPU)
- Level 3 — Ditto-style cross-encoder with data augmentation (~$0.50, ~5 min CPU)
LLM feature extraction
Extract structured fields from unstructured text using the LLM. O(N) preprocessing, not O(N^2) pair scoring.Provider configuration
GoldenMatch auto-detects the provider from environment variables:| Variable | Provider |
|---|---|
OPENAI_API_KEY | OpenAI (GPT-4o-mini, GPT-4o) |
ANTHROPIC_API_KEY | Anthropic (Claude) |
(text, input_tokens, output_tokens) tuples for budget tracking.
Cost benchmarks
| Dataset | Strategy | LLM Cost | F1 |
|---|---|---|---|
| Abt-Buy (electronics) | Domain + emb + LLM | $0.04 | 72.2% |
| Amazon-Google (software) | emb + ANN + LLM | $0.02 | 45.3% |
| Abt-Buy (Vertex AI + LLM) | Embeddings + GPT-4o-mini | $0.74 | 81.7% |
| Bulldozer 401K (equipment) | Multi-pass + ANN + calibration | ~$0.01 | 87.7% conf |
| Typical 5K dataset | LLM scorer (borderline only) | ~$0.05 | varies |
Python API summary
| Function | Description |
|---|---|
gm.llm_score_pairs(pairs, df, config) | Pairwise LLM scoring |
gm.llm_cluster_pairs(pairs, df, config) | In-context block clustering |
gm.BudgetTracker(max_cost_usd, max_calls) | Track and limit LLM spending |
gm.llm_label_pairs(pairs, df) | Generate LLM-labeled training pairs |
gm.llm_extract_features(df, column) | LLM-based feature extraction |