Strategy overview
| Strategy | Description | Best For |
|---|---|---|
static | Group by blocking key | Clean data with reliable keys |
adaptive | Static + recursive sub-blocking for oversized blocks | Default choice |
sorted_neighborhood | Sliding window over sorted records | Typos in blocking key |
multi_pass | Union of blocks from multiple passes | Noisy data, best recall |
ann | FAISS nearest-neighbor on embeddings | Semantic matching |
ann_pairs | Direct-pair ANN scoring | 50—100x faster than ann |
canopy | TF-IDF canopy clustering | Text-heavy data |
learned | Data-driven predicate selection | Auto-discovers rules |
Static blocking
Group records by exact value of the blocking key.Adaptive blocking
Static blocking with automatic sub-splitting for oversized blocks. When a block exceedsmax_block_size, it splits on the highest-cardinality column within the block.
Sorted neighborhood
Sliding window over records sorted by a key. Catches near-matches that differ by one character in the blocking key.Multi-pass blocking
Run multiple blocking passes and union the results. Best recall for noisy data.ANN hybrid blocking
New in v1.2.6. Combine multi-pass string blocking with ANN fallback for oversized blocks. When a block exceedsmax_block_size and would normally be skipped, GoldenMatch embeds only the unique text values in that block and uses FAISS to create smaller sub-blocks.
- Multi-pass blocking creates string-based blocks (fast, handles most data)
- Blocks exceeding
max_block_sizetrigger ANN fallback instead of being skipped - ANN embeds only unique text values (e.g., 61K records with 187 unique texts = seconds)
- FAISS finds nearest neighbors among unique texts, Union-Find creates sub-blocks
- Sub-blocks still exceeding
max_block_size(after 10x cap) are skipped
GOLDENMATCH_GPU_MODE=vertex) or local sentence-transformers for embedding.
ANN blocking
Use FAISS approximate nearest-neighbor search on sentence-transformer embeddings. Requirespip install goldenmatch[embeddings].
ann_pairs is a faster variant (50—100x) that returns direct pairs instead of block groups:
Canopy blocking
TF-IDF-based canopy clustering with loose and tight thresholds.Learned blocking
Data-driven predicate selection via a two-pass approach: sample pairs, train predicates, apply to full data. Achieves 96.9% F1 matching hand-tuned static blocking on DBLP-ACM.Auto-select
Let GoldenMatch pick the best blocking key by histogram analysis:Performance impact
Blocking key choice dominates fuzzy matching performance. A coarse key (e.g.,state) creates huge blocks and slow scoring. A fine key (e.g., email) misses near-duplicates.
| Key | Records | Blocks | Max Size | Comparisons | Time |
|---|---|---|---|---|---|
zip | 100K | 8,200 | 340 | 1.2M | 12s |
state | 100K | 50 | 12,000 | 45M | 320s |
last_name + soundex | 100K | 4,100 | 180 | 0.8M | 9s |
learned | 100K | 3,800 | 200 | 0.9M | 10s |
- Target max block size under 1,000 records
- Use
multi_passfor best recall,adaptivefor best speed - Use
learnedto auto-discover optimal predicates - Use
ann_pairsfor semantic/product matching