Blocking strategies

Blocking reduces the comparison space from O(N^2) to O(N*B) by grouping records that share a key. GoldenMatch supports 10 strategies.

Strategy overview

Strategy	Description	Best For
`static`	Group by blocking key	Clean data with reliable keys
`adaptive`	Static + recursive sub-blocking for oversized blocks	Default choice
`sorted_neighborhood`	Sliding window over sorted records	Typos in blocking key
`multi_pass`	Union of blocks from multiple passes	Noisy data, best recall
`ann`	FAISS nearest-neighbor on embeddings	Semantic matching
`ann_pairs`	Direct-pair ANN scoring	50—100x faster than `ann`
`canopy`	TF-IDF canopy clustering	Text-heavy data
`learned`	Data-driven predicate selection	Auto-discovers rules
`lsh`	MinHash/LSH sketching on a text column	Near-duplicate text / corpus dedup
`simhash`	SimHash LSH over embeddings	Semantic near-duplicate text
`perceptual`	Banded-Hamming LSH over perceptual image hashes	Near-duplicate images

Static blocking

Group records by exact value of the blocking key.

blocking:
  strategy: static
  keys:
    - fields: [zip]
    - fields: [last_name]
      transforms: [lowercase, soundex]

Multiple keys produce independent blocks that are unioned. Transforms are applied before grouping.

Adaptive blocking

Static blocking with automatic sub-splitting for oversized blocks. When a block exceeds max_block_size, it splits on the highest-cardinality column within the block.

blocking:
  strategy: adaptive
  max_block_size: 5000
  keys:
    - fields: [zip]

Sorted neighborhood

Sliding window over records sorted by a key. Catches near-matches that differ by one character in the blocking key.

blocking:
  strategy: sorted_neighborhood
  window_size: 20
  sort_key:
    - column: last_name
      transforms: [lowercase, soundex]

Multi-pass blocking

Run multiple blocking passes and union the results. Best recall for noisy data.

blocking:
  strategy: multi_pass
  union_mode: true
  passes:
    - fields: [zip]
    - fields: [last_name]
      transforms: [lowercase, soundex]
    - fields: [first_name]
      transforms: [lowercase, first_token]

ANN hybrid blocking

New in v1.2.6. Combine multi-pass string blocking with ANN fallback for oversized blocks. When a block exceeds max_block_size and would normally be skipped, GoldenMatch embeds only the unique text values in that block and uses FAISS to create smaller sub-blocks.

blocking:
  strategy: multi_pass
  passes:
    - fields: [model_desc, state]
      transforms: [lowercase, strip]
    - fields: [base_model]
      transforms: [lowercase, soundex]
  max_block_size: 1000
  skip_oversized: true
  ann_column: description_text     # enables ANN fallback
  ann_top_k: 20

How it works:

Multi-pass blocking creates string-based blocks (fast, handles most data)
Blocks exceeding max_block_size trigger ANN fallback instead of being skipped
ANN embeds only unique text values (e.g., 61K records with 187 unique texts = seconds)
FAISS finds nearest neighbors among unique texts, Union-Find creates sub-blocks
Sub-blocks still exceeding max_block_size (after 10x cap) are skipped

On the Bulldozer dataset (401K rows), this recovered 363 sub-blocks from 15 oversized blocks that would otherwise be skipped, matching 949 additional records. Requires Vertex AI (GOLDENMATCH_GPU_MODE=vertex) or local sentence-transformers for embedding.

ANN blocking

Use FAISS approximate nearest-neighbor search on sentence-transformer embeddings. Requires pip install goldenmatch[embeddings].

blocking:
  strategy: ann
  ann_column: description
  ann_model: all-MiniLM-L6-v2
  ann_top_k: 20

ann_pairs is a faster variant (50—100x) that returns direct pairs instead of block groups:

blocking:
  strategy: ann_pairs
  ann_column: title
  ann_top_k: 20

SimHash (semantic) blocking

New in #1082. Bucket records by semantic near-duplication. SimHash embeds a text column, projects each embedding through num_planes random hyperplanes into a 0/1 signature, then bands the signature into LSH buckets. Records whose embeddings are cosine-near collide in a band and become candidates. This catches paraphrases that share meaning but little surface text, where the lexical lsh strategy (word/char-shingle MinHash) only catches shared shingles.

blocking:
  strategy: simhash
  simhash:
    column: description   # the text column to embed
    num_planes: 256       # hyperplane count = SimHash signature length
    num_bands: 32         # band count (must divide num_planes)
    seed: 0               # deterministic plane set
    # threshold: 0.85     # alternative to num_bands: picks the band/row split
    # model: inhouse      # embedder; defaults to the in-house ER embedder

Provide either num_bands or threshold (not both required; num_bands wins if you set both). More bands means looser blocking (higher recall, more candidates); fewer bands is tighter. The default embedder is the zero-config in-house model, so SimHash works without external credentials; set model to use a configured provider. For a text corpus (a column of long free text), auto-config picks simhash automatically when an embedder is reachable, and falls back to the lexical lsh strategy when it is not. So dedupe_df(corpus) selects the semantic near-dup path with no config when embeddings are available. SimHash buckets dense embeddings; the lexical lsh strategy (MinHash) buckets sparse shingle sets. Use SimHash for semantic paraphrase, lsh for lexical near-dups.

High-recall corpus dedup without writing a blocking config. For large text corpora where you want near-duplicate recall at throughput scale, the throughput tier (dedupe_df(..., throughput=0.95)) picks lsh or simhash automatically based on embedder availability, then confirms candidate pairs by sketch distance — no BlockingConfig required. See Throughput tier.

Canopy blocking

TF-IDF-based canopy clustering with loose and tight thresholds.

blocking:
  strategy: canopy
  canopy:
    fields: [name, address]
    loose_threshold: 0.3
    tight_threshold: 0.7
    max_canopy_size: 500

Learned blocking

Data-driven predicate selection via a two-pass approach: sample pairs, train predicates, apply to full data. Achieves 96.9% F1 matching hand-tuned static blocking on DBLP-ACM.

blocking:
  strategy: learned
  learned_sample_size: 5000
  learned_min_recall: 0.95
  learned_min_reduction: 0.90
  learned_predicate_depth: 2
  learned_cache_path: .goldenmatch/learned_blocking.pkl

import goldenmatch as gm

rules = gm.learn_blocking_rules(df, matchkey, sample_size=5000)
blocks = gm.apply_learned_blocks(df, rules)

Cache the learned rules to skip re-training on subsequent runs.

MinHash/LSH blocking

New in #1081. Probabilistic sketching for near-duplicate text. Each record’s text column is shingled (char- or word-grams), MinHashed into a signature, and the signature is split into LSH bands; records that share at least one band bucket become candidates. This is the set-similarity (Jaccard) path — the right tool for document/corpus-scale near-duplicate detection, where keyed-predicate strategies (static/multi-pass) are a poor fit.

blocking:
  strategy: lsh
  lsh:
    column: description   # the text column to shingle
    mode: word            # word- or char-grams
    k: 2                  # shingle size
    num_perms: 128        # MinHash signature length
    threshold: 0.5        # similarity threshold; picks the band/row split via
                          # optimal_bands. Or set num_bands explicitly (must
                          # divide num_perms).

A lower threshold yields more bands (higher recall, more candidate pairs); a higher one yields fewer (more precision, fewer pairs).

LSH measures lexical overlap, not semantic similarity. It excels at near-duplicates that share words/characters (boilerplate, reposts, lightly edited copies). It does not catch paraphrases that mean the same thing in different words — on the Quora Question Pairs benchmark (semantic duplicates) it recovers only ~21% of labeled pairs at threshold: 0.5, versus ~98% on a lexical-near-duplicate set. For semantic matching, use the ann strategy (embedding nearest-neighbor).

Empty / whitespace-only text rows have nothing to sketch and are skipped. The underlying kernel (goldenmatch.core.sketch) is shared byte-for-byte across Python, the optional native extension, and the TypeScript port. Recall is measured by an always-on synthetic gate plus a Quora Question Pairs benchmark (bench-lsh-recall.yml).

Auto-select

Let GoldenMatch pick the best blocking key by histogram analysis:

blocking:
  auto_select: true
  keys:
    - fields: [zip]
    - fields: [last_name]
      transforms: [lowercase, soundex]
    - fields: [city]

The analyzer scores each key by block count, max block size, estimated comparisons, and recall. Use the CLI to see suggestions:

goldenmatch analyze-blocking customers.csv --config config.yaml

Performance impact

Blocking key choice dominates fuzzy matching performance. A coarse key (e.g., state) creates huge blocks and slow scoring. A fine key (e.g., email) misses near-duplicates.

Key	Records	Blocks	Max Size	Comparisons	Time
`zip`	100K	8,200	340	1.2M	12s
`state`	100K	50	12,000	45M	320s
`last_name + soundex`	100K	4,100	180	0.8M	9s
`learned`	100K	3,800	200	0.9M	10s

Rules of thumb:

Target max block size under 1,000 records
Use multi_pass for best recall, adaptive for best speed
Use learned to auto-discover optimal predicates
Use ann_pairs for semantic/product matching

Get started

Concepts

GoldenMatch

GoldenCheck

GoldenFlow

GoldenPipe

GoldenAnalysis

InferMap

SQL extensions

Reference

Research

Strategy overview

Static blocking

Adaptive blocking

Sorted neighborhood

Multi-pass blocking

ANN hybrid blocking

ANN blocking

SimHash (semantic) blocking

Canopy blocking

Learned blocking

MinHash/LSH blocking

Auto-select

Performance impact

​Strategy overview

​Static blocking

​Adaptive blocking

​Sorted neighborhood

​Multi-pass blocking

​ANN hybrid blocking

​ANN blocking

​SimHash (semantic) blocking

​Canopy blocking

​Learned blocking

​MinHash/LSH blocking

​Auto-select

​Performance impact

Strategy overview

Static blocking

Adaptive blocking

Sorted neighborhood

Multi-pass blocking

ANN hybrid blocking

ANN blocking

SimHash (semantic) blocking

Canopy blocking

Learned blocking

MinHash/LSH blocking

Auto-select

Performance impact