Skip to main content
Blocking reduces the comparison space from O(N^2) to O(N*B) by grouping records that share a key. GoldenMatch supports 8 strategies.

Strategy overview

StrategyDescriptionBest For
staticGroup by blocking keyClean data with reliable keys
adaptiveStatic + recursive sub-blocking for oversized blocksDefault choice
sorted_neighborhoodSliding window over sorted recordsTypos in blocking key
multi_passUnion of blocks from multiple passesNoisy data, best recall
annFAISS nearest-neighbor on embeddingsSemantic matching
ann_pairsDirect-pair ANN scoring50—100x faster than ann
canopyTF-IDF canopy clusteringText-heavy data
learnedData-driven predicate selectionAuto-discovers rules

Static blocking

Group records by exact value of the blocking key.
blocking:
  strategy: static
  keys:
    - fields: [zip]
    - fields: [last_name]
      transforms: [lowercase, soundex]
Multiple keys produce independent blocks that are unioned. Transforms are applied before grouping.

Adaptive blocking

Static blocking with automatic sub-splitting for oversized blocks. When a block exceeds max_block_size, it splits on the highest-cardinality column within the block.
blocking:
  strategy: adaptive
  max_block_size: 5000
  keys:
    - fields: [zip]

Sorted neighborhood

Sliding window over records sorted by a key. Catches near-matches that differ by one character in the blocking key.
blocking:
  strategy: sorted_neighborhood
  window_size: 20
  sort_key:
    - column: last_name
      transforms: [lowercase, soundex]

Multi-pass blocking

Run multiple blocking passes and union the results. Best recall for noisy data.
blocking:
  strategy: multi_pass
  union_mode: true
  passes:
    - fields: [zip]
    - fields: [last_name]
      transforms: [lowercase, soundex]
    - fields: [first_name]
      transforms: [lowercase, first_token]

ANN hybrid blocking

New in v1.2.6. Combine multi-pass string blocking with ANN fallback for oversized blocks. When a block exceeds max_block_size and would normally be skipped, GoldenMatch embeds only the unique text values in that block and uses FAISS to create smaller sub-blocks.
blocking:
  strategy: multi_pass
  passes:
    - fields: [model_desc, state]
      transforms: [lowercase, strip]
    - fields: [base_model]
      transforms: [lowercase, soundex]
  max_block_size: 1000
  skip_oversized: true
  ann_column: description_text     # enables ANN fallback
  ann_top_k: 20
How it works:
  1. Multi-pass blocking creates string-based blocks (fast, handles most data)
  2. Blocks exceeding max_block_size trigger ANN fallback instead of being skipped
  3. ANN embeds only unique text values (e.g., 61K records with 187 unique texts = seconds)
  4. FAISS finds nearest neighbors among unique texts, Union-Find creates sub-blocks
  5. Sub-blocks still exceeding max_block_size (after 10x cap) are skipped
On the Bulldozer dataset (401K rows), this recovered 363 sub-blocks from 15 oversized blocks that would otherwise be skipped, matching 949 additional records. Requires Vertex AI (GOLDENMATCH_GPU_MODE=vertex) or local sentence-transformers for embedding.

ANN blocking

Use FAISS approximate nearest-neighbor search on sentence-transformer embeddings. Requires pip install goldenmatch[embeddings].
blocking:
  strategy: ann
  ann_column: description
  ann_model: all-MiniLM-L6-v2
  ann_top_k: 20
ann_pairs is a faster variant (50—100x) that returns direct pairs instead of block groups:
blocking:
  strategy: ann_pairs
  ann_column: title
  ann_top_k: 20

Canopy blocking

TF-IDF-based canopy clustering with loose and tight thresholds.
blocking:
  strategy: canopy
  canopy:
    fields: [name, address]
    loose_threshold: 0.3
    tight_threshold: 0.7
    max_canopy_size: 500

Learned blocking

Data-driven predicate selection via a two-pass approach: sample pairs, train predicates, apply to full data. Achieves 96.9% F1 matching hand-tuned static blocking on DBLP-ACM.
blocking:
  strategy: learned
  learned_sample_size: 5000
  learned_min_recall: 0.95
  learned_min_reduction: 0.90
  learned_predicate_depth: 2
  learned_cache_path: .goldenmatch/learned_blocking.pkl
import goldenmatch as gm

rules = gm.learn_blocking_rules(df, matchkey, sample_size=5000)
blocks = gm.apply_learned_blocks(df, rules)
Cache the learned rules to skip re-training on subsequent runs.

Auto-select

Let GoldenMatch pick the best blocking key by histogram analysis:
blocking:
  auto_select: true
  keys:
    - fields: [zip]
    - fields: [last_name]
      transforms: [lowercase, soundex]
    - fields: [city]
The analyzer scores each key by block count, max block size, estimated comparisons, and recall. Use the CLI to see suggestions:
goldenmatch analyze-blocking customers.csv --config config.yaml

Performance impact

Blocking key choice dominates fuzzy matching performance. A coarse key (e.g., state) creates huge blocks and slow scoring. A fine key (e.g., email) misses near-duplicates.
KeyRecordsBlocksMax SizeComparisonsTime
zip100K8,2003401.2M12s
state100K5012,00045M320s
last_name + soundex100K4,1001800.8M9s
learned100K3,8002000.9M10s
Rules of thumb:
  • Target max block size under 1,000 records
  • Use multi_pass for best recall, adaptive for best speed
  • Use learned to auto-discover optimal predicates
  • Use ann_pairs for semantic/product matching