match_one
The core primitive for streaming. Matches a single record against an existing DataFrame.match_one works with fuzzy (weighted) matchkeys. For exact matchkeys (threshold=None), use find_exact_matches with a Polars join instead.
StreamProcessor
Incremental record matching with immediate or micro-batch processing. Wrapsmatch_one and add_to_cluster for continuous operation.
Immediate mode
Each record is matched and clustered as it arrives:Micro-batch mode
Buffer records and process them together for better throughput:Incremental cluster updates
When a new record matches existing records, update the cluster structure:add_to_cluster handles three cases:
- New record matches records in one cluster — joins that cluster
- New record matches records in multiple clusters — merges those clusters
- New record has no matches — creates a singleton cluster
Incremental CLI
Match new CSV records against an existing base dataset:- Exact matchkeys: Polars join between new and base records (fast)
- Fuzzy matchkeys:
match_onebrute-force against the base (thorough)
ANN incremental operations
For embedding-based matching, the ANN index supports incremental updates:Database watch mode
Continuously monitor a database table for new records and match them incrementally:Daemon mode
Run watch as a background service with health endpoint and PID file:- HTTP health endpoint at
/health - PID file for process management
- SIGTERM handling for graceful shutdown
Database sync
Full incremental matching against a live Postgres database:- Incremental sync — only processes records added since last run
- Hybrid blocking — SQL WHERE clauses for exact fields + FAISS ANN for semantic fields
- Persistent ANN index — disk cache + DB source of truth
- Golden record versioning — append-only with
is_currentflag