AgentSession.autoconfigure() — run controller; return config + telemetry (v1.7-v1.12)
dedupe_df() — DataFrame deduplication
DedupeResult fields
StandardizationConfig - use rules dict, NOT keyword args
StandardizationConfig has a single rules: dict[str, list[str]] field with a model validator. Keyword args will raise a Pydantic validation error.
BlockingConfig requires keys field
MatchkeyConfig requires name field
Extracting pairs from clusters (correct way)
Multi-pass blocking for catching different dupe types
Available scorers
exact: 1.0 if equal, 0.0 otherwisejaro_winkler: best for short strings (names)levenshtein: normalized edit distancetoken_sort: handles word reorderingensemble: weighted combination of jaro_winkler + levenshtein + token_sort + dice (best for names)dice,jaccard: set-based similaritysoundex_match: phonetic matchingembedding: sentence-transformer cosine similarity
Available transforms (applied at matchkey time)
lowercase, uppercase, strip, strip_all, soundex, metaphone, digits_only, alpha_only, normalize_whitespace, token_sort, first_token, last_token, substring:start:end, qgram:n
LLM Scorer for borderline pairs
Common Mistakes
- Using
exact=["email"]as sole matchkey - creates oversized clusters with common emails - Using
auto_configure()on synthetic data - it may produce poor configs - Not setting
name=on MatchkeyConfig - it’s required - Not providing
keys=on BlockingConfig - it’s required even with multi_pass - Extracting pairs from dupes DataFrame directly instead of using result.clusters