dedupe_df() into a durable, queryable identity graph. Stable entity_ids survive re-runs, every match has provenance, and the same answer comes back from Python, SQL, REST, MCP, A2A, and the web UI.
Shipped in v1.15.0 (2026-05-12). Off by default — the zero-config posture is preserved. Enable via
config.identity.enabled = True or an identity: block in YAML.What problem this solves
A plainrun_dedupe() returns clusters whose IDs are meaningful only inside that result. Re-run the pipeline tomorrow with one new record and:
- Cluster IDs are different.
- There is no record of “Alice Smith merged into entity 7 because of evidence X.”
- A second source pointing at the same person has no way to find the existing identity.
- The web/SQL/agent surfaces all reason from scratch each time.
Quickstart
Add anidentity: block to your config and run dedupe normally:
Storage model
Five tables. SQLite default at.goldenmatch/identity.db; Postgres optional.
| Table | What it stores |
|---|---|
identity_nodes | One row per identity: entity_id (UUIDv7), status, rolled-up golden record, confidence, dataset. |
source_records | One row per {source}:{source_pk} — the raw observation, current owning identity, payload, first/last seen. |
evidence_edges | One row per scored pair that supports an identity. Score, matchkey, per-field breakdown, NE penalties, controller telemetry, run name. |
identity_events | Append-only log: created / absorbed_record / merged_with / split_from / retired / manual_*. |
identity_aliases | Optional cross-source convenience lookups (e.g. salesforce:003abc -> entity). |
v_identities, v_identity_pairs, v_identity_timeline. Apply via packages/python/goldenmatch/goldenmatch/db/migrations/identity_v1.sql or let IdentityStore(backend="postgres", connection=...) create them on first connect.
How resolution works
After clustering, the pipeline takes each cluster and:- Look up existing identities that already own any record in the cluster.
- Decide what happened:
- No overlap — mint a new identity (UUIDv7), emit
created. - One existing identity covers all overlapping records — absorb the new records, emit
absorbed_recordper addition. - Multiple existing identities overlap — merge them. Winner = most members (tie-break: oldest
created_at). Emitmerged_withon winner, retire losers withstatus='merged_into', merged_into=<winner>.
- No overlap — mint a new identity (UUIDv7), emit
- Upsert every cluster record under the chosen identity.
- Record evidence — one row in
evidence_edgesfor every scored within-cluster pair, including matchkey name, per-field scores, negative-evidence penalties, and a controller-telemetry snapshot.
entity_id is stable across runs. The same Alice Smith on a Tuesday run and a Wednesday run with one extra record both resolve to the same UUID — evidence in evidence_edges shows which run added which edge.
Resolution is idempotent: replaying the same run_name is a no-op. Edges deduplicate on (entity_id, record_a_id, record_b_id, run_name). Events deduplicate on (run_name, kind, entity_id).
Surfaces — one shape, many faces
Every surface returns the same JSON (theIdentityView.to_dict() shape). The cross-surface contract test at tests/identity/test_cross_surface_contract.py enforces this byte-for-byte across all six.
Python
CLI
REST
Web UI
The “Identities” tab ingoldenmatch serve-ui lists identities with dataset/status filters, drills into one to show members + evidence + event log, and supports steward merge/split.
MCP
Seven tools on the standard MCP server (goldenmatch mcp-serve):
identity_resolve— look up byrecord_ididentity_show— full payload byentity_ididentity_list— list with filtersidentity_history— event logidentity_conflicts— listconflicts_withedgesidentity_merge/identity_split— steward operations
A2A
Same seven skills on the A2A agent server (goldenmatch agent-serve). The agent card declares 31 total skills.
SQL (Postgres + DuckDB)
Thegoldenmatch-duckdb PyPI package (>= 0.3.0) and goldenmatch_pg Postgres extension (>= 0.4.0) expose five read-only functions per backend:
IdentityView.to_dict() returns. SQL is read-only — writes go through the Python CLI, REST endpoints, or MCP tools.
”Why did these link?” — reading the evidence
Every link decision is auditable. Pull an entity’s edges:GET /api/v1/identities/{eid}/evidence, identity_history over MCP/A2A, and the DuckDB / Postgres _view functions.
Configuration reference
source_pk_column is unset and you have near-duplicate raw rows from the same source, two physically-different observations may collide on the same record_id. The recommended pattern is to always pass an explicit PK column when you can.
Postgres setup
Apply the schema directly (skip if you only use SQLite):v_identities, v_identity_pairs, v_identity_timeline) that the bare IdentityStore does not — prefer the migration file for shared/team setups.
Performance notes
- Resolve runs after clustering, before output. On a 100k-row dedupe the resolve step is dominated by SQLite write throughput (~5-15s). Postgres scales further but adds network latency.
- Resolution is gated and additive — if the store fails to open, the pipeline logs a warning and continues. Identity never blocks a dedupe.
- For multi-process writers, the SQLite store uses WAL + a 5s
busy_timeout. Postgres relies on row-level locks. Single-tenant web UI / CLI invocations are the assumed model; for high-write multi-tenant graphs use Postgres.
When NOT to use it
- Single-shot ad-hoc dedupe where you only want golden records out and don’t care about the next run.
- Pipelines whose source has no stable PK and whose rows are duplicated character-for-character — the hash fallback will fold them together.
Migration / backfill
Existing projects without an identity graph don’t get retroactiveentity_id stability. New runs will assign fresh UUIDs from the moment you enable identity. A best-effort backfill command that walks lineage JSONL + cluster snapshots is on the v2.1 roadmap.
See also
examples/python/08_identity_graph.py— end-to-end demo (two-run stability + absorb + merge + split + conflict)- Pipeline architecture — where identity sits in the dedupe flow
- Learning Memory — the other persistent-state layer; complementary, not competing
- Configuration — full schema reference