Skip to main content
The goldenmatch-extensions package runs GoldenMatch directly from SQL, without leaving the database. It ships a pgrx-based PostgreSQL extension and a DuckDB UDF package. goldenmatch-duckdb on PyPI goldenmatch-duckdb downloads goldenmatch-embed on PyPI goldenmatch_pg release

PostgreSQL

Install

The fastest path is the prebuilt Docker image with the extension preinstalled:
docker run -p 5432:5432 -e POSTGRES_PASSWORD=postgres \
  ghcr.io/benseverndev-oss/goldenmatch-extensions:latest

psql -h localhost -U postgres \
  -c "SELECT goldenmatch.goldenmatch_score('John', 'Jon', 'jaro_winkler');"
Package installs are also available:
# Debian / Ubuntu
curl -LO https://github.com/benseverndev-oss/goldenmatch-extensions/releases/latest/download/postgresql-16-goldenmatch_0.2.0_amd64.deb
sudo dpkg -i postgresql-16-goldenmatch_0.2.0_amd64.deb
pip install goldenmatch>=1.1.0
Then enable it:
CREATE EXTENSION goldenmatch_pg;
SELECT goldenmatch.goldenmatch_score('John Smith', 'Jon Smyth', 'jaro_winkler');

Functions

-- Score two strings
SELECT goldenmatch_score('John Smith', 'Jon Smyth', 'jaro_winkler');  -- 0.91

-- Score two JSON records against a config
SELECT goldenmatch_score_pair(
    '{"name": "John Smith", "email": "j@x.com"}',
    '{"name": "Jon Smyth", "email": "j@x.com"}',
    '{"fuzzy": {"name": 0.85}, "exact": ["email"]}'
);  -- 0.95

-- Explain a match
SELECT goldenmatch_explain(rec_a, rec_b, config);

-- Whole-table operations
SELECT goldenmatch_dedupe_table('customers', '{"exact": ["email"]}');
SELECT goldenmatch_match_tables('prospects', 'customers', '{"fuzzy": {"name": 0.85}}');

DuckDB

pip install goldenmatch-duckdb
import duckdb
import goldenmatch_duckdb

con = duckdb.connect()
goldenmatch_duckdb.register(con)

con.sql("SELECT goldenmatch_score('John Smith', 'Jon Smyth', 'jaro_winkler')").show()
con.sql("SELECT goldenmatch_dedupe_table('customers', '{\"exact\": [\"email\"]}')").show()
Registered UDFs:
FunctionPurpose
goldenmatch_score(a, b, scorer)Score two strings.
goldenmatch_score_pair(rec_a, rec_b, config)Score two JSON records.
goldenmatch_explain(rec_a, rec_b, config)Explain a match.
goldenmatch_dedupe_table(table, config)Deduplicate a table.
goldenmatch_match_tables(target, ref, config)Match two tables.
goldenmatch_dedupe(json, config)Deduplicate JSON records directly.
goldenmatch_match(target_json, ref_json, config)Match two JSON record sets.
goldenmatch_connected_components(...)Group a candidate-pair graph into entities.
goldenmatch_pair_dedup(...)Keep the best score per canonical pair.
goldenmatch_embed_local(text, model_path)Embed text with a local in-house model.
gm_embed(text) (PostgreSQL)Embed text with the in-house model, dir from GOLDENEMBED_MODEL_DIR.

Graph and embedding kernels

These run native-direct in pure Rust, with no CPython round-trip. They expose GoldenMatch’s clustering primitives and the local embedder directly in SQL, on both backends (and as DataFusion FFI UDFs). One shared kernel backs all surfaces, so results are identical across them.

Connected components and pair dedupe

goldenmatch_connected_components groups a candidate-pair graph into entities, one component per entity, with singletons included. goldenmatch_pair_dedup canonicalizes a candidate-pair set and keeps the best score per pair. Both take the edge columns as lists. Pass integer record ids to the bare name, or string ids to the _str sibling.
-- Components over an edge set plus the id universe
SELECT goldenmatch_connected_components(
  (SELECT list(id_a) FROM edges),
  (SELECT list(id_b) FROM edges),
  (SELECT list(score) FROM edges),
  (SELECT list(id) FROM records)
);  -- [[id, ...], ...]

-- Canonical max-score pairs
SELECT goldenmatch_pair_dedup(
  (SELECT list(id_a) FROM pairs),
  (SELECT list(id_b) FROM pairs),
  (SELECT list(score) FROM pairs)
);  -- [{a, b, s}, ...]

-- String record ids use the _str variants
SELECT goldenmatch_connected_components_str(
  ['a', 'b'], ['b', 'c'], [0.9, 0.8], ['a', 'b', 'c', 'd']
);

Local embedding

goldenmatch_embed_local embeds text with a saved in-house model through the goldenembed ONNX runtime. No network and no API key. model_path is a directory holding config.json and model.onnx.
SELECT goldenmatch_embed_local('John Smith', '/path/to/model');  -- JSON float array
On PostgreSQL, gm_embed(text) is a one-argument convenience that reads the model directory from the GOLDENEMBED_MODEL_DIR environment variable instead of taking it per call, and returns real[] (float4) to match the DataFusion goldenmatch_embed UDF. The model loads once per backend process and is cached. A NULL input embeds the empty string rather than returning NULL.
PostgreSQL
-- Set GOLDENEMBED_MODEL_DIR in the server environment first.
SELECT goldenmatch.gm_embed('John Smith');  -- real[]
The DuckDB embedding UDF needs the optional embed runtime: pip install goldenmatch-duckdb[embed].

Requirements

  • Python 3.11+
  • goldenmatch >= 1.1.0
  • DuckDB 1.0+ (DuckDB extension)
  • PostgreSQL 15, 16, or 17 (Postgres extension)
The scoring and table operations embed CPython through pyo3 and call the GoldenMatch Python API, so they match the Python package exactly. The graph and embedding kernels run native-direct in pure Rust with no CPython, sharing one kernel across DuckDB, PostgreSQL, and DataFusion.