GoldenFlow overview

GoldenFlow standardizes, reshapes, and normalizes messy data before it enters a pipeline. In zero-config mode it auto-detects quality issues and applies safe transforms. It ships 92 built-in transforms across 11 categories (text, phone, name, address, date, categorical, numeric, email, identifiers, URLs, and auto-correct), plus domain packs for healthcare, finance, ecommerce, people/HR, real estate, and carceral. The identifiers group includes checksummed/structural validators/formatters for payment cards (Luhn), IBAN (mod-97), ISBN, EAN/UPC, EU VAT, SWIFT/BIC, US ABA routing numbers, and IMEI, running on owned Rust kernels. The name group includes two owned i18n kernels: name_transliterate (deterministic Unicode-to-ASCII fold via a curated map) and name_script (dominant-script detection). It scores 100/100 on the DQBench transform benchmarks.

CLI reference

Commands, flags, and the config format.

Performance

Vectorized fast paths and the optional Arrow-native phone kernel.

Install

pip install goldenflow

npm install goldenflow

Extras: goldenflow[check] adds GoldenCheck integration, goldenflow[mcp] adds the MCP server, and goldenflow[all] brings everything.

Quickstart

Zero-config auto-detect:

goldenflow transform messy_data.csv
goldenflow demo

In Python:

import goldenflow

result = goldenflow.transform_file("messy_data.csv")
print(result.df)        # clean Polars DataFrame
print(result.manifest)  # audit trail

In TypeScript:

import { TransformEngine } from "goldenflow";

const result = new TransformEngine().transformDf([
  { name: "  JOHN  ", phone: "(555) 123-4567", email: "John@Example.COM" },
]);
console.log(result.rows[0]);
// { name: "JOHN", phone: "+15551234567", email: "john@example.com" }

Key features

Zero-config mode: auto-detects column types and applies safe transforms.
113 transforms across 13 categories, including phonetic blocking keys (soundex, double_metaphone), company/org dedup, email/URL dedup keys, checksummed/structural identifiers (payment card, IBAN, ISBN, EAN/UPC, EU VAT, SWIFT/BIC, ABA routing, IMEI, ISIN, CUSIP, NPI, Luhn) and owned i18n name kernels (name_transliterate, name_script).
6 domain packs: healthcare, finance, ecommerce, people/HR, real estate, and carceral.
Audit trail: a JSON manifest of every transform, which rows changed, and before/after samples.
Schema mapping: auto-map columns between systems by name similarity and data profiling.
Streaming: process large files in chunks via StreamProcessor.
Cloud connectors: transparent s3:// and gs:// paths.
Watch and schedule: auto-transform new files, or run on an interval.
LLM-enhanced mode: optional categorical corrections via --llm.
Polars-native, with Jupyter-friendly rich rendering. Date and phone normalization use vectorized fast paths (≈14× faster end-to-end), plus an optional Arrow-native Rust kernel for the phone tail.

Configured transforms

import goldenflow
from goldenflow import GoldenFlowConfig, TransformSpec

config = GoldenFlowConfig(transforms=[
    TransformSpec(column="first_name", ops=["strip", "title_case"]),
    TransformSpec(column="last_name", ops=["strip", "title_case"]),
    TransformSpec(column="email", ops=["strip", "lowercase"]),
    TransformSpec(column="phone", ops=["strip", "phone_national"]),
])
result = goldenflow.transform_df(df, config=config)

Polars-free `transform()`

goldenflow.transform() runs the same transforms on the native/Arrow substrate with Polars never imported. It takes a dict[str, list] or a file path (.csv / .parquet / .xlsx), and returns a ColumnarResult (.columns + .manifest, with an opt-in .to_polars()):

result = goldenflow.transform({"phone": ["212-555-0100"]}, config=config)
result = goldenflow.transform("customers.parquet", config=config)  # pyarrow, no Polars
result = goldenflow.transform("customers.csv", config=None)        # zero-config, Polars-free

transform_df(pl.DataFrame) stays as the Polars-backend adapter. Install optional backends with pip install goldenflow[polars] (bulk-vectorized backend + pl.read_*) or goldenflow[parquet] (pyarrow, for Polars-free Parquet read).

Schema mapping

from goldenflow import SchemaMapper
import polars as pl

source = pl.DataFrame({"fname": ["John"], "lname": ["Smith"]})
target = pl.DataFrame({"first_name": [""], "last_name": [""]})

mapper = SchemaMapper()
for m in mapper.map(source, target):
    print(f"{m.source} → {m.target} ({m.confidence:.0%})")

Streaming

from goldenflow.streaming import StreamProcessor

processor = StreamProcessor(config=config)
for result in processor.stream_file("large_data.csv", chunk_size=10_000):
    write_to_output(result.df)
print(f"Processed {processor.batches_processed} batches")

SQL (DuckDB)

GoldenFlow’s transforms also ship as a zero-Python DuckDB extension (goldenflow-duckdb) — a compiled Rust extension that links the same reference kernels straight into DuckDB, so the transforms run natively in the query engine, byte-identical to the Python / TypeScript / WASM surfaces. 98 transforms are exposed as goldenflow_<kernel> SQL functions:

-- CLI: duckdb -unsigned   (or: SET allow_unsigned_extensions = true;)
LOAD 'goldenflow_duckdb.duckdb_extension';

SELECT
  goldenflow_email_normalize(email)     AS email,
  goldenflow_name_proper(name)          AS name,
  goldenflow_address_standardize(addr)  AS address
FROM read_parquet('s3://bucket/raw/*.parquet');

Download the per-platform zip from the goldenflow-duckdb-v* release assets (linux / macOS / Windows; loads on DuckDB >= 1.3.0).

Get started

Concepts

GoldenMatch

GoldenCheck

GoldenFlow

GoldenPipe

GoldenAnalysis

InferMap

SQL extensions

Reference

Research

CLI reference

Performance

Install

Quickstart

Key features

Configured transforms

Polars-free `transform()`

Schema mapping

Streaming

SQL (DuckDB)

CLI reference

Performance

​Install

​Quickstart

​Key features

​Configured transforms

​Polars-free transform()

​Schema mapping

​Streaming

​SQL (DuckDB)

Install

Quickstart

Key features

Configured transforms

Polars-free `transform()`

Schema mapping

Streaming

SQL (DuckDB)