GoldenCheck overview

GoldenCheck is a zero-config data-validation platform that discovers quality rules automatically from your data. It combines fast statistical profilers with optional LLM enhancement to catch type mismatches, missing values, format violations, outliers, and cross-column constraints. It runs as a CLI, a Python and TypeScript library, a dbt package, and a GitHub Action.

CLI reference

Commands, flags, and domain packs.

Integrations

dbt tests and the GitHub Action.

Install

pip install goldencheck

npm install goldencheck

Extras:

pip install goldencheck[llm]                # LLM boost (Anthropic / OpenAI)
pip install goldencheck[baseline]           # deep profiling and drift detection
pip install goldencheck[baseline,semantic]  # + semantic type inference
pip install goldencheck[db]                  # database scanning
pip install goldencheck[mcp]                 # MCP server

Polars-free by default (3.0.0)

As of 3.0.0 the default scan path is Arrow-native and needs no Polars. pip install goldencheck scans CSV, Parquet, and Excel end-to-end — scan_file, scan_dataframe, and the CLI all run on a pyarrow.Table (a base dependency now). The 2.0.0 rule that CSV and the full scan needed goldencheck[polars] no longer applies. Polars is only used by two opt-in extras:

goldencheck[baseline] — pulls Polars (plus scipy/numpy) for the statistical, drift, and correlation subsystems, which still run on Polars.
goldencheck[polars] — only for the scan_dataframe(pl.DataFrame) convenience overload. scan_dataframe accepts a pyarrow.Table natively; a polars.DataFrame is converted via .to_arrow() when Polars is present.

inferred_type reports a neutral dtype vocabulary (str, int, uint, float, date, datetime, bool, other) rather than raw Polars dtype names.

Quickstart

# Scan a file: discover issues and launch the TUI
goldencheck data.csv

# CI-friendly: no TUI, JSON output, fail on errors
goldencheck data.csv --no-tui --json
goldencheck validate data.csv

In Python:

import goldencheck

findings, profile = goldencheck.scan_file("data.csv")
for f in findings:
    print(f"[{f.severity}] {f.column}: {f.check} — {f.message}")

grade, score = profile.health_score()
print(f"{grade} ({score}/100)")  # e.g. "B (78/100)"

The edge-safe TypeScript core:

import { scanData, TabularData, Severity } from "goldencheck";

const data = new TabularData([
  { id: 1, email: "alice@example.com", age: 30, status: "active" },
  { id: 2, email: "bob@test.com", age: -5, status: "inactive" },
  { id: 3, email: "not-an-email", age: 25, status: "active" },
]);

const { findings, profile } = scanData(data);
for (const f of findings) {
  console.log(`[${f.severity === Severity.ERROR ? "ERROR" : "WARNING"}] ${f.column}: ${f.message}`);
}

Key features

Profiling: type inference, nullability, uniqueness, format detection, range and distribution, cardinality, pattern consistency, encoding and sequence detection, near-duplicate value detection (inconsistent categorical encodings such as California/Californa/CALIFORNIA), and freshness (future-dated values and name-gated staleness).
Cross-column checks: temporal ordering, null correlation, numeric constraints, age-versus-DOB, composite-key discovery, exact and near-duplicate rows, strict functional dependencies, and approximate-FD violations (rows that break a near-strict zip → city).
Denial-constraint discovery (opt-in): mines if-then / cross-tuple invariants of the form ¬(p1 ∧ … ∧ pm) — e.g. ¬(status=shipped ∧ ship_date<order_date) (“if shipped, ship_date must be ≥ order_date”) — from a single table and surfaces the violating rows. Off by default; enable with goldencheck denial-constraints data.csv or --denial on a scan.
Baseline and drift detection: 12 check types including distribution drift, entropy drift, Benford drift, functional-dependency violation, and type drift.
Domain packs: healthcare, finance, and ecommerce.
Auto-fix: safe repairs such as trim, normalize case, fix encoding, and coerce types.
LLM boost: optional semantic understanding at roughly $0.01 per scan.
Confidence scoring: every finding carries a 0.0–1.0 confidence.

Baseline and drift

Create a statistical baseline once, then check new data against it cheaply:

import goldencheck

baseline = goldencheck.create_baseline("reference.csv")
baseline.save("goldencheck_baseline.yaml")

findings, profile = goldencheck.scan_file("production.csv", baseline="goldencheck_baseline.yaml")
drift = [f for f in findings if f.source == "baseline_drift"]

Benchmark

GoldenCheck scores 88.40 on DQBench (versus Pandera 32.51, Great Expectations 21.68, Soda 22.36), at roughly 482K rows/sec. The Arrow-native scan that shipped in 3.0.0 got roughly 3.8x faster across the 3.0.x and 3.1.x releases (a 1M-row x 7-column mixed scan went 3.74s to about 1.0s), each release byte-identical to the last. The wins come from vectorized pyarrow ops, fused single-pass kernels, and a deterministic parallel scan. The scan thread pool is tunable with GOLDENCHECK_SCAN_THREADS (default parallel; set it to 1 to force sequential). Installing goldencheck[native] accelerates the scan further. See Native acceleration for the details.

Get started

Concepts

GoldenMatch

GoldenCheck

GoldenFlow

GoldenPipe

GoldenAnalysis

InferMap

SQL extensions

Reference

Research

CLI reference

Integrations

Install

Polars-free by default (3.0.0)

Quickstart

Key features

Baseline and drift

Benchmark

CLI reference

Integrations

​Install

​Polars-free by default (3.0.0)

​Quickstart

​Key features

​Baseline and drift

​Benchmark

Install

Polars-free by default (3.0.0)

Quickstart

Key features

Baseline and drift

Benchmark