GoldenPipe chains GoldenCheck, GoldenFlow, and GoldenMatch into one pipeline. It profiles your data, conditionally routes it through transformation, deduplicates it (or routes sensitive fields to privacy-preserving matching), and emits golden records. Its logic is adaptive: it skips transformation when no issues are found, and it explains the reasoning behind every decision.
Install
pip install goldenpipe
pip install goldenpipe[mcp] # MCP server mode
Quickstart
import goldenpipe as gp
result = gp.run("customers.csv")
print(result.status) # "success"
print(result.check) # quality findings
print(result.transform) # what was fixed
print(result.match) # deduplicated clusters
print(result.reasoning) # why each decision was made
On the CLI:
goldenpipe run customers.csv # full pipeline
goldenpipe run customers.csv --verbose # show reasoning
goldenpipe run customers.csv --skip-flow # check + match only
goldenpipe run customers.csv --strategy pprl # force privacy mode
goldenpipe run customers.csv -o golden.csv # save golden records
Key features
- Orchestrates the full pipeline (Check → Flow → Match) in one call.
- Adaptive logic that skips transformation when there are no quality issues.
- Privacy-preserving routing that detects sensitive fields and routes to PPRL.
- Reasoning transparency that reports why each stage ran or was skipped.
- Column-context enrichment that builds targeted dedupe config from GoldenCheck profiles and column-name heuristics.
- Remote and local MCP server (
goldenpipe mcp-serve).
GoldenPipe scores 88.07 on the DQBench Pipeline category.
Selective stages
Run only part of the pipeline by specifying stages:
from goldenpipe import Pipeline, PipelineConfig, StageSpec
config = PipelineConfig(
pipeline="check-and-flow-only",
stages=[
StageSpec(use="goldencheck.scan"),
StageSpec(use="goldenflow.transform"),
# omit goldenmatch.dedupe to skip dedup
],
)
result = Pipeline(config=config).run(source="data.csv")
The PipeResult
Pipeline.run() returns a PipeResult, not the output DataFrame:
result.status # PipeStatus enum: SUCCESS, PARTIAL, FAILED
result.input_rows # int
result.stages # dict[str, StageResult]
result.artifacts # dict[str, Any], e.g. {"manifest": Manifest}
result.errors # list[str]
result.reasoning # dict[str, str], why each stage ran or was skipped
result.timing # dict[str, float]
result.skipped # list[str]
A few sharp edges from the package docs: PipeResult does not expose the output DataFrame directly. Use gp.run(path) (file-based) rather than gp.run_df(df), since GoldenCheck needs a file extension. And cast mixed-type columns (for example a birth_year that is sometimes an int and sometimes a string) to a single type before dedup, or GoldenMatch will raise.
MCP endpoints
- Remote:
https://goldenpipe-mcp-production.up.railway.app/mcp/
- Local:
goldenpipe mcp-serve --transport http --port 8250