dateutil / phonenumbers) once
per row. They are now resolved with a vectorized fast path and a per-row
fallback, with an optional compiled kernel for the phone tail.
The three-tier resolver
Date and phone transforms resolve each value with the cheapest tier that can, falling through only for the rows it can’t:- Vectorized Polars fast path — resolves the well-formed common case in
Rust (multi-format
str.to_datefor dates; a NANP-shape regex for phones), leaving anything it isn’t certain about unresolved. - Native kernel (optional, phone only) — the
goldenflow-nativeRust kernel runs on just the residual rows. - Per-row reference — the original
dateutil/phonenumberspath settles whatever the first two tiers left.
Measured on a realistic messy 1M-row frame:
date_iso8601 76× faster,
phone_e164 19×, phone_digits 4.9× — roughly 14× end-to-end on
a mixed date/phone/text run, with no change to the cleaned values.Optional native kernel
goldenflow-native is a separate compiled runtime (Rust/PyO3, abi3) — the same
split as polars / polars-runtime. The pure-Python goldenflow wheel works
on its own; the native kernel is opt-in and accelerates the phone residual
the Polars fast path can’t reach (alpha numbers like 1-800-FLOWERS,
extensions, +1-prefixed forms) via an Arrow zero-copy path.
phonenumbers library: it resolves
North American (NANP) numbers and defers everything international or ambiguous to
Python. You never get a different cleaned value with the kernel on.
Controlling it
TheGOLDENFLOW_NATIVE environment variable selects the path:
| Value | Behavior |
|---|---|
unset / auto | Use the native kernel where it’s gated (phone, NANP-only). Default. |
0 | Force the pure-Python path everywhere. |
1 | Use native for every component with no NANP restriction — a benchmarking/parity lane that can differ from Python on international numbers. |
Dates are intentionally not a native kernel — the Polars fast path already
resolves them in vectorized Rust, so a per-row compiled parser would be
slower.
phone_national / phone_validate stay pure Python as well.Why it’s safe
The parity contract is enforced by tests that compare the full output against the puredateutil / phonenumbers reference over a large random corpus
(clean, alpha, extension, ambiguous, and international inputs), and a CI lane
builds the native kernel and runs that suite with the kernel active. Turning the
kernel on or off only changes speed, never the cleaned data.