Skip to main content
GoldenMatch includes 7 built-in YAML rulebooks that extract structured fields from unstructured product descriptions and other domain-specific text.

Built-in packs

PackDomainExtracted Fields
electronicsConsumer electronicsbrand, model, SKU, color, specs
softwareSoftware productsname, version, edition, platform
healthcareMedical recordsNPI, CPT codes, drug names, dosages
financialFinancial instrumentsCUSIP, LEI, ticker, account numbers
real_estateProperty listingsaddress, MLS number, lot size, year built
peoplePerson recordsname parts, phone, email, SSN pattern
retailGeneral retailbrand, SKU, UPC, size, color

Using domain packs

Auto-detection

import goldenmatch as gm

rulebooks = gm.discover_rulebooks()  # Returns all 7 packs
print(list(rulebooks.keys()))
# ['electronics', 'software', 'healthcare', 'financial', 'real_estate', 'people', 'retail']

Extract fields

import goldenmatch as gm

rulebooks = gm.discover_rulebooks()
enhanced_df, low_confidence = gm.extract_with_rulebook(df, "title", rulebooks["electronics"])

# enhanced_df has new columns: __brand__, __model__, __sku__, etc.
# low_confidence contains records where extraction confidence was low

Auto-detect domain

domain = gm.match_domain(df, "description")
# Returns "electronics", "software", etc., or None

YAML config

Enable domain extraction in your config file:
domain:
  enabled: true
  pack: electronics
Or let GoldenMatch auto-detect:
domain:
  enabled: true

Electronics pack

Extracts brand, model number, SKU, color, and technical specs from product titles.
"Samsung Galaxy S24 Ultra 256GB Titanium Black SM-S928B"
  -> brand: Samsung
  -> model: Galaxy S24 Ultra
  -> sku: SM-S928B
  -> color: Titanium Black
  -> specs: 256GB
Model normalization strips hyphens, region suffixes, and color suffixes for better matching.

Software pack

Extracts name, version, edition, and platform.
"Microsoft Office 365 Professional Plus - Windows"
  -> name: Microsoft Office
  -> version: 365
  -> edition: Professional Plus
  -> platform: Windows

Healthcare pack

Extracts medical identifiers with contextual prefix requirements (e.g., NPI:, CPT:) to avoid false positives on generic numbers.
"Provider NPI:1234567890, CPT:99213 Office Visit"
  -> npi: 1234567890
  -> cpt_code: 99213

Financial pack

Extracts financial identifiers (CUSIP, LEI, ticker). Contextual prefixes required.
"Bond CUSIP:037833AK6, Issuer LEI:HWUPKR0MPOU8FGXBT394"
  -> cusip: 037833AK6
  -> lei: HWUPKR0MPOU8FGXBT394

Custom domain packs

Create your own YAML rulebook and place it in one of the search paths:
PathScope
.goldenmatch/domains/Project-local
~/.goldenmatch/domains/Global (user)
goldenmatch/domains/Built-in (read-only)

Rulebook YAML format

# .goldenmatch/domains/my_domain.yaml
name: my_domain
description: Custom domain for matching widgets
signals:
  - pattern: "widget"
    weight: 1.0
  - pattern: "part_?number"
    weight: 0.8
extractors:
  - name: part_number
    pattern: "PN[:-]?\\s*(\\w{6,12})"
    group: 1
  - name: manufacturer
    pattern: "(Acme|Globex|Initech)"
    group: 1
normalizers:
  part_number:
    strip_chars: "-"
    uppercase: true

Create via Python

import goldenmatch as gm

gm.save_rulebook("my_domain", rulebook)
loaded = gm.load_rulebook("my_domain")

Create via MCP

The MCP server provides tools for domain management:
ToolDescription
list_domainsList all available domain packs
create_domainCreate a new custom domain pack
test_domainTest a domain pack against sample data

Domain extraction in the pipeline

Domain extraction runs between the standardize and matchkeys steps. It adds extracted fields as new columns (prefixed with __) that can be used in matchkeys:
matchkeys:
  - name: product_match
    type: weighted
    threshold: 0.85
    fields:
      - field: __brand__
        scorer: exact
        weight: 0.3
      - field: __model__
        scorer: jaro_winkler
        weight: 0.5
      - field: title
        scorer: token_sort
        weight: 0.2

Benchmarks

Domain extraction significantly improves product matching:
DatasetWithout DomainWith DomainImprovement
Abt-Buy (electronics)44.5% F172.2% F1+27.7pp
Amazon-Google (software)45.3% F142.1% F1-3.2pp
Domain extraction helps datasets with structured identifiers (brand, model, SKU) but can hurt datasets with unstructured descriptions. For software matching, clean embedding + ANN pipelines perform better.