Data LakehouseMedallion ArchitectureData EngineeringData Quality

Medallion Architecture Explained: Bronze, Silver, Gold Data Layers

March 25, 20267 min readBy Hybridyn Engineering

Medallion architecture is a data design pattern that organizes a lakehouse into three layers: Bronze (raw), Silver (cleaned), and Gold (business-ready). It was popularized by Databricks but is now used across the industry regardless of the underlying technology.

The pattern solves a fundamental problem: raw data is messy, and business users need clean, reliable data. Instead of trying to do everything in one step, medallion architecture breaks the problem into progressive refinement stages.

The Three Layers

Bronze Layer — Raw Ingestion

The Bronze layer is where raw data lands exactly as it arrives from source systems. No transformations, no filtering, no cleaning. Just a faithful copy of the source.

What goes here:

Raw API responses (JSON as-is)
Database CDC (Change Data Capture) events
Log files, CSV dumps, webhook payloads
File uploads from external partners

Key principles:

Append-only — never delete or modify raw data
Schema-on-read — store first, define schema later
Full history — keep every version, every event
Metadata tracking — when was this ingested? From which source? Which pipeline?

Storage format: Parquet files in object storage (MinIO, S3, Azure Blob, GCS). Parquet gives you columnar compression and schema evolution without a database.

Silver Layer — Cleaned and Conformed

The Silver layer is where data gets cleaned, deduplicated, typed, and conformed to a consistent schema. This is where most of the engineering work happens.

Transformations applied:

Data type casting (strings to dates, numbers)
Null handling and default values
Deduplication (remove duplicate events)
Schema normalization (consistent column names across sources)
Join enrichment (combine data from multiple Bronze sources)
Data quality validation (reject rows that fail rules)

Key principles:

Idempotent — reprocessing produces the same result
Testable — data quality rules are explicit and measurable
Documented — every transformation has a clear purpose
Versioned — schema changes are tracked

Gold Layer — Business-Ready

The Gold layer contains data modeled for specific business use cases. This is what analysts query, dashboards display, and ML models train on.

What goes here:

Aggregated metrics (daily revenue, user counts, conversion rates)
Dimensional models (star/snowflake schemas)
Feature tables for ML models
Data products (pre-computed datasets published for consumption)

Key principles:

Use-case driven — each Gold table serves a specific business question
Performant — optimized for query patterns (partitioned, indexed)
Governed — access controls, data classification, lineage tracked
SLA-backed — freshness guarantees for downstream consumers

Why This Pattern Works

1. Separation of Concerns

Each layer has a clear responsibility. Bronze handles ingestion. Silver handles quality. Gold handles business logic. Teams can work on each layer independently.

2. Reprocessing Safety

If you discover a bug in your Silver transformation, you can fix it and reprocess from Bronze. The raw data is always there. Without Bronze, a transformation bug means lost data.

3. Data Quality Gates

You can enforce quality rules between layers. Data that fails validation stays in Bronze — it doesn't pollute Silver or Gold. This is much harder when everything happens in one step.

4. Incremental Processing

Each layer processes only what changed. Bronze ingests new events. Silver processes only new Bronze data. Gold aggregates only updated Silver records. This makes pipelines efficient at scale.

Implementing Medallion Architecture

With SQL Transforms

The most common implementation uses SQL for Silver and Gold transformations:

-- Silver: Clean and deduplicate orders
SELECT DISTINCT
    order_id,
    CAST(created_at AS TIMESTAMP) AS created_at,
    CAST(total AS DECIMAL(10,2)) AS total,
    LOWER(TRIM(customer_email)) AS customer_email,
    status
FROM bronze.raw_orders
WHERE order_id IS NOT NULL
  AND total > 0

-- Gold: Daily revenue aggregation
SELECT
    DATE(created_at) AS order_date,
    COUNT(*) AS order_count,
    SUM(total) AS revenue,
    AVG(total) AS avg_order_value
FROM silver.orders
WHERE status = 'completed'
GROUP BY DATE(created_at)

With Table Formats

Modern lakehouses use table formats like Apache Iceberg, Delta Lake, or Apache Hudi on top of Parquet files. These add:

ACID transactions — concurrent reads and writes without corruption
Time travel — query data as it existed at any point in time
Schema evolution — add columns without rewriting all data
Partition evolution — change partitioning strategy without reprocessing

With F-Pulse Pipeline Templates

F-Pulse includes a Medallion pipeline template that scaffolds the three-layer architecture automatically. You define your sources, and the template creates the Bronze ingestion, Silver cleaning, and Gold aggregation pipelines with proper scheduling and dependencies.

Common Mistakes

1. Skipping Bronze. Teams sometimes transform data on ingestion, losing the raw source. When the transformation logic changes (and it will), there's no way to reprocess.

2. Over-engineering Silver. Silver should clean and conform — not aggregate or model. If you're doing business logic in Silver, it belongs in Gold.

3. One giant Gold table. Gold tables should be use-case specific. A single "master" Gold table that tries to serve every query pattern serves none of them well.

4. No data quality rules. Without explicit quality gates between layers, bad data flows through to Gold and corrupts dashboards. Define rules, measure them, alert on failures.

Summary

Medallion architecture is a proven pattern for organizing data lakehouses. Bronze preserves raw data. Silver enforces quality. Gold delivers business value. The pattern scales from small teams to enterprise data platforms — and it's a first-class template in F-Pulse, which ships 20 OSS pipeline templates including the Bronze → Silver → Gold flow as a one-click scaffold.

Build data pipelines visually

F-Pulse is open source. Try it in under 3 minutes.

Get F-Pulse Join D-Pulse Early Access