medallionlakehouseETLdata architecturetutorial

Building a Medallion Architecture ETL Pipeline — Bronze, Silver, Gold Explained

April 7, 202612 min readBy Hybridyn

The medallion architecture — Bronze, Silver, Gold — has become the default pattern for organizing data in lakehouses and data platforms. It's simple to explain, hard to get wrong, and flexible enough for teams of any size.

This guide walks through the architecture, the design decisions that matter, and how to implement it with a visual pipeline builder.

What Is the Medallion Architecture?

The medallion architecture organizes data into three layers based on quality and readiness:

Bronze (Raw): Data as it arrived. No transformations, no cleaning. The system of record for your pipeline. If something goes wrong downstream, you can always replay from Bronze.

Silver (Cleaned): Data that has been typed, deduplicated, validated, and standardized. Business logic starts here — renaming columns, casting types, filtering bad records.

Gold (Business-Ready): Aggregated, joined, and modeled data ready for dashboards, reports, and APIs. These are your data products — the tables that business users query directly.

Why Three Layers?

Replayability

Bronze stores raw data. If your Silver transform has a bug, fix it and rerun — the source data hasn't changed.

Quality Gates

Each layer boundary is a checkpoint. You can assert row counts, null rates, uniqueness, and freshness before data moves to the next layer.

Team Boundaries

Different teams can own different layers. Data engineers own Bronze → Silver. Analytics engineers own Silver → Gold. This maps cleanly to RBAC and approval workflows.

Designing Your Bronze Layer

Bronze should be append-only and schema-flexible. Store data in Parquet or JSON, partitioned by ingestion date.

Key decisions:

Partitioning: By ingest_date or source_system/ingest_date
Format: Parquet (columnar, compressed) for structured data. JSON for semi-structured.
Retention: Keep Bronze data for at least 90 days. Some teams keep it indefinitely — storage is cheap, reprocessing is not.
Schema: Don't enforce strict schemas at Bronze. Let the Silver layer handle validation.

Common Sources

Database CDC (Debezium, WAL-based)
SaaS API pulls (Stripe, HubSpot, Salesforce)
Event streams (Kafka, Redpanda)
File drops (S3, SFTP, GCS)

Designing Your Silver Layer

Silver is where you make data reliable. Every record should be:

Correctly typed (strings to dates, ints to decimals)
Deduplicated (by primary key + timestamp)
Validated (null checks, range checks, referential integrity)
Standardized (consistent naming, timezone normalization)

SQL Transform Pattern

SELECT
  CAST(id AS INTEGER) AS customer_id,
  TRIM(LOWER(email)) AS email,
  CAST(created_at AS TIMESTAMP) AS created_at,
  COALESCE(status, 'unknown') AS status
FROM source_table
WHERE id IS NOT NULL
  AND email IS NOT NULL

Data Quality Gates

Between Bronze → Silver, assert:

Row count within expected range (±20% of prior run)
Critical columns have <1% null rate
Primary key uniqueness holds
Timestamps are within valid ranges

If any gate fails, the pipeline should halt and alert — not silently pass bad data downstream.

Designing Your Gold Layer

Gold tables are business-facing. They should be:

Named in business terms (monthly_revenue, not agg_txn_amt_m)
Documented with column descriptions
Versioned (add columns, never remove without migration)
Optimized for query patterns (partitioned by common filter columns)

Common Gold Patterns

Fact tables: Events with foreign keys. orders, page_views, support_tickets.

Dimension tables: Slowly changing entities. customers, products, regions.

Aggregate tables: Pre-computed rollups. daily_revenue_by_region, weekly_active_users.

Wide tables: Denormalized joins for dashboard performance. order_detail_with_customer_and_product.

Implementing with F-Pulse

F-Pulse includes a Medallion ETL template that sets up the full three-layer pipeline:

Source Node — Connect to your database, API, or file system
Bronze Sink — Write raw data to S3/MinIO as Parquet, partitioned by date
Transform Node — SQL transform with the expression editor (references upstream data as source_table)
Data Quality Node — Assert row counts, null rates, and uniqueness
Silver Sink — Write cleaned data to the silver bucket
Aggregate Transform — Join and aggregate for business use
Gold Sink — Write final tables

Each node shows live data preview — you can inspect the output at every stage before promoting to production.

Operational Patterns

Backfill

When you change a Silver transform, you need to reprocess historical Bronze data. Design your pipeline to accept a date range parameter. F-Pulse's scheduling supports backfill runs out of the box.

Late-Arriving Data

Data from APIs often arrives late. Your Silver dedup logic should handle re-processing the same record with an updated timestamp. Use ROW_NUMBER() OVER (PARTITION BY id ORDER BY updated_at DESC) to keep the latest version.

Schema Evolution

Bronze is append-only, so new columns appear naturally. Silver transforms should handle missing columns with COALESCE or conditional logic. Gold tables should be additive — add columns, don't remove them.

Anti-Patterns to Avoid

Skipping Bronze: Writing cleaned data directly means you can't replay from raw. Always land raw data first.
Too many layers: Some teams add "Platinum" or "Diamond" layers. Stick with three — complexity kills maintainability.
No quality gates: Moving data between layers without validation is just copying files. The gates are the point.
Gold tables with technical names: If a business user can't understand the table name, it's not Gold yet.

Getting Started

The fastest way to build a medallion pipeline:

Download F-Pulse (docker compose up -d)
Open the pipeline builder and select the Medallion ETL template
Configure your source (database, API, or file)
Set up Bronze, Silver, and Gold sinks pointing to your storage
Add data quality gates between each layer
Schedule and monitor

F-Pulse is free and open source under Apache 2.0. The Medallion ETL template is included by default. Get started here.

Build data pipelines visually

F-Pulse is open source. Try it in under 3 minutes.

Get F-Pulse Join D-Pulse Early Access