Back to Blog
Data LakehouseData EngineeringData ArchitectureApache Iceberg

How to Build a Data Lakehouse from Scratch

March 10, 20269 min readBy Hybridyn Engineering

A data lakehouse combines the best of data lakes (cheap storage, schema flexibility, raw data retention) with the best of data warehouses (ACID transactions, SQL access, performance). Instead of choosing between a lake and a warehouse, you get both in one architecture.

This guide walks through building a lakehouse from scratch — what components you need, how they fit together, and where the common pitfalls are.

The Lakehouse Stack

A data lakehouse has four layers:

┌─────────────────────────────┐
│    SQL Query Engine         │  ← Trino, DuckDB, Spark SQL
├─────────────────────────────┤
│    Table Format             │  ← Iceberg, Delta Lake, Hudi
├─────────────────────────────┤
│    File Format              │  ← Parquet, ORC, Avro
├─────────────────────────────┤
│    Object Storage           │  ← S3, MinIO, GCS, Azure Blob
└─────────────────────────────┘

Layer 1: Object Storage

Object storage is the foundation. It stores your data files cheaply and durably. Unlike a database, there's no compute tied to storage — you pay for bytes stored, not for a running server.

Options:

  • MinIO — self-hosted, S3-compatible, free. Great for development and on-premise.
  • Amazon S3 — the standard for cloud deployments.
  • Google Cloud Storage (GCS) — Google's equivalent.
  • Azure Blob Storage — Microsoft's equivalent.

For self-hosted lakehouses, MinIO is the standard choice. It's S3-compatible, so every tool that works with S3 works with MinIO.

Layer 2: File Format

Data files need a format that supports efficient querying. CSV and JSON are inefficient — they require reading entire files to answer any query.

Parquet is the industry standard for lakehouse storage:

  • Columnar — reads only the columns your query needs
  • Compressed — typically 5-10x smaller than CSV
  • Typed — schema is embedded in the file
  • Widely supported — every data tool reads Parquet

Layer 3: Table Format

Raw Parquet files in object storage give you a data lake. A table format gives you a data warehouse — with ACID transactions, time travel, and schema evolution.

Apache Iceberg is the leading open table format:

  • ACID transactions — concurrent readers and writers don't corrupt data
  • Time travel — query data as it existed at any point in time
  • Schema evolution — add, rename, or drop columns without rewriting data
  • Partition evolution — change partitioning strategy transparently
  • Hidden partitioning — users query without knowing partition structure

Delta Lake (by Databricks) and Apache Hudi (by Uber) are alternatives with similar capabilities. Iceberg has the broadest engine support.

Layer 4: SQL Query Engine

The query engine reads table format metadata, pushes down filters, and executes SQL against Parquet files in object storage.

Options:

  • DuckDB — embedded, single-machine, fast for development and moderate data volumes
  • Trino — distributed, production-grade, supports federated queries across multiple data sources
  • Spark SQL — distributed, best for heavy transformations and ML workloads
  • StarRocks / ClickHouse — optimized for real-time analytics

For most teams starting out, DuckDB is the right choice. It's fast, requires no infrastructure, and handles datasets up to hundreds of gigabytes on a single machine. Move to Trino when you need distributed queries or federation.

Setting Up the Lakehouse

Step 1: Object Storage

With Docker Compose, MinIO is a single service:

minio:
  image: minio/minio
  command: server /data --console-address ":9001"
  environment:
    MINIO_ROOT_USER: minioadmin
    MINIO_ROOT_PASSWORD: minioadmin
  ports:
    - "9000:9000"
    - "9001:9001"

Create three buckets for the medallion architecture: bronze, silver, gold.

Step 2: Organize with Medallion Architecture

Structure your data in three layers:

bronze/
  ├── orders/2026/03/raw-orders-20260310.parquet
  ├── customers/2026/03/raw-customers-20260310.parquet
  └── products/2026/03/raw-products-20260310.parquet

silver/
  ├── orders/cleaned-orders.parquet
  ├── customers/cleaned-customers.parquet
  └── dim_products/dim-products.parquet

gold/
  ├── daily_revenue/daily-revenue.parquet
  ├── customer_segments/customer-segments.parquet
  └── product_performance/product-performance.parquet

Bronze holds raw ingested data (append-only). Silver holds cleaned, deduplicated, typed data. Gold holds business-ready aggregations and models.

Step 3: Build Ingestion Pipelines

Ingestion pipelines extract data from source systems and write Parquet files to the Bronze bucket. Key requirements:

  • Incremental extraction — only pull new/changed records
  • Change Data Capture (CDC) — capture inserts, updates, and deletes
  • Schema detection — handle schema changes from source systems
  • Metadata — track ingestion timestamp, source, and pipeline version

Step 4: Build Transformation Pipelines

Silver transforms clean and conform Bronze data. Gold transforms aggregate and model Silver data. Both are typically SQL:

-- Silver transform: clean orders
CREATE TABLE silver.orders AS
SELECT
    order_id,
    CAST(created_at AS TIMESTAMP) AS created_at,
    CAST(amount AS DECIMAL(10,2)) AS amount,
    LOWER(TRIM(customer_email)) AS customer_email
FROM bronze.raw_orders
WHERE order_id IS NOT NULL;

Step 5: Add Governance

As the lakehouse grows, governance becomes critical:

  • Access control — who can read Gold? Who can write to Silver?
  • Data catalog — what datasets exist? What do the columns mean?
  • Data lineage — where did this data come from? What transforms were applied?
  • Data classification — which columns contain PII? Which are sensitive?
  • Quality rules — define and enforce expectations on each dataset

Common Mistakes

1. Starting with distributed engines. Unless you have terabytes of data, DuckDB on a single machine is faster and simpler than Trino on a cluster. Scale up when you need to, not before.

2. Skipping the table format. Raw Parquet files work, but you'll miss ACID transactions, time travel, and schema evolution. Add Iceberg from the start — the cost is minimal and the benefit compounds over time.

3. No data quality gates. Without quality checks between medallion layers, bad data flows through to Gold tables and corrupts dashboards. Define rules, measure them, alert on failures.

4. Manual pipeline management. As the number of pipelines grows, manual scheduling and monitoring break down. Use an orchestration tool that handles dependencies, retries, and alerting.

The D-Pulse Approach

D-Pulse implements the full lakehouse stack as part of its 9-plane architecture. Object storage (MinIO), table formats (Iceberg/Delta/Hudi via Parquet), query engine (DuckDB + Trino), data catalog, lineage, governance, and pipeline orchestration — all integrated into one platform.

For teams starting smaller, F-Pulse provides the pipeline orchestration layer with medallion architecture templates. You can build the lakehouse incrementally and grow into the full platform when ready.

Summary

A data lakehouse needs four layers: object storage, file format (Parquet), table format (Iceberg), and a SQL engine (DuckDB or Trino). Organize data with medallion architecture (Bronze/Silver/Gold). Start simple with Docker and DuckDB, add governance as you scale, and use pipeline orchestration to keep it all running reliably.

Build data pipelines visually

F-Pulse is open source. Try it in under 3 minutes.