CDCPostgreSQLdata replicationtutorialDebezium

How to Set Up a CDC Pipeline from PostgreSQL to Your Data Warehouse

March 28, 20269 min readBy Hybridyn

Change Data Capture (CDC) is the most efficient way to replicate data from a production database to a data warehouse. Instead of querying the entire table on every sync, CDC reads the database's write-ahead log (WAL) and streams only the changes — inserts, updates, and deletes.

This guide covers how to set up a CDC pipeline from PostgreSQL to a warehouse (Snowflake, BigQuery, or Redshift) using Debezium and F-Pulse+. CDC is an F-Pulse+ feature — the OSS edition handles batch replication via Database Source with incremental cursors, while the Plus edition adds the Debezium-based streaming CDC Source node covered below.

Why CDC Instead of Batch Queries?

Batch-based replication (SELECT * FROM table WHERE updated_at > last_sync) has three problems:

Deletes are invisible. A deleted row won't appear in your next query — you'll never know it was removed.
High database load. Full table scans on production databases cause lock contention and slow down your application.
Latency. Batch jobs run on a schedule (hourly, daily). CDC can stream changes in near real-time.

CDC solves all three. The WAL records every change, including deletes. Reading the WAL is lightweight — it's an append-only log, not a table scan. And changes stream as they happen.

Prerequisites

PostgreSQL Configuration

Your PostgreSQL instance needs WAL-level replication enabled:

-- Check current setting
SHOW wal_level;

-- Must be 'logical' for CDC
-- In postgresql.conf:
-- wal_level = logical
-- max_replication_slots = 4
-- max_wal_senders = 4

You also need a replication slot and a publication:

-- Create a publication for the tables you want to replicate
CREATE PUBLICATION fpulse_pub FOR TABLE orders, customers, products;

-- The replication slot is created automatically by Debezium

Permissions

The replication user needs:

CREATE ROLE replication_user WITH REPLICATION LOGIN PASSWORD 'secure_password';
GRANT USAGE ON SCHEMA public TO replication_user;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO replication_user;

Architecture Overview

The CDC pipeline has four components:

PostgreSQL WAL → Debezium → Message Queue (optional) → Warehouse Sink

PostgreSQL WAL: The source of truth. Every INSERT, UPDATE, DELETE is recorded.
Debezium: Reads the WAL and converts changes into structured events.
Message Queue (Kafka/Redpanda): Buffers events for durability. Optional but recommended for production.
Warehouse Sink: Applies changes to the destination tables.

Setting Up with F-Pulse+

F-Pulse+ ships Debezium-based CDC as a first-class node. Here's how to build the pipeline:

Step 1: Add a CDC Source Node

Drag a CDC Source (PostgreSQL) node onto the canvas. Configure:

Host, port, database name
Replication user credentials
Tables to capture (or use the publication name)
Snapshot mode: initial (full snapshot on first run, then streaming)

Step 2: Add a Transform Node (Optional)

If you need to reshape data before loading:

SELECT
  id,
  customer_email,
  total_amount,
  CASE
    WHEN __op = 'd' THEN 'DELETED'
    WHEN __op = 'c' THEN 'INSERTED'
    WHEN __op = 'u' THEN 'UPDATED'
    ELSE 'UNKNOWN'
  END AS change_type,
  __source_ts AS change_timestamp
FROM source_table

Step 3: Add a Warehouse Sink

Connect your destination — Snowflake, BigQuery, or Redshift. Configure:

Connection credentials
Target schema and table
Write mode: upsert (merge on primary key)
Batch size: 1000-5000 rows (depends on warehouse)

Step 4: Schedule

For near real-time: schedule every 1-5 minutes.

For cost-sensitive workloads: schedule every 15-60 minutes.

Handling Schema Changes

Schema changes (ALTER TABLE) are the biggest operational challenge with CDC. When a column is added, renamed, or dropped in PostgreSQL, the Debezium connector needs to handle it.

Best practices:

Additive changes only: Add columns, don't rename or drop them
Schema registry: Use a schema registry (included with Redpanda) to version schemas
Downstream flexibility: Your warehouse sink should handle new columns automatically (Snowflake's VARIANT column is good for this)
Alerting: Set up alerts for schema drift events so your team can review changes before they propagate

Common Pitfalls

1. WAL Retention Filling Disk

If Debezium falls behind (downtime, slow sink), unread WAL segments accumulate and can fill your disk.

Fix: Set max_slot_wal_keep_size in PostgreSQL to cap WAL retention. Monitor the pg_replication_slots view for lag.

2. Initial Snapshot Timeout

The first run takes a full snapshot of every captured table. For large tables, this can take hours and hold locks.

Fix: Use snapshot.mode=exported (lock-free snapshot) or schedule the initial snapshot during low-traffic hours.

3. DELETE Handling in Warehouses

Most warehouses don't support SQL DELETE efficiently. Upserting a "deleted" record is cheaper than running DELETE statements.

Fix: Use soft deletes — add a _deleted_at timestamp column and filter in your downstream queries.

4. Duplicate Events

CDC guarantees at-least-once delivery. Your sink must be idempotent.

Fix: Use upsert (merge) mode with the primary key. F-Pulse's Bulk Loader (OSS) and warehouse sinks handle this automatically — the same merge mode is what makes CDC end-to-end safe in Plus.

Monitoring Your CDC Pipeline

Key metrics to watch:

Replication lag: Time between a change in PostgreSQL and its arrival in the warehouse
WAL disk usage: How much unread WAL is queued
Event throughput: Changes per second being processed
Error rate: Failed upserts or schema mismatch events

F-Pulse+'s monitoring dashboard shows all of these in real-time, with alerting for lag spikes and errors.

Production Checklist

Before going live:

[ ] WAL level set to logical on PostgreSQL
[ ] Replication user has minimal required permissions
[ ] Initial snapshot completed successfully
[ ] Upsert mode enabled on warehouse sink (not append-only)
[ ] WAL disk monitoring in place
[ ] Schema drift alerting configured
[ ] Replication lag alert set (threshold: 5 minutes for batch, 30 seconds for real-time)
[ ] Tested with a DELETE event to confirm soft-delete handling

Streaming CDC ships in F-Pulse+, covering PostgreSQL and MySQL via Debezium-style change-data-capture. F-Pulse OSS (Apache 2.0) handles batch replication via incremental Database Source today — download F-Pulse free, or see F-Pulse+ for CDC.

Build data pipelines visually

F-Pulse is open source. Try it in under 3 minutes.

Get F-Pulse Join D-Pulse Early Access