How to Set Up a CDC Pipeline from PostgreSQL to Your Data Warehouse
Change Data Capture (CDC) is the most efficient way to replicate data from a production database to a data warehouse. Instead of querying the entire table on every sync, CDC reads the database's write-ahead log (WAL) and streams only the changes — inserts, updates, and deletes.
This guide covers how to set up a CDC pipeline from PostgreSQL to a warehouse (Snowflake, BigQuery, or Redshift) using Debezium and F-Pulse.
Why CDC Instead of Batch Queries?
Batch-based replication (SELECT * FROM table WHERE updated_at > last_sync) has three problems:
- Deletes are invisible. A deleted row won't appear in your next query — you'll never know it was removed.
- High database load. Full table scans on production databases cause lock contention and slow down your application.
- Latency. Batch jobs run on a schedule (hourly, daily). CDC can stream changes in near real-time.
CDC solves all three. The WAL records every change, including deletes. Reading the WAL is lightweight — it's an append-only log, not a table scan. And changes stream as they happen.
Prerequisites
PostgreSQL Configuration
Your PostgreSQL instance needs WAL-level replication enabled:
-- Check current setting
SHOW wal_level;
-- Must be 'logical' for CDC
-- In postgresql.conf:
-- wal_level = logical
-- max_replication_slots = 4
-- max_wal_senders = 4
You also need a replication slot and a publication:
-- Create a publication for the tables you want to replicate
CREATE PUBLICATION fpulse_pub FOR TABLE orders, customers, products;
-- The replication slot is created automatically by Debezium
Permissions
The replication user needs:
CREATE ROLE replication_user WITH REPLICATION LOGIN PASSWORD 'secure_password';
GRANT USAGE ON SCHEMA public TO replication_user;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO replication_user;
Architecture Overview
The CDC pipeline has four components:
PostgreSQL WAL → Debezium → Message Queue (optional) → Warehouse Sink
- PostgreSQL WAL: The source of truth. Every INSERT, UPDATE, DELETE is recorded.
- Debezium: Reads the WAL and converts changes into structured events.
- Message Queue (Kafka/Redpanda): Buffers events for durability. Optional but recommended for production.
- Warehouse Sink: Applies changes to the destination tables.
Setting Up with F-Pulse
F-Pulse includes Debezium connectors as first-class nodes. Here's how to build the pipeline:
Step 1: Add a CDC Source Node
Drag a Debezium (PostgreSQL) source onto the canvas. Configure:
- Host, port, database name
- Replication user credentials
- Tables to capture (or use the publication name)
- Snapshot mode:
initial(full snapshot on first run, then streaming)
Step 2: Add a Transform Node (Optional)
If you need to reshape data before loading:
SELECT
id,
customer_email,
total_amount,
CASE
WHEN __op = 'd' THEN 'DELETED'
WHEN __op = 'c' THEN 'INSERTED'
WHEN __op = 'u' THEN 'UPDATED'
ELSE 'UNKNOWN'
END AS change_type,
__source_ts AS change_timestamp
FROM source_table
Step 3: Add a Warehouse Sink
Connect your destination — Snowflake, BigQuery, or Redshift. Configure:
- Connection credentials
- Target schema and table
- Write mode:
upsert(merge on primary key) - Batch size: 1000-5000 rows (depends on warehouse)
Step 4: Schedule
For near real-time: schedule every 1-5 minutes.
For cost-sensitive workloads: schedule every 15-60 minutes.
Handling Schema Changes
Schema changes (ALTER TABLE) are the biggest operational challenge with CDC. When a column is added, renamed, or dropped in PostgreSQL, the Debezium connector needs to handle it.
Best practices:
- Additive changes only: Add columns, don't rename or drop them
- Schema registry: Use a schema registry (included with Redpanda) to version schemas
- Downstream flexibility: Your warehouse sink should handle new columns automatically (Snowflake's VARIANT column is good for this)
- Alerting: Set up alerts for schema drift events so your team can review changes before they propagate
Common Pitfalls
1. WAL Retention Filling Disk
If Debezium falls behind (downtime, slow sink), unread WAL segments accumulate and can fill your disk.
Fix: Set max_slot_wal_keep_size in PostgreSQL to cap WAL retention. Monitor the pg_replication_slots view for lag.
2. Initial Snapshot Timeout
The first run takes a full snapshot of every captured table. For large tables, this can take hours and hold locks.
Fix: Use snapshot.mode=exported (lock-free snapshot) or schedule the initial snapshot during low-traffic hours.
3. DELETE Handling in Warehouses
Most warehouses don't support SQL DELETE efficiently. Upserting a "deleted" record is cheaper than running DELETE statements.
Fix: Use soft deletes — add a _deleted_at timestamp column and filter in your downstream queries.
4. Duplicate Events
CDC guarantees at-least-once delivery. Your sink must be idempotent.
Fix: Use upsert (merge) mode with the primary key. F-Pulse's warehouse sinks handle this automatically.
Monitoring Your CDC Pipeline
Key metrics to watch:
- Replication lag: Time between a change in PostgreSQL and its arrival in the warehouse
- WAL disk usage: How much unread WAL is queued
- Event throughput: Changes per second being processed
- Error rate: Failed upserts or schema mismatch events
F-Pulse's monitoring dashboard shows all of these in real-time, with alerting for lag spikes and errors.
Production Checklist
Before going live:
- [ ] WAL level set to
logicalon PostgreSQL - [ ] Replication user has minimal required permissions
- [ ] Initial snapshot completed successfully
- [ ] Upsert mode enabled on warehouse sink (not append-only)
- [ ] WAL disk monitoring in place
- [ ] Schema drift alerting configured
- [ ] Replication lag alert set (threshold: 5 minutes for batch, 30 seconds for real-time)
- [ ] Tested with a DELETE event to confirm soft-delete handling
F-Pulse includes Debezium CDC connectors for PostgreSQL, MySQL, SQL Server, MongoDB, and Oracle — all free under the MIT license. Set up your first CDC pipeline.
Build data pipelines visually
F-Pulse is open source. Try it in under 3 minutes.