How CDC Works
Change Data Capture (CDC) is the mechanism for capturing every INSERT, UPDATE, and DELETE as it commits to the database — in order, in real time — without polling or reading the full table.
Why polling doesn’t work
The obvious alternative to CDC is polling: run SELECT * FROM orders WHERE updated_at > $last_check on a schedule. It seems simple but breaks down at scale:
- Deletes are invisible. A deleted row is gone. You can’t query for it. Without CDC, you never know it was deleted unless you maintain a separate tombstone table.
- Ordering is not guaranteed. Two rows with the same
updated_atvalue may arrive in different poll windows. Concurrent transactions compound this. - High load at busy times. A full-table scan every 10 seconds on a multi-billion-row table is expensive. CDC reads only what changed.
- Requires a watermark column. Many tables don’t have an
updated_at. Adding one to every table is invasive and doesn’t always happen.
CDC sidesteps all of these by reading the database’s own change stream — the same stream the database uses internally for replication.
How PostgreSQL implements CDC
PostgreSQL writes every committed transaction to the Write-Ahead Log (WAL) before applying it to data files. The WAL is an append-only, ordered, durable record of every change.
Postgres exposes the WAL to external consumers through logical replication, using a plugin to decode binary WAL entries into a structured change stream. Nanosync uses the pgoutput plugin, which is built into Postgres 10+ and requires no extension install.
The setup involves two objects on the source:
- Publication — a named set of tables whose changes should be streamed. Nanosync creates this automatically (
nanosync_<slot_name>). - Replication slot — a durable cursor that tracks how far nanosync has consumed the WAL. Postgres holds WAL segments on disk until the slot acknowledges them.
Nanosync connects to Postgres over the replication protocol (a special Postgres wire protocol mode), subscribes to the publication through the slot, and receives a stream of decoded change messages.
Postgres WAL
│
▼
pgoutput logical decoding plugin
│ (binary WAL → structured messages)
▼
Replication slot (nanosync_orders)
│ (holds WAL until acknowledged)
▼
nanosync CDC decoder
│ (decoded messages → Arrow records)
▼
Sink writer
How SQL Server implements CDC
SQL Server doesn’t expose a streaming replication protocol the way Postgres does. Instead, it writes changes to CDC change tables — one per tracked table, stored in the cdc schema.
A background SQL Agent job called the capture job reads the transaction log continuously and copies change records into these tables. Each row in a change table represents one INSERT, UPDATE, or DELETE, annotated with the LSN and a capture instance identifier.
Nanosync polls the CDC change tables using LSN watermarks: it reads from the last acknowledged LSN, processes the batch, commits to the sink, and advances the watermark. The poll interval is configurable (default: 200ms).
Nanosync also supports tlog mode, which reads sys.fn_dblog directly — bypassing the CDC capture job entirely. This is useful when CDC isn’t enabled or available, but requires more careful tuning on high-volume databases. See SQL Server source setup for details.
What an LSN is
An LSN (Log Sequence Number) is a monotonically increasing pointer into the transaction log. Every committed transaction produces a unique LSN. Nanosync stores the last LSN it successfully committed to the sink.
On restart, nanosync reads the checkpoint LSN from the state store and resumes consuming changes from exactly that point — not from “now”, not from the beginning, but from the precise position where it left off. This is what makes restarts safe and fast: no re-read, no manual recovery, no data loss.
Postgres LSNs look like 0/3A1B2C4D (hexadecimal offset into the WAL file). SQL Server LSNs are binary values typically rendered as 0x0000C3A100000001000a.
The event format nanosync produces
Every event emitted by nanosync has the same envelope regardless of source:
| Field | Type | Description |
|---|---|---|
_ns_op | string | INSERT, UPDATE, or DELETE |
_ns_table | string | Source table in schema.table format |
_ns_lsn | string | LSN at commit |
_ns_committed_at | string | Wall-clock commit timestamp (UTC, RFC3339) |
_ns_tx_id | string | Source transaction identifier |
| column fields | any | Column values from the after-image (or before-image for DELETEs) |
UPDATEs include the full after-image. The before-image is included under _ns_before when the source table has REPLICA IDENTITY FULL set (Postgres) or when all columns are captured (SQL Server CDC).
DELETEs include only the primary key columns by default, unless REPLICA IDENTITY FULL is set, in which case the full before-image is included.
What CDC cannot capture
CDC has blind spots. Nanosync handles these explicitly:
| Event | What nanosync does |
|---|---|
TRUNCATE | Detected separately from the WAL (Postgres 11+) and forwarded as a _ns_op: TRUNCATE event. SQL Server CDC does not capture TRUNCATE — it’s treated as a schema change requiring manual intervention. |
| DDL changes (ALTER TABLE, etc.) | Detected via schema comparison on each batch. Not replayed. Nanosync pauses with a SCHEMA_DRIFT alert on breaking changes. |
| Changes before CDC was enabled | Not capturable. If CDC was enabled after the table was created and already had data, the initial snapshot fills the gap. |
| Large object (LOB) changes | Captured only when the LOB column is included in the publication (Postgres) or the capture instance (SQL Server CDC). |
Sequence value changes (e.g. auto-increment counters) are not emitted as change events. Only row-level DML produces CDC events.