Snapshot to Streaming Handoff
Getting all the existing rows from a source database into the sink is straightforward. Switching seamlessly to a live CDC stream without losing or duplicating any events is the hard part. This page explains how nanosync solves it.
The problem
You can’t just start reading the WAL from “now” — the WAL only contains changes from the moment you connect. All the rows that existed in the table before you connected are invisible to the change stream. You’d land in the sink with a gap: no historical data and only future changes.
The naive fix — snapshot first, then start CDC — introduces a different problem: the snapshot takes time, and during that time, rows are being modified. If row id=42 was updated twice while the snapshot was running and you start CDC after the snapshot completes, you might never see those updates because they occurred before your CDC start point.
Nanosync solves this with a coordinated overlap between the snapshot and the CDC stream.
The two phases
Every cdc_backfill pipeline runs in two sequential phases:
- Snapshot phase — read all existing rows from each table and write them to the sink.
- CDC phase — stream live changes from the source, starting from the LSN recorded at the beginning of the snapshot.
The key detail is in when the start LSN is recorded.
The overlap strategy
Before the snapshot starts, nanosync records the current WAL LSN from the source. Call it L₀.
While the snapshot runs, the WAL continues to advance. Rows may be inserted, updated, or deleted. Nanosync does not read these changes during the snapshot — it just keeps L₀ in memory.
When the snapshot completes, nanosync transitions to the CDC phase and begins reading the WAL starting from L₀. This means it replays all WAL events that occurred during the snapshot.
Some of those events will reference rows that the snapshot already wrote to the sink. For example, a row inserted during the snapshot will appear once via the snapshot read and again via the WAL replay. Nanosync handles this by keying all writes on the row’s primary key — the WAL replay’s version simply overwrites the snapshot version at the sink. The final state is correct.
Timeline:
L₀ (LSN recorded)
│
Snapshot reads ─────┤──────────────────────────────►
(full table scan) │ │
│ WAL events accumulate │
│ during snapshot │
│ │
CDC starts ─────────────────────────────────────────►
(from L₀) │
└── events in this range are replayed
and overwrite snapshot rows if needed
The result: no events are lost and no final state is duplicated. The handoff is automatic and requires no manual coordination.
Snapshot parallelism
A single goroutine reading a 500 million row table would take too long. Nanosync splits each table into independent partitions and reads them in parallel:
- PostgreSQL — partitions by CTID ranges (physical row addresses). Nanosync divides the table’s page range into equal-sized chunks and assigns each to a worker goroutine.
- SQL Server — partitions by primary key ranges. Nanosync samples the key distribution and creates balanced ranges.
The number of parallel workers is controlled by the snapshot.concurrency option (default: max(4, GOMAXPROCS)):
pipelines:
- name: orders-pipeline
source:
connection: prod-postgres
tables: [public.orders]
sink:
connection: prod-bigquery
snapshot:
concurrency: 8
target_rows_per_partition: 100000
Each worker reads its partition independently and writes batches to the sink. Workers for different tables can also run concurrently — nanosync snapshots multiple tables in parallel, not sequentially.
Resuming an interrupted snapshot
If nanosync is stopped mid-snapshot — due to a crash, restart, or Ctrl-C — it does not start over. Each partition’s completion is checkpointed to the state store. On restart, nanosync resumes from the last completed partition boundary.
nanosync serving — Ctrl-C to stop
INF pipeline resuming name=orders-pipeline
INF snapshot in progress table=public.orders completed_partitions=34/128
INF snapshotting table=public.orders partition=35/128
This makes large snapshots safe across restarts. A 2 billion row table that takes 4 hours to snapshot will resume from where it was interrupted, not from the beginning.
The replication_type option
The snapshot and CDC behavior is controlled by replication_type in the pipeline definition:
pipelines:
- name: orders-pipeline
replication_type: cdc_backfill
source:
connection: prod-postgres
tables: [public.orders]
sink:
connection: prod-bigquery
| Value | Behavior |
|---|---|
cdc_backfill | (default) Snapshot existing rows, then stream live CDC changes. The handoff strategy described on this page. |
cdc | Skip the snapshot. Start streaming CDC from the current LSN. Use this when the sink already has the historical data and you only want live changes. |
snapshot | Snapshot once and stop. No live CDC. Use for one-off full table copies. |
query | Run a scheduled SQL query on the source and write results to the sink. No CDC. |
cdc_backfill is the right default for new pipelines. cdc is correct when you’re adding nanosync to a sink that was already populated by another process — avoid re-reading data that’s already there.
Switching from cdc to cdc_backfill on an existing pipeline triggers a full re-snapshot. Switching from cdc_backfill to cdc on a pipeline that has already completed its snapshot has no effect — the snapshot is done and the pipeline is already in CDC mode.