Snapshot to Streaming Handoff

Getting all the existing rows from a source database into the sink is straightforward. Switching seamlessly to a live CDC stream without losing or duplicating any events is the hard part. This page explains how nanosync solves it.


The problem

You can’t just start reading the WAL from “now” — the WAL only contains changes from the moment you connect. All the rows that existed in the table before you connected are invisible to the change stream. You’d land in the sink with a gap: no historical data and only future changes.

The naive fix — snapshot first, then start CDC — introduces a different problem: the snapshot takes time, and during that time, rows are being modified. If row id=42 was updated twice while the snapshot was running and you start CDC after the snapshot completes, you might never see those updates because they occurred before your CDC start point.

Nanosync solves this with a coordinated overlap between the snapshot and the CDC stream.


The two phases

Every cdc_backfill pipeline runs in two sequential phases:

  1. Snapshot phase — read all existing rows from each table and write them to the sink.
  2. CDC phase — stream live changes from the source, starting from the LSN recorded at the beginning of the snapshot.

The key detail is in when the start LSN is recorded.


The overlap strategy

Before the snapshot starts, nanosync records the current WAL LSN from the source. Call it L₀.

While the snapshot runs, the WAL continues to advance. Rows may be inserted, updated, or deleted. Nanosync does not read these changes during the snapshot — it just keeps L₀ in memory.

When the snapshot completes, nanosync transitions to the CDC phase and begins reading the WAL starting from L₀. This means it replays all WAL events that occurred during the snapshot.

Some of those events will reference rows that the snapshot already wrote to the sink. For example, a row inserted during the snapshot will appear once via the snapshot read and again via the WAL replay. Nanosync handles this by keying all writes on the row’s primary key — the WAL replay’s version simply overwrites the snapshot version at the sink. The final state is correct.

Timeline:
                     L₀ (LSN recorded)

  Snapshot reads ─────┤──────────────────────────────►
  (full table scan)   │                               │
                      │  WAL events accumulate        │
                      │  during snapshot              │
                      │                               │
  CDC starts ─────────────────────────────────────────►
  (from L₀)           │
                       └── events in this range are replayed
                           and overwrite snapshot rows if needed

The result: no events are lost and no final state is duplicated. The handoff is automatic and requires no manual coordination.


Snapshot parallelism

A single goroutine reading a 500 million row table would take too long. Nanosync splits each table into independent partitions and reads them in parallel:

The number of parallel workers is controlled by the snapshot.concurrency option (default: max(4, GOMAXPROCS)):

pipelines:
  - name: orders-pipeline
    source:
      connection: prod-postgres
      tables: [public.orders]
    sink:
      connection: prod-bigquery
    snapshot:
      concurrency: 8
      target_rows_per_partition: 100000

Each worker reads its partition independently and writes batches to the sink. Workers for different tables can also run concurrently — nanosync snapshots multiple tables in parallel, not sequentially.


Resuming an interrupted snapshot

If nanosync is stopped mid-snapshot — due to a crash, restart, or Ctrl-C — it does not start over. Each partition’s completion is checkpointed to the state store. On restart, nanosync resumes from the last completed partition boundary.

nanosync serving — Ctrl-C to stop
  INF pipeline resuming      name=orders-pipeline
  INF snapshot in progress   table=public.orders  completed_partitions=34/128
  INF snapshotting           table=public.orders  partition=35/128

This makes large snapshots safe across restarts. A 2 billion row table that takes 4 hours to snapshot will resume from where it was interrupted, not from the beginning.


The replication_type option

The snapshot and CDC behavior is controlled by replication_type in the pipeline definition:

pipelines:
  - name: orders-pipeline
    replication_type: cdc_backfill
    source:
      connection: prod-postgres
      tables: [public.orders]
    sink:
      connection: prod-bigquery
ValueBehavior
cdc_backfill(default) Snapshot existing rows, then stream live CDC changes. The handoff strategy described on this page.
cdcSkip the snapshot. Start streaming CDC from the current LSN. Use this when the sink already has the historical data and you only want live changes.
snapshotSnapshot once and stop. No live CDC. Use for one-off full table copies.
queryRun a scheduled SQL query on the source and write results to the sink. No CDC.

cdc_backfill is the right default for new pipelines. cdc is correct when you’re adding nanosync to a sink that was already populated by another process — avoid re-reading data that’s already there.

Switching from cdc to cdc_backfill on an existing pipeline triggers a full re-snapshot. Switching from cdc_backfill to cdc on a pipeline that has already completed its snapshot has no effect — the snapshot is done and the pipeline is already in CDC mode.