Replicate a Postgres Database to BigQuery

What you’ll build

A live Postgres → BigQuery pipeline that snapshots existing rows then streams inserts, updates, and deletes in real time. After setup, new commits in Postgres appear in BigQuery within seconds.

Prerequisites

Before starting, complete the setup steps in PostgreSQL source prerequisites and BigQuery IAM setup.

Step 1 — Set up BigQuery

# Enable the API
gcloud services enable bigquery.googleapis.com --project=my-project

# Create the destination dataset
bq mk --dataset --location=US my-project:replication

# Create a service account and grant access
gcloud iam service-accounts create nanosync --project=my-project

gcloud projects add-iam-policy-binding my-project \
  --member="serviceAccount:nanosync@my-project.iam.gserviceaccount.com" \
  --role="roles/bigquery.dataEditor"

gcloud projects add-iam-policy-binding my-project \
  --member="serviceAccount:nanosync@my-project.iam.gserviceaccount.com" \
  --role="roles/bigquery.jobUser"

# Download the key
gcloud iam service-accounts keys create nanosync-bq-key.json \
  --iam-account=nanosync@my-project.iam.gserviceaccount.com

On GKE, skip the key — Workload Identity works automatically. Omit credentials_file from the config below.

Step 2 — Create the config files

Nanosync uses two separate YAML files: one for the server (database, notifications) and one for pipeline definitions (applied separately).

Create server.yaml:

# Server config — passed to nanosync start
# No pipeline definitions here.

Create pipelines.yaml:

connections:
  - name: my-postgres
    type: postgres
    dsn: "postgres://nanosync:${env:PG_PASSWORD}@localhost:5432/mydb?sslmode=require"

  - name: my-bigquery
    type: bigquery
    properties:
      project_id:       my-project
      dataset_id:       replication
      credentials_file: /path/to/nanosync-bq-key.json

pipelines:
  - name: orders-to-bigquery
    replication_type: cdc_backfill
    source:
      connection: my-postgres
      tables:
        - public.orders
        - public.order_items
      properties:
        replication_slot: nanosync_orders
        publication:      nanosync_orders
    sink:
      connection: my-bigquery
      properties:
        table_id: orders

Secrets never go in the config file. Use ${env:VAR_NAME} — values are injected from environment variables at startup.

Step 3 — Test your connections

Before starting the pipeline, verify both connections are reachable:

export PG_PASSWORD=my-secret

nanosync test connection my-postgres
# ✓  my-postgres connected in 12 ms

nanosync test connection my-bigquery
# ✓  my-bigquery connected in 34 ms

If either fails, the error tells you exactly what’s wrong:

✗  my-postgres: connection refused — dial tcp 127.0.0.1:5432: connect: connection refused

Fix the issue before continuing — don’t waste time debugging a running pipeline.

Step 4 — Start the server and apply pipelines

# Start the server (handles state store, notifications, etc.)
nanosync start dev --config server.yaml

nanosync serving — Ctrl-C to stop
  REST API: http://localhost:7600/v1
  Web UI:   http://localhost:7600/app

Then apply your pipeline definitions in a separate terminal:

export PG_PASSWORD=my-secret

nanosync apply --file pipelines.yaml

The pipeline starts immediately. The initial snapshot begins — nanosync reads all rows from the configured tables in parallel, then switches to streaming live changes.

Step 5 — Watch it work

nanosync monitor

A full-screen live dashboard opens in your terminal:

NAME                 SOURCE        TARGET     STATUS          LAG    EV/S
orders-to-bigquery   my-postgres   bigquery   ● snapshotting  —      —
                                              [████████░░░░]  67%  2,400,000 rows

orders-to-bigquery   my-postgres   bigquery   ● live CDC      12ms   4,231

Press Enter to drill into table-level breakdown. Press w for worker fleet view. Press q to quit.

Step 6 — Kill it and restart

This is the part worth seeing. Press Ctrl-C to stop the server, then restart it:

nanosync start dev --config nanosync.yaml

INF pipeline resuming  name=orders-to-bigquery  lsn=0/3A1B2C4D
INF cdc streaming      lsn=0/3A1B2C4D

No re-snapshot. No manual recovery. The pipeline resumed from the last committed WAL offset — exactly where it left off. That’s the checkpoint.

Production considerations

Slot lag monitoring

Replication slots retain WAL until consumed by nanosync. If nanosync stops for an extended period, unprocessed WAL accumulates and can fill your disk. Monitor slot lag with:

SELECT slot_name, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS lag
FROM pg_replication_slots
WHERE active = false;

Set slot_lag_alert_mb: 5000 in your pipeline config to receive warnings before lag becomes a problem.

Schema drift

If you add a column to the Postgres table, nanosync will automatically add the column to the BigQuery table on the next batch. If you rename or drop a column, nanosync will stop and alert — manual resolution is required. See Configuration reference for the schema_drift_mode option.

How to know it’s working

Verify the pipeline is healthy and rows are in sync:

# Check lag is low
nanosync monitor --pipeline orders-to-bigquery

# Verify rows in BigQuery
bq query --use_legacy_sql=false 'SELECT COUNT(*) FROM `my-project.replication.orders`'

# Compare with source
psql -c "SELECT COUNT(*) FROM public.orders"

Both counts should match within a few seconds of lag during active writes.

What’s next

Configuration reference — all YAML pipeline options
CLI reference — full command listing
PostgreSQL source docs — replication slot details, Cloud SQL/AlloyDB specifics
BigQuery sink docs — IAM, partitioning, DELETE handling