Replicate a Postgres Database to BigQuery
End-to-end guide to setting up a live Postgres → BigQuery pipeline with nanosync. CDC prerequisites, IAM, config, monitoring, and production considerations.
What you’ll build
A live Postgres → BigQuery pipeline that snapshots existing rows then streams inserts, updates, and deletes in real time. After setup, new commits in Postgres appear in BigQuery within seconds.
Prerequisites
Before starting, complete the setup steps in PostgreSQL source prerequisites and BigQuery IAM setup.
Step 1 — Set up BigQuery
# Enable the API
gcloud services enable bigquery.googleapis.com --project=my-project
# Create the destination dataset
bq mk --dataset --location=US my-project:replication
# Create a service account and grant access
gcloud iam service-accounts create nanosync --project=my-project
gcloud projects add-iam-policy-binding my-project \
--member="serviceAccount:nanosync@my-project.iam.gserviceaccount.com" \
--role="roles/bigquery.dataEditor"
gcloud projects add-iam-policy-binding my-project \
--member="serviceAccount:nanosync@my-project.iam.gserviceaccount.com" \
--role="roles/bigquery.jobUser"
# Download the key
gcloud iam service-accounts keys create nanosync-bq-key.json \
--iam-account=nanosync@my-project.iam.gserviceaccount.com
On GKE, skip the key — Workload Identity works automatically. Omit credentials_file from the config below.
Step 2 — Create the config file
Create nanosync.yaml:
connections:
- name: my-postgres
type: postgres
dsn: "postgres://nanosync:${env:PG_PASSWORD}@localhost:5432/mydb?sslmode=require"
- name: my-bigquery
type: bigquery
properties:
project_id: my-project
dataset_id: replication
credentials_file: /path/to/nanosync-bq-key.json
pipelines:
- name: orders-to-bigquery
source:
connection: my-postgres
tables:
- public.orders
- public.order_items
sink:
connection: my-bigquery
properties:
table_id: orders
primary_keys: id
Secrets never go in the config file. Use ${env:VAR_NAME} — values are injected from environment variables at startup.
Step 3 — Test your connections
Before starting the pipeline, verify both connections are reachable:
export PG_PASSWORD=my-secret
nanosync test connection my-postgres
# ✓ my-postgres connected in 12 ms
nanosync test connection my-bigquery
# ✓ my-bigquery connected in 34 ms
If either fails, the error tells you exactly what’s wrong:
✗ my-postgres: connection refused — dial tcp 127.0.0.1:5432: connect: connection refused
Fix the issue before continuing — don’t waste time debugging a running pipeline.
Step 4 — Start the server
nanosync start dev --config nanosync.yaml
nanosync serving — Ctrl-C to stop
REST API: http://localhost:7600/v1
Web UI: http://localhost:7600/app
The pipeline starts automatically. The initial snapshot begins immediately — nanosync reads all rows from the configured tables in parallel, then switches to streaming live changes.
Step 5 — Watch it work
nanosync monitor
A full-screen live dashboard opens in your terminal:
NAME SOURCE TARGET STATUS LAG EV/S
orders-to-bigquery my-postgres bigquery ● snapshotting — —
[████████░░░░] 67% 2,400,000 rows
orders-to-bigquery my-postgres bigquery ● live CDC 12ms 4,231
Press Enter to drill into table-level breakdown. Press w for worker fleet view. Press q to quit.
Step 6 — Kill it and restart
This is the part worth seeing. Press Ctrl-C to stop the server, then restart it:
nanosync start dev --config nanosync.yaml
INF pipeline resuming name=orders-to-bigquery lsn=0/3A1B2C4D
INF cdc streaming lsn=0/3A1B2C4D
No re-snapshot. No manual recovery. The pipeline resumed from the last committed WAL offset — exactly where it left off. That’s the checkpoint.
Production considerations
Slot lag monitoring
Replication slots retain WAL until consumed by nanosync. If nanosync stops for an extended period, unprocessed WAL accumulates and can fill your disk. Monitor slot lag with:
SELECT slot_name, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS lag
FROM pg_replication_slots
WHERE active = false;
Set slot_lag_alert_mb: 5000 in your pipeline config to receive warnings before lag becomes a problem.
Schema drift
If you add a column to the Postgres table, nanosync will automatically add the column to the BigQuery table on the next batch. If you rename or drop a column, nanosync will stop and alert — manual resolution is required. See Configuration reference for the schema_drift_mode option.
How to know it’s working
Verify the pipeline is healthy and rows are in sync:
# Check lag is low
nanosync monitor --pipeline orders-to-bigquery
# Verify rows in BigQuery
bq query --use_legacy_sql=false 'SELECT COUNT(*) FROM `my-project.replication.orders`'
# Compare with source
psql -c "SELECT COUNT(*) FROM public.orders"
Both counts should match within a few seconds of lag during active writes.
What’s next
- Configuration reference — all YAML pipeline options
- CLI reference — full command listing
- PostgreSQL source docs — replication slot details, Cloud SQL/AlloyDB specifics
- BigQuery sink docs — IAM, partitioning, DELETE handling