Troubleshooting
Diagnose and fix common nanosync issues — slot lag, connection errors, schema drift, checkpoint failures, and BigQuery errors.
Replication slot accumulating WAL
Symptom: Disk usage on the Postgres host keeps growing. pg_replication_slots shows a slot with a large wal_status or lag.
Cause: nanosync stopped (crashed, restarted, or was paused) and the replication slot is retaining WAL until nanosync reconnects and acknowledges the backlog.
Fix:
-- Check slot lag
SELECT slot_name,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS lag,
active,
wal_status
FROM pg_replication_slots
WHERE plugin = 'pgoutput';
If nanosync is running, the slot should drain automatically once it catches up. If nanosync is stopped and the lag is critical:
# Restart nanosync — it will resume from the saved checkpoint
nanosync start server --config nanosync.yaml
# Monitor catch-up progress
nanosync monitor
If you need to drop the slot (data loss — only if restarting from scratch):
SELECT pg_drop_replication_slot('nanosync_slot_<pipeline-name>');
Set slot_lag_alert_mb: 5000 in your pipeline config to receive warnings before disk fills. See Configuration reference.
Pipeline not resuming after restart
Symptom: After restarting nanosync, the pipeline starts a full snapshot instead of resuming from where it left off.
Cause: One of three things — the checkpoint file is missing, the replication slot was dropped, or the config pipeline name changed.
Diagnose:
# Check if the checkpoint exists
nanosync pipeline get <pipeline-name>
# Check if the slot still exists in Postgres
psql -c "SELECT slot_name, active FROM pg_replication_slots WHERE slot_name LIKE 'nanosync%';"
Fix: If the slot was dropped, nanosync will re-create it and run a full snapshot automatically. There is no manual recovery step — just let it run.
If the pipeline name changed in nanosync.yaml, nanosync treats it as a new pipeline. Keep pipeline names stable across restarts.
Connection refused / cannot connect to database
Symptom: nanosync test connection <name> fails with connection refused or dial tcp ... connect: connection refused.
Steps:
-
Verify the host and port are reachable from where nanosync is running:
nc -zv <host> 5432 -
Check that
sslmodeis correct. Cloud databases (Cloud SQL, RDS) typically requiresslmode=require:dsn: "postgres://user:pass@host:5432/db?sslmode=require" -
Check that the replication user has the
REPLICATIONattribute:SELECT usename, userepl FROM pg_user WHERE usename = 'nanosync'; -
For Cloud SQL: ensure the Cloud SQL Auth Proxy is running, or that the instance allows connections from nanosync’s IP.
Schema drift — pipeline stopped on column change
Symptom: Pipeline stopped with an error like schema drift detected: column 'email' dropped — manual resolution required.
Cause: A breaking schema change was detected (column dropped, column renamed, or type narrowed). Nanosync stops rather than silently corrupt data.
Fix:
-
Check what changed:
nanosync pipeline schema-diff <pipeline-name> -
If the change is intentional, approve it:
nanosync pipeline approve-drift <pipeline-name> -
If using
schema_drift_mode: autoin config, backward-compatible changes (column additions, type widenings) apply automatically. Only breaking changes require manual approval.
See Configuration reference for schema_drift_mode options.
BigQuery write errors
Quota exceeded
Symptom: Sink errors containing quota exceeded or rateLimitExceeded.
Fix: The Storage Write API has per-project quota. Check the BigQuery quotas page and request an increase if needed. You can also reduce throughput temporarily:
pipelines:
- name: orders-to-bigquery
sink:
properties:
batch_size: 500 # reduce from default 1000
flush_interval_ms: 500 # slow down flush rate
Permission denied on table creation
Symptom: Errors like permission denied: ... does not have bigquery.tables.create permission.
Fix: Ensure the service account has both roles/bigquery.dataEditor AND roles/bigquery.jobUser. dataEditor alone is not sufficient for schema operations.
gcloud projects add-iam-policy-binding my-project \
--member="serviceAccount:nanosync@my-project.iam.gserviceaccount.com" \
--role="roles/bigquery.jobUser"
WAL level not set to logical
Symptom: nanosync test connection or pipeline start fails with wal_level is not logical.
Fix:
-- Check current setting
SHOW wal_level;
-- Fix: add to postgresql.conf and restart
wal_level = logical
For Cloud SQL: set cloudsql.logical_decoding = on via the GCP Console under Database flags, then restart the instance.
Pipeline stuck in snapshotting
Symptom: nanosync monitor shows a pipeline that has been in snapshotting state for longer than expected.
Diagnose:
# Check snapshot progress
nanosync monitor --pipeline <name>
# Check for errors in logs
journalctl -u nanosync -f
Large tables can take time. The snapshot throughput is shown in nanosync monitor. If it’s stuck at 0 rows/s, check:
- Network connectivity to the source database
- Whether the source database is under heavy load
- Whether BigQuery is accepting writes (
nanosync test connection <sink-name>)
What’s next
- Configuration reference — all pipeline and sink options
- Observability — Prometheus metrics for lag, throughput, and errors
- PostgreSQL source — slot management and WAL configuration