Troubleshooting

Replication slot accumulating WAL

Symptom: Disk usage on the Postgres host keeps growing. pg_replication_slots shows a slot with a large wal_status or lag.

Cause: nanosync stopped (crashed, restarted, or was paused) and the replication slot is retaining WAL until nanosync reconnects and acknowledges the backlog.

Fix:

-- Check slot lag
SELECT slot_name,
       pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS lag,
       active,
       wal_status
FROM pg_replication_slots
WHERE plugin = 'pgoutput';

If nanosync is running, the slot should drain automatically once it catches up. If nanosync is stopped and the lag is critical:

# Restart nanosync — it will resume from the saved checkpoint
nanosync start server --config nanosync.yaml

# Monitor catch-up progress
nanosync monitor

If you need to drop the slot (data loss — only if restarting from scratch):

SELECT pg_drop_replication_slot('nanosync_slot_<pipeline-name>');

Set slot_lag_alert_mb: 5000 in your pipeline config to receive warnings before disk fills. See Configuration reference.

Pipeline not resuming after restart

Symptom: After restarting nanosync, the pipeline starts a full snapshot instead of resuming from where it left off.

Cause: One of three things — the checkpoint file is missing, the replication slot was dropped, or the config pipeline name changed.

Diagnose:

# Check if the checkpoint exists
nanosync pipeline get <pipeline-name>

# Check if the slot still exists in Postgres
psql -c "SELECT slot_name, active FROM pg_replication_slots WHERE slot_name LIKE 'nanosync%';"

Fix: If the slot was dropped, nanosync will re-create it and run a full snapshot automatically. There is no manual recovery step — just let it run.

If the pipeline name changed in nanosync.yaml, nanosync treats it as a new pipeline. Keep pipeline names stable across restarts.

Connection refused / cannot connect to database

Symptom: nanosync test connection <name> fails with connection refused or dial tcp ... connect: connection refused.

Steps:

Verify the host and port are reachable from where nanosync is running:
```
nc -zv <host> 5432
```
Check that sslmode is correct. Cloud databases (Cloud SQL, RDS) typically require sslmode=require:
```
dsn: "postgres://user:pass@host:5432/db?sslmode=require"
```

Check that the replication user has the REPLICATION attribute:

SELECT usename, userepl FROM pg_user WHERE usename = 'nanosync';

For Cloud SQL: ensure the Cloud SQL Auth Proxy is running, or that the instance allows connections from nanosync’s IP.

Schema drift — pipeline stopped on column change

Symptom: Pipeline stopped with an error like schema drift detected: column 'email' dropped — manual resolution required.

Cause: A breaking schema change was detected (column dropped, column renamed, or type narrowed). Nanosync stops rather than silently corrupt data.

Fix:

Check what changed:

nanosync pipeline schema-diff <pipeline-name>

If the change is intentional, approve it:

nanosync pipeline approve-drift <pipeline-name>

If using schema_drift_mode: auto in config, backward-compatible changes (column additions, type widenings) apply automatically. Only breaking changes require manual approval.

See Configuration reference for schema_drift_mode options.

BigQuery write errors

Quota exceeded

Symptom: Sink errors containing quota exceeded or rateLimitExceeded.

Fix: The Storage Write API has per-project quota. Check the BigQuery quotas page and request an increase if needed. You can also reduce throughput temporarily:

pipelines:
  - name: orders-to-bigquery
    sink:
      properties:
        batch_size: 500        # reduce from default 1000
        flush_interval_ms: 500 # slow down flush rate

Permission denied on table creation

Symptom: Errors like permission denied: ... does not have bigquery.tables.create permission.

Fix: Ensure the service account has both roles/bigquery.dataEditor AND roles/bigquery.jobUser. dataEditor alone is not sufficient for schema operations.

gcloud projects add-iam-policy-binding my-project \
  --member="serviceAccount:nanosync@my-project.iam.gserviceaccount.com" \
  --role="roles/bigquery.jobUser"

WAL level not set to logical

Symptom: nanosync test connection or pipeline start fails with wal_level is not logical.

Fix:

-- Check current setting
SHOW wal_level;

-- Fix: add to postgresql.conf and restart
wal_level = logical

For Cloud SQL: set cloudsql.logical_decoding = on via the GCP Console under Database flags, then restart the instance.

Pipeline stuck in snapshotting

Symptom: nanosync monitor shows a pipeline that has been in snapshotting state for longer than expected.

Diagnose:

# Check snapshot progress
nanosync monitor --pipeline <name>

# Check for errors in logs
journalctl -u nanosync -f

Large tables can take time. The snapshot throughput is shown in nanosync monitor. If it’s stuck at 0 rows/s, check:

Network connectivity to the source database
Whether the source database is under heavy load
Whether BigQuery is accepting writes (nanosync test connection <sink-name>)

What’s next

Configuration reference — all pipeline and sink options
Observability — Prometheus metrics for lag, throughput, and errors
PostgreSQL source — slot management and WAL configuration