Kubernetes: off-site PostgreSQL backups

5 April 2020

It's been a long time since I managed a PostgreSQL cluster; these days I prefer to use a managed service which almost every cloud provider is offering. One of the benefits of a managed database service is that backups are automated and the responsibility of the service provider. However, these backups are not usually exposed to you and you can't take them with you to another provider leading to vendor lock-in or potential issues should your provider have other issues. To solve this problem I also run my own automated backups on a less frequent schedule than the vendor. Should there be a need to restore from a backup within the same cloud I can use the normal backups with minimal data loss. Should there be a prolonged issue, or some other reason that would necessitate switching providers without being able to access the original, running database I can use my own logical backups to spin up a new database on my own or using a different provider.

You could also send the logical backups to storage in the same cloud provider but for me that defeats the purpose. I send my backups to a bucket on Google Cloud Storage. I use the standard pg_dump tool to make logical backups, so I am maintaining a docker container that installs the Google gsutil program into the official Postgres image. You can find the source on GitHub and the image on Docker Hub.

I also like to encrypt the backups client side using symmetric encryption (i.e., a passphrase) with gnupg. You'll want to load in the passphrase into your cluster's secrets ahead of time. For Google backups (if you're not running on GKE) you'll also need to create a service account and load the configuration json file into the secrets as well. Finally, the way I have my application setup consumes the Postgres connection information as a URI so you'll also need that in the secret store. I run the backup cronjob in the same namespace as each application that has a database that should be backed up, but this is not strictly required.

Given the service account json in a file backups.json and the passphrase in a file backups.key you can run kubectl --n=app create secret generic backups --from-file=backups.json and $ kubectl -n=app create secret generic backup --from-file=backup.key

Here's the cronjob configuration:

--- apiVersion: batch/v1beta1 kind: CronJob metadata: name: pgbackup namespace: app spec: schedule: "0 6 * * *" jobTemplate: spec: template: spec: containers: - name: pgbackup image: mfinelli/pgbackup command: [/bin/sh] args: - -c - >- gcloud auth activate-service-account --key-file $GOOGLE_APPLICATION_CREDENTIALS && pg_dump $DATABASE_URL --clean --create --format=plain --column-inserts --if-exists --quote-all-identifiers | gzip --best | gpg --armor --cipher-algo AES256 --symmetric --yes --batch --passphrase="$BACKUP_PASSPHRASE" | gsutil cp -a project-private - gs://YOUR-PG-BACKUP-BUCKET/$(date +%Y%m%d%H%M%S).sql.gz.asc env: - name: GOOGLE_APPLICATION_CREDENTIALS value: /root/backups.json - name: DATABASE_URL valueFrom: secretKeyRef: name: postgres key: postgres - name: BACKUP_PASSPHRASE valueFrom: secretKeyRef: name: backup key: backup.key volumeMounts: - mountPath: /root/backups.json subPath: backups.json name: backups restartPolicy: Never volumes: - name: backups secret: secretName: backups

The YAML syntax makes it a little hard to understand exactly what the command we're running is, here it is if we were going to run it from the command line.

gcloud auth activate-service-account \ --key-file $GOOGLE_APPLICATION_CREDENTIALS && \ pg_dump $DATABASE_URL --clean --create --format=plain --column-inserts \ --if-exists --quote-all-identifiers | \ gzip --best | \ gpg --armor --cipher-algo AES256 --symmetric --yes --batch \ --passphrase="$BACKUP_PASSPHRASE" | \ gsutil cp -a project-private - \ gs://YOUR-PG-BACKUP-BUCKET/$(date +%Y%m%d%H%M%S).sql.gz.asc

The first thing we do is authorize the gsutil program with our service account credentials. Since the container is being spun up on each invocation there is no shared state and this needs to be done every time. Then we do a standard pg_dump with all of our desired options which we then pipe to gzip to compress it and then pipe that to gpg to encrypt is as mentioned before and then finally, pipe it to gsutil to stream directly to the storage bucket from stdin.

And that's it! Now you'll get nightly logical dumps of your database. Should it be necessary you could download one of the backups and play it into a new database:

gpg -d --batch --passphrase="$(cat backup.key)" ./path/to/file.gz.asc | \ gunzip | psql ...