Mario Finelli Blog

Kubernetes: off-site PostgreSQL backups

5 April 2020

It's been a long time since I managed a PostgreSQL cluster; these days I prefer to use a managed service which almost every cloud provider is offering. One of the benefits of a managed database service is that backups are automated and the responsibility of the service provider. However, these backups are not usually exposed to you and you can't take them with you to another provider leading to vendor lock-in or potential issues should your provider have other issues. To solve this problem I also run my own automated backups on a less frequent schedule than the vendor. Should there be a need to restore from a backup within the same cloud I can use the normal backups with minimal data loss. Should there be a prolonged issue, or some other reason that would necessitate switching providers without being able to access the original, running database I can use my own logical backups to spin up a new database on my own or using a different provider.

You could also send the logical backups to storage in the same cloud provider but for me that defeats the purpose. I send my backups to a bucket on Google Cloud Storage. I use the standard pg_dump tool to make logical backups, so I am maintaining a docker container that installs the Google gsutil program into the official Postgres image. You can find the source on GitHub and the image on Docker Hub.

I also like to encrypt the backups client side using symmetric encryption (i.e., a passphrase) with gnupg. You'll want to load in the passphrase into your cluster's secrets ahead of time. For Google backups (if you're not running on GKE) you'll also need to create a service account and load the configuration json file into the secrets as well. Finally, the way I have my application setup consumes the Postgres connection information as a URI so you'll also need that in the secret store. I run the backup cronjob in the same namespace as each application that has a database that should be backed up, but this is not strictly required.

Given the service account json in a file backups.json and the passphrase in a file backups.key you can run kubectl --n=app create secret generic backups --from-file=backups.json and $ kubectl -n=app create secret generic backup --from-file=backup.key

Here's the cronjob configuration:

---
apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: pgbackup
  namespace: app
spec:
  schedule: "0 6 * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: pgbackup
              image: mfinelli/pgbackup
              command: [/bin/sh]
              args:
                - -c
                - >-
                  gcloud auth activate-service-account
                  --key-file $GOOGLE_APPLICATION_CREDENTIALS &&
                  pg_dump $DATABASE_URL --clean --create --format=plain
                  --column-inserts --if-exists --quote-all-identifiers |
                  gzip --best |
                  gpg --armor --cipher-algo AES256 --symmetric --yes --batch
                  --passphrase="$BACKUP_PASSPHRASE" |
                  gsutil cp -a project-private -
                  gs://YOUR-PG-BACKUP-BUCKET/$(date +%Y%m%d%H%M%S).sql.gz.asc
              env:
                - name: GOOGLE_APPLICATION_CREDENTIALS
                  value: /root/backups.json
                - name: DATABASE_URL
                  valueFrom:
                    secretKeyRef:
                      name: postgres
                      key: postgres
                - name: BACKUP_PASSPHRASE
                  valueFrom:
                    secretKeyRef:
                      name: backup
                      key: backup.key
              volumeMounts:
                - mountPath: /root/backups.json
                  subPath: backups.json
                  name: backups
          restartPolicy: Never
          volumes:
            - name: backups
              secret:
                secretName: backups

The YAML syntax makes it a little hard to understand exactly what the command we're running is, here it is if we were going to run it from the command line.

gcloud auth activate-service-account \
  --key-file $GOOGLE_APPLICATION_CREDENTIALS && \
pg_dump $DATABASE_URL --clean --create --format=plain --column-inserts \
  --if-exists --quote-all-identifiers | \
gzip --best | \
gpg --armor --cipher-algo AES256 --symmetric --yes --batch \
  --passphrase="$BACKUP_PASSPHRASE" | \
gsutil cp -a project-private - \
  gs://YOUR-PG-BACKUP-BUCKET/$(date +%Y%m%d%H%M%S).sql.gz.asc

The first thing we do is authorize the gsutil program with our service account credentials. Since the container is being spun up on each invocation there is no shared state and this needs to be done every time. Then we do a standard pg_dump with all of our desired options which we then pipe to gzip to compress it and then pipe that to gpg to encrypt is as mentioned before and then finally, pipe it to gsutil to stream directly to the storage bucket from stdin.

And that's it! Now you'll get nightly logical dumps of your database. Should it be necessary you could download one of the backups and play it into a new database:

gpg -d --batch --passphrase="$(cat backup.key)" ./path/to/file.gz.asc | \
gunzip | psql ...