11/13/2022
Last night, we had prolonged downtime of about 12 hours. Here's what happened, and what we plan to do better.
We use Flux to deploy changes to our Kubernetes cluster. Changes are tested, pushed to Git, and eventually released to production. We were also working on a staging environment that matches our live environment closely, which would add additional validation. Flux is widely used for deployment and generally rock-solid.
But there is one component that we don't manage through this mechanism, and that is Flux itself. To install Flux, we use a Helm chart provided by the Flux community. Yesterday, I simply updated that Helm chart to the latest version, but I made one fatal mistake. I installed the wrong chart. I installed the chart from fluixcd/flux, but I should've installed flux-community/flux2.
I installed the wrong version of Flux, a version that does not work with our setup. No big deal I thought, so I uninstalled the incorrect version and reinstall the correct one. I genuinely had no idea of the catastrophic consequences of my actions.
This is when everything went wrong. Our cluster setup is managed as a custom resource by Flux. When you delete a resource in Kubernetes, it will usually also delete all children of that resource. Somehow, when I installed the wrong version of flux, it removed all flux2 custom resources. Which then triggered a fatal chain of events: All our apps, volumes, databases were suddenly being deleted as well. Everything in our production namespace was being terminated.
This is not supposed to happen. Helm can install custom resource definitions, but it does not upgrade or remove them, because these actions can cause exactly what happened to us. I suspect that a custom installation hook in the original flux chart is responsible, but I'm not sure yet.
We had a backup strategy from the start, but as we've only launched one month ago, it was still being improved.
Our databases are performing full backups every night, but also use a so-called Write-Ahead log. They are a continoous stream of changes to a backup location, which enables us to recover a database with next to no data loss.
Our obect storage is replicated to multiple datacenters and uses versioning to protect against accidential deletion
Our block storage is snapshotted by Velero every night and the snapshot is stored on object storage. Unfortunately, this was not actually fully set up yet, which we had to learn the hard way.
Velero also backups up our resource manifests every night.
The good news in all of this is that none of our object storage got deleted. Most data and backups were totally safe. What was gone however were our cluster resources and volumes. Both were meant to be backed up by Velero, so I fired up the Velero CLI and attempted a restore. Except there was one major issue: Velero requires in-cluster backup resources to be able to restore. Since everything got deleted, it was unusable. There was no way to recover our existing cluster, so I launched a new cluster and began manual restoration there.
First step was to restore our databases from the WALs. I had never done this before so it took some time to figure out. Once I did, I was able to recreate most apps with all data intact.
There was one exception, I was not able to restore the disk of one Gitea instance due to the lack of volume backups.
This experience was very stressful for me. It was already late when this happened, and I stayed up all night to fix it. Unfortunately, we did loose one instance, but I'm glad I at least managed to restore all other apps. Anything else may have meant the end of this service, just one month into its life. Trust is hard to build, and even harder to rebuild. I am very sorry for all of this, but I suppose events like this are unavoidable, even the biggest companies go down occasionally. I just never expected it to happen this early, and through such a tiny error.
I've already disabled app creation earlier to allow me some well needed rest. As of right now, I am turning it back on, but only for apps that do not use block storage. Over the next few days, I will be taking steps to make sure this can never happen again, which includes fixing volume backups and improving our monitoring.
This also shows how important it is that we give you an easy way to export your data on a regular schedule. This is a feature I had planned from the start, and I'll increase its priority as a result of this event.