Automated rollback using Prometheus and Github Actions

For an application with tight service level objectives, you might want to automate a mitigation process states the bible of SRE:

... insert citation

For one of our systems at ninetailed.io, we were still deploying manually one service that required a fast rollback in case anything goes wrong. With a SLO of 99.99 availability, a quick answer to an increase of error rate after a deployment was necessary.

To move the deployment to an automated pipeline and achieve continuous deployment, we needed a good strategy to assure high availability.

Contrary to common beliefs, DevOps goal is not to automatically create outage on production [TODO: find the citation]!

Examinate solutions

Even though we have all kind of tests, and a QA environment where we deploy to, simply deploy to production was off-the-charts as a complete outage would take some time to be fixed. It would require an alert to trigger, then someone noticing the error, start a revert or a local deployment to rollback asap. Which was error prone and not the best.

Hope is not a strategy whoever.

Blue/Green deployment & Canaries

Probably the one MVP of modern controlled deployment is the Canary release with blue/green deployments to assure low impacts on the user in case of failure, and easy rollback as it is just a switch.

Sadly, we had data constraints preventing us from having 2 exact replicated environments. For the most curious of you, the stack is Cloudflare workers & Durable object.

We needed another way to control the deployment.

Check for fire

Without going to the smoke tests, as we don't want to run tests on production data, we opt-in with the "check if there is fire" strategy.

First, we deploy from the pipeline on merge to main. Once deployed, we run several checks for some minutes using Prometheus API and a Prometheus recorded metric: endpoint:http_requests:error_rate5m. If we notice the error rate going up a defined threshold (1%), we would automatically trigger a rollback.

💡 Note 1: If you deploy to Kubernetes with a tool like Helm, you can use Helm Tests (link) to run a job to check if a deployment went correctly, and if not, automatically rollback. 💡 Note 2: If you deploy to Kubernetes, I would highly suggest to check Flagger or Argo Rollouts to achieve Canary deployments