Sometimes when you do everything right, things still go wrong. I previously talked about how bad I am at spelling and grammar in “The Four Year Typo,” which reminded me of my first major production failure at Heroku.
Here’s the setup. We have a service that builds apps. You don’t need to know this, but it’s called “codon.” This is the service that runs the buildpacks such as the one I currently maintain, the Heroku Ruby Buildpack. Believe it or not, when I started we had no production monitoring of build failures. If we so much as hiccup, Twitter tends to catch on fire and our support tickets come in like a tsunami, so there wasn’t a huge need. However, the faster we can find out about failures the faster we can fix them and fewer people get impacted. Also, when you’re deploying as many apps as we are, a one-in-a-million bug occurs a non-trivial number of times. So, we really have to be on top of things. One day I made a change to a buildpack that caused one of those one-in-a-million bugs to be exposed in Bundler, but because it wasn’t a major system meltdown, we didn’t really hear anything about it. While that is a bug, it’s not the one I’m writing about.