Blog outage: A Post-Mortem
Tagged: learning rants
While enjoying Christmas dinner with the family, I received the most dreaded message: a request for IT support. Not just any request, but one about a family member's php5 site I'm hosting on my VPS. "Weird", I thought, since the site had been chugging along fine for over a year. I couldn't service the request because I left my laptop at home, on a different continent, which bought me a few days of calm.
Once home, the first thing I do is SSH into the box to check a few things. But my connection times out without even logging me in. I get a funny feeling in my stomach. Pointing the browser to this blog or the photolog times out as well. Crap. Probably some sort of outage. Let's open a ticket with SparkVPS.
Their site is down too? What the hell? Googling around brings up "something something deadpooling something". What's deadpooling? It's when a VPS (or VPN, etc.) provider sells you months/years of a service, then disappears with the money and the computing capacity. I remember being happy as a clam at getting a nice, cheap hosting deal just a few months earlier. When did this happen? Early December? My shit's been down for weeks and I never noticed?!
Oh. I turned off the http/ping monitoring and alerting in September for an experiment and forgot to turn it back on. Very humbling.
Finding a new provider, this time with a physical address in the "about" section, took around an hour. Thankfully I've kept everything simple: my blog is just a bunch of html files and the photolog a simple, sqlite-backed Django application. Another hour and I have those back up, with most of the time spent making sure I'm putting things in the right place. I'll leave automating this with something like Ansible for another day. Now, the relative's php5 site. php5-fpm? I don't even want to know what that is. This time around, I will do the right thing and dockerize it. Future me will be thankful next time something breaks.
This must be the fifth time I've moved my Internet home since 2013 and the first time I've been forced to do it. Still, it's amazing how well everything runs 99.9% of the time. It's also a good reminder that even the simplest deployment can fail due to human error.