Today’s Site Outage: My After-Action Report

Sometime around 4 pm today, this site had an outage. From what I can tell from my Google Analytics data, the outage started around 4 pm EDT and lasted for a little over an hour. I called the host to report the problem, and also pinged them on Twitter. It turned out to be an known issue, by 5:15, the site was up and running again.

Let me apologize for the outage. I get annoyed when I go to a site and it spins and spins and eventually spits out an error, and I imagine I’m not the only one who feels this way.

A brief review of outages on this site

The previous outage I had on the site was back on April 24, not that long ago, relatively speaking. That outage didn’t last as long as this one, somewhere around 25 minutes.

Prior to that, the last outage I had was on September 10, 2012, back when the host fell under a DDoS attack.  That one lasted 4 hours.

A total of 6 hours downtown over a span of more than three years is a pretty good track record, over all, well above the 99.9% uptime commitment my hosting service. This is far better than what I had with my previous hosting service, and makes me glad that I made the switch. Problems have been very few and far between.

A pet-peeve about how host support handles these issues

The one complaint I have about today’s outage is an IT pet-peeve I have in general. Keep in mind, that I’ve worked in IT in a variety of capacities for more than 20 years now. For the first 10 or so, I was intimately involved with running a service desk, so I know a little about how these things work.

When I called to report the problem, I had already done my homework. Mostly, I verified that while I could not get to my sites (503 errors), I could get to the server that hosts the files. This tells me that the issue is with the web server. I pointed this out right away to the very pleasant support person I spoke to.

But these folks have to follow a support tree. I hate support trees because they virtually eliminate any possibility of creative thinking on the part of the support person, and they end up wasting a lot of unnecessary time.

For instance, today, I pointed out that I could access the file server but not the web server, which told me that there was an issues with the web server. The support person told me that there were no known issues with the web server at this time: that’s the first problem with a support tree.

It’s quite possible that there were no known issues and that I, acting as a canary,  was the first to report the problem. Rather than running down a rabbit hole of steps that are almost certainly unrelated to what is wrong, it would have been prudent at this point to check the web server. Instead, I was asked when the site was last working, I had this information from Google Analytics. The next question was when the site was last changed. It had days. One look the modified dates on the files in my directory would have told the support person this. But it was irrelevant because when unchanged files that accessible through one protocol but not another, the likely culprit is the other protocol.

Next, I was asked to perform the rather ridiculous task of renaming my .htaccess files in case someone the permissions got messed up. Mind you, the modified dates on these files were from months and years ago. Again, no changes. This is what I call a “customer comfort” task. I feel like I’m doing something useful. But it was a complete and utter waste because it was completely unrelated to the problem.

Eventually, lots of people started reporting the problem on Twitter, and not long after that, the site was up and running again. It was the web server, no doubt, just as I had figured. It might have saved a lot of customers some negative moments of truth, if only the support staff weren’t chained to those useless support trees.

One thought on “Today’s Site Outage: My After-Action Report

  1. About “the modified dates on these files were from months and years ago.” These date can be modified using *touch* command. So if somebody can access to .htaccess can modificate it and put a old date.

Comments are closed.