Is there a timeline to how long it took them to figure out Redis was down? Becau...

donutpepperoni · on Feb 4, 2016

Is it really "shameful"? Running systems like this at scale is hard. We're not talking about redundant power systems for an ICU Ward in a hospital. We're talking about a website which powers a sliver of the first world.

You bet they busted their ass to get this fixed and shared their learnings with us. I'm extremely grateful for this and yeah it inconvenienced my morning but nothing more.

You make it sound so easy. If it takes the Github folks 2 hours, I can bet it would've taken us much longer.

ryanlol · on Feb 4, 2016

"Busting your ass" doesn't make up for the lack of failover servers.

ssmoot · on Feb 4, 2016

What did they learn? What did we learn?

I learned they had an unfortunate power outage. Then it took about two hours to determine that the Redis servers weren't booting, and to failover.

That's honestly pretty unimpressive no matter how you slice it.

Setting up a Redis server is easy. Sharding it is easy. Setting up slave systems is easy.

What's hard is, like most things in life and tech, planning. Planning for failure. Practicing failure. Not by insisting you need a monkey-army, which sounds cool and fun, but by having a staging environment which is mundane and boring. Pull the power plug on a server. See how long it takes to identify the culprit and recover. Figure out what you could have done to identify the issue quicker. Figure out how you could recover quicker.

You're running Redis, and you haven't even setup slave systems in different cabinets? You've never tried pulling the plug on the JBOD to see what might happen? These servers apparently weren't even pinging. Why did it take more than 10 seconds to find out they weren't running? Why wasn't there a dashboard for basic system/process status across all services? Is GitHub's operations budget really that tight?

Setting up a pair of OpenBSD boxes with pf and HA-Proxy is easy. Making sure CARP picks up on the failover system when the primary fails is pretty easy. Scheduling it, testing it so you know it's actually going to work when you need it to is the hard part. Holding people accountable for it's uptime when the log says that hasn't been done this month is the hard part.

Setting up some FreeBSD boxes with DAS is easy. Giving each shard it's own slaves is easy. Monitoring the slaves to make sure they're actually useful is hard. Making sure you have a one-way failover script fire when the CARP interface fails over is easy. Snapshotting the ZFS filesystems is easy. Simulating breaks by pulling a power plug is hard. Write some garbage to the FS. Unplug the DAS. Unplug the host. Force "up but non-responsive" situations, figure out how to identify them, and how to recover from them.

Doing anything reliable with iSCSI is hard. Even Amazon has a poor reputation for it. Do your best to avoid it, and never ever ever buy into vendor promises that it'll allow you to just remount your block devices on a different host and never have to worry about data or downtime. IME. YMMV.

Using runit to keep your Ruby processes up: Easy. Proper logging, monitoring and alerts because processes are just bouncing constantly? Well, it's not rocket science. But it definitely takes discipline.

There are people a hell of a lot smarter than me that have been doing this for a lot longer. But some of this stuff is just flat out not acceptable. I've been on those angry-client phone calls. "Inadvertent" should be a trigger word to anyone in Ops. It really means somebody didn't do something they should have and it bit them in the ass (at least that's what it meant when I said it). Would the app have started fine if boot.rb didn't attempt to connect to Redis? That seems pretty far fetched. So that sounds like a red-herring. But what do I know.

Setting these things up is not the hard part. Identifying that a host isn't responding to ping isn't the hard part. Planning, procedures, discipline and execution, day in and day out when things aren't on fire to prepare for the day that they are, that's the hard part.

Maybe Github is has a much smaller Ops team and budget than I would've imagined. Maybe this should be a wake up call to the CEO. I don't know. These are just my observations from what was said and having deja-vu thinking about lessons I had to learn the hard way.

BTW, I'm not necessarily trying to bag on the guy getting paged at 3AM. Been there, done that. It sucks. You do the best you can. If anything that I'm saying has a ring of truth to it, then it's a leadership issue. And it's not about "give smart people things and get out of their way blah blah bullshit". This sort of stuff doesn't materialize out of thin air and good intentions.

OTOH it's crazy that Github is single-homed and can't afford an F5. :shrug:

BTW:

> I can bet it would've taken us much longer

One of the worst experiences of my professional life was 72 sleepless hours, mostly in a 50F data-center, trying to figure out why the SAN would sporadically drop off servers every other minute. Turns out somebody, not me for a change, set the MTU on the switches to 9000. So whenever a max-frame was used BOOM.

But yeah, I've been through Redis failures. It didn't take me two hours to get things going again. Load-balancer failures. Database failures (backup sure, but never ever plan to actually use it; it's plan Z at best). NFS failures as well. Though those might be pushing it with heads that take almost 10 minutes just to reboot.

From the outside 2 hours seems like a very long time to identify and recover from a Redis failure. (Knock on wood.) And it sounds like they didn't have Warm systems standing by for failover? That's bad...