1.7 petabytes and 850M files lost, and how we survived it

zimpenfish · on March 30, 2016

"The directory is intended for temporary storage of results before staging them into a more permanent location [...] During the three years that the filesystem has been in operation, it has accumulated 1.7 Petabytes of data in 850 million objects."

There needs to be some law about how temporary directories always end up containing vitally important data.

sevensor · on March 30, 2016

What was interesting to me about this was that they had decided not to enforce a deletion policy on /wrk, because they had so much space and the filesystem hadn't ever failed. But a rolling deletion policy would have gone a long way to containing the damage by encouraging the users to move their data to a filesystem optimized for reliability instead of availability. Still, I appreciate the heroics involved in restoring the data.

ople · on March 30, 2016

Author here. We have had an automated deletion policy on our previous filesystems but opted out this time: There are users that have temporary files that they want to persist on the /wrk and we have plenty of capacity. We definitely learned our lesson, though. :)

Natanael_L · on March 30, 2016

Better solution for the future: client side scripts that push those files back after every purge.

jabl · on March 30, 2016

The (inevitable?) consequence of deletion policies that typically delete based on mtime, tend to be users putting touch scripts in cron, making the metadata servers even more of a bottleneck than they already are. Been there, done that.

Perhaps the solution is some netflix-like chaos monkey that randomly deletes files..;) Or for each user over its soft quota, delete the oldest files until under the quota. Or something like that..

lmkg · on March 30, 2016

It's not just files. Anything originally designed as a temporary stop-gap has a habit of becoming permanent. I first learned this about portable classrooms, then laws, then organizational practices. By comparison, the temp directory isn't so surprising.

Aaargh20318 · on March 30, 2016

There are few things as permanent as a temporary solution.

The problem with a temporary solution is that it makes the problem go away, so suddenly there is no longer any incentive to fix it properly.

Natanael_L · on March 30, 2016

How about this formulation?

"Any quick fix that suppress the feeling of emergency will become permanent"

Natanael_L · on March 30, 2016

(since I can't update the previous comment anymore)

As a corollary, I wrote a few variants of a suggestion for an anti-quick fix permanence policy:

A: Ugly problems gets fixes first, so make your quick fixes ugly.

B: Every quick fix must be made exponentially uglier than its degree of crappiness.

C: Make your fixes transparent; have your fixes let the symptoms shine through from the original problem. And if the problem is silent, have the quick fix generate noise.

Definitions:

* Ugliness = how bad it looks to everybody involved, including non-technical bosses.

* Crappiness = how risky / exploit prone / buggy / inefficient / hard to maintain it is.

Justification: to maintain an appropriate degree of a sense of emergency.

Two days of unplanned downtime is more than twice as bad as one day (thus the exponent), as costs often grow exponentially. The apparent ugliness of the fix itself should drive everybody to implement a better fix, as they all want to get rid of it right away and NOT put something worse in its place.

Physical analogy: a lobby's unstable ceiling held up by a temporary pillar will remain unfixed with the pillar standing if the pillar gets painted, but will be urgently fixed if the pillar stays looking ugly.

not_kurt_godel · on April 1, 2016

Here's my solution: Do it right the first time.

Natanael_L · on April 1, 2016

Absolutely, if you have no deadline.

hga · on March 30, 2016

Lots of fun; while backing up the filesystem prior to wiping and rebuilding it, they ran out of IOPS to do it in a reasonable time frame, so after considering other options:

One obvious solution would be to use a ramdisk, a virtual disk that actually resides in the memory of a node. The problem was that even our biggest system had 1.5TB of memory while we needed at least 3TB.

As a workaround we created ramdisks on a number of Taito cluster compute nodes, mounted them via iSCSI over the high-speed InfiniBand network to a server and pooled them together to make a sufficiently large filesystem for our needs.

A hack they weren't at all sure would work, but it did nicely.

powercf · on March 30, 2016

Couldn't they add 1.5TB of swap to their 1.5TB of memory system and run a ramdisk on that? I'm curious what performance would look like, but given 2-3k IOPS for the on-disk solution, and 20k IOPS for the in-memory I would naively expect at least 11k IOPS for random access, which should have been fast enough without the headache of clustering?

ople · on March 30, 2016

Author here: We considered that but as the access pattern was likely pretty much random, the performance would have been terrible. Due to the break we had nearly a 1000 clustered servers sitting idle so it was reasonably quick to do the ramdisk trick.

icefo · on March 30, 2016

I'm sorry but I don't understand something. What did you put on that big ramdisk ? The metadata ?

ople · on March 31, 2016

We copied the raw image file of the corrupted metadata filesystem (MDT in Lustre lingo) to the ramdisk.

Then we mounted it via loopback and copied the files to tarballs. The bit that was really slow on the spinning disk was reading the millions of files from the metadata FS.

The basic process of the file-level backup is documented here: https://build.hpdd.intel.com/job/lustre-manual/lastSuccessfu...

garthk · on March 31, 2016

For those still not quite getting it:

The first copy to RAM was a sequential image copy, thus not bottlenecked on seeks despite spinning platters.

The second copy from RAM was a file copy with a lot of random I/O, but not bottlenecked on seeks because it was reading from RAM.

Bulk writes tend to be more efficient. They might have made temporary configuration changes to make that end faster, or not if they lacked the appetite for the extra risk.

hga · on March 30, 2016

My guess is the "headache of clustering" wasn't a big one for them, they do this for a living, and by that time in the process/downtime, they wanted the job done ASAP.

ghubbard · on March 30, 2016

Current HN Title: 1.7 petabytes and 850M files lost, and how we survived it.

Article title: The largest unplanned outage in years and how we survived it

Article overview: A month ago CSC's high-performance computing services suffered the largest unplanned outage in years. In total approximately 1.7 petabytes and 850 million files were recovered.

Although technically correct, the HN title is misleading.

distances · on March 30, 2016

To be fair, giving some numbers in the title makes the link much more interesting. With the original title this piece wouldn't probably have made it to the front page, as it doesn't even hint that this is about a scientific computing center.

It was an interesting read, so thumbs up for the dramatization.

tnorthcutt · on March 30, 2016

1.7 petabytes and 850M files lost --- 1.7 petabytes and 850 million files were recovered

Given that the latter statement is from the article, how is the former "technically correct"?

biot · on March 30, 2016

Imagine an article "One web server lost and how we survived it" simply said "Our load balancer automatically removed that server from the pool and we let the other 15 web servers pick up the load. We didn't have to do anything." This is different from "Oh crap, we only had one web server and we absolutely had to do a lengthy recovery process to get it back online."

dantillberg · on March 30, 2016

Yeah; I read the headline and presumed that they had been using a resilient data duplication scheme that allowed them to recover from the catastrophic loss of e.g. an entire datacenter's worth of data.

pinewurst · on March 30, 2016

It should be noted that this is about a Lustre filesystem hosted on DDN hardware. It's unclear whether the failed controller contributed to the file system corruption, but Lustre is quite capable of accelerating local entropy all by itself. It was designed/spec-ed at LLNL as huge file, high performance, short term scratch/swap and even after 15 years isn't especially reliable or fit for use outside that domain.

gnufx · on March 30, 2016

I'm surprised that the copying bottleneck seems to have been entirely at the target rather than the source. Is that because there were multiple copies of the source?

I've had to employ the horrible hack of iscsi from compute nodes, raided and re-exported, but it's not what I'd have tried to use first. The article doesn't mention the possibility of just spinning up a parallel filesystem on compute node local disks (assuming they have disks); I wonder if that was ruled out. I don't have a good feeling for the numbers, but I'd have tried OrangeFS on a good number of nodes initially.

By the way, it's been pointed out that RAM disk is relatively slow, if in the context of data rates rather than metadata <http://mvapich.cse.ohio-state.edu/static/media/publications/....

ople · on March 31, 2016

The reading of the metadata required quite a lot of random acces. We were fairly sure that if a high-end array and controller with fast disks is struggling with it, then a traditional clustered solution with slower node local disks would not fare much better. Thus we tried to find the solution which yields the highest possible IOPS.

gnufx · on March 31, 2016

I misunderstood the bottleneck, not having had to do that. (Distributed metadata for the parallel filesystem could actually be tuned to be memory resident.)

ajford · on March 30, 2016

Out of curiosity, why weren't they running the metadata drive in a mirroring raid? If you have PB of data, wouldn't it make sense to spend the ~$100 for a second 3TB drive to mirror your metadata?

Or was the inode problem not a local disk problem but a problem in the Luster fs? I couldn't quite tell from the article.

pinewurst · on March 30, 2016

It's almost a certainty that the MDS (metadata server) was situated on a mirrored RAID (prob RAID10). I'm guessing that the RAID system itself (software MDRAID or some HW array, DDN or something like a NetApp E-Series) corrupted the data under the FS that the MDS used, which I'm also assuming was XFS.

Lustre, for those who don't know it, is a cluster meta-filesystem, with separate metadata and object servers, each sitting on top of host file systems/RAID/storage.

ople · on March 30, 2016

The metadata target (MDT) in the MDS is actually "ldiskfs" which is an enhanced version of ext4. One possibility may be to use ZFS in the future as the support in Lustre seems to be quite stable now.

It seems pretty impossible to find out the exact root cause in retrospect as the system was running for a long time without apparent issue. Any ideas are welcome though.

ople · on March 30, 2016

It was filesystem-level corruption in Lustre. The underlying disk arrays and other hardware have comprehensive redundancy.

beezle · on March 30, 2016

I bookmarked this for whenever I think I'm having a really bad day...

ople · on March 31, 2016

Hehe.. In retrospect the whole team was in fairly good spirit although the situation was stressful. A lot of this was due to the top management giving the time and space for the specialists to do their thing and the very understanding response from the customers once we explained the situation.