A couple of seconds is very aggressive. There is a window for data loss when the... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		fdr on April 27, 2011 \| parent \| context \| favorite \| on: Heroku's AWS outage post-mortem A couple of seconds is very aggressive. There is a window for data loss when the segment is incomplete. The calculus of what this means in a real system is somewhat complicated, and sketched below. Although people like to measure the data loss temporally, it'd be more precise to the system-minded to say that it's 16MB of transaction log loss should the drive die between COMMIT and WAL-E send. Thus, temporally, there is a plateauing effect: the more data you push up to a point, the less you will lose temporally because Postgres swaps segments more quickly. If you push too much, backlogs can occur. If you measure in terms of xact bytes lost, it's simple: maximum 16MB-(32-epsilon)MB, assuming a trivial backog size, lose-able between COMMIT; and archiver send. A word on backlogs: my experience would suggest you need to be doing very demanding things (bulk load, large in-server denormalizations or statement executions) to produce backlog given the throughput one sees on EC2. It's easy to write a monitoring query to do this using pg_ls_dir and regular expressions or similar. Nominal operation doesn't often see backlog, the pipes to S3 are reasonably fat. I hope to more carefully document ways to limit these backlogs via parallel execution and adaptive throttling of the block device I/O for the WAL writing. Another idea I had was to back WAL writes in-memory in addition to on-disk (RAID-1) so WAL-E would have a chance to send the last few WAL-segments, if any, in event of sudden backing block device failure. A dead WAL drive is interesting because it will prevent COMMIT; from successful execution, hence, the amount of data loss is reduced (because availability comes to a halt immediately, even if the WAL segment is incomplete). Whereas if a Postgres cluster disk fails new transactions might COMMIT (the WAL continues to write and no fsync that will block has necessarily been issued) but you have a good chance of grabbing those segments anyway as database activity halts since WAL-E can continue to execute even in the presence of a failed postgres cluster-directory serving block device. A dead WAL drive will nominally allow non-writing SELECT statements to execute, so availability is generally lost to new writes only, although this may change on the account of crash-safe hint bits (I'm not terribly familiar with the latest thinking of that design, but I imagine it may have to generate WAL when doing read-only SELECT). Finally, interesting things are possible with synchronous replication and tools like pg_streamrecv in 9.1, even if pg_streamrecv runs on the same box: I don't see an obvious reason why it would not be possible to allow for user-transaction-controlled durability of at least two levels: committed to EBS, and committed to S3. S3 could effectively act as a synchronous replication partner. Fundamentally, putting aside the small archiver asynchronism, EBS with WAL-E is basically a cache of sorts to speed up recovery. The backing store is really, in some respect, S3.

joevandyk on April 27, 2011 [–]

I was thinking setting archive_timeout to a low number would limit the temporal data loss.

fdr on April 27, 2011 | [–]

You are right, especially if you aren't pushing much data. Your restore times be rather long though. I hope to implement a prefetching strategy to make this much, much faster, so one could do that if they absolutely wished.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact