(2011-04-24) Amazon Ebs Outage

Amazon EBS had a big outage, which took down a bunch of sites - like anyone using Amazon for their RDBMS instead of just asset-serving.

EBS doesn't even have an SLA!

Joyent thinks the goal is unrealistic. At Joyent, when V Mware and Amazon were preaching centralized network block storage as the salvation, we tacked the other way, for reasons I’ll explain below. (Hint: We tried and failed.) We simply provide a POSIX File System interface to storage, which happens to be sitting on local, not network, block devices. We lean on ZFS for durability and a fighting chance when things “go byzantine.” We use fast SAS drives in RAID groups with multiple parity stripes. We don’t pretend that this “disk” can never, ever go away or survive all failure modes. What’s more important, I believe, is the set of technologies we put in place so that our customers can transparently see exactly how our filesystems are holding up... We don’t have the luxury of treating local disk as ephermeral when selecting nodes to place workload on, for example. We lose some nice features like quick re-mounting to arbitrary vm instances, too. But it allowed us to present a usable, observable abstraction that we’re continuing to improve with innovative I/O throttling and Qo S.

Edited: 2011-04-26 00:00:00 | Tweet this! | Search Twitter for discussion

Bill Seitz