Thursday, January 07, 2010

blog.reddit -- what's new on reddit: Why did we take reddit down for 71 minutes?

blog.reddit -- what's new on reddit: Why did we take reddit down for 71 minutes?: "We use a lot of the EBS disks. All of our databases were each using one EBS. This worked really well for us up until a week or so ago. Then all of you came back from holiday and decided that work was just too boring or something, and our traffic spiked, essentially breaking the camel's back, if you will.

In response, we started upgrading some of our databases to use a software RAID of EBS disks, which gives drastically increased performance (at a higher cost of course). This worked really well, but there was still one missing piece of the puzzle.

Part of our setup uses what we call a 'permacache', which uses Memcachedb. Memcachedb is Memcached with a built-in permanent storage system using BDB. One of the 'features' of this system is that it saves up its disk writes and then bursts them to the disk. Unfortunately, the single EBS volumes they were on could not handle these bursting writes. Memcachedb also has another feature that blocks all reads while it writes to the disk. These two things together would cause the site to go down for about 30 seconds every hour or so lately."

No comments: