I'm interested in why he's so against the "custom solution". Almost everything a DB will try and add in for key-value lookup is predicated on the idea that the requests aren't randomly distributed. The DB index will probably be based around b-trees, which will do a logarithmic-time search for the top few levels cached in RAM, but will fall over fairly miserably with multiple seeks as it has to page in leaf nodes and likely the nodes a level above those: I don't see it as realistic to expect a DB to do it all in a single seek.
On the other hand, if you have a perfect hash function for the keyspace into a hash addressing space (you'll want something a bit larger than 2^32), you could simply choose the device with the first few bits, then do a single seek into a (large) file on disk with the position indicated by the remaining bits of the hash addressing space, shifted over appropriately for the 128-byte stride of values. You don't even need a perfect hash: with a merely good hash function, you can cheaply read in more than 128 bytes from disk (OS will probably be reading in 4K anyhow), and use open addressing with linear probing.
You'll need enough devices to spread the load to get good latency for random seeks; probably a combination of mirroring and what amounts to sharding, using a portion of the key hash to select the device. But that's a relatively cheap numbers game.
But I'm not really seeing what a DB buys you here, in this specific scenario.
EDIT: Further investigation of tc and cdb (as linked in the article) suggests that tc (Tokyo Cabinet) may work in its hash form, but I'd rule out cdb for using two seeks, which I would expect to double the number of devices required.
I would imagine that for a certain number of candidates, they answer with a custom solution because they believe this is what the interviewer is expecting to hear (and asking for) when the question contains "you need to design, build, and deploy...". I think to some people, the immediate thought is that the interviewer would deduct points for originality if you said "well let's just use this tool off of the shelf".
This is why interviews are so tricky - it's hard as a candidate to juggle answering the question and trying to decipher what traits the interviewer is really trying to expose with this question.
For example with this one: is the interviewer trying to see if I understand my data structures and the size of this data to the point where I could design such a system from the ground up? Or are they trying to see how practical I can be given real-world time constraints?
I agree, and I try to make clear that I really want to understand how the candidate would approach it -- if they genuinely would try to find a perfect hash and keep a hash -> offset map in memory, then mmap a big data file, that is fine. I will ask how they will find the perfect hash, etc. If you can describe well the approach you take, then this is fine.
Ofttimes, though, candidates have started designing a general purpose on-disk hash database. That is fine and dandy, and we always need a better database, but it is pretty orthogonal to the immediate problem.
Actually, ofttimes engineers have implemented general purpose on-disk hash databases as part of j. random project, so that path certainly does occur on real projects :-)
Were I asked this, my general approach would be: (1) Given the available building blocks (disks, ram, cpus, etc) what strategies are workable? An example of a strategy would be "store it in a btree and keep all but the last two levels in ram; therefore we'll need enough ram for the top of the tree and 2x seeks for each lookup." (2) Find existing software that implements a workable strategy if at all possible, or at least the complicated components of it and code the rest yourself.
You can think of two extreme answers to this as "buy oracle" and "buy some transistors". Those are both bad answers. The first is a bad answer because, unless the candidate can answer the follow-up question "how does oracle work?" it probably just amounts to wishful thinking and they almost certainly won't be able to say when it will fail. The second is a bad answer because you can't do that in two weeks by yourself.
Take for example, Foursquare's recent outage. From what I read online, it seems like they went with "use MongoDB" as a strategy without actually understanding what Mongo was doing under the hood. Presumably they didn't figure out how it would fail (the author's follow-up question, natch) and thus they discovered that the hard way. So they chose bad answer one. On the other hand, had they decided to implement their own database, they'd know all about its characteristics, but they wouldn't have launched yet.
I have gotten solutions which implement custom stores which I have had faith that the candidate could build, but not many. I am not dead set against it, but I will press for an estimate and a good explanation of it if they want to go down that route.
I try to emphasize that I want to know what they would actually do, on the job. Other questions (typically other interviewers) will explore algorithmic stuff, and folks I work with have used this question with that intent -- I use it to see how they think about building systems.
Another thing I didn't mention in the blog post is that up front I tell them exactly what I am looking to learn. I am not going on a fishing expedition, or trying to trick anyone. I find that kind of "second guess the interviewer" thing to be asinine, so strive to avoid doing it myself.
How often do you decide exactly what you're going to do to solve a problem for your job in the span of one hour? Oh yeah, and make sure you provide an estimate and a good explanation, within that hour, as to why you've made that decision. Give me a break man, you're looking for people who tell you what you want to hear because then you've found someone who thinks like you--and that, in your mind, is the only correct answer.
The idea is that you don't have to write a DB - it is already there, usually even deployed and monitored by your ops team. Even if it gives absolutely no other value, thats a big plus.
In terms of randomly distributed requests:
I'm surprised the topic of "working set" is not part of the question. Even with huge amounts of data, if the working set can be stored in RAM and only rarely swap data in and out, its a much different problem than 5000 disk reads a second. Lots of databases are optimized to efficiently maintain cached data in memory and swap as little as possible and as efficiently as possible.
I must say the problem bores me to death. I don't think I'd want to work where they ask me to solve this. However, I am confident I'd design a solution that's near perfect in the two weeks given, it just would require some experimentation on the data storage side - the code is trivial.
On the other hand, if you have a perfect hash function for the keyspace into a hash addressing space (you'll want something a bit larger than 2^32), you could simply choose the device with the first few bits, then do a single seek into a (large) file on disk with the position indicated by the remaining bits of the hash addressing space, shifted over appropriately for the 128-byte stride of values. You don't even need a perfect hash: with a merely good hash function, you can cheaply read in more than 128 bytes from disk (OS will probably be reading in 4K anyhow), and use open addressing with linear probing.
You'll need enough devices to spread the load to get good latency for random seeks; probably a combination of mirroring and what amounts to sharding, using a portion of the key hash to select the device. But that's a relatively cheap numbers game.
But I'm not really seeing what a DB buys you here, in this specific scenario.
EDIT: Further investigation of tc and cdb (as linked in the article) suggests that tc (Tokyo Cabinet) may work in its hash form, but I'd rule out cdb for using two seeks, which I would expect to double the number of devices required.