I'm interested in why he's so against the "custom solution". Almost everything a...

brown9-2 · on Oct 28, 2010

I would imagine that for a certain number of candidates, they answer with a custom solution because they believe this is what the interviewer is expecting to hear (and asking for) when the question contains "you need to design, build, and deploy...". I think to some people, the immediate thought is that the interviewer would deduct points for originality if you said "well let's just use this tool off of the shelf".

This is why interviews are so tricky - it's hard as a candidate to juggle answering the question and trying to decipher what traits the interviewer is really trying to expose with this question.

For example with this one: is the interviewer trying to see if I understand my data structures and the size of this data to the point where I could design such a system from the ground up? Or are they trying to see how practical I can be given real-world time constraints?

brianm · on Oct 28, 2010

I agree, and I try to make clear that I really want to understand how the candidate would approach it -- if they genuinely would try to find a perfect hash and keep a hash -> offset map in memory, then mmap a big data file, that is fine. I will ask how they will find the perfect hash, etc. If you can describe well the approach you take, then this is fine.

Ofttimes, though, candidates have started designing a general purpose on-disk hash database. That is fine and dandy, and we always need a better database, but it is pretty orthogonal to the immediate problem.

Actually, ofttimes engineers have implemented general purpose on-disk hash databases as part of j. random project, so that path certainly does occur on real projects :-)

cs_loser · on Oct 28, 2010

Were I asked this, my general approach would be: (1) Given the available building blocks (disks, ram, cpus, etc) what strategies are workable? An example of a strategy would be "store it in a btree and keep all but the last two levels in ram; therefore we'll need enough ram for the top of the tree and 2x seeks for each lookup." (2) Find existing software that implements a workable strategy if at all possible, or at least the complicated components of it and code the rest yourself.

You can think of two extreme answers to this as "buy oracle" and "buy some transistors". Those are both bad answers. The first is a bad answer because, unless the candidate can answer the follow-up question "how does oracle work?" it probably just amounts to wishful thinking and they almost certainly won't be able to say when it will fail. The second is a bad answer because you can't do that in two weeks by yourself.

Take for example, Foursquare's recent outage. From what I read online, it seems like they went with "use MongoDB" as a strategy without actually understanding what Mongo was doing under the hood. Presumably they didn't figure out how it would fail (the author's follow-up question, natch) and thus they discovered that the hard way. So they chose bad answer one. On the other hand, had they decided to implement their own database, they'd know all about its characteristics, but they wouldn't have launched yet.

variety · on Oct 28, 2010

Or, they simply didn't have failover in place at all.

brianm · on Oct 28, 2010

I have gotten solutions which implement custom stores which I have had faith that the candidate could build, but not many. I am not dead set against it, but I will press for an estimate and a good explanation of it if they want to go down that route.

I try to emphasize that I want to know what they would actually do, on the job. Other questions (typically other interviewers) will explore algorithmic stuff, and folks I work with have used this question with that intent -- I use it to see how they think about building systems.

Another thing I didn't mention in the blog post is that up front I tell them exactly what I am looking to learn. I am not going on a fishing expedition, or trying to trick anyone. I find that kind of "second guess the interviewer" thing to be asinine, so strive to avoid doing it myself.

lrm242 · on Oct 28, 2010

How often do you decide exactly what you're going to do to solve a problem for your job in the span of one hour? Oh yeah, and make sure you provide an estimate and a good explanation, within that hour, as to why you've made that decision. Give me a break man, you're looking for people who tell you what you want to hear because then you've found someone who thinks like you--and that, in your mind, is the only correct answer.

ora600 · on Oct 28, 2010

The idea is that you don't have to write a DB - it is already there, usually even deployed and monitored by your ops team. Even if it gives absolutely no other value, thats a big plus.

In terms of randomly distributed requests: I'm surprised the topic of "working set" is not part of the question. Even with huge amounts of data, if the working set can be stored in RAM and only rarely swap data in and out, its a much different problem than 5000 disk reads a second. Lots of databases are optimized to efficiently maintain cached data in memory and swap as little as possible and as efficiently as possible.

ojilles · on Oct 28, 2010

Requests are evenly distributed, so caching will only be "effective" if it approaches the size of the whole dataset.

nikster · on Oct 28, 2010

I must say the problem bores me to death. I don't think I'd want to work where they ask me to solve this. However, I am confident I'd design a solution that's near perfect in the two weeks given, it just would require some experimentation on the data storage side - the code is trivial.