After my presentation in the Data Analytics devroom at the FOSDEM this year, I felt that I needed to complete the information that I have provided there. This post will try to explain in more details my methodology and the results.

But before I start, I would like to say that all of this benchmark is an idea that is quite recent and I’m still working on it. That means that there could be bugs in my code, errors in my configuration of the various clusters and other kinds of problems. This is why I wouldn’t recomend using the result of this benchmark to take any kind of serious strategic decision. I will make the code available soon so everyone can play with it.

First, here are the slides of my presentation. It is not obvious to understand why there are 3 measures for each cluster size on slide 11 without the oral explanations. Those three measures are in fact the three runs that I have described on the step by step methodology just before on slide 10. Speaking about those runs, one run is in fact the total number of requests chosen repeated ten times in a row. This is what I use to compute the average and standard deviation values for each of those 3 runs.

I would  like to justify my choice of storing the articles in one XML blob instead of storing it as different parts. I could have done this by taking into account the specificities of the column oriented model or of the document oriented model. But I choose to not take into account those specificities for two reasons :

  • I don’t think that Wikipedia is storing the content of the article in multiple parts, you always get the whole page when you want to edit it. Of course I’m also downloading some meta data when I use the XML export, but those have a negligible impact on the size of the whole export compared to the content of the article.
  • I want to be able to adapt this benchmark (at least for the read/update performances benchmark) to as many databases as possible without having to do some specific work on the storage part for each of those. As I want to test the performances of distributed key/value stores like Scalaris or Voldemort, I don’t have a choice.

Moreover, I don’t think that any of the databases that provide a storage paradigm more complex than simple key/value would have anything to gain in term of performances in this benchmark if the articles had been stored using those paradigm. My update function is (realy!)  simple and the size of the meta data is negligible, so I don’t see how there could be any improvement in performances.

Another thing that I would like to explain a little bit more is how my MapReduce implementation works. My first idea was to use two phases of MapReduce to build my reverse index for a given keyword :

  • The first Map phase split each article using the spaces as separator and then it compares each of the strings of this list with the given keyword in a case unsensitive way. If  those two string are equal, it output the pair (string,ID) with ID the corresponding integer of the article. Note that several different string can be emitted as this a case unsensitive comparison.
  • The first Reduce phase simply collect those results and store them
  • The second Map phase emits the pair (ID,1) for each occurences of ID in the previously saved output of the first Reduce phase.
  • The second Reduce phase sums up all the “1″ for each ID and outputs pairs (ID,sum)

I know the second MapReduce phase looks a little bit unnecessary, in fact it mostly is as I could compute the same result without the second phase. I added this second phase because I wanted to test chaining MapReduce jobs. Anyway, the first part is where all the heavy computation will take place, the second part will always be very fast in comparison.

So this is what I wanted to do before I actually started to look at all the specificities of each MapReduce implementation. Those specificities forced me to do some specific modification to this idea for some of the databases. This is explained in this post.