As I updated my benchmark to work with a more up to date HBase version, I thought that I had to do the same for the other databases if I wanted to be fair. Moreover I had some problems with the Cassandra implementation of MapReduce on the 0.6.10 version (you can read this post to learn about those problems).
In the same spirit of fairness between my databases, the configuration of Cassandra was updated to be closer to the HBase one. In fact the only thing that changed is that now Cassandra can also use 3Gb of RAM if it wants to. The updated configuration can be downloaded here.
Unlike HBase, there has been an API break between the Cassandra 0.6.X series and the 0.7.X series, so I had to rewrite part of the Cassandra implementation. This time I saw in the documentation that it was not recommended to use directly the Thrift interface as I was doing before, this is why I decided to start using the Pelops high level client. This new implementation can be seen here. There has been very welcome API changes for the MapReduce part too, so it has been rewritten and can be seen here.
Now my MapReduce job is doing exactly the same as HBase and the other, it is also worth noticing that the bug described in this post is no longer present. Everything seems to work fine and I can easily output to Cassandra directly in my MapReduce jobs.
Enough talking, here are the results of the read/update performances :
As you can see there has been a notable performance increase between the 0.6.X series and the 0.7.X series. I don’t think that this increase in performances is due to the fact that I gave more RAM to Cassandra because by default it can use up to 1Gb of RAM. That should be enough to store everything as my data set is only 620Mb large. The other interesting thing to notice is that Cassandra 0.7.2 and HBase are now really close both in terms of performances and scalability. In fact those results are almost identical.
Here are the MapReduce performances :
It is kinda hard to see the new results as they are almost identical to the ones of the 0.6.X series. Except the very important fact that now the results of this MapReduce job are always correct. For now and for this (small) cluster size, Casssandra seems to be the fastest to compute my inverted index using MapReduce. But there is still one weird thing about those results, it is the fact that the performances do not increase with the size of the cluster! I am facing exactly the same problem with Hbase 0.90.0 and I am still investigating to find why that could be the case. This is kind of a big problem if those results do not change for bigger clusters. Indeed, databases like mongoDB with a very good scalability in terms of MapReduce performances will not take long to be faster even if the raw performances are way behind.
Like always, feedback and critics are welcome!