New results for Cassandra 0.7.2
As I updated my benchmark to work with a more up to date HBase version, I thought that I had to do the same for the other databases if I wanted to be fair. Moreover I had some problems with the Cassandra implementation of MapReduce on the 0.6.10 version (you can read this post to learn about those problems).
In the same spirit of fairness between my databases, the configuration of Cassandra was updated to be closer to the HBase one. In fact the only thing that changed is that now Cassandra can also use 3Gb of RAM if it wants to. The updated configuration can be downloaded here.
Unlike HBase, there has been an API break between the Cassandra 0.6.X series and the 0.7.X series, so I had to rewrite part of the Cassandra implementation. This time I saw in the documentation that it was not recommended to use directly the Thrift interface as I was doing before, this is why I decided to start using the Pelops high level client. This new implementation can be seen here. There has been very welcome API changes for the MapReduce part too, so it has been rewritten and can be seen here.
Now my MapReduce job is doing exactly the same as HBase and the other, it is also worth noticing that the bug described in this post is no longer present. Everything seems to work fine and I can easily output to Cassandra directly in my MapReduce jobs.
Enough talking, here are the results of the read/update performances :
As you can see there has been a notable performance increase between the 0.6.X series and the 0.7.X series. I don’t think that this increase in performances is due to the fact that I gave more RAM to Cassandra because by default it can use up to 1Gb of RAM. That should be enough to store everything as my data set is only 620Mb large. The other interesting thing to notice is that Cassandra 0.7.2 and HBase are now really close both in terms of performances and scalability. In fact those results are almost identical.
Here are the MapReduce performances :
It is kinda hard to see the new results as they are almost identical to the ones of the 0.6.X series. Except the very important fact that now the results of this MapReduce job are always correct. For now and for this (small) cluster size, Casssandra seems to be the fastest to compute my inverted index using MapReduce. But there is still one weird thing about those results, it is the fact that the performances do not increase with the size of the cluster! I am facing exactly the same problem with Hbase 0.90.0 and I am still investigating to find why that could be the case. This is kind of a big problem if those results do not change for bigger clusters. Indeed, databases like mongoDB with a very good scalability in terms of MapReduce performances will not take long to be faster even if the raw performances are way behind.
Like always, feedback and critics are welcome!


Could you test http://www.redis.io? Also comparison with Oracle would be very interesting.
yes,i agree with Kuba, but more typical RDBMS to compare:
mysql, ms sql and postgreSQL
Hello,
right now my priority is to be sure that the current implementations work well because I’m going to make tests on bigger clusters with bigger datasets.
I could add support for redis in a short time, but running the tests takes time.
Concerning the RDBMS, the work needed to implement their support would also be quite simple but this is even less a priority for me. Indeed I’m not that much interested in raw performances but more about scalability and elasticity on a lot of small servers while SQL is more about small clusters of very powerfull servers.
Anyway, if I have time I would like to do both!
I have designed EAI/JMS stress test with logging of some process statistics (JMSTimestamp, JMSExpiration of received message, process start time, end time, actual number of started/finished/running processes etc.). All software runs on the same machine. 64k JMS messages + 64k inserts to HSQLDB takes 8 minutes. The same with HBase 10,5 m.
@Kuba Raw performances on a single machine are interesting but not really representative for noSQL databases that have been designed with scalability in mind. What is interesting is how those databases works in a cluster environment, how they will gain in performances if more node are added to the cluster, etc…
You should try your test on more than one machine, HBase is not designed to run on a single server. It is possible but only as a development environment.
About the lack of speed improvement on the mapreduce task when the number of nodes is growing: I am no specialist of Cassandra and HBase but it is possible that the default settings of the block size for replication of chunks over the nodes makes it impossible to see an improvement on such a small dataset. If it fits in 3 chunks / blocks, then only 3 nodes are working at a time.
I would suggest you to try again with a much larger dataset (e.g. at least a complete English Wikipedia dump of several tens of GB).
@ogrisel I have found the reason why I didn’t see any improvement with the Hadoop based MapReduce implementations thanks to Bruno Dumon http://www.nosqlbenchmarking.com/2011/02/hbase-0-90-0-configuration-and-mapreduce/#comment-9
In fact the jobs were executed locally, on the client side. But correcting this problem and keeping everything else the same lead to worse performances. This is due to the fact that there is a cost to set up the MapReduce jobs, and this cost is higher than the gain of computing power because my dataset is too small.
I’m currently working with the Rackspace cloud where I will benchmark bigger cluster with data sets at least as big as the whole English Wikipedia (+10millions articles, 28Gb). I will post the results here as soon as they are ready.
Pingback: The R&D director’s blog » Blog Archive » Data storage elasticity – quick view on master thesis work