As I promised in a previous post, this one will explain how I configured HBase 0.90.0 for the last tests and a few observations about my experience with MapReduce on this HBase version.
First on the configuration side there are a few modifications worth noticing :
- I have increased the memory allowed to HBase to 3Gb
- There are now 3 ZooKeeper nodes instead of only one
- I had to raise the maximum number of connections allowed to a singe ZooKeeper node because of my MapReduce jobs, I talk about this just after.
- The size of the regions is now about 12Mb to be sure that they are evenly distributed across the nodes, remember that the data set used for those test is kinda small (20000 articles summing up to 620Mb)
The first thing I noticed before I saw the performances of HBase 0.90.0 for MapReduce is that with this new version, despite the fact that I was using exactly the same code than with Hbase 0.20.6, the job was opening a lot of connections to ZooKeeper. In fact I learned that the hard way, my MapReduce job was no longer working. I had an error telling me that too many connections to ZK where opened by a single client and indeed, the job was opening a lot connection to ZK. In fact it seems that my big Map phase is divided in several Map phases and that each of those is opening its own connection to ZK, you can see it in this log.
At this point I have no idea why this is happening, you can see the code of the MapReduce job here and if you have any idea of what could be wrong, it would be very appreciated. For now I have bypassed the problem by increasing the number of connections that a ZK node can handle but it seems weird to have to increase this value to hundreds of connections while by default it is set to 30. Maybe this problem is affecting the performances of MapReduce on HBase and leading to an apparent lack of scalability.
Finally, in an attempt to solve this apparent lack of scalability, I have tried to use the Hadoop balancer script to distribute evenly all the blocks on my servers. As the MapReduce computation is local, I thought that maybe it was always the same nodes that were doing all the work (remember that I do not increase the data set size during the tests). So I ran the balancer with a threshold of 0.2 but without any impact on the performances that would be worth noticing. Again if you have any idea of what could be causing this, please share