HBase 0.90.0 configuration and MapReduce
As I promised in a previous post, this one will explain how I configured HBase 0.90.0 for the last tests and a few observations about my experience with MapReduce on this HBase version.
First on the configuration side there are a few modifications worth noticing :
- I have increased the memory allowed to HBase to 3Gb
- There are now 3 ZooKeeper nodes instead of only one
- I had to raise the maximum number of connections allowed to a singe ZooKeeper node because of my MapReduce jobs, I talk about this just after.
- The size of the regions is now about 12Mb to be sure that they are evenly distributed across the nodes, remember that the data set used for those test is kinda small (20000 articles summing up to 620Mb)
You can download the whole configuration folder usedĀ here. Please note that the Hadoop configuration did not change, it can still be downloaded here.
The first thing I noticed before I saw the performances of HBase 0.90.0 for MapReduce is that with this new version, despite the fact that I was using exactly the same code than with Hbase 0.20.6, the job was opening a lot of connections to ZooKeeper. In fact I learned that the hard way, my MapReduce job was no longer working. I had an error telling me that too many connections to ZK where opened by a single client and indeed, the job was opening a lot connection to ZK. In fact it seems that my big Map phase is divided in several Map phases and that each of those is opening its own connection to ZK, you can see it in this log.
At this point I have no idea why this isĀ happening, you can see the code of the MapReduce job here and if you have any idea of what could be wrong, it would be very appreciated. For now I have bypassed the problem by increasing the number of connections that a ZK node can handle but it seems weird to have to increase this value to hundreds of connections while by default it is set to 30. Maybe this problem is affecting the performances of MapReduce on HBase and leading to an apparent lack of scalability.
Finally, in an attempt to solve this apparent lack of scalability, I have tried to use the Hadoop balancer script to distribute evenly all the blocks on my servers. As the MapReduce computation is local, I thought that maybe it was always the same nodes that were doing all the work (remember that I do not increase the data set size during the tests). So I ran the balancer with a threshold of 0.2 but without any impact on the performances that would be worth noticing. Again if you have any idea of what could be causing this, please share
This in the log is suspicious:
11/02/20 17:24:18 INFO mapred.LocalJobRunner:
It seems you are not running the job on the cluster but simply in the client VM. In the configuration object you pass to the Job, you need to set the property mapred.job.tracker with as value host:port of the jobtracker (this is not hbase specific).
Concerning the ZooKeeper connections: HBase caches connections based on the Configuration object. Each time you use a different Configuration object (when instantiating an HTable object), a different ZooKeeper connection will be made, but also a different cache of the root/meta information etc. So it’s really important to have just one of these per JVM. In your case the problem could be internal to HBase, the leak might be in the TableInputFormat (but running on a real cluster this will not be a problem as a new client VM is started each time).
I recommend you lower again the ZK setting so that in the future you’ll notice when this problem pops up again. Also, for ZK, using 3 servers instead of 1 does not bring a performance advantage.
Some other things that come to mind:
* using just 20.000 docs to test hbase is not serious, you should really change this
* given the data set size, probably a lot more time goes into the setup of the job than in its execution. Job setup includes copying all necessary code (various big jar files) to the task nodes (on each mapper invocation, if i’m not mistaken) and starting JVM’s. So either you’d have to optimize that, or better just measure the time which goes into the individual map executions.
* HBase starts one map task per region. So if your table has e.g. 20 regions, this will lead to 20 map tasks. If you have e.g. 5 task nodes, this means 4 tasks which will have to be executed on each node (sequentially, with jvm instantiation etc). Again, given the small data size, this means more time will go into setting this all up than in actual execution.
* the class you linked to can be optimized by removing the system.out.println’s, by using a StringBuilder for string concatenation, and by doing the Bytes.toBytes() call just once for fixed data like column names.
Thanks a lot for your detailed answer Bruno! The fact that the job runs only on the client side explains a lot. I didn’t even know that it was possible for a MapReduce job to run only on the client side, the doc should say that.
I’ll try setting this property and maybe those connections problems to ZK will be solved too.
I know the 20000 articles are not enough, in fact those tests are mostly a prelude to the next benchmarks I plan to run. Those will use the EC2 infrastructure with hundreds of small instances and I plan to insert the whole +10 millions articles of the entire Wikipedia in English several times. This will cost money and my budget is kinda tight, so I would like to be as sure as possible that everything works as it should on this small cluster.
Indeed with the current setup a lot of time can be wasted into the various setups, I have 76 regions into HBase for now. But this should be negligible with the larger data set I plan to use. I will also optimize the class as you say.
Again, thank you for your feedback this is really appreciated. It is not simple to learn to use HBase and Hadoop in the right way and inputs from professionals is highly valuable!
No problem, glad I can be of help.
I forgot to mention this: a simple way to check that everything is as it should is by having a look at the web ui’s:
* map-reduce jobtracker: http://jobtrackerhost:50030. There you can see the status of the jobs. If it is run locally (= not submitted to the job tracker), it will not even appear there.
* hbase: http://hbase-master-host:60010. You can there for example see the list of region servers (to verify they are all running).
* hdfs web ui, http://namenode:50070, where you can browse the file system to check if HBase is actually storing its files on HDFS.
The same information is also available via APIs, might be interesting to pull out some of this data before/after the test to have some basic checks like ‘are all systems actually running and participating’.
Similarly, I’d recommend to monitor some metrics during the execution of the tests. For example cpu and memory, but you can also get hbase & hadoop specific metrics via JMX. This sort of thing can quickly reveal problems.
Have fun