As my work on the benchmark progressed, I needed to update my methodology to fit my new needs and to take into account the various feedback I got. First here is a little reminder of the basis of my benchmark. It was inspired by Wikipedia because they can provide me with a lot of real data that can be used to perform computations that makes sense. I have downloaded the whole English version of Wikipedia (+10 millions articles, summing up to ~28Gb) and I plan to insert it multiple times. Each article is inserted as a blob with a unique integer ID as key. I deliberately did not use the various and more complex data structure provided by the databases to ensure that my data model is easy to port from simple key/value system to column or document oriented systems. You can see the various databases implementations here.
I’m measuring performances for two simple use case, first I’m simulating traffic from people reading articles and updating a few of them. Concretely, I choose a percentage of read only requests, the total number of requests and the number of threads. Then the benchmark measure the average time needed to complete those requests. The second test concern the MapReduce performances. I use the various MapReduce implementations to build a reverse index for a given keyword. This index is simply a list of pair, that contains the ID of the article and the number of occurrences of the keyword in this article. You can have a more detailed explanation of the various MapReduce phases if you read the end of this post.
I’m not only interested in raw performances, here are the properties that I want to measure :
- The raw performances : how fast is it to make all the requests?
- The scalability : what is the impact on the performances of changing the cluster size?
- The elasticity : how long does it take to get to a stable state with increased performances when nodes are added to the cluster?
The raw performances are the easiest one to measure. The articles are stored in a very similar way across the databases and the load is also very similar. In fact for the reads and updates, the load is exactly the same while for the MapReduce test I have tried to build the inversed index in a similar way for the various databases. All of this enable me to compare the results between databases for the same cluster size.
The scalability is measured as the change in performance when new machines and data are added in a linear way to the cluster, this is also quite easy to measure given that I have a way to know when the cluster has stabilized. Indeed, to be fair I need to wait for the cluster to stabilize before I can measure what is the real gain in performance. Otherwise I will only measure performances of transient states that will not be representative of a real cluster deployment.
Finally the elasticity is the most complex of the three measures, indeed the time needed for the systems to stabilize should be different for each database and for each cluster size. In most systems, the newly added machine will only start to serve requests when they host their share of the data. Therefore I have chosen to keep the same number of threads (talking to the old nodes in the cluster) while I’m running my elasticity test. The goal is to measure the changes in time needed to complete the same work when nodes are added and start to move data across the cluster. To put it in a more formalized way, I characterize the stability of a cluster in term of the standard deviation computed over the time needed to run N read/update tests. The procedure to measure elasticity would looks like :
- Use a stable cluster to determine the usual standard deviation of this DB for this cluster size
- Add new nodes to the cluster without increasing the data set size
- Repeat :
- Start a read/update benchmark and compute the standard deviation
- Wait X seconds
- Until the new standard deviation can be considered as equal to the old one by a statistical test with a confidence of Y percents.
There is a potential problem with this methodology : my data set is now much bigger than the available memory in the cluster, therefore a lot of requests will require a seek on the hard drive and as my requests are fully random it is possible to see big differences between standard deviations, even for a stable cluster.
For example here are the standard deviations that I have observed just after all the articles were inserted into the databases. Each run is made of 10000 requests with a read only percentage of 80%. The clusters were made of 6 4Gb nodes from Rackspace.
As you can see, the standard variation can vary a lot between the runs. A possible solution would be to lower the confidence level of the statistical test but that would create the new problem of the selection of this confidence level.
Finally, here is the complete step by step methodology that I plan to use :
- Start up with a clean cluster of size 6 and insert all the articles
- Measure the standard deviation for this cluster once it has stabilized
- Choose a total number of requests and a read only percentage
- Start the benchmark with the values chosen above
- Start the MapReduce benchmark
- Double the number of nodes in the cluster
- Start the elasticity test
- Double the size of the data set inserted
- Restart the elasticity test
- Jump to 4. with a doubled number requests
I’m currently tuning the various databases for the Rackspace 4Gb cloud servers, it’s almost done for Cassandra and HBase. I will start soon the same work for Riak and mongoDB. Once everything is ready and bug free, I will start to benchmark at bigger scales!