Spark Performance Tuning with Ganglia and Sparklens

May 04, 2020

In this article, the goal is to showcase how the tuning tools can be helpful when developing an application. Ganglia and Sparklens are pretty useful and I would say underrated tools that provide a lot if used in a proper way. They both are open sourced tools built to help developers. Let’s look into each what they actually are:

Ganglia is a real time cluster monitoring tool, open sourced by University of California. It provides a detailed UI view about cluster metrics like resource utilization, load, network and much more.

Sparklens is a tuning tool that saves developers time by suggesting the right tuning options based on executors. It is written in Scala and open sourced by Qubole. It is a JSON config-based tool that runs on top of the logs and provides an option to run it on demand whenever needed.

We will look into each by using a real-world example on how these can be really helpful for anyone. I am going to use a subset of a proprietary dataset (~5GB). I am going to start with R instances (AWS EMR) (Spot Instance Fleets having total 384 units).

We are going to utilize Ganglia’s ability to check if the current cluster suits our case or to try a different one. You will notice two peaks in each graph, and they are actually indicating the two runs of the same job, first is with a little fat executor approach, second is based on Sparklens which suggested to go with thin executors keeping the parallelism perfect.

The following snap is showing the memory usage by the cluster. And it is pretty sure that we are wasting so much memory. In the second peak where I used less memory it still gives better results, proving the tasks really don’t need that much memory.

This graph is the CPU utilization graph, and with Sparklens suggestions it was able to utilize it better. Maybe with more computing power we can get better results.

The below is a load graph and the second run (fastest) is way above the threshold line, from the other Ganglia views, we can see the load on executors are quite high that might cause some failures, however, we didn’t face any. The load increases usually with the increase in utilization, the first run looks good but it’s because the CPU utilization is very low keeping the performance slow.

The R instance approach still looks feasible, increasing the executor count by making it thin and setting the correct parallelism helped us to achieve performance improvement, thanks to Sparklens.

A Sparklens Standard output sample suggestions would look like this: (can also be viewed from the UI Simulation Tab)

Executor count    72  (100%) estimated time 01m 49s and estimated cluster utilization 42.77% 
Executor count    79  (110%) estimated time 01m 46s and estimated cluster utilization 39.99% 
Executor count    86  (120%) estimated time 01m 40s and estimated cluster utilization 39.14% 
Executor count   108  (150%) estimated time 01m 40s and estimated cluster utilization 31.17%

Sparklens UI view to see the runtime and utilization for the second run, check full details here:

Sparklens UI view for R instance best Run

Detailed analysis for the first peak can be found here. The 2m 44s (first peak: 3m 28s) is the best I achieved on R instances with ~50% waste of executor time, however I used Sparklens just for testing purposes because from the experience I was pretty sure we needed a different instance type first. From the full view, you can also see more about the job and each stage, not going to dive into that here.

Let’s move to C instances, first is we can see the need of more compute power than memory. So, we can move to the C instance family with half the number of cores (192 units) and way less memory. Two peaks here, first is a random approach leaning towards fat executors while second is the ideal thin executor approach with correct executors and parallelism based on Sparklens suggestions.

Memory usage seems fine, in this case the memory usage is indirectly proportional to the run time unlike what we had in R instance.

CPU usage is good as well. With the same relation that we saw in the memory graph.

Load is really improved here, we are a little over the threshold line in the second case. Slightly decreasing the parallelism might make it better but will compromise the runtime, also it’s not affecting anything.

Below is Sparklens UI view of the runtime and utilization from the second run, check full details here. The UI also defines the terms that you see in the snap, so would recommend you to head over for a quick glance. The first UI can be seen here. For executor suggestions check the Simulation tab.

Sparklens UI view for C instance best Run

You can notice that it’s certainly improved (`2m 32s`, compare with it’s first peak or with R instance best case) with only ~39% executor time wasted, however, this is not the ideal time Sparklens suggested, since we can see there could be a data skew that leaves many executors to sit idle. Data skew can be found in the link provided as well (Per Stage Metrics). Another metric I found improved was the Garbage collector, comparing the best cases from both C and R instances. This can be found in the Aggregated Metrics tab in the UI, named jvmGCTime. If you see some metrics that are not reported then could be because the job was so small that it couldn’t pick them up. There are a lot of metrics that I didn’t find particularly useful for this, however I would suggest you to look at them.

So to conclude, I improved the performance by;

Changing the instance to correct type (R to C using Ganglia)
Getting correct executor count (suggested by Sparklens)
Setting up the parallelism and partitions correctly (equal to VCores for max parallelism)

We improved performance a little bit (2m44s to 2m32s = 12 seconds) but saved 50% money on cluster cost (R384 vs C192 units). Overall there could still be a room for improvement, increasing the C instance units to 384 could bump up the performance. Also, one important point is that, from this we can’t generalize that it would perform the same on fullset, the same metrics could perform differently.

Tool Takeaways:

Sparklens is a good supplementary tool, one cannot rely on it completely especially Ideal Time is not the time that every job can achieve.
Ganglia is a good tool for real time monitoring. However, it dies with the cluster. There are alternative tools out there like DataDog and Prometheus, but they are not free and require some initial configuration investment.
Both the tools have alot of more features/metrics that one can utilize, I used the most important and useful ones.

Spark Performance Tuning with Ganglia and Sparklens

Discussion about this post