Solving Data Skewness in Spark
You might have been a victim of skewed data when performing some operations in Spark especially that requires a shuffle like a Join
and you might have noticed that your job takes longer and you don’t know what is going on and why it’s happening.
This article will help you answer some questions related to skewed data.
What is Data Skewness?
Data skewness is the non-uniform distribution of data across the executors, which means few executors have more data as compared to others. It directly affects the performance of job which can be frustrating at times.
How to know if the Data is Skewed?
Sometimes, we do not care of how Spark is performing, and we assume it could be its normal behavior, but you might don’t know that you can increase performance by overcoming the data skewness issue. The biggest hint that we can look for is in the Resource Manager
where you can look at your tasks to know more about.
In the above image it clearly gives an idea of there is some issue when most of your task is completed so quickly 1 min
and the last few partitions are taking longer 20min
. Most likely, this means your data is skewed!
What causes Data Skewed?
Data Skewness is related to the distribution of keys, the reason is when you have the skewed keys on which you are performing an operation and then it is likely to spread the data across executors in a skewed manner.
Suppose you have the following tables, with three executors to perform a join operation.
Table 1
Key | Value
A | 1
A | 2
A | 5
Table 2
Key | Value
A | 1
Then it is going to distribute the data based on the join key in the following manner, showing that the Executor1
is going to do all the job because it has all the records while others sit idle. This is data skewness, consider how long it will take if your records are in millions with such level of skewness.
How to solve data skewed problem?
There are few ways which we have all tried to solve this issue, increasing the number of cores which will not help here, it might help to speed up the process, but the skewness will remain there. Second, repartitioning for uniform distribution, which will also not help us because it will repartition again based on the same join key.
The solution that can help us is known as Salting
.
Salting is a process in which we add a column Salted_Key
where we replicate our join keys with a random number attached to it based on the Salting Factor.
The bigger table Table 1
get a salted key for each row, while the small table Table 2
has to make all combinations in order for the join to work on the column Salted_Key
, the small table can be broadcasted if possible.
So, lets say we have a Salting Factor = 3
.
Table 1
Key | Value | Salted_Key
A | 1 | A_1
A | 2 | A_2
A | 5 | A_3
Table 2
Key | Value | Salted_Key
A | 1 | A_1
A | 1 | A_2
A | 1 | A_3
Now if we look at the keys, they can be distributed evenly across executors with no data skewed. You might be asking here that the Table 2
size is same as Table 1
, but this example is for the sake of understanding, in the real world a single key is unlikely to have this scenario and a salting factor plays a big role (Salting Factor does not have to be equal to the total number of rows for a key). Else if the Table 2
is big then this technique might not be useful.
Now its evenly distributed and no executor is sitting idle, this is the extreme case where Salting plays the best role in solving the problem.
To understand more, take a look at this quick video.
There are few more techniques to solve the data skewness problem Structured Streaming and Sampling.
I would like to thank my former colleague Alex Healey who introduced these techniques to our Data Team.
Feel free to reach me out for any questions, and if you think I missed some point feel free to leave a comment.