# Solving Data Skewness in Spark

You might have been a victim of skewed data when performing some operations in **Spark **especially that requires a shuffle like a ** Join **and you might have noticed that your job takes longer and you don’t know what is going on and why it’s happening.

This article will help you answer some questions related to skewed data.

### What is Data Skewness?

Data skewness is the non-uniform distribution of data across the executors, which means few executors have more data as compared to others. It directly affects the performance of job which can be frustrating at times.

### How to know if the Data is Skewed?

Sometimes, we do not care of how Spark is performing, and we assume it could be its normal behavior, but you might don’t know that you can increase performance by overcoming the data skewness issue. The biggest hint that we can look for is in the ** Resource Manager** where you can look at your tasks to know more about.

In the above image it clearly gives an idea of there is some issue when most of your task is completed so quickly `1 min`

and the last few partitions are taking longer `20min`

. Most likely, this means your data is skewed!

### What causes Data Skewed?

Data Skewness is related to the distribution of keys, the reason is when you have the skewed keys on which you are performing an operation and then it is likely to spread the data across executors in a skewed manner.

Suppose you have the following tables, with three executors to perform a join operation.

```
Table 1
Key | Value
A | 1
A | 2
A | 5
Table 2
Key | Value
A | 1
```

Then it is going to distribute the data based on the join key in the following manner, showing that the ** Executor1 **is going to do all the job because it has all the records while others sit idle. This is data skewness, consider how long it will take if your records are in millions with such level of skewness.

### How to solve data skewed problem?

There are few ways which we have all tried to solve this issue, increasing the number of cores which will not help here, it might help to speed up the process, but the skewness will remain there. Second, repartitioning for uniform distribution, which will also not help us because it will repartition again based on the same join key.

The solution that can help us is known as ** Salting**.

Salting is a process in which we add a column `Salted_Key`

where we replicate our join keys with a random number attached to it based on the Salting Factor.

The bigger table `Table 1`

get a salted key for each row, while the small table `Table 2`

has to make all combinations in order for the join to work on the column `Salted_Key`

, the small table can be broadcasted if possible.

So, lets say we have a `Salting Factor = 3`

.

```
Table 1
Key | Value | Salted_Key
A | 1 | A_1
A | 2 | A_2
A | 5 | A_3
Table 2
Key | Value | Salted_Key
A | 1 | A_1
A | 1 | A_2
A | 1 | A_3
```

Now if we look at the keys, they can be distributed evenly across executors with no data skewed. You might be asking here that the `Table 2`

size is same as `Table 1`

, but this example is for the sake of understanding, in the real world a single key is unlikely to have this scenario and a salting factor plays a big role (Salting Factor does not have to be equal to the total number of rows for a key). Else if the `Table 2`

is big then this technique might not be useful.

Now its evenly distributed and no executor is sitting idle, this is the extreme case where Salting plays the best role in solving the problem.

To understand more, take a look at this quick video.

There are few more techniques to solve the data skewness problem Structured Streaming and Sampling.

I would like to thank my former colleague Alex Healey who introduced these techniques to our Data Team.

Feel free to reach me out for any questions, and if you think I missed some point feel free to leave a comment.