Whats new in Spark 3.0?

Nov 24, 2019

Asa Data Engineer I wait for improved Spark version every year, and this yearlast month they introduced a major long awaited upgrade known as Spark 3.0. The Spark 3.0 is planned to be released in the next few months, however, earlier this month, preview release of Spark 3.0 was made available for everyone on request for testing purposes as the community still say the APIs and functionalities are not stable.

Despite of that, it will still be a good topic to discuss the benefits of the newversion which we hope soon will be out there for everyone. The new version improvesthe optimizer and data catalog by adding new important features.

Data Catalog
Query Optimization
- Auto Broadcast Join
- Dynamic Partition Pruning

Data Catalog

The new version known as DataSourceV2 gives the ability to plug in various data sources in an efficient manner. Spark will be able to find the data across all the data sources that exists in your data warehouse and query them much faster without having additional load on network through a unified Spark Streaming and Batch APIs. Personally, I am going to wait for it to test the use case.

Query Optimization

Optimization plays an important role in writing efficient Spark applications, Spark 3.0 brings two optimization techniques that will help to speed up the job upto 10xtimes.

Auto Broadcast Join

In a scenario where small table is joined with a large table and shuffling takes place, its time consuming, right? Spark 3.0 adds the support to auto convert anormal join into a broadcast join when needed which eliminates the shuffling as the smaller table is copied into each executor node.

Usually, it is done manually by calling a function broadcast on smaller table like below.

large_table.join(broadcast(small_table),key)

But in new version, we can achieve the same by,

large_table.join(small_table,key)

So, if you have a poor written legacy code where your join does not use the broadcast and you now realize that you can improve the performance by adding a broadcast function then Spark 3.0 will do it for you, just upgrade the Spark version and it will save your time and improves the performance by a big factor.

Dynamic Partition Pruning

Spark now have the optimization technique known as dynamic partition pruning which already exists in traditional data warehouses. This is a technique that is used to speed up the process by loading in the required data instead of the whole dataset. Spark introduced partition pruning in 2.x version; as an example; if you have a table that runs a filter on top, then usually it is like this;

Table -> scan -> filter, it scans all the data then applies the filter. But partition pruning uses a push down approach which will help in increasing the performance.

Table -> filter -> scan, it scans what’s required only based on the filter.

But dynamic partition pruning is something more than that. A complex case would be a join scenario where one small table is joined with a larger one on a key on which the filter is applied. In Spark 3.0, the small table will be broadcasted first and converted into a hash table, the larger table would now be able to use that filter to scan what’s needed using dynamic partition pruning, this helps in avoiding duplicate code and improves the performance of the system tremendously. In the worst scenario, what happens is the scan -> join -> filter, but to make it better, we do scan -> filter -> join, but now you will be able to do filter -> scan -> join, where filter is being reused from the smaller table, this is how it will increase the performance.

This helps in almost every join and filter clause if the data is partitioned correctly based on that filter column.

As mentioned in the start, to get all the benefits of Spark 3.0, we need to wait until a stable version is released. But overall, the spark optimization would boost the performance by a big factor.

Whats new in Spark 3.0?

Data Catalog

Query Optimization

Discussion about this post