Optimizing Spark Cluster
Welcome to the first optimization blog of the Spark Optimization series, in this article we will learn what cluster to choose based on memory, cores and storage. The goal of this is to give you enough so you can start off with an optimal cluster and save some money. Since I have worked mainly in AWS, I will refer its services like EMR along the way. For those who are not aware of these terms, please read here to get an idea before moving on.
Start with the right EMR
Picking or selecting which type of EMR would be great for your job is not easy, especially if you are not aware of the underlying data and the transformation and aggregation you are going to apply. It is always better to plan out whatwould be the best for the job so you can make the most of it.
EMR gives us several optimized instances options; memory (r), compute (c), general (m), storage (h) and GPU (g) instances.
Few common things to consider when selecting an instance are:
Use Memory instance, if your logic is going to do a lot of caching and shuffling.
Use Compute instance, if your job is dependent on parallelism.
Use Storage instance, if you are writing big data into HDFS. Alternate, you can also increase the EBS on other instances.
Use GPU instance, if you are doing some heavy deep learning.
Second is Driver node, it does not do the main work, it's job is to send, fetches and connect between the executor, so usually a smaller one works.
However, sometimes your job fails as the driver node couldn’t handle the results. That scenario occurs when if you are reading thousands of partitions from external or internal storage, if you are doing some action operation that results in a big dataframe that driver cannot handle.
Bootstrapping steps
On startup, the user has the option to setup some bootstrapping steps that will run when the cluster spins up, these steps usually have environment setup like loading and pulling required packages and code from external source and setting up environment variables. However, it is always useful to consider puttinglight jobs in startup scripts otherwise your bootstrap might hang which delaysthe whole process and you pay for no reason.