Apache Spark - Frequently Asked Questions
Developers and Engineers are now pretty much aware of Apache Spark and its purpose in the technological stack but somehow there are some basic questions that I face and find over the internet so often.
This article is going to be a compilation of all the frequently asked questions related to Spark and I will try to keep updating this with more questions. Also, it would be helpful if you can leave a comment for new questions that you think should be added here.
What is Fat Executor?
Fat Executor means that you will have one executor per node. For example, if you have a cluster with 16
cores, you will configure it to have a single executor. Good for throughput but should maintain a balance between both types.
What is Thin Executor?
Thin or Tiny Executor means that you will have one executor per core. For example, if you have a cluster with 16
cores, you will configure it to have 16
executors with each having a single core. Good for parallelism but should maintain a balance between both types.
What is Memory Overhead?
Memory overhead is the memory for the off-heap space which deals with virtual machine overheads, interned strings and other native overheads (as mentioned here). The off-heap is to avoid the garbage collection overhead which is often sometimes in gigabytes. It stores the object in a serialized state and have to deserialized in case of use.
What should be the correct number of partitions?
The correct number of partitions is based on the number of cores being used, the number of cores should be equal or multiple of the number of partitions.
For example: Number of cores is 4
Number of partitions should be 2
or 4
, `(4 will give maximum parallelism in this case)`
What is Salting?
Salting is a process to overcome data skewing issue when performing an operation that requires a shuffle, for example Join.
Read more: Salting Explained!
Feel free to reach me out and leave a comment for inclusion of any question.