Common Issues faced in Spark
There are several issues everyone faces when they start using spark either at their jobs or for fun. These issues come up every other day and finding easy solutions for them could be hard sometimes. I decided to come up with some common solutions that I faced and will be facing probably in future. Thanks to my teammates who helped me out in solving those issues, and thanks to the internet for sure.
Most of the issues are due to poor configuration, memory management and its understanding plays a big role in that, adjusting the following helps alot in general.
Cores may also help in many scenarios which you may find below.
I would recommend you understand how Spark memory management works, read spark memory management by Databricks.
Following are few common issues with their solutions.
Java Heap OOM
Java heap issues usually comes up when driver node is unable to handle the job due to memory, few solutions: Try setting up driver memory:
--spark.driver.memory=<value> (default is 1g only)
Try increasing the memory by reducing the storage memory fraction:
--conf spark.storage.memoryFraction=<value>(default is 0.6)
If you fail to solve it by the above two, then try increasing the cluster size.
Stack Overflow issue comes up when the stack has no space left, this can occur in driver or/and executor. Try adjusting the following.
--confspark.driver.extraJavaOptions=<value> (default is '-Xss64m')
--confspark.executor.extraJavaOptions=<value> (default is '-Xss64m')
Heart Beat interval
When driver did not get a response from the executor, this could be because the task size is too big, try decreasing the task size by splitting it into more partitions or increasing the network and heartbeat interval.
--confspark.executor.heartbeatInterval=<value> (default to 10s)
Failed to Send RPC
So far, I have not been able to know the reason behind this problem, it comes upon different occasions. Could be due to the executor that are not getting enough memory on the data node. If you know, add a comment. Explanation found here.
Out of physical memory
This comes up usually with a suggestion saying try increasing overhead, pretty straightforward.
Garbage Collector memory exceeds
This occurs when GC takes too much time to recover space from java heap. The reason behind is that many cores doing their tasks in parallel due to which GC could not get space to remove and allocate to new objects. Try increasing the executor memory or reduce the number of cores.
Too large dataframe
This occurs when the dataframe is way too big to be handled by spark, shuffle partition should fix this issue otherwise try increasing the executor memory.
Nodes becoming unhealthy could be because of many reasons, common is the memory of local directory on the cluster gets filled up. Like big log files could fill those up, to prevent this avoid printing out anything in logs especially in loops.
Fetch Failed Exception
This usually occurs due to change in cluster size while the job is running, if not then most likely timeouts due to small number of partitions. Try configuring the following three.
--conf spark.network.timeout= <value>
Serialized Result Size
You might have seen something like this:
Total size of serialized results of 262114 tasks (2 GB) is bigger than spark.driver.maxResultSize (2 GB)
This is a common problem that comes up when executor tries to send a serialized result to driver which is bigger than the size a driver can receive.
Easiest and shortest solution is to increase the size.
--conf spark.driver.maxResultSize=<value> // 0 will make it unlimited
However, the other solution is to see the if the data is skewed and one of the those executors is trying to send that sized data. If yes, a code review with proper repartitioning and salting would be helpful.
Stage Size Exception
Stage size failure occurs when the size of java byte exceeds the limit. A common error would be something like this:
'org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3426'grows beyond 64 KB
It happens when the plan for a single stage is too big, for example having too many case statements. The workaround here is to cache/write the DataFrame so it clears up the space. If running from
spark-shell then it might still work, but
spark-submit terminates the flow with that error.