Common Issues faced in Spark

Apr 09, 2018

There are several issues everyone faces when they start using spark either at their jobs or for fun. These issues come up every other day and finding easy solutions for them could be hard sometimes. I decided to come up with some common solutions that I faced and will be facing probably in future. Thanks to my teammates who helped me out in solving those issues, and thanks to the internet for sure.

Most of the issues are due to poor configuration, memory management and its understanding plays a big role in that, adjusting the following helps alot in general.

--executor-memory<value>
--confspark.yarn.executor.memoryOverhead<value>
--driver-memory<value>

Cores may also help in many scenarios which you may find below.

--executor-cores<value>
--driver-cores<value>

I would recommend you understand how Spark memory management works, read spark memory management by Databricks.

Following are few common issues with their solutions.

Java Heap OOM

Java heap issues usually comes up when driver node is unable to handle the job due to memory, few solutions: Try setting up driver memory:

--spark.driver.memory=<value> (default is 1g only)

Try increasing the memory by reducing the storage memory fraction:

--conf spark.storage.memoryFraction=<value>(default is 0.6)

If you fail to solve it by the above two, then try increasing the cluster size.

Stack Overflow

Stack Overflow issue comes up when the stack has no space left, this can occur in driver or/and executor. Try adjusting the following.

--confspark.driver.extraJavaOptions=<value> (default is '-Xss64m')
--confspark.executor.extraJavaOptions=<value> (default is '-Xss64m')

Heart Beat interval

When driver did not get a response from the executor, this could be because the task size is too big, try decreasing the task size by splitting it into more partitions or increasing the network and heartbeat interval.

--confspark.default.parallelism=<value>
--confspark.network.timeout=<value>
--confspark.executor.heartbeatInterval=<value> (default to 10s)

Failed to Send RPC

So far, I have not been able to know the reason behind this problem, it comes upon different occasions. Could be due to the executor that are not getting enough memory on the data node. If you know, add a comment. Explanation found here.

Out of physical memory

This comes up usually with a suggestion saying try increasing overhead, pretty straightforward.

--confspark.yarn.executor.memoryOverhead=<value>

Garbage Collector memory exceeds

This occurs when GC takes too much time to recover space from java heap. The reason behind is that many cores doing their tasks in parallel due to which GC could not get space to remove and allocate to new objects. Try increasing the executor memory or reduce the number of cores.

--executor-memory<value>
--executor-cores<value>

Too large dataframe

This occurs when the dataframe is way too big to be handled by spark, shuffle partition should fix this issue otherwise try increasing the executor memory.

--confspark.sql.shuffle.partitions=<value>
--executor-memory <value>

Nodes unhealthy

Nodes becoming unhealthy could be because of many reasons, common is the memory of local directory on the cluster gets filled up. Like big log files could fill those up, to prevent this avoid printing out anything in logs especially in loops.

Fetch Failed Exception

This usually occurs due to change in cluster size while the job is running, if not then most likely timeouts due to small number of partitions. Try configuring the following three.

--conf spark.sql.shuffle.partitions=<value>
--conf spark.network.timeout= <value>

Serialized Result Size

You might have seen something like this:
Total size of serialized results of 262114 tasks (2 GB) is bigger than spark.driver.maxResultSize (2 GB)
This is a common problem that comes up when executor tries to send a serialized result to driver which is bigger than the size a driver can receive.
Easiest and shortest solution is to increase the size.

--conf spark.driver.maxResultSize=<value> // 0 will make it unlimited

However, the other solution is to see the if the data is skewed and one of the those executors is trying to send that sized data. If yes, a code review with proper repartitioning and salting would be helpful.

Stage Size Exception

Stage size failure occurs when the size of java byte exceeds the limit. A common error would be something like this: 'org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3426'grows beyond 64 KB

It happens when the plan for a single stage is too big, for example having too many case statements. The workaround here is to cache/write the DataFrame so it clears up the space. If running from spark-shell then it might still work, but spark-submit terminates the flow with that error.

Discussion about this post

Ready for more?