Java.lang.OutOfMemoryError for a simple rdd.count () operation

advertisements

I'm having a lot of trouble getting a simple count operation working on about 55 files on hdfs and a total of 1B records. Both spark-shell and PySpark fail with OOM errors. I'm using yarn, MapR, Spark 1.3.1, and hdfs 2.4.1. (It fails in local mode as well.) I've tried following the tuning and configuration advice, throwing more and more memory at the executor. My configuration is

conf = (SparkConf()
        .setMaster("yarn-client")
        .setAppName("pyspark-testing")
        .set("spark.executor.memory", "6g")
        .set("spark.driver.memory", "6g")
        .set("spark.executor.instances", 20)
        .set("spark.yarn.executor.memoryOverhead", "1024")
        .set("spark.yarn.driver.memoryOverhead", "1024")
        .set("spark.yarn.am.memoryOverhead", "1024")
        )
sc = SparkContext(conf=conf)
sc.textFile('/data/on/hdfs/*.csv').count()  # fails every time

The job gets split into 893 tasks and after about 50 tasks are successfully completed, many start failing. I see ExecutorLostFailure in the stderr of the application. When digging through the executor logs, I see errors like the following:

15/06/24 16:54:07 ERROR util.Utils: Uncaught exception in thread stdout writer for /work/analytics2/analytics/python/envs/santon/bin/python
java.lang.OutOfMemoryError: Java heap space
    at java.nio.HeapCharBuffer.<init>(HeapCharBuffer.java:57)
    at java.nio.CharBuffer.allocate(CharBuffer.java:331)
    at java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:792)
    at org.apache.hadoop.io.Text.decode(Text.java:406)
    at org.apache.hadoop.io.Text.decode(Text.java:383)
    at org.apache.hadoop.io.Text.toString(Text.java:281)
    at org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:558)
    at org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:558)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
    at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:379)
    at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:242)
    at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204)
    at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204)
    at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1550)
    at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:203)
15/06/24 16:54:07 ERROR util.SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[stdout writer for /work/analytics2/analytics/python/envs/santon/bin/python,5,main]
java.lang.OutOfMemoryError: Java heap space
    at java.nio.HeapCharBuffer.<init>(HeapCharBuffer.java:57)
    at java.nio.CharBuffer.allocate(CharBuffer.java:331)
    at java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:792)
    at org.apache.hadoop.io.Text.decode(Text.java:406)
    at org.apache.hadoop.io.Text.decode(Text.java:383)
    at org.apache.hadoop.io.Text.toString(Text.java:281)
    at org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:558)
    at org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:558)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
    at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:379)
    at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:242)
    at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204)
    at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204)
    at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1550)
    at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:203)
15/06/24 16:54:07 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: SIGTERM

In the stdout:

# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError="kill %p"
#   Executing /bin/sh -c "kill 16490"...

In general, I think I understand the OOM errors and troubleshooting, but I'm stuck conceptually here. This is just a simple count. I don't understand how the Java heap could possibly be overflowing when the executors have ~3G heaps. Has anyone run into this before or have any pointers? Is there something going on under the hood that would shed light on the issue?

Update:

I've also noticed that by specifying the parallelism (for example sc.textFile(..., 1000)) to the same number of tasks (893), then the created job has 920 tasks, all but the last of which complete without error. Then the very last task hangs indefinitely. This seems exceedingly strange!


It turns out that the issue I was having was actually related to a single file that was corrupted. Running a simple cat or wc -l on the file would cause the terminal to hang.