This is the story of how to understand and fix out-of-memory errors in Spark. OOM errors are among the most common and most frustrating Spark problems: the job dies, the error message points to a line deep in the JVM stack, and it’s not obvious what ran out of memory, why, or how much more it needed. The key to fixing OOM errors is knowing which memory was exhausted, why it grew beyond its limit, and what to change. This story maps out the main OOM scenarios, their symptoms, and their remedies.
Before diagnosing OOMs, you need to know how Spark’s memory is divided. Each executor JVM has:
The driver has its own heap, separate from executors.
Think of an executor’s memory like the workspace in a restaurant kitchen. Reserved memory is the fixed equipment bolted to the floor (ovens, sinks)—you can’t use that space for anything else. User memory is the chef’s personal prep corner. Spark memory is the shared counter space, divided between “active cooking” (execution) and “holding prepared ingredients” (storage). If cooking spills into the holding area, that’s acceptable. But if the combined cooking and holding exceeds the total counter space, the kitchen grinds to a halt.
Symptom: task fails with java.lang.OutOfMemoryError: Java heap space or GC overhead limit exceeded. Often appears on tasks with large inputs or during aggregations and joins on wide data.
Cause: the task’s working memory (sort buffer, aggregation hash map, join build-side hash map) grew beyond the executor heap’s capacity. The most common triggers:
spark.sql.shuffle.partitions = 200 but you’re processing 2 TB of data, each partition is ~10 GB. One task must hold that in memory.groupBy with very high cardinality causes the in-memory hash map to grow large before spilling.Too few shuffle partitions is like dividing a city’s entire water supply into only 10 buckets. Each bucket is enormous; no single person can lift one. Increasing shuffle partitions is like using 2,000 normal buckets instead—each is manageable and can be carried without strain.
Fixes:
spark.sql.shuffle.partitions to reduce per-task data volume.--executor-memory) to give each task more headroom.spark.sql.autoBroadcastJoinThreshold.Symptom: executor OOMs, but the task input is small and there’s no obvious aggregation. The stack trace points to user code or a large collection being iterated.
Cause: code in the closure captures or builds a large data structure. Common examples: reading the entire partition into a list, building a large HashMap from data in the task, or loading a large model file inside the task function.
Capturing a large object in a closure is like asking each of 500 delivery drivers to carry a full copy of the city’s entire street map book in their van, just in case. One book per driver, 500 copies—even though a single shared map on a server would do. The fix is broadcast: one shared copy, accessed by reference.
Fixes:
mapPartitions for per-partition initialization rather than per-row.Symptom: executor dies with Direct buffer memory error or the native process is killed by the OS. In Kubernetes, the pod is OOMKilled.
Cause: memory used outside the JVM heap—Arrow buffers, off-heap Tungsten storage, or native library allocations—exceeded the container or OS memory limit. The JVM heap limit (-Xmx) only controls heap memory; native and direct buffer memory has a separate budget.
Fixes:
spark.executor.memoryOverhead (YARN) or spark.executor.memoryOverheadFactor to reserve more non-heap memory. Default is often insufficient for Arrow-heavy PySpark workloads.spark.sql.execution.arrow.maxRecordsPerBatch).spark.memory.offHeap.size is accounted for in the container memory request.Symptom: driver fails with java.lang.OutOfMemoryError. Job submission fails, or the job dies partway through.
Cause: the driver heap is used for the application code, DAG/stage metadata, task result objects, broadcast variable data, and any data collected from executors. The most common driver OOM causes:
collect() on a large DataFrame: collect() brings all rows to the driver. For a 10 GB DataFrame this puts 10 GB of data on the driver heap.A
collect()on a large DataFrame is like funnelling an entire lake through a single garden hose into a bathtub. The lake (cluster memory) has plenty of room for the water. The bathtub (driver heap) does not. The fix is to write the water to a reservoir (storage) and read from there, rather than routing it through the bathtub.
Fixes:
collect() with write to storage and read back separately.--driver-memory high enough for broadcast variables and result accumulation.toPandas() calls, ensure Arrow optimization is enabled.Symptom: task fails with a Python process crash or MemoryError from Python.
Cause: the Python worker process—a separate OS process from the executor JVM—ran out of memory. For mapPartitions with Python or ForeachBatch writing large objects, the entire partition or batch may be materialized in Python memory at once.
Fixes:
spark.python.worker.memory to configure the Python worker’s memory limit.spark.sql.execution.arrow.maxRecordsPerBatch to send smaller Arrow batches to Python.The OOM error message and stack trace tell you which heap ran out. Java heap space is the JVM heap—execution, storage, or user memory. GC overhead limit exceeded means the JVM spent more than 98% of its time in GC trying to free memory—effectively the heap is full. Direct buffer memory is off-heap. A pod OOMKill on Kubernetes points to container-level memory.
The Spark UI’s Executors tab shows GC time per executor. High GC fraction (> 20–30%) means memory pressure before the OOM hit. The Stage detail shows spill metrics: if spill is present, memory was insufficient but the job didn’t die—increasing memory or partition count would reduce spill.
collect() without knowing the data size; prefer write.memoryOverhead generously for PySpark and Arrow-heavy workloads.Spark OOM errors fall into five categories, each with a distinct cause and remedy. Executor execution OOM: per-task working memory exceeds the heap—fix with more partitions, more executor memory, or a different join strategy. Executor user memory OOM: large objects allocated by user code in closures—fix with broadcast variables or mapPartitions. Off-heap/container OOM: native memory exceeds the container limit—fix with memoryOverhead or offHeap.size. Driver OOM: collect(), large broadcasts, or result accumulation exhausts driver heap—fix with --driver-memory, write instead of collect. Python worker OOM: too much data materialized in the Python process—fix with smaller batches or chunked processing. The Spark UI (GC time, spill metrics, task duration) and the OOM stack trace together point to which category you’re in. Each category has a different fix, and diagnosing which memory ran out is the first and most important step.