sparklearning

Out of Memory: A Field Guide to Spark OOM Errors

This is the story of how to understand and fix out-of-memory errors in Spark. OOM errors are among the most common and most frustrating Spark problems: the job dies, the error message points to a line deep in the JVM stack, and it’s not obvious what ran out of memory, why, or how much more it needed. The key to fixing OOM errors is knowing which memory was exhausted, why it grew beyond its limit, and what to change. This story maps out the main OOM scenarios, their symptoms, and their remedies.


Spark’s memory model: a quick map

Before diagnosing OOMs, you need to know how Spark’s memory is divided. Each executor JVM has:

The driver has its own heap, separate from executors.

Think of an executor’s memory like the workspace in a restaurant kitchen. Reserved memory is the fixed equipment bolted to the floor (ovens, sinks)—you can’t use that space for anything else. User memory is the chef’s personal prep corner. Spark memory is the shared counter space, divided between “active cooking” (execution) and “holding prepared ingredients” (storage). If cooking spills into the holding area, that’s acceptable. But if the combined cooking and holding exceeds the total counter space, the kitchen grinds to a halt.


OOM type 1: executor heap — execution memory

Symptom: task fails with java.lang.OutOfMemoryError: Java heap space or GC overhead limit exceeded. Often appears on tasks with large inputs or during aggregations and joins on wide data.

Cause: the task’s working memory (sort buffer, aggregation hash map, join build-side hash map) grew beyond the executor heap’s capacity. The most common triggers:

Too few shuffle partitions is like dividing a city’s entire water supply into only 10 buckets. Each bucket is enormous; no single person can lift one. Increasing shuffle partitions is like using 2,000 normal buckets instead—each is manageable and can be carried without strain.

Fixes:


OOM type 2: executor heap — user memory and large closures

Symptom: executor OOMs, but the task input is small and there’s no obvious aggregation. The stack trace points to user code or a large collection being iterated.

Cause: code in the closure captures or builds a large data structure. Common examples: reading the entire partition into a list, building a large HashMap from data in the task, or loading a large model file inside the task function.

Capturing a large object in a closure is like asking each of 500 delivery drivers to carry a full copy of the city’s entire street map book in their van, just in case. One book per driver, 500 copies—even though a single shared map on a server would do. The fix is broadcast: one shared copy, accessed by reference.

Fixes:


OOM type 3: executor off-heap

Symptom: executor dies with Direct buffer memory error or the native process is killed by the OS. In Kubernetes, the pod is OOMKilled.

Cause: memory used outside the JVM heap—Arrow buffers, off-heap Tungsten storage, or native library allocations—exceeded the container or OS memory limit. The JVM heap limit (-Xmx) only controls heap memory; native and direct buffer memory has a separate budget.

Fixes:


OOM type 4: driver heap

Symptom: driver fails with java.lang.OutOfMemoryError. Job submission fails, or the job dies partway through.

Cause: the driver heap is used for the application code, DAG/stage metadata, task result objects, broadcast variable data, and any data collected from executors. The most common driver OOM causes:

A collect() on a large DataFrame is like funnelling an entire lake through a single garden hose into a bathtub. The lake (cluster memory) has plenty of room for the water. The bathtub (driver heap) does not. The fix is to write the water to a reservoir (storage) and read from there, rather than routing it through the bathtub.

Fixes:


OOM type 5: Python worker (PySpark)

Symptom: task fails with a Python process crash or MemoryError from Python.

Cause: the Python worker process—a separate OS process from the executor JVM—ran out of memory. For mapPartitions with Python or ForeachBatch writing large objects, the entire partition or batch may be materialized in Python memory at once.

Fixes:


Diagnosis: reading the error message and the UI

The OOM error message and stack trace tell you which heap ran out. Java heap space is the JVM heap—execution, storage, or user memory. GC overhead limit exceeded means the JVM spent more than 98% of its time in GC trying to free memory—effectively the heap is full. Direct buffer memory is off-heap. A pod OOMKill on Kubernetes points to container-level memory.

The Spark UI’s Executors tab shows GC time per executor. High GC fraction (> 20–30%) means memory pressure before the OOM hit. The Stage detail shows spill metrics: if spill is present, memory was insufficient but the job didn’t die—increasing memory or partition count would reduce spill.


The OOM prevention checklist


Bringing it together

Spark OOM errors fall into five categories, each with a distinct cause and remedy. Executor execution OOM: per-task working memory exceeds the heap—fix with more partitions, more executor memory, or a different join strategy. Executor user memory OOM: large objects allocated by user code in closures—fix with broadcast variables or mapPartitions. Off-heap/container OOM: native memory exceeds the container limit—fix with memoryOverhead or offHeap.size. Driver OOM: collect(), large broadcasts, or result accumulation exhausts driver heap—fix with --driver-memory, write instead of collect. Python worker OOM: too much data materialized in the Python process—fix with smaller batches or chunked processing. The Spark UI (GC time, spill metrics, task duration) and the OOM stack trace together point to which category you’re in. Each category has a different fix, and diagnosing which memory ran out is the first and most important step.