sparklearning

Elastic Executors: How Dynamic Allocation Grows and Shrinks the Cluster

This is the story of how a Spark application can grow and shrink its pool of executors while it runs, rather than reserving a fixed set of resources for the entire job. Dynamic resource allocation (DRA) is the mechanism behind this elasticity. Understanding how it works explains why a short ETL job doesn’t waste cluster resources during quiet periods, why a batch job that spikes in the middle can acquire extra capacity without restarting, and what the External Shuffle Service has to do with giving executors back safely.

The problem with fixed allocation

In static allocation—the original Spark model—you declare how many executors you want when you launch the application (--num-executors N), and those executors are reserved for the application’s entire lifetime. This is simple but wasteful. A typical Spark job is not uniformly busy: it reads data (tasks running on many cores), then hits a wide shuffle (most cores idle while map output accumulates), then reduces (cores busy again), then writes output (tasks running), then the job ends.

Between each active phase, many executors sit idle, consuming cluster memory and CPU that another job could be using.

Static allocation is like booking 50 taxis for an airport transfer, even though only 10 people are travelling at a time. For most of the trip, 40 taxis idle at red lights burning fuel, while other passengers in the city can’t get a cab. Dynamic allocation is the ride-share model: cars join when demand spikes and drop off when demand falls. The total fleet handles more demand, and no car sits idle for an hour.

On a shared cluster—a YARN or Kubernetes environment serving dozens of teams—idle reserved executors can prevent other jobs from acquiring resources. Dynamic allocation solves both problems: the application holds only the executors it currently needs.

The two halves: requesting and releasing

Dynamic allocation operates through two independent timers running on the driver.

The scale-up timer watches the pending task queue: tasks that are ready to run but have no executor slot available. If there are pending tasks and the application has fewer executors than the configured maximum, the driver requests additional executors from the cluster manager. Requests are made in a multiplicative ramp: if you have N executors and need more, Spark requests up to 2N additional executors in the next round (doubling up to the maximum). This aggressive ramp-up means a job that suddenly has 500 tasks to run doesn’t wait through many small incremental requests.

The scale-down timer watches for idle executors: executors that have had no running tasks for a configurable duration (default: 60 seconds). When an executor has been idle long enough, the driver tells the cluster manager to remove it.

The scale-up timer is like a restaurant calling in extra kitchen staff when the dining room fills up. The more tables waiting for food, the faster new staff are called. The scale-down timer is like sending staff home when the rush ends and they’re standing around idle. The kitchen runs full-speed during the dinner rush and runs lean during the quiet afternoon.

There is also a separate, shorter timeout for executors that hold cached data. An executor with a cached RDD partition is more valuable to release late, because releasing it loses that cached data.

The shuffle service problem

The original design of dynamic allocation hit a fundamental obstacle: shuffle data. When a map stage completes, its output lives in local files on the executors that ran the map tasks. Reduce tasks read those files later. If a map-stage executor is released before the reduce stage runs, its shuffle files are gone—the reduce tasks fail with fetch failures, the map stage must re-run.

Releasing map-side executors early is like the warehouse workers clocking out and locking up before the delivery drivers arrive to collect the shipped goods. The drivers show up to find the warehouse dark and shuttered. The goods are gone—or rather, still inside a locked building with no one to give them out. The External Shuffle Service is a night watchman who stays behind: the workers go home, but the goods are still accessible through the watchman.

The solution is the External Shuffle Service (ESS): a long-lived daemon process that runs on each worker node, separate from the executor JVM. When the ESS is running, map-stage shuffle files are written so the ESS can serve them. If the executor is later released, the files remain on the node’s local disk and the ESS continues to serve them to reducers.

Dynamic allocation is only safe to enable when either (a) the External Shuffle Service is running, or (b) you are using shuffle-tracking (a Spark 3 feature that tracks which executors hold shuffle data and keeps them alive until their shuffle output has been consumed, without requiring a separate service).

Executor lifecycle under dynamic allocation

The executor lifecycle becomes more dynamic and observable. At application start, Spark requests spark.dynamicAllocation.initialExecutors executors. As the first stage’s tasks are submitted, the task queue fills and the scale-up timer fires. As stages complete and the task queue drains, idle executors accumulate and are released.

This means that under dynamic allocation, the set of executors is not stable across stages. Cached data on a released executor is lost. If your job caches an RDD in stage 2 and reads it in stage 5, and the caching executors were released between those stages, stage 5 will see cache misses and recompute.

Minimum, maximum, and initial executor counts

Three configuration values define the bounds:

spark.dynamicAllocation.minExecutors (default 0): the floor. The application will never have fewer than this many executors. Setting this above zero ensures a warm pool is always available, at the cost of holding resources even when idle.
spark.dynamicAllocation.maxExecutors (default infinity): the ceiling. The application will never request more than this many executors. This is the primary way to limit cluster impact.
spark.dynamicAllocation.initialExecutors (default = minExecutors): how many executors to request at startup.

A common production pattern: minExecutors = 1 (keep at least one warm executor to avoid cold-start latency for small queries), maxExecutors = 50 (cap cluster impact), and let DRA handle everything in between.

Dynamic allocation and Kubernetes

On Kubernetes, executors are pods. Dynamic allocation works by creating new executor pods (via the Kubernetes API) when scale-up is needed and deleting them when idle. Because Kubernetes pods are ephemeral, there is no persistent ESS equivalent. Shuffle tracking is the default mechanism.

Kubernetes-native Spark also benefits from executor decommissioning: rather than abruptly killing an executor pod, Spark sends a decommission signal. The executor finishes its currently running tasks, migrates any shuffle or cache blocks it can, and then exits cleanly. This reduces the likelihood of fetch failures during scale-down.

Decommissioning is like a courteous employee giving notice instead of walking out mid-shift. “I’m leaving in 5 minutes—let me finish this customer’s order first, hand off my notes, and then I’ll go.” An abrupt kill is the employee vanishing mid-transaction, leaving an unfinished order on the counter and confused customers.

Bringing it together

Dynamic allocation lets a Spark application hold only the executors it currently needs, scaling up when the task queue grows and scaling down when executors go idle. It operates through two timers on the driver: a scale-up timer that requests new executors (doubling per round) when tasks are pending, and a scale-down timer that releases executors that have been idle beyond a threshold. The External Shuffle Service (or shuffle tracking in Spark 3) makes scale-down safe by ensuring that shuffle output remains accessible after the executor that wrote it has been released. Cached-data executors are protected by a separate, longer idle timeout. Minimum and maximum executor counts bound the cluster footprint. So the story is: tasks pile up → driver requests executors → cluster allocates them → tasks drain → executors go idle → driver releases them → shuffle data stays accessible via ESS or shuffle tracking. The result is a Spark application that scales with its workload rather than reserving peak capacity for the entire job.