sparklearning

Taming the Shuffle: Partition Count, Spill, and the Right Shuffle for Your Job

This is the story of shuffle tuning—the art and science of making Spark’s most expensive operation as fast and efficient as possible. A shuffle is unavoidable in any query that groups, joins, or aggregates data: rows must be redistributed across executors so that all rows with the same key end up on the same machine. Done well, a shuffle is fast, memory-efficient, and transparent. Done badly, it’s the source of slow stages, GC pressure, out-of-memory errors, and mysterious long-tail tasks. This story covers the three main levers of shuffle tuning: partition count, spill management, and codec selection.

Why shuffle is expensive

The shuffle has two phases:

Map phase (write): each mapper task computes its output rows, hashes them by the partitioning key, and writes each row to a shuffle file on local disk—one section per reduce partition. With 200 reduce partitions, each map task writes 200 files (or one file with 200 sections using sort-merge shuffle). With 1,000 map tasks and 200 reduce partitions, there are 200,000 shuffle file sections on disk.

Reduce phase (read): each reducer task reads its section from every mapper—a remote read from potentially thousands of machines. With 1,000 mappers and 200 reducers, each reducer makes 1,000 remote HTTP requests to fetch its data. This is the network bottleneck.

Beyond the network, the shuffle involves: disk I/O (writing and reading shuffle files), serialization/deserialization (objects to bytes and back), memory for sort buffers and in-memory blocks, and coordination overhead (the driver tracks every shuffle block location).

A shuffle is like a postal sorting facility that redistributes mail from 1,000 origin post offices to 200 destination zip codes. The origin offices (map tasks) sort and bundle their outgoing mail by destination. The destination offices (reduce tasks) send a truck to each origin office to collect their bundles. With 1,000 origins and 200 destinations, that’s 200,000 bundles and 200 × 1,000 = 200,000 truck trips. Tuning the shuffle means reducing bundle count, truck trips, and sorting time.

The most important parameter: partition count

spark.sql.shuffle.partitions (default: 200) controls how many reduce partitions are created after each shuffle. This single number has more impact on shuffle performance than almost any other configuration.

Too few partitions (e.g., 10 for a large dataset):

Each reducer task gets a very large amount of data
Tasks run for a long time (sequential bottleneck)
Tasks are more likely to spill to disk (not enough memory for the full partition)
Parallelism is limited: with 10 tasks and 100 executor cores, 90 cores sit idle

Too many partitions (e.g., 50,000 for a small dataset):

Each reducer task gets almost no data—milliseconds of actual work
Scheduling overhead dominates (each task has a fixed overhead of ~10–20ms)
The driver is overwhelmed tracking 50,000 shuffle blocks
Small file problem: writing 50,000 tiny output files to storage is inefficient

The right number: A commonly cited target is partition sizes of 100–200MB per partition (for memory-intensive operations) or 256MB–1GB (for I/O-bound operations writing to storage). For a 100GB shuffle, 500–1,000 partitions is a reasonable starting range.

With AQE enabled (Spark 3.0+), Spark automatically coalesces small shuffle partitions after the map phase, reducing the number of reduce tasks for light data. This makes spark.sql.shuffle.partitions=200 a safe default even for varying workload sizes—AQE will merge partitions that end up small.

Setting partition count is like choosing how many lanes to have at a highway toll plaza. Too few lanes (too few partitions): each lane has a huge queue; cars wait a long time. Too many lanes: the plaza is massive, attendants stand idle, and the overhead of the plaza itself slows traffic. The right number matches throughput to traffic volume—and AQE is like having variable lanes that automatically close when traffic is light.

Spill: the emergency valve and why to avoid it

When a shuffle reducer’s input data doesn’t fit in memory, Spark spills to disk: it sorts what it has in memory into a temp file, clears the buffer, and continues accumulating. At the end, it merges the sorted temp files with the remaining in-memory data. Spill is correct—it prevents OOM. But it’s slow: it writes and reads data twice, consuming disk I/O bandwidth.

Signs of spill in the Spark UI: the Stages tab shows “Spill (Memory)” and “Spill (Disk)” metrics. A stage with significant spill is a performance opportunity.

Causes of spill:

Too few partitions (each partition too large for memory)
Data skew (one partition gets far more data than the others)
Insufficient executor memory for the workload

Fixes:

Increase partition count: smaller partitions require less memory per task.
Increase spark.executor.memory: more memory per executor means larger partitions can fit without spilling.
Increase spark.memory.fraction (default 0.6): the fraction of executor memory available for execution (shuffle buffers, sort buffers). More execution memory = less spill.
Fix data skew: if one key has vastly more rows than others, all its data ends up in one partition (covered in the data skew story).
Enable AQE skew join handling: automatically splits skewed partitions.

The shuffle implementation: sort-merge shuffle

Spark’s default shuffle implementation is sort-merge shuffle (also called sort-based shuffle, enabled since Spark 1.2):

For each map task:

Rows are written to an in-memory sort buffer.
When the buffer is full (or the task ends), the buffer is sorted by partition ID (and optionally by key) and written to a single spill file on local disk.
At the end of the map task, all spill files are merged into one shuffle file with a corresponding index file (recording where each reduce partition’s data starts).

For each reduce task:

It reads its section from each map task’s shuffle file (via remote HTTP or local file read if on the same executor).
It merges the fetched data (optionally sorting by key for sort-merge joins).

The External Shuffle Service (ESS) is a separate process on each worker node that serves shuffle files. Without ESS, shuffle files are served by the executor JVM’s HTTP server. ESS allows executors to be released (dynamic allocation) without losing their shuffle files—the ESS continues serving the files even after the executor exits.

The Tungsten path for small shuffles: hash shuffle

For small data volumes (e.g., a broadcast join where the small side is shuffled, or a repartition of a small dataset), sort-merge shuffle’s overhead (sorting, writing to disk) can be excessive. Spark uses Tungsten sort-based shuffle (UnsafeShuffleWriter) which sorts by partition ID only (not by key), writes in Tungsten’s binary format, and avoids Java object overhead. For shuffle writes that fit in memory (no spill), this is very fast.

For very small shuffles (output fits in memory entirely), Spark can bypass sorting entirely and use BypassMergeSortShuffleWriter—one output file per partition per map task. This is fast for small partition counts (below spark.shuffle.sort.bypassMergeThreshold, default 200 partitions) but creates many small files for large partition counts.

Shuffle compression and network bandwidth

Shuffle data can be compressed before being written and transmitted:

spark.shuffle.compress = true        (default: true)
spark.shuffle.spill.compress = true  (default: true)
spark.io.compression.codec = lz4    (default: lz4; alternatives: snappy, zstd, lzf)

lz4 is fast to compress/decompress with decent compression ratios—a good default. zstd achieves better compression at slightly higher CPU cost, useful if network is the bottleneck. snappy is similarly fast but slightly worse compression. lzf is the fastest decompressor but worst compression.

If your network is the bottleneck (you see high shuffle fetch wait time), more aggressive compression (zstd) reduces network bytes at the cost of more CPU. If your CPU is the bottleneck, use lz4 or disable compression entirely for small shuffles.

Diagnosing shuffle problems

Check stage metrics in the Spark UI: look at the shuffle read and write bytes for stages that involve shuffles. Large shuffle sizes are expected for joins on large tables; unexpected large sizes (e.g., a filter that should have reduced data didn’t push down to the scan) are a problem.
Check fetch wait time: high shuffle read fetch wait time indicates network congestion or a skewed reducer that’s waiting for one large partition while its peers are already done.
Check spill: non-zero “Spill (Memory)” in the stage metrics means spill occurred. How much? Is it occasional (acceptable) or massive (problem)?
Check task duration distribution: if the median task duration is 2 seconds but the max is 120 seconds, there’s skew—one or a few partitions are much larger than the rest.
Count exchanges in the EXPLAIN output: more exchanges means more shuffles. Can any shuffles be eliminated by bucketing or pre-partitioning?

Bringing it together

Shuffle tuning has three main axes: partition count (the single most impactful parameter—too few causes large, slow tasks; too many causes overhead; AQE can auto-tune), spill management (unavoidable when partitions exceed memory; reduced by increasing partition count, memory, or memory fraction; diagnosed via the Stages tab), and network/compression (compression reduces network bytes at CPU cost; codec choice should match your bottleneck). The shuffle implementation (sort-merge with ESS) is mostly correct by default; tuning is usually about partition sizing and memory allocation. For most workloads with AQE enabled, the default spark.sql.shuffle.partitions=200 and AQE’s coalescing handles routine cases automatically. Manual tuning is needed for extreme data sizes, severe skew, or memory-constrained environments. So the story of shuffle tuning is: size partitions right for your data → give each partition enough memory → let AQE adapt → compress for the network → watch the metrics and iterate.