sparklearning

Shuffle

Stories about how Spark redistributes data across executors and how to tune it.

The Journey of a Shuffle Record — map phase, sort-merge shuffle, shuffle files, reduce phase, External Shuffle Service
Taming the Shuffle: Partition Count, Spill, and the Right Shuffle for Your Job — partition sizing, spill management, codec selection, diagnosing shuffle bottlenecks

Partitions: The Grain of Parallelism — the number of shuffle partitions is the single most impactful tuning parameter
AQE: How Spark Rewrites Plans After the Shuffle — AQE coalesces small shuffle output partitions automatically after the map phase
How Spark Chooses a Join — sort-merge join is built on top of the shuffle; broadcast hash join avoids it entirely
The Two Lives of Spark’s Memory — shuffle spill is triggered when execution memory is insufficient for the in-memory sort buffer
Bytes on the Wire: How Spark Serializes Data for Tasks and Shuffles — shuffle data is serialized and optionally compressed before being written to disk