sparklearning
Shuffle
Stories about how Spark redistributes data across executors and how to tune it.
Stories
The Journey of a Shuffle Record
— map phase, sort-merge shuffle, shuffle files, reduce phase, External Shuffle Service
Taming the Shuffle: Partition Count, Spill, and the Right Shuffle for Your Job
— partition sizing, spill management, codec selection, diagnosing shuffle bottlenecks
Related stories
Partitions: The Grain of Parallelism
— the number of shuffle partitions is the single most impactful tuning parameter
AQE: How Spark Rewrites Plans After the Shuffle
— AQE coalesces small shuffle output partitions automatically after the map phase
How Spark Chooses a Join
— sort-merge join is built on top of the shuffle; broadcast hash join avoids it entirely
The Two Lives of Spark’s Memory
— shuffle spill is triggered when execution memory is insufficient for the in-memory sort buffer
Bytes on the Wire: How Spark Serializes Data for Tasks and Shuffles
— shuffle data is serialized and optionally compressed before being written to disk