sparklearning

Partitioning

Stories about how Spark divides data into partitions, and how to deal with skew.

Partitions: The Grain of Parallelism — partition basics, coalesce vs repartition, repartitionByRange, partition pruning
When One Partition Holds Up Everyone: The Data Skew Story — detecting skew, hot keys, salting, AQE skew join handling

The Journey of a Shuffle Record — shuffles are how data is repartitioned; partition count directly controls shuffle output
AQE: How Spark Rewrites Plans After the Shuffle — AQE coalesces small shuffle partitions and splits skewed ones at runtime
Join Without Pain: Patterns for Fast Joins on Large Tables — skewed join keys and bucket-based pre-partitioning are covered here
Taming the Shuffle: Partition Count, Spill, and the Right Shuffle for Your Job — how to choose the right number of shuffle partitions
Windows into Your Data: How Window Functions Are Planned and Executed — window functions PARTITION BY creates the same data-per-partition tradeoffs