sparklearning

Joins

Stories about how Spark plans and executes joins, and how to make them faster.

How Spark Chooses a Join — broadcast hash join, sort-merge join, shuffled hash join, AQE join conversion
Join Without Pain: Patterns for Fast Joins on Large Tables — broadcast, bucketing, AQE conversion, skew handling, salting, partition pruning

AQE: How Spark Rewrites Plans After the Shuffle — AQE can convert sort-merge joins to broadcast hash joins at runtime
When One Partition Holds Up Everyone: The Data Skew Story — skewed join keys are a primary cause of join slowdowns
The Journey of a Shuffle Record — sort-merge joins are built on top of the shuffle mechanism
From SQL to a Running Plan: The Catalyst Story — join strategy selection happens during physical planning
What Spark Knows About Your Data: Statistics and the Cost-Based Optimizer — the CBO uses table statistics to choose optimal join order and strategy
EXPLAIN Yourself: How to Read a Spark Physical Plan — how to verify which join strategy was actually chosen