sparklearning
Joins
Stories about how Spark plans and executes joins, and how to make them faster.
Stories
How Spark Chooses a Join
— broadcast hash join, sort-merge join, shuffled hash join, AQE join conversion
Join Without Pain: Patterns for Fast Joins on Large Tables
— broadcast, bucketing, AQE conversion, skew handling, salting, partition pruning
Related stories
AQE: How Spark Rewrites Plans After the Shuffle
— AQE can convert sort-merge joins to broadcast hash joins at runtime
When One Partition Holds Up Everyone: The Data Skew Story
— skewed join keys are a primary cause of join slowdowns
The Journey of a Shuffle Record
— sort-merge joins are built on top of the shuffle mechanism
From SQL to a Running Plan: The Catalyst Story
— join strategy selection happens during physical planning
What Spark Knows About Your Data: Statistics and the Cost-Based Optimizer
— the CBO uses table statistics to choose optimal join order and strategy
EXPLAIN Yourself: How to Read a Spark Physical Plan
— how to verify which join strategy was actually chosen