Catalyst: Query Planning & Optimization
Stories about how Spark SQL turns a query into an optimized physical plan and executes it.
Stories
- From SQL to a Running Plan: The Catalyst Story — the full pipeline overview: parsing → analysis → optimization → physical planning → codegen
- Making Sense of Names: The Analyzer’s Resolution Rules — ResolveRelations, ResolveReferences, ResolveFunctions, type coercion, VerifyAnalysis; why AnalysisExceptions happen
- The Optimizer’s Rulebook: How Catalyst Makes Plans Cheaper — predicate pushdown, column pruning, constant folding, join elimination, subquery decorrelation; before/after plan diffs
- From Logic to Execution: How Spark Picks Physical Operators — planning strategies (JoinSelection, Aggregation, FileSource), EnsureRequirements, CollapseCodegenStages, AQE rules
- What Spark Knows About Your Data: Statistics and the Cost-Based Optimizer — table/column statistics, histograms, CBO join reordering
- Subqueries Untangled: How Spark Rewrites Nested Queries — correlated vs uncorrelated subqueries, decorrelation, semi-join rewriting
- Windows into Your Data: How Window Functions Are Planned and Executed — window specs, frames, WindowExec, memory and skew considerations
- The Encoder Contract: How Spark Converts Between JVM Objects and Binary Rows — ExpressionEncoder, Dataset[T] vs DataFrame, Kryo fallback
- Expressions All the Way Down: How Spark Represents and Evaluates Computations — expression trees, leaf/unary/binary nodes, interpreted vs codegen evaluation
- EXPLAIN Yourself: How to Read a Spark Physical Plan — reading physical plans, key operators, diagnostic checklist
Recommended reading order
For a complete understanding of the Catalyst pipeline, read in this order:
- From SQL to a Running Plan — the big picture
- Expressions All the Way Down — the data model rules operate on
- Making Sense of Names: Analyzer Rules — phase 2: resolution
- The Optimizer’s Rulebook — phase 3: logical optimization
- From Logic to Execution: Physical Planning Rules — phase 4: physical planning
- EXPLAIN Yourself — reading the output of all four phases