sparklearning

Spark Internals: Learning Through Stories

📖 Published as GitHub Pages

A growing collection of story-style explanations of Apache Spark internals. Each story focuses on one concept or subsystem and explains it as a narrative—what problem it solves, how it works, and how the pieces fit together. Every story includes analogies and real-world examples to make the ideas concrete and memorable, without diving into code.

53 stories published

Stories are grouped by topic (each has its own directory); related topics are grouped into themes in the index below.

How to use this doc

Browse by theme — Execution core, Query & planning, Streaming, Data & I/O, and more.
Read in any order — Stories are self-contained; follow your curiosity.

Goal	Path
New to Spark internals	Driver & Executors → Scheduler → Shuffle → Memory → Fault tolerance
Understanding SQL/DataFrame performance	Catalyst → Analyzer rules → Optimizer rules → Physical planning rules → Statistics & CBO → AQE → EXPLAIN output
Debugging slow jobs	Spark UI → EXPLAIN output → Partitions → Data skew → Shuffle tuning → OOM diagnosis
PySpark & UDF performance	PySpark bridge → UDF tax → Pandas UDFs → Arrow columnar → Serialization
Streaming systems	Micro-batch engine → Watermarks → Stateful operations → Kafka source → Trigger types → Exactly-once → RocksDB state store
Scheduling & fairness	Scheduler → Locality → Fair sharing → Dynamic allocation
Data lake & storage	Parquet internals → DataSource V2 → Delta Lake → Catalog & tables → Unity Catalog
Cluster & deploy	spark-submit → YARN mode → Kubernetes mode → SparkConf → Event log
Join & shuffle optimization	Join strategies → Join patterns → Shuffle tuning → Caching strategy → GC tuning
Advanced internals	Tungsten → Expression tree → Encoders & Datasets → Subqueries → Window functions

Index by theme

Execution core

How jobs become stages and tasks, how data moves, and how memory and fault tolerance work.

Topic	Description	Stories
Execution & scheduling	From actions to DAG, stages, tasks; driver and executors	The Driver, the Executors, and How a Job Actually Runs
Dynamic allocation	Requesting and releasing executors at runtime; elasticity under load	Elastic Executors: How Dynamic Allocation Grows and Shrinks the Cluster
Scheduler	DAG Scheduler, Task Scheduler; how stages and tasks are submitted and run	From One Action to Many Tasks
Locality and delay scheduling	Preferred locations, locality levels, delay scheduling; when Spark waits for a good executor	Locality and Delay Scheduling
Scheduling pools and fair sharing	Pools, minimum share, weight; how multiple jobs share resources in fair mode	Scheduling Pools and Fair Sharing
Shuffle	Shuffle write/read, sort shuffle, external shuffle service	The Journey of a Shuffle Record
Memory & storage	Unified memory, BlockManager, caching and eviction	The Two Lives of Spark’s Memory
Fault tolerance	Lineage, recomputation, checkpointing, speculation	How Spark Survives Failure
Partitioning	Partitions, coalesce vs repartition, partition pruning	Partitions: The Grain of Parallelism
Broadcast & shared state	Broadcast variables, accumulators	Shared State in a Distributed Job
Data skew	Detecting and handling skewed partitions; salting, AQE skew join	When One Partition Holds Up Everyone: The Data Skew Story

Query & planning

How DataFrame/SQL becomes a plan, how it’s optimized, and how joins and adaptive execution work.

Topic	Description	Stories
Query planning (Catalyst)	Logical plan, optimization rules, physical plan, codegen	From SQL to a Running Plan: The Catalyst Story
Analyzer rules	How Spark resolves table names, column references, functions, and types	Making Sense of Names: The Analyzer’s Resolution Rules
Optimizer rules	Predicate pushdown, column pruning, constant folding, join elimination; before/after plan diffs	The Optimizer’s Rulebook: How Catalyst Makes Plans Cheaper
Physical planning rules	Strategy selection, EnsureRequirements, WholeStageCodegen grouping, AQE rules	From Logic to Execution: How Spark Picks Physical Operators
Statistics & CBO	Table statistics, column histograms, cost-based optimizer decisions	What Spark Knows About Your Data: Statistics and the Cost-Based Optimizer
Adaptive & runtime	AQE: coalescing partitions, join conversion, skew handling, dynamic partition pruning	AQE: How Spark Rewrites Plans After the Shuffle
Join strategies	Sort-merge, broadcast, hash join; when each is chosen	How Spark Chooses a Join
Subqueries	Correlated and uncorrelated subqueries; how they are rewritten and executed	Subqueries Untangled: How Spark Rewrites Nested Queries
Window functions	Window specs, frame boundaries, ranking and analytic functions	Windows into Your Data: How Window Functions Are Planned and Executed
Encoders & Datasets	How Dataset[T] maps JVM types to Spark’s internal row format	The Encoder Contract: How Spark Converts Between JVM Objects and Binary Rows
Expression tree	How computations are represented as trees of expressions; evaluation model	Expressions All the Way Down: How Spark Represents and Evaluates Computations
Reading EXPLAIN output	Parsing physical plans to find shuffles, broadcast decisions, and skipped filters	EXPLAIN Yourself: How to Read a Spark Physical Plan

Streaming

State, checkpointing, watermarks, and the lifecycle of micro-batches.

Topic	Description	Stories
Micro-batch engine	How each batch is planned, executed, and committed; the StreamExecution thread	Batch by Batch: Inside the Structured Streaming Micro-Batch Engine
Watermarks & late data	Event time, watermarks, how late records are handled or dropped	Watermarks: How Structured Streaming Decides When to Stop Waiting
Exactly-once delivery	Sources, sinks, idempotent writes, transactional commits	Exactly Once, For Real: How Structured Streaming Guarantees No Duplicates
Structured Streaming	State stores, checkpointing; RocksDB as the state backend	RocksDB in Structured Streaming
Stateful operations	Aggregations over time windows, mapGroupsWithState, flatMapGroupsWithState	Keeping Score: How Spark Maintains State Across Micro-Batches
Kafka integration	Offset management, partition assignment, rate limiting in the Kafka source	Spark Meets Kafka: How Offsets, Partitions, and Backpressure Work Together
Trigger types	ProcessingTime, Once, AvailableNow, Continuous; what changes under the hood	When Should the Next Batch Run? The Story of Trigger Types

Data & I/O

Reading and writing data, formats, and data source APIs.

Topic	Description	Stories
Parquet internals	Row groups, column chunks, page encoding, predicate and projection pushdown	Inside a Parquet File: Row Groups, Column Chunks, and Why Spark Loves It
Delta Lake basics	Transaction log, snapshot isolation, schema enforcement, time travel	The Transaction Log: How Delta Lake Brings ACID to Object Storage
Serialization	Tungsten binary format, Kryo, Java serialization; when each is used	Bytes on the Wire: How Spark Serializes Data for Tasks and Shuffles
DataSource V2 API	Pluggable connector API; pushdown negotiation, transactional writes, streaming sources	The DataSource V2 API: How Spark Talks to Storage Systems
Arrow & columnar transfer	Apache Arrow format, columnar batches, zero-copy transfer in PySpark	The Columnar Fast Lane: How Apache Arrow Speeds Up PySpark

Python & UDFs

How PySpark and UDFs integrate with the JVM.

Topic	Description	Stories
Python (PySpark)	JVM ↔ Python bridge, Py4J, Arrow, serialization overhead	Two Runtimes, One Job: How PySpark Bridges Python and the JVM
Pandas UDFs	Arrow-based columnar UDFs; why they are faster than row-at-a-time UDFs	Pandas UDFs: How Arrow Makes Python Functions Fast Enough for Spark
UDFs	Scalar UDF execution path, deserialization cost, why UDFs block Catalyst	The UDF Tax: Why User-Defined Functions Are a Black Box to the Optimizer
UDTFs & table functions	User-defined table functions, how they expand one row into many	One Row In, Many Rows Out: The Story of User-Defined Table Functions

Cluster & observability

How Spark runs on clusters and how you observe it.

Topic	Description	Stories
UI & metrics	Spark UI tabs — Jobs, Stages, SQL, Executors, Storage — and what each reveals	Reading the Spark UI: What Every Tab Is Actually Telling You
spark-submit & resource negotiation	From spark-submit to running tasks; driver launch, executor acquisition	From spark-submit to Running Tasks: The Resource Negotiation Story
YARN mode	How Spark runs on YARN; AM lifecycle, container allocation, queue policies	Spark on YARN: ApplicationMaster, Containers, and the Queue
Kubernetes mode	Pod lifecycle, driver pod, executor pods, dynamic allocation on K8s	Spark on Kubernetes: Pods, Namespaces, and Ephemeral Executors
Event log & history server	What goes into the event log, how the history server replays it	The Event Log: A Complete Record of Everything That Happened in Your Job
Configuration	SparkConf, config sources and precedence, how settings flow through the stack	SparkConf to Code: How Configuration Reaches the Component That Needs It

Advanced / internals

Deeper internals: Tungsten, encoders, catalog, and expression trees.

Topic	Description	Stories
Tungsten	Binary rows, off-heap memory, cache-friendly layout, UnsafeRow	Tungsten: How Spark Stopped Trusting the JVM
Catalog & tables	Spark catalog, session catalog, Hive metastore, managed vs external tables	What Is a Table to Spark? The Catalog, Metadata, and the Metastore
Unity Catalog	Governance layer beyond the session catalog; lineage, fine-grained access control	Beyond the Session Catalog: Unity Catalog and the Governed Lakehouse

Performance & tuning

Practical stories about diagnosing and fixing common Spark performance problems.

Topic	Description	Stories
OOM diagnosis	Heap vs off-heap OOMs, driver vs executor, common causes and fixes	Out of Memory: A Field Guide to Spark OOM Errors
Reading EXPLAIN output	Parsing physical plans to find shuffles, broadcast decisions, and skipped filters	EXPLAIN Yourself: How to Read a Spark Physical Plan
Shuffle tuning	Shuffle partition count, spill, sort vs bypass; tuning for job size	Taming the Shuffle: Partition Count, Spill, and the Right Shuffle for Your Job
Join optimization patterns	When to broadcast, pre-partition, bucket, or cache to eliminate shuffle	Join Without Pain: Patterns for Fast Joins on Large Tables
Caching strategy	What to cache, what not to, storage levels, when caching hurts	Cache Wisely: When Persisting Data Helps and When It Hurts
GC tuning	G1GC vs ZGC, heap sizing, off-heap trade-offs, diagnosing GC pauses	Garbage Collection in Spark: Why the JVM Pauses and How to Make It Stop

Story map at a glance

Theme	Stories
Execution core	11
Query & planning	12
Streaming	7
Data & I/O	5
Python & UDFs	4
Cluster & observability	6
Advanced / internals	3
Performance & tuning	6
Unique total	53

Adding new stories

Put each new story in the directory for its topic (create the directory if it’s the first story in that group).
Use a descriptive filename (e.g. rocksdb_structured_streaming_story.md).
Update this README — add the story link in the table and update the Story map counts.