sparklearning

Spark Internals: Learning Through Stories

📖 Published as GitHub Pages

A growing collection of story-style explanations of Apache Spark internals. Each story focuses on one concept or subsystem and explains it as a narrative—what problem it solves, how it works, and how the pieces fit together. Stories are written to be engaging and readable without diving into code, so the ideas stick.

Stories are grouped by topic (each has its own directory); related topics are grouped into themes in the index below.


How to use this doc

New stories are added over time and linked from this README.


Index by theme

Execution core

How jobs become stages and tasks, how data moves, and how memory and fault tolerance work.

Topic Description Stories
Execution & scheduling From actions to DAG, stages, tasks; driver and executors Coming soon
Scheduler DAG Scheduler, Task Scheduler; how stages and tasks are submitted and run From One Action to Many Tasks
Locality and delay scheduling Preferred locations, locality levels, delay scheduling; when Spark waits for a good executor Locality and Delay Scheduling
Scheduling pools and fair sharing Pools, minimum share, weight; how multiple jobs share resources in fair mode Scheduling Pools and Fair Sharing
Shuffle Shuffle write/read, sort shuffle, external shuffle service The Journey of a Shuffle Record
Memory & storage Unified memory, BlockManager, caching and eviction Coming soon
Fault tolerance Lineage, recomputation, checkpointing, speculation Coming soon
Partitioning Partitions, coalesce vs repartition, partition pruning Coming soon
Broadcast & shared state Broadcast variables, accumulators Coming soon

Query & planning

How DataFrame/SQL becomes a plan, how it’s optimized, and how joins and adaptive execution work.

Topic Description Stories
Query planning (Catalyst) Logical plan, optimization rules, physical plan, codegen Coming soon
Adaptive & runtime AQE, dynamic partition pruning Coming soon
Join strategies Sort-merge, broadcast, hash join; when each is chosen Coming soon

Streaming

State, checkpointing, and the lifecycle of micro-batches.

Topic Description Stories
Structured Streaming State stores, checkpointing, micro-batches, exactly-once RocksDB in Structured Streaming

Data & I/O

Reading and writing data, formats, and data source APIs.

Topic Description Stories
Data sources Reading/writing, V1 vs V2 API, file formats Coming soon
Serialization Tungsten binary format, Kryo, wire format Coming soon

Python & UDFs

How PySpark and UDFs integrate with the JVM.

Topic Description Stories
Python (PySpark) JVM ↔ Python, Arrow, Pandas UDFs Coming soon
UDFs Scala/Java UDFs, registration, execution path Coming soon

Cluster & observability

How Spark runs on clusters and how you observe it.

Topic Description Stories
Cluster & deploy Cluster managers, driver/executor lifecycle, resource negotiation Coming soon
UI & metrics Spark UI, event log, history server, where metrics come from Coming soon
Configuration SparkConf, important configs, how they flow through the app Coming soon

Advanced / internals

Deeper internals: Tungsten, catalog, and table metadata.

Topic Description Stories
Tungsten Binary rows, off-heap, cache-friendly layout Coming soon
Catalog & tables Spark catalog, table metadata, session catalog Coming soon

Adding new stories