sparklearning

Data Sources & I/O

Stories about reading and writing data — file formats, storage APIs, and the columnar data exchange layer.

Inside a Parquet File: Row Groups, Column Chunks, and Why Spark Loves It — row groups, column chunks, encoding, predicate/projection pushdown, bloom filters
The Transaction Log: How Delta Lake Brings ACID to Object Storage — transaction log, snapshot isolation, optimistic concurrency, time travel, checkpoints
The DataSource V2 API: How Spark Talks to Storage Systems — pluggable connector API, pushdown negotiation, transactional writes, streaming source support
The Columnar Fast Lane: How Apache Arrow Speeds Up PySpark — Arrow columnar format, zero-copy transfer, toPandas() and pandas UDF performance

What Is a Table to Spark? The Catalog, Metadata, and the Metastore — how catalog metadata points to the physical files these stories describe
Bytes on the Wire: How Spark Serializes Data for Tasks and Shuffles — how data is serialized once it leaves the file reader
Cache Wisely: When Persisting Data Helps and When It Hurts — when to cache the results of expensive reads from these formats
Two Runtimes, One Job: How PySpark Bridges Python and the JVM — Arrow is the fast path for getting data from Spark into Python
Batch by Batch: Inside the Structured Streaming Micro-Batch Engine — DataSource V2’s MicroBatchStream is how streaming sources plug into Structured Streaming