Data Sources & I/O
Stories about reading and writing data — file formats, storage APIs, and the columnar data exchange layer.
Stories
- Inside a Parquet File: Row Groups, Column Chunks, and Why Spark Loves It — row groups, column chunks, encoding, predicate/projection pushdown, bloom filters
- The Transaction Log: How Delta Lake Brings ACID to Object Storage — transaction log, snapshot isolation, optimistic concurrency, time travel, checkpoints
- The DataSource V2 API: How Spark Talks to Storage Systems — pluggable connector API, pushdown negotiation, transactional writes, streaming source support
- The Columnar Fast Lane: How Apache Arrow Speeds Up PySpark — Arrow columnar format, zero-copy transfer, toPandas() and pandas UDF performance