sparklearning

Structured Streaming

Stories about how Spark processes continuous data streams — micro-batches, state, watermarks, and sources.

Batch by Batch: Inside the Structured Streaming Micro-Batch Engine — StreamExecution thread, offset log, commit log, checkpoint directory
Watermarks: How Structured Streaming Decides When to Stop Waiting — event time vs processing time, watermark semantics, late data handling, output modes
Exactly Once, For Real: How Structured Streaming Guarantees No Duplicates — at-most-once vs at-least-once vs exactly-once, idempotent writes, transactional commits
RocksDB in Structured Streaming — default in-memory state store vs RocksDB, changelog-based checkpointing, GC pressure
Keeping Score: How Spark Maintains State Across Micro-Batches — streaming aggregations, windowed state, deduplication, mapGroupsWithState, timeouts
Spark Meets Kafka: How Offsets, Partitions, and Backpressure Work Together — Kafka offset management, partition-to-task mapping, rate limiting, exactly-once recovery
When Should the Next Batch Run? The Story of Trigger Types — ProcessingTime, Once, AvailableNow, Continuous triggers and their trade-offs

How Spark Survives Failure — Structured Streaming’s checkpoint-based recovery is an extension of Spark’s fault tolerance model
The DataSource V2 API: How Spark Talks to Storage Systems — streaming sources implement the MicroBatchStream interface from DataSource V2
The Two Lives of Spark’s Memory — stateful streaming stores state in the BlockManager or RocksDB
The Transaction Log: How Delta Lake Brings ACID to Object Storage — Delta Lake is a common exactly-once sink for Structured Streaming pipelines
Partitions: The Grain of Parallelism — the number of streaming shuffle partitions controls state store parallelism