Structured Streaming
Stories about how Spark processes continuous data streams — micro-batches, state, watermarks, and sources.
Stories
- Batch by Batch: Inside the Structured Streaming Micro-Batch Engine — StreamExecution thread, offset log, commit log, checkpoint directory
- Watermarks: How Structured Streaming Decides When to Stop Waiting — event time vs processing time, watermark semantics, late data handling, output modes
- Exactly Once, For Real: How Structured Streaming Guarantees No Duplicates — at-most-once vs at-least-once vs exactly-once, idempotent writes, transactional commits
- RocksDB in Structured Streaming — default in-memory state store vs RocksDB, changelog-based checkpointing, GC pressure
- Keeping Score: How Spark Maintains State Across Micro-Batches — streaming aggregations, windowed state, deduplication, mapGroupsWithState, timeouts
- Spark Meets Kafka: How Offsets, Partitions, and Backpressure Work Together — Kafka offset management, partition-to-task mapping, rate limiting, exactly-once recovery
- When Should the Next Batch Run? The Story of Trigger Types — ProcessingTime, Once, AvailableNow, Continuous triggers and their trade-offs