sparklearning

Tungsten & JVM Internals

Stories about Spark’s low-level memory and execution optimizations — and the JVM garbage collector.

Tungsten: How Spark Stopped Trusting the JVM — UnsafeRow binary format, off-heap memory, cache-friendly layout, binary comparisons
Garbage Collection in Spark: Why the JVM Pauses and How to Make It Stop — generational GC, G1GC tuning, Spark-specific GC patterns, diagnosing GC pauses

The Two Lives of Spark’s Memory — the unified memory manager that controls how much heap Tungsten’s execution pool gets
From SQL to a Running Plan: The Catalyst Story — whole-stage codegen generates the tight loops that Tungsten’s binary format is designed for
Bytes on the Wire: How Spark Serializes Data for Tasks and Shuffles — Tungsten’s binary format is also used for shuffle serialization
The Encoder Contract: How Spark Converts Between JVM Objects and Binary Rows — encoders translate between JVM objects and Tungsten’s UnsafeRow binary format
Cache Wisely: When Persisting Data Helps and When It Hurts — serialized caching stores data in Tungsten-compatible byte buffers, reducing GC pressure