sparklearning

Bytes on the Wire: How Spark Serializes Data for Tasks and Shuffles

This is the story of serialization in Spark—the process of turning in-memory data structures into bytes for transmission across the network, for writing to disk, or for passing between the JVM and Python. Serialization is one of those topics that rarely appears in architecture diagrams but profoundly affects performance. Every byte that crosses a network, every task closure that is dispatched to an executor, every shuffle record that is written to disk—all of them go through serialization and deserialization. Understanding where Spark serializes, what format it uses, and how to control the cost is essential for diagnosing and fixing performance problems in production jobs.

Why serialization is necessary

The JVM heap holds objects in a format designed for efficient access from running Java code—object headers, references, arrays of references. This format is not portable: two JVMs don’t share a heap, and a file on disk can’t hold a live object graph. To move data from one JVM to another (across the network in a shuffle), to persist it to disk (shuffle files, cached blocks), or to pass it to a Python process (for PySpark UDFs), it must be converted to a flat sequence of bytes. That conversion is serialization; the reverse is deserialization.

Serialization is like packing a house for a move. The furniture and appliances are in their “native format”—arranged, plugged in, functional. To move them, you box everything up (serialize), label the boxes, ship them to the new house (transmit), unpack them (deserialize), and arrange them again. The boxes are portable; the assembled furniture is not. The cost of moving depends heavily on how efficiently you pack.

Serialization appears in several distinct places in a Spark job: task dispatch, shuffle, RDD caching, broadcasting, and Python interop. Each context has somewhat different requirements and may use a different format.

Java serialization: the default and its costs

The simplest serialization mechanism is Java’s built-in ObjectOutputStream / ObjectInputStream—Java serialization. Any class that implements java.io.Serializable can be serialized this way. Spark uses Java serialization by default for task closures.

Java serialization is universal but verbose: it includes full class names, field names, and type metadata. A simple closure that captures a few variables can serialize to kilobytes of overhead. Deserialization involves class loading, reflection, and object allocation—it is slow compared to hand-coded binary formats.

Java serialization is like packing using a moving company that insists on labelling every item in full detail: “One (1) wooden chair, oak finish, four legs, upholstered seat, circa 2019, serial number BX4291.” The information is complete and unambiguous, but the labelling takes longer than the packing itself, and the labels add weight. Kryo is the company that writes “chair” on the box.

Kryo: faster, more compact object serialization

Kryo is an alternative JVM serialization library that is significantly faster and produces more compact output than Java serialization. Instead of including full class names and field names, Kryo registers classes by ID (a small integer) and uses a compact binary format. For a Long value, Kryo writes 1–9 bytes; Java serialization writes tens of bytes.

Spark supports Kryo for RDD-based operations (spark.serializer = org.apache.spark.serializer.KryoSerializer). For custom classes used in RDD operations, registering them with Kryo is important: without registration, Kryo falls back to writing the full class name, losing much of the size advantage.

Kryo is the right choice for RDD-based workloads where custom Java/Scala objects flow through map, reduce, and groupByKey operations. The speedup over Java serialization for data-heavy jobs can be substantial—often 2–5× faster with 30–60% smaller serialized size.

Tungsten binary format: the fastest path

For DataFrame and Dataset operations, Spark uses a completely different approach that bypasses Java objects entirely—the Tungsten binary row format. Rows are stored as compact binary byte arrays, with fixed offsets for each field. No class headers, no field names, no Java object overhead.

Tungsten rows don’t need traditional serialization when they are written to shuffle files: the bytes are already in a compact format, and writing them to disk is essentially a memcpy. Deserialization is reading the bytes back and interpreting fields at their known offsets—no object allocation, no reflection.

The Tungsten format is like a shipping container pre-loaded at the factory. Every item has a fixed slot; no individual labelling; no re-packing at the port. The container arrives, you know exactly where each item is, you pick it up from its slot. Java serialization is individual FedEx packages with handwritten labels; Tungsten is a factory-packed container where everything is at a predictable offset.

This is why DataFrame shuffles are dramatically faster than RDD shuffles with Java serialization: the data never becomes Java objects at the shuffle boundary.

Task closure serialization: what goes on the wire

When the Task Scheduler sends a task to an executor, it serializes the entire task closure: the function to run plus everything it references. If your map() function captures a reference to a large object—a HashMap, a configuration object—that object is serialized and sent with every task that uses it.

Capturing a large object in a task closure is like sending the entire contents of your office filing cabinet as an attachment to a one-line email. The recipient only needed one document, but you sent the whole cabinet. The fix is to use a shared drive (broadcast variable): the filing cabinet is uploaded once and everyone accesses their copy locally.

If you capture a 100 MB configuration map in a lambda and have 1000 tasks, the driver serializes 100 MB 1000 times and sends 100 GB of data to executors. The fix is to broadcast the large object: sc.broadcast(myMap), then reference myMap.value inside the task.

Serialization failures in task closures are a common source of NotSerializableException errors. If your lambda captures a reference to a class that is not serializable (a database connection, a file handle), Spark will throw this exception when trying to send the task.

Shuffle serialization: what flows between stages

Shuffle records in DataFrame operations are Tungsten binary rows. For RDD operations, shuffle records are serialized with Kryo (if configured) or Java serialization. The serialized bytes are written to the map-side shuffle file along with an index, and deserialized by the reduce-side tasks when they fetch the blocks.

Compression is applied on top of serialization for shuffle data. Spark supports Snappy (fast, moderate compression—the default), LZ4, ZSTD, and no compression. The compression codec trades CPU for I/O: fast codecs reduce shuffle write/read time; good codecs reduce network transfer. For shuffle-heavy jobs on network-constrained clusters, ZSTD often wins overall.

Broadcast serialization: sending large objects efficiently

Broadcast variables are serialized by the driver and then chunked and distributed to executors using the TorrentBroadcast protocol. The serialized form is stored in the executor’s BlockManager and deserialized once (cached in memory) rather than re-deserialized for every task. This is why registering a broadcast object’s class with Kryo matters: a 100 MB object that serializes to 10 MB with Kryo instead of 150 MB with Java serialization reduces broadcast transfer time and executor memory significantly.

Python serialization: pickle and Arrow

For Python UDFs and PySpark RDD operations, data is serialized from JVM to Python and back using pickle (Python’s native serialization protocol) for row-at-a-time UDFs, or Apache Arrow columnar format for pandas UDFs. Pickle is general but slow and produces verbose output; Arrow is fast, compact, and zero-copy readable.

The choice between pickle and Arrow has the same performance implications as the choice between Java serialization and Tungsten in the JVM world—Arrow’s batch, columnar approach is far more efficient for large data transfers.

Bringing it together

Serialization appears everywhere in Spark: task closures sent to executors (Java serialization by default; Kryo if configured), shuffle records between stages (Tungsten binary for DataFrames; Kryo/Java for RDDs), cached blocks in the MemoryStore, broadcast variables distributed to executors (Java/Kryo, then cached per executor), and Python data passing through the PySpark bridge (pickle row-at-a-time or Arrow batch). Java serialization is universal but verbose and slow. Kryo is faster and more compact for JVM objects. Tungsten binary is the fastest path for DataFrame rows—no serialization in the traditional sense, just compact bytes with known offsets. Minimizing what is captured in task closures (use broadcast for large objects) and choosing the right storage level and serializer for your workload are the main levers. So the story of bytes on the wire is: JVM objects → serialized bytes → transmitted or written → deserialized back, and the format and compressor you choose determines how much of your job’s time is spent on that journey versus doing actual work.