sparklearning

The Columnar Fast Lane: How Apache Arrow Speeds Up PySpark

This is the story of how Apache Arrow transforms PySpark performance—changing the way data crosses the boundary between the JVM and Python, between Spark and pandas, and between row-by-row processing and batch-oriented analytics. Arrow is a cross-language, columnar, in-memory format that has become the lingua franca of the modern data stack. In the context of Spark, Arrow’s most visible impact is in PySpark: replacing the slow pickle-based serialization of individual rows with zero-copy columnar batches that are orders of magnitude faster. Understanding how Arrow works, why columns are faster than rows for analytics, and where Arrow appears in the Spark execution path explains a large part of why modern PySpark is significantly more performant than older versions.

Why columnar layout matters for analytics

Traditional row-oriented storage writes each record’s fields together: [row1_col1, row1_col2, row1_col3, row2_col1, row2_col2, ...]. To read only col2 from 1 million rows, you skip through the entire dataset touching every other field on the way.

Columnar storage writes each column’s values together: [row1_col1, row2_col1, row3_col1, ..., row1_col2, row2_col2, ...]. To read only col2, you read a contiguous block of memory containing nothing but col2 values. CPU caches are used efficiently because every cache line loaded contains only relevant data.

Columnar layout is like a spreadsheet where all the values in one column are physically stored together on the same page, rather than one row per page. If you want to compute the sum of column B across 10,000 rows, the row-oriented spreadsheet requires flipping to every page (one per row) and picking out column B. The columnar spreadsheet stores all of column B on pages 1–10, which you read straight through. The data you actually need is always together.

For analytic workloads that compute aggregates over a few columns of wide tables, columnar access is often 10–100× faster for raw I/O, and the contiguous memory layout enables SIMD (CPU vector instructions) that process 4–8 values per CPU instruction.

Apache Arrow: the columnar standard

Arrow is a language-independent, columnar, in-memory format with:

A fixed specification for how each data type (integer, string, list, struct, etc.) is laid out in memory
Zero-copy semantics: two processes sharing an Arrow buffer (via shared memory or IPC) don’t need to copy data; they both point to the same memory region
A rich set of libraries (Java, C++, Python, Rust, Go, etc.) that all understand the same format

Because Arrow is standardized, a Spark executor (JVM) can produce an Arrow buffer that a Python worker reads without any serialization—no JSON conversion, no pickle, no struct-pack. The JVM writes column arrays in Arrow format; the Python process reads them directly via PyArrow’s memory mapping.

Arrow is a universal power adapter for columnar data. Without it, every device (language, framework, process) uses its own plug shape (serialization format), and you need a different converter for each pair. With Arrow, everyone agrees on a single plug shape—data plugs directly into any Arrow-compatible consumer without conversion.

How Arrow appears in PySpark: `toPandas()` with Arrow

The most widely noticed Arrow integration is in toPandas() with spark.sql.execution.arrow.pyspark.enabled = true (enabled by default in recent Spark versions):

Without Arrow:

Spark collects all rows to the driver (as InternalRow Java objects)
For each row, the Python gateway (Py4J) serializes the row as a Python tuple (pickle)
The Python side accumulates these tuples
pandas constructs a DataFrame by creating a numpy array per column, copying from the accumulated rows

This involves: JVM-to-Python serialization of every value, row-by-row transfer over a socket, and column reconstruction from rows on the Python side. For a 10-million-row DataFrame, this is extremely slow.

With Arrow:

Spark collects rows to the driver executor
The executor’s Arrow writer groups rows into column batches in Arrow IPC format
Arrow batches are sent over the socket to Python (still bytes, but in compact columnar format)
PyArrow reads the batches directly—no per-row deserialization; the column arrays are handed directly to pandas without copying

The performance difference is typically 5–20× for wide DataFrames. The data is already columnar in Arrow format; pandas reads it by pointing to the same memory.

toPandas() with Arrow is like receiving a pre-organized filing cabinet instead of a box of loose papers. Without Arrow, you receive the box (rows), sort the papers by category (column), and file them (pandas columns). With Arrow, you receive a filing cabinet where all the papers are already sorted by category (columnar Arrow batches). You just slide each drawer (column) directly into your desk (pandas DataFrame).

Arrow in Pandas UDFs

Arrow also enables pandas UDFs (covered in the pandas UDFs story), where a Python function receives a pandas Series (or DataFrame) instead of scalar values:

@pandas_udf("double")
def multiply(amount: pd.Series) -> pd.Series:
    return amount * 1.1

Without Arrow, Spark would serialize each row’s amount value individually to Python, call the Python function once per row, and serialize the result back. 1 million rows = 1 million serialization round-trips.

With Arrow, Spark batches many rows’ amount values into an Arrow array (a single contiguous buffer), sends the array to Python as one batch, Python processes the entire batch as a pandas Series (a numpy array under the hood), and returns the result array. 1 million rows might be sent as 4 batches of 250,000 values each—4 round-trips instead of 1 million.

The performance improvement for pandas UDFs over row-at-a-time Python UDFs is typically 10–100× for numeric operations on large datasets.

Arrow in `createDataFrame()` from pandas

The reverse direction—creating a Spark DataFrame from a pandas DataFrame—also benefits from Arrow:

Without Arrow:

spark.createDataFrame(pandas_df)

Each row of pandas_df is pickled as a Python tuple, sent from Python to the JVM over Py4J, and deserialized into an InternalRow. N rows = N round-trips.

With Arrow: The pandas DataFrame (which already stores data as numpy arrays per column) is converted to Arrow IPC format in Python (a fast operation: numpy arrays are already compatible with Arrow’s memory layout for numeric types). The Arrow batches are sent to the JVM in bulk. The JVM reads the Arrow batches and constructs Spark’s internal rows with minimal per-value work.

For large pandas DataFrames, Arrow-based createDataFrame() can be 5–50× faster.

Arrow’s columnar format: what’s actually in memory

An Arrow buffer for a column of integers looks like:

A validity bitmap (1 bit per value: is this value null or not?)
A data buffer (4 bytes × N values for int32, packed contiguously)

For a string column (variable-width type):

A validity bitmap
An offsets buffer (4 bytes × N+1 values, pointing to where each string starts in the data buffer)
A data buffer (all string bytes concatenated)

This fixed layout means any Arrow-compatible reader in any language knows exactly where to find value i in any Arrow buffer without parsing a format description at runtime.

For nested types (structs, lists, maps), Arrow uses a nested buffer structure. Struct columns are represented as multiple child arrays (one per field), sharing the same validity bitmap structure.

Arrow for shuffle and exchange in Spark (future direction)

Arrow is increasingly used within Spark itself, not just at the Python boundary. Some projects and configurations use Arrow-native shuffle (passing Arrow batches between stages rather than serializing to Spark’s internal row format). The DataSource V2 API supports ColumnarBatch reads, enabling connectors to return Arrow data that flows into Spark’s columnar operators without row conversion.

As more of Spark’s operators gain vectorized (columnar) implementations (HashAggregateExec with vectorized aggregation, FileScan with vectorized reader), Arrow and columnar execution become the fast path for more and more of the query plan—not just at the Python boundary.

Bringing it together

Apache Arrow brings columnar data processing to the boundary between Spark and the outside world. Its fixed, language-independent, columnar memory format eliminates costly row-by-row serialization at the JVM–Python boundary, making toPandas(), createDataFrame(), and pandas UDFs dramatically faster. Columnar layout speeds up analytic operations by placing all values for one column in contiguous memory—maximizing CPU cache efficiency and enabling SIMD instructions. Arrow’s zero-copy semantics mean that data shared between processes points to the same memory region rather than being copied. In PySpark, enabling Arrow via spark.sql.execution.arrow.pyspark.enabled = true activates this fast path for toPandas() and pandas UDFs, reducing what was millions of row-level round-trips to a handful of bulk column batch transfers. So the story of Arrow in PySpark is: group rows into column batches → encode in Arrow’s fixed format → transfer once per batch → receive and read directly in pandas → no copies, no per-row overhead.