sparklearning

Two Runtimes, One Job: How PySpark Bridges Python and the JVM

This is the story of how PySpark makes Spark—a system written entirely in Scala and running on the JVM—accessible from Python. It’s a story of two processes running on the same machine, talking across a socket, each with its own memory and runtime. Understanding how the bridge works explains a lot about PySpark’s behavior: why Python UDFs are slow, why pandas UDFs are faster, why you shouldn’t collect large DataFrames to Python, and what “driver-side” vs. “executor-side” really means in a PySpark job.


Two processes, not one

When you run a PySpark job, you have at least two processes: a Python driver process (the python or pyspark interpreter running your script) and a JVM driver process (the Spark JVM that does the actual scheduling, planning, and coordination). These are separate operating system processes with separate memory spaces. They communicate over a local socket using the Py4J library.

Py4J is a bridge library that lets Python code call Java/Scala objects over a network connection. When you write spark.sql("SELECT 1") in Python, Py4J serializes that method call, sends it over the socket to the JVM process, the JVM executes it, and the result is sent back. From your Python code it looks like a normal method call, but under the hood it’s a socket round-trip to another process.

Think of the Python driver as a French-speaking executive and the JVM as a German-speaking engineering team. The executive (Python) speaks into a phone (Py4J socket), a live interpreter (Py4J) translates in real time, and the engineering team (JVM) does the actual work. The executive doesn’t do the engineering; they issue instructions and receive results. Every DataFrame operation you call in Python is a translated instruction.

The JVM driver is the real Spark driver: it runs the DAG Scheduler, Task Scheduler, BlockManagerMaster—all the components described in other stories. The Python process is a thin client that proxies your operations to the JVM. Every DataFrame operation you call in Python (filter, groupBy, select, join) results in Py4J calls that build up a logical plan in the JVM. The JVM then compiles, optimizes, and executes that plan exactly as it would for a Scala or Java program. The Python process is not involved in the execution at all—as long as you stick to DataFrame/SQL operations.


The executor side: where Python re-enters

Executors are JVM processes, too. For DataFrame operations (filter, join, aggregate, etc.), Python is not involved at executor time: Catalyst generates JVM bytecode for those operations, and they run at full JVM speed.

Python re-enters the picture when you use a Python UDF or RDD API with Python functions. In those cases, a Python worker process must run on the executor host. When a task needs to call a Python UDF, the executor JVM spawns (or reuses from a pool) a Python worker process on the same machine, serializes the data it needs to send to Python, sends it over a local pipe or socket, waits for Python to process it and return results, then deserializes the results back into JVM types.

It’s like the engineering team (JVM) hitting a problem that only a specialist consultant (Python) can solve. The team packages up the problem, sends it via courier (pipe) to the consultant’s office next door, the consultant solves it, couriers the answer back, and the team continues. For one problem this is fine. For one hundred million problems (rows), the couriering dominates the cost.

This Python worker process is a full Python interpreter. Every record that passes through a Python UDF crosses a process boundary twice (JVM → Python → JVM), gets serialized and deserialized twice, and incurs the overhead of Python’s interpreter for the function logic.


Serialization: what crosses the process boundary

The format used to move data between the JVM and Python processes has evolved significantly.

For RDD-based Python operations and row-at-a-time Python UDFs, each row is serialized as a pickled row, passed to Python, and pickled back. Because this happens row by row within a task, the per-row overhead (serialization + deserialization + process-boundary crossing) often dominates the cost of the UDF logic itself.

Row-at-a-time pickle is like passing individual marbles through a mail slot, one at a time. Each marble requires opening the slot, inserting the marble, closing the slot, the other person picking it up, processing it, and posting the result back through the slot. Arrow’s batch transfer is like opening a door and rolling a full bag of 10,000 marbles across at once. Same marbles, same processing—but the door-opening overhead is paid once, not 10,000 times.


Arrow: the fast lane for bulk transfer

Apache Arrow changed the performance picture dramatically. Arrow is a columnar, in-memory data format that is efficient to read and write, supports zero-copy sharing, and has implementations in both Java and Python. When Arrow is used to transfer data between the JVM and Python, an entire batch of rows is transferred as a columnar Arrow buffer rather than row-by-row pickled objects.

Arrow is what makes pandas UDFs (also called Vectorized UDFs) fast. Instead of calling your Python function once per row with one value, Spark calls it once per batch with a pandas Series—backed by an Arrow buffer. Your function operates on an entire column at a time using pandas’ vectorized operations (which in turn call NumPy routines implemented in C). The per-batch overhead is paid once for thousands of rows.


The driver-side Python limitation

Even with Arrow-accelerated executors, the Python driver process has a ceiling. collect(), toPandas(), show(), and any operation that brings data back to the driver goes through the Py4J bridge. For large results, this transfer adds latency and memory pressure on both the JVM heap and the Python heap.

toPandas() in particular deserves attention. When Arrow optimization is enabled (spark.sql.execution.arrow.pyspark.enabled = true), toPandas() uses Arrow to batch-transfer the collected data from JVM to Python—much faster than row-by-row pickle. Without it, each row is pickled.

The deeper lesson: the Python driver is a thin layer. The closer your operations stay to DataFrame/SQL (which execute entirely in the JVM), the better performance you get. Every collect(), toPandas(), or custom Python aggregation that pulls data to the Python side pays the bridge-crossing cost.


SparkContext vs SparkSession from Python’s perspective

In PySpark, SparkContext and SparkSession are Python proxy objects backed by Py4J references to JVM objects. When you call sc.parallelize([1, 2, 3]), the Python list [1, 2, 3] is pickled and sent to the JVM, which creates an RDD from it. When you call spark.range(1000), no Python data is involved at all—the JVM creates the range dataset entirely in its own memory. The Python DataFrame object you get back is just a handle (a Py4J reference) to a JVM DataFrame; it carries no data.

This is why operations on large DataFrames are cheap in Python—you’re not moving data, you’re sending method calls. The data lives in JVM memory; the Python process only holds references.


Worker process reuse and startup cost

Spawning a new Python worker process for every task would be expensive—each spawn involves forking a Python interpreter, importing libraries, and loading the function’s closure. PySpark reuses Python worker processes via a daemon that pre-forks a worker and keeps it alive to serve multiple tasks. Library imports that happen at module load time are paid once per worker process, not once per task.

The first task that touches a Python UDF in a session still pays the Python process startup cost. Subsequent tasks on the same executor reuse the warm worker. This is why the first few Python-UDF tasks in a stage may appear slower than the rest in the Spark UI.


Bringing it together

PySpark is a two-process, two-runtime system. The Python driver process communicates with the JVM driver over a Py4J socket: every DataFrame operation is a call that builds a logical plan in the JVM, which then optimizes and executes it without Python’s involvement. At executor time, Python re-enters only for Python UDFs and RDD API operations—each requiring a Python worker process on the executor host, data serialization across a process boundary, and deserialization back. Pickle handles row-by-row UDF transfer at high per-record overhead; Apache Arrow enables batch transfer for pandas UDFs, amortizing the boundary-crossing cost over thousands of rows. toPandas() with Arrow enabled collects data efficiently from JVM to Python; without it, row-by-row pickling dominates. The closer your code stays to DataFrame/SQL operations—which compile to pure JVM bytecode—the less the bridge matters and the faster your job runs.