sparklearning

Python & PySpark

Stories about how PySpark bridges Python and the JVM, and how to use the Python API efficiently.

Two Runtimes, One Job: How PySpark Bridges Python and the JVM — Py4J gateway, Python worker processes, pickle vs Arrow serialization
Pandas UDFs: How Arrow Makes Python Functions Fast Enough for Spark — scalar, scalar iterator, grouped aggregate, and grouped map pandas UDFs; Arrow batching

The UDF Tax: Why User-Defined Functions Are a Black Box to the Optimizer — why Python UDFs are expensive and how pandas UDFs improve on them
One Row In, Many Rows Out: The Story of User-Defined Table Functions — Python UDTFs for generating multiple output rows per input
The Columnar Fast Lane: How Apache Arrow Speeds Up PySpark — the Arrow format that powers fast toPandas() and pandas UDF transfers
Bytes on the Wire: How Spark Serializes Data for Tasks and Shuffles — pickle vs Arrow is the Python-specific chapter of the serialization story