Two Sigma uses third-party advertising and advertising analytics cookies that allow us and our partners to serve your more relevant advertisements across platforms. You may accept or decline our use of these kinds of cookies by selecting “accept” or “decline” below. For more information about our privacy practices, please visit our Cookies Policy here.

Learn More

Engineering

Improving Python and Spark Performance and Interoperability with Apache Arrow

Author: Li Jin (Two Sigma), Julien Le Dem

Venue: Spark Summit 2017, San Francisco

Abstract: Apache Spark has become a popular and successful way for Python programming to parallelize and scale up data processing. In many use cases though, a PySpark job can perform worse than an equivalent job written in Scala. It is also costly to push and pull data between the user’s Python environment and the Spark master. Apache Arrow-based interconnection between the various big data tools (SQL, UDFs, machine learning, big data frameworks, etc.) enables you to use them together seamlessly and efficiently, without overhead. When collocated on the same processing node, read-only shared memory and IPC avoid communication overhead. When remote, scatter-gather I/O sends the memory representation directly to the socket avoiding serialization costs.

Improving Python and Spark Performance and Interoperability with Apache Arrow from Two Sigma