Pandas has also moved to Apache Arrow as a backend [1], so it’s likely performan...

jasonjmcghee · on Jan 9, 2024

Not according to DuckDB benchmarks. Not even close.

https://duckdblabs.github.io/db-benchmark/

keithalewis · on Jan 9, 2024

Ouch! It is going to take a lot of work to get Polars this fast. If ever.

hyperpl · on Jan 9, 2024

Polars has an OLAP query engine so without any significant pandas overhaul, I highly doubt it will come close to polars in performance for many general case workloads.

dash2 · on Jan 9, 2024

This is a great chance to ELI5: what is an OLAP query engine and why does it make polars fast?

disgruntledphd2 · on Jan 9, 2024

Polars can use lazy processing, where it collects all of the operations together and creates a graph of what needs to happen, while pandas executes everything upon calling of the code.

Spark tended to do this and it makes complete sense for distributed setups, but apparently is still faster locally.

mjhay · on Jan 9, 2024

Laziness in this context has huge advantages in reducing memory allocation. Many operations can be fused together, so there's less of a need to allocate huge intermediate data structures at every step.

disgruntledphd2 · on Jan 9, 2024

yeah, totally, I can see that. I think that polars is the first library to do this locally, which is surprising if it has so many advantages.

mjhay · on Jan 9, 2024

It's been around in R-land for a while with dplyr and its variety of backends (including Arrow, the same as Polars). Pandas is just an incredibly mediocre library in nearly all respects.

disgruntledphd2 · on Jan 10, 2024

> It's been around in R-land for a while with dplyr and its variety of backends

Only for SQL databases, so not really. Source: have been running dplyr since 2011.

mjhay · on Jan 12, 2024

The Arrow backend does allow for lazy eval.

https://arrow.apache.org/cookbook/r/manipulating-data---tabl...

thejosh · on Jan 9, 2024

Memory and CPU usage is still really high though.

vietvu · on Jan 9, 2024

Not with eager API.