Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Pandas has also moved to Apache Arrow as a backend [1], so it’s likely performance will be similar when comparing recent versions. But it’s great to have some friendly competition.

[1] https://datapythonista.me/blog/pandas-20-and-the-arrow-revol...



Not according to DuckDB benchmarks. Not even close.

https://duckdblabs.github.io/db-benchmark/


Ouch! It is going to take a lot of work to get Polars this fast. If ever.


Polars has an OLAP query engine so without any significant pandas overhaul, I highly doubt it will come close to polars in performance for many general case workloads.


This is a great chance to ELI5: what is an OLAP query engine and why does it make polars fast?


Polars can use lazy processing, where it collects all of the operations together and creates a graph of what needs to happen, while pandas executes everything upon calling of the code.

Spark tended to do this and it makes complete sense for distributed setups, but apparently is still faster locally.


Laziness in this context has huge advantages in reducing memory allocation. Many operations can be fused together, so there's less of a need to allocate huge intermediate data structures at every step.


yeah, totally, I can see that. I think that polars is the first library to do this locally, which is surprising if it has so many advantages.


It's been around in R-land for a while with dplyr and its variety of backends (including Arrow, the same as Polars). Pandas is just an incredibly mediocre library in nearly all respects.


> It's been around in R-land for a while with dplyr and its variety of backends

Only for SQL databases, so not really. Source: have been running dplyr since 2011.


The Arrow backend does allow for lazy eval.

https://arrow.apache.org/cookbook/r/manipulating-data---tabl...


Memory and CPU usage is still really high though.


Not with eager API.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: