Complex operations on very large datasets can take multiple minutes in pandas. P...

nomilk · on Jan 9, 2024

> operations on very large datasets can take multiple minutes in pandas. Polars is supposed to reduce that to a few seconds.

Good example; I can see how efficiency would matter for workflows like that.

I work with dataframes in the 10's-100's millions of rows (mostly in tidyverse, but also pandas and base python and R), and find most data wrangling operations are close to instant on modern laptops. Plotting is another story (not sure if polars helps there).

So the case for efficiency is weak at the 10-100 million row dataframe size (unless doing some intense computations), but gains strength as the size of the dataframe grows.

Would be a fun aside to test all these frameworks side by side with some 1m/10m/100m/1bn row joins, filters, summary calcs, maps etc to get some concrete data on where efficiency starts to become noticeable and starts to matter. I think at sub 100m rows it probably doesn't. Not for the kinds of operations I do anyway.

dash2 · on Jan 9, 2024

I'd be interested to know what proportion of users of dataframes are working at different orders of magnitude.

Most of my life I've had databases of like 1000. Now I have a big one of about 500K! So for me, speed is almost a non-issue. But that is my specific field.