Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'm really excited about Polars and it's speed performance is super impressive buuutt. . . It annoys me to see vaex, modin and dask all compared on the same benchmarks.

For anyone who doesn't use those libraries, they are all targeted towards out-of-core data processing (i.e. computing across multiple machines because your data is too big). Comparing them to a single core data frame library is just silly, and they will obviously be slower because they necessarily come with a lot of overhead. It just wouldn't make sense to use polars in the same context as those libraries, so seeing them presented in benchmarks as if they are equivalents is a little silly.

And on top of that, duckdb, which you might use in the same context as polars and is faster than polars in a lot of contexts, isn't included in the benchmarks.

The software engineering behind polars is amazing work and there's no need to have misleading benchmarks like this.



I don't know about the others but you can use Dask on a single machine, and it's also the easiest way to use Dask. It allows parallelizing operations by splitting dataframes into partitions that get processed in individual cores on your machine. Performance boost over pandas can be 2x with zero config, and I've seen up to 5x on certain operations.


Ibis, a Python dataframe created by the creator of pandas, uses DuckDB as the default backend and generally beats Polars on these benchmarks (with exceptions on some queries)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: