Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It’s pandas, but fast. Pandas is the original open source data frame library. Pandas is robust and widely used, but sprawling and apparently slower than this newcomer. The word “data frames” keys in people who have worked with them before.


Actually pandas is not the original open source data frame library, perhaps only in Python. There is a very rich tradition in R on data.frames, which includes the unjustly neglected data.table.


Yep! Unless I'm mistaken, R (and its predecessor S) seems to have been the first to introduce the concept of a dataframe.

One could also argue that dataframes are basically in-memory database tables. And in that case, S and SQL probably tie in terms of the creation timeline.


The difference is dataframes can also be seen like matrices. You can do row operations, row + column operations, multiply rows and columns, multiply different matrices, transpose them etc. These kind of things don't really make sense in DB tables (and they are generally not supported and you jump through hoops to do similar things in DBs).


Yes, that's totally fair; dataframes are more flexible in that sense.


Oh, and another important difference is memory layout. The dataframe implementations mostly (or all) use column-major format. Whereas most conventional SQL implementations use row-major format, I believe.


I think most OLTP databases are row oriented whilst most OLAP are column.


> The difference is dataframes can also be seen like matrices. You can do row operations, row + column operations, multiply rows and columns, multiply different matrices, transpose them etc.

I think this is overblowing the similarities to matrices. Matrices have elements all of the same type, while data.frames mix numbers, characters, factors, etc. You certainly cannot transpose a data.frame and still have a data.frame that makes sense. Multiplying rows would not make sense either, since within one row you will have different types of data. Unless you have a data.frame that is all numeric, but in that case one should probably be using a matrix in the first place.


> Unless you have a data.frame that is all numeric, but in that case one should probably be using a matrix in the first place.

They still have their advantages with row/column labels, NaN handling etc. These are not operations I am speculating about by the way. I am most familiar with pandas and the dataframe there has transpose, dot product operations and almost all column operations have their correspondence in rows (i.e. you either sum(axis=0) or sum(axis=1)).


Oh, based on the comment you replied to I thought this was about R. In R matrices can handle NaNs and NAs, have column and row labels, have dot products and much more.


I feel like the predecessor of R should be Q!


The way that I've heard the story, S was short for "statistics", and R was chosen because the authors were _R_obert [Gentleman] and _R_oss [Ihaka].

Statisticians are funny!


Yeah. I think Wes McKinney liked the data frames in R, but preferred the programming language of Python. I've heard somewhere that he also got a lot of inspiration from APL.


R is literally designed to do statistics and has first class support and language feature support for many specialized tasks in statistics and closely related fields.

Python is literally designed to be easy to program with in general.

Well, it turns out when you’re dealing with terabytes of data and TFLOPS, the programming becomes more important than the math. Not all R devs are happy about this and they are very loud about it.

But it shouldn’t really surprise anyone. That is literally how those languages are designed.

Most of the R devs I know like this are just butthurt they are paid less and refuse to switch because they’re obstinate, or they’re a little scared they’re being left behind. first group is all over the place, but the second group tends to skew older of course


R is heavily influenced by Scheme. Not only is it heavily functional, but it has metaprogramming capabilities allowing a high level of flexibility and expressiveness. The tidyverse libraries use this heavily to produce very nice composable APIs that aren't really practically possible in Python.

R is fine. The issue is more in the ecosystem (with the aforementioned exception of the tidyverse).


Imho forecasting is still way better in R than in Python


> Most of the R devs I know like this are just butthurt they are paid less and refuse to switch because they’re obstinate, or they’re a little scared they’re being left behind. first group is all over the place, but the second group tends to skew older of course

Look, I started with R and use mostly Python these days, but this is not really a fair take.

R is (still) much, much, much better for analytics and graphing (the only decent plotting library in python is a ggplot clone). The big change (and why Python ended up winning) is that integrating R with other tools (like web stuff, for example) is harder than just using Python.

pandas (for instance) is like an unholy clone of the worst features from both R and Python. Polars is pretty rocking, though (mostly because it clones from Spark/dplyr/linc).

It's another example of Python being the second best language for everything winning out in the marketplace.

That being said, if I was starting a data focused company and needed to pick a language, I'd almost certainly build all the DS focused stuff in R as it would be many many times quicker, as long as I didn't need to hire too many people.


> which includes the unjustly neglected data.table

So so true.

I was working on an adhoc project that needed a quick result by the end of the day. I had to pull this series of parquet files and do some quick and dirty analysis. My first reflex was to use python with pandas, quick and easy. Python could not handle the datasets, too large. I decided to give R and data.table a go and it went smoothly. I am usually a python user but from time to time I feel compelled to jump back to R and data.table. Phenomenal tool.


Indeed.

Python is great for getting things into a data.frame. But once in a data.frame, R is so much easier to work with for all things stats and ml-related.


My friend. You cannot make people like R. We all know about and study data.table, so it’s not neglected, we just don’t use that implementation.

Mainly because R sucks for anything that isn’t statistics.


Ah, like polar bears are a much more aggressive implementation of the idea behind panda bears? That’s a pretty funny name if so.


Yeah. The name always makes me chuckle


I don't know, I think the name is kind of polar-izing

/pun


Oh, I'm not sure. I'd say it's bear-ly polarizing.

I'm so sorry.


Depends on your frame of mind.


This thread is turning into pandamonium


I'm worried it's going to get grizzly.


Thanks folks, you all made my day. "Frame of mind" was my favorite. I'm surprised I didn't think of some of these, I must be getting... Rust-y


Next re-implementation will be called grizzl.ys, hand-written in Y86 assembly.


Pandas has also moved to Apache Arrow as a backend [1], so it’s likely performance will be similar when comparing recent versions. But it’s great to have some friendly competition.

[1] https://datapythonista.me/blog/pandas-20-and-the-arrow-revol...


Not according to DuckDB benchmarks. Not even close.

https://duckdblabs.github.io/db-benchmark/


Ouch! It is going to take a lot of work to get Polars this fast. If ever.


Polars has an OLAP query engine so without any significant pandas overhaul, I highly doubt it will come close to polars in performance for many general case workloads.


This is a great chance to ELI5: what is an OLAP query engine and why does it make polars fast?


Polars can use lazy processing, where it collects all of the operations together and creates a graph of what needs to happen, while pandas executes everything upon calling of the code.

Spark tended to do this and it makes complete sense for distributed setups, but apparently is still faster locally.


Laziness in this context has huge advantages in reducing memory allocation. Many operations can be fused together, so there's less of a need to allocate huge intermediate data structures at every step.


yeah, totally, I can see that. I think that polars is the first library to do this locally, which is surprising if it has so many advantages.


It's been around in R-land for a while with dplyr and its variety of backends (including Arrow, the same as Polars). Pandas is just an incredibly mediocre library in nearly all respects.


> It's been around in R-land for a while with dplyr and its variety of backends

Only for SQL databases, so not really. Source: have been running dplyr since 2011.


The Arrow backend does allow for lazy eval.

https://arrow.apache.org/cookbook/r/manipulating-data---tabl...


Memory and CPU usage is still really high though.


Not with eager API.


> Pandas is the original open source data frame library

...ehh, not quite. R and its predecessor S have Pandas beat by decades. Pandas wasn't even the first data frame library for Python. But it sure is popular now.


That's interesting! I didn't realize there had been prior dataframe libraries in Python!

Out of curiosity, what was/were the previous libraries?


it is built in data structure and function in R.


Oh, yes, I was aware that R (and its predecessor S) have a native dataframe object in the language.

It seemed that gmfawcett was indicating that there was a dataframe library in _Python_ that existed prior to Pandas. I was curious what that library was/is, as I'd not heard that before.


ok, guess I misunderstood both comments of you two. ´_>`


Sorry :) Pandas is undisputed king. But there were multiple bindings from Python into R available in the early 2000's. Some like rpy and rpy2 are still around, others are long defunct. I concede that these weren't standalone dataframe libraries, but rather dataframe features built into a language binding.


Not *original* but probably most commonly used.


Yeah, I believe Pandas was inspired by similar functionality in R.


yup I first met data frames in R and pandas is the Python answer to R isn't it


If I understand correctly, Pandas original scope was indexed in-memory data frames for use in high frequency trading, making use of the numpy library under the hood. At the time it was written you had JPMC's Athena, GS's platform, and several HFT internal systems (C++ my friends in that space have mentioned). Pandas just is so darn useful! I've been using it since maybe version 0.10, even got to contribute a tiny bit for the sas7bdat handling.


indeed it's both: it was created for financial analytics, and it provides R dataframe features to python. thanks for.making me detour into the history of it.


I may be a rare bird starting with R dataframes (still newbie+ level), then python polars (intermediate- ?). Frankly whenever I have to use pandas or df's in R I am not convinced that these are more intuitive/easier to master. I.e. I do not like the concept of row names.

Polars can be an overkill for small/medium dataset, but since I have been bitten by corrupted/badly formatted CSVs/TSVs I love the fact that Polars will throw the towel & complain about types/column number mismatches etc. And the fact that it can scale up to millions of rows on a modest workstation compensates the fact that sometimes one can spend hours finding a proper way to manipulate a dataset.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: