It’s pandas, but fast. Pandas is the original open source data frame library. Pandas is robust and widely used, but sprawling and apparently slower than this newcomer. The word “data frames” keys in people who have worked with them before.
Actually pandas is not the original open source data frame library, perhaps only in Python. There is a very rich tradition in R on data.frames, which includes the unjustly neglected data.table.
Yep! Unless I'm mistaken, R (and its predecessor S) seems to have been the first to introduce the concept of a dataframe.
One could also argue that dataframes are basically in-memory database tables. And in that case, S and SQL probably tie in terms of the creation timeline.
The difference is dataframes can also be seen like matrices. You can do row operations, row + column operations, multiply rows and columns, multiply different matrices, transpose them etc. These kind of things don't really make sense in DB tables (and they are generally not supported and you jump through hoops to do similar things in DBs).
Oh, and another important difference is memory layout. The dataframe implementations mostly (or all) use column-major format. Whereas most conventional SQL implementations use row-major format, I believe.
> The difference is dataframes can also be seen like matrices. You can do row operations, row + column operations, multiply rows and columns, multiply different matrices, transpose them etc.
I think this is overblowing the similarities to matrices. Matrices have elements all of the same type, while data.frames mix numbers, characters, factors, etc. You certainly cannot transpose a data.frame and still have a data.frame that makes sense. Multiplying rows would not make sense either, since within one row you will have different types of data. Unless you have a data.frame that is all numeric, but in that case one should probably be using a matrix in the first place.
> Unless you have a data.frame that is all numeric, but in that case one should probably be using a matrix in the first place.
They still have their advantages with row/column labels, NaN handling etc. These are not operations I am speculating about by the way. I am most familiar with pandas and the dataframe there has transpose, dot product operations and almost all column operations have their correspondence in rows (i.e. you either sum(axis=0) or sum(axis=1)).
Oh, based on the comment you replied to I thought this was about R. In R matrices can handle NaNs and NAs, have column and row labels, have dot products and much more.
Yeah. I think Wes McKinney liked the data frames in R, but preferred the programming language of Python. I've heard somewhere that he also got a lot of inspiration from APL.
R is literally designed to do statistics and has first class support and language feature support for many specialized tasks in statistics and closely related fields.
Python is literally designed to be easy to program with in general.
Well, it turns out when you’re dealing with terabytes of data and TFLOPS, the programming becomes more
important than the math. Not all R devs are happy about this and they are very loud about it.
But it shouldn’t really surprise anyone. That is literally how those languages are designed.
Most of the R devs I know like this are just butthurt they are paid less and refuse to switch because they’re obstinate, or they’re a little scared they’re being left behind. first group is all over the place, but the second group tends to skew older of course
R is heavily influenced by Scheme. Not only is it heavily functional, but it has metaprogramming capabilities allowing a high level of flexibility and expressiveness. The tidyverse libraries use this heavily to produce very nice composable APIs that aren't really practically possible in Python.
R is fine. The issue is more in the ecosystem (with the aforementioned exception of the tidyverse).
> Most of the R devs I know like this are just butthurt they are paid less and refuse to switch because they’re obstinate, or they’re a little scared they’re being left behind. first group is all over the place, but the second group tends to skew older of course
Look, I started with R and use mostly Python these days, but this is not really a fair take.
R is (still) much, much, much better for analytics and graphing (the only decent plotting library in python is a ggplot clone). The big change (and why Python ended up winning) is that integrating R with other tools (like web stuff, for example) is harder than just using Python.
pandas (for instance) is like an unholy clone of the worst features from both R and Python. Polars is pretty rocking, though (mostly because it clones from Spark/dplyr/linc).
It's another example of Python being the second best language for everything winning out in the marketplace.
That being said, if I was starting a data focused company and needed to pick a language, I'd almost certainly build all the DS focused stuff in R as it would be many many times quicker, as long as I didn't need to hire too many people.
> which includes the unjustly neglected data.table
So so true.
I was working on an adhoc project that needed a quick result by the end of the day. I had to pull this series of parquet files and do some quick and dirty analysis. My first reflex was to use python with pandas, quick and easy. Python could not handle the datasets, too large. I decided to give R and data.table a go and it went smoothly. I am usually a python user but from time to time I feel compelled to jump back to R and data.table. Phenomenal tool.
Pandas has also moved to Apache Arrow as a backend [1], so it’s likely performance will be similar when comparing recent versions. But it’s great to have some friendly competition.
Polars has an OLAP query engine so without any significant pandas overhaul, I highly doubt it will come close to polars in performance for many general case workloads.
Polars can use lazy processing, where it collects all of the operations together and creates a graph of what needs to happen, while pandas executes everything upon calling of the code.
Spark tended to do this and it makes complete sense for distributed setups, but apparently is still faster locally.
Laziness in this context has huge advantages in reducing memory allocation. Many operations can be fused together, so there's less of a need to allocate huge intermediate data structures at every step.
It's been around in R-land for a while with dplyr and its variety of backends (including Arrow, the same as Polars). Pandas is just an incredibly mediocre library in nearly all respects.
> Pandas is the original open source data frame library
...ehh, not quite. R and its predecessor S have Pandas beat by decades. Pandas wasn't even the first data frame library for Python. But it sure is popular now.
Oh, yes, I was aware that R (and its predecessor S) have a native dataframe object in the language.
It seemed that gmfawcett was indicating that there was a dataframe library in _Python_ that existed prior to Pandas. I was curious what that library was/is, as I'd not heard that before.
Sorry :) Pandas is undisputed king. But there were multiple bindings from Python into R available in the early 2000's. Some like rpy and rpy2 are still around, others are long defunct. I concede that these weren't standalone dataframe libraries, but rather dataframe features built into a language binding.
If I understand correctly, Pandas original scope was indexed in-memory data frames for use in high frequency trading, making use of the numpy library under the hood. At the time it was written you had JPMC's Athena, GS's platform, and several HFT internal systems (C++ my friends in that space have mentioned). Pandas just is so darn useful! I've been using it since maybe version 0.10, even got to contribute a tiny bit for the sas7bdat handling.
indeed it's both: it was created for financial analytics, and it provides R dataframe features to python. thanks for.making me detour into the history of it.
I may be a rare bird starting with R dataframes (still newbie+ level), then python polars (intermediate- ?). Frankly whenever I have to use pandas or df's in R I am not convinced that these are more intuitive/easier to master. I.e. I do not like the concept of row names.
Polars can be an overkill for small/medium dataset, but since I have been bitten by corrupted/badly formatted CSVs/TSVs I love the fact that Polars will throw the towel & complain about types/column number mismatches etc.
And the fact that it can scale up to millions of rows on a modest workstation compensates the fact that sometimes one can spend hours finding a proper way to manipulate a dataset.