It’s pandas, but fast. Pandas is the original open source data frame library. Pa...

dkga · on Jan 9, 2024

Actually pandas is not the original open source data frame library, perhaps only in Python. There is a very rich tradition in R on data.frames, which includes the unjustly neglected data.table.

p4ul · on Jan 9, 2024

Yep! Unless I'm mistaken, R (and its predecessor S) seems to have been the first to introduce the concept of a dataframe.

One could also argue that dataframes are basically in-memory database tables. And in that case, S and SQL probably tie in terms of the creation timeline.

ayhanfuat · on Jan 9, 2024

The difference is dataframes can also be seen like matrices. You can do row operations, row + column operations, multiply rows and columns, multiply different matrices, transpose them etc. These kind of things don't really make sense in DB tables (and they are generally not supported and you jump through hoops to do similar things in DBs).

p4ul · on Jan 9, 2024

Yes, that's totally fair; dataframes are more flexible in that sense.

p4ul · on Jan 9, 2024

Oh, and another important difference is memory layout. The dataframe implementations mostly (or all) use column-major format. Whereas most conventional SQL implementations use row-major format, I believe.

theLiminator · on Jan 9, 2024

I think most OLTP databases are row oriented whilst most OLAP are column.

kkoncevicius · on Jan 9, 2024

> The difference is dataframes can also be seen like matrices. You can do row operations, row + column operations, multiply rows and columns, multiply different matrices, transpose them etc.

I think this is overblowing the similarities to matrices. Matrices have elements all of the same type, while data.frames mix numbers, characters, factors, etc. You certainly cannot transpose a data.frame and still have a data.frame that makes sense. Multiplying rows would not make sense either, since within one row you will have different types of data. Unless you have a data.frame that is all numeric, but in that case one should probably be using a matrix in the first place.

ayhanfuat · on Jan 9, 2024

> Unless you have a data.frame that is all numeric, but in that case one should probably be using a matrix in the first place.

They still have their advantages with row/column labels, NaN handling etc. These are not operations I am speculating about by the way. I am most familiar with pandas and the dataframe there has transpose, dot product operations and almost all column operations have their correspondence in rows (i.e. you either sum(axis=0) or sum(axis=1)).

kkoncevicius · on Jan 9, 2024

Oh, based on the comment you replied to I thought this was about R. In R matrices can handle NaNs and NAs, have column and row labels, have dot products and much more.

xwowsersx · on Jan 9, 2024

I feel like the predecessor of R should be Q!

p4ul · on Jan 9, 2024

The way that I've heard the story, S was short for "statistics", and R was chosen because the authors were _R_obert [Gentleman] and _R_oss [Ihaka].

Statisticians are funny!

7thaccount · on Jan 9, 2024

Yeah. I think Wes McKinney liked the data frames in R, but preferred the programming language of Python. I've heard somewhere that he also got a lot of inspiration from APL.

Cacti · on Jan 9, 2024

R is literally designed to do statistics and has first class support and language feature support for many specialized tasks in statistics and closely related fields.

Python is literally designed to be easy to program with in general.

Well, it turns out when you’re dealing with terabytes of data and TFLOPS, the programming becomes more important than the math. Not all R devs are happy about this and they are very loud about it.

But it shouldn’t really surprise anyone. That is literally how those languages are designed.

Most of the R devs I know like this are just butthurt they are paid less and refuse to switch because they’re obstinate, or they’re a little scared they’re being left behind. first group is all over the place, but the second group tends to skew older of course

mjhay · on Jan 9, 2024

R is heavily influenced by Scheme. Not only is it heavily functional, but it has metaprogramming capabilities allowing a high level of flexibility and expressiveness. The tidyverse libraries use this heavily to produce very nice composable APIs that aren't really practically possible in Python.

R is fine. The issue is more in the ecosystem (with the aforementioned exception of the tidyverse).

RamblingCTO · on Jan 10, 2024

Imho forecasting is still way better in R than in Python

disgruntledphd2 · on Jan 9, 2024

> Most of the R devs I know like this are just butthurt they are paid less and refuse to switch because they’re obstinate, or they’re a little scared they’re being left behind. first group is all over the place, but the second group tends to skew older of course

Look, I started with R and use mostly Python these days, but this is not really a fair take.

R is (still) much, much, much better for analytics and graphing (the only decent plotting library in python is a ggplot clone). The big change (and why Python ended up winning) is that integrating R with other tools (like web stuff, for example) is harder than just using Python.

pandas (for instance) is like an unholy clone of the worst features from both R and Python. Polars is pretty rocking, though (mostly because it clones from Spark/dplyr/linc).

It's another example of Python being the second best language for everything winning out in the marketplace.

That being said, if I was starting a data focused company and needed to pick a language, I'd almost certainly build all the DS focused stuff in R as it would be many many times quicker, as long as I didn't need to hire too many people.

bonadrag · on Jan 9, 2024

> which includes the unjustly neglected data.table

So so true.

I was working on an adhoc project that needed a quick result by the end of the day. I had to pull this series of parquet files and do some quick and dirty analysis. My first reflex was to use python with pandas, quick and easy. Python could not handle the datasets, too large. I decided to give R and data.table a go and it went smoothly. I am usually a python user but from time to time I feel compelled to jump back to R and data.table. Phenomenal tool.

jgalt212 · on Jan 10, 2024

Indeed.

Python is great for getting things into a data.frame. But once in a data.frame, R is so much easier to work with for all things stats and ml-related.

Cacti · on Jan 9, 2024

My friend. You cannot make people like R. We all know about and study data.table, so it’s not neglected, we just don’t use that implementation.

Mainly because R sucks for anything that isn’t statistics.

bee_rider · on Jan 9, 2024

Ah, like polar bears are a much more aggressive implementation of the idea behind panda bears? That’s a pretty funny name if so.

Icathian · on Jan 9, 2024

Yeah. The name always makes me chuckle

debo_ · on Jan 9, 2024

I don't know, I think the name is kind of polar-izing

/pun

p4ul · on Jan 9, 2024

Oh, I'm not sure. I'd say it's bear-ly polarizing.

I'm so sorry.

kevindamm · on Jan 9, 2024

Depends on your frame of mind.

xwowsersx · on Jan 9, 2024

This thread is turning into pandamonium

p4ul · on Jan 9, 2024

I'm worried it's going to get grizzly.

debo_ · on Jan 9, 2024

Thanks folks, you all made my day. "Frame of mind" was my favorite. I'm surprised I didn't think of some of these, I must be getting... Rust-y

sesm · on Jan 9, 2024

Next re-implementation will be called grizzl.ys, hand-written in Y86 assembly.

maliker · on Jan 9, 2024

Pandas has also moved to Apache Arrow as a backend [1], so it’s likely performance will be similar when comparing recent versions. But it’s great to have some friendly competition.

[1] https://datapythonista.me/blog/pandas-20-and-the-arrow-revol...

jasonjmcghee · on Jan 9, 2024

Not according to DuckDB benchmarks. Not even close.

https://duckdblabs.github.io/db-benchmark/

keithalewis · on Jan 9, 2024

Ouch! It is going to take a lot of work to get Polars this fast. If ever.

hyperpl · on Jan 9, 2024

Polars has an OLAP query engine so without any significant pandas overhaul, I highly doubt it will come close to polars in performance for many general case workloads.

dash2 · on Jan 9, 2024

This is a great chance to ELI5: what is an OLAP query engine and why does it make polars fast?

disgruntledphd2 · on Jan 9, 2024

Polars can use lazy processing, where it collects all of the operations together and creates a graph of what needs to happen, while pandas executes everything upon calling of the code.

Spark tended to do this and it makes complete sense for distributed setups, but apparently is still faster locally.

mjhay · on Jan 9, 2024

Laziness in this context has huge advantages in reducing memory allocation. Many operations can be fused together, so there's less of a need to allocate huge intermediate data structures at every step.

disgruntledphd2 · on Jan 9, 2024

yeah, totally, I can see that. I think that polars is the first library to do this locally, which is surprising if it has so many advantages.

mjhay · on Jan 9, 2024

It's been around in R-land for a while with dplyr and its variety of backends (including Arrow, the same as Polars). Pandas is just an incredibly mediocre library in nearly all respects.

disgruntledphd2 · on Jan 10, 2024

> It's been around in R-land for a while with dplyr and its variety of backends

Only for SQL databases, so not really. Source: have been running dplyr since 2011.

mjhay · on Jan 12, 2024

The Arrow backend does allow for lazy eval.

https://arrow.apache.org/cookbook/r/manipulating-data---tabl...

thejosh · on Jan 9, 2024

Memory and CPU usage is still really high though.

vietvu · on Jan 9, 2024

Not with eager API.

gmfawcett · on Jan 9, 2024

> Pandas is the original open source data frame library

...ehh, not quite. R and its predecessor S have Pandas beat by decades. Pandas wasn't even the first data frame library for Python. But it sure is popular now.

p4ul · on Jan 9, 2024

That's interesting! I didn't realize there had been prior dataframe libraries in Python!

Out of curiosity, what was/were the previous libraries?

melagonster · on Jan 9, 2024

it is built in data structure and function in R.

p4ul · on Jan 9, 2024

Oh, yes, I was aware that R (and its predecessor S) have a native dataframe object in the language.

It seemed that gmfawcett was indicating that there was a dataframe library in _Python_ that existed prior to Pandas. I was curious what that library was/is, as I'd not heard that before.

melagonster · on Jan 9, 2024

ok, guess I misunderstood both comments of you two. ´_>`

gmfawcett · on Jan 9, 2024

Sorry :) Pandas is undisputed king. But there were multiple bindings from Python into R available in the early 2000's. Some like rpy and rpy2 are still around, others are long defunct. I concede that these weren't standalone dataframe libraries, but rather dataframe features built into a language binding.

ayhanfuat · on Jan 9, 2024

Not *original* but probably most commonly used.

drbaba · on Jan 9, 2024

Yeah, I believe Pandas was inspired by similar functionality in R.

froh · on Jan 9, 2024

yup I first met data frames in R and pandas is the Python answer to R isn't it

tomrod · on Jan 9, 2024

If I understand correctly, Pandas original scope was indexed in-memory data frames for use in high frequency trading, making use of the numpy library under the hood. At the time it was written you had JPMC's Athena, GS's platform, and several HFT internal systems (C++ my friends in that space have mentioned). Pandas just is so darn useful! I've been using it since maybe version 0.10, even got to contribute a tiny bit for the sas7bdat handling.

froh · on Jan 9, 2024

indeed it's both: it was created for financial analytics, and it provides R dataframe features to python. thanks for.making me detour into the history of it.

elmolino89 · on Jan 11, 2024

I may be a rare bird starting with R dataframes (still newbie+ level), then python polars (intermediate- ?). Frankly whenever I have to use pandas or df's in R I am not convinced that these are more intuitive/easier to master. I.e. I do not like the concept of row names.

Polars can be an overkill for small/medium dataset, but since I have been bitten by corrupted/badly formatted CSVs/TSVs I love the fact that Polars will throw the towel & complain about types/column number mismatches etc. And the fact that it can scale up to millions of rows on a modest workstation compensates the fact that sometimes one can spend hours finding a proper way to manipulate a dataset.