Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Transforming Postgres into a Fast OLAP Database (paradedb.com)
245 points by philippemnoel on Feb 7, 2024 | hide | past | favorite | 60 comments


Thanks for sharing, I really like Postgres. However, I generally use Postgres for OLTP work. I would like to point out two things:

1. Based on Clickbench results, pg_analytics is still far from top-tier performance. If you're looking for a high-performance OLAP database, you should consider the top-ranked products.

2. The queries tested by Clickbench are really simple, far from real data analysis scenarios. You must be aware of the TPC-DS and TPC-H benchmarks, because ClickHouse simply cannot run these test suites, so their Clickbench does not include these datasets.

Lastly, I want to say, if your enterprise is of a certain size, separating OLTP and OLAP into two databases is the right choice, because you will have two different teams responsible for two different tasks. By the way, I use StarRocks for OLAP work.


One of the authors of pg_analytics here.

1. On Clickbench, make sure you're doing an apples-to-apples comparison by comparing scores from the same instance. We used the most commonly-used c6a.4xlarge instance. While a few databases like DuckDB rank higher, the performance of Datafusion (our underlying query engine) is constantly improving, and pg_analytics inherits those improvements.

Then again, people only care about performance and benchmarks up to a certain threshold. The goal of pg_analytics is not to displace something like StarRocks, but to enable analytical workloads that require both row and column-oriented data or Postgres transactions.

2. We're working on TPC-H benchmarks. They're good for demonstrating JOIN performance and we'll have them published early next week.


+1 for this.

Most of the time, all that matter in terms of performance is user's tolerance. Once that is reached, operational complexity becomes a lot more important. We use raw Postgres for analytics, knowing that projects like these and cloud offerings like AlloyDB will make our lives easier (in terms of performance) as time goes.

pg_bm25 looks awesome too! Next up, take fdw to the level of Trino/Drill, and we dont need anything else other than postgres and its extensions!


This is exactly how we think over at ParadeDB, and querying data lakes is on our roadmap :)


Thank you for your great job!


> Based on Clickbench results, pg_analytics is still far from top-tier performance

You touch on this in your next sentence, but really, how many people need that kind of performance?


As long as you own it, you will love it. Trust me.


Clickbench is really nice because it is so easy to compare and contribute benchmarks. Is there anything out there like Clickbench, but for the TPC-DS and TPC-H benchmarks?


TPC-H has some published benchmarks on their website, but it's for specific hardware systems. It's certainly not as user-friendly and modern as ClickBench. I'm not sure if it's possible to make something better without their consent, though.


here's a link comparing parade to duckdb, clickhouse and snowflake

https://benchmark.clickhouse.com/#eyJzeXN0ZW0iOnsiQWxsb3lEQi...


Interesting stuff keeps happening in PostgresLand. This uses Apache DataFusion, first think i hear about it, but from reading the docs it seems like it already offers build in SQL interface atop of Apache parquet and Apache Arrow. What is the difference between this two versions of SQL. Can I still use my favourite pl/sql function together with a deltalake table and does this play nice with other kind of tables?


Datafusion's SQL dialect has some slight quirks that pertain to Datafusion-specific functionality. For instance, the ability to create an external table.

With regards to ParadeDB, we rely on the Datafusion SQL parser, which can transform the Postgres SQL dialect into a Datafusion logical plan that can be executed by Datafusion. We actually have an open PR that adds support for user-defined functions...it will likely get merged within a few days.


thx for your reply, so postgres is taking care of the parquet files in the background in some way i guess, would it be possible to combine this with something like neon?


We haven't tested. Neon has a pretty unique storage architecture, so I suspect it wouldn't work out-of-the-box, but could with some modifications on our end. We're open to this type of partnership eventually.

If you decide to try it, I believe they now have a way to load in arbitrary Postgres extensions!


We are happy to try for sure. Analytics workloads really benefit from separation of storage and compute!


Looks very nice. I have three random questions:

1) How does this deal with backups? Presumably the deltalake tables can't be backed up by Postgres itself, so I guess there's some special way to d backups?

2) Similarly for replication (physical or logical). Presumably that's not supported, right? I guess logical replication is more useful for OLTP databases from which the data flow to datalakes, so that's fine. But what's the HA story without physical replication?

3) Presumably all the benefits are from compression at the storage level? Or are there some tweaks to the executor to do columnar stuff? I looked at the hooks in pg_analytics, but I see only stuff to handle DML. But I don't speeak rust, so maybe I missed something.


1) and 2) Backups and replication are on the roadmap. It's the next major feature we're working on.

3) We store data in Parquet files, which is a heavily compressed file format for columnar data. This ensures good compression and compatibility with Arrow for in-memory columnar processing. We also hook at the executor level and route queries on deltalake tables to DataFusion query engine, which processes the data in a vectorized fashion for much faster execution


Thanks. I wonder how you plan to replicate stuff, considering how heavily it relies on WAL (I haven't found the answer in the code, but I'm not very familiar with rust).

How large part of the plan you route to the datalake tables? Just scans or some more complex part? Can you point me to the part of the code doing that? I'm intrigued.


Hey! Sorry I missed this.

1) You can find the code here: https://github.com/paradedb/paradedb/blob/996f018e3258d3989f...

For deltalake tables, we send the full plan to DataFusion.

2) Re: WALs and replication -- We are currently adding support for WALs


Congrats on launching – we just spent quite a bit of time replicating/transforming our primary database into clickhouse for OLAP use cases, and it would have been way easier if there were a postgres-native solution. Hoping the managed hosting providers catch on


I’m actually in the same boat right now (primary DB -> Clickhouse). We’re currently trialing Airbyte but it appears they’re removing normalization support for Clickhouse this year. Did you land on a custom solution, or some other ETL/ELT tool?


Have you tried the PostgreSQL table engine?

Or Debezium to Kafka for a more hand coded solution?


The materialized postgres is still experimental and has some backlog until being production ready.

Debezium -> kafka -> kafka engine -> MVs -> Replacing merge trees works like a charm for me.


There are many Postgres forks for OLAP workloads, such as:

- Greenplum;

- Citus with cstore_fdw;

- IMCS (Konstantin Knizhnik);

- Hydra;

- AlloyDB;

For example, Greenplum is one of the earliest, fairly mature, but abandoned.


That's right! Greenplum is still active, but falling off.

We see pg_analytics as the next-generation Citus columnar, with much better performance and integration into the wider data ecosystem via Delta Lake, and eventually Iceberg


I've been scanning through the comments looking for the opportunity to ask, while not seeming critical...

why start with Delta Lake instead of Iceberg?

I imagine it might have had something to do with timing and the state of Iceberg at the time you started this effort?


You're exactly right. There is a mature project called delta-rs which we use. The Iceberg equivalent, iceberg-rs, is still very young. We are planning to contribute to it and eventually add support for Iceberg


Slightly related but it's so interesting to me all those new small tech companies that basically provides value on top of Postgres! They usually blog about some obscure stuff about Postgres and one of their articles gets on the frontpage of HN. Good marketing and very cool stuff!


How do you find working with Arrow / Parquet?

Personally I found it very hard to reason about and thought that Clickhouse's strategy for managing columnar data much more reasonable.


Datafusion and Deltalake abstract away most of Arrow/Parquet. And those APIs were very nice to work with.


Do you have an example where CH was more reasonable? I don't have enough experience here, so would greatly appreciate hearing what you'e seen. Thank you!


Is this deltalake the same as Databricks deltalake? Is it compatible?


Yes. delta-rs is a Rust-based implementation of Delta Lake.

The existing version of pg_analytics uses delta-rs to manage Parquet files stored within Postgres. In the future, we plan on integrating external object stores. This means that you'll be able to query any Delta Lake directly from Postgres.

Iceberg support will come later, once the Rust implementation of Iceberg matures.


Now this is cool. Making data available to applications (and other nice things).


This is pretty cool. It seems like there are a lot of hard database problems that still need to be solved, but finding the right database system (or extension) seems about as difficult as finding a research article and implementing the solution yourself lately. This seems like a step in the right direction by integrating with Postgres, which is widely used and supported.

Case in point regarding OLAP in particular, I am currently trying to solve a problem where I have a high number of categorical dimensions, and I want to perform a “count distinct” over any combination of dimensions, grouped by any other combination of dimensions, filtered by specific values in each dimension. E.g., count(distinct a.col1, b.col2), count(distinct a.col1), count(distinct b.col3) from table a join table b using (id) group by a.col4, b.col7.

Sounds obscure when I word it that way, but this is actually a pretty “generic” problem that appears whenever you want to filter and count the number of distinct property combinations that occur within a fact dataset of transactions or events that has been joined with other dimensional datasets. A naive implementation is exorbitantly expensive (and impractical) if you have to join many large tables before grouping and performing count distinct.

However, this specific problem manifests in various equivalent forms mathematically: model counting of boolean expressions, low rank factorization of sparse high dimensional boolean tensors (each row in your transaction dataset corresponds to a value of “true” in a sparse tensor with dimensions indexed by the values of your columns), minimal hypergraph covering set, etc.

Is there a database already out there that’s optimized for this fairly common business problem? Maybe...? I searched for a while but couldn’t easily separate the startup database hype from the actual capabilities of a particular offering. Plus, even if the ideal “hypergraph counting database” exists, it’s not like my company is just going to replace its standard cloud SQL platform that serves as the backbone of our entire product with a niche and fragile experimental database with questionable long-term support. It’s much easier to just translate one of the latest tensor factoring research papers into a Python script, plop that into the data processing pipeline, and output the simple factored form of the transactions dataset into a new table that can be easily queried in the standard way.


Is a sparse high dimensional boolean tensor modeled as you describe roughly the same as inverted index in a search engine like ElasticSearch/Solr?


Nice benchmark with amazing results! Congrats!


That look like what Microsoft has done with Microsoft Fabric. SQL Server going full parquet file for table storage.


I thought going full parquet as storage is something that Snowflake did first?


If I combined ParadeDB with a common postgres extension like PostGIS, would I get the benefits of both?


Yes and no, depending on what you mean.

The custom types/indexes introduced by PostGIS won't work with deltalake tables. Even if it were possible, the benefits of using deltalake tables to execute geospatical queries are unclear, since PostGIS indexes are already optimized for this task.

But, what ParadeDB enables is for geospatial tables and deltalake tables to exist within the same database.


Interesting and smart choice to use Apache DataFusion as their query engine.


ParadeDB maker here -- DataFusion is amazing! Apple used it to rebuild part of Spark and open-sourced it under Comet too: https://github.com/apache/arrow-datafusion-comet/pull/1

It's a lot more mature than people think :)


which is point of worry, because that one is not mature and well adapted product


If you're looking for pg_analytics (the product in TFA) in the charts and can't find it, it's because the chart calls it ParadeDB (after the company, instead of the product).


One of the ParadeDB makers here -- Yes, we release our Helm charts under the `paradedb` name.

We also ship our individual extensions (pg_analytics and pg_bm25) pre-compiled as .deb under our GitHub Releases, and as a fully complete product, ParadeDB, as a Dockerfile and Helm chart


I would say that the biggest missing features is not being able to mix normal and delta tables.

This would allow easier reporting within a very active db without too much bother.


You're right. We're working on this currently. You can track the issue here: https://github.com/paradedb/paradedb/issues/717


Congrats on the pg_analytics launch! Great to see ParadeDB consolidating some of these tools. Useful at the early stage to not manage multiple databases.


How does this handle sort ordering / collations?


Sort ordering is handled automatically by Datafusion.

Collations at the column/operation level are not yet supported but we're working on it.


Nice, congrats! How does this compare to Hydra in philosophy and use cases?


I'm not fully familiar with Hydra. From my understanding, they forked Citus columnar and improved it. I think the use case for just pg_analytics vs hydra is similar, i.e. fast analytics in Postgres. However, soon they'll diverge as we interoperate it with pg_bm25, our full-text search extension. We see ParadeDB as becoming a modern alternative to Elasticsearch, while Hydra appears to be positioning itself as a Postgres warehouse

In philosophy, we believe in playing into the ecosystem. We use DataFusion to avoid needing to write a vectorized query engine, Arrow to avoid needing to build an in-memory representation, and Parquet to avoid needing to build columnar disk storage. Citus columnar/Hydra appear to be working from first principles directly within Postgres, storing data in Postgres blocks, and writing vectorized execution operator by operator


Excited to see ParadeDB make progress in this area :)


Can it be installed on managed solutions?


It can't. We are open to exploring this, although we expect most of our users, especially at larger organizations, to want a separate search&analytics node, similar to how they use Elastic today


clickhouse is phenomenal


(in Rust) :)


It's nice to have a title that doesn't say that for once!


OLAP was my favorite character in Frozen




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: