Thanks for sharing, I really like Postgres. However, I generally use Postgres for OLTP work. I would like to point out two things:
1. Based on Clickbench results, pg_analytics is still far from top-tier performance. If you're looking for a high-performance OLAP database, you should consider the top-ranked products.
2. The queries tested by Clickbench are really simple, far from real data analysis scenarios. You must be aware of the TPC-DS and TPC-H benchmarks, because ClickHouse simply cannot run these test suites, so their Clickbench does not include these datasets.
Lastly, I want to say, if your enterprise is of a certain size, separating OLTP and OLAP into two databases is the right choice, because you will have two different teams responsible for two different tasks. By the way, I use StarRocks for OLAP work.
1. On Clickbench, make sure you're doing an apples-to-apples comparison by comparing scores from the same instance. We used the most commonly-used c6a.4xlarge instance. While a few databases like DuckDB rank higher, the performance of Datafusion (our underlying query engine) is constantly improving, and pg_analytics inherits those improvements.
Then again, people only care about performance and benchmarks up to a certain threshold. The goal of pg_analytics is not to displace something like StarRocks, but to enable analytical workloads that require both row and column-oriented data or Postgres transactions.
2. We're working on TPC-H benchmarks. They're good for demonstrating JOIN performance and we'll have them published early next week.
Most of the time, all that matter in terms of performance is user's tolerance. Once that is reached, operational complexity becomes a lot more important. We use raw Postgres for analytics, knowing that projects like these and cloud offerings like AlloyDB will make our lives easier (in terms of performance) as time goes.
pg_bm25 looks awesome too! Next up, take fdw to the level of Trino/Drill, and we dont need anything else other than postgres and its extensions!
Clickbench is really nice because it is so easy to compare and contribute benchmarks. Is there anything out there like Clickbench, but for the TPC-DS and TPC-H benchmarks?
TPC-H has some published benchmarks on their website, but it's for specific hardware systems. It's certainly not as user-friendly and modern as ClickBench. I'm not sure if it's possible to make something better without their consent, though.
Interesting stuff keeps happening in PostgresLand. This uses Apache DataFusion, first think i hear about it, but from reading the docs it seems like it already offers build in SQL interface atop of Apache parquet and Apache Arrow. What is the difference between this two versions of SQL. Can I still use my favourite pl/sql function together with a deltalake table and does this play nice with other kind of tables?
Datafusion's SQL dialect has some slight quirks that pertain to Datafusion-specific functionality. For instance, the ability to create an external table.
With regards to ParadeDB, we rely on the Datafusion SQL parser, which can transform the Postgres SQL dialect into a Datafusion logical plan that can be executed by Datafusion. We actually have an open PR that adds support for user-defined functions...it will likely get merged within a few days.
thx for your reply, so postgres is taking care of the parquet files in the background in some way i guess, would it be possible to combine this with something like neon?
We haven't tested. Neon has a pretty unique storage architecture, so I suspect it wouldn't work out-of-the-box, but could with some modifications on our end. We're open to this type of partnership eventually.
If you decide to try it, I believe they now have a way to load in arbitrary Postgres extensions!
1) How does this deal with backups? Presumably the deltalake tables can't be backed up by Postgres itself, so I guess there's some special way to d backups?
2) Similarly for replication (physical or logical). Presumably that's not supported, right? I guess logical replication is more useful for OLTP databases from which the data flow to datalakes, so that's fine. But what's the HA story without physical replication?
3) Presumably all the benefits are from compression at the storage level? Or are there some tweaks to the executor to do columnar stuff? I looked at the hooks in pg_analytics, but I see only stuff to handle DML. But I don't speeak rust, so maybe I missed something.
1) and 2) Backups and replication are on the roadmap. It's the next major feature we're working on.
3) We store data in Parquet files, which is a heavily compressed file format for columnar data. This ensures good compression and compatibility with Arrow for in-memory columnar processing. We also hook at the executor level and route queries on deltalake tables to DataFusion query engine, which processes the data in a vectorized fashion for much faster execution
Thanks. I wonder how you plan to replicate stuff, considering how heavily it relies on WAL (I haven't found the answer in the code, but I'm not very familiar with rust).
How large part of the plan you route to the datalake tables? Just scans or some more complex part? Can you point me to the part of the code doing that? I'm intrigued.
Congrats on launching – we just spent quite a bit of time replicating/transforming our primary database into clickhouse for OLAP use cases, and it would have been way easier if there were a postgres-native solution. Hoping the managed hosting providers catch on
I’m actually in the same boat right now (primary DB -> Clickhouse). We’re currently trialing Airbyte but it appears they’re removing normalization support for Clickhouse this year. Did you land on a custom solution, or some other ETL/ELT tool?
That's right! Greenplum is still active, but falling off.
We see pg_analytics as the next-generation Citus columnar, with much better performance and integration into the wider data ecosystem via Delta Lake, and eventually Iceberg
You're exactly right. There is a mature project called delta-rs which we use. The Iceberg equivalent, iceberg-rs, is still very young. We are planning to contribute to it and eventually add support for Iceberg
Slightly related but it's so interesting to me all those new small tech companies that basically provides value on top of Postgres! They usually blog about some obscure stuff about Postgres and one of their articles gets on the frontpage of HN. Good marketing and very cool stuff!
Do you have an example where CH was more reasonable? I don't have enough experience here, so would greatly appreciate hearing what you'e seen. Thank you!
Yes. delta-rs is a Rust-based implementation of Delta Lake.
The existing version of pg_analytics uses delta-rs to manage Parquet files stored within Postgres. In the future, we plan on integrating external object stores. This means that you'll be able to query any Delta Lake directly from Postgres.
Iceberg support will come later, once the Rust implementation of Iceberg matures.
This is pretty cool. It seems like there are a lot of hard database problems that still need to be solved, but finding the right database system (or extension) seems about as difficult as finding a research article and implementing the solution yourself lately. This seems like a step in the right direction by integrating with Postgres, which is widely used and supported.
Case in point regarding OLAP in particular, I am currently trying to solve a problem where I have a high number of categorical dimensions, and I want to perform a “count distinct” over any combination of dimensions, grouped by any other combination of dimensions, filtered by specific values in each dimension. E.g., count(distinct a.col1, b.col2), count(distinct a.col1), count(distinct b.col3) from table a join table b using (id) group by a.col4, b.col7.
Sounds obscure when I word it that way, but this is actually a pretty “generic” problem that appears whenever you want to filter and count the number of distinct property combinations that occur within a fact dataset of transactions or events that has been joined with other dimensional datasets. A naive implementation is exorbitantly expensive (and impractical) if you have to join many large tables before grouping and performing count distinct.
However, this specific problem manifests in various equivalent forms mathematically: model counting of boolean expressions, low rank factorization of sparse high dimensional boolean tensors (each row in your transaction dataset corresponds to a value of “true” in a sparse tensor with dimensions indexed by the values of your columns), minimal hypergraph covering set, etc.
Is there a database already out there that’s optimized for this fairly common business problem? Maybe...? I searched for a while but couldn’t easily separate the startup database hype from the actual capabilities of a particular offering. Plus, even if the ideal “hypergraph counting database” exists, it’s not like my company is just going to replace its standard cloud SQL platform that serves as the backbone of our entire product with a niche and fragile experimental database with questionable long-term support. It’s much easier to just translate one of the latest tensor factoring research papers into a Python script, plop that into the data processing pipeline, and output the simple factored form of the transactions dataset into a new table that can be easily queried in the standard way.
The custom types/indexes introduced by PostGIS won't work with deltalake tables. Even if it were possible, the benefits of using deltalake tables to execute geospatical queries are unclear, since PostGIS indexes are already optimized for this task.
But, what ParadeDB enables is for geospatial tables and deltalake tables to exist within the same database.
If you're looking for pg_analytics (the product in TFA) in the charts and can't find it, it's because the chart calls it ParadeDB (after the company, instead of the product).
One of the ParadeDB makers here -- Yes, we release our Helm charts under the `paradedb` name.
We also ship our individual extensions (pg_analytics and pg_bm25) pre-compiled as .deb under our GitHub Releases, and as a fully complete product, ParadeDB, as a Dockerfile and Helm chart
Congrats on the pg_analytics launch! Great to see ParadeDB consolidating some of these tools. Useful at the early stage to not manage multiple databases.
I'm not fully familiar with Hydra. From my understanding, they forked Citus columnar and improved it. I think the use case for just pg_analytics vs hydra is similar, i.e. fast analytics in Postgres. However, soon they'll diverge as we interoperate it with pg_bm25, our full-text search extension. We see ParadeDB as becoming a modern alternative to Elasticsearch, while Hydra appears to be positioning itself as a Postgres warehouse
In philosophy, we believe in playing into the ecosystem. We use DataFusion to avoid needing to write a vectorized query engine, Arrow to avoid needing to build an in-memory representation, and Parquet to avoid needing to build columnar disk storage. Citus columnar/Hydra appear to be working from first principles directly within Postgres, storing data in Postgres blocks, and writing vectorized execution operator by operator
It can't. We are open to exploring this, although we expect most of our users, especially at larger organizations, to want a separate search&analytics node, similar to how they use Elastic today
1. Based on Clickbench results, pg_analytics is still far from top-tier performance. If you're looking for a high-performance OLAP database, you should consider the top-ranked products.
2. The queries tested by Clickbench are really simple, far from real data analysis scenarios. You must be aware of the TPC-DS and TPC-H benchmarks, because ClickHouse simply cannot run these test suites, so their Clickbench does not include these datasets.
Lastly, I want to say, if your enterprise is of a certain size, separating OLTP and OLAP into two databases is the right choice, because you will have two different teams responsible for two different tasks. By the way, I use StarRocks for OLAP work.