Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

SQL is better than 99% of the nosql alternatives out there.

But one thing it falls apart are these time series data processing tasks.

It's because of its model of unordered sets (or multisets to be more precise, but still undordered). When you look at majority of those queries and other real life queries they always involve the dimension of time. Well - then that means we have a natural order of the data - why not use an array data structure instead of this unordered bag and throw the sort order out of the window.

SQL realized this and bolted window functions on top of the original relational model. But you still feel its inadequacy when trying to do simple things such as (x join x on x.t=x.t-1) or even the infamous asof joins where you do (x join y on x.t <= y.t).

In array databases with implicit time order both examples are a trivial linear zip operation on the the sorted tables.

In traditional set oriented SQL it results in horrible O(n^2) cross join with a filter or in the better case od equijoin in sort merge join which still has to do two expensive sorts on my massive time series tables and then perform the same trivival linear zip that the array processor would do. Which is also completely useless on unbounded streams.

Also many stackoverflow answers suggest to use binary blobs for time series data and process them locally in your favorite programming language - which points at wrong conceptual model of SQL engines.

Is SQL really so inadequate for real life data problems or have I been living under a rock and Kdb is no longer the only option to do these trivial time oriented tasks?



Have you seen TimescaleDB? https://www.timescale.com/


I've seen it.

Most of these 'time series databases' are for processing of structured logs and metrics to be plugged into you favorite system for monitoring site latency.

Asof join is still an open issue in their issue tracker so it is not usable as a time series database.

https://github.com/timescale/timescaledb/issues/271


If AS OF is a dealbreaker for you, so be it, but I think TimescaleDB is trying hard to be more than a "simple timeseries" aggregator like so many competitors. We're using it to store hundreds of millions of complex IIOT data packets, where a fully normalized packet adds dozens of rows across ~4 tables.

Having it be mostly plain-ol-postgres has been huge for designing and evolving the data model in a way that I'd hate to do without "real" SQL.


I'm sure your product is great for many customers in their domains.

The asof and other time based joins are a necessity for just slightly more complicated data processing than just a trivial aggregation at the tail of the stream which is majority of use cases in IOT and real time monitoring.

Without that I can pipe my data to /dev/null giving me even better write performance while also not being able to solve this common financial industry use case.


Sorry, didn't mean to imply I work there, I don't; my employer uses Timescale and I designed our schema.

Timescale has been quite performant for us fetching arbitrary past time windows, but I'll stop the cheerleading there as I think mfreed covered the details far better.

I'd certainly still like to see AS OF support!


(Timescale person here)

Just an observation that the parent post was the unordered "set" nature of tables, and not strictly about ASOF joins.

I should point out there is an alternative reason for this in TimescaleDB (and many other databases): to support loose-time-order writes (or even data backfill), where records don't always arrive in perfect timestamp order.

So some of these engineering decisions are to balance tradeoffs across insert vs. query rates.

In TimescaleDB, for example, we write data to time intervals in arrival order, not timestamp order, in order to achieve higher writes rates. On the other hand, we support the automated asynchronous rewriting of data to reorder data after it reaches a certain age, according to arbitrary columns.

This can be used to reorder data precisely in timestamp order. It's also highly useful and commonly used for composite reorderings, such that you can then reorder on (some_id, timestamp DESC), such that, within a time interval (chunk in TimescaleDB), data belonging to each some_id is collocated on disk, and within each id, it's sorted on disk again by timestamp. (Here, some_id might be a unique hostname, device_id, metric_id, stock ticker symbol, etc.)

These allow you to take much faster queries for SELECT * from table WHERE some_id = foo and time > X and time < Y.

We also take a similar approach when performing asynchronous columnar-style compression, where we "segment" together many records belonging to the same id into a single internal row, then order those records in array style within each row, sorted by some column (typically timestamp) before applying a type-specific compression algorithm to then. Again, all metrics then belonging to some_id are collocated, ordered by timestamp as an array on out-of-line pages so that only the requested columns need to be fetched from disk. And this gets hidden behind a SQL view that makes it look like you are still interacting with a standard row-based table.

Anyway, totally appreciate that we haven't prioritized ASOF JOINs, as our experience is that these typically come up in more specific financial use cases, rather than the vast majority of time-series use cases.

And we also find that our "gapfill/LOCF" functionality can be used to bucket-aggregate data with different ids for related behavior. That is, it's a data modeling choice to store different ticker symbols in different tables (and then ASOF JOIN), and not necessarily the only way to achieve the desired goals.

https://docs.timescale.com/latest/api#add_reorder_policy https://docs.timescale.com/latest/using-timescaledb/compress... https://docs.timescale.com/latest/api#time_bucket_gapfill

And thanks clatonjy for your use!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: