One point mentioned in the article is the caching for global users and shared hosting does not give you that. At most you can replace S3 with shared hosting, not CDN part.
I have shared + Cloudflare for Spain and Latin America audiences and it works pretty well. IDK AWS is for other purposes, for simple apps it's not worth it IMO.
Funny historical and architecture fact about PostgreSQL. It actually can do this, for all tables without special features. Unfortunately the facility to perform a query like this is no longer exposed but it shouldn't be impossible to re-add in a more modern way.
Essentially PostgreSQL has copy-on-write semantics, so historical records exist unless a vacuum marks them as no longer needed and subsequent insert/updates overwrite the values.
In the past when PostgreSQL had the postquel language (before SQL was added) there was special syntax to access data at specific points in time:
This is nicely outlined in "THE IMPLEMENTATION OF POSTGRES" by Michael Stonebraker, Lawrence A. Rowe and Michael Hirohama[1]. Go ahead open the PDF and search for "time travel" or read the quotes below.
> The second benefit of a no-overwrite storage manager is the possibility of time travel. As noted earlier, a user can ask a historical query and POSTGRES will automatically return information from the record valid at the correct time.
Quoting the paper again:
> For example to find the salary of Sam at time T one would query:
retrieve (EMP.salary)
using EMP [T]
where EMP.name = "Sam"
> POSTGRES will automatically find the version of Sam’s record valid at the correct time and get the
appropriate salary.
Really nice background, thanks for sharing! I knew Postgres did CoW internally and always wondered why the SQL standard for time-travel queries was not implemented.
I am using triggers and audit tables which works but my data requirements are relatively small so I won't face any challenges that way. However, re-using the old rows like this would lead to a far more efficient approach if it were supported natively.
Shameless plug (I'm a co-founder) but this is basically what we've built with Splitgraph[0]: we can add change tracking to tables using PostgreSQL's audit triggers and let the user switch between different versions of the table / query past versions.
Change tracking is not a fully bitemporal scheme, though. A bitemporal table tracks _two_ timelines. One is about when facts in the world were true ("valid time" or "application time"), the other is about the history of particular records in the database ("transaction time" or "system time"). Change tracking can only capture the second.
This was extremely confusing to me on first read but, since temporal data is an area of interest to me for improving an existing service I maintain as well as for future endeavors, it prompted me to go do a bunch of reading to understand what was meant by the distinction. If others are reading this and confused, maybe what I learned will help make the distinction clearer.
There’s two cases where valid/applicable time might be meaningful:
1. Future state known at present time (e.g. when you know now that a fact will change at a specific future point in time).
2. Corrections to current state which should be applied to historical state (e.g. when you know your application produced invalid state and you want to produce a valid state both at present and in historical representations).
The first case is used more in the literature I found, but didn’t really make the distinction clearer for me because I have limited domain use for that kind of behavior. The correction case really drove the point home for me, because my use cases would benefit from it considerably.
I hope this helps other readers interested in the topic but struggling to visualize the two timelines and how they could be used.
It is hard to maintain the two timelines in one's mind at once. Most of the time I maintain the dehydrated version that "you need both" and talk myself through the details when I need them.
Typically what happens is that folks incorrectly capture parts of both timelines. For "valid time", folks will have "created at", "deleted at", "updated at" fields etc and assume this covers it. Unfortunately, it doesn't really capture proper spans. If I have a fact that starts being true on Tuesday, then add another one that starts being true on Thursday, can I deduce that the first fact was true from Tuesday until Thursday?
No, I can't logically make that claim. It's possible that the first fact was true on Tuesday only, leaving a gap on Wednesday. Without explicit valid time, I can't find those kinds of holes, or apply constraints against them.
Similarly, folks rely on history tables, audit tables etc to try to capture transaction time changes. These are still not enough, for the same reason. You need the span of time, the interval, to properly assert the time during which the database held a record to be true. When we discover that the Thursday fact was wrong, we need to be able to know over what window we held that to be true. Overwriting the record, or entering an update into a log etc, is not enough to truly reconstruct the history. You must be explicit about the span.
The necessity of bitemporalism was easy for me, because I had been responsible for either creating or maintaining multiple databases in which I could not answer a wide range of questions after the march of time. I learned many of the standard hacks (copy the price of a stocked item into the invoice line item, or it will change beneath you! Use slowly changing dimensions! Have a transactions table which creates a slow-as-hell copy of what the database does internally, but not as well!). When I read Snodgrass's book it all went clunk and I've been a wild-haired proselyte ever since.
That sounds neat. What does the performance of querying past versions look like? For instance, is lookup time linear with the amount of history or do you maintain special temporal indexes?
It varies depending on how the user chooses to structure storage (we're flexible with that) and what mode of querying they use. We have a more in-depth explanation and some benchmarks in an IPython notebook at [1].
We store Splitgraph "image" (schema snapshot) metadata in PostgreSQL itself and each image has a timestamp, so you could create a PG index on that to quickly get to an image valid at a certain time.
Each image consists of tables and each table is a set of possibly overlapping objects, or "chunks". If two chunks have a row with the same PK, the row from the latter will take precedence. Within these constraints, you can store table versions however you want -- e.g. as a big "base" chunk and multiple deltas (least storage, slowest querying) or as a multiple big chunks (faster querying, more storage).
You can query tables in two ways. Firstly, you can perform a "checkout". Like Git, this replays changes to a table in the staging area and turns it into a normal PostgreSQL table with audit triggers. You get same read performance (and can create whatever PG indexes you want to speed it up). Write performance is 2x slower than normal PostgreSQL since every change has to be mirrored by the audit trigger. When you "commit" the table (a Splitgraph commit, not the Postgres commit), we grab those changes and package them into a new chunk. In this case, you have to pay the initial checkout cost.
You can also query tables without checking them out (we call this "layered querying" [2]). We implemented this through a read-only foreign data wrapper, so all PG clients still support it. In layered querying, we find the chunks that the query requires (using bloom filters and other metadata), direct the query to those and assemble the result. The cool thing about this is you don't have to have the whole table history local to your machine: you can store some chunks on S3 and Splitgraph will download them behind the scenes as required, without interrupting the client. Especially for large tables, this can be faster than PostgreSQL itself, since we are backed by a columnar store [3].
I work at a company where (many years ago) we built an extension to Postgres (and some helper libs in SQLAlchemy, Go) for implementing decently-performant bitemporal tables (biggest history tables have hundreds of millions of rows). Pretty much our entire company runs on it today.
We implemented the “minimum viable” features (i.e. automatic expiring, non-destructive updates, generated indexes and generated table declarations), but left some of the “harder” ideas up to the application designer (adding semantic versioning on top of temporal versioning, schema migrations).
It’s worked really well for us. I can’t think of anything we’ve done that’s had a higher ROI than this. I’ll really miss it when I leave!
This is the kind of thing I always design with the possibility of open sourcing in mind, even if I don’t have buy in or dedicated time to make the open source effort at that moment. Even if you miss it when you’re gone, you’ll have the benefit of hindsight of where the boundaries are between your own business needs and the more general use case, and can take that with you and apply the same lessons (often with improvements) the next time you face a similar problem.
System time (aka "transaction time") is also invaluable for debugging if you annotate it with release versions. Unless an application is particularly strapped for storage costs, which is rare in this day and age, it ought to be the default choice to use built-in system time versioning wherever it exists.
2. if you want to run your own you’d better know how HA setups work and you can lose at least a machine at ANY point in time (not just when AWS warns you the hardware is failing)
I've been running docsapp.io, a SaaS for documentation hub. It is growing and profitable but not a lot. I work between 1-8 hours per month, mostly development and little support.
I run a SaaS that building and hosting developer hub https://www.docsapp.io using the following technologies.
Backend: Scala and Playframework
Frontend: Jquery
DB: RDS, Postgresql on AWS
Hosting: Baremetal on Scaleway
DNS: AWS Route53
Object Store: AWS S3
CDN: AWS CloudFront
Infrastructure: Ansible and Kubernetes
Monitoring: CloudWatch and Influxdb + Grafana in k8s
Logging: ELK in k8s
Scaleway is a great hosting with cheap price. Instead of finding reliable but expensive hosting, I prefer cheaper hosting, but more servers and build HA to handle server issue. Server die all the time.
I am working on SaaS DocsApp (https://www.docsapp.io/) solo and bootstrapped. DocsApp is a platform to manage your product documentation. DocsApp now is profitable and growing. Currently I am focusing on marketing to grow more. DocsApp was born as a result of frustration to building documentation for Product/SaaS.
I have been working on DocsApp for 3 years and only now start profitable. The hardest problem I faced was marketing and sales. Development is only 30% of the work.
I built https://www.docsapp.io/ while learning new programming language. Now DocsApp generating passive income for me, not a lot, but small profit. Next step is to build content around my product to generate more traffic.
Anyone know where to find good technical content writers?
Depends on what level, you can probably find the more creatively independent on here, and even judge work sample through the comments. Email is in my profile if you're interested.
Congratulation on launching! I built DocsApp to solve similar pain point. For documentation hub, performance is a feature :). Feel free to contact me (email in my profile) to chat.