You could store the data in a git repo per domain, so that implicit de-duplication happens on re-visits & for shared resources.
You could have a raw dir (the files you receive from the server) and a render dir that consists of snapshots of the DOM + CSS with no JS & external resource complexity.
When the global archive becomes too big, history could be discarded from all the git repos by discarding the oldest commit in each repo, and so on.
SOLR is probably the right tool for the index but there is something undeniably appealing about staying in the pure file paradigm - you could use sqlite's FTS5 module to do that too.
You still end up with two copies of the same file, one in the local LFS “server”, one in the work tree, no? (I only played with LFS a bit many years ago when it first came out, so I could be wrong.) Unless you take into account deduplication built into certain filesystems.
Also saves you from beating up your index with every change.
I've been using GIT LFS for the last 6 months with an Unreal Engine project, with multiple gigabytes of files being tracked, and it really is painless.
I know. But you do need them, and files in your work tree don’t magically disappear when you commit them in (presumably). So either you delete the work tree copy immediately after pushing it to LFS server, and duplicate the server copy every time you need to access it, in which case the file is only duplicated then but comes with elevated cost of access, or the latest copy sits around costing double the amount of space at all times.
I don't see the issue? Either you want to use Git or not. I have gigabyte-scale files in my 6-month old repo's and haven't ever run into any issues. Of course this may be because my git server is right next to my desk and I'm on gigabit ethernet ..
Aside: are there non-binary and/or non-large blobs? I'm thinking along the lines of ATM machine / PIN number but maybe BLOB no longer implies "binary large" without being explicit.
I like this idea, especially using git to version the store. With automatic commits, you could roll back to a particular date to see the page versions then. A personal "archive.org" sounds very awesome!
I like the idea (not of git, but a personal archive), especially if search is integrated.
However what'd make it really amazing for me would be the ability to share those archived versions with everyone around the world, so we wouldn't have to duplicate our efforts or would have a higher chance of having that specific version of one special page saved.
For now the best way to contribute to this seems to be centralization: Donate to archive.org.
You could have a raw dir (the files you receive from the server) and a render dir that consists of snapshots of the DOM + CSS with no JS & external resource complexity.
When the global archive becomes too big, history could be discarded from all the git repos by discarding the oldest commit in each repo, and so on.
SOLR is probably the right tool for the index but there is something undeniably appealing about staying in the pure file paradigm - you could use sqlite's FTS5 module to do that too.