You could store the data in a git repo per domain, so that implicit de-duplicati...

oefrha · on Dec 22, 2019

git is pretty bad at handling large binary blobs. Good old timestamped directories with hardlinks (a la rsync --link-dest) probably works better.

fit2rule · on Dec 22, 2019

Git isn't that bad at handling binary blobs - as long as you enable LFS support, and your git repo is served locally, as suggested, you'll do fine.

oefrha · on Dec 22, 2019

> as long as you enable LFS support

You still end up with two copies of the same file, one in the local LFS “server”, one in the work tree, no? (I only played with LFS a bit many years ago when it first came out, so I could be wrong.) Unless you take into account deduplication built into certain filesystems.

fit2rule · on Dec 22, 2019

You don't get copies until you need them - thats the point entirely. More details here:

https://www.atlassian.com/git/tutorials/git-lfs

Also saves you from beating up your index with every change.

I've been using GIT LFS for the last 6 months with an Unreal Engine project, with multiple gigabytes of files being tracked, and it really is painless.

oefrha · on Dec 22, 2019

> You don't get copies until you need them

I know. But you do need them, and files in your work tree don’t magically disappear when you commit them in (presumably). So either you delete the work tree copy immediately after pushing it to LFS server, and duplicate the server copy every time you need to access it, in which case the file is only duplicated then but comes with elevated cost of access, or the latest copy sits around costing double the amount of space at all times.

fit2rule · on Dec 22, 2019

I don't see the issue? Either you want to use Git or not. I have gigabyte-scale files in my 6-month old repo's and haven't ever run into any issues. Of course this may be because my git server is right next to my desk and I'm on gigabit ethernet ..

Hello71 · on Dec 22, 2019

or just use a bare repo?

hunter2_ · on Dec 22, 2019

Aside: are there non-binary and/or non-large blobs? I'm thinking along the lines of ATM machine / PIN number but maybe BLOB no longer implies "binary large" without being explicit.

wurst_case · on Dec 22, 2019

I like the elegance of this idea.

archivist1 · on Dec 22, 2019

I like this idea, especially using git to version the store. With automatic commits, you could roll back to a particular date to see the page versions then. A personal "archive.org" sounds very awesome!

solarkraft · on Dec 22, 2019

I like the idea (not of git, but a personal archive), especially if search is integrated.

However what'd make it really amazing for me would be the ability to share those archived versions with everyone around the world, so we wouldn't have to duplicate our efforts or would have a higher chance of having that specific version of one special page saved.

For now the best way to contribute to this seems to be centralization: Donate to archive.org.