I think you are on the right track. Storing every individual (term, document, .....

fizx · on April 20, 2018

So from the Lucene perspective, the idea of a filesystem is pretty baked into the format. However, there's also the idea of a Codec which takes the logical data structures and translates to/from the filesystem. If you made a Codec that ignored the filesystem and just interacted with FDB, then that could work.

You can already tune segment sizes (a segment is a self-contained index over a subset of documents). I'd assume that the right thing to do for a first attempt is to use a Codec to write each term's entire posting list for that one segment to a single FDB key (doing similar things for the many auxiliary data structures). If it gets too big, then you should have tuned max segment size to be smaller. Do some sort of caching on the hot spots.

If anyone has any serious interest in trying this, my email is in my profile to discuss further.

burntsushi · on April 19, 2018

Hmmm. I'm skeptical. A Lucene term lookup is stupidly fast. It traverses an FST, which is small and probably in memory. Traversing the postings lists itself also needs to be smart by following a skip table, which is critical for performance.

derefr · on April 19, 2018

> you should be able to take Lucene's nice fast immutable data structure and stuff blocks of it (at the term level or below) into FDB values very efficiently.

That sounds a lot like Datomic's "Storage Resource" approach, too! Would Datomic-on-FDB make sense, or is there a duplication of effort there?

wll · on April 20, 2018

It most definitely would.

Datomic’s single-writer system requires conditional put (CAS) for index and (transaction) log (trees) roots pointers (mutable writes), and eventual consistency for all other writes (immutable writes) [0].

I would go as far as saying a FoundationDB-specific Datomic may be able to drop its single-writer system due to FoundationDB’s external consistency and causality guarantees [1], drop its 64bit integer-based keys to take advantage of FoundationDB range reads [2], drop its memcached layer due to FoundationDB’s distributed caching [3], use FoundationsDB watches for transactor messaging and tx-report-queue function [4], use FoundationDB snapshot reads [5] for its immutable indexes trees nodes, and maybe more?

Datomic is a FoundationDB layer. It just doesn’t know yet.

[0] https://docs.datomic.com/on-prem/acid.html#how-it-works

[1] https://apple.github.io/foundationdb/developer-guide.html?hi...

[2] https://apple.github.io/foundationdb/developer-guide.html?hi...

[3] https://apple.github.io/foundationdb/features.html#distribut...

[4] https://docs.datomic.com/on-prem/clojure/index.html#datomic....

[5] https://apple.github.io/foundationdb/developer-guide.html?hi...

mvc · on April 25, 2018

Can't see datomic itself ever doing that though because they'd have to support those features in all the backends.