Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

How would you replace a Lucene/Elasticsearch index with foundationDb?


It's more like you would build a better Elasticsearch using Lucene to do the indexing and FoundationDB to do the storage. FoundationDB will make it fault tolerant and scalable; the other pieces will be stateless.


It'd take a low number of hours to wire up FoundationDB as a Lucene filesystem (Directory) implementation. Shared filesystem with a local RAM cache has been practical for a while in Lucene, and was briefly supported then deprecated in Elasticsearch. I've used Lucene on top of HDFS and S3 quite nicely.

If you have a reason to use FoundationDB over HDFS, NFS, S3, etc, then this will work well.

Doing a Lucene+DB implementation where each entry posting lists are stored natively in the key-value system was explored for Lucene+Cassandra as (https://github.com/tjake/Solandra). It was horrifically slow, not because Cassandra was slow, but because posting lists are optimized and putting them in a generalized b-tree or LSM-tree variant will remove some locality and many of the possible optimizations.

I'm still holding out some hope for a hybrid implementation where posting list ranges are stored in a kv store.


I think you are on the right track. Storing every individual (term, document, ...) in the key value store will not be efficient, but you should be able to take Lucene's nice fast immutable data structure and stuff blocks of it (at the term level or below) into FDB values very efficiently. And of course you can do caching (and represent invalidation data structures in FDB), and...

FDB leaves room for a lot of creativity in optimizing higher layers. Transactions mean that you can use data structures with global invariants.


So from the Lucene perspective, the idea of a filesystem is pretty baked into the format. However, there's also the idea of a Codec which takes the logical data structures and translates to/from the filesystem. If you made a Codec that ignored the filesystem and just interacted with FDB, then that could work.

You can already tune segment sizes (a segment is a self-contained index over a subset of documents). I'd assume that the right thing to do for a first attempt is to use a Codec to write each term's entire posting list for that one segment to a single FDB key (doing similar things for the many auxiliary data structures). If it gets too big, then you should have tuned max segment size to be smaller. Do some sort of caching on the hot spots.

If anyone has any serious interest in trying this, my email is in my profile to discuss further.


Hmmm. I'm skeptical. A Lucene term lookup is stupidly fast. It traverses an FST, which is small and probably in memory. Traversing the postings lists itself also needs to be smart by following a skip table, which is critical for performance.


> you should be able to take Lucene's nice fast immutable data structure and stuff blocks of it (at the term level or below) into FDB values very efficiently.

That sounds a lot like Datomic's "Storage Resource" approach, too! Would Datomic-on-FDB make sense, or is there a duplication of effort there?


It most definitely would.

Datomic’s single-writer system requires conditional put (CAS) for index and (transaction) log (trees) roots pointers (mutable writes), and eventual consistency for all other writes (immutable writes) [0].

I would go as far as saying a FoundationDB-specific Datomic may be able to drop its single-writer system due to FoundationDB’s external consistency and causality guarantees [1], drop its 64bit integer-based keys to take advantage of FoundationDB range reads [2], drop its memcached layer due to FoundationDB’s distributed caching [3], use FoundationsDB watches for transactor messaging and tx-report-queue function [4], use FoundationDB snapshot reads [5] for its immutable indexes trees nodes, and maybe more?

Datomic is a FoundationDB layer. It just doesn’t know yet.

[0] https://docs.datomic.com/on-prem/acid.html#how-it-works

[1] https://apple.github.io/foundationdb/developer-guide.html?hi...

[2] https://apple.github.io/foundationdb/developer-guide.html?hi...

[3] https://apple.github.io/foundationdb/features.html#distribut...

[4] https://docs.datomic.com/on-prem/clojure/index.html#datomic....

[5] https://apple.github.io/foundationdb/developer-guide.html?hi...


Can't see datomic itself ever doing that though because they'd have to support those features in all the backends.


I wrote the original version of Solandra (which is/was Solr on Cassandra) on top of Jake's Lucene on Cassandra[1].

I can confirm it wasn't fast!

(And to be fair that wasn't the point - back then there were no distributed versions of Solr available so the idea of this was to solve the reliability/failover issue).

I wouldn't use it on a production system now days.

[1] http://nicklothian.com/blog/2009/10/27/solr-cassandra-soland...


> I've used Lucene on top of HDFS and S3 quite nicely.

Out of curiosity, what led you to do this? And what does it do better/worse/differently than out-of-the-box things like Elasticsearch or SOLR?


HDFS is supported by Solr+Lucene, but rather than my own poor paraphrasing, see what you think of this writeup: https://engineering.linkedin.com/search/did-you-mean-galene


Ah, excellent! Thanks. That answers my question. I also found the idea of early termination via static rank very intriguing.


See elassandra for a better solandra, keeping lucene indexes and sstables separate. It should be better than keeping posting list in kv-store


ok thanks, it was sort of confusing me


Actually, since my current side project needs good graph support I went looking for foundationDb graph and found this https://github.com/rayleyva/blueprints-foundationdb-graph may have to check it out.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: