Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The most interesting thing about zstandard is how easy it makes it to train custom compression dictionaries, which can provide massive improvements against certain types of data.

I'd like to see how well a custom dictionary trained against a few hundred npm packages could work against arbitrary extra npm packages. My hunch is that there are a lot of patterns - both in JavaScript code and in the JSON and README conventions used in those packages - that could help achieve much better compression.



We had billions of Protobufs to store in Cassandra as byte blobs, using a zstd dictionary dramatically reduced storage size and improved latency over the built in compression. The complexity overhead of managing these dictionaries and making sure the client always has access to the right dictionary to decompress was non-trivial but well worth it.

We looked at Brotli as well but decompression speed at acceptable ratio was the most important factor for us, that plus the far superior docs and evangelism sealed the deal for zstd.


What kind of additional gains (% wise) did you see with custom dictionaries compared to vanilla zstd?


Depends on how much shared entropy your data has. Could you test this by trying to compress all your content into one stream (shared dictionary) compared to compressing it into separate streams (no shared dictionary)?


We think similarly. :) I posted the same idea while you were typing yours: https://news.ycombinator.com/item?id=37755005




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: