Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

In order to check if a file is a duplicate of another, you need to check it against _every other possible file_. You need some kind of "lookup key".

If we took the first 1024 bytes of each file as the lookup key, then our key size would be 1024 bytes. If you have 1 million files on your disk, then that's 128MB of ram just to store all the keys. That's not a big deal these days, but it's also annoying if you have a bunch of files that all start with the same 1024 bytes -- e.g. perhaps all the photoshop documents start with the same header. You'd need a 2-stage comparison, where you first match the key (1024 bytes) and then do a full comparison to see if it really matches.

Far more efficient - and less work - If you just use a SHA256 of the file's contents. That gets you a much smaller 32 byte key, and you don't need to bother with 2-stage comparisons.



I understand the concept. My main point is that it's probably not a huge advantage to store hashes of the first 1KB, which requires CPU to calculate, over just the raw bytes, which requires storage. There's a tradeoff either way.

I don't think it would be far more efficient to do hash the entire contents though. If you have a million files storing a terabyte of data, the 2 stage comparison would read at most 1GB (1 million * 1KB) of data, and less for smaller files. If you do a comparison of the whole hashed contents, you have to read the entire 1TB. There are a hundred confounding variables, for sure. I don't think you could confidently estimate which would be more efficient without a lot of experimenting.


If you're going to keep partial hashes in memory, may as well align it on whatever boundary is the minimal block/sector size that your drives give back to you. Hashing (say) 8kB takes less time than it takes to fetch it from SSD (much less disk), so if you only used the first 1kB, you'd (eventually) need to re-fetch the same block to calculate the hash for the rest of the bytes in that block.

... okay, so as long as you always feed chunks of data into your hash in the same deterministic order, it doesn't matter for the sake of correctness what that order is or even if you process some bytes multiple times. You could hash the first 1kB, then the second-through-last disk blocks, then the entire first disk block again (double-hashing the first 1kB) and it would still tell you whether two files are identical.

If you're reading from an SSD and seek times don't matter, it's in fact probable that on average a lot of files are going to differ near the start and end (file formats with a header and/or footer) more than in the middle, so maybe a good strategy is to use the first 32k and the last 32k, and then if they're still identical, continue with the middle blocks.

In memory, per-file, you can keep something like

  - the length
  - h(block[0:4])
  - h(block[0:4] | block[-5:])
  - h(block[0:4] | block[-5:] | block[4:32])
  - h(block[0:4] | block[-5:] | block[4:128])
  - ...
  - h(block[0:4] | block[-5:] | block[4:])
etc, and only calculate the latter partial hashes when there is a collision between earlier ones. If you have 10M files and none of them have the same length, you don't need to hash anything. If you have 10M files and 9M of them are copies of each other except for a metadata tweak that resides in the last handful of bytes, you don't need to read the entirety of all 10M files, just a few blocks from each.

A further refinement would be to have per-file-format hashing strategies... but then hashes wouldn't be comparable between different formats, so if you had 1M pngs, 1M zips, and 1M png-but-also-zip quine files, it gets weird. Probably not worth it to go down this road.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: