In order to check if a file is a duplicate of another, you need to check it agai...

kstrauser · on Feb 25, 2025

I understand the concept. My main point is that it's probably not a huge advantage to store hashes of the first 1KB, which requires CPU to calculate, over just the raw bytes, which requires storage. There's a tradeoff either way.

I don't think it would be far more efficient to do hash the entire contents though. If you have a million files storing a terabyte of data, the 2 stage comparison would read at most 1GB (1 million * 1KB) of data, and less for smaller files. If you do a comparison of the whole hashed contents, you have to read the entire 1TB. There are a hundred confounding variables, for sure. I don't think you could confidently estimate which would be more efficient without a lot of experimenting.

philsnow · on Feb 25, 2025

If you're going to keep partial hashes in memory, may as well align it on whatever boundary is the minimal block/sector size that your drives give back to you. Hashing (say) 8kB takes less time than it takes to fetch it from SSD (much less disk), so if you only used the first 1kB, you'd (eventually) need to re-fetch the same block to calculate the hash for the rest of the bytes in that block.

... okay, so as long as you always feed chunks of data into your hash in the same deterministic order, it doesn't matter for the sake of correctness what that order is or even if you process some bytes multiple times. You could hash the first 1kB, then the second-through-last disk blocks, then the entire first disk block again (double-hashing the first 1kB) and it would still tell you whether two files are identical.

If you're reading from an SSD and seek times don't matter, it's in fact probable that on average a lot of files are going to differ near the start and end (file formats with a header and/or footer) more than in the middle, so maybe a good strategy is to use the first 32k and the last 32k, and then if they're still identical, continue with the middle blocks.

In memory, per-file, you can keep something like

  - the length
  - h(block[0:4])
  - h(block[0:4] | block[-5:])
  - h(block[0:4] | block[-5:] | block[4:32])
  - h(block[0:4] | block[-5:] | block[4:128])
  - ...
  - h(block[0:4] | block[-5:] | block[4:])

etc, and only calculate the latter partial hashes when there is a collision between earlier ones. If you have 10M files and none of them have the same length, you don't need to hash anything. If you have 10M files and 9M of them are copies of each other except for a metadata tweak that resides in the last handful of bytes, you don't need to read the entirety of all 10M files, just a few blocks from each.

A further refinement would be to have per-file-format hashing strategies... but then hashes wouldn't be comparable between different formats, so if you had 1M pngs, 1M zips, and 1M png-but-also-zip quine files, it gets weird. Probably not worth it to go down this road.