Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
How I created two images with the same MD5 hash (natmchugh.blogspot.com)
193 points by sebg on Nov 4, 2014 | hide | past | favorite | 47 comments


tl;dr collision by appending data at the end of .jpeg file.

> The chosen prefix collision attack works by repeatedly adding 'near collision' blocks which gradually work to eliminate the differences in the internal MD5 state until they are the same

> This type of collision is has been termed a chosen prefix collision. In this case the image data is the prefix or to be more exact the internal state of the MD5 algorithm after processing the image is. You can't see the added binary data at the end of jpeg images as it is preceded with an End Of Image JPEG marker.


Of course, algorithms like md5 are still useful when not used as cryptographic functions. For example, they can be used for duplicate detection across a large data set. Unless the data set is under control by an attacker, and they have something to gain by creating a collision, there's little risk for this use.


That's the issue though isn't it - that an attacker would have control over the dataset and have something to gain by providing a modified file under a publicly known "safe" md5. For example, compromised Linux install iso's.


In that case that's second pre-image resistance though, not collision resistance. TTBOMK that's still not possible with MD5 (though it will probably be in a few years).


Only if the attacker had no part in making the "safe" ISO image in the first place. You just cannot know. Using MD5 makes no sense whatsoever, if you don't need collision resistance, there are faster options, and if you do, it's obviously completely unsuitable.


"For example, they can be used for duplicate detection across a large data set."

From extensive experimentation with this exact topic, you have to be really careful because there's a lot of handwavey arguments or "just use openssl" but just because md5 is 33% faster than sha1 on 32 bit openssl in i386 doesn't mean whatever random platform and library combo that you're actually using will be exactly 33% by switching from sha1 to md5. Probably, someone out there has a pathological example where md5 is somehow actually slower than sha1.

MD5 is almost too good of a hash function for mere de-duping, but I have to use it because its near universal. On mysql I have the function "md5()". In clojure its "(digest/md5 "something")". On the AS/400 thankfully thats not my area of responsibility but I know its compatible with all of the above. In perl its "use Digest::MD5". The most important part is a md5 calculated on an engineering data record will always be the same hash result number no matter the source, which is very important for de-duping.

The world of programmers has some very peculiar ideas about hashes such that there is usually only one library for hashing and it often only has stuff like md5 or sha1 which is not good enough for crypto and overkill for de-duping and ID creation. The world would be a better place if as a profession we split our hashing libraries on all machines and languages into "really fast insecure de-dupe hash functions" and "really secure glacially slow crypto hash functions"


You can also use e.g. BLAKE2 which is at least as fast as MD5, while still being cryptographically strong. Win-win.


Nice BLAKE2 article from HN in March, 2014

https://leastauthority.com/blog/BLAKE2-harder-better-faster-...


For hashing passwords, you want something slow, not fast.


Who said anything about hashing passwords? But this is a misconception. You don't want a primitive that is slow. You want one that gives you a lot of security margin in few cycles, then you make it slow by, for instance, iterating it many times. There are many password hashing schemes based on BLAKE2, 6 different ones submitted to PHC (https://password-hashing.net/).


It's sadly common for inexperienced devs to use MD5 and call it a day (if they hash at all).

I just wanted to point out that, for situations where user input and security are important, you want the algorithm to be slow.

I didn't say anything about how to implement it or whether you should use BLAKE2 or what. There's a lot more to it than I could put in a reply here, and even quick Googling would turn up info about salting/iterating/etc.


Thanks for BLAKE2


If you arn't concerned about an attacker, then Murmur and fnv-1a are attractive hashes http://programmers.stackexchange.com/questions/49550/which-h...


> For example, they can be used for duplicate detection across a large data set.

And that's how my colleague sees the photos of some unknown woman in his Dropbox (and supposedly, she sees his pictures too).


Can you elaborate on what mechanism of Dropbox lets this work?


Almost certainly by using some form of fragile hashing mechanism such as MD5 to identify users, instead of UUID or something similar.


No, almost certainly not. Md5 being broken in this way doesn't mean the chances of two random 160 bit numbers are any more likely to collide by chance. It requires an active adversary. There are a million ways dropbox's software could be buggy that would result in this, that are more likely than a chance md5 collision

Furthermore, the kind of weakness md5 has now would require that that the attacker generate both usernames. And the usernames would be really really long because they had gibberish tacked onto the end to make the hashes collide. In the jpeg format you can make that part invisible, but not in a username.


Do I understand this right: if the length of the original file is specified along with the hash, then such attacks are not trivial?




When you have <pre> formatted paragraphs that are wider than the viewport, it's quite annoying that you also intercept swipe-left to open a different article instead of letting the user scroll.

MobileSafari, iOS8.


Valid criticism, but Blogspot is to blame here.


Do authors not have control over their blogspot theme? I though they used to.

Or if not, maybe more people need to be told not to use blogspot. Second article I've read in a week that was broken by this "improvement".


Same problem here in Android, and the browser is Internet.


This reminds me of a very interesting challenge on smashthestack.org (IO #11 I believe), involving creating two brainfuck programs with the same MD5 hash, which produced quite different outputs. Excellent learning process. Having this article probably would have cut a week off the time it took to complete...


That was a fun one. I seem to recall it taking a day or so to generate a suitably colliding prefix pair using publicly available code. After that the rest was pretty easy.


If anyone wants to re-run this this is the version of the shell script I ended up with https://gist.github.com/natmchugh/ab3e30a45fd724888ad8


[deleted]


It is indeed, 2^64. Here is a good explanation of the Birthday Paradox that sheds some light on this.

http://betterexplained.com/articles/understanding-the-birthd...


I'm curious what sort of vulnerabilities this could expose.

Conflicting git check ins, breaking cache layers, tampering with downloads, etc


1. git uses SHA-1, not MD5; former has problems, but is still better than latter

2. git stores the length of a blob before hashing [1] which makes it harder (but not impossible) to perform such attacks [2]

[1] http://www.git-scm.com/book/en/v2/Git-Internals-Git-Objects#... [2] http://blogs.msdn.com/b/oldnewthing/archive/2004/05/19/13493...


I've seen quite a few examples of binary collisions, but never plain text collisions.

I'm aware that they are possible, but they appear to be unlikely, especially when using md5 correctly, even then most people have already migrated to sha1 or sha256...


Few things are as aggravating as a website that commandeers the navigate back gesture.


I think at this point everybody knows (or should know) MD5 is only useful to compare 2 files of the same size to find duplicate files. Any other use is asking for trouble.


There are known MD5 collisions of the same length.


When is it worth it to be concerned about this type of attack?

Do both files need to be altered to find a collision, or just one? Can it be done in a fixed file size?


Do sha1, sha2 have similar weakness?


SHA-1 can be considered near-broken at this point¹, as far as I remember. No actual successful attack like with MD5, but close enough to be theoretically possible in the foreseeable future.

There was fear that the attacks could be extended to SHA-2, thus we now have SHA-3 too. However, SHA-2 remains secure for now.

_____

¹ Wikipedia: »As of 2012, the most efficient attack against SHA-1 is considered to be the one by Marc Stevens[34] with an estimated cost of $2.77M to break a single hash value by renting CPU power from cloud servers.« I.e. it's quite expensive, but can be done in a reasonable time, especially by adversaries with interest and funds to do so.


There is no indication that SHA-2 is threatened in any practical way.

SHA-1 and SHA-2 are similar at an architectural level, in some of the same ways that two mid-1990s Feistel ciphers might be similar, and share building blocks, but they aren't the same hash function. They are much more different than, say, DES and 3DES.

SHA-2 remains the best practical choice for most systems today. The truncated variants (like SHA2-512/256) even break length extension exploits.


It would be a lot cheaper to approach industries which already have very large FPGA clusters (weather forecasting do, for example) and rent some compute time on them with your own bitsteam. Problems like this are embarrassingly parallel and very suited to hardware based attacks, given sufficient financial motivation. Time for a kickstarter, maybe?


Every time you map a large set to a smaller set, you are bound to have collisions[1]. In other words, all hashing algorithms which map input data to a bounded string will have collisions. Depending on what you're using the hashing function for, it just becomes a matter of how feasible it is to either find a collision or find the data which yielded the hash.

[1] http://en.wikipedia.org/wiki/Pigeonhole_principle

Edit: source


While that's true, I wouldn't call the pigeonhole principle a remotely similar weakness to the chosen prefix collision attack vulnerability in MD5. The feasibility of finding a collision in MD5 has little to do with the number of pigeonholes in MD5.


Not at all. Not to completely repeat myself, I'm merely saying that all such hash functions will have collisions. So the fact you have collisions doesn't imply weakness as it's a fact of life. So all you're left with is to look at the feasibility of finding a collision. We're both saying the same thing IMO.


No one has yet shown that SHA-1 collisions are possible, as with this MD5 technique, but it's known that SHA-1 isn't as strong against potential attacks as it was designed-to-be.

That knowledge has driven adoption of SHA-2, and was partial motivation for the (completed in 2012) competition to design SHA-3. I don't believe any weaker-than-designed problems have yet been found with the SHA-2 family.

In 2012, Bruce Schneier reported on an analysis by Jesse Walker of Intel about when SHA-1 collisions might be practical to create:

https://www.schneier.com/blog/archives/2012/10/when_will_we_...

That analysis suggests: "A collision attack is therefore well within the range of what an organized crime syndicate can practically budget by 2018, and a university research project by 2021."

But it also notes non-commodity approaches (GPUs, custom chips, etc) could achieve SHA-1 collisions sooner/cheaper.


This is not new at all, I remember playing with MD5 collisions in 2006, but it's good to be discussed from time to time so that we know we cannot trust such things.

We're getting there with reduced SHA-1[1] (that is, less than 80 rounds, that means less than 2^80 theoretical operations[2]). But the cost of finding a collision decreases over time[3], and this is why everybody says SHA-1 is obsolete.

[1] http://eprint.iacr.org/2010/413.pdf [2] https://www.schneier.com/blog/archives/2005/02/sha1_broken.h... [3] https://www.schneier.com/blog/archives/2012/10/when_will_we_...


There aren't any known collisions for either of those algorithms.


    "To search though all possible MD5 values is 2^128 operations which is massive. To be in with a good chance of finding a collision would take ~  2^64 operations which is again far too big for normal computing."
Seams weird, why so low to find collision compared to full search? or does the author not understand Exponentiation?


He doesn't reference it but he understands exponentiation and the birthday problem [1] that is used to find collisions when you control both files [2].

[1] http://en.wikipedia.org/wiki/Birthday_problem

[2] http://en.wikipedia.org/wiki/Collision_attack#Classical_coll...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: