Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I have some interest in various edit distance implementations. The problem with the basic edit distance is that you need to match each string against all stored strings. This becomes an unbearable computational cost as the size of your corpus increases.

It seems that what they have done here is create a rainbow table of sorts that houses all the possible edits at distance 1 and 2. 3 is possible but requires more space to store and more time to scan.

This is a very interesting problem and has application in many, many areas. I always felt like there was more work to be done here and it looks like there may still be yet. For example, an area that edit distance was initially applied to was in person deduplication. When merging lists of names it is important to identify duplicates and merge then appropriately. This is a problem for me in medical informatics and is more devious than it sounds on first blush.



A "rainbow table" works in a somewhat similar manner, but is based on doing a full calculation then saving only certain starting points. I think what they are doing is actually more like creating a regular expression that matches all words a particular Levenshtein distance from the target.

Here's more about what how they are doing it: http://blog.notdot.net/2010/07/Damn-Cool-Algorithms-Levensht...

This article has a very good overview of the options: http://stevehanov.ca/blog/index.php?id=114




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: