algorithm - String similarity score/hash -


is there method calculate general "similarity score" of string? in way not comparing 2 strings rather number (hash) each string can later tell me 2 strings or not similar. 2 similar strings should have similar (close) hashes.

let's consider these strings , scores example:

hello world                1000 hello world!               1010 hello earth                1125 foo bar                    3250 foobarbar                  3750 foo bar!                   3300 foo world!                 2350 

you can see hello world! , hello world similar , scores close each other.

this way, finding similar strings given string done subtracting given strings score other scores , sorting absolute value.

i believe you're looking called locality sensitive hash. whereas hash algorithms designed such small variations in input cause large changes in output, these hashes attempt opposite: small changes in input generate proportionally small changes in output.

as others have mentioned, there inherent issues forcing multi-dimensional mapping 2-dimensional mapping. it's analogous creating flat map of earth... can never accurately represent sphere on flat surface. best can find lsh optimized whatever feature you're using determine whether strings "alike".


Comments

Popular posts from this blog

asp.net - repeatedly call AddImageUrl(url) to assemble pdf document -

java - Android recognize cell phone with keyboard or not? -

iphone - How would you achieve a LED Scrolling effect? -