algorithm - String similarity score/hash -

July 15, 2015

is there method calculate general "similarity score" of string? in way not comparing 2 strings rather number (hash) each string can later tell me 2 strings or not similar. 2 similar strings should have similar (close) hashes.

let's consider these strings , scores example:

hello world                1000 hello world!               1010 hello earth                1125 foo bar                    3250 foobarbar                  3750 foo bar!                   3300 foo world!                 2350

you can see hello world! , hello world similar , scores close each other.

this way, finding similar strings given string done subtracting given strings score other scores , sorting absolute value.

i believe you're looking called locality sensitive hash. whereas hash algorithms designed such small variations in input cause large changes in output, these hashes attempt opposite: small changes in input generate proportionally small changes in output.

as others have mentioned, there inherent issues forcing multi-dimensional mapping 2-dimensional mapping. it's analogous creating flat map of earth... can never accurately represent sphere on flat surface. best can find lsh optimized whatever feature you're using determine whether strings "alike".

Search This Blog

shell

algorithm - String similarity score/hash -

Comments

Post a Comment

Popular posts from this blog

Add email recipient to all new Trac tickets -

400 Bad Request on Apache/PHP AddHandler wrapper -

asp.net - repeatedly call AddImageUrl(url) to assemble pdf document -