algorithm - String similarity score/hash -
is there method calculate general "similarity score" of string? in way not comparing 2 strings rather number (hash) each string can later tell me 2 strings or not similar. 2 similar strings should have similar (close) hashes.
let's consider these strings , scores example:
hello world 1000 hello world! 1010 hello earth 1125 foo bar 3250 foobarbar 3750 foo bar! 3300 foo world! 2350
you can see hello world! , hello world similar , scores close each other.
this way, finding similar strings given string done subtracting given strings score other scores , sorting absolute value.
i believe you're looking called locality sensitive hash. whereas hash algorithms designed such small variations in input cause large changes in output, these hashes attempt opposite: small changes in input generate proportionally small changes in output.
as others have mentioned, there inherent issues forcing multi-dimensional mapping 2-dimensional mapping. it's analogous creating flat map of earth... can never accurately represent sphere on flat surface. best can find lsh optimized whatever feature you're using determine whether strings "alike".
Comments
Post a Comment