algorithm - Better distance metrics besides Levenshtein for ordered word sets and subsequent clustering -

January 15, 2012

i trying solve problem involves comparing large numbers of word sets , each of contains large, ordered number of words set of words (totaling around 600+, high dimensionality!) similarity , clustering them distinct groupings. solution needs unsupervised possible.

the data looks like

[apple, banana, orange...]
[apple, banana, grape...]
[jelly, anise, orange...]
[strawberry, banana, orange...]
...etc

the order of words in each set matters ([apple, banana, orange] distinct [apple, orange, banana]

the approach have been using far has been use levenshtein distance (limited distance threshold) metric calculated in python script each word being unique identifier, generate similarity matrix distances, , throwing matrix k-mediods in knime groupings.

my questions are:

is levenshtein appropriate distance metric use problem?
is mean/medoid prototype clustering best way go groupings?
i haven't yet put thought validating choice 'k' in clustering. evaluating sse curve of clustering best way go this?
are there flaws in methodology?
as extension solution in future, given training data, happen have ideas going assigning probabilities cluster assignments? example, set 1 has 80% chance of being in cluster 1, etc.

i hope questions don't seem silly or answers painfully obvious, i'm relatively new data mining.

thanks!

yes, levenshtein suitable way this. if sequences vary in size much, might better off normalising these distances dividing sum of sequence lengths -- otherwise find observed distances tend increase pairs of long sequences "average distance" (in sense of average distance between corresponding k-length substrings, small k) constant.

example: pair ([apple, banana], [carrot, banana]) said have same "average" distance ([apple, banana, widget, xylophone], [carrot, banana, yam, xylophone]) since every 2nd item matches in both, latter pair's raw levenshtein distance twice great.

also bear in mind levenshtein not make special allowances "block moves": if take string, , move 1 of substrings sufficiently far away, resulting pair (of original , modified strings) have same levenshtein score if 2nd string had different elements @ position substring moved to. if want take account, consider using compression-based distance instead. (although there it's useful computing distances without respect order, of course favour ordered similarity disordered similarity.)

Search This Blog

shell

algorithm - Better distance metrics besides Levenshtein for ordered word sets and subsequent clustering -

Comments

Post a Comment

Popular posts from this blog

Add email recipient to all new Trac tickets -

400 Bad Request on Apache/PHP AddHandler wrapper -

asp.net - repeatedly call AddImageUrl(url) to assemble pdf document -