math - Measuring Error Rates Between Rank-Order Lists -
i'm trying measure agreement between 2 different systems of classification (one of them based on machine-learning algorithms, , other based on human ground-truthing), , i'm looking input who's implemented similar sort of system.
the classification schema allows each item classified multiple different nodes in category taxonomy, each classification carries weight coefficient. example, if item can classified 4 different taxonomy nodes, result might algorithmic , ground-truth classifiers:
algo truth category a: 0.35 0.50 category b: 0.30 0.30 category c: 0.25 0.15 category d: 0.10 0.05 the weights add 1.0, selected category nodes (of there 200 in classification taxonomy).
in example above, it's important note both lists agree rank ordering (abcd), should scored being in strong agreement 1 (even though there differences in weights assigned each category. contrast, in next example, 2 classifications in complete disagreement respect rank-order:
algo truth category a: 0.40 0.10 category b: 0.35 0.15 category c: 0.15 0.35 category d: 0.10 0.40 so result should low score.
one final example demonstrates common case human-generated ground-truth contains duplicate weight values:
algo truth category a: 0.40 0.50 category b: 0.35 0.50 category c: 0.15 0.00 category d: 0.10 0.00 so it's important algorithm allows lists without perfect rank ordering (since ground truth validly interpreted abcd, abdc, bacd, or badc)
stuff i've tried far:
root mean squared error (rmse): problematic. doesn't account rank-order agreement, means gross disagreements between categories @ top of list swept under rug agreement categories @ bottom of list.
spearman's rank correlation: although accounts differences in rank, gives equal weight rank agreements @ top of list , @ bottom of list. don't care low-level discrepancies, long high-level discrepancies contribute error metric. doesn't handle cases multiple categories can have tie-value ranks.
kendall tau rank correlation coefficient: has same basic properties , limitations spearman's rank correlation, far can tell.
i've been thinking rolling own ad-hoc metrics, i'm no mathematician, i'd suspicious of whether own little metric provide rigorous value. if there's standard methodology kind of thing, i'd rather use that.
any ideas?
i don't think need worry rigour extent. if want weight types of agreement more others, legitimate.
for example, calculate spearman's top k categories. think should legitimate answers.
you can z-transform etc. map [0,1] while preserving consider "important" pieces of data set (variance, difference etc.) can take advantage of large number of hypothesis testing functions available.
(as side note, can modify spearman's account ties. see wikipedia.)
Comments
Post a Comment