regex - How do I remove words from multilingual text? -
i have 2 versions of same document (d, say) containing multilingual text (english , others):
i. 1 encoded in ascii unicode code-points represented character entity references (i.e. unicode characters of form &#n, n decimal equivalent of unicode hex value)
ii. other utf-8 encoding.
q 1:
i have separate list of words (encoded in utf-8, , in more 1 language), have remove document d. how should proceed?
can use regex clean d? doc type i, believe have specify whole &#n patterns each word in list when form regex.
should task easier doc type ii, can specify non-english characters directly in regex (my emacs configured use these non-english fonts) ?
q 2:
i have huge collections of such document d's. should best algorithm remove words each of these documents? table look-up straight-forward slowest. should regex through each?
i suggest processing entities first 2 sorts of files same. when you’re done removing, put first set encoded form.
Comments
Post a Comment