regex - How do I remove words from multilingual text? -


i have 2 versions of same document (d, say) containing multilingual text (english , others):

i. 1 encoded in ascii unicode code-points represented character entity references (i.e. unicode characters of form &#n, n decimal equivalent of unicode hex value)

ii. other utf-8 encoding.

q 1:

i have separate list of words (encoded in utf-8, , in more 1 language), have remove document d. how should proceed?

can use regex clean d? doc type i, believe have specify whole &#n patterns each word in list when form regex.

should task easier doc type ii, can specify non-english characters directly in regex (my emacs configured use these non-english fonts) ?

q 2:

i have huge collections of such document d's. should best algorithm remove words each of these documents? table look-up straight-forward slowest. should regex through each?

i suggest processing entities first 2 sorts of files same. when you’re done removing, put first set encoded form.


Comments

Popular posts from this blog

Add email recipient to all new Trac tickets -

400 Bad Request on Apache/PHP AddHandler wrapper -

php - Change action and image src url's with jQuery -