regex - Multi-language input validation with UTF-8 encoding -

August 15, 2013

to check user input english name valid, match input against regular expression such [a-za-z]. how can if multi-language(like chinese, japanese etc.) support required utf8 encoding?

you can approximate unicode derived property \p{alphabetic} pretty succintly [\pl\pm\p{nl}] if language doensn’t support proper alphabetic property directly.

don’t use java’s \p{alpha}, because that’s ascii-only.

but you’ll notice you’ve failed account dashes (\p{pd} or dashpunctuation works, not include of hyphens!), apostrophes (usually not 1 of u+27, u+2bc, u+2019, or u+ff07), comma, or full stop/period.

you had better include \p{pc} connectorpunctuation, in case.

if have unicode derived property \p{diacritic}, should use that, too, because includes things mid-dot needed geminated l’s in catalan , non-combining forms of diacritic marks people use.

but you’ll find people use ordinal numbers in names in ways \p{nl} (letternumber) doesn’t accomodate, throw \p{nd} (decimalnumber) or of \pn (number) mix.

then realize asian names require use of zwj or zwnj written correctly in scripts, have add u+200d , u+200c mix, both \p{cf} (format) characters , indeed joincontrol ones.

by time you’re done looking various unicode properties various , many exotic characters keep cropping — or when think you’re done, rather — you’re conclude better job @ if allowed them use whatever unicode characters name wish, the link tim cites advises. yes, you’ll few jokers putting in things “əɯɐuʇƨɐ⅂ əɯɐuʇƨɹᴉℲ”, goes territory, , can’t preclude silly names in reasonable way.

Search This Blog

shell

regex - Multi-language input validation with UTF-8 encoding -

Comments

Post a Comment

Popular posts from this blog

Add email recipient to all new Trac tickets -

400 Bad Request on Apache/PHP AddHandler wrapper -

asp.net - repeatedly call AddImageUrl(url) to assemble pdf document -