regex - Multi-language input validation with UTF-8 encoding -
to check user input english name valid, match input against regular expression such [a-za-z]. how can if multi-language(like chinese, japanese etc.) support required utf8 encoding?
you can approximate unicode derived property \p{alphabetic}
pretty succintly [\pl\pm\p{nl}]
if language doensn’t support proper alphabetic property directly.
don’t use java’s \p{alpha}
, because that’s ascii-only.
but you’ll notice you’ve failed account dashes (\p{pd}
or dashpunctuation works, not include of hyphens!), apostrophes (usually not 1 of u+27, u+2bc, u+2019, or u+ff07), comma, or full stop/period.
you had better include \p{pc}
connectorpunctuation, in case.
if have unicode derived property \p{diacritic}
, should use that, too, because includes things mid-dot needed geminated l’s in catalan , non-combining forms of diacritic marks people use.
but you’ll find people use ordinal numbers in names in ways \p{nl}
(letternumber) doesn’t accomodate, throw \p{nd}
(decimalnumber) or of \pn
(number) mix.
then realize asian names require use of zwj or zwnj written correctly in scripts, have add u+200d , u+200c mix, both \p{cf}
(format) characters , indeed joincontrol ones.
by time you’re done looking various unicode properties various , many exotic characters keep cropping — or when think you’re done, rather — you’re conclude better job @ if allowed them use whatever unicode characters name wish, the link tim cites advises. yes, you’ll few jokers putting in things “əɯɐuʇƨɐ⅂ əɯɐuʇƨɹᴉℲ”, goes territory, , can’t preclude silly names in reasonable way.
Comments
Post a Comment