Why is jQuery's email validation regex so simple? -


we know regex validate emails quite complicated. however, jquery's validation plugin has shorter regex (contributed scott gonzalez), spanning few lines:

/^((([a-z]|\d|[!#\$%&'\*\+\-\/=\?\^_`{\|}~]|[\u00a0-\ud7ff\uf900-\ufdcf\ufdf0-\uffef]) +(\.([a-z]|\d|[!#\$%&'\*\+\-\/=\?\^_`{\|}~]|[\u00a0-\ud7ff\uf900-\ufdcf\ufdf0-\uffef])+)*)| ((\x22)((((\x20|\x09)*(\x0d\x0a))?(\x20|\x09)+)?(([\x01-\x08\x0b\x0c\x0e-\x1f\x7f]|\x21| [\x23-\x5b]|[\x5d-\x7e]|[\u00a0-\ud7ff\uf900-\ufdcf\ufdf0-\uffef])|(\\([\x01-\x09\x0b\x0c\x0d-\x7f] |[\u00a0-\ud7ff\uf900-\ufdcf\ufdf0-\uffef]))))*(((\x20|\x09)*(\x0d\x0a))?(\x20|\x09)+)? (\x22)))@((([a-z]|\d|[\u00a0-\ud7ff\uf900-\ufdcf\ufdf0-\uffef])|(([a-z]|\d| [\u00a0-\ud7ff\uf900-\ufdcf\ufdf0-\uffef])([a-z]|\d|-|\.|_|~|[\u00a0-\ud7ff\uf900-\ufdcf\ufdf0-\uffef])* ([a-z]|\d|[\u00a0-\ud7ff\uf900-\ufdcf\ufdf0-\uffef])))\.)+(([a-z]| [\u00a0-\ud7ff\uf900-\ufdcf\ufdf0-\uffef])|(([a-z]|[\u00a0-\ud7ff\uf900-\ufdcf\ufdf0-\uffef]) ([a-z]|\d|-|\.|_|~|[\u00a0-\ud7ff\uf900-\ufdcf\ufdf0-\uffef])*([a-z]| [\u00a0-\ud7ff\uf900-\ufdcf\ufdf0-\uffef])))\.?$/ 

why 'simple' compared more well-known monstrosity? there cases 1 regex fail , other succeed (whether cases valid or invalid emails)?

the regex custom combination of:

  • rfc 2234 abnf
  • rfc 2396 uri generic syntax (obseleted rfc 3986)
  • rfc 2616 hypertext transfer protocol -- http/1.1
  • rfc 2822 internet message format
  • rfc 3987 iri
  • rfc 3986 uri generic syntax

i wrote regex when web forms 2.0 being drafted , rfc 5322 did not exist. if @ order in rfcs written, you'll notice definition iri , uri changed after internet message format written. means rfc 2822 not support current iri definitions. unfortunately, wasn't simple task of substituting definitions, had pick , choose definitions use rfcs. made choices remove (like support comments).

the regex not hand-written. while did manually write every section of regex, scripted "glue". each definition rfcs stored in variable, compound definitions utilizing variables store simpler definitions (@walf: why there many subpatterns , ors).

to complicate matter, version of regex used in jquery validation plugin modified further account differences between spec-valid addresses , user expectation of valid address. have no recollection of modifications made. promised jörn zaefferer (the author of validation plugin) write newer script generate regex. new script allow specify options , don't want support (required tld, specific tlds, ipv6, comments, obsolete defintions, quoted local names, etc.). 5 years ago. started once, never finished. maybe 1 day will. have far hosted on github: https://github.com/scottgonzalez/regex-builder

if want regex validating email addresses, i'd suggest following regex included in html5 specification:

/^[a-za-z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-za-z0-9](?:[a-za-z0-9-]{0,61}[a-za-z0-9])?(?:\.[a-za-z0-9](?:[a-za-z0-9-]{0,61}[a-za-z0-9])?)*$/

if use regex-builder , turn off options, you'll similar. it's been year since looked @ that, don't remember differences are.


i'd point out link in original question mentions rfc 822. while it's great rfc 822 advanced arpanet arpa internet, isn't current. internet has made few advances in past 3 decades , rfc has been superseded twice. i'd see new work following latest standards.


update:

a friend asked me why html5 regex doesn't support utf-8. i've never asked hixie it, assume reason: though tlds started support idns (international domain names) in 2000 , rfc 3987 (iri) written in 2005, when rfc 5322 written in 2008 listed characters in ranges 33-90 , 94-126 valid dtext (characters allowed use in domain literal). html5 based on rfc 5322 , result there no utf-8 support. seems strange rfc 5322 doesn't account idns, it's worth nothing in 2008 idns weren't usable. wasn't until 2010 icann approved first set of idns. however, today if want use idn, pretty need destroy domain name using punycode if want things email , dns work globally.

update 2:

updated html5 regex match updated spec, changed label length limits 255 characters 63 characters, specified in rfc 1034 section 3.5.


Comments

Popular posts from this blog

asp.net - repeatedly call AddImageUrl(url) to assemble pdf document -

java - Android recognize cell phone with keyboard or not? -

iphone - How would you achieve a LED Scrolling effect? -