encoding - How many bytes do we need to store an arabic character -

August 15, 2011

i'm little confused storage needed representing arabic character.

please let me know if true:

in iso/iec 8859-6 encoding takes 2 bytes (http://en.wikipedia.org/wiki/iso/iec_8859-6)
in unicode takes 4 bytes (http://en.wikipedia.org/wiki/arabic_unicode)

what advantages of each encoding? when should prefer 1 on one?

well first, unicode not encoding. standard assigning code points every character in every language. these code points integers; how many bytes take depends on specific encoding. common unicode encodings utf-8 , utf-16.

to summarise:

iso 8859-6 uses 1 byte each arabic character, doesn't support "arabic presentation forms", nor characters other script ascii.
utf-8 uses 2 bytes each arabic character, , 3 bytes "arabic presentation forms".
utf-16 uses 2 bytes each arabic character, including "arabic presentation forms".

i use 2 examples: 'ح' (u+062d) , 'ﻰ' (u+fef0). numbers hexadecimal codes representing unicode code point of each of characters.

in iso 8859-6, arabic characters take single byte, since encoding dedicated arabic. example, character 'ح' (u+062d) encoded single byte "cd", can see table on wikipedia article. character 'ﻰ' (u+fef0) listed "arabic presentation form", suppose explains why doesn't appear in iso 8859-6 @ (you can't encode character in encoding).

there 2 common unicode encodings let encode characters: utf-8 , utf-16. have different uses. utf-8 uses 1 byte ascii characters, between 2 , 3 bytes basic characters (including of arabic) , 4 bytes other characters. utf-16 uses 2 bytes basic characters, , 4 bytes other characters. basically, if using lots of ascii, utf-8 better. international text, utf-16 better.

in utf-8, 'ح' (u+062d) encoded 2-byte sequence "d8 ad", while 'ﻰ' (u+fef0) encoded 3-byte sequence "ef bb b0". basically, characters between u+0080 , u+07ff use 2 bytes, , characters between u+07ff , u+ffff use 3 bytes. basic arabic , arabic supplement characters use 2 bytes, whereas arabic presentation forms use 3 bytes.

in utf-16, 'ح' (u+062d) encoded 2-byte sequence "2d 06", while 'ﻰ' (u+fef0) encoded 2-byte sequence "f0 fe". in utf-16, arabic characters 2 bytes. further complicated endianness. note bytes in utf-16 code points 2 parts swapped around. equally valid encoding "06 2d" first one, , "fe f0" second.

in summary, recommend utf-8 unambiguous , supports ascii text well. arabic characters 2 bytes in either encoding (unless use "presentation forms"). can use iso 8859-6 if using ascii , arabic characters, , nothing else, , save space, isn't worth it, break other characters come along. utf-8 , utf-16 support characters in unicode.

Search This Blog

shell

encoding - How many bytes do we need to store an arabic character -

Comments

Post a Comment

Popular posts from this blog

Add email recipient to all new Trac tickets -

400 Bad Request on Apache/PHP AddHandler wrapper -

java - Android recognize cell phone with keyboard or not? -