encoding - How many bytes do we need to store an arabic character -
i'm little confused storage needed representing arabic character.
please let me know if true:
- in iso/iec 8859-6 encoding takes 2 bytes (http://en.wikipedia.org/wiki/iso/iec_8859-6)
- in unicode takes 4 bytes (http://en.wikipedia.org/wiki/arabic_unicode)
what advantages of each encoding? when should prefer 1 on one?
well first, unicode not encoding. standard assigning code points every character in every language. these code points integers; how many bytes take depends on specific encoding. common unicode encodings utf-8 , utf-16.
to summarise:
- iso 8859-6 uses 1 byte each arabic character, doesn't support "arabic presentation forms", nor characters other script ascii.
- utf-8 uses 2 bytes each arabic character, , 3 bytes "arabic presentation forms".
- utf-16 uses 2 bytes each arabic character, including "arabic presentation forms".
i use 2 examples: 'ح' (u+062d) , 'ﻰ' (u+fef0). numbers hexadecimal codes representing unicode code point of each of characters.
in iso 8859-6, arabic characters take single byte, since encoding dedicated arabic. example, character 'ح' (u+062d) encoded single byte "cd", can see table on wikipedia article. character 'ﻰ' (u+fef0) listed "arabic presentation form", suppose explains why doesn't appear in iso 8859-6 @ (you can't encode character in encoding).
there 2 common unicode encodings let encode characters: utf-8 , utf-16. have different uses. utf-8 uses 1 byte ascii characters, between 2 , 3 bytes basic characters (including of arabic) , 4 bytes other characters. utf-16 uses 2 bytes basic characters, , 4 bytes other characters. basically, if using lots of ascii, utf-8 better. international text, utf-16 better.
in utf-8, 'ح' (u+062d) encoded 2-byte sequence "d8 ad", while 'ﻰ' (u+fef0) encoded 3-byte sequence "ef bb b0". basically, characters between u+0080 , u+07ff use 2 bytes, , characters between u+07ff , u+ffff use 3 bytes. basic arabic , arabic supplement characters use 2 bytes, whereas arabic presentation forms use 3 bytes.
in utf-16, 'ح' (u+062d) encoded 2-byte sequence "2d 06", while 'ﻰ' (u+fef0) encoded 2-byte sequence "f0 fe". in utf-16, arabic characters 2 bytes. further complicated endianness. note bytes in utf-16 code points 2 parts swapped around. equally valid encoding "06 2d" first one, , "fe f0" second.
in summary, recommend utf-8 unambiguous , supports ascii text well. arabic characters 2 bytes in either encoding (unless use "presentation forms"). can use iso 8859-6 if using ascii , arabic characters, , nothing else, , save space, isn't worth it, break other characters come along. utf-8 , utf-16 support characters in unicode.
Comments
Post a Comment