encoding - Perl: utf8::decode vs. Encode::decode -

March 15, 2012

i having interesting results trying discern differences between using encode::decode("utf8", $var) , utf8::decode($var). i've discovered calling former multiple times on variable result in error "cannot decode string wide characters at..." whereas latter method happily run many times want, returning false.

what i'm having trouble understanding how length function returns different results depending on method use decode. problem arises because dealing "doubly encoded" utf8 text outside file. demonstrate issue, created text file "test.txt" following unicode characters on 1 line: u+00e8, u+00ab, u+0086, u+000a. these unicode characters double-encoding of unicode character u+8acb, along newline character. file encoded disk in utf8. run following perl script:

#!/usr/bin/perl                                                                                                                                           use strict; use warnings; require "encode.pm"; require "utf8.pm";  open file, "test.txt" or die $!; @lines = <file>; $test =  $lines[0];  print "length: " . (length $test) . "\n"; print "utf8 flag: " . utf8::is_utf8($test) . "\n"; @unicode = (unpack('u*', $test)); print "unicode:\n@unicode\n"; @hex = (unpack('h*', $test)); print "hex:\n@hex\n";  print "==============\n";  $test = encode::decode("utf8", $test); print "length: " . (length $test) . "\n"; print "utf8 flag: " . utf8::is_utf8($test) . "\n"; @unicode = (unpack('u*', $test)); print "unicode:\n@unicode\n"; @hex = (unpack('h*', $test)); print "hex:\n@hex\n";  print "==============\n";  $test = encode::decode("utf8", $test); print "length: " . (length $test) . "\n"; print "utf8 flag: " . utf8::is_utf8($test) . "\n"; @unicode = (unpack('u*', $test)); print "unicode:\n@unicode\n"; @hex = (unpack('h*', $test));  print "hex:\n@hex\n";

this gives following output:

length: 7 utf8 flag:  unicode: 195 168 194 171 194 139 10 hex: c3a8c2abc28b0a ============== length: 4 utf8 flag: 1 unicode: 232 171 139 10 hex: c3a8c2abc28b0a ============== length: 2 utf8 flag: 1 unicode: 35531 10 hex: e8ab8b0a

this expect. length 7 because perl thinks $test series of bytes. after decoding once, perl knows $test series of characters utf8-encoded (i.e. instead of returning length of 7 bytes, perl returns length of 4 characters, though $test still 7 bytes in memory). after second decoding, $test contains 4 bytes interpreted 2 characters, expect since encode::decode took 4 code points , interpreted them utf8-encoded bytes, resulting in 2 characters. strange thing when modify code call utf8::decode instead (replace $test = encode::decode("utf8", $test); utf8::decode($test))

this gives identical output, result of length differs:

 length: 7 utf8 flag:  unicode: 195 168 194 171 194 139 10 hex: c3a8c2abc28b0a ============== length: 4 utf8 flag: 1 unicode: 232 171 139 10 hex: c3a8c2abc28b0a ============== length: 4 utf8 flag: 1 unicode: 35531 10 hex: e8ab8b0a

it seems perl first counts bytes before decoding (as expected), counts characters after first decoding, counts bytes again after second decoding (not expected). why switch happen? there lapse in understanding of how these decoding functions work?

thanks,
matt

you not supposed use functions utf8 pragma module. its documentation says so:

do not use pragma else telling perl script written in utf-8.

always use encode module, , see question checklist going unicode way perl. unpack low-level, not give error-checking.

you going wrong assumption octects e8 ab 86 0a result of utf-8 double-encoding characters 諆 , newline. representation of single utf-8 encoding of these characters. perhaps whole confusion on side stems mistake.

length unappropriately overloaded, @ times determines length in characters, or length in octets. use better tools such devel::peek.

#!/usr/bin/env perl use strict; use warnings fatal => 'all'; use devel::peek qw(dump); use encode qw(decode);  $test = "\x{00e8}\x{00ab}\x{0086}\x{000a}"; # or read octets without implicit decoding file, not matter  dump $test; #  flags = (padmy,pok,ppok) #  pv = 0x8d8520 "\350\253\206\n"\0  $test = decode('utf-8', $test, encode::fb_croak); dump $test; #  flags = (padmy,pok,ppok,utf8) #  pv = 0xc02850 "\350\253\206\n"\0 [utf8 "\x{8ac6}\n"]

Search This Blog

shell

encoding - Perl: utf8::decode vs. Encode::decode -

Comments

Post a Comment

Popular posts from this blog

400 Bad Request on Apache/PHP AddHandler wrapper -

Add email recipient to all new Trac tickets -

php - Change action and image src url's with jQuery -