So I was looking at Drupal's import page and noticed non-ascii characters looked quite botched. Some source viewing revealed the input had apparently been utf8 encoded twice (that is UTF'd, then assumed to be ISO-8859-1 and UTF'd again).
The source was an XML feed in UTF8 which looked perfectly fine. I went over import.module and couldn't see any specific UTF8 encoding. Some testing revealed that PHP's XML parser was the culprit:
<?php
$xmlfile = "<?xml version=\"1.0\" encoding=\"utf-8\" ?><tag>UTF8 v\xC3\xA3lue</tag>";
function handler_data($parser, $data) {
print "Data: $data\r\n";
}
$xml_parser = xml_parser_create();
xml_set_character_data_handler($xml_parser, "handler_data");
xml_parser_set_option($xml_parser, XML_OPTION_TARGET_ENCODING, "utf-8");
xml_parse($xml_parser, $xmlfile, 1);
xml_parser_free($xml_parser);
?>
The input XML contains the word vãlue and is in UTF8 format. The output encoding is specified as UTF8 as well, so you would expect PHP to print out the value unchanged (i.e. with 2 bytes for the ã character). Not so... PHP incorrectly treats the input as ISO-8859-1 and re-UTF's the input, resulting in 4 bytes for the ã character.
This is strange because PHP claims to support UTF-8 source encoding.

-
What happens when you leave out the set_option and define it with the parser_create? like:
$xml_parser = xml_parser_create("UTF-8");
also, (far-fetched, i know) what about the caps? utf/UTF ? shouldn't be, but you never know :p
Input vs Output
Actually that set_option is there to specify the output encoding, not the input... but you've brought the solution to my attention: The parameter for
xml_parser_createis indeed the input encoding that PHP4 uses.PHP4 does not extract the input encoding automatically, but requires it to be specified explicitly.
PHP5 does do this automatically (and ignores any encoding given to
xml_parser_create).Post new comment