I didn’t know much about php’s mb_detect_encoding() function since I seldom use it. However I knew there it is.

Today I met some encoding problems and I found some interesting facts about the mb_string modules.

OK, the mission is to process some files’ encoding, if not then re-encoding them to UTF-8.

I met problems at very beginning: when I was testing a string ‘abc中文’ which is encoded in GB2312/GBK(aka CP936), mb_detect_encoding() returns ‘UTF-8’.

Obviously it’s not work, At least failed in this case. But what is the problem? I don’t think this is a bug so I continue search answers on php.net.

I noticed a note at mb_detect_order() page that says:

Note that as of mbstring.c version 1.142.2.31, first released as PHP 4.3.4RC3, “auto” has changed meaning. It used to be configured based on #defines, so it was set at compile time, so for precompiled binary users (esp. Windows users) it has always been the same (Japanese mode). However, it is now based on the language that mbstring is configured for at runtime. (setlocale() doesn’t affect this though) Running on English Windows at least, mbstring defaults to a “neutral” mode which results in an “auto” list of “ASCII, UTF-8”. So, the point is, for PHP 4.3.4 or newer, you probably want to either use mb_language(“Japanese”) followed by mb_detect_order(“auto”), or just hardcode your detect order with mb_detect_order(“ASCII, JIS, UTF-8, EUC-JP, SJIS”). (Also note that mb_language() alone won’t do it, you’ll have to set the detect order to “auto” _after_ calling mb_language().)

The problem is very clear: I didn’t apply any encoding_list to mb_detect_order() so the functions is using default neutral mode(which only detects ASCII and UTF-8) to run all tests. Technically the test string ‘abc中文’ contains characters more that ASCII, so the function returns ‘UTF-8’.

So I made my solution below:

function detectEncoding($str){

	$encodings = array(
		//ascii
		'ASCII',
		
		//unicodes
		'UTF-8',
		'UTF-16',
		
		//chinese
		'EUC-CN',  //gb2312
		'CP936',   //gbk
		'EUC-TW',  //big5
		
		//japanese
		'EUC-JP',
		'SJIS',
		'eucJP-win',
		'SJIS-win',
		'JIS',
		'ISO-2022-JP'
	);
	$charset = mb_detect_encoding($str, $encodings);
	return $charset;
}

And of course, this works.

Note the name different in Chinese charsets:

PHP output ‘ECU-CN’ rather than ‘GB2312’ and ‘CP936’ rather than ‘GBK’.

Tagged with →  

发表评论

电子邮件地址不会被公开。 必填项已用*标注