How to detect the character encoding of a Web page programmatically?

advertisements

I want to get a web page source code by Qt or PyQt ,I know how to get the encoded source code ,then I need get the codec in order to convert it into plain text ,so the problem is how to detect the character encoding of a web page programmatically ?can anyone help ?

this page is encoded by UTF8 http://www.flvxz.com/getFlv.php?url=aHR0cDojI3d3dy41Ni5jb20vdTk1L3ZfT1RFM05UYzBNakEuaHRtbA==

and this one is encoded by gb2312

http://www.qnwz.cn/html/yinlegushihui/magazine/2013/0524/425731.html

your answer should test on this 2 page


You can use QTextCodec::codecForHtml static function.

Tries to detect the encoding of the provided snippet of HTML in the given byte array, ba, by checking the BOM (Byte Order Mark) and the content-type meta header and returns a QTextCodec instance that is capable of decoding the html to unicode. If the codec cannot be detected from the content provided, defaultCodec is returned.

This will not work for pages with no encoding meta-tag. For example, the first link you've posted has no encoding tag (this is not HTML, so there can't be any tags), encoding is specified in HTTP header named 'Content-type'. You need to check its value. It can be retrieved using QHttpHeader::contentType if you use Qt to download the page.