Two apparently equal Python Unicode UTF8 encoded strings do not match

advertisements
>>> str1 = unicode('María','utf8')
>>> str2 = u'María'.encode('utf8')
>>> str1 == str2
False

How is that possible?

Just in case it is relevant, I'm using the iPython Notebook.


You have a unicode string and a byte string. They are not the same thing.

One holds a Unicode value, María. The other holds a UTF-8 encoded series of bytes, 'Mar\xc3\xada'.

Python 2 does do an implicit conversion when comparing Unicode and byte string values, but you should not count on that conversion, and it depends entirely on the default codec set for your system.

If you don't yet know what Unicode really is, or why UTF-8 is not the same thing, or want to know anything else about encodings, see: