Python Unicode Bug

advertisements

I'm making a virtual machine in RPython using PyPy. The problem is, when I tried to add unicode support I found an unusual problem. I'll use the letter "á" in my examples.

# The char in the example is á
print len(char)

OUTPUT:
2

I understand how the letter "á" takes two bytes, hence the length of 2. But the problem is when I use this example below I am faced with the problem.

# In this example instr = "á" (including the quotes)
for char in instr:
    print hex(int(ord(char)))

OUTPUT:
0x22
0xc3
0xa1
0x22

As you can there are 4 numbers. For 0x22 are for the quotes, but there is only 1 letter in between the quotes but there are two numbers. My question is, some machines I tested this script on produced this output:

OUTPUT:
0x22
0xe1
0x22

Is there anyway to make the output the same on both machines? The script is exactly the same on each.


The issue is that you use bytestrings to work with a text data. You should use Unicode instead.

It implies that you need to know the character encoding of your input data -- There Ain't No Such Thing As Plain Text.

If you know the character encoding then it is easy to convert a bytestring to Unicode e.g.:

unicode_text = bytestring.decode(encoding)

It should resolve your initial issue.

There are also Unicode normalization forms e.g.:

import unicodedata

norm_text = unicodedata.normalize('NFC', unicode_text)


If I don't change the encoding in the program how can I output unicode characters for example?

You might mean that you have a sequence of bytes e.g., '\xc3\xa1' (two bytes) that can be interpreted as text using some character encoding e.g., it is U+00E1 Unicode codepoint in utf-8. It may be something different in a different character encoding. Please, read the link I've provided above The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

Unless by accident your terminal uses the same character encoding as data in your input file; you need to be able to convert from one character encoding to another. Otherwise the output will be corrupted e.g., instead of á you might get ├б on the screen.

In ordinary Python, you could use bytes.decode, unicode.encode methods (or codecs module directly). I don't know whether it is possible in RPython.