Skip to content Skip to sidebar Skip to footer

Why Does Python 2.x Throw An Exception With String Formatting + Unicode?

I have the following piece of code. The last line throws an error. Why is that? class Foo(object): def __unicode__(self): return u'\u6797\u89ba\u6c11\u8b1d\u51b0\u5fc3

Solution 1:

Python 2 implicitly will try and encode unicode values to strings when you mix unicode and string objects, or it will try and decode byte strings to unicode.

You are mixing unicode, byte strings and a custom object, and you are triggering a sequence of encodings and decodings that doesn't mix.

In this case, your Foo() value is interpolated as a string (str(Foo()) is used), and the u'asdf' interpolation triggers a decode of the template so far (so with the UTF-8 Foo() value) to interpolate the unicode string. This decode fails as the ASCII codec cannot decode the \xe6\x9e\x97 UTF-8 byte sequence already interpolated.

You should always explicitly encode Unicode values to bytestrings or decode byte strings to Unicode before mixing types, as the corner cases are complex.

Explicitly converting to unicode() works:

>>> print"this should break %s %s" % (unicode(Foo()), u'asdf')
this should break 林覺民謝冰心故居 asdf

as the output is turned into a unicode string:

>>> "this should break %s %s" % (unicode(Foo()), u'asdf')
u'this should break \u6797\u89ba\u6c11\u8b1d\u51b0\u5fc3\u6545\u5c45 asdf'

while otherwise you'd end up with a byte string:

>>> "this should break %s %s" % (Foo(), 'asdf')
'this should break \xe6\x9e\x97\xe8\xa6\xba\xe6\xb0\x91\xe8\xac\x9d\xe5\x86\xb0\xe5\xbf\x83\xe6\x95\x85\xe5\xb1\x85 asdf'

(note that asdf is left a bytestring too).

Alternatively, use a unicode template:

>>> u"this should break %s %s" % (Foo(), u'asdf')
u'this should break \u6797\u89ba\u6c11\u8b1d\u51b0\u5fc3\u6545\u5c45 asdf'

Post a Comment for "Why Does Python 2.x Throw An Exception With String Formatting + Unicode?"