Unicode encoding error when writing to a file

advertisements

I know that this is an ever presenting problem when working with Python 2.x. I'm currently working with Python 2.7. The text content that I'm wanting to output to a tab delimited text file is being pulled from a Sql Server 2012 database table that is by has the Server Collation set to SQL_Latin1_General_CP1_CI_AS.

The exception I get tends to vary a little, but essentially is : UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 57: ordinal not in range(128)

or UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 308: ordinal not in range(128)

Now here is what I typically turn to, but still result in an error:

from kitchen.text.converters import getwriter
with open("output.txt", 'a') as myfile:
    #content processing done here
    #title is text pulled directly from database
    #just_text is content pulled from raw html inserted into beautiful soup
    #    and using its .get_text() to just retrieve the text content
    UTF8Writer = getwriter('utf8')
    myfile = UTF8Writer(myfile)
    myfile.write(text + '\t' + just_text)

I have also tried:

# also performed for just_text and still resulting in exceptions
title = title.encode('utf-8')

and

title = title.decode('latin-1')
title = title.encode('utf-8')

and

title = unicode(title, 'latin-1')

I have also replaced the with open() with:

with codecs.open("codingOutput.txt", mode='a', encoding='utf-8') as myfile:

I'm not sure what it is that I'm doing wrong, or forgetting to do. I have also swapped the encode with decode just in case I've been doing the encoding/decoding backwards. with no success.

any help would be greatly appreciated.

Update

I have added print repr(title) and print repr(just_text) and both when I first retrieved title from the database and when performing the .get_text(). Not sure how much this helps but....

for title I get: <type 'str'> for just_text I get: <type 'unicode'>

Errors

These are errors I'm getting from the content pulled from the BeautifulSoup Summary() function.

C:\Python27\lib\site-packages\bs4\dammit.py:269: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  if (len(data) >= 4) and (data[:2] == b'\xfe\xff') \
C:\Python27\lib\site-packages\bs4\dammit.py:273: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  elif (len(data) >= 4) and (data[:2] == b'\xff\xfe') \
C:\Python27\lib\site-packages\bs4\dammit.py:277: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  elif data[:3] == b'\xef\xbb\xbf':
C:\Python27\lib\site-packages\bs4\dammit.py:280: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  elif data[:4] == b'\x00\x00\xfe\xff':
C:\Python27\lib\site-packages\bs4\dammit.py:283: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  elif data[:4] == b'\xff\xfe\x00\x00':

ValueError: Expected a bytes object, not a unicode object

The trace back portion is:

File <myfile>, line 39, in <module>
  summary_soup = BeautifulSoup(page_summary)
File "C:\Python27\lib\site-packages\bs4\__init__.py", line 193, in __init__
  self.builder.prepare_markup(markup, from_encoding)):
File "C:\Python27\lib\site-packages\bs4\builder\_lxml.py", line 99, in prepare_markup
  for encoding in detector.encodings:
File "C:\Python27\lib\site-packages\bs4\dammit.py", line 256, in encodings
  self.chardet_encoding = chardet_dammit(self.markup)
File "C:\Python27\lib\site-packages\bs4\dammit.py", line 31, in chardet_dammit
  return chardet.detect(s)['encoding']
File "C:\Python27\lib\site-packages\chardet\__init__.py", line 25, in detect
  raise ValueError('Expected a bytes object, not a unicode object')
ValueError: Expected a bytes object, not a unicode object


Here's some advice. Everything has an encoding. Your issue is just a matter of finding out the various encodings of the different portions, re-encoding them into a common format, and writing the result to a file.

I recommend choosing utf-8 as the output encoding.

f = open('output', 'w')
unistr = title.decode("latin-1") + "\t" + just_text
f.write(unistr.encode("utf-8"))

Beautiful soup's get_text returns python's unicode wrapper type. decode("latin-1") should get your database content into the unicode type, which is joined with the tab before writing bytes encoded in utf-8.