Cp949 Codec Can't Encode Character Error In Python
I am using the code below to parse the XML format wikipedia training data into a pure text file: from __future__ import print_function import logging import os.path import six imp
Solution 1:
Minimal example
The problem is that your file is opened with an implicit encoding (inferred from your system). I can recreate your issue as follows:
a = '\u1f00'
with open('f.txt', 'w', encoding='cp949') as f:
f.write(a)
Error message: UnicodeEncodeError: 'cp949' codec can't encode character '\u1f00' in position 0: illegal multibyte sequence
You have two options. Either open the file using an encoding which can encode the character you are using:
with open('f.txt', 'w', encoding='utf-8') as f:
f.write(a)
Or open the file as binary and write encoded bytes:
with open('f.txt', 'wb') as f:
f.write(a.encode('utf-8'))
Applied to your code:
I would replace this part:
output = open(outp, 'w')
wiki = WikiCorpus(inp, lemmatize=False, dictionary={})
for text in wiki.get_texts():
if six.PY3:
output.write(bytes(' '.join(text), 'utf-8').decode('utf-8') + '\n')
# ###another method###
# output.write(
# space.join(map(lambda x:x.decode("utf-8"), text)) + '\n')
else:
output.write(space.join(text) + "\n")
with this:
from io import open
wiki = WikiCorpus(inp, lemmatize=False, dictionary={})
with open(outp, 'w', encoding='utf=8') as output:
for text in wiki.get_texts():
output.write(u' '.join(text) + u'\n')
which should work in both Python 2 and Python 3.
Post a Comment for "Cp949 Codec Can't Encode Character Error In Python"