Skip to content Skip to sidebar Skip to footer

Cp949 Codec Can't Encode Character Error In Python

I am using the code below to parse the XML format wikipedia training data into a pure text file: from __future__ import print_function import logging import os.path import six imp

Solution 1:

Minimal example

The problem is that your file is opened with an implicit encoding (inferred from your system). I can recreate your issue as follows:

a = '\u1f00'
with open('f.txt', 'w', encoding='cp949') as f:
    f.write(a)

Error message: UnicodeEncodeError: 'cp949' codec can't encode character '\u1f00' in position 0: illegal multibyte sequence

You have two options. Either open the file using an encoding which can encode the character you are using:

with open('f.txt', 'w', encoding='utf-8') as f:
    f.write(a)

Or open the file as binary and write encoded bytes:

with open('f.txt', 'wb') as f:
    f.write(a.encode('utf-8'))

Applied to your code:

I would replace this part:

output = open(outp, 'w')
wiki = WikiCorpus(inp, lemmatize=False, dictionary={})
for text in wiki.get_texts():
    if six.PY3:
        output.write(bytes(' '.join(text), 'utf-8').decode('utf-8') + '\n')
    #   ###another method###
    #    output.write(
    #            space.join(map(lambda x:x.decode("utf-8"), text)) + '\n')
    else:
        output.write(space.join(text) + "\n")

with this:

from io import open

wiki = WikiCorpus(inp, lemmatize=False, dictionary={})
with open(outp, 'w', encoding='utf=8') as output:
    for text in wiki.get_texts():
        output.write(u' '.join(text) + u'\n')

which should work in both Python 2 and Python 3.


Post a Comment for "Cp949 Codec Can't Encode Character Error In Python"