Skip to content Skip to sidebar Skip to footer

Wrong Encoding Of Email Attachment

I have a python 2.7 script running on windows. It logs in gmail, checks for new e-mails and attachments: #!/usr/bin/env python # -*- coding: utf-8 -*- file_types = ['pdf', 'doc',

Solution 1:

If you read the spec that defines the filename field, RFC 2183, section 2.3, it says:

Current [RFC 2045] grammar restricts parameter values (and hence Content-Disposition filenames) to US-ASCII. We recognize the great desirability of allowing arbitrary character sets in filenames, but it is beyond the scope of this document to define the necessary mechanisms. We expect that the basic [RFC 1521] 'value' specification will someday be amended to allow use of non-US-ASCII characters, at which time the same mechanism should be used in the Content-Disposition filename parameter.

There are proposed RFCs to handle this. In particular, it's been suggested that filenames be handled as encoded-words, as defined by RFC 5987, RFC 2047, and RFC 2231. In brief this means either RFC 2047 format:

"=?" charset "?" encoding "?" encoded-text"?="

… or RFC 2231 format:

"=?" charset ["*" language] "?" encoded-text "?="

Some mail agents are already using this functionality, others don't know what to do with it. The email package in Python 2.x is among those that don't know what to do with it. (It's possible that the later version in Python 3.x does, or that it may change in the future, but that won't help you if you want to stick with 2.x.) So, if you want to parse this, you have to do it yourself.

In your example, you've got a filename in RFC 2047 format, with charset UTF-8 (which is usable directly as a Python encoding name), encoding B, which means Base-64, and content 0YfQsNGB0YLRjCAxINGC0LXQutGB0YIg0LzQtdGC0L7QtNC40YfQutC4LmRv. So, you have to base-64 decode that, then UTF-8-decode that, and you get u'часть 1 текст методички.do'.

If you want to do this more generally, you're going to have to write code which tries to interpret each filename in RFC 2231 format if possible, in RFC 2047 format otherwise, and does the appropriate decoding steps. This code isn't trivial enough to write in a StackOverflow answer, but the basic idea is pretty simple, as demonstrated above, so you should be able to write it yourself. You may also want to search PyPI for existing implementations.

Post a Comment for "Wrong Encoding Of Email Attachment"