Yesterday I posted a patch to issue 4661, which I have called "the
poster child for the problems with the email module in Python3".
The patch proposes a way to make email5 handle binary data...more
or less. I'll quote the tracker post here since it explains
the situation. Please go to
OK, I'm not entirely sure I want to post this, but....
Antoine and I were having a conversation about nntplib and email and I
noted that unicode as an email transmission channel acts as if it required
7bit clean data. That is, that there's no way to use unicode as an 8bit
data transmission channel. Antoine pointed out that there is PEP 383,
and that he is using that in his nntplib update to tunnel 8bit data (if
there is any) from and back to the nntp server. I said I couldn't do
that with email because I not only needed to transmit the data, I also
needed to *parse* it.
Antoine pointed out that you can in fact parse a header even if it has
surrogateescape code points in it.
So I started thinking about that. In point of fact, from the point
of view of an email parser, non-ASCII bytes are pretty much opaque.
They don't affect the semantics of the parsing. Either they are
invalid data (in headers), or they are opaque content data (8bit
So...I came up with a horrible little hack, which is attached here as
a patch. This is horrible because it is a perversion of the Python3
desire to make a clean separation between bytes and strings. The only
thing it really has to recommend it is that it works: it allows email5
(the version of email currently in Python3) to read wire-format messages
and parse them into valid message structures.
The patch is a proof of concept and is far from complete. It handles
only message bodies (but those are the most important) and has no doc
updates and only one test. If this approach is deemed worth considering,
I will flesh out the tests and make sure the corner cases are handled
correctly, and write docs with lots of notes about why this is perverse
and email6 will make it all better :)
I feel bad about posting this both because it is an ugly hack and because
it will likely slow down email6 development (because it will make email5
mostly work). But making email5 mostly work in 3.2 seems like a case
where practicality beats purity.
The essence of the hack is as follows: Given binary data we encode it
to ASCII using the surrogateescape error handler. Then, when a message
body is retrieved we check to see if there are any surrogates in it,
and if there are we encode it back to ASCII using surrogateescape,
thereby recovering the original bytes. For "Content-Transfer-Encoding:
8bit" parts we can then try to decode it using the declared charset, or
ASCII with the replace error handler if the charset isn't known. But in
any case the original binary data is accessible by using 'decode=True'
in the call to get_payload. (NB for those not familiar with the API:
decode=True refers to decoding the Content-Transfer-Encoding, *not*
decoding to unicode...which means after CTE decoding you end up with a
For headers, which are not supposed to have 8bit data in them, the best
we can do is re-decode them with ASCII/replace, but at least it will be
possible to parse the messages. (The current patch doesn't do this.)
Another thing missing from the current patch is the generator side.
But since the binary data for the message content is now available,
it should be possible to have a generator that outputs binary.
Note that in this patch I've introduced new functions/methods for getting
binary string data in, but for file input one needs to open the file as
text using ASCII encoding and the surrogateescape error handler.
I've only done minimal testing on this (obviously), and so I may find
a showstopper somewhere along the way, but so far it seems to work,
and logically it seems like it should work.