Quantcast

Thoughts on the general API, and the Header API.

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Thoughts on the general API, and the Header API.

R. David Murray
OK, so we've agreed that we need to handle bytes and text at pretty
much all API levels, and that the "original data" that informs the data
structure can be either bytes or text.  We want to be able to recover
that original data, especially in the bytes case, but arguably also in
the text case.

Then there's also the issue of transforming a message once we have it in
a data structure, and the consequent issue of what it means to serialize
the resulting modified message.  (This last comes up in a very specific
way in issues 968430 and 1670765, which are about preserving the *exact*
byte representation of a multipart/signed message).

We've also agreed that whatever we decide to do with the __str__ and
__bytes__ magic methods, they will be implemented in terms of other
parts of the API.  So I'll ignore those for now.

I think we want to decide on a general API structure that is implemented
at all levels and objects where it makes sense, this being the API
for creating and accessing the following information about any part of
the model:

    * create object from bytes
    * create object from text
    * obtain the defect list resulting from creating the object
    * serialize object to bytes
    * serialize object to text
    * obtain original input data
    * discover the type of the original input data

At the moment I see no reason to change the API for defects (a .defects
attribute on the object holding a list of defects), so I'm going to
ignore that for now as well.

I spent a bunch of time trying to define an API for Headers that provided
methods for all of the above.  As I was writing the descriptions for
the various methods, and especially trying to specify the "correct"
behavior for both the raw-data-is-bytes and raw-data-is-text cases
(especially for the methods that serialize the data), the whole thing
began to give off a bad code smell.

After setting it aside for a bit, I had what I think is a little epiphany:
our need is to deal with messages (and parts of messages) that could be
in either bytes form or text form.  The things we need to do with them
are similar regardless of their form, and so we have been talking about a
"dual API": one method for bytes and a parallel method for text.

What if we recognize that we have two different data types, bytes messages
and text messages?  Then the "dual API" becomes a more uniform, almost
single, API, but with two possible underlying data types.

In the context specifically of the proposed new Header object, I propose
that we have a StringHeader and a BytesHeader, and an API that looks
something like this:

StringHeader

    properties:
        raw_header (None unless from_full_header was used)
        raw_name
        raw_value
        name
        value

    __init__(name, value)
    from_full_header(header)
    serialize(max_line_len=78,
              newline='\n',
              use_raw_data_if_possible=False)
    encode(charset='utf-8')

BytesHeader would be exactly the same, with the exception of the signature
for serialize and the fact that it has a 'decode' method rather than an
'encode' method.  Serialize would be different only in the fact that
it would have an additional keyword parameter, must_be_7bit=True.

The magic of this approach is in those encode/decode methods.

Encoding a StringHeader would yield a BytesHeader containing the same
data, but encoded per RFC2047 using the specified charset.  Decoding a
BytesHeader would yield a StringHeader with the same data, but decoded to
unicode per RFC2047, with any 8bit parts decoded (in the unicode sense,
not the RFC2047 sense) using the specified charset (which would default to
ASCII, meaning bare 8bit bytes in headers would throw an error).  (What to
with RFC2047 charsets like unknown-8bit is an open question...probably
throw an error).

(Encoding or decoding a Message would cause the Message to recursively
encode or decode its subparts.  This means you are making a complete
new copy of the Message in memory.  If you don't want to do that you
can walk the Message and convert it piece by piece (we could provide a
generator that does this).)

raw_header would be the data passed in to the constructor if
from_full_header is used, and None otherwise.  If encode/decode call
the regular constructor, then this attribute would also act as a flag
as to whether or not the header was constructed from raw input data
or via program.

raw_name and raw_value would be the fieldname and fieldbody, either
what was passed in to the __init__ constructor, or the result of
splitting what was passed to the from_full_header constructor on
the first ':'.  (These are convenience attributes and are not
essential to the proposed API).

name would be the fieldname stripped of trailing whitespace.

value would be the *unfolded* fieldbody stripped of leading and
trailing whitespace (but with internal whitespace intact).

As for serialize, my thought here is that every object in the tree
has a serialize method with the same signature, and serialization
is a matter of recursively passing the specified parameters downward.

max_line_len is obvious, and defaults to the RFC recommended max.  (If you
want the unfolded header, use its .value attribute).  newline resolves
issue 1349106, allowing an email package client to generate completely
wire-format messages if it needs to.  use_raw_data_if_possible would
mean to emit the original raw data if it exists (modulo changing
the flavor of newline if needed, for those object types (such as
headers) where that makes sense).  The serialize method of specific
sub-types can do specialized things (eg: multipart/signed can make
use_raw_data_if_possible default to True).

For Bytes types, the extra 'must_be_7bit' flag would cause any 8bit
data to be transport encoded to be 7bit clean.  (For headers, this would
mean raw 8bit data would get the charset 'unknown-8bit', and we might
want to provide more control over that in some way: an error and way to
provide an error handler, or some other way to specify a charset to use
for such encodings.)  use_raw_data_if_possible would cause this flag to
be ignored when raw data was available for the object.

(If you want the text version of the transport-encoded message for some
reason, you can serialize the Bytes form using must_be_7bit and decode
the result as ASCII.)

Subclasses of these classes for structured headers would have additional
methods that would return either specialized object types (datetimes,
address objects) or bytes/strings, and these may or may not exist in
both Bytes and String forms (that depends on the use cases, I think).

I also think that the Bytes and Strings versions of objects that have
them can share large portions of their implementation through a base
class.  I think that makes this approach both easier to code than a
single-type-dual-API approach, and more robust in the face of changes.

So, those are my thoughts, and I'm sure I haven't thought of all the
corner cases.  The biggest question is, does it seem like this general
scheme is worth pursuing?

--David
_______________________________________________
Email-SIG mailing list
[hidden email]
Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Thoughts on the general API, and the Header API.

Glenn Linderman-3
On approximately 1/25/2010 12:10 PM, came the following characters from
the keyboard of R. David Murray:
> So, those are my thoughts, and I'm sure I haven't thought of all the
> corner cases.  The biggest question is, does it seem like this general
> scheme is worth pursuing?

Moving your last question to the front, yes.  And of course, we do need
to think through most of the corner cases before absolutely committing
to this approach.  But it sounds viable, and avoids an awful lot of
duplicate APIs, and would allow simple email clients to be written
primarily or even fully in bytes or primarily or even fully in strings.

A simple email client that is written fully in strings would "simply"
reject/bounce messages that cannot be decoded to strings.  This is
simple; it works for 100% properly encoded messages; in an environment
where a client is coded to process messages from some generator, once
they are both debugged to the extent of generating messages that can be
consumed, then all is well, and no messages would be rejected.  This
would not be an appropriate model for a general email server; while I'd
like to see a popular mailing list submission client that would bounce
messages that are improperly formed -- forcing contributors to use RFC
conformant clients, and thus encouraging the of those clients that are
not RFC conformant, but I'm not going to hold my breath.

I think there can be enough power in an API designed in this manner to
allow the full nitty-gritty access as required.

I have some questions and concerns; I haven't thought through all of
them; perhaps some of them are corner cases, if so, they are corner
cases that are particularly interesting to me.

> OK, so we've agreed that we need to handle bytes and text at pretty
> much all API levels, and that the "original data" that informs the data
> structure can be either bytes or text.  We want to be able to recover
> that original data, especially in the bytes case, but arguably also in
> the text case.
>
> Then there's also the issue of transforming a message once we have it in
> a data structure, and the consequent issue of what it means to serialize
> the resulting modified message.  (This last comes up in a very specific
> way in issues 968430 and 1670765, which are about preserving the *exact*
> byte representation of a multipart/signed message).
>
> We've also agreed that whatever we decide to do with the __str__ and
> __bytes__ magic methods, they will be implemented in terms of other
> parts of the API.  So I'll ignore those for now.
>
> I think we want to decide on a general API structure that is implemented
> at all levels and objects where it makes sense, this being the API
> for creating and accessing the following information about any part of
> the model:
>
>      * create object from bytes
>      * create object from text
>      * obtain the defect list resulting from creating the object
>      * serialize object to bytes
>      * serialize object to text
>      * obtain original input data
>      * discover the type of the original input data
>
> At the moment I see no reason to change the API for defects (a .defects
> attribute on the object holding a list of defects), so I'm going to
> ignore that for now as well.
>
> I spent a bunch of time trying to define an API for Headers that provided
> methods for all of the above.  As I was writing the descriptions for
> the various methods, and especially trying to specify the "correct"
> behavior for both the raw-data-is-bytes and raw-data-is-text cases
> (especially for the methods that serialize the data), the whole thing
> began to give off a bad code smell.
>
> After setting it aside for a bit, I had what I think is a little epiphany:
> our need is to deal with messages (and parts of messages) that could be
> in either bytes form or text form.  The things we need to do with them
> are similar regardless of their form, and so we have been talking about a
> "dual API": one method for bytes and a parallel method for text.
>
> What if we recognize that we have two different data types, bytes messages
> and text messages?  Then the "dual API" becomes a more uniform, almost
> single, API, but with two possible underlying data types.
>
> In the context specifically of the proposed new Header object, I propose
> that we have a StringHeader and a BytesHeader, and an API that looks
> something like this:
>
> StringHeader
>
>      properties:
>          raw_header (None unless from_full_header was used)
>          raw_name
>          raw_value
>          name
>          value
>
>      __init__(name, value)
>      from_full_header(header)
>      serialize(max_line_len=78,
>                newline='\n',
>                use_raw_data_if_possible=False)
>      encode(charset='utf-8')
>    

If it was stated, I missed it: is  from_full_header  a way of producing
an object from a raw data value?  Whereas __init__ would obviously be
used to produce one from string or bytes values.  If so, then it would
be a requirement that this from_full_header API would never produce an
exception?  Rather it would produce an object with or without defects?

Are there any other *Header APIs that would be required not to produce
exceptions?  I don't yet perceive any.

The "charset" parameter... is that not mostly needed for data parts?
Headers are either ASCII, or contain self-describing charset info.
I guess I could see an intermediate decode from string to some charset,
before serialization, as a hint that when generating headers, that all
the characters in the header that are not ASCII are in the specified
charset... and that that charset is the one to be used in the
self-describing serialized ASCII stream?  The full generality of the
RFCs, however,
allows pieces of headers to be encoded using different charsets... with
this API, it would seem that that could only be created containing one
charset... the serialization primitives were made available, so that
piecewise construction of a header value could be done with different
charsets, and then the from_full_header API used to create the complex
value.  I don't see this as a severe limitation, I just want to
understand your intention, and document the limitation, or my
misunderstanding.


> BytesHeader would be exactly the same, with the exception of the signature
> for serialize and the fact that it has a 'decode' method rather than an
> 'encode' method.  Serialize would be different only in the fact that
> it would have an additional keyword parameter, must_be_7bit=True.
>    

I am not clear on why StringHeader's serialize would not need the  
must_be_7bit  parameter... or do I misunderstand that
StringHeader.serialize produces wire-format data?

> The magic of this approach is in those encode/decode methods.
>
> Encoding a StringHeader would yield a BytesHeader containing the same
> data, but encoded per RFC2047 using the specified charset.  Decoding a
> BytesHeader would yield a StringHeader with the same data, but decoded to
> unicode per RFC2047, with any 8bit parts decoded (in the unicode sense,
> not the RFC2047 sense) using the specified charset (which would default to
> ASCII, meaning bare 8bit bytes in headers would throw an error).  (What to
> with RFC2047 charsets like unknown-8bit is an open question...probably
> throw an error).
>    

Would the encoding to/from StringHeader/BytesHeader preserve the  
from_full_header  state and value?

> (Encoding or decoding a Message would cause the Message to recursively
> encode or decode its subparts.  This means you are making a complete
> new copy of the Message in memory.  If you don't want to do that you
> can walk the Message and convert it piece by piece (we could provide a
> generator that does this).)
>    

Walking it piece by piece would allow the old pieces to be discarded, to
save total memory consumption, where that is appropriate.

Perhaps one generator that would be commonly used, would be to convert
headers only, and leave MIME data parts alone, accessing and converting
them only with the registered methods?  This would mean that a "complete
copy" wouldn't generally be very big, if the data parts were excluded
from implicit conversion.  Perhaps the "external storage protocol" might
also only be defined for MIME data parts, and walking the tree with this
generator would not need to reference the MIME data parts, nor bring
them in from "external storage".

> raw_header would be the data passed in to the constructor if
> from_full_header is used, and None otherwise.  If encode/decode call
> the regular constructor, then this attribute would also act as a flag
> as to whether or not the header was constructed from raw input data
> or via program.
>    

This _implies_ that  from_full_header always accepts raw data bytes...
even for the StringHeader.  And that implies the need for an implicit
decode, and therefore, perhaps a charset parameter?  No, not a charset
parameter, since they are explicitly contained in the header values.

Decode for header values may not need a charset value at all!


No comments for the rest.

--
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

_______________________________________________
Email-SIG mailing list
[hidden email]
Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Thoughts on the general API, and the Header API.

R. David Murray
On Mon, 25 Jan 2010 16:55:15 -0800, Glenn Linderman <[hidden email]> wrote:
> On approximately 1/25/2010 12:10 PM, came the following characters from
> the keyboard of R. David Murray:
> > So, those are my thoughts, and I'm sure I haven't thought of all the
> > corner cases.  The biggest question is, does it seem like this general
> > scheme is worth pursuing?
>
> If it was stated, I missed it: is  from_full_header  a way of producing
> an object from a raw data value?  Whereas __init__ would obviously be

Yes.

> used to produce one from string or bytes values.  If so, then it would

Well, StringHeader.from_full_header would take a string as input,
while BytesHeader.from_full_headerwould take bytes as input.
__init__ would be used to construct a header in your program:

    StringHeader('MyHeader', 'my value')
    BytesHeader(b'MyHeader', b'my value').

> be a requirement that this from_full_header API would never produce an
> exception?  Rather it would produce an object with or without defects?

Yes.

> Are there any other *Header APIs that would be required not to produce
> exceptions?  I don't yet perceive any.

I don't think so.  from_full_header is the only one involved in parsing
raw data.  Whether __init__ throws errors or records defects is an open
question, but I lean toward it throwing errors.  The reason there is an
open question is because an email manipulating application may want to
convert to text to process an incoming message, and there are things
that a BytesHeader can hold that would cause errors when encoded to a
StringHeader (specifically, 8 bit bytes that aren't transfer encoded).
So it may be that decode, at least, should not throw errors but instead
record additional defects in the resulting StringHeader.  I think that
even in that case __init__ should still throw errors, though; decode
could deal with the defects before calling StringHeader.__init__, or
(more likely) catch the errors throw by __init__, fix/record the defects,
and call it again.

Note, by the way, that by 'raw data' I mean what you are feeding in.
Raw data fed to a BytesHeader would be bytes, but raw data fed to
a StringHeader would be text (eg: if read from a file in text mode).

> The "charset" parameter... is that not mostly needed for data parts?

No, if you start with a unicode string in a StringHeader, you need to
know what charset to encode the unicode to and therefore to specify as
the charset in the RFC 2047 encoded words.

> Headers are either ASCII, or contain self-describing charset info.

That's true for BytesHeaders, but not for StringHeaders.  So as I
said above charset for StringHeader says which charset to put into
the encoded words when converting to BytesHeader form.

I specified a charset parameter for 'decode' only to handle the case
of raw bytes data that contains 8 bit data that is not in encoded words
(ie: is not RFC compliant).  I am visualizing this as satisfying a use
case where you have non-email (non RFC compliant) data where you allow
8 bit data in the header bodies because it's in internal ap and you
know the encoding.  You can then use decode(charset) to decode those
BytesHeaders into StringHeaders.

> I guess I could see an intermediate decode from string to some charset,
> before serialization, as a hint that when generating headers, that all
> the characters in the header that are not ASCII are in the specified
> charset... and that that charset is the one to be used in the
> self-describing serialized ASCII stream?  The full generality of the

Exactly.

> RFCs, however,
> allows pieces of headers to be encoded using different charsets... with
> this API, it would seem that that could only be created containing one
> charset... the serialization primitives were made available, so that
> piecewise construction of a header value could be done with different
> charsets, and then the from_full_header API used to create the complex
> value.  I don't see this as a severe limitation, I just want to
> understand your intention, and document the limitation, or my
> misunderstanding.

Right.  I'm visualizing the "normal case" being encoding a StringHeader
using the default utf-8 charset or another specified charset, turning
the words containing non-ASCII characters into encoded words using that
charset.  The utility methods that turn unicode into encoded words would
be exposed, and an application that needs to create a header with mixed
charsets can use those utilities to build RFC compliant bytes data and
pass that to one of the BytesHeader constructors.  (Make the common case
easy, and the complicated cases possible.)

> > BytesHeader would be exactly the same, with the exception of the signature
> > for serialize and the fact that it has a 'decode' method rather than an
> > 'encode' method.  Serialize would be different only in the fact that
> > it would have an additional keyword parameter, must_be_7bit=True.
>
> I am not clear on why StringHeader's serialize would not need the  
> must_be_7bit  parameter... or do I misunderstand that
> StringHeader.serialize produces wire-format data?

The latter.  StringHeader serialize does not produce wire-format data,
it produces text (for example, for display to the user).  If you want
wire format, you encode the StringHeader and use the resulting BytesHeader
serialize.

> > The magic of this approach is in those encode/decode methods.
> >
> > Encoding a StringHeader would yield a BytesHeader containing the same
> > data, but encoded per RFC2047 using the specified charset.  Decoding a
> > BytesHeader would yield a StringHeader with the same data, but decoded to
> > unicode per RFC2047, with any 8bit parts decoded (in the unicode sense,
> > not the RFC2047 sense) using the specified charset (which would default to
> > ASCII, meaning bare 8bit bytes in headers would throw an error).  (What to
> > with RFC2047 charsets like unknown-8bit is an open question...probably
> > throw an error).
> >    
>
> Would the encoding to/from StringHeader/BytesHeader preserve the  
> from_full_header  state and value?

My thought is no.  Once you encode/decode the header, your program has
transformed it, and I think it is better to treat the original raw data
as gone.  The motivation for this is that the 'raw data' of a StringHeader
is the *text* string used to create it.  Keeping a bytes string 'raw data'
around as well would get us back into the mess that I developed this
approach to avoid, where we'd need to specify carefully the difference
between handing a header whose 'original' raw data was bytes vs string,
for each of the BytesHeader and StringHeader cases.  Better, I think,
to put the (small) burden on the application programmer: if you want to
preserve the original input data, do so by keeping the original object
around.  Once you mutate the object model, the original raw data for
the mutated piece is gone.

There are some use-case questions here, though, with regards to
preservation of as much original information/format as possible, and how
valuable that is.  I think we'll have to figure that out by examining
concrete use cases in detail.  (It is not something that the current email
package supports very well, by the way...headers currently get modified
significantly in the parse/generate cycle, even without bytes-to-string
transformations happening.)

> > (Encoding or decoding a Message would cause the Message to recursively
> > encode or decode its subparts.  This means you are making a complete
> > new copy of the Message in memory.  If you don't want to do that you
> > can walk the Message and convert it piece by piece (we could provide a
> > generator that does this).)
>
> Walking it piece by piece would allow the old pieces to be discarded, to
> save total memory consumption, where that is appropriate.
>
> Perhaps one generator that would be commonly used, would be to convert
> headers only, and leave MIME data parts alone, accessing and converting
> them only with the registered methods?  This would mean that a "complete
> copy" wouldn't generally be very big, if the data parts were excluded
> from implicit conversion.  Perhaps the "external storage protocol" might
> also only be defined for MIME data parts, and walking the tree with this
> generator would not need to reference the MIME data parts, nor bring
> them in from "external storage".

That's true.  The Bytes and String versions of binary MIME parts,
which are likely to be the large ones, will probably have a common
representation for the payload, and could potentially point to the same
object.  That breaking of of the expectation that 'encode' and 'decode'
return new objects (in analogy to how encode and decode of strings/bytes
works) might not be a good thing, though.

In any case, text MIME parts have the same bytes vs string issues as
headers do, and should, IMO, be converted from one to the other on
encode/decode.

Another possible approach would be some sort of 'encode/decode on demand'
system, although that would need to retain a pointer to the original
object, which might get us into suboptimal reference cycle difficulties.

These bits are implementation details, though, and don't affect the API
design.

> > raw_header would be the data passed in to the constructor if
> > from_full_header is used, and None otherwise.  If encode/decode call
> > the regular constructor, then this attribute would also act as a flag
> > as to whether or not the header was constructed from raw input data
> > or via program.
> >    
>
> This _implies_ that  from_full_header always accepts raw data bytes...
> even for the StringHeader.  And that implies the need for an implicit
> decode, and therefore, perhaps a charset parameter?  No, not a charset
> parameter, since they are explicitly contained in the header values.

Your confusion was my confusing use of the term 'raw data' to mean
whatever was input to the from_full_header constructor, which is
bytes for a BytesHeader and text for a StringHeade.

> Decode for header values may not need a charset value at all!

Normally it would not.  charset would be useful in decode only for
non-RFC compliant headers.

> No comments for the rest.

Thanks for your feedback.

--David
_______________________________________________
Email-SIG mailing list
[hidden email]
Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Thoughts on the general API, and the Header API.

Glenn Linderman-2
On approximately 1/25/2010 6:51 PM, came the following characters from
the keyboard of R. David Murray:

> On Mon, 25 Jan 2010 16:55:15 -0800, Glenn Linderman<[hidden email]>  wrote:
>    
>> Are there any other *Header APIs that would be required not to produce
>> exceptions?  I don't yet perceive any.
>>      
> I don't think so.  from_full_header is the only one involved in parsing
> raw data.  Whether __init__ throws errors or records defects is an open
> question, but I lean toward it throwing errors.  The reason there is an
> open question is because an email manipulating application may want to
> convert to text to process an incoming message, and there are things
> that a BytesHeader can hold that would cause errors when encoded to a
> StringHeader (specifically, 8 bit bytes that aren't transfer encoded).
> So it may be that decode, at least, should not throw errors but instead
> record additional defects in the resulting StringHeader.  I think that
> even in that case __init__ should still throw errors, though; decode
> could deal with the defects before calling StringHeader.__init__, or
> (more likely) catch the errors throw by __init__, fix/record the defects,
> and call it again.
>
> Note, by the way, that by 'raw data' I mean what you are feeding in.
> Raw data fed to a BytesHeader would be bytes, but raw data fed to
> a StringHeader would be text (eg: if read from a file in text mode).
>    

Glad you clarified that; it wasn't obvious, without typed parameters to
the APIs.

I had assumed that serialize and from_full_header would produce/consume
bytes, and I think that showed up in my comments, and you've probably
addressed that below.  Of course, the reason that I assumed that, is
that there are no RFCs to describe a string format email message, either
on the wire, in memory, or, particularly, stored in a file.  So it is
really up to the application to define that, if it wants that.  Now
since py3 has a natural string format manipulation capability, and since
the emaillib wants to provide the interface between them, I suppose it
is a somewhat obvious thing that you might want to store a whole email
message in string format... I say somewhat obvious, because you thought
of it, but I didn't, until you clarified the above.

Perhaps the reason I didn't think of it, is simply that all the
currently used email message storage containers of which I am aware use
wire format.  So using string format for that purpose would require
inventing a new storage container (perhaps a trivial extension of an
existing one, but new, nonetheless).  I sort of expected email clients
would, given the capabilities of the emaillib, simply continue to
save/read in wire format.  In fact, it may be the only choice of format
that can completely preserve raw format messages for later processing,
in the presence of defects.

>> The "charset" parameter... is that not mostly needed for data parts?
>>      
> No, if you start with a unicode string in a StringHeader, you need to
> know what charset to encode the unicode to and therefore to specify as
> the charset in the RFC 2047 encoded words.
>
>    
>> Headers are either ASCII, or contain self-describing charset info.
>>      
> That's true for BytesHeaders, but not for StringHeaders.  So as I
> said above charset for StringHeader says which charset to put into
> the encoded words when converting to BytesHeader form.
>
> I specified a charset parameter for 'decode' only to handle the case
> of raw bytes data that contains 8 bit data that is not in encoded words
> (ie: is not RFC compliant).  I am visualizing this as satisfying a use
> case where you have non-email (non RFC compliant) data where you allow
> 8 bit data in the header bodies because it's in internal ap and you
> know the encoding.  You can then use decode(charset) to decode those
> BytesHeaders into StringHeaders.
>
>    
>> I guess I could see an intermediate decode from string to some charset,
>> before serialization, as a hint that when generating headers, that all
>> the characters in the header that are not ASCII are in the specified
>> charset... and that that charset is the one to be used in the
>> self-describing serialized ASCII stream?  The full generality of the
>>      
> Exactly.
>    

OK, I'm with you now on the charset parameter, for encoding and decoding.


>> RFCs, however,
>> allows pieces of headers to be encoded using different charsets... with
>> this API, it would seem that that could only be created containing one
>> charset... the serialization primitives were made available, so that
>> piecewise construction of a header value could be done with different
>> charsets, and then the from_full_header API used to create the complex
>> value.  I don't see this as a severe limitation, I just want to
>> understand your intention, and document the limitation, or my
>> misunderstanding.
>>      
> Right.  I'm visualizing the "normal case" being encoding a StringHeader
> using the default utf-8 charset or another specified charset, turning
> the words containing non-ASCII characters into encoded words using that
> charset.  The utility methods that turn unicode into encoded words would
> be exposed, and an application that needs to create a header with mixed
> charsets can use those utilities to build RFC compliant bytes data and
> pass that to one of the BytesHeader constructors.  (Make the common case
> easy, and the complicated cases possible.)
>    

Thanks for this clarification also.


>>> BytesHeader would be exactly the same, with the exception of the signature
>>> for serialize and the fact that it has a 'decode' method rather than an
>>> 'encode' method.  Serialize would be different only in the fact that
>>> it would have an additional keyword parameter, must_be_7bit=True.
>>>        
>> I am not clear on why StringHeader's serialize would not need the
>> must_be_7bit  parameter... or do I misunderstand that
>> StringHeader.serialize produces wire-format data?
>>      
> The latter.  StringHeader serialize does not produce wire-format data,
> it produces text (for example, for display to the user).  If you want
> wire format, you encode the StringHeader and use the resulting BytesHeader
> serialize.
>    

OK, I'm with you here now too.  So it may be nice to have a recursive
operation that would convert String format stuff to Bytes and then to
wire format, in one go, discarding the intermediate Bytes format stuffh
along the way to avoid three copies of the data, for simple email
clients that only use the String format interfaces.


>>> The magic of this approach is in those encode/decode methods.
>>>
>>> Encoding a StringHeader would yield a BytesHeader containing the same
>>> data, but encoded per RFC2047 using the specified charset.  Decoding a
>>> BytesHeader would yield a StringHeader with the same data, but decoded to
>>> unicode per RFC2047, with any 8bit parts decoded (in the unicode sense,
>>> not the RFC2047 sense) using the specified charset (which would default to
>>> ASCII, meaning bare 8bit bytes in headers would throw an error).  (What to
>>> with RFC2047 charsets like unknown-8bit is an open question...probably
>>> throw an error).
>>>        
>> Would the encoding to/from StringHeader/BytesHeader preserve the
>> from_full_header  state and value?
>>      
> My thought is no.  Once you encode/decode the header, your program has
> transformed it, and I think it is better to treat the original raw data
> as gone.  The motivation for this is that the 'raw data' of a StringHeader
> is the *text* string used to create it.  Keeping a bytes string 'raw data'
> around as well would get us back into the mess that I developed this
> approach to avoid, where we'd need to specify carefully the difference
> between handing a header whose 'original' raw data was bytes vs string,
> for each of the BytesHeader and StringHeader cases.  Better, I think,
> to put the (small) burden on the application programmer: if you want to
> preserve the original input data, do so by keeping the original object
> around.  Once you mutate the object model, the original raw data for
> the mutated piece is gone.
>
> There are some use-case questions here, though, with regards to
> preservation of as much original information/format as possible, and how
> valuable that is.  I think we'll have to figure that out by examining
> concrete use cases in detail.  (It is not something that the current email
> package supports very well, by the way...headers currently get modified
> significantly in the parse/generate cycle, even without bytes-to-string
> transformations happening.)
>    

Not every transformation is intended to be a change.  Until there is a
change, it would be nice to be able to retain the original byte stream,
for invertibility, without requiring that a simple email client deal
with bytes interfaces for RFC conformant messages.

I hear you regarding the mess... here's an brainstorming idea, tossed
out mostly to get your creative juices flowing in this direction, not
because I think it is "definitely the way to go".  The decode API could,
in addition to your description, have an option to preserve itself and
the decode charset, within the String object... If encode "discovers" a
preserved Bytes object, and the same charset is provided, it would
return the preserved Bytes object, rather than creating a new one.  
There may be no need to drop the Bytes object explicitly; as it seems
the only API for making changes to a Header object is to create a new
one, and substitute the new one for the old one.  Or maybe
from_full_header does a modify.  Or maybe the properties are assignable
(that is not explicitly stated, by the way).  So if there are modify
operations, they should drop the Bytes object.


>>> (Encoding or decoding a Message would cause the Message to recursively
>>> encode or decode its subparts.  This means you are making a complete
>>> new copy of the Message in memory.  If you don't want to do that you
>>> can walk the Message and convert it piece by piece (we could provide a
>>> generator that does this).)
>>>        
>> Walking it piece by piece would allow the old pieces to be discarded, to
>> save total memory consumption, where that is appropriate.
>>
>> Perhaps one generator that would be commonly used, would be to convert
>> headers only, and leave MIME data parts alone, accessing and converting
>> them only with the registered methods?  This would mean that a "complete
>> copy" wouldn't generally be very big, if the data parts were excluded
>> from implicit conversion.  Perhaps the "external storage protocol" might
>> also only be defined for MIME data parts, and walking the tree with this
>> generator would not need to reference the MIME data parts, nor bring
>> them in from "external storage".
>>      
> That's true.  The Bytes and String versions of binary MIME parts,
> which are likely to be the large ones, will probably have a common
> representation for the payload, and could potentially point to the same
> object.  That breaking of of the expectation that 'encode' and 'decode'
> return new objects (in analogy to how encode and decode of strings/bytes
> works) might not be a good thing, though.
>    

Well, one generator could provide the expectation that everything is
new; another could provide different expectations.  The differences
between them, and the tradeoffs would be documented, of course, were
both provided.  I'm not convinced that treating headers and data exactly
the same at all times is a good thing... a convenient option at times,
perhaps, but I can see it as a serious inefficiency in many use cases
involving large data.

This deserves a bit more thought/analysis/discussion, perhaps.  More
than I have time for tonight, but I may reply again, perhaps after
others have responded, if they do.

> In any case, text MIME parts have the same bytes vs string issues as
> headers do, and should, IMO, be converted from one to the other on
> encode/decode.
>    

To me, your first phrase implies that they should share common
encode/decode routines, but not the other.  I can clearly see a use case
where your opinion is the right approach, but I think there are use
cases where it might not be... while text MIME parts are generally
smaller than binary MIME parts, that is neither a requirement, nor
always true (think about transferring an XML format database... could be
huge... and is text of sorts -- human decipherable, more easily than hex
dumps, but not what I would call "human readable").


> Another possible approach would be some sort of 'encode/decode on demand'
> system, although that would need to retain a pointer to the original
> object, which might get us into suboptimal reference cycle difficulties.
>    

Hmm.  Brainstorming again.  decode could minimally create the String
format object, with only the Bytes format object and charset parameter
set (from the above brainstorming idea).  Then the real decoding could
be done if the properties are accessed.  If the properties are not
accessed (because the client/application makes its decisions based on
access to other components of the email), the decoding need never be
done for some objects.  Perhaps this would also neatly deal with my
desire to delay the decode of MIME data parts as well?

> These bits are implementation details, though, and don't affect the API
> design.
>    

Well, one impact of the above brainstorming would be an interface to
create the StringHeader containing the BytesHeader and charset
parameters.  Or maybe that would be a private interface, not considered
to be part of the API?


>>> raw_header would be the data passed in to the constructor if
>>> from_full_header is used, and None otherwise.  If encode/decode call
>>> the regular constructor, then this attribute would also act as a flag
>>> as to whether or not the header was constructed from raw input data
>>> or via program.
>>>
>>>        
>> This _implies_ that  from_full_header always accepts raw data bytes...
>> even for the StringHeader.  And that implies the need for an implicit
>> decode, and therefore, perhaps a charset parameter?  No, not a charset
>> parameter, since they are explicitly contained in the header values.
>>      
> Your confusion was my confusing use of the term 'raw data' to mean
> whatever was input to the from_full_header constructor, which is
> bytes for a BytesHeader and text for a StringHeade.
>    

If we are going to invent a new "string format raw data" element, maybe
we should invent a term to describe it, also... maybe "raw data" should
be split into "raw bytes" and "raw string", and "raw data" become a
synonym for "raw bytes", as that is what it was historically?


--
Glenn
------------------------------------------------------------------------
“Everyone is entitled to their own opinion, but not their own facts. In
turn, everyone is entitled to their own opinions of the facts, but not
their own facts based on their opinions.” -- Guy Rocha, retiring NV
state archivist
_______________________________________________
Email-SIG mailing list
[hidden email]
Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Thoughts on the general API, and the Header API.

Glenn Linderman-3
In reply to this post by R. David Murray
On approximately 1/25/2010 6:51 PM, came the following characters from
the keyboard of R. David Murray:

> On Mon, 25 Jan 2010 16:55:15 -0800, Glenn Linderman<[hidden email]>  wrote:
>
>> Are there any other *Header APIs that would be required not to produce
>> exceptions?  I don't yet perceive any.
>>
> I don't think so.  from_full_header is the only one involved in parsing
> raw data.  Whether __init__ throws errors or records defects is an open
> question, but I lean toward it throwing errors.  The reason there is an
> open question is because an email manipulating application may want to
> convert to text to process an incoming message, and there are things
> that a BytesHeader can hold that would cause errors when encoded to a
> StringHeader (specifically, 8 bit bytes that aren't transfer encoded).
> So it may be that decode, at least, should not throw errors but instead
> record additional defects in the resulting StringHeader.  I think that
> even in that case __init__ should still throw errors, though; decode
> could deal with the defects before calling StringHeader.__init__, or
> (more likely) catch the errors throw by __init__, fix/record the defects,
> and call it again.
>
> Note, by the way, that by 'raw data' I mean what you are feeding in.
> Raw data fed to a BytesHeader would be bytes, but raw data fed to
> a StringHeader would be text (eg: if read from a file in text mode).
>

Glad you clarified that; it wasn't obvious, without typed parameters to
the APIs.

I had assumed that serialize and from_full_header would produce/consume
bytes, and I think that showed up in my comments, and you've probably
addressed that below.  Of course, the reason that I assumed that, is
that there are no RFCs to describe a string format email message, either
on the wire, in memory, or, particularly, stored in a file.  So it is
really up to the application to define that, if it wants that.  Now
since py3 has a natural string format manipulation capability, and since
the emaillib wants to provide the interface between them, I suppose it
is a somewhat obvious thing that you might want to store a whole email
message in string format... I say somewhat obvious, because you thought
of it, but I didn't, until you clarified the above.

Perhaps the reason I didn't think of it, is simply that all the
currently used email message storage containers of which I am aware use
wire format.  So using string format for that purpose would require
inventing a new storage container (perhaps a trivial extension of an
existing one, but new, nonetheless).  I sort of expected email clients
would, given the capabilities of the emaillib, simply continue to
save/read in wire format.  In fact, it may be the only choice of format
that can completely preserve raw format messages for later processing,
in the presence of defects.

>> The "charset" parameter... is that not mostly needed for data parts?
>>
> No, if you start with a unicode string in a StringHeader, you need to
> know what charset to encode the unicode to and therefore to specify as
> the charset in the RFC 2047 encoded words.
>
>
>> Headers are either ASCII, or contain self-describing charset info.
>>
> That's true for BytesHeaders, but not for StringHeaders.  So as I
> said above charset for StringHeader says which charset to put into
> the encoded words when converting to BytesHeader form.
>
> I specified a charset parameter for 'decode' only to handle the case
> of raw bytes data that contains 8 bit data that is not in encoded words
> (ie: is not RFC compliant).  I am visualizing this as satisfying a use
> case where you have non-email (non RFC compliant) data where you allow
> 8 bit data in the header bodies because it's in internal ap and you
> know the encoding.  You can then use decode(charset) to decode those
> BytesHeaders into StringHeaders.
>
>
>> I guess I could see an intermediate decode from string to some charset,
>> before serialization, as a hint that when generating headers, that all
>> the characters in the header that are not ASCII are in the specified
>> charset... and that that charset is the one to be used in the
>> self-describing serialized ASCII stream?  The full generality of the
>>
> Exactly.
>

OK, I'm with you now on the charset parameter, for encoding and decoding.


>> RFCs, however,
>> allows pieces of headers to be encoded using different charsets... with
>> this API, it would seem that that could only be created containing one
>> charset... the serialization primitives were made available, so that
>> piecewise construction of a header value could be done with different
>> charsets, and then the from_full_header API used to create the complex
>> value.  I don't see this as a severe limitation, I just want to
>> understand your intention, and document the limitation, or my
>> misunderstanding.
>>
> Right.  I'm visualizing the "normal case" being encoding a StringHeader
> using the default utf-8 charset or another specified charset, turning
> the words containing non-ASCII characters into encoded words using that
> charset.  The utility methods that turn unicode into encoded words would
> be exposed, and an application that needs to create a header with mixed
> charsets can use those utilities to build RFC compliant bytes data and
> pass that to one of the BytesHeader constructors.  (Make the common case
> easy, and the complicated cases possible.)
>

Thanks for this clarification also.


>>> BytesHeader would be exactly the same, with the exception of the signature
>>> for serialize and the fact that it has a 'decode' method rather than an
>>> 'encode' method.  Serialize would be different only in the fact that
>>> it would have an additional keyword parameter, must_be_7bit=True.
>>>
>> I am not clear on why StringHeader's serialize would not need the
>> must_be_7bit  parameter... or do I misunderstand that
>> StringHeader.serialize produces wire-format data?
>>
> The latter.  StringHeader serialize does not produce wire-format data,
> it produces text (for example, for display to the user).  If you want
> wire format, you encode the StringHeader and use the resulting BytesHeader
> serialize.
>

OK, I'm with you here now too.  So it may be nice to have a recursive
operation that would convert String format stuff to Bytes and then to
wire format, in one go, discarding the intermediate Bytes format stuffh
along the way to avoid three copies of the data, for simple email
clients that only use the String format interfaces.


>>> The magic of this approach is in those encode/decode methods.
>>>
>>> Encoding a StringHeader would yield a BytesHeader containing the same
>>> data, but encoded per RFC2047 using the specified charset.  Decoding a
>>> BytesHeader would yield a StringHeader with the same data, but decoded to
>>> unicode per RFC2047, with any 8bit parts decoded (in the unicode sense,
>>> not the RFC2047 sense) using the specified charset (which would default to
>>> ASCII, meaning bare 8bit bytes in headers would throw an error).  (What to
>>> with RFC2047 charsets like unknown-8bit is an open question...probably
>>> throw an error).
>>>
>> Would the encoding to/from StringHeader/BytesHeader preserve the
>> from_full_header  state and value?
>>
> My thought is no.  Once you encode/decode the header, your program has
> transformed it, and I think it is better to treat the original raw data
> as gone.  The motivation for this is that the 'raw data' of a StringHeader
> is the *text* string used to create it.  Keeping a bytes string 'raw data'
> around as well would get us back into the mess that I developed this
> approach to avoid, where we'd need to specify carefully the difference
> between handing a header whose 'original' raw data was bytes vs string,
> for each of the BytesHeader and StringHeader cases.  Better, I think,
> to put the (small) burden on the application programmer: if you want to
> preserve the original input data, do so by keeping the original object
> around.  Once you mutate the object model, the original raw data for
> the mutated piece is gone.
>
> There are some use-case questions here, though, with regards to
> preservation of as much original information/format as possible, and how
> valuable that is.  I think we'll have to figure that out by examining
> concrete use cases in detail.  (It is not something that the current email
> package supports very well, by the way...headers currently get modified
> significantly in the parse/generate cycle, even without bytes-to-string
> transformations happening.)
>

Not every transformation is intended to be a change.  Until there is a
change, it would be nice to be able to retain the original byte stream,
for invertibility, without requiring that a simple email client deal
with bytes interfaces for RFC conformant messages.

I hear you regarding the mess... here's an brainstorming idea, tossed
out mostly to get your creative juices flowing in this direction, not
because I think it is "definitely the way to go".  The decode API could,
in addition to your description, have an option to preserve itself and
the decode charset, within the String object... If encode "discovers" a
preserved Bytes object, and the same charset is provided, it would
return the preserved Bytes object, rather than creating a new one.
There may be no need to drop the Bytes object explicitly; as it seems
the only API for making changes to a Header object is to create a new
one, and substitute the new one for the old one.  Or maybe
from_full_header does a modify.  Or maybe the properties are assignable
(that is not explicitly stated, by the way).  So if there are modify
operations, they should drop the Bytes object.


>>> (Encoding or decoding a Message would cause the Message to recursively
>>> encode or decode its subparts.  This means you are making a complete
>>> new copy of the Message in memory.  If you don't want to do that you
>>> can walk the Message and convert it piece by piece (we could provide a
>>> generator that does this).)
>>>
>> Walking it piece by piece would allow the old pieces to be discarded, to
>> save total memory consumption, where that is appropriate.
>>
>> Perhaps one generator that would be commonly used, would be to convert
>> headers only, and leave MIME data parts alone, accessing and converting
>> them only with the registered methods?  This would mean that a "complete
>> copy" wouldn't generally be very big, if the data parts were excluded
>> from implicit conversion.  Perhaps the "external storage protocol" might
>> also only be defined for MIME data parts, and walking the tree with this
>> generator would not need to reference the MIME data parts, nor bring
>> them in from "external storage".
>>
> That's true.  The Bytes and String versions of binary MIME parts,
> which are likely to be the large ones, will probably have a common
> representation for the payload, and could potentially point to the same
> object.  That breaking of of the expectation that 'encode' and 'decode'
> return new objects (in analogy to how encode and decode of strings/bytes
> works) might not be a good thing, though.
>

Well, one generator could provide the expectation that everything is
new; another could provide different expectations.  The differences
between them, and the tradeoffs would be documented, of course, were
both provided.  I'm not convinced that treating headers and data exactly
the same at all times is a good thing... a convenient option at times,
perhaps, but I can see it as a serious inefficiency in many use cases
involving large data.

This deserves a bit more thought/analysis/discussion, perhaps.  More
than I have time for tonight, but I may reply again, perhaps after
others have responded, if they do.

> In any case, text MIME parts have the same bytes vs string issues as
> headers do, and should, IMO, be converted from one to the other on
> encode/decode.
>

To me, your first phrase implies that they should share common
encode/decode routines, but not the other.  I can clearly see a use case
where your opinion is the right approach, but I think there are use
cases where it might not be... while text MIME parts are generally
smaller than binary MIME parts, that is neither a requirement, nor
always true (think about transferring an XML format database... could be
huge... and is text of sorts -- human decipherable, more easily than hex
dumps, but not what I would call "human readable").


> Another possible approach would be some sort of 'encode/decode on demand'
> system, although that would need to retain a pointer to the original
> object, which might get us into suboptimal reference cycle difficulties.
>

Hmm.  Brainstorming again.  decode could minimally create the String
format object, with only the Bytes format object and charset parameter
set (from the above brainstorming idea).  Then the real decoding could
be done if the properties are accessed.  If the properties are not
accessed (because the client/application makes its decisions based on
access to other components of the email), the decoding need never be
done for some objects.  Perhaps this would also neatly deal with my
desire to delay the decode of MIME data parts as well?

> These bits are implementation details, though, and don't affect the API
> design.
>

Well, one impact of the above brainstorming would be an interface to
create the StringHeader containing the BytesHeader and charset
parameters.  Or maybe that would be a private interface, not considered
to be part of the API?


>>> raw_header would be the data passed in to the constructor if
>>> from_full_header is used, and None otherwise.  If encode/decode call
>>> the regular constructor, then this attribute would also act as a flag
>>> as to whether or not the header was constructed from raw input data
>>> or via program.
>>>
>>>
>> This _implies_ that  from_full_header always accepts raw data bytes...
>> even for the StringHeader.  And that implies the need for an implicit
>> decode, and therefore, perhaps a charset parameter?  No, not a charset
>> parameter, since they are explicitly contained in the header values.
>>
> Your confusion was my confusing use of the term 'raw data' to mean
> whatever was input to the from_full_header constructor, which is
> bytes for a BytesHeader and text for a StringHeade.
>

If we are going to invent a new "string format raw data" element, maybe
we should invent a term to describe it, also... maybe "raw data" should
be split into "raw bytes" and "raw string", and "raw data" become a
synonym for "raw bytes", as that is what it was historically?


--
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
_______________________________________________
Email-SIG mailing list
[hidden email]
Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Thoughts on the general API, and the Header API.

Glenn Linderman-3
On approximately 1/25/2010 8:10 PM, came the following characters from
the keyboard of Glenn Linderman:

>> That's true.  The Bytes and String versions of binary MIME parts,
>> which are likely to be the large ones, will probably have a common
>> representation for the payload, and could potentially point to the same
>> object.  That breaking of of the expectation that 'encode' and 'decode'
>> return new objects (in analogy to how encode and decode of strings/bytes
>> works) might not be a good thing, though.
>
> Well, one generator could provide the expectation that everything is
> new; another could provide different expectations.  The differences
> between them, and the tradeoffs would be documented, of course, were
> both provided.  I'm not convinced that treating headers and data
> exactly the same at all times is a good thing... a convenient option
> at times, perhaps, but I can see it as a serious inefficiency in many
> use cases involving large data.
>
> This deserves a bit more thought/analysis/discussion, perhaps.  More
> than I have time for tonight, but I may reply again, perhaps after
> others have responded, if they do.

I guess no one else is responding here at the moment.  Read the ideas
below, and then afterward, consider building the APIs you've suggested
on top of them.  And then, with the full knowledge that the messages may
be either in fast or slow storage, I think that you'll agree that
converting the whole tree in one swoop isn't always appropriate... all
headers, probably could be.  Data, because of its size, should probably
be done on demand.


In earlier discussions about the registry, there was the idea of having
a registry for transport encoding handling, and a registry for MIME
encoding handling.  There were also vague comments about doing an
external storage protocol "somehow", but it was a vague concept to be
defined later, or at least I don't recall any definitions.

Given a raw bytes representation of an incoming email, mail servers need
to choose how to handle it... this may need to be a dynamic choice based
on current server load, as well as the obvious static server resources,
as well as configured limits.

Unfortunately, the SMTP protocol does not require predeclaration of the
size of the incoming DATA part, so servers cannot enforce size limits
until they are exceeded.  So as the data streams in, a dynamic
adjustment to the handling strategy might be appropriate.  Gateways may
choose to route messages, and stall the input until the output channel
is ready to receive it, and basically "pass through" the data, with
limited need to buffer messages on disk... unless the output channel
doesn't respond... then they might reject the message.  An SMTP server
should be willing to act as a store-and-forward server, and also must do
individual delivery of messages to each RCPT (or at least one per
destination domain), so must have a way of dealing with large messages,
probably via disk buffering.  The case of disk buffering and retrying
generally means that the whole message, not just the large data parts,
must be stored on disk, so the external storage protocol should be able
to deal with that case.

The minimal external storage format capability is to store the received
bytestream to disk, associate it with the envelope information, and be
able to retrieve it in whole later.  This would require having the whole
thing in RAM at those two points in time, however, and doesn't solve the
real problem.  Incremental writing and reading to the external storage
would be much more useful.  Even more useful, would be "partially
parsed" seek points.

An external storage system that provides "partially parsed" information
could include:

1) envelope information.  This section is useful to SMTP servers, but
not other email tools, so should be optional.  This could be a copy of
the received RCPT command texts, complete with CRLF endings.

2) header information.  This would be everything between DATA and the
first CRLF CRLF sequence.

3) data.  Pre-MIME this would simply be the rest of the message, but
post-MIME it would be usefully more complex.  If MIME headers can be
observed and parsed as the data passes through, then additional metadata
could be saved that could enhance performance of the later processing
steps.  Such additional metadata could include the beginning of each
MIME part, the end of the headers for that part, and the end of the data
for that part.

The result of saving that information would mean that minimal data (just
headers) would need to be read in create a tree representing the email,
the rest could be left in external storage until it is accessed... and
then obtained directly from there when needed, and converted to the form
required by the request... either the whole part, or some piece in a buffer.

So there could be a variety of external storage systems... one that
stores in memory, one that stores on disk per the ideas above, and a
variety that retain some amount of cached information about the email,
even though they store it all on disk.  Sounds like this could be a
plug-in, or an attribute of a message object creation.

But to me, it sounds like the foundation upon which the whole email lib
should be built, not something that is shoveled in later.

A further note about access to data parts... clearly "data for the whole
MIME part" could be provided, but even for a single part that could be
large.  So access to smaller chunks might be desired.

The data access/conversion functions, therefore, should support a
buffer-at-a-time access interface.  Base64 supports random access
easily, unless it contains characters not in the 64, that are to be
ignored, that could throw off the size calculations.  So maybe providing
sequential buffer-at-a-time access with rewind is the best that can be
done -- quoted-printable doesn't support random access very well, and
neither would some sort of compression or encryption technique -- they
usually like to start from the beginning -- and those are the sorts of
things that I would consider likely to be standardized in the future, to
reduce the size of the payload, and to increase the security of the payload.

--
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

_______________________________________________
Email-SIG mailing list
[hidden email]
Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Thoughts on the general API, and the Header API.

Glenn Linderman-3
Another thought occurred to me regarding this "Access API"... an IMAP
implementation could defer obtaining data parts from the server until
requested, under the covers of this same API.  Of course, for devices
with limited resources, that would probably be the optimal approach, but
for devices with lots of resources, an IMAP implementation might also
want to offer other options.


On approximately 1/28/2010 6:20 PM, came the following characters from
the keyboard of Glenn Linderman:

> On approximately 1/25/2010 8:10 PM, came the following characters from
> the keyboard of Glenn Linderman:
>>> That's true.  The Bytes and String versions of binary MIME parts,
>>> which are likely to be the large ones, will probably have a common
>>> representation for the payload, and could potentially point to the same
>>> object.  That breaking of of the expectation that 'encode' and 'decode'
>>> return new objects (in analogy to how encode and decode of
>>> strings/bytes
>>> works) might not be a good thing, though.
>>
>> Well, one generator could provide the expectation that everything is
>> new; another could provide different expectations.  The differences
>> between them, and the tradeoffs would be documented, of course, were
>> both provided.  I'm not convinced that treating headers and data
>> exactly the same at all times is a good thing... a convenient option
>> at times, perhaps, but I can see it as a serious inefficiency in many
>> use cases involving large data.
>>
>> This deserves a bit more thought/analysis/discussion, perhaps.  More
>> than I have time for tonight, but I may reply again, perhaps after
>> others have responded, if they do.
>
> I guess no one else is responding here at the moment.  Read the ideas
> below, and then afterward, consider building the APIs you've suggested
> on top of them.  And then, with the full knowledge that the messages
> may be either in fast or slow storage, I think that you'll agree that
> converting the whole tree in one swoop isn't always appropriate... all
> headers, probably could be.  Data, because of its size, should
> probably be done on demand.
>
>
> In earlier discussions about the registry, there was the idea of
> having a registry for transport encoding handling, and a registry for
> MIME encoding handling.  There were also vague comments about doing an
> external storage protocol "somehow", but it was a vague concept to be
> defined later, or at least I don't recall any definitions.
>
> Given a raw bytes representation of an incoming email, mail servers
> need to choose how to handle it... this may need to be a dynamic
> choice based on current server load, as well as the obvious static
> server resources, as well as configured limits.
>
> Unfortunately, the SMTP protocol does not require predeclaration of
> the size of the incoming DATA part, so servers cannot enforce size
> limits until they are exceeded.  So as the data streams in, a dynamic
> adjustment to the handling strategy might be appropriate.  Gateways
> may choose to route messages, and stall the input until the output
> channel is ready to receive it, and basically "pass through" the data,
> with limited need to buffer messages on disk... unless the output
> channel doesn't respond... then they might reject the message.  An
> SMTP server should be willing to act as a store-and-forward server,
> and also must do individual delivery of messages to each RCPT (or at
> least one per destination domain), so must have a way of dealing with
> large messages, probably via disk buffering.  The case of disk
> buffering and retrying generally means that the whole message, not
> just the large data parts, must be stored on disk, so the external
> storage protocol should be able to deal with that case.
>
> The minimal external storage format capability is to store the
> received bytestream to disk, associate it with the envelope
> information, and be able to retrieve it in whole later.  This would
> require having the whole thing in RAM at those two points in time,
> however, and doesn't solve the real problem.  Incremental writing and
> reading to the external storage would be much more useful.  Even more
> useful, would be "partially parsed" seek points.
>
> An external storage system that provides "partially parsed"
> information could include:
>
> 1) envelope information.  This section is useful to SMTP servers, but
> not other email tools, so should be optional.  This could be a copy of
> the received RCPT command texts, complete with CRLF endings.
>
> 2) header information.  This would be everything between DATA and the
> first CRLF CRLF sequence.
>
> 3) data.  Pre-MIME this would simply be the rest of the message, but
> post-MIME it would be usefully more complex.  If MIME headers can be
> observed and parsed as the data passes through, then additional
> metadata could be saved that could enhance performance of the later
> processing steps.  Such additional metadata could include the
> beginning of each MIME part, the end of the headers for that part, and
> the end of the data for that part.
>
> The result of saving that information would mean that minimal data
> (just headers) would need to be read in create a tree representing the
> email, the rest could be left in external storage until it is
> accessed... and then obtained directly from there when needed, and
> converted to the form required by the request... either the whole
> part, or some piece in a buffer.
>
> So there could be a variety of external storage systems... one that
> stores in memory, one that stores on disk per the ideas above, and a
> variety that retain some amount of cached information about the email,
> even though they store it all on disk.  Sounds like this could be a
> plug-in, or an attribute of a message object creation.
>
> But to me, it sounds like the foundation upon which the whole email
> lib should be built, not something that is shoveled in later.
>
> A further note about access to data parts... clearly "data for the
> whole MIME part" could be provided, but even for a single part that
> could be large.  So access to smaller chunks might be desired.
>
> The data access/conversion functions, therefore, should support a
> buffer-at-a-time access interface.  Base64 supports random access
> easily, unless it contains characters not in the 64, that are to be
> ignored, that could throw off the size calculations.  So maybe
> providing sequential buffer-at-a-time access with rewind is the best
> that can be done -- quoted-printable doesn't support random access
> very well, and neither would some sort of compression or encryption
> technique -- they usually like to start from the beginning -- and
> those are the sorts of things that I would consider likely to be
> standardized in the future, to reduce the size of the payload, and to
> increase the security of the payload.
>

--
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

_______________________________________________
Email-SIG mailing list
[hidden email]
Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Thoughts on the general API, and the Header API.

R. David Murray
On Mon, 01 Feb 2010 11:06:34 -0800, Glenn Linderman <[hidden email]> wrote:
> Another thought occurred to me regarding this "Access API"... an IMAP
> implementation could defer obtaining data parts from the server until
> requested, under the covers of this same API.  Of course, for devices
> with limited resources, that would probably be the optimal approach, but
> for devices with lots of resources, an IMAP implementation might also
> want to offer other options.

I like your thought about treating memory as just another backing
store and designing the API accordingly.  I will keep it in mind
as I go along.

> On approximately 1/28/2010 6:20 PM, came the following characters from
> the keyboard of Glenn Linderman:
> > I guess no one else is responding here at the moment.  Read the ideas
> > below, and then afterward, consider building the APIs you've suggested
> > on top of them.  And then, with the full knowledge that the messages
> > may be either in fast or slow storage, I think that you'll agree that
> > converting the whole tree in one swoop isn't always appropriate... all
> > headers, probably could be.  Data, because of its size, should
> > probably be done on demand.

I hope the fact that no one is responding means that they think I'm at
least on the right track :)

I've committed a skeleton of the new Header classes to the
lp:python-email6 repository, along with my testing framework.  More test
cases to come.

--David
_______________________________________________
Email-SIG mailing list
[hidden email]
Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Thoughts on the general API, and the Header API.

Barry Warsaw
In reply to this post by R. David Murray
On Jan 25, 2010, at 03:10 PM, R. David Murray wrote:

>After setting it aside for a bit, I had what I think is a little epiphany:
>our need is to deal with messages (and parts of messages) that could be
>in either bytes form or text form.  The things we need to do with them
>are similar regardless of their form, and so we have been talking about a
>"dual API": one method for bytes and a parallel method for text.
>
>What if we recognize that we have two different data types, bytes messages
>and text messages?  Then the "dual API" becomes a more uniform, almost
>single, API, but with two possible underlying data types.

I really like this, especially because it kind of mirrors the transformations
between bytes and strings.  I have one suggestion that might clean up the API
and make some other things possible or easier.

>In the context specifically of the proposed new Header object, I propose
>that we have a StringHeader and a BytesHeader, and an API that looks
>something like this:
>
>StringHeader
>
>    properties:
>        raw_header (None unless from_full_header was used)
>        raw_name
>        raw_value
>        name
>        value
>
>    __init__(name, value)
>    from_full_header(header)
>    serialize(max_line_len=78,
>              newline='\n',
>              use_raw_data_if_possible=False)
>    encode(charset='utf-8')
>
>BytesHeader would be exactly the same, with the exception of the signature
>for serialize and the fact that it has a 'decode' method rather than an
>'encode' method.  Serialize would be different only in the fact that
>it would have an additional keyword parameter, must_be_7bit=True.
The one thing that I think is unwieldy is the signature of the serialize() and
deserialize() methods.  I've been thinking about "policy" objects that can be
used to control formatting and I think that perhaps substituting an API like
this might work:

serialize(policy=None)
deserialize(policy=None)

The idea is that the policy object would describe how and when to fold header
lines, what EOL characters to use, but also such choices such as whether to
use raw data if possible, and must_be_7bit.  A first order improvement is that
it would be much easier to pass the policy object up and down the call stack
than a slew of independent parameters.

Further, it might be interesting to allow policy objects in the generator,
which would control default formatting options, and on Message objects in the
hierarchy which would control formatting for that Message and all the ones
below it in the tree (unless overridden by a policy object on a sub-message).
Maybe headers themselves also support policy objects.

I think this could be interesting for supporting output of the same message
tree to different destinations.  E.g. if the message is being output directly
to an SMTP server, you'd stick a policy object on there that had the RFC 5321
required EOL, but you'd have a different policy object for output to a web
server.

>(Encoding or decoding a Message would cause the Message to recursively
>encode or decode its subparts.  This means you are making a complete
>new copy of the Message in memory.  If you don't want to do that you
>can walk the Message and convert it piece by piece (we could provide a
>generator that does this).)

It sounds like there's overlap between the encoding/decoding API and the
serialize/deserialize API.  Are you thinking along those lines?  Differences
in signature could be papered over with the policy objects.

>Subclasses of these classes for structured headers would have additional
>methods that would return either specialized object types (datetimes,
>address objects) or bytes/strings, and these may or may not exist in
>both Bytes and String forms (that depends on the use cases, I think).

Is it crackful to think about the policy object also containing a MIME type
registry for conversion to the specialized object types?

>So, those are my thoughts, and I'm sure I haven't thought of all the
>corner cases.  The biggest question is, does it seem like this general
>scheme is worth pursuing?

Definitely!  I think it's a great idea.

-Barry

_______________________________________________
Email-SIG mailing list
[hidden email]
Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com

signature.asc (852 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Thoughts on the general API, and the Header API.

Glenn Linderman-3
On approximately 2/19/2010 6:23 PM, came the following characters from
the keyboard of Barry Warsaw:
> Is it crackful to think about the policy object also containing a MIME
> type
> registry for conversion to the specialized object types?
>    

While the MIME type registry (and other registries) were (I think)
conceptualized as global objects, having them be "just objects" means
you could have as many as you want, for different purposes, and means
that you could pass them in to the encoding and decoding methods, and
might even solve issues with different threads wanting different
registries concurrently... they could have them.

I like the idea, although clearly it needs to be fleshed out a bit.

--
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

_______________________________________________
Email-SIG mailing list
[hidden email]
Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Thoughts on the general API, and the Header API.

R. David Murray
In reply to this post by Barry Warsaw
On Fri, 19 Feb 2010 21:23:52 -0500, Barry Warsaw <[hidden email]> wrote:
> On Jan 25, 2010, at 03:10 PM, R. David Murray wrote:
> The one thing that I think is unwieldy is the signature of the serialize() and
> deserialize() methods.  I've been thinking about "policy" objects that can be
> used to control formatting and I think that perhaps substituting an API like
> this might work:
>
> serialize(policy=None)
> deserialize(policy=None)

I love the idea of policy objects.  I'm clear on what they do for
serialization.  What do you visualize them doing for deserialization
(parsing)?

> I think this could be interesting for supporting output of the same message
> tree to different destinations.  E.g. if the message is being output directly
> to an SMTP server, you'd stick a policy object on there that had the RFC 5321
> required EOL, but you'd have a different policy object for output to a web
> server.

Yes, this was my intent in providing the newline and max_line_length
parameters, but a policy object is a much cleaner way to do that.
Especially since we can then provide premade policy objects to support
common output scenarios such as SMTP and HTTP.

> >(Encoding or decoding a Message would cause the Message to recursively
> >encode or decode its subparts.  This means you are making a complete
> >new copy of the Message in memory.  If you don't want to do that you
> >can walk the Message and convert it piece by piece (we could provide a
> >generator that does this).)
>
> It sounds like there's overlap between the encoding/decoding API and the
> serialize/deserialize API.  Are you thinking along those lines?  Differences
> in signature could be papered over with the policy objects.

No, I'm thinking of encode/decode as exactly parallel to encode/decode
on string/bytes.  In my prototype API, for example,  StringHeader
values are unicode, and do *not* contain any rfc2047 encoded words.
decoding a BytesHeader decodes the RFC2047 stuff.  Contrawise, encoding
a StringHeader does the RFC2047 encoding (using whatever charset you
specify or utf-8 by default).  (This means you lose the ability to piece
together headers from bits in different charsets, but what is the actual
use case for that?  And in any case, there will be a way to get at the
underlying header-translation machinery to do it if you really need to.)

Serializing a StringHeader, in my design, produces *text* not bytes.
This is to support the use case of using the email package to manipulate
generic 'name:value // body' formatted data in unicode form (presumably
utf-8 on disk).

To get something that is RFC compliant, you have to encode the StringMessage
object (and thus the headers) to a BytesMessage object, and then
serialize that.  (That's where the incremental encoder may be needed).

The advantage of doing it this way is we support all possible combinations
of input and output format via two strictly parallel interfaces and
their encode/decode methods.

Hmm.  It occurs to me now that another possible way to do this would be to
put the output data format into the policy object.  Then you could
serialize a StringMessage object, and it would know to do the string
to bytes conversion as it went along doing the serialization.
I don't think that would eliminate the need for encode/decode methods:
first, that's what serialize would use when converting for output,
and second, you will sometimes want to manipulate, eg, individual
header values, and it seems like the natural way to do that is something like
this:

    mybytesmessage['subject'].decode().value

You don't want to serialize using a to-string policy object, because
what you want is the decoded value, and you can't do

    mybytesmessage['subject'].value.decode()

because you have to rfc2047 decode....

Hmm.  Here's a thought: could we write an rfc2047 codec?  Then we
could use that second, more python-intuitive form like this:

    mybytesmessage['subject'].value.decode('mimeheader')

Well, looking at that I'm not sure it's better :(

> >Subclasses of these classes for structured headers would have additional
> >methods that would return either specialized object types (datetimes,
> >address objects) or bytes/strings, and these may or may not exist in
> >both Bytes and String forms (that depends on the use cases, I think).
>
> Is it crackful to think about the policy object also containing a MIME type
> registry for conversion to the specialized object types?

Oooh.  I *like* that idea.  I dislike global registries.  Like Glenn
says, this could make a lot of things safer threading-wise, and
certainly makes things more flexible.  I was worrying that there
might be a case of a complex app needing the registry to have
different states in different parts of the app, and while I don't
have an actual use-case in mind, this would make that a non-problem.

> >So, those are my thoughts, and I'm sure I haven't thought of all the
> >corner cases.  The biggest question is, does it seem like this general
> >scheme is worth pursuing?
>
> Definitely!  I think it's a great idea.

Thanks.  The repository (lp:python-email6) contains the beginnings
of the implementation of the StringHeader and BytesHeader classes.
I'm currently working on fleshing out the part where it says "this
is a temporary hack, need to handle folding encoded words", which is,
needless to say, a bit complicated...I may set that aside for a bit and
work on the policy object stuff.  Though I also need to put a bunch more
tests into the test database...

--David
_______________________________________________
Email-SIG mailing list
[hidden email]
Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Thoughts on the general API, and the Header API.

Barry Warsaw
On Feb 20, 2010, at 12:50 AM, R. David Murray wrote:

>> serialize(policy=None)
>> deserialize(policy=None)
>
>I love the idea of policy objects.  I'm clear on what they do for
>serialization.  What do you visualize them doing for deserialization
>(parsing)?

As Glenn points out, they could contain the MIME type registry for producing
more specific instance types.  I also think they'll serve as a container for
any other configuration variables that we'll find convenient for controlling
the parsing process.  E.g. we might enable strict parsing this way.  It's
basically just a hand-wavy way of saying, let's define the API in terms of the
policy object to keep our signatures small and sane (at the cost of course of
making the policy objects huge and insane ;).

>Yes, this was my intent in providing the newline and max_line_length
>parameters, but a policy object is a much cleaner way to do that.
>Especially since we can then provide premade policy objects to support
>common output scenarios such as SMTP and HTTP.

+1

>> It sounds like there's overlap between the encoding/decoding API and the
>> serialize/deserialize API.  Are you thinking along those lines?  Differences
>> in signature could be papered over with the policy objects.
>
>No, I'm thinking of encode/decode as exactly parallel to encode/decode
>on string/bytes.  In my prototype API, for example,  StringHeader
>values are unicode, and do *not* contain any rfc2047 encoded words.
>decoding a BytesHeader decodes the RFC2047 stuff.  Contrawise, encoding
>a StringHeader does the RFC2047 encoding (using whatever charset you
>specify or utf-8 by default).
Make sense, thanks.  Yep, we probably don't need the policy API for that.  It
makes we wonder whether 'serialize' and 'deserialize' are the right names for
functionality we've traditionally called 'parsing' and 'generating'.  But we
can paint that bikeshed later.

>(This means you lose the ability to piece together headers from bits in
>different charsets, but what is the actual use case for that?  And in any
>case, there will be a way to get at the underlying header-translation
>machinery to do it if you really need to.)

The degenerate case is to mix ASCII and non-ASCII header chunks, which I think
is fairly common.  Of course the RFCs allow it, so we have to support it, even
if doing so is via a different API.

>Serializing a StringHeader, in my design, produces *text* not bytes.
>This is to support the use case of using the email package to manipulate
>generic 'name:value // body' formatted data in unicode form (presumably
>utf-8 on disk).
>
>To get something that is RFC compliant, you have to encode the StringMessage
>object (and thus the headers) to a BytesMessage object, and then
>serialize that.  (That's where the incremental encoder may be needed).
>
>The advantage of doing it this way is we support all possible combinations
>of input and output format via two strictly parallel interfaces and
>their encode/decode methods.
This all sounds great.

>Hmm.  It occurs to me now that another possible way to do this would be to
>put the output data format into the policy object.

Indeed, that's an interesting idea.

>Then you could serialize a StringMessage object, and it would know to do the
>string to bytes conversion as it went along doing the serialization.  I don't
>think that would eliminate the need for encode/decode methods: first, that's
>what serialize would use when converting for output, and second, you will
>sometimes want to manipulate, eg, individual header values, and it seems like
>the natural way to do that is something like this:
>
>    mybytesmessage['subject'].decode().value
>
>You don't want to serialize using a to-string policy object, because
>what you want is the decoded value, and you can't do
>
>    mybytesmessage['subject'].value.decode()
>
>because you have to rfc2047 decode....
I'm with ya!

>Hmm.  Here's a thought: could we write an rfc2047 codec?  Then we
>could use that second, more python-intuitive form like this:
>
>    mybytesmessage['subject'].value.decode('mimeheader')
>
>Well, looking at that I'm not sure it's better :(

Yeah.

>Thanks.  The repository (lp:python-email6) contains the beginnings
>of the implementation of the StringHeader and BytesHeader classes.
>I'm currently working on fleshing out the part where it says "this
>is a temporary hack, need to handle folding encoded words", which is,
>needless to say, a bit complicated...I may set that aside for a bit and
>work on the policy object stuff.  Though I also need to put a bunch more
>tests into the test database...

+1
-Barry

_______________________________________________
Email-SIG mailing list
[hidden email]
Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com

signature.asc (852 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Thoughts on the general API, and the Header API.

R. David Murray
On Sun, 21 Feb 2010 14:07:32 -0500, Barry Warsaw <[hidden email]> wrote:

> On Feb 20, 2010, at 12:50 AM, R. David Murray wrote:
>
> >> serialize(policy=None)
> >> deserialize(policy=None)
> >
> >I love the idea of policy objects.  I'm clear on what they do for
> >serialization.  What do you visualize them doing for deserialization
> >(parsing)?
>
> As Glenn points out, they could contain the MIME type registry for producing
> more specific instance types.  I also think they'll serve as a container for

Arg.  I was of course writing that email late at night and sleep
deprived or I'd have noticed that :)

> any other configuration variables that we'll find convenient for controlling
> the parsing process.  E.g. we might enable strict parsing this way.  It's
> basically just a hand-wavy way of saying, let's define the API in terms of
> the policy object to keep our signatures small and sane (at the cost of course
> of making the policy objects huge and insane ;).

Sounds good.

> Make sense, thanks.  Yep, we probably don't need the policy API for that.  It
> makes we wonder whether 'serialize' and 'deserialize' are the right names for
> functionality we've traditionally called 'parsing' and 'generating'.  But we
> can paint that bikeshed later.

Yes.  I'm thinking if serialization as the replacement for generating,
with the idea that the 'generator' api at the top level will be
convenience functions wrapped around the serialization API.  But we can
deal with that when I get up to that level.

> >(This means you lose the ability to piece together headers from bits in
> >different charsets, but what is the actual use case for that?  And in any
> >case, there will be a way to get at the underlying header-translation
> >machinery to do it if you really need to.)
>
> The degenerate case is to mix ASCII and non-ASCII header chunks, which I think
> is fairly common.  Of course the RFCs allow it, so we have to support it, even
> if doing so is via a different API.

I'd better talk about what I'm thinking about in that regard.  My notion
is that the serializer will actually try to minimize the amount of
encoded text (modulo caring about how long the encoded bits are when
the RFC2047 chrome is included) and putting anything that can be put in
ascii in ascii.  But also using us-ascii encoded words to do things like
wrap tokens that won't fit in 77 chars and even to preserve whitespace
in unstructured headers in certain situations (this bit would be the
more controversial bit, I think).  So combining ascii chunks and chunks
encoded in the charset specified to the encode method happens naturally.
You could also modify the value of a BytesHeader, stuffing into it ascii
or encoded words created 'manually' using a low level function I plan
to expose.  So I think that's the 'different API', and I think it fits
in pretty logically, I think.

If you want to control *exactly* how the encoded words appear, then I
think it would be reasonable to also require that you do your own header
wrapping, which means using the low level tools to build the encoded
words, putting in the appropriate folding yourself, adding the fieldname
on the front, passing the result to BytesHeader.from_full_header,
and using a policy that says to use the raw header data.

--David
_______________________________________________
Email-SIG mailing list
[hidden email]
Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com
Loading...