rfc822 parser (the elephant has landed)

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

rfc822 parser (the elephant has landed)

R. David Murray
Things have been a bit disrupted in my life over the past month (a family
tragedy).  Fortunately for this group one of my ways of coping is to write
code, so I did manage to do a fair bit on the email6 project, I just haven't
been keeping up with publishing about it consistently.  I did write one other
blog post before today's.  Here are the two links:

    http://www.bitdance.com/blog/2011/05/23_01_Email6_Headers_and_Header_Classes/
    http://www.bitdance.com/blog/2011/06/08_01_Email6_RFC822_Parser/

The big thing is an RFC822 parser.  I should probably have asked for advice
here before plunging in to it, but it seemed reasonably straightforward when I
started :).  And it still seems simple in outline, just complex in details.

Take a look and give me whatever feedback you've got.

--
R. David Murray           http://www.bitdance.com
_______________________________________________
Email-SIG mailing list
[hidden email]
Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: rfc822 parser (the elephant has landed)

vikas ruhil
hey suggest i am looking for  a web based mail service for FOSS comunity like gmail can anybody suggest from where i am to start ? should i use sendmail,posfix mail server or lamson mail server ? help  me plz

On Wed, Jun 8, 2011 at 11:58 PM, R. David Murray <[hidden email]> wrote:
Things have been a bit disrupted in my life over the past month (a family
tragedy).  Fortunately for this group one of my ways of coping is to write
code, so I did manage to do a fair bit on the email6 project, I just haven't
been keeping up with publishing about it consistently.  I did write one other
blog post before today's.  Here are the two links:

   http://www.bitdance.com/blog/2011/05/23_01_Email6_Headers_and_Header_Classes/
   http://www.bitdance.com/blog/2011/06/08_01_Email6_RFC822_Parser/

The big thing is an RFC822 parser.  I should probably have asked for advice
here before plunging in to it, but it seemed reasonably straightforward when I
started :).  And it still seems simple in outline, just complex in details.

Take a look and give me whatever feedback you've got.

--
R. David Murray           http://www.bitdance.com
_______________________________________________
Email-SIG mailing list
[hidden email]
Your options: http://mail.python.org/mailman/options/email-sig/vikasruhil06%40gmail.com


_______________________________________________
Email-SIG mailing list
[hidden email]
Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: rfc822 parser (the elephant has landed)

Barry Warsaw
In reply to this post by R. David Murray
On Jun 08, 2011, at 02:28 PM, R. David Murray wrote:

>Things have been a bit disrupted in my life over the past month (a family
>tragedy).

I'm very sorry to hear this David.  My thoughts are with you.

As always, thanks for your amazing work on email6.  You are my hero.
Comments:

* Changing the __setitem__ API.  I've always thought about this as a pure
  convenience, and that appending was the most convenient semantics.  Other
  methods, e.g. replace_header() should be included to provide the range of
  semantics that people want.  Then we'd just pick one and alias it to
  __setitem__.  I'm mixed as to whether appending still is the most convenient
  alias, since in my own code I often `del msg[header]; msg[header] = foo`.
  But that also changes the header order so it's not a perfect replacement.

* Unique headers: is this controlled or influenced by a policy?  For example,
  duplicate Subjects might be disallowed by RFC 5322, but could conceivably be
  allowed (or at least not prohibited) by other email-like protocols.

  Also, while some fields like CC allow only occurrence, it can contain
  multiple values in that single field.  Is it totally insane to say that
  `msg['cc'] = 'address'` would append `address` to the existing value?  It
  probably is, but having to do that manually also kind of sucks.

  Some headers have other constraints (RFC 5322, $3.6).  For example
  Message-ID can technically appear zero times, but "SHOULD be present".  Part
  of me thinks it should be out of scope for email6 to enforce this, and I'm
  not sure where that would get enforced anyway, but I'm just wondering if
  you've thought about that.

* Datetimes: \o/.  It will be awesome when I can `msg['date'] = a_datetime`.
  While it does seem reasonable that a naive datetime uses -0000, it should
  also be very easy for folks to add a Date header that references the local
  timezone, since I suspect that will be a more common use case than UTC.  I
  don't know what the answer for that is though.

* As for header parsing, have you looked at the pyparsing module?  I don't
  write many parsers, and have no direct experience with pyparsing, but I keep
  hearing really good things about it.  OTOH, it's not in the stdlib, so it
  would present problems if email6 were to adopt it.  Still, I don't envy this
  part of the job, and I sympathize with the rabbit-hole effect of "just one
  more little thing..." ;)  Oh, and I'm just blown away impressed by the work
  you've done on the parser.

* Are there operations on Groups and Mailboxes?  E.g. in your example, I see
  that you added `[hidden email]` to the To header by string
  concatenation.  What if for example, I had a number of addresses that I
  wanted to combine into a Reply-To header (which RFC 5322 says I can only
  have one of).  Would I be able to do something like the following:

  >>> msg['reply_to'].mailboxes.append('[hidden email]')

  and have the printed representation of the message look correct?  Ah, maybe
  something like your last example in the What's Missing section covers this.

* Oooh!  Your example has an `== None` which should probably be `is None` :)

Really, *really* fantastic stuff.
-Barry

_______________________________________________
Email-SIG mailing list
[hidden email]
Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com

signature.asc (853 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Modoboa (was Re: rfc822 parser (the elephant has landed))

Barry Warsaw
In reply to this post by vikas ruhil
On Jun 09, 2011, at 12:05 AM, vikas ruhil wrote:

>hey suggest i am looking for  a web based mail service for FOSS comunity
>like gmail can anybody suggest from where i am to start ? should i use
>sendmail,posfix mail server or lamson mail server ? help  me plz

Modoboa perhaps?

http://modoboa.org/

-Barry

_______________________________________________
Email-SIG mailing list
[hidden email]
Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com

signature.asc (853 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: rfc822 parser (the elephant has landed)

R. David Murray
In reply to this post by Barry Warsaw
On Wed, 08 Jun 2011 16:48:50 -0400, Barry Warsaw <[hidden email]> wrote:
> * Changing the __setitem__ API.  I've always thought about this as a pure
>   convenience, and that appending was the most convenient semantics.  Other
>   methods, e.g. replace_header() should be included to provide the range of
>   semantics that people want.  Then we'd just pick one and alias it to
>   __setitem__.  I'm mixed as to whether appending still is the most convenient
>   alias, since in my own code I often `del msg[header]; msg[header] = foo`.
>   But that also changes the header order so it's not a perfect replacement.

Yeah, it would be really nice if setting (say) 'To' replaced it, but
setting (say) 'Resent-To' appended.  But that way lies chaos :)

One of my ideas is to eventually decouple the header dictionary from the
Message.  That is, you access the headers through msg.headers instead
of directly on msg.  At that point we could get away with changing
the semantics of __setitem__, and have msg.headers[X] be 'replace'.
Having append be spelled 'msg.headers.append(X)' seems slightly more
natural than having replace spelled msg.headers.replace(X), so that's
what I'd be in favor of.

> * Unique headers: is this controlled or influenced by a policy?  For example,
>   duplicate Subjects might be disallowed by RFC 5322, but could conceivably be
>   allowed (or at least not prohibited) by other email-like protocols.

Right now it is always applied, but IMO it needs to be a policy setting.
So despite my thought that Messages don't have a policy, it turns out
that they do :(.  I haven't thought through how to handle that yet, though
the obvious way is to set attributes on the Message when it is created.
Perhaps what needs to be controlled on a Message is what Defects are
considered to be errors that should be raised.

An alternative would be to take the uniqueness check out of __setitem__
and do that check only at message generation time, if the policy says to
do so.  I'd prefer that the immediate raise be available as an option,
myself, since it seems like it would catch programming errors sooner
and thus make for a better user experience.

>   Also, while some fields like CC allow only occurrence, it can contain
>   multiple values in that single field.  Is it totally insane to say that
>   `msg['cc'] = 'address'` would append `address` to the existing value?  It
>   probably is, but having to do that manually also kind of sucks.

Yeah I think that would be insane :).  But += isn't and I want to support
that, as you note later.

>   Some headers have other constraints (RFC 5322, $3.6).  For example
>   Message-ID can technically appear zero times, but "SHOULD be present".  Part
>   of me thinks it should be out of scope for email6 to enforce this, and I'm
>   not sure where that would get enforced anyway, but I'm just wondering if
>   you've thought about that.

That one I think can only be enforced when the message is known to be
"complete", which would be when it is transmitted.  So the generator
could have a policy setting that controls whether or not a lack of
a Message-ID is a raisable error.

> * Datetimes: \o/.  It will be awesome when I can `msg['date'] = a_datetime`.
>   While it does seem reasonable that a naive datetime uses -0000, it should
>   also be very easy for folks to add a Date header that references the local
>   timezone, since I suspect that will be a more common use case than UTC.  I
>   don't know what the answer for that is though.

Well, Alexander has an answer (a function that returns an aware localtime
in the datetime module) but hasn't gotten consensus on adding it.
Perhaps I'll add such a function to email6, at least for the field trials.

> * As for header parsing, have you looked at the pyparsing module?  I don't
>   write many parsers, and have no direct experience with pyparsing, but I keep
>   hearing really good things about it.  OTOH, it's not in the stdlib, so it
>   would present problems if email6 were to adopt it.  Still, I don't envy this
>   part of the job, and I sympathize with the rabbit-hole effect of "just one
>   more little thing..." ;)  Oh, and I'm just blown away impressed by the work
>   you've done on the parser.

I thought about pyparsing (though I haven't tried it out myself), but
I think its scope is much wider than email6 needs, and getting it in to
the stdlib should be an independent project if doing so seems worthwhile.
I don't think email6 should depend on anything not already in the stdlib.
In any case, at this point I think the hard part of the parser is done,
and everything else is incremental additions and tweaks.

Something I didn't say in my blog post is that I'm thinking of marking
rfc822_parser as a private module for the 3.3 release, but that a long
term goal would be to expose it, if it proves to be worthwhile and useful
apart from its internal use in email6.  I think there are occasions when
programs need to do non-email rfc822 parsing, where it could come in handy
(perhaps with a few API tweaks to optionally suppress  email-specific hacks).

Alternatively, the parser might get replaced by something else that does
the same job, especially if it proves to be a performance bottleneck.

> * Are there operations on Groups and Mailboxes?  E.g. in your example, I see
>   that you added `[hidden email]` to the To header by string
>   concatenation.  What if for example, I had a number of addresses that I
>   wanted to combine into a Reply-To header (which RFC 5322 says I can only
>   have one of).  Would I be able to do something like the following:
>
>   >>> msg['reply_to'].mailboxes.append('[hidden email]')
>
>   and have the printed representation of the message look correct?  Ah, maybe
>   something like your last example in the What's Missing section covers this.

Yes.  Headers are immutable, so 'append' is not the appropriate operation
for this.  + or += is.  What I'm thinking is that the current Mailbox
and Group objects should be enhanced so that there is a nice API for
creating them from various kinds of input data, and an explicit AddresList
object added, and then they can be passed around, summed, and maybe even
subtracted with each other and with AddressList valued header fields.

> * Oooh!  Your example has an `== None` which should probably be `is None` :)

Heh.  Oops :)  At least I ran the doc tests this time before posting.

> Really, *really* fantastic stuff.

Thanks.

--
R. David Murray           http://www.bitdance.com
_______________________________________________
Email-SIG mailing list
[hidden email]
Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: rfc822 parser (the elephant has landed)

Stephen J. Turnbull
R. David Murray writes:

 > Yeah, it would be really nice if setting (say) 'To' replaced it, but
 > setting (say) 'Resent-To' appended.  But that way lies chaos :)

Especially since "Resent-To" (and other Resent-*, as well as trace
headers) needs to be *pre*pended. :)

 > One of my ideas is to eventually decouple the header dictionary from the
 > Message.

I don't understand why you want to do that; in many applications, you
pass around a reference to the body but never need to access it until
a final flattening operation.  The headers are naturally structured as
a list or ordered dictionary.  Bodies OTOH are recursively structured,
so they really can't be handled in the same way.

 > > * Unique headers: is this controlled or influenced by a policy?  For example,
 > >   duplicate Subjects might be disallowed by RFC 5322, but could conceivably be
 > >   allowed (or at least not prohibited) by other email-like protocols.
 >
 > Right now it is always applied, but IMO it needs to be a policy
 > setting.

Yes.  The Postel Principle applies here.

 > >   Also, while some fields like CC allow only occurrence, it can contain
 > >   multiple values in that single field.  Is it totally insane to say that
 > >   `msg['cc'] = 'address'` would append `address` to the existing value?  It
 > >   probably is, but having to do that manually also kind of sucks.
 >
 > Yeah I think that would be insane :).

+1 for insanity.

 > But += isn't and I want to support that, as you note later.

+1 for += (and perhaps -=).

 > >   Some headers have other constraints (RFC 5322, $3.6).  For example
 > >   Message-ID can technically appear zero times, but "SHOULD be present".  Part
 > >   of me thinks it should be out of scope for email6 to enforce this, and I'm
 > >   not sure where that would get enforced anyway, but I'm just wondering if
 > >   you've thought about that.
 >
 > That one I think can only be enforced when the message is known to be
 > "complete", which would be when it is transmitted.

"Enforced", yes, it's out of scope, for several reasons.  However, any
given application may know at some early stage that headers are
complete, and want to check policy at that point.  So there should be
a mechanism to explicitly check policy conformance, perhaps a
.check_policy() method on Message objects.  Then it becomes a question
of whether the policy check should ever be called implicitly, or
always left up to the application.

_______________________________________________
Email-SIG mailing list
[hidden email]
Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: rfc822 parser (the elephant has landed)

R. David Murray
On Thu, 09 Jun 2011 16:45:08 +0900, "Stephen J. Turnbull" <[hidden email]> wrote:
> R. David Murray writes:
>
>  > Yeah, it would be really nice if setting (say) 'To' replaced it, but
>  > setting (say) 'Resent-To' appended.  But that way lies chaos :)
>
> Especially since "Resent-To" (and other Resent-*, as well as trace
> headers) needs to be *pre*pended. :)

Ah, right.  Which means we don't support that currently...

>  > One of my ideas is to eventually decouple the header dictionary from the
>  > Message.
>
> I don't understand why you want to do that; in many applications, you
> pass around a reference to the body but never need to access it until
> a final flattening operation.  The headers are naturally structured as
> a list or ordered dictionary.  Bodies OTOH are recursively structured,
> so they really can't be handled in the same way.

Well, the main motivation was so that I could change the semantics of
__setitem__.

>  > > * Unique headers: is this controlled or influenced by a policy?  For example,
>  > >   duplicate Subjects might be disallowed by RFC 5322, but could conceivably be
>  > >   allowed (or at least not prohibited) by other email-like protocols.
>  >
>  > Right now it is always applied, but IMO it needs to be a policy
>  > setting.
>
> Yes.  The Postel Principle applies here.

Well, that's already in place.  The parser treats duplicate unique headers
as a defect by default.  But there needs to be a way to construct invalid
messages, too, I think.

Heh.  And I forgot, there actually is a way with the current code to
create duplicate headers, you just have to call append instead of using
__setitem__.

So maybe it wouldn't be totally crazy to have unique headers __setitem__
be replace while non-unique headers __setitem__ does append.  We could
even go really crazy and have Resent headers __setitem__ do prepend :)

The other way to control this "unique header" behavior would be to
change the header registry.  If you are building an application whose
headers do not conform to the RFC, you would probably end up doing that
anyway.

If you combine the last two ideas, we could have a carefully defined
API for controlling how __setitem__ works using attributes on the
header classes.

Totally crazy?  Crazy-smart?

>  > >   Also, while some fields like CC allow only occurrence, it can contain
>  > >   multiple values in that single field.  Is it totally insane to say that
>  > >   `msg['cc'] = 'address'` would append `address` to the existing value?  It
>  > >   probably is, but having to do that manually also kind of sucks.
>  >
>  > Yeah I think that would be insane :).
>
> +1 for insanity.

Are you saying = should append to the value?  I think that would be
bad/counterintuitive.

>  > But += isn't and I want to support that, as you note later.
>
> +1 for += (and perhaps -=).

Agreed.

>  > >   Some headers have other constraints (RFC 5322, $3.6).  For example
>  > >   Message-ID can technically appear zero times, but "SHOULD be present".  Part
>  > >   of me thinks it should be out of scope for email6 to enforce this, and I'm
>  > >   not sure where that would get enforced anyway, but I'm just wondering if
>  > >   you've thought about that.
>  >
>  > That one I think can only be enforced when the message is known to be
>  > "complete", which would be when it is transmitted.
>
> "Enforced", yes, it's out of scope, for several reasons.  However, any
> given application may know at some early stage that headers are
> complete, and want to check policy at that point.  So there should be
> a mechanism to explicitly check policy conformance, perhaps a
> .check_policy() method on Message objects.  Then it becomes a question
> of whether the policy check should ever be called implicitly, or
> always left up to the application.

How about a validate function that takes a message and a policy?
That would be parallel to generator.  In fact, it might share some code
with generator.

--
R. David Murray           http://www.bitdance.com
_______________________________________________
Email-SIG mailing list
[hidden email]
Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: rfc822 parser (the elephant has landed)

Éric Araujo
In reply to this post by R. David Murray
Hi,

I know close to zilch about email but thought I’d give two eurocents.

The first cent is about subclassing builtins.  I read in your article
that your code uses subclasses of str and list; can’t that lead to
problems caused by fast paths for built-in types in CPython code?  (if I
understand http://bugs.python.org/issue10977 correctly)

The second cent is about naming.  Does a Mailbox represent an email
address?  The confusion with mailbox.Mailbox would be a problem.

Dare I say it? PEP 8 would advise rfc822parser for the name, or parser
(but I don’t know how you plan to deprecate/replace the existing
email.parser module).

I’m sorry for your family stuff.

Regards
_______________________________________________
Email-SIG mailing list
[hidden email]
Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: rfc822 parser (the elephant has landed)

R. David Murray
On Fri, 10 Jun 2011 19:00:32 +0200, <[hidden email]> wrote:
> The first cent is about subclassing builtins.  I read in your article
> that your code uses subclasses of str and list; can’t that lead to
> problems caused by fast paths for built-in types in CPython code?  (if I
> understand http://bugs.python.org/issue10977 correctly)

The problems there arise from C code calling (or, rather, not calling)
methods on the subclass.  But in email headers act *just like* strings,
but they have *extra* methods.  So there should be no problem.  Anything
that doesn't know about the extra methods will treat the header just
like a string, which is exactly what we want for backward compatibility
reasons.

The one place where this might bite us is in the proposed support for +=
and -=.  I haven't tested that yet, and if it does work I'm not sure
that there won't be obscure corners in which will turn out to be broken.

> The second cent is about naming.  Does a Mailbox represent an email
> address?  The confusion with mailbox.Mailbox would be a problem.

Well, that is an issue.  I'm not entirely happy about the name, but I
haven't thought of a better one.  The problem is that we have to deal
both with a full 'mailbox' and the 'addr-spec' subpart, and I don't know
of *any* other name (other than 'addr-spec') for the addr-spec part.
(Well, 'address', but you can see the problem with using that for both
meanings...) Perhaps it would be better to use that (or rather
addr_spec), and use 'address' for the address-with-display-name
('mailbox').

I'm open to suggestions for better naming in the API.

> Dare I say it? PEP 8 would advise rfc822parser for the name, or parser
> (but I don’t know how you plan to deprecate/replace the existing
> email.parser module).

Good point.  rfc822parser is completely distinct from 'parser', which
probably won't get deprecated.  On the other hand, once I add RFC2047
support to it, perhaps I should rename it rfcparser (or, at least at
first, _rfcparser).  Or perhaps _headerparser, though it doesn't
contain *all* of the header parsing machinery.
 
> I’m sorry for your family stuff.

Thanks.

--
R. David Murray           http://www.bitdance.com

_______________________________________________
Email-SIG mailing list
[hidden email]
Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: rfc822 parser (the elephant has landed)

Barry Warsaw
In reply to this post by R. David Murray
On Jun 08, 2011, at 06:46 PM, R. David Murray wrote:

>One of my ideas is to eventually decouple the header dictionary from the
>Message.  That is, you access the headers through msg.headers instead
>of directly on msg.  At that point we could get away with changing
>the semantics of __setitem__, and have msg.headers[X] be 'replace'.
>Having append be spelled 'msg.headers.append(X)' seems slightly more
>natural than having replace spelled msg.headers.replace(X), so that's
>what I'd be in favor of.

I agree that it probably does make sense to eventually relegate the headers to
msg.headers.  But I think you'll want both .append() and .replace() methods
for explicitness, with one of them being mapped to __setitem__() for
convenience.  Heck, as is pointed out elsewhere, __setitem__() will probably
be mapped to .magical_rfc_compliant_manipulation_of_header(X, policy) anyway.

>An alternative would be to take the uniqueness check out of __setitem__
>and do that check only at message generation time, if the policy says to
>do so.  I'd prefer that the immediate raise be available as an option,
>myself, since it seems like it would catch programming errors sooner
>and thus make for a better user experience.

Definitely.

>>   Also, while some fields like CC allow only occurrence, it can contain
>>   multiple values in that single field.  Is it totally insane to say that
>>   `msg['cc'] = 'address'` would append `address` to the existing value?  It
>>   probably is, but having to do that manually also kind of sucks.
>
>Yeah I think that would be insane :).  But += isn't and I want to support
>that, as you note later.

+=1!

>>   Some headers have other constraints (RFC 5322, $3.6).  For example
>>   Message-ID can technically appear zero times, but "SHOULD be present".  Part
>>   of me thinks it should be out of scope for email6 to enforce this, and I'm
>>   not sure where that would get enforced anyway, but I'm just wondering if
>>   you've thought about that.
>
>That one I think can only be enforced when the message is known to be
>"complete", which would be when it is transmitted.  So the generator
>could have a policy setting that controls whether or not a lack of
>a Message-ID is a raisable error.
It might also make sense for Messages to have a .validate(policy) method.  The
application using email6 should essentially know when it's done parsing or
manipulating the message, so it could call .validate() at that point.

>> * Datetimes: \o/.  It will be awesome when I can `msg['date'] = a_datetime`.
>>   While it does seem reasonable that a naive datetime uses -0000, it should
>>   also be very easy for folks to add a Date header that references the local
>>   timezone, since I suspect that will be a more common use case than UTC.  I
>>   don't know what the answer for that is though.
>
>Well, Alexander has an answer (a function that returns an aware localtime
>in the datetime module) but hasn't gotten consensus on adding it.
>Perhaps I'll add such a function to email6, at least for the field trials.

Nice.

>> * As for header parsing, have you looked at the pyparsing module?  I don't
>>   write many parsers, and have no direct experience with pyparsing, but I keep
>>   hearing really good things about it.  OTOH, it's not in the stdlib, so it
>>   would present problems if email6 were to adopt it.  Still, I don't envy this
>>   part of the job, and I sympathize with the rabbit-hole effect of "just one
>>   more little thing..." ;)  Oh, and I'm just blown away impressed by the work
>>   you've done on the parser.
>
>I thought about pyparsing (though I haven't tried it out myself), but
>I think its scope is much wider than email6 needs, and getting it in to
>the stdlib should be an independent project if doing so seems worthwhile.
>I don't think email6 should depend on anything not already in the stdlib.
Agreed.

>In any case, at this point I think the hard part of the parser is done,
>and everything else is incremental additions and tweaks.
>
>Something I didn't say in my blog post is that I'm thinking of marking
>rfc822_parser as a private module for the 3.3 release, but that a long
>term goal would be to expose it, if it proves to be worthwhile and useful
>apart from its internal use in email6.  I think there are occasions when
>programs need to do non-email rfc822 parsing, where it could come in handy
>(perhaps with a few API tweaks to optionally suppress  email-specific hacks).

Again, agreed.  There are *lots* of file formats that follow rfc822 style
layouts.  One that I'm particularly interested in these days is Debian control
files.  It's essentially rfc822 headers with no bodies, with sections
separated by a blank line.  It would be kind of neat if the stdlib could help
me parse those.

>Yes.  Headers are immutable, so 'append' is not the appropriate operation
>for this.  + or += is.  What I'm thinking is that the current Mailbox
>and Group objects should be enhanced so that there is a nice API for
>creating them from various kinds of input data, and an explicit AddresList
>object added, and then they can be passed around, summed, and maybe even
>subtracted with each other and with AddressList valued header fields.

Sounds good to me.

-Barry


_______________________________________________
Email-SIG mailing list
[hidden email]
Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com

signature.asc (853 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: rfc822 parser (the elephant has landed)

Barry Warsaw
In reply to this post by Stephen J. Turnbull
On Jun 09, 2011, at 04:45 PM, Stephen J. Turnbull wrote:

>R. David Murray writes:
>
> > Yeah, it would be really nice if setting (say) 'To' replaced it, but
> > setting (say) 'Resent-To' appended.  But that way lies chaos :)
>
>Especially since "Resent-To" (and other Resent-*, as well as trace
>headers) needs to be *pre*pended. :)

.insert(i, header) probably, where `i` could either (maybe) be an integer or
the name of the first header to insert the new header before.

>"Enforced", yes, it's out of scope, for several reasons.  However, any
>given application may know at some early stage that headers are
>complete, and want to check policy at that point.  So there should be
>a mechanism to explicitly check policy conformance, perhaps a
>.check_policy() method on Message objects.  Then it becomes a question
>of whether the policy check should ever be called implicitly, or
>always left up to the application.

Smart minds think alike. :)

-Barry

_______________________________________________
Email-SIG mailing list
[hidden email]
Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com

signature.asc (853 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: rfc822 parser (the elephant has landed)

Barry Warsaw
In reply to this post by R. David Murray
On Jun 10, 2011, at 09:58 AM, R. David Murray wrote:

>If you combine the last two ideas, we could have a carefully defined
>API for controlling how __setitem__ works using attributes on the
>header classes.
>
>Totally crazy?  Crazy-smart?

Could be!

>How about a validate function that takes a message and a policy?
>That would be parallel to generator.  In fact, it might share some code
>with generator.

Smart minds think alike.

-Barry

_______________________________________________
Email-SIG mailing list
[hidden email]
Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com

signature.asc (853 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: rfc822 parser (the elephant has landed)

Stephen J. Turnbull
In reply to this post by R. David Murray
R. David Murray writes:

 > Ah, right.  Which means we don't support that currently...

No biggee.  As Barry says, .insert(resent_header,0) would do the
trick.  However, resent-* headers might be prepended as block.  Do you
support

    headers[0:0] = resent_header_list

now?

 > >  > One of my ideas is to eventually decouple the header dictionary from the
 > >  > Message.
 > >
 > > I don't understand why you want to do that;
 >
 > Well, the main motivation was so that I could change the semantics of
 > __setitem__.

Ah, OK.  I've always thought the email representation of messages as a
mapping of headers with a couple of special attributes was a little
quirky but nice, that's all.  It's not something that's hard to give
up, especially since

    hs = msg.headers

is always available.

 > So maybe it wouldn't be totally crazy to have unique headers __setitem__
 > be replace while non-unique headers __setitem__ does append.  We could
 > even go really crazy and have Resent headers __setitem__ do prepend :)

But this kind of thing would probably have to be optional, since not
every protocol that uses RFC 822-style headers is going to obey the
modern rules that RFC 5322 requires.

 > The other way to control this "unique header" behavior would be to
 > change the header registry.  If you are building an application whose
 > headers do not conform to the RFC, you would probably end up doing that
 > anyway.
 >
 > If you combine the last two ideas, we could have a carefully defined
 > API for controlling how __setitem__ works using attributes on the
 > header classes.
 >
 > Totally crazy?  Crazy-smart?

Totally crazy in the sense of +1 for more craziness. :-)

 > >  > Yeah I think that would be insane :).
 > >
 > > +1 for insanity.
 >
 > Are you saying = should append to the value?  I think that would be
 > bad/counterintuitive.

No, just that insanity is a good thing as long as we don't implement
more than the very best 10% of it. :-)

 > How about a validate function that takes a message and a policy?

I would be comfortable with that API, for sure.  Maybe there should be
a way to set a default policy in the header registry.  Perhaps each
header in the registry could have its own ignore, warn, raise (, fix?)
option, or even more flexibility.  For example, you might want a
policy so that email will accept and pass through multiple From
fields, but never generate that (eg, a mailing list).  Alternatively,
you might want an exception raised if an incoming message has multiple
>From fields (a local submission agent).


_______________________________________________
Email-SIG mailing list
[hidden email]
Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: rfc822 parser (the elephant has landed)

Éric Araujo
In reply to this post by R. David Murray
Le 10/06/2011 20:27, R. David Murray a écrit :
> On Fri, 10 Jun 2011 19:00:32 +0200, <[hidden email]> wrote:
> The problems there arise from C code calling (or, rather, not calling)
> methods on the subclass.  But in email headers act *just like* strings,
> but they have *extra* methods.  So there should be no problem.  Anything
> that doesn't know about the extra methods will treat the header just
> like a string, which is exactly what we want for backward compatibility
> reasons.

Good.

> The one place where this might bite us is in the proposed support for +=
> and -=.  I haven't tested that yet, and if it does work I'm not sure
> that there won't be obscure corners in which will turn out to be broken.

I don’t know either.

>> The second cent is about naming.  Does a Mailbox represent an email
>> address?  The confusion with mailbox.Mailbox would be a problem.
> Well, that is an issue.  I'm not entirely happy about the name, but I
> haven't thought of a better one.  The problem is that we have to deal
> both with a full 'mailbox' and the 'addr-spec' subpart, and I don't know
> of *any* other name (other than 'addr-spec') for the addr-spec part.
> (Well, 'address', but you can see the problem with using that for both
> meanings...) Perhaps it would be better to use that (or rather
> addr_spec), and use 'address' for the address-with-display-name
> ('mailbox').

Yep, +1 for using addr_spec for some format defined in the RFCs, and
address for the higher-level full address more familiar to human.

> Good point.  rfc822parser is completely distinct from 'parser', which
> probably won't get deprecated.  On the other hand, once I add RFC2047
> support to it, perhaps I should rename it rfcparser (or, at least at
> first, _rfcparser).  Or perhaps _headerparser, though it doesn't
> contain *all* of the header parsing machinery.

After reading your blog post and this email, I still can’t say whether
this parser module deals with headers only or with full messages.  If
it’s the former, definite +1 to _headerparser; if it’s the latter, then
_rfcparser or something else would be okay.

Regards
_______________________________________________
Email-SIG mailing list
[hidden email]
Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com