Quantcast

I miss size() (and some latest frustration)

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

I miss size() (and some latest frustration)

Steffen Daode Nurpmeso-2
I'm stressing this list again, but i stumbled over a missing
[message_]size().
http://wiki.python.org/moin/Email%20SIG/DesignThoughts makes it
a prerequisite for the new EMail package that

    The API needs to at a minimum have hooks available for an
    application to store data on disk rather than holding
    everything in memory.

It would be great if the message (file) size would also be
provided as a public method, so that code-flow decisions can be
made dependend upon the plain size of a message.
(The size is known without parsing for many real-life message
objects anyway or can be detected *cheap*.  True, e.g., for
all Message objects which are created by mailbox.py.)

It's also so unfortunate that 'headersonly' of Parser is in fact
treated as "a backwards compatibility hack", effectively consuming
the entire input nonetheless.
And *DesignThoughts* treats lazy parsing/partial loading as an
"interesting idea" only, though i can think about many cases where
it is a good thing to parse a Message{Headers[/Part/Part/Part...]}
sequentially.

E.g., why should a spam detector load an entire message if it only
wants to check addresses against some white-/blacklists and simply
throw away bad hits.
Even more, why should a companies dispatcher read all the content
if it's only about to rewrite addresses and dispatch the mail to
some other internal server.
(Of course - hey, it's you, you know *such* more about this stuff
than i do.)

Waiting is an electric experience ...
Have fun.

--
Steffen Daode Nurpmeso <[hidden email]>
:wq steffen

_______________________________________________
Email-SIG mailing list
[hidden email]
Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: I miss size() (and some latest frustration)

Barry Warsaw
On Mar 24, 2011, at 05:10 PM, Steffen Daode Nurpmeso wrote:

>It would be great if the message (file) size would also be
>provided as a public method, so that code-flow decisions can be
>made dependend upon the plain size of a message.
>(The size is known without parsing for many real-life message
>objects anyway or can be detected *cheap*.  True, e.g., for
>all Message objects which are created by mailbox.py.)

Certainly the normal FeedParser will see every byte of the message, even if it
does save parts of it on disk.  Mailman 3's LMTP server also sees every byte
and tucks the size away on an .original_size attribute of its Message
subclass.

But how would you handle it when you are creating the message yourself?  I
think there are too many places you'd have to hook to get an accurate reading,
or you'd have to essentially serialize it via a generator before you'd know,
so it's less than helpful.

It may indeed be possible to ask some external process what the size of the
message is, but it would likely be a hint you couldn't necessarily trust.
(I.e. the server might only have an approximate size.)

So, I'm not sure whether the email package can have a consistent notion of a
message's 'size'.  Perhaps though it ought to define an attribute for when the
message is created by a parser, but let it be writable so that e.g. your
application could get it from an IMAP server or whatever, and stick it in the
attribute.

>It's also so unfortunate that 'headersonly' of Parser is in fact treated as
>"a backwards compatibility hack", effectively consuming the entire input
>nonetheless.  And *DesignThoughts* treats lazy parsing/partial loading as an
>"interesting idea" only, though i can think about many cases where it is a
>good thing to parse a Message{Headers[/Part/Part/Part...]}  sequentially.
>
>E.g., why should a spam detector load an entire message if it only wants to
>check addresses against some white-/blacklists and simply throw away bad
>hits.  Even more, why should a companies dispatcher read all the content if
>it's only about to rewrite addresses and dispatch the mail to some other
>internal server.  (Of course - hey, it's you, you know *such* more about this
>stuff than i do.)
Do you have suggestions for how the email package can help with these use
cases?  Do you have specific API or implementation proposals?

Cheers,
-Barry

_______________________________________________
Email-SIG mailing list
[hidden email]
Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com

signature.asc (853 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: I miss size() (and some latest frustration)

Glenn Linderman-3
On 3/24/2011 2:41 PM, Barry Warsaw wrote:
On Mar 24, 2011, at 05:10 PM, Steffen Daode Nurpmeso wrote:

It would be great if the message (file) size would also be 
provided as a public method, so that code-flow decisions can be 
made dependend upon the plain size of a message. 
(The size is known without parsing for many real-life message 
objects anyway or can be detected *cheap*.  True, e.g., for 
all Message objects which are created by mailbox.py.)
Certainly the normal FeedParser will see every byte of the message, even if it
does save parts of it on disk.  Mailman 3's LMTP server also sees every byte
and tucks the size away on an .original_size attribute of its Message
subclass.

But how would you handle it when you are creating the message yourself?  I
think there are too many places you'd have to hook to get an accurate reading,
or you'd have to essentially serialize it via a generator before you'd know,
so it's less than helpful.

It may indeed be possible to ask some external process what the size of the
message is, but it would likely be a hint you couldn't necessarily trust.
(I.e. the server might only have an approximate size.)

So, I'm not sure whether the email package can have a consistent notion of a
message's 'size'.  Perhaps though it ought to define an attribute for when the
message is created by a parser, but let it be writable so that e.g. your
application could get it from an IMAP server or whatever, and stick it in the
attribute.

When created by a parser, it could have the notion of size-seen-so-far, or bytes-fed.  Once the whole message has been processed, the size of the message would be known, as well as of each piece.

Incomplete messages, such as those from IMAP servers for which only partial requests have been made for pieces, could only get the concept of "total size" from the server, if it provides it.  Since POP servers do, I think IMAP would also, but I'm not an IMAP expert.

It's also so unfortunate that 'headersonly' of Parser is in fact treated as
"a backwards compatibility hack", effectively consuming the entire input
nonetheless.  And *DesignThoughts* treats lazy parsing/partial loading as an
"interesting idea" only, though i can think about many cases where it is a
good thing to parse a Message{Headers[/Part/Part/Part...]}  sequentially.

E.g., why should a spam detector load an entire message if it only wants to
check addresses against some white-/blacklists and simply throw away bad
hits.  Even more, why should a companies dispatcher read all the content if
it's only about to rewrite addresses and dispatch the mail to some other
internal server.  (Of course - hey, it's you, you know *such* more about this
stuff than i do.)
Do you have suggestions for how the email package can help with these use
cases?  Do you have specific API or implementation proposals?

For message parsing, it seems like allowing registered callbacks for various pieces would be handy... "Call me when you parse this type of a header" (or body part, etc.).

_______________________________________________
Email-SIG mailing list
[hidden email]
Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: I miss size() (and some latest frustration)

Steffen Daode Nurpmeso-2
In reply to this post by Barry Warsaw
On Thu, Mar 24, 2011 at 05:41:49PM -0400, Barry Warsaw wrote:
> On Mar 24, 2011, at 05:10 PM, Steffen Daode Nurpmeso wrote:
> So, I'm not sure whether the email package can have a consistent notion of a
> message's 'size'.

> Do you have suggestions for how the email package can help with these use
> cases?  Do you have specific API or implementation proposals?

An incremental package must of course have a notion of a "current
state of a message", so that all methods of an object must first
check wether they're applicable - anyway!?
Methods which can be used in multiple states need to document how
they react in each of those anyway (if behaviour changes).

So that there may be .current_parse_state() returning
a to-be-defined enum.
Or size() may return a tuple (Bool_is_final_size, current_size)
(but that's really ugly).

Beside size(), the most simple way would be to extend the
FeedParser so that it could stop in a defined way at all
boundaries of a message (i.e. Headers,Part,Part...).
That would be a state().
It would need to be restartable, i.e., .close() may remain and
return an entire message, but .last_part() or so/etc. must be
added.  .feed() must return something useful, too.  E.g.:

    dataf = SOMERAWDATA.get_fileobject()
    while 1:
        l = dataf.readline()
        ..
        parser_state = fp.feed()
        if parser_state == fp.BOUNDARY_SEEN:
            ..
            break
        ..
    # This is a header object
    # (Or, simply: Message without payload)
    headerobject = fp.get_headers()
    rewrite_headers(headerobject)
    datachunk = prepare_as_sendfile_header_object(headerobject)
    call_sendfile_with_headers_and_unchanged_rest_of_dataf

Interestingly FeedParser has almost all capabilities which are
required to do all that internally, but it does not offer it to
the outside.  8-)

Anyway, EMail is capable of many things, but it does not expose
them to the outside, so that one gets stuck soon if a special task
is to be performed.  email.message_from_xy() is a fantastic
abstraction of a complex set of RFC's and real-life potholes.
On the other hand a programming package is not a shelter - you
can mess up any package which goes beyond some message_from_xy().
So i really think that it is acceptable to offer an interface
which gives you access to partially constructed objects as long as
it is well-defined in some manner.

--
Steffen Daode Nurpmeso <[hidden email]>
:wq steffen

_______________________________________________
Email-SIG mailing list
[hidden email]
Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: I miss size() (and some latest frustration)

Steffen Daode Nurpmeso-2
In reply to this post by Glenn Linderman-3
On Thu, Mar 24, 2011 at 03:54:48PM -0700, Glenn Linderman wrote:
> For message parsing, it seems like allowing registered callbacks
> for various pieces would be handy... "Call me when you parse this
> type of a header" (or body part, etc.).

A completely different idea, but i also like it.
I remember that DOM did not even rock a bit unless SAX came up.
_______________________________________________
Email-SIG mailing list
[hidden email]
Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: I miss size() (and some latest frustration)

Barry Warsaw
In reply to this post by Glenn Linderman-3
On Mar 24, 2011, at 03:54 PM, Glenn Linderman wrote:

>When created by a parser, it could have the notion of size-seen-so-far, or
>bytes-fed.  Once the whole message has been processed, the size of the
>message would be known, as well as of each piece.

It makes sense to record this in the Message objects, but I'd want to be very
careful about what that attribute is called.  Using just 'size' could be
misleading, either because parsing has not completed, or because they might
think that it's an exact count of the serialized size.  Something like
'parsed_byte_count' might be okay though.

>Incomplete messages, such as those from IMAP servers for which only partial
>requests have been made for pieces, could only get the concept of "total
>size" from the server, if it provides it.  Since POP servers do, I think IMAP
>would also, but I'm not an IMAP expert.

In a case like that, an attribute such as 'server_reported_size' or some such
would be okay.

>For message parsing, it seems like allowing registered callbacks for various
>pieces would be handy... "Call me when you parse this type of a header" (or
>body part, etc.).

I think David's design documents to allow for extensions and callbacks based
on the content-types of things seen.

Cheers,
-Barry

_______________________________________________
Email-SIG mailing list
[hidden email]
Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com

signature.asc (853 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: I miss size() (and some latest frustration)

Glenn Linderman-2
On 3/25/2011 1:10 PM, Barry Warsaw wrote:
For message parsing, it seems like allowing registered callbacks for various
>pieces would be handy... "Call me when you parse this type of a header" (or
>body part, etc.).
I think David's design documents to allow for extensions and callbacks based
on the content-types of things seen.

I recall registration of handlers for various mime times.  I don't recall callbacks (registered handlers) being available for header parsing, but no time to find and reread at the moment.  Would be a good idea, though.  Also, callbacks should have the capability to stop the parse.  That technique could be used to implement "only parse headers" also, but it might be nicer to implement that as a flag when parsing starts.

Along this line, if parsing is stopped, it would be nice to be able to retrieve the unparsed data for alternate use (some is likely to have been already retrieved from whatever data stream, and passed as a "chunk" to the parser; an early-out would leave a "partial chunk" that hasn't been processed, but may want to be processed by some other entity, even if only for logging or error reporting.

--
Glenn
Experience is that marvelous thing that enables you to recognize a
mistake when you make it again. -- Franklin Jones

_______________________________________________
Email-SIG mailing list
[hidden email]
Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: I miss size() (and some latest frustration)

R. David Murray
In reply to this post by Barry Warsaw
On Fri, 25 Mar 2011 16:10:03 -0400, Barry Warsaw <[hidden email]> wrote:
> >For message parsing, it seems like allowing registered callbacks for various
> >pieces would be handy... "Call me when you parse this type of a header" (or
> >body part, etc.).
>
> I think David's design documents to allow for extensions and callbacks based
> on the content-types of things seen.

Effectively, yes.  The idea is that there is a factory that gets called
whenever a mime content type or a header is instantiated, so that
factory can do whatever magic it would like.  The standard factories will
have a lookup table for the factories for individual types, so you can
alternately use a copy of the standard factory with just the headers or
mime types you are interested in hooked.

We'll want to refine the design when I get near to actually implementing
it.

--
R. David Murray           http://www.bitdance.com
_______________________________________________
Email-SIG mailing list
[hidden email]
Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: I miss size() (and some latest frustration)

Steffen Daode Nurpmeso-2
In reply to this post by Barry Warsaw
    First of all i have to say that i am sooo prowd of myself
    that this mail manages to get addressed correctly right away!
    Wow!  (Or WAU! WAU! as those four-legged germans would say;)
    Thanks for your understanding.

On Thu, Mar 24, 2011 at 05:41:49PM -0400, Barry Warsaw wrote:
> Certainly the normal FeedParser will see every byte of the
> message, even if it does save parts of it on disk.  Mailman 3's
> LMTP server also sees every byte

I'm afraid of it, and i hate it from the bottom of my heart, but
it is to be expected that EMail 6 will see times where mails
actually contain entire 3-D Blockbusters as MIME attachments.
And the truth will not be far from that.

Thus i personally would really vote for the possibility that
parsing can be stopped at defined boundaries so that

    write(target_file, yet_parsed_object.data())
    while 1:
        x = source_file.read()
        target_file.write(x)

can be used directly (i.e. no swallowed boundary line).
Hooks are a fine thing but they are on the wrong side of the story
for this kind of problem (unless you have full, i.e. linewise,
control of the input side, too, and set one flag here and there.)

Have a nice weekend - it's cherry blossom, and it smells fantastic!

--
Steffen Daode Nurpmeso <[hidden email]>
:wq steffen

_______________________________________________
Email-SIG mailing list
[hidden email]
Your options: http://mail.python.org/mailman/options/email-sig/lists%40nabble.com
Loading...