API for email threading library?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
19 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

API for email threading library?

jalopyuser
Folks, I'm working on an implementation of RFC 5256 email threading,
designed so that it could fit as a submodule in the "email" package, if
such a think was ever seen to be useful.

I'd like to ask "the wisdom of the crowd" what they think an appropriate
interface to such a thing would be?  The basic operation is that you
create a collection (type C) of email threads (type T) by passing a set
of messages (type M) to the constructor.

* Should M be required to be "email.message.Message", or perhaps some
  less restrictive type, say "ThreadableMessageAPI"?  All that's
  strictly required is the ability to retrieve the Message-ID, Subject,
  Date, References, and In-Reply-To fields.

* What operations should be possible on C?  Some that come to mind:

  * retrieve_thread (M or message-id) => T
  * add_message (M) => T
  * add_messages (set of M) => None
  * remove_message (M or message-id) => T (or None) ?

* What's the interface for T?  It's a tree with possible dummy nodes, so
  a tuple of messages plus nested tuples would do it.  What should the
  nodes in the tree be?  Normalized (see RFC 5256) Message-IDs?
  email.message.Message instances?

* For large sets of threads (millions of messages) a persistence
  mechanism would be useful.  Should there be a standard interface to
  such a mechanism, perhaps as class methods on C?  If so, what should
  it look like?  Should the implementation contain a default persistent
  subclass of C, based on sqlite3?  What side-effects would persistence
  requirements have on the other design considerations?  For instance,
  would you have to save the entire text of a message for each node?
  Just the headers?  Just some of the headers?  Just the Message-ID?

Have at it!  Advise away!

Bill
_______________________________________________
Email-SIG mailing list
[hidden email]
Your options: http://mail.python.org/mailman/options/email-sig/lists%2B1324540640401-2213733%40n6.nabble.com
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: API for email threading library?

Barry Warsaw
On Jan 05, 2012, at 09:55 AM, Bill Janssen wrote:

>Folks, I'm working on an implementation of RFC 5256 email threading,
>designed so that it could fit as a submodule in the "email" package, if
>such a think was ever seen to be useful.

I really like the idea of threading support being included in the email
package.  (I admit that I don't have time right now to read the RFC.)  My
general thoughts are that the actual messages needn't be included in the
thread collection, but perhaps just Message-IDs.  That would allow an
application to store the actual message objects anywhere they want, and would
reduce space requirements of the thread collection.

>I'd like to ask "the wisdom of the crowd" what they think an appropriate
>interface to such a thing would be?  The basic operation is that you
>create a collection (type C) of email threads (type T) by passing a set
>of messages (type M) to the constructor.
>
>* Should M be required to be "email.message.Message", or perhaps some
>  less restrictive type, say "ThreadableMessageAPI"?  All that's
>  strictly required is the ability to retrieve the Message-ID, Subject,
>  Date, References, and In-Reply-To fields.

I think it would be fine then to allow duck-typing of the input objects.  I
don't have a sense of whether it needs a formal (as in Python's ABCs)
interface type.

>* What operations should be possible on C?  Some that come to mind:
>
>  * retrieve_thread (M or message-id) => T

Message-ID as input.

>  * add_message (M) => T

Duck-typed message.

>  * add_messages (set of M) => None
>  * remove_message (M or message-id) => T (or None) ?

Probably Message-ID as the input.  I guess the rule would be that if you need
all the headers you mention above, a duck-typed message would be required.
For operations that only need the Message-ID, just accept that.

And you probably want the full Message-ID header value, e.g. it would include
the angle brackets.

>* What's the interface for T?  It's a tree with possible dummy nodes, so
>  a tuple of messages plus nested tuples would do it.  What should the
>  nodes in the tree be?  Normalized (see RFC 5256) Message-IDs?
>  email.message.Message instances?

Will the tree get mutated when a message is added in the middle of a thread,
or will you generate a new tree?  That would make a difference for
tuple-of-tuples or list-of-lists.

I think the nodes would be Message-IDs, but you'd need a public API for
normalizing them, and my application would have to make sure that my messages
are normalized (or at least the lookup keys are) or I might not be able to
find a message given its normalized id.  OTOH, maybe the message parser or
message object itself should provide an API for normalizing ids?

Let's think about some use cases.

- given any message, find the entire thread it's a part of
- given a message, find all children
- given a message, find a path to the root of the thread
- find the parts of the thread that fall within a date range
- find the parts of a thread with a matching subject

>* For large sets of threads (millions of messages) a persistence
>  mechanism would be useful.  Should there be a standard interface to
>  such a mechanism, perhaps as class methods on C?  If so, what should
>  it look like?  Should the implementation contain a default persistent
>  subclass of C, based on sqlite3?  What side-effects would persistence
>  requirements have on the other design considerations?  For instance,
>  would you have to save the entire text of a message for each node?
>  Just the headers?  Just some of the headers?  Just the Message-ID?

Great questions.  We've long talked about a persistence mechanism for message
parts (e.g. store the big binary parts on disk instead of in memory).  Some
consistency of design would be good here.  But I agree that persistence should
definitely be part of the story, and it needs to be plugable.

Have to think more about this, but a big +1 for the idea.  It would serve as a
very good component for the ideas I have about a next generation email
archiver.

-Barry
_______________________________________________
Email-SIG mailing list
[hidden email]
Your options: http://mail.python.org/mailman/options/email-sig/lists%2B1324540640401-2213733%40n6.nabble.com
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: API for email threading library?

R. David Murray
On Thu, 05 Jan 2012 20:21:08 -0500, Barry Warsaw <[hidden email]> wrote:

> On Jan 05, 2012, at 09:55 AM, Bill Janssen wrote:
>
> >Folks, I'm working on an implementation of RFC 5256 email threading,
> >designed so that it could fit as a submodule in the "email" package, if
> >such a think was ever seen to be useful.
>
> I really like the idea of threading support being included in the email
> package.  (I admit that I don't have time right now to read the RFC.)  My
> general thoughts are that the actual messages needn't be included in the
> thread collection, but perhaps just Message-IDs.  That would allow an
> application to store the actual message objects anywhere they want, and would
> reduce space requirements of the thread collection.

I don't have time to read the RFC either :(.  But from a skim of the
first bits, my immediate reaction is that the best thing to do is to break
everything down into as many discrete components as practical (pluggable
thread storage, thread construction (which presumably takes duck typed
Message objects containing at least the relevant headers) with different
subclasses or plugins for the different sorting algorithms, thread query,
etc) and keep them as decoupled as possible.  That would give a server
implementer the greatest flexibility.  You'll probably want to noodle
on the various APIs and make some concrete (but not fully fleshed out)
proposals for discussion.  That's the procedure that seemed to work best
when we were working on the email6 API.

On a possibly related note, it has become clear to me through work I've
done recently that the parser/generator classes need some non-trivial
refactoring to make using external (not in-the-object-in-memory) storage
of all or parts of the message possible.  I'm not at all sure when I'll
have time to work on that, but I've got a bunch of relevant notes for
use when I do :)

--David

PS: If you implement the 'base subject' algorithm I bet we can get
agreement to check that right in to email.utils before 3.3 :)
_______________________________________________
Email-SIG mailing list
[hidden email]
Your options: http://mail.python.org/mailman/options/email-sig/lists%2B1324540640401-2213733%40n6.nabble.com
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: API for email threading library?

jalopyuser
In reply to this post by Barry Warsaw
Thanks for the feedback, Barry.

Barry Warsaw <[hidden email]> wrote:

> On Jan 05, 2012, at 09:55 AM, Bill Janssen wrote:
>
> >Folks, I'm working on an implementation of RFC 5256 email threading,
> >designed so that it could fit as a submodule in the "email" package, if
> >such a think was ever seen to be useful.
>
> I really like the idea of threading support being included in the email
> package.  (I admit that I don't have time right now to read the RFC.)

It basically defines two kinds of threading for IMAP:  ORDEREDSUBJECT,
which is "poor man's threading" using "Subject" and "Date" headers, and
REFERENCES, which is JWS threading a la Netscape using "References" and
"In-Reply-To" headers.  I intend to support both.

> My general thoughts are that the actual messages needn't be included in the
> thread collection, but perhaps just Message-IDs.  That would allow an
> application to store the actual message objects anywhere they want, and would
> reduce space requirements of the thread collection.

We need "Subject", "Date", and either "References" or "In-Reply-To", in
addition to "Message-ID", in order to add a new message to the thread
DB.  I was planning to use a struct with slots containing hashes of the
value of each of these as the internal node structure in the thread-set
instance.

If the message objects were available on-demand (perhaps via a weakref
or via a Message-ID to message mapping), we could save only a pointer to
the message.  Perhaps a "retrieve-message-by-message-id" callback object
should be a parameter to the constructor.

> >I'd like to ask "the wisdom of the crowd" what they think an appropriate
> >interface to such a thing would be?  The basic operation is that you
> >create a collection (type C) of email threads (type T) by passing a set
> >of messages (type M) to the constructor.
> >
> >* Should M be required to be "email.message.Message", or perhaps some
> >  less restrictive type, say "ThreadableMessageAPI"?  All that's
> >  strictly required is the ability to retrieve the Message-ID, Subject,
> >  Date, References, and In-Reply-To fields.
>
> I think it would be fine then to allow duck-typing of the input objects.  I
> don't have a sense of whether it needs a formal (as in Python's ABCs)
> interface type.

I prefer an ABC as documentation of the duck-typing requirements.  I'm
thinking a subtype of email.message.Message would be good -- basically
adding the contraint that the "Message-ID", "Subject", "Date", and
"References" (or "In-Reply-To" headers) be set, but not requiring any
payload.

> >* What operations should be possible on C?  Some that come to mind:
> >
> >  * retrieve_thread (M or message-id) => T
>
> Message-ID as input.
>
> >  * add_message (M) => T
>
> Duck-typed message.
>
> >  * add_messages (set of M) => None
> >  * remove_message (M or message-id) => T (or None) ?
>
> Probably Message-ID as the input.  I guess the rule would be that if you need
> all the headers you mention above, a duck-typed message would be required.

For "add", but not "remove".

> For operations that only need the Message-ID, just accept that.

Sure.  Either/or.

> And you probably want the full Message-ID header value, e.g. it would include
> the angle brackets.

Easier to get at.

> >* What's the interface for T?  It's a tree with possible dummy nodes, so
> >  a tuple of messages plus nested tuples would do it.  What should the
> >  nodes in the tree be?  Normalized (see RFC 5256) Message-IDs?
> >  email.message.Message instances?
>
> Will the tree get mutated when a message is added in the middle of a thread,
> or will you generate a new tree?  That would make a difference for
> tuple-of-tuples or list-of-lists.

It would be mutated internally, but the thread given back to callers
would be an immutable copy of the internal tree.  I was thinking that
the returned thread would be a fresh tuple containing (<root>
<children>...), where <root> is a message ID, and each <children> is a
fresh tuple of the same form as <msgs>.

Thus the tree

  A --+-- B
      |
      +-- C --+-- D
              |
              +-- E
              |
              +-- F

would look like this:

  ('[hidden email]'
     ('[hidden email]')
     ('[hidden email]'
        ('[hidden email]')
        ('[hidden email]')
        ('[hidden email]'))))

though perhaps

  ('[hidden email]'
     '[hidden email]'
     ('[hidden email]'
        '[hidden email]'
        '[hidden email]'
        '[hidden email]'))

would be more efficient -- each child is either a singleton represented
by a string message-id, or a tuple of a reply plus its children.

> I think the nodes would be Message-IDs, but you'd need a public API for
> normalizing them, and my application would have to make sure that my messages
> are normalized (or at least the lookup keys are) or I might not be able to
> find a message given its normalized id.  OTOH, maybe the message parser or
> message object itself should provide an API for normalizing ids?

The normalization of the Message-ID in RFC 5256 refers to the optional
quoting allowed in RFC 2822, in which '<"01KF8JCEOCBS0045PS"@xxx.yyy.com>'
and '<[hidden email]>' and
and '<"01KF8JCEOCBS0045PS"@[xxx.yyy.com]>' and
'<01KF8JCEOCBS0045PS@[xxx.yyy.com]>' are all the same message ID, the
normalized form of which is '[hidden email]'.

Might be useful to have a method or property on email.message.Message to
retrieve this value.

I'd certainly want to normalize any message-IDs passed in as keys.

> Let's think about some use cases.
>
> - given any message, find the entire thread it's a part of
> - given a message, find all children
> - given a message, find a path to the root of the thread
> - find the parts of the thread that fall within a date range

Interesting, hadn't thought about that one.  Good idea.

> - find the parts of a thread with a matching subject

Hmmm.  Using ORDEREDSUBJECT, all of the parts of a thread have the same
"base subject" -- which is another thing defined in RFC 5256.  It's
basically the subject of the message with any "Re:" or "Fwd:" or
"[mailman-listname]" stuff trimmed off.  The ORDEREDSUBJECT algorithm
basically collects all messages with the same "base subject" and sorts
them by date.

"Base subject" would be another good thing to add to email.util or
email.message.Message, by the way.

In the REFERENCES algorithm, threads with the same base subject are
merged, but I suppose threads where someone replied to an earlier
message, but with a different subject line, would allow multiple base
subjects per thread.  Perhaps such threads should be split apart?

> >* For large sets of threads (millions of messages) a persistence
> >  mechanism would be useful.  Should there be a standard interface to
> >  such a mechanism, perhaps as class methods on C?  If so, what should
> >  it look like?  Should the implementation contain a default persistent
> >  subclass of C, based on sqlite3?  What side-effects would persistence
> >  requirements have on the other design considerations?  For instance,
> >  would you have to save the entire text of a message for each node?
> >  Just the headers?  Just some of the headers?  Just the Message-ID?
>
> Great questions.  We've long talked about a persistence mechanism for message
> parts (e.g. store the big binary parts on disk instead of in memory).  Some
> consistency of design would be good here.  But I agree that persistence should
> definitely be part of the story, and it needs to be plugable.
>
> Have to think more about this, but a big +1 for the idea.  It would serve as a
> very good component for the ideas I have about a next generation email
> archiver.

Yes, I intend to use it for UpLib (http://uplib.parc.com/), which is
what I use to archive my many years of email.  But I thought it would be
more generally useful for others if I wrote it to work with the more
stdlib email package.

Bill
_______________________________________________
Email-SIG mailing list
[hidden email]
Your options: http://mail.python.org/mailman/options/email-sig/lists%2B1324540640401-2213733%40n6.nabble.com
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: API for email threading library?

jalopyuser
In reply to this post by R. David Murray
David, thanks for the follow-up.

R. David Murray <[hidden email]> wrote:

> On Thu, 05 Jan 2012 20:21:08 -0500, Barry Warsaw <[hidden email]> wrote:
> > On Jan 05, 2012, at 09:55 AM, Bill Janssen wrote:
> >
> > >Folks, I'm working on an implementation of RFC 5256 email threading,
> > >designed so that it could fit as a submodule in the "email" package, if
> > >such a think was ever seen to be useful.
> >
> > I really like the idea of threading support being included in the email
> > package.  (I admit that I don't have time right now to read the RFC.)  My
> > general thoughts are that the actual messages needn't be included in the
> > thread collection, but perhaps just Message-IDs.  That would allow an
> > application to store the actual message objects anywhere they want, and would
> > reduce space requirements of the thread collection.
>
> I don't have time to read the RFC either :(.  But from a skim of the
> first bits, my immediate reaction is that the best thing to do is to break
> everything down into as many discrete components as practical (pluggable
> thread storage, thread construction (which presumably takes duck typed
> Message objects containing at least the relevant headers) with different
> subclasses or plugins for the different sorting algorithms, thread query,
> etc) and keep them as decoupled as possible.  That would give a server
> implementer the greatest flexibility.

That sounds good to me, too.  Let me think about pluggable thread
persistence a bit more -- pluggable might work better than subtypes
there, which is the path I've been going down.  The key question is what
would we want to be able to do with a re-vivified thread store.  If we
want to be able to add new messages to it, we need to have access to the
"five headers" of each of the messages, either by saving them, or by
having access to the message store.  If not, we can just save the
message-IDs.  (It would be nice if we could use fixed-size hashes of the
message IDs instead of strings, but that would require a message store
which understood that concept.)

On the other hand, if we're adding a message, presumably we also have
access to the message store, and could retrieve the "five headers"
therefrom given the message-id -- though that might be an expensive
operations for large message stores.

Interesting set of metadata requirements on the pluggable design, both
for the thread store and the message store.

> You'll probably want to noodle on the various APIs and make some
> concrete (but not fully fleshed out) proposals for discussion.  That's
> the procedure that seemed to work best when we were working on the
> email6 API.

Think of this as the noodling :-).

> PS: If you implement the 'base subject' algorithm I bet we can get
> agreement to check that right in to email.utils before 3.3 :)

I have working code for all of this; right now I'm expanding the test
suite and looking at performance and API optimizations, not to mention
PEP8-ification.

Bill
_______________________________________________
Email-SIG mailing list
[hidden email]
Your options: http://mail.python.org/mailman/options/email-sig/lists%2B1324540640401-2213733%40n6.nabble.com
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: API for email threading library?

Matthew Dixon Cowles
In reply to this post by jalopyuser
Bill,

> Folks, I'm working on an implementation of RFC 5256 email
> threading, designed so that it could fit as a submodule in the
> "email" package, if such a think was ever seen to be useful.

If you find it at all useful, you're very welcome to use anything you
like from:

http://www.mondoinfo.com/threadMessages.tar.gz

Regards,
Matt

_______________________________________________
Email-SIG mailing list
[hidden email]
Your options: http://mail.python.org/mailman/options/email-sig/lists%2B1324540640401-2213733%40n6.nabble.com
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: API for email threading library?

jalopyuser
Matthew Dixon Cowles <[hidden email]> wrote:

> Bill,
>
> > Folks, I'm working on an implementation of RFC 5256 email
> > threading, designed so that it could fit as a submodule in the
> > "email" package, if such a think was ever seen to be useful.
>
> If you find it at all useful, you're very welcome to use anything you
> like from:
>
> http://www.mondoinfo.com/threadMessages.tar.gz
>
> Regards,
> Matt

Thanks, Matt.

Bill
_______________________________________________
Email-SIG mailing list
[hidden email]
Your options: http://mail.python.org/mailman/options/email-sig/lists%2B1324540640401-2213733%40n6.nabble.com
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

API for email threading library?

Stephen J. Turnbull
In reply to this post by jalopyuser
Bill Janssen writes:

 > Folks, I'm working on an implementation of RFC 5256 email threading,
 > designed so that it could fit as a submodule in the "email" package, if
 > such a think was ever seen to be useful.

I don't know if it belongs there, although that's the obvious place.
There are a other threaded message structures that aren't email (or
netnews, which is obviously basically the same thing).  For example,
issue trackers.

 > * Should M be required to be "email.message.Message",

-1

 >   or perhaps some less restrictive type, say
 >   "ThreadableMessageAPI"?  All that's strictly required is the
 >   ability to retrieve the Message-ID, Subject, Date, References,
 >   and In-Reply-To fields.

If a variety of existing apps are to be able to plug this in, the API
shouldn't be bound to email.message.Message.  +1 for duck-typing.

 > * What operations should be possible on C?  Some that come to mind:
 >
 >   * retrieve_thread (M or message-id) => T
 >   * add_message (M) => T
 >   * add_messages (set of M) => None
 >   * remove_message (M or message-id) => T (or None) ?

* Reparent message (this will actually merge threads).

 > * What's the interface for T?  It's a tree with possible dummy nodes, so
 >   a tuple of messages plus nested tuples would do it.  What should the
 >   nodes in the tree be?  Normalized (see RFC 5256) Message-IDs?

In a Lisp implementation of http://www.jwz.org/doc/threading.html I'm
working on, I just use symbols named by the message IDs themselves;
I'm not familiar with the normalization yet.

 >   email.message.Message instances?

I think it should be more abstract than that.

_______________________________________________
Email-SIG mailing list
[hidden email]
Your options: http://mail.python.org/mailman/options/email-sig/lists%2B1324540640401-2213733%40n6.nabble.com
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: API for email threading library?

jalopyuser
Thanks for the feedback, Stephen.

Stephen J. Turnbull <[hidden email]> wrote:

> Bill Janssen writes:
>
>  > Folks, I'm working on an implementation of RFC 5256 email threading,
>  > designed so that it could fit as a submodule in the "email" package, if
>  > such a think was ever seen to be useful.
>
> I don't know if it belongs there, although that's the obvious place.
> There are a other threaded message structures that aren't email (or
> netnews, which is obviously basically the same thing).  For example,
> issue trackers.
>
>  > * Should M be required to be "email.message.Message",
>
> -1
>
>  >   or perhaps some less restrictive type, say
>  >   "ThreadableMessageAPI"?  All that's strictly required is the
>  >   ability to retrieve the Message-ID, Subject, Date, References,
>  >   and In-Reply-To fields.
>
> If a variety of existing apps are to be able to plug this in, the API
> shouldn't be bound to email.message.Message.  +1 for duck-typing.

I think I'll finesse this issue with another (appropriate) layer of
indirection.

>  > * What operations should be possible on C?  Some that come to mind:
>  >
>  >   * retrieve_thread (M or message-id) => T
>  >   * add_message (M) => T
>  >   * add_messages (set of M) => None
>  >   * remove_message (M or message-id) => T (or None) ?
>
> * Reparent message (this will actually merge threads).
>
>  > * What's the interface for T?  It's a tree with possible dummy nodes, so
>  >   a tuple of messages plus nested tuples would do it.  What should the
>  >   nodes in the tree be?  Normalized (see RFC 5256) Message-IDs?
>
> In a Lisp implementation of http://www.jwz.org/doc/threading.html I'm
> working on, I just use symbols named by the message IDs themselves;

Yes, that works well for a static persistent representation.

Lisp message threading?  What's that in aid of, if you can say?

> I'm not familiar with the normalization yet.

RFC 5256 mentions it, but I had to go back to 2822 to figure it out.
Referencing section 3.6.4 of RFC 2822:  The IMAP guys seem to be implying
that the DQUOTEs in "no-fold-quote" and the "[" and "]" brackets in
"no-fold-literal" should be removed before comparing message-ids.

I'll send a note to the IMAP list to verify that.

Bill
_______________________________________________
Email-SIG mailing list
[hidden email]
Your options: http://mail.python.org/mailman/options/email-sig/lists%2B1324540640401-2213733%40n6.nabble.com
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: API for email threading library?

Stephen J. Turnbull
Bill Janssen writes:

 > I think I'll finesse this issue with another (appropriate) layer of
 > indirection.

OK by me (can't bring myself to +1 on a thoughtful finesse. :)

 > > In a Lisp implementation of http://www.jwz.org/doc/threading.html I'm
 > > working on, I just use symbols named by the message IDs themselves;
 >
 > Yes, that works well for a static persistent representation.
 >
 > Lisp message threading?  What's that in aid of, if you can say?

The "VM" MUA for Emacs and XEmacs.

 > RFC 5256 mentions it, but I had to go back to 2822 to figure it out.

Tee-hee-hee!  The wild, wonderful world of RFCs: "You are in a twisty
maze of ABNF, all alike ...."
_______________________________________________
Email-SIG mailing list
[hidden email]
Your options: http://mail.python.org/mailman/options/email-sig/lists%2B1324540640401-2213733%40n6.nabble.com
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: API for email threading library?

jalopyuser
Stephen J. Turnbull <[hidden email]> wrote:

>  > Lisp message threading?  What's that in aid of, if you can say?
>
> The "VM" MUA for Emacs and XEmacs.

Ah!  I use MH-E, myself.

Bill
_______________________________________________
Email-SIG mailing list
[hidden email]
Your options: http://mail.python.org/mailman/options/email-sig/lists%2B1324540640401-2213733%40n6.nabble.com
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: API for email threading library?

jalopyuser
In reply to this post by Stephen J. Turnbull
Some input from Mark Crispin (who wrote that bit about message-ID
normalization in RFC 5256):

> no-fold-quote does not exist in the current specification (RFC 5322)
> [which obsoletes 2822 - wcj].
>
> I don't know why you think that the brackets should be removed in
> no-fold-literal. The brackets indicate that the contents are a literal IP
> address as opposed to a domain. The fact that 10.20.30.40, as opposed to
> [10.20.30.40], is parsed by some people as an IP address does not
> necessarily mean that it is (I'll laugh when the first all-numeric TLD is
> created!). Now, in the modern day of RFC 5322, this isn't a domain at all
> but rather an id-right.
>
> People can flame at some length whether bloop@10.20.30.40 and
> bloop@[10.20.30.40] are the same message-ID. My guess is "no".
>
> The bottom line here is whether that text about normalized message ID has
> any particular meaning in the context of RFC 5322 as opposed to earlier
> versions of header syntax that used local-part@domain for message-id.
> IMHO (and I wrote that text!) I would treat it as advice on how to treat
> warts from the past rather than how to move forward.
>
> That is, once upon a time, it was necessary to treat:
>
> Message-ID: <"bloop"@grok.this>
> and
> Message-ID: <[hidden email]>
>
> as the same thing. This was a protocol wart and I'm glad to see it
> declared obsolete. I wouldn't flame anyone who decided that strcmp() is
> the one and only way to compare Message-IDs. I daresay that's what most
> implementations did anyway even when RFC 822 was king.

So, stripping double-quotes on the left side stays, stripping brackets
on the right side is a no-no.

Bill
_______________________________________________
Email-SIG mailing list
[hidden email]
Your options: http://mail.python.org/mailman/options/email-sig/lists%2B1324540640401-2213733%40n6.nabble.com
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: API for email threading library?

jalopyuser
In reply to this post by jalopyuser
Thanks for all the feedback, folks.

After musing about all of this, it seems to me that threading makes no
sense outside the context of a message store (or forum post store, or
netnews article store, or...).  So I'm going to push all the decisions
about "what's a message anyway" into an abstraction of such a message
store:

   class ThreadableObjectStore:

      @abstractmethod
      get_message_id(msg) => message-ID (string)

      @abstractmethod
      get_subject(msg) => subject (string)

      @abstractmethod
      get_date(msg) => timestamp (float, seconds past epoch)

      @abstractmethod
      get_references(msg) => sequence of message-ID (list of string)

So your particular instantiation of ThreadableObjectStore can decide
what a 'msg' is, what a message-ID is, whether they're normalized, etc.
An instance of a ThreadableObjectStore will be required to create an
instance of a threadset.  I'll provide such a class for mailbox.Mailbox,
for testing.

Also, I think that persistence of the threading analysis is really a
function of the message store, not the threadset.  So what the threadset
requires is simply

  (1) a way to externalize its threads in a meaningful way, which a
      forest of tuple trees with message IDs at the nodes works perfectly
      well for, and

  (2) a way to take such a representation and revivify it, given a
      message store.

Bill
_______________________________________________
Email-SIG mailing list
[hidden email]
Your options: http://mail.python.org/mailman/options/email-sig/lists%2B1324540640401-2213733%40n6.nabble.com
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: API for email threading library?

Stephen J. Turnbull
In reply to this post by jalopyuser
Bill Janssen writes:

 > So, stripping double-quotes on the left side stays, stripping brackets
 > on the right side is a no-no.

Hrm.  How about interpreting quoted-pairs?  (Not that you should ever
see them, but ....)

That is, <"b\l\o\op"@grok.this> and <"bloop"@grok.this> should compare
equal, no?  Or yes?

Which leads me to ... I wonder if the way the Postel Principle applies
here isn't "you're better unifying too many message IDs because the
user will immediately recognize thread content skew, while unifying
too few will result in different parts of the thread being widely
separated in the presentation of the message set, and possibly
premature ejaculation of responses".[1]  So, (without having thought
about it *too* much<wink/>) I would advocate unifying message IDs that
are likely to be (mistakenly?) "normalized" by some implementations.

And of course, you should never see such message IDs in practice; I
don't think I've ever seen a mailbox, let alone the LHS of a message
ID, in quotes outside of an RFC.<wink/>  Although I *have* seen whole
addresses in quotes.

BTW, although I'm working with VM myself, my intent is to make
jwz-thread.el usable with any Emacsen-based MUA.  (I'm really sick of
how crappy *all* of the MUA code is in Emacs -- I can understand why
one would use MH-E since the MUA is actually implemented elsewhere!)

Footnotes:
[1]  Which is why I'm implementing a threading engine....

_______________________________________________
Email-SIG mailing list
[hidden email]
Your options: http://mail.python.org/mailman/options/email-sig/lists%2B1324540640401-2213733%40n6.nabble.com
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: API for email threading library?

jalopyuser
Stephen J. Turnbull <[hidden email]> wrote:

> BTW, although I'm working with VM myself, my intent is to make
> jwz-thread.el usable with any Emacsen-based MUA.

Great!

> (I'm really sick of
> how crappy *all* of the MUA code is in Emacs -- I can understand why
> one would use MH-E since the MUA is actually implemented elsewhere!)

Completely agree -- when I shifted over to it, I had to re-write a third
of the MH-E code to get it into some shape I could live with.

Bill
_______________________________________________
Email-SIG mailing list
[hidden email]
Your options: http://mail.python.org/mailman/options/email-sig/lists%2B1324540640401-2213733%40n6.nabble.com
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: API for email threading library?

jalopyuser
In reply to this post by Stephen J. Turnbull
Stephen J. Turnbull <[hidden email]> wrote:

> Bill Janssen writes:
>
>  > So, stripping double-quotes on the left side stays, stripping brackets
>  > on the right side is a no-no.
>
> Hrm.  How about interpreting quoted-pairs?  (Not that you should ever
> see them, but ....)
>
> That is, <"b\l\o\op"@grok.this> and <"bloop"@grok.this> should compare
> equal, no?  Or yes?

Yes, I think.

> Which leads me to ... I wonder if the way the Postel Principle applies
> here isn't "you're better unifying too many message IDs because the
> user will immediately recognize thread content skew, while unifying
> too few will result in different parts of the thread being widely
> separated in the presentation of the message set, and possibly
> premature ejaculation of responses".[1]  So, (without having thought
> about it *too* much<wink/>) I would advocate unifying message IDs that
> are likely to be (mistakenly?) "normalized" by some implementations.
>
> And of course, you should never see such message IDs in practice; I
> don't think I've ever seen a mailbox, let alone the LHS of a message
> ID, in quotes outside of an RFC.<wink/>  Although I *have* seen whole
> addresses in quotes.

That's probably right, too.  And Mark Crispin says as much.

Bill
_______________________________________________
Email-SIG mailing list
[hidden email]
Your options: http://mail.python.org/mailman/options/email-sig/lists%2B1324540640401-2213733%40n6.nabble.com
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: API for email threading library?

jalopyuser
In reply to this post by jalopyuser
Here's what I've got so far.  Comments would be appreciated.

Bill

======================================================================

This module implements email threading per RFC 5256.

It provides four classes: ThreadableObjectStore, MailboxStore,
ReferencesSet, and OrderedSubjectSet.

To use it, you need to provide it with a "mailstore", and a set of
messages to thread.  The mailstore must be a subclass of the
abstract class ThreadableObjectStore; an implementation of a
ThreadableObjectStore for mailbox.Mailbox is provided, as the class
MailboxStore.  Four methods must be implemented for a new
ThreadableObjectStore subclass:

  tos_get_message_id(msg or message ID) => message ID

    where the message ID is an immutable value that must be unique in
    that ThreadableObjectStore context, and the msg can be whatever
    that ThreadableObjectStore considers a message.

  tos_get_subject(msg or message ID) => subject

    where the subject is the subject of the message, or None

  tos_get_date (msg or message ID) => timestamp

    where the timestamp is the date and time of the message, expressed
    as a standard Python time.time() value

  tos_get_references (msg or message ID) => sequence of message ID

    where the references are a sequence of message IDs, arranged in
    order as per RFC 5322.  These message IDs must be in the same
    format as the message ID returned by tos_get_message_id().

The base ThreadableObjectStore class also provides a class method to
compute the RFC 5256 "base subject":

  ThreadableObjectStore.tos_base_subject (subject text) => \
        subject, is_reply_or_forward

    Takes a standard Subject: header value, and returns the "base
    subject" for it, along with a boolean flag indicating whether the
    supplied subject indicated a reply to or forward of the original
    subject

To develop a set of threads, you then instantiate either ReferencesSet
(the JWS algorithm from Netscape, formalized in RFC 5256), or
OrderedSubjectSet (the "same subjects" algorithm, aka "poor man's
threading"), both subclasses of the abstract class ThreadSet.  Each
constructor takes a ThreadableObjectStore instance and optionally a
set of messages to use for the initial threads.  If provided, those
messages are analyzed into a set of threads.  The threadset is
iterable; the iteration is over the threads it contains.

An instance of ThreadSet provides the following methods:

  add (msg or message ID) => thread

    add another message from the mailstore to the thread set, where
    "thread" is an object which has the attributes "message_id" (a
    string) and "children" (an ordered list of sub-threads), and is
    the root of the thread tree for that msg.

  remove (msg or message ID) => thread

    remove a message from the thread set, where thread is as for
    "add()", but may additionally be 'None' if the message was not in
    a thread, or was the only message in the thread.

  thread (msg or message ID) => thread

    obtain the thread containing the specified message, if any,
    where "thread" is as for "add()", or 'None' if no thread for
    that message exists.

  subject_threads (subject regexp) => set of thread

    obtain the threads where the base subject of the thread contains
    the specified regular expression, where "regexp" is a textual or
    compiled regular expression, and the return value is a set of
    threads.  Note that subject comparisons are case-insensitive;
    compiled regexps must use the re.IGNORECASE flag.

  date_threads (starting time, ending time, root_only=False) => set of thread

    obtain the set of threads containing any messages between
    the two timestamps.  Timestamps are time.time() timestamps;
    either may be specified as 'None' to mean either the start
    of time, or the distant future, respectively.  If "root_only"
    is specified, will only consider the dates of the roots of
    each thread; threads with no root message (a subject forest)
    will always fail to match in this case.

  __contains__ (msg or message ID) => boolean

    Present to support the "in" operator.

Support for persistence is provided with an instance method
"to_external_form" and a class method "from_external_form" on thread
sets.  Calling "to_external_form" on a thread set instance will
generate a set of tree structured nested tuples, where each tuple
consists of an optional message ID followed by zero or more child
tuples.  ReferencesSet and OrderedSubjectSet also provide a class
method, "from_external_form", which given a ThreadableObjectStore
instance and an externalized thread set value, will create and return
a new thread set instance initialized to that set of threads.

MailboxStore is a subclass of ThreadableObjectStore designed to
wrap mailboxes (subclasses of mailbox.Mailbox).  For instance,

  >>> mbox = mailbox.Mbox("foo.mbox")
  >>> mboxstore = MailboxStore(mbox)
  >>> threadset = ReferencesSet (mboxstore, mbox.itervalues())

will produce a thread set for all the messages in the mbox-format
mailbox 'foo.mbox', using the REFERENCES threading algorithm.

MailboxStore also provides a static method to compute the normalized
form of a message ID (the message ID stripped of <> angle brackets,
and various quoted parts unquoted):

  MailboxStore.normalize_message_id(message ID) => message ID

    Take a standard RFC 5322 message ID string and return the
    normalized form of it.
_______________________________________________
Email-SIG mailing list
[hidden email]
Your options: http://mail.python.org/mailman/options/email-sig/lists%2B1324540640401-2213733%40n6.nabble.com
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: API for email threading library?

jalopyuser
In reply to this post by Stephen J. Turnbull
Stephen J. Turnbull <[hidden email]> wrote:

>  > Lisp message threading?  What's that in aid of, if you can say?
>
> The "VM" MUA for Emacs and XEmacs.

Incidentally, I'm using the Nov 2011 Python-dev archive as a test mbox.
If were to try it with your software, too, we could test the
implementations against each other.

Bill
_______________________________________________
Email-SIG mailing list
[hidden email]
Your options: http://mail.python.org/mailman/options/email-sig/lists%2B1324540640401-2213733%40n6.nabble.com
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: API for email threading library?

Stephen J. Turnbull
Bill Janssen writes:
 > Stephen J. Turnbull <[hidden email]> wrote:
 >
 > >  > Lisp message threading?  What's that in aid of, if you can say?
 > >
 > > The "VM" MUA for Emacs and XEmacs.
 >
 > Incidentally, I'm using the Nov 2011 Python-dev archive as a test mbox.
 > If were to try it with your software, too, we could test the
 > implementations against each other.

Of course I've been using an archive of my own, but I'm more than
happy to switch to something publicly available.  I've got a big
meeting coming up on Saturday, I'll get back to you on this after
that.

Steve
_______________________________________________
Email-SIG mailing list
[hidden email]
Your options: http://mail.python.org/mailman/options/email-sig/lists%2B1324540640401-2213733%40n6.nabble.com
Loading...