thoughts on an iterator

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

thoughts on an iterator

Brandon Craig Rhodes
Graham, I confess that it was I who brought up the idea of a wsgi.input
iterator at the WSGI Open Space yesterday evening. :-) The discussion
seemed to be assuming a file-like input object that could be read from
by a piece of middleware, then "backed up" or "rewound" before passing
it down to the next layer.  This seemed to have problems: it doesn't
support the case where the middleware wants to alter the input or pass
it piecemeal down to the client as it comes in, and it also means that
the *entire* input stream has to be kept around in memory for the
lifetime of the whole request in case the client reading it is not the
"real client" at the bottom of the stack, and a request is coming that
will ask for the whole thing to be replayed.

So, I suggested placing the responsibility for rewind and buffering on
the middleware.  You want to read 2k of the input to make a middleware
decision before invoking the next layer down?  Then read it, and pass
along a fresh iterator that first yields that 2k, then starts yielding
everything from the partially-read iterator.  Or, you can pass along a
filter iterator that scans or changes the entire stream as it reads it
from the upstream iterator.

But, having through more about the idea, I think that your criticisms,
Graham, are exactly on-target.  Iterators don't give enough control to
the reader to ask about the chunks (lines or blocks) that get delivered
as they read.  So at the very least we should indeed be looking at a
file-like object; it's still easy to construct a file-like object that's
really streaming from another file as it comes in, and we could even
provide shortcuts to build files from inline iterators or something.

And, the idea that each piece of middleware does its *own* buffering
might be a bad one too.  One might naively store everything in RAM,
another might put blocks on disk, another might run you out of /tmp
space trying to do the same thing - even storing duplicates of the same
data if we're not careful!  The same 1MB initial block could wind up on
disk two or three times if each piece of middleware thinks it's the one
with it cached to pass along to the bottom layer that's reading 16k
blocks at a time.

So what's left of my suggestion?  I suggest that we *not* commit to
unlimited rewinding of the input stream; that was my single real
insight, and an uncontrollable iterator design gives up too much in
order to achieve that.  A file-like object is more appropriate, but we
either need to make middleware do its own caching of partially-consumed
data, *or* we need some way for middleware to signal whether it needs
the older data kept.

Could "input.bookmark()" signal that everything from this point on in
the stream needs to be retained, in memory or on disk, to be rewound to
later?  And the data be released only after the bookmark is deleted?

   b = input.bookmark()
   input.read()...

   input2 = b.file()
   del b

Or, we could allow the "input" object to support cloning, where all data
is cached from the clone-that's-read-least-far to the one that's read
the farthest:

   c = input.clone()
   input.read(100)
   # 100 bytes are now cached by the framework, in RAM or on disk or on
   # a USB keyfob or wherever this framework puts it. (Django will write
   # their own caching that's different from everyone else's).
   c.read(100)
   # the bytes are released
   del c
   # Now that there's just one active clone, no buffering takes place.

That way one could "read ahead" on your own input, while passing the
complete stream back down to the next level.  This has the disadvantage
that if a middleware piece wants to keep the first 100MB and last 100MB
from a stream but throw out the middle, it's got no way to do so without
dropping back to its own caching scheme that the framework can't
coordinate with other schemes; but it seems to cover the majority of
cases that I can think of.

Anyway: no unlimited caching, no unlimited rewind; that's my argument.
Iterators were just one way of cleaning getting there, but probably, in
the light of the next day, not a powerful enough way.

--
Brandon Craig Rhodes   [hidden email]   http://rhodesmill.org/brandon
_______________________________________________
Web-SIG mailing list
[hidden email]
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: thoughts on an iterator

Robert Brewer-4
Brandon Craig Rhodes wrote:
> Graham, I confess that it was I who brought up the idea of a
wsgi.input

> iterator at the WSGI Open Space yesterday evening. :-) The discussion
> seemed to be assuming a file-like input object that could be read from
> by a piece of middleware, then "backed up" or "rewound" before passing
> it down to the next layer.  This seemed to have problems: it doesn't
> support the case where the middleware wants to alter the input or pass
> it piecemeal down to the client as it comes in, and it also means that
> the *entire* input stream has to be kept around in memory for the
> lifetime of the whole request in case the client reading it is not the
> "real client" at the bottom of the stack, and a request is coming that
> will ask for the whole thing to be replayed.
>
> So, I suggested placing the responsibility for rewind and buffering on
> the middleware.  You want to read 2k of the input to make a middleware
> decision before invoking the next layer down?  Then read it, and pass
> along a fresh iterator that first yields that 2k, then starts yielding
> everything from the partially-read iterator.  Or, you can pass along a
> filter iterator that scans or changes the entire stream as it reads it
> from the upstream iterator.
>
> But, having through more about the idea, I think that your criticisms,
> Graham, are exactly on-target.  Iterators don't give enough control to
> the reader to ask about the chunks (lines or blocks) that get
delivered
> as they read.  So at the very least we should indeed be looking at a
> file-like object;

Hmmmm. Graham brought up chunked requests which I don't think have much
bearing on this issue--the server/app can't rely on the client-specified
chunk sizes either way (or you enable a Denial of Service attack). I
don't see much difference between the file approach and the iterator
approach, other than moving the read chunk size from the app (or more
likely, the cgi module) to the server. That may be what kills this
proposal: cgi.FieldStorage expects a file pointer and I doubt we want to
either rewrite the entire cgi module to support iterators, or re-package
the iterator up as a file.

> it's still easy to construct a file-like object that's
> really streaming from another file as it comes in, and we could even
> provide shortcuts to build files from inline iterators or something.

Right; either approach can be re-streamed pretty easily.

> And, the idea that each piece of middleware does its *own* buffering
> might be a bad one too.  One might naively store everything in RAM,
> another might put blocks on disk, another might run you out of /tmp
> space trying to do the same thing - even storing duplicates of the
same
> data if we're not careful!  The same 1MB initial block could wind up
on
> disk two or three times if each piece of middleware thinks it's the
one
> with it cached to pass along to the bottom layer that's reading 16k
> blocks at a time.

Any middleware which did so would pretty quickly get fixed or abandoned.
I don't think that's a strong argument given that we have many
developers with experience in this area from existing middleware.

> So what's left of my suggestion?  I suggest that we *not* commit to
> unlimited rewinding of the input stream; that was my single real
> insight, and an uncontrollable iterator design gives up too much in
> order to achieve that.  A file-like object is more appropriate, but we
> either need to make middleware do its own caching of
partially-consumed
> data, *or* we need some way for middleware to signal whether it needs
> the older data kept.
>
> Could "input.bookmark()" signal that everything from this point on in
> the stream needs to be retained, in memory or on disk, to be rewound
to

> later?  And the data be released only after the bookmark is deleted?
>
>    b = input.bookmark()
>    input.read()...
>
>    input2 = b.file()
>    del b
>
> Or, we could allow the "input" object to support cloning, where all
> data
> is cached from the clone-that's-read-least-far to the one that's read
> the farthest:
>
>    c = input.clone()
>    input.read(100)
>    # 100 bytes are now cached by the framework, in RAM or on disk or
on

>    # a USB keyfob or wherever this framework puts it. (Django will
> write
>    # their own caching that's different from everyone else's).
>    c.read(100)
>    # the bytes are released
>    del c
>    # Now that there's just one active clone, no buffering takes place.
>
> That way one could "read ahead" on your own input, while passing the
> complete stream back down to the next level.  This has the
disadvantage
> that if a middleware piece wants to keep the first 100MB and last
100MB
> from a stream but throw out the middle, it's got no way to do so
> without
> dropping back to its own caching scheme that the framework can't
> coordinate with other schemes; but it seems to cover the majority of
> cases that I can think of.

Those seem like strategies for individual middleware components to
implement, not necessary to burden the general case with it.

> Anyway: no unlimited caching, no unlimited rewind; that's my argument.
> Iterators were just one way of cleaning getting there, but probably,
in
> the light of the next day, not a powerful enough way.

I'd vote to stick with the file-like approach for no other reason than
that FieldStorage expects one.


Robert Brewer
[hidden email]

_______________________________________________
Web-SIG mailing list
[hidden email]
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: thoughts on an iterator

ianb
2009/3/28 Robert Brewer <[hidden email]>:
> Hmmmm. Graham brought up chunked requests which I don't think have much
> bearing on this issue--the server/app can't rely on the client-specified
> chunk sizes either way (or you enable a Denial of Service attack). I
> don't see much difference between the file approach and the iterator
> approach, other than moving the read chunk size from the app (or more
> likely, the cgi module) to the server. That may be what kills this
> proposal: cgi.FieldStorage expects a file pointer and I doubt we want to
> either rewrite the entire cgi module to support iterators, or re-package
> the iterator up as a file.

There are some alternate implementations of the cgi POST-parsing
functionality, some of which might be more amenable to using an
iterator.  Or for that matter, none of us have probably read the cgi
module with this in mind.  With a quick look, it'll be slightly tricky
because it uses .readline a lot, but there's just not that much code
involved so it can't be too hard.

For clarity, I think everyone has been discussing an *iterator*, not
an iterable; an iterable would have a lot of unnecessary overhead, but
I've seen both terms used.

I don't agree with Graham's objection, as I think the reason to read
specific-sized chunks is that you don't want to read too much data
into memory at one time.  But the server is free to chunk the iterator
to avoid too much data, and once the strings are in memory the
consumer really isn't any better off reading a smaller chunk than what
is available.

This also means I can stop making up entirely random chunk sizes in
applications.  Applications have no real information to inform this
chunking.  If the string is already in memory, the chunking actually
is counterproductive.

--
Ian Bicking  |  http://blog.ianbicking.org
_______________________________________________
Web-SIG mailing list
[hidden email]
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: thoughts on an iterator

Alan Kennedy-12
In reply to this post by Robert Brewer-4
Hi all,

It was great to meet (nearly) everybody at PyCon; I look forward to
the next time.

I particularly want to thank Robert for being so meticulous about
recording and reporting the discussions; a necessary part of moving
forward, IMO.

[Robert]
> Hmmmm. Graham brought up chunked requests which I don't think have much
> bearing on this issue--the server/app can't rely on the client-specified
> chunk sizes either way (or you enable a Denial of Service attack). I
> don't see much difference between the file approach and the iterator
> approach, other than moving the read chunk size from the app (or more
> likely, the cgi module) to the server. That may be what kills this
> proposal: cgi.FieldStorage expects a file pointer and I doubt we want to
> either rewrite the entire cgi module to support iterators, or re-package
> the iterator up as a file.

I recommend that any discussion of file-like vs. iterator for input
should be informed by this discussion between myself and PJE back when
the spec was being written.

http://mail.python.org/pipermail/web-sig/2004-September/000885.html

Most relevant quote

[PJE]

> Aha!  There's the problem.  The 'read()' protocol is what's wrong.  If
> 'wsgi.input' were an *iterator* instead of a file-like object, it would be
> fairly straightforward for async servers to implement "would block" reads
> as yielding empty strings.  And, servers could actually support streaming
> input via chunked encoding, because they could just yield blocks once
> they've arrived.
>
> The downside to making 'wsgi.input' an iterator is that you lose control
> over how much data to read at a time: the upstream server or middleware
> determines how much data you get.  But, it's quite possible to make a
> buffering, file-like wrapper over such an iterator, if that's what you
> really need, and your code is synchronous.  (This will slightly increase
> the coding burden for interfacing applications and frameworks that expect
> to have a readable stream for CGI input.)  For asynchronous code, you're
> just going to invoke some sort of callback with each block, and it's the
> callback's job to deal with it.
>
> What does everybody think?  If combined with a "pause iterating me until
> there's input data available" extension API, this would let the input
> stream be non-blocking, and solve the chunked-encoding input issue all in
> one change to the protocol.  Or am I missing something here?

http://mail.python.org/pipermail/web-sig/2004-September/000890.html

I'd also be interested in the Twisted folk's take on that discussion.

All the best,

Alan.
_______________________________________________
Web-SIG mailing list
[hidden email]
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/lists%40nabble.com