Backup plan: WSGI 1 Addenda and wsgiref update for Py3

classic Classic list List threaded Threaded
19 messages Options
Reply | Threaded
Open this post in threaded view
|

Backup plan: WSGI 1 Addenda and wsgiref update for Py3

PJ Eby
While the Web-SIG is trying to hash out PEP 444, I thought it would
be a good idea to have a backup plan that would allow the Python 3
stdlib to move forward, without needing a major new spec to settle
out implementation questions.

After all, even if PEP 333 is ultimately replaced by PEP 444, it's
probably a good idea to have *some* sort of WSGI 1-ish thing
available on Python 3, with bytes/unicode and other matters settled.

In the past, I was waiting for some consensuses (consensi?) on
Web-SIG about different approaches to Python 3, looking for some sort
of definite, "yes, we all like this" response.  However, I can see
now that this just means it's my fault we don't have a spec yet.    :-(

So, unless any last-minute showstopper rebuttals show up this week,
I've decided to go ahead officially bless nearly all of what Graham
Dumpleton (who's not only the mod_wsgi author, but has put huge
amounts of work into shepherding WSGI-on-Python3 proposals, WSGI
amendments, etc.) has proposed, with a few minor exceptions.

In other words: almost none of the following is my own original work;
it's like 90% Graham's.  Any praise for this belongs to him; the only
thing that belongs to me is the blame for not doing this
sooner!  (Sorry Graham.  You asked me to do this ages ago, and you were right.)

Anyway, I'm posting this for comment to both Python-Dev and the
Web-SIG.  If you are commenting on the technical details of the
amendments, please reply to the Web-SIG only.  If you are commenting
on the development agenda for wsgiref or other Python 3 library
issues, please reply to Python-Dev only.  That way, neither list will
see off-topic discussions.  Thanks!


The Plan
========

I plan to update the proposal below per comments and feedback during
this week, then update PEP 333 itself over the weekend or early next
week, followed by a code review of Python 3's wsgiref, and
implementation of needed changes (such as recoding os.environ to
latin1-captured bytes in the CGI handler).

To complete the changes, it is possible that I may need assistance
from one or more developers who have more Python 3 experience.  If
after reading the proposed changes to the spec, you would like to
volunteer to help with updating wsgiref to match, please let me know!


The Proposal
============


Overview
--------

1. The primary purpose of this update is to provide a uniform porting
pattern for moving Python 2 WSGI code to Python 3, meaning a pattern
of changes that can be mechanically applied to as little code as
practical, while still keeping the WSGI spec easy to programmatically
validate (e.g. via ``wsgiref.validate``).

The Python 3 specific changes are to use:

* ``bytes`` for I/O streams in both directions
* ``str`` for environ keys and values
* ``bytes`` for arguments to start_response() and write()
* text stream for wsgi.errors

In other words, "strings in, bytes out" for headers, bytes for bodies.

In general, only changes that don't break Python 2 WSGI
implementations are allowed.  The changes should also not break
mod_wsgi on Python 3, but may make some Python 3 wsgi applications
non-compliant, despite continuing to function on mod_wsgi.

This is because mod_wsgi allows applications to output string headers
and bodies, but I am ruling that option out because it forces every
piece of middleware to have to be tested with arbitrary combinations
of strings and bytes in order to test compliance.  If you want your
application to output strings rather than bytes, you can always use a
decorator to do that.  (And a sample one could be provided in wsgiref.)


2. The secondary purpose of the update is to address some
long-standing open issues documented here:

    http://www.wsgi.org/wsgi/Amendments_1.0

As with the Python 3 changes, only changes that don't retroactively
invalidate existing implementations are allowed.


3. There is no tertiary purpose.  ;-)  (By which I mean, all other
kinds of changes are out-of-scope for this update.)


4. The section below labeled "A Note On String Types" is proposed for
verbatim addition to the "Specification Overview" section in the PEP;
the other sections below describe changes to be made inline at the
appropriate part of the spec, and changes that were proposed but are
rejected for inclusion in this amendment.


A Note On String Types
----------------------

In general, HTTP deals with bytes, which means that this
specification is mostly about handling bytes.

However, the content of those bytes often has some kind of textual
interpretation, and in Python, strings are the most convenient way to
handle text.

But in many Python versions and implementations, strings are Unicode,
rather than bytes.  This requires a careful balance between a usable
API and correct translations between bytes and text in the context of
HTTP...  especially to support porting code between Python
implementations with different ``str`` types.

WSGI therefore defines two kinds of "string":

* "Native" strings (which are always implemented using the type named ``str``)

* "Bytestrings" (which are implemented using the ``bytes`` type in
Python 3, and ``str`` elsewhere)

So, even though HTTP is in some sense "really just bytes", there are
many API conveniences to be had by using whatever Python's default
``str`` type is.

Do not be confused however: even if Python's ``str`` is actually
Unicode under the hood, the *content* of a native string is still
restricted to bytes!  See the section on `Unicode Issues`_ later in
this document.

In short: where you see the word "string" in this document, it refers
to a "native" string, i.e., an object of type ``str``, whether it is
internally implemented as bytes or unicode.  Where you see references
to "bytestring", this should be read as "an object of type ``bytes``
under Python 3, or type ``str`` under Python 2".


Clarifications (To be made in-line)
-----------------------------------

The following amendments are clarifications to parts of the existing
spec that proved over the years to be ambiguous or insufficiently
specified, as well as some attempts to correct practical errors.

(Note: many of these issues cannot be completely fixed in WSGI 1
without breaking existing implementations, and so the text below has
notations such as "(MUST in WSGI 2)" to indicate where any
replacement spec for WSGI 1 should strengthen them.)

* If an application returns a body iterator, a server (or middleware)
MAY stop iterating over it and discard the remainder of the output,
as long as it calls any close() method provided by the
iterator.  Applications returning a generator or other custom
iterator SHOULD NOT assume that the entire iterator will be
consumed.  (This change makes it explicit that caching middleware or
HEAD-processing servers can throw away the response body.)

* start_response() SHOULD (MUST in WSGI 2) check for errors in the
status or headers at the time it's called, so that an error can be
raised as close to the problem as possible

* If start_response() raises an error when called normally (i.e.
without exc_info), it SHOULD be an error to call it a second time
without passing exc_info

* The SERVER_PORT variable is of type str, just like any other CGI
environ variable.  (According to the WSGI wiki, "some
implementations" expect it to be an integer, even though there is
nothing in the WSGI spec that allows a CGI variable to be anything but a str.)

* A server SHOULD (MUST in WSGI 2) support the size hint argument to
readline() on its wsgi.input stream.

* A server SHOULD (MUST in WSGI 2) return an empty bytestring from
read() on wsgi.input to indicate an end-of-file condition.  (In WSGI
2, language should be clarified to allow the input stream length and
CONTENT_LENGTH to be out of sync, for reasons explained in Graham's blog post.)

* A server SHOULD (MUST in WSGI 2) allow read() to be called without
an argument, and return the entire remaining contents of the stream

* If an application provides a Content-Length header, the server
SHOULD NOT (MUST NOT in WSGI 2) send more data to the client than was
specified in that header, whether via write(), yielded body
bytestrings, or via a wsgi.file_wrapper.  (This rule applies to
middleware as well.)

* wsgi.errors is a text stream accepting "native strings"



Rejected Amendments
-------------------

* Manlio Perillo's suggestion to allow header specification to be
delayed until the response iterator is producing non-empty
output.  This would've been a possible win for async WSGI, but could
require substantial changes to existing servers.

_______________________________________________
Web-SIG mailing list
[hidden email]
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: Backup plan: WSGI 1 Addenda and wsgiref update for Py3

Chris McDonough
On Tue, 2010-09-21 at 12:09 -0400, P.J. Eby wrote:
> While the Web-SIG is trying to hash out PEP 444, I thought it would
> be a good idea to have a backup plan that would allow the Python 3
> stdlib to move forward, without needing a major new spec to settle
> out implementation questions.

If a WSGI-1-compatible protocol seems more sensible to folks, I'm
personally happy to defer discussion on PEP 444 or any other
backwards-incompatible proposal.

- C


_______________________________________________
Web-SIG mailing list
[hidden email]
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: [Python-Dev] Backup plan: WSGI 1 Addenda and wsgiref update for Py3

ianb
On Tue, Sep 21, 2010 at 12:47 PM, Chris McDonough <[hidden email]> wrote:
On Tue, 2010-09-21 at 12:09 -0400, P.J. Eby wrote:
> While the Web-SIG is trying to hash out PEP 444, I thought it would
> be a good idea to have a backup plan that would allow the Python 3
> stdlib to move forward, without needing a major new spec to settle
> out implementation questions.

If a WSGI-1-compatible protocol seems more sensible to folks, I'm
personally happy to defer discussion on PEP 444 or any other
backwards-incompatible proposal.

I think both make sense, making WSGI 1 sensible for Python 3 (as well as other small errata like the size hint) doesn't detract from PEP 444 at all, IMHO.

--
Ian Bicking  |  http://blog.ianbicking.org

_______________________________________________
Web-SIG mailing list
[hidden email]
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: Backup plan: WSGI 1 Addenda and wsgiref update for Py3

ianb
In reply to this post by PJ Eby
On Tue, Sep 21, 2010 at 12:09 PM, P.J. Eby <[hidden email]> wrote:
The Python 3 specific changes are to use:

* ``bytes`` for I/O streams in both directions
* ``str`` for environ keys and values
* ``bytes`` for arguments to start_response() and write()

This is the only thing that seems odd to me -- it seems like the response should be symmetric with the request, and the request in this case uses str for headers (status being header-like), and bytes for the body.

Otherwise this seems good to me, the only other major errata I can think of are all listed in the links you included.

* text stream for wsgi.errors

In other words, "strings in, bytes out" for headers, bytes for bodies.

In general, only changes that don't break Python 2 WSGI implementations are allowed.  The changes should also not break mod_wsgi on Python 3, but may make some Python 3 wsgi applications non-compliant, despite continuing to function on mod_wsgi.

This is because mod_wsgi allows applications to output string headers and bodies, but I am ruling that option out because it forces every piece of middleware to have to be tested with arbitrary combinations of strings and bytes in order to test compliance.  If you want your application to output strings rather than bytes, you can always use a decorator to do that.  (And a sample one could be provided in wsgiref.)

I agree allowing both is not ideal.


--
Ian Bicking  |  http://blog.ianbicking.org

_______________________________________________
Web-SIG mailing list
[hidden email]
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: [Python-Dev] Backup plan: WSGI 1 Addenda and wsgiref update for Py3

Jeff Hardy-4
In reply to this post by Chris McDonough
On Tue, Sep 21, 2010 at 10:47 AM, Chris McDonough <[hidden email]> wrote:
> On Tue, 2010-09-21 at 12:09 -0400, P.J. Eby wrote:
> If a WSGI-1-compatible protocol seems more sensible to folks, I'm
> personally happy to defer discussion on PEP 444 or any other
> backwards-incompatible proposal.

I think both make sense. PEP 444 can continue to be worked out (and it
should be!); the changes here are pretty much uncontroversial. It also
helps clarify how WSGI should work on IronPython, which has the same
str/unicode issues as Python 3 - that fact it's basically how I've
implemented it for IronPython is nice as well.

- Jeff
_______________________________________________
Web-SIG mailing list
[hidden email]
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: [Python-Dev] Backup plan: WSGI 1 Addenda and wsgiref update for Py3

Jeff Hardy-4
In reply to this post by ianb
On Tue, Sep 21, 2010 at 10:57 AM, Ian Bicking <[hidden email]> wrote:

> On Tue, Sep 21, 2010 at 12:09 PM, P.J. Eby <[hidden email]> wrote:
>>
>> The Python 3 specific changes are to use:
>>
>> * ``bytes`` for I/O streams in both directions
>> * ``str`` for environ keys and values
>> * ``bytes`` for arguments to start_response() and write()
>
> This is the only thing that seems odd to me -- it seems like the response
> should be symmetric with the request, and the request in this case uses str
> for headers (status being header-like), and bytes for the body.

FWIW I agree with Ian about the symmetry breaking being odd. For
IronPython, most .NET webservers expect the status and headers as
strings, which in .NET are unicode, but that would just be an
implementation convenience for me.

- Jeff
_______________________________________________
Web-SIG mailing list
[hidden email]
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: [Python-Dev] Backup plan: WSGI 1 Addenda and wsgiref update for Py3

PJ Eby
In reply to this post by ianb
At 12:55 PM 9/21/2010 -0400, Ian Bicking wrote:

>On Tue, Sep 21, 2010 at 12:47 PM, Chris McDonough
><<mailto:[hidden email]>[hidden email]> wrote:
>On Tue, 2010-09-21 at 12:09 -0400, P.J. Eby wrote:
> > While the Web-SIG is trying to hash out PEP 444, I thought it would
> > be a good idea to have a backup plan that would allow the Python 3
> > stdlib to move forward, without needing a major new spec to settle
> > out implementation questions.
>
>If a WSGI-1-compatible protocol seems more sensible to folks, I'm
>personally happy to defer discussion on PEP 444 or any other
>backwards-incompatible proposal.
>
>
>I think both make sense, making WSGI 1 sensible for Python 3 (as
>well as other small errata like the size hint) doesn't detract from
>PEP 444 at all, IMHO.

Yep.  I agree.  I do, however, want to get these amendments settled
and make sure they get carried over to whatever spec is the successor
to PEP 333.  I've had a lot of trouble following exactly what was
changed in 444, and I'm a tad worried that several new ambiguities
may be being introduced.  So, solidifying 333 a bit might be helpful
if it gives a good baseline against which to diff 444 (or whatever).

_______________________________________________
Web-SIG mailing list
[hidden email]
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: [Python-Dev] Backup plan: WSGI 1 Addenda and wsgiref update for Py3

PJ Eby
In reply to this post by PJ Eby
At 06:52 PM 9/21/2010 +0200, Antoine Pitrou wrote:

>On Tue, 21 Sep 2010 12:09:44 -0400
>"P.J. Eby" <[hidden email]> wrote:
> > While the Web-SIG is trying to hash out PEP 444, I thought it would
> > be a good idea to have a backup plan that would allow the Python 3
> > stdlib to move forward, without needing a major new spec to settle
> > out implementation questions.
>
>If this allows the Web situation in Python 3 to be improved faster
>and with less hassle then all the better.
>There's something strange in your proposal: it mentions WSGI 2 at
>several places while there's no guarantee about what WSGI 2 will be (is
>there?).

Sorry - "WSGI 2" should be read as shorthand for, "whatever new spec
succeeds PEP 333", whether that's PEP 444 or something else.

It just means that any new spec that doesn't have to be
backward-compatible can (and should) more thoroughly address the
issue in question.

_______________________________________________
Web-SIG mailing list
[hidden email]
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: Backup plan: WSGI 1 Addenda and wsgiref update for Py3

PJ Eby
In reply to this post by ianb
[trimming reply headers to just web-sig]

At 12:57 PM 9/21/2010 -0400, Ian Bicking wrote:

>On Tue, Sep 21, 2010 at 12:09 PM, P.J. Eby
><<mailto:[hidden email]>[hidden email]> wrote:
>The Python 3 specific changes are to use:
>
>* ``bytes`` for I/O streams in both directions
>* ``str`` for environ keys and values
>* ``bytes`` for arguments to start_response() and write()
>
>
>This is the only thing that seems odd to me -- it seems like the
>response should be symmetric with the request, and the request in
>this case uses str for headers (status being header-like), and bytes
>for the body.

Are you suggesting a "``str`` for headers, ``bytes`` for bodies"
approach instead?

I suppose that could work; I was going for "str in, bytes out".  My
assumption, though, was that headers are relatively easy to address
at a choke point from a framework's output.  But I guess that
iterator output is equally chokable.

I'm open to discussion on this point, so long as every value produced
or consumed by a WSGI application is of a specified single type().


>Otherwise this seems good to me, the only other major errata I can
>think of are all listed in the links you included.

Um, if by "links" you mean, "included textually in the proposal",
then sure.  If it's not in the proposal, it's not going in the PEP,
even if it's on the WSGI Amendments page or Graham's blog.

_______________________________________________
Web-SIG mailing list
[hidden email]
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: Backup plan: WSGI 1 Addenda and wsgiref update for Py3

ianb
On Tue, Sep 21, 2010 at 1:17 PM, P.J. Eby <[hidden email]> wrote:
[trimming reply headers to just web-sig]

At 12:57 PM 9/21/2010 -0400, Ian Bicking wrote:

On Tue, Sep 21, 2010 at 12:09 PM, P.J. Eby <<mailto:[hidden email]>[hidden email]> wrote:
The Python 3 specific changes are to use:

* ``bytes`` for I/O streams in both directions
* ``str`` for environ keys and values
* ``bytes`` for arguments to start_response() and write()


This is the only thing that seems odd to me -- it seems like the response should be symmetric with the request, and the request in this case uses str for headers (status being header-like), and bytes for the body.

Are you suggesting a "``str`` for headers, ``bytes`` for bodies" approach instead?

Yes.

I suppose that could work; I was going for "str in, bytes out".  My assumption, though, was that headers are relatively easy to address at a choke point from a framework's output.  But I guess that iterator output is equally chokable.

The request body would still be bytes in either model (at least, I assumed that).

I'm open to discussion on this point, so long as every value produced or consumed by a WSGI application is of a specified single type().



Otherwise this seems good to me, the only other major errata I can think of are all listed in the links you included.

Um, if by "links" you mean, "included textually in the proposal", then sure.  If it's not in the proposal, it's not going in the PEP, even if it's on the WSGI Amendments page or Graham's blog.

Well, at a minimum there is the size hint on wsgi.input.  Things like CONTENT_LENGTH are probably more involved than is necessary for this revision.


--
Ian Bicking  |  http://blog.ianbicking.org

_______________________________________________
Web-SIG mailing list
[hidden email]
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Output header encodings? (was Re: Backup plan: WSGI 1 Addenda and wsgiref update for Py3)

PJ Eby
In reply to this post by ianb
At 12:57 PM 9/21/2010 -0400, Ian Bicking wrote:

>On Tue, Sep 21, 2010 at 12:09 PM, P.J. Eby
><<mailto:[hidden email]>[hidden email]> wrote:
>The Python 3 specific changes are to use:
>
>* ``bytes`` for I/O streams in both directions
>* ``str`` for environ keys and values
>* ``bytes`` for arguments to start_response() and write()
>
>
>This is the only thing that seems odd to me -- it seems like the
>response should be symmetric with the request, and the request in
>this case uses str for headers (status being header-like), and bytes
>for the body.

So, I've given some thought to your suggestion, and, while it's true
that most of the output headers are far less prone to ending up with
unintended unicode content, there are at least two output headers
that can include some sort of application content (and can therefore
have random failures): Location and Set-Cookie.

If these headers accidentally contain non-Latin1 characters, the
error isn't detectable until the header reaches the origin server
doing the transmission encoding, and it'll likely be a dynamic (and
therefore hard-to-debug) error.

However, if the output is always bytes (and this can be
relatively-statically verified), then any error can't occur except
*inside* the application, where the app's developer can find it more easily.

So I guess the question boils down to: would we rather make sure that
coding errors happen *inside* applications, or would we rather make
porting WSGI apps trivial (or nearly so)?

But I think that it's possible here to have one's cake and eat it
too: if we require bytes for all outputs, but provide a pair of
decorators in wsgiref.util like the following:

     def encode_body(codec='utf8'):
         """Allow a WSGI app to output its response body as strings
w/specified encoding"""
         def decorate(app):
             def encode(response):
                 try:
                     for data in response:
                         yield data.encode(codec)
                 finally:
                     if hasattr(response, 'close'):
                         response.close()
             def decorated_app(environ, start_response):
                 def start(status, response_headers, exc_info=None):
                     _write = start_response(status,
response_headers, exc_info)
                     def write(data):
                         return _write(data.encode(codec))
                     return write
                 return encode(app(environ, start))
             return decorated_app
         return decorate

     def encode_headers(codec='latin1'):
         """Allow a WSGI app to output its headers as strings,
w/specified encoding"""
         def decorate(app):
             def decorated_app(environ, start_response):
                 def start(status, response_headers, exc_info=None):
                     status = status.encode(codec)
                     response_headers = [
                         (k.encode(codec), v.encode(codec)) for k,v
in response_headers
                     ]
                     return start_response(status, response_headers, exc_info)
                 return app(environ, start)
             return decorated_app
         return decorate

So, this seems like a win-win to me: relatively-static verification,
errors stay in the app (or at least in the decorator), and the API is
clean-and-easy.  Indeed, it seems likely that at least some apps that
don't read wsgi.input themselves could be ported *just* by adding the
appropriate decorator(s).  And, if your app is using unicode on 2.x,
you can even use the same decorators there, for the benefit of
2to3.  (Assuming I release an updated standalone wsgiref version with
the decorators, of course.)

So, unless somebody has some additional arguments on this one, I
think I'm going to stick with bytes output.

_______________________________________________
Web-SIG mailing list
[hidden email]
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: Output header encodings? (was Re: Backup plan: WSGI 1 Addenda and wsgiref update for Py3)

ianb
On Thu, Sep 23, 2010 at 11:06 AM, P.J. Eby <[hidden email]> wrote:
At 12:57 PM 9/21/2010 -0400, Ian Bicking wrote:
On Tue, Sep 21, 2010 at 12:09 PM, P.J. Eby <<mailto:[hidden email]>[hidden email]> wrote:
The Python 3 specific changes are to use:

* ``bytes`` for I/O streams in both directions
* ``str`` for environ keys and values
* ``bytes`` for arguments to start_response() and write()


This is the only thing that seems odd to me -- it seems like the response should be symmetric with the request, and the request in this case uses str for headers (status being header-like), and bytes for the body.

So, I've given some thought to your suggestion, and, while it's true that most of the output headers are far less prone to ending up with unintended unicode content, there are at least two output headers that can include some sort of application content (and can therefore have random failures): Location and Set-Cookie.

If these headers accidentally contain non-Latin1 characters, the error isn't detectable until the header reaches the origin server doing the transmission encoding, and it'll likely be a dynamic (and therefore hard-to-debug) error.

I don't see any reason why Location shouldn't be ASCII.  Any header could have any character put in it, of course, there's just no valid case where Location shouldn't be a URL, and URLs are ASCII.  Cookie can contain weirdness, yes.  I would expect any library that abstracts cookies to handle this (it's certainly doable)... otherwise, this seems like one among many ways a person can do the wrong thing.

This can also be detected with the validator, which doesn't avoid runtime errors, but bytes allow runtime errors too -- they will just happen somewhere else (e.g., when a value is converted to bytes in an application or library).

If servers print the invalid value on error (instead of just some generic error) I don't think it would be that hard to track down problems.  This requires some explicit effort on the part of the server (most servers handle app_iter==None ungracefully, which is a similar problem).


However, if the output is always bytes (and this can be relatively-statically verified), then any error can't occur except *inside* the application, where the app's developer can find it more easily.

So I guess the question boils down to: would we rather make sure that coding errors happen *inside* applications, or would we rather make porting WSGI apps trivial (or nearly so)?

But I think that it's possible here to have one's cake and eat it too: if we require bytes for all outputs, but provide a pair of decorators in wsgiref.util like the following:

   def encode_body(codec='utf8'):
       """Allow a WSGI app to output its response body as strings w/specified encoding"""
       def decorate(app):
           def encode(response):
               try:
                   for data in response:
                       yield data.encode(codec)
               finally:
                   if hasattr(response, 'close'):
                       response.close()
           def decorated_app(environ, start_response):
               def start(status, response_headers, exc_info=None):
                   _write = start_response(status, response_headers, exc_info)
                   def write(data):
                       return _write(data.encode(codec))
                   return write
               return encode(app(environ, start))
           return decorated_app
       return decorate

   def encode_headers(codec='latin1'):
       """Allow a WSGI app to output its headers as strings, w/specified encoding"""
       def decorate(app):
           def decorated_app(environ, start_response):
               def start(status, response_headers, exc_info=None):
                   status = status.encode(codec)
                   response_headers = [
                       (k.encode(codec), v.encode(codec)) for k,v in response_headers
                   ]
                   return start_response(status, response_headers, exc_info)
               return app(environ, start)
           return decorated_app
       return decorate

So, this seems like a win-win to me: relatively-static verification, errors stay in the app (or at least in the decorator), and the API is clean-and-easy.  Indeed, it seems likely that at least some apps that don't read wsgi.input themselves could be ported *just* by adding the appropriate decorator(s).  And, if your app is using unicode on 2.x, you can even use the same decorators there, for the benefit of 2to3.  (Assuming I release an updated standalone wsgiref version with the decorators, of course.)

This doesn't seem that different than the validator, except that the decorator uses a different interface internally and externally (the internal interface using text, the external one bytes).


--
Ian Bicking  |  http://blog.ianbicking.org

_______________________________________________
Web-SIG mailing list
[hidden email]
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: Output header encodings? (was Re: Backup plan: WSGI 1 Addenda and wsgiref update for Py3)

Jeff Hardy-4
In reply to this post by PJ Eby
On Thu, Sep 23, 2010 at 10:06 AM, P.J. Eby <[hidden email]> wrote:
> So, unless somebody has some additional arguments on this one, I think I'm
> going to stick with bytes output.

I don't have a strong opinion on whether it should be bytes or strings
-- I'll leave that discussion for people who know more about the
details than I do.

I do think input and output should be symmetric, though. If response
headers are going to be bytes, then the request headers should be as
well, or vice versa. The same arguments apply to both, after all.

- Jeff
_______________________________________________
Web-SIG mailing list
[hidden email]
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: Output header encodings? (was Re: Backup plan: WSGI 1 Addenda and wsgiref update for Py3)

PJ Eby
At 11:11 AM 9/23/2010 -0600, Jeff Hardy wrote:

>On Thu, Sep 23, 2010 at 10:06 AM, P.J. Eby <[hidden email]> wrote:
> > So, unless somebody has some additional arguments on this one, I think I'm
> > going to stick with bytes output.
>
>I don't have a strong opinion on whether it should be bytes or strings
>-- I'll leave that discussion for people who know more about the
>details than I do.
>
>I do think input and output should be symmetric, though. If response
>headers are going to be bytes, then the request headers should be as
>well, or vice versa. The same arguments apply to both, after all.

Actually, they don't.  There are more apps than servers, so more code
to get right, by more people.  Servers also don't generally *create*
any of the bytes or text involved, they're just ferrying it from one
place to the next.  So the API conditions are not symmetrical.

_______________________________________________
Web-SIG mailing list
[hidden email]
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: Output header encodings? (was Re: Backup plan: WSGI 1 Addenda and wsgiref update for Py3)

Jeff Hardy-4
On Thu, Sep 23, 2010 at 11:52 AM, P.J. Eby <[hidden email]> wrote:
>> I do think input and output should be symmetric, though. If response
>> headers are going to be bytes, then the request headers should be as
>> well, or vice versa. The same arguments apply to both, after all.
>
> Actually, they don't.  There are more apps than servers, so more code to get
> right, by more people.  Servers also don't generally *create* any of the
> bytes or text involved, they're just ferrying it from one place to the next.
>  So the API conditions are not symmetrical.

How so? If I'm writing an application, I would need to deal with
strings in environ but remember to send bytes to start_response.
Conversions can happen on the application side either way. I just
don't see how having strings in->bytes out is more error-prone than
bytes-in->bytes-out or strings in->strings out, from an application or
a server perspective.

Also, IronPython/.NET falls outside of "generally". Every .NET server
I've seen deals with headers exclusively as strings (like Python 3,
.NET strings are Unicode), so NWSGI would be encoding the response
headers to strings, but passing the request headers through unchanged.

- Jeff
_______________________________________________
Web-SIG mailing list
[hidden email]
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: Output header encodings? (was Re: Backup plan: WSGI 1 Addenda and wsgiref update for Py3)

ianb
In reply to this post by ianb
On Thu, Sep 23, 2010 at 11:17 AM, Ian Bicking <[hidden email]> wrote:
If these headers accidentally contain non-Latin1 characters, the error isn't detectable until the header reaches the origin server doing the transmission encoding, and it'll likely be a dynamic (and therefore hard-to-debug) error.

I don't see any reason why Location shouldn't be ASCII.  Any header could have any character put in it, of course, there's just no valid case where Location shouldn't be a URL, and URLs are ASCII.  Cookie can contain weirdness, yes.  I would expect any library that abstracts cookies to handle this (it's certainly doable)... otherwise, this seems like one among many ways a person can do the wrong thing.

Minor correction, Set-Cookie, not Cookie.  Good practice is to stick to ASCII even there (all other techniques have a high risk of mojibake), so we're really considering legacy integration.  Note that a similar problem is using [('Content-length', len(body))] -- which also results in a sometimes confusing error message well away from the application itself.

Generally without validation any data errors occur away from the application.  A type error is not any different than an encoding error.  Using bytes removes a possible encoding error, but IMHO has a greater chance of type errors (as bytes are not as natural as text in most cases).  Validation can check all aspects, including encoding (simply by doing a test encoding).

Consider this hello world:

def app(environ, start_response):
    body = b'Hello World'
    start_response(b'200 OK', [(b'Content-Type', str(len(body)).encode('ascii'))])
    return [body]

str(len(body)).encode('ascii')?!?  Yuck.  Also no 2to3 fixup can help there.  bytes(len(body)) does something weird.

--
Ian Bicking  |  http://blog.ianbicking.org

_______________________________________________
Web-SIG mailing list
[hidden email]
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: Output header encodings? (was Re: Backup plan: WSGI 1 Addenda and wsgiref update for Py3)

PJ Eby
In reply to this post by ianb
At 11:17 AM 9/23/2010 -0500, Ian Bicking wrote:

>I don't see any reason why Location shouldn't be ASCII.  Any header
>could have any character put in it, of course, there's just no valid
>case where Location shouldn't be a URL, and URLs are ASCII.  Cookie
>can contain weirdness, yes.  I would expect any library that
>abstracts cookies to handle this (it's certainly doable)...
>otherwise, this seems like one among many ways a person can do the wrong thing.
>
>This can also be detected with the validator, which doesn't avoid
>runtime errors, but bytes allow runtime errors too -- they will just
>happen somewhere else (e.g., when a value is converted to bytes in
>an application or library).

Right: somewhere much closer to the *actual* error, where the
developer can know the problem is, "I have garbage data or have not
selected an appropriate codec", rather than "this WSGI stuff is
giving me errors some place".


>If servers print the invalid value on error (instead of just some
>generic error) I don't think it would be that hard to track down
>problems.  This requires some explicit effort on the part of the
>server (most servers handle app_iter==None ungracefully, which is a
>similar problem).

The difference is that if a server rejects non-bytes, you'll know
*right away* that your app isn't compliant, instead of having to wait
until some non-latin1 data shows up.

AFAICT, there are only two advantages to using text for output headers:

1. Text is easier to work with, and
2. It's symmetric with using text for input headers.

Both of which can still be had, by using the @encode_headers decorator.

I'm a little bit on the fence on this one, because 1) it does seem a
little pointless (if harmless) to shuffle headers around in bytes
form, and 2) Location and Set-Cookie are very likely the only headers
where any kind of damage could ever happen.

But, since it *can* happen, and because it is also really easy to fix
the API issue with a decorator, I'm still leaning in favor of "output
is bytes" over "headers are text, bodies are bytes", unless somebody
can come up with either some actually-bad consequence of using bytes,
or some extra-good consequence of using text (that isn't addressed by
just using the decorator).

(Note, by the way, that WSGI design has always leaned in the
direction of "any convenience that can be handled by a library should
be", if it keeps the spec simpler and more verifiable.  So, this
seems like a good use of that principle.)

_______________________________________________
Web-SIG mailing list
[hidden email]
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: Output header encodings? (was Re: Backup plan: WSGI 1 Addenda and wsgiref update for Py3)

ianb
On Thu, Sep 23, 2010 at 3:23 PM, P.J. Eby <[hidden email]> wrote:
At 11:17 AM 9/23/2010 -0500, Ian Bicking wrote:
I don't see any reason why Location shouldn't be ASCII.  Any header could have any character put in it, of course, there's just no valid case where Location shouldn't be a URL, and URLs are ASCII.  Cookie can contain weirdness, yes.  I would expect any library that abstracts cookies to handle this (it's certainly doable)... otherwise, this seems like one among many ways a person can do the wrong thing.


This can also be detected with the validator, which doesn't avoid runtime errors, but bytes allow runtime errors too -- they will just happen somewhere else (e.g., when a value is converted to bytes in an application or library).

Right: somewhere much closer to the *actual* error, where the developer can know the problem is, "I have garbage data or have not selected an appropriate codec", rather than "this WSGI stuff is giving me errors some place".


If servers print the invalid value on error (instead of just some generic error) I don't think it would be that hard to track down problems.  This requires some explicit effort on the part of the server (most servers handle app_iter==None ungracefully, which is a similar problem).

The difference is that if a server rejects non-bytes, you'll know *right away* that your app isn't compliant, instead of having to wait until some non-latin1 data shows up.

No, you've only pushed the encoding elsewhere, and the error elsewhere.  Somewhere someone is probably doing text_value.encode('ascii') (or latin1 or whatever), and if they haven't tested with non-ascii or non-latin1 input then they might encounter an error.  It will be in their code, not in the WSGI server, but the error will be present in all the same situations.  I don't think it will be much harder to fix if it occurs in the WSGI server, so long as the error message is at least a little bit helpful.
 
AFAICT, there are only two advantages to using text for output headers:

1. Text is easier to work with, and
2. It's symmetric with using text for input headers.

Both of which can still be had, by using the @encode_headers decorator.

Sure, anything can be fixed in a library.  But @encode_headers is just another library.  And it also can't magically appear with 2to3, instead it requires yet more patches and weird workarounds.

Also, what you are proposing hasn't been considered for PEP 444, though other combinations of bytes and text have (all symmetric).  So it doesn't seem to have any clean way to translate into the next version of the specification.
 
I'm a little bit on the fence on this one, because 1) it does seem a little pointless (if harmless) to shuffle headers around in bytes form, and 2) Location and Set-Cookie are very likely the only headers where any kind of damage could ever happen.

Set-Cookie only, Location is clean.  The entirety of hand-wringing over bytes is all just about freakin' cookies.  Or the theory of cookies, I don't know that anyone has yet encountered any concrete and vexing problems.

But, since it *can* happen, and because it is also really easy to fix the API issue with a decorator, I'm still leaning in favor of "output is bytes" over "headers are text, bodies are bytes", unless somebody can come up with either some actually-bad consequence of using bytes, or some extra-good consequence of using text (that isn't addressed by just using the decorator).

(Note, by the way, that WSGI design has always leaned in the direction of "any convenience that can be handled by a library should be", if it keeps the spec simpler and more verifiable.  So, this seems like a good use of that principle.)

It only fixes the one case of non-Latin1 characters, there are still many other values you can put into a header (a newline or control character for instance), and innumerable header-specific issues.  It seems to be adding complexity for one of the least problematic cases.

--
Ian Bicking  |  http://blog.ianbicking.org

_______________________________________________
Web-SIG mailing list
[hidden email]
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: Output header encodings? (was Re: Backup plan: WSGI 1 Addenda and wsgiref update for Py3)

PJ Eby
At 03:48 PM 9/23/2010 -0500, Ian Bicking wrote:
>It only fixes the one case of non-Latin1 characters, there are still
>many other values you can put into a header (a newline or control
>character for instance), and innumerable header-specific
>issues.  It seems to be adding complexity for one of the least
>problematic cases.

Ok, you found one that convinces me.  ;-)  "Headers are text, bodies
are bytes" shall be the rule.  I'll rewrite the "note about string
types" and change the way I'm updating the spec accordingly.

_______________________________________________
Web-SIG mailing list
[hidden email]
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/lists%40nabble.com