WSGI for Python 3

classic Classic list List threaded Threaded
84 messages Options
12345
Reply | Threaded
Open this post in threaded view
|

WSGI for Python 3

ianb
So... there's been some discussion of WSGI on Python 3 lately.  I'm not feeling as pessimistic as some people, I feel like we were close but just didn't *quite* get there.

Here's my thoughts:

* Everyone agrees keys in the environ should be native strings
* Bodies should stay bytes
* Can we make all "standard" values that are str on Python 2, str on Python 3 with a Latin1 encoding?  This is basically what wsgiref did.  This means HTTP_*, SERVER_NAME, etc.  Everything CGIish, and everything with an all-caps key.  There's only a couple tricky keys: SCRIPT_NAME, PATH_INFO, and HTTP_COOKIE.
* I propose we let libraries handle HTTP_COOKIE however they want; don't bother transcoding *into* the environ, just do so when you parse the cookie (if you so choose).  Happy developers will just urlencode all their cookie values to keep their cookies ASCII-clean.  Unhappy developers who have to handle legacy cookies will just run environ['HTTP_COOKIE'].decode('latin1') and then do whatever sad magic they are forced to do.
* I (re)propose we eliminate SCRIPT_NAME and PATH_INFO and replace them exclusively with encoded versions (that represent the original request URI).  We use Latin1 encoding, but it should be ASCII anyway, like most of the headers.
* I'm terrible at naming, but let's say these new values are RAW_SCRIPT_NAME and RAW_PATH_INFO.

Does this solve everything?  There's broken stuff in the stdlib, but we shouldn't bother ourselves with that -- if we need working code we should just write it and ignore the stdlib or submit our stuff as patches to the stdlib.

Some environments will have a hard time constructing RAW_SCRIPT_NAME and RAW_PATH_INFO, but in my opinion they can just encode SCRIPT_NAME and PATH_INFO and be done with it; it's not as accurate, but it's no less accurate than what we have now.

Actual transcoding in the environ is not supported or encouraged in this scheme.  If you want to adjust an encoding you should do it in your application/library code.

There's some other topics, like chunked responses, unknown request body lengths, start_response, and maybe some other things, but these aren't Python 3 issues, they are just... generic issues.  app_iter.close() might be worth thinking about given new iterator semantics introduced since WSGI was written.

--
Ian Bicking  |  http://blog.ianbicking.org

_______________________________________________
Web-SIG mailing list
[hidden email]
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: WSGI for Python 3

Graham Dumpleton-2
On 14 July 2010 14:43, Ian Bicking <[hidden email]> wrote:
> So... there's been some discussion of WSGI on Python 3 lately.  I'm not
> feeling as pessimistic as some people, I feel like we were close but just
> didn't *quite* get there.

What I took from the discussion wasn't that one couldn't specify a
WSGI interface, and as you say we more or less have one now, the issue
is more about how practical that is from a usability perspective for
those who have to code stuff on top.

The concern seems to be that although it may be easy to work with the
specification for those who at the lowest layer immediately wrap it in
a higher level abstraction that normalises stuff into something that
is then used consistently in that way, for those who use lower level
raw WSGI right through the stack, especially in the context of
stackable WSGI middleware, that repetitive task of having to deal with
the byte/unicode issues at every point it just a big PITA.

That said, my job in writing the WSGI adapter is really easy as I
don't have to worry about these issues. This is why I don't seem to
really appreciate the concerns people are expressing. The above is how
I read things though.

> Here's my thoughts:
>
> * Everyone agrees keys in the environ should be native strings
> * Bodies should stay bytes
> * Can we make all "standard" values that are str on Python 2, str on Python
> 3 with a Latin1 encoding?  This is basically what wsgiref did.  This means
> HTTP_*, SERVER_NAME, etc.  Everything CGIish, and everything with an
> all-caps key.  There's only a couple tricky keys: SCRIPT_NAME, PATH_INFO,
> and HTTP_COOKIE.
> * I propose we let libraries handle HTTP_COOKIE however they want; don't
> bother transcoding *into* the environ, just do so when you parse the cookie
> (if you so choose).  Happy developers will just urlencode all their cookie
> values to keep their cookies ASCII-clean.  Unhappy developers who have to
> handle legacy cookies will just run environ['HTTP_COOKIE'].decode('latin1')
> and then do whatever sad magic they are forced to do.
> * I (re)propose we eliminate SCRIPT_NAME and PATH_INFO and replace them
> exclusively with encoded versions (that represent the original request
> URI).  We use Latin1 encoding, but it should be ASCII anyway, like most of
> the headers.
> * I'm terrible at naming, but let's say these new values are RAW_SCRIPT_NAME
> and RAW_PATH_INFO.

My prior suggestion on that since upper case keys for now effectively
derive from CGI, was to make them wsgi.script_name and wsgi.path_info.
Ie., push them into the wsgi namespace.

> Does this solve everything?  There's broken stuff in the stdlib, but we
> shouldn't bother ourselves with that -- if we need working code we should
> just write it and ignore the stdlib or submit our stuff as patches to the
> stdlib.

The quick summary of what I suggest before is at:

  http://code.google.com/p/modwsgi/wiki/SupportForPython3X

I believe the only difference I see is the raw SCRIPT_NAME and
PATH_INFO, which got discussed to death previously with no consensus.

> Some environments will have a hard time constructing RAW_SCRIPT_NAME and
> RAW_PATH_INFO, but in my opinion they can just encode SCRIPT_NAME and
> PATH_INFO and be done with it; it's not as accurate, but it's no less
> accurate than what we have now.
>
> Actual transcoding in the environ is not supported or encouraged in this
> scheme.  If you want to adjust an encoding you should do it in your
> application/library code.
>
> There's some other topics, like chunked responses, unknown request body
> lengths, start_response, and maybe some other things, but these aren't
> Python 3 issues, they are just... generic issues.  app_iter.close() might be
> worth thinking about given new iterator semantics introduced since WSGI was
> written.

Graham
_______________________________________________
Web-SIG mailing list
[hidden email]
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: WSGI for Python 3

ianb
On Wed, Jul 14, 2010 at 12:04 AM, Graham Dumpleton <[hidden email]> wrote:
On 14 July 2010 14:43, Ian Bicking <[hidden email]> wrote:
> So... there's been some discussion of WSGI on Python 3 lately.  I'm not
> feeling as pessimistic as some people, I feel like we were close but just
> didn't *quite* get there.

What I took from the discussion wasn't that one couldn't specify a
WSGI interface, and as you say we more or less have one now, the issue
is more about how practical that is from a usability perspective for
those who have to code stuff on top.

My intuition is that won't be that bad.  At least compared to any library that is dealing with str/unicode porting issues; which aren't easy, but so it goes.


> * I'm terrible at naming, but let's say these new values are RAW_SCRIPT_NAME
> and RAW_PATH_INFO.

My prior suggestion on that since upper case keys for now effectively
derive from CGI, was to make them wsgi.script_name and wsgi.path_info.
Ie., push them into the wsgi namespace.

That's fine with me too.
 
> Does this solve everything?  There's broken stuff in the stdlib, but we
> shouldn't bother ourselves with that -- if we need working code we should
> just write it and ignore the stdlib or submit our stuff as patches to the
> stdlib.

The quick summary of what I suggest before is at:

 http://code.google.com/p/modwsgi/wiki/SupportForPython3X

I believe the only difference I see is the raw SCRIPT_NAME and
PATH_INFO, which got discussed to death previously with no consensus.

Thanks, I was looking for that.  I remember the primary objection to a SCRIPT_NAME/PATH_INFO change was from you.  Do you still feel that way?

I generally agree with your interpretation, except I would want to strictly disallow unicode (Python 3 str) from response bodies.  Latin1/ISO-8859-1 is an okay encoding for headers and status and raw SCRIPT_NAME/PATH_INFO, but for bodies it doesn't have any particular validity.

I forgot to mention the response, which you cover; I guess I'm okay with being lenient on types there (allowing both bytes and str in Python 3)... though I'm not really that happy with it.  I'd rather just keep it symmetric with the request, requiring native strings everywhere.

--
Ian Bicking  |  http://blog.ianbicking.org

_______________________________________________
Web-SIG mailing list
[hidden email]
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: WSGI for Python 3

Graham Dumpleton-2
In reply to this post by Graham Dumpleton-2
On 14 July 2010 15:04, Graham Dumpleton <[hidden email]> wrote:

> On 14 July 2010 14:43, Ian Bicking <[hidden email]> wrote:
>> So... there's been some discussion of WSGI on Python 3 lately.  I'm not
>> feeling as pessimistic as some people, I feel like we were close but just
>> didn't *quite* get there.
>
> What I took from the discussion wasn't that one couldn't specify a
> WSGI interface, and as you say we more or less have one now, the issue
> is more about how practical that is from a usability perspective for
> those who have to code stuff on top.
>
> The concern seems to be that although it may be easy to work with the
> specification for those who at the lowest layer immediately wrap it in
> a higher level abstraction that normalises stuff into something that
> is then used consistently in that way, for those who use lower level
> raw WSGI right through the stack, especially in the context of
> stackable WSGI middleware, that repetitive task of having to deal with
> the byte/unicode issues at every point it just a big PITA.
>
> That said, my job in writing the WSGI adapter is really easy as I
> don't have to worry about these issues. This is why I don't seem to
> really appreciate the concerns people are expressing. The above is how
> I read things though.
>
>> Here's my thoughts:
>>
>> * Everyone agrees keys in the environ should be native strings
>> * Bodies should stay bytes
>> * Can we make all "standard" values that are str on Python 2, str on Python
>> 3 with a Latin1 encoding?  This is basically what wsgiref did.  This means
>> HTTP_*, SERVER_NAME, etc.  Everything CGIish, and everything with an
>> all-caps key.  There's only a couple tricky keys: SCRIPT_NAME, PATH_INFO,
>> and HTTP_COOKIE.
>> * I propose we let libraries handle HTTP_COOKIE however they want; don't
>> bother transcoding *into* the environ, just do so when you parse the cookie
>> (if you so choose).  Happy developers will just urlencode all their cookie
>> values to keep their cookies ASCII-clean.  Unhappy developers who have to
>> handle legacy cookies will just run environ['HTTP_COOKIE'].decode('latin1')
>> and then do whatever sad magic they are forced to do.
>> * I (re)propose we eliminate SCRIPT_NAME and PATH_INFO and replace them
>> exclusively with encoded versions (that represent the original request
>> URI).  We use Latin1 encoding, but it should be ASCII anyway, like most of
>> the headers.

BTW, it should be highlighted whether this change is relevant to
Python 3 but like some of the other things you relegated as out of
scope, purely a wish list item.

Graham

>> * I'm terrible at naming, but let's say these new values are RAW_SCRIPT_NAME
>> and RAW_PATH_INFO.
>
> My prior suggestion on that since upper case keys for now effectively
> derive from CGI, was to make them wsgi.script_name and wsgi.path_info.
> Ie., push them into the wsgi namespace.
>
>> Does this solve everything?  There's broken stuff in the stdlib, but we
>> shouldn't bother ourselves with that -- if we need working code we should
>> just write it and ignore the stdlib or submit our stuff as patches to the
>> stdlib.
>
> The quick summary of what I suggest before is at:
>
>  http://code.google.com/p/modwsgi/wiki/SupportForPython3X
>
> I believe the only difference I see is the raw SCRIPT_NAME and
> PATH_INFO, which got discussed to death previously with no consensus.
>
>> Some environments will have a hard time constructing RAW_SCRIPT_NAME and
>> RAW_PATH_INFO, but in my opinion they can just encode SCRIPT_NAME and
>> PATH_INFO and be done with it; it's not as accurate, but it's no less
>> accurate than what we have now.
>>
>> Actual transcoding in the environ is not supported or encouraged in this
>> scheme.  If you want to adjust an encoding you should do it in your
>> application/library code.
>>
>> There's some other topics, like chunked responses, unknown request body
>> lengths, start_response, and maybe some other things, but these aren't
>> Python 3 issues, they are just... generic issues.  app_iter.close() might be
>> worth thinking about given new iterator semantics introduced since WSGI was
>> written.
>
> Graham
>
_______________________________________________
Web-SIG mailing list
[hidden email]
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: WSGI for Python 3

Graham Dumpleton-2
In reply to this post by ianb
On 14 July 2010 15:18, Ian Bicking <[hidden email]> wrote:

> On Wed, Jul 14, 2010 at 12:04 AM, Graham Dumpleton
> <[hidden email]> wrote:
>>
>> On 14 July 2010 14:43, Ian Bicking <[hidden email]> wrote:
>> > So... there's been some discussion of WSGI on Python 3 lately.  I'm not
>> > feeling as pessimistic as some people, I feel like we were close but
>> > just
>> > didn't *quite* get there.
>>
>> What I took from the discussion wasn't that one couldn't specify a
>> WSGI interface, and as you say we more or less have one now, the issue
>> is more about how practical that is from a usability perspective for
>> those who have to code stuff on top.
>
> My intuition is that won't be that bad.  At least compared to any library
> that is dealing with str/unicode porting issues; which aren't easy, but so
> it goes.
>
>>
>> > * I'm terrible at naming, but let's say these new values are
>> > RAW_SCRIPT_NAME
>> > and RAW_PATH_INFO.
>>
>> My prior suggestion on that since upper case keys for now effectively
>> derive from CGI, was to make them wsgi.script_name and wsgi.path_info.
>> Ie., push them into the wsgi namespace.
>
> That's fine with me too.
>
>>
>> > Does this solve everything?  There's broken stuff in the stdlib, but we
>> > shouldn't bother ourselves with that -- if we need working code we
>> > should
>> > just write it and ignore the stdlib or submit our stuff as patches to
>> > the
>> > stdlib.
>>
>> The quick summary of what I suggest before is at:
>>
>>  http://code.google.com/p/modwsgi/wiki/SupportForPython3X
>>
>> I believe the only difference I see is the raw SCRIPT_NAME and
>> PATH_INFO, which got discussed to death previously with no consensus.
>
> Thanks, I was looking for that.  I remember the primary objection to a
> SCRIPT_NAME/PATH_INFO change was from you.  Do you still feel that way?

I accept that access to the raw information may help for people who
want access to repeating slashes or other encoded information that an
underlying web server may alter, but I cant remember in what way this
helps with the Python 3 issues. That is why I just made the comment in
other email.

Perhaps you can cover how this helps with Python 3.

> I generally agree with your interpretation, except I would want to strictly
> disallow unicode (Python 3 str) from response bodies.  Latin1/ISO-8859-1 is
> an okay encoding for headers and status and raw SCRIPT_NAME/PATH_INFO, but
> for bodies it doesn't have any particular validity.
>
> I forgot to mention the response, which you cover; I guess I'm okay with
> being lenient on types there (allowing both bytes and str in Python 3)...
> though I'm not really that happy with it.  I'd rather just keep it symmetric
> with the request, requiring native strings everywhere.

The reason for allowing it in the response content was so the
canonical WSGI hello world still work unmodified.

Graham
_______________________________________________
Web-SIG mailing list
[hidden email]
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: WSGI for Python 3

ianb
In reply to this post by Graham Dumpleton-2
On Wed, Jul 14, 2010 at 12:19 AM, Graham Dumpleton <[hidden email]> wrote:
>> * I (re)propose we eliminate SCRIPT_NAME and PATH_INFO and replace them
>> exclusively with encoded versions (that represent the original request
>> URI).  We use Latin1 encoding, but it should be ASCII anyway, like most of
>> the headers.

BTW, it should be highlighted whether this change is relevant to
Python 3 but like some of the other things you relegated as out of
scope, purely a wish list item.

Certainly; most headers or metadata is pretty much constrained to ASCII, and any use of non-ASCII is... at least peculiar, and presumably application-specific.  For instance, there's no reason you'd have anything but ASCII in Cache-Control.  The one place encoded information happens regularly in headers (that I know of) is Cookie.  The request URI path is generally ASCII, but SCRIPT_NAME and PATH_INFO *aren't* the request URI path, they are URL decoded versions of the request URI path.  And they are usually encoded in UTF8... but UTF8 is a lossy encoding, so decoding them is problematic (though we could define that they must be decoded with surrogateescape).  And while they are usually UTF8, they are sometimes no valid encoding at all, because anyone can assemble any set of characters they want and web browsers will accept it.

By avoiding URL-unquoting of these values, we can also stick to Latin1 and get something reasonable.  It's not very attractive to me that we take something that is probably *not* Latin1, and may reasonably not be ASCII, and decode it as Latin1.

--
Ian Bicking  |  http://blog.ianbicking.org

_______________________________________________
Web-SIG mailing list
[hidden email]
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: WSGI for Python 3

and-py
In reply to this post by ianb
On 07/14/2010 06:43 AM, Ian Bicking wrote:

> There's only a couple tricky keys: SCRIPT_NAME, PATH_INFO,
> and HTTP_COOKIE.

(And of those, PATH_INFO is the only one that really matters, in that
no-one really uses non-ASCII script filenames, and non-ASCII characters
in Cookie/Set-Cookie are still handled so differently/brokenly across
browsers that you can't rely on them at all.)

> * I (re)propose we eliminate SCRIPT_NAME and PATH_INFO and replace them
> exclusively with encoded versions

For compatibility with existing apps, how about keeping the existing
SCRIPT_NAME and PATH_INFO as-is (with all their problems), and
specifying that the new 'raw' versions (whatever they are called) are
added only if they really are raw, not reconstructed.

Then existing scripts that don't care about non-ASCII and slashes can
carry on as before, and for apps that do care about them, they'll be
able to be *sure* the input is correct. Or they can fall back to
PATH_INFO when not present, and avoid producing these kind of URLs in
response.

(Or an app might have enough special knowledge to try other fallback
mechanisms when the raw versions are unavailable, such as REQUEST_URI or
Windows ctypes envvar hacking. But if the server/gateway has good raw
paths it shouldn't bother use these.)

--
And Clover
mailto:[hidden email]
http://www.doxdesk.com/
_______________________________________________
Web-SIG mailing list
[hidden email]
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: WSGI for Python 3

Graham Dumpleton-2
On Friday, July 16, 2010, And Clover <[hidden email]> wrote:

> On 07/14/2010 06:43 AM, Ian Bicking wrote:
>
>
> There's only a couple tricky keys: SCRIPT_NAME, PATH_INFO,
> and HTTP_COOKIE.
>
>
> (And of those, PATH_INFO is the only one that really matters, in that no-one really uses non-ASCII script filenames, and non-ASCII characters in Cookie/Set-Cookie are still handled so differently/brokenly across browsers that you can't rely on them at all.)
>
>
> * I (re)propose we eliminate SCRIPT_NAME and PATH_INFO and replace them
> exclusively with encoded versions
>
>
> For compatibility with existing apps, how about keeping the existing SCRIPT_NAME and PATH_INFO as-is (with all their problems), and specifying that the new 'raw' versions (whatever they are called) are added only if they really are raw, not reconstructed.
>
> Then existing scripts that don't care about non-ASCII and slashes can carry on as before, and for apps that do care about them, they'll be able to be *sure* the input is correct. Or they can fall back to PATH_INFO when not present, and avoid producing these kind of URLs in response.
>
> (Or an app might have enough special knowledge to try other fallback mechanisms when the raw versions are unavailable, such as REQUEST_URI or Windows ctypes envvar hacking. But if the server/gateway has good raw paths it shouldn't bother use these.)

Which is exactly what I have suggested in the past. If you do that,
one has to ask the question, given it is more convention than
anything, why it isn't just a x-wsgiorg extension specification like
routing args is rather than a core part of the WSGI specification.
Servers could still implement the extension as they are able to and
don't have to worry about changing core specification then and what we
have now stands.

Graham

> --
> And Clover
> mailto:[hidden email]
> http://www.doxdesk.com/
> _______________________________________________
> Web-SIG mailing list
> [hidden email]
> Web SIG: http://www.python.org/sigs/web-sig
> Unsubscribe: http://mail.python.org/mailman/options/web-sig/graham.dumpleton%40gmail.com
>
_______________________________________________
Web-SIG mailing list
[hidden email]
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: WSGI for Python 3

and-py
On 07/16/2010 12:07 PM, Graham Dumpleton wrote:

> If you do that, one has to ask the question, given it is more convention than
> anything, why it isn't just a x-wsgiorg extension specification

Yes, fine by me either way.

I just want to be able to say "this application can use Unicode paths
when run on a server/gateway that supports <standardised feature X>",
rather than the current mess of "you can have Unicode paths if you use
one of the dozen different server-and-platform combinations we've
specifically coded workarounds for".

--
And Clover
mailto:[hidden email]
http://www.doxdesk.com/
_______________________________________________
Web-SIG mailing list
[hidden email]
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: WSGI for Python 3

ianb
In reply to this post by and-py
On Fri, Jul 16, 2010 at 4:33 AM, And Clover <[hidden email]> wrote:
On 07/14/2010 06:43 AM, Ian Bicking wrote:

There's only a couple tricky keys: SCRIPT_NAME, PATH_INFO,
and HTTP_COOKIE.

(And of those, PATH_INFO is the only one that really matters, in that no-one really uses non-ASCII script filenames, and non-ASCII characters in Cookie/Set-Cookie are still handled so differently/brokenly across browsers that you can't rely on them at all.)


* I (re)propose we eliminate SCRIPT_NAME and PATH_INFO and replace them
exclusively with encoded versions

For compatibility with existing apps, how about keeping the existing SCRIPT_NAME and PATH_INFO as-is (with all their problems), and specifying that the new 'raw' versions (whatever they are called) are added only if they really are raw, not reconstructed.

Having two ways of expressing the same information will lead to bugs related to which data is canonical.  If an application is using SCRIPT_NAME/PATH_INFO and then updates those values in any way, and wsgi.raw_script_name/wsgi.raw_path_info are present, then there will be weird bugs and code will disagree about which one is correct.  Since %2f can exist in the raw versions, there isn't even a way to chunk the two variables in the same way.

Then existing scripts that don't care about non-ASCII and slashes can carry on as before, and for apps that do care about them, they'll be able to be *sure* the input is correct. Or they can fall back to PATH_INFO when not present, and avoid producing these kind of URLs in response.

I don't think it works to imagine you can just not care about non-ASCII.  Requests come in.  WSGI should represent those requests.  If a request comes in with non-ASCII bytes then WSGI needs to do *something* with it.  I don't want to have to configure servers with application policy; servers should just work.

And this doesn't help with Python 3: either we have byte values of SCRIPT_NAME and PATH_INFO in Python 3, or we have text values.  I think bytes will be more awkward to port to than text, and inconsistent with other WSGI values.  If we have text then we have to choose an encoding.  Latin1 will work, but it will be the exact wrong encoding most of the time as UTF-8 is the typical  (unlike other headers, where Latin1 will mostly be an okay encoding, or as good a guess as we have).  If we firmly remove these keys then we can avoid this choice entirely... and we conveniently also get a better representation of the request.

Note that libraries can smooth over this change; WebOb for instance will certainly still support req.script_name/req.path_info by decoding the raw values.  Admittedly lots of code use these values directly... but at least if they get a KeyError the port/fix will be obvious (as opposed to out of sync values, which will only emerge as a problem occasionally -- I'd rather not invite more occasional bugs).

--
Ian Bicking  |  http://blog.ianbicking.org

_______________________________________________
Web-SIG mailing list
[hidden email]
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: WSGI for Python 3

Chris McDonough
On Fri, 2010-07-16 at 11:07 -0500, Ian Bicking wrote:

> And this doesn't help with Python 3: either we have byte values of
> SCRIPT_NAME and PATH_INFO in Python 3, or we have text values.  I
> think bytes will be more awkward to port to than text, and
> inconsistent with other WSGI values.  If we have text then we have to
> choose an encoding.  Latin1 will work, but it will be the exact wrong
> encoding most of the time as UTF-8 is the typical  (unlike other
> headers, where Latin1 will mostly be an okay encoding, or as good a
> guess as we have).  If we firmly remove these keys then we can avoid
> this choice entirely... and we conveniently also get a better
> representation of the request.

My $.02: I'd rather lobby the core folks for a string ABC (which we can
hook with a stringlike bytes type) and consider all 3.X releases made so
far "dead to WSGI" than to have to tunnel arbitrary bytes through some
misleading Unicode encoding.

- C


_______________________________________________
Web-SIG mailing list
[hidden email]
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: WSGI for Python 3

ianb
On Fri, Jul 16, 2010 at 12:28 PM, Chris McDonough <[hidden email]> wrote:
On Fri, 2010-07-16 at 11:07 -0500, Ian Bicking wrote:

> And this doesn't help with Python 3: either we have byte values of
> SCRIPT_NAME and PATH_INFO in Python 3, or we have text values.  I
> think bytes will be more awkward to port to than text, and
> inconsistent with other WSGI values.  If we have text then we have to
> choose an encoding.  Latin1 will work, but it will be the exact wrong
> encoding most of the time as UTF-8 is the typical  (unlike other
> headers, where Latin1 will mostly be an okay encoding, or as good a
> guess as we have).  If we firmly remove these keys then we can avoid
> this choice entirely... and we conveniently also get a better
> representation of the request.

My $.02: I'd rather lobby the core folks for a string ABC (which we can
hook with a stringlike bytes type) and consider all 3.X releases made so
far "dead to WSGI" than to have to tunnel arbitrary bytes through some
misleading Unicode encoding.

While I think it would be generally useful, it's also a long way off at best, with serious performance dangers that could torpedo the whole thing.  But... I'm also unsure how it would help here, except perhaps we could incrementally annotate bytes with an encoding?  Well, I don't really know.  Treating the raw request path as text is easy enough, as it should always be ASCII anyway.  We don't have to worry what is "right" or "wrong" in this case.

We could make everything bytes and be done with it, but it would make it much harder to port Python 2 WSGI code to Python 3.

--
Ian Bicking  |  http://blog.ianbicking.org

_______________________________________________
Web-SIG mailing list
[hidden email]
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: WSGI for Python 3

Stephan Richter-2
On Friday, July 16, 2010, Ian Bicking wrote:
> We could make everything bytes and be done with it, but it would make it
> much harder to port Python 2 WSGI code to Python 3.

I think this might be best having seen all of the discussion. One could easily
write a compatibility middleware that makes porting Python 2 applications easy
or even completely transparent (from a WSGI spec point of view).

Regards,
Stephan
--
Entrepreneur and Software Geek
Google me. "Zope Stephan Richter"
_______________________________________________
Web-SIG mailing list
[hidden email]
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: WSGI for Python 3

PJ Eby
In reply to this post by ianb
At 11:07 AM 7/16/2010 -0500, Ian Bicking wrote:
>And this doesn't help with Python 3: either we have byte values of
>SCRIPT_NAME and PATH_INFO in Python 3, or we have text values.  I
>think bytes will be more awkward to port to than text, and
>inconsistent with other WSGI values.

OTOH, it has the tremendous advantage of pushing the encoding
question onto the app (or framework) developer...  who's really the
only one who can make the right decision for their particular
application.  And personally, I'd rather have clear boundaries
between text and bytes, such that porting (even if tedious or
awkward) is *consistent*, and clear as to when you're finished, not,
"oh, did I check to make sure I converted SCRIPT_NAME and
PATH_INFO...  not just in my app code, but in all the library code I
call *from* my app?"

IOW, the bytes/string discussion on Python-dev has kind of led me to
realize that we might just as well make the *entire* stack bytes
(incoming and outgoing headers *and* streams), and rewrite that bit
in PEP 333 about using str on "Python 3000" to say we go with bytes
on Python 3+ for everything that's a str in today's WSGI.

Or, to put it another way, if I knew then what I know *now*, I think
I'd have written the PEP the other way around, such that the use of
'str' in WSGI would be a substitute for the future 'bytes' type,
rather than viewing some byte strings as a forward-compatible
substitute for Py3K unicode strings.

Of course, this would be a WSGI 2 change, but IMO we're better off
making a clean break with backward compatibility here anyway, rather
than having conditionals.  Also, going with bytes everywhere means we
don't have to rename SCRIPT_NAME and PATH_INFO, which in turn avoids
deeper rewrites being required in today's apps.

(Hm.  Although actually, I suppose we *could* just borrow the time
machine and pretend that WSGI called for "byte-strings everywhere"
all along...)

_______________________________________________
Web-SIG mailing list
[hidden email]
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Unicode fundamentals

travis+ml-python-web
In reply to this post by Chris McDonough
BTW, if you're a noob like me and can't follow the Unicode stuff,
I once read this:

http://www.joelonsoftware.com/articles/Unicode.html

I need to read it again before commenting, but I seem to recall it
being edifying, if not particularly memorable. ;-)
--
A Weapon of Mass Construction
My emails do not have attachments; it's a digital signature that your mail
program doesn't understand. | http://www.subspacefield.org/~travis/ 
If you are a spammer, please email [hidden email] to get blacklisted.

_______________________________________________
Web-SIG mailing list
[hidden email]
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/lists%40nabble.com

attachment0 (850 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: WSGI for Python 3

Gustavo Narea
In reply to this post by ianb
Hello,

Ian said:
> Having two ways of expressing the same information will lead to bugs
> related to which data is canonical.  If an application is using
> SCRIPT_NAME/PATH_INFO and then updates those values in any way, and
> wsgi.raw_script_name/wsgi.raw_path_info are present, then there will be
> weird bugs and code will disagree about which one is correct.  Since %2f
> can exist in the raw versions, there isn't even a way to chunk the two
> variables in the same way.

I can't agree more.

I would propose the following, and excuse me in advance if this has already
been proposed and discarded -- I've tried to follow this topic on the mailing
list over the past few months, until it becomes an endless discussion.

I think only the raw values should be available. Even if a middleware changes
them, it must put them with raw values. And because you cannot change those
values without knowing what encoding the request uses, the character encoding
*must* be present.

I know that sounds easy but it's not, because browsers don't specify the
charset in the Content-Type and instead they generate a new request using the
charset from the previous response. So the charset is unknown to the
server/gateway and the middleware stack.

So, what we could do is introduce a mandatory variable called, say,
wsgi.charset, and would be used as follows:
 - It MUST be set by the server or gateway on every request.
 - Every middleware or application that reads or writes these values MUST use
the charset specified in wsgi.charset.
 - If a server, gateway, middleware or application wants to change the charset
and it is possible*, it MUST convert the *entire* request into that charset
and update wsgi.charset accordingly.
 - When the charset is not specified in the HTTP request, UTF-8 MUST be
assumed by the server/gateway. Unless another default charset has been
specified by the user.

I think/hope that will solve all the problems.

What happens when a WSGI application is actually made up two WSGI applications
and they send the responses in different charsets? If it's not possible to
configure them so that they both use the same charsets, then one of them would
have to be wrapped by a middleware which:
 - On egress, converts the responses using the charset used by the other
application.
 - On ingress, if the charset is not specified in the request, it will assume
it's the one used by the other application, and thus it will convert the
request using the charset supported by the wrapped application.

It would look like this:
===
def application(environ, start_response):
    if environ.startswith("/trac/"):
        # Say Trac only supports Latin-1 and we want responses to use UTF-8:
        app = trac.web.main.dispatch_request
        app = CharsetNormalizer(app, response="latin-1", request="utf8")
    else:
        # myapp uses UTF-8
        app = myapp
    return app(environ, start_response)
===

Then there's the string vs bytes issue. Bytes would be the natural choice to
represent these raw values, but it would probably cause more trouble than they
solve. So, I think they should be strings that contain the the ASCII raw
encoded values (i.e., str on both versions of Python).

What do you think about this? Again, sorry if this has been discarded before!
:)

* For example, you can always convert Latin-1 to UTF-8, but not every UTF-8
string can be converted to Latin-1.
--
Gustavo Narea <xri://=Gustavo>.
| Tech blog: =Gustavo/(+blog)/tech  ~  About me: =Gustavo/about |
_______________________________________________
Web-SIG mailing list
[hidden email]
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: WSGI for Python 3

ianb
In reply to this post by PJ Eby
On Fri, Jul 16, 2010 at 1:40 PM, P.J. Eby <[hidden email]> wrote:
At 11:07 AM 7/16/2010 -0500, Ian Bicking wrote:
And this doesn't help with Python 3: either we have byte values of SCRIPT_NAME and PATH_INFO in Python 3, or we have text values.  I think bytes will be more awkward to port to than text, and inconsistent with other WSGI values.

OTOH, it has the tremendous advantage of pushing the encoding question onto the app (or framework) developer...  who's really the only one who can make the right decision for their particular application.  And personally, I'd rather have clear boundaries between text and bytes, such that porting (even if tedious or awkward) is *consistent*, and clear as to when you're finished, not, "oh, did I check to make sure I converted SCRIPT_NAME and PATH_INFO...  not just in my app code, but in all the library code I call *from* my app?"

IOW, the bytes/string discussion on Python-dev has kind of led me to realize that we might just as well make the *entire* stack bytes (incoming and outgoing headers *and* streams), and rewrite that bit in PEP 333 about using str on "Python 3000" to say we go with bytes on Python 3+ for everything that's a str in today's WSGI.

This was my first intuition too, until I started thinking in more detail about the particular values involved.  Some obviously are textish, like environ['SERVER_NAME'].  Not a very useful value, but definitely text.

Basically all the internal strings are textish, so we're left with:

wsgi.url_scheme
SCRIPT_NAME/PATH_INFO
QUERY_STRING
HTTP_*, CONTENT_TYPE, CONTENT_LENGTH (headers)
response status
response headers (name and value)

And there's a few things like REMOTE_USER that are kind of in the middle.  Everyone is in agreement that bodies should be bytes.

One initial problem is that the Python 3 stdlib handles bytes poorly, so for instance there's no good way to reconstruct the URL using the stdlib.  That explains certain tensions, but I think we should ignore that, and in fact that's what Python-Dev seemed to say pretty clearly.

Now, the other keys:

wsgi.url_scheme: clearly ASCII

SCRIPT_NAME/PATH_INFO: often UTF-8, could be no encoding, could be some old legacy encoding.
raw request path: should be ASCII (non-ASCII should be URL-encoded).  URL encoding happens at the byte layer, so a server could reasonably URL encode any non-ASCII characters without imposing any encoding.

QUERY_STRING: should be ASCII, same as raw request path

headers: Most are ASCII.  Latin1 is a reasonable fallback and suggested by the specification.  The spec also implies you have use the RFC2047 inline encoding (like ?iso-8859-1?q?some=20text?=), but nothing supports this and supporting it would probably be a bad idea for security reasons.  The Atompub spec (reasonably modern) specifically says Title headers should be encoded with RFC2047 (if they are not ISO-8859-1): http://tools.ietf.org/html/draft-ietf-atompub-protocol-08#page-17 -- decoding this kind of encoding at the application layer seems reasonable to me.

cookie header: this specific header can easily have multiple encodings, as the browser encodes data then treats it as opaque bytes, so a cookie can be set via UTF-8 one place, Latin1 another, and those coexist in one header.  That is, there is no real encoding and this should be treated as bytes.  (Latin1 is an approximation of bytes... a spotty way to treat bytes, but entirely workable.)

response status: I believe the spec says this must be Latin1/ISO-8859-1.  In practice it is almost always ASCII, and since it is not user-visible it's not something that really needs localization.

response headers: the spec implies Latin1, in practice the Set-Cookie header is bytes (since interoperation with wonky legacy systems is not uncommon).  I'm not sure of any other exceptions?


So... to me it seems pretty reasonable for HTTP specifically that text can work.  And if feels weird that, say, environ['SERVER_NAME'] be text and environ['HTTP_HOST'] not, and I don't know what environ['REMOTE_ADDR'] should be in that mode.  And it would also be weird if environ['SERVER_NAME'] was bytes.

In the past when we've gotten down to specifics, the only holdup has been SCRIPT_NAME/PATH_INFO, hence my suggestion to eliminate those.

--
Ian Bicking  |  http://blog.ianbicking.org

_______________________________________________
Web-SIG mailing list
[hidden email]
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: WSGI for Python 3

Gustavo Narea
In reply to this post by Gustavo Narea
Gustavo said:
>  - On ingress, if the charset is not specified in the request, it will
> assume  it's the one used by the other application, and thus it will
> convert the request using the charset supported by the wrapped
> application.

That should actually be:

"On ingress, if the charset in wsgi.charset differs from the charset supported
by the wrapped application, the request will be converted into the charset
supported by the wrapped application."
--
Gustavo Narea <xri://=Gustavo>.
| Tech blog: =Gustavo/(+blog)/tech  ~  About me: =Gustavo/about |
_______________________________________________
Web-SIG mailing list
[hidden email]
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: WSGI for Python 3

Tres Seaver
In reply to this post by PJ Eby
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

P.J. Eby wrote:

> (Hm.  Although actually, I suppose we *could* just borrow the time
> machine and pretend that WSGI called for "byte-strings everywhere"
> all along...)

I like the idea of pushing responsibility for decoding stuff into the
framework / app writer's hands.  OTOH, doesn't that hose authors of
existing middleware, due to the borkedness of working with bytes in Python3?


Tres.
- --
===================================================================
Tres Seaver          +1 540-429-0999          [hidden email]
Palladion Software   "Excellence by Design"    http://palladion.com
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkxA0iwACgkQ+gerLs4ltQ44BgCcD9BGPD7cvJb+azx7akBUqVHc
X0wAnA3alzFWBXa1jBcEixyrFBRk6dbh
=m9TD
-----END PGP SIGNATURE-----

_______________________________________________
Web-SIG mailing list
[hidden email]
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: WSGI for Python 3

Tres Seaver
In reply to this post by ianb
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Ian Bicking wrote:

>> IOW, the bytes/string discussion on Python-dev has kind of led me to
>> realize that we might just as well make the *entire* stack bytes (incoming
>> and outgoing headers *and* streams), and rewrite that bit in PEP 333 about
>> using str on "Python 3000" to say we go with bytes on Python 3+ for
>> everything that's a str in today's WSGI.
>>
>
> This was my first intuition too, until I started thinking in more detail
> about the particular values involved.  Some obviously are textish, like
> environ['SERVER_NAME'].  Not a very useful value, but definitely text.
>
> Basically all the internal strings are textish, so we're left with:

What do you mean by "internal"?  Anything in the headers or the CGI
environment is intrinsically "bytes-ish" to me.  Do you mean that you
want application programmers to have them transparently decoded?  If so,
we can make that the responsibility of the non-middleware framework /
application.

> wsgi.url_scheme
> SCRIPT_NAME/PATH_INFO
> QUERY_STRING
> HTTP_*, CONTENT_TYPE, CONTENT_LENGTH (headers)
> response status
> response headers (name and value)
>
> And there's a few things like REMOTE_USER that are kind of in the middle.
> Everyone is in agreement that bodies should be bytes.
>
> One initial problem is that the Python 3 stdlib handles bytes poorly, so for
> instance there's no good way to reconstruct the URL using the stdlib.  That
> explains certain tensions, but I think we should ignore that, and in fact
> that's what Python-Dev seemed to say pretty clearly.

python-dev seems to me to be coming to the realization that they should
have tried harder to make real-world apps work before they froze their
choices.

> Now, the other keys:
>
> wsgi.url_scheme: clearly ASCII
>
> SCRIPT_NAME/PATH_INFO: often UTF-8, could be no encoding, could be some old
> legacy encoding.
> raw request path: should be ASCII (non-ASCII should be URL-encoded).  URL
> encoding happens at the byte layer, so a server could reasonably URL encode
> any non-ASCII characters without imposing any encoding.
>
> QUERY_STRING: should be ASCII, same as raw request path
>
> headers: Most are ASCII.  Latin1 is a reasonable fallback and suggested by
> the specification.  The spec also implies you have use the RFC2047 inline
> encoding (like ?iso-8859-1?q?some=20text?=), but nothing supports this and
> supporting it would probably be a bad idea for security reasons.  The
> Atompub spec (reasonably modern) specifically says Title headers should be
> encoded with RFC2047 (if they are not ISO-8859-1):
> http://tools.ietf.org/html/draft-ietf-atompub-protocol-08#page-17 --
> decoding this kind of encoding at the application layer seems reasonable to
> me.
>
> cookie header: this specific header can easily have multiple encodings, as
> the browser encodes data then treats it as opaque bytes, so a cookie can be
> set via UTF-8 one place, Latin1 another, and those coexist in one header.
> That is, there is no real encoding and this should be treated as bytes.
> (Latin1 is an approximation of bytes... a spotty way to treat bytes, but
> entirely workable.)
>
> response status: I believe the spec says this must be Latin1/ISO-8859-1.  In
> practice it is almost always ASCII, and since it is not user-visible it's
> not something that really needs localization.
>
> response headers: the spec implies Latin1, in practice the Set-Cookie header
> is bytes (since interoperation with wonky legacy systems is not uncommon).
> I'm not sure of any other exceptions?
>
>
> So... to me it seems pretty reasonable for HTTP specifically that text can
> work.  And if feels weird that, say, environ['SERVER_NAME'] be text and
> environ['HTTP_HOST'] not, and I don't know what environ['REMOTE_ADDR']
> should be in that mode.  And it would also be weird if
> environ['SERVER_NAME'] was bytes.


> In the past when we've gotten down to specifics, the only holdup has been
> SCRIPT_NAME/PATH_INFO, hence my suggestion to eliminate those.

I think I favor PJE's suggestion:  let WSGI deal only in bytes.



Tres.
- --
===================================================================
Tres Seaver          +1 540-429-0999          [hidden email]
Palladion Software   "Excellence by Design"    http://palladion.com
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkxA03wACgkQ+gerLs4ltQ7x0gCg03P1cT9RsJhagBERqY6SbLQ8
zu0An0T0YoFjzAb+2WjWp20DS3VeP68u
=ybUr
-----END PGP SIGNATURE-----

_______________________________________________
Web-SIG mailing list
[hidden email]
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/lists%40nabble.com
12345