String Types in WSGI [Graham's WSGI for py3]

classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

String Types in WSGI [Graham's WSGI for py3]

Armin Ronacher
Hi,

Graham currently proposes[1] the following behaviors for Strings in WSGI
(Python version independent).  However this mail only covers the Python
3 part which I assume becomes a separate section in the PEP or even WSGI
version.

Terminology:

  byte string == contains bytes
  unicode string == contains unicode charpoints*
  native string == what the python version uses a a string
                   (bytes in python 2, unicode in python 3)

  * ucs2 / ucs4 is ignored here.  You might still have problems
    with surrogate pairs in ucs2 python builds and jython.

> 2. For the WSGI variable 'wsgi.url_scheme' contained in the WSGI
> environment, the value of the variable should be a native string.

URLs in general are a tricky topic.  For this particular field it does
not matter if we decide on bytes or unicode because it will always only
contain ASCII characters.  This should be picked consistencly with the
type of PATH_INFO and SCRIPT_NAME.

> 3. For the CGI variables contained in the WSGI environment, the values
> of the variables are byte strings.
\o/ Totally agree with that.

> 4. The WSGI input stream 'wsgi.input' contained in the WSGI
> environment and from which request content is read, should yield byte
> strings.
Same thing.

> 5. The status line specified by the WSGI application must be a byte
> string.
Ditto.

> 6. The list of response headers specified by the WSGI application must
> contain tuples consisting of two values, where each value is a byte
> string.
Makes sense because people stuff a lot of non latin1 stuff in there.
However I'm fine with latin1 for headers here as well but that would
probably only affect cookie and custom headers.

> 7. The iterable returned by the application and from which response
> content is derived, must yield byte strings.
I totally agree.


However Graham moves further away from that in the rest of the blog post
because he wants to point out that people use WSGI directly and that
explicit bytestrings in Python 3 confuse people.  The latest iteration
in the blog post is not to use bytestrings in a single location except
for headers and the input stream.

I thought a lot about this in the past and I welcome the step to make
WSGI harder to use!  This might sound absurd, but once encodings are
really explicit, people will think about it.  I think we should
discourage *applications* written in WSGI and link to implementations in
the PEP.

The big problems are always PATH_INFO and SCRIPT_NAME.  Those are the
only values that are in the dict URL-decoded and might contain non-ASCII
characters. (except for headers, but that's a different story because
the only real-world problem there are cookie headers and those are
troubleing for more reasons than just character sets)

My latest change to the WSGI sandbox hg repo [2] was that I added a
notice that later PEP revisions might document a RAW_SCRIPT_NAME or
something that contains the URL quoted values.  It however turns out
that this value is not available from within a webserver context (We're
talking about Apache and IIS here) so that the problem of unquoted
values will not go away.


It also introduces the concept of URI encodings.  I'm especially unhappy
with this part.  It would mean that implementations would have to follow
the WSGI URI encoding if set.  Most of the applications are using either
latin1 or UTF-8 URLs, I would leave that including the decoding of *all*
incoming data to the user.

So yes, I'm all for definition #1 in the blog post where Graham says:

> The first is that although WSGI 1.0 on Python 3.X should strictly be
> bytes everywhere as per Definition #1, it is probably too late to
> enforce this now.
I don't think so.  Reasoning: Python 3.0 does not work and is considered
outdated, Python 3.1 might ship with a wsgiref that's against a
revisioned spec, but cgi.FieldStorage is still broken there, making it
impossible to use for anything but small applications.


Regards,
Armin

[1]:
http://blog.dscpl.com.au/2009/09/roadmap-for-python-wsgi-specification.html
[2]: http://bitbucket.org/ianb/wsgi-peps/
_______________________________________________
Web-SIG mailing list
[hidden email]
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: String Types in WSGI [Graham's WSGI for py3]

Graham Dumpleton-2
2009/9/18 Armin Ronacher <[hidden email]>:

> Hi,
>
> Graham currently proposes[1] the following behaviors for Strings in WSGI
> (Python version independent).  However this mail only covers the Python
> 3 part which I assume becomes a separate section in the PEP or even WSGI
> version.
>
> Terminology:
>
>  byte string == contains bytes
>  unicode string == contains unicode charpoints*
>  native string == what the python version uses a a string
>                   (bytes in python 2, unicode in python 3)
>
>  * ucs2 / ucs4 is ignored here.  You might still have problems
>    with surrogate pairs in ucs2 python builds and jython.
>
>> 2. For the WSGI variable 'wsgi.url_scheme' contained in the WSGI
>> environment, the value of the variable should be a native string.
>
> URLs in general are a tricky topic.  For this particular field it does
> not matter if we decide on bytes or unicode because it will always only
> contain ASCII characters.  This should be picked consistencly with the
> type of PATH_INFO and SCRIPT_NAME.

I believe it does matter and that it contains ASCII possibly doesn't
mean it is somehow simpler. The reason is that URL reconstruction
recipe as per WSGI PEP has to work. Ie.,

from urllib import quote
url = environ['wsgi.url_scheme']+'://'

if environ.get('HTTP_HOST'):
    url += environ['HTTP_HOST']
else:
    url += environ['SERVER_NAME']

    if environ['wsgi.url_scheme'] == 'https':
        if environ['SERVER_PORT'] != '443':
           url += ':' + environ['SERVER_PORT']
    else:
        if environ['SERVER_PORT'] != '80':
           url += ':' + environ['SERVER_PORT']

url += quote(environ.get('SCRIPT_NAME',''))
url += quote(environ.get('PATH_INFO',''))
if environ.get('QUERY_STRING'):
    url += '?' + environ['QUERY_STRING']

In Python 2.X you can concatenate byte strings and unicode strings:

>>> 'http' + u'://'
u'http://'

In Python 3.X you cannot concatenate byte strings and unicode strings:

>>> b'http'+'://'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: can't concat bytes to str

On the basis that SCRIPT_NAME, PATH_INFO and QUERY_STRING when used by
a user in Python 3.X were likely to be held as unicode strings, then
saw wsgi.url_scheme needing to be of same type, albeit specified as
native string so still byte string as we are accustomed to in Python
2.X now.

This is also why all the other CGI variables are similarly make to be
unicode strings. That is, so all the same type and stuff like URL
reconstruction will work.

If bytes is used, you could potentially end up with messy situations
where you have to perform URL reconstruction as bytes, but then
convert it to unicode strings to stuff it in as a parameter into some
templating system where the template text is unicode.

If SCRIPT_NAME, PATH_INFO and QUERY_STRING are in bytes form and they
needed different encodings, how do you easily convert your bytes
strings to the unicode string needed to stuff in the template. Can't
see how you could, they really need to be in unicode if everything
else in the system is going to be unicode. Or are templating systems
now going to be expected to drop down and use bytes all the time as
well.

> However Graham moves further away from that in the rest of the blog post
> because he wants to point out that people use WSGI directly and that
> explicit bytestrings in Python 3 confuse people.  The latest iteration
> in the blog post is not to use bytestrings in a single location except
> for headers and the input stream.

Plus the response content would need to be bytes, albeit allowing an
ISO-8859-1 fallback if unicode like other response items. The use of
unicode exclusively is only really a big factor in WSGI environment
variables.

> I thought a lot about this in the past and I welcome the step to make
> WSGI harder to use!  This might sound absurd, but once encodings are
> really explicit, people will think about it.  I think we should
> discourage *applications* written in WSGI and link to implementations in
> the PEP.

As a way of deterring a lot of users, making it harder to use, or at
least making it more obvious that thought is required, would be quite
effective.

This would also be good in pushing people to use existing
frameworks/toolkits which deal with all this stuff internally and hide
it and instead present unicode strings at a higher level after doing
everything correctly.

So, it may well curtail the NIH issue that is becoming a problem, but
am not sure that doing that and making it harder for users who want to
work at that level, is a good idea.

As others have pointed out, the likes of rack and jack, not sure about
the new Perl variant, don't seem to have an issue with using unicode.

> The big problems are always PATH_INFO and SCRIPT_NAME.  Those are the
> only values that are in the dict URL-decoded and might contain non-ASCII
> characters. (except for headers, but that's a different story because
> the only real-world problem there are cookie headers and those are
> troubleing for more reasons than just character sets)
>
> My latest change to the WSGI sandbox hg repo [2] was that I added a
> notice that later PEP revisions might document a RAW_SCRIPT_NAME or
> something that contains the URL quoted values.  It however turns out
> that this value is not available from within a webserver context (We're
> talking about Apache and IIS here) so that the problem of unquoted
> values will not go away.

I am still waiting for the good explanation of why access to the raw
URL quoted values is so important. Can you please explain what the
requirement is?

The only example I recall was related to web servers eliminating
repeating slashes thereby effectively not making it possible to have
URLs in query strings with out a custom encoding string. Since there
are alternatives, I don't find that alone a compelling argument.

> It also introduces the concept of URI encodings.  I'm especially unhappy
> with this part.  It would mean that implementations would have to follow
> the WSGI URI encoding if set.

No it doesn't. The whole point of providing wsgi.uri_encoding was so
that a WSGI application would know the encoding so as to be able to
reverse it to bytes and convert it to something else. Given that you
accept below that most of the time latin1 or UTF-8 would be used, then
the typical case would be handled automatically and so that
transcoding wouldn't be required.

> Most of the applications are using either
> latin1 or UTF-8 URLs, I would leave that including the decoding of *all*
> incoming data to the user.
>
> So yes, I'm all for definition #1 in the blog post where Graham says:
>
>> The first is that although WSGI 1.0 on Python 3.X should strictly be
>> bytes everywhere as per Definition #1, it is probably too late to
>> enforce this now.
> I don't think so.  Reasoning: Python 3.0 does not work and is considered
> outdated, Python 3.1 might ship with a wsgiref that's against a
> revisioned spec, but cgi.FieldStorage is still broken there, making it
> impossible to use for anything but small applications.

I'll summarise where people are falling in respect of which definition
that want in a later post after more of the key figures have indicated
their choices.

Graham
_______________________________________________
Web-SIG mailing list
[hidden email]
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: String Types in WSGI [Graham's WSGI for py3]

René Dudfield
On Fri, Sep 18, 2009 at 8:56 AM, Graham Dumpleton
<[hidden email]> wrote:

>> The big problems are always PATH_INFO and SCRIPT_NAME.  Those are the
>> only values that are in the dict URL-decoded and might contain non-ASCII
>> characters. (except for headers, but that's a different story because
>> the only real-world problem there are cookie headers and those are
>> troubleing for more reasons than just character sets)
>>
>> My latest change to the WSGI sandbox hg repo [2] was that I added a
>> notice that later PEP revisions might document a RAW_SCRIPT_NAME or
>> something that contains the URL quoted values.  It however turns out
>> that this value is not available from within a webserver context (We're
>> talking about Apache and IIS here) so that the problem of unquoted
>> values will not go away.
>
> I am still waiting for the good explanation of why access to the raw
> URL quoted values is so important. Can you please explain what the
> requirement is?
>
> The only example I recall was related to web servers eliminating
> repeating slashes thereby effectively not making it possible to have
> URLs in query strings with out a custom encoding string. Since there
> are alternatives, I don't find that alone a compelling argument.
>

Why is the raw url needed(very rarely)?

Sometimes there are bugs.  Access to the raw string lets you work
around those bugs... if you need to.  Dropping to a lower level is
needed sometimes.

Some APIs require you to send back an exact copy of the input url.  Or
sometimes you want to know what input url was used... not the cleaned
up version of it.  Sometimes clients calling the wsgi code will be
buggy... and looking at the unquoted url is needed in those cases to
work around buggy clients.
_______________________________________________
Web-SIG mailing list
[hidden email]
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: String Types in WSGI [Graham's WSGI for py3]

Benoit Chesneau-5

On Sep 18, 2009, at 10:12 AM, René Dudfield wrote:

Why is the raw url needed(very rarely)?

Sometimes there are bugs.  Access to the raw string lets you work
around those bugs... if you need to.  Dropping to a lower level is
needed sometimes.

Some APIs require you to send back an exact copy of the input url.  Or
sometimes you want to know what input url was used... not the cleaned
up version of it.  Sometimes clients calling the wsgi code will be
buggy... and looking at the unquoted url is needed in those cases to
work around buggy clients.


And sometimes you need to support full uri spec. For example %2F is different from / . Actually if all url is decoded you don't know if the client request was %2F or /, you just get a /. Which is annoying. It causes some problem with some api ,I'm  thinking to couchdb for example who accept db name with a %2F inside to allow creation of folder on user system.


- benoit

_______________________________________________
Web-SIG mailing list
[hidden email]
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: String Types in WSGI [Graham's WSGI for py3]

Graham Dumpleton-2
In reply to this post by René Dudfield
2009/9/18 René Dudfield <[hidden email]>:

> On Fri, Sep 18, 2009 at 8:56 AM, Graham Dumpleton
> <[hidden email]> wrote:
>>> The big problems are always PATH_INFO and SCRIPT_NAME.  Those are the
>>> only values that are in the dict URL-decoded and might contain non-ASCII
>>> characters. (except for headers, but that's a different story because
>>> the only real-world problem there are cookie headers and those are
>>> troubleing for more reasons than just character sets)
>>>
>>> My latest change to the WSGI sandbox hg repo [2] was that I added a
>>> notice that later PEP revisions might document a RAW_SCRIPT_NAME or
>>> something that contains the URL quoted values.  It however turns out
>>> that this value is not available from within a webserver context (We're
>>> talking about Apache and IIS here) so that the problem of unquoted
>>> values will not go away.
>>
>> I am still waiting for the good explanation of why access to the raw
>> URL quoted values is so important. Can you please explain what the
>> requirement is?
>>
>> The only example I recall was related to web servers eliminating
>> repeating slashes thereby effectively not making it possible to have
>> URLs in query strings with out a custom encoding string. Since there
>> are alternatives, I don't find that alone a compelling argument.
>>
>
> Why is the raw url needed(very rarely)?
>
> Sometimes there are bugs.  Access to the raw string lets you work
> around those bugs... if you need to.  Dropping to a lower level is
> needed sometimes.
>
> Some APIs require you to send back an exact copy of the input url.
> Or sometimes you want to know what input url was used... not the cleaned
> up version of it.

What APIs? Can we have some concrete examples in common use rather
than theoretical possibilities?

> Sometimes clients calling the wsgi code will be
> buggy... and looking at the unquoted url is needed in those cases to
> work around buggy clients.

Bugs in WSGI adapters aren't a good reason for why it is needed. If
the WSGI adapters are broken, fix the WSGI adapters.

Graham
_______________________________________________
Web-SIG mailing list
[hidden email]
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: String Types in WSGI [Graham's WSGI for py3]

Graham Dumpleton-2
In reply to this post by Benoit Chesneau-5
2009/9/18 Benoit Chesneau <[hidden email]>:
> And sometimes you need to support full uri spec. For example %2F is
> different from / . Actually if all url is decoded you don't know if the
> client request was %2F or /, you just get a /. Which is annoying. It causes
> some problem with some api ,I'm  thinking to couchdb for example who accept
> db name with a %2F inside to allow creation of folder on user system.

Which happens because of the way the HTTP URL processing rules says it
has to be done.

Are there any other real world examples besides repeating slashes and
slash encoding issues?

Is the desire to bypass traditional SCRIPT_NAME and PATH_INFO and go
direct to REQUEST_URI all come down to these slash encoding and path
normalising issues?

Graham
_______________________________________________
Web-SIG mailing list
[hidden email]
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: String Types in WSGI [Graham's WSGI for py3]

Armin Ronacher
In reply to this post by Graham Dumpleton-2
Hi,

Graham Dumpleton schrieb:
> I believe it does matter and that it contains ASCII possibly doesn't
> mean it is somehow simpler. The reason is that URL reconstruction
> recipe as per WSGI PEP has to work. Ie.
>  *snip*
That of course will not work and is not something we should aim for.
There is a lot of stuff that will break as well, and libraries are
supposed to fix that on the 2.x -> 3.x transition.  Actually in 2.6 you
can use bytestring literals that will fix that problem for you.  The
only problem left is wsgi.url_scheme and for that one just have to use
an explicit .encode() call.  No big deal.

> This is also why all the other CGI variables are similarly make to be
> unicode strings. That is, so all the same type and stuff like URL
> reconstruction will work.
In an ideal world, maybe.  But the only thing more evil than
UnicodeErrors are silent encoding errors that are hard to track down.
(What just destroyed my charset information? Oh, it was the WSGI gateway
in combination with an ancient internet explorer version)

> If bytes is used, you could potentially end up with messy situations
> where you have to perform URL reconstruction as bytes, but then
> convert it to unicode strings to stuff it in as a parameter into some
> templating system where the template text is unicode.
URLs are ASCII only, IRIs are not.  If you are working with Python 3 you
would probably start using IRIs internally after a while because "it
makes sense".

> If SCRIPT_NAME, PATH_INFO and QUERY_STRING are in bytes form and they
> needed different encodings, how do you easily convert your bytes
> strings to the unicode string needed to stuff in the template. Can't
> see how you could, they really need to be in unicode if everything
> else in the system is going to be unicode. Or are templating systems
> now going to be expected to drop down and use bytes all the time as
> well.
I still defend my point that charsets are a complex topic and it's the
framework / library that should deal with that.  WebOb does, Werkzeug
does, Django does, I'm sure web.py and other libraries do to.  If one
wants to shoot himself into the foot by implementing his own library
based on WSGI we should not stop him.

> As a way of deterring a lot of users, making it harder to use, or at
> least making it more obvious that thought is required, would be quite
> effective.
>
> This would also be good in pushing people to use existing
> frameworks/toolkits which deal with all this stuff internally and hide
> it and instead present unicode strings at a higher level after doing
> everything correctly.
I like that idea a lot :)

> As others have pointed out, the likes of rack and jack, not sure about
> the new Perl variant, don't seem to have an issue with using unicode.
Ruby does not use unicode internally, it uses encoding marked strings.
That is, a string comes in and is iso-8859-15, it's marked as such and
ruby knows how to deal with it.  As far as I know Rack does not specify
charsets at all which probably means that it's up to the implementaiton
to decide what to use.  Rack will have the problem with charsets soon
enough, they just don't care about unicode enough (yet?).

> I am still waiting for the good explanation of why access to the raw
> URL quoted values is so important. Can you please explain what the
> requirement is?
Knowing the difference between "foo/bar" and "foo%2fbar" I guess.  To be
humble, I never had the problem, but apparently some other people are.
And of course that you suddenly have non ASCII stuff in a dict value ;)

> The only example I recall was related to web servers eliminating
> repeating slashes thereby effectively not making it possible to have
> URLs in query strings with out a custom encoding string. Since there
> are alternatives, I don't find that alone a compelling argument.
I don't need unquoted strings, I just think it would make sense to have
them *if possible*.


Regards,
Armin
_______________________________________________
Web-SIG mailing list
[hidden email]
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: String Types in WSGI [Graham's WSGI for py3]

René Dudfield
In reply to this post by Graham Dumpleton-2
On Fri, Sep 18, 2009 at 11:21 AM, Graham Dumpleton
<[hidden email]> wrote:

> 2009/9/18 Benoit Chesneau <[hidden email]>:
>> And sometimes you need to support full uri spec. For example %2F is
>> different from / . Actually if all url is decoded you don't know if the
>> client request was %2F or /, you just get a /. Which is annoying. It causes
>> some problem with some api ,I'm  thinking to couchdb for example who accept
>> db name with a %2F inside to allow creation of folder on user system.
>
> Which happens because of the way the HTTP URL processing rules says it
> has to be done.
>
> Are there any other real world examples besides repeating slashes and
> slash encoding issues?
>
> Is the desire to bypass traditional SCRIPT_NAME and PATH_INFO and go
> direct to REQUEST_URI all come down to these slash encoding and path
> normalising issues?
>

hello again,

No, slash encoding and normalising are not the only issues.

As mentioned before sometimes you need the exact bytes.

1. buggy clients.  If a client sends something that doesn't work
correctly, you can still sometimes make sense of it in the raw version
of the url.
2. client APIs that require the server to know the exact url.
3. buggy servers that don't do their job properly.
4. extensibility.  A url scheme changes a tiny bit, and you want to
support the change.  Having the raw url allows you do to support it on
old servers.

In all APIs it's handy to go to lower levels when the higher levels
don't work right.  Especially when wsgi only handles one side of
things, and urls are can be generated by anything.


cheers,
_______________________________________________
Web-SIG mailing list
[hidden email]
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: String Types in WSGI [Graham's WSGI for py3]

Graham Dumpleton-2
2009/9/18 René Dudfield <[hidden email]>:

> On Fri, Sep 18, 2009 at 11:21 AM, Graham Dumpleton
> <[hidden email]> wrote:
>> 2009/9/18 Benoit Chesneau <[hidden email]>:
>>> And sometimes you need to support full uri spec. For example %2F is
>>> different from / . Actually if all url is decoded you don't know if the
>>> client request was %2F or /, you just get a /. Which is annoying. It causes
>>> some problem with some api ,I'm  thinking to couchdb for example who accept
>>> db name with a %2F inside to allow creation of folder on user system.
>>
>> Which happens because of the way the HTTP URL processing rules says it
>> has to be done.
>>
>> Are there any other real world examples besides repeating slashes and
>> slash encoding issues?
>>
>> Is the desire to bypass traditional SCRIPT_NAME and PATH_INFO and go
>> direct to REQUEST_URI all come down to these slash encoding and path
>> normalising issues?
>>
>
> hello again,
>
> No, slash encoding and normalising are not the only issues.
>
> As mentioned before sometimes you need the exact bytes.
>
> 1. buggy clients.  If a client sends something that doesn't work
> correctly, you can still sometimes make sense of it in the raw version
> of the url.
> 2. client APIs that require the server to know the exact url.
> 3. buggy servers that don't do their job properly.
> 4. extensibility.  A url scheme changes a tiny bit, and you want to
> support the change.  Having the raw url allows you do to support it on
> old servers.
>
> In all APIs it's handy to go to lower levels when the higher levels
> don't work right.  Especially when wsgi only handles one side of
> things, and urls are can be generated by anything.

This is where it all comes down to me not have the real world
experience in writing web applications to know best.

What I would like to hear is PJE (who tends towards #3) and Robert
Brewer (who tends towards #4). Can you guys give counter explanations
as to why there arguments for bytes isn't valid. Ian, I don't think
you have yet expressed your leaning, but would like to here your point
as well.

On top of the issues above, Armin believes 2to3 gives better results
where bytes everywhere interpretation is used. Has anyone else
actually tried 2to3 and have the experience with it?

Graham
_______________________________________________
Web-SIG mailing list
[hidden email]
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: String Types in WSGI [Graham's WSGI for py3]

Armin Ronacher
Hi,

Graham Dumpleton schrieb:
> On top of the issues above, Armin believes 2to3 gives better results
> where bytes everywhere interpretation is used. Has anyone else
> actually tried 2to3 and have the experience with it?
You slightly misquoted me.  I said that 2to3 gives good results on high
level transformations (eg, a django app between 2 and 3) because both
"foo" and u"foo" becomes "foo".  Werkzeug, WebOb, Django all use unicode
by default, so the application will not notice any changes.

That would not change if we would have unicode in the WSGI dict and the
framework would be changed to treat it properly and do a encode/decode
dance if necessary.

The reason I brought it up is that 2to3 does not work at all on the raw
WSGI layer currently because it converts bytes to unicode which in my
opinion is just wrong.


Regards,
Armin
_______________________________________________
Web-SIG mailing list
[hidden email]
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: String Types in WSGI [Graham's WSGI for py3]

Armin Ronacher
In reply to this post by Armin Ronacher
Hi,

Let me backup a bit here.

We have to focus on two difference use cases for WSGI on Python 3.  The
one is the application that should continue to work on Python 3, the
other one is the application that was designed for Python 3.

In both cases let's just assume that this application is using
WebOb/Werkzeug/Django or whatever library is in use.

2to3 converts "foo" and u"foo" to "foo".  However in Python 3 "foo" is
unicode, so that's fine if the library exposes unicode data only.  This
is the case for all the frameworks and libraries.  Template engines,
database adapters, frameworks, they all use unicode internally which is
great.

If the WSGI server figures out charsets or the library, the data
forwarded to the application is always unicode.  So what would we gain
from doing the decoding in the server?

On the bright side, 2to3 would probably start working for some raw WSGI
applications but would still break many.  On the other hand, the
frameworks would still have to perform encoding detection for stuff like
multipart or form encoded form data.  Even worse: they would have to
apply different decode rules for form data and stuff like path info.

It already caused confusion that path info was unquoted in the past with
many people quoting that value, it would be even worse in the future if
path info was proper unicode, query string looked like unicode but is
actually url encoded data with a different encoding etc.  I can see some
major confusion coming up there, and it would not remove any complexity
for real-world implementations of WSGI.


Regards,
Armin
_______________________________________________
Web-SIG mailing list
[hidden email]
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: String Types in WSGI [Graham's WSGI for py3]

ianb
In reply to this post by Graham Dumpleton-2
On Fri, Sep 18, 2009 at 2:56 AM, Graham Dumpleton
<[hidden email]> wrote:
> As others have pointed out, the likes of rack and jack, not sure about
> the new Perl variant, don't seem to have an issue with using unicode.

I looked up Jack and Rack: http://jackjs.org/jsgi-spec.html and
http://rack.rubyforge.org/doc/files/SPEC.html

They don't have an issue with unicode because they don't mention it
and don't specify anything at all.  Basically they punt on the issue.

In the specific case, most things in Javascript have to be unicode.
The response body iterator must have items that respond to
toByteString, which includes String and Binary.  I'm assuming Strings
always use UTF8 in Javascript, as JSON acts that way.  jsgi.input is
only specified as an "input stream", which is very unspecified.
Especially since jsgi.errors is an "output stream", though presumably
one should be binary and the other text.

Ruby's unicode is kind of funny (as I understand it), in a way that
might help them.  Strings are stored as binary with an attached
encoding.  So there's no "unicode", only binary strings with
encodings; so you can change the encoding, or transcoding happens
implicitly when you combine strings from different encodings.  So
basically there's no mention of unicode because they've dodged that
whole bullet.  But it also seems to be unspecified what encoding might
be attached to strings, if any at all.

Another example, neither spec even indicates if SCRIPT_NAME/PATH_INFO
are url-decoded (or that they aren't decoded).  So, in summary: I
don't see anything we can learn from these specs, and there's no
reason we should feel like we've somehow been leapfrogged, instead
these other specifications are underspecified.  I also think on
Web-SIG we are approaching this with more robust and general
applications in mind than for Jack and Rack -- for instance, I would
like WSGI to be a reasonable basis for an HTTP proxy, where you can't
enforce UTF8-everywhere.  If all we wanted for WSGI was to be a layer
for serving monolithic applications then these issues wouldn't be so
important.

--
Ian Bicking  |  http://blog.ianbicking.org  |  http://topplabs.org/civichacker
_______________________________________________
Web-SIG mailing list
[hidden email]
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: String Types in WSGI [Graham's WSGI for py3]

PJ Eby
In reply to this post by Graham Dumpleton-2
At 08:06 PM 9/18/2009 +1000, Graham Dumpleton wrote:
> > Sometimes clients calling the wsgi code will be
> > buggy... and looking at the unquoted url is needed in those cases to
> > work around buggy clients.
>
>Bugs in WSGI adapters aren't a good reason for why it is needed. If
>the WSGI adapters are broken, fix the WSGI adapters.

"client" = "HTTP client" = browser/web spider/other script.

_______________________________________________
Web-SIG mailing list
[hidden email]
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: String Types in WSGI [Graham's WSGI for py3]

René Dudfield
In reply to this post by Graham Dumpleton-2
On Fri, Sep 18, 2009 at 12:55 PM, Graham Dumpleton
<[hidden email]> wrote:

> What I would like to hear is PJE (who tends towards #3) and Robert
> Brewer (who tends towards #4). Can you guys give counter explanations
> as to why there arguments for bytes isn't valid. Ian, I don't think
> you have yet expressed your leaning, but would like to here your point
> as well.
>
> On top of the issues above, Armin believes 2to3 gives better results
> where bytes everywhere interpretation is used. Has anyone else
> actually tried 2to3 and have the experience with it?
>
> Graham


Here's a small wsgi server converted in the other thread.

I've also applied 2to3 to it so you can see what it does.  Below are
links to diffs as well.

Note, this doesn't show the following things converted with 2to3:
    - wsgi application.
    - an application from a framework layered on top(eg cherrypy).
    - wsgi middleware.

sneaky.py
    - original python 2.x wsgi 1.0 server.
    http://pastebin.com/f5c2cdd3b
sneaky3.py
    - conversion done by hand.
    http://pastebin.com/f7ae33d81
sneaky3_from2to3.py
    - conversion from 2to3 (python 3.1 version of the script)
    http://pastebin.com/f62a7d83a



(diffs for your comparison).

sneaky_2to3.diff
    - a diff from sneaky.py and the 2to3 tool applied.
    http://pastebin.com/f6d0430fa

sneaky_sneaky3.diff
    - a diff from sneaky.py and sneak3.py
    http://pastebin.com/f23cadbb0




cu,
_______________________________________________
Web-SIG mailing list
[hidden email]
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: String Types in WSGI [Graham's WSGI for py3]

Robert Brewer-4
In reply to this post by Graham Dumpleton-2
René Dudfield wrote:
> No, slash encoding and normalising are not the only issues.
> As mentioned before sometimes you need the exact bytes.
>
> 1. buggy clients.  If a client sends something that doesn't work
> correctly, you can still sometimes make sense of it in the raw
version

> of the url.
> 2. client APIs that require the server to know the exact url.
> 3. buggy servers that don't do their job properly.
> 4. extensibility.  A url scheme changes a tiny bit, and you want to
> support the change.  Having the raw url allows you do to support it
> on old servers.
>
> In all APIs it's handy to go to lower levels when the higher levels
> don't work right.  Especially when wsgi only handles one side of
> things, and urls are can be generated by anything.

and Graham Dumpleton replied:
> This is where it all comes down to me not have the real world
> experience in writing web applications to know best.
>
> What I would like to hear is PJE (who tends towards #3) and Robert
> Brewer (who tends towards #4). Can you guys give counter explanations
> as to why there arguments for bytes isn't valid. Ian, I don't think
> you have yet expressed your leaning, but would like to here your point
> as well.

No; in fact, I agree that REQUEST_URI should be mandated as bytes. IIRC, I'm the one who proposed it ;)


Robert Brewer
[hidden email]
_______________________________________________
Web-SIG mailing list
[hidden email]
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/lists%40nabble.com