|
Manlio Perillo wrote:
> Words of *TEXT MAY contain characters from character sets other than > ISO-8859-1 [22] only when encoded according to the rules of RFC 2047 Yeah, this is, unfortunately, a lie. The rules of RFC 2047 apply only to RFC*822-family 'atoms' and not elsewhere; indeed, RFC2047 itself specifically denies that an encoded-word can go in a quoted-string. RFC2047 encoded-words are not on-topic in an HTTP header(*); this has been confirmed by newer development work on HTTPbis by Reschke et al. (http://tools.ietf.org/wg/httpbis/). The "correct" way of escaping header parameters in an RFC*822-family protocol would be RFC2231's complex encoding scheme, but HTTP is explicitly not an 822-family protocol despite sharing many of the same constructs. See http://tools.ietf.org/html/draft-reschke-rfc2231-in-http-06 for a strategy for how 2231 should interact with HTTP, but note that for now RFC2231-in-HTTP simply does not exist in any deployed tools. So for now there is basically nothing useful WSGI can do other than provide direct, byte-oriented (even if wrapped in 8859-1 unicode strings) access to headers. -- And Clover mailto:[hidden email] http://www.doxdesk.com/ _______________________________________________ Web-SIG mailing list [hidden email] Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/lists%40nabble.com |
|
In reply to this post by Henry Precheur
Henry Precheur ha scritto:
> On Thu, Dec 03, 2009 at 09:15:06PM +0100, Manlio Perillo wrote: >> There is something that I don't understand. >> >> Some HTTP headers, like Accept-Language, contains data described as >> `token`, where: >> >> token = 1*<any CHAR except CTLs or separators> >> >> So a token, IMHO, is an opaque string, and it SHOULD not decoded. >> In Python 3.x it SHOULD be a byte string. > > I think this is more an issue that frameworks should deal with. By > decoding every headers value to latin-1: > > * It keeps WSGI simple. Simple is good. > It is just as simple as using byte strings, IMHO. It is not simple, it is convenient because of (if I understand correctly) how code is converted by 2to3. > * WSGI sticks to what RFC 2616 (Hypertext Transfer Protocol -- HTTP/1.1) > says. WSGI is about HTTP, but that doesn't necessarily includes all > other standards extending HTTP. > HTTP never says to consided whole headers as latin-1 text, IMHO. > * It's possible to convert latin-1 strings to bytes without losing data. > Yes, but it is quite stupid to first convert to Unicode and then convert again to byte string. It it true, however, that this does not happen often; but only for: - WSGI applications that implement an HTTP proxy - WSGI applications that needs to support HTTP Digest Authentication - WSGI applications that store encoded data in cookies Regards Manlio _______________________________________________ Web-SIG mailing list [hidden email] Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/lists%40nabble.com |
|
In reply to this post by and-py
And Clover ha scritto:
> Manlio Perillo wrote: > >> Words of *TEXT MAY contain characters from character sets other than >> ISO-8859-1 [22] only when encoded according to the rules of RFC 2047 > > Yeah, this is, unfortunately, a lie. The rules of RFC 2047 apply only to > RFC*822-family 'atoms' and not elsewhere; indeed, RFC2047 itself > specifically denies that an encoded-word can go in a quoted-string. > > RFC2047 encoded-words are not on-topic in an HTTP header(*); this has > been confirmed by newer development work on HTTPbis by Reschke et al. > (http://tools.ietf.org/wg/httpbis/). > Thanks. HTTPbis seems to fix all these problems: "Historically, HTTP has allowed field content with text in the ISO- 8859-1 [ISO-8859-1] character encoding and supported other character sets only through use of [RFC2047] encoding. In practice, most HTTP header field values use only a subset of the US-ASCII character encoding [USASCII]. Newly defined header fields SHOULD limit their field values to US-ASCII characters. Recipients SHOULD treat other (obs-text) octets in field content as opaque data." This is the new rule for `quoted-string`: quoted-string = DQUOTE *( qdtext / quoted-pair ) DQUOTE qdtext = OWS / %x21 / %x23-5B / %x5D-7E / obs-text ; OWS / <VCHAR except DQUOTE and "\"> / obs-text obs-text = %x80-FF quoted-pair = "\" ( WSP / VCHAR / obs-text ) > The "correct" way of escaping header parameters in an RFC*822-family > protocol would be RFC2231's complex encoding scheme, but HTTP is > explicitly not an 822-family protocol despite sharing many of the same > constructs. See > http://tools.ietf.org/html/draft-reschke-rfc2231-in-http-06 for a > strategy for how 2231 should interact with HTTP, but note that for now > RFC2231-in-HTTP simply does not exist in any deployed tools. > It seems reasonable. > So for now there is basically nothing useful WSGI can do other than > provide direct, byte-oriented (even if wrapped in 8859-1 unicode > strings) access to headers. > Yes, this is what I think. I have some doubts about wrapping the headers in 8859-1 unicode strings, but luckily there is surrogateescape. Regards Manlio _______________________________________________ Web-SIG mailing list [hidden email] Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/lists%40nabble.com |
|
In reply to this post by Manlio Perillo-3
On Fri, Dec 04, 2009 at 10:17:09AM +0100, Manlio Perillo wrote:
> It is just as simple as using byte strings, IMHO. No, it's not. There were lots of dicussions regarding this on the mailing list. One of the main issue is that the standard library supports bytes poorly. urllib for example expects strings not bytes. > > * WSGI sticks to what RFC 2616 (Hypertext Transfer Protocol -- HTTP/1.1) > > says. WSGI is about HTTP, but that doesn't necessarily includes all > > other standards extending HTTP. > > > > HTTP never says to consided whole headers as latin-1 text, IMHO. It does: When no explicit charset parameter is provided by the sender, media subtypes of the "text" type are defined to have a default charset value of "ISO-8859-1" when received via HTTP. http://tools.ietf.org/html/rfc2616#section-3.7.1 > Yes, but it is quite stupid to first convert to Unicode and then convert > again to byte string. 99% of the time latin-1 will work. And converting from Unicode to bytes is not costly. 6 months ago I was a big fan of bytes, but bytes create more problems than they solve. -- Henry PrĂȘcheur _______________________________________________ Web-SIG mailing list [hidden email] Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/lists%40nabble.com |
|
Henry Precheur ha scritto:
> On Fri, Dec 04, 2009 at 10:17:09AM +0100, Manlio Perillo wrote: >> It is just as simple as using byte strings, IMHO. > > No, it's not. There were lots of dicussions regarding this on the > mailing list. One of the main issue is that the standard library > supports bytes poorly. urllib for example expects strings not bytes. > I read last month discussions 3 day ago! The quote function supports byte strings, as an example. What are the functions that does not works with byte strings? >>> * WSGI sticks to what RFC 2616 (Hypertext Transfer Protocol -- HTTP/1.1) >>> says. WSGI is about HTTP, but that doesn't necessarily includes all >>> other standards extending HTTP. >>> >> HTTP never says to consided whole headers as latin-1 text, IMHO. > > It does: > > When no explicit charset parameter is provided by the sender, media > subtypes of the "text" type are defined to have a default charset value > of "ISO-8859-1" when received via HTTP. > > http://tools.ietf.org/html/rfc2616#section-3.7.1 > This is not correct. First of all, HTTP never says that whole headers are of type TEXT. Only specific components are of type TEXT. Moreover, HTTPbis has finally clarified this; TEXT is no more used, instead non ascii characters are to be considered opaque. Do you really want to define the new WSGI specification to be "against" the new (possible) HTTP spec? Of course it will work; but since some code in the standard library needs to be fixed (the wsgiref.util.application_uri, as an example), maybe it is better to fix it to work with byte strings. Just my two cents. > [...] Regards Manlio _______________________________________________ Web-SIG mailing list [hidden email] Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/lists%40nabble.com |
|
On Fri, Dec 04, 2009 at 07:40:55PM +0100, Manlio Perillo wrote:
> What are the functions that does not works with byte strings? Just to make things clear, I was talking about Python 3. All the functions I tried not ending with _from_bytes raise an exception with bytes. This includes urllib.parse.parse_qs & urllib.parse.urlparse which are rather critical ... > First of all, HTTP never says that whole headers are of type TEXT. > Only specific components are of type TEXT. If parts of a header contain latin-1 characters, that means its encoding is latin-1 (at least partially). > Moreover, HTTPbis has finally clarified this; TEXT is no more used, > instead non ascii characters are to be considered opaque. Yes, but the HTTPbis draft also says: Historically, HTTP has allowed field content with text in the ISO-8859-1 character encoding. And WSGI is not about HTTP in a distant future, it's about HTTP right now. > Do you really want to define the new WSGI specification to be "against" > the new (possible) HTTP spec? I don't know why it would be "against" it. WSGI aims to handle HTTP in the real world. Just because the HTTPbis spec is released wont take all the garbage out of the web. There will still be latin-1 strings in headers passed around for the next 10 years. -- Henry PrĂȘcheur _______________________________________________ Web-SIG mailing list [hidden email] Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/lists%40nabble.com |
|
Henry Precheur ha scritto:
> On Fri, Dec 04, 2009 at 07:40:55PM +0100, Manlio Perillo wrote: >> What are the functions that does not works with byte strings? > > Just to make things clear, I was talking about Python 3. > I know. Unfortunately I don't have installed Python 3, I'm just reading the code. > All the functions I tried not ending with _from_bytes raise an exception > with bytes. This includes urllib.parse.parse_qs & urllib.parse.urlparse > which are rather critical ... > Ah, ok. Can you show me the traceback of parse_qs? Thanks. >> First of all, HTTP never says that whole headers are of type TEXT. >> Only specific components are of type TEXT. > > If parts of a header contain latin-1 characters, that means its > encoding is latin-1 (at least partially). > This is not completely true. > [...] > And WSGI is not about HTTP in a distant future, it's about HTTP right > now. > >> Do you really want to define the new WSGI specification to be "against" >> the new (possible) HTTP spec? > > I don't know why it would be "against" it. Well, I have quoted it for this reason. What I mean is that, IMHO: - Using Unicode strings in WSGI is an abuse of Unicode string - This abuse is not justified by the HTTP spec > [...] Regards Manlio _______________________________________________ Web-SIG mailing list [hidden email] Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/lists%40nabble.com |
|
In reply to this post by and-py
On 12/4/09 12:50 AM, And Clover wrote:
> So for now there is basically nothing useful WSGI can do other than > provide direct, byte-oriented (even if wrapped in 8859-1 unicode > strings) access to headers. You could argue that this is perhaps a good reason to replace ``environ`` with something that interprets the headers according to how HTTP is actually used in the real world. It may be that WSGI should use bytes everywhere and the recommended usage would be via a decorator (which could cache computations on the environ dictionary): e.g. the raw application handler versus one decorated with an imaginary ``webob`` function. def app(environ, start_response): ... @webob def app(request): ... It is often said that WSGI should be practical, but in actual usage, I think most developers use a request/response abstraction layer. Middlewares are usually shrink-wrapped library code that could handle a bytes-based environ dict (they'd have to explicitly decode the headers of interest). \malthe _______________________________________________ Web-SIG mailing list [hidden email] Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/lists%40nabble.com |
| Powered by Nabble | Edit this page |
