Python 3: Form data encoding issues in cgi and urllib modules

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Python 3: Form data encoding issues in cgi and urllib modules

Miles Kaufmann
Hi everyone,

I read through the recent archives, and I've seen some discussion on
similar topics, but not this exact topic recently, so if the solution
to these issues has already been decided, please point me to the
relevant messages.  (Also, if this isn't the most appropriate list,
please let me know!)

The first issue is that there doesn't seem to be a way to parse
x-www-form-urlencoded query strings in a character set other than
UTF-8, for example:

'premier=un&deuxi%E8me=deux' # latin-1

The urllib.parse.unquote* functions take encoding and errors
parameters, but none of the higher-level ones.  The solution to me
seems to be that functions that build on top of
it--urllib.parse.parse*, cgi.parse*, and the cgi.FieldStorage
constructor--should grow encoding and errors parameters that they pass
through to the lower-level functions.

The second issue is that the FieldStorage classes work with text input
streams.  However, with multipart/form-data posts, posted files aren't
necessarily in the same encoding as form fields, or may be binary and
not text at all.  I would suggest that FieldStorage should be changed
to take a binary input stream. For multipart forms, it should only
attempt to decode a part with the passed-in FieldStorage encoding if
the part's content type is text/plain and the content-disposition does
not specify a filename; otherwise, field.file would be a binary file,
and field.value should be bytes or non-existent.

Here is a example form submission that is currently difficult to
handle with the cgi module, posted from a page with a charset of UTF-8
and two attached files; this is similar to how a real form submission
from Safari or Firefox would look:

post_input = b"""---123
Content-Disposition: form-data; name="utf8text"

\xc2\xa1ol\xc3\xa9!
---123
Content-Disposition: form-data; name="file1"; filename="latin1.txt"
Content-Type: text/plain

Oh l\xe0 l\xe0!
---123
Content-Disposition: form-data; name="file2"; filename="binary"
Content-Type: application/octet-stream

\x80\x81\x82\x83\x84\x85\x86\x87\xad\xf0
---123--
"""

environ = {'CONTENT_LENGTH':str(len(post_input)),
    'CONTENT_TYPE': 'multipart/form-data; boundary=-123',
    'REQUEST_METHOD': 'POST'}

It's possible that the email.mime and http packages might also need
some changes made, but I haven't looked into those as much.  Also,
cgi.parse_multipart seems to be broken currently, since it uses
http.client.parse_headers which expects a bytes stream.

If there's agreement on these points, I think it would be important to
get these changes (or perhaps alternate fixes) into Python 3.1; I know
that some of the changes are backwards incompatible with 3.0, but I
think that the encoding issues in the current cgi module make it very
difficult to work with.  I'm willing to take responsibility for
submitting bug reports and patches, but could probably use a more
experienced mentor to let me know if I'm doing it wrong.

If you don't think that these changes are reasonable, I'm interested
to hear your alternate suggestions.  I strongly believe that the
current behavior is broken and needs to be changed for 3.1.

Thanks for your consideration,
Miles Kaufmann
_______________________________________________
Web-SIG mailing list
[hidden email]
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: Python 3: Form data encoding issues in cgi and urllib modules

Miles Kaufmann
On Sat, Apr 11, 2009 at 8:48 PM, Miles Kaufmann wrote:
> ...
> It's possible that the email.mime and http packages might also need
> some changes made, but I haven't looked into those as much.
> ...

Apparently there's been some discussion on the python-dev and
email-sig lists in the past couple of days since I last checked, about
the email package and strings and bytes.  So it might be the case that
the cgi module will build on top of those decisions.  But I want to
make sure that the cgi module isn't left behind, and I think that
having FieldStorage being built from string streams instead of byte
streams is a mistake that should be rectified ASAP.

On Fri, Apr 10, 2009 at 12:35 PM, Bill Janssen wrote [1]:
> Barry Warsaw <[hidden email]> wrote:
>> In that case, we really need the
>> bytes-in-bytes-out-bytes-in-the-chewy-
>> center API first, and build things on top of that.
>
> Yep.

-Miles Kaufmann

[1] http://mail.python.org/pipermail/email-sig/2009-April/000438.html
_______________________________________________
Web-SIG mailing list
[hidden email]
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: Python 3: Form data encoding issues in cgi and urllib modules

Miles Kaufmann
In reply to this post by Miles Kaufmann
On Sat, Apr 11, 2009 at 8:48 PM, Miles Kaufmann wrote:

> The first issue is that there doesn't seem to be a way to parse
> x-www-form-urlencoded query strings in a character set other than
> UTF-8, for example:
>
> 'premier=un&deuxi%E8me=deux' # latin-1
>
> The urllib.parse.unquote* functions take encoding and errors
> parameters, but none of the higher-level ones.  The solution to me
> seems to be that functions that build on top of
> it--urllib.parse.parse*, cgi.parse*, and the cgi.FieldStorage
> constructor--should grow encoding and errors parameters that they pass
> through to the lower-level functions.
>
> The second issue is that the FieldStorage classes work with text input
> streams.  However, with multipart/form-data posts, posted files aren't
> necessarily in the same encoding as form fields, or may be binary and
> not text at all.  I would suggest that FieldStorage should be changed
> to take a binary input stream.
>
> [...]

I'm not quite sure how to interpret the lack of response I've gotten
on this topic.  Is it just that there's little interest in the cgi
module?  Should I raise this issue on the python-dev list, or just
open a bug report and start submitting patches?

There's been a lot of discussion recently about bytes vs. str in email
headers and WSGI environ variables, but I haven't been able to find a
substantive discussion on this specific topic.  Here are some of the
related quotes I've come across.

Martin v. Löwis wrote [1]:
> In a CGI application, you shouldn't be using sys.stdin or print().
> Instead, you should be using sys.stdin.buffer (or sys.stdin.buffer.raw),
> and sys.stdout.buffer.raw. A CGI script essentially does binary IO;
> if you use TextIO, there likely will be bugs (e.g. if you have
> attachments of type application/octet-stream).

bobince wrote [2]:

> Evan Fosmark wrote:
>> bobince wrote:
>>> So yeah, it's a bug in cgi.py, yet another victim of 2to3 conversion
>>> that hasn't been fixed properly for the new string model. It should
>>> be converting the incoming byte stream to characters before
>>> passing them to urllib.
>>>
>>> Did I mention Python 3.0's libraries (especially web-related
>>> ones) still being rather shonky? :-)
>>
>> Yeah. So far I've noticed huge problems with cgi, urllib, and
>> wsgiref. I hope they get fixed soon. :(
>
> Indeed. Momentum in WEB-SIG seems to have ground to a halt; no-one
> seems to want ownership of the issue. Very disappointing.

There's also this bug report[3], but it doesn't directly propose the
changes that I have.

So: does anyone agree, or disagree, that cgi.FieldStorage should be
changed to take byte streams, and many of the cgi and urllib.parse
functions should become encoding-aware, preferably in time for Python
3.1?  The byte-stream change will break compatibility with with Python
3.0, but I strongly feel that treating POST data as text is wrong and
should not continue to be supported.

-Miles Kaufmann

[1]: http://mail.python.org/pipermail/python-dev/2009-April/088727.html
[2]: http://stackoverflow.com/questions/540342/python-3-0-urllib
[3]: http://bugs.python.org/issue4953
_______________________________________________
Web-SIG mailing list
[hidden email]
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: Python 3: Form data encoding issues in cgi and urllib modules

Graham Dumpleton-2
2009/4/16 Miles Kaufmann <[hidden email]>:

> On Sat, Apr 11, 2009 at 8:48 PM, Miles Kaufmann wrote:
>> The first issue is that there doesn't seem to be a way to parse
>> x-www-form-urlencoded query strings in a character set other than
>> UTF-8, for example:
>>
>> 'premier=un&deuxi%E8me=deux' # latin-1
>>
>> The urllib.parse.unquote* functions take encoding and errors
>> parameters, but none of the higher-level ones.  The solution to me
>> seems to be that functions that build on top of
>> it--urllib.parse.parse*, cgi.parse*, and the cgi.FieldStorage
>> constructor--should grow encoding and errors parameters that they pass
>> through to the lower-level functions.
>>
>> The second issue is that the FieldStorage classes work with text input
>> streams.  However, with multipart/form-data posts, posted files aren't
>> necessarily in the same encoding as form fields, or may be binary and
>> not text at all.  I would suggest that FieldStorage should be changed
>> to take a binary input stream.
>>
>> [...]
>
> I'm not quite sure how to interpret the lack of response I've gotten
> on this topic.  Is it just that there's little interest in the cgi
> module?  Should I raise this issue on the python-dev list, or just
> open a bug report and start submitting patches?
>
> There's been a lot of discussion recently about bytes vs. str in email
> headers and WSGI environ variables, but I haven't been able to find a
> substantive discussion on this specific topic.  Here are some of the
> related quotes I've come across.
>
> Martin v. Löwis wrote [1]:
>> In a CGI application, you shouldn't be using sys.stdin or print().
>> Instead, you should be using sys.stdin.buffer (or sys.stdin.buffer.raw),
>> and sys.stdout.buffer.raw. A CGI script essentially does binary IO;
>> if you use TextIO, there likely will be bugs (e.g. if you have
>> attachments of type application/octet-stream).
>
> bobince wrote [2]:
>> Evan Fosmark wrote:
>>> bobince wrote:
>>>> So yeah, it's a bug in cgi.py, yet another victim of 2to3 conversion
>>>> that hasn't been fixed properly for the new string model. It should
>>>> be converting the incoming byte stream to characters before
>>>> passing them to urllib.
>>>>
>>>> Did I mention Python 3.0's libraries (especially web-related
>>>> ones) still being rather shonky? :-)
>>>
>>> Yeah. So far I've noticed huge problems with cgi, urllib, and
>>> wsgiref. I hope they get fixed soon. :(
>>
>> Indeed. Momentum in WEB-SIG seems to have ground to a halt; no-one
>> seems to want ownership of the issue. Very disappointing.
>
> There's also this bug report[3], but it doesn't directly propose the
> changes that I have.
>
> So: does anyone agree, or disagree, that cgi.FieldStorage should be
> changed to take byte streams, and many of the cgi and urllib.parse
> functions should become encoding-aware, preferably in time for Python
> 3.1?  The byte-stream change will break compatibility with with Python
> 3.0, but I strongly feel that treating POST data as text is wrong and
> should not continue to be supported.
>
> -Miles Kaufmann
>
> [1]: http://mail.python.org/pipermail/python-dev/2009-April/088727.html
> [2]: http://stackoverflow.com/questions/540342/python-3-0-urllib
> [3]: http://bugs.python.org/issue4953

Have you read:

  http://bugs.python.org/issue3300

This was referenced in a prior post here and is likely relevant. A lot
of the discussion for that was happening on developers list for Python
3.0.

Not sure why someone was taking issue with WEB-SIG list over cgi
FieldStorage issues as I don't recollect us having any substantive
discussion about it and any problems it has.

Graham
_______________________________________________
Web-SIG mailing list
[hidden email]
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: Python 3: Form data encoding issues in cgi and urllib modules

Miles Kaufmann
On Wed, Apr 15, 2009 at 5:23 PM, Graham Dumpleton wrote:

> 2009/4/16 Miles Kaufmann <[hidden email]>:
>> So: does anyone agree, or disagree, that cgi.FieldStorage should be
>> changed to take byte streams, and many of the cgi and urllib.parse
>> functions should become encoding-aware, preferably in time for Python
>> 3.1?  The byte-stream change will break compatibility with with Python
>> 3.0, but I strongly feel that treating POST data as text is wrong and
>> should not continue to be supported.
>
> Have you read:
>
>  http://bugs.python.org/issue3300
>
> This was referenced in a prior post here and is likely relevant. A lot
> of the discussion for that was happening on developers list for Python
> 3.0.

I hadn't. Thanks for the link! That was a long read, so apologies if I
missed anything, but that discussion seems to pertain almost entirely
to the urllib.parse.[un]quote* functions; there was only one point
where it was mentioned that there would be issues with non-UTF-8 data
for higher-level functions[1], and nothing followed from that.

I don't think it should be a controversial move to add encoding and
errors parameters to the following functions:

* urllib.parse.parse_qs
* urllib.parse.parse_qsl
* urllib.parse.urlencode

which, I feel, would be in line with the outcome of the discussion you
referenced, shouldn't break any existing code, and would make it
possible to parse the "quite prevalent"[2] instances of non-utf-8
query strings like the following:

'premier=un&deuxi%E8me=deux' # latin-1

The parameters would also need to be added to cgi.parse,
cgi.parse_multipart, and cgi.FieldStorage, if they were in fact
changed to expect a bytes file input, as I suggest.

> Not sure why someone was taking issue with WEB-SIG list over cgi
> FieldStorage issues as I don't recollect us having any substantive
> discussion about it and any problems it has.

Exactly; that person's issue was that there hasn't been substantive
discussion.  Which is what I'm trying to create now. :)

-Miles Kaufmann

[1]: http://bugs.python.org/msg70970
[2]: http://lists.w3.org/Archives/Public/www-international/2008JulSep/0042.html
_______________________________________________
Web-SIG mailing list
[hidden email]
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/lists%40nabble.com