Security implications of pep 383

classic Classic list List threaded Threaded
22 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Security implications of pep 383

Michael Foord-3
Hey all,

Not sure how real the security risk is here:

     http://blog.omega-prime.co.uk/?p=107

Basically  he is saying that if you store a list of blacklisted files
with names encoded in big-5 (or some other non-utf8 compatible encoding)
if those names are passed at the command line, or otherwise read in and
decoded from an assumed-utf8 source with surrogate escaping, the
surrogate escape decoded names will not match the properly decoded
blacklisted names.

All the best,

Michael Foord

--
http://www.voidspace.org.uk/

May you do good and not evil
May you find forgiveness for yourself and forgive others
May you share freely, never taking more than you give.
-- the sqlite blessing http://www.sqlite.org/different.html

_______________________________________________
Python-Dev mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: Security implications of pep 383

Antoine Pitrou
On Tue, 29 Mar 2011 19:23:25 +0100
Michael Foord <[hidden email]> wrote:

> Hey all,
>
> Not sure how real the security risk is here:
>
>      http://blog.omega-prime.co.uk/?p=107
>
> Basically  he is saying that if you store a list of blacklisted files
> with names encoded in big-5 (or some other non-utf8 compatible encoding)
> if those names are passed at the command line, or otherwise read in and
> decoded from an assumed-utf8 source with surrogate escaping, the
> surrogate escape decoded names will not match the properly decoded
> blacklisted names.

This has nothing to do specifically with PEP 383. The same issues can
arise without PEP 383 if you replace utf-8 with, say, latin-1 in the
above example.

Basically, what this says is if you are decoding the same bytestring
using two different encodings, you get two different unicode strings
(which therefore compare unequal).

Another observation is that, in the script which is presented, if the
user were to extract a filename from the blacklist and call open() on
it, they wouldn't actually open one of the blacklisted files, since the
encoded representation using the filesystem encoding (e.g. utf-8 or
latin-1) would be different from the Big-5 representation.

A solution would be to open the blacklist file in binary mode and call
os.fsdecode() on the result.

Regards

Antoine.


_______________________________________________
Python-Dev mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: Security implications of pep 383

"Martin v. Löwis"
In reply to this post by Michael Foord-3
> Not sure how real the security risk is here:
>
>     http://blog.omega-prime.co.uk/?p=107
>
> Basically  he is saying that if you store a list of blacklisted files
> with names encoded in big-5 (or some other non-utf8 compatible encoding)
> if those names are passed at the command line, or otherwise read in and
> decoded from an assumed-utf8 source with surrogate escaping, the
> surrogate escape decoded names will not match the properly decoded
> blacklisted names.

As described, I find the problem a little bit artificial: supposedly,
he was passing the file name on the command line. However, since his
terminal is in UTF-8 and the file name in Big5, the console didn't
display the file name in a meaningful way when he ran the program. So
whoever ran the program ignored the moji-bake, and didn't wonder whether
it could have any effect on proper functioning of the program. In
addition, if he did ls(1) on the directory, it would have displayed
question marks throughout. This should alert the user that something bad
is going on.

Notice that this isn't really PEP-383's fault. If the file system
encoding was UTF-8, and the blacklist was UTF-8, and the program
ran in a Latin-1 locale, it would have decoded the file name nicely
(without surrogates), but the blacklist check would still have failed.

He should have opened the file in the locale's encoding (i.e. giving no
encoding), using the surrogate escape handler.

Regards,
Martin
_______________________________________________
Python-Dev mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: Security implications of pep 383

Laura Creighton-2
In reply to this post by Michael Foord-3
In a message of Tue, 29 Mar 2011 19:23:25 BST, Michael Foord writes:

>Hey all,
>
>Not sure how real the security risk is here:
>
>     http://blog.omega-prime.co.uk/?p=107
>
>Basically  he is saying that if you store a list of blacklisted files
>with names encoded in big-5 (or some other non-utf8 compatible encoding)
>if those names are passed at the command line, or otherwise read in and
>decoded from an assumed-utf8 source with surrogate escaping, the
>surrogate escape decoded names will not match the properly decoded
>blacklisted names.

>All the best,
>
>Michael Foord
>

I am not sure there are any security related gotchas here.  All he is
saying is that if you decode the same bytestring using two different
encodings, you will get two different unicode strings (which therefore
will compare unequal).  Where's the problem, except in that you might
have unrealistic expectations?

Laura
_______________________________________________
Python-Dev mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: Security implications of pep 383

Toshio Kuratomi-2
In reply to this post by Michael Foord-3
On Tue, Mar 29, 2011 at 07:23:25PM +0100, Michael Foord wrote:

> Hey all,
>
> Not sure how real the security risk is here:
>
>     http://blog.omega-prime.co.uk/?p=107
>
> Basically  he is saying that if you store a list of blacklisted files
> with names encoded in big-5 (or some other non-utf8 compatible
> encoding) if those names are passed at the command line, or otherwise
> read in and decoded from an assumed-utf8 source with surrogate
> escaping, the surrogate escape decoded names will not match the
> properly decoded blacklisted names.
>
The example is correct.  The security risk is real.  However, there's a flaw
in the program and whether the question of whether there's also a flaw in
python is not so certain.

Here's the line I'd say is contentious::
  blacklist = open("blacklist.big5", encoding='big5').read().split()

The blacklist file contains a list of filenames.  However, this code treats
it as a list of strings.  This a logic error in the program, and he should
really be doing this::
  blacklist = open("blacklist.big5", 'rb').read().split()

Then, when comparing it against the values of sys.argv, either sys.argv gets
converted into bytes (using the system locale since that's what was used to
encode to unicode) or the items in blacklist get converted to unicode with
surrogateescape.

The possible flaw in python is this:  Code like the blog poster wrote passes
python3 without an error or a warning.  This gives the programmer no
feedback that they're doing something wrong until it actually bites them in
the foot in deployed code.

-Toshio

_______________________________________________
Python-Dev mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/lists%40nabble.com

attachment0 (205 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Security implications of pep 383

Glenn Linderman-3
On 3/29/2011 12:10 PM, Toshio Kuratomi wrote:
The possible flaw in python is this:  Code like the blog poster wrote passes
python3 without an error or a warning.  This gives the programmer no
feedback that they're doing something wrong until it actually bites them in
the foot in deployed code.

Yes there is a certain level of knowledge required of the system configuration and python defaults for accessing the system for things like filenames.  It can be coded in any of several ways.

But by the above definition of "possible flaw", that seems equivalent to saying that Python should give a warning for things like

os.unlink("my-most-important-file.doc")

_______________________________________________
Python-Dev mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: Security implications of pep 383

Victor STINNER
In reply to this post by Michael Foord-3
Le mardi 29 mars 2011 à 19:23 +0100, Michael Foord a écrit :

> Hey all,
>
> Not sure how real the security risk is here:
>
>      http://blog.omega-prime.co.uk/?p=107
>
> Basically  he is saying that if you store a list of blacklisted files
> with names encoded in big-5 (or some other non-utf8 compatible encoding)
> if those names are passed at the command line, or otherwise read in and
> decoded from an assumed-utf8 source with surrogate escaping, the
> surrogate escape decoded names will not match the properly decoded
> blacklisted names.

Yes, if you decode two byte strings from two different encodings, you
get different unicode strings. It's not related to surrogateescape (PEP
383).

Sorry, '\u4f60\u597d'.encode('big5').decode('latin1') doesn't give you
'\u4f60\u597d' but '§A¦n', and it doesn't warn you that latin1 is not
big5 (there is no UnicodeEncodeError, even if the error handler is
strict).

I think that the example has two issues:

 - security using blacklists doesn't work (it is better to use
   a whitelist)
 - if filenames are stored as Big5, they must be decoded from Big5,
   and so the locale encoding must be Big5

I don't understand the last paragraph:

"P.P.S I will further note that you get the same issue even if the
blacklist and filename had been in UTF-8, but this time it gets broken
from a terminal in the Big5 locale. I didn’t show it this way around
because I understand that Python 3 may only have just recently started
using the locale to decode argv, rather than being hardcoded to UTF-8."

Python filesystem encoding is only hardcoded to UTF-8 on Mac OS X, on
other operating systems, it is the locale encoding.

Victor

_______________________________________________
Python-Dev mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: Security implications of pep 383

Lennart Regebro-2
The lesson here seems to be "if you have to use blacklists, and you
use unicode strings for those blacklists, also make sure the string
you compare with doesn't have surrogates".

//Lennart
_______________________________________________
Python-Dev mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: Security implications of pep 383

Lennart Regebro-2
On Tue, Mar 29, 2011 at 22:40, Lennart Regebro <[hidden email]> wrote:
> The lesson here seems to be "if you have to use blacklists, and you
> use unicode strings for those blacklists, also make sure the string
> you compare with doesn't have surrogates".
>

For that matter, what happens with combining characters?

'\N{LATIN SMALL LETTER O}\N{COMBINING DIAERESIS}' != '\N{LATIN SMALL
LETTER O WITH DIAERESIS}'

I guess the filesystem shouldn't treat these as the same (even though
they are), but what if some webservice does? I suspect you should
normalize both strings before comparing them in any blacklist, and
what happens with surrogates when you normalize?

//Lennart
_______________________________________________
Python-Dev mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: Security implications of pep 383

Victor STINNER
In reply to this post by Lennart Regebro-2
Le mardi 29 mars 2011 à 22:40 +0200, Lennart Regebro a écrit :
> The lesson here seems to be "if you have to use blacklists, and you
> use unicode strings for those blacklists, also make sure the string
> you compare with doesn't have surrogates".

No. '\u4f60\u597d'.encode('big5').decode('latin1') gives '§A¦n' which
doesn't contain any surrogate character.

The lesson is: if you compare Unicode filenames on UNIX, make sure that
your system is correctly configured (the locale encoding must be the
filesystem encoding).

Victor

_______________________________________________
Python-Dev mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: Security implications of pep 383

Antoine Pitrou
In reply to this post by Lennart Regebro-2
On Tue, 29 Mar 2011 22:40:01 +0200
Lennart Regebro <[hidden email]> wrote:
> The lesson here seems to be "if you have to use blacklists, and you
> use unicode strings for those blacklists, also make sure the string
> you compare with doesn't have surrogates".

Not really. As everyone said, this can happen even without surrogates.

Regards

Antoine.


_______________________________________________
Python-Dev mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: Security implications of pep 383

Victor STINNER
In reply to this post by Lennart Regebro-2
Le mardi 29 mars 2011 à 22:45 +0200, Lennart Regebro a écrit :

> On Tue, Mar 29, 2011 at 22:40, Lennart Regebro <[hidden email]> wrote:
> > The lesson here seems to be "if you have to use blacklists, and you
> > use unicode strings for those blacklists, also make sure the string
> > you compare with doesn't have surrogates".
> >
>
> For that matter, what happens with combining characters?
>
> '\N{LATIN SMALL LETTER O}\N{COMBINING DIAERESIS}' != '\N{LATIN SMALL
> LETTER O WITH DIAERESIS}'
>
> I guess the filesystem shouldn't treat these as the same (even though
> they are), but what if some webservice does?

Mac OS X does normalize filenames to a variant of the D (decomposed)
form.
http://www.haypocalc.com/tmp/unicode-2011-03-25/html/operating_systems.html#mac-os-x

> I suspect you should normalize both strings before comparing them in any blacklist,

Yes, but a blacklist is not safe: use a whitelist.

> and what happens with surrogates when you normalize?

Surrogates are not the same in forms N, D, KC and KD.

>>> unicodedata.normalize('NFC', '\uDC80') ==
unicodedata.normalize('NFC', '\uDC80') == unicodedata.normalize('NFKC',
'\uDC80') == unicodedata.normalize('NFKD', '\uDC80') == '\uDC80'
True

Victor

_______________________________________________
Python-Dev mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: Security implications of pep 383

"Martin v. Löwis"
In reply to this post by Lennart Regebro-2
> '\N{LATIN SMALL LETTER O}\N{COMBINING DIAERESIS}' != '\N{LATIN SMALL
> LETTER O WITH DIAERESIS}'
>
> I guess the filesystem shouldn't treat these as the same (even though
> they are), but what if some webservice does? I suspect you should
> normalize both strings before comparing them in any blacklist, and
> what happens with surrogates when you normalize?

I think the whole blacklist example is artificial. The string in the
blacklist is actually a Chinese "hello" greeting, so it surely isn't
the string being blacklisted. For proper blacklisting, you would likely
use substring searches, case-insensitivity, transliterations, and
perhaps even regular expressions and word stemming. If you consider all
these things, proper or alternative encodings of the same text are just
another issue to consider.

Regards,
Martin


_______________________________________________
Python-Dev mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: Security implications of pep 383

Terry Reedy
In reply to this post by Michael Foord-3
On 3/29/2011 2:23 PM, Michael Foord wrote:

> Not sure how real the security risk is here:
>
> http://blog.omega-prime.co.uk/?p=107
>
> Basically he is saying that if you store a list of blacklisted files
> with names encoded in big-5 (or some other non-utf8 compatible encoding)
> if those names are passed at the command line, or otherwise read in and
> decoded from an assumed-utf8 source with surrogate escaping, the
> surrogate escape decoded names will not match the properly decoded
> blacklisted names.

I posted link to this as comment, with my summary of thread.

--
Terry Jan Reedy

_______________________________________________
Python-Dev mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: Security implications of pep 383

Toshio Kuratomi-2
In reply to this post by Victor STINNER
On Tue, Mar 29, 2011 at 10:55:47PM +0200, Victor Stinner wrote:

> Le mardi 29 mars 2011 à 22:40 +0200, Lennart Regebro a écrit :
> > The lesson here seems to be "if you have to use blacklists, and you
> > use unicode strings for those blacklists, also make sure the string
> > you compare with doesn't have surrogates".
>
> No. '\u4f60\u597d'.encode('big5').decode('latin1') gives '§A¦n' which
> doesn't contain any surrogate character.
>
> The lesson is: if you compare Unicode filenames on UNIX, make sure that
> your system is correctly configured (the locale encoding must be the
> filesystem encoding).
>
You're both wrong :-)

Lennart is missing that you just need to use the same encoding
+ surrogateescape (or stick with bytes) for decoding the byte strings that
you are comparing.

You're missing that on UNIX there is no filesystem encoding so the idea of
locale and filesystem encoding matching is false (and unnecessary -- the
encodings that you use within python just need to be the same.  They don't
even need to match up to the reality of what's used on the filesystem or the
user's locale.)

-Toshio

_______________________________________________
Python-Dev mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/lists%40nabble.com

attachment0 (205 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Security implications of pep 383

Lennart Regebro-2
On Wed, Mar 30, 2011 at 07:54, Toshio Kuratomi <[hidden email]> wrote:
> Lennart is missing that you just need to use the same encoding
> + surrogateescape (or stick with bytes) for decoding the byte strings that
> you are comparing.

You lost me here. I need to do this for what?

//Lennart
_______________________________________________
Python-Dev mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: Security implications of pep 383

Lennart Regebro-2
In reply to this post by "Martin v. Löwis"
On Tue, Mar 29, 2011 at 23:17, "Martin v. Löwis" <[hidden email]> wrote:
> I think the whole blacklist example is artificial. The string in the
> blacklist is actually a Chinese "hello" greeting, so it surely isn't
> the string being blacklisted. For proper blacklisting, you would likely
> use substring searches, case-insensitivity, transliterations, and
> perhaps even regular expressions and word stemming.

Good point.

//Lennart
_______________________________________________
Python-Dev mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: Security implications of pep 383

Gregory P. Smith-3
In reply to this post by Terry Reedy


On Tue, Mar 29, 2011 at 4:07 PM, Terry Reedy <[hidden email]> wrote:
On 3/29/2011 2:23 PM, Michael Foord wrote:

Not sure how real the security risk is here:

http://blog.omega-prime.co.uk/?p=107

Basically he is saying that if you store a list of blacklisted files
with names encoded in big-5 (or some other non-utf8 compatible encoding)
if those names are passed at the command line, or otherwise read in and
decoded from an assumed-utf8 source with surrogate escaping, the
surrogate escape decoded names will not match the properly decoded
blacklisted names.

I posted link to this as comment, with my summary of thread.

--
Terry Jan Reedy

I don't see your comment on the blog post.  So either the author is moderating comments and hasn't seen yours yet (likely) or they don't want disagreement in their comments. ;)

Regardless, is isn't a bug with Python or PEP 383.  If someone is dealing with security and does not know what formats the various inputs to their program that are used to make the security check can come in as they shouldn't be writing security oriented code at all...  (But that's never stopped anyone).

-gps


_______________________________________________
Python-Dev mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: Security implications of pep 383

Nick Coghlan
On Wed, Mar 30, 2011 at 4:57 PM, Gregory P. Smith <[hidden email]> wrote:
> I don't see your comment on the blog post.  So either the author is
> moderating comments and hasn't seen yours yet (likely) or they don't want
> disagreement in their comments. ;)

My comment was sitting in the moderation queue last time I looked as well.

While Toshio is correct that there is no one correct "filesystem
encoding" on Linux systems, Python still does its best to guess one
(even though it may be wrong for some of the mounted filesystems).
That's what it will use when encoding Unicode strings to pass to
bytes-oriented POSIX APIs, so you can always "pre-check" values by
using os.fsencode to get everything into the bytes format that will
actually be passed to the underlying OS API.

Python 3.2 provides the tools to do this kind of thing correctly, but
it is finicky enough that there isn't really any way for us to make it
easy.

Cheers,
Nick.

--
Nick Coghlan   |   [hidden email]   |   Brisbane, Australia
_______________________________________________
Python-Dev mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/lists%40nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: Security implications of pep 383

Terry Reedy
In reply to this post by Gregory P. Smith-3
On 3/30/2011 2:57 AM, Gregory P. Smith wrote:

>>> http://blog.omega-prime.co.uk/?p=107

>> I posted link to this as comment, with my summary of thread.

> I don't see your comment on the blog post.  So either the author is
> moderating comments and hasn't seen yours yet (likely)

My comment and Nick's are now both posted. Blogger Max replied

"Nick, thanks for that info. It is certainly nice that there is a work
around, and perhaps this indeed the best that can be done if you still
want the convenience of representing filenames as strings.

Terry: thanks also for the link to the mailing list thread. It is
certainly interesting, and the argument regarding latin1 is a compelling
one — this issue is indeed not specific to PEP383. So the dangerous
operation seems to be comparing strings that were originally created
from byte strings in two different encodings. It’s not clear to me that
it would be sensible for the language to check this (perhaps by throwing
an exception if you try it).

The other 2 comments are also followed by responses.

--
Terry Jan Reedy


_______________________________________________
Python-Dev mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/lists%40nabble.com
12