Status of the fix for the hash collision vulnerability

classic Classic list List threaded Threaded
67 messages Options
1234
Reply | Threaded
Open this post in threaded view
|

Status of the fix for the hash collision vulnerability

Victor STINNER
Many people proposed their own idea to fix the vulnerability, but only
3 wrote a patch:

- Glenn Linderman proposes to fix the vulnerability by adding a new
"safe" dict type (only accepting string keys). His proof-of-concept
(SafeDict.py) uses a secret of 64 random bits and uses it to compute
the hash of a key.
- Marc Andre Lemburg proposes to fix the vulnerability directly in
dict (for any key type). The patch raises an exception if a lookup
causes more than 1000 collisions.
- I propose to fix the vulnerability only in the Unicode hash (not for
other types). My patch adds a random secret initialized at startup (it
can be disabled or fixed using an environment variable).

--

I consider that Glenn's proposition is not applicable in practice
because all applications and all libraries have to be patched to use
the new "safe" dict type.

Some people are concerned by possible regression introduced by Marc's
proposition: his patch may raise an exception for legitimate data.

My proposition tries to be "just enough" secure with a low (runtime
performance) overhead. My patch becomes huge (and so backporting is
more complex), whereas Marc's patch is very simple and so trivial to
backport.

--

It is still unclear to me if the fix should be enabled by default for
Python < 3.3. Because the overhead (of my patch) is low, I would
prefer to enable the fix by default, to protect everyone with a simple
Python upgrade.

I prefer to explain how to disable explicitly the randomized hash
(PYTHONHASHSEED=0) (or how to fix application bugs) to people having
troubles with randomized hash, instead of leaving the hole open by
default.

--

We might change hash() for types other than str, but it looks like web
servers are only concerned by dict with string keys.

We may use Paul's hash function if mine is not enough secure.

My patch doesn't fix the DoS, it just make the attack more complex.
The attacker cannot pregenerate data for an attack: (s)he has first to
compute the hash secret, and then compute hash collisions using the
secret. The hash secret is a least 64 bits long (128 bits on a 64 bit
system). So I hope that computing collisions requires a lot of CPU
time (is slow) to make the attack ineffective with today computers.

--

I plan to write a nice patch for Python 3.3, then write a simpler
patch for 3.1 and 3.2 (duplicate os.urandom code to keep it unchanged,
maybe don't create a new random.c file, maybe don't touch the test
suite while the patch breaks many tests), and finally write patches
for Python 2.6 and 2.7.

Details about my patch:

- I tested it on Linux (32 and 64 bits) and Windows (Seven 64 bits)
- a new PYTHONSEED environment variable allow to control the
randomized hash: PYTHONSEED=0 disables completly the randomized hash
(restore the previous behaviour), PYTHONSEED=value uses a fixed seed
for processes sharing data and needind same hash values
(multiprocessing users?)
- no overhead on hash(str)
- no startup overhead on Linux
- startup overhead is 10% on Windows (see the issue, I propose another
solution with a startup overhead of 1%)

The patch is not done, some tests are still failing because of the
randomized hash.

--

FYI, PHP released a version 5.3.9 adding "max_input_vars directive to
prevent attacks based on hash collisions (CVE-2011-4885)".

Victor
_______________________________________________
Python-Dev mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/lists%2B1324100855712-1801473%40n6.nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: Status of the fix for the hash collision vulnerability

Guido van Rossum
Hm... I started out as a big fan of the randomized hash, but thinking more about it, I actually believe that the chances of some legitimate app having >1000 collisions are way smaller than the chances that somebody's code will break due to the variable hashing. In fact we know for a fact that the latter will break code, since it changes the order of items in a dict. This affects many tests written without this in mind, and I assume there will be some poor sap out there who uses Python's hash() function to address some external persistent hash table or some other external datastructure. How pathological the data needs to be before the collision counter triggers? I'd expect *very* pathological.

This is depending on how the counting is done (I didn't look at MAL's patch), and assuming that increasing the hash table size will generally reduce collisions if items collide but their hashes are different.

That said, even with collision counting I'd like a way to disable it without changing the code, e.g. a flag or environment variable.

--Guido

On Thu, Jan 12, 2012 at 5:24 PM, Victor Stinner <[hidden email]> wrote:
Many people proposed their own idea to fix the vulnerability, but only
3 wrote a patch:

- Glenn Linderman proposes to fix the vulnerability by adding a new
"safe" dict type (only accepting string keys). His proof-of-concept
(SafeDict.py) uses a secret of 64 random bits and uses it to compute
the hash of a key.
- Marc Andre Lemburg proposes to fix the vulnerability directly in
dict (for any key type). The patch raises an exception if a lookup
causes more than 1000 collisions.
- I propose to fix the vulnerability only in the Unicode hash (not for
other types). My patch adds a random secret initialized at startup (it
can be disabled or fixed using an environment variable).

--

I consider that Glenn's proposition is not applicable in practice
because all applications and all libraries have to be patched to use
the new "safe" dict type.

Some people are concerned by possible regression introduced by Marc's
proposition: his patch may raise an exception for legitimate data.

My proposition tries to be "just enough" secure with a low (runtime
performance) overhead. My patch becomes huge (and so backporting is
more complex), whereas Marc's patch is very simple and so trivial to
backport.

--

It is still unclear to me if the fix should be enabled by default for
Python < 3.3. Because the overhead (of my patch) is low, I would
prefer to enable the fix by default, to protect everyone with a simple
Python upgrade.

I prefer to explain how to disable explicitly the randomized hash
(PYTHONHASHSEED=0) (or how to fix application bugs) to people having
troubles with randomized hash, instead of leaving the hole open by
default.

--

We might change hash() for types other than str, but it looks like web
servers are only concerned by dict with string keys.

We may use Paul's hash function if mine is not enough secure.

My patch doesn't fix the DoS, it just make the attack more complex.
The attacker cannot pregenerate data for an attack: (s)he has first to
compute the hash secret, and then compute hash collisions using the
secret. The hash secret is a least 64 bits long (128 bits on a 64 bit
system). So I hope that computing collisions requires a lot of CPU
time (is slow) to make the attack ineffective with today computers.

--

I plan to write a nice patch for Python 3.3, then write a simpler
patch for 3.1 and 3.2 (duplicate os.urandom code to keep it unchanged,
maybe don't create a new random.c file, maybe don't touch the test
suite while the patch breaks many tests), and finally write patches
for Python 2.6 and 2.7.

Details about my patch:

- I tested it on Linux (32 and 64 bits) and Windows (Seven 64 bits)
- a new PYTHONSEED environment variable allow to control the
randomized hash: PYTHONSEED=0 disables completly the randomized hash
(restore the previous behaviour), PYTHONSEED=value uses a fixed seed
for processes sharing data and needind same hash values
(multiprocessing users?)
- no overhead on hash(str)
- no startup overhead on Linux
- startup overhead is 10% on Windows (see the issue, I propose another
solution with a startup overhead of 1%)

The patch is not done, some tests are still failing because of the
randomized hash.

--

FYI, PHP released a version 5.3.9 adding "max_input_vars directive to
prevent attacks based on hash collisions (CVE-2011-4885)".

Victor
_______________________________________________
Python-Dev mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/guido%40python.org



--
--Guido van Rossum (python.org/~guido)

_______________________________________________
Python-Dev mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/lists%2B1324100855712-1801473%40n6.nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: Status of the fix for the hash collision vulnerability

Frank Sievertsen
In reply to this post by Victor STINNER
Am 13.01.2012 02:24, schrieb Victor Stinner:
> My patch doesn't fix the DoS, it just make the attack more complex.
> The attacker cannot pregenerate data for an attack: (s)he has first to
> compute the hash secret, and then compute hash collisions using the
> secret. The hash secret is a least 64 bits long (128 bits on a 64 bit
> system). So I hope that computing collisions requires a lot of CPU
> time (is slow) to make the attack ineffective with today computers.
Unfortunately it requires only a few seconds to compute enough 32bit
collisions on one core with no precomputed data.  I'm sure it's possible
to make this less than a second.

In fact, since hash(X) == hash(Y) is independent of the suffix [ hash(X)
^ suffix == hash(Y) ^ suffix ], a lot of precomputation (from the tail)
is possible.

So the question is: How difficult is it to guess the seed?

Frank
_______________________________________________
Python-Dev mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/lists%2B1324100855712-1801473%40n6.nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: Status of the fix for the hash collision vulnerability

Victor STINNER
> Unfortunately it requires only a few seconds to compute enough 32bit
> collisions on one core with no precomputed data.

Are you running the hash function "backward" to generate strings with
the same value, or you are more trying something like brute forcing?

And how do you get the hash secret? You need it to run an attack.

> In fact, since hash(X) == hash(Y) is independent of the suffix [ hash(X) ^
> suffix == hash(Y) ^ suffix ], a lot of precomputation (from the tail) is
> possible.

My change adds also a prefix (a prefix and a suffix). I don't know if
it changes anything for generating collisions.

> So the question is: How difficult is it to guess the seed?

I wrote some remarks about that in the issue. For example:

(hash("\0")^1) ^ (hash("\0\0")^2) gives ((prefix * 1000003) &
HASH_MASK) ^ ((prefix * 1000003**2)  & HASH_MASK)

I suppose that you don't have directly the full output of hash(str) in
practical, but hash(str) & DICT_MASK where DICT_MASK depends is the
size of the internal dict array minus 1. For example, for a dictionary
of 65,536 items, the mask is 0x1ffff and so cannot gives you more than
17 bits of hash(str) output. I still don't know how difficult it is to
retreive hash(str) bits from repr(dict).

Victor
_______________________________________________
Python-Dev mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/lists%2B1324100855712-1801473%40n6.nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: Status of the fix for the hash collision vulnerability

Lennart Regebro-2
In reply to this post by Victor STINNER
On Fri, Jan 13, 2012 at 02:24, Victor Stinner
<[hidden email]> wrote:
> - Glenn Linderman proposes to fix the vulnerability by adding a new
> "safe" dict type (only accepting string keys). His proof-of-concept
> (SafeDict.py) uses a secret of 64 random bits and uses it to compute
> the hash of a key.

This is my preferred solution. The vulnerability is basically only in
the dictionary you keep the form data you get from a request. This
solves it easily and nicely. It can also be a separate module
installable for Python 2, which many web frameworks still use, so it
can be practical implementable now, and not in a couple of years.

Then again, nothing prevents us from having both this, *and* one of
the other solutions.  :-)

//Lennart
_______________________________________________
Python-Dev mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/lists%2B1324100855712-1801473%40n6.nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: Status of the fix for the hash collision vulnerability

Frank Sievertsen-2
In reply to this post by Victor STINNER

>> Unfortunately it requires only a few seconds to compute enough 32bit
>> collisions on one core with no precomputed data.
> Are you running the hash function "backward" to generate strings with
> the same value, or you are more trying something like brute forcing?

If you try it brute force to hit a specific target, you'll only find
only one good string every 4 billion tries. That's why you first blow up
your target:

You start backward from an arbitrary target-value. You brute force for 3
characters, for example, this will give you 16 million intermediate
values from which you know that they'll end up in your target-value.

Those 16 million values are a huge target for now brute-forcing forward:
Every 256 tries you'll hit one of these values.

> And how do you get the hash secret? You need it to run an attack.

I don't know. This was meant as an answer to the quoted text "So I hope
that computing collisions requires a lot of CPU time (is slow) to make
the attack ineffective with today computers.".

What I wanted to say is: The security relies on the fact that the
attacker can't guess the prefix, not that he can't precompute the values
and it takes hours or days to compute the collisions. If the prefix
leaks out of the application, then the rest is trivial and done in a few
seconds. The suffix is not important for the collision-prevention, but
it will probably make it much harder to guess the prefix.

I don't know an effective way to get the prefix either, (if the
application doesn't leak full hash(X) values).

Frank
_______________________________________________
Python-Dev mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/lists%2B1324100855712-1801473%40n6.nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: Status of the fix for the hash collision vulnerability

And Clover
In reply to this post by Lennart Regebro-2
On 2012-01-13 11:20, Lennart Regebro wrote:
> The vulnerability is basically only in the dictionary you keep the
> form data you get from a request.

I'd have to disagree with this statement. The vulnerability is anywhere
that creates a dictionary (or set) from attacker-provided keys. That
would include HTTP headers, RFC822-family subheaders and parameters, the
environ, input taken from JSON or XML, and so on - and indeed hash
collision attacks are not at all web-specific.

The problem with having two dict implementations is that a caller would
have to tell libraries that use dictionaries which implementation to
use. So for example an argument would have to be passed to json.load[s]
to specify whether the input was known-sane or potentially hostile.

Any library could ever use dictionaries to process untrusted input *or
any library that used another library that did* would have to pass such
a flag through, which would quickly get very unwieldy indeed... or else
they'd have to just always use safedict, in which case we're in pretty
much the same position as we are with changing dict anyway.

--
And Clover
mailto:[hidden email]
http://www.doxdesk.com/
gtalk:chat?jid=[hidden email]
_______________________________________________
Python-Dev mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/lists%2B1324100855712-1801473%40n6.nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: Status of the fix for the hash collision vulnerability

Mark Dickinson
In reply to this post by Guido van Rossum
On Fri, Jan 13, 2012 at 2:57 AM, Guido van Rossum <[hidden email]> wrote:
> How
> pathological the data needs to be before the collision counter triggers? I'd
> expect *very* pathological.

How pathological do you consider the set

   {1 << n for n in range(2000)}

to be?  What about the set:

   ieee754_powers_of_two = {2.0**n for n in range(-1074, 1024)}

?  The > 2000 elements of the latter set have only 61 distinct hash
values on 64-bit machine, so there will be over 2000 total collisions
involved in creating this set (though admittedly only around 30
collisions per hash value).

--
Mark
_______________________________________________
Python-Dev mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/lists%2B1324100855712-1801473%40n6.nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: Status of the fix for the hash collision vulnerability

Guido van Rossum
On Fri, Jan 13, 2012 at 9:08 AM, Mark Dickinson <[hidden email]> wrote:
On Fri, Jan 13, 2012 at 2:57 AM, Guido van Rossum <[hidden email]> wrote:
> How
> pathological the data needs to be before the collision counter triggers? I'd
> expect *very* pathological.

How pathological do you consider the set

  {1 << n for n in range(2000)}

to be?  What about the set:

  ieee754_powers_of_two = {2.0**n for n in range(-1074, 1024)}

?  The > 2000 elements of the latter set have only 61 distinct hash
values on 64-bit machine, so there will be over 2000 total collisions
involved in creating this set (though admittedly only around 30
collisions per hash value).

Hm... So how does the collision counting work for this case?

--
--Guido van Rossum (python.org/~guido)

_______________________________________________
Python-Dev mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/lists%2B1324100855712-1801473%40n6.nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: Status of the fix for the hash collision vulnerability

Mark Dickinson
On Fri, Jan 13, 2012 at 5:43 PM, Guido van Rossum <[hidden email]> wrote:

>> How pathological do you consider the set
>>
>>   {1 << n for n in range(2000)}
>>
>> to be?  What about the set:
>>
>>   ieee754_powers_of_two = {2.0**n for n in range(-1074, 1024)}
>>
>> ?  The > 2000 elements of the latter set have only 61 distinct hash
>> values on 64-bit machine, so there will be over 2000 total collisions
>> involved in creating this set (though admittedly only around 30
>> collisions per hash value).
>
> Hm... So how does the collision counting work for this case?

Ah, my bad.  It looks like the ieee754_powers_of_two is safe---IIUC,
it's the number of collisions involved in a single key-set operation
that's limited.  So a dictionary with keys {1<<n for n in range(2000)}
is fine, but a dictionary with keys  {1<<(61*n) for n in range(2000)}
is not:

>>> {1<<(n*61):True for n in range(2000)}
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 1, in <dictcomp>
KeyError: 'too many hash collisions'
[67961 refs]

I'd still not consider this particularly pathological, though.

--
Mark
_______________________________________________
Python-Dev mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/lists%2B1324100855712-1801473%40n6.nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: Status of the fix for the hash collision vulnerability

Guido van Rossum
On Fri, Jan 13, 2012 at 10:13 AM, Mark Dickinson <[hidden email]> wrote:
On Fri, Jan 13, 2012 at 5:43 PM, Guido van Rossum <[hidden email]> wrote:
>> How pathological do you consider the set
>>
>>   {1 << n for n in range(2000)}
>>
>> to be?  What about the set:
>>
>>   ieee754_powers_of_two = {2.0**n for n in range(-1074, 1024)}
>>
>> ?  The > 2000 elements of the latter set have only 61 distinct hash
>> values on 64-bit machine, so there will be over 2000 total collisions
>> involved in creating this set (though admittedly only around 30
>> collisions per hash value).
>
> Hm... So how does the collision counting work for this case?

Ah, my bad.  It looks like the ieee754_powers_of_two is safe---IIUC,
it's the number of collisions involved in a single key-set operation
that's limited.  So a dictionary with keys {1<<n for n in range(2000)}
is fine, but a dictionary with keys  {1<<(61*n) for n in range(2000)}
is not:

>>> {1<<(n*61):True for n in range(2000)}
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
 File "<stdin>", line 1, in <dictcomp>
KeyError: 'too many hash collisions'
[67961 refs]

I'd still not consider this particularly pathological, though.

Really? Even though you came up with specifically to prove me wrong?

--
--Guido van Rossum (python.org/~guido)

_______________________________________________
Python-Dev mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/lists%2B1324100855712-1801473%40n6.nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: Status of the fix for the hash collision vulnerability

Benjamin Peterson-3
2012/1/13 Guido van Rossum <[hidden email]>:
> Really? Even though you came up with specifically to prove me wrong?

Coming up with a counterexample now invalidates it?



--
Regards,
Benjamin
_______________________________________________
Python-Dev mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/lists%2B1324100855712-1801473%40n6.nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: Status of the fix for the hash collision vulnerability

Antoine Pitrou
In reply to this post by Guido van Rossum
On Thu, 12 Jan 2012 18:57:42 -0800
Guido van Rossum <[hidden email]> wrote:
> Hm... I started out as a big fan of the randomized hash, but thinking more
> about it, I actually believe that the chances of some legitimate app having
> >1000 collisions are way smaller than the chances that somebody's code will
> break due to the variable hashing.

Breaking due to variable hashing is deterministic: you notice it as
soon as you upgrade (and then you use PYTHONHASHSEED to disable
variable hashing). That seems better than unpredictable breaking when
some legitimate collision chain happens.

Regards

Antoine.


_______________________________________________
Python-Dev mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/lists%2B1324100855712-1801473%40n6.nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: Status of the fix for the hash collision vulnerability

Victor STINNER
In reply to this post by Victor STINNER
> - Glenn Linderman proposes to fix the vulnerability by adding a new
> "safe" dict type (only accepting string keys). His proof-of-concept
> (SafeDict.py) uses a secret of 64 random bits and uses it to compute
> the hash of a key.

We could mix Marc's collision counter with SafeDict idea (being able
to use a different secret for each dict): use hash(key, secret)
(simple example: hash(secret+key)) instead of hash(key) in dict (and
set), and change the secret if we have more than N collisions. But it
would slow down all dict lookup (dict creation, get, set, del, ...).
And getting new random data can also be slow.

SafeDict and hash(secret+key) lose the benefit of the cached hash
result. Because the hash result depends on a argument, we cannot cache
the result anymore, and we have to recompute the hash for each lookup
(even if you lookup the same key twice ore more).

Victor
_______________________________________________
Python-Dev mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/lists%2B1324100855712-1801473%40n6.nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: Status of the fix for the hash collision vulnerability

Guido van Rossum
In reply to this post by Antoine Pitrou
On Fri, Jan 13, 2012 at 5:17 PM, Antoine Pitrou <[hidden email]> wrote:
On Thu, 12 Jan 2012 18:57:42 -0800
Guido van Rossum <[hidden email]> wrote:
> Hm... I started out as a big fan of the randomized hash, but thinking more
> about it, I actually believe that the chances of some legitimate app having
> >1000 collisions are way smaller than the chances that somebody's code will
> break due to the variable hashing.

Breaking due to variable hashing is deterministic: you notice it as
soon as you upgrade (and then you use PYTHONHASHSEED to disable
variable hashing). That seems better than unpredictable breaking when
some legitimate collision chain happens.

Fair enough. But I'm now uncomfortable with turning this on for bugfix releases. I'm fine with making this the default in 3.3, just not in 3.2, 3.1 or 2.x -- it will break too much code and organizations will have to roll back the release or do extensive testing before installing a bugfix release -- exactly what we *don't* want for those.

FWIW, I don't believe in the SafeDict solution -- you never know which dicts you have to change.

--
--Guido van Rossum (python.org/~guido)

_______________________________________________
Python-Dev mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/lists%2B1324100855712-1801473%40n6.nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: Status of the fix for the hash collision vulnerability

Gregory P. Smith-3

On Fri, Jan 13, 2012 at 5:38 PM, Guido van Rossum <[hidden email]> wrote:
On Fri, Jan 13, 2012 at 5:17 PM, Antoine Pitrou <[hidden email]> wrote:
On Thu, 12 Jan 2012 18:57:42 -0800
Guido van Rossum <[hidden email]> wrote:
> Hm... I started out as a big fan of the randomized hash, but thinking more
> about it, I actually believe that the chances of some legitimate app having
> >1000 collisions are way smaller than the chances that somebody's code will
> break due to the variable hashing.

Breaking due to variable hashing is deterministic: you notice it as
soon as you upgrade (and then you use PYTHONHASHSEED to disable
variable hashing). That seems better than unpredictable breaking when
some legitimate collision chain happens.

Fair enough. But I'm now uncomfortable with turning this on for bugfix releases. I'm fine with making this the default in 3.3, just not in 3.2, 3.1 or 2.x -- it will break too much code and organizations will have to roll back the release or do extensive testing before installing a bugfix release -- exactly what we *don't* want for those.

FWIW, I don't believe in the SafeDict solution -- you never know which dicts you have to change.


Agreed.

Of the three options Victor listed only one is good.

I don't like SafeDict.  -1.  It puts the onerous on the coder to always get everything right with regards to data that came from outside the process never ending up hashed in a non-safe dict or set *anywhere*.  "Safe" needs to be the default option for all hash tables.

I don't like the "too many hash collisions" exception. -1. It provides non-deterministic application behavior for data driven applications with no way for them to predict when it'll happen or where and prepare for it. It may work in practice for many applications but is simply odd behavior.

I do like randomly seeding the hash. +1. This is easy. It can easily be back ported to any Python version.

It is perfectly okay to break existing users who had anything depending on ordering of internal hash tables. Their code was already broken. We will provide a flag and/or environment variable that can be set to turn the feature off at their own peril which they can use in their test harnesses that are stupid enough to use doctests with order dependencies.

This approach worked fine for Perl 9 years ago.  https://rt.perl.org/rt3//Public/Bug/Display.html?id=22371

-gps

_______________________________________________
Python-Dev mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/lists%2B1324100855712-1801473%40n6.nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: Status of the fix for the hash collision vulnerability

Glenn Linderman-3
In reply to this post by Victor STINNER
On 1/13/2012 5:35 PM, Victor Stinner wrote:
- Glenn Linderman proposes to fix the vulnerability by adding a new
"safe" dict type (only accepting string keys). His proof-of-concept
(SafeDict.py) uses a secret of 64 random bits and uses it to compute
the hash of a key.
We could mix Marc's collision counter with SafeDict idea (being able
to use a different secret for each dict): use hash(key, secret)
(simple example: hash(secret+key)) instead of hash(key) in dict (and
set), and change the secret if we have more than N collisions. But it
would slow down all dict lookup (dict creation, get, set, del, ...).
And getting new random data can also be slow.

SafeDict and hash(secret+key) lose the benefit of the cached hash
result. Because the hash result depends on a argument, we cannot cache
the result anymore, and we have to recompute the hash for each lookup
(even if you lookup the same key twice ore more).

Victor

So integrating SafeDict into dict so it could be automatically converted would mean changing the data structures underneath dict.  Given that, a technique for hash caching could be created, that isn't quite as good as the one in place, but may be less expensive than not caching the hashes.  It would also take more space, a second dict, internally, as well as the secret.

So once the collision counter reaches some threshold (since there would be a functional fallback, it could be much lower than 1000), the secret is obtained, and the keys are rehashed using hash(secret+key).  Now when lookups occur, the object id of the key and the hash of the key are used as the index and hash(secret+key) is stored as a cached value.  This would only benefit lookups by the same object, other objects with the same key value would be recalculated (at least the first time).  Some limit on the number of cached values would probably be appropriate.  This would add complexity, of course, in trying to save time.

An alternate solution would be to convert a dict to a tree once the number of collisions produces poor performance.  Converting to a tree would result in O(log N) instead of O(1) lookup performance, but that is better than the degenerate case of O(N) which is produced by the excessive number of collisions resulting from an attack.  This would require new tree code to be included in the core, of course, probably a red-black tree, which stays balanced.

In either of these cases, the conversion is expensive, because a collision threshold must first be reached to determine the need for conversion, so the hash could already contain lots of data.  If it were too expensive, the attack could still be effective.

Another solution would be to change the collision code, so that colliding keys don't produce O(N) behavior, but some other behavior.  Each colliding entry could convert that entry to a tree of entries, perhaps.  This would require no conversion of "bad dicts", and an attack could at worst convert O(1) performance to O(log N).

Clearly these ideas are more complex than adding randomization, but adding randomization doesn't seem to be produce immunity from attack, when data about the randomness is leaked.

_______________________________________________
Python-Dev mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/lists%2B1324100855712-1801473%40n6.nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: Status of the fix for the hash collision vulnerability

Gregory P. Smith-3

Clearly these ideas are more complex than adding randomization, but adding randomization doesn't seem to be produce immunity from attack, when data about the randomness is leaked.

Which will not normally happen.

I'm firmly in the camp that believes the random seed can be probed and determined by creatively injecting values and measuring timing of things.  But doing that is difficult and time and bandwidth intensive so the per process random hash seed is good enough.

There's another elephant in the room here, if you want to avoid this attack use a 64-bit Python build as it uses 64-bit hash values that are significantly more difficult to force a collision on.

-gps

_______________________________________________
Python-Dev mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/lists%2B1324100855712-1801473%40n6.nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: Status of the fix for the hash collision vulnerability

Gregory P. Smith-3
btw, Tim's commit message on this one is amusingly relevant. :)



On Fri, Jan 13, 2012 at 6:25 PM, Gregory P. Smith <[hidden email]> wrote:

Clearly these ideas are more complex than adding randomization, but adding randomization doesn't seem to be produce immunity from attack, when data about the randomness is leaked.

Which will not normally happen.

I'm firmly in the camp that believes the random seed can be probed and determined by creatively injecting values and measuring timing of things.  But doing that is difficult and time and bandwidth intensive so the per process random hash seed is good enough.

There's another elephant in the room here, if you want to avoid this attack use a 64-bit Python build as it uses 64-bit hash values that are significantly more difficult to force a collision on.

-gps


_______________________________________________
Python-Dev mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/lists%2B1324100855712-1801473%40n6.nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: Status of the fix for the hash collision vulnerability

Steven D'Aprano-8
In reply to this post by Gregory P. Smith-3
On 14/01/12 12:58, Gregory P. Smith wrote:

> I do like *randomly seeding the hash*. *+1*. This is easy. It can easily be
> back ported to any Python version.
>
> It is perfectly okay to break existing users who had anything depending on
> ordering of internal hash tables. Their code was already broken.

For the record:

steve@runes:~$ python -c "print(hash('spam ham'))"
-376510515
steve@runes:~$ jython -c "print(hash('spam ham'))"
2054637885

So it is already the case that Python code that assumes stable hashing is broken.

For what it's worth, I'm not convinced that we should be overly-concerned by
"poor saps" (Guido's words) who rely on accidents of implementation regarding
hash. We shouldn't break their code unless we have a good reason, but this
strikes me as a good reason. The documentation for hash certainly makes no
promise about stability, and relying on it strikes me as about as sensible as
relying on the stability of error messages.

I'm also not convinced that the option to raise an exception after 1000
collisions actually solves the problem. That relies on the application being
re-written to catch the exception and recover from it (how?). Otherwise, all
it does is change the attack vector from "cause an indefinite number of hash
collisions" to "cause 999 hash collisions followed by crashing the application
with an exception", which doesn't strike me as much of an improvement.

+1 on random seeding. Default to on in 3.3+ and default to off in older
versions, which allows people to avoid breaking their code until they're ready
for it to be broken.



--
Steven
_______________________________________________
Python-Dev mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/lists%2B1324100855712-1801473%40n6.nabble.com
1234