Hashing proposal: change only string-only dicts

classic Classic list List threaded Threaded
18 messages Options
Reply | Threaded
Open this post in threaded view
|

Hashing proposal: change only string-only dicts

"Martin v. Löwis"
I'd like to propose a different approach to seeding the string hashes:
only do so for dictionaries involving only strings, and leave the
tp_hash slot of strings unchanged.

Each string would get two hashes: the "public" hash, which is constant
across runs and bugfix releases, and the dict-hash, which is only used
by the dictionary implementation, and only if all keys to the dict are
strings. In order to allow caching of the hash, all dicts should use
the same hash (if caching wasn't necessary, each dict could use its own
seed).

There are several variants of that approach wrt. caching of the hash
1. add an additional field to all string objects, to cache the second
   hash value.
   a) variant: in 3.3, drop the extra field, and declare that hashes
   may change across runs
2. only cache the dict-hash, recomputing the public hash each time
3. on a per-string choice, cache either the dict-hash or the public
   hash, depending on which one gets computed first, and recompute
   the other one every time it's needed.

As you can see, 1 vs. 2/3 is a classical time-space-tradeoff.

What do you think?

Regards,
Martin
_______________________________________________
Python-Dev mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/lists%2B1324100855712-1801473%40n6.nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: Hashing proposal: change only string-only dicts

Antoine Pitrou
On Tue, 17 Jan 2012 21:59:28 +0100
"Martin v. Löwis" <[hidden email]> wrote:
> I'd like to propose a different approach to seeding the string hashes:
> only do so for dictionaries involving only strings, and leave the
> tp_hash slot of strings unchanged.

I think Python 3 would be better with a clean fix (all hashes
randomized).
Now for Python 2... The problem with this idea is that it only
addresses str dicts. Unicode dicts, and any other dicts, are left
vulnerable. Unicode dicts are quite likely in Web
frameworks/applications and other places which have well-thought text
semantics.

That said, here's a suggestion to squeeze those bits:

> 1. add an additional field to all string objects, to cache the second
>    hash value.
>    a) variant: in 3.3, drop the extra field, and declare that hashes
>    may change across runs

In 2.7, a string object has the following fields:

    long ob_shash;
    int ob_sstate;

Only 2 bits are used in ob_sstate, meaning 30 are left. These 30 bits
could cache a "hash perturbation" computed from the string and the
random bits:

- hash() would use ob_shash
- dict_lookup() would use ((ob_shash * 1000003) ^ (ob_sstate & ~3))

This way, you cache almost all computations, adding only a computation
and a couple logical ops when looking up a string in a dict.

Regards

Antoine.


_______________________________________________
Python-Dev mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/lists%2B1324100855712-1801473%40n6.nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: Hashing proposal: change only string-only dicts

Victor STINNER
In reply to this post by "Martin v. Löwis"
2012/1/17 "Martin v. Löwis" <[hidden email]>:
> I'd like to propose a different approach to seeding the string hashes:
> only do so for dictionaries involving only strings, and leave the
> tp_hash slot of strings unchanged.

The real problem is in dict (or any structure using an hash table), so
if it is possible, it would also prefer to fix the problem directly in
dict.

> There are several variants of that approach wrt. caching of the hash
> 1. add an additional field to all string objects, to cache the second
>   hash value.
>   a) variant: in 3.3, drop the extra field, and declare that hashes
>   may change across runs
> 2. only cache the dict-hash, recomputing the public hash each time
> 3. on a per-string choice, cache either the dict-hash or the public
>   hash, depending on which one gets computed first, and recompute
>   the other one every time it's needed.

There is a simpler solution:

bucket_index = (hash(str) ^ secret) & DICT_MASK.

Remark: set must also be fixed.

Victor
_______________________________________________
Python-Dev mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/lists%2B1324100855712-1801473%40n6.nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: Hashing proposal: change only string-only dicts

Victor STINNER
> There is a simpler solution:
>
> bucket_index = (hash(str) ^ secret) & DICT_MASK.

Oops, hash^secret doesn't add any security.

Victor
_______________________________________________
Python-Dev mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/lists%2B1324100855712-1801473%40n6.nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: Hashing proposal: change only string-only dicts

Victor STINNER
In reply to this post by "Martin v. Löwis"
> Each string would get two hashes: the "public" hash, which is constant
> across runs and bugfix releases, and the dict-hash, which is only used
> by the dictionary implementation, and only if all keys to the dict are
> strings.

The distinction between secret (private, secure) and "public" hash
(deterministic) is not clear to me.

Example: collections.UserDict implements __hash__() using
hash(self.data). Should it use the public or the private hash?
collections.abc.Set computes its hash using hash(x) of each item. Same
question.

If we need to use the secret hash, it should be exposed in Python.
Which function/method would be used? I suppose that we cannot add
anything to stable releases like 2.7.

Victor
_______________________________________________
Python-Dev mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/lists%2B1324100855712-1801473%40n6.nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: Hashing proposal: change only string-only dicts

"Martin v. Löwis"

Zitat von Victor Stinner <[hidden email]>:

>> Each string would get two hashes: the "public" hash, which is constant
>> across runs and bugfix releases, and the dict-hash, which is only used
>> by the dictionary implementation, and only if all keys to the dict are
>> strings.
>
> The distinction between secret (private, secure) and "public" hash
> (deterministic) is not clear to me.

It's not about privacy or security. It's about compatibility. The
dict-hash is only used in the dict implementation, and never exposed,
leaving the tp_hash unmodified.

> Example: collections.UserDict implements __hash__() using
> hash(self.data).

Are you sure? I only see that used for UserString, not UserDict.

> collections.abc.Set computes its hash using hash(x) of each item. Same
> question.

The hash of the Set should most certainly use the element's tp_hash.
That *is* the hash of the objects, and it may collide for strings
just fine due to the vulnerability.

> If we need to use the secret hash, it should be exposed in Python.

It's not secret, just specific. I don't mind it being exposed. However,
that would be a new feature, which cannot be added in a security fix
or bug fix release.

> Which function/method would be used? I suppose that we cannot add
> anything to stable releases like 2.7.

Right. Nor do I see any need to expose it. It fixes the vulnerability
just fine without being exposed.

Regards,
Martin

_______________________________________________
Python-Dev mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/lists%2B1324100855712-1801473%40n6.nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: Hashing proposal: change only string-only dicts

"Martin v. Löwis"
In reply to this post by Antoine Pitrou
Am 17.01.2012 22:26, schrieb Antoine Pitrou:

> On Tue, 17 Jan 2012 21:59:28 +0100
> "Martin v. Löwis" <[hidden email]> wrote:
>> I'd like to propose a different approach to seeding the string hashes:
>> only do so for dictionaries involving only strings, and leave the
>> tp_hash slot of strings unchanged.
>
> I think Python 3 would be better with a clean fix (all hashes
> randomized).
> Now for Python 2... The problem with this idea is that it only
> addresses str dicts. Unicode dicts, and any other dicts, are left
> vulnerable.

No, you misunderstood. I meant to propose that this applies to both
kinds of string (unicode and byte strings); for 2.x also dictionaries
including a mix of them.

> Only 2 bits are used in ob_sstate, meaning 30 are left. These 30 bits
> could cache a "hash perturbation" computed from the string and the
> random bits:
>
> - hash() would use ob_shash
> - dict_lookup() would use ((ob_shash * 1000003) ^ (ob_sstate & ~3))
>
> This way, you cache almost all computations, adding only a computation
> and a couple logical ops when looking up a string in a dict.

That's a good idea. For Unicode, it might be best to add another slot
into the object, even though this increases the object size.

Regards,
Martin
_______________________________________________
Python-Dev mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/lists%2B1324100855712-1801473%40n6.nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: Hashing proposal: change only string-only dicts

Gregory P. Smith-3
In reply to this post by "Martin v. Löwis"

On Tue, Jan 17, 2012 at 12:59 PM, "Martin v. Löwis" <[hidden email]> wrote:
I'd like to propose a different approach to seeding the string hashes:
only do so for dictionaries involving only strings, and leave the
tp_hash slot of strings unchanged.

Each string would get two hashes: the "public" hash, which is constant
across runs and bugfix releases, and the dict-hash, which is only used
by the dictionary implementation, and only if all keys to the dict are
strings. In order to allow caching of the hash, all dicts should use
the same hash (if caching wasn't necessary, each dict could use its own
seed).

There are several variants of that approach wrt. caching of the hash
1. add an additional field to all string objects, to cache the second
  hash value.

yuck, our objects are large enough as it is.
 
  a) variant: in 3.3, drop the extra field, and declare that hashes
  may change across runs

+1 Absolutely.  We can and should make 3.3 change hashes across runs (behavior that can be disabled via a flag or environment variable).

I think the issue of doctests and such breaking even in 2.7 due to hash order changes is a being overblown.  Code like that has already needs to fix its tests at least once when they want tests to pass on on both 32-bit and 64-bit python VMs (they have different hashes).  Do we have _any_ measure of how big a deal this will be before going too far here?

-gps

_______________________________________________
Python-Dev mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/lists%2B1324100855712-1801473%40n6.nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: Hashing proposal: change only string-only dicts

"Martin v. Löwis"
> +1 Absolutely.  We can and should make 3.3 change hashes across runs
> (behavior that can be disabled via a flag or environment variable).
>
> I think the issue of doctests and such breaking even in 2.7 due to hash
> order changes is a being overblown.  Code like that has already needs to
> fix its tests at least once when they want tests to pass on on both
> 32-bit and 64-bit python VMs (they have different hashes).  Do we have
> _any_ measure of how big a deal this will be before going too far here?

My concern is not about breaking doctests: this proposal will also break
them. My concern is about applications that assume that hash(s) is
stable across runs, and we do have reports that it will break
applications.

Regards,
Martin
_______________________________________________
Python-Dev mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/lists%2B1324100855712-1801473%40n6.nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: Hashing proposal: change only string-only dicts

Barry Warsaw
On Jan 18, 2012, at 08:19 AM, Martin v. Löwis wrote:

>My concern is not about breaking doctests: this proposal will also break
>them. My concern is about applications that assume that hash(s) is
>stable across runs, and we do have reports that it will break
>applications.

I am a proponent of doctests, and thus use them heavily.  I can tell you that
the issue of dict hashing (non-)order has been well known for *years* and I
have convenience functions in my own doctests to sort and print dict
elements.  Back in my Launchpad days (which has oodles of doctests), many
years ago we went on a tear to fix dict printing when some change in Python
caused them to break.  So I'm not personally worried that such a change would
break any of my own code.

Even though I hope anybody who uses doctests has their own workarounds for
this, I still support being conservative in default behavior for stable
releases, because it's the right thing to do for our users.

-Barry
_______________________________________________
Python-Dev mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/lists%2B1324100855712-1801473%40n6.nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: Hashing proposal: change only string-only dicts

PJ Eby
In reply to this post by "Martin v. Löwis"
On Tue, Jan 17, 2012 at 7:58 PM, "Martin v. Löwis" <[hidden email]> wrote:
Am 17.01.2012 22:26, schrieb Antoine Pitrou:
> Only 2 bits are used in ob_sstate, meaning 30 are left. These 30 bits
> could cache a "hash perturbation" computed from the string and the
> random bits:
>
> - hash() would use ob_shash
> - dict_lookup() would use ((ob_shash * 1000003) ^ (ob_sstate & ~3))
>
> This way, you cache almost all computations, adding only a computation
> and a couple logical ops when looking up a string in a dict.

That's a good idea. For Unicode, it might be best to add another slot
into the object, even though this increases the object size.

Wouldn't that break the ABI in 2.x?

_______________________________________________
Python-Dev mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/lists%2B1324100855712-1801473%40n6.nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: Hashing proposal: change only string-only dicts

"Martin v. Löwis"
In reply to this post by Barry Warsaw
Am 18.01.2012 13:30, schrieb Barry Warsaw:

> On Jan 18, 2012, at 08:19 AM, Martin v. Löwis wrote:
>
>> My concern is not about breaking doctests: this proposal will also break
>> them. My concern is about applications that assume that hash(s) is
>> stable across runs, and we do have reports that it will break
>> applications.
>
> I am a proponent of doctests, and thus use them heavily.  I can tell you that
> the issue of dict hashing (non-)order has been well known for *years* and I
> have convenience functions in my own doctests to sort and print dict
> elements.

Indeed. So that breakage may actually be less than people expect.

As for cases that still rely on dict order: none of the proposed
solutions preserve full compatibility in dict order. The only solution
(not actually proposed so far) is to add an AVL tree into the hash
table, to track keys that collide on hash values (rather than hash
slots). Such a tree would be only used if there is an actual collision,
which, in practical dict usage, never occurs.

I've been seriously considering implementing a balanced tree inside
the dict (again for string-only dicts, as ordering can't be guaranteed
otherwise). However, this would be a lot of code for a security fix.
It *would* solve the issue for good, though.

Regards,
Martin
_______________________________________________
Python-Dev mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/lists%2B1324100855712-1801473%40n6.nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: Hashing proposal: change only string-only dicts

"Martin v. Löwis"
In reply to this post by PJ Eby
Am 18.01.2012 17:01, schrieb PJ Eby:

> On Tue, Jan 17, 2012 at 7:58 PM, "Martin v. Löwis" <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     Am 17.01.2012 22:26, schrieb Antoine Pitrou:
>     > Only 2 bits are used in ob_sstate, meaning 30 are left. These 30 bits
>     > could cache a "hash perturbation" computed from the string and the
>     > random bits:
>     >
>     > - hash() would use ob_shash
>     > - dict_lookup() would use ((ob_shash * 1000003) ^ (ob_sstate & ~3))
>     >
>     > This way, you cache almost all computations, adding only a computation
>     > and a couple logical ops when looking up a string in a dict.
>
>     That's a good idea. For Unicode, it might be best to add another slot
>     into the object, even though this increases the object size.
>
>
> Wouldn't that break the ABI in 2.x?

I was thinking about adding the field at the end, so I thought it
shouldn't. However, if somebody inherits from PyUnicodeObject, it still
might - so my new proposal is to add the extra hash into the str block,
either at str[-1], or after the terminating 0. This would cause an
average increase of four bytes of the storage (0 bytes in 50% of the
cases, 8 bytes because of padding in the other 50%).

What do you think?

Regards,
Martin
_______________________________________________
Python-Dev mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/lists%2B1324100855712-1801473%40n6.nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: Hashing proposal: change only string-only dicts

Glenn Linderman-3
In reply to this post by "Martin v. Löwis"
On 1/18/2012 9:52 AM, "Martin v. Löwis" wrote:
I've been seriously considering implementing a balanced tree inside
the dict (again for string-only dicts, as ordering can't be guaranteed
otherwise). However, this would be a lot of code for a security fix.
It *would* solve the issue for good, though.

To handle keys containing non-orderable keys along with strings, which are equally vulnerable to string-only keys, especially if the non-string components can have fixed values during an attack, you could simply use their hash value as an orderable proxy for the non-orderable key components.

_______________________________________________
Python-Dev mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/lists%2B1324100855712-1801473%40n6.nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: Hashing proposal: change only string-only dicts

PJ Eby
In reply to this post by "Martin v. Löwis"

On Jan 18, 2012 12:55 PM, Martin v. Löwis <[hidden email]> wrote:
>
> Am 18.01.2012 17:01, schrieb PJ Eby:
> > On Tue, Jan 17, 2012 at 7:58 PM, "Martin v. Löwis" <[hidden email]
> > <mailto:[hidden email]>> wrote:
> >
> >     Am 17.01.2012 22:26, schrieb Antoine Pitrou:
> >     > Only 2 bits are used in ob_sstate, meaning 30 are left. These 30 bits
> >     > could cache a "hash perturbation" computed from the string and the
> >     > random bits:
> >     >
> >     > - hash() would use ob_shash
> >     > - dict_lookup() would use ((ob_shash * 1000003) ^ (ob_sstate & ~3))
> >     >
> >     > This way, you cache almost all computations, adding only a computation
> >     > and a couple logical ops when looking up a string in a dict.
> >
> >     That's a good idea. For Unicode, it might be best to add another slot
> >     into the object, even though this increases the object size.
> >
> >
> > Wouldn't that break the ABI in 2.x?
>
> I was thinking about adding the field at the end, so I thought it
> shouldn't. However, if somebody inherits from PyUnicodeObject, it still
> might - so my new proposal is to add the extra hash into the str block,
> either at str[-1], or after the terminating 0. This would cause an
> average increase of four bytes of the storage (0 bytes in 50% of the
> cases, 8 bytes because of padding in the other 50%).
>
> What do you think?

So far it sounds like the very best solution of all, as far as backward compatibility is concerned.  If the extra bits are only used when two strings have a matching hash value, the only doctests that could be affected are ones testing for this issue.  ;-)


_______________________________________________
Python-Dev mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/lists%2B1324100855712-1801473%40n6.nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: Hashing proposal: change only string-only dicts

Gregory P. Smith-3
In reply to this post by "Martin v. Löwis"

On Wed, Jan 18, 2012 at 9:55 AM, "Martin v. Löwis" <[hidden email]> wrote:
Am 18.01.2012 17:01, schrieb PJ Eby:
> On Tue, Jan 17, 2012 at 7:58 PM, "Martin v. Löwis" <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     Am 17.01.2012 22:26, schrieb Antoine Pitrou:
>     > Only 2 bits are used in ob_sstate, meaning 30 are left. These 30 bits
>     > could cache a "hash perturbation" computed from the string and the
>     > random bits:
>     >
>     > - hash() would use ob_shash
>     > - dict_lookup() would use ((ob_shash * 1000003) ^ (ob_sstate & ~3))
>     >
>     > This way, you cache almost all computations, adding only a computation
>     > and a couple logical ops when looking up a string in a dict.
>
>     That's a good idea. For Unicode, it might be best to add another slot
>     into the object, even though this increases the object size.
>
> Wouldn't that break the ABI in 2.x?

I was thinking about adding the field at the end, so I thought it
shouldn't. However, if somebody inherits from PyUnicodeObject, it still
might - so my new proposal is to add the extra hash into the str block,
either at str[-1], or after the terminating 0. This would cause an
average increase of four bytes of the storage (0 bytes in 50% of the
cases, 8 bytes because of padding in the other 50%).

What do you think?

str[-1] is not likely to work if you want to maintain ABI compatibility.  Appending it to the data after the terminating \0 is more likely to be possible, but if there is any possibility that existing compiled extension modules have somehow inlined code to do allocation of the str field even that is questionable (i don't think there are?).

I'd also be concerned about C API code that uses PyUnicode_Resize(). How do you keep track of if you have filled in these extra bytes at the end in or not?  allocation and resize fill it with a magic value indicating "not filled in" similar to a tp_hash of -1?

Regardless of all of this, I don't think this fully addresses the overall issue as strings within other hashable data structures like tuples would not be treated this way, only strings directly stored in a dict.  Sure you can continue on and "fix" tuples and such in a similar manner but then what about user defined classes that implement __hash__ based on the return value of hash() on some strings they contain?

I don't see anything I'd consider a real complete fix unless we also backport the randomized hash code so that people who need a guaranteed fix can enable it and use it.

-gps

_______________________________________________
Python-Dev mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/lists%2B1324100855712-1801473%40n6.nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: Hashing proposal: change only string-only dicts

Hrvoje Niksic-2
In reply to this post by "Martin v. Löwis"
On 01/18/2012 06:55 PM, "Martin v. Löwis" wrote:
> I was thinking about adding the field at the end,

Will this make all strings larger, or only those that create dict
collisions?  Making all strings larger to fix this issue sounds like a
really bad idea.

Also, would it be acceptable to simply not cache the alternate hash?
The cached string hash is an optimization anyway.

Hrvoje
_______________________________________________
Python-Dev mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/lists%2B1324100855712-1801473%40n6.nabble.com
Reply | Threaded
Open this post in threaded view
|

Re: Hashing proposal: change only string-only dicts

Amaury Forgeot d'Arc
In reply to this post by Gregory P. Smith-3
2012/1/19 Gregory P. Smith <[hidden email]>
str[-1] is not likely to work if you want to maintain ABI compatibility.  Appending it to the data after the terminating \0 is more likely to be possible, but if there is any possibility that existing compiled extension modules have somehow inlined code to do allocation of the str field even that is questionable (i don't think there are?).

There are. Unfortunately.
https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/scalarapi.c#L710

--
Amaury Forgeot d'Arc

_______________________________________________
Python-Dev mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/lists%2B1324100855712-1801473%40n6.nabble.com