Testing random

classic Classic list List threaded Threaded
82 messages Options
12345
Reply | Threaded
Open this post in threaded view
|

Testing random

Cecil Westerhof
I wrote a very simple function to test random:
    def test_random(length, multiplier = 10000):
        number_list = length * [0]
        for i in range(length * multiplier):
            number_list[random.randint(0, length - 1)] += 1
        minimum = min(number_list)
        maximum = max(number_list)
        return (minimum, maximum, minimum / maximum)

When running:
    for i in range(1, 7):
        print(test_random(100, 10 ** i))
I get:
    (3, 17, 0.17647058823529413)
    (73, 127, 0.5748031496062992)
    (915, 1086, 0.8425414364640884)
    (9723, 10195, 0.9537027954879843)
    (99348, 100620, 0.9873583780560524)
    (997198, 1002496, 0.9947151908835546)

and when running:
    for i in range(1, 7):
        print(test_random(1000, 10 ** i))
I get:
    (2, 20, 0.1)
    (69, 138, 0.5)
    (908, 1098, 0.8269581056466302)
    (9684, 10268, 0.9431242695753799)
    (99046, 100979, 0.9808574059953059)
    (996923, 1003598, 0.9933489305478886)

It shows that when the first parameter increases the deviation
increases and when the second parameter increases the deviation
decreases. Exactly what you would expect. But what are the ranges you
would expect with a good random function. Then it could be used to
test a random function.

--
Cecil Westerhof
Senior Software Engineer
LinkedIn: http://www.linkedin.com/in/cecilwesterhof

Reply | Threaded
Open this post in threaded view
|

Testing random

Thomas 'PointedEars' Lahn
Cecil Westerhof wrote:

> I wrote a very simple function to test random:
>     def test_random(length, multiplier = 10000):
>         number_list = length * [0]
>         for i in range(length * multiplier):
>             number_list[random.randint(0, length - 1)] += 1
>         minimum = min(number_list)
>         maximum = max(number_list)
>         return (minimum, maximum, minimum / maximum)

As there is no guarantee that every number will occur randomly, using a
dictionary at first should be more efficient than a list:

def test_random (length, multiplier=10000):
    number_list = {}
    for i in range(length * multiplier):
        r = random.randint(0, length - 1)
        number_list[r] = number_list[r] + 1 if r in number_list else 1
    values = number_list.values()
    minimum = min(values)
    maximum = max(values)
    return (minimum, maximum, minimum / maximum)

[?uple literals are going to be introduced with C# 7 ;-)]


Since Python uses indentation as syntax element, it is better not to indent
Python code by default when posting in order to make it easier testable.  If
necessary, code sections can be delimited in an unambiguous way that is
syntactically correct by lines like

#------------------------------------------------------------------------

The line length then should not exceed 72 to 78 characters, for quoting.

Although the <HT> character can be used for indentation, 4 spaces appear to
be a sensible, device-independent default (8 is too much for more complex
snippets that need to fit within the conventional 80-characters-per-line
limit).

--
PointedEars

Twitter: @PointedEars2
Please do not cc me. / Bitte keine Kopien per E-Mail.

Reply | Threaded
Open this post in threaded view
|

Testing random

Jonas Wielicki
In reply to this post by Cecil Westerhof
On 07.06.2015 08:27, Cecil Westerhof wrote:

> I wrote a very simple function to test random:
>     def test_random(length, multiplier = 10000):
>         number_list = length * [0]
>         for i in range(length * multiplier):
>             number_list[random.randint(0, length - 1)] += 1
>         minimum = min(number_list)
>         maximum = max(number_list)
>         return (minimum, maximum, minimum / maximum)
>
> When running:
>     for i in range(1, 7):
>         print(test_random(100, 10 ** i))
> I get:
>     (3, 17, 0.17647058823529413)
>     (73, 127, 0.5748031496062992)
>     (915, 1086, 0.8425414364640884)
>     (9723, 10195, 0.9537027954879843)
>     (99348, 100620, 0.9873583780560524)
>     (997198, 1002496, 0.9947151908835546)
>
> and when running:
>     for i in range(1, 7):
>         print(test_random(1000, 10 ** i))
> I get:
>     (2, 20, 0.1)
>     (69, 138, 0.5)
>     (908, 1098, 0.8269581056466302)
>     (9684, 10268, 0.9431242695753799)
>     (99046, 100979, 0.9808574059953059)
>     (996923, 1003598, 0.9933489305478886)
>
> It shows that when the first parameter increases the deviation
> increases and when the second parameter increases the deviation
> decreases. Exactly what you would expect. But what are the ranges you
> would expect with a good random function.

Really depends on the number of samples. Appearantly, a good RNG would
come close to 1.0 for the ratio.

> Then it could be used to test a random function.

This is an interesting test (more interesting to me than it looks at the
first sight, and certainly better than what I had come up with), but
unfortunately, there is more to testing randomness.

The test clearly suggests that random functions should have a min/max
ratio of about 1.0. Think of a "random" function like this:

    def fake_random(minv, maxv, _cache={}):
        try:
            return next(_cache[minv, maxv])
        except (StopIteration, KeyError):
            iterator = iter(range(minv, maxv+1))
            _cache[minv, maxv] = iterator
            return next(iterator)

(if that is hard to read, I agree; it returns the sequence from minv to
maxv (inclusively) over and over again for equal minv and maxv. don?t
pass anything to _cache :), just call it like random.randint)

This gives a "perfect" ratio of 1.0, but is not very random. This kind
of function would probably be ruled out by a compression or correlation
test. If you want to dig deeper into the algorithms for random testing,
the dieharder test suite [1] is probably worth a look.

It all boils down to the use case. For some use cases, the
``fake_random`` might be good enough (unittesting would be such a case:
it is certainly uniformly distributed and allows full coverage of the
tested range), for others it would fail catastrophically (don?t generate
your cryptographic keys with this!).

cheers,
Jonas

   [1]: http://www.phy.duke.edu/~rgb/General/dieharder.php


-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: OpenPGP digital signature
URL: <http://mail.python.org/pipermail/python-list/attachments/20150607/9fa6160b/attachment.sig>

Reply | Threaded
Open this post in threaded view
|

Testing random

Chris Angelico
In reply to this post by Thomas 'PointedEars' Lahn
On Sun, Jun 7, 2015 at 8:40 PM, Thomas 'PointedEars' Lahn
<PointedEars at web.de> wrote:

> Cecil Westerhof wrote:
>
>> I wrote a very simple function to test random:
>>     def test_random(length, multiplier = 10000):
>>         number_list = length * [0]
>>         for i in range(length * multiplier):
>>             number_list[random.randint(0, length - 1)] += 1
>>         minimum = min(number_list)
>>         maximum = max(number_list)
>>         return (minimum, maximum, minimum / maximum)
>
> As there is no guarantee that every number will occur randomly, using a
> dictionary at first should be more efficient than a list:

Hmm, I'm not sure that's actually so. His code is aiming to get
'multiplier' values in each box; for any serious multiplier (he starts
with 10 in the main code), you can be fairly confident that every
number will come up at least once. The distribution of numbers won't
ever be perfectly even, but you'd expect it to be reasonably close. I
have a similar routine on my Dungeons & Dragons server; since a roll
of a twenty-sided dice is crucial to most of the game, I have a simple
tester that proves to people that the in-built dice roller is fair.
Its output looks like this:

> roll test
1: 10017
2: 10003
3: 9966
4: 9728
5: 10088
6: 9888
7: 10087
8: 9971
9: 10052
10: 10061
11: 10130
12: 9942
13: 10062
14: 10075
15: 10050
16: 9948
17: 9880
18: 10052
19: 9995
20: 10005
Standard deviation: 90.18 (0.90%)

This is about equivalent to test_random(20), and as you see, he and I
both picked a multiplier of 10K to use by default. (I call the
parameters "max" and "avg" rather than "length" and "multiplier", but
they have the exact same semantics.) The hard part is figuring out
what "looks reasonable"; a true RNG could legitimately produce nothing
but 7s for the entire run, it's just extremely unlikely.

ChrisA

Reply | Threaded
Open this post in threaded view
|

Testing random

Steven D'Aprano-8
In reply to this post by Cecil Westerhof
On Sun, 7 Jun 2015 04:27 pm, Cecil Westerhof wrote:

> I wrote a very simple function to test random:
>     def test_random(length, multiplier = 10000):
>         number_list = length * [0]
>         for i in range(length * multiplier):
>             number_list[random.randint(0, length - 1)] += 1
>         minimum = min(number_list)
>         maximum = max(number_list)
>         return (minimum, maximum, minimum / maximum)

Putting aside the timing aspects, your frequency calculations are not done
in a very Pythonic manner. A better way might be:


from collections import Counter
from random import randint

def test_random(length, multiplier = 10000):
    freq = Counter(
        randint(0, length - 1) for i in range(length * multiplier)
        )
    minimum = min(freq[i] for i in range(length))
    maximum = max(freq.values())
    return (minimum, maximum, minimum / maximum)


Note carefully the way I calculate the minimum value. That way, if by chance
some number has a frequency of zero, the minimum returned will be zero. For
the maximum, that doesn't matter, since any missing numbers will not be the
most frequent number.

Better than returning the min and max, you should return the counter object
itself. Before doing so, I make sure that any missing numbers are given a
value of zero.


def test_random(size, samples=1000000):
    """Return a sample of `samples` random numbers from 0 to size-1."""
    freq = Counter(randint(0, size-1) for i in range(samples))
    # Avoid missing numbers. (Likely for small samples.)
    freq.subtract(dict.fromkeys(range(size), 0))
    return freq


Now once you have the frequencies, you can look at many interesting
statistics, and verify that you have a valid sample.

py> f = test_random(100)
py> sum(f.values()) == 1000000  # Total number of samples is right?
True
py> sorted(f) == list(range(100))  # Any missing numbers?
True
py> max(f) - min(f)  # statistical range
99
py> min(f.values())  # Lowest frequency?
9775
py> max(f.values())  # And the highest?
10313

The ratio of those two frequencies isn't very interesting, but we can
calculate it:

py> min(f.values())/max(f.values())
0.9478328323475226

For a uniformly distributed population, that ratio will be 1.0 exactly, but
for any finite sample, it could be any number between 0 and 1. We expect
that if we have a large random sample, the ratio should approach 1, but
moderate deviations from that are expected. In fact, we should be surprised
if the ratio is exactly 1, and suspect shenanigans if it is.

We can extract the mode, and its frequency:

py> f.most_common(1)  # mode
[(85, 10313)]

In Python 3.4 or better, we can look at some more statistics:

py> import statistics
py> statistics.median(f.elements())
50.0
py> statistics.mean(f.elements())
49.531949
py> statistics.stdev(f.elements())
28.873178122988367

And verify that the mode is correct:

py> statistics.mode(f.elements())
85

We can see what happens if the data is fiddled with:

py> g = f.copy()
py> g[42] += g[23]
py> g[23] = 0
py> statistics.median(g.elements())
50.0
py> statistics.mean(g.elements())
49.7263
py> statistics.stdev(g.elements())
28.757647388343212
py> g.most_common(3)
[(42, 20258), (85, 10313), (56, 10208)]

The median is unchanged, the mean shifts slightly higher, and the standard
deviation increases. But as you can see, these are not particularly
powerful tests of randomness: only the mode shows an obvious deviation from
uniformity.


> But what are the ranges you
> would expect with a good random function. Then it could be used to
> test a random function.


Python's pseudo-random number generator is based on the Mersenne Twister.
This is one of the most high-quality and extensively tested PRNGs
available, with a period of 2**19937-1 before it repeats. Its randomness
properties are very good indeed. (Although of course that isn't to say that
there aren't bugs in the Python implementation.)

Mersenne Twister is is not suitable for cryptographic work, but apart from
that, it is pretty much as indistinguishable from random as any
deterministic computer program can be.



--
Steven


Reply | Threaded
Open this post in threaded view
|

Testing random

Christian Gollwitzer
In reply to this post by Cecil Westerhof
Am 07.06.15 um 08:27 schrieb Cecil Westerhof:
> I wrote a very simple function to test random:
>      def test_random(length, multiplier = 10000):
>          number_list = length * [0]
>          for i in range(length * multiplier):
>              number_list[random.randint(0, length - 1)] += 1
>          minimum = min(number_list)
>          maximum = max(number_list)
>          return (minimum, maximum, minimum / maximum)
>

> It shows that when the first parameter increases the deviation
> increases and when the second parameter increases the deviation
> decreases. Exactly what you would expect. But what are the ranges you
> would expect with a good random function. Then it could be used to
> test a random function.

Random number generators (RNG or PRNG) are usually tested using similar
methods, where the outcome can be assigned a probability distribution.
For example, diehard is a famous sequence of tests for RNGs.

http://en.wikipedia.org/wiki/Diehard_tests

You are only checking for uniform distribution. This couldn't detect
correlation, for instance if the RNG would just return increasing
numbers. A better check for uniformness could be done by the chi square
test or Kolmogorov-Smirnov. Then there are tables which relate the
deviations to significance.

        Christian

Reply | Threaded
Open this post in threaded view
|

Testing random

Steven D'Aprano-8
In reply to this post by Steven D'Aprano-8
On Sun, 7 Jun 2015 10:52 pm, Steven D'Aprano wrote:

> The median is unchanged, the mean shifts slightly higher, and the standard
> deviation increases. But as you can see, these are not particularly
> powerful tests of randomness: only the mode shows an obvious deviation
> from uniformity.

Oh, I forgot: we can look at the frequencies themselves as well. If our
sample is absolutely perfectly distributed uniformly, then every element
will have a frequency of exactly 10000. We don't necessarily expect this
from a random sample, but if we're too far from that, we should be
concerned. Here are the results from the fair sample:

py> statistics.mean(f.values())
10000.0
py> statistics.median(f.values())
9995.0
py> statistics.stdev(f.values())
91.8027089673141

If I do the same thing with the biased sample, the standard deviation stands
out like a sore thumb:

py> statistics.mean(g.values())
10000.0
py> statistics.median(g.values())
9993.0
py> statistics.stdev(g.values())
1442.527341617181



--
Steven


Reply | Threaded
Open this post in threaded view
|

Testing random

Peter Otten
In reply to this post by Steven D'Aprano-8
Steven D'Aprano wrote:


>> I wrote a very simple function to test random:
>>     def test_random(length, multiplier = 10000):
>>         number_list = length * [0]
>>         for i in range(length * multiplier):
>>             number_list[random.randint(0, length - 1)] += 1
>>         minimum = min(number_list)
>>         maximum = max(number_list)
>>         return (minimum, maximum, minimum / maximum)
>
> Putting aside the timing aspects, your frequency calculations are not done
> in a very Pythonic manner.

I would agree if the frequency table were sparse, i. e. many indices with

number_list[index] == 0

but that's not the case with on average 10000 hits per index.

> A better way might be:

I'm too lazy to measure, but this will likely be a tad slower. Did you mean
to convey this by "Putting aside the timing aspects"?

> from collections import Counter
> from random import randint
>
> def test_random(length, multiplier = 10000):
>     freq = Counter(
>         randint(0, length - 1) for i in range(length * multiplier)
>         )
>     minimum = min(freq[i] for i in range(length))

How about

      if len(freq) < length:
          minimum = 0
      else:
          minimum = min(freq.values())

Not as an optimisation (replacing randint() with randrange() is probably
more effective in that department), but because it's closer to being self-
explanatory.

>     maximum = max(freq.values())
>     return (minimum, maximum, minimum / maximum)
 


Reply | Threaded
Open this post in threaded view
|

Testing random

Thomas 'PointedEars' Lahn
In reply to this post by Thomas 'PointedEars' Lahn
Chris Angelico wrote:

> On Sun, Jun 7, 2015 at 8:40 PM, Thomas 'PointedEars' Lahn
> <PointedEars at web.de> wrote:
>> Cecil Westerhof wrote:
>>> I wrote a very simple function to test random:
>>>     def test_random(length, multiplier = 10000):
>>>         number_list = length * [0]
>>>         for i in range(length * multiplier):
>>>             number_list[random.randint(0, length - 1)] += 1
>>>         minimum = min(number_list)
>>>         maximum = max(number_list)
>>>         return (minimum, maximum, minimum / maximum)
>>
>> As there is no guarantee that every number will occur randomly, using a
>> dictionary at first should be more efficient than a list:
>
> Hmm, I'm not sure that's actually so. His code is aiming to get
> 'multiplier' values in each box; for any serious multiplier (he starts
> with 10 in the main code), you can be fairly confident that every
> number will come up at least once.

The wording shows a common misconception: that random distribution would
mean that it is guaranteed or more probable that every element of the set
will occur at least once.  It is another common misconception that
increasing the number of trials would increase the probability of that
happening.  But that is not so.

The law of large numbers only says that as you increase the number of
trials, that the relative frequency *approaches* the probability for each
value of the probability variable, or IOW ?the average of the results
obtained from a large number of trials should be close to the expected
value, and will *tend to become closer* as more trials are performed.?
(<http://en.wikipedia.org/wiki/Law_of_large_numbers>; emphasis by me)

> [?] a true RNG could legitimately produce nothing but 7s for the entire
> run, it's just extremely unlikely.

That reasoning is precisely a result of the misconception described above.  
Because people think that every value must occur, they do not think it
possible that (much) repetition could occur with a (pseudo-)random
generator, and when they want to mince words they say ?(extremely) unlikely?
instead.  For example, when people see

  6 5 8 7 9 3 7 8 4 7 5 6 8 8 1 2 8 3 5 7 5 4 1 2 4 8 8 7 5 1

and are told that this is random sequence, they find it hard to believe.  
They think something like: ?Look at all those repeated 8, and ?7, 5?
occurs twice.  4 occurs more often than 2, and there are much more 5s than
1s.  That cannot be possibly be random!?

Yes, it *can*.  I have just produced it with

#------------------------------------------------------------------
import random
print(" ".join([str(random.randint(1, 9)) for i in range(0, 30)]))'
#------------------------------------------------------------------

But if you think this *Mersenne Twister* PRNG which ?generates numbers with
nearly uniform distribution and a large period? is flawed, use a proper die
as RNG, and a sheet of paper to record the outcomes, and do the experiment.  
The outcome is not going to be different.

If the distribution is even and the game is fair, every outcome *always* has
the same probability of occurring.  As a result, every sequence of outcomes
of equal length *always* has the same probability of occurring, and the
probability for a particular sequence of equal length does _not_ increase or
decrease based on the number of occurrences of previous outcomes.  Those are
*independent* events.

See also:
<http://www.teacherlink.org/content/math/interactive/probability/numbersense/misconceptions/home.html>,
in particular the section ?Representativeness?

--
PointedEars

Twitter: @PointedEars2
Please do not cc me. / Bitte keine Kopien per E-Mail.

Reply | Threaded
Open this post in threaded view
|

Testing random

Chris Angelico
On Mon, Jun 8, 2015 at 1:51 AM, Thomas 'PointedEars' Lahn
<PointedEars at web.de> wrote:

> Chris Angelico wrote:
>
>> On Sun, Jun 7, 2015 at 8:40 PM, Thomas 'PointedEars' Lahn
>> <PointedEars at web.de> wrote:
>>> Cecil Westerhof wrote:
>>>> I wrote a very simple function to test random:
>>>>     def test_random(length, multiplier = 10000):
>>>>         number_list = length * [0]
>>>>         for i in range(length * multiplier):
>>>>             number_list[random.randint(0, length - 1)] += 1
>>>>         minimum = min(number_list)
>>>>         maximum = max(number_list)
>>>>         return (minimum, maximum, minimum / maximum)
>>>
>>> As there is no guarantee that every number will occur randomly, using a
>>> dictionary at first should be more efficient than a list:
>>
>> Hmm, I'm not sure that's actually so. His code is aiming to get
>> 'multiplier' values in each box; for any serious multiplier (he starts
>> with 10 in the main code), you can be fairly confident that every
>> number will come up at least once.
>
> The wording shows a common misconception: that random distribution would
> mean that it is guaranteed or more probable that every element of the set
> will occur at least once.  It is another common misconception that
> increasing the number of trials would increase the probability of that
> happening.  But that is not so.

The greater the multiplier, the lower the chance that any element will
have no hits. With uniform distribution, a length of 10, and a
multiplier of 10, there are 100 attempts with a 90% chance each that
any given number will be avoided - which works out to 0.9**100 ==
2.6561398887587544e-05 probability that (say) there'll be no 4s in the
list. Invert that and raise to the 10th power, and you get a
probability of 0.9997344177567317 that there'll be at least one in
every bucket. Raise the multiplier to 100, and the probability of a
single whiff becomes 1.7478712517226947e-46; invert and raise to the
tenth power, and it becomes closer to certainty than IEEE double
precision can represent. Raise the length to 100 and the numbers get
lower again; with multiplier 10, probability 0.9956920878572284 of
having one in every bucket (that's a half a percent chance of a zero
anywhere), and at multiplier 100, still underflows to certainty.

But you'll notice that I wasn't actually talking about certainty here.
I was talking about confidence, at levels sufficient to make data-type
decisions on. Sure, there's no guarantee that every number will occur;
but if there's a 0.4% chance that any number will be omitted, I think
the list is going to work out more efficient.

You'll notice that some of the other posts have been concerned more
about correctness (for instance, using collections.Counter and then
making sure there's a zero slot for every element - otherwise empty
slots would be omitted), and then they _do_ acknowledge the chance
that something will underflow. But with the numbers the OP gave, I
would be fully satisfied with optimizing for the case where every
bucket gets at least something.

ChrisA

Reply | Threaded
Open this post in threaded view
|

Testing random

Thomas 'PointedEars' Lahn
In reply to this post by Steven D'Aprano-8
Peter Otten wrote:

> Steven D'Aprano wrote:
>>> I wrote a very simple function to test random:
>>>     def test_random(length, multiplier = 10000):
>>>         number_list = length * [0]
>>>         for i in range(length * multiplier):
>>>             number_list[random.randint(0, length - 1)] += 1
>>>         minimum = min(number_list)
>>>         maximum = max(number_list)
>>>         return (minimum, maximum, minimum / maximum)
>>
>> Putting aside the timing aspects, your frequency calculations are not
>> done in a very Pythonic manner.
>
> I would agree if the frequency table were sparse, i. e. many indices with
>
> number_list[index] == 0
>
> but that's not the case with on average 10000 hits per index.

A common misconception.

--
PointedEars

Twitter: @PointedEars2
Please do not cc me. / Bitte keine Kopien per E-Mail.

Reply | Threaded
Open this post in threaded view
|

Testing random

Thomas 'PointedEars' Lahn
In reply to this post by Thomas 'PointedEars' Lahn
Chris Angelico wrote:

> On Mon, Jun 8, 2015 at 1:51 AM, Thomas 'PointedEars' Lahn
> <PointedEars at web.de> wrote:
>> Chris Angelico wrote:
>>
>>> On Sun, Jun 7, 2015 at 8:40 PM, Thomas 'PointedEars' Lahn
>>> <PointedEars at web.de> wrote:
>>>> Cecil Westerhof wrote:
>>>>> I wrote a very simple function to test random:
>>>>>     def test_random(length, multiplier = 10000):
>>>>>         number_list = length * [0]
>>>>>         for i in range(length * multiplier):
>>>>>             number_list[random.randint(0, length - 1)] += 1
>>>>>         minimum = min(number_list)
>>>>>         maximum = max(number_list)
>>>>>         return (minimum, maximum, minimum / maximum)
>>>>
>>>> As there is no guarantee that every number will occur randomly, using a
>>>> dictionary at first should be more efficient than a list:
>>>
>>> Hmm, I'm not sure that's actually so. His code is aiming to get
>>> 'multiplier' values in each box; for any serious multiplier (he starts
>>> with 10 in the main code), you can be fairly confident that every
>>> number will come up at least once.
>>
>> The wording shows a common misconception: that random distribution would
>> mean that it is guaranteed or more probable that every element of the set
>> will occur at least once.  It is another common misconception that
>> increasing the number of trials would increase the probability of that
>> happening.  But that is not so.
>
> The greater the multiplier, the lower the chance that any element will
> have no hits.

Wrong.

> [ex falso quodlibet]

--
PointedEars

Twitter: @PointedEars2
Please do not cc me. / Bitte keine Kopien per E-Mail.

Reply | Threaded
Open this post in threaded view
|

Testing random

Chris Angelico
On Mon, Jun 8, 2015 at 2:36 AM, Thomas 'PointedEars' Lahn
<PointedEars at web.de> wrote:
>> The greater the multiplier, the lower the chance that any element will
>> have no hits.
>
> Wrong.
>
>> [ex falso quodlibet]

Huh. Do you want to explain how, mathematically, I am wrong, or do you
want to join the RUE in my ignore list?

ChrisA

Reply | Threaded
Open this post in threaded view
|

Testing random

Peter Otten
In reply to this post by Thomas 'PointedEars' Lahn
Thomas 'PointedEars' Lahn wrote:

> Peter Otten wrote:
>
>> Steven D'Aprano wrote:
>>>> I wrote a very simple function to test random:
>>>>     def test_random(length, multiplier = 10000):
>>>>         number_list = length * [0]
>>>>         for i in range(length * multiplier):
>>>>             number_list[random.randint(0, length - 1)] += 1
>>>>         minimum = min(number_list)
>>>>         maximum = max(number_list)
>>>>         return (minimum, maximum, minimum / maximum)
>>>
>>> Putting aside the timing aspects, your frequency calculations are not
>>> done in a very Pythonic manner.
>>
>> I would agree if the frequency table were sparse, i. e. many indices with
>>
>> number_list[index] == 0
>>
>> but that's not the case with on average 10000 hits per index.
>
> A common misconception.

A pointless remark. Not counting the ears ;)



Reply | Threaded
Open this post in threaded view
|

Testing random

Ian Kelly-2
In reply to this post by Chris Angelico
On Sun, Jun 7, 2015 at 10:44 AM, Chris Angelico <rosuav at gmail.com> wrote:

> On Mon, Jun 8, 2015 at 2:36 AM, Thomas 'PointedEars' Lahn
> <PointedEars at web.de> wrote:
>>> The greater the multiplier, the lower the chance that any element will
>>> have no hits.
>>
>> Wrong.
>>
>>> [ex falso quodlibet]
>
> Huh. Do you want to explain how, mathematically, I am wrong, or do you
> want to join the RUE in my ignore list?

My best speculation is that he's either objecting to the generality of
your statement (it's false if the probability of some element
occurring is zero or eventually degrades to zero), or misreading the
word "multiplier" to the conclusion that the value of each element is
being multiplied rather than the number of trials. Or trolling; I
suppose that's always an option too.

Reply | Threaded
Open this post in threaded view
|

Testing random

Chris Angelico
On Mon, Jun 8, 2015 at 3:07 AM, Ian Kelly <ian.g.kelly at gmail.com> wrote:

> On Sun, Jun 7, 2015 at 10:44 AM, Chris Angelico <rosuav at gmail.com> wrote:
>> On Mon, Jun 8, 2015 at 2:36 AM, Thomas 'PointedEars' Lahn
>> <PointedEars at web.de> wrote:
>>>> The greater the multiplier, the lower the chance that any element will
>>>> have no hits.
>>>
>>> Wrong.
>>>
>>>> [ex falso quodlibet]
>>
>> Huh. Do you want to explain how, mathematically, I am wrong, or do you
>> want to join the RUE in my ignore list?
>
> My best speculation is that he's either objecting to the generality of
> your statement (it's false if the probability of some element
> occurring is zero or eventually degrades to zero), or misreading the
> word "multiplier" to the conclusion that the value of each element is
> being multiplied rather than the number of trials. Or trolling; I
> suppose that's always an option too.

My first thought was your first option, which is why I specifically
responded with explanation about how data type selection can viably be
about expectations rather than certainties, but he responded with a
one-word "Wrong" so I must have been answering the wrong question.
I've no idea about the misreading of "multiplier". A fourth
possibility is that mathematics works differently for him and for us,
which I suppose is possible; when I visited sci.math a while ago, I
found some people for whom everything I'd learned in grade school was
clearly wrong, and they were doing their best to enlighten the world
about the new truths of mathematics that they'd found.

ChrisA

Reply | Threaded
Open this post in threaded view
|

Testing random

C.D. Reimer
On 6/7/2015 10:20 AM, Chris Angelico wrote:
> A fourth
> possibility is that mathematics works differently for him and for us,
> which I suppose is possible; when I visited sci.math a while ago, I
> found some people for whom everything I'd learned in grade school was
> clearly wrong, and they were doing their best to enlighten the world
> about the new truths of mathematics that they'd found.

I had the unfortunate luck of taking "Harvard Calculus" in college
(circa 1995). The textbook was nothing but word problems from end to
end. The goal was to get the students away from symbolic thinking of
traditional calculus into thinking about real world problems that uses
calculus. (Never mind that the majority of students don't use calculus
after they get out school.) I bailed out after two weeks. Taking the
class at 7:30AM probably didn't help.

Chris R.

Reply | Threaded
Open this post in threaded view
|

Testing random

Thomas 'PointedEars' Lahn
In reply to this post by Thomas 'PointedEars' Lahn
Chris Angelico wrote:

> On Mon, Jun 8, 2015 at 2:36 AM, Thomas 'PointedEars' Lahn
> <PointedEars at web.de> wrote:
>>> The greater the multiplier, the lower the chance that any element will
>>> have no hits.
>> Wrong.
>>
>>> [ex falso quodlibet]
>
> Huh. Do you want to explain how, mathematically, I am wrong, or do you
> want to join the RUE in my ignore list?

I already did; you have overlooked it.  In a nutshell, the probability of

  1 1 1 1 1 1 1 1 1 1 1

is *the same* as that of

  1 2 3 4 5 6 7 8 9 1 2

and the same as that of

  8 3 6 3 1 2 6 8 2 1 6.

If the set to choose from is integer numbers from 1 to 9, then *each* of
those sequences has *the same* probability (1?9)?? ? 3.1866355 ? 10???.
Repeating that random experiment one more time, the probability of

  1 1 1 1 1 1 1 1 1 1 1 2

is the same as of

  1 2 3 4 5 6 7 8 9 1 2 2

and the same as that of

  8 3 6 3 1 2 6 8 2 1 6 2

(that the next outcome is a 2): (1?9)?? ? 3.5407062 ? 10???.  And the
probability of

  1 1 1 1 1 1 1 1 1 1 1 1

(only 1s) is the *same* as of

  1 1 1 1 1 1 1 1 1 1 1 2

(only 1s and one 2) and the same as of

  1 1 1 1 1 1 1 1 1 1 1 3

(only 1s and a 3) and of

  8 3 6 3 1 2 6 8 2 1 6 4

(that the next outcome is a 4, and there are no 5s at all).

AISB, those are *independent* events; the number of occurrences of an
outcome before *does not matter* for the probability of the next one.  And
so the probability of getting a certain number does _not_ change depending
on the number of times you repeat the experiment.  It is always the same;
in this example, it is always 1?9.  And the probability of _not_ getting a
certain number is always the same; in this example, it is always 1 ? 1?9 =
8?9.


So, I am sorry to tell you this, but you do _not_ understand probability.  
And you *cannot* understand it intuitively, like you tried to.  Probability
is counter-intuitive.

That should not surprise or worry you.  Because as I and others have pointed
out in this thread, the majority of the human population, including the OP,
does not understand probability ? until it has been explained to them.  I am
not even sure I understand probability in every instance, or rather, I am
sure that *intuitively* I do not.

It is why humans play the lottery using their ?lucky numbers? whereas other
numbers have exactly the same probability of being drawn; why they think a  
baby boy ?is due? when they have two girls already; why they are surprised
if two people in the room have the same birthday; why *in the short run* you
can make a fortune by running a casino even if the tables are not rigged;
why the ?gate three? trick works (for a while); why they pay money for
horoscopes and truthsayers, not recognizing that their wording is so
ambiguous that it is highly probable to apply to everyone, and why many of
them believe in a deity or fate.

A big part of the reason is psychology, how the human mind works: The human
mind has evolved to recognize patterns.  It is an evolutionary advantage to
be able to recognize patterns, to tell a mammoth from a sabre-stooth lion at
a distance, a poisonous from an edible plant, and an X from a U.  So humans
try to see patterns everywhere, even in pure randomness: clouds, texts,
numbers, you name it.  And when they are told that something is random, and
they find patterns regardless, they do not believe that it is random.

An extension of that misconception is emphasized by an anecdote (which may
be apocryphal) told about Richard Feynman (I heard it from Lawrence Krauss
in ?A Universe from Nothing?; he can tell that in a much more funny way than
I am able to reproduce it here [1]):

  Richard Feynman used to go up to people all the time and he?d say:

    ?You won?t believe what happend to me today!
     You won?t believe what happend to me today!?

  And people would say:

    ?What??

  And he would reply:

    ?Absolutely nothing!?

Because humans are evolutionary driven to look for patterns, and they
believe that everything that happens to them is important and has
significance in ?the greater scheme of things? (whose mere existence
is a misconception, too).

HTH

________
[1] Lawrence KRAUSS (2009). ?A Universe from Nothing?. The Mariott hotel,
    Burbank. 1:03:12.  <https://youtu.be/-EilZ4VY5Vs?t=3792>
--
PointedEars

Twitter: @PointedEars2
Please do not cc me. / Bitte keine Kopien per E-Mail.

Reply | Threaded
Open this post in threaded view
|

Testing random

Steven D'Aprano-8
In reply to this post by Thomas 'PointedEars' Lahn
On Mon, 8 Jun 2015 01:51 am, Thomas 'PointedEars' Lahn wrote:

> Chris Angelico wrote:
>
>> On Sun, Jun 7, 2015 at 8:40 PM, Thomas 'PointedEars' Lahn
>> <PointedEars at web.de> wrote:
>>> Cecil Westerhof wrote:
>>>> I wrote a very simple function to test random:
>>>>     def test_random(length, multiplier = 10000):
>>>>         number_list = length * [0]
>>>>         for i in range(length * multiplier):
>>>>             number_list[random.randint(0, length - 1)] += 1
>>>>         minimum = min(number_list)
>>>>         maximum = max(number_list)
>>>>         return (minimum, maximum, minimum / maximum)
>>>
>>> As there is no guarantee that every number will occur randomly, using a
>>> dictionary at first should be more efficient than a list:
>>
>> Hmm, I'm not sure that's actually so. His code is aiming to get
>> 'multiplier' values in each box; for any serious multiplier (he starts
>> with 10 in the main code), you can be fairly confident that every
>> number will come up at least once.
>
> The wording shows a common misconception: that random distribution would
> mean that it is guaranteed or more probable that every element of the set
> will occur at least once.  It is another common misconception that
> increasing the number of trials would increase the probability of that
> happening.  But that is not so.

You are badly mistaken.

Of course you are correct that a random distribution does not *guarantee*
that every element of the set is seen at least once. But as soon as you
call it "a common misconception" that increasing the number of trials
increases the probability of seeing every element, you are simply mistaken.
It's not a misconception at all, it is true. See below for calculations.


[...]
> If the distribution is even and the game is fair, every outcome *always*
> has
> the same probability of occurring.  As a result, every sequence of
> outcomes of equal length *always* has the same probability of occurring,
> and the probability for a particular sequence of equal length does _not_
> increase or
> decrease based on the number of occurrences of previous outcomes.  Those
> are *independent* events.

You are confused. Think about it more carefully.

To make it simple, let's randomly generate from the set 1, 2, 3, and do just
three samples. That's small enough to enumerate all the possibilities:

111 112 113 121 122 123 131 132 133
211 212 213 221 222 223 231 232 233
311 312 313 321 322 323 331 332 333

Each event (111 etc) has a probability of 1/27, but that's not important.
We're comparing the probability of all elements being seen, versus at least
one element being missed. I'm going to tag the events that contain all
three elements:

111 112 113 121 122 123* 131 132* 133
211 212 213* 221 222 223 231* 232 233
311 312* 313 321* 322 323 331 332 333

So there are six events out of 27, or 2/9, where all elements are seen in a
trial, versus 7/9 where at least one element will have a count of zero.

Putting it another way:

Pr(at least one element has zero count)
= Pr(1 has zero count) + Pr(2 has zero count) + Pr(3 has zero count)
  - Pr(1 and 2 both have zero counts)
  - Pr(1 and 3 both have zero counts)
  - Pr(2 and 3 both have zero counts)

= (2/3)**3 * 3 - (1/3)**3 * 3

which gives us 7/9 as confirmed by direct enumeration of all the
possibilities.

With four samples per trial:
Pr(at least one element has zero count) = 5/9

With five samples:
Pr(at least one element has zero count) = 31/81

Ten samples:
Pr(at least one element has zero count) = 341/6561
or approximately 0.05

Twenty samples:
Pr(at least one element has zero count) = 349525/387420489
or approximately 0.0009

For 100 samples:
Pr(at least one element has zero count) = 7.4e-18 (approx)


which means that with 100 samples from the set (1, 2, 3), if you could run
one million trials every second, it would on average take you almost 4300
years to see a trial that had a zero count for one or more of the elements.

The probability of any element *not* showing up in a random trial with N
samples decreases rapidly as N increases. Elements with zero count are only
a factor when the number of samples is small, or the number of elements is
large.




--
Steven


Reply | Threaded
Open this post in threaded view
|

Testing random

Steven D'Aprano-8
In reply to this post by Thomas 'PointedEars' Lahn
On Mon, 8 Jun 2015 02:36 am, Thomas 'PointedEars' Lahn wrote:

> Chris Angelico wrote:

>> The greater the multiplier, the lower the chance that any element will
>> have no hits.
>
> Wrong.

No, Chris is correct. The "multiplier" increases the number of samples. The
larger the number of samples, the lower the chance that any element is not
represented.

If my previous post didn't convince you, consider an even simpler random
distribution: tossing a fair coin. The probability of getting a head is
exactly 1/2 whether you toss the coin once or a thousand times. But if you
toss the coin once, your chances of getting "no heads" is 1/2. If you toss
it a thousand times, your chances of getting "no heads" is 1/2**1000.



--
Steven


12345