File 2967-Write-a-section-about-range-capping.patch of Package erlang

Overview Repositories Revisions Requests Users Attributes Meta

File 2967-Write-a-section-about-range-capping.patch of Package erlang

From 76686e648a85ba0e0795c33ef18dd8534d2bf7da Mon Sep 17 00:00:00 2001
From: Raimo Niskanen <raimo@erlang.org>
Date: Wed, 11 May 2022 14:29:28 +0200
Subject: [PATCH 7/8] Write a section about range capping

Describe different approaches for how to generate numbers in a range
related to the Niche algorithms API, and point to that from
the algorithm descriptions.
---
 lib/stdlib/doc/src/rand.xml | 225 +++++++++++++++++++++++++++++++++---
 1 file changed, 209 insertions(+), 16 deletions(-)

diff --git a/lib/stdlib/doc/src/rand.xml b/lib/stdlib/doc/src/rand.xml
index 8b9b924366..471a23f6b9 100644
--- a/lib/stdlib/doc/src/rand.xml
+++ b/lib/stdlib/doc/src/rand.xml
@@ -806,6 +806,175 @@ end.</pre>
 <fsdescription>
 <marker id="niche_algorithms"/>
 <title>Niche algorithms API</title>
+ 
+ This section contains special purpose algorithms
+ that does not use the
+ <seeerl marker="#plug_in_api">plug-in framework API</seeerl>,
+ for example for speed reasons.
+ 
+ 
+ Since these algorithms lack the plug-in framework support,
+ generating numbers in a range other than the
+ generator's own generated range may become a problem.
+ 
+ 
+ There are at least 3 ways to do this, assuming that
+ the range is less than the generator's range:
+ 
+ <taglist>
+ <tag>Modulo</tag>
+ <item>
+ 
+ To generate a number <c>V</c> in the range 0..<c>Range</c>-1:
+ 
+ <list type="bulleted">
+ <item>Generate a number <c>X</c>.</item>
+ <item>
+ Use <c>V&nbsp;=&nbsp;X&nbsp;rem&nbsp;Range</c> as your value.
+ </item>
+ </list>
+ 
+ This method uses <c>rem</c>, that is, the remainder of
+ an integer division, which is a slow operation.
+ 
+ 
+ Low bits from the generator propagate straight through
+ to the generated value, so if the generator has got
+ weaknesses in the low bits this method propagates
+ them too.
+ 
+ 
+ If <c>Range</c> is not a divisor of the generator range,
+ the generated numbers have a bias.
+ Example:
+ 
+ 
+ Say the generator generates a byte, that is,
+ the generator range is 0..255,
+ and the desired range is 0..99 (<c>Range=100</c>).
+ Then there are 3 generator outputs that produce the value 0,
+ that is; 0, 100 and 200. But there are only
+ 2 generator outputs that produce the value 99,
+ which are; 99 and 199. So the probability for
+ a value <c>V</c> in 0..55 is 3/2 times
+ the probability for the other values 56..99.
+ 
+ 
+ If <c>Range</c> is much smaller than the generator range,
+ then this bias gets hard to detect. The rule of thumb is
+ that if <c>Range</c> is smaller than the square root
+ of the generator range, the bias is small enough.
+ Example:
+ 
+ 
+ A byte generator when <c>Range=20</c>.
+ There are 12 (<c>256&nbsp;div&nbsp;20</c>)
+ possibilities to generate the highest numbers
+ and one more to generate a number
+ <c>V</c>&nbsp;&lt;&nbsp;16 (<c>256&nbsp;rem&nbsp;20</c>).
+ So the probability is 13/12 for a low number
+ versus a high. To detect that difference
+ with some confidence you would need to generate
+ a lot more numbers than the generator range,
+ 256 in this small example.
+ 
+ </item>
+ <tag>Truncated multiplication</tag>
+ <item>
+ 
+ To generate a number <c>V</c> in the range 0..<c>Range</c>-1,
+ when you have a generator with the range
+ 0..2^<c>Bits</c>-1:
+ 
+ <list type="bulleted">
+ <item>Generate a number <c>X</c>.</item>
+ <item>
+ Use <c>V&nbsp;=&nbsp;X*Range&nbsp;bsr&nbsp;Bits</c>
+ as your value.
+ </item>
+ </list>
+ 
+ If the multiplication <c>X*Range</c> creates a bignum
+ this method becomes very slow.
+ 
+ 
+ High bits from the generator propagate through
+ to the generated value, so if the generator has got
+ weaknesses in the high bits this method propagates
+ them too.
+ 
+ 
+ If <c>Range</c> is not a divisor of the generator range,
+ the generated numbers have a bias,
+ pretty much as for the Modulo method above.
+ 
+ </item>
+ <tag>Shift or mask</tag>
+ <item>
+ 
+ To generate a number in the range 0..2^<c>RBits</c>-1,
+ when you have a generator with the range 0..2^<c>Bits</c>:
+ 
+ <list type="bulleted">
+ <item>Generate a number <c>X</c>.</item>
+ <item>
+ Use <c>V&nbsp;=&nbsp;X&nbsp;band&nbsp;((1&nbsp;bsl&nbsp;RBits)-1)</c>
+ or <c>V&nbsp;=&nbsp;X&nbsp;bsr&nbsp;(Bits-RBits)</c>
+ as your value.
+ </item>
+ </list>
+ 
+ Masking with <c>band</c> preserves the low bits,
+ and right shifting with <c>bsr</c> preserves the high,
+ so if the generator has got weaknesses in high or low
+ bits; choose the right operator.
+ 
+ 
+ If the generator has got a range that is not a power of 2
+ and this method is used anyway, it introduces bias
+ in the same way as for the Modulo method above.
+ 
+ </item>
+ <tag>Rejection</tag>
+ <item>
+ <list type="bulleted">
+ <item>Generate a number <c>X</c>.</item>
+ <item>
+ If <c>X</c> is in the range, use <c>V&nbsp;=&nbsp;X</c>
+ as your value, otherwise reject it and repeat.
+ </item>
+ </list>
+ 
+ In theory it is not certain that this method
+ will ever complete, but in practice you ensure
+ that the probability of rejection is low.
+ Then the probability for yet another iteration
+ decreases exponentially so the expected mean
+ number of iterations will often be between 1 and 2.
+ Also, since the base generator is a full length generator,
+ a value that will break the loop must eventually
+ be generated.
+ 
+ </item>
+ </taglist>
+ 
+ Chese methods can be combined, such as using the Modulo
+ method and only if the generator value would create bias
+ use Rejection. Or using Shift or mask
+ to reduce the size of a generator value so that
+ Truncated multiplication will not create a bignum.
+ 
+ 
+ The recommended way to generate a floating point number
+ (IEEE 745 double, that has got a 53-bit mantissa)
+ in the range 0..1, that is
+ 0.0&nbsp;=&lt;&nbsp;<c>V</c>&nbsp;&lt;1.0
+ is to generate a 53-bit number <c>X</c> and then use
+ <c>V&nbsp;=&nbsp;X&nbsp;*&nbsp;(1.0/((1&nbsp;bsl&nbsp;53)))</c>
+ as your value. This will create a value on the form
+ <c>N</c>*2^-53 with equal probability for every
+ possible <c>N</c> for the range.
+ 
 </fsdescription>
 <func>
 <name name="splitmix64_next" arity="1" since="OTP 25.0"/>
@@ -861,6 +1030,11 @@ end.</pre>
 on a selected range, nor in generating a floating point number.
 It is easy to accidentally mess up the fairly good
 statistical properties of this generator when doing either.
+ See the recepies at the start of this
+ <seeerl marker="#niche_algorithms">
+ Niche algorithms API
+ </seeerl>
+ description.
 Note also the caveat about weak low bits that
 this generator suffers from.
 The generator is exported in this form
@@ -917,8 +1091,8 @@ end.</pre>
 the generator state.
 
 
- To create an output value, the quality improves much
- if the state is scrambled.
+ The quality of the output value improves much by using
+ a scrambler instead of just taking the low bits.
 Function
 <seemfa marker="#mwc59_value32/1">
 <c>mwc59_value32</c>
@@ -934,12 +1108,17 @@ end.</pre>
 
 
 The low bits of the base generator are surprisingly good,
- so the lowest 16 bits actually passes fairly strict PRNG tests,
- despite the generator's weaknesses that lies in the high
+ so the lowest 16 bits actually pass fairly strict PRNG tests,
+ despite the generator's weaknesses that lie in the high
 bits of the 32-bit MWC "digit". It is recommended
 to use <c>rem</c> on the the generator state,
- or bit mask on the lowest bits to produce numbers
+ or bit mask extracting the lowest bits to produce numbers
 in a range 16 bits or less.
+ See the recepies at the start of this
+ <seeerl marker="#niche_algorithms">
+ Niche algorithms API
+ </seeerl>
+ description.
 
 
 On a typical 64 bit Erlang VM this generator executes
@@ -993,14 +1172,25 @@ end.</pre>
 birthday spacing and collision tests show through.
 
 
- To extract a power of two number it is recommended
- to use the high bits which helps in hiding
- the remaining base generator problems.
+ When using this scrambler it is in general better to use
+ the high bits of the value than the low.
+ The lowest 8 bits are of good quality and pass right through
+ from the base generator. They are combined with the next 8
+ in the xorshift making the low 16 good quality,
+ but in the range 16..31 bits there are weaker bits
+ that you do not want to have as the high bits
+ of your generated values.
+ Therefore it is in general safer to shift out low bits.
+ See the recepies at the start of this
+ <seeerl marker="#niche_algorithms">
+ Niche algorithms API
+ </seeerl>
+ description.
 
 
- For a small arbitrary range less than about 16 bits
+ For a non power of 2 range less than about 16 bits
 (to not get too much bias and to avoid bignums)
- multiply-and-shift can be used,
+ truncated multiplication can be used,
 which is much faster than using <c>rem</c>:
 <c>(Range*<anno>V</anno>)&nbsp;bsr&nbsp;32</c>.
 
@@ -1024,20 +1214,23 @@ end.</pre>
 when handling the value <c><anno>V</anno></c>.
 
 
- To extract a power of two number it is slightly better
- to shift down the high bits than to mask the low.
+ It is in general general better to use the high bits
+ from this scrambler than the low.
+ See the recepies at the start of this
+ <seeerl marker="#niche_algorithms">
+ Niche algorithms API
+ </seeerl>
+ description.
 
 
- For an arbitrary range less than about 29 bits
+ For a non power of 2 range less than about 29 bits
 (to not get too much bias and to avoid bignums)
- multiply-and-shift can be used,
+ truncated multiplication can be used,
 which is much faster than using <c>rem</c>.
 Example for range 1'000'000'000;
 the range is 30 bits, we use 29 bits from the generator,
 adding up to 59 bits, which is not a bignum:
 <c>(1000000000&nbsp;*&nbsp;(<anno>V</anno>&nbsp;bsr&nbsp;(59-29)))&nbsp;bsr&nbsp;29</c>.
- 
- 
 
 </desc>
 </func>
-- 
2.35.3

Places

File 2967-Write-a-section-about-range-capping.patch of Package erlang

Places