Background
I needed to generate large amounts of "well behaved" test data that was approximately normally distributed, but also shouldn't be able to have extreme values (so, of course, it isn't truly the normal distribution!).
The method described below is similar to scipy.stats.truncnorm
but I wanted to implement this directly and use the number of standard deviations. The cut-off values for truncnorm
are relative to a reference mean of 0.0 and standard deviation of 1.0, so it is possible to convert easily between one and the other.
In a normal distribution, the percentages falling within a certain number of standard deviations of the mean (as detailed here in Wikipedia) are:
- 2 SD: 95.5%
- 3 SD: 99.7%
- 4 SD: 99.99%
Code sample
The method is simple: generate a value from the underlying normal distribution and either return it (if in the acceptable range) or generate a new one recursively.
Of course, compared to the native numpy.random.normal
method that can generate a set of values in a single call, this will be significantly less performant.
As an example, using timeit
I found it took around 2 seconds to produce a list of 1 million values, compared to 1.7 seconds calling random.normal
1 million times as a list comprehension, and 0.01 seconds (!) calling random.normal
with a size
parameter of 1000000. This was typical of a few runs:
Statistical properties
Standard deviation, mean and median
Since the ends of the distribution are "cut off", the standard deviation of the resulting data is smaller than the underlying normal distribution it was generated from (with a greater effect the more data is cut off). The mean and median don't change. np.mean
, np.median
and np.std
demonstate this.
Kurtosis
The "truncated" distribution shows a negative kurtosis meaning the 'tails' are thinner than in the normal distribution, as expected. scipy.stats.kurtosis
demonstrates this.
Histogram
Plotting the distributions overlaid shows the difference clearly. (I used SD = 3 and cut-off at 2 SDs from the mean, to more clearly show the distinction.)
Jupyter notebook
The complete Jupyter notebook for the above can be found here (Github Gist).