Truncated (by number of standard deviations) normal distribution in Python

Background

I needed to generate large amounts of "well behaved" test data that was approximately normally distributed, but also shouldn't be able to have extreme values (so, of course, it isn't truly the normal distribution!).

The method described below is similar to scipy.stats.truncnorm but I wanted to implement this directly and use the number of standard deviations. The cut-off values for truncnorm are relative to a reference mean of 0.0 and standard deviation of 1.0, so it is possible to convert easily between one and the other.

In a normal distribution, the percentages falling within a certain number of standard deviations of the mean (as detailed here in Wikipedia) are:

2 SD: 95.5%
3 SD: 99.7%
4 SD: 99.99%

Code sample

The method is simple: generate a value from the underlying normal distribution and either return it (if in the acceptable range) or generate a new one recursively.

Of course, compared to the native numpy.random.normal method that can generate a set of values in a single call, this will be significantly less performant.

As an example, using timeit I found it took around 2 seconds to produce a list of 1 million values, compared to 1.7 seconds calling random.normal 1 million times as a list comprehension, and 0.01 seconds (!) calling random.normal with a size parameter of 1000000. This was typical of a few runs:

Statistical properties

Standard deviation, mean and median

Since the ends of the distribution are "cut off", the standard deviation of the resulting data is smaller than the underlying normal distribution it was generated from (with a greater effect the more data is cut off). The mean and median don't change. np.mean, np.median and np.std demonstate this.

Kurtosis

The "truncated" distribution shows a negative kurtosis meaning the 'tails' are thinner than in the normal distribution, as expected. scipy.stats.kurtosis demonstrates this.

Histogram

Plotting the distributions overlaid shows the difference clearly. (I used SD = 3 and cut-off at 2 SDs from the mean, to more clearly show the distinction.)

Jupyter notebook

The complete Jupyter notebook for the above can be found here (Github Gist).

Topics