Python examples for visualising normal and skewed normal distributions with histograms

Today I'm showing some sample Python code for visualising the normal and skewed normal distributions with histograms.


  • Generate the data using norm and skewnorm from the scipy.stats module.
    • Normal distribution
    • Skewed normal distribution with positive skew
    • Skewed normal distribution with negative skew
  • Bin the data into a selected number of buckets using histogram from numpy
  • Visualise the 3 sets of data as histograms using matplotlib / pyplot bar charts
  • Annotate the histograms with mean, median, 1st and 3rd quartile using vlines

Walkthrough and example code:

First I've imported the needed libraries and set some "constants". RANDOM_SEED is used to initialise the random number generator in a reproducible state. The others define the number of values, mean and standard deviation of the distribution and the number of "buckets" to group into.

Generate and bin the data

Now I generate and bin the data. There's 3 sections here, each essentially the same but putting a normal distribution, positive skew and negative skew into data1, data2 and data3.

First generate the random values using norm.rvs / skewnorm.rvs. This takes the mean, standard deviation, number of values to generate and optionally the RANDOM_SEED, which can be passed as None or omitted for a new, non-reproducible result each time.

data1 = norm.rvs(loc=MEAN, scale=ST_DEV, size=NUM_VALUES, random_state=RANDOM_SEED)

Then the data is binned into the needed number of buckets using histogram from numpy. In this case I've specified 20 in the constants so I will end up with 20 buckets.

data_binned1 = np.histogram(data1, bins=NUM_BINS)

Now I extract the x values (boundaries for the histogram buckets) and y values (frequency i.e. number of values in that bucket) from the binned data.

There's one more boundary value than the number of buckets - because they represent both the beginning and end of the range (similar to fence posts!). To get the "left" boundary of all 20 buckets I omit the last value of the list.

x_values1 = data_binned1[1][:-1]
y_values1 = data_binned1[0]

Lastly, I calculate the minimum, 1st quartile, median, 3rd quartile, max and mean for the created data.

data1_min, data1_q1, data1_median, data1_q3, data1_max, data1_mean = np.min(data1), np.quantile(data1, 0.25), np.quantile(data1, 0.5), np.quantile(data1, 0.75), np.max(data1), np.mean(data1)

Here's the complete code for the part described above:

Visualise as a histogram (example with normally distributed data)

Now we can visualise the created data as a histogram. Since the data is already grouped into buckets, plot this directly as a bar chart with the heights being the frequencies (this is equivalent to the way plt.hist() does it.)

Visualise the 3 histograms to compare

Using plt.subplots, plot 3 histograms in a column - one for each of the datasets created above. Since these distributions will have different min and max x values and likely a different maximum frequency (y) value, I set sharex and sharey properties for the plot.

fig, (ax1, ax2, ax3) = plt.subplots(3, sharex='col', sharey='col')
fig.set_size_inches(10,7.5), height=y_values1, align='edge', width=(x_values1[1]-x_values1[0]), facecolor='#E5E7E9'), height=y_values2, align='edge', width=(x_values2[1]-x_values2[0]), facecolor='#E5E7E9'), height=y_values3, align='edge', width=(x_values3[1]-x_values3[0]), facecolor='#E5E7E9')

To add the markers for mean, median and quartiles I make use of the Axes.vlines with a list of lines to add and their position and colour.

I chose to show the median/Q1/Q3 extending from the top of the chart, and the mean from the bottom. The min_y and max_y for those are set accordingly so that they will reach 35% or 65% of the way on the plot.

Jupyter notebook

The complete Jupyter notebook for the above can be found here (Github Gist).