Python examples for visualising normal and skewed normal distributions with histograms

Some code snippets for generating and visualising normal and skewed data, and comparing their mean, quartiles and median.

Today I'm showing some sample Python code for visualising the normal and skewed normal distributions with histograms.

Approach:

  • Generate the data using norm and skewnorm from the scipy.stats module.
    • Normal distribution
    • Skewed normal distribution with positive skew
    • Skewed normal distribution with negative skew
  • Bin the data into a selected number of buckets using histogram from numpy
  • Visualise the 3 sets of data as histograms using matplotlib / pyplot bar charts
  • Annotate the histograms with mean, median, 1st and 3rd quartile using vlines

Walkthrough and example code:

First I've imported the needed libraries and set some "constants". RANDOM_SEED is used to initialise the random number generator in a reproducible state. The others define the number of values, mean and standard deviation of the distribution and the number of "buckets" to group into.

Generate and bin the data

Now I generate and bin the data. There's 3 sections here, each essentially the same but putting a normal distribution, positive skew and negative skew into data1, data2 and data3.

First generate the random values using norm.rvs / skewnorm.rvs. This takes the mean, standard deviation, number of values to generate and optionally the RANDOM_SEED, which can be passed as None or omitted for a new, non-reproducible result each time.


data1 = norm.rvs(loc=MEAN, scale=ST_DEV, size=NUM_VALUES, random_state=RANDOM_SEED)

Then the data is binned into the needed number of buckets using histogram from numpy. In this case I've specified 20 in the constants so I will end up with 20 buckets.


data_binned1 = np.histogram(data1, bins=NUM_BINS)

Now I extract the x values (boundaries for the histogram buckets) and y values (frequency i.e. number of values in that bucket) from the binned data.

There's one more boundary value than the number of buckets - because they represent both the beginning and end of the range (similar to fence posts!). To get the "left" boundary of all 20 buckets I omit the last value of the list.


x_values1 = data_binned1[1][:-1]
y_values1 = data_binned1[0]

Lastly, I calculate the minimum, 1st quartile, median, 3rd quartile, max and mean for the created data.


data1_min, data1_q1, data1_median, data1_q3, data1_max, data1_mean = np.min(data1), np.quantile(data1, 0.25), np.quantile(data1, 0.5), np.quantile(data1, 0.75), np.max(data1), np.mean(data1)

Here's the complete code for the part described above:

Visualise as a histogram (example with normally distributed data)

Now we can visualise the created data as a histogram. Since the data is already grouped into buckets, plot this directly as a bar chart with the heights being the frequencies (this is equivalent to the way plt.hist() does it.)

Visualise the 3 histograms to compare

Using plt.subplots, plot 3 histograms in a column - one for each of the datasets created above. Since these distributions will have different min and max x values and likely a different maximum frequency (y) value, I set sharex and sharey properties for the plot.


fig, (ax1, ax2, ax3) = plt.subplots(3, sharex='col', sharey='col')
fig.set_size_inches(10,7.5)
ax1.bar(x=x_values1, height=y_values1, align='edge', width=(x_values1[1]-x_values1[0]), facecolor='#E5E7E9')
ax2.bar(x=x_values2, height=y_values2, align='edge', width=(x_values2[1]-x_values2[0]), facecolor='#E5E7E9')
ax3.bar(x=x_values3, height=y_values3, align='edge', width=(x_values3[1]-x_values3[0]), facecolor='#E5E7E9')
plt.show()

To add the markers for mean, median and quartiles I make use of the Axes.vlines with a list of lines to add and their position and colour.

I chose to show the median/Q1/Q3 extending from the top of the chart, and the mean from the bottom. The min_y and max_y for those are set accordingly so that they will reach 35% or 65% of the way on the plot.

Jupyter notebook

The complete Jupyter notebook for the above can be found here (Github Gist).