Today I'm showing some sample Python code for visualising the normal and skewed normal distributions with histograms.
Approach:
- Generate the data using
norm
andskewnorm
from thescipy.stats
module.- Normal distribution
- Skewed normal distribution with positive skew
- Skewed normal distribution with negative skew
- Bin the data into a selected number of buckets using
histogram
fromnumpy
- Visualise the 3 sets of data as histograms using
matplotlib
/pyplot
bar charts - Annotate the histograms with mean, median, 1st and 3rd quartile using
vlines
Walkthrough and example code:
First I've imported the needed libraries and set some "constants". RANDOM_SEED
is used to initialise the random number generator in a reproducible state. The others define the number of values, mean and standard deviation of the distribution and the number of "buckets" to group into.
Generate and bin the data
Now I generate and bin the data. There's 3 sections here, each essentially the same but putting a normal distribution, positive skew and negative skew into data1, data2 and data3.
First generate the random values using norm.rvs
/ skewnorm.rvs
. This takes the mean, standard deviation, number of values to generate and optionally the RANDOM_SEED
, which can be passed as None or omitted for a new, non-reproducible result each time.
data1 = norm.rvs(loc=MEAN, scale=ST_DEV, size=NUM_VALUES, random_state=RANDOM_SEED)
Then the data is binned into the needed number of buckets using histogram
from numpy
. In this case I've specified 20 in the constants so I will end up with 20 buckets.
data_binned1 = np.histogram(data1, bins=NUM_BINS)
Now I extract the x values (boundaries for the histogram buckets) and y values (frequency i.e. number of values in that bucket) from the binned data.
There's one more boundary value than the number of buckets - because they represent both the beginning and end of the range (similar to fence posts!). To get the "left" boundary of all 20 buckets I omit the last value of the list.
x_values1 = data_binned1[1][:-1]
y_values1 = data_binned1[0]
Lastly, I calculate the minimum, 1st quartile, median, 3rd quartile, max and mean for the created data.
data1_min, data1_q1, data1_median, data1_q3, data1_max, data1_mean = np.min(data1), np.quantile(data1, 0.25), np.quantile(data1, 0.5), np.quantile(data1, 0.75), np.max(data1), np.mean(data1)
Here's the complete code for the part described above:
Visualise as a histogram (example with normally distributed data)
Now we can visualise the created data as a histogram. Since the data is already grouped into buckets, plot this directly as a bar chart with the heights being the frequencies (this is equivalent to the way plt.hist()
does it.)
Visualise the 3 histograms to compare
Using plt.subplots
, plot 3 histograms in a column - one for each of the datasets created above. Since these distributions will have different min and max x
values and likely a different maximum frequency (y
) value, I set sharex
and sharey
properties for the plot.
fig, (ax1, ax2, ax3) = plt.subplots(3, sharex='col', sharey='col')
fig.set_size_inches(10,7.5)
ax1.bar(x=x_values1, height=y_values1, align='edge', width=(x_values1[1]-x_values1[0]), facecolor='#E5E7E9')
ax2.bar(x=x_values2, height=y_values2, align='edge', width=(x_values2[1]-x_values2[0]), facecolor='#E5E7E9')
ax3.bar(x=x_values3, height=y_values3, align='edge', width=(x_values3[1]-x_values3[0]), facecolor='#E5E7E9')
plt.show()
To add the markers for mean, median and quartiles I make use of the Axes.vlines
with a list of lines to add and their position and colour.
I chose to show the median/Q1/Q3 extending from the top of the chart, and the mean from the bottom. The min_y
and max_y
for those are set accordingly so that they will reach 35% or 65% of the way on the plot.
Jupyter notebook
The complete Jupyter notebook for the above can be found here (Github Gist).