Context
Today I've been exploring with geopandas
to create some choropleth maps of the UK using governmernt data.
Concepts covered in this walkthrough:
- Read an excel sheet into
pandas
choosing a sheet name, columns, excluding some header and footer rows and setting an index - Read a GeoJSON file into
geopandas
choosing the index column and geometry column - Join numerical data in
pandas
dataframe with geographical boundaries ingeopandas
dataframe - Plot data as a choropleth map using
geopandas
- Plot multiple
geopandas
subplots and customise the colormap, legend and display properties
NB: geopandas
does have a lot of dependencies and I have struggled with "dependency cascade" trying to install it before now. I suggest the easiest way to work with it is to use Anaconda and create a new environment for this, adding any extra libraries as you go. For example I had to install a library for reading Excel sheets which pandas
didn't pull in by default.
Data sources
- Statistical data: Annual Survey of Hours and Earnings (ASHE) Earnings and hours worked, homeplace published by the Office For National Statistics. Table 33, 2022 provisional edition has been used for this walkthrough. Data accessed 2023-01-21.
- Geo-data: International Territorial Level 2 (January 2021) UK BUC V2 published by ONS Geography (Office for National Statistics) and available on the Open Geography Portal. Data accessed 2023-01-21.
Exploration and import of the statistical data
This dataset comes as an Excel workbook and includes (among other data) mean number of hours worked per week according to the Annual Survey of Hours and Earnings (ASHE), split by work type (full/part time) and gender (male/female). The other data released with it, which I haven't used here, relates to pay including basic, overtime, hourly, weekly and annual.
The data is arranged into a number of sheets, all with the same structure, one for each of 'Male - Full Time', 'Part-Time' and so on. The "Code" column (Column B) holds the ITL Level 1, 2 and 3 UK region codes. The data itself starts from the 6th row.
Similarly, there are 5 rows of "footnote" information that we don't need to import.
The Python code for importing a sheet like this is below (D: is just the drive I downloaded it to). I exclude the 5 header rows and 5 footer rows as shown just now, select just the needed columns and rename them. The read_excel()
method imports a specific sheet, so we will need an import (I show this later) for each sub-dataset (male full-time, etc).
import pandas as pd
all_hours_data = pd.read_excel(
'D:\\ashetable332022provisional\\PROV - Home ITL (3) Table 33.9a  Paid hours worked - Total 2022.xls',
sheet_name='Full-Time',
header=5,
usecols='A,B,D,F',
names=['Location', 'ITL_Code', 'Median_Hours', 'Mean_Hours'],
skipfooter=5,
index_col=1
)
This results in a dataframe similar to this - note how ITL_Code is the index (because we told it to use column 1 as the index_col). This is important as we will use the index column to join to the geo-boundaries data:
Exploration and import of the GeoJSON UK ITL2 boundaries
Next I use the geopandas read_file
method to import the GeoJSON boundary data.
Here's an example of the raw GeoJSON:
I use the set_index()
method to ensure the ITL codes are the 'key' for this dataframe so they can be joined easily (as mentioned above in relation to the statistical data). I also select a geometry column using set_geometry()
. This isn't actually required if there is only one geometry column in the data, as in this case, since it will be chosen by default - but I've made it explicit. The purpose of specifying the geometry column is so geopandas knows which one to use in its geo-calculations.
There's also a data wrangling step added ad-hoc when I discovered that one of the ITL codes (Northern Ireland) doesn't have any sub-regions at ITL2 and ITL3 level - it is all just grouped as Northern Ireland - and the formatting of its code differs between the two sources! Of course it is common to need data wrangling steps like this, so I'm grateful that this was the only mismatch.
Now I can join the two datasets by their index, and get geopandas
to produce a plot!
I use a dataframe join
to combine the boundaries and statistical dataframes, choosing only the needed columns from the boundaries data as I go. In this case it is an inner join, returning just the matched rows. The default plot draws a map against the x and y axes, with legend=True
adding a colorbar. By default the viridis
color theme is used (this is just matplotlib under the hood).
Having worked through all the individual parts I can now create the final plot...
Creating geopandas subplots and formatting them
In this final section I've put together all of the above to create a plot with 6 subplots in a 2 row x 3 column layout. The rows will be for full-time and part-time work, and the columns for All, Male and Female genders.
Points to note:
- I make use of a "helper" function
read_data
to abstract the process of loading data from the Excel sheet. This makes sense to do in this case since I have 6 sheets with an identical structure. - The dataframes
full_time_data
andpart_time_data
are created by joining the datasets. - I then calculate the minimum and maximum value across the 3 (All, Male and Female) columns so that the color mapping and legend are on the same scale. Of course, full-time and part-time hours are very different so I have a scale for each.
- Loop over the full time and then part time data, creating a
geopandas
plot for each and placing it in its spot in the grid. I used a Blues color map for full time and Greens for part time to make them visually distinct without really implying an "order". - I apply some formatting to the legend. The geopandas legend is a matplotlib colorbar and created as an additional axis, so to get a reference to it I call the
ax.figure.axes[i]
whereax
is a reference to the subplot.figure
is the plot as a whole, and the additional axes 6 through 11 (where 0-5 are created by the maps) are the legends.
Jupyter notebook
The complete Jupyter notebook for the above can be found here (Github Gist).