Creating a choropleth map with geopandas and using multiple subplots

Exploration of basic plotting with geopandas and collating them into subplots. Formatting it and editing the colormap legends.

Context

Today I've been exploring with geopandas to create some choropleth maps of the UK using governmernt data.

Concepts covered in this walkthrough:

  • Read an excel sheet into pandas choosing a sheet name, columns, excluding some header and footer rows and setting an index
  • Read a GeoJSON file into geopandas choosing the index column and geometry column
  • Join numerical data in pandas dataframe with geographical boundaries in geopandas dataframe
  • Plot data as a choropleth map using geopandas
  • Plot multiple geopandas subplots and customise the colormap, legend and display properties

NB: geopandas does have a lot of dependencies and I have struggled with "dependency cascade" trying to install it before now. I suggest the easiest way to work with it is to use Anaconda and create a new environment for this, adding any extra libraries as you go. For example I had to install a library for reading Excel sheets which pandas didn't pull in by default.

Data sources

Exploration and import of the statistical data

This dataset comes as an Excel workbook and includes (among other data) mean number of hours worked per week according to the Annual Survey of Hours and Earnings (ASHE), split by work type (full/part time) and gender (male/female). The other data released with it, which I haven't used here, relates to pay including basic, overtime, hourly, weekly and annual.

The data is arranged into a number of sheets, all with the same structure, one for each of 'Male - Full Time', 'Part-Time' and so on. The "Code" column (Column B) holds the ITL Level 1, 2 and 3 UK region codes. The data itself starts from the 6th row.

Similarly, there are 5 rows of "footnote" information that we don't need to import.

The Python code for importing a sheet like this is below (D: is just the drive I downloaded it to). I exclude the 5 header rows and 5 footer rows as shown just now, select just the needed columns and rename them. The read_excel() method imports a specific sheet, so we will need an import (I show this later) for each sub-dataset (male full-time, etc).


import pandas as pd
all_hours_data = pd.read_excel(
    'D:\\ashetable332022provisional\\PROV - Home ITL (3) Table 33.9a   Paid hours worked - Total 2022.xls',
    sheet_name='Full-Time',
    header=5,
    usecols='A,B,D,F',
    names=['Location', 'ITL_Code', 'Median_Hours', 'Mean_Hours'],
    skipfooter=5,
    index_col=1
    )

This results in a dataframe similar to this - note how ITL_Code is the index (because we told it to use column 1 as the index_col). This is important as we will use the index column to join to the geo-boundaries data:

Exploration and import of the GeoJSON UK ITL2 boundaries

Next I use the geopandas read_file method to import the GeoJSON boundary data.

Here's an example of the raw GeoJSON:

I use the set_index() method to ensure the ITL codes are the 'key' for this dataframe so they can be joined easily (as mentioned above in relation to the statistical data). I also select a geometry column using set_geometry(). This isn't actually required if there is only one geometry column in the data, as in this case, since it will be chosen by default - but I've made it explicit. The purpose of specifying the geometry column is so geopandas knows which one to use in its geo-calculations.

There's also a data wrangling step added ad-hoc when I discovered that one of the ITL codes (Northern Ireland) doesn't have any sub-regions at ITL2 and ITL3 level - it is all just grouped as Northern Ireland - and the formatting of its code differs between the two sources! Of course it is common to need data wrangling steps like this, so I'm grateful that this was the only mismatch.

Now I can join the two datasets by their index, and get geopandas to produce a plot!

I use a dataframe join to combine the boundaries and statistical dataframes, choosing only the needed columns from the boundaries data as I go. In this case it is an inner join, returning just the matched rows. The default plot draws a map against the x and y axes, with legend=True adding a colorbar. By default the viridis color theme is used (this is just matplotlib under the hood).

Having worked through all the individual parts I can now create the final plot...

Creating geopandas subplots and formatting them

In this final section I've put together all of the above to create a plot with 6 subplots in a 2 row x 3 column layout. The rows will be for full-time and part-time work, and the columns for All, Male and Female genders.

Points to note:

  • I make use of a "helper" function read_data to abstract the process of loading data from the Excel sheet. This makes sense to do in this case since I have 6 sheets with an identical structure.
  • The dataframes full_time_data and part_time_data are created by joining the datasets.
  • I then calculate the minimum and maximum value across the 3 (All, Male and Female) columns so that the color mapping and legend are on the same scale. Of course, full-time and part-time hours are very different so I have a scale for each.
  • Loop over the full time and then part time data, creating a geopandas plot for each and placing it in its spot in the grid. I used a Blues color map for full time and Greens for part time to make them visually distinct without really implying an "order".
  • I apply some formatting to the legend. The geopandas legend is a matplotlib colorbar and created as an additional axis, so to get a reference to it I call the ax.figure.axes[i] where ax is a reference to the subplot. figure is the plot as a whole, and the additional axes 6 through 11 (where 0-5 are created by the maps) are the legends.

Jupyter notebook

The complete Jupyter notebook for the above can be found here (Github Gist).