Mapping UK shopping centres using data wrangling with BeautifulSoup and plotting with Plotly Express and Mapbox

I step through extracting geo-data from a Wikipedia table (or any HTML source) and plotting it on an interactive canvas.

Context

I wanted to work a bit with the mapping functionality of Plotly Express so decided to use Wikipedia to source some data, which turned into a tour through the shopping centres of the UK - as well as the visualisation itself there's a data wrangling component to this (which admittedly isn't my favourite part and I tend to like starting with 'nice' datasets, but don't we all!) but nothing too elaborate as the Wikipedia data was well formed.

Concepts covered in this walkthrough:

  • Traverse HTML using BeautifulSoup to extract place data from a Wikipedia table and navigate to its linked pages to find its geographic coordinates
  • Plot the discovered data on a Mapbox scatter plot using Plotly Express

Final result: (the plot produced in the notebook itself is interactive i.e. zoomable etc).

Obtaining the data from Wikipedia using BeautifulSoup

First use the requests library to download the content from the relevant page (requests.get() returns a Response and from there we need to extract the content) - in this case a list of shopping centres in the UK by size.

I create a BeautifulSoup object from the downloaded content and specify that we will shred it using the html parser. As there is only one table on this page I've gone directly to s.table; if there were multiple I would have used find_all() and accessed it by the appropriate index, as I've done in the following part to find_all() 'tr' (table row) items in that table. I excluded the first one as that was some header cells.

The table contains links to the Wikipedia pages for the individual shopping centres as well as their location (town/city), region and area.

Starting with an empty list I loop over the table rows (which at the time of writing contains 44 centres that are more than 65,000 square metres) and extract:

  • name (cells[1].a.contents[0])
  • location (cells[2].get_text().replace('\n','') - some of these imported with newlines so I had to remove those)
  • region (cells[3].get_text().replace('\n','')) - again there were some newlines here
  • size in square metres (int(cells[4].contents[0].replace(',',''))) - here I've removed the "thousands separator" and converted to an int

The coordinates for each centre are listed on the page for the specific centre, so I've navigated to that page via the href of the link and scraped the coordinates. After some digging through the HTML I found that the geo class was a suitable way to find the coordinates on each of the pages:

I wrapped that logic into a helper function get_coords_from_wiki_page() that takes a URL and again uses BeautifulSoup to find the needed coordinates. This code isn't particularly robust since it is just for a one-off data-extraction script; if putting something like this into 'production' it would obviously need to have error handling and so on.

Each row of the table gets its data accumulated into the list.

Finally, with the list populated I could create a pandas DataFrame from it. (I haven't shown it here, but there should be some steps to ensure that all the data has been imported correctly and there are no missing / erroneous values etc. In this case it was easy just to visually inspect.)

There is a read_html() ability in pandas but I didn't use that here due to wanting to control the elements extracted and traverse to the linked pages before combining the data.

Plotting the data with Plotly Express and Mapbox

The code for this part is fairly long but only because I did a number of customisations relative to the default plot - it's conceptually simple...

I plotted the data in these steps:

  • Set the renderer to 'notebook' - this is specific to running it in a VS Code-like environment so the plot will display.

  • Using the DataFrame created above, call the scatter_mapbox() function with some initial arguments to create the plot.

    • The DataFrame itself, of course
    • lat and lon mappings to the fields in my data to give the location for the points
    • custom_data - a list of fields to work with. I initially used hover_name and hover_data (shown here commented out) which work nicely with the default plot, but I wanted to customise the hover behaviour further
    • color, opacity, size etc for the points
    • mapbox_style which I have defaulted to 'carto-positron'. The Mapbox integration offers a number of different "tiles" for generating a map - some of which require use of an API key to render. The one I've selected doesn't require this, for ease of use, although I do have a key and have had this working with key-requiring map styles (shown commented out). To obtain an API key it's a similar process to how this is done elsewhere - sign up with Mapbox and generate a token. They have a free pricing tier.
    • width, height and zoom level for the rendered map
  • With a reference (fig) to the rendered map, call update_layout() on it to set the appearance for the hover labels (when hovering on a scatter point), title and margin.

  • Update the color axis (the bar on the right in the finished chart) to set some appearance properties for the scale itself and the title.
  • Call update_traces() to set the 'template' for what will be displayed when hovering.
  • Update the center point of the plot i.e. the coordinate on which the view will be centred by default.
  • Add an annotation to credit the Wikipedia page.

Interacting with the rendered map

The resulting map can be scrolled around, zoomed and hovered on (a bit difficult to capture in screenshots, but I've made an attempt below!)

Initial view:

Hovering on a data point:

Zooming and panning:

Jupyter notebook

The complete Jupyter notebook for the above can be found here (Github Gist). There are a couple of fonts (Bebas Neue and Ubuntu Light) that aren't default system fonts, so those could be installed or exchanged for something else.