Identifying different types of green space deprivation with K means clustering and TidyModels

My role: analysis design, analysis, data visualisation, reporting

Techniques used: k means clustering, feature engineering, parameter tuning, exploratory data analysis, unsupervised learning, data visualisation.

In brief:

In this notebook, I run a cluster analysis to identify different types of green space deprivation in England. This notebook focuses on the feature engineering and parameter tuning. The analysis and presentation of the results of the clustering appears in another notebook.

Objective: to identify different types of green space deprivation experiences by different populations across England.
My initial hypothesis:
- different areas of England experience different forms of green space deprivation.
- One of these forms might look something like: an area tending to have relatively large amounts of garden space per person, but relatively small amounts of public green space per person in easy walking distance.
- An unsupervised learning approach might help to identify different forms of green space deprivation by identifying structures within green space and demographic data.
Geospatial scale: identify forms of green space deprivation at the local scale. The Middle Super Output Area (MSOA) scale in the jargon of UK administrative terminology. To give a sense of scale the mean population of a MSOA is approximately 7200.
Data: The dataset used is was released by Friends of the Earth in 2020, as part of their research on ‘England’s Green Space Gap’. The dataset includes some UK Government data, and some data which Friends of the Earth derived based on analysis of the UK Government data.

Setting up the notebook

# ------------------------------------------------------------------------------
# Import packages used in the notebook
# ------------------------------------------------------------------------------

# for data manipulation and visualisation
library(dplyr)
library(ggplot2)
library(tidyr)

# for string manipulation
library(stringr)

# for functional programming
library(purrr)

# for reading in data xslx files
library(readxl)

# for cluster analysis and feature engineering within a tidy models workflow
library(tidyclust)
library(tidymodels)

# for umap (a dimension reduction technique used in the feature engineering)
library(embed)

# for skewedness and kurtosis tests used in exploratory analysis
library(moments)

# for table formatting
library(knitr)

# ------------------------------------------------------------------------------
# Imported packages used in the notebook
# ------------------------------------------------------------------------------
subset(
  data.frame(sessioninfo::package_info()),
  attached == TRUE,
  c(package, loadedversion)
) %>% 
  kable()

	package	loadedversion
broom	broom	1.0.0
dials	dials	1.1.0
dplyr	dplyr	1.0.9
embed	embed	1.0.0
ggplot2	ggplot2	3.4.0
infer	infer	1.0.4
knitr	knitr	1.39
modeldata	modeldata	1.0.1
moments	moments	0.14.1
parsnip	parsnip	1.0.3
purrr	purrr	0.3.4
readxl	readxl	1.4.0
recipes	recipes	1.0.3
rsample	rsample	1.1.1
scales	scales	1.2.1
stringr	stringr	1.4.1
tibble	tibble	3.1.8
tidyclust	tidyclust	0.1.0
tidymodels	tidymodels	1.0.0
tidyr	tidyr	1.2.0
tune	tune	1.0.1
workflows	workflows	1.1.2
workflowsets	workflowsets	1.0.0
yardstick	yardstick	1.1.0

# ------------------------------------------------------------------------------
# Set up notebook output and data viz styling
# ------------------------------------------------------------------------------
# set default Rmd Chunk options
knitr::opts_chunk$set(echo = TRUE, 
                      cache = TRUE, 
                      warning = FALSE,
                      message = FALSE)

# set default theme for exploratory plots
theme_set(theme_light())

# define colour palette for plots
colours <- c("#374E83", "#C94A54")

Reading in the data

The dataset used is from Friends of the Earth in 2020. It was released as part of their research on ‘England’s Green Space Gap’. Data at different spatial resolutions (LSOA, MSOA and Local Authority is included), but I only focus at the MSOA (Middle Super Output Area) scale. This was the finest grain resolution where the data required for the analysis was available.

The code chuck below reads the data in from a local Excel file, sorts of the variable naming conventions and outputs an overview of the structure of the data. So, I could see all the variables I had to work with and check if the variable types (character, numeric, logical etc.) looked correct.

foe_green_space_raw <- read_excel("../Data/(FOE) Green Space Consolidated Data - England - Version 2.1.xlsx",
  sheet = "MSOAs V2.1"
) %>%
  # inconsistent naming conventions for variables are used in the source data
  # some needed to clean names for consistency (variables names are now )
  janitor::clean_names()

str(foe_green_space_raw)

## tibble [6,791 × 38] (S3: tbl_df/tbl/data.frame)
##  $ x1                                       : num [1:6791] 1 2 3 4 5 6 7 8 9 10 ...
##  $ msoa_code                                : chr [1:6791] "E02000001" "E02000002" "E02000003" "E02000004" ...
##  $ msoa_name                                : chr [1:6791] "City of London 001" "Barking and Dagenham 001" "Barking and Dagenham 002" "Barking and Dagenham 003" ...
##  $ msoa_name_house_of_commons               : chr [1:6791] "City of London" "Marks Gate" "Chadwell Heath East" "Eastbrookend" ...
##  $ la_code                                  : chr [1:6791] "E09000001" "E09000002" "E09000002" "E09000002" ...
##  $ la_name                                  : chr [1:6791] "City of London" "Barking and Dagenham" "Barking and Dagenham" "Barking and Dagenham" ...
##  $ la_name_for_readability                  : chr [1:6791] "City of London" "Barking and Dagenham" "Barking and Dagenham" "Barking and Dagenham" ...
##  $ area                                     : num [1:6791] 2905505 2165730 2143624 2490295 1186203 ...
##  $ population_imd                           : num [1:6791] 6687 7379 10720 6536 9243 ...
##  $ population_density                       : num [1:6791] 0.0023 0.00341 0.005 0.00262 0.00779 ...
##  $ total_pop_from_ethnicity_data            : num [1:6791] 7375 6775 10045 6182 8562 ...
##  $ white_pop                                : num [1:6791] 5799 4403 5486 5006 5674 ...
##  $ mixed_multiple_ethnic_group_pop          : num [1:6791] 289 330 433 186 313 307 478 366 280 202 ...
##  $ asian_asian_british_pop                  : num [1:6791] 940 820 2284 313 1050 ...
##  $ black_african_caribbean_black_british_pop: num [1:6791] 193 1133 1618 649 1445 ...
##  $ other_ethnic_group_pop                   : num [1:6791] 154 89 224 28 80 55 141 92 94 51 ...
##  $ bame_pop                                 : num [1:6791] 1576 2372 4559 1176 2888 ...
##  $ income_index                             : num [1:6791] 7.85 2.71 3.95 4.13 3.73 ...
##  $ green_space_area                         : num [1:6791] 18805 98083 145909 1228232 38041 ...
##  $ green_space_area_per_capita              : num [1:6791] 2.81 13.29 13.61 187.92 4.12 ...
##  $ unbuffered_go_space_area                 : num [1:6791] 0 78286 138734 1228232 37415 ...
##  $ buffrd_go_space_area                     : num [1:6791] 152351 725988 797423 2328455 291124 ...
##  $ unbuffered_go_space_per_capita           : num [1:6791] 0 10.61 12.94 187.92 4.05 ...
##  $ pop_area                                 : num [1:6791] 2905505 2087444 2004890 1262063 1148789 ...
##  $ pop_area_with_go_space_access            : num [1:6791] 152351 647702 658689 1100223 253709 ...
##  $ pcnt_pop_area_with_go_space_access       : num [1:6791] 5.24 31.03 32.85 87.18 22.08 ...
##  $ pcnt_pop_with_go_space_access            : num [1:6791] 7.88 29.07 41.58 84.74 27.03 ...
##  $ pcnt_pop_without_go_space_access         : num [1:6791] 92.1 70.9 58.4 15.3 73 ...
##  $ pop_without_go_space_access              : num [1:6791] 6160 5234 6263 997 6744 ...
##  $ garden_area_houses                       : num [1:6791] 380 1199811 433155 389959 410321 ...
##  $ garden_area_flats                        : num [1:6791] 3294 36880 69334 16058 27838 ...
##  $ garden_area_total                        : num [1:6791] 3674 1236691 502489 406018 438159 ...
##  $ garden_area_per_capita                   : num [1:6791] 0.549 167.596 46.874 62.12 47.404 ...
##  $ gsdi_avg_area                            : num [1:6791] 1 2 2 4 1 4 2 1 2 3 ...
##  $ gsdi_access                              : num [1:6791] 1 2 2 4 2 2 2 3 2 4 ...
##  $ gsdi_garden                              : num [1:6791] 1 4 1 2 1 1 1 1 1 1 ...
##  $ green_space_deprivation_score            : num [1:6791] 1 4 2 4 1 4 2 1 2 4 ...
##  $ green_space_deprivation_rating           : chr [1:6791] "E" "B" "D" "B" ...

Clean the data

There are many variables within the dataset, some of which are derived from other variables in the dataset. I have previous experience working with the dataset, and had performed some earlier exploratory clustering with the dataset too. So, I had some initial ideas about which variables would be interesting to work with.

I chose some demographic and some green space variables, as this made sense given the analysis objectives. I avoid pairs of variables that in someway caused duplication. For example, I only included population density for each neighborhood, rather than population and area.

# -----------------------------------------------------------------------------
# Select variables of interest from the wider dataset
# -----------------------------------------------------------------------------

foe_green_space_focus <- foe_green_space_raw %>%
  select(
    # id
    msoa_code,

    # demographic variables
    population_density, bame_pop, total_pop_from_ethnicity_data, income_index,

    # green space variables
    pcnt_pop_without_go_space_access, garden_area_per_capita,
    green_space_area_per_capita
  )

# -----------------------------------------------------------------------------
# Create (derived) variables for use in clustering
# -----------------------------------------------------------------------------

foe_green_space_focus <- foe_green_space_focus %>%
  mutate(pcnt_bame_pop = bame_pop / total_pop_from_ethnicity_data * 100) %>%
  select(-c(bame_pop, total_pop_from_ethnicity_data))

Exploring the data

Exploratory plotting

A quick look at the distributions of each variable shows we will have some heavily skewed data. I’ll pick up dealing with this in the data pre-processing, as it might prove beneficial (in terms of clustering results) to transform some variables to reduce the skew.

# reshape data for plotting
foe_green_space_focus_long <- foe_green_space_focus %>%
  pivot_longer(
    cols = population_density:pcnt_bame_pop,
    names_to = "variable",
    values_to = "value"
  )

# get a sense of distributions
ggplot(
  foe_green_space_focus_long,
  aes(value)
) +
  geom_density() +
  facet_wrap(~variable, scales = "free") +
  theme(
    strip.background = element_rect(fill = "white"),
    strip.text = element_text(colour = "black")
  )

### Just how skewed is the data Having looked at the plot above, I want to get some quantitative measures of how far the distributions of the variables was from normal. So, I ran some quick skewness and kurtosis test.

The skewness test proved helpful later on in thinking about the type and strength of transformation to use on each variable. A skewness score of close to 0 would mean the distribution of a variable is similar to the normal distribution.

The kurtosis test yield insight into the extent to which each variable is affected outliers (in comparison to a normal distribution). A kurtosis score higher than 3, suggests more outliers than typical for a normally distribution variable. Conversely, a kurtoisis score lower than 3, suggests less outliers than normal distribution.

Skew and outliers can cause problems for k means algorithm, because an assumption that clusters are spherical needs to satisfied to some extent.

Information on to run the skew and kurtosis tests in R came from this article.

#' Get a nice looking table of skew and kurtosis test results
#'
#' @param df a dataframe of variables
#'
#' @return a dataframe of test results
#' @export
#'
#' @examples
test_skew_kurtosis <- function(df) {
  df %>%
    # calculate summary statistics
    summarise(across(dplyr::everything(),
      list(skewness = skewness, kurtosis = kurtosis),
      na.rm = TRUE
    )) %>%
    # rearrange dataframe into a nice format
    pivot_longer(cols = everything()) %>%
    mutate(
      test = if_else(str_detect(name, "skewness"),
        "skewness", "kurtosis"
      ),
      name = str_remove(name, c("_skewness", "_kurtosis"))
    ) %>%
    rename(variable = name) %>%
    pivot_wider(names_from = "test", values_from = "value") %>%
    arrange(desc(skewness), desc(kurtosis))
}

skew_kurtosis_test_res <- test_skew_kurtosis(select(
  foe_green_space_focus,
  -msoa_code
))

kable(skew_kurtosis_test_res)

variable	skewness	kurtosis
green_space_area_per_capita	24.8523268	822.028067
garden_area_per_capita	15.5339581	667.133444
population_density	2.0696404	8.506217
pcnt_bame_pop	2.0270140	6.706167
income_index	-0.1306751	1.971792
pcnt_pop_without_go_space_access	-0.4481288	2.416562

Remove some outliers

The garden_area_per_capita and green_space_area_per_capita variables show very high skewness and kurtosis test scores. Transforming these variable to address the skew can be done in the feature engineering below. However, due to the way the TidyModels packages work, it is easier to deal with outliers here, ahead of starting the feature engineering and algorithm parameter tuning.

The next chunk of code looks at what proportion of data would be lost, if observations above certain thresholds were omitted as outliers.

prop_above_threshold <- function(variable, threshold) {
  obs_above_thres <- length(which(foe_green_space_focus[variable] > threshold))
  total_obs <- length(foe_green_space_focus[[variable]])
  obs_above_thres / total_obs
}

# -----------------------------------------------------------------------------
# garden_area_per_capita: Look at proportions of observations above thresholds
# -----------------------------------------------------------------------------
cut_off_thresholds <- c(1, 10, 50, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000)

prop_omitted <- tibble(threshold = cut_off_thresholds) %>%
  mutate(
    prop_omitted_garden_area_per_capita = map_dbl(
      threshold,
      ~ prop_above_threshold("garden_area_per_capita", .)
    )
  )

kable(prop_omitted)

threshold	prop_omitted_garden_area_per_capita
1	0.9991165
10	0.9768812
50	0.7916360
100	0.4080401
200	0.1546164
300	0.0672949
400	0.0251804
500	0.0067737
600	0.0023561
700	0.0010308
800	0.0004418
900	0.0001473
1000	0.0001473

# -----------------------------------------------------------------------------
# green_space_area_per_capita: Look at proportions of observations above thresholds
# -----------------------------------------------------------------------------

cut_off_thresholds <- c(1, 10, 50, 100, 200, 300, 400, 500, 1000, 5000, 8000, 10000, 100000)

prop_omitted <- tibble(threshold = cut_off_thresholds) %>%
  mutate(
    prop_omitted_green_space_area_per_capita = map_dbl(
      threshold,
      ~ prop_above_threshold("green_space_area_per_capita", .)
    )
  )

kable(prop_omitted)

threshold	prop_omitted_green_space_area_per_capita
1e+00	0.9447799
1e+01	0.7025475
5e+01	0.2412016
1e+02	0.1434251
2e+02	0.0868797
3e+02	0.0640554
4e+02	0.0524223
5e+02	0.0424091
1e+03	0.0263584
5e+03	0.0082462
8e+03	0.0058901
1e+04	0.0050066
1e+05	0.0001473

Having inspected the table above, I decided to remove outliers in the following way.

garden_area_per_capita remove observations above 500m², losing less than 1% of observations.
green_space_area_per_capita remove observations above 5000m², losing less than 1% of observations.

I also reran the skew and kurtosis tests. green_space_area_per_capita retained a vary high kurtosis score, I’ll look to deal with in the Univarainte Transformation section of the feature engineering.

# retain outliers for inspection and use later on
foe_green_space_outliers <- foe_green_space_focus %>%
  filter(
    garden_area_per_capita > 500 |
      green_space_area_per_capita > 5000
  )

# remove outliers ahead of feature engineering workflow
foe_green_space_focus <- foe_green_space_focus %>%
  filter(
    garden_area_per_capita <= 500,
    green_space_area_per_capita <= 5000
  )


# rerun the skew and kurtosis tests for the variables where I've removed some
# outliers
test_skew_kurtosis(select(
  foe_green_space_focus, garden_area_per_capita, green_space_area_per_capita
)) %>% 
  kable()

variable	skewness	kurtosis
green_space_area_per_capita	8.586878	91.509741
garden_area_per_capita	1.701253	5.856855

Feature engineering

Before doing the cluster analysis itself, I did a lot of feature engineering work. The was to get the clustering input data into as good a state as possible within the time available.

Setting up the feature engineering workflow

First, I set up a feature engineering workflow functions. These functions allowed me to easily and repeatedly run, and view the results of, clustering with different data preprocessing / feature engineering steps (i.e. recipes in TidyModels terminology) and different values of k.

The metric I use is assess the performance of the clustering is ratio of WSS:TSS. Where:

WSS is the within cluster sum of squared error (this measures the distance of each observation from it’s cluster centroid).
TSS is the total sum of squared error (this measures the distance of all observations from a global centroid).

set.seed(1289473)

# ------------------------------------------------------------------------------
# key parameters
# ------------------------------------------------------------------------------

#' Run the K means cluster workflow with a single recipe (k is a tuning parameter)
#'
#' @data_df the dataset to be clustered
#' @param num_cluster_levels number of levels created between 1 and 10 in the parameter (i.e. k) tuning grid
#' @param num_cross_validation_folds number of cv folds
#' @recipe the recipe to be used to pre-process the data ahead of loading it into the clustering workflow
#'
#' @return a dataframe of the results of the tuning process
#' @export
#'
#' @examples
run_cluster_wf <- function(data_df, num_cluster_levels = 5,
                           num_cross_validation_folds = 2,
                           recipe) {
  # ---------------------------------------------------------------------------
  # create cross-validation samples
  green_space_cv <- vfold_cv(data_df, v = num_cross_validation_folds)
  # ---------------------------------------------------------------------------

  # ---------------------------------------------------------------------------
  # set up clustering algorithm
  # ---------------------------------------------------------------------------
  kmeans_spec <- k_means(num_clusters = tune()) %>%
    set_engine("stats")

  # ---------------------------------------------------------------------------
  # set up workflow which brings together the clustering algorithm and
  # the data pre-processing (i.e. the recipe) for execution
  # ---------------------------------------------------------------------------
  kmeans_wflow <- workflow(recipe, kmeans_spec)

  # ---------------------------------------------------------------------------
  # create parameter tuning grid - for k (i.e. the number of clusters)
  # ---------------------------------------------------------------------------
  num_clusters_grid <- grid_regular(num_clusters(), levels = num_cluster_levels)

  # ---------------------------------------------------------------------------
  # run tuning
  # ---------------------------------------------------------------------------

  res <- tune_cluster(kmeans_wflow,
    resamples = green_space_cv,
    grid = num_clusters_grid,
    control = control_grid(save_pred = TRUE, extract = identity),
    metrics = cluster_metric_set(
      sse_within_total, sse_total,
      sse_ratio
    )
  )

  return(res)
}


#' Run the K means cluster workflow for multiple recipes (k is a tuning parameter)
#'
#' #' @data_df the dataset to be clustered
#' @param num_cluster_levels number of levels created between 1 and 10 in the parameter (i.e. k) tuning grid
#' @param num_cross_validation_folds number of cv folds
#' @param recipes a named list of recipes
#'
#' @return a ggplot graph showing the performance (i.e. sse_ratio) of clustering with each recipe
#' @export
#'
#' @examples
run_wf_with_recipes <- function(data_df,
                                num_cluster_levels = 5,
                                num_cross_validation_folds = 2,
                                recipes) {
  # ---------------------------------------------------------------------------
  # create a dataframe with each recipe
  # to store the objects create at each stage of the clustering process
  # ---------------------------------------------------------------------------

  clustering_df <- tibble(
    recipe_id = names(recipes), # for meaningful names in plotting
    recipe = recipes
  )

  # ---------------------------------------------------------------------------
  # Run the workflow with each recipe and get the performance metrics for
  # each
  # ---------------------------------------------------------------------------

  metric_df <- clustering_df %>%
    mutate(
      # run a clustering workflow for each recipe
      workflow = map(recipe, ~ run_cluster_wf(
        recipe = .,
        data_df = data_df,
        num_cluster_levels = num_cluster_levels,
        num_cross_validation_folds = num_cross_validation_folds
      )),

      # get the performance metrics for each workflow
      metrics = map(workflow, ~ collect_metrics(.))
    )

  # ---------------------------------------------------------------------------
  # transform the metric data for use in ggplot2
  # including selecting the metric of interest
  # ---------------------------------------------------------------------------

  metric_df_focus <- metric_df %>%
    # expand the nested metrics tables
    select(recipe_id, metrics) %>%
    unnest(metrics) %>%
    # focus on metric of interest
    filter(.metric == "sse_ratio") %>%
    # so that different coloured lines can be plotted for each recipe
    mutate(recipe_id = factor(recipe_id))

  # ---------------------------------------------------------------------------
  # plot the performance metric for each recipe
  # ---------------------------------------------------------------------------

  p <- ggplot(
    data = metric_df_focus,
    mapping = aes(
      x = num_clusters, y = mean,
      colour = recipe_id, group = recipe_id
    )
  ) +
    geom_line() +
    scale_color_viridis_d() +
    theme_minimal() +
    ylab("Performance (mean WSS/TSS ratio)") +
    xlab("Number of clusters") +
    scale_x_continuous(breaks = 1:10)

  # make plot interactive
  plotly::ggplotly(p)
}

Doing the feature engineering

In this section I do the feature engineering itself, using the workflow set up above. My choices around the order of feature engineering steps was based on recommendations made by the authors of TidyModels.

Checking for missing data

The first step with the feature engineering was to impute any missing data. Fortunately, there is no missing data in the Friends of the Earth dataset.

visdat::vis_miss(foe_green_space_focus)

#### Handling qualitative variables The next step would have been to think through if and how to use any qualitative variables. However, there aren’t any in the dataset I’m using, so we can just move on.

Standardising the scale of the variables

K means very sensitive to distance between points. So, the first thing to do was to put all the variables on the same scale (between 0 and 1). This running the clustering workflow, gives us the baseline perfomance of the clustering.

This is the first section of actual feature engineering that I run, so before doing that I need to set a few parameters for the clustering workflow.

# workflow tuning parameters used thoughout feature engineering
num_cluster_levels <- 10
num_cross_validation_folds <- 2

# drop identifier column as not relevant to clustering
cluster_in_df <- select(foe_green_space_focus, -msoa_code) %>%
  drop_na()

So, now we can get on with the standardising the scales of the variables.

# ------------------------------------------------------------------------------
# create range recipes
# ------------------------------------------------------------------------------
range_recipe_0 <- recipe(~., data = cluster_in_df) %>%
  step_range(all_numeric(), min = 0, max = 1)

# add recipes to a list
range_recipes <- list(base = range_recipe_0)

# ------------------------------------------------------------------------------
# run the modelling workflow with the recipes
# ------------------------------------------------------------------------------
run_wf_with_recipes(
  recipes = range_recipes,
  data_df = cluster_in_df,
  num_cluster_levels = num_cluster_levels,
  num_cross_validation_folds = num_cross_validation_folds
)

Univariant tranformations

The next step in the feature engineering is to transform the input data to try to address some of the skew and kurtosis identified in the EDA (Exploratory Data Analysis) section above. This was probably the most complex step in the feature engineering, as it involved comparing various approaches to transforming the data.

# ------------------------------------------------------------------------------
# a general transformation across all variables
# ------------------------------------------------------------------------------

trans_recipe_yeo <- range_recipe_0 %>%
  step_YeoJohnson(all_numeric())

# ------------------------------------------------------------------------------
# stronger custom transformation (based on EDA observations)
# ------------------------------------------------------------------------------

trans_recipe_custom_strong <- range_recipe_0 %>%
  step_inverse(
    # for right skewed data
    green_space_area_per_capita,
    garden_area_per_capita,
    offset = 1
  ) %>%
  step_log(
    # for right skewed data
    population_density, pcnt_bame_pop,
    # for left skewed data
    pcnt_pop_without_go_space_access,
    offset = 1
  )

# ------------------------------------------------------------------------------
# gentler custom transformation (based on EDA observations)
# ------------------------------------------------------------------------------

trans_recipe_custom_gentle <- range_recipe_0 %>%
  step_log(
    # for right skewed data
    green_space_area_per_capita,
    garden_area_per_capita,
    offset = 1
  ) %>%
  step_sqrt(
    # for right skewed data
    population_density, pcnt_bame_pop,
    # for left skewed data
    pcnt_pop_without_go_space_access
  )

# ------------------------------------------------------------------------------
#  custom transformation (based on a mix of the three recipes above)
# ------------------------------------------------------------------------------
trans_recipe_custom_mixed_1 <- range_recipe_0 %>%
  step_inverse(
    # for right skewed data
    green_space_area_per_capita,
    garden_area_per_capita,
    offset = 1
  ) %>%
  step_YeoJohnson(
    # for right skewed data
    population_density, pcnt_bame_pop,
    # for left skewed data
    pcnt_pop_without_go_space_access
  )

trans_recipe_custom_mixed_2 <- range_recipe_0 %>%
  step_inverse(
    # for extreme right skewed data
    green_space_area_per_capita,
    offset = 1
  ) %>%
  step_log(
    # for right skewed data
    population_density, pcnt_bame_pop, garden_area_per_capita,
    offset = 1
  ) %>%
  step_YeoJohnson(
    # for left skewed data
    pcnt_pop_without_go_space_access
  )


# add recipes to a list
trans_recipes <- list(
  base = range_recipe_0,
  YeoJohnson = trans_recipe_yeo,
  custom_strong = trans_recipe_custom_strong,
  custom_gentle = trans_recipe_custom_gentle,
  custom_mixed_1 = trans_recipe_custom_mixed_1,
  custom_mixed_2 = trans_recipe_custom_mixed_2
)



# ------------------------------------------------------------------------------
# View performance of clustering with each recipe
# ------------------------------------------------------------------------------

# run the modelling workflow with the recipes
run_wf_with_recipes(
  recipes = trans_recipes,
  data_df = cluster_in_df,
  num_cluster_levels = num_cluster_levels,
  num_cross_validation_folds = num_cross_validation_folds
)

In the end, the best performing set of transformations was a custom mix of transformations of varying strengths including: inverse transformation for the most skewed variables, log transformation for the skewed variables, and a Yeo-Johnson transformation for the other variables. So, this chunk of code just renames the appropriate recipe (i.e. set of transformation), so we can build on it in the next feature engineering step.

trans_recipe_preferred <- trans_recipe_custom_mixed_2

Creating interactions

I then looked at creating interactions between the variables. I tried some theoretically plausible interactions in the code chunk below, but none produced any major improvements in the performance of the clustering. So, for the sake of simplicty none of these recipes will be taken forward.

# ------------------------------------------------------------------------------
# create interaction recipes
# ------------------------------------------------------------------------------
interactions_recipe_1 <- trans_recipe_preferred %>%
  step_interact(terms = ~ income_index * pcnt_bame_pop)

interactions_recipe_2 <- trans_recipe_preferred %>%
  step_interact(terms = ~ garden_area_per_capita * green_space_area_per_capita)

interactions_recipe_3 <- trans_recipe_preferred %>%
  step_interact(terms = ~ population_density * green_space_area_per_capita)

interactions_recipe_4 <- trans_recipe_preferred %>%
  step_interact(terms = ~ population_density * garden_area_per_capita)

# add recipes to a list
interactions_recipes <- list(
  base_after_trans = trans_recipe_preferred,
  interact_1 = interactions_recipe_1,
  interact_2 = interactions_recipe_2,
  interact_3 = interactions_recipe_3,
  interact_4 = interactions_recipe_4
)

# ------------------------------------------------------------------------------
# run the modelling workflow with the recipes
# ------------------------------------------------------------------------------
run_wf_with_recipes(
  recipes = interactions_recipes,
  data_df = cluster_in_df,
  num_cluster_levels = num_cluster_levels,
  num_cross_validation_folds = num_cross_validation_folds
)

Normalise

I also tried normalising all the variables, but this reduced the performane of the clustering. So, the recipes below won’t be taken forward.

# ------------------------------------------------------------------------------
# create normalisation recipes
# ------------------------------------------------------------------------------
normalisation_recipe <- trans_recipe_preferred %>%
  step_normalize(all_numeric())

# add recipes to a list
normalisation_recipes <- list(
  base_after_trans = trans_recipe_preferred,
  norm = normalisation_recipe
)

# ------------------------------------------------------------------------------
# run the modelling workflow with the recipes
# ------------------------------------------------------------------------------
run_wf_with_recipes(
  recipes = normalisation_recipes,
  data_df = cluster_in_df,
  num_cluster_levels = num_cluster_levels,
  num_cross_validation_folds = num_cross_validation_folds
)

Multivariate transformation

The last feature engineering step I tried was dimensionality reduction. I compared the impact of different dimension reduction algorithms (PCA, UMAP and ICA). Both UMAP and PCA (2 components) produced improvements in the performance of the clustering. This would be expected as in both case I’ve reduced the number of dimensions of the data from 6 to 2, and k means is a distance based algorithm.

I took forward the best performing recipe forward (PCA - 2 components), as the final set of fature engineering steps. It won’t be until I look at the results of clustering in the next section, that I’ll be able to tell to what extent the dimensions reduction has removed important information, and/or removed redundant data and lead to meaningful clustering results.

# ------------------------------------------------------------------------------
# create multivariate transformation recipes
# ------------------------------------------------------------------------------
pca_recipe_2 <- trans_recipe_preferred %>%
  step_pca(all_numeric(), num_comp = 2)

pca_recipe_4 <- trans_recipe_preferred %>%
  step_pca(all_numeric(), num_comp = 4)

umap_recipe <- trans_recipe_preferred %>%
  step_umap(all_numeric())

# add recipes to a list
normalisation_recipes <- list(
  base_after_trans = trans_recipe_preferred,
  pca_2 = pca_recipe_2,
  pca_4 = pca_recipe_4,
  umap = umap_recipe
)

# ------------------------------------------------------------------------------
# run the modelling workflow with the recipes
# ------------------------------------------------------------------------------
run_wf_with_recipes(
  recipes = normalisation_recipes,
  data_df = cluster_in_df,
  num_cluster_levels = num_cluster_levels,
  num_cross_validation_folds = num_cross_validation_folds
)

Clustering results

Having decided on the feature engineering steps to use ahead of running the clustering, which are codified in pca_recipe_2, this final section of the notebook looks at the results of the clustering. In the process, I decide on what value of k (i.e. what number of clusters yeilds the most interesting results).

Setting up the clustering results workflow

I approached the coding for the results in a similar to the coding for the feature engineer. I started by setting up a couple of functions to getand visualise the clustering results (for a given value of k and preprocessing recipe).

#' Having decided on a recipe (set of pre-processing steps) and k, run the
#' clustering workflow
#'
#' @param k the number of clusters to identified by the k means clustering
#' algorithm
#' @param recipe the set of data pre-processing steps (a TidyModels object)
#' @param data_df the main dataset used by the clustering algorithm
#' @param outlier_df any outlier observations that you do not want to influence 
#' the clustering algorithm. The clusters for these observations will be 
#' 'predicted' once the cluster algorithm has been run.
#' 
#'
#' @return a dataframe including observation from both `data_df` and 
#' `outlier_df` including a cluster assignment for each observation

get_clustering_results <- function(k, recipe, data_df, outlier_df) {
  # ---------------------------------------------------------------------------
  # set up clustering algorithm
  # ---------------------------------------------------------------------------
  kmeans_spec <- k_means(num_clusters = k) %>%
    set_engine("stats")

  # ---------------------------------------------------------------------------
  # set up workflow which brings together the clustering algorithm and
  # the data pre-processing (i.e. the recipe) for execution
  # ---------------------------------------------------------------------------
  kmeans_wflow <- workflow(recipe, kmeans_spec)

  # ---------------------------------------------------------------------------
  # Main data used for determining how observations should be classified
  # ---------------------------------------------------------------------------
  kmeans_fit <- kmeans_wflow %>%
    fit(data = select(data_df, -msoa_code))

  cluster_assignments <- extract_cluster_assignment(kmeans_fit)

  main <- data_df %>%
    bind_cols(cluster_assignments)

  # ---------------------------------------------------------------------------
  # Outliers data not involved in determining how observations should be
  # classified
  # ---------------------------------------------------------------------------
  outliers <- kmeans_fit %>%
    augment(new_data = outlier_df) %>%
    rename(.cluster = .pred_cluster)

  # ---------------------------------------------------------------------------
  # return all the data together including cluster classifications
  # ---------------------------------------------------------------------------
  bind_rows(main, outliers)
}

# a quick test things are working as expected
clustering_res <- get_clustering_results(
  k = 6,
  recipe = pca_recipe_2,
  data_df = foe_green_space_focus,
  outlier_df = foe_green_space_outliers
)

# confirm the returned dataframe has the expected number of rows
assertthat::assert_that(nrow(clustering_res) == nrow(foe_green_space_focus) +
  nrow(foe_green_space_outliers))

## [1] TRUE

#' View some visualisations of FoE green space clustering results
#'
#' @param df a dataframe of observations including a `.cluster` variable
#' @param k an integer the number of clusters identified 
#' (for inclusion in plot titles)
#' @param recipe_name a string naming the recipe used to preprocess the clustering
#' input data (for inclusion in plot titles)
#'
#' @return a list of three ggplotly objects (interactive ggplot charts created
#' using plotly::ggplotly(p)) showing clustering results.

view_cluster_dist <- function(df, k, recipe_name) {
  
  #-----------------------------------------------------------------------------
  # set up common features used across multiple plots
  #-----------------------------------------------------------------------------
  
  common_plot_elements <- function(facet = TRUE) {
    
    background_grey <- "grey90"
    
    list(
      
      # colour palette
      scale_colour_viridis_d(),
      
      # facet styling
      if(facet) 
        facet_wrap(~variable, scales = "free", ncol = 2),
      theme(
        strip.background = element_rect(fill = background_grey),
        strip.text = element_text(colour = "black"),
        
        # add a neutral background (rather than stark white)
        plot.background = element_rect(fill = background_grey),
        panel.background = element_rect(fill = background_grey),
        legend.background = element_blank()
      ),
      
      # plot labeling
      ggtitle(glue::glue("{k} clusters. Preprocessing: {recipe_name}"))
    )
    
  }

  # ---------------------------------------------------------------------------
  # process data for plotting
  # ---------------------------------------------------------------------------
  
  # put data in long format - required for facetted plots
  plot_df <- df %>%
    select(-c(msoa_code)) %>%
    pivot_longer(
      cols = population_density:pcnt_bame_pop,
      names_to = "variable"
    )

  # split data for use in different plots
  
  standard_scale_vars <-  c(
      "income_index",
      "pcnt_pop_without_go_space_access"
    )
  
  # plots with a standard x axis scale
  standard_plot <- plot_df %>%
    filter(variable %in% standard_scale_vars)

  # plots with a log x axis scale
  log_plot <- plot_df %>%
    filter(!(variable %in% standard_scale_vars) & variable != "green_space_area_per_capita")
  
  gsapc_plot <- plot_df %>% # more customisation is need to get good view of the
                            # distribution of this variable
    filter(variable == "green_space_area_per_capita")

  #-----------------------------------------------------------------------------
  # produce plots
  #-----------------------------------------------------------------------------
  
  # on standard x axis scale
  p1 <- standard_plot %>%
    ggplot(aes(value, colour = .cluster, group = .cluster)) +
    geom_density() +
    common_plot_elements() +
    labs(x = NULL)


  # on log x axis scale
  p2 <- log_plot %>%
    # + 1e-10 an offset to avoid log(0)
    ggplot(aes(value + 1e-10, colour = .cluster)) +
    geom_density() +
    scale_x_log10() +
    common_plot_elements() +
    labs(x = NULL)
  
  # more customisation is need to get good view of the
  # distribution of this variable
  p3 <- gsapc_plot %>% 
    # + 1e-10 an offset to avoid log(0)
    ggplot(aes(value + 1e-10, colour = .cluster)) +
    geom_density() +
    scale_x_log10(limits = c(1, 1000)) +
    common_plot_elements(facet = FALSE) +
    labs(x = "log green_space_area_per_capita")

  # return plots
  list(p1, 
       p2,
       p3)
}

Results taken forward for further analysis

I explored the results of the cluster with various values of k, and settled on k = 6 as yielding the most interesting results in terms of identifying distinct forms of green space deprivation. Based on an initial inspection, I was happy that the dimension reduction techniques, used in the feature engineering, hadn’t caused an major issues with the interpretability of the results.

I take forward, interpret and present these results more fully in another notebook.

# set number of clusters as used in two functions below
k <- 6

res <- get_clustering_results(
  k = k,
  recipe = pca_recipe_2,
  data_df = foe_green_space_focus,
  outlier_df = foe_green_space_outliers
) 

# output for inspection
kable(head(res))

msoa_code	population_density	income_index	pcnt_pop_without_go_space_access	garden_area_per_capita	green_space_area_per_capita	pcnt_bame_pop	.cluster
E02000001	0.0023015	7.851353	92.12428	0.549499	2.812213	21.36949	Cluster_1
E02000002	0.0034072	2.706193	70.92668	167.596029	13.292218	35.01107	Cluster_2
E02000003	0.0050009	3.952799	58.41962	46.873937	13.610908	45.38576	Cluster_3
E02000004	0.0026246	4.131273	15.26157	62.120196	187.917965	19.02297	Cluster_4
E02000005	0.0077921	3.733204	72.96830	47.404414	4.115677	33.73044	Cluster_2
E02000007	0.0055733	2.269247	58.70080	36.161931	52.393113	32.81765	Cluster_3

# view results charts
res %>%
  view_cluster_dist(k = k, 
                    # get recipe_name as string
                    recipe = deparse(substitute(pca_recipe_2)))

## [[1]]

## 
## [[2]]

## 
## [[3]]

# output for use in other notebooks
readr::write_csv(res, "../data_out/clusters_cv_5_k_6_pca_recipe_2.csv")

Some of the other results I looked at

Just for reference, here are the results for some values of k (other than 6) which I explored

explore_cluster_results_pca_recipe_2 <- function(k){

res <- get_clustering_results(
  k = k,
  recipe = pca_recipe_2,
  data_df = foe_green_space_focus,
  outlier_df = foe_green_space_outliers
) 

res %>% 
  view_cluster_dist(k = k, 
                    # get recipe_name as string
                    recipe = deparse(substitute(pca_recipe_2)))
}

explore_cluster_results_pca_recipe_2(k = 3)

## [[1]]

## 
## [[2]]

## 
## [[3]]

explore_cluster_results_pca_recipe_2(k = 5)

## [[1]]

## 
## [[2]]

## 
## [[3]]

explore_cluster_results_pca_recipe_2(k = 10)

## [[1]]

## 
## [[2]]

## 
## [[3]]

If I was continuing with developing this notebook

If I was investing more time into this analysis, I would:

look at the performance of the code.
try clustering techniques other than k means and compare the results.