Identifying different types of green space deprivation with K means clustering and TidyModels
My role: analysis design, analysis, data visualisation, reporting
Techniques used: k means clustering, feature engineering, parameter tuning, exploratory data analysis, unsupervised learning, data visualisation.
In brief:
In this notebook, I run a cluster analysis to identify different types of green space deprivation in England. This notebook focuses on the feature engineering and parameter tuning. The analysis and presentation of the results of the clustering appears in another notebook.
- Objective: to identify different types of green space deprivation experiences by different populations across England.
- My initial hypothesis:
- different areas of England experience different forms of green space deprivation.
- One of these forms might look something like: an area tending to have relatively large amounts of garden space per person, but relatively small amounts of public green space per person in easy walking distance.
- An unsupervised learning approach might help to identify different forms of green space deprivation by identifying structures within green space and demographic data.
- Geospatial scale: identify forms of green space deprivation at the local scale. The Middle Super Output Area (MSOA) scale in the jargon of UK administrative terminology. To give a sense of scale the mean population of a MSOA is approximately 7200.
- Data: The dataset used is was released by Friends of the Earth in 2020, as part of their research on ‘England’s Green Space Gap’. The dataset includes some UK Government data, and some data which Friends of the Earth derived based on analysis of the UK Government data.
Setting up the notebook
# ------------------------------------------------------------------------------
# Import packages used in the notebook
# ------------------------------------------------------------------------------
# for data manipulation and visualisation
library(dplyr)
library(ggplot2)
library(tidyr)
# for string manipulation
library(stringr)
# for functional programming
library(purrr)
# for reading in data xslx files
library(readxl)
# for cluster analysis and feature engineering within a tidy models workflow
library(tidyclust)
library(tidymodels)
# for umap (a dimension reduction technique used in the feature engineering)
library(embed)
# for skewedness and kurtosis tests used in exploratory analysis
library(moments)
# for table formatting
library(knitr)# ------------------------------------------------------------------------------
# Imported packages used in the notebook
# ------------------------------------------------------------------------------
subset(
data.frame(sessioninfo::package_info()),
attached == TRUE,
c(package, loadedversion)
) %>%
kable()| package | loadedversion | |
|---|---|---|
| broom | broom | 1.0.0 |
| dials | dials | 1.1.0 |
| dplyr | dplyr | 1.0.9 |
| embed | embed | 1.0.0 |
| ggplot2 | ggplot2 | 3.4.0 |
| infer | infer | 1.0.4 |
| knitr | knitr | 1.39 |
| modeldata | modeldata | 1.0.1 |
| moments | moments | 0.14.1 |
| parsnip | parsnip | 1.0.3 |
| purrr | purrr | 0.3.4 |
| readxl | readxl | 1.4.0 |
| recipes | recipes | 1.0.3 |
| rsample | rsample | 1.1.1 |
| scales | scales | 1.2.1 |
| stringr | stringr | 1.4.1 |
| tibble | tibble | 3.1.8 |
| tidyclust | tidyclust | 0.1.0 |
| tidymodels | tidymodels | 1.0.0 |
| tidyr | tidyr | 1.2.0 |
| tune | tune | 1.0.1 |
| workflows | workflows | 1.1.2 |
| workflowsets | workflowsets | 1.0.0 |
| yardstick | yardstick | 1.1.0 |
# ------------------------------------------------------------------------------
# Set up notebook output and data viz styling
# ------------------------------------------------------------------------------
# set default Rmd Chunk options
knitr::opts_chunk$set(echo = TRUE,
cache = TRUE,
warning = FALSE,
message = FALSE)
# set default theme for exploratory plots
theme_set(theme_light())
# define colour palette for plots
colours <- c("#374E83", "#C94A54")Reading in the data
The dataset used is from Friends of the Earth in 2020. It was released as part of their research on ‘England’s Green Space Gap’. Data at different spatial resolutions (LSOA, MSOA and Local Authority is included), but I only focus at the MSOA (Middle Super Output Area) scale. This was the finest grain resolution where the data required for the analysis was available.
The code chuck below reads the data in from a local Excel file, sorts of the variable naming conventions and outputs an overview of the structure of the data. So, I could see all the variables I had to work with and check if the variable types (character, numeric, logical etc.) looked correct.
foe_green_space_raw <- read_excel("../Data/(FOE) Green Space Consolidated Data - England - Version 2.1.xlsx",
sheet = "MSOAs V2.1"
) %>%
# inconsistent naming conventions for variables are used in the source data
# some needed to clean names for consistency (variables names are now )
janitor::clean_names()
str(foe_green_space_raw)## tibble [6,791 × 38] (S3: tbl_df/tbl/data.frame)
## $ x1 : num [1:6791] 1 2 3 4 5 6 7 8 9 10 ...
## $ msoa_code : chr [1:6791] "E02000001" "E02000002" "E02000003" "E02000004" ...
## $ msoa_name : chr [1:6791] "City of London 001" "Barking and Dagenham 001" "Barking and Dagenham 002" "Barking and Dagenham 003" ...
## $ msoa_name_house_of_commons : chr [1:6791] "City of London" "Marks Gate" "Chadwell Heath East" "Eastbrookend" ...
## $ la_code : chr [1:6791] "E09000001" "E09000002" "E09000002" "E09000002" ...
## $ la_name : chr [1:6791] "City of London" "Barking and Dagenham" "Barking and Dagenham" "Barking and Dagenham" ...
## $ la_name_for_readability : chr [1:6791] "City of London" "Barking and Dagenham" "Barking and Dagenham" "Barking and Dagenham" ...
## $ area : num [1:6791] 2905505 2165730 2143624 2490295 1186203 ...
## $ population_imd : num [1:6791] 6687 7379 10720 6536 9243 ...
## $ population_density : num [1:6791] 0.0023 0.00341 0.005 0.00262 0.00779 ...
## $ total_pop_from_ethnicity_data : num [1:6791] 7375 6775 10045 6182 8562 ...
## $ white_pop : num [1:6791] 5799 4403 5486 5006 5674 ...
## $ mixed_multiple_ethnic_group_pop : num [1:6791] 289 330 433 186 313 307 478 366 280 202 ...
## $ asian_asian_british_pop : num [1:6791] 940 820 2284 313 1050 ...
## $ black_african_caribbean_black_british_pop: num [1:6791] 193 1133 1618 649 1445 ...
## $ other_ethnic_group_pop : num [1:6791] 154 89 224 28 80 55 141 92 94 51 ...
## $ bame_pop : num [1:6791] 1576 2372 4559 1176 2888 ...
## $ income_index : num [1:6791] 7.85 2.71 3.95 4.13 3.73 ...
## $ green_space_area : num [1:6791] 18805 98083 145909 1228232 38041 ...
## $ green_space_area_per_capita : num [1:6791] 2.81 13.29 13.61 187.92 4.12 ...
## $ unbuffered_go_space_area : num [1:6791] 0 78286 138734 1228232 37415 ...
## $ buffrd_go_space_area : num [1:6791] 152351 725988 797423 2328455 291124 ...
## $ unbuffered_go_space_per_capita : num [1:6791] 0 10.61 12.94 187.92 4.05 ...
## $ pop_area : num [1:6791] 2905505 2087444 2004890 1262063 1148789 ...
## $ pop_area_with_go_space_access : num [1:6791] 152351 647702 658689 1100223 253709 ...
## $ pcnt_pop_area_with_go_space_access : num [1:6791] 5.24 31.03 32.85 87.18 22.08 ...
## $ pcnt_pop_with_go_space_access : num [1:6791] 7.88 29.07 41.58 84.74 27.03 ...
## $ pcnt_pop_without_go_space_access : num [1:6791] 92.1 70.9 58.4 15.3 73 ...
## $ pop_without_go_space_access : num [1:6791] 6160 5234 6263 997 6744 ...
## $ garden_area_houses : num [1:6791] 380 1199811 433155 389959 410321 ...
## $ garden_area_flats : num [1:6791] 3294 36880 69334 16058 27838 ...
## $ garden_area_total : num [1:6791] 3674 1236691 502489 406018 438159 ...
## $ garden_area_per_capita : num [1:6791] 0.549 167.596 46.874 62.12 47.404 ...
## $ gsdi_avg_area : num [1:6791] 1 2 2 4 1 4 2 1 2 3 ...
## $ gsdi_access : num [1:6791] 1 2 2 4 2 2 2 3 2 4 ...
## $ gsdi_garden : num [1:6791] 1 4 1 2 1 1 1 1 1 1 ...
## $ green_space_deprivation_score : num [1:6791] 1 4 2 4 1 4 2 1 2 4 ...
## $ green_space_deprivation_rating : chr [1:6791] "E" "B" "D" "B" ...
Clean the data
There are many variables within the dataset, some of which are derived from other variables in the dataset. I have previous experience working with the dataset, and had performed some earlier exploratory clustering with the dataset too. So, I had some initial ideas about which variables would be interesting to work with.
I chose some demographic and some green space variables, as this made sense given the analysis objectives. I avoid pairs of variables that in someway caused duplication. For example, I only included population density for each neighborhood, rather than population and area.
# -----------------------------------------------------------------------------
# Select variables of interest from the wider dataset
# -----------------------------------------------------------------------------
foe_green_space_focus <- foe_green_space_raw %>%
select(
# id
msoa_code,
# demographic variables
population_density, bame_pop, total_pop_from_ethnicity_data, income_index,
# green space variables
pcnt_pop_without_go_space_access, garden_area_per_capita,
green_space_area_per_capita
)
# -----------------------------------------------------------------------------
# Create (derived) variables for use in clustering
# -----------------------------------------------------------------------------
foe_green_space_focus <- foe_green_space_focus %>%
mutate(pcnt_bame_pop = bame_pop / total_pop_from_ethnicity_data * 100) %>%
select(-c(bame_pop, total_pop_from_ethnicity_data))Exploring the data
Exploratory plotting
A quick look at the distributions of each variable shows we will have some heavily skewed data. I’ll pick up dealing with this in the data pre-processing, as it might prove beneficial (in terms of clustering results) to transform some variables to reduce the skew.
# reshape data for plotting
foe_green_space_focus_long <- foe_green_space_focus %>%
pivot_longer(
cols = population_density:pcnt_bame_pop,
names_to = "variable",
values_to = "value"
)
# get a sense of distributions
ggplot(
foe_green_space_focus_long,
aes(value)
) +
geom_density() +
facet_wrap(~variable, scales = "free") +
theme(
strip.background = element_rect(fill = "white"),
strip.text = element_text(colour = "black")
)
### Just how skewed is the data Having looked at the plot above, I want
to get some quantitative measures of how far the distributions of the
variables was from normal. So, I ran some quick skewness and kurtosis
test.
The skewness test proved helpful later on in thinking about the type and strength of transformation to use on each variable. A skewness score of close to 0 would mean the distribution of a variable is similar to the normal distribution.
The kurtosis test yield insight into the extent to which each variable is affected outliers (in comparison to a normal distribution). A kurtosis score higher than 3, suggests more outliers than typical for a normally distribution variable. Conversely, a kurtoisis score lower than 3, suggests less outliers than normal distribution.
Skew and outliers can cause problems for k means algorithm, because an assumption that clusters are spherical needs to satisfied to some extent.
Information on to run the skew and kurtosis tests in R
came from this article.
#' Get a nice looking table of skew and kurtosis test results
#'
#' @param df a dataframe of variables
#'
#' @return a dataframe of test results
#' @export
#'
#' @examples
test_skew_kurtosis <- function(df) {
df %>%
# calculate summary statistics
summarise(across(dplyr::everything(),
list(skewness = skewness, kurtosis = kurtosis),
na.rm = TRUE
)) %>%
# rearrange dataframe into a nice format
pivot_longer(cols = everything()) %>%
mutate(
test = if_else(str_detect(name, "skewness"),
"skewness", "kurtosis"
),
name = str_remove(name, c("_skewness", "_kurtosis"))
) %>%
rename(variable = name) %>%
pivot_wider(names_from = "test", values_from = "value") %>%
arrange(desc(skewness), desc(kurtosis))
}
skew_kurtosis_test_res <- test_skew_kurtosis(select(
foe_green_space_focus,
-msoa_code
))
kable(skew_kurtosis_test_res)| variable | skewness | kurtosis |
|---|---|---|
| green_space_area_per_capita | 24.8523268 | 822.028067 |
| garden_area_per_capita | 15.5339581 | 667.133444 |
| population_density | 2.0696404 | 8.506217 |
| pcnt_bame_pop | 2.0270140 | 6.706167 |
| income_index | -0.1306751 | 1.971792 |
| pcnt_pop_without_go_space_access | -0.4481288 | 2.416562 |
Remove some outliers
The garden_area_per_capita and
green_space_area_per_capita variables show very high
skewness and kurtosis test scores. Transforming these variable to
address the skew can be done in the feature engineering below. However,
due to the way the TidyModels packages work, it is easier
to deal with outliers here, ahead of starting the feature engineering
and algorithm parameter tuning.
The next chunk of code looks at what proportion of data would be lost, if observations above certain thresholds were omitted as outliers.
prop_above_threshold <- function(variable, threshold) {
obs_above_thres <- length(which(foe_green_space_focus[variable] > threshold))
total_obs <- length(foe_green_space_focus[[variable]])
obs_above_thres / total_obs
}
# -----------------------------------------------------------------------------
# garden_area_per_capita: Look at proportions of observations above thresholds
# -----------------------------------------------------------------------------
cut_off_thresholds <- c(1, 10, 50, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000)
prop_omitted <- tibble(threshold = cut_off_thresholds) %>%
mutate(
prop_omitted_garden_area_per_capita = map_dbl(
threshold,
~ prop_above_threshold("garden_area_per_capita", .)
)
)
kable(prop_omitted)| threshold | prop_omitted_garden_area_per_capita |
|---|---|
| 1 | 0.9991165 |
| 10 | 0.9768812 |
| 50 | 0.7916360 |
| 100 | 0.4080401 |
| 200 | 0.1546164 |
| 300 | 0.0672949 |
| 400 | 0.0251804 |
| 500 | 0.0067737 |
| 600 | 0.0023561 |
| 700 | 0.0010308 |
| 800 | 0.0004418 |
| 900 | 0.0001473 |
| 1000 | 0.0001473 |
# -----------------------------------------------------------------------------
# green_space_area_per_capita: Look at proportions of observations above thresholds
# -----------------------------------------------------------------------------
cut_off_thresholds <- c(1, 10, 50, 100, 200, 300, 400, 500, 1000, 5000, 8000, 10000, 100000)
prop_omitted <- tibble(threshold = cut_off_thresholds) %>%
mutate(
prop_omitted_green_space_area_per_capita = map_dbl(
threshold,
~ prop_above_threshold("green_space_area_per_capita", .)
)
)
kable(prop_omitted)| threshold | prop_omitted_green_space_area_per_capita |
|---|---|
| 1e+00 | 0.9447799 |
| 1e+01 | 0.7025475 |
| 5e+01 | 0.2412016 |
| 1e+02 | 0.1434251 |
| 2e+02 | 0.0868797 |
| 3e+02 | 0.0640554 |
| 4e+02 | 0.0524223 |
| 5e+02 | 0.0424091 |
| 1e+03 | 0.0263584 |
| 5e+03 | 0.0082462 |
| 8e+03 | 0.0058901 |
| 1e+04 | 0.0050066 |
| 1e+05 | 0.0001473 |
Having inspected the table above, I decided to remove outliers in the following way.
garden_area_per_capitaremove observations above 500m2, losing less than 1% of observations.green_space_area_per_capitaremove observations above 5000m2, losing less than 1% of observations.
I also reran the skew and kurtosis tests.
green_space_area_per_capita retained a vary high kurtosis
score, I’ll look to deal with in the Univarainte Transformation section
of the feature engineering.
# retain outliers for inspection and use later on
foe_green_space_outliers <- foe_green_space_focus %>%
filter(
garden_area_per_capita > 500 |
green_space_area_per_capita > 5000
)
# remove outliers ahead of feature engineering workflow
foe_green_space_focus <- foe_green_space_focus %>%
filter(
garden_area_per_capita <= 500,
green_space_area_per_capita <= 5000
)
# rerun the skew and kurtosis tests for the variables where I've removed some
# outliers
test_skew_kurtosis(select(
foe_green_space_focus, garden_area_per_capita, green_space_area_per_capita
)) %>%
kable()| variable | skewness | kurtosis |
|---|---|---|
| green_space_area_per_capita | 8.586878 | 91.509741 |
| garden_area_per_capita | 1.701253 | 5.856855 |
Feature engineering
Before doing the cluster analysis itself, I did a lot of feature engineering work. The was to get the clustering input data into as good a state as possible within the time available.
Setting up the feature engineering workflow
First, I set up a feature engineering workflow functions. These
functions allowed me to easily and repeatedly run, and view the results
of, clustering with different data preprocessing / feature engineering
steps (i.e. recipes in TidyModels terminology) and
different values of k.
The metric I use is assess the performance of the clustering is ratio of WSS:TSS. Where:
- WSS is the within cluster sum of squared error (this measures the distance of each observation from it’s cluster centroid).
- TSS is the total sum of squared error (this measures the distance of all observations from a global centroid).
set.seed(1289473)
# ------------------------------------------------------------------------------
# key parameters
# ------------------------------------------------------------------------------
#' Run the K means cluster workflow with a single recipe (k is a tuning parameter)
#'
#' @data_df the dataset to be clustered
#' @param num_cluster_levels number of levels created between 1 and 10 in the parameter (i.e. k) tuning grid
#' @param num_cross_validation_folds number of cv folds
#' @recipe the recipe to be used to pre-process the data ahead of loading it into the clustering workflow
#'
#' @return a dataframe of the results of the tuning process
#' @export
#'
#' @examples
run_cluster_wf <- function(data_df, num_cluster_levels = 5,
num_cross_validation_folds = 2,
recipe) {
# ---------------------------------------------------------------------------
# create cross-validation samples
green_space_cv <- vfold_cv(data_df, v = num_cross_validation_folds)
# ---------------------------------------------------------------------------
# ---------------------------------------------------------------------------
# set up clustering algorithm
# ---------------------------------------------------------------------------
kmeans_spec <- k_means(num_clusters = tune()) %>%
set_engine("stats")
# ---------------------------------------------------------------------------
# set up workflow which brings together the clustering algorithm and
# the data pre-processing (i.e. the recipe) for execution
# ---------------------------------------------------------------------------
kmeans_wflow <- workflow(recipe, kmeans_spec)
# ---------------------------------------------------------------------------
# create parameter tuning grid - for k (i.e. the number of clusters)
# ---------------------------------------------------------------------------
num_clusters_grid <- grid_regular(num_clusters(), levels = num_cluster_levels)
# ---------------------------------------------------------------------------
# run tuning
# ---------------------------------------------------------------------------
res <- tune_cluster(kmeans_wflow,
resamples = green_space_cv,
grid = num_clusters_grid,
control = control_grid(save_pred = TRUE, extract = identity),
metrics = cluster_metric_set(
sse_within_total, sse_total,
sse_ratio
)
)
return(res)
}
#' Run the K means cluster workflow for multiple recipes (k is a tuning parameter)
#'
#' #' @data_df the dataset to be clustered
#' @param num_cluster_levels number of levels created between 1 and 10 in the parameter (i.e. k) tuning grid
#' @param num_cross_validation_folds number of cv folds
#' @param recipes a named list of recipes
#'
#' @return a ggplot graph showing the performance (i.e. sse_ratio) of clustering with each recipe
#' @export
#'
#' @examples
run_wf_with_recipes <- function(data_df,
num_cluster_levels = 5,
num_cross_validation_folds = 2,
recipes) {
# ---------------------------------------------------------------------------
# create a dataframe with each recipe
# to store the objects create at each stage of the clustering process
# ---------------------------------------------------------------------------
clustering_df <- tibble(
recipe_id = names(recipes), # for meaningful names in plotting
recipe = recipes
)
# ---------------------------------------------------------------------------
# Run the workflow with each recipe and get the performance metrics for
# each
# ---------------------------------------------------------------------------
metric_df <- clustering_df %>%
mutate(
# run a clustering workflow for each recipe
workflow = map(recipe, ~ run_cluster_wf(
recipe = .,
data_df = data_df,
num_cluster_levels = num_cluster_levels,
num_cross_validation_folds = num_cross_validation_folds
)),
# get the performance metrics for each workflow
metrics = map(workflow, ~ collect_metrics(.))
)
# ---------------------------------------------------------------------------
# transform the metric data for use in ggplot2
# including selecting the metric of interest
# ---------------------------------------------------------------------------
metric_df_focus <- metric_df %>%
# expand the nested metrics tables
select(recipe_id, metrics) %>%
unnest(metrics) %>%
# focus on metric of interest
filter(.metric == "sse_ratio") %>%
# so that different coloured lines can be plotted for each recipe
mutate(recipe_id = factor(recipe_id))
# ---------------------------------------------------------------------------
# plot the performance metric for each recipe
# ---------------------------------------------------------------------------
p <- ggplot(
data = metric_df_focus,
mapping = aes(
x = num_clusters, y = mean,
colour = recipe_id, group = recipe_id
)
) +
geom_line() +
scale_color_viridis_d() +
theme_minimal() +
ylab("Performance (mean WSS/TSS ratio)") +
xlab("Number of clusters") +
scale_x_continuous(breaks = 1:10)
# make plot interactive
plotly::ggplotly(p)
}Doing the feature engineering
In this section I do the feature engineering itself, using the workflow set up above. My choices around the order of feature engineering steps was based on recommendations made by the authors of TidyModels.
Checking for missing data
The first step with the feature engineering was to impute any missing data. Fortunately, there is no missing data in the Friends of the Earth dataset.
visdat::vis_miss(foe_green_space_focus)
#### Handling qualitative variables The next step would have been to
think through if and how to use any qualitative variables. However,
there aren’t any in the dataset I’m using, so we can just move on.
Standardising the scale of the variables
K means very sensitive to distance between points. So, the first thing to do was to put all the variables on the same scale (between 0 and 1). This running the clustering workflow, gives us the baseline perfomance of the clustering.
This is the first section of actual feature engineering that I run, so before doing that I need to set a few parameters for the clustering workflow.
# workflow tuning parameters used thoughout feature engineering
num_cluster_levels <- 10
num_cross_validation_folds <- 2
# drop identifier column as not relevant to clustering
cluster_in_df <- select(foe_green_space_focus, -msoa_code) %>%
drop_na()So, now we can get on with the standardising the scales of the variables.
# ------------------------------------------------------------------------------
# create range recipes
# ------------------------------------------------------------------------------
range_recipe_0 <- recipe(~., data = cluster_in_df) %>%
step_range(all_numeric(), min = 0, max = 1)
# add recipes to a list
range_recipes <- list(base = range_recipe_0)
# ------------------------------------------------------------------------------
# run the modelling workflow with the recipes
# ------------------------------------------------------------------------------
run_wf_with_recipes(
recipes = range_recipes,
data_df = cluster_in_df,
num_cluster_levels = num_cluster_levels,
num_cross_validation_folds = num_cross_validation_folds
)Univariant tranformations
The next step in the feature engineering is to transform the input data to try to address some of the skew and kurtosis identified in the EDA (Exploratory Data Analysis) section above. This was probably the most complex step in the feature engineering, as it involved comparing various approaches to transforming the data.
# ------------------------------------------------------------------------------
# a general transformation across all variables
# ------------------------------------------------------------------------------
trans_recipe_yeo <- range_recipe_0 %>%
step_YeoJohnson(all_numeric())
# ------------------------------------------------------------------------------
# stronger custom transformation (based on EDA observations)
# ------------------------------------------------------------------------------
trans_recipe_custom_strong <- range_recipe_0 %>%
step_inverse(
# for right skewed data
green_space_area_per_capita,
garden_area_per_capita,
offset = 1
) %>%
step_log(
# for right skewed data
population_density, pcnt_bame_pop,
# for left skewed data
pcnt_pop_without_go_space_access,
offset = 1
)
# ------------------------------------------------------------------------------
# gentler custom transformation (based on EDA observations)
# ------------------------------------------------------------------------------
trans_recipe_custom_gentle <- range_recipe_0 %>%
step_log(
# for right skewed data
green_space_area_per_capita,
garden_area_per_capita,
offset = 1
) %>%
step_sqrt(
# for right skewed data
population_density, pcnt_bame_pop,
# for left skewed data
pcnt_pop_without_go_space_access
)
# ------------------------------------------------------------------------------
# custom transformation (based on a mix of the three recipes above)
# ------------------------------------------------------------------------------
trans_recipe_custom_mixed_1 <- range_recipe_0 %>%
step_inverse(
# for right skewed data
green_space_area_per_capita,
garden_area_per_capita,
offset = 1
) %>%
step_YeoJohnson(
# for right skewed data
population_density, pcnt_bame_pop,
# for left skewed data
pcnt_pop_without_go_space_access
)
trans_recipe_custom_mixed_2 <- range_recipe_0 %>%
step_inverse(
# for extreme right skewed data
green_space_area_per_capita,
offset = 1
) %>%
step_log(
# for right skewed data
population_density, pcnt_bame_pop, garden_area_per_capita,
offset = 1
) %>%
step_YeoJohnson(
# for left skewed data
pcnt_pop_without_go_space_access
)
# add recipes to a list
trans_recipes <- list(
base = range_recipe_0,
YeoJohnson = trans_recipe_yeo,
custom_strong = trans_recipe_custom_strong,
custom_gentle = trans_recipe_custom_gentle,
custom_mixed_1 = trans_recipe_custom_mixed_1,
custom_mixed_2 = trans_recipe_custom_mixed_2
)
# ------------------------------------------------------------------------------
# View performance of clustering with each recipe
# ------------------------------------------------------------------------------
# run the modelling workflow with the recipes
run_wf_with_recipes(
recipes = trans_recipes,
data_df = cluster_in_df,
num_cluster_levels = num_cluster_levels,
num_cross_validation_folds = num_cross_validation_folds
)In the end, the best performing set of transformations was a custom mix of transformations of varying strengths including: inverse transformation for the most skewed variables, log transformation for the skewed variables, and a Yeo-Johnson transformation for the other variables. So, this chunk of code just renames the appropriate recipe (i.e. set of transformation), so we can build on it in the next feature engineering step.
trans_recipe_preferred <- trans_recipe_custom_mixed_2Creating interactions
I then looked at creating interactions between the variables. I tried some theoretically plausible interactions in the code chunk below, but none produced any major improvements in the performance of the clustering. So, for the sake of simplicty none of these recipes will be taken forward.
# ------------------------------------------------------------------------------
# create interaction recipes
# ------------------------------------------------------------------------------
interactions_recipe_1 <- trans_recipe_preferred %>%
step_interact(terms = ~ income_index * pcnt_bame_pop)
interactions_recipe_2 <- trans_recipe_preferred %>%
step_interact(terms = ~ garden_area_per_capita * green_space_area_per_capita)
interactions_recipe_3 <- trans_recipe_preferred %>%
step_interact(terms = ~ population_density * green_space_area_per_capita)
interactions_recipe_4 <- trans_recipe_preferred %>%
step_interact(terms = ~ population_density * garden_area_per_capita)
# add recipes to a list
interactions_recipes <- list(
base_after_trans = trans_recipe_preferred,
interact_1 = interactions_recipe_1,
interact_2 = interactions_recipe_2,
interact_3 = interactions_recipe_3,
interact_4 = interactions_recipe_4
)
# ------------------------------------------------------------------------------
# run the modelling workflow with the recipes
# ------------------------------------------------------------------------------
run_wf_with_recipes(
recipes = interactions_recipes,
data_df = cluster_in_df,
num_cluster_levels = num_cluster_levels,
num_cross_validation_folds = num_cross_validation_folds
)Normalise
I also tried normalising all the variables, but this reduced the performane of the clustering. So, the recipes below won’t be taken forward.
# ------------------------------------------------------------------------------
# create normalisation recipes
# ------------------------------------------------------------------------------
normalisation_recipe <- trans_recipe_preferred %>%
step_normalize(all_numeric())
# add recipes to a list
normalisation_recipes <- list(
base_after_trans = trans_recipe_preferred,
norm = normalisation_recipe
)
# ------------------------------------------------------------------------------
# run the modelling workflow with the recipes
# ------------------------------------------------------------------------------
run_wf_with_recipes(
recipes = normalisation_recipes,
data_df = cluster_in_df,
num_cluster_levels = num_cluster_levels,
num_cross_validation_folds = num_cross_validation_folds
)Multivariate transformation
The last feature engineering step I tried was dimensionality reduction. I compared the impact of different dimension reduction algorithms (PCA, UMAP and ICA). Both UMAP and PCA (2 components) produced improvements in the performance of the clustering. This would be expected as in both case I’ve reduced the number of dimensions of the data from 6 to 2, and k means is a distance based algorithm.
I took forward the best performing recipe forward (PCA - 2 components), as the final set of fature engineering steps. It won’t be until I look at the results of clustering in the next section, that I’ll be able to tell to what extent the dimensions reduction has removed important information, and/or removed redundant data and lead to meaningful clustering results.
# ------------------------------------------------------------------------------
# create multivariate transformation recipes
# ------------------------------------------------------------------------------
pca_recipe_2 <- trans_recipe_preferred %>%
step_pca(all_numeric(), num_comp = 2)
pca_recipe_4 <- trans_recipe_preferred %>%
step_pca(all_numeric(), num_comp = 4)
umap_recipe <- trans_recipe_preferred %>%
step_umap(all_numeric())
# add recipes to a list
normalisation_recipes <- list(
base_after_trans = trans_recipe_preferred,
pca_2 = pca_recipe_2,
pca_4 = pca_recipe_4,
umap = umap_recipe
)
# ------------------------------------------------------------------------------
# run the modelling workflow with the recipes
# ------------------------------------------------------------------------------
run_wf_with_recipes(
recipes = normalisation_recipes,
data_df = cluster_in_df,
num_cluster_levels = num_cluster_levels,
num_cross_validation_folds = num_cross_validation_folds
)Clustering results
Having decided on the feature engineering steps to use ahead of
running the clustering, which are codified in pca_recipe_2,
this final section of the notebook looks at the results of the
clustering. In the process, I decide on what value of k (i.e. what
number of clusters yeilds the most interesting results).
Setting up the clustering results workflow
I approached the coding for the results in a similar to the coding for the feature engineer. I started by setting up a couple of functions to getand visualise the clustering results (for a given value of k and preprocessing recipe).
#' Having decided on a recipe (set of pre-processing steps) and k, run the
#' clustering workflow
#'
#' @param k the number of clusters to identified by the k means clustering
#' algorithm
#' @param recipe the set of data pre-processing steps (a TidyModels object)
#' @param data_df the main dataset used by the clustering algorithm
#' @param outlier_df any outlier observations that you do not want to influence
#' the clustering algorithm. The clusters for these observations will be
#' 'predicted' once the cluster algorithm has been run.
#'
#'
#' @return a dataframe including observation from both `data_df` and
#' `outlier_df` including a cluster assignment for each observation
get_clustering_results <- function(k, recipe, data_df, outlier_df) {
# ---------------------------------------------------------------------------
# set up clustering algorithm
# ---------------------------------------------------------------------------
kmeans_spec <- k_means(num_clusters = k) %>%
set_engine("stats")
# ---------------------------------------------------------------------------
# set up workflow which brings together the clustering algorithm and
# the data pre-processing (i.e. the recipe) for execution
# ---------------------------------------------------------------------------
kmeans_wflow <- workflow(recipe, kmeans_spec)
# ---------------------------------------------------------------------------
# Main data used for determining how observations should be classified
# ---------------------------------------------------------------------------
kmeans_fit <- kmeans_wflow %>%
fit(data = select(data_df, -msoa_code))
cluster_assignments <- extract_cluster_assignment(kmeans_fit)
main <- data_df %>%
bind_cols(cluster_assignments)
# ---------------------------------------------------------------------------
# Outliers data not involved in determining how observations should be
# classified
# ---------------------------------------------------------------------------
outliers <- kmeans_fit %>%
augment(new_data = outlier_df) %>%
rename(.cluster = .pred_cluster)
# ---------------------------------------------------------------------------
# return all the data together including cluster classifications
# ---------------------------------------------------------------------------
bind_rows(main, outliers)
}
# a quick test things are working as expected
clustering_res <- get_clustering_results(
k = 6,
recipe = pca_recipe_2,
data_df = foe_green_space_focus,
outlier_df = foe_green_space_outliers
)
# confirm the returned dataframe has the expected number of rows
assertthat::assert_that(nrow(clustering_res) == nrow(foe_green_space_focus) +
nrow(foe_green_space_outliers))## [1] TRUE
#' View some visualisations of FoE green space clustering results
#'
#' @param df a dataframe of observations including a `.cluster` variable
#' @param k an integer the number of clusters identified
#' (for inclusion in plot titles)
#' @param recipe_name a string naming the recipe used to preprocess the clustering
#' input data (for inclusion in plot titles)
#'
#' @return a list of three ggplotly objects (interactive ggplot charts created
#' using plotly::ggplotly(p)) showing clustering results.
view_cluster_dist <- function(df, k, recipe_name) {
#-----------------------------------------------------------------------------
# set up common features used across multiple plots
#-----------------------------------------------------------------------------
common_plot_elements <- function(facet = TRUE) {
background_grey <- "grey90"
list(
# colour palette
scale_colour_viridis_d(),
# facet styling
if(facet)
facet_wrap(~variable, scales = "free", ncol = 2),
theme(
strip.background = element_rect(fill = background_grey),
strip.text = element_text(colour = "black"),
# add a neutral background (rather than stark white)
plot.background = element_rect(fill = background_grey),
panel.background = element_rect(fill = background_grey),
legend.background = element_blank()
),
# plot labeling
ggtitle(glue::glue("{k} clusters. Preprocessing: {recipe_name}"))
)
}
# ---------------------------------------------------------------------------
# process data for plotting
# ---------------------------------------------------------------------------
# put data in long format - required for facetted plots
plot_df <- df %>%
select(-c(msoa_code)) %>%
pivot_longer(
cols = population_density:pcnt_bame_pop,
names_to = "variable"
)
# split data for use in different plots
standard_scale_vars <- c(
"income_index",
"pcnt_pop_without_go_space_access"
)
# plots with a standard x axis scale
standard_plot <- plot_df %>%
filter(variable %in% standard_scale_vars)
# plots with a log x axis scale
log_plot <- plot_df %>%
filter(!(variable %in% standard_scale_vars) & variable != "green_space_area_per_capita")
gsapc_plot <- plot_df %>% # more customisation is need to get good view of the
# distribution of this variable
filter(variable == "green_space_area_per_capita")
#-----------------------------------------------------------------------------
# produce plots
#-----------------------------------------------------------------------------
# on standard x axis scale
p1 <- standard_plot %>%
ggplot(aes(value, colour = .cluster, group = .cluster)) +
geom_density() +
common_plot_elements() +
labs(x = NULL)
# on log x axis scale
p2 <- log_plot %>%
# + 1e-10 an offset to avoid log(0)
ggplot(aes(value + 1e-10, colour = .cluster)) +
geom_density() +
scale_x_log10() +
common_plot_elements() +
labs(x = NULL)
# more customisation is need to get good view of the
# distribution of this variable
p3 <- gsapc_plot %>%
# + 1e-10 an offset to avoid log(0)
ggplot(aes(value + 1e-10, colour = .cluster)) +
geom_density() +
scale_x_log10(limits = c(1, 1000)) +
common_plot_elements(facet = FALSE) +
labs(x = "log green_space_area_per_capita")
# return plots
list(p1,
p2,
p3)
}Results taken forward for further analysis
I explored the results of the cluster with various values of k, and settled on k = 6 as yielding the most interesting results in terms of identifying distinct forms of green space deprivation. Based on an initial inspection, I was happy that the dimension reduction techniques, used in the feature engineering, hadn’t caused an major issues with the interpretability of the results.
I take forward, interpret and present these results more fully in another notebook.
# set number of clusters as used in two functions below
k <- 6
res <- get_clustering_results(
k = k,
recipe = pca_recipe_2,
data_df = foe_green_space_focus,
outlier_df = foe_green_space_outliers
)
# output for inspection
kable(head(res))| msoa_code | population_density | income_index | pcnt_pop_without_go_space_access | garden_area_per_capita | green_space_area_per_capita | pcnt_bame_pop | .cluster |
|---|---|---|---|---|---|---|---|
| E02000001 | 0.0023015 | 7.851353 | 92.12428 | 0.549499 | 2.812213 | 21.36949 | Cluster_1 |
| E02000002 | 0.0034072 | 2.706193 | 70.92668 | 167.596029 | 13.292218 | 35.01107 | Cluster_2 |
| E02000003 | 0.0050009 | 3.952799 | 58.41962 | 46.873937 | 13.610908 | 45.38576 | Cluster_3 |
| E02000004 | 0.0026246 | 4.131273 | 15.26157 | 62.120196 | 187.917965 | 19.02297 | Cluster_4 |
| E02000005 | 0.0077921 | 3.733204 | 72.96830 | 47.404414 | 4.115677 | 33.73044 | Cluster_2 |
| E02000007 | 0.0055733 | 2.269247 | 58.70080 | 36.161931 | 52.393113 | 32.81765 | Cluster_3 |
# view results charts
res %>%
view_cluster_dist(k = k,
# get recipe_name as string
recipe = deparse(substitute(pca_recipe_2)))## [[1]]
##
## [[2]]
##
## [[3]]
# output for use in other notebooks
readr::write_csv(res, "../data_out/clusters_cv_5_k_6_pca_recipe_2.csv")Some of the other results I looked at
Just for reference, here are the results for some values of k (other than 6) which I explored
explore_cluster_results_pca_recipe_2 <- function(k){
res <- get_clustering_results(
k = k,
recipe = pca_recipe_2,
data_df = foe_green_space_focus,
outlier_df = foe_green_space_outliers
)
res %>%
view_cluster_dist(k = k,
# get recipe_name as string
recipe = deparse(substitute(pca_recipe_2)))
}
explore_cluster_results_pca_recipe_2(k = 3)## [[1]]
##
## [[2]]
##
## [[3]]
explore_cluster_results_pca_recipe_2(k = 5) ## [[1]]
##
## [[2]]
##
## [[3]]
explore_cluster_results_pca_recipe_2(k = 10) ## [[1]]
##
## [[2]]
##
## [[3]]
If I was continuing with developing this notebook
If I was investing more time into this analysis, I would:
- look at the performance of the code.
- try clustering techniques other than k means and compare the results.