Identifying demographic and neighborhood characteristics associated with green space deprivation in England (Tidymodels and LASSO Regression)
My role: analysis design, analysis, data visualisation, reporting
Techniques used: linear regression (LASSO), feature engineering, parameter tuning, exploratory data analysis, data visualisation.
In summary
In this notebook, I run a regression analysis to an attempt to identify demographic and neighborhood characteristics associated with green space deprivation in England. This notebook focuses on the feature engineering and modelling.
Objective: to identify demographic and neighborhood characteristics associated with green space deprivation in England.
My initial (informal) hypothesis: green space deprivation might be associated with one or more demographic (e.g. percentage of the population from minority ethnic backgrounds) and neighborhood characteristics (e.g. population density of the neighborhood).
Terminology: In terms of green space deprivation, I focus on the percentage of the population within each neighborhood without easy walking access to public green space. More specifically the percentage of the population without 2 ha of public green space within 300m. This would be approximately a five minute walk for some walking slowly or with children.
Geospatial scale: In this analysis I work at the neighborhood scale. The Lower Super Output Area (LSOA) scale in the jargon of UK administrative terminology. To give a sense of scale the average population of a LSOA is approximately 1600. This was the smallest geospatial scale that data on green space deprivation was available.
Data: The dataset used is was released by Friends of the Earth in 2020, as part of their research on ‘England’s Green Space Gap’. The dataset includes some UK Government data, and some data which Friends of the Earth derived based on analysis of the UK Government data.
Resources: I used the following resources and information sources when putting together this notebook.
Key findings:
- Whether or not a neighborhood/LSOA was urban or rural was the most important variable in estimating the proportion of its population without good access to green space. The model estimated that, holding other explanatory variables constant, an urban neighborhood would would have approx. 25% less people without good access to green space.
- The amount of green space per capita in a neighborhood was the second most important variable in estimating the proportion of its population without good access to green space. The model predicts that neighborhoods with more green space per capita have higher percentages of people without good access to green space.
- Other variables positively associated with the higher percentages of
a neighborhoods population being without good access to public green
space include:
- average income of the population;
- total population of the neighborhood;
- the percentage of the population from a black African, Caribbean or black British ethnic background.
- Applying the model to unsee data:
- the predictive accuracy of the model was relatively poor. The RMSE was approx 24, with the range of predicted outcomes being between 0 and 100.
- the R2 values was 0.46, this could be cosnidered resonable good for the problem domain and the first iteration of model development.
Next steps: If I invest more time in this analysis, I think exploring the following approaches/issues would be interesting/useful.
- Creating separate models for urban and rural neighborhoods/LSOAs.
- Consider what other variables could help improve the explanatory/predictive capabilities of the model.
- Consider if alternative (non regression based) analytical methods could be help in understanding the data better.
Setting up the notebook
This section of the notebook includes all the setup code including importing and documenting the packages used, and setting notebook options.
# ------------------------------------------------------------------------------
# Import packages used in the notebook
# ------------------------------------------------------------------------------
# for data manipulation and visualisation
library(dplyr)
library(ggplot2)
library(tidyr)
# for modeling
library(tidymodels)
# for string manipulation
library(stringr)
# for functional programming
library(purrr)
# for reading in data xslx files
library(readxl)
# for table formatting
library(knitr)
# for skewedness and kurtosis tests used in exploratory analysis
library(moments)
# for analysing the importance of explanatory/predictor variables
library(vip)
# for handling factor variables
library(forcats)# ------------------------------------------------------------------------------
# Imported packages used in the notebook
# ------------------------------------------------------------------------------
subset(
data.frame(sessioninfo::package_info()),
attached == TRUE,
c(package, loadedversion)
) %>%
kable()| package | loadedversion | |
|---|---|---|
| broom | broom | 1.0.3 |
| dials | dials | 1.1.0 |
| dplyr | dplyr | 1.1.0 |
| forcats | forcats | 1.0.0 |
| ggplot2 | ggplot2 | 3.4.0 |
| infer | infer | 1.0.4 |
| knitr | knitr | 1.42 |
| modeldata | modeldata | 1.1.0 |
| moments | moments | 0.14.1 |
| parsnip | parsnip | 1.0.3 |
| purrr | purrr | 1.0.1 |
| readxl | readxl | 1.4.1 |
| recipes | recipes | 1.0.4 |
| rsample | rsample | 1.1.1 |
| scales | scales | 1.2.1 |
| stringr | stringr | 1.5.0 |
| tibble | tibble | 3.1.8 |
| tidymodels | tidymodels | 1.0.0 |
| tidyr | tidyr | 1.3.0 |
| tune | tune | 1.0.1 |
| vip | vip | 0.3.2 |
| workflows | workflows | 1.1.2 |
| workflowsets | workflowsets | 1.0.0 |
| yardstick | yardstick | 1.1.0 |
# ------------------------------------------------------------------------------
# Set up notebook output and data viz styling
# ------------------------------------------------------------------------------
# set default Rmd Chunk options
knitr::opts_chunk$set(
echo = TRUE,
cache = TRUE,
warning = FALSE,
message = FALSE
)
# set default theme for exploratory plots
theme_set(theme_light())
# define colour palette for plots
colours <- c("#374E83", "#C94A54")Reading in the data
The dataset used is from Friends of the Earth in 2020. It was released as part of their research on ‘England’s Green Space Gap’. Data at different spatial resolutions (LSOA, MSOA and Local Authority is included), but I only focus at the LSOA (Lower Super Output Area) scale. This was the finest grain resolution where the data required for the analysis was available.
The code chuck below reads the data in from a local Excel file, sorts of the variable naming conventions and outputs an overview of the structure of the data. So, I could see all the variables I had to work with and check if the variable types (character, numeric, logical etc.) looked correct.
foe_green_space_raw <- read_excel("../Data/(FOE) Green Space Consolidated Data - England - Version 2.1.xlsx",
sheet = "LSOAs V2.1"
) %>%
# inconsistent naming conventions for variables are used in the source data
# some needed to clean names for consistency (variables names are now )
janitor::clean_names()
str(foe_green_space_raw)## tibble [32,844 x 31] (S3: tbl_df/tbl/data.frame)
## $ x1 : num [1:32844] 1 2 3 4 5 6 7 8 9 10 ...
## $ lsoa_code : chr [1:32844] "E01000001" "E01000002" "E01000003" "E01000005" ...
## $ lsoa_name : chr [1:32844] "City of London 001A" "City of London 001B" "City of London 001C" "City of London 001E" ...
## $ msoa_code : chr [1:32844] "E02000001" "E02000001" "E02000001" "E02000001" ...
## $ msoa_name : chr [1:32844] "City of London 001" "City of London 001" "City of London 001" "City of London 001" ...
## $ msoa_name_house_of_commons : chr [1:32844] "City of London" "City of London" "City of London" "City of London" ...
## $ la_code : chr [1:32844] "E09000001" "E09000001" "E09000001" "E09000001" ...
## $ la_name : chr [1:32844] "City of London" "City of London" "City of London" "City of London" ...
## $ la_name_for_readability : chr [1:32844] "City of London" "City of London" "City of London" "City of London" ...
## $ area : num [1:32844] 133326 226199 57305 190745 144197 ...
## $ imd_st_areasha : num [1:32844] 133321 226191 57303 190739 144196 ...
## $ population_imd : num [1:32844] 1296 1156 1350 1121 2040 ...
## $ population_density : num [1:32844] 0.00972 0.00511 0.02356 0.00588 0.01415 ...
## $ total_pop_from_ethnicity_data : num [1:32844] 1465 1436 1346 985 1703 ...
## $ white_pop : num [1:32844] 1238 1274 1055 506 557 ...
## $ mixed_multiple_ethnic_group_pop : num [1:32844] 54 54 55 59 58 88 83 62 149 28 ...
## $ asian_asian_british_pop : num [1:32844] 128 95 168 274 861 ...
## $ black_african_caribbean_black_british_pop: num [1:32844] 11 4 45 100 177 389 574 228 580 143 ...
## $ other_ethnic_group_pop : num [1:32844] 34 9 23 46 50 28 53 49 94 47 ...
## $ bame_pop : num [1:32844] 227 162 291 479 1146 ...
## $ income_decile : num [1:32844] 10 10 6 2 5 2 2 3 3 3 ...
## $ unbuffrd_go_space_area : num [1:32844] 0 0 0 0 0 ...
## $ buffrd_go_space_area : num [1:32844] 0 80106 1858 0 27147 ...
## $ unbuffered_go_space_per_capita : num [1:32844] 0 0 0 0 0 ...
## $ pop_area : num [1:32844] 133326 226199 57305 190745 144197 ...
## $ pop_area_with_go_space_access : num [1:32844] 0 80106 1858 0 27147 ...
## $ pcnt_pop_area_with_go_space_access : num [1:32844] 0 35.41 3.24 0 18.83 ...
## $ pcnt_pop_without_go_space_access : num [1:32844] 100 64.6 96.8 100 81.2 ...
## $ pop_without_go_space_access : num [1:32844] 1296 747 1306 1121 1656 ...
## $ gsdi_avg_area : num [1:32844] 1 1 1 1 1 1 1 1 2 1 ...
## $ gsdi_access : num [1:32844] 1 2 1 1 1 2 2 3 3 4 ...
We also need some additional data on where each neighborhood (i.e. LSOA) is urban or rural. This data comes from the Office for National Statistics.
urban_rural <- read_excel("../Data/Rural_Urban_Classification_2011_lookup_tables_for_small_area_geographies.xlsx",
sheet = "LSOA11", skip = 2
) %>%
janitor::clean_names() %>%
select(-c(lower_super_output_area_2011_name, ))
kable(head(urban_rural))| lower_super_output_area_2011_code | rural_urban_classification_2011_code | rural_urban_classification_2011_10_fold | rural_urban_classification_2011_2_fold |
|---|---|---|---|
| E01000001 | A1 | Urban major conurbation | Urban |
| E01000002 | A1 | Urban major conurbation | Urban |
| E01000003 | A1 | Urban major conurbation | Urban |
| E01000005 | A1 | Urban major conurbation | Urban |
| E01000006 | A1 | Urban major conurbation | Urban |
| E01000007 | A1 | Urban major conurbation | Urban |
Now, we just need to join the two datasets together.
foe_green_space_urban_rural <- foe_green_space_raw %>%
left_join(urban_rural, by = c("lsoa_code" = "lower_super_output_area_2011_code"))
kable(head(foe_green_space_urban_rural))| x1 | lsoa_code | lsoa_name | msoa_code | msoa_name | msoa_name_house_of_commons | la_code | la_name | la_name_for_readability | area | imd_st_areasha | population_imd | population_density | total_pop_from_ethnicity_data | white_pop | mixed_multiple_ethnic_group_pop | asian_asian_british_pop | black_african_caribbean_black_british_pop | other_ethnic_group_pop | bame_pop | income_decile | unbuffrd_go_space_area | buffrd_go_space_area | unbuffered_go_space_per_capita | pop_area | pop_area_with_go_space_access | pcnt_pop_area_with_go_space_access | pcnt_pop_without_go_space_access | pop_without_go_space_access | gsdi_avg_area | gsdi_access | rural_urban_classification_2011_code | rural_urban_classification_2011_10_fold | rural_urban_classification_2011_2_fold |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | E01000001 | City of London 001A | E02000001 | City of London 001 | City of London | E09000001 | City of London | City of London | 133325.89 | 133320.77 | 1296 | 0.0097205 | 1465 | 1238 | 54 | 128 | 11 | 34 | 227 | 10 | 0 | 0.000 | 0 | 133325.89 | 0.000 | 0.000000 | 100.00000 | 1296.0000 | 1 | 1 | A1 | Urban major conurbation | Urban |
| 2 | E01000002 | City of London 001B | E02000001 | City of London 001 | City of London | E09000001 | City of London | City of London | 226199.38 | 226191.27 | 1156 | 0.0051105 | 1436 | 1274 | 54 | 95 | 4 | 9 | 162 | 10 | 0 | 80106.352 | 0 | 226199.38 | 80106.352 | 35.414046 | 64.58595 | 746.6136 | 1 | 2 | A1 | Urban major conurbation | Urban |
| 3 | E01000003 | City of London 001C | E02000001 | City of London 001 | City of London | E09000001 | City of London | City of London | 57305.11 | 57302.97 | 1350 | 0.0235581 | 1346 | 1055 | 55 | 168 | 45 | 23 | 291 | 6 | 0 | 1857.635 | 0 | 57305.11 | 1857.635 | 3.241657 | 96.75834 | 1306.2376 | 1 | 1 | A1 | Urban major conurbation | Urban |
| 4 | E01000005 | City of London 001E | E02000001 | City of London 001 | City of London | E09000001 | City of London | City of London | 190745.29 | 190738.76 | 1121 | 0.0058769 | 985 | 506 | 59 | 274 | 100 | 46 | 479 | 2 | 0 | 0.000 | 0 | 190745.29 | 0.000 | 0.000000 | 100.00000 | 1121.0000 | 1 | 1 | A1 | Urban major conurbation | Urban |
| 5 | E01000006 | Barking and Dagenham 016A | E02000017 | Barking and Dagenham 016 | Barking East | E09000002 | Barking and Dagenham | Barking and Dagenham | 144196.94 | 144195.85 | 2040 | 0.0141473 | 1703 | 557 | 58 | 861 | 177 | 50 | 1146 | 5 | 0 | 27146.718 | 0 | 144196.94 | 27146.718 | 18.826140 | 81.17386 | 1655.9467 | 1 | 1 | A1 | Urban major conurbation | Urban |
| 6 | E01000007 | Barking and Dagenham 015A | E02000016 | Barking and Dagenham 015 | Barking Central | E09000002 | Barking and Dagenham | Barking and Dagenham | 198136.98 | 198134.81 | 2101 | 0.0106038 | 1391 | 439 | 88 | 447 | 389 | 28 | 952 | 2 | 0 | 83933.435 | 0 | 198136.98 | 83933.435 | 42.361317 | 57.63868 | 1210.9887 | 1 | 2 | A1 | Urban major conurbation | Urban |
Data cleaning
Happily, there is no missing data. I was hoping the would be the case, given the data is mostly either from or derived from open government data.
visdat::vis_miss(foe_green_space_urban_rural, warn_large_data = FALSE)Exploratory data analysis
I know this dataset pretty well, from previous analysis, so I’ll do a fairly minimal exploratory analysis focusing on:
- identifying multi-colinearity amongst the predictors.
- understanding how skewed the distribution of each variable is.
Multi-colinearity
The correlation plot below helped me identify some colinearity between variables.
foe_green_space_variables <- foe_green_space_urban_rural %>%
select(
-contains(c("name", "code", "gsdi")),
-x1
)
foe_green_space_variables %>%
GGally::ggcorr(hjust = 1, layout.exp = 4, size = 3)
Based on inspection of the correlation plot about above, in the code
chunk below I actually drop some variables I suspect that we don’t need
for the analysis. Variables are dropped either because they are highly
correlated or because they are duplicate measures of the same underlying
quantity but from different datasets.
foe_green_space_urban_rural_focus <- foe_green_space_urban_rural %>%
# explicitly remove variables so I know what has been dropped
select(
-c(
x1, imd_st_areasha, population_imd, unbuffrd_go_space_area, buffrd_go_space_area,
pop_area, pop_area_with_go_space_access, pcnt_pop_area_with_go_space_access,
pop_without_go_space_access
),
# foe green space deprivation ratings are not going to help in the analysis
-contains("gsdi")
)Skew
I know this dataset pretty well, from previous analysis, and remembered that a lot of the variables are highly skewed. An issue that I’ll pick in the feature engineering. So, the code chunk below runs the skewedness test for each numeric variable and I’ll store the results as they will probably be useful later on.
# look at skewedness for each variable
skew_tests <- foe_green_space_urban_rural_focus %>%
# drop non numeric variables
select(-contains(c("name", "code", "gsdi", "rural_urban"))) %>%
# calculate skewnedess
summarise(across(
dplyr::everything(),
\(x) skewness(x, na.rm = TRUE)
)) %>%
# make the table format more readible
pivot_longer(cols = everything()) %>%
rename(
variable = name,
skewness = value
) %>%
arrange(skewness) %>%
# classify variable by amount of skewedness
mutate(skew_group = case_when(
skewness > 5 ~ "extreme left",
skewness > 2 ~ "strong left",
skewness > 0.5 ~ "left",
skewness > -0.5 ~ "limited",
skewness > -2 ~ "right",
skewness > -5 ~ "strong right",
TRUE ~ "extreme right"
))
skew_tests %>%
# format for rmarkdown
kable()| variable | skewness | skew_group |
|---|---|---|
| pcnt_pop_without_go_space_access | -0.5780064 | right |
| income_decile | -0.0000103 | limited |
| white_pop | 0.0229443 | limited |
| mixed_multiple_ethnic_group_pop | 1.8827047 | left |
| total_pop_from_ethnicity_data | 2.2291027 | strong left |
| bame_pop | 2.4637131 | strong left |
| population_density | 2.5508460 | strong left |
| asian_asian_british_pop | 3.7868011 | strong left |
| black_african_caribbean_black_british_pop | 3.8526981 | strong left |
| other_ethnic_group_pop | 4.9348319 | strong left |
| area | 12.0015978 | extreme left |
| unbuffered_go_space_per_capita | 39.8380888 | extreme left |
LASSO regression analysis
We can now move on the to run the regression analysis itself. Some key details on the analysis approach are presented below, before getting into running the modelling workflow itself.
Outcome variable:
pcnt_pop_without_go_space_access this is the percentage of
the population (in a neighbourhood/LSOA) without 2 ha of public green
space within 300m.
Predictor variables: all other relevant variables available.
Methods:
- I used LASSO regression to penalise inclusion of variables, this was in case I missed any multi-colinearity in the EDA above.
- I split data into test and train sets, as I wanted to get a sense how the model performs on unseen data.
Approach to coding/implementation: I work within the TidyModels framework.
Preparing the data
In the code chunk below I do some final data preparation ahead of
splitting into test and train sets. The main data transformation is to
convert population variables from raw counts to percentages. For
example, moving from having a raw count variable for the BAME population
(bame_pop) in each neighborhood, to have a percentage BAME
population (perc_bame_pop).
foe_green_space_urban_rural_model <- foe_green_space_urban_rural_focus %>%
# keep only the variables needed for modelling
select(
lsoa_code,
where(is.numeric),
rural_urban_classification_2011_2_fold
) %>%
# calculate demographic variables as percentage of population rather than
# raw counts
mutate(
perc_bame_pop = bame_pop / total_pop_from_ethnicity_data,
perc_mixed_multiple_ethnic_group_pop = mixed_multiple_ethnic_group_pop / total_pop_from_ethnicity_data,
perc_asian_asian_british_pop = asian_asian_british_pop / total_pop_from_ethnicity_data,
perc_black_african_caribbean_black_british_pop = black_african_caribbean_black_british_pop / total_pop_from_ethnicity_data,
perc_other_ethnic_group_pop = other_ethnic_group_pop / total_pop_from_ethnicity_data,
) %>%
# remove population count data
select(-c(
white_pop, mixed_multiple_ethnic_group_pop,
asian_asian_british_pop,
black_african_caribbean_black_british_pop,
other_ethnic_group_pop,
bame_pop
))Splitting the data
Next, I split the data into test and training sets. I want each split to be have a representative proportion of rural and urban neighborhoods, because I thing based on doamin knowledge that the urban-rural classification will be important. So, stratify the sampling on this basis.
data_split <- initial_split(foe_green_space_urban_rural_model,
strata = rural_urban_classification_2011_2_fold
)
train_df <- training(data_split)
test_df <- testing(data_split)Feature engineering
Next, is setting up the feature engineering which can be run within the modelling workflow. Mostly, this is trying to reduce the skew of the explanatory variables, and making sure some of the prerequisites for LASSO regression are met.
green_space_recipe <- recipe(pcnt_pop_without_go_space_access ~ .,
data = train_df
) %>%
# allows lsoa_code to be retained as an id without affecting model results
update_role(lsoa_code, new_role = "ID") %>%
# for extreme right skewed data
step_inverse(
unbuffered_go_space_per_capita, area,
offset = 1
) %>%
# for right skewed data
step_log(
contains("perc"), population_density,
offset = 1
) %>%
# normalisation is required for LASSO
step_normalize(all_numeric(), -all_outcomes()) %>%
# transform urban-rural classification from a string into a format that can
# be used in the model
# 0 = rural, 1 = urban
step_string2factor(rural_urban_classification_2011_2_fold) %>%
step_dummy(rural_urban_classification_2011_2_fold)
# check what training data looks like after running the preprocessing recipe
green_space_prep <- green_space_recipe %>%
prep(strings_as_factors = FALSE)Parameter tuning
Everything is set now with the data, and feature engineering specification. So, the next step is to tune a key parameter of the LASSO algorithm, the penalty for including a variable in the model. The penalty can range between 0 and 1, with 0 meaning that a standard linear regression will be run. The modelling workflow is run for multiple values of penalty, on a different bootstrap resample from the training set. Then all the performance metrics are collected, so we can see which value of penalty to use in the model fit.
# for reproducibility
set.seed(2023)
# bootstrap resamples required for tuning penalty parameter
green_space_boot <- bootstraps(train_df,
strata = rural_urban_classification_2011_2_fold
)
# specify the model
lasso_tuning_spec <- linear_reg(penalty = tune(), mixture = 1) %>%
set_engine("glmnet")
# create a tuning grid
lambda_grid <- grid_regular(penalty(), levels = 50)
# setup the modelling workflow
wf <- workflow() %>%
add_recipe(green_space_recipe) %>%
add_model(lasso_tuning_spec)
# run the parameter tuning
lasso_grid <- tune_grid(
wf,
resamples = green_space_boot,
grid = lambda_grid
)
# view the performance of the model on the training set with different values
# of penalty
lasso_grid %>%
collect_metrics() %>%
ggplot(aes(penalty, mean, color = .metric)) +
geom_line() +
facet_wrap(~.metric, nrow = 2, scales = "free")The chart above shows the model performing worse as the penalty is increased. I’ll come back to the implications of this later on in the notebook.
Training the model
Everything is now in place to do the model fit on the training data. Before doing that I need to get the value of the penalty that give the best model performance. This is very close to zero, so We’ll effectively be running a standard linear regression model.
# get the best performing value of penalty
lowest_rmse <- lasso_grid %>%
select_best("rmse", maximise = FALSE)
kable(lowest_rmse)| penalty | .config |
|---|---|
| 0 | Preprocessor1_Model01 |
With the value of penalty decided upon, the chunk of code below runs the final model fit on the training data, and we can take a look at the model coefficients.
# set up the final modelling workflow (post tuning)
final_lasso <- finalize_workflow(
wf, lowest_rmse
)
# fit the model
lasso_fit <- final_lasso %>%
fit(data = train_df) %>%
extract_fit_parsnip()
# view model coefficients fit
lasso_fit %>%
tidy() %>%
kable()| term | estimate | penalty |
|---|---|---|
| (Intercept) | 85.912385 | 0 |
| area | -7.250327 | 0 |
| population_density | -1.148555 | 0 |
| total_pop_from_ethnicity_data | 1.473266 | 0 |
| income_decile | 2.611268 | 0 |
| unbuffered_go_space_per_capita | 21.023485 | 0 |
| perc_bame_pop | -2.177079 | 0 |
| perc_mixed_multiple_ethnic_group_pop | -2.456070 | 0 |
| perc_asian_asian_british_pop | 0.000000 | 0 |
| perc_black_african_caribbean_black_british_pop | 1.395428 | 0 |
| perc_other_ethnic_group_pop | 1.829372 | 0 |
| rural_urban_classification_2011_2_fold_Urban | -25.487460 | 0 |
Finally, in the analysis stage we can look at the importance of each variable included in the model. I’ll come back to the interpretation of these results in the Key findings section below.
p <- lasso_fit %>%
# get the importance of each variable
vi(lambda = lowest_rmse$penalty) %>%
# tidy up variables names for plot
mutate(
importance = abs(Importance),
variable = fct_reorder(Variable, Importance)
) %>%
# plot
ggplot(aes(importance, variable, fill = Sign)) +
geom_col() +
labs(y = NULL)
pRunning the model with test data
Having fit the model to the training data, it is straightforward to run the model on the test data and take a look at the performance of the model on unseen data. Again, I’ll come back to the interpretation of these results in the Key findings section immediately below.
metrics <- last_fit(
final_lasso,
data_split
) %>%
collect_metrics()
kable(metrics)| .metric | .estimator | .estimate | .config |
|---|---|---|---|
| rmse | standard | 23.6468916 | Preprocessor1_Model1 |
| rsq | standard | 0.4630332 | Preprocessor1_Model1 |
Key findings
- Whether or not a neighborhood/LSOA was urban or rural was the most important variable in estimating the proportion of its population without good access to green space. The model estimated that, holding other explanatory variables constant, an urban neighborhood would would have approx. 25% less people without good access to green space.
- The amount of green space per capita in a neighborhood was the second most important variable in estimating the proportion of its population without good access to green space. The model predicts that neighborhoods with more green space per capita have higher percentages of people without good access to green space. Similar, to the importance of urban-rural classification noted above, this is likely to be result of larger, rural LSOA being less dense so the distance to travel to public accessible green space are likely to higher than in urban areas.
- Other variables positively associated with the higher percentages of
a neighborhoods population being without good access to public green
space include:
- average income of the population;
- total population of the neighborhood;
- the percentage of the population from a black African, Caribbean or black British ethnic background.
- Applying the model to unsee data:
- the predictive accuracy of the model was relatively poor. The RMSE was approx 24, with the range of predicted outcomes being between 0 and 100.
- the R2 values was 0.46, this could be cosnidered resonable good for the problem domain and the first iteration of model development.
Next steps
Options to consider in further developing this analysis.
- Creating separate models for urban and rural neighborhoods/LSOAs.
- Consider what other variables could help improve the explanatory/predictive capabilities of the model.
- Consider if alternative (non regression based) analytical methods could be help in understanding the data better.