Are natural green spaces less accessible to people from minority ethnic backgrounds?
By: Dr. Chris Martin
Tools used: R, R Markdown, tidyverse packages (inc.
dplyr and ggplot2) , easystats packages (inc.
modelbased) and tidymodels packages (inc.
infer).
Techniques used: data cleaning, exploratory data analysis, descriptive statistics, hypothesis testing, confidence intervals, simulation, marginal means and contrast analysis (for testing pairwise differences between groups).
Introduction
There are widespread concerns in the UK that natural green spaces are less accessible to people from minority ethnic backgrounds. In this notebook I use statistical techniques to assess if some of the available evidence support these concerns. To do this, I analyse data detailing how often people have visited green spaces. The data was collected between April 2020 and March 2022, and is drawn from the People and Nature Survey, which is run by Natural England (part of the UK Government). My analysis finds statistical evidence which supports concerns that natural green spaces are less accessible to people from minority ethnic backgrounds.
In this notebook I work through the typical stages of a statistical analysis: importing, cleaning and the exploring the data before then analyzing the data and reporting the findings.
Setting up the notebook
This section runs the set up code for the notebook including importing libraries and setting global notebook parameters. A couple of convenience functions used throughout the notebook are also included in the setup code. The versions of the packages used are listed for reproducibility purposes.
library(tidyverse)
library(glue) # for string literals
library(skimr) # for descriptive stats
library(infer) # from TidyModels
# for computational methods for hypothesis tests
# and confidence intervals
library(modelbased) # for marginal means modelling and contrast analysis tests
library(see) # from EasyStats
# for lighthouse plots and contrast analysis
library(knitr) # for table formatting# set default Rmd Chunk options
knitr::opts_chunk$set(echo = TRUE)
# set default theme for exploratory plots
theme_set(theme_light())
# define colour palette for plots
colours <- c("#374E83", "#C94A54")
# for styling faceted plots
style_facets <- function(){
theme(strip.background = element_rect(fill = "white"),
strip.text = element_text(colour = "black",
face = "bold"))
}
# formatting string in the ethnicity variable for table output
format_ethnicity_str <- function(df){
df %>%
mutate(ethnicity = if_else(ethnicity ==
'Any other ethnic group\nor background',
'Any other ethnic group or background',
ethnicity))
}installed.packages()[names(sessionInfo()$otherPkgs), "Version"]## knitr see modelbased infer skimr glue forcats
## "1.39" "0.7.4" "0.8.5" "1.0.4" "2.1.4" "1.6.2" "0.5.1"
## stringr dplyr purrr readr tidyr tibble ggplot2
## "1.4.1" "1.0.9" "0.3.4" "2.1.2" "1.2.0" "3.1.8" "3.4.0"
## tidyverse
## "1.3.2"
Importing the data
The survey data is released by Natural England as an Excel spreadsheet. The code below reads in the data, and then focuses on the variables needed in the analysis. The People and Nature survey is broad in scope and only few variables from it are needed. So, many of the variables in the raw dataset are dropped to make the dataframe easier to work with.
file_path_prefix <- "../"
# read in the data
pan_raw <- readxl::read_xlsx(glue("{file_path_prefix}datasets/PANS_Y2_Q1_Q4.xlsx"),
sheet = "Data",
guess_max = 26000) %>% # to account for some
# sparse columns
janitor::clean_names() # standardise naming conventions
# focus in on the data to be used in this analysis
pan_focus <- pan_raw %>%
select(respondent_id, wave, weight_percent, any_visits_14, ethnicity,
no_of_visits, gender, age_band) Cleaning the data
The section of the code cleans the data. First, it identifies the
pattern of missing data. We can see that there are three variables with
missing data: any_visits_14, ethnicity and
no_of_visits. It looks like when data is missing for
any_visits_14 it is also missing for
no_of_visits . This makes sense as:
any_visits_14records if the survey respondent has visited a green space in the 14 days preceding their participation in the survey;no_of_visitsrecords how many times the respondent has visited a green space in the 14 days preceding their participation in the survey.
# ------------------------------------------------------------------------------
# Check patterns of missing data
# ------------------------------------------------------------------------------
# visual overview
visdat::vis_miss(pan_focus)In the code below, I check that a none response (i.e. NA value) for
any_visits_14 is independent from a none response for
ethnicity. A quick test suggests independence which gives
some reassurance that there is a not a systematic bias at play, for
example that people from a certain ethnic background are less likely to
respond to questions about visiting green space.
# ------------------------------------------------------------------------------
# Check for independence
# ------------------------------------------------------------------------------
# count number of missing pieces of data for the two variables
missing <- pan_focus %>%
mutate(missing_any_visits = is.na(any_visits_14),
missing_ethnicity = is.na(ethnicity),
missing_both = is.na(any_visits_14) & is.na(ethnicity)) %>%
summarise(across(where(is.logical), list(sum = sum)))
# calculate how observations would be expected to be missing both if
# assumption of independence holds
prop_missing_any_visits <- missing$missing_any_visits_sum / nrow(pan_focus)
prop_missing_ethnicity <- missing$missing_ethnicity_sum / nrow(pan_focus)
expected_prop_missing_both <- prop_missing_any_visits * prop_missing_ethnicity
expected_num_missing_both <- round(expected_prop_missing_both * nrow(pan_focus))
actual_num_missing_both <- missing$missing_both_sum
# output expected (if independent) and acutal number missing both
print(glue("If independent expected missing both: {expected_num_missing_both}"))## If independent expected missing both: 136
print(glue("Actual number missing both: {actual_num_missing_both}"))## Actual number missing both: 137
# they are independent, so proceed with dropping missing data
pan_clean <- pan_focus %>%
na.omit() # removes about 10% of observationsFinally in this data cleaning section, I confirm that the respondent
ids (respondent_id) are unique. So, now we know that people
are not participating in the survey multiple time.
# confirm respondent_id values are unique
assertthat::assert_that(length(unique(pan_focus$respondent_id)) == nrow(pan_focus))## [1] TRUE
Exploring the data
First, I take a look at a summary of the descriptive stats for each
variable. In the process I noticed that no_of_visits was
character type when it should be numeric. I’ll come back to this below
when I take a look at the no_of_visits variable in more
detail.
# descriptive stats for all variables
skimr::skim(select(pan_clean, -respondent_id))| Name | select(pan_clean, -respon… |
| Number of rows | 43717 |
| Number of columns | 7 |
| _______________________ | |
| Column type frequency: | |
| character | 6 |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| wave | 0 | 1 | 17 | 24 | 0 | 24 | 0 |
| any_visits_14 | 0 | 1 | 9 | 17 | 0 | 4 | 0 |
| ethnicity | 0 | 1 | 5 | 36 | 0 | 7 | 0 |
| no_of_visits | 0 | 1 | 1 | 17 | 0 | 59 | 0 |
| gender | 0 | 1 | 4 | 24 | 0 | 3 | 0 |
| age_band | 0 | 1 | 3 | 5 | 0 | 5 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| weight_percent | 0 | 1 | 1 | 0.68 | 0.14 | 0.57 | 0.81 | 1.22 | 10.42 | ▇▁▁▁▁ |
Next, I explore the data for the three main variables used in the analysis.
ethnicitywhich records the ethnic background of the survey respondent;any_visits_14which records if the survey respondent has visited a green space in the 14 days preceding their participation in the survey;no_of_visitswhich records how many times the respondent has visited a green space in the 14 days preceding their participation in the survey.
Before doing that, I quickly makes tweaks to the format of strings within the dataframe so that when they appear in exploratory plots everything looks a bit neater.
pan_clean <- pan_clean %>%
# to improve plot labelling
mutate(ethnicity = if_else(ethnicity == 'Any other ethnic group or background',
'Any other ethnic group\nor background',
ethnicity))ethnicity: What is your ethnic background?
This chunk of code looks at ethnicity variable, to
confirming that the frequency distribution, when group respondents by
ethnic background, looks roughly as one would expect. See the plot
(showing frequencies) and dataframe (showing proportions of the sample)
below for details.
# prepare data for plotting
plot_df <- pan_clean %>%
count(ethnicity) %>%
# to order groups within the plot
mutate(ethnicity = fct_reorder(ethnicity, n))
# will be useful for ordering facets in a plot later on
ethnicity_levels <- rev(levels(plot_df$ethnicity))
# plot frequency distribution
ggplot(plot_df, aes(ethnicity, n)) +
geom_col(fill = colours[1], width = 0.8) +
geom_text(aes(label = n), nudge_y = 2000) +
coord_flip() +
labs(y = "Number of survey respodents",
x = NULL)# format strings for table output
levels(plot_df$ethnicity) <- rev(str_replace(ethnicity_levels, "\n", " "))
# look at proportion of survey sample in each ethnicity group
plot_df %>%
mutate(prop = round(n / sum(n), 3)) %>%
kable()| ethnicity | n | prop |
|---|---|---|
| Any other ethnic group or background | 400 | 0.009 |
| Asian or Asian British | 2989 | 0.068 |
| Black or Black British | 1314 | 0.030 |
| Don’t know | 1 | 0.000 |
| Mixed | 1012 | 0.023 |
| Prefer not to say | 1 | 0.000 |
| White | 38000 | 0.869 |
The survey dataset from Natural England includes weighting factors. So, we can look how different weighted group counts are from the raw group counts above. The plot below shows that the weighting does not make a huge difference to the group counts. So, in this notebook I’ll be flexible about where and when to use the weighting factor. As this opens up the opportunity to conduct analysis on the raw data, where it would be challenging to use the weighted data.
# -----------------------------------------------------------------------------
# compare weighted and unweighted counts
# -----------------------------------------------------------------------------
# unweighted and weighted group counts
ethnicity <- pan_clean %>%
count(ethnicity) %>%
mutate(weighted = FALSE)
ethnicity_wt <- pan_clean %>%
count(ethnicity, wt = weight_percent) %>%
mutate(weighted = TRUE)
# combine unweighted and weighted group counts for plotting
plot_df <- bind_rows(ethnicity, ethnicity_wt)
# produce the plot
ggplot(plot_df, aes(ethnicity, n,
fill = weighted)) +
geom_col(position = "dodge", width = 0.6) +
scale_fill_manual(values = colours) +
coord_flip() +
labs(x = NULL,
y= "Number of survey respondents",
fill = "With weigthing?") +
facet_wrap(~factor(ethnicity, ethnicity_levels), scales = "free") +
theme(axis.text.y = element_blank()) +
style_facets()any_visits_14: Have you visited green space in the last
14 days?
The code below calculates and shows the proportion of survey respondents providing each response to the question of ‘Have you visited green space in the last 14 days?’. We can see in the chart and table that around 64% of respondents have made a visit to green space. Of the remainder, around 9% stated that they didn’t know if they have made a visit or not.
# any_visits
any_visits <- pan_clean %>%
count(any_visits_14) %>%
mutate(weighted = FALSE,
prop = round(n / sum(n), 3))
# plot frequency distribution
ggplot(any_visits, aes(fct_reorder(any_visits_14, n),
prop)) +
geom_col(fill = colours[1], width = 0.6) +
geom_text(aes(label = prop), nudge_y = 0.03) +
coord_flip() +
labs(y = "Proportion of survey respodents",
x = NULL,
subtitle = "Have you visited green space in the last 14 days?")# show table including proportions
select(any_visits, -weighted) %>%
kable()| any_visits_14 | n | prop |
|---|---|---|
| Any visits | 28089 | 0.643 |
| Don’t know | 4121 | 0.094 |
| No visits | 11148 | 0.255 |
| Prefer not to say | 359 | 0.008 |
We can also compare the frequency of each survey response between the
raw and weighted data. Again for any_visit_14 the
relatively small impact of applying the weighting factors suggests that
we can be flexible about whether or not to use the weighting factors in
the analysis.
# ----------------------------------------------------------------------------
# compare weighted and unweighted counts
# ----------------------------------------------------------------------------
any_visits_wt <- pan_clean %>%
count(any_visits_14, wt = weight_percent) %>%
mutate(weighted = TRUE)
plot_df <- bind_rows(any_visits, any_visits_wt)
ggplot(plot_df, aes(any_visits_14, n, fill = weighted)) +
geom_col(position = "dodge", width = 0.6) +
scale_fill_manual(values = colours) +
labs(x = NULL,
y= "Number of survey respondents",
fill = "With weigthing?") +
coord_flip() +
facet_wrap(~any_visits_14, scales = "free") +
style_facets() +
theme(axis.text.y = element_blank())no_of_visits: How often have you visited green space in
the last 14 days?
Finally in this exploratory data analysis section of the notebook, I
am going to look at the distribution of the no_of_visits
variable. But first I need to address the issue around this variable
being a character type when I would expect a numeric. The first step is
to look at all the values of no_of_visits, and we can see
that there are some text responses (“Don’t know” and “Prefer not to
say”). So, I am going to remove these responses, because it is not clear
that either could be simply replaced by a 0. Also I am keen
to see some initial analysis results before I invest to much time data
preprocessing (inc. missing data imputation).
# check why data has been read in as character
unique(pan_clean$no_of_visits) # "Don't know" etc. values## [1] "0" "12" "14"
## [4] "1" "Don’t Know" "4"
## [7] "2" "10" "8"
## [10] "3" "6" "7"
## [13] "19" "5" "Prefer not to say"
## [16] "28" "9" "25"
## [19] "43" "15" "13"
## [22] "20" "11" "22"
## [25] "21" "18" "16"
## [28] "24" "42" "23"
## [31] "55" "50" "17"
## [34] "36" "35" "30"
## [37] "52" "34" "26"
## [40] "40" "45" "71"
## [43] "44" "56" "29"
## [46] "62" "46" "27"
## [49] "33" "100" "69"
## [52] "48" "31" "32"
## [55] "54" "60" "37"
## [58] "89" "41"
# remove text based responses
pan_clean_no_of_visits <- pan_clean %>%
filter(!(no_of_visits %in% c("Don’t Know", "Prefer not to say"))) %>%
mutate(no_of_visits = as.numeric(no_of_visits))Having dealt with the missing data issue, the plot below shows the
frequency distribution for no_of_visits. The top part shows
the full distribution, which is heavily right skewed, as some
respondents have reported up to 100 visits to green space in the last 14
days. The bottom part zooms in to focus on responses of up 20 visits in
the last 14 days to provide a little more detail.
p <- pan_clean_no_of_visits %>%
# calculate frequencies
count(no_of_visits) %>%
# produce plot
ggplot(aes(no_of_visits, n)) +
geom_col(fill = colours[1]) +
labs(x = "Number of visits to green space in the last 14 days",
subtitle = "Number of survey respondents",
y = NULL) +
ggforce::facet_zoom(x = no_of_visits < 20)
pAlso, here are the descriptive stats for the
no_of_visits variable.
s <- skim(select(pan_clean_no_of_visits,
no_of_visits))
s| Name | select(pan_clean_no_of_vi… |
| Number of rows | 39237 |
| Number of columns | 1 |
| _______________________ | |
| Column type frequency: | |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| no_of_visits | 0 | 1 | 4.15 | 5.28 | 0 | 0 | 2 | 6 | 100 | ▇▁▁▁▁ |
Analyzing the data
Having explored the data and become familar with it, I had two main objectives for the analysis:
- To understand if people from a white (i.e. majority) ethnic background are more likely to have visited green space in the last 14 days than people from a minority ethnic background.
- To understand if people from different ethnic backgrounds visit green space with different frequencies. In other words, do people with certain ethnic backgrounds tend to make more or less visits to green spaces.
So, the analysis section of this notebook is split into two sub-sections, one addressing each of these two objectives. In both cases I am analyzing the survey sample to infer something about the wider population (adults in England).
Have you visited green space in the last 14 days?
In this section of the notebook I try to answer the question: do the proportions of people who have visited green space in the last 14 days differ between people from an ethnic minority background and people from a white ethnic background?
As I am going to use the hypothesis test for the difference between two proportions, to assess if we can be confident that there is a difference in the wider population. But first let’s confirm that we see a difference within the survey sample itself.
The code below calculates the number of survey respondents in each group, for both the raw data and the data with the weighting applied.
# ---------------------------------------------------------------------------------
# Prepare the data
# ---------------------------------------------------------------------------------
diff_in_props_df <- pan_clean %>%
# remove ambiguous answers
filter(ethnicity != "Don't know", ethnicity != "Prefer not to say") %>%
filter(any_visits_14 != "Don't know", any_visits_14 != "Prefer not to say") %>%
mutate(ethnicity = if_else(ethnicity == "White", ethnicity, "Minority Ethnic Background"))
# ---------------------------------------------------------------------------------
# simulate a representative sample based on the survey weightings
# ---------------------------------------------------------------------------------
set.seed(1683763) # so we get the same sample each time the code block runs
# what would group proprtions look like if we applied the weighting factor
group_props_weighted <- diff_in_props_df %>%
count(ethnicity, any_visits_14,
wt = weight_percent) %>%
mutate(prop = n / sum(n), 3)
num_resp <- nrow(diff_in_props_df)
sim_diff_in_props_df <- group_props_weighted %>%
# simulate respondents based on weighted group proportions
slice_sample(weight_by = prop, n = num_resp, replace = TRUE) %>%
select(-c(n, prop))
sim_diff_in_props_counts <- sim_diff_in_props_df %>%
# calculated proportions based on simulation of a sample using weightings
count(ethnicity, any_visits_14) %>%
mutate(prop_weighted = round(n / sum(n), 3)) %>%
rename(n_weighted = n)
# calculate actual proportions in the raw data for comaparision
acutal_diff_in_props_counts <- diff_in_props_df %>%
count(ethnicity, any_visits_14) %>%
mutate(prop = round(n / sum(n), 3))The table below compares the the weighted and weighted samples. We
can see simulating a sample (using the weighting factor) results in
counts similar to those from the actual survey sample. However, for the
sake of completeness I use the weighted sample in this part of the
analysis. Note: In the table below prop
refers to the proportion of the overall sample.
acutal_diff_in_props_counts %>%
left_join(sim_diff_in_props_counts) %>%
kable()## Joining, by = c("ethnicity", "any_visits_14")
| ethnicity | any_visits_14 | n | prop | n_weighted | prop_weighted |
|---|---|---|---|---|---|
| Minority Ethnic Background | Any visits | 3284 | 0.084 | 3311 | 0.084 |
| Minority Ethnic Background | No visits | 1547 | 0.039 | 1565 | 0.040 |
| White | Any visits | 24804 | 0.632 | 24330 | 0.620 |
| White | No visits | 9601 | 0.245 | 10030 | 0.256 |
We are actually interested in what proportion of each of the two groups (Minority Ethnic Background and White) has made any visits to green space in the 14 days preceding participation in the survey. The table below confirms, that for the sample at least, there is a difference in the proportion visiting across the groups.
sim_diff_in_props_df %>%
count(ethnicity, any_visits_14) %>%
group_by(ethnicity) %>%
mutate(prop_of_ethnicity_group = n / sum(n)) %>%
kable()| ethnicity | any_visits_14 | n | prop_of_ethnicity_group |
|---|---|---|---|
| Minority Ethnic Background | Any visits | 3311 | 0.6790402 |
| Minority Ethnic Background | No visits | 1565 | 0.3209598 |
| White | Any visits | 24330 | 0.7080908 |
| White | No visits | 10030 | 0.2919092 |
So, now we can move on to the formal statistical testing. This will involve assessing if we can infer from our weighted sample that there is a difference in the proportions of people from white and minority ethnic backgrounds visiting greens spaces in the wider population (adults in England). The first thing we need to do is set up the hypothesis to be tested.
- H0: The proportions of people from white and minority ethnic backgrounds visiting greens spaces is the same.
- HA: The proportions of people from white and minority ethnic backgrounds visiting greens spaces is the not same
I am going to use a 99% confidence level within this hypothesis test.
I also use a computational approach implemented within the
infer package (part of TidyModels). See here
for more details.
The first step is to calcualte the observed statistic, the difference
in proprotions in sample at hand (referred to as d_hat in
the code chunk below). Then we simulate the null distribution for the
difference in proportions. This is the sampling distribution we would
expect if H0 is correct. In
practice this is done by repeating this process: randomly shuffling
values in the any_visits_14 column, while holding the
ethnicity column fixed, and then calculating the difference
in the proportions each time. This is because shuffling in this way
would be of no consequence if the null hypothesis holds
(i.e. ethnicity and any_visits_14 are
independent).
set.seed(16833) # so we get the same results from the permuation test each time
# calculate the observed difference in the proportions of people with and without
# Minority Ethnic backgrounds who have visited green space in the last 14 days
d_hat <- sim_diff_in_props_df %>%
specify(any_visits_14 ~ ethnicity, success = "Any visits") %>%
calculate(stat = "diff in props", order = c("Minority Ethnic Background",
"White"))
# generate the null distribution
null_dist <- sim_diff_in_props_df %>%
specify(any_visits_14 ~ ethnicity, success = "Any visits") %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "diff in props", order = c("Minority Ethnic Background",
"White"))Next we compare the observed statistic (difference in proportions) to the null distribution.We can see that the observed statistic (the red line) falls outside the range of the null distribution.
# annotation
ann_text<-data.frame(
x = 4, y = 100,
label="geeks for geeks"
)
label_text <- "Red line shows\ndifference in proportions\nin the actual survey sample"
# visualise the null distibution and the observed differnce
visualise(null_dist) +
shade_p_value(obs_stat = d_hat, direction = "two-sided") +
geom_label(x = -0.0185, y = 100,
label = label_text,
colour = "red") +
labs(x = "Difference in proportions between groups",
y = "Number of simulations")This confirmed by calculating the p-value associated with the observed statistic. This is approximately 0, so we can reject H0, there is a statistical evidence that the proportions of people from white and minority ethnic backgrounds visiting greens spaces is different.
null_dist %>%
get_p_value(obs_stat = d_hat, direction = "two-sided") %>%
kable()## Warning: Please be cautious in reporting a p-value of 0. This result is an
## approximation based on the number of `reps` chosen in the `generate()` step. See
## `?get_p_value()` for more information.
| p_value |
|---|
| 0 |
Below I output the observed difference in proportions and the confidence intervals (at a 99% confidence level). to show the size of the difference. The that fact the observed proportion and confidence intervals are negative indicates that the proportion of people visiting green spaces is lower for people from a minority ethnic background than for people from a white background.
bootstrap_dist <- sim_diff_in_props_df %>%
specify(any_visits_14 ~ ethnicity, success = "Any visits") %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "diff in props", order = c("Minority Ethnic Background",
"White"))
# get confidence intervals
percentile_ci <- get_ci(bootstrap_dist, level = 0.99)
# output and visualise confidence intervals
percentile_ci %>%
mutate(observed_diff_in_proportions = d_hat$stat) %>%
relocate(observed_diff_in_proportions) %>%
kable()| observed_diff_in_proportions | lower_ci | upper_ci |
|---|---|---|
| -0.0290506 | -0.0480605 | -0.0103446 |
Finally in the analysis of the any_visit_14 survey
response, we can visualize the confidence intervals over the bootstrap
distribution. The later simulates the sampling distribution we would
expect based on the survey sample at hand. We can see that the
confidence interval does not intersect with 0, hence again confirming
that H0 can be rejected.
visualize(bootstrap_dist) +
shade_confidence_interval(endpoints = percentile_ci) +
geom_label(x = -0.0185, y = 150,
label = "99% CI\nin blue",
colour = "grey20") +
labs(x = "Difference in proportions between groups",
y = "Number of simulations")How often have you visited green space in the last 14 days?
Moving on now to the second part of the analysis, which compares
green space visiting behavior between groups of people with different
ethnic backgrounds. Specifically, the analysis compares the number of
visits made to green space in the 14 day preceding participating in the
survey (no_of_visits) across groups of people with
different ethnic backgrounds. The groups, as defined within the survey
data, are:
Asian or Asian British;
Black or Black British;
Mixed;
White;
and, any other ethnic group or background.
The first thing to do is to look at the mean number of visits
(no_of_visits) for each group in the raw survey sample
data. The table below suggest there may well be some differences in the
means between groups.
raw_means_comp <- pan_clean_no_of_visits %>%
# remove outliers
# assume than visiting green space more than twice per day
# i.e. 28 times in the last 14 days is an outlier
filter(no_of_visits <= 28) %>%
# remove ambiguous survey responses
filter(!(ethnicity %in% c("Don’t know", "Prefer not to say")))
skim_df <- raw_means_comp %>%
select(ethnicity, no_of_visits) %>%
# for ouput table formatting
mutate(ethnicity = str_replace(ethnicity, '\n', ' ')) %>%
# for group-wise summary statistics
group_by(ethnicity) %>%
skim()
# reorder columns for readability
raw_means <- skim_df %>%
select(skim_variable:numeric.sd) %>%
rename(mean = numeric.mean,
sd = numeric.sd) %>%
relocate(mean, .after = ethnicity)
# output to notebook
kable(raw_means)| skim_variable | ethnicity | mean | n_missing | complete_rate | sd |
|---|---|---|---|---|---|
| no_of_visits | Any other ethnic group or background | 3.406728 | 0 | 1 | 4.306831 |
| no_of_visits | Asian or Asian British | 2.886582 | 0 | 1 | 3.534872 |
| no_of_visits | Black or Black British | 2.623886 | 0 | 1 | 3.793191 |
| no_of_visits | Mixed | 3.610660 | 0 | 1 | 4.311962 |
| no_of_visits | White | 4.173403 | 0 | 1 | 4.854922 |
We can also take a look at a violin plot to set a sense of the differences in the distributions across groups.
# create plot
raw_means_comp %>%
ggplot(aes(no_of_visits, ethnicity)) +
geom_violin() +
labs(y = NULL,
x = "Number of Visits to Green Space in the last 14 days",
subtitle = "Probability densities")Given there are some big differences between the number of people in
each of the ethnicity groups, I wanted to confirm that this
wasn’t have too much impact on the estimation of the group means. To do
this I used a Marginal Means calculation, which essentially models the
means using Linear Regression. See here
for more details. See table below for the results.
# create model
model <- lm(no_of_visits ~ ethnicity,
data = raw_means_comp)
#calculate marginal means
marginal_means <- estimate_means(model, ci = 0.99)## We selected `at = c("ethnicity")`.
# format marginal means for notebook output
marginal_means %>%
format_ethnicity_str() %>%
kable()| ethnicity | Mean | SE | CI_low | CI_high |
|---|---|---|---|---|
| White | 4.173403 | 0.0255853 | 4.107497 | 4.239310 |
| Asian or Asian British | 2.886582 | 0.0946797 | 2.642691 | 3.130472 |
| Mixed | 3.610660 | 0.1612756 | 3.195222 | 4.026099 |
| Black or Black British | 2.623886 | 0.1414418 | 2.259538 | 2.988234 |
| Any other ethnic group or background | 3.406728 | 0.2619994 | 2.731829 | 4.081626 |
Comparing the raw means for each group with the marginal means for each group, we can see that they are pretty similar.
# format ethnicity strings for plotting
raw_means <- raw_means %>%
mutate(ethnicity = if_else(ethnicity == 'Any other ethnic group or background',
'Any other ethnic group\nor background',
ethnicity))
# create plot
ggplot() +
# raw means
geom_point(data = raw_means,
mapping = aes(mean, ethnicity),
colour = "red",
position = position_nudge(y = -0.1)) +
# marginal means
geom_pointrange(data = marginal_means,
mapping = aes(Mean, ethnicity,
xmin = CI_low, xmax = CI_high),
position = position_nudge(y = 0.1)) +
# tidy up text
labs(subtitle = "Red points are raw means.
Black point-lines show marginal means
plus confidence intervals",
y = NULL,
x = "Mean number of visits in the last 14 days")I continue to use the marginal means, as this allows us to test the statistical significance of the differences in means between groups using contrast analysis. In the table below we can see results for the full list of inter group comparisons.
contrasts <- estimate_contrasts(model, contrast = "ethnicity")
contrasts_full_res <- contrasts %>%
# format strings for table output
mutate(across(where(is.character), str_replace, '\n', ' ')) %>%
mutate(across(where(is.numeric),
~round(., 3))
)
contrasts_full_res %>%
kable()| Level1 | Level2 | Difference | CI_low | CI_high | SE | df | t | p |
|---|---|---|---|---|---|---|---|---|
| Asian or Asian British | Any other ethnic group or background | -0.520 | -1.302 | 0.262 | 0.279 | 39101 | -1.867 | 0.335 |
| Asian or Asian British | Black or Black British | 0.263 | -0.215 | 0.740 | 0.170 | 39101 | 1.543 | 0.534 |
| Asian or Asian British | Mixed | -0.724 | -1.249 | -0.199 | 0.187 | 39101 | -3.872 | 0.001 |
| Black or Black British | Any other ethnic group or background | -0.783 | -1.619 | 0.053 | 0.298 | 39101 | -2.629 | 0.065 |
| Mixed | Any other ethnic group or background | 0.204 | -0.660 | 1.068 | 0.308 | 39101 | 0.663 | 0.964 |
| Mixed | Black or Black British | 0.987 | 0.385 | 1.589 | 0.215 | 39101 | 4.600 | 0.000 |
| White | Any other ethnic group or background | 0.767 | 0.028 | 1.506 | 0.263 | 39101 | 2.912 | 0.030 |
| White | Asian or Asian British | 1.287 | 1.012 | 1.562 | 0.098 | 39101 | 13.121 | 0.000 |
| White | Black or Black British | 1.550 | 1.146 | 1.953 | 0.144 | 39101 | 10.780 | 0.000 |
| White | Mixed | 0.563 | 0.104 | 1.021 | 0.163 | 39101 | 3.446 | 0.005 |
Focusing in where there are significant differences between the means (i.e. p values are less than 0.01 as we are working at a 99% confidence level) we get the table and lighthouse plot below.
These are showing that there are significant differences in the
average number of visits to green space (no_of_visits)
between the following groups:
- Asian or Asian British and Mixed
- Mixed and Black or Black British
- White and each of Asian or Asian British, Black or Black British and Mixed
So, we can be confident that we will see similar differences in the general population, assuming that our survey sample is not systematically biased in some way.
contrasts_full_res %>%
filter(p < 0.01) %>%
kable()| Level1 | Level2 | Difference | CI_low | CI_high | SE | df | t | p |
|---|---|---|---|---|---|---|---|---|
| Asian or Asian British | Mixed | -0.724 | -1.249 | -0.199 | 0.187 | 39101 | -3.872 | 0.001 |
| Mixed | Black or Black British | 0.987 | 0.385 | 1.589 | 0.215 | 39101 | 4.600 | 0.000 |
| White | Asian or Asian British | 1.287 | 1.012 | 1.562 | 0.098 | 39101 | 13.121 | 0.000 |
| White | Black or Black British | 1.550 | 1.146 | 1.953 | 0.144 | 39101 | 10.780 | 0.000 |
| White | Mixed | 0.563 | 0.104 | 1.021 | 0.163 | 39101 | 3.446 | 0.005 |
plot(filter(contrasts, p < 0.01), estimate_means(model)) +
coord_flip() +
labs(x = NULL,
y = "Mean number of visits in the last 14 days")## We selected `at = c("ethnicity")`.
If I was extending this analysis
If I was spending more time on this analysis, I would explore the validity of some of the key assumptions that have made so far. This would include testing if the following were valid assumptions:
That we can use the unweighted data in comparison between groups.
That the relationships identified hold across the different rounds of the survey. Did people participating in the survey on different dates make any difference?