Are natural green spaces less accessible to people from minority ethnic backgrounds?

By: Dr. Chris Martin

Tools used: R, R Markdown, tidyverse packages (inc. dplyr and ggplot2) , easystats packages (inc. modelbased) and tidymodels packages (inc. infer).

Techniques used: data cleaning, exploratory data analysis, descriptive statistics, hypothesis testing, confidence intervals, simulation, marginal means and contrast analysis (for testing pairwise differences between groups).

Introduction

There are widespread concerns in the UK that natural green spaces are less accessible to people from minority ethnic backgrounds. In this notebook I use statistical techniques to assess if some of the available evidence support these concerns. To do this, I analyse data detailing how often people have visited green spaces. The data was collected between April 2020 and March 2022, and is drawn from the People and Nature Survey, which is run by Natural England (part of the UK Government). My analysis finds statistical evidence which supports concerns that natural green spaces are less accessible to people from minority ethnic backgrounds.

In this notebook I work through the typical stages of a statistical analysis: importing, cleaning and the exploring the data before then analyzing the data and reporting the findings.

Setting up the notebook

This section runs the set up code for the notebook including importing libraries and setting global notebook parameters. A couple of convenience functions used throughout the notebook are also included in the setup code. The versions of the packages used are listed for reproducibility purposes.

library(tidyverse)

library(glue)   # for string literals

library(skimr)  # for descriptive stats

library(infer)  # from TidyModels 
                # for computational methods for hypothesis tests 
                # and confidence intervals

library(modelbased) # for marginal means modelling and contrast analysis tests

library(see)    # from EasyStats
                # for lighthouse plots and contrast analysis

library(knitr)  # for table formatting

# set default Rmd Chunk options
knitr::opts_chunk$set(echo = TRUE)

# set default theme for exploratory plots
theme_set(theme_light())

# define colour palette for plots
colours <- c("#374E83", "#C94A54")

# for styling faceted plots
style_facets <- function(){
  theme(strip.background = element_rect(fill = "white"),
        strip.text = element_text(colour = "black",
                                  face = "bold"))
}

# formatting string in the ethnicity variable for table output
format_ethnicity_str <- function(df){
  df %>% 
    mutate(ethnicity = if_else(ethnicity == 
                                 'Any other ethnic group\nor background',
                               'Any other ethnic group or background',
                               ethnicity))
}

installed.packages()[names(sessionInfo()$otherPkgs), "Version"]

##      knitr        see modelbased      infer      skimr       glue    forcats 
##     "1.39"    "0.7.4"    "0.8.5"    "1.0.4"    "2.1.4"    "1.6.2"    "0.5.1" 
##    stringr      dplyr      purrr      readr      tidyr     tibble    ggplot2 
##    "1.4.1"    "1.0.9"    "0.3.4"    "2.1.2"    "1.2.0"    "3.1.8"    "3.4.0" 
##  tidyverse 
##    "1.3.2"

Importing the data

The survey data is released by Natural England as an Excel spreadsheet. The code below reads in the data, and then focuses on the variables needed in the analysis. The People and Nature survey is broad in scope and only few variables from it are needed. So, many of the variables in the raw dataset are dropped to make the dataframe easier to work with.

file_path_prefix <- "../" 

# read in the data
pan_raw <- readxl::read_xlsx(glue("{file_path_prefix}datasets/PANS_Y2_Q1_Q4.xlsx"), 
                             sheet = "Data", 
                             guess_max = 26000) %>% # to account for some 
                                                    # sparse columns
  janitor::clean_names() # standardise naming conventions

# focus in on the data to be used in this analysis
pan_focus <- pan_raw %>% 
  select(respondent_id, wave, weight_percent, any_visits_14, ethnicity,
         no_of_visits, gender, age_band)

Cleaning the data

The section of the code cleans the data. First, it identifies the pattern of missing data. We can see that there are three variables with missing data: any_visits_14, ethnicity and no_of_visits. It looks like when data is missing for any_visits_14 it is also missing for no_of_visits . This makes sense as:

any_visits_14 records if the survey respondent has visited a green space in the 14 days preceding their participation in the survey;
no_of_visits records how many times the respondent has visited a green space in the 14 days preceding their participation in the survey.

# ------------------------------------------------------------------------------
# Check patterns of missing data
# ------------------------------------------------------------------------------

# visual overview
visdat::vis_miss(pan_focus)

In the code below, I check that a none response (i.e. NA value) for any_visits_14 is independent from a none response for ethnicity. A quick test suggests independence which gives some reassurance that there is a not a systematic bias at play, for example that people from a certain ethnic background are less likely to respond to questions about visiting green space.

# ------------------------------------------------------------------------------
# Check for independence
# ------------------------------------------------------------------------------

# count number of missing pieces of data for the two variables
missing <- pan_focus %>% 
  mutate(missing_any_visits = is.na(any_visits_14),
         missing_ethnicity = is.na(ethnicity),
         missing_both = is.na(any_visits_14) & is.na(ethnicity)) %>% 
  summarise(across(where(is.logical), list(sum = sum)))

# calculate how observations would be expected to be missing both if
# assumption of independence holds
prop_missing_any_visits <- missing$missing_any_visits_sum / nrow(pan_focus)
prop_missing_ethnicity <- missing$missing_ethnicity_sum / nrow(pan_focus)

expected_prop_missing_both <- prop_missing_any_visits * prop_missing_ethnicity
expected_num_missing_both <- round(expected_prop_missing_both * nrow(pan_focus))

actual_num_missing_both <- missing$missing_both_sum

# output expected (if independent) and acutal number missing both
print(glue("If independent expected missing both: {expected_num_missing_both}"))

## If independent expected missing both: 136

print(glue("Actual number missing both: {actual_num_missing_both}"))

## Actual number missing both: 137

# they are independent, so proceed with dropping missing data
pan_clean <- pan_focus %>% 
  na.omit() # removes about 10% of observations

Finally in this data cleaning section, I confirm that the respondent ids (respondent_id) are unique. So, now we know that people are not participating in the survey multiple time.

# confirm respondent_id values are unique
assertthat::assert_that(length(unique(pan_focus$respondent_id)) == nrow(pan_focus))

## [1] TRUE

Exploring the data

First, I take a look at a summary of the descriptive stats for each variable. In the process I noticed that no_of_visits was character type when it should be numeric. I’ll come back to this below when I take a look at the no_of_visits variable in more detail.

# descriptive stats for all variables
skimr::skim(select(pan_clean, -respondent_id))

Data summary
Name	select(pan_clean, -respon…
Number of rows	43717
Number of columns	7
_______________________
Column type frequency:
character	6
numeric	1
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
wave	1	17	24	24
any_visits_14	1	9	17	4
ethnicity	1	5	36	7
no_of_visits	1	1	17	59
gender	1	4	24	3
age_band	1	3	5	5

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
weight_percent	0	1	1	0.68	0.14	0.57	0.81	1.22	10.42	▇▁▁▁▁

Next, I explore the data for the three main variables used in the analysis.

ethnicity which records the ethnic background of the survey respondent;
any_visits_14 which records if the survey respondent has visited a green space in the 14 days preceding their participation in the survey;
no_of_visits which records how many times the respondent has visited a green space in the 14 days preceding their participation in the survey.

Before doing that, I quickly makes tweaks to the format of strings within the dataframe so that when they appear in exploratory plots everything looks a bit neater.

pan_clean <- pan_clean %>% 
  # to improve plot labelling
  mutate(ethnicity = if_else(ethnicity == 'Any other ethnic group or background',
                             'Any other ethnic group\nor background',
                             ethnicity))

`ethnicity`: What is your ethnic background?

This chunk of code looks at ethnicity variable, to confirming that the frequency distribution, when group respondents by ethnic background, looks roughly as one would expect. See the plot (showing frequencies) and dataframe (showing proportions of the sample) below for details.

# prepare data for plotting
plot_df <- pan_clean %>% 
  
  count(ethnicity) %>% 
  
  # to order groups within the plot
  mutate(ethnicity = fct_reorder(ethnicity, n))

# will be useful for ordering facets in a plot later on
ethnicity_levels <- rev(levels(plot_df$ethnicity))

# plot frequency distribution
ggplot(plot_df, aes(ethnicity, n)) +
  
  geom_col(fill = colours[1], width = 0.8) +
  geom_text(aes(label = n), nudge_y = 2000) +
  
  coord_flip() +
  
  labs(y = "Number of survey respodents",
       x = NULL)

# format strings for table output
levels(plot_df$ethnicity) <- rev(str_replace(ethnicity_levels, "\n", " "))

# look at proportion of survey sample in each ethnicity group
plot_df %>% 
  mutate(prop = round(n / sum(n), 3)) %>% 
  kable()

ethnicity	n	prop
Any other ethnic group or background	400	0.009
Asian or Asian British	2989	0.068
Black or Black British	1314	0.030
Don’t know	1	0.000
Mixed	1012	0.023
Prefer not to say	1	0.000
White	38000	0.869

The survey dataset from Natural England includes weighting factors. So, we can look how different weighted group counts are from the raw group counts above. The plot below shows that the weighting does not make a huge difference to the group counts. So, in this notebook I’ll be flexible about where and when to use the weighting factor. As this opens up the opportunity to conduct analysis on the raw data, where it would be challenging to use the weighted data.

# -----------------------------------------------------------------------------
# compare weighted and unweighted counts
# -----------------------------------------------------------------------------

# unweighted and weighted group counts
ethnicity <- pan_clean %>% 
  count(ethnicity) %>% 
  mutate(weighted = FALSE)

ethnicity_wt <- pan_clean %>% 
  count(ethnicity, wt = weight_percent) %>% 
  mutate(weighted = TRUE)

# combine unweighted and weighted group counts for plotting
plot_df <- bind_rows(ethnicity, ethnicity_wt) 
  
# produce the plot
ggplot(plot_df, aes(ethnicity, n, 
                    fill = weighted)) +
  
  geom_col(position = "dodge", width = 0.6) +
  
  scale_fill_manual(values = colours) +
  
  coord_flip() +
  
  labs(x = NULL,
       y= "Number of survey respondents",
       fill = "With weigthing?") +
  
  facet_wrap(~factor(ethnicity, ethnicity_levels), scales = "free") +
  theme(axis.text.y = element_blank()) +
  style_facets()

`any_visits_14`: Have you visited green space in the last 14 days?

The code below calculates and shows the proportion of survey respondents providing each response to the question of ‘Have you visited green space in the last 14 days?’. We can see in the chart and table that around 64% of respondents have made a visit to green space. Of the remainder, around 9% stated that they didn’t know if they have made a visit or not.

# any_visits
any_visits <- pan_clean %>% 
  count(any_visits_14) %>% 
  mutate(weighted = FALSE,
         prop = round(n / sum(n), 3))

# plot frequency distribution
ggplot(any_visits, aes(fct_reorder(any_visits_14, n),
                    prop)) +
  
  geom_col(fill = colours[1], width = 0.6) +
  geom_text(aes(label = prop), nudge_y = 0.03) +
  
  coord_flip() +
  
  labs(y = "Proportion of survey respodents",
       x = NULL,
       subtitle = "Have you visited green space in the last 14 days?")

# show table including proportions
select(any_visits, -weighted) %>% 
  kable()

any_visits_14	n	prop
Any visits	28089	0.643
Don’t know	4121	0.094
No visits	11148	0.255
Prefer not to say	359	0.008

We can also compare the frequency of each survey response between the raw and weighted data. Again for any_visit_14 the relatively small impact of applying the weighting factors suggests that we can be flexible about whether or not to use the weighting factors in the analysis.

# ----------------------------------------------------------------------------
# compare weighted and unweighted counts
# ----------------------------------------------------------------------------
any_visits_wt <- pan_clean %>% 
  count(any_visits_14, wt = weight_percent) %>% 
  mutate(weighted = TRUE)

plot_df <- bind_rows(any_visits, any_visits_wt)

ggplot(plot_df, aes(any_visits_14, n, fill = weighted)) +
  geom_col(position = "dodge", width = 0.6) +
  
  scale_fill_manual(values = colours) +
  
  labs(x = NULL,
       y= "Number of survey respondents",
       fill = "With weigthing?") +
  
  coord_flip() +
  
  facet_wrap(~any_visits_14, scales = "free") +
  style_facets() +
  theme(axis.text.y = element_blank())

`no_of_visits`: How often have you visited green space in the last 14 days?

Finally in this exploratory data analysis section of the notebook, I am going to look at the distribution of the no_of_visits variable. But first I need to address the issue around this variable being a character type when I would expect a numeric. The first step is to look at all the values of no_of_visits, and we can see that there are some text responses (“Don’t know” and “Prefer not to say”). So, I am going to remove these responses, because it is not clear that either could be simply replaced by a 0. Also I am keen to see some initial analysis results before I invest to much time data preprocessing (inc. missing data imputation).

# check why data has been read in as character
unique(pan_clean$no_of_visits) # "Don't know" etc. values

##  [1] "0"                 "12"                "14"               
##  [4] "1"                 "Don’t Know"        "4"                
##  [7] "2"                 "10"                "8"                
## [10] "3"                 "6"                 "7"                
## [13] "19"                "5"                 "Prefer not to say"
## [16] "28"                "9"                 "25"               
## [19] "43"                "15"                "13"               
## [22] "20"                "11"                "22"               
## [25] "21"                "18"                "16"               
## [28] "24"                "42"                "23"               
## [31] "55"                "50"                "17"               
## [34] "36"                "35"                "30"               
## [37] "52"                "34"                "26"               
## [40] "40"                "45"                "71"               
## [43] "44"                "56"                "29"               
## [46] "62"                "46"                "27"               
## [49] "33"                "100"               "69"               
## [52] "48"                "31"                "32"               
## [55] "54"                "60"                "37"               
## [58] "89"                "41"

# remove text based responses
pan_clean_no_of_visits <- pan_clean %>%
  filter(!(no_of_visits %in% c("Don’t Know", "Prefer not to say"))) %>% 
  mutate(no_of_visits = as.numeric(no_of_visits))

Having dealt with the missing data issue, the plot below shows the frequency distribution for no_of_visits. The top part shows the full distribution, which is heavily right skewed, as some respondents have reported up to 100 visits to green space in the last 14 days. The bottom part zooms in to focus on responses of up 20 visits in the last 14 days to provide a little more detail.

p <- pan_clean_no_of_visits %>% 
  # calculate frequencies
  count(no_of_visits) %>% 
  
  # produce plot
  ggplot(aes(no_of_visits, n)) +
    
    geom_col(fill = colours[1]) +
    
    labs(x = "Number of visits to green space in the last 14 days",
         subtitle = "Number of survey respondents",
         y = NULL) +
  
    ggforce::facet_zoom(x = no_of_visits < 20)

p

Also, here are the descriptive stats for the no_of_visits variable.

s <- skim(select(pan_clean_no_of_visits, 
                  no_of_visits))

s

Data summary
Name	select(pan_clean_no_of_vi…
Number of rows	39237
Number of columns	1
_______________________
Column type frequency:
numeric	1
________________________
Group variables	None

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
no_of_visits	0	1	4.15	5.28	0	0	2	6	100	▇▁▁▁▁

Analyzing the data

Having explored the data and become familar with it, I had two main objectives for the analysis:

To understand if people from a white (i.e. majority) ethnic background are more likely to have visited green space in the last 14 days than people from a minority ethnic background.
To understand if people from different ethnic backgrounds visit green space with different frequencies. In other words, do people with certain ethnic backgrounds tend to make more or less visits to green spaces.

So, the analysis section of this notebook is split into two sub-sections, one addressing each of these two objectives. In both cases I am analyzing the survey sample to infer something about the wider population (adults in England).

Have you visited green space in the last 14 days?

In this section of the notebook I try to answer the question: do the proportions of people who have visited green space in the last 14 days differ between people from an ethnic minority background and people from a white ethnic background?

As I am going to use the hypothesis test for the difference between two proportions, to assess if we can be confident that there is a difference in the wider population. But first let’s confirm that we see a difference within the survey sample itself.

The code below calculates the number of survey respondents in each group, for both the raw data and the data with the weighting applied.

# ---------------------------------------------------------------------------------
# Prepare the data
# ---------------------------------------------------------------------------------

diff_in_props_df <- pan_clean %>% 
  
  # remove ambiguous answers
  filter(ethnicity != "Don't know", ethnicity != "Prefer not to say") %>% 
  filter(any_visits_14 != "Don't know", any_visits_14 != "Prefer not to say") %>% 
  
  mutate(ethnicity = if_else(ethnicity == "White", ethnicity, "Minority Ethnic Background")) 

# ---------------------------------------------------------------------------------
# simulate a representative sample based on the survey weightings
# ---------------------------------------------------------------------------------

set.seed(1683763) # so we get the same sample each time the code block runs

# what would group proprtions look like if we applied the weighting factor
group_props_weighted <- diff_in_props_df %>% 
  count(ethnicity, any_visits_14,
        wt = weight_percent) %>% 
  mutate(prop = n / sum(n), 3)


num_resp <- nrow(diff_in_props_df)

sim_diff_in_props_df <- group_props_weighted %>% 
  
  # simulate respondents based on weighted group proportions
  slice_sample(weight_by = prop, n = num_resp, replace = TRUE) %>% 
  select(-c(n, prop))
  
  
sim_diff_in_props_counts <- sim_diff_in_props_df %>% 
  # calculated proportions based on simulation of a sample using weightings
  count(ethnicity, any_visits_14) %>% 
  mutate(prop_weighted = round(n / sum(n), 3)) %>% 
  rename(n_weighted = n)

# calculate actual proportions in the raw data for comaparision
acutal_diff_in_props_counts <- diff_in_props_df %>% 
  count(ethnicity, any_visits_14) %>% 
  mutate(prop = round(n / sum(n), 3))

The table below compares the the weighted and weighted samples. We can see simulating a sample (using the weighting factor) results in counts similar to those from the actual survey sample. However, for the sake of completeness I use the weighted sample in this part of the analysis. Note: In the table below prop refers to the proportion of the overall sample.

acutal_diff_in_props_counts %>% 
  left_join(sim_diff_in_props_counts) %>% 
  kable()

## Joining, by = c("ethnicity", "any_visits_14")

ethnicity	any_visits_14	n	prop	n_weighted	prop_weighted
Minority Ethnic Background	Any visits	3284	0.084	3311	0.084
Minority Ethnic Background	No visits	1547	0.039	1565	0.040
White	Any visits	24804	0.632	24330	0.620
White	No visits	9601	0.245	10030	0.256

We are actually interested in what proportion of each of the two groups (Minority Ethnic Background and White) has made any visits to green space in the 14 days preceding participation in the survey. The table below confirms, that for the sample at least, there is a difference in the proportion visiting across the groups.

sim_diff_in_props_df %>% 
  count(ethnicity, any_visits_14) %>% 
  group_by(ethnicity) %>% 
  mutate(prop_of_ethnicity_group = n / sum(n)) %>% 
  kable()

ethnicity	any_visits_14	n	prop_of_ethnicity_group
Minority Ethnic Background	Any visits	3311	0.6790402
Minority Ethnic Background	No visits	1565	0.3209598
White	Any visits	24330	0.7080908
White	No visits	10030	0.2919092

So, now we can move on to the formal statistical testing. This will involve assessing if we can infer from our weighted sample that there is a difference in the proportions of people from white and minority ethnic backgrounds visiting greens spaces in the wider population (adults in England). The first thing we need to do is set up the hypothesis to be tested.

H₀: The proportions of people from white and minority ethnic backgrounds visiting greens spaces is the same.

H_A: The proportions of people from white and minority ethnic backgrounds visiting greens spaces is the not same

I am going to use a 99% confidence level within this hypothesis test. I also use a computational approach implemented within the infer package (part of TidyModels). See here for more details.

The first step is to calcualte the observed statistic, the difference in proprotions in sample at hand (referred to as d_hat in the code chunk below). Then we simulate the null distribution for the difference in proportions. This is the sampling distribution we would expect if H₀ is correct. In practice this is done by repeating this process: randomly shuffling values in the any_visits_14 column, while holding the ethnicity column fixed, and then calculating the difference in the proportions each time. This is because shuffling in this way would be of no consequence if the null hypothesis holds (i.e. ethnicity and any_visits_14 are independent).

set.seed(16833) # so we get the same results from the permuation test each time

# calculate the observed difference in the proportions of people with and without 
# Minority Ethnic backgrounds who have visited green space in the last 14 days
d_hat <- sim_diff_in_props_df %>% 
  specify(any_visits_14 ~ ethnicity, success = "Any visits") %>%
  calculate(stat = "diff in props", order = c("Minority Ethnic Background", 
                                              "White"))
# generate the null distribution
null_dist <- sim_diff_in_props_df %>%
  specify(any_visits_14 ~ ethnicity, success = "Any visits") %>%
  hypothesize(null = "independence") %>% 
  generate(reps = 1000, type = "permute") %>% 
  calculate(stat = "diff in props", order = c("Minority Ethnic Background", 
                                              "White"))

Next we compare the observed statistic (difference in proportions) to the null distribution.We can see that the observed statistic (the red line) falls outside the range of the null distribution.

# annotation
ann_text<-data.frame(
  x = 4, y = 100,
  label="geeks for geeks"
)

label_text <- "Red line shows\ndifference in proportions\nin the actual survey sample"

# visualise the null distibution and the observed differnce
visualise(null_dist) +
  shade_p_value(obs_stat = d_hat, direction = "two-sided") +
  
  geom_label(x = -0.0185, y = 100, 
            label = label_text, 
            colour = "red") +
  
  labs(x = "Difference in proportions between groups",
       y = "Number of simulations")

This confirmed by calculating the p-value associated with the observed statistic. This is approximately 0, so we can reject H₀, there is a statistical evidence that the proportions of people from white and minority ethnic backgrounds visiting greens spaces is different.

null_dist %>%
  get_p_value(obs_stat = d_hat, direction = "two-sided") %>% 
  kable()

## Warning: Please be cautious in reporting a p-value of 0. This result is an
## approximation based on the number of `reps` chosen in the `generate()` step. See
## `?get_p_value()` for more information.

p_value
0

Below I output the observed difference in proportions and the confidence intervals (at a 99% confidence level). to show the size of the difference. The that fact the observed proportion and confidence intervals are negative indicates that the proportion of people visiting green spaces is lower for people from a minority ethnic background than for people from a white background.

bootstrap_dist <- sim_diff_in_props_df %>%
  specify(any_visits_14 ~ ethnicity, success = "Any visits") %>%
  generate(reps = 1000, type = "bootstrap") %>% 
  calculate(stat = "diff in props", order = c("Minority Ethnic Background", 
                                              "White"))
# get confidence intervals
percentile_ci <- get_ci(bootstrap_dist, level = 0.99)

# output and visualise confidence intervals
percentile_ci %>% 
  mutate(observed_diff_in_proportions = d_hat$stat) %>% 
  relocate(observed_diff_in_proportions) %>% 
  kable()

observed_diff_in_proportions	lower_ci	upper_ci
-0.0290506	-0.0480605	-0.0103446

Finally in the analysis of the any_visit_14 survey response, we can visualize the confidence intervals over the bootstrap distribution. The later simulates the sampling distribution we would expect based on the survey sample at hand. We can see that the confidence interval does not intersect with 0, hence again confirming that H₀ can be rejected.

visualize(bootstrap_dist) +
  shade_confidence_interval(endpoints = percentile_ci) +
  
  geom_label(x = -0.0185, y = 150, 
            label = "99% CI\nin blue", 
            colour = "grey20") +
  
  labs(x = "Difference in proportions between groups",
       y = "Number of simulations")

How often have you visited green space in the last 14 days?

Moving on now to the second part of the analysis, which compares green space visiting behavior between groups of people with different ethnic backgrounds. Specifically, the analysis compares the number of visits made to green space in the 14 day preceding participating in the survey (no_of_visits) across groups of people with different ethnic backgrounds. The groups, as defined within the survey data, are:

Asian or Asian British;
Black or Black British;
Mixed;
White;
and, any other ethnic group or background.

The first thing to do is to look at the mean number of visits (no_of_visits) for each group in the raw survey sample data. The table below suggest there may well be some differences in the means between groups.

raw_means_comp <- pan_clean_no_of_visits %>% 
  
  # remove outliers
  # assume than visiting green space more than twice per day
  # i.e. 28 times in the last 14 days is an outlier
  filter(no_of_visits <= 28) %>% 
  
  # remove ambiguous survey responses
  filter(!(ethnicity %in% c("Don’t know", "Prefer not to say")))

skim_df <- raw_means_comp %>%
  select(ethnicity, no_of_visits) %>% 
  
  # for ouput table formatting
  mutate(ethnicity = str_replace(ethnicity, '\n', ' ')) %>%
  
  # for group-wise summary statistics
  group_by(ethnicity) %>% 
  skim()

# reorder columns for readability
raw_means <- skim_df %>% 
  select(skim_variable:numeric.sd) %>% 
  rename(mean = numeric.mean,
         sd = numeric.sd) %>% 
  relocate(mean, .after = ethnicity)

# output to notebook
kable(raw_means)

skim_variable	ethnicity	mean	complete_rate	sd
no_of_visits	Any other ethnic group or background	3.406728	1	4.306831
no_of_visits	Asian or Asian British	2.886582	1	3.534872
no_of_visits	Black or Black British	2.623886	1	3.793191
no_of_visits	Mixed	3.610660	1	4.311962
no_of_visits	White	4.173403	1	4.854922

We can also take a look at a violin plot to set a sense of the differences in the distributions across groups.

# create plot
raw_means_comp %>% 
  ggplot(aes(no_of_visits, ethnicity)) +
    geom_violin() +
  
    labs(y = NULL, 
         x = "Number of Visits to Green Space in the last 14 days",
         subtitle = "Probability densities")

Given there are some big differences between the number of people in each of the ethnicity groups, I wanted to confirm that this wasn’t have too much impact on the estimation of the group means. To do this I used a Marginal Means calculation, which essentially models the means using Linear Regression. See here for more details. See table below for the results.

# create model
model <- lm(no_of_visits ~ ethnicity, 
            data = raw_means_comp)

#calculate marginal means
marginal_means <- estimate_means(model, ci = 0.99)

## We selected `at = c("ethnicity")`.

# format marginal means for notebook output
marginal_means %>% 
  format_ethnicity_str() %>% 
  kable()

ethnicity	Mean	SE	CI_low	CI_high
White	4.173403	0.0255853	4.107497	4.239310
Asian or Asian British	2.886582	0.0946797	2.642691	3.130472
Mixed	3.610660	0.1612756	3.195222	4.026099
Black or Black British	2.623886	0.1414418	2.259538	2.988234
Any other ethnic group or background	3.406728	0.2619994	2.731829	4.081626

Comparing the raw means for each group with the marginal means for each group, we can see that they are pretty similar.

# format ethnicity strings for plotting
raw_means <- raw_means %>% 
  mutate(ethnicity = if_else(ethnicity == 'Any other ethnic group or background',
                               'Any other ethnic group\nor background',
                               ethnicity))


# create plot 
ggplot() +
  
  # raw means
  geom_point(data = raw_means,
             mapping = aes(mean, ethnicity),
             colour = "red",
             position = position_nudge(y = -0.1)) +
  
  # marginal means
  geom_pointrange(data = marginal_means,
                 mapping = aes(Mean, ethnicity,
                               xmin = CI_low, xmax = CI_high),
                 position = position_nudge(y = 0.1)) +
  
  # tidy up text
  labs(subtitle = "Red points are raw means.
Black point-lines show marginal means 
    plus confidence intervals",
       y = NULL,
      x = "Mean number of visits in the last 14 days")

I continue to use the marginal means, as this allows us to test the statistical significance of the differences in means between groups using contrast analysis. In the table below we can see results for the full list of inter group comparisons.

contrasts <- estimate_contrasts(model, contrast = "ethnicity")

contrasts_full_res <- contrasts %>% 
  # format strings for table output
  mutate(across(where(is.character), str_replace, '\n', ' ')) %>% 
  
  mutate(across(where(is.numeric), 
                ~round(., 3))
         )

contrasts_full_res %>% 
  kable()

Level1	Level2	Difference	CI_low	CI_high	SE	df	t	p
Asian or Asian British	Any other ethnic group or background	-0.520	-1.302	0.262	0.279	39101	-1.867	0.335
Asian or Asian British	Black or Black British	0.263	-0.215	0.740	0.170	39101	1.543	0.534
Asian or Asian British	Mixed	-0.724	-1.249	-0.199	0.187	39101	-3.872	0.001
Black or Black British	Any other ethnic group or background	-0.783	-1.619	0.053	0.298	39101	-2.629	0.065
Mixed	Any other ethnic group or background	0.204	-0.660	1.068	0.308	39101	0.663	0.964
Mixed	Black or Black British	0.987	0.385	1.589	0.215	39101	4.600	0.000
White	Any other ethnic group or background	0.767	0.028	1.506	0.263	39101	2.912	0.030
White	Asian or Asian British	1.287	1.012	1.562	0.098	39101	13.121	0.000
White	Black or Black British	1.550	1.146	1.953	0.144	39101	10.780	0.000
White	Mixed	0.563	0.104	1.021	0.163	39101	3.446	0.005

Focusing in where there are significant differences between the means (i.e. p values are less than 0.01 as we are working at a 99% confidence level) we get the table and lighthouse plot below.

These are showing that there are significant differences in the average number of visits to green space (no_of_visits) between the following groups:

Asian or Asian British and Mixed
Mixed and Black or Black British
White and each of Asian or Asian British, Black or Black British and Mixed

So, we can be confident that we will see similar differences in the general population, assuming that our survey sample is not systematically biased in some way.

contrasts_full_res %>% 
  filter(p < 0.01) %>% 
  kable()

Level1	Level2	Difference	CI_low	CI_high	SE	df	t	p
Asian or Asian British	Mixed	-0.724	-1.249	-0.199	0.187	39101	-3.872	0.001
Mixed	Black or Black British	0.987	0.385	1.589	0.215	39101	4.600	0.000
White	Asian or Asian British	1.287	1.012	1.562	0.098	39101	13.121	0.000
White	Black or Black British	1.550	1.146	1.953	0.144	39101	10.780	0.000
White	Mixed	0.563	0.104	1.021	0.163	39101	3.446	0.005

plot(filter(contrasts, p < 0.01), estimate_means(model)) +
  coord_flip() +
  labs(x = NULL,
       y = "Mean number of visits in the last 14 days")

## We selected `at = c("ethnicity")`.

If I was extending this analysis

If I was spending more time on this analysis, I would explore the validity of some of the key assumptions that have made so far. This would include testing if the following were valid assumptions:

That we can use the unweighted data in comparison between groups.
That the relationships identified hold across the different rounds of the survey. Did people participating in the survey on different dates make any difference?