Marine Environment analysis

We will look to see if there is any evidence in the sample data that use of the marine gym makes a difference to scores on the wellness, nr6 and nep scales.

In doing this we will acknowledge that the data is ordinal in character but nevertheless use a repeated measures ANOVA test coupled with pairwise Mann-Whitney-Wilcox tests. There is substantial support in the literature for this approach.

Inferences from this procedure are compared with inferences made when confidence intervals on estimates of the difference made on each scale are estimated using the bootstrap method. In that method no assumptions are made of the data other than that it is at least ordinal.

Preliminaries

Load packages

library(tidyverse)
library(here)
library(readxl)
library(kableExtra)
library(rstatix)

Load data

filepath<-here("data","ailish.xlsx")
all_data<-read_excel(filepath,"original")
wellness<-read_excel(filepath,"wellness")
nr6<-read_excel(filepath,"nr6")
nep<-read_excel(filepath,"nep")

Prepare the data

In any study, this is often the most time-consuming part of the analysis.

Functions to replace verbal responses with numerical responses

ans_frequency<-function(x) {
  x_trim<-str_trim(str_squish(x),side="both") # remove any leading or trailing white space, plus any double spaces.
  case_when(
  x_trim=="None the Time" ~ 1,
  x_trim=="Rarely" ~ 2,
  x_trim=="Some of the Time" ~ 3,
  x_trim=="Often" ~ 4,
  x_trim=="All of the Time" ~ 5
)
}

ans_agreement<-function(x) {
  x_trim<-str_trim(str_squish(x),side="both")
  case_when(
  x_trim=="Strongly Disagree" ~ 1,
  x_trim=="Disagree" ~ 2,
  x_trim=="Neither Agree or Disagree" ~ 3,
  x_trim=="Agree" ~ 4,
  x_trim=="Strongly Agree" ~ 5
)
}

Summary function, with verbal answers replaced by numbers, 1 - 5.

This will produce a tidy data set for each scale, with total, mean and median scores for each participant.

make_summary<-function(scale,data_df,n_questions,response_fun){
  response_cols<-LETTERS[1:n_questions]
  data_summary<- data_df %>% 
    mutate(across(all_of(response_cols),response_fun)) %>% # convert verbal responses to numbers
    rowwise() %>% 
    summarise(id=as.factor(id),
              marine=as.factor(BG_User),
              frequency=Frequency,
              total_score=sum(c_across(all_of(response_cols))),
              mean_score=mean(c_across(all_of(response_cols))),
              median_score=median(c_across(all_of(response_cols)))) %>%
    mutate(scale=as.factor(rep(scale,nrow(data_df)))) %>% # which scale was used
    select(id,scale,marine,frequency,total_score,mean_score,median_score)
  return(data_summary)
}

Make summaries for each scale

wellness_summary <- make_summary("wellness",wellness,13,ans_frequency)
nr6_summary <- make_summary("nr6",nr6,6,ans_agreement)
nep_summary <- make_summary("nep",nep,15,ans_agreement)

Combine summaries for each scale into one big tidy dataset

overall_summary<-wellness_summary %>%
  add_row(nr6_summary) %>%
  add_row(nep_summary)

Summary table

Let us look at summaries of the data, by gym use and by scale

overall_summary %>%
  group_by(scale,marine) %>%
  summarise(n=n(),mean.score=mean(total_score),
            se.score=sqrt(var(total_score)/n()),
            median.score=median(total_score)) %>%
  kbl(digits=2) %>%
  kable_styling(full_width=F)

scale	marine	n	mean.score	se.score	median.score
wellness	Yes	288	46.00	0.42	47.0
wellness	No	86	41.81	0.93	42.5
nr6	Yes	288	25.44	0.20	26.0
nr6	No	86	23.48	0.47	24.0
nep	Yes	288	50.24	0.23	50.0
nep	No	86	50.90	0.38	51.0

The means and standard errors might suggest that there is evidence for a significant increase in score if the gym is used and that score is measured using the wellness or nr6 scales, but not if the nep scale is used.

The difference made does not look not very great however (but what do I know?). It is about 10%. Be careful to note results that are statistically significant but not of practical importance. With very large samples it is possible to get results that are statistically significant but not practically important.

Whether the differences observed here, if real, are of practical importance is for a subject specialist (you!) to decide.

A bigger issue however is that the data are ordinal, so that we should not really be taking means and calculating standard errors. There is a big debate in the statistics literature about this however, so we will not dismiss them entirely, but we will look at the median scores for each scale.

The total median score is higher for gym users if the wellness and nr6 scales are used, but not if the nep scale is used. In other words, the same conclusion as when we looked at the means and standard errors. As we go on, we will see this pattern again and again with this data - the conclusion does not seem to alter whether we use parametric or non-parametric methods of analysis.

One thing lacking in this table is any idea of the precision of the medians, or in particular of the difference in medians between the gym users and the non-gym users. We need this if are to be able to make inferences to the wider population from what we observe to be true for the sample.

What we have here are our estimates of what the median scores would be among the wider population of people ‘out there’. Our null hypothesis might be that gym use makes no difference to median score by a particular scale. If we find a difference in our sample, this is our estimate of the difference in the score among the wider population. We need to know the precision in this estimate in order to know whether to reject the null hypothesis or not.

That precision is best expressed by calculating a 95% confidence interval around our estimate. If the confidence interval for the difference in median scores does not encompass zero then we can reject the null hypothesis. If it does, we can’t. There is a technique called the bootstrap which we will use later that calculates this interval while making minimal assumptions about the data.

Plot the data

Let us plot the data using a box and whisker plot. Does this confirm what the table appears to be telling us?

overall_summary %>%
  ggplot(aes(x=scale,y=mean_score,fill=scale,alpha=0.3)) +
  geom_boxplot() +
  geom_jitter(aes(colour=scale),width=0.2) +
  #geom_histogram(binwidth=0.2) +
  labs(x="Scale",
       y="Mean Score") +
  facet_wrap(~marine) +
  #facet_grid(vars(marine), vars(scale)) +
  theme_classic() +
  theme(legend.position = "none")

If there are any differences made by gym use, they appear to be small compare to the spread of values within each set of conditions.

We now do statistical tests to get a more objective answer.

Friedman repeated measures test

The study design is a repeated measures design where each individual is measured by each scale, but there is an additional between-subject variable that is whether they used the gym or not. They either did or they did not.

For ordinal data the usual test for a repeated measures design is the Friedman test but this, as far as I can tell, can only be used with within-sample variables (scale, in our case) and does not allow for inclusion of between-subject variables.

Here it is being used:

library(rstatix)
res.fried <- overall_summary %>% friedman_test(mean_score ~ scale  | id )
res.fried

This tells us that the score is affected by the scale used, but takes no account of gym use and so is not much help to us.

I may be wrong about the impossibility of including gym use as a between-subject variable. Research continues….what can you find on this?

Meanwhile….

Pair-wise comparisons.

Instead we can do pairwise comparisons for each scale between those who used the gym and those who did not. We use the Bonferroni correction to compensate for multiple testing. This works here (I think) by multiplying the calculated p-value by the number of tests carried out, so that it is harder for a test to give a result that is ‘significant’. You need to do something like this whenever you do multiple pairwise tests or the possibility of a Type One error (ie spotting an effect when there isn’t one, ie getting false positives) rapidly becomes very large.

Non-parametric

First we use the non-parametric Mann-Whitney (Wilcox) test, since we have ordinal data.

# pairwise comparisons
pwc <- overall_summary %>%
  group_by(scale) %>%
  wilcox_test(total_score ~ marine, paired = FALSE, p.adjust.method = "bonferroni")
pwc

The data provide evidence that the use of the marine gym does affect scores in the wellness and nr6 scales (p<0.001, n=288,86), but provides no evidence of such an effect on the score using the nep scale.

Parametric

Anticipating our discussion below about the use of parametric tests with Likert scale data, we also carry out the same pairwise tests but now using the parametric t-test.

# pairwise comparisons - parametric
pwc <- overall_summary %>%
  group_by(scale) %>%
  t_test(total_score ~ marine, paired = FALSE, p.adjust.method = "bonferroni")
pwc

The result is the same as given by the Mann-Whitney tests.

Two-way repeated measures ANOVA

Back to trying to do a holistic analysis a la Friedman. The Friedman test is for repeated measures designs, but only for one-way designs. If we can’t use that, is there a parametric test we can use? Yes there is, it is the 2-way repeated measures ANOVA. But wait, I thought we werent supposed to use that since we have ordinal data? Read on…

There is much empirical evidence to suggest that it is robust to use parametric analysis schemes for Likert scales, comprised of many outcomes of individual Likert questions. See, for example, these two papers:

de Winter, J. F. C., Dodou, D. (2010). Five-Point Likert Items: t test versus Mann-Whitney-Wilcoxon (Addendum added October 2012). Practical Assessment, Research, and Evaluation, 15, 1–16. https://doi.org/10.7275/bj1p-ts64
Norman, G. (2010). Likert scales, levels of measurement and the “laws” of statistics. Advances in Health Sciences Education, 15(5), 625–632. https://doi.org/10.1007/s10459-010-9222-y

This is welcome news, since parametric tests have greater power, in general, than non-parametric tests. That means they are better able to detect differences if any exist.

Despite what the papers cited above tell us, let us just check how far off our data are from being normally distributed, which is one of the conditions that data are supposed to satisfy before parametric tests such as ANOVAs are used:

Normality check: quantile-quantile plots

overall_summary %>%
  ggplot(aes(sample=mean_score,colour=scale)) +
  stat_qq() +
  stat_qq_line() +
  facet_wrap(~marine) +
  #facet_grid(vars(marine), vars(scale)) +
  theme_classic()

Normally distributed data should give plots that are approximately straight. These plots are really not too far off that! This is doubly good news since if we analyse our data using some kind of ANOVA we are looking for evidence for a difference between the mean values of the scores for each set of conditions. This only makes sense if the mean values are ‘typical’ for each dataset, which they would not be if the data were highly skewed or bimodal. Since our data are not too far off being normally distributed, the means under each set of conditions will be typical of the rest of the data, and so we should be good to go in using an ANOVA. Hooray!

Accordingly, we will analyse our data using a two-way repeated measures ANOVA scheme.

We use the anova_test() function from the rstatix package, because it is easier to understand what each argument is for than in other versions of ANOVA that are available in R.

These are the arguments:

dv denotes the dependent variable, which is mean_score in this case.
scale is a within-subject variable since every individual was measured on each of the scales.
marine is a between-subject variable since each individual either went to the gym or they did not.
id is the unique identifier for each individual.

res.aov<-overall_summary %>% anova_test(dv = median_score,
                               wid=id,
                               within = scale,
                               between = marine)
res.aov

## ANOVA Table (type III tests)
## 
## $ANOVA
##         Effect DFn DFd      F        p p<.05   ges
## 1       marine   1 372  9.594 2.00e-03     * 0.010
## 2        scale   2 744 87.526 7.29e-35     * 0.124
## 3 marine:scale   2 744  8.736 1.78e-04     * 0.014
## 
## $`Mauchly's Test for Sphericity`
##         Effect     W     p p<.05
## 1        scale 0.984 0.052      
## 2 marine:scale 0.984 0.052      
## 
## $`Sphericity Corrections`
##         Effect   GGe      DF[GG]    p[GG] p[GG]<.05  HFe       DF[HF]    p[HF]
## 1        scale 0.984 1.97, 732.4 2.30e-34         * 0.99 1.98, 736.26 1.57e-34
## 2 marine:scale 0.984 1.97, 732.4 1.95e-04         * 0.99 1.98, 736.26 1.89e-04
##   p[HF]<.05
## 1         *
## 2         *

The overall picture from the ANOVA, whether we use the collections of medians or means, is that both scale and marine have a main effect, but so also does their interaction. In other words, the score of an individual is affected by which scale is used, but the size of this effect is in turn affected by whether or not the individual uses the marine gym.

This is in accordance with the pair-wise Mann-Whitney-Wilcox tests which provided evidence that gym use has an effect on the scores in the Wellness and nr6 scales but not for whether it does if the nep scale is used. In other words, the effect of gym use on score depends on the scale used.

What the effect is, up, down, this scale, that scale is not revealed by the ANOVA, but we can go back to the original estimates of the scores for gym users and non-gym users plus the Mann-Whitney tests to see where and in what direction tose diffdrences were.

The Mauchly test for Sphericity is a multi-dimensional version of the tests for constant variance that is used to validate repeated measures ANOVA calculations. If we use the median scores we pass this test (p>0.05), if we use mean scores we do not. Either way, a ‘sphericity correction’ can be applied to the p-values for the main effects and the interaction, given the outcome of this test.

All told, the outcome of the sphericity tests and the size of the corrections that need to be applied given thier outcome suggests that is not unreasonable to be using a repeated measures ANOVA with this data.

The final table of _p_values after applying this correction is presented below.

get_anova_table(res.aov, correction = "HF")

Nothing substantive is changed by applying the corrections. All p values are less than 0.001, so there is evidence of a main effect of both scale and gym use (marine), and also of an interaction between them.

This conclusion is not changed, whether we use mean scores or median scores.

The bootstrap

While the repeated measures ANOVA ‘works’ in that it gives results that are consistent with the non-parametric Mann-Whitney-Wilcox tests, it still does feel insecure to be using ordinal data in a procedure designed for interval (ie numeric) data. Further, the Mann-Whitney-Wilcox test uses only the ranks of the data, and not the actual scores, so some information is lost. That means it has low power - it might not spot an effect (eg of gym use on score) that did actually exist. So although it didn’t see evidence for a difference if the nr6 scale was used, it might be mistaken.

There is an alternative approach that is conceptually simpler than eoither the ANOVA or the Mann-Whitney-Wilcox tests, works perfectly well with ordinal data and is now in widespread use in clinical settings and elsewhere where ordinal data such as pain scores are common. Unlike the ANOVA test it makes no assumptions about the distribution or variance of the data, and unlike the Mann-Whitney-Wilcox test it does not throw away the information of the actual scores.

It is called the bootstrap. Given the estimates from our sample of the differences made by gym use to scores on each of the scales, it calculates the 95% confidence interval for these estimate. If this interval does not encompass zero for a given scale we can reject the null hypothesis that gym use makes no difference to score on that scale. If it does, we can’t.

What you do is create multiple new samples of scores by drawing from your original sample with replacement (that is, any of the original sample may be drawn once, more than once or not at all).

For each scale, for example, we take the scores from that scale and separate these scores into the Yes and No camps. Fo the wellness scale, you end up with 288 Yes scores and 86 No scores. We calcuate our estimate of the difference made by gym use to score by taking the difference between the median total scores of the Yes and No camps. We treat those 288 Yes scores and 86 No scores as our original samples and from them create a a large number (2000 seems to be enough) of ‘bootstrap’ copies of them. For each pair of copies we find the difference between the medians.

Finally we end up with 2000 differences between the medians of the Yes and No bootstrap copies.

Remember that the null hypothesis here is that there is no difference between the medians ie use of the marine gym makes no difference to the score on whichever scale we are investigating.

From our large collection of bootstrapped differences we can use the so-called percentile method to find a 95% confidence interval for what the difference actually is. (You just put the data in order and find the 2.5% and 97.5% percentiles). If this interval does not encompass zero then we can reject the null hypothesis at the 95% confidence level. If it does, we can’t.

# bootstrap function

bs<-function(overall_summary,which_scale,N,fun){
  
  # create vector of total scores of the Yes people
  sampleYes<-overall_summary %>%
    filter(scale==which_scale) %>%
    filter(marine=="Yes") %>% 
    select(total_score) %>% 
    pull()
  
  # create vector of total scores of the No people
  sampleNo<-overall_summary %>%
    filter(scale==which_scale) %>%
    filter(marine=="No") %>% 
    select(total_score) %>% 
    pull()
  
  estimator<-fun(sampleYes) - fun(sampleNo)
  
  # how many of each
  nYes<-length(sampleYes)
  nNo<-length(sampleNo)
  
  # create bootstrap sample of difference in the medians/means between Yes and No people
    diffs<-N %>% 
    rerun(fun(sample(sampleYes,size=nYes,replace=TRUE))-fun(sample(sampleNo,size=nNo,replace=TRUE))) %>%
    simplify()
  
  # use perentile method to find 95% confidence interval of this difference
    ci <- quantile(diffs,c(0.025,0.975)) 
  
  return (c(estimator,ci)) # return the 95% confidence interval of the difference in scores
}

# apply the bootstrap to each scale
N<-20000 # Carpenter & Bithel (2000) suggest that 2000 is a sufficient number of times to take a bootstrap sample
ci_wellness<-bs(overall_summary,"wellness",N,median)
ci_nr6<-bs(overall_summary,"nr6",N,median)
ci_nep<-bs(overall_summary,"nep",N,median)

# Summarise the results
df<-tibble(ci_wellness,ci_nr6,ci_nep)
df <- as_tibble(cbind(nms = names(df), t(df))) %>%
  rename ("Scale"=nms,"Estimator"=V2,"CI95 LB"=V3,"CI95 UB"=V4) %>%
  mutate(Scale=c("Wellness","nr6","nep")) %>%
  kbl(digits=3) %>%
  kable_styling(full_width=F)
df

Scale	Estimator	CI95 LB	CI95 UB
Wellness	4.5	1	8
nr6	2	1	3
nep	-1	-2	0

These are our estimates of the difference made by gym use to the total score as measured by the wellness, nr6 and nep scales, plus the 95% confidence intervals for those differences.

The 95% confidence intervals for the difference between the median total scores of the Yes and No camps do not encompass zero where the wellness and nr6 scales are used. They are wholly positive. Thus we find evidence at the 95% confidence level for an increase in score as measured by those scales for gym users compared to non-gym users.

Conversely, we find no such evidence if the score is measured using the nep scale since the 95% confidence interval for the difference in score does (just) encompass zero.

Concluding cautionary remarks

All methods of analysis used give consistent outputs. There is evidence at the 95% confidence level that gym use increases score as measured by the wellness or nr6 scales, but not if measured by the nep scale.

Whether these scores correspond to any real attribute of the people being surveyed is another question entirely.

Even if they do, we conclude by noting that this evidence for a significant increase in score as measured by the wellness or nrp scales if an individual uses the gym does not necessarily mean that the increase is large enough to be of practical importance. It could be, but significance, even for tiny effects, can be an artifact of having large data sets. Do you, the researcher, think the differences you have detected are large enough to be of practical importance? ## References

Carpenter, J. & Bithell, J. (2000). Bootstrap confidence intervals: When, which, what? A practical guide for medical statisticians. Statistics in Medicine, 19(9) 1141-1164.
de Winter, J. F. C., Dodou, D. (2010). Five-Point Likert Items: t test versus Mann-Whitney-Wilcoxon (Addendum added October 2012). Practical Assessment, Research, and Evaluation, 15, 1–16.
Norman, G. (2010). Likert scales, levels of measurement and the “laws” of statistics. Advances in Health Sciences Education, 15(5), 625–632.