Data Analysis

Data analysis

We have collect our data, fantastic! What are we going to do with it! Well first lest read the data in. We have two files, the first is our participant information, stature, age, mass etc. and the second is our data file “Jump_data”. Make sure both these files are save neatly in your Human Movement folder and that you have set this folder as your working directory (in files click on more and then “set working directory” - you can copy the code into your script like I have done below:

setwd("~/Library/CloudStorage/OneDrive-TeessideUniversity/Work/Teaching/Human Movement/2024/Analysis")

Installing / loading packages

Remember we need some key packages to do something in R and once installed we need to “load” these. The code before is naming the packages we needed and installing any that are not installed and loading all of them for you - check for any error messages or if it asks you to restart R :

library(tidyverse)
library(janitor)
library(kableExtra)

You can now read your data in (here we use the code read_csv():

data <- read_csv("Jump_data.csv")

Rows: 48 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): ID, Condition
dbl  (7): BW [KG], Mean Contact Time [ms], Mean Jump Height (Flight Time) [c...
time (1): Time

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(data)

# A tibble: 6 × 10
  ID    Condition Time   BW [K…¹ Mean …² Mean …³ Mean …⁴ Mean …⁵ Mean …⁶ Mean …⁷
  <chr> <chr>     <time>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
1 P1    Control   09:34     74       214    15.8    1.69    0.75     414  31490.
2 P1    Lava      09:45     74       226    18      1.69    0.8      439  25030.
3 P1    Wall      09:56     74       229    18.4    1.69    0.8      430  25548.
4 P2    Control   09:24     68.7     250    21      1.65    0.84     440  17978 
5 P2    Lava      09:37     68.9     569    20.8    0.75    0.38     647  61681.
6 P2    Wall      09:48     68.9     201    14.2    1.69    0.7      364  29016 
# … with abbreviated variable names ¹`BW [KG]`, ²`Mean Contact Time [ms]`,
#   ³`Mean Jump Height (Flight Time) [cm]`, ⁴`Mean RSI (Flight/Contact Time)`,
#   ⁵`Mean RSI (Jump Height/Contact Time) [m/s]`, ⁶`Mean Impulse [N s]`,
#   ⁷`Mean Active Stiffness [N/m]`

Data cleaning

One of the problems with this data is that the column headers that Vald gives the data are not great to use when coding (see Bowman and Loo for more details here). Lucky for us we can use some simple code using the “janitor” package to get clean names:

Broman, Karl W., and Kara H. Woo. “Data organization in spreadsheets.” The American Statistician 72.1 (2018): 2-10.

names<-names(data%>%clean_names)
colnames(data) <- names
head(data)

# A tibble: 6 × 10
  id    condition time   bw_kg mean_co…¹ mean_…² mean_…³ mean_…⁴ mean_…⁵ mean_…⁶
  <chr> <chr>     <time> <dbl>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
1 P1    Control   09:34   74         214    15.8    1.69    0.75     414  31490.
2 P1    Lava      09:45   74         226    18      1.69    0.8      439  25030.
3 P1    Wall      09:56   74         229    18.4    1.69    0.8      430  25548.
4 P2    Control   09:24   68.7       250    21      1.65    0.84     440  17978 
5 P2    Lava      09:37   68.9       569    20.8    0.75    0.38     647  61681.
6 P2    Wall      09:48   68.9       201    14.2    1.69    0.7      364  29016 
# … with abbreviated variable names ¹mean_contact_time_ms,
#   ²mean_jump_height_flight_time_cm, ³mean_rsi_flight_contact_time,
#   ⁴mean_rsi_jump_height_contact_time_m_s, ⁵mean_impulse_n_s,
#   ⁶mean_active_stiffness_n_m

I can now select the data columns I want and also change the names to something a little more manageable if I want. The script below will keep: id, condition, mean_contact_time_ms, mean_jump_height_flight_time_cm, mean_rsi_jump_height_contact_time_m_s. However, I will change the names to: id, condition, ct, ht, rsi (of course you can change the code below to choose whichever columns you want and name them whatever works for you).

data <- data %>%
  subset(select = c(id, condition, 
                    mean_contact_time_ms, 
                    mean_jump_height_flight_time_cm,
                    mean_rsi_jump_height_contact_time_m_s))%>%
  rename(ct =  mean_contact_time_ms, 
         ht = mean_jump_height_flight_time_cm,
         rsi = mean_rsi_jump_height_contact_time_m_s)

head(data)

# A tibble: 6 × 5
  id    condition    ct    ht   rsi
  <chr> <chr>     <dbl> <dbl> <dbl>
1 P1    Control     214  15.8  0.75
2 P1    Lava        226  18    0.8 
3 P1    Wall        229  18.4  0.8 
4 P2    Control     250  21    0.84
5 P2    Lava        569  20.8  0.38
6 P2    Wall        201  14.2  0.7

Summary statistics (central tendency)

We can use some simple code to work out mean, standard deviation or an other measures of central tendency either overall or broken down by our conditions.

mean(data$rsi)

[1] 0.9710417

tapply(data$rsi, list(data$condition), mean)

  Control      Lava      Wall 
0.9152941 0.9475000 1.0593333

sd(data$rsi)

[1] 0.3420914

tapply(data$rsi, list(data$condition), sd)

  Control      Lava      Wall 
0.3452919 0.3874618 0.2881088

Or we can create a table combining means and standard deviations:

summary_table <- data %>%
  group_by(condition) %>%
  select(-id) %>%
  summarise_all(funs(summary = paste0(round(mean(., na.rm = TRUE), 1), " ± ", round(sd(., na.rm = TRUE), 1)))) 

kable(summary_table, caption = "Summary statistics for each condition (mean ± sd)", align = "c") %>%
  kable_styling(full_width = FALSE)

Summary statistics for each condition (mean ± sd)
condition	ct_summary	ht_summary	rsi_summary
Control	209.3 ± 66.6	18.1 ± 5.3	0.9 ± 0.3
Lava	210.8 ± 97.6	18.2 ± 5.4	0.9 ± 0.4
Wall	202.4 ± 22.5	21.3 ± 5.8	1.1 ± 0.3

Data visualization and assumption checks

So now we have data ready for analysis. Let’s just take a look at our data first, lets plot a simple box plot:

ggplot(data, aes(condition, rsi, fill = condition))+
  geom_boxplot()

You can make this look nice by adding different elements to it, including colour or a colour fill. Above I’ve added fill = condition. You can change colour schemes, adding the line of code” scale_fill_viridis_d(option ="viridis", direction = 1) uses a really nice colour palette for example. Also play with adding themes like theme_classic() .

Assumption checking

We can see that there are a few outlines in our data. The first stage of any analysis is to check the data fits our assumptions of “normality”. For this we will use a histogram and look to see if it looks close to a normal “bell shaped curve”:

hist(data$rsi)

We can also use ggplot to graph this data and can use both a histogram and density plot:

ggplot(data, aes(x = rsi)) + 
  geom_histogram(aes(y = ..density..),
                 colour = 1, fill = "lightgreen") +
  geom_density()+theme_classic()+ labs(title="Histogram / density plot", x="Reactive Strength Index (m/s)", y = "Density")

Q-Q plot

The Q-Q plot, or quantile-quantile plot, is a graphical tool to help us assess if a set of data plausibly came from some theoretical distribution such as a Normal or exponential. For example, if we run a statistical analysis that assumes our dependent variable is Normally distributed, we can use a Normal Q-Q plot to check that assumption. It’s just a visual check, not an air-tight proof, so it is somewhat subjective. But itallows us to see at-a-glance if our assumption is plausible, and if not, how the assumption is violated and what data points contribute to the violation.

A Q-Q plot is a scatterplot created by plotting two sets of quantiles against one another. If both sets of quantiles came from the same distribution, we should see the points forming a line that’s roughly straight.

We can use base R or library(car) so you might need to install the package if you don’t have it already:

library(car)
qqPlot(data$rsi)

[1] 45 44

#or in base R
qqnorm(data$rsi )
qqline(data$rsi )

In this case it looks like our data might not be normally distributed, although most of the points are within the blue shaded area, but we might want a formal test to check.

Shapiro-Wilk normality test

The Shapiro–Wilk test is a test of normality and simply tells us if the data is significantly different from normal. Here the null hypothesis is that the data will be normal and the test gives us a p-value for the likelihood that we would see the distribution we do if the null hypothesis is true. Remember p-values are crude and in reality why is p=0.51 any better than p=0.49. As such this is also not “air-tight”.

The code below runs the Shapiro-Wilk test for RSI and we get W = 0.939, p = 0.016)

shapiro.test(data$rsi)


    Shapiro-Wilk normality test

data:  data$rsi
W = 0.93992, p-value = 0.01612

Okay so we could say that our data is not normally distributed and we might want to run a non-parametric test (traditional approach).

Traditional approach

Non-parametric statistics

So if we are following the undergraduate textbook we would need to run a test to see if we can reject the null hypothesis. That test would need to be one that did not assume the data to be normally distributed: We have three conditions so usually we would use a test called and Analysis of Variance (ANOVA) but here we want to find it’s non-parametric alternative.

The Kruskal-Wallis test

The Kruskal-Wallis H test (sometimes also called the “one-way ANOVA on ranks”) is a rank-based nonparametric test that can be used to determine if there are statistically significant differences between two or more groups of an independent variable on a continuous or ordinal dependent variable. It is considered the nonparametric alternative to the one-way ANOVA, and an extension of the Mann-Whitney U test to allow the comparison of more than two independent groups. As the Kruskal-Wallis H test does not assume normality in the data and is much less sensitive to outliers, it can be used when these assumptions have been violated and the use of a one-way ANOVA is inappropriate. In addition, if your data is ordinal, a one-way ANOVA is inappropriate, but the Kruskal-Wallis H test is not.

library(ggpubr)
library(rstatix)
res.kruskal <- data %>% kruskal_test(rsi ~ condition)
res.kruskal

# A tibble: 1 × 6
  .y.       n statistic    df     p method        
* <chr> <int>     <dbl> <int> <dbl> <chr>         
1 rsi      48      2.05     2 0.359 Kruskal-Wallis

Here we get a non-significant result (p = 0.359) so we **cannot** reject the null hypothesis here.

BUT

The Kruskal-Wallis test for within-participant designs, but it’s important to note that the Kruskal-Wallis test does not directly handle repeated measures or within-participant designs. It’s a non-parametric alternative to the one-way ANOVA, which assumes independence of observations.

For within-participant designs, you might consider using Friedman’s test, which is a non-parametric alternative to repeated measures ANOVA. However, if your data meet the assumptions of parametric tests (normality, homogeneity of variances), you may still prefer to use repeated measures ANOVA.

However, we might be better to consider using a linear mixed-effects model (LMM) to analyze our within-participant design. LMMs can handle unbalanced designs and account for the correlation between repeated measurements from the same participant. Here’s how you can perform such an analysis using the lmer() function from the lme4 package

library(lme4)

m<-lmer(rsi ~ 1  + condition + (1| id), data)
# Perform ANOVA
anova(m)

Analysis of Variance Table
          npar   Sum Sq  Mean Sq F value
condition    2 0.099876 0.049938  2.2732

summary(m)

Linear mixed model fit by REML ['lmerMod']
Formula: rsi ~ 1 + condition + (1 | id)
   Data: data

REML criterion at convergence: 3.9

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-1.7995 -0.4151 -0.0350  0.6527  2.0579 

Random effects:
 Groups   Name        Variance Std.Dev.
 id       (Intercept) 0.09656  0.3107  
 Residual             0.02197  0.1482  
Number of obs: 48, groups:  id, 16

Fixed effects:
              Estimate Std. Error t value
(Intercept)    0.93304    0.08573  10.883
conditionLava  0.01446    0.05185   0.279
conditionWall  0.10619    0.05349   1.985

Correlation of Fixed Effects:
            (Intr) cndtnL
conditionLv -0.296       
conditinWll -0.292  0.484

Here we can report the F value of 2.273 but this package does not provide us with a p-value. It does however give us the variation in our data. The variance associated with the random intercept is the between-participant variation - 0.3107 SDs & the residual standard deviation represents the within-participant variation 0.1482 SDs. So we are seeing more variation between each of us on the testing than within (i.e. variation between conditions).

We also get the RSI values for the control (Intercept) and the Lava and Wall condition in our “Fixed effects” table. However, this doesn’t tell us the between condition difference and confidence intervals which is what we really want to know. To do this we need another package:

library(emmeans)

emm  <- emmeans(m, pairwise ~ condition)
emm

$emmeans
 condition emmean     SE   df lower.CL upper.CL
 Control    0.933 0.0857 19.0    0.754     1.11
 Lava       0.948 0.0861 19.3    0.768     1.13
 Wall       1.039 0.0868 19.9    0.858     1.22

Degrees-of-freedom method: kenward-roger 
Confidence level used: 0.95 

$contrasts
 contrast       estimate     SE   df t.ratio p.value
 Control - Lava  -0.0145 0.0519 30.0  -0.279  0.9581
 Control - Wall  -0.1062 0.0535 30.2  -1.984  0.1335
 Lava - Wall     -0.0917 0.0536 30.1  -1.713  0.2170

Degrees-of-freedom method: kenward-roger 
P value adjustment: tukey method for comparing a family of 3 estimates

confint(emm , level = 0.95)

$emmeans
 condition emmean     SE   df lower.CL upper.CL
 Control    0.933 0.0857 19.0    0.754     1.11
 Lava       0.948 0.0861 19.3    0.768     1.13
 Wall       1.039 0.0868 19.9    0.858     1.22

Degrees-of-freedom method: kenward-roger 
Confidence level used: 0.95 

$contrasts
 contrast       estimate     SE   df lower.CL upper.CL
 Control - Lava  -0.0145 0.0519 30.0   -0.142   0.1134
 Control - Wall  -0.1062 0.0535 30.2   -0.238   0.0257
 Lava - Wall     -0.0917 0.0536 30.1   -0.224   0.0403

Degrees-of-freedom method: kenward-roger 
Confidence level used: 0.95 
Conf-level adjustment: tukey method for comparing a family of 3 estimates

Here we get the between group differences and their confidence intervals an p-values: e.g., Control - Lava = 0,0145 (95% CI -0.142 to 0.1134, p = 0.9581.

But should we not check our model first?

Yes we probably should, we can use the performance package for this:

library(performance)
check_model(m)

check_normality(m)     # shapiro.test, however, visual inspection (e.g. Q-Q plots) are preferable

OK: residuals appear as normally distributed (p = 0.472).

check_heteroscedasticity(m)

Warning: Heteroscedasticity (non-constant error variance) detected (p < .001).

So the model runs okay but there is some issues with “error variance” there is differences in the variations between those that jump well and those that don’t. What we could do here is run a robust model instead. In a nutshell robust methods are less influenced by outlines and thus better in your situation (probably generally better to use these according to Field and Wilcox see link and reference:

Robust statistical methods

Field, A. P., & Wilcox, R. R. (2017). Robust statistical methods: A primer for clinical psychology and experimental psychopathology researchers. Behaviour research and therapy, 98, 19-38.

library(robustlmm) #for robust linear models

m<-rlmer(rsi ~ 1  + condition + (1| id), data)
# Perform ANOVA
summary(m)

Robust linear mixed model fit by DAStau 
Formula: rsi ~ 1 + condition + (1 | id) 
   Data: data 

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-2.37961 -0.42890 -0.01068  0.58507  1.89373 

Random effects:
 Groups   Name        Variance Std.Dev.
 id       (Intercept) 0.08415  0.2901  
 Residual             0.01811  0.1346  
Number of obs: 48, groups: id, 16

Fixed effects:
              Estimate Std. Error t value
(Intercept)    0.90197    0.08169  11.042
conditionLava  0.01932    0.04828   0.400
conditionWall  0.11436    0.04981   2.296

Correlation of Fixed Effects:
            (Intr) cndtnL
conditionLv -0.289       
conditinWll -0.286  0.484

Robustness weights for the residuals: 
 40 weights are ~= 1. The remaining 8 ones are
    4     5    30    35    36    39    43    45 
0.892 0.648 0.701 0.974 0.976 0.775 0.565 0.709 

Robustness weights for the random effects: 
 14 weights are ~= 1. The remaining 2 ones are
    7     8 
0.479 0.800 

Rho functions used for fitting:
  Residuals:
    eff: smoothed Huber (k = 1.345, s = 10) 
    sig: smoothed Huber, Proposal 2 (k = 1.345, s = 10) 
  Random Effects, variance component 1 (id):
    eff: smoothed Huber (k = 1.345, s = 10) 
    vcp: smoothed Huber, Proposal 2 (k = 1.345, s = 10)

Notice a slight decrease in our between-participant variation now: 0.29 SDs

emm  <- emmeans(m, pairwise ~ condition)
emm

$emmeans
 condition emmean     SE  df asymp.LCL asymp.UCL
 Control    0.902 0.0817 Inf     0.742      1.06
 Lava       0.921 0.0820 Inf     0.761      1.08
 Wall       1.016 0.0826 Inf     0.854      1.18

Confidence level used: 0.95 

$contrasts
 contrast       estimate     SE  df z.ratio p.value
 Control - Lava  -0.0193 0.0483 Inf  -0.400  0.9155
 Control - Wall  -0.1144 0.0498 Inf  -2.296  0.0564
 Lava - Wall     -0.0950 0.0499 Inf  -1.906  0.1369

P value adjustment: tukey method for comparing a family of 3 estimates

confint(emm , level = 0.95)

$emmeans
 condition emmean     SE  df asymp.LCL asymp.UCL
 Control    0.902 0.0817 Inf     0.742      1.06
 Lava       0.921 0.0820 Inf     0.761      1.08
 Wall       1.016 0.0826 Inf     0.854      1.18

Confidence level used: 0.95 

$contrasts
 contrast       estimate     SE  df asymp.LCL asymp.UCL
 Control - Lava  -0.0193 0.0483 Inf    -0.132   0.09383
 Control - Wall  -0.1144 0.0498 Inf    -0.231   0.00238
 Lava - Wall     -0.0950 0.0499 Inf    -0.212   0.02182

Confidence level used: 0.95 
Conf-level adjustment: tukey method for comparing a family of 3 estimates

Here our confidence intervals are narrower and the p-values altered slightly: e.g., Control - Lava = 0.0193 (95% CI -0.132 to 0.09383, p = 0.9155. Still no significant differences but p = 0.0564 for Control versus Wall, so a little closer!

You will need to run these for all your outcome measures.