setwd("~/Library/CloudStorage/OneDrive-TeessideUniversity/Work/Teaching/Human Movement/2024/Analysis")Data Analysis
Data analysis
We have collect our data, fantastic! What are we going to do with it! Well first lest read the data in. We have two files, the first is our participant information, stature, age, mass etc. and the second is our data file “Jump_data”. Make sure both these files are save neatly in your Human Movement folder and that you have set this folder as your working directory (in files click on more and then “set working directory” - you can copy the code into your script like I have done below:
Installing / loading packages
Remember we need some key packages to do something in R and once installed we need to “load” these. The code before is naming the packages we needed and installing any that are not installed and loading all of them for you - check for any error messages or if it asks you to restart R :
library(tidyverse)
library(janitor)
library(kableExtra)You can now read your data in (here we use the code read_csv():
data <- read_csv("Jump_data.csv")Rows: 48 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): ID, Condition
dbl (7): BW [KG], Mean Contact Time [ms], Mean Jump Height (Flight Time) [c...
time (1): Time
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(data)# A tibble: 6 × 10
ID Condition Time BW [K…¹ Mean …² Mean …³ Mean …⁴ Mean …⁵ Mean …⁶ Mean …⁷
<chr> <chr> <time> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 P1 Control 09:34 74 214 15.8 1.69 0.75 414 31490.
2 P1 Lava 09:45 74 226 18 1.69 0.8 439 25030.
3 P1 Wall 09:56 74 229 18.4 1.69 0.8 430 25548.
4 P2 Control 09:24 68.7 250 21 1.65 0.84 440 17978
5 P2 Lava 09:37 68.9 569 20.8 0.75 0.38 647 61681.
6 P2 Wall 09:48 68.9 201 14.2 1.69 0.7 364 29016
# … with abbreviated variable names ¹`BW [KG]`, ²`Mean Contact Time [ms]`,
# ³`Mean Jump Height (Flight Time) [cm]`, ⁴`Mean RSI (Flight/Contact Time)`,
# ⁵`Mean RSI (Jump Height/Contact Time) [m/s]`, ⁶`Mean Impulse [N s]`,
# ⁷`Mean Active Stiffness [N/m]`
Data cleaning
One of the problems with this data is that the column headers that Vald gives the data are not great to use when coding (see Bowman and Loo for more details here). Lucky for us we can use some simple code using the “janitor” package to get clean names:
Broman, Karl W., and Kara H. Woo. “Data organization in spreadsheets.” The American Statistician 72.1 (2018): 2-10.
names<-names(data%>%clean_names)
colnames(data) <- names
head(data)# A tibble: 6 × 10
id condition time bw_kg mean_co…¹ mean_…² mean_…³ mean_…⁴ mean_…⁵ mean_…⁶
<chr> <chr> <time> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 P1 Control 09:34 74 214 15.8 1.69 0.75 414 31490.
2 P1 Lava 09:45 74 226 18 1.69 0.8 439 25030.
3 P1 Wall 09:56 74 229 18.4 1.69 0.8 430 25548.
4 P2 Control 09:24 68.7 250 21 1.65 0.84 440 17978
5 P2 Lava 09:37 68.9 569 20.8 0.75 0.38 647 61681.
6 P2 Wall 09:48 68.9 201 14.2 1.69 0.7 364 29016
# … with abbreviated variable names ¹mean_contact_time_ms,
# ²mean_jump_height_flight_time_cm, ³mean_rsi_flight_contact_time,
# ⁴mean_rsi_jump_height_contact_time_m_s, ⁵mean_impulse_n_s,
# ⁶mean_active_stiffness_n_m
I can now select the data columns I want and also change the names to something a little more manageable if I want. The script below will keep: id, condition, mean_contact_time_ms, mean_jump_height_flight_time_cm, mean_rsi_jump_height_contact_time_m_s. However, I will change the names to: id, condition, ct, ht, rsi (of course you can change the code below to choose whichever columns you want and name them whatever works for you).
data <- data %>%
subset(select = c(id, condition,
mean_contact_time_ms,
mean_jump_height_flight_time_cm,
mean_rsi_jump_height_contact_time_m_s))%>%
rename(ct = mean_contact_time_ms,
ht = mean_jump_height_flight_time_cm,
rsi = mean_rsi_jump_height_contact_time_m_s)
head(data)# A tibble: 6 × 5
id condition ct ht rsi
<chr> <chr> <dbl> <dbl> <dbl>
1 P1 Control 214 15.8 0.75
2 P1 Lava 226 18 0.8
3 P1 Wall 229 18.4 0.8
4 P2 Control 250 21 0.84
5 P2 Lava 569 20.8 0.38
6 P2 Wall 201 14.2 0.7
Summary statistics (central tendency)
We can use some simple code to work out mean, standard deviation or an other measures of central tendency either overall or broken down by our conditions.
mean(data$rsi)[1] 0.9710417
tapply(data$rsi, list(data$condition), mean) Control Lava Wall
0.9152941 0.9475000 1.0593333
sd(data$rsi)[1] 0.3420914
tapply(data$rsi, list(data$condition), sd) Control Lava Wall
0.3452919 0.3874618 0.2881088
Or we can create a table combining means and standard deviations:
summary_table <- data %>%
group_by(condition) %>%
select(-id) %>%
summarise_all(funs(summary = paste0(round(mean(., na.rm = TRUE), 1), " ± ", round(sd(., na.rm = TRUE), 1))))
kable(summary_table, caption = "Summary statistics for each condition (mean ± sd)", align = "c") %>%
kable_styling(full_width = FALSE)| condition | ct_summary | ht_summary | rsi_summary |
|---|---|---|---|
| Control | 209.3 ± 66.6 | 18.1 ± 5.3 | 0.9 ± 0.3 |
| Lava | 210.8 ± 97.6 | 18.2 ± 5.4 | 0.9 ± 0.4 |
| Wall | 202.4 ± 22.5 | 21.3 ± 5.8 | 1.1 ± 0.3 |
Data visualization and assumption checks
So now we have data ready for analysis. Let’s just take a look at our data first, lets plot a simple box plot:
ggplot(data, aes(condition, rsi, fill = condition))+
geom_boxplot() You can make this look nice by adding different elements to it, including colour or a colour fill. Above I’ve added fill = condition. You can change colour schemes, adding the line of code” scale_fill_viridis_d(option ="viridis", direction = 1) uses a really nice colour palette for example. Also play with adding themes like theme_classic() .
Assumption checking
We can see that there are a few outlines in our data. The first stage of any analysis is to check the data fits our assumptions of “normality”. For this we will use a histogram and look to see if it looks close to a normal “bell shaped curve”:
hist(data$rsi)We can also use ggplot to graph this data and can use both a histogram and density plot:
ggplot(data, aes(x = rsi)) +
geom_histogram(aes(y = ..density..),
colour = 1, fill = "lightgreen") +
geom_density()+theme_classic()+ labs(title="Histogram / density plot", x="Reactive Strength Index (m/s)", y = "Density")Q-Q plot
The Q-Q plot, or quantile-quantile plot, is a graphical tool to help us assess if a set of data plausibly came from some theoretical distribution such as a Normal or exponential. For example, if we run a statistical analysis that assumes our dependent variable is Normally distributed, we can use a Normal Q-Q plot to check that assumption. It’s just a visual check, not an air-tight proof, so it is somewhat subjective. But itallows us to see at-a-glance if our assumption is plausible, and if not, how the assumption is violated and what data points contribute to the violation.
A Q-Q plot is a scatterplot created by plotting two sets of quantiles against one another. If both sets of quantiles came from the same distribution, we should see the points forming a line that’s roughly straight.
We can use base R or library(car) so you might need to install the package if you don’t have it already:
library(car)
qqPlot(data$rsi)[1] 45 44
#or in base R
qqnorm(data$rsi )
qqline(data$rsi )In this case it looks like our data might not be normally distributed, although most of the points are within the blue shaded area, but we might want a formal test to check.
Shapiro-Wilk normality test
The Shapiro–Wilk test is a test of normality and simply tells us if the data is significantly different from normal. Here the null hypothesis is that the data will be normal and the test gives us a p-value for the likelihood that we would see the distribution we do if the null hypothesis is true. Remember p-values are crude and in reality why is p=0.51 any better than p=0.49. As such this is also not “air-tight”.
The code below runs the Shapiro-Wilk test for RSI and we get W = 0.939, p = 0.016)
shapiro.test(data$rsi)
Shapiro-Wilk normality test
data: data$rsi
W = 0.93992, p-value = 0.01612
Okay so we could say that our data is not normally distributed and we might want to run a non-parametric test (traditional approach).
Traditional approach
Non-parametric statistics
So if we are following the undergraduate textbook we would need to run a test to see if we can reject the null hypothesis. That test would need to be one that did not assume the data to be normally distributed: We have three conditions so usually we would use a test called and Analysis of Variance (ANOVA) but here we want to find it’s non-parametric alternative.
The Kruskal-Wallis test
The Kruskal-Wallis H test (sometimes also called the “one-way ANOVA on ranks”) is a rank-based nonparametric test that can be used to determine if there are statistically significant differences between two or more groups of an independent variable on a continuous or ordinal dependent variable. It is considered the nonparametric alternative to the one-way ANOVA, and an extension of the Mann-Whitney U test to allow the comparison of more than two independent groups. As the Kruskal-Wallis H test does not assume normality in the data and is much less sensitive to outliers, it can be used when these assumptions have been violated and the use of a one-way ANOVA is inappropriate. In addition, if your data is ordinal, a one-way ANOVA is inappropriate, but the Kruskal-Wallis H test is not.
library(ggpubr)
library(rstatix)
res.kruskal <- data %>% kruskal_test(rsi ~ condition)
res.kruskal# A tibble: 1 × 6
.y. n statistic df p method
* <chr> <int> <dbl> <int> <dbl> <chr>
1 rsi 48 2.05 2 0.359 Kruskal-Wallis
Here we get a non-significant result (p = 0.359) so we **cannot** reject the null hypothesis here.
BUT
The Kruskal-Wallis test for within-participant designs, but it’s important to note that the Kruskal-Wallis test does not directly handle repeated measures or within-participant designs. It’s a non-parametric alternative to the one-way ANOVA, which assumes independence of observations.
For within-participant designs, you might consider using Friedman’s test, which is a non-parametric alternative to repeated measures ANOVA. However, if your data meet the assumptions of parametric tests (normality, homogeneity of variances), you may still prefer to use repeated measures ANOVA.
However, we might be better to consider using a linear mixed-effects model (LMM) to analyze our within-participant design. LMMs can handle unbalanced designs and account for the correlation between repeated measurements from the same participant. Here’s how you can perform such an analysis using the lmer() function from the lme4 package
library(lme4)
m<-lmer(rsi ~ 1 + condition + (1| id), data)
# Perform ANOVA
anova(m)Analysis of Variance Table
npar Sum Sq Mean Sq F value
condition 2 0.099876 0.049938 2.2732
summary(m)Linear mixed model fit by REML ['lmerMod']
Formula: rsi ~ 1 + condition + (1 | id)
Data: data
REML criterion at convergence: 3.9
Scaled residuals:
Min 1Q Median 3Q Max
-1.7995 -0.4151 -0.0350 0.6527 2.0579
Random effects:
Groups Name Variance Std.Dev.
id (Intercept) 0.09656 0.3107
Residual 0.02197 0.1482
Number of obs: 48, groups: id, 16
Fixed effects:
Estimate Std. Error t value
(Intercept) 0.93304 0.08573 10.883
conditionLava 0.01446 0.05185 0.279
conditionWall 0.10619 0.05349 1.985
Correlation of Fixed Effects:
(Intr) cndtnL
conditionLv -0.296
conditinWll -0.292 0.484
Here we can report the F value of 2.273 but this package does not provide us with a p-value. It does however give us the variation in our data. The variance associated with the random intercept is the between-participant variation - 0.3107 SDs & the residual standard deviation represents the within-participant variation 0.1482 SDs. So we are seeing more variation between each of us on the testing than within (i.e. variation between conditions).
We also get the RSI values for the control (Intercept) and the Lava and Wall condition in our “Fixed effects” table. However, this doesn’t tell us the between condition difference and confidence intervals which is what we really want to know. To do this we need another package:
library(emmeans)
emm <- emmeans(m, pairwise ~ condition)
emm$emmeans
condition emmean SE df lower.CL upper.CL
Control 0.933 0.0857 19.0 0.754 1.11
Lava 0.948 0.0861 19.3 0.768 1.13
Wall 1.039 0.0868 19.9 0.858 1.22
Degrees-of-freedom method: kenward-roger
Confidence level used: 0.95
$contrasts
contrast estimate SE df t.ratio p.value
Control - Lava -0.0145 0.0519 30.0 -0.279 0.9581
Control - Wall -0.1062 0.0535 30.2 -1.984 0.1335
Lava - Wall -0.0917 0.0536 30.1 -1.713 0.2170
Degrees-of-freedom method: kenward-roger
P value adjustment: tukey method for comparing a family of 3 estimates
confint(emm , level = 0.95)$emmeans
condition emmean SE df lower.CL upper.CL
Control 0.933 0.0857 19.0 0.754 1.11
Lava 0.948 0.0861 19.3 0.768 1.13
Wall 1.039 0.0868 19.9 0.858 1.22
Degrees-of-freedom method: kenward-roger
Confidence level used: 0.95
$contrasts
contrast estimate SE df lower.CL upper.CL
Control - Lava -0.0145 0.0519 30.0 -0.142 0.1134
Control - Wall -0.1062 0.0535 30.2 -0.238 0.0257
Lava - Wall -0.0917 0.0536 30.1 -0.224 0.0403
Degrees-of-freedom method: kenward-roger
Confidence level used: 0.95
Conf-level adjustment: tukey method for comparing a family of 3 estimates
Here we get the between group differences and their confidence intervals an p-values: e.g., Control - Lava = 0,0145 (95% CI -0.142 to 0.1134, p = 0.9581.
But should we not check our model first?
Yes we probably should, we can use the performance package for this:
library(performance)
check_model(m)check_normality(m) # shapiro.test, however, visual inspection (e.g. Q-Q plots) are preferableOK: residuals appear as normally distributed (p = 0.472).
check_heteroscedasticity(m)Warning: Heteroscedasticity (non-constant error variance) detected (p < .001).
So the model runs okay but there is some issues with “error variance” there is differences in the variations between those that jump well and those that don’t. What we could do here is run a robust model instead. In a nutshell robust methods are less influenced by outlines and thus better in your situation (probably generally better to use these according to Field and Wilcox see link and reference:
Field, A. P., & Wilcox, R. R. (2017). Robust statistical methods: A primer for clinical psychology and experimental psychopathology researchers. Behaviour research and therapy, 98, 19-38.
library(robustlmm) #for robust linear models
m<-rlmer(rsi ~ 1 + condition + (1| id), data)
# Perform ANOVA
summary(m)Robust linear mixed model fit by DAStau
Formula: rsi ~ 1 + condition + (1 | id)
Data: data
Scaled residuals:
Min 1Q Median 3Q Max
-2.37961 -0.42890 -0.01068 0.58507 1.89373
Random effects:
Groups Name Variance Std.Dev.
id (Intercept) 0.08415 0.2901
Residual 0.01811 0.1346
Number of obs: 48, groups: id, 16
Fixed effects:
Estimate Std. Error t value
(Intercept) 0.90197 0.08169 11.042
conditionLava 0.01932 0.04828 0.400
conditionWall 0.11436 0.04981 2.296
Correlation of Fixed Effects:
(Intr) cndtnL
conditionLv -0.289
conditinWll -0.286 0.484
Robustness weights for the residuals:
40 weights are ~= 1. The remaining 8 ones are
4 5 30 35 36 39 43 45
0.892 0.648 0.701 0.974 0.976 0.775 0.565 0.709
Robustness weights for the random effects:
14 weights are ~= 1. The remaining 2 ones are
7 8
0.479 0.800
Rho functions used for fitting:
Residuals:
eff: smoothed Huber (k = 1.345, s = 10)
sig: smoothed Huber, Proposal 2 (k = 1.345, s = 10)
Random Effects, variance component 1 (id):
eff: smoothed Huber (k = 1.345, s = 10)
vcp: smoothed Huber, Proposal 2 (k = 1.345, s = 10)
Notice a slight decrease in our between-participant variation now: 0.29 SDs
emm <- emmeans(m, pairwise ~ condition)
emm$emmeans
condition emmean SE df asymp.LCL asymp.UCL
Control 0.902 0.0817 Inf 0.742 1.06
Lava 0.921 0.0820 Inf 0.761 1.08
Wall 1.016 0.0826 Inf 0.854 1.18
Confidence level used: 0.95
$contrasts
contrast estimate SE df z.ratio p.value
Control - Lava -0.0193 0.0483 Inf -0.400 0.9155
Control - Wall -0.1144 0.0498 Inf -2.296 0.0564
Lava - Wall -0.0950 0.0499 Inf -1.906 0.1369
P value adjustment: tukey method for comparing a family of 3 estimates
confint(emm , level = 0.95)$emmeans
condition emmean SE df asymp.LCL asymp.UCL
Control 0.902 0.0817 Inf 0.742 1.06
Lava 0.921 0.0820 Inf 0.761 1.08
Wall 1.016 0.0826 Inf 0.854 1.18
Confidence level used: 0.95
$contrasts
contrast estimate SE df asymp.LCL asymp.UCL
Control - Lava -0.0193 0.0483 Inf -0.132 0.09383
Control - Wall -0.1144 0.0498 Inf -0.231 0.00238
Lava - Wall -0.0950 0.0499 Inf -0.212 0.02182
Confidence level used: 0.95
Conf-level adjustment: tukey method for comparing a family of 3 estimates
Here our confidence intervals are narrower and the p-values altered slightly: e.g., Control - Lava = 0.0193 (95% CI -0.132 to 0.09383, p = 0.9155. Still no significant differences but p = 0.0564 for Control versus Wall, so a little closer!
You will need to run these for all your outcome measures.