library(data.table)
library(ggplot2)
library(ggpubr)Correlation
Correlations
This work book simulates some data for the Athlete Introductory Movement Screen (AIMS) and a force measure (e.g., CMJ). This will show you how to run a basic Pearson’s correlation, graph the association and check assumptions.
N.B. I’ve not simulated these to be in anyway associated but purely to create some data to play with
Below are a couple of useful papers on interpreting correlations and cover some of the maths behind the statistical tests.
Stovitz SD, Verhagen E, Shrier I. Distinguishing between causal and non-causal associations: implications for sports medicine cliniciansBritish Journal of Sports Medicine 2019;53:398-399.
Altman, N., Krzywinski, M. Association, correlation and causation. Nat Methods 12, 899–900 (2015). https://doi.org/10.1038/nmeth.3587
You will need the following packages to get started
Simulate your data
This code created two independent normally distributed, you can play around with changing the number of participants, the mean value and the standard deviation. The “set seed” means you will always simulate the same data unless you change the numbers from “1 2 3” to some other combination.
set.seed(123)
# Create data for two independent normally distributed varriables (23 particpants, mean and standard deviation)
AIMS <- rnorm(23, 42, 4)
force<- rnorm(23, 26, 4)
data<-data.table(AIMS, force)
cor.test(AIMS, force)
Pearson's product-moment correlation
data: AIMS and force
t = -0.61867, df = 21, p-value = 0.5428
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.5174558 0.2946617
sample estimates:
cor
-0.1337907
As suggested by Altman and Krzywinski 2015 you might want present the Pearson’s correlation along with the 95% confidence intervals and / or the p-value e.g., (r = -0.13, -0.52 to 0.29, p = 0.543). Be aware that the p-value is effected by the sample size and thus significant correlations do not tell you anything about the magnitude of the association. If you want try increase your sample size substantially and see what happens to the p-value.
Graphing your data
I like to use a gg-scatter to graph correlations, the code below can be change to fit your purposes or to change the look. Here we add the r-value and the p-value to the graph
Check normality / assumptions
You could check if both data sets are normally distributed (given these have been simulated to be normally distributed we already know the results) but we can graph these with histograms and QQ-Plots and we can also run a Shapiro- Wilk test. This isn’t the best thing to do but let’s go with it!
Here is a simple histogram
hist(AIMS)The following gg-plot adds a histogram and density curve.
ggplot(data, aes(x = AIMS)) +
geom_histogram(aes(y = ..density..),
colour = 1, fill = "grey") +
geom_density()+theme_classic()Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(density)` instead.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The Shapiro-Wilk test
The Shapiro–Wilk test is a test of normality and simply tells us if the data is significantly different from normal. Here the null hypothesis is that the data will be normal and the test gives us a p-value for the likelihood that we would see the distribution we do if the null hypothesis is true. Remember p-values are crude and in reality why is p=0.51 any better than p=0.49. As such this is also not "air-tight".
The code below runs the Shapiro-Wilk test for Twenty meter time and we get W = 0.43, p = <0.0001) Note: e-14 means there are 14 0's in front of the 6.82!
shapiro.test(AIMS)
Shapiro-Wilk normality test
data: AIMS
W = 0.97409, p-value = 0.7855
A better way
We are not really interested if the data is normally distributed, we want to know if the residuals left over from the model are normally distributed (the vertical distance the dots are from the line of best fit in our scatter plot). When we put a line through data we are fitting a linear model. So we can run our Pearson’s correlation through a different line of code (we need a new packages for this too, called lme4)
Here we run the code to run our model and give it a name (l.model)
l.model<-lm(AIMS ~ 1 + force, data)
We then use the code summary(l.model)
library(lme4)Loading required package: Matrix
l.model<-lm(AIMS ~ 1 + force , data)
summary(l.model)
Call:
lm(formula = AIMS ~ 1 + force, data = data)
Residuals:
Min 1Q Median 3Q Max
-8.363 -2.726 0.477 2.123 6.860
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 45.6213 5.7655 7.913 9.81e-08 ***
force -0.1346 0.2175 -0.619 0.543
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.927 on 21 degrees of freedom
Multiple R-squared: 0.0179, Adjusted R-squared: -0.02887
F-statistic: 0.3828 on 1 and 21 DF, p-value: 0.5428
This gives us lots of data but we have a p-value of 0.543 as before and our r-value of -0.134 - the same as our standard correlation above. But we can now check the fit of the model and the “performance package” is ideal for this.
library(performance)
check_model(l.model)# comprehensive visual inspectioncheck_normality(l.model) # shapiro.test, however, visual inspection (e.g. Q-Q plots) are preferableOK: residuals appear as normally distributed (p = 0.812).
check_heteroscedasticity(l.model)# checks error varrianceOK: Error variance appears to be homoscedastic (p = 0.103).