A strong positive relationship:
Correlation
Measuring relationships between variables
A weak positive relationship:
Correlation
Measuring relationships between variables
No relationship:
Correlation
Measuring relationships between variables
A negative relationship:
Measuring relationships
Remember the notion of variance? - Measuring the variation within a single variable
\(var = \frac{\sum{(x_i-\bar{x})^2}}{n-1} = \frac{\sum{(x_i-\bar{x})(x_i-\bar{x})}}{n-1}\)
Now we apply the same idea to two variables simultaneously
Covariance
- We want to characterise how the two variables are related
- We need to see whether as one variable increases, the other one increases, decreases or stays the same
- We are not making any conclusion in terms of causality!
The metric we use is the covariance:
\(cov(x,y) = \frac{\sum{(x-\bar{x})(y-\bar{y})}}{n-1}\)
Can we understand this formula?
Can we understand this formula?
\(cov(x,y) = \frac{\sum{(x-\bar{x})(y-\bar{y})}}{n-1}\)
From the covariance to the correlation coefficient
The covariance depends on the units of measurement, we therefore have to standardise by dividing by the standard deviations of both variables to obtain the
correlation coefficient:
\(R = \frac{cov(x,y)}{s_xs_y} = \frac{\sum{(x-\bar{x})(y-\bar{y})}}{(n-1)s_xs_y}\)
- Independent of units
- Ranges from -1 to 1, -1: perfect negative correlation, 1: perfect positive correlation
Things to know about the correlation coefficient R
- It is an effect size
- ± 0.1 = small effect
- ± 0.3 = medium effect
- ± 0.5 = large effect
- The coefficient of determination, \(R^2\) is the proportion of variance in one variable shared/explained by the other (more on this in the video on regression analysis)
Calculating correlation coefficients in R
The correlation test
You can use cor.test() for most applications:
set.seed(0) cor.test(rnorm(20), rnorm(20))
Pearson's product-moment correlation
data: rnorm(20) and rnorm(20)
t = 0.33035, df = 18, p-value = 0.7449
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.3778722 0.5028752
sample estimates:
cor
0.07762955
How do we interpret this output?
Options to select from when testing or calculating correlations
The function cor.test() has several options (arguments) you can set.
- ‘method’: you can select from ‘Pearson’ (the default), ‘Spearman’ and ‘Kendall’
- use ‘Pearson’ for normally distributed data
- use ‘Spearman’ or ‘Kendall’ for non-normal data, the latter is particularly good for small sample sizes
- alternative: use ‘two.sided’ (the default) if testing for a positive OR negative correlation, ‘greater’ for a positive correlation, and ‘less’ for a negative correlation
Correlation, example 2:
We would like to see whether there is a positive correlation between the tenderness of tuna flesh (x) and the consumer panel scores (y). This time we make sure we check for the normality assumption first.
x = c(44.4, 45.9, 41.9, 53.3, 44.7, 44.1, 50.7, 45.2, 60.1) y = c( 2.6, 3.1, 2.5, 5.0, 3.6, 4.0, 5.2, 2.8, 3.8) plot(x, y); qqnorm(x); qqline(x); qqnorm(y); qqline(y)
The assumption of normality seems violated (check also using
shapiro.test()) and the sample size is small, so we will use the non-parametric (rank-based) correlation tests.
Correlation, example
cor.test(x, y, method = 'kendall', alternative = 'greater')
Kendall's rank correlation tau
data: x and y
T = 26, p-value = 0.05972
alternative hypothesis: true tau is greater than 0
sample estimates:
tau
0.4444444
cor.test(x, y, method = 'spearman', alternative = 'greater')
Spearman's rank correlation rho
data: x and y
S = 48, p-value = 0.0484
alternative hypothesis: true rho is greater than 0
sample estimates:
rho
0.6
Correlation, example
What would happen to the p-value if we tested two-sided?
cor.test(x, y, method = 'kendall')
Kendall's rank correlation tau
data: x and y
T = 26, p-value = 0.1194
alternative hypothesis: true tau is not equal to 0
sample estimates:
tau
0.4444444
cor.test(x, y, method = 'spearman')
Spearman's rank correlation rho
data: x and y
S = 48, p-value = 0.0968
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
0.6
- The p-values are now both non-significant, why?
Reporting statistical results in a text
Correlation test:
- Significant: E.g. ‘Antibody levels were significantly correlated with rat size (Pearson’s correlation test, p = 0.01545)’
- Non-significant: E.g. ‘Antibody levels were not significantly correlated with rat size (Pearson’s correlation test, p = 0.92)’
- Non-parametric correlation tests: ditto, but ‘…(Kendall’s test, p = …’)
P-values
- Best to indicate exact p-value
- Sometimes, asterisks are used to designate the significance in a plot: * for p < 0.05, ** for p < 0.01, and *** for p < 0.001
In a nutshell
- Covariance is derived from the variance, the standardised metric is the correlation coefficient
- Correlation tests use the t-statistic and have a non-parametric option (Spearman and Kendall)
- A correlation coefficient of 0.2 can be considered weak, one above 0.5 strong, depending on the subject matter