16 - Correlation Analysis

Department of Environmental Science, AUT

Correlation

Measuring relationships between continuous variables

Correlation

A strong positive relationship:

Correlation

Measuring relationships between variables

Correlation

A weak positive relationship:

Correlation

Measuring relationships between variables

Correlation

No relationship:

Correlation

Measuring relationships between variables

Correlation

A negative relationship:

Measuring relationships

Correlation

Remember the notion of variance? - Measuring the variation within a single variable

\(var = \frac{\sum{(x_i-\bar{x})^2}}{n-1} = \frac{\sum{(x_i-\bar{x})(x_i-\bar{x})}}{n-1}\)

Now we apply the same idea to two variables simultaneously

Covariance

Correlation

We want to characterise how the two variables are related
We need to see whether as one variable increases, the other one increases, decreases or stays the same
We are not making any conclusion in terms of causality!

The metric we use is the covariance:

\(cov(x,y) = \frac{\sum{(x-\bar{x})(y-\bar{y})}}{n-1}\)

Can we understand this formula?

Correlation

Can we understand this formula?

\(cov(x,y) = \frac{\sum{(x-\bar{x})(y-\bar{y})}}{n-1}\)

From the covariance to the correlation coefficient

Correlation

The covariance depends on the units of measurement, we therefore have to standardise by dividing by the standard deviations of both variables to obtain the

correlation coefficient:

\(R = \frac{cov(x,y)}{s_xs_y} = \frac{\sum{(x-\bar{x})(y-\bar{y})}}{(n-1)s_xs_y}\)

Independent of units
Ranges from -1 to 1, -1: perfect negative correlation, 1: perfect positive correlation

Things to know about the correlation coefficient R

Correlation

It is an effect size
± 0.1 = small effect
± 0.3 = medium effect
± 0.5 = large effect
The coefficient of determination, \(R^2\) is the proportion of variance in one variable shared/explained by the other (more on this in the video on regression analysis)

Calculating correlation coefficients in R

The correlation test

Correlation

You can use cor.test() for most applications:

set.seed(0)
cor.test(rnorm(20), rnorm(20))

    Pearson's product-moment correlation

data:  rnorm(20) and rnorm(20)
t = 0.33035, df = 18, p-value = 0.7449
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.3778722  0.5028752
sample estimates:
       cor 
0.07762955

How do we interpret this output?

Options to select from when testing or calculating correlations

Correlation

The function cor.test() has several options (arguments) you can set.

‘method’: you can select from ‘Pearson’ (the default), ‘Spearman’ and ‘Kendall’
use ‘Pearson’ for normally distributed data
use ‘Spearman’ or ‘Kendall’ for non-normal data, the latter is particularly good for small sample sizes
alternative: use ‘two.sided’ (the default) if testing for a positive OR negative correlation, ‘greater’ for a positive correlation, and ‘less’ for a negative correlation

Correlation, example 2:

Correlation

We would like to see whether there is a positive correlation between the tenderness of tuna flesh (x) and the consumer panel scores (y). This time we make sure we check for the normality assumption first.

x = c(44.4, 45.9, 41.9, 53.3, 44.7, 44.1, 50.7, 45.2, 60.1)
y = c( 2.6,  3.1,  2.5,  5.0,  3.6,  4.0,  5.2,  2.8,  3.8)
plot(x, y); qqnorm(x); qqline(x); qqnorm(y); qqline(y)

The assumption of normality seems violated (check also using shapiro.test()) and the sample size is small, so we will use the non-parametric (rank-based) correlation tests.

Correlation, example

Correlation

cor.test(x, y, method = 'kendall', alternative = 'greater')

    Kendall's rank correlation tau

data:  x and y
T = 26, p-value = 0.05972
alternative hypothesis: true tau is greater than 0
sample estimates:
      tau 
0.4444444

cor.test(x, y, method = 'spearman', alternative = 'greater')

    Spearman's rank correlation rho

data:  x and y
S = 48, p-value = 0.0484
alternative hypothesis: true rho is greater than 0
sample estimates:
rho 
0.6

Correlation, example

Correlation

What would happen to the p-value if we tested two-sided?

cor.test(x, y, method = 'kendall')

    Kendall's rank correlation tau

data:  x and y
T = 26, p-value = 0.1194
alternative hypothesis: true tau is not equal to 0
sample estimates:
      tau 
0.4444444

cor.test(x, y, method = 'spearman')

    Spearman's rank correlation rho

data:  x and y
S = 48, p-value = 0.0968
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho 
0.6

The p-values are now both non-significant, why?

Reporting statistical results in a text

Correlation

Correlation test:

Significant: E.g. ‘Antibody levels were significantly correlated with rat size (Pearson’s correlation test, p = 0.01545)’
Non-significant: E.g. ‘Antibody levels were not significantly correlated with rat size (Pearson’s correlation test, p = 0.92)’
Non-parametric correlation tests: ditto, but ‘…(Kendall’s test, p = …’)

P-values

Best to indicate exact p-value
Sometimes, asterisks are used to designate the significance in a plot: * for p < 0.05, ** for p < 0.01, and *** for p < 0.001

In a nutshell

Correlation

Covariance is derived from the variance, the standardised metric is the correlation coefficient
Correlation tests use the t-statistic and have a non-parametric option (Spearman and Kendall)
A correlation coefficient of 0.2 can be considered weak, one above 0.5 strong, depending on the subject matter