Correlation coefficients

Data analysis for correlation for differnet datasets

Below are plots of 4 data sets: The left - top (a. linear relationship between time and score); the right -top (b. nonlinear relationship); the left - bottom(c. linear relationship + one outlier); the right bottom (d. nonlinear relationship, using R data set “pressure”)

R code for calculation of the different correlations:

cor(x, y, method = c(“pearson”, “kendall”, “spearman”))

Calculation for correlation

Download the data for plot a, and load the data … , suppose the data for (a) in loaded into data frame “dat” in R and data for (d) is loaded into “dat1”.

#pearson correlation
cor(dat$time, dat$score, method = "pearson")

## [1] 0.951577

#spearman correlation
cor(dat$time, dat$score, method = "spearman")

## [1] 0.9537421

#kendall's $\tau$ correlation
cor(dat$time, dat$score, method = "kendall")

## [1] 0.8390531

For data set show in a: according to its scatter plot (linear, and no outliers), the pearson correlation is the best choice. The sample pearson correlation is 0.95.

For data set show in d: according to its scatter plot (nonlinear - montoic relationship ), the pearson correlation may not be a good choice. We use spearman correlation or kendall’s-tau. The sample spearman correlation is 0.95. The sample kendall’s-tau is 0.84.

Correlation tests

A general way of a null hypothesis for a correlation is:

H0: No [monotonic, linear, …] association between the two variables. (\(H_0: \rho =0\))

H1: Two variables are related in a [monotonic, linear, …] way (\(H_1: \rho \ne 0\))

Note: for theses correlation tests, a significant result (p-value < the level of significance) does not necessarily indicate the strength of the association. For example, a situation where a p-value = 0.001, may not have a stronger association than a situation with a p-value value of p = 0.04.

Example: Use the R cars data. Test if dist and speed are related.

plot( cars$speed,  cars$dist, xlab="speed",  ylab = "dist")

Two quantitative variables are linearly related and there are no outliers. The pearson correaltion shall be ok for the problem.

rout = cor.test(cars$speed,  cars$dist)
rout

## 
##  Pearson's product-moment correlation
## 
## data:  cars$speed and cars$dist
## t = 9.464, df = 48, p-value = 1.49e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6816422 0.8862036
## sample estimates:
##       cor 
## 0.8068949

The sample (pearson) correlation coefficient between dist and speed is 0.9279869. The statistic is 0.8068949and the corresponding p-value is 0. We can conclude that two variables are related.

R codes for other similar cases:

# Test if dist and speed are positively related based on the kendall tau correlation.
cor.test(cars$speed,  cars$dist, alternative = "greater",
         method = "kendall")

## 
##  Kendall's rank correlation tau
## 
## data:  cars$speed and cars$dist
## z = 6.6655, p-value = 1.319e-11
## alternative hypothesis: true tau is greater than 0
## sample estimates:
##       tau 
## 0.6689901

# Test if dist and speed are positively related based on the spearman correlation.
cor.test(cars$speed,  cars$dist, alternative = "greater",
         method = "spearman")

## Warning in cor.test.default(cars$speed, cars$dist, alternative = "greater", :
## Cannot compute exact p-value with ties

## 
##  Spearman's rank correlation rho
## 
## data:  cars$speed and cars$dist
## S = 3532.8, p-value = 4.412e-14
## alternative hypothesis: true rho is greater than 0
## sample estimates:
##       rho 
## 0.8303568

Reference:

Edwin van den Heuvel & Zhuozhao Zhan (2022) Myths About Linear and Monotonic Associations: Pearson’s r, Spearman’s ρ, and Kendall’s τ, The American Statistician, 76:1, 44-52, DOI: 10.1080/00031305.2021.2004922