The Pearson correlation is valid if both variables are quantitative (interval or ratio), normally distributed, and the relationship is linear with homoscedastic residuals.

The Spearman and the Kendal correlations are non-parametric measures.22 See discussion of parametric vs nonparametric at Statistics How To. They are valid for both quantitative and ordinal variables and do not carry the normality and homoscedasticity conditions. However, non-parametric tests have less statistical power than parametric tests, so only these correlations if Pearson does not apply.

Pearson’s Product Moment Correlation

The Pearson product moment correlation coefficient r estimates population correlation \(\rho\). The test evaluates \(H_0: \rho = 0\) and constructs a \((1 - \alpha)\%\) confidence interval around \(r\):

\(r = \frac{\sum_{i=1}^n{(X_i - \bar{X})(Y_i - \bar{Y})}}{\sqrt{\sum_{i=1}^n{(X_i - \bar{X})^2 \sum_{i=1}^n{(Y_i - \bar{Y})^2}}}}\)

\(r\) ranges from \(-1\) (perfect negative linear relationship) to \(+1\) (perfect positive linear relationship, and \(r = 0\) when there is no linear relationship. A correlation in the range \((.1, .3)\) is small, \((.3, .5)\) is medium, and \((.5, 1.0)\) is large.

Only use the Pearson correlation when

the variables are interval or ratio,
the variables are normally distributed,
the variables are linearly related,
there are minimal outliers (or they are removed entirely), and
the residuals are homoscedastic.

Test \(H_0: \rho = 0\) with test statistic \(T = r \sqrt{\frac{n-2}{1-r^2}}\). The test also defines a \((1 - \alpha)\%\) confidence interval.33 I found an explanation of the confidence interval here.

Pearson’s Product Moment Correlation Example

The nascard data set consists of 898 races from 1975 - 2003. In race 1 of 1975, what was the correlation between the driver’s finishing position and prize?

library(dplyr)  # for glimpse()
nascard <- read.fwf(file = url("http://jse.amstat.org/datasets/nascard.dat.txt"),
                    widths = c(5, 6, 4, 4, 4, 5, 9, 4, 11, 30),
                    col.names = c('series_race', 'year', 'race_year',
                                  'finish_pos', 'start_pos',
                                  'laps_comp', 'winnings', 'num_cars',
                                  'car_make', 'driver'))
nascard_sr1 <- nascard[nascard$series_race == 1,]
glimpse(nascard_sr1)

## Observations: 35
## Variables: 10
## $ series_race <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ year        <dbl> 1975, 1975, 1975, 1975, 1975, 1975, 1975, 1975, 19...
## $ race_year   <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ finish_pos  <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,...
## $ start_pos   <dbl> 1, 2, 25, 27, 22, 35, 3, 21, 33, 10, 17, 24, 6, 28...
## $ laps_comp   <dbl> 191, 191, 184, 183, 178, 175, 172, 168, 167, 166, ...
## $ winnings    <dbl> 12035, 8135, 6535, 5035, 3835, 2885, 3185, 2485, 2...
## $ num_cars    <dbl> 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35...
## $ car_make    <fct> Matador    , Mercury    , Chevrolet  , Dodge      ...
## $ driver      <fct> BobbyAllison                  , DavidPearson      ...

Explore the relationship with a scatterplot. As one would expect, there is a negative relationship between finish position and winnings, but it is not linear.

library(ggplot2)
ggplot(data = nascard_sr1, aes(x = finish_pos, y = winnings)) +
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE)

However, 1/winnings may be linearly related to finish position.44 I tried several variable tranformations from Stat Trek, and chose the tranformation with the highest \(R^2\) that also had a normal distribution in each variable.

ggplot(data = nascard_sr1, aes(x = finish_pos, y = 1/winnings)) +
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE)

#summary(lm(winnings ~ finish_pos, data = nascard_sr1)) # r2 = .5474
#summary(lm(log(winnings) ~ finish_pos, data = nascard_sr1)) # r2 = .9015
summary(lm(1/winnings ~ finish_pos, data = nascard_sr1)) # r2 = .9509

## 
## Call:
## lm(formula = 1/winnings ~ finish_pos, data = nascard_sr1)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -1.343e-04 -7.449e-05 -1.710e-06  4.575e-05  3.623e-04 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.829e-05  3.682e-05  -0.497    0.623    
## finish_pos   4.512e-05  1.784e-06  25.292   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0001066 on 33 degrees of freedom
## Multiple R-squared:  0.9509, Adjusted R-squared:  0.9495 
## F-statistic: 639.7 on 1 and 33 DF,  p-value: < 2.2e-16

#summary(lm(winnings ~ log(finish_pos), data = nascard_sr1)) # r2 = .883
#summary(lm(log(winnings) ~ log(finish_pos), data = nascard_sr1)) # r2 = .9766
#summary(lm(winnings ~ log(finish_pos), data = nascard_sr1)) # r2 = .883

The finish position and prize are quantitative ratio variables, so the Pearson’s product moment correlation should apply. First check whether each variable is normally distributed. The normal distribution plots look good and the Anderson-Darling tests both fail to reject the normality null hypothesis.

qqnorm(nascard_sr1$finish_pos)
qqline(nascard_sr1$finish_pos)

qqnorm(1/(nascard_sr1$winnings))
qqline(1/(nascard_sr1$winnings))

library(nortest)  # for Anderson-Darling 
ad.test(nascard_sr1$finish_pos)

## 
##  Anderson-Darling normality test
## 
## data:  nascard_sr1$finish_pos
## A = 0.37344, p-value = 0.3989

ad.test(1/nascard_sr1$winnings)

## 
##  Anderson-Darling normality test
## 
## data:  1/nascard_sr1$winnings
## A = 0.3292, p-value = 0.505

The Pearson product moment correlation is \(r = \frac{\sum_{i=1}^n{(X_i - \bar{X})(Y_i - \bar{Y})}}{\sqrt{\sum_{i=1}^n{(X_i - \bar{X})^2 \sum_{i=1}^n{(Y_i - \bar{Y})^2}}}} = .9752\)

x <- nascard_sr1$finish_pos
y <- 1/nascard_sr1$winnings
(r = sum((x - mean(x)) * (y - mean(y))) /
   sqrt(sum((x - mean(x))^2) * sum((y - mean(y))^2)))

## [1] 0.9751633

The test statistic is \(T = r \sqrt{\frac{n-2}{1-r^2}} = .9752 \sqrt{\frac{35-2}{1-.9752^2}} = 25.292\). \(P(T>.9752) < .0001\), so reject \(H_0\) that the correlation is zero.

n <- nrow(nascard_sr1)
(t = r * sqrt((n-2) / (1 - r^2)))

## [1] 25.29218

pt(q = t, df = n-2, lower.tail = FALSE) * 2

## [1] 3.516944e-23

R function cor.test performs these calculations. Specify method = "pearson".

cor.test(x = nascard_sr1$finish_pos,
         y = 1/nascard_sr1$winnings,
         alternative = "two.sided",
         method = "pearson")

## 
##  Pearson's product-moment correlation
## 
## data:  nascard_sr1$finish_pos and 1/nascard_sr1$winnings
## t = 25.292, df = 33, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9509453 0.9875017
## sample estimates:
##       cor 
## 0.9751633

Spearman’s Rank-Order Correlation

Spearman’s \(\rho\) is the Pearson’s product moment correlation coefficient applied to the rank of each variable in the sample observations. Let \((X_i, Y_i)\) be the ranks of the \(n\) sample pairs. The mean rank for each is \(\bar{X} = \bar{Y} = (n+1)/2\). Calculate Pearson’s product moment correlation coefficient with these ranks.

\(\hat{\rho} = \frac{\sum_{i=1}^n{(X_i - \bar{X})(Y_i - \bar{Y})}}{\sqrt{\sum_{i=1}^n{(X_i - \bar{X})^2 \sum_{i=1}^n{(Y_i - \bar{Y})^2}}}}\)

Test \(H_0: \rho = 0\) with test statistic \(Z = \hat{\rho} \sqrt{\frac{n-2}{1 - \hat{\rho}^2}}\). As a non-parametric test, Spearman’s \(\rho\) does not produce a confidence interval.

Example Using Spearman’s Rho

From the nascard data set above, in race 1 of 1975, what was the correlation between the driver’s starting position start_pos and finishing position finish_pos?

Start with a visual exploration of the relationship with a scatterplot. There does not appear to be much of a relationship.

library(dplyr)
library(ggplot2)

nascard_sr1 %>%
  ggplot(aes(x = start_pos, y = finish_pos)) +
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE)

Both star_pos and finish_pos are ordinal variables, so use Spearman’s rho instead of Pearson.

Replace the variable values with their ranks. In this case, the value and their ranks are the same because the values are their ranks. Then calculate Spearman’s \(\rho\) the same as Pearson. \(\hat{\rho} = \frac{\sum_{i=1}^n{(X_i - \bar{X})(Y_i - \bar{Y})}}{\sqrt{\sum_{i=1}^n{(X_i - \bar{X})^2 \sum_{i=1}^n{(Y_i - \bar{Y})^2}}}} = -.0384\).

x <- rank(nascard_sr1$start_pos)
y <- rank(nascard_sr1$finish_pos)

(rho = sum((x - mean(x)) * (y - mean(y))) /
   sqrt(sum((x - mean(x))^2) * sum((y - mean(y))^2)))

## [1] -0.03837535

The test statistic is \(Z = \hat{\rho} \sqrt{\frac{n-2}{1 - \hat{\rho}^2}} = -.2206\). \(P(Z < -.2206) = .8268\), so do not reject \(H_0\) that the correlation is zero.

n <- nrow(nascard_sr1)
(z = rho * sqrt(n-2) / sqrt(1 - rho^2))

## [1] -0.2206121

pt(q = abs(z), df = n-2, lower.tail = FALSE) * 2

## [1] 0.8267537

R function cor.test performs these calculations. Specify method = "spearman".

cor.test(x = nascard_sr1$start_pos,
         y = nascard_sr1$finish_pos,
         alternative = "two.sided",
         method = "spearman")

## 
##  Spearman's rank correlation rho
## 
## data:  nascard_sr1$start_pos and nascard_sr1$finish_pos
## S = 7414, p-value = 0.8264
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##         rho 
## -0.03837535

Kendall’s Tau

The Kendall Rank correlation coefficient (Kendall’s Tau) measures the relationship between ranked variables.55 See article at Statistics How To Let \((X_i, Y_i)\) be the ranks of the \(n\) sample pairs. For each \(X_i\), count \(Y > X_i\). The total count is \(k\). Kendall’s tau is

\(\hat{\tau} = \frac{4k}{n(n-1)} -1\)

Note that if \(X\) and \(Y\) have the same rank orders, \(\hat{\tau} = 1\), and if they have opposite rank orders \(\hat{\tau} = -1\). Test \(H_0: \tau = 0\) with test statistic \(Z = \hat{\tau} \sqrt{\frac{9n(n - 1)}{2(2n + 5)}}\). As a non-parametric test, Kendall’s \(\tau\) does not produce a confidence interval.

Example Using Kendall’s Tau

From the nascard data set above, in race 1 of 1975, what was the correlation between the driver’s starting position start_pos and finishing position finish_pos?

Again, because both start_pos and finish_pos are ordinal variables, we use a non-parametric test. This time use Kendall’s tau.

For each observation, count the number of other observations where both start_pos and finish_pos is larger. The sum is \(k = 293\). Then calculate Kendall’s \(\tau\) is \(\hat{\tau} = \frac{4 * 293}{35(35-1)} -1 = -.0151\).

n <- nrow(nascard_sr1)
k <- 0
for(i in 1:n) {
  start_pos_i = nascard_sr1[i, 'start_pos']
  finish_pos_i = nascard_sr1[i, 'finish_pos']
  for (j in 1:n) {
    start_pos_j = nascard_sr1[j, 'start_pos']
    finish_pos_j = nascard_sr1[j, 'finish_pos']
    if(finish_pos_j > finish_pos_i && 
        start_pos_j > start_pos_i) {
      k <- k + 1
    }
  }
}
(k)

## [1] 293

(tau = 4.0*k / (n * (n - 1)) - 1)

## [1] -0.01512605

The test statistic is \(Z = \hat{\tau} \sqrt{\frac{9n(n - 1)}{2(2n + 5)}} = -0.0151 \sqrt{\frac{9(35)(35 - 1)}{2(2(35) + 5)}} = -0.1278\). \(P(Z < -.1278) = .8991\), so do not reject \(H_0\) that the correlation is zero.

(z = tau * sqrt(9 * n * (n-1) / (2 * (2*n + 5))))

## [1] -0.1278129

pt(q = abs(z), df = n-2, lower.tail = FALSE) * 2

## [1] 0.8990728

R function cor.test performs these calculations. Specify method = "kendall".

cor.test(x = nascard_sr1$start_pos,
         y = nascard_sr1$finish_pos,
         alternative = "two.sided",
         method = "kendall")

## 
##  Kendall's rank correlation tau
## 
## data:  nascard_sr1$start_pos and nascard_sr1$finish_pos
## T = 293, p-value = 0.9102
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
##         tau 
## -0.01512605

Tie it all together!

Calculate both Spearman’s rho and Kendall’s tau for all 898 races in the nascard data set to compare the results.

series_race <- nascard %>% 
  distinct(nascard$series_race, nascard$year)
spearman <- sapply(series_race[,1], function(series_race) {
  cor.test(x = nascard[nascard$series_race == series_race, 'start_pos'],
         y = nascard[nascard$series_race == series_race, 'finish_pos'],
         alternative = "two.sided",
         method = "spearman")$estimate
})
spearman.p <- sapply(series_race[,1], function(series_race) {
  cor.test(x = nascard[nascard$series_race == series_race, 'start_pos'],
         y = nascard[nascard$series_race == series_race, 'finish_pos'],
         alternative = "two.sided",
         method = "spearman")$p.value
})
kendall <- sapply(series_race[,1], function(series_race) {
  cor.test(x = nascard[nascard$series_race == series_race, 'start_pos'],
         y = nascard[nascard$series_race == series_race, 'finish_pos'],
         alternative = "two.sided",
         method = "kendall")$estimate
})
kendall.p <- sapply(series_race[,1], function(series_race) {
  cor.test(x = nascard[nascard$series_race == series_race, 'start_pos'],
         y = nascard[nascard$series_race == series_race, 'finish_pos'],
         alternative = "two.sided",
         method = "kendall")$p.value
})
nascard_smry <- data.frame(series_race = series_race$`nascard$series_race`, 
                           year = series_race$`nascard$year`,
                           spearman, spearman.p,
                           kendall, kendall.p)

library(tidyr)
nascard_smry2 <- nascard_smry %>% 
  group_by(year) %>%
  summarize(spearman_avg = mean(spearman),
            kendall_avg = mean(kendall)) %>%
  gather(key = method, value = estimate, c(spearman_avg, kendall_avg))
ggplot(data = nascard_smry2, aes(x = year, color = method)) +
  geom_line(aes(y = estimate)) +
  geom_line(aes(y = estimate)) +
  expand_limits(y = 0) +
  labs(title = "Average Rank Correlations vs Year")

Summarize the results in a contingency table. The two tests usually return the same results in terms of significance.

nascard_smry$spearman_assoc <- 
  ifelse(nascard_smry$spearman.p < .05, "sig", "insig")
nascard_smry$kendall_assoc <- 
  ifelse(nascard_smry$kendall.p < .05, "sig", "insig")
table(nascard_smry$spearman_assoc, nascard_smry$kendall_assoc)

##        
##         insig sig
##   insig   344  18
##   sig      11 525

We can calculate the Pearson coefficient to measure the relationship between Spearman’s rho and Kendall’s tau.

ggplot(data = nascard_smry, aes(x = kendall, y = spearman)) +
  geom_point() +
  geom_smooth(method = lm, se = FALSE) +
  labs(title = "Spearman's rho vs Kendall's tau")

The scatterplot with fitted line has r-squared equal to the Pearson coefficient squared.

cor.test(x = nascard_smry$spearman,
         y = nascard_smry$kendall,
         alternative = "two.sided",
         method = "pearson")

## 
##  Pearson's product-moment correlation
## 
## data:  nascard_smry$spearman and nascard_smry$kendall
## t = 219.64, df = 896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9895653 0.9919612
## sample estimates:
##       cor 
## 0.9908409

summary(lm(kendall~spearman, data = nascard_smry))

## 
## Call:
## lm(formula = kendall ~ spearman, data = nascard_smry)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.061417 -0.012363 -0.001883  0.011648  0.074801 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.006944   0.001369  -5.074 4.74e-07 ***
## spearman     0.741603   0.003376 219.641  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01904 on 896 degrees of freedom
## Multiple R-squared:  0.9818, Adjusted R-squared:  0.9817 
## F-statistic: 4.824e+04 on 1 and 896 DF,  p-value: < 2.2e-16

```

Correlation Test

Measuring the relationship between variables with Pearson’s Correlation, Spearman’s Rho, and Kendall’s Tau using R.

Michael Foley

2019-03-19