Correlation measures the strength and direction of association between two variables.1 See explanations at Statistics Solutions, PSU STAT 500, and NASCAR Winston Cup Race Results for 1975-2003 by Larry Winner in Journal of Statistics Education. There are three common correlation tests: the Pearson product moment correlation (Pearson), Kendall’s tau correlation (Kendall’s tau), and Spearman’s rank-order correlation (Spearman rho).
The Pearson correlation is valid if both variables are quantitative (interval or ratio), normally distributed, and the relationship is linear with homoscedastic residuals.
The Spearman and the Kendal correlations are non-parametric measures.2 See discussion of parametric vs nonparametric at Statistics How To. They are valid for both quantitative and ordinal variables and do not carry the normality and homoscedasticity conditions. However, non-parametric tests have less statistical power than parametric tests, so only these correlations if Pearson does not apply.
The Pearson product moment correlation coefficient r estimates population correlation \(\rho\). The test evaluates \(H_0: \rho = 0\) and constructs a \((1 - \alpha)\%\) confidence interval around \(r\):
\(r = \frac{\sum_{i=1}^n{(X_i - \bar{X})(Y_i - \bar{Y})}}{\sqrt{\sum_{i=1}^n{(X_i - \bar{X})^2 \sum_{i=1}^n{(Y_i - \bar{Y})^2}}}}\)
\(r\) ranges from \(-1\) (perfect negative linear relationship) to \(+1\) (perfect positive linear relationship, and \(r = 0\) when there is no linear relationship. A correlation in the range \((.1, .3)\) is small, \((.3, .5)\) is medium, and \((.5, 1.0)\) is large.
Only use the Pearson correlation when
Test \(H_0: \rho = 0\) with test statistic \(T = r \sqrt{\frac{n-2}{1-r^2}}\). The test also defines a \((1 - \alpha)\%\) confidence interval.3 I found an explanation of the confidence interval here.
The nascard data set consists of 898 races from 1975 - 2003. In race 1 of 1975, what was the correlation between the driver’s finishing position and prize?
library(dplyr) # for glimpse()
nascard <- read.fwf(file = url("http://jse.amstat.org/datasets/nascard.dat.txt"),
widths = c(5, 6, 4, 4, 4, 5, 9, 4, 11, 30),
col.names = c('series_race', 'year', 'race_year',
'finish_pos', 'start_pos',
'laps_comp', 'winnings', 'num_cars',
'car_make', 'driver'))
nascard_sr1 <- nascard[nascard$series_race == 1,]
glimpse(nascard_sr1)
## Observations: 35
## Variables: 10
## $ series_race <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ year <dbl> 1975, 1975, 1975, 1975, 1975, 1975, 1975, 1975, 19...
## $ race_year <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ finish_pos <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,...
## $ start_pos <dbl> 1, 2, 25, 27, 22, 35, 3, 21, 33, 10, 17, 24, 6, 28...
## $ laps_comp <dbl> 191, 191, 184, 183, 178, 175, 172, 168, 167, 166, ...
## $ winnings <dbl> 12035, 8135, 6535, 5035, 3835, 2885, 3185, 2485, 2...
## $ num_cars <dbl> 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35...
## $ car_make <fct> Matador , Mercury , Chevrolet , Dodge ...
## $ driver <fct> BobbyAllison , DavidPearson ...
Explore the relationship with a scatterplot. As one would expect, there is a negative relationship between finish position and winnings, but it is not linear.
library(ggplot2)
ggplot(data = nascard_sr1, aes(x = finish_pos, y = winnings)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
However, 1/winnings may be linearly related to finish position.4 I tried several variable tranformations from Stat Trek, and chose the tranformation with the highest \(R^2\) that also had a normal distribution in each variable.
ggplot(data = nascard_sr1, aes(x = finish_pos, y = 1/winnings)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
#summary(lm(winnings ~ finish_pos, data = nascard_sr1)) # r2 = .5474
#summary(lm(log(winnings) ~ finish_pos, data = nascard_sr1)) # r2 = .9015
summary(lm(1/winnings ~ finish_pos, data = nascard_sr1)) # r2 = .9509
##
## Call:
## lm(formula = 1/winnings ~ finish_pos, data = nascard_sr1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.343e-04 -7.449e-05 -1.710e-06 4.575e-05 3.623e-04
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.829e-05 3.682e-05 -0.497 0.623
## finish_pos 4.512e-05 1.784e-06 25.292 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0001066 on 33 degrees of freedom
## Multiple R-squared: 0.9509, Adjusted R-squared: 0.9495
## F-statistic: 639.7 on 1 and 33 DF, p-value: < 2.2e-16
#summary(lm(winnings ~ log(finish_pos), data = nascard_sr1)) # r2 = .883
#summary(lm(log(winnings) ~ log(finish_pos), data = nascard_sr1)) # r2 = .9766
#summary(lm(winnings ~ log(finish_pos), data = nascard_sr1)) # r2 = .883
The finish position and prize are quantitative ratio variables, so the Pearson’s product moment correlation should apply. First check whether each variable is normally distributed. The normal distribution plots look good and the Anderson-Darling tests both fail to reject the normality null hypothesis.
qqnorm(nascard_sr1$finish_pos)
qqline(nascard_sr1$finish_pos)
qqnorm(1/(nascard_sr1$winnings))
qqline(1/(nascard_sr1$winnings))
library(nortest) # for Anderson-Darling
ad.test(nascard_sr1$finish_pos)
##
## Anderson-Darling normality test
##
## data: nascard_sr1$finish_pos
## A = 0.37344, p-value = 0.3989
ad.test(1/nascard_sr1$winnings)
##
## Anderson-Darling normality test
##
## data: 1/nascard_sr1$winnings
## A = 0.3292, p-value = 0.505
The Pearson product moment correlation is \(r = \frac{\sum_{i=1}^n{(X_i - \bar{X})(Y_i - \bar{Y})}}{\sqrt{\sum_{i=1}^n{(X_i - \bar{X})^2 \sum_{i=1}^n{(Y_i - \bar{Y})^2}}}} = .9752\)
x <- nascard_sr1$finish_pos
y <- 1/nascard_sr1$winnings
(r = sum((x - mean(x)) * (y - mean(y))) /
sqrt(sum((x - mean(x))^2) * sum((y - mean(y))^2)))
## [1] 0.9751633
The test statistic is \(T = r \sqrt{\frac{n-2}{1-r^2}} = .9752 \sqrt{\frac{35-2}{1-.9752^2}} = 25.292\). \(P(T>.9752) < .0001\), so reject \(H_0\) that the correlation is zero.
n <- nrow(nascard_sr1)
(t = r * sqrt((n-2) / (1 - r^2)))
## [1] 25.29218
pt(q = t, df = n-2, lower.tail = FALSE) * 2
## [1] 3.516944e-23
R function cor.test performs these calculations. Specify method = "pearson".
cor.test(x = nascard_sr1$finish_pos,
y = 1/nascard_sr1$winnings,
alternative = "two.sided",
method = "pearson")
##
## Pearson's product-moment correlation
##
## data: nascard_sr1$finish_pos and 1/nascard_sr1$winnings
## t = 25.292, df = 33, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9509453 0.9875017
## sample estimates:
## cor
## 0.9751633
Spearman’s \(\rho\) is the Pearson’s product moment correlation coefficient applied to the rank of each variable in the sample observations. Let \((X_i, Y_i)\) be the ranks of the \(n\) sample pairs. The mean rank for each is \(\bar{X} = \bar{Y} = (n+1)/2\). Calculate Pearson’s product moment correlation coefficient with these ranks.
\(\hat{\rho} = \frac{\sum_{i=1}^n{(X_i - \bar{X})(Y_i - \bar{Y})}}{\sqrt{\sum_{i=1}^n{(X_i - \bar{X})^2 \sum_{i=1}^n{(Y_i - \bar{Y})^2}}}}\)
Test \(H_0: \rho = 0\) with test statistic \(Z = \hat{\rho} \sqrt{\frac{n-2}{1 - \hat{\rho}^2}}\). As a non-parametric test, Spearman’s \(\rho\) does not produce a confidence interval.
From the nascard data set above, in race 1 of 1975, what was the correlation between the driver’s starting position start_pos and finishing position finish_pos?
Start with a visual exploration of the relationship with a scatterplot. There does not appear to be much of a relationship.
library(dplyr)
library(ggplot2)
nascard_sr1 %>%
ggplot(aes(x = start_pos, y = finish_pos)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
Both star_pos and finish_pos are ordinal variables, so use Spearman’s rho instead of Pearson.
Replace the variable values with their ranks. In this case, the value and their ranks are the same because the values are their ranks. Then calculate Spearman’s \(\rho\) the same as Pearson. \(\hat{\rho} = \frac{\sum_{i=1}^n{(X_i - \bar{X})(Y_i - \bar{Y})}}{\sqrt{\sum_{i=1}^n{(X_i - \bar{X})^2 \sum_{i=1}^n{(Y_i - \bar{Y})^2}}}} = -.0384\).
x <- rank(nascard_sr1$start_pos)
y <- rank(nascard_sr1$finish_pos)
(rho = sum((x - mean(x)) * (y - mean(y))) /
sqrt(sum((x - mean(x))^2) * sum((y - mean(y))^2)))
## [1] -0.03837535
The test statistic is \(Z = \hat{\rho} \sqrt{\frac{n-2}{1 - \hat{\rho}^2}} = -.2206\). \(P(Z < -.2206) = .8268\), so do not reject \(H_0\) that the correlation is zero.
n <- nrow(nascard_sr1)
(z = rho * sqrt(n-2) / sqrt(1 - rho^2))
## [1] -0.2206121
pt(q = abs(z), df = n-2, lower.tail = FALSE) * 2
## [1] 0.8267537
R function cor.test performs these calculations. Specify method = "spearman".
cor.test(x = nascard_sr1$start_pos,
y = nascard_sr1$finish_pos,
alternative = "two.sided",
method = "spearman")
##
## Spearman's rank correlation rho
##
## data: nascard_sr1$start_pos and nascard_sr1$finish_pos
## S = 7414, p-value = 0.8264
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.03837535
The Kendall Rank correlation coefficient (Kendall’s Tau) measures the relationship between ranked variables.5 See article at Statistics How To Let \((X_i, Y_i)\) be the ranks of the \(n\) sample pairs. For each \(X_i\), count \(Y > X_i\). The total count is \(k\). Kendall’s tau is
\(\hat{\tau} = \frac{4k}{n(n-1)} -1\)
Note that if \(X\) and \(Y\) have the same rank orders, \(\hat{\tau} = 1\), and if they have opposite rank orders \(\hat{\tau} = -1\). Test \(H_0: \tau = 0\) with test statistic \(Z = \hat{\tau} \sqrt{\frac{9n(n - 1)}{2(2n + 5)}}\). As a non-parametric test, Kendall’s \(\tau\) does not produce a confidence interval.
From the nascard data set above, in race 1 of 1975, what was the correlation between the driver’s starting position start_pos and finishing position finish_pos?
Again, because both start_pos and finish_pos are ordinal variables, we use a non-parametric test. This time use Kendall’s tau.
For each observation, count the number of other observations where both start_pos and finish_pos is larger. The sum is \(k = 293\). Then calculate Kendall’s \(\tau\) is \(\hat{\tau} = \frac{4 * 293}{35(35-1)} -1 = -.0151\).
n <- nrow(nascard_sr1)
k <- 0
for(i in 1:n) {
start_pos_i = nascard_sr1[i, 'start_pos']
finish_pos_i = nascard_sr1[i, 'finish_pos']
for (j in 1:n) {
start_pos_j = nascard_sr1[j, 'start_pos']
finish_pos_j = nascard_sr1[j, 'finish_pos']
if(finish_pos_j > finish_pos_i &&
start_pos_j > start_pos_i) {
k <- k + 1
}
}
}
(k)
## [1] 293
(tau = 4.0*k / (n * (n - 1)) - 1)
## [1] -0.01512605
The test statistic is \(Z = \hat{\tau} \sqrt{\frac{9n(n - 1)}{2(2n + 5)}} = -0.0151 \sqrt{\frac{9(35)(35 - 1)}{2(2(35) + 5)}} = -0.1278\). \(P(Z < -.1278) = .8991\), so do not reject \(H_0\) that the correlation is zero.
(z = tau * sqrt(9 * n * (n-1) / (2 * (2*n + 5))))
## [1] -0.1278129
pt(q = abs(z), df = n-2, lower.tail = FALSE) * 2
## [1] 0.8990728
R function cor.test performs these calculations. Specify method = "kendall".
cor.test(x = nascard_sr1$start_pos,
y = nascard_sr1$finish_pos,
alternative = "two.sided",
method = "kendall")
##
## Kendall's rank correlation tau
##
## data: nascard_sr1$start_pos and nascard_sr1$finish_pos
## T = 293, p-value = 0.9102
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
## tau
## -0.01512605
Calculate both Spearman’s rho and Kendall’s tau for all 898 races in the nascard data set to compare the results.
series_race <- nascard %>%
distinct(nascard$series_race, nascard$year)
spearman <- sapply(series_race[,1], function(series_race) {
cor.test(x = nascard[nascard$series_race == series_race, 'start_pos'],
y = nascard[nascard$series_race == series_race, 'finish_pos'],
alternative = "two.sided",
method = "spearman")$estimate
})
spearman.p <- sapply(series_race[,1], function(series_race) {
cor.test(x = nascard[nascard$series_race == series_race, 'start_pos'],
y = nascard[nascard$series_race == series_race, 'finish_pos'],
alternative = "two.sided",
method = "spearman")$p.value
})
kendall <- sapply(series_race[,1], function(series_race) {
cor.test(x = nascard[nascard$series_race == series_race, 'start_pos'],
y = nascard[nascard$series_race == series_race, 'finish_pos'],
alternative = "two.sided",
method = "kendall")$estimate
})
kendall.p <- sapply(series_race[,1], function(series_race) {
cor.test(x = nascard[nascard$series_race == series_race, 'start_pos'],
y = nascard[nascard$series_race == series_race, 'finish_pos'],
alternative = "two.sided",
method = "kendall")$p.value
})
nascard_smry <- data.frame(series_race = series_race$`nascard$series_race`,
year = series_race$`nascard$year`,
spearman, spearman.p,
kendall, kendall.p)
library(tidyr)
nascard_smry2 <- nascard_smry %>%
group_by(year) %>%
summarize(spearman_avg = mean(spearman),
kendall_avg = mean(kendall)) %>%
gather(key = method, value = estimate, c(spearman_avg, kendall_avg))
ggplot(data = nascard_smry2, aes(x = year, color = method)) +
geom_line(aes(y = estimate)) +
geom_line(aes(y = estimate)) +
expand_limits(y = 0) +
labs(title = "Average Rank Correlations vs Year")
Summarize the results in a contingency table. The two tests usually return the same results in terms of significance.
nascard_smry$spearman_assoc <-
ifelse(nascard_smry$spearman.p < .05, "sig", "insig")
nascard_smry$kendall_assoc <-
ifelse(nascard_smry$kendall.p < .05, "sig", "insig")
table(nascard_smry$spearman_assoc, nascard_smry$kendall_assoc)
##
## insig sig
## insig 344 18
## sig 11 525
We can calculate the Pearson coefficient to measure the relationship between Spearman’s rho and Kendall’s tau.
ggplot(data = nascard_smry, aes(x = kendall, y = spearman)) +
geom_point() +
geom_smooth(method = lm, se = FALSE) +
labs(title = "Spearman's rho vs Kendall's tau")
The scatterplot with fitted line has r-squared equal to the Pearson coefficient squared.
cor.test(x = nascard_smry$spearman,
y = nascard_smry$kendall,
alternative = "two.sided",
method = "pearson")
##
## Pearson's product-moment correlation
##
## data: nascard_smry$spearman and nascard_smry$kendall
## t = 219.64, df = 896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9895653 0.9919612
## sample estimates:
## cor
## 0.9908409
summary(lm(kendall~spearman, data = nascard_smry))
##
## Call:
## lm(formula = kendall ~ spearman, data = nascard_smry)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.061417 -0.012363 -0.001883 0.011648 0.074801
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.006944 0.001369 -5.074 4.74e-07 ***
## spearman 0.741603 0.003376 219.641 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.01904 on 896 degrees of freedom
## Multiple R-squared: 0.9818, Adjusted R-squared: 0.9817
## F-statistic: 4.824e+04 on 1 and 896 DF, p-value: < 2.2e-16
```