Spearman Rank Correlation

In a previous example, linear correlation was examined with Pearson’s \(r\). The cars dataset that was examined exhibited a strong linear relationship, and thus Pearson’s correlation was ideally suited to measure how the variables depend on each other. However, there are times when data exhibits some relationship but not linear correlation. In these cases, Pearson’s correlation may report inaccurate and misleading values. Spearman’s rank correlation coefficient, or Spearman’s correlation coefficient, as the name suggests, is a nonparametric approach to measuring correlation using rank values.

Spearman’s correlation coefficient is often denoted rho \(\rho\) and measures the monotonic relationship of the variables rather than the linear association in the Pearson setting. Thus, Spearman’s correlation coefficient is more reliable with non-linear data compared to Pearson’s \(r\).

This post will explore Spearman’s correlation coefficient and its calculation. Spearman’s correlation coefficient can be considered Pearson’s \(r\) on ranked values, so several of the foundations of Spearman’s rank coefficient have already been investigated in a previous post.

Start by loading the packages that will be used.

library(ggplot2)
data("cars")

The cars dataset from the previous post will be used to see if Spearman’s \(\rho\) reports a different value than Pearson’s \(r\). The data exhibited a linear relationship, as shown in the scatterplot below, but was the correlation as strong as Pearson’s \(r\) reported (about .80)? It could be the lone value in the upper-right corner of the graph is inflating the association between the two variables.

ggplot(cars, aes(x=speed, y=dist)) + 
  geom_point(color='#2980B9', size = 4) + 
  geom_smooth(method=lm, se=FALSE, fullrange=TRUE, color='#2C3E50')

Aside: Monotonic Functions

A monotonic function is a function of a dependent variable that is always increasing or decreasing as its independent variable increases.

Examples of monotonic functions:

x <- 1:100
x2 <- 1:5

y <- (1/4) * x^2 # Monotonically Increasing Function
y2 <- exp(-x2) # Montonically Decreasing
y3 <- sin(x / 5) # Not Monotonic

par(mfrow=c(1,3))
plot(x2, y2, type = 'l', main = 'Monotonically Decreasing', frame.plot = FALSE)
plot(x, y, type = 'l', main = 'Monotonically Increasing', frame.plot = FALSE)
plot(x, y3, type = 'l', main = 'Not Monotonic', frame.plot = FALSE)

Judging from the monotonically increasing and decreasing graphs, it can be seen that the y variable has a positive or negative relationship with the independent x variable. In the example non-monotonic function \(\sin{\frac{x}{5}}\), y always stays between -1 and 1 regardless of how much x increases. The graphs exhibit similarities to positively, negatively, and non-correlated variables. Hence, the motivation to examine the monotonic relationship of the variables to assess the strength of the correlation, if any.

Spearman’s Correlation Coefficient in R

Spearman’s rank correlation coefficient can easily be calculated by again using the cor.test() function in R. The only difference from the Pearson example is the method argument.

corr <- cor.test(x=cars$speed, y=cars$dist, method = 'spearman')

## Warning in cor.test.default(x = cars$speed, y = cars$dist, method =
## "spearman"): Cannot compute exact p-value with ties

corr

## 
##  Spearman's rank correlation rho
## 
## data:  cars$speed and cars$dist
## S = 3532.8, p-value = 8.825e-14
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.8303568

Spearman’s \(\rho\) is reported as .83, slightly higher than Pearson’s \(r\) of .80, indicating a stronger correlation than previously estimated with Pearson’s correlation. Since Pearson’s \(r\) measures the linear association between two variables, it is likely the outlier in the top right corner of the first graph was causing under-reported values as linear relationships can be extremely affected by outliers. A great answer on CrossValidated has more explanation on the effect of outliers on \(r\).

Manually Calculating Spearman’s \(\rho\)

When the data contains no ties, \(\rho\) can be found by taking the difference of the ranked values using the following equation:

\[ r_s = \frac{6 \sum d_i^2}{n(n^2 - 1)} \]

Where \(d_i\) is the difference of the ranked variables and \(n\) is the number of samples.

Since the cars dataset contains ties, the above formula will report incorrect values. When the data contain ties, the following equation can be used:

\[ r_s = \rho_{rg_x, rg_y} = \frac{cov(rg_x, rg_y)}{\sigma_{rg_x} \sigma_{rg_y}} \]

Where \(\rho\) denotes the usual Pearson coefficient but on the ranked variables. \(cov(rg_x, rg_y)\) is the covariance and \(\sigma_{rg_x}, \sigma_{rg_y}\) are the standard deviations of the ranked variables.

To calculate Spearman’s \(\rho\) in R, first, rank the \(x\) and \(y\) variables. A new data.frame is created to keep the ranked variables.

cars.ranked <- data.frame(cbind(rank(cars$speed, ties.method = 'average'),
                                rank(cars$dist, ties.method = 'average')))

colnames(cars.ranked) <- c('speed', 'dist')

Take the covariance of the variables and divide by the product of the \(x\) and \(y\) variables’ standard deviations to find Spearman’s \(\rho\).

rho <- cov(cars.ranked) / (sd(cars.ranked$speed) * sd(cars.ranked$dist))
rho[[2]]

## [1] 0.8303568

corr$estimate

##       rho 
## 0.8303568

The S-statistic reported from the cor.test() function is defined as:

\[ (n^3 - n) \frac{1 - r}{6} \]

Where \(r\) is Pearson’s correlation coefficient on the ranked variables (StackOverflow) and \(n\) is equal to the sample size of variables. Therefore, find \(r\) and \(n\) and enter the values into the equation to get S.

n <- length(cars.ranked$speed)
r <- cor(x = cars.ranked$speed, y = cars.ranked$dist, method = 'pearson')
s <- (n^3 - n) * (1 - r) / 6
s

## [1] 3532.819

corr$statistic

##        S 
## 3532.819

Though the p-value cannot be exactly computed due to ties, it can be closely approximated as it is distributed approximately to the Student’s t-distribution. Similar to Pearson’s coefficient, the t-value is found by:

\[ t = r \sqrt{\frac{n - 2}{1 - r^2}} \]

The test is two-tailed since it can take positive or negative values and is thus multiplied by 2.

t <- r * sqrt((n - 2) / (1 - r^2))
p <- 2 * (1-pt(q = t, df = n - 2))
p

## [1] 8.837375e-14

corr$p.value

## [1] 8.824558e-14

The calculated p-value is very close to the output of the cor.test() function. I’m not exactly sure how the p-value is computed in the cor.test() function, it’s likely an asymptotic approach due to the sample size. However, both p-values are so similar it can be agreed the manual calculation verifies the output.

Summary

Spearman’s \(\rho\) was examined by using the cor.test() function available in base R as well as manual calculations. \(\rho\) was slightly higher compared to Pearson’s \(r\) coefficient, indicating the correlation between speed and stopping distance is slightly more strong than originally thought. As assumed at the beginning of this post, it is likely the outlier in the data caused Pearson’s \(r\) to report a slightly lower correlation. In either case, the result of the strong correlation between the two variables is the same. Other methods of calculating correlation such as Kendall’s \(\tau\) will be examined in future examples.

References

Test statistic for Spearman rank correlation. (n.d.), from http://stackoverflow.com/questions/15407656/test-statistic-for-spearman-rank-correlation

Monotonic function. (n.d.), from https://en.wikipedia.org/wiki/Monotonic_function

Spearman’s rank correlation coefficient (2016) from https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient