Correlations - Pearson, Spearman, Kendall

3 Types of Correlations

Pearson

For a population

\[ r_{XY}=\frac{Cov(X,Y)}{\sigma_X \sigma_Y} \]

For a sample

\[ r_{XY}=\frac{\sum_{i=1}^n (x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\sum_{i=1}^n(x_i-\bar{x})^2}\sqrt{\sum_{i=1}^n(y_i-\bar{y})^2}} \]

Properties:
1. \(\rho_{XY}\in [-1,1]\).
2. Greater value of \(\vert r_{XY}\vert\) indicates stronger linear relationship.
3. Linear Regression \[ [r(y,\hat{y})]^2=\frac{\sum_{i=1}^n (\hat{y}_i-\bar{y})^2}{\sum_{i=1}^n (y_i-\bar{y})^2}= \text{proportion of variance in Y explained by a linear function of X.} \]
4. For simple linear regression: \(y = a +\beta x +\epsilon\) \[ \hat{\beta}=\frac{\sum_{i=1}^n(x_i-\bar{x})(y_i-\bar{y})}{\sum_{i=1}^n(x_i-\bar{x})^2}=r(x,y)\frac{sd(y)}{sd(x)} \]
5. Pearson’s \(r\) is nonrobost to skewed data and outliers.

Spearman

\[ \rho = r(rank(X), rank(Y)) \]

A perfect Spearman correlation indicates monotonic relationship, whereas a perfect Pearson correlation indicates a linear relationship.

Kendall

\[ \tau=\frac{n_c-n_d}{n_c+n_d}=\frac{n_c-n_d}{\binom{n}{2}}=\frac{1}{n(n-1)}\sum_{i\neq j}sgn(y_i-y_j)sgn(x_i-x_j) \]

\(n_c\): # of concordant pairs
\(n_d\): # of discordant pairs
total number of possible pairings \(\binom{n}{2}\)

\[ \Rightarrow \frac{1+\tau}{2}=\frac{n_c}{n_c+n_d}=\text{percentage of concordant pairs} \]

Spearman’s \(\rho\) versus Kendall’s \(\tau\):
1. Kendall’s \(\tau\) is more interpretable.
2. Kendall’s \(\tau\) approaches a normal distribution more rapidly as sample size increases; thus, for small data set, Kendall’s \(\tau\) is preferred.
3. computation complexity:
  - Kendall’s \(\tau\): \(O(n^2)\)
  - Spearman’s \(\rho\): \(O(n\log n)\) (\(n\) = sample size)

Correlation Plot

Define a corPlot function

corPlot = function(dt, group){
  # Customize the lower panel: correlation efficients
  correlation.panel = function(x, y){
    usr = par("usr"); on.exit(par(usr))
    par(usr = c(0, 1, 0, 1))
    
    # 3 types of correlation
    rp = round(cor(x, y, method = "pearson", use = "complete.obs"), 
               digits = 2)
    text(0.5, 0.75, paste("Pearson's r = ", rp))
    
    rs = round(cor(x, y, method = "spearman", use = "complete.obs"), 
               digits = 2)
    text(0.5, 0.5, bquote("Spearman's "~rho~" = "~.(rs)))
    
    rk = round(cor(x, y, method = "kendall", use = "complete.obs"), 
               digits = 2)
    text(0.5, 0.25, bquote("Kendall's "~tau~" = "~.(rk)))
  }
  
  # Customize upper panel: scatter plots
  scatterplot.panel = function(x, y){
    points(x, y, pch = 16,
           col = group)
  }
  
  pairs(dt, 
        lower.panel = correlation.panel, 
        upper.panel = scatterplot.panel)
}

Example: plot iris

corPlot(iris[1:4], iris$Species)