3 Types of Correlations

Pearson

  • For a population

\[ r_{XY}=\frac{Cov(X,Y)}{\sigma_X \sigma_Y} \]

  • For a sample

\[ r_{XY}=\frac{\sum_{i=1}^n (x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\sum_{i=1}^n(x_i-\bar{x})^2}\sqrt{\sum_{i=1}^n(y_i-\bar{y})^2}} \]

  • Properties:
    1. \(\rho_{XY}\in [-1,1]\).
    2. Greater value of \(\vert r_{XY}\vert\) indicates stronger linear relationship.
    3. Linear Regression \[ [r(y,\hat{y})]^2=\frac{\sum_{i=1}^n (\hat{y}_i-\bar{y})^2}{\sum_{i=1}^n (y_i-\bar{y})^2}= \text{proportion of variance in Y explained by a linear function of X.} \]
    4. For simple linear regression: \(y = a +\beta x +\epsilon\) \[ \hat{\beta}=\frac{\sum_{i=1}^n(x_i-\bar{x})(y_i-\bar{y})}{\sum_{i=1}^n(x_i-\bar{x})^2}=r(x,y)\frac{sd(y)}{sd(x)} \]
    5. Pearson’s \(r\) is nonrobost to skewed data and outliers.

Spearman

\[ \rho = r(rank(X), rank(Y)) \]

  • A perfect Spearman correlation indicates monotonic relationship, whereas a perfect Pearson correlation indicates a linear relationship.

Kendall

\[ \tau=\frac{n_c-n_d}{n_c+n_d}=\frac{n_c-n_d}{\binom{n}{2}}=\frac{1}{n(n-1)}\sum_{i\neq j}sgn(y_i-y_j)sgn(x_i-x_j) \]

  • \(n_c\): # of concordant pairs
  • \(n_d\): # of discordant pairs
  • total number of possible pairings \(\binom{n}{2}\)

\[ \Rightarrow \frac{1+\tau}{2}=\frac{n_c}{n_c+n_d}=\text{percentage of concordant pairs} \]

  • Spearman’s \(\rho\) versus Kendall’s \(\tau\):
    1. Kendall’s \(\tau\) is more interpretable.
    2. Kendall’s \(\tau\) approaches a normal distribution more rapidly as sample size increases; thus, for small data set, Kendall’s \(\tau\) is preferred.
    3. computation complexity:
      • Kendall’s \(\tau\): \(O(n^2)\)
      • Spearman’s \(\rho\): \(O(n\log n)\) (\(n\) = sample size)

Correlation Plot

corPlot = function(dt, group){
  # Customize the lower panel: correlation efficients
  correlation.panel = function(x, y){
    usr = par("usr"); on.exit(par(usr))
    par(usr = c(0, 1, 0, 1))
    
    # 3 types of correlation
    rp = round(cor(x, y, method = "pearson", use = "complete.obs"), 
               digits = 2)
    text(0.5, 0.75, paste("Pearson's r = ", rp))
    
    rs = round(cor(x, y, method = "spearman", use = "complete.obs"), 
               digits = 2)
    text(0.5, 0.5, bquote("Spearman's "~rho~" = "~.(rs)))
    
    rk = round(cor(x, y, method = "kendall", use = "complete.obs"), 
               digits = 2)
    text(0.5, 0.25, bquote("Kendall's "~tau~" = "~.(rk)))
  }
  
  # Customize upper panel: scatter plots
  scatterplot.panel = function(x, y){
    points(x, y, pch = 16,
           col = group)
  }
  
  pairs(dt, 
        lower.panel = correlation.panel, 
        upper.panel = scatterplot.panel)
}
corPlot(iris[1:4], iris$Species)

Reference

  1. Scatter Plot Matrices - R Base Graphs
  2. Correlation Coefficient
  3. Pearson correlation coefficient
  4. Simple linear regression
  5. Correlation and dependence
  6. Correlation Test Between Two Variables in R
  7. Does Pearson correlation require normality?
  8. How robust is Pearson’s correlation coefficient to violations of normality?
  9. Spearman’s rank correlation coefficient
  10. Kendall rank correlation coefficient
  11. Kendall Tau or Spearman’s rho?
  12. Kendall’s Tau and Spearman’s Rank Correlation Coefficient
  13. Does Spearman’s rho have any advantage over Kendall’s tau?