Statistical correlations

Bongani Ncube

2023-11-03


 

Data Scientist Volucentric Consultancy

 

Correlation

Correlation measures the strength and direction of association between two variables. There are three common correlation tests:

  • the Pearson product moment (Pearson’s r),
  • Spearman’s rank-order (Spearman’s rho),
  • and Kendall’s tau (Kendall’s tau).

Use the Pearson’s r if both variables are quantitative (interval or ratio), normally distributed, and the relationship is linear with homoscedastic residuals.

The Spearman’s rho and Kendal’s tao correlations are measures, so they are valid for both quantitative and ordinal variables and do not carry the normality and homoscedasticity conditions. However, non-parametric tests have less statistical power than parametric tests, so only use these correlations if Pearson does not apply.

Pearson’s r

Pearson’s \(r\)

\[r = \frac{\sum{(X_i - \bar{X})(Y_i - \bar{Y})}}{\sqrt{\sum{(X_i - \bar{X})^2 \sum{(Y_i - \bar{Y})^2}}}} = \frac{cov(X,Y)}{s_X s_Y}\]

estimates the population correlation \(\rho\). Pearson’s \(r\) ranges from \(-1\) (perfect negative linear relationship) to \(+1\) (perfect positive linear relationship, and \(r = 0\) when there is no linear relationship. A correlation in the range \((.1, .3)\) is condidered small, \((.3, .5)\) medium, and \((.5, 1.0)\) large.

Pearson’s \(r\) only applies if the variables are interval or ratio, normally distributed, linearly related, there are minimal outliers, and the residuals are homoscedastic.

library Setup

library(tidyverse)
library(glue)
library(flextable)
library(tvthemes)
library(flextable)
# Dummy set containing a feature and label column
df <- tibble(
  Height = c(115, 126, 137, 140, 152, 156, 114, 129),
  Weight = c(56, 61, 67, 72, 76, 82, 54, 62)
  
)

# Display the data set
df  %>%
  flextable() %>%
  flextable::set_table_properties(width = .75, layout = "autofit") %>%
  flextable::theme_zebra() %>%
  flextable::fontsize(size = 12) %>%
  flextable::fontsize(size = 12, part = "header") %>%
  flextable::align_text_col(align = "center") %>%
  flextable::set_caption(caption = "Weight and height of a random sample of people.")  %>%
  flextable::border_outer()
Weight and height of a random sample of people.

Height

Weight

115

56

126

61

137

67

140

72

152

76

156

82

114

54

129

62

Types of correlations

Almost perfect correlations

p1 <- ggplot(data = df, aes(x = Height, y = Weight)) +
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE) +
  
  labs(title = "Pearson's Rho", subtitle = "positive and strong correlation")
p2 <- ggplot(data = df, aes(x = Height, y = 1/Weight)) +
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE) +
 
  labs(title = "Pearson's Rho", subtitle = "Negative yet strong correlation")
gridExtra::grid.arrange(p1, p2, nrow = 1)

Some weak correlations

  • here is some fake dataset producing weak correlations
df2 <- tibble(
  Height = c(115, 101, 99, 107, 118, 127, 120, 129),
  Weight = c(56, 50, 67, 64, 55, 70, 61, 59)
  
)

Visualizing weak correlations

p1 <- ggplot(data = df2, aes(x = Height, y = Weight)) +
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE) +
  
  labs(title = "Pearson's Rho", subtitle = "positive and weak correlation")
p2 <- ggplot(data = df2, aes(x = Height, y = 1/Weight)) +
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE) +
 
  labs(title = "Pearson's Rho", subtitle = "Negative yet weak correlation")
gridExtra::grid.arrange(p1, p2, nrow = 1)

Calculations manually

we are interested in finding \(\sum{(X_i - \bar{X})(Y_i - \bar{Y})}\) , \(\sum{(X_i - \bar{X})^2}\) and \(\sum{(Y_i - \bar{Y})^2}\)

dummy_mse <- df %>% 
  mutate("$(X_i - \\bar{X})$" = (Height- mean(Height))) %>% 
  mutate("$(Y_i - \\bar{Y})$" = (Weight- mean(Weight)))%>% 
  mutate("$(X_i - \\bar{X})^2$" = (Height- mean(Height))^2) %>% 
  mutate("$(Y_i - \\bar{Y})^2$" = (Weight- mean(Weight))^2) %>%  
  mutate("$(Y_i - \\bar{Y})(X_i - \\bar{X})$" = (Weight- mean(Weight))*(Height- mean(Height)))

dummy_mse %>% 
  knitr::kable()
Height Weight \((X_i - \bar{X})\) \((Y_i - \bar{Y})\) \((X_i - \bar{X})^2\) \((Y_i - \bar{Y})^2\) \((Y_i - \bar{Y})(X_i - \bar{X})\)
115 56 -18.625 -10.25 346.89062 105.0625 190.90625
126 61 -7.625 -5.25 58.14062 27.5625 40.03125
137 67 3.375 0.75 11.39062 0.5625 2.53125
140 72 6.375 5.75 40.64062 33.0625 36.65625
152 76 18.375 9.75 337.64062 95.0625 179.15625
156 82 22.375 15.75 500.64062 248.0625 352.40625
114 54 -19.625 -12.25 385.14062 150.0625 240.40625
129 62 -4.625 -4.25 21.39062 18.0625 19.65625

Summations and calculating correlation

(dummy_mse1<-dummy_mse %>% 
  summarise("$\\sum Height$"=sum(.[1]),
            "$\\sum Weight$"=sum(.[2]),
            "$\\sum(X_i - \\bar{X})$"=sum(.[3]),
            "$\\sum(Y_i - \\bar{Y})$"=sum(.[4]),
            "$\\sum(X_i - \\bar{X})^2$"=sum(.[5]),
            "$\\sum(Y_i - \\bar{Y})^2$"=sum(.[6]),
            "$\\sum(Y_i - \\bar{Y})(X_i - \\bar{X})$"=sum(.[7]))%>%  
   mutate(Pearson_R = (.[7]/sqrt(.[5]*.[6]))) %>%  ##seventh index/sqrt(5th times 6th index) 
   relocate(Pearson_R)%>% 
   knitr::kable())
Pearson_R \(\sum Height\) \(\sum Weight\) \(\sum(X_i - \bar{X})\) \(\sum(Y_i - \bar{Y})\) \(\sum(X_i - \bar{X})^2\) \(\sum(Y_i - \bar{Y})^2\) \(\sum(Y_i - \bar{Y})(X_i - \bar{X})\)
0.9887894 1069 530 0 0 1701.875 677.5 1061.75

In R

  • While it is important to know the mathematical theory behind , R has a built in function for this
  • we use the cor(x,y) function
cor(df$Height,df$Weight)
#> [1] 0.9887894

Spearman’s Rho

Spearman’s \(\rho\) is the Pearson’s r applied to the sample variable ranks. Let \((X_i, Y_i)\) be the ranks of the \(n\) sample pairs with mean ranks \(\bar{X} = \bar{Y} = (n+1)/2\). Spearman’s rho is

\[\hat{\rho} = \frac{\sum{(X_i - \bar{X})(Y_i - \bar{Y})}}{\sqrt{\sum{(X_i - \bar{X})^2 \sum{(Y_i - \bar{Y})^2}}}}\] You will also encounter another formula given by \[\hat{\rho} = 1 -\frac{6\sum{D^2}}{n(n^2-1)}\] where \(D=rank(X)-rank(Y)\)

Spearman’s rho is a non-parametric test, so there is no associated confidence interval.

Doing the calculations manually

  • basically we repeat the same procedure except that we will be using ranks ,we will use the rank function in R

dummy_mse <- df %>% 
  mutate(Height_rank =rank(Height), ## define ranks
         Weight_rank =rank(Weight))%>% 
  select(-Weight,-Height) %>%  
  mutate("$(X_i - \\bar{X})$" = (Height_rank- mean(Height_rank))) %>% 
  mutate("$(Y_i - \\bar{Y})$" = (Weight_rank- mean(Weight_rank))) %>% 
  mutate("$(X_i - \\bar{X})^2$" = (Height_rank- mean(Height_rank))^2) %>% 
  mutate("$(Y_i - \\bar{Y})^2$" = (Weight_rank- mean(Weight_rank))^2) %>% 
  mutate("$(Y_i - \\bar{Y})(X_i - \\bar{X})$" = (Weight_rank- mean(Weight_rank))*(Height_rank- mean(Height_rank)))

dummy_mse %>% 
  knitr::kable()
Height_rank Weight_rank \((X_i - \bar{X})\) \((Y_i - \bar{Y})\) \((X_i - \bar{X})^2\) \((Y_i - \bar{Y})^2\) \((Y_i - \bar{Y})(X_i - \bar{X})\)
2 2 -2.5 -2.5 6.25 6.25 6.25
3 3 -1.5 -1.5 2.25 2.25 2.25
5 5 0.5 0.5 0.25 0.25 0.25
6 6 1.5 1.5 2.25 2.25 2.25
7 7 2.5 2.5 6.25 6.25 6.25
8 8 3.5 3.5 12.25 12.25 12.25
1 1 -3.5 -3.5 12.25 12.25 12.25
4 4 -0.5 -0.5 0.25 0.25 0.25

Summations and calculations

dummy_mse %>% 
  summarise("$\\sum Height_rank$"=sum(.[1]),
            "$\\sum Weight_rank$"=sum(.[2]),
            "$\\sum(X_i - \\bar{X})$"=sum(.[3]),
            "$\\sum(Y_i - \\bar{Y})$"=sum(.[4]),
            "$\\sum(X_i - \\bar{X})^2$"=sum(.[5]),
            "$\\sum(Y_i - \\bar{Y})^2$"=sum(.[6]),
            "$\\sum(Y_i - \\bar{Y})(X_i - \\bar{X})$"=sum(.[7]))%>% 
  mutate(Spearman_Rho = (.[7]/sqrt(.[5]*.[6]))) %>% 
  relocate(Spearman_Rho)%>% 
  knitr::kable()
Spearman_Rho \(\sum Height_rank\) \(\sum Weight_rank\) \(\sum(X_i - \bar{X})\) \(\sum(Y_i - \bar{Y})\) \(\sum(X_i - \bar{X})^2\) \(\sum(Y_i - \bar{Y})^2\) \(\sum(Y_i - \bar{Y})(X_i - \bar{X})\)
1 36 36 0 0 42 42 42

Doing it in R

  • We use cor(x,y,method="spearman")
cor(dummy_mse$Height_rank,dummy_mse$Weight_rank,method="spearman")
#> [1] 1