Statistical correlations

Bongani Ncube

2023-11-03

r195334vncube@gmail.com

Correlation

Correlation measures the strength and direction of association between two variables. There are three common correlation tests:

the Pearson product moment (Pearson’s r),
Spearman’s rank-order (Spearman’s rho),
and Kendall’s tau (Kendall’s tau).

Use the Pearson’s r if both variables are quantitative (interval or ratio), normally distributed, and the relationship is linear with homoscedastic residuals.

The Spearman’s rho and Kendal’s tao correlations are measures, so they are valid for both quantitative and ordinal variables and do not carry the normality and homoscedasticity conditions. However, non-parametric tests have less statistical power than parametric tests, so only use these correlations if Pearson does not apply.

Pearson’s r

Pearson’s \(r\)

\[r = \frac{\sum{(X_i - \bar{X})(Y_i - \bar{Y})}}{\sqrt{\sum{(X_i - \bar{X})^2 \sum{(Y_i - \bar{Y})^2}}}} = \frac{cov(X,Y)}{s_X s_Y}\]

estimates the population correlation \(\rho\). Pearson’s \(r\) ranges from \(-1\) (perfect negative linear relationship) to \(+1\) (perfect positive linear relationship, and \(r = 0\) when there is no linear relationship. A correlation in the range \((.1, .3)\) is condidered small, \((.3, .5)\) medium, and \((.5, 1.0)\) large.

Pearson’s \(r\) only applies if the variables are interval or ratio, normally distributed, linearly related, there are minimal outliers, and the residuals are homoscedastic.

library Setup

library(tidyverse)
library(glue)
library(flextable)
library(tvthemes)

library(flextable)
# Dummy set containing a feature and label column
df <- tibble(
  Height = c(115, 126, 137, 140, 152, 156, 114, 129),
  Weight = c(56, 61, 67, 72, 76, 82, 54, 62)
  
)

# Display the data set
df  %>%
  flextable() %>%
  flextable::set_table_properties(width = .75, layout = "autofit") %>%
  flextable::theme_zebra() %>%
  flextable::fontsize(size = 12) %>%
  flextable::fontsize(size = 12, part = "header") %>%
  flextable::align_text_col(align = "center") %>%
  flextable::set_caption(caption = "Weight and height of a random sample of people.")  %>%
  flextable::border_outer()

Weight and height of a random sample of people.
Height	Weight
115	56
126	61
137	67
140	72
152	76
156	82
114	54
129	62

Types of correlations

Almost perfect correlations

p1 <- ggplot(data = df, aes(x = Height, y = Weight)) +
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE) +
  
  labs(title = "Pearson's Rho", subtitle = "positive and strong correlation")
p2 <- ggplot(data = df, aes(x = Height, y = 1/Weight)) +
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE) +
 
  labs(title = "Pearson's Rho", subtitle = "Negative yet strong correlation")
gridExtra::grid.arrange(p1, p2, nrow = 1)

Some weak correlations

here is some fake dataset producing weak correlations

df2 <- tibble(
  Height = c(115, 101, 99, 107, 118, 127, 120, 129),
  Weight = c(56, 50, 67, 64, 55, 70, 61, 59)
  
)

Visualizing weak correlations

p1 <- ggplot(data = df2, aes(x = Height, y = Weight)) +
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE) +
  
  labs(title = "Pearson's Rho", subtitle = "positive and weak correlation")
p2 <- ggplot(data = df2, aes(x = Height, y = 1/Weight)) +
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE) +
 
  labs(title = "Pearson's Rho", subtitle = "Negative yet weak correlation")
gridExtra::grid.arrange(p1, p2, nrow = 1)

Calculations manually

we are interested in finding \(\sum{(X_i - \bar{X})(Y_i - \bar{Y})}\) , \(\sum{(X_i - \bar{X})^2}\) and \(\sum{(Y_i - \bar{Y})^2}\)

dummy_mse <- df %>% 
  mutate("$(X_i - \\bar{X})$" = (Height- mean(Height))) %>% 
  mutate("$(Y_i - \\bar{Y})$" = (Weight- mean(Weight)))%>% 
  mutate("$(X_i - \\bar{X})^2$" = (Height- mean(Height))^2) %>% 
  mutate("$(Y_i - \\bar{Y})^2$" = (Weight- mean(Weight))^2) %>%  
  mutate("$(Y_i - \\bar{Y})(X_i - \\bar{X})$" = (Weight- mean(Weight))*(Height- mean(Height)))

dummy_mse %>% 
  knitr::kable()

Height	Weight	\((X_i - \bar{X})\)	\((Y_i - \bar{Y})\)	\((X_i - \bar{X})^2\)	\((Y_i - \bar{Y})^2\)	\((Y_i - \bar{Y})(X_i - \bar{X})\)
115	56	-18.625	-10.25	346.89062	105.0625	190.90625
126	61	-7.625	-5.25	58.14062	27.5625	40.03125
137	67	3.375	0.75	11.39062	0.5625	2.53125
140	72	6.375	5.75	40.64062	33.0625	36.65625
152	76	18.375	9.75	337.64062	95.0625	179.15625
156	82	22.375	15.75	500.64062	248.0625	352.40625
114	54	-19.625	-12.25	385.14062	150.0625	240.40625
129	62	-4.625	-4.25	21.39062	18.0625	19.65625

Summations and calculating correlation

(dummy_mse1<-dummy_mse %>% 
  summarise("$\\sum Height$"=sum(.[1]),
            "$\\sum Weight$"=sum(.[2]),
            "$\\sum(X_i - \\bar{X})$"=sum(.[3]),
            "$\\sum(Y_i - \\bar{Y})$"=sum(.[4]),
            "$\\sum(X_i - \\bar{X})^2$"=sum(.[5]),
            "$\\sum(Y_i - \\bar{Y})^2$"=sum(.[6]),
            "$\\sum(Y_i - \\bar{Y})(X_i - \\bar{X})$"=sum(.[7]))%>%  
   mutate(Pearson_R = (.[7]/sqrt(.[5]*.[6]))) %>%  ##seventh index/sqrt(5th times 6th index) 
   relocate(Pearson_R)%>% 
   knitr::kable())

Pearson_R	\(\sum Height\)	\(\sum Weight\)	\(\sum(X_i - \bar{X})\)	\(\sum(Y_i - \bar{Y})\)	\(\sum(X_i - \bar{X})^2\)	\(\sum(Y_i - \bar{Y})^2\)	\(\sum(Y_i - \bar{Y})(X_i - \bar{X})\)
0.9887894	1069	530	0	0	1701.875	677.5	1061.75

In R

While it is important to know the mathematical theory behind , R has a built in function for this
we use the cor(x,y) function

cor(df$Height,df$Weight)
#> [1] 0.9887894

Spearman’s Rho

Spearman’s \(\rho\) is the Pearson’s r applied to the sample variable ranks. Let \((X_i, Y_i)\) be the ranks of the \(n\) sample pairs with mean ranks \(\bar{X} = \bar{Y} = (n+1)/2\). Spearman’s rho is

\[\hat{\rho} = \frac{\sum{(X_i - \bar{X})(Y_i - \bar{Y})}}{\sqrt{\sum{(X_i - \bar{X})^2 \sum{(Y_i - \bar{Y})^2}}}}\] You will also encounter another formula given by \[\hat{\rho} = 1 -\frac{6\sum{D^2}}{n(n^2-1)}\] where \(D=rank(X)-rank(Y)\)

Spearman’s rho is a non-parametric test, so there is no associated confidence interval.

Doing the calculations manually

basically we repeat the same procedure except that we will be using ranks ,we will use the rank function in R


dummy_mse <- df %>% 
  mutate(Height_rank =rank(Height), ## define ranks
         Weight_rank =rank(Weight))%>% 
  select(-Weight,-Height) %>%  
  mutate("$(X_i - \\bar{X})$" = (Height_rank- mean(Height_rank))) %>% 
  mutate("$(Y_i - \\bar{Y})$" = (Weight_rank- mean(Weight_rank))) %>% 
  mutate("$(X_i - \\bar{X})^2$" = (Height_rank- mean(Height_rank))^2) %>% 
  mutate("$(Y_i - \\bar{Y})^2$" = (Weight_rank- mean(Weight_rank))^2) %>% 
  mutate("$(Y_i - \\bar{Y})(X_i - \\bar{X})$" = (Weight_rank- mean(Weight_rank))*(Height_rank- mean(Height_rank)))

dummy_mse %>% 
  knitr::kable()

Height_rank	Weight_rank	\((X_i - \bar{X})\)	\((Y_i - \bar{Y})\)	\((X_i - \bar{X})^2\)	\((Y_i - \bar{Y})^2\)	\((Y_i - \bar{Y})(X_i - \bar{X})\)
2	2	-2.5	-2.5	6.25	6.25	6.25
3	3	-1.5	-1.5	2.25	2.25	2.25
5	5	0.5	0.5	0.25	0.25	0.25
6	6	1.5	1.5	2.25	2.25	2.25
7	7	2.5	2.5	6.25	6.25	6.25
8	8	3.5	3.5	12.25	12.25	12.25
1	1	-3.5	-3.5	12.25	12.25	12.25
4	4	-0.5	-0.5	0.25	0.25	0.25

Summations and calculations

dummy_mse %>% 
  summarise("$\\sum Height_rank$"=sum(.[1]),
            "$\\sum Weight_rank$"=sum(.[2]),
            "$\\sum(X_i - \\bar{X})$"=sum(.[3]),
            "$\\sum(Y_i - \\bar{Y})$"=sum(.[4]),
            "$\\sum(X_i - \\bar{X})^2$"=sum(.[5]),
            "$\\sum(Y_i - \\bar{Y})^2$"=sum(.[6]),
            "$\\sum(Y_i - \\bar{Y})(X_i - \\bar{X})$"=sum(.[7]))%>% 
  mutate(Spearman_Rho = (.[7]/sqrt(.[5]*.[6]))) %>% 
  relocate(Spearman_Rho)%>% 
  knitr::kable()

Spearman_Rho	\(\sum Height_rank\)	\(\sum Weight_rank\)	\(\sum(X_i - \bar{X})\)	\(\sum(Y_i - \bar{Y})\)	\(\sum(X_i - \bar{X})^2\)	\(\sum(Y_i - \bar{Y})^2\)	\(\sum(Y_i - \bar{Y})(X_i - \bar{X})\)
1	36	36	0	0	42	42	42

Doing it in R

We use cor(x,y,method="spearman")

cor(dummy_mse$Height_rank,dummy_mse$Weight_rank,method="spearman")
#> [1] 1