Data Scientist Volucentric Consultancy
Correlation measures the strength and direction of association between two variables. There are three common correlation tests:
Use the Pearson’s r if both variables are quantitative (interval or ratio), normally distributed, and the relationship is linear with homoscedastic residuals.
The Spearman’s rho and Kendal’s tao correlations are measures, so they are valid for both quantitative and ordinal variables and do not carry the normality and homoscedasticity conditions. However, non-parametric tests have less statistical power than parametric tests, so only use these correlations if Pearson does not apply.
Pearson’s \(r\)
\[r = \frac{\sum{(X_i - \bar{X})(Y_i - \bar{Y})}}{\sqrt{\sum{(X_i - \bar{X})^2 \sum{(Y_i - \bar{Y})^2}}}} = \frac{cov(X,Y)}{s_X s_Y}\]
estimates the population correlation \(\rho\). Pearson’s \(r\) ranges from \(-1\) (perfect negative linear relationship) to \(+1\) (perfect positive linear relationship, and \(r = 0\) when there is no linear relationship. A correlation in the range \((.1, .3)\) is condidered small, \((.3, .5)\) medium, and \((.5, 1.0)\) large.
Pearson’s \(r\) only applies if the variables are interval or ratio, normally distributed, linearly related, there are minimal outliers, and the residuals are homoscedastic.
library(tidyverse)
library(glue)
library(flextable)
library(tvthemes)
library(flextable)
# Dummy set containing a feature and label column
<- tibble(
df Height = c(115, 126, 137, 140, 152, 156, 114, 129),
Weight = c(56, 61, 67, 72, 76, 82, 54, 62)
)
# Display the data set
%>%
df flextable() %>%
::set_table_properties(width = .75, layout = "autofit") %>%
flextable::theme_zebra() %>%
flextable::fontsize(size = 12) %>%
flextable::fontsize(size = 12, part = "header") %>%
flextable::align_text_col(align = "center") %>%
flextable::set_caption(caption = "Weight and height of a random sample of people.") %>%
flextable::border_outer() flextable
Height | Weight |
---|---|
115 | 56 |
126 | 61 |
137 | 67 |
140 | 72 |
152 | 76 |
156 | 82 |
114 | 54 |
129 | 62 |
<- ggplot(data = df, aes(x = Height, y = Weight)) +
p1 geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "Pearson's Rho", subtitle = "positive and strong correlation")
<- ggplot(data = df, aes(x = Height, y = 1/Weight)) +
p2 geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "Pearson's Rho", subtitle = "Negative yet strong correlation")
::grid.arrange(p1, p2, nrow = 1) gridExtra
<- tibble(
df2 Height = c(115, 101, 99, 107, 118, 127, 120, 129),
Weight = c(56, 50, 67, 64, 55, 70, 61, 59)
)
<- ggplot(data = df2, aes(x = Height, y = Weight)) +
p1 geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "Pearson's Rho", subtitle = "positive and weak correlation")
<- ggplot(data = df2, aes(x = Height, y = 1/Weight)) +
p2 geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "Pearson's Rho", subtitle = "Negative yet weak correlation")
::grid.arrange(p1, p2, nrow = 1) gridExtra
we are interested in finding \(\sum{(X_i - \bar{X})(Y_i - \bar{Y})}\) , \(\sum{(X_i - \bar{X})^2}\) and \(\sum{(Y_i - \bar{Y})^2}\)
<- df %>%
dummy_mse mutate("$(X_i - \\bar{X})$" = (Height- mean(Height))) %>%
mutate("$(Y_i - \\bar{Y})$" = (Weight- mean(Weight)))%>%
mutate("$(X_i - \\bar{X})^2$" = (Height- mean(Height))^2) %>%
mutate("$(Y_i - \\bar{Y})^2$" = (Weight- mean(Weight))^2) %>%
mutate("$(Y_i - \\bar{Y})(X_i - \\bar{X})$" = (Weight- mean(Weight))*(Height- mean(Height)))
%>%
dummy_mse ::kable() knitr
Height | Weight | \((X_i - \bar{X})\) | \((Y_i - \bar{Y})\) | \((X_i - \bar{X})^2\) | \((Y_i - \bar{Y})^2\) | \((Y_i - \bar{Y})(X_i - \bar{X})\) |
---|---|---|---|---|---|---|
115 | 56 | -18.625 | -10.25 | 346.89062 | 105.0625 | 190.90625 |
126 | 61 | -7.625 | -5.25 | 58.14062 | 27.5625 | 40.03125 |
137 | 67 | 3.375 | 0.75 | 11.39062 | 0.5625 | 2.53125 |
140 | 72 | 6.375 | 5.75 | 40.64062 | 33.0625 | 36.65625 |
152 | 76 | 18.375 | 9.75 | 337.64062 | 95.0625 | 179.15625 |
156 | 82 | 22.375 | 15.75 | 500.64062 | 248.0625 | 352.40625 |
114 | 54 | -19.625 | -12.25 | 385.14062 | 150.0625 | 240.40625 |
129 | 62 | -4.625 | -4.25 | 21.39062 | 18.0625 | 19.65625 |
<-dummy_mse %>%
(dummy_mse1summarise("$\\sum Height$"=sum(.[1]),
"$\\sum Weight$"=sum(.[2]),
"$\\sum(X_i - \\bar{X})$"=sum(.[3]),
"$\\sum(Y_i - \\bar{Y})$"=sum(.[4]),
"$\\sum(X_i - \\bar{X})^2$"=sum(.[5]),
"$\\sum(Y_i - \\bar{Y})^2$"=sum(.[6]),
"$\\sum(Y_i - \\bar{Y})(X_i - \\bar{X})$"=sum(.[7]))%>%
mutate(Pearson_R = (.[7]/sqrt(.[5]*.[6]))) %>% ##seventh index/sqrt(5th times 6th index)
relocate(Pearson_R)%>%
::kable()) knitr
Pearson_R | \(\sum Height\) | \(\sum Weight\) | \(\sum(X_i - \bar{X})\) | \(\sum(Y_i - \bar{Y})\) | \(\sum(X_i - \bar{X})^2\) | \(\sum(Y_i - \bar{Y})^2\) | \(\sum(Y_i - \bar{Y})(X_i - \bar{X})\) |
---|---|---|---|---|---|---|---|
0.9887894 | 1069 | 530 | 0 | 0 | 1701.875 | 677.5 | 1061.75 |
cor(x,y)
functioncor(df$Height,df$Weight)
#> [1] 0.9887894
Spearman’s \(\rho\) is the Pearson’s r applied to the sample variable ranks. Let \((X_i, Y_i)\) be the ranks of the \(n\) sample pairs with mean ranks \(\bar{X} = \bar{Y} = (n+1)/2\). Spearman’s rho is
\[\hat{\rho} = \frac{\sum{(X_i - \bar{X})(Y_i - \bar{Y})}}{\sqrt{\sum{(X_i - \bar{X})^2 \sum{(Y_i - \bar{Y})^2}}}}\] You will also encounter another formula given by \[\hat{\rho} = 1 -\frac{6\sum{D^2}}{n(n^2-1)}\] where \(D=rank(X)-rank(Y)\)
Spearman’s rho is a non-parametric test, so there is no associated confidence interval.
rank
function in R
<- df %>%
dummy_mse mutate(Height_rank =rank(Height), ## define ranks
Weight_rank =rank(Weight))%>%
select(-Weight,-Height) %>%
mutate("$(X_i - \\bar{X})$" = (Height_rank- mean(Height_rank))) %>%
mutate("$(Y_i - \\bar{Y})$" = (Weight_rank- mean(Weight_rank))) %>%
mutate("$(X_i - \\bar{X})^2$" = (Height_rank- mean(Height_rank))^2) %>%
mutate("$(Y_i - \\bar{Y})^2$" = (Weight_rank- mean(Weight_rank))^2) %>%
mutate("$(Y_i - \\bar{Y})(X_i - \\bar{X})$" = (Weight_rank- mean(Weight_rank))*(Height_rank- mean(Height_rank)))
%>%
dummy_mse ::kable() knitr
Height_rank | Weight_rank | \((X_i - \bar{X})\) | \((Y_i - \bar{Y})\) | \((X_i - \bar{X})^2\) | \((Y_i - \bar{Y})^2\) | \((Y_i - \bar{Y})(X_i - \bar{X})\) |
---|---|---|---|---|---|---|
2 | 2 | -2.5 | -2.5 | 6.25 | 6.25 | 6.25 |
3 | 3 | -1.5 | -1.5 | 2.25 | 2.25 | 2.25 |
5 | 5 | 0.5 | 0.5 | 0.25 | 0.25 | 0.25 |
6 | 6 | 1.5 | 1.5 | 2.25 | 2.25 | 2.25 |
7 | 7 | 2.5 | 2.5 | 6.25 | 6.25 | 6.25 |
8 | 8 | 3.5 | 3.5 | 12.25 | 12.25 | 12.25 |
1 | 1 | -3.5 | -3.5 | 12.25 | 12.25 | 12.25 |
4 | 4 | -0.5 | -0.5 | 0.25 | 0.25 | 0.25 |
%>%
dummy_mse summarise("$\\sum Height_rank$"=sum(.[1]),
"$\\sum Weight_rank$"=sum(.[2]),
"$\\sum(X_i - \\bar{X})$"=sum(.[3]),
"$\\sum(Y_i - \\bar{Y})$"=sum(.[4]),
"$\\sum(X_i - \\bar{X})^2$"=sum(.[5]),
"$\\sum(Y_i - \\bar{Y})^2$"=sum(.[6]),
"$\\sum(Y_i - \\bar{Y})(X_i - \\bar{X})$"=sum(.[7]))%>%
mutate(Spearman_Rho = (.[7]/sqrt(.[5]*.[6]))) %>%
relocate(Spearman_Rho)%>%
::kable() knitr
Spearman_Rho | \(\sum Height_rank\) | \(\sum Weight_rank\) | \(\sum(X_i - \bar{X})\) | \(\sum(Y_i - \bar{Y})\) | \(\sum(X_i - \bar{X})^2\) | \(\sum(Y_i - \bar{Y})^2\) | \(\sum(Y_i - \bar{Y})(X_i - \bar{X})\) |
---|---|---|---|---|---|---|---|
1 | 36 | 36 | 0 | 0 | 42 | 42 | 42 |
cor(x,y,method="spearman")
cor(dummy_mse$Height_rank,dummy_mse$Weight_rank,method="spearman")
#> [1] 1