Data Scientist Volucentric Consultancy
Correlation measures the strength and direction of association between two variables. There are three common correlation tests:
Use the Pearson’s r if both variables are quantitative (interval or ratio), normally distributed, and the relationship is linear with homoscedastic residuals.
The Spearman’s rho and Kendal’s tao correlations are measures, so they are valid for both quantitative and ordinal variables and do not carry the normality and homoscedasticity conditions. However, non-parametric tests have less statistical power than parametric tests, so only use these correlations if Pearson does not apply.
Pearson’s \(r\)
\[r = \frac{\sum{(X_i - \bar{X})(Y_i - \bar{Y})}}{\sqrt{\sum{(X_i - \bar{X})^2 \sum{(Y_i - \bar{Y})^2}}}} = \frac{cov(X,Y)}{s_X s_Y}\]
estimates the population correlation \(\rho\). Pearson’s \(r\) ranges from \(-1\) (perfect negative linear relationship) to \(+1\) (perfect positive linear relationship, and \(r = 0\) when there is no linear relationship. A correlation in the range \((.1, .3)\) is condidered small, \((.3, .5)\) medium, and \((.5, 1.0)\) large.
Pearson’s \(r\) only applies if the variables are interval or ratio, normally distributed, linearly related, there are minimal outliers, and the residuals are homoscedastic.
library(tidyverse)
library(glue)
library(flextable)
library(tvthemes)library(flextable)
# Dummy set containing a feature and label column
df <- tibble(
Height = c(115, 126, 137, 140, 152, 156, 114, 129),
Weight = c(56, 61, 67, 72, 76, 82, 54, 62)
)
# Display the data set
df %>%
flextable() %>%
flextable::set_table_properties(width = .75, layout = "autofit") %>%
flextable::theme_zebra() %>%
flextable::fontsize(size = 12) %>%
flextable::fontsize(size = 12, part = "header") %>%
flextable::align_text_col(align = "center") %>%
flextable::set_caption(caption = "Weight and height of a random sample of people.") %>%
flextable::border_outer()Height | Weight |
|---|---|
115 | 56 |
126 | 61 |
137 | 67 |
140 | 72 |
152 | 76 |
156 | 82 |
114 | 54 |
129 | 62 |
p1 <- ggplot(data = df, aes(x = Height, y = Weight)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "Pearson's Rho", subtitle = "positive and strong correlation")
p2 <- ggplot(data = df, aes(x = Height, y = 1/Weight)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "Pearson's Rho", subtitle = "Negative yet strong correlation")
gridExtra::grid.arrange(p1, p2, nrow = 1)df2 <- tibble(
Height = c(115, 101, 99, 107, 118, 127, 120, 129),
Weight = c(56, 50, 67, 64, 55, 70, 61, 59)
)p1 <- ggplot(data = df2, aes(x = Height, y = Weight)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "Pearson's Rho", subtitle = "positive and weak correlation")
p2 <- ggplot(data = df2, aes(x = Height, y = 1/Weight)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "Pearson's Rho", subtitle = "Negative yet weak correlation")
gridExtra::grid.arrange(p1, p2, nrow = 1)we are interested in finding \(\sum{(X_i - \bar{X})(Y_i - \bar{Y})}\) , \(\sum{(X_i - \bar{X})^2}\) and \(\sum{(Y_i - \bar{Y})^2}\)
dummy_mse <- df %>%
mutate("$(X_i - \\bar{X})$" = (Height- mean(Height))) %>%
mutate("$(Y_i - \\bar{Y})$" = (Weight- mean(Weight)))%>%
mutate("$(X_i - \\bar{X})^2$" = (Height- mean(Height))^2) %>%
mutate("$(Y_i - \\bar{Y})^2$" = (Weight- mean(Weight))^2) %>%
mutate("$(Y_i - \\bar{Y})(X_i - \\bar{X})$" = (Weight- mean(Weight))*(Height- mean(Height)))
dummy_mse %>%
knitr::kable()| Height | Weight | \((X_i - \bar{X})\) | \((Y_i - \bar{Y})\) | \((X_i - \bar{X})^2\) | \((Y_i - \bar{Y})^2\) | \((Y_i - \bar{Y})(X_i - \bar{X})\) |
|---|---|---|---|---|---|---|
| 115 | 56 | -18.625 | -10.25 | 346.89062 | 105.0625 | 190.90625 |
| 126 | 61 | -7.625 | -5.25 | 58.14062 | 27.5625 | 40.03125 |
| 137 | 67 | 3.375 | 0.75 | 11.39062 | 0.5625 | 2.53125 |
| 140 | 72 | 6.375 | 5.75 | 40.64062 | 33.0625 | 36.65625 |
| 152 | 76 | 18.375 | 9.75 | 337.64062 | 95.0625 | 179.15625 |
| 156 | 82 | 22.375 | 15.75 | 500.64062 | 248.0625 | 352.40625 |
| 114 | 54 | -19.625 | -12.25 | 385.14062 | 150.0625 | 240.40625 |
| 129 | 62 | -4.625 | -4.25 | 21.39062 | 18.0625 | 19.65625 |
(dummy_mse1<-dummy_mse %>%
summarise("$\\sum Height$"=sum(.[1]),
"$\\sum Weight$"=sum(.[2]),
"$\\sum(X_i - \\bar{X})$"=sum(.[3]),
"$\\sum(Y_i - \\bar{Y})$"=sum(.[4]),
"$\\sum(X_i - \\bar{X})^2$"=sum(.[5]),
"$\\sum(Y_i - \\bar{Y})^2$"=sum(.[6]),
"$\\sum(Y_i - \\bar{Y})(X_i - \\bar{X})$"=sum(.[7]))%>%
mutate(Pearson_R = (.[7]/sqrt(.[5]*.[6]))) %>% ##seventh index/sqrt(5th times 6th index)
relocate(Pearson_R)%>%
knitr::kable())| Pearson_R | \(\sum Height\) | \(\sum Weight\) | \(\sum(X_i - \bar{X})\) | \(\sum(Y_i - \bar{Y})\) | \(\sum(X_i - \bar{X})^2\) | \(\sum(Y_i - \bar{Y})^2\) | \(\sum(Y_i - \bar{Y})(X_i - \bar{X})\) |
|---|---|---|---|---|---|---|---|
| 0.9887894 | 1069 | 530 | 0 | 0 | 1701.875 | 677.5 | 1061.75 |
cor(x,y) functioncor(df$Height,df$Weight)
#> [1] 0.9887894Spearman’s \(\rho\) is the Pearson’s r applied to the sample variable ranks. Let \((X_i, Y_i)\) be the ranks of the \(n\) sample pairs with mean ranks \(\bar{X} = \bar{Y} = (n+1)/2\). Spearman’s rho is
\[\hat{\rho} = \frac{\sum{(X_i - \bar{X})(Y_i - \bar{Y})}}{\sqrt{\sum{(X_i - \bar{X})^2 \sum{(Y_i - \bar{Y})^2}}}}\] You will also encounter another formula given by \[\hat{\rho} = 1 -\frac{6\sum{D^2}}{n(n^2-1)}\] where \(D=rank(X)-rank(Y)\)
Spearman’s rho is a non-parametric test, so there is no associated confidence interval.
rank function in R
dummy_mse <- df %>%
mutate(Height_rank =rank(Height), ## define ranks
Weight_rank =rank(Weight))%>%
select(-Weight,-Height) %>%
mutate("$(X_i - \\bar{X})$" = (Height_rank- mean(Height_rank))) %>%
mutate("$(Y_i - \\bar{Y})$" = (Weight_rank- mean(Weight_rank))) %>%
mutate("$(X_i - \\bar{X})^2$" = (Height_rank- mean(Height_rank))^2) %>%
mutate("$(Y_i - \\bar{Y})^2$" = (Weight_rank- mean(Weight_rank))^2) %>%
mutate("$(Y_i - \\bar{Y})(X_i - \\bar{X})$" = (Weight_rank- mean(Weight_rank))*(Height_rank- mean(Height_rank)))
dummy_mse %>%
knitr::kable()| Height_rank | Weight_rank | \((X_i - \bar{X})\) | \((Y_i - \bar{Y})\) | \((X_i - \bar{X})^2\) | \((Y_i - \bar{Y})^2\) | \((Y_i - \bar{Y})(X_i - \bar{X})\) |
|---|---|---|---|---|---|---|
| 2 | 2 | -2.5 | -2.5 | 6.25 | 6.25 | 6.25 |
| 3 | 3 | -1.5 | -1.5 | 2.25 | 2.25 | 2.25 |
| 5 | 5 | 0.5 | 0.5 | 0.25 | 0.25 | 0.25 |
| 6 | 6 | 1.5 | 1.5 | 2.25 | 2.25 | 2.25 |
| 7 | 7 | 2.5 | 2.5 | 6.25 | 6.25 | 6.25 |
| 8 | 8 | 3.5 | 3.5 | 12.25 | 12.25 | 12.25 |
| 1 | 1 | -3.5 | -3.5 | 12.25 | 12.25 | 12.25 |
| 4 | 4 | -0.5 | -0.5 | 0.25 | 0.25 | 0.25 |
dummy_mse %>%
summarise("$\\sum Height_rank$"=sum(.[1]),
"$\\sum Weight_rank$"=sum(.[2]),
"$\\sum(X_i - \\bar{X})$"=sum(.[3]),
"$\\sum(Y_i - \\bar{Y})$"=sum(.[4]),
"$\\sum(X_i - \\bar{X})^2$"=sum(.[5]),
"$\\sum(Y_i - \\bar{Y})^2$"=sum(.[6]),
"$\\sum(Y_i - \\bar{Y})(X_i - \\bar{X})$"=sum(.[7]))%>%
mutate(Spearman_Rho = (.[7]/sqrt(.[5]*.[6]))) %>%
relocate(Spearman_Rho)%>%
knitr::kable()| Spearman_Rho | \(\sum Height_rank\) | \(\sum Weight_rank\) | \(\sum(X_i - \bar{X})\) | \(\sum(Y_i - \bar{Y})\) | \(\sum(X_i - \bar{X})^2\) | \(\sum(Y_i - \bar{Y})^2\) | \(\sum(Y_i - \bar{Y})(X_i - \bar{X})\) |
|---|---|---|---|---|---|---|---|
| 1 | 36 | 36 | 0 | 0 | 42 | 42 | 42 |
cor(x,y,method="spearman")cor(dummy_mse$Height_rank,dummy_mse$Weight_rank,method="spearman")
#> [1] 1