#install.packages("ggpubr")
library("ggpubr")
## Loading required package: ggplot2
Methods for correlation analyses There are different methods to perform correlation analysis:
Pearson correlation (r), which measures a linear dependence between two variables (x and y). It’s also known as a parametric correlation test because it depends to the distribution of the data. It can be used only when x and y are from normal distribution. The plot of y = f(x) is named the linear regression curve.
Kendall tau and Spearman rho, which are rank-based correlation coefficients (non-parametric)
| Theory (Optional) [Refer to class and classnote] |
For correlation coefficients, under the null-hypothesis that the population correlation coefficient equals 0, the sample correlation is approximately Normally distributed with standard error.
and the standard error is again χ2-distributed. Thus, the t-statistic is obtained by dividing the sample correlation coefficient r by this standard error:
Note that we get the t-statistic by dividing a Normally-distributed variable by its χ2-distributed standard error.
The p-value (significance level) of the correlation can be determined :
by using the correlation coefficient table for the degrees of freedom : df=n−2, where n is the number of observation in x and y variables.
or by calculating the t value as follow:
the corresponding p-value is determined using t distribution table for df=n−2.
If the p-value is < 5%, then the correlation between x and y is significant.
# cor(x, y, method = c("pearson", "kendall", "spearman"))
# cor.test(x, y, method=c("pearson", "kendall", "spearman"))
# cor(x, y, method = "pearson", use = "complete.obs")
#x, y: numeric vectors with the same length
#method: correlation method
EXAMPLE 1 : USING REAL DATA —————————–
Here, we’ll use the built-in R data set mtcars as an example.
my_data <- mtcars
head(my_data, 10)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Scatterplot Visualization of Data ———————————-
library("ggpubr")
ggscatter(my_data, x = "mpg", y = "wt",
add = "reg.line", conf.int = TRUE,
cor.coef = TRUE, cor.method = "pearson",
xlab = "Miles/(US) gallon", ylab = "Weight (1000 lbs)")
## `geom_smooth()` using formula 'y ~ x'
Initial Checks —————–
In the situation where the scatter plots show curved patterns, we are dealing with nonlinear association between the two variables.
Use Shapiro-Wilk normality test –> R function: shapiro.test() and look at the normality plot —> R function: ggpubr::ggqqplot()
# Shapiro-Wilk normality test for mpg
shapiro.test(my_data$mpg) # => p = 0.1229
##
## Shapiro-Wilk normality test
##
## data: my_data$mpg
## W = 0.94756, p-value = 0.1229
# Shapiro-Wilk normality test for wt
shapiro.test(my_data$wt) # => p = 0.09
##
## Shapiro-Wilk normality test
##
## data: my_data$wt
## W = 0.94326, p-value = 0.09265
Visual inspection of the data normality using Q-Q plots ——————————-
(quantile-quantile plots). Q-Q plot draws the correlation between a given sample and the normal distribution.
library("ggpubr")
# mpg
ggqqplot(my_data$mpg, ylab = "MPG")
# wt
ggqqplot(my_data$wt, ylab = "WT")
| Special Note: |
|---|
| (Note that, if the data are not normally distributed, it’s recommended to use the non-parametric correlation, including Spearman and Kendall rank-based correlation tests.) |
Pearson correlation test
Correlation test between mpg and wt variables:
res <- cor.test(my_data$wt, my_data$mpg,
method = "pearson")
res
##
## Pearson's product-moment correlation
##
## data: my_data$wt and my_data$mpg
## t = -9.559, df = 30, p-value = 1.294e-10
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.9338264 -0.7440872
## sample estimates:
## cor
## -0.8676594
| Spearman rank correlation coefficient and Test |
|---|
| Spearman’s rho statistic is also used to estimate a rank-based measure of association. This test may be used if the data do not come from a bivariate normal distribution. |
res2 <-cor.test(my_data$wt, my_data$mpg, method = "spearman")
## Warning in cor.test.default(my_data$wt, my_data$mpg, method = "spearman"):
## Cannot compute exact p-value with ties
res2
##
## Spearman's rank correlation rho
##
## data: my_data$wt and my_data$mpg
## S = 10292, p-value = 1.488e-11
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.886422
Kendall rank correlation test —————————–
The Kendall rank correlation coefficient or Kendall’s tau statistic is used to estimate a rank-based measure of association. This test may be used if the data do not necessarily come from a bivariate normal distribution.
res2 <- cor.test(my_data$wt, my_data$mpg, method="kendall")
## Warning in cor.test.default(my_data$wt, my_data$mpg, method = "kendall"): Cannot
## compute exact p-value with ties
res2
##
## Kendall's rank correlation tau
##
## data: my_data$wt and my_data$mpg
## z = -5.7981, p-value = 6.706e-09
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
## tau
## -0.7278321
| Home Work/ Class Work |
Perform a Hypothesis Test for any correlation between (mpg) vs (qsec) for above mtcars dataset in R.
References
http://www.sthda.com/english/wiki/correlation-test-between-two-variables-in-r chrome-extension://efaidnbmnnnibpcajpcglclefindmkaj/https://egyankosh.ac.in/bitstream/123456789/20447/1/Unit-6.pdf
https://online.stat.psu.edu/stat501/lesson/1/1.9
http://www.sthda.com/english/wiki/one-proportion-z-test-in-r
http://www.sthda.com/english/wiki/chi-square-goodness-of-fit-test-in-r
http://www.sthda.com/english/wiki/two-proportions-z-test-in-r
http://www.sthda.com/english/wiki/chi-square-test-of-independence-in-r.Rmd
http://www.sthda.com/english/wiki/chi-square-test-of-independence-in-r.Rmd
http://www.sthda.com/english/home/error.php
http://www.sthda.com/english/home/error.php
http://www.sthda.com/english/wiki/two-proportions-z-test-in-r
http://www.sthda.com/english/wiki/two-proportions-z-test-in-r