Sampling Distribution (Practical): List 9

Methods for correlation analyses There are different methods to perform correlation analysis:

Pearson correlation (r), which measures a linear dependence between two variables (x and y). It’s also known as a parametric correlation test because it depends to the distribution of the data. It can be used only when x and y are from normal distribution. The plot of y = f(x) is named the linear regression curve.

Kendall tau and Spearman rho, which are rank-based correlation coefficients (non-parametric)

Pearson correlation formula

$r = \frac{\sum (x - m_{x}) (y - m_{y})}{\sqrt{\sum (x - m_{x})^{2} \sum (y - m_{y})^{2}}}$

Theory (Optional) [Refer to class and classnote]

For correlation coefficients, under the null-hypothesis that the population correlation coefficient equals 0, the sample correlation is approximately Normally distributed with standard error.

$S E (r) = \sqrt{\frac{1 - r^{2}}{n - 2}}$

and the standard error is again χ2-distributed. Thus, the t-statistic is obtained by dividing the sample correlation coefficient r by this standard error:

$t = \frac{r}{S E (r)} = \frac{r}{\sqrt{\frac{1 - r^{2}}{n - 2}}} = \frac{r \sqrt{n - 2}}{\sqrt{1 - r^{2}}}$

Note that we get the t-statistic by dividing a Normally-distributed variable by its χ2-distributed standard error.

The p-value (significance level) of the correlation can be determined :

by using the correlation coefficient table for the degrees of freedom : df=n−2, where n is the number of observation in x and y variables.

or by calculating the t value as follow:

$t = \frac{r}{\sqrt{1 - r^{2}}} \sqrt{n - 2}$

the corresponding p-value is determined using t distribution table for df=n−2.

If the p-value is < 5%, then the correlation between x and y is significant.

#  cor(x, y, method = c("pearson", "kendall", "spearman"))
 #  cor.test(x, y, method=c("pearson", "kendall", "spearman"))
# cor(x, y,  method = "pearson", use = "complete.obs")

 #x, y: numeric vectors with the same length
  #method: correlation method

EXAMPLE 1 : USING REAL DATA —————————–

Here, we’ll use the built-in R data set mtcars as an example.

my_data <- mtcars
head(my_data, 10)

##                    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360        14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D         24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230          22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280          19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4

Scatterplot Visualization of Data ———————————-

library("ggpubr")
ggscatter(my_data, x = "mpg", y = "wt", 
          add = "reg.line", conf.int = TRUE, 
          cor.coef = TRUE, cor.method = "pearson",
          xlab = "Miles/(US) gallon", ylab = "Weight (1000 lbs)")

## `geom_smooth()` using formula 'y ~ x'

Initial Checks —————–

Is the covariation linear? (Yes, form the plot above, the relationship is linear.)

In the situation where the scatter plots show curved patterns, we are dealing with nonlinear association between the two variables.

Are the data from each of the 2 variables (x, y) follow a normal distribution?

Use Shapiro-Wilk normality test –> R function: shapiro.test() and look at the normality plot —> R function: ggpubr::ggqqplot()

# Shapiro-Wilk normality test for mpg
shapiro.test(my_data$mpg) # => p = 0.1229

## 
##  Shapiro-Wilk normality test
## 
## data:  my_data$mpg
## W = 0.94756, p-value = 0.1229

# Shapiro-Wilk normality test for wt
shapiro.test(my_data$wt) # => p = 0.09

## 
##  Shapiro-Wilk normality test
## 
## data:  my_data$wt
## W = 0.94326, p-value = 0.09265

Visual inspection of the data normality using Q-Q plots ——————————-

(quantile-quantile plots). Q-Q plot draws the correlation between a given sample and the normal distribution.

library("ggpubr")
# mpg
ggqqplot(my_data$mpg, ylab = "MPG")

# wt
ggqqplot(my_data$wt, ylab = "WT")

Special Note:
(Note that, if the data are not normally distributed, it’s recommended to use the non-parametric correlation, including Spearman and Kendall rank-based correlation tests.)

 Pearson correlation test

Correlation test between mpg and wt variables:

res <- cor.test(my_data$wt, my_data$mpg, 
                    method = "pearson")
res

## 
##  Pearson's product-moment correlation
## 
## data:  my_data$wt and my_data$mpg
## t = -9.559, df = 30, p-value = 1.294e-10
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.9338264 -0.7440872
## sample estimates:
##        cor 
## -0.8676594

Spearman rank correlation coefficient and Test
Spearman’s rho statistic is also used to estimate a rank-based measure of association. This test may be used if the data do not come from a bivariate normal distribution.

res2 <-cor.test(my_data$wt, my_data$mpg,  method = "spearman")

## Warning in cor.test.default(my_data$wt, my_data$mpg, method = "spearman"):
## Cannot compute exact p-value with ties

res2

## 
##  Spearman's rank correlation rho
## 
## data:  my_data$wt and my_data$mpg
## S = 10292, p-value = 1.488e-11
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## -0.886422

Kendall rank correlation test —————————–

The Kendall rank correlation coefficient or Kendall’s tau statistic is used to estimate a rank-based measure of association. This test may be used if the data do not necessarily come from a bivariate normal distribution.

res2 <- cor.test(my_data$wt, my_data$mpg,  method="kendall")

## Warning in cor.test.default(my_data$wt, my_data$mpg, method = "kendall"): Cannot
## compute exact p-value with ties

res2

## 
##  Kendall's rank correlation tau
## 
## data:  my_data$wt and my_data$mpg
## z = -5.7981, p-value = 6.706e-09
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
##        tau 
## -0.7278321

Home Work/ Class Work

Perform a Hypothesis Test for any correlation between (mpg) vs (qsec) for above mtcars dataset in R.

References

chrome-extension://efaidnbmnnnibpcajpcglclefindmkaj/https://nvlpubs.nist.gov/nistpubs/jres/72B/jresv72Bn1p33_A1b.pdf

http://www.sthda.com/english/wiki/correlation-test-between-two-variables-in-r chrome-extension://efaidnbmnnnibpcajpcglclefindmkaj/https://egyankosh.ac.in/bitstream/123456789/20447/1/Unit-6.pdf

https://online.stat.psu.edu/stat501/lesson/1/1.9

http://www.sthda.com/english/wiki/one-proportion-z-test-in-r

http://www.sthda.com/english/wiki/chi-square-goodness-of-fit-test-in-r

http://www.sthda.com/english/wiki/two-proportions-z-test-in-r

http://www.sthda.com/english/wiki/chi-square-test-of-independence-in-r.Rmd

http://www.sthda.com/english/home/error.php

http://www.sthda.com/english/wiki/two-proportions-z-test-in-r

Sampling Distribution (Practical): List 9

Real Datasets in R & Correlation-Test (Hypothesis Test of significnce and confidence Intervals of an observed Correlation in R)

BSc 3rd Sem

8 dec 2022

Pearson correlation formula