R Notebook

# Correlation

library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

# Correlation between two continuous variables

head(mtcars) # Built-in dataset

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

head(iris) # Built-in dataset

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

# Visually look at correlation between two continuous variables

library(gridExtra)
library(ggplot2)

p1<-ggplot(data=iris,aes(x=iris$Sepal.Length,y=iris$Petal.Length)) + geom_point(color=2) + theme_bw()+ ylab("Petal Length") +xlab("Sepal Length")

p2<-ggplot(data=mtcars,aes(x=mtcars$wt,y=mtcars$mpg)) + geom_point(color=4) + theme_bw() + ylab("mpg") +xlab("wt") +theme_bw()

grid.arrange(p1,p2,ncol=2)

The plots show a strong correlation between two variables, but we should check normal distribution of each variable to see wheather it’s meet the assumption

Assumption

Ho: Data are normal distribution

Ha: data are not normal distributed

# Shapiro-wilk test for normality

shapiro.test(mtcars$mpg)

## 
##  Shapiro-Wilk normality test
## 
## data:  mtcars$mpg
## W = 0.94756, p-value = 0.1229

# p>0.05 => normal distribution

Is the correlation statistically significant

Ho: there is no association bw two variables

cor.test(mtcars$mpg,mtcars$wt)

## 
##  Pearson's product-moment correlation
## 
## data:  mtcars$mpg and mtcars$wt
## t = -9.559, df = 30, p-value = 1.294e-10
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.9338264 -0.7440872
## sample estimates:
##        cor 
## -0.8676594

# p<0.05 => there is a strong evidence to suggest that two variables are correlated