# Correlation
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
# Correlation between two continuous variables
head(mtcars) # Built-in dataset
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
head(iris) # Built-in dataset
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
# Visually look at correlation between two continuous variables
library(gridExtra)
library(ggplot2)
p1<-ggplot(data=iris,aes(x=iris$Sepal.Length,y=iris$Petal.Length)) + geom_point(color=2) + theme_bw()+ ylab("Petal Length") +xlab("Sepal Length")
p2<-ggplot(data=mtcars,aes(x=mtcars$wt,y=mtcars$mpg)) + geom_point(color=4) + theme_bw() + ylab("mpg") +xlab("wt") +theme_bw()
grid.arrange(p1,p2,ncol=2)
The plots show a strong correlation between two variables, but we should check normal distribution of each variable to see wheather it’s meet the assumption
Ho: Data are normal distribution
Ha: data are not normal distributed
# Shapiro-wilk test for normality
shapiro.test(mtcars$mpg)
##
## Shapiro-Wilk normality test
##
## data: mtcars$mpg
## W = 0.94756, p-value = 0.1229
# p>0.05 => normal distribution
Is the correlation statistically significant
Ho: there is no association bw two variables
cor.test(mtcars$mpg,mtcars$wt)
##
## Pearson's product-moment correlation
##
## data: mtcars$mpg and mtcars$wt
## t = -9.559, df = 30, p-value = 1.294e-10
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.9338264 -0.7440872
## sample estimates:
## cor
## -0.8676594
# p<0.05 => there is a strong evidence to suggest that two variables are correlated