Open a blank R script, set the working directory, keep the dataset and script in the same folder. Then start scripting…
library(readxl)
library(GGally)
library(corrplot)
library(ggcorrplot)
library(car)
library(tidyverse)
# This data set has been copied from mtcars in R datasets package.
data = read_xlsx('datasets.xlsx', sheet = 'wrangling', range = 'V7:AG39')
# First 6 rows
head(data)
## # A tibble: 6 × 12
## model mpg cyl disp hp drat wt qsec vs am gear carb
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Mazda RX4 21 6 160 110 3.9 2.62 16.5 0 1 4 4
## 2 Mazda RX4 W… 21 6 160 110 3.9 2.88 17.0 0 1 4 4
## 3 Datsun 710 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
## 4 Hornet 4 Dr… 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
## 5 Hornet Spor… 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
## 6 Valiant 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
# Last 6 rows
tail(data)
## # A tibble: 6 × 12
## model mpg cyl disp hp drat wt qsec vs am gear carb
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Porsche 914… 26 4 120. 91 4.43 2.14 16.7 0 1 5 2
## 2 Lotus Europa 30.4 4 95.1 113 3.77 1.51 16.9 1 1 5 2
## 3 Ford Panter… 15.8 8 351 264 4.22 3.17 14.5 0 1 5 4
## 4 Ferrari Dino 19.7 6 145 175 3.62 2.77 15.5 0 1 5 6
## 5 Maserati Bo… 15 8 301 335 3.54 3.57 14.6 0 1 5 8
## 6 Volvo 142E 21.4 4 121 109 4.11 2.78 18.6 1 1 4 2
# Select the variables to create correlation matrix
car = data %>% select(mpg, disp, hp, drat, wt, qsec)
head(car)
## # A tibble: 6 × 6
## mpg disp hp drat wt qsec
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 21 160 110 3.9 2.62 16.5
## 2 21 160 110 3.9 2.88 17.0
## 3 22.8 108 93 3.85 2.32 18.6
## 4 21.4 258 110 3.08 3.22 19.4
## 5 18.7 360 175 3.15 3.44 17.0
## 6 18.1 225 105 2.76 3.46 20.2
We know that correlation means the relationship between two variables: e.g. mpg (mile per gallon) and hp (horse power). Both the variables should be continous and have normal distribution. The Pearson correlation coefficient varies between -1 and 1, where 0 indicates no relationship, 1 indicates perfect positive relationship, -1 indicates perfect negative relationship. That means correlation coefficient indicates the changes in one variable as changes take place in another variable. We want to see whether mpg changes with the changes in hp of cars.
For mpg and hp, the correlation coefficient (r) is -0.78 with p < 0.05, which means that increase in hp decreases the mpg. It is understand that increased hp needs more fuel to burn. \(H_0: r = 0\). The very small p-value cannot support the \(H_0\), which is rejected.
cor.test(car$mpg, car$hp)
##
## Pearson's product-moment correlation
##
## data: car$mpg and car$hp
## t = -6.7424, df = 30, p-value = 1.788e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.8852686 -0.5860994
## sample estimates:
## cor
## -0.7761684
The variables cyl (number of cylinder, discrete variables with a few levels, rank data) and mpg (miles per gallon) do not comply with the assumptions of Pearson correlation test. We estimate Spearman rank correlation coefficient (\(\rho\)) in this case. The value of \(\rho\) is -0.91 with p < 0.001 suggests strong negative correlation between mpg and cyl. We can conclude that increasing the number of cylinders in a car will increase the fuel cost.
cor.test(data$mpg, data$cyl, method = 'spearman')
## Warning in cor.test.default(data$mpg, data$cyl, method = "spearman"): Cannot
## compute exact p-value with ties
##
## Spearman's rank correlation rho
##
## data: data$mpg and data$cyl
## S = 10425, p-value = 4.69e-13
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.9108013
Please note that if both the variables are categorical, we use chi-square test or Fisher’s exact test to assess the association or relationship. For example, if a researcher is interested to know whether the number of gears (gear) and number of carburetors (carb) are related to each other, chi-square test can be done. First, you need to create a contingency table.
We see that many cells (>20%) have frequencies <5. Therefore, Fisher’s exact test is suggested. We also notice that Chi-square value is significant but Fisher’s test indicates that there is no association between the number of gears and the number of carburetors in cars as the p-value is larger than 0.05.
cont_table = table(data$gear, data$carb)
cont_table
##
## 1 2 3 4 6 8
## 3 3 4 3 5 0 0
## 4 4 4 0 4 0 0
## 5 0 2 0 1 1 1
chisq.test(cont_table)
## Warning in chisq.test(cont_table): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: cont_table
## X-squared = 16.518, df = 10, p-value = 0.08573
fisher.test(cont_table)
##
## Fisher's Exact Test for Count Data
##
## data: cont_table
## p-value = 0.2434
## alternative hypothesis: two.sided
So far, we have done correlation test as a bivariate combination. For more than two continuous variables, how should we present the analysis and visualize the results. We have to create a matrix with r and p values.
Let us check the scatter plot to guess the relationship between different variables.
plot(car, gap = 1/10, main = 'Pair-wise correlation matrix')
# For saving the plots created by R base functions
# png('Pairs.png', width = 8, height = 8, units = 'in', res = 600)
# plot(car, gap = 1/10, main = 'Pair-wise correlation matrix')
# dev.off()
scatterplotMatrix(formula = ~mpg+disp+hp+drat+wt+qsec, data = car,
smooth = FALSE, regLine = TRUE, ellipse = FALSE)
cor_mat = cor(car)
round(cor_mat, 2)
## mpg disp hp drat wt qsec
## mpg 1.00 -0.85 -0.78 0.68 -0.87 0.42
## disp -0.85 1.00 0.79 -0.71 0.89 -0.43
## hp -0.78 0.79 1.00 -0.45 0.66 -0.71
## drat 0.68 -0.71 -0.45 1.00 -0.71 0.09
## wt -0.87 0.89 0.66 -0.71 1.00 -0.17
## qsec 0.42 -0.43 -0.71 0.09 -0.17 1.00
p_mat = cor_pmat(car)
round(p_mat, 2)
## mpg disp hp drat wt qsec
## mpg 0.00 0.00 0.00 0.00 0.00 0.02
## disp 0.00 0.00 0.00 0.00 0.00 0.01
## hp 0.00 0.00 0.00 0.01 0.00 0.00
## drat 0.00 0.00 0.01 0.00 0.00 0.62
## wt 0.00 0.00 0.00 0.00 0.00 0.34
## qsec 0.02 0.01 0.00 0.62 0.34 0.00
ggcorrplot(cor_mat)
ggcorrplot(corr = cor_mat, p.mat = p_mat,
method = 'square',
type = 'lower', insig = 'pch',
ggtheme = ggplot2::theme_test)
corrplot(cor_mat)
corrplot(corr = cor_mat, p.mat = p_mat,
method = 'pie',
type = 'lower',
insig = 'pch',
diag = FALSE,
bg = 'grey95',
tl.col = 'black',
title = "Correlation matrix based on Pearson's r",
mar = c(1,1,2,1))
ggpairs(car) + theme_test()