Load required libraries

Open a blank R script, set the working directory, keep the dataset and script in the same folder. Then start scripting…

library(readxl)
library(GGally)
library(corrplot)
library(ggcorrplot)
library(car)
library(tidyverse)

Read the data set

# This data set has been copied from mtcars in R datasets package.
data = read_xlsx('datasets.xlsx', sheet = 'wrangling', range = 'V7:AG39')
# First 6 rows
head(data)
## # A tibble: 6 × 12
##   model          mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##   <chr>        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Mazda RX4     21       6   160   110  3.9   2.62  16.5     0     1     4     4
## 2 Mazda RX4 W…  21       6   160   110  3.9   2.88  17.0     0     1     4     4
## 3 Datsun 710    22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
## 4 Hornet 4 Dr…  21.4     6   258   110  3.08  3.22  19.4     1     0     3     1
## 5 Hornet Spor…  18.7     8   360   175  3.15  3.44  17.0     0     0     3     2
## 6 Valiant       18.1     6   225   105  2.76  3.46  20.2     1     0     3     1
# Last 6 rows
tail(data)
## # A tibble: 6 × 12
##   model          mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##   <chr>        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Porsche 914…  26       4 120.     91  4.43  2.14  16.7     0     1     5     2
## 2 Lotus Europa  30.4     4  95.1   113  3.77  1.51  16.9     1     1     5     2
## 3 Ford Panter…  15.8     8 351     264  4.22  3.17  14.5     0     1     5     4
## 4 Ferrari Dino  19.7     6 145     175  3.62  2.77  15.5     0     1     5     6
## 5 Maserati Bo…  15       8 301     335  3.54  3.57  14.6     0     1     5     8
## 6 Volvo 142E    21.4     4 121     109  4.11  2.78  18.6     1     1     4     2
# Select the variables to create correlation matrix
car = data %>% select(mpg, disp, hp, drat, wt, qsec)
head(car)
## # A tibble: 6 × 6
##     mpg  disp    hp  drat    wt  qsec
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1  21     160   110  3.9   2.62  16.5
## 2  21     160   110  3.9   2.88  17.0
## 3  22.8   108    93  3.85  2.32  18.6
## 4  21.4   258   110  3.08  3.22  19.4
## 5  18.7   360   175  3.15  3.44  17.0
## 6  18.1   225   105  2.76  3.46  20.2

Pearson correlation test between mpg and hp

We know that correlation means the relationship between two variables: e.g. mpg (mile per gallon) and hp (horse power). Both the variables should be continous and have normal distribution. The Pearson correlation coefficient varies between -1 and 1, where 0 indicates no relationship, 1 indicates perfect positive relationship, -1 indicates perfect negative relationship. That means correlation coefficient indicates the changes in one variable as changes take place in another variable. We want to see whether mpg changes with the changes in hp of cars.

For mpg and hp, the correlation coefficient (r) is -0.78 with p < 0.05, which means that increase in hp decreases the mpg. It is understand that increased hp needs more fuel to burn. \(H_0: r = 0\). The very small p-value cannot support the \(H_0\), which is rejected.

cor.test(car$mpg, car$hp)
## 
##  Pearson's product-moment correlation
## 
## data:  car$mpg and car$hp
## t = -6.7424, df = 30, p-value = 1.788e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.8852686 -0.5860994
## sample estimates:
##        cor 
## -0.7761684

Spearman correlation test between cyl and mpg

The variables cyl (number of cylinder, discrete variables with a few levels, rank data) and mpg (miles per gallon) do not comply with the assumptions of Pearson correlation test. We estimate Spearman rank correlation coefficient (\(\rho\)) in this case. The value of \(\rho\) is -0.91 with p < 0.001 suggests strong negative correlation between mpg and cyl. We can conclude that increasing the number of cylinders in a car will increase the fuel cost.

cor.test(data$mpg, data$cyl, method = 'spearman')
## Warning in cor.test.default(data$mpg, data$cyl, method = "spearman"): Cannot
## compute exact p-value with ties
## 
##  Spearman's rank correlation rho
## 
## data:  data$mpg and data$cyl
## S = 10425, p-value = 4.69e-13
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## -0.9108013

Please note that if both the variables are categorical, we use chi-square test or Fisher’s exact test to assess the association or relationship. For example, if a researcher is interested to know whether the number of gears (gear) and number of carburetors (carb) are related to each other, chi-square test can be done. First, you need to create a contingency table.

We see that many cells (>20%) have frequencies <5. Therefore, Fisher’s exact test is suggested. We also notice that Chi-square value is significant but Fisher’s test indicates that there is no association between the number of gears and the number of carburetors in cars as the p-value is larger than 0.05.

cont_table = table(data$gear, data$carb)
cont_table
##    
##     1 2 3 4 6 8
##   3 3 4 3 5 0 0
##   4 4 4 0 4 0 0
##   5 0 2 0 1 1 1
chisq.test(cont_table)
## Warning in chisq.test(cont_table): Chi-squared approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  cont_table
## X-squared = 16.518, df = 10, p-value = 0.08573
fisher.test(cont_table)
## 
##  Fisher's Exact Test for Count Data
## 
## data:  cont_table
## p-value = 0.2434
## alternative hypothesis: two.sided

Correlation matrix and visualization

So far, we have done correlation test as a bivariate combination. For more than two continuous variables, how should we present the analysis and visualize the results. We have to create a matrix with r and p values.

Let us check the scatter plot to guess the relationship between different variables.

plot(car, gap = 1/10, main = 'Pair-wise correlation matrix')

# For saving the plots created by R base functions
# png('Pairs.png', width = 8, height = 8, units = 'in', res = 600)
# plot(car, gap = 1/10, main = 'Pair-wise correlation matrix')
# dev.off()
scatterplotMatrix(formula = ~mpg+disp+hp+drat+wt+qsec, data = car,
                  smooth = FALSE, regLine = TRUE, ellipse = FALSE)

Create the r matrix

cor_mat = cor(car)
round(cor_mat, 2)
##        mpg  disp    hp  drat    wt  qsec
## mpg   1.00 -0.85 -0.78  0.68 -0.87  0.42
## disp -0.85  1.00  0.79 -0.71  0.89 -0.43
## hp   -0.78  0.79  1.00 -0.45  0.66 -0.71
## drat  0.68 -0.71 -0.45  1.00 -0.71  0.09
## wt   -0.87  0.89  0.66 -0.71  1.00 -0.17
## qsec  0.42 -0.43 -0.71  0.09 -0.17  1.00

Create the p matrix

p_mat = cor_pmat(car)
round(p_mat, 2)
##       mpg disp   hp drat   wt qsec
## mpg  0.00 0.00 0.00 0.00 0.00 0.02
## disp 0.00 0.00 0.00 0.00 0.00 0.01
## hp   0.00 0.00 0.00 0.01 0.00 0.00
## drat 0.00 0.00 0.01 0.00 0.00 0.62
## wt   0.00 0.00 0.00 0.00 0.00 0.34
## qsec 0.02 0.01 0.00 0.62 0.34 0.00

Visualization

ggcorrplot(cor_mat)

ggcorrplot(corr = cor_mat, p.mat = p_mat,
           method = 'square',
           type = 'lower', insig = 'pch', 
           ggtheme = ggplot2::theme_test)

corrplot(cor_mat)

corrplot(corr = cor_mat, p.mat = p_mat,
         method = 'pie',
         type = 'lower',
         insig = 'pch',
         diag = FALSE,
         bg = 'grey95',
         tl.col = 'black',
         title = "Correlation matrix based on Pearson's r",
         mar = c(1,1,2,1))

ggpairs(car) + theme_test()