Computing and visualizing correlation matrix

ggcorrplot: Visualization of a correlation matrix using ggplot2

The ggcorrplot package can be used to visualize easily a correlation matrix using ggplot2. It provides a solution for reordering the correlation matrix and displays the significance level on the correlogram. It includes also a function for computing a matrix of correlation p-values.

Installation and loading

ggcorrplot can be installed from CRAN as follow:

#install.packages("ggcorrplot")
# Loading
library(ggcorrplot)

## Loading required package: ggplot2

data(mtcars)
str(mtcars)

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Compute a correlation matrix

The mtcars data set will be used in the following R code. The function cor_pmat() in ggcorrplot. computes a matrix of correlation p-values.

cor() return correlation matrix; cor_pmat() in ggcorrplot package computes a matrix of correlation p-values.

# Compute a correlation matrix

corr <- round(cor(mtcars), 1) # rounded to one decimal point
head(corr[, 1:6]) # show first six rows of only the first six columns

##       mpg  cyl disp   hp drat   wt
## mpg   1.0 -0.9 -0.8 -0.8  0.7 -0.9
## cyl  -0.9  1.0  0.9  0.8 -0.7  0.8
## disp -0.8  0.9  1.0  0.8 -0.7  0.9
## hp   -0.8  0.8  0.8  1.0 -0.4  0.7
## drat  0.7 -0.7 -0.7 -0.4  1.0 -0.7
## wt   -0.9  0.8  0.9  0.7 -0.7  1.0

# Compute a matrix of correlation p-values
p.mat <- cor_pmat(mtcars)
head(p.mat[, 1:4])

##               mpg          cyl         disp           hp
## mpg  0.000000e+00 6.112687e-10 9.380327e-10 1.787835e-07
## cyl  6.112687e-10 0.000000e+00 1.802838e-12 3.477861e-09
## disp 9.380327e-10 1.802838e-12 0.000000e+00 7.142679e-08
## hp   1.787835e-07 3.477861e-09 7.142679e-08 0.000000e+00
## drat 1.776240e-05 8.244636e-06 5.282022e-06 9.988772e-03
## wt   1.293959e-10 1.217567e-07 1.222320e-11 4.145827e-05

Correlation matrix visualization

# Visualize the correlation matrix
# --------------------------------
# method = "square" (default)
ggcorrplot(corr)

ggcorrplot(cor(mtcars))

# method = "circle"
ggcorrplot(corr, method = "circle")

#> Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
#> "none")` instead.

ggcorrplot(cor(mtcars), 
 method = "circle", 
 type = "lower",
 outline.color = "black",
 lab_size = 6)

# Reordering the correlation matrix
# --------------------------------
# using hierarchical clustering
ggcorrplot(corr, hc.order = TRUE, outline.color = "white")

# Types of correlogram layout
# --------------------------------
# Get the lower triangle
ggcorrplot(corr,
           hc.order = TRUE,
           type = "lower",
           outline.color = "white")

# Get the upper triangle
ggcorrplot(corr,
           hc.order = TRUE,
           type = "upper",
           outline.color = "white")

# Change colors and theme
# --------------------------------
# Argument colors
ggcorrplot(
  corr,
  hc.order = TRUE,
  type = "lower",
  outline.color = "white",
  ggtheme = ggplot2::theme_gray,
  colors = c("#6D9EC1", "white", "#E46726")
)

# Add correlation coefficients
# --------------------------------
# argument lab = TRUE
ggcorrplot(corr,
           hc.order = TRUE,
           type = "lower",
           lab = TRUE)

# Add correlation significance level
# --------------------------------
# Argument p.mat
# Barring the no significant coefficient
ggcorrplot(corr,
           hc.order = TRUE,
           type = "lower",
           p.mat = p.mat)

library(ggcorrplot)
ggcorrplot(corr, 
           method = "circle",  #"square" (default), "circle" 
           type = "lower", # "full" (default), "lower" or "upper" display.
           hc.order = TRUE, #logical value. If TRUE, correlation matrix will be hc.ordered using hclust #function.
           colors = c("red", "white", "green"), #Change colors
           lab = TRUE, #Add correlation coefficients
           
           )

# Leave blank on no significant coefficient
ggcorrplot(
  corr,
  p.mat = p.mat,
  hc.order = TRUE,
  type = "lower",
  insig = "blank"
)

using corrplot package - Correlation plots in R

#install.packages("corrplot")
library(corrplot)

## corrplot 0.92 loaded

#loading the dataset
data(mtcars)
#we will use “corrplot” library
library(corrplot)
#to make the correlation matrix plot
corrplot(cor(mtcars)) #it creates the correlation matrix

It represents the “correlation coefficient” or value of “R” or the degree of the linear relationship. The value of R can be -1 to +1. +1 means positive 100% correlation, i.e., if one variable increases, the other will also increase, or if it decreases, the other will also decrease. -1 means negative 100%, i.e., the other will decrease if one increases. And 0.0 means no linear relationship. The size of the circles is relative to the percentage of correlation.

You can also change the method to “square,” “circle,” or “number” to change the way it represents the correlation matrix. Then you can change the type to show the upper section or the lower section of the matrix (remember upper and lower are actually mirror images), so you can either show upper or lower or the full based on your preference.

corrplot(
 cor(mtcars),
 method = "square",
 type = "upper",
 tl.col = "black",
 tl.cex = 2,
 col = colorRampPalette(c("purple", "dark green"))(200)
)

You can also create a mixed-type matrix using the following code. Here in the upper section, I used the square as the method and the lower number (which are the correlation coefficients).

corrplot.mixed(cor(mtcars),
 upper = "square",
 lower = "number",
 addgrid.col = "black",
 tl.col = "black")

## using ggstatsplot} package

An alternative to the correlogram presented above is possible with the ggcorrmat() function from the {ggstatsplot} package:

#install.packages("ggstatsplot")
# load package
library(ggstatsplot)

## You can cite this package as:
##      Patil, I. (2021). Visualizations with statistical details: The 'ggstatsplot' approach.
##      Journal of Open Source Software, 6(61), 3167, doi:10.21105/joss.03167

# correlogram
ggstatsplot::ggcorrmat(
  data = mtcars,
  type = "parametric", # parametric for Pearson, nonparametric for Spearman's correlation
  colors = c("darkred", "white", "steelblue") # change default colors
)

dat <- mtcars[, c(1, 3:7)]
# correlogram
ggstatsplot::ggcorrmat(
  data = dat,
  type = "parametric", # parametric for Pearson, nonparametric for Spearman's correlation
  colors = c("darkred", "white", "steelblue") # change default colors
)

dat <- mtcars[, c(1, 3:7)]

corrplot2 <- function(data,
                      method = "pearson",
                      sig.level = 0.05,
                      order = "original",
                      diag = FALSE,
                      type = "upper",
                      tl.srt = 90,
                      number.font = 1,
                      number.cex = 1,
                      mar = c(0, 0, 0, 0)) {
  library(corrplot)
  data_incomplete <- data
  data <- data[complete.cases(data), ]
  mat <- cor(data, method = method)
  cor.mtest <- function(mat, method) {
    mat <- as.matrix(mat)
    n <- ncol(mat)
    p.mat <- matrix(NA, n, n)
    diag(p.mat) <- 0
    for (i in 1:(n - 1)) {
      for (j in (i + 1):n) {
        tmp <- cor.test(mat[, i], mat[, j], method = method)
        p.mat[i, j] <- p.mat[j, i] <- tmp$p.value
      }
    }
    colnames(p.mat) <- rownames(p.mat) <- colnames(mat)
    p.mat
  }
  p.mat <- cor.mtest(data, method = method)
  col <- colorRampPalette(c("#BB4444", "#EE9988", "#FFFFFF", "#77AADD", "#4477AA"))
  corrplot(mat,
    method = "color", col = col(200), number.font = number.font,
    mar = mar, number.cex = number.cex,
    type = type, order = order,
    addCoef.col = "black", # add correlation coefficient
    tl.col = "black", tl.srt = tl.srt, # rotation of text labels
    # combine with significance level
    p.mat = p.mat, sig.level = sig.level, insig = "blank",
    # hide correlation coefficiens on the diagonal
    diag = diag
  )
}

corrplot2(
  data = dat,
  method = "pearson",
  sig.level = 0.05,
  order = "original",
  diag = FALSE,
  type = "upper",
  tl.srt = 75
)

Uning another fun package called ’Lares”

It ranks the correlation and produces a gradually decreasing order of columns, Which is really useful to analyze the top most correlated variables.

#install.packages("lares")
library(lares)
corr_cross(mtcars, rm.na = T, max_pvalue = 0.05, top = 15, grid = T)

## Returning only the top 15. You may override with the 'top' argument

## Warning in .font_global(font, quiet = FALSE): Font 'Arial Narrow' is not
## installed, has other name, or can't be found

Negative correlations are represented in red and positive correlations in blue.

All possible correlations using lares package

Use the corr_cross() function if you want to compute all correlations and return the highest and significant ones in a plot:

# devtools::install_github("laresbernardo/lares")
library(lares)

corr_cross(dat, # name of dataset
  max_pvalue = 0.05, # display only significant correlations (at 5% level)
  top = 10 # display top 10 couples of variables (by correlation coefficient)
)

## Returning only the top 10. You may override with the 'top' argument

Negative correlations are represented in red and positive correlations in blue.

Correlation of one variable against all others using lares package

Use the corr_var() function if you want to focus on the correlation of one variable against all others, and return the highest ones in a plot:

corr_var(dat, # name of dataset
  mpg, # name of variable to focus on
  top = 5 # display top 5 correlations
)

corr_var(mtcars, # name of dataset
  mpg, # name of variable to focus on
  top = 7 # display top 7 correlations
)

Using heatmap()

corr <- round(cor(mtcars), 2) #Compute a correlation matrix

col <- colorRampPalette(c("blue", "white", "red"))(20)
heatmap(x = corr, col = col, symm = TRUE)

Now, first let’s calculate and create a correlation matrix and then we will see how to create visualization using ggplot.

#install.packages("rstatix")
library(rstatix)

## 
## Attaching package: 'rstatix'

## The following object is masked from 'package:ggcorrplot':
## 
##     cor_pmat

## The following object is masked from 'package:stats':
## 
##     filter

cor_test <- cor_mat(mtcars) #to create the correlation matrix
cor_test

## # A tibble: 11 × 12
##    rowname   mpg   cyl  disp    hp   drat    wt   qsec    vs     am  gear   carb
##  * <chr>   <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>  <dbl> <dbl>  <dbl> <dbl>  <dbl>
##  1 mpg      1    -0.85 -0.85 -0.78  0.68  -0.87  0.42   0.66  0.6    0.48 -0.55 
##  2 cyl     -0.85  1     0.9   0.83 -0.7    0.78 -0.59  -0.81 -0.52  -0.49  0.53 
##  3 disp    -0.85  0.9   1     0.79 -0.71   0.89 -0.43  -0.71 -0.59  -0.56  0.39 
##  4 hp      -0.78  0.83  0.79  1    -0.45   0.66 -0.71  -0.72 -0.24  -0.13  0.75 
##  5 drat     0.68 -0.7  -0.71 -0.45  1     -0.71  0.091  0.44  0.71   0.7  -0.091
##  6 wt      -0.87  0.78  0.89  0.66 -0.71   1    -0.17  -0.55 -0.69  -0.58  0.43 
##  7 qsec     0.42 -0.59 -0.43 -0.71  0.091 -0.17  1      0.74 -0.23  -0.21 -0.66 
##  8 vs       0.66 -0.81 -0.71 -0.72  0.44  -0.55  0.74   1     0.17   0.21 -0.57 
##  9 am       0.6  -0.52 -0.59 -0.24  0.71  -0.69 -0.23   0.17  1      0.79  0.058
## 10 gear     0.48 -0.49 -0.56 -0.13  0.7   -0.58 -0.21   0.21  0.79   1     0.27 
## 11 carb    -0.55  0.53  0.39  0.75 -0.091  0.43 -0.66  -0.57  0.058  0.27  1

cor_p <- cor_pmat(mtcars)
cor_p

## # A tibble: 11 × 12
##    rowname      mpg      cyl     disp            hp       drat        wt    qsec
##    <chr>      <dbl>    <dbl>    <dbl>         <dbl>      <dbl>     <dbl>   <dbl>
##  1 mpg     0        6.11e-10 9.38e-10 0.000000179   0.0000178  1.29e- 10 1.71e-2
##  2 cyl     6.11e-10 0        1.8 e-12 0.00000000348 0.00000824 1.22e-  7 3.66e-4
##  3 disp    9.38e-10 1.8 e-12 0        0.0000000714  0.00000528 1.22e- 11 1.31e-2
##  4 hp      1.79e- 7 3.48e- 9 7.14e- 8 0             0.00999    4.15e-  5 5.77e-6
##  5 drat    1.78e- 5 8.24e- 6 5.28e- 6 0.00999       0          4.78e-  6 6.2 e-1
##  6 wt      1.29e-10 1.22e- 7 1.22e-11 0.0000415     0.00000478 2.27e-236 3.39e-1
##  7 qsec    1.71e- 2 3.66e- 4 1.31e- 2 0.00000577    0.62       3.39e-  1 0      
##  8 vs      3.42e- 5 1.84e- 8 5.24e- 6 0.00000294    0.0117     9.8 e-  4 1.03e-6
##  9 am      2.85e- 4 2.15e- 3 3.66e- 4 0.18          0.00000473 1.13e-  5 2.06e-1
## 10 gear    5.4 e- 3 4.17e- 3 9.64e- 4 0.493         0.00000836 4.59e-  4 2.43e-1
## 11 carb    1.08e- 3 1.94e- 3 2.53e- 2 0.000000783   0.621      1.46e-  2 4.54e-5
## # … with 4 more variables: vs <dbl>, am <dbl>, gear <dbl>, carb <dbl>

Now as we have the matrix for r-value, we can just gather all the data into a variable columns (for all the keys) and the actual r value in another column using the following code.

df <- cor_test %>% gather(-rowname, key = cor_var, value = r)
df

## # A tibble: 121 × 3
##    rowname cor_var     r
##    <chr>   <chr>   <dbl>
##  1 mpg     mpg      1   
##  2 cyl     mpg     -0.85
##  3 disp    mpg     -0.85
##  4 hp      mpg     -0.78
##  5 drat    mpg      0.68
##  6 wt      mpg     -0.87
##  7 qsec    mpg      0.42
##  8 vs      mpg      0.66
##  9 am      mpg      0.6 
## 10 gear    mpg      0.48
## # … with 111 more rows

df %>% ggplot(aes(rowname, cor_var, fill = r)) + geom_tile() +
 labs(x = "variables", y = "variables")

Now lets say, we want to customize it, you can use the basic ggplot functions to customize it. For example:

df %>% ggplot(aes(rowname, cor_var, fill = r)) + geom_tile() +
 labs(x = "variables", y = "variables") +
 scale_fill_gradient(low = "blue", high = "red")

Now lets say, you want to add the actual values in your plot, you can use the following code:

df %>% ggplot(aes(rowname, cor_var, fill = r)) + geom_tile() +
 labs(x = "variables", y = "variables") +
 scale_fill_gradient(low = "blue", high = "red") +
  geom_text(aes(label = r))

## Another informative package is “perfromanceanalytics”, which gives you p-value, distribution (histograms), and correlation coefficient.

#install.packages("PerformanceAnalytics")
library(PerformanceAnalytics)

## Loading required package: xts

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## 
## Attaching package: 'PerformanceAnalytics'

## The following object is masked from 'package:graphics':
## 
##     legend

chart.Correlation(cor(mtcars))

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

The red stars in the figure define the level of significance. * = 0.05, ** = 0.01, *** = 0.001

my_data <- mtcars[, c(1,3,4,5,6,7)]
chart.Correlation(my_data, histogram=TRUE, pch=19)

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

Notice!

The distribution of each variable is shown on the diagonal.
left bottom of the diagonal: the bivariate scatter plots with a fitted line are displayed
right top of the diagonal: the value of the correlation plus the significance level as stars
Each significance level is associated to a symbol:p-values(0, 0.001, 0.01, 0.05, 0.1, 1) <=> symbols(, , , ., ,““)

The ggpairs() function of the GGally package

allows to build a great scatterplot matrix.

# Quick display of two cabapilities of GGally, to assess the distribution and correlation of variables 
library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

# Check correlations (as scatterplots), distribution and print corrleation coefficient 
ggpairs(mtcars, columns = 1:8, title="correlogram with ggpairs()")

# Nice visualization of correlations
ggpairs(mtcars, columns = 2:4, ggplot2::aes(colour=as.character(am)))

# Quick display of two cabapilities of GGally, to assess the distribution and correlation of variables 
library(GGally)
 
# From the help page:
data(mtcars)
ggpairs(
  mtcars[, c(1, 3, 4, 2)],
  upper = list(continuous = "density", combo = "box_no_facet"),
  lower = list(continuous = "points", combo = "dot_no_facet")
)

correlation coefficient

Correlation coefficient is a quantity that measures the strength of the association (or dependence) between two or more variables.

Types of correlation coefficient

Pearson r: is a parametric correlation test as it depends on the distribution (normal distribution) of the data. It measures the linear dependence between two variables. The plot of y = f(x) is named the linear regression curve. (the mostly used method)

Kendall tau: rank-based correlation coefficient (non-parametric methods). Recommended if the data do not come from a bivariate normal distribution.

Spearman rho: rank-base correlation coefficient (non-parametric methods). Recommended if the data do not come from a bivariate normal distribution.

Preleminary test to check the test assumptions

data are normally distributed
Is the covariation linear? Yes, form the plot, the relationship is linear. In the situation where the scatter plots show curved patterns, we are dealing with nonlinear association between the two variables.

Are the data from each of the 2 variables (x, y) follow a normal distribution?

Use Shapiro-Wilk normality test -> R function: shapiro.test() and look at the normality plot -> R function: ggpubr::ggqqplot()

Shapiro-Wilk test can be performed as follow:

Null hypothesis: the data are normally distributed Alternative hypothesis: the data are not normally distributed

#install.packages("ggpubr")
library(ggpubr)

ggscatter(mtcars, x = "mpg", y = "wt", 
          add = "reg.line", 
          conf.int = TRUE, 
          cor.coef = TRUE, 
          cor.method = "pearson",
          xlab = "Miles/(US) gallon", 
          ylab = "Weight (1000 lbs)")

## `geom_smooth()` using formula = 'y ~ x'

#Shapiro-Wilk normality test for mpg and wt

shapiro.test(mtcars$mpg)

## 
##  Shapiro-Wilk normality test
## 
## data:  mtcars$mpg
## W = 0.94756, p-value = 0.1229

shapiro.test(mtcars$wt)

## 
##  Shapiro-Wilk normality test
## 
## data:  mtcars$wt
## W = 0.94326, p-value = 0.09265

#Visual inspection of the data normality using Q-Q plots (quantile-quantile plots)

#Q-Q plot draws the correlation between a given sample and the normal distribution.
ggqqplot(mtcars$mpg, ylab = "MPG")

## Warning: The following aesthetics were dropped during statistical transformation: sample
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?
## The following aesthetics were dropped during statistical transformation: sample
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

ggqqplot(mtcars$wt, ylab = "WT")

## Warning: The following aesthetics were dropped during statistical transformation: sample
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?
## The following aesthetics were dropped during statistical transformation: sample
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

#Pearson correlation test
cor.test(mtcars$wt, mtcars$mpg, method = "pearson")

## 
##  Pearson's product-moment correlation
## 
## data:  mtcars$wt and mtcars$mpg
## t = -9.559, df = 30, p-value = 1.294e-10
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.9338264 -0.7440872
## sample estimates:
##        cor 
## -0.8676594

data are not normally distributed

If the data are not normally distributed, it’s recommended to use the non-parametric correlation, including Spearman and Kendall rank-based correlation tests.

#Spearman rank correlation coefficient
cor.test(mtcars$wt, mtcars$mpg,  method = "spearman")

## Warning in cor.test.default(mtcars$wt, mtcars$mpg, method = "spearman"): Cannot
## compute exact p-value with ties

## 
##  Spearman's rank correlation rho
## 
## data:  mtcars$wt and mtcars$mpg
## S = 10292, p-value = 1.488e-11
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## -0.886422

#Kendall rank correlation test
res <- cor.test(mtcars$wt, mtcars$mpg,  method="kendall")

## Warning in cor.test.default(mtcars$wt, mtcars$mpg, method = "kendall"): Cannot
## compute exact p-value with ties

res

## 
##  Kendall's rank correlation tau
## 
## data:  mtcars$wt and mtcars$mpg
## z = -5.7981, p-value = 6.706e-09
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
##        tau 
## -0.7278321

#Extract the p.value and the correlation coefficient
res$p.value

## [1] 6.70577e-09

res$estimate

##        tau 
## -0.7278321

Interprete correlation coefficient

The value of correlation coefficient can be negative or positive, range [-1, 1]:

-1: strong negative correlation 0: no relationship between the two variables (x and y) 1: strong positive correlation

Correlation and visualization

Alema Fissuh

2023-02-07