A correlogram or correlation matrix allows to analyse the relationship between each pair of numeric variables in a dataset.

It gives a quick overview of the whole dataset. It is more used for exploratory purpose than explanatory.

1. USING THE GGALLY PACKAGE

library(GGally)

## Loading required package: ggplot2

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

The ggpairs() function of the GGally package allows to build a great scatterplot matrix.

Scatterplots of each pair of numeric variable are drawn on the left part of the figure.

Pearson correlation is displayed on the right.

Variable distribution is available on the diagonal.

library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --

## v tibble  3.0.5     v dplyr   1.0.3
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.0
## v purrr   0.3.4

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

bank <- read.csv("bank_cleaned.csv")

num_bank <- bank %>% 
  select(where(is.numeric)) %>% 
  select(-c(X, response_binary))

str(num_bank)

## 'data.frame':    40841 obs. of  7 variables:
##  $ age     : int  58 44 33 35 28 42 58 43 41 29 ...
##  $ balance : int  2143 29 2 231 447 2 121 593 270 390 ...
##  $ day     : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ duration: num  4.35 2.52 1.27 2.32 3.62 6.33 0.83 0.92 3.7 2.28 ...
##  $ campaign: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ pdays   : int  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
##  $ previous: int  0 0 0 0 0 0 0 0 0 0 ...

https://www.rdocumentation.org/packages/GGally/versions/1.5.0/topics/ggpairs

ggpairs( data, mapping = NULL, columns = 1:ncol(data), title = NULL, upper = list(continuous = “cor”, combo = “box_no_facet”, discrete = “facetbar”, na = “na”), lower = list(continuous = “points”, combo = “facethist”, discrete = “facetbar”, na =“na”), diag = list(continuous = “densityDiag”, discrete = “barDiag”, na = “naDiag”), params = NULL, xlab = NULL, ylab = NULL, axisLabels = c(“show”, “internal”, “none”), columnLabels = colnames(data[columns]), labeller = “label_value”, switch = NULL, showStrips = NULL, legend = NULL, cardinality_threshold = 15, progress = NULL, legends = stop(“deprecated”))

ggpairs(num_bank, title="correlogram with ggpairs()")

ggpairs(iris, 
        mapping = aes(color = Species),
        columns = c('Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width', 'Species'), 
        columnLabels = c('Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width', 'Species')) + 
  scale_colour_manual(values=c('red','blue','orange')) +
  theme_minimal()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

data(banknote)

## Warning in data(banknote): data set 'banknote' not found

library(uskewFactors)

## Loading required package: tmvtnorm

## Loading required package: mvtnorm

## Loading required package: Matrix

## 
## Attaching package: 'Matrix'

## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack

## Loading required package: stats4

## Loading required package: gmm

## Loading required package: sandwich

## Loading required package: MCMCpack

## Loading required package: coda

## ##
## ## Markov Chain Monte Carlo Package (MCMCpack)

## ## Copyright (C) 2003-2021 Andrew D. Martin, Kevin M. Quinn, and Jong Hee Park

## ##
## ## Support provided by the U.S. National Science Foundation

## ## (Grants SES-0350646 and SES-0350613)
## ##

data(banknote)

swissbank <- banknote %>% 
  mutate(status =  case_when(
    Y == 1~ "counterfeit", 
    Y == 0 ~ "genuine"))

swissbank<- subset(swissbank, select = -c(Y))

This package contains measurements on 200 Swiss banknotes: 100 genuine and 100 counterfeit. The variables are length of bill, width of left edge, width of right edge , bottom margin width and top margin width. All measurements are in millimetres. The data source is noted below. This data is also available in the alr package in R.

str(swissbank)

## 'data.frame':    200 obs. of  7 variables:
##  $ Length  : num  215 215 215 215 215 ...
##  $ Left    : num  131 130 130 130 130 ...
##  $ Right   : num  131 130 130 130 130 ...
##  $ Bottom  : num  9 8.1 8.7 7.5 10.4 9 7.9 7.2 8.2 9.2 ...
##  $ Top     : num  9.7 9.5 9.6 10.4 7.7 10.1 9.6 10.7 11 10 ...
##  $ Diagonal: num  141 142 142 142 142 ...
##  $ status  : chr  "genuine" "genuine" "genuine" "genuine" ...

ggpairs(swissbank, mapping = aes(col = status)) + scale_colour_manual(values=c('blue','orange')) +
  theme_bw()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Scatterplot matrix with ggpairs()

The ggcorr() function allows to visualize the correlation of each pair of variable as a square. Note that the method argument allows to pick the correlation type you desire.

# Nice visualization of correlations
ggcorr(num_bank, method = c("everything", "pearson"))

Split by group

# From the help page:
data(flea)
ggpairs(flea, columns = 2:4, ggplot2::aes(colour=species))

Change plot types

Change the type of plot used on each part of the correlogram. This is done with the upper and lower argument.

library(reshape)

## 
## Attaching package: 'reshape'

## The following object is masked from 'package:Matrix':
## 
##     expand

## The following object is masked from 'package:dplyr':
## 
##     rename

## The following objects are masked from 'package:tidyr':
## 
##     expand, smiths

data(tips)
ggpairs(tips[, c(1, 3, 4, 2)],
  upper = list(continuous = "density", combo = "box_no_facet"),
  lower = list(continuous = "points", combo = "dot_no_facet")
)

ggpairs(tips,
        upper = list(continuous = "density", combo = "box_no_facet"),
        lower = list(continuous = "points", combo = "dot_no_facet"))

2. USING THE CORRGRAM PACKAGE

library(corrgram)

corrgram(mtcars, order = TRUE, 
         lower.panel = panel.shade, 
         upper.panel = panel.pie, 
         text.panel = panel.txt, 
         main = "Car Milage Data in PC2/PC1 Order")

corrgram(mtcars, 
         order = TRUE, 
         lower.panel = panel.ellipse, 
         upper.panel = panel.pts, 
         text.panel = panel.txt, 
         diag.panel = panel.minmax, 
         main="Car Milage Data in PC2/PC1 Order")

corrgram(mtcars, order = NULL, 
         lower.panel = panel.shade, 
         upper.panel = NULL, 
         text.panel = panel.txt, 
         main="Car Milage Data (unsorted)")

3. OTHER METHODS

Lesser known ways to build correlogram with R, like the ellipse package, the plot() function and the car package.

Package ellipse

library(ellipse)

## 
## Attaching package: 'ellipse'

## The following object is masked from 'package:graphics':
## 
##     pairs

library(RColorBrewer)

Use of the mtcars data proposed by R

data <- cor(mtcars)

# Build a Pannel of 100 colors with RcolorBrewer
my_colors <- brewer.pal(5, "Spectral")
my_colors <- colorRampPalette(my_colors)(100)
 
# Order the correlation matrix
ord <- order(data[1, ])

data_ord <- data[ord, ord]

plotcorr(data_ord , col=my_colors[data_ord*50+50] , mar=c(1,1,1,1)  )

The native plot() function does the job pretty well as long as you just need to display scatterplots.

The native plot() function

# Plot
plot(mtcars , pch=20 , cex=1.5 , col="blue")

Correlograms