1 Objective

Here we are trying to identify correlated features/variables in a data set by using heatmap.

2 Data

Here we use a data set called churn data. This contains data of customer phone calles and customer details. For a example customer state, phone number, total day minute calls…etc.

Let’s see how to identify correlated variables on this data.

2.1 Reading Data

library(readr)
churndata <- read_csv("P:/uop-PINK/RMyProjects/R/Data/churndata.csv")

2.2 Data Pre-processing

In data pre-processing we do identify outliers, NA values and get rid of them. Also here let’s remove unnecessary variables from the data set.
Since we just need numerical variables we remove other variables from the data set.

head(churndata)
churnData <- churndata[8:19]
head(churnData)
str(churnData)
## tibble [3,333 x 12] (S3: tbl_df/tbl/data.frame)
##  $ total day minutes  : num [1:3333] 265 162 243 299 167 ...
##  $ total day calls    : num [1:3333] 110 123 114 71 113 98 88 79 97 84 ...
##  $ total day charge   : num [1:3333] 45.1 27.5 41.4 50.9 28.3 ...
##  $ total eve minutes  : num [1:3333] 197.4 195.5 121.2 61.9 148.3 ...
##  $ total eve calls    : num [1:3333] 99 103 110 88 122 101 108 94 80 111 ...
##  $ total eve charge   : num [1:3333] 16.78 16.62 10.3 5.26 12.61 ...
##  $ total night minutes: num [1:3333] 245 254 163 197 187 ...
##  $ total night calls  : num [1:3333] 91 103 104 89 121 118 118 96 90 97 ...
##  $ total night charge : num [1:3333] 11.01 11.45 7.32 8.86 8.41 ...
##  $ total intl minutes : num [1:3333] 10 13.7 12.2 6.6 10.1 6.3 7.5 7.1 8.7 11.2 ...
##  $ total intl calls   : num [1:3333] 3 3 5 7 3 6 7 6 4 5 ...
##  $ total intl charge  : num [1:3333] 2.7 3.7 3.29 1.78 2.73 1.7 2.03 1.92 2.35 3.02 ...

So we have 12 numerical variables in newly created churnData data set.

3 Head Map

Simple intorduction to Heat Map.
A heat map is a data visualization technique that shows magnitude of a phenomenon as color in two dimensions. The variation in color may be by hue or intensity, giving obvious visual cues to the reader about how the phenomenon is clustered or varies over space.

4 Process for a inbuid data

4.1 Data and Correlation Matrix

### Data
mydata <- mtcars[, c(1,3,4,5,6,7)]
dim(mydata)
## [1] 32  6
head(mydata)
### Correlation Matrix
cormat <- round(x = cor(mydata), digits = 2)
# round use to just to get first two digits after decimal point.
head(cormat)
##        mpg  disp    hp  drat    wt  qsec
## mpg   1.00 -0.85 -0.78  0.68 -0.87  0.42
## disp -0.85  1.00  0.79 -0.71  0.89 -0.43
## hp   -0.78  0.79  1.00 -0.45  0.66 -0.71
## drat  0.68 -0.71 -0.45  1.00 -0.71  0.09
## wt   -0.87  0.89  0.66 -0.71  1.00 -0.17
## qsec  0.42 -0.43 -0.71  0.09 -0.17  1.00

By looking at the cormat (Correlation matrix) we can see some correlations among the variables. But that’s not our objective. Let’s go for heatmaps.

4.2 Creating Corralation Heat Map.

We use ggplot2, reshape2 packages to perform this task.

### Libraries
library(ggplot2)
library(reshape2)

### Let's melt the corralation matrix
melted_cormat <- melt(cormat)
head(melted_cormat)
# What this does is since we have 6 variables.
# melt does provide us a list view of correlation 
# of each 36 pair of correlations.

### Let's plot
ggplot(data = melted_cormat, aes(x=Var1, y=Var2, fill = value)) +
  geom_tile()

By looking at the plot, it is quite ugly to see less correlated variables are shown in dark colors and higher’s are in bright colors. So let’s make it clear.

4.3 Understandable version of plot

### Get lower triangle of the correlation matrix
get_lower_tri<-function(cormat){
  cormat[upper.tri(cormat)] <- NA
  return(cormat)
}
### Get upper triangle of the correlation matrix
get_upper_tri <- function(cormat){
  cormat[lower.tri(cormat)]<- NA
  return(cormat)
}

upper_tri <- get_upper_tri(cormat)
upper_tri
##      mpg  disp    hp  drat    wt  qsec
## mpg    1 -0.85 -0.78  0.68 -0.87  0.42
## disp  NA  1.00  0.79 -0.71  0.89 -0.43
## hp    NA    NA  1.00 -0.45  0.66 -0.71
## drat  NA    NA    NA  1.00 -0.71  0.09
## wt    NA    NA    NA    NA  1.00 -0.17
## qsec  NA    NA    NA    NA    NA  1.00
### Melt
melted_cormat <- melt(upper_tri, na.rm = TRUE)

### Heatmap
ggplot(data = melted_cormat, aes(Var2, Var1, fill = value))+
 geom_tile(color = "white")+
 scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
   midpoint = 0, limit = c(-1,1), space = "Lab", 
   name="Pearson\nCorrelation") +
  theme_minimal()+ 
 theme(axis.text.x = element_text(angle = 45, vjust = 1, 
    size = 12, hjust = 1))+
 coord_fixed()

5 Using heatmaply library

### Library
library(heatmaply)

### Data
df <- mtcars

### Let's Plot
heatmaply_cor(x = cor(df),
              xlab = "Features",
              ylab = "Features",
              k_col = 2,
              k_row = 2)

6 Using ggcorplot, ggplot2

library(ggplot2)
library(ggcorrplot)

ggcorrplot::ggcorrplot(cor(mtcars))

7 For my data

library(heatmaply)
library(ggplot2)
heatmaply_cor(x = cor(churnData),
              xlab = "Features",
              ylab = "Features",
              k_col = 2,
              k_row = 2)
library(ggcorrplot)
ggcorrplot::ggcorrplot(cor(churnData))

get_upper_tri <- function(cormat){
  cormat[lower.tri(cormat)]<- NA
  return(cormat)
}

upper_tri <- get_upper_tri(round(x = cor(churnData), digits = 2))
# upper_tri


melted_cormat <- melt(upper_tri, na.rm = TRUE)
ggplot(data = melted_cormat, aes(Var2, Var1, fill = value))+
 geom_tile(color = "white")+
 scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
   midpoint = 0, limit = c(-1,1), space = "Lab", 
   name="Pearson\nCorrelation") +
  theme_minimal()+ 
 theme(axis.text.x = element_text(angle = 45, vjust = 1, 
    size = 12, hjust = 1))+
 coord_fixed()