Here we are trying to identify correlated features/variables in a data set by using heatmap.
Here we use a data set called churn data. This contains data of customer phone calles and customer details. For a example customer state, phone number, total day minute calls…etc.
Let’s see how to identify correlated variables on this data.
library(readr)
churndata <- read_csv("P:/uop-PINK/RMyProjects/R/Data/churndata.csv")
In data pre-processing we do identify outliers, NA values and get rid of them. Also here let’s remove unnecessary variables from the data set.
Since we just need numerical variables we remove other variables from the data set.
head(churndata)
churnData <- churndata[8:19]
head(churnData)
str(churnData)
## tibble [3,333 x 12] (S3: tbl_df/tbl/data.frame)
## $ total day minutes : num [1:3333] 265 162 243 299 167 ...
## $ total day calls : num [1:3333] 110 123 114 71 113 98 88 79 97 84 ...
## $ total day charge : num [1:3333] 45.1 27.5 41.4 50.9 28.3 ...
## $ total eve minutes : num [1:3333] 197.4 195.5 121.2 61.9 148.3 ...
## $ total eve calls : num [1:3333] 99 103 110 88 122 101 108 94 80 111 ...
## $ total eve charge : num [1:3333] 16.78 16.62 10.3 5.26 12.61 ...
## $ total night minutes: num [1:3333] 245 254 163 197 187 ...
## $ total night calls : num [1:3333] 91 103 104 89 121 118 118 96 90 97 ...
## $ total night charge : num [1:3333] 11.01 11.45 7.32 8.86 8.41 ...
## $ total intl minutes : num [1:3333] 10 13.7 12.2 6.6 10.1 6.3 7.5 7.1 8.7 11.2 ...
## $ total intl calls : num [1:3333] 3 3 5 7 3 6 7 6 4 5 ...
## $ total intl charge : num [1:3333] 2.7 3.7 3.29 1.78 2.73 1.7 2.03 1.92 2.35 3.02 ...
So we have 12 numerical variables in newly created churnData data set.
Simple intorduction to Heat Map.
A heat map is a data visualization technique that shows magnitude of a phenomenon as color in two dimensions. The variation in color may be by hue or intensity, giving obvious visual cues to the reader about how the phenomenon is clustered or varies over space.
### Data
mydata <- mtcars[, c(1,3,4,5,6,7)]
dim(mydata)
## [1] 32 6
head(mydata)
### Correlation Matrix
cormat <- round(x = cor(mydata), digits = 2)
# round use to just to get first two digits after decimal point.
head(cormat)
## mpg disp hp drat wt qsec
## mpg 1.00 -0.85 -0.78 0.68 -0.87 0.42
## disp -0.85 1.00 0.79 -0.71 0.89 -0.43
## hp -0.78 0.79 1.00 -0.45 0.66 -0.71
## drat 0.68 -0.71 -0.45 1.00 -0.71 0.09
## wt -0.87 0.89 0.66 -0.71 1.00 -0.17
## qsec 0.42 -0.43 -0.71 0.09 -0.17 1.00
By looking at the cormat (Correlation matrix) we can see some correlations among the variables. But that’s not our objective. Let’s go for heatmaps.
We use ggplot2, reshape2 packages to perform this task.
### Libraries
library(ggplot2)
library(reshape2)
### Let's melt the corralation matrix
melted_cormat <- melt(cormat)
head(melted_cormat)
# What this does is since we have 6 variables.
# melt does provide us a list view of correlation
# of each 36 pair of correlations.
### Let's plot
ggplot(data = melted_cormat, aes(x=Var1, y=Var2, fill = value)) +
geom_tile()
By looking at the plot, it is quite ugly to see less correlated variables are shown in dark colors and higher’s are in bright colors. So let’s make it clear.
### Get lower triangle of the correlation matrix
get_lower_tri<-function(cormat){
cormat[upper.tri(cormat)] <- NA
return(cormat)
}
### Get upper triangle of the correlation matrix
get_upper_tri <- function(cormat){
cormat[lower.tri(cormat)]<- NA
return(cormat)
}
upper_tri <- get_upper_tri(cormat)
upper_tri
## mpg disp hp drat wt qsec
## mpg 1 -0.85 -0.78 0.68 -0.87 0.42
## disp NA 1.00 0.79 -0.71 0.89 -0.43
## hp NA NA 1.00 -0.45 0.66 -0.71
## drat NA NA NA 1.00 -0.71 0.09
## wt NA NA NA NA 1.00 -0.17
## qsec NA NA NA NA NA 1.00
### Melt
melted_cormat <- melt(upper_tri, na.rm = TRUE)
### Heatmap
ggplot(data = melted_cormat, aes(Var2, Var1, fill = value))+
geom_tile(color = "white")+
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1,1), space = "Lab",
name="Pearson\nCorrelation") +
theme_minimal()+
theme(axis.text.x = element_text(angle = 45, vjust = 1,
size = 12, hjust = 1))+
coord_fixed()
### Library
library(heatmaply)
### Data
df <- mtcars
### Let's Plot
heatmaply_cor(x = cor(df),
xlab = "Features",
ylab = "Features",
k_col = 2,
k_row = 2)
library(ggplot2)
library(ggcorrplot)
ggcorrplot::ggcorrplot(cor(mtcars))
library(heatmaply)
library(ggplot2)
heatmaply_cor(x = cor(churnData),
xlab = "Features",
ylab = "Features",
k_col = 2,
k_row = 2)
library(ggcorrplot)
ggcorrplot::ggcorrplot(cor(churnData))
get_upper_tri <- function(cormat){
cormat[lower.tri(cormat)]<- NA
return(cormat)
}
upper_tri <- get_upper_tri(round(x = cor(churnData), digits = 2))
# upper_tri
melted_cormat <- melt(upper_tri, na.rm = TRUE)
ggplot(data = melted_cormat, aes(Var2, Var1, fill = value))+
geom_tile(color = "white")+
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1,1), space = "Lab",
name="Pearson\nCorrelation") +
theme_minimal()+
theme(axis.text.x = element_text(angle = 45, vjust = 1,
size = 12, hjust = 1))+
coord_fixed()