Load Libraries

library(tidyverse)


(Continued from the last lecture)

2. Pair Plots

To summarise variation and covariation more efficiently, one usually creates a pair plot between multiple variables in the exploration stage.

bank_data <- read_csv("BankChurners.csv")
library(GGally)

bank_data2 <- select(bank_data, Attrition_Flag, Customer_Age, Gender, Credit_Limit)
ggpairs(bank_data2)


Graph types from ggpairs()

The ggpairs() function automatically creates the following graphs in a pair plot:

  • One categorical variable: bar plot
  • One numeric variable: density plot
  • Two categorical variables: A bubble plot and a grouped bar plots
  • Two numeric variables: a scatter plot and the correlation coefficient
  • One categorical variable and one numeric variable: a multiple box plot and a ridge density plot

This nearly covers all the graph types we have covered, maybe other than the bubble plot. In ggplot, a bubble plot can be created by the function geom_count(), where the size of the bubble (in terms of area) is proportional to the counts of each joint category:

ggplot(mpg) + geom_count(aes(as.factor(year), class)) +
  labs(title = "Vehicle class by year of made", x = "Year of Made", y = "Vehicle Class") + 
  theme(plot.title = element_text(hjust = 0.5))

We can customize the shape and the size of bubbles by the function scale_size_area().

ggplot(bank_data) + geom_count(aes(Attrition_Flag, Education_Level), shape = 'square') + scale_size_area(max_size = 16)

But still, bubble plots are not as rigorous as a chi-square test to check the dependence between two categorical variables. You are recommended to perform a chi-square test when things are not clear from the bubble plot alone.


Read the pair plot: what can we learn from the plot below?

bank_data2 <- select(bank_data, Attrition_Flag, Total_Relationship_Count, Months_Inactive_12_mon)
ggpairs(bank_data2)

So here we may draw the following preliminary conclusion:

  • Existing customers seems to have overall more relationship counts than churning customers.

  • Exisiting customers seems to have less inactive months than churning customers.

  • The relationship count and inactive months have little correlation between them.


Lab Exercise:

1. Find two numeric variables that are highly correlated by checking the correlation coefficient. Then create a graph to illustrate that.

2. Find two categorical variables (other than Attrition_Flag) that are strongly dependent of each other. Then create a graph to illustrate that.

3.Find all variables that have non-negligible correlation or dependence with Attrition_Flag.