library(tidyverse)
(Continued from the last lecture)
To summarise variation and covariation more efficiently, one usually creates a pair plot between multiple variables in the exploration stage.
bank_data <- read_csv("BankChurners.csv")
library(GGally)
bank_data2 <- select(bank_data, Attrition_Flag, Customer_Age, Gender, Credit_Limit)
ggpairs(bank_data2)
ggpairs()
The ggpairs()
function automatically creates the
following graphs in a pair plot:
This nearly covers all the graph types we have covered, maybe other
than the bubble plot. In ggplot
, a bubble plot can be
created by the function geom_count()
, where the size of the
bubble (in terms of area) is proportional to the counts of each joint
category:
ggplot(mpg) + geom_count(aes(as.factor(year), class)) +
labs(title = "Vehicle class by year of made", x = "Year of Made", y = "Vehicle Class") +
theme(plot.title = element_text(hjust = 0.5))
We can customize the shape and the size of bubbles by the function
scale_size_area()
.
ggplot(bank_data) + geom_count(aes(Attrition_Flag, Education_Level), shape = 'square') + scale_size_area(max_size = 16)
But still, bubble plots are not as rigorous as a chi-square test to check the dependence between two categorical variables. You are recommended to perform a chi-square test when things are not clear from the bubble plot alone.
bank_data2 <- select(bank_data, Attrition_Flag, Total_Relationship_Count, Months_Inactive_12_mon)
ggpairs(bank_data2)
So here we may draw the following preliminary conclusion:
Existing customers seems to have overall more relationship counts than churning customers.
Exisiting customers seems to have less inactive months than churning customers.
The relationship count and inactive months have little correlation between them.
1. Find two numeric variables that are highly correlated by checking the correlation coefficient. Then create a graph to illustrate that.
2. Find two categorical variables (other than
Attrition_Flag
) that are strongly dependent of each other.
Then create a graph to illustrate that.
3.Find all variables that have non-negligible correlation or
dependence with Attrition_Flag
.