Lecture 10


Lab Exercise:

For the data set of Bank Customers:

1. Find two numeric variables that are highly correlated by checking the correlation coefficient. Then create a graph to illustrate that.

2. Find two categorical variables (other than Attrition_Flag) that are strongly dependent of each other. Then create a graph to illustrate that.

3.Find all variables that have non-negligible correlation or dependence with Attrition_Flag.


Import data and libraries

library(tidyverse)
library(GGally)

bank_data <- read_csv("BankChurners.csv")


Exercise 1

my_data <- select(bank_data, where(is.numeric)) # where(is.numeric) keep all numeric variables

ggpairs(my_data) # The plot might be too small to see all numbers. Run this on your laptop and enlarge the picture to see all numbers clearly.

The highest number in the correlation matrix is between Avg_Open_To_Buy and Credit_Limit, which is quite obvious since the former is the available credit limit by its definition.

To visualise this, we can do the following

ggplot(my_data, aes(Avg_Open_To_Buy, Credit_Limit)) + 
  geom_point(size = rel(0.5), shape = 1, color = "blue") +
  geom_smooth(color = "magenta", linewidth = 1.5) 


Exercise 2

my_data <- select(bank_data, !where(is.numeric)) # Select all non-numeric variables

ggpairs(my_data)

Observe the bubble plots and we may see that for males and females, the income category distribution is very different. So it is likely that the two variables are of strong dependence. We can do the chi-square test.

chisq.test(bank_data$Gender, bank_data$Income_Category)
## 
##  Pearson's Chi-squared test
## 
## data:  bank_data$Gender and bank_data$Income_Category
## X-squared = 7138.4, df = 5, p-value < 2.2e-16

So an extremely small p-value does confirm our hypothesis. We can further show this by a grouped bar chart. Before that we need to make Income_Category a factor and relevel it. More details will be explained in the factor session. This step is not necessary.

income_levels <- levels(as.factor(bank_data$Income_Category))

my_data <- bank_data %>%
  mutate(Income_Category = as.factor(Income_Category)) %>%
  mutate(Income_Category = fct_relevel(Income_Category, income_levels[c(1,4,3,2,5,6)]))
  
levels(my_data$Income_Category)
## [1] "$120K +"        "$80K - $120K"   "$60K - $80K"    "$40K - $60K"   
## [5] "Less than $40K" "Unknown"
ggplot(my_data) + 
  geom_bar(aes(Gender, fill = Income_Category), position = "dodge")

So it is evident that females have much less income than males on average.


Exercise 3

For this question, we don’t have particularly easier ways than analyzing the relationship between Attrition_Flag and each other variable extensively.

For another variable being categorical, we can simply do the chi-square test.

For another variable being numeric, we should do the t-test. For example, Attrition_Flag is correlated with Months_Inactive_12_mon. We can show this by the following code:

data1 <- bank_data$Months_Inactive_12_mon[bank_data$Attrition_Flag != "Existing Customer"]
data2 <- bank_data$Months_Inactive_12_mon[bank_data$Attrition_Flag == "Existing Customer"]
t.test(data1, data2)
## 
##  Welch Two Sample t-test
## 
## data:  data1 and data2
## t = 16.862, df = 2489.8, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.3707468 0.4683249
## sample estimates:
## mean of x mean of y 
##  2.693301  2.273765

The p-value is very small, indicating that we cannot assume the two groups (existing/churning customers) have the same mean inactive months. We can also recreate the boxplot to show this.

ggplot(bank_data) + 
  geom_boxplot(aes(Attrition_Flag, Months_Inactive_12_mon))


Summary

With all the knowledge we have learned from data exploration, we may move forward to the modeling step to explain our data or predict customer behavior in the future. If necessary, we can also go back to the transformation step to further transform our data before the modeling step.