For the data set of Bank Customers:
1. Find two numeric variables that are highly correlated by checking the correlation coefficient. Then create a graph to illustrate that.
2. Find two categorical variables (other than
Attrition_Flag
) that are strongly dependent of each other.
Then create a graph to illustrate that.
3.Find all variables that have non-negligible correlation or
dependence with Attrition_Flag
.
library(tidyverse)
library(GGally)
bank_data <- read_csv("BankChurners.csv")
my_data <- select(bank_data, where(is.numeric)) # where(is.numeric) keep all numeric variables
ggpairs(my_data) # The plot might be too small to see all numbers. Run this on your laptop and enlarge the picture to see all numbers clearly.
The highest number in the correlation matrix is between
Avg_Open_To_Buy
and Credit_Limit
, which is
quite obvious since the former is the available credit limit by its
definition.
To visualise this, we can do the following
ggplot(my_data, aes(Avg_Open_To_Buy, Credit_Limit)) +
geom_point(size = rel(0.5), shape = 1, color = "blue") +
geom_smooth(color = "magenta", linewidth = 1.5)
my_data <- select(bank_data, !where(is.numeric)) # Select all non-numeric variables
ggpairs(my_data)
Observe the bubble plots and we may see that for males and females, the income category distribution is very different. So it is likely that the two variables are of strong dependence. We can do the chi-square test.
chisq.test(bank_data$Gender, bank_data$Income_Category)
##
## Pearson's Chi-squared test
##
## data: bank_data$Gender and bank_data$Income_Category
## X-squared = 7138.4, df = 5, p-value < 2.2e-16
So an extremely small p-value does confirm our hypothesis. We can
further show this by a grouped bar chart. Before that we need to make
Income_Category
a factor and relevel it. More details will
be explained in the factor
session. This step is not
necessary.
income_levels <- levels(as.factor(bank_data$Income_Category))
my_data <- bank_data %>%
mutate(Income_Category = as.factor(Income_Category)) %>%
mutate(Income_Category = fct_relevel(Income_Category, income_levels[c(1,4,3,2,5,6)]))
levels(my_data$Income_Category)
## [1] "$120K +" "$80K - $120K" "$60K - $80K" "$40K - $60K"
## [5] "Less than $40K" "Unknown"
ggplot(my_data) +
geom_bar(aes(Gender, fill = Income_Category), position = "dodge")
So it is evident that females have much less income than males on average.
For this question, we don’t have particularly easier ways than
analyzing the relationship between Attrition_Flag
and each
other variable extensively.
For another variable being categorical, we can simply do the chi-square test.
For another variable being numeric, we should do the t-test. For
example, Attrition_Flag
is correlated with
Months_Inactive_12_mon
. We can show this by the following
code:
data1 <- bank_data$Months_Inactive_12_mon[bank_data$Attrition_Flag != "Existing Customer"]
data2 <- bank_data$Months_Inactive_12_mon[bank_data$Attrition_Flag == "Existing Customer"]
t.test(data1, data2)
##
## Welch Two Sample t-test
##
## data: data1 and data2
## t = 16.862, df = 2489.8, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.3707468 0.4683249
## sample estimates:
## mean of x mean of y
## 2.693301 2.273765
The p-value is very small, indicating that we cannot assume the two groups (existing/churning customers) have the same mean inactive months. We can also recreate the boxplot to show this.
ggplot(bank_data) +
geom_boxplot(aes(Attrition_Flag, Months_Inactive_12_mon))
With all the knowledge we have learned from data exploration, we may move forward to the modeling step to explain our data or predict customer behavior in the future. If necessary, we can also go back to the transformation step to further transform our data before the modeling step.