Background

We will dive into bank customer churn records to see who are leaving and potentially find reasons why. The purpose is due to the fact that it is easier to keep current customers rather than finding new customers. Overall, if we could prevent churn we can develop customer loyalty and create incentives like loyalty programs and retention campaigns to keep them.

Data Overview

Notice the first three columns (Row, ID, and Surname) do not have any use as predictors, thus will not be included in any of the plots, thus they will be deleted.

names(CCR)
##  [1] "RowNumber"          "CustomerId"         "Surname"           
##  [4] "CreditScore"        "Geography"          "Gender"            
##  [7] "Age"                "Tenure"             "Balance"           
## [10] "NumOfProducts"      "HasCrCard"          "IsActiveMember"    
## [13] "EstimatedSalary"    "Exited"             "Complain"          
## [16] "Satisfaction Score" "Card Type"          "Point Earned"
kable(head(CCR,10),"simple",
    col.names = c("Row", "ID", "Surname", "CreditScore", "Geo", "Gender", "Age", 
                  "Tenure", "Balance", "#Products","CrCard","Active","Salary",
                  "Exited", "Complain","Satisfaction Score", "Card Type", "Points")) 
Row ID Surname CreditScore Geo Gender Age Tenure Balance #Products CrCard Active Salary Exited Complain Satisfaction Score Card Type Points
1 15634602 Hargrave 619 France Female 42 2 0.00 1 1 1 101348.88 1 1 2 DIAMOND 464
2 15647311 Hill 608 Spain Female 41 1 83807.86 1 0 1 112542.58 0 1 3 DIAMOND 456
3 15619304 Onio 502 France Female 42 8 159660.80 3 1 0 113931.57 1 1 3 DIAMOND 377
4 15701354 Boni 699 France Female 39 1 0.00 2 0 0 93826.63 0 0 5 GOLD 350
5 15737888 Mitchell 850 Spain Female 43 2 125510.82 1 1 1 79084.10 0 0 5 GOLD 425
6 15574012 Chu 645 Spain Male 44 8 113755.78 2 1 0 149756.71 1 1 5 DIAMOND 484
7 15592531 Bartlett 822 France Male 50 7 0.00 2 1 1 10062.80 0 0 2 SILVER 206
8 15656148 Obinna 376 Germany Female 29 4 115046.74 4 1 0 119346.88 1 1 2 DIAMOND 282
9 15792365 He 501 France Male 44 4 142051.07 2 0 1 74940.50 0 0 3 GOLD 251
10 15592389 H? 684 France Male 27 2 134603.88 1 1 1 71725.73 0 0 3 GOLD 342
CCR <-CCR[,-c(1:3)] #delete Row, ID, and Surname

Scatter Plot

p <- ggplot(CCR, aes(x = CreditScore , y = factor(Exited), col = factor(Exited))) + 
  geom_point() +
  labs(x = "Credit Score", y = "Exited",
       title = "Exited vs. Credit Score",
       caption = "Voronyak 2023") +
  theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))

p + scale_color_discrete(name = "Exited", labels = c("No", "Yes")) 

p <- ggplot(CCR, aes(x = CreditScore, fill = factor(Exited))) +
  geom_histogram(bins = 11, col = "black")+
  labs(x = "Credit Score", 
       y = "Frequency", 
       title = "The Affect of Credit Score on Churn",
       caption = "Voronyak 2023") + 
  theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))

p + scale_fill_discrete(name = "Exited", labels = c("No", "Yes")) 

The credit score seems to have an effect on leaving if the credit score is around or below 400. Based off of the histogram it doesn’t appear to have much affect on the other score. The proportions look similar, but without calculating the actual proportions, it is hard to tell.

p <- ggplot(CCR, aes(x = Geography, fill = factor(Exited))) +
  geom_bar(position = "dodge", col = "black")+
  labs(x = "Geography", 
       y = "Frequency", 
       title = "The Affect of Geography on Churn",
       caption = "Voronyak 2023") + 
  theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))

p + scale_fill_discrete(name = "Exited", labels = c("No", "Yes")) 

It appears that Germany has a lot higher proportion of people whom exited. The proportion seems to be that 1/3 of the people have exited there vs. France, which has roughly the same number of people exiting, but well over double the total people staying.

p <- ggplot(CCR, aes(x = factor(Exited), fill = Gender)) +
  geom_bar(position = "dodge", col = "black")+
  labs(x = "", 
       y = "Frequency", 
       title = "The Affect of Geography on Churn",
       caption = "Voronyak 2023") + 
  theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))

p + scale_fill_discrete(name = "Gender", labels = c("Female", "Male")) + scale_x_discrete(labels=c("Not Exiting", "Exiting"))

It appears that Male customers are more loyal and stay with the bank and that Female customers seems to exit at a high proportion.

p <- ggplot(CCR, aes(x = Age , y = factor(Exited), col = factor(Exited))) + 
  geom_point() +
  labs(x = "Age", y = "Exited",
       title = "The Affect of Age on Churn",
       caption = "Voronyak 2023") +
  theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))

p + scale_color_discrete(name = "Exited", labels = c("No", "Yes")) 

It appears that there is a cutoff at around 74 years of age that if you are part of a bank, you tend not to leave. Thus, ensuring that people are member before that age cutoff would beneficial. Based off of some quick research, these members have the most money in savings, thus would be the most profitable for the bank.

p <- ggplot(CCR, aes(x = Tenure, fill = factor(Exited))) +
  geom_histogram(bins = 11, col = "black")+  # need 11 bins for 0-10 years of tenure
  labs(x = "Tenure", 
       y = "Frequency", 
       title = "The Affect of Tenure on Churn",
       caption = "Voronyak 2023") + 
  theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))

p + scale_fill_discrete(name = "Exited", labels = c("No", "Yes")) + scale_x_continuous(breaks = seq(0, 10, 1))

I am not seeing anything glaring at me other than it is pretty consistent across the board. You would think that higher tenure would be higher loyalty, but it isn’t apparent here. It fact 1 year vs. 9 years are almost identical.

p <- ggplot(CCR, aes(x = Balance, y = factor(Exited), col = factor(Exited))) +  #considered using log - transformation, but it wasn't as descriptive
  geom_point() +
  labs(x = "Balance", y = "Exited",
       title = "The Affect of Balance on Churn",
       caption = "Voronyak 2023") +
  theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))

p + scale_color_discrete(name = "Exited", labels = c("No", "Yes")) 

The balance in an account doesn’t seem to be a tell tail sign of bank loyalty either. You would expect to have higher loyalty with a higher balance.

p <- ggplot(CCR, aes(x = factor(Exited), fill = factor(NumOfProducts))) +
  geom_bar(position = "dodge", col = "black")+
  labs(x = "", 
       y = "Frequency", 
       title = "The Affect of Number of Products Purchased on Churn",
       caption = "Voronyak 2023") + 
  theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))

p + scale_fill_discrete(name = "Number of Products") + scale_x_discrete(labels=c("Not Exiting", "Exiting"))

p <- ggplot(CCR, aes(x = NumOfProducts, fill = factor(Exited))) +
  geom_histogram(bins = 4, col = "black")+  
  labs(x = "Tenure", 
       y = "Frequency", 
       title = "The Affect of Number of Products Purchased on Churn",
       caption = "Voronyak 2023") + 
  theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))

p + scale_fill_discrete(name = "Exited", labels = c("No", "Yes")) + scale_x_continuous(breaks = seq(1, 4, 1))

If the bank is able to get a customer to get 4 products through the bank, then it appears that they will stay. Interestingly, it goes the other direction when they have 3 products, they are more likely to leave then stay. Also, if they have 2 products that have a better probability of staying with the bank. So, try to have customers either have 2 products or 4 products. Could use some sort of bundling strategy.

p <- ggplot(CCR, aes(x = factor(Exited), fill = factor(HasCrCard))) +
  geom_bar(position = "dodge", col = "black")+
  labs(x = "", 
       y = "Frequency", 
       title = "The Affect of Having a Credit Card on Churn",
       caption = "Voronyak 2023") + 
  theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))

p + scale_fill_discrete(name = "Has Credit Card") + scale_x_discrete(labels=c("Not Exiting", "Exiting"))+ scale_fill_discrete(name = "Credit Card", labels = c("No", "Yes"))
## Scale for fill is already present.
## Adding another scale for fill, which will replace the existing scale.

p <- ggplot(CCR, aes(x = HasCrCard, fill = factor(Exited))) +
  geom_histogram(bins = 2, col = "black")+  
  labs(x = "Has Credit Card", 
       y = "Frequency", 
       title = "The Affect of Having a Credit Card on Churn",
       caption = "Voronyak 2023") + 
  theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))

p + scale_fill_discrete(name = "Exited", labels = c("No", "Yes")) + scale_x_continuous(breaks = seq(0, 1, 1))

The results of having a credit card or not seem inconclusive based off of the above visualizations. A more in depth analysis based off of the numbers would be beneficial.

p <- ggplot(CCR, aes(x = factor(Exited), fill = factor(IsActiveMember))) +
  geom_bar(position = "dodge", col = "black")+
  labs(x = "", 
       y = "Frequency", 
       title = "The Affect of Being a Memeber on Churn",
       caption = "Voronyak 2023") + 
  theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))

p + scale_fill_discrete(name = "Is a Member") + scale_x_discrete(labels=c("Not Exiting", "Exiting"))+ scale_fill_discrete(name = "Member", labels = c("No", "Yes"))
## Scale for fill is already present.
## Adding another scale for fill, which will replace the existing scale.

p <- ggplot(CCR, aes(x = IsActiveMember, fill = factor(Exited))) +
  geom_histogram(bins = 2, col = "black")+  
  labs(x = "Is a Member", 
       y = "Frequency", 
       title = "The Affect of Being a Memeber on Churn",
       caption = "Voronyak 2023") + 
  theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))

p + scale_fill_discrete(name = "Exited", labels = c("No", "Yes")) + scale_x_continuous(breaks = seq(0, 1, 1)) 

There seems to be a high correlation here. If you are a member you are less likely to leave and thus higher loyalty. My guess is that if you are willing to be become a member you’d be less likely to leave and the visualization appear to collaborate with my hypothesis.

p <- ggplot(CCR, aes(x = log(EstimatedSalary), fill = factor(Exited))) + #Use log transformation
  geom_histogram(bins = 10, col = "black")+  
  labs(x = "Log of Estimated Salary", 
       y = "Frequency", 
       title = "The Affect of Number of Products Purchased on Churn",
       caption = "Voronyak 2023") + 
  theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))

p + scale_fill_discrete(name = "Exited", labels = c("No", "Yes")) + scale_x_continuous(breaks = seq(3, 12, 1))

Using a log transformation on the estimated salary gives us a better view of the data. This appears to no have too great of an effect on the churn, but we’d need a bit more detailed analysis of the numbers to be sure. It looks like the higher the salary may play a slight roll and to my eyes it appears that the lower salary people actually have higher loyalty.

p <- ggplot(CCR, aes(x = factor(Exited), fill = factor(Complain))) +
  geom_bar(position = "dodge", col = "black")+
  labs(x = "", 
       y = "Frequency", 
       title = "The Affect of Complaints on Churn",
       caption = "Voronyak 2023") + 
  theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))

p + scale_fill_discrete(name = "Complains or Not") + scale_x_discrete(labels=c("Not Exiting", "Exiting"))+ scale_fill_discrete(name = "Complained", labels = c("No", "Yes"))
## Scale for fill is already present.
## Adding another scale for fill, which will replace the existing scale.

The data is very clear that if you are exiting you filed a complaint. This is very evident based on the above visualization.

p <- ggplot(CCR, aes(x = factor(Exited), fill = factor(`Satisfaction Score`))) +
  geom_bar(position = "dodge", col = "black")+
  labs(x = "", 
       y = "Frequency", 
       title = "The Affect Complaint Resolution on Churn",
       caption = "Voronyak 2023") + 
  theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))

p + scale_fill_discrete(name = "Satisfaction Score") + scale_x_discrete(labels=c("Not Exiting", "Exiting"))

p <- ggplot(CCR, aes(x = `Satisfaction Score`, fill = factor(Exited))) +
  geom_histogram(bins = 5, col = "black")+  
  labs(x = "Satisfaction Score", 
       y = "Frequency", 
       title = "The Affect Complaint Resolution on Churn",
       caption = "Voronyak 2023") + 
  theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))

p + scale_fill_discrete(name = "Exited", labels = c("No", "Yes")) + scale_x_continuous(breaks = seq(1, 5, 1))

The satisfaction score provided by the customer for their complaint resolution appears to have no affect. This, to me, is strange. I would think that having a high satisfaction score would keep more people.

p <- ggplot(CCR, aes(x = `Card Type`, fill = factor(Exited))) +
  geom_bar(position = "dodge", col = "black")+
  labs(x = "Card Type", 
       y = "Frequency", 
       title = "The Affect of Card Type on Churn",
       caption = "Voronyak 2023") + 
  theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))

p + scale_fill_discrete(name = "Exited", labels = c("No", "Yes")) 

It appears that the type of card also has no affect or very minor.

p <- ggplot(CCR, aes(x = `Point Earned` , y = factor(Exited), col = factor(Exited))) + 
  geom_point() +
  labs(x = "Points Earned", y = "Exited",
       title = "Exited vs. Points Earned",
       caption = "Voronyak 2023") +
  theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))

p + scale_color_discrete(name = "Exited", labels = c("No", "Yes")) 

p <- ggplot(CCR, aes(x = `Point Earned`, fill = factor(Exited))) +
  geom_histogram(bins = 10, col = "black")+
  labs(x = "Points Earned", 
       y = "Frequency", 
       title = "The Affect of Points Earend on Churn",
       caption = "Voronyak 2023") + 
  theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))

p + scale_fill_discrete(name = "Exited", labels = c("No", "Yes")) 

Points Earned also appear to not have a high correlation for churn. The dot plot seems to not be very useful, but the histogram seems to be better in showing that there appears to be very little variation based on the points earned.