Project 1: Analysis of churners of the bank

The largest differences for churners and non-churners is traced in Revolving balance, Number of transactions in last 12 months and Total amount of these transactions
Single people and women are less likely to stay in the bank
Segments
Models for predicting churn
Churn prevention strategies
Appendix

This project aims to explore the data about churners of the bank, to segment the customers and then predict churn in these segments. Finally, churn prevention strategies will be introduced and discussed.

As a result, the overall churn rate is 16%. Three segments are managed by Revolving balance, Number of transactions in last 12 months and Total amount of these transactions. The churn of 30% in Sleeping segment is predicted with the Model1 with accuracy of 0.85 - good estimate.

Preprocessing

In this section skimming and cleaning the data for the following analysis are made.

library(readr)
library(tidyverse)
library(rmarkdown)
df <- read_csv("bank_churners.csv")

All the variables should be managed to appropriate types for further manipulations.

#factor variables
df$still_customer <- as.factor(df$still_customer)
df$gender <- as.factor(df$gender)
df$education_level <- as.factor(df$education_level)
df$marital_status <- as.factor(df$marital_status)
df$income_category <- as.factor(df$income_category)
df$card_category <- as.factor(df$card_category)

levels(df$still_customer) <- c("No", "Yes")

16% of churners

All in all, we have 1627 customers who churned and 8500 remaining customers.

table(df$still_customer)

## 
##   No  Yes 
## 8500 1627

ggplot(data = df) + geom_bar(aes( x = still_customer, fill = still_customer), show.legend = FALSE) + 
  ggtitle("Number of customers who stayed/churned") +
  xlab("Churned") +
  ylab("Number of customers") + 
  theme_minimal()

In percentage points it’s 16% of people churning and 84% staying.

churn  <- df %>% group_by(still_customer) %>% 
 summarize(count = n()) %>% 
 mutate(pct = count/sum(count))  

ggplot(churn, aes(still_customer, pct, fill = still_customer)) + 
  geom_bar(stat='identity', show.legend = FALSE) + 
  geom_text(aes(label=scales::percent(pct)), position = position_stack(vjust = .5))+
  scale_y_continuous(labels = scales::percent) +
  ggtitle("Number of customers who churned/stayed in %") +
  xlab("Churned") +
  ylab("Percent of customers") + 
  theme_minimal()

Percentage of churners - 16%! This rate is not very large, but not perfect either (it is better to have 5-10%). So it is indeed problematic for the bank and should be dealt with.

You can trace data distributions in more detail in the Appendix section at the end of the report.

The largest differences for churners and non-churners is traced in Revolving balance, Number of transactions in last 12 months and Total amount of these transactions

In this section some associations of churn with other variables are found.

Only 7 of the 14 continuous variables have visible differences for people who churned and who stayed. These are differences in mean values.

For example, people with small revolving balance on average may churn. People who are inactive for many months during the year are more likely to churn, as well. Other relationships can be seen on the graphs.

All in all, after analyzing all the graphs and doing some statistical tests, the largest differences for churners and non-churners is traced in Revolving balance estimate, Number of transactions in last 12 months and total amount of these transactions.

Hence, these measurements seem to be the most interesting and useful for further segmentation and predictions.

Rest of the graphs and statistical tests could be found in the Appendix.

bp1 <- ggplot(df) + geom_boxplot(aes(still_customer, total_revolving_bal, fill = still_customer), show.legend = FALSE)+
  ggtitle("Churn by Total revolving balance") +
  xlab("Churned") +
  ylab("Total revolving balance")
bp2 <- ggplot(df) + geom_boxplot(aes(still_customer, months_inactive_12_mon, fill = still_customer), show.legend = FALSE)+
  ggtitle("Churn by Months inactive \n(last 12 months)") +
  xlab("Churned") +
  ylab("Months inactive")
bp3 <- ggplot(df) + geom_boxplot(aes(still_customer, total_relationship_count, fill = still_customer), show.legend = FALSE)+
  ggtitle("Churn by Total relationship count") +
  xlab("Churned") +
  ylab("Total relationship count")
bp4 <- ggplot(df) + geom_boxplot(aes(still_customer, contacts_count_12_mon, fill = still_customer), show.legend = FALSE)+
  ggtitle("Churn by Contacts count \n(last 12 months)") +
  xlab("Churned") +
  ylab("Contacts count")

ggpubr::ggarrange(bp1, bp2, bp3, bp4,
          ncol = 2, nrow = 2)

bp5 <- ggplot(df) + geom_boxplot(aes(still_customer, total_trans_amt, fill = still_customer), show.legend = FALSE)+
  ggtitle("Churn by Total \ntransaction amount") +
  xlab("Churned") +
  ylab("Total transaction amount")

bp6 <- ggplot(df) + geom_boxplot(aes(still_customer, total_trans_ct, fill = still_customer), show.legend = FALSE)+
  ggtitle("Churn by Total \ntransaction count") +
  xlab("Churned") +
  ylab("Total transaction count")

bp7 <- ggplot(df) + geom_boxplot(aes(still_customer, avg_utilization_ratio, fill = still_customer), show.legend = FALSE)+
  ggtitle("Churn by Average \nutilization ratio") +
  xlab("Churned") +
  ylab("Average utilization ratio")

ggpubr::ggarrange(bp5, bp6, bp7,
          ncol = 3, nrow = 1)

Also, other variables like number of inactive months, contacts count, total relationship count, and average utilization ratio might to be useful for further churn prediction, as they are also statistically significant. But for segmentation three mentioned variables will be enough.

Further, churn appears to have slight association with gender, females are a bit more likely to churn. The majority of people who stay have Graduate level of education, however, among churners this category prevails too. Married people are not likely to churn as much as single people do. People with income less than 40K $ are more likely to churn, as well. Card categories have no association with churn at all.

Additional statistical tests could be found in the Appendix.

Single people and women are less likely to stay in the bank

bar1 <- ggplot(df) + geom_bar(aes(gender, fill = still_customer), 
  position="dodge", show.legend = F) +
  ggtitle("Churn by Gender") +
  xlab("Gender") +
  ylab("Number of people")

bar2 <- ggplot(df) + geom_bar(aes(education_level, fill = still_customer), 
  position="dodge", show.legend = F) +
  ggtitle("Churn by Education level") +
  xlab("Education level") +
  ylab("Number of people")

bar3 <- ggplot(df) + geom_bar(aes(marital_status, fill = still_customer), 
  position="dodge", show.legend = F) + ggtitle("Churn by Marital status") +
  xlab("Marital status") +
  ylab("Number of people")

bar4 <- ggplot(df) + geom_bar(aes(income_category, fill = still_customer), 
  position="dodge", show.legend = F) + 
  ggtitle("Churn by Income category") +
  xlab("Income category") +
  ylab("Number of people")

bar5 <- ggplot(df) + geom_bar(aes(card_category, fill = still_customer), 
  position="dodge", show.legend = F) +
  ggtitle("Churn by Card category") +
  xlab("Card category") +
  ylab("Number of people")

ggpubr::ggarrange(bar1, bar2,
          ncol = 1, nrow = 2)

ggpubr::ggarrange(bar3, bar4, 
          ncol = 1, nrow = 2)

bar5

So gender, marital status and income might be good predictors for churn. However, for segmenting they will not be used.

Before, it is reasonable to normalize - to have similar scale from 0 to 1 - all the variables. This gives a right to compare estimates which are measured in different units (dollars, number of time, people, etc.)

df_scaled <- df
df_scaled$still_customer <- ifelse(df_scaled$still_customer == "Yes", 1, 0)

scale_01 <- function(x){(x-min(x))/(max(x)-min(x))}

df_scaled <- df_scaled %>% 
  mutate_if(sapply(df, is.numeric), scale_01)


library('fastDummies')
df_scaled <- dummy_cols(df_scaled, select_columns = 'gender') %>% dplyr::select(-gender)
df_scaled <- dummy_cols(df_scaled, select_columns = 'education_level') %>% dplyr::select(-education_level)
df_scaled <- dummy_cols(df_scaled, select_columns = 'income_category') %>% dplyr::select(-income_category)
df_scaled <- dummy_cols(df_scaled, select_columns = 'card_category') %>% dplyr::select(-card_category)
df_scaled <- dummy_cols(df_scaled, select_columns = 'marital_status') %>% dplyr::select(-marital_status)

Segments

So for segmentation we came up to three measurements. If we look at the graphs of their associations with each other, we can already see three distinct groups. One of them seems to have no churners at all. So this assumption with be now checked in the following section.

sp1 <- ggplot(df_scaled) + geom_point(aes(total_revolving_bal, total_trans_ct, color = factor(still_customer)), show.legend = F) +
  xlab("Total revolving balance") +
  ylab("Total transaction count")
  
sp2 <- ggplot(df_scaled) + geom_point(aes(total_revolving_bal, total_trans_amt, color = factor(still_customer)), show.legend = F) +
  xlab("Total revolving balance") +
  ylab("Total transaction amount")

sp3 <- ggplot(df_scaled) + geom_point(aes(total_trans_ct, total_trans_amt, color = factor(still_customer)), show.legend = F) +
  xlab("Total transaction count") +
  ylab("Total transaction amount")

ggpubr::ggarrange(sp1, sp2, sp3,
          ncol = 3, nrow = 1)

df_segment <- df_scaled %>% dplyr::select(total_trans_ct, total_revolving_bal, total_trans_amt)

The graph which describes optimal number of clusters confirms that 3 clusters is the best option.

library(NbClust)
library(factoextra)

fviz_nbclust(df_segment, kmeans, method = "silhouette")

So the segmentation is performed with K-means algorithm.

set.seed(12)
model_clust <- kmeans(df_segment, centers = 3, nstart = 25)

df_clustered <- mutate(df_scaled, cluster = model_clust$cluster)

df_clustered %>% group_by(cluster) %>% summarize(bal = mean(total_revolving_bal), cnt = mean(total_trans_ct), amt = mean(total_trans_amt))

## # A tibble: 3 x 4
##   cluster   bal   cnt   amt
##     <int> <dbl> <dbl> <dbl>
## 1       1 0.547 0.758 0.722
## 2       2 0.684 0.396 0.165
## 3       3 0.107 0.387 0.167

First segment is about customers with high values of revolving balance, transaction counts and amount of transactions. This segment consists almost only of customers who stay in the bank, so segment with no churners. - “Good customers” segment

Second segment is about customers with high revolving balance, with average count of transactions and small total amount of transactions. - “In debt” segment

Third segment is about customers with small-to-medium estimates in all three measurements, so with small revolving balance, with average count of transactions and small total amount of transactions. - “Free to go” or “Sleeping” segment.

Hence, churn will be predicted with the help of models in the second and third segments only, as the first one has no problems with churners.

ggplot(df_clustered) + geom_bar(aes(cluster, fill = factor(still_customer)), show.legend = F) +
  ggtitle("Churners in clusters") +
  xlab("Cluster") +
  ylab("Number of customers")

df_clustered %>% group_by(cluster, still_customer) %>% summarize(count = n()) %>% 
  mutate(pct = count/sum(count))

The churn is 8% in the second cluster and 30% in the third cluster.

df_predict1 <- df_clustered %>% filter(cluster==3)
df_predict2 <- df_clustered %>% filter(cluster==2)

Special attention will be paid for dealing with the churn of 30%, but also I will take a look at the 8% churn in the “In debt” segment.

Models for predicting churn

Model1

library(tidymodels)
set.seed(1111)
df_split1 <- initial_split(df_predict1, prop = .8)
train1 <- training(df_split1)
test1 <- testing(df_split1)

The first large model uses all the estimates in the dataset. It explains much of the data, however, it is very large and hard to interpret.

The full model coefficients could be found in Appendix.

model1 <- glm(still_customer ~., data = train1, family = binomial)

library(descr)
LogRegR2(model1)

## Chi2                 2055.646 
## Df                   31 
## Sig.                 0 
## Cox and Snell Index  0.5034901 
## Nagelkerke Index     0.7063052 
## McFadden's R2        0.5611303

From the previous model it is seen that some measurements have more influence, so noting them, we arrive at the final model.

model2 <- glm(still_customer ~ gender_F + marital_status_Single + total_relationship_count + months_inactive_12_mon + total_revolving_bal  + total_trans_amt + total_trans_ct, data = train1, family = binomial)

LogRegR2(model2)

## Chi2                 1747.117 
## Df                   7 
## Sig.                 0 
## Cox and Snell Index  0.4484744 
## Nagelkerke Index     0.6291282 
## McFadden's R2        0.476911

The final model explains 47% of our data - great estimate!

pred1 <- predict(model2, newdata = test1, type = "response")

library(caret)
library(e1071)

pred1 = as.data.frame(pred1)
pred1$pred1 <- if_else(pred1 > 0.5,1,0)
p.class = factor(pred1$pred1, levels = c(1,0))
actual.factor = factor(test1$still_customer, levels = c(1,0))
confusionMatrix(p.class, actual.factor)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   1   0
##          1 175  54
##          0  55 451
##                                           
##                Accuracy : 0.8517          
##                  95% CI : (0.8239, 0.8766)
##     No Information Rate : 0.6871          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.6547          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.7609          
##             Specificity : 0.8931          
##          Pos Pred Value : 0.7642          
##          Neg Pred Value : 0.8913          
##              Prevalence : 0.3129          
##          Detection Rate : 0.2381          
##    Detection Prevalence : 0.3116          
##       Balanced Accuracy : 0.8270          
##                                           
##        'Positive' Class : 1               
##

As for predictive power, the model works good. It predicts true churners with relatively little errors (sensitivity of 0.8). The model is accurate and have good predictions on the test data!

Interpretation

summary(model2)$coef

##                             Estimate Std. Error    z value      Pr(>|z|)
## (Intercept)                3.6829437  0.2652197  13.886386  7.660396e-44
## gender_F                   0.5389501  0.1183714   4.553043  5.287534e-06
## marital_status_Single      0.3478942  0.1184893   2.936081  3.323873e-03
## total_relationship_count  -2.0516887  0.2017294 -10.170498  2.685315e-24
## months_inactive_12_mon     3.3025868  0.3650315   9.047401  1.464111e-19
## total_revolving_bal       -5.2441284  0.4565431 -11.486600  1.540564e-30
## total_trans_amt           17.6607656  1.0886980  16.221915  3.530158e-59
## total_trans_ct           -21.4317737  0.8856065 -24.200108 2.218929e-129

From the model it appears that females and single persons have larger chances to churn. Also with smaller number of products held, with smaller revolving balance amount people will churn more frequently. And with larger number of inactive months, larger amounts of transactions in last 12 months people are predicted by the model to churn with bigger chances.

Most powerful predictors: number of transaction in last 12 months, amount of these transactions and amount of revolving balance.

Model2

All the same is done for the model for the second cluster. But this final model will have other measurements for prediction.

set.seed(1111)
df_split2 <- initial_split(df_predict2, prop = .8)
train2 <- training(df_split2)
test2 <- testing(df_split2)

model3 <- glm(still_customer ~., data = train2, family = binomial)

The full model coefficients could be found in Appendix.

model4 <- glm(still_customer ~ marital_status_Single + total_relationship_count + months_inactive_12_mon + contacts_count_12_mon + total_trans_amt + total_revolving_bal + total_trans_ct + total_ct_chng_q4_q1, data = train2, family = binomial)

LogRegR2(model4)

## Chi2                 1102.322 
## Df                   8 
## Sig.                 0 
## Cox and Snell Index  0.2204198 
## Nagelkerke Index     0.5048573 
## McFadden's R2        0.4339771

This model explains 43% of the our data, that is very good.

pred2 <- predict(model4, newdata = test2, type = "response")

pred2 = as.data.frame(pred2)
pred2$pred2 <- if_else(pred2 > 0.5,1,0)
p.class = factor(pred2$pred2,levels = c(1,0))
actual.factor = factor(test2$still_customer, levels = c(1,0))
confusionMatrix(p.class, actual.factor)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   1   0
##          1  33  16
##          0  59 999
##                                           
##                Accuracy : 0.9322          
##                  95% CI : (0.9158, 0.9463)
##     No Information Rate : 0.9169          
##     P-Value [Acc > NIR] : 0.03334         
##                                           
##                   Kappa : 0.4355          
##                                           
##  Mcnemar's Test P-Value : 1.236e-06       
##                                           
##             Sensitivity : 0.35870         
##             Specificity : 0.98424         
##          Pos Pred Value : 0.67347         
##          Neg Pred Value : 0.94423         
##              Prevalence : 0.08311         
##          Detection Rate : 0.02981         
##    Detection Prevalence : 0.04426         
##       Balanced Accuracy : 0.67147         
##                                           
##        'Positive' Class : 1               
##

The model predicts non-churners with great success, churners are predicted worse, but more measurements will not improve the model significantly and certainly they will complicate the interpretation.

Unfortunately, the model for predicting churn in the second segment “In debt” is almost useless, as it predicts churners poorly - misses a lot of true churners.

All in all, the model for predicting churn in the segment of sleeping customers performs well. Further section will include possible churn prevention strategies for this segment.

Churn prevention strategies

In general, customers searching for bank services mostly crave for small loan rate and high deposit rate, so that they can minimize their costs or maximize profit, as all rational people want to do. So in order to be the most attractive bank for customers, the rates should be managed in customer-centered traditions. This also includes bonus programs or special offers. Customer-oriented approached helps to build trust and loyalty.

The analysis of sleeping segment revealed that customers almost do not use the services of the bank: they hold few bank products, they do not perform a lot of transactions and rarely do they appear at the bank in general.

Sleeping customers or at risk customers are needed to be talked to, offered some renewals, so reactivated.

As I already mentioned, bonus programs or special offers are good tools for motivating customers, especially the sleeping ones, as in our case. The recommendation is to introduce or increase (if exists) the monetary cashback or bonuses for being an active customer. This would not be so costly as changing the rates, but it will be effective too.

As for special offers, customers will take more bank products or perform more transactions, if they know that each following product/transaction have benefits for them. For example, running a new saving account will have a larger rate compared to existing account. Or having certain amount of transactions during several months will give larger cashback for customers.

Special offers might be addressed also to “good customers” segment. For example, if they give recommendations to their friends, they will receive certain monetary reward. (P.S. Alpha bank has this offer)

The analysis also showed that churners are a bit more likely to be single and female. So it might be beneficial to pay attention to them, as well. It might be informational help like targeted messages with explanations about certain procedures and notifications about more beneficial manipulations with money for females. Ladies may be not fully informed. Special tariffs for single persons might also stimulate them to use bank services.

Another way is to ask the churning customers for their feedback, so that to upgrade the existing system and improve services. Better explanations might be not those which are predicted, but those which are gained after asking the leaving customers directly. So for example, to ask customers to describe their reasons for stopping relationships with the bank.

Appendix

Skimming the data.

library(skimr)
skim(df)

Data summary
Name	df
Number of rows	10127
Number of columns	20
_______________________
Column type frequency:
factor	6
numeric	14
________________________
Group variables	None

Variable type: factor

skim_variable	complete_rate	ordered	n_unique	top_counts
still_customer	1	FALSE	2	No: 8500, Yes: 1627
gender	1	FALSE	2	F: 5358, M: 4769
education_level	1	FALSE	7	Gra: 3128, Hig: 2013, Unk: 1519, Une: 1487
marital_status	1	FALSE	4	Mar: 4687, Sin: 3943, Unk: 749, Div: 748
income_category	1	FALSE	6	Les: 3561, $40: 1790, $80: 1535, $60: 1402
card_category	1	FALSE	4	Blu: 9436, Sil: 555, Gol: 116, Pla: 20

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
customer_age	1	46.33	8.02	26.0	41.00	46.00	52.00	73.00	▂▆▇▃▁
dependent_count	1	2.35	1.30	0.0	1.00	2.00	3.00	5.00	▇▇▇▅▁
months_on_book	1	35.93	7.99	13.0	31.00	36.00	40.00	56.00	▁▃▇▃▂
total_relationship_count	1	3.81	1.55	1.0	3.00	4.00	5.00	6.00	▇▇▆▆▆
months_inactive_12_mon	1	2.34	1.01	0.0	2.00	2.00	3.00	6.00	▅▇▇▁▁
contacts_count_12_mon	1	2.46	1.11	0.0	2.00	2.00	3.00	6.00	▅▇▇▃▁
credit_limit	1	8631.95	9088.78	1438.3	2555.00	4549.00	11067.50	34516.00	▇▂▁▁▁
total_revolving_bal	1	1162.81	814.99	0.0	359.00	1276.00	1784.00	2517.00	▇▅▇▇▅
avg_open_to_buy	1	7469.14	9090.69	3.0	1324.50	3474.00	9859.00	34516.00	▇▂▁▁▁
total_amt_chng_q4_q1	1	0.76	0.22	0.0	0.63	0.74	0.86	3.40	▅▇▁▁▁
total_trans_amt	1	4404.09	3397.13	510.0	2155.50	3899.00	4741.00	18484.00	▇▅▁▁▁
total_trans_ct	1	64.86	23.47	10.0	45.00	67.00	81.00	139.00	▂▅▇▂▁
total_ct_chng_q4_q1	1	0.71	0.24	0.0	0.58	0.70	0.82	3.71	▇▆▁▁▁
avg_utilization_ratio	1	0.27	0.28	0.0	0.02	0.18	0.50	1.00	▇▂▂▂▁

summary(df)

##  still_customer  customer_age   gender   dependent_count      education_level
##  No :8500       Min.   :26.00   F:5358   Min.   :0.000   College      :1013  
##  Yes:1627       1st Qu.:41.00   M:4769   1st Qu.:1.000   Doctorate    : 451  
##                 Median :46.00            Median :2.000   Graduate     :3128  
##                 Mean   :46.33            Mean   :2.346   High School  :2013  
##                 3rd Qu.:52.00            3rd Qu.:3.000   Post-Graduate: 516  
##                 Max.   :73.00            Max.   :5.000   Uneducated   :1487  
##                                                          Unknown      :1519  
##   marital_status       income_category  card_category  months_on_book 
##  Divorced: 748   $120K +       : 727   Blue    :9436   Min.   :13.00  
##  Married :4687   $40K - $60K   :1790   Gold    : 116   1st Qu.:31.00  
##  Single  :3943   $60K - $80K   :1402   Platinum:  20   Median :36.00  
##  Unknown : 749   $80K - $120K  :1535   Silver  : 555   Mean   :35.93  
##                  Less than $40K:3561                   3rd Qu.:40.00  
##                  Unknown       :1112                   Max.   :56.00  
##                                                                       
##  total_relationship_count months_inactive_12_mon contacts_count_12_mon
##  Min.   :1.000            Min.   :0.000          Min.   :0.000        
##  1st Qu.:3.000            1st Qu.:2.000          1st Qu.:2.000        
##  Median :4.000            Median :2.000          Median :2.000        
##  Mean   :3.813            Mean   :2.341          Mean   :2.455        
##  3rd Qu.:5.000            3rd Qu.:3.000          3rd Qu.:3.000        
##  Max.   :6.000            Max.   :6.000          Max.   :6.000        
##                                                                       
##   credit_limit   total_revolving_bal avg_open_to_buy total_amt_chng_q4_q1
##  Min.   : 1438   Min.   :   0        Min.   :    3   Min.   :0.0000      
##  1st Qu.: 2555   1st Qu.: 359        1st Qu.: 1324   1st Qu.:0.6310      
##  Median : 4549   Median :1276        Median : 3474   Median :0.7360      
##  Mean   : 8632   Mean   :1163        Mean   : 7469   Mean   :0.7599      
##  3rd Qu.:11068   3rd Qu.:1784        3rd Qu.: 9859   3rd Qu.:0.8590      
##  Max.   :34516   Max.   :2517        Max.   :34516   Max.   :3.3970      
##                                                                          
##  total_trans_amt total_trans_ct   total_ct_chng_q4_q1 avg_utilization_ratio
##  Min.   :  510   Min.   : 10.00   Min.   :0.0000      Min.   :0.0000       
##  1st Qu.: 2156   1st Qu.: 45.00   1st Qu.:0.5820      1st Qu.:0.0230       
##  Median : 3899   Median : 67.00   Median :0.7020      Median :0.1760       
##  Mean   : 4404   Mean   : 64.86   Mean   :0.7122      Mean   :0.2749       
##  3rd Qu.: 4741   3rd Qu.: 81.00   3rd Qu.:0.8180      3rd Qu.:0.5030       
##  Max.   :18484   Max.   :139.00   Max.   :3.7140      Max.   :0.9990       
##

Churn with other continuous variables (which have almost no associations with churn)

bp8 <- ggplot(df) + geom_boxplot(aes(still_customer, customer_age, fill = still_customer), show.legend = FALSE)+
  ggtitle("Churn by Age") +
  xlab("Churned") +
  ylab("Age")
bp9 <- ggplot(df) + geom_boxplot(aes(still_customer, dependent_count, fill = still_customer), show.legend = FALSE)+
  ggtitle("Churn by Dependent count") +
  xlab("Churned") +
  ylab("Dependent count")
bp10 <- ggplot(df) + geom_boxplot(aes(still_customer, months_on_book, fill = still_customer), show.legend = FALSE)+
  ggtitle("Churn by Months on book") +
  xlab("Churned") +
  ylab("Months on book")
bp11 <- ggplot(df) + geom_boxplot(aes(still_customer, credit_limit, fill = still_customer), show.legend = FALSE)+
  ggtitle("Churn by Credit limit") +
  xlab("Churned") +
  ylab("Credit limit")

ggpubr::ggarrange(bp8, bp9, bp10, bp11,
          ncol = 2, nrow = 2)

bp12 <- ggplot(df) + geom_boxplot(aes(still_customer, avg_open_to_buy, fill = still_customer), show.legend = FALSE)+
  ggtitle("Churn by Average \nopen to buy") +
  xlab("Churned") +
  ylab("Average open to buy")

bp13 <- ggplot(df) + geom_boxplot(aes(still_customer, total_amt_chng_q4_q1, fill = still_customer), show.legend = FALSE)+
  ggtitle("Churn by Total \namount change") +
  xlab("Churned") +
  ylab("Total amount change")

bp14 <- ggplot(df) + geom_boxplot(aes(still_customer, total_ct_chng_q4_q1, fill = still_customer), show.legend = FALSE)+
  ggtitle("Churn by Total count \nchange") +
  xlab("Churned") +
  ylab("Total count change")


ggpubr::ggarrange(bp12, bp13, bp14,
          ncol = 3, nrow = 1)

T-tests for numeric variables

t.test(df$total_revolving_bal ~ df$still_customer)$estimate #large difference

##  mean in group No mean in group Yes 
##          1256.604           672.823

t.test(df$months_inactive_12_mon ~ df$still_customer)$estimate #small difference

##  mean in group No mean in group Yes 
##          2.273765          2.693301

t.test(df$total_relationship_count~ df$still_customer)$estimate #small difference

##  mean in group No mean in group Yes 
##          3.914588          3.279656

t.test(df$contacts_count_12_mon ~ df$still_customer)$estimate #small difference

##  mean in group No mean in group Yes 
##          2.356353          2.972342

t.test(df$total_trans_amt ~ df$still_customer)$estimate #large difference

##  mean in group No mean in group Yes 
##          4654.656          3095.026

t.test(df$total_trans_ct ~ df$still_customer)$estimate #large difference

##  mean in group No mean in group Yes 
##          68.67259          44.93362

t.test(df$avg_utilization_ratio ~ df$still_customer)$estimate #small difference

##  mean in group No mean in group Yes 
##         0.2964118         0.1624751

Chi-Tests for categories

chisq.test(x = df$gender, y = df$still_customer)$res

##          df$still_customer
## df$gender        No       Yes
##         F -1.031683  2.358098
##         M  1.093538 -2.499479

chisq.test(x = df$education_level, y = df$still_customer)$res

##                   df$still_customer
## df$education_level          No         Yes
##      College        0.30001628 -0.68574161
##      Doctorate     -1.15863047  2.64826006
##      Graduate       0.30334777 -0.69335634
##      High School    0.42350004 -0.96798615
##      Post-Graduate -0.43725023  0.99941470
##      Uneducated     0.05380534 -0.12298186
##      Unknown       -0.33489742  0.76546879

chisq.test(x = df$card_category, y = df$still_customer)$res

##                 df$still_customer
## df$card_category          No         Yes
##         Blue     -0.03388749  0.07745601
##         Gold     -0.23952705  0.54748252
##         Platinum -0.43610767  0.99680316
##         Silver    0.33202191 -0.75889630

chisq.test(x = df$marital_status, y = df$still_customer)$res

##                  df$still_customer
## df$marital_status          No         Yes
##          Divorced -0.03298957  0.07540365
##          Married   0.70169945 -1.60386135
##          Single   -0.60003573  1.37149048
##          Unknown  -0.34562555  0.78998987

chisq.test(x = df$income_category, y = df$still_customer)$res

##                   df$still_customer
## df$income_category         No        Yes
##     $120K +        -0.3724543  0.8513119
##     $40K - $60K     0.4277676 -0.9777405
##     $60K - $80K     1.0565804 -2.4150061
##     $80K - $120K    0.1285035 -0.2937182
##     Less than $40K -0.7296604  1.6677712
##     Unknown        -0.2732014  0.6244513

The detailed information about the models for churn prediction

summary(model1)

## 
## Call:
## glm(formula = still_customer ~ ., family = binomial, data = train1)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.6066  -0.3716  -0.1264   0.2657   3.0209  
## 
## Coefficients: (7 not defined because of singularities)
##                                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                        5.11671    0.73318   6.979 2.98e-12 ***
## customer_age                       0.74579    0.62873   1.186 0.235551    
## dependent_count                    0.70571    0.25665   2.750 0.005964 ** 
## months_on_book                    -0.73767    0.57354  -1.286 0.198385    
## total_relationship_count          -1.92634    0.22452  -8.580  < 2e-16 ***
## months_inactive_12_mon             3.25319    0.40612   8.010 1.14e-15 ***
## contacts_count_12_mon              2.95921    0.36764   8.049 8.33e-16 ***
## credit_limit                      -0.52796    0.36340  -1.453 0.146271    
## total_revolving_bal               -5.62758    0.78401  -7.178 7.08e-13 ***
## avg_open_to_buy                         NA         NA      NA       NA    
## total_amt_chng_q4_q1              -4.74458    1.20763  -3.929 8.54e-05 ***
## total_trans_amt                   21.37877    1.27016  16.832  < 2e-16 ***
## total_trans_ct                   -22.47128    1.01192 -22.207  < 2e-16 ***
## total_ct_chng_q4_q1              -11.78777    1.20404  -9.790  < 2e-16 ***
## avg_utilization_ratio              1.16758    0.73774   1.583 0.113504    
## gender_F                           0.88772    0.24851   3.572 0.000354 ***
## gender_M                                NA         NA      NA       NA    
## education_level_College           -0.23667    0.26102  -0.907 0.364557    
## education_level_Doctorate          0.03862    0.34397   0.112 0.910611    
## education_level_Graduate          -0.02616    0.19937  -0.131 0.895602    
## `education_level_High School`     -0.13327    0.21674  -0.615 0.538632    
## `education_level_Post-Graduate`   -0.20093    0.32333  -0.621 0.534301    
## education_level_Uneducated        -0.36359    0.23454  -1.550 0.121081    
## education_level_Unknown                 NA         NA      NA       NA    
## `income_category_$120K +`          1.09928    0.39991   2.749 0.005982 ** 
## `income_category_$40K - $60K`      0.19614    0.27279   0.719 0.472125    
## `income_category_$60K - $80K`      0.51113    0.35019   1.460 0.144408    
## `income_category_$80K - $120K`     0.69730    0.35178   1.982 0.047459 *  
## `income_category_Less than $40K`   0.01393    0.22828   0.061 0.951353    
## income_category_Unknown                 NA         NA      NA       NA    
## card_category_Blue                -0.28733    0.37876  -0.759 0.448086    
## card_category_Gold                -0.25022    0.81475  -0.307 0.758753    
## card_category_Platinum             0.56834    1.33378   0.426 0.670027    
## card_category_Silver                    NA         NA      NA       NA    
## marital_status_Divorced            0.17276    0.33288   0.519 0.603777    
## marital_status_Married            -0.26106    0.24245  -1.077 0.281575    
## marital_status_Single              0.08625    0.24430   0.353 0.724054    
## marital_status_Unknown                  NA         NA      NA       NA    
## cluster                                 NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3663.4  on 2935  degrees of freedom
## Residual deviance: 1607.8  on 2904  degrees of freedom
## AIC: 1671.8
## 
## Number of Fisher Scoring iterations: 6

summary(model3)

## 
## Call:
## glm(formula = still_customer ~ ., family = binomial, data = train2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4662  -0.2506  -0.1048  -0.0411   3.3533  
## 
## Coefficients: (7 not defined because of singularities)
##                                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                       -0.885337   0.835252  -1.060 0.289161    
## customer_age                      -0.800914   0.704955  -1.136 0.255906    
## dependent_count                    1.090158   0.290916   3.747 0.000179 ***
## months_on_book                    -0.110580   0.634969  -0.174 0.861748    
## total_relationship_count          -2.282620   0.267556  -8.531  < 2e-16 ***
## months_inactive_12_mon             3.263469   0.437054   7.467 8.21e-14 ***
## contacts_count_12_mon              3.470256   0.432501   8.024 1.03e-15 ***
## credit_limit                       0.324924   0.513156   0.633 0.526611    
## total_revolving_bal                3.394755   0.435856   7.789 6.77e-15 ***
## avg_open_to_buy                          NA         NA      NA       NA    
## total_amt_chng_q4_q1              -3.864084   1.284732  -3.008 0.002632 ** 
## total_trans_amt                   19.393359   1.404098  13.812  < 2e-16 ***
## total_trans_ct                   -19.876850   1.143142 -17.388  < 2e-16 ***
## total_ct_chng_q4_q1              -14.189657   1.433505  -9.899  < 2e-16 ***
## avg_utilization_ratio              1.142995   0.438274   2.608 0.009109 ** 
## gender_F                           0.886846   0.281641   3.149 0.001639 ** 
## gender_M                                 NA         NA      NA       NA    
## education_level_College            0.450259   0.299888   1.501 0.133246    
## education_level_Doctorate          0.772615   0.368162   2.099 0.035855 *  
## education_level_Graduate          -0.020227   0.245389  -0.082 0.934307    
## `education_level_High School`      0.164649   0.256791   0.641 0.521406    
## `education_level_Post-Graduate`    0.880607   0.365570   2.409 0.016002 *  
## education_level_Uneducated         0.508716   0.272348   1.868 0.061778 .  
## education_level_Unknown                  NA         NA      NA       NA    
## `income_category_$120K +`          0.339519   0.451967   0.751 0.452531    
## `income_category_$40K - $60K`     -0.147841   0.299560  -0.494 0.621641    
## `income_category_$60K - $80K`      0.007713   0.411116   0.019 0.985032    
## `income_category_$80K - $120K`     0.167835   0.393778   0.426 0.669949    
## `income_category_Less than $40K`  -0.027069   0.256045  -0.106 0.915805    
## income_category_Unknown                  NA         NA      NA       NA    
## card_category_Blue                -0.012979   0.386644  -0.034 0.973222    
## card_category_Gold                 0.793328   0.595773   1.332 0.182993    
## card_category_Platinum           -12.104277 407.554700  -0.030 0.976307    
## card_category_Silver                     NA         NA      NA       NA    
## marital_status_Divorced            0.019636   0.370180   0.053 0.957697    
## marital_status_Married            -0.725374   0.289423  -2.506 0.012201 *  
## marital_status_Single              0.170768   0.288687   0.592 0.554163    
## marital_status_Unknown                   NA         NA      NA       NA    
## cluster                                  NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2540.0  on 4426  degrees of freedom
## Residual deviance: 1328.2  on 4395  degrees of freedom
## AIC: 1392.2
## 
## Number of Fisher Scoring iterations: 13