This project aims to explore the data about churners of the bank, to segment the customers and then predict churn in these segments. Finally, churn prevention strategies will be introduced and discussed.
As a result, the overall churn rate is 16%. Three segments are managed by Revolving balance, Number of transactions in last 12 months and Total amount of these transactions. The churn of 30% in Sleeping segment is predicted with the Model1 with accuracy of 0.85 - good estimate.
In this section skimming and cleaning the data for the following analysis are made.
library(readr)
library(tidyverse)
library(rmarkdown)
df <- read_csv("bank_churners.csv")
All the variables should be managed to appropriate types for further manipulations.
#factor variables
df$still_customer <- as.factor(df$still_customer)
df$gender <- as.factor(df$gender)
df$education_level <- as.factor(df$education_level)
df$marital_status <- as.factor(df$marital_status)
df$income_category <- as.factor(df$income_category)
df$card_category <- as.factor(df$card_category)
levels(df$still_customer) <- c("No", "Yes")
All in all, we have 1627 customers who churned and 8500 remaining customers.
table(df$still_customer)
##
## No Yes
## 8500 1627
ggplot(data = df) + geom_bar(aes( x = still_customer, fill = still_customer), show.legend = FALSE) +
ggtitle("Number of customers who stayed/churned") +
xlab("Churned") +
ylab("Number of customers") +
theme_minimal()
In percentage points it’s 16% of people churning and 84% staying.
churn <- df %>% group_by(still_customer) %>%
summarize(count = n()) %>%
mutate(pct = count/sum(count))
ggplot(churn, aes(still_customer, pct, fill = still_customer)) +
geom_bar(stat='identity', show.legend = FALSE) +
geom_text(aes(label=scales::percent(pct)), position = position_stack(vjust = .5))+
scale_y_continuous(labels = scales::percent) +
ggtitle("Number of customers who churned/stayed in %") +
xlab("Churned") +
ylab("Percent of customers") +
theme_minimal()
Percentage of churners - 16%! This rate is not very large, but not perfect either (it is better to have 5-10%). So it is indeed problematic for the bank and should be dealt with.
You can trace data distributions in more detail in the Appendix section at the end of the report.
In this section some associations of churn with other variables are found.
Only 7 of the 14 continuous variables have visible differences for people who churned and who stayed. These are differences in mean values.
For example, people with small revolving balance on average may churn. People who are inactive for many months during the year are more likely to churn, as well. Other relationships can be seen on the graphs.
All in all, after analyzing all the graphs and doing some statistical tests, the largest differences for churners and non-churners is traced in Revolving balance estimate, Number of transactions in last 12 months and total amount of these transactions.
Hence, these measurements seem to be the most interesting and useful for further segmentation and predictions.
Rest of the graphs and statistical tests could be found in the Appendix.
bp1 <- ggplot(df) + geom_boxplot(aes(still_customer, total_revolving_bal, fill = still_customer), show.legend = FALSE)+
ggtitle("Churn by Total revolving balance") +
xlab("Churned") +
ylab("Total revolving balance")
bp2 <- ggplot(df) + geom_boxplot(aes(still_customer, months_inactive_12_mon, fill = still_customer), show.legend = FALSE)+
ggtitle("Churn by Months inactive \n(last 12 months)") +
xlab("Churned") +
ylab("Months inactive")
bp3 <- ggplot(df) + geom_boxplot(aes(still_customer, total_relationship_count, fill = still_customer), show.legend = FALSE)+
ggtitle("Churn by Total relationship count") +
xlab("Churned") +
ylab("Total relationship count")
bp4 <- ggplot(df) + geom_boxplot(aes(still_customer, contacts_count_12_mon, fill = still_customer), show.legend = FALSE)+
ggtitle("Churn by Contacts count \n(last 12 months)") +
xlab("Churned") +
ylab("Contacts count")
ggpubr::ggarrange(bp1, bp2, bp3, bp4,
ncol = 2, nrow = 2)
bp5 <- ggplot(df) + geom_boxplot(aes(still_customer, total_trans_amt, fill = still_customer), show.legend = FALSE)+
ggtitle("Churn by Total \ntransaction amount") +
xlab("Churned") +
ylab("Total transaction amount")
bp6 <- ggplot(df) + geom_boxplot(aes(still_customer, total_trans_ct, fill = still_customer), show.legend = FALSE)+
ggtitle("Churn by Total \ntransaction count") +
xlab("Churned") +
ylab("Total transaction count")
bp7 <- ggplot(df) + geom_boxplot(aes(still_customer, avg_utilization_ratio, fill = still_customer), show.legend = FALSE)+
ggtitle("Churn by Average \nutilization ratio") +
xlab("Churned") +
ylab("Average utilization ratio")
ggpubr::ggarrange(bp5, bp6, bp7,
ncol = 3, nrow = 1)
Also, other variables like number of inactive months, contacts count, total relationship count, and average utilization ratio might to be useful for further churn prediction, as they are also statistically significant. But for segmentation three mentioned variables will be enough.
Further, churn appears to have slight association with gender, females are a bit more likely to churn. The majority of people who stay have Graduate level of education, however, among churners this category prevails too. Married people are not likely to churn as much as single people do. People with income less than 40K $ are more likely to churn, as well. Card categories have no association with churn at all.
Additional statistical tests could be found in the Appendix.
bar1 <- ggplot(df) + geom_bar(aes(gender, fill = still_customer),
position="dodge", show.legend = F) +
ggtitle("Churn by Gender") +
xlab("Gender") +
ylab("Number of people")
bar2 <- ggplot(df) + geom_bar(aes(education_level, fill = still_customer),
position="dodge", show.legend = F) +
ggtitle("Churn by Education level") +
xlab("Education level") +
ylab("Number of people")
bar3 <- ggplot(df) + geom_bar(aes(marital_status, fill = still_customer),
position="dodge", show.legend = F) + ggtitle("Churn by Marital status") +
xlab("Marital status") +
ylab("Number of people")
bar4 <- ggplot(df) + geom_bar(aes(income_category, fill = still_customer),
position="dodge", show.legend = F) +
ggtitle("Churn by Income category") +
xlab("Income category") +
ylab("Number of people")
bar5 <- ggplot(df) + geom_bar(aes(card_category, fill = still_customer),
position="dodge", show.legend = F) +
ggtitle("Churn by Card category") +
xlab("Card category") +
ylab("Number of people")
ggpubr::ggarrange(bar1, bar2,
ncol = 1, nrow = 2)
ggpubr::ggarrange(bar3, bar4,
ncol = 1, nrow = 2)
bar5
So gender, marital status and income might be good predictors for churn. However, for segmenting they will not be used.
Before, it is reasonable to normalize - to have similar scale from 0 to 1 - all the variables. This gives a right to compare estimates which are measured in different units (dollars, number of time, people, etc.)
df_scaled <- df
df_scaled$still_customer <- ifelse(df_scaled$still_customer == "Yes", 1, 0)
scale_01 <- function(x){(x-min(x))/(max(x)-min(x))}
df_scaled <- df_scaled %>%
mutate_if(sapply(df, is.numeric), scale_01)
library('fastDummies')
df_scaled <- dummy_cols(df_scaled, select_columns = 'gender') %>% dplyr::select(-gender)
df_scaled <- dummy_cols(df_scaled, select_columns = 'education_level') %>% dplyr::select(-education_level)
df_scaled <- dummy_cols(df_scaled, select_columns = 'income_category') %>% dplyr::select(-income_category)
df_scaled <- dummy_cols(df_scaled, select_columns = 'card_category') %>% dplyr::select(-card_category)
df_scaled <- dummy_cols(df_scaled, select_columns = 'marital_status') %>% dplyr::select(-marital_status)
So for segmentation we came up to three measurements. If we look at the graphs of their associations with each other, we can already see three distinct groups. One of them seems to have no churners at all. So this assumption with be now checked in the following section.
sp1 <- ggplot(df_scaled) + geom_point(aes(total_revolving_bal, total_trans_ct, color = factor(still_customer)), show.legend = F) +
xlab("Total revolving balance") +
ylab("Total transaction count")
sp2 <- ggplot(df_scaled) + geom_point(aes(total_revolving_bal, total_trans_amt, color = factor(still_customer)), show.legend = F) +
xlab("Total revolving balance") +
ylab("Total transaction amount")
sp3 <- ggplot(df_scaled) + geom_point(aes(total_trans_ct, total_trans_amt, color = factor(still_customer)), show.legend = F) +
xlab("Total transaction count") +
ylab("Total transaction amount")
ggpubr::ggarrange(sp1, sp2, sp3,
ncol = 3, nrow = 1)
df_segment <- df_scaled %>% dplyr::select(total_trans_ct, total_revolving_bal, total_trans_amt)
The graph which describes optimal number of clusters confirms that 3 clusters is the best option.
library(NbClust)
library(factoextra)
fviz_nbclust(df_segment, kmeans, method = "silhouette")
So the segmentation is performed with K-means algorithm.
set.seed(12)
model_clust <- kmeans(df_segment, centers = 3, nstart = 25)
df_clustered <- mutate(df_scaled, cluster = model_clust$cluster)
df_clustered %>% group_by(cluster) %>% summarize(bal = mean(total_revolving_bal), cnt = mean(total_trans_ct), amt = mean(total_trans_amt))
## # A tibble: 3 x 4
## cluster bal cnt amt
## <int> <dbl> <dbl> <dbl>
## 1 1 0.547 0.758 0.722
## 2 2 0.684 0.396 0.165
## 3 3 0.107 0.387 0.167
First segment is about customers with high values of revolving balance, transaction counts and amount of transactions. This segment consists almost only of customers who stay in the bank, so segment with no churners. - “Good customers” segment
Second segment is about customers with high revolving balance, with average count of transactions and small total amount of transactions. - “In debt” segment
Third segment is about customers with small-to-medium estimates in all three measurements, so with small revolving balance, with average count of transactions and small total amount of transactions. - “Free to go” or “Sleeping” segment.
Hence, churn will be predicted with the help of models in the second and third segments only, as the first one has no problems with churners.
ggplot(df_clustered) + geom_bar(aes(cluster, fill = factor(still_customer)), show.legend = F) +
ggtitle("Churners in clusters") +
xlab("Cluster") +
ylab("Number of customers")
df_clustered %>% group_by(cluster, still_customer) %>% summarize(count = n()) %>%
mutate(pct = count/sum(count))
The churn is 8% in the second cluster and 30% in the third cluster.
df_predict1 <- df_clustered %>% filter(cluster==3)
df_predict2 <- df_clustered %>% filter(cluster==2)
Special attention will be paid for dealing with the churn of 30%, but also I will take a look at the 8% churn in the “In debt” segment.
library(tidymodels)
set.seed(1111)
df_split1 <- initial_split(df_predict1, prop = .8)
train1 <- training(df_split1)
test1 <- testing(df_split1)
The first large model uses all the estimates in the dataset. It explains much of the data, however, it is very large and hard to interpret.
The full model coefficients could be found in Appendix.
model1 <- glm(still_customer ~., data = train1, family = binomial)
library(descr)
LogRegR2(model1)
## Chi2 2055.646
## Df 31
## Sig. 0
## Cox and Snell Index 0.5034901
## Nagelkerke Index 0.7063052
## McFadden's R2 0.5611303
From the previous model it is seen that some measurements have more influence, so noting them, we arrive at the final model.
model2 <- glm(still_customer ~ gender_F + marital_status_Single + total_relationship_count + months_inactive_12_mon + total_revolving_bal + total_trans_amt + total_trans_ct, data = train1, family = binomial)
LogRegR2(model2)
## Chi2 1747.117
## Df 7
## Sig. 0
## Cox and Snell Index 0.4484744
## Nagelkerke Index 0.6291282
## McFadden's R2 0.476911
The final model explains 47% of our data - great estimate!
pred1 <- predict(model2, newdata = test1, type = "response")
library(caret)
library(e1071)
pred1 = as.data.frame(pred1)
pred1$pred1 <- if_else(pred1 > 0.5,1,0)
p.class = factor(pred1$pred1, levels = c(1,0))
actual.factor = factor(test1$still_customer, levels = c(1,0))
confusionMatrix(p.class, actual.factor)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 0
## 1 175 54
## 0 55 451
##
## Accuracy : 0.8517
## 95% CI : (0.8239, 0.8766)
## No Information Rate : 0.6871
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.6547
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.7609
## Specificity : 0.8931
## Pos Pred Value : 0.7642
## Neg Pred Value : 0.8913
## Prevalence : 0.3129
## Detection Rate : 0.2381
## Detection Prevalence : 0.3116
## Balanced Accuracy : 0.8270
##
## 'Positive' Class : 1
##
As for predictive power, the model works good. It predicts true churners with relatively little errors (sensitivity of 0.8). The model is accurate and have good predictions on the test data!
summary(model2)$coef
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.6829437 0.2652197 13.886386 7.660396e-44
## gender_F 0.5389501 0.1183714 4.553043 5.287534e-06
## marital_status_Single 0.3478942 0.1184893 2.936081 3.323873e-03
## total_relationship_count -2.0516887 0.2017294 -10.170498 2.685315e-24
## months_inactive_12_mon 3.3025868 0.3650315 9.047401 1.464111e-19
## total_revolving_bal -5.2441284 0.4565431 -11.486600 1.540564e-30
## total_trans_amt 17.6607656 1.0886980 16.221915 3.530158e-59
## total_trans_ct -21.4317737 0.8856065 -24.200108 2.218929e-129
From the model it appears that females and single persons have larger chances to churn. Also with smaller number of products held, with smaller revolving balance amount people will churn more frequently. And with larger number of inactive months, larger amounts of transactions in last 12 months people are predicted by the model to churn with bigger chances.
Most powerful predictors: number of transaction in last 12 months, amount of these transactions and amount of revolving balance.
All the same is done for the model for the second cluster. But this final model will have other measurements for prediction.
set.seed(1111)
df_split2 <- initial_split(df_predict2, prop = .8)
train2 <- training(df_split2)
test2 <- testing(df_split2)
model3 <- glm(still_customer ~., data = train2, family = binomial)
The full model coefficients could be found in Appendix.
model4 <- glm(still_customer ~ marital_status_Single + total_relationship_count + months_inactive_12_mon + contacts_count_12_mon + total_trans_amt + total_revolving_bal + total_trans_ct + total_ct_chng_q4_q1, data = train2, family = binomial)
LogRegR2(model4)
## Chi2 1102.322
## Df 8
## Sig. 0
## Cox and Snell Index 0.2204198
## Nagelkerke Index 0.5048573
## McFadden's R2 0.4339771
This model explains 43% of the our data, that is very good.
pred2 <- predict(model4, newdata = test2, type = "response")
pred2 = as.data.frame(pred2)
pred2$pred2 <- if_else(pred2 > 0.5,1,0)
p.class = factor(pred2$pred2,levels = c(1,0))
actual.factor = factor(test2$still_customer, levels = c(1,0))
confusionMatrix(p.class, actual.factor)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 0
## 1 33 16
## 0 59 999
##
## Accuracy : 0.9322
## 95% CI : (0.9158, 0.9463)
## No Information Rate : 0.9169
## P-Value [Acc > NIR] : 0.03334
##
## Kappa : 0.4355
##
## Mcnemar's Test P-Value : 1.236e-06
##
## Sensitivity : 0.35870
## Specificity : 0.98424
## Pos Pred Value : 0.67347
## Neg Pred Value : 0.94423
## Prevalence : 0.08311
## Detection Rate : 0.02981
## Detection Prevalence : 0.04426
## Balanced Accuracy : 0.67147
##
## 'Positive' Class : 1
##
The model predicts non-churners with great success, churners are predicted worse, but more measurements will not improve the model significantly and certainly they will complicate the interpretation.
Unfortunately, the model for predicting churn in the second segment “In debt” is almost useless, as it predicts churners poorly - misses a lot of true churners.
All in all, the model for predicting churn in the segment of sleeping customers performs well. Further section will include possible churn prevention strategies for this segment.
In general, customers searching for bank services mostly crave for small loan rate and high deposit rate, so that they can minimize their costs or maximize profit, as all rational people want to do. So in order to be the most attractive bank for customers, the rates should be managed in customer-centered traditions. This also includes bonus programs or special offers. Customer-oriented approached helps to build trust and loyalty.
The analysis of sleeping segment revealed that customers almost do not use the services of the bank: they hold few bank products, they do not perform a lot of transactions and rarely do they appear at the bank in general.
Sleeping customers or at risk customers are needed to be talked to, offered some renewals, so reactivated.
As I already mentioned, bonus programs or special offers are good tools for motivating customers, especially the sleeping ones, as in our case. The recommendation is to introduce or increase (if exists) the monetary cashback or bonuses for being an active customer. This would not be so costly as changing the rates, but it will be effective too.
As for special offers, customers will take more bank products or perform more transactions, if they know that each following product/transaction have benefits for them. For example, running a new saving account will have a larger rate compared to existing account. Or having certain amount of transactions during several months will give larger cashback for customers.
Special offers might be addressed also to “good customers” segment. For example, if they give recommendations to their friends, they will receive certain monetary reward. (P.S. Alpha bank has this offer)
The analysis also showed that churners are a bit more likely to be single and female. So it might be beneficial to pay attention to them, as well. It might be informational help like targeted messages with explanations about certain procedures and notifications about more beneficial manipulations with money for females. Ladies may be not fully informed. Special tariffs for single persons might also stimulate them to use bank services.
Another way is to ask the churning customers for their feedback, so that to upgrade the existing system and improve services. Better explanations might be not those which are predicted, but those which are gained after asking the leaving customers directly. So for example, to ask customers to describe their reasons for stopping relationships with the bank.
Skimming the data.
library(skimr)
skim(df)
| Name | df |
| Number of rows | 10127 |
| Number of columns | 20 |
| _______________________ | |
| Column type frequency: | |
| factor | 6 |
| numeric | 14 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| still_customer | 0 | 1 | FALSE | 2 | No: 8500, Yes: 1627 |
| gender | 0 | 1 | FALSE | 2 | F: 5358, M: 4769 |
| education_level | 0 | 1 | FALSE | 7 | Gra: 3128, Hig: 2013, Unk: 1519, Une: 1487 |
| marital_status | 0 | 1 | FALSE | 4 | Mar: 4687, Sin: 3943, Unk: 749, Div: 748 |
| income_category | 0 | 1 | FALSE | 6 | Les: 3561, $40: 1790, $80: 1535, $60: 1402 |
| card_category | 0 | 1 | FALSE | 4 | Blu: 9436, Sil: 555, Gol: 116, Pla: 20 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| customer_age | 0 | 1 | 46.33 | 8.02 | 26.0 | 41.00 | 46.00 | 52.00 | 73.00 | ▂▆▇▃▁ |
| dependent_count | 0 | 1 | 2.35 | 1.30 | 0.0 | 1.00 | 2.00 | 3.00 | 5.00 | ▇▇▇▅▁ |
| months_on_book | 0 | 1 | 35.93 | 7.99 | 13.0 | 31.00 | 36.00 | 40.00 | 56.00 | ▁▃▇▃▂ |
| total_relationship_count | 0 | 1 | 3.81 | 1.55 | 1.0 | 3.00 | 4.00 | 5.00 | 6.00 | ▇▇▆▆▆ |
| months_inactive_12_mon | 0 | 1 | 2.34 | 1.01 | 0.0 | 2.00 | 2.00 | 3.00 | 6.00 | ▅▇▇▁▁ |
| contacts_count_12_mon | 0 | 1 | 2.46 | 1.11 | 0.0 | 2.00 | 2.00 | 3.00 | 6.00 | ▅▇▇▃▁ |
| credit_limit | 0 | 1 | 8631.95 | 9088.78 | 1438.3 | 2555.00 | 4549.00 | 11067.50 | 34516.00 | ▇▂▁▁▁ |
| total_revolving_bal | 0 | 1 | 1162.81 | 814.99 | 0.0 | 359.00 | 1276.00 | 1784.00 | 2517.00 | ▇▅▇▇▅ |
| avg_open_to_buy | 0 | 1 | 7469.14 | 9090.69 | 3.0 | 1324.50 | 3474.00 | 9859.00 | 34516.00 | ▇▂▁▁▁ |
| total_amt_chng_q4_q1 | 0 | 1 | 0.76 | 0.22 | 0.0 | 0.63 | 0.74 | 0.86 | 3.40 | ▅▇▁▁▁ |
| total_trans_amt | 0 | 1 | 4404.09 | 3397.13 | 510.0 | 2155.50 | 3899.00 | 4741.00 | 18484.00 | ▇▅▁▁▁ |
| total_trans_ct | 0 | 1 | 64.86 | 23.47 | 10.0 | 45.00 | 67.00 | 81.00 | 139.00 | ▂▅▇▂▁ |
| total_ct_chng_q4_q1 | 0 | 1 | 0.71 | 0.24 | 0.0 | 0.58 | 0.70 | 0.82 | 3.71 | ▇▆▁▁▁ |
| avg_utilization_ratio | 0 | 1 | 0.27 | 0.28 | 0.0 | 0.02 | 0.18 | 0.50 | 1.00 | ▇▂▂▂▁ |
summary(df)
## still_customer customer_age gender dependent_count education_level
## No :8500 Min. :26.00 F:5358 Min. :0.000 College :1013
## Yes:1627 1st Qu.:41.00 M:4769 1st Qu.:1.000 Doctorate : 451
## Median :46.00 Median :2.000 Graduate :3128
## Mean :46.33 Mean :2.346 High School :2013
## 3rd Qu.:52.00 3rd Qu.:3.000 Post-Graduate: 516
## Max. :73.00 Max. :5.000 Uneducated :1487
## Unknown :1519
## marital_status income_category card_category months_on_book
## Divorced: 748 $120K + : 727 Blue :9436 Min. :13.00
## Married :4687 $40K - $60K :1790 Gold : 116 1st Qu.:31.00
## Single :3943 $60K - $80K :1402 Platinum: 20 Median :36.00
## Unknown : 749 $80K - $120K :1535 Silver : 555 Mean :35.93
## Less than $40K:3561 3rd Qu.:40.00
## Unknown :1112 Max. :56.00
##
## total_relationship_count months_inactive_12_mon contacts_count_12_mon
## Min. :1.000 Min. :0.000 Min. :0.000
## 1st Qu.:3.000 1st Qu.:2.000 1st Qu.:2.000
## Median :4.000 Median :2.000 Median :2.000
## Mean :3.813 Mean :2.341 Mean :2.455
## 3rd Qu.:5.000 3rd Qu.:3.000 3rd Qu.:3.000
## Max. :6.000 Max. :6.000 Max. :6.000
##
## credit_limit total_revolving_bal avg_open_to_buy total_amt_chng_q4_q1
## Min. : 1438 Min. : 0 Min. : 3 Min. :0.0000
## 1st Qu.: 2555 1st Qu.: 359 1st Qu.: 1324 1st Qu.:0.6310
## Median : 4549 Median :1276 Median : 3474 Median :0.7360
## Mean : 8632 Mean :1163 Mean : 7469 Mean :0.7599
## 3rd Qu.:11068 3rd Qu.:1784 3rd Qu.: 9859 3rd Qu.:0.8590
## Max. :34516 Max. :2517 Max. :34516 Max. :3.3970
##
## total_trans_amt total_trans_ct total_ct_chng_q4_q1 avg_utilization_ratio
## Min. : 510 Min. : 10.00 Min. :0.0000 Min. :0.0000
## 1st Qu.: 2156 1st Qu.: 45.00 1st Qu.:0.5820 1st Qu.:0.0230
## Median : 3899 Median : 67.00 Median :0.7020 Median :0.1760
## Mean : 4404 Mean : 64.86 Mean :0.7122 Mean :0.2749
## 3rd Qu.: 4741 3rd Qu.: 81.00 3rd Qu.:0.8180 3rd Qu.:0.5030
## Max. :18484 Max. :139.00 Max. :3.7140 Max. :0.9990
##
Churn with other continuous variables (which have almost no associations with churn)
bp8 <- ggplot(df) + geom_boxplot(aes(still_customer, customer_age, fill = still_customer), show.legend = FALSE)+
ggtitle("Churn by Age") +
xlab("Churned") +
ylab("Age")
bp9 <- ggplot(df) + geom_boxplot(aes(still_customer, dependent_count, fill = still_customer), show.legend = FALSE)+
ggtitle("Churn by Dependent count") +
xlab("Churned") +
ylab("Dependent count")
bp10 <- ggplot(df) + geom_boxplot(aes(still_customer, months_on_book, fill = still_customer), show.legend = FALSE)+
ggtitle("Churn by Months on book") +
xlab("Churned") +
ylab("Months on book")
bp11 <- ggplot(df) + geom_boxplot(aes(still_customer, credit_limit, fill = still_customer), show.legend = FALSE)+
ggtitle("Churn by Credit limit") +
xlab("Churned") +
ylab("Credit limit")
ggpubr::ggarrange(bp8, bp9, bp10, bp11,
ncol = 2, nrow = 2)
bp12 <- ggplot(df) + geom_boxplot(aes(still_customer, avg_open_to_buy, fill = still_customer), show.legend = FALSE)+
ggtitle("Churn by Average \nopen to buy") +
xlab("Churned") +
ylab("Average open to buy")
bp13 <- ggplot(df) + geom_boxplot(aes(still_customer, total_amt_chng_q4_q1, fill = still_customer), show.legend = FALSE)+
ggtitle("Churn by Total \namount change") +
xlab("Churned") +
ylab("Total amount change")
bp14 <- ggplot(df) + geom_boxplot(aes(still_customer, total_ct_chng_q4_q1, fill = still_customer), show.legend = FALSE)+
ggtitle("Churn by Total count \nchange") +
xlab("Churned") +
ylab("Total count change")
ggpubr::ggarrange(bp12, bp13, bp14,
ncol = 3, nrow = 1)
T-tests for numeric variables
t.test(df$total_revolving_bal ~ df$still_customer)$estimate #large difference
## mean in group No mean in group Yes
## 1256.604 672.823
t.test(df$months_inactive_12_mon ~ df$still_customer)$estimate #small difference
## mean in group No mean in group Yes
## 2.273765 2.693301
t.test(df$total_relationship_count~ df$still_customer)$estimate #small difference
## mean in group No mean in group Yes
## 3.914588 3.279656
t.test(df$contacts_count_12_mon ~ df$still_customer)$estimate #small difference
## mean in group No mean in group Yes
## 2.356353 2.972342
t.test(df$total_trans_amt ~ df$still_customer)$estimate #large difference
## mean in group No mean in group Yes
## 4654.656 3095.026
t.test(df$total_trans_ct ~ df$still_customer)$estimate #large difference
## mean in group No mean in group Yes
## 68.67259 44.93362
t.test(df$avg_utilization_ratio ~ df$still_customer)$estimate #small difference
## mean in group No mean in group Yes
## 0.2964118 0.1624751
Chi-Tests for categories
chisq.test(x = df$gender, y = df$still_customer)$res
## df$still_customer
## df$gender No Yes
## F -1.031683 2.358098
## M 1.093538 -2.499479
chisq.test(x = df$education_level, y = df$still_customer)$res
## df$still_customer
## df$education_level No Yes
## College 0.30001628 -0.68574161
## Doctorate -1.15863047 2.64826006
## Graduate 0.30334777 -0.69335634
## High School 0.42350004 -0.96798615
## Post-Graduate -0.43725023 0.99941470
## Uneducated 0.05380534 -0.12298186
## Unknown -0.33489742 0.76546879
chisq.test(x = df$card_category, y = df$still_customer)$res
## df$still_customer
## df$card_category No Yes
## Blue -0.03388749 0.07745601
## Gold -0.23952705 0.54748252
## Platinum -0.43610767 0.99680316
## Silver 0.33202191 -0.75889630
chisq.test(x = df$marital_status, y = df$still_customer)$res
## df$still_customer
## df$marital_status No Yes
## Divorced -0.03298957 0.07540365
## Married 0.70169945 -1.60386135
## Single -0.60003573 1.37149048
## Unknown -0.34562555 0.78998987
chisq.test(x = df$income_category, y = df$still_customer)$res
## df$still_customer
## df$income_category No Yes
## $120K + -0.3724543 0.8513119
## $40K - $60K 0.4277676 -0.9777405
## $60K - $80K 1.0565804 -2.4150061
## $80K - $120K 0.1285035 -0.2937182
## Less than $40K -0.7296604 1.6677712
## Unknown -0.2732014 0.6244513
The detailed information about the models for churn prediction
summary(model1)
##
## Call:
## glm(formula = still_customer ~ ., family = binomial, data = train1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.6066 -0.3716 -0.1264 0.2657 3.0209
##
## Coefficients: (7 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.11671 0.73318 6.979 2.98e-12 ***
## customer_age 0.74579 0.62873 1.186 0.235551
## dependent_count 0.70571 0.25665 2.750 0.005964 **
## months_on_book -0.73767 0.57354 -1.286 0.198385
## total_relationship_count -1.92634 0.22452 -8.580 < 2e-16 ***
## months_inactive_12_mon 3.25319 0.40612 8.010 1.14e-15 ***
## contacts_count_12_mon 2.95921 0.36764 8.049 8.33e-16 ***
## credit_limit -0.52796 0.36340 -1.453 0.146271
## total_revolving_bal -5.62758 0.78401 -7.178 7.08e-13 ***
## avg_open_to_buy NA NA NA NA
## total_amt_chng_q4_q1 -4.74458 1.20763 -3.929 8.54e-05 ***
## total_trans_amt 21.37877 1.27016 16.832 < 2e-16 ***
## total_trans_ct -22.47128 1.01192 -22.207 < 2e-16 ***
## total_ct_chng_q4_q1 -11.78777 1.20404 -9.790 < 2e-16 ***
## avg_utilization_ratio 1.16758 0.73774 1.583 0.113504
## gender_F 0.88772 0.24851 3.572 0.000354 ***
## gender_M NA NA NA NA
## education_level_College -0.23667 0.26102 -0.907 0.364557
## education_level_Doctorate 0.03862 0.34397 0.112 0.910611
## education_level_Graduate -0.02616 0.19937 -0.131 0.895602
## `education_level_High School` -0.13327 0.21674 -0.615 0.538632
## `education_level_Post-Graduate` -0.20093 0.32333 -0.621 0.534301
## education_level_Uneducated -0.36359 0.23454 -1.550 0.121081
## education_level_Unknown NA NA NA NA
## `income_category_$120K +` 1.09928 0.39991 2.749 0.005982 **
## `income_category_$40K - $60K` 0.19614 0.27279 0.719 0.472125
## `income_category_$60K - $80K` 0.51113 0.35019 1.460 0.144408
## `income_category_$80K - $120K` 0.69730 0.35178 1.982 0.047459 *
## `income_category_Less than $40K` 0.01393 0.22828 0.061 0.951353
## income_category_Unknown NA NA NA NA
## card_category_Blue -0.28733 0.37876 -0.759 0.448086
## card_category_Gold -0.25022 0.81475 -0.307 0.758753
## card_category_Platinum 0.56834 1.33378 0.426 0.670027
## card_category_Silver NA NA NA NA
## marital_status_Divorced 0.17276 0.33288 0.519 0.603777
## marital_status_Married -0.26106 0.24245 -1.077 0.281575
## marital_status_Single 0.08625 0.24430 0.353 0.724054
## marital_status_Unknown NA NA NA NA
## cluster NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3663.4 on 2935 degrees of freedom
## Residual deviance: 1607.8 on 2904 degrees of freedom
## AIC: 1671.8
##
## Number of Fisher Scoring iterations: 6
summary(model3)
##
## Call:
## glm(formula = still_customer ~ ., family = binomial, data = train2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4662 -0.2506 -0.1048 -0.0411 3.3533
##
## Coefficients: (7 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.885337 0.835252 -1.060 0.289161
## customer_age -0.800914 0.704955 -1.136 0.255906
## dependent_count 1.090158 0.290916 3.747 0.000179 ***
## months_on_book -0.110580 0.634969 -0.174 0.861748
## total_relationship_count -2.282620 0.267556 -8.531 < 2e-16 ***
## months_inactive_12_mon 3.263469 0.437054 7.467 8.21e-14 ***
## contacts_count_12_mon 3.470256 0.432501 8.024 1.03e-15 ***
## credit_limit 0.324924 0.513156 0.633 0.526611
## total_revolving_bal 3.394755 0.435856 7.789 6.77e-15 ***
## avg_open_to_buy NA NA NA NA
## total_amt_chng_q4_q1 -3.864084 1.284732 -3.008 0.002632 **
## total_trans_amt 19.393359 1.404098 13.812 < 2e-16 ***
## total_trans_ct -19.876850 1.143142 -17.388 < 2e-16 ***
## total_ct_chng_q4_q1 -14.189657 1.433505 -9.899 < 2e-16 ***
## avg_utilization_ratio 1.142995 0.438274 2.608 0.009109 **
## gender_F 0.886846 0.281641 3.149 0.001639 **
## gender_M NA NA NA NA
## education_level_College 0.450259 0.299888 1.501 0.133246
## education_level_Doctorate 0.772615 0.368162 2.099 0.035855 *
## education_level_Graduate -0.020227 0.245389 -0.082 0.934307
## `education_level_High School` 0.164649 0.256791 0.641 0.521406
## `education_level_Post-Graduate` 0.880607 0.365570 2.409 0.016002 *
## education_level_Uneducated 0.508716 0.272348 1.868 0.061778 .
## education_level_Unknown NA NA NA NA
## `income_category_$120K +` 0.339519 0.451967 0.751 0.452531
## `income_category_$40K - $60K` -0.147841 0.299560 -0.494 0.621641
## `income_category_$60K - $80K` 0.007713 0.411116 0.019 0.985032
## `income_category_$80K - $120K` 0.167835 0.393778 0.426 0.669949
## `income_category_Less than $40K` -0.027069 0.256045 -0.106 0.915805
## income_category_Unknown NA NA NA NA
## card_category_Blue -0.012979 0.386644 -0.034 0.973222
## card_category_Gold 0.793328 0.595773 1.332 0.182993
## card_category_Platinum -12.104277 407.554700 -0.030 0.976307
## card_category_Silver NA NA NA NA
## marital_status_Divorced 0.019636 0.370180 0.053 0.957697
## marital_status_Married -0.725374 0.289423 -2.506 0.012201 *
## marital_status_Single 0.170768 0.288687 0.592 0.554163
## marital_status_Unknown NA NA NA NA
## cluster NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2540.0 on 4426 degrees of freedom
## Residual deviance: 1328.2 on 4395 degrees of freedom
## AIC: 1392.2
##
## Number of Fisher Scoring iterations: 13