In this project I will analyze the credit data in order to identify the most important factors for predicting customer churn as well as estimating time period before churning.
First, I will do a short exploratory analysis to get acquainted with the data at hand. Some features of gender composition and geographical location will be presented in this inroductory part.
Second, the information about churners will be presented into more detail.
Third, I will use logistic regression to predict churn. Along the way, the most important factors for predicting churn will be outlined.
Fourth, a decision tree model will be built as an alternative solution for the task. Both models` performances will be compared.
Finally, I will build a model to predict time to churn.
Overall statistics:
cust_stat <- credit %>% group_by(Gender) %>% summarise(num_customers = n(), mean_crScore = round(mean(CreditScore),2), mean_age = round(mean(Age),2), mean_tenure = round(mean(Tenure),2), mean_balance = round(mean(Balance),2), mean_salary = round(mean(EstimatedSalary),2))
formattable(cust_stat,
align =c("l","c","c","c","c","c","c"),
list(`Indicator Name` = formatter(
"span", style = ~ style(color = "grey",font.weight = "bold"))))
| Gender | num_customers | mean_crScore | mean_age | mean_tenure | mean_balance | mean_salary |
|---|---|---|---|---|---|---|
| Female | 4543 | 650.83 | 39.24 | 4.97 | 75659.37 | 100601.54 |
| Male | 5457 | 650.28 | 38.66 | 5.05 | 77173.97 | 99664.58 |
As for the overview information about the pool of clients as hand, there is almost equal proportions of male anf female customers in the dataset. The mean age of clients is approximately 39, whereas women are a little bit older than men on average. Tenure time for both gender groups is about 5 months on average. As for the amount of money currently being stored in the bank, males keep a bigger sums on average, meanwhile the average salary is greater for females for this particular dataset.
ggplot(credit,aes(x = Geography, group = Gender)) +
geom_bar(aes(y = ..prop.., fill = factor(..x..)),
stat="count", color = "black", alpha = 0.7) +
geom_text(aes(label = scales::percent(..prop..),
y= ..prop.. ), stat= "count", position = position_stack(vjust = 0.5), size = 4.5) +
facet_grid(~Gender) +
scale_y_continuous(labels = scales::percent)+
guides(fill = F)+
labs(title = "Gender and geographical composition
",
x = "", y = "Percentage")+
theme_minimal()+
scale_fill_brewer(palette="Set2")+
theme(text = element_text(size = 12),
plot.title = element_text(hjust = 0.5), title = element_text(size = 12))
As it can be seen from the graph, geographical distribution of customers for males and females is very similar. Approximately equal amounts of customers come from all the 3 different countries: about a half comes from France, one quarter comes from Germany and another quarter - from Spain. This holds true for both males and females.
salary = credit %>% group_by(Gender, Geography) %>% summarise(mean_sal = round(mean(EstimatedSalary),2))
ggplot(salary,aes(x = Gender, y = mean_sal, fill = Gender)) +
geom_bar(stat = "identity",color = "black", alpha = 0.7) +
geom_text(aes(label = mean_sal, vjust = -.5), size = 4)+
facet_grid(~ Geography) +
labs(title = "Gender and geographical composition of salaries
",
x = "", y = "Mean salery")+
theme_minimal()+
coord_cartesian(ylim=c(95000,103000))+
scale_y_continuous(breaks = seq(95000,103000, 1000))+
guides(fill = F)+
theme(text = element_text(size = 12),
plot.title = element_text(hjust = 0.5), title = element_text(size = 12))
Another interesting thing to explore is geographical distribution of salaries among males and females. As it can be noticed, the mean salary in Germany is greater than in 2 other countries, with female`s average salary being the biggest one among both males and females. There is also a greater mean salary for females in Spain, whereas in France the situation is the opposite.
a <- ggplot(credit, aes(x = Exited, y = ..count../sum(..count..), fill = Exited)) +
geom_bar(color = "black", alpha = 0.7) +
geom_text(aes(label = percent(..count../sum(..count..))),
size = 4, stat= "count", position = position_stack(vjust = 0.5)) +
scale_y_continuous(labels = percent) +
labs(title = "Churn proportions", y = "Percentage") +
scale_x_discrete(name = "Did the client churn?",
breaks=c("0", "1"),
labels=c("No", "Yes"))+
theme_minimal() +
scale_fill_brewer(palette="Set2")+
guides(fill = F)+
theme(text = element_text(size = 12),
plot.title = element_text(hjust = 0.5), title = element_text(size = 8))
b <- ggplot(credit,aes(x = Exited, group = Gender)) +
geom_bar(aes(y = ..prop.., fill = factor(..x..)),
stat="count", color = "black", alpha = 0.7) +
geom_text(aes(label = scales::percent(..prop..),
y= ..prop.. ), stat= "count", position = position_stack(vjust = 0.5), size = 3) +
facet_grid(Gender ~ Geography) +
scale_y_continuous(labels = scales::percent)+
scale_x_discrete(name = "Did the client churn?",
breaks=c("0", "1"),
labels=c("No", "Yes"))+
guides(fill = F)+
labs(title = "Churn proportions by gender and geographical location
", y = "Percentage")+
theme_minimal()+
scale_fill_brewer(palette="Set2")+
theme(text = element_text(size = 12),
plot.title = element_text(hjust = 0.5), title = element_text(size = 8))
grid.arrange(a,b, ncol = 2)
This small dashboard indicates the most important information about churn rates among the customers. It can be said that every 5th client churned for this particular dataset. Additionally, it can be noticed that the churn rates in Germany are the highest among other countires, which holds true both for males and females. The least amount of churners can be allocated to France, whereas Spain occupies intermediate position.
duration = credit %>% group_by(Exited) %>% summarise(mean_time = round(mean(Tenure),2))
c<- ggplot(duration,aes(x = Exited, y = mean_time, fill = Exited)) +
geom_bar(stat = "identity",color = "black", alpha = 0.7) +
geom_text(aes(label = mean_time, vjust = -.5), size = 4)+
scale_x_discrete(name = "Did the client churn?",
breaks=c("0", "1"),
labels=c("No", "Yes"))+
theme_minimal() +
scale_fill_brewer(palette="Set2")+
coord_cartesian(ylim=c(4.7,5.1))+
scale_y_continuous(breaks = seq(4.8,5.1, 0.1))+
guides(fill = F)+
theme(text = element_text(size = 12),
plot.title = element_text(hjust = 0.5), title = element_text(size = 8)) +
labs(title = "Tenure by customer status
", x = "", y = "Mean tenure")
duration2 = credit %>% group_by(Exited, Gender, Geography) %>% summarise(mean_time = round(mean(Tenure),2))
d <-ggplot(duration2,aes(x = Exited, y = mean_time, group = Gender, fill = Exited)) +
geom_bar(stat = "identity",color = "black", alpha = 0.7) +
geom_text(aes(label = mean_time, vjust = -.5), size = 4)+
scale_x_discrete(name = "Did the client churn?",
breaks=c("0", "1"),
labels=c("No", "Yes"))+
facet_grid(Gender ~ Geography)+
theme_minimal() +
scale_fill_brewer(palette="Set2")+
coord_cartesian(ylim=c(4.5,5.2))+
scale_y_continuous(breaks = seq(4.5,5.2, 0.1))+
guides(fill = F)+
theme(text = element_text(size = 12),
plot.title = element_text(hjust = 0.5), title = element_text(size = 8)) +
labs(title = "Tenure by customer status, gender and geographical location
", x = "", y = "Mean tenure")
grid.arrange(c,d,ncol = 2)
This small dashboard indicates the most important information about tenure time. As we can see, non-churners last a little bit longer as clients of the bank compared to churners, however the difference is not that significant. Additionally, apart from the general trend, it can be outlined that for females in Germany and males in France - the average tenure time is greater for churners rather than for non-churners. The shortest tenure time can be observed for churners in Spain, whereas the longest - for France.
Building a preliminary model:
modeling_data = credit %>% dplyr::select(-RowNumber, -CustomerId, -Surname)
modeling_data$IsActiveMember = ifelse(modeling_data$IsActiveMember == "1", "yes", "no")
modeling_data$IsActiveMember = as.factor(modeling_data$IsActiveMember)
my_log_model <- glm(Exited ~., family = binomial, data = modeling_data)
summary(my_log_model)
##
## Call:
## glm(formula = Exited ~ ., family = binomial, data = modeling_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5188 -0.5867 -0.3611 -0.1787 3.2691
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.804e+00 2.519e-01 -11.133 <2e-16 ***
## CreditScore -6.942e-04 3.033e-04 -2.289 0.0221 *
## GeographyGermany 9.518e-01 7.293e-02 13.051 <2e-16 ***
## GeographySpain 6.004e-02 7.604e-02 0.790 0.4297
## GenderMale -5.226e-01 5.897e-02 -8.861 <2e-16 ***
## Age 7.126e-02 2.761e-03 25.808 <2e-16 ***
## Tenure -1.959e-02 1.010e-02 -1.940 0.0524 .
## Balance -7.027e-07 5.697e-07 -1.233 0.2174
## NumOfProducts2 -1.547e+00 7.122e-02 -21.716 <2e-16 ***
## NumOfProducts3 2.568e+00 1.798e-01 14.278 <2e-16 ***
## NumOfProducts4 1.633e+01 1.754e+02 0.093 0.9258
## HasCrCard1 -6.165e-02 6.416e-02 -0.961 0.3366
## IsActiveMemberyes -1.106e+00 6.235e-02 -17.736 <2e-16 ***
## EstimatedSalary 4.393e-07 5.135e-07 0.856 0.3923
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 10109.8 on 9999 degrees of freedom
## Residual deviance: 7428.6 on 9986 degrees of freedom
## AIC: 7456.6
##
## Number of Fisher Scoring iterations: 14
From the summary table it can be seen that the most statistically significant variables are:
Further I will give an interpretation to these statistically significant coefficients.
Exponentiating coefficient to give interpretation:
coefs <- coef(my_log_model) %>% exp() %>% round(2)
coefs
## (Intercept) CreditScore GeographyGermany GeographySpain
## 0.06 1.00 2.59 1.06
## GenderMale Age Tenure Balance
## 0.59 1.07 0.98 1.00
## NumOfProducts2 NumOfProducts3 NumOfProducts4 HasCrCard1
## 0.21 13.04 12404416.37 0.94
## IsActiveMemberyes EstimatedSalary
## 0.33 1.00
Interpretation table:
Coefficient <- c("GeographyGermany","GenderMale", "Age", "NumOfProducts2", "NumOfProducts3","IsActiveMemberyes")
Interpretation <- c("Given that the country of residence for a client is Germany, the hazard to churn increases by a factor of 2.59 or by 159 % for German residents compared to French residents. So to say, Germans are more likely to decide to churn in contrast to French and Spanish clients.", "Given that the client is male, the risk of churning decreases by a factor of 0.59 or by 41% compared to female clients. So to say, men are 41% more likely to stick to one bank than women.", "A one year increase in age of a client increases the hazard to churn by a factor of 1.07 or by 7%.","Given that the client has 2 accounts/ bank affiliated products in the bank, the hazard to churn decreases by a factor of 0.21 or by 79%. So to say, people who have, for example, 2 bank accounts are 79% more likely to stay.","Interestingly, but given that the client has 3 accounts/ bank affiliated products in the bank, the hazard to churn increases by a factor of 13.04 or 1200%. So to say, if a person has more than 2 accounts/ services used, he or she is extremely likely to churn.", "If a person is an active member of a bank system, the hazard to churn decreases by a factor of 0.33 or by 77%. So to say, active bank cliens are 77% more likely to stay.")
Coef_tab <- data.frame(Coefficient, Interpretation)
formattable(Coef_tab,
align =c("l","l"),
list(`Indicator Name` = formatter(
"span", style = ~ style(color = "grey",font.weight = "bold"))))
| Coefficient | Interpretation |
|---|---|
| GeographyGermany | Given that the country of residence for a client is Germany, the hazard to churn increases by a factor of 2.59 or by 159 % for German residents compared to French residents. So to say, Germans are more likely to decide to churn in contrast to French and Spanish clients. |
| GenderMale | Given that the client is male, the risk of churning decreases by a factor of 0.59 or by 41% compared to female clients. So to say, men are 41% more likely to stick to one bank than women. |
| Age | A one year increase in age of a client increases the hazard to churn by a factor of 1.07 or by 7%. |
| NumOfProducts2 | Given that the client has 2 accounts/ bank affiliated products in the bank, the hazard to churn decreases by a factor of 0.21 or by 79%. So to say, people who have, for example, 2 bank accounts are 79% more likely to stay. |
| NumOfProducts3 | Interestingly, but given that the client has 3 accounts/ bank affiliated products in the bank, the hazard to churn increases by a factor of 13.04 or 1200%. So to say, if a person has more than 2 accounts/ services used, he or she is extremely likely to churn. |
| IsActiveMemberyes | If a person is an active member of a bank system, the hazard to churn decreases by a factor of 0.33 or by 77%. So to say, active bank cliens are 77% more likely to stay. |
Since it seemed strange to me to give an interpretation of coefficients only on a part of data (only using model built on the training or test subset), I will proceed with creating data partition, model building (again) and assessing model quality.
Creating train and test subsets:
modeling_data$Exited = as.factor(modeling_data$Exited)
modeling_data$churn = ifelse(modeling_data$Exited == 1,"Churns","Stays") #for clarity
modeling_data$churn = as.factor(modeling_data$churn)
modeling_data = modeling_data %>% dplyr::select(-Exited) #removing the column with numbers, otherwise the pridiction is obvious
set.seed(1)
test.ind = createDataPartition(modeling_data$churn, p = 0.2, list = FALSE) #classic 80/20 train-test partition
credit.test = modeling_data[test.ind,]
credit.train = modeling_data[-test.ind,]
Building a model on train data, predicting on test data, presenting confusion matrix:
model.train = train(churn ~., data = credit.train, method = "glm", family = binomial(link = "logit"))
predTrain = predict(model.train, credit.train)
predTest = predict(model.train, credit.test)
For train subset:
confusionMatrix(predTrain, credit.train$churn, positive = "Churns")
## Confusion Matrix and Statistics
##
## Reference
## Prediction Churns Stays
## Churns 619 251
## Stays 1010 6119
##
## Accuracy : 0.8424
## 95% CI : (0.8342, 0.8503)
## No Information Rate : 0.7963
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.412
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.37999
## Specificity : 0.96060
## Pos Pred Value : 0.71149
## Neg Pred Value : 0.85833
## Prevalence : 0.20365
## Detection Rate : 0.07738
## Detection Prevalence : 0.10876
## Balanced Accuracy : 0.67029
##
## 'Positive' Class : Churns
##
Model accuracy is 84%, which is pretty good. However, the sensitivity parameter of the model is only about 37 %, which means that there is a high probability for type II error (False Negative), meaning that the model is more likely to missclassify churners as those who would not churn (even though in reality they churned).
For test subset:
confusionMatrix(predTest, credit.test$churn, positive = "Churns")
## Confusion Matrix and Statistics
##
## Reference
## Prediction Churns Stays
## Churns 156 81
## Stays 252 1512
##
## Accuracy : 0.8336
## 95% CI : (0.8165, 0.8497)
## No Information Rate : 0.7961
## P-Value [Acc > NIR] : 1.165e-05
##
## Kappa : 0.3927
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.38235
## Specificity : 0.94915
## Pos Pred Value : 0.65823
## Neg Pred Value : 0.85714
## Prevalence : 0.20390
## Detection Rate : 0.07796
## Detection Prevalence : 0.11844
## Balanced Accuracy : 0.66575
##
## 'Positive' Class : Churns
##
As can be seen from the table that the accuracy of the model on the test subset (Accuracy = 0.83) is not that different compared to the model on train subset (Accuracy = 0.84), which means that the model was trained properly (no overfitting) and performs almost as equally good as on the training set, which is great.
model = glm(churn~., family = binomial, data = credit.train)
pred = predict(model, newdata = credit.test, type = "response")
ROC = roc(response = credit.test$churn, predictor = pred)
ggplot() + geom_path(aes(y=ROC$sensitivities, x=1-ROC$specificities))+
labs(x = "FPR", y = "TPR", title="ROC curve")+
theme_minimal()
pROC::auc(ROC)
## Area under the curve: 0.8353
The ROC and AUC metrics confirm the result from confusion matrix.
Building one more model but using cross-validation along the way. Looking at the accuracy of such a model:
train_control <- trainControl(method="cv", number=10) #10-fold cross validation
model_cv <- caret::train(churn~., data=credit.train, trControl=train_control, method = "glm", family = binomial(link = "logit"))
print(model_cv)
## Generalized Linear Model
##
## 7999 samples
## 10 predictors
## 2 classes: 'Churns', 'Stays'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 7199, 7199, 7200, 7199, 7199, 7199, ...
## Resampling results:
##
## Accuracy Kappa
## 0.8408559 0.4065872
The accuracy of the model using 10-fold cross-validation is almost the same as the initial accuracy. Let`s look at the accuracy of the prediction on the test subset:
predTest.cv <- predict(model_cv, credit.test)
cmTest.cv = confusionMatrix(predTest.cv, credit.test$churn)
cmTest.cv
## Confusion Matrix and Statistics
##
## Reference
## Prediction Churns Stays
## Churns 156 81
## Stays 252 1512
##
## Accuracy : 0.8336
## 95% CI : (0.8165, 0.8497)
## No Information Rate : 0.7961
## P-Value [Acc > NIR] : 1.165e-05
##
## Kappa : 0.3927
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.38235
## Specificity : 0.94915
## Pos Pred Value : 0.65823
## Neg Pred Value : 0.85714
## Prevalence : 0.20390
## Detection Rate : 0.07796
## Detection Prevalence : 0.11844
## Balanced Accuracy : 0.66575
##
## 'Positive' Class : Churns
##
The accuracy of the model on the test subset, where the 10-fold cross validation is used, is really close to the accuracy of the initial model. Therefore, it can be concluded that the initial division (train/test) was properly done. Which is expectable, as I used createDataPartition function, which takes into account the initial distribution of variables in order to create samples with similar to initial data distributions.
Simple variable importance analysis: by the absolute value of z-value in regression model:
importance <- varImp(model.train, scale=FALSE)
plot(importance)
As it can be seen from the plot, the top 5 most important variables for churn prediction appeared to be:
Relative importance, e.g. all other predictors compared to Age:
plot(varImp(model.train))
tree <- ctree(churn~., data = credit.train)
treePredTrain <- predict(tree, credit.train, type = "response")
confusionMatrix(treePredTrain,credit.train$churn)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Churns Stays
## Churns 739 181
## Stays 890 6189
##
## Accuracy : 0.8661
## 95% CI : (0.8584, 0.8735)
## No Information Rate : 0.7963
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5074
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.45365
## Specificity : 0.97159
## Pos Pred Value : 0.80326
## Neg Pred Value : 0.87428
## Prevalence : 0.20365
## Detection Rate : 0.09239
## Detection Prevalence : 0.11501
## Balanced Accuracy : 0.71262
##
## 'Positive' Class : Churns
##
The accuracy of the model on training sudset is 86%, which is a little better than logistic regression performance. It is also can be noticed that sensitivity parameter is better (higher) in case of a tree compared to logistic reggression. The Kappa is also way better than for regression.
Let`s look at test subset performance:
treePredTest <- predict(tree, credit.test, type = "response")
confusionMatrix(treePredTest,credit.test$churn)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Churns Stays
## Churns 183 71
## Stays 225 1522
##
## Accuracy : 0.8521
## 95% CI : (0.8358, 0.8674)
## No Information Rate : 0.7961
## P-Value [Acc > NIR] : 6.684e-11
##
## Kappa : 0.4699
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.44853
## Specificity : 0.95543
## Pos Pred Value : 0.72047
## Neg Pred Value : 0.87121
## Prevalence : 0.20390
## Detection Rate : 0.09145
## Detection Prevalence : 0.12694
## Balanced Accuracy : 0.70198
##
## 'Positive' Class : Churns
##
We can see that the accuracy on the test set is 85% for decision tree (compared to 83% for regression model). Sensitivity and Kappa parameters are also higher.
train_control_tree <- trainControl(method="cv", number=10)
model_tree <- caret::train(churn~., data=credit.train, trControl=train_control_tree, method="ctree")
print(model_tree)
## Conditional Inference Tree
##
## 7999 samples
## 10 predictors
## 2 classes: 'Churns', 'Stays'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 7199, 7199, 7199, 7199, 7199, 7200, ...
## Resampling results across tuning parameters:
##
## mincriterion Accuracy Kappa
## 0.01 0.8359817 0.4432486
## 0.50 0.8519831 0.4651987
## 0.99 0.8551084 0.4658944
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mincriterion = 0.99.
Predictions on the test subset:
predTest_tree <- predict(model_tree, credit.test)
tree_cv = confusionMatrix(predTest_tree, credit.test$churn)
tree_cv
## Confusion Matrix and Statistics
##
## Reference
## Prediction Churns Stays
## Churns 173 67
## Stays 235 1526
##
## Accuracy : 0.8491
## 95% CI : (0.8326, 0.8645)
## No Information Rate : 0.7961
## P-Value [Acc > NIR] : 6.687e-10
##
## Kappa : 0.451
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.42402
## Specificity : 0.95794
## Pos Pred Value : 0.72083
## Neg Pred Value : 0.86655
## Prevalence : 0.20390
## Detection Rate : 0.08646
## Detection Prevalence : 0.11994
## Balanced Accuracy : 0.69098
##
## 'Positive' Class : Churns
##
The accuracy of prediction with a decision tree on the test set is still a little bit better (~85%) than the same accuracy for logistic regression (~83%).
In order to assess the variable importance in this case, I will create random forest.
Accuracy of Random Forest model on test subset (the same as for ordinary tree):
library(randomForest)
set.seed(4)
model.rf = randomForest(churn ~., data=credit.train, mtry=5, importance=TRUE)
predTrain.rf = predict(model.rf, credit.train)
predTest.rf = predict(model.rf, credit.test)
confusionMatrix(predTest.rf, credit.test$churn,
positive = "Churns", mode = "prec_recall")$overall["Accuracy"]
## Accuracy
## 0.8570715
Variable importance for the RF model (scaled):
library(vip)
vip(model.rf, scale = T) + labs(y = "Relative importance", title = "Relative importance of predictors in Random Forest model")+
theme_minimal()
As it can be seen, the top 5 most important variables for RF is close to top 5 of a logistic regression. However, variable Balance (how much money is currently beeing kept on a client`s account) appeared to be quiet important for the RF model compared to LR model. As for the rest, the number of bank accounts, age, activeness and geographic location of a clien still constitute a set of variables of high importance.
In other words, these variables are the most useful ones in predicting customers` churning behavior in both models:
As a result, a decision tree (or Random Forest) model would be chosen to work with this particular dataset, as it showed a better performance (higher accuracy) on the training set.
library(sjPlot)
sjp.xtab(modeling_data$Tenure, modeling_data$churn, margin = "row", bar.pos = "stack",
show.summary = F, coord.flip = T,
axis.titles = "Churn proportions", legend.title = "Did the customer churn?",
geom.colors = c("#f3c6f3","lightblue"))+
theme_minimal()+
theme(text = element_text(size = 12),
plot.title = element_text(hjust = 0.5), title = element_text(size = 12))+
labs(title = "Churn proportions by tenure
", x = "Tetunre, months", y = "Percentage, %")
credit$Exited = as.numeric(credit$Exited)
library(survival)
fitKM <- survfit(Surv(credit$Tenure, credit$Exited) ~ 1,
type = "kaplan-meier")
print(fitKM)
## Call: survfit(formula = Surv(credit$Tenure, credit$Exited) ~ 1, type = "kaplan-meier")
##
## n events median 0.95LCL 0.95UCL
## 10000 2037 10 10 NA
From the output it can be seen that among 10000 customers, about ~ 2000 churned under the time of the observation. The median survival time is 10 months, which means that about 50 % of the customers do not churn before they reach a tenure duration of 10 months.
Important predictors as covariates:
fitKMstr1 <- survfit(Surv(Tenure, Exited) ~ Geography,
data = credit)
print(fitKMstr1)
## Call: survfit(formula = Surv(Tenure, Exited) ~ Geography, data = credit)
##
## n events median 0.95LCL 0.95UCL
## Geography=France 5014 810 NA NA NA
## Geography=Germany 2509 814 9 9 9
## Geography=Spain 2477 413 NA NA NA
As it can be seen, about a half of customers from Germany do not churn until 9 months of tenure duration. However, customers from France and Spain have a higher median survival time which is bigger than 10 months (that is why there are NA).
fitKMstr2 <- survfit(Surv(Tenure, Exited) ~ IsActiveMember,
data = credit)
print(fitKMstr2)
## Call: survfit(formula = Surv(Tenure, Exited) ~ IsActiveMember, data = credit)
##
## n events median 0.95LCL 0.95UCL
## IsActiveMember=0 4849 1302 10 9 10
## IsActiveMember=1 5151 735 NA NA NA
Similarly, it can be seen that among customers who are not active members of banking system median survival time is 10 months, whereas for those who are active members - median survival time is greater than 10 (than 10, which is median survival time in the initial model with no covariates used).
All in all, it can be concluded that for predicting churn for this particular dataset it is better to use decision tree (Randome Forest), rather than logistic regression, judging by the model performance for both methods. However, the difference in accuracy is not really significant (1-3%) even with the application of cross-validation procedure. As for the most important predictors for churn, there can be named the following: age, country of residence, activeness as a banking system user and the number of bank accounts or bank affiliated products. In some cases, the current balance of the client can help to predict churn as well. Speaking about the tenure time, about 50 % of the customers presumably are not going to churn withing 10 months. Still, if taking into consideration covariates, being a German citizen increases the probability to churn, lowering the survival time to only 9 months. However, being a citizen of another country or/and being an active member of banking system decreases the probability to churn, yielding 10+ months within which customers are not expected to quit.