Business Analytics: Project 1

Introduction

In this project I will analyze the credit data in order to identify the most important factors for predicting customer churn as well as estimating time period before churning.

First, I will do a short exploratory analysis to get acquainted with the data at hand. Some features of gender composition and geographical location will be presented in this inroductory part.

Second, the information about churners will be presented into more detail.

Third, I will use logistic regression to predict churn. Along the way, the most important factors for predicting churn will be outlined.

Fourth, a decision tree model will be built as an alternative solution for the task. Both models` performances will be compared.

Finally, I will build a model to predict time to churn.

Data overview

Overall statistics:

cust_stat <- credit %>% group_by(Gender) %>% summarise(num_customers = n(), mean_crScore = round(mean(CreditScore),2), mean_age = round(mean(Age),2), mean_tenure = round(mean(Tenure),2), mean_balance = round(mean(Balance),2), mean_salary = round(mean(EstimatedSalary),2))

formattable(cust_stat, 
            align =c("l","c","c","c","c","c","c"), 
            list(`Indicator Name` = formatter(
              "span", style = ~ style(color = "grey",font.weight = "bold"))))

Gender	num_customers	mean_crScore	mean_age	mean_tenure	mean_balance	mean_salary
Female	4543	650.83	39.24	4.97	75659.37	100601.54
Male	5457	650.28	38.66	5.05	77173.97	99664.58

As for the overview information about the pool of clients as hand, there is almost equal proportions of male anf female customers in the dataset. The mean age of clients is approximately 39, whereas women are a little bit older than men on average. Tenure time for both gender groups is about 5 months on average. As for the amount of money currently being stored in the bank, males keep a bigger sums on average, meanwhile the average salary is greater for females for this particular dataset.

ggplot(credit,aes(x = Geography, group = Gender)) + 
    geom_bar(aes(y = ..prop.., fill = factor(..x..)), 
    stat="count", color = "black", alpha = 0.7) +
    geom_text(aes(label = scales::percent(..prop..),
    y= ..prop.. ), stat= "count", position = position_stack(vjust = 0.5), size = 4.5) +
    facet_grid(~Gender) +
    scale_y_continuous(labels = scales::percent)+
    guides(fill = F)+
    labs(title = "Gender and geographical composition
         ", 
    x = "", y = "Percentage")+
    theme_minimal()+
    scale_fill_brewer(palette="Set2")+
    theme(text = element_text(size = 12), 
    plot.title = element_text(hjust = 0.5), title = element_text(size = 12))

As it can be seen from the graph, geographical distribution of customers for males and females is very similar. Approximately equal amounts of customers come from all the 3 different countries: about a half comes from France, one quarter comes from Germany and another quarter - from Spain. This holds true for both males and females.

salary = credit %>% group_by(Gender, Geography) %>% summarise(mean_sal = round(mean(EstimatedSalary),2))


ggplot(salary,aes(x = Gender, y = mean_sal, fill = Gender)) + 
    geom_bar(stat = "identity",color = "black", alpha = 0.7) +
    geom_text(aes(label = mean_sal, vjust = -.5), size = 4)+
    facet_grid(~ Geography) +
    labs(title = "Gender and geographical composition of salaries
         ", 
    x = "", y = "Mean salery")+
    theme_minimal()+
    coord_cartesian(ylim=c(95000,103000))+
    scale_y_continuous(breaks = seq(95000,103000, 1000))+
    guides(fill = F)+
    theme(text = element_text(size = 12), 
    plot.title = element_text(hjust = 0.5), title = element_text(size = 12))

Another interesting thing to explore is geographical distribution of salaries among males and females. As it can be noticed, the mean salary in Germany is greater than in 2 other countries, with female`s average salary being the biggest one among both males and females. There is also a greater mean salary for females in Spain, whereas in France the situation is the opposite.

How many customers churned at all?

a <- ggplot(credit, aes(x = Exited, y = ..count../sum(..count..), fill = Exited)) + 
          geom_bar(color = "black", alpha = 0.7) +
          geom_text(aes(label = percent(..count../sum(..count..))), 
          size = 4, stat= "count", position = position_stack(vjust = 0.5)) +
          scale_y_continuous(labels = percent) +
          labs(title = "Churn proportions", y  = "Percentage") +
          scale_x_discrete(name  = "Did the client churn?",
                          breaks=c("0", "1"),
                          labels=c("No", "Yes"))+
          theme_minimal() +
          scale_fill_brewer(palette="Set2")+
          guides(fill = F)+
          theme(text = element_text(size = 12), 
          plot.title = element_text(hjust = 0.5), title = element_text(size = 8))

b <- ggplot(credit,aes(x = Exited, group = Gender)) + 
    geom_bar(aes(y = ..prop.., fill = factor(..x..)), 
    stat="count", color = "black", alpha = 0.7) +
    geom_text(aes(label = scales::percent(..prop..),
    y= ..prop.. ), stat= "count", position = position_stack(vjust = 0.5), size = 3) +
    facet_grid(Gender ~ Geography) +
    scale_y_continuous(labels = scales::percent)+
    scale_x_discrete(name  = "Did the client churn?",
                          breaks=c("0", "1"),
                          labels=c("No", "Yes"))+
    guides(fill = F)+
    labs(title = "Churn proportions by gender and geographical location
         ", y = "Percentage")+
    theme_minimal()+
    scale_fill_brewer(palette="Set2")+
    theme(text = element_text(size = 12), 
    plot.title = element_text(hjust = 0.5), title = element_text(size = 8))

grid.arrange(a,b, ncol = 2)

This small dashboard indicates the most important information about churn rates among the customers. It can be said that every 5th client churned for this particular dataset. Additionally, it can be noticed that the churn rates in Germany are the highest among other countires, which holds true both for males and females. The least amount of churners can be allocated to France, whereas Spain occupies intermediate position.

How long on average churners and non-churners last as a bank customers?

duration = credit %>% group_by(Exited) %>% summarise(mean_time = round(mean(Tenure),2))


c<- ggplot(duration,aes(x = Exited, y = mean_time, fill = Exited)) + 
    geom_bar(stat = "identity",color = "black", alpha = 0.7) +
    geom_text(aes(label = mean_time, vjust = -.5), size = 4)+
    scale_x_discrete(name  = "Did the client churn?",
                          breaks=c("0", "1"),
                          labels=c("No", "Yes"))+
    theme_minimal() +
    scale_fill_brewer(palette="Set2")+
    coord_cartesian(ylim=c(4.7,5.1))+
    scale_y_continuous(breaks = seq(4.8,5.1, 0.1))+
    guides(fill = F)+
    theme(text = element_text(size = 12), 
          plot.title = element_text(hjust = 0.5), title = element_text(size = 8)) +
    labs(title = "Tenure by customer status
         ", x = "", y = "Mean tenure")


duration2 = credit %>% group_by(Exited, Gender, Geography) %>% summarise(mean_time = round(mean(Tenure),2))


d <-ggplot(duration2,aes(x = Exited, y = mean_time, group = Gender, fill = Exited)) + 
    geom_bar(stat = "identity",color = "black", alpha = 0.7) +
    geom_text(aes(label = mean_time, vjust = -.5), size = 4)+
    scale_x_discrete(name  = "Did the client churn?",
                          breaks=c("0", "1"),
                          labels=c("No", "Yes"))+
    facet_grid(Gender ~ Geography)+
    theme_minimal() +
    scale_fill_brewer(palette="Set2")+
    coord_cartesian(ylim=c(4.5,5.2))+
    scale_y_continuous(breaks = seq(4.5,5.2, 0.1))+
    guides(fill = F)+
    theme(text = element_text(size = 12), 
          plot.title = element_text(hjust = 0.5), title = element_text(size = 8)) +
    labs(title = "Tenure by customer status, gender and geographical location
         ", x = "", y = "Mean tenure")

grid.arrange(c,d,ncol = 2)

This small dashboard indicates the most important information about tenure time. As we can see, non-churners last a little bit longer as clients of the bank compared to churners, however the difference is not that significant. Additionally, apart from the general trend, it can be outlined that for females in Germany and males in France - the average tenure time is greater for churners rather than for non-churners. The shortest tenure time can be observed for churners in Spain, whereas the longest - for France.

Binary Logistic regression for churn prediction

Building a preliminary model:

modeling_data = credit %>% dplyr::select(-RowNumber, -CustomerId, -Surname)
modeling_data$IsActiveMember = ifelse(modeling_data$IsActiveMember == "1", "yes", "no")
modeling_data$IsActiveMember = as.factor(modeling_data$IsActiveMember)

my_log_model <- glm(Exited ~., family = binomial, data = modeling_data)
summary(my_log_model)

## 
## Call:
## glm(formula = Exited ~ ., family = binomial, data = modeling_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5188  -0.5867  -0.3611  -0.1787   3.2691  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -2.804e+00  2.519e-01 -11.133   <2e-16 ***
## CreditScore       -6.942e-04  3.033e-04  -2.289   0.0221 *  
## GeographyGermany   9.518e-01  7.293e-02  13.051   <2e-16 ***
## GeographySpain     6.004e-02  7.604e-02   0.790   0.4297    
## GenderMale        -5.226e-01  5.897e-02  -8.861   <2e-16 ***
## Age                7.126e-02  2.761e-03  25.808   <2e-16 ***
## Tenure            -1.959e-02  1.010e-02  -1.940   0.0524 .  
## Balance           -7.027e-07  5.697e-07  -1.233   0.2174    
## NumOfProducts2    -1.547e+00  7.122e-02 -21.716   <2e-16 ***
## NumOfProducts3     2.568e+00  1.798e-01  14.278   <2e-16 ***
## NumOfProducts4     1.633e+01  1.754e+02   0.093   0.9258    
## HasCrCard1        -6.165e-02  6.416e-02  -0.961   0.3366    
## IsActiveMemberyes -1.106e+00  6.235e-02 -17.736   <2e-16 ***
## EstimatedSalary    4.393e-07  5.135e-07   0.856   0.3923    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10109.8  on 9999  degrees of freedom
## Residual deviance:  7428.6  on 9986  degrees of freedom
## AIC: 7456.6
## 
## Number of Fisher Scoring iterations: 14

From the summary table it can be seen that the most statistically significant variables are:

Geography (Germany compared to France),
Gender (Male compared to Female),
Age,
Number of bank accounts/products (2 and 3 compared to 1) and
Wether a client is an active member (yes compared to no).

Further I will give an interpretation to these statistically significant coefficients.

Interpretation of model coefficients

Exponentiating coefficient to give interpretation:

coefs <- coef(my_log_model) %>% exp() %>% round(2)
coefs

##       (Intercept)       CreditScore  GeographyGermany    GeographySpain 
##              0.06              1.00              2.59              1.06 
##        GenderMale               Age            Tenure           Balance 
##              0.59              1.07              0.98              1.00 
##    NumOfProducts2    NumOfProducts3    NumOfProducts4        HasCrCard1 
##              0.21             13.04       12404416.37              0.94 
## IsActiveMemberyes   EstimatedSalary 
##              0.33              1.00

Interpretation table:

Coefficient <- c("GeographyGermany","GenderMale", "Age", "NumOfProducts2", "NumOfProducts3","IsActiveMemberyes")

Interpretation <- c("Given that the country of residence for a client is Germany, the hazard to churn increases by a factor of 2.59 or by 159 % for German residents compared to French residents. So to say, Germans are more likely to decide to churn in contrast to French and Spanish clients.", "Given that the client is male, the risk of churning decreases by a factor of 0.59 or by 41% compared to female clients. So to say, men are 41% more likely to stick to one bank than women.", "A one year increase in age of a client increases the hazard to churn by a factor of 1.07 or by 7%.","Given that the client has 2 accounts/ bank affiliated products in the bank, the hazard to churn decreases by a factor of 0.21 or by 79%. So to say, people who have, for example, 2 bank accounts are 79% more likely to stay.","Interestingly, but given that the client has 3 accounts/ bank affiliated products in the bank, the hazard to churn increases by a factor of 13.04 or 1200%. So to say, if a person has more than 2 accounts/ services used, he or she is extremely likely to churn.", "If a person is an active member of a bank system, the hazard to churn decreases by a factor of 0.33 or by 77%. So to say, active bank cliens are 77% more likely to stay.")

Coef_tab <- data.frame(Coefficient, Interpretation)

formattable(Coef_tab, 
            align =c("l","l"), 
            list(`Indicator Name` = formatter(
              "span", style = ~ style(color = "grey",font.weight = "bold"))))

Coefficient	Interpretation
GeographyGermany	Given that the country of residence for a client is Germany, the hazard to churn increases by a factor of 2.59 or by 159 % for German residents compared to French residents. So to say, Germans are more likely to decide to churn in contrast to French and Spanish clients.
GenderMale	Given that the client is male, the risk of churning decreases by a factor of 0.59 or by 41% compared to female clients. So to say, men are 41% more likely to stick to one bank than women.
Age	A one year increase in age of a client increases the hazard to churn by a factor of 1.07 or by 7%.
NumOfProducts2	Given that the client has 2 accounts/ bank affiliated products in the bank, the hazard to churn decreases by a factor of 0.21 or by 79%. So to say, people who have, for example, 2 bank accounts are 79% more likely to stay.
NumOfProducts3	Interestingly, but given that the client has 3 accounts/ bank affiliated products in the bank, the hazard to churn increases by a factor of 13.04 or 1200%. So to say, if a person has more than 2 accounts/ services used, he or she is extremely likely to churn.
IsActiveMemberyes	If a person is an active member of a bank system, the hazard to churn decreases by a factor of 0.33 or by 77%. So to say, active bank cliens are 77% more likely to stay.

Since it seemed strange to me to give an interpretation of coefficients only on a part of data (only using model built on the training or test subset), I will proceed with creating data partition, model building (again) and assessing model quality.

Data partition, modelling again

Creating train and test subsets:

modeling_data$Exited = as.factor(modeling_data$Exited)
modeling_data$churn = ifelse(modeling_data$Exited == 1,"Churns","Stays") #for clarity
modeling_data$churn = as.factor(modeling_data$churn)
modeling_data = modeling_data %>% dplyr::select(-Exited) #removing the column with numbers, otherwise the pridiction is obvious

set.seed(1)
test.ind = createDataPartition(modeling_data$churn, p = 0.2, list = FALSE) #classic 80/20 train-test partition
credit.test = modeling_data[test.ind,]
credit.train = modeling_data[-test.ind,]

Building a model on train data, predicting on test data, presenting confusion matrix:

model.train = train(churn ~., data = credit.train, method = "glm", family = binomial(link = "logit"))

predTrain = predict(model.train, credit.train)
predTest = predict(model.train, credit.test)

Confusion Matrices

For train subset:

confusionMatrix(predTrain, credit.train$churn, positive = "Churns")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Churns Stays
##     Churns    619   251
##     Stays    1010  6119
##                                           
##                Accuracy : 0.8424          
##                  95% CI : (0.8342, 0.8503)
##     No Information Rate : 0.7963          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.412           
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.37999         
##             Specificity : 0.96060         
##          Pos Pred Value : 0.71149         
##          Neg Pred Value : 0.85833         
##              Prevalence : 0.20365         
##          Detection Rate : 0.07738         
##    Detection Prevalence : 0.10876         
##       Balanced Accuracy : 0.67029         
##                                           
##        'Positive' Class : Churns          
##

Model accuracy is 84%, which is pretty good. However, the sensitivity parameter of the model is only about 37 %, which means that there is a high probability for type II error (False Negative), meaning that the model is more likely to missclassify churners as those who would not churn (even though in reality they churned).

For test subset:

confusionMatrix(predTest, credit.test$churn, positive = "Churns")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Churns Stays
##     Churns    156    81
##     Stays     252  1512
##                                           
##                Accuracy : 0.8336          
##                  95% CI : (0.8165, 0.8497)
##     No Information Rate : 0.7961          
##     P-Value [Acc > NIR] : 1.165e-05       
##                                           
##                   Kappa : 0.3927          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.38235         
##             Specificity : 0.94915         
##          Pos Pred Value : 0.65823         
##          Neg Pred Value : 0.85714         
##              Prevalence : 0.20390         
##          Detection Rate : 0.07796         
##    Detection Prevalence : 0.11844         
##       Balanced Accuracy : 0.66575         
##                                           
##        'Positive' Class : Churns          
##

As can be seen from the table that the accuracy of the model on the test subset (Accuracy = 0.83) is not that different compared to the model on train subset (Accuracy = 0.84), which means that the model was trained properly (no overfitting) and performs almost as equally good as on the training set, which is great.

ROC curve and AUC for prediction model on test set

model = glm(churn~., family = binomial, data = credit.train)
pred = predict(model, newdata = credit.test, type = "response")
ROC = roc(response = credit.test$churn, predictor = pred)

ggplot() + geom_path(aes(y=ROC$sensitivities, x=1-ROC$specificities))+
  labs(x = "FPR", y = "TPR", title="ROC curve")+
  theme_minimal()

pROC::auc(ROC)

## Area under the curve: 0.8353

The ROC and AUC metrics confirm the result from confusion matrix.

Cross-validation

Building one more model but using cross-validation along the way. Looking at the accuracy of such a model:

train_control <- trainControl(method="cv", number=10) #10-fold cross validation
model_cv <- caret::train(churn~., data=credit.train, trControl=train_control, method = "glm", family = binomial(link = "logit"))
print(model_cv)

## Generalized Linear Model 
## 
## 7999 samples
##   10 predictors
##    2 classes: 'Churns', 'Stays' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 7199, 7199, 7200, 7199, 7199, 7199, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.8408559  0.4065872

The accuracy of the model using 10-fold cross-validation is almost the same as the initial accuracy. Let`s look at the accuracy of the prediction on the test subset:

predTest.cv <- predict(model_cv, credit.test)
cmTest.cv = confusionMatrix(predTest.cv, credit.test$churn)
cmTest.cv

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Churns Stays
##     Churns    156    81
##     Stays     252  1512
##                                           
##                Accuracy : 0.8336          
##                  95% CI : (0.8165, 0.8497)
##     No Information Rate : 0.7961          
##     P-Value [Acc > NIR] : 1.165e-05       
##                                           
##                   Kappa : 0.3927          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.38235         
##             Specificity : 0.94915         
##          Pos Pred Value : 0.65823         
##          Neg Pred Value : 0.85714         
##              Prevalence : 0.20390         
##          Detection Rate : 0.07796         
##    Detection Prevalence : 0.11844         
##       Balanced Accuracy : 0.66575         
##                                           
##        'Positive' Class : Churns          
##

The accuracy of the model on the test subset, where the 10-fold cross validation is used, is really close to the accuracy of the initial model. Therefore, it can be concluded that the initial division (train/test) was properly done. Which is expectable, as I used createDataPartition function, which takes into account the initial distribution of variables in order to create samples with similar to initial data distributions.

Looking for the most important churn predictors

Simple variable importance analysis: by the absolute value of z-value in regression model:

importance <- varImp(model.train, scale=FALSE)
plot(importance)

As it can be seen from the plot, the top 5 most important variables for churn prediction appeared to be:

age of the client,
the fact of having 2 bank accounts or bank affiliated products,
the fact of being an active member of bank system
the fact of having 3 bank accounts or bank affiliated products,
the fact that a client lives in Germany.

Relative importance, e.g. all other predictors compared to Age:

plot(varImp(model.train))

Predicting churn with a decision tree

tree <- ctree(churn~., data = credit.train)

treePredTrain <- predict(tree, credit.train, type = "response")

confusionMatrix(treePredTrain,credit.train$churn)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Churns Stays
##     Churns    739   181
##     Stays     890  6189
##                                           
##                Accuracy : 0.8661          
##                  95% CI : (0.8584, 0.8735)
##     No Information Rate : 0.7963          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5074          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.45365         
##             Specificity : 0.97159         
##          Pos Pred Value : 0.80326         
##          Neg Pred Value : 0.87428         
##              Prevalence : 0.20365         
##          Detection Rate : 0.09239         
##    Detection Prevalence : 0.11501         
##       Balanced Accuracy : 0.71262         
##                                           
##        'Positive' Class : Churns          
##

The accuracy of the model on training sudset is 86%, which is a little better than logistic regression performance. It is also can be noticed that sensitivity parameter is better (higher) in case of a tree compared to logistic reggression. The Kappa is also way better than for regression.

Let`s look at test subset performance:

treePredTest <- predict(tree, credit.test, type = "response")

confusionMatrix(treePredTest,credit.test$churn)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Churns Stays
##     Churns    183    71
##     Stays     225  1522
##                                           
##                Accuracy : 0.8521          
##                  95% CI : (0.8358, 0.8674)
##     No Information Rate : 0.7961          
##     P-Value [Acc > NIR] : 6.684e-11       
##                                           
##                   Kappa : 0.4699          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.44853         
##             Specificity : 0.95543         
##          Pos Pred Value : 0.72047         
##          Neg Pred Value : 0.87121         
##              Prevalence : 0.20390         
##          Detection Rate : 0.09145         
##    Detection Prevalence : 0.12694         
##       Balanced Accuracy : 0.70198         
##                                           
##        'Positive' Class : Churns          
##

We can see that the accuracy on the test set is 85% for decision tree (compared to 83% for regression model). Sensitivity and Kappa parameters are also higher.

Cross-validation for decision tree

train_control_tree <- trainControl(method="cv", number=10)
model_tree <- caret::train(churn~., data=credit.train, trControl=train_control_tree, method="ctree")

print(model_tree)

## Conditional Inference Tree 
## 
## 7999 samples
##   10 predictors
##    2 classes: 'Churns', 'Stays' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 7199, 7199, 7199, 7199, 7199, 7200, ... 
## Resampling results across tuning parameters:
## 
##   mincriterion  Accuracy   Kappa    
##   0.01          0.8359817  0.4432486
##   0.50          0.8519831  0.4651987
##   0.99          0.8551084  0.4658944
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mincriterion = 0.99.

Predictions on the test subset:

predTest_tree <- predict(model_tree, credit.test)
tree_cv = confusionMatrix(predTest_tree, credit.test$churn)
tree_cv

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Churns Stays
##     Churns    173    67
##     Stays     235  1526
##                                           
##                Accuracy : 0.8491          
##                  95% CI : (0.8326, 0.8645)
##     No Information Rate : 0.7961          
##     P-Value [Acc > NIR] : 6.687e-10       
##                                           
##                   Kappa : 0.451           
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.42402         
##             Specificity : 0.95794         
##          Pos Pred Value : 0.72083         
##          Neg Pred Value : 0.86655         
##              Prevalence : 0.20390         
##          Detection Rate : 0.08646         
##    Detection Prevalence : 0.11994         
##       Balanced Accuracy : 0.69098         
##                                           
##        'Positive' Class : Churns          
##

The accuracy of prediction with a decision tree on the test set is still a little bit better (~85%) than the same accuracy for logistic regression (~83%).

Does the importance of predictors vary in different types of models?

In order to assess the variable importance in this case, I will create random forest.

Accuracy of Random Forest model on test subset (the same as for ordinary tree):

library(randomForest)
set.seed(4)
model.rf = randomForest(churn ~., data=credit.train, mtry=5, importance=TRUE)

predTrain.rf = predict(model.rf, credit.train)
predTest.rf = predict(model.rf, credit.test)

confusionMatrix(predTest.rf, credit.test$churn, 
                positive = "Churns", mode = "prec_recall")$overall["Accuracy"]

##  Accuracy 
## 0.8570715

Variable importance for the RF model (scaled):

library(vip)

vip(model.rf, scale = T) + labs(y = "Relative importance", title = "Relative importance of predictors in Random Forest model")+
  theme_minimal()

As it can be seen, the top 5 most important variables for RF is close to top 5 of a logistic regression. However, variable Balance (how much money is currently beeing kept on a client`s account) appeared to be quiet important for the RF model compared to LR model. As for the rest, the number of bank accounts, age, activeness and geographic location of a clien still constitute a set of variables of high importance.

In other words, these variables are the most useful ones in predicting customers` churning behavior in both models:

Number of bank accounts/bank affiliated products
Age of a client
Country of residence of a client
The amount of money which is currently being stored in the bank
Activeness as a bank system user

As a result, a decision tree (or Random Forest) model would be chosen to work with this particular dataset, as it showed a better performance (higher accuracy) on the training set.

Predicting customer time to churn

library(sjPlot)
sjp.xtab(modeling_data$Tenure, modeling_data$churn, margin = "row", bar.pos = "stack",
show.summary = F, coord.flip = T,
axis.titles = "Churn proportions", legend.title = "Did the customer churn?",
geom.colors = c("#f3c6f3","lightblue"))+
  theme_minimal()+
          theme(text = element_text(size = 12), 
          plot.title = element_text(hjust = 0.5), title = element_text(size = 12))+
  labs(title = "Churn proportions by tenure
       ", x = "Tetunre, months", y = "Percentage, %")

Estimation of the survival time independent of any covariates

credit$Exited = as.numeric(credit$Exited)
library(survival)
fitKM <- survfit(Surv(credit$Tenure, credit$Exited) ~ 1, 
                 type = "kaplan-meier")
print(fitKM)

## Call: survfit(formula = Surv(credit$Tenure, credit$Exited) ~ 1, type = "kaplan-meier")
## 
##       n  events  median 0.95LCL 0.95UCL 
##   10000    2037      10      10      NA

From the output it can be seen that among 10000 customers, about ~ 2000 churned under the time of the observation. The median survival time is 10 months, which means that about 50 % of the customers do not churn before they reach a tenure duration of 10 months.

Estimation of survival time depending on covariates

Important predictors as covariates:

fitKMstr1 <- survfit(Surv(Tenure, Exited) ~ Geography, 
                    data = credit)
print(fitKMstr1)

## Call: survfit(formula = Surv(Tenure, Exited) ~ Geography, data = credit)
## 
##                      n events median 0.95LCL 0.95UCL
## Geography=France  5014    810     NA      NA      NA
## Geography=Germany 2509    814      9       9       9
## Geography=Spain   2477    413     NA      NA      NA

As it can be seen, about a half of customers from Germany do not churn until 9 months of tenure duration. However, customers from France and Spain have a higher median survival time which is bigger than 10 months (that is why there are NA).

fitKMstr2 <- survfit(Surv(Tenure, Exited) ~ IsActiveMember, 
                    data = credit)
print(fitKMstr2)

## Call: survfit(formula = Surv(Tenure, Exited) ~ IsActiveMember, data = credit)
## 
##                     n events median 0.95LCL 0.95UCL
## IsActiveMember=0 4849   1302     10       9      10
## IsActiveMember=1 5151    735     NA      NA      NA

Similarly, it can be seen that among customers who are not active members of banking system median survival time is 10 months, whereas for those who are active members - median survival time is greater than 10 (than 10, which is median survival time in the initial model with no covariates used).

Conclusion

All in all, it can be concluded that for predicting churn for this particular dataset it is better to use decision tree (Randome Forest), rather than logistic regression, judging by the model performance for both methods. However, the difference in accuracy is not really significant (1-3%) even with the application of cross-validation procedure. As for the most important predictors for churn, there can be named the following: age, country of residence, activeness as a banking system user and the number of bank accounts or bank affiliated products. In some cases, the current balance of the client can help to predict churn as well. Speaking about the tenure time, about 50 % of the customers presumably are not going to churn withing 10 months. Still, if taking into consideration covariates, being a German citizen increases the probability to churn, lowering the survival time to only 9 months. However, being a citizen of another country or/and being an active member of banking system decreases the probability to churn, yielding 10+ months within which customers are not expected to quit.