2022-10-24

About Data

Data Description

The data is obtained from Kaggle. Following will be the description of each feature.

1. customer_id: account number.
2. credit_score: credit score.
3. country: country of residence.
4. gender: gender.
5. age: age.
6. tenure: from how many years the card holder hold the account.
7. balance: account balance.
8. products_number: number of product from bank.
9. credit_card: does account holder have associated credit card.
10. active_member: is account holder an active member.
11. estimated_salary: salary of account holder.
12. churn: churn status.

Obviously, customer_id is for identification purpose and it will not provide any useful information for further investigation, I am going to drop the column. Also, there are some features indicates “on/off” characteristic, I am going to change their data type into factors.

## Rows: 10,000
## Columns: 11
## $ credit_score     <dbl> 619, 608, 502, 699, 850, 645, 822, 376, 501, 684, 528…
## $ country          <fct> France, Spain, France, France, Spain, Spain, France, …
## $ gender           <fct> Female, Female, Female, Female, Female, Male, Male, F…
## $ age              <dbl> 42, 41, 42, 39, 43, 44, 50, 29, 44, 27, 31, 24, 34, 2…
## $ tenure           <dbl> 2, 1, 8, 1, 2, 8, 7, 4, 4, 2, 6, 3, 10, 5, 7, 3, 1, 9…
## $ balance          <dbl> 0.00, 83807.86, 159660.80, 0.00, 125510.82, 113755.78…
## $ products_number  <dbl> 1, 1, 3, 2, 1, 2, 2, 4, 2, 1, 2, 2, 2, 2, 2, 2, 1, 2,…
## $ credit_card      <fct> 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1,…
## $ active_member    <fct> 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1,…
## $ estimated_salary <dbl> 101348.88, 112542.58, 113931.57, 93826.63, 79084.10, …
## $ churn            <fct> 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,…

Summary Statistics

There is no missing value in the data, about 20% customer churn eventually, and 80% customer remain on the bank.

##   credit_score      country        gender          age            tenure      
##  Min.   :350.0   France :5014   Female:4543   Min.   :18.00   Min.   : 0.000  
##  1st Qu.:584.0   Germany:2509   Male  :5457   1st Qu.:32.00   1st Qu.: 3.000  
##  Median :652.0   Spain  :2477                 Median :37.00   Median : 5.000  
##  Mean   :650.5                                Mean   :38.92   Mean   : 5.013  
##  3rd Qu.:718.0                                3rd Qu.:44.00   3rd Qu.: 7.000  
##  Max.   :850.0                                Max.   :92.00   Max.   :10.000  
##     balance       products_number credit_card active_member estimated_salary   
##  Min.   :     0   Min.   :1.00    0:2945      0:4849        Min.   :    11.58  
##  1st Qu.:     0   1st Qu.:1.00    1:7055      1:5151        1st Qu.: 51002.11  
##  Median : 97199   Median :1.00                              Median :100193.91  
##  Mean   : 76486   Mean   :1.53                              Mean   :100090.24  
##  3rd Qu.:127644   3rd Qu.:2.00                              3rd Qu.:149388.25  
##  Max.   :250898   Max.   :4.00                              Max.   :199992.48  
##  churn   
##  0:7963  
##  1:2037  
##          
##          
##          
## 

Visualization

In order to efficiently map features into target variable, we need to visualize the data to get more information or have some intuition.

Numerical vs Target

age: it seems mid-age is more likely to churn

balance and credit_score do not seem to provide a lot useful information

estimated_salary: it seems estimated salary is less than 125,000 is less likely to churn where as estimated salary more than 125,000 is more likely to churn

product_number: the more product used by customer, the less likely churn

tenure: the longer a customer stays, the less likely churn

### Correlations

No strong correlation between numeric variables, which is a good sign ### Categorical vs Target

To see the relationship between these categorical variable and target variable, I need to take a look at the both graph shown below. And I found:

active_member: active member is more than inactive member, and inactive member is more likely to churn

country: France has more users than other two countries, however the proportion of churn status in both France and Spain is roughly the same. Therefore, the first common churning country is German, then it would be Spain, France is the least likely to churn.

credit_card: There are more customers have associated credit card but the proportion of churn status between customers who have credit card and who don’t have one is similar, which I would say that the customers who don’t have associated credit card are more likely to churn.

gender: female customers is likely to churn compared to male ones.

# Preprocessing

One-Hot Encoding

Data Split

## 
##    0    1 
## 5994 5994

Modeling

First Decision Tree

I am going to use all variable to see how the decision tree works. As the tree structure shown, variable credit_card, credit_score and estimated_salary do not used in the prediction. Double check the importance of variables determined by the tree, this tree model only uses the first five variable to train the data, and the accuracy of the prediction outcome is 85.4%.

##                     Overall
## age              1202.41515
## products_number   870.35088
## country.Germany   429.48659
## active_member     408.85479
## balance           251.33315
## country.France     40.53116
## gender.Female      12.15136
## credit_score        0.00000
## tenure              0.00000
## credit_card         0.00000
## estimated_salary    0.00000
## country.Spain       0.00000
## gender.Male         0.00000

Second Decision Tree

In the second tree, I pick the most important variable based on my intuition. The tree structure looks different from the first tree.

Random Forest

The last model is random forest. I expected the accuracy for the ensemble model would be higher than the simple decision tree, however, the accuracy of the predictive outcome is approximately the same as the first decision tree.

## 
## Call:
##  randomForest(formula = Class ~ ., data = train.up) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 5.71%
## Confusion matrix:
##      0    1 class.error
## 0 5377  617  0.10293627
## 1   67 5927  0.01117784

Model Comparison

tree1: first decision tree

tree2: second decision tree

tree3: random forest

##   model accuracy sensitivity specificity precision recall
## 1 tree1   0.7088       0.711      0.7006     0.898  0.711
## 2 tree2   0.7876      0.8294      0.6328    0.8933 0.8294
## 3 tree3    0.838      0.9045      0.5913    0.8914  0.711

Conclusion

I did twice of this homework, the first time that I did, I forget the point that imbalanced class will cause unstable tree model. After reviewing the code, I realize the problem, and add upsampling after spliting data, which makes sure that the target class is well balanced in the training set. Then, model performance is just like what I expected which Random forest performs the best. I think the futher step needs to be taken is to tune the parameters for these tree models and see where the optimal values are.