About Data
Data Description
The data is obtained from Kaggle. Following will be the description of each feature.
1. customer_id: account number.
2. credit_score: credit score.
3. country: country of residence.
4. gender: gender.
5. age: age.
6. tenure: from how many years the card holder hold the account.
7. balance: account balance.
8. products_number: number of product from bank.
9. credit_card: does account holder have associated credit card.
10. active_member: is account holder an active member.
11. estimated_salary: salary of account holder.
12. churn: churn status.
Obviously, customer_id is for identification purpose and
it will not provide any useful information for further investigation, I
am going to drop the column. Also, there are some features indicates
“on/off” characteristic, I am going to change their data type into
factors.
## Rows: 10,000
## Columns: 11
## $ credit_score <dbl> 619, 608, 502, 699, 850, 645, 822, 376, 501, 684, 528…
## $ country <fct> France, Spain, France, France, Spain, Spain, France, …
## $ gender <fct> Female, Female, Female, Female, Female, Male, Male, F…
## $ age <dbl> 42, 41, 42, 39, 43, 44, 50, 29, 44, 27, 31, 24, 34, 2…
## $ tenure <dbl> 2, 1, 8, 1, 2, 8, 7, 4, 4, 2, 6, 3, 10, 5, 7, 3, 1, 9…
## $ balance <dbl> 0.00, 83807.86, 159660.80, 0.00, 125510.82, 113755.78…
## $ products_number <dbl> 1, 1, 3, 2, 1, 2, 2, 4, 2, 1, 2, 2, 2, 2, 2, 2, 1, 2,…
## $ credit_card <fct> 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1,…
## $ active_member <fct> 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1,…
## $ estimated_salary <dbl> 101348.88, 112542.58, 113931.57, 93826.63, 79084.10, …
## $ churn <fct> 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,…
Summary Statistics
There is no missing value in the data, about 20% customer churn eventually, and 80% customer remain on the bank.
## credit_score country gender age tenure
## Min. :350.0 France :5014 Female:4543 Min. :18.00 Min. : 0.000
## 1st Qu.:584.0 Germany:2509 Male :5457 1st Qu.:32.00 1st Qu.: 3.000
## Median :652.0 Spain :2477 Median :37.00 Median : 5.000
## Mean :650.5 Mean :38.92 Mean : 5.013
## 3rd Qu.:718.0 3rd Qu.:44.00 3rd Qu.: 7.000
## Max. :850.0 Max. :92.00 Max. :10.000
## balance products_number credit_card active_member estimated_salary
## Min. : 0 Min. :1.00 0:2945 0:4849 Min. : 11.58
## 1st Qu.: 0 1st Qu.:1.00 1:7055 1:5151 1st Qu.: 51002.11
## Median : 97199 Median :1.00 Median :100193.91
## Mean : 76486 Mean :1.53 Mean :100090.24
## 3rd Qu.:127644 3rd Qu.:2.00 3rd Qu.:149388.25
## Max. :250898 Max. :4.00 Max. :199992.48
## churn
## 0:7963
## 1:2037
##
##
##
##
Visualization
In order to efficiently map features into target variable, we need to visualize the data to get more information or have some intuition.
Numerical vs Target
age: it seems mid-age is more likely to churn
balance and credit_score do not seem to
provide a lot useful information
estimated_salary: it seems estimated salary is less than
125,000 is less likely to churn where as estimated salary more than
125,000 is more likely to churn
product_number: the more product used by customer, the
less likely churn
tenure: the longer a customer stays, the less likely
churn
### Correlations
No strong correlation between numeric variables, which is a good sign
### Categorical vs Target
To see the relationship between these categorical variable and target variable, I need to take a look at the both graph shown below. And I found:
active_member: active member is more than inactive
member, and inactive member is more likely to churn
country: France has more users than other two countries,
however the proportion of churn status in both France and Spain is
roughly the same. Therefore, the first common churning country is
German, then it would be Spain, France is the least likely to churn.
credit_card: There are more customers have associated
credit card but the proportion of churn status between customers who
have credit card and who don’t have one is similar, which I would say
that the customers who don’t have associated credit card are more likely
to churn.
gender: female customers is likely to churn compared to
male ones.
# Preprocessing
One-Hot Encoding
Data Split
##
## 0 1
## 5994 5994
Modeling
First Decision Tree
I am going to use all variable to see how the decision tree works. As
the tree structure shown, variable credit_card,
credit_score and estimated_salary do not used
in the prediction. Double check the importance of variables determined
by the tree, this tree model only uses the first five variable to train
the data, and the accuracy of the prediction outcome is 85.4%.
## Overall
## age 1202.41515
## products_number 870.35088
## country.Germany 429.48659
## active_member 408.85479
## balance 251.33315
## country.France 40.53116
## gender.Female 12.15136
## credit_score 0.00000
## tenure 0.00000
## credit_card 0.00000
## estimated_salary 0.00000
## country.Spain 0.00000
## gender.Male 0.00000
Second Decision Tree
In the second tree, I pick the most important variable based on my intuition. The tree structure looks different from the first tree.
Random Forest
The last model is random forest. I expected the accuracy for the ensemble model would be higher than the simple decision tree, however, the accuracy of the predictive outcome is approximately the same as the first decision tree.
##
## Call:
## randomForest(formula = Class ~ ., data = train.up)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 5.71%
## Confusion matrix:
## 0 1 class.error
## 0 5377 617 0.10293627
## 1 67 5927 0.01117784
Model Comparison
tree1: first decision tree
tree2: second decision tree
tree3: random forest
## model accuracy sensitivity specificity precision recall
## 1 tree1 0.7088 0.711 0.7006 0.898 0.711
## 2 tree2 0.7876 0.8294 0.6328 0.8933 0.8294
## 3 tree3 0.838 0.9045 0.5913 0.8914 0.711
Conclusion
I did twice of this homework, the first time that I did, I forget the point that imbalanced class will cause unstable tree model. After reviewing the code, I realize the problem, and add upsampling after spliting data, which makes sure that the target class is well balanced in the training set. Then, model performance is just like what I expected which Random forest performs the best. I think the futher step needs to be taken is to tune the parameters for these tree models and see where the optimal values are.