Homework #4 Assignment:
You get to decide which dataset you want to work on. The data set must be different from the ones used in previous homework You can work on a problem from your job, or something you are interested in. You may also obtain a dataset from sites such as Kaggle, Data.Gov, Census Bureau, USGS or other open data portals.
Select one of the methodologies studied in weeks 1-10, and one methodology from weeks 11-15 to apply in the new dataset selected. To complete this task:.
Your final presentation could be the traditional R file or Python file and essay, or it could be an oral presentation with the execution and explanation of your code, recorded on any platform of your choice (Youtube, Free Cam). If you select the presentation, it should be a 5 to 8 minutes recording.
The dataset for this assignment is the Bank Loan Modelling data from Kaggle. The data consists of customer information for a campaign to get customers to open bank loans. The goal is to create a model to identify customers who have a higher probability of opening a loan. The bank collected customer demographic information and the types of accounts the customer has with the bank.
#libraries
library(dplyr)
library(tidyr)
library(ggplot2)
library(gridExtra)
library(rpart)
library(rpart.plot)
library(rcompanion)
library(datasets)
library(caTools)
library(party)
library(caret)
library(ggcorrplot)
library(randomForest)
library(e1071)
bank_df <- read.csv("https://raw.githubusercontent.com/ltcancel/DATA622/main/Homework4/Data/Bank_Personal_Loan_Modelling.csv")
str(bank_df)
## 'data.frame': 5000 obs. of 14 variables:
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Age : int 25 45 39 35 35 37 53 50 35 34 ...
## $ Experience : int 1 19 15 9 8 13 27 24 10 9 ...
## $ Income : int 49 34 11 100 45 29 72 22 81 180 ...
## $ ZIP.Code : int 91107 90089 94720 94112 91330 92121 91711 93943 90089 93023 ...
## $ Family : int 4 3 1 1 4 4 2 1 3 1 ...
## $ CCAvg : num 1.6 1.5 1 2.7 1 0.4 1.5 0.3 0.6 8.9 ...
## $ Education : int 1 1 1 2 2 2 2 3 2 3 ...
## $ Mortgage : int 0 0 0 0 0 155 0 0 104 0 ...
## $ Personal.Loan : int 0 0 0 0 0 0 0 0 0 1 ...
## $ Securities.Account: int 1 1 0 0 0 0 0 0 0 0 ...
## $ CD.Account : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Online : int 0 0 0 0 0 1 1 0 1 0 ...
## $ CreditCard : int 0 0 0 0 1 0 0 1 0 0 ...
If we look at the summary of the dataframe we see that there are no N/A values. Some columns have a minimum value of zero but the zero has a meaning so we do not have to do not have to clean or remove any of the values.
summary(bank_df)
## ID Age Experience Income ZIP.Code
## Min. : 1 Min. :23.00 Min. :-3.0 Min. : 8.00 Min. : 9307
## 1st Qu.:1251 1st Qu.:35.00 1st Qu.:10.0 1st Qu.: 39.00 1st Qu.:91911
## Median :2500 Median :45.00 Median :20.0 Median : 64.00 Median :93437
## Mean :2500 Mean :45.34 Mean :20.1 Mean : 73.77 Mean :93153
## 3rd Qu.:3750 3rd Qu.:55.00 3rd Qu.:30.0 3rd Qu.: 98.00 3rd Qu.:94608
## Max. :5000 Max. :67.00 Max. :43.0 Max. :224.00 Max. :96651
## Family CCAvg Education Mortgage
## Min. :1.000 Min. : 0.000 Min. :1.000 Min. : 0.0
## 1st Qu.:1.000 1st Qu.: 0.700 1st Qu.:1.000 1st Qu.: 0.0
## Median :2.000 Median : 1.500 Median :2.000 Median : 0.0
## Mean :2.396 Mean : 1.938 Mean :1.881 Mean : 56.5
## 3rd Qu.:3.000 3rd Qu.: 2.500 3rd Qu.:3.000 3rd Qu.:101.0
## Max. :4.000 Max. :10.000 Max. :3.000 Max. :635.0
## Personal.Loan Securities.Account CD.Account Online
## Min. :0.000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.000 Median :0.0000 Median :0.0000 Median :1.0000
## Mean :0.096 Mean :0.1044 Mean :0.0604 Mean :0.5968
## 3rd Qu.:0.000 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:1.0000
## Max. :1.000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## CreditCard
## Min. :0.000
## 1st Qu.:0.000
## Median :0.000
## Mean :0.294
## 3rd Qu.:1.000
## Max. :1.000
The columns are a mix of continuous and categorical so we will convert the categorical columns to factors.
bank_df$Education <- as.factor(bank_df$Education)
bank_df$Personal.Loan <- as.factor(bank_df$Personal.Loan)
bank_df$Securities.Account <- as.factor(bank_df$Securities.Account)
bank_df$CD.Account <- as.factor(bank_df$CD.Account)
bank_df$Online <- as.factor(bank_df$Online)
bank_df$CreditCard <- as.factor(bank_df$CreditCard)
Graphs of categorical variables. After converting Education, Personal Loan, Securities Account, CD Account, Online, and CreditCard to factors, we can explore the counts for each column. Most of the customers are a household of 1 or 2. The highest education level for most customers is Undergrad (1), then Advanced/Professional (3), and last is Graduate (2). The last four charts are columns that only contain Boolean values, a one or a zero (Yes/No). Most of the customers did not accept the Personal Loan offer, they do not have a Security Account or a CD account, and most do not own a credit card from UniversalBank.
hist(bank_df$Family, main = "Family Size of Customer", col = "orange")
plot(bank_df$Education, main = "Educational Level of Customer", col = "orange")
plot(bank_df$Personal.Loan, main = "Customers Accepting Personal Loan Offer", col = "orange")
plot(bank_df$Securities.Account, main = "Security Account Holder", col = "orange")
plot(bank_df$CD.Account, main = "CD Account Holder", col = "orange")
plot(bank_df$Online, main = "Internet Banking", col = "orange")
plot(bank_df$CreditCard, main = "Owns a Credit Card by UniveralBank", col = "orange")
We will use a Boxplot to explore the columns with continuous variables. Customer age and years of professional experience have a normal, symmetrical distribution. The median customer age is in the 40’s and the median years of professional experience is 20 years. The last three boxplots contain some outliers with Home Mortgage having the most outliers.
boxplot(bank_df$Age, main = "Customer Age")
boxplot(bank_df$Experience, main = "Customer Years of Professional Experience")
boxplot(bank_df$Income, main = "Customer Annual Income")
boxplot(bank_df$CCAvg, main = "Customer Credit Card Average Monthly Spending")
boxplot(bank_df$Mortgage, main = "Customer Home Mortgage")
The model used for this project is a decision tree. We used an 80/20 split for the train and test dataframes and used the Personal Loan column as the predictor. A zero means the customer did not accept the Personal Loan offer and a one means the customer did accept the personal loan offer. We know from looking at the bar chart from earlier that most customers did not accept the Personal Loan offer. If we follow one of the branches that leads to a 1 (accepted loan offer), customers with an income over $114,00, Education Level above Undergraduate, and income over $117,000 (there are two Income nodes on the right side of the decision tree). A simplified interpretation of that is customers with a higher level of education and a higher income were more likely to accept the Personal Loan offer.
# Create the train/test sets
set.seed(123)
create_train_test <- function(data, size = 0.8, train = TRUE){
n_row = nrow(data)
total_row = size * n_row
train_sample <- 1: total_row
if (train == TRUE){
return(data[train_sample, ])
} else {
return(data[-train_sample, ])
}
}
# Test the function and check the dimension
data_train <- create_train_test(bank_df, 0.8, train = TRUE)
data_test <- create_train_test(bank_df, 0.8, train = FALSE)
dim(data_train)
## [1] 4000 14
dim(data_test)
## [1] 1000 14
# Decision Tree using Personal Loan as the predictor
control <- rpart.control(minsplit = 5L, maxdepth = 5L, minbucket = 5, cp = 0.002, maxsurrogate = 4)
fit <- rpart(Personal.Loan~., data = data_train, method = 'class', control = control)
rpart.plot(fit, extra = "auto")
A Confusion Matrix was used to test the accuracy of the Decision Tree, so we can see that is has a 98.5% accuracy.
predict_df <- predict(fit, data_test, type = "class")
#?predict
print("Decision Tree Confusion Matrix")
## [1] "Decision Tree Confusion Matrix"
confusionMatrix(predict_df, data_test$Personal.Loan)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 910 8
## 1 7 75
##
## Accuracy : 0.985
## 95% CI : (0.9754, 0.9916)
## No Information Rate : 0.917
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9009
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9924
## Specificity : 0.9036
## Pos Pred Value : 0.9913
## Neg Pred Value : 0.9146
## Prevalence : 0.9170
## Detection Rate : 0.9100
## Detection Prevalence : 0.9180
## Balanced Accuracy : 0.9480
##
## 'Positive' Class : 0
##
Although the confusion matrix shows a high accuracy, we will experiment with pruning the tree to see if it makes a difference. A Complexity Parameter (CP) table is used to select the optimal size of the decision tree. CP 5 & 6 have equal and lowest xerror.
printcp(fit)
##
## Classification tree:
## rpart(formula = Personal.Loan ~ ., data = data_train, method = "class",
## control = control)
##
## Variables actually used in tree construction:
## [1] Age CCAvg CD.Account Education Family Income
##
## Root node error: 397/4000 = 0.09925
##
## n= 4000
##
## CP nsplit rel error xerror xstd
## 1 0.3261965 0 1.000000 1.00000 0.047633
## 2 0.1335013 2 0.347607 0.35768 0.029478
## 3 0.0163728 3 0.214106 0.21159 0.022842
## 4 0.0100756 7 0.138539 0.17884 0.021035
## 5 0.0075567 9 0.118388 0.14106 0.018717
## 6 0.0041982 10 0.110831 0.13854 0.018552
## 7 0.0020000 13 0.098237 0.13854 0.018552
plotcp(fit)
#compute accuracy of pruned tree
base_accuracy <- mean(predict_df == data_test$Personal.Loan)
Using the cp value for row 5 (0.0041982) we run the tree and prediction one more time. If we compare the accuracy from the original prediction to the postruning, we see that the values are exactly the same. So, pruning the tree did not make a difference.
model_pruned <- prune(fit, cp = 0.0041982)
#accuracy of pruned tree
predict_df <- predict(model_pruned, data_test, type = "class")
accuracy_postrun <- mean(predict_df == data_test$Personal.Loan)
data.frame(base_accuracy, accuracy_postrun)
## base_accuracy accuracy_postrun
## 1 0.985 0.985
If I were to do this analysis again, I would filter the data by customers who accepted the loan only. The current decision tree mostly displays information for customers who did not accept the loan which is useful information. If the bank wanted to run this campaign again, they can see the characteristics of the customers who said no. However, I think it would also be useful to have a tree that displays the characteristics of the customers who said yes.