Homework #4 Assignment:

You get to decide which dataset you want to work on. The data set must be different from the ones used in previous homework You can work on a problem from your job, or something you are interested in. You may also obtain a dataset from sites such as Kaggle, Data.Gov, Census Bureau, USGS or other open data portals.

Select one of the methodologies studied in weeks 1-10, and one methodology from weeks 11-15 to apply in the new dataset selected. To complete this task:.

  • describe the problem you are trying to solve.
  • describe your datases and what you did to prepare the data for analysis.
  • methodologies you used for analyzing the data
  • what’s the purpose of the analysis performed
  • make your conclusions from your analysis. Please be sure to address the business impact (it could be of any domain) of your solution.

Your final presentation could be the traditional R file or Python file and essay, or it could be an oral presentation with the execution and explanation of your code, recorded on any platform of your choice (Youtube, Free Cam). If you select the presentation, it should be a 5 to 8 minutes recording.

Introduction

The dataset for this assignment is the Bank Loan Modelling data from Kaggle. The data consists of customer information for a campaign to get customers to open bank loans. The goal is to create a model to identify customers who have a higher probability of opening a loan. The bank collected customer demographic information and the types of accounts the customer has with the bank.

#libraries
library(dplyr)
library(tidyr)
library(ggplot2)
library(gridExtra)
library(rpart)
library(rpart.plot)
library(rcompanion)
library(datasets)
library(caTools)
library(party)
library(caret)
library(ggcorrplot)
library(randomForest)
library(e1071)

Data Import

bank_df <- read.csv("https://raw.githubusercontent.com/ltcancel/DATA622/main/Homework4/Data/Bank_Personal_Loan_Modelling.csv")

str(bank_df)
## 'data.frame':    5000 obs. of  14 variables:
##  $ ID                : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Age               : int  25 45 39 35 35 37 53 50 35 34 ...
##  $ Experience        : int  1 19 15 9 8 13 27 24 10 9 ...
##  $ Income            : int  49 34 11 100 45 29 72 22 81 180 ...
##  $ ZIP.Code          : int  91107 90089 94720 94112 91330 92121 91711 93943 90089 93023 ...
##  $ Family            : int  4 3 1 1 4 4 2 1 3 1 ...
##  $ CCAvg             : num  1.6 1.5 1 2.7 1 0.4 1.5 0.3 0.6 8.9 ...
##  $ Education         : int  1 1 1 2 2 2 2 3 2 3 ...
##  $ Mortgage          : int  0 0 0 0 0 155 0 0 104 0 ...
##  $ Personal.Loan     : int  0 0 0 0 0 0 0 0 0 1 ...
##  $ Securities.Account: int  1 1 0 0 0 0 0 0 0 0 ...
##  $ CD.Account        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Online            : int  0 0 0 0 0 1 1 0 1 0 ...
##  $ CreditCard        : int  0 0 0 0 1 0 0 1 0 0 ...

Data Exploration

If we look at the summary of the dataframe we see that there are no N/A values. Some columns have a minimum value of zero but the zero has a meaning so we do not have to do not have to clean or remove any of the values.

summary(bank_df)
##        ID            Age          Experience       Income          ZIP.Code    
##  Min.   :   1   Min.   :23.00   Min.   :-3.0   Min.   :  8.00   Min.   : 9307  
##  1st Qu.:1251   1st Qu.:35.00   1st Qu.:10.0   1st Qu.: 39.00   1st Qu.:91911  
##  Median :2500   Median :45.00   Median :20.0   Median : 64.00   Median :93437  
##  Mean   :2500   Mean   :45.34   Mean   :20.1   Mean   : 73.77   Mean   :93153  
##  3rd Qu.:3750   3rd Qu.:55.00   3rd Qu.:30.0   3rd Qu.: 98.00   3rd Qu.:94608  
##  Max.   :5000   Max.   :67.00   Max.   :43.0   Max.   :224.00   Max.   :96651  
##      Family          CCAvg          Education        Mortgage    
##  Min.   :1.000   Min.   : 0.000   Min.   :1.000   Min.   :  0.0  
##  1st Qu.:1.000   1st Qu.: 0.700   1st Qu.:1.000   1st Qu.:  0.0  
##  Median :2.000   Median : 1.500   Median :2.000   Median :  0.0  
##  Mean   :2.396   Mean   : 1.938   Mean   :1.881   Mean   : 56.5  
##  3rd Qu.:3.000   3rd Qu.: 2.500   3rd Qu.:3.000   3rd Qu.:101.0  
##  Max.   :4.000   Max.   :10.000   Max.   :3.000   Max.   :635.0  
##  Personal.Loan   Securities.Account   CD.Account         Online      
##  Min.   :0.000   Min.   :0.0000     Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.000   1st Qu.:0.0000     1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.000   Median :0.0000     Median :0.0000   Median :1.0000  
##  Mean   :0.096   Mean   :0.1044     Mean   :0.0604   Mean   :0.5968  
##  3rd Qu.:0.000   3rd Qu.:0.0000     3rd Qu.:0.0000   3rd Qu.:1.0000  
##  Max.   :1.000   Max.   :1.0000     Max.   :1.0000   Max.   :1.0000  
##    CreditCard   
##  Min.   :0.000  
##  1st Qu.:0.000  
##  Median :0.000  
##  Mean   :0.294  
##  3rd Qu.:1.000  
##  Max.   :1.000

The columns are a mix of continuous and categorical so we will convert the categorical columns to factors.

bank_df$Education <- as.factor(bank_df$Education)
bank_df$Personal.Loan <- as.factor(bank_df$Personal.Loan)
bank_df$Securities.Account <- as.factor(bank_df$Securities.Account)
bank_df$CD.Account <- as.factor(bank_df$CD.Account)
bank_df$Online <- as.factor(bank_df$Online)
bank_df$CreditCard <- as.factor(bank_df$CreditCard)

Graphs of categorical variables. After converting Education, Personal Loan, Securities Account, CD Account, Online, and CreditCard to factors, we can explore the counts for each column. Most of the customers are a household of 1 or 2. The highest education level for most customers is Undergrad (1), then Advanced/Professional (3), and last is Graduate (2). The last four charts are columns that only contain Boolean values, a one or a zero (Yes/No). Most of the customers did not accept the Personal Loan offer, they do not have a Security Account or a CD account, and most do not own a credit card from UniversalBank.

hist(bank_df$Family, main = "Family Size of Customer", col = "orange")

plot(bank_df$Education, main = "Educational Level of Customer", col = "orange")

plot(bank_df$Personal.Loan, main = "Customers Accepting Personal Loan Offer", col = "orange")

plot(bank_df$Securities.Account, main = "Security Account Holder", col = "orange")

plot(bank_df$CD.Account, main = "CD Account Holder", col = "orange")

plot(bank_df$Online, main = "Internet Banking", col = "orange")

plot(bank_df$CreditCard, main = "Owns a Credit Card by UniveralBank", col = "orange")

We will use a Boxplot to explore the columns with continuous variables. Customer age and years of professional experience have a normal, symmetrical distribution. The median customer age is in the 40’s and the median years of professional experience is 20 years. The last three boxplots contain some outliers with Home Mortgage having the most outliers.

boxplot(bank_df$Age, main = "Customer Age")

boxplot(bank_df$Experience, main = "Customer Years of Professional Experience")

boxplot(bank_df$Income, main = "Customer Annual Income")

boxplot(bank_df$CCAvg, main = "Customer Credit Card Average Monthly Spending")

boxplot(bank_df$Mortgage, main = "Customer Home Mortgage")

Decision Tree

The model used for this project is a decision tree. We used an 80/20 split for the train and test dataframes and used the Personal Loan column as the predictor. A zero means the customer did not accept the Personal Loan offer and a one means the customer did accept the personal loan offer. We know from looking at the bar chart from earlier that most customers did not accept the Personal Loan offer. If we follow one of the branches that leads to a 1 (accepted loan offer), customers with an income over $114,00, Education Level above Undergraduate, and income over $117,000 (there are two Income nodes on the right side of the decision tree). A simplified interpretation of that is customers with a higher level of education and a higher income were more likely to accept the Personal Loan offer.

# Create the train/test sets
set.seed(123)
create_train_test <- function(data, size = 0.8, train = TRUE){
  n_row = nrow(data)
  total_row = size * n_row
  train_sample <- 1: total_row
  if (train == TRUE){
    return(data[train_sample, ])
  } else {
    return(data[-train_sample, ])
  }
}
# Test the function and check the dimension
data_train <- create_train_test(bank_df, 0.8, train = TRUE)
data_test <- create_train_test(bank_df, 0.8, train = FALSE)
dim(data_train)
## [1] 4000   14
dim(data_test)
## [1] 1000   14
# Decision Tree using Personal Loan as the predictor
control <- rpart.control(minsplit = 5L, maxdepth = 5L, minbucket = 5, cp = 0.002, maxsurrogate = 4)
fit <- rpart(Personal.Loan~., data = data_train, method = 'class', control = control)
rpart.plot(fit, extra = "auto")

A Confusion Matrix was used to test the accuracy of the Decision Tree, so we can see that is has a 98.5% accuracy.

predict_df <- predict(fit, data_test, type = "class")


#?predict
print("Decision Tree Confusion Matrix")
## [1] "Decision Tree Confusion Matrix"
confusionMatrix(predict_df, data_test$Personal.Loan)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 910   8
##          1   7  75
##                                           
##                Accuracy : 0.985           
##                  95% CI : (0.9754, 0.9916)
##     No Information Rate : 0.917           
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9009          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9924          
##             Specificity : 0.9036          
##          Pos Pred Value : 0.9913          
##          Neg Pred Value : 0.9146          
##              Prevalence : 0.9170          
##          Detection Rate : 0.9100          
##    Detection Prevalence : 0.9180          
##       Balanced Accuracy : 0.9480          
##                                           
##        'Positive' Class : 0               
## 

Although the confusion matrix shows a high accuracy, we will experiment with pruning the tree to see if it makes a difference. A Complexity Parameter (CP) table is used to select the optimal size of the decision tree. CP 5 & 6 have equal and lowest xerror.

printcp(fit)
## 
## Classification tree:
## rpart(formula = Personal.Loan ~ ., data = data_train, method = "class", 
##     control = control)
## 
## Variables actually used in tree construction:
## [1] Age        CCAvg      CD.Account Education  Family     Income    
## 
## Root node error: 397/4000 = 0.09925
## 
## n= 4000 
## 
##          CP nsplit rel error  xerror     xstd
## 1 0.3261965      0  1.000000 1.00000 0.047633
## 2 0.1335013      2  0.347607 0.35768 0.029478
## 3 0.0163728      3  0.214106 0.21159 0.022842
## 4 0.0100756      7  0.138539 0.17884 0.021035
## 5 0.0075567      9  0.118388 0.14106 0.018717
## 6 0.0041982     10  0.110831 0.13854 0.018552
## 7 0.0020000     13  0.098237 0.13854 0.018552
plotcp(fit)

#compute accuracy of pruned tree
base_accuracy <- mean(predict_df == data_test$Personal.Loan)

Using the cp value for row 5 (0.0041982) we run the tree and prediction one more time. If we compare the accuracy from the original prediction to the postruning, we see that the values are exactly the same. So, pruning the tree did not make a difference.

model_pruned <- prune(fit, cp = 0.0041982)
#accuracy of pruned tree
predict_df <- predict(model_pruned, data_test, type = "class")
accuracy_postrun <- mean(predict_df == data_test$Personal.Loan)
data.frame(base_accuracy, accuracy_postrun)
##   base_accuracy accuracy_postrun
## 1         0.985            0.985

If I were to do this analysis again, I would filter the data by customers who accepted the loan only. The current decision tree mostly displays information for customers who did not accept the loan which is useful information. If the bank wanted to run this campaign again, they can see the characteristics of the customers who said no. However, I think it would also be useful to have a tree that displays the characteristics of the customers who said yes.