Authorship

Group 5:

  • Don (Geeth) Padmaperuma,
  • Subhalaxmi Rout,
  • Isabel Ramesar, and
  • Magnus Skonberg

Background

The purpose of our Final Project was to explore the application of Neural Networks to loan approval data to then back compare model performance with a variety of Classification algorithms (ie. KNN, DT, RF, GBM).

Neural Networks

Neural networks form the basis of Deep Learning, an extension of Machine Learning, where algorithms are inspired by the structure of the human brain. They take in data, train themselves to recognize patterns therein, and then predict outputs for similar, unseen data.

download.file(
    url='https://1.cms.s81c.com/sites/default/files/2021-01-06/ICLH_Diagram_Batch_01_03-DeepNeuralNetwork-WHITEBG.png',
    destfile='image1.jpg',
    mode='wb')
knitr::include_graphics(path='image1.jpg')

Neural networks are made up of layers of nodes. They contain an input layer, one or more hidden layers, and an output layer. Nodes are interconnected with associated weights and thresholds. When a node is above its specified threshold, the node is activated and data is sent to the next layer of the network. Otherwise, data is not fed forward.

The power of a neural network lies in its ability to fine-tune upon countless iterations. Back-propagation allows for continuous model accuracy improvement. Weights are adjusted based on the magnitude of error at the output layer, and continuous refinement allows for predictive accuracy improvements.

Our Approach

We’ll start by (re) exploring and preparing the loan dataset, progress to building our neural network model, and then compare and contrast loan approval status prediction accuracy for our neural network model vs. decision tree, random forest, and gradient boosting models.


Loan Approval Data

A loan is when money is transferred from one party to another under the prospect that the lender (loan giver) will be repaid in full with interest by the lendee (loan receiver).

The profit, for the lendor, comes from the interest they are paid by the lendee and thus, as a core part of their business model, it’s important for banks and credit companies alike to be able to depend upon the fact that their loan (and interest) will be repaid in full.

With this motivation in mind, we (re) explore and prepare our loan approval dataset in order to construct a more precise neural network model (later). Being that we’ve explored this data before, we build upon the core takeaways of our past exploration while simultaneously pushing the bounds of our understanding to a deeper level. Rudimentary (early) EDA steps will be summarized and/or excluded from the write up and included in the Appendix in favor of output that provides greater context and insight.

Prior to commencing EDA, we revisit the corresponding data directory:

  • LoanID: unique loan ID
  • Gender: applicant gender (Male/Female)
  • Married: applicant marriage status (Yes/No)
  • Dependents: number of dependents for applicant (0, 1, 2, 3+)
  • Education: applicant college education status (Graduate / Not Graduate)
  • Self_Employed: applicant self-employment status (Yes/No)
  • ApplicantIncome: applicant income level
  • CoapplicantIncome: co-applicant income level (if applicable)
  • LoanAmount: loan amount requested (in thousands)
  • Loan_Amount_Term: loan term (in months)
  • Credit_History: credit history meets guidelines (1/0)
  • PropertyArea: property location (Urban/Semi Urban/Rural)
  • Loan_Status: loan approved (Yes/No). target variable

Data Exploration & Preparation

To start, we load in our data, replace empty strings with NAs, observe the first 6 observations of our dataset to refamiliarize ourselves with the format of our data and then use R’s built-in glimpse() and summary() functions to revisit data types and value ranges.

We’re dealing with a 614 observation x 13 variable dataframe with Loan_Status as our dependent, categoric variable, ApplicantIncome, CoApplicantIncome,LoanAmount, Loan_Amount_Term, and Credit_History as independent, numeric variables, and Loan_ID, LoanGender, Married, Dependents, Education, Self_Employed, Property_Area, and Loan_Status as independent, categoric variables.

From above, we extend that Loan_ID can be dropped, ApplicantIncome and CoApplicantIncome can be combined to create a TotalIncome variable, and observations with a “3+” label in Dependents should be re-labelled as “3” so that data follows a consistent format and imputation can be performed as a next step.

NA Values

We pre-process our data (as described above), visualize and handle NA values:

#Pre-process dataset for easier interpretation
loan <- subset(loan, select = -c(1) ) #drop Loan_ID from consideration
loan$TotalIncome <- loan$CoapplicantIncome + loan$ApplicantIncome #create TotalIncome variable
loan <- subset(loan, select = -c(6,7) ) #drop CoapplicantIncome and ApplicantIncome
loan$Dependents <- revalue(loan$Dependents, c("3+"="3")) #relabel Dependents "3+" value as "3"

#Visualize NA counts
colSums(is.na(loan)) 
##           Gender          Married       Dependents        Education 
##               13                3               15                0 
##    Self_Employed       LoanAmount Loan_Amount_Term   Credit_History 
##               32               22               14               50 
##    Property_Area      Loan_Status      TotalIncome 
##                0                0                0

We re-assign the “3+” value of the Dependents variable to provide consistent leveling and enable pmm (predictive mean matching). Predictive mean matching calculates the predicted value for our target variable, and, for missing values, forms a small set of “candidate donors” from the complete cases that are closest to the predicted value for our missing entry. Donors are then randomly chosen from candidates and imputed where values were once missing. To apply pmm we assume that the distribution is the same for missing cells as it is for observed data, and thus, the approach may be more limited when the % of missing values is higher.

Once we’ve imputed missing values, we verify whether our operation was successful:

#verify absence of NA values in the dataset
colSums(is.na(loan))
##           Gender          Married       Dependents        Education 
##                0                0                0                0 
##    Self_Employed       LoanAmount Loan_Amount_Term   Credit_History 
##                0                0                0                0 
##    Property_Area      Loan_Status      TotalIncome 
##                0                0                0

Imputation was a success and data pre-processing has been completed. From this point we proceed to the observance of feature correlation.

Correlation and Variable Importance

To identify feature correlation - how strongly independent variables are related to one another and how strongly these variables relate to our dependent variable (Loan_Status), we consider a correlation matrix with a threshold of 0.3:

#Utilize custom-built correlation matrix generation function
plot_corr_matrix(loan, 0.3)

From the correlation matrix we can extend that:

  • Credit_History is our strongest predictor / strongly correlated with Loan_Status, and
  • Gender and Married, Married and Dependents, LoanAmount and TotalIncome appear to be correlated with one another and indicative that multicollinearity may be a concern for our data.

We varied the threshold value for our correlation matrix and found that, aside from Credit_History, our other independent variables were relatively poor predictors of Loan_Status, making it worth exploring variable importance:

#NOTE: COMMENTED OUT BELOW DUE TO LONG COMPILATION TIME. UNCOMMENT BEFORE FINAL SUBMISSION.

# Perform Boruta search
#boruta_output <- Boruta(Loan_Status ~ ., data=na.omit(loan), doTrace=0, maxRuns = 1000)
#Get significant variables including tentatives
#boruta_signif <- getSelectedAttributes(boruta_output, withTentative = TRUE)
#print(boruta_signif)
# Plot variable importance
#plot(boruta_output, cex.axis=.7, las=2, xlab="", main="Variable Importance")

Our utilization of the Boruta function for feature ranking and selection indicate that:

  • Credit_History, TotalIncome, LoanAmount, and Self_Employed are strong predictors,
  • Property_Area is a moderate predictor, and
  • Married, Loan_Amount_Term, Education, Gender, and Dependents are all poor predictors.

With feature importance in mind, we drop the poor predictors from consideration. Dropping these variables also addresses concerns of applicant discrimination (ie. rejection based on Gender) and thus we address two concerns with this act of feature selection.

Independent Variable Distributions

With our loan dataset properly subset, we proceed to observing the distributions of our independent variables. First we observe numeric distributions:

#convert CreditHistory to type factor
loan$Credit_History <- factor(loan$Credit_History)
#levels(loan$Credit_History) #verify

#Numeric distributions
loan %>%
    keep(is.numeric) %>%
    gather() %>% 
    ggplot(aes(value)) +
        facet_wrap(~ key, scales = "free", ncol=1) +
        geom_histogram(bins=90,color="darkblue", fill="lightblue")

From the above figures we observe that LoanAmount and TotalIncome appear to be right skewed normal and there are a number of noteworthy outliers for both distributions. From this, we note the importance of outlier-handling and scaling as critical steps in building our neural network model.

Next, we explore our categorical variables:

#Categoric distributions
##Self_Employed
p1 <- loan %>% dplyr::select(1,5) %>% group_by(,Loan_Status) %>% count() %>%
    ggplot(aes(x=Self_Employed, y=freq, fill=Loan_Status)) + 
        geom_bar(stat='identity', position="stack")
##Self_Employed
p2 <- loan %>% dplyr::select(3,5) %>% group_by(,Loan_Status) %>% count() %>%
    ggplot(aes(x=Credit_History, y=freq, fill=Loan_Status)) + 
        geom_bar(stat='identity', position="stack")
##Property_Area
p3 <- loan %>% dplyr::select(4,5) %>% group_by(,Loan_Status) %>% count() %>%
    ggplot(aes(x=Property_Area, y=freq, fill=Loan_Status)) + 
        geom_bar(stat='identity', position="stack")

grid.arrange(p1, p2, p3, nrow = 2, ncol = 2)

From the above figures we can extend:

  • non self employed outnumbers self employed on a 5:1 basis,
  • credit history meeting qualifications outnumbers not meeting qualifications on a 5:1 basis,
  • properties in semiurban areas make up a slight majority, and
  • with regard to loan approval, it appears that being self-employed, having a strong credit history, and living in a semiurban area are advantageous. The strongest categorical predictor appears to be that the applicant have a credit history that meets qualifications.

With a relatively thorough exploratory analysis under our belt, we move on to building our neural network model.


Model Building

We’ll utilize the holdout-validation method for evaluating model performance. We train-test split our data using a 75:25 partition, build our model on the training set and then evaluate its performance on the test set.

Neural Network (nnet)

To start, we compute our “barrier value” and then partition our data based on this value, with 75% of our data going in the training set and 25% of our data going in the test set.

set.seed(123) #for reproducibility

bar <- floor(0.75 * nrow(loan)) #compute "barrier value"
partition <- sample(seq_len(nrow(loan)), size = bar) #sample based on barrier value

#Subset: train-test split based on partition value
train <- loan[partition, ] 
test <- loan[-partition, ]

#print(dim(train)) #460 x 6
#print(dim(test)) #154 x 6

We set our training algorithm’s parameters and then train our model using the train() function with “nnet” passed in as the method and “scale” and “center” passed in so that numeric variables are standardized.

With our “baseline model” trained, we proceed to model evaluation. We verify the baseline accuracy (0.676) and then evaluate our model’s performance against this “baseline”. We generate predictions based on the training set and then output these predictions as a confusion matrix and then we do the same with our test data.

#round(prop.table(table(train$Loan_Status)),3)   #Baseline accuracy Y: 0.676

#Training predictions
nnPred_train <-predict(nnet_model1, train)
#Training confusion matrix
table(train$Loan_Status, nnPred_train)
##    nnPred_train
##       N   Y
##   N  77  72
##   Y   2 309
round((308+78)/nrow(train),3)                    
## [1] 0.839
#Test predictions
nnPred_test <-predict(nnet_model1, test)
#Test confusion matrix
table(test$Loan_Status, nnPred_test)
##    nnPred_test
##       N   Y
##   N  23  20
##   Y   5 106
round((105+25)/nrow(test),3) 
## [1] 0.844

From above, we observe a training accuracy of 83.9% and a test accuracy of 84.4% which is an improvement of nearly 20% over our “baseline accuracy”.

By merely applying a neural network model to our dataset, we see major improvements in predictive capability. Next, we see if we can take the model further. If we can improve model performance by handling outliers and creating features prior to feeding the model.

Optimization Attempts

We explore the affects of outlier handling and feature creation on model performance to determine if either step improves our model.

We start by re-visiting the distribution of outliers via boxplot:

#Observe the affect of outlier-handling on model performance (if any)

#Confirm the presence of influential observations
p4 <- ggplot(loan) +
  aes(x = Loan_Status, y = LoanAmount) +
  geom_boxplot(fill = "#0c4c8a") +
  theme_minimal()

p5 <- ggplot(loan) +
  aes(x = Loan_Status, y = TotalIncome) +
  geom_boxplot(fill = "#0c4c8a") +
  theme_minimal()

grid.arrange(p4, p5, nrow = 1, ncol = 2)

From above we can see that outliers appear to be a concern for our model.

Outlier Handling

To rectify the situation, we identify the outliers using the boxplot.stats() function, filter for corresponding observations, remove outliers from our dataset, and revisit model performance.

Outlier-handling led to performance improvements on the training set (up to 85.6% accuracy) and reduction on the test set (down to 70.8%). As such we elected not to include outlier-handling as an optimization step. Note: corresponding code has been included in the Appendix.

We proceeded to observe the affect of feature creation on model performance.

Feature Creation

We wanted to see if adding certain combinations of features would improve our predictive accuracy. We tested the inclusion of variables for:

  • self employed with high income,
  • semiurban property with qualified credit history,
  • not self employed with low loan amount, and
  • low income with high loan amount.

The inclusion of these variables, and feature creation in general, slightly reduced the performance of our model and so we elected to exclude it as a modeling optimization step. Note: corresponding code has been included in the Appendix.

Being that each of our optimization attempts were fruitless, we next opted to explore an alternative approach to neural networks to then compare the predictive accuracy between packages / approaches.

Neural Network (keras)

Background / Intro

This model is a densely connected deep neural network with multiple hidden layers and an output layer. Training a neural network revolves around the following objects:

  • Layers, which are combined into a network (or model)
  • The input data and corresponding targets
  • The loss function, which defines the feedback signal used for learning
  • The optimizer, which determines how learning proceeds

This model we use all these steps to build the model using keras. Use keras package to make this model.

Convert all categorical variables to numerical variables. Plot the neural networks with 2 hidden layes. First hidden layer with 4 neurons and 2 nd hidden layer with 2 neurons.

#Preprocess data
loan_keras <-  loan_keras[, c(1, 3, 4, 2, 6, 5)]

# Convert Credit_History numeric type
loan_keras$Credit_History <- as.numeric(loan_keras$Credit_History)

# Change Variables values
loan_keras$Self_Employed <- ifelse(loan_keras$Self_Employed == "Yes", 1, 0)
loan_keras$Loan_Status <- ifelse(loan_keras$Loan_Status=="Y", 1, 0)

loan_keras$Property_Area <- case_when(
  loan_keras$Property_Area == "Semiurban" ~ 2,
  loan_keras$Property_Area == "Urban" ~ 1,
  loan_keras$Property_Area == "Rural" ~ 0,
)

#loan_keras <- loan_keras %>% mutate_if(is.factor, as.numeric) # convert factor to numeric

# Neural Network Visualization
n <- neuralnet(Loan_Status ~ Self_Employed+Credit_History+Property_Area+LoanAmount+TotalIncome,
               data = loan_keras,
               hidden = c(4,2),
               linear.output = F,
               lifesign = 'full',
               rep=1)

plot(n,
     col.hidden = 'darkgreen',
     col.hidden.synapse = 'darkgreen',
     show.weights = F,
     information = F,
     fill = 'lightblue', rep = "best")

The above model has 5 inputs, 2 hidden layers (i.e 4 nurons in one layer and other has 2 neurons). This is connect to the output neuron. We make the dataset to a matrix format, and split in to train and test.

# Matrix
loan_keras <- as.matrix(loan_keras)
dimnames(loan_keras) <- NULL

Split the data in to train and test, train data have 80% of the total data and test data 20%. Create 4 sets of data i.e training, testing, traintraget and testtarget. Train and test target contains the output variable i.e Loan_Status. Training and testing datasets contains all other variables other than output variable.

# Partition
set.seed(18)
ind <- sample(2, nrow(loan_keras), replace = T, prob = c(.8, .2))
training_keras <- loan_keras[ind==1,1:5]
testing_keras <- loan_keras[ind==2, 1:5]
trainingtarget <- loan_keras[ind==1, 6]
testingtarget <- loan_keras[ind==2, 6]

Normalize the values apply mean and standard deviation to the columns. Normalize indepenedent variables by substracting mean from the value, devide by standard deviation. Use scale function to do normalize in both train and test sets.

# Normalize
m <- colMeans(training_keras)
s <- apply(training_keras, 2, sd)
training_keras <- scale(training_keras, center = m, scale = s)
testing_keras <- scale(testing_keras, center = m, scale = s)

Create the model using keras_model_sequential, pass the layer_dense(). This function creates the hidden layer in the network. Can apply multiple hidden layer using the same function. For hidden layer activation function is relu stands for rectified linear unit, by default the value for activation function is linear. Need to define the input shape, we have 5 input variable so shape is 5.

set.seed(77)
#Create model
model_keras <- keras_model_sequential()
model_keras %>% 
         layer_dense(units = 4, activation = 'relu', input_shape = c(5)) %>%
         layer_dense(units = 2, activation = 'relu') %>%
         layer_dense(units = 1)
set.seed(77)
#Compile model
model_keras %>% compile(loss = 'mse',
                  optimizer = optimizer_rmsprop(lr= 0.005),
                  metrics = c("accuracy"))

For compile the model using loss, optimizer and metrics. There are other arrguments we can pass, but here we choose to pass 3 Arguments. The ouput has 2 values (1,0) and input are all numerical variables.

  • loss = If the model has multiple outputs, you can use a different loss on each output by passing a dictionary or a list of objectives. The loss value that will be minimized by the model will then be the sum of all individual losses.
  • optimizer = optimizer instance, default is rmsprop(lr stands for learning rate)
  • metrics = List of metrics to be evaluated by the model during training and testing. Here we pass accuracy.
set.seed(77)
#Fit model
model_keras_train <- model_keras %>%
         fit(training_keras,
             trainingtarget,
             epochs = 100,
             batch_size = 32,
             validation_split = 0.1
             )
plot(model_keras_train)

Before fitting the training data (passing the training data to the model), the model requires compilation. The loss function, optimizer, and metrics are specified during this step.

To train model,

  • training_keras = Vector, matrix, or array of training data
  • trainingtarget = Vector, matrix, or array of target (label) data
  • epochs = Number of epochs to train the model
  • batch_size = Integer or NULL. Number of samples per gradient update. If unspecified, batch_size will default to 32.
  • validation_split = Fraction of the training data to be used as validation data (between 0 and 1).

The first graph is for loss and 2nd graph is for accuracy. Initially there are very less difference between loss and val_loss which is very minimal. This means the training error is low.

Lets have a look on model evaluation.

set.seed(77)
#Evaluate model
model_keras %>% evaluate(testing_keras, testingtarget)
##      loss  accuracy 
## 0.1627313 0.7818182
pred_keras <- model_keras %>% predict(testing_keras)

The test feature and target sets can be used to evaluate the model. The results show the overall loss and mean absolute error by using the evaluate() function. It takes two arguments referencing the feature and target test sets.

Evaluation gives total loss and accuracy of the training data. Below table is the confusion matrix and accuracy of the training data.

# Confusion matrix and misclassification 
pred_keras_1 <- ifelse(pred_keras>0.5, 1, 0)


tab4 <- table(Predicted = pred_keras_1,
      Actual = testingtarget)

tab4
##          Actual
## Predicted  0  1
##         0 16  2
##         1 22 70
sum(diag(tab4))/sum(tab4)
## [1] 0.7818182

Accuracy is 78%, model classify loan approve yes category well but shows poor performance for loan approve no category.

We have used 6 neurons and 2 hidden layer, there are scope for improvement. We will try with multiple neurons and layer dropout that helps to reduce over fitting.

We apply the same code but with different neurons count and hidden layer.

Fine tune the model

Here apply with different neurons and layer_dropout. While increase more number of neurons the accuracy going up. But this is not the same for all cases. Model training can change the parameter of learning rate, to improve the model performance.

set.seed(88)
#Create model
model_keras_2 <- keras_model_sequential()
model_keras_2 %>% 
         layer_dense(units = 100, activation = 'relu', input_shape = c(5)) %>%
         layer_dropout(rate = 0.4) %>%
         layer_dense(units = 1)
set.seed(88)
#Compile model
model_keras_2 %>% compile(loss = 'mse',
                  optimizer = optimizer_rmsprop(lr= 0.005),
                  metrics = c("accuracy"))
set.seed(88)
#Fit model
model_keras_train2 <- model_keras_2 %>%
         fit(training_keras,
             trainingtarget,
             epochs = 100,
             batch_size = 32,
             validation_split = 0.2
             )
plot(model_keras_train2)

set.seed(88)
#Evaluate model
model_keras_2 %>% evaluate(testing_keras, testingtarget)
##      loss  accuracy 
## 0.1618941 0.7909091
pred_keras_2 <- model_keras_2 %>% predict(testing_keras)

Below shows the confusion matrix and the accuracy after model tuing.

# Confusion matrix and misclassification 
pred_keras_2 <- ifelse(pred_keras_2>0.5, 1, 0)

tab5 <- table(Predicted = pred_keras_2,
      Actual = testingtarget)

tab5
##          Actual
## Predicted  0  1
##         0 16  1
##         1 22 71
sum(diag(tab5))/sum(tab5)
## [1] 0.7909091

Accuracy improved 78% to 80%, this model also perform same as first keras model. Better performance with Yes category and poor performance with No category. There are total 21 misclassification.

Neural Networks with normalization

The dataset consist of 6 variables, out of that LoanAmount and TotalIncome are numerical. Others variables are factor and categorical. Implement below changes :

  • Self_Employed (Yes/No) <- Yes = 1, No = 0)
  • Credit_History <- data type factor to numeric
  • Property_Area <- Semiurban = 2, Urban = 1, Rural = 0
  • Loan_Status <- data type factor to numeric

To perform Neural nets use package neuralnet. First we apply the normalization the on variables to get all the values between 0 and 1. We retain output variable (Loan_Status) as it is because the values are in 0 or 1.

# Convert Credit_History numeric type
loan_nn$Credit_History <- as.numeric(loan_nn$Credit_History)

# Change Variables values
loan_nn$Self_Employed <- ifelse(loan_nn$Self_Employed == "Yes", 1, 0)
loan_nn$Loan_Status <- ifelse(loan_nn$Loan_Status=="Y", 1, 0)

loan_nn$Property_Area <- case_when(
  loan_nn$Property_Area == "Semiurban" ~ 2,
  loan_nn$Property_Area == "Urban" ~ 1,
  loan_nn$Property_Area == "Rural" ~ 0,
)

# plot small multiples
ggplot(data = reshape2::melt(loan_nn), aes(x = value)) +
stat_density() +
facet_wrap(~variable, scales = "free")

Normalization

To apply normalization on each columns, substract the minimum values of the column and devide by maximim - minimum. Plot density of all the variables to see the normalization. The below plot range loes between 0 and 1.

# Min-Max Normalization
loan_nn$Self_Employed <- (loan_nn$Self_Employed - min(loan_nn$Self_Employed))/(max(loan_nn$Self_Employed) - min(loan_nn$Self_Employed))
loan_nn$LoanAmount <- (loan_nn$LoanAmount - min(loan_nn$LoanAmount))/(max(loan_nn$LoanAmount) - min(loan_nn$LoanAmount))
loan_nn$Credit_History <- (loan_nn$Credit_History - min(loan_nn$Credit_History))/(max(loan_nn$Credit_History)-min(loan_nn$Credit_History))
loan_nn$Property_Area <- (loan_nn$Property_Area - min(loan_nn$Property_Area))/(max(loan_nn$Property_Area)-min(loan_nn$Property_Area))
loan_nn$TotalIncome <- (loan_nn$TotalIncome - min(loan_nn$TotalIncome))/(max(loan_nn$TotalIncome)-min(loan_nn$TotalIncome))

# plot small multiples
ggplot(data = reshape2::melt(loan_nn), aes(x = value)) +
stat_density() +
facet_wrap(~variable, scales = "free")

Model Building

Split the data in 80% and 20%, train data having 80% of the data and train data having 20% of the data. Then apply the model on train data. Use sample package to split the data in to train and test.

# Data Partition
set.seed(11)
ind <- sample(2, nrow(loan_nn), replace = TRUE, prob = c(0.8, 0.2))
training_nn <- loan_nn[ind==1,]
testing_nn <- loan_nn[ind==2,]

dim(training_nn)
## [1] 500   6
dim(testing_nn)
## [1] 114   6

Dimension of the training and testing data:

Training data : 500, 6
Testing data : 114, 6

Parameters apply on Neural Network model,

  • hidden - represnts the number of hidden layer, to get the optimal model we can change the number
  • err.fact - used for calculation of the error, Alternatively, the strings ‘sse’ and ‘ce’ which stand for the sum of squared errors and the cross-entropy can be used.
  • linear.output - logical. If act.fct should not be applied to the output neurons set linear output to TRUE, otherwise to FALSE.

Plot the network and discuss about how it input layer, hidden layer and output layer works. Below plots shows one hiden layer.

# Neural Networks
library(neuralnet)
set.seed(13)

n1 <- neuralnet(Loan_Status~Self_Employed+LoanAmount+Credit_History+Property_Area+TotalIncome,
               data = training_nn,
               hidden = 1,
               err.fct = "ce",
               linear.output = FALSE
               )

plot(n1, rep = "best")

Node Output Calculations with Sigmoid Activation Function

Here we will discuss how Neuralnets algo calculate the output node. From the neural net graph, manually calulate the first row of the output node.

Above diagram shows first 5 nodes are input node, lets say N1, N2, N3, N4, and N5. The middle layer is hidden layer i.e N6, and last node is output layer i.e N7. To calculate output node need to calculate N6 output, because the output of N6 is input to N7. Similary to calculate N6 ouput, need to calculate N6 input.

Note, formula for sigmoid is 1/(1+exp(-x))

Here is the equation: \(f(x)=b+ w1⋅x1+ w2⋅x2 +...+ wn⋅xn\)

# first row
head(training_nn[1,])
##   Self_Employed LoanAmount Credit_History Property_Area Loan_Status TotalIncome
## 1             0  0.1432706              1           0.5           1  0.05539355

Above shows the first row of training data. We calculate manually first value of output node and compair the values with training model output.

in6 <- 10.47375 + (0*-5.36828) + (0.1432706*2.7186) + (1*-13.27065) + (0.5 * -1.77825) + (0.05539355 * 8.03783)
out6 <- 1/(1+exp(-in6))
in7 <- 1.71561 +(-4.21959*out6)
out7 <- 1/(1+exp(-in7))
paste0("Node 7 Output : " ,out7)
## [1] "Node 7 Output : 0.815350328873002"
output <- compute(n1, training_nn[,-5])
paste0("First predicted value of NN : ", head(output$net.result, n = 1))
## [1] "First predicted value of NN : 0.815351067634266"

Both predicted values and manual calculation showing same values, we see last some decimal points are not maching due to round off calculation.

Above explanation shows the neural nets prediction calculation.

Feature Selection

Feature selection is an important part of most learning algorithms. Feature selection is used to select the most relevant features from the data.A simple method for feature selection using feedforward neural networks is presented. The method starts by using one input neuron and adds one input at time until the wanted classification accuracy has been achieved or all attributes have been chosen.

From EDA, we found Credit History is the most important feature for Loan approval. So first start with that feature, then add other relevant feature, untill we get the significantly and improves classification accuracy.

Testing data and traing data has same accuracy. There are other parameter we applied and checked for the optimal model. They are:

  • tried with hidden layer range 1 to 5, multi layer such as c(3,2), c(2,1), and layer repeation (the parameter rep gets the input of repetation, and model will repeat that many times)

  • Life sign:specify how much the function will print during the calculation of the neural network. ‘none’, ‘minimal’ or ‘full’. Tried with ‘full’.

  • algorithm : contains the algorithm type to calculate the neural network. Tried with ‘rprop+’ refers to resilient backpropagation with and without weight backtracking.

Current model we find the optimal model out of all other models.

set.seed(12)

n <- neuralnet(Loan_Status ~ Credit_History+Property_Area,
               data = training_nn,
               hidden = 1,
               err.fct = "ce",
               linear.output = FALSE
               )

plot(n, rep = 'best')

Lets have look on confusion matrix and accuracy on train dataset.

# Confusion Matrix & Misclassification Error - training data
output <- compute(n, training_nn[,-5])
p1 <- output$net.result
pred1 <- ifelse(p1>0.5, 1, 0)
tab1 <- table(Prediction =  pred1, Actuals = training_nn$Loan_Status)
tab1
##           Actuals
## Prediction   0   1
##          0  73   6
##          1  85 336
paste0("Misclassification Error of training data: ", round(100 - sum(diag(tab1))/sum(tab1)*100,2))
## [1] "Misclassification Error of training data: 18.2"
paste0("Accuracy of training data: ", round(sum(diag(tab1))/sum(tab1) * 100,2))
## [1] "Accuracy of training data: 81.8"

There are total 91 misclassification based on training data, out of that 6 pople actually get loan approval, model shows their loan is not approved, and 85 people actually loan not approved but model shows their loan approved. This training set model performs better result with loan approved class and poor result with no loan approved class.

Confusion matrix and accuracy of test dataset.

# Confusion Matrix & Misclassification Error - testing data
output <- compute(n, testing_nn[,-5])
p2 <- output$net.result
pred2 <- ifelse(p2>0.5, 1, 0)
tab2 <- table(Prediction = pred2, Actual = testing_nn$Loan_Status)
tab2
##           Actual
## Prediction  0  1
##          0 17  1
##          1 17 79
paste0("Misclassification Error of testing data: ", round(100 - sum(diag(tab2))/sum(tab2)*100,2))
## [1] "Misclassification Error of testing data: 15.79"
paste0("Accuracy of testing data: ", round(sum(diag(tab2))/sum(tab2)*100,2))
## [1] "Accuracy of testing data: 84.21"

Training data misclassification is 18.2% and testing data misclassification is 84.21%, so neural nets performs some amount of consistency between traing and testing data. This model performs better with loan approved class than no loan approved i.e model says 1 person actually get loan approved but showing no loan approved, and 17 people actually not get loan approved but model shows their loan approved. With test data, the model performs pretty well with loan approved class and model have poor performance with no loan approved class.

Logistic Regression

Logistic Regression is a powerful statistical way of modeling a binomial outcome with one or more explanatory variables.

For Loan dataset, Loan_status is the Dependent variable and other variables are independent. To perform Logistic regression first we check the data is balanced or not. Then split the data in to train and test, fit the model with train data and apply the prediction on test data. We will see the performace matrices for this model, to see how the model is working.

# Convert Credit_History numeric type
loan_new2$Credit_History <- as.numeric(loan_new2$Credit_History)

# Change Variables values
loan_new2$Self_Employed <- ifelse(loan_new2$Self_Employed == "Yes", 1, 0)
loan_new2$Loan_Status <- ifelse(loan_new2$Loan_Status=="Y", 1, 0)

loan_new2$Property_Area <- case_when(
  loan_new2$Property_Area == "Semiurban" ~ 2,
  loan_new2$Property_Area == "Urban" ~ 1,
  loan_new2$Property_Area == "Rural" ~ 0,
)
loan_new2 <- subset(loan_new2, select = c(Self_Employed, LoanAmount, Credit_History, 
                                          Property_Area, Loan_Status, TotalIncome))
table(loan_new2$Loan_Status)
## 
##   0   1 
## 192 422

Above table shows the data is not balanced. Data have more Yes than No, so data is not balance, we will do the down sample and apply logistic regression on train dataset.

# under sample 
set.seed(1)
smpl1 <- loan_new2 %>% filter(Loan_Status == 1) %>% sample_n(size = 192)
smpl2 <- loan_new2 %>% filter(Loan_Status == 0) 
smpl_192 <- rbind(smpl1, smpl2)

table(smpl_192$Loan_Status)
## 
##   0   1 
## 192 192

LR Model building

Performing an 80-20 split allocates 80% of our data for training the model and 20% of our data for testing it. Use glm from stats package to perform logistic regression.

# Data split
set.seed(5)
sample = sample.split(smpl_192$Loan_Status, SplitRatio = 0.80)

loan_train = subset(smpl_192, sample == TRUE)
loan_test = subset(smpl_192, sample == FALSE)

dim(loan_train)
## [1] 308   6
dim(loan_test)
## [1] 76  6

Binomial Model 1

This model, includes all the features.

set.seed(6)
bi_model1 <- glm(Loan_Status ~ ., data = loan_train, family=binomial(link="logit"))
summary(bi_model1)
## 
## Call:
## glm(formula = Loan_Status ~ ., family = binomial(link = "logit"), 
##     data = loan_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8528  -0.9975   0.2861   0.8709   2.7495  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -3.470e+00  6.702e-01  -5.177 2.25e-07 ***
## Self_Employed   2.345e-01  4.030e-01   0.582  0.56068    
## LoanAmount     -7.316e-04  1.800e-03  -0.406  0.68444    
## Credit_History  3.996e+00  6.136e-01   6.512 7.43e-11 ***
## Property_Area   4.772e-01  1.651e-01   2.890  0.00385 ** 
## TotalIncome    -3.904e-05  2.677e-05  -1.458  0.14472    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 426.98  on 307  degrees of freedom
## Residual deviance: 310.88  on 302  degrees of freedom
## AIC: 322.88
## 
## Number of Fisher Scoring iterations: 6

From the model we see, AIC : 322.8818211. Residual deviance : 310.8818211. Low AIC and residual is better for the model. Credit_History is only statistically significant.

Confusion Matrix (CM) and accuracy

The confusion matrix avoids “confusion” by measuring the actual and predicted values in a tabular format. Test the model using loan_test.

pred_1 <- predict(bi_model1,loan_test) 
pred_1 <- if_else(pred_1 > 0.5, 1, 0) 

cm <- table(true = loan_test$Loan_Status, pred_1)

print(cm)
##     pred_1
## true  0  1
##    0 24 14
##    1  7 31
TP <- cm[1]
FP <- cm[2]
FN <- cm[3]
TN <- cm[4]

# accuracy 
accuracy_bimod1 <- (TP + TN)/(TN + FN + TP + FP)
paste0("Accuracy : " ,round(accuracy_bimod1 * 100, 2))
## [1] "Accuracy : 72.37"

The model showing 21 misclassification, The model predicts poor for both loan approved class and no loan approved class.

The accuracy is not promising, Corelation plot helps to exclude the variable which are co-related to ecah other. From cor plot we see strong relation between TotalIncome and Loan Amount. We include the variables which are statistically significant. Here Credit_History and Property_Area are statistically significant menas low p-value. We build our second LR model with this two variables.

Bi-nomial Model 2:

This model includes 2 variables such as Property_Area and Credit_History.

set.seed(7)
bi_model2 <- glm(Loan_Status ~ Credit_History + Property_Area, data = loan_train, family=binomial(link="logit"))
summary(bi_model2)
## 
## Call:
## glm(formula = Loan_Status ~ Credit_History + Property_Area, family = binomial(link = "logit"), 
##     data = loan_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6719  -1.2520   0.2700   0.9189   2.7537  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -3.7686     0.6281  -6.000 1.98e-09 ***
## Credit_History   3.9424     0.6101   6.462 1.03e-10 ***
## Property_Area    0.4699     0.1622   2.897  0.00376 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 426.98  on 307  degrees of freedom
## Residual deviance: 314.76  on 305  degrees of freedom
## AIC: 320.76
## 
## Number of Fisher Scoring iterations: 5

From the 2nd model, AIC : 320.7603262.
Residual deviance : 314.7603262.

Confusion Matrix (CM) and accuracy

pred_2 <- predict(bi_model2,loan_test) 
pred_2 <- if_else(pred_2 > 0.5, 1, 0) 

cm <- table(true = loan_test$Loan_Status, pred_2)

print(cm)
##     pred_2
## true  0  1
##    0 23 15
##    1  5 33
TP <- cm[1]
FP <- cm[2]
FN <- cm[3]
TN <- cm[4]


# accuracy 
accuracy_bimod2 <- (TP + TN)/(TN + FN + TP + FP)
paste0("Accuracy : " ,round(accuracy_bimod2 * 100, 2))
## [1] "Accuracy : 73.68"

Accuracy improved from 72% to 74%. The model has 20 misclassification, the model is predicts poor performnace for both Loan approved and no loan approved class. Out of both the classes no-loan approved class perform better than loan approved class.

K-Nearest Neighbors Classification

The K-Nearest Neighbors Classification algorithm classifies observations by allowing the K observations in the training set that are nearest to the new observation to “vote” on the class of the new observation. More specifically, the algorithm works as follows:

  • Calculate the distance between the new observation and EACH obsevation in the training set.
  • Distances are calculated within the feature space.
  • Select the K observations in the training set that are nearest to the new observation.
  • Determine the majority class within the K nearest neighbors.
  • Classify the new observation as this majority class.
  • If there is a tie in the vote, resolve this according to some pre-determine process

The dataset having total 6 columns, and 614 observations. Loan_Status is the target variable. This model predict the loan approval based on employment types, Credit history, Property, Loan amount and Total income.

Data preparation for kNN

We will now use the knn function from the class package to create a 3-Nearest Neighbors model for the loan dataset.

Data preparation, convert categorical variables to numerical. Knn performs well with same scale and numerical data. kNN doesn’t work great in general when features are on different scales. For many machine learning algorithms, it is important to make sure that the features are on roughly the same scale before training. This is especially true of distance-based algorithms such as K-Nearest Neighbors. We apply standard scaling. With standardization, each column in the training set is scaled to have a mean of zero and a standard deviation of 1. If xi is a single observation of a particular feature, then its scaled value zi is given by:

\(z_i= \frac {x_i−\bar x}{s_x}\)

After scaling, Split the data in 80 - 20 %, train data - 80% and test data - 20%.

Create 2 datasets for train and test, for actual data and for scaled data.

Actual data, train set name: train, test : test Scaled data, train set name : training_knn, test : testing_knn

The idea here is, to show the model performance using kNN, using both actual and scaled data.

loan_knn <-  loan_knn[, c(1, 3, 4, 2, 6, 5)]

# Convert Credit_History numeric type
loan_knn$Credit_History <- as.numeric(loan_knn$Credit_History)

# Change Variables values
loan_knn$Self_Employed <- ifelse(loan_knn$Self_Employed == "Yes", 1, 0)
loan_knn$Loan_Status <- ifelse(loan_knn$Loan_Status=="Y", 1, 0)

loan_knn$Property_Area <- case_when(
  loan_knn$Property_Area == "Semiurban" ~ 2,
  loan_knn$Property_Area == "Urban" ~ 1,
  loan_knn$Property_Area == "Rural" ~ 0,
)

# Partition
set.seed(26)
ind <- sample(2, nrow(loan_knn), replace = T, prob = c(.8, .2))
training_knn <- loan_knn[ind==1,1:5]
testing_knn <- loan_knn[ind==2, 1:5]
trainingtarget_knn <- loan_knn[ind==1, 6]
testingtarget_knn <- loan_knn[ind==2, 6]

# unscaled data
train <- loan_knn[ind==1,1:5]
test <- loan_knn[ind==2, 1:5]


# Normalize scales data
m <- colMeans(training_knn)
s <- apply(training_knn, 2, sd)
training_knn <- scale(training_knn, center = m, scale = s)
testing_knn <- scale(testing_knn, center = m, scale = s)

Selecting K

We will now build several KNN models. For each K from 1 to 100, we will calculate training and validation accuracy for the KNN model, using scaled data.

set.seed(26)

train_acc <- c()
valid_acc <- c()
train_acc_sc <- c()
valid_acc_sc <- c()

k_range <- 1:100
for (i in k_range) {
  #  Unscaled data
  set.seed(26)
  train_pred_a <- knn(train, train, trainingtarget_knn, k=i)
  train_acc <- c(train_acc, mean(train_pred_a == trainingtarget_knn))
  
  set.seed(26)
  valid_pred_a <- knn(train, test, trainingtarget_knn, k=i)
  valid_acc <- c(valid_acc, mean(valid_pred_a == testingtarget_knn))
  
  # Standard Scaling
  set.seed(26)
  train_pred <- knn(training_knn, training_knn, trainingtarget_knn, k=i)
  train_acc_sc <- c(train_acc_sc, mean(train_pred == trainingtarget_knn))
  
  set.seed(26)
  valid_pred <- knn(training_knn, testing_knn, trainingtarget_knn, k=i)
  valid_acc_sc <- c(valid_acc_sc, mean(valid_pred == testingtarget_knn))
}
print(max(valid_acc))
## [1] 0.75
print(max(valid_acc_sc))
## [1] 0.8712121

Training and Validation Curves for actual Data(unscaled)

plot(k_range, train_acc, pch='.', ylim=c(0.55, 1), col='salmon', ylab="")
lines(k_range, train_acc, lwd=2, col='salmon')
lines(k_range, valid_acc, lwd=2, col='cornflowerblue')
legend(65, 1, legend=c("Training Acc", "Validation Acc"),
       col=c("salmon", "cornflowerblue"), lty=1, lwd=2, cex=0.8)

Training and Validation Curves for Standard Scaled Data

plot(k_range, train_acc_sc, pch='.', ylim=c(0.55, 1), col='salmon', ylab="")
lines(k_range, train_acc_sc, lwd=2, col='salmon')
lines(k_range, valid_acc_sc, lwd=2, col='cornflowerblue')
legend(65, 1, legend=c("Training Acc", "Validation Acc"),
       col=c("salmon", "cornflowerblue"), lty=1, lwd=2, cex=0.8)

Selecting the Final Model

Our best validation performance was obtained using scaled data. Let’s determine the value of K used to create this particular model.

acc <- which.max(valid_acc)
acc
## [1] 15

We will now recalculate the training and validation accuracies for this model, given k = 15. We can use the table function to create a quick confusion matrix.

set.seed(26)
train_pred <- knn(train, train, trainingtarget_knn, k=acc)
train_acc <- mean(train_pred == trainingtarget_knn)
  
set.seed(26)
valid_pred <- knn(train, test, trainingtarget_knn, k=acc)
valid_acc <- mean(valid_pred == testingtarget_knn)

cat('Training Accuracy:   ', train_acc, '\n',
    'Validation Accuracy: ', valid_acc, sep='')
## Training Accuracy:   0.6804979
## Validation Accuracy: 0.75
tab_knn <-  table(Predicted = valid_pred, Actual = testingtarget_knn)
tab_knn
##          Actual
## Predicted  0  1
##         0  4  4
##         1 29 95

Confusion matrix and model prediction on Scaled data

The confusion matrix shows, total 33 misclassification on test data. Model predicts loan approval category Yes class better than No class category. More misclassification occure in No class category.

acc_sc <- which.max(valid_acc_sc)
acc_sc
## [1] 10
set.seed(26)
train_pred <- knn(training_knn, training_knn, trainingtarget_knn, k=acc_sc)
train_acc <- mean(train_pred == trainingtarget_knn)
  
set.seed(26)
valid_pred <- knn(training_knn, testing_knn, trainingtarget_knn, k=acc_sc)
valid_acc <- mean(valid_pred == testingtarget_knn)

cat('Training Accuracy:   ', train_acc, '\n',
    'Validation Accuracy: ', valid_acc, sep='')
## Training Accuracy:   0.813278
## Validation Accuracy: 0.8712121
tab_knn_sc <-  table(Predicted = valid_pred, Actual = testingtarget_knn)
tab_knn_sc
##          Actual
## Predicted  0  1
##         0 16  0
##         1 17 99

This model has high accuracy, but this is not a prefered model for predict loan approve No category. This model predict 100% accuract for Yes category, but this is not a recommended model to use.

Decision Tree

Decision trees build classification or regression models in the form of a tree structure. Datasets are broken into smaller and smaller subsets along chosen parameters, as the associated decision tree is developed.

The result is a tree with nodes and branches. Nodes denote a split point based on the attribute in question while branches denote the corresponding outcome. We start at a “root node”, terminate at “leaf nodes”, and use corresponding lead nodes to provide proportions regarding resulting class labels.

From EDA, we found most imp features and using Boruta package. The table shows, in the data Credit_History has highest impotance. 2nd highest impotance feature is TotalIncome, 3rd highest feature importance is LoanAmount. There is a strong co-relation we found between TotalIncome and LoanAmount. Decision trees and boosted trees algorithms are immune to multicollinearity by nature . When they decide to split, the tree will choose only one of the perfectly correlated features. So, multicollinearity not be the issue here.

# feature impotance
boruta_output = Boruta(Loan_Status ~ ., data = loan_dt, doTrace = 0, maxRuns = 1000)  
roughFixMod = TentativeRoughFix(boruta_output)
importance = attStats(TentativeRoughFix(boruta_output))
importance = importance[importance$decision != 'Rejected', c('meanImp', 'decision')]
kable(head(importance[order(-importance$meanImp), ]), 
      caption = "Feature Importance of Loan Data") %>%
  kable_styling(bootstrap_options = "striped", full_width = TRUE)
Feature Importance of Loan Data
meanImp decision
Credit_History 94.485237 Confirmed
TotalIncome 16.222849 Confirmed
LoanAmount 13.022106 Confirmed
Self_Employed 4.175202 Confirmed

Data preparation

We decide to split 80/20 for our testing and training datasets. Using R’s built-in rpart (short for Recursive Partitioning and Regression Tree) function, plot the resulting model.

set.seed(44)

# Split data into training and testing sets
sample_data = sample.split(loan_dt, SplitRatio = 0.8)
train_data <- subset(loan_dt, sample_data == TRUE)
test_data <- subset(loan_dt, sample_data == FALSE)


# Plot tree
tree_mod <- rpart(Loan_Status ~ ., data = train_data, method="class", 
                  control = rpart.control(minsplit = 4, 
                                          minbucket =  2, 
                                          cp = 0, 
                                          maxdepth = 6))
rpart.plot(tree_mod, extra=1, cex=0.8)

Baseline Model

train_pred <- predict(tree_mod, train_data, type="class")
test_pred <- predict(tree_mod, test_data, type="class")

paste("Confusin matrix of Test data")
## [1] "Confusin matrix of Test data"
table(Predicted = test_pred, Actual = test_data$Loan_Status)
##          Actual
## Predicted   N   Y
##         N  38  11
##         Y  32 124
cat('Training Accuracy: ', mean(train_pred == train_data$Loan_Status), '\n',
   'Test Set Accuracy: ', mean(test_pred == test_data$Loan_Status), sep='')
## Training Accuracy: 0.8533007
## Test Set Accuracy: 0.7902439

Confusion matrix of test data, shows 79% accuracy. There are total 43 misclassification occours. The model slightly better for predict Yes category for loan than No category.

The next step is to prune the tree to optimal size, may improve the model accuracy. To get optimize model, need to find optimal depth and complexily parameter. Below shows the calculation of optimal depth and complex parameter.

Cross-Validation for Model Evaluation

The appropriate depth can be determined by evaluating the tree on a held-out data set via cross-validation. Essentially Cross Validation allows us to alternate between training and testing when dataset is relatively small to maximize error estimation.

set.seed(1)

bc_tree_cv <- train(Loan_Status ~ ., loan_dt, method="rpart2",
                 trControl = trainControl(method="cv", number=10),
                 tuneGrid = expand.grid(maxdepth=c(6)),
                 control = rpart.control(minsplit = 4,
                                         minbucket =  2,
                                         cp = 0))
bc_tree_cv
## CART 
## 
## 614 samples
##   5 predictor
##   2 classes: 'N', 'Y' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 553, 553, 553, 551, 553, 553, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.8159099  0.5129139
## 
## Tuning parameter 'maxdepth' was held constant at a value of 6

Accuracy : 0.8159099
Kappa : 0.5129139

Cross-Validation for Tuning Maximum Depth

In every classification technique, there are some parameters that can be tuned to optimize the classification. Some parameters that can be tuned in the decision tree is max depth (the depth of the tree), complexity parameter.

set.seed(1)

bc_tree_cv_md <- train(Loan_Status ~ ., loan_dt, method="rpart2", 
                       trControl = trainControl(method="cv", number=10),
                       tuneGrid = expand.grid(maxdepth=1:20),
                       control = rpart.control(minsplit = 4, 
                                               minbucket =  2, 
                                               cp = 0))

best_ix = which.max(bc_tree_cv_md$results$Accuracy)
bc_tree_cv_md$results[best_ix, ]
##   maxdepth Accuracy     Kappa AccuracySD   KappaSD
## 2        2  0.82244 0.5260808 0.03329859 0.1025391
plot(bc_tree_cv_md$results$maxdepth, bc_tree_cv_md$results$Accuracy, ylab = "Accuracy (cross -Validation)", xlab = "Max Tree Depth")
lines(1:20, bc_tree_cv_md$results$Accuracy)
abline(v=which.max(bc_tree_cv_md$results$Accuracy), col="red", lty=2, lwd=1)

The plot shows Max Tree depth vs accuracy, more number of depth reduce the accuracy, gives highest accuacy at depth 2.

Cross-Validation for Tuning Complexity Parameter

set.seed(1)

bc_tree_cv_cp <- train(Loan_Status ~ ., loan_dt, method="rpart", 
                       trControl = trainControl(method="cv", number=10),
                       tuneGrid = expand.grid(cp=seq(0, 0.1, 0.001)),
                       control = rpart.control(minsplit = 4, 
                                               minbucket =  2, 
                                               maxdepth = 30))

best_ix = which.max(bc_tree_cv_cp$results$Accuracy)
bc_tree_cv_cp$results[best_ix, ]
##       cp  Accuracy   Kappa AccuracySD    KappaSD
## 36 0.035 0.8224392 0.52012  0.0305313 0.09393886
plot(bc_tree_cv_cp, pch="")

The plot shows complexity parameter vs accuracy, cp value at high accuracy is 0.035.

Final Model DT

Above we get optimal depth and cp values, we can build our final model. The model categorize well with Credit_History. The tree looks convincing, we get the same result from feature importance table. In that table, Credit_History has more importance over other features.

tree_mod_final <- rpart(Loan_Status ~ ., data = train_data, method="class", 
                  control = rpart.control(minsplit = 2, 
                                          minbucket =  2, 
                                          cp = 0.035, 
                                          maxdepth = 2))

rpart.plot(tree_mod_final, extra=1, cex=0.8)

Confusion Matrix and accuracy

loanPred <- predict(tree_mod_final, test_data, type = "class")

tab_dt <- table(Predicted = loanPred, Actual = test_data$Loan_Status)

tab_dt
##          Actual
## Predicted   N   Y
##         N  35   3
##         Y  35 132
paste0("Accuracy : ", round(sum(diag(tab_dt))/sum(tab_dt) * 100,2))
## [1] "Accuracy : 81.46"

Accuracy has improved to 81%. Total 38 misclassifications exist in the test data, the model shows actually 3 loans approved but the model shows no loan approved. Similarly, 35 no loan approved but the model shows loan approved. The model predicts good prediction for loan approve category Yes and poor prediction for the loan approved No category.

Compare baseline model and final model, the baseline model is better than final model. Because final model shows overfitting for loan approved Yes class.

Random Forest

To expand on our analysis of the loan approval dataset, we’ll run a few random forest models to see if they are a more effective way to classify whether or not someone loan will be approved or not approved.

Data Preparation

Split the dataset into the Training set and Test set. Train : 80%, test : 20%. Random Forest is a tree-based model and does not require feature scaling. One advantage of decision tree based methods like random forests is their ability to natively handle categorical predictors without having to first transform them (e.g., by using feature engineering techniques).

set.seed(66)

# Split data into training and testing sets
sample_data_rf = sample.split(loan_rf, SplitRatio = 0.8)
train_data_rf <- subset(loan_rf, sample_data_rf == TRUE)
test_data_rf <- subset(loan_rf, sample_data_rf == FALSE)

The training set has 408 observations and 6 variables. The testing set has 408 observations and 6 variables.

Random Forest Model

Use randomforest package to build the model. Run a simple decision tree was run using all variables as predictors.

set.seed(66)

# fit a Random Forest model with training data
model_rf <- randomForest(Loan_Status ~ .,
                           data=train_data_rf)
# display model details
model_rf
## 
## Call:
##  randomForest(formula = Loan_Status ~ ., data = train_data_rf) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 20.59%
## Confusion matrix:
##    N   Y class.error
## N 56  70  0.55555556
## Y 14 268  0.04964539

Error Rate

This plot shows the class error rates of the random forest model. As the number of trees increases, the error rate approaches zero for Y classifier. For N classifier, initial increase of few trees error rate reduced then remain constant.

plot(model_rf, main = "Error rate of random forest")
legend("topright", cex =1, legend=colnames(model_rf$err.rate), lty=c(1,2,3), col=c(1,2,3))

Random Forest Evaluation

Predict the Test set results.

set.seed(66)

pred_rf <- predict(model_rf, test_data_rf, type="class")
tab_rf <- table(Predicted = pred_rf, Actual = test_data_rf$Loan_Status)
tab_rf
##          Actual
## Predicted   N   Y
##         N  44   3
##         Y  22 137
pred_rf <- as.factor(pred_rf)
test_data_rf$Loan_Status <- as.factor(test_data_rf$Loan_Status)
paste0("Accuracy : ",round(sum(diag(tab_rf))/sum(tab_rf)*100,2))
## [1] "Accuracy : 87.86"
RF_Model_1 <- confusionMatrix(pred_rf, test_data_rf$Loan_Status)$byClass
acc1 <- confusionMatrix(pred_rf, test_data_rf$Loan_Status)$overall['Accuracy']
RF_Model_1 <- data.frame(RF_Model_1)
RF_Model_1 <- rbind("Accuracy" = acc1, RF_Model_1)

Importance of Variables

The mean decrease in Gini coefficient is a measure of how each variable contributes to the homogeneity of the nodes and leaves in the resulting random forest…Variables that result in nodes with higher purity have a higher decrease in Gini coefficient.” This importance chart displays the variables that affected the random forest, from greatest impact to least impact, from top to bottom.

varImpPlot(model_rf, main = "Feature Importance of RF Model")

Tuning on Number of Trees

m_opt_ntrees <- which.min(model_rf$err.rate[,'OOB'])
m_opt_err_rate <- min(model_rf$err.rate[,'OOB'])

cat("Optimal Number of Trees: ", m_opt_ntrees, "\n",
    "Minimum Error Rate:      ", m_opt_err_rate, sep="")
## Optimal Number of Trees: 51
## Minimum Error Rate:      0.1936275

Create another random forest model using Optimal Number of Trees.

RF Model 2 (Optimal tree)
set.seed(66)
model_rf_2 <- randomForest(Loan_Status ~ .,
                           data=train_data_rf, ntree=m_opt_ntrees, importance=TRUE)

Test data Confusion matrix and Accuracy

Predict the test data.

set.seed(66)

pred_rf_2 <- predict(model_rf_2, test_data_rf, type="class")
tab_rf_2 <- table(Predicted = pred_rf_2, Actual = test_data_rf$Loan_Status)
tab_rf_2
##          Actual
## Predicted   N   Y
##         N  44   4
##         Y  22 136
pred_rf_2 <- as.factor(pred_rf_2)
paste0("Accuracy : ",round(sum(diag(tab_rf_2))/sum(tab_rf_2)*100,2))
## [1] "Accuracy : 87.38"
RF_Model_2 <- confusionMatrix(pred_rf_2, test_data_rf$Loan_Status)$byClass
acc2 <- confusionMatrix(pred_rf_2, test_data_rf$Loan_Status)$overall['Accuracy']
RF_Model_2 <- data.frame(RF_Model_2)
RF_Model_2 <- rbind("Accuracy" = acc2, RF_Model_2)

Tuning on mtry and ntree

Create another RF model with tuning mtry and ntree.

oob_acc_list <- c()
opt_ntree_list <- c()

for(i in 1:29){
  set.seed(66)
  temp_mod <- randomForest(Loan_Status ~ .,train_data_rf , ntree=500, importance=TRUE, mtry=i)
  oob_acc_list <- c(oob_acc_list, min(temp_mod$err.rate[,'OOB']))
  opt_ntree_list <- c(opt_ntree_list, which.min(temp_mod$err.rate[,'OOB']))
}

opt_mtry <- which.min(oob_acc_list)
opt_ntree <- opt_ntree_list[opt_mtry]
min_oob_acc <- min(oob_acc_list)

cat("Optimal Value of mtry:  ", opt_mtry, "\n",
    "Optimal Value of ntree: ", opt_ntree, "\n",
    "Minimum OOB Accuracy:   ", min_oob_acc, sep="")
## Optimal Value of mtry:  2
## Optimal Value of ntree: 80
## Minimum OOB Accuracy:   0.1911765
plot(1:29, oob_acc_list, xlab="Value of mtry", ylab="Minimum OOB Accuracy Score")
lines(1:29, oob_acc_list)
abline(v=which.min(oob_acc_list), col="red", lty=2, lwd=1)

RF Model 3 (Optimal mtry and ntree)
set.seed(66)

model_rf_3 <- randomForest(Loan_Status ~ .,
                           data=train_data_rf, ntree=opt_ntree, mtry=opt_mtry, importance=TRUE)

Test data Confusion matrix and Accuracy

Predict the test data.

set.seed(66)

pred_rf_3 <- predict(model_rf_3, test_data_rf, type="class")
tab_rf_3 <- table(Predicted = pred_rf_3, Actual = test_data_rf$Loan_Status)
tab_rf_3
##          Actual
## Predicted   N   Y
##         N  44   5
##         Y  22 135
pred_rf_3 <- as.factor(pred_rf_3)
paste0("Accuracy : ",round(sum(diag(tab_rf_3))/sum(tab_rf_3)*100,2))
## [1] "Accuracy : 86.89"
RF_Model_3 <- confusionMatrix(pred_rf_3, test_data_rf$Loan_Status)$byClass
acc3 <- confusionMatrix(pred_rf_3, test_data_rf$Loan_Status)$overall['Accuracy']
RF_Model_3 <- data.frame(RF_Model_3)
RF_Model_3 <- rbind("Accuracy" = acc3, RF_Model_3)

Model comparision

Compare all 3 models i.e First RF model (baseline model), Second RF model (tree tuning), and third RF model (tuning mtry and ntree)

tabularview <- data.frame(RF_Model_1, RF_Model_2, RF_Model_3)
kable(tabularview) %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),latex_options="scale_down") %>% column_spec(1, bold = T)
RF_Model_1 RF_Model_2 RF_Model_3
Accuracy 0.8786408 0.8737864 0.8689320
Sensitivity 0.6666667 0.6666667 0.6666667
Specificity 0.9785714 0.9714286 0.9642857
Pos Pred Value 0.9361702 0.9166667 0.8979592
Neg Pred Value 0.8616352 0.8607595 0.8598726
Precision 0.9361702 0.9166667 0.8979592
Recall 0.6666667 0.6666667 0.6666667
F1 0.7787611 0.7719298 0.7652174
Prevalence 0.3203883 0.3203883 0.3203883
Detection Rate 0.2135922 0.2135922 0.2135922
Detection Prevalence 0.2281553 0.2330097 0.2378641
Balanced Accuracy 0.8226190 0.8190476 0.8154762

The above table shows the comparative statistics of each model’s performance.

All 3 random forest model has accuracy above 85%. The models have poor performance for the Loan rejected class and very high performance for the Loan accepted class. This may be due to the unbalance of the classifier in the data. The count of Y is higher than N. Some other models may predict better prediction on Loan approved classes.

Model Selection

Final Model Discussion / Interpretation


Model Selection / Comparison

For this section we could re-include HW3 models improved based on the Prof’s feedback.

  1. KNN: which k-value would we actually pick? Could re-apply to loan data (rather than penguin).
  2. DT: should be able to handle categorical variables | explore a longer tree | show variable importance factor
  3. RF: derive a different variable (ie. combined applicant income and co-applicant income as total income). Speak in greater depth regarding higher class error in predicting ‘no’ and then discuss what this models strengths and weaknesses may be (ie. predicting ‘no’ class).

Based on the Professor’s HW3 feedback, we could improve associated code chunks in the following manner. This is more a “nice to have” than any sortof requirement. It could be nice to show the Prof that we listened to her feedback before re-including this code to compare the performance of our NN model to …

Comparison Table


Conclusion

Findings

Next Steps


Appendices