The purpose of our Final Project was to explore the application of Neural Networks to loan approval data to then back compare model performance with a variety of Classification algorithms (ie. KNN, DT, RF, GBM).
Neural networks form the basis of Deep Learning, an extension of Machine Learning, where algorithms are inspired by the structure of the human brain. They take in data, train themselves to recognize patterns therein, and then predict outputs for similar, unseen data.
download.file(
url='https://1.cms.s81c.com/sites/default/files/2021-01-06/ICLH_Diagram_Batch_01_03-DeepNeuralNetwork-WHITEBG.png',
destfile='image1.jpg',
mode='wb')
knitr::include_graphics(path='image1.jpg')Neural networks are made up of layers of nodes. They contain an input layer, one or more hidden layers, and an output layer. Nodes are interconnected with associated weights and thresholds. When a node is above its specified threshold, the node is activated and data is sent to the next layer of the network. Otherwise, data is not fed forward.
The power of a neural network lies in its ability to fine-tune upon countless iterations. Back-propagation allows for continuous model accuracy improvement. Weights are adjusted based on the magnitude of error at the output layer, and continuous refinement allows for predictive accuracy improvements.
We’ll start by (re) exploring and preparing the loan dataset, progress to building our neural network model, and then compare and contrast loan approval status prediction accuracy for our neural network model vs. decision tree, random forest, and gradient boosting models.
A loan is when money is transferred from one party to another under the prospect that the lender (loan giver) will be repaid in full with interest by the lendee (loan receiver).
The profit, for the lendor, comes from the interest they are paid by the lendee and thus, as a core part of their business model, it’s important for banks and credit companies alike to be able to depend upon the fact that their loan (and interest) will be repaid in full.
With this motivation in mind, we (re) explore and prepare our loan approval dataset in order to construct a more precise neural network model (later). Being that we’ve explored this data before, we build upon the core takeaways of our past exploration while simultaneously pushing the bounds of our understanding to a deeper level. Rudimentary (early) EDA steps will be summarized and/or excluded from the write up and included in the Appendix in favor of output that provides greater context and insight.
Prior to commencing EDA, we revisit the corresponding data directory:
LoanID: unique loan IDGender: applicant gender (Male/Female)Married: applicant marriage status (Yes/No)Dependents: number of dependents for applicant (0, 1, 2, 3+)Education: applicant college education status (Graduate / Not Graduate)Self_Employed: applicant self-employment status (Yes/No)ApplicantIncome: applicant income levelCoapplicantIncome: co-applicant income level (if applicable)LoanAmount: loan amount requested (in thousands)Loan_Amount_Term: loan term (in months)Credit_History: credit history meets guidelines (1/0)PropertyArea: property location (Urban/Semi Urban/Rural)Loan_Status: loan approved (Yes/No). target variableTo start, we load in our data, replace empty strings with NAs, observe the first 6 observations of our dataset to refamiliarize ourselves with the format of our data and then use R’s built-in glimpse() and summary() functions to revisit data types and value ranges.
We’re dealing with a 614 observation x 13 variable dataframe with Loan_Status as our dependent, categoric variable, ApplicantIncome, CoApplicantIncome,LoanAmount, Loan_Amount_Term, and Credit_History as independent, numeric variables, and Loan_ID, LoanGender, Married, Dependents, Education, Self_Employed, Property_Area, and Loan_Status as independent, categoric variables.
From above, we extend that Loan_ID can be dropped, ApplicantIncome and CoApplicantIncome can be combined to create a TotalIncome variable, and observations with a “3+” label in Dependents should be re-labelled as “3” so that data follows a consistent format and imputation can be performed as a next step.
We pre-process our data (as described above), visualize and handle NA values:
#Pre-process dataset for easier interpretation
loan <- subset(loan, select = -c(1) ) #drop Loan_ID from consideration
loan$TotalIncome <- loan$CoapplicantIncome + loan$ApplicantIncome #create TotalIncome variable
loan <- subset(loan, select = -c(6,7) ) #drop CoapplicantIncome and ApplicantIncome
loan$Dependents <- revalue(loan$Dependents, c("3+"="3")) #relabel Dependents "3+" value as "3"
#Visualize NA counts
colSums(is.na(loan)) ## Gender Married Dependents Education
## 13 3 15 0
## Self_Employed LoanAmount Loan_Amount_Term Credit_History
## 32 22 14 50
## Property_Area Loan_Status TotalIncome
## 0 0 0
We re-assign the “3+” value of the Dependents variable to provide consistent leveling and enable pmm (predictive mean matching). Predictive mean matching calculates the predicted value for our target variable, and, for missing values, forms a small set of “candidate donors” from the complete cases that are closest to the predicted value for our missing entry. Donors are then randomly chosen from candidates and imputed where values were once missing. To apply pmm we assume that the distribution is the same for missing cells as it is for observed data, and thus, the approach may be more limited when the % of missing values is higher.
Once we’ve imputed missing values, we verify whether our operation was successful:
## Gender Married Dependents Education
## 0 0 0 0
## Self_Employed LoanAmount Loan_Amount_Term Credit_History
## 0 0 0 0
## Property_Area Loan_Status TotalIncome
## 0 0 0
Imputation was a success and data pre-processing has been completed. From this point we proceed to the observance of feature correlation.
To identify feature correlation - how strongly independent variables are related to one another and how strongly these variables relate to our dependent variable (Loan_Status), we consider a correlation matrix with a threshold of 0.3:
From the correlation matrix we can extend that:
Credit_History is our strongest predictor / strongly correlated with Loan_Status, andGender and Married, Married and Dependents, LoanAmount and TotalIncome appear to be correlated with one another and indicative that multicollinearity may be a concern for our data.We varied the threshold value for our correlation matrix and found that, aside from Credit_History, our other independent variables were relatively poor predictors of Loan_Status, making it worth exploring variable importance:
#NOTE: COMMENTED OUT BELOW DUE TO LONG COMPILATION TIME. UNCOMMENT BEFORE FINAL SUBMISSION.
# Perform Boruta search
#boruta_output <- Boruta(Loan_Status ~ ., data=na.omit(loan), doTrace=0, maxRuns = 1000)
#Get significant variables including tentatives
#boruta_signif <- getSelectedAttributes(boruta_output, withTentative = TRUE)
#print(boruta_signif)
# Plot variable importance
#plot(boruta_output, cex.axis=.7, las=2, xlab="", main="Variable Importance")Our utilization of the Boruta function for feature ranking and selection indicate that:
Credit_History, TotalIncome, LoanAmount, and Self_Employed are strong predictors,Property_Area is a moderate predictor, andMarried, Loan_Amount_Term, Education, Gender, and Dependents are all poor predictors.With feature importance in mind, we drop the poor predictors from consideration. Dropping these variables also addresses concerns of applicant discrimination (ie. rejection based on Gender) and thus we address two concerns with this act of feature selection.
With our loan dataset properly subset, we proceed to observing the distributions of our independent variables. First we observe numeric distributions:
#convert CreditHistory to type factor
loan$Credit_History <- factor(loan$Credit_History)
#levels(loan$Credit_History) #verify
#Numeric distributions
loan %>%
keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free", ncol=1) +
geom_histogram(bins=90,color="darkblue", fill="lightblue")From the above figures we observe that LoanAmount and TotalIncome appear to be right skewed normal and there are a number of noteworthy outliers for both distributions. From this, we note the importance of outlier-handling and scaling as critical steps in building our neural network model.
Next, we explore our categorical variables:
#Categoric distributions
##Self_Employed
p1 <- loan %>% dplyr::select(1,5) %>% group_by(,Loan_Status) %>% count() %>%
ggplot(aes(x=Self_Employed, y=freq, fill=Loan_Status)) +
geom_bar(stat='identity', position="stack")
##Self_Employed
p2 <- loan %>% dplyr::select(3,5) %>% group_by(,Loan_Status) %>% count() %>%
ggplot(aes(x=Credit_History, y=freq, fill=Loan_Status)) +
geom_bar(stat='identity', position="stack")
##Property_Area
p3 <- loan %>% dplyr::select(4,5) %>% group_by(,Loan_Status) %>% count() %>%
ggplot(aes(x=Property_Area, y=freq, fill=Loan_Status)) +
geom_bar(stat='identity', position="stack")
grid.arrange(p1, p2, p3, nrow = 2, ncol = 2)From the above figures we can extend:
With a relatively thorough exploratory analysis under our belt, we move on to building our neural network model.
We’ll utilize the holdout-validation method for evaluating model performance. We train-test split our data using a 75:25 partition, build our model on the training set and then evaluate its performance on the test set.
To start, we compute our “barrier value” and then partition our data based on this value, with 75% of our data going in the training set and 25% of our data going in the test set.
set.seed(123) #for reproducibility
bar <- floor(0.75 * nrow(loan)) #compute "barrier value"
partition <- sample(seq_len(nrow(loan)), size = bar) #sample based on barrier value
#Subset: train-test split based on partition value
train <- loan[partition, ]
test <- loan[-partition, ]
#print(dim(train)) #460 x 6
#print(dim(test)) #154 x 6We set our training algorithm’s parameters and then train our model using the train() function with “nnet” passed in as the method and “scale” and “center” passed in so that numeric variables are standardized.
With our “baseline model” trained, we proceed to model evaluation. We verify the baseline accuracy (0.676) and then evaluate our model’s performance against this “baseline”. We generate predictions based on the training set and then output these predictions as a confusion matrix and then we do the same with our test data.
#round(prop.table(table(train$Loan_Status)),3) #Baseline accuracy Y: 0.676
#Training predictions
nnPred_train <-predict(nnet_model1, train)
#Training confusion matrix
table(train$Loan_Status, nnPred_train)## nnPred_train
## N Y
## N 77 72
## Y 2 309
## [1] 0.839
#Test predictions
nnPred_test <-predict(nnet_model1, test)
#Test confusion matrix
table(test$Loan_Status, nnPred_test)## nnPred_test
## N Y
## N 23 20
## Y 5 106
## [1] 0.844
From above, we observe a training accuracy of 83.9% and a test accuracy of 84.4% which is an improvement of nearly 20% over our “baseline accuracy”.
By merely applying a neural network model to our dataset, we see major improvements in predictive capability. Next, we see if we can take the model further. If we can improve model performance by handling outliers and creating features prior to feeding the model.
We explore the affects of outlier handling and feature creation on model performance to determine if either step improves our model.
We start by re-visiting the distribution of outliers via boxplot:
#Observe the affect of outlier-handling on model performance (if any)
#Confirm the presence of influential observations
p4 <- ggplot(loan) +
aes(x = Loan_Status, y = LoanAmount) +
geom_boxplot(fill = "#0c4c8a") +
theme_minimal()
p5 <- ggplot(loan) +
aes(x = Loan_Status, y = TotalIncome) +
geom_boxplot(fill = "#0c4c8a") +
theme_minimal()
grid.arrange(p4, p5, nrow = 1, ncol = 2)From above we can see that outliers appear to be a concern for our model.
To rectify the situation, we identify the outliers using the boxplot.stats() function, filter for corresponding observations, remove outliers from our dataset, and revisit model performance.
Outlier-handling led to performance improvements on the training set (up to 85.6% accuracy) and reduction on the test set (down to 70.8%). As such we elected not to include outlier-handling as an optimization step. Note: corresponding code has been included in the Appendix.
We proceeded to observe the affect of feature creation on model performance.
We wanted to see if adding certain combinations of features would improve our predictive accuracy. We tested the inclusion of variables for:
The inclusion of these variables, and feature creation in general, slightly reduced the performance of our model and so we elected to exclude it as a modeling optimization step. Note: corresponding code has been included in the Appendix.
Being that each of our optimization attempts were fruitless, we next opted to explore an alternative approach to neural networks to then compare the predictive accuracy between packages / approaches.
Background / Intro
This model is a densely connected deep neural network with multiple hidden layers and an output layer. Training a neural network revolves around the following objects:
This model we use all these steps to build the model using keras. Use keras package to make this model.
Convert all categorical variables to numerical variables. Plot the neural networks with 2 hidden layes. First hidden layer with 4 neurons and 2 nd hidden layer with 2 neurons.
#Preprocess data
loan_keras <- loan_keras[, c(1, 3, 4, 2, 6, 5)]
# Convert Credit_History numeric type
loan_keras$Credit_History <- as.numeric(loan_keras$Credit_History)
# Change Variables values
loan_keras$Self_Employed <- ifelse(loan_keras$Self_Employed == "Yes", 1, 0)
loan_keras$Loan_Status <- ifelse(loan_keras$Loan_Status=="Y", 1, 0)
loan_keras$Property_Area <- case_when(
loan_keras$Property_Area == "Semiurban" ~ 2,
loan_keras$Property_Area == "Urban" ~ 1,
loan_keras$Property_Area == "Rural" ~ 0,
)
#loan_keras <- loan_keras %>% mutate_if(is.factor, as.numeric) # convert factor to numeric
# Neural Network Visualization
n <- neuralnet(Loan_Status ~ Self_Employed+Credit_History+Property_Area+LoanAmount+TotalIncome,
data = loan_keras,
hidden = c(4,2),
linear.output = F,
lifesign = 'full',
rep=1)
plot(n,
col.hidden = 'darkgreen',
col.hidden.synapse = 'darkgreen',
show.weights = F,
information = F,
fill = 'lightblue', rep = "best")The above model has 5 inputs, 2 hidden layers (i.e 4 nurons in one layer and other has 2 neurons). This is connect to the output neuron. We make the dataset to a matrix format, and split in to train and test.
Split the data in to train and test, train data have 80% of the total data and test data 20%. Create 4 sets of data i.e training, testing, traintraget and testtarget. Train and test target contains the output variable i.e Loan_Status. Training and testing datasets contains all other variables other than output variable.
# Partition
set.seed(18)
ind <- sample(2, nrow(loan_keras), replace = T, prob = c(.8, .2))
training_keras <- loan_keras[ind==1,1:5]
testing_keras <- loan_keras[ind==2, 1:5]
trainingtarget <- loan_keras[ind==1, 6]
testingtarget <- loan_keras[ind==2, 6]Normalize the values apply mean and standard deviation to the columns. Normalize indepenedent variables by substracting mean from the value, devide by standard deviation. Use scale function to do normalize in both train and test sets.
# Normalize
m <- colMeans(training_keras)
s <- apply(training_keras, 2, sd)
training_keras <- scale(training_keras, center = m, scale = s)
testing_keras <- scale(testing_keras, center = m, scale = s)Create the model using keras_model_sequential, pass the layer_dense(). This function creates the hidden layer in the network. Can apply multiple hidden layer using the same function. For hidden layer activation function is relu stands for rectified linear unit, by default the value for activation function is linear. Need to define the input shape, we have 5 input variable so shape is 5.
set.seed(77)
#Create model
model_keras <- keras_model_sequential()
model_keras %>%
layer_dense(units = 4, activation = 'relu', input_shape = c(5)) %>%
layer_dense(units = 2, activation = 'relu') %>%
layer_dense(units = 1)set.seed(77)
#Compile model
model_keras %>% compile(loss = 'mse',
optimizer = optimizer_rmsprop(lr= 0.005),
metrics = c("accuracy"))For compile the model using loss, optimizer and metrics. There are other arrguments we can pass, but here we choose to pass 3 Arguments. The ouput has 2 values (1,0) and input are all numerical variables.
set.seed(77)
#Fit model
model_keras_train <- model_keras %>%
fit(training_keras,
trainingtarget,
epochs = 100,
batch_size = 32,
validation_split = 0.1
)
plot(model_keras_train)Before fitting the training data (passing the training data to the model), the model requires compilation. The loss function, optimizer, and metrics are specified during this step.
To train model,
The first graph is for loss and 2nd graph is for accuracy. Initially there are very less difference between loss and val_loss which is very minimal. This means the training error is low.
Lets have a look on model evaluation.
## loss accuracy
## 0.1627313 0.7818182
The test feature and target sets can be used to evaluate the model. The results show the overall loss and mean absolute error by using the evaluate() function. It takes two arguments referencing the feature and target test sets.
Evaluation gives total loss and accuracy of the training data. Below table is the confusion matrix and accuracy of the training data.
# Confusion matrix and misclassification
pred_keras_1 <- ifelse(pred_keras>0.5, 1, 0)
tab4 <- table(Predicted = pred_keras_1,
Actual = testingtarget)
tab4## Actual
## Predicted 0 1
## 0 16 2
## 1 22 70
## [1] 0.7818182
Accuracy is 78%, model classify loan approve yes category well but shows poor performance for loan approve no category.
We have used 6 neurons and 2 hidden layer, there are scope for improvement. We will try with multiple neurons and layer dropout that helps to reduce over fitting.
We apply the same code but with different neurons count and hidden layer.
Here apply with different neurons and layer_dropout. While increase more number of neurons the accuracy going up. But this is not the same for all cases. Model training can change the parameter of learning rate, to improve the model performance.
set.seed(88)
#Create model
model_keras_2 <- keras_model_sequential()
model_keras_2 %>%
layer_dense(units = 100, activation = 'relu', input_shape = c(5)) %>%
layer_dropout(rate = 0.4) %>%
layer_dense(units = 1)set.seed(88)
#Compile model
model_keras_2 %>% compile(loss = 'mse',
optimizer = optimizer_rmsprop(lr= 0.005),
metrics = c("accuracy"))set.seed(88)
#Fit model
model_keras_train2 <- model_keras_2 %>%
fit(training_keras,
trainingtarget,
epochs = 100,
batch_size = 32,
validation_split = 0.2
)
plot(model_keras_train2)## loss accuracy
## 0.1618941 0.7909091
Below shows the confusion matrix and the accuracy after model tuing.
# Confusion matrix and misclassification
pred_keras_2 <- ifelse(pred_keras_2>0.5, 1, 0)
tab5 <- table(Predicted = pred_keras_2,
Actual = testingtarget)
tab5## Actual
## Predicted 0 1
## 0 16 1
## 1 22 71
## [1] 0.7909091
Accuracy improved 78% to 80%, this model also perform same as first keras model. Better performance with Yes category and poor performance with No category. There are total 21 misclassification.
The dataset consist of 6 variables, out of that LoanAmount and TotalIncome are numerical. Others variables are factor and categorical. Implement below changes :
To perform Neural nets use package neuralnet. First we apply the normalization the on variables to get all the values between 0 and 1. We retain output variable (Loan_Status) as it is because the values are in 0 or 1.
# Convert Credit_History numeric type
loan_nn$Credit_History <- as.numeric(loan_nn$Credit_History)
# Change Variables values
loan_nn$Self_Employed <- ifelse(loan_nn$Self_Employed == "Yes", 1, 0)
loan_nn$Loan_Status <- ifelse(loan_nn$Loan_Status=="Y", 1, 0)
loan_nn$Property_Area <- case_when(
loan_nn$Property_Area == "Semiurban" ~ 2,
loan_nn$Property_Area == "Urban" ~ 1,
loan_nn$Property_Area == "Rural" ~ 0,
)
# plot small multiples
ggplot(data = reshape2::melt(loan_nn), aes(x = value)) +
stat_density() +
facet_wrap(~variable, scales = "free")To apply normalization on each columns, substract the minimum values of the column and devide by maximim - minimum. Plot density of all the variables to see the normalization. The below plot range loes between 0 and 1.
# Min-Max Normalization
loan_nn$Self_Employed <- (loan_nn$Self_Employed - min(loan_nn$Self_Employed))/(max(loan_nn$Self_Employed) - min(loan_nn$Self_Employed))
loan_nn$LoanAmount <- (loan_nn$LoanAmount - min(loan_nn$LoanAmount))/(max(loan_nn$LoanAmount) - min(loan_nn$LoanAmount))
loan_nn$Credit_History <- (loan_nn$Credit_History - min(loan_nn$Credit_History))/(max(loan_nn$Credit_History)-min(loan_nn$Credit_History))
loan_nn$Property_Area <- (loan_nn$Property_Area - min(loan_nn$Property_Area))/(max(loan_nn$Property_Area)-min(loan_nn$Property_Area))
loan_nn$TotalIncome <- (loan_nn$TotalIncome - min(loan_nn$TotalIncome))/(max(loan_nn$TotalIncome)-min(loan_nn$TotalIncome))
# plot small multiples
ggplot(data = reshape2::melt(loan_nn), aes(x = value)) +
stat_density() +
facet_wrap(~variable, scales = "free")Split the data in 80% and 20%, train data having 80% of the data and train data having 20% of the data. Then apply the model on train data. Use sample package to split the data in to train and test.
# Data Partition
set.seed(11)
ind <- sample(2, nrow(loan_nn), replace = TRUE, prob = c(0.8, 0.2))
training_nn <- loan_nn[ind==1,]
testing_nn <- loan_nn[ind==2,]
dim(training_nn)## [1] 500 6
## [1] 114 6
Dimension of the training and testing data:
Training data : 500, 6
Testing data : 114, 6
Parameters apply on Neural Network model,
Plot the network and discuss about how it input layer, hidden layer and output layer works. Below plots shows one hiden layer.
# Neural Networks
library(neuralnet)
set.seed(13)
n1 <- neuralnet(Loan_Status~Self_Employed+LoanAmount+Credit_History+Property_Area+TotalIncome,
data = training_nn,
hidden = 1,
err.fct = "ce",
linear.output = FALSE
)
plot(n1, rep = "best")Node Output Calculations with Sigmoid Activation Function
Here we will discuss how Neuralnets algo calculate the output node. From the neural net graph, manually calulate the first row of the output node.
Above diagram shows first 5 nodes are input node, lets say N1, N2, N3, N4, and N5. The middle layer is hidden layer i.e N6, and last node is output layer i.e N7. To calculate output node need to calculate N6 output, because the output of N6 is input to N7. Similary to calculate N6 ouput, need to calculate N6 input.
Note, formula for sigmoid is 1/(1+exp(-x))
Here is the equation: \(f(x)=b+ w1⋅x1+ w2⋅x2 +...+ wn⋅xn\)
## Self_Employed LoanAmount Credit_History Property_Area Loan_Status TotalIncome
## 1 0 0.1432706 1 0.5 1 0.05539355
Above shows the first row of training data. We calculate manually first value of output node and compair the values with training model output.
in6 <- 10.47375 + (0*-5.36828) + (0.1432706*2.7186) + (1*-13.27065) + (0.5 * -1.77825) + (0.05539355 * 8.03783)
out6 <- 1/(1+exp(-in6))
in7 <- 1.71561 +(-4.21959*out6)
out7 <- 1/(1+exp(-in7))
paste0("Node 7 Output : " ,out7)## [1] "Node 7 Output : 0.815350328873002"
output <- compute(n1, training_nn[,-5])
paste0("First predicted value of NN : ", head(output$net.result, n = 1))## [1] "First predicted value of NN : 0.815351067634266"
Both predicted values and manual calculation showing same values, we see last some decimal points are not maching due to round off calculation.
Above explanation shows the neural nets prediction calculation.
Feature selection is an important part of most learning algorithms. Feature selection is used to select the most relevant features from the data.A simple method for feature selection using feedforward neural networks is presented. The method starts by using one input neuron and adds one input at time until the wanted classification accuracy has been achieved or all attributes have been chosen.
From EDA, we found Credit History is the most important feature for Loan approval. So first start with that feature, then add other relevant feature, untill we get the significantly and improves classification accuracy.
Testing data and traing data has same accuracy. There are other parameter we applied and checked for the optimal model. They are:
tried with hidden layer range 1 to 5, multi layer such as c(3,2), c(2,1), and layer repeation (the parameter rep gets the input of repetation, and model will repeat that many times)
Life sign:specify how much the function will print during the calculation of the neural network. ‘none’, ‘minimal’ or ‘full’. Tried with ‘full’.
algorithm : contains the algorithm type to calculate the neural network. Tried with ‘rprop+’ refers to resilient backpropagation with and without weight backtracking.
Current model we find the optimal model out of all other models.
set.seed(12)
n <- neuralnet(Loan_Status ~ Credit_History+Property_Area,
data = training_nn,
hidden = 1,
err.fct = "ce",
linear.output = FALSE
)
plot(n, rep = 'best')Lets have look on confusion matrix and accuracy on train dataset.
# Confusion Matrix & Misclassification Error - training data
output <- compute(n, training_nn[,-5])
p1 <- output$net.result
pred1 <- ifelse(p1>0.5, 1, 0)
tab1 <- table(Prediction = pred1, Actuals = training_nn$Loan_Status)
tab1## Actuals
## Prediction 0 1
## 0 73 6
## 1 85 336
## [1] "Misclassification Error of training data: 18.2"
## [1] "Accuracy of training data: 81.8"
There are total 91 misclassification based on training data, out of that 6 pople actually get loan approval, model shows their loan is not approved, and 85 people actually loan not approved but model shows their loan approved. This training set model performs better result with loan approved class and poor result with no loan approved class.
Confusion matrix and accuracy of test dataset.
# Confusion Matrix & Misclassification Error - testing data
output <- compute(n, testing_nn[,-5])
p2 <- output$net.result
pred2 <- ifelse(p2>0.5, 1, 0)
tab2 <- table(Prediction = pred2, Actual = testing_nn$Loan_Status)
tab2## Actual
## Prediction 0 1
## 0 17 1
## 1 17 79
## [1] "Misclassification Error of testing data: 15.79"
## [1] "Accuracy of testing data: 84.21"
Training data misclassification is 18.2% and testing data misclassification is 84.21%, so neural nets performs some amount of consistency between traing and testing data. This model performs better with loan approved class than no loan approved i.e model says 1 person actually get loan approved but showing no loan approved, and 17 people actually not get loan approved but model shows their loan approved. With test data, the model performs pretty well with loan approved class and model have poor performance with no loan approved class.
Logistic Regression is a powerful statistical way of modeling a binomial outcome with one or more explanatory variables.
For Loan dataset, Loan_status is the Dependent variable and other variables are independent. To perform Logistic regression first we check the data is balanced or not. Then split the data in to train and test, fit the model with train data and apply the prediction on test data. We will see the performace matrices for this model, to see how the model is working.
# Convert Credit_History numeric type
loan_new2$Credit_History <- as.numeric(loan_new2$Credit_History)
# Change Variables values
loan_new2$Self_Employed <- ifelse(loan_new2$Self_Employed == "Yes", 1, 0)
loan_new2$Loan_Status <- ifelse(loan_new2$Loan_Status=="Y", 1, 0)
loan_new2$Property_Area <- case_when(
loan_new2$Property_Area == "Semiurban" ~ 2,
loan_new2$Property_Area == "Urban" ~ 1,
loan_new2$Property_Area == "Rural" ~ 0,
)
loan_new2 <- subset(loan_new2, select = c(Self_Employed, LoanAmount, Credit_History,
Property_Area, Loan_Status, TotalIncome))
table(loan_new2$Loan_Status)##
## 0 1
## 192 422
Above table shows the data is not balanced. Data have more Yes than No, so data is not balance, we will do the down sample and apply logistic regression on train dataset.
# under sample
set.seed(1)
smpl1 <- loan_new2 %>% filter(Loan_Status == 1) %>% sample_n(size = 192)
smpl2 <- loan_new2 %>% filter(Loan_Status == 0)
smpl_192 <- rbind(smpl1, smpl2)
table(smpl_192$Loan_Status)##
## 0 1
## 192 192
Performing an 80-20 split allocates 80% of our data for training the model and 20% of our data for testing it. Use glm from stats package to perform logistic regression.
# Data split
set.seed(5)
sample = sample.split(smpl_192$Loan_Status, SplitRatio = 0.80)
loan_train = subset(smpl_192, sample == TRUE)
loan_test = subset(smpl_192, sample == FALSE)
dim(loan_train)## [1] 308 6
## [1] 76 6
This model, includes all the features.
set.seed(6)
bi_model1 <- glm(Loan_Status ~ ., data = loan_train, family=binomial(link="logit"))
summary(bi_model1)##
## Call:
## glm(formula = Loan_Status ~ ., family = binomial(link = "logit"),
## data = loan_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8528 -0.9975 0.2861 0.8709 2.7495
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.470e+00 6.702e-01 -5.177 2.25e-07 ***
## Self_Employed 2.345e-01 4.030e-01 0.582 0.56068
## LoanAmount -7.316e-04 1.800e-03 -0.406 0.68444
## Credit_History 3.996e+00 6.136e-01 6.512 7.43e-11 ***
## Property_Area 4.772e-01 1.651e-01 2.890 0.00385 **
## TotalIncome -3.904e-05 2.677e-05 -1.458 0.14472
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 426.98 on 307 degrees of freedom
## Residual deviance: 310.88 on 302 degrees of freedom
## AIC: 322.88
##
## Number of Fisher Scoring iterations: 6
From the model we see, AIC : 322.8818211. Residual deviance : 310.8818211. Low AIC and residual is better for the model. Credit_History is only statistically significant.
The confusion matrix avoids “confusion” by measuring the actual and predicted values in a tabular format. Test the model using loan_test.
pred_1 <- predict(bi_model1,loan_test)
pred_1 <- if_else(pred_1 > 0.5, 1, 0)
cm <- table(true = loan_test$Loan_Status, pred_1)
print(cm)## pred_1
## true 0 1
## 0 24 14
## 1 7 31
TP <- cm[1]
FP <- cm[2]
FN <- cm[3]
TN <- cm[4]
# accuracy
accuracy_bimod1 <- (TP + TN)/(TN + FN + TP + FP)
paste0("Accuracy : " ,round(accuracy_bimod1 * 100, 2))## [1] "Accuracy : 72.37"
The model showing 21 misclassification, The model predicts poor for both loan approved class and no loan approved class.
The accuracy is not promising, Corelation plot helps to exclude the variable which are co-related to ecah other. From cor plot we see strong relation between TotalIncome and Loan Amount. We include the variables which are statistically significant. Here Credit_History and Property_Area are statistically significant menas low p-value. We build our second LR model with this two variables.
This model includes 2 variables such as Property_Area and Credit_History.
set.seed(7)
bi_model2 <- glm(Loan_Status ~ Credit_History + Property_Area, data = loan_train, family=binomial(link="logit"))
summary(bi_model2)##
## Call:
## glm(formula = Loan_Status ~ Credit_History + Property_Area, family = binomial(link = "logit"),
## data = loan_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6719 -1.2520 0.2700 0.9189 2.7537
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.7686 0.6281 -6.000 1.98e-09 ***
## Credit_History 3.9424 0.6101 6.462 1.03e-10 ***
## Property_Area 0.4699 0.1622 2.897 0.00376 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 426.98 on 307 degrees of freedom
## Residual deviance: 314.76 on 305 degrees of freedom
## AIC: 320.76
##
## Number of Fisher Scoring iterations: 5
From the 2nd model, AIC : 320.7603262.
Residual deviance : 314.7603262.
pred_2 <- predict(bi_model2,loan_test)
pred_2 <- if_else(pred_2 > 0.5, 1, 0)
cm <- table(true = loan_test$Loan_Status, pred_2)
print(cm)## pred_2
## true 0 1
## 0 23 15
## 1 5 33
TP <- cm[1]
FP <- cm[2]
FN <- cm[3]
TN <- cm[4]
# accuracy
accuracy_bimod2 <- (TP + TN)/(TN + FN + TP + FP)
paste0("Accuracy : " ,round(accuracy_bimod2 * 100, 2))## [1] "Accuracy : 73.68"
Accuracy improved from 72% to 74%. The model has 20 misclassification, the model is predicts poor performnace for both Loan approved and no loan approved class. Out of both the classes no-loan approved class perform better than loan approved class.
The K-Nearest Neighbors Classification algorithm classifies observations by allowing the K observations in the training set that are nearest to the new observation to “vote” on the class of the new observation. More specifically, the algorithm works as follows:
The dataset having total 6 columns, and 614 observations. Loan_Status is the target variable. This model predict the loan approval based on employment types, Credit history, Property, Loan amount and Total income.
We will now use the knn function from the class package to create a 3-Nearest Neighbors model for the loan dataset.
Data preparation, convert categorical variables to numerical. Knn performs well with same scale and numerical data. kNN doesn’t work great in general when features are on different scales. For many machine learning algorithms, it is important to make sure that the features are on roughly the same scale before training. This is especially true of distance-based algorithms such as K-Nearest Neighbors. We apply standard scaling. With standardization, each column in the training set is scaled to have a mean of zero and a standard deviation of 1. If xi is a single observation of a particular feature, then its scaled value zi is given by:
\(z_i= \frac {x_i−\bar x}{s_x}\)
After scaling, Split the data in 80 - 20 %, train data - 80% and test data - 20%.
Create 2 datasets for train and test, for actual data and for scaled data.
Actual data, train set name: train, test : test Scaled data, train set name : training_knn, test : testing_knn
The idea here is, to show the model performance using kNN, using both actual and scaled data.
loan_knn <- loan_knn[, c(1, 3, 4, 2, 6, 5)]
# Convert Credit_History numeric type
loan_knn$Credit_History <- as.numeric(loan_knn$Credit_History)
# Change Variables values
loan_knn$Self_Employed <- ifelse(loan_knn$Self_Employed == "Yes", 1, 0)
loan_knn$Loan_Status <- ifelse(loan_knn$Loan_Status=="Y", 1, 0)
loan_knn$Property_Area <- case_when(
loan_knn$Property_Area == "Semiurban" ~ 2,
loan_knn$Property_Area == "Urban" ~ 1,
loan_knn$Property_Area == "Rural" ~ 0,
)
# Partition
set.seed(26)
ind <- sample(2, nrow(loan_knn), replace = T, prob = c(.8, .2))
training_knn <- loan_knn[ind==1,1:5]
testing_knn <- loan_knn[ind==2, 1:5]
trainingtarget_knn <- loan_knn[ind==1, 6]
testingtarget_knn <- loan_knn[ind==2, 6]
# unscaled data
train <- loan_knn[ind==1,1:5]
test <- loan_knn[ind==2, 1:5]
# Normalize scales data
m <- colMeans(training_knn)
s <- apply(training_knn, 2, sd)
training_knn <- scale(training_knn, center = m, scale = s)
testing_knn <- scale(testing_knn, center = m, scale = s)Selecting K
We will now build several KNN models. For each K from 1 to 100, we will calculate training and validation accuracy for the KNN model, using scaled data.
set.seed(26)
train_acc <- c()
valid_acc <- c()
train_acc_sc <- c()
valid_acc_sc <- c()
k_range <- 1:100
for (i in k_range) {
# Unscaled data
set.seed(26)
train_pred_a <- knn(train, train, trainingtarget_knn, k=i)
train_acc <- c(train_acc, mean(train_pred_a == trainingtarget_knn))
set.seed(26)
valid_pred_a <- knn(train, test, trainingtarget_knn, k=i)
valid_acc <- c(valid_acc, mean(valid_pred_a == testingtarget_knn))
# Standard Scaling
set.seed(26)
train_pred <- knn(training_knn, training_knn, trainingtarget_knn, k=i)
train_acc_sc <- c(train_acc_sc, mean(train_pred == trainingtarget_knn))
set.seed(26)
valid_pred <- knn(training_knn, testing_knn, trainingtarget_knn, k=i)
valid_acc_sc <- c(valid_acc_sc, mean(valid_pred == testingtarget_knn))
}
print(max(valid_acc))## [1] 0.75
## [1] 0.8712121
Training and Validation Curves for actual Data(unscaled)
plot(k_range, train_acc, pch='.', ylim=c(0.55, 1), col='salmon', ylab="")
lines(k_range, train_acc, lwd=2, col='salmon')
lines(k_range, valid_acc, lwd=2, col='cornflowerblue')
legend(65, 1, legend=c("Training Acc", "Validation Acc"),
col=c("salmon", "cornflowerblue"), lty=1, lwd=2, cex=0.8)Training and Validation Curves for Standard Scaled Data
plot(k_range, train_acc_sc, pch='.', ylim=c(0.55, 1), col='salmon', ylab="")
lines(k_range, train_acc_sc, lwd=2, col='salmon')
lines(k_range, valid_acc_sc, lwd=2, col='cornflowerblue')
legend(65, 1, legend=c("Training Acc", "Validation Acc"),
col=c("salmon", "cornflowerblue"), lty=1, lwd=2, cex=0.8)Selecting the Final Model
Our best validation performance was obtained using scaled data. Let’s determine the value of K used to create this particular model.
## [1] 15
We will now recalculate the training and validation accuracies for this model, given k = 15. We can use the table function to create a quick confusion matrix.
set.seed(26)
train_pred <- knn(train, train, trainingtarget_knn, k=acc)
train_acc <- mean(train_pred == trainingtarget_knn)
set.seed(26)
valid_pred <- knn(train, test, trainingtarget_knn, k=acc)
valid_acc <- mean(valid_pred == testingtarget_knn)
cat('Training Accuracy: ', train_acc, '\n',
'Validation Accuracy: ', valid_acc, sep='')## Training Accuracy: 0.6804979
## Validation Accuracy: 0.75
## Actual
## Predicted 0 1
## 0 4 4
## 1 29 95
Confusion matrix and model prediction on Scaled data
The confusion matrix shows, total 33 misclassification on test data. Model predicts loan approval category Yes class better than No class category. More misclassification occure in No class category.
## [1] 10
set.seed(26)
train_pred <- knn(training_knn, training_knn, trainingtarget_knn, k=acc_sc)
train_acc <- mean(train_pred == trainingtarget_knn)
set.seed(26)
valid_pred <- knn(training_knn, testing_knn, trainingtarget_knn, k=acc_sc)
valid_acc <- mean(valid_pred == testingtarget_knn)
cat('Training Accuracy: ', train_acc, '\n',
'Validation Accuracy: ', valid_acc, sep='')## Training Accuracy: 0.813278
## Validation Accuracy: 0.8712121
## Actual
## Predicted 0 1
## 0 16 0
## 1 17 99
This model has high accuracy, but this is not a prefered model for predict loan approve No category. This model predict 100% accuract for Yes category, but this is not a recommended model to use.
Decision trees build classification or regression models in the form of a tree structure. Datasets are broken into smaller and smaller subsets along chosen parameters, as the associated decision tree is developed.
The result is a tree with nodes and branches. Nodes denote a split point based on the attribute in question while branches denote the corresponding outcome. We start at a “root node”, terminate at “leaf nodes”, and use corresponding lead nodes to provide proportions regarding resulting class labels.
From EDA, we found most imp features and using Boruta package. The table shows, in the data Credit_History has highest impotance. 2nd highest impotance feature is TotalIncome, 3rd highest feature importance is LoanAmount. There is a strong co-relation we found between TotalIncome and LoanAmount. Decision trees and boosted trees algorithms are immune to multicollinearity by nature . When they decide to split, the tree will choose only one of the perfectly correlated features. So, multicollinearity not be the issue here.
# feature impotance
boruta_output = Boruta(Loan_Status ~ ., data = loan_dt, doTrace = 0, maxRuns = 1000)
roughFixMod = TentativeRoughFix(boruta_output)
importance = attStats(TentativeRoughFix(boruta_output))
importance = importance[importance$decision != 'Rejected', c('meanImp', 'decision')]
kable(head(importance[order(-importance$meanImp), ]),
caption = "Feature Importance of Loan Data") %>%
kable_styling(bootstrap_options = "striped", full_width = TRUE)| meanImp | decision | |
|---|---|---|
| Credit_History | 94.485237 | Confirmed |
| TotalIncome | 16.222849 | Confirmed |
| LoanAmount | 13.022106 | Confirmed |
| Self_Employed | 4.175202 | Confirmed |
Data preparation
We decide to split 80/20 for our testing and training datasets. Using R’s built-in rpart (short for Recursive Partitioning and Regression Tree) function, plot the resulting model.
set.seed(44)
# Split data into training and testing sets
sample_data = sample.split(loan_dt, SplitRatio = 0.8)
train_data <- subset(loan_dt, sample_data == TRUE)
test_data <- subset(loan_dt, sample_data == FALSE)
# Plot tree
tree_mod <- rpart(Loan_Status ~ ., data = train_data, method="class",
control = rpart.control(minsplit = 4,
minbucket = 2,
cp = 0,
maxdepth = 6))
rpart.plot(tree_mod, extra=1, cex=0.8)train_pred <- predict(tree_mod, train_data, type="class")
test_pred <- predict(tree_mod, test_data, type="class")
paste("Confusin matrix of Test data")## [1] "Confusin matrix of Test data"
## Actual
## Predicted N Y
## N 38 11
## Y 32 124
cat('Training Accuracy: ', mean(train_pred == train_data$Loan_Status), '\n',
'Test Set Accuracy: ', mean(test_pred == test_data$Loan_Status), sep='')## Training Accuracy: 0.8533007
## Test Set Accuracy: 0.7902439
Confusion matrix of test data, shows 79% accuracy. There are total 43 misclassification occours. The model slightly better for predict Yes category for loan than No category.
The next step is to prune the tree to optimal size, may improve the model accuracy. To get optimize model, need to find optimal depth and complexily parameter. Below shows the calculation of optimal depth and complex parameter.
Cross-Validation for Model Evaluation
The appropriate depth can be determined by evaluating the tree on a held-out data set via cross-validation. Essentially Cross Validation allows us to alternate between training and testing when dataset is relatively small to maximize error estimation.
set.seed(1)
bc_tree_cv <- train(Loan_Status ~ ., loan_dt, method="rpart2",
trControl = trainControl(method="cv", number=10),
tuneGrid = expand.grid(maxdepth=c(6)),
control = rpart.control(minsplit = 4,
minbucket = 2,
cp = 0))
bc_tree_cv## CART
##
## 614 samples
## 5 predictor
## 2 classes: 'N', 'Y'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 553, 553, 553, 551, 553, 553, ...
## Resampling results:
##
## Accuracy Kappa
## 0.8159099 0.5129139
##
## Tuning parameter 'maxdepth' was held constant at a value of 6
Accuracy : 0.8159099
Kappa : 0.5129139
Cross-Validation for Tuning Maximum Depth
In every classification technique, there are some parameters that can be tuned to optimize the classification. Some parameters that can be tuned in the decision tree is max depth (the depth of the tree), complexity parameter.
set.seed(1)
bc_tree_cv_md <- train(Loan_Status ~ ., loan_dt, method="rpart2",
trControl = trainControl(method="cv", number=10),
tuneGrid = expand.grid(maxdepth=1:20),
control = rpart.control(minsplit = 4,
minbucket = 2,
cp = 0))
best_ix = which.max(bc_tree_cv_md$results$Accuracy)
bc_tree_cv_md$results[best_ix, ]## maxdepth Accuracy Kappa AccuracySD KappaSD
## 2 2 0.82244 0.5260808 0.03329859 0.1025391
plot(bc_tree_cv_md$results$maxdepth, bc_tree_cv_md$results$Accuracy, ylab = "Accuracy (cross -Validation)", xlab = "Max Tree Depth")
lines(1:20, bc_tree_cv_md$results$Accuracy)
abline(v=which.max(bc_tree_cv_md$results$Accuracy), col="red", lty=2, lwd=1)The plot shows Max Tree depth vs accuracy, more number of depth reduce the accuracy, gives highest accuacy at depth 2.
Cross-Validation for Tuning Complexity Parameter
set.seed(1)
bc_tree_cv_cp <- train(Loan_Status ~ ., loan_dt, method="rpart",
trControl = trainControl(method="cv", number=10),
tuneGrid = expand.grid(cp=seq(0, 0.1, 0.001)),
control = rpart.control(minsplit = 4,
minbucket = 2,
maxdepth = 30))
best_ix = which.max(bc_tree_cv_cp$results$Accuracy)
bc_tree_cv_cp$results[best_ix, ]## cp Accuracy Kappa AccuracySD KappaSD
## 36 0.035 0.8224392 0.52012 0.0305313 0.09393886
The plot shows complexity parameter vs accuracy, cp value at high accuracy is 0.035.
Above we get optimal depth and cp values, we can build our final model. The model categorize well with Credit_History. The tree looks convincing, we get the same result from feature importance table. In that table, Credit_History has more importance over other features.
tree_mod_final <- rpart(Loan_Status ~ ., data = train_data, method="class",
control = rpart.control(minsplit = 2,
minbucket = 2,
cp = 0.035,
maxdepth = 2))
rpart.plot(tree_mod_final, extra=1, cex=0.8)Confusion Matrix and accuracy
loanPred <- predict(tree_mod_final, test_data, type = "class")
tab_dt <- table(Predicted = loanPred, Actual = test_data$Loan_Status)
tab_dt## Actual
## Predicted N Y
## N 35 3
## Y 35 132
## [1] "Accuracy : 81.46"
Accuracy has improved to 81%. Total 38 misclassifications exist in the test data, the model shows actually 3 loans approved but the model shows no loan approved. Similarly, 35 no loan approved but the model shows loan approved. The model predicts good prediction for loan approve category Yes and poor prediction for the loan approved No category.
Compare baseline model and final model, the baseline model is better than final model. Because final model shows overfitting for loan approved Yes class.
To expand on our analysis of the loan approval dataset, we’ll run a few random forest models to see if they are a more effective way to classify whether or not someone loan will be approved or not approved.
Data Preparation
Split the dataset into the Training set and Test set. Train : 80%, test : 20%. Random Forest is a tree-based model and does not require feature scaling. One advantage of decision tree based methods like random forests is their ability to natively handle categorical predictors without having to first transform them (e.g., by using feature engineering techniques).
set.seed(66)
# Split data into training and testing sets
sample_data_rf = sample.split(loan_rf, SplitRatio = 0.8)
train_data_rf <- subset(loan_rf, sample_data_rf == TRUE)
test_data_rf <- subset(loan_rf, sample_data_rf == FALSE)The training set has 408 observations and 6 variables. The testing set has 408 observations and 6 variables.
Use randomforest package to build the model. Run a simple decision tree was run using all variables as predictors.
set.seed(66)
# fit a Random Forest model with training data
model_rf <- randomForest(Loan_Status ~ .,
data=train_data_rf)
# display model details
model_rf##
## Call:
## randomForest(formula = Loan_Status ~ ., data = train_data_rf)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 20.59%
## Confusion matrix:
## N Y class.error
## N 56 70 0.55555556
## Y 14 268 0.04964539
Error Rate
This plot shows the class error rates of the random forest model. As the number of trees increases, the error rate approaches zero for Y classifier. For N classifier, initial increase of few trees error rate reduced then remain constant.
plot(model_rf, main = "Error rate of random forest")
legend("topright", cex =1, legend=colnames(model_rf$err.rate), lty=c(1,2,3), col=c(1,2,3))Random Forest Evaluation
Predict the Test set results.
set.seed(66)
pred_rf <- predict(model_rf, test_data_rf, type="class")
tab_rf <- table(Predicted = pred_rf, Actual = test_data_rf$Loan_Status)
tab_rf## Actual
## Predicted N Y
## N 44 3
## Y 22 137
pred_rf <- as.factor(pred_rf)
test_data_rf$Loan_Status <- as.factor(test_data_rf$Loan_Status)
paste0("Accuracy : ",round(sum(diag(tab_rf))/sum(tab_rf)*100,2))## [1] "Accuracy : 87.86"
RF_Model_1 <- confusionMatrix(pred_rf, test_data_rf$Loan_Status)$byClass
acc1 <- confusionMatrix(pred_rf, test_data_rf$Loan_Status)$overall['Accuracy']
RF_Model_1 <- data.frame(RF_Model_1)
RF_Model_1 <- rbind("Accuracy" = acc1, RF_Model_1)Importance of Variables
The mean decrease in Gini coefficient is a measure of how each variable contributes to the homogeneity of the nodes and leaves in the resulting random forest…Variables that result in nodes with higher purity have a higher decrease in Gini coefficient.” This importance chart displays the variables that affected the random forest, from greatest impact to least impact, from top to bottom.
Tuning on Number of Trees
m_opt_ntrees <- which.min(model_rf$err.rate[,'OOB'])
m_opt_err_rate <- min(model_rf$err.rate[,'OOB'])
cat("Optimal Number of Trees: ", m_opt_ntrees, "\n",
"Minimum Error Rate: ", m_opt_err_rate, sep="")## Optimal Number of Trees: 51
## Minimum Error Rate: 0.1936275
Create another random forest model using Optimal Number of Trees.
set.seed(66)
model_rf_2 <- randomForest(Loan_Status ~ .,
data=train_data_rf, ntree=m_opt_ntrees, importance=TRUE)Test data Confusion matrix and Accuracy
Predict the test data.
set.seed(66)
pred_rf_2 <- predict(model_rf_2, test_data_rf, type="class")
tab_rf_2 <- table(Predicted = pred_rf_2, Actual = test_data_rf$Loan_Status)
tab_rf_2## Actual
## Predicted N Y
## N 44 4
## Y 22 136
pred_rf_2 <- as.factor(pred_rf_2)
paste0("Accuracy : ",round(sum(diag(tab_rf_2))/sum(tab_rf_2)*100,2))## [1] "Accuracy : 87.38"
RF_Model_2 <- confusionMatrix(pred_rf_2, test_data_rf$Loan_Status)$byClass
acc2 <- confusionMatrix(pred_rf_2, test_data_rf$Loan_Status)$overall['Accuracy']
RF_Model_2 <- data.frame(RF_Model_2)
RF_Model_2 <- rbind("Accuracy" = acc2, RF_Model_2)Tuning on mtry and ntree
Create another RF model with tuning mtry and ntree.
oob_acc_list <- c()
opt_ntree_list <- c()
for(i in 1:29){
set.seed(66)
temp_mod <- randomForest(Loan_Status ~ .,train_data_rf , ntree=500, importance=TRUE, mtry=i)
oob_acc_list <- c(oob_acc_list, min(temp_mod$err.rate[,'OOB']))
opt_ntree_list <- c(opt_ntree_list, which.min(temp_mod$err.rate[,'OOB']))
}
opt_mtry <- which.min(oob_acc_list)
opt_ntree <- opt_ntree_list[opt_mtry]
min_oob_acc <- min(oob_acc_list)
cat("Optimal Value of mtry: ", opt_mtry, "\n",
"Optimal Value of ntree: ", opt_ntree, "\n",
"Minimum OOB Accuracy: ", min_oob_acc, sep="")## Optimal Value of mtry: 2
## Optimal Value of ntree: 80
## Minimum OOB Accuracy: 0.1911765
plot(1:29, oob_acc_list, xlab="Value of mtry", ylab="Minimum OOB Accuracy Score")
lines(1:29, oob_acc_list)
abline(v=which.min(oob_acc_list), col="red", lty=2, lwd=1)set.seed(66)
model_rf_3 <- randomForest(Loan_Status ~ .,
data=train_data_rf, ntree=opt_ntree, mtry=opt_mtry, importance=TRUE)Test data Confusion matrix and Accuracy
Predict the test data.
set.seed(66)
pred_rf_3 <- predict(model_rf_3, test_data_rf, type="class")
tab_rf_3 <- table(Predicted = pred_rf_3, Actual = test_data_rf$Loan_Status)
tab_rf_3## Actual
## Predicted N Y
## N 44 5
## Y 22 135
pred_rf_3 <- as.factor(pred_rf_3)
paste0("Accuracy : ",round(sum(diag(tab_rf_3))/sum(tab_rf_3)*100,2))## [1] "Accuracy : 86.89"
Compare all 3 models i.e First RF model (baseline model), Second RF model (tree tuning), and third RF model (tuning mtry and ntree)
tabularview <- data.frame(RF_Model_1, RF_Model_2, RF_Model_3)
kable(tabularview) %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),latex_options="scale_down") %>% column_spec(1, bold = T)| RF_Model_1 | RF_Model_2 | RF_Model_3 | |
|---|---|---|---|
| Accuracy | 0.8786408 | 0.8737864 | 0.8689320 |
| Sensitivity | 0.6666667 | 0.6666667 | 0.6666667 |
| Specificity | 0.9785714 | 0.9714286 | 0.9642857 |
| Pos Pred Value | 0.9361702 | 0.9166667 | 0.8979592 |
| Neg Pred Value | 0.8616352 | 0.8607595 | 0.8598726 |
| Precision | 0.9361702 | 0.9166667 | 0.8979592 |
| Recall | 0.6666667 | 0.6666667 | 0.6666667 |
| F1 | 0.7787611 | 0.7719298 | 0.7652174 |
| Prevalence | 0.3203883 | 0.3203883 | 0.3203883 |
| Detection Rate | 0.2135922 | 0.2135922 | 0.2135922 |
| Detection Prevalence | 0.2281553 | 0.2330097 | 0.2378641 |
| Balanced Accuracy | 0.8226190 | 0.8190476 | 0.8154762 |
The above table shows the comparative statistics of each model’s performance.
All 3 random forest model has accuracy above 85%. The models have poor performance for the Loan rejected class and very high performance for the Loan accepted class. This may be due to the unbalance of the classifier in the data. The count of Y is higher than N. Some other models may predict better prediction on Loan approved classes.
For this section we could re-include HW3 models improved based on the Prof’s feedback.
Based on the Professor’s HW3 feedback, we could improve associated code chunks in the following manner. This is more a “nice to have” than any sortof requirement. It could be nice to show the Prof that we listened to her feedback before re-including this code to compare the performance of our NN model to …