The purpose of our Final Project was to explore the application of Neural Networks to loan approval data to then back compare model performance with a variety of Classification algorithms (ie. KNN, DT, RF, GBM).
Neural networks form the basis of Deep Learning, an extension of Machine Learning, where algorithms are inspired by the structure of the human brain. They take in data, train themselves to recognize patterns therein, and then predict outputs for similar, unseen data.
download.file(
url='https://1.cms.s81c.com/sites/default/files/2021-01-06/ICLH_Diagram_Batch_01_03-DeepNeuralNetwork-WHITEBG.png',
destfile='image1.jpg',
mode='wb')
::include_graphics(path='image1.jpg') knitr
Neural networks are made up of layers of nodes. They contain an input layer, one or more hidden layers, and an output layer. Nodes are interconnected with associated weights and thresholds. When a node is above its specified threshold, the node is activated and data is sent to the next layer of the network. Otherwise, data is not fed forward.
The power of a neural network lies in its ability to fine-tune upon countless iterations. Back-propagation allows for continuous model accuracy improvement. Weights are adjusted based on the magnitude of error at the output layer, and continuous refinement allows for predictive accuracy improvements.
We’ll explore and prepare the loan dataset, train multiple neural network and alternative classification models and then compare and contrast the top performing model from each group in its ability to accurately classify Loan_Status
(approval/ rejection).
A loan is when money is transferred from one party to another under the prospect that the lender (loan giver) will be repaid in full with interest by the lendee (loan receiver).
The profit, for the lendor, comes from the interest they are paid by the lendee and thus, as a core part of their business model, it’s important for banks and credit companies alike to be able to depend upon the fact that their loan (and interest) will be repaid in full.
With this motivation in mind, we (re) explore and prepare our loan approval dataset in order to construct a more precise neural network model (later). Being that we’ve explored this data before, we build upon the core takeaways of our past exploration while simultaneously pushing the bounds of our understanding to a deeper level. Rudimentary (early) EDA steps will be summarized and/or excluded from the write up and included in the Appendix in favor of output that provides greater context and insight.
Prior to commencing EDA, we revisit the corresponding data directory:
LoanID
: unique loan IDGender
: applicant gender (Male/Female)Married
: applicant marriage status (Yes/No)Dependents
: number of dependents for applicant (0, 1, 2, 3+)Education
: applicant college education status (Graduate / Not Graduate)Self_Employed
: applicant self-employment status (Yes/No)ApplicantIncome
: applicant income levelCoapplicantIncome
: co-applicant income level (if applicable)LoanAmount
: loan amount requested (in thousands)Loan_Amount_Term
: loan term (in months)Credit_History
: credit history meets guidelines (1/0)PropertyArea
: property location (Urban/Semi Urban/Rural)Loan_Status
: loan approved (Yes/No). target variableTo start, we load in our data, replace empty strings with NAs, observe the first 6 observations of our dataset to refamiliarize ourselves with the format of our data and then use R’s built-in glimpse() and summary() functions to revisit data types and value ranges.
We’re dealing with a 614 observation x 13 variable dataframe with Loan_Status
as our dependent, categoric variable, ApplicantIncome
, CoApplicantIncome
,LoanAmount
, Loan_Amount_Term
, and Credit_History
as independent, numeric variables, and Loan_ID
, LoanGender
, Married
, Dependents
, Education
, Self_Employed
, Property_Area
, and Loan_Status
as independent, categoric variables.
From above, we extend that Loan_ID
can be dropped, ApplicantIncome
and CoApplicantIncome
can be combined to create a TotalIncome
variable, and observations with a “3+” label in Dependents
should be re-labelled as “3” so that data follows a consistent format and imputation can be performed as a next step.
We pre-process our data (as described above), visualize and handle NA values:
#Pre-process dataset for easier interpretation
<- subset(loan, select = -c(1) ) #drop Loan_ID from consideration
loan $TotalIncome <- loan$CoapplicantIncome + loan$ApplicantIncome #create TotalIncome variable
loan<- subset(loan, select = -c(6,7) ) #drop CoapplicantIncome and ApplicantIncome
loan $Dependents <- revalue(loan$Dependents, c("3+"="3")) #relabel Dependents "3+" value as "3"
loan#Visualize NA counts
colSums(is.na(loan))
## Gender Married Dependents Education
## 13 3 15 0
## Self_Employed LoanAmount Loan_Amount_Term Credit_History
## 32 22 14 50
## Property_Area Loan_Status TotalIncome
## 0 0 0
We re-assign the “3+” value of the Dependents
variable to provide consistent leveling and enable pmm (predictive mean matching). Predictive mean matching calculates the predicted value for our target variable, and, for missing values, forms a small set of “candidate donors” from the complete cases that are closest to the predicted value for our missing entry. Donors are then randomly chosen from candidates and imputed where values were once missing. To apply pmm we assume that the distribution is the same for missing cells as it is for observed data, and thus, the approach may be more limited when the % of missing values is higher.
Once we’ve imputed missing values, we verify whether our operation was successful:
#verify absence of NA values in the dataset
colSums(is.na(loan))
## Gender Married Dependents Education
## 0 0 0 0
## Self_Employed LoanAmount Loan_Amount_Term Credit_History
## 0 0 0 0
## Property_Area Loan_Status TotalIncome
## 0 0 0
Imputation was a success and data pre-processing has been completed. From this point we proceed to the observance of feature correlation.
To identify feature correlation - how strongly independent variables are related to one another and how strongly these variables relate to our dependent variable (Loan_Status
), we consider a correlation matrix with a threshold of 0.3:
#Utilize custom-built correlation matrix generation function
plot_corr_matrix(loan, 0.3)
From the correlation matrix we can extend that:
Credit_History
is our strongest predictor / strongly correlated with Loan_Status
, andGender
and Married
, Married
and Dependents
, LoanAmount
and TotalIncome
appear to be correlated with one another and indicative that multicollinearity may be a concern for our data.We varied the threshold value for our correlation matrix and found that, aside from Credit_History
, our other independent variables were relatively poor predictors of Loan_Status
, making it worth exploring variable importance:
#NOTE: COMMENTED OUT BELOW DUE TO LONG COMPILATION TIME. UNCOMMENT BEFORE FINAL SUBMISSION.
# Perform Boruta search
<- Boruta(Loan_Status ~ ., data=na.omit(loan), doTrace=0, maxRuns = 1000)
boruta_output #Get significant variables including tentatives
<- getSelectedAttributes(boruta_output, withTentative = TRUE)
boruta_signif #print(boruta_signif)
# Plot variable importance
plot(boruta_output, cex.axis=.7, las=2, xlab="", main="Variable Importance")
Our utilization of the Boruta function for feature ranking and selection indicate that:
Credit_History
, TotalIncome
, LoanAmount
, and Self_Employed
are strong predictors,Property_Area
is a moderate predictor, andMarried
, Loan_Amount_Term
, Education
, Gender
, and Dependents
are all poor predictors.With feature importance in mind, we drop the poor predictors from consideration. Dropping these variables also addresses concerns of applicant discrimination (ie. rejection based on Gender
) and thus we address two concerns with this act of feature selection.
With our loan dataset properly subset, we proceed to observing the distributions of our independent variables. First we observe numeric distributions:
#convert CreditHistory to type factor
$Credit_History <- factor(loan$Credit_History)
loan#levels(loan$Credit_History) #verify
#Numeric distributions
%>%
loan keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free", ncol=1) +
geom_histogram(bins=90,color="darkblue", fill="lightblue")
From the above figures we observe that LoanAmount
and TotalIncome
appear to be right skewed normal and there are a number of noteworthy outliers for both distributions. From this, we note the importance of outlier-handling and scaling as critical steps in building our neural network model.
Next, we explore our categorical variables:
#Categoric distributions
##Self_Employed
<- loan %>% dplyr::select(1,5) %>% group_by(,Loan_Status) %>% count() %>%
p1 ggplot(aes(x=Self_Employed, y=freq, fill=Loan_Status)) +
geom_bar(stat='identity', position="stack")
##Self_Employed
<- loan %>% dplyr::select(3,5) %>% group_by(,Loan_Status) %>% count() %>%
p2 ggplot(aes(x=Credit_History, y=freq, fill=Loan_Status)) +
geom_bar(stat='identity', position="stack")
##Property_Area
<- loan %>% dplyr::select(4,5) %>% group_by(,Loan_Status) %>% count() %>%
p3 ggplot(aes(x=Property_Area, y=freq, fill=Loan_Status)) +
geom_bar(stat='identity', position="stack")
grid.arrange(p1, p2, p3, nrow = 2, ncol = 2)
From the above figures we can extend:
With a relatively thorough exploratory analysis under our belt, we move on to building our neural network model.
We’ll utilize the holdout-validation method for evaluating model performance. We train-test split our data using a 75:25 partition, build our model on the training set and then evaluate its performance on the test set.
To start, we compute our “barrier value” and then partition our data based on this value, with 75% of our data going in the training set and 25% of our data going in the test set.
set.seed(123) #replicability
<- floor(0.75 * nrow(loan)) #compute "barrier value"
bar <- sample(seq_len(nrow(loan)), size = bar) #sample based on barrier value
partition #Subset: train-test split based on partition value
<- loan[partition, ]
train <- loan[-partition, ]
test #print(dim(train)) #460 x 6
#print(dim(test)) #154 x 6
We set our training algorithm’s parameters and then train our model using the train() function with “nnet” passed in as the method and “scale” and “center” passed in so that numeric variables are standardized.
With our “baseline model” trained, we proceed to model evaluation. We verify the baseline accuracy (0.676) and then evaluate our model’s performance against this “baseline”. We generate predictions based on the training set and then output these predictions as a confusion matrix and then we do the same with our test data.
set.seed(123) #replicability
#round(prop.table(table(train$Loan_Status)),3) #Baseline accuracy Y: 0.685
#Training predictions
<-predict(nnet_model1, train)
nnPred_train #Training confusion matrix
table(train$Loan_Status, nnPred_train)
## nnPred_train
## N Y
## N 79 70
## Y 4 307
round((310+78)/nrow(train),3)
## [1] 0.843
#Test predictions
<-predict(nnet_model1, test)
nnPred_test #Test confusion matrix
table(test$Loan_Status, nnPred_test)
## nnPred_test
## N Y
## N 23 20
## Y 6 105
round((106+21)/nrow(test),3)
## [1] 0.825
#Performance statistics for later comparison
<- confusionMatrix(as.factor(nnPred_test), as.factor(test$Loan_Status))$byClass
NN_Model_1 <- confusionMatrix(as.factor(nnPred_test), as.factor(test$Loan_Status))$overall['Accuracy']
acc1 <- data.frame(NN_Model_1)
NN_Model_1 <- rbind("Accuracy" = acc1, NN_Model_1) NN_Model_1
From above, we observe a training accuracy of 84.3% and a test accuracy of 82.5% which is an improvement of more than 15% over our “baseline accuracy”.
By merely applying a neural network model to our dataset, we see major improvements in predictive capability. Next, we see if we can take the model further. If we can improve model performance by handling outliers and creating features prior to feeding the model.
We explore the affects of outlier handling and feature creation on model performance to determine if either step improves our model.
We start by re-visiting the distribution of outliers via boxplot:
set.seed(123) #replicability
#Observe the affect of outlier-handling on model performance (if any)
#Confirm the presence of influential observations
<- ggplot(loan) +
p4 aes(x = Loan_Status, y = LoanAmount) +
geom_boxplot(fill = "#0c4c8a") +
theme_minimal()
<- ggplot(loan) +
p5 aes(x = Loan_Status, y = TotalIncome) +
geom_boxplot(fill = "#0c4c8a") +
theme_minimal()
grid.arrange(p4, p5, nrow = 1, ncol = 2)
From above we can see that outliers appear to be a concern for our model.
Outlier Handling
To rectify the situation, we identify the outliers using the boxplot.stats() function, filter for corresponding observations, remove outliers from our dataset, and revisit model performance.
Outlier-handling led to a slight performance reduction for training and significant performance reduction for test data. As such we elected not to include outlier-handling as an optimization step. Note: corresponding code has been included in the Appendix.
We proceeded to observe the affect of feature creation on model performance.
Feature Creation
We wanted to see if adding certain combinations of features would improve our predictive accuracy. We tested the inclusion of variables for:
The inclusion of these variables, and feature creation in general, slightly reduced the performance of our model and so we elected to exclude it as a modeling optimization step. Note: corresponding code has been included in the Appendix.
Being that each of our optimization attempts were fruitless, we next opted to explore two alternative approaches to neural networks to compare the predictive accuracy between approaches.
For our first “alternative” neural network approach, we consider the keras
package and explore the characteristics of the model and means of optimizing the model:
We pre-process our data (converting variables to the proper type and changing categoric variable values from a character to a numeric-base) and then visualize the layers of our neural network:
set.seed(123) #replicability
#Preprocess data
<- loan_keras[, c(1, 3, 4, 2, 6, 5)]
loan_keras # Convert Credit_History numeric type
$Credit_History <- as.numeric(loan_keras$Credit_History)
loan_keras# Change Variables values
$Self_Employed <- ifelse(loan_keras$Self_Employed == "Yes", 1, 0)
loan_keras$Loan_Status <- ifelse(loan_keras$Loan_Status=="Y", 1, 0)
loan_keras$Property_Area <- case_when(
loan_keras$Property_Area == "Semiurban" ~ 2,
loan_keras$Property_Area == "Urban" ~ 1,
loan_keras$Property_Area == "Rural" ~ 0,
loan_keras
)#loan_keras <- loan_keras %>% mutate_if(is.factor, as.numeric) # convert factor to numeric
#Build initial keras NN model
<- neuralnet(Loan_Status ~
n +
Self_Employed +
Credit_History +
Property_Area +
LoanAmount
TotalIncome,data = loan_keras,
hidden = c(4,2),
linear.output = F,
lifesign = 'full',
rep=1)
#Visualize our initial keras NN
plot(n,
col.hidden = 'darkgreen',
col.hidden.synapse = 'darkgreen',
show.weights = F,
information = F,
fill = 'lightblue', rep = "best")
From the above visualization, we observe 5 input nodes, 2 hidden layers (1 with 4 neurons and 1 with 2 neurons), and 1 output node. As is the case with neural networks, all nodes are interconnected.
As a next step, we convert our data into matrix form, train-test split (using an 80/20 partition), and normalize (center and scale). We then build, compile, and train our model using the keras_model_sequential(), compile(), and fit() functions respectively.
We use the layer_dense() function to create the hidden layers for our network with “relu” (rectified linear unit) as our activation function, 5 inputs, 4 nodes in our 1st hidden layer, 2 nodes in our 2nd hidden layer, and 1 output node (as in our earlier visualization).
We compile our model using the following 3 arguments:
Finally, we fit our model (with a validation proportion of 0.1 and 100 epochs) and visualize the model’s accuracy and loss for training and validation sets:
set.seed(123) #replicability
#Build model
<- keras::keras_model_sequential()
model_keras %>%
model_keras layer_dense(units = 4, activation = 'relu', input_shape = c(5)) %>%
layer_dense(units = 2, activation = 'relu') %>%
layer_dense(units = 1)
#Compile model
%>% compile(loss = 'mse',
model_keras optimizer = optimizer_rmsprop(lr= 0.005),
metrics = c("accuracy"))
#Train model
<- model_keras %>%
model_keras_train fit(training_keras,
trainingtarget,epochs = 100,
validation_split = 0.1
)#Visualize
plot(model_keras_train)
The first graph is for loss whereas the second is for accuracy. There is little difference in loss for training vs. validation sets whereas when we consider accuracy, we see that training and validation accuracy start with a low accuracy (which makes sense) and then diverge shortly before the 10th epoch with validation settling ~0.85 and training settling ~0.8.
This appears to indicate that over-training is not a concern but we’ll explore further via evaluation metrics:
set.seed(123) #replicability
#Evaluate model on training data
%>% evaluate(training_keras, trainingtarget) model_keras
## loss accuracy
## 0.1486293 0.8102041
<- model_keras %>% predict(training_keras)
pred_keras_train #Evaluate model on test data
%>% evaluate(testing_keras, testingtarget) model_keras
## loss accuracy
## 0.1249997 0.8548387
<- model_keras %>% predict(testing_keras) pred_keras_test
Our evaluation confirms that the keras model performs better on (unseen) test data.
We explore accuracy metrics for test data in confusion matrix form for further interpretation:
set.seed(123) #replicability
# Confusion matrix and misclassification- test data
<- ifelse(pred_keras_test>0.5, 1, 0)
pred_keras_1 <- table(Predicted = pred_keras_1,
tab4 Actual = testingtarget)
tab4
## Actual
## Predicted 0 1
## 0 21 2
## 1 16 85
sum(diag(tab4))/sum(tab4)
## [1] 0.8548387
With an accuracy of ~86%, our keras model classifies loan approvals with a high degree of accuracy while misclassifying countless loan rejections as approvals. There were 14 misclassified rejections vs. just 4 misclassified approvals.
As such, we’ll attempt to optimize this model.
We revisit our model with varied neurons / layer_dropout.
Increasing the number of neurons may improve model performance but that is not always the case and so we specify our “rate” as 0.4, re-specify our validation set as 0.2, and revisit our visualization and evaluation metrics to observe the effect:
set.seed(123) #replicability
#Create model2
<- keras_model_sequential()
model_keras_2 %>%
model_keras_2 layer_dense(units = 100, activation = 'relu', input_shape = c(5)) %>%
layer_dropout(rate = 0.4) %>%
layer_dense(units = 1)
#Compile model2
%>% compile(loss = 'mse',
model_keras_2 optimizer = optimizer_rmsprop(lr= 0.005),
metrics = c("accuracy"))
#Fit model2
<- model_keras_2 %>%
model_keras_train2 fit(training_keras,
trainingtarget,epochs = 100,
batch_size = 32,
validation_split = 0.2
)#Visualize
plot(model_keras_train2)
Our loss visualization y axis covers a different range and it appears that our training loss is around ~0.15 while our validation loss stabilizes around ~0.17. When we consider that our training set was reduced in size this makes sense.
Our accuracy visualization’s y axis also covers a different range and and we observe that training and validation start with low accuracies, follow each other with validation slightly higher (0.825 vs. 0.81) and conclude with training accuracy and validation accuracy crossing paths ~75 epochs where then our training accuracy became higher.
We explore further via evaluation metrics and confusion matrix:
set.seed(123) #replicability
#Evaluate model on training data
%>% evaluate(training_keras, trainingtarget) model_keras_2
## loss accuracy
## 0.1413085 0.8163266
<- model_keras_2 %>% predict(training_keras) #loss: 0.1338, accuracy: 0.8390
pred_keras_train2 #Evaluate model on test data
%>% evaluate(testing_keras, testingtarget) model_keras_2
## loss accuracy
## 0.1178222 0.8709677
<- model_keras_2 %>% predict(testing_keras) #loss: 0.1413, accuracy: 0.8632
pred_keras_test2 # Confusion matrix and misclassification
<- ifelse(pred_keras_test2>0.5, 1, 0)
pred_keras_2 <- table(Predicted = pred_keras_2,
tab5 Actual = testingtarget)
tab5
## Actual
## Predicted 0 1
## 0 24 3
## 1 13 84
sum(diag(tab5))/sum(tab5)
## [1] 0.8709677
#Performance statistics for later comparison
<- confusionMatrix(as.factor(pred_keras_2), as.factor(testingtarget))$byClass
NN_Model_2 <- confusionMatrix(as.factor(pred_keras_2), as.factor(testingtarget))$overall['Accuracy']
acc1 <- data.frame(NN_Model_2)
NN_Model_2 <- rbind("Accuracy" = acc1, NN_Model_2) NN_Model_2
The impact on (unseen) test data accuracy is noted above and we pay particular attention to the improvement in rejection misclassifications from 14 to 13.
For our second “alternative” neural network approach, we consider the neuralnet
package and explore the characteristics of the model and means of optimizing the model.
We start by pre-processing our data:
Credit_History
to type numeric,Self_Employed
as Yes = 1, No = 0,Property_Area
as Semi-urban = 2, Urban = 1, and Rural = 0,Loan_Status
as Yes = 1, No = 0, and\[ Transformed.Values = \frac{(Values - Min)}{(Max - Min)} \] With our data in proper form (consistent, normalized variables), we train-test split using an 80/20 partition, train and visualize our model:
set.seed(123) #replicability
# Convert Credit_History numeric type
$Credit_History <- as.numeric(loan_nn$Credit_History)
loan_nn# Change Variables values
$Self_Employed <- ifelse(loan_nn$Self_Employed == "Yes", 1, 0)
loan_nn$Loan_Status <- ifelse(loan_nn$Loan_Status=="Y", 1, 0)
loan_nn$Property_Area <- case_when(
loan_nn$Property_Area == "Semiurban" ~ 2,
loan_nn$Property_Area == "Urban" ~ 1,
loan_nn$Property_Area == "Rural" ~ 0,
loan_nn
)# Min-Max Normalization
$Self_Employed <- (loan_nn$Self_Employed - min(loan_nn$Self_Employed))/(max(loan_nn$Self_Employed) - min(loan_nn$Self_Employed))
loan_nn$LoanAmount <- (loan_nn$LoanAmount - min(loan_nn$LoanAmount))/(max(loan_nn$LoanAmount) - min(loan_nn$LoanAmount))
loan_nn$Credit_History <- (loan_nn$Credit_History - min(loan_nn$Credit_History))/(max(loan_nn$Credit_History)-min(loan_nn$Credit_History))
loan_nn$Property_Area <- (loan_nn$Property_Area - min(loan_nn$Property_Area))/(max(loan_nn$Property_Area)-min(loan_nn$Property_Area))
loan_nn$TotalIncome <- (loan_nn$TotalIncome - min(loan_nn$TotalIncome))/(max(loan_nn$TotalIncome)-min(loan_nn$TotalIncome))
loan_nn# Train-test split
<- sample(2, nrow(loan_nn), replace = TRUE, prob = c(0.8, 0.2))
ind <- loan_nn[ind==1,]
training_nn <- loan_nn[ind==2,]
testing_nn #Train model
<- neuralnet(Loan_Status ~ Self_Employed + LoanAmount + Credit_History + Property_Area + TotalIncome,
n1 data = training_nn,
hidden = 1,
err.fct = "ce",
linear.output = FALSE
)#Visualize model
plot(n1, rep = "best")
The plot of our neural network above shows 5 input nodes, 1 hidden layer with 1 node, and 1 output node. The simplest neural network we’ve visited thus far.
After this point, we verified the predictive accuracy of our model via node output calculations with sigmoid activation function. Note: corresponding code has been included in the Appendix.
We then proceed to attempt to optimize our neuralnet model via feature selection.
Feature selection is used to select the most important features in a set. It’s a balancing act of selecting the fewest featuers that provide the greatest representative capability of the data and thus the strongest predictive accuracy.
We utilized forward feature selection, starting with no features and adding them back in one-by-one to find the strongest combination. During our exploratory data analysis, we found that Credit_History
was the most important feature, so we began with this feature and tested combinations with other features until we arrived at:
set.seed(123)
#Adding other variables did not help model performance:
#Credit_History + Property_Area (0.871 test)
#Credit_History + TotalIncome (0.871 test)
#Credit_History + LoanAmount (0.871 test)
#Credit_History + SelfEmployed (0.871 test)
<- neuralnet(Loan_Status ~ Credit_History,
n2 data = training_nn,
hidden = 1,
err.fct = "ce",
linear.output = FALSE
)plot(n2, rep = 'best')
An even simpler model than that we had visited before: 1 input node, 1 hidden layer with 1 node, and 1 output node. From this we can derive that our strongest factor, Credit_History
provides enough signal to accurately predict Loan_Status
.
We then evaluate our model’s performance on training vs. test data and visit the corresponding confusion matrix:
# Confusion Matrix & Misclassification Error - training data
<- compute(n2, training_nn[,-5])
output <- output$net.result
p1 <- ifelse(p1>0.5, 1, 0)
pred1 <- table(Prediction = pred1, Actuals = training_nn$Loan_Status)
tab1 #tab1
paste0("Misclassification Error of training data: ", round(100 - sum(diag(tab1))/sum(tab1)*100,2))
## [1] "Misclassification Error of training data: 18.98"
paste0("Accuracy of training data: ", round(sum(diag(tab1))/sum(tab1) * 100,2))
## [1] "Accuracy of training data: 81.02"
# Confusion Matrix & Misclassification Error - testing data
<- compute(n2, testing_nn[,-5])
output <- output$net.result
p2 <- ifelse(p2>0.5, 1, 0)
pred2 <- table(Prediction = pred2, Actual = testing_nn$Loan_Status)
tab2 tab2
## Actual
## Prediction 0 1
## 0 23 2
## 1 14 85
paste0("Misclassification Error of testing data: ", round(100 - sum(diag(tab2))/sum(tab2)*100,2))
## [1] "Misclassification Error of testing data: 12.9"
paste0("Accuracy of testing data: ", round(sum(diag(tab2))/sum(tab2)*100,2))
## [1] "Accuracy of testing data: 87.1"
#Performance statistics for later comparison
<- confusionMatrix(as.factor(pred2), as.factor(testing_nn$Loan_Status))$byClass
NN_Model_3 <- confusionMatrix(as.factor(pred2), as.factor(testing_nn$Loan_Status))$overall['Accuracy']
acc1 <- data.frame(NN_Model_3)
NN_Model_3 <- rbind("Accuracy" = acc1, NN_Model_3) NN_Model_3
From above, we observe our misclassification error and accuracy for test and training data. It appears that this model has performed as well as the prior model and, if we consider model simplicity as well, may be the strongest.
Next, we re-visit the performance of all of our neural network models to determine which approach produced the strongest model.
We pull all models into a dataframe and output there performance metrics side-by-side:
#Tabular view of model performance:
<- data.frame(NN_Model_1, NN_Model_2, NN_Model_3)
tabularview %>% kableExtra::kbl() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),latex_options="scale_down") tabularview
NN_Model_1 | NN_Model_2 | NN_Model_3 | |
---|---|---|---|
Accuracy | 0.8311688 | 0.8709677 | 0.8709677 |
Sensitivity | 0.5348837 | 0.6486486 | 0.6216216 |
Specificity | 0.9459459 | 0.9655172 | 0.9770115 |
Pos Pred Value | 0.7931034 | 0.8888889 | 0.9200000 |
Neg Pred Value | 0.8400000 | 0.8659794 | 0.8585859 |
Precision | 0.7931034 | 0.8888889 | 0.9200000 |
Recall | 0.5348837 | 0.6486486 | 0.6216216 |
F1 | 0.6388889 | 0.7500000 | 0.7419355 |
Prevalence | 0.2792208 | 0.2983871 | 0.2983871 |
Detection Rate | 0.1493506 | 0.1935484 | 0.1854839 |
Detection Prevalence | 0.1883117 | 0.2177419 | 0.2016129 |
Balanced Accuracy | 0.7404148 | 0.8070829 | 0.7993166 |
We see that NN_Model_2
and NN_Model_3
compare closely for numerous metrics. Being that we’re dealing with an unbalanced set (more approvals predicted than rejections), we use the Balanced Accuracy
metric as our determining factor and see that NN_Model_2
(keras) is our strongest.
Later, we’ll provide an in-depth interpretation of this model’s statistics vs. those of our alternative classification approaches.
Later, we’ll revisit pertinent statistics for our neural network models, but first we’ll explore the efficacy of alternatives to a neural network approach to compare and contrast methods. Being that the focus of this project was on neural networks, the section that follows provides a high level overview of each alternative approach (with corresponding code in the Appendix), output statistics near the conclusion, and a more thorough analysis of output statistics in our Model Comparison section.
Decision trees build classification or regression models in the form of a tree structure. Datasets are broken into smaller and smaller subsets along chosen parameters, as the associated decision tree is developed. The result is a tree with nodes and branches. Nodes denote a split point based on the attribute in question while branches denote the corresponding outcome. We start at a “root node”, terminate at “leaf nodes”, and use corresponding lead nodes to provide proportions regarding resulting class labels.
During EDA, we’d specified our most important features. We explored these features further (in tabular form), re-confirmed Credit_History
as our most important feature and retained all remaining features for the simple fact that tree algorithms are immune to multicollinearity.
We then train-test split our data using an 80/20 split, explored a “baseline model” which produced a training accuracy of 87% and a test accuracy of just 74.5%. Thus, we sought to optimize our models performance.
We used cross-validation, alternating between training and test data to increase the precision of our error estimation while determining the optimal tree depth and tuning complexity arguments.
We found that a depth of 2 and a cp value of 0.03 provided the highest corresponding accuracy and so we proceeded to finalize our model with optimal input parameters:
<- rpart(Loan_Status ~ ., data = train_data, method="class",
tree_mod_final control = rpart.control(minsplit = 2,
minbucket = 2,
cp = 0.03,
maxdepth = 2))
rpart.plot(tree_mod_final, extra=1, cex=0.8)
The tree appears to confirm our earlier findings regarding Credit_History
and so we proceed to review its test data performance via confusion matrix and corresponding accuracy:
<- predict(tree_mod_final, test_data, type = "class")
loanPred <- table(Predicted = loanPred, Actual = test_data$Loan_Status)
tab_dt tab_dt
## Actual
## Predicted N Y
## N 26 3
## Y 45 130
paste0("Accuracy : ", round(sum(diag(tab_dt))/sum(tab_dt) * 100,2))
## [1] "Accuracy : 76.47"
We find that our optimal decision tree model has an accuracy of 76.5% on test data with a total of 48 misclassifications. The model precisely predicts approval while simultaneously mispredicting loan rejection at an overwhelming rate. This may be because of the model’s dependence on Credit_History
and that variable’s strength in predicting approval and weakness in predicting rejection. We’ll revisit corresponding statistics in the Overall Model Comparison section.
As the next natural progression from decision trees, we move on to modeling via random forest (a collection of trees).
Random forests are one the most popular machine learning algorithms. They can be used for classification or regression, deal with a large number of features, and generally are so successful because they provide a good predictive performance, low incidence of over-fitting, and easy interpret-ability.
We train-test split our dataset using an 80/20 split, we then used the randomforest
package to explore a simple random forest (using all variables as predictors), we visited a plot of the class error rates, observed that as the number of trees increases the error rates stabilize and the Y classifier approaches 0, we then visited the corresponding confusion matrix for test data:
set.seed(66)
<- predict(model_rf, test_data_rf, type="class")
pred_rf <- table(Predicted = pred_rf, Actual = test_data_rf$Loan_Status)
tab_rf tab_rf
## Actual
## Predicted N Y
## N 44 4
## Y 22 136
<- as.factor(pred_rf)
pred_rf $Loan_Status <- as.factor(test_data_rf$Loan_Status)
test_data_rfpaste0("Accuracy : ",round(sum(diag(tab_rf))/sum(tab_rf)*100,2))
## [1] "Accuracy : 87.38"
<- confusionMatrix(pred_rf, test_data_rf$Loan_Status)$byClass
RF_Model_1 <- confusionMatrix(pred_rf, test_data_rf$Loan_Status)$overall['Accuracy']
acc1 <- data.frame(RF_Model_1)
RF_Model_1 <- rbind("Accuracy" = acc1, RF_Model_1) RF_Model_1
We observed an accuracy of ~87.4% and proceeded to attempt to optimize the model further.
We plotted feature importance to observe those that most vs. least affected the model’s predictive capability, we tuned the number of trees to minimize the error rate, created a second model (RF_Model_2
) observed that the accuracy had actually decreased by ~1% on test data.
We then tuned on mtry
and ntree
, created another model (RF_Model_3
), and observed that its accuracy for test data was 86.89% and thus our first model (RF_Model_1
), the model without any optimization attempts, appears to have been our strongest.
We visit a tabular output for the performance metrics of RF_Model_1
vs RF_Model_2
and RF_Model_3
and confirm that RF_Model_1
was the strongest. We only visited the tabular output for our random forest models because they were by far the strongest performing alternative approaches to classification and we had to decipher between the models as to which was the strongest. The corresponding comparison table’s code is available in the Appendix.
The KNN algorithm hinges on the idea that similar data will be near one another. When applying KNN, the distance between data points are used to classify data. Those nearer one another are “batched” together. The amount of batches we see and the distance between points is determined by the k value we select:
Due to the simplicity of calculations involved and ease of interpretability of the result, KNN is a popular classifier. For our purposes, we’re going to apply the K-nearest neighbor algorithm to the loan dataset to predict the Loan_Status
variable.
We use the knn
function from the class package to create a 3-Nearest Neighbors model for the loan dataset.
We convert categorical variables to numerical, apply standard scaling to bring each variable in the dataset to have a mean of 0 and standard deviation of 1 (using \(z_i= \frac {x_i−\bar x}{s_x}\))m and then train-test split our dataset.
We explore model performance using actual and scaled data, vary k from a range of 1 to 100 to determine the optimal value, select our final model based on which was more accurate (scaled was more accurate than unscaled)
set.seed(26) #ensure reproducibility
#Select optimal k for SCALED data
<- which.max(valid_acc_sc)
acc_sc #acc_sc
#Confusion matrix and accuracy for scaled data model with k = 10
<- knn(training_knn, training_knn, trainingtarget_knn, k=acc_sc)
train_pred <- mean(train_pred == trainingtarget_knn)
train_acc
<- knn(training_knn, testing_knn, trainingtarget_knn, k=acc_sc)
valid_pred <- mean(valid_pred == testingtarget_knn)
valid_acc cat('Training Accuracy: ', train_acc, '\n',
'Validation Accuracy: ', valid_acc, sep='')
## Training Accuracy: 0.8298755
## Validation Accuracy: 0.8333333
<- table(Predicted = valid_pred, Actual = testingtarget_knn)
tab_knn_sc tab_knn_sc
## Actual
## Predicted 0 1
## 0 17 6
## 1 16 93
From above we observe a relatively strong prediction rate for a approvals and a near 50/50 split when it comes to predicting rejections which is quite low. It would appear that this is a sub-optimal model, and it may be because our rejections require more nuanced detection methods than grouping based on its nearest neighbors.
Boosting is a method for creating an ensemble, or a combination of machine learning models. Gradient boosting relies on the belief that the best possible next model, when combined with prior models, reduces how often we misclassify our data.
It’s heralded as one of the most powerful techniques for building predictive models and offers numerous options for implementation. We’ll make use of the XGBoost (eXtreme Gradient Boosting) distributed gradient library due to its promise of flexibility, portability, and efficiency.
For our purposes, we’re dealing with a categorical dependent variable, and will build a Gradient Boosting model from the loan approval dataset to provide Loan_Status
predictions (ie. Approve or Reject).
To prepare our data for Gradient Boost modeling, we verify our numeric variable’s skewness, perform a log() transform on skewed numeric variables, transform categoric variables to a numeric range (starting at 0), train-test split our data, create train and test matrices for the one hot encoding of variables, build our model and verify the confusion matrix and misclassification error rates:
#prediction and confusion matrix from TRAIN data
<- predict(GB_model, newdata = train_matrix)
p_train <- matrix(p_train, nrow = nc, ncol = length(p_train)/nc) %>%
pred_train t() %>% #matrix transpose
data.frame() %>%
mutate(label = train_label, max_prob = max.col(.,"last")-1)
<- table(Prediction = pred_train$max_prob, Actual = pred_train$label)
tab_train print(tab_train)
## Actual
## Prediction 0 1
## 0 108 2
## 1 49 298
print(paste('Misclassification Error with Train data', round(1 - sum(diag(tab_train))/sum(tab_train),3))) #0.112
## [1] "Misclassification Error with Train data 0.112"
#prediction and confusion matrix from TEST data
<- predict(GB_model, newdata = test_matrix)
p_test <- matrix(p_test, nrow = nc, ncol = length(p_test)/nc) %>%
pred_test t() %>% #matrix transpose
data.frame() %>%
mutate(label = test_label, max_prob = max.col(.,"last")-1)
<- table(Prediction = pred_test$max_prob, Actual = pred_test$label)
tab_test print(tab_test)
## Actual
## Prediction 0 1
## 0 15 8
## 1 20 114
print(paste('Misclassification Error with Test data', round(1 - sum(diag(tab_test))/sum(tab_test),3))) #0.178
## [1] "Misclassification Error with Test data 0.178"
#feature importance
<- xgb.importance(colnames(train_matrix), model=GB_model)
imp print(imp) #higher Gain means higher feature importance
## Feature Gain Cover Frequency
## 1: Credit_History 0.52509792 0.18382990 0.04666667
## 2: TotalIncome 0.25538857 0.43982782 0.46833333
## 3: LoanAmount 0.16711178 0.24865180 0.38666667
## 4: Property_Area 0.03417599 0.11209544 0.06166667
## 5: Self_Employed 0.01822575 0.01559505 0.03666667
From above we note 0.112 and 0.178 misclassification error rates (the opposite of accuracy) on training and test data respectively. This a strong performance although not quite as strong as the RF models we’d explored earlier and so we attempt to optimize.
From our feature importance scores, we observed that Credit_History
, TotalIncome
, and LoanAmount
had the highest Gain
scores and so we omitted all other features in our second model. After adapting our matrices, building our model, and optimizing input parameters (ie. nrounds
), we found that selecting features slightly reduced our model accuracy and so our first model outperformed our second.
With a rather in-depth exploration of alternative models under our belt, we proceed to compare the strongest performing of our models: random forest and neural network.
We put our strongest Random Forest and Neural Network models side-by-side to interpret their common classification metrics and determine which has the greatest predictive accuracy.
We consider the following classification metrics in consulting each model’s Confusion Matrix:
These models are all applied to the test data set and make use of the confusionMatrix function from the caret library. We present the corresponding common classification metrics as a kable table to glean more insight regarding the relative strength of each model’s performance:
#Tabular view of our 2 strongest models:
<- data.frame(RF_Model_1, NN_Model_2)
tabularview %>% kableExtra::kbl() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),latex_options="scale_down") tabularview
RF_Model_1 | NN_Model_2 | |
---|---|---|
Accuracy | 0.8737864 | 0.8709677 |
Sensitivity | 0.6666667 | 0.6486486 |
Specificity | 0.9714286 | 0.9655172 |
Pos Pred Value | 0.9166667 | 0.8888889 |
Neg Pred Value | 0.8607595 | 0.8659794 |
Precision | 0.9166667 | 0.8888889 |
Recall | 0.6666667 | 0.6486486 |
F1 | 0.7719298 | 0.7500000 |
Prevalence | 0.3203883 | 0.2983871 |
Detection Rate | 0.2135922 | 0.1935484 |
Detection Prevalence | 0.2330097 | 0.2177419 |
Balanced Accuracy | 0.8190476 | 0.8070829 |
First, it’s worth noting that the (keras) neural network model from above outperformed all corresponding alternative approaches to classification. Thus, there’s certainly merit to the strength and capability of a neural network.
With that said, we consult the above table and observe that while both models performed well and had close statistics across the board, RF_Model_1
outperformed NN_Model_2
for all metrics aside from Neg Pred Value. It appears that NN_Model_2
is slightly more accurate at predicting loan rejections, but other than that and especially when we consider that we’re dealing with an unbalanced dataset, RF_Model_1
would be the preferred model.
We explored nnet
, keras
and neuralnet
as neural network approaches. Our assumption was that the top performing neural network model would be our strongest model. This assumption was based in our hearing a lot about the advantages and applications of deep learning models. In particular, we thought that backpropagation as well as the ability to adapt hidden layers to further improve performance would lead to a major performance advantage.
Our strongest neural network model outperformed all alternative models by a significant margin … aside from random forests. The accuracies and balanced accuracies were quite close but our strongest random forest model edged our strongest neural network model by a slight margin.
This surprised our entire group considering that we’d applied random forest, decision tree, and gradient boosting models in a past assignment to the same dataset and had the random forest model perform the worst and the gradient boosting model perform the best.
We explored and prepared the data in a different manner this time though.
We selected only pertinent variables and we combined our incomes to create a TotalIncome
variable. The data preparation we’d done as well as our use of set seeds and various arguments for the methods we used within each model likely played a role. With just a couple of slight model or data preparation changes we may have seen a different outcome. And this was our largest lesson learned.
When an approach is lauded, mystical, and popular, that doesn’t necessarily mean that its applicable in every situation. There is no “one size fits all” in Data Science. In this case the Random Forest model may have performed better because of the data at hand, the way we applied our methods, or because it was simply most advantageous in this particular situation. If we were to apply each model with different set seeds, the accuracies may have varied and we may have come to a different conclusion, and that’s the beauty of Data Science.
There’s much to be learned.
#Standardize chunk-knitting
::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
knitrset.seed(123) #ensure reproducibility
#Load relevant libraries
library(tidyverse)
library(caret)
library(ggplot2)
library(MASS)
library(dplyr)
library(kableExtra)
library(plyr) #revalue 'Dependent'
library(mice) #pmm imputation
library(corrplot) #correlation matrix
library(Boruta) #Utilize Boruta for feature ranking and selection
library(gridExtra) #output plots via grid.arrange
library(car) #outlier handling
library(caTools)
library(keras) #NN approach 2
library(neuralnet) #NN approach 3
require(reshape2) # small multiple density plot
library(class) # kNN
library(rpart) # DT
library(rpart.plot)# DT
library(data.table)
library(randomForest) # RF
library(e1071) #skewness
library(gbm)
library(xgboost) # a faster implementation of gbm
library(Matrix)
#Utilize customized functions
<- function(dataframe, significance_threshold){
plot_corr_matrix <- paste0('Correlation Matrix for significance > ',
title
significance_threshold)
<- dataframe %>% mutate_if(is.character, as.factor)
df_cor
<- df_cor %>% mutate_if(is.factor, as.numeric)
df_cor #run a correlation and drop the insignificant ones
<- cor(df_cor)
corr #prepare to drop duplicates and correlations of 1
lower.tri(corr,diag=TRUE)] <- NA
corr[#drop perfect correlations
== 1] <- NA
corr[corr #turn into a 3-column table
<- as.data.frame(as.table(corr))
corr #remove the NA values from above
<- na.omit(corr)
corr #select significant values
<- subset(corr, abs(Freq) > significance_threshold)
corr #sort by highest correlation
<- corr[order(-abs(corr$Freq)),]
corr #print table
# print(corr)
#turn corr back into matrix in order to plot with corrplot
<- reshape2::acast(corr, Var1~Var2, value.var="Freq")
mtx_corr
#plot correlations visually
corrplot(mtx_corr,
title=title,
mar=c(0,0,1,0),
method='color',
tl.col="black",
na.label= " ",
addCoef.col = 'black',
number.cex = .9)
}download.file(
url='https://1.cms.s81c.com/sites/default/files/2021-01-06/ICLH_Diagram_Batch_01_03-DeepNeuralNetwork-WHITEBG.png',
destfile='image1.jpg',
mode='wb')
::include_graphics(path='image1.jpg')
knitr#Load in data
<- read.csv("https://raw.githubusercontent.com/SubhalaxmiRout002/Data-622-Group-5/main/Final_Project/Loan_approval.csv", stringsAsFactors = TRUE)
loan ==""] <- NA #replace empty strings with NAs
loan[loan#head(loan) #verify 1st 6 observations
#Light EDA
glimpse(loan)
summary(loan)
#Pre-process dataset for easier interpretation
<- subset(loan, select = -c(1) ) #drop Loan_ID from consideration
loan $TotalIncome <- loan$CoapplicantIncome + loan$ApplicantIncome #create TotalIncome variable
loan<- subset(loan, select = -c(6,7) ) #drop CoapplicantIncome and ApplicantIncome
loan $Dependents <- revalue(loan$Dependents, c("3+"="3")) #relabel Dependents "3+" value as "3"
loan#Visualize NA counts
colSums(is.na(loan))
#Handle NAs: apply predictive mean matching to loan data
<- mice(loan, m = 1, method = "pmm", seed = 500)
loan <- mice::complete(loan, 1)
loan #verify absence of NA values in the dataset
colSums(is.na(loan))
#Utilize custom-built correlation matrix generation function
plot_corr_matrix(loan, 0.3)
#NOTE: COMMENTED OUT BELOW DUE TO LONG COMPILATION TIME. UNCOMMENT BEFORE FINAL SUBMISSION.
# Perform Boruta search
<- Boruta(Loan_Status ~ ., data=na.omit(loan), doTrace=0, maxRuns = 1000)
boruta_output #Get significant variables including tentatives
<- getSelectedAttributes(boruta_output, withTentative = TRUE)
boruta_signif #print(boruta_signif)
# Plot variable importance
plot(boruta_output, cex.axis=.7, las=2, xlab="", main="Variable Importance")
#Subset data based on predictor strength
<- subset(loan, select = -c(1, 2, 3, 4, 7) ) #drop poor predictors
loan #head(loan) #verify
# created 5 datasets for LR, Neural nets, NN using keras, kNN, Decison Tree
<- loan
loan_new2 <- loan
loan_nn <- loan
loan_keras <- loan
loan_knn <- loan
loan_dt <- loan
loan_rf <- loan
loan_GB #convert CreditHistory to type factor
$Credit_History <- factor(loan$Credit_History)
loan#levels(loan$Credit_History) #verify
#Numeric distributions
%>%
loan keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free", ncol=1) +
geom_histogram(bins=90,color="darkblue", fill="lightblue")
#Categoric distributions
##Self_Employed
<- loan %>% dplyr::select(1,5) %>% group_by(,Loan_Status) %>% count() %>%
p1 ggplot(aes(x=Self_Employed, y=freq, fill=Loan_Status)) +
geom_bar(stat='identity', position="stack")
##Self_Employed
<- loan %>% dplyr::select(3,5) %>% group_by(,Loan_Status) %>% count() %>%
p2 ggplot(aes(x=Credit_History, y=freq, fill=Loan_Status)) +
geom_bar(stat='identity', position="stack")
##Property_Area
<- loan %>% dplyr::select(4,5) %>% group_by(,Loan_Status) %>% count() %>%
p3 ggplot(aes(x=Property_Area, y=freq, fill=Loan_Status)) +
geom_bar(stat='identity', position="stack")
grid.arrange(p1, p2, p3, nrow = 2, ncol = 2)
set.seed(123) #replicability
<- floor(0.75 * nrow(loan)) #compute "barrier value"
bar <- sample(seq_len(nrow(loan)), size = bar) #sample based on barrier value
partition #Subset: train-test split based on partition value
<- loan[partition, ]
train <- loan[-partition, ]
test #print(dim(train)) #460 x 6
#print(dim(test)) #154 x 6
set.seed(123) #replicability
#Specify training algorithm parameters
<- trainControl(method = "repeatedcv", number = 10, repeats=5)
train_params #Train neural net model and standardize variables via preProcess method
<- train(train[,-5], train$Loan_Status,
nnet_model1 method = "nnet",
trControl= train_params,
preProcess=c("scale","center"),
na.action = na.omit
)set.seed(123) #replicability
#round(prop.table(table(train$Loan_Status)),3) #Baseline accuracy Y: 0.685
#Training predictions
<-predict(nnet_model1, train)
nnPred_train #Training confusion matrix
table(train$Loan_Status, nnPred_train)
round((310+78)/nrow(train),3)
#Test predictions
<-predict(nnet_model1, test)
nnPred_test #Test confusion matrix
table(test$Loan_Status, nnPred_test)
round((106+21)/nrow(test),3)
#Performance statistics for later comparison
<- confusionMatrix(as.factor(nnPred_test), as.factor(test$Loan_Status))$byClass
NN_Model_1 <- confusionMatrix(as.factor(nnPred_test), as.factor(test$Loan_Status))$overall['Accuracy']
acc1 <- data.frame(NN_Model_1)
NN_Model_1 <- rbind("Accuracy" = acc1, NN_Model_1)
NN_Model_1 set.seed(123) #replicability
#Observe the affect of outlier-handling on model performance (if any)
#Confirm the presence of influential observations
<- ggplot(loan) +
p4 aes(x = Loan_Status, y = LoanAmount) +
geom_boxplot(fill = "#0c4c8a") +
theme_minimal()
<- ggplot(loan) +
p5 aes(x = Loan_Status, y = TotalIncome) +
geom_boxplot(fill = "#0c4c8a") +
theme_minimal()
grid.arrange(p4, p5, nrow = 1, ncol = 2)
set.seed(123) #replicability
#Identify outlier locations for LoanAmount, TotalIncome
<- boxplot.stats(loan$LoanAmount)$out
out1 <- which(loan$LoanAmount %in% c(out1))
outliers1 <- boxplot.stats(loan$TotalIncome)$out
out2 <- which(loan$TotalIncome %in% c(out2))
outliers2 <- c(outliers1, outliers2) #merge lists
outliers <- unique(outliers) #remove repeat values
outliers #Remove outliers
<- loan[-outliers,]
loan #Observe affect on model performance
<- floor(0.75 * nrow(loan)) #compute "barrier value"
bar <- sample(seq_len(nrow(loan)), size = bar) #sample based on barrier value
partition2 #Subset: train-test split based on partition value
<- loan[partition2, ]
train2 <- loan[-partition2, ]
test2 #Train neural net model and standardize variables via preProcess method
<- train(train2[,-5], train2$Loan_Status,
nnet_model2 method = "nnet",
trControl= train_params,
preProcess=c("scale","center"),
na.action = na.omit
)#Training predictions
<-predict(nnet_model2, train2)
nnPred_train2 #Training confusion matrix
table(train2$Loan_Status, nnPred_train2)
round((282+63)/nrow(train2),3) #0.839
#Test predictions
<-predict(nnet_model2, test2)
nnPred_test2 #Test confusion matrix
table(test2$Loan_Status, nnPred_test2)
round((94+17)/nrow(test),3) #0.721
set.seed(123) #replicability
#Observe the affect of feature creation on model performance (if any)
#if self employed and income greater than
$hiINC_SE <- as.factor(ifelse(loan$TotalIncome >= 7522 & loan$Self_Employed == "Yes", 1, 0))
loan#if semiurban property and credit history 1
$SEMI_CH <- as.factor(ifelse(loan$Property_Area == "Semiurban" & loan$Credit_History == 1, 1, 0))
loan#if not self employed and loan amount below
$notSE_CH <- as.factor(ifelse(loan$Self_Employed == "No" & loan$LoanAmount <= 100.0, 1, 0))
loan#if income below and loan amount above
$loINC_hiAMT <- as.factor(ifelse(loan$TotalIncome <= 4166 & loan$LoanAmount >= 166.8, 1, 0))
loan#head(loan) #verify
set.seed(123) #replicability
<- floor(0.75 * nrow(loan)) #compute "barrier value"
bar <- sample(seq_len(nrow(loan)), size = bar) #sample based on barrier value
partition2 #Subset: train-test split based on partition value
<- loan[partition2, ]
train2 <- loan[-partition2, ]
test2 #Train neural net model and standardize variables via preProcess method
<- train(train2[,-5], train2$Loan_Status,
nnet_model2 method = "nnet",
trControl= train_params,
preProcess=c("scale","center"),
na.action = na.omit
)#Training predictions
<-predict(nnet_model2, train2)
nnPred_train2 #Training confusion matrix
table(train2$Loan_Status, nnPred_train2)
round((282+63)/nrow(train2),3) #0.839 - LOWER
#Test predictions
<-predict(nnet_model2, test2)
nnPred_test2 #Test confusion matrix
table(test2$Loan_Status, nnPred_test2)
round((94+17)/nrow(test),3) #0.721 - LOWER
set.seed(123) #replicability
#Preprocess data
<- loan_keras[, c(1, 3, 4, 2, 6, 5)]
loan_keras # Convert Credit_History numeric type
$Credit_History <- as.numeric(loan_keras$Credit_History)
loan_keras# Change Variables values
$Self_Employed <- ifelse(loan_keras$Self_Employed == "Yes", 1, 0)
loan_keras$Loan_Status <- ifelse(loan_keras$Loan_Status=="Y", 1, 0)
loan_keras$Property_Area <- case_when(
loan_keras$Property_Area == "Semiurban" ~ 2,
loan_keras$Property_Area == "Urban" ~ 1,
loan_keras$Property_Area == "Rural" ~ 0,
loan_keras
)#loan_keras <- loan_keras %>% mutate_if(is.factor, as.numeric) # convert factor to numeric
#Build initial keras NN model
<- neuralnet(Loan_Status ~
n +
Self_Employed +
Credit_History +
Property_Area +
LoanAmount
TotalIncome,data = loan_keras,
hidden = c(4,2),
linear.output = F,
lifesign = 'full',
rep=1)
#Visualize our initial keras NN
plot(n,
col.hidden = 'darkgreen',
col.hidden.synapse = 'darkgreen',
show.weights = F,
information = F,
fill = 'lightblue', rep = "best")
set.seed(123) #replicability
# Matrix
<- as.matrix(loan_keras)
loan_keras dimnames(loan_keras) <- NULL
# Partition
<- sample(2, nrow(loan_keras), replace = T, prob = c(.8, .2))
ind <- loan_keras[ind==1,1:5]
training_keras <- loan_keras[ind==2, 1:5]
testing_keras <- loan_keras[ind==1, 6]
trainingtarget <- loan_keras[ind==2, 6]
testingtarget # Normalize
<- colMeans(training_keras)
m <- apply(training_keras, 2, sd)
s <- scale(training_keras, center = m, scale = s)
training_keras <- scale(testing_keras, center = m, scale = s)
testing_keras set.seed(123) #replicability
#Build model
<- keras::keras_model_sequential()
model_keras %>%
model_keras layer_dense(units = 4, activation = 'relu', input_shape = c(5)) %>%
layer_dense(units = 2, activation = 'relu') %>%
layer_dense(units = 1)
#Compile model
%>% compile(loss = 'mse',
model_keras optimizer = optimizer_rmsprop(lr= 0.005),
metrics = c("accuracy"))
#Train model
<- model_keras %>%
model_keras_train fit(training_keras,
trainingtarget,epochs = 100,
validation_split = 0.1
)#Visualize
plot(model_keras_train)
set.seed(123) #replicability
#Evaluate model on training data
%>% evaluate(training_keras, trainingtarget)
model_keras <- model_keras %>% predict(training_keras)
pred_keras_train #Evaluate model on test data
%>% evaluate(testing_keras, testingtarget)
model_keras <- model_keras %>% predict(testing_keras)
pred_keras_test set.seed(123) #replicability
# Confusion matrix and misclassification- test data
<- ifelse(pred_keras_test>0.5, 1, 0)
pred_keras_1 <- table(Predicted = pred_keras_1,
tab4 Actual = testingtarget)
tab4sum(diag(tab4))/sum(tab4)
set.seed(123) #replicability
#Create model2
<- keras_model_sequential()
model_keras_2 %>%
model_keras_2 layer_dense(units = 100, activation = 'relu', input_shape = c(5)) %>%
layer_dropout(rate = 0.4) %>%
layer_dense(units = 1)
#Compile model2
%>% compile(loss = 'mse',
model_keras_2 optimizer = optimizer_rmsprop(lr= 0.005),
metrics = c("accuracy"))
#Fit model2
<- model_keras_2 %>%
model_keras_train2 fit(training_keras,
trainingtarget,epochs = 100,
batch_size = 32,
validation_split = 0.2
)#Visualize
plot(model_keras_train2)
set.seed(123) #replicability
#Evaluate model on training data
%>% evaluate(training_keras, trainingtarget)
model_keras_2 <- model_keras_2 %>% predict(training_keras) #loss: 0.1338, accuracy: 0.8390
pred_keras_train2 #Evaluate model on test data
%>% evaluate(testing_keras, testingtarget)
model_keras_2 <- model_keras_2 %>% predict(testing_keras) #loss: 0.1413, accuracy: 0.8632
pred_keras_test2 # Confusion matrix and misclassification
<- ifelse(pred_keras_test2>0.5, 1, 0)
pred_keras_2 <- table(Predicted = pred_keras_2,
tab5 Actual = testingtarget)
tab5sum(diag(tab5))/sum(tab5)
#Performance statistics for later comparison
<- confusionMatrix(as.factor(pred_keras_2), as.factor(testingtarget))$byClass
NN_Model_2 <- confusionMatrix(as.factor(pred_keras_2), as.factor(testingtarget))$overall['Accuracy']
acc1 <- data.frame(NN_Model_2)
NN_Model_2 <- rbind("Accuracy" = acc1, NN_Model_2)
NN_Model_2 set.seed(123) #replicability
# Convert Credit_History numeric type
$Credit_History <- as.numeric(loan_nn$Credit_History)
loan_nn# Change Variables values
$Self_Employed <- ifelse(loan_nn$Self_Employed == "Yes", 1, 0)
loan_nn$Loan_Status <- ifelse(loan_nn$Loan_Status=="Y", 1, 0)
loan_nn$Property_Area <- case_when(
loan_nn$Property_Area == "Semiurban" ~ 2,
loan_nn$Property_Area == "Urban" ~ 1,
loan_nn$Property_Area == "Rural" ~ 0,
loan_nn
)# Min-Max Normalization
$Self_Employed <- (loan_nn$Self_Employed - min(loan_nn$Self_Employed))/(max(loan_nn$Self_Employed) - min(loan_nn$Self_Employed))
loan_nn$LoanAmount <- (loan_nn$LoanAmount - min(loan_nn$LoanAmount))/(max(loan_nn$LoanAmount) - min(loan_nn$LoanAmount))
loan_nn$Credit_History <- (loan_nn$Credit_History - min(loan_nn$Credit_History))/(max(loan_nn$Credit_History)-min(loan_nn$Credit_History))
loan_nn$Property_Area <- (loan_nn$Property_Area - min(loan_nn$Property_Area))/(max(loan_nn$Property_Area)-min(loan_nn$Property_Area))
loan_nn$TotalIncome <- (loan_nn$TotalIncome - min(loan_nn$TotalIncome))/(max(loan_nn$TotalIncome)-min(loan_nn$TotalIncome))
loan_nn# Train-test split
<- sample(2, nrow(loan_nn), replace = TRUE, prob = c(0.8, 0.2))
ind <- loan_nn[ind==1,]
training_nn <- loan_nn[ind==2,]
testing_nn #Train model
<- neuralnet(Loan_Status ~ Self_Employed + LoanAmount + Credit_History + Property_Area + TotalIncome,
n1 data = training_nn,
hidden = 1,
err.fct = "ce",
linear.output = FALSE
)#Visualize model
plot(n1, rep = "best")
# values extracted for manual entry
head(training_nn[1,])
#manually calculate first value of output node and compare with training model output
<- 10.47375 + (0*-5.36828) + (0.1432706*2.7186) + (1*-13.27065) + (0.5 * -1.77825) + (0.05539355 * 8.03783)
in6 <- 1/(1+exp(-in6))
out6 <- 1.71561 +(-4.21959*out6)
in7 <- 1/(1+exp(-in7))
out7 #Output actual node 7 output vs. predicted output for our 1st entry
paste0("Node 7 Output : " ,out7)
<- compute(n1, training_nn[,-5])
output paste0("First predicted value of NN : ", head(output$net.result, n = 1))
set.seed(123)
#Adding other variables did not help model performance:
#Credit_History + Property_Area (0.871 test)
#Credit_History + TotalIncome (0.871 test)
#Credit_History + LoanAmount (0.871 test)
#Credit_History + SelfEmployed (0.871 test)
<- neuralnet(Loan_Status ~ Credit_History,
n2 data = training_nn,
hidden = 1,
err.fct = "ce",
linear.output = FALSE
)plot(n2, rep = 'best')
# Confusion Matrix & Misclassification Error - training data
<- compute(n2, training_nn[,-5])
output <- output$net.result
p1 <- ifelse(p1>0.5, 1, 0)
pred1 <- table(Prediction = pred1, Actuals = training_nn$Loan_Status)
tab1 #tab1
paste0("Misclassification Error of training data: ", round(100 - sum(diag(tab1))/sum(tab1)*100,2))
paste0("Accuracy of training data: ", round(sum(diag(tab1))/sum(tab1) * 100,2))
# Confusion Matrix & Misclassification Error - testing data
<- compute(n2, testing_nn[,-5])
output <- output$net.result
p2 <- ifelse(p2>0.5, 1, 0)
pred2 <- table(Prediction = pred2, Actual = testing_nn$Loan_Status)
tab2
tab2paste0("Misclassification Error of testing data: ", round(100 - sum(diag(tab2))/sum(tab2)*100,2))
paste0("Accuracy of testing data: ", round(sum(diag(tab2))/sum(tab2)*100,2))
#Performance statistics for later comparison
<- confusionMatrix(as.factor(pred2), as.factor(testing_nn$Loan_Status))$byClass
NN_Model_3 <- confusionMatrix(as.factor(pred2), as.factor(testing_nn$Loan_Status))$overall['Accuracy']
acc1 <- data.frame(NN_Model_3)
NN_Model_3 <- rbind("Accuracy" = acc1, NN_Model_3)
NN_Model_3 #Tabular view of model performance:
<- data.frame(NN_Model_1, NN_Model_2, NN_Model_3)
tabularview %>% kableExtra::kbl() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),latex_options="scale_down")
tabularview #Revisit feature importance
#boruta_output = Boruta(Loan_Status ~ ., data = loan_dt, doTrace = 0, maxRuns = 1000)
#roughFixMod = TentativeRoughFix(boruta_output)
#importance = attStats(TentativeRoughFix(boruta_output))
#importance = importance[importance$decision != 'Rejected', c('meanImp', 'decision')]
#kable(head(importance[order(-importance$meanImp), ]),
# caption = "Feature Importance of Loan Data") %>%
# kable_styling(bootstrap_options = "striped", full_width = TRUE)
set.seed(123) #for reproducibility
# Split data into training and testing sets
= sample.split(loan_dt, SplitRatio = 0.8)
sample_data <- subset(loan_dt, sample_data == TRUE)
train_data <- subset(loan_dt, sample_data == FALSE)
test_data # Plot tree
<- rpart(Loan_Status ~ ., data = train_data, method="class",
tree_mod control = rpart.control(minsplit = 4,
minbucket = 2,
cp = 0,
maxdepth = 6))
rpart.plot(tree_mod, extra=1, cex=0.8)
<- predict(tree_mod, train_data, type="class")
train_pred <- predict(tree_mod, test_data, type="class")
test_pred paste("Confusion matrix of Test data")
table(Predicted = test_pred, Actual = test_data$Loan_Status)
cat('Training Accuracy: ', mean(train_pred == train_data$Loan_Status), '\n',
'Test Set Accuracy: ', mean(test_pred == test_data$Loan_Status), sep='')
#Determine optimal tree depth
set.seed(123) #for reproducibility
<- train(Loan_Status ~ ., loan_dt, method="rpart2",
bc_tree_cv trControl = trainControl(method="cv", number=10),
tuneGrid = expand.grid(maxdepth=c(6)),
control = rpart.control(minsplit = 4,
minbucket = 2,
cp = 0))
bc_tree_cvset.seed(123) #for reproducibility
<- train(Loan_Status ~ ., loan_dt, method="rpart2",
bc_tree_cv_md trControl = trainControl(method="cv", number=10),
tuneGrid = expand.grid(maxdepth=1:20),
control = rpart.control(minsplit = 4,
minbucket = 2,
cp = 0))
= which.max(bc_tree_cv_md$results$Accuracy)
best_ix $results[best_ix, ]
bc_tree_cv_md#Plot Accuracy vs. Max Tree Depth
plot(bc_tree_cv_md$results$maxdepth, bc_tree_cv_md$results$Accuracy, ylab = "Accuracy (cross -Validation)", xlab = "Max Tree Depth")
lines(1:20, bc_tree_cv_md$results$Accuracy)
abline(v=which.max(bc_tree_cv_md$results$Accuracy), col="red", lty=2, lwd=1)
#Determine optimal tuning complexity parameter
set.seed(123) #for reproducibility
<- train(Loan_Status ~ ., loan_dt, method="rpart",
bc_tree_cv_cp trControl = trainControl(method="cv", number=10),
tuneGrid = expand.grid(cp=seq(0, 0.1, 0.001)),
control = rpart.control(minsplit = 4,
minbucket = 2,
maxdepth = 30))
= which.max(bc_tree_cv_cp$results$Accuracy)
best_ix $results[best_ix, ]
bc_tree_cv_cp#Plot Accuracy vs. Complexity Parameter
plot(bc_tree_cv_cp, pch="")
<- rpart(Loan_Status ~ ., data = train_data, method="class",
tree_mod_final control = rpart.control(minsplit = 2,
minbucket = 2,
cp = 0.03,
maxdepth = 2))
rpart.plot(tree_mod_final, extra=1, cex=0.8)
<- predict(tree_mod_final, test_data, type = "class")
loanPred <- table(Predicted = loanPred, Actual = test_data$Loan_Status)
tab_dt
tab_dtpaste0("Accuracy : ", round(sum(diag(tab_dt))/sum(tab_dt) * 100,2))
set.seed(66)
# Split data into training and testing sets
= sample.split(loan_rf, SplitRatio = 0.8)
sample_data_rf <- subset(loan_rf, sample_data_rf == TRUE)
train_data_rf <- subset(loan_rf, sample_data_rf == FALSE)
test_data_rf # fit a simple Random Forest model with training data
<- randomForest(Loan_Status ~ .,
model_rf data=train_data_rf)
# display model details
#model_rf
#Error rate plot
plot(model_rf, main = "Error rate of random forest")
legend("topright", cex =1, legend=colnames(model_rf$err.rate), lty=c(1,2,3), col=c(1,2,3))
set.seed(66)
<- predict(model_rf, test_data_rf, type="class")
pred_rf <- table(Predicted = pred_rf, Actual = test_data_rf$Loan_Status)
tab_rf
tab_rf<- as.factor(pred_rf)
pred_rf $Loan_Status <- as.factor(test_data_rf$Loan_Status)
test_data_rfpaste0("Accuracy : ",round(sum(diag(tab_rf))/sum(tab_rf)*100,2))
<- confusionMatrix(pred_rf, test_data_rf$Loan_Status)$byClass
RF_Model_1 <- confusionMatrix(pred_rf, test_data_rf$Loan_Status)$overall['Accuracy']
acc1 <- data.frame(RF_Model_1)
RF_Model_1 <- rbind("Accuracy" = acc1, RF_Model_1)
RF_Model_1 varImpPlot(model_rf, main = "Feature Importance of RF Model")
<- which.min(model_rf$err.rate[,'OOB'])
m_opt_ntrees <- min(model_rf$err.rate[,'OOB'])
m_opt_err_rate cat("Optimal Number of Trees: ", m_opt_ntrees, "\n",
"Minimum Error Rate: ", m_opt_err_rate, sep="")
set.seed(66)
#Build model
<- randomForest(Loan_Status ~ .,
model_rf_2 data=train_data_rf, ntree=m_opt_ntrees, importance=TRUE)
#Confusion matrix and accuracy statistics - test data
<- predict(model_rf_2, test_data_rf, type="class")
pred_rf_2 <- table(Predicted = pred_rf_2, Actual = test_data_rf$Loan_Status)
tab_rf_2
tab_rf_2<- as.factor(pred_rf_2)
pred_rf_2 paste0("Accuracy : ",round(sum(diag(tab_rf_2))/sum(tab_rf_2)*100,2))
<- confusionMatrix(pred_rf_2, test_data_rf$Loan_Status)$byClass
RF_Model_2 <- confusionMatrix(pred_rf_2, test_data_rf$Loan_Status)$overall['Accuracy']
acc2 <- data.frame(RF_Model_2)
RF_Model_2 <- rbind("Accuracy" = acc2, RF_Model_2)
RF_Model_2 <- c()
oob_acc_list <- c()
opt_ntree_list for(i in 1:29){
set.seed(66)
<- randomForest(Loan_Status ~ .,train_data_rf , ntree=500, importance=TRUE, mtry=i)
temp_mod <- c(oob_acc_list, min(temp_mod$err.rate[,'OOB']))
oob_acc_list <- c(opt_ntree_list, which.min(temp_mod$err.rate[,'OOB']))
opt_ntree_list
}<- which.min(oob_acc_list)
opt_mtry <- opt_ntree_list[opt_mtry]
opt_ntree <- min(oob_acc_list)
min_oob_acc cat("Optimal Value of mtry: ", opt_mtry, "\n",
"Optimal Value of ntree: ", opt_ntree, "\n",
"Minimum OOB Accuracy: ", min_oob_acc, sep="")
plot(1:29, oob_acc_list, xlab="Value of mtry", ylab="Minimum OOB Accuracy Score")
lines(1:29, oob_acc_list)
abline(v=which.min(oob_acc_list), col="red", lty=2, lwd=1)
set.seed(66)
#Build model
<- randomForest(Loan_Status ~ .,
model_rf_3 data=train_data_rf, ntree=opt_ntree, mtry=opt_mtry, importance=TRUE)
#Confusion matrix and accuracy - test data
<- predict(model_rf_3, test_data_rf, type="class")
pred_rf_3 <- table(Predicted = pred_rf_3, Actual = test_data_rf$Loan_Status)
tab_rf_3
tab_rf_3<- as.factor(pred_rf_3)
pred_rf_3 paste0("Accuracy : ",round(sum(diag(tab_rf_3))/sum(tab_rf_3)*100,2))
<- confusionMatrix(pred_rf_3, test_data_rf$Loan_Status)$byClass
RF_Model_3 <- confusionMatrix(pred_rf_3, test_data_rf$Loan_Status)$overall['Accuracy']
acc3 <- data.frame(RF_Model_3)
RF_Model_3 <- rbind("Accuracy" = acc3, RF_Model_3)
RF_Model_3 #Tabular view of model performance: RF_Model_1 is the strongest
<- data.frame(RF_Model_1, RF_Model_2, RF_Model_3)
tabularview %>% kableExtra::kbl() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),latex_options="scale_down")
tabularview set.seed(26) #ensure reproducibility
<- loan_knn[, c(1, 3, 4, 2, 6, 5)]
loan_knn # Convert Credit_History numeric type
$Credit_History <- as.numeric(loan_knn$Credit_History)
loan_knn# Change Variables values
$Self_Employed <- ifelse(loan_knn$Self_Employed == "Yes", 1, 0)
loan_knn$Loan_Status <- ifelse(loan_knn$Loan_Status=="Y", 1, 0)
loan_knn$Property_Area <- case_when(
loan_knn$Property_Area == "Semiurban" ~ 2,
loan_knn$Property_Area == "Urban" ~ 1,
loan_knn$Property_Area == "Rural" ~ 0,
loan_knn
)# Partition
<- sample(2, nrow(loan_knn), replace = T, prob = c(.8, .2))
ind <- loan_knn[ind==1,1:5]
training_knn <- loan_knn[ind==2, 1:5]
testing_knn <- loan_knn[ind==1, 6]
trainingtarget_knn <- loan_knn[ind==2, 6]
testingtarget_knn # unscaled data
<- loan_knn[ind==1,1:5]
train <- loan_knn[ind==2, 1:5]
test # Normalize scales data
<- colMeans(training_knn)
m <- apply(training_knn, 2, sd)
s <- scale(training_knn, center = m, scale = s)
training_knn <- scale(testing_knn, center = m, scale = s)
testing_knn #Determine optimal k
<- c()
train_acc <- c()
valid_acc <- c()
train_acc_sc <- c()
valid_acc_sc <- 1:100
k_range for (i in k_range) {
# Unscaled data
<- knn(train, train, trainingtarget_knn, k=i)
train_pred_a <- c(train_acc, mean(train_pred_a == trainingtarget_knn))
train_acc
<- knn(train, test, trainingtarget_knn, k=i)
valid_pred_a <- c(valid_acc, mean(valid_pred_a == testingtarget_knn))
valid_acc
# Standard Scaling
<- knn(training_knn, training_knn, trainingtarget_knn, k=i)
train_pred <- c(train_acc_sc, mean(train_pred == trainingtarget_knn))
train_acc_sc
<- knn(training_knn, testing_knn, trainingtarget_knn, k=i)
valid_pred <- c(valid_acc_sc, mean(valid_pred == testingtarget_knn))
valid_acc_sc
}print(max(valid_acc))
print(max(valid_acc_sc))
#Training and Validation Curves for UNSCALED data
plot(k_range, train_acc, pch='.', ylim=c(0.55, 1), col='salmon', ylab="")
lines(k_range, train_acc, lwd=2, col='salmon')
lines(k_range, valid_acc, lwd=2, col='cornflowerblue')
legend(65, 1, legend=c("Training Acc", "Validation Acc"),
col=c("salmon", "cornflowerblue"), lty=1, lwd=2, cex=0.8)
#Training and Validation Curves for SCALED data
plot(k_range, train_acc_sc, pch='.', ylim=c(0.55, 1), col='salmon', ylab="")
lines(k_range, train_acc_sc, lwd=2, col='salmon')
lines(k_range, valid_acc_sc, lwd=2, col='cornflowerblue')
legend(65, 1, legend=c("Training Acc", "Validation Acc"),
col=c("salmon", "cornflowerblue"), lty=1, lwd=2, cex=0.8)
#Select optimal k for UNSCALED data
<- which.max(valid_acc)
acc #Confusion matrix and accuracy statistics for UNSCALED data - weaker model
<- knn(train, train, trainingtarget_knn, k=acc)
train_pred <- mean(train_pred == trainingtarget_knn)
train_acc
<- knn(train, test, trainingtarget_knn, k=acc)
valid_pred <- mean(valid_pred == testingtarget_knn)
valid_acc cat('Training Accuracy: ', train_acc, '\n',
'Validation Accuracy: ', valid_acc, sep='')
<- table(Predicted = valid_pred, Actual = testingtarget_knn)
tab_knn
tab_knnset.seed(26) #ensure reproducibility
#Select optimal k for SCALED data
<- which.max(valid_acc_sc)
acc_sc #acc_sc
#Confusion matrix and accuracy for scaled data model with k = 10
<- knn(training_knn, training_knn, trainingtarget_knn, k=acc_sc)
train_pred <- mean(train_pred == trainingtarget_knn)
train_acc
<- knn(training_knn, testing_knn, trainingtarget_knn, k=acc_sc)
valid_pred <- mean(valid_pred == testingtarget_knn)
valid_acc cat('Training Accuracy: ', train_acc, '\n',
'Validation Accuracy: ', valid_acc, sep='')
<- table(Predicted = valid_pred, Actual = testingtarget_knn)
tab_knn_sc
tab_knn_sc#Pre-process data: normalize independent numeric vars
#head(loan_GB) #revisit data
#calculate skewness prior to transformation
skewness(loan$TotalIncome, na.rm = TRUE)
skewness(loan$LoanAmount, na.rm = TRUE)
#transformation: account for outliers with log transform
$TotalIncome <- log10(loan$TotalIncome)
loan$LoanAmount <- log10(loan$LoanAmount)
loan#calculate skewness after transformation
skewness(loan$TotaltIncome, na.rm = TRUE) #NaN
skewness(loan$LoanAmount, na.rm = TRUE) #drastically reduced
# Convert Credit_History to numeric type
$Credit_History <- as.numeric(loan_GB$Credit_History) #0, 1
loan_GB# Change Variables values
$Self_Employed <- ifelse(loan_GB$Self_Employed=="Yes", 1, 0) #0,1
loan_GB$Loan_Status <- ifelse(loan_GB$Loan_Status=="Y", 1, 0) #0,1
loan_GB$Property_Area <- case_when( #0,1,2
loan_GB$Property_Area == "Semiurban" ~ 2,
loan_GB$Property_Area == "Urban" ~ 1,
loan_GB$Property_Area == "Rural" ~ 0,
loan_GB
)#Convert Loan_Status to factor
#loan_GB$Loan_Status <- as.factor(loan_GB$Loan_Status)
head(loan_GB) #verify all numeric inputs
#Partition data
set.seed(1234)
#loan_GB
<- sample(2, nrow(loan_GB), replace = TRUE, prob = c(0.75, 0.25))
ind <- loan_GB[ind == 1, ]
train_GB <- loan_GB[ind == 2, ]
test_GB #dim(train_GB) #457 x 6
#dim(test_GB) #157 x 6
#Create train, test matrices - one hot encoding for factor variables
<- sparse.model.matrix(Loan_Status ~ ., data = train_GB)
trainm #head(trainm)
<- train_GB[,"Loan_Status"]
train_label <- xgb.DMatrix(data = as.matrix(trainm),label = train_label )
train_matrix <- sparse.model.matrix(Loan_Status ~ ., data = test_GB)
testm <- test_GB[,"Loan_Status"]
test_label <- xgb.DMatrix(data = as.matrix(testm),label = test_label )
test_matrix #Parameters
<- length(unique(train_label)) #number of classes
nc <- list("objective" = "multi:softprob",
xgb_params "eval_metric" = "mlogloss",
"num_class" = nc)
<- list(train = train_matrix, test = test_matrix)
watchlist #extreme Gradient Boosting Model
<- xgb.train(params = xgb_params,
GB_model data = train_matrix,
nrounds = 28, #run 100 iterations 1st then update based on test error value
watchlist = watchlist,
eta = 0.1, seed = 1234
#inc eta value increased accuracy by 1
) #error plot
#e <- data.frame(GB_model$evaluation_log)
#plot(e$iter, e$train_mlogloss)
#lines(e$iter, e$test_mlogloss, col = 'red')
#determine when test error was lowest
#min(e$test_mlogloss) #0.480478 lowest error
#e[e$test_mlogloss == 0.480478,] #28th iteration
#prediction and confusion matrix from TRAIN data
<- predict(GB_model, newdata = train_matrix)
p_train <- matrix(p_train, nrow = nc, ncol = length(p_train)/nc) %>%
pred_train t() %>% #matrix transpose
data.frame() %>%
mutate(label = train_label, max_prob = max.col(.,"last")-1)
<- table(Prediction = pred_train$max_prob, Actual = pred_train$label)
tab_train print(tab_train)
print(paste('Misclassification Error with Train data', round(1 - sum(diag(tab_train))/sum(tab_train),3))) #0.112
#prediction and confusion matrix from TEST data
<- predict(GB_model, newdata = test_matrix)
p_test <- matrix(p_test, nrow = nc, ncol = length(p_test)/nc) %>%
pred_test t() %>% #matrix transpose
data.frame() %>%
mutate(label = test_label, max_prob = max.col(.,"last")-1)
<- table(Prediction = pred_test$max_prob, Actual = pred_test$label)
tab_test print(tab_test)
print(paste('Misclassification Error with Test data', round(1 - sum(diag(tab_test))/sum(tab_test),3))) #0.178
#feature importance
<- xgb.importance(colnames(train_matrix), model=GB_model)
imp print(imp) #higher Gain means higher feature importance
#Create train, test matrices - one hot encoding for factor variables
<- sparse.model.matrix(Loan_Status ~ Credit_History + LoanAmount + TotalIncome, data = train_GB)
trainm2 #head(trainm2)
<- train_GB[,"Loan_Status"]
train_label2 <- xgb.DMatrix(data = as.matrix(trainm2),label = train_label2 )
train_matrix2 <- sparse.model.matrix(Loan_Status ~ Credit_History + LoanAmount + TotalIncome, data = test_GB)
testm2 <- test_GB[,"Loan_Status"]
test_label2 <- xgb.DMatrix(data = as.matrix(testm2),label = test_label2 )
test_matrix2 #Parameters
<- length(unique(train_label2)) #number of classes
nc2 <- list("objective" = "multi:softprob",
xgb_params2 "eval_metric" = "mlogloss",
"num_class" = nc2)
<- list(train = train_matrix2, test = test_matrix2)
watchlist2 #extreme Gradient Boosting Model
<- xgb.train(params = xgb_params2,
GB_model2 data = train_matrix2,
nrounds = 23, #run 100 iterations 1st then update based on test error value
watchlist = watchlist2,
eta = 0.1, seed = 606
#inc eta value increased accuracy by 1
) #error plot
#e2 <- data.frame(GB_model2$evaluation_log)
#plot(e2$iter, e2$train_mlogloss)
#lines(e2$iter, e2$test_mlogloss, col = 'red')
#determine when test error was lowest
#min(e2$test_mlogloss) #0.481189 lowest error
#e2[e2$test_mlogloss == 0.481189,] #23rd iteration
#prediction and confusion matrix from train data
<- predict(GB_model2, newdata = train_matrix2)
p_train2 <- matrix(p_train2, nrow = nc2, ncol = length(p_train2)/nc2) %>%
pred_train2 t() %>% #matrix transpose
data.frame() %>%
mutate(label = train_label2, max_prob = max.col(.,"last")-1)
<- table(Prediction = pred_train2$max_prob, Actual = pred_train2$label)
tab_train2 print(tab_train2)
print(paste('Misclassification Error with Train data', round(1 - sum(diag(tab_train2))/sum(tab_train2),3)))
#prediction and confusion matrix from test data
<- predict(GB_model2, newdata = test_matrix2)
p_test2 <- matrix(p_test2, nrow = nc2, ncol = length(p_test2)/nc2) %>%
pred_test2 t() %>% #matrix transpose
data.frame() %>%
mutate(label = test_label2, max_prob = max.col(.,"last")-1)
<- table(Prediction = pred_test2$max_prob, Actual = pred_test2$label)
tab_test2 print(tab_test2)
print(paste('Misclassification Error with Test data', round(1 - sum(diag(tab_test2))/sum(tab_test2),3)))
#Tabular view of our 2 strongest models:
<- data.frame(RF_Model_1, NN_Model_2)
tabularview %>% kableExtra::kbl() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),latex_options="scale_down") tabularview