Project 2

Introduction

In this project, I will be using a loan approval data set to predict whether a loan is approved based on applicant information. I will also compare Naive Bayes and k-NN in terms of accuracy and bias-variance behavior. I will perform a Decision Tree test as well.

Dataset

The data set being used is sourced from Kaggle. It contains financial and demographic information on 1,000 loan applicants. The primary goal will be to predict loan approval. Oftentimes, people apply for loans with high hopes, just to be denied, as well get a new inquiry on their credit report. This problem could be improved if we can determine what factors result in a higher likelihood of loan approval. With this information, applicants can work on those key factors and apply with confidence.

The key features in this data set are annual income, loan amount, and credit score. Loan_approved is indicated by a 1 if the loan was granted, and a 0 if the loan was denied. Ultimately, the appeal of this data set was due to the interesting nature of personal finance as well as the binary classification of loan approval allowing for clear comparisons.

dfloanapproval = read.csv("loanapproval.csv")
dfloanapproval = dfloanapproval[,c("credit_score","annual_income","loan_approved")]

head(dfloanapproval)

##   credit_score annual_income loan_approved
## 1          793        100073             1
## 2          789        112197             1
## 3          372         84429             0
## 4          808        124195             1
## 5          689         81627             1
## 6          594         53848             0

dfloanapproval$loan_approved = as.factor(dfloanapproval$loan_approved)

dim(dfloanapproval)

## [1] 1000    3

Data Analysis: Naive Bayes

Train-test split

70% training, 30% testing

We want to test our model on future data, but since it is not available we are using a subset of the 1,000 observations. Sample split is used to split data with a 70/30 split. Train data is created using 70% of the randomly chosen data from the loan approval data set. The remaining 30% creates the Test data set.

Green dots represent that a loan was approved. Red dots represent that a loan was not approved.

par(mar = c(4,4,1,4)) #adjust margins around the plot 

xlim = range(dfloanapproval$credit_score)
ylim = range(dfloanapproval$annual_income)

plot(dfloanapproval$credit_score,
     dfloanapproval$annual_income,
     main = "Loan Approval Data Points",
     xlab = "Credit Score", 
     ylab = "Annual Income",
     xlim = xlim, 
     ylim = ylim)

points(dfloanapproval$credit_score,
       dfloanapproval$annual_income,
       pch = 21, 
       bg = ifelse(dfloanapproval$loan_approved == 1,
                   "green4", "red3"))

Using the plot above, we can train the model and divide the region into two areas. This allows us to predict with future data whether a person may be granted a loan or not based on where they fall with respect to their credit score and annual income. We will be using the training data (70%).

mod = naiveBayes(x = traindata[,c("credit_score","annual_income")],
                 y = traindata$loan_approved)

y_pred = predict(mod, newdata = testdata[,c("credit_score","annual_income")])

## Begin plotting train data

data = traindata

minX1 = min(data[,1]); maxX1 = max(data[,1]); range1 = diff(range(data[,1]))
minX2 = min(data[,2]); maxX2 = max(data[,2]); range2 = diff(range(data[,2]))
len = 400

X1 = seq(from = minX1-0.1*range1, 
         to = maxX1+0.1*range1,length.out = len)

X2 = seq(from = minX2-0.1*range2, 
         to = maxX2+0.1*range2,length.out = len)

grid_data = expand.grid(X1, X2)
colnames(grid_data) = c("credit_score", "annual_income")

y_grid = predict(mod, newdata = grid_data)

Prediction: If the output Credit Score, in tandem with the Annual Income input, falls in the green area, the person will be approved for a loan (1). If the output Credit Score, in tandem with the Annual Income input, falls in the red area, the person will not be approved for a loan (0).

par(mar = c(4,4,1,4)) #adjust margins around the plot 

plot(dfloanapproval$credit_score,
     dfloanapproval$annual_income,
     main = "Loan Approval (Training Data)",
     xlab = "Credit Score", 
     ylab = "Annual Income",
     xlim = range(X1), 
     ylim = range(X2))

contour(X1, X2,
        matrix(as.numeric(y_grid), length(X1), length(X2)),
        add = TRUE)

points(grid_data,
       pch = '.',
       col = ifelse(y_grid == 1,
                   "green2", "red"))
points(data, pch = 21, bg = ifelse(data[,"loan_approved"] == 1, "green4", "red3"))

We will also apply this use test data (30%) now to further confirm predictions.

## Begin plotting train data

data = testdata

minX1 = min(data[,1]); maxX1 = max(data[,1]); range1 = diff(range(data[,1]))
minX2 = min(data[,2]); maxX2 = max(data[,2]); range2 = diff(range(data[,2]))
len = 400

X1 = seq(from = minX1-0.1*range1, 
         to = maxX1+0.1*range1,length.out = len)

X2 = seq(from = minX2-0.1*range2, 
         to = maxX2+0.1*range2,length.out = len)

grid_data = expand.grid(X1, X2)
colnames(grid_data) = c("credit_score", "annual_income")

y_grid = predict(mod, newdata = grid_data)

## plot test data 

par(mar = c(4,4,1,4)) #adjust margins around the plot 

#xlim = range(dfloanapproval$credit_score)
#ylim = range(dfloanapproval$annual_income)

plot(dfloanapproval$credit_score,
     dfloanapproval$annual_income,
     main = "Loan Approval (Test Data)",
     xlab = "Credit Score", 
     ylab = "Annual Income",
     xlim = range(X1), 
     ylim = range(X2))

contour(X1,
        X2,
        matrix(as.numeric(y_grid), length(X1), length(X2)),
        add = TRUE)

points(grid_data,
       pch = '.',
       col = ifelse(y_grid == 1,
                   "green2", "red"))
points(data, pch = 21, bg = ifelse(data[,"loan_approved"] == 1, "green4", "red3"))

Confusion Matrix

##    y_pred
##       0   1
##   0  21  60
##   1  16 203

## [1] 300   3

## [1] 0.7466667

There were 21 predicted correctly for loan approved = 0. There were 60 predicted incorrectly for loan approved = 1.

There were 16 predicted incorrectly for loan approved = 0. There were 203 predicted correctly for loan approved = 1.

Result

The accuracy is 74.66% with out-of-sample data. We use out-of-sample data only to obtain accuracy as the curve was built from the in-sample data.

Model Comparision: k-NN

train_x = scale(traindata[, c("credit_score", "annual_income")])
test_x  = scale(testdata[, c("credit_score", "annual_income")],
                center = attr(train_x, "scaled:center"),
                scale  = attr(train_x, "scaled:scale"))

train_y = traindata$loan_approved
test_y  = testdata$loan_approved

kmod = knn(train = train_x,
               test  = test_x,
               cl    = train_y,
               k     = 5)


## confM 
confM_knn = table(test_y, kmod)
confM_knn

##       kmod
## test_y   0   1
##      0  33  48
##      1  28 191

sum(diag(confM_knn)) / sum(confM_knn)

## [1] 0.7466667

Model Comparison: Decision Tree

A decision tree is a classification method that arranges the predictor space into distinct regions. It chooses the predictor and split point that minimizes classification error.

The decision tree created determines rules for classification. In this case, the rules are

If credit score falls below a certain threshold, deny the loan.
If credit score exceeds the threshold and income is high, approve the loan.

tree_model = tree(loan_approved ~ credit_score + annual_income, data = traindata)

summary(tree_model)

## 
## Classification tree:
## tree(formula = loan_approved ~ credit_score + annual_income, 
##     data = traindata)
## Number of terminal nodes:  4 
## Residual mean deviance:  0.9216 = 641.4 / 696 
## Misclassification error rate: 0.2114 = 148 / 700

tree_model

## node), split, n, deviance, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 700 818.50 1 ( 0.27143 0.72857 )  
##   2) credit_score < 597.5 383 521.20 1 ( 0.42037 0.57963 )  
##     4) annual_income < 38669.5 70  70.06 0 ( 0.80000 0.20000 ) *
##     5) annual_income > 38669.5 313 399.40 1 ( 0.33546 0.66454 ) *
##   3) credit_score > 597.5 317 194.00 1 ( 0.09148 0.90852 )  
##     6) annual_income < 37296 59  68.96 1 ( 0.27119 0.72881 ) *
##     7) annual_income > 37296 258 103.00 1 ( 0.05039 0.94961 ) *

plot(tree_model)
text(tree_model, pretty = 0)

## Predictions and Confusion Matrix 
tree_pred = predict(tree_model, newdata = testdata, type = "class")

confM_tree = table(testdata$loan_approved, tree_pred)
confM_tree

##    tree_pred
##       0   1
##   0  19  62
##   1   8 211

sum(diag(confM_tree)) / sum(confM_tree)

## [1] 0.7666667

Cross Validation

I am using cross-validation to see if the tree can be pruned. This plot is showing the size and deviance. Because trees can grow very complex, pruning is used to reduce over-fitting by simplifying the structure while maintaining predictive performance.

cvtree = cv.tree(object = tree_model)
plot(cvtree$size, cvtree$dev, type = "b",
     xlab = "Tree Size",
     ylab = "CV Error")

## Pruning 
pruned_tree = prune.misclass(tree_model, best = 4)

plot(pruned_tree)
text(pruned_tree, pretty = 0)

## Accuracy of Pruned Tree  

pruned_pred = predict(pruned_tree, newdata = testdata, type = "class")

confM_pruned = table(testdata$loan_approved, pruned_pred)
confM_pruned

##    pruned_pred
##       0   1
##   0  19  62
##   1   8 211

sum(diag(confM_pruned)) / sum(confM_pruned)

## [1] 0.7666667

Cross-validation was used to determine the optimal tree size. The cross-validated error was minimized at a tree size of 4. This indicates that further pruning would increase classification error. As a result, the full tree was retained, and it suggests that the model was not substantially over-fitting the training data.

A pruned tree was created to compare test accuracy. As predicted, the pruned tree did not yield different results. The splits being used are already significant to the data.

Conclusion

Recap of Accuracy

Naive Bayes accuracy: 74.67%
k-NN accuracy: 74.67%
Decision Tree accuracy: 76.67%

Naive Bayes and k-Nearest Neighbors were used to evaluate a 70/30 split on a loan approval data set. Both achieved an accuracy of 74.67%.

These findings lead me to conclude the relationships between credit score and annual income is smooth.

Naive Bayes

higher bias
lower variance
This is due to its independence assumptions

k-NN

lower bias
higher variance
This is due to its predictions being based on nearby data observations

Both models yielded similar results. Neither models were indicative of more flexibility or improvement in accuracy predicting for this data set.

Decision Tree

Unlike Naive Bayes, the decision tree does not assume independence between predictors. Rather, it creates a clear hierarchy in which it is easy to calculate which region a value will fall into.

The decision tree outperformed both other methods with a test accuracy of 76.67%. Cross-validation suggests that a tree with 4 terminal nodes was most efficient for predictive classification. Pruning the tree did not change test accuracy, which suggests the model was not over-fitting.

Overall, the similarity in performance among all models indicate that loan approval in this data set is largely determined by a smooth relationship between credit score and annual income. Although the decision tree only offered a small improvement, it offered improved interpretability through clear decision rules.

References

Link to Dataset: https://www.kaggle.com/datasets/amineipad/loan-approval-dataset