Should This Loan be Approved or Denied?
In this report, we explore attributes of SBAs loan data from a variety of small business loan records. The sba has offered a collection of data on these small businesses and their loan applications. The businesses in the dataset represent a variety of industries locations, sizes, and other demographic attributes. There are over 800,000 businesses with 27 varables collected over a span of several decades.
We are interested in a few problems and questions besides what have been explored in the class. What are the most important five features in predicting default rate. Is KNN better or Logistic regression better when predicting default rate using the five most important features.
If this rmd does not compile at the first time becasue some package is not installed on the grader’s computer, you can check our published knit report on http://rpubs.com/dzhao97/540770
The dataset contains 27 variables. After carefully inspect the variable description in the “Should This Loan be Approved or Denied?” article, we dropped some variables because they contain too many distinct values (which will result overfitting) or they are consequences of Loan_Status. The variables we dropped are “LoanNr_ChkDgt”, “Name”, “City”, “Zip”, “NAICS”, “ApprovalDate”, “ApprovalFY”, “Bank”, “ChgOffDate”, “DisbursementDate”, “ChgOffPrinGr”. After that, according to all the valid input values for each variable in the dataset description, we filter out all rows containing null values or invalid input. The result dataframe has 348049 observations.
##load Original Data
library(readr)
SBAnational <- read_csv("https://uofi.box.com/shared/static/vi37omgitiaa2yyplrom779qvwk1g14x.csv", col_types = cols(BalanceGross = col_number(), DisbursementGross = col_number(), GrAppv = col_number(), SBA_Appv = col_number()))
#Drop properties with too much distinct values or unrelated
drops <- c("LoanNr_ChkDgt", "Name","City","Zip","NAICS", "ApprovalDate", "ApprovalFY","Bank", "ChgOffDate","DisbursementDate","ChgOffPrinGr")
Ourdata = SBAnational[ , !(names(SBAnational) %in% drops)]
# use sql to preprocess data with valid input
library(sqldf)
Ourdata = sqldf('select * from Ourdata where (MIS_Status = "P I F" or MIS_Status = "CHGOFF") and (NewExist = 1 or NewExist = 2) and (FranchiseCode = 0 or FranchiseCode = 1) and (UrbanRural = 1 or UrbanRural = 2) and (RevLineCr = "Y" or RevLineCr = "N") and (LowDoc = "Y" or LowDoc = "N") and (BankState <> "AN")')
#clean all NUlls
Ourdata = na.omit(Ourdata)
#change char data input to factor
Ourdata$MIS_Status = factor(Ourdata$MIS_Status)
Ourdata$State = factor(Ourdata$State)
Ourdata$BankState = factor(Ourdata$BankState)
Ourdata$RevLineCr = factor(Ourdata$RevLineCr)
Ourdata$LowDoc = factor(Ourdata$LowDoc)
summary(Ourdata)
## State BankState Term NoEmp
## CA : 47581 NC : 52293 Min. : 0.0 Min. : 0
## NY : 28483 IL : 36214 1st Qu.: 49.0 1st Qu.: 2
## TX : 21227 CA : 34245 Median : 84.0 Median : 3
## FL : 19707 RI : 32949 Mean : 81.7 Mean : 8
## PA : 15594 OH : 29566 3rd Qu.: 84.0 3rd Qu.: 8
## OH : 14999 VA : 17980 Max. :527.0 Max. :8000
## (Other):200458 (Other):144802
## NewExist CreateJob RetainedJob FranchiseCode UrbanRural
## Min. :1.00 Min. : 0 Min. : 0 Min. :0.000 Min. :1.00
## 1st Qu.:1.00 1st Qu.: 0 1st Qu.: 1 1st Qu.:0.000 1st Qu.:1.00
## Median :1.00 Median : 0 Median : 2 Median :0.000 Median :1.00
## Mean :1.26 Mean : 2 Mean : 6 Mean :0.499 Mean :1.18
## 3rd Qu.:2.00 3rd Qu.: 2 3rd Qu.: 6 3rd Qu.:1.000 3rd Qu.:1.00
## Max. :2.00 Max. :5085 Max. :4441 Max. :1.000 Max. :2.00
##
## RevLineCr LowDoc DisbursementGross BalanceGross MIS_Status
## N:159017 N:346622 Min. : 4000 Min. : 0 CHGOFF: 92573
## Y:189032 Y: 1427 1st Qu.: 27000 1st Qu.: 0 P I F :255476
## Median : 61000 Median : 0
## Mean : 157439 Mean : 6
## 3rd Qu.: 152899 3rd Qu.: 0
## Max. :11446325 Max. :996262
##
## GrAppv SBA_Appv
## Min. : 1000 Min. : 500
## 1st Qu.: 25000 1st Qu.: 12500
## Median : 50000 Median : 25000
## Mean : 133651 Mean : 97315
## 3rd Qu.: 100100 3rd Qu.: 72056
## Max. :5000000 Max. :4500000
##
#run randomforest
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
train <- sample(nrow(Ourdata), 0.7*nrow(Ourdata), replace = FALSE)
TrainSet <- Ourdata[train,]
ValidSet <- Ourdata[-train,]
model <- randomForest(MIS_Status ~ ., data = TrainSet, ntree = 100, mtry = 5, importance = TRUE)
model
##
## Call:
## randomForest(formula = MIS_Status ~ ., data = TrainSet, ntree = 100, mtry = 5, importance = TRUE)
## Type of random forest: classification
## Number of trees: 100
## No. of variables tried at each split: 5
##
## OOB estimate of error rate: 5.65%
## Confusion matrix:
## CHGOFF P I F class.error
## CHGOFF 55849 8711 0.13493
## P I F 5058 174016 0.02825
# Predicting on Validation set
predValid <- predict(model, ValidSet, type = "class")
# Checking classification accuracy
accuracy = mean(predValid == ValidSet$MIS_Status)
#confusion matrix
table(predValid,ValidSet$MIS_Status)
##
## predValid CHGOFF P I F
## CHGOFF 24278 2044
## P I F 3735 74358
The accurqcy of our model is 0.9447.
After tune the parameters a few times on our local run, we found according to the feature importance of Random Forest Model, there are five features that show a higher significance than other features consistently. The importance graph is the following:
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
##
## margin
varImpPlot(model)
As a result, we choose to use these five above features as our input features for further training models.
features = c("State","BankState","Term","FranchiseCode","DisbursementGross","MIS_Status")
temp = Ourdata[ , (names(Ourdata) %in% features)]
Ourdata = temp
We run the Logistic Regression on the five features we selected earlier.
total_performance = 0
best_logistic = 0
best_score = 0
best_table = 0
for (i in 1:10){
index_1 = createDataPartition(Ourdata$State, p = 0.85, list = F)
index_2 = createDataPartition(Ourdata$BankState, p = 0.85, list = F)
train = intersect(index_1, index_2)
TrainSet <- Ourdata[train,]
ValidSet <- Ourdata[-train,]
glm.fit <- glm(MIS_Status ~ State + BankState + Term + FranchiseCode + DisbursementGross, data = TrainSet, family = binomial)
#levels(ValidSet$MIS_Status)
glm.probs <- predict(glm.fit, newdata = ValidSet, type = "response")
glm.pred <- ifelse(glm.probs > 0.5, "P I F", "CHGOFF")
cur_score = mean(glm.pred == ValidSet$MIS_Status)
total_performance = total_performance + cur_score
if (cur_score > best_score){
best_score = cur_score
best_logistic = glm.fit
best_table = table(glm.pred, ValidSet$MIS_Status)
}
}
summary(best_logistic)
##
## Call:
## glm(formula = MIS_Status ~ State + BankState + Term + FranchiseCode +
## DisbursementGross, family = binomial, data = TrainSet)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -5.379 -0.599 0.391 0.660 2.815
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.60e-01 3.45e-01 -2.20 0.02774 *
## StateAL -1.16e+00 2.13e-01 -5.42 5.9e-08 ***
## StateAR -1.80e+00 2.24e-01 -8.01 1.2e-15 ***
## StateAZ -1.23e+00 2.04e-01 -6.01 1.9e-09 ***
## StateCA -1.09e+00 2.02e-01 -5.39 6.9e-08 ***
## StateCO -1.19e+00 2.04e-01 -5.84 5.1e-09 ***
## StateCT -2.62e-01 2.08e-01 -1.26 0.20840
## StateDC -7.51e-01 2.39e-01 -3.14 0.00168 **
## StateDE -9.07e-01 2.28e-01 -3.98 6.8e-05 ***
## StateFL -1.46e+00 2.02e-01 -7.21 5.6e-13 ***
## StateGA -1.62e+00 2.04e-01 -7.96 1.8e-15 ***
## StateHI -1.61e+00 2.61e-01 -6.17 6.8e-10 ***
## StateIA -7.09e-01 2.26e-01 -3.14 0.00171 **
## StateID -8.32e-01 2.11e-01 -3.94 8.0e-05 ***
## StateIL -1.04e+00 2.03e-01 -5.15 2.6e-07 ***
## StateIN -8.71e-01 2.06e-01 -4.22 2.4e-05 ***
## StateKS -9.70e-01 2.19e-01 -4.43 9.3e-06 ***
## StateKY -9.99e-01 2.12e-01 -4.72 2.4e-06 ***
## StateLA -1.19e+00 2.13e-01 -5.62 1.9e-08 ***
## StateMA -4.96e-01 2.05e-01 -2.42 0.01556 *
## StateMD -9.16e-01 2.07e-01 -4.43 9.2e-06 ***
## StateME -9.28e-02 2.24e-01 -0.41 0.67913
## StateMI -1.03e+00 2.04e-01 -5.04 4.7e-07 ***
## StateMN -8.35e-01 2.07e-01 -4.04 5.4e-05 ***
## StateMO -1.01e+00 2.08e-01 -4.88 1.1e-06 ***
## StateMS -1.69e+00 2.21e-01 -7.66 1.9e-14 ***
## StateMT -3.69e-01 2.48e-01 -1.49 0.13576
## StateNC -6.56e-01 2.06e-01 -3.19 0.00143 **
## StateND 2.54e-01 2.82e-01 0.90 0.36705
## StateNE -5.69e-01 2.37e-01 -2.40 0.01632 *
## StateNH -4.90e-01 2.10e-01 -2.34 0.01941 *
## StateNJ -8.42e-01 2.04e-01 -4.14 3.5e-05 ***
## StateNM -5.37e-01 2.26e-01 -2.38 0.01728 *
## StateNV -1.67e+00 2.08e-01 -8.01 1.1e-15 ***
## StateNY -7.65e-01 2.02e-01 -3.78 0.00016 ***
## StateOH -6.91e-01 2.03e-01 -3.40 0.00067 ***
## StateOK -1.12e+00 2.16e-01 -5.18 2.2e-07 ***
## StateOR -5.80e-01 2.07e-01 -2.81 0.00502 **
## StatePA -7.17e-01 2.04e-01 -3.52 0.00044 ***
## StateRI -2.07e-01 2.10e-01 -0.99 0.32367
## StateSC -7.78e-01 2.14e-01 -3.63 0.00028 ***
## StateSD 1.83e-01 2.41e-01 0.76 0.44763
## StateTN -1.47e+00 2.09e-01 -7.02 2.3e-12 ***
## StateTX -7.84e-01 2.03e-01 -3.87 0.00011 ***
## StateUT -9.35e-01 2.07e-01 -4.51 6.5e-06 ***
## StateVA -6.03e-01 2.06e-01 -2.92 0.00346 **
## StateVT -1.02e-01 2.36e-01 -0.43 0.66593
## StateWA -9.40e-01 2.05e-01 -4.60 4.3e-06 ***
## StateWI -5.76e-01 2.09e-01 -2.76 0.00580 **
## StateWV -4.62e-01 2.40e-01 -1.93 0.05397 .
## StateWY -1.65e-01 3.12e-01 -0.53 0.59707
## BankStateAL -5.54e-01 4.02e-01 -1.38 0.16805
## BankStateAR 9.69e-01 4.13e-01 2.35 0.01899 *
## BankStateAZ 7.18e-01 4.37e-01 1.64 0.10059
## BankStateCA -1.13e+00 3.99e-01 -2.84 0.00455 **
## BankStateCO 4.29e-01 4.23e-01 1.01 0.31108
## BankStateCT -2.95e-01 4.07e-01 -0.72 0.46848
## BankStateDC -1.14e+00 4.92e-01 -2.31 0.02077 *
## BankStateDE -1.12e-01 3.99e-01 -0.28 0.77927
## BankStateFL -1.01e+00 4.01e-01 -2.52 0.01162 *
## BankStateGA 8.80e-01 4.04e-01 2.18 0.02957 *
## BankStateHI 1.86e+00 4.52e-01 4.12 3.7e-05 ***
## BankStateIA 3.12e-01 4.20e-01 0.74 0.45697
## BankStateID -8.60e-02 4.15e-01 -0.21 0.83591
## BankStateIL -6.22e-01 3.99e-01 -1.56 0.11860
## BankStateIN 1.13e+00 4.09e-01 2.75 0.00589 **
## BankStateKS 1.19e+00 4.21e-01 2.83 0.00471 **
## BankStateKY 8.52e-01 4.21e-01 2.03 0.04274 *
## BankStateLA 8.96e-01 4.27e-01 2.10 0.03578 *
## BankStateMA 7.11e-01 4.05e-01 1.76 0.07905 .
## BankStateMD 1.15e+00 4.12e-01 2.78 0.00543 **
## BankStateME 9.95e-01 4.32e-01 2.30 0.02129 *
## BankStateMI 1.00e+00 4.11e-01 2.44 0.01471 *
## BankStateMN 8.23e-01 4.05e-01 2.03 0.04222 *
## BankStateMO 3.89e-01 4.05e-01 0.96 0.33665
## BankStateMS 1.91e+00 4.18e-01 4.58 4.6e-06 ***
## BankStateMT 1.44e+00 4.36e-01 3.30 0.00095 ***
## BankStateNC -5.64e-01 3.99e-01 -1.42 0.15666
## BankStateND 5.98e-01 4.45e-01 1.34 0.17901
## BankStateNE 6.36e-01 4.28e-01 1.49 0.13747
## BankStateNH 3.44e-01 4.15e-01 0.83 0.40722
## BankStateNJ 3.13e-01 4.08e-01 0.77 0.44315
## BankStateNM 8.12e-01 4.27e-01 1.90 0.05724 .
## BankStateNV 1.87e-01 4.19e-01 0.45 0.65583
## BankStateNY 2.21e-02 3.99e-01 0.06 0.95593
## BankStateOH -7.39e-02 3.99e-01 -0.19 0.85295
## BankStateOK 7.80e-01 4.16e-01 1.88 0.06076 .
## BankStateOR -9.37e-01 4.01e-01 -2.34 0.01952 *
## BankStatePA 1.10e+00 4.03e-01 2.72 0.00659 **
## BankStatePR -2.90e+00 1.38e+00 -2.10 0.03585 *
## BankStateRI -3.80e-01 3.99e-01 -0.95 0.34028
## BankStateSC -2.52e+00 4.22e-01 -5.97 2.3e-09 ***
## BankStateSD -7.12e-01 3.98e-01 -1.79 0.07387 .
## BankStateTN -5.85e-02 4.18e-01 -0.14 0.88866
## BankStateTX -2.95e-01 4.01e-01 -0.74 0.46185
## BankStateUT -1.76e-01 4.02e-01 -0.44 0.66052
## BankStateVA -1.29e+00 3.99e-01 -3.25 0.00116 **
## BankStateVT 1.50e+00 4.37e-01 3.43 0.00061 ***
## BankStateWA 2.50e-01 4.09e-01 0.61 0.54098
## BankStateWI 9.45e-02 4.04e-01 0.23 0.81514
## BankStateWV -6.74e-01 4.60e-01 -1.46 0.14313
## BankStateWY 1.10e+00 5.42e-01 2.04 0.04140 *
## Term 4.20e-02 2.20e-04 191.22 < 2e-16 ***
## FranchiseCode 8.55e-01 1.18e-02 72.76 < 2e-16 ***
## DisbursementGross 3.48e-07 2.95e-08 11.81 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 291718 on 251498 degrees of freedom
## Residual deviance: 212594 on 251394 degrees of freedom
## AIC: 212804
##
## Number of Fisher Scoring iterations: 6
best_score
## [1] 0.8334
best_table
##
## glm.pred CHGOFF P I F
## CHGOFF 14005 4596
## P I F 11491 66458
total_performance/10
## [1] 0.8311
In the article “Should This Loan be Approved or Denied?”, the author chose the following five features when he or she run the logistics model and then he selected three of them according to the summary table of the first run. Their model’s accuracy is 0.6784. On the other hand, the average accuracy of our model is around 0.83. Therefore, we think our model is more robust than the author’s model.
We run a KNN classification our 5 features using k=1, 3, 5, 7, 10, and 100.
#load package
library(magrittr) # needs to be run every time you start R and want to use %>%
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:randomForest':
##
## combine
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(class)
# standardize all quantitative predictors so they have mean 0 and standard deviation of 1
# scale Term and DisbursementGross
standardized.X = data.frame(scale(Ourdata[ , -c(1, 2, 4, 6)]))
summary(standardized.X)
## Term DisbursementGross
## Min. :-1.407 Min. :-0.55
## 1st Qu.:-0.563 1st Qu.:-0.47
## Median : 0.039 Median :-0.35
## Mean : 0.000 Mean : 0.00
## 3rd Qu.: 0.039 3rd Qu.:-0.02
## Max. : 7.664 Max. :40.43
unstandardized.X = data.frame(Ourdata[, -c(3, 5)])
combined.X = cbind(unstandardized.X, standardized.X)
summary(combined.X)
## State BankState FranchiseCode MIS_Status
## CA : 47581 NC : 52293 Min. :0.000 CHGOFF: 92573
## NY : 28483 IL : 36214 1st Qu.:0.000 P I F :255476
## TX : 21227 CA : 34245 Median :0.000
## FL : 19707 RI : 32949 Mean :0.499
## PA : 15594 OH : 29566 3rd Qu.:1.000
## OH : 14999 VA : 17980 Max. :1.000
## (Other):200458 (Other):144802
## Term DisbursementGross
## Min. :-1.407 Min. :-0.55
## 1st Qu.:-0.563 1st Qu.:-0.47
## Median : 0.039 Median :-0.35
## Mean : 0.000 Mean : 0.00
## 3rd Qu.: 0.039 3rd Qu.:-0.02
## Max. : 7.664 Max. :40.43
##
# Convert factor predictors into numerics to be used in KNN classification
combined.X$State = as.integer(combined.X$State)
combined.X$BankState = as.integer(combined.X$BankState)
# use 20% of data for test set
test.X = combined.X[, -4] %>% slice(1:69609)
train.X = combined.X[, -4] %>% slice(69610:348049)
test.Y = combined.X$MIS_Status[1:69609]
train.Y = combined.X$MIS_Status[69610:348049]
# First attempt with K = 1
set.seed(1)
knn_pred = knn(train.X, test.X, train.Y, k = 1)
k_1_error_rate = mean(test.Y != knn_pred) # KNN error rate
print(k_1_error_rate)
## [1] 0.103
table(knn_pred, test.Y)
## test.Y
## knn_pred CHGOFF P I F
## CHGOFF 18034 3171
## P I F 3997 44407
# Second Attempt with K = 3
set.seed(2)
knn_pred = knn(train.X, test.X, train.Y, k = 3)
k_3_error_rate = mean(test.Y != knn_pred) # KNN error rate
print(k_3_error_rate)
## [1] 0.09492
table(knn_pred, test.Y)
## test.Y
## knn_pred CHGOFF P I F
## CHGOFF 18200 2776
## P I F 3831 44802
# Third Attempt with K = 5
set.seed(3)
knn_pred = knn(train.X, test.X, train.Y, k = 5)
k_5_error_rate = mean(test.Y != knn_pred) # KNN error rate
print(k_5_error_rate)
## [1] 0.09259
table(knn_pred, test.Y)
## test.Y
## knn_pred CHGOFF P I F
## CHGOFF 18318 2732
## P I F 3713 44846
# Fourth Attempt with K = 7
set.seed(4)
knn_pred = knn(train.X, test.X, train.Y, k = 7)
k_7_error_rate = mean(test.Y != knn_pred) # KNN error rate
print(k_7_error_rate)
## [1] 0.08995
table(knn_pred, test.Y)
## test.Y
## knn_pred CHGOFF P I F
## CHGOFF 18354 2584
## P I F 3677 44994
# Fifth Attempt with K = 10
set.seed(5)
knn_pred = knn(train.X, test.X, train.Y, k = 10)
k_10_error_rate = mean(test.Y != knn_pred) # KNN error rate
print(k_10_error_rate)
## [1] 0.09686
table(knn_pred, test.Y)
## test.Y
## knn_pred CHGOFF P I F
## CHGOFF 18045 2756
## P I F 3986 44822
# Sixth Attempt with K = 100
set.seed(100)
knn_pred= knn(train.X, test.X, train.Y, k = 100)
k_100_error_rate = mean(test.Y != knn_pred) # KNN error rate
print(k_100_error_rate)
## [1] 0.1273
table(knn_pred, test.Y)
## test.Y
## knn_pred CHGOFF P I F
## CHGOFF 16488 3317
## P I F 5543 44261
The model with the best accuracy is k=7. The error rate is 0.08995.
The five features we found are most important in predicting loan status are “State”,“BankState”,“Term”,“FranchiseCode”,and “DisbursementGross”. Logistic Regression is more consistent with predicting loan status with average accuracy of 83%. KNN with k = 7 shows a lowest error rate of 0.08995.