Introduction

Should This Loan be Approved or Denied?

In this report, we explore attributes of SBAs loan data from a variety of small business loan records. The sba has offered a collection of data on these small businesses and their loan applications. The businesses in the dataset represent a variety of industries locations, sizes, and other demographic attributes. There are over 800,000 businesses with 27 varables collected over a span of several decades.

We are interested in a few problems and questions besides what have been explored in the class. What are the most important five features in predicting default rate. Is KNN better or Logistic regression better when predicting default rate using the five most important features.

If this rmd does not compile at the first time becasue some package is not installed on the grader’s computer, you can check our published knit report on http://rpubs.com/dzhao97/540770

Analysis

Input and Preprocessing data

The dataset contains 27 variables. After carefully inspect the variable description in the “Should This Loan be Approved or Denied?” article, we dropped some variables because they contain too many distinct values (which will result overfitting) or they are consequences of Loan_Status. The variables we dropped are “LoanNr_ChkDgt”, “Name”, “City”, “Zip”, “NAICS”, “ApprovalDate”, “ApprovalFY”, “Bank”, “ChgOffDate”, “DisbursementDate”, “ChgOffPrinGr”. After that, according to all the valid input values for each variable in the dataset description, we filter out all rows containing null values or invalid input. The result dataframe has 348049 observations.

##load Original Data
library(readr)
SBAnational <- read_csv("https://uofi.box.com/shared/static/vi37omgitiaa2yyplrom779qvwk1g14x.csv", col_types = cols(BalanceGross = col_number(),  DisbursementGross = col_number(),  GrAppv = col_number(), SBA_Appv = col_number()))

#Drop properties with too much distinct values or unrelated
drops <- c("LoanNr_ChkDgt", "Name","City","Zip","NAICS", "ApprovalDate", "ApprovalFY","Bank", "ChgOffDate","DisbursementDate","ChgOffPrinGr")
Ourdata = SBAnational[ , !(names(SBAnational) %in% drops)]

# use sql to preprocess data with valid input
library(sqldf)
Ourdata = sqldf('select * from Ourdata where (MIS_Status = "P I F" or MIS_Status = "CHGOFF") and (NewExist = 1 or NewExist = 2) and (FranchiseCode = 0 or FranchiseCode = 1) and (UrbanRural = 1 or UrbanRural = 2) and (RevLineCr = "Y" or RevLineCr = "N") and (LowDoc = "Y" or LowDoc = "N") and (BankState <> "AN")')

#clean all NUlls
Ourdata = na.omit(Ourdata)

#change char data input to factor
Ourdata$MIS_Status = factor(Ourdata$MIS_Status)
Ourdata$State = factor(Ourdata$State)
Ourdata$BankState = factor(Ourdata$BankState)
Ourdata$RevLineCr = factor(Ourdata$RevLineCr)
Ourdata$LowDoc = factor(Ourdata$LowDoc)

summary(Ourdata)
##      State          BankState           Term           NoEmp     
##  CA     : 47581   NC     : 52293   Min.   :  0.0   Min.   :   0  
##  NY     : 28483   IL     : 36214   1st Qu.: 49.0   1st Qu.:   2  
##  TX     : 21227   CA     : 34245   Median : 84.0   Median :   3  
##  FL     : 19707   RI     : 32949   Mean   : 81.7   Mean   :   8  
##  PA     : 15594   OH     : 29566   3rd Qu.: 84.0   3rd Qu.:   8  
##  OH     : 14999   VA     : 17980   Max.   :527.0   Max.   :8000  
##  (Other):200458   (Other):144802                                 
##     NewExist      CreateJob     RetainedJob   FranchiseCode     UrbanRural  
##  Min.   :1.00   Min.   :   0   Min.   :   0   Min.   :0.000   Min.   :1.00  
##  1st Qu.:1.00   1st Qu.:   0   1st Qu.:   1   1st Qu.:0.000   1st Qu.:1.00  
##  Median :1.00   Median :   0   Median :   2   Median :0.000   Median :1.00  
##  Mean   :1.26   Mean   :   2   Mean   :   6   Mean   :0.499   Mean   :1.18  
##  3rd Qu.:2.00   3rd Qu.:   2   3rd Qu.:   6   3rd Qu.:1.000   3rd Qu.:1.00  
##  Max.   :2.00   Max.   :5085   Max.   :4441   Max.   :1.000   Max.   :2.00  
##                                                                             
##  RevLineCr  LowDoc     DisbursementGross   BalanceGross     MIS_Status    
##  N:159017   N:346622   Min.   :    4000   Min.   :     0   CHGOFF: 92573  
##  Y:189032   Y:  1427   1st Qu.:   27000   1st Qu.:     0   P I F :255476  
##                        Median :   61000   Median :     0                  
##                        Mean   :  157439   Mean   :     6                  
##                        3rd Qu.:  152899   3rd Qu.:     0                  
##                        Max.   :11446325   Max.   :996262                  
##                                                                           
##      GrAppv           SBA_Appv      
##  Min.   :   1000   Min.   :    500  
##  1st Qu.:  25000   1st Qu.:  12500  
##  Median :  50000   Median :  25000  
##  Mean   : 133651   Mean   :  97315  
##  3rd Qu.: 100100   3rd Qu.:  72056  
##  Max.   :5000000   Max.   :4500000  
## 

Feature Engineering

#run randomforest
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
train <- sample(nrow(Ourdata), 0.7*nrow(Ourdata), replace = FALSE)
TrainSet <- Ourdata[train,]
ValidSet <- Ourdata[-train,]
model <- randomForest(MIS_Status ~ ., data = TrainSet, ntree = 100, mtry = 5, importance = TRUE)
model
## 
## Call:
##  randomForest(formula = MIS_Status ~ ., data = TrainSet, ntree = 100,      mtry = 5, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 100
## No. of variables tried at each split: 5
## 
##         OOB estimate of  error rate: 5.65%
## Confusion matrix:
##        CHGOFF  P I F class.error
## CHGOFF  55849   8711     0.13493
## P I F    5058 174016     0.02825
# Predicting on Validation set
predValid <- predict(model, ValidSet, type = "class")
# Checking classification accuracy
accuracy = mean(predValid == ValidSet$MIS_Status)
#confusion matrix
table(predValid,ValidSet$MIS_Status)
##          
## predValid CHGOFF P I F
##    CHGOFF  24278  2044
##    P I F    3735 74358

The accurqcy of our model is 0.9447.

After tune the parameters a few times on our local run, we found according to the feature importance of Random Forest Model, there are five features that show a higher significance than other features consistently. The importance graph is the following:

library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
## 
##     margin
varImpPlot(model) 

As a result, we choose to use these five above features as our input features for further training models.

features = c("State","BankState","Term","FranchiseCode","DisbursementGross","MIS_Status")
temp = Ourdata[ , (names(Ourdata) %in% features)]
Ourdata = temp

Logistic Regression

We run the Logistic Regression on the five features we selected earlier.

total_performance = 0
best_logistic = 0
best_score = 0
best_table = 0
for (i in 1:10){
  index_1 = createDataPartition(Ourdata$State, p = 0.85, list = F)
  index_2 = createDataPartition(Ourdata$BankState, p = 0.85, list = F)
  train = intersect(index_1, index_2)
  TrainSet <- Ourdata[train,]
  ValidSet <- Ourdata[-train,]
  glm.fit <- glm(MIS_Status ~ State + BankState + Term + FranchiseCode + DisbursementGross, data = TrainSet, family = binomial)
  #levels(ValidSet$MIS_Status)
  glm.probs <- predict(glm.fit, newdata = ValidSet, type = "response")
  glm.pred <- ifelse(glm.probs > 0.5, "P I F", "CHGOFF")
  cur_score = mean(glm.pred == ValidSet$MIS_Status)
  total_performance = total_performance + cur_score
  if (cur_score > best_score){
    best_score = cur_score
    best_logistic = glm.fit
    best_table = table(glm.pred, ValidSet$MIS_Status)
  }
}
summary(best_logistic)
## 
## Call:
## glm(formula = MIS_Status ~ State + BankState + Term + FranchiseCode + 
##     DisbursementGross, family = binomial, data = TrainSet)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -5.379  -0.599   0.391   0.660   2.815  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -7.60e-01   3.45e-01   -2.20  0.02774 *  
## StateAL           -1.16e+00   2.13e-01   -5.42  5.9e-08 ***
## StateAR           -1.80e+00   2.24e-01   -8.01  1.2e-15 ***
## StateAZ           -1.23e+00   2.04e-01   -6.01  1.9e-09 ***
## StateCA           -1.09e+00   2.02e-01   -5.39  6.9e-08 ***
## StateCO           -1.19e+00   2.04e-01   -5.84  5.1e-09 ***
## StateCT           -2.62e-01   2.08e-01   -1.26  0.20840    
## StateDC           -7.51e-01   2.39e-01   -3.14  0.00168 ** 
## StateDE           -9.07e-01   2.28e-01   -3.98  6.8e-05 ***
## StateFL           -1.46e+00   2.02e-01   -7.21  5.6e-13 ***
## StateGA           -1.62e+00   2.04e-01   -7.96  1.8e-15 ***
## StateHI           -1.61e+00   2.61e-01   -6.17  6.8e-10 ***
## StateIA           -7.09e-01   2.26e-01   -3.14  0.00171 ** 
## StateID           -8.32e-01   2.11e-01   -3.94  8.0e-05 ***
## StateIL           -1.04e+00   2.03e-01   -5.15  2.6e-07 ***
## StateIN           -8.71e-01   2.06e-01   -4.22  2.4e-05 ***
## StateKS           -9.70e-01   2.19e-01   -4.43  9.3e-06 ***
## StateKY           -9.99e-01   2.12e-01   -4.72  2.4e-06 ***
## StateLA           -1.19e+00   2.13e-01   -5.62  1.9e-08 ***
## StateMA           -4.96e-01   2.05e-01   -2.42  0.01556 *  
## StateMD           -9.16e-01   2.07e-01   -4.43  9.2e-06 ***
## StateME           -9.28e-02   2.24e-01   -0.41  0.67913    
## StateMI           -1.03e+00   2.04e-01   -5.04  4.7e-07 ***
## StateMN           -8.35e-01   2.07e-01   -4.04  5.4e-05 ***
## StateMO           -1.01e+00   2.08e-01   -4.88  1.1e-06 ***
## StateMS           -1.69e+00   2.21e-01   -7.66  1.9e-14 ***
## StateMT           -3.69e-01   2.48e-01   -1.49  0.13576    
## StateNC           -6.56e-01   2.06e-01   -3.19  0.00143 ** 
## StateND            2.54e-01   2.82e-01    0.90  0.36705    
## StateNE           -5.69e-01   2.37e-01   -2.40  0.01632 *  
## StateNH           -4.90e-01   2.10e-01   -2.34  0.01941 *  
## StateNJ           -8.42e-01   2.04e-01   -4.14  3.5e-05 ***
## StateNM           -5.37e-01   2.26e-01   -2.38  0.01728 *  
## StateNV           -1.67e+00   2.08e-01   -8.01  1.1e-15 ***
## StateNY           -7.65e-01   2.02e-01   -3.78  0.00016 ***
## StateOH           -6.91e-01   2.03e-01   -3.40  0.00067 ***
## StateOK           -1.12e+00   2.16e-01   -5.18  2.2e-07 ***
## StateOR           -5.80e-01   2.07e-01   -2.81  0.00502 ** 
## StatePA           -7.17e-01   2.04e-01   -3.52  0.00044 ***
## StateRI           -2.07e-01   2.10e-01   -0.99  0.32367    
## StateSC           -7.78e-01   2.14e-01   -3.63  0.00028 ***
## StateSD            1.83e-01   2.41e-01    0.76  0.44763    
## StateTN           -1.47e+00   2.09e-01   -7.02  2.3e-12 ***
## StateTX           -7.84e-01   2.03e-01   -3.87  0.00011 ***
## StateUT           -9.35e-01   2.07e-01   -4.51  6.5e-06 ***
## StateVA           -6.03e-01   2.06e-01   -2.92  0.00346 ** 
## StateVT           -1.02e-01   2.36e-01   -0.43  0.66593    
## StateWA           -9.40e-01   2.05e-01   -4.60  4.3e-06 ***
## StateWI           -5.76e-01   2.09e-01   -2.76  0.00580 ** 
## StateWV           -4.62e-01   2.40e-01   -1.93  0.05397 .  
## StateWY           -1.65e-01   3.12e-01   -0.53  0.59707    
## BankStateAL       -5.54e-01   4.02e-01   -1.38  0.16805    
## BankStateAR        9.69e-01   4.13e-01    2.35  0.01899 *  
## BankStateAZ        7.18e-01   4.37e-01    1.64  0.10059    
## BankStateCA       -1.13e+00   3.99e-01   -2.84  0.00455 ** 
## BankStateCO        4.29e-01   4.23e-01    1.01  0.31108    
## BankStateCT       -2.95e-01   4.07e-01   -0.72  0.46848    
## BankStateDC       -1.14e+00   4.92e-01   -2.31  0.02077 *  
## BankStateDE       -1.12e-01   3.99e-01   -0.28  0.77927    
## BankStateFL       -1.01e+00   4.01e-01   -2.52  0.01162 *  
## BankStateGA        8.80e-01   4.04e-01    2.18  0.02957 *  
## BankStateHI        1.86e+00   4.52e-01    4.12  3.7e-05 ***
## BankStateIA        3.12e-01   4.20e-01    0.74  0.45697    
## BankStateID       -8.60e-02   4.15e-01   -0.21  0.83591    
## BankStateIL       -6.22e-01   3.99e-01   -1.56  0.11860    
## BankStateIN        1.13e+00   4.09e-01    2.75  0.00589 ** 
## BankStateKS        1.19e+00   4.21e-01    2.83  0.00471 ** 
## BankStateKY        8.52e-01   4.21e-01    2.03  0.04274 *  
## BankStateLA        8.96e-01   4.27e-01    2.10  0.03578 *  
## BankStateMA        7.11e-01   4.05e-01    1.76  0.07905 .  
## BankStateMD        1.15e+00   4.12e-01    2.78  0.00543 ** 
## BankStateME        9.95e-01   4.32e-01    2.30  0.02129 *  
## BankStateMI        1.00e+00   4.11e-01    2.44  0.01471 *  
## BankStateMN        8.23e-01   4.05e-01    2.03  0.04222 *  
## BankStateMO        3.89e-01   4.05e-01    0.96  0.33665    
## BankStateMS        1.91e+00   4.18e-01    4.58  4.6e-06 ***
## BankStateMT        1.44e+00   4.36e-01    3.30  0.00095 ***
## BankStateNC       -5.64e-01   3.99e-01   -1.42  0.15666    
## BankStateND        5.98e-01   4.45e-01    1.34  0.17901    
## BankStateNE        6.36e-01   4.28e-01    1.49  0.13747    
## BankStateNH        3.44e-01   4.15e-01    0.83  0.40722    
## BankStateNJ        3.13e-01   4.08e-01    0.77  0.44315    
## BankStateNM        8.12e-01   4.27e-01    1.90  0.05724 .  
## BankStateNV        1.87e-01   4.19e-01    0.45  0.65583    
## BankStateNY        2.21e-02   3.99e-01    0.06  0.95593    
## BankStateOH       -7.39e-02   3.99e-01   -0.19  0.85295    
## BankStateOK        7.80e-01   4.16e-01    1.88  0.06076 .  
## BankStateOR       -9.37e-01   4.01e-01   -2.34  0.01952 *  
## BankStatePA        1.10e+00   4.03e-01    2.72  0.00659 ** 
## BankStatePR       -2.90e+00   1.38e+00   -2.10  0.03585 *  
## BankStateRI       -3.80e-01   3.99e-01   -0.95  0.34028    
## BankStateSC       -2.52e+00   4.22e-01   -5.97  2.3e-09 ***
## BankStateSD       -7.12e-01   3.98e-01   -1.79  0.07387 .  
## BankStateTN       -5.85e-02   4.18e-01   -0.14  0.88866    
## BankStateTX       -2.95e-01   4.01e-01   -0.74  0.46185    
## BankStateUT       -1.76e-01   4.02e-01   -0.44  0.66052    
## BankStateVA       -1.29e+00   3.99e-01   -3.25  0.00116 ** 
## BankStateVT        1.50e+00   4.37e-01    3.43  0.00061 ***
## BankStateWA        2.50e-01   4.09e-01    0.61  0.54098    
## BankStateWI        9.45e-02   4.04e-01    0.23  0.81514    
## BankStateWV       -6.74e-01   4.60e-01   -1.46  0.14313    
## BankStateWY        1.10e+00   5.42e-01    2.04  0.04140 *  
## Term               4.20e-02   2.20e-04  191.22  < 2e-16 ***
## FranchiseCode      8.55e-01   1.18e-02   72.76  < 2e-16 ***
## DisbursementGross  3.48e-07   2.95e-08   11.81  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 291718  on 251498  degrees of freedom
## Residual deviance: 212594  on 251394  degrees of freedom
## AIC: 212804
## 
## Number of Fisher Scoring iterations: 6
best_score
## [1] 0.8334
best_table
##         
## glm.pred CHGOFF P I F
##   CHGOFF  14005  4596
##   P I F   11491 66458
total_performance/10
## [1] 0.8311

In the article “Should This Loan be Approved or Denied?”, the author chose the following five features when he or she run the logistics model and then he selected three of them according to the summary table of the first run. Their model’s accuracy is 0.6784. On the other hand, the average accuracy of our model is around 0.83. Therefore, we think our model is more robust than the author’s model.

KNN Classfication

We run a KNN classification our 5 features using k=1, 3, 5, 7, 10, and 100.

#load package
library(magrittr) # needs to be run every time you start R and want to use %>%
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:randomForest':
## 
##     combine
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(class)

# standardize all quantitative predictors so they have mean 0 and standard deviation of 1
# scale Term and DisbursementGross
standardized.X = data.frame(scale(Ourdata[ , -c(1, 2, 4, 6)]))
summary(standardized.X)
##       Term        DisbursementGross
##  Min.   :-1.407   Min.   :-0.55    
##  1st Qu.:-0.563   1st Qu.:-0.47    
##  Median : 0.039   Median :-0.35    
##  Mean   : 0.000   Mean   : 0.00    
##  3rd Qu.: 0.039   3rd Qu.:-0.02    
##  Max.   : 7.664   Max.   :40.43
unstandardized.X = data.frame(Ourdata[, -c(3, 5)])
combined.X = cbind(unstandardized.X, standardized.X)
summary(combined.X)
##      State          BankState      FranchiseCode    MIS_Status    
##  CA     : 47581   NC     : 52293   Min.   :0.000   CHGOFF: 92573  
##  NY     : 28483   IL     : 36214   1st Qu.:0.000   P I F :255476  
##  TX     : 21227   CA     : 34245   Median :0.000                  
##  FL     : 19707   RI     : 32949   Mean   :0.499                  
##  PA     : 15594   OH     : 29566   3rd Qu.:1.000                  
##  OH     : 14999   VA     : 17980   Max.   :1.000                  
##  (Other):200458   (Other):144802                                  
##       Term        DisbursementGross
##  Min.   :-1.407   Min.   :-0.55    
##  1st Qu.:-0.563   1st Qu.:-0.47    
##  Median : 0.039   Median :-0.35    
##  Mean   : 0.000   Mean   : 0.00    
##  3rd Qu.: 0.039   3rd Qu.:-0.02    
##  Max.   : 7.664   Max.   :40.43    
## 
# Convert factor predictors into numerics to be used in KNN classification
combined.X$State = as.integer(combined.X$State)
combined.X$BankState = as.integer(combined.X$BankState)

# use 20% of data for test set
test.X = combined.X[, -4] %>% slice(1:69609)
train.X = combined.X[, -4] %>% slice(69610:348049)
test.Y = combined.X$MIS_Status[1:69609]
train.Y = combined.X$MIS_Status[69610:348049]

# First attempt with K = 1
set.seed(1)
knn_pred = knn(train.X, test.X, train.Y, k = 1)
k_1_error_rate = mean(test.Y != knn_pred) # KNN error rate
print(k_1_error_rate)
## [1] 0.103
table(knn_pred, test.Y)
##         test.Y
## knn_pred CHGOFF P I F
##   CHGOFF  18034  3171
##   P I F    3997 44407
# Second Attempt with K = 3
set.seed(2)
knn_pred = knn(train.X, test.X, train.Y, k = 3)
k_3_error_rate = mean(test.Y != knn_pred) # KNN error rate
print(k_3_error_rate)
## [1] 0.09492
table(knn_pred, test.Y)
##         test.Y
## knn_pred CHGOFF P I F
##   CHGOFF  18200  2776
##   P I F    3831 44802
# Third Attempt with K = 5
set.seed(3)
knn_pred = knn(train.X, test.X, train.Y, k = 5)
k_5_error_rate = mean(test.Y != knn_pred) # KNN error rate
print(k_5_error_rate)
## [1] 0.09259
table(knn_pred, test.Y)
##         test.Y
## knn_pred CHGOFF P I F
##   CHGOFF  18318  2732
##   P I F    3713 44846
# Fourth Attempt with K = 7
set.seed(4)
knn_pred = knn(train.X, test.X, train.Y, k = 7)
k_7_error_rate = mean(test.Y != knn_pred) # KNN error rate
print(k_7_error_rate)
## [1] 0.08995
table(knn_pred, test.Y)
##         test.Y
## knn_pred CHGOFF P I F
##   CHGOFF  18354  2584
##   P I F    3677 44994
# Fifth Attempt with K = 10
set.seed(5)
knn_pred = knn(train.X, test.X, train.Y, k = 10)
k_10_error_rate = mean(test.Y != knn_pred) # KNN error rate
print(k_10_error_rate)
## [1] 0.09686
table(knn_pred, test.Y)
##         test.Y
## knn_pred CHGOFF P I F
##   CHGOFF  18045  2756
##   P I F    3986 44822
# Sixth Attempt with K = 100
set.seed(100)
knn_pred= knn(train.X, test.X, train.Y, k = 100)
k_100_error_rate = mean(test.Y != knn_pred) # KNN error rate
print(k_100_error_rate)
## [1] 0.1273
table(knn_pred, test.Y)
##         test.Y
## knn_pred CHGOFF P I F
##   CHGOFF  16488  3317
##   P I F    5543 44261

The model with the best accuracy is k=7. The error rate is 0.08995.

Conclusion

The five features we found are most important in predicting loan status are “State”,“BankState”,“Term”,“FranchiseCode”,and “DisbursementGross”. Logistic Regression is more consistent with predicting loan status with average accuracy of 83%. KNN with k = 7 shows a lowest error rate of 0.08995.