Introduction

Should This Loan be Approved or Denied?

In this report, we explore attributes of SBAs loan data from a variety of small business loan records. The sba has offered a collection of data on these small businesses and their loan applications. The businesses in the dataset represent a variety of industries locations, sizes, and other demographic attributes. There are over 800,000 businesses with 27 varables collected over a span of several decades.

We are interested in a few problems and questions besides what have been explored in the class. What are the most important five features in predicting default rate. Is KNN better or Logistic regression better when predicting default rate using the five most important features.

If this rmd does not compile at the first time becasue some package is not installed on the grader’s computer, you can check our published knit report on http://rpubs.com/dzhao97/543225

Analysis

Input and Preprocessing data

The dataset contains 27 variables. After carefully inspect the variable description in the “Should This Loan be Approved or Denied?” article, we dropped some variables because they contain too many distinct values (which will result overfitting) or they are consequences of Loan_Status. The variables we dropped are “LoanNr_ChkDgt”, “Name”, “City”, “Zip”, “NAICS”, “ApprovalDate”, “ApprovalFY”, “Bank”, “ChgOffDate”, “DisbursementDate”, “ChgOffPrinGr”. After that, according to all the valid input values for each variable in the dataset description, we filter out all rows containing null values or invalid input. The result dataframe has 359320 observations.

##load Original Data
library(readr)
SBAnational <- read_csv("https://uofi.box.com/shared/static/vi37omgitiaa2yyplrom779qvwk1g14x.csv", col_types = cols(BalanceGross = col_number(),  DisbursementGross = col_number(),  GrAppv = col_number(), SBA_Appv = col_number()))

#Drop properties with too much distinct values or unrelated
drops <- c("LoanNr_ChkDgt", "Name","City","Zip","NAICS", "ApprovalDate", "ApprovalFY","Bank", "ChgOffDate","DisbursementDate","ChgOffPrinGr")
Ourdata = SBAnational[ , !(names(SBAnational) %in% drops)]

# use sql to preprocess data with valid input
library(sqldf)
Ourdata$FranchiseCode[Ourdata$FranchiseCode > 1] = 2 # all large franchise
Ourdata = sqldf('select * from Ourdata where (MIS_Status = "P I F" or MIS_Status = "CHGOFF") and (NewExist = 1 or NewExist = 2) and (FranchiseCode = 0 or FranchiseCode = 1 or FranchiseCode = 2) and (UrbanRural = 1 or UrbanRural = 2) and (RevLineCr = "Y" or RevLineCr = "N") and (LowDoc = "Y" or LowDoc = "N") and (BankState <> "AN")')

#clean all NUlls
Ourdata = na.omit(Ourdata)
Copy_Ourdata = Ourdata

#change char data input to factor
Ourdata$MIS_Status = factor(Ourdata$MIS_Status)
Ourdata$State = factor(Ourdata$State)
Ourdata$BankState = factor(Ourdata$BankState)
Ourdata$RevLineCr = factor(Ourdata$RevLineCr)
Ourdata$LowDoc = factor(Ourdata$LowDoc)
Ourdata$FranchiseCode = factor(Ourdata$FranchiseCode)

summary(Ourdata)
##      State          BankState           Term           NoEmp     
##  CA     : 48477   NC     : 53071   Min.   :  0.0   Min.   :   0  
##  NY     : 28837   IL     : 37020   1st Qu.: 50.0   1st Qu.:   2  
##  TX     : 22231   CA     : 35161   Median : 84.0   Median :   3  
##  FL     : 20296   RI     : 33377   Mean   : 82.7   Mean   :   8  
##  PA     : 16068   OH     : 30342   3rd Qu.: 84.0   3rd Qu.:   8  
##  OH     : 15623   VA     : 18213   Max.   :527.0   Max.   :8000  
##  (Other):207788   (Other):152136                                 
##     NewExist      CreateJob     RetainedJob   FranchiseCode   UrbanRural  
##  Min.   :1.00   Min.   :   0   Min.   :   0   0:174197      Min.   :1.00  
##  1st Qu.:1.00   1st Qu.:   0   1st Qu.:   1   1:173852      1st Qu.:1.00  
##  Median :1.00   Median :   0   Median :   2   2: 11271      Median :1.00  
##  Mean   :1.27   Mean   :   2   Mean   :   6                 Mean   :1.18  
##  3rd Qu.:2.00   3rd Qu.:   2   3rd Qu.:   6                 3rd Qu.:1.00  
##  Max.   :2.00   Max.   :5085   Max.   :7250                 Max.   :2.00  
##                                                                           
##  RevLineCr  LowDoc     DisbursementGross   BalanceGross     MIS_Status    
##  N:167459   N:357759   Min.   :    4000   Min.   :     0   CHGOFF: 94663  
##  Y:191861   Y:  1561   1st Qu.:   28000   1st Qu.:     0   P I F :264657  
##                        Median :   63482   Median :     0                  
##                        Mean   :  162311   Mean   :     6                  
##                        3rd Qu.:  160143   3rd Qu.:     0                  
##                        Max.   :11446325   Max.   :996262                  
##                                                                           
##      GrAppv           SBA_Appv      
##  Min.   :   1000   Min.   :    500  
##  1st Qu.:  25000   1st Qu.:  12500  
##  Median :  50000   Median :  25000  
##  Mean   : 139130   Mean   : 101986  
##  3rd Qu.: 120000   3rd Qu.:  75000  
##  Max.   :5000000   Max.   :4500000  
## 

Feature Engineering

#run randomforest
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
train <- sample(nrow(Ourdata), 0.7*nrow(Ourdata), replace = FALSE)
TrainSet <- Ourdata[train,]
ValidSet <- Ourdata[-train,]
model <- randomForest(MIS_Status ~ ., data = TrainSet, ntree = 100, mtry = 5, importance = TRUE)
model
## 
## Call:
##  randomForest(formula = MIS_Status ~ ., data = TrainSet, ntree = 100,      mtry = 5, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 100
## No. of variables tried at each split: 5
## 
##         OOB estimate of  error rate: 5.74%
## Confusion matrix:
##        CHGOFF  P I F class.error
## CHGOFF  57169   9062     0.13682
## P I F    5386 179906     0.02907
# Predicting on Validation set
predValid <- predict(model, ValidSet, type = "class")
# Checking classification accuracy
accuracy = mean(predValid == ValidSet$MIS_Status)
#confusion matrix
table(predValid,ValidSet$MIS_Status)
##          
## predValid CHGOFF P I F
##    CHGOFF  24664  2188
##    P I F    3768 77177

The accurqcy of our model is 0.9447.

After tune the parameters a few times on our local run, we found according to the feature importance of Random Forest Model, there are five features that show a higher significance than other features consistently. The importance graph is the following:

library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
## 
##     margin
varImpPlot(model) 

As a result, we choose to use these five above features as our input features for further training models.

features = c("State","BankState","Term","FranchiseCode","DisbursementGross","MIS_Status")
temp = Ourdata[ , (names(Ourdata) %in% features)]
Ourdata = temp

Logistic Regression

We run the Logistic Regression on the five features we selected earlier.

total_performance = 0
best_logistic = 0
best_score = 0
best_table = 0
for (i in 1:10){
  index_1 = createDataPartition(Ourdata$State, p = 0.85, list = F)
  index_2 = createDataPartition(Ourdata$BankState, p = 0.85, list = F)
  train = intersect(index_1, index_2)
  TrainSet <- Ourdata[train,]
  ValidSet <- Ourdata[-train,]
  glm.fit <- glm(MIS_Status ~ State + BankState + Term + FranchiseCode + DisbursementGross, data = TrainSet, family = binomial)
  #levels(ValidSet$MIS_Status)
  glm.probs <- predict(glm.fit, newdata = ValidSet, type = "response")
  glm.pred <- ifelse(glm.probs > 0.5, "P I F", "CHGOFF")
  cur_score = mean(glm.pred == ValidSet$MIS_Status)
  total_performance = total_performance + cur_score
  if (cur_score > best_score){
    best_score = cur_score
    best_logistic = glm.fit
    best_table = table(glm.pred, ValidSet$MIS_Status)
  }
}
summary(best_logistic)
## 
## Call:
## glm(formula = MIS_Status ~ State + BankState + Term + FranchiseCode + 
##     DisbursementGross, family = binomial, data = TrainSet)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -5.337  -0.586   0.387   0.659   2.820  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -7.61e-01   3.66e-01   -2.08  0.03768 *  
## StateAL           -1.12e+00   2.08e-01   -5.39  7.1e-08 ***
## StateAR           -1.64e+00   2.18e-01   -7.54  4.6e-14 ***
## StateAZ           -1.22e+00   1.99e-01   -6.16  7.1e-10 ***
## StateCA           -1.04e+00   1.96e-01   -5.30  1.2e-07 ***
## StateCO           -1.12e+00   1.99e-01   -5.63  1.8e-08 ***
## StateCT           -1.97e-01   2.02e-01   -0.97  0.33067    
## StateDC           -6.30e-01   2.33e-01   -2.70  0.00684 ** 
## StateDE           -8.37e-01   2.21e-01   -3.78  0.00016 ***
## StateFL           -1.40e+00   1.97e-01   -7.11  1.1e-12 ***
## StateGA           -1.55e+00   1.99e-01   -7.82  5.4e-15 ***
## StateHI           -1.43e+00   2.55e-01   -5.59  2.3e-08 ***
## StateIA           -6.61e-01   2.20e-01   -3.00  0.00269 ** 
## StateID           -8.13e-01   2.06e-01   -3.95  7.7e-05 ***
## StateIL           -9.67e-01   1.97e-01   -4.90  9.6e-07 ***
## StateIN           -8.26e-01   2.01e-01   -4.12  3.9e-05 ***
## StateKS           -9.38e-01   2.12e-01   -4.42  9.9e-06 ***
## StateKY           -9.34e-01   2.06e-01   -4.53  5.9e-06 ***
## StateLA           -1.16e+00   2.07e-01   -5.59  2.3e-08 ***
## StateMA           -4.57e-01   1.99e-01   -2.30  0.02169 *  
## StateMD           -8.33e-01   2.01e-01   -4.15  3.4e-05 ***
## StateME           -1.06e-01   2.19e-01   -0.48  0.62792    
## StateMI           -1.01e+00   1.98e-01   -5.10  3.5e-07 ***
## StateMN           -7.48e-01   2.01e-01   -3.72  0.00020 ***
## StateMO           -1.05e+00   2.02e-01   -5.21  1.9e-07 ***
## StateMS           -1.59e+00   2.16e-01   -7.38  1.6e-13 ***
## StateMT           -3.78e-01   2.44e-01   -1.55  0.12144    
## StateNC           -5.72e-01   2.00e-01   -2.86  0.00428 ** 
## StateND            3.29e-01   2.82e-01    1.17  0.24315    
## StateNE           -5.40e-01   2.31e-01   -2.34  0.01932 *  
## StateNH           -5.45e-01   2.04e-01   -2.67  0.00752 ** 
## StateNJ           -7.99e-01   1.98e-01   -4.03  5.5e-05 ***
## StateNM           -4.95e-01   2.20e-01   -2.25  0.02458 *  
## StateNV           -1.63e+00   2.03e-01   -8.07  7.0e-16 ***
## StateNY           -7.03e-01   1.97e-01   -3.58  0.00035 ***
## StateOH           -6.57e-01   1.97e-01   -3.33  0.00087 ***
## StateOK           -1.15e+00   2.10e-01   -5.48  4.3e-08 ***
## StateOR           -5.58e-01   2.01e-01   -2.77  0.00553 ** 
## StatePA           -6.43e-01   1.98e-01   -3.24  0.00118 ** 
## StateRI           -1.99e-01   2.04e-01   -0.98  0.32904    
## StateSC           -7.52e-01   2.08e-01   -3.61  0.00031 ***
## StateSD            3.83e-01   2.38e-01    1.61  0.10826    
## StateTN           -1.35e+00   2.04e-01   -6.62  3.6e-11 ***
## StateTX           -7.25e-01   1.97e-01   -3.68  0.00023 ***
## StateUT           -8.53e-01   2.02e-01   -4.22  2.4e-05 ***
## StateVA           -5.21e-01   2.01e-01   -2.60  0.00939 ** 
## StateVT           -3.58e-02   2.33e-01   -0.15  0.87787    
## StateWA           -8.79e-01   1.99e-01   -4.42  9.9e-06 ***
## StateWI           -4.71e-01   2.03e-01   -2.32  0.02034 *  
## StateWV           -4.60e-01   2.33e-01   -1.97  0.04868 *  
## StateWY           -2.98e-02   3.07e-01   -0.10  0.92272    
## BankStateAL       -5.01e-01   4.19e-01   -1.20  0.23159    
## BankStateAR        8.96e-01   4.29e-01    2.09  0.03667 *  
## BankStateAZ        9.35e-01   4.54e-01    2.06  0.03926 *  
## BankStateCA       -1.14e+00   4.15e-01   -2.75  0.00602 ** 
## BankStateCO        5.21e-01   4.39e-01    1.19  0.23494    
## BankStateCT       -3.96e-01   4.23e-01   -0.94  0.34849    
## BankStateDC       -1.46e+00   4.96e-01   -2.94  0.00324 ** 
## BankStateDE       -1.31e-01   4.16e-01   -0.31  0.75288    
## BankStateFL       -1.05e+00   4.17e-01   -2.50  0.01226 *  
## BankStateGA        7.80e-01   4.20e-01    1.86  0.06345 .  
## BankStateHI        1.82e+00   4.66e-01    3.91  9.2e-05 ***
## BankStateIA        3.50e-01   4.35e-01    0.80  0.42132    
## BankStateID       -7.08e-02   4.30e-01   -0.16  0.86924    
## BankStateIL       -6.26e-01   4.15e-01   -1.51  0.13151    
## BankStateIN        1.18e+00   4.26e-01    2.77  0.00555 ** 
## BankStateKS        1.15e+00   4.34e-01    2.66  0.00783 ** 
## BankStateKY        8.03e-01   4.35e-01    1.85  0.06454 .  
## BankStateLA        9.23e-01   4.42e-01    2.09  0.03670 *  
## BankStateMA        7.02e-01   4.21e-01    1.67  0.09554 .  
## BankStateMD        1.18e+00   4.29e-01    2.75  0.00593 ** 
## BankStateME        9.52e-01   4.46e-01    2.14  0.03269 *  
## BankStateMI        1.09e+00   4.27e-01    2.56  0.01033 *  
## BankStateMN        7.11e-01   4.21e-01    1.69  0.09105 .  
## BankStateMO        3.93e-01   4.21e-01    0.93  0.35062    
## BankStateMS        1.92e+00   4.33e-01    4.43  9.2e-06 ***
## BankStateMT        1.49e+00   4.52e-01    3.30  0.00098 ***
## BankStateNC       -5.75e-01   4.15e-01   -1.39  0.16602    
## BankStateND        5.78e-01   4.62e-01    1.25  0.21064    
## BankStateNE        5.80e-01   4.41e-01    1.32  0.18814    
## BankStateNH        4.50e-01   4.31e-01    1.05  0.29567    
## BankStateNJ        3.05e-01   4.24e-01    0.72  0.47200    
## BankStateNM        7.17e-01   4.42e-01    1.62  0.10497    
## BankStateNV        1.34e-01   4.33e-01    0.31  0.75618    
## BankStateNY       -1.45e-02   4.16e-01   -0.03  0.97215    
## BankStateOH       -9.26e-02   4.15e-01   -0.22  0.82354    
## BankStateOK        7.28e-01   4.31e-01    1.69  0.09104 .  
## BankStateOR       -9.29e-01   4.18e-01   -2.22  0.02611 *  
## BankStatePA        9.62e-01   4.20e-01    2.29  0.02182 *  
## BankStatePR       -2.54e-02   1.33e+00   -0.02  0.98470    
## BankStateRI       -3.78e-01   4.16e-01   -0.91  0.36337    
## BankStateSC       -2.58e+00   4.35e-01   -5.92  3.1e-09 ***
## BankStateSD       -7.28e-01   4.15e-01   -1.75  0.07963 .  
## BankStateTN       -1.81e-01   4.31e-01   -0.42  0.67453    
## BankStateTX       -4.04e-01   4.18e-01   -0.97  0.33405    
## BankStateUT       -1.77e-01   4.19e-01   -0.42  0.67282    
## BankStateVA       -1.31e+00   4.15e-01   -3.15  0.00163 ** 
## BankStateVT        1.59e+00   4.55e-01    3.50  0.00046 ***
## BankStateWA        1.57e-01   4.25e-01    0.37  0.71179    
## BankStateWI       -2.31e-02   4.20e-01   -0.05  0.95621    
## BankStateWV       -7.76e-01   4.73e-01   -1.64  0.10086    
## BankStateWY        8.56e-01   5.46e-01    1.57  0.11688    
## Term               4.15e-02   2.15e-04  193.05  < 2e-16 ***
## FranchiseCode1     8.57e-01   1.17e-02   73.27  < 2e-16 ***
## FranchiseCode2     4.06e-02   3.53e-02    1.15  0.24969    
## DisbursementGross  2.46e-07   2.82e-08    8.70  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 299658  on 259594  degrees of freedom
## Residual deviance: 218847  on 259489  degrees of freedom
## AIC: 219059
## 
## Number of Fisher Scoring iterations: 6
best_score
## [1] 0.8307
best_table
##         
## glm.pred CHGOFF P I F
##   CHGOFF  14102  4850
##   P I F   12037 68736
total_performance/10
## [1] 0.8297

In the article “Should This Loan be Approved or Denied?”, the author chose the following five features when he or she run the logistics model and then he selected three of them according to the summary table of the first run. Their model’s accuracy is 0.6784. On the other hand, the average accuracy of our model is around 0.83. Therefore, we think our model is more robust than the author’s model.

KNN Classfication

We run a KNN classification our 5 features using k=1, 3, 5, 7, 10, and 100.

#load package
library(magrittr) # needs to be run every time you start R and want to use %>%
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:randomForest':
## 
##     combine
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(class)

# standardize all quantitative predictors so they have mean 0 and standard deviation of 1
# scale Term and DisbursementGross
standardized.X = data.frame(scale(Ourdata[ , -c(1, 2, 4, 6)]))
summary(standardized.X)
##       Term        DisbursementGross
##  Min.   :-1.402   Min.   :-0.55    
##  1st Qu.:-0.555   1st Qu.:-0.47    
##  Median : 0.022   Median :-0.35    
##  Mean   : 0.000   Mean   : 0.00    
##  3rd Qu.: 0.022   3rd Qu.:-0.01    
##  Max.   : 7.531   Max.   :39.43
unstandardized.X = data.frame(Ourdata[, -c(3, 5)])
combined.X = cbind(unstandardized.X, standardized.X)
summary(combined.X)
##      State          BankState      FranchiseCode  MIS_Status    
##  CA     : 48477   NC     : 53071   0:174197      CHGOFF: 94663  
##  NY     : 28837   IL     : 37020   1:173852      P I F :264657  
##  TX     : 22231   CA     : 35161   2: 11271                     
##  FL     : 20296   RI     : 33377                                
##  PA     : 16068   OH     : 30342                                
##  OH     : 15623   VA     : 18213                                
##  (Other):207788   (Other):152136                                
##       Term        DisbursementGross
##  Min.   :-1.402   Min.   :-0.55    
##  1st Qu.:-0.555   1st Qu.:-0.47    
##  Median : 0.022   Median :-0.35    
##  Mean   : 0.000   Mean   : 0.00    
##  3rd Qu.: 0.022   3rd Qu.:-0.01    
##  Max.   : 7.531   Max.   :39.43    
## 
# Convert factor predictors into numerics to be used in KNN classification
combined.X$State = as.integer(combined.X$State)
combined.X$BankState = as.integer(combined.X$BankState)

# use 20% of data for test set
test.X = combined.X[, -4] %>% slice(1:69609)
train.X = combined.X[, -4] %>% slice(69610:348049)
test.Y = combined.X$MIS_Status[1:69609]
train.Y = combined.X$MIS_Status[69610:348049]

# First attempt with K = 1
set.seed(1)
knn_pred = knn(train.X, test.X, train.Y, k = 1)
k_1_error_rate = mean(test.Y != knn_pred) # KNN error rate
print(k_1_error_rate)
## [1] 0.1043
table(knn_pred, test.Y)
##         test.Y
## knn_pred CHGOFF P I F
##   CHGOFF  17820  3231
##   P I F    4027 44531
# Second Attempt with K = 3
set.seed(2)
knn_pred = knn(train.X, test.X, train.Y, k = 3)
k_3_error_rate = mean(test.Y != knn_pred) # KNN error rate
print(k_3_error_rate)
## [1] 0.09651
table(knn_pred, test.Y)
##         test.Y
## knn_pred CHGOFF P I F
##   CHGOFF  17925  2796
##   P I F    3922 44966
# Third Attempt with K = 5
set.seed(3)
knn_pred = knn(train.X, test.X, train.Y, k = 5)
k_5_error_rate = mean(test.Y != knn_pred) # KNN error rate
print(k_5_error_rate)
## [1] 0.09446
table(knn_pred, test.Y)
##         test.Y
## knn_pred CHGOFF P I F
##   CHGOFF  18029  2757
##   P I F    3818 45005
# Fourth Attempt with K = 7
set.seed(4)
knn_pred = knn(train.X, test.X, train.Y, k = 7)
k_7_error_rate = mean(test.Y != knn_pred) # KNN error rate
print(k_7_error_rate)
## [1] 0.09204
table(knn_pred, test.Y)
##         test.Y
## knn_pred CHGOFF P I F
##   CHGOFF  18082  2642
##   P I F    3765 45120
# Fifth Attempt with K = 10
set.seed(5)
knn_pred = knn(train.X, test.X, train.Y, k = 10)
k_10_error_rate = mean(test.Y != knn_pred) # KNN error rate
print(k_10_error_rate)
## [1] 0.09783
table(knn_pred, test.Y)
##         test.Y
## knn_pred CHGOFF P I F
##   CHGOFF  17786  2749
##   P I F    4061 45013
# Sixth Attempt with K = 100
set.seed(100)
knn_pred= knn(train.X, test.X, train.Y, k = 100)
k_100_error_rate = mean(test.Y != knn_pred) # KNN error rate
print(k_100_error_rate)
## [1] 0.1277
table(knn_pred, test.Y)
##         test.Y
## knn_pred CHGOFF P I F
##   CHGOFF  16227  3269
##   P I F    5620 44493

The model with the best accuracy is k=7. The error rate is 0.09240.

Further Individual Feature Analysis

Proposed Model: Term, State, BankState, FranchiseCode, DisbursementGross

Is including both Bank State and State redundant?

X = Copy_Ourdata$State == Copy_Ourdata$BankState
same_rate = sum(X, na.rm = TRUE) / nrow(Copy_Ourdata) *100

Bank State and State are only the same about 37.9695% of the time. How are their default rates different?

new.df = data.frame("state" = Copy_Ourdata$BankState, "default" = Copy_Ourdata$MIS_Status)
new.df = na.omit(new.df)
new.df$dfr = new.df$default == "CHGOFF"
new.df = new.df[,-2]
my.df = aggregate.data.frame(x = new.df$dfr, by=list(new.df$state), FUN = mean)
colnames(my.df) = c("state", "DefaultRate")
usmap::plot_usmap(data = my.df, values = "DefaultRate", lines = "black")+
  scale_fill_continuous(low = "white", high = "blue", name = "Default Rates for Banks by State")

new.df = data.frame("state" = Copy_Ourdata$State, "default" = Copy_Ourdata$MIS_Status)
new.df = na.omit(new.df)
new.df$dfr = new.df$default == "CHGOFF"
new.df = new.df[,-2]
my.df = aggregate.data.frame(x = new.df$dfr, by=list(new.df$state), FUN = mean)
colnames(my.df) = c("state", "DefaultRate")
usmap::plot_usmap(data = my.df, values = "DefaultRate", lines = "black")+
  scale_fill_continuous(low = "white", high = "red", name = "Default Rates by State")

new.df = data.frame("Loan_Status" = Copy_Ourdata$MIS_Status, "Term_Length" = Copy_Ourdata$Term)
new.df = na.omit(new.df)
my.df = aggregate.data.frame(x=new.df$Term_Length, by=list(new.df$Loan_Status), FUN = mean)
colnames(my.df) = c("Loan Status", "Average Term")
my.df
##   Loan Status Average Term
## 1      CHGOFF        48.39
## 2       P I F        95.01

Term length tends to be longer for loans PIF which is shown in our model by a positive coefficient on the term predictor.

new.df = data.frame("LoanStatus" = Copy_Ourdata$MIS_Status, "FranchiseCode" = Copy_Ourdata$FranchiseCode)
new.df = na.omit(new.df)
new.df$FranchiseCode[new.df$FranchiseCode > 1] = 2
my.df = aggregate.data.frame(x=new.df$FranchiseCode, by=list(new.df$LoanStatus), FUN = mean)
colnames(my.df) = c("Loan Status", "Average Franchise Code")
my.df
##   Loan Status Average Franchise Code
## 1      CHGOFF                 0.3963
## 2       P I F                 0.6003

Franchise Code 2 is generalized to mean all franchise codes that aren’t 1 or 0. In other words, 2 indicates a franchise business, some more recognizable than others. It is likely that a franchise code of 1 indicates a smaller or local franchise since many businesses in “2” have recognizeable names while 1s do not. It is natural that this would be a part of the model because franchises have brand recognition which in many cases may make them more likely to succeed.

new.df = data.frame("Loan_Status" = Copy_Ourdata$MIS_Status, "dg" = Copy_Ourdata$DisbursementGross)
new.df = na.omit(new.df)
my.df = aggregate.data.frame(x=new.df$dg, by=list(new.df$Loan_Status), FUN = mean)
colnames(my.df) = c("Loan Status", "Mean Loan Amount (Dollars)")
my.df
##   Loan Status Mean Loan Amount (Dollars)
## 1      CHGOFF                     101805
## 2       P I F                     183952
new.df = data.frame("Loan_Status" = Copy_Ourdata$MIS_Status, "dg" = Copy_Ourdata$DisbursementGross)
new.df = na.omit(new.df)
my.df = aggregate.data.frame(x=new.df$dg, by=list(new.df$Loan_Status), FUN = median)
colnames(my.df) = c("Loan Status", "Median Loan Amount (Dollars)")
my.df
##   Loan Status Median Loan Amount (Dollars)
## 1      CHGOFF                        50000
## 2       P I F                        72209
new.df = data.frame("Loan_Status" = Copy_Ourdata$MIS_Status, "dg" = Copy_Ourdata$DisbursementGross)
new.df = na.omit(new.df)
my.df = aggregate.data.frame(x=new.df$dg, by=list(new.df$Loan_Status), FUN = sd)
colnames(my.df) = c("Loan Status", "Standard Deviation Loan Amount (Dollars)")
my.df
##   Loan Status Standard Deviation Loan Amount (Dollars)
## 1      CHGOFF                                   170675
## 2       P I F                                   314665

The above three tables show the comparation between Loan Status and the mean,median, and standard deviation of loan amount,

SBAPIF = SBAnational[SBAnational$MIS_Status == "P I F",]
SBACHGOFF = SBAnational[SBAnational$MIS_Status == "CHGOFF",]
new.df = data.frame("approvalyear" = SBAPIF$ApprovalFY, "disbursementgross" = SBAPIF$DisbursementGross)
new.df = na.omit(new.df)
my.df = aggregate.data.frame(x = new.df$disbursementgross, by=list(new.df$approvalyear), FUN = mean)
colnames(my.df) = c("Year", "AvgLoanAmt")
ggplot2::ggplot(data = my.df)+
  aes(x = Year, y = AvgLoanAmt)+
  geom_line()+
  labs(title = "Average Loan Amount Over Time for Loans Paid in Full", x = "Year", y = "Average Disbursement Gross")

new.df = data.frame("approvalyear" = SBACHGOFF$ApprovalFY, "disbursementgross" = SBACHGOFF$DisbursementGross)
new.df = na.omit(new.df)
my.df = aggregate.data.frame(x = new.df$disbursementgross, by=list(new.df$approvalyear), FUN = mean)
colnames(my.df) = c("Year", "AvgLoanAmt")
ggplot2::ggplot(data = my.df)+
  aes(x = Year, y = AvgLoanAmt)+
  geom_line()+
  labs(title = "Average Loan Amount Over Time for Defaulted Loans", x = "Year", y = "Average Disbursement Gross")

Conclusion

The five features we found are most important in predicting loan status are “State”,“BankState”,“Term”,“FranchiseCode”,and “DisbursementGross”. Logistic Regression is more consistent with predicting loan status with average accuracy of 83%. KNN with k = 7 shows a lowest error rate of 0.09240.