Should This Loan be Approved or Denied?
In this report, we explore attributes of SBAs loan data from a variety of small business loan records. The sba has offered a collection of data on these small businesses and their loan applications. The businesses in the dataset represent a variety of industries locations, sizes, and other demographic attributes. There are over 800,000 businesses with 27 varables collected over a span of several decades.
We are interested in a few problems and questions besides what have been explored in the class. What are the most important five features in predicting default rate. Is KNN better or Logistic regression better when predicting default rate using the five most important features.
If this rmd does not compile at the first time becasue some package is not installed on the grader’s computer, you can check our published knit report on http://rpubs.com/dzhao97/543225
The dataset contains 27 variables. After carefully inspect the variable description in the “Should This Loan be Approved or Denied?” article, we dropped some variables because they contain too many distinct values (which will result overfitting) or they are consequences of Loan_Status. The variables we dropped are “LoanNr_ChkDgt”, “Name”, “City”, “Zip”, “NAICS”, “ApprovalDate”, “ApprovalFY”, “Bank”, “ChgOffDate”, “DisbursementDate”, “ChgOffPrinGr”. After that, according to all the valid input values for each variable in the dataset description, we filter out all rows containing null values or invalid input. The result dataframe has 359320 observations.
##load Original Data
library(readr)
SBAnational <- read_csv("https://uofi.box.com/shared/static/vi37omgitiaa2yyplrom779qvwk1g14x.csv", col_types = cols(BalanceGross = col_number(), DisbursementGross = col_number(), GrAppv = col_number(), SBA_Appv = col_number()))
#Drop properties with too much distinct values or unrelated
drops <- c("LoanNr_ChkDgt", "Name","City","Zip","NAICS", "ApprovalDate", "ApprovalFY","Bank", "ChgOffDate","DisbursementDate","ChgOffPrinGr")
Ourdata = SBAnational[ , !(names(SBAnational) %in% drops)]
# use sql to preprocess data with valid input
library(sqldf)
Ourdata$FranchiseCode[Ourdata$FranchiseCode > 1] = 2 # all large franchise
Ourdata = sqldf('select * from Ourdata where (MIS_Status = "P I F" or MIS_Status = "CHGOFF") and (NewExist = 1 or NewExist = 2) and (FranchiseCode = 0 or FranchiseCode = 1 or FranchiseCode = 2) and (UrbanRural = 1 or UrbanRural = 2) and (RevLineCr = "Y" or RevLineCr = "N") and (LowDoc = "Y" or LowDoc = "N") and (BankState <> "AN")')
#clean all NUlls
Ourdata = na.omit(Ourdata)
Copy_Ourdata = Ourdata
#change char data input to factor
Ourdata$MIS_Status = factor(Ourdata$MIS_Status)
Ourdata$State = factor(Ourdata$State)
Ourdata$BankState = factor(Ourdata$BankState)
Ourdata$RevLineCr = factor(Ourdata$RevLineCr)
Ourdata$LowDoc = factor(Ourdata$LowDoc)
Ourdata$FranchiseCode = factor(Ourdata$FranchiseCode)
summary(Ourdata)
## State BankState Term NoEmp
## CA : 48477 NC : 53071 Min. : 0.0 Min. : 0
## NY : 28837 IL : 37020 1st Qu.: 50.0 1st Qu.: 2
## TX : 22231 CA : 35161 Median : 84.0 Median : 3
## FL : 20296 RI : 33377 Mean : 82.7 Mean : 8
## PA : 16068 OH : 30342 3rd Qu.: 84.0 3rd Qu.: 8
## OH : 15623 VA : 18213 Max. :527.0 Max. :8000
## (Other):207788 (Other):152136
## NewExist CreateJob RetainedJob FranchiseCode UrbanRural
## Min. :1.00 Min. : 0 Min. : 0 0:174197 Min. :1.00
## 1st Qu.:1.00 1st Qu.: 0 1st Qu.: 1 1:173852 1st Qu.:1.00
## Median :1.00 Median : 0 Median : 2 2: 11271 Median :1.00
## Mean :1.27 Mean : 2 Mean : 6 Mean :1.18
## 3rd Qu.:2.00 3rd Qu.: 2 3rd Qu.: 6 3rd Qu.:1.00
## Max. :2.00 Max. :5085 Max. :7250 Max. :2.00
##
## RevLineCr LowDoc DisbursementGross BalanceGross MIS_Status
## N:167459 N:357759 Min. : 4000 Min. : 0 CHGOFF: 94663
## Y:191861 Y: 1561 1st Qu.: 28000 1st Qu.: 0 P I F :264657
## Median : 63482 Median : 0
## Mean : 162311 Mean : 6
## 3rd Qu.: 160143 3rd Qu.: 0
## Max. :11446325 Max. :996262
##
## GrAppv SBA_Appv
## Min. : 1000 Min. : 500
## 1st Qu.: 25000 1st Qu.: 12500
## Median : 50000 Median : 25000
## Mean : 139130 Mean : 101986
## 3rd Qu.: 120000 3rd Qu.: 75000
## Max. :5000000 Max. :4500000
##
#run randomforest
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
train <- sample(nrow(Ourdata), 0.7*nrow(Ourdata), replace = FALSE)
TrainSet <- Ourdata[train,]
ValidSet <- Ourdata[-train,]
model <- randomForest(MIS_Status ~ ., data = TrainSet, ntree = 100, mtry = 5, importance = TRUE)
model
##
## Call:
## randomForest(formula = MIS_Status ~ ., data = TrainSet, ntree = 100, mtry = 5, importance = TRUE)
## Type of random forest: classification
## Number of trees: 100
## No. of variables tried at each split: 5
##
## OOB estimate of error rate: 5.74%
## Confusion matrix:
## CHGOFF P I F class.error
## CHGOFF 57169 9062 0.13682
## P I F 5386 179906 0.02907
# Predicting on Validation set
predValid <- predict(model, ValidSet, type = "class")
# Checking classification accuracy
accuracy = mean(predValid == ValidSet$MIS_Status)
#confusion matrix
table(predValid,ValidSet$MIS_Status)
##
## predValid CHGOFF P I F
## CHGOFF 24664 2188
## P I F 3768 77177
The accurqcy of our model is 0.9447.
After tune the parameters a few times on our local run, we found according to the feature importance of Random Forest Model, there are five features that show a higher significance than other features consistently. The importance graph is the following:
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
##
## margin
varImpPlot(model)
As a result, we choose to use these five above features as our input features for further training models.
features = c("State","BankState","Term","FranchiseCode","DisbursementGross","MIS_Status")
temp = Ourdata[ , (names(Ourdata) %in% features)]
Ourdata = temp
We run the Logistic Regression on the five features we selected earlier.
total_performance = 0
best_logistic = 0
best_score = 0
best_table = 0
for (i in 1:10){
index_1 = createDataPartition(Ourdata$State, p = 0.85, list = F)
index_2 = createDataPartition(Ourdata$BankState, p = 0.85, list = F)
train = intersect(index_1, index_2)
TrainSet <- Ourdata[train,]
ValidSet <- Ourdata[-train,]
glm.fit <- glm(MIS_Status ~ State + BankState + Term + FranchiseCode + DisbursementGross, data = TrainSet, family = binomial)
#levels(ValidSet$MIS_Status)
glm.probs <- predict(glm.fit, newdata = ValidSet, type = "response")
glm.pred <- ifelse(glm.probs > 0.5, "P I F", "CHGOFF")
cur_score = mean(glm.pred == ValidSet$MIS_Status)
total_performance = total_performance + cur_score
if (cur_score > best_score){
best_score = cur_score
best_logistic = glm.fit
best_table = table(glm.pred, ValidSet$MIS_Status)
}
}
summary(best_logistic)
##
## Call:
## glm(formula = MIS_Status ~ State + BankState + Term + FranchiseCode +
## DisbursementGross, family = binomial, data = TrainSet)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -5.337 -0.586 0.387 0.659 2.820
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.61e-01 3.66e-01 -2.08 0.03768 *
## StateAL -1.12e+00 2.08e-01 -5.39 7.1e-08 ***
## StateAR -1.64e+00 2.18e-01 -7.54 4.6e-14 ***
## StateAZ -1.22e+00 1.99e-01 -6.16 7.1e-10 ***
## StateCA -1.04e+00 1.96e-01 -5.30 1.2e-07 ***
## StateCO -1.12e+00 1.99e-01 -5.63 1.8e-08 ***
## StateCT -1.97e-01 2.02e-01 -0.97 0.33067
## StateDC -6.30e-01 2.33e-01 -2.70 0.00684 **
## StateDE -8.37e-01 2.21e-01 -3.78 0.00016 ***
## StateFL -1.40e+00 1.97e-01 -7.11 1.1e-12 ***
## StateGA -1.55e+00 1.99e-01 -7.82 5.4e-15 ***
## StateHI -1.43e+00 2.55e-01 -5.59 2.3e-08 ***
## StateIA -6.61e-01 2.20e-01 -3.00 0.00269 **
## StateID -8.13e-01 2.06e-01 -3.95 7.7e-05 ***
## StateIL -9.67e-01 1.97e-01 -4.90 9.6e-07 ***
## StateIN -8.26e-01 2.01e-01 -4.12 3.9e-05 ***
## StateKS -9.38e-01 2.12e-01 -4.42 9.9e-06 ***
## StateKY -9.34e-01 2.06e-01 -4.53 5.9e-06 ***
## StateLA -1.16e+00 2.07e-01 -5.59 2.3e-08 ***
## StateMA -4.57e-01 1.99e-01 -2.30 0.02169 *
## StateMD -8.33e-01 2.01e-01 -4.15 3.4e-05 ***
## StateME -1.06e-01 2.19e-01 -0.48 0.62792
## StateMI -1.01e+00 1.98e-01 -5.10 3.5e-07 ***
## StateMN -7.48e-01 2.01e-01 -3.72 0.00020 ***
## StateMO -1.05e+00 2.02e-01 -5.21 1.9e-07 ***
## StateMS -1.59e+00 2.16e-01 -7.38 1.6e-13 ***
## StateMT -3.78e-01 2.44e-01 -1.55 0.12144
## StateNC -5.72e-01 2.00e-01 -2.86 0.00428 **
## StateND 3.29e-01 2.82e-01 1.17 0.24315
## StateNE -5.40e-01 2.31e-01 -2.34 0.01932 *
## StateNH -5.45e-01 2.04e-01 -2.67 0.00752 **
## StateNJ -7.99e-01 1.98e-01 -4.03 5.5e-05 ***
## StateNM -4.95e-01 2.20e-01 -2.25 0.02458 *
## StateNV -1.63e+00 2.03e-01 -8.07 7.0e-16 ***
## StateNY -7.03e-01 1.97e-01 -3.58 0.00035 ***
## StateOH -6.57e-01 1.97e-01 -3.33 0.00087 ***
## StateOK -1.15e+00 2.10e-01 -5.48 4.3e-08 ***
## StateOR -5.58e-01 2.01e-01 -2.77 0.00553 **
## StatePA -6.43e-01 1.98e-01 -3.24 0.00118 **
## StateRI -1.99e-01 2.04e-01 -0.98 0.32904
## StateSC -7.52e-01 2.08e-01 -3.61 0.00031 ***
## StateSD 3.83e-01 2.38e-01 1.61 0.10826
## StateTN -1.35e+00 2.04e-01 -6.62 3.6e-11 ***
## StateTX -7.25e-01 1.97e-01 -3.68 0.00023 ***
## StateUT -8.53e-01 2.02e-01 -4.22 2.4e-05 ***
## StateVA -5.21e-01 2.01e-01 -2.60 0.00939 **
## StateVT -3.58e-02 2.33e-01 -0.15 0.87787
## StateWA -8.79e-01 1.99e-01 -4.42 9.9e-06 ***
## StateWI -4.71e-01 2.03e-01 -2.32 0.02034 *
## StateWV -4.60e-01 2.33e-01 -1.97 0.04868 *
## StateWY -2.98e-02 3.07e-01 -0.10 0.92272
## BankStateAL -5.01e-01 4.19e-01 -1.20 0.23159
## BankStateAR 8.96e-01 4.29e-01 2.09 0.03667 *
## BankStateAZ 9.35e-01 4.54e-01 2.06 0.03926 *
## BankStateCA -1.14e+00 4.15e-01 -2.75 0.00602 **
## BankStateCO 5.21e-01 4.39e-01 1.19 0.23494
## BankStateCT -3.96e-01 4.23e-01 -0.94 0.34849
## BankStateDC -1.46e+00 4.96e-01 -2.94 0.00324 **
## BankStateDE -1.31e-01 4.16e-01 -0.31 0.75288
## BankStateFL -1.05e+00 4.17e-01 -2.50 0.01226 *
## BankStateGA 7.80e-01 4.20e-01 1.86 0.06345 .
## BankStateHI 1.82e+00 4.66e-01 3.91 9.2e-05 ***
## BankStateIA 3.50e-01 4.35e-01 0.80 0.42132
## BankStateID -7.08e-02 4.30e-01 -0.16 0.86924
## BankStateIL -6.26e-01 4.15e-01 -1.51 0.13151
## BankStateIN 1.18e+00 4.26e-01 2.77 0.00555 **
## BankStateKS 1.15e+00 4.34e-01 2.66 0.00783 **
## BankStateKY 8.03e-01 4.35e-01 1.85 0.06454 .
## BankStateLA 9.23e-01 4.42e-01 2.09 0.03670 *
## BankStateMA 7.02e-01 4.21e-01 1.67 0.09554 .
## BankStateMD 1.18e+00 4.29e-01 2.75 0.00593 **
## BankStateME 9.52e-01 4.46e-01 2.14 0.03269 *
## BankStateMI 1.09e+00 4.27e-01 2.56 0.01033 *
## BankStateMN 7.11e-01 4.21e-01 1.69 0.09105 .
## BankStateMO 3.93e-01 4.21e-01 0.93 0.35062
## BankStateMS 1.92e+00 4.33e-01 4.43 9.2e-06 ***
## BankStateMT 1.49e+00 4.52e-01 3.30 0.00098 ***
## BankStateNC -5.75e-01 4.15e-01 -1.39 0.16602
## BankStateND 5.78e-01 4.62e-01 1.25 0.21064
## BankStateNE 5.80e-01 4.41e-01 1.32 0.18814
## BankStateNH 4.50e-01 4.31e-01 1.05 0.29567
## BankStateNJ 3.05e-01 4.24e-01 0.72 0.47200
## BankStateNM 7.17e-01 4.42e-01 1.62 0.10497
## BankStateNV 1.34e-01 4.33e-01 0.31 0.75618
## BankStateNY -1.45e-02 4.16e-01 -0.03 0.97215
## BankStateOH -9.26e-02 4.15e-01 -0.22 0.82354
## BankStateOK 7.28e-01 4.31e-01 1.69 0.09104 .
## BankStateOR -9.29e-01 4.18e-01 -2.22 0.02611 *
## BankStatePA 9.62e-01 4.20e-01 2.29 0.02182 *
## BankStatePR -2.54e-02 1.33e+00 -0.02 0.98470
## BankStateRI -3.78e-01 4.16e-01 -0.91 0.36337
## BankStateSC -2.58e+00 4.35e-01 -5.92 3.1e-09 ***
## BankStateSD -7.28e-01 4.15e-01 -1.75 0.07963 .
## BankStateTN -1.81e-01 4.31e-01 -0.42 0.67453
## BankStateTX -4.04e-01 4.18e-01 -0.97 0.33405
## BankStateUT -1.77e-01 4.19e-01 -0.42 0.67282
## BankStateVA -1.31e+00 4.15e-01 -3.15 0.00163 **
## BankStateVT 1.59e+00 4.55e-01 3.50 0.00046 ***
## BankStateWA 1.57e-01 4.25e-01 0.37 0.71179
## BankStateWI -2.31e-02 4.20e-01 -0.05 0.95621
## BankStateWV -7.76e-01 4.73e-01 -1.64 0.10086
## BankStateWY 8.56e-01 5.46e-01 1.57 0.11688
## Term 4.15e-02 2.15e-04 193.05 < 2e-16 ***
## FranchiseCode1 8.57e-01 1.17e-02 73.27 < 2e-16 ***
## FranchiseCode2 4.06e-02 3.53e-02 1.15 0.24969
## DisbursementGross 2.46e-07 2.82e-08 8.70 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 299658 on 259594 degrees of freedom
## Residual deviance: 218847 on 259489 degrees of freedom
## AIC: 219059
##
## Number of Fisher Scoring iterations: 6
best_score
## [1] 0.8307
best_table
##
## glm.pred CHGOFF P I F
## CHGOFF 14102 4850
## P I F 12037 68736
total_performance/10
## [1] 0.8297
In the article “Should This Loan be Approved or Denied?”, the author chose the following five features when he or she run the logistics model and then he selected three of them according to the summary table of the first run. Their model’s accuracy is 0.6784. On the other hand, the average accuracy of our model is around 0.83. Therefore, we think our model is more robust than the author’s model.
We run a KNN classification our 5 features using k=1, 3, 5, 7, 10, and 100.
#load package
library(magrittr) # needs to be run every time you start R and want to use %>%
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:randomForest':
##
## combine
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(class)
# standardize all quantitative predictors so they have mean 0 and standard deviation of 1
# scale Term and DisbursementGross
standardized.X = data.frame(scale(Ourdata[ , -c(1, 2, 4, 6)]))
summary(standardized.X)
## Term DisbursementGross
## Min. :-1.402 Min. :-0.55
## 1st Qu.:-0.555 1st Qu.:-0.47
## Median : 0.022 Median :-0.35
## Mean : 0.000 Mean : 0.00
## 3rd Qu.: 0.022 3rd Qu.:-0.01
## Max. : 7.531 Max. :39.43
unstandardized.X = data.frame(Ourdata[, -c(3, 5)])
combined.X = cbind(unstandardized.X, standardized.X)
summary(combined.X)
## State BankState FranchiseCode MIS_Status
## CA : 48477 NC : 53071 0:174197 CHGOFF: 94663
## NY : 28837 IL : 37020 1:173852 P I F :264657
## TX : 22231 CA : 35161 2: 11271
## FL : 20296 RI : 33377
## PA : 16068 OH : 30342
## OH : 15623 VA : 18213
## (Other):207788 (Other):152136
## Term DisbursementGross
## Min. :-1.402 Min. :-0.55
## 1st Qu.:-0.555 1st Qu.:-0.47
## Median : 0.022 Median :-0.35
## Mean : 0.000 Mean : 0.00
## 3rd Qu.: 0.022 3rd Qu.:-0.01
## Max. : 7.531 Max. :39.43
##
# Convert factor predictors into numerics to be used in KNN classification
combined.X$State = as.integer(combined.X$State)
combined.X$BankState = as.integer(combined.X$BankState)
# use 20% of data for test set
test.X = combined.X[, -4] %>% slice(1:69609)
train.X = combined.X[, -4] %>% slice(69610:348049)
test.Y = combined.X$MIS_Status[1:69609]
train.Y = combined.X$MIS_Status[69610:348049]
# First attempt with K = 1
set.seed(1)
knn_pred = knn(train.X, test.X, train.Y, k = 1)
k_1_error_rate = mean(test.Y != knn_pred) # KNN error rate
print(k_1_error_rate)
## [1] 0.1043
table(knn_pred, test.Y)
## test.Y
## knn_pred CHGOFF P I F
## CHGOFF 17820 3231
## P I F 4027 44531
# Second Attempt with K = 3
set.seed(2)
knn_pred = knn(train.X, test.X, train.Y, k = 3)
k_3_error_rate = mean(test.Y != knn_pred) # KNN error rate
print(k_3_error_rate)
## [1] 0.09651
table(knn_pred, test.Y)
## test.Y
## knn_pred CHGOFF P I F
## CHGOFF 17925 2796
## P I F 3922 44966
# Third Attempt with K = 5
set.seed(3)
knn_pred = knn(train.X, test.X, train.Y, k = 5)
k_5_error_rate = mean(test.Y != knn_pred) # KNN error rate
print(k_5_error_rate)
## [1] 0.09446
table(knn_pred, test.Y)
## test.Y
## knn_pred CHGOFF P I F
## CHGOFF 18029 2757
## P I F 3818 45005
# Fourth Attempt with K = 7
set.seed(4)
knn_pred = knn(train.X, test.X, train.Y, k = 7)
k_7_error_rate = mean(test.Y != knn_pred) # KNN error rate
print(k_7_error_rate)
## [1] 0.09204
table(knn_pred, test.Y)
## test.Y
## knn_pred CHGOFF P I F
## CHGOFF 18082 2642
## P I F 3765 45120
# Fifth Attempt with K = 10
set.seed(5)
knn_pred = knn(train.X, test.X, train.Y, k = 10)
k_10_error_rate = mean(test.Y != knn_pred) # KNN error rate
print(k_10_error_rate)
## [1] 0.09783
table(knn_pred, test.Y)
## test.Y
## knn_pred CHGOFF P I F
## CHGOFF 17786 2749
## P I F 4061 45013
# Sixth Attempt with K = 100
set.seed(100)
knn_pred= knn(train.X, test.X, train.Y, k = 100)
k_100_error_rate = mean(test.Y != knn_pred) # KNN error rate
print(k_100_error_rate)
## [1] 0.1277
table(knn_pred, test.Y)
## test.Y
## knn_pred CHGOFF P I F
## CHGOFF 16227 3269
## P I F 5620 44493
The model with the best accuracy is k=7. The error rate is 0.09240.
Proposed Model: Term, State, BankState, FranchiseCode, DisbursementGross
Is including both Bank State and State redundant?
X = Copy_Ourdata$State == Copy_Ourdata$BankState
same_rate = sum(X, na.rm = TRUE) / nrow(Copy_Ourdata) *100
Bank State and State are only the same about 37.9695% of the time. How are their default rates different?
new.df = data.frame("state" = Copy_Ourdata$BankState, "default" = Copy_Ourdata$MIS_Status)
new.df = na.omit(new.df)
new.df$dfr = new.df$default == "CHGOFF"
new.df = new.df[,-2]
my.df = aggregate.data.frame(x = new.df$dfr, by=list(new.df$state), FUN = mean)
colnames(my.df) = c("state", "DefaultRate")
usmap::plot_usmap(data = my.df, values = "DefaultRate", lines = "black")+
scale_fill_continuous(low = "white", high = "blue", name = "Default Rates for Banks by State")
new.df = data.frame("state" = Copy_Ourdata$State, "default" = Copy_Ourdata$MIS_Status)
new.df = na.omit(new.df)
new.df$dfr = new.df$default == "CHGOFF"
new.df = new.df[,-2]
my.df = aggregate.data.frame(x = new.df$dfr, by=list(new.df$state), FUN = mean)
colnames(my.df) = c("state", "DefaultRate")
usmap::plot_usmap(data = my.df, values = "DefaultRate", lines = "black")+
scale_fill_continuous(low = "white", high = "red", name = "Default Rates by State")
new.df = data.frame("Loan_Status" = Copy_Ourdata$MIS_Status, "Term_Length" = Copy_Ourdata$Term)
new.df = na.omit(new.df)
my.df = aggregate.data.frame(x=new.df$Term_Length, by=list(new.df$Loan_Status), FUN = mean)
colnames(my.df) = c("Loan Status", "Average Term")
my.df
## Loan Status Average Term
## 1 CHGOFF 48.39
## 2 P I F 95.01
Term length tends to be longer for loans PIF which is shown in our model by a positive coefficient on the term predictor.
new.df = data.frame("LoanStatus" = Copy_Ourdata$MIS_Status, "FranchiseCode" = Copy_Ourdata$FranchiseCode)
new.df = na.omit(new.df)
new.df$FranchiseCode[new.df$FranchiseCode > 1] = 2
my.df = aggregate.data.frame(x=new.df$FranchiseCode, by=list(new.df$LoanStatus), FUN = mean)
colnames(my.df) = c("Loan Status", "Average Franchise Code")
my.df
## Loan Status Average Franchise Code
## 1 CHGOFF 0.3963
## 2 P I F 0.6003
Franchise Code 2 is generalized to mean all franchise codes that aren’t 1 or 0. In other words, 2 indicates a franchise business, some more recognizable than others. It is likely that a franchise code of 1 indicates a smaller or local franchise since many businesses in “2” have recognizeable names while 1s do not. It is natural that this would be a part of the model because franchises have brand recognition which in many cases may make them more likely to succeed.
new.df = data.frame("Loan_Status" = Copy_Ourdata$MIS_Status, "dg" = Copy_Ourdata$DisbursementGross)
new.df = na.omit(new.df)
my.df = aggregate.data.frame(x=new.df$dg, by=list(new.df$Loan_Status), FUN = mean)
colnames(my.df) = c("Loan Status", "Mean Loan Amount (Dollars)")
my.df
## Loan Status Mean Loan Amount (Dollars)
## 1 CHGOFF 101805
## 2 P I F 183952
new.df = data.frame("Loan_Status" = Copy_Ourdata$MIS_Status, "dg" = Copy_Ourdata$DisbursementGross)
new.df = na.omit(new.df)
my.df = aggregate.data.frame(x=new.df$dg, by=list(new.df$Loan_Status), FUN = median)
colnames(my.df) = c("Loan Status", "Median Loan Amount (Dollars)")
my.df
## Loan Status Median Loan Amount (Dollars)
## 1 CHGOFF 50000
## 2 P I F 72209
new.df = data.frame("Loan_Status" = Copy_Ourdata$MIS_Status, "dg" = Copy_Ourdata$DisbursementGross)
new.df = na.omit(new.df)
my.df = aggregate.data.frame(x=new.df$dg, by=list(new.df$Loan_Status), FUN = sd)
colnames(my.df) = c("Loan Status", "Standard Deviation Loan Amount (Dollars)")
my.df
## Loan Status Standard Deviation Loan Amount (Dollars)
## 1 CHGOFF 170675
## 2 P I F 314665
The above three tables show the comparation between Loan Status and the mean,median, and standard deviation of loan amount,
SBAPIF = SBAnational[SBAnational$MIS_Status == "P I F",]
SBACHGOFF = SBAnational[SBAnational$MIS_Status == "CHGOFF",]
new.df = data.frame("approvalyear" = SBAPIF$ApprovalFY, "disbursementgross" = SBAPIF$DisbursementGross)
new.df = na.omit(new.df)
my.df = aggregate.data.frame(x = new.df$disbursementgross, by=list(new.df$approvalyear), FUN = mean)
colnames(my.df) = c("Year", "AvgLoanAmt")
ggplot2::ggplot(data = my.df)+
aes(x = Year, y = AvgLoanAmt)+
geom_line()+
labs(title = "Average Loan Amount Over Time for Loans Paid in Full", x = "Year", y = "Average Disbursement Gross")
new.df = data.frame("approvalyear" = SBACHGOFF$ApprovalFY, "disbursementgross" = SBACHGOFF$DisbursementGross)
new.df = na.omit(new.df)
my.df = aggregate.data.frame(x = new.df$disbursementgross, by=list(new.df$approvalyear), FUN = mean)
colnames(my.df) = c("Year", "AvgLoanAmt")
ggplot2::ggplot(data = my.df)+
aes(x = Year, y = AvgLoanAmt)+
geom_line()+
labs(title = "Average Loan Amount Over Time for Defaulted Loans", x = "Year", y = "Average Disbursement Gross")
The five features we found are most important in predicting loan status are “State”,“BankState”,“Term”,“FranchiseCode”,and “DisbursementGross”. Logistic Regression is more consistent with predicting loan status with average accuracy of 83%. KNN with k = 7 shows a lowest error rate of 0.09240.