Abstract

The purpose of this project was to use a subset of the Lending Club accepted loan application data to predict loan status. Mutliple models were run and their performance measured in terms of Kappa. The algorithms of interest were k-NN, naiveBayes, C5.0, rpart, svm, and randomForest. Each model was tuned at an attempt to improve performance and, ultimately, C5.0 resulted as the “best” model.

Introduction

The data that was used is the accepted loan application data from 2007 to 2018 belonging to Lending Club - although, I was only interested in analyzing the data for the years 2012 to 2015. The full dataset contains 2,260,701 observations and 151 features. By observing the first few rows of the dataset, it is obvious that there are plenty of missing values which could lead to some issues.

Before removing these columns, features which are in the form “Month-Year” were reduced to only the year, which helped to filter according to the years of interest. In this case, the feature that contains this information is issue_d. Then, I proceeded with removing features which had a high percentage of missing values - high percentage being greater than 50%. Doing so, meant that there would be features which would have some missing values, but give that the data is relatively large, I omitted rows with missing values as well. To prevent unreasonable runtimes in the algorithms, factor features with over 50 levels, or levels not greater than 1, were removed along with identifying features. In total, this brought down the dataset to 319,773 observations with 79 features.

Considring my computational limitations, I created a separate dataset containing only 10% of the data (31,997 observations). Furthermore, the smaller dataset was separated into training and test data at a 75%-25% split.

The first model trained was k-NN on only the normalized numeric features with k = 155 (square root of 31,997). The evaluation metric used for all models was Kappa, which for this model resulted in 0.6128. For naiveBayes Kappa = 0.3051; C5.0 Kappa = 0.9766; Regression Trees (run with only numeric features due to long runtime when including factors) Kappa = 0.9448; SVM Kappa = 0.9397; and RandomForest Kappa = 0.9768

At an attempt to improve the models, each was run with repeated 10-fold cross-validation twice, for the exception of naiveBayes (would not run with parameter tuning) and randomForest with 2-fold cross-validation and mtry set to the square root of the number of features.

The trainControl function from the caret package was used to set the cross-validation and selection function. The selection function was set to “best” in order to choose the best Kappa from each of the models run during cross-validation. Changes made to k-NN were that the data was standardized and the grid ran three separate models with k = 9, 111, 113. This resulted in an improvement of kappa, now equal to 0.7673. naiveBayes did worse with Kappa = 0.2202. The grid for C5.0 set trials to 2, 4, 6, 8, and 10 and its Kappa came out to be 0.9957. The tuned rpart had a Kappa = 0.8925. SVM had a Kappa = 0.9774, and randomForest Kappa = 0.9695.

In the end, I chose C5.0 to predict the loan_status for the year 2015 since it had the highest Kappa during cross-validation. The 2015 data was created in a similar way to the data between 2012 - 2014. This resulted in a dataset of 326,396 and 79 features. This time, C5.0 was run with 10-fold cross-validation and the number of trials set to 1, 5, 10. The result was a Kappa = 0.9806 and an accuracy of 0.9912.

Conclusion

Although, a high Kappa was achieved for most algorithms, they were not tuned the same way. This was to prevent the fact that some algorithms, such as randomForest when tuned, takes an unreasonably amount of time without the proper computational resource. Ideally, all algorithms would have been trained on the full 2012-2014 dataset and the best one picked in order to classify the 2015 data.

Code Appendix

Loading the Data

The data being used is from LendingClub.

library(pacman)
p_load(tidyverse, tibble, lubridate,
       rpart, C50, randomForest, Amelia, naniar,
       Boruta, caret, class, trelliscopejs, tictoc,
       e1071, doParallel, purrr, neuralnet, kernlab,
       Boruta, mosaicCore, future, ROCR)

accepted <- read_csv("C:\\Users\\fa_na\\OneDrive\\Documents\\Mathematics\\Statistics\\Statistical Machine Learning\\Project\\lending-club\\accepted_2007_to_2018q4.csv\\accepted_2007_to_2018Q4.csv")

Exploring and Preparing the Data

head(accepted)

## # A tibble: 6 x 151
##       id member_id loan_amnt funded_amnt funded_amnt_inv term  int_rate
##    <dbl> <lgl>         <dbl>       <dbl>           <dbl> <chr>    <dbl>
## 1 6.84e7 NA             3600        3600            3600 36 m~     14.0
## 2 6.84e7 NA            24700       24700           24700 36 m~     12.0
## 3 6.83e7 NA            20000       20000           20000 60 m~     10.8
## 4 6.63e7 NA            35000       35000           35000 60 m~     14.8
## 5 6.85e7 NA            10400       10400           10400 60 m~     22.4
## 6 6.84e7 NA            11950       11950           11950 36 m~     13.4
## # ... with 144 more variables: installment <dbl>, grade <chr>, sub_grade <chr>,
## #   emp_title <chr>, emp_length <chr>, home_ownership <chr>, annual_inc <dbl>,
## #   verification_status <chr>, issue_d <chr>, loan_status <chr>,
## #   pymnt_plan <chr>, url <chr>, desc <lgl>, purpose <chr>, title <chr>,
## #   zip_code <chr>, addr_state <chr>, dti <dbl>, delinq_2yrs <dbl>,
## #   earliest_cr_line <chr>, fico_range_low <dbl>, fico_range_high <dbl>,
## #   inq_last_6mths <dbl>, mths_since_last_delinq <dbl>,
## #   mths_since_last_record <dbl>, open_acc <dbl>, pub_rec <dbl>,
## #   revol_bal <dbl>, revol_util <dbl>, total_acc <dbl>,
## #   initial_list_status <chr>, out_prncp <dbl>, out_prncp_inv <dbl>,
## #   total_pymnt <dbl>, total_pymnt_inv <dbl>, total_rec_prncp <dbl>,
## #   total_rec_int <dbl>, total_rec_late_fee <dbl>, recoveries <dbl>,
## #   collection_recovery_fee <dbl>, last_pymnt_d <chr>, last_pymnt_amnt <dbl>,
## #   next_pymnt_d <chr>, last_credit_pull_d <chr>, last_fico_range_high <dbl>,
## #   last_fico_range_low <dbl>, collections_12_mths_ex_med <dbl>,
## #   mths_since_last_major_derog <dbl>, policy_code <dbl>,
## #   application_type <chr>, annual_inc_joint <dbl>, dti_joint <dbl>,
## #   verification_status_joint <chr>, acc_now_delinq <dbl>, tot_coll_amt <dbl>,
## #   tot_cur_bal <dbl>, open_acc_6m <dbl>, open_act_il <dbl>, open_il_12m <dbl>,
## #   open_il_24m <dbl>, mths_since_rcnt_il <dbl>, total_bal_il <dbl>,
## #   il_util <dbl>, open_rv_12m <dbl>, open_rv_24m <dbl>, max_bal_bc <dbl>,
## #   all_util <dbl>, total_rev_hi_lim <dbl>, inq_fi <dbl>, total_cu_tl <dbl>,
## #   inq_last_12m <dbl>, acc_open_past_24mths <dbl>, avg_cur_bal <dbl>,
## #   bc_open_to_buy <dbl>, bc_util <dbl>, chargeoff_within_12_mths <dbl>,
## #   delinq_amnt <dbl>, mo_sin_old_il_acct <dbl>, mo_sin_old_rev_tl_op <dbl>,
## #   mo_sin_rcnt_rev_tl_op <dbl>, mo_sin_rcnt_tl <dbl>, mort_acc <dbl>,
## #   mths_since_recent_bc <dbl>, mths_since_recent_bc_dlq <dbl>,
## #   mths_since_recent_inq <dbl>, mths_since_recent_revol_delinq <dbl>,
## #   num_accts_ever_120_pd <dbl>, num_actv_bc_tl <dbl>, num_actv_rev_tl <dbl>,
## #   num_bc_sats <dbl>, num_bc_tl <dbl>, num_il_tl <dbl>, num_op_rev_tl <dbl>,
## #   num_rev_accts <dbl>, num_rev_tl_bal_gt_0 <dbl>, num_sats <dbl>,
## #   num_tl_120dpd_2m <dbl>, num_tl_30dpd <dbl>, num_tl_90g_dpd_24m <dbl>,
## #   num_tl_op_past_12m <dbl>, ...

dim(accepted)

## [1] 2260701     151

# Capturing only the year for "date" variables
# and filtering for data between 2012-2014
accepted_2012_2014 <- accepted %>%
  mutate(issue_d = year(parse_date(issue_d, "%b-%Y")),
         earliest_cr_line = year(parse_date(earliest_cr_line, "%b-%Y")),
         last_pymnt_d = year(parse_date(last_pymnt_d, "%b-%Y")),
         last_credit_pull_d = year(parse_date(last_credit_pull_d, "%b-%Y"))) %>% 
  filter(between(issue_d, 2012, 2014), !is.na(issue_d))

# Drop identifier, variables with high proportion of NA's,
# as well as factors with too many levels,
# and omitting rows with missing values
accepted_2012_2014 <- accepted_2012_2014 %>%
  select(loan_status, 
         which(colMeans(is.na(accepted_2012_2014)) < 0.5),
         -id, -url, -emp_title, -policy_code, -application_type,
         -disbursement_method, -funded_amnt_inv,
         -title, -out_prncp_inv, -pub_rec_bankruptcies,
         -hardship_flag, -zip_code, -pymnt_plan, -addr_state) %>% 
  arrange(issue_d) %>%
  na.omit()

saveRDS(accepted, "accepted.Rds")
remove(accepted)

# Changes character variables to factor
char_to_factor <- function(x) { 
  for (i in 1:ncol(x)) { 
    if (class(x[[i]]) == "character") 
      x[[i]] <- as.factor(x[[i]]) 
    }
    
  return(x)
}

# Changing character variables to factors
accepted_2012_2014 <- char_to_factor(accepted_2012_2014)

summary(accepted_2012_2014 %>%
          keep(is.factor))

##              loan_status            term        grade       sub_grade     
##  Charged Off       : 53840   36 months:224497   A:46201   B4     : 20887  
##  Current           :  9631   60 months: 95276   B:92232   B3     : 20694  
##  Default           :     1                      C:88951   C3     : 18507  
##  Fully Paid        :255784                      D:55069   B2     : 18348  
##  In Grace Period   :   171                      E:25448   C1     : 18204  
##  Late (16-30 days) :    59                      F: 9550   C2     : 18200  
##  Late (31-120 days):   287                      G: 2322   (Other):204933  
##      emp_length      home_ownership        verification_status
##  10+ years:112924   ANY     :     1   Not Verified   : 96483  
##  2 years  : 28934   MORTGAGE:172753   Source Verified:113374  
##  3 years  : 25654   NONE    :    29   Verified       :109916  
##  < 1 year : 24303   OTHER   :    36                           
##  5 years  : 21070   OWN     : 27181                           
##  1 year   : 20359   RENT    :119773                           
##  (Other)  : 86529                                             
##                purpose       initial_list_status debt_settlement_flag
##  debt_consolidation:195626   f:188008            N:314762            
##  credit_card       : 73981   w:131765            Y:  5011            
##  home_improvement  : 17562                                           
##  other             : 13456                                           
##  major_purchase    :  5279                                           
##  small_business    :  3360                                           
##  (Other)           : 10509

accepted_2012_2014 %>%
  keep(is.numeric) %>%
  gather() %>%
  ggplot(aes(value)) +
    facet_trelliscope(~ key, scales = "free", path = "rmarkdown_files/trelli_one", self_contained = T) +
    geom_histogram()

accepted_2012_2014 %>%
  keep(is.factor) %>%
  gather() %>%
  ggplot(aes(value)) +
    facet_trelliscope(~ key, scales = "free", path = "rmarkdown_files/trelli_two", self_contained = T) +
    geom_bar() +
    coord_flip()

## Warning: attributes are not identical across measure variables;
## they will be dropped

## using data from the first layer

accepted_2012_2014 <- accepted_2012_2014 %>%
  filter(last_fico_range_low != 0, last_fico_range_high != 0)

accepted_2012_2014 %>%
  ggplot() +
  geom_histogram(aes(last_fico_range_low), fill = "darkred")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

accepted_2012_2014 %>%
  ggplot() +
  geom_histogram(aes(last_fico_range_high), fill = "darkgreen")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

col_names <- colnames(accepted_2012_2014)
loan_dim <- tibble (
  Number_Rows = nrow(accepted_2012_2014),
  Number_Cols = ncol(accepted_2012_2014)
)

head(col_names)

## [1] "loan_status" "loan_amnt"   "funded_amnt" "term"        "int_rate"   
## [6] "installment"

loan_dim

## # A tibble: 1 x 2
##   Number_Rows Number_Cols
##         <int>       <int>
## 1      313184          79

set.seed(2)
rows <- sample(nrow(accepted_2012_2014), as.integer(0.1 * nrow(accepted_2012_2014)))

loan_sample <- accepted_2012_2014[rows, ]

tally(~ loan_status, loan_sample, "percent")

## loan_status
##        Charged Off            Current            Default         Fully Paid 
##        15.47991570         3.00466186         0.00000000        81.35257679 
##    In Grace Period  Late (16-30 days) Late (31-120 days) 
##         0.06066799         0.01915831         0.08301935

tally(~ loan_status, accepted_2012_2014, "percent")

## loan_status
##        Charged Off            Current            Default         Fully Paid 
##       1.565054e+01       3.072954e+00       3.193011e-04       8.111398e+01 
##    In Grace Period  Late (16-30 days) Late (31-120 days) 
##       5.460049e-02       1.883877e-02       8.876571e-02

saveRDS(accepted_2012_2014, "accepted_2012_2014.Rds")

remove(accepted_2012_2014)

set.seed(1)
rows <- sample(nrow(loan_sample), as.integer(0.75 * nrow(loan_sample)))

loan_train <- loan_sample[rows, ]
loan_test <- loan_sample[-rows, ]

tally(~ loan_status, loan_train, "percent")

## loan_status
##        Charged Off            Current            Default         Fully Paid 
##        15.42915531         3.01004768         0.00000000        81.39475477 
##    In Grace Period  Late (16-30 days) Late (31-120 days) 
##         0.05534741         0.01702997         0.09366485

tally(~ loan_status, loan_test, "percent")

## loan_status
##        Charged Off            Current            Default         Fully Paid 
##        15.63218391         2.98850575         0.00000000        81.22605364 
##    In Grace Period  Late (16-30 days) Late (31-120 days) 
##         0.07662835         0.02554278         0.05108557

# Dropping the level "Default" from training and test data
loan_train$loan_status <- droplevels(loan_train$loan_status, "Default")
loan_test$loan_status <- droplevels(loan_test$loan_status, "Default")

Model Training and Evaluation

kNN

normalize <- function(x) { 
  return ((x - min(x)) / (max(x) - min(x))) 
}

norm_train <- as.data.frame(lapply(loan_train %>%
                                    keep(is.numeric),
                                  normalize))

summary(norm_train[1:6])

##    loan_amnt       funded_amnt        int_rate       installment    
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.2353   1st Qu.:0.2353   1st Qu.:0.2488   1st Qu.:0.1807  
##  Median :0.3824   Median :0.3824   Median :0.3978   Median :0.2705  
##  Mean   :0.4151   Mean   :0.4151   Mean   :0.4015   Mean   :0.3087  
##  3rd Qu.:0.5588   3rd Qu.:0.5588   3rd Qu.:0.5479   3rd Qu.:0.4085  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##    annual_inc          issue_d      
##  Min.   :0.000000   Min.   :0.0000  
##  1st Qu.:0.008351   1st Qu.:0.5000  
##  Median :0.012088   Median :1.0000  
##  Mean   :0.014651   Mean   :0.7622  
##  3rd Qu.:0.017802   3rd Qu.:1.0000  
##  Max.   :1.000000   Max.   :1.0000

norm_test <- as.data.frame(lapply(loan_test %>%
                                   keep(is.numeric),
                                 normalize))

summary(norm_test[1:6])

##    loan_amnt       funded_amnt        int_rate       installment    
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.2316   1st Qu.:0.2316   1st Qu.:0.2488   1st Qu.:0.1819  
##  Median :0.3757   Median :0.3757   Median :0.3978   Median :0.2768  
##  Mean   :0.4141   Mean   :0.4141   Mean   :0.4000   Mean   :0.3175  
##  3rd Qu.:0.5588   3rd Qu.:0.5588   3rd Qu.:0.5479   3rd Qu.:0.4222  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##    annual_inc         issue_d      
##  Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:0.02060   1st Qu.:0.5000  
##  Median :0.02935   Median :1.0000  
##  Mean   :0.03529   Mean   :0.7595  
##  3rd Qu.:0.04274   3rd Qu.:1.0000  
##  Max.   :1.00000   Max.   :1.0000

set.seed(652)
knn_loan <- knn(norm_train, norm_test, loan_train$loan_status, 155)

Naive Bayes

set.seed(652)
naive_loan <- naiveBayes(loan_train[, -1], loan_train$loan_status)
naive_pred <- predict(naive_loan, loan_test[, -1])

C5.0

set.seed(652)
c50_loan <- C5.0(loan_status ~ ., loan_train)
c50_pred <- predict(c50_loan, loan_test[-1])

Regression Tree

set.seed(652)
rpart_loan <- rpart(loan_status ~ ., loan_train %>%
                      select(loan_status, colnames(loan_train %>%
                                                     keep(is.numeric))))
rpart_pred <- predict(rpart_loan, 
                      loan_test[-1] %>%
                        keep(is.numeric),
                      "class")

Support Vector Machine

set.seed(652)
svm_loan <- ksvm(loan_status ~ ., loan_train)
svm_pred <- predict(svm_loan, loan_test[-1], "response")

Random Forest

set.seed(652)
rf_loan <- randomForest(loan_status ~ ., loan_train)
rf_pred <- predict(rf_loan, loan_test[-1], "response")

Model Evaluation

kNN

confusionMatrix(loan_test$loan_status, knn_loan)

## Confusion Matrix and Statistics
## 
##                     Reference
## Prediction           Charged Off Current Fully Paid In Grace Period
##   Charged Off                763       0        461               0
##   Current                      0      62        172               0
##   Fully Paid                  23       0       6337               0
##   In Grace Period              0       1          5               0
##   Late (16-30 days)            0       0          2               0
##   Late (31-120 days)           0       2          2               0
##                     Reference
## Prediction           Late (16-30 days) Late (31-120 days)
##   Charged Off                        0                  0
##   Current                            0                  0
##   Fully Paid                         0                  0
##   In Grace Period                    0                  0
##   Late (16-30 days)                  0                  0
##   Late (31-120 days)                 0                  0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9147          
##                  95% CI : (0.9083, 0.9208)
##     No Information Rate : 0.8913          
##     P-Value [Acc > NIR] : 3.653e-12       
##                                           
##                   Kappa : 0.672           
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Charged Off Class: Current Class: Fully Paid
## Sensitivity                     0.97074       0.953846            0.9080
## Specificity                     0.93455       0.977849            0.9730
## Pos Pred Value                  0.62337       0.264957            0.9964
## Neg Pred Value                  0.99652       0.999605            0.5633
## Prevalence                      0.10038       0.008301            0.8913
## Detection Rate                  0.09745       0.007918            0.8093
## Detection Prevalence            0.15632       0.029885            0.8123
## Balanced Accuracy               0.95265       0.965848            0.9405
##                      Class: In Grace Period Class: Late (16-30 days)
## Sensitivity                              NA                       NA
## Specificity                       0.9992337                0.9997446
## Pos Pred Value                           NA                       NA
## Neg Pred Value                           NA                       NA
## Prevalence                        0.0000000                0.0000000
## Detection Rate                    0.0000000                0.0000000
## Detection Prevalence              0.0007663                0.0002554
## Balanced Accuracy                        NA                       NA
##                      Class: Late (31-120 days)
## Sensitivity                                 NA
## Specificity                          0.9994891
## Pos Pred Value                              NA
## Neg Pred Value                              NA
## Prevalence                           0.0000000
## Detection Rate                       0.0000000
## Detection Prevalence                 0.0005109
## Balanced Accuracy                           NA

Naive Bayes

confusionMatrix(naive_pred, loan_test$loan_status)

## Confusion Matrix and Statistics
## 
##                     Reference
## Prediction           Charged Off Current Fully Paid In Grace Period
##   Charged Off                263       0          0               0
##   Current                      1       1          6               0
##   Fully Paid                 110       1       3528               0
##   In Grace Period            336     132       1362               4
##   Late (16-30 days)          513      99       1441               2
##   Late (31-120 days)           1       1         23               0
##                     Reference
## Prediction           Late (16-30 days) Late (31-120 days)
##   Charged Off                        0                  0
##   Current                            0                  0
##   Fully Paid                         0                  0
##   In Grace Period                    1                  1
##   Late (16-30 days)                  1                  3
##   Late (31-120 days)                 0                  0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.4849          
##                  95% CI : (0.4738, 0.4961)
##     No Information Rate : 0.8123          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1652          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
## 
## Statistics by Class:
## 
##                      Class: Charged Off Class: Current Class: Fully Paid
## Sensitivity                     0.21487      0.0042735            0.5547
## Specificity                     1.00000      0.9990785            0.9245
## Pos Pred Value                  1.00000      0.1250000            0.9695
## Neg Pred Value                  0.87300      0.9702122            0.3243
## Prevalence                      0.15632      0.0298851            0.8123
## Detection Rate                  0.03359      0.0001277            0.4506
## Detection Prevalence            0.03359      0.0010217            0.4648
## Balanced Accuracy               0.60743      0.5016760            0.7396
##                      Class: In Grace Period Class: Late (16-30 days)
## Sensitivity                       0.6666667                0.5000000
## Specificity                       0.7658487                0.7370976
## Pos Pred Value                    0.0021786                0.0004857
## Neg Pred Value                    0.9996663                0.9998267
## Prevalence                        0.0007663                0.0002554
## Detection Rate                    0.0005109                0.0001277
## Detection Prevalence              0.2344828                0.2629630
## Balanced Accuracy                 0.7162577                0.6185488
##                      Class: Late (31-120 days)
## Sensitivity                          0.0000000
## Specificity                          0.9968055
## Pos Pred Value                       0.0000000
## Neg Pred Value                       0.9994875
## Prevalence                           0.0005109
## Detection Rate                       0.0000000
## Detection Prevalence                 0.0031928
## Balanced Accuracy                    0.4984028

C5.0

confusionMatrix(c50_pred, loan_test$loan_status)

## Confusion Matrix and Statistics
## 
##                     Reference
## Prediction           Charged Off Current Fully Paid In Grace Period
##   Charged Off               1170       0         12               0
##   Current                      0     231          0               5
##   Fully Paid                  54       3       6348               1
##   In Grace Period              0       0          0               0
##   Late (16-30 days)            0       0          0               0
##   Late (31-120 days)           0       0          0               0
##                     Reference
## Prediction           Late (16-30 days) Late (31-120 days)
##   Charged Off                        0                  0
##   Current                            2                  1
##   Fully Paid                         0                  0
##   In Grace Period                    0                  0
##   Late (16-30 days)                  0                  0
##   Late (31-120 days)                 0                  3
## 
## Overall Statistics
##                                           
##                Accuracy : 0.99            
##                  95% CI : (0.9876, 0.9921)
##     No Information Rate : 0.8123          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.968           
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Charged Off Class: Current Class: Fully Paid
## Sensitivity                      0.9559        0.98718            0.9981
## Specificity                      0.9982        0.99895            0.9605
## Pos Pred Value                   0.9898        0.96653            0.9909
## Neg Pred Value                   0.9919        0.99960            0.9916
## Prevalence                       0.1563        0.02989            0.8123
## Detection Rate                   0.1494        0.02950            0.8107
## Detection Prevalence             0.1510        0.03052            0.8181
## Balanced Accuracy                0.9770        0.99306            0.9793
##                      Class: In Grace Period Class: Late (16-30 days)
## Sensitivity                       0.0000000                0.0000000
## Specificity                       1.0000000                1.0000000
## Pos Pred Value                          NaN                      NaN
## Neg Pred Value                    0.9992337                0.9997446
## Prevalence                        0.0007663                0.0002554
## Detection Rate                    0.0000000                0.0000000
## Detection Prevalence              0.0000000                0.0000000
## Balanced Accuracy                 0.5000000                0.5000000
##                      Class: Late (31-120 days)
## Sensitivity                          0.7500000
## Specificity                          1.0000000
## Pos Pred Value                       1.0000000
## Neg Pred Value                       0.9998722
## Prevalence                           0.0005109
## Detection Rate                       0.0003831
## Detection Prevalence                 0.0003831
## Balanced Accuracy                    0.8750000

Regression Tree

confusionMatrix(rpart_pred, loan_test$loan_status)

## Confusion Matrix and Statistics
## 
##                     Reference
## Prediction           Charged Off Current Fully Paid In Grace Period
##   Charged Off               1021       0          0               0
##   Current                      0     231          0               5
##   Fully Paid                 203       3       6360               1
##   In Grace Period              0       0          0               0
##   Late (16-30 days)            0       0          0               0
##   Late (31-120 days)           0       0          0               0
##                     Reference
## Prediction           Late (16-30 days) Late (31-120 days)
##   Charged Off                        0                  0
##   Current                            2                  4
##   Fully Paid                         0                  0
##   In Grace Period                    0                  0
##   Late (16-30 days)                  0                  0
##   Late (31-120 days)                 0                  0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9722          
##                  95% CI : (0.9683, 0.9757)
##     No Information Rate : 0.8123          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9064          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Charged Off Class: Current Class: Fully Paid
## Sensitivity                      0.8342        0.98718            1.0000
## Specificity                      1.0000        0.99855            0.8592
## Pos Pred Value                   1.0000        0.95455            0.9685
## Neg Pred Value                   0.9702        0.99960            1.0000
## Prevalence                       0.1563        0.02989            0.8123
## Detection Rate                   0.1304        0.02950            0.8123
## Detection Prevalence             0.1304        0.03091            0.8387
## Balanced Accuracy                0.9171        0.99287            0.9296
##                      Class: In Grace Period Class: Late (16-30 days)
## Sensitivity                       0.0000000                0.0000000
## Specificity                       1.0000000                1.0000000
## Pos Pred Value                          NaN                      NaN
## Neg Pred Value                    0.9992337                0.9997446
## Prevalence                        0.0007663                0.0002554
## Detection Rate                    0.0000000                0.0000000
## Detection Prevalence              0.0000000                0.0000000
## Balanced Accuracy                 0.5000000                0.5000000
##                      Class: Late (31-120 days)
## Sensitivity                          0.0000000
## Specificity                          1.0000000
## Pos Pred Value                             NaN
## Neg Pred Value                       0.9994891
## Prevalence                           0.0005109
## Detection Rate                       0.0000000
## Detection Prevalence                 0.0000000
## Balanced Accuracy                    0.5000000

Support Vector Machines

confusionMatrix(svm_pred, loan_test$loan_status)

## Confusion Matrix and Statistics
## 
##                     Reference
## Prediction           Charged Off Current Fully Paid In Grace Period
##   Charged Off               1110       0          0               0
##   Current                      0     211          0               4
##   Fully Paid                 114      23       6360               2
##   In Grace Period              0       0          0               0
##   Late (16-30 days)            0       0          0               0
##   Late (31-120 days)           0       0          0               0
##                     Reference
## Prediction           Late (16-30 days) Late (31-120 days)
##   Charged Off                        0                  0
##   Current                            2                  4
##   Fully Paid                         0                  0
##   In Grace Period                    0                  0
##   Late (16-30 days)                  0                  0
##   Late (31-120 days)                 0                  0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.981           
##                  95% CI : (0.9777, 0.9839)
##     No Information Rate : 0.8123          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9372          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Charged Off Class: Current Class: Fully Paid
## Sensitivity                      0.9069        0.90171            1.0000
## Specificity                      1.0000        0.99868            0.9054
## Pos Pred Value                   1.0000        0.95475            0.9786
## Neg Pred Value                   0.9830        0.99698            1.0000
## Prevalence                       0.1563        0.02989            0.8123
## Detection Rate                   0.1418        0.02695            0.8123
## Detection Prevalence             0.1418        0.02822            0.8300
## Balanced Accuracy                0.9534        0.95020            0.9527
##                      Class: In Grace Period Class: Late (16-30 days)
## Sensitivity                       0.0000000                0.0000000
## Specificity                       1.0000000                1.0000000
## Pos Pred Value                          NaN                      NaN
## Neg Pred Value                    0.9992337                0.9997446
## Prevalence                        0.0007663                0.0002554
## Detection Rate                    0.0000000                0.0000000
## Detection Prevalence              0.0000000                0.0000000
## Balanced Accuracy                 0.5000000                0.5000000
##                      Class: Late (31-120 days)
## Sensitivity                          0.0000000
## Specificity                          1.0000000
## Pos Pred Value                             NaN
## Neg Pred Value                       0.9994891
## Prevalence                           0.0005109
## Detection Rate                       0.0000000
## Detection Prevalence                 0.0000000
## Balanced Accuracy                    0.5000000

Random Forest

confusionMatrix(rf_pred, loan_test$loan_status)

## Confusion Matrix and Statistics
## 
##                     Reference
## Prediction           Charged Off Current Fully Paid In Grace Period
##   Charged Off               1172       0          0               0
##   Current                      0     230          0               5
##   Fully Paid                  52       4       6360               1
##   In Grace Period              0       0          0               0
##   Late (16-30 days)            0       0          0               0
##   Late (31-120 days)           0       0          0               0
##                     Reference
## Prediction           Late (16-30 days) Late (31-120 days)
##   Charged Off                        0                  0
##   Current                            2                  1
##   Fully Paid                         0                  0
##   In Grace Period                    0                  0
##   Late (16-30 days)                  0                  0
##   Late (31-120 days)                 0                  3
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9917          
##                  95% CI : (0.9894, 0.9936)
##     No Information Rate : 0.8123          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9732          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Charged Off Class: Current Class: Fully Paid
## Sensitivity                      0.9575        0.98291            1.0000
## Specificity                      1.0000        0.99895            0.9612
## Pos Pred Value                   1.0000        0.96639            0.9911
## Neg Pred Value                   0.9922        0.99947            1.0000
## Prevalence                       0.1563        0.02989            0.8123
## Detection Rate                   0.1497        0.02937            0.8123
## Detection Prevalence             0.1497        0.03040            0.8195
## Balanced Accuracy                0.9788        0.99093            0.9806
##                      Class: In Grace Period Class: Late (16-30 days)
## Sensitivity                       0.0000000                0.0000000
## Specificity                       1.0000000                1.0000000
## Pos Pred Value                          NaN                      NaN
## Neg Pred Value                    0.9992337                0.9997446
## Prevalence                        0.0007663                0.0002554
## Detection Rate                    0.0000000                0.0000000
## Detection Prevalence              0.0000000                0.0000000
## Balanced Accuracy                 0.5000000                0.5000000
##                      Class: Late (31-120 days)
## Sensitivity                          0.7500000
## Specificity                          1.0000000
## Pos Pred Value                       1.0000000
## Neg Pred Value                       0.9998722
## Prevalence                           0.0005109
## Detection Rate                       0.0003831
## Detection Prevalence                 0.0003831
## Balanced Accuracy                    0.8750000

Improving the Models

registerDoParallel(cores = availableCores())

kNN

scale_train <- loan_train %>%
  keep(is.numeric) %>%
  scale()

summary(scale_train[, 1:6])

##    loan_amnt        funded_amnt         int_rate         installment     
##  Min.   :-1.6955   Min.   :-1.6955   Min.   :-1.84005   Min.   :-1.7407  
##  1st Qu.:-0.7345   1st Qu.:-0.7345   1st Qu.:-0.69990   1st Qu.:-0.7220  
##  Median :-0.1339   Median :-0.1339   Median :-0.01672   Median :-0.2154  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.5868   3rd Qu.: 0.5868   3rd Qu.: 0.67103   3rd Qu.: 0.5625  
##  Max.   : 2.3887   Max.   : 2.3887   Max.   : 2.74340   Max.   : 3.8972  
##    annual_inc         issue_d       
##  Min.   :-1.1795   Min.   :-2.4821  
##  1st Qu.:-0.5072   1st Qu.:-0.8539  
##  Median :-0.2064   Median : 0.7742  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.2536   3rd Qu.: 0.7742  
##  Max.   :79.3226   Max.   : 0.7742

scale_test <- loan_test %>%
  keep(is.numeric) %>%
  scale()

summary(scale_test[, 1:6])

##    loan_amnt        funded_amnt         int_rate          installment     
##  Min.   :-1.6848   Min.   :-1.6848   Min.   :-1.837539   Min.   :-1.7183  
##  1st Qu.:-0.7425   1st Qu.:-0.7425   1st Qu.:-0.694748   1st Qu.:-0.7339  
##  Median :-0.1562   Median :-0.1562   Median :-0.009989   Median :-0.2205  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.000000   Mean   : 0.0000  
##  3rd Qu.: 0.5887   3rd Qu.: 0.5887   3rd Qu.: 0.679351   3rd Qu.: 0.5661  
##  Max.   : 2.3835   Max.   : 2.3835   Max.   : 2.756529   Max.   : 3.6929  
##    annual_inc         issue_d       
##  Min.   :-1.3027   Min.   :-2.4542  
##  1st Qu.:-0.5423   1st Qu.:-0.8384  
##  Median :-0.2192   Median : 0.7774  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.2750   3rd Qu.: 0.7774  
##  Max.   :35.6126   Max.   : 0.7774

set.seed(652)
ctrl <- trainControl(method = "repeatedcv",
                     number = 10,
                     repeats = 2,
                     selectionFunction = "best")
grid <- expand.grid(k = c(9, 111, 113))
knn_tuned <- train(loan_status ~ .,
                   norm_train %>% 
                     cbind(loan_status = loan_train$loan_status),
                   method = "knn",
                   metric = "Kappa",
                   trControl = ctrl,
                   tuneGrid = grid)
knn_tuned_pred <- predict(knn_tuned, norm_test)
confusionMatrix(knn_tuned_pred, loan_test$loan_status)

## Confusion Matrix and Statistics
## 
##                     Reference
## Prediction           Charged Off Current Fully Paid In Grace Period
##   Charged Off                916       1         81               0
##   Current                      1     155          5               4
##   Fully Paid                 307      78       6274               2
##   In Grace Period              0       0          0               0
##   Late (16-30 days)            0       0          0               0
##   Late (31-120 days)           0       0          0               0
##                     Reference
## Prediction           Late (16-30 days) Late (31-120 days)
##   Charged Off                        0                  0
##   Current                            2                  3
##   Fully Paid                         0                  1
##   In Grace Period                    0                  0
##   Late (16-30 days)                  0                  0
##   Late (31-120 days)                 0                  0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9381          
##                  95% CI : (0.9325, 0.9433)
##     No Information Rate : 0.8123          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7852          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Charged Off Class: Current Class: Fully Paid
## Sensitivity                      0.7484        0.66239            0.9865
## Specificity                      0.9876        0.99803            0.7361
## Pos Pred Value                   0.9178        0.91176            0.9418
## Neg Pred Value                   0.9549        0.98969            0.9264
## Prevalence                       0.1563        0.02989            0.8123
## Detection Rate                   0.1170        0.01980            0.8013
## Detection Prevalence             0.1275        0.02171            0.8508
## Balanced Accuracy                0.8680        0.83021            0.8613
##                      Class: In Grace Period Class: Late (16-30 days)
## Sensitivity                       0.0000000                0.0000000
## Specificity                       1.0000000                1.0000000
## Pos Pred Value                          NaN                      NaN
## Neg Pred Value                    0.9992337                0.9997446
## Prevalence                        0.0007663                0.0002554
## Detection Rate                    0.0000000                0.0000000
## Detection Prevalence              0.0000000                0.0000000
## Balanced Accuracy                 0.5000000                0.5000000
##                      Class: Late (31-120 days)
## Sensitivity                          0.0000000
## Specificity                          1.0000000
## Pos Pred Value                             NaN
## Neg Pred Value                       0.9994891
## Prevalence                           0.0005109
## Detection Rate                       0.0000000
## Detection Prevalence                 0.0000000
## Balanced Accuracy                    0.5000000

Naive Bayes

set.seed(652)
naive_tuned <- naiveBayes(loan_train[-1], loan_train$loan_status, laplace = 1)
naive_tuned_pred <- predict(naive_tuned, loan_test[, -1])
confusionMatrix(naive_tuned_pred, loan_test$loan_status)

## Confusion Matrix and Statistics
## 
##                     Reference
## Prediction           Charged Off Current Fully Paid In Grace Period
##   Charged Off                130       0          0               0
##   Current                      1       1          6               0
##   Fully Paid                  51       1       2791               0
##   In Grace Period            285     101       1250               4
##   Late (16-30 days)          756     129       2288               2
##   Late (31-120 days)           1       2         25               0
##                     Reference
## Prediction           Late (16-30 days) Late (31-120 days)
##   Charged Off                        0                  0
##   Current                            0                  0
##   Fully Paid                         0                  0
##   In Grace Period                    0                  0
##   Late (16-30 days)                  2                  4
##   Late (31-120 days)                 0                  0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.3739          
##                  95% CI : (0.3632, 0.3848)
##     No Information Rate : 0.8123          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1084          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Charged Off Class: Current Class: Fully Paid
## Sensitivity                      0.1062      0.0042735            0.4388
## Specificity                      1.0000      0.9990785            0.9646
## Pos Pred Value                   1.0000      0.1250000            0.9817
## Neg Pred Value                   0.8579      0.9702122            0.2843
## Prevalence                       0.1563      0.0298851            0.8123
## Detection Rate                   0.0166      0.0001277            0.3564
## Detection Prevalence             0.0166      0.0010217            0.3631
## Balanced Accuracy                0.5531      0.5016760            0.7017
##                      Class: In Grace Period Class: Late (16-30 days)
## Sensitivity                       0.6666667                1.0000000
## Specificity                       0.7908998                0.5938937
## Pos Pred Value                    0.0024390                0.0006287
## Neg Pred Value                    0.9996769                1.0000000
## Prevalence                        0.0007663                0.0002554
## Detection Rate                    0.0005109                0.0002554
## Detection Prevalence              0.2094508                0.4062580
## Balanced Accuracy                 0.7287832                0.7969469
##                      Class: Late (31-120 days)
## Sensitivity                          0.0000000
## Specificity                          0.9964222
## Pos Pred Value                       0.0000000
## Neg Pred Value                       0.9994873
## Prevalence                           0.0005109
## Detection Rate                       0.0000000
## Detection Prevalence                 0.0035760
## Balanced Accuracy                    0.4982111

C5.0

set.seed(652)
grid <- expand.grid(trials = c(2, 4, 6, 8, 10),
                    model = "tree",
                    winnow = "FALSE")

c50_tuned <- train(loan_status ~ .,
                   loan_train,
                   method = "C5.0",
                   metric = "Kappa",
                   trControl = ctrl,
                   tuneGrid = grid)
c50_tuned_pred <- predict(c50_tuned, loan_train[-1])
confusionMatrix(c50_tuned_pred, loan_train$loan_status)

## Confusion Matrix and Statistics
## 
##                     Reference
## Prediction           Charged Off Current Fully Paid In Grace Period
##   Charged Off               3592       0          0               0
##   Current                      0     701          0               8
##   Fully Paid                  32       5      19118               0
##   In Grace Period              0       1          0               5
##   Late (16-30 days)            0       0          0               0
##   Late (31-120 days)           0       0          0               0
##                     Reference
## Prediction           Late (16-30 days) Late (31-120 days)
##   Charged Off                        0                  0
##   Current                            4                 10
##   Fully Paid                         0                  0
##   In Grace Period                    0                  0
##   Late (16-30 days)                  0                  0
##   Late (31-120 days)                 0                 12
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9974          
##                  95% CI : (0.9967, 0.9981)
##     No Information Rate : 0.8139          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9918          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Charged Off Class: Current Class: Fully Paid
## Sensitivity                      0.9912        0.99151            1.0000
## Specificity                      1.0000        0.99903            0.9915
## Pos Pred Value                   1.0000        0.96957            0.9981
## Neg Pred Value                   0.9984        0.99974            1.0000
## Prevalence                       0.1543        0.03010            0.8139
## Detection Rate                   0.1529        0.02985            0.8139
## Detection Prevalence             0.1529        0.03078            0.8155
## Balanced Accuracy                0.9956        0.99527            0.9958
##                      Class: In Grace Period Class: Late (16-30 days)
## Sensitivity                       0.3846154                0.0000000
## Specificity                       0.9999574                1.0000000
## Pos Pred Value                    0.8333333                      NaN
## Neg Pred Value                    0.9996593                0.9998297
## Prevalence                        0.0005535                0.0001703
## Detection Rate                    0.0002129                0.0000000
## Detection Prevalence              0.0002554                0.0000000
## Balanced Accuracy                 0.6922864                0.5000000
##                      Class: Late (31-120 days)
## Sensitivity                          0.5454545
## Specificity                          1.0000000
## Pos Pred Value                       1.0000000
## Neg Pred Value                       0.9995740
## Prevalence                           0.0009366
## Detection Rate                       0.0005109
## Detection Prevalence                 0.0005109
## Balanced Accuracy                    0.7727273

Regression Tree

set.seed(652)
rpart_tuned <- train(loan_status ~ .,
                     loan_train %>%
                       select(loan_status, 
                              colnames(loan_train %>%
                                         keep(is.numeric))),
                     method = "rpart",
                     metric = "Kappa",
                     trControl = ctrl)
rpart_tuned_pred <- predict(rpart_tuned, 
                            loan_test[-1] %>%
                              keep(is.numeric))
confusionMatrix(rpart_tuned_pred, loan_test$loan_status)

## Confusion Matrix and Statistics
## 
##                     Reference
## Prediction           Charged Off Current Fully Paid In Grace Period
##   Charged Off                974       0          0               0
##   Current                      0     231          0               5
##   Fully Paid                 250       3       6360               1
##   In Grace Period              0       0          0               0
##   Late (16-30 days)            0       0          0               0
##   Late (31-120 days)           0       0          0               0
##                     Reference
## Prediction           Late (16-30 days) Late (31-120 days)
##   Charged Off                        0                  0
##   Current                            2                  4
##   Fully Paid                         0                  0
##   In Grace Period                    0                  0
##   Late (16-30 days)                  0                  0
##   Late (31-120 days)                 0                  0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9662          
##                  95% CI : (0.9619, 0.9701)
##     No Information Rate : 0.8123          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8847          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Charged Off Class: Current Class: Fully Paid
## Sensitivity                      0.7958        0.98718            1.0000
## Specificity                      1.0000        0.99855            0.8272
## Pos Pred Value                   1.0000        0.95455            0.9616
## Neg Pred Value                   0.9635        0.99960            1.0000
## Prevalence                       0.1563        0.02989            0.8123
## Detection Rate                   0.1244        0.02950            0.8123
## Detection Prevalence             0.1244        0.03091            0.8447
## Balanced Accuracy                0.8979        0.99287            0.9136
##                      Class: In Grace Period Class: Late (16-30 days)
## Sensitivity                       0.0000000                0.0000000
## Specificity                       1.0000000                1.0000000
## Pos Pred Value                          NaN                      NaN
## Neg Pred Value                    0.9992337                0.9997446
## Prevalence                        0.0007663                0.0002554
## Detection Rate                    0.0000000                0.0000000
## Detection Prevalence              0.0000000                0.0000000
## Balanced Accuracy                 0.5000000                0.5000000
##                      Class: Late (31-120 days)
## Sensitivity                          0.0000000
## Specificity                          1.0000000
## Pos Pred Value                             NaN
## Neg Pred Value                       0.9994891
## Prevalence                           0.0005109
## Detection Rate                       0.0000000
## Detection Prevalence                 0.0000000
## Balanced Accuracy                    0.5000000

Support Vector Machine

set.seed(652)
svm_tuned <- train(loan_status ~ .,
                   loan_train,
                   method = "svmLinear",
                   metric = "Kappa",
                   trControl = ctrl)
svm_tuned_pred <- predict(svm_tuned, loan_test[-1])
confusionMatrix(svm_tuned_pred, loan_test$loan_status)

## Confusion Matrix and Statistics
## 
##                     Reference
## Prediction           Charged Off Current Fully Paid In Grace Period
##   Charged Off               1213       1          0               0
##   Current                      0     220          1               4
##   Fully Paid                  11       7       6359               1
##   In Grace Period              0       5          0               1
##   Late (16-30 days)            0       1          0               0
##   Late (31-120 days)           0       0          0               0
##                     Reference
## Prediction           Late (16-30 days) Late (31-120 days)
##   Charged Off                        0                  0
##   Current                            2                  1
##   Fully Paid                         0                  0
##   In Grace Period                    0                  0
##   Late (16-30 days)                  0                  0
##   Late (31-120 days)                 0                  3
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9957         
##                  95% CI : (0.9939, 0.997)
##     No Information Rate : 0.8123         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.9861         
##                                          
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: Charged Off Class: Current Class: Fully Paid
## Sensitivity                      0.9910        0.94017            0.9998
## Specificity                      0.9998        0.99895            0.9871
## Pos Pred Value                   0.9992        0.96491            0.9970
## Neg Pred Value                   0.9983        0.99816            0.9993
## Prevalence                       0.1563        0.02989            0.8123
## Detection Rate                   0.1549        0.02810            0.8121
## Detection Prevalence             0.1550        0.02912            0.8146
## Balanced Accuracy                0.9954        0.96956            0.9935
##                      Class: In Grace Period Class: Late (16-30 days)
## Sensitivity                       0.1666667                0.0000000
## Specificity                       0.9993609                0.9998723
## Pos Pred Value                    0.1666667                0.0000000
## Neg Pred Value                    0.9993609                0.9997445
## Prevalence                        0.0007663                0.0002554
## Detection Rate                    0.0001277                0.0000000
## Detection Prevalence              0.0007663                0.0001277
## Balanced Accuracy                 0.5830138                0.4999361
##                      Class: Late (31-120 days)
## Sensitivity                          0.7500000
## Specificity                          1.0000000
## Pos Pred Value                       1.0000000
## Neg Pred Value                       0.9998722
## Prevalence                           0.0005109
## Detection Rate                       0.0003831
## Detection Prevalence                 0.0003831
## Balanced Accuracy                    0.8750000

Random Forest

ctrl <- trainControl(method = "cv", number = 2)
grid <- expand.grid(mtry = sqrt(loan_dim$Number_Cols))
set.seed(652)
rf_tuned <- train(loan_status ~ .,
                  loan_train,
                  method = "rf",
                  metric = "Kappa",
                  trControl = ctrl,
                  tuneGrid = grid)
rf_tuned_pred <- predict(rf_tuned, loan_test[-1])
confusionMatrix(rf_tuned_pred, loan_test$loan_status)

## Confusion Matrix and Statistics
## 
##                     Reference
## Prediction           Charged Off Current Fully Paid In Grace Period
##   Charged Off               1153       0          0               0
##   Current                      0     230          0               5
##   Fully Paid                  71       4       6360               1
##   In Grace Period              0       0          0               0
##   Late (16-30 days)            0       0          0               0
##   Late (31-120 days)           0       0          0               0
##                     Reference
## Prediction           Late (16-30 days) Late (31-120 days)
##   Charged Off                        0                  1
##   Current                            2                  3
##   Fully Paid                         0                  0
##   In Grace Period                    0                  0
##   Late (16-30 days)                  0                  0
##   Late (31-120 days)                 0                  0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9889          
##                  95% CI : (0.9863, 0.9911)
##     No Information Rate : 0.8123          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.964           
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Charged Off Class: Current Class: Fully Paid
## Sensitivity                      0.9420        0.98291            1.0000
## Specificity                      0.9998        0.99868            0.9483
## Pos Pred Value                   0.9991        0.95833            0.9882
## Neg Pred Value                   0.9894        0.99947            1.0000
## Prevalence                       0.1563        0.02989            0.8123
## Detection Rate                   0.1473        0.02937            0.8123
## Detection Prevalence             0.1474        0.03065            0.8220
## Balanced Accuracy                0.9709        0.99079            0.9741
##                      Class: In Grace Period Class: Late (16-30 days)
## Sensitivity                       0.0000000                0.0000000
## Specificity                       1.0000000                1.0000000
## Pos Pred Value                          NaN                      NaN
## Neg Pred Value                    0.9992337                0.9997446
## Prevalence                        0.0007663                0.0002554
## Detection Rate                    0.0000000                0.0000000
## Detection Prevalence              0.0000000                0.0000000
## Balanced Accuracy                 0.5000000                0.5000000
##                      Class: Late (31-120 days)
## Sensitivity                          0.0000000
## Specificity                          1.0000000
## Pos Pred Value                             NaN
## Neg Pred Value                       0.9994891
## Prevalence                           0.0005109
## Detection Rate                       0.0000000
## Detection Prevalence                 0.0000000
## Balanced Accuracy                    0.5000000

Prediciting loan_status for 2015

loans_2015 <- readRDS("accepted.Rds") %>%
  mutate(issue_d = year(parse_date(issue_d, "%b-%Y")),
         earliest_cr_line = year(parse_date(earliest_cr_line, "%b-%Y")),
         last_pymnt_d = year(parse_date(last_pymnt_d, "%b-%Y")),
         last_credit_pull_d = year(parse_date(last_credit_pull_d, "%b-%Y"))) %>% 
  filter(issue_d == 2015, purpose != "educational") %>%
  select(col_names) %>%
  char_to_factor() %>%
  na.omit()

## Note: Using an external vector in selections is ambiguous.
## i Use `all_of(col_names)` instead of `col_names` to silence this message.
## i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.

levels(loans_2015$purpose)

##  [1] "car"                "credit_card"        "debt_consolidation"
##  [4] "home_improvement"   "house"              "major_purchase"    
##  [7] "medical"            "moving"             "other"             
## [10] "renewable_energy"   "small_business"     "vacation"          
## [13] "wedding"

levels(loan_sample$purpose)

##  [1] "car"                "credit_card"        "debt_consolidation"
##  [4] "home_improvement"   "house"              "major_purchase"    
##  [7] "medical"            "moving"             "other"             
## [10] "renewable_energy"   "small_business"     "vacation"          
## [13] "wedding"

set.seed(652)
ctrl <- trainControl(method = "cv", 
                     number = 10)
grid <- expand.grid(model = "tree",
                    trials = c(1, 5, 10))
c50_2015 <- C5.0(loan_status ~ ., 
                 readRDS("accepted_2012_2014.Rds"),
                 metric = "Kappa",
                 trControl = ctrl,
                 tuneGrid = grid)
c50_2015_pred <- predict(c50_2015, loans_2015[-1])
confusionMatrix(c50_2015_pred, loans_2015$loan_status)

## Confusion Matrix and Statistics
## 
##                     Reference
## Prediction           Charged Off Current Default Fully Paid In Grace Period
##   Charged Off              58502       0       0         68               0
##   Current                      0   33819       0          0             501
##   Default                      0       0       0          0               0
##   Fully Paid                 966     157       0     231036               7
##   In Grace Period              0       0       0          0               0
##   Late (16-30 days)            0       0       0          0               0
##   Late (31-120 days)           0      12       0          0               3
##                     Reference
## Prediction           Late (16-30 days) Late (31-120 days)
##   Charged Off                        0                  0
##   Current                          222                599
##   Default                            0                  0
##   Fully Paid                         1                  1
##   In Grace Period                    0                  0
##   Late (16-30 days)                  0                  0
##   Late (31-120 days)                 2                500
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9922          
##                  95% CI : (0.9919, 0.9925)
##     No Information Rate : 0.708           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9828          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Charged Off Class: Current Class: Default
## Sensitivity                      0.9838         0.9950             NA
## Specificity                      0.9997         0.9955              1
## Pos Pred Value                   0.9988         0.9624             NA
## Neg Pred Value                   0.9964         0.9994             NA
## Prevalence                       0.1822         0.1041              0
## Detection Rate                   0.1792         0.1036              0
## Detection Prevalence             0.1794         0.1077              0
## Balanced Accuracy                0.9918         0.9953             NA
##                      Class: Fully Paid Class: In Grace Period
## Sensitivity                     0.9997               0.000000
## Specificity                     0.9881               1.000000
## Pos Pred Value                  0.9951                    NaN
## Neg Pred Value                  0.9993               0.998434
## Prevalence                      0.7080               0.001566
## Detection Rate                  0.7078               0.000000
## Detection Prevalence            0.7113               0.000000
## Balanced Accuracy               0.9939               0.500000
##                      Class: Late (16-30 days) Class: Late (31-120 days)
## Sensitivity                         0.0000000                  0.454545
## Specificity                         1.0000000                  0.999948
## Pos Pred Value                            NaN                  0.967118
## Neg Pred Value                      0.9993107                  0.998159
## Prevalence                          0.0006893                  0.003370
## Detection Rate                      0.0000000                  0.001532
## Detection Prevalence                0.0000000                  0.001584
## Balanced Accuracy                   0.5000000                  0.727247

Classification of Loan Status for Lending Club Loan Data

Luis Magana

3/16/2020

Abstract

Introduction

Conclusion

Code Appendix

Loading the Data

Exploring and Preparing the Data

Model Training and Evaluation

kNN

Naive Bayes

C5.0

Regression Tree

Support Vector Machine

Random Forest

Model Evaluation

kNN

Naive Bayes

C5.0

Regression Tree

Support Vector Machines

Random Forest

Improving the Models

kNN

Naive Bayes

C5.0

Regression Tree

Support Vector Machine

Random Forest

Prediciting loan_status for 2015