The purpose of this project was to use a subset of the Lending Club accepted loan application data to predict loan status. Mutliple models were run and their performance measured in terms of Kappa. The algorithms of interest were k-NN, naiveBayes, C5.0, rpart, svm, and randomForest. Each model was tuned at an attempt to improve performance and, ultimately, C5.0 resulted as the “best” model.
The data that was used is the accepted loan application data from 2007 to 2018 belonging to Lending Club - although, I was only interested in analyzing the data for the years 2012 to 2015. The full dataset contains 2,260,701 observations and 151 features. By observing the first few rows of the dataset, it is obvious that there are plenty of missing values which could lead to some issues.
Before removing these columns, features which are in the form “Month-Year” were reduced to only the year, which helped to filter according to the years of interest. In this case, the feature that contains this information is issue_d. Then, I proceeded with removing features which had a high percentage of missing values - high percentage being greater than 50%. Doing so, meant that there would be features which would have some missing values, but give that the data is relatively large, I omitted rows with missing values as well. To prevent unreasonable runtimes in the algorithms, factor features with over 50 levels, or levels not greater than 1, were removed along with identifying features. In total, this brought down the dataset to 319,773 observations with 79 features.
Considring my computational limitations, I created a separate dataset containing only 10% of the data (31,997 observations). Furthermore, the smaller dataset was separated into training and test data at a 75%-25% split.
The first model trained was k-NN on only the normalized numeric features with k = 155 (square root of 31,997). The evaluation metric used for all models was Kappa, which for this model resulted in 0.6128. For naiveBayes Kappa = 0.3051; C5.0 Kappa = 0.9766; Regression Trees (run with only numeric features due to long runtime when including factors) Kappa = 0.9448; SVM Kappa = 0.9397; and RandomForest Kappa = 0.9768
At an attempt to improve the models, each was run with repeated 10-fold cross-validation twice, for the exception of naiveBayes (would not run with parameter tuning) and randomForest with 2-fold cross-validation and mtry set to the square root of the number of features.
The trainControl function from the caret package was used to set the cross-validation and selection function. The selection function was set to “best” in order to choose the best Kappa from each of the models run during cross-validation. Changes made to k-NN were that the data was standardized and the grid ran three separate models with k = 9, 111, 113. This resulted in an improvement of kappa, now equal to 0.7673. naiveBayes did worse with Kappa = 0.2202. The grid for C5.0 set trials to 2, 4, 6, 8, and 10 and its Kappa came out to be 0.9957. The tuned rpart had a Kappa = 0.8925. SVM had a Kappa = 0.9774, and randomForest Kappa = 0.9695.
In the end, I chose C5.0 to predict the loan_status for the year 2015 since it had the highest Kappa during cross-validation. The 2015 data was created in a similar way to the data between 2012 - 2014. This resulted in a dataset of 326,396 and 79 features. This time, C5.0 was run with 10-fold cross-validation and the number of trials set to 1, 5, 10. The result was a Kappa = 0.9806 and an accuracy of 0.9912.
Although, a high Kappa was achieved for most algorithms, they were not tuned the same way. This was to prevent the fact that some algorithms, such as randomForest when tuned, takes an unreasonably amount of time without the proper computational resource. Ideally, all algorithms would have been trained on the full 2012-2014 dataset and the best one picked in order to classify the 2015 data.
The data being used is from LendingClub.
library(pacman)
p_load(tidyverse, tibble, lubridate,
rpart, C50, randomForest, Amelia, naniar,
Boruta, caret, class, trelliscopejs, tictoc,
e1071, doParallel, purrr, neuralnet, kernlab,
Boruta, mosaicCore, future, ROCR)
accepted <- read_csv("C:\\Users\\fa_na\\OneDrive\\Documents\\Mathematics\\Statistics\\Statistical Machine Learning\\Project\\lending-club\\accepted_2007_to_2018q4.csv\\accepted_2007_to_2018Q4.csv")
head(accepted)
## # A tibble: 6 x 151
## id member_id loan_amnt funded_amnt funded_amnt_inv term int_rate
## <dbl> <lgl> <dbl> <dbl> <dbl> <chr> <dbl>
## 1 6.84e7 NA 3600 3600 3600 36 m~ 14.0
## 2 6.84e7 NA 24700 24700 24700 36 m~ 12.0
## 3 6.83e7 NA 20000 20000 20000 60 m~ 10.8
## 4 6.63e7 NA 35000 35000 35000 60 m~ 14.8
## 5 6.85e7 NA 10400 10400 10400 60 m~ 22.4
## 6 6.84e7 NA 11950 11950 11950 36 m~ 13.4
## # ... with 144 more variables: installment <dbl>, grade <chr>, sub_grade <chr>,
## # emp_title <chr>, emp_length <chr>, home_ownership <chr>, annual_inc <dbl>,
## # verification_status <chr>, issue_d <chr>, loan_status <chr>,
## # pymnt_plan <chr>, url <chr>, desc <lgl>, purpose <chr>, title <chr>,
## # zip_code <chr>, addr_state <chr>, dti <dbl>, delinq_2yrs <dbl>,
## # earliest_cr_line <chr>, fico_range_low <dbl>, fico_range_high <dbl>,
## # inq_last_6mths <dbl>, mths_since_last_delinq <dbl>,
## # mths_since_last_record <dbl>, open_acc <dbl>, pub_rec <dbl>,
## # revol_bal <dbl>, revol_util <dbl>, total_acc <dbl>,
## # initial_list_status <chr>, out_prncp <dbl>, out_prncp_inv <dbl>,
## # total_pymnt <dbl>, total_pymnt_inv <dbl>, total_rec_prncp <dbl>,
## # total_rec_int <dbl>, total_rec_late_fee <dbl>, recoveries <dbl>,
## # collection_recovery_fee <dbl>, last_pymnt_d <chr>, last_pymnt_amnt <dbl>,
## # next_pymnt_d <chr>, last_credit_pull_d <chr>, last_fico_range_high <dbl>,
## # last_fico_range_low <dbl>, collections_12_mths_ex_med <dbl>,
## # mths_since_last_major_derog <dbl>, policy_code <dbl>,
## # application_type <chr>, annual_inc_joint <dbl>, dti_joint <dbl>,
## # verification_status_joint <chr>, acc_now_delinq <dbl>, tot_coll_amt <dbl>,
## # tot_cur_bal <dbl>, open_acc_6m <dbl>, open_act_il <dbl>, open_il_12m <dbl>,
## # open_il_24m <dbl>, mths_since_rcnt_il <dbl>, total_bal_il <dbl>,
## # il_util <dbl>, open_rv_12m <dbl>, open_rv_24m <dbl>, max_bal_bc <dbl>,
## # all_util <dbl>, total_rev_hi_lim <dbl>, inq_fi <dbl>, total_cu_tl <dbl>,
## # inq_last_12m <dbl>, acc_open_past_24mths <dbl>, avg_cur_bal <dbl>,
## # bc_open_to_buy <dbl>, bc_util <dbl>, chargeoff_within_12_mths <dbl>,
## # delinq_amnt <dbl>, mo_sin_old_il_acct <dbl>, mo_sin_old_rev_tl_op <dbl>,
## # mo_sin_rcnt_rev_tl_op <dbl>, mo_sin_rcnt_tl <dbl>, mort_acc <dbl>,
## # mths_since_recent_bc <dbl>, mths_since_recent_bc_dlq <dbl>,
## # mths_since_recent_inq <dbl>, mths_since_recent_revol_delinq <dbl>,
## # num_accts_ever_120_pd <dbl>, num_actv_bc_tl <dbl>, num_actv_rev_tl <dbl>,
## # num_bc_sats <dbl>, num_bc_tl <dbl>, num_il_tl <dbl>, num_op_rev_tl <dbl>,
## # num_rev_accts <dbl>, num_rev_tl_bal_gt_0 <dbl>, num_sats <dbl>,
## # num_tl_120dpd_2m <dbl>, num_tl_30dpd <dbl>, num_tl_90g_dpd_24m <dbl>,
## # num_tl_op_past_12m <dbl>, ...
dim(accepted)
## [1] 2260701 151
# Capturing only the year for "date" variables
# and filtering for data between 2012-2014
accepted_2012_2014 <- accepted %>%
mutate(issue_d = year(parse_date(issue_d, "%b-%Y")),
earliest_cr_line = year(parse_date(earliest_cr_line, "%b-%Y")),
last_pymnt_d = year(parse_date(last_pymnt_d, "%b-%Y")),
last_credit_pull_d = year(parse_date(last_credit_pull_d, "%b-%Y"))) %>%
filter(between(issue_d, 2012, 2014), !is.na(issue_d))
# Drop identifier, variables with high proportion of NA's,
# as well as factors with too many levels,
# and omitting rows with missing values
accepted_2012_2014 <- accepted_2012_2014 %>%
select(loan_status,
which(colMeans(is.na(accepted_2012_2014)) < 0.5),
-id, -url, -emp_title, -policy_code, -application_type,
-disbursement_method, -funded_amnt_inv,
-title, -out_prncp_inv, -pub_rec_bankruptcies,
-hardship_flag, -zip_code, -pymnt_plan, -addr_state) %>%
arrange(issue_d) %>%
na.omit()
saveRDS(accepted, "accepted.Rds")
remove(accepted)
# Changes character variables to factor
char_to_factor <- function(x) {
for (i in 1:ncol(x)) {
if (class(x[[i]]) == "character")
x[[i]] <- as.factor(x[[i]])
}
return(x)
}
# Changing character variables to factors
accepted_2012_2014 <- char_to_factor(accepted_2012_2014)
summary(accepted_2012_2014 %>%
keep(is.factor))
## loan_status term grade sub_grade
## Charged Off : 53840 36 months:224497 A:46201 B4 : 20887
## Current : 9631 60 months: 95276 B:92232 B3 : 20694
## Default : 1 C:88951 C3 : 18507
## Fully Paid :255784 D:55069 B2 : 18348
## In Grace Period : 171 E:25448 C1 : 18204
## Late (16-30 days) : 59 F: 9550 C2 : 18200
## Late (31-120 days): 287 G: 2322 (Other):204933
## emp_length home_ownership verification_status
## 10+ years:112924 ANY : 1 Not Verified : 96483
## 2 years : 28934 MORTGAGE:172753 Source Verified:113374
## 3 years : 25654 NONE : 29 Verified :109916
## < 1 year : 24303 OTHER : 36
## 5 years : 21070 OWN : 27181
## 1 year : 20359 RENT :119773
## (Other) : 86529
## purpose initial_list_status debt_settlement_flag
## debt_consolidation:195626 f:188008 N:314762
## credit_card : 73981 w:131765 Y: 5011
## home_improvement : 17562
## other : 13456
## major_purchase : 5279
## small_business : 3360
## (Other) : 10509
accepted_2012_2014 %>%
keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
facet_trelliscope(~ key, scales = "free", path = "rmarkdown_files/trelli_one", self_contained = T) +
geom_histogram()
accepted_2012_2014 %>%
keep(is.factor) %>%
gather() %>%
ggplot(aes(value)) +
facet_trelliscope(~ key, scales = "free", path = "rmarkdown_files/trelli_two", self_contained = T) +
geom_bar() +
coord_flip()
## Warning: attributes are not identical across measure variables;
## they will be dropped
## using data from the first layer
accepted_2012_2014 <- accepted_2012_2014 %>%
filter(last_fico_range_low != 0, last_fico_range_high != 0)
accepted_2012_2014 %>%
ggplot() +
geom_histogram(aes(last_fico_range_low), fill = "darkred")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

accepted_2012_2014 %>%
ggplot() +
geom_histogram(aes(last_fico_range_high), fill = "darkgreen")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

col_names <- colnames(accepted_2012_2014)
loan_dim <- tibble (
Number_Rows = nrow(accepted_2012_2014),
Number_Cols = ncol(accepted_2012_2014)
)
head(col_names)
## [1] "loan_status" "loan_amnt" "funded_amnt" "term" "int_rate"
## [6] "installment"
loan_dim
## # A tibble: 1 x 2
## Number_Rows Number_Cols
## <int> <int>
## 1 313184 79
set.seed(2)
rows <- sample(nrow(accepted_2012_2014), as.integer(0.1 * nrow(accepted_2012_2014)))
loan_sample <- accepted_2012_2014[rows, ]
tally(~ loan_status, loan_sample, "percent")
## loan_status
## Charged Off Current Default Fully Paid
## 15.47991570 3.00466186 0.00000000 81.35257679
## In Grace Period Late (16-30 days) Late (31-120 days)
## 0.06066799 0.01915831 0.08301935
tally(~ loan_status, accepted_2012_2014, "percent")
## loan_status
## Charged Off Current Default Fully Paid
## 1.565054e+01 3.072954e+00 3.193011e-04 8.111398e+01
## In Grace Period Late (16-30 days) Late (31-120 days)
## 5.460049e-02 1.883877e-02 8.876571e-02
saveRDS(accepted_2012_2014, "accepted_2012_2014.Rds")
remove(accepted_2012_2014)
set.seed(1)
rows <- sample(nrow(loan_sample), as.integer(0.75 * nrow(loan_sample)))
loan_train <- loan_sample[rows, ]
loan_test <- loan_sample[-rows, ]
tally(~ loan_status, loan_train, "percent")
## loan_status
## Charged Off Current Default Fully Paid
## 15.42915531 3.01004768 0.00000000 81.39475477
## In Grace Period Late (16-30 days) Late (31-120 days)
## 0.05534741 0.01702997 0.09366485
tally(~ loan_status, loan_test, "percent")
## loan_status
## Charged Off Current Default Fully Paid
## 15.63218391 2.98850575 0.00000000 81.22605364
## In Grace Period Late (16-30 days) Late (31-120 days)
## 0.07662835 0.02554278 0.05108557
# Dropping the level "Default" from training and test data
loan_train$loan_status <- droplevels(loan_train$loan_status, "Default")
loan_test$loan_status <- droplevels(loan_test$loan_status, "Default")
normalize <- function(x) {
return ((x - min(x)) / (max(x) - min(x)))
}
norm_train <- as.data.frame(lapply(loan_train %>%
keep(is.numeric),
normalize))
summary(norm_train[1:6])
## loan_amnt funded_amnt int_rate installment
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.2353 1st Qu.:0.2353 1st Qu.:0.2488 1st Qu.:0.1807
## Median :0.3824 Median :0.3824 Median :0.3978 Median :0.2705
## Mean :0.4151 Mean :0.4151 Mean :0.4015 Mean :0.3087
## 3rd Qu.:0.5588 3rd Qu.:0.5588 3rd Qu.:0.5479 3rd Qu.:0.4085
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## annual_inc issue_d
## Min. :0.000000 Min. :0.0000
## 1st Qu.:0.008351 1st Qu.:0.5000
## Median :0.012088 Median :1.0000
## Mean :0.014651 Mean :0.7622
## 3rd Qu.:0.017802 3rd Qu.:1.0000
## Max. :1.000000 Max. :1.0000
norm_test <- as.data.frame(lapply(loan_test %>%
keep(is.numeric),
normalize))
summary(norm_test[1:6])
## loan_amnt funded_amnt int_rate installment
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.2316 1st Qu.:0.2316 1st Qu.:0.2488 1st Qu.:0.1819
## Median :0.3757 Median :0.3757 Median :0.3978 Median :0.2768
## Mean :0.4141 Mean :0.4141 Mean :0.4000 Mean :0.3175
## 3rd Qu.:0.5588 3rd Qu.:0.5588 3rd Qu.:0.5479 3rd Qu.:0.4222
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## annual_inc issue_d
## Min. :0.00000 Min. :0.0000
## 1st Qu.:0.02060 1st Qu.:0.5000
## Median :0.02935 Median :1.0000
## Mean :0.03529 Mean :0.7595
## 3rd Qu.:0.04274 3rd Qu.:1.0000
## Max. :1.00000 Max. :1.0000
set.seed(652)
knn_loan <- knn(norm_train, norm_test, loan_train$loan_status, 155)
set.seed(652)
naive_loan <- naiveBayes(loan_train[, -1], loan_train$loan_status)
naive_pred <- predict(naive_loan, loan_test[, -1])
set.seed(652)
c50_loan <- C5.0(loan_status ~ ., loan_train)
c50_pred <- predict(c50_loan, loan_test[-1])
set.seed(652)
rpart_loan <- rpart(loan_status ~ ., loan_train %>%
select(loan_status, colnames(loan_train %>%
keep(is.numeric))))
rpart_pred <- predict(rpart_loan,
loan_test[-1] %>%
keep(is.numeric),
"class")
set.seed(652)
svm_loan <- ksvm(loan_status ~ ., loan_train)
svm_pred <- predict(svm_loan, loan_test[-1], "response")
set.seed(652)
rf_loan <- randomForest(loan_status ~ ., loan_train)
rf_pred <- predict(rf_loan, loan_test[-1], "response")
confusionMatrix(loan_test$loan_status, knn_loan)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Charged Off Current Fully Paid In Grace Period
## Charged Off 763 0 461 0
## Current 0 62 172 0
## Fully Paid 23 0 6337 0
## In Grace Period 0 1 5 0
## Late (16-30 days) 0 0 2 0
## Late (31-120 days) 0 2 2 0
## Reference
## Prediction Late (16-30 days) Late (31-120 days)
## Charged Off 0 0
## Current 0 0
## Fully Paid 0 0
## In Grace Period 0 0
## Late (16-30 days) 0 0
## Late (31-120 days) 0 0
##
## Overall Statistics
##
## Accuracy : 0.9147
## 95% CI : (0.9083, 0.9208)
## No Information Rate : 0.8913
## P-Value [Acc > NIR] : 3.653e-12
##
## Kappa : 0.672
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Charged Off Class: Current Class: Fully Paid
## Sensitivity 0.97074 0.953846 0.9080
## Specificity 0.93455 0.977849 0.9730
## Pos Pred Value 0.62337 0.264957 0.9964
## Neg Pred Value 0.99652 0.999605 0.5633
## Prevalence 0.10038 0.008301 0.8913
## Detection Rate 0.09745 0.007918 0.8093
## Detection Prevalence 0.15632 0.029885 0.8123
## Balanced Accuracy 0.95265 0.965848 0.9405
## Class: In Grace Period Class: Late (16-30 days)
## Sensitivity NA NA
## Specificity 0.9992337 0.9997446
## Pos Pred Value NA NA
## Neg Pred Value NA NA
## Prevalence 0.0000000 0.0000000
## Detection Rate 0.0000000 0.0000000
## Detection Prevalence 0.0007663 0.0002554
## Balanced Accuracy NA NA
## Class: Late (31-120 days)
## Sensitivity NA
## Specificity 0.9994891
## Pos Pred Value NA
## Neg Pred Value NA
## Prevalence 0.0000000
## Detection Rate 0.0000000
## Detection Prevalence 0.0005109
## Balanced Accuracy NA
confusionMatrix(naive_pred, loan_test$loan_status)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Charged Off Current Fully Paid In Grace Period
## Charged Off 263 0 0 0
## Current 1 1 6 0
## Fully Paid 110 1 3528 0
## In Grace Period 336 132 1362 4
## Late (16-30 days) 513 99 1441 2
## Late (31-120 days) 1 1 23 0
## Reference
## Prediction Late (16-30 days) Late (31-120 days)
## Charged Off 0 0
## Current 0 0
## Fully Paid 0 0
## In Grace Period 1 1
## Late (16-30 days) 1 3
## Late (31-120 days) 0 0
##
## Overall Statistics
##
## Accuracy : 0.4849
## 95% CI : (0.4738, 0.4961)
## No Information Rate : 0.8123
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1652
##
## Mcnemar's Test P-Value : <2e-16
##
## Statistics by Class:
##
## Class: Charged Off Class: Current Class: Fully Paid
## Sensitivity 0.21487 0.0042735 0.5547
## Specificity 1.00000 0.9990785 0.9245
## Pos Pred Value 1.00000 0.1250000 0.9695
## Neg Pred Value 0.87300 0.9702122 0.3243
## Prevalence 0.15632 0.0298851 0.8123
## Detection Rate 0.03359 0.0001277 0.4506
## Detection Prevalence 0.03359 0.0010217 0.4648
## Balanced Accuracy 0.60743 0.5016760 0.7396
## Class: In Grace Period Class: Late (16-30 days)
## Sensitivity 0.6666667 0.5000000
## Specificity 0.7658487 0.7370976
## Pos Pred Value 0.0021786 0.0004857
## Neg Pred Value 0.9996663 0.9998267
## Prevalence 0.0007663 0.0002554
## Detection Rate 0.0005109 0.0001277
## Detection Prevalence 0.2344828 0.2629630
## Balanced Accuracy 0.7162577 0.6185488
## Class: Late (31-120 days)
## Sensitivity 0.0000000
## Specificity 0.9968055
## Pos Pred Value 0.0000000
## Neg Pred Value 0.9994875
## Prevalence 0.0005109
## Detection Rate 0.0000000
## Detection Prevalence 0.0031928
## Balanced Accuracy 0.4984028
confusionMatrix(c50_pred, loan_test$loan_status)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Charged Off Current Fully Paid In Grace Period
## Charged Off 1170 0 12 0
## Current 0 231 0 5
## Fully Paid 54 3 6348 1
## In Grace Period 0 0 0 0
## Late (16-30 days) 0 0 0 0
## Late (31-120 days) 0 0 0 0
## Reference
## Prediction Late (16-30 days) Late (31-120 days)
## Charged Off 0 0
## Current 2 1
## Fully Paid 0 0
## In Grace Period 0 0
## Late (16-30 days) 0 0
## Late (31-120 days) 0 3
##
## Overall Statistics
##
## Accuracy : 0.99
## 95% CI : (0.9876, 0.9921)
## No Information Rate : 0.8123
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.968
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Charged Off Class: Current Class: Fully Paid
## Sensitivity 0.9559 0.98718 0.9981
## Specificity 0.9982 0.99895 0.9605
## Pos Pred Value 0.9898 0.96653 0.9909
## Neg Pred Value 0.9919 0.99960 0.9916
## Prevalence 0.1563 0.02989 0.8123
## Detection Rate 0.1494 0.02950 0.8107
## Detection Prevalence 0.1510 0.03052 0.8181
## Balanced Accuracy 0.9770 0.99306 0.9793
## Class: In Grace Period Class: Late (16-30 days)
## Sensitivity 0.0000000 0.0000000
## Specificity 1.0000000 1.0000000
## Pos Pred Value NaN NaN
## Neg Pred Value 0.9992337 0.9997446
## Prevalence 0.0007663 0.0002554
## Detection Rate 0.0000000 0.0000000
## Detection Prevalence 0.0000000 0.0000000
## Balanced Accuracy 0.5000000 0.5000000
## Class: Late (31-120 days)
## Sensitivity 0.7500000
## Specificity 1.0000000
## Pos Pred Value 1.0000000
## Neg Pred Value 0.9998722
## Prevalence 0.0005109
## Detection Rate 0.0003831
## Detection Prevalence 0.0003831
## Balanced Accuracy 0.8750000
confusionMatrix(rpart_pred, loan_test$loan_status)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Charged Off Current Fully Paid In Grace Period
## Charged Off 1021 0 0 0
## Current 0 231 0 5
## Fully Paid 203 3 6360 1
## In Grace Period 0 0 0 0
## Late (16-30 days) 0 0 0 0
## Late (31-120 days) 0 0 0 0
## Reference
## Prediction Late (16-30 days) Late (31-120 days)
## Charged Off 0 0
## Current 2 4
## Fully Paid 0 0
## In Grace Period 0 0
## Late (16-30 days) 0 0
## Late (31-120 days) 0 0
##
## Overall Statistics
##
## Accuracy : 0.9722
## 95% CI : (0.9683, 0.9757)
## No Information Rate : 0.8123
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9064
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Charged Off Class: Current Class: Fully Paid
## Sensitivity 0.8342 0.98718 1.0000
## Specificity 1.0000 0.99855 0.8592
## Pos Pred Value 1.0000 0.95455 0.9685
## Neg Pred Value 0.9702 0.99960 1.0000
## Prevalence 0.1563 0.02989 0.8123
## Detection Rate 0.1304 0.02950 0.8123
## Detection Prevalence 0.1304 0.03091 0.8387
## Balanced Accuracy 0.9171 0.99287 0.9296
## Class: In Grace Period Class: Late (16-30 days)
## Sensitivity 0.0000000 0.0000000
## Specificity 1.0000000 1.0000000
## Pos Pred Value NaN NaN
## Neg Pred Value 0.9992337 0.9997446
## Prevalence 0.0007663 0.0002554
## Detection Rate 0.0000000 0.0000000
## Detection Prevalence 0.0000000 0.0000000
## Balanced Accuracy 0.5000000 0.5000000
## Class: Late (31-120 days)
## Sensitivity 0.0000000
## Specificity 1.0000000
## Pos Pred Value NaN
## Neg Pred Value 0.9994891
## Prevalence 0.0005109
## Detection Rate 0.0000000
## Detection Prevalence 0.0000000
## Balanced Accuracy 0.5000000
confusionMatrix(svm_pred, loan_test$loan_status)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Charged Off Current Fully Paid In Grace Period
## Charged Off 1110 0 0 0
## Current 0 211 0 4
## Fully Paid 114 23 6360 2
## In Grace Period 0 0 0 0
## Late (16-30 days) 0 0 0 0
## Late (31-120 days) 0 0 0 0
## Reference
## Prediction Late (16-30 days) Late (31-120 days)
## Charged Off 0 0
## Current 2 4
## Fully Paid 0 0
## In Grace Period 0 0
## Late (16-30 days) 0 0
## Late (31-120 days) 0 0
##
## Overall Statistics
##
## Accuracy : 0.981
## 95% CI : (0.9777, 0.9839)
## No Information Rate : 0.8123
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9372
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Charged Off Class: Current Class: Fully Paid
## Sensitivity 0.9069 0.90171 1.0000
## Specificity 1.0000 0.99868 0.9054
## Pos Pred Value 1.0000 0.95475 0.9786
## Neg Pred Value 0.9830 0.99698 1.0000
## Prevalence 0.1563 0.02989 0.8123
## Detection Rate 0.1418 0.02695 0.8123
## Detection Prevalence 0.1418 0.02822 0.8300
## Balanced Accuracy 0.9534 0.95020 0.9527
## Class: In Grace Period Class: Late (16-30 days)
## Sensitivity 0.0000000 0.0000000
## Specificity 1.0000000 1.0000000
## Pos Pred Value NaN NaN
## Neg Pred Value 0.9992337 0.9997446
## Prevalence 0.0007663 0.0002554
## Detection Rate 0.0000000 0.0000000
## Detection Prevalence 0.0000000 0.0000000
## Balanced Accuracy 0.5000000 0.5000000
## Class: Late (31-120 days)
## Sensitivity 0.0000000
## Specificity 1.0000000
## Pos Pred Value NaN
## Neg Pred Value 0.9994891
## Prevalence 0.0005109
## Detection Rate 0.0000000
## Detection Prevalence 0.0000000
## Balanced Accuracy 0.5000000
confusionMatrix(rf_pred, loan_test$loan_status)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Charged Off Current Fully Paid In Grace Period
## Charged Off 1172 0 0 0
## Current 0 230 0 5
## Fully Paid 52 4 6360 1
## In Grace Period 0 0 0 0
## Late (16-30 days) 0 0 0 0
## Late (31-120 days) 0 0 0 0
## Reference
## Prediction Late (16-30 days) Late (31-120 days)
## Charged Off 0 0
## Current 2 1
## Fully Paid 0 0
## In Grace Period 0 0
## Late (16-30 days) 0 0
## Late (31-120 days) 0 3
##
## Overall Statistics
##
## Accuracy : 0.9917
## 95% CI : (0.9894, 0.9936)
## No Information Rate : 0.8123
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9732
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Charged Off Class: Current Class: Fully Paid
## Sensitivity 0.9575 0.98291 1.0000
## Specificity 1.0000 0.99895 0.9612
## Pos Pred Value 1.0000 0.96639 0.9911
## Neg Pred Value 0.9922 0.99947 1.0000
## Prevalence 0.1563 0.02989 0.8123
## Detection Rate 0.1497 0.02937 0.8123
## Detection Prevalence 0.1497 0.03040 0.8195
## Balanced Accuracy 0.9788 0.99093 0.9806
## Class: In Grace Period Class: Late (16-30 days)
## Sensitivity 0.0000000 0.0000000
## Specificity 1.0000000 1.0000000
## Pos Pred Value NaN NaN
## Neg Pred Value 0.9992337 0.9997446
## Prevalence 0.0007663 0.0002554
## Detection Rate 0.0000000 0.0000000
## Detection Prevalence 0.0000000 0.0000000
## Balanced Accuracy 0.5000000 0.5000000
## Class: Late (31-120 days)
## Sensitivity 0.7500000
## Specificity 1.0000000
## Pos Pred Value 1.0000000
## Neg Pred Value 0.9998722
## Prevalence 0.0005109
## Detection Rate 0.0003831
## Detection Prevalence 0.0003831
## Balanced Accuracy 0.8750000
registerDoParallel(cores = availableCores())
scale_train <- loan_train %>%
keep(is.numeric) %>%
scale()
summary(scale_train[, 1:6])
## loan_amnt funded_amnt int_rate installment
## Min. :-1.6955 Min. :-1.6955 Min. :-1.84005 Min. :-1.7407
## 1st Qu.:-0.7345 1st Qu.:-0.7345 1st Qu.:-0.69990 1st Qu.:-0.7220
## Median :-0.1339 Median :-0.1339 Median :-0.01672 Median :-0.2154
## Mean : 0.0000 Mean : 0.0000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.5868 3rd Qu.: 0.5868 3rd Qu.: 0.67103 3rd Qu.: 0.5625
## Max. : 2.3887 Max. : 2.3887 Max. : 2.74340 Max. : 3.8972
## annual_inc issue_d
## Min. :-1.1795 Min. :-2.4821
## 1st Qu.:-0.5072 1st Qu.:-0.8539
## Median :-0.2064 Median : 0.7742
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.2536 3rd Qu.: 0.7742
## Max. :79.3226 Max. : 0.7742
scale_test <- loan_test %>%
keep(is.numeric) %>%
scale()
summary(scale_test[, 1:6])
## loan_amnt funded_amnt int_rate installment
## Min. :-1.6848 Min. :-1.6848 Min. :-1.837539 Min. :-1.7183
## 1st Qu.:-0.7425 1st Qu.:-0.7425 1st Qu.:-0.694748 1st Qu.:-0.7339
## Median :-0.1562 Median :-0.1562 Median :-0.009989 Median :-0.2205
## Mean : 0.0000 Mean : 0.0000 Mean : 0.000000 Mean : 0.0000
## 3rd Qu.: 0.5887 3rd Qu.: 0.5887 3rd Qu.: 0.679351 3rd Qu.: 0.5661
## Max. : 2.3835 Max. : 2.3835 Max. : 2.756529 Max. : 3.6929
## annual_inc issue_d
## Min. :-1.3027 Min. :-2.4542
## 1st Qu.:-0.5423 1st Qu.:-0.8384
## Median :-0.2192 Median : 0.7774
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.2750 3rd Qu.: 0.7774
## Max. :35.6126 Max. : 0.7774
set.seed(652)
ctrl <- trainControl(method = "repeatedcv",
number = 10,
repeats = 2,
selectionFunction = "best")
grid <- expand.grid(k = c(9, 111, 113))
knn_tuned <- train(loan_status ~ .,
norm_train %>%
cbind(loan_status = loan_train$loan_status),
method = "knn",
metric = "Kappa",
trControl = ctrl,
tuneGrid = grid)
knn_tuned_pred <- predict(knn_tuned, norm_test)
confusionMatrix(knn_tuned_pred, loan_test$loan_status)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Charged Off Current Fully Paid In Grace Period
## Charged Off 916 1 81 0
## Current 1 155 5 4
## Fully Paid 307 78 6274 2
## In Grace Period 0 0 0 0
## Late (16-30 days) 0 0 0 0
## Late (31-120 days) 0 0 0 0
## Reference
## Prediction Late (16-30 days) Late (31-120 days)
## Charged Off 0 0
## Current 2 3
## Fully Paid 0 1
## In Grace Period 0 0
## Late (16-30 days) 0 0
## Late (31-120 days) 0 0
##
## Overall Statistics
##
## Accuracy : 0.9381
## 95% CI : (0.9325, 0.9433)
## No Information Rate : 0.8123
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7852
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Charged Off Class: Current Class: Fully Paid
## Sensitivity 0.7484 0.66239 0.9865
## Specificity 0.9876 0.99803 0.7361
## Pos Pred Value 0.9178 0.91176 0.9418
## Neg Pred Value 0.9549 0.98969 0.9264
## Prevalence 0.1563 0.02989 0.8123
## Detection Rate 0.1170 0.01980 0.8013
## Detection Prevalence 0.1275 0.02171 0.8508
## Balanced Accuracy 0.8680 0.83021 0.8613
## Class: In Grace Period Class: Late (16-30 days)
## Sensitivity 0.0000000 0.0000000
## Specificity 1.0000000 1.0000000
## Pos Pred Value NaN NaN
## Neg Pred Value 0.9992337 0.9997446
## Prevalence 0.0007663 0.0002554
## Detection Rate 0.0000000 0.0000000
## Detection Prevalence 0.0000000 0.0000000
## Balanced Accuracy 0.5000000 0.5000000
## Class: Late (31-120 days)
## Sensitivity 0.0000000
## Specificity 1.0000000
## Pos Pred Value NaN
## Neg Pred Value 0.9994891
## Prevalence 0.0005109
## Detection Rate 0.0000000
## Detection Prevalence 0.0000000
## Balanced Accuracy 0.5000000
set.seed(652)
naive_tuned <- naiveBayes(loan_train[-1], loan_train$loan_status, laplace = 1)
naive_tuned_pred <- predict(naive_tuned, loan_test[, -1])
confusionMatrix(naive_tuned_pred, loan_test$loan_status)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Charged Off Current Fully Paid In Grace Period
## Charged Off 130 0 0 0
## Current 1 1 6 0
## Fully Paid 51 1 2791 0
## In Grace Period 285 101 1250 4
## Late (16-30 days) 756 129 2288 2
## Late (31-120 days) 1 2 25 0
## Reference
## Prediction Late (16-30 days) Late (31-120 days)
## Charged Off 0 0
## Current 0 0
## Fully Paid 0 0
## In Grace Period 0 0
## Late (16-30 days) 2 4
## Late (31-120 days) 0 0
##
## Overall Statistics
##
## Accuracy : 0.3739
## 95% CI : (0.3632, 0.3848)
## No Information Rate : 0.8123
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1084
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Charged Off Class: Current Class: Fully Paid
## Sensitivity 0.1062 0.0042735 0.4388
## Specificity 1.0000 0.9990785 0.9646
## Pos Pred Value 1.0000 0.1250000 0.9817
## Neg Pred Value 0.8579 0.9702122 0.2843
## Prevalence 0.1563 0.0298851 0.8123
## Detection Rate 0.0166 0.0001277 0.3564
## Detection Prevalence 0.0166 0.0010217 0.3631
## Balanced Accuracy 0.5531 0.5016760 0.7017
## Class: In Grace Period Class: Late (16-30 days)
## Sensitivity 0.6666667 1.0000000
## Specificity 0.7908998 0.5938937
## Pos Pred Value 0.0024390 0.0006287
## Neg Pred Value 0.9996769 1.0000000
## Prevalence 0.0007663 0.0002554
## Detection Rate 0.0005109 0.0002554
## Detection Prevalence 0.2094508 0.4062580
## Balanced Accuracy 0.7287832 0.7969469
## Class: Late (31-120 days)
## Sensitivity 0.0000000
## Specificity 0.9964222
## Pos Pred Value 0.0000000
## Neg Pred Value 0.9994873
## Prevalence 0.0005109
## Detection Rate 0.0000000
## Detection Prevalence 0.0035760
## Balanced Accuracy 0.4982111
set.seed(652)
grid <- expand.grid(trials = c(2, 4, 6, 8, 10),
model = "tree",
winnow = "FALSE")
c50_tuned <- train(loan_status ~ .,
loan_train,
method = "C5.0",
metric = "Kappa",
trControl = ctrl,
tuneGrid = grid)
c50_tuned_pred <- predict(c50_tuned, loan_train[-1])
confusionMatrix(c50_tuned_pred, loan_train$loan_status)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Charged Off Current Fully Paid In Grace Period
## Charged Off 3592 0 0 0
## Current 0 701 0 8
## Fully Paid 32 5 19118 0
## In Grace Period 0 1 0 5
## Late (16-30 days) 0 0 0 0
## Late (31-120 days) 0 0 0 0
## Reference
## Prediction Late (16-30 days) Late (31-120 days)
## Charged Off 0 0
## Current 4 10
## Fully Paid 0 0
## In Grace Period 0 0
## Late (16-30 days) 0 0
## Late (31-120 days) 0 12
##
## Overall Statistics
##
## Accuracy : 0.9974
## 95% CI : (0.9967, 0.9981)
## No Information Rate : 0.8139
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9918
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Charged Off Class: Current Class: Fully Paid
## Sensitivity 0.9912 0.99151 1.0000
## Specificity 1.0000 0.99903 0.9915
## Pos Pred Value 1.0000 0.96957 0.9981
## Neg Pred Value 0.9984 0.99974 1.0000
## Prevalence 0.1543 0.03010 0.8139
## Detection Rate 0.1529 0.02985 0.8139
## Detection Prevalence 0.1529 0.03078 0.8155
## Balanced Accuracy 0.9956 0.99527 0.9958
## Class: In Grace Period Class: Late (16-30 days)
## Sensitivity 0.3846154 0.0000000
## Specificity 0.9999574 1.0000000
## Pos Pred Value 0.8333333 NaN
## Neg Pred Value 0.9996593 0.9998297
## Prevalence 0.0005535 0.0001703
## Detection Rate 0.0002129 0.0000000
## Detection Prevalence 0.0002554 0.0000000
## Balanced Accuracy 0.6922864 0.5000000
## Class: Late (31-120 days)
## Sensitivity 0.5454545
## Specificity 1.0000000
## Pos Pred Value 1.0000000
## Neg Pred Value 0.9995740
## Prevalence 0.0009366
## Detection Rate 0.0005109
## Detection Prevalence 0.0005109
## Balanced Accuracy 0.7727273
set.seed(652)
rpart_tuned <- train(loan_status ~ .,
loan_train %>%
select(loan_status,
colnames(loan_train %>%
keep(is.numeric))),
method = "rpart",
metric = "Kappa",
trControl = ctrl)
rpart_tuned_pred <- predict(rpart_tuned,
loan_test[-1] %>%
keep(is.numeric))
confusionMatrix(rpart_tuned_pred, loan_test$loan_status)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Charged Off Current Fully Paid In Grace Period
## Charged Off 974 0 0 0
## Current 0 231 0 5
## Fully Paid 250 3 6360 1
## In Grace Period 0 0 0 0
## Late (16-30 days) 0 0 0 0
## Late (31-120 days) 0 0 0 0
## Reference
## Prediction Late (16-30 days) Late (31-120 days)
## Charged Off 0 0
## Current 2 4
## Fully Paid 0 0
## In Grace Period 0 0
## Late (16-30 days) 0 0
## Late (31-120 days) 0 0
##
## Overall Statistics
##
## Accuracy : 0.9662
## 95% CI : (0.9619, 0.9701)
## No Information Rate : 0.8123
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8847
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Charged Off Class: Current Class: Fully Paid
## Sensitivity 0.7958 0.98718 1.0000
## Specificity 1.0000 0.99855 0.8272
## Pos Pred Value 1.0000 0.95455 0.9616
## Neg Pred Value 0.9635 0.99960 1.0000
## Prevalence 0.1563 0.02989 0.8123
## Detection Rate 0.1244 0.02950 0.8123
## Detection Prevalence 0.1244 0.03091 0.8447
## Balanced Accuracy 0.8979 0.99287 0.9136
## Class: In Grace Period Class: Late (16-30 days)
## Sensitivity 0.0000000 0.0000000
## Specificity 1.0000000 1.0000000
## Pos Pred Value NaN NaN
## Neg Pred Value 0.9992337 0.9997446
## Prevalence 0.0007663 0.0002554
## Detection Rate 0.0000000 0.0000000
## Detection Prevalence 0.0000000 0.0000000
## Balanced Accuracy 0.5000000 0.5000000
## Class: Late (31-120 days)
## Sensitivity 0.0000000
## Specificity 1.0000000
## Pos Pred Value NaN
## Neg Pred Value 0.9994891
## Prevalence 0.0005109
## Detection Rate 0.0000000
## Detection Prevalence 0.0000000
## Balanced Accuracy 0.5000000
set.seed(652)
svm_tuned <- train(loan_status ~ .,
loan_train,
method = "svmLinear",
metric = "Kappa",
trControl = ctrl)
svm_tuned_pred <- predict(svm_tuned, loan_test[-1])
confusionMatrix(svm_tuned_pred, loan_test$loan_status)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Charged Off Current Fully Paid In Grace Period
## Charged Off 1213 1 0 0
## Current 0 220 1 4
## Fully Paid 11 7 6359 1
## In Grace Period 0 5 0 1
## Late (16-30 days) 0 1 0 0
## Late (31-120 days) 0 0 0 0
## Reference
## Prediction Late (16-30 days) Late (31-120 days)
## Charged Off 0 0
## Current 2 1
## Fully Paid 0 0
## In Grace Period 0 0
## Late (16-30 days) 0 0
## Late (31-120 days) 0 3
##
## Overall Statistics
##
## Accuracy : 0.9957
## 95% CI : (0.9939, 0.997)
## No Information Rate : 0.8123
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9861
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Charged Off Class: Current Class: Fully Paid
## Sensitivity 0.9910 0.94017 0.9998
## Specificity 0.9998 0.99895 0.9871
## Pos Pred Value 0.9992 0.96491 0.9970
## Neg Pred Value 0.9983 0.99816 0.9993
## Prevalence 0.1563 0.02989 0.8123
## Detection Rate 0.1549 0.02810 0.8121
## Detection Prevalence 0.1550 0.02912 0.8146
## Balanced Accuracy 0.9954 0.96956 0.9935
## Class: In Grace Period Class: Late (16-30 days)
## Sensitivity 0.1666667 0.0000000
## Specificity 0.9993609 0.9998723
## Pos Pred Value 0.1666667 0.0000000
## Neg Pred Value 0.9993609 0.9997445
## Prevalence 0.0007663 0.0002554
## Detection Rate 0.0001277 0.0000000
## Detection Prevalence 0.0007663 0.0001277
## Balanced Accuracy 0.5830138 0.4999361
## Class: Late (31-120 days)
## Sensitivity 0.7500000
## Specificity 1.0000000
## Pos Pred Value 1.0000000
## Neg Pred Value 0.9998722
## Prevalence 0.0005109
## Detection Rate 0.0003831
## Detection Prevalence 0.0003831
## Balanced Accuracy 0.8750000
ctrl <- trainControl(method = "cv", number = 2)
grid <- expand.grid(mtry = sqrt(loan_dim$Number_Cols))
set.seed(652)
rf_tuned <- train(loan_status ~ .,
loan_train,
method = "rf",
metric = "Kappa",
trControl = ctrl,
tuneGrid = grid)
rf_tuned_pred <- predict(rf_tuned, loan_test[-1])
confusionMatrix(rf_tuned_pred, loan_test$loan_status)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Charged Off Current Fully Paid In Grace Period
## Charged Off 1153 0 0 0
## Current 0 230 0 5
## Fully Paid 71 4 6360 1
## In Grace Period 0 0 0 0
## Late (16-30 days) 0 0 0 0
## Late (31-120 days) 0 0 0 0
## Reference
## Prediction Late (16-30 days) Late (31-120 days)
## Charged Off 0 1
## Current 2 3
## Fully Paid 0 0
## In Grace Period 0 0
## Late (16-30 days) 0 0
## Late (31-120 days) 0 0
##
## Overall Statistics
##
## Accuracy : 0.9889
## 95% CI : (0.9863, 0.9911)
## No Information Rate : 0.8123
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.964
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Charged Off Class: Current Class: Fully Paid
## Sensitivity 0.9420 0.98291 1.0000
## Specificity 0.9998 0.99868 0.9483
## Pos Pred Value 0.9991 0.95833 0.9882
## Neg Pred Value 0.9894 0.99947 1.0000
## Prevalence 0.1563 0.02989 0.8123
## Detection Rate 0.1473 0.02937 0.8123
## Detection Prevalence 0.1474 0.03065 0.8220
## Balanced Accuracy 0.9709 0.99079 0.9741
## Class: In Grace Period Class: Late (16-30 days)
## Sensitivity 0.0000000 0.0000000
## Specificity 1.0000000 1.0000000
## Pos Pred Value NaN NaN
## Neg Pred Value 0.9992337 0.9997446
## Prevalence 0.0007663 0.0002554
## Detection Rate 0.0000000 0.0000000
## Detection Prevalence 0.0000000 0.0000000
## Balanced Accuracy 0.5000000 0.5000000
## Class: Late (31-120 days)
## Sensitivity 0.0000000
## Specificity 1.0000000
## Pos Pred Value NaN
## Neg Pred Value 0.9994891
## Prevalence 0.0005109
## Detection Rate 0.0000000
## Detection Prevalence 0.0000000
## Balanced Accuracy 0.5000000
loans_2015 <- readRDS("accepted.Rds") %>%
mutate(issue_d = year(parse_date(issue_d, "%b-%Y")),
earliest_cr_line = year(parse_date(earliest_cr_line, "%b-%Y")),
last_pymnt_d = year(parse_date(last_pymnt_d, "%b-%Y")),
last_credit_pull_d = year(parse_date(last_credit_pull_d, "%b-%Y"))) %>%
filter(issue_d == 2015, purpose != "educational") %>%
select(col_names) %>%
char_to_factor() %>%
na.omit()
## Note: Using an external vector in selections is ambiguous.
## i Use `all_of(col_names)` instead of `col_names` to silence this message.
## i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
levels(loans_2015$purpose)
## [1] "car" "credit_card" "debt_consolidation"
## [4] "home_improvement" "house" "major_purchase"
## [7] "medical" "moving" "other"
## [10] "renewable_energy" "small_business" "vacation"
## [13] "wedding"
levels(loan_sample$purpose)
## [1] "car" "credit_card" "debt_consolidation"
## [4] "home_improvement" "house" "major_purchase"
## [7] "medical" "moving" "other"
## [10] "renewable_energy" "small_business" "vacation"
## [13] "wedding"
set.seed(652)
ctrl <- trainControl(method = "cv",
number = 10)
grid <- expand.grid(model = "tree",
trials = c(1, 5, 10))
c50_2015 <- C5.0(loan_status ~ .,
readRDS("accepted_2012_2014.Rds"),
metric = "Kappa",
trControl = ctrl,
tuneGrid = grid)
c50_2015_pred <- predict(c50_2015, loans_2015[-1])
confusionMatrix(c50_2015_pred, loans_2015$loan_status)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Charged Off Current Default Fully Paid In Grace Period
## Charged Off 58502 0 0 68 0
## Current 0 33819 0 0 501
## Default 0 0 0 0 0
## Fully Paid 966 157 0 231036 7
## In Grace Period 0 0 0 0 0
## Late (16-30 days) 0 0 0 0 0
## Late (31-120 days) 0 12 0 0 3
## Reference
## Prediction Late (16-30 days) Late (31-120 days)
## Charged Off 0 0
## Current 222 599
## Default 0 0
## Fully Paid 1 1
## In Grace Period 0 0
## Late (16-30 days) 0 0
## Late (31-120 days) 2 500
##
## Overall Statistics
##
## Accuracy : 0.9922
## 95% CI : (0.9919, 0.9925)
## No Information Rate : 0.708
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9828
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Charged Off Class: Current Class: Default
## Sensitivity 0.9838 0.9950 NA
## Specificity 0.9997 0.9955 1
## Pos Pred Value 0.9988 0.9624 NA
## Neg Pred Value 0.9964 0.9994 NA
## Prevalence 0.1822 0.1041 0
## Detection Rate 0.1792 0.1036 0
## Detection Prevalence 0.1794 0.1077 0
## Balanced Accuracy 0.9918 0.9953 NA
## Class: Fully Paid Class: In Grace Period
## Sensitivity 0.9997 0.000000
## Specificity 0.9881 1.000000
## Pos Pred Value 0.9951 NaN
## Neg Pred Value 0.9993 0.998434
## Prevalence 0.7080 0.001566
## Detection Rate 0.7078 0.000000
## Detection Prevalence 0.7113 0.000000
## Balanced Accuracy 0.9939 0.500000
## Class: Late (16-30 days) Class: Late (31-120 days)
## Sensitivity 0.0000000 0.454545
## Specificity 1.0000000 0.999948
## Pos Pred Value NaN 0.967118
## Neg Pred Value 0.9993107 0.998159
## Prevalence 0.0006893 0.003370
## Detection Rate 0.0000000 0.001532
## Detection Prevalence 0.0000000 0.001584
## Balanced Accuracy 0.5000000 0.727247