Task 1: We aim at finding a regression for estimating the interest rate applied to a particular lending request.
Solution: We used four linear model (LM) approaches and got the following results:
(regression_results <- read.csv("data/regression_methods_results.csv") %>% select(Model = model, Test_MSE, R2))
According to our results the linear regression model approach with best subset selection has the smallest test MSE (1.372) overall. The linear regression model with leave one out cross validation is nearly as good with a test MSE of 1.375. The ridge and the lasso regression perform noticeable worse (test MSE of 1.708 and 2.064). Details to the winnig model with best subset selection can be found below.
| Question | Answer |
|---|---|
| 1. Is at least one of the predictores X1, X2, …, Xp useful in predicting the response? | First, one could reduce the possible predictors in the data preprocessing part. One could remove 126 out of 147 possible predictors, because they would not explain the response at all (93 predictors with more than 5% missing values or just one level) or there were too many levels (more than 10). Some of the few 126 removed predictors were a human decision, especially if the predictors were not understable in regard to the response. Back to the original question: Yes, there were 31 predictors useful in predicting the response. Due to matrix transformation there are a lot of new dummy variables out of the 21 remaining predictors, for example grade, home_ownership or loan_status. |
| 2. Do all the predictors help to explain Y, or is only a subset of the predictors useful? | No, only 31 of the preditors help to explain the response. One could try to only take 23 predictors because one can reduce most of the test error with them. |
| 3. How well does the model fit the data? How accurate is the prediction? | Quite well. The remaining test error (MSE) is 1.372. |
Task 2: We would like to find a classification model for the default status.
We performed the following methods and got the these results:
classification_results <- read.csv("data/classification_methods_results.csv")
classification_results %>% rename(Model = X) %>% arrange(Class.Defaulted.Test.Error)
If the objective is to reduce overall test error then the PCA in combination with a logistic regression would be the most appropriate method. If however the class error of the defaulted loans is the objective then a simple pruned tree model might be more appropriate. In addition, we should note that the PCA is an unsupervised method and is therefore inadequate for explaining models. Bottom line: The tree model offers the best solution that is accurate, explainable and understandable.
# load remaining packages
library(DataExplorer) # https://github.com/boxuancui/DataExplorer
library(glmnet)
library(janitor) # https://github.com/sfirke/janitor
library(plotly)
library(GGally)
library(corrplot)
library(PerformanceAnalytics)
library(caret)
library(leaps)
library(knitr)
library(pls)
library(ISLR)
library(dplyr)
library(randomForest)
library(tree)
library(gdata)
library(class)
# set printing preferences
options(scipen = 99) # penalty for displaying scientific notation
options(digits = 4) # suggested number of digits to display
The following code loads the data from Kaggle, filters the data to the date range from 2007 to 2015 and selects the subset for group 3 according to the lecturer instructions. One can skip the following part because our team provids the prepared data in the next section.
# preparation step: loading data from https://www.kaggle.com/wendykan/lending-club-loan-data
# load data
data_temp <- read_csv("data/loan.csv")
# select years 2007 - 2015
data_temp %<>%
mutate(year_temp = as.integer(str_sub(issue_d,-4,-1))) %>%
filter(year_temp >= 2007,
year_temp <= 2015)
# subset our group subset
# d <- data_temp[which(data_temp$id%%8+1==6),] # not working: no id values
data_temp_2 <- cbind(id_2 = rownames(data_temp), data_temp)
rownames(data_temp_2) <- 1:nrow(data_temp_2)
data_temp_2$id_2 <- as.numeric(data_temp_2$id_2)
dataset_3 <- data_temp_2[which(data_temp_2$id_2%%8+1==3),]
# we continue working with d for easier coding reasons
d <- dataset_3
# transform data frame into much nicer tibble format for working with
# huge datasets
d <- as_tibble(d)
# remove big data file for memory reasons
rm(data_temp)
rm(data_temp_2)
rm(dataset_3)
# write group 3 data
path_file <- "./data/data_group_3.csv"
if (!file.exists(path_file)) {
write_csv(d, path = "data/data_group_3.csv")
}
One can load the necessary data for the regression and classification in the following section.
# import already generated file for group 3
d <- read_csv("data/data_group_3.csv") #, n_max = 2000)
d <- as_tibble(d)
compare_df_cols(d) %>% arrange(d, column_name) %>% count(d) %>% arrange(desc(n))
d %>% remove_empty(c("rows", "cols")) # shows 14 empty columns
d %>% remove_constant() # shows no column with just single values
d %>% get_dupes() # there are no duplicates
# explore interest rate
d %>% tabyl(int_rate) %>% arrange(desc(n)) %>% plot_ly(x = ~n, type = "histogram")
d %>% tabyl(int_rate) %>% arrange(desc(n)) %>% plot_ly(x = ~n, type = "box")
# create copy of dataset
d_copy <- d
# data preparation --> check for NAs in int_rate and remove them if there are any
d %>%
select(int_rate) %>%
summarise_all(funs(sum(is.na(.))))
# there are no NAs --> nothing to do
#further NAs investigation
## too much variables to see all NAs attributes clearly --> split variables
plot_missing(d[1:30])
plot_missing(d[31:60])
plot_missing(d[61:90])
plot_missing(d[91:120])
plot_missing(d[121:147])
## there are a lot of variables with an NA-Quota bigger than 5%
# generate columns where NA quota is bigger or equal than 5% of rows
d_na_quoate_bigger_equals_0.05 <- d %>% summarise_all(funs(sum(is.na(.)))) %>% t() %>%
as.data.frame() %>% rownames_to_column("column_names") %>% as_tibble() %>%
select(count = V1, everything()) %>% arrange(desc(count)) %>%
mutate(quote = count/length(d$id)) %>% filter(quote >= 0.05)
(d_count_columns_na_quoate_bigger_equals_0.05 <- length(d_na_quoate_bigger_equals_0.05$column_names))
## [1] 93
# 93 columns with NA quote >= 5%
# remove 93 columns with NA quota >= 0.05%
d_filter_temp <- d %>% summarise_all(funs(sum(is.na(.)))) %>% t() %>%
as.data.frame() %>% rownames_to_column("column_names") %>% as_tibble() %>%
select(count = V1, everything()) %>% arrange(desc(count)) %>%
mutate(quote = count/length(d$id)) %>% filter(quote < 0.05)
d_filter_temp %<>% select(column_names)
d %<>% select(d_filter_temp$column_names)
dim(d)[2]
## [1] 54
# success: only 54 columns left!
# inspect remaining variables
plot_missing(d) # success!
# create backup copy containing NAs
d_with_nas <- d
# remove rows with NAs
d %<>% na.omit() # 2347 NAs removed
# create new data frame for part 2 (classification)
d_na_less_5_percent <- d
# write data for classification
path_file <- "./data/d_na_less_5_percent.csv"
if (!file.exists(path_file)) {
write_csv(d, path = "data/d_na_less_5_percent.csv")
}
introduce(d) %>% t()
## [,1]
## rows 108583
## columns 54
## discrete_columns 21
## continuous_columns 33
## all_missing_columns 0
## total_missing_values 0
## complete_rows 108583
## total_observations 5863482
## memory_usage 47747920
plot_intro(d)
plot_bar(d) # 7 columns ignored with more than 50 categories.
## 7 columns ignored with more than 50 categories.
## last_pymnt_d: 129 categories
## title: 9650 categories
## earliest_cr_line: 624 categories
## last_credit_pull_d: 127 categories
## issue_d: 99 categories
## zip_code: 874 categories
## addr_state: 51 categories
# last_pymnt_d: 129 categories
# title: 9652 categories
# earliest_cr_line: 624 categories
# last_credit_pull_d: 127 categories
# issue_d: 99 categories
# zip_code: 874 categories
# addr_state: 51 categories
# generate a list of unique counts per column
(d_unique_counts_per_column <- d %>% summarise_all(list(~n_distinct(.))) %>% t() %>% as.data.frame() %>% rownames_to_column("column_names") %>% as_tibble() %>%
select(count = V1, everything()) %>% arrange(desc(count)))
# assumption: select most important columns according to subjective evaluation
## filter e. g. columns with just one unique value, not understandable meaning,
## in general columns <= 10 distinct values,
d %<>% select(int_rate, loan_amnt, revol_bal, revol_util, funded_amnt_inv, annual_inc, pub_rec_bankruptcies,
term, pymnt_plan, initial_list_status, application_type, hardship_flag, debt_settlement_flag,
verification_status, home_ownership, acc_now_delinq, grade, collections_12_mths_ex_med, loan_status,
chargeoff_within_12_mths, year_temp)
# write file for easier regression analysis start
path_file <- "./data/data_group_3_regression.csv"
if (!file.exists(path_file)) {
write_csv(d, path = "data/data_group_3_regression.csv")
}
One who would just perfrom the regression models could directly start here. The necessary data are provided in the next code block.
# starting with the following data for regression task
d <- read_csv("data/data_group_3_regression.csv")
## Parsed with column specification:
## cols(
## .default = col_double(),
## term = col_character(),
## pymnt_plan = col_character(),
## initial_list_status = col_character(),
## application_type = col_character(),
## hardship_flag = col_character(),
## debt_settlement_flag = col_character(),
## verification_status = col_character(),
## home_ownership = col_character(),
## grade = col_character(),
## loan_status = col_character()
## )
## See spec(...) for full column specifications.
The following regression models answer to following questions (from “An Introduction to Statistical Learning”, Springer, 2017, Page 75):
One finds the answers at the end of this regression analysis part.
ggpairs(d[, 1:3])
# ggpairs(d) # not really working for all data
# show scatterplots, distributions & correlations
set.seed(22)
d %>%
sample_n(1000) %>%
select(c(1:7, 16, 18, 20, 21)) %>%
na.omit() %>%
chart.Correlation(histogram = TRUE, pch = 19) # better than cor() %>%corrplot()
### show patterns ###
plot_bar(d, with = "int_rate")
plot_boxplot(d, by = "int_rate")
plot_histogram(d)
plot_scatterplot(split_columns(d)$continuous, by = "int_rate", sampled_rows = 1000L)
There are no strong correlations between int_rate and all other variables. One assumes that we need several predictors for a suitable model. Interesting: There is a very strong correlation between loan_amount and funded_amnt_inv.
# Split the data into training and test set
set.seed(22)
training_samples <- d$int_rate %>%
createDataPartition(p = 0.67, list = FALSE)
train_data <- d[training_samples, ]
test_data <- d[-training_samples, ]
# remove NONE values which cause problems regarding model generation
test_data %<>% filter(home_ownership != "NONE")
train_data %<>% filter(home_ownership != "NONE")
# Build the model first try
lm_model <- lm(int_rate ~ ., data = train_data)
summary(lm_model)
##
## Call:
## lm(formula = int_rate ~ ., data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.519 -0.872 0.036 0.865 4.305
##
## Coefficients: (1 not defined because of singularities)
## Estimate
## (Intercept) 802.2066801977
## loan_amnt -0.0003086859
## revol_bal -0.0000020704
## revol_util 0.0056891673
## funded_amnt_inv 0.0003121423
## annual_inc -0.0000002850
## pub_rec_bankruptcies 0.0751329199
## term60 months 0.2132050242
## pymnt_plany -0.1552257175
## initial_list_statusw -0.0860336871
## application_typeJoint App 0.0207964906
## hardship_flagY NA
## debt_settlement_flagY -0.0415066104
## verification_statusSource Verified 0.0602749870
## verification_statusVerified 0.2818644813
## home_ownershipOTHER -0.9907899725
## home_ownershipOWN 0.0183835613
## home_ownershipRENT 0.0360787957
## acc_now_delinq 0.2050850016
## gradeB 3.4381638127
## gradeC 6.5588592529
## gradeD 9.6845363379
## gradeE 12.4093353062
## gradeF 15.9656363201
## gradeG 18.1417385984
## collections_12_mths_ex_med -0.0220253352
## loan_statusCurrent -0.3646700318
## loan_statusDoes not meet the credit policy. Status:Charged Off -3.2233222795
## loan_statusDoes not meet the credit policy. Status:Fully Paid -3.5624382781
## loan_statusFully Paid -0.0902041231
## loan_statusIn Grace Period -0.3305893808
## loan_statusLate (16-30 days) -0.0825789745
## loan_statusLate (31-120 days) -0.1998072911
## chargeoff_within_12_mths 0.0931754742
## year_temp -0.3948021639
## Std. Error
## (Intercept) 8.5549017340
## loan_amnt 0.0000114455
## revol_bal 0.0000002183
## revol_util 0.0001990838
## funded_amnt_inv 0.0000114537
## annual_inc 0.0000000667
## pub_rec_bankruptcies 0.0123966239
## term60 months 0.0127306969
## pymnt_plany 0.5338359837
## initial_list_statusw 0.0095813271
## application_typeJoint App 0.1645470952
## hardship_flagY NA
## debt_settlement_flagY 0.0328428767
## verification_statusSource Verified 0.0111819662
## verification_statusVerified 0.0119250363
## home_ownershipOTHER 0.3729366090
## home_ownershipOWN 0.0154055692
## home_ownershipRENT 0.0095771804
## acc_now_delinq 0.0531781521
## gradeB 0.0139086212
## gradeC 0.0147744261
## gradeD 0.0171087015
## gradeE 0.0213715895
## gradeF 0.0316785085
## gradeG 0.0594618845
## collections_12_mths_ex_med 0.0327539262
## loan_statusCurrent 0.0217741818
## loan_statusDoes not meet the credit policy. Status:Charged Off 0.2059161940
## loan_statusDoes not meet the credit policy. Status:Fully Paid 0.1322630433
## loan_statusFully Paid 0.0127855330
## loan_statusIn Grace Period 0.1361802734
## loan_statusLate (16-30 days) 0.2112354846
## loan_statusLate (31-120 days) 0.0975745521
## chargeoff_within_12_mths 0.0382852794
## year_temp 0.0042487679
## t value
## (Intercept) 93.77
## loan_amnt -26.97
## revol_bal -9.49
## revol_util 28.58
## funded_amnt_inv 27.25
## annual_inc -4.27
## pub_rec_bankruptcies 6.06
## term60 months 16.75
## pymnt_plany -0.29
## initial_list_statusw -8.98
## application_typeJoint App 0.13
## hardship_flagY NA
## debt_settlement_flagY -1.26
## verification_statusSource Verified 5.39
## verification_statusVerified 23.64
## home_ownershipOTHER -2.66
## home_ownershipOWN 1.19
## home_ownershipRENT 3.77
## acc_now_delinq 3.86
## gradeB 247.20
## gradeC 443.93
## gradeD 566.06
## gradeE 580.65
## gradeF 503.99
## gradeG 305.10
## collections_12_mths_ex_med -0.67
## loan_statusCurrent -16.75
## loan_statusDoes not meet the credit policy. Status:Charged Off -15.65
## loan_statusDoes not meet the credit policy. Status:Fully Paid -26.93
## loan_statusFully Paid -7.06
## loan_statusIn Grace Period -2.43
## loan_statusLate (16-30 days) -0.39
## loan_statusLate (31-120 days) -2.05
## chargeoff_within_12_mths 2.43
## year_temp -92.92
## Pr(>|t|)
## (Intercept) < 0.0000000000000002
## loan_amnt < 0.0000000000000002
## revol_bal < 0.0000000000000002
## revol_util < 0.0000000000000002
## funded_amnt_inv < 0.0000000000000002
## annual_inc 0.0000194774277
## pub_rec_bankruptcies 0.0000000013615
## term60 months < 0.0000000000000002
## pymnt_plany 0.77122
## initial_list_statusw < 0.0000000000000002
## application_typeJoint App 0.89943
## hardship_flagY NA
## debt_settlement_flagY 0.20631
## verification_statusSource Verified 0.0000000705292
## verification_statusVerified < 0.0000000000000002
## home_ownershipOTHER 0.00789
## home_ownershipOWN 0.23275
## home_ownershipRENT 0.00017
## acc_now_delinq 0.00012
## gradeB < 0.0000000000000002
## gradeC < 0.0000000000000002
## gradeD < 0.0000000000000002
## gradeE < 0.0000000000000002
## gradeF < 0.0000000000000002
## gradeG < 0.0000000000000002
## collections_12_mths_ex_med 0.50130
## loan_statusCurrent < 0.0000000000000002
## loan_statusDoes not meet the credit policy. Status:Charged Off < 0.0000000000000002
## loan_statusDoes not meet the credit policy. Status:Fully Paid < 0.0000000000000002
## loan_statusFully Paid 0.0000000000017
## loan_statusIn Grace Period 0.01520
## loan_statusLate (16-30 days) 0.69585
## loan_statusLate (31-120 days) 0.04059
## chargeoff_within_12_mths 0.01495
## year_temp < 0.0000000000000002
##
## (Intercept) ***
## loan_amnt ***
## revol_bal ***
## revol_util ***
## funded_amnt_inv ***
## annual_inc ***
## pub_rec_bankruptcies ***
## term60 months ***
## pymnt_plany
## initial_list_statusw ***
## application_typeJoint App
## hardship_flagY
## debt_settlement_flagY
## verification_statusSource Verified ***
## verification_statusVerified ***
## home_ownershipOTHER **
## home_ownershipOWN
## home_ownershipRENT ***
## acc_now_delinq ***
## gradeB ***
## gradeC ***
## gradeD ***
## gradeE ***
## gradeF ***
## gradeG ***
## collections_12_mths_ex_med
## loan_statusCurrent ***
## loan_statusDoes not meet the credit policy. Status:Charged Off ***
## loan_statusDoes not meet the credit policy. Status:Fully Paid ***
## loan_statusFully Paid ***
## loan_statusIn Grace Period *
## loan_statusLate (16-30 days)
## loan_statusLate (31-120 days) *
## chargeoff_within_12_mths *
## year_temp ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.17 on 72718 degrees of freedom
## Multiple R-squared: 0.928, Adjusted R-squared: 0.928
## F-statistic: 2.85e+04 on 33 and 72718 DF, p-value: <0.0000000000000002
# remove not significant predictors
train_data_lm <- train_data %>% select(-pymnt_plan, -collections_12_mths_ex_med)
test_data_lm <- test_data %>% select(-pymnt_plan, -collections_12_mths_ex_med)
# second try
lm_model <- lm(int_rate ~ ., data = train_data_lm)
summary(lm_model)
##
## Call:
## lm(formula = int_rate ~ ., data = train_data_lm)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.518 -0.872 0.036 0.865 4.284
##
## Coefficients:
## Estimate
## (Intercept) 802.4465289657
## loan_amnt -0.0003087075
## revol_bal -0.0000020691
## revol_util 0.0056948126
## funded_amnt_inv 0.0003121712
## annual_inc -0.0000002854
## pub_rec_bankruptcies 0.0751867373
## term60 months 0.2133866484
## initial_list_statusw -0.0860427429
## application_typeJoint App 0.0211644755
## hardship_flagY -0.1546470906
## debt_settlement_flagY -0.0412172759
## verification_statusSource Verified 0.0601186230
## verification_statusVerified 0.2816950230
## home_ownershipOTHER -0.9907347072
## home_ownershipOWN 0.0183600927
## home_ownershipRENT 0.0361318887
## acc_now_delinq 0.2040971144
## gradeB 3.4379225999
## gradeC 6.5585298813
## gradeD 9.6841821464
## gradeE 12.4089177736
## gradeF 15.9652127988
## gradeG 18.1411912756
## loan_statusCurrent -0.3647086569
## loan_statusDoes not meet the credit policy. Status:Charged Off -3.2234223886
## loan_statusDoes not meet the credit policy. Status:Fully Paid -3.5624433792
## loan_statusFully Paid -0.0900978667
## loan_statusIn Grace Period -0.3301965231
## loan_statusLate (16-30 days) -0.0829198365
## loan_statusLate (31-120 days) -0.1998885589
## chargeoff_within_12_mths 0.0920427660
## year_temp -0.3949214992
## Std. Error
## (Intercept) 8.5474307795
## loan_amnt 0.0000114454
## revol_bal 0.0000002183
## revol_util 0.0001989059
## funded_amnt_inv 0.0000114536
## annual_inc 0.0000000667
## pub_rec_bankruptcies 0.0123963189
## term60 months 0.0127277835
## initial_list_statusw 0.0095812815
## application_typeJoint App 0.1645455654
## hardship_flagY 0.5338332795
## debt_settlement_flagY 0.0328399345
## verification_statusSource Verified 0.0111795061
## verification_statusVerified 0.0119223284
## home_ownershipOTHER 0.3729351952
## home_ownershipOWN 0.0154054716
## home_ownershipRENT 0.0095768189
## acc_now_delinq 0.0531576556
## gradeB 0.0139039425
## gradeC 0.0147662491
## gradeD 0.0171005273
## gradeE 0.0213624873
## gradeF 0.0316721277
## gradeG 0.0594560897
## loan_statusCurrent 0.0217740241
## loan_statusDoes not meet the credit policy. Status:Charged Off 0.2059153646
## loan_statusDoes not meet the credit policy. Status:Fully Paid 0.1322625449
## loan_statusFully Paid 0.0127845084
## loan_statusIn Grace Period 0.1361785073
## loan_statusLate (16-30 days) 0.2112340808
## loan_statusLate (31-120 days) 0.0975741097
## chargeoff_within_12_mths 0.0382480616
## year_temp 0.0042450441
## t value
## (Intercept) 93.88
## loan_amnt -26.97
## revol_bal -9.48
## revol_util 28.63
## funded_amnt_inv 27.26
## annual_inc -4.28
## pub_rec_bankruptcies 6.07
## term60 months 16.77
## initial_list_statusw -8.98
## application_typeJoint App 0.13
## hardship_flagY -0.29
## debt_settlement_flagY -1.26
## verification_statusSource Verified 5.38
## verification_statusVerified 23.63
## home_ownershipOTHER -2.66
## home_ownershipOWN 1.19
## home_ownershipRENT 3.77
## acc_now_delinq 3.84
## gradeB 247.26
## gradeC 444.16
## gradeD 566.31
## gradeE 580.87
## gradeF 504.08
## gradeG 305.12
## loan_statusCurrent -16.75
## loan_statusDoes not meet the credit policy. Status:Charged Off -15.65
## loan_statusDoes not meet the credit policy. Status:Fully Paid -26.93
## loan_statusFully Paid -7.05
## loan_statusIn Grace Period -2.42
## loan_statusLate (16-30 days) -0.39
## loan_statusLate (31-120 days) -2.05
## chargeoff_within_12_mths 2.41
## year_temp -93.03
## Pr(>|t|)
## (Intercept) < 0.0000000000000002
## loan_amnt < 0.0000000000000002
## revol_bal < 0.0000000000000002
## revol_util < 0.0000000000000002
## funded_amnt_inv < 0.0000000000000002
## annual_inc 0.0000190210480
## pub_rec_bankruptcies 0.0000000013240
## term60 months < 0.0000000000000002
## initial_list_statusw < 0.0000000000000002
## application_typeJoint App 0.89766
## hardship_flagY 0.77205
## debt_settlement_flagY 0.20945
## verification_statusSource Verified 0.0000000757283
## verification_statusVerified < 0.0000000000000002
## home_ownershipOTHER 0.00790
## home_ownershipOWN 0.23335
## home_ownershipRENT 0.00016
## acc_now_delinq 0.00012
## gradeB < 0.0000000000000002
## gradeC < 0.0000000000000002
## gradeD < 0.0000000000000002
## gradeE < 0.0000000000000002
## gradeF < 0.0000000000000002
## gradeG < 0.0000000000000002
## loan_statusCurrent < 0.0000000000000002
## loan_statusDoes not meet the credit policy. Status:Charged Off < 0.0000000000000002
## loan_statusDoes not meet the credit policy. Status:Fully Paid < 0.0000000000000002
## loan_statusFully Paid 0.0000000000018
## loan_statusIn Grace Period 0.01532
## loan_statusLate (16-30 days) 0.69465
## loan_statusLate (31-120 days) 0.04051
## chargeoff_within_12_mths 0.01611
## year_temp < 0.0000000000000002
##
## (Intercept) ***
## loan_amnt ***
## revol_bal ***
## revol_util ***
## funded_amnt_inv ***
## annual_inc ***
## pub_rec_bankruptcies ***
## term60 months ***
## initial_list_statusw ***
## application_typeJoint App
## hardship_flagY
## debt_settlement_flagY
## verification_statusSource Verified ***
## verification_statusVerified ***
## home_ownershipOTHER **
## home_ownershipOWN
## home_ownershipRENT ***
## acc_now_delinq ***
## gradeB ***
## gradeC ***
## gradeD ***
## gradeE ***
## gradeF ***
## gradeG ***
## loan_statusCurrent ***
## loan_statusDoes not meet the credit policy. Status:Charged Off ***
## loan_statusDoes not meet the credit policy. Status:Fully Paid ***
## loan_statusFully Paid ***
## loan_statusIn Grace Period *
## loan_statusLate (16-30 days)
## loan_statusLate (31-120 days) *
## chargeoff_within_12_mths *
## year_temp ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.17 on 72719 degrees of freedom
## Multiple R-squared: 0.928, Adjusted R-squared: 0.928
## F-statistic: 2.94e+04 on 32 and 72719 DF, p-value: <0.0000000000000002
table(d$pymnt_plan) # clear distribution > most of the values are no, only 9 values are yes > predictor can be removed
##
## n y
## 108574 9
table(d$collections_12_mths_ex_med) # more or less clear distribution > most of the values are in categorie 0 > predictor can be removed
##
## 0 1 2 3 4 5 6
## 107173 1305 83 14 6 1 1
# other not significant predictors in the lm-Modell can't be removed because they have categories which don't
# show a clearly recognisable distribution of the values, for example
table(d$loan_status) # not a clear distribution > predictor can't be removed
##
## Charged Off
## 18471
## Current
## 7075
## Does not meet the credit policy. Status:Charged Off
## 49
## Does not meet the credit policy. Status:Fully Paid
## 122
## Fully Paid
## 82475
## In Grace Period
## 122
## Late (16-30 days)
## 44
## Late (31-120 days)
## 225
table(d$home_ownership)
##
## MORTGAGE NONE OTHER OWN RENT
## 54498 1 14 10661 43409
# Make predictions and compute the R2, RMSE and MAE
lm_predictions <- lm_model %>% predict(test_data)
summary(lm_predictions)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.05 10.46 13.51 13.24 16.65 27.74
na_filter <- lm_predictions[!is.na(lm_predictions)]
# show measures for model quality
(lm_evaluation_vsa <-
data.frame(
R2 = caret::R2(lm_predictions[na_filter], test_data$int_rate[na_filter]),
RMSE = caret::RMSE(lm_predictions[na_filter], test_data$int_rate[na_filter]) # ,
# MAE = MAE(lm_predictions[na_filter], test_data$int_rate[na_filter])
))
# Leave one out cross validation - LOOCV
# Define training control
train_control <- trainControl(method = "LOOCV")
# Train the model
set.seed(22)
d_sample <- d %>% sample_n(2000) %>% na.omit()
# check if there is just one unique value per column: hit with "pymnt_plan" and "hardship_flag"
d_sample %>% summarise_all(list(~n_distinct(.))) %>% t() %>% as.data.frame() %>% rownames_to_column("column_names") %>% as_tibble() %>%
select(count = V1, everything()) %>% arrange(count)
d_sample %<>% select(-c(pymnt_plan, hardship_flag, application_type))
## the following loocv code snippet works within this environment (one could have problems with mac)
# sessionInfo()
lm_model_loocv <- train(int_rate ~ ., data = d_sample, method = "lm",
trControl = train_control)
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading
# Summarize the results
(lm_evaluation_loocv <- lm_model_loocv)
## Linear Regression
##
## 2000 samples
## 17 predictor
##
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation
## Summary of sample sizes: 1999, 1999, 1999, 1999, 1999, 1999, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 1.173 0.9263 0.9441
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
print("MSE: ")
## [1] "MSE: "
(loocv_mse <- lm_evaluation_loocv$results$RMSE^2)
## [1] 1.375
# comparison to validation set approach: slightly worse at R2 and RMSE but better MAE
Model Evalution:
One advantage of the LOOCV approach is that almost the entire dataset (n-1 observations) can be used for training, which results in less bias and constant estimates for the test MSE. The model is fit n times with n-1 observations, whereas the one remaining observation is used to test the model. This results in n test MSE, whereas the LOOCV estimate for the overall test MSE is the average of these n test MSE.
LOOCV Test MSE: 1.375
dim(train_data) # 20 predictors (21-1). -1 because of int_rate
## [1] 72752 21
# creates a model matrix by expanding factors to a set of dummy variables.
# This matrix is later used to calculate the test MSE for various models
test_mat <- model.matrix(int_rate~.,data=test_data)
# when considering all dummy variables there are 34 predictors (35-1): -1 because of intercept.
dim(test_mat)
## [1] 35830 35
# create best subset selection with 1 up to all 35 predictors.
regfit_best <- regsubsets(int_rate~.,data=train_data, nvmax=34)
## Warning in leaps.setup(x, y, wt = wt, nbest = nbest, nvmax = nvmax,
## force.in = force.in, : 1 linear dependencies found
## Reordering variables and trying again:
reg_summary <- summary(regfit_best)
reg_summary # models with up to 33 predictors were included.
## Subset selection object
## Call: regsubsets.formula(int_rate ~ ., data = train_data, nvmax = 34)
## 34 Variables (and intercept)
## Forced in
## loan_amnt FALSE
## revol_bal FALSE
## revol_util FALSE
## funded_amnt_inv FALSE
## annual_inc FALSE
## pub_rec_bankruptcies FALSE
## term60 months FALSE
## pymnt_plany FALSE
## initial_list_statusw FALSE
## application_typeJoint App FALSE
## debt_settlement_flagY FALSE
## verification_statusSource Verified FALSE
## verification_statusVerified FALSE
## home_ownershipOTHER FALSE
## home_ownershipOWN FALSE
## home_ownershipRENT FALSE
## acc_now_delinq FALSE
## gradeB FALSE
## gradeC FALSE
## gradeD FALSE
## gradeE FALSE
## gradeF FALSE
## gradeG FALSE
## collections_12_mths_ex_med FALSE
## loan_statusCurrent FALSE
## loan_statusDoes not meet the credit policy. Status:Charged Off FALSE
## loan_statusDoes not meet the credit policy. Status:Fully Paid FALSE
## loan_statusFully Paid FALSE
## loan_statusIn Grace Period FALSE
## loan_statusLate (16-30 days) FALSE
## loan_statusLate (31-120 days) FALSE
## chargeoff_within_12_mths FALSE
## year_temp FALSE
## hardship_flagY FALSE
## Forced out
## loan_amnt FALSE
## revol_bal FALSE
## revol_util FALSE
## funded_amnt_inv FALSE
## annual_inc FALSE
## pub_rec_bankruptcies FALSE
## term60 months FALSE
## pymnt_plany FALSE
## initial_list_statusw FALSE
## application_typeJoint App FALSE
## debt_settlement_flagY FALSE
## verification_statusSource Verified FALSE
## verification_statusVerified FALSE
## home_ownershipOTHER FALSE
## home_ownershipOWN FALSE
## home_ownershipRENT FALSE
## acc_now_delinq FALSE
## gradeB FALSE
## gradeC FALSE
## gradeD FALSE
## gradeE FALSE
## gradeF FALSE
## gradeG FALSE
## collections_12_mths_ex_med FALSE
## loan_statusCurrent FALSE
## loan_statusDoes not meet the credit policy. Status:Charged Off FALSE
## loan_statusDoes not meet the credit policy. Status:Fully Paid FALSE
## loan_statusFully Paid FALSE
## loan_statusIn Grace Period FALSE
## loan_statusLate (16-30 days) FALSE
## loan_statusLate (31-120 days) FALSE
## chargeoff_within_12_mths FALSE
## year_temp FALSE
## hardship_flagY FALSE
## 1 subsets of each size up to 33
## Selection Algorithm: exhaustive
## loan_amnt revol_bal revol_util funded_amnt_inv annual_inc
## 1 ( 1 ) " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " "
## 4 ( 1 ) " " " " " " " " " "
## 5 ( 1 ) " " " " " " " " " "
## 6 ( 1 ) " " " " " " " " " "
## 7 ( 1 ) " " " " " " " " " "
## 8 ( 1 ) " " " " " " " " " "
## 9 ( 1 ) " " " " " " " " " "
## 10 ( 1 ) " " " " "*" " " " "
## 11 ( 1 ) " " " " "*" " " " "
## 12 ( 1 ) "*" " " "*" "*" " "
## 13 ( 1 ) "*" " " "*" "*" " "
## 14 ( 1 ) "*" "*" "*" "*" " "
## 15 ( 1 ) "*" " " "*" "*" " "
## 16 ( 1 ) "*" "*" "*" "*" " "
## 17 ( 1 ) "*" "*" "*" "*" " "
## 18 ( 1 ) "*" "*" "*" "*" " "
## 19 ( 1 ) "*" "*" "*" "*" " "
## 20 ( 1 ) "*" "*" "*" "*" " "
## 21 ( 1 ) "*" "*" "*" "*" "*"
## 22 ( 1 ) "*" "*" "*" "*" "*"
## 23 ( 1 ) "*" "*" "*" "*" "*"
## 24 ( 1 ) "*" "*" "*" "*" "*"
## 25 ( 1 ) "*" "*" "*" "*" "*"
## 26 ( 1 ) "*" "*" "*" "*" "*"
## 27 ( 1 ) "*" "*" "*" "*" "*"
## 28 ( 1 ) "*" "*" "*" "*" "*"
## 29 ( 1 ) "*" "*" "*" "*" "*"
## 30 ( 1 ) "*" "*" "*" "*" "*"
## 31 ( 1 ) "*" "*" "*" "*" "*"
## 32 ( 1 ) "*" "*" "*" "*" "*"
## 33 ( 1 ) "*" "*" "*" "*" "*"
## pub_rec_bankruptcies term60 months pymnt_plany
## 1 ( 1 ) " " " " " "
## 2 ( 1 ) " " " " " "
## 3 ( 1 ) " " " " " "
## 4 ( 1 ) " " " " " "
## 5 ( 1 ) " " " " " "
## 6 ( 1 ) " " " " " "
## 7 ( 1 ) " " " " " "
## 8 ( 1 ) " " " " " "
## 9 ( 1 ) " " " " " "
## 10 ( 1 ) " " " " " "
## 11 ( 1 ) " " " " " "
## 12 ( 1 ) " " " " " "
## 13 ( 1 ) " " " " " "
## 14 ( 1 ) " " " " " "
## 15 ( 1 ) " " "*" " "
## 16 ( 1 ) " " "*" " "
## 17 ( 1 ) " " "*" " "
## 18 ( 1 ) " " "*" " "
## 19 ( 1 ) "*" "*" " "
## 20 ( 1 ) "*" "*" " "
## 21 ( 1 ) "*" "*" " "
## 22 ( 1 ) "*" "*" " "
## 23 ( 1 ) "*" "*" " "
## 24 ( 1 ) "*" "*" " "
## 25 ( 1 ) "*" "*" " "
## 26 ( 1 ) "*" "*" " "
## 27 ( 1 ) "*" "*" " "
## 28 ( 1 ) "*" "*" " "
## 29 ( 1 ) "*" "*" " "
## 30 ( 1 ) "*" "*" " "
## 31 ( 1 ) "*" "*" " "
## 32 ( 1 ) "*" "*" "*"
## 33 ( 1 ) "*" "*" "*"
## initial_list_statusw application_typeJoint App hardship_flagY
## 1 ( 1 ) " " " " " "
## 2 ( 1 ) " " " " " "
## 3 ( 1 ) " " " " " "
## 4 ( 1 ) " " " " " "
## 5 ( 1 ) " " " " " "
## 6 ( 1 ) " " " " " "
## 7 ( 1 ) " " " " " "
## 8 ( 1 ) " " " " " "
## 9 ( 1 ) " " " " " "
## 10 ( 1 ) " " " " " "
## 11 ( 1 ) " " " " " "
## 12 ( 1 ) " " " " " "
## 13 ( 1 ) " " " " " "
## 14 ( 1 ) " " " " " "
## 15 ( 1 ) " " " " " "
## 16 ( 1 ) " " " " " "
## 17 ( 1 ) "*" " " " "
## 18 ( 1 ) "*" " " " "
## 19 ( 1 ) "*" " " " "
## 20 ( 1 ) "*" " " " "
## 21 ( 1 ) "*" " " " "
## 22 ( 1 ) "*" " " " "
## 23 ( 1 ) "*" " " " "
## 24 ( 1 ) "*" " " " "
## 25 ( 1 ) "*" " " " "
## 26 ( 1 ) "*" " " " "
## 27 ( 1 ) "*" " " " "
## 28 ( 1 ) "*" " " " "
## 29 ( 1 ) "*" " " " "
## 30 ( 1 ) "*" " " " "
## 31 ( 1 ) "*" " " " "
## 32 ( 1 ) "*" " " " "
## 33 ( 1 ) "*" "*" " "
## debt_settlement_flagY verification_statusSource Verified
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " " "
## 3 ( 1 ) " " " "
## 4 ( 1 ) " " " "
## 5 ( 1 ) " " " "
## 6 ( 1 ) " " " "
## 7 ( 1 ) " " " "
## 8 ( 1 ) " " " "
## 9 ( 1 ) " " " "
## 10 ( 1 ) " " " "
## 11 ( 1 ) " " " "
## 12 ( 1 ) " " " "
## 13 ( 1 ) " " " "
## 14 ( 1 ) " " " "
## 15 ( 1 ) " " " "
## 16 ( 1 ) " " " "
## 17 ( 1 ) " " " "
## 18 ( 1 ) " " " "
## 19 ( 1 ) " " " "
## 20 ( 1 ) " " "*"
## 21 ( 1 ) " " "*"
## 22 ( 1 ) " " "*"
## 23 ( 1 ) " " "*"
## 24 ( 1 ) " " "*"
## 25 ( 1 ) " " "*"
## 26 ( 1 ) " " "*"
## 27 ( 1 ) " " "*"
## 28 ( 1 ) "*" "*"
## 29 ( 1 ) "*" "*"
## 30 ( 1 ) "*" "*"
## 31 ( 1 ) "*" "*"
## 32 ( 1 ) "*" "*"
## 33 ( 1 ) "*" "*"
## verification_statusVerified home_ownershipOTHER
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " " "
## 3 ( 1 ) " " " "
## 4 ( 1 ) " " " "
## 5 ( 1 ) " " " "
## 6 ( 1 ) " " " "
## 7 ( 1 ) " " " "
## 8 ( 1 ) " " " "
## 9 ( 1 ) "*" " "
## 10 ( 1 ) "*" " "
## 11 ( 1 ) "*" " "
## 12 ( 1 ) "*" " "
## 13 ( 1 ) "*" " "
## 14 ( 1 ) "*" " "
## 15 ( 1 ) "*" " "
## 16 ( 1 ) "*" " "
## 17 ( 1 ) "*" " "
## 18 ( 1 ) "*" " "
## 19 ( 1 ) "*" " "
## 20 ( 1 ) "*" " "
## 21 ( 1 ) "*" " "
## 22 ( 1 ) "*" " "
## 23 ( 1 ) "*" " "
## 24 ( 1 ) "*" "*"
## 25 ( 1 ) "*" "*"
## 26 ( 1 ) "*" "*"
## 27 ( 1 ) "*" "*"
## 28 ( 1 ) "*" "*"
## 29 ( 1 ) "*" "*"
## 30 ( 1 ) "*" "*"
## 31 ( 1 ) "*" "*"
## 32 ( 1 ) "*" "*"
## 33 ( 1 ) "*" "*"
## home_ownershipOWN home_ownershipRENT acc_now_delinq gradeB
## 1 ( 1 ) " " " " " " " "
## 2 ( 1 ) " " " " " " " "
## 3 ( 1 ) " " " " " " " "
## 4 ( 1 ) " " " " " " " "
## 5 ( 1 ) " " " " " " " "
## 6 ( 1 ) " " " " " " "*"
## 7 ( 1 ) " " " " " " "*"
## 8 ( 1 ) " " " " " " "*"
## 9 ( 1 ) " " " " " " "*"
## 10 ( 1 ) " " " " " " "*"
## 11 ( 1 ) " " " " " " "*"
## 12 ( 1 ) " " " " " " "*"
## 13 ( 1 ) " " " " " " "*"
## 14 ( 1 ) " " " " " " "*"
## 15 ( 1 ) " " " " " " "*"
## 16 ( 1 ) " " " " " " "*"
## 17 ( 1 ) " " " " " " "*"
## 18 ( 1 ) " " " " " " "*"
## 19 ( 1 ) " " " " " " "*"
## 20 ( 1 ) " " " " " " "*"
## 21 ( 1 ) " " " " " " "*"
## 22 ( 1 ) " " " " "*" "*"
## 23 ( 1 ) " " "*" "*" "*"
## 24 ( 1 ) " " "*" "*" "*"
## 25 ( 1 ) " " "*" "*" "*"
## 26 ( 1 ) " " "*" "*" "*"
## 27 ( 1 ) " " "*" "*" "*"
## 28 ( 1 ) " " "*" "*" "*"
## 29 ( 1 ) "*" "*" "*" "*"
## 30 ( 1 ) "*" "*" "*" "*"
## 31 ( 1 ) "*" "*" "*" "*"
## 32 ( 1 ) "*" "*" "*" "*"
## 33 ( 1 ) "*" "*" "*" "*"
## gradeC gradeD gradeE gradeF gradeG collections_12_mths_ex_med
## 1 ( 1 ) " " " " "*" " " " " " "
## 2 ( 1 ) " " "*" "*" " " " " " "
## 3 ( 1 ) " " "*" "*" "*" " " " "
## 4 ( 1 ) "*" "*" "*" "*" " " " "
## 5 ( 1 ) "*" "*" "*" "*" "*" " "
## 6 ( 1 ) "*" "*" "*" "*" "*" " "
## 7 ( 1 ) "*" "*" "*" "*" "*" " "
## 8 ( 1 ) "*" "*" "*" "*" "*" " "
## 9 ( 1 ) "*" "*" "*" "*" "*" " "
## 10 ( 1 ) "*" "*" "*" "*" "*" " "
## 11 ( 1 ) "*" "*" "*" "*" "*" " "
## 12 ( 1 ) "*" "*" "*" "*" "*" " "
## 13 ( 1 ) "*" "*" "*" "*" "*" " "
## 14 ( 1 ) "*" "*" "*" "*" "*" " "
## 15 ( 1 ) "*" "*" "*" "*" "*" " "
## 16 ( 1 ) "*" "*" "*" "*" "*" " "
## 17 ( 1 ) "*" "*" "*" "*" "*" " "
## 18 ( 1 ) "*" "*" "*" "*" "*" " "
## 19 ( 1 ) "*" "*" "*" "*" "*" " "
## 20 ( 1 ) "*" "*" "*" "*" "*" " "
## 21 ( 1 ) "*" "*" "*" "*" "*" " "
## 22 ( 1 ) "*" "*" "*" "*" "*" " "
## 23 ( 1 ) "*" "*" "*" "*" "*" " "
## 24 ( 1 ) "*" "*" "*" "*" "*" " "
## 25 ( 1 ) "*" "*" "*" "*" "*" " "
## 26 ( 1 ) "*" "*" "*" "*" "*" " "
## 27 ( 1 ) "*" "*" "*" "*" "*" " "
## 28 ( 1 ) "*" "*" "*" "*" "*" " "
## 29 ( 1 ) "*" "*" "*" "*" "*" " "
## 30 ( 1 ) "*" "*" "*" "*" "*" "*"
## 31 ( 1 ) "*" "*" "*" "*" "*" "*"
## 32 ( 1 ) "*" "*" "*" "*" "*" "*"
## 33 ( 1 ) "*" "*" "*" "*" "*" "*"
## loan_statusCurrent
## 1 ( 1 ) " "
## 2 ( 1 ) " "
## 3 ( 1 ) " "
## 4 ( 1 ) " "
## 5 ( 1 ) " "
## 6 ( 1 ) " "
## 7 ( 1 ) " "
## 8 ( 1 ) " "
## 9 ( 1 ) " "
## 10 ( 1 ) " "
## 11 ( 1 ) " "
## 12 ( 1 ) " "
## 13 ( 1 ) " "
## 14 ( 1 ) " "
## 15 ( 1 ) "*"
## 16 ( 1 ) "*"
## 17 ( 1 ) "*"
## 18 ( 1 ) "*"
## 19 ( 1 ) "*"
## 20 ( 1 ) "*"
## 21 ( 1 ) "*"
## 22 ( 1 ) "*"
## 23 ( 1 ) "*"
## 24 ( 1 ) "*"
## 25 ( 1 ) "*"
## 26 ( 1 ) "*"
## 27 ( 1 ) "*"
## 28 ( 1 ) "*"
## 29 ( 1 ) "*"
## 30 ( 1 ) "*"
## 31 ( 1 ) "*"
## 32 ( 1 ) "*"
## 33 ( 1 ) "*"
## loan_statusDoes not meet the credit policy. Status:Charged Off
## 1 ( 1 ) " "
## 2 ( 1 ) " "
## 3 ( 1 ) " "
## 4 ( 1 ) " "
## 5 ( 1 ) " "
## 6 ( 1 ) " "
## 7 ( 1 ) " "
## 8 ( 1 ) " "
## 9 ( 1 ) " "
## 10 ( 1 ) " "
## 11 ( 1 ) "*"
## 12 ( 1 ) " "
## 13 ( 1 ) "*"
## 14 ( 1 ) "*"
## 15 ( 1 ) "*"
## 16 ( 1 ) "*"
## 17 ( 1 ) "*"
## 18 ( 1 ) "*"
## 19 ( 1 ) "*"
## 20 ( 1 ) "*"
## 21 ( 1 ) "*"
## 22 ( 1 ) "*"
## 23 ( 1 ) "*"
## 24 ( 1 ) "*"
## 25 ( 1 ) "*"
## 26 ( 1 ) "*"
## 27 ( 1 ) "*"
## 28 ( 1 ) "*"
## 29 ( 1 ) "*"
## 30 ( 1 ) "*"
## 31 ( 1 ) "*"
## 32 ( 1 ) "*"
## 33 ( 1 ) "*"
## loan_statusDoes not meet the credit policy. Status:Fully Paid
## 1 ( 1 ) " "
## 2 ( 1 ) " "
## 3 ( 1 ) " "
## 4 ( 1 ) " "
## 5 ( 1 ) " "
## 6 ( 1 ) " "
## 7 ( 1 ) " "
## 8 ( 1 ) "*"
## 9 ( 1 ) "*"
## 10 ( 1 ) "*"
## 11 ( 1 ) "*"
## 12 ( 1 ) "*"
## 13 ( 1 ) "*"
## 14 ( 1 ) "*"
## 15 ( 1 ) "*"
## 16 ( 1 ) "*"
## 17 ( 1 ) "*"
## 18 ( 1 ) "*"
## 19 ( 1 ) "*"
## 20 ( 1 ) "*"
## 21 ( 1 ) "*"
## 22 ( 1 ) "*"
## 23 ( 1 ) "*"
## 24 ( 1 ) "*"
## 25 ( 1 ) "*"
## 26 ( 1 ) "*"
## 27 ( 1 ) "*"
## 28 ( 1 ) "*"
## 29 ( 1 ) "*"
## 30 ( 1 ) "*"
## 31 ( 1 ) "*"
## 32 ( 1 ) "*"
## 33 ( 1 ) "*"
## loan_statusFully Paid loan_statusIn Grace Period
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " " "
## 3 ( 1 ) " " " "
## 4 ( 1 ) " " " "
## 5 ( 1 ) " " " "
## 6 ( 1 ) " " " "
## 7 ( 1 ) " " " "
## 8 ( 1 ) " " " "
## 9 ( 1 ) " " " "
## 10 ( 1 ) " " " "
## 11 ( 1 ) " " " "
## 12 ( 1 ) " " " "
## 13 ( 1 ) " " " "
## 14 ( 1 ) " " " "
## 15 ( 1 ) " " " "
## 16 ( 1 ) " " " "
## 17 ( 1 ) " " " "
## 18 ( 1 ) "*" " "
## 19 ( 1 ) "*" " "
## 20 ( 1 ) "*" " "
## 21 ( 1 ) "*" " "
## 22 ( 1 ) "*" " "
## 23 ( 1 ) "*" " "
## 24 ( 1 ) "*" " "
## 25 ( 1 ) "*" " "
## 26 ( 1 ) "*" "*"
## 27 ( 1 ) "*" "*"
## 28 ( 1 ) "*" "*"
## 29 ( 1 ) "*" "*"
## 30 ( 1 ) "*" "*"
## 31 ( 1 ) "*" "*"
## 32 ( 1 ) "*" "*"
## 33 ( 1 ) "*" "*"
## loan_statusLate (16-30 days) loan_statusLate (31-120 days)
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " " "
## 3 ( 1 ) " " " "
## 4 ( 1 ) " " " "
## 5 ( 1 ) " " " "
## 6 ( 1 ) " " " "
## 7 ( 1 ) " " " "
## 8 ( 1 ) " " " "
## 9 ( 1 ) " " " "
## 10 ( 1 ) " " " "
## 11 ( 1 ) " " " "
## 12 ( 1 ) " " " "
## 13 ( 1 ) " " " "
## 14 ( 1 ) " " " "
## 15 ( 1 ) " " " "
## 16 ( 1 ) " " " "
## 17 ( 1 ) " " " "
## 18 ( 1 ) " " " "
## 19 ( 1 ) " " " "
## 20 ( 1 ) " " " "
## 21 ( 1 ) " " " "
## 22 ( 1 ) " " " "
## 23 ( 1 ) " " " "
## 24 ( 1 ) " " " "
## 25 ( 1 ) " " " "
## 26 ( 1 ) " " " "
## 27 ( 1 ) " " "*"
## 28 ( 1 ) " " "*"
## 29 ( 1 ) " " "*"
## 30 ( 1 ) " " "*"
## 31 ( 1 ) "*" "*"
## 32 ( 1 ) "*" "*"
## 33 ( 1 ) "*" "*"
## chargeoff_within_12_mths year_temp
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " " "
## 3 ( 1 ) " " " "
## 4 ( 1 ) " " " "
## 5 ( 1 ) " " " "
## 6 ( 1 ) " " " "
## 7 ( 1 ) " " "*"
## 8 ( 1 ) " " "*"
## 9 ( 1 ) " " "*"
## 10 ( 1 ) " " "*"
## 11 ( 1 ) " " "*"
## 12 ( 1 ) " " "*"
## 13 ( 1 ) " " "*"
## 14 ( 1 ) " " "*"
## 15 ( 1 ) " " "*"
## 16 ( 1 ) " " "*"
## 17 ( 1 ) " " "*"
## 18 ( 1 ) " " "*"
## 19 ( 1 ) " " "*"
## 20 ( 1 ) " " "*"
## 21 ( 1 ) " " "*"
## 22 ( 1 ) " " "*"
## 23 ( 1 ) " " "*"
## 24 ( 1 ) " " "*"
## 25 ( 1 ) "*" "*"
## 26 ( 1 ) "*" "*"
## 27 ( 1 ) "*" "*"
## 28 ( 1 ) "*" "*"
## 29 ( 1 ) "*" "*"
## 30 ( 1 ) "*" "*"
## 31 ( 1 ) "*" "*"
## 32 ( 1 ) "*" "*"
## 33 ( 1 ) "*" "*"
# Analysis of various metrics to assess the models
par(mfrow=c(2,2))
# plot R square
plot(reg_summary$rss ,xlab=" Number of Variables ",ylab="RSS", type="l")
# plot adusted R-square
plot(reg_summary$adjr2 ,xlab =" Number of Variables ", ylab="Adjusted RSq",type="l")
which.max (reg_summary$adjr2) # maximal adusted R-square: 29
## [1] 29
points (29, reg_summary$adjr2[29], col ="red",cex =2, pch =20)
# plot CP
plot(reg_summary$cp ,xlab =" Number of Variables ", ylab="CP",type="l")
which.min (reg_summary$cp ) # minimal CP with 27 predictors
## [1] 27
points (27, reg_summary$cp[27], col ="red",cex =2, pch =20)
# plot BIC
plot(reg_summary$bic ,xlab =" Number of Variables ", ylab="BIC",type="l")
which.min (reg_summary$bic ) # minimal BIC with 23 predictors
## [1] 23
points (23, reg_summary$bic[23], col ="red",cex =2, pch =20)
# model of does not seem to improve much anymore after 6 predictors are included.
coefi=coef(regfit_best ,id=6) # get coefficients
pred_6=test_mat[,names(coefi)]%*%coefi # predict value for each observation with model and test data
test_error_6=mean((test_data$int_rate-pred_6)^2) # calculate test error
# -> test error for model with 6 predictors: 3.015
# next step: compare the test MSE of the various models with 1 predictor up to 33 predictors.
test_errors=rep(NA ,33) # create empty list
# search for each size i, the best model of that size
for(i in 1:33){
coefi = coef(regfit_best ,id=i) # get coefficients
pred = test_mat[,names(coefi)]%*%coefi # predict value for each observation with model and test data
test_errors[i] = mean((test_data$int_rate-pred)^2) # calculate test error
}
test_errors # show all test errors
## [1] 16.257 12.090 11.085 6.458 6.459 3.015 3.015 3.013 3.013 2.901
## [11] 2.895 2.897 2.891 2.875 2.861 2.845 2.755 2.755 2.748 2.708
## [21] 2.700 1.543 1.543 1.543 1.379 1.379 1.379 1.378 1.378 1.372
## [31] 1.372 1.372 1.372
test_errors[which.min(test_errors)] # test error of best model: 1.372
## [1] 1.372
which.min(test_errors) # model with minimal test error: 31
## [1] 31
par(mfrow=c(1,1))
plot(test_errors, type="l", main="Test Error") # plot the test error of all modelss
points(31, test_errors[31], col ="red",cex =2, pch =20) # mark test error of model with 31 variables.
# Plausibility Check:
# the output of best subset selection (regsubset) was that the test error decreases with every new added predictor up to 31 variables.
# Therefore test the linear regression by including all parameters and compare the test errors:
lm_2 <- lm(int_rate~., data = train_data) # linear regression with all predictors
summary(lm_2)
##
## Call:
## lm(formula = int_rate ~ ., data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.519 -0.872 0.036 0.865 4.305
##
## Coefficients: (1 not defined because of singularities)
## Estimate
## (Intercept) 802.2066801977
## loan_amnt -0.0003086859
## revol_bal -0.0000020704
## revol_util 0.0056891673
## funded_amnt_inv 0.0003121423
## annual_inc -0.0000002850
## pub_rec_bankruptcies 0.0751329199
## term60 months 0.2132050242
## pymnt_plany -0.1552257175
## initial_list_statusw -0.0860336871
## application_typeJoint App 0.0207964906
## hardship_flagY NA
## debt_settlement_flagY -0.0415066104
## verification_statusSource Verified 0.0602749870
## verification_statusVerified 0.2818644813
## home_ownershipOTHER -0.9907899725
## home_ownershipOWN 0.0183835613
## home_ownershipRENT 0.0360787957
## acc_now_delinq 0.2050850016
## gradeB 3.4381638127
## gradeC 6.5588592529
## gradeD 9.6845363379
## gradeE 12.4093353062
## gradeF 15.9656363201
## gradeG 18.1417385984
## collections_12_mths_ex_med -0.0220253352
## loan_statusCurrent -0.3646700318
## loan_statusDoes not meet the credit policy. Status:Charged Off -3.2233222795
## loan_statusDoes not meet the credit policy. Status:Fully Paid -3.5624382781
## loan_statusFully Paid -0.0902041231
## loan_statusIn Grace Period -0.3305893808
## loan_statusLate (16-30 days) -0.0825789745
## loan_statusLate (31-120 days) -0.1998072911
## chargeoff_within_12_mths 0.0931754742
## year_temp -0.3948021639
## Std. Error
## (Intercept) 8.5549017340
## loan_amnt 0.0000114455
## revol_bal 0.0000002183
## revol_util 0.0001990838
## funded_amnt_inv 0.0000114537
## annual_inc 0.0000000667
## pub_rec_bankruptcies 0.0123966239
## term60 months 0.0127306969
## pymnt_plany 0.5338359837
## initial_list_statusw 0.0095813271
## application_typeJoint App 0.1645470952
## hardship_flagY NA
## debt_settlement_flagY 0.0328428767
## verification_statusSource Verified 0.0111819662
## verification_statusVerified 0.0119250363
## home_ownershipOTHER 0.3729366090
## home_ownershipOWN 0.0154055692
## home_ownershipRENT 0.0095771804
## acc_now_delinq 0.0531781521
## gradeB 0.0139086212
## gradeC 0.0147744261
## gradeD 0.0171087015
## gradeE 0.0213715895
## gradeF 0.0316785085
## gradeG 0.0594618845
## collections_12_mths_ex_med 0.0327539262
## loan_statusCurrent 0.0217741818
## loan_statusDoes not meet the credit policy. Status:Charged Off 0.2059161940
## loan_statusDoes not meet the credit policy. Status:Fully Paid 0.1322630433
## loan_statusFully Paid 0.0127855330
## loan_statusIn Grace Period 0.1361802734
## loan_statusLate (16-30 days) 0.2112354846
## loan_statusLate (31-120 days) 0.0975745521
## chargeoff_within_12_mths 0.0382852794
## year_temp 0.0042487679
## t value
## (Intercept) 93.77
## loan_amnt -26.97
## revol_bal -9.49
## revol_util 28.58
## funded_amnt_inv 27.25
## annual_inc -4.27
## pub_rec_bankruptcies 6.06
## term60 months 16.75
## pymnt_plany -0.29
## initial_list_statusw -8.98
## application_typeJoint App 0.13
## hardship_flagY NA
## debt_settlement_flagY -1.26
## verification_statusSource Verified 5.39
## verification_statusVerified 23.64
## home_ownershipOTHER -2.66
## home_ownershipOWN 1.19
## home_ownershipRENT 3.77
## acc_now_delinq 3.86
## gradeB 247.20
## gradeC 443.93
## gradeD 566.06
## gradeE 580.65
## gradeF 503.99
## gradeG 305.10
## collections_12_mths_ex_med -0.67
## loan_statusCurrent -16.75
## loan_statusDoes not meet the credit policy. Status:Charged Off -15.65
## loan_statusDoes not meet the credit policy. Status:Fully Paid -26.93
## loan_statusFully Paid -7.06
## loan_statusIn Grace Period -2.43
## loan_statusLate (16-30 days) -0.39
## loan_statusLate (31-120 days) -2.05
## chargeoff_within_12_mths 2.43
## year_temp -92.92
## Pr(>|t|)
## (Intercept) < 0.0000000000000002
## loan_amnt < 0.0000000000000002
## revol_bal < 0.0000000000000002
## revol_util < 0.0000000000000002
## funded_amnt_inv < 0.0000000000000002
## annual_inc 0.0000194774277
## pub_rec_bankruptcies 0.0000000013615
## term60 months < 0.0000000000000002
## pymnt_plany 0.77122
## initial_list_statusw < 0.0000000000000002
## application_typeJoint App 0.89943
## hardship_flagY NA
## debt_settlement_flagY 0.20631
## verification_statusSource Verified 0.0000000705292
## verification_statusVerified < 0.0000000000000002
## home_ownershipOTHER 0.00789
## home_ownershipOWN 0.23275
## home_ownershipRENT 0.00017
## acc_now_delinq 0.00012
## gradeB < 0.0000000000000002
## gradeC < 0.0000000000000002
## gradeD < 0.0000000000000002
## gradeE < 0.0000000000000002
## gradeF < 0.0000000000000002
## gradeG < 0.0000000000000002
## collections_12_mths_ex_med 0.50130
## loan_statusCurrent < 0.0000000000000002
## loan_statusDoes not meet the credit policy. Status:Charged Off < 0.0000000000000002
## loan_statusDoes not meet the credit policy. Status:Fully Paid < 0.0000000000000002
## loan_statusFully Paid 0.0000000000017
## loan_statusIn Grace Period 0.01520
## loan_statusLate (16-30 days) 0.69585
## loan_statusLate (31-120 days) 0.04059
## chargeoff_within_12_mths 0.01495
## year_temp < 0.0000000000000002
##
## (Intercept) ***
## loan_amnt ***
## revol_bal ***
## revol_util ***
## funded_amnt_inv ***
## annual_inc ***
## pub_rec_bankruptcies ***
## term60 months ***
## pymnt_plany
## initial_list_statusw ***
## application_typeJoint App
## hardship_flagY
## debt_settlement_flagY
## verification_statusSource Verified ***
## verification_statusVerified ***
## home_ownershipOTHER **
## home_ownershipOWN
## home_ownershipRENT ***
## acc_now_delinq ***
## gradeB ***
## gradeC ***
## gradeD ***
## gradeE ***
## gradeF ***
## gradeG ***
## collections_12_mths_ex_med
## loan_statusCurrent ***
## loan_statusDoes not meet the credit policy. Status:Charged Off ***
## loan_statusDoes not meet the credit policy. Status:Fully Paid ***
## loan_statusFully Paid ***
## loan_statusIn Grace Period *
## loan_statusLate (16-30 days)
## loan_statusLate (31-120 days) *
## chargeoff_within_12_mths *
## year_temp ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.17 on 72718 degrees of freedom
## Multiple R-squared: 0.928, Adjusted R-squared: 0.928
## F-statistic: 2.85e+04 on 33 and 72718 DF, p-value: <0.0000000000000002
lm_2_pred <- predict(lm_2, test_data) # predict int_rate based on test data
## Warning in predict.lm(lm_2, test_data): prediction from a rank-deficient
## fit may be misleading
test_errors_2 <- mean((test_data$int_rate-lm_2_pred)^2) # calculate test error
test_errors_2 # test error: 1.372
## [1] 1.372
# test error is equal to previous test error that was found with best subset selection
Model Evaluation:
Models with various complexity with 1 predictor up to 33 predictors were created with best subset selection and were compared with the validation set approach. By analysing the plots of metrics like adjusted R^2, Cp and BIC, ones would think that the model does not improve much anymore with more than 6 predictors. So in a first step the test MSE was calculated for a model with 6 predictors. The model with 6 predictors resulted in a test MSE of 3.015. In a next step the test MSE of all 33 models were compared. The plot of all test MSE shows that the test MSE decreases with every new predictor up to the model with 31 predictors. The test MSE of the model with 31 predictors is 1.372, which is much smaller than the test MSE that was found for a model with 6 predictors.
# using method from http://www.sthda.com/english/articles/37-model-selection-essentials-in-r/153-penalized-regression-essentials-ridge-lasso-elastic-net/
# creating matrix
x=model.matrix(int_rate~., d)[, -1]
y=d$int_rate
# Predictor variables train data
x_train <- model.matrix(int_rate~., train_data)[,-1]
# Outcome variable train data
y_train <- train_data$int_rate
# Find the best lambda using cross-validation
set.seed(22)
cv <- cv.glmnet(x_train, y_train, alpha = 0)
# Display the best lambda value using cross-validation
cv$lambda.min
## [1] 0.2158
# Fit the final model on the training data
model_train <- glmnet(x_train, y_train, alpha = 0, lambda = cv$lambda.min)
model_train
##
## Call: glmnet(x = x_train, y = y_train, alpha = 0, lambda = cv$lambda.min)
##
## Df %Dev Lambda
## [1,] 34 0.91 0.216
# Display regression coefficients
coef(model_train)
## 35 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 673.4197663501
## loan_amnt -0.0000046055
## revol_bal -0.0000044971
## revol_util 0.0126619106
## funded_amnt_inv 0.0000045518
## annual_inc -0.0000008879
## pub_rec_bankruptcies 0.2215014030
## term60 months 0.7320285124
## pymnt_plany -0.0981643392
## initial_list_statusw -0.2512668734
## application_typeJoint App 0.0556007472
## hardship_flagY -0.0122931063
## debt_settlement_flagY -0.0630768398
## verification_statusSource Verified 0.1401325336
## verification_statusVerified 0.4880856599
## home_ownershipOTHER -0.6575275493
## home_ownershipOWN 0.1088854645
## home_ownershipRENT 0.1433827279
## acc_now_delinq 0.4428531718
## gradeB 2.0721047336
## gradeC 4.9252150053
## gradeD 7.7942545127
## gradeE 10.2200880846
## gradeF 13.4885037816
## gradeG 15.5080386436
## collections_12_mths_ex_med 0.1223624955
## loan_statusCurrent -0.5565290140
## loan_statusDoes not meet the credit policy. Status:Charged Off -2.7720040355
## loan_statusDoes not meet the credit policy. Status:Fully Paid -3.1717638626
## loan_statusFully Paid -0.2917309393
## loan_statusIn Grace Period -0.4119790329
## loan_statusLate (16-30 days) -0.1106423141
## loan_statusLate (31-120 days) -0.2839238895
## chargeoff_within_12_mths 0.1845140558
## year_temp -0.3303316059
# Make predictions on the test data
x_test <- model.matrix(int_rate~., test_data)[,-1]
y_test <- test_data$int_rate
model_test <- model.matrix(int_rate ~., test_data)[,-1]
predictions <- model_train %>% predict(model_test) %>% as.vector()
# Model performance metrics
data.frame(
RMSE = caret::RMSE(predictions, test_data$int_rate),
Rsquare = caret::R2(predictions, test_data$int_rate)
)
# Mean squared error
mean((predictions - y_test)^2)
## [1] 1.708
# Refit ridge regression model on the full data set, using lambda chosen by cross-validation
modell_full_data_set <- glmnet(x, y, alpha = 0)
predict(modell_full_data_set, type="coefficients", s=cv$lambda.min)
## 36 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 666.0096710668
## loan_amnt -0.0000051354
## revol_bal -0.0000043680
## revol_util 0.0128394625
## funded_amnt_inv 0.0000050157
## annual_inc -0.0000009665
## pub_rec_bankruptcies 0.2287244653
## term60 months 0.7331639762
## pymnt_plany -0.2176853880
## initial_list_statusw -0.2564763252
## application_typeJoint App 0.0053602665
## hardship_flagY -0.2364632815
## debt_settlement_flagY -0.0616096764
## verification_statusSource Verified 0.1451127340
## verification_statusVerified 0.4829217869
## home_ownershipNONE -0.1701960729
## home_ownershipOTHER -0.5612204159
## home_ownershipOWN 0.1149555752
## home_ownershipRENT 0.1450360467
## acc_now_delinq 0.4480115481
## gradeB 2.0740732729
## gradeC 4.9276885532
## gradeD 7.7905392419
## gradeE 10.2029248536
## gradeF 13.4950919639
## gradeG 15.4736473629
## collections_12_mths_ex_med 0.0934593841
## loan_statusCurrent -0.5594135656
## loan_statusDoes not meet the credit policy. Status:Charged Off -2.6877638615
## loan_statusDoes not meet the credit policy. Status:Fully Paid -3.2454211614
## loan_statusFully Paid -0.2921106504
## loan_statusIn Grace Period -0.3971416022
## loan_statusLate (16-30 days) -0.2741674027
## loan_statusLate (31-120 days) -0.2827116135
## chargeoff_within_12_mths 0.1661220201
## year_temp -0.3266543517
print("MSE")
## [1] "MSE"
mean((predictions-test_data$int_rate)^2)
## [1] 1.708
Model Evaluation:
In general ridge regression reduces the variance of the ridge coefficient estimates by shrinking them towards zero. In order to do so, a good tuning parameter lambda needs to be selected to adjust the trade-off between coefficient shrinkage and RSS reduction. Lambda was optimized by using cross validation. The best lambda was determined to be 0.216.
The examination of the ridge coefficent estimates shows that the coefficents of the following predictors was shrinked to almost zero:
Highest coefficent were found for the following predictors:
The test MSE for the ridge regression model resulted in 1.708.
# using method from http://www.sthda.com/english/articles/37-model-selection-essentials-in-r/153-penalized-regression-essentials-ridge-lasso-elastic-net/
# Setup a grid range of lambda values:
lambda <- 10^seq(-3, 3, length = 100)
# Build the model
set.seed(22)
lasso <- train(
int_rate ~., data = train_data, method = "glmnet",
trControl = trainControl("cv", number = 10),
tuneGrid = expand.grid(alpha = 1, lambda = cv$lambda)
)
# Model coefficients
coef(lasso$finalModel, lasso$bestTune$lambda)
## 35 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 380.59983
## loan_amnt .
## revol_bal .
## revol_util 0.00816
## funded_amnt_inv .
## annual_inc .
## pub_rec_bankruptcies .
## term60 months 0.43160
## pymnt_plany .
## initial_list_statusw -0.03003
## application_typeJoint App .
## hardship_flagY .
## debt_settlement_flagY .
## verification_statusSource Verified .
## verification_statusVerified 0.14410
## home_ownershipOTHER .
## home_ownershipOWN .
## home_ownershipRENT .
## acc_now_delinq .
## gradeB 0.78062
## gradeC 3.82864
## gradeD 6.80696
## gradeE 9.22728
## gradeF 12.20369
## gradeG 12.86658
## collections_12_mths_ex_med .
## loan_statusCurrent .
## loan_statusDoes not meet the credit policy. Status:Charged Off .
## loan_statusDoes not meet the credit policy. Status:Fully Paid .
## loan_statusFully Paid .
## loan_statusIn Grace Period .
## loan_statusLate (16-30 days) .
## loan_statusLate (31-120 days) .
## chargeoff_within_12_mths .
## year_temp -0.18443
# Make predictions
predictions <- lasso %>% predict(test_data)
# Model prediction performance
data.frame(
RMSE = caret::RMSE(predictions, test_data$int_rate),
Rsquare = caret::R2(predictions, test_data$int_rate)
)
print("MSE")
## [1] "MSE"
mean((predictions-test_data$int_rate)^2)
## [1] 2.604
Model Evaluation:
Same as ridge regression, the lasso regression reduces the variance of the ridge coefficient estimates by shrinking them towards zero. In addition, lasso also has the side effect of reducing the number of predictors (Setting the coefficients to zero).
The lasso regression set the coefficents of the following parameters to zero (including dummy variables):
The test MSE for the lasso regression model resulted in 2.604.
model_comparison <- tibble(model = c("LM with LOOCV", "LM with Best Subset Selection", "Ridge Regression", "Lasso Regression"),
Test_MSE = c(1.375, 1.372, 1.708, 2.604),
R2 = c(0.9263, 0.928, 0.9174, 0.8867))
(model_comparison %<>% arrange(Test_MSE))
write.csv(model_comparison, file = "data/regression_methods_results.csv")
According to the table above the linear regression model with best subset selection has the smallest MSE (1.372) overall. The linear regression model with leave one out cross validation is nearly as good with a MSE of 1.375. The ridge and the lasso regression perform noticeable worse (MSE of 1.708 and 2.064). Details to the winnig model with best subset selection can be found below.
| Initial question | Answer |
|---|---|
| 1. Is at least one of the predictores X1, X2, …, Xp useful in predicting the response? | First, one could reduce the possible predictors in the data preprocessing part. One could remove 126 out of 147 possible predictors, because they would not explain the response at all (93 predictors with more than 5% missing values or just one level) or there were too many levels (more than 10). Some of the few 126 removed predictors were a human decision, especially if the predictors were not understable in regard to the response. Back to the original question: Yes, there were 31 predictors useful in predicting the response. Due to matrix transformation there are a lot of new dummy variables out of the 21 remaining predictors, for example grade, home_ownership or loan_status. |
| 2. Do all the predictors help to explain Y, or is only a subset of the predictors useful? | No, only 31 of the preditors help to explain of the response. One could try to only take 23 predictors because one can reduce most of the test error with them. |
| 3. How well does the model fit the data? How accurate is the prediction? | Quite well. The remaining test error (MSE) is 1.375. |
One reasonable approach could involve testing non-linear regression models by checking the non-linearity of the response-predictor relationsships. Additionally one could further investigate in problems such as outliers, collinearity or residual analysis (correlation of error terms, non-constant variance of error terms, …). Finally one should better understand the business domain to improve model results and interpretation. Furthermore the numerical predictors could be standardized before applying various regression models, so that all predictors are on the same scale.
Our goal in the second part of the assignment is to predict if a new customer will be able to fully pay back their loans using a classification method. Therefore, we focus on the “concluded lends” in the data set, i.e. on all loans where the loan_status is not Current. To this end, we filter out all observations with loan_status == Current.
# load data
d_na_less_5_percent <- read.csv("data/d_na_less_5_percent.csv",stringsAsFactors = T)
# create subset with loan_status = current
d2_sub <- subset(d_na_less_5_percent,loan_status != "Current")
# check how much loan status varibales we have
d2_sub$loan_status <- gsub("Does not meet the credit policy. Status:Fully Paid", "Fully Paid", d2_sub$loan_status)
d2_sub$loan_status <- gsub("Does not meet the credit policy. Status:Charged Off", "Fully Paid", d2_sub$loan_status)
d2_sub$loan_status <- gsub("Current", "Defaulted", d2_sub$loan_status)
d2_sub$loan_status <- gsub("Charged Off", "Defaulted", d2_sub$loan_status)
d2_sub$loan_status <- gsub("In Grace Period", "Defaulted", d2_sub$loan_status)
d2_sub$loan_status <- gsub("Late \\(31-120 days\\)", "Defaulted", d2_sub$loan_status)
d2_sub$loan_status <- gsub("Late \\(16-30 days\\)", "Defaulted", d2_sub$loan_status)
# create subset
d2_sub$loan_status <- as.factor(d2_sub$loan_status)
# replace home ownership "NONE" with "OTHER"
d2_sub$home_ownership <- as.factor(gsub("NONE", "OTHER", d2_sub$home_ownership))
d2_sub %<>% na.omit()
summary(d2_sub)
## funded_amnt_inv annual_inc pub_rec_bankruptcies last_pymnt_d
## Min. : 0 Min. : 5000 Min. :0.000 Mar-2017: 2550
## 1st Qu.: 8000 1st Qu.: 45000 1st Qu.:0.000 Mar-2016: 2460
## Median :12000 Median : 64000 Median :0.000 Jul-2016: 2280
## Mean :14379 Mean : 74643 Mean :0.122 Aug-2016: 2242
## 3rd Qu.:20000 3rd Qu.: 90000 3rd Qu.:0.000 Sep-2016: 2146
## Max. :35000 Max. :9500000 Max. :9.000 Jun-2016: 2145
## (Other) :87685
## revol_util title
## Min. : 0.0 Debt consolidation :47018
## 1st Qu.: 37.3 Credit card refinancing:18316
## Median : 55.6 Home improvement : 4474
## Mean : 54.8 Other : 3750
## 3rd Qu.: 73.2 Debt Consolidation : 1944
## Max. :155.3 Major purchase : 1421
## (Other) :24585
## collections_12_mths_ex_med chargeoff_within_12_mths tax_liens
## Min. :0.000 Min. :0.000 Min. : 0.00
## 1st Qu.:0.000 1st Qu.:0.000 1st Qu.: 0.00
## Median :0.000 Median :0.000 Median : 0.00
## Mean :0.014 Mean :0.009 Mean : 0.05
## 3rd Qu.:0.000 3rd Qu.:0.000 3rd Qu.: 0.00
## Max. :6.000 Max. :9.000 Max. :48.00
##
## delinq_2yrs earliest_cr_line inq_last_6mths open_acc
## Min. : 0.00 Aug-2001: 772 Min. : 0.000 Min. : 1.0
## 1st Qu.: 0.00 Aug-2002: 745 1st Qu.: 0.000 1st Qu.: 8.0
## Median : 0.00 Aug-2000: 721 Median : 0.000 Median :11.0
## Mean : 0.31 Oct-2000: 720 Mean : 0.703 Mean :11.5
## 3rd Qu.: 0.00 Oct-2001: 693 3rd Qu.: 1.000 3rd Qu.:14.0
## Max. :29.00 Oct-1999: 677 Max. :12.000 Max. :90.0
## (Other) :97180
## pub_rec total_acc last_credit_pull_d acc_now_delinq
## Min. : 0.0 Min. : 2.0 Feb-2019:32989 Min. :0.000
## 1st Qu.: 0.0 1st Qu.: 17.0 Oct-2016: 5940 1st Qu.:0.000
## Median : 0.0 Median : 24.0 Jan-2019: 4273 Median :0.000
## Mean : 0.2 Mean : 25.3 Jul-2018: 3859 Mean :0.005
## 3rd Qu.: 0.0 3rd Qu.: 32.0 Aug-2018: 3399 3rd Qu.:0.000
## Max. :49.0 Max. :169.0 Oct-2018: 3300 Max. :5.000
## (Other) :47748
## delinq_amnt id_2 loan_amnt funded_amnt
## Min. : 0 Min. : 10 Min. : 725 Min. : 725
## 1st Qu.: 0 1st Qu.:229392 1st Qu.: 8000 1st Qu.: 8000
## Median : 0 Median :447734 Median :12000 Median :12000
## Mean : 15 Mean :445468 Mean :14404 Mean :14395
## 3rd Qu.: 0 3rd Qu.:662596 3rd Qu.:20000 3rd Qu.:20000
## Max. :65000 Max. :887434 Max. :35000 Max. :35000
##
## term int_rate installment grade
## 36 months:76047 Min. : 5.32 Min. : 22.5 A:17939
## 60 months:25461 1st Qu.: 9.80 1st Qu.: 255.0 B:29551
## Median :12.99 Median : 377.6 C:27724
## Mean :13.15 Mean : 435.0 D:15560
## 3rd Qu.:15.96 3rd Qu.: 571.6 E: 7594
## Max. :28.99 Max. :1408.1 F: 2560
## G: 580
## sub_grade emp_length home_ownership
## B3 : 6480 10+ years:33298 MORTGAGE:50469
## B4 : 6443 2 years : 9024 OTHER : 15
## C1 : 6202 < 1 year : 8155 OWN : 9929
## C2 : 5922 3 years : 8016 RENT :41095
## B2 : 5797 1 year : 6751
## C3 : 5643 5 years : 6396
## (Other):65021 (Other) :29868
## verification_status issue_d loan_status pymnt_plan
## Not Verified :29655 Oct-2015: 5361 Defaulted :18862 n:101499
## Source Verified:38237 Jul-2015: 5116 Fully Paid:82646 y: 9
## Verified :33616 Dec-2015: 4828
## Oct-2014: 4449
## Nov-2015: 4116
## Aug-2015: 3962
## (Other) :73676
## purpose zip_code addr_state
## debt_consolidation:60009 945xx : 1188 CA :15280
## credit_card :23344 112xx : 1088 NY : 8443
## home_improvement : 5871 750xx : 1057 TX : 8184
## other : 4955 606xx : 1011 FL : 6876
## major_purchase : 2011 300xx : 899 IL : 4071
## small_business : 1189 900xx : 895 NJ : 3753
## (Other) : 4129 (Other):95370 (Other):54901
## dti revol_bal initial_list_status out_prncp
## Min. : 0.0 Min. : 0 f:53757 Min. : 0
## 1st Qu.: 11.8 1st Qu.: 6319 w:47751 1st Qu.: 0
## Median : 17.5 Median : 11557 Median : 0
## Mean : 18.0 Mean : 16558 Mean : 28
## 3rd Qu.: 23.8 3rd Qu.: 20277 3rd Qu.: 0
## Max. :100.1 Max. :1298783 Max. :22039
##
## out_prncp_inv total_pymnt total_pymnt_inv total_rec_prncp
## Min. : 0 Min. : 36 Min. : 0 Min. : 0
## 1st Qu.: 0 1st Qu.: 7577 1st Qu.: 7559 1st Qu.: 6000
## Median : 0 Median :12956 Median :12942 Median :10000
## Mean : 28 Mean :15485 Mean :15468 Mean :12464
## 3rd Qu.: 0 3rd Qu.:21307 3rd Qu.:21290 3rd Qu.:17925
## Max. :22007 Max. :62708 Max. :62687 Max. :35000
##
## total_rec_int total_rec_late_fee recoveries
## Min. : 0 Min. : 0.0 Min. : 0
## 1st Qu.: 978 1st Qu.: 0.0 1st Qu.: 0
## Median : 1860 Median : 0.0 Median : 0
## Mean : 2786 Mean : 1.5 Mean : 234
## 3rd Qu.: 3487 3rd Qu.: 0.0 3rd Qu.: 0
## Max. :27708 Max. :864.0 Max. :32321
##
## collection_recovery_fee last_pymnt_amnt policy_code application_type
## Min. : 0 Min. : 0 Min. :1 Individual:101452
## 1st Qu.: 0 1st Qu.: 354 1st Qu.:1 Joint App : 56
## Median : 0 Median : 1103 Median :1
## Mean : 38 Mean : 4498 Mean :1
## 3rd Qu.: 0 3rd Qu.: 6743 3rd Qu.:1
## Max. :5432 Max. :36170 Max. :1
##
## hardship_flag disbursement_method debt_settlement_flag year_temp
## N:101499 Cash:101508 N:99364 Min. :2007
## Y: 9 Y: 2144 1st Qu.:2013
## Median :2014
## Mean :2014
## 3rd Qu.:2015
## Max. :2015
##
We have converted all loan_status that are not fully paid into defaulted. We have also replaced “home_ownership” NONE with OTHER.
We now transform the dataset using the Principal component analaysis (PCA). The following code prepares the data for the PCA. The PCA compution (prcomp) does not allow any variables to be factors or any NAs we therefore transform the factors to numeric and remove the NAs. Also we reduce the levels of factors to maximum 10. Then we create a test and training set and we transform the base of the training set using PCA.
# load previous df
df.pca <- d2_sub
# str(df.pca) # check
# remove factors larger than 10
plus10levels <- df.pca[, sapply(df.pca, function(col) class(col) == "factor")]
plus10levels <- plus10levels[, sapply(plus10levels, function(col) length(unique(col))) > 10]
rmv.col <- colnames(plus10levels)
str(rmv.col) # columns to be removed
## chr [1:10] "last_pymnt_d" "title" "earliest_cr_line" ...
df.pca <- df.pca[ , !(names(df.pca) %in% rmv.col)]
# str(df.pca) # check
# transform factors to numeric, except for loan_status
str(df.pca)
## 'data.frame': 101508 obs. of 44 variables:
## $ funded_amnt_inv : int 33600 10400 15000 10000 4200 7000 20000 28000 20000 17600 ...
## $ annual_inc : int 104000 104433 80000 30000 32000 59000 180000 92000 65000 88000 ...
## $ pub_rec_bankruptcies : int 0 0 0 1 0 0 0 0 0 0 ...
## $ revol_util : num 51.8 64.5 21.4 26.6 11.7 24.2 84.5 64.5 3 34.4 ...
## $ collections_12_mths_ex_med: int 0 0 0 0 0 0 0 0 0 0 ...
## $ chargeoff_within_12_mths : int 0 0 0 0 0 0 0 0 0 0 ...
## $ tax_liens : int 0 0 0 0 0 0 0 0 0 0 ...
## $ delinq_2yrs : int 0 1 0 0 0 1 0 0 0 0 ...
## $ inq_last_6mths : int 2 3 1 2 2 1 0 0 1 0 ...
## $ open_acc : int 14 12 30 19 10 10 12 16 12 20 ...
## $ pub_rec : int 0 0 0 1 0 0 0 0 0 0 ...
## $ total_acc : int 31 35 47 36 16 34 27 24 17 42 ...
## $ acc_now_delinq : int 0 0 0 0 0 0 0 0 0 0 ...
## $ delinq_amnt : int 0 0 0 0 0 0 0 0 0 0 ...
## $ id_2 : int 777778 876330 188890 288890 388890 788890 877442 1114 10002 11114 ...
## $ loan_amnt : int 33600 10400 15000 10000 4200 7000 20000 28000 20000 17600 ...
## $ funded_amnt : int 33600 10400 15000 10000 4200 7000 20000 28000 20000 17600 ...
## $ term : Factor w/ 2 levels "36 months","60 months": 1 2 1 2 1 1 1 1 1 1 ...
## $ int_rate : num 10.78 22.45 5.32 16.59 13.99 ...
## $ installment : num 1097 290 452 246 144 ...
## $ grade : Factor w/ 7 levels "A","B","C","D",..: 2 6 1 4 3 1 2 1 1 1 ...
## $ home_ownership : Factor w/ 4 levels "MORTGAGE","OTHER",..: 4 1 4 4 4 1 1 1 3 1 ...
## $ verification_status : Factor w/ 3 levels "Not Verified",..: 2 2 1 2 2 1 1 1 1 2 ...
## $ loan_status : Factor w/ 2 levels "Defaulted","Fully Paid": 2 2 2 2 2 2 2 2 2 2 ...
## $ pymnt_plan : Factor w/ 2 levels "n","y": 1 1 1 1 1 1 1 1 1 1 ...
## $ dti : num 14 25.4 20.2 30.8 31 ...
## $ revol_bal : int 28475 21929 41593 6118 1294 10467 87329 51507 2269 12203 ...
## $ initial_list_status : Factor w/ 2 levels "f","w": 2 2 2 2 2 2 1 2 2 2 ...
## $ out_prncp : num 0 0 0 0 0 0 0 0 0 0 ...
## $ out_prncp_inv : num 0 0 0 0 0 0 0 0 0 0 ...
## $ total_pymnt : num 34309 11740 16241 10758 5036 ...
## $ total_pymnt_inv : num 34309 11740 16241 10758 5036 ...
## $ total_rec_prncp : num 33600 10400 15000 10000 4200 7000 20000 28000 20000 17600 ...
## $ total_rec_int : num 709 1340 1241 758 836 ...
## $ total_rec_late_fee : num 0 0 0 0 0 0 0 0 0 0 ...
## $ recoveries : num 0 0 0 0 0 0 0 0 0 0 ...
## $ collection_recovery_fee : num 0 0 0 0 0 0 0 0 0 0 ...
## $ last_pymnt_amnt : num 33263 10129 2244 9795 1743 ...
## $ policy_code : int 1 1 1 1 1 1 1 1 1 1 ...
## $ application_type : Factor w/ 2 levels "Individual","Joint App": 1 1 1 1 1 1 1 1 1 1 ...
## $ hardship_flag : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
## $ disbursement_method : Factor w/ 1 level "Cash": 1 1 1 1 1 1 1 1 1 1 ...
## $ debt_settlement_flag : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
## $ year_temp : int 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
for(col in 1:ncol(df.pca)){
if(class(df.pca[,col]) %in% c('factor') && colnames(df.pca)[col] != 'loan_status'){
df.pca[,col] <- as.numeric(df.pca[,col])
}
}
# str(df.pca) # check
# remove NAs
drop_na(df.pca)
str(df.pca)
## 'data.frame': 101508 obs. of 44 variables:
## $ funded_amnt_inv : int 33600 10400 15000 10000 4200 7000 20000 28000 20000 17600 ...
## $ annual_inc : int 104000 104433 80000 30000 32000 59000 180000 92000 65000 88000 ...
## $ pub_rec_bankruptcies : int 0 0 0 1 0 0 0 0 0 0 ...
## $ revol_util : num 51.8 64.5 21.4 26.6 11.7 24.2 84.5 64.5 3 34.4 ...
## $ collections_12_mths_ex_med: int 0 0 0 0 0 0 0 0 0 0 ...
## $ chargeoff_within_12_mths : int 0 0 0 0 0 0 0 0 0 0 ...
## $ tax_liens : int 0 0 0 0 0 0 0 0 0 0 ...
## $ delinq_2yrs : int 0 1 0 0 0 1 0 0 0 0 ...
## $ inq_last_6mths : int 2 3 1 2 2 1 0 0 1 0 ...
## $ open_acc : int 14 12 30 19 10 10 12 16 12 20 ...
## $ pub_rec : int 0 0 0 1 0 0 0 0 0 0 ...
## $ total_acc : int 31 35 47 36 16 34 27 24 17 42 ...
## $ acc_now_delinq : int 0 0 0 0 0 0 0 0 0 0 ...
## $ delinq_amnt : int 0 0 0 0 0 0 0 0 0 0 ...
## $ id_2 : int 777778 876330 188890 288890 388890 788890 877442 1114 10002 11114 ...
## $ loan_amnt : int 33600 10400 15000 10000 4200 7000 20000 28000 20000 17600 ...
## $ funded_amnt : int 33600 10400 15000 10000 4200 7000 20000 28000 20000 17600 ...
## $ term : num 1 2 1 2 1 1 1 1 1 1 ...
## $ int_rate : num 10.78 22.45 5.32 16.59 13.99 ...
## $ installment : num 1097 290 452 246 144 ...
## $ grade : num 2 6 1 4 3 1 2 1 1 1 ...
## $ home_ownership : num 4 1 4 4 4 1 1 1 3 1 ...
## $ verification_status : num 2 2 1 2 2 1 1 1 1 2 ...
## $ loan_status : Factor w/ 2 levels "Defaulted","Fully Paid": 2 2 2 2 2 2 2 2 2 2 ...
## $ pymnt_plan : num 1 1 1 1 1 1 1 1 1 1 ...
## $ dti : num 14 25.4 20.2 30.8 31 ...
## $ revol_bal : int 28475 21929 41593 6118 1294 10467 87329 51507 2269 12203 ...
## $ initial_list_status : num 2 2 2 2 2 2 1 2 2 2 ...
## $ out_prncp : num 0 0 0 0 0 0 0 0 0 0 ...
## $ out_prncp_inv : num 0 0 0 0 0 0 0 0 0 0 ...
## $ total_pymnt : num 34309 11740 16241 10758 5036 ...
## $ total_pymnt_inv : num 34309 11740 16241 10758 5036 ...
## $ total_rec_prncp : num 33600 10400 15000 10000 4200 7000 20000 28000 20000 17600 ...
## $ total_rec_int : num 709 1340 1241 758 836 ...
## $ total_rec_late_fee : num 0 0 0 0 0 0 0 0 0 0 ...
## $ recoveries : num 0 0 0 0 0 0 0 0 0 0 ...
## $ collection_recovery_fee : num 0 0 0 0 0 0 0 0 0 0 ...
## $ last_pymnt_amnt : num 33263 10129 2244 9795 1743 ...
## $ policy_code : int 1 1 1 1 1 1 1 1 1 1 ...
## $ application_type : num 1 1 1 1 1 1 1 1 1 1 ...
## $ hardship_flag : num 1 1 1 1 1 1 1 1 1 1 ...
## $ disbursement_method : num 1 1 1 1 1 1 1 1 1 1 ...
## $ debt_settlement_flag : num 1 1 1 1 1 1 1 1 1 1 ...
## $ year_temp : int 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
# check dimnames, mean and variance
dimnames(df.pca)[[2]]
## [1] "funded_amnt_inv" "annual_inc"
## [3] "pub_rec_bankruptcies" "revol_util"
## [5] "collections_12_mths_ex_med" "chargeoff_within_12_mths"
## [7] "tax_liens" "delinq_2yrs"
## [9] "inq_last_6mths" "open_acc"
## [11] "pub_rec" "total_acc"
## [13] "acc_now_delinq" "delinq_amnt"
## [15] "id_2" "loan_amnt"
## [17] "funded_amnt" "term"
## [19] "int_rate" "installment"
## [21] "grade" "home_ownership"
## [23] "verification_status" "loan_status"
## [25] "pymnt_plan" "dti"
## [27] "revol_bal" "initial_list_status"
## [29] "out_prncp" "out_prncp_inv"
## [31] "total_pymnt" "total_pymnt_inv"
## [33] "total_rec_prncp" "total_rec_int"
## [35] "total_rec_late_fee" "recoveries"
## [37] "collection_recovery_fee" "last_pymnt_amnt"
## [39] "policy_code" "application_type"
## [41] "hardship_flag" "disbursement_method"
## [43] "debt_settlement_flag" "year_temp"
apply(df.pca,2,var)
## Warning in FUN(newX[, i], ...): NAs introduced by coercion
## funded_amnt_inv annual_inc
## 70379710.85512327 5044374313.91709137
## pub_rec_bankruptcies revol_util
## 0.13152107 564.14696066
## collections_12_mths_ex_med chargeoff_within_12_mths
## 0.01703437 0.01214732
## tax_liens delinq_2yrs
## 0.15101847 0.71781136
## inq_last_6mths open_acc
## 0.97899224 28.14901467
## pub_rec total_acc
## 0.34206417 141.58278152
## acc_now_delinq delinq_amnt
## 0.00656248 624566.97294738
## id_2 loan_amnt
## 63602829078.21736145 70458348.47642767
## funded_amnt term
## 70361904.80439132 0.18791493
## int_rate installment
## 19.34605973 61766.92356414
## grade home_ownership
## 1.70985230 2.04605086
## verification_status loan_status
## 0.62179392 NA
## pymnt_plan dti
## 0.00008866 68.00934810
## revol_bal initial_list_status
## 489147939.65753955 0.24912725
## out_prncp out_prncp_inv
## 273519.32762228 273208.22961914
## total_pymnt total_pymnt_inv
## 106393361.73141186 106374750.73708099
## total_rec_prncp total_rec_int
## 72449474.50050333 8253374.70288832
## total_rec_late_fee recoveries
## 129.69762222 893639.07984153
## collection_recovery_fee last_pymnt_amnt
## 27064.96883563 39291930.08692809
## policy_code application_type
## 0.00000000 0.00055138
## hardship_flag disbursement_method
## 0.00008866 0.00000000
## debt_settlement_flag year_temp
## 0.02067557 1.40756346
summary(df.pca)
## funded_amnt_inv annual_inc pub_rec_bankruptcies revol_util
## Min. : 0 Min. : 5000 Min. :0.000 Min. : 0.0
## 1st Qu.: 8000 1st Qu.: 45000 1st Qu.:0.000 1st Qu.: 37.3
## Median :12000 Median : 64000 Median :0.000 Median : 55.6
## Mean :14379 Mean : 74643 Mean :0.122 Mean : 54.8
## 3rd Qu.:20000 3rd Qu.: 90000 3rd Qu.:0.000 3rd Qu.: 73.2
## Max. :35000 Max. :9500000 Max. :9.000 Max. :155.3
## collections_12_mths_ex_med chargeoff_within_12_mths tax_liens
## Min. :0.000 Min. :0.000 Min. : 0.00
## 1st Qu.:0.000 1st Qu.:0.000 1st Qu.: 0.00
## Median :0.000 Median :0.000 Median : 0.00
## Mean :0.014 Mean :0.009 Mean : 0.05
## 3rd Qu.:0.000 3rd Qu.:0.000 3rd Qu.: 0.00
## Max. :6.000 Max. :9.000 Max. :48.00
## delinq_2yrs inq_last_6mths open_acc pub_rec
## Min. : 0.00 Min. : 0.000 Min. : 1.0 Min. : 0.0
## 1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.: 8.0 1st Qu.: 0.0
## Median : 0.00 Median : 0.000 Median :11.0 Median : 0.0
## Mean : 0.31 Mean : 0.703 Mean :11.5 Mean : 0.2
## 3rd Qu.: 0.00 3rd Qu.: 1.000 3rd Qu.:14.0 3rd Qu.: 0.0
## Max. :29.00 Max. :12.000 Max. :90.0 Max. :49.0
## total_acc acc_now_delinq delinq_amnt id_2
## Min. : 2.0 Min. :0.000 Min. : 0 Min. : 10
## 1st Qu.: 17.0 1st Qu.:0.000 1st Qu.: 0 1st Qu.:229392
## Median : 24.0 Median :0.000 Median : 0 Median :447734
## Mean : 25.3 Mean :0.005 Mean : 15 Mean :445468
## 3rd Qu.: 32.0 3rd Qu.:0.000 3rd Qu.: 0 3rd Qu.:662596
## Max. :169.0 Max. :5.000 Max. :65000 Max. :887434
## loan_amnt funded_amnt term int_rate
## Min. : 725 Min. : 725 Min. :1.00 Min. : 5.32
## 1st Qu.: 8000 1st Qu.: 8000 1st Qu.:1.00 1st Qu.: 9.80
## Median :12000 Median :12000 Median :1.00 Median :12.99
## Mean :14404 Mean :14395 Mean :1.25 Mean :13.15
## 3rd Qu.:20000 3rd Qu.:20000 3rd Qu.:2.00 3rd Qu.:15.96
## Max. :35000 Max. :35000 Max. :2.00 Max. :28.99
## installment grade home_ownership verification_status
## Min. : 22.5 Min. :1.00 Min. :1.00 Min. :1.00
## 1st Qu.: 255.0 1st Qu.:2.00 1st Qu.:1.00 1st Qu.:1.00
## Median : 377.6 Median :3.00 Median :3.00 Median :2.00
## Mean : 435.0 Mean :2.76 Mean :2.41 Mean :2.04
## 3rd Qu.: 571.6 3rd Qu.:4.00 3rd Qu.:4.00 3rd Qu.:3.00
## Max. :1408.1 Max. :7.00 Max. :4.00 Max. :3.00
## loan_status pymnt_plan dti revol_bal
## Defaulted :18862 Min. :1 Min. : 0.0 Min. : 0
## Fully Paid:82646 1st Qu.:1 1st Qu.: 11.8 1st Qu.: 6319
## Median :1 Median : 17.5 Median : 11557
## Mean :1 Mean : 18.0 Mean : 16558
## 3rd Qu.:1 3rd Qu.: 23.8 3rd Qu.: 20277
## Max. :2 Max. :100.1 Max. :1298783
## initial_list_status out_prncp out_prncp_inv total_pymnt
## Min. :1.00 Min. : 0 Min. : 0 Min. : 36
## 1st Qu.:1.00 1st Qu.: 0 1st Qu.: 0 1st Qu.: 7577
## Median :1.00 Median : 0 Median : 0 Median :12956
## Mean :1.47 Mean : 28 Mean : 28 Mean :15485
## 3rd Qu.:2.00 3rd Qu.: 0 3rd Qu.: 0 3rd Qu.:21307
## Max. :2.00 Max. :22039 Max. :22007 Max. :62708
## total_pymnt_inv total_rec_prncp total_rec_int total_rec_late_fee
## Min. : 0 Min. : 0 Min. : 0 Min. : 0.0
## 1st Qu.: 7559 1st Qu.: 6000 1st Qu.: 978 1st Qu.: 0.0
## Median :12942 Median :10000 Median : 1860 Median : 0.0
## Mean :15468 Mean :12464 Mean : 2786 Mean : 1.5
## 3rd Qu.:21290 3rd Qu.:17925 3rd Qu.: 3487 3rd Qu.: 0.0
## Max. :62687 Max. :35000 Max. :27708 Max. :864.0
## recoveries collection_recovery_fee last_pymnt_amnt policy_code
## Min. : 0 Min. : 0 Min. : 0 Min. :1
## 1st Qu.: 0 1st Qu.: 0 1st Qu.: 354 1st Qu.:1
## Median : 0 Median : 0 Median : 1103 Median :1
## Mean : 234 Mean : 38 Mean : 4498 Mean :1
## 3rd Qu.: 0 3rd Qu.: 0 3rd Qu.: 6743 3rd Qu.:1
## Max. :32321 Max. :5432 Max. :36170 Max. :1
## application_type hardship_flag disbursement_method debt_settlement_flag
## Min. :1 Min. :1 Min. :1 Min. :1.00
## 1st Qu.:1 1st Qu.:1 1st Qu.:1 1st Qu.:1.00
## Median :1 Median :1 Median :1 Median :1.00
## Mean :1 Mean :1 Mean :1 Mean :1.02
## 3rd Qu.:1 3rd Qu.:1 3rd Qu.:1 3rd Qu.:1.00
## Max. :2 Max. :2 Max. :1 Max. :2.00
## year_temp
## Min. :2007
## 1st Qu.:2013
## Median :2014
## Mean :2014
## 3rd Qu.:2015
## Max. :2015
# looks like policy_code and disbursement_method have variance 0 and mean 1, we can remove these two variables
df.pca <- subset(df.pca, select = -c(policy_code,disbursement_method))
# str(df.pca) # check
# devide and conquer 80% train, 20% test data
p80 <- nrow(df.pca)*0.8
set.seed(13)
indices <- sort(sample(1:nrow(df.pca), p80)) # select 100 random samples
test.pca <- df.pca[-indices,]
train.pca <- df.pca[indices,]
train.log.pca <- train.pca
test.log.pca <- test.pca
# remove loan_status from test and train set
y.train.pca <- train.pca$loan_status
y.test.pca <- test.pca$loan_status
train.pca <- subset(train.pca, select = -c(loan_status))
test.pca <- subset(test.pca, select = -c(loan_status))
# str(df.pca) # check
# conduct PCA
pca.out = prcomp(train.pca, scale=T)
pca.out$rotation
## PC1 PC2 PC3 PC4
## funded_amnt_inv 0.3407943 0.012278 -0.01526673 0.052398
## annual_inc 0.1136536 0.093545 -0.00467066 0.114199
## pub_rec_bankruptcies -0.0464533 -0.077944 0.01267195 0.023158
## revol_util 0.0568166 -0.121046 -0.02000963 -0.162366
## collections_12_mths_ex_med -0.0061810 -0.014579 -0.00003231 0.033622
## chargeoff_within_12_mths 0.0016584 -0.005383 -0.00109124 0.020972
## tax_liens -0.0016403 -0.026564 0.01669991 0.038721
## delinq_2yrs 0.0074521 -0.026839 -0.00011600 0.060745
## inq_last_6mths 0.0019968 -0.126173 -0.02625197 -0.067290
## open_acc 0.0886699 0.010373 -0.00176069 0.303670
## pub_rec -0.0358108 -0.075615 0.02010241 0.044061
## total_acc 0.0987534 0.023219 -0.00039717 0.291341
## acc_now_delinq 0.0042213 -0.016788 -0.00191433 0.020997
## delinq_amnt 0.0003938 -0.006680 -0.00188756 0.012416
## id_2 -0.0112269 -0.016593 -0.04184998 -0.245852
## loan_amnt 0.3407600 0.012114 -0.01559505 0.050310
## funded_amnt 0.3408557 0.012304 -0.01545427 0.051249
## term 0.1539785 -0.252114 0.00853078 -0.096273
## int_rate 0.0791947 -0.426646 -0.03879709 -0.296917
## installment 0.3217352 0.043233 -0.02681402 0.049944
## grade 0.0788170 -0.426602 -0.03131353 -0.258908
## home_ownership -0.0814892 -0.067576 -0.00007015 -0.106877
## verification_status 0.1182896 -0.130401 -0.02250034 -0.074315
## pymnt_plan 0.0061424 -0.040921 0.47740147 0.030438
## dti 0.0268096 -0.129824 -0.00476301 0.116630
## revol_bal 0.1362472 0.067889 -0.01252753 0.119420
## initial_list_status 0.0215974 0.052381 0.04650319 0.215415
## out_prncp 0.0272044 -0.088659 0.49412730 0.009560
## out_prncp_inv 0.0272075 -0.088661 0.49417126 0.009569
## total_pymnt 0.3338835 0.099368 0.00329541 -0.074675
## total_pymnt_inv 0.3338595 0.099295 0.00344712 -0.073616
## total_rec_prncp 0.3093545 0.218345 0.00852315 -0.075873
## total_rec_int 0.2689073 -0.166832 0.02683691 -0.153373
## total_rec_late_fee 0.0335687 -0.096442 0.09072128 0.046985
## recoveries 0.0412889 -0.377419 -0.12300545 0.331975
## collection_recovery_fee 0.0417152 -0.370781 -0.12123666 0.341533
## last_pymnt_amnt 0.1900513 0.118362 -0.01883821 -0.108174
## application_type 0.0060405 -0.004824 0.00489678 0.005075
## hardship_flag 0.0061424 -0.040921 0.47740147 0.030438
## debt_settlement_flag 0.0106025 -0.229287 -0.05841299 0.241378
## year_temp 0.0259978 0.012367 0.05083276 0.318218
## PC5 PC6 PC7 PC8
## funded_amnt_inv -0.017297 0.089203 -0.0180491 0.008946
## annual_inc 0.008011 0.013898 -0.1062135 0.133794
## pub_rec_bankruptcies 0.456425 0.087693 -0.1034899 0.056843
## revol_util -0.125350 0.034861 0.1235473 -0.144925
## collections_12_mths_ex_med 0.042988 -0.034280 0.0269903 -0.035104
## chargeoff_within_12_mths 0.017255 -0.062798 0.0116576 -0.009641
## tax_liens 0.426521 0.182158 -0.1101874 0.082456
## delinq_2yrs 0.035671 -0.141093 0.0095877 0.017114
## inq_last_6mths 0.102780 -0.246345 -0.1841422 0.159520
## open_acc 0.071830 -0.487799 -0.0620000 0.096417
## pub_rec 0.613098 0.184108 -0.1440733 0.091651
## total_acc 0.101429 -0.474292 -0.1202433 0.149519
## acc_now_delinq 0.021787 -0.071341 0.0132533 -0.009723
## delinq_amnt 0.008099 -0.023356 0.0030506 -0.002916
## id_2 -0.124541 -0.047411 -0.3240296 0.331009
## loan_amnt -0.018519 0.088677 -0.0206835 0.011797
## funded_amnt -0.018012 0.088948 -0.0196445 0.010666
## term 0.034208 -0.083803 0.0859856 -0.072691
## int_rate 0.083001 -0.162254 0.0462808 -0.102524
## installment -0.019863 0.101305 -0.0478646 0.024157
## grade 0.104336 -0.157846 0.0853233 -0.139455
## home_ownership -0.025141 0.127318 0.0870005 -0.116249
## verification_status 0.076136 -0.040901 -0.0143728 -0.029156
## pymnt_plan -0.068016 -0.002952 -0.3975665 -0.325553
## dti -0.003603 -0.331337 0.1576499 -0.169634
## revol_bal -0.092387 -0.117515 0.0004494 0.014664
## initial_list_status 0.093662 0.048684 0.2366678 -0.239858
## out_prncp -0.018259 0.027454 0.3053913 0.356268
## out_prncp_inv -0.018265 0.027453 0.3053421 0.356226
## total_pymnt 0.020485 0.048167 -0.0063352 -0.002237
## total_pymnt_inv 0.021090 0.048424 -0.0049674 -0.003710
## total_rec_prncp 0.039844 0.037318 -0.0073048 -0.018075
## total_rec_int 0.011127 -0.006520 0.0405860 0.021043
## total_rec_late_fee -0.020605 0.061190 0.0750292 0.117555
## recoveries -0.168793 0.208003 -0.1268302 0.073338
## collection_recovery_fee -0.165390 0.212863 -0.1209136 0.068295
## last_pymnt_amnt 0.102709 -0.070254 0.0138976 -0.115969
## application_type 0.009669 0.003359 0.0262662 -0.019715
## hardship_flag -0.068016 -0.002952 -0.3975665 -0.325553
## debt_settlement_flag -0.109664 0.156859 -0.0492498 0.059989
## year_temp 0.192668 0.041539 0.3443643 -0.359657
## PC9 PC10 PC11 PC12
## funded_amnt_inv -0.008053 0.02641072 -0.029061 -0.045177
## annual_inc -0.111425 -0.20740525 0.315011 0.368963
## pub_rec_bankruptcies 0.059931 0.13784234 -0.076196 0.042912
## revol_util 0.260096 -0.51808011 0.127537 0.110010
## collections_12_mths_ex_med -0.217743 -0.01356013 0.175679 -0.321799
## chargeoff_within_12_mths -0.329826 -0.11437861 0.319184 -0.326300
## tax_liens 0.098964 -0.30713214 0.024342 -0.060901
## delinq_2yrs -0.490416 -0.23603147 0.223489 -0.127567
## inq_last_6mths -0.091416 0.15624991 0.128293 0.073729
## open_acc 0.124245 0.05110830 -0.086281 -0.092016
## pub_rec 0.101042 -0.11466744 -0.028563 -0.013883
## total_acc 0.040665 0.04919569 -0.020595 -0.032503
## acc_now_delinq -0.429680 -0.25436691 -0.447748 0.123925
## delinq_amnt -0.311247 -0.21531612 -0.552508 0.220290
## id_2 0.034050 -0.00009531 -0.063408 -0.001975
## loan_amnt -0.007653 0.02651344 -0.029480 -0.045155
## funded_amnt -0.007805 0.02644688 -0.029342 -0.045314
## term -0.051508 0.23909035 0.033155 0.266016
## int_rate -0.038776 -0.00786048 0.060568 0.061136
## installment 0.002893 -0.05218896 -0.036245 -0.132718
## grade -0.050284 0.00815234 0.070520 0.073235
## home_ownership 0.031574 0.05260739 -0.225847 -0.404481
## verification_status 0.024326 -0.06675839 -0.022898 -0.148805
## pymnt_plan -0.005291 -0.01393971 0.004192 -0.004205
## dti 0.328088 -0.17583629 -0.218207 -0.241862
## revol_bal 0.237416 -0.38364358 0.114505 0.232700
## initial_list_status -0.073699 0.15575840 0.078527 0.225932
## out_prncp 0.013724 0.02281877 -0.018197 0.017919
## out_prncp_inv 0.013720 0.02281824 -0.018199 0.017908
## total_pymnt -0.006424 0.02449277 -0.040861 -0.066093
## total_pymnt_inv -0.006655 0.02448290 -0.040601 -0.065959
## total_rec_prncp -0.004247 0.02678977 -0.041728 -0.064628
## total_rec_int -0.006647 -0.00462334 -0.020290 -0.052822
## total_rec_late_fee -0.060729 -0.14468610 0.101408 -0.166651
## recoveries -0.010968 0.04151167 -0.009639 0.023058
## collection_recovery_fee -0.011523 0.04418134 -0.011824 0.019165
## last_pymnt_amnt -0.062897 0.22300878 0.022419 0.189721
## application_type 0.035182 -0.01146990 -0.063404 -0.062173
## hardship_flag -0.005291 -0.01393971 0.004192 -0.004205
## debt_settlement_flag 0.007439 0.04350607 -0.049886 -0.034320
## year_temp -0.062051 0.03424975 0.068992 0.041827
## PC13 PC14 PC15 PC16
## funded_amnt_inv 0.005690 0.0004935 0.004970 0.015379
## annual_inc 0.217679 -0.1632597 0.090962 0.065295
## pub_rec_bankruptcies 0.016589 -0.0542558 0.029519 0.342323
## revol_util -0.072161 0.0319784 0.044861 0.105163
## collections_12_mths_ex_med 0.038654 0.0187004 0.808957 -0.319059
## chargeoff_within_12_mths -0.287934 0.1655876 -0.130843 0.462681
## tax_liens -0.111187 0.1386998 -0.033842 -0.327927
## delinq_2yrs -0.130761 0.0451098 -0.219508 -0.050387
## inq_last_6mths 0.344008 -0.3062071 0.025690 0.076764
## open_acc 0.033614 0.0353415 -0.030914 -0.073642
## pub_rec -0.062419 0.0550936 0.000748 0.006953
## total_acc -0.038226 0.0428841 -0.047470 -0.011202
## acc_now_delinq -0.010956 0.0134888 -0.031796 -0.035818
## delinq_amnt 0.069652 -0.0275310 0.149940 0.076283
## id_2 -0.265809 0.1503580 0.027574 -0.144313
## loan_amnt 0.004953 0.0009471 0.005328 0.014990
## funded_amnt 0.005459 0.0005764 0.005165 0.015296
## term -0.228929 0.1884706 -0.002476 -0.175663
## int_rate 0.052190 -0.0292966 -0.004262 -0.010284
## installment 0.089927 -0.0671848 0.004483 0.072554
## grade 0.064212 -0.0376278 -0.004972 -0.016604
## home_ownership 0.262901 -0.0277765 -0.036282 0.073327
## verification_status 0.158892 -0.1550404 0.130551 0.259149
## pymnt_plan 0.002631 -0.0035480 -0.010594 -0.011014
## dti -0.118839 0.0973486 -0.006108 0.003377
## revol_bal 0.052389 -0.0166253 0.155475 0.090927
## initial_list_status -0.130560 0.0680905 -0.020706 -0.191764
## out_prncp -0.030848 0.0210329 0.069423 0.085250
## out_prncp_inv -0.030844 0.0210247 0.069409 0.085230
## total_pymnt -0.011667 0.0148368 -0.027411 -0.014933
## total_pymnt_inv -0.011425 0.0147311 -0.027570 -0.014841
## total_rec_prncp 0.001198 -0.0001014 -0.019055 0.011591
## total_rec_int -0.028918 0.0451978 -0.057352 -0.103422
## total_rec_late_fee 0.316789 -0.2053185 -0.414554 -0.457968
## recoveries -0.054434 0.0285712 0.051106 0.050935
## collection_recovery_fee -0.054167 0.0292801 0.046311 0.049598
## last_pymnt_amnt -0.099397 0.0552343 0.028143 0.017469
## application_type -0.553849 -0.8117762 0.022387 -0.066305
## hardship_flag 0.002631 -0.0035480 -0.010594 -0.011014
## debt_settlement_flag -0.016198 0.0085391 -0.058878 -0.005923
## year_temp 0.104635 -0.0779260 -0.017726 0.042044
## PC17 PC18 PC19 PC20
## funded_amnt_inv 0.019224 0.0158430 -0.0031690 -0.0108530
## annual_inc 0.120686 -0.0997654 -0.2346729 -0.1292822
## pub_rec_bankruptcies -0.503907 0.1077371 -0.2396981 -0.1696128
## revol_util -0.141726 0.0292424 -0.1306092 -0.0481764
## collections_12_mths_ex_med -0.198311 0.0031451 -0.0688839 0.0101397
## chargeoff_within_12_mths -0.045439 -0.4583949 0.0749343 0.0690759
## tax_liens 0.448049 -0.1060648 0.1673885 0.1436238
## delinq_2yrs 0.072582 0.3795890 -0.1590973 -0.0576897
## inq_last_6mths 0.201787 -0.1229851 0.1425420 -0.0282915
## open_acc 0.086801 -0.0561477 -0.1171243 0.0012440
## pub_rec -0.032894 0.0014482 -0.0421234 -0.0142237
## total_acc -0.032223 -0.0057542 -0.1012056 0.0024636
## acc_now_delinq -0.002951 0.3026133 0.1171567 -0.3456290
## delinq_amnt -0.077309 -0.4554568 -0.0506766 0.3586387
## id_2 -0.083829 -0.1798509 -0.0085159 -0.3848294
## loan_amnt 0.018993 0.0152981 -0.0027429 -0.0112323
## funded_amnt 0.019224 0.0156125 -0.0031663 -0.0109012
## term -0.038045 -0.1083308 0.0124631 0.0279626
## int_rate 0.059186 0.0084273 -0.0883649 0.0271096
## installment 0.040120 0.0534655 -0.0176638 -0.0181687
## grade 0.070237 0.0092773 -0.0991535 0.0409776
## home_ownership 0.354434 -0.2320935 -0.4519466 -0.3726269
## verification_status -0.001138 0.0280120 0.6844315 -0.2047341
## pymnt_plan -0.010828 -0.0002611 -0.0029882 0.0040888
## dti -0.074396 0.0207044 0.0769411 -0.0220991
## revol_bal -0.023617 -0.1204071 -0.0498194 -0.1519156
## initial_list_status -0.010501 -0.3371165 0.1371582 -0.5446063
## out_prncp 0.067807 0.0344166 -0.0009233 -0.0028172
## out_prncp_inv 0.067792 0.0343948 -0.0009213 -0.0028455
## total_pymnt -0.027577 0.0066609 -0.0380455 0.0185584
## total_pymnt_inv -0.027549 0.0068645 -0.0380073 0.0185882
## total_rec_prncp -0.015381 0.0119776 -0.0260113 0.0213496
## total_rec_int -0.070339 -0.0237713 -0.0645095 0.0075668
## total_rec_late_fee -0.467632 -0.2288168 0.0828636 0.0201809
## recoveries 0.056103 0.0393787 0.0133899 -0.0130336
## collection_recovery_fee 0.049811 0.0387510 0.0057738 -0.0038723
## last_pymnt_amnt 0.051749 -0.0072241 -0.0188131 0.0911322
## application_type 0.061855 -0.0400113 -0.0805371 0.0003988
## hardship_flag -0.010828 -0.0002611 -0.0029882 0.0040888
## debt_settlement_flag -0.086188 -0.0432097 -0.0657936 0.0892126
## year_temp 0.042894 0.0404050 -0.0159810 0.0570448
## PC21 PC22 PC23 PC24
## funded_amnt_inv -0.049722 -0.0200163 0.09272632 0.033951
## annual_inc 0.354707 -0.0949816 -0.02419697 -0.147383
## pub_rec_bankruptcies -0.031966 -0.0486667 0.02006544 0.017384
## revol_util -0.209123 0.0216387 -0.09036820 0.086493
## collections_12_mths_ex_med 0.005765 0.0796263 -0.01520337 0.020965
## chargeoff_within_12_mths 0.084534 0.2705814 0.13402384 -0.020348
## tax_liens 0.020164 0.0643688 -0.02051509 0.012197
## delinq_2yrs -0.204352 -0.4164230 -0.24464415 0.101643
## inq_last_6mths -0.517177 0.2889405 -0.05403707 -0.007394
## open_acc 0.121101 -0.0391402 0.05641108 -0.059636
## pub_rec -0.010167 0.0119448 0.00219263 0.016386
## total_acc 0.087642 -0.0869982 -0.01092848 0.015755
## acc_now_delinq 0.163568 0.4962175 0.13186705 -0.036095
## delinq_amnt -0.145420 -0.3124764 -0.05748383 0.033603
## id_2 -0.135561 -0.0935236 -0.09363407 0.110624
## loan_amnt -0.048222 -0.0201426 0.09234489 0.034797
## funded_amnt -0.049401 -0.0196418 0.09238433 0.035000
## term 0.306090 -0.0368475 0.04629556 -0.045296
## int_rate 0.003152 0.0117982 0.02902556 0.004070
## installment -0.156260 -0.0089729 0.09279242 0.046190
## grade 0.007188 0.0217753 0.04069162 0.008628
## home_ownership 0.193617 -0.0684644 -0.08110608 0.131855
## verification_status 0.304889 -0.3527777 -0.21576567 0.015821
## pymnt_plan 0.005343 -0.0004441 -0.00001031 -0.008235
## dti -0.123859 0.0498631 -0.05908760 0.063686
## revol_bal 0.067442 0.1409719 -0.01341009 0.112752
## initial_list_status -0.304046 -0.1148164 -0.05471659 -0.100142
## out_prncp -0.026999 0.0089816 -0.02315409 0.051339
## out_prncp_inv -0.027003 0.0089776 -0.02316415 0.051337
## total_pymnt -0.040782 0.0336088 -0.06034406 -0.098832
## total_pymnt_inv -0.041118 0.0332801 -0.05995399 -0.099646
## total_rec_prncp -0.048829 0.0613267 -0.11920598 0.009868
## total_rec_int 0.011289 -0.0482747 0.08286738 -0.446256
## total_rec_late_fee 0.128700 0.0550558 0.06209495 0.256687
## recoveries -0.040592 -0.0406028 0.16433277 0.179108
## collection_recovery_fee -0.042435 -0.0251761 0.13322933 0.171387
## last_pymnt_amnt 0.126678 0.1699818 -0.31154928 0.673098
## application_type 0.072912 0.0036742 0.01746757 -0.004073
## hardship_flag 0.005343 -0.0004441 -0.00001031 -0.008235
## debt_settlement_flag 0.057323 0.2727025 -0.77616608 -0.288616
## year_temp -0.060454 0.0261255 0.01356063 0.001056
## PC25 PC26 PC27 PC28
## funded_amnt_inv -0.005237 -0.0157807 -0.0042771 0.01503415
## annual_inc -0.357813 -0.4155742 -0.0290351 -0.04586837
## pub_rec_bankruptcies 0.069029 -0.0452885 0.0325850 -0.04007643
## revol_util -0.331886 0.3014332 0.4396403 0.09437372
## collections_12_mths_ex_med -0.016441 0.0090788 0.0390581 -0.02000711
## chargeoff_within_12_mths -0.029207 -0.0058625 -0.0167000 -0.01033258
## tax_liens -0.017838 0.0218106 0.0219061 -0.01072277
## delinq_2yrs 0.242051 -0.0958819 0.0580850 -0.03087683
## inq_last_6mths 0.108071 -0.1376712 0.3187893 0.10189418
## open_acc -0.074272 0.2536187 -0.0591155 0.00354725
## pub_rec 0.030930 -0.0116490 0.0176739 -0.02304342
## total_acc -0.161935 0.2872426 0.1546405 0.00713898
## acc_now_delinq -0.071535 0.0406023 0.0488368 0.00630221
## delinq_amnt 0.014678 -0.0097733 -0.0062278 0.00133074
## id_2 -0.122487 -0.0783480 -0.2628345 0.50875945
## loan_amnt -0.004098 -0.0161914 -0.0027226 0.00862230
## funded_amnt -0.004798 -0.0160432 -0.0035173 0.01067739
## term 0.246312 -0.1523352 0.4517307 0.18054294
## int_rate -0.071150 0.1000939 -0.3054628 -0.11960153
## installment -0.099002 0.0474721 -0.1924296 -0.06293209
## grade -0.064958 0.0923353 -0.3449975 -0.06537945
## home_ownership 0.108133 0.0199799 0.1974660 0.03247743
## verification_status -0.011797 0.0903972 0.0589825 0.05477305
## pymnt_plan 0.003546 -0.0006678 -0.0002166 0.00006532
## dti -0.087748 -0.6893603 -0.0223261 -0.11485632
## revol_bal 0.706647 0.0773432 -0.2052722 -0.00020828
## initial_list_status -0.082157 0.1018341 -0.0480284 -0.33172991
## out_prncp -0.019215 0.0041879 -0.0154207 -0.01315008
## out_prncp_inv -0.019214 0.0041892 -0.0154197 -0.01317025
## total_pymnt 0.012158 -0.0093772 0.0337338 -0.00541456
## total_pymnt_inv 0.011777 -0.0091557 0.0330804 -0.00159799
## total_rec_prncp -0.021791 -0.0017860 -0.0006264 -0.01799829
## total_rec_int 0.113207 -0.0241519 0.1075436 0.04599481
## total_rec_late_fee 0.002415 -0.0119822 0.0147996 -0.00061268
## recoveries -0.013374 -0.0130119 0.0482189 -0.03594279
## collection_recovery_fee -0.013383 -0.0165239 0.0493835 -0.01761759
## last_pymnt_amnt -0.092320 -0.0168388 0.0164961 -0.08709568
## application_type 0.024237 0.0193659 0.0139839 -0.00965248
## hardship_flag 0.003546 -0.0006678 -0.0002166 0.00006532
## debt_settlement_flag -0.004204 0.0617942 -0.0974026 -0.01075087
## year_temp -0.055055 -0.0051183 -0.1679771 0.72173234
## PC29 PC30 PC31 PC32
## funded_amnt_inv 0.3046016 -0.0196764 0.0723893 -0.00410684
## annual_inc -0.0329418 -0.0149366 -0.0021221 0.00036507
## pub_rec_bankruptcies 0.0175684 -0.0685408 -0.0008813 0.48899326
## revol_util 0.0313139 -0.1167147 -0.0019387 -0.00677899
## collections_12_mths_ex_med 0.0055310 -0.0023425 -0.0014324 0.00404701
## chargeoff_within_12_mths 0.0017365 -0.0267296 0.0025030 0.00280397
## tax_liens 0.0058903 0.0371965 0.0029066 0.49609851
## delinq_2yrs 0.0159067 -0.0770948 -0.0037421 0.00011155
## inq_last_6mths 0.0073431 -0.0359874 0.0020411 0.00059111
## open_acc -0.0640153 -0.6899376 0.0023432 0.00543742
## pub_rec 0.0037944 -0.0289915 -0.0039237 -0.71678972
## total_acc 0.0345311 0.6785744 -0.0145805 -0.00573002
## acc_now_delinq 0.0004375 -0.0042567 0.0021177 0.00001424
## delinq_amnt -0.0026382 -0.0062735 -0.0004835 -0.00086343
## id_2 -0.0097057 -0.0109970 0.0026691 0.00307219
## loan_amnt 0.3074962 -0.0205079 0.0712560 -0.00516567
## funded_amnt 0.3061500 -0.0203476 0.0720240 -0.00456752
## term 0.2633226 -0.0759924 -0.2360419 -0.00530321
## int_rate 0.0083550 0.0472248 -0.0675115 0.01957996
## installment 0.2726635 0.0079155 0.0917273 0.00165354
## grade 0.0244020 0.0590059 -0.1683883 -0.00587875
## home_ownership 0.0055990 0.0594914 0.0010325 -0.00614498
## verification_status -0.0460194 -0.0272897 0.0044340 0.00295271
## pymnt_plan 0.0003129 -0.0025483 0.0006757 -0.00003051
## dti 0.0052395 0.0513468 -0.0027186 -0.00478984
## revol_bal -0.0361348 0.0819046 0.0098314 -0.00196776
## initial_list_status -0.0201480 0.0053719 0.0033340 0.00069918
## out_prncp -0.0134946 -0.0020887 -0.0164149 -0.00007395
## out_prncp_inv -0.0135008 -0.0021017 -0.0163978 -0.00008858
## total_pymnt -0.3082077 0.0146565 -0.2144029 0.00202860
## total_pymnt_inv -0.3090452 0.0151539 -0.2137604 0.00243473
## total_rec_prncp -0.2332989 0.0006795 -0.4733264 0.00264423
## total_rec_int -0.3530512 0.0512130 0.6445416 -0.00176336
## total_rec_late_fee -0.0117868 -0.0089600 -0.0218023 -0.00043468
## recoveries -0.1921833 -0.0009897 -0.0227909 0.00363873
## collection_recovery_fee -0.2898173 0.0037407 -0.0814040 -0.00097810
## last_pymnt_amnt -0.1752250 -0.0476290 0.3834094 -0.00263027
## application_type 0.0018112 -0.0021745 -0.0000388 0.00002569
## hardship_flag 0.0003129 -0.0025483 0.0006757 -0.00003051
## debt_settlement_flag 0.1862239 -0.0125999 0.0134592 -0.00144844
## year_temp -0.0784211 0.0507234 0.0234355 0.01450990
## PC33 PC34 PC35
## funded_amnt_inv -0.0119544 -0.01817155 0.30342814
## annual_inc 0.0038008 -0.00008514 -0.00092220
## pub_rec_bankruptcies -0.0119238 0.00054737 0.00170547
## revol_util -0.0201763 0.00253533 -0.00007815
## collections_12_mths_ex_med 0.0021889 0.00003947 0.00021299
## chargeoff_within_12_mths -0.0011337 0.00060606 0.00054240
## tax_liens -0.0083919 -0.00047264 0.00102878
## delinq_2yrs -0.0029023 0.00016951 -0.00032259
## inq_last_6mths -0.0057704 0.00227458 0.00179623
## open_acc -0.0026167 0.00132371 0.00116609
## pub_rec 0.0138214 0.00061460 -0.00079142
## total_acc 0.0007318 -0.00023781 0.00221052
## acc_now_delinq -0.0013506 -0.00020789 0.00080139
## delinq_amnt -0.0005810 0.00074543 0.00025116
## id_2 -0.0091085 0.00279402 0.00062448
## loan_amnt -0.0386443 -0.01772896 0.30404125
## funded_amnt -0.0270117 -0.01872758 0.27959479
## term 0.0376632 -0.01827435 -0.25592691
## int_rate 0.7002100 -0.13484447 0.10811561
## installment 0.0496507 -0.04393584 -0.81071119
## grade -0.6771437 0.12815007 0.00560313
## home_ownership -0.0009057 0.00084670 -0.00059272
## verification_status -0.0141567 -0.00003973 -0.00149826
## pymnt_plan -0.0002551 0.00007906 -0.00003260
## dti 0.0001803 0.00247332 -0.00067548
## revol_bal 0.0061098 -0.00038205 -0.00519227
## initial_list_status 0.0022881 -0.00315426 -0.00093849
## out_prncp 0.0028535 0.00095761 -0.00193787
## out_prncp_inv 0.0028822 0.00095914 -0.00174520
## total_pymnt 0.0151616 0.05689234 -0.00539233
## total_pymnt_inv 0.0282099 0.05720743 0.01571945
## total_rec_prncp 0.0225086 -0.00691717 0.00771423
## total_rec_int -0.0565466 -0.00636876 -0.03846962
## total_rec_late_fee 0.0042549 0.00237518 0.00098876
## recoveries 0.1334166 0.70161473 -0.01200572
## collection_recovery_fee -0.1264759 -0.68048475 0.00370655
## last_pymnt_amnt -0.0236500 0.00461753 -0.02810334
## application_type 0.0010625 -0.00079392 0.00037878
## hardship_flag -0.0002551 0.00007906 -0.00003260
## debt_settlement_flag 0.0019055 0.00975498 0.00215493
## year_temp 0.0678134 0.00705397 -0.00471761
## PC36 PC37 PC38
## funded_amnt_inv -0.6332917811 -0.0305243076 -0.5111481640
## annual_inc -0.0001506085 0.0000533107 0.0000124126
## pub_rec_bankruptcies -0.0000317023 -0.0000686171 -0.0000100089
## revol_util 0.0000571656 -0.0000905920 -0.0000099214
## collections_12_mths_ex_med -0.0001017341 -0.0000246378 0.0000011004
## chargeoff_within_12_mths -0.0000389266 -0.0000135233 -0.0000007689
## tax_liens 0.0001006288 -0.0000538784 -0.0000004081
## delinq_2yrs -0.0001110893 -0.0000215407 -0.0000040890
## inq_last_6mths -0.0002326328 -0.0002855994 -0.0000527436
## open_acc -0.0002582351 -0.0000473278 -0.0000114645
## pub_rec -0.0003523910 0.0000586506 0.0000019442
## total_acc 0.0001851207 0.0002086585 0.0000347687
## acc_now_delinq -0.0000093105 -0.0000009277 -0.0000014517
## delinq_amnt -0.0000065810 0.0000032650 0.0000013428
## id_2 0.0005529001 0.0003249821 0.0000068043
## loan_amnt 0.5650577698 -0.5839902969 -0.0043586242
## funded_amnt 0.0746635792 0.6361611564 0.5152503363
## term -0.0028389175 -0.0054682730 -0.0000618493
## int_rate 0.0161365592 0.0018724281 0.0001207049
## installment -0.0078225420 -0.0199204108 -0.0002507136
## grade -0.0145497375 0.0007630749 -0.0000585802
## home_ownership -0.0001462005 -0.0000579376 0.0000047273
## verification_status -0.0004316720 0.0003912111 0.0000060489
## pymnt_plan 0.0000044882 0.0000020053 0.0000007339
## dti -0.0002411812 -0.0001911755 0.0000085689
## revol_bal -0.0003210518 -0.0000758387 0.0000064807
## initial_list_status 0.0000006184 -0.0000607551 0.0000042124
## out_prncp 0.0000907762 -0.0000089687 0.0018346895
## out_prncp_inv -0.0002116945 -0.0002246866 -0.0018348842
## total_pymnt 0.1904113924 0.1822639947 -0.2492707837
## total_pymnt_inv -0.4358309452 -0.4199812370 0.5745627656
## total_rec_prncp 0.2047226223 0.1963820096 -0.2682587524
## total_rec_int 0.0684726042 0.0651415481 -0.0900811701
## total_rec_late_fee 0.0001094284 0.0002503714 -0.0003725457
## recoveries 0.0254633820 0.0218110817 -0.0297154545
## collection_recovery_fee -0.0023334576 -0.0004660227 -0.0000415097
## last_pymnt_amnt -0.0012627805 -0.0012392498 -0.0000503792
## application_type -0.0001008856 -0.0000090948 0.0000009640
## hardship_flag 0.0000044882 0.0000020053 0.0000007339
## debt_settlement_flag -0.0003087381 0.0000275720 -0.0000142676
## year_temp 0.0085837976 0.0000510832 0.0000572327
## PC39 PC40
## funded_amnt_inv -0.0015074221 -0.00000008655653
## annual_inc 0.0000002199 0.00000000012468
## pub_rec_bankruptcies 0.0000017559 -0.00000000007575
## revol_util 0.0000023115 -0.00000000047120
## collections_12_mths_ex_med -0.0000011378 -0.00000000035052
## chargeoff_within_12_mths -0.0000006881 0.00000000031061
## tax_liens 0.0000062198 0.00000000020097
## delinq_2yrs -0.0000028870 0.00000000082056
## inq_last_6mths -0.0000023596 0.00000000037435
## open_acc -0.0000090502 -0.00000000028306
## pub_rec -0.0000089896 0.00000000018528
## total_acc 0.0000038141 0.00000000005778
## acc_now_delinq -0.0000010345 -0.00000000034105
## delinq_amnt 0.0000001979 0.00000000069280
## id_2 -0.0000040032 -0.00000000065828
## loan_amnt -0.0000201590 -0.00000001242206
## funded_amnt 0.0014124714 0.00000009670161
## term 0.0000393725 0.00000000032607
## int_rate -0.0000219193 0.00000000271000
## installment 0.0001029559 0.00000000198891
## grade 0.0000147624 -0.00000000177291
## home_ownership -0.0000088164 -0.00000000003973
## verification_status -0.0000031963 0.00000000018589
## pymnt_plan -0.0000391314 0.00000000002970
## dti -0.0000027212 0.00000000023808
## revol_bal 0.0000007007 0.00000000060690
## initial_list_status -0.0000249066 0.00000000029615
## out_prncp -0.7070954570 -0.00000000050264
## out_prncp_inv 0.7071132788 0.00000000059748
## total_pymnt -0.0005783416 0.75218099549416
## total_pymnt_inv 0.0013304368 0.00000007868942
## total_rec_prncp -0.0006210324 -0.62127884198648
## total_rec_int -0.0002129456 -0.20850323079127
## total_rec_late_fee -0.0000193156 -0.00084014920469
## recoveries -0.0000690659 -0.06900758955536
## collection_recovery_fee 0.0000035427 0.00000000006858
## last_pymnt_amnt -0.0000002625 0.00000000140187
## application_type -0.0000039235 -0.00000000014170
## hardship_flag -0.0000391314 -0.00000000003482
## debt_settlement_flag -0.0000060360 0.00000000010788
## year_temp 0.0000121877 -0.00000000036488
## PC41
## funded_amnt_inv 0.000000000000000000000
## annual_inc -0.000000000000000049440
## pub_rec_bankruptcies 0.000000000000000935294
## revol_util -0.000000000000000061441
## collections_12_mths_ex_med -0.000000000000000101437
## chargeoff_within_12_mths 0.000000000000000284505
## tax_liens 0.000000000000000734963
## delinq_2yrs -0.000000000000000297496
## inq_last_6mths -0.000000000000000141597
## open_acc -0.000000000000000364687
## pub_rec -0.000000000000001197649
## total_acc 0.000000000000000721321
## acc_now_delinq -0.000000000000000284198
## delinq_amnt -0.000000000000000478910
## id_2 -0.000000000000000228325
## loan_amnt 0.000000000000002975558
## funded_amnt -0.000000000000001801706
## term -0.000000000000000043415
## int_rate 0.000000000000002002955
## installment -0.000000000000001657352
## grade -0.000000000000002022867
## home_ownership 0.000000000000000384457
## verification_status 0.000000000000000005009
## pymnt_plan 0.707106781186550237273
## dti -0.000000000000000383473
## revol_bal 0.000000000000000497409
## initial_list_status -0.000000000000000653937
## out_prncp -0.000000000000017347235
## out_prncp_inv 0.000000000000017554534
## total_pymnt -0.000000000034318298637
## total_pymnt_inv -0.000000000000003654195
## total_rec_prncp 0.000000000028349242658
## total_rec_int 0.000000000009513843706
## total_rec_late_fee 0.000000000000038216392
## recoveries 0.000000000003151915794
## collection_recovery_fee -0.000000000000002590376
## last_pymnt_amnt 0.000000000000000169136
## application_type 0.000000000000000028189
## hardship_flag -0.707106781186544575135
## debt_settlement_flag -0.000000000000000307913
## year_temp 0.000000000000000498299
pca.out$sdev
## [1] 2.830175242207880970 1.747375222580279219 1.546859648725075864
## [4] 1.504282222452379525 1.440913490912746342 1.350547919992009627
## [7] 1.301832520261502646 1.262336896904347805 1.147288436751347618
## [10] 1.106109123609336331 1.044481556691171376 1.033539758528099428
## [13] 1.005783212316659236 0.998232461153887196 0.984926986888527867
## [16] 0.975775140248013928 0.956927099753434374 0.941272539883864612
## [19] 0.916107387094604375 0.913423634545449614 0.894116307699023505
## [22] 0.888633672223303805 0.866868447384414376 0.833172771712352600
## [25] 0.788394997709253254 0.745894724384768892 0.715588409147793314
## [28] 0.634722759166007955 0.564806807240468389 0.525564163061816192
## [31] 0.338632953619651234 0.257734324165023232 0.197013896082651380
## [34] 0.180241759807571400 0.105411320861252961 0.036374359034614168
## [37] 0.022565222163519465 0.002737867910510237 0.001216232120532726
## [40] 0.000000154215179868 0.000000000000005506
# % of variance explained
pr.var <- pca.out$sdev ^2
pve=pr.var/sum(pr.var)
pve
## [1] 0.1953632171123477478946739438470104
## [2] 0.0744712236216441975100721606395382
## [3] 0.0583603603135076018371130146533687
## [4] 0.0551918293850297489444578502570948
## [5] 0.0506397972754709865816913350045070
## [6] 0.0444873093706025096727429968268552
## [7] 0.0413358027026921384661228842105629
## [8] 0.0388657180801478679965654805528175
## [9] 0.0321041648073980834254825822426938
## [10] 0.0298409120324826040993038134274684
## [11] 0.0266083346894631297963051963506587
## [12] 0.0260537666453243450803700653750639
## [13] 0.0246731675653169592321578562632567
## [14] 0.0243040986951542602179099361592307
## [15] 0.0236605163292998195434524433267143
## [16] 0.0232228566908782921296428014557023
## [17] 0.0223343774205487545503334700924825
## [18] 0.0216096096180341881720643470998766
## [19] 0.0204695791387630374180695014274534
## [20] 0.0203498228328341593251771257655491
## [21] 0.0194986334559345622641224338167376
## [22] 0.0192602391075379630880348713617423
## [23] 0.0183283147578206934147360129827575
## [24] 0.0169311431103128899600118728585585
## [25] 0.0151601627417795020019708474023901
## [26] 0.0135697302406101935612348086124257
## [27] 0.0124894334465038593651176057619523
## [28] 0.0098261702683731990543014589434279
## [29] 0.0077806519391503533786269564131999
## [30] 0.0067370168169478324843302274871348
## [31] 0.0027968848116382023089032671236964
## [32] 0.0016201702890926779516977962458668
## [33] 0.0009466945182845094798260165980253
## [34] 0.0007923680970373057083980250503430
## [35] 0.0002710133308710677016786783077862
## [36] 0.0000322705852482679941221205277557
## [37] 0.0000124192500314385050833726475972
## [38] 0.0000001828273340341834642987228765
## [39] 0.0000000360785505125731575373423943
## [40] 0.0000000000000005800566268729881272
## [41] 0.0000000000000000000000000000007394
# Plot
plot( cumsum(pve), xlab = "Principal Component",
ylab = "Cumulative Proportion of Variance Explained", type = "b" )
text(cumsum(pve), labels=round(cumsum(pve)*100, digits = 1), cex= 0.7, pos = 2)
# try PCA as prediction
# use 30 PCAs
pca.train.1_30 <- pca.out$x[,1:30]
# predict PCA on test data set
pred <- predict(pca.out, test.pca)
head(pred[,1:30])
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 1 5.7828 2.6681 -0.23686 0.06525 0.2372 0.35380 -0.5208 -0.1969
## 7 -3.2917 -0.4280 0.13261 0.17922 0.1920 -0.83022 0.8724 -0.9447
## 12 2.1772 2.2568 -0.15467 0.33938 -1.0290 0.04282 -0.6153 0.5584
## 14 4.0037 3.2149 0.11557 2.00725 -0.2400 0.58266 0.9263 -1.0651
## 19 0.7012 0.3014 0.01453 0.07402 0.2960 -0.29043 1.2685 -1.6427
## 22 -4.0796 0.2277 0.26020 0.40989 -0.6043 0.50158 1.8375 -1.7563
## PC9 PC10 PC11 PC12 PC13 PC14 PC15 PC16
## 1 -0.27863 1.5954 -0.37404 -0.03361 0.48053 -0.2982 0.17298 0.34509
## 7 -0.04816 1.2542 -0.49635 -0.52856 0.75752 -0.4419 -0.15506 -0.21382
## 12 1.05401 -1.8134 0.80220 1.74448 -0.46758 0.2597 0.65498 0.26230
## 14 0.69012 -0.1750 -0.04525 0.55190 -0.18760 0.2017 0.06036 0.11262
## 19 -0.02052 0.4211 0.50052 1.06659 0.08716 -0.1655 -0.13471 -0.08576
## 22 0.68663 -0.8798 0.21020 0.43279 0.08706 0.1520 -0.12166 -0.07957
## PC17 PC18 PC19 PC20 PC21 PC22 PC23 PC24
## 1 1.1842 -0.9112 -0.5639 -1.178202 -1.2311 1.0420 -1.35902 3.42971
## 7 0.9745 -0.7444 0.6911 -0.639940 -0.6715 0.3254 -0.12769 -0.11961
## 12 -0.3950 -0.1634 -1.2672 -0.003899 0.1019 1.0372 -0.33961 1.67011
## 14 -0.1623 0.1475 -0.4875 0.641470 -0.7976 1.0872 0.02990 1.05316
## 19 0.1830 0.2184 -0.3086 1.012318 -1.2412 1.1640 -0.06054 1.07812
## 22 0.1783 -0.3336 -0.8833 -0.141689 -0.2460 0.2972 0.07527 -0.09925
## PC25 PC26 PC27 PC28 PC29 PC30
## 1 -0.8011 0.21168 -0.39021 0.58175 0.34240 -0.18603
## 7 0.3118 -1.57075 -0.62418 -0.06206 -0.41471 0.12804
## 12 0.5308 0.06308 -1.15173 1.56909 -0.10367 0.03154
## 14 0.2606 0.03499 -0.09298 -0.91620 0.16353 -0.68291
## 19 -1.0437 0.34436 -0.60843 -1.03677 0.06011 -0.04914
## 22 -0.4478 0.39142 0.94814 -0.36157 -0.30349 -0.19256
pca.test.1_30 <- pred[,1:30]
log.pca.fit <- glm(train.log.pca$loan_status ~ ., as.data.frame(pca.train.1_30),family="binomial")
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(log.pca.fit)
##
## Call:
## glm(formula = train.log.pca$loan_status ~ ., family = "binomial",
## data = as.data.frame(pca.train.1_30))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -8.49 0.00 0.00 0.00 8.49
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) 1081162023588628 235497 4590975283
## PC1 340252070685539 83210 4089079955
## PC2 610392735941376 134773 4529049089
## PC3 -27177827633057 152243 -178516001
## PC4 -475026671580247 156552 -3034302089
## PC5 310839100875539 163437 1901888488
## PC6 -262930590871878 174373 -1507865532
## PC7 21117352560508 180898 116736343
## PC8 -301360309735786 186558 -1615373031
## PC9 -124998732210273 205266 -608961341
## PC10 419078409049468 212907 1968361034
## PC11 34520620835027 225470 153105528
## PC12 355026763743441 227856 1558115972
## PC13 -209481380086400 234145 -894666770
## PC14 98574889496720 235916 417839493
## PC15 38701078098963 239103 161859651
## PC16 20249111734962 241345 83901017
## PC17 75918220844796 246099 308486650
## PC18 -21446226725691 250192 -85719135
## PC19 -39691569638562 257065 -154403150
## PC20 194211956059419 257820 753285669
## PC21 232136930086545 263387 881352723
## PC22 337970227872327 265012 1275301023
## PC23 -614016865381567 271666 -2260190233
## PC24 1323957744940502 282653 4684041037
## PC25 -171315995378799 298707 -573526213
## PC26 -52891716403152 315727 -167523869
## PC27 109294971538026 329098 332104666
## PC28 -211350893131724 371026 -569638923
## PC29 -1147093243930877 416954 -2751123853
## PC30 -55880339547491 448087 -124708564
## Pr(>|z|)
## (Intercept) <0.0000000000000002 ***
## PC1 <0.0000000000000002 ***
## PC2 <0.0000000000000002 ***
## PC3 <0.0000000000000002 ***
## PC4 <0.0000000000000002 ***
## PC5 <0.0000000000000002 ***
## PC6 <0.0000000000000002 ***
## PC7 <0.0000000000000002 ***
## PC8 <0.0000000000000002 ***
## PC9 <0.0000000000000002 ***
## PC10 <0.0000000000000002 ***
## PC11 <0.0000000000000002 ***
## PC12 <0.0000000000000002 ***
## PC13 <0.0000000000000002 ***
## PC14 <0.0000000000000002 ***
## PC15 <0.0000000000000002 ***
## PC16 <0.0000000000000002 ***
## PC17 <0.0000000000000002 ***
## PC18 <0.0000000000000002 ***
## PC19 <0.0000000000000002 ***
## PC20 <0.0000000000000002 ***
## PC21 <0.0000000000000002 ***
## PC22 <0.0000000000000002 ***
## PC23 <0.0000000000000002 ***
## PC24 <0.0000000000000002 ***
## PC25 <0.0000000000000002 ***
## PC26 <0.0000000000000002 ***
## PC27 <0.0000000000000002 ***
## PC28 <0.0000000000000002 ***
## PC29 <0.0000000000000002 ***
## PC30 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 78035 on 81205 degrees of freedom
## Residual deviance: 44622 on 81175 degrees of freedom
## AIC: 44684
##
## Number of Fisher Scoring iterations: 25
log.pca.prob <- predict(log.pca.fit, newdata = as.data.frame(pca.test.1_30), type = "response")
log.pca.pred <- ifelse(log.pca.prob>0.5,"Fully Paid","Defaulted")
# Test errors and confusion matrix
pca.matrix <- table(log.pca.pred,test.log.pca$loan_status)
pca.overall.test.err = mean(log.pca.pred != test.log.pca$loan_status)
pca.default.test.err= pca.matrix[1,2]/(pca.matrix[1,1]+pca.matrix[1,2]) # Class Test Error
pca.paid.test.err= pca.matrix[2,1]/(pca.matrix[2,2]+pca.matrix[2,1]) # Class Test Error
pca.matrix
##
## log.pca.pred Defaulted Fully Paid
## Defaulted 3690 108
## Fully Paid 62 16442
pca.default.test.err
## [1] 0.02844
pca.paid.test.err
## [1] 0.003757
pca.overall.test.err
## [1] 0.008374
In the graph we can see the cumulative amount of variance explained rises with each additional Principal Component (PC). With 30 PCs we have reached over 99% of Variance Explained and the curve flattens off. We decided to take the smallest amount of PC that could nevertheless explain all most all variance in the data.
To validate our approach we used Logistic Regression. We used the first 30 PCs and then trained a logistical model on it. We validated the fitted model on the test set, where we applied the same PCA transformation. It resulted in terribly low test error rates for both the classes “Defaulted” (0.66%) and “Fully Paid” (0.77%), as well as the overall error (0.75%)
# load previous df
df.pls <- d2_sub
# str(df.pls) #check
# remove factors larger than 10
plus10levels <- df.pls[, sapply(df.pls, function(col) class(col) == "factor")]
plus10levels <- plus10levels[, sapply(plus10levels, function(col) length(unique(col))) > 10]
rmv.col <- colnames(plus10levels)
str(rmv.col) # columns to be removed
## chr [1:10] "last_pymnt_d" "title" "earliest_cr_line" ...
df.pls <- df.pls[ , !(names(df.pls) %in% rmv.col)]
# str(df.pls) #check
# transform all factors to numeric, also loan_status, otherwise pls function doesn't work
str(df.pls)
## 'data.frame': 101508 obs. of 44 variables:
## $ funded_amnt_inv : int 33600 10400 15000 10000 4200 7000 20000 28000 20000 17600 ...
## $ annual_inc : int 104000 104433 80000 30000 32000 59000 180000 92000 65000 88000 ...
## $ pub_rec_bankruptcies : int 0 0 0 1 0 0 0 0 0 0 ...
## $ revol_util : num 51.8 64.5 21.4 26.6 11.7 24.2 84.5 64.5 3 34.4 ...
## $ collections_12_mths_ex_med: int 0 0 0 0 0 0 0 0 0 0 ...
## $ chargeoff_within_12_mths : int 0 0 0 0 0 0 0 0 0 0 ...
## $ tax_liens : int 0 0 0 0 0 0 0 0 0 0 ...
## $ delinq_2yrs : int 0 1 0 0 0 1 0 0 0 0 ...
## $ inq_last_6mths : int 2 3 1 2 2 1 0 0 1 0 ...
## $ open_acc : int 14 12 30 19 10 10 12 16 12 20 ...
## $ pub_rec : int 0 0 0 1 0 0 0 0 0 0 ...
## $ total_acc : int 31 35 47 36 16 34 27 24 17 42 ...
## $ acc_now_delinq : int 0 0 0 0 0 0 0 0 0 0 ...
## $ delinq_amnt : int 0 0 0 0 0 0 0 0 0 0 ...
## $ id_2 : int 777778 876330 188890 288890 388890 788890 877442 1114 10002 11114 ...
## $ loan_amnt : int 33600 10400 15000 10000 4200 7000 20000 28000 20000 17600 ...
## $ funded_amnt : int 33600 10400 15000 10000 4200 7000 20000 28000 20000 17600 ...
## $ term : Factor w/ 2 levels "36 months","60 months": 1 2 1 2 1 1 1 1 1 1 ...
## $ int_rate : num 10.78 22.45 5.32 16.59 13.99 ...
## $ installment : num 1097 290 452 246 144 ...
## $ grade : Factor w/ 7 levels "A","B","C","D",..: 2 6 1 4 3 1 2 1 1 1 ...
## $ home_ownership : Factor w/ 4 levels "MORTGAGE","OTHER",..: 4 1 4 4 4 1 1 1 3 1 ...
## $ verification_status : Factor w/ 3 levels "Not Verified",..: 2 2 1 2 2 1 1 1 1 2 ...
## $ loan_status : Factor w/ 2 levels "Defaulted","Fully Paid": 2 2 2 2 2 2 2 2 2 2 ...
## $ pymnt_plan : Factor w/ 2 levels "n","y": 1 1 1 1 1 1 1 1 1 1 ...
## $ dti : num 14 25.4 20.2 30.8 31 ...
## $ revol_bal : int 28475 21929 41593 6118 1294 10467 87329 51507 2269 12203 ...
## $ initial_list_status : Factor w/ 2 levels "f","w": 2 2 2 2 2 2 1 2 2 2 ...
## $ out_prncp : num 0 0 0 0 0 0 0 0 0 0 ...
## $ out_prncp_inv : num 0 0 0 0 0 0 0 0 0 0 ...
## $ total_pymnt : num 34309 11740 16241 10758 5036 ...
## $ total_pymnt_inv : num 34309 11740 16241 10758 5036 ...
## $ total_rec_prncp : num 33600 10400 15000 10000 4200 7000 20000 28000 20000 17600 ...
## $ total_rec_int : num 709 1340 1241 758 836 ...
## $ total_rec_late_fee : num 0 0 0 0 0 0 0 0 0 0 ...
## $ recoveries : num 0 0 0 0 0 0 0 0 0 0 ...
## $ collection_recovery_fee : num 0 0 0 0 0 0 0 0 0 0 ...
## $ last_pymnt_amnt : num 33263 10129 2244 9795 1743 ...
## $ policy_code : int 1 1 1 1 1 1 1 1 1 1 ...
## $ application_type : Factor w/ 2 levels "Individual","Joint App": 1 1 1 1 1 1 1 1 1 1 ...
## $ hardship_flag : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
## $ disbursement_method : Factor w/ 1 level "Cash": 1 1 1 1 1 1 1 1 1 1 ...
## $ debt_settlement_flag : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
## $ year_temp : int 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
for(col in 1:ncol(df.pls)){
if(class(df.pls[,col]) %in% c('factor')){
df.pls[,col] <- as.numeric(df.pls[,col])
}
}
# str(df.pls) #check
# remove NAs
drop_na(df.pls)
# str(df.pls) #check
# check dimnames, mean and variance
dimnames(df.pls)[[2]]
## [1] "funded_amnt_inv" "annual_inc"
## [3] "pub_rec_bankruptcies" "revol_util"
## [5] "collections_12_mths_ex_med" "chargeoff_within_12_mths"
## [7] "tax_liens" "delinq_2yrs"
## [9] "inq_last_6mths" "open_acc"
## [11] "pub_rec" "total_acc"
## [13] "acc_now_delinq" "delinq_amnt"
## [15] "id_2" "loan_amnt"
## [17] "funded_amnt" "term"
## [19] "int_rate" "installment"
## [21] "grade" "home_ownership"
## [23] "verification_status" "loan_status"
## [25] "pymnt_plan" "dti"
## [27] "revol_bal" "initial_list_status"
## [29] "out_prncp" "out_prncp_inv"
## [31] "total_pymnt" "total_pymnt_inv"
## [33] "total_rec_prncp" "total_rec_int"
## [35] "total_rec_late_fee" "recoveries"
## [37] "collection_recovery_fee" "last_pymnt_amnt"
## [39] "policy_code" "application_type"
## [41] "hardship_flag" "disbursement_method"
## [43] "debt_settlement_flag" "year_temp"
apply(df.pls,2,mean)
## funded_amnt_inv annual_inc
## 14379.166529 74643.055336
## pub_rec_bankruptcies revol_util
## 0.122276 54.747690
## collections_12_mths_ex_med chargeoff_within_12_mths
## 0.013999 0.008856
## tax_liens delinq_2yrs
## 0.049385 0.309542
## inq_last_6mths open_acc
## 0.703373 11.508285
## pub_rec total_acc
## 0.199649 25.264501
## acc_now_delinq delinq_amnt
## 0.005310 15.156086
## id_2 loan_amnt
## 445468.355519 14404.126768
## funded_amnt term
## 14395.091274 1.250828
## int_rate installment
## 13.150106 434.951229
## grade home_ownership
## 2.756857 2.410312
## verification_status loan_status
## 2.039022 1.814182
## pymnt_plan dti
## 1.000089 17.984777
## revol_bal initial_list_status
## 16558.096544 1.470416
## out_prncp out_prncp_inv
## 28.002359 27.986320
## total_pymnt total_pymnt_inv
## 15484.977656 15467.948131
## total_rec_prncp total_rec_int
## 12464.129846 2785.762868
## total_rec_late_fee recoveries
## 1.511362 233.573568
## collection_recovery_fee last_pymnt_amnt
## 37.995335 4497.802159
## policy_code application_type
## 1.000000 1.000552
## hardship_flag disbursement_method
## 1.000089 1.000000
## debt_settlement_flag year_temp
## 1.021121 2014.023791
apply(df.pls,2,var)
## funded_amnt_inv annual_inc
## 70379710.85512327 5044374313.91709137
## pub_rec_bankruptcies revol_util
## 0.13152107 564.14696066
## collections_12_mths_ex_med chargeoff_within_12_mths
## 0.01703437 0.01214732
## tax_liens delinq_2yrs
## 0.15101847 0.71781136
## inq_last_6mths open_acc
## 0.97899224 28.14901467
## pub_rec total_acc
## 0.34206417 141.58278152
## acc_now_delinq delinq_amnt
## 0.00656248 624566.97294738
## id_2 loan_amnt
## 63602829078.21736145 70458348.47642767
## funded_amnt term
## 70361904.80439132 0.18791493
## int_rate installment
## 19.34605973 61766.92356414
## grade home_ownership
## 1.70985230 2.04605086
## verification_status loan_status
## 0.62179392 0.15129108
## pymnt_plan dti
## 0.00008866 68.00934810
## revol_bal initial_list_status
## 489147939.65753955 0.24912725
## out_prncp out_prncp_inv
## 273519.32762228 273208.22961914
## total_pymnt total_pymnt_inv
## 106393362.09431723 106374750.73708099
## total_rec_prncp total_rec_int
## 72449474.50050333 8253374.70288832
## total_rec_late_fee recoveries
## 129.69762222 893639.07920437
## collection_recovery_fee last_pymnt_amnt
## 27064.96883561 39291930.08685572
## policy_code application_type
## 0.00000000 0.00055138
## hardship_flag disbursement_method
## 0.00008866 0.00000000
## debt_settlement_flag year_temp
## 0.02067557 1.40756346
# looks like policy_code and disbursement_method have variance 0 and mean 1, we can remove these two variables
df.pls <- subset(df.pls, select = -c(policy_code,disbursement_method))
# str(df.pls) #check
# devide and conquer 80% train, 20% test data
p80 <- nrow(d2_sub)*0.8
set.seed(13)
indices <- sort(sample(1:nrow(df.pls), p80)) # select 100 random samples
test.pls <- df.pls[-indices,]
train.pls <- df.pls[indices,]
# pls function
pls.fit <- plsr(loan_status ~ ., data=train.pls, scale=TRUE, validation ="CV")
summary(pls.fit)
## Data: X dimension: 81206 41
## Y dimension: 81206 1
## Fit method: kernelpls
## Number of components considered: 41
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 0.3892 0.2868 0.276 0.2492 0.2302 0.2247 0.2225
## adjCV 0.3892 0.2868 0.276 0.2491 0.2302 0.2247 0.2225
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 0.2201 0.219 0.2189 0.2189 0.2189 0.2189 0.2189
## adjCV 0.2201 0.219 0.2189 0.2189 0.2189 0.2189 0.2189
## 14 comps 15 comps 16 comps 17 comps 18 comps 19 comps
## CV 0.2189 0.2189 0.2189 0.2189 0.2189 0.2189
## adjCV 0.2189 0.2188 0.2188 0.2188 0.2188 0.2188
## 20 comps 21 comps 22 comps 23 comps 24 comps 25 comps
## CV 0.2189 0.2189 0.2189 0.2188 0.2188 0.2188
## adjCV 0.2188 0.2188 0.2188 0.2188 0.2188 0.2188
## 26 comps 27 comps 28 comps 29 comps 30 comps 31 comps
## CV 0.2188 0.2188 0.2188 0.2188 0.2188 0.2188
## adjCV 0.2188 0.2188 0.2188 0.2188 0.2188 0.2188
## 32 comps 33 comps 34 comps 35 comps 36 comps 37 comps
## CV 0.2188 0.2188 0.2188 0.2188 0.2188 0.2188
## adjCV 0.2188 0.2188 0.2188 0.2188 0.2188 0.2188
## 38 comps 39 comps 40 comps 41 comps
## CV 0.2188 0.2188 0.2188 0.2188
## adjCV 0.2188 0.2188 0.2188 0.2188
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
## X 7.479 24.06 29.77 33.26 35.99 39.25 41.44
## loan_status 45.744 49.75 59.12 65.07 66.70 67.36 68.06
## 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## X 44.43 47.67 51.40 55.10 58.37 60.50
## loan_status 68.38 68.41 68.41 68.42 68.42 68.42
## 14 comps 15 comps 16 comps 17 comps 18 comps 19 comps
## X 62.01 64.09 66.81 68.50 70.32 72.15
## loan_status 68.42 68.42 68.42 68.42 68.42 68.42
## 20 comps 21 comps 22 comps 23 comps 24 comps 25 comps
## X 73.55 75.02 76.76 78.13 79.10 80.86
## loan_status 68.42 68.42 68.42 68.42 68.42 68.42
## 26 comps 27 comps 28 comps 29 comps 30 comps 31 comps
## X 82.63 83.20 84.79 86.42 87.13 88.94
## loan_status 68.43 68.43 68.43 68.43 68.43 68.43
## 32 comps 33 comps 34 comps 35 comps 36 comps 37 comps
## X 91.31 93.52 93.71 95.76 97.96 100.00
## loan_status 68.43 68.43 68.43 68.43 68.43 68.43
## 38 comps 39 comps 40 comps 41 comps
## X 100.00 100.00 100.00 100.00
## loan_status 68.44 68.44 68.44 68.44
pls.prob=predict(pls.fit, test.pls ,ncomp=30)
head(pls.prob)
## [1] 2.127 1.845 2.024 2.088 1.976 1.842
summary(pls.prob)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.279 1.864 1.910 1.813 1.963 2.241
pls.pred <- ifelse(pls.prob>0.5,"Fully Paid","Defaulted")
The partial least squares method is generally used for numeric continous variables. There is a method namly partial least squares discriminant analysis, that can be used for categorical variables. This however was not further investigated due to time constrains. We convert the loan status variable into a numeric variable. However, this gave us results that we cannot interperate in a useful way. In conclusion to the question asked in Part 2, we must give preference to PCA over PLS, since is not suited for this classification problem.
# devide and conquer 80% train.log, 20% test.data
set.seed(13)
indices <- sort(sample(1:nrow(d2_sub), p80)) # select 100 random samples
test.log <- d2_sub[-indices,]
train.log <- d2_sub[indices,]
#drop NAs & columns with more than 50 levels or less than 2
drop_na(train.log)
less2levels <- train.log[, sapply(train.log, function(col) length(unique(col))) < 2]
plus50levels <- train.log[, sapply(train.log, function(col) length(unique(col))) > 50]
plus50levels <- plus50levels[, sapply(plus50levels, function(col) class(col) != "numeric")]
plus50levels <- plus50levels[, sapply(plus50levels, function(col) class(col) != "integer")]
rmv.col <- c(colnames(less2levels),colnames(plus50levels))
rmv.col
## [1] "policy_code" "disbursement_method" "last_pymnt_d"
## [4] "title" "earliest_cr_line" "last_credit_pull_d"
## [7] "issue_d" "zip_code" "addr_state"
train.log <- train.log[ , !(names(train.log) %in% rmv.col)]
# str(train.log) # check
log.fit <- glm(loan_status ~ ., family="binomial", data=train.log)
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(log.fit)
##
## Call:
## glm(formula = loan_status ~ ., family = "binomial", data = train.log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -5.942 0.000 0.009 0.058 8.490
##
## Coefficients: (6 not defined because of singularities)
## Estimate
## (Intercept) 1596.1175906646
## funded_amnt_inv -0.0042239504
## annual_inc -0.0000003691
## pub_rec_bankruptcies 0.1074820796
## revol_util -0.0049668737
## collections_12_mths_ex_med -0.6643459174
## chargeoff_within_12_mths 1.0459590754
## tax_liens 0.2420755916
## delinq_2yrs 0.0265023207
## inq_last_6mths 0.3400357473
## open_acc -0.0253205900
## pub_rec -0.3047168281
## total_acc 0.0024553198
## acc_now_delinq -0.1885208017
## delinq_amnt -0.0000348501
## id_2 0.0000000716
## loan_amnt -0.0002996871
## funded_amnt -0.0015263423
## term60 months 0.4289031765
## int_rate -0.6904382902
## installment 0.0015284132
## gradeB 3.7380610475
## gradeC 5.6530027018
## gradeD 8.2266771355
## gradeE 8.9640706447
## gradeF 14.2810638801
## gradeG -2863774854855.7319335937
## sub_gradeA2 0.3527068019
## sub_gradeA3 0.6494456692
## sub_gradeA4 0.9017209014
## sub_gradeA5 0.8219715707
## sub_gradeB1 -2.2258883645
## sub_gradeB2 -1.1620605272
## sub_gradeB3 -0.9769213198
## sub_gradeB4 -0.8042032834
## sub_gradeB5 NA
## sub_gradeC1 -1.6006359969
## sub_gradeC2 -0.8153221739
## sub_gradeC3 -0.8078126019
## sub_gradeC4 -0.2678132502
## sub_gradeC5 NA
## sub_gradeD1 -1.6332531416
## sub_gradeD2 -1.5914486352
## sub_gradeD3 -0.8900854357
## sub_gradeD4 -0.3756871407
## sub_gradeD5 NA
## sub_gradeE1 -0.7709186141
## sub_gradeE2 0.2480298091
## sub_gradeE3 2.0829021345
## sub_gradeE4 1.2527044116
## sub_gradeE5 NA
## sub_gradeF1 -3.6999911377
## sub_gradeF2 -3.4057448244
## sub_gradeF3 -2.8351958248
## sub_gradeF4 -0.8594286203
## sub_gradeF5 NA
## sub_gradeG1 2863774854867.3969726562
## sub_gradeG2 2863774854866.9052734375
## sub_gradeG3 2863774854890.2983398437
## sub_gradeG4 2863774854884.0180664062
## sub_gradeG5 2863774854882.6865234375
## emp_length1 year -0.0782523741
## emp_length10+ years -0.3513320315
## emp_length2 years -0.1509637608
## emp_length3 years -0.5054641077
## emp_length4 years -0.1745352288
## emp_length5 years 0.0186843203
## emp_length6 years 0.0429527711
## emp_length7 years -0.3932234899
## emp_length8 years -0.5618624586
## emp_length9 years 1.5856241105
## emp_lengthn/a -0.7721557092
## home_ownershipOTHER 36.4922215994
## home_ownershipOWN 0.0696397792
## home_ownershipRENT -0.1062975061
## verification_statusSource Verified 0.1155058081
## verification_statusVerified 0.1955563216
## pymnt_plany 237.1805559180
## purposecredit_card -0.1382151674
## purposedebt_consolidation -0.2699089168
## purposeeducational 10.7206893887
## purposehome_improvement -0.1933713125
## purposehouse -0.4066442714
## purposemajor_purchase -1.0884230523
## purposemedical -0.1155378053
## purposemoving -0.4802372028
## purposeother -0.7806814423
## purposerenewable_energy -0.7040382302
## purposesmall_business 0.4615832462
## purposevacation -0.1367284913
## purposewedding -0.9761105537
## dti 0.0047562573
## revol_bal 0.0000133299
## initial_list_statusw -0.1292219908
## out_prncp 1.8740333621
## out_prncp_inv -1.9973199826
## total_pymnt -6.5260124007
## total_pymnt_inv 0.0037202401
## total_rec_prncp 6.5284597768
## total_rec_int 6.5220315626
## total_rec_late_fee 6.5108133457
## recoveries 6.4832454925
## collection_recovery_fee 0.1233288790
## last_pymnt_amnt 0.0008963066
## application_typeJoint App 11.5416281238
## hardship_flagY NA
## debt_settlement_flagY -19.5103850855
## year_temp -0.7878995473
## Std. Error z value
## (Intercept) 200.1768720510 7.97
## funded_amnt_inv 0.0004097675 -10.31
## annual_inc 0.0000005697 -0.65
## pub_rec_bankruptcies 0.2647733045 0.41
## revol_util 0.0030787880 -1.61
## collections_12_mths_ex_med 0.2799704595 -2.37
## chargeoff_within_12_mths 1.0781461234 0.97
## tax_liens 0.2520385718 0.96
## delinq_2yrs 0.0802100235 0.33
## inq_last_6mths 0.0767800802 4.43
## open_acc 0.0186402745 -1.36
## pub_rec 0.2213926463 -1.38
## total_acc 0.0084265243 0.29
## acc_now_delinq 0.7954176587 -0.24
## delinq_amnt 0.0000290784 -1.20
## id_2 0.0000002950 0.24
## loan_amnt 0.0001563432 -1.92
## funded_amnt 0.0004092574 -3.73
## term60 months 0.6270179517 0.68
## int_rate 0.1072614761 -6.44
## installment 0.0039223119 0.39
## gradeB 0.9709617220 3.85
## gradeC 1.2564334848 4.50
## gradeD 1.6143429016 5.10
## gradeE 1.9639001687 4.56
## gradeF 2.7307779117 5.23
## gradeG 58271573085858.6562500000 -0.05
## sub_gradeA2 0.7924181985 0.45
## sub_gradeA3 0.7963899549 0.82
## sub_gradeA4 0.7538923593 1.20
## sub_gradeA5 0.7207798046 1.14
## sub_gradeB1 0.5228648144 -4.26
## sub_gradeB2 0.4785863644 -2.43
## sub_gradeB3 0.3819796432 -2.56
## sub_gradeB4 0.3237502345 -2.48
## sub_gradeB5 NA NA
## sub_gradeC1 0.4483636838 -3.57
## sub_gradeC2 0.4431324428 -1.84
## sub_gradeC3 0.4013020329 -2.01
## sub_gradeC4 0.3869795852 -0.69
## sub_gradeC5 NA NA
## sub_gradeD1 0.6213867035 -2.63
## sub_gradeD2 0.5741310440 -2.77
## sub_gradeD3 0.5887401083 -1.51
## sub_gradeD4 0.6379061130 -0.59
## sub_gradeD5 NA NA
## sub_gradeE1 0.7358855420 -1.05
## sub_gradeE2 0.8277417025 0.30
## sub_gradeE3 0.9853244469 2.11
## sub_gradeE4 1.0043178822 1.25
## sub_gradeE5 NA NA
## sub_gradeF1 1.6588682361 -2.23
## sub_gradeF2 1.7218719955 -1.98
## sub_gradeF3 1.7100012716 -1.66
## sub_gradeF4 2.0860686477 -0.41
## sub_gradeF5 NA NA
## sub_gradeG1 58271573085857.8125000000 0.05
## sub_gradeG2 58271573085858.4765625000 0.05
## sub_gradeG3 58271573085864.8593750000 0.05
## sub_gradeG4 58271573085855.3125000000 0.05
## sub_gradeG5 58271573085853.6875000000 0.05
## emp_length1 year 0.3469431132 -0.23
## emp_length10+ years 0.2703057471 -1.30
## emp_length2 years 0.3248284750 -0.46
## emp_length3 years 0.3101940738 -1.63
## emp_length4 years 0.3584759464 -0.49
## emp_length5 years 0.3758273682 0.05
## emp_length6 years 0.4207274438 0.10
## emp_length7 years 0.3683264855 -1.07
## emp_length8 years 0.3516253685 -1.60
## emp_length9 years 0.6318478250 2.51
## emp_lengthn/a 0.3177068206 -2.43
## home_ownershipOTHER 335289.2193462860 0.00
## home_ownershipOWN 0.2326148244 0.30
## home_ownershipRENT 0.1480845279 -0.72
## verification_statusSource Verified 0.1586177976 0.73
## verification_statusVerified 0.1756149123 1.11
## pymnt_plany 80956.5381584926 0.00
## purposecredit_card 0.6172423191 -0.22
## purposedebt_consolidation 0.6056153770 -0.45
## purposeeducational 2.5673167940 4.18
## purposehome_improvement 0.6597943666 -0.29
## purposehouse 1.0486504641 -0.39
## purposemajor_purchase 0.6623449307 -1.64
## purposemedical 0.7387955506 -0.16
## purposemoving 0.7410327760 -0.65
## purposeother 0.6212401941 -1.26
## purposerenewable_energy 1.8897410820 -0.37
## purposesmall_business 0.9627450244 0.48
## purposevacation 0.8184546256 -0.17
## purposewedding 1.2153901772 -0.80
## dti 0.0084608777 0.56
## revol_bal 0.0000063358 2.10
## initial_list_statusw 0.1373119196 -0.94
## out_prncp 539.2367633312 0.00
## out_prncp_inv 540.9404325318 0.00
## total_pymnt 26.7832151827 -0.24
## total_pymnt_inv 0.0005719240 6.50
## total_rec_prncp 26.7832143605 0.24
## total_rec_int 26.7832142871 0.24
## total_rec_late_fee 26.7831480941 0.24
## recoveries 26.7832377605 0.24
## collection_recovery_fee 0.0125178208 9.85
## last_pymnt_amnt 0.0001122375 7.99
## application_typeJoint App 719.0515385641 0.02
## hardship_flagY NA NA
## debt_settlement_flagY 3.5688709568 -5.47
## year_temp 0.0992217319 -7.94
## Pr(>|z|)
## (Intercept) 0.0000000000000015 ***
## funded_amnt_inv < 0.0000000000000002 ***
## annual_inc 0.51711
## pub_rec_bankruptcies 0.68479
## revol_util 0.10669
## collections_12_mths_ex_med 0.01765 *
## chargeoff_within_12_mths 0.33197
## tax_liens 0.33682
## delinq_2yrs 0.74109
## inq_last_6mths 0.0000094803890652 ***
## open_acc 0.17434
## pub_rec 0.16871
## total_acc 0.77076
## acc_now_delinq 0.81265
## delinq_amnt 0.23073
## id_2 0.80816
## loan_amnt 0.05526 .
## funded_amnt 0.00019 ***
## term60 months 0.49395
## int_rate 0.0000000001218866 ***
## installment 0.69678
## gradeB 0.00012 ***
## gradeC 0.0000068195067174 ***
## gradeD 0.0000003469213388 ***
## gradeE 0.0000050087054269 ***
## gradeF 0.0000001698140433 ***
## gradeG 0.96080
## sub_gradeA2 0.65625
## sub_gradeA3 0.41479
## sub_gradeA4 0.23166
## sub_gradeA5 0.25412
## sub_gradeB1 0.0000207094740188 ***
## sub_gradeB2 0.01518 *
## sub_gradeB3 0.01054 *
## sub_gradeB4 0.01299 *
## sub_gradeB5 NA
## sub_gradeC1 0.00036 ***
## sub_gradeC2 0.06578 .
## sub_gradeC3 0.04412 *
## sub_gradeC4 0.48890
## sub_gradeC5 NA
## sub_gradeD1 0.00858 **
## sub_gradeD2 0.00557 **
## sub_gradeD3 0.13057
## sub_gradeD4 0.55590
## sub_gradeD5 NA
## sub_gradeE1 0.29482
## sub_gradeE2 0.76445
## sub_gradeE3 0.03452 *
## sub_gradeE4 0.21228
## sub_gradeE5 NA
## sub_gradeF1 0.02572 *
## sub_gradeF2 0.04794 *
## sub_gradeF3 0.09732 .
## sub_gradeF4 0.68035
## sub_gradeF5 NA
## sub_gradeG1 0.96080
## sub_gradeG2 0.96080
## sub_gradeG3 0.96080
## sub_gradeG4 0.96080
## sub_gradeG5 0.96080
## emp_length1 year 0.82155
## emp_length10+ years 0.19368
## emp_length2 years 0.64211
## emp_length3 years 0.10321
## emp_length4 years 0.62634
## emp_length5 years 0.96035
## emp_length6 years 0.91868
## emp_length7 years 0.28570
## emp_length8 years 0.11007
## emp_length9 years 0.01209 *
## emp_lengthn/a 0.01508 *
## home_ownershipOTHER 0.99991
## home_ownershipOWN 0.76465
## home_ownershipRENT 0.47287
## verification_statusSource Verified 0.46649
## verification_statusVerified 0.26547
## pymnt_plany 0.99766
## purposecredit_card 0.82282
## purposedebt_consolidation 0.65583
## purposeeducational 0.0000296895954881 ***
## purposehome_improvement 0.76946
## purposehouse 0.69818
## purposemajor_purchase 0.10032
## purposemedical 0.87573
## purposemoving 0.51694
## purposeother 0.20888
## purposerenewable_energy 0.70948
## purposesmall_business 0.63162
## purposevacation 0.86733
## purposewedding 0.42190
## dti 0.57402
## revol_bal 0.03539 *
## initial_list_statusw 0.34666
## out_prncp 0.99723
## out_prncp_inv 0.99705
## total_pymnt 0.80749
## total_pymnt_inv 0.0000000000778071 ***
## total_rec_prncp 0.80742
## total_rec_int 0.80761
## total_rec_late_fee 0.80793
## recoveries 0.80873
## collection_recovery_fee < 0.0000000000000002 ***
## last_pymnt_amnt 0.0000000000000014 ***
## application_typeJoint App 0.98719
## hardship_flagY NA
## debt_settlement_flagY 0.0000000458173689 ***
## year_temp 0.0000000000000020 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 78034.7 on 81205 degrees of freedom
## Residual deviance: 2231.3 on 81104 degrees of freedom
## AIC: 2435
##
## Number of Fisher Scoring iterations: 25
# "Predicting" the train data
log.prob <- predict(log.fit, newdata = test.log, type = "response")
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type
## == : prediction from a rank-deficient fit may be misleading
log.pred <- ifelse(log.prob>0.5,"Fully Paid","Defaulted")
# Test errors and confusion matrix
log.matrix <- table(y.test.pca,log.pred)
log.overall.test.err = mean(log.pred != test.log$loan_status)
log.default.test.err= log.matrix[1,2]/(log.matrix[1,1]+log.matrix[1,2]) # Class Test Error
log.paid.test.err= log.matrix[2,1]/(log.matrix[2,2]+log.matrix[2,1]) # Class Test Error
log.matrix
## log.pred
## y.test.pca Defaulted Fully Paid
## Defaulted 3688 64
## Fully Paid 10 16540
log.default.test.err
## [1] 0.01706
log.paid.test.err
## [1] 0.0006042
log.overall.test.err
## [1] 0.003645
For the Logisitic Regression we used the dataset removing the categorical values with more than 50 levels and those with less than 2. The defaulted test error rate is around 1.7 %, the fully paid test error rate is 0.04 % and the overall test error rate is 0.36 %.
In this following section we create a random forest model. We will try different amounts of variables chosen randomly at each try (mtry). In theory we could use mtry that is equal to the amount of predictors in the dataset. In practice we will choose only compare mtry between 1 and 10 (the section where we try different mtrys is commented, as it is computionally expensive and time consuming) .
# devide and conquer 80% train.rf, 20% test.data
set.seed(13)
indices <- sort(sample(1:nrow(d2_sub), p80)) # select 100 random samples
test.rf <- d2_sub[-indices,]
train.rf <- d2_sub[indices,]
#drop NAs & columns with more than 53 categories on train data
drop_na(train.rf)
plus50levels <- train.rf[, sapply(train.rf, function(col) length(unique(col))) > 50]
plus50levels <- plus50levels[, sapply(plus50levels, function(col) class(col) != "numeric")]
plus50levels <- plus50levels[, sapply(plus50levels, function(col) class(col) != "integer")]
rmv.col <- colnames(plus50levels)
train.rf <- train.rf[ , !(names(train.rf) %in% rmv.col)]
# str(train.rf) check
#d2_sub <- as.data.frame(d2_sub)
rf.fit <- randomForest(loan_status ~ ., data= train.rf,ntree=400, mtry=4)
rf.fit
##
## Call:
## randomForest(formula = loan_status ~ ., data = train.rf, ntree = 400, mtry = 4)
## Type of random forest: classification
## Number of trees: 400
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 0.58%
## Confusion matrix:
## Defaulted Fully Paid class.error
## Defaulted 14669 441 0.0291860
## Fully Paid 32 66064 0.0004841
importance(rf.fit)
## MeanDecreaseGini
## funded_amnt_inv 775.4057
## annual_inc 71.8560
## pub_rec_bankruptcies 10.4872
## revol_util 67.4561
## collections_12_mths_ex_med 5.3176
## chargeoff_within_12_mths 1.9668
## tax_liens 6.4099
## delinq_2yrs 15.3960
## inq_last_6mths 32.5806
## open_acc 55.7434
## pub_rec 14.0838
## total_acc 60.6186
## acc_now_delinq 1.6115
## delinq_amnt 3.0891
## id_2 103.0269
## loan_amnt 857.3068
## funded_amnt 817.2438
## term 373.9342
## int_rate 221.2365
## installment 684.2372
## grade 208.6517
## sub_grade 333.6614
## emp_length 78.4711
## home_ownership 19.5888
## verification_status 40.7899
## pymnt_plan 0.8561
## purpose 56.4789
## dti 89.3934
## revol_bal 81.4486
## initial_list_status 16.7607
## out_prncp 151.5170
## out_prncp_inv 159.2229
## total_pymnt 1330.1439
## total_pymnt_inv 1273.2038
## total_rec_prncp 3268.4156
## total_rec_int 377.7770
## total_rec_late_fee 203.3534
## recoveries 5789.3385
## collection_recovery_fee 4622.2460
## last_pymnt_amnt 1580.4862
## policy_code 0.0000
## application_type 0.2693
## hardship_flag 0.8913
## disbursement_method 0.0000
## debt_settlement_flag 558.6815
## year_temp 73.9763
varImpPlot(rf.fit)
# Uncomment this section only if you have time
# check which RF is the most promising, mtry 10 should be sufficient
# default.train.err=double(10)
# paid.train.err=double(10)
# default.test.err=double(10)
# paid.test.err=double(10)
#
#mtry is no of Variables randomly chosen at each split
# for(mtry in 1:10)
# {
# rf=randomForest(loan_status ~ . , data= train.rf,mtry=mtry,ntree=400)
# default.train.err[mtry] = rf$confusion[1,3]
# paid.train.err[mtry] = rf$confusion[2,3]
#
# pred<-predict(rf,test.rf) #Predictions on Test Set for each Tree
# default.test.err[mtry]= table(test.rf$loan_status,pred)[1,2]/(table(test.rf$loan_status,pred)[1,1]+table(test.rf$loan_status,pred)[1,2]) # Class Test Error
# paid.test.err[mtry]= table(test.rf$loan_status,pred)[2,1]/(table(test.rf$loan_status,pred)[2,2]+table(test.rf$loan_status,pred)[2,1]) # Class Test Error
#
# cat(mtry," ") #printing the output to the console
# }
# plot the results
# matplot(1:mtry , cbind(default.train.err,paid.train.err,default.test.err,paid.test.err), pch=19 , col=c("red","blue","darkred","darkblue"),type="b",ylab="Classification Error",xlab="Number of Predictors Considered at each Split")
# legend("left",legend=c("Default Train Class.Error","Paid Train Class.Error","Default Test Class.Error","Paid Test Class.Error"),pch=19, col=c("red","blue","darkred","darkblue"))
rf.pred <- predict(rf.fit, newdata= test.rf)
table(test.rf$loan_status,rf.pred)
## rf.pred
## Defaulted Fully Paid
## Defaulted 3647 105
## Fully Paid 11 16539
# Test errors and confusion matrix
rf.matrix <- table(test.rf$loan_status,rf.pred)
rf.overall.test.err = mean(rf.pred != test.rf$loan_status)
rf.default.test.err= rf.matrix[1,2]/(rf.matrix[1,1]+rf.matrix[1,2]) # Class Test Error
rf.paid.test.err= rf.matrix[2,1]/(rf.matrix[2,2]+rf.matrix[2,1]) # Class Test Error
rf.matrix
## rf.pred
## Defaulted Fully Paid
## Defaulted 3647 105
## Fully Paid 11 16539
rf.default.test.err
## [1] 0.02799
rf.paid.test.err
## [1] 0.0006647
rf.overall.test.err
## [1] 0.005714
The amount of mtry used in this approach was 4 as we noted that above 4 there was no great increase in accuracy. The most important predictor variables in the random forest are recoveries, followed by collection_recovery_fee. The error rate for the class defaulted is higher (3.14%) than the models seen before. We assume that the random forest did not train so well on the unbalanced data set.
Now we model the data using K-Nearest-Neighbour
# load previous df
df.knn <- d2_sub
#str(df.knn) # Check
#drop NAs & columns with more than 50 levels or less than 2
drop_na(df.knn)
less2levels <- df.knn[, sapply(df.knn, function(col) length(unique(col))) < 2]
plus50levels <- df.knn[, sapply(df.knn, function(col) length(unique(col))) > 50]
plus50levels <- plus50levels[, sapply(plus50levels, function(col) class(col) != "numeric")]
plus50levels <- plus50levels[, sapply(plus50levels, function(col) class(col) != "integer")]
rmv.col <- c(colnames(less2levels),colnames(plus50levels))
# rmv.col # Check
df.knn <- df.knn[ , !(names(df.knn) %in% rmv.col)]
#str(df.knn) # Check
# transform factors to numeric, except for loan_status
str(df.knn)
## 'data.frame': 101508 obs. of 45 variables:
## $ funded_amnt_inv : int 33600 10400 15000 10000 4200 7000 20000 28000 20000 17600 ...
## $ annual_inc : int 104000 104433 80000 30000 32000 59000 180000 92000 65000 88000 ...
## $ pub_rec_bankruptcies : int 0 0 0 1 0 0 0 0 0 0 ...
## $ revol_util : num 51.8 64.5 21.4 26.6 11.7 24.2 84.5 64.5 3 34.4 ...
## $ collections_12_mths_ex_med: int 0 0 0 0 0 0 0 0 0 0 ...
## $ chargeoff_within_12_mths : int 0 0 0 0 0 0 0 0 0 0 ...
## $ tax_liens : int 0 0 0 0 0 0 0 0 0 0 ...
## $ delinq_2yrs : int 0 1 0 0 0 1 0 0 0 0 ...
## $ inq_last_6mths : int 2 3 1 2 2 1 0 0 1 0 ...
## $ open_acc : int 14 12 30 19 10 10 12 16 12 20 ...
## $ pub_rec : int 0 0 0 1 0 0 0 0 0 0 ...
## $ total_acc : int 31 35 47 36 16 34 27 24 17 42 ...
## $ acc_now_delinq : int 0 0 0 0 0 0 0 0 0 0 ...
## $ delinq_amnt : int 0 0 0 0 0 0 0 0 0 0 ...
## $ id_2 : int 777778 876330 188890 288890 388890 788890 877442 1114 10002 11114 ...
## $ loan_amnt : int 33600 10400 15000 10000 4200 7000 20000 28000 20000 17600 ...
## $ funded_amnt : int 33600 10400 15000 10000 4200 7000 20000 28000 20000 17600 ...
## $ term : Factor w/ 2 levels "36 months","60 months": 1 2 1 2 1 1 1 1 1 1 ...
## $ int_rate : num 10.78 22.45 5.32 16.59 13.99 ...
## $ installment : num 1097 290 452 246 144 ...
## $ grade : Factor w/ 7 levels "A","B","C","D",..: 2 6 1 4 3 1 2 1 1 1 ...
## $ sub_grade : Factor w/ 35 levels "A1","A2","A3",..: 9 26 1 17 14 4 7 2 1 1 ...
## $ emp_length : Factor w/ 12 levels "< 1 year","1 year",..: 10 5 12 1 4 3 3 3 11 1 ...
## $ home_ownership : Factor w/ 4 levels "MORTGAGE","OTHER",..: 4 1 4 4 4 1 1 1 3 1 ...
## $ verification_status : Factor w/ 3 levels "Not Verified",..: 2 2 1 2 2 1 1 1 1 2 ...
## $ loan_status : Factor w/ 2 levels "Defaulted","Fully Paid": 2 2 2 2 2 2 2 2 2 2 ...
## $ pymnt_plan : Factor w/ 2 levels "n","y": 1 1 1 1 1 1 1 1 1 1 ...
## $ purpose : Factor w/ 14 levels "car","credit_card",..: 3 7 3 3 10 2 3 3 3 2 ...
## $ dti : num 14 25.4 20.2 30.8 31 ...
## $ revol_bal : int 28475 21929 41593 6118 1294 10467 87329 51507 2269 12203 ...
## $ initial_list_status : Factor w/ 2 levels "f","w": 2 2 2 2 2 2 1 2 2 2 ...
## $ out_prncp : num 0 0 0 0 0 0 0 0 0 0 ...
## $ out_prncp_inv : num 0 0 0 0 0 0 0 0 0 0 ...
## $ total_pymnt : num 34309 11740 16241 10758 5036 ...
## $ total_pymnt_inv : num 34309 11740 16241 10758 5036 ...
## $ total_rec_prncp : num 33600 10400 15000 10000 4200 7000 20000 28000 20000 17600 ...
## $ total_rec_int : num 709 1340 1241 758 836 ...
## $ total_rec_late_fee : num 0 0 0 0 0 0 0 0 0 0 ...
## $ recoveries : num 0 0 0 0 0 0 0 0 0 0 ...
## $ collection_recovery_fee : num 0 0 0 0 0 0 0 0 0 0 ...
## $ last_pymnt_amnt : num 33263 10129 2244 9795 1743 ...
## $ application_type : Factor w/ 2 levels "Individual","Joint App": 1 1 1 1 1 1 1 1 1 1 ...
## $ hardship_flag : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
## $ debt_settlement_flag : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
## $ year_temp : int 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
for(col in 1:ncol(df.knn)){
if(class(df.knn[,col]) %in% c('factor') && colnames(df.knn)[col] != 'loan_status'){
df.knn[,col] <- as.numeric(df.knn[,col])
}
}
#str(df.knn) # Check
# devide and conquer 80% train.log, 20% test.data
set.seed(13)
indices <- sort(sample(1:nrow(df.knn), p80)) # select 100 random samples
test.knn <- df.knn[-indices,]
train.knn <- df.knn[indices,]
# remove loan_status from test and train set
y.train.knn <- train.knn$loan_status
y.test.knn <- test.knn$loan_status
train.knn <- subset(train.knn, select = -c(loan_status))
test.knn <- subset(test.knn, select = -c(loan_status))
#str(df.knn) # Check
set.seed (1)
knn.pred <- knn(train.knn,test.knn, y.train.knn,k=1)
# Test errors and confusion matrix
knn.matrix <- table(y.test.knn,knn.pred)
knn.overall.test.err = mean(knn.pred != y.test.knn)
knn.default.test.err= knn.matrix[1,2]/(knn.matrix[1,1]+knn.matrix[1,2]) # Class Test Error
knn.paid.test.err= knn.matrix[2,1]/(knn.matrix[2,2]+knn.matrix[2,1]) # Class Test Error
knn.matrix
## knn.pred
## y.test.knn Defaulted Fully Paid
## Defaulted 2761 991
## Fully Paid 487 16063
knn.default.test.err
## [1] 0.2641
knn.paid.test.err
## [1] 0.02943
knn.overall.test.err
## [1] 0.0728
The results of the KNN model do not look as accurate as the previous ones. We have a defaulted error rate of 26.3 %, a fully paid error rate of 2.8 %. The overall test error rate is 7.2 %. Thus, KNN does not seem as accurate as some previous models.
Finally, we model data using a simple decision tree.
# devide and conquer 80% train 20% test data
set.seed(13)
p80 <- nrow(df.pca)*0.8
indices <- sort(sample(1:nrow(df.pca), p80)) # select 100 random samples
test.tree <- df.pca[-indices,]
train.tree <- df.pca[indices,]
# fit the tree
tree.fit = tree(loan_status ~ . , data = train.tree)
summary(tree.fit)
##
## Classification tree:
## tree(formula = loan_status ~ ., data = train.tree)
## Variables actually used in tree construction:
## [1] "recoveries" "last_pymnt_amnt" "term" "total_rec_prncp"
## [5] "funded_amnt" "funded_amnt_inv"
## Number of terminal nodes: 8
## Residual mean deviance: 0.141 = 11400 / 81200
## Misclassification error rate: 0.018 = 1459 / 81206
plot(tree.fit)
text(tree.fit, pretty = 0)
tree.fit
## node), split, n, deviance, yval, (yprob)
## * denotes terminal node
##
## 1) root 81206 80000 Fully Paid ( 0.186 0.814 )
## 2) recoveries < 0.005 69797 30000 Fully Paid ( 0.054 0.946 )
## 4) last_pymnt_amnt < 1269.81 30758 20000 Fully Paid ( 0.120 0.880 )
## 8) term < 1.5 26963 10000 Fully Paid ( 0.066 0.934 )
## 16) total_rec_prncp < 4998.13 5275 5000 Fully Paid ( 0.204 0.796 )
## 32) funded_amnt < 4987.5 4423 2000 Fully Paid ( 0.051 0.949 ) *
## 33) funded_amnt > 4987.5 852 0 Defaulted ( 1.000 0.000 ) *
## 17) total_rec_prncp > 4998.13 21688 6000 Fully Paid ( 0.033 0.967 ) *
## 9) term > 1.5 3795 5000 Defaulted ( 0.501 0.499 )
## 18) total_rec_prncp < 9989.65 1636 1000 Defaulted ( 0.896 0.104 )
## 36) funded_amnt_inv < 9725 174 40 Fully Paid ( 0.023 0.977 ) *
## 37) funded_amnt_inv > 9725 1462 0 Defaulted ( 1.000 0.000 ) *
## 19) total_rec_prncp > 9989.65 2159 2000 Fully Paid ( 0.202 0.798 ) *
## 5) last_pymnt_amnt > 1269.81 39039 700 Fully Paid ( 0.001 0.999 ) *
## 3) recoveries > 0.005 11409 500 Defaulted ( 0.997 0.003 ) *
tree.pred <- predict(tree.fit,newdata = test.tree, type = "class")
with(test.tree, table(tree.pred,test.tree$loan_status))
##
## tree.pred Defaulted Fully Paid
## Defaulted 3395 13
## Fully Paid 357 16537
# decision tree with CV
cvtree.fit = cv.tree(tree.fit, FUN=prune.misclass)
plot(cvtree.fit)
tree.prune.fit <- prune.misclass(tree.fit,best=3)
tree.prune.fit
## node), split, n, deviance, yval, (yprob)
## * denotes terminal node
##
## 1) root 81206 80000 Fully Paid ( 0.186 0.814 )
## 2) recoveries < 0.005 69797 30000 Fully Paid ( 0.054 0.946 )
## 4) last_pymnt_amnt < 1269.81 30758 20000 Fully Paid ( 0.120 0.880 )
## 8) term < 1.5 26963 10000 Fully Paid ( 0.066 0.934 ) *
## 9) term > 1.5 3795 5000 Defaulted ( 0.501 0.499 )
## 18) total_rec_prncp < 9989.65 1636 1000 Defaulted ( 0.896 0.104 ) *
## 19) total_rec_prncp > 9989.65 2159 2000 Fully Paid ( 0.202 0.798 ) *
## 5) last_pymnt_amnt > 1269.81 39039 700 Fully Paid ( 0.001 0.999 ) *
## 3) recoveries > 0.005 11409 500 Defaulted ( 0.997 0.003 ) *
tree.prune.pred <- predict(tree.prune.fit,newdata = test.tree, type = "class")
# Test errors and confusion matrix for the pruned tree
tree.matrix <- with(test.tree, table(tree.prune.pred,test.tree$loan_status))
tree.overall.test.err = mean(tree.prune.pred != test.tree$loan_status)
tree.default.test.err= tree.matrix[1,2]/(tree.matrix[1,1]+tree.matrix[1,2]) # Class Test Error
tree.paid.test.err= tree.matrix[2,1]/(tree.matrix[2,2]+tree.matrix[2,1]) # Class Test Error
tree.matrix
##
## tree.prune.pred Defaulted Fully Paid
## Defaulted 3188 62
## Fully Paid 564 16488
tree.overall.test.err
## [1] 0.03083
tree.default.test.err
## [1] 0.01908
tree.paid.test.err
## [1] 0.03308
The results of the decision tree look surprisingly good. The overall test error rate is 3.1 %, the defaulted test error rate is 1.4 % and the fully paid test error rate is 3.4%. The great advantage of this approach is how easy we can explain the results. By far the most important indicator in the pruned tree is the recoveries. If this value is above 0.01, then the model will forecast the loan to Default.
PCA <- c(pca.default.test.err,pca.paid.test.err,pca.overall.test.err)
LogRegression <- c(log.default.test.err,log.paid.test.err,log.overall.test.err)
RF <- c(rf.default.test.err,rf.paid.test.err,rf.overall.test.err)
KNN <- c(knn.default.test.err,knn.paid.test.err,knn.overall.test.err)
Tree <- c(tree.default.test.err,tree.paid.test.err,tree.overall.test.err)
row.names <- c("Class:Defaulted Test Error", "Class:Paid Test Error", "Total:Test Error")
all <- data.frame(PCA,LogRegression,RF,KNN,Tree,row.names = row.names )
all
all_transposed <- as.data.frame(t(all))
all_transposed[order(all_transposed$`Total:Test Error`),]
write.csv(all_transposed, file = "data/classification_methods_results.csv")
If the objective is to reduce overall test error then the PCA in combination with a logistic regression would be the most appropriate method. If however the class error of the defaulted loans is the objective then a simple pruned tree model might be more appropriate. In addition, we should note that the PCA is an unsupervised method and is therefore inadequate for explaining models. Bottom line: The tree model offers the best solution that is accurate, explainable and understandable.
Following outlook could be considered. Further analysis could be done with PLS discriminant analysis which would allow computing for a categorical target variable. Because of time constraints, as well as lacking knowledge in this area, this was not further explored. A balanced data set could have been set up for the training of the models. This would imply that roughly 50% of the outcome would be “Defaulted” and 50% “Fully Paid”. We would expect some models to perform better with a balanced training set. The team would have also liked more time to implement a Boosting model, which we would have expected to outperform the RF.