Abstract

Task 1: We aim at finding a regression for estimating the interest rate applied to a particular lending request.
Solution: We used four linear model (LM) approaches and got the following results:

(regression_results <- read.csv("data/regression_methods_results.csv") %>% select(Model = model, Test_MSE, R2))

According to our results the linear regression model approach with best subset selection has the smallest test MSE (1.372) overall. The linear regression model with leave one out cross validation is nearly as good with a test MSE of 1.375. The ridge and the lasso regression perform noticeable worse (test MSE of 1.708 and 2.064). Details to the winnig model with best subset selection can be found below.

Question Answer
1. Is at least one of the predictores X1, X2, …, Xp useful in predicting the response? First, one could reduce the possible predictors in the data preprocessing part. One could remove 126 out of 147 possible predictors, because they would not explain the response at all (93 predictors with more than 5% missing values or just one level) or there were too many levels (more than 10). Some of the few 126 removed predictors were a human decision, especially if the predictors were not understable in regard to the response. Back to the original question: Yes, there were 31 predictors useful in predicting the response. Due to matrix transformation there are a lot of new dummy variables out of the 21 remaining predictors, for example grade, home_ownership or loan_status.
2. Do all the predictors help to explain Y, or is only a subset of the predictors useful? No, only 31 of the preditors help to explain the response. One could try to only take 23 predictors because one can reduce most of the test error with them.
3. How well does the model fit the data? How accurate is the prediction? Quite well. The remaining test error (MSE) is 1.372.



Task 2: We would like to find a classification model for the default status.

We performed the following methods and got the these results:

classification_results <- read.csv("data/classification_methods_results.csv")
classification_results %>% rename(Model = X) %>% arrange(Class.Defaulted.Test.Error)

If the objective is to reduce overall test error then the PCA in combination with a logistic regression would be the most appropriate method. If however the class error of the defaulted loans is the objective then a simple pruned tree model might be more appropriate. In addition, we should note that the PCA is an unsupervised method and is therefore inadequate for explaining models. Bottom line: The tree model offers the best solution that is accurate, explainable and understandable.



Data Preparation

Load Packages

# load remaining packages
library(DataExplorer) # https://github.com/boxuancui/DataExplorer
library(glmnet)
library(janitor) # https://github.com/sfirke/janitor
library(plotly)
library(GGally)
library(corrplot)
library(PerformanceAnalytics)
library(caret)
library(leaps)
library(knitr)
library(pls)
library(ISLR)
library(dplyr)
library(randomForest)
library(tree)
library(gdata)
library(class)



Set Options

# set printing preferences
options(scipen = 99) # penalty for displaying scientific notation
options(digits = 4) # suggested number of digits to display



Load Data

The following code loads the data from Kaggle, filters the data to the date range from 2007 to 2015 and selects the subset for group 3 according to the lecturer instructions. One can skip the following part because our team provids the prepared data in the next section.

# preparation step: loading data from https://www.kaggle.com/wendykan/lending-club-loan-data
# load data
data_temp <- read_csv("data/loan.csv")

# select years 2007 - 2015
data_temp %<>% 
  mutate(year_temp = as.integer(str_sub(issue_d,-4,-1))) %>%
  filter(year_temp >= 2007,
         year_temp <= 2015)

# subset our group subset
# d <- data_temp[which(data_temp$id%%8+1==6),] # not working: no id values

data_temp_2 <- cbind(id_2 = rownames(data_temp), data_temp)
rownames(data_temp_2) <- 1:nrow(data_temp_2)
data_temp_2$id_2 <- as.numeric(data_temp_2$id_2)
dataset_3 <- data_temp_2[which(data_temp_2$id_2%%8+1==3),]

# we continue working with d for easier coding reasons
d <- dataset_3 

# transform data frame into much nicer tibble format for working with
# huge datasets
d <- as_tibble(d)

# remove big data file for memory reasons
rm(data_temp)
rm(data_temp_2)
rm(dataset_3)

# write group 3 data
path_file <- "./data/data_group_3.csv"
if (!file.exists(path_file)) {
  write_csv(d, path = "data/data_group_3.csv")
}

One can load the necessary data for the regression and classification in the following section.

# import already generated file for group 3 
d <- read_csv("data/data_group_3.csv") #, n_max = 2000) 
d <- as_tibble(d)

Data Exploration

compare_df_cols(d) %>% arrange(d, column_name) %>% count(d) %>% arrange(desc(n))
d %>% remove_empty(c("rows", "cols")) # shows 14 empty columns
d %>% remove_constant() # shows no column with just single values
d %>% get_dupes() # there are no duplicates
# explore interest rate
d %>% tabyl(int_rate) %>% arrange(desc(n)) %>% plot_ly(x = ~n, type = "histogram")
d %>% tabyl(int_rate) %>% arrange(desc(n)) %>% plot_ly(x = ~n, type = "box")







Task 1: Regression

Preparation

  • NAs identification and removal of 93 columns containing equal and more than 5% of NA values.
# create copy of dataset
d_copy <- d

# data preparation --> check for NAs in int_rate and remove them if there are any
d %>%
  select(int_rate) %>%  
  summarise_all(funs(sum(is.na(.))))
# there are no NAs --> nothing to do

#further NAs investigation
## too much variables to see all NAs attributes clearly --> split variables
plot_missing(d[1:30])

plot_missing(d[31:60])

plot_missing(d[61:90])

plot_missing(d[91:120])

plot_missing(d[121:147])

## there are a lot of variables with an NA-Quota bigger than 5%

# generate columns where NA quota is bigger or equal than 5% of rows
d_na_quoate_bigger_equals_0.05 <- d %>% summarise_all(funs(sum(is.na(.)))) %>% t() %>% 
  as.data.frame() %>% rownames_to_column("column_names") %>% as_tibble() %>% 
  select(count = V1, everything()) %>% arrange(desc(count)) %>% 
  mutate(quote = count/length(d$id)) %>% filter(quote >= 0.05)
(d_count_columns_na_quoate_bigger_equals_0.05 <- length(d_na_quoate_bigger_equals_0.05$column_names))
## [1] 93
# 93 columns with NA quote >= 5%

# remove 93 columns with NA quota >= 0.05%
d_filter_temp <- d %>% summarise_all(funs(sum(is.na(.)))) %>% t() %>% 
  as.data.frame() %>% rownames_to_column("column_names") %>% as_tibble() %>% 
  select(count = V1, everything()) %>% arrange(desc(count)) %>% 
  mutate(quote = count/length(d$id)) %>% filter(quote < 0.05)

d_filter_temp %<>% select(column_names)
d %<>% select(d_filter_temp$column_names)
dim(d)[2]
## [1] 54
# success: only 54 columns left!



Prepare valid dataset for regression and classification

  • remove remaining rows with NAs
# inspect remaining variables
plot_missing(d) # success!

# create backup copy containing NAs
d_with_nas <- d

# remove rows with NAs
d %<>% na.omit() # 2347 NAs removed

# create new data frame for part 2 (classification)
d_na_less_5_percent <- d
# write data for classification
path_file <- "./data/d_na_less_5_percent.csv"
if (!file.exists(path_file)) {
  write_csv(d, path = "data/d_na_less_5_percent.csv")
}



Prepare Dataset for regression task

  • remove categorial variables with more than 10 levels
  • subjective removal of variables: columns with just one value and not understandable variables
introduce(d) %>%  t()
##                          [,1]
## rows                   108583
## columns                    54
## discrete_columns           21
## continuous_columns         33
## all_missing_columns         0
## total_missing_values        0
## complete_rows          108583
## total_observations    5863482
## memory_usage         47747920
plot_intro(d)

plot_bar(d) # 7 columns ignored with more than 50 categories.
## 7 columns ignored with more than 50 categories.
## last_pymnt_d: 129 categories
## title: 9650 categories
## earliest_cr_line: 624 categories
## last_credit_pull_d: 127 categories
## issue_d: 99 categories
## zip_code: 874 categories
## addr_state: 51 categories

# last_pymnt_d: 129 categories
# title: 9652 categories
# earliest_cr_line: 624 categories
# last_credit_pull_d: 127 categories
# issue_d: 99 categories
# zip_code: 874 categories
# addr_state: 51 categories

# generate a list of unique counts per column
(d_unique_counts_per_column <- d %>% summarise_all(list(~n_distinct(.))) %>% t() %>% as.data.frame() %>% rownames_to_column("column_names") %>% as_tibble() %>% 
    select(count = V1, everything()) %>% arrange(desc(count)))
# assumption: select most important columns according to subjective evaluation
## filter e. g. columns with just one unique value, not understandable meaning, 
## in general columns <= 10 distinct values, 
d %<>% select(int_rate, loan_amnt, revol_bal, revol_util, funded_amnt_inv, annual_inc, pub_rec_bankruptcies, 
            term, pymnt_plan, initial_list_status, application_type, hardship_flag, debt_settlement_flag, 
            verification_status, home_ownership, acc_now_delinq, grade, collections_12_mths_ex_med, loan_status,
            chargeoff_within_12_mths, year_temp)

# write file for easier regression analysis start
path_file <- "./data/data_group_3_regression.csv"
if (!file.exists(path_file)) {
  write_csv(d, path = "data/data_group_3_regression.csv")
}



Load prepared data

One who would just perfrom the regression models could directly start here. The necessary data are provided in the next code block.

# starting with the following data for regression task
d <- read_csv("data/data_group_3_regression.csv")
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   term = col_character(),
##   pymnt_plan = col_character(),
##   initial_list_status = col_character(),
##   application_type = col_character(),
##   hardship_flag = col_character(),
##   debt_settlement_flag = col_character(),
##   verification_status = col_character(),
##   home_ownership = col_character(),
##   grade = col_character(),
##   loan_status = col_character()
## )
## See spec(...) for full column specifications.



Regression data exploration

The following regression models answer to following questions (from “An Introduction to Statistical Learning”, Springer, 2017, Page 75):

  1. Is at least one of the predictores X1, X2, …, Xp useful in predicting the response?
  2. Do all thr predictors help to explain Y, or is only a subset of the predictors useful?
  3. How well does the model fit the data? How accurate is the prediction?

One finds the answers at the end of this regression analysis part.

ggpairs(d[, 1:3]) 

# ggpairs(d) # not really working for all data

# show scatterplots, distributions & correlations
set.seed(22)
d %>%
  sample_n(1000) %>% 
  select(c(1:7, 16, 18, 20, 21)) %>% 
  na.omit() %>% 
  chart.Correlation(histogram = TRUE, pch = 19) # better than cor() %>%corrplot()

### show patterns ###
plot_bar(d, with = "int_rate")

plot_boxplot(d, by = "int_rate")

plot_histogram(d)

plot_scatterplot(split_columns(d)$continuous, by = "int_rate", sampled_rows = 1000L)

There are no strong correlations between int_rate and all other variables. One assumes that we need several predictors for a suitable model. Interesting: There is a very strong correlation between loan_amount and funded_amnt_inv.


Validation set approach

  • Split data in training (67%) and test set (33%).
  • Removal of rows with troublesome value “NONE”
# Split the data into training and test set
set.seed(22)
training_samples <- d$int_rate %>%
  createDataPartition(p = 0.67, list = FALSE)
train_data  <- d[training_samples, ]
test_data <- d[-training_samples, ]
# remove NONE values which cause problems regarding model generation
test_data %<>% filter(home_ownership != "NONE")
train_data %<>% filter(home_ownership != "NONE")



Models

Exploratory linear regression

  • try simple linear model as a first exploratory approach
  • improve first model
# Build the model first try
lm_model <- lm(int_rate ~ ., data = train_data)
summary(lm_model)
## 
## Call:
## lm(formula = int_rate ~ ., data = train_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.519  -0.872   0.036   0.865   4.305 
## 
## Coefficients: (1 not defined because of singularities)
##                                                                      Estimate
## (Intercept)                                                    802.2066801977
## loan_amnt                                                       -0.0003086859
## revol_bal                                                       -0.0000020704
## revol_util                                                       0.0056891673
## funded_amnt_inv                                                  0.0003121423
## annual_inc                                                      -0.0000002850
## pub_rec_bankruptcies                                             0.0751329199
## term60 months                                                    0.2132050242
## pymnt_plany                                                     -0.1552257175
## initial_list_statusw                                            -0.0860336871
## application_typeJoint App                                        0.0207964906
## hardship_flagY                                                             NA
## debt_settlement_flagY                                           -0.0415066104
## verification_statusSource Verified                               0.0602749870
## verification_statusVerified                                      0.2818644813
## home_ownershipOTHER                                             -0.9907899725
## home_ownershipOWN                                                0.0183835613
## home_ownershipRENT                                               0.0360787957
## acc_now_delinq                                                   0.2050850016
## gradeB                                                           3.4381638127
## gradeC                                                           6.5588592529
## gradeD                                                           9.6845363379
## gradeE                                                          12.4093353062
## gradeF                                                          15.9656363201
## gradeG                                                          18.1417385984
## collections_12_mths_ex_med                                      -0.0220253352
## loan_statusCurrent                                              -0.3646700318
## loan_statusDoes not meet the credit policy. Status:Charged Off  -3.2233222795
## loan_statusDoes not meet the credit policy. Status:Fully Paid   -3.5624382781
## loan_statusFully Paid                                           -0.0902041231
## loan_statusIn Grace Period                                      -0.3305893808
## loan_statusLate (16-30 days)                                    -0.0825789745
## loan_statusLate (31-120 days)                                   -0.1998072911
## chargeoff_within_12_mths                                         0.0931754742
## year_temp                                                       -0.3948021639
##                                                                    Std. Error
## (Intercept)                                                      8.5549017340
## loan_amnt                                                        0.0000114455
## revol_bal                                                        0.0000002183
## revol_util                                                       0.0001990838
## funded_amnt_inv                                                  0.0000114537
## annual_inc                                                       0.0000000667
## pub_rec_bankruptcies                                             0.0123966239
## term60 months                                                    0.0127306969
## pymnt_plany                                                      0.5338359837
## initial_list_statusw                                             0.0095813271
## application_typeJoint App                                        0.1645470952
## hardship_flagY                                                             NA
## debt_settlement_flagY                                            0.0328428767
## verification_statusSource Verified                               0.0111819662
## verification_statusVerified                                      0.0119250363
## home_ownershipOTHER                                              0.3729366090
## home_ownershipOWN                                                0.0154055692
## home_ownershipRENT                                               0.0095771804
## acc_now_delinq                                                   0.0531781521
## gradeB                                                           0.0139086212
## gradeC                                                           0.0147744261
## gradeD                                                           0.0171087015
## gradeE                                                           0.0213715895
## gradeF                                                           0.0316785085
## gradeG                                                           0.0594618845
## collections_12_mths_ex_med                                       0.0327539262
## loan_statusCurrent                                               0.0217741818
## loan_statusDoes not meet the credit policy. Status:Charged Off   0.2059161940
## loan_statusDoes not meet the credit policy. Status:Fully Paid    0.1322630433
## loan_statusFully Paid                                            0.0127855330
## loan_statusIn Grace Period                                       0.1361802734
## loan_statusLate (16-30 days)                                     0.2112354846
## loan_statusLate (31-120 days)                                    0.0975745521
## chargeoff_within_12_mths                                         0.0382852794
## year_temp                                                        0.0042487679
##                                                                t value
## (Intercept)                                                      93.77
## loan_amnt                                                       -26.97
## revol_bal                                                        -9.49
## revol_util                                                       28.58
## funded_amnt_inv                                                  27.25
## annual_inc                                                       -4.27
## pub_rec_bankruptcies                                              6.06
## term60 months                                                    16.75
## pymnt_plany                                                      -0.29
## initial_list_statusw                                             -8.98
## application_typeJoint App                                         0.13
## hardship_flagY                                                      NA
## debt_settlement_flagY                                            -1.26
## verification_statusSource Verified                                5.39
## verification_statusVerified                                      23.64
## home_ownershipOTHER                                              -2.66
## home_ownershipOWN                                                 1.19
## home_ownershipRENT                                                3.77
## acc_now_delinq                                                    3.86
## gradeB                                                          247.20
## gradeC                                                          443.93
## gradeD                                                          566.06
## gradeE                                                          580.65
## gradeF                                                          503.99
## gradeG                                                          305.10
## collections_12_mths_ex_med                                       -0.67
## loan_statusCurrent                                              -16.75
## loan_statusDoes not meet the credit policy. Status:Charged Off  -15.65
## loan_statusDoes not meet the credit policy. Status:Fully Paid   -26.93
## loan_statusFully Paid                                            -7.06
## loan_statusIn Grace Period                                       -2.43
## loan_statusLate (16-30 days)                                     -0.39
## loan_statusLate (31-120 days)                                    -2.05
## chargeoff_within_12_mths                                          2.43
## year_temp                                                       -92.92
##                                                                            Pr(>|t|)
## (Intercept)                                                    < 0.0000000000000002
## loan_amnt                                                      < 0.0000000000000002
## revol_bal                                                      < 0.0000000000000002
## revol_util                                                     < 0.0000000000000002
## funded_amnt_inv                                                < 0.0000000000000002
## annual_inc                                                          0.0000194774277
## pub_rec_bankruptcies                                                0.0000000013615
## term60 months                                                  < 0.0000000000000002
## pymnt_plany                                                                 0.77122
## initial_list_statusw                                           < 0.0000000000000002
## application_typeJoint App                                                   0.89943
## hardship_flagY                                                                   NA
## debt_settlement_flagY                                                       0.20631
## verification_statusSource Verified                                  0.0000000705292
## verification_statusVerified                                    < 0.0000000000000002
## home_ownershipOTHER                                                         0.00789
## home_ownershipOWN                                                           0.23275
## home_ownershipRENT                                                          0.00017
## acc_now_delinq                                                              0.00012
## gradeB                                                         < 0.0000000000000002
## gradeC                                                         < 0.0000000000000002
## gradeD                                                         < 0.0000000000000002
## gradeE                                                         < 0.0000000000000002
## gradeF                                                         < 0.0000000000000002
## gradeG                                                         < 0.0000000000000002
## collections_12_mths_ex_med                                                  0.50130
## loan_statusCurrent                                             < 0.0000000000000002
## loan_statusDoes not meet the credit policy. Status:Charged Off < 0.0000000000000002
## loan_statusDoes not meet the credit policy. Status:Fully Paid  < 0.0000000000000002
## loan_statusFully Paid                                               0.0000000000017
## loan_statusIn Grace Period                                                  0.01520
## loan_statusLate (16-30 days)                                                0.69585
## loan_statusLate (31-120 days)                                               0.04059
## chargeoff_within_12_mths                                                    0.01495
## year_temp                                                      < 0.0000000000000002
##                                                                   
## (Intercept)                                                    ***
## loan_amnt                                                      ***
## revol_bal                                                      ***
## revol_util                                                     ***
## funded_amnt_inv                                                ***
## annual_inc                                                     ***
## pub_rec_bankruptcies                                           ***
## term60 months                                                  ***
## pymnt_plany                                                       
## initial_list_statusw                                           ***
## application_typeJoint App                                         
## hardship_flagY                                                    
## debt_settlement_flagY                                             
## verification_statusSource Verified                             ***
## verification_statusVerified                                    ***
## home_ownershipOTHER                                            ** 
## home_ownershipOWN                                                 
## home_ownershipRENT                                             ***
## acc_now_delinq                                                 ***
## gradeB                                                         ***
## gradeC                                                         ***
## gradeD                                                         ***
## gradeE                                                         ***
## gradeF                                                         ***
## gradeG                                                         ***
## collections_12_mths_ex_med                                        
## loan_statusCurrent                                             ***
## loan_statusDoes not meet the credit policy. Status:Charged Off ***
## loan_statusDoes not meet the credit policy. Status:Fully Paid  ***
## loan_statusFully Paid                                          ***
## loan_statusIn Grace Period                                     *  
## loan_statusLate (16-30 days)                                      
## loan_statusLate (31-120 days)                                  *  
## chargeoff_within_12_mths                                       *  
## year_temp                                                      ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.17 on 72718 degrees of freedom
## Multiple R-squared:  0.928,  Adjusted R-squared:  0.928 
## F-statistic: 2.85e+04 on 33 and 72718 DF,  p-value: <0.0000000000000002
# remove not significant predictors
train_data_lm <- train_data %>% select(-pymnt_plan, -collections_12_mths_ex_med)
test_data_lm <- test_data %>% select(-pymnt_plan, -collections_12_mths_ex_med)

# second try
lm_model <- lm(int_rate ~ ., data = train_data_lm)
summary(lm_model)
## 
## Call:
## lm(formula = int_rate ~ ., data = train_data_lm)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.518  -0.872   0.036   0.865   4.284 
## 
## Coefficients:
##                                                                      Estimate
## (Intercept)                                                    802.4465289657
## loan_amnt                                                       -0.0003087075
## revol_bal                                                       -0.0000020691
## revol_util                                                       0.0056948126
## funded_amnt_inv                                                  0.0003121712
## annual_inc                                                      -0.0000002854
## pub_rec_bankruptcies                                             0.0751867373
## term60 months                                                    0.2133866484
## initial_list_statusw                                            -0.0860427429
## application_typeJoint App                                        0.0211644755
## hardship_flagY                                                  -0.1546470906
## debt_settlement_flagY                                           -0.0412172759
## verification_statusSource Verified                               0.0601186230
## verification_statusVerified                                      0.2816950230
## home_ownershipOTHER                                             -0.9907347072
## home_ownershipOWN                                                0.0183600927
## home_ownershipRENT                                               0.0361318887
## acc_now_delinq                                                   0.2040971144
## gradeB                                                           3.4379225999
## gradeC                                                           6.5585298813
## gradeD                                                           9.6841821464
## gradeE                                                          12.4089177736
## gradeF                                                          15.9652127988
## gradeG                                                          18.1411912756
## loan_statusCurrent                                              -0.3647086569
## loan_statusDoes not meet the credit policy. Status:Charged Off  -3.2234223886
## loan_statusDoes not meet the credit policy. Status:Fully Paid   -3.5624433792
## loan_statusFully Paid                                           -0.0900978667
## loan_statusIn Grace Period                                      -0.3301965231
## loan_statusLate (16-30 days)                                    -0.0829198365
## loan_statusLate (31-120 days)                                   -0.1998885589
## chargeoff_within_12_mths                                         0.0920427660
## year_temp                                                       -0.3949214992
##                                                                    Std. Error
## (Intercept)                                                      8.5474307795
## loan_amnt                                                        0.0000114454
## revol_bal                                                        0.0000002183
## revol_util                                                       0.0001989059
## funded_amnt_inv                                                  0.0000114536
## annual_inc                                                       0.0000000667
## pub_rec_bankruptcies                                             0.0123963189
## term60 months                                                    0.0127277835
## initial_list_statusw                                             0.0095812815
## application_typeJoint App                                        0.1645455654
## hardship_flagY                                                   0.5338332795
## debt_settlement_flagY                                            0.0328399345
## verification_statusSource Verified                               0.0111795061
## verification_statusVerified                                      0.0119223284
## home_ownershipOTHER                                              0.3729351952
## home_ownershipOWN                                                0.0154054716
## home_ownershipRENT                                               0.0095768189
## acc_now_delinq                                                   0.0531576556
## gradeB                                                           0.0139039425
## gradeC                                                           0.0147662491
## gradeD                                                           0.0171005273
## gradeE                                                           0.0213624873
## gradeF                                                           0.0316721277
## gradeG                                                           0.0594560897
## loan_statusCurrent                                               0.0217740241
## loan_statusDoes not meet the credit policy. Status:Charged Off   0.2059153646
## loan_statusDoes not meet the credit policy. Status:Fully Paid    0.1322625449
## loan_statusFully Paid                                            0.0127845084
## loan_statusIn Grace Period                                       0.1361785073
## loan_statusLate (16-30 days)                                     0.2112340808
## loan_statusLate (31-120 days)                                    0.0975741097
## chargeoff_within_12_mths                                         0.0382480616
## year_temp                                                        0.0042450441
##                                                                t value
## (Intercept)                                                      93.88
## loan_amnt                                                       -26.97
## revol_bal                                                        -9.48
## revol_util                                                       28.63
## funded_amnt_inv                                                  27.26
## annual_inc                                                       -4.28
## pub_rec_bankruptcies                                              6.07
## term60 months                                                    16.77
## initial_list_statusw                                             -8.98
## application_typeJoint App                                         0.13
## hardship_flagY                                                   -0.29
## debt_settlement_flagY                                            -1.26
## verification_statusSource Verified                                5.38
## verification_statusVerified                                      23.63
## home_ownershipOTHER                                              -2.66
## home_ownershipOWN                                                 1.19
## home_ownershipRENT                                                3.77
## acc_now_delinq                                                    3.84
## gradeB                                                          247.26
## gradeC                                                          444.16
## gradeD                                                          566.31
## gradeE                                                          580.87
## gradeF                                                          504.08
## gradeG                                                          305.12
## loan_statusCurrent                                              -16.75
## loan_statusDoes not meet the credit policy. Status:Charged Off  -15.65
## loan_statusDoes not meet the credit policy. Status:Fully Paid   -26.93
## loan_statusFully Paid                                            -7.05
## loan_statusIn Grace Period                                       -2.42
## loan_statusLate (16-30 days)                                     -0.39
## loan_statusLate (31-120 days)                                    -2.05
## chargeoff_within_12_mths                                          2.41
## year_temp                                                       -93.03
##                                                                            Pr(>|t|)
## (Intercept)                                                    < 0.0000000000000002
## loan_amnt                                                      < 0.0000000000000002
## revol_bal                                                      < 0.0000000000000002
## revol_util                                                     < 0.0000000000000002
## funded_amnt_inv                                                < 0.0000000000000002
## annual_inc                                                          0.0000190210480
## pub_rec_bankruptcies                                                0.0000000013240
## term60 months                                                  < 0.0000000000000002
## initial_list_statusw                                           < 0.0000000000000002
## application_typeJoint App                                                   0.89766
## hardship_flagY                                                              0.77205
## debt_settlement_flagY                                                       0.20945
## verification_statusSource Verified                                  0.0000000757283
## verification_statusVerified                                    < 0.0000000000000002
## home_ownershipOTHER                                                         0.00790
## home_ownershipOWN                                                           0.23335
## home_ownershipRENT                                                          0.00016
## acc_now_delinq                                                              0.00012
## gradeB                                                         < 0.0000000000000002
## gradeC                                                         < 0.0000000000000002
## gradeD                                                         < 0.0000000000000002
## gradeE                                                         < 0.0000000000000002
## gradeF                                                         < 0.0000000000000002
## gradeG                                                         < 0.0000000000000002
## loan_statusCurrent                                             < 0.0000000000000002
## loan_statusDoes not meet the credit policy. Status:Charged Off < 0.0000000000000002
## loan_statusDoes not meet the credit policy. Status:Fully Paid  < 0.0000000000000002
## loan_statusFully Paid                                               0.0000000000018
## loan_statusIn Grace Period                                                  0.01532
## loan_statusLate (16-30 days)                                                0.69465
## loan_statusLate (31-120 days)                                               0.04051
## chargeoff_within_12_mths                                                    0.01611
## year_temp                                                      < 0.0000000000000002
##                                                                   
## (Intercept)                                                    ***
## loan_amnt                                                      ***
## revol_bal                                                      ***
## revol_util                                                     ***
## funded_amnt_inv                                                ***
## annual_inc                                                     ***
## pub_rec_bankruptcies                                           ***
## term60 months                                                  ***
## initial_list_statusw                                           ***
## application_typeJoint App                                         
## hardship_flagY                                                    
## debt_settlement_flagY                                             
## verification_statusSource Verified                             ***
## verification_statusVerified                                    ***
## home_ownershipOTHER                                            ** 
## home_ownershipOWN                                                 
## home_ownershipRENT                                             ***
## acc_now_delinq                                                 ***
## gradeB                                                         ***
## gradeC                                                         ***
## gradeD                                                         ***
## gradeE                                                         ***
## gradeF                                                         ***
## gradeG                                                         ***
## loan_statusCurrent                                             ***
## loan_statusDoes not meet the credit policy. Status:Charged Off ***
## loan_statusDoes not meet the credit policy. Status:Fully Paid  ***
## loan_statusFully Paid                                          ***
## loan_statusIn Grace Period                                     *  
## loan_statusLate (16-30 days)                                      
## loan_statusLate (31-120 days)                                  *  
## chargeoff_within_12_mths                                       *  
## year_temp                                                      ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.17 on 72719 degrees of freedom
## Multiple R-squared:  0.928,  Adjusted R-squared:  0.928 
## F-statistic: 2.94e+04 on 32 and 72719 DF,  p-value: <0.0000000000000002
table(d$pymnt_plan) # clear distribution > most of the values are no, only 9 values are yes > predictor can be removed
## 
##      n      y 
## 108574      9
table(d$collections_12_mths_ex_med) # more or less clear distribution > most of the values are in categorie 0 > predictor can be removed
## 
##      0      1      2      3      4      5      6 
## 107173   1305     83     14      6      1      1
# other not significant predictors in the lm-Modell can't be removed because they have categories which don't
# show a clearly recognisable distribution of the values, for example
table(d$loan_status) # not a clear distribution > predictor can't be removed
## 
##                                         Charged Off 
##                                               18471 
##                                             Current 
##                                                7075 
## Does not meet the credit policy. Status:Charged Off 
##                                                  49 
##  Does not meet the credit policy. Status:Fully Paid 
##                                                 122 
##                                          Fully Paid 
##                                               82475 
##                                     In Grace Period 
##                                                 122 
##                                   Late (16-30 days) 
##                                                  44 
##                                  Late (31-120 days) 
##                                                 225
table(d$home_ownership)
## 
## MORTGAGE     NONE    OTHER      OWN     RENT 
##    54498        1       14    10661    43409
# Make predictions and compute the R2, RMSE and MAE
lm_predictions <- lm_model %>% predict(test_data)
summary(lm_predictions)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.05   10.46   13.51   13.24   16.65   27.74
na_filter <- lm_predictions[!is.na(lm_predictions)]
# show measures for model quality
(lm_evaluation_vsa <-
  data.frame(
    R2 = caret::R2(lm_predictions[na_filter], test_data$int_rate[na_filter]),
    RMSE = caret::RMSE(lm_predictions[na_filter], test_data$int_rate[na_filter]) # ,
    # MAE = MAE(lm_predictions[na_filter], test_data$int_rate[na_filter])
  ))



Leave one out cross validation (LOOCV)

  • Perform LOOCV
# Leave one out cross validation - LOOCV
# Define training control
train_control <- trainControl(method = "LOOCV")
# Train the model
set.seed(22)
d_sample <- d %>% sample_n(2000) %>% na.omit()
# check if there is just one unique value per column: hit with "pymnt_plan" and "hardship_flag"
d_sample %>% summarise_all(list(~n_distinct(.))) %>% t() %>% as.data.frame() %>% rownames_to_column("column_names") %>% as_tibble() %>% 
  select(count = V1, everything()) %>% arrange(count)
d_sample %<>% select(-c(pymnt_plan, hardship_flag, application_type))
## the following loocv code snippet works within this environment (one could have problems with mac)
# sessionInfo()
lm_model_loocv <- train(int_rate ~ ., data = d_sample, method = "lm",
               trControl = train_control)
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading
# Summarize the results
(lm_evaluation_loocv <- lm_model_loocv)
## Linear Regression 
## 
## 2000 samples
##   17 predictor
## 
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation 
## Summary of sample sizes: 1999, 1999, 1999, 1999, 1999, 1999, ... 
## Resampling results:
## 
##   RMSE   Rsquared  MAE   
##   1.173  0.9263    0.9441
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
print("MSE: ")
## [1] "MSE: "
(loocv_mse <- lm_evaluation_loocv$results$RMSE^2)
## [1] 1.375
# comparison to validation set approach: slightly worse at R2 and RMSE but better MAE


Model Evalution:

One advantage of the LOOCV approach is that almost the entire dataset (n-1 observations) can be used for training, which results in less bias and constant estimates for the test MSE. The model is fit n times with n-1 observations, whereas the one remaining observation is used to test the model. This results in n test MSE, whereas the LOOCV estimate for the overall test MSE is the average of these n test MSE.

LOOCV Test MSE: 1.375


Linear regression with best subset selection

  • comparing models with validation set approach
  • using MSE as a measure of validation set error
dim(train_data) # 20 predictors (21-1). -1 because of int_rate
## [1] 72752    21
# creates a model matrix by expanding factors to a set of dummy variables.
# This matrix is later used to calculate the test MSE for various models
test_mat <- model.matrix(int_rate~.,data=test_data)

# when considering all dummy variables there are 34 predictors (35-1): -1 because of intercept.
dim(test_mat)
## [1] 35830    35
# create best subset selection with 1 up to all 35 predictors.
regfit_best <- regsubsets(int_rate~.,data=train_data, nvmax=34)
## Warning in leaps.setup(x, y, wt = wt, nbest = nbest, nvmax = nvmax,
## force.in = force.in, : 1 linear dependencies found
## Reordering variables and trying again:
reg_summary <- summary(regfit_best)
reg_summary # models with up to 33 predictors were included.
## Subset selection object
## Call: regsubsets.formula(int_rate ~ ., data = train_data, nvmax = 34)
## 34 Variables  (and intercept)
##                                                                Forced in
## loan_amnt                                                          FALSE
## revol_bal                                                          FALSE
## revol_util                                                         FALSE
## funded_amnt_inv                                                    FALSE
## annual_inc                                                         FALSE
## pub_rec_bankruptcies                                               FALSE
## term60 months                                                      FALSE
## pymnt_plany                                                        FALSE
## initial_list_statusw                                               FALSE
## application_typeJoint App                                          FALSE
## debt_settlement_flagY                                              FALSE
## verification_statusSource Verified                                 FALSE
## verification_statusVerified                                        FALSE
## home_ownershipOTHER                                                FALSE
## home_ownershipOWN                                                  FALSE
## home_ownershipRENT                                                 FALSE
## acc_now_delinq                                                     FALSE
## gradeB                                                             FALSE
## gradeC                                                             FALSE
## gradeD                                                             FALSE
## gradeE                                                             FALSE
## gradeF                                                             FALSE
## gradeG                                                             FALSE
## collections_12_mths_ex_med                                         FALSE
## loan_statusCurrent                                                 FALSE
## loan_statusDoes not meet the credit policy. Status:Charged Off     FALSE
## loan_statusDoes not meet the credit policy. Status:Fully Paid      FALSE
## loan_statusFully Paid                                              FALSE
## loan_statusIn Grace Period                                         FALSE
## loan_statusLate (16-30 days)                                       FALSE
## loan_statusLate (31-120 days)                                      FALSE
## chargeoff_within_12_mths                                           FALSE
## year_temp                                                          FALSE
## hardship_flagY                                                     FALSE
##                                                                Forced out
## loan_amnt                                                           FALSE
## revol_bal                                                           FALSE
## revol_util                                                          FALSE
## funded_amnt_inv                                                     FALSE
## annual_inc                                                          FALSE
## pub_rec_bankruptcies                                                FALSE
## term60 months                                                       FALSE
## pymnt_plany                                                         FALSE
## initial_list_statusw                                                FALSE
## application_typeJoint App                                           FALSE
## debt_settlement_flagY                                               FALSE
## verification_statusSource Verified                                  FALSE
## verification_statusVerified                                         FALSE
## home_ownershipOTHER                                                 FALSE
## home_ownershipOWN                                                   FALSE
## home_ownershipRENT                                                  FALSE
## acc_now_delinq                                                      FALSE
## gradeB                                                              FALSE
## gradeC                                                              FALSE
## gradeD                                                              FALSE
## gradeE                                                              FALSE
## gradeF                                                              FALSE
## gradeG                                                              FALSE
## collections_12_mths_ex_med                                          FALSE
## loan_statusCurrent                                                  FALSE
## loan_statusDoes not meet the credit policy. Status:Charged Off      FALSE
## loan_statusDoes not meet the credit policy. Status:Fully Paid       FALSE
## loan_statusFully Paid                                               FALSE
## loan_statusIn Grace Period                                          FALSE
## loan_statusLate (16-30 days)                                        FALSE
## loan_statusLate (31-120 days)                                       FALSE
## chargeoff_within_12_mths                                            FALSE
## year_temp                                                           FALSE
## hardship_flagY                                                      FALSE
## 1 subsets of each size up to 33
## Selection Algorithm: exhaustive
##           loan_amnt revol_bal revol_util funded_amnt_inv annual_inc
## 1  ( 1 )  " "       " "       " "        " "             " "       
## 2  ( 1 )  " "       " "       " "        " "             " "       
## 3  ( 1 )  " "       " "       " "        " "             " "       
## 4  ( 1 )  " "       " "       " "        " "             " "       
## 5  ( 1 )  " "       " "       " "        " "             " "       
## 6  ( 1 )  " "       " "       " "        " "             " "       
## 7  ( 1 )  " "       " "       " "        " "             " "       
## 8  ( 1 )  " "       " "       " "        " "             " "       
## 9  ( 1 )  " "       " "       " "        " "             " "       
## 10  ( 1 ) " "       " "       "*"        " "             " "       
## 11  ( 1 ) " "       " "       "*"        " "             " "       
## 12  ( 1 ) "*"       " "       "*"        "*"             " "       
## 13  ( 1 ) "*"       " "       "*"        "*"             " "       
## 14  ( 1 ) "*"       "*"       "*"        "*"             " "       
## 15  ( 1 ) "*"       " "       "*"        "*"             " "       
## 16  ( 1 ) "*"       "*"       "*"        "*"             " "       
## 17  ( 1 ) "*"       "*"       "*"        "*"             " "       
## 18  ( 1 ) "*"       "*"       "*"        "*"             " "       
## 19  ( 1 ) "*"       "*"       "*"        "*"             " "       
## 20  ( 1 ) "*"       "*"       "*"        "*"             " "       
## 21  ( 1 ) "*"       "*"       "*"        "*"             "*"       
## 22  ( 1 ) "*"       "*"       "*"        "*"             "*"       
## 23  ( 1 ) "*"       "*"       "*"        "*"             "*"       
## 24  ( 1 ) "*"       "*"       "*"        "*"             "*"       
## 25  ( 1 ) "*"       "*"       "*"        "*"             "*"       
## 26  ( 1 ) "*"       "*"       "*"        "*"             "*"       
## 27  ( 1 ) "*"       "*"       "*"        "*"             "*"       
## 28  ( 1 ) "*"       "*"       "*"        "*"             "*"       
## 29  ( 1 ) "*"       "*"       "*"        "*"             "*"       
## 30  ( 1 ) "*"       "*"       "*"        "*"             "*"       
## 31  ( 1 ) "*"       "*"       "*"        "*"             "*"       
## 32  ( 1 ) "*"       "*"       "*"        "*"             "*"       
## 33  ( 1 ) "*"       "*"       "*"        "*"             "*"       
##           pub_rec_bankruptcies term60 months pymnt_plany
## 1  ( 1 )  " "                  " "           " "        
## 2  ( 1 )  " "                  " "           " "        
## 3  ( 1 )  " "                  " "           " "        
## 4  ( 1 )  " "                  " "           " "        
## 5  ( 1 )  " "                  " "           " "        
## 6  ( 1 )  " "                  " "           " "        
## 7  ( 1 )  " "                  " "           " "        
## 8  ( 1 )  " "                  " "           " "        
## 9  ( 1 )  " "                  " "           " "        
## 10  ( 1 ) " "                  " "           " "        
## 11  ( 1 ) " "                  " "           " "        
## 12  ( 1 ) " "                  " "           " "        
## 13  ( 1 ) " "                  " "           " "        
## 14  ( 1 ) " "                  " "           " "        
## 15  ( 1 ) " "                  "*"           " "        
## 16  ( 1 ) " "                  "*"           " "        
## 17  ( 1 ) " "                  "*"           " "        
## 18  ( 1 ) " "                  "*"           " "        
## 19  ( 1 ) "*"                  "*"           " "        
## 20  ( 1 ) "*"                  "*"           " "        
## 21  ( 1 ) "*"                  "*"           " "        
## 22  ( 1 ) "*"                  "*"           " "        
## 23  ( 1 ) "*"                  "*"           " "        
## 24  ( 1 ) "*"                  "*"           " "        
## 25  ( 1 ) "*"                  "*"           " "        
## 26  ( 1 ) "*"                  "*"           " "        
## 27  ( 1 ) "*"                  "*"           " "        
## 28  ( 1 ) "*"                  "*"           " "        
## 29  ( 1 ) "*"                  "*"           " "        
## 30  ( 1 ) "*"                  "*"           " "        
## 31  ( 1 ) "*"                  "*"           " "        
## 32  ( 1 ) "*"                  "*"           "*"        
## 33  ( 1 ) "*"                  "*"           "*"        
##           initial_list_statusw application_typeJoint App hardship_flagY
## 1  ( 1 )  " "                  " "                       " "           
## 2  ( 1 )  " "                  " "                       " "           
## 3  ( 1 )  " "                  " "                       " "           
## 4  ( 1 )  " "                  " "                       " "           
## 5  ( 1 )  " "                  " "                       " "           
## 6  ( 1 )  " "                  " "                       " "           
## 7  ( 1 )  " "                  " "                       " "           
## 8  ( 1 )  " "                  " "                       " "           
## 9  ( 1 )  " "                  " "                       " "           
## 10  ( 1 ) " "                  " "                       " "           
## 11  ( 1 ) " "                  " "                       " "           
## 12  ( 1 ) " "                  " "                       " "           
## 13  ( 1 ) " "                  " "                       " "           
## 14  ( 1 ) " "                  " "                       " "           
## 15  ( 1 ) " "                  " "                       " "           
## 16  ( 1 ) " "                  " "                       " "           
## 17  ( 1 ) "*"                  " "                       " "           
## 18  ( 1 ) "*"                  " "                       " "           
## 19  ( 1 ) "*"                  " "                       " "           
## 20  ( 1 ) "*"                  " "                       " "           
## 21  ( 1 ) "*"                  " "                       " "           
## 22  ( 1 ) "*"                  " "                       " "           
## 23  ( 1 ) "*"                  " "                       " "           
## 24  ( 1 ) "*"                  " "                       " "           
## 25  ( 1 ) "*"                  " "                       " "           
## 26  ( 1 ) "*"                  " "                       " "           
## 27  ( 1 ) "*"                  " "                       " "           
## 28  ( 1 ) "*"                  " "                       " "           
## 29  ( 1 ) "*"                  " "                       " "           
## 30  ( 1 ) "*"                  " "                       " "           
## 31  ( 1 ) "*"                  " "                       " "           
## 32  ( 1 ) "*"                  " "                       " "           
## 33  ( 1 ) "*"                  "*"                       " "           
##           debt_settlement_flagY verification_statusSource Verified
## 1  ( 1 )  " "                   " "                               
## 2  ( 1 )  " "                   " "                               
## 3  ( 1 )  " "                   " "                               
## 4  ( 1 )  " "                   " "                               
## 5  ( 1 )  " "                   " "                               
## 6  ( 1 )  " "                   " "                               
## 7  ( 1 )  " "                   " "                               
## 8  ( 1 )  " "                   " "                               
## 9  ( 1 )  " "                   " "                               
## 10  ( 1 ) " "                   " "                               
## 11  ( 1 ) " "                   " "                               
## 12  ( 1 ) " "                   " "                               
## 13  ( 1 ) " "                   " "                               
## 14  ( 1 ) " "                   " "                               
## 15  ( 1 ) " "                   " "                               
## 16  ( 1 ) " "                   " "                               
## 17  ( 1 ) " "                   " "                               
## 18  ( 1 ) " "                   " "                               
## 19  ( 1 ) " "                   " "                               
## 20  ( 1 ) " "                   "*"                               
## 21  ( 1 ) " "                   "*"                               
## 22  ( 1 ) " "                   "*"                               
## 23  ( 1 ) " "                   "*"                               
## 24  ( 1 ) " "                   "*"                               
## 25  ( 1 ) " "                   "*"                               
## 26  ( 1 ) " "                   "*"                               
## 27  ( 1 ) " "                   "*"                               
## 28  ( 1 ) "*"                   "*"                               
## 29  ( 1 ) "*"                   "*"                               
## 30  ( 1 ) "*"                   "*"                               
## 31  ( 1 ) "*"                   "*"                               
## 32  ( 1 ) "*"                   "*"                               
## 33  ( 1 ) "*"                   "*"                               
##           verification_statusVerified home_ownershipOTHER
## 1  ( 1 )  " "                         " "                
## 2  ( 1 )  " "                         " "                
## 3  ( 1 )  " "                         " "                
## 4  ( 1 )  " "                         " "                
## 5  ( 1 )  " "                         " "                
## 6  ( 1 )  " "                         " "                
## 7  ( 1 )  " "                         " "                
## 8  ( 1 )  " "                         " "                
## 9  ( 1 )  "*"                         " "                
## 10  ( 1 ) "*"                         " "                
## 11  ( 1 ) "*"                         " "                
## 12  ( 1 ) "*"                         " "                
## 13  ( 1 ) "*"                         " "                
## 14  ( 1 ) "*"                         " "                
## 15  ( 1 ) "*"                         " "                
## 16  ( 1 ) "*"                         " "                
## 17  ( 1 ) "*"                         " "                
## 18  ( 1 ) "*"                         " "                
## 19  ( 1 ) "*"                         " "                
## 20  ( 1 ) "*"                         " "                
## 21  ( 1 ) "*"                         " "                
## 22  ( 1 ) "*"                         " "                
## 23  ( 1 ) "*"                         " "                
## 24  ( 1 ) "*"                         "*"                
## 25  ( 1 ) "*"                         "*"                
## 26  ( 1 ) "*"                         "*"                
## 27  ( 1 ) "*"                         "*"                
## 28  ( 1 ) "*"                         "*"                
## 29  ( 1 ) "*"                         "*"                
## 30  ( 1 ) "*"                         "*"                
## 31  ( 1 ) "*"                         "*"                
## 32  ( 1 ) "*"                         "*"                
## 33  ( 1 ) "*"                         "*"                
##           home_ownershipOWN home_ownershipRENT acc_now_delinq gradeB
## 1  ( 1 )  " "               " "                " "            " "   
## 2  ( 1 )  " "               " "                " "            " "   
## 3  ( 1 )  " "               " "                " "            " "   
## 4  ( 1 )  " "               " "                " "            " "   
## 5  ( 1 )  " "               " "                " "            " "   
## 6  ( 1 )  " "               " "                " "            "*"   
## 7  ( 1 )  " "               " "                " "            "*"   
## 8  ( 1 )  " "               " "                " "            "*"   
## 9  ( 1 )  " "               " "                " "            "*"   
## 10  ( 1 ) " "               " "                " "            "*"   
## 11  ( 1 ) " "               " "                " "            "*"   
## 12  ( 1 ) " "               " "                " "            "*"   
## 13  ( 1 ) " "               " "                " "            "*"   
## 14  ( 1 ) " "               " "                " "            "*"   
## 15  ( 1 ) " "               " "                " "            "*"   
## 16  ( 1 ) " "               " "                " "            "*"   
## 17  ( 1 ) " "               " "                " "            "*"   
## 18  ( 1 ) " "               " "                " "            "*"   
## 19  ( 1 ) " "               " "                " "            "*"   
## 20  ( 1 ) " "               " "                " "            "*"   
## 21  ( 1 ) " "               " "                " "            "*"   
## 22  ( 1 ) " "               " "                "*"            "*"   
## 23  ( 1 ) " "               "*"                "*"            "*"   
## 24  ( 1 ) " "               "*"                "*"            "*"   
## 25  ( 1 ) " "               "*"                "*"            "*"   
## 26  ( 1 ) " "               "*"                "*"            "*"   
## 27  ( 1 ) " "               "*"                "*"            "*"   
## 28  ( 1 ) " "               "*"                "*"            "*"   
## 29  ( 1 ) "*"               "*"                "*"            "*"   
## 30  ( 1 ) "*"               "*"                "*"            "*"   
## 31  ( 1 ) "*"               "*"                "*"            "*"   
## 32  ( 1 ) "*"               "*"                "*"            "*"   
## 33  ( 1 ) "*"               "*"                "*"            "*"   
##           gradeC gradeD gradeE gradeF gradeG collections_12_mths_ex_med
## 1  ( 1 )  " "    " "    "*"    " "    " "    " "                       
## 2  ( 1 )  " "    "*"    "*"    " "    " "    " "                       
## 3  ( 1 )  " "    "*"    "*"    "*"    " "    " "                       
## 4  ( 1 )  "*"    "*"    "*"    "*"    " "    " "                       
## 5  ( 1 )  "*"    "*"    "*"    "*"    "*"    " "                       
## 6  ( 1 )  "*"    "*"    "*"    "*"    "*"    " "                       
## 7  ( 1 )  "*"    "*"    "*"    "*"    "*"    " "                       
## 8  ( 1 )  "*"    "*"    "*"    "*"    "*"    " "                       
## 9  ( 1 )  "*"    "*"    "*"    "*"    "*"    " "                       
## 10  ( 1 ) "*"    "*"    "*"    "*"    "*"    " "                       
## 11  ( 1 ) "*"    "*"    "*"    "*"    "*"    " "                       
## 12  ( 1 ) "*"    "*"    "*"    "*"    "*"    " "                       
## 13  ( 1 ) "*"    "*"    "*"    "*"    "*"    " "                       
## 14  ( 1 ) "*"    "*"    "*"    "*"    "*"    " "                       
## 15  ( 1 ) "*"    "*"    "*"    "*"    "*"    " "                       
## 16  ( 1 ) "*"    "*"    "*"    "*"    "*"    " "                       
## 17  ( 1 ) "*"    "*"    "*"    "*"    "*"    " "                       
## 18  ( 1 ) "*"    "*"    "*"    "*"    "*"    " "                       
## 19  ( 1 ) "*"    "*"    "*"    "*"    "*"    " "                       
## 20  ( 1 ) "*"    "*"    "*"    "*"    "*"    " "                       
## 21  ( 1 ) "*"    "*"    "*"    "*"    "*"    " "                       
## 22  ( 1 ) "*"    "*"    "*"    "*"    "*"    " "                       
## 23  ( 1 ) "*"    "*"    "*"    "*"    "*"    " "                       
## 24  ( 1 ) "*"    "*"    "*"    "*"    "*"    " "                       
## 25  ( 1 ) "*"    "*"    "*"    "*"    "*"    " "                       
## 26  ( 1 ) "*"    "*"    "*"    "*"    "*"    " "                       
## 27  ( 1 ) "*"    "*"    "*"    "*"    "*"    " "                       
## 28  ( 1 ) "*"    "*"    "*"    "*"    "*"    " "                       
## 29  ( 1 ) "*"    "*"    "*"    "*"    "*"    " "                       
## 30  ( 1 ) "*"    "*"    "*"    "*"    "*"    "*"                       
## 31  ( 1 ) "*"    "*"    "*"    "*"    "*"    "*"                       
## 32  ( 1 ) "*"    "*"    "*"    "*"    "*"    "*"                       
## 33  ( 1 ) "*"    "*"    "*"    "*"    "*"    "*"                       
##           loan_statusCurrent
## 1  ( 1 )  " "               
## 2  ( 1 )  " "               
## 3  ( 1 )  " "               
## 4  ( 1 )  " "               
## 5  ( 1 )  " "               
## 6  ( 1 )  " "               
## 7  ( 1 )  " "               
## 8  ( 1 )  " "               
## 9  ( 1 )  " "               
## 10  ( 1 ) " "               
## 11  ( 1 ) " "               
## 12  ( 1 ) " "               
## 13  ( 1 ) " "               
## 14  ( 1 ) " "               
## 15  ( 1 ) "*"               
## 16  ( 1 ) "*"               
## 17  ( 1 ) "*"               
## 18  ( 1 ) "*"               
## 19  ( 1 ) "*"               
## 20  ( 1 ) "*"               
## 21  ( 1 ) "*"               
## 22  ( 1 ) "*"               
## 23  ( 1 ) "*"               
## 24  ( 1 ) "*"               
## 25  ( 1 ) "*"               
## 26  ( 1 ) "*"               
## 27  ( 1 ) "*"               
## 28  ( 1 ) "*"               
## 29  ( 1 ) "*"               
## 30  ( 1 ) "*"               
## 31  ( 1 ) "*"               
## 32  ( 1 ) "*"               
## 33  ( 1 ) "*"               
##           loan_statusDoes not meet the credit policy. Status:Charged Off
## 1  ( 1 )  " "                                                           
## 2  ( 1 )  " "                                                           
## 3  ( 1 )  " "                                                           
## 4  ( 1 )  " "                                                           
## 5  ( 1 )  " "                                                           
## 6  ( 1 )  " "                                                           
## 7  ( 1 )  " "                                                           
## 8  ( 1 )  " "                                                           
## 9  ( 1 )  " "                                                           
## 10  ( 1 ) " "                                                           
## 11  ( 1 ) "*"                                                           
## 12  ( 1 ) " "                                                           
## 13  ( 1 ) "*"                                                           
## 14  ( 1 ) "*"                                                           
## 15  ( 1 ) "*"                                                           
## 16  ( 1 ) "*"                                                           
## 17  ( 1 ) "*"                                                           
## 18  ( 1 ) "*"                                                           
## 19  ( 1 ) "*"                                                           
## 20  ( 1 ) "*"                                                           
## 21  ( 1 ) "*"                                                           
## 22  ( 1 ) "*"                                                           
## 23  ( 1 ) "*"                                                           
## 24  ( 1 ) "*"                                                           
## 25  ( 1 ) "*"                                                           
## 26  ( 1 ) "*"                                                           
## 27  ( 1 ) "*"                                                           
## 28  ( 1 ) "*"                                                           
## 29  ( 1 ) "*"                                                           
## 30  ( 1 ) "*"                                                           
## 31  ( 1 ) "*"                                                           
## 32  ( 1 ) "*"                                                           
## 33  ( 1 ) "*"                                                           
##           loan_statusDoes not meet the credit policy. Status:Fully Paid
## 1  ( 1 )  " "                                                          
## 2  ( 1 )  " "                                                          
## 3  ( 1 )  " "                                                          
## 4  ( 1 )  " "                                                          
## 5  ( 1 )  " "                                                          
## 6  ( 1 )  " "                                                          
## 7  ( 1 )  " "                                                          
## 8  ( 1 )  "*"                                                          
## 9  ( 1 )  "*"                                                          
## 10  ( 1 ) "*"                                                          
## 11  ( 1 ) "*"                                                          
## 12  ( 1 ) "*"                                                          
## 13  ( 1 ) "*"                                                          
## 14  ( 1 ) "*"                                                          
## 15  ( 1 ) "*"                                                          
## 16  ( 1 ) "*"                                                          
## 17  ( 1 ) "*"                                                          
## 18  ( 1 ) "*"                                                          
## 19  ( 1 ) "*"                                                          
## 20  ( 1 ) "*"                                                          
## 21  ( 1 ) "*"                                                          
## 22  ( 1 ) "*"                                                          
## 23  ( 1 ) "*"                                                          
## 24  ( 1 ) "*"                                                          
## 25  ( 1 ) "*"                                                          
## 26  ( 1 ) "*"                                                          
## 27  ( 1 ) "*"                                                          
## 28  ( 1 ) "*"                                                          
## 29  ( 1 ) "*"                                                          
## 30  ( 1 ) "*"                                                          
## 31  ( 1 ) "*"                                                          
## 32  ( 1 ) "*"                                                          
## 33  ( 1 ) "*"                                                          
##           loan_statusFully Paid loan_statusIn Grace Period
## 1  ( 1 )  " "                   " "                       
## 2  ( 1 )  " "                   " "                       
## 3  ( 1 )  " "                   " "                       
## 4  ( 1 )  " "                   " "                       
## 5  ( 1 )  " "                   " "                       
## 6  ( 1 )  " "                   " "                       
## 7  ( 1 )  " "                   " "                       
## 8  ( 1 )  " "                   " "                       
## 9  ( 1 )  " "                   " "                       
## 10  ( 1 ) " "                   " "                       
## 11  ( 1 ) " "                   " "                       
## 12  ( 1 ) " "                   " "                       
## 13  ( 1 ) " "                   " "                       
## 14  ( 1 ) " "                   " "                       
## 15  ( 1 ) " "                   " "                       
## 16  ( 1 ) " "                   " "                       
## 17  ( 1 ) " "                   " "                       
## 18  ( 1 ) "*"                   " "                       
## 19  ( 1 ) "*"                   " "                       
## 20  ( 1 ) "*"                   " "                       
## 21  ( 1 ) "*"                   " "                       
## 22  ( 1 ) "*"                   " "                       
## 23  ( 1 ) "*"                   " "                       
## 24  ( 1 ) "*"                   " "                       
## 25  ( 1 ) "*"                   " "                       
## 26  ( 1 ) "*"                   "*"                       
## 27  ( 1 ) "*"                   "*"                       
## 28  ( 1 ) "*"                   "*"                       
## 29  ( 1 ) "*"                   "*"                       
## 30  ( 1 ) "*"                   "*"                       
## 31  ( 1 ) "*"                   "*"                       
## 32  ( 1 ) "*"                   "*"                       
## 33  ( 1 ) "*"                   "*"                       
##           loan_statusLate (16-30 days) loan_statusLate (31-120 days)
## 1  ( 1 )  " "                          " "                          
## 2  ( 1 )  " "                          " "                          
## 3  ( 1 )  " "                          " "                          
## 4  ( 1 )  " "                          " "                          
## 5  ( 1 )  " "                          " "                          
## 6  ( 1 )  " "                          " "                          
## 7  ( 1 )  " "                          " "                          
## 8  ( 1 )  " "                          " "                          
## 9  ( 1 )  " "                          " "                          
## 10  ( 1 ) " "                          " "                          
## 11  ( 1 ) " "                          " "                          
## 12  ( 1 ) " "                          " "                          
## 13  ( 1 ) " "                          " "                          
## 14  ( 1 ) " "                          " "                          
## 15  ( 1 ) " "                          " "                          
## 16  ( 1 ) " "                          " "                          
## 17  ( 1 ) " "                          " "                          
## 18  ( 1 ) " "                          " "                          
## 19  ( 1 ) " "                          " "                          
## 20  ( 1 ) " "                          " "                          
## 21  ( 1 ) " "                          " "                          
## 22  ( 1 ) " "                          " "                          
## 23  ( 1 ) " "                          " "                          
## 24  ( 1 ) " "                          " "                          
## 25  ( 1 ) " "                          " "                          
## 26  ( 1 ) " "                          " "                          
## 27  ( 1 ) " "                          "*"                          
## 28  ( 1 ) " "                          "*"                          
## 29  ( 1 ) " "                          "*"                          
## 30  ( 1 ) " "                          "*"                          
## 31  ( 1 ) "*"                          "*"                          
## 32  ( 1 ) "*"                          "*"                          
## 33  ( 1 ) "*"                          "*"                          
##           chargeoff_within_12_mths year_temp
## 1  ( 1 )  " "                      " "      
## 2  ( 1 )  " "                      " "      
## 3  ( 1 )  " "                      " "      
## 4  ( 1 )  " "                      " "      
## 5  ( 1 )  " "                      " "      
## 6  ( 1 )  " "                      " "      
## 7  ( 1 )  " "                      "*"      
## 8  ( 1 )  " "                      "*"      
## 9  ( 1 )  " "                      "*"      
## 10  ( 1 ) " "                      "*"      
## 11  ( 1 ) " "                      "*"      
## 12  ( 1 ) " "                      "*"      
## 13  ( 1 ) " "                      "*"      
## 14  ( 1 ) " "                      "*"      
## 15  ( 1 ) " "                      "*"      
## 16  ( 1 ) " "                      "*"      
## 17  ( 1 ) " "                      "*"      
## 18  ( 1 ) " "                      "*"      
## 19  ( 1 ) " "                      "*"      
## 20  ( 1 ) " "                      "*"      
## 21  ( 1 ) " "                      "*"      
## 22  ( 1 ) " "                      "*"      
## 23  ( 1 ) " "                      "*"      
## 24  ( 1 ) " "                      "*"      
## 25  ( 1 ) "*"                      "*"      
## 26  ( 1 ) "*"                      "*"      
## 27  ( 1 ) "*"                      "*"      
## 28  ( 1 ) "*"                      "*"      
## 29  ( 1 ) "*"                      "*"      
## 30  ( 1 ) "*"                      "*"      
## 31  ( 1 ) "*"                      "*"      
## 32  ( 1 ) "*"                      "*"      
## 33  ( 1 ) "*"                      "*"
# Analysis of various metrics to assess the models
par(mfrow=c(2,2))
# plot R square
plot(reg_summary$rss ,xlab=" Number of Variables ",ylab="RSS", type="l")
# plot adusted R-square
plot(reg_summary$adjr2 ,xlab =" Number of Variables ", ylab="Adjusted RSq",type="l")
which.max (reg_summary$adjr2)  # maximal adusted R-square: 29
## [1] 29
points (29, reg_summary$adjr2[29], col ="red",cex =2, pch =20)
# plot CP
plot(reg_summary$cp ,xlab =" Number of Variables ", ylab="CP",type="l")
which.min (reg_summary$cp ) # minimal CP with 27 predictors
## [1] 27
points (27, reg_summary$cp[27], col ="red",cex =2, pch =20)
# plot BIC
plot(reg_summary$bic ,xlab =" Number of Variables ", ylab="BIC",type="l")
which.min (reg_summary$bic ) # minimal BIC with 23 predictors
## [1] 23
points (23, reg_summary$bic[23], col ="red",cex =2, pch =20)

# model of does not seem to improve much anymore after 6 predictors are included.
coefi=coef(regfit_best ,id=6) # get coefficients
pred_6=test_mat[,names(coefi)]%*%coefi # predict value for each observation with model and test data
test_error_6=mean((test_data$int_rate-pred_6)^2) # calculate test error
# -> test error for model with 6 predictors: 3.015

# next step: compare the test MSE of the various models with 1 predictor up to 33 predictors.

test_errors=rep(NA ,33) # create empty list
# search for each size i, the best model of that size
for(i in 1:33){ 
  coefi = coef(regfit_best ,id=i) # get coefficients
  pred = test_mat[,names(coefi)]%*%coefi # predict value for each observation with model and test data
  test_errors[i] = mean((test_data$int_rate-pred)^2) # calculate test error
}
test_errors # show all test errors
##  [1] 16.257 12.090 11.085  6.458  6.459  3.015  3.015  3.013  3.013  2.901
## [11]  2.895  2.897  2.891  2.875  2.861  2.845  2.755  2.755  2.748  2.708
## [21]  2.700  1.543  1.543  1.543  1.379  1.379  1.379  1.378  1.378  1.372
## [31]  1.372  1.372  1.372
test_errors[which.min(test_errors)] # test error of best model: 1.372
## [1] 1.372
which.min(test_errors) # model with minimal test error: 31
## [1] 31
par(mfrow=c(1,1))
plot(test_errors, type="l", main="Test Error") # plot the test error of all modelss
points(31, test_errors[31], col ="red",cex =2, pch =20) # mark test error of model with 31 variables.

# Plausibility Check:
# the output of best subset selection (regsubset) was that the test error decreases with every new added predictor up to 31 variables.
# Therefore test the linear regression by including all parameters and compare the test errors:
lm_2 <- lm(int_rate~., data = train_data) # linear regression with all predictors
summary(lm_2)
## 
## Call:
## lm(formula = int_rate ~ ., data = train_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.519  -0.872   0.036   0.865   4.305 
## 
## Coefficients: (1 not defined because of singularities)
##                                                                      Estimate
## (Intercept)                                                    802.2066801977
## loan_amnt                                                       -0.0003086859
## revol_bal                                                       -0.0000020704
## revol_util                                                       0.0056891673
## funded_amnt_inv                                                  0.0003121423
## annual_inc                                                      -0.0000002850
## pub_rec_bankruptcies                                             0.0751329199
## term60 months                                                    0.2132050242
## pymnt_plany                                                     -0.1552257175
## initial_list_statusw                                            -0.0860336871
## application_typeJoint App                                        0.0207964906
## hardship_flagY                                                             NA
## debt_settlement_flagY                                           -0.0415066104
## verification_statusSource Verified                               0.0602749870
## verification_statusVerified                                      0.2818644813
## home_ownershipOTHER                                             -0.9907899725
## home_ownershipOWN                                                0.0183835613
## home_ownershipRENT                                               0.0360787957
## acc_now_delinq                                                   0.2050850016
## gradeB                                                           3.4381638127
## gradeC                                                           6.5588592529
## gradeD                                                           9.6845363379
## gradeE                                                          12.4093353062
## gradeF                                                          15.9656363201
## gradeG                                                          18.1417385984
## collections_12_mths_ex_med                                      -0.0220253352
## loan_statusCurrent                                              -0.3646700318
## loan_statusDoes not meet the credit policy. Status:Charged Off  -3.2233222795
## loan_statusDoes not meet the credit policy. Status:Fully Paid   -3.5624382781
## loan_statusFully Paid                                           -0.0902041231
## loan_statusIn Grace Period                                      -0.3305893808
## loan_statusLate (16-30 days)                                    -0.0825789745
## loan_statusLate (31-120 days)                                   -0.1998072911
## chargeoff_within_12_mths                                         0.0931754742
## year_temp                                                       -0.3948021639
##                                                                    Std. Error
## (Intercept)                                                      8.5549017340
## loan_amnt                                                        0.0000114455
## revol_bal                                                        0.0000002183
## revol_util                                                       0.0001990838
## funded_amnt_inv                                                  0.0000114537
## annual_inc                                                       0.0000000667
## pub_rec_bankruptcies                                             0.0123966239
## term60 months                                                    0.0127306969
## pymnt_plany                                                      0.5338359837
## initial_list_statusw                                             0.0095813271
## application_typeJoint App                                        0.1645470952
## hardship_flagY                                                             NA
## debt_settlement_flagY                                            0.0328428767
## verification_statusSource Verified                               0.0111819662
## verification_statusVerified                                      0.0119250363
## home_ownershipOTHER                                              0.3729366090
## home_ownershipOWN                                                0.0154055692
## home_ownershipRENT                                               0.0095771804
## acc_now_delinq                                                   0.0531781521
## gradeB                                                           0.0139086212
## gradeC                                                           0.0147744261
## gradeD                                                           0.0171087015
## gradeE                                                           0.0213715895
## gradeF                                                           0.0316785085
## gradeG                                                           0.0594618845
## collections_12_mths_ex_med                                       0.0327539262
## loan_statusCurrent                                               0.0217741818
## loan_statusDoes not meet the credit policy. Status:Charged Off   0.2059161940
## loan_statusDoes not meet the credit policy. Status:Fully Paid    0.1322630433
## loan_statusFully Paid                                            0.0127855330
## loan_statusIn Grace Period                                       0.1361802734
## loan_statusLate (16-30 days)                                     0.2112354846
## loan_statusLate (31-120 days)                                    0.0975745521
## chargeoff_within_12_mths                                         0.0382852794
## year_temp                                                        0.0042487679
##                                                                t value
## (Intercept)                                                      93.77
## loan_amnt                                                       -26.97
## revol_bal                                                        -9.49
## revol_util                                                       28.58
## funded_amnt_inv                                                  27.25
## annual_inc                                                       -4.27
## pub_rec_bankruptcies                                              6.06
## term60 months                                                    16.75
## pymnt_plany                                                      -0.29
## initial_list_statusw                                             -8.98
## application_typeJoint App                                         0.13
## hardship_flagY                                                      NA
## debt_settlement_flagY                                            -1.26
## verification_statusSource Verified                                5.39
## verification_statusVerified                                      23.64
## home_ownershipOTHER                                              -2.66
## home_ownershipOWN                                                 1.19
## home_ownershipRENT                                                3.77
## acc_now_delinq                                                    3.86
## gradeB                                                          247.20
## gradeC                                                          443.93
## gradeD                                                          566.06
## gradeE                                                          580.65
## gradeF                                                          503.99
## gradeG                                                          305.10
## collections_12_mths_ex_med                                       -0.67
## loan_statusCurrent                                              -16.75
## loan_statusDoes not meet the credit policy. Status:Charged Off  -15.65
## loan_statusDoes not meet the credit policy. Status:Fully Paid   -26.93
## loan_statusFully Paid                                            -7.06
## loan_statusIn Grace Period                                       -2.43
## loan_statusLate (16-30 days)                                     -0.39
## loan_statusLate (31-120 days)                                    -2.05
## chargeoff_within_12_mths                                          2.43
## year_temp                                                       -92.92
##                                                                            Pr(>|t|)
## (Intercept)                                                    < 0.0000000000000002
## loan_amnt                                                      < 0.0000000000000002
## revol_bal                                                      < 0.0000000000000002
## revol_util                                                     < 0.0000000000000002
## funded_amnt_inv                                                < 0.0000000000000002
## annual_inc                                                          0.0000194774277
## pub_rec_bankruptcies                                                0.0000000013615
## term60 months                                                  < 0.0000000000000002
## pymnt_plany                                                                 0.77122
## initial_list_statusw                                           < 0.0000000000000002
## application_typeJoint App                                                   0.89943
## hardship_flagY                                                                   NA
## debt_settlement_flagY                                                       0.20631
## verification_statusSource Verified                                  0.0000000705292
## verification_statusVerified                                    < 0.0000000000000002
## home_ownershipOTHER                                                         0.00789
## home_ownershipOWN                                                           0.23275
## home_ownershipRENT                                                          0.00017
## acc_now_delinq                                                              0.00012
## gradeB                                                         < 0.0000000000000002
## gradeC                                                         < 0.0000000000000002
## gradeD                                                         < 0.0000000000000002
## gradeE                                                         < 0.0000000000000002
## gradeF                                                         < 0.0000000000000002
## gradeG                                                         < 0.0000000000000002
## collections_12_mths_ex_med                                                  0.50130
## loan_statusCurrent                                             < 0.0000000000000002
## loan_statusDoes not meet the credit policy. Status:Charged Off < 0.0000000000000002
## loan_statusDoes not meet the credit policy. Status:Fully Paid  < 0.0000000000000002
## loan_statusFully Paid                                               0.0000000000017
## loan_statusIn Grace Period                                                  0.01520
## loan_statusLate (16-30 days)                                                0.69585
## loan_statusLate (31-120 days)                                               0.04059
## chargeoff_within_12_mths                                                    0.01495
## year_temp                                                      < 0.0000000000000002
##                                                                   
## (Intercept)                                                    ***
## loan_amnt                                                      ***
## revol_bal                                                      ***
## revol_util                                                     ***
## funded_amnt_inv                                                ***
## annual_inc                                                     ***
## pub_rec_bankruptcies                                           ***
## term60 months                                                  ***
## pymnt_plany                                                       
## initial_list_statusw                                           ***
## application_typeJoint App                                         
## hardship_flagY                                                    
## debt_settlement_flagY                                             
## verification_statusSource Verified                             ***
## verification_statusVerified                                    ***
## home_ownershipOTHER                                            ** 
## home_ownershipOWN                                                 
## home_ownershipRENT                                             ***
## acc_now_delinq                                                 ***
## gradeB                                                         ***
## gradeC                                                         ***
## gradeD                                                         ***
## gradeE                                                         ***
## gradeF                                                         ***
## gradeG                                                         ***
## collections_12_mths_ex_med                                        
## loan_statusCurrent                                             ***
## loan_statusDoes not meet the credit policy. Status:Charged Off ***
## loan_statusDoes not meet the credit policy. Status:Fully Paid  ***
## loan_statusFully Paid                                          ***
## loan_statusIn Grace Period                                     *  
## loan_statusLate (16-30 days)                                      
## loan_statusLate (31-120 days)                                  *  
## chargeoff_within_12_mths                                       *  
## year_temp                                                      ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.17 on 72718 degrees of freedom
## Multiple R-squared:  0.928,  Adjusted R-squared:  0.928 
## F-statistic: 2.85e+04 on 33 and 72718 DF,  p-value: <0.0000000000000002
lm_2_pred <- predict(lm_2, test_data) # predict int_rate based on test data
## Warning in predict.lm(lm_2, test_data): prediction from a rank-deficient
## fit may be misleading
test_errors_2 <- mean((test_data$int_rate-lm_2_pred)^2) # calculate test error
test_errors_2 # test error: 1.372
## [1] 1.372
# test error is equal to previous test error that was found with best subset selection


Model Evaluation:

Models with various complexity with 1 predictor up to 33 predictors were created with best subset selection and were compared with the validation set approach. By analysing the plots of metrics like adjusted R^2, Cp and BIC, ones would think that the model does not improve much anymore with more than 6 predictors. So in a first step the test MSE was calculated for a model with 6 predictors. The model with 6 predictors resulted in a test MSE of 3.015. In a next step the test MSE of all 33 models were compared. The plot of all test MSE shows that the test MSE decreases with every new predictor up to the model with 31 predictors. The test MSE of the model with 31 predictors is 1.372, which is much smaller than the test MSE that was found for a model with 6 predictors.



Ridge regression

# using method from http://www.sthda.com/english/articles/37-model-selection-essentials-in-r/153-penalized-regression-essentials-ridge-lasso-elastic-net/
# creating matrix
x=model.matrix(int_rate~., d)[, -1]
y=d$int_rate

# Predictor variables train data
x_train <- model.matrix(int_rate~., train_data)[,-1]

# Outcome variable train data
y_train <- train_data$int_rate


# Find the best lambda using cross-validation
set.seed(22) 
cv <- cv.glmnet(x_train, y_train, alpha = 0)

# Display the best lambda value using cross-validation
cv$lambda.min
## [1] 0.2158
# Fit the final model on the training data
model_train <- glmnet(x_train, y_train, alpha = 0, lambda = cv$lambda.min)
model_train
## 
## Call:  glmnet(x = x_train, y = y_train, alpha = 0, lambda = cv$lambda.min) 
## 
##      Df %Dev Lambda
## [1,] 34 0.91  0.216
# Display regression coefficients
coef(model_train)
## 35 x 1 sparse Matrix of class "dgCMatrix"
##                                                                            s0
## (Intercept)                                                    673.4197663501
## loan_amnt                                                       -0.0000046055
## revol_bal                                                       -0.0000044971
## revol_util                                                       0.0126619106
## funded_amnt_inv                                                  0.0000045518
## annual_inc                                                      -0.0000008879
## pub_rec_bankruptcies                                             0.2215014030
## term60 months                                                    0.7320285124
## pymnt_plany                                                     -0.0981643392
## initial_list_statusw                                            -0.2512668734
## application_typeJoint App                                        0.0556007472
## hardship_flagY                                                  -0.0122931063
## debt_settlement_flagY                                           -0.0630768398
## verification_statusSource Verified                               0.1401325336
## verification_statusVerified                                      0.4880856599
## home_ownershipOTHER                                             -0.6575275493
## home_ownershipOWN                                                0.1088854645
## home_ownershipRENT                                               0.1433827279
## acc_now_delinq                                                   0.4428531718
## gradeB                                                           2.0721047336
## gradeC                                                           4.9252150053
## gradeD                                                           7.7942545127
## gradeE                                                          10.2200880846
## gradeF                                                          13.4885037816
## gradeG                                                          15.5080386436
## collections_12_mths_ex_med                                       0.1223624955
## loan_statusCurrent                                              -0.5565290140
## loan_statusDoes not meet the credit policy. Status:Charged Off  -2.7720040355
## loan_statusDoes not meet the credit policy. Status:Fully Paid   -3.1717638626
## loan_statusFully Paid                                           -0.2917309393
## loan_statusIn Grace Period                                      -0.4119790329
## loan_statusLate (16-30 days)                                    -0.1106423141
## loan_statusLate (31-120 days)                                   -0.2839238895
## chargeoff_within_12_mths                                         0.1845140558
## year_temp                                                       -0.3303316059
# Make predictions on the test data
x_test <- model.matrix(int_rate~., test_data)[,-1]
y_test <- test_data$int_rate

model_test <- model.matrix(int_rate ~., test_data)[,-1]
predictions <- model_train %>% predict(model_test) %>% as.vector()


# Model performance metrics
data.frame(
  RMSE = caret::RMSE(predictions, test_data$int_rate),
  Rsquare = caret::R2(predictions, test_data$int_rate)
)
# Mean squared error
mean((predictions - y_test)^2)
## [1] 1.708
# Refit ridge regression model on the full data set, using lambda chosen by cross-validation
modell_full_data_set <- glmnet(x, y, alpha = 0)
predict(modell_full_data_set, type="coefficients", s=cv$lambda.min)
## 36 x 1 sparse Matrix of class "dgCMatrix"
##                                                                             1
## (Intercept)                                                    666.0096710668
## loan_amnt                                                       -0.0000051354
## revol_bal                                                       -0.0000043680
## revol_util                                                       0.0128394625
## funded_amnt_inv                                                  0.0000050157
## annual_inc                                                      -0.0000009665
## pub_rec_bankruptcies                                             0.2287244653
## term60 months                                                    0.7331639762
## pymnt_plany                                                     -0.2176853880
## initial_list_statusw                                            -0.2564763252
## application_typeJoint App                                        0.0053602665
## hardship_flagY                                                  -0.2364632815
## debt_settlement_flagY                                           -0.0616096764
## verification_statusSource Verified                               0.1451127340
## verification_statusVerified                                      0.4829217869
## home_ownershipNONE                                              -0.1701960729
## home_ownershipOTHER                                             -0.5612204159
## home_ownershipOWN                                                0.1149555752
## home_ownershipRENT                                               0.1450360467
## acc_now_delinq                                                   0.4480115481
## gradeB                                                           2.0740732729
## gradeC                                                           4.9276885532
## gradeD                                                           7.7905392419
## gradeE                                                          10.2029248536
## gradeF                                                          13.4950919639
## gradeG                                                          15.4736473629
## collections_12_mths_ex_med                                       0.0934593841
## loan_statusCurrent                                              -0.5594135656
## loan_statusDoes not meet the credit policy. Status:Charged Off  -2.6877638615
## loan_statusDoes not meet the credit policy. Status:Fully Paid   -3.2454211614
## loan_statusFully Paid                                           -0.2921106504
## loan_statusIn Grace Period                                      -0.3971416022
## loan_statusLate (16-30 days)                                    -0.2741674027
## loan_statusLate (31-120 days)                                   -0.2827116135
## chargeoff_within_12_mths                                         0.1661220201
## year_temp                                                       -0.3266543517
print("MSE")
## [1] "MSE"
mean((predictions-test_data$int_rate)^2)
## [1] 1.708


Model Evaluation:

In general ridge regression reduces the variance of the ridge coefficient estimates by shrinking them towards zero. In order to do so, a good tuning parameter lambda needs to be selected to adjust the trade-off between coefficient shrinkage and RSS reduction. Lambda was optimized by using cross validation. The best lambda was determined to be 0.216.
The examination of the ridge coefficent estimates shows that the coefficents of the following predictors was shrinked to almost zero:

  • loan_amnt
  • revol_bal
  • funded_amnt_inv
  • annual_inc


Highest coefficent were found for the following predictors:

  • grade
  • loan_status


The test MSE for the ridge regression model resulted in 1.708.


Lasso regression with 10-fold cross validation

# using method from http://www.sthda.com/english/articles/37-model-selection-essentials-in-r/153-penalized-regression-essentials-ridge-lasso-elastic-net/
# Setup a grid range of lambda values:
lambda <- 10^seq(-3, 3, length = 100)


# Build the model
set.seed(22)
lasso <- train(
  int_rate ~., data = train_data, method = "glmnet",
  trControl = trainControl("cv", number = 10),
  tuneGrid = expand.grid(alpha = 1, lambda = cv$lambda)
)

# Model coefficients
coef(lasso$finalModel, lasso$bestTune$lambda)
## 35 x 1 sparse Matrix of class "dgCMatrix"
##                                                                        1
## (Intercept)                                                    380.59983
## loan_amnt                                                        .      
## revol_bal                                                        .      
## revol_util                                                       0.00816
## funded_amnt_inv                                                  .      
## annual_inc                                                       .      
## pub_rec_bankruptcies                                             .      
## term60 months                                                    0.43160
## pymnt_plany                                                      .      
## initial_list_statusw                                            -0.03003
## application_typeJoint App                                        .      
## hardship_flagY                                                   .      
## debt_settlement_flagY                                            .      
## verification_statusSource Verified                               .      
## verification_statusVerified                                      0.14410
## home_ownershipOTHER                                              .      
## home_ownershipOWN                                                .      
## home_ownershipRENT                                               .      
## acc_now_delinq                                                   .      
## gradeB                                                           0.78062
## gradeC                                                           3.82864
## gradeD                                                           6.80696
## gradeE                                                           9.22728
## gradeF                                                          12.20369
## gradeG                                                          12.86658
## collections_12_mths_ex_med                                       .      
## loan_statusCurrent                                               .      
## loan_statusDoes not meet the credit policy. Status:Charged Off   .      
## loan_statusDoes not meet the credit policy. Status:Fully Paid    .      
## loan_statusFully Paid                                            .      
## loan_statusIn Grace Period                                       .      
## loan_statusLate (16-30 days)                                     .      
## loan_statusLate (31-120 days)                                    .      
## chargeoff_within_12_mths                                         .      
## year_temp                                                       -0.18443
# Make predictions
predictions <- lasso %>% predict(test_data)
# Model prediction performance
data.frame(
  RMSE = caret::RMSE(predictions, test_data$int_rate),
  Rsquare = caret::R2(predictions, test_data$int_rate)
)
print("MSE")
## [1] "MSE"
mean((predictions-test_data$int_rate)^2)
## [1] 2.604

Model Evaluation:

Same as ridge regression, the lasso regression reduces the variance of the ridge coefficient estimates by shrinking them towards zero. In addition, lasso also has the side effect of reducing the number of predictors (Setting the coefficients to zero).
The lasso regression set the coefficents of the following parameters to zero (including dummy variables):

  • loan_amnt
  • revol_bal
  • funded_amnt_inv
  • annual_inc
  • pub_rec_bankruptcies
  • pymnt_plany
  • application_typeJoint App
  • hardship_flagY
  • debt_settlement_flagY
  • verification_statusSource Verified
  • home_ownershipOTHER
  • home_ownershipOWN
  • home_ownershipRENT
  • acc_now_delinq
  • collections_12_mths_ex_med
  • loan_statusCurrent
  • loan_statusDoes not meet the credit policy. Status:Charged Off
  • loan_statusDoes not meet the credit policy. Status:Fully Paid
  • loan_statusFully Paid
  • loan_statusIn Grace Period
  • loan_statusLate (16-30 days)
  • loan_statusLate (31-120 days)
  • chargeoff_within_12_mths


The test MSE for the lasso regression model resulted in 2.604.


Discussion / Model comparison and evaluation

model_comparison <-  tibble(model = c("LM with LOOCV", "LM with Best Subset Selection", "Ridge Regression", "Lasso Regression"),
       Test_MSE = c(1.375, 1.372, 1.708, 2.604),
       R2 = c(0.9263, 0.928, 0.9174, 0.8867))

(model_comparison %<>% arrange(Test_MSE))
write.csv(model_comparison, file = "data/regression_methods_results.csv")


According to the table above the linear regression model with best subset selection has the smallest MSE (1.372) overall. The linear regression model with leave one out cross validation is nearly as good with a MSE of 1.375. The ridge and the lasso regression perform noticeable worse (MSE of 1.708 and 2.064). Details to the winnig model with best subset selection can be found below.

Initial question Answer
1. Is at least one of the predictores X1, X2, …, Xp useful in predicting the response? First, one could reduce the possible predictors in the data preprocessing part. One could remove 126 out of 147 possible predictors, because they would not explain the response at all (93 predictors with more than 5% missing values or just one level) or there were too many levels (more than 10). Some of the few 126 removed predictors were a human decision, especially if the predictors were not understable in regard to the response. Back to the original question: Yes, there were 31 predictors useful in predicting the response. Due to matrix transformation there are a lot of new dummy variables out of the 21 remaining predictors, for example grade, home_ownership or loan_status.
2. Do all the predictors help to explain Y, or is only a subset of the predictors useful? No, only 31 of the preditors help to explain of the response. One could try to only take 23 predictors because one can reduce most of the test error with them.
3. How well does the model fit the data? How accurate is the prediction? Quite well. The remaining test error (MSE) is 1.375.

Possible next steps

One reasonable approach could involve testing non-linear regression models by checking the non-linearity of the response-predictor relationsships. Additionally one could further investigate in problems such as outliers, collinearity or residual analysis (correlation of error terms, non-constant variance of error terms, …). Finally one should better understand the business domain to improve model results and interpretation. Furthermore the numerical predictors could be standardized before applying various regression models, so that all predictors are on the same scale.







Taks 2: Classification


Data Preperation

Our goal in the second part of the assignment is to predict if a new customer will be able to fully pay back their loans using a classification method. Therefore, we focus on the “concluded lends” in the data set, i.e. on all loans where the loan_status is not Current. To this end, we filter out all observations with loan_status == Current.

# load data
d_na_less_5_percent <- read.csv("data/d_na_less_5_percent.csv",stringsAsFactors = T)

# create subset with loan_status = current
d2_sub <- subset(d_na_less_5_percent,loan_status != "Current")

# check how much loan status varibales we have
d2_sub$loan_status <- gsub("Does not meet the credit policy. Status:Fully Paid", "Fully Paid", d2_sub$loan_status)
d2_sub$loan_status <- gsub("Does not meet the credit policy. Status:Charged Off", "Fully Paid", d2_sub$loan_status)
d2_sub$loan_status <- gsub("Current", "Defaulted", d2_sub$loan_status)
d2_sub$loan_status <- gsub("Charged Off", "Defaulted", d2_sub$loan_status)
d2_sub$loan_status <- gsub("In Grace Period", "Defaulted", d2_sub$loan_status)
d2_sub$loan_status <- gsub("Late \\(31-120 days\\)", "Defaulted", d2_sub$loan_status)
d2_sub$loan_status <- gsub("Late \\(16-30 days\\)", "Defaulted", d2_sub$loan_status)

# create subset
d2_sub$loan_status <- as.factor(d2_sub$loan_status)


# replace home ownership "NONE" with "OTHER"
d2_sub$home_ownership <- as.factor(gsub("NONE", "OTHER", d2_sub$home_ownership))
d2_sub %<>% na.omit()
summary(d2_sub)
##  funded_amnt_inv   annual_inc      pub_rec_bankruptcies   last_pymnt_d  
##  Min.   :    0   Min.   :   5000   Min.   :0.000        Mar-2017: 2550  
##  1st Qu.: 8000   1st Qu.:  45000   1st Qu.:0.000        Mar-2016: 2460  
##  Median :12000   Median :  64000   Median :0.000        Jul-2016: 2280  
##  Mean   :14379   Mean   :  74643   Mean   :0.122        Aug-2016: 2242  
##  3rd Qu.:20000   3rd Qu.:  90000   3rd Qu.:0.000        Sep-2016: 2146  
##  Max.   :35000   Max.   :9500000   Max.   :9.000        Jun-2016: 2145  
##                                                         (Other) :87685  
##    revol_util                        title      
##  Min.   :  0.0   Debt consolidation     :47018  
##  1st Qu.: 37.3   Credit card refinancing:18316  
##  Median : 55.6   Home improvement       : 4474  
##  Mean   : 54.8   Other                  : 3750  
##  3rd Qu.: 73.2   Debt Consolidation     : 1944  
##  Max.   :155.3   Major purchase         : 1421  
##                  (Other)                :24585  
##  collections_12_mths_ex_med chargeoff_within_12_mths   tax_liens    
##  Min.   :0.000              Min.   :0.000            Min.   : 0.00  
##  1st Qu.:0.000              1st Qu.:0.000            1st Qu.: 0.00  
##  Median :0.000              Median :0.000            Median : 0.00  
##  Mean   :0.014              Mean   :0.009            Mean   : 0.05  
##  3rd Qu.:0.000              3rd Qu.:0.000            3rd Qu.: 0.00  
##  Max.   :6.000              Max.   :9.000            Max.   :48.00  
##                                                                     
##   delinq_2yrs    earliest_cr_line inq_last_6mths      open_acc   
##  Min.   : 0.00   Aug-2001:  772   Min.   : 0.000   Min.   : 1.0  
##  1st Qu.: 0.00   Aug-2002:  745   1st Qu.: 0.000   1st Qu.: 8.0  
##  Median : 0.00   Aug-2000:  721   Median : 0.000   Median :11.0  
##  Mean   : 0.31   Oct-2000:  720   Mean   : 0.703   Mean   :11.5  
##  3rd Qu.: 0.00   Oct-2001:  693   3rd Qu.: 1.000   3rd Qu.:14.0  
##  Max.   :29.00   Oct-1999:  677   Max.   :12.000   Max.   :90.0  
##                  (Other) :97180                                  
##     pub_rec       total_acc     last_credit_pull_d acc_now_delinq 
##  Min.   : 0.0   Min.   :  2.0   Feb-2019:32989     Min.   :0.000  
##  1st Qu.: 0.0   1st Qu.: 17.0   Oct-2016: 5940     1st Qu.:0.000  
##  Median : 0.0   Median : 24.0   Jan-2019: 4273     Median :0.000  
##  Mean   : 0.2   Mean   : 25.3   Jul-2018: 3859     Mean   :0.005  
##  3rd Qu.: 0.0   3rd Qu.: 32.0   Aug-2018: 3399     3rd Qu.:0.000  
##  Max.   :49.0   Max.   :169.0   Oct-2018: 3300     Max.   :5.000  
##                                 (Other) :47748                    
##   delinq_amnt         id_2          loan_amnt      funded_amnt   
##  Min.   :    0   Min.   :    10   Min.   :  725   Min.   :  725  
##  1st Qu.:    0   1st Qu.:229392   1st Qu.: 8000   1st Qu.: 8000  
##  Median :    0   Median :447734   Median :12000   Median :12000  
##  Mean   :   15   Mean   :445468   Mean   :14404   Mean   :14395  
##  3rd Qu.:    0   3rd Qu.:662596   3rd Qu.:20000   3rd Qu.:20000  
##  Max.   :65000   Max.   :887434   Max.   :35000   Max.   :35000  
##                                                                  
##         term          int_rate      installment     grade    
##  36 months:76047   Min.   : 5.32   Min.   :  22.5   A:17939  
##  60 months:25461   1st Qu.: 9.80   1st Qu.: 255.0   B:29551  
##                    Median :12.99   Median : 377.6   C:27724  
##                    Mean   :13.15   Mean   : 435.0   D:15560  
##                    3rd Qu.:15.96   3rd Qu.: 571.6   E: 7594  
##                    Max.   :28.99   Max.   :1408.1   F: 2560  
##                                                     G:  580  
##    sub_grade         emp_length     home_ownership 
##  B3     : 6480   10+ years:33298   MORTGAGE:50469  
##  B4     : 6443   2 years  : 9024   OTHER   :   15  
##  C1     : 6202   < 1 year : 8155   OWN     : 9929  
##  C2     : 5922   3 years  : 8016   RENT    :41095  
##  B2     : 5797   1 year   : 6751                   
##  C3     : 5643   5 years  : 6396                   
##  (Other):65021   (Other)  :29868                   
##       verification_status     issue_d          loan_status    pymnt_plan
##  Not Verified   :29655    Oct-2015: 5361   Defaulted :18862   n:101499  
##  Source Verified:38237    Jul-2015: 5116   Fully Paid:82646   y:     9  
##  Verified       :33616    Dec-2015: 4828                                
##                           Oct-2014: 4449                                
##                           Nov-2015: 4116                                
##                           Aug-2015: 3962                                
##                           (Other) :73676                                
##                purpose         zip_code       addr_state   
##  debt_consolidation:60009   945xx  : 1188   CA     :15280  
##  credit_card       :23344   112xx  : 1088   NY     : 8443  
##  home_improvement  : 5871   750xx  : 1057   TX     : 8184  
##  other             : 4955   606xx  : 1011   FL     : 6876  
##  major_purchase    : 2011   300xx  :  899   IL     : 4071  
##  small_business    : 1189   900xx  :  895   NJ     : 3753  
##  (Other)           : 4129   (Other):95370   (Other):54901  
##       dti          revol_bal       initial_list_status   out_prncp    
##  Min.   :  0.0   Min.   :      0   f:53757             Min.   :    0  
##  1st Qu.: 11.8   1st Qu.:   6319   w:47751             1st Qu.:    0  
##  Median : 17.5   Median :  11557                       Median :    0  
##  Mean   : 18.0   Mean   :  16558                       Mean   :   28  
##  3rd Qu.: 23.8   3rd Qu.:  20277                       3rd Qu.:    0  
##  Max.   :100.1   Max.   :1298783                       Max.   :22039  
##                                                                       
##  out_prncp_inv    total_pymnt    total_pymnt_inv total_rec_prncp
##  Min.   :    0   Min.   :   36   Min.   :    0   Min.   :    0  
##  1st Qu.:    0   1st Qu.: 7577   1st Qu.: 7559   1st Qu.: 6000  
##  Median :    0   Median :12956   Median :12942   Median :10000  
##  Mean   :   28   Mean   :15485   Mean   :15468   Mean   :12464  
##  3rd Qu.:    0   3rd Qu.:21307   3rd Qu.:21290   3rd Qu.:17925  
##  Max.   :22007   Max.   :62708   Max.   :62687   Max.   :35000  
##                                                                 
##  total_rec_int   total_rec_late_fee   recoveries   
##  Min.   :    0   Min.   :  0.0      Min.   :    0  
##  1st Qu.:  978   1st Qu.:  0.0      1st Qu.:    0  
##  Median : 1860   Median :  0.0      Median :    0  
##  Mean   : 2786   Mean   :  1.5      Mean   :  234  
##  3rd Qu.: 3487   3rd Qu.:  0.0      3rd Qu.:    0  
##  Max.   :27708   Max.   :864.0      Max.   :32321  
##                                                    
##  collection_recovery_fee last_pymnt_amnt  policy_code   application_type 
##  Min.   :   0            Min.   :    0   Min.   :1    Individual:101452  
##  1st Qu.:   0            1st Qu.:  354   1st Qu.:1    Joint App :    56  
##  Median :   0            Median : 1103   Median :1                       
##  Mean   :  38            Mean   : 4498   Mean   :1                       
##  3rd Qu.:   0            3rd Qu.: 6743   3rd Qu.:1                       
##  Max.   :5432            Max.   :36170   Max.   :1                       
##                                                                          
##  hardship_flag disbursement_method debt_settlement_flag   year_temp   
##  N:101499      Cash:101508         N:99364              Min.   :2007  
##  Y:     9                          Y: 2144              1st Qu.:2013  
##                                                         Median :2014  
##                                                         Mean   :2014  
##                                                         3rd Qu.:2015  
##                                                         Max.   :2015  
## 

We have converted all loan_status that are not fully paid into defaulted. We have also replaced “home_ownership” NONE with OTHER.



PCA

  1. Use Principal Component Analysis for base transformation and then compare it with the Partial Least Squares Regression result. Select the best base with cross validation, using the better of the two approaches.

We now transform the dataset using the Principal component analaysis (PCA). The following code prepares the data for the PCA. The PCA compution (prcomp) does not allow any variables to be factors or any NAs we therefore transform the factors to numeric and remove the NAs. Also we reduce the levels of factors to maximum 10. Then we create a test and training set and we transform the base of the training set using PCA.

# load previous df
df.pca <- d2_sub
# str(df.pca)                                             # check

# remove factors larger than 10
plus10levels <- df.pca[, sapply(df.pca, function(col) class(col) == "factor")]
plus10levels <-  plus10levels[, sapply(plus10levels, function(col) length(unique(col))) > 10]
rmv.col <- colnames(plus10levels)
str(rmv.col) # columns to be removed
##  chr [1:10] "last_pymnt_d" "title" "earliest_cr_line" ...
df.pca <- df.pca[ , !(names(df.pca) %in% rmv.col)]
# str(df.pca)                                             # check

# transform factors to numeric, except for loan_status
str(df.pca)
## 'data.frame':    101508 obs. of  44 variables:
##  $ funded_amnt_inv           : int  33600 10400 15000 10000 4200 7000 20000 28000 20000 17600 ...
##  $ annual_inc                : int  104000 104433 80000 30000 32000 59000 180000 92000 65000 88000 ...
##  $ pub_rec_bankruptcies      : int  0 0 0 1 0 0 0 0 0 0 ...
##  $ revol_util                : num  51.8 64.5 21.4 26.6 11.7 24.2 84.5 64.5 3 34.4 ...
##  $ collections_12_mths_ex_med: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ chargeoff_within_12_mths  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ tax_liens                 : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ delinq_2yrs               : int  0 1 0 0 0 1 0 0 0 0 ...
##  $ inq_last_6mths            : int  2 3 1 2 2 1 0 0 1 0 ...
##  $ open_acc                  : int  14 12 30 19 10 10 12 16 12 20 ...
##  $ pub_rec                   : int  0 0 0 1 0 0 0 0 0 0 ...
##  $ total_acc                 : int  31 35 47 36 16 34 27 24 17 42 ...
##  $ acc_now_delinq            : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ delinq_amnt               : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ id_2                      : int  777778 876330 188890 288890 388890 788890 877442 1114 10002 11114 ...
##  $ loan_amnt                 : int  33600 10400 15000 10000 4200 7000 20000 28000 20000 17600 ...
##  $ funded_amnt               : int  33600 10400 15000 10000 4200 7000 20000 28000 20000 17600 ...
##  $ term                      : Factor w/ 2 levels "36 months","60 months": 1 2 1 2 1 1 1 1 1 1 ...
##  $ int_rate                  : num  10.78 22.45 5.32 16.59 13.99 ...
##  $ installment               : num  1097 290 452 246 144 ...
##  $ grade                     : Factor w/ 7 levels "A","B","C","D",..: 2 6 1 4 3 1 2 1 1 1 ...
##  $ home_ownership            : Factor w/ 4 levels "MORTGAGE","OTHER",..: 4 1 4 4 4 1 1 1 3 1 ...
##  $ verification_status       : Factor w/ 3 levels "Not Verified",..: 2 2 1 2 2 1 1 1 1 2 ...
##  $ loan_status               : Factor w/ 2 levels "Defaulted","Fully Paid": 2 2 2 2 2 2 2 2 2 2 ...
##  $ pymnt_plan                : Factor w/ 2 levels "n","y": 1 1 1 1 1 1 1 1 1 1 ...
##  $ dti                       : num  14 25.4 20.2 30.8 31 ...
##  $ revol_bal                 : int  28475 21929 41593 6118 1294 10467 87329 51507 2269 12203 ...
##  $ initial_list_status       : Factor w/ 2 levels "f","w": 2 2 2 2 2 2 1 2 2 2 ...
##  $ out_prncp                 : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ out_prncp_inv             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ total_pymnt               : num  34309 11740 16241 10758 5036 ...
##  $ total_pymnt_inv           : num  34309 11740 16241 10758 5036 ...
##  $ total_rec_prncp           : num  33600 10400 15000 10000 4200 7000 20000 28000 20000 17600 ...
##  $ total_rec_int             : num  709 1340 1241 758 836 ...
##  $ total_rec_late_fee        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ recoveries                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ collection_recovery_fee   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ last_pymnt_amnt           : num  33263 10129 2244 9795 1743 ...
##  $ policy_code               : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ application_type          : Factor w/ 2 levels "Individual","Joint App": 1 1 1 1 1 1 1 1 1 1 ...
##  $ hardship_flag             : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
##  $ disbursement_method       : Factor w/ 1 level "Cash": 1 1 1 1 1 1 1 1 1 1 ...
##  $ debt_settlement_flag      : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
##  $ year_temp                 : int  2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
for(col in 1:ncol(df.pca)){
  if(class(df.pca[,col]) %in% c('factor') && colnames(df.pca)[col] != 'loan_status'){
    df.pca[,col] <- as.numeric(df.pca[,col])
  }
}
# str(df.pca)                                             # check

# remove NAs
drop_na(df.pca)
str(df.pca)
## 'data.frame':    101508 obs. of  44 variables:
##  $ funded_amnt_inv           : int  33600 10400 15000 10000 4200 7000 20000 28000 20000 17600 ...
##  $ annual_inc                : int  104000 104433 80000 30000 32000 59000 180000 92000 65000 88000 ...
##  $ pub_rec_bankruptcies      : int  0 0 0 1 0 0 0 0 0 0 ...
##  $ revol_util                : num  51.8 64.5 21.4 26.6 11.7 24.2 84.5 64.5 3 34.4 ...
##  $ collections_12_mths_ex_med: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ chargeoff_within_12_mths  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ tax_liens                 : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ delinq_2yrs               : int  0 1 0 0 0 1 0 0 0 0 ...
##  $ inq_last_6mths            : int  2 3 1 2 2 1 0 0 1 0 ...
##  $ open_acc                  : int  14 12 30 19 10 10 12 16 12 20 ...
##  $ pub_rec                   : int  0 0 0 1 0 0 0 0 0 0 ...
##  $ total_acc                 : int  31 35 47 36 16 34 27 24 17 42 ...
##  $ acc_now_delinq            : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ delinq_amnt               : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ id_2                      : int  777778 876330 188890 288890 388890 788890 877442 1114 10002 11114 ...
##  $ loan_amnt                 : int  33600 10400 15000 10000 4200 7000 20000 28000 20000 17600 ...
##  $ funded_amnt               : int  33600 10400 15000 10000 4200 7000 20000 28000 20000 17600 ...
##  $ term                      : num  1 2 1 2 1 1 1 1 1 1 ...
##  $ int_rate                  : num  10.78 22.45 5.32 16.59 13.99 ...
##  $ installment               : num  1097 290 452 246 144 ...
##  $ grade                     : num  2 6 1 4 3 1 2 1 1 1 ...
##  $ home_ownership            : num  4 1 4 4 4 1 1 1 3 1 ...
##  $ verification_status       : num  2 2 1 2 2 1 1 1 1 2 ...
##  $ loan_status               : Factor w/ 2 levels "Defaulted","Fully Paid": 2 2 2 2 2 2 2 2 2 2 ...
##  $ pymnt_plan                : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ dti                       : num  14 25.4 20.2 30.8 31 ...
##  $ revol_bal                 : int  28475 21929 41593 6118 1294 10467 87329 51507 2269 12203 ...
##  $ initial_list_status       : num  2 2 2 2 2 2 1 2 2 2 ...
##  $ out_prncp                 : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ out_prncp_inv             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ total_pymnt               : num  34309 11740 16241 10758 5036 ...
##  $ total_pymnt_inv           : num  34309 11740 16241 10758 5036 ...
##  $ total_rec_prncp           : num  33600 10400 15000 10000 4200 7000 20000 28000 20000 17600 ...
##  $ total_rec_int             : num  709 1340 1241 758 836 ...
##  $ total_rec_late_fee        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ recoveries                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ collection_recovery_fee   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ last_pymnt_amnt           : num  33263 10129 2244 9795 1743 ...
##  $ policy_code               : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ application_type          : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ hardship_flag             : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ disbursement_method       : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ debt_settlement_flag      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ year_temp                 : int  2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
# check dimnames, mean and variance
dimnames(df.pca)[[2]]
##  [1] "funded_amnt_inv"            "annual_inc"                
##  [3] "pub_rec_bankruptcies"       "revol_util"                
##  [5] "collections_12_mths_ex_med" "chargeoff_within_12_mths"  
##  [7] "tax_liens"                  "delinq_2yrs"               
##  [9] "inq_last_6mths"             "open_acc"                  
## [11] "pub_rec"                    "total_acc"                 
## [13] "acc_now_delinq"             "delinq_amnt"               
## [15] "id_2"                       "loan_amnt"                 
## [17] "funded_amnt"                "term"                      
## [19] "int_rate"                   "installment"               
## [21] "grade"                      "home_ownership"            
## [23] "verification_status"        "loan_status"               
## [25] "pymnt_plan"                 "dti"                       
## [27] "revol_bal"                  "initial_list_status"       
## [29] "out_prncp"                  "out_prncp_inv"             
## [31] "total_pymnt"                "total_pymnt_inv"           
## [33] "total_rec_prncp"            "total_rec_int"             
## [35] "total_rec_late_fee"         "recoveries"                
## [37] "collection_recovery_fee"    "last_pymnt_amnt"           
## [39] "policy_code"                "application_type"          
## [41] "hardship_flag"              "disbursement_method"       
## [43] "debt_settlement_flag"       "year_temp"
apply(df.pca,2,var)
## Warning in FUN(newX[, i], ...): NAs introduced by coercion
##            funded_amnt_inv                 annual_inc 
##          70379710.85512327        5044374313.91709137 
##       pub_rec_bankruptcies                 revol_util 
##                 0.13152107               564.14696066 
## collections_12_mths_ex_med   chargeoff_within_12_mths 
##                 0.01703437                 0.01214732 
##                  tax_liens                delinq_2yrs 
##                 0.15101847                 0.71781136 
##             inq_last_6mths                   open_acc 
##                 0.97899224                28.14901467 
##                    pub_rec                  total_acc 
##                 0.34206417               141.58278152 
##             acc_now_delinq                delinq_amnt 
##                 0.00656248            624566.97294738 
##                       id_2                  loan_amnt 
##       63602829078.21736145          70458348.47642767 
##                funded_amnt                       term 
##          70361904.80439132                 0.18791493 
##                   int_rate                installment 
##                19.34605973             61766.92356414 
##                      grade             home_ownership 
##                 1.70985230                 2.04605086 
##        verification_status                loan_status 
##                 0.62179392                         NA 
##                 pymnt_plan                        dti 
##                 0.00008866                68.00934810 
##                  revol_bal        initial_list_status 
##         489147939.65753955                 0.24912725 
##                  out_prncp              out_prncp_inv 
##            273519.32762228            273208.22961914 
##                total_pymnt            total_pymnt_inv 
##         106393361.73141186         106374750.73708099 
##            total_rec_prncp              total_rec_int 
##          72449474.50050333           8253374.70288832 
##         total_rec_late_fee                 recoveries 
##               129.69762222            893639.07984153 
##    collection_recovery_fee            last_pymnt_amnt 
##             27064.96883563          39291930.08692809 
##                policy_code           application_type 
##                 0.00000000                 0.00055138 
##              hardship_flag        disbursement_method 
##                 0.00008866                 0.00000000 
##       debt_settlement_flag                  year_temp 
##                 0.02067557                 1.40756346
summary(df.pca)
##  funded_amnt_inv   annual_inc      pub_rec_bankruptcies   revol_util   
##  Min.   :    0   Min.   :   5000   Min.   :0.000        Min.   :  0.0  
##  1st Qu.: 8000   1st Qu.:  45000   1st Qu.:0.000        1st Qu.: 37.3  
##  Median :12000   Median :  64000   Median :0.000        Median : 55.6  
##  Mean   :14379   Mean   :  74643   Mean   :0.122        Mean   : 54.8  
##  3rd Qu.:20000   3rd Qu.:  90000   3rd Qu.:0.000        3rd Qu.: 73.2  
##  Max.   :35000   Max.   :9500000   Max.   :9.000        Max.   :155.3  
##  collections_12_mths_ex_med chargeoff_within_12_mths   tax_liens    
##  Min.   :0.000              Min.   :0.000            Min.   : 0.00  
##  1st Qu.:0.000              1st Qu.:0.000            1st Qu.: 0.00  
##  Median :0.000              Median :0.000            Median : 0.00  
##  Mean   :0.014              Mean   :0.009            Mean   : 0.05  
##  3rd Qu.:0.000              3rd Qu.:0.000            3rd Qu.: 0.00  
##  Max.   :6.000              Max.   :9.000            Max.   :48.00  
##   delinq_2yrs    inq_last_6mths      open_acc       pub_rec    
##  Min.   : 0.00   Min.   : 0.000   Min.   : 1.0   Min.   : 0.0  
##  1st Qu.: 0.00   1st Qu.: 0.000   1st Qu.: 8.0   1st Qu.: 0.0  
##  Median : 0.00   Median : 0.000   Median :11.0   Median : 0.0  
##  Mean   : 0.31   Mean   : 0.703   Mean   :11.5   Mean   : 0.2  
##  3rd Qu.: 0.00   3rd Qu.: 1.000   3rd Qu.:14.0   3rd Qu.: 0.0  
##  Max.   :29.00   Max.   :12.000   Max.   :90.0   Max.   :49.0  
##    total_acc     acc_now_delinq   delinq_amnt         id_2       
##  Min.   :  2.0   Min.   :0.000   Min.   :    0   Min.   :    10  
##  1st Qu.: 17.0   1st Qu.:0.000   1st Qu.:    0   1st Qu.:229392  
##  Median : 24.0   Median :0.000   Median :    0   Median :447734  
##  Mean   : 25.3   Mean   :0.005   Mean   :   15   Mean   :445468  
##  3rd Qu.: 32.0   3rd Qu.:0.000   3rd Qu.:    0   3rd Qu.:662596  
##  Max.   :169.0   Max.   :5.000   Max.   :65000   Max.   :887434  
##    loan_amnt      funded_amnt         term         int_rate    
##  Min.   :  725   Min.   :  725   Min.   :1.00   Min.   : 5.32  
##  1st Qu.: 8000   1st Qu.: 8000   1st Qu.:1.00   1st Qu.: 9.80  
##  Median :12000   Median :12000   Median :1.00   Median :12.99  
##  Mean   :14404   Mean   :14395   Mean   :1.25   Mean   :13.15  
##  3rd Qu.:20000   3rd Qu.:20000   3rd Qu.:2.00   3rd Qu.:15.96  
##  Max.   :35000   Max.   :35000   Max.   :2.00   Max.   :28.99  
##   installment         grade      home_ownership verification_status
##  Min.   :  22.5   Min.   :1.00   Min.   :1.00   Min.   :1.00       
##  1st Qu.: 255.0   1st Qu.:2.00   1st Qu.:1.00   1st Qu.:1.00       
##  Median : 377.6   Median :3.00   Median :3.00   Median :2.00       
##  Mean   : 435.0   Mean   :2.76   Mean   :2.41   Mean   :2.04       
##  3rd Qu.: 571.6   3rd Qu.:4.00   3rd Qu.:4.00   3rd Qu.:3.00       
##  Max.   :1408.1   Max.   :7.00   Max.   :4.00   Max.   :3.00       
##      loan_status      pymnt_plan      dti          revol_bal      
##  Defaulted :18862   Min.   :1    Min.   :  0.0   Min.   :      0  
##  Fully Paid:82646   1st Qu.:1    1st Qu.: 11.8   1st Qu.:   6319  
##                     Median :1    Median : 17.5   Median :  11557  
##                     Mean   :1    Mean   : 18.0   Mean   :  16558  
##                     3rd Qu.:1    3rd Qu.: 23.8   3rd Qu.:  20277  
##                     Max.   :2    Max.   :100.1   Max.   :1298783  
##  initial_list_status   out_prncp     out_prncp_inv    total_pymnt   
##  Min.   :1.00        Min.   :    0   Min.   :    0   Min.   :   36  
##  1st Qu.:1.00        1st Qu.:    0   1st Qu.:    0   1st Qu.: 7577  
##  Median :1.00        Median :    0   Median :    0   Median :12956  
##  Mean   :1.47        Mean   :   28   Mean   :   28   Mean   :15485  
##  3rd Qu.:2.00        3rd Qu.:    0   3rd Qu.:    0   3rd Qu.:21307  
##  Max.   :2.00        Max.   :22039   Max.   :22007   Max.   :62708  
##  total_pymnt_inv total_rec_prncp total_rec_int   total_rec_late_fee
##  Min.   :    0   Min.   :    0   Min.   :    0   Min.   :  0.0     
##  1st Qu.: 7559   1st Qu.: 6000   1st Qu.:  978   1st Qu.:  0.0     
##  Median :12942   Median :10000   Median : 1860   Median :  0.0     
##  Mean   :15468   Mean   :12464   Mean   : 2786   Mean   :  1.5     
##  3rd Qu.:21290   3rd Qu.:17925   3rd Qu.: 3487   3rd Qu.:  0.0     
##  Max.   :62687   Max.   :35000   Max.   :27708   Max.   :864.0     
##    recoveries    collection_recovery_fee last_pymnt_amnt  policy_code
##  Min.   :    0   Min.   :   0            Min.   :    0   Min.   :1   
##  1st Qu.:    0   1st Qu.:   0            1st Qu.:  354   1st Qu.:1   
##  Median :    0   Median :   0            Median : 1103   Median :1   
##  Mean   :  234   Mean   :  38            Mean   : 4498   Mean   :1   
##  3rd Qu.:    0   3rd Qu.:   0            3rd Qu.: 6743   3rd Qu.:1   
##  Max.   :32321   Max.   :5432            Max.   :36170   Max.   :1   
##  application_type hardship_flag disbursement_method debt_settlement_flag
##  Min.   :1        Min.   :1     Min.   :1           Min.   :1.00        
##  1st Qu.:1        1st Qu.:1     1st Qu.:1           1st Qu.:1.00        
##  Median :1        Median :1     Median :1           Median :1.00        
##  Mean   :1        Mean   :1     Mean   :1           Mean   :1.02        
##  3rd Qu.:1        3rd Qu.:1     3rd Qu.:1           3rd Qu.:1.00        
##  Max.   :2        Max.   :2     Max.   :1           Max.   :2.00        
##    year_temp   
##  Min.   :2007  
##  1st Qu.:2013  
##  Median :2014  
##  Mean   :2014  
##  3rd Qu.:2015  
##  Max.   :2015
# looks like policy_code and disbursement_method have variance 0 and mean 1, we can remove these two variables
df.pca <- subset(df.pca, select = -c(policy_code,disbursement_method))
# str(df.pca)                                             # check

# devide and conquer 80% train, 20% test data
p80 <- nrow(df.pca)*0.8
set.seed(13)
indices <- sort(sample(1:nrow(df.pca), p80)) # select 100 random samples
test.pca <- df.pca[-indices,]
train.pca <- df.pca[indices,]
train.log.pca <- train.pca
test.log.pca <- test.pca

# remove loan_status from test and train set
y.train.pca <- train.pca$loan_status
y.test.pca <- test.pca$loan_status
train.pca <- subset(train.pca, select = -c(loan_status))
test.pca <- subset(test.pca, select = -c(loan_status))
# str(df.pca)                                             # check

# conduct PCA
pca.out = prcomp(train.pca,  scale=T)
pca.out$rotation
##                                   PC1       PC2         PC3       PC4
## funded_amnt_inv             0.3407943  0.012278 -0.01526673  0.052398
## annual_inc                  0.1136536  0.093545 -0.00467066  0.114199
## pub_rec_bankruptcies       -0.0464533 -0.077944  0.01267195  0.023158
## revol_util                  0.0568166 -0.121046 -0.02000963 -0.162366
## collections_12_mths_ex_med -0.0061810 -0.014579 -0.00003231  0.033622
## chargeoff_within_12_mths    0.0016584 -0.005383 -0.00109124  0.020972
## tax_liens                  -0.0016403 -0.026564  0.01669991  0.038721
## delinq_2yrs                 0.0074521 -0.026839 -0.00011600  0.060745
## inq_last_6mths              0.0019968 -0.126173 -0.02625197 -0.067290
## open_acc                    0.0886699  0.010373 -0.00176069  0.303670
## pub_rec                    -0.0358108 -0.075615  0.02010241  0.044061
## total_acc                   0.0987534  0.023219 -0.00039717  0.291341
## acc_now_delinq              0.0042213 -0.016788 -0.00191433  0.020997
## delinq_amnt                 0.0003938 -0.006680 -0.00188756  0.012416
## id_2                       -0.0112269 -0.016593 -0.04184998 -0.245852
## loan_amnt                   0.3407600  0.012114 -0.01559505  0.050310
## funded_amnt                 0.3408557  0.012304 -0.01545427  0.051249
## term                        0.1539785 -0.252114  0.00853078 -0.096273
## int_rate                    0.0791947 -0.426646 -0.03879709 -0.296917
## installment                 0.3217352  0.043233 -0.02681402  0.049944
## grade                       0.0788170 -0.426602 -0.03131353 -0.258908
## home_ownership             -0.0814892 -0.067576 -0.00007015 -0.106877
## verification_status         0.1182896 -0.130401 -0.02250034 -0.074315
## pymnt_plan                  0.0061424 -0.040921  0.47740147  0.030438
## dti                         0.0268096 -0.129824 -0.00476301  0.116630
## revol_bal                   0.1362472  0.067889 -0.01252753  0.119420
## initial_list_status         0.0215974  0.052381  0.04650319  0.215415
## out_prncp                   0.0272044 -0.088659  0.49412730  0.009560
## out_prncp_inv               0.0272075 -0.088661  0.49417126  0.009569
## total_pymnt                 0.3338835  0.099368  0.00329541 -0.074675
## total_pymnt_inv             0.3338595  0.099295  0.00344712 -0.073616
## total_rec_prncp             0.3093545  0.218345  0.00852315 -0.075873
## total_rec_int               0.2689073 -0.166832  0.02683691 -0.153373
## total_rec_late_fee          0.0335687 -0.096442  0.09072128  0.046985
## recoveries                  0.0412889 -0.377419 -0.12300545  0.331975
## collection_recovery_fee     0.0417152 -0.370781 -0.12123666  0.341533
## last_pymnt_amnt             0.1900513  0.118362 -0.01883821 -0.108174
## application_type            0.0060405 -0.004824  0.00489678  0.005075
## hardship_flag               0.0061424 -0.040921  0.47740147  0.030438
## debt_settlement_flag        0.0106025 -0.229287 -0.05841299  0.241378
## year_temp                   0.0259978  0.012367  0.05083276  0.318218
##                                  PC5       PC6        PC7       PC8
## funded_amnt_inv            -0.017297  0.089203 -0.0180491  0.008946
## annual_inc                  0.008011  0.013898 -0.1062135  0.133794
## pub_rec_bankruptcies        0.456425  0.087693 -0.1034899  0.056843
## revol_util                 -0.125350  0.034861  0.1235473 -0.144925
## collections_12_mths_ex_med  0.042988 -0.034280  0.0269903 -0.035104
## chargeoff_within_12_mths    0.017255 -0.062798  0.0116576 -0.009641
## tax_liens                   0.426521  0.182158 -0.1101874  0.082456
## delinq_2yrs                 0.035671 -0.141093  0.0095877  0.017114
## inq_last_6mths              0.102780 -0.246345 -0.1841422  0.159520
## open_acc                    0.071830 -0.487799 -0.0620000  0.096417
## pub_rec                     0.613098  0.184108 -0.1440733  0.091651
## total_acc                   0.101429 -0.474292 -0.1202433  0.149519
## acc_now_delinq              0.021787 -0.071341  0.0132533 -0.009723
## delinq_amnt                 0.008099 -0.023356  0.0030506 -0.002916
## id_2                       -0.124541 -0.047411 -0.3240296  0.331009
## loan_amnt                  -0.018519  0.088677 -0.0206835  0.011797
## funded_amnt                -0.018012  0.088948 -0.0196445  0.010666
## term                        0.034208 -0.083803  0.0859856 -0.072691
## int_rate                    0.083001 -0.162254  0.0462808 -0.102524
## installment                -0.019863  0.101305 -0.0478646  0.024157
## grade                       0.104336 -0.157846  0.0853233 -0.139455
## home_ownership             -0.025141  0.127318  0.0870005 -0.116249
## verification_status         0.076136 -0.040901 -0.0143728 -0.029156
## pymnt_plan                 -0.068016 -0.002952 -0.3975665 -0.325553
## dti                        -0.003603 -0.331337  0.1576499 -0.169634
## revol_bal                  -0.092387 -0.117515  0.0004494  0.014664
## initial_list_status         0.093662  0.048684  0.2366678 -0.239858
## out_prncp                  -0.018259  0.027454  0.3053913  0.356268
## out_prncp_inv              -0.018265  0.027453  0.3053421  0.356226
## total_pymnt                 0.020485  0.048167 -0.0063352 -0.002237
## total_pymnt_inv             0.021090  0.048424 -0.0049674 -0.003710
## total_rec_prncp             0.039844  0.037318 -0.0073048 -0.018075
## total_rec_int               0.011127 -0.006520  0.0405860  0.021043
## total_rec_late_fee         -0.020605  0.061190  0.0750292  0.117555
## recoveries                 -0.168793  0.208003 -0.1268302  0.073338
## collection_recovery_fee    -0.165390  0.212863 -0.1209136  0.068295
## last_pymnt_amnt             0.102709 -0.070254  0.0138976 -0.115969
## application_type            0.009669  0.003359  0.0262662 -0.019715
## hardship_flag              -0.068016 -0.002952 -0.3975665 -0.325553
## debt_settlement_flag       -0.109664  0.156859 -0.0492498  0.059989
## year_temp                   0.192668  0.041539  0.3443643 -0.359657
##                                  PC9        PC10      PC11      PC12
## funded_amnt_inv            -0.008053  0.02641072 -0.029061 -0.045177
## annual_inc                 -0.111425 -0.20740525  0.315011  0.368963
## pub_rec_bankruptcies        0.059931  0.13784234 -0.076196  0.042912
## revol_util                  0.260096 -0.51808011  0.127537  0.110010
## collections_12_mths_ex_med -0.217743 -0.01356013  0.175679 -0.321799
## chargeoff_within_12_mths   -0.329826 -0.11437861  0.319184 -0.326300
## tax_liens                   0.098964 -0.30713214  0.024342 -0.060901
## delinq_2yrs                -0.490416 -0.23603147  0.223489 -0.127567
## inq_last_6mths             -0.091416  0.15624991  0.128293  0.073729
## open_acc                    0.124245  0.05110830 -0.086281 -0.092016
## pub_rec                     0.101042 -0.11466744 -0.028563 -0.013883
## total_acc                   0.040665  0.04919569 -0.020595 -0.032503
## acc_now_delinq             -0.429680 -0.25436691 -0.447748  0.123925
## delinq_amnt                -0.311247 -0.21531612 -0.552508  0.220290
## id_2                        0.034050 -0.00009531 -0.063408 -0.001975
## loan_amnt                  -0.007653  0.02651344 -0.029480 -0.045155
## funded_amnt                -0.007805  0.02644688 -0.029342 -0.045314
## term                       -0.051508  0.23909035  0.033155  0.266016
## int_rate                   -0.038776 -0.00786048  0.060568  0.061136
## installment                 0.002893 -0.05218896 -0.036245 -0.132718
## grade                      -0.050284  0.00815234  0.070520  0.073235
## home_ownership              0.031574  0.05260739 -0.225847 -0.404481
## verification_status         0.024326 -0.06675839 -0.022898 -0.148805
## pymnt_plan                 -0.005291 -0.01393971  0.004192 -0.004205
## dti                         0.328088 -0.17583629 -0.218207 -0.241862
## revol_bal                   0.237416 -0.38364358  0.114505  0.232700
## initial_list_status        -0.073699  0.15575840  0.078527  0.225932
## out_prncp                   0.013724  0.02281877 -0.018197  0.017919
## out_prncp_inv               0.013720  0.02281824 -0.018199  0.017908
## total_pymnt                -0.006424  0.02449277 -0.040861 -0.066093
## total_pymnt_inv            -0.006655  0.02448290 -0.040601 -0.065959
## total_rec_prncp            -0.004247  0.02678977 -0.041728 -0.064628
## total_rec_int              -0.006647 -0.00462334 -0.020290 -0.052822
## total_rec_late_fee         -0.060729 -0.14468610  0.101408 -0.166651
## recoveries                 -0.010968  0.04151167 -0.009639  0.023058
## collection_recovery_fee    -0.011523  0.04418134 -0.011824  0.019165
## last_pymnt_amnt            -0.062897  0.22300878  0.022419  0.189721
## application_type            0.035182 -0.01146990 -0.063404 -0.062173
## hardship_flag              -0.005291 -0.01393971  0.004192 -0.004205
## debt_settlement_flag        0.007439  0.04350607 -0.049886 -0.034320
## year_temp                  -0.062051  0.03424975  0.068992  0.041827
##                                 PC13       PC14      PC15      PC16
## funded_amnt_inv             0.005690  0.0004935  0.004970  0.015379
## annual_inc                  0.217679 -0.1632597  0.090962  0.065295
## pub_rec_bankruptcies        0.016589 -0.0542558  0.029519  0.342323
## revol_util                 -0.072161  0.0319784  0.044861  0.105163
## collections_12_mths_ex_med  0.038654  0.0187004  0.808957 -0.319059
## chargeoff_within_12_mths   -0.287934  0.1655876 -0.130843  0.462681
## tax_liens                  -0.111187  0.1386998 -0.033842 -0.327927
## delinq_2yrs                -0.130761  0.0451098 -0.219508 -0.050387
## inq_last_6mths              0.344008 -0.3062071  0.025690  0.076764
## open_acc                    0.033614  0.0353415 -0.030914 -0.073642
## pub_rec                    -0.062419  0.0550936  0.000748  0.006953
## total_acc                  -0.038226  0.0428841 -0.047470 -0.011202
## acc_now_delinq             -0.010956  0.0134888 -0.031796 -0.035818
## delinq_amnt                 0.069652 -0.0275310  0.149940  0.076283
## id_2                       -0.265809  0.1503580  0.027574 -0.144313
## loan_amnt                   0.004953  0.0009471  0.005328  0.014990
## funded_amnt                 0.005459  0.0005764  0.005165  0.015296
## term                       -0.228929  0.1884706 -0.002476 -0.175663
## int_rate                    0.052190 -0.0292966 -0.004262 -0.010284
## installment                 0.089927 -0.0671848  0.004483  0.072554
## grade                       0.064212 -0.0376278 -0.004972 -0.016604
## home_ownership              0.262901 -0.0277765 -0.036282  0.073327
## verification_status         0.158892 -0.1550404  0.130551  0.259149
## pymnt_plan                  0.002631 -0.0035480 -0.010594 -0.011014
## dti                        -0.118839  0.0973486 -0.006108  0.003377
## revol_bal                   0.052389 -0.0166253  0.155475  0.090927
## initial_list_status        -0.130560  0.0680905 -0.020706 -0.191764
## out_prncp                  -0.030848  0.0210329  0.069423  0.085250
## out_prncp_inv              -0.030844  0.0210247  0.069409  0.085230
## total_pymnt                -0.011667  0.0148368 -0.027411 -0.014933
## total_pymnt_inv            -0.011425  0.0147311 -0.027570 -0.014841
## total_rec_prncp             0.001198 -0.0001014 -0.019055  0.011591
## total_rec_int              -0.028918  0.0451978 -0.057352 -0.103422
## total_rec_late_fee          0.316789 -0.2053185 -0.414554 -0.457968
## recoveries                 -0.054434  0.0285712  0.051106  0.050935
## collection_recovery_fee    -0.054167  0.0292801  0.046311  0.049598
## last_pymnt_amnt            -0.099397  0.0552343  0.028143  0.017469
## application_type           -0.553849 -0.8117762  0.022387 -0.066305
## hardship_flag               0.002631 -0.0035480 -0.010594 -0.011014
## debt_settlement_flag       -0.016198  0.0085391 -0.058878 -0.005923
## year_temp                   0.104635 -0.0779260 -0.017726  0.042044
##                                 PC17       PC18       PC19       PC20
## funded_amnt_inv             0.019224  0.0158430 -0.0031690 -0.0108530
## annual_inc                  0.120686 -0.0997654 -0.2346729 -0.1292822
## pub_rec_bankruptcies       -0.503907  0.1077371 -0.2396981 -0.1696128
## revol_util                 -0.141726  0.0292424 -0.1306092 -0.0481764
## collections_12_mths_ex_med -0.198311  0.0031451 -0.0688839  0.0101397
## chargeoff_within_12_mths   -0.045439 -0.4583949  0.0749343  0.0690759
## tax_liens                   0.448049 -0.1060648  0.1673885  0.1436238
## delinq_2yrs                 0.072582  0.3795890 -0.1590973 -0.0576897
## inq_last_6mths              0.201787 -0.1229851  0.1425420 -0.0282915
## open_acc                    0.086801 -0.0561477 -0.1171243  0.0012440
## pub_rec                    -0.032894  0.0014482 -0.0421234 -0.0142237
## total_acc                  -0.032223 -0.0057542 -0.1012056  0.0024636
## acc_now_delinq             -0.002951  0.3026133  0.1171567 -0.3456290
## delinq_amnt                -0.077309 -0.4554568 -0.0506766  0.3586387
## id_2                       -0.083829 -0.1798509 -0.0085159 -0.3848294
## loan_amnt                   0.018993  0.0152981 -0.0027429 -0.0112323
## funded_amnt                 0.019224  0.0156125 -0.0031663 -0.0109012
## term                       -0.038045 -0.1083308  0.0124631  0.0279626
## int_rate                    0.059186  0.0084273 -0.0883649  0.0271096
## installment                 0.040120  0.0534655 -0.0176638 -0.0181687
## grade                       0.070237  0.0092773 -0.0991535  0.0409776
## home_ownership              0.354434 -0.2320935 -0.4519466 -0.3726269
## verification_status        -0.001138  0.0280120  0.6844315 -0.2047341
## pymnt_plan                 -0.010828 -0.0002611 -0.0029882  0.0040888
## dti                        -0.074396  0.0207044  0.0769411 -0.0220991
## revol_bal                  -0.023617 -0.1204071 -0.0498194 -0.1519156
## initial_list_status        -0.010501 -0.3371165  0.1371582 -0.5446063
## out_prncp                   0.067807  0.0344166 -0.0009233 -0.0028172
## out_prncp_inv               0.067792  0.0343948 -0.0009213 -0.0028455
## total_pymnt                -0.027577  0.0066609 -0.0380455  0.0185584
## total_pymnt_inv            -0.027549  0.0068645 -0.0380073  0.0185882
## total_rec_prncp            -0.015381  0.0119776 -0.0260113  0.0213496
## total_rec_int              -0.070339 -0.0237713 -0.0645095  0.0075668
## total_rec_late_fee         -0.467632 -0.2288168  0.0828636  0.0201809
## recoveries                  0.056103  0.0393787  0.0133899 -0.0130336
## collection_recovery_fee     0.049811  0.0387510  0.0057738 -0.0038723
## last_pymnt_amnt             0.051749 -0.0072241 -0.0188131  0.0911322
## application_type            0.061855 -0.0400113 -0.0805371  0.0003988
## hardship_flag              -0.010828 -0.0002611 -0.0029882  0.0040888
## debt_settlement_flag       -0.086188 -0.0432097 -0.0657936  0.0892126
## year_temp                   0.042894  0.0404050 -0.0159810  0.0570448
##                                 PC21       PC22        PC23      PC24
## funded_amnt_inv            -0.049722 -0.0200163  0.09272632  0.033951
## annual_inc                  0.354707 -0.0949816 -0.02419697 -0.147383
## pub_rec_bankruptcies       -0.031966 -0.0486667  0.02006544  0.017384
## revol_util                 -0.209123  0.0216387 -0.09036820  0.086493
## collections_12_mths_ex_med  0.005765  0.0796263 -0.01520337  0.020965
## chargeoff_within_12_mths    0.084534  0.2705814  0.13402384 -0.020348
## tax_liens                   0.020164  0.0643688 -0.02051509  0.012197
## delinq_2yrs                -0.204352 -0.4164230 -0.24464415  0.101643
## inq_last_6mths             -0.517177  0.2889405 -0.05403707 -0.007394
## open_acc                    0.121101 -0.0391402  0.05641108 -0.059636
## pub_rec                    -0.010167  0.0119448  0.00219263  0.016386
## total_acc                   0.087642 -0.0869982 -0.01092848  0.015755
## acc_now_delinq              0.163568  0.4962175  0.13186705 -0.036095
## delinq_amnt                -0.145420 -0.3124764 -0.05748383  0.033603
## id_2                       -0.135561 -0.0935236 -0.09363407  0.110624
## loan_amnt                  -0.048222 -0.0201426  0.09234489  0.034797
## funded_amnt                -0.049401 -0.0196418  0.09238433  0.035000
## term                        0.306090 -0.0368475  0.04629556 -0.045296
## int_rate                    0.003152  0.0117982  0.02902556  0.004070
## installment                -0.156260 -0.0089729  0.09279242  0.046190
## grade                       0.007188  0.0217753  0.04069162  0.008628
## home_ownership              0.193617 -0.0684644 -0.08110608  0.131855
## verification_status         0.304889 -0.3527777 -0.21576567  0.015821
## pymnt_plan                  0.005343 -0.0004441 -0.00001031 -0.008235
## dti                        -0.123859  0.0498631 -0.05908760  0.063686
## revol_bal                   0.067442  0.1409719 -0.01341009  0.112752
## initial_list_status        -0.304046 -0.1148164 -0.05471659 -0.100142
## out_prncp                  -0.026999  0.0089816 -0.02315409  0.051339
## out_prncp_inv              -0.027003  0.0089776 -0.02316415  0.051337
## total_pymnt                -0.040782  0.0336088 -0.06034406 -0.098832
## total_pymnt_inv            -0.041118  0.0332801 -0.05995399 -0.099646
## total_rec_prncp            -0.048829  0.0613267 -0.11920598  0.009868
## total_rec_int               0.011289 -0.0482747  0.08286738 -0.446256
## total_rec_late_fee          0.128700  0.0550558  0.06209495  0.256687
## recoveries                 -0.040592 -0.0406028  0.16433277  0.179108
## collection_recovery_fee    -0.042435 -0.0251761  0.13322933  0.171387
## last_pymnt_amnt             0.126678  0.1699818 -0.31154928  0.673098
## application_type            0.072912  0.0036742  0.01746757 -0.004073
## hardship_flag               0.005343 -0.0004441 -0.00001031 -0.008235
## debt_settlement_flag        0.057323  0.2727025 -0.77616608 -0.288616
## year_temp                  -0.060454  0.0261255  0.01356063  0.001056
##                                 PC25       PC26       PC27        PC28
## funded_amnt_inv            -0.005237 -0.0157807 -0.0042771  0.01503415
## annual_inc                 -0.357813 -0.4155742 -0.0290351 -0.04586837
## pub_rec_bankruptcies        0.069029 -0.0452885  0.0325850 -0.04007643
## revol_util                 -0.331886  0.3014332  0.4396403  0.09437372
## collections_12_mths_ex_med -0.016441  0.0090788  0.0390581 -0.02000711
## chargeoff_within_12_mths   -0.029207 -0.0058625 -0.0167000 -0.01033258
## tax_liens                  -0.017838  0.0218106  0.0219061 -0.01072277
## delinq_2yrs                 0.242051 -0.0958819  0.0580850 -0.03087683
## inq_last_6mths              0.108071 -0.1376712  0.3187893  0.10189418
## open_acc                   -0.074272  0.2536187 -0.0591155  0.00354725
## pub_rec                     0.030930 -0.0116490  0.0176739 -0.02304342
## total_acc                  -0.161935  0.2872426  0.1546405  0.00713898
## acc_now_delinq             -0.071535  0.0406023  0.0488368  0.00630221
## delinq_amnt                 0.014678 -0.0097733 -0.0062278  0.00133074
## id_2                       -0.122487 -0.0783480 -0.2628345  0.50875945
## loan_amnt                  -0.004098 -0.0161914 -0.0027226  0.00862230
## funded_amnt                -0.004798 -0.0160432 -0.0035173  0.01067739
## term                        0.246312 -0.1523352  0.4517307  0.18054294
## int_rate                   -0.071150  0.1000939 -0.3054628 -0.11960153
## installment                -0.099002  0.0474721 -0.1924296 -0.06293209
## grade                      -0.064958  0.0923353 -0.3449975 -0.06537945
## home_ownership              0.108133  0.0199799  0.1974660  0.03247743
## verification_status        -0.011797  0.0903972  0.0589825  0.05477305
## pymnt_plan                  0.003546 -0.0006678 -0.0002166  0.00006532
## dti                        -0.087748 -0.6893603 -0.0223261 -0.11485632
## revol_bal                   0.706647  0.0773432 -0.2052722 -0.00020828
## initial_list_status        -0.082157  0.1018341 -0.0480284 -0.33172991
## out_prncp                  -0.019215  0.0041879 -0.0154207 -0.01315008
## out_prncp_inv              -0.019214  0.0041892 -0.0154197 -0.01317025
## total_pymnt                 0.012158 -0.0093772  0.0337338 -0.00541456
## total_pymnt_inv             0.011777 -0.0091557  0.0330804 -0.00159799
## total_rec_prncp            -0.021791 -0.0017860 -0.0006264 -0.01799829
## total_rec_int               0.113207 -0.0241519  0.1075436  0.04599481
## total_rec_late_fee          0.002415 -0.0119822  0.0147996 -0.00061268
## recoveries                 -0.013374 -0.0130119  0.0482189 -0.03594279
## collection_recovery_fee    -0.013383 -0.0165239  0.0493835 -0.01761759
## last_pymnt_amnt            -0.092320 -0.0168388  0.0164961 -0.08709568
## application_type            0.024237  0.0193659  0.0139839 -0.00965248
## hardship_flag               0.003546 -0.0006678 -0.0002166  0.00006532
## debt_settlement_flag       -0.004204  0.0617942 -0.0974026 -0.01075087
## year_temp                  -0.055055 -0.0051183 -0.1679771  0.72173234
##                                  PC29       PC30       PC31        PC32
## funded_amnt_inv             0.3046016 -0.0196764  0.0723893 -0.00410684
## annual_inc                 -0.0329418 -0.0149366 -0.0021221  0.00036507
## pub_rec_bankruptcies        0.0175684 -0.0685408 -0.0008813  0.48899326
## revol_util                  0.0313139 -0.1167147 -0.0019387 -0.00677899
## collections_12_mths_ex_med  0.0055310 -0.0023425 -0.0014324  0.00404701
## chargeoff_within_12_mths    0.0017365 -0.0267296  0.0025030  0.00280397
## tax_liens                   0.0058903  0.0371965  0.0029066  0.49609851
## delinq_2yrs                 0.0159067 -0.0770948 -0.0037421  0.00011155
## inq_last_6mths              0.0073431 -0.0359874  0.0020411  0.00059111
## open_acc                   -0.0640153 -0.6899376  0.0023432  0.00543742
## pub_rec                     0.0037944 -0.0289915 -0.0039237 -0.71678972
## total_acc                   0.0345311  0.6785744 -0.0145805 -0.00573002
## acc_now_delinq              0.0004375 -0.0042567  0.0021177  0.00001424
## delinq_amnt                -0.0026382 -0.0062735 -0.0004835 -0.00086343
## id_2                       -0.0097057 -0.0109970  0.0026691  0.00307219
## loan_amnt                   0.3074962 -0.0205079  0.0712560 -0.00516567
## funded_amnt                 0.3061500 -0.0203476  0.0720240 -0.00456752
## term                        0.2633226 -0.0759924 -0.2360419 -0.00530321
## int_rate                    0.0083550  0.0472248 -0.0675115  0.01957996
## installment                 0.2726635  0.0079155  0.0917273  0.00165354
## grade                       0.0244020  0.0590059 -0.1683883 -0.00587875
## home_ownership              0.0055990  0.0594914  0.0010325 -0.00614498
## verification_status        -0.0460194 -0.0272897  0.0044340  0.00295271
## pymnt_plan                  0.0003129 -0.0025483  0.0006757 -0.00003051
## dti                         0.0052395  0.0513468 -0.0027186 -0.00478984
## revol_bal                  -0.0361348  0.0819046  0.0098314 -0.00196776
## initial_list_status        -0.0201480  0.0053719  0.0033340  0.00069918
## out_prncp                  -0.0134946 -0.0020887 -0.0164149 -0.00007395
## out_prncp_inv              -0.0135008 -0.0021017 -0.0163978 -0.00008858
## total_pymnt                -0.3082077  0.0146565 -0.2144029  0.00202860
## total_pymnt_inv            -0.3090452  0.0151539 -0.2137604  0.00243473
## total_rec_prncp            -0.2332989  0.0006795 -0.4733264  0.00264423
## total_rec_int              -0.3530512  0.0512130  0.6445416 -0.00176336
## total_rec_late_fee         -0.0117868 -0.0089600 -0.0218023 -0.00043468
## recoveries                 -0.1921833 -0.0009897 -0.0227909  0.00363873
## collection_recovery_fee    -0.2898173  0.0037407 -0.0814040 -0.00097810
## last_pymnt_amnt            -0.1752250 -0.0476290  0.3834094 -0.00263027
## application_type            0.0018112 -0.0021745 -0.0000388  0.00002569
## hardship_flag               0.0003129 -0.0025483  0.0006757 -0.00003051
## debt_settlement_flag        0.1862239 -0.0125999  0.0134592 -0.00144844
## year_temp                  -0.0784211  0.0507234  0.0234355  0.01450990
##                                  PC33        PC34        PC35
## funded_amnt_inv            -0.0119544 -0.01817155  0.30342814
## annual_inc                  0.0038008 -0.00008514 -0.00092220
## pub_rec_bankruptcies       -0.0119238  0.00054737  0.00170547
## revol_util                 -0.0201763  0.00253533 -0.00007815
## collections_12_mths_ex_med  0.0021889  0.00003947  0.00021299
## chargeoff_within_12_mths   -0.0011337  0.00060606  0.00054240
## tax_liens                  -0.0083919 -0.00047264  0.00102878
## delinq_2yrs                -0.0029023  0.00016951 -0.00032259
## inq_last_6mths             -0.0057704  0.00227458  0.00179623
## open_acc                   -0.0026167  0.00132371  0.00116609
## pub_rec                     0.0138214  0.00061460 -0.00079142
## total_acc                   0.0007318 -0.00023781  0.00221052
## acc_now_delinq             -0.0013506 -0.00020789  0.00080139
## delinq_amnt                -0.0005810  0.00074543  0.00025116
## id_2                       -0.0091085  0.00279402  0.00062448
## loan_amnt                  -0.0386443 -0.01772896  0.30404125
## funded_amnt                -0.0270117 -0.01872758  0.27959479
## term                        0.0376632 -0.01827435 -0.25592691
## int_rate                    0.7002100 -0.13484447  0.10811561
## installment                 0.0496507 -0.04393584 -0.81071119
## grade                      -0.6771437  0.12815007  0.00560313
## home_ownership             -0.0009057  0.00084670 -0.00059272
## verification_status        -0.0141567 -0.00003973 -0.00149826
## pymnt_plan                 -0.0002551  0.00007906 -0.00003260
## dti                         0.0001803  0.00247332 -0.00067548
## revol_bal                   0.0061098 -0.00038205 -0.00519227
## initial_list_status         0.0022881 -0.00315426 -0.00093849
## out_prncp                   0.0028535  0.00095761 -0.00193787
## out_prncp_inv               0.0028822  0.00095914 -0.00174520
## total_pymnt                 0.0151616  0.05689234 -0.00539233
## total_pymnt_inv             0.0282099  0.05720743  0.01571945
## total_rec_prncp             0.0225086 -0.00691717  0.00771423
## total_rec_int              -0.0565466 -0.00636876 -0.03846962
## total_rec_late_fee          0.0042549  0.00237518  0.00098876
## recoveries                  0.1334166  0.70161473 -0.01200572
## collection_recovery_fee    -0.1264759 -0.68048475  0.00370655
## last_pymnt_amnt            -0.0236500  0.00461753 -0.02810334
## application_type            0.0010625 -0.00079392  0.00037878
## hardship_flag              -0.0002551  0.00007906 -0.00003260
## debt_settlement_flag        0.0019055  0.00975498  0.00215493
## year_temp                   0.0678134  0.00705397 -0.00471761
##                                     PC36          PC37          PC38
## funded_amnt_inv            -0.6332917811 -0.0305243076 -0.5111481640
## annual_inc                 -0.0001506085  0.0000533107  0.0000124126
## pub_rec_bankruptcies       -0.0000317023 -0.0000686171 -0.0000100089
## revol_util                  0.0000571656 -0.0000905920 -0.0000099214
## collections_12_mths_ex_med -0.0001017341 -0.0000246378  0.0000011004
## chargeoff_within_12_mths   -0.0000389266 -0.0000135233 -0.0000007689
## tax_liens                   0.0001006288 -0.0000538784 -0.0000004081
## delinq_2yrs                -0.0001110893 -0.0000215407 -0.0000040890
## inq_last_6mths             -0.0002326328 -0.0002855994 -0.0000527436
## open_acc                   -0.0002582351 -0.0000473278 -0.0000114645
## pub_rec                    -0.0003523910  0.0000586506  0.0000019442
## total_acc                   0.0001851207  0.0002086585  0.0000347687
## acc_now_delinq             -0.0000093105 -0.0000009277 -0.0000014517
## delinq_amnt                -0.0000065810  0.0000032650  0.0000013428
## id_2                        0.0005529001  0.0003249821  0.0000068043
## loan_amnt                   0.5650577698 -0.5839902969 -0.0043586242
## funded_amnt                 0.0746635792  0.6361611564  0.5152503363
## term                       -0.0028389175 -0.0054682730 -0.0000618493
## int_rate                    0.0161365592  0.0018724281  0.0001207049
## installment                -0.0078225420 -0.0199204108 -0.0002507136
## grade                      -0.0145497375  0.0007630749 -0.0000585802
## home_ownership             -0.0001462005 -0.0000579376  0.0000047273
## verification_status        -0.0004316720  0.0003912111  0.0000060489
## pymnt_plan                  0.0000044882  0.0000020053  0.0000007339
## dti                        -0.0002411812 -0.0001911755  0.0000085689
## revol_bal                  -0.0003210518 -0.0000758387  0.0000064807
## initial_list_status         0.0000006184 -0.0000607551  0.0000042124
## out_prncp                   0.0000907762 -0.0000089687  0.0018346895
## out_prncp_inv              -0.0002116945 -0.0002246866 -0.0018348842
## total_pymnt                 0.1904113924  0.1822639947 -0.2492707837
## total_pymnt_inv            -0.4358309452 -0.4199812370  0.5745627656
## total_rec_prncp             0.2047226223  0.1963820096 -0.2682587524
## total_rec_int               0.0684726042  0.0651415481 -0.0900811701
## total_rec_late_fee          0.0001094284  0.0002503714 -0.0003725457
## recoveries                  0.0254633820  0.0218110817 -0.0297154545
## collection_recovery_fee    -0.0023334576 -0.0004660227 -0.0000415097
## last_pymnt_amnt            -0.0012627805 -0.0012392498 -0.0000503792
## application_type           -0.0001008856 -0.0000090948  0.0000009640
## hardship_flag               0.0000044882  0.0000020053  0.0000007339
## debt_settlement_flag       -0.0003087381  0.0000275720 -0.0000142676
## year_temp                   0.0085837976  0.0000510832  0.0000572327
##                                     PC39              PC40
## funded_amnt_inv            -0.0015074221 -0.00000008655653
## annual_inc                  0.0000002199  0.00000000012468
## pub_rec_bankruptcies        0.0000017559 -0.00000000007575
## revol_util                  0.0000023115 -0.00000000047120
## collections_12_mths_ex_med -0.0000011378 -0.00000000035052
## chargeoff_within_12_mths   -0.0000006881  0.00000000031061
## tax_liens                   0.0000062198  0.00000000020097
## delinq_2yrs                -0.0000028870  0.00000000082056
## inq_last_6mths             -0.0000023596  0.00000000037435
## open_acc                   -0.0000090502 -0.00000000028306
## pub_rec                    -0.0000089896  0.00000000018528
## total_acc                   0.0000038141  0.00000000005778
## acc_now_delinq             -0.0000010345 -0.00000000034105
## delinq_amnt                 0.0000001979  0.00000000069280
## id_2                       -0.0000040032 -0.00000000065828
## loan_amnt                  -0.0000201590 -0.00000001242206
## funded_amnt                 0.0014124714  0.00000009670161
## term                        0.0000393725  0.00000000032607
## int_rate                   -0.0000219193  0.00000000271000
## installment                 0.0001029559  0.00000000198891
## grade                       0.0000147624 -0.00000000177291
## home_ownership             -0.0000088164 -0.00000000003973
## verification_status        -0.0000031963  0.00000000018589
## pymnt_plan                 -0.0000391314  0.00000000002970
## dti                        -0.0000027212  0.00000000023808
## revol_bal                   0.0000007007  0.00000000060690
## initial_list_status        -0.0000249066  0.00000000029615
## out_prncp                  -0.7070954570 -0.00000000050264
## out_prncp_inv               0.7071132788  0.00000000059748
## total_pymnt                -0.0005783416  0.75218099549416
## total_pymnt_inv             0.0013304368  0.00000007868942
## total_rec_prncp            -0.0006210324 -0.62127884198648
## total_rec_int              -0.0002129456 -0.20850323079127
## total_rec_late_fee         -0.0000193156 -0.00084014920469
## recoveries                 -0.0000690659 -0.06900758955536
## collection_recovery_fee     0.0000035427  0.00000000006858
## last_pymnt_amnt            -0.0000002625  0.00000000140187
## application_type           -0.0000039235 -0.00000000014170
## hardship_flag              -0.0000391314 -0.00000000003482
## debt_settlement_flag       -0.0000060360  0.00000000010788
## year_temp                   0.0000121877 -0.00000000036488
##                                                PC41
## funded_amnt_inv             0.000000000000000000000
## annual_inc                 -0.000000000000000049440
## pub_rec_bankruptcies        0.000000000000000935294
## revol_util                 -0.000000000000000061441
## collections_12_mths_ex_med -0.000000000000000101437
## chargeoff_within_12_mths    0.000000000000000284505
## tax_liens                   0.000000000000000734963
## delinq_2yrs                -0.000000000000000297496
## inq_last_6mths             -0.000000000000000141597
## open_acc                   -0.000000000000000364687
## pub_rec                    -0.000000000000001197649
## total_acc                   0.000000000000000721321
## acc_now_delinq             -0.000000000000000284198
## delinq_amnt                -0.000000000000000478910
## id_2                       -0.000000000000000228325
## loan_amnt                   0.000000000000002975558
## funded_amnt                -0.000000000000001801706
## term                       -0.000000000000000043415
## int_rate                    0.000000000000002002955
## installment                -0.000000000000001657352
## grade                      -0.000000000000002022867
## home_ownership              0.000000000000000384457
## verification_status         0.000000000000000005009
## pymnt_plan                  0.707106781186550237273
## dti                        -0.000000000000000383473
## revol_bal                   0.000000000000000497409
## initial_list_status        -0.000000000000000653937
## out_prncp                  -0.000000000000017347235
## out_prncp_inv               0.000000000000017554534
## total_pymnt                -0.000000000034318298637
## total_pymnt_inv            -0.000000000000003654195
## total_rec_prncp             0.000000000028349242658
## total_rec_int               0.000000000009513843706
## total_rec_late_fee          0.000000000000038216392
## recoveries                  0.000000000003151915794
## collection_recovery_fee    -0.000000000000002590376
## last_pymnt_amnt             0.000000000000000169136
## application_type            0.000000000000000028189
## hardship_flag              -0.707106781186544575135
## debt_settlement_flag       -0.000000000000000307913
## year_temp                   0.000000000000000498299
pca.out$sdev
##  [1] 2.830175242207880970 1.747375222580279219 1.546859648725075864
##  [4] 1.504282222452379525 1.440913490912746342 1.350547919992009627
##  [7] 1.301832520261502646 1.262336896904347805 1.147288436751347618
## [10] 1.106109123609336331 1.044481556691171376 1.033539758528099428
## [13] 1.005783212316659236 0.998232461153887196 0.984926986888527867
## [16] 0.975775140248013928 0.956927099753434374 0.941272539883864612
## [19] 0.916107387094604375 0.913423634545449614 0.894116307699023505
## [22] 0.888633672223303805 0.866868447384414376 0.833172771712352600
## [25] 0.788394997709253254 0.745894724384768892 0.715588409147793314
## [28] 0.634722759166007955 0.564806807240468389 0.525564163061816192
## [31] 0.338632953619651234 0.257734324165023232 0.197013896082651380
## [34] 0.180241759807571400 0.105411320861252961 0.036374359034614168
## [37] 0.022565222163519465 0.002737867910510237 0.001216232120532726
## [40] 0.000000154215179868 0.000000000000005506
# % of variance explained
pr.var <- pca.out$sdev ^2
pve=pr.var/sum(pr.var)
pve
##  [1] 0.1953632171123477478946739438470104
##  [2] 0.0744712236216441975100721606395382
##  [3] 0.0583603603135076018371130146533687
##  [4] 0.0551918293850297489444578502570948
##  [5] 0.0506397972754709865816913350045070
##  [6] 0.0444873093706025096727429968268552
##  [7] 0.0413358027026921384661228842105629
##  [8] 0.0388657180801478679965654805528175
##  [9] 0.0321041648073980834254825822426938
## [10] 0.0298409120324826040993038134274684
## [11] 0.0266083346894631297963051963506587
## [12] 0.0260537666453243450803700653750639
## [13] 0.0246731675653169592321578562632567
## [14] 0.0243040986951542602179099361592307
## [15] 0.0236605163292998195434524433267143
## [16] 0.0232228566908782921296428014557023
## [17] 0.0223343774205487545503334700924825
## [18] 0.0216096096180341881720643470998766
## [19] 0.0204695791387630374180695014274534
## [20] 0.0203498228328341593251771257655491
## [21] 0.0194986334559345622641224338167376
## [22] 0.0192602391075379630880348713617423
## [23] 0.0183283147578206934147360129827575
## [24] 0.0169311431103128899600118728585585
## [25] 0.0151601627417795020019708474023901
## [26] 0.0135697302406101935612348086124257
## [27] 0.0124894334465038593651176057619523
## [28] 0.0098261702683731990543014589434279
## [29] 0.0077806519391503533786269564131999
## [30] 0.0067370168169478324843302274871348
## [31] 0.0027968848116382023089032671236964
## [32] 0.0016201702890926779516977962458668
## [33] 0.0009466945182845094798260165980253
## [34] 0.0007923680970373057083980250503430
## [35] 0.0002710133308710677016786783077862
## [36] 0.0000322705852482679941221205277557
## [37] 0.0000124192500314385050833726475972
## [38] 0.0000001828273340341834642987228765
## [39] 0.0000000360785505125731575373423943
## [40] 0.0000000000000005800566268729881272
## [41] 0.0000000000000000000000000000007394
# Plot
plot( cumsum(pve), xlab = "Principal Component",
      ylab = "Cumulative Proportion of Variance Explained", type = "b" )
text(cumsum(pve), labels=round(cumsum(pve)*100, digits = 1), cex= 0.7, pos = 2)

# try PCA as prediction
# use 30 PCAs
pca.train.1_30 <- pca.out$x[,1:30]

# predict PCA on test data set
pred <- predict(pca.out, test.pca)
head(pred[,1:30])
##        PC1     PC2      PC3     PC4     PC5      PC6     PC7     PC8
## 1   5.7828  2.6681 -0.23686 0.06525  0.2372  0.35380 -0.5208 -0.1969
## 7  -3.2917 -0.4280  0.13261 0.17922  0.1920 -0.83022  0.8724 -0.9447
## 12  2.1772  2.2568 -0.15467 0.33938 -1.0290  0.04282 -0.6153  0.5584
## 14  4.0037  3.2149  0.11557 2.00725 -0.2400  0.58266  0.9263 -1.0651
## 19  0.7012  0.3014  0.01453 0.07402  0.2960 -0.29043  1.2685 -1.6427
## 22 -4.0796  0.2277  0.26020 0.40989 -0.6043  0.50158  1.8375 -1.7563
##         PC9    PC10     PC11     PC12     PC13    PC14     PC15     PC16
## 1  -0.27863  1.5954 -0.37404 -0.03361  0.48053 -0.2982  0.17298  0.34509
## 7  -0.04816  1.2542 -0.49635 -0.52856  0.75752 -0.4419 -0.15506 -0.21382
## 12  1.05401 -1.8134  0.80220  1.74448 -0.46758  0.2597  0.65498  0.26230
## 14  0.69012 -0.1750 -0.04525  0.55190 -0.18760  0.2017  0.06036  0.11262
## 19 -0.02052  0.4211  0.50052  1.06659  0.08716 -0.1655 -0.13471 -0.08576
## 22  0.68663 -0.8798  0.21020  0.43279  0.08706  0.1520 -0.12166 -0.07957
##       PC17    PC18    PC19      PC20    PC21   PC22     PC23     PC24
## 1   1.1842 -0.9112 -0.5639 -1.178202 -1.2311 1.0420 -1.35902  3.42971
## 7   0.9745 -0.7444  0.6911 -0.639940 -0.6715 0.3254 -0.12769 -0.11961
## 12 -0.3950 -0.1634 -1.2672 -0.003899  0.1019 1.0372 -0.33961  1.67011
## 14 -0.1623  0.1475 -0.4875  0.641470 -0.7976 1.0872  0.02990  1.05316
## 19  0.1830  0.2184 -0.3086  1.012318 -1.2412 1.1640 -0.06054  1.07812
## 22  0.1783 -0.3336 -0.8833 -0.141689 -0.2460 0.2972  0.07527 -0.09925
##       PC25     PC26     PC27     PC28     PC29     PC30
## 1  -0.8011  0.21168 -0.39021  0.58175  0.34240 -0.18603
## 7   0.3118 -1.57075 -0.62418 -0.06206 -0.41471  0.12804
## 12  0.5308  0.06308 -1.15173  1.56909 -0.10367  0.03154
## 14  0.2606  0.03499 -0.09298 -0.91620  0.16353 -0.68291
## 19 -1.0437  0.34436 -0.60843 -1.03677  0.06011 -0.04914
## 22 -0.4478  0.39142  0.94814 -0.36157 -0.30349 -0.19256
pca.test.1_30 <- pred[,1:30]

log.pca.fit <- glm(train.log.pca$loan_status ~ ., as.data.frame(pca.train.1_30),family="binomial")
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(log.pca.fit)
## 
## Call:
## glm(formula = train.log.pca$loan_status ~ ., family = "binomial", 
##     data = as.data.frame(pca.train.1_30))
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
##  -8.49    0.00    0.00    0.00    8.49  
## 
## Coefficients:
##                      Estimate        Std. Error     z value
## (Intercept)  1081162023588628            235497  4590975283
## PC1           340252070685539             83210  4089079955
## PC2           610392735941376            134773  4529049089
## PC3           -27177827633057            152243  -178516001
## PC4          -475026671580247            156552 -3034302089
## PC5           310839100875539            163437  1901888488
## PC6          -262930590871878            174373 -1507865532
## PC7            21117352560508            180898   116736343
## PC8          -301360309735786            186558 -1615373031
## PC9          -124998732210273            205266  -608961341
## PC10          419078409049468            212907  1968361034
## PC11           34520620835027            225470   153105528
## PC12          355026763743441            227856  1558115972
## PC13         -209481380086400            234145  -894666770
## PC14           98574889496720            235916   417839493
## PC15           38701078098963            239103   161859651
## PC16           20249111734962            241345    83901017
## PC17           75918220844796            246099   308486650
## PC18          -21446226725691            250192   -85719135
## PC19          -39691569638562            257065  -154403150
## PC20          194211956059419            257820   753285669
## PC21          232136930086545            263387   881352723
## PC22          337970227872327            265012  1275301023
## PC23         -614016865381567            271666 -2260190233
## PC24         1323957744940502            282653  4684041037
## PC25         -171315995378799            298707  -573526213
## PC26          -52891716403152            315727  -167523869
## PC27          109294971538026            329098   332104666
## PC28         -211350893131724            371026  -569638923
## PC29        -1147093243930877            416954 -2751123853
## PC30          -55880339547491            448087  -124708564
##                        Pr(>|z|)    
## (Intercept) <0.0000000000000002 ***
## PC1         <0.0000000000000002 ***
## PC2         <0.0000000000000002 ***
## PC3         <0.0000000000000002 ***
## PC4         <0.0000000000000002 ***
## PC5         <0.0000000000000002 ***
## PC6         <0.0000000000000002 ***
## PC7         <0.0000000000000002 ***
## PC8         <0.0000000000000002 ***
## PC9         <0.0000000000000002 ***
## PC10        <0.0000000000000002 ***
## PC11        <0.0000000000000002 ***
## PC12        <0.0000000000000002 ***
## PC13        <0.0000000000000002 ***
## PC14        <0.0000000000000002 ***
## PC15        <0.0000000000000002 ***
## PC16        <0.0000000000000002 ***
## PC17        <0.0000000000000002 ***
## PC18        <0.0000000000000002 ***
## PC19        <0.0000000000000002 ***
## PC20        <0.0000000000000002 ***
## PC21        <0.0000000000000002 ***
## PC22        <0.0000000000000002 ***
## PC23        <0.0000000000000002 ***
## PC24        <0.0000000000000002 ***
## PC25        <0.0000000000000002 ***
## PC26        <0.0000000000000002 ***
## PC27        <0.0000000000000002 ***
## PC28        <0.0000000000000002 ***
## PC29        <0.0000000000000002 ***
## PC30        <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 78035  on 81205  degrees of freedom
## Residual deviance: 44622  on 81175  degrees of freedom
## AIC: 44684
## 
## Number of Fisher Scoring iterations: 25
log.pca.prob <- predict(log.pca.fit, newdata = as.data.frame(pca.test.1_30), type = "response")
log.pca.pred <- ifelse(log.pca.prob>0.5,"Fully Paid","Defaulted")

# Test errors and confusion matrix
pca.matrix <- table(log.pca.pred,test.log.pca$loan_status)
pca.overall.test.err = mean(log.pca.pred != test.log.pca$loan_status)
pca.default.test.err= pca.matrix[1,2]/(pca.matrix[1,1]+pca.matrix[1,2]) # Class Test Error 
pca.paid.test.err= pca.matrix[2,1]/(pca.matrix[2,2]+pca.matrix[2,1]) # Class Test Error 
pca.matrix
##             
## log.pca.pred Defaulted Fully Paid
##   Defaulted       3690        108
##   Fully Paid        62      16442
pca.default.test.err
## [1] 0.02844
pca.paid.test.err
## [1] 0.003757
pca.overall.test.err
## [1] 0.008374

In the graph we can see the cumulative amount of variance explained rises with each additional Principal Component (PC). With 30 PCs we have reached over 99% of Variance Explained and the curve flattens off. We decided to take the smallest amount of PC that could nevertheless explain all most all variance in the data.

To validate our approach we used Logistic Regression. We used the first 30 PCs and then trained a logistical model on it. We validated the fitted model on the test set, where we applied the same PCA transformation. It resulted in terribly low test error rates for both the classes “Defaulted” (0.66%) and “Fully Paid” (0.77%), as well as the overall error (0.75%)



PLS

# load previous df
df.pls <- d2_sub
# str(df.pls)                                 #check

# remove factors larger than 10
plus10levels <- df.pls[, sapply(df.pls, function(col) class(col) == "factor")]
plus10levels <-  plus10levels[, sapply(plus10levels, function(col) length(unique(col))) > 10]
rmv.col <- colnames(plus10levels)
str(rmv.col) # columns to be removed
##  chr [1:10] "last_pymnt_d" "title" "earliest_cr_line" ...
df.pls <- df.pls[ , !(names(df.pls) %in% rmv.col)]
# str(df.pls)                                 #check

# transform all factors to numeric, also loan_status, otherwise pls function doesn't work
str(df.pls)
## 'data.frame':    101508 obs. of  44 variables:
##  $ funded_amnt_inv           : int  33600 10400 15000 10000 4200 7000 20000 28000 20000 17600 ...
##  $ annual_inc                : int  104000 104433 80000 30000 32000 59000 180000 92000 65000 88000 ...
##  $ pub_rec_bankruptcies      : int  0 0 0 1 0 0 0 0 0 0 ...
##  $ revol_util                : num  51.8 64.5 21.4 26.6 11.7 24.2 84.5 64.5 3 34.4 ...
##  $ collections_12_mths_ex_med: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ chargeoff_within_12_mths  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ tax_liens                 : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ delinq_2yrs               : int  0 1 0 0 0 1 0 0 0 0 ...
##  $ inq_last_6mths            : int  2 3 1 2 2 1 0 0 1 0 ...
##  $ open_acc                  : int  14 12 30 19 10 10 12 16 12 20 ...
##  $ pub_rec                   : int  0 0 0 1 0 0 0 0 0 0 ...
##  $ total_acc                 : int  31 35 47 36 16 34 27 24 17 42 ...
##  $ acc_now_delinq            : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ delinq_amnt               : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ id_2                      : int  777778 876330 188890 288890 388890 788890 877442 1114 10002 11114 ...
##  $ loan_amnt                 : int  33600 10400 15000 10000 4200 7000 20000 28000 20000 17600 ...
##  $ funded_amnt               : int  33600 10400 15000 10000 4200 7000 20000 28000 20000 17600 ...
##  $ term                      : Factor w/ 2 levels "36 months","60 months": 1 2 1 2 1 1 1 1 1 1 ...
##  $ int_rate                  : num  10.78 22.45 5.32 16.59 13.99 ...
##  $ installment               : num  1097 290 452 246 144 ...
##  $ grade                     : Factor w/ 7 levels "A","B","C","D",..: 2 6 1 4 3 1 2 1 1 1 ...
##  $ home_ownership            : Factor w/ 4 levels "MORTGAGE","OTHER",..: 4 1 4 4 4 1 1 1 3 1 ...
##  $ verification_status       : Factor w/ 3 levels "Not Verified",..: 2 2 1 2 2 1 1 1 1 2 ...
##  $ loan_status               : Factor w/ 2 levels "Defaulted","Fully Paid": 2 2 2 2 2 2 2 2 2 2 ...
##  $ pymnt_plan                : Factor w/ 2 levels "n","y": 1 1 1 1 1 1 1 1 1 1 ...
##  $ dti                       : num  14 25.4 20.2 30.8 31 ...
##  $ revol_bal                 : int  28475 21929 41593 6118 1294 10467 87329 51507 2269 12203 ...
##  $ initial_list_status       : Factor w/ 2 levels "f","w": 2 2 2 2 2 2 1 2 2 2 ...
##  $ out_prncp                 : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ out_prncp_inv             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ total_pymnt               : num  34309 11740 16241 10758 5036 ...
##  $ total_pymnt_inv           : num  34309 11740 16241 10758 5036 ...
##  $ total_rec_prncp           : num  33600 10400 15000 10000 4200 7000 20000 28000 20000 17600 ...
##  $ total_rec_int             : num  709 1340 1241 758 836 ...
##  $ total_rec_late_fee        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ recoveries                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ collection_recovery_fee   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ last_pymnt_amnt           : num  33263 10129 2244 9795 1743 ...
##  $ policy_code               : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ application_type          : Factor w/ 2 levels "Individual","Joint App": 1 1 1 1 1 1 1 1 1 1 ...
##  $ hardship_flag             : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
##  $ disbursement_method       : Factor w/ 1 level "Cash": 1 1 1 1 1 1 1 1 1 1 ...
##  $ debt_settlement_flag      : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
##  $ year_temp                 : int  2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
for(col in 1:ncol(df.pls)){
  if(class(df.pls[,col]) %in% c('factor')){
    df.pls[,col] <- as.numeric(df.pls[,col])
  }
}
# str(df.pls)                                 #check

# remove NAs
drop_na(df.pls)
# str(df.pls)                                 #check

# check dimnames, mean and variance
dimnames(df.pls)[[2]]
##  [1] "funded_amnt_inv"            "annual_inc"                
##  [3] "pub_rec_bankruptcies"       "revol_util"                
##  [5] "collections_12_mths_ex_med" "chargeoff_within_12_mths"  
##  [7] "tax_liens"                  "delinq_2yrs"               
##  [9] "inq_last_6mths"             "open_acc"                  
## [11] "pub_rec"                    "total_acc"                 
## [13] "acc_now_delinq"             "delinq_amnt"               
## [15] "id_2"                       "loan_amnt"                 
## [17] "funded_amnt"                "term"                      
## [19] "int_rate"                   "installment"               
## [21] "grade"                      "home_ownership"            
## [23] "verification_status"        "loan_status"               
## [25] "pymnt_plan"                 "dti"                       
## [27] "revol_bal"                  "initial_list_status"       
## [29] "out_prncp"                  "out_prncp_inv"             
## [31] "total_pymnt"                "total_pymnt_inv"           
## [33] "total_rec_prncp"            "total_rec_int"             
## [35] "total_rec_late_fee"         "recoveries"                
## [37] "collection_recovery_fee"    "last_pymnt_amnt"           
## [39] "policy_code"                "application_type"          
## [41] "hardship_flag"              "disbursement_method"       
## [43] "debt_settlement_flag"       "year_temp"
apply(df.pls,2,mean)
##            funded_amnt_inv                 annual_inc 
##               14379.166529               74643.055336 
##       pub_rec_bankruptcies                 revol_util 
##                   0.122276                  54.747690 
## collections_12_mths_ex_med   chargeoff_within_12_mths 
##                   0.013999                   0.008856 
##                  tax_liens                delinq_2yrs 
##                   0.049385                   0.309542 
##             inq_last_6mths                   open_acc 
##                   0.703373                  11.508285 
##                    pub_rec                  total_acc 
##                   0.199649                  25.264501 
##             acc_now_delinq                delinq_amnt 
##                   0.005310                  15.156086 
##                       id_2                  loan_amnt 
##              445468.355519               14404.126768 
##                funded_amnt                       term 
##               14395.091274                   1.250828 
##                   int_rate                installment 
##                  13.150106                 434.951229 
##                      grade             home_ownership 
##                   2.756857                   2.410312 
##        verification_status                loan_status 
##                   2.039022                   1.814182 
##                 pymnt_plan                        dti 
##                   1.000089                  17.984777 
##                  revol_bal        initial_list_status 
##               16558.096544                   1.470416 
##                  out_prncp              out_prncp_inv 
##                  28.002359                  27.986320 
##                total_pymnt            total_pymnt_inv 
##               15484.977656               15467.948131 
##            total_rec_prncp              total_rec_int 
##               12464.129846                2785.762868 
##         total_rec_late_fee                 recoveries 
##                   1.511362                 233.573568 
##    collection_recovery_fee            last_pymnt_amnt 
##                  37.995335                4497.802159 
##                policy_code           application_type 
##                   1.000000                   1.000552 
##              hardship_flag        disbursement_method 
##                   1.000089                   1.000000 
##       debt_settlement_flag                  year_temp 
##                   1.021121                2014.023791
apply(df.pls,2,var)
##            funded_amnt_inv                 annual_inc 
##          70379710.85512327        5044374313.91709137 
##       pub_rec_bankruptcies                 revol_util 
##                 0.13152107               564.14696066 
## collections_12_mths_ex_med   chargeoff_within_12_mths 
##                 0.01703437                 0.01214732 
##                  tax_liens                delinq_2yrs 
##                 0.15101847                 0.71781136 
##             inq_last_6mths                   open_acc 
##                 0.97899224                28.14901467 
##                    pub_rec                  total_acc 
##                 0.34206417               141.58278152 
##             acc_now_delinq                delinq_amnt 
##                 0.00656248            624566.97294738 
##                       id_2                  loan_amnt 
##       63602829078.21736145          70458348.47642767 
##                funded_amnt                       term 
##          70361904.80439132                 0.18791493 
##                   int_rate                installment 
##                19.34605973             61766.92356414 
##                      grade             home_ownership 
##                 1.70985230                 2.04605086 
##        verification_status                loan_status 
##                 0.62179392                 0.15129108 
##                 pymnt_plan                        dti 
##                 0.00008866                68.00934810 
##                  revol_bal        initial_list_status 
##         489147939.65753955                 0.24912725 
##                  out_prncp              out_prncp_inv 
##            273519.32762228            273208.22961914 
##                total_pymnt            total_pymnt_inv 
##         106393362.09431723         106374750.73708099 
##            total_rec_prncp              total_rec_int 
##          72449474.50050333           8253374.70288832 
##         total_rec_late_fee                 recoveries 
##               129.69762222            893639.07920437 
##    collection_recovery_fee            last_pymnt_amnt 
##             27064.96883561          39291930.08685572 
##                policy_code           application_type 
##                 0.00000000                 0.00055138 
##              hardship_flag        disbursement_method 
##                 0.00008866                 0.00000000 
##       debt_settlement_flag                  year_temp 
##                 0.02067557                 1.40756346
# looks like policy_code and disbursement_method have variance 0 and mean 1, we can remove these two variables
df.pls <- subset(df.pls, select = -c(policy_code,disbursement_method))
# str(df.pls)                                 #check

# devide and conquer 80% train, 20% test data
p80 <- nrow(d2_sub)*0.8
set.seed(13)
indices <- sort(sample(1:nrow(df.pls), p80)) # select 100 random samples
test.pls <- df.pls[-indices,]
train.pls <- df.pls[indices,]

# pls function
pls.fit <- plsr(loan_status ~ ., data=train.pls, scale=TRUE, validation ="CV")
summary(pls.fit)
## Data:    X dimension: 81206 41 
##  Y dimension: 81206 1
## Fit method: kernelpls
## Number of components considered: 41
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV          0.3892   0.2868    0.276   0.2492   0.2302   0.2247   0.2225
## adjCV       0.3892   0.2868    0.276   0.2491   0.2302   0.2247   0.2225
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV      0.2201    0.219   0.2189    0.2189    0.2189    0.2189    0.2189
## adjCV   0.2201    0.219   0.2189    0.2189    0.2189    0.2189    0.2189
##        14 comps  15 comps  16 comps  17 comps  18 comps  19 comps
## CV       0.2189    0.2189    0.2189    0.2189    0.2189    0.2189
## adjCV    0.2189    0.2188    0.2188    0.2188    0.2188    0.2188
##        20 comps  21 comps  22 comps  23 comps  24 comps  25 comps
## CV       0.2189    0.2189    0.2189    0.2188    0.2188    0.2188
## adjCV    0.2188    0.2188    0.2188    0.2188    0.2188    0.2188
##        26 comps  27 comps  28 comps  29 comps  30 comps  31 comps
## CV       0.2188    0.2188    0.2188    0.2188    0.2188    0.2188
## adjCV    0.2188    0.2188    0.2188    0.2188    0.2188    0.2188
##        32 comps  33 comps  34 comps  35 comps  36 comps  37 comps
## CV       0.2188    0.2188    0.2188    0.2188    0.2188    0.2188
## adjCV    0.2188    0.2188    0.2188    0.2188    0.2188    0.2188
##        38 comps  39 comps  40 comps  41 comps
## CV       0.2188    0.2188    0.2188    0.2188
## adjCV    0.2188    0.2188    0.2188    0.2188
## 
## TRAINING: % variance explained
##              1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
## X              7.479    24.06    29.77    33.26    35.99    39.25    41.44
## loan_status   45.744    49.75    59.12    65.07    66.70    67.36    68.06
##              8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## X              44.43    47.67     51.40     55.10     58.37     60.50
## loan_status    68.38    68.41     68.41     68.42     68.42     68.42
##              14 comps  15 comps  16 comps  17 comps  18 comps  19 comps
## X               62.01     64.09     66.81     68.50     70.32     72.15
## loan_status     68.42     68.42     68.42     68.42     68.42     68.42
##              20 comps  21 comps  22 comps  23 comps  24 comps  25 comps
## X               73.55     75.02     76.76     78.13     79.10     80.86
## loan_status     68.42     68.42     68.42     68.42     68.42     68.42
##              26 comps  27 comps  28 comps  29 comps  30 comps  31 comps
## X               82.63     83.20     84.79     86.42     87.13     88.94
## loan_status     68.43     68.43     68.43     68.43     68.43     68.43
##              32 comps  33 comps  34 comps  35 comps  36 comps  37 comps
## X               91.31     93.52     93.71     95.76     97.96    100.00
## loan_status     68.43     68.43     68.43     68.43     68.43     68.43
##              38 comps  39 comps  40 comps  41 comps
## X              100.00    100.00    100.00    100.00
## loan_status     68.44     68.44     68.44     68.44
pls.prob=predict(pls.fit, test.pls ,ncomp=30)
head(pls.prob)
## [1] 2.127 1.845 2.024 2.088 1.976 1.842
summary(pls.prob)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -0.279   1.864   1.910   1.813   1.963   2.241
pls.pred <- ifelse(pls.prob>0.5,"Fully Paid","Defaulted")

The partial least squares method is generally used for numeric continous variables. There is a method namly partial least squares discriminant analysis, that can be used for categorical variables. This however was not further investigated due to time constrains. We convert the loan status variable into a numeric variable. However, this gave us results that we cannot interperate in a useful way. In conclusion to the question asked in Part 2, we must give preference to PCA over PLS, since is not suited for this classification problem.

  1. Perform the classification using KNN, Logistic Regression, Decision tree and Random forest. Compare the respective train and test error performances to select one of these approaches. Perform the prediction on the validation set and compute the confusion matrix.



Logistic Regression

# devide and conquer 80% train.log, 20% test.data
set.seed(13)
indices <- sort(sample(1:nrow(d2_sub), p80)) # select 100 random samples
test.log <- d2_sub[-indices,]
train.log <- d2_sub[indices,]

#drop NAs & columns with more than 50 levels or less than 2
drop_na(train.log) 
less2levels <-  train.log[, sapply(train.log, function(col) length(unique(col))) < 2]
plus50levels <-  train.log[, sapply(train.log, function(col) length(unique(col))) > 50]
plus50levels <- plus50levels[, sapply(plus50levels, function(col) class(col) != "numeric")]
plus50levels <- plus50levels[, sapply(plus50levels, function(col) class(col) != "integer")]
rmv.col <- c(colnames(less2levels),colnames(plus50levels))
rmv.col
## [1] "policy_code"         "disbursement_method" "last_pymnt_d"       
## [4] "title"               "earliest_cr_line"    "last_credit_pull_d" 
## [7] "issue_d"             "zip_code"            "addr_state"
train.log <- train.log[ , !(names(train.log) %in% rmv.col)]
# str(train.log)                  # check

log.fit <- glm(loan_status ~ ., family="binomial", data=train.log)
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(log.fit)
## 
## Call:
## glm(formula = loan_status ~ ., family = "binomial", data = train.log)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -5.942   0.000   0.009   0.058   8.490  
## 
## Coefficients: (6 not defined because of singularities)
##                                                     Estimate
## (Intercept)                                  1596.1175906646
## funded_amnt_inv                                -0.0042239504
## annual_inc                                     -0.0000003691
## pub_rec_bankruptcies                            0.1074820796
## revol_util                                     -0.0049668737
## collections_12_mths_ex_med                     -0.6643459174
## chargeoff_within_12_mths                        1.0459590754
## tax_liens                                       0.2420755916
## delinq_2yrs                                     0.0265023207
## inq_last_6mths                                  0.3400357473
## open_acc                                       -0.0253205900
## pub_rec                                        -0.3047168281
## total_acc                                       0.0024553198
## acc_now_delinq                                 -0.1885208017
## delinq_amnt                                    -0.0000348501
## id_2                                            0.0000000716
## loan_amnt                                      -0.0002996871
## funded_amnt                                    -0.0015263423
## term60 months                                   0.4289031765
## int_rate                                       -0.6904382902
## installment                                     0.0015284132
## gradeB                                          3.7380610475
## gradeC                                          5.6530027018
## gradeD                                          8.2266771355
## gradeE                                          8.9640706447
## gradeF                                         14.2810638801
## gradeG                             -2863774854855.7319335937
## sub_gradeA2                                     0.3527068019
## sub_gradeA3                                     0.6494456692
## sub_gradeA4                                     0.9017209014
## sub_gradeA5                                     0.8219715707
## sub_gradeB1                                    -2.2258883645
## sub_gradeB2                                    -1.1620605272
## sub_gradeB3                                    -0.9769213198
## sub_gradeB4                                    -0.8042032834
## sub_gradeB5                                               NA
## sub_gradeC1                                    -1.6006359969
## sub_gradeC2                                    -0.8153221739
## sub_gradeC3                                    -0.8078126019
## sub_gradeC4                                    -0.2678132502
## sub_gradeC5                                               NA
## sub_gradeD1                                    -1.6332531416
## sub_gradeD2                                    -1.5914486352
## sub_gradeD3                                    -0.8900854357
## sub_gradeD4                                    -0.3756871407
## sub_gradeD5                                               NA
## sub_gradeE1                                    -0.7709186141
## sub_gradeE2                                     0.2480298091
## sub_gradeE3                                     2.0829021345
## sub_gradeE4                                     1.2527044116
## sub_gradeE5                                               NA
## sub_gradeF1                                    -3.6999911377
## sub_gradeF2                                    -3.4057448244
## sub_gradeF3                                    -2.8351958248
## sub_gradeF4                                    -0.8594286203
## sub_gradeF5                                               NA
## sub_gradeG1                         2863774854867.3969726562
## sub_gradeG2                         2863774854866.9052734375
## sub_gradeG3                         2863774854890.2983398437
## sub_gradeG4                         2863774854884.0180664062
## sub_gradeG5                         2863774854882.6865234375
## emp_length1 year                               -0.0782523741
## emp_length10+ years                            -0.3513320315
## emp_length2 years                              -0.1509637608
## emp_length3 years                              -0.5054641077
## emp_length4 years                              -0.1745352288
## emp_length5 years                               0.0186843203
## emp_length6 years                               0.0429527711
## emp_length7 years                              -0.3932234899
## emp_length8 years                              -0.5618624586
## emp_length9 years                               1.5856241105
## emp_lengthn/a                                  -0.7721557092
## home_ownershipOTHER                            36.4922215994
## home_ownershipOWN                               0.0696397792
## home_ownershipRENT                             -0.1062975061
## verification_statusSource Verified              0.1155058081
## verification_statusVerified                     0.1955563216
## pymnt_plany                                   237.1805559180
## purposecredit_card                             -0.1382151674
## purposedebt_consolidation                      -0.2699089168
## purposeeducational                             10.7206893887
## purposehome_improvement                        -0.1933713125
## purposehouse                                   -0.4066442714
## purposemajor_purchase                          -1.0884230523
## purposemedical                                 -0.1155378053
## purposemoving                                  -0.4802372028
## purposeother                                   -0.7806814423
## purposerenewable_energy                        -0.7040382302
## purposesmall_business                           0.4615832462
## purposevacation                                -0.1367284913
## purposewedding                                 -0.9761105537
## dti                                             0.0047562573
## revol_bal                                       0.0000133299
## initial_list_statusw                           -0.1292219908
## out_prncp                                       1.8740333621
## out_prncp_inv                                  -1.9973199826
## total_pymnt                                    -6.5260124007
## total_pymnt_inv                                 0.0037202401
## total_rec_prncp                                 6.5284597768
## total_rec_int                                   6.5220315626
## total_rec_late_fee                              6.5108133457
## recoveries                                      6.4832454925
## collection_recovery_fee                         0.1233288790
## last_pymnt_amnt                                 0.0008963066
## application_typeJoint App                      11.5416281238
## hardship_flagY                                            NA
## debt_settlement_flagY                         -19.5103850855
## year_temp                                      -0.7878995473
##                                                   Std. Error z value
## (Intercept)                                   200.1768720510    7.97
## funded_amnt_inv                                 0.0004097675  -10.31
## annual_inc                                      0.0000005697   -0.65
## pub_rec_bankruptcies                            0.2647733045    0.41
## revol_util                                      0.0030787880   -1.61
## collections_12_mths_ex_med                      0.2799704595   -2.37
## chargeoff_within_12_mths                        1.0781461234    0.97
## tax_liens                                       0.2520385718    0.96
## delinq_2yrs                                     0.0802100235    0.33
## inq_last_6mths                                  0.0767800802    4.43
## open_acc                                        0.0186402745   -1.36
## pub_rec                                         0.2213926463   -1.38
## total_acc                                       0.0084265243    0.29
## acc_now_delinq                                  0.7954176587   -0.24
## delinq_amnt                                     0.0000290784   -1.20
## id_2                                            0.0000002950    0.24
## loan_amnt                                       0.0001563432   -1.92
## funded_amnt                                     0.0004092574   -3.73
## term60 months                                   0.6270179517    0.68
## int_rate                                        0.1072614761   -6.44
## installment                                     0.0039223119    0.39
## gradeB                                          0.9709617220    3.85
## gradeC                                          1.2564334848    4.50
## gradeD                                          1.6143429016    5.10
## gradeE                                          1.9639001687    4.56
## gradeF                                          2.7307779117    5.23
## gradeG                             58271573085858.6562500000   -0.05
## sub_gradeA2                                     0.7924181985    0.45
## sub_gradeA3                                     0.7963899549    0.82
## sub_gradeA4                                     0.7538923593    1.20
## sub_gradeA5                                     0.7207798046    1.14
## sub_gradeB1                                     0.5228648144   -4.26
## sub_gradeB2                                     0.4785863644   -2.43
## sub_gradeB3                                     0.3819796432   -2.56
## sub_gradeB4                                     0.3237502345   -2.48
## sub_gradeB5                                               NA      NA
## sub_gradeC1                                     0.4483636838   -3.57
## sub_gradeC2                                     0.4431324428   -1.84
## sub_gradeC3                                     0.4013020329   -2.01
## sub_gradeC4                                     0.3869795852   -0.69
## sub_gradeC5                                               NA      NA
## sub_gradeD1                                     0.6213867035   -2.63
## sub_gradeD2                                     0.5741310440   -2.77
## sub_gradeD3                                     0.5887401083   -1.51
## sub_gradeD4                                     0.6379061130   -0.59
## sub_gradeD5                                               NA      NA
## sub_gradeE1                                     0.7358855420   -1.05
## sub_gradeE2                                     0.8277417025    0.30
## sub_gradeE3                                     0.9853244469    2.11
## sub_gradeE4                                     1.0043178822    1.25
## sub_gradeE5                                               NA      NA
## sub_gradeF1                                     1.6588682361   -2.23
## sub_gradeF2                                     1.7218719955   -1.98
## sub_gradeF3                                     1.7100012716   -1.66
## sub_gradeF4                                     2.0860686477   -0.41
## sub_gradeF5                                               NA      NA
## sub_gradeG1                        58271573085857.8125000000    0.05
## sub_gradeG2                        58271573085858.4765625000    0.05
## sub_gradeG3                        58271573085864.8593750000    0.05
## sub_gradeG4                        58271573085855.3125000000    0.05
## sub_gradeG5                        58271573085853.6875000000    0.05
## emp_length1 year                                0.3469431132   -0.23
## emp_length10+ years                             0.2703057471   -1.30
## emp_length2 years                               0.3248284750   -0.46
## emp_length3 years                               0.3101940738   -1.63
## emp_length4 years                               0.3584759464   -0.49
## emp_length5 years                               0.3758273682    0.05
## emp_length6 years                               0.4207274438    0.10
## emp_length7 years                               0.3683264855   -1.07
## emp_length8 years                               0.3516253685   -1.60
## emp_length9 years                               0.6318478250    2.51
## emp_lengthn/a                                   0.3177068206   -2.43
## home_ownershipOTHER                        335289.2193462860    0.00
## home_ownershipOWN                               0.2326148244    0.30
## home_ownershipRENT                              0.1480845279   -0.72
## verification_statusSource Verified              0.1586177976    0.73
## verification_statusVerified                     0.1756149123    1.11
## pymnt_plany                                 80956.5381584926    0.00
## purposecredit_card                              0.6172423191   -0.22
## purposedebt_consolidation                       0.6056153770   -0.45
## purposeeducational                              2.5673167940    4.18
## purposehome_improvement                         0.6597943666   -0.29
## purposehouse                                    1.0486504641   -0.39
## purposemajor_purchase                           0.6623449307   -1.64
## purposemedical                                  0.7387955506   -0.16
## purposemoving                                   0.7410327760   -0.65
## purposeother                                    0.6212401941   -1.26
## purposerenewable_energy                         1.8897410820   -0.37
## purposesmall_business                           0.9627450244    0.48
## purposevacation                                 0.8184546256   -0.17
## purposewedding                                  1.2153901772   -0.80
## dti                                             0.0084608777    0.56
## revol_bal                                       0.0000063358    2.10
## initial_list_statusw                            0.1373119196   -0.94
## out_prncp                                     539.2367633312    0.00
## out_prncp_inv                                 540.9404325318    0.00
## total_pymnt                                    26.7832151827   -0.24
## total_pymnt_inv                                 0.0005719240    6.50
## total_rec_prncp                                26.7832143605    0.24
## total_rec_int                                  26.7832142871    0.24
## total_rec_late_fee                             26.7831480941    0.24
## recoveries                                     26.7832377605    0.24
## collection_recovery_fee                         0.0125178208    9.85
## last_pymnt_amnt                                 0.0001122375    7.99
## application_typeJoint App                     719.0515385641    0.02
## hardship_flagY                                            NA      NA
## debt_settlement_flagY                           3.5688709568   -5.47
## year_temp                                       0.0992217319   -7.94
##                                                Pr(>|z|)    
## (Intercept)                          0.0000000000000015 ***
## funded_amnt_inv                    < 0.0000000000000002 ***
## annual_inc                                      0.51711    
## pub_rec_bankruptcies                            0.68479    
## revol_util                                      0.10669    
## collections_12_mths_ex_med                      0.01765 *  
## chargeoff_within_12_mths                        0.33197    
## tax_liens                                       0.33682    
## delinq_2yrs                                     0.74109    
## inq_last_6mths                       0.0000094803890652 ***
## open_acc                                        0.17434    
## pub_rec                                         0.16871    
## total_acc                                       0.77076    
## acc_now_delinq                                  0.81265    
## delinq_amnt                                     0.23073    
## id_2                                            0.80816    
## loan_amnt                                       0.05526 .  
## funded_amnt                                     0.00019 ***
## term60 months                                   0.49395    
## int_rate                             0.0000000001218866 ***
## installment                                     0.69678    
## gradeB                                          0.00012 ***
## gradeC                               0.0000068195067174 ***
## gradeD                               0.0000003469213388 ***
## gradeE                               0.0000050087054269 ***
## gradeF                               0.0000001698140433 ***
## gradeG                                          0.96080    
## sub_gradeA2                                     0.65625    
## sub_gradeA3                                     0.41479    
## sub_gradeA4                                     0.23166    
## sub_gradeA5                                     0.25412    
## sub_gradeB1                          0.0000207094740188 ***
## sub_gradeB2                                     0.01518 *  
## sub_gradeB3                                     0.01054 *  
## sub_gradeB4                                     0.01299 *  
## sub_gradeB5                                          NA    
## sub_gradeC1                                     0.00036 ***
## sub_gradeC2                                     0.06578 .  
## sub_gradeC3                                     0.04412 *  
## sub_gradeC4                                     0.48890    
## sub_gradeC5                                          NA    
## sub_gradeD1                                     0.00858 ** 
## sub_gradeD2                                     0.00557 ** 
## sub_gradeD3                                     0.13057    
## sub_gradeD4                                     0.55590    
## sub_gradeD5                                          NA    
## sub_gradeE1                                     0.29482    
## sub_gradeE2                                     0.76445    
## sub_gradeE3                                     0.03452 *  
## sub_gradeE4                                     0.21228    
## sub_gradeE5                                          NA    
## sub_gradeF1                                     0.02572 *  
## sub_gradeF2                                     0.04794 *  
## sub_gradeF3                                     0.09732 .  
## sub_gradeF4                                     0.68035    
## sub_gradeF5                                          NA    
## sub_gradeG1                                     0.96080    
## sub_gradeG2                                     0.96080    
## sub_gradeG3                                     0.96080    
## sub_gradeG4                                     0.96080    
## sub_gradeG5                                     0.96080    
## emp_length1 year                                0.82155    
## emp_length10+ years                             0.19368    
## emp_length2 years                               0.64211    
## emp_length3 years                               0.10321    
## emp_length4 years                               0.62634    
## emp_length5 years                               0.96035    
## emp_length6 years                               0.91868    
## emp_length7 years                               0.28570    
## emp_length8 years                               0.11007    
## emp_length9 years                               0.01209 *  
## emp_lengthn/a                                   0.01508 *  
## home_ownershipOTHER                             0.99991    
## home_ownershipOWN                               0.76465    
## home_ownershipRENT                              0.47287    
## verification_statusSource Verified              0.46649    
## verification_statusVerified                     0.26547    
## pymnt_plany                                     0.99766    
## purposecredit_card                              0.82282    
## purposedebt_consolidation                       0.65583    
## purposeeducational                   0.0000296895954881 ***
## purposehome_improvement                         0.76946    
## purposehouse                                    0.69818    
## purposemajor_purchase                           0.10032    
## purposemedical                                  0.87573    
## purposemoving                                   0.51694    
## purposeother                                    0.20888    
## purposerenewable_energy                         0.70948    
## purposesmall_business                           0.63162    
## purposevacation                                 0.86733    
## purposewedding                                  0.42190    
## dti                                             0.57402    
## revol_bal                                       0.03539 *  
## initial_list_statusw                            0.34666    
## out_prncp                                       0.99723    
## out_prncp_inv                                   0.99705    
## total_pymnt                                     0.80749    
## total_pymnt_inv                      0.0000000000778071 ***
## total_rec_prncp                                 0.80742    
## total_rec_int                                   0.80761    
## total_rec_late_fee                              0.80793    
## recoveries                                      0.80873    
## collection_recovery_fee            < 0.0000000000000002 ***
## last_pymnt_amnt                      0.0000000000000014 ***
## application_typeJoint App                       0.98719    
## hardship_flagY                                       NA    
## debt_settlement_flagY                0.0000000458173689 ***
## year_temp                            0.0000000000000020 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 78034.7  on 81205  degrees of freedom
## Residual deviance:  2231.3  on 81104  degrees of freedom
## AIC: 2435
## 
## Number of Fisher Scoring iterations: 25
# "Predicting" the train data 
log.prob <- predict(log.fit, newdata = test.log, type = "response")
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type
## == : prediction from a rank-deficient fit may be misleading
log.pred <- ifelse(log.prob>0.5,"Fully Paid","Defaulted")

# Test errors and confusion matrix
log.matrix <- table(y.test.pca,log.pred)
log.overall.test.err = mean(log.pred != test.log$loan_status)
log.default.test.err= log.matrix[1,2]/(log.matrix[1,1]+log.matrix[1,2]) # Class Test Error 
log.paid.test.err= log.matrix[2,1]/(log.matrix[2,2]+log.matrix[2,1]) # Class Test Error 
log.matrix
##             log.pred
## y.test.pca   Defaulted Fully Paid
##   Defaulted       3688         64
##   Fully Paid        10      16540
log.default.test.err
## [1] 0.01706
log.paid.test.err
## [1] 0.0006042
log.overall.test.err
## [1] 0.003645

For the Logisitic Regression we used the dataset removing the categorical values with more than 50 levels and those with less than 2. The defaulted test error rate is around 1.7 %, the fully paid test error rate is 0.04 % and the overall test error rate is 0.36 %.

Random Forest

In this following section we create a random forest model. We will try different amounts of variables chosen randomly at each try (mtry). In theory we could use mtry that is equal to the amount of predictors in the dataset. In practice we will choose only compare mtry between 1 and 10 (the section where we try different mtrys is commented, as it is computionally expensive and time consuming) .

# devide and conquer 80% train.rf, 20% test.data
set.seed(13)
indices <- sort(sample(1:nrow(d2_sub), p80)) # select 100 random samples
test.rf <- d2_sub[-indices,]
train.rf <- d2_sub[indices,]

#drop NAs & columns with more than 53 categories on train data
drop_na(train.rf) 
plus50levels <-  train.rf[, sapply(train.rf, function(col) length(unique(col))) > 50]
plus50levels <- plus50levels[, sapply(plus50levels, function(col) class(col) != "numeric")]
plus50levels <- plus50levels[, sapply(plus50levels, function(col) class(col) != "integer")]
rmv.col <- colnames(plus50levels)
train.rf <- train.rf[ , !(names(train.rf) %in% rmv.col)]
# str(train.rf)           check

#d2_sub <- as.data.frame(d2_sub)
rf.fit <- randomForest(loan_status ~ ., data= train.rf,ntree=400, mtry=4)  
rf.fit
## 
## Call:
##  randomForest(formula = loan_status ~ ., data = train.rf, ntree = 400,      mtry = 4) 
##                Type of random forest: classification
##                      Number of trees: 400
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 0.58%
## Confusion matrix:
##            Defaulted Fully Paid class.error
## Defaulted      14669        441   0.0291860
## Fully Paid        32      66064   0.0004841
importance(rf.fit)
##                            MeanDecreaseGini
## funded_amnt_inv                    775.4057
## annual_inc                          71.8560
## pub_rec_bankruptcies                10.4872
## revol_util                          67.4561
## collections_12_mths_ex_med           5.3176
## chargeoff_within_12_mths             1.9668
## tax_liens                            6.4099
## delinq_2yrs                         15.3960
## inq_last_6mths                      32.5806
## open_acc                            55.7434
## pub_rec                             14.0838
## total_acc                           60.6186
## acc_now_delinq                       1.6115
## delinq_amnt                          3.0891
## id_2                               103.0269
## loan_amnt                          857.3068
## funded_amnt                        817.2438
## term                               373.9342
## int_rate                           221.2365
## installment                        684.2372
## grade                              208.6517
## sub_grade                          333.6614
## emp_length                          78.4711
## home_ownership                      19.5888
## verification_status                 40.7899
## pymnt_plan                           0.8561
## purpose                             56.4789
## dti                                 89.3934
## revol_bal                           81.4486
## initial_list_status                 16.7607
## out_prncp                          151.5170
## out_prncp_inv                      159.2229
## total_pymnt                       1330.1439
## total_pymnt_inv                   1273.2038
## total_rec_prncp                   3268.4156
## total_rec_int                      377.7770
## total_rec_late_fee                 203.3534
## recoveries                        5789.3385
## collection_recovery_fee           4622.2460
## last_pymnt_amnt                   1580.4862
## policy_code                          0.0000
## application_type                     0.2693
## hardship_flag                        0.8913
## disbursement_method                  0.0000
## debt_settlement_flag               558.6815
## year_temp                           73.9763
varImpPlot(rf.fit)

# Uncomment this section only if you have time 
# check which RF is the most promising, mtry 10 should be sufficient
# default.train.err=double(10)
# paid.train.err=double(10)
# default.test.err=double(10)
# paid.test.err=double(10)
# 
#mtry is no of Variables randomly chosen at each split
# for(mtry in 1:10) 
# {
#   rf=randomForest(loan_status ~ . , data= train.rf,mtry=mtry,ntree=400) 
#   default.train.err[mtry] = rf$confusion[1,3]
#   paid.train.err[mtry] = rf$confusion[2,3]
#   
#   pred<-predict(rf,test.rf) #Predictions on Test Set for each Tree
#   default.test.err[mtry]= table(test.rf$loan_status,pred)[1,2]/(table(test.rf$loan_status,pred)[1,1]+table(test.rf$loan_status,pred)[1,2]) # Class Test Error 
#   paid.test.err[mtry]= table(test.rf$loan_status,pred)[2,1]/(table(test.rf$loan_status,pred)[2,2]+table(test.rf$loan_status,pred)[2,1]) # Class Test Error 
#   
#   cat(mtry," ") #printing the output to the console
# }

# plot the results
# matplot(1:mtry , cbind(default.train.err,paid.train.err,default.test.err,paid.test.err), pch=19 , col=c("red","blue","darkred","darkblue"),type="b",ylab="Classification Error",xlab="Number of Predictors Considered at each Split")
# legend("left",legend=c("Default Train Class.Error","Paid Train Class.Error","Default Test Class.Error","Paid Test Class.Error"),pch=19,  col=c("red","blue","darkred","darkblue"))

rf.pred <- predict(rf.fit, newdata= test.rf)
table(test.rf$loan_status,rf.pred)
##             rf.pred
##              Defaulted Fully Paid
##   Defaulted       3647        105
##   Fully Paid        11      16539
# Test errors and confusion matrix
rf.matrix <- table(test.rf$loan_status,rf.pred)
rf.overall.test.err = mean(rf.pred != test.rf$loan_status)
rf.default.test.err= rf.matrix[1,2]/(rf.matrix[1,1]+rf.matrix[1,2]) # Class Test Error 
rf.paid.test.err= rf.matrix[2,1]/(rf.matrix[2,2]+rf.matrix[2,1]) # Class Test Error 
rf.matrix
##             rf.pred
##              Defaulted Fully Paid
##   Defaulted       3647        105
##   Fully Paid        11      16539
rf.default.test.err
## [1] 0.02799
rf.paid.test.err
## [1] 0.0006647
rf.overall.test.err
## [1] 0.005714

The amount of mtry used in this approach was 4 as we noted that above 4 there was no great increase in accuracy. The most important predictor variables in the random forest are recoveries, followed by collection_recovery_fee. The error rate for the class defaulted is higher (3.14%) than the models seen before. We assume that the random forest did not train so well on the unbalanced data set.



KNN

Now we model the data using K-Nearest-Neighbour

# load previous df
df.knn <- d2_sub
#str(df.knn)              # Check 

#drop NAs & columns with more than 50 levels or less than 2
drop_na(df.knn) 
less2levels <-  df.knn[, sapply(df.knn, function(col) length(unique(col))) < 2]
plus50levels <-  df.knn[, sapply(df.knn, function(col) length(unique(col))) > 50]
plus50levels <- plus50levels[, sapply(plus50levels, function(col) class(col) != "numeric")]
plus50levels <- plus50levels[, sapply(plus50levels, function(col) class(col) != "integer")]
rmv.col <- c(colnames(less2levels),colnames(plus50levels))
# rmv.col                 # Check
df.knn <- df.knn[ , !(names(df.knn) %in% rmv.col)]
#str(df.knn)              # Check 

# transform factors to numeric, except for loan_status
str(df.knn)
## 'data.frame':    101508 obs. of  45 variables:
##  $ funded_amnt_inv           : int  33600 10400 15000 10000 4200 7000 20000 28000 20000 17600 ...
##  $ annual_inc                : int  104000 104433 80000 30000 32000 59000 180000 92000 65000 88000 ...
##  $ pub_rec_bankruptcies      : int  0 0 0 1 0 0 0 0 0 0 ...
##  $ revol_util                : num  51.8 64.5 21.4 26.6 11.7 24.2 84.5 64.5 3 34.4 ...
##  $ collections_12_mths_ex_med: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ chargeoff_within_12_mths  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ tax_liens                 : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ delinq_2yrs               : int  0 1 0 0 0 1 0 0 0 0 ...
##  $ inq_last_6mths            : int  2 3 1 2 2 1 0 0 1 0 ...
##  $ open_acc                  : int  14 12 30 19 10 10 12 16 12 20 ...
##  $ pub_rec                   : int  0 0 0 1 0 0 0 0 0 0 ...
##  $ total_acc                 : int  31 35 47 36 16 34 27 24 17 42 ...
##  $ acc_now_delinq            : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ delinq_amnt               : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ id_2                      : int  777778 876330 188890 288890 388890 788890 877442 1114 10002 11114 ...
##  $ loan_amnt                 : int  33600 10400 15000 10000 4200 7000 20000 28000 20000 17600 ...
##  $ funded_amnt               : int  33600 10400 15000 10000 4200 7000 20000 28000 20000 17600 ...
##  $ term                      : Factor w/ 2 levels "36 months","60 months": 1 2 1 2 1 1 1 1 1 1 ...
##  $ int_rate                  : num  10.78 22.45 5.32 16.59 13.99 ...
##  $ installment               : num  1097 290 452 246 144 ...
##  $ grade                     : Factor w/ 7 levels "A","B","C","D",..: 2 6 1 4 3 1 2 1 1 1 ...
##  $ sub_grade                 : Factor w/ 35 levels "A1","A2","A3",..: 9 26 1 17 14 4 7 2 1 1 ...
##  $ emp_length                : Factor w/ 12 levels "< 1 year","1 year",..: 10 5 12 1 4 3 3 3 11 1 ...
##  $ home_ownership            : Factor w/ 4 levels "MORTGAGE","OTHER",..: 4 1 4 4 4 1 1 1 3 1 ...
##  $ verification_status       : Factor w/ 3 levels "Not Verified",..: 2 2 1 2 2 1 1 1 1 2 ...
##  $ loan_status               : Factor w/ 2 levels "Defaulted","Fully Paid": 2 2 2 2 2 2 2 2 2 2 ...
##  $ pymnt_plan                : Factor w/ 2 levels "n","y": 1 1 1 1 1 1 1 1 1 1 ...
##  $ purpose                   : Factor w/ 14 levels "car","credit_card",..: 3 7 3 3 10 2 3 3 3 2 ...
##  $ dti                       : num  14 25.4 20.2 30.8 31 ...
##  $ revol_bal                 : int  28475 21929 41593 6118 1294 10467 87329 51507 2269 12203 ...
##  $ initial_list_status       : Factor w/ 2 levels "f","w": 2 2 2 2 2 2 1 2 2 2 ...
##  $ out_prncp                 : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ out_prncp_inv             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ total_pymnt               : num  34309 11740 16241 10758 5036 ...
##  $ total_pymnt_inv           : num  34309 11740 16241 10758 5036 ...
##  $ total_rec_prncp           : num  33600 10400 15000 10000 4200 7000 20000 28000 20000 17600 ...
##  $ total_rec_int             : num  709 1340 1241 758 836 ...
##  $ total_rec_late_fee        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ recoveries                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ collection_recovery_fee   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ last_pymnt_amnt           : num  33263 10129 2244 9795 1743 ...
##  $ application_type          : Factor w/ 2 levels "Individual","Joint App": 1 1 1 1 1 1 1 1 1 1 ...
##  $ hardship_flag             : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
##  $ debt_settlement_flag      : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
##  $ year_temp                 : int  2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
for(col in 1:ncol(df.knn)){
  if(class(df.knn[,col]) %in% c('factor') && colnames(df.knn)[col] != 'loan_status'){
    df.knn[,col] <- as.numeric(df.knn[,col])
  }
}
#str(df.knn)              # Check 

# devide and conquer 80% train.log, 20% test.data
set.seed(13)
indices <- sort(sample(1:nrow(df.knn), p80)) # select 100 random samples
test.knn <- df.knn[-indices,]
train.knn <- df.knn[indices,]

# remove loan_status from test and train set
y.train.knn <- train.knn$loan_status
y.test.knn <- test.knn$loan_status
train.knn <- subset(train.knn, select = -c(loan_status))
test.knn <- subset(test.knn, select = -c(loan_status))
#str(df.knn)              # Check 

set.seed (1)
knn.pred <- knn(train.knn,test.knn, y.train.knn,k=1)

# Test errors and confusion matrix
knn.matrix <- table(y.test.knn,knn.pred)
knn.overall.test.err = mean(knn.pred != y.test.knn)
knn.default.test.err= knn.matrix[1,2]/(knn.matrix[1,1]+knn.matrix[1,2]) # Class Test Error 
knn.paid.test.err= knn.matrix[2,1]/(knn.matrix[2,2]+knn.matrix[2,1]) # Class Test Error 
knn.matrix
##             knn.pred
## y.test.knn   Defaulted Fully Paid
##   Defaulted       2761        991
##   Fully Paid       487      16063
knn.default.test.err
## [1] 0.2641
knn.paid.test.err
## [1] 0.02943
knn.overall.test.err
## [1] 0.0728

The results of the KNN model do not look as accurate as the previous ones. We have a defaulted error rate of 26.3 %, a fully paid error rate of 2.8 %. The overall test error rate is 7.2 %. Thus, KNN does not seem as accurate as some previous models.



Decision Tree

Finally, we model data using a simple decision tree.

# devide and conquer 80% train 20% test data
set.seed(13)
p80 <- nrow(df.pca)*0.8
indices <- sort(sample(1:nrow(df.pca), p80)) # select 100 random samples
test.tree <- df.pca[-indices,]
train.tree <- df.pca[indices,]

# fit the tree 
tree.fit = tree(loan_status ~ . , data = train.tree)
summary(tree.fit)
## 
## Classification tree:
## tree(formula = loan_status ~ ., data = train.tree)
## Variables actually used in tree construction:
## [1] "recoveries"      "last_pymnt_amnt" "term"            "total_rec_prncp"
## [5] "funded_amnt"     "funded_amnt_inv"
## Number of terminal nodes:  8 
## Residual mean deviance:  0.141 = 11400 / 81200 
## Misclassification error rate: 0.018 = 1459 / 81206
plot(tree.fit)
text(tree.fit, pretty = 0)

tree.fit
## node), split, n, deviance, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 81206 80000 Fully Paid ( 0.186 0.814 )  
##    2) recoveries < 0.005 69797 30000 Fully Paid ( 0.054 0.946 )  
##      4) last_pymnt_amnt < 1269.81 30758 20000 Fully Paid ( 0.120 0.880 )  
##        8) term < 1.5 26963 10000 Fully Paid ( 0.066 0.934 )  
##         16) total_rec_prncp < 4998.13 5275  5000 Fully Paid ( 0.204 0.796 )  
##           32) funded_amnt < 4987.5 4423  2000 Fully Paid ( 0.051 0.949 ) *
##           33) funded_amnt > 4987.5 852     0 Defaulted ( 1.000 0.000 ) *
##         17) total_rec_prncp > 4998.13 21688  6000 Fully Paid ( 0.033 0.967 ) *
##        9) term > 1.5 3795  5000 Defaulted ( 0.501 0.499 )  
##         18) total_rec_prncp < 9989.65 1636  1000 Defaulted ( 0.896 0.104 )  
##           36) funded_amnt_inv < 9725 174    40 Fully Paid ( 0.023 0.977 ) *
##           37) funded_amnt_inv > 9725 1462     0 Defaulted ( 1.000 0.000 ) *
##         19) total_rec_prncp > 9989.65 2159  2000 Fully Paid ( 0.202 0.798 ) *
##      5) last_pymnt_amnt > 1269.81 39039   700 Fully Paid ( 0.001 0.999 ) *
##    3) recoveries > 0.005 11409   500 Defaulted ( 0.997 0.003 ) *
tree.pred <- predict(tree.fit,newdata = test.tree, type = "class")
with(test.tree, table(tree.pred,test.tree$loan_status))
##             
## tree.pred    Defaulted Fully Paid
##   Defaulted       3395         13
##   Fully Paid       357      16537
# decision tree with CV 
cvtree.fit = cv.tree(tree.fit, FUN=prune.misclass)
plot(cvtree.fit)

tree.prune.fit <- prune.misclass(tree.fit,best=3)
tree.prune.fit
## node), split, n, deviance, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 81206 80000 Fully Paid ( 0.186 0.814 )  
##    2) recoveries < 0.005 69797 30000 Fully Paid ( 0.054 0.946 )  
##      4) last_pymnt_amnt < 1269.81 30758 20000 Fully Paid ( 0.120 0.880 )  
##        8) term < 1.5 26963 10000 Fully Paid ( 0.066 0.934 ) *
##        9) term > 1.5 3795  5000 Defaulted ( 0.501 0.499 )  
##         18) total_rec_prncp < 9989.65 1636  1000 Defaulted ( 0.896 0.104 ) *
##         19) total_rec_prncp > 9989.65 2159  2000 Fully Paid ( 0.202 0.798 ) *
##      5) last_pymnt_amnt > 1269.81 39039   700 Fully Paid ( 0.001 0.999 ) *
##    3) recoveries > 0.005 11409   500 Defaulted ( 0.997 0.003 ) *
tree.prune.pred <- predict(tree.prune.fit,newdata = test.tree, type = "class")

# Test errors and confusion matrix for the pruned tree
tree.matrix <- with(test.tree, table(tree.prune.pred,test.tree$loan_status))
tree.overall.test.err = mean(tree.prune.pred != test.tree$loan_status)
tree.default.test.err= tree.matrix[1,2]/(tree.matrix[1,1]+tree.matrix[1,2]) # Class Test Error 
tree.paid.test.err= tree.matrix[2,1]/(tree.matrix[2,2]+tree.matrix[2,1]) # Class Test Error 
tree.matrix
##                
## tree.prune.pred Defaulted Fully Paid
##      Defaulted       3188         62
##      Fully Paid       564      16488
tree.overall.test.err 
## [1] 0.03083
tree.default.test.err
## [1] 0.01908
tree.paid.test.err
## [1] 0.03308

The results of the decision tree look surprisingly good. The overall test error rate is 3.1 %, the defaulted test error rate is 1.4 % and the fully paid test error rate is 3.4%. The great advantage of this approach is how easy we can explain the results. By far the most important indicator in the pruned tree is the recoveries. If this value is above 0.01, then the model will forecast the loan to Default.





Discussion

PCA <- c(pca.default.test.err,pca.paid.test.err,pca.overall.test.err)
LogRegression <- c(log.default.test.err,log.paid.test.err,log.overall.test.err)
RF <- c(rf.default.test.err,rf.paid.test.err,rf.overall.test.err)
KNN <- c(knn.default.test.err,knn.paid.test.err,knn.overall.test.err)
Tree <- c(tree.default.test.err,tree.paid.test.err,tree.overall.test.err)
row.names <- c("Class:Defaulted Test Error", "Class:Paid Test Error", "Total:Test Error")
all <- data.frame(PCA,LogRegression,RF,KNN,Tree,row.names = row.names ) 
all
all_transposed <- as.data.frame(t(all))
all_transposed[order(all_transposed$`Total:Test Error`),]
write.csv(all_transposed, file = "data/classification_methods_results.csv")

If the objective is to reduce overall test error then the PCA in combination with a logistic regression would be the most appropriate method. If however the class error of the defaulted loans is the objective then a simple pruned tree model might be more appropriate. In addition, we should note that the PCA is an unsupervised method and is therefore inadequate for explaining models. Bottom line: The tree model offers the best solution that is accurate, explainable and understandable.

Following outlook could be considered. Further analysis could be done with PLS discriminant analysis which would allow computing for a categorical target variable. Because of time constraints, as well as lacking knowledge in this area, this was not further explored. A balanced data set could have been set up for the training of the models. This would imply that roughly 50% of the outcome would be “Defaulted” and 50% “Fully Paid”. We would expect some models to perform better with a balanced training set. The team would have also liked more time to implement a Boosting model, which we would have expected to outperform the RF.