Abstract

In this project, I build machine-learned models trained on LendingClub (a leading P2P lending platform) historical loan data from 2012-2014 that help investors quantify credit risks using R. The classifier, predicting whether a given loan will be fully paid or not, achieves accuracy of 0.8344.

Introduction

The purpose of this project is to explore data from Lending Club (LC), a peer-to-peer lending site. LC has made data from funded loans publicly available and accessible in CSV formatted files. This data is incredibly rich, and personal financial data is rare. Additionally, a model predicting defaults could help to instruct the direction of a lending portfolio on the site.I downloaded the Loan Data from 2012 to 2014, a file containing 423814 records. Each record is a distinct loan made on the Lending Club platform. For each loan, 145 characteristics are recorded in the table. These characteristics can largely be divided into two groups - features of the loan, and features of the borrower.
The loan features are the basic stats one might expect, such as the loan amount, the interest rate, and the term of the loan. The borrower features, which comprise by far the majority of the dataset, are the characteristics of the borrowers that Lending Club has deemed important to collect. Such features include employment length, credit history, and public default histories.

After data cleaning, feature selection and feature engineering, I omit the missing values and remove outliers of annual_inc and int_rate. Now the data has 362730 records and 19 variables.

There are 11 Numeric variables: ‘loan_amnt’, ‘int_rate’, ‘installment’, ‘emp_length’,‘annual_inc’,‘dti’,‘tot_cur_bal’,‘inq_last_6mths’,‘open_acc’,‘revol_util’,‘credit_lth’
There are 8 factor variables: ‘grade’,‘term’,‘home_ownership’,‘verification_status’,‘purpose’,‘addr_state’,‘loan_status’,‘issue_d’

Due to the great sample size ,I’ll random choose 15000 experimental units for training model. In the original data, 17.32% of loan_status are ‘Charged Off’ and 82.67% of loan_status are ‘Fully Paid’. There are 17.03% of loan_status ‘Charged Off’ and 82.97% of loan_status ‘Fully Paid’. The percentage are approximately the same, so the sampling is reasonable.Spliting data, 75%(37500 units) for training and 25%(left) for test. The The percentage are approximately the same as the original data, so the sampling is reasonable.

conclusion

The accuracy of logistic regression model for the train set is 0.8325333 and it for the test set is 0.8338667.
The accuracy of model random forest is 0.8344.
We use decision tree c5.0 model. From the plot, it is easily to see when we increase the trials, the misclassification Error is decreasing, until the trials equals 10, the misclassification Error reaches its least value. The accuracy of the decision tree classifier is 0.8213.
when k=1, Accuracy 0.741066666666667 when k=13,Accuracy 0.826133333333333 when k=21, Accuracy 0.8312 When k=25, the model reaches its maximum Accuracy 0.832.
choose all variables as predictors to build the model, the model accuracy is 78.32%. When trying to use laplace to improve the model, the performance of the model improve slightly. The accuracy of the model with laplace =3 is 78.35%. If only choose the factor predictors, the accuracy is 81.54%. Adding laplace =3 to the model, the accuracy is 81.6%.

library(ggplot2)
library(gmodels)
## Warning: package 'gmodels' was built under R version 3.5.2
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date
library(caret)
## Warning: package 'caret' was built under R version 3.5.2
## Loading required package: lattice
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.5.2
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:lubridate':
## 
##     intersect, setdiff, union
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyverse)
## -- Attaching packages -------------------------------------------------- tidyverse 1.2.1 --
## v tibble  1.4.2     v purrr   0.2.5
## v tidyr   0.8.1     v stringr 1.3.1
## v readr   1.1.1     v forcats 0.3.0
## -- Conflicts ----------------------------------------------------- tidyverse_conflicts() --
## x lubridate::as.difftime() masks base::as.difftime()
## x lubridate::date()        masks base::date()
## x dplyr::filter()          masks stats::filter()
## x lubridate::intersect()   masks base::intersect()
## x dplyr::lag()             masks stats::lag()
## x purrr::lift()            masks caret::lift()
## x lubridate::setdiff()     masks base::setdiff()
## x lubridate::union()       masks base::union()
library(mosaic)
## Warning: package 'mosaic' was built under R version 3.5.2
## Loading required package: ggformula
## Loading required package: ggstance
## 
## Attaching package: 'ggstance'
## The following objects are masked from 'package:ggplot2':
## 
##     geom_errorbarh, GeomErrorbarh
## 
## New to ggformula?  Try the tutorials: 
##  learnr::run_tutorial("introduction", package = "ggformula")
##  learnr::run_tutorial("refining", package = "ggformula")
## Loading required package: mosaicData
## Warning: package 'mosaicData' was built under R version 3.5.2
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following object is masked from 'package:tidyr':
## 
##     expand
## 
## The 'mosaic' package masks several functions from core packages in order to add 
## additional features.  The original behavior of these functions should not be affected by this.
## 
## Note: If you use the Matrix package, be sure to load it BEFORE loading mosaic.
## 
## Attaching package: 'mosaic'
## The following object is masked from 'package:Matrix':
## 
##     mean
## The following object is masked from 'package:purrr':
## 
##     cross
## The following objects are masked from 'package:dplyr':
## 
##     count, do, tally
## The following object is masked from 'package:caret':
## 
##     dotPlot
## The following object is masked from 'package:ggplot2':
## 
##     stat
## The following objects are masked from 'package:stats':
## 
##     binom.test, cor, cor.test, cov, fivenum, IQR, median,
##     prop.test, quantile, sd, t.test, var
## The following objects are masked from 'package:base':
## 
##     max, mean, min, prod, range, sample, sum
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:mosaic':
## 
##     do
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(Amelia)
## Warning: package 'Amelia' was built under R version 3.5.2
## Loading required package: Rcpp
## Warning: package 'Rcpp' was built under R version 3.5.2
## ## 
## ## Amelia II: Multiple Imputation
## ## (Version 1.7.5, built: 2018-05-07)
## ## Copyright (C) 2005-2019 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##
library(janitor)
## Warning: package 'janitor' was built under R version 3.5.2
library(readxl)
library(car)
## Warning: package 'car' was built under R version 3.5.2
## Loading required package: carData
## Warning: package 'carData' was built under R version 3.5.2
## 
## Attaching package: 'car'
## The following objects are masked from 'package:mosaic':
## 
##     deltaMethod, logit
## The following object is masked from 'package:purrr':
## 
##     some
## The following object is masked from 'package:dplyr':
## 
##     recode
library(ROCR)
## Warning: package 'ROCR' was built under R version 3.5.2
## Loading required package: gplots
## Warning: package 'gplots' was built under R version 3.5.2
## 
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
## 
##     lowess

Step1: Collecting Data

Merge the data together into one dataframe. Make a table of the variable names. Make a table of the number of rows and columns in the dataframe.
here’s the code merge data and save data into Rds.
L12_13 = data.frame(read.csv(“LoanStats3b.csv”, stringsAsFactors = FALSE, header = T, skip = 1)) L14 = data.frame(read.csv(“LoanStats3c.csv”, stringsAsFactors = FALSE, header = T, skip = 1)) #read the file loan1 = rbind(L12_13,L14) # merge 2012-2014 into one dataframe saveRDS(loan1,file = ‘loan.Rds’)

loan1=readRDS('loan.Rds')
head(loan1)
variables1=data.frame(Variables=names(loan1));variables1  #create a table of variables
length1=data.frame(row=nrow(loan1),column=ncol(loan1));length1 #create a table of the number of rows and columns in the dataframe

Step 2 exploring and preparing the data

Data Cleaning & Data Transformation & data structure & feature visualization ## Data cleaning## Notes: Summariz missing value; delete column with over 50% missing vales; 210000 is minimum number of non-NA values. After removing those variables with great amount missing values. There’s 87 variables left.

423814*0.5 
## [1] 211907
loan1=loan1[!sapply(loan1, function (x) sum(is.na(x) | x == "")>210000)]; 
head(loan1)
sort(sapply(loan1,function(k) sum(is.na(k))),decreasing = T)
##      mths_since_recent_inq         mo_sin_old_il_acct 
##                      49564                      41047 
##           num_tl_120dpd_2m             pct_tl_nvr_dlq 
##                      35861                      27898 
##                avg_cur_bal       mo_sin_old_rev_tl_op 
##                      27757                      27746 
##      mo_sin_rcnt_rev_tl_op               tot_coll_amt 
##                      27746                      27745 
##                tot_cur_bal           total_rev_hi_lim 
##                      27745                      27745 
##             mo_sin_rcnt_tl      num_accts_ever_120_pd 
##                      27745                      27745 
##             num_actv_bc_tl            num_actv_rev_tl 
##                      27745                      27745 
##                  num_bc_tl                  num_il_tl 
##                      27745                      27745 
##              num_op_rev_tl              num_rev_accts 
##                      27745                      27745 
##        num_rev_tl_bal_gt_0               num_tl_30dpd 
##                      27745                      27745 
##         num_tl_90g_dpd_24m         num_tl_op_past_12m 
##                      27745                      27745 
##            tot_hi_cred_lim total_il_high_credit_limit 
##                      27745                      27745 
##                num_bc_sats                   num_sats 
##                      16059                      16059 
##                    bc_util           percent_bc_gt_75 
##                      11727                      11589 
##             bc_open_to_buy       mths_since_recent_bc 
##                      11474                      11078 
##       acc_open_past_24mths                   mort_acc 
##                       7499                       7499 
##          total_bal_ex_mort             total_bc_limit 
##                       7499                       7499 
##                  loan_amnt                funded_amnt 
##                          4                          4 
##            funded_amnt_inv                installment 
##                          4                          4 
##                 annual_inc                        dti 
##                          4                          4 
##                delinq_2yrs             inq_last_6mths 
##                          4                          4 
##                   open_acc                    pub_rec 
##                          4                          4 
##                  revol_bal                  total_acc 
##                          4                          4 
##                  out_prncp              out_prncp_inv 
##                          4                          4 
##                total_pymnt            total_pymnt_inv 
##                          4                          4 
##            total_rec_prncp              total_rec_int 
##                          4                          4 
##         total_rec_late_fee                 recoveries 
##                          4                          4 
##    collection_recovery_fee            last_pymnt_amnt 
##                          4                          4 
## collections_12_mths_ex_med                policy_code 
##                          4                          4 
##             acc_now_delinq   chargeoff_within_12_mths 
##                          4                          4 
##                delinq_amnt       pub_rec_bankruptcies 
##                          4                          4 
##                  tax_liens                  emp_title 
##                          4                          3 
##                       term                   int_rate 
##                          0                          0 
##                      grade                  sub_grade 
##                          0                          0 
##                 emp_length             home_ownership 
##                          0                          0 
##        verification_status                    issue_d 
##                          0                          0 
##                loan_status                 pymnt_plan 
##                          0                          0 
##                    purpose                      title 
##                          0                          0 
##                   zip_code                 addr_state 
##                          0                          0 
##           earliest_cr_line                 revol_util 
##                          0                          0 
##        initial_list_status               last_pymnt_d 
##                          0                          0 
##         last_credit_pull_d           application_type 
##                          0                          0 
##              hardship_flag        disbursement_method 
##                          0                          0 
##       debt_settlement_flag 
##                          0
dim(loan1)
## [1] 423814     87

Note: Remove irrelevant and Redundant Features. some of them are not related to our target variable and some of them are redundant. Finally, 19 relevant variables are chosen. And convert the chracter variable to be factor.

loan1 <- loan1[,c("loan_status","grade", "term","home_ownership","verification_status","purpose","addr_state", "loan_amnt","int_rate",'installment',"emp_length",  "annual_inc", "dti",  'tot_cur_bal',"inq_last_6mths","open_acc", "revol_util",  "issue_d", "earliest_cr_line")]

There are 7 labels for loan_status: “Charged Off”, “Current” , “Default” ,“Fully Paid”, “In Grace Period” , “Late (16-30 days)” , “Late (31-120 days)”. To consolidate these labels, I took several steps. First, I removed all the late loans, as these fall into a certain grey area with ambiguous, undetermined final statuses. I’ll select data with two final labels ‘charged off’ and ‘Fully Paid’ for my classification attempts.Also, remove Na values.

The resulting data was much more cleanly labled:

loan1=subset(loan1,loan1$loan_status=='Charged Off' | loan1$loan_status== 'Fully Paid')

table(loan1$loan_status)
## 
## Charged Off  Fully Paid 
##       70613      338329
table(loan1$loan_status)/length(loan1$loan_status)
## 
## Charged Off  Fully Paid 
##   0.1726724   0.8273276
dim(loan1)
## [1] 408942     19
str(loan1)
## 'data.frame':    408942 obs. of  19 variables:
##  $ loan_status        : chr  "Fully Paid" "Fully Paid" "Fully Paid" "Fully Paid" ...
##  $ grade              : chr  "A" "A" "B" "B" ...
##  $ term               : chr  " 36 months" " 36 months" " 36 months" " 36 months" ...
##  $ home_ownership     : chr  "MORTGAGE" "MORTGAGE" "OWN" "MORTGAGE" ...
##  $ verification_status: chr  "Not Verified" "Source Verified" "Verified" "Source Verified" ...
##  $ purpose            : chr  "debt_consolidation" "debt_consolidation" "debt_consolidation" "debt_consolidation" ...
##  $ addr_state         : chr  "TX" "CA" "MI" "CO" ...
##  $ loan_amnt          : int  12000 28000 27050 12000 27600 12000 3000 20800 7550 15000 ...
##  $ int_rate           : chr  "  7.62%" "  7.62%" " 10.99%" " 11.99%" ...
##  $ installment        : num  374 873 885 399 731 ...
##  $ emp_length         : chr  "3 years" "5 years" "10+ years" "10+ years" ...
##  $ annual_inc         : num  96500 325000 55000 130000 73000 60000 25000 81500 28000 98000 ...
##  $ dti                : num  12.6 18.6 22.9 13 23.1 ...
##  $ tot_cur_bal        : int  200314 799592 114834 327264 241609 7137 19530 23473 5759 13038 ...
##  $ inq_last_6mths     : int  0 1 0 1 1 1 0 2 0 2 ...
##  $ open_acc           : int  17 15 14 9 10 15 5 29 4 16 ...
##  $ revol_util         : chr  "55.7%" "54.6%" "61.2%" "67%" ...
##  $ issue_d            : chr  "Dec-2013" "Dec-2013" "Dec-2013" "Dec-2013" ...
##  $ earliest_cr_line   : chr  "Sep-2003" "Nov-1994" "Oct-1986" "Nov-1997" ...
summary(loan1)
##  loan_status           grade               term          
##  Length:408942      Length:408942      Length:408942     
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##  home_ownership     verification_status   purpose         
##  Length:408942      Length:408942       Length:408942     
##  Class :character   Class :character    Class :character  
##  Mode  :character   Mode  :character    Mode  :character  
##                                                           
##                                                           
##                                                           
##                                                           
##   addr_state          loan_amnt       int_rate          installment     
##  Length:408942      Min.   : 1000   Length:408942      Min.   :   4.93  
##  Class :character   1st Qu.: 8000   Class :character   1st Qu.: 264.74  
##  Mode  :character   Median :12275   Mode  :character   Median : 388.06  
##                     Mean   :14450                      Mean   : 441.55  
##                     3rd Qu.:20000                      3rd Qu.: 578.03  
##                     Max.   :35000                      Max.   :1409.99  
##                                                                         
##   emp_length          annual_inc           dti         tot_cur_bal     
##  Length:408942      Min.   :   3000   Min.   : 0.00   Min.   :      0  
##  Class :character   1st Qu.:  45000   1st Qu.:11.65   1st Qu.:  28208  
##  Mode  :character   Median :  63000   Median :17.15   Median :  80064  
##                     Mean   :  73456   Mean   :17.53   Mean   : 137869  
##                     3rd Qu.:  89000   3rd Qu.:23.15   3rd Qu.: 207710  
##                     Max.   :7500000   Max.   :39.99   Max.   :8000078  
##                                                       NA's   :27741    
##  inq_last_6mths      open_acc      revol_util          issue_d         
##  Min.   :0.0000   Min.   : 0.00   Length:408942      Length:408942     
##  1st Qu.:0.0000   1st Qu.: 8.00   Class :character   Class :character  
##  Median :0.0000   Median :11.00   Mode  :character   Mode  :character  
##  Mean   :0.7832   Mean   :11.34                                        
##  3rd Qu.:1.0000   3rd Qu.:14.00                                        
##  Max.   :8.0000   Max.   :84.00                                        
##                                                                        
##  earliest_cr_line  
##  Length:408942     
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 

Data transformation

Further cleaning of the data involved conerting interest rates, revolving utility rates, employment length from factors to numerics. Employment length in years. Possible values are between 0 and 10 where 0 means n/a(unemployed), 0.5 means less than one year and 10 means ten or more years.removing empty factor levels, and removing annual income outliers.

#Drops the “%” character and converts the observations from “factors” to “numeric”
loan1$int_rate <- gsub("%", "", loan1$int_rate)
loan1$int_rate <- as.numeric(loan1$int_rate)
head(loan1$int_rate)
## [1]  7.62  7.62 10.99 11.99 19.97 10.99
loan1$revol_util <- gsub("%", "", loan1$revol_util)
loan1$revol_util <- as.numeric(loan1$revol_util)
head(loan1$revol_util)
## [1] 55.7 54.6 61.2 67.0 82.8 24.0
#Drops the “years” ,'+' character and symbols. Convert'< 1 year' to '0' and converts the observations from “factors” to “numeric”. Employment length in years. Possible values are between 0 and 10 where 0 means n/a(unemployed), 0.5 means less than one year and 10 means ten or more years..
loan1$emp_length <- gsub("n/a", "0", loan1$emp_length)
loan1$emp_length <- gsub("< 1 year", "0.5", loan1$emp_length)
loan1$emp_length <- gsub("years", "", loan1$emp_length)
loan1$emp_length <- gsub("year", "", loan1$emp_length)
loan1$emp_length <- gsub("\\+", "", loan1$emp_length)
loan1$emp_length <- as.numeric(loan1$emp_length)
head(loan1$emp_length)
## [1]  3  5 10 10  6  4
loan1$annual_inc=loan1$annual_inc/1000

loan1$verification_status=gsub('Source Verified','Verified',loan1$verification_status)
sort(sapply(loan1,function(k) sum(is.na(k))),decreasing = T)
##         tot_cur_bal          revol_util         loan_status 
##               27741                 238                   0 
##               grade                term      home_ownership 
##                   0                   0                   0 
## verification_status             purpose          addr_state 
##                   0                   0                   0 
##           loan_amnt            int_rate         installment 
##                   0                   0                   0 
##          emp_length          annual_inc                 dti 
##                   0                   0                   0 
##      inq_last_6mths            open_acc             issue_d 
##                   0                   0                   0 
##    earliest_cr_line 
##                   0
dim(loan1)
## [1] 408942     19
loan1=na.omit(loan1)
dim(loan1)
## [1] 380994     19
summary(loan1)
##  loan_status           grade               term          
##  Length:380994      Length:380994      Length:380994     
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##  home_ownership     verification_status   purpose         
##  Length:380994      Length:380994       Length:380994     
##  Class :character   Class :character    Class :character  
##  Mode  :character   Mode  :character    Mode  :character  
##                                                           
##                                                           
##                                                           
##   addr_state          loan_amnt        int_rate      installment     
##  Length:380994      Min.   : 1000   Min.   : 6.00   Min.   :   4.93  
##  Class :character   1st Qu.: 8000   1st Qu.:10.99   1st Qu.: 266.65  
##  Mode  :character   Median :12500   Median :13.68   Median : 389.28  
##                     Mean   :14540   Mean   :13.98   Mean   : 443.73  
##                     3rd Qu.:20000   3rd Qu.:16.78   3rd Qu.: 581.06  
##                     Max.   :35000   Max.   :26.06   Max.   :1409.99  
##    emp_length       annual_inc          dti         tot_cur_bal     
##  Min.   : 0.000   Min.   :   3.0   Min.   : 0.00   Min.   :      0  
##  1st Qu.: 2.000   1st Qu.:  45.0   1st Qu.:11.74   1st Qu.:  28207  
##  Median : 6.000   Median :  64.0   Median :17.26   Median :  80044  
##  Mean   : 5.861   Mean   :  73.8   Mean   :17.65   Mean   : 137838  
##  3rd Qu.:10.000   3rd Qu.:  90.0   3rd Qu.:23.32   3rd Qu.: 207674  
##  Max.   :10.000   Max.   :7500.0   Max.   :39.99   Max.   :8000078  
##  inq_last_6mths      open_acc       revol_util      issue_d         
##  Min.   :0.0000   Min.   : 1.00   Min.   :  0.0   Length:380994     
##  1st Qu.:0.0000   1st Qu.: 8.00   1st Qu.: 40.2   Class :character  
##  Median :0.0000   Median :11.00   Median : 57.8   Mode  :character  
##  Mean   :0.7818   Mean   :11.42   Mean   : 56.7                     
##  3rd Qu.:1.0000   3rd Qu.:14.00   3rd Qu.: 74.5                     
##  Max.   :8.0000   Max.   :84.00   Max.   :892.3                     
##  earliest_cr_line  
##  Length:380994     
##  Class :character  
##  Mode  :character  
##                    
##                    
## 

Feature Engineering

Several features of the data set were inherently related, lending themselves naturally to feature engineering. For example, I extract the loan issue year and the year of the borrower’s first credit line to calculate the length in years of each borrower’s credit history.

loan1$issue_d <- substr(loan1$issue_d, 5,8)



loan1$earliest_cr_line <- substr(loan1$earliest_cr_line, 5, 8)


loan1$credit_lth <- as.numeric(loan1$issue_d) - as.numeric(loan1$earliest_cr_line)

loan1 <- loan1[,c("loan_status","grade", "term","issue_d","home_ownership","verification_status","purpose","addr_state", "loan_amnt","int_rate",'installment',"emp_length",  "annual_inc", "dti",  'tot_cur_bal',"inq_last_6mths","open_acc", "revol_util",   "credit_lth")]

Data Visualization

annual_income

ggplot(data = loan1, aes(x=loan_status,y=annual_inc,fill=loan_status))+ geom_boxplot(outlier.size = 1)

ggplot(data=loan1,aes(x=annual_inc,color=loan_status))+ geom_histogram()+facet_wrap(~loan_status)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

outliers=boxplot(loan1$annual_inc,plot=FALSE)$out
loan1=loan1[-which(loan1$annual_inc %in% outliers),]
ggplot(data=loan1,aes(x=annual_inc,color=loan_status))+ geom_histogram()+facet_wrap(~loan_status)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data = loan1, aes(x=loan_status,y=annual_inc,fill=loan_status))+ geom_boxplot(outlier.size = 1)

###int_rate To start examining the data, I began by investigating the distributions of each numeric feature via histograms, segmented out by loan outcome. The interest rate analysis was of particular interest. Examining the histogram, charged off loans have a more even distribution, tending towards higher interest rates more frequently than the fully paid ones do. This result makes intuitive sense, as higher interest rates are assigned to riskier investments.

ggplot(data = loan1, aes(x=loan_status,y=int_rate,fill=loan_status))+ geom_boxplot(outlier.size = 1)

ggplot(data=loan1,aes(x=int_rate,color=loan_status))+ geom_histogram()+facet_wrap(~loan_status)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

outliers=boxplot(loan1$int_rate,plot=FALSE)$out
loan1=loan1[-which(loan1$int_rate %in% outliers),]
summary(loan1$int_rate)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   10.99   13.68   13.93   16.59   25.28

credit history

there’s no obvious difference in credit history between charged off group and Full paid group .

ggplot(data = loan1, aes(x=loan_status,y=credit_lth,fill=loan_status))+ geom_boxplot(outlier.size = 1)

ggplot(data=loan1,aes(x=credit_lth,color=loan_status))+ geom_histogram()+facet_wrap(~loan_status)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Income Verifiable

Next I check how many of the incomes are verifiable. From the barplot, There’s no obviously relationship between verification_status and loan_staus

ggplot(data=loan1,aes(x=verification_status,color=loan_status))+ geom_bar()+facet_wrap(~loan_status)+ggtitle('Is income verified?')

Data Structure

Now the data has 362730 records and 19 variables.

After data cleaning, feature selection and feature engineering, I omit the missing values and remove outliers of annual_inc and int_rate. Now the data has 362730 records and 19 variables.

There are 11 Numeric variables: ‘loan_amnt’, ‘int_rate’, ‘installment’, ‘emp_length’,‘annual_inc’,‘dti’,‘tot_cur_bal’,‘inq_last_6mths’,‘open_acc’,‘revol_util’,‘credit_lth’
There are 8 factor variables: ‘grade’,‘term’,‘home_ownership’,‘verification_status’,‘purpose’,‘addr_state’,‘loan_status’,‘issue_d’

After the first step of data cleaning, there’s no missing value. Due to the great sample size ,I’ll random choose 15000 experimental units for training model. In the original data, 17.32% of loan_status are ‘Charged Off’ and 82.67% of loan_status are ‘Fully Paid’. There are 17.03% of loan_status ‘Charged Off’ and 82.97% of loan_status ‘Fully Paid’. The percentage are approximately the same, so the sampling is reasonable.

loan1[1:8] <- lapply(loan1[1:8] , as.factor)
sort(sapply(loan1,function(k) sum(is.na(k))),decreasing = T)
##         loan_status               grade                term 
##                   0                   0                   0 
##             issue_d      home_ownership verification_status 
##                   0                   0                   0 
##             purpose          addr_state           loan_amnt 
##                   0                   0                   0 
##            int_rate         installment          emp_length 
##                   0                   0                   0 
##          annual_inc                 dti         tot_cur_bal 
##                   0                   0                   0 
##      inq_last_6mths            open_acc          revol_util 
##                   0                   0                   0 
##          credit_lth 
##                   0
table(loan1$loan_status)
## 
## Charged Off  Fully Paid 
##       62830      299900
table(loan1$loan_status)/length(loan1$loan_status)
## 
## Charged Off  Fully Paid 
##   0.1732142   0.8267858
dim(loan1)
## [1] 362730     19
L=nrow(loan1);L
## [1] 362730
summary(loan1)
##       loan_status     grade              term        issue_d      
##  Charged Off: 62830   A: 53952    36 months:272016   2012: 24751  
##  Fully Paid :299900   B:108582    60 months: 90714   2013:128981  
##                       C:102153                       2014:208998  
##                       D: 60764                                    
##                       E: 27331                                    
##                       F:  9574                                    
##                       G:   374                                    
##   home_ownership     verification_status               purpose      
##  ANY     :     1   Not Verified:113442   debt_consolidation:219821  
##  MORTGAGE:183525   Verified    :249288   credit_card       : 86238  
##  NONE    :    42                         home_improvement  : 18886  
##  OTHER   :    44                         other             : 15996  
##  OWN     : 33361                         major_purchase    :  6083  
##  RENT    :145757                         small_business    :  3377  
##                                          (Other)           : 12329  
##    addr_state       loan_amnt        int_rate      installment     
##  CA     : 54358   Min.   : 1000   Min.   : 6.00   Min.   :   4.93  
##  NY     : 30165   1st Qu.: 8000   1st Qu.:10.99   1st Qu.: 262.28  
##  TX     : 28565   Median :12000   Median :13.68   Median : 381.04  
##  FL     : 24366   Mean   :14125   Mean   :13.93   Mean   : 430.88  
##  IL     : 14358   3rd Qu.:19600   3rd Qu.:16.59   3rd Qu.: 561.50  
##  NJ     : 13184   Max.   :35000   Max.   :25.28   Max.   :1396.79  
##  (Other):197734                                                    
##    emp_length       annual_inc          dti         tot_cur_bal     
##  Min.   : 0.000   Min.   :  3.00   Min.   : 0.00   Min.   :      0  
##  1st Qu.: 2.000   1st Qu.: 45.00   1st Qu.:12.02   1st Qu.:  27300  
##  Median : 6.000   Median : 61.20   Median :17.53   Median :  74135  
##  Mean   : 5.846   Mean   : 67.14   Mean   :17.88   Mean   : 126971  
##  3rd Qu.:10.000   3rd Qu.: 85.00   3rd Qu.:23.53   3rd Qu.: 197278  
##  Max.   :10.000   Max.   :157.50   Max.   :39.99   Max.   :1654949  
##                                                                     
##  inq_last_6mths      open_acc       revol_util       credit_lth   
##  Min.   :0.0000   Min.   : 1.00   Min.   :  0.00   Min.   : 3.00  
##  1st Qu.:0.0000   1st Qu.: 8.00   1st Qu.: 40.10   1st Qu.:11.00  
##  Median :0.0000   Median :11.00   Median : 57.60   Median :15.00  
##  Mean   :0.7675   Mean   :11.33   Mean   : 56.55   Mean   :16.05  
##  3rd Qu.:1.0000   3rd Qu.:14.00   3rd Qu.: 74.20   3rd Qu.:20.00  
##  Max.   :8.0000   Max.   :84.00   Max.   :892.30   Max.   :70.00  
## 
set.seed(99)
sample_num=sample(1:L,15000, replace = FALSE)
loan=loan1[sample_num,]
table(loan$loan_status)
## 
## Charged Off  Fully Paid 
##        2555       12445
table(loan$loan_status)/length(loan$loan_status)
## 
## Charged Off  Fully Paid 
##   0.1703333   0.8296667
loan_bup=loan #backup




sort(sapply(loan1,function(k) sum(is.na(k))),decreasing = T)
##         loan_status               grade                term 
##                   0                   0                   0 
##             issue_d      home_ownership verification_status 
##                   0                   0                   0 
##             purpose          addr_state           loan_amnt 
##                   0                   0                   0 
##            int_rate         installment          emp_length 
##                   0                   0                   0 
##          annual_inc                 dti         tot_cur_bal 
##                   0                   0                   0 
##      inq_last_6mths            open_acc          revol_util 
##                   0                   0                   0 
##          credit_lth 
##                   0

Split data, 75%(37500 units) for training and 25%(left) for test. The The percentage are approximately the same as the original data, so the sampling is reasonable.

15000*0.75
## [1] 11250
set.seed(5)

index_num=sample(1:15000,15000*0.75,replace = FALSE)
train=loan[index_num,]
test=loan[-index_num,]

table(train$loan_status)/length(train$loan_status)
## 
## Charged Off  Fully Paid 
##   0.1715556   0.8284444
table(test$loan_status)/length(test$loan_status)
## 
## Charged Off  Fully Paid 
##   0.1666667   0.8333333
missmap(loan)

#logistic regression The accuracy of logistic regression model for the train set is 0.8325333 and it for the test set is 0.8338667.

#Step 3 – training a model on the data
loan$loan_status=as.factor(loan$loan_status)
levels(loan$loan_status)
## [1] "Charged Off" "Fully Paid"
train=loan[index_num,]
test=loan[-index_num,]
lr_full <- glm(loan_status~., family=binomial(link='logit'), data=train)
summary(lr_full)
## 
## Call:
## glm(formula = loan_status ~ ., family = binomial(link = "logit"), 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9143   0.3104   0.4945   0.6503   1.4738  
## 
## Coefficients:
##                               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                  4.552e+00  8.105e-01   5.617 1.94e-08 ***
## gradeB                      -5.915e-01  1.586e-01  -3.729 0.000192 ***
## gradeC                      -7.897e-01  2.134e-01  -3.701 0.000214 ***
## gradeD                      -8.702e-01  2.763e-01  -3.149 0.001638 ** 
## gradeE                      -1.149e+00  3.517e-01  -3.267 0.001086 ** 
## gradeF                      -8.567e-01  4.307e-01  -1.989 0.046703 *  
## gradeG                      -1.232e+00  6.916e-01  -1.781 0.074842 .  
## term 60 months              -7.789e-01  1.510e-01  -5.159 2.49e-07 ***
## issue_d2013                  2.029e-01  1.068e-01   1.900 0.057452 .  
## issue_d2014                 -1.295e-02  1.067e-01  -0.121 0.903467    
## home_ownershipNONE           1.185e+01  3.083e+02   0.038 0.969348    
## home_ownershipOTHER         -2.096e+00  1.489e+00  -1.407 0.159321    
## home_ownershipOWN           -5.687e-02  1.005e-01  -0.566 0.571543    
## home_ownershipRENT          -2.339e-01  7.258e-02  -3.223 0.001269 ** 
## verification_statusVerified -1.041e-01  6.756e-02  -1.540 0.123472    
## purposecredit_card          -7.426e-01  4.386e-01  -1.693 0.090445 .  
## purposedebt_consolidation   -7.642e-01  4.361e-01  -1.752 0.079721 .  
## purposehome_improvement     -8.128e-01  4.507e-01  -1.804 0.071298 .  
## purposehouse                -2.170e-01  6.612e-01  -0.328 0.742714    
## purposemajor_purchase       -4.198e-01  4.981e-01  -0.843 0.399333    
## purposemedical              -1.075e+00  4.987e-01  -2.155 0.031151 *  
## purposemoving               -9.867e-01  5.387e-01  -1.832 0.067001 .  
## purposeother                -6.629e-01  4.508e-01  -1.470 0.141452    
## purposerenewable_energy      1.126e+01  2.068e+02   0.054 0.956555    
## purposesmall_business       -9.267e-01  4.986e-01  -1.859 0.063064 .  
## purposevacation             -1.186e+00  5.270e-01  -2.251 0.024404 *  
## purposewedding              -2.563e-01  7.737e-01  -0.331 0.740480    
## addr_stateAL                -7.547e-01  6.620e-01  -1.140 0.254259    
## addr_stateAR                -4.155e-01  7.073e-01  -0.587 0.556891    
## addr_stateAZ                -8.570e-01  6.450e-01  -1.329 0.183945    
## addr_stateCA                -6.448e-01  6.276e-01  -1.028 0.304170    
## addr_stateCO                -1.833e-01  6.565e-01  -0.279 0.780036    
## addr_stateCT                -7.174e-01  6.610e-01  -1.085 0.277736    
## addr_stateDC                 7.310e-01  1.204e+00   0.607 0.543896    
## addr_stateDE                -1.505e+00  7.443e-01  -2.022 0.043177 *  
## addr_stateFL                -7.861e-01  6.310e-01  -1.246 0.212841    
## addr_stateGA                -6.295e-01  6.409e-01  -0.982 0.326025    
## addr_stateHI                -5.553e-01  7.205e-01  -0.771 0.440935    
## addr_stateIL                -6.977e-01  6.392e-01  -1.091 0.275070    
## addr_stateIN                -7.467e-01  6.560e-01  -1.138 0.255053    
## addr_stateKS                -1.476e-01  6.995e-01  -0.211 0.832885    
## addr_stateKY                -3.175e-01  6.867e-01  -0.462 0.643893    
## addr_stateLA                -4.634e-01  6.731e-01  -0.688 0.491176    
## addr_stateMA                -8.786e-01  6.443e-01  -1.364 0.172704    
## addr_stateMD                -9.156e-01  6.445e-01  -1.421 0.155439    
## addr_stateMI                -4.584e-01  6.485e-01  -0.707 0.479659    
## addr_stateMN                -8.671e-01  6.504e-01  -1.333 0.182511    
## addr_stateMO                -8.473e-01  6.513e-01  -1.301 0.193256    
## addr_stateMS                -6.638e-01  7.362e-01  -0.902 0.367230    
## addr_stateMT                -1.376e-01  8.316e-01  -0.166 0.868539    
## addr_stateNC                -7.637e-01  6.431e-01  -1.187 0.235049    
## addr_stateNH                -3.998e-02  7.469e-01  -0.054 0.957313    
## addr_stateNJ                -6.933e-01  6.380e-01  -1.087 0.277177    
## addr_stateNM                -1.161e+00  6.843e-01  -1.697 0.089678 .  
## addr_stateNV                -5.616e-01  6.584e-01  -0.853 0.393658    
## addr_stateNY                -7.336e-01  6.303e-01  -1.164 0.244468    
## addr_stateOH                -9.415e-01  6.385e-01  -1.474 0.140359    
## addr_stateOK                -1.688e+00  6.655e-01  -2.536 0.011210 *  
## addr_stateOR                -3.681e-03  6.844e-01  -0.005 0.995709    
## addr_statePA                -7.684e-01  6.388e-01  -1.203 0.229025    
## addr_stateRI                -1.014e+00  7.156e-01  -1.417 0.156472    
## addr_stateSC                -1.044e-01  6.815e-01  -0.153 0.878244    
## addr_stateSD                -8.294e-01  8.075e-01  -1.027 0.304322    
## addr_stateTN                -7.652e-01  6.580e-01  -1.163 0.244833    
## addr_stateTX                -6.774e-01  6.312e-01  -1.073 0.283211    
## addr_stateUT                 1.834e-01  7.454e-01   0.246 0.805699    
## addr_stateVA                -6.580e-01  6.416e-01  -1.025 0.305148    
## addr_stateVT                -1.392e+00  8.150e-01  -1.707 0.087752 .  
## addr_stateWA                -2.977e-01  6.539e-01  -0.455 0.648904    
## addr_stateWI                -4.881e-01  6.640e-01  -0.735 0.462303    
## addr_stateWV                -1.725e-01  7.491e-01  -0.230 0.817859    
## addr_stateWY                -7.504e-01  7.610e-01  -0.986 0.324077    
## loan_amnt                    2.396e-05  2.409e-05   0.995 0.319967    
## int_rate                    -4.103e-02  2.523e-02  -1.626 0.103895    
## installment                 -1.089e-03  7.600e-04  -1.433 0.151811    
## emp_length                   9.770e-03  7.318e-03   1.335 0.181834    
## annual_inc                   6.996e-03  1.358e-03   5.153 2.56e-07 ***
## dti                         -1.034e-02  3.843e-03  -2.691 0.007124 ** 
## tot_cur_bal                  3.156e-07  3.216e-07   0.981 0.326481    
## inq_last_6mths              -6.485e-02  2.606e-02  -2.489 0.012817 *  
## open_acc                    -6.139e-03  6.001e-03  -1.023 0.306362    
## revol_util                   1.952e-04  1.350e-03   0.145 0.885084    
## credit_lth                   2.446e-03  3.880e-03   0.630 0.528447    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10312.7  on 11249  degrees of freedom
## Residual deviance:  9411.9  on 11167  degrees of freedom
## AIC: 9577.9
## 
## Number of Fisher Scoring iterations: 12
#Step4 eveluate the model performance

levels(test$loan_status)
## [1] "Charged Off" "Fully Paid"
pred=predict(lr_full, newdata=test,type = 'response')
pred=ifelse(pred>=0.5,1,0)
CrossTable(pred, test$loan_status)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  3750 
## 
##  
##              | test$loan_status 
##         pred | Charged Off |  Fully Paid |   Row Total | 
## -------------|-------------|-------------|-------------|
##            0 |          14 |          21 |          35 | 
##              |      11.433 |       2.287 |             | 
##              |       0.400 |       0.600 |       0.009 | 
##              |       0.022 |       0.007 |             | 
##              |       0.004 |       0.006 |             | 
## -------------|-------------|-------------|-------------|
##            1 |         611 |        3104 |        3715 | 
##              |       0.108 |       0.022 |             | 
##              |       0.164 |       0.836 |       0.991 | 
##              |       0.978 |       0.993 |             | 
##              |       0.163 |       0.828 |             | 
## -------------|-------------|-------------|-------------|
## Column Total |         625 |        3125 |        3750 | 
##              |       0.167 |       0.833 |             | 
## -------------|-------------|-------------|-------------|
## 
## 
accuracy_Lr <- table(pred, test[,1])
sum(diag(accuracy_Lr))/sum(accuracy_Lr)
## [1] 0.8314667
#Step 5 improve model performance
##based on the P value in the summary of model, choose these significant variables to build a new model
lr_redu=glm(loan_status~grade+term+issue_d+home_ownership+purpose+int_rate+ annual_inc+ dti+inq_last_6mths, family=binomial(link='logit'), data=train)
summary(lr_redu)
## 
## Call:
## glm(formula = loan_status ~ grade + term + issue_d + home_ownership + 
##     purpose + int_rate + annual_inc + dti + inq_last_6mths, family = binomial(link = "logit"), 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7204   0.3190   0.5036   0.6538   1.3334  
## 
## Coefficients:
##                             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                 3.982168   0.505866   7.872 3.49e-15 ***
## gradeB                     -0.577425   0.157709  -3.661 0.000251 ***
## gradeC                     -0.771565   0.211661  -3.645 0.000267 ***
## gradeD                     -0.851452   0.274001  -3.107 0.001887 ** 
## gradeE                     -1.148459   0.348601  -3.294 0.000986 ***
## gradeF                     -0.857709   0.425967  -2.014 0.044056 *  
## gradeG                     -1.270392   0.684177  -1.857 0.063337 .  
## term 60 months             -0.658922   0.063564 -10.366  < 2e-16 ***
## issue_d2013                 0.194893   0.105559   1.846 0.064850 .  
## issue_d2014                -0.014671   0.105070  -0.140 0.888950    
## home_ownershipNONE         11.859511 308.609115   0.038 0.969346    
## home_ownershipOTHER        -1.676228   1.421110  -1.180 0.238191    
## home_ownershipOWN          -0.111479   0.095536  -1.167 0.243259    
## home_ownershipRENT         -0.275465   0.057632  -4.780 1.76e-06 ***
## purposecredit_card         -0.790754   0.435340  -1.816 0.069308 .  
## purposedebt_consolidation  -0.811096   0.433230  -1.872 0.061178 .  
## purposehome_improvement    -0.837937   0.448179  -1.870 0.061533 .  
## purposehouse               -0.360780   0.656320  -0.550 0.582524    
## purposemajor_purchase      -0.436921   0.495028  -0.883 0.377442    
## purposemedical             -1.061080   0.495594  -2.141 0.032272 *  
## purposemoving              -0.892419   0.533951  -1.671 0.094652 .  
## purposeother               -0.654310   0.448337  -1.459 0.144451    
## purposerenewable_energy    11.192705 206.543908   0.054 0.956783    
## purposesmall_business      -0.949348   0.495590  -1.916 0.055417 .  
## purposevacation            -1.115311   0.523259  -2.131 0.033051 *  
## purposewedding             -0.318735   0.772775  -0.412 0.680006    
## int_rate                   -0.049442   0.024380  -2.028 0.042564 *  
## annual_inc                  0.006021   0.001031   5.838 5.29e-09 ***
## dti                        -0.012091   0.003399  -3.557 0.000374 ***
## inq_last_6mths             -0.070702   0.024978  -2.831 0.004647 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10313  on 11249  degrees of freedom
## Residual deviance:  9514  on 11220  degrees of freedom
## AIC: 9574
## 
## Number of Fisher Scoring iterations: 12
##check accuracy in train set
pred_train=predict(lr_redu, newdata=train,type = 'response')
pred_train=ifelse(pred_train>=0.5,1,0)
accuracy_Lrtrain <- table(pred_train, train[,1])
sum(diag(accuracy_Lrtrain))/sum(accuracy_Lrtrain)
## [1] 0.8283556
##check accuracy in test set
pred_new=predict(lr_redu, newdata=test,type = 'response')
pred_new=ifelse(pred_new>=0.5,1,0)
CrossTable(pred_new, test$loan_status)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  3750 
## 
##  
##              | test$loan_status 
##     pred_new | Charged Off |  Fully Paid |   Row Total | 
## -------------|-------------|-------------|-------------|
##            0 |           6 |          10 |          16 | 
##              |       4.167 |       0.833 |             | 
##              |       0.375 |       0.625 |       0.004 | 
##              |       0.010 |       0.003 |             | 
##              |       0.002 |       0.003 |             | 
## -------------|-------------|-------------|-------------|
##            1 |         619 |        3115 |        3734 | 
##              |       0.018 |       0.004 |             | 
##              |       0.166 |       0.834 |       0.996 | 
##              |       0.990 |       0.997 |             | 
##              |       0.165 |       0.831 |             | 
## -------------|-------------|-------------|-------------|
## Column Total |         625 |        3125 |        3750 | 
##              |       0.167 |       0.833 |             | 
## -------------|-------------|-------------|-------------|
## 
## 
accuracy_Lrnew <- table(pred_new, test[,1])
sum(diag(accuracy_Lrnew))/sum(accuracy_Lrnew)
## [1] 0.8322667

Random Forest

The accuracy of model random forest is 0.8344

##step3 train the model
library(randomForest)
## Warning: package 'randomForest' was built under R version 3.5.2
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
lapply(train, class)
## $loan_status
## [1] "factor"
## 
## $grade
## [1] "factor"
## 
## $term
## [1] "factor"
## 
## $issue_d
## [1] "factor"
## 
## $home_ownership
## [1] "factor"
## 
## $verification_status
## [1] "factor"
## 
## $purpose
## [1] "factor"
## 
## $addr_state
## [1] "factor"
## 
## $loan_amnt
## [1] "integer"
## 
## $int_rate
## [1] "numeric"
## 
## $installment
## [1] "numeric"
## 
## $emp_length
## [1] "numeric"
## 
## $annual_inc
## [1] "numeric"
## 
## $dti
## [1] "numeric"
## 
## $tot_cur_bal
## [1] "integer"
## 
## $inq_last_6mths
## [1] "integer"
## 
## $open_acc
## [1] "integer"
## 
## $revol_util
## [1] "numeric"
## 
## $credit_lth
## [1] "numeric"
rf=randomForest(loan_status~., data=train, importance=TRUE)
plot(rf)

rf
## 
## Call:
##  randomForest(formula = loan_status ~ ., data = train, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 17.14%
## Confusion matrix:
##             Charged Off Fully Paid class.error
## Charged Off          38       1892 0.980310881
## Fully Paid           36       9284 0.003862661
##step4 Evaluate the model performance
rf_pred=predict(rf, newdata=test)
CrossTable(test$loan_status,rf_pred)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  3750 
## 
##  
##                  | rf_pred 
## test$loan_status | Charged Off |  Fully Paid |   Row Total | 
## -----------------|-------------|-------------|-------------|
##      Charged Off |           9 |         616 |         625 | 
##                  |      19.048 |       0.071 |             | 
##                  |       0.014 |       0.986 |       0.167 | 
##                  |       0.643 |       0.165 |             | 
##                  |       0.002 |       0.164 |             | 
## -----------------|-------------|-------------|-------------|
##       Fully Paid |           5 |        3120 |        3125 | 
##                  |       3.810 |       0.014 |             | 
##                  |       0.002 |       0.998 |       0.833 | 
##                  |       0.357 |       0.835 |             | 
##                  |       0.001 |       0.832 |             | 
## -----------------|-------------|-------------|-------------|
##     Column Total |          14 |        3736 |        3750 | 
##                  |       0.004 |       0.996 |             | 
## -----------------|-------------|-------------|-------------|
## 
## 
accuracy=mean(rf_pred==test$loan_status);accuracy
## [1] 0.8344

using ranger

library(caret)
library(ranger)
## Warning: package 'ranger' was built under R version 3.5.3
## 
## Attaching package: 'ranger'
## The following object is masked from 'package:randomForest':
## 
##     importance
##step3 train the model
rf <- ranger(loan_status~ ., data = train, num.threads = 4 )
summary(rf)
##                           Length Class         Mode     
## predictions               11250  factor        numeric  
## num.trees                     1  -none-        numeric  
## num.independent.variables     1  -none-        numeric  
## mtry                          1  -none-        numeric  
## min.node.size                 1  -none-        numeric  
## prediction.error              1  -none-        numeric  
## forest                       10  ranger.forest list     
## confusion.matrix              4  table         numeric  
## splitrule                     1  -none-        character
## treetype                      1  -none-        character
## call                          4  -none-        call     
## importance.mode               1  -none-        character
## num.samples                   1  -none-        numeric  
## replace                       1  -none-        logical
rf$confusion.matrix
##              predicted
## true          Charged Off Fully Paid
##   Charged Off          57       1873
##   Fully Paid           62       9258
## step4  evaluate the model performance
#training
rf_pred <- predict(rf, train)
confusionMatrix(data=rf_pred$predictions, train$loan_status)
## Confusion Matrix and Statistics
## 
##              Reference
## Prediction    Charged Off Fully Paid
##   Charged Off        1930          0
##   Fully Paid            0       9320
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9997, 1)
##     No Information Rate : 0.8284     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0000     
##             Specificity : 1.0000     
##          Pos Pred Value : 1.0000     
##          Neg Pred Value : 1.0000     
##              Prevalence : 0.1716     
##          Detection Rate : 0.1716     
##    Detection Prevalence : 0.1716     
##       Balanced Accuracy : 1.0000     
##                                      
##        'Positive' Class : Charged Off
## 
rf_pred <- predict(rf, test)
confusionMatrix(data=rf_pred$predictions, test$loan_status)
## Confusion Matrix and Statistics
## 
##              Reference
## Prediction    Charged Off Fully Paid
##   Charged Off          15         19
##   Fully Paid          610       3106
##                                           
##                Accuracy : 0.8323          
##                  95% CI : (0.8199, 0.8441)
##     No Information Rate : 0.8333          
##     P-Value [Acc > NIR] : 0.58            
##                                           
##                   Kappa : 0.0288          
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.024000        
##             Specificity : 0.993920        
##          Pos Pred Value : 0.441176        
##          Neg Pred Value : 0.835845        
##              Prevalence : 0.166667        
##          Detection Rate : 0.004000        
##    Detection Prevalence : 0.009067        
##       Balanced Accuracy : 0.508960        
##                                           
##        'Positive' Class : Charged Off     
## 

Decision Tree c5.0

In this case, We use decision tree c5.0 model. From the plot, it is easily to see when we increase the trials, the misclassification Error is decreasing, until the trials equals 10, the misclassification Error reaches its least value. The accuracy of the decision tree classifier is 0.8213.

## Step 3: Training a model on the data ----
# regression tree using c50

library(C50)
## Warning: package 'C50' was built under R version 3.5.2
model <- C5.0(loan_status~., data = train )

# get basic information about the tree
model
## 
## Call:
## C5.0.formula(formula = loan_status ~ ., data = train)
## 
## Classification Tree
## Number of samples: 11250 
## Number of predictors: 18 
## 
## Tree size: 54 
## 
## Non-standard options: attempt to group attributes
# get more detailed information about the tree
summary(model)
## 
## Call:
## C5.0.formula(formula = loan_status ~ ., data = train)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Sun Apr 14 21:59:41 2019
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 11250 cases (19 attributes) from undefined.data
## 
## Decision tree:
## 
## term = 36 months: Fully Paid (8398/1110)
## term = 60 months:
## :...grade in {A,B,C,D}: Fully Paid (2080/510)
##     grade in {E,F,G}:
##     :...addr_state in {AK,AL,AR,DC,FL,GA,IA,ID,IN,KS,MA,ME,MO,NC,NE,NH,NJ,NV,
##         :              NY,SD,UT,VA,WY}: Fully Paid (309/102)
##         addr_state in {DE,MT,PA,RI,VT,WV}: Charged Off (41/17)
##         addr_state = CA:
##         :...credit_lth <= 11: Charged Off (31/10)
##         :   credit_lth > 11: Fully Paid (69/25)
##         addr_state = CO:
##         :...home_ownership in {ANY,MORTGAGE,NONE,OTHER,
##         :   :                  OWN}: Fully Paid (17/3)
##         :   home_ownership = RENT: Charged Off (5)
##         addr_state = CT:
##         :...annual_inc <= 91: Charged Off (2)
##         :   annual_inc > 91: Fully Paid (4)
##         addr_state = HI:
##         :...loan_amnt <= 14925: Charged Off (2)
##         :   loan_amnt > 14925: Fully Paid (2)
##         addr_state = IL:
##         :...verification_status = Not Verified: Charged Off (3)
##         :   verification_status = Verified: Fully Paid (16/7)
##         addr_state = KY:
##         :...inq_last_6mths <= 1: Charged Off (4/1)
##         :   inq_last_6mths > 1: Fully Paid (5)
##         addr_state = MN:
##         :...revol_util <= 74.2: Charged Off (7/1)
##         :   revol_util > 74.2: Fully Paid (4)
##         addr_state = MS:
##         :...emp_length <= 6: Charged Off (3)
##         :   emp_length > 6: Fully Paid (2)
##         addr_state = NM:
##         :...annual_inc <= 77.393: Charged Off (4)
##         :   annual_inc > 77.393: Fully Paid (2)
##         addr_state = OH:
##         :...emp_length <= 1: Charged Off (4)
##         :   emp_length > 1: Fully Paid (23/8)
##         addr_state = OK:
##         :...annual_inc <= 64.5: Charged Off (6)
##         :   annual_inc > 64.5: Fully Paid (6/2)
##         addr_state = OR:
##         :...annual_inc <= 37.5: Charged Off (2)
##         :   annual_inc > 37.5: Fully Paid (8)
##         addr_state = SC:
##         :...purpose in {car,credit_card,debt_consolidation,house,
##         :   :           major_purchase,medical,moving,other,renewable_energy,
##         :   :           small_business,vacation,wedding}: Fully Paid (13/4)
##         :   purpose = home_improvement: Charged Off (2)
##         addr_state = TN:
##         :...issue_d in {2012,2013}: Fully Paid (9/1)
##         :   issue_d = 2014: Charged Off (10/3)
##         addr_state = WI:
##         :...inq_last_6mths <= 1: Fully Paid (7)
##         :   inq_last_6mths > 1: Charged Off (5/1)
##         addr_state = AZ:
##         :...home_ownership in {ANY,NONE,OTHER,OWN,RENT}: Charged Off (6)
##         :   home_ownership = MORTGAGE:
##         :   :...credit_lth <= 14: Charged Off (4/1)
##         :       credit_lth > 14: Fully Paid (5)
##         addr_state = LA:
##         :...credit_lth <= 12: Charged Off (3)
##         :   credit_lth > 12:
##         :   :...dti <= 8.46: Charged Off (2)
##         :       dti > 8.46: Fully Paid (8)
##         addr_state = MI:
##         :...revol_util <= 66.6: Charged Off (4)
##         :   revol_util > 66.6:
##         :   :...inq_last_6mths <= 3: Fully Paid (10)
##         :       inq_last_6mths > 3: Charged Off (2)
##         addr_state = WA:
##         :...annual_inc <= 49.9: Charged Off (5)
##         :   annual_inc > 49.9:
##         :   :...inq_last_6mths <= 2: Fully Paid (19/3)
##         :       inq_last_6mths > 2: Charged Off (2)
##         addr_state = MD:
##         :...home_ownership in {ANY,NONE,OTHER,RENT}: Fully Paid (9)
##         :   home_ownership in {MORTGAGE,OWN}:
##         :   :...inq_last_6mths > 1: Fully Paid (3)
##         :       inq_last_6mths <= 1:
##         :       :...annual_inc <= 55: Fully Paid (4/1)
##         :           annual_inc > 55: Charged Off (7)
##         addr_state = TX:
##         :...verification_status = Not Verified: Fully Paid (3)
##             verification_status = Verified:
##             :...grade in {F,G}: Fully Paid (11/2)
##                 grade = E:
##                 :...annual_inc <= 75.79: Charged Off (17/3)
##                     annual_inc > 75.79: Fully Paid (21/6)
## 
## 
## Evaluation on training data (11250 cases):
## 
##      Decision Tree   
##    ----------------  
##    Size      Errors  
## 
##      54 1821(16.2%)   <<
## 
## 
##     (a)   (b)    <-classified as
##    ----  ----
##     146  1784    (a): class Charged Off
##      37  9283    (b): class Fully Paid
## 
## 
##  Attribute usage:
## 
##  100.00% term
##   25.35% grade
##    6.86% addr_state
##    1.08% credit_lth
##    0.97% annual_inc
##    0.63% verification_status
##    0.60% inq_last_6mths
##    0.53% home_ownership
##    0.28% emp_length
##    0.24% revol_util
##    0.17% issue_d
##    0.13% purpose
##    0.09% dti
##    0.04% loan_amnt
## 
## 
## Time: 0.1 secs
## Step 4: Evaluate model performance ----

#check accuracy in train set
fitted.results <- predict(model, newdata = train)
head(fitted.results)
## [1] Fully Paid Fully Paid Fully Paid Fully Paid Fully Paid Fully Paid
## Levels: Charged Off Fully Paid
misClasificError <- mean(fitted.results != train$loan_status, na.rm=TRUE)
print(paste('Accuracy',1-misClasificError))
## [1] "Accuracy 0.838133333333333"
# test
fitted.results <- predict(model, newdata = test)
head(fitted.results)
## [1] Fully Paid Fully Paid Fully Paid Fully Paid Fully Paid Fully Paid
## Levels: Charged Off Fully Paid
misClasificError <- mean(fitted.results != test$loan_status, na.rm=TRUE)
print(paste('Accuracy',1-misClasificError))
## [1] "Accuracy 0.828533333333333"
CrossTable(test$loan_status, fitted.results)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  3750 
## 
##  
##                  | fitted.results 
## test$loan_status | Charged Off |  Fully Paid |   Row Total | 
## -----------------|-------------|-------------|-------------|
##      Charged Off |          18 |         607 |         625 | 
##                  |       9.000 |       0.131 |             | 
##                  |       0.029 |       0.971 |       0.167 | 
##                  |       0.333 |       0.164 |             | 
##                  |       0.005 |       0.162 |             | 
## -----------------|-------------|-------------|-------------|
##       Fully Paid |          36 |        3089 |        3125 | 
##                  |       1.800 |       0.026 |             | 
##                  |       0.012 |       0.988 |       0.833 | 
##                  |       0.667 |       0.836 |             | 
##                  |       0.010 |       0.824 |             | 
## -----------------|-------------|-------------|-------------|
##     Column Total |          54 |        3696 |        3750 | 
##                  |       0.014 |       0.986 |             | 
## -----------------|-------------|-------------|-------------|
## 
## 
##step 5 improve the model performance
# Boosting the accuracy of decision trees
# boosted decision tree with n trials

misClasificError= function(n){
  model_trial <- C5.0(loan_status~., data = train, trials =n )
  fitted.results <- predict(model_trial, newdata = test)
  misClasificError <- mean(fitted.results != test$loan_status, na.rm=TRUE)
  return(misClasificError)
}
x=c(4,6,10,20,30,35,36,40,45)
testerror_rates <- sapply(x, FUN = misClasificError)
ggplot() +geom_point(aes(x=x,y=testerror_rates)) + geom_line(aes(x=x,y=testerror_rates)) + ylab("Misclassification Rate")

#From the plot, it is easily to see when we increase the trials, the misclassification Error is decreasing, until the trials equals 35, the misclassification Error reaches its least value. 
model_trial <- C5.0(loan_status~., data = train, trials =35 )

# get basic information about the tree
model_trial
## 
## Call:
## C5.0.formula(formula = loan_status ~ ., data = train, trials = 35)
## 
## Classification Tree
## Number of samples: 11250 
## Number of predictors: 18 
## 
## Number of boosting iterations: 35 requested;  11 used due to early stopping
## Average tree size: 34.4 
## 
## Non-standard options: attempt to group attributes
#check accuracy
fitted.results <- predict(model_trial, newdata = test)
misClasificErrorleast <- mean(fitted.results != test$loan_status, na.rm=TRUE)
print(paste('Maximum Accuracy',1-misClasificErrorleast))
## [1] "Maximum Accuracy 0.821333333333333"

KNN

when k=1, Accuracy 0.741066666666667
when k=13,Accuracy 0.826133333333333
when k=21, Accuracy 0.8312

When k=25, the model reaches its maximum Accuracy 0.832

# create normalization function
normalize <- function(x) {
  return ((x - min(x)) / (max(x) - min(x)))
}
loanknn=loan
loan_n <- as.data.frame(lapply(loanknn[, 9:19], normalize))

trainknn=loan_n[index_num,]
testknn=loan_n[-index_num,]
trainlabel=loanknn[index_num,1]
testlabel=loanknn[-index_num,1]

# build a KNN model
library(class)
## Warning: package 'class' was built under R version 3.5.2
loan_test_pred <- knn(train = trainknn, test = testknn,
                      cl = trainlabel, k = 21)
## Step 4: Evaluating model performance ----

# Create the cross tabulation of predicted vs. actual
CrossTable(x = testlabel, y = loan_test_pred,
           prop.chisq = FALSE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  3750 
## 
##  
##              | loan_test_pred 
##    testlabel | Charged Off |  Fully Paid |   Row Total | 
## -------------|-------------|-------------|-------------|
##  Charged Off |          11 |         614 |         625 | 
##              |       0.018 |       0.982 |       0.167 | 
##              |       0.367 |       0.165 |             | 
##              |       0.003 |       0.164 |             | 
## -------------|-------------|-------------|-------------|
##   Fully Paid |          19 |        3106 |        3125 | 
##              |       0.006 |       0.994 |       0.833 | 
##              |       0.633 |       0.835 |             | 
##              |       0.005 |       0.828 |             | 
## -------------|-------------|-------------|-------------|
## Column Total |          30 |        3720 |        3750 | 
##              |       0.008 |       0.992 |             | 
## -------------|-------------|-------------|-------------|
## 
## 
misClasificError <- mean(loan_test_pred != testlabel, na.rm=TRUE)
print(paste('Accuracy',1-misClasificError))
## [1] "Accuracy 0.8312"
## step 5 improve the model
loan_test_pred1 <- knn(train = trainknn, test = testknn,
                      cl = trainlabel, k = 1)
misClasificError <- mean(loan_test_pred1 != testlabel, na.rm=TRUE)
print(paste('Accuracy',1-misClasificError))
## [1] "Accuracy 0.741066666666667"
loan_test_pred1 <- knn(train = trainknn, test = testknn,
                      cl = trainlabel, k = 13)
misClasificError <- mean(loan_test_pred1 != testlabel, na.rm=TRUE)
print(paste('Accuracy',1-misClasificError))
## [1] "Accuracy 0.826133333333333"
# KNN error rate on test sample
knn_accuracy_rate <- function(n)  {
  y_hat <- knn(train = trainknn, test = testknn,
                      cl = trainlabel, k = n)
  misClasificError=mean(y_hat != testlabel,na.rm=TRUE)
  return(1-misClasificError)
}
a <- c(1:10,12,15,20,25,30,35,40)
Accuracy_rates <- sapply(a, FUN = knn_accuracy_rate )
Accuracy_rates
##  [1] 0.7410667 0.7341333 0.7906667 0.7973333 0.8082667 0.8125333 0.8170667
##  [8] 0.8168000 0.8202667 0.8224000 0.8256000 0.8282667 0.8309333 0.8320000
## [15] 0.8322667 0.8325333 0.8322667
ggplot() +geom_point(aes(x=a,y=Accuracy_rates)) + geom_line(aes(x=a,y=Accuracy_rates)) + ylab("Accuracy Rate")

# from the plot, we can see when k=25, the model reaches its maximum accuracy
loan_test_pred1 <- knn(train = trainknn, test = testknn,
                      cl = trainlabel, k = 25)
misClasificError <- mean(loan_test_pred1 != testlabel, na.rm=TRUE)
print(paste('Accuracy',1-misClasificError))
## [1] "Accuracy 0.832"

Naives Bayes

choose all variables as predictors to build the model, the model accuracy is 78.32%. When trying to use laplace to improve the model, the performance of the model improve slightly. The accuracy of the model with laplace =3 is 78.35%. If only choose the factor predictors, the accuracy is 81.54%. Adding laplace =3 to the model, the accuracy is 81.6%.

##build model
library(e1071)
## Warning: package 'e1071' was built under R version 3.5.2
nb <- naiveBayes(train[,-1], train[,1])
pred=predict(nb, test[,-1])
CrossTable(pred, test$loan_status)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  3750 
## 
##  
##              | test$loan_status 
##         pred | Charged Off |  Fully Paid |   Row Total | 
## -------------|-------------|-------------|-------------|
##  Charged Off |         188 |         376 |         564 | 
##              |      94.000 |      18.800 |             | 
##              |       0.333 |       0.667 |       0.150 | 
##              |       0.301 |       0.120 |             | 
##              |       0.050 |       0.100 |             | 
## -------------|-------------|-------------|-------------|
##   Fully Paid |         437 |        2749 |        3186 | 
##              |      16.640 |       3.328 |             | 
##              |       0.137 |       0.863 |       0.850 | 
##              |       0.699 |       0.880 |             | 
##              |       0.117 |       0.733 |             | 
## -------------|-------------|-------------|-------------|
## Column Total |         625 |        3125 |        3750 | 
##              |       0.167 |       0.833 |             | 
## -------------|-------------|-------------|-------------|
## 
## 
##evaluate the model performance
accuracy=1-mean(pred!=test$loan_status,na.rm=TRUE);accuracy
## [1] 0.7832
##improve the model performance
#try to use laplace to improve the model
nb <- naiveBayes(train[,-1], train[,1],laplace = 3)
pred=predict(nb, test[,-1])
CrossTable(pred, test$loan_status)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  3750 
## 
##  
##              | test$loan_status 
##         pred | Charged Off |  Fully Paid |   Row Total | 
## -------------|-------------|-------------|-------------|
##  Charged Off |         184 |         371 |         555 | 
##              |      90.511 |      18.102 |             | 
##              |       0.332 |       0.668 |       0.148 | 
##              |       0.294 |       0.119 |             | 
##              |       0.049 |       0.099 |             | 
## -------------|-------------|-------------|-------------|
##   Fully Paid |         441 |        2754 |        3195 | 
##              |      15.723 |       3.145 |             | 
##              |       0.138 |       0.862 |       0.852 | 
##              |       0.706 |       0.881 |             | 
##              |       0.118 |       0.734 |             | 
## -------------|-------------|-------------|-------------|
## Column Total |         625 |        3125 |        3750 | 
##              |       0.167 |       0.833 |             | 
## -------------|-------------|-------------|-------------|
## 
## 
accuracy=1-mean(pred!=test$loan_status,na.rm=TRUE);accuracy
## [1] 0.7834667
#if only choose the factor predictors, the accuracy is 81.54
nb=naiveBayes(train[,2:9], train[,1])
pred=predict(nb, test[,2:9])
CrossTable(pred, test$loan_status)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  3750 
## 
##  
##              | test$loan_status 
##         pred | Charged Off |  Fully Paid |   Row Total | 
## -------------|-------------|-------------|-------------|
##  Charged Off |          81 |         148 |         229 | 
##              |      48.071 |       9.614 |             | 
##              |       0.354 |       0.646 |       0.061 | 
##              |       0.130 |       0.047 |             | 
##              |       0.022 |       0.039 |             | 
## -------------|-------------|-------------|-------------|
##   Fully Paid |         544 |        2977 |        3521 | 
##              |       3.126 |       0.625 |             | 
##              |       0.155 |       0.845 |       0.939 | 
##              |       0.870 |       0.953 |             | 
##              |       0.145 |       0.794 |             | 
## -------------|-------------|-------------|-------------|
## Column Total |         625 |        3125 |        3750 | 
##              |       0.167 |       0.833 |             | 
## -------------|-------------|-------------|-------------|
## 
## 
accuracy=1-mean(pred!=test$loan_status,na.rm=TRUE);accuracy
## [1] 0.8154667
#adding laplace
nb=naiveBayes(train[,2:9], train[,1],laplace=3)
pred=predict(nb, test[,2:9])
CrossTable(pred, test$loan_status)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  3750 
## 
##  
##              | test$loan_status 
##         pred | Charged Off |  Fully Paid |   Row Total | 
## -------------|-------------|-------------|-------------|
##  Charged Off |          78 |         143 |         221 | 
##              |      46.010 |       9.202 |             | 
##              |       0.353 |       0.647 |       0.059 | 
##              |       0.125 |       0.046 |             | 
##              |       0.021 |       0.038 |             | 
## -------------|-------------|-------------|-------------|
##   Fully Paid |         547 |        2982 |        3529 | 
##              |       2.881 |       0.576 |             | 
##              |       0.155 |       0.845 |       0.941 | 
##              |       0.875 |       0.954 |             | 
##              |       0.146 |       0.795 |             | 
## -------------|-------------|-------------|-------------|
## Column Total |         625 |        3125 |        3750 | 
##              |       0.167 |       0.833 |             | 
## -------------|-------------|-------------|-------------|
## 
## 
accuracy=1-mean(pred!=test$loan_status,na.rm=TRUE);accuracy
## [1] 0.816