In this project, I build machine-learned models trained on LendingClub (a leading P2P lending platform) historical loan data from 2012-2014 that help investors quantify credit risks using R. The classifier, predicting whether a given loan will be fully paid or not, achieves accuracy of 0.8344.
The purpose of this project is to explore data from Lending Club (LC), a peer-to-peer lending site. LC has made data from funded loans publicly available and accessible in CSV formatted files. This data is incredibly rich, and personal financial data is rare. Additionally, a model predicting defaults could help to instruct the direction of a lending portfolio on the site.I downloaded the Loan Data from 2012 to 2014, a file containing 423814 records. Each record is a distinct loan made on the Lending Club platform. For each loan, 145 characteristics are recorded in the table. These characteristics can largely be divided into two groups - features of the loan, and features of the borrower.
The loan features are the basic stats one might expect, such as the loan amount, the interest rate, and the term of the loan. The borrower features, which comprise by far the majority of the dataset, are the characteristics of the borrowers that Lending Club has deemed important to collect. Such features include employment length, credit history, and public default histories.
After data cleaning, feature selection and feature engineering, I omit the missing values and remove outliers of annual_inc and int_rate. Now the data has 362730 records and 19 variables.
There are 11 Numeric variables: ‘loan_amnt’, ‘int_rate’, ‘installment’, ‘emp_length’,‘annual_inc’,‘dti’,‘tot_cur_bal’,‘inq_last_6mths’,‘open_acc’,‘revol_util’,‘credit_lth’
There are 8 factor variables: ‘grade’,‘term’,‘home_ownership’,‘verification_status’,‘purpose’,‘addr_state’,‘loan_status’,‘issue_d’
Due to the great sample size ,I’ll random choose 15000 experimental units for training model. In the original data, 17.32% of loan_status are ‘Charged Off’ and 82.67% of loan_status are ‘Fully Paid’. There are 17.03% of loan_status ‘Charged Off’ and 82.97% of loan_status ‘Fully Paid’. The percentage are approximately the same, so the sampling is reasonable.Spliting data, 75%(37500 units) for training and 25%(left) for test. The The percentage are approximately the same as the original data, so the sampling is reasonable.
The accuracy of logistic regression model for the train set is 0.8325333 and it for the test set is 0.8338667.
The accuracy of model random forest is 0.8344.
We use decision tree c5.0 model. From the plot, it is easily to see when we increase the trials, the misclassification Error is decreasing, until the trials equals 10, the misclassification Error reaches its least value. The accuracy of the decision tree classifier is 0.8213.
when k=1, Accuracy 0.741066666666667 when k=13,Accuracy 0.826133333333333 when k=21, Accuracy 0.8312 When k=25, the model reaches its maximum Accuracy 0.832.
choose all variables as predictors to build the model, the model accuracy is 78.32%. When trying to use laplace to improve the model, the performance of the model improve slightly. The accuracy of the model with laplace =3 is 78.35%. If only choose the factor predictors, the accuracy is 81.54%. Adding laplace =3 to the model, the accuracy is 81.6%.
library(ggplot2)
library(gmodels)
## Warning: package 'gmodels' was built under R version 3.5.2
library(lubridate)
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
library(caret)
## Warning: package 'caret' was built under R version 3.5.2
## Loading required package: lattice
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.5.2
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:lubridate':
##
## intersect, setdiff, union
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyverse)
## -- Attaching packages -------------------------------------------------- tidyverse 1.2.1 --
## v tibble 1.4.2 v purrr 0.2.5
## v tidyr 0.8.1 v stringr 1.3.1
## v readr 1.1.1 v forcats 0.3.0
## -- Conflicts ----------------------------------------------------- tidyverse_conflicts() --
## x lubridate::as.difftime() masks base::as.difftime()
## x lubridate::date() masks base::date()
## x dplyr::filter() masks stats::filter()
## x lubridate::intersect() masks base::intersect()
## x dplyr::lag() masks stats::lag()
## x purrr::lift() masks caret::lift()
## x lubridate::setdiff() masks base::setdiff()
## x lubridate::union() masks base::union()
library(mosaic)
## Warning: package 'mosaic' was built under R version 3.5.2
## Loading required package: ggformula
## Loading required package: ggstance
##
## Attaching package: 'ggstance'
## The following objects are masked from 'package:ggplot2':
##
## geom_errorbarh, GeomErrorbarh
##
## New to ggformula? Try the tutorials:
## learnr::run_tutorial("introduction", package = "ggformula")
## learnr::run_tutorial("refining", package = "ggformula")
## Loading required package: mosaicData
## Warning: package 'mosaicData' was built under R version 3.5.2
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following object is masked from 'package:tidyr':
##
## expand
##
## The 'mosaic' package masks several functions from core packages in order to add
## additional features. The original behavior of these functions should not be affected by this.
##
## Note: If you use the Matrix package, be sure to load it BEFORE loading mosaic.
##
## Attaching package: 'mosaic'
## The following object is masked from 'package:Matrix':
##
## mean
## The following object is masked from 'package:purrr':
##
## cross
## The following objects are masked from 'package:dplyr':
##
## count, do, tally
## The following object is masked from 'package:caret':
##
## dotPlot
## The following object is masked from 'package:ggplot2':
##
## stat
## The following objects are masked from 'package:stats':
##
## binom.test, cor, cor.test, cov, fivenum, IQR, median,
## prop.test, quantile, sd, t.test, var
## The following objects are masked from 'package:base':
##
## max, mean, min, prod, range, sample, sum
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:mosaic':
##
## do
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(Amelia)
## Warning: package 'Amelia' was built under R version 3.5.2
## Loading required package: Rcpp
## Warning: package 'Rcpp' was built under R version 3.5.2
## ##
## ## Amelia II: Multiple Imputation
## ## (Version 1.7.5, built: 2018-05-07)
## ## Copyright (C) 2005-2019 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##
library(janitor)
## Warning: package 'janitor' was built under R version 3.5.2
library(readxl)
library(car)
## Warning: package 'car' was built under R version 3.5.2
## Loading required package: carData
## Warning: package 'carData' was built under R version 3.5.2
##
## Attaching package: 'car'
## The following objects are masked from 'package:mosaic':
##
## deltaMethod, logit
## The following object is masked from 'package:purrr':
##
## some
## The following object is masked from 'package:dplyr':
##
## recode
library(ROCR)
## Warning: package 'ROCR' was built under R version 3.5.2
## Loading required package: gplots
## Warning: package 'gplots' was built under R version 3.5.2
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
Merge the data together into one dataframe. Make a table of the variable names. Make a table of the number of rows and columns in the dataframe.
here’s the code merge data and save data into Rds.
L12_13 = data.frame(read.csv(“LoanStats3b.csv”, stringsAsFactors = FALSE, header = T, skip = 1)) L14 = data.frame(read.csv(“LoanStats3c.csv”, stringsAsFactors = FALSE, header = T, skip = 1)) #read the file loan1 = rbind(L12_13,L14) # merge 2012-2014 into one dataframe saveRDS(loan1,file = ‘loan.Rds’)
loan1=readRDS('loan.Rds')
head(loan1)
variables1=data.frame(Variables=names(loan1));variables1 #create a table of variables
length1=data.frame(row=nrow(loan1),column=ncol(loan1));length1 #create a table of the number of rows and columns in the dataframe
Data Cleaning & Data Transformation & data structure & feature visualization ## Data cleaning## Notes: Summariz missing value; delete column with over 50% missing vales; 210000 is minimum number of non-NA values. After removing those variables with great amount missing values. There’s 87 variables left.
423814*0.5
## [1] 211907
loan1=loan1[!sapply(loan1, function (x) sum(is.na(x) | x == "")>210000)];
head(loan1)
sort(sapply(loan1,function(k) sum(is.na(k))),decreasing = T)
## mths_since_recent_inq mo_sin_old_il_acct
## 49564 41047
## num_tl_120dpd_2m pct_tl_nvr_dlq
## 35861 27898
## avg_cur_bal mo_sin_old_rev_tl_op
## 27757 27746
## mo_sin_rcnt_rev_tl_op tot_coll_amt
## 27746 27745
## tot_cur_bal total_rev_hi_lim
## 27745 27745
## mo_sin_rcnt_tl num_accts_ever_120_pd
## 27745 27745
## num_actv_bc_tl num_actv_rev_tl
## 27745 27745
## num_bc_tl num_il_tl
## 27745 27745
## num_op_rev_tl num_rev_accts
## 27745 27745
## num_rev_tl_bal_gt_0 num_tl_30dpd
## 27745 27745
## num_tl_90g_dpd_24m num_tl_op_past_12m
## 27745 27745
## tot_hi_cred_lim total_il_high_credit_limit
## 27745 27745
## num_bc_sats num_sats
## 16059 16059
## bc_util percent_bc_gt_75
## 11727 11589
## bc_open_to_buy mths_since_recent_bc
## 11474 11078
## acc_open_past_24mths mort_acc
## 7499 7499
## total_bal_ex_mort total_bc_limit
## 7499 7499
## loan_amnt funded_amnt
## 4 4
## funded_amnt_inv installment
## 4 4
## annual_inc dti
## 4 4
## delinq_2yrs inq_last_6mths
## 4 4
## open_acc pub_rec
## 4 4
## revol_bal total_acc
## 4 4
## out_prncp out_prncp_inv
## 4 4
## total_pymnt total_pymnt_inv
## 4 4
## total_rec_prncp total_rec_int
## 4 4
## total_rec_late_fee recoveries
## 4 4
## collection_recovery_fee last_pymnt_amnt
## 4 4
## collections_12_mths_ex_med policy_code
## 4 4
## acc_now_delinq chargeoff_within_12_mths
## 4 4
## delinq_amnt pub_rec_bankruptcies
## 4 4
## tax_liens emp_title
## 4 3
## term int_rate
## 0 0
## grade sub_grade
## 0 0
## emp_length home_ownership
## 0 0
## verification_status issue_d
## 0 0
## loan_status pymnt_plan
## 0 0
## purpose title
## 0 0
## zip_code addr_state
## 0 0
## earliest_cr_line revol_util
## 0 0
## initial_list_status last_pymnt_d
## 0 0
## last_credit_pull_d application_type
## 0 0
## hardship_flag disbursement_method
## 0 0
## debt_settlement_flag
## 0
dim(loan1)
## [1] 423814 87
Note: Remove irrelevant and Redundant Features. some of them are not related to our target variable and some of them are redundant. Finally, 19 relevant variables are chosen. And convert the chracter variable to be factor.
loan1 <- loan1[,c("loan_status","grade", "term","home_ownership","verification_status","purpose","addr_state", "loan_amnt","int_rate",'installment',"emp_length", "annual_inc", "dti", 'tot_cur_bal',"inq_last_6mths","open_acc", "revol_util", "issue_d", "earliest_cr_line")]
There are 7 labels for loan_status: “Charged Off”, “Current” , “Default” ,“Fully Paid”, “In Grace Period” , “Late (16-30 days)” , “Late (31-120 days)”. To consolidate these labels, I took several steps. First, I removed all the late loans, as these fall into a certain grey area with ambiguous, undetermined final statuses. I’ll select data with two final labels ‘charged off’ and ‘Fully Paid’ for my classification attempts.Also, remove Na values.
The resulting data was much more cleanly labled:
loan1=subset(loan1,loan1$loan_status=='Charged Off' | loan1$loan_status== 'Fully Paid')
table(loan1$loan_status)
##
## Charged Off Fully Paid
## 70613 338329
table(loan1$loan_status)/length(loan1$loan_status)
##
## Charged Off Fully Paid
## 0.1726724 0.8273276
dim(loan1)
## [1] 408942 19
str(loan1)
## 'data.frame': 408942 obs. of 19 variables:
## $ loan_status : chr "Fully Paid" "Fully Paid" "Fully Paid" "Fully Paid" ...
## $ grade : chr "A" "A" "B" "B" ...
## $ term : chr " 36 months" " 36 months" " 36 months" " 36 months" ...
## $ home_ownership : chr "MORTGAGE" "MORTGAGE" "OWN" "MORTGAGE" ...
## $ verification_status: chr "Not Verified" "Source Verified" "Verified" "Source Verified" ...
## $ purpose : chr "debt_consolidation" "debt_consolidation" "debt_consolidation" "debt_consolidation" ...
## $ addr_state : chr "TX" "CA" "MI" "CO" ...
## $ loan_amnt : int 12000 28000 27050 12000 27600 12000 3000 20800 7550 15000 ...
## $ int_rate : chr " 7.62%" " 7.62%" " 10.99%" " 11.99%" ...
## $ installment : num 374 873 885 399 731 ...
## $ emp_length : chr "3 years" "5 years" "10+ years" "10+ years" ...
## $ annual_inc : num 96500 325000 55000 130000 73000 60000 25000 81500 28000 98000 ...
## $ dti : num 12.6 18.6 22.9 13 23.1 ...
## $ tot_cur_bal : int 200314 799592 114834 327264 241609 7137 19530 23473 5759 13038 ...
## $ inq_last_6mths : int 0 1 0 1 1 1 0 2 0 2 ...
## $ open_acc : int 17 15 14 9 10 15 5 29 4 16 ...
## $ revol_util : chr "55.7%" "54.6%" "61.2%" "67%" ...
## $ issue_d : chr "Dec-2013" "Dec-2013" "Dec-2013" "Dec-2013" ...
## $ earliest_cr_line : chr "Sep-2003" "Nov-1994" "Oct-1986" "Nov-1997" ...
summary(loan1)
## loan_status grade term
## Length:408942 Length:408942 Length:408942
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## home_ownership verification_status purpose
## Length:408942 Length:408942 Length:408942
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## addr_state loan_amnt int_rate installment
## Length:408942 Min. : 1000 Length:408942 Min. : 4.93
## Class :character 1st Qu.: 8000 Class :character 1st Qu.: 264.74
## Mode :character Median :12275 Mode :character Median : 388.06
## Mean :14450 Mean : 441.55
## 3rd Qu.:20000 3rd Qu.: 578.03
## Max. :35000 Max. :1409.99
##
## emp_length annual_inc dti tot_cur_bal
## Length:408942 Min. : 3000 Min. : 0.00 Min. : 0
## Class :character 1st Qu.: 45000 1st Qu.:11.65 1st Qu.: 28208
## Mode :character Median : 63000 Median :17.15 Median : 80064
## Mean : 73456 Mean :17.53 Mean : 137869
## 3rd Qu.: 89000 3rd Qu.:23.15 3rd Qu.: 207710
## Max. :7500000 Max. :39.99 Max. :8000078
## NA's :27741
## inq_last_6mths open_acc revol_util issue_d
## Min. :0.0000 Min. : 0.00 Length:408942 Length:408942
## 1st Qu.:0.0000 1st Qu.: 8.00 Class :character Class :character
## Median :0.0000 Median :11.00 Mode :character Mode :character
## Mean :0.7832 Mean :11.34
## 3rd Qu.:1.0000 3rd Qu.:14.00
## Max. :8.0000 Max. :84.00
##
## earliest_cr_line
## Length:408942
## Class :character
## Mode :character
##
##
##
##
Further cleaning of the data involved conerting interest rates, revolving utility rates, employment length from factors to numerics. Employment length in years. Possible values are between 0 and 10 where 0 means n/a(unemployed), 0.5 means less than one year and 10 means ten or more years.removing empty factor levels, and removing annual income outliers.
#Drops the â%â character and converts the observations from âfactorsâ to ânumericâ
loan1$int_rate <- gsub("%", "", loan1$int_rate)
loan1$int_rate <- as.numeric(loan1$int_rate)
head(loan1$int_rate)
## [1] 7.62 7.62 10.99 11.99 19.97 10.99
loan1$revol_util <- gsub("%", "", loan1$revol_util)
loan1$revol_util <- as.numeric(loan1$revol_util)
head(loan1$revol_util)
## [1] 55.7 54.6 61.2 67.0 82.8 24.0
#Drops the âyearsâ ,'+' character and symbols. Convert'< 1 year' to '0' and converts the observations from âfactorsâ to ânumericâ. Employment length in years. Possible values are between 0 and 10 where 0 means n/a(unemployed), 0.5 means less than one year and 10 means ten or more years..
loan1$emp_length <- gsub("n/a", "0", loan1$emp_length)
loan1$emp_length <- gsub("< 1 year", "0.5", loan1$emp_length)
loan1$emp_length <- gsub("years", "", loan1$emp_length)
loan1$emp_length <- gsub("year", "", loan1$emp_length)
loan1$emp_length <- gsub("\\+", "", loan1$emp_length)
loan1$emp_length <- as.numeric(loan1$emp_length)
head(loan1$emp_length)
## [1] 3 5 10 10 6 4
loan1$annual_inc=loan1$annual_inc/1000
loan1$verification_status=gsub('Source Verified','Verified',loan1$verification_status)
sort(sapply(loan1,function(k) sum(is.na(k))),decreasing = T)
## tot_cur_bal revol_util loan_status
## 27741 238 0
## grade term home_ownership
## 0 0 0
## verification_status purpose addr_state
## 0 0 0
## loan_amnt int_rate installment
## 0 0 0
## emp_length annual_inc dti
## 0 0 0
## inq_last_6mths open_acc issue_d
## 0 0 0
## earliest_cr_line
## 0
dim(loan1)
## [1] 408942 19
loan1=na.omit(loan1)
dim(loan1)
## [1] 380994 19
summary(loan1)
## loan_status grade term
## Length:380994 Length:380994 Length:380994
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## home_ownership verification_status purpose
## Length:380994 Length:380994 Length:380994
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## addr_state loan_amnt int_rate installment
## Length:380994 Min. : 1000 Min. : 6.00 Min. : 4.93
## Class :character 1st Qu.: 8000 1st Qu.:10.99 1st Qu.: 266.65
## Mode :character Median :12500 Median :13.68 Median : 389.28
## Mean :14540 Mean :13.98 Mean : 443.73
## 3rd Qu.:20000 3rd Qu.:16.78 3rd Qu.: 581.06
## Max. :35000 Max. :26.06 Max. :1409.99
## emp_length annual_inc dti tot_cur_bal
## Min. : 0.000 Min. : 3.0 Min. : 0.00 Min. : 0
## 1st Qu.: 2.000 1st Qu.: 45.0 1st Qu.:11.74 1st Qu.: 28207
## Median : 6.000 Median : 64.0 Median :17.26 Median : 80044
## Mean : 5.861 Mean : 73.8 Mean :17.65 Mean : 137838
## 3rd Qu.:10.000 3rd Qu.: 90.0 3rd Qu.:23.32 3rd Qu.: 207674
## Max. :10.000 Max. :7500.0 Max. :39.99 Max. :8000078
## inq_last_6mths open_acc revol_util issue_d
## Min. :0.0000 Min. : 1.00 Min. : 0.0 Length:380994
## 1st Qu.:0.0000 1st Qu.: 8.00 1st Qu.: 40.2 Class :character
## Median :0.0000 Median :11.00 Median : 57.8 Mode :character
## Mean :0.7818 Mean :11.42 Mean : 56.7
## 3rd Qu.:1.0000 3rd Qu.:14.00 3rd Qu.: 74.5
## Max. :8.0000 Max. :84.00 Max. :892.3
## earliest_cr_line
## Length:380994
## Class :character
## Mode :character
##
##
##
Several features of the data set were inherently related, lending themselves naturally to feature engineering. For example, I extract the loan issue year and the year of the borrowerâs first credit line to calculate the length in years of each borrowerâs credit history.
loan1$issue_d <- substr(loan1$issue_d, 5,8)
loan1$earliest_cr_line <- substr(loan1$earliest_cr_line, 5, 8)
loan1$credit_lth <- as.numeric(loan1$issue_d) - as.numeric(loan1$earliest_cr_line)
loan1 <- loan1[,c("loan_status","grade", "term","issue_d","home_ownership","verification_status","purpose","addr_state", "loan_amnt","int_rate",'installment',"emp_length", "annual_inc", "dti", 'tot_cur_bal',"inq_last_6mths","open_acc", "revol_util", "credit_lth")]
ggplot(data = loan1, aes(x=loan_status,y=annual_inc,fill=loan_status))+ geom_boxplot(outlier.size = 1)
ggplot(data=loan1,aes(x=annual_inc,color=loan_status))+ geom_histogram()+facet_wrap(~loan_status)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
outliers=boxplot(loan1$annual_inc,plot=FALSE)$out
loan1=loan1[-which(loan1$annual_inc %in% outliers),]
ggplot(data=loan1,aes(x=annual_inc,color=loan_status))+ geom_histogram()+facet_wrap(~loan_status)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data = loan1, aes(x=loan_status,y=annual_inc,fill=loan_status))+ geom_boxplot(outlier.size = 1)
###int_rate To start examining the data, I began by investigating the distributions of each numeric feature via histograms, segmented out by loan outcome. The interest rate analysis was of particular interest. Examining the histogram, charged off loans have a more even distribution, tending towards higher interest rates more frequently than the fully paid ones do. This result makes intuitive sense, as higher interest rates are assigned to riskier investments.
ggplot(data = loan1, aes(x=loan_status,y=int_rate,fill=loan_status))+ geom_boxplot(outlier.size = 1)
ggplot(data=loan1,aes(x=int_rate,color=loan_status))+ geom_histogram()+facet_wrap(~loan_status)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
outliers=boxplot(loan1$int_rate,plot=FALSE)$out
loan1=loan1[-which(loan1$int_rate %in% outliers),]
summary(loan1$int_rate)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 10.99 13.68 13.93 16.59 25.28
there’s no obvious difference in credit history between charged off group and Full paid group .
ggplot(data = loan1, aes(x=loan_status,y=credit_lth,fill=loan_status))+ geom_boxplot(outlier.size = 1)
ggplot(data=loan1,aes(x=credit_lth,color=loan_status))+ geom_histogram()+facet_wrap(~loan_status)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Next I check how many of the incomes are verifiable. From the barplot, There’s no obviously relationship between verification_status and loan_staus
ggplot(data=loan1,aes(x=verification_status,color=loan_status))+ geom_bar()+facet_wrap(~loan_status)+ggtitle('Is income verified?')
Now the data has 362730 records and 19 variables.
After data cleaning, feature selection and feature engineering, I omit the missing values and remove outliers of annual_inc and int_rate. Now the data has 362730 records and 19 variables.
There are 11 Numeric variables: ‘loan_amnt’, ‘int_rate’, ‘installment’, ‘emp_length’,‘annual_inc’,‘dti’,‘tot_cur_bal’,‘inq_last_6mths’,‘open_acc’,‘revol_util’,‘credit_lth’
There are 8 factor variables: ‘grade’,‘term’,‘home_ownership’,‘verification_status’,‘purpose’,‘addr_state’,‘loan_status’,‘issue_d’
After the first step of data cleaning, there’s no missing value. Due to the great sample size ,I’ll random choose 15000 experimental units for training model. In the original data, 17.32% of loan_status are ‘Charged Off’ and 82.67% of loan_status are ‘Fully Paid’. There are 17.03% of loan_status ‘Charged Off’ and 82.97% of loan_status ‘Fully Paid’. The percentage are approximately the same, so the sampling is reasonable.
loan1[1:8] <- lapply(loan1[1:8] , as.factor)
sort(sapply(loan1,function(k) sum(is.na(k))),decreasing = T)
## loan_status grade term
## 0 0 0
## issue_d home_ownership verification_status
## 0 0 0
## purpose addr_state loan_amnt
## 0 0 0
## int_rate installment emp_length
## 0 0 0
## annual_inc dti tot_cur_bal
## 0 0 0
## inq_last_6mths open_acc revol_util
## 0 0 0
## credit_lth
## 0
table(loan1$loan_status)
##
## Charged Off Fully Paid
## 62830 299900
table(loan1$loan_status)/length(loan1$loan_status)
##
## Charged Off Fully Paid
## 0.1732142 0.8267858
dim(loan1)
## [1] 362730 19
L=nrow(loan1);L
## [1] 362730
summary(loan1)
## loan_status grade term issue_d
## Charged Off: 62830 A: 53952 36 months:272016 2012: 24751
## Fully Paid :299900 B:108582 60 months: 90714 2013:128981
## C:102153 2014:208998
## D: 60764
## E: 27331
## F: 9574
## G: 374
## home_ownership verification_status purpose
## ANY : 1 Not Verified:113442 debt_consolidation:219821
## MORTGAGE:183525 Verified :249288 credit_card : 86238
## NONE : 42 home_improvement : 18886
## OTHER : 44 other : 15996
## OWN : 33361 major_purchase : 6083
## RENT :145757 small_business : 3377
## (Other) : 12329
## addr_state loan_amnt int_rate installment
## CA : 54358 Min. : 1000 Min. : 6.00 Min. : 4.93
## NY : 30165 1st Qu.: 8000 1st Qu.:10.99 1st Qu.: 262.28
## TX : 28565 Median :12000 Median :13.68 Median : 381.04
## FL : 24366 Mean :14125 Mean :13.93 Mean : 430.88
## IL : 14358 3rd Qu.:19600 3rd Qu.:16.59 3rd Qu.: 561.50
## NJ : 13184 Max. :35000 Max. :25.28 Max. :1396.79
## (Other):197734
## emp_length annual_inc dti tot_cur_bal
## Min. : 0.000 Min. : 3.00 Min. : 0.00 Min. : 0
## 1st Qu.: 2.000 1st Qu.: 45.00 1st Qu.:12.02 1st Qu.: 27300
## Median : 6.000 Median : 61.20 Median :17.53 Median : 74135
## Mean : 5.846 Mean : 67.14 Mean :17.88 Mean : 126971
## 3rd Qu.:10.000 3rd Qu.: 85.00 3rd Qu.:23.53 3rd Qu.: 197278
## Max. :10.000 Max. :157.50 Max. :39.99 Max. :1654949
##
## inq_last_6mths open_acc revol_util credit_lth
## Min. :0.0000 Min. : 1.00 Min. : 0.00 Min. : 3.00
## 1st Qu.:0.0000 1st Qu.: 8.00 1st Qu.: 40.10 1st Qu.:11.00
## Median :0.0000 Median :11.00 Median : 57.60 Median :15.00
## Mean :0.7675 Mean :11.33 Mean : 56.55 Mean :16.05
## 3rd Qu.:1.0000 3rd Qu.:14.00 3rd Qu.: 74.20 3rd Qu.:20.00
## Max. :8.0000 Max. :84.00 Max. :892.30 Max. :70.00
##
set.seed(99)
sample_num=sample(1:L,15000, replace = FALSE)
loan=loan1[sample_num,]
table(loan$loan_status)
##
## Charged Off Fully Paid
## 2555 12445
table(loan$loan_status)/length(loan$loan_status)
##
## Charged Off Fully Paid
## 0.1703333 0.8296667
loan_bup=loan #backup
sort(sapply(loan1,function(k) sum(is.na(k))),decreasing = T)
## loan_status grade term
## 0 0 0
## issue_d home_ownership verification_status
## 0 0 0
## purpose addr_state loan_amnt
## 0 0 0
## int_rate installment emp_length
## 0 0 0
## annual_inc dti tot_cur_bal
## 0 0 0
## inq_last_6mths open_acc revol_util
## 0 0 0
## credit_lth
## 0
15000*0.75
## [1] 11250
set.seed(5)
index_num=sample(1:15000,15000*0.75,replace = FALSE)
train=loan[index_num,]
test=loan[-index_num,]
table(train$loan_status)/length(train$loan_status)
##
## Charged Off Fully Paid
## 0.1715556 0.8284444
table(test$loan_status)/length(test$loan_status)
##
## Charged Off Fully Paid
## 0.1666667 0.8333333
missmap(loan)
#logistic regression The accuracy of logistic regression model for the train set is 0.8325333 and it for the test set is 0.8338667.
#Step 3 â training a model on the data
loan$loan_status=as.factor(loan$loan_status)
levels(loan$loan_status)
## [1] "Charged Off" "Fully Paid"
train=loan[index_num,]
test=loan[-index_num,]
lr_full <- glm(loan_status~., family=binomial(link='logit'), data=train)
summary(lr_full)
##
## Call:
## glm(formula = loan_status ~ ., family = binomial(link = "logit"),
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.9143 0.3104 0.4945 0.6503 1.4738
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.552e+00 8.105e-01 5.617 1.94e-08 ***
## gradeB -5.915e-01 1.586e-01 -3.729 0.000192 ***
## gradeC -7.897e-01 2.134e-01 -3.701 0.000214 ***
## gradeD -8.702e-01 2.763e-01 -3.149 0.001638 **
## gradeE -1.149e+00 3.517e-01 -3.267 0.001086 **
## gradeF -8.567e-01 4.307e-01 -1.989 0.046703 *
## gradeG -1.232e+00 6.916e-01 -1.781 0.074842 .
## term 60 months -7.789e-01 1.510e-01 -5.159 2.49e-07 ***
## issue_d2013 2.029e-01 1.068e-01 1.900 0.057452 .
## issue_d2014 -1.295e-02 1.067e-01 -0.121 0.903467
## home_ownershipNONE 1.185e+01 3.083e+02 0.038 0.969348
## home_ownershipOTHER -2.096e+00 1.489e+00 -1.407 0.159321
## home_ownershipOWN -5.687e-02 1.005e-01 -0.566 0.571543
## home_ownershipRENT -2.339e-01 7.258e-02 -3.223 0.001269 **
## verification_statusVerified -1.041e-01 6.756e-02 -1.540 0.123472
## purposecredit_card -7.426e-01 4.386e-01 -1.693 0.090445 .
## purposedebt_consolidation -7.642e-01 4.361e-01 -1.752 0.079721 .
## purposehome_improvement -8.128e-01 4.507e-01 -1.804 0.071298 .
## purposehouse -2.170e-01 6.612e-01 -0.328 0.742714
## purposemajor_purchase -4.198e-01 4.981e-01 -0.843 0.399333
## purposemedical -1.075e+00 4.987e-01 -2.155 0.031151 *
## purposemoving -9.867e-01 5.387e-01 -1.832 0.067001 .
## purposeother -6.629e-01 4.508e-01 -1.470 0.141452
## purposerenewable_energy 1.126e+01 2.068e+02 0.054 0.956555
## purposesmall_business -9.267e-01 4.986e-01 -1.859 0.063064 .
## purposevacation -1.186e+00 5.270e-01 -2.251 0.024404 *
## purposewedding -2.563e-01 7.737e-01 -0.331 0.740480
## addr_stateAL -7.547e-01 6.620e-01 -1.140 0.254259
## addr_stateAR -4.155e-01 7.073e-01 -0.587 0.556891
## addr_stateAZ -8.570e-01 6.450e-01 -1.329 0.183945
## addr_stateCA -6.448e-01 6.276e-01 -1.028 0.304170
## addr_stateCO -1.833e-01 6.565e-01 -0.279 0.780036
## addr_stateCT -7.174e-01 6.610e-01 -1.085 0.277736
## addr_stateDC 7.310e-01 1.204e+00 0.607 0.543896
## addr_stateDE -1.505e+00 7.443e-01 -2.022 0.043177 *
## addr_stateFL -7.861e-01 6.310e-01 -1.246 0.212841
## addr_stateGA -6.295e-01 6.409e-01 -0.982 0.326025
## addr_stateHI -5.553e-01 7.205e-01 -0.771 0.440935
## addr_stateIL -6.977e-01 6.392e-01 -1.091 0.275070
## addr_stateIN -7.467e-01 6.560e-01 -1.138 0.255053
## addr_stateKS -1.476e-01 6.995e-01 -0.211 0.832885
## addr_stateKY -3.175e-01 6.867e-01 -0.462 0.643893
## addr_stateLA -4.634e-01 6.731e-01 -0.688 0.491176
## addr_stateMA -8.786e-01 6.443e-01 -1.364 0.172704
## addr_stateMD -9.156e-01 6.445e-01 -1.421 0.155439
## addr_stateMI -4.584e-01 6.485e-01 -0.707 0.479659
## addr_stateMN -8.671e-01 6.504e-01 -1.333 0.182511
## addr_stateMO -8.473e-01 6.513e-01 -1.301 0.193256
## addr_stateMS -6.638e-01 7.362e-01 -0.902 0.367230
## addr_stateMT -1.376e-01 8.316e-01 -0.166 0.868539
## addr_stateNC -7.637e-01 6.431e-01 -1.187 0.235049
## addr_stateNH -3.998e-02 7.469e-01 -0.054 0.957313
## addr_stateNJ -6.933e-01 6.380e-01 -1.087 0.277177
## addr_stateNM -1.161e+00 6.843e-01 -1.697 0.089678 .
## addr_stateNV -5.616e-01 6.584e-01 -0.853 0.393658
## addr_stateNY -7.336e-01 6.303e-01 -1.164 0.244468
## addr_stateOH -9.415e-01 6.385e-01 -1.474 0.140359
## addr_stateOK -1.688e+00 6.655e-01 -2.536 0.011210 *
## addr_stateOR -3.681e-03 6.844e-01 -0.005 0.995709
## addr_statePA -7.684e-01 6.388e-01 -1.203 0.229025
## addr_stateRI -1.014e+00 7.156e-01 -1.417 0.156472
## addr_stateSC -1.044e-01 6.815e-01 -0.153 0.878244
## addr_stateSD -8.294e-01 8.075e-01 -1.027 0.304322
## addr_stateTN -7.652e-01 6.580e-01 -1.163 0.244833
## addr_stateTX -6.774e-01 6.312e-01 -1.073 0.283211
## addr_stateUT 1.834e-01 7.454e-01 0.246 0.805699
## addr_stateVA -6.580e-01 6.416e-01 -1.025 0.305148
## addr_stateVT -1.392e+00 8.150e-01 -1.707 0.087752 .
## addr_stateWA -2.977e-01 6.539e-01 -0.455 0.648904
## addr_stateWI -4.881e-01 6.640e-01 -0.735 0.462303
## addr_stateWV -1.725e-01 7.491e-01 -0.230 0.817859
## addr_stateWY -7.504e-01 7.610e-01 -0.986 0.324077
## loan_amnt 2.396e-05 2.409e-05 0.995 0.319967
## int_rate -4.103e-02 2.523e-02 -1.626 0.103895
## installment -1.089e-03 7.600e-04 -1.433 0.151811
## emp_length 9.770e-03 7.318e-03 1.335 0.181834
## annual_inc 6.996e-03 1.358e-03 5.153 2.56e-07 ***
## dti -1.034e-02 3.843e-03 -2.691 0.007124 **
## tot_cur_bal 3.156e-07 3.216e-07 0.981 0.326481
## inq_last_6mths -6.485e-02 2.606e-02 -2.489 0.012817 *
## open_acc -6.139e-03 6.001e-03 -1.023 0.306362
## revol_util 1.952e-04 1.350e-03 0.145 0.885084
## credit_lth 2.446e-03 3.880e-03 0.630 0.528447
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 10312.7 on 11249 degrees of freedom
## Residual deviance: 9411.9 on 11167 degrees of freedom
## AIC: 9577.9
##
## Number of Fisher Scoring iterations: 12
#Step4 eveluate the model performance
levels(test$loan_status)
## [1] "Charged Off" "Fully Paid"
pred=predict(lr_full, newdata=test,type = 'response')
pred=ifelse(pred>=0.5,1,0)
CrossTable(pred, test$loan_status)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 3750
##
##
## | test$loan_status
## pred | Charged Off | Fully Paid | Row Total |
## -------------|-------------|-------------|-------------|
## 0 | 14 | 21 | 35 |
## | 11.433 | 2.287 | |
## | 0.400 | 0.600 | 0.009 |
## | 0.022 | 0.007 | |
## | 0.004 | 0.006 | |
## -------------|-------------|-------------|-------------|
## 1 | 611 | 3104 | 3715 |
## | 0.108 | 0.022 | |
## | 0.164 | 0.836 | 0.991 |
## | 0.978 | 0.993 | |
## | 0.163 | 0.828 | |
## -------------|-------------|-------------|-------------|
## Column Total | 625 | 3125 | 3750 |
## | 0.167 | 0.833 | |
## -------------|-------------|-------------|-------------|
##
##
accuracy_Lr <- table(pred, test[,1])
sum(diag(accuracy_Lr))/sum(accuracy_Lr)
## [1] 0.8314667
#Step 5 improve model performance
##based on the P value in the summary of model, choose these significant variables to build a new model
lr_redu=glm(loan_status~grade+term+issue_d+home_ownership+purpose+int_rate+ annual_inc+ dti+inq_last_6mths, family=binomial(link='logit'), data=train)
summary(lr_redu)
##
## Call:
## glm(formula = loan_status ~ grade + term + issue_d + home_ownership +
## purpose + int_rate + annual_inc + dti + inq_last_6mths, family = binomial(link = "logit"),
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7204 0.3190 0.5036 0.6538 1.3334
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.982168 0.505866 7.872 3.49e-15 ***
## gradeB -0.577425 0.157709 -3.661 0.000251 ***
## gradeC -0.771565 0.211661 -3.645 0.000267 ***
## gradeD -0.851452 0.274001 -3.107 0.001887 **
## gradeE -1.148459 0.348601 -3.294 0.000986 ***
## gradeF -0.857709 0.425967 -2.014 0.044056 *
## gradeG -1.270392 0.684177 -1.857 0.063337 .
## term 60 months -0.658922 0.063564 -10.366 < 2e-16 ***
## issue_d2013 0.194893 0.105559 1.846 0.064850 .
## issue_d2014 -0.014671 0.105070 -0.140 0.888950
## home_ownershipNONE 11.859511 308.609115 0.038 0.969346
## home_ownershipOTHER -1.676228 1.421110 -1.180 0.238191
## home_ownershipOWN -0.111479 0.095536 -1.167 0.243259
## home_ownershipRENT -0.275465 0.057632 -4.780 1.76e-06 ***
## purposecredit_card -0.790754 0.435340 -1.816 0.069308 .
## purposedebt_consolidation -0.811096 0.433230 -1.872 0.061178 .
## purposehome_improvement -0.837937 0.448179 -1.870 0.061533 .
## purposehouse -0.360780 0.656320 -0.550 0.582524
## purposemajor_purchase -0.436921 0.495028 -0.883 0.377442
## purposemedical -1.061080 0.495594 -2.141 0.032272 *
## purposemoving -0.892419 0.533951 -1.671 0.094652 .
## purposeother -0.654310 0.448337 -1.459 0.144451
## purposerenewable_energy 11.192705 206.543908 0.054 0.956783
## purposesmall_business -0.949348 0.495590 -1.916 0.055417 .
## purposevacation -1.115311 0.523259 -2.131 0.033051 *
## purposewedding -0.318735 0.772775 -0.412 0.680006
## int_rate -0.049442 0.024380 -2.028 0.042564 *
## annual_inc 0.006021 0.001031 5.838 5.29e-09 ***
## dti -0.012091 0.003399 -3.557 0.000374 ***
## inq_last_6mths -0.070702 0.024978 -2.831 0.004647 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 10313 on 11249 degrees of freedom
## Residual deviance: 9514 on 11220 degrees of freedom
## AIC: 9574
##
## Number of Fisher Scoring iterations: 12
##check accuracy in train set
pred_train=predict(lr_redu, newdata=train,type = 'response')
pred_train=ifelse(pred_train>=0.5,1,0)
accuracy_Lrtrain <- table(pred_train, train[,1])
sum(diag(accuracy_Lrtrain))/sum(accuracy_Lrtrain)
## [1] 0.8283556
##check accuracy in test set
pred_new=predict(lr_redu, newdata=test,type = 'response')
pred_new=ifelse(pred_new>=0.5,1,0)
CrossTable(pred_new, test$loan_status)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 3750
##
##
## | test$loan_status
## pred_new | Charged Off | Fully Paid | Row Total |
## -------------|-------------|-------------|-------------|
## 0 | 6 | 10 | 16 |
## | 4.167 | 0.833 | |
## | 0.375 | 0.625 | 0.004 |
## | 0.010 | 0.003 | |
## | 0.002 | 0.003 | |
## -------------|-------------|-------------|-------------|
## 1 | 619 | 3115 | 3734 |
## | 0.018 | 0.004 | |
## | 0.166 | 0.834 | 0.996 |
## | 0.990 | 0.997 | |
## | 0.165 | 0.831 | |
## -------------|-------------|-------------|-------------|
## Column Total | 625 | 3125 | 3750 |
## | 0.167 | 0.833 | |
## -------------|-------------|-------------|-------------|
##
##
accuracy_Lrnew <- table(pred_new, test[,1])
sum(diag(accuracy_Lrnew))/sum(accuracy_Lrnew)
## [1] 0.8322667
The accuracy of model random forest is 0.8344
##step3 train the model
library(randomForest)
## Warning: package 'randomForest' was built under R version 3.5.2
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
lapply(train, class)
## $loan_status
## [1] "factor"
##
## $grade
## [1] "factor"
##
## $term
## [1] "factor"
##
## $issue_d
## [1] "factor"
##
## $home_ownership
## [1] "factor"
##
## $verification_status
## [1] "factor"
##
## $purpose
## [1] "factor"
##
## $addr_state
## [1] "factor"
##
## $loan_amnt
## [1] "integer"
##
## $int_rate
## [1] "numeric"
##
## $installment
## [1] "numeric"
##
## $emp_length
## [1] "numeric"
##
## $annual_inc
## [1] "numeric"
##
## $dti
## [1] "numeric"
##
## $tot_cur_bal
## [1] "integer"
##
## $inq_last_6mths
## [1] "integer"
##
## $open_acc
## [1] "integer"
##
## $revol_util
## [1] "numeric"
##
## $credit_lth
## [1] "numeric"
rf=randomForest(loan_status~., data=train, importance=TRUE)
plot(rf)
rf
##
## Call:
## randomForest(formula = loan_status ~ ., data = train, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 17.14%
## Confusion matrix:
## Charged Off Fully Paid class.error
## Charged Off 38 1892 0.980310881
## Fully Paid 36 9284 0.003862661
##step4 Evaluate the model performance
rf_pred=predict(rf, newdata=test)
CrossTable(test$loan_status,rf_pred)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 3750
##
##
## | rf_pred
## test$loan_status | Charged Off | Fully Paid | Row Total |
## -----------------|-------------|-------------|-------------|
## Charged Off | 9 | 616 | 625 |
## | 19.048 | 0.071 | |
## | 0.014 | 0.986 | 0.167 |
## | 0.643 | 0.165 | |
## | 0.002 | 0.164 | |
## -----------------|-------------|-------------|-------------|
## Fully Paid | 5 | 3120 | 3125 |
## | 3.810 | 0.014 | |
## | 0.002 | 0.998 | 0.833 |
## | 0.357 | 0.835 | |
## | 0.001 | 0.832 | |
## -----------------|-------------|-------------|-------------|
## Column Total | 14 | 3736 | 3750 |
## | 0.004 | 0.996 | |
## -----------------|-------------|-------------|-------------|
##
##
accuracy=mean(rf_pred==test$loan_status);accuracy
## [1] 0.8344
library(caret)
library(ranger)
## Warning: package 'ranger' was built under R version 3.5.3
##
## Attaching package: 'ranger'
## The following object is masked from 'package:randomForest':
##
## importance
##step3 train the model
rf <- ranger(loan_status~ ., data = train, num.threads = 4 )
summary(rf)
## Length Class Mode
## predictions 11250 factor numeric
## num.trees 1 -none- numeric
## num.independent.variables 1 -none- numeric
## mtry 1 -none- numeric
## min.node.size 1 -none- numeric
## prediction.error 1 -none- numeric
## forest 10 ranger.forest list
## confusion.matrix 4 table numeric
## splitrule 1 -none- character
## treetype 1 -none- character
## call 4 -none- call
## importance.mode 1 -none- character
## num.samples 1 -none- numeric
## replace 1 -none- logical
rf$confusion.matrix
## predicted
## true Charged Off Fully Paid
## Charged Off 57 1873
## Fully Paid 62 9258
## step4 evaluate the model performance
#training
rf_pred <- predict(rf, train)
confusionMatrix(data=rf_pred$predictions, train$loan_status)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Charged Off Fully Paid
## Charged Off 1930 0
## Fully Paid 0 9320
##
## Accuracy : 1
## 95% CI : (0.9997, 1)
## No Information Rate : 0.8284
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.0000
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 1.0000
## Prevalence : 0.1716
## Detection Rate : 0.1716
## Detection Prevalence : 0.1716
## Balanced Accuracy : 1.0000
##
## 'Positive' Class : Charged Off
##
rf_pred <- predict(rf, test)
confusionMatrix(data=rf_pred$predictions, test$loan_status)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Charged Off Fully Paid
## Charged Off 15 19
## Fully Paid 610 3106
##
## Accuracy : 0.8323
## 95% CI : (0.8199, 0.8441)
## No Information Rate : 0.8333
## P-Value [Acc > NIR] : 0.58
##
## Kappa : 0.0288
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.024000
## Specificity : 0.993920
## Pos Pred Value : 0.441176
## Neg Pred Value : 0.835845
## Prevalence : 0.166667
## Detection Rate : 0.004000
## Detection Prevalence : 0.009067
## Balanced Accuracy : 0.508960
##
## 'Positive' Class : Charged Off
##
In this case, We use decision tree c5.0 model. From the plot, it is easily to see when we increase the trials, the misclassification Error is decreasing, until the trials equals 10, the misclassification Error reaches its least value. The accuracy of the decision tree classifier is 0.8213.
## Step 3: Training a model on the data ----
# regression tree using c50
library(C50)
## Warning: package 'C50' was built under R version 3.5.2
model <- C5.0(loan_status~., data = train )
# get basic information about the tree
model
##
## Call:
## C5.0.formula(formula = loan_status ~ ., data = train)
##
## Classification Tree
## Number of samples: 11250
## Number of predictors: 18
##
## Tree size: 54
##
## Non-standard options: attempt to group attributes
# get more detailed information about the tree
summary(model)
##
## Call:
## C5.0.formula(formula = loan_status ~ ., data = train)
##
##
## C5.0 [Release 2.07 GPL Edition] Sun Apr 14 21:59:41 2019
## -------------------------------
##
## Class specified by attribute `outcome'
##
## Read 11250 cases (19 attributes) from undefined.data
##
## Decision tree:
##
## term = 36 months: Fully Paid (8398/1110)
## term = 60 months:
## :...grade in {A,B,C,D}: Fully Paid (2080/510)
## grade in {E,F,G}:
## :...addr_state in {AK,AL,AR,DC,FL,GA,IA,ID,IN,KS,MA,ME,MO,NC,NE,NH,NJ,NV,
## : NY,SD,UT,VA,WY}: Fully Paid (309/102)
## addr_state in {DE,MT,PA,RI,VT,WV}: Charged Off (41/17)
## addr_state = CA:
## :...credit_lth <= 11: Charged Off (31/10)
## : credit_lth > 11: Fully Paid (69/25)
## addr_state = CO:
## :...home_ownership in {ANY,MORTGAGE,NONE,OTHER,
## : : OWN}: Fully Paid (17/3)
## : home_ownership = RENT: Charged Off (5)
## addr_state = CT:
## :...annual_inc <= 91: Charged Off (2)
## : annual_inc > 91: Fully Paid (4)
## addr_state = HI:
## :...loan_amnt <= 14925: Charged Off (2)
## : loan_amnt > 14925: Fully Paid (2)
## addr_state = IL:
## :...verification_status = Not Verified: Charged Off (3)
## : verification_status = Verified: Fully Paid (16/7)
## addr_state = KY:
## :...inq_last_6mths <= 1: Charged Off (4/1)
## : inq_last_6mths > 1: Fully Paid (5)
## addr_state = MN:
## :...revol_util <= 74.2: Charged Off (7/1)
## : revol_util > 74.2: Fully Paid (4)
## addr_state = MS:
## :...emp_length <= 6: Charged Off (3)
## : emp_length > 6: Fully Paid (2)
## addr_state = NM:
## :...annual_inc <= 77.393: Charged Off (4)
## : annual_inc > 77.393: Fully Paid (2)
## addr_state = OH:
## :...emp_length <= 1: Charged Off (4)
## : emp_length > 1: Fully Paid (23/8)
## addr_state = OK:
## :...annual_inc <= 64.5: Charged Off (6)
## : annual_inc > 64.5: Fully Paid (6/2)
## addr_state = OR:
## :...annual_inc <= 37.5: Charged Off (2)
## : annual_inc > 37.5: Fully Paid (8)
## addr_state = SC:
## :...purpose in {car,credit_card,debt_consolidation,house,
## : : major_purchase,medical,moving,other,renewable_energy,
## : : small_business,vacation,wedding}: Fully Paid (13/4)
## : purpose = home_improvement: Charged Off (2)
## addr_state = TN:
## :...issue_d in {2012,2013}: Fully Paid (9/1)
## : issue_d = 2014: Charged Off (10/3)
## addr_state = WI:
## :...inq_last_6mths <= 1: Fully Paid (7)
## : inq_last_6mths > 1: Charged Off (5/1)
## addr_state = AZ:
## :...home_ownership in {ANY,NONE,OTHER,OWN,RENT}: Charged Off (6)
## : home_ownership = MORTGAGE:
## : :...credit_lth <= 14: Charged Off (4/1)
## : credit_lth > 14: Fully Paid (5)
## addr_state = LA:
## :...credit_lth <= 12: Charged Off (3)
## : credit_lth > 12:
## : :...dti <= 8.46: Charged Off (2)
## : dti > 8.46: Fully Paid (8)
## addr_state = MI:
## :...revol_util <= 66.6: Charged Off (4)
## : revol_util > 66.6:
## : :...inq_last_6mths <= 3: Fully Paid (10)
## : inq_last_6mths > 3: Charged Off (2)
## addr_state = WA:
## :...annual_inc <= 49.9: Charged Off (5)
## : annual_inc > 49.9:
## : :...inq_last_6mths <= 2: Fully Paid (19/3)
## : inq_last_6mths > 2: Charged Off (2)
## addr_state = MD:
## :...home_ownership in {ANY,NONE,OTHER,RENT}: Fully Paid (9)
## : home_ownership in {MORTGAGE,OWN}:
## : :...inq_last_6mths > 1: Fully Paid (3)
## : inq_last_6mths <= 1:
## : :...annual_inc <= 55: Fully Paid (4/1)
## : annual_inc > 55: Charged Off (7)
## addr_state = TX:
## :...verification_status = Not Verified: Fully Paid (3)
## verification_status = Verified:
## :...grade in {F,G}: Fully Paid (11/2)
## grade = E:
## :...annual_inc <= 75.79: Charged Off (17/3)
## annual_inc > 75.79: Fully Paid (21/6)
##
##
## Evaluation on training data (11250 cases):
##
## Decision Tree
## ----------------
## Size Errors
##
## 54 1821(16.2%) <<
##
##
## (a) (b) <-classified as
## ---- ----
## 146 1784 (a): class Charged Off
## 37 9283 (b): class Fully Paid
##
##
## Attribute usage:
##
## 100.00% term
## 25.35% grade
## 6.86% addr_state
## 1.08% credit_lth
## 0.97% annual_inc
## 0.63% verification_status
## 0.60% inq_last_6mths
## 0.53% home_ownership
## 0.28% emp_length
## 0.24% revol_util
## 0.17% issue_d
## 0.13% purpose
## 0.09% dti
## 0.04% loan_amnt
##
##
## Time: 0.1 secs
## Step 4: Evaluate model performance ----
#check accuracy in train set
fitted.results <- predict(model, newdata = train)
head(fitted.results)
## [1] Fully Paid Fully Paid Fully Paid Fully Paid Fully Paid Fully Paid
## Levels: Charged Off Fully Paid
misClasificError <- mean(fitted.results != train$loan_status, na.rm=TRUE)
print(paste('Accuracy',1-misClasificError))
## [1] "Accuracy 0.838133333333333"
# test
fitted.results <- predict(model, newdata = test)
head(fitted.results)
## [1] Fully Paid Fully Paid Fully Paid Fully Paid Fully Paid Fully Paid
## Levels: Charged Off Fully Paid
misClasificError <- mean(fitted.results != test$loan_status, na.rm=TRUE)
print(paste('Accuracy',1-misClasificError))
## [1] "Accuracy 0.828533333333333"
CrossTable(test$loan_status, fitted.results)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 3750
##
##
## | fitted.results
## test$loan_status | Charged Off | Fully Paid | Row Total |
## -----------------|-------------|-------------|-------------|
## Charged Off | 18 | 607 | 625 |
## | 9.000 | 0.131 | |
## | 0.029 | 0.971 | 0.167 |
## | 0.333 | 0.164 | |
## | 0.005 | 0.162 | |
## -----------------|-------------|-------------|-------------|
## Fully Paid | 36 | 3089 | 3125 |
## | 1.800 | 0.026 | |
## | 0.012 | 0.988 | 0.833 |
## | 0.667 | 0.836 | |
## | 0.010 | 0.824 | |
## -----------------|-------------|-------------|-------------|
## Column Total | 54 | 3696 | 3750 |
## | 0.014 | 0.986 | |
## -----------------|-------------|-------------|-------------|
##
##
##step 5 improve the model performance
# Boosting the accuracy of decision trees
# boosted decision tree with n trials
misClasificError= function(n){
model_trial <- C5.0(loan_status~., data = train, trials =n )
fitted.results <- predict(model_trial, newdata = test)
misClasificError <- mean(fitted.results != test$loan_status, na.rm=TRUE)
return(misClasificError)
}
x=c(4,6,10,20,30,35,36,40,45)
testerror_rates <- sapply(x, FUN = misClasificError)
ggplot() +geom_point(aes(x=x,y=testerror_rates)) + geom_line(aes(x=x,y=testerror_rates)) + ylab("Misclassification Rate")
#From the plot, it is easily to see when we increase the trials, the misclassification Error is decreasing, until the trials equals 35, the misclassification Error reaches its least value.
model_trial <- C5.0(loan_status~., data = train, trials =35 )
# get basic information about the tree
model_trial
##
## Call:
## C5.0.formula(formula = loan_status ~ ., data = train, trials = 35)
##
## Classification Tree
## Number of samples: 11250
## Number of predictors: 18
##
## Number of boosting iterations: 35 requested; 11 used due to early stopping
## Average tree size: 34.4
##
## Non-standard options: attempt to group attributes
#check accuracy
fitted.results <- predict(model_trial, newdata = test)
misClasificErrorleast <- mean(fitted.results != test$loan_status, na.rm=TRUE)
print(paste('Maximum Accuracy',1-misClasificErrorleast))
## [1] "Maximum Accuracy 0.821333333333333"
when k=1, Accuracy 0.741066666666667
when k=13,Accuracy 0.826133333333333
when k=21, Accuracy 0.8312
When k=25, the model reaches its maximum Accuracy 0.832
# create normalization function
normalize <- function(x) {
return ((x - min(x)) / (max(x) - min(x)))
}
loanknn=loan
loan_n <- as.data.frame(lapply(loanknn[, 9:19], normalize))
trainknn=loan_n[index_num,]
testknn=loan_n[-index_num,]
trainlabel=loanknn[index_num,1]
testlabel=loanknn[-index_num,1]
# build a KNN model
library(class)
## Warning: package 'class' was built under R version 3.5.2
loan_test_pred <- knn(train = trainknn, test = testknn,
cl = trainlabel, k = 21)
## Step 4: Evaluating model performance ----
# Create the cross tabulation of predicted vs. actual
CrossTable(x = testlabel, y = loan_test_pred,
prop.chisq = FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 3750
##
##
## | loan_test_pred
## testlabel | Charged Off | Fully Paid | Row Total |
## -------------|-------------|-------------|-------------|
## Charged Off | 11 | 614 | 625 |
## | 0.018 | 0.982 | 0.167 |
## | 0.367 | 0.165 | |
## | 0.003 | 0.164 | |
## -------------|-------------|-------------|-------------|
## Fully Paid | 19 | 3106 | 3125 |
## | 0.006 | 0.994 | 0.833 |
## | 0.633 | 0.835 | |
## | 0.005 | 0.828 | |
## -------------|-------------|-------------|-------------|
## Column Total | 30 | 3720 | 3750 |
## | 0.008 | 0.992 | |
## -------------|-------------|-------------|-------------|
##
##
misClasificError <- mean(loan_test_pred != testlabel, na.rm=TRUE)
print(paste('Accuracy',1-misClasificError))
## [1] "Accuracy 0.8312"
## step 5 improve the model
loan_test_pred1 <- knn(train = trainknn, test = testknn,
cl = trainlabel, k = 1)
misClasificError <- mean(loan_test_pred1 != testlabel, na.rm=TRUE)
print(paste('Accuracy',1-misClasificError))
## [1] "Accuracy 0.741066666666667"
loan_test_pred1 <- knn(train = trainknn, test = testknn,
cl = trainlabel, k = 13)
misClasificError <- mean(loan_test_pred1 != testlabel, na.rm=TRUE)
print(paste('Accuracy',1-misClasificError))
## [1] "Accuracy 0.826133333333333"
# KNN error rate on test sample
knn_accuracy_rate <- function(n) {
y_hat <- knn(train = trainknn, test = testknn,
cl = trainlabel, k = n)
misClasificError=mean(y_hat != testlabel,na.rm=TRUE)
return(1-misClasificError)
}
a <- c(1:10,12,15,20,25,30,35,40)
Accuracy_rates <- sapply(a, FUN = knn_accuracy_rate )
Accuracy_rates
## [1] 0.7410667 0.7341333 0.7906667 0.7973333 0.8082667 0.8125333 0.8170667
## [8] 0.8168000 0.8202667 0.8224000 0.8256000 0.8282667 0.8309333 0.8320000
## [15] 0.8322667 0.8325333 0.8322667
ggplot() +geom_point(aes(x=a,y=Accuracy_rates)) + geom_line(aes(x=a,y=Accuracy_rates)) + ylab("Accuracy Rate")
# from the plot, we can see when k=25, the model reaches its maximum accuracy
loan_test_pred1 <- knn(train = trainknn, test = testknn,
cl = trainlabel, k = 25)
misClasificError <- mean(loan_test_pred1 != testlabel, na.rm=TRUE)
print(paste('Accuracy',1-misClasificError))
## [1] "Accuracy 0.832"
choose all variables as predictors to build the model, the model accuracy is 78.32%. When trying to use laplace to improve the model, the performance of the model improve slightly. The accuracy of the model with laplace =3 is 78.35%. If only choose the factor predictors, the accuracy is 81.54%. Adding laplace =3 to the model, the accuracy is 81.6%.
##build model
library(e1071)
## Warning: package 'e1071' was built under R version 3.5.2
nb <- naiveBayes(train[,-1], train[,1])
pred=predict(nb, test[,-1])
CrossTable(pred, test$loan_status)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 3750
##
##
## | test$loan_status
## pred | Charged Off | Fully Paid | Row Total |
## -------------|-------------|-------------|-------------|
## Charged Off | 188 | 376 | 564 |
## | 94.000 | 18.800 | |
## | 0.333 | 0.667 | 0.150 |
## | 0.301 | 0.120 | |
## | 0.050 | 0.100 | |
## -------------|-------------|-------------|-------------|
## Fully Paid | 437 | 2749 | 3186 |
## | 16.640 | 3.328 | |
## | 0.137 | 0.863 | 0.850 |
## | 0.699 | 0.880 | |
## | 0.117 | 0.733 | |
## -------------|-------------|-------------|-------------|
## Column Total | 625 | 3125 | 3750 |
## | 0.167 | 0.833 | |
## -------------|-------------|-------------|-------------|
##
##
##evaluate the model performance
accuracy=1-mean(pred!=test$loan_status,na.rm=TRUE);accuracy
## [1] 0.7832
##improve the model performance
#try to use laplace to improve the model
nb <- naiveBayes(train[,-1], train[,1],laplace = 3)
pred=predict(nb, test[,-1])
CrossTable(pred, test$loan_status)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 3750
##
##
## | test$loan_status
## pred | Charged Off | Fully Paid | Row Total |
## -------------|-------------|-------------|-------------|
## Charged Off | 184 | 371 | 555 |
## | 90.511 | 18.102 | |
## | 0.332 | 0.668 | 0.148 |
## | 0.294 | 0.119 | |
## | 0.049 | 0.099 | |
## -------------|-------------|-------------|-------------|
## Fully Paid | 441 | 2754 | 3195 |
## | 15.723 | 3.145 | |
## | 0.138 | 0.862 | 0.852 |
## | 0.706 | 0.881 | |
## | 0.118 | 0.734 | |
## -------------|-------------|-------------|-------------|
## Column Total | 625 | 3125 | 3750 |
## | 0.167 | 0.833 | |
## -------------|-------------|-------------|-------------|
##
##
accuracy=1-mean(pred!=test$loan_status,na.rm=TRUE);accuracy
## [1] 0.7834667
#if only choose the factor predictors, the accuracy is 81.54
nb=naiveBayes(train[,2:9], train[,1])
pred=predict(nb, test[,2:9])
CrossTable(pred, test$loan_status)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 3750
##
##
## | test$loan_status
## pred | Charged Off | Fully Paid | Row Total |
## -------------|-------------|-------------|-------------|
## Charged Off | 81 | 148 | 229 |
## | 48.071 | 9.614 | |
## | 0.354 | 0.646 | 0.061 |
## | 0.130 | 0.047 | |
## | 0.022 | 0.039 | |
## -------------|-------------|-------------|-------------|
## Fully Paid | 544 | 2977 | 3521 |
## | 3.126 | 0.625 | |
## | 0.155 | 0.845 | 0.939 |
## | 0.870 | 0.953 | |
## | 0.145 | 0.794 | |
## -------------|-------------|-------------|-------------|
## Column Total | 625 | 3125 | 3750 |
## | 0.167 | 0.833 | |
## -------------|-------------|-------------|-------------|
##
##
accuracy=1-mean(pred!=test$loan_status,na.rm=TRUE);accuracy
## [1] 0.8154667
#adding laplace
nb=naiveBayes(train[,2:9], train[,1],laplace=3)
pred=predict(nb, test[,2:9])
CrossTable(pred, test$loan_status)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 3750
##
##
## | test$loan_status
## pred | Charged Off | Fully Paid | Row Total |
## -------------|-------------|-------------|-------------|
## Charged Off | 78 | 143 | 221 |
## | 46.010 | 9.202 | |
## | 0.353 | 0.647 | 0.059 |
## | 0.125 | 0.046 | |
## | 0.021 | 0.038 | |
## -------------|-------------|-------------|-------------|
## Fully Paid | 547 | 2982 | 3529 |
## | 2.881 | 0.576 | |
## | 0.155 | 0.845 | 0.941 |
## | 0.875 | 0.954 | |
## | 0.146 | 0.795 | |
## -------------|-------------|-------------|-------------|
## Column Total | 625 | 3125 | 3750 |
## | 0.167 | 0.833 | |
## -------------|-------------|-------------|-------------|
##
##
accuracy=1-mean(pred!=test$loan_status,na.rm=TRUE);accuracy
## [1] 0.816