For my project I selected the data set that I found on Lending Club’s website (https://www.lendingclub.com). The data is provided for potential investors. The data set contains information about loans that were issued from 2007 to the third quarter of 2017.
Lending Club is the world’s largest peer-to-peer lending platform that enables borrowers to obtain a loan, and investors to purchase notes backed by payments made on loans.
The goals of the project are
To find the equation that best predicts the probability of weather the load will be paid off or not.
To understand what might cause the probability to change.
To predict loan status based on logistic regression
An investor earns money when loan is fully paid of and loses money when loan is charged off. If an investor obtains the results generated by the model that classify loans he would be able to make better investment decisions.
While I was reviewing Landing Club’s website I found out that investors can see the information such as loan rate, loan term, interest rate, borrower’s FICO score, loan amount and loan purpose. Moreover, they have an ability to filter by borrower’s employment length and monthly income.
In order to collect the data I downloaded (data source: https://www.lendingclub.com/info/download-data.action ) and merged 11 files that contain data from 2007 to the third quarter of 2017. To reduce the loading time I implemented the following steps.
#1. read in a few records of the input file to identify the classes of the input file and assign that column class to the input file while reading the entire data set
data_2007_2011 <- read.csv(file="https://cdn-stage.fedweb.org/fed-2/13/LoanStats3a.csv",
stringsAsFactors=T, header=T, nrows=5)
data_2012_2013 <- read.csv(file="https://cdn-stage.fedweb.org/fed-2/13/LoanStats3b.csv",
stringsAsFactors=T, header=T, nrows=5)
data_2014 <- read.csv("https://cdn-stage.fedweb.org/fed-2/13/LoanStats3c.csv",
stringsAsFactors=T, header=T, nrows=5)
data_2015 <- read.csv("https://cdn-stage.fedweb.org/fed-2/13/LoanStats3d.csv",
stringsAsFactors=T, header=T, nrows=5)
data_2016_q1 <- read.csv("https://cdn-stage.fedweb.org/fed-2/13/LoanStats_2016Q1.csv",
stringsAsFactors=T, header=T, nrows=5)
data_2016_q2 <- read.csv("https://cdn-stage.fedweb.org/fed-2/13/LoanStats_2016Q2.csv",
stringsAsFactors=T, header=T, nrows=5)
data_2016_q3 <- read.csv("https://cdn-stage.fedweb.org/fed-2/13/LoanStats_2016Q3.csv",
stringsAsFactors=T, header=T, nrows=5)
data_2016_q4 <- read.csv("https://cdn-stage.fedweb.org/fed-2/13/LoanStats_2016Q4.csv",
stringsAsFactors=T, header=T, nrows=5)
data_2017_q1 <- read.csv("https://cdn-stage.fedweb.org/fed-2/13/LoanStats_2017Q1.csv",
stringsAsFactors=T, header=T, nrows=5)
data_2017_q2 <- read.csv("https://cdn-stage.fedweb.org/fed-2/13/LoanStats_2017Q2.csv",
stringsAsFactors=T, header=T, nrows=5)
data_2017_q3 <- read.csv("https://cdn-stage.fedweb.org/fed-2/13/LoanStats_2017Q3.csv",
stringsAsFactors=T, header=T, nrows=5)
#2. replace all missing values with NAs
data_2007_2011 <- data_2007_2011[is.na(data_2007_2011)]
data_2012_2013 <- data_2012_2013[is.na(data_2012_2013)]
data_2014 <- data_2014[is.na(data_2014)]
data_2015 <- data_2015[is.na(data_2015)]
data_2016_q1 <- data_2016_q1[is.na(data_2016_q1)]
data_2016_q2 <- data_2016_q1[is.na(data_2016_q2)]
data_2016_q3 <- data_2016_q1[is.na(data_2016_q3)]
data_2016_q4 <- data_2016_q1[is.na(data_2016_q4)]
data_2017_q1 <- data_2017_q1[is.na(data_2017_q1)]
data_2017_q2 <- data_2017_q2[is.na(data_2017_q2)]
data_2017_q3 <- data_2017_q3[is.na(data_2017_q3)]
#3. determine classes
data_2007_2011.colclass <- sapply(data_2007_2011,class)
data_2012_2013.colclass <- sapply(data_2012_2013,class)
data_2014.colclass <- sapply(data_2014,class)
data_2015.colclass <- sapply(data_2015,class)
data_2016_q1.colclass <- sapply(data_2016_q1,class)
data_2016_q2.colclass <- sapply(data_2016_q2,class)
data_2016_q3.colclass <- sapply(data_2016_q3,class)
data_2016_q4.colclass <- sapply(data_2016_q4,class)
data_2017_q1.colclass <- sapply(data_2017_q1,class)
data_2017_q2.colclass <- sapply(data_2017_q2,class)
data_2017_q3.colclass <- sapply(data_2017_q3,class)
#4. assign that column class to the input file while reading the entire data set and define comment.char parameter.
data_2007_2011 <- read.csv("https://cdn-stage.fedweb.org/fed-2/13/LoanStats3a.csv",
stringsAsFactors=T,
header=T,colClasses=data_2007_2011.colclass, comment.char="")
data_2012_2013 <- read.csv("https://cdn-stage.fedweb.org/fed-2/13/LoanStats3b.csv",
stringsAsFactors=T,
header=T,colClasses=data_2007_2011.colclass, comment.char="")
data_2014 <- read.csv("https://cdn-stage.fedweb.org/fed-2/13/LoanStats3c.csv",
stringsAsFactors=T, colClasses=data_2014.colclass, comment.char="")
data_2015 <- read.csv("https://cdn-stage.fedweb.org/fed-2/13/LoanStats3d.csv",
stringsAsFactors=T, header=T, colClasses=data_2015.colclass, comment.char="")
data_2016_q1 <- read.csv("https://cdn-stage.fedweb.org/fed-2/13/LoanStats_2016Q1.csv",
stringsAsFactors=T, header=T,colClasses=data_2016_q1.colclass, comment.char="")
data_2016_q2 <- read.csv("https://cdn-stage.fedweb.org/fed-2/13/LoanStats_2016Q2.csv",
stringsAsFactors=T, header=T,colClasses=data_2016_q2.colclass, comment.char="")
data_2016_q3 <- read.csv("https://cdn-stage.fedweb.org/fed-2/13/LoanStats_2016Q3.csv",
stringsAsFactors=T, header=T,colClasses=data_2016_q3.colclass, comment.char="")
data_2016_q4 <- read.csv("https://cdn-stage.fedweb.org/fed-2/13/LoanStats_2016Q4.csv",
stringsAsFactors=T, header=T,colClasses=data_2016_q4.colclass, comment.char="")
data_2017_q1 <- read.csv("https://cdn-stage.fedweb.org/fed-2/13/LoanStats_2017Q1.csv",
stringsAsFactors=T, header=T,colClasses=data_2017_q1.colclass, comment.char="")
data_2017_q2 <- read.csv("https://cdn-stage.fedweb.org/fed-2/13/LoanStats_2017Q2.csv",
stringsAsFactors=T, header=T,colClasses=data_2017_q2.colclass, comment.char="")
data_2017_q3 <- read.csv("https://cdn-stage.fedweb.org/fed-2/13/LoanStats_2017Q3.csv",
stringsAsFactors=T, header=T,colClasses=data_2017_q3.colclass, comment.char="")
#5. merge csv files
# I selected 51 first attributes since csv files for different years has different set of attributes but 51 first attributes are similar for different files
full_data <- rbind(data_2007_2011[1:51],data_2012_2013[1:51],data_2014[1:51],data_2015[1:51],data_2016_q1[1:51],data_2016_q2[1:51],data_2016_q3[1:51],data_2016_q4[1:51],data_2017_q1[1:51],data_2017_q2[1:51],data_2017_q3[1:51])
After that I determined all types of loan statuses.
levels(factor(full_data$loan_status))
## [1] ""
## [2] "Charged Off"
## [3] "Does not meet the credit policy. Status:Charged Off"
## [4] "Does not meet the credit policy. Status:Fully Paid"
## [5] "Fully Paid"
## [6] "Current"
## [7] "Default"
## [8] "In Grace Period"
## [9] "Late (16-30 days)"
## [10] "Late (31-120 days)"
I filtered the data so that the data set contain loans with “Fully Paid” or “Charged Off” statuses. I ignored loans with statuses “Current”, “Late (31-120 days)”, “Late (16-30 days)” and “Default” since theoretically borrowers still can pay them off.
full_data <- full_data %>% mutate(loan_status=str_replace(loan_status, "Does not meet the credit policy. Status:", "")) %>% filter(loan_status %in% c("Fully Paid","Charged Off"))
levels(factor(full_data$loan_status))
## [1] "Charged Off" "Fully Paid"
Also, I removed all attributes that investors can’t see on the website and kept only the ones that they can see. Moreover, I converted term and interest rate attribute to numerical format.
colnames(full_data)
## [1] "id" "member_id"
## [3] "loan_amnt" "funded_amnt"
## [5] "funded_amnt_inv" "term"
## [7] "int_rate" "installment"
## [9] "grade" "sub_grade"
## [11] "emp_title" "emp_length"
## [13] "home_ownership" "annual_inc"
## [15] "verification_status" "issue_d"
## [17] "loan_status" "pymnt_plan"
## [19] "url" "desc"
## [21] "purpose" "title"
## [23] "zip_code" "addr_state"
## [25] "dti" "delinq_2yrs"
## [27] "earliest_cr_line" "inq_last_6mths"
## [29] "mths_since_last_delinq" "mths_since_last_record"
## [31] "open_acc" "pub_rec"
## [33] "revol_bal" "revol_util"
## [35] "total_acc" "initial_list_status"
## [37] "out_prncp" "out_prncp_inv"
## [39] "total_pymnt" "total_pymnt_inv"
## [41] "total_rec_prncp" "total_rec_int"
## [43] "total_rec_late_fee" "recoveries"
## [45] "collection_recovery_fee" "last_pymnt_d"
## [47] "last_pymnt_amnt" "next_pymnt_d"
## [49] "last_credit_pull_d" "collections_12_mths_ex_med"
## [51] "mths_since_last_major_derog"
full_data <- full_data %>% select(loan_status,loan_amnt,term,int_rate,installment,grade,emp_length,home_ownership,annual_inc,verification_status,dti,delinq_2yrs,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,collections_12_mths_ex_med,mths_since_last_major_derog) %>% mutate(term = str_replace(term, "months", ""),int_rate = str_replace(int_rate, "%", ""),revol_util = str_replace(revol_util, "%", ""),term = as.integer(term),revol_util = as.integer(revol_util),int_rate = as.double(int_rate),loan_status = as.factor(loan_status))
#omit NAs
full_data <- na.omit(full_data)
head(full_data)
## loan_status loan_amnt term int_rate installment grade emp_length
## 42538 Fully Paid 12000 36 13.53 407.40 B 10+ years
## 42545 Fully Paid 3000 36 12.85 100.87 B 10+ years
## 42568 Fully Paid 7200 36 10.99 235.69 B 4 years
## 42594 Fully Paid 9000 36 14.98 311.90 C 10+ years
## 42597 Fully Paid 7500 36 11.99 249.08 B 2 years
## 42620 Fully Paid 15850 36 16.24 559.12 C 1 year
## home_ownership annual_inc verification_status dti delinq_2yrs
## 42538 RENT 40000 Source Verified 16.94 0
## 42545 RENT 25000 Verified 24.68 0
## 42568 OWN 70000 Verified 19.20 0
## 42594 MORTGAGE 56000 Verified 21.45 0
## 42597 RENT 59600 Not Verified 15.93 0
## 42620 RENT 59400 Verified 33.22 0
## inq_last_6mths mths_since_last_delinq mths_since_last_record
## 42538 0 53 33
## 42545 0 58 53
## 42568 0 59 59
## 42594 2 36 72
## 42597 0 28 108
## 42620 3 64 52
## open_acc pub_rec revol_bal revol_util total_acc
## 42538 7 2 5572 68 32
## 42545 5 2 2875 54 26
## 42568 14 1 3479 35 49
## 42594 30 1 11317 30 60
## 42597 9 1 5517 54 53
## 42620 17 1 11161 60 36
## collections_12_mths_ex_med mths_since_last_major_derog
## 42538 0 53
## 42545 0 69
## 42568 0 59
## 42594 0 70
## 42597 0 34
## 42620 0 64
The explanatory variables are:
colnames(full_data)
## [1] "loan_status" "loan_amnt"
## [3] "term" "int_rate"
## [5] "installment" "grade"
## [7] "emp_length" "home_ownership"
## [9] "annual_inc" "verification_status"
## [11] "dti" "delinq_2yrs"
## [13] "inq_last_6mths" "mths_since_last_delinq"
## [15] "mths_since_last_record" "open_acc"
## [17] "pub_rec" "revol_bal"
## [19] "revol_util" "total_acc"
## [21] "collections_12_mths_ex_med" "mths_since_last_major_derog"
Since the response variable “loan status” is a binary categorical variable (that has two possible outcomes - “Fully Paid” or “Charged Off”) and explanatory variables are numerical and categorical variables I used logistic regression for data set analysis.
The assumptions for the logistic regression are:
Assumption 1 - Logistic regression typically requires a large sample size. It requires a minimum of 10 cases with the least frequent outcome for each independent variable in the model. For example, if a model has 5 independent variables and the expected probability of the least frequent outcome is .10, then the minimum sample size is 500 (10*5 / .10).
dim(full_data)
## [1] 36222 22
The data set that contains 36222 observations can be considered as a large data set.
Assumption 2 - Logistic regression requires the observations to be independent of each other. I assume that majority of the observations are independent since most of the borrowers are not related.
Assumption 3 - Logistic regression requires there to be little or no multicollinearity among the independent variables. This means that the independent variables should not be too highly correlated with each other.
#removing categorical variables exept loan status (i will use this data set later)
full_data_no_categorical_status <- full_data %>% select(-grade,-emp_length,-home_ownership,-verification_status)
#removing loan status
full_data_no_categorical <- full_data_no_categorical_status %>% select(-loan_status)
#calculate correlation
cor(full_data_no_categorical,use="complete.obs")
## loan_amnt term int_rate
## loan_amnt 1.000000000 0.422905078 0.252943051
## term 0.422905078 1.000000000 0.449439920
## int_rate 0.252943051 0.449439920 1.000000000
## installment 0.960984857 0.201161598 0.251209631
## annual_inc 0.396604274 0.077325539 -0.024229579
## dti -0.004296079 0.055376313 0.154213650
## delinq_2yrs 0.004823642 -0.013761784 0.044475316
## inq_last_6mths -0.019746327 -0.002598566 0.242287085
## mths_since_last_delinq -0.030847229 0.006928472 -0.050570450
## mths_since_last_record -0.020720434 0.025732174 0.022453452
## open_acc 0.145728246 0.086797964 0.068295286
## pub_rec 0.031823800 -0.024404755 0.006143159
## revol_bal 0.218459342 0.051399233 -0.006254746
## revol_util 0.109403712 0.034683183 0.095508757
## total_acc 0.106496607 0.069087858 0.001666116
## collections_12_mths_ex_med -0.003320038 -0.009964323 0.003916385
## mths_since_last_major_derog -0.019175444 0.004729816 -0.029554970
## installment annual_inc dti
## loan_amnt 0.960984857 0.396604274 -0.0042960793
## term 0.201161598 0.077325539 0.0553763132
## int_rate 0.251209631 -0.024229579 0.1542136500
## installment 1.000000000 0.392374700 -0.0018279538
## annual_inc 0.392374700 1.000000000 -0.2858904922
## dti -0.001827954 -0.285890492 1.0000000000
## delinq_2yrs 0.011625624 0.042238702 -0.0020443121
## inq_last_6mths 0.009682516 0.040434813 0.0004220809
## mths_since_last_delinq -0.038752539 -0.076417767 0.0204770180
## mths_since_last_record -0.026630255 -0.086111918 0.0615709422
## open_acc 0.139589986 0.092781884 0.2540949157
## pub_rec 0.042574297 0.082048619 -0.0488774759
## revol_bal 0.214007210 0.255702205 0.0733672979
## revol_util 0.117137922 0.050101528 0.1431764418
## total_acc 0.094812658 0.105119526 0.1645578233
## collections_12_mths_ex_med 0.000183462 0.005928169 0.0064742469
## mths_since_last_major_derog -0.023985298 -0.042258911 0.0328661843
## delinq_2yrs inq_last_6mths
## loan_amnt 0.0048236422 -0.0197463266
## term -0.0137617839 -0.0025985660
## int_rate 0.0444753161 0.2422870854
## installment 0.0116256241 0.0096825159
## annual_inc 0.0422387017 0.0404348126
## dti -0.0020443121 0.0004220809
## delinq_2yrs 1.0000000000 0.0282431493
## inq_last_6mths 0.0282431493 1.0000000000
## mths_since_last_delinq -0.4979427822 0.0032003344
## mths_since_last_record -0.0124547371 -0.0347099768
## open_acc 0.0550553163 0.1503937659
## pub_rec 0.0006722893 0.0040616085
## revol_bal 0.0064078761 -0.0122525865
## revol_util 0.0001444110 -0.0794601846
## total_acc 0.0423192281 0.1703407140
## collections_12_mths_ex_med 0.0847557528 0.0004772454
## mths_since_last_major_derog -0.3953539971 0.0124708097
## mths_since_last_delinq mths_since_last_record
## loan_amnt -0.030847229 -0.020720434
## term 0.006928472 0.025732174
## int_rate -0.050570450 0.022453452
## installment -0.038752539 -0.026630255
## annual_inc -0.076417767 -0.086111918
## dti 0.020477018 0.061570942
## delinq_2yrs -0.497942782 -0.012454737
## inq_last_6mths 0.003200334 -0.034709977
## mths_since_last_delinq 1.000000000 -0.006842418
## mths_since_last_record -0.006842418 1.000000000
## open_acc -0.040857028 0.031370633
## pub_rec -0.004742984 -0.269376093
## revol_bal -0.019583893 -0.024295713
## revol_util -0.015932256 0.041013824
## total_acc -0.003720052 -0.144489072
## collections_12_mths_ex_med -0.097995842 -0.005553399
## mths_since_last_major_derog 0.689480972 -0.007275000
## open_acc pub_rec revol_bal
## loan_amnt 0.14572825 0.0318237998 0.218459342
## term 0.08679796 -0.0244047554 0.051399233
## int_rate 0.06829529 0.0061431595 -0.006254746
## installment 0.13958999 0.0425742965 0.214007210
## annual_inc 0.09278188 0.0820486188 0.255702205
## dti 0.25409492 -0.0488774759 0.073367298
## delinq_2yrs 0.05505532 0.0006722893 0.006407876
## inq_last_6mths 0.15039377 0.0040616085 -0.012252587
## mths_since_last_delinq -0.04085703 -0.0047429835 -0.019583893
## mths_since_last_record 0.03137063 -0.2693760931 -0.024295713
## open_acc 1.00000000 -0.0184278479 0.162964627
## pub_rec -0.01842785 1.0000000000 0.011407285
## revol_bal 0.16296463 0.0114072853 1.000000000
## revol_util -0.07903746 -0.0002151927 0.230578235
## total_acc 0.59745422 -0.0586662051 0.079748021
## collections_12_mths_ex_med 0.01444761 0.0169394696 -0.007964900
## mths_since_last_major_derog -0.01320742 0.0099003795 0.002334282
## revol_util total_acc
## loan_amnt 0.1094037119 0.106496607
## term 0.0346831827 0.069087858
## int_rate 0.0955087571 0.001666116
## installment 0.1171379217 0.094812658
## annual_inc 0.0501015282 0.105119526
## dti 0.1431764418 0.164557823
## delinq_2yrs 0.0001444110 0.042319228
## inq_last_6mths -0.0794601846 0.170340714
## mths_since_last_delinq -0.0159322556 -0.003720052
## mths_since_last_record 0.0410138239 -0.144489072
## open_acc -0.0790374605 0.597454218
## pub_rec -0.0002151927 -0.058666205
## revol_bal 0.2305782349 0.079748021
## revol_util 1.0000000000 -0.086334311
## total_acc -0.0863343109 1.000000000
## collections_12_mths_ex_med -0.0258469002 -0.023336431
## mths_since_last_major_derog 0.0221548906 -0.006425679
## collections_12_mths_ex_med
## loan_amnt -0.0033200383
## term -0.0099643227
## int_rate 0.0039163850
## installment 0.0001834620
## annual_inc 0.0059281689
## dti 0.0064742469
## delinq_2yrs 0.0847557528
## inq_last_6mths 0.0004772454
## mths_since_last_delinq -0.0979958421
## mths_since_last_record -0.0055533989
## open_acc 0.0144476057
## pub_rec 0.0169394696
## revol_bal -0.0079648997
## revol_util -0.0258469002
## total_acc -0.0233364306
## collections_12_mths_ex_med 1.0000000000
## mths_since_last_major_derog -0.1270235332
## mths_since_last_major_derog
## loan_amnt -0.019175444
## term 0.004729816
## int_rate -0.029554970
## installment -0.023985298
## annual_inc -0.042258911
## dti 0.032866184
## delinq_2yrs -0.395353997
## inq_last_6mths 0.012470810
## mths_since_last_delinq 0.689480972
## mths_since_last_record -0.007275000
## open_acc -0.013207424
## pub_rec 0.009900380
## revol_bal 0.002334282
## revol_util 0.022154891
## total_acc -0.006425679
## collections_12_mths_ex_med -0.127023533
## mths_since_last_major_derog 1.000000000
#the function chart.Correlation generates plots for each pair of variables. however, it loads very slow
#chart.Correlation(full_data_no_categorical, method="spearman",histogram=TRUE)
According to the correlation matrix loan amount and installment there are four pairs of variables that are highly correlated. Those pairs are:
Assumption 4 - Logistic regression assumes linearity of independent variables and log odds.
I tested linearity in the logit by running Box-Tidwell Transformation Test. I added to the logistic model interaction terms which are the crossproduct of each independent times its natural logarithm ( (X)ln(X)] ). If these terms are significant, then there is non-linearity in the logit.
I stated the following hypothesis:
The null hypothesis: Odds ration is linear with an independent variable. No transformation is needed. The alternative hypothesis: Odds ration is not linear with an independent variable. Transformation is needed.
The null hypothesis is rejected only for total accounts (total_acc) variable which p-values is less than significance level of 5%. Hence transformation is needed for the variable “total_acc”.
All other variables have p-values that are greater than sagnificance level of 5%. So, no transformation is needed.
#replacing each variable with variable*log(variable)
for (i in 2:length(full_data_no_categorical_status)){
full_data_no_categorical_status[i] <- full_data_no_categorical_status[i]*log(full_data_no_categorical_status[i])
}
## Warning in FUN(X[[i]], ...): NaNs produced
model_linearity <- glm(formula = loan_status ~ ., family = binomial(link = "logit"),data = full_data_no_categorical_status)
summary(model_linearity)
##
## Call:
## glm(formula = loan_status ~ ., family = binomial(link = "logit"),
## data = full_data_no_categorical_status)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1680 -1.1848 0.6907 0.8759 1.7380
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.730e+00 7.697e-01 2.247 0.02461 *
## loan_amnt -8.166e-07 1.163e-05 -0.070 0.94405
## term -6.410e-03 6.198e-03 -1.034 0.30101
## int_rate -6.800e-03 9.655e-03 -0.704 0.48122
## installment -1.222e-04 5.130e-04 -0.238 0.81175
## annual_inc 7.518e-08 1.210e-07 0.621 0.53432
## dti -4.857e-03 3.899e-03 -1.246 0.21289
## delinq_2yrs -4.295e-02 4.143e-02 -1.037 0.29979
## inq_last_6mths -1.550e-02 6.810e-02 -0.228 0.81995
## mths_since_last_delinq -1.782e-03 1.470e-03 -1.212 0.22543
## mths_since_last_record 4.471e-04 9.169e-04 0.488 0.62582
## open_acc -7.902e-03 8.436e-03 -0.937 0.34890
## pub_rec -2.440e-02 5.779e-02 -0.422 0.67287
## revol_bal 7.318e-08 9.358e-07 0.078 0.93767
## revol_util 2.125e-03 1.360e-03 1.563 0.11816
## total_acc 8.091e-03 2.952e-03 2.741 0.00613 **
## collections_12_mths_ex_med 1.149e-01 2.192e-01 0.524 0.60014
## mths_since_last_major_derog 5.865e-04 1.226e-03 0.479 0.63226
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 490.75 on 389 degrees of freedom
## Residual deviance: 458.77 on 372 degrees of freedom
## (35832 observations deleted due to missingness)
## AIC: 494.77
##
## Number of Fisher Scoring iterations: 4
#oslem.test(full_data$loan_status, fitted(y), g=10)
I replaced total accounts variable by its natural logarithm.
full_data <- full_data %>% mutate(total_acc = log(total_acc))
In order to find the best fitting model that will provide the best prediction about the response variable one should use non-redundant explanatory variables. In order to decide which explanatory variables to include in multiple logistic regression I checked whether dependent variables are correlated between each other or not.
Both highly correlated variables should not be in a final regression model.
In order to find the best regression model I ran the step function (which implements stepwise approach) that analyses all combination of variables and selects the best regression model based on lowest AIC (Akaike’s criterion) value. Lower values of AIC indicate the preferred model, that is, the one with the fewest parameters that still provides an adequate fit to the data.
model.null = glm(loan_status ~ 1,
data = full_data,
family = binomial(link="logit")
)
model.full = glm(loan_status ~ .,
data = full_data,
family = binomial(link="logit")
)
step(model.null,
scope = list(upper=model.full),
direction = "both",
test = "Chisq",
data = full_data)
## Start: AIC=38611.8
## loan_status ~ 1
##
## Df Deviance AIC LRT Pr(>Chi)
## + grade 6 36736 36750 1873.51 < 2.2e-16 ***
## + int_rate 1 37092 37096 1517.43 < 2.2e-16 ***
## + term 1 37555 37559 1054.57 < 2.2e-16 ***
## + dti 1 37891 37895 719.12 < 2.2e-16 ***
## + loan_amnt 1 37991 37995 618.82 < 2.2e-16 ***
## + installment 1 38155 38159 455.20 < 2.2e-16 ***
## + open_acc 1 38411 38415 199.08 < 2.2e-16 ***
## + verification_status 2 38411 38417 199.28 < 2.2e-16 ***
## + home_ownership 3 38411 38419 199.21 < 2.2e-16 ***
## + revol_util 1 38528 38532 81.78 < 2.2e-16 ***
## + inq_last_6mths 1 38535 38539 74.47 < 2.2e-16 ***
## + emp_length 11 38518 38542 92.03 6.667e-15 ***
## + delinq_2yrs 1 38569 38573 41.13 1.423e-10 ***
## + mths_since_last_delinq 1 38576 38580 33.71 6.407e-09 ***
## + annual_inc 1 38579 38583 30.62 3.134e-08 ***
## + collections_12_mths_ex_med 1 38587 38591 22.85 1.748e-06 ***
## + revol_bal 1 38594 38598 15.47 8.404e-05 ***
## + mths_since_last_major_derog 1 38604 38608 5.47 0.01938 *
## + pub_rec 1 38605 38609 5.12 0.02370 *
## + mths_since_last_record 1 38605 38609 4.54 0.03307 *
## <none> 38610 38612
## + total_acc 1 38610 38614 0.08 0.77779
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=36750.29
## loan_status ~ grade
##
## Df Deviance AIC LRT Pr(>Chi)
## + dti 1 36323 36339 413.41 < 2.2e-16 ***
## + term 1 36533 36549 203.05 < 2.2e-16 ***
## + loan_amnt 1 36535 36551 201.53 < 2.2e-16 ***
## + home_ownership 3 36576 36596 160.23 < 2.2e-16 ***
## + installment 1 36612 36628 124.14 < 2.2e-16 ***
## + open_acc 1 36617 36633 119.23 < 2.2e-16 ***
## + emp_length 11 36632 36668 104.18 < 2.2e-16 ***
## + verification_status 2 36691 36709 45.74 1.168e-10 ***
## + revol_util 1 36704 36720 32.30 1.319e-08 ***
## + int_rate 1 36706 36722 30.59 3.193e-08 ***
## + annual_inc 1 36710 36726 26.55 2.563e-07 ***
## + collections_12_mths_ex_med 1 36716 36732 20.57 5.752e-06 ***
## + delinq_2yrs 1 36716 36732 20.31 6.598e-06 ***
## + revol_bal 1 36719 36735 17.63 2.679e-05 ***
## + mths_since_last_delinq 1 36720 36736 15.84 6.887e-05 ***
## + mths_since_last_record 1 36732 36748 4.31 0.03782 *
## + pub_rec 1 36733 36749 3.24 0.07175 .
## + inq_last_6mths 1 36733 36749 3.22 0.07266 .
## + mths_since_last_major_derog 1 36734 36750 2.25 0.13362
## <none> 36736 36750
## + total_acc 1 36736 36752 0.01 0.92726
## - grade 6 38610 38612 1873.51 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=36338.87
## loan_status ~ grade + dti
##
## Df Deviance AIC LRT Pr(>Chi)
## + loan_amnt 1 36085 36103 238.24 < 2.2e-16 ***
## + term 1 36103 36121 219.45 < 2.2e-16 ***
## + installment 1 36173 36191 149.44 < 2.2e-16 ***
## + home_ownership 3 36178 36200 144.96 < 2.2e-16 ***
## + emp_length 11 36237 36275 85.53 1.244e-13 ***
## + verification_status 2 36277 36297 45.86 1.102e-10 ***
## + open_acc 1 36284 36302 39.28 3.673e-10 ***
## + int_rate 1 36293 36311 29.83 4.707e-08 ***
## + delinq_2yrs 1 36300 36318 23.33 1.366e-06 ***
## + mths_since_last_delinq 1 36301 36319 21.66 3.262e-06 ***
## + collections_12_mths_ex_med 1 36303 36321 19.93 8.022e-06 ***
## + total_acc 1 36310 36328 12.99 0.0003134 ***
## + revol_util 1 36313 36331 10.04 0.0015328 **
## + pub_rec 1 36314 36332 8.41 0.0037321 **
## + revol_bal 1 36315 36333 7.52 0.0061019 **
## + mths_since_last_major_derog 1 36318 36336 5.31 0.0212583 *
## <none> 36323 36339
## + annual_inc 1 36322 36340 1.02 0.3124942
## + mths_since_last_record 1 36322 36340 0.62 0.4297031
## + inq_last_6mths 1 36322 36340 0.55 0.4592229
## - dti 1 36736 36750 413.41 < 2.2e-16 ***
## - grade 6 37891 37895 1567.81 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=36102.63
## loan_status ~ grade + dti + loan_amnt
##
## Df Deviance AIC LRT Pr(>Chi)
## + home_ownership 3 35876 35900 209.07 < 2.2e-16 ***
## + term 1 35985 36005 99.58 < 2.2e-16 ***
## + emp_length 11 35970 36010 114.82 < 2.2e-16 ***
## + installment 1 36017 36037 67.85 < 2.2e-16 ***
## + annual_inc 1 36042 36062 42.43 7.319e-11 ***
## + int_rate 1 36050 36070 34.53 4.204e-09 ***
## + total_acc 1 36052 36072 33.12 8.654e-09 ***
## + verification_status 2 36056 36078 28.25 7.347e-07 ***
## + delinq_2yrs 1 36060 36080 25.06 5.571e-07 ***
## + collections_12_mths_ex_med 1 36063 36083 21.19 4.150e-06 ***
## + mths_since_last_delinq 1 36065 36085 19.73 8.898e-06 ***
## + open_acc 1 36068 36088 16.60 4.619e-05 ***
## + pub_rec 1 36078 36098 6.43 0.01119 *
## + mths_since_last_major_derog 1 36080 36100 4.76 0.02920 *
## + revol_util 1 36082 36102 2.58 0.10856
## <none> 36085 36103
## + mths_since_last_record 1 36083 36103 1.26 0.26086
## + revol_bal 1 36084 36104 0.56 0.45461
## + inq_last_6mths 1 36084 36104 0.50 0.47748
## - loan_amnt 1 36323 36339 238.24 < 2.2e-16 ***
## - dti 1 36535 36551 450.12 < 2.2e-16 ***
## - grade 6 37243 37249 1158.75 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=35899.56
## loan_status ~ grade + dti + loan_amnt + home_ownership
##
## Df Deviance AIC LRT Pr(>Chi)
## + term 1 35760 35786 115.67 < 2.2e-16 ***
## + installment 1 35795 35821 80.31 < 2.2e-16 ***
## + emp_length 11 35778 35824 97.28 6.162e-16 ***
## + int_rate 1 35839 35865 36.45 1.564e-09 ***
## + delinq_2yrs 1 35846 35872 29.99 4.342e-08 ***
## + annual_inc 1 35846 35872 29.75 4.916e-08 ***
## + mths_since_last_delinq 1 35850 35876 25.42 4.605e-07 ***
## + open_acc 1 35853 35879 22.86 1.741e-06 ***
## + total_acc 1 35854 35880 21.54 3.471e-06 ***
## + collections_12_mths_ex_med 1 35856 35882 19.21 1.172e-05 ***
## + verification_status 2 35858 35886 17.20 0.0001842 ***
## + revol_util 1 35868 35894 7.09 0.0077667 **
## + pub_rec 1 35870 35896 5.75 0.0164890 *
## + mths_since_last_major_derog 1 35870 35896 5.13 0.0235422 *
## <none> 35876 35900
## + mths_since_last_record 1 35874 35900 1.53 0.2156855
## + inq_last_6mths 1 35874 35900 1.51 0.2189319
## + revol_bal 1 35876 35902 0.02 0.8815250
## - home_ownership 3 36085 36103 209.07 < 2.2e-16 ***
## - loan_amnt 1 36178 36200 302.34 < 2.2e-16 ***
## - dti 1 36312 36334 435.95 < 2.2e-16 ***
## - grade 6 36965 36977 1089.41 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=35785.89
## loan_status ~ grade + dti + loan_amnt + home_ownership + term
##
## Df Deviance AIC LRT Pr(>Chi)
## + emp_length 11 35655 35703 104.45 < 2.2e-16 ***
## + int_rate 1 35720 35748 40.23 2.260e-10 ***
## + delinq_2yrs 1 35725 35753 35.21 2.953e-09 ***
## + mths_since_last_delinq 1 35729 35757 30.54 3.277e-08 ***
## + total_acc 1 35734 35762 26.30 2.927e-07 ***
## + open_acc 1 35738 35766 21.81 3.004e-06 ***
## + annual_inc 1 35738 35766 21.62 3.323e-06 ***
## + collections_12_mths_ex_med 1 35739 35767 20.61 5.622e-06 ***
## + verification_status 2 35741 35771 18.75 8.494e-05 ***
## + revol_util 1 35751 35779 9.11 0.002537 **
## + pub_rec 1 35751 35779 8.66 0.003257 **
## + mths_since_last_major_derog 1 35754 35782 6.37 0.011587 *
## + inq_last_6mths 1 35754 35782 6.16 0.013033 *
## <none> 35760 35786
## + installment 1 35758 35786 1.74 0.187287
## + mths_since_last_record 1 35759 35787 0.64 0.425373
## + revol_bal 1 35760 35788 0.19 0.663781
## - term 1 35876 35900 115.67 < 2.2e-16 ***
## - loan_amnt 1 35918 35942 158.49 < 2.2e-16 ***
## - home_ownership 3 35985 36005 225.17 < 2.2e-16 ***
## - dti 1 36198 36222 438.08 < 2.2e-16 ***
## - grade 6 36437 36451 677.60 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=35703.44
## loan_status ~ grade + dti + loan_amnt + home_ownership + term +
## emp_length
##
## Df Deviance AIC LRT Pr(>Chi)
## + int_rate 1 35615 35665 40.55 1.914e-10 ***
## + delinq_2yrs 1 35619 35669 36.30 1.693e-09 ***
## + mths_since_last_delinq 1 35624 35674 31.77 1.738e-08 ***
## + open_acc 1 35629 35679 26.21 3.062e-07 ***
## + total_acc 1 35632 35682 23.55 1.216e-06 ***
## + collections_12_mths_ex_med 1 35635 35685 20.21 6.941e-06 ***
## + annual_inc 1 35642 35692 13.51 0.0002377 ***
## + verification_status 2 35641 35693 14.75 0.0006265 ***
## + revol_util 1 35644 35694 11.93 0.0005535 ***
## + pub_rec 1 35647 35697 8.68 0.0032252 **
## + inq_last_6mths 1 35648 35698 7.24 0.0071142 **
## + mths_since_last_major_derog 1 35649 35699 6.13 0.0132866 *
## + installment 1 35653 35703 2.17 0.1408510
## <none> 35655 35703
## + mths_since_last_record 1 35655 35705 0.74 0.3892936
## + revol_bal 1 35655 35705 0.40 0.5268580
## - emp_length 11 35760 35786 104.45 < 2.2e-16 ***
## - term 1 35778 35824 122.84 < 2.2e-16 ***
## - loan_amnt 1 35832 35878 177.01 < 2.2e-16 ***
## - home_ownership 3 35863 35905 207.19 < 2.2e-16 ***
## - dti 1 36072 36118 416.37 < 2.2e-16 ***
## - grade 6 36330 36366 674.89 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=35664.89
## loan_status ~ grade + dti + loan_amnt + home_ownership + term +
## emp_length + int_rate
##
## Df Deviance AIC LRT Pr(>Chi)
## + delinq_2yrs 1 35579 35631 36.24 1.747e-09 ***
## + mths_since_last_delinq 1 35581 35633 33.40 7.486e-09 ***
## + total_acc 1 35589 35641 25.93 3.543e-07 ***
## + open_acc 1 35590 35642 25.27 4.973e-07 ***
## + collections_12_mths_ex_med 1 35596 35648 19.22 1.164e-05 ***
## + annual_inc 1 35600 35652 15.12 0.0001008 ***
## + verification_status 2 35598 35652 16.59 0.0002491 ***
## + revol_util 1 35601 35653 13.66 0.0002190 ***
## + installment 1 35602 35654 12.82 0.0003428 ***
## + inq_last_6mths 1 35606 35658 8.50 0.0035595 **
## + pub_rec 1 35607 35659 8.10 0.0044290 **
## + mths_since_last_major_derog 1 35608 35660 7.19 0.0073372 **
## <none> 35615 35665
## + mths_since_last_record 1 35613 35665 1.51 0.2198582
## + revol_bal 1 35615 35667 0.36 0.5494714
## - int_rate 1 35655 35703 40.55 1.914e-10 ***
## - emp_length 11 35720 35748 104.77 < 2.2e-16 ***
## - term 1 35742 35790 126.80 < 2.2e-16 ***
## - loan_amnt 1 35795 35843 179.99 < 2.2e-16 ***
## - home_ownership 3 35825 35869 209.68 < 2.2e-16 ***
## - grade 6 35882 35920 267.29 < 2.2e-16 ***
## - dti 1 36031 36079 415.74 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=35630.65
## loan_status ~ grade + dti + loan_amnt + home_ownership + term +
## emp_length + int_rate + delinq_2yrs
##
## Df Deviance AIC LRT Pr(>Chi)
## + total_acc 1 35549 35603 29.31 6.172e-08 ***
## + open_acc 1 35556 35610 22.16 2.508e-06 ***
## + annual_inc 1 35561 35615 17.57 2.764e-05 ***
## + verification_status 2 35561 35617 17.66 0.0001463 ***
## + collections_12_mths_ex_med 1 35563 35617 15.21 9.634e-05 ***
## + revol_util 1 35564 35618 14.24 0.0001608 ***
## + installment 1 35565 35619 13.87 0.0001956 ***
## + mths_since_last_delinq 1 35568 35622 10.68 0.0010834 **
## + pub_rec 1 35571 35625 8.13 0.0043559 **
## + inq_last_6mths 1 35571 35625 8.03 0.0046057 **
## <none> 35579 35631
## + mths_since_last_record 1 35577 35631 1.70 0.1923600
## + revol_bal 1 35578 35632 0.34 0.5615886
## + mths_since_last_major_derog 1 35579 35633 0.13 0.7210945
## - delinq_2yrs 1 35615 35665 36.24 1.747e-09 ***
## - int_rate 1 35619 35669 40.49 1.975e-10 ***
## - emp_length 11 35685 35715 105.92 < 2.2e-16 ***
## - term 1 35711 35761 132.52 < 2.2e-16 ***
## - loan_amnt 1 35759 35809 180.22 < 2.2e-16 ***
## - home_ownership 3 35794 35840 215.72 < 2.2e-16 ***
## - grade 6 35841 35881 261.98 < 2.2e-16 ***
## - dti 1 35998 36048 419.09 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=35603.34
## loan_status ~ grade + dti + loan_amnt + home_ownership + term +
## emp_length + int_rate + delinq_2yrs + total_acc
##
## Df Deviance AIC LRT Pr(>Chi)
## + open_acc 1 35461 35517 88.44 < 2.2e-16 ***
## + inq_last_6mths 1 35533 35589 16.32 5.357e-05 ***
## + verification_status 2 35531 35589 18.04 0.0001210 ***
## + installment 1 35534 35590 14.85 0.0001165 ***
## + collections_12_mths_ex_med 1 35535 35591 14.01 0.0001816 ***
## + annual_inc 1 35537 35593 12.45 0.0004186 ***
## + mths_since_last_delinq 1 35540 35596 9.75 0.0017907 **
## + revol_util 1 35540 35596 9.72 0.0018231 **
## + pub_rec 1 35543 35599 6.47 0.0109500 *
## <none> 35549 35603
## + revol_bal 1 35549 35605 0.63 0.4288168
## + mths_since_last_record 1 35549 35605 0.25 0.6191527
## + mths_since_last_major_derog 1 35549 35605 0.06 0.8066411
## - total_acc 1 35579 35631 29.31 6.172e-08 ***
## - delinq_2yrs 1 35589 35641 39.62 3.090e-10 ***
## - int_rate 1 35592 35644 43.04 5.353e-11 ***
## - emp_length 11 35652 35684 102.83 < 2.2e-16 ***
## - term 1 35688 35740 138.28 < 2.2e-16 ***
## - loan_amnt 1 35742 35794 193.11 < 2.2e-16 ***
## - home_ownership 3 35753 35801 203.54 < 2.2e-16 ***
## - grade 6 35812 35854 262.46 < 2.2e-16 ***
## - dti 1 35996 36048 446.60 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=35516.9
## loan_status ~ grade + dti + loan_amnt + home_ownership + term +
## emp_length + int_rate + delinq_2yrs + total_acc + open_acc
##
## Df Deviance AIC LRT Pr(>Chi)
## + annual_inc 1 35444 35502 16.95 3.831e-05 ***
## + verification_status 2 35443 35503 17.94 0.0001270 ***
## + revol_util 1 35446 35504 15.11 0.0001013 ***
## + installment 1 35448 35506 13.30 0.0002654 ***
## + collections_12_mths_ex_med 1 35449 35507 11.96 0.0005429 ***
## + inq_last_6mths 1 35449 35507 11.50 0.0006968 ***
## + mths_since_last_delinq 1 35453 35511 8.08 0.0044820 **
## + pub_rec 1 35455 35513 5.42 0.0199163 *
## <none> 35461 35517
## + mths_since_last_record 1 35460 35518 0.53 0.4659882
## + revol_bal 1 35461 35519 0.08 0.7808971
## + mths_since_last_major_derog 1 35461 35519 0.04 0.8473246
## - delinq_2yrs 1 35497 35551 36.06 1.917e-09 ***
## - int_rate 1 35505 35559 44.13 3.069e-11 ***
## - open_acc 1 35549 35603 88.44 < 2.2e-16 ***
## - emp_length 11 35571 35605 109.85 < 2.2e-16 ***
## - total_acc 1 35556 35610 95.59 < 2.2e-16 ***
## - term 1 35603 35657 142.44 < 2.2e-16 ***
## - loan_amnt 1 35634 35688 173.30 < 2.2e-16 ***
## - home_ownership 3 35667 35717 205.94 < 2.2e-16 ***
## - grade 6 35721 35765 260.13 < 2.2e-16 ***
## - dti 1 35828 35882 367.39 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=35501.95
## loan_status ~ grade + dti + loan_amnt + home_ownership + term +
## emp_length + int_rate + delinq_2yrs + total_acc + open_acc +
## annual_inc
##
## Df Deviance AIC LRT Pr(>Chi)
## + revol_util 1 35426 35486 18.017 2.190e-05 ***
## + verification_status 2 35425 35487 18.649 8.923e-05 ***
## + inq_last_6mths 1 35431 35491 13.116 0.0002928 ***
## + installment 1 35431 35491 12.831 0.0003409 ***
## + collections_12_mths_ex_med 1 35432 35492 12.169 0.0004858 ***
## + mths_since_last_delinq 1 35434 35494 9.461 0.0020994 **
## + pub_rec 1 35437 35497 6.927 0.0084898 **
## <none> 35444 35502
## + mths_since_last_record 1 35443 35503 0.932 0.3344409
## + revol_bal 1 35444 35504 0.412 0.5207704
## + mths_since_last_major_derog 1 35444 35504 0.070 0.7917278
## - annual_inc 1 35461 35517 16.953 3.831e-05 ***
## - delinq_2yrs 1 35482 35538 38.022 6.995e-10 ***
## - int_rate 1 35490 35546 45.719 1.365e-11 ***
## - emp_length 11 35544 35580 100.496 < 2.2e-16 ***
## - total_acc 1 35533 35589 89.381 < 2.2e-16 ***
## - open_acc 1 35537 35593 92.950 < 2.2e-16 ***
## - term 1 35577 35633 133.428 < 2.2e-16 ***
## - loan_amnt 1 35632 35688 187.584 < 2.2e-16 ***
## - home_ownership 3 35641 35693 197.082 < 2.2e-16 ***
## - grade 6 35705 35751 261.289 < 2.2e-16 ***
## - dti 1 35722 35778 278.227 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=35485.93
## loan_status ~ grade + dti + loan_amnt + home_ownership + term +
## emp_length + int_rate + delinq_2yrs + total_acc + open_acc +
## annual_inc + revol_util
##
## Df Deviance AIC LRT Pr(>Chi)
## + verification_status 2 35407 35471 18.762 8.433e-05 ***
## + inq_last_6mths 1 35410 35472 15.840 6.893e-05 ***
## + collections_12_mths_ex_med 1 35413 35475 13.213 0.0002781 ***
## + installment 1 35414 35476 12.215 0.0004741 ***
## + mths_since_last_delinq 1 35417 35479 9.397 0.0021730 **
## + pub_rec 1 35419 35481 7.163 0.0074443 **
## <none> 35426 35486
## + mths_since_last_record 1 35425 35487 1.222 0.2690492
## + mths_since_last_major_derog 1 35426 35488 0.155 0.6936460
## + revol_bal 1 35426 35488 0.096 0.7563124
## - revol_util 1 35444 35502 18.017 2.190e-05 ***
## - annual_inc 1 35446 35504 19.859 8.339e-06 ***
## - delinq_2yrs 1 35464 35522 38.430 5.675e-10 ***
## - int_rate 1 35474 35532 47.786 4.755e-12 ***
## - total_acc 1 35510 35568 83.743 < 2.2e-16 ***
## - emp_length 11 35530 35568 103.767 < 2.2e-16 ***
## - open_acc 1 35525 35583 99.443 < 2.2e-16 ***
## - term 1 35562 35620 135.795 < 2.2e-16 ***
## - loan_amnt 1 35605 35663 178.762 < 2.2e-16 ***
## - home_ownership 3 35631 35685 205.007 < 2.2e-16 ***
## - dti 1 35668 35726 242.556 < 2.2e-16 ***
## - grade 6 35687 35735 261.083 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=35471.17
## loan_status ~ grade + dti + loan_amnt + home_ownership + term +
## emp_length + int_rate + delinq_2yrs + total_acc + open_acc +
## annual_inc + revol_util + verification_status
##
## Df Deviance AIC LRT Pr(>Chi)
## + inq_last_6mths 1 35392 35458 15.538 8.086e-05 ***
## + collections_12_mths_ex_med 1 35394 35460 12.862 0.0003353 ***
## + installment 1 35395 35461 12.024 0.0005253 ***
## + mths_since_last_delinq 1 35398 35464 9.411 0.0021564 **
## + pub_rec 1 35401 35467 6.601 0.0101937 *
## <none> 35407 35471
## + mths_since_last_record 1 35406 35472 0.718 0.3968847
## + mths_since_last_major_derog 1 35407 35473 0.106 0.7446273
## + revol_bal 1 35407 35473 0.080 0.7769471
## - verification_status 2 35426 35486 18.762 8.433e-05 ***
## - revol_util 1 35425 35487 18.130 2.064e-05 ***
## - annual_inc 1 35428 35490 20.626 5.583e-06 ***
## - delinq_2yrs 1 35447 35509 39.584 3.142e-10 ***
## - int_rate 1 35457 35519 50.127 1.441e-12 ***
## - emp_length 11 35504 35546 96.916 7.270e-16 ***
## - total_acc 1 35491 35553 84.027 < 2.2e-16 ***
## - open_acc 1 35507 35569 99.465 < 2.2e-16 ***
## - term 1 35544 35606 136.824 < 2.2e-16 ***
## - loan_amnt 1 35573 35635 166.071 < 2.2e-16 ***
## - home_ownership 3 35601 35659 193.650 < 2.2e-16 ***
## - dti 1 35649 35711 242.280 < 2.2e-16 ***
## - grade 6 35669 35721 261.764 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=35457.63
## loan_status ~ grade + dti + loan_amnt + home_ownership + term +
## emp_length + int_rate + delinq_2yrs + total_acc + open_acc +
## annual_inc + revol_util + verification_status + inq_last_6mths
##
## Df Deviance AIC LRT Pr(>Chi)
## + collections_12_mths_ex_med 1 35379 35447 12.900 0.0003287 ***
## + installment 1 35380 35448 11.549 0.0006777 ***
## + mths_since_last_delinq 1 35381 35449 10.184 0.0014166 **
## + pub_rec 1 35385 35453 6.584 0.0102926 *
## <none> 35392 35458
## + mths_since_last_record 1 35391 35459 0.691 0.4056797
## + mths_since_last_major_derog 1 35391 35459 0.199 0.6553885
## + revol_bal 1 35392 35460 0.077 0.7813563
## - inq_last_6mths 1 35407 35471 15.538 8.086e-05 ***
## - verification_status 2 35410 35472 18.460 9.808e-05 ***
## - revol_util 1 35412 35476 20.808 5.078e-06 ***
## - annual_inc 1 35414 35478 22.829 1.771e-06 ***
## - delinq_2yrs 1 35431 35495 39.728 2.918e-10 ***
## - int_rate 1 35444 35508 52.401 4.525e-13 ***
## - emp_length 11 35490 35534 98.325 3.830e-16 ***
## - total_acc 1 35483 35547 91.762 < 2.2e-16 ***
## - open_acc 1 35486 35550 94.584 < 2.2e-16 ***
## - term 1 35538 35602 146.066 < 2.2e-16 ***
## - loan_amnt 1 35566 35630 174.493 < 2.2e-16 ***
## - home_ownership 3 35588 35648 196.679 < 2.2e-16 ***
## - grade 6 35642 35696 250.275 < 2.2e-16 ***
## - dti 1 35641 35705 249.231 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=35446.73
## loan_status ~ grade + dti + loan_amnt + home_ownership + term +
## emp_length + int_rate + delinq_2yrs + total_acc + open_acc +
## annual_inc + revol_util + verification_status + inq_last_6mths +
## collections_12_mths_ex_med
##
## Df Deviance AIC LRT Pr(>Chi)
## + installment 1 35367 35437 11.426 0.0007244 ***
## + mths_since_last_delinq 1 35370 35440 8.730 0.0031297 **
## + pub_rec 1 35372 35442 6.408 0.0113585 *
## <none> 35379 35447
## + mths_since_last_record 1 35378 35448 0.636 0.4250090
## + revol_bal 1 35379 35449 0.069 0.7925676
## + mths_since_last_major_derog 1 35379 35449 0.003 0.9571168
## - collections_12_mths_ex_med 1 35392 35458 12.900 0.0003287 ***
## - inq_last_6mths 1 35394 35460 15.575 7.929e-05 ***
## - verification_status 2 35397 35461 18.119 0.0001163 ***
## - revol_util 1 35401 35467 21.901 2.871e-06 ***
## - annual_inc 1 35402 35468 23.158 1.492e-06 ***
## - delinq_2yrs 1 35415 35481 35.800 2.187e-09 ***
## - int_rate 1 35430 35496 51.395 7.553e-13 ***
## - emp_length 11 35477 35523 98.076 4.290e-16 ***
## - total_acc 1 35467 35533 88.628 < 2.2e-16 ***
## - open_acc 1 35471 35537 92.665 < 2.2e-16 ***
## - term 1 35526 35592 146.832 < 2.2e-16 ***
## - loan_amnt 1 35553 35619 174.724 < 2.2e-16 ***
## - home_ownership 3 35574 35636 195.616 < 2.2e-16 ***
## - grade 6 35627 35683 248.595 < 2.2e-16 ***
## - dti 1 35627 35693 247.841 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=35437.31
## loan_status ~ grade + dti + loan_amnt + home_ownership + term +
## emp_length + int_rate + delinq_2yrs + total_acc + open_acc +
## annual_inc + revol_util + verification_status + inq_last_6mths +
## collections_12_mths_ex_med + installment
##
## Df Deviance AIC LRT Pr(>Chi)
## + mths_since_last_delinq 1 35358 35430 8.843 0.0029421 **
## + pub_rec 1 35361 35433 6.230 0.0125635 *
## - loan_amnt 1 35369 35437 1.743 0.1867110
## <none> 35367 35437
## + mths_since_last_record 1 35367 35439 0.635 0.4256166
## + revol_bal 1 35367 35439 0.043 0.8360434
## + mths_since_last_major_derog 1 35367 35439 0.004 0.9483777
## - installment 1 35379 35447 11.426 0.0007244 ***
## - collections_12_mths_ex_med 1 35380 35448 12.776 0.0003512 ***
## - inq_last_6mths 1 35382 35450 15.100 0.0001019 ***
## - verification_status 2 35385 35451 17.928 0.0001279 ***
## - revol_util 1 35388 35456 21.192 4.156e-06 ***
## - annual_inc 1 35390 35458 22.527 2.072e-06 ***
## - delinq_2yrs 1 35404 35472 36.783 1.320e-09 ***
## - int_rate 1 35429 35497 61.729 3.942e-15 ***
## - term 1 35438 35506 70.462 < 2.2e-16 ***
## - emp_length 11 35466 35514 99.139 2.645e-16 ***
## - total_acc 1 35456 35524 89.115 < 2.2e-16 ***
## - open_acc 1 35458 35526 91.064 < 2.2e-16 ***
## - home_ownership 3 35563 35627 195.359 < 2.2e-16 ***
## - grade 6 35622 35680 254.949 < 2.2e-16 ***
## - dti 1 35615 35683 247.918 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=35430.46
## loan_status ~ grade + dti + loan_amnt + home_ownership + term +
## emp_length + int_rate + delinq_2yrs + total_acc + open_acc +
## annual_inc + revol_util + verification_status + inq_last_6mths +
## collections_12_mths_ex_med + installment + mths_since_last_delinq
##
## Df Deviance AIC LRT Pr(>Chi)
## + pub_rec 1 35352 35426 6.333 0.0118483 *
## + mths_since_last_major_derog 1 35354 35428 4.811 0.0282777 *
## - loan_amnt 1 35360 35430 1.785 0.1815103
## <none> 35358 35430
## + mths_since_last_record 1 35358 35432 0.735 0.3912658
## + revol_bal 1 35358 35432 0.034 0.8538711
## - mths_since_last_delinq 1 35367 35437 8.843 0.0029421 **
## - collections_12_mths_ex_med 1 35370 35440 11.318 0.0007676 ***
## - installment 1 35370 35440 11.538 0.0006818 ***
## - delinq_2yrs 1 35374 35444 15.110 0.0001014 ***
## - inq_last_6mths 1 35374 35444 15.814 6.988e-05 ***
## - verification_status 2 35376 35444 17.923 0.0001282 ***
## - revol_util 1 35380 35450 21.129 4.293e-06 ***
## - annual_inc 1 35383 35453 24.078 9.253e-07 ***
## - int_rate 1 35422 35492 63.075 1.989e-15 ***
## - term 1 35430 35500 71.254 < 2.2e-16 ***
## - emp_length 11 35458 35508 99.248 2.516e-16 ***
## - total_acc 1 35445 35515 87.009 < 2.2e-16 ***
## - open_acc 1 35448 35518 89.501 < 2.2e-16 ***
## - home_ownership 3 35556 35622 197.704 < 2.2e-16 ***
## - grade 6 35615 35675 256.050 < 2.2e-16 ***
## - dti 1 35607 35677 248.777 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=35426.13
## loan_status ~ grade + dti + loan_amnt + home_ownership + term +
## emp_length + int_rate + delinq_2yrs + total_acc + open_acc +
## annual_inc + revol_util + verification_status + inq_last_6mths +
## collections_12_mths_ex_med + installment + mths_since_last_delinq +
## pub_rec
##
## Df Deviance AIC LRT Pr(>Chi)
## + mths_since_last_major_derog 1 35348 35424 4.592 0.0321127 *
## - loan_amnt 1 35354 35426 1.721 0.1895678
## <none> 35352 35426
## + mths_since_last_record 1 35352 35428 0.025 0.8740707
## + revol_bal 1 35352 35428 0.025 0.8749050
## - pub_rec 1 35358 35430 6.333 0.0118483 *
## - mths_since_last_delinq 1 35361 35433 8.947 0.0027794 **
## - collections_12_mths_ex_med 1 35363 35435 11.147 0.0008416 ***
## - installment 1 35363 35435 11.358 0.0007511 ***
## - delinq_2yrs 1 35367 35439 15.069 0.0001037 ***
## - verification_status 2 35370 35440 17.391 0.0001673 ***
## - inq_last_6mths 1 35368 35440 15.806 7.018e-05 ***
## - revol_util 1 35373 35445 21.363 3.799e-06 ***
## - annual_inc 1 35378 35450 25.802 3.784e-07 ***
## - int_rate 1 35414 35486 62.319 2.921e-15 ***
## - term 1 35424 35496 71.588 < 2.2e-16 ***
## - emp_length 11 35451 35503 98.991 2.828e-16 ***
## - total_acc 1 35436 35508 83.578 < 2.2e-16 ***
## - open_acc 1 35441 35513 88.662 < 2.2e-16 ***
## - home_ownership 3 35549 35617 197.305 < 2.2e-16 ***
## - grade 6 35606 35668 254.164 < 2.2e-16 ***
## - dti 1 35602 35674 249.957 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=35423.54
## loan_status ~ grade + dti + loan_amnt + home_ownership + term +
## emp_length + int_rate + delinq_2yrs + total_acc + open_acc +
## annual_inc + revol_util + verification_status + inq_last_6mths +
## collections_12_mths_ex_med + installment + mths_since_last_delinq +
## pub_rec + mths_since_last_major_derog
##
## Df Deviance AIC LRT Pr(>Chi)
## - loan_amnt 1 35349 35423 1.713 0.1906389
## <none> 35348 35424
## + revol_bal 1 35348 35426 0.027 0.8701625
## + mths_since_last_record 1 35348 35426 0.021 0.8842465
## - mths_since_last_major_derog 1 35352 35426 4.592 0.0321127 *
## - pub_rec 1 35354 35428 6.115 0.0134052 *
## - installment 1 35359 35433 11.367 0.0007475 ***
## - collections_12_mths_ex_med 1 35360 35434 12.325 0.0004468 ***
## - mths_since_last_delinq 1 35361 35435 13.526 0.0002353 ***
## - inq_last_6mths 1 35363 35437 15.535 8.099e-05 ***
## - verification_status 2 35365 35437 17.769 0.0001385 ***
## - delinq_2yrs 1 35364 35438 16.460 4.969e-05 ***
## - revol_util 1 35368 35442 20.557 5.789e-06 ***
## - annual_inc 1 35374 35448 26.166 3.132e-07 ***
## - int_rate 1 35409 35483 61.428 4.592e-15 ***
## - term 1 35419 35493 71.699 < 2.2e-16 ***
## - emp_length 11 35447 35501 99.438 2.307e-16 ***
## - total_acc 1 35431 35505 83.341 < 2.2e-16 ***
## - open_acc 1 35435 35509 87.893 < 2.2e-16 ***
## - home_ownership 3 35546 35616 198.562 < 2.2e-16 ***
## - grade 6 35599 35663 251.788 < 2.2e-16 ***
## - dti 1 35596 35670 248.500 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: AIC=35423.25
## loan_status ~ grade + dti + home_ownership + term + emp_length +
## int_rate + delinq_2yrs + total_acc + open_acc + annual_inc +
## revol_util + verification_status + inq_last_6mths + collections_12_mths_ex_med +
## installment + mths_since_last_delinq + pub_rec + mths_since_last_major_derog
##
## Df Deviance AIC LRT Pr(>Chi)
## <none> 35349 35423
## + loan_amnt 1 35348 35424 1.713 0.1906389
## + revol_bal 1 35349 35425 0.041 0.8398013
## + mths_since_last_record 1 35349 35425 0.020 0.8872851
## - mths_since_last_major_derog 1 35354 35426 4.601 0.0319576 *
## - pub_rec 1 35355 35427 6.177 0.0129404 *
## - collections_12_mths_ex_med 1 35362 35434 12.380 0.0004340 ***
## - mths_since_last_delinq 1 35363 35435 13.495 0.0002392 ***
## - verification_status 2 35367 35437 17.688 0.0001443 ***
## - inq_last_6mths 1 35365 35437 15.890 6.712e-05 ***
## - delinq_2yrs 1 35366 35438 16.297 5.414e-05 ***
## - revol_util 1 35370 35442 20.714 5.331e-06 ***
## - annual_inc 1 35377 35449 27.468 1.597e-07 ***
## - int_rate 1 35410 35482 60.858 6.134e-15 ***
## - emp_length 11 35448 35500 99.167 2.611e-16 ***
## - total_acc 1 35432 35504 83.224 < 2.2e-16 ***
## - open_acc 1 35438 35510 88.310 < 2.2e-16 ***
## - installment 1 35534 35606 184.789 < 2.2e-16 ***
## - home_ownership 3 35548 35616 199.092 < 2.2e-16 ***
## - grade 6 35599 35661 250.098 < 2.2e-16 ***
## - dti 1 35597 35669 248.126 < 2.2e-16 ***
## - term 1 35647 35719 297.572 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Call: glm(formula = loan_status ~ grade + dti + home_ownership + term +
## emp_length + int_rate + delinq_2yrs + total_acc + open_acc +
## annual_inc + revol_util + verification_status + inq_last_6mths +
## collections_12_mths_ex_med + installment + mths_since_last_delinq +
## pub_rec + mths_since_last_major_derog, family = binomial(link = "logit"),
## data = full_data)
##
## Coefficients:
## (Intercept) gradeB
## 2.588e+00 -5.584e-01
## gradeC gradeD
## -1.132e+00 -1.636e+00
## gradeE gradeF
## -2.134e+00 -2.398e+00
## gradeG dti
## -2.676e+00 -2.763e-02
## home_ownershipOWN home_ownershipRENT
## -2.802e-01 -4.040e-01
## home_ownershipANY term
## -1.010e+00 -2.503e-02
## emp_length1 year emp_length10+ years
## 6.807e-02 2.986e-01
## emp_length2 years emp_length3 years
## 1.523e-01 1.715e-01
## emp_length4 years emp_length5 years
## 1.581e-01 1.773e-01
## emp_length6 years emp_length7 years
## 2.098e-01 2.804e-01
## emp_length8 years emp_length9 years
## 1.543e-01 1.656e-01
## emp_lengthn/a int_rate
## -1.693e-01 7.610e-02
## delinq_2yrs total_acc
## -5.004e-02 3.714e-01
## open_acc annual_inc
## -3.107e-02 1.888e-06
## revol_util verification_statusSource Verified
## -2.896e-03 -1.526e-01
## verification_statusVerified inq_last_6mths
## -1.300e-01 -4.835e-02
## collections_12_mths_ex_med installment
## -2.130e-01 -8.861e-04
## mths_since_last_delinq pub_rec
## 3.105e-03 -3.710e-02
## mths_since_last_major_derog
## -1.688e-03
##
## Degrees of Freedom: 36221 Total (i.e. Null); 36185 Residual
## Null Deviance: 38610
## Residual Deviance: 35350 AIC: 35420
The step function ignored redundant variables (the variables that are highly correlated). The best models shown below:
#optional model
final.model <- glm(formula = loan_status ~ grade + dti + home_ownership + term + emp_length + int_rate + delinq_2yrs + total_acc + open_acc + annual_inc + revol_util + verification_status + inq_last_6mths + collections_12_mths_ex_med + installment + mths_since_last_delinq + pub_rec + mths_since_last_major_derog, family = binomial(link = "logit"), data = full_data)
summary(final.model)
##
## Call:
## glm(formula = loan_status ~ grade + dti + home_ownership + term +
## emp_length + int_rate + delinq_2yrs + total_acc + open_acc +
## annual_inc + revol_util + verification_status + inq_last_6mths +
## collections_12_mths_ex_med + installment + mths_since_last_delinq +
## pub_rec + mths_since_last_major_derog, family = binomial(link = "logit"),
## data = full_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5328 0.3621 0.5536 0.7259 2.4219
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.588e+00 1.798e-01 14.389 < 2e-16
## gradeB -5.584e-01 8.513e-02 -6.560 5.40e-11
## gradeC -1.132e+00 9.916e-02 -11.418 < 2e-16
## gradeD -1.636e+00 1.240e-01 -13.196 < 2e-16
## gradeE -2.134e+00 1.515e-01 -14.084 < 2e-16
## gradeF -2.398e+00 1.911e-01 -12.551 < 2e-16
## gradeG -2.676e+00 2.348e-01 -11.398 < 2e-16
## dti -2.763e-02 1.759e-03 -15.701 < 2e-16
## home_ownershipOWN -2.802e-01 4.495e-02 -6.234 4.53e-10
## home_ownershipRENT -4.040e-01 2.900e-02 -13.930 < 2e-16
## home_ownershipANY -1.010e+00 9.549e-01 -1.057 0.290405
## term -2.503e-02 1.438e-03 -17.405 < 2e-16
## emp_length1 year 6.807e-02 7.434e-02 0.916 0.359871
## emp_length10+ years 2.986e-01 5.545e-02 5.385 7.25e-08
## emp_length2 years 1.523e-01 6.834e-02 2.228 0.025867
## emp_length3 years 1.715e-01 6.925e-02 2.476 0.013284
## emp_length4 years 1.581e-01 7.311e-02 2.163 0.030550
## emp_length5 years 1.773e-01 7.351e-02 2.411 0.015896
## emp_length6 years 2.098e-01 7.919e-02 2.650 0.008060
## emp_length7 years 2.804e-01 8.161e-02 3.436 0.000589
## emp_length8 years 1.543e-01 7.971e-02 1.936 0.052917
## emp_length9 years 1.656e-01 8.530e-02 1.941 0.052261
## emp_lengthn/a -1.693e-01 6.913e-02 -2.449 0.014320
## int_rate 7.610e-02 9.796e-03 7.768 7.97e-15
## delinq_2yrs -5.004e-02 1.218e-02 -4.110 3.96e-05
## total_acc 3.714e-01 4.076e-02 9.112 < 2e-16
## open_acc -3.107e-02 3.293e-03 -9.435 < 2e-16
## annual_inc 1.888e-06 3.749e-07 5.037 4.73e-07
## revol_util -2.896e-03 6.365e-04 -4.550 5.38e-06
## verification_statusSource Verified -1.526e-01 3.703e-02 -4.120 3.79e-05
## verification_statusVerified -1.300e-01 3.934e-02 -3.304 0.000954
## inq_last_6mths -4.835e-02 1.208e-02 -4.001 6.30e-05
## collections_12_mths_ex_med -2.130e-01 5.955e-02 -3.577 0.000348
## installment -8.861e-04 6.497e-05 -13.639 < 2e-16
## mths_since_last_delinq 3.105e-03 8.437e-04 3.680 0.000234
## pub_rec -3.710e-02 1.476e-02 -2.513 0.011958
## mths_since_last_major_derog -1.688e-03 7.846e-04 -2.151 0.031456
##
## (Intercept) ***
## gradeB ***
## gradeC ***
## gradeD ***
## gradeE ***
## gradeF ***
## gradeG ***
## dti ***
## home_ownershipOWN ***
## home_ownershipRENT ***
## home_ownershipANY
## term ***
## emp_length1 year
## emp_length10+ years ***
## emp_length2 years *
## emp_length3 years *
## emp_length4 years *
## emp_length5 years *
## emp_length6 years **
## emp_length7 years ***
## emp_length8 years .
## emp_length9 years .
## emp_lengthn/a *
## int_rate ***
## delinq_2yrs ***
## total_acc ***
## open_acc ***
## annual_inc ***
## revol_util ***
## verification_statusSource Verified ***
## verification_statusVerified ***
## inq_last_6mths ***
## collections_12_mths_ex_med ***
## installment ***
## mths_since_last_delinq ***
## pub_rec *
## mths_since_last_major_derog *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 38610 on 36221 degrees of freedom
## Residual deviance: 35349 on 36185 degrees of freedom
## AIC: 35423
##
## Number of Fisher Scoring iterations: 4
By looking at the summary statistics, I concluded that the variables “home_ownershipANY”, “emp_length1 year”,“emp_length8 years” and “emp_length9 years” are not statistically significant (they don’t affect the outcome) because theirs p-value is greater that 0.05 level of significance.
In order to test goodness of fit I performed Likelihood Ratio Test, Pseudo R^2 and Hosmer-Lemeshow Test.
A logistic regression is said to provide a better fit to the data if it demonstrates an improvement over a model with fewer predictors. This is performed using the likelihood ratio test, which compares the likelihood of the data under the full model against the likelihood of the data under a model with fewer predictors. Removing predictor variables from a model will almost always make the model fit less well, but it is necessary to test whether the observed difference in model fit is statistically significant.
I stated the following hypothesis:
Null hypothesis: reduced model is true Alternative hypothesis: reduced model is false
I compared final regression model generated by step function with initial null model and model with fewer predictors.
#the model with fewer predictors
model_fewer_pred <- glm(formula = loan_status ~ grade + dti + loan_amnt + home_ownership + term + emp_length + int_rate, family = binomial(link = "logit"), data = full_data)
#optimal model vs null model
anova(final.model, model.null, test ="Chisq")
## Analysis of Deviance Table
##
## Model 1: loan_status ~ grade + dti + home_ownership + term + emp_length +
## int_rate + delinq_2yrs + total_acc + open_acc + annual_inc +
## revol_util + verification_status + inq_last_6mths + collections_12_mths_ex_med +
## installment + mths_since_last_delinq + pub_rec + mths_since_last_major_derog
## Model 2: loan_status ~ 1
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 36185 35349
## 2 36221 38610 -36 -3260.5 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#optimal model vs the model with fewer predictors
anova(final.model, model_fewer_pred, test ="Chisq")
## Analysis of Deviance Table
##
## Model 1: loan_status ~ grade + dti + home_ownership + term + emp_length +
## int_rate + delinq_2yrs + total_acc + open_acc + annual_inc +
## revol_util + verification_status + inq_last_6mths + collections_12_mths_ex_med +
## installment + mths_since_last_delinq + pub_rec + mths_since_last_major_derog
## Model 2: loan_status ~ grade + dti + loan_amnt + home_ownership + term +
## emp_length + int_rate
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 36185 35349
## 2 36197 35615 -12 -265.64 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In both cases the null hypothesis is rejected (as p-value is less than significance level of 5%). It would provide evidence against the reduced model in favor of the final model.
Unlike linear regression with ordinary least squares estimation, there is no R-squire statistic which explains the proportion of variance in the dependent variable that is explained by the predictors. However, there are a number of pseudo R2-squire metrics that could be of value. Most notable is McFadden’s R-squire. It measures ranges from 0 to just under 1, with values closer to zero indicating that the model has no predictive power.
pR2(final.model)
## llh llhNull G2 McFadden r2ML
## -1.767462e+04 -1.930490e+04 3.260549e+03 8.444875e-02 8.608317e-02
## r2CU
## 1.313065e-01
The test showed that the model has no predictive power since the value of r2CU is close to 0.
Another approach to determining the goodness of fit is through the Homer-Lemeshow statistics, which is computed on data after the observations have been segmented into groups based on having similar predicted probabilities. It examines whether the observed proportions of events are similar to the predicted probabilities of occurrence in subgroups of the data set using a pearson chi square test.
I stated the following hypothesis:
Null hypothesis: there is a good fit of the model Alternative hypothesis: there is a poor fit of the model
hoslem.test(full_data$loan_status, fitted(final.model), g=10)
## Warning in Ops.factor(1, y): '-' not meaningful for factors
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: full_data$loan_status, fitted(final.model)
## X-squared = 36222, df = 8, p-value < 2.2e-16
Since p-vale is close to 0 I rejected the null hypothesis. There is a poor fit to the model.
Based on the results of the optimal logistic regression model I built the following logit function:
#logit(p) = ln(p/(p-1)) =
#logit_p=
#(3.434e+00)-(5.621e-01)*gradeB-(1.138e+00)*gradeC-(1.644e+00)*gradeD-(2.144e+00)*gradeE -(2.144e+00)*gradeF-(2.689e+00)*gradeG-(2.710e-02)*dti -(2.842e-(1.512e-01)*verificationstatussourseverified+(1.317e-01)*verificationstatussverified-(4.745e-02)*inqlast6mths-(2.149e-01)*collections12mthsexmed-(8.799e-04)*installment +(3.070e-03)*mthssincelastdelinq-(3.284e-02)*pubrec+(1.607e-03)*mthssincelastmajorderog-01)*homeownershipOWN-(4.066e-01)*homeownershipRENT-(2.481e-02)*term+(3.027e-01)*emlengh10years +(1.534e-01)*emplengthyears+(1.718e-01)*emplength3years+(1.559e-01)*emplength4years+(1.778e-01)*emplength5years+(2.102-01)*emplength6years+(2.829-01)*emplength7years+(1.571e-01)*emplength8years-(1.651e-01)*emplengthn/a+years+(7.601e-02)*intrate-(4.956e-02)*deling2yrs+(1.239e-02)*totalacc-(3.128-02)*opeacc+(1.922e-06)*annualinc-(2.941e-03)*revolutil
The natural logarithm of the odds ratio is equivalent to a linear function of the independent variables. The antilog of the logit function allows us to find the estimated regression equation.
The estimated regression equation is shown below:
#p_hat = e^logit_p/(1+e^logit_p)
In order to illustrate how the equation works I will show how grade can affect the probability of whether the loan will be paid off or not.
Let’s change the loan grade and hold all remaining variables constant.
#create vectors to store grades and probabilities
grade <- c()
probability <- c()
#grade A
logit_p_A = (3.434e+00)-(5.621e-01)*0-(1.138e+00)*0-(1.644e+00)*0-(2.144e+00)*0 -(2.411e+00)*0-(2.689e+00)*0
probability[1] = exp(1)^logit_p_A/(1+exp(1)^logit_p_A)
grade[1] <- "grade A"
#grade B
logit_p_B = (3.434e+00)-(5.621e-01)*1-(1.138e+00)*0-(1.644e+00)*0-(2.144e+00)*0 -(2.411e+00)*0-(2.689e+00)*0
probability[2] = exp(1)^logit_p_B/(1+exp(1)^logit_p_B)
grade[2] <- "grade B"
#grade C
logit_p_C = (3.434e+00)-(5.621e-01)*0-(1.138e+00)*1-(1.644e+00)*0-(2.144e+00)*0 -(2.411e+00)*0-(2.689e+00)*0
probability[3] = exp(1)^logit_p_C/(1+exp(1)^logit_p_C)
grade[3] <- "grade C"
#grade D
logit_p_D = (3.434e+00)-(5.621e-01)*0-(1.138e+00)*0-(1.644e+00)*1-(2.144e+00)*0 -(2.411e+00)*0-(2.689e+00)*0
probability[4] = exp(1)^logit_p_D/(1+exp(1)^logit_p_D)
grade[4] <- "grade D"
#grade E
logit_p_E = (3.434e+00)-(5.621e-01)*0-(1.138e+00)*0-(1.644e+00)*0-(2.144e+00)*1 -(2.411e+00)*0-(2.689e+00)*0
probability[5] = exp(1)^logit_p_E/(1+exp(1)^logit_p_E)
grade[5] <- "grade E"
#grade F
logit_p_F = (3.434e+00)-(5.621e-01)*0-(1.138e+00)*0-(1.644e+00)*0-(2.144e+00)*0 -(2.411e+00)*1-(2.689e+00)*0
probability[6] = exp(1)^logit_p_F/(1+exp(1)^logit_p_F)
grade[6] <- "grade F"
#grade G
logit_p_G = (3.434e+00)-(5.621e-01)*0-(1.138e+00)*0-(1.644e+00)*0-(2.144e+00)*0 -(2.411e+00)*0-(2.689e+00)*1
probability[7] = exp(1)^logit_p_G/(1+exp(1)^logit_p_G)
grade[7] <- "grade G"
#table with results
prob_table <- data.frame(grade,probability)
head(prob_table)
## grade probability
## 1 grade A 0.9687504
## 2 grade B 0.9464397
## 3 grade C 0.9085452
## 4 grade D 0.8569273
## 5 grade E 0.7841472
## 6 grade F 0.7355566
#graph
g <- ggplot(prob_table, aes(grade,probability))
g + ggtitle("Change in probability based on change in grade") + geom_bar(stat="identity",fill="skyblue")
The bar chart above shows how change in grades reflects probability while holding all other variables constant.
#create vectors to store change in dti and probabilitiies
dti_value <- c()
probability <- c()
for (i in 1:100){
dti_value[i] <- i
logit_p_dti = (3.434e+00)-(2.710e-02)*i
probability[i] = exp(1)^logit_p_dti/(1+exp(1)^logit_p_dti)
}
dti_table <- data.frame(dti_value,probability)
head(dti_table)
## dti_value probability
## 1 1 0.9679195
## 2 2 0.9670672
## 3 3 0.9661931
## 4 4 0.9652967
## 5 5 0.9643773
## 6 6 0.9634345
#graph
g <- ggplot(data=dti_table, aes(x=dti_value,y=probability,colour="skyblue"))
g + geom_line()+ ggtitle("Change in probability based on change in Debt to Income ratio") + labs(x="Debt to Income ratio,%")
The bar chart above shows how change in debt to income ratio reflects probability while holding all other variables constant.
By using logit function investor can verify how change in variable or combinations of variables can affect probability while holding all other variables constant.
Next I used logistic model in order to predict loan status.
#66% of data set is for training
training <- full_data[sample(nrow(full_data)),][1:round(0.66*nrow(full_data)),]
#34% of data set is for testing
testing <- full_data[sample(nrow(full_data)),][(round(0.66*nrow(full_data))+1):nrow(full_data),]
#run optimal model with training data set
pred_model <- glm(formula = loan_status ~ grade + dti + home_ownership + term + emp_length + int_rate + delinq_2yrs + total_acc + open_acc + annual_inc + revol_util + verification_status + inq_last_6mths + collections_12_mths_ex_med + installment + mths_since_last_delinq + pub_rec + mths_since_last_major_derog, family = binomial(link = "logit"), data = training)
#apply regression model as predictor for testing data set
testing$predicted_loan_status = predict(pred_model, newdata=testing,type = "response")
#replace prpability that is greater or equal to 0.5 with "Fully Paid"" class and the probability that is less than 0.5 with "Chraged Off" class
testing <- testing %>% mutate(predicted_loan_status = ifelse(predicted_loan_status < 0.5, "Charged Off","Fully Paid"))
levels(factor(testing$predicted_loan_status))
## [1] "Charged Off" "Fully Paid"
testing <- testing %>% select(loan_status, predicted_loan_status)
head(testing)
## loan_status predicted_loan_status
## 1 Fully Paid Fully Paid
## 2 Fully Paid Fully Paid
## 3 Fully Paid Charged Off
## 4 Fully Paid Fully Paid
## 5 Fully Paid Fully Paid
## 6 Fully Paid Fully Paid
#create confusion matrix
confusion_matrix <- table(testing$predicted_loan_status, testing$loan_status)
confusion_matrix
##
## Charged Off Fully Paid
## Charged Off 300 213
## Fully Paid 2487 9315
#calculate accuracy
accuracy <- (confusion_matrix[1,1] + confusion_matrix[2,2])/nrow(testing)
accuracy
## [1] 0.7807552
Logistic regression can predict whether a loan will be paid off or not with 74.93% of accuracy.
In conclusion I would like to say that I achieved all goals of the project. First, I built the optimal logistic regression function using stepwise technique (that selects the best model based on lowest AIC criterion).
Second,I showed how change in grades and dti can affect probability of success(the probability whether the loan will be paid off). Third, I calculated that logistic regression can predict loan status with 74.93% of accuracy.
I learnt a lot about multiple logistic regression that weren’t covered in labs assignments