For my project I selected the data set that I found on Lending Club’s website (https://www.lendingclub.com). The data is provided for potential investors. The data set contains information about loans that were issued from 2007 to the third quarter of 2017.

Lending Club is the world’s largest peer-to-peer lending platform that enables borrowers to obtain a loan, and investors to purchase notes backed by payments made on loans.

The goals of the project are

  1. To find the equation that best predicts the probability of weather the load will be paid off or not.

  2. To understand what might cause the probability to change.

  3. To predict loan status based on logistic regression

An investor earns money when loan is fully paid of and loses money when loan is charged off. If an investor obtains the results generated by the model that classify loans he would be able to make better investment decisions.

While I was reviewing Landing Club’s website I found out that investors can see the information such as loan rate, loan term, interest rate, borrower’s FICO score, loan amount and loan purpose. Moreover, they have an ability to filter by borrower’s employment length and monthly income.

In order to collect the data I downloaded (data source: https://www.lendingclub.com/info/download-data.action ) and merged 11 files that contain data from 2007 to the third quarter of 2017. To reduce the loading time I implemented the following steps.

#1. read in a few records of the input file to identify the classes of the input file and assign that column class to the input file while reading the entire data set
data_2007_2011 <- read.csv(file="https://cdn-stage.fedweb.org/fed-2/13/LoanStats3a.csv",  
                           stringsAsFactors=T, header=T, nrows=5)

data_2012_2013 <- read.csv(file="https://cdn-stage.fedweb.org/fed-2/13/LoanStats3b.csv",  
                           stringsAsFactors=T, header=T, nrows=5) 

data_2014 <- read.csv("https://cdn-stage.fedweb.org/fed-2/13/LoanStats3c.csv",  
                           stringsAsFactors=T, header=T, nrows=5) 

data_2015 <- read.csv("https://cdn-stage.fedweb.org/fed-2/13/LoanStats3d.csv",
                           stringsAsFactors=T, header=T, nrows=5) 

data_2016_q1 <- read.csv("https://cdn-stage.fedweb.org/fed-2/13/LoanStats_2016Q1.csv",
                          stringsAsFactors=T, header=T, nrows=5) 

data_2016_q2 <- read.csv("https://cdn-stage.fedweb.org/fed-2/13/LoanStats_2016Q2.csv",  
                          stringsAsFactors=T, header=T, nrows=5)

data_2016_q3 <- read.csv("https://cdn-stage.fedweb.org/fed-2/13/LoanStats_2016Q3.csv",  
                          stringsAsFactors=T, header=T, nrows=5) 

data_2016_q4 <- read.csv("https://cdn-stage.fedweb.org/fed-2/13/LoanStats_2016Q4.csv",  
                          stringsAsFactors=T, header=T, nrows=5)

data_2017_q1 <- read.csv("https://cdn-stage.fedweb.org/fed-2/13/LoanStats_2017Q1.csv",  
                          stringsAsFactors=T, header=T, nrows=5)

data_2017_q2 <- read.csv("https://cdn-stage.fedweb.org/fed-2/13/LoanStats_2017Q2.csv",  
                          stringsAsFactors=T, header=T, nrows=5)

data_2017_q3 <- read.csv("https://cdn-stage.fedweb.org/fed-2/13/LoanStats_2017Q3.csv",  
                          stringsAsFactors=T, header=T, nrows=5)


#2. replace all missing values with NAs
data_2007_2011 <- data_2007_2011[is.na(data_2007_2011)]
data_2012_2013 <- data_2012_2013[is.na(data_2012_2013)]
data_2014 <- data_2014[is.na(data_2014)]
data_2015 <- data_2015[is.na(data_2015)]
data_2016_q1 <- data_2016_q1[is.na(data_2016_q1)]
data_2016_q2 <- data_2016_q1[is.na(data_2016_q2)]
data_2016_q3 <- data_2016_q1[is.na(data_2016_q3)]
data_2016_q4 <- data_2016_q1[is.na(data_2016_q4)]
data_2017_q1 <- data_2017_q1[is.na(data_2017_q1)]
data_2017_q2 <- data_2017_q2[is.na(data_2017_q2)]
data_2017_q3 <- data_2017_q3[is.na(data_2017_q3)]


#3. determine classes
data_2007_2011.colclass <- sapply(data_2007_2011,class)
data_2012_2013.colclass <- sapply(data_2012_2013,class)
data_2014.colclass <- sapply(data_2014,class)
data_2015.colclass <- sapply(data_2015,class)
data_2016_q1.colclass <- sapply(data_2016_q1,class)
data_2016_q2.colclass <- sapply(data_2016_q2,class)
data_2016_q3.colclass <- sapply(data_2016_q3,class)
data_2016_q4.colclass <- sapply(data_2016_q4,class)
data_2017_q1.colclass <- sapply(data_2017_q1,class)
data_2017_q2.colclass <- sapply(data_2017_q2,class)
data_2017_q3.colclass <- sapply(data_2017_q3,class)


#4. assign that column class to the input file while reading the entire data set and define comment.char parameter.
data_2007_2011 <- read.csv("https://cdn-stage.fedweb.org/fed-2/13/LoanStats3a.csv",  
                           stringsAsFactors=T,
                           header=T,colClasses=data_2007_2011.colclass, comment.char="")

data_2012_2013 <- read.csv("https://cdn-stage.fedweb.org/fed-2/13/LoanStats3b.csv",  
                           stringsAsFactors=T,
                           header=T,colClasses=data_2007_2011.colclass, comment.char="")

data_2014 <- read.csv("https://cdn-stage.fedweb.org/fed-2/13/LoanStats3c.csv",  
                       stringsAsFactors=T, colClasses=data_2014.colclass, comment.char="") 

data_2015 <- read.csv("https://cdn-stage.fedweb.org/fed-2/13/LoanStats3d.csv",
                      stringsAsFactors=T, header=T, colClasses=data_2015.colclass, comment.char="") 

data_2016_q1 <- read.csv("https://cdn-stage.fedweb.org/fed-2/13/LoanStats_2016Q1.csv", 
                         stringsAsFactors=T, header=T,colClasses=data_2016_q1.colclass, comment.char="") 

data_2016_q2 <- read.csv("https://cdn-stage.fedweb.org/fed-2/13/LoanStats_2016Q2.csv",  
                          stringsAsFactors=T, header=T,colClasses=data_2016_q2.colclass, comment.char="")

data_2016_q3 <- read.csv("https://cdn-stage.fedweb.org/fed-2/13/LoanStats_2016Q3.csv",  
                          stringsAsFactors=T, header=T,colClasses=data_2016_q3.colclass, comment.char="") 

data_2016_q4 <- read.csv("https://cdn-stage.fedweb.org/fed-2/13/LoanStats_2016Q4.csv",  
                          stringsAsFactors=T, header=T,colClasses=data_2016_q4.colclass, comment.char="")

data_2017_q1 <- read.csv("https://cdn-stage.fedweb.org/fed-2/13/LoanStats_2017Q1.csv",  
                          stringsAsFactors=T, header=T,colClasses=data_2017_q1.colclass, comment.char="")

data_2017_q2 <- read.csv("https://cdn-stage.fedweb.org/fed-2/13/LoanStats_2017Q2.csv",  
                          stringsAsFactors=T, header=T,colClasses=data_2017_q2.colclass, comment.char="")

data_2017_q3 <- read.csv("https://cdn-stage.fedweb.org/fed-2/13/LoanStats_2017Q3.csv",  
                          stringsAsFactors=T, header=T,colClasses=data_2017_q3.colclass, comment.char="")
#5. merge csv files
# I selected 51 first attributes since csv files for different years has different set of attributes but 51 first attributes are similar for different files
full_data <- rbind(data_2007_2011[1:51],data_2012_2013[1:51],data_2014[1:51],data_2015[1:51],data_2016_q1[1:51],data_2016_q2[1:51],data_2016_q3[1:51],data_2016_q4[1:51],data_2017_q1[1:51],data_2017_q2[1:51],data_2017_q3[1:51])

After that I determined all types of loan statuses.

levels(factor(full_data$loan_status))
##  [1] ""                                                   
##  [2] "Charged Off"                                        
##  [3] "Does not meet the credit policy. Status:Charged Off"
##  [4] "Does not meet the credit policy. Status:Fully Paid" 
##  [5] "Fully Paid"                                         
##  [6] "Current"                                            
##  [7] "Default"                                            
##  [8] "In Grace Period"                                    
##  [9] "Late (16-30 days)"                                  
## [10] "Late (31-120 days)"

I filtered the data so that the data set contain loans with “Fully Paid” or “Charged Off” statuses. I ignored loans with statuses “Current”, “Late (31-120 days)”, “Late (16-30 days)” and “Default” since theoretically borrowers still can pay them off.

full_data <- full_data %>% mutate(loan_status=str_replace(loan_status, "Does not meet the credit policy. Status:", "")) %>% filter(loan_status %in% c("Fully Paid","Charged Off"))
levels(factor(full_data$loan_status))
## [1] "Charged Off" "Fully Paid"

Also, I removed all attributes that investors can’t see on the website and kept only the ones that they can see. Moreover, I converted term and interest rate attribute to numerical format.

colnames(full_data)
##  [1] "id"                          "member_id"                  
##  [3] "loan_amnt"                   "funded_amnt"                
##  [5] "funded_amnt_inv"             "term"                       
##  [7] "int_rate"                    "installment"                
##  [9] "grade"                       "sub_grade"                  
## [11] "emp_title"                   "emp_length"                 
## [13] "home_ownership"              "annual_inc"                 
## [15] "verification_status"         "issue_d"                    
## [17] "loan_status"                 "pymnt_plan"                 
## [19] "url"                         "desc"                       
## [21] "purpose"                     "title"                      
## [23] "zip_code"                    "addr_state"                 
## [25] "dti"                         "delinq_2yrs"                
## [27] "earliest_cr_line"            "inq_last_6mths"             
## [29] "mths_since_last_delinq"      "mths_since_last_record"     
## [31] "open_acc"                    "pub_rec"                    
## [33] "revol_bal"                   "revol_util"                 
## [35] "total_acc"                   "initial_list_status"        
## [37] "out_prncp"                   "out_prncp_inv"              
## [39] "total_pymnt"                 "total_pymnt_inv"            
## [41] "total_rec_prncp"             "total_rec_int"              
## [43] "total_rec_late_fee"          "recoveries"                 
## [45] "collection_recovery_fee"     "last_pymnt_d"               
## [47] "last_pymnt_amnt"             "next_pymnt_d"               
## [49] "last_credit_pull_d"          "collections_12_mths_ex_med" 
## [51] "mths_since_last_major_derog"
full_data <- full_data %>% select(loan_status,loan_amnt,term,int_rate,installment,grade,emp_length,home_ownership,annual_inc,verification_status,dti,delinq_2yrs,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,collections_12_mths_ex_med,mths_since_last_major_derog) %>% mutate(term = str_replace(term, "months", ""),int_rate = str_replace(int_rate, "%", ""),revol_util = str_replace(revol_util, "%", ""),term = as.integer(term),revol_util = as.integer(revol_util),int_rate = as.double(int_rate),loan_status = as.factor(loan_status)) 

#omit NAs
full_data <- na.omit(full_data)

head(full_data)
##       loan_status loan_amnt term int_rate installment grade emp_length
## 42538  Fully Paid     12000   36    13.53      407.40     B  10+ years
## 42545  Fully Paid      3000   36    12.85      100.87     B  10+ years
## 42568  Fully Paid      7200   36    10.99      235.69     B    4 years
## 42594  Fully Paid      9000   36    14.98      311.90     C  10+ years
## 42597  Fully Paid      7500   36    11.99      249.08     B    2 years
## 42620  Fully Paid     15850   36    16.24      559.12     C     1 year
##       home_ownership annual_inc verification_status   dti delinq_2yrs
## 42538           RENT      40000     Source Verified 16.94           0
## 42545           RENT      25000            Verified 24.68           0
## 42568            OWN      70000            Verified 19.20           0
## 42594       MORTGAGE      56000            Verified 21.45           0
## 42597           RENT      59600        Not Verified 15.93           0
## 42620           RENT      59400            Verified 33.22           0
##       inq_last_6mths mths_since_last_delinq mths_since_last_record
## 42538              0                     53                     33
## 42545              0                     58                     53
## 42568              0                     59                     59
## 42594              2                     36                     72
## 42597              0                     28                    108
## 42620              3                     64                     52
##       open_acc pub_rec revol_bal revol_util total_acc
## 42538        7       2      5572         68        32
## 42545        5       2      2875         54        26
## 42568       14       1      3479         35        49
## 42594       30       1     11317         30        60
## 42597        9       1      5517         54        53
## 42620       17       1     11161         60        36
##       collections_12_mths_ex_med mths_since_last_major_derog
## 42538                          0                          53
## 42545                          0                          69
## 42568                          0                          59
## 42594                          0                          70
## 42597                          0                          34
## 42620                          0                          64

The explanatory variables are:

colnames(full_data)
##  [1] "loan_status"                 "loan_amnt"                  
##  [3] "term"                        "int_rate"                   
##  [5] "installment"                 "grade"                      
##  [7] "emp_length"                  "home_ownership"             
##  [9] "annual_inc"                  "verification_status"        
## [11] "dti"                         "delinq_2yrs"                
## [13] "inq_last_6mths"              "mths_since_last_delinq"     
## [15] "mths_since_last_record"      "open_acc"                   
## [17] "pub_rec"                     "revol_bal"                  
## [19] "revol_util"                  "total_acc"                  
## [21] "collections_12_mths_ex_med"  "mths_since_last_major_derog"
  1. loan_amnt - the listed amount of the loan applied for by the borrower. If at some point in time, the credit department reduces the loan amount, then it will be reflected in this value.
  2. term - the number of payments on the loan. Values are in months and can be either 36 or 60.
  3. int_rate - interest rate on the loan.
  4. installment - the monthly payment owed by the borrower if the loan originates.
  5. annual_inc - the self-reported annual income provided by the borrower during registration.
  6. dti - a ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower’s self-reported monthly income.
  7. delinq_2yrs - the number of 30+ days past-due incidences of delinquency in the borrower’s credit file for the past 2 years.
  8. inq_last_6mths - the number of inquiries in past 6 months (excluding auto and mortgage inquiries).
  9. mths_since_last_delinq - the number of months since the borrower’s last delinquency.
  10. mths_since_last_record - the number of months since the last public record.
  11. open_acc - the number of open credit lines in the borrower’s credit file.
  12. pub_rec - number of derogatory public records.
  13. revol_bal - total credit revolving balance.
  14. pub_rec - number of derogatory public records.
  15. revol_util - revolving line utilization rate, or the amount of credit the borrower is using relative to all available revolving credit.
  16. total_acc - the total number of credit lines currently in the borrower’s credit file.
  17. collections_12_mths_ex_med - number of collections in 12 months excluding medical collections.
  18. mths_since_last_major_derog - months since most recent 90-day or worse rating.

Since the response variable “loan status” is a binary categorical variable (that has two possible outcomes - “Fully Paid” or “Charged Off”) and explanatory variables are numerical and categorical variables I used logistic regression for data set analysis.

The assumptions for the logistic regression are:

Assumption 1 - Logistic regression typically requires a large sample size. It requires a minimum of 10 cases with the least frequent outcome for each independent variable in the model. For example, if a model has 5 independent variables and the expected probability of the least frequent outcome is .10, then the minimum sample size is 500 (10*5 / .10).

dim(full_data)
## [1] 36222    22

The data set that contains 36222 observations can be considered as a large data set.

Assumption 2 - Logistic regression requires the observations to be independent of each other. I assume that majority of the observations are independent since most of the borrowers are not related.

Assumption 3 - Logistic regression requires there to be little or no multicollinearity among the independent variables. This means that the independent variables should not be too highly correlated with each other.

#removing categorical variables exept loan status (i will use this data set later)
full_data_no_categorical_status <- full_data %>% select(-grade,-emp_length,-home_ownership,-verification_status)

#removing loan status
full_data_no_categorical <- full_data_no_categorical_status %>% select(-loan_status)

#calculate correlation 
cor(full_data_no_categorical,use="complete.obs")
##                                loan_amnt         term     int_rate
## loan_amnt                    1.000000000  0.422905078  0.252943051
## term                         0.422905078  1.000000000  0.449439920
## int_rate                     0.252943051  0.449439920  1.000000000
## installment                  0.960984857  0.201161598  0.251209631
## annual_inc                   0.396604274  0.077325539 -0.024229579
## dti                         -0.004296079  0.055376313  0.154213650
## delinq_2yrs                  0.004823642 -0.013761784  0.044475316
## inq_last_6mths              -0.019746327 -0.002598566  0.242287085
## mths_since_last_delinq      -0.030847229  0.006928472 -0.050570450
## mths_since_last_record      -0.020720434  0.025732174  0.022453452
## open_acc                     0.145728246  0.086797964  0.068295286
## pub_rec                      0.031823800 -0.024404755  0.006143159
## revol_bal                    0.218459342  0.051399233 -0.006254746
## revol_util                   0.109403712  0.034683183  0.095508757
## total_acc                    0.106496607  0.069087858  0.001666116
## collections_12_mths_ex_med  -0.003320038 -0.009964323  0.003916385
## mths_since_last_major_derog -0.019175444  0.004729816 -0.029554970
##                              installment   annual_inc           dti
## loan_amnt                    0.960984857  0.396604274 -0.0042960793
## term                         0.201161598  0.077325539  0.0553763132
## int_rate                     0.251209631 -0.024229579  0.1542136500
## installment                  1.000000000  0.392374700 -0.0018279538
## annual_inc                   0.392374700  1.000000000 -0.2858904922
## dti                         -0.001827954 -0.285890492  1.0000000000
## delinq_2yrs                  0.011625624  0.042238702 -0.0020443121
## inq_last_6mths               0.009682516  0.040434813  0.0004220809
## mths_since_last_delinq      -0.038752539 -0.076417767  0.0204770180
## mths_since_last_record      -0.026630255 -0.086111918  0.0615709422
## open_acc                     0.139589986  0.092781884  0.2540949157
## pub_rec                      0.042574297  0.082048619 -0.0488774759
## revol_bal                    0.214007210  0.255702205  0.0733672979
## revol_util                   0.117137922  0.050101528  0.1431764418
## total_acc                    0.094812658  0.105119526  0.1645578233
## collections_12_mths_ex_med   0.000183462  0.005928169  0.0064742469
## mths_since_last_major_derog -0.023985298 -0.042258911  0.0328661843
##                               delinq_2yrs inq_last_6mths
## loan_amnt                    0.0048236422  -0.0197463266
## term                        -0.0137617839  -0.0025985660
## int_rate                     0.0444753161   0.2422870854
## installment                  0.0116256241   0.0096825159
## annual_inc                   0.0422387017   0.0404348126
## dti                         -0.0020443121   0.0004220809
## delinq_2yrs                  1.0000000000   0.0282431493
## inq_last_6mths               0.0282431493   1.0000000000
## mths_since_last_delinq      -0.4979427822   0.0032003344
## mths_since_last_record      -0.0124547371  -0.0347099768
## open_acc                     0.0550553163   0.1503937659
## pub_rec                      0.0006722893   0.0040616085
## revol_bal                    0.0064078761  -0.0122525865
## revol_util                   0.0001444110  -0.0794601846
## total_acc                    0.0423192281   0.1703407140
## collections_12_mths_ex_med   0.0847557528   0.0004772454
## mths_since_last_major_derog -0.3953539971   0.0124708097
##                             mths_since_last_delinq mths_since_last_record
## loan_amnt                             -0.030847229           -0.020720434
## term                                   0.006928472            0.025732174
## int_rate                              -0.050570450            0.022453452
## installment                           -0.038752539           -0.026630255
## annual_inc                            -0.076417767           -0.086111918
## dti                                    0.020477018            0.061570942
## delinq_2yrs                           -0.497942782           -0.012454737
## inq_last_6mths                         0.003200334           -0.034709977
## mths_since_last_delinq                 1.000000000           -0.006842418
## mths_since_last_record                -0.006842418            1.000000000
## open_acc                              -0.040857028            0.031370633
## pub_rec                               -0.004742984           -0.269376093
## revol_bal                             -0.019583893           -0.024295713
## revol_util                            -0.015932256            0.041013824
## total_acc                             -0.003720052           -0.144489072
## collections_12_mths_ex_med            -0.097995842           -0.005553399
## mths_since_last_major_derog            0.689480972           -0.007275000
##                                open_acc       pub_rec    revol_bal
## loan_amnt                    0.14572825  0.0318237998  0.218459342
## term                         0.08679796 -0.0244047554  0.051399233
## int_rate                     0.06829529  0.0061431595 -0.006254746
## installment                  0.13958999  0.0425742965  0.214007210
## annual_inc                   0.09278188  0.0820486188  0.255702205
## dti                          0.25409492 -0.0488774759  0.073367298
## delinq_2yrs                  0.05505532  0.0006722893  0.006407876
## inq_last_6mths               0.15039377  0.0040616085 -0.012252587
## mths_since_last_delinq      -0.04085703 -0.0047429835 -0.019583893
## mths_since_last_record       0.03137063 -0.2693760931 -0.024295713
## open_acc                     1.00000000 -0.0184278479  0.162964627
## pub_rec                     -0.01842785  1.0000000000  0.011407285
## revol_bal                    0.16296463  0.0114072853  1.000000000
## revol_util                  -0.07903746 -0.0002151927  0.230578235
## total_acc                    0.59745422 -0.0586662051  0.079748021
## collections_12_mths_ex_med   0.01444761  0.0169394696 -0.007964900
## mths_since_last_major_derog -0.01320742  0.0099003795  0.002334282
##                                revol_util    total_acc
## loan_amnt                    0.1094037119  0.106496607
## term                         0.0346831827  0.069087858
## int_rate                     0.0955087571  0.001666116
## installment                  0.1171379217  0.094812658
## annual_inc                   0.0501015282  0.105119526
## dti                          0.1431764418  0.164557823
## delinq_2yrs                  0.0001444110  0.042319228
## inq_last_6mths              -0.0794601846  0.170340714
## mths_since_last_delinq      -0.0159322556 -0.003720052
## mths_since_last_record       0.0410138239 -0.144489072
## open_acc                    -0.0790374605  0.597454218
## pub_rec                     -0.0002151927 -0.058666205
## revol_bal                    0.2305782349  0.079748021
## revol_util                   1.0000000000 -0.086334311
## total_acc                   -0.0863343109  1.000000000
## collections_12_mths_ex_med  -0.0258469002 -0.023336431
## mths_since_last_major_derog  0.0221548906 -0.006425679
##                             collections_12_mths_ex_med
## loan_amnt                                -0.0033200383
## term                                     -0.0099643227
## int_rate                                  0.0039163850
## installment                               0.0001834620
## annual_inc                                0.0059281689
## dti                                       0.0064742469
## delinq_2yrs                               0.0847557528
## inq_last_6mths                            0.0004772454
## mths_since_last_delinq                   -0.0979958421
## mths_since_last_record                   -0.0055533989
## open_acc                                  0.0144476057
## pub_rec                                   0.0169394696
## revol_bal                                -0.0079648997
## revol_util                               -0.0258469002
## total_acc                                -0.0233364306
## collections_12_mths_ex_med                1.0000000000
## mths_since_last_major_derog              -0.1270235332
##                             mths_since_last_major_derog
## loan_amnt                                  -0.019175444
## term                                        0.004729816
## int_rate                                   -0.029554970
## installment                                -0.023985298
## annual_inc                                 -0.042258911
## dti                                         0.032866184
## delinq_2yrs                                -0.395353997
## inq_last_6mths                              0.012470810
## mths_since_last_delinq                      0.689480972
## mths_since_last_record                     -0.007275000
## open_acc                                   -0.013207424
## pub_rec                                     0.009900380
## revol_bal                                   0.002334282
## revol_util                                  0.022154891
## total_acc                                  -0.006425679
## collections_12_mths_ex_med                 -0.127023533
## mths_since_last_major_derog                 1.000000000
#the function chart.Correlation generates plots for each pair of variables.  however, it loads very slow 
#chart.Correlation(full_data_no_categorical, method="spearman",histogram=TRUE)

According to the correlation matrix loan amount and installment there are four pairs of variables that are highly correlated. Those pairs are:

  1. loan amount and installment
  2. the number of open accounts and the number of total accounts
  3. the number of months since the borrower’s last delinquency and the number of 30+ days past-due incidences of delinquency in the borrower’s credit file for the past 2 years
  4. the number of months since the borrower’s last delinquency and the number of since most recent 90-day or worse rating

Assumption 4 - Logistic regression assumes linearity of independent variables and log odds.

I tested linearity in the logit by running Box-Tidwell Transformation Test. I added to the logistic model interaction terms which are the crossproduct of each independent times its natural logarithm ( (X)ln(X)] ). If these terms are significant, then there is non-linearity in the logit.

I stated the following hypothesis:

The null hypothesis: Odds ration is linear with an independent variable. No transformation is needed. The alternative hypothesis: Odds ration is not linear with an independent variable. Transformation is needed.

The null hypothesis is rejected only for total accounts (total_acc) variable which p-values is less than significance level of 5%. Hence transformation is needed for the variable “total_acc”.

All other variables have p-values that are greater than sagnificance level of 5%. So, no transformation is needed.

#replacing each variable with variable*log(variable)
for (i in 2:length(full_data_no_categorical_status)){
  
full_data_no_categorical_status[i] <- full_data_no_categorical_status[i]*log(full_data_no_categorical_status[i])
  
}
## Warning in FUN(X[[i]], ...): NaNs produced
model_linearity <- glm(formula = loan_status ~ ., family = binomial(link = "logit"),data = full_data_no_categorical_status)
summary(model_linearity)
## 
## Call:
## glm(formula = loan_status ~ ., family = binomial(link = "logit"), 
##     data = full_data_no_categorical_status)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1680  -1.1848   0.6907   0.8759   1.7380  
## 
## Coefficients:
##                               Estimate Std. Error z value Pr(>|z|)   
## (Intercept)                  1.730e+00  7.697e-01   2.247  0.02461 * 
## loan_amnt                   -8.166e-07  1.163e-05  -0.070  0.94405   
## term                        -6.410e-03  6.198e-03  -1.034  0.30101   
## int_rate                    -6.800e-03  9.655e-03  -0.704  0.48122   
## installment                 -1.222e-04  5.130e-04  -0.238  0.81175   
## annual_inc                   7.518e-08  1.210e-07   0.621  0.53432   
## dti                         -4.857e-03  3.899e-03  -1.246  0.21289   
## delinq_2yrs                 -4.295e-02  4.143e-02  -1.037  0.29979   
## inq_last_6mths              -1.550e-02  6.810e-02  -0.228  0.81995   
## mths_since_last_delinq      -1.782e-03  1.470e-03  -1.212  0.22543   
## mths_since_last_record       4.471e-04  9.169e-04   0.488  0.62582   
## open_acc                    -7.902e-03  8.436e-03  -0.937  0.34890   
## pub_rec                     -2.440e-02  5.779e-02  -0.422  0.67287   
## revol_bal                    7.318e-08  9.358e-07   0.078  0.93767   
## revol_util                   2.125e-03  1.360e-03   1.563  0.11816   
## total_acc                    8.091e-03  2.952e-03   2.741  0.00613 **
## collections_12_mths_ex_med   1.149e-01  2.192e-01   0.524  0.60014   
## mths_since_last_major_derog  5.865e-04  1.226e-03   0.479  0.63226   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 490.75  on 389  degrees of freedom
## Residual deviance: 458.77  on 372  degrees of freedom
##   (35832 observations deleted due to missingness)
## AIC: 494.77
## 
## Number of Fisher Scoring iterations: 4
#oslem.test(full_data$loan_status, fitted(y), g=10)

I replaced total accounts variable by its natural logarithm.

full_data <- full_data %>% mutate(total_acc = log(total_acc))

In order to find the best fitting model that will provide the best prediction about the response variable one should use non-redundant explanatory variables. In order to decide which explanatory variables to include in multiple logistic regression I checked whether dependent variables are correlated between each other or not.

Both highly correlated variables should not be in a final regression model.

In order to find the best regression model I ran the step function (which implements stepwise approach) that analyses all combination of variables and selects the best regression model based on lowest AIC (Akaike’s criterion) value. Lower values of AIC indicate the preferred model, that is, the one with the fewest parameters that still provides an adequate fit to the data.

model.null = glm(loan_status ~ 1, 
                 data = full_data,
                 family = binomial(link="logit")
                 )

model.full = glm(loan_status ~ .,
                 data = full_data,
                 family = binomial(link="logit")
                 )
     
step(model.null,
     scope = list(upper=model.full),
             direction = "both",
             test = "Chisq",
             data = full_data)
## Start:  AIC=38611.8
## loan_status ~ 1
## 
##                               Df Deviance   AIC     LRT  Pr(>Chi)    
## + grade                        6    36736 36750 1873.51 < 2.2e-16 ***
## + int_rate                     1    37092 37096 1517.43 < 2.2e-16 ***
## + term                         1    37555 37559 1054.57 < 2.2e-16 ***
## + dti                          1    37891 37895  719.12 < 2.2e-16 ***
## + loan_amnt                    1    37991 37995  618.82 < 2.2e-16 ***
## + installment                  1    38155 38159  455.20 < 2.2e-16 ***
## + open_acc                     1    38411 38415  199.08 < 2.2e-16 ***
## + verification_status          2    38411 38417  199.28 < 2.2e-16 ***
## + home_ownership               3    38411 38419  199.21 < 2.2e-16 ***
## + revol_util                   1    38528 38532   81.78 < 2.2e-16 ***
## + inq_last_6mths               1    38535 38539   74.47 < 2.2e-16 ***
## + emp_length                  11    38518 38542   92.03 6.667e-15 ***
## + delinq_2yrs                  1    38569 38573   41.13 1.423e-10 ***
## + mths_since_last_delinq       1    38576 38580   33.71 6.407e-09 ***
## + annual_inc                   1    38579 38583   30.62 3.134e-08 ***
## + collections_12_mths_ex_med   1    38587 38591   22.85 1.748e-06 ***
## + revol_bal                    1    38594 38598   15.47 8.404e-05 ***
## + mths_since_last_major_derog  1    38604 38608    5.47   0.01938 *  
## + pub_rec                      1    38605 38609    5.12   0.02370 *  
## + mths_since_last_record       1    38605 38609    4.54   0.03307 *  
## <none>                              38610 38612                      
## + total_acc                    1    38610 38614    0.08   0.77779    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Step:  AIC=36750.29
## loan_status ~ grade
## 
##                               Df Deviance   AIC     LRT  Pr(>Chi)    
## + dti                          1    36323 36339  413.41 < 2.2e-16 ***
## + term                         1    36533 36549  203.05 < 2.2e-16 ***
## + loan_amnt                    1    36535 36551  201.53 < 2.2e-16 ***
## + home_ownership               3    36576 36596  160.23 < 2.2e-16 ***
## + installment                  1    36612 36628  124.14 < 2.2e-16 ***
## + open_acc                     1    36617 36633  119.23 < 2.2e-16 ***
## + emp_length                  11    36632 36668  104.18 < 2.2e-16 ***
## + verification_status          2    36691 36709   45.74 1.168e-10 ***
## + revol_util                   1    36704 36720   32.30 1.319e-08 ***
## + int_rate                     1    36706 36722   30.59 3.193e-08 ***
## + annual_inc                   1    36710 36726   26.55 2.563e-07 ***
## + collections_12_mths_ex_med   1    36716 36732   20.57 5.752e-06 ***
## + delinq_2yrs                  1    36716 36732   20.31 6.598e-06 ***
## + revol_bal                    1    36719 36735   17.63 2.679e-05 ***
## + mths_since_last_delinq       1    36720 36736   15.84 6.887e-05 ***
## + mths_since_last_record       1    36732 36748    4.31   0.03782 *  
## + pub_rec                      1    36733 36749    3.24   0.07175 .  
## + inq_last_6mths               1    36733 36749    3.22   0.07266 .  
## + mths_since_last_major_derog  1    36734 36750    2.25   0.13362    
## <none>                              36736 36750                      
## + total_acc                    1    36736 36752    0.01   0.92726    
## - grade                        6    38610 38612 1873.51 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Step:  AIC=36338.87
## loan_status ~ grade + dti
## 
##                               Df Deviance   AIC     LRT  Pr(>Chi)    
## + loan_amnt                    1    36085 36103  238.24 < 2.2e-16 ***
## + term                         1    36103 36121  219.45 < 2.2e-16 ***
## + installment                  1    36173 36191  149.44 < 2.2e-16 ***
## + home_ownership               3    36178 36200  144.96 < 2.2e-16 ***
## + emp_length                  11    36237 36275   85.53 1.244e-13 ***
## + verification_status          2    36277 36297   45.86 1.102e-10 ***
## + open_acc                     1    36284 36302   39.28 3.673e-10 ***
## + int_rate                     1    36293 36311   29.83 4.707e-08 ***
## + delinq_2yrs                  1    36300 36318   23.33 1.366e-06 ***
## + mths_since_last_delinq       1    36301 36319   21.66 3.262e-06 ***
## + collections_12_mths_ex_med   1    36303 36321   19.93 8.022e-06 ***
## + total_acc                    1    36310 36328   12.99 0.0003134 ***
## + revol_util                   1    36313 36331   10.04 0.0015328 ** 
## + pub_rec                      1    36314 36332    8.41 0.0037321 ** 
## + revol_bal                    1    36315 36333    7.52 0.0061019 ** 
## + mths_since_last_major_derog  1    36318 36336    5.31 0.0212583 *  
## <none>                              36323 36339                      
## + annual_inc                   1    36322 36340    1.02 0.3124942    
## + mths_since_last_record       1    36322 36340    0.62 0.4297031    
## + inq_last_6mths               1    36322 36340    0.55 0.4592229    
## - dti                          1    36736 36750  413.41 < 2.2e-16 ***
## - grade                        6    37891 37895 1567.81 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Step:  AIC=36102.63
## loan_status ~ grade + dti + loan_amnt
## 
##                               Df Deviance   AIC     LRT  Pr(>Chi)    
## + home_ownership               3    35876 35900  209.07 < 2.2e-16 ***
## + term                         1    35985 36005   99.58 < 2.2e-16 ***
## + emp_length                  11    35970 36010  114.82 < 2.2e-16 ***
## + installment                  1    36017 36037   67.85 < 2.2e-16 ***
## + annual_inc                   1    36042 36062   42.43 7.319e-11 ***
## + int_rate                     1    36050 36070   34.53 4.204e-09 ***
## + total_acc                    1    36052 36072   33.12 8.654e-09 ***
## + verification_status          2    36056 36078   28.25 7.347e-07 ***
## + delinq_2yrs                  1    36060 36080   25.06 5.571e-07 ***
## + collections_12_mths_ex_med   1    36063 36083   21.19 4.150e-06 ***
## + mths_since_last_delinq       1    36065 36085   19.73 8.898e-06 ***
## + open_acc                     1    36068 36088   16.60 4.619e-05 ***
## + pub_rec                      1    36078 36098    6.43   0.01119 *  
## + mths_since_last_major_derog  1    36080 36100    4.76   0.02920 *  
## + revol_util                   1    36082 36102    2.58   0.10856    
## <none>                              36085 36103                      
## + mths_since_last_record       1    36083 36103    1.26   0.26086    
## + revol_bal                    1    36084 36104    0.56   0.45461    
## + inq_last_6mths               1    36084 36104    0.50   0.47748    
## - loan_amnt                    1    36323 36339  238.24 < 2.2e-16 ***
## - dti                          1    36535 36551  450.12 < 2.2e-16 ***
## - grade                        6    37243 37249 1158.75 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Step:  AIC=35899.56
## loan_status ~ grade + dti + loan_amnt + home_ownership
## 
##                               Df Deviance   AIC     LRT  Pr(>Chi)    
## + term                         1    35760 35786  115.67 < 2.2e-16 ***
## + installment                  1    35795 35821   80.31 < 2.2e-16 ***
## + emp_length                  11    35778 35824   97.28 6.162e-16 ***
## + int_rate                     1    35839 35865   36.45 1.564e-09 ***
## + delinq_2yrs                  1    35846 35872   29.99 4.342e-08 ***
## + annual_inc                   1    35846 35872   29.75 4.916e-08 ***
## + mths_since_last_delinq       1    35850 35876   25.42 4.605e-07 ***
## + open_acc                     1    35853 35879   22.86 1.741e-06 ***
## + total_acc                    1    35854 35880   21.54 3.471e-06 ***
## + collections_12_mths_ex_med   1    35856 35882   19.21 1.172e-05 ***
## + verification_status          2    35858 35886   17.20 0.0001842 ***
## + revol_util                   1    35868 35894    7.09 0.0077667 ** 
## + pub_rec                      1    35870 35896    5.75 0.0164890 *  
## + mths_since_last_major_derog  1    35870 35896    5.13 0.0235422 *  
## <none>                              35876 35900                      
## + mths_since_last_record       1    35874 35900    1.53 0.2156855    
## + inq_last_6mths               1    35874 35900    1.51 0.2189319    
## + revol_bal                    1    35876 35902    0.02 0.8815250    
## - home_ownership               3    36085 36103  209.07 < 2.2e-16 ***
## - loan_amnt                    1    36178 36200  302.34 < 2.2e-16 ***
## - dti                          1    36312 36334  435.95 < 2.2e-16 ***
## - grade                        6    36965 36977 1089.41 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Step:  AIC=35785.89
## loan_status ~ grade + dti + loan_amnt + home_ownership + term
## 
##                               Df Deviance   AIC    LRT  Pr(>Chi)    
## + emp_length                  11    35655 35703 104.45 < 2.2e-16 ***
## + int_rate                     1    35720 35748  40.23 2.260e-10 ***
## + delinq_2yrs                  1    35725 35753  35.21 2.953e-09 ***
## + mths_since_last_delinq       1    35729 35757  30.54 3.277e-08 ***
## + total_acc                    1    35734 35762  26.30 2.927e-07 ***
## + open_acc                     1    35738 35766  21.81 3.004e-06 ***
## + annual_inc                   1    35738 35766  21.62 3.323e-06 ***
## + collections_12_mths_ex_med   1    35739 35767  20.61 5.622e-06 ***
## + verification_status          2    35741 35771  18.75 8.494e-05 ***
## + revol_util                   1    35751 35779   9.11  0.002537 ** 
## + pub_rec                      1    35751 35779   8.66  0.003257 ** 
## + mths_since_last_major_derog  1    35754 35782   6.37  0.011587 *  
## + inq_last_6mths               1    35754 35782   6.16  0.013033 *  
## <none>                              35760 35786                     
## + installment                  1    35758 35786   1.74  0.187287    
## + mths_since_last_record       1    35759 35787   0.64  0.425373    
## + revol_bal                    1    35760 35788   0.19  0.663781    
## - term                         1    35876 35900 115.67 < 2.2e-16 ***
## - loan_amnt                    1    35918 35942 158.49 < 2.2e-16 ***
## - home_ownership               3    35985 36005 225.17 < 2.2e-16 ***
## - dti                          1    36198 36222 438.08 < 2.2e-16 ***
## - grade                        6    36437 36451 677.60 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Step:  AIC=35703.44
## loan_status ~ grade + dti + loan_amnt + home_ownership + term + 
##     emp_length
## 
##                               Df Deviance   AIC    LRT  Pr(>Chi)    
## + int_rate                     1    35615 35665  40.55 1.914e-10 ***
## + delinq_2yrs                  1    35619 35669  36.30 1.693e-09 ***
## + mths_since_last_delinq       1    35624 35674  31.77 1.738e-08 ***
## + open_acc                     1    35629 35679  26.21 3.062e-07 ***
## + total_acc                    1    35632 35682  23.55 1.216e-06 ***
## + collections_12_mths_ex_med   1    35635 35685  20.21 6.941e-06 ***
## + annual_inc                   1    35642 35692  13.51 0.0002377 ***
## + verification_status          2    35641 35693  14.75 0.0006265 ***
## + revol_util                   1    35644 35694  11.93 0.0005535 ***
## + pub_rec                      1    35647 35697   8.68 0.0032252 ** 
## + inq_last_6mths               1    35648 35698   7.24 0.0071142 ** 
## + mths_since_last_major_derog  1    35649 35699   6.13 0.0132866 *  
## + installment                  1    35653 35703   2.17 0.1408510    
## <none>                              35655 35703                     
## + mths_since_last_record       1    35655 35705   0.74 0.3892936    
## + revol_bal                    1    35655 35705   0.40 0.5268580    
## - emp_length                  11    35760 35786 104.45 < 2.2e-16 ***
## - term                         1    35778 35824 122.84 < 2.2e-16 ***
## - loan_amnt                    1    35832 35878 177.01 < 2.2e-16 ***
## - home_ownership               3    35863 35905 207.19 < 2.2e-16 ***
## - dti                          1    36072 36118 416.37 < 2.2e-16 ***
## - grade                        6    36330 36366 674.89 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Step:  AIC=35664.89
## loan_status ~ grade + dti + loan_amnt + home_ownership + term + 
##     emp_length + int_rate
## 
##                               Df Deviance   AIC    LRT  Pr(>Chi)    
## + delinq_2yrs                  1    35579 35631  36.24 1.747e-09 ***
## + mths_since_last_delinq       1    35581 35633  33.40 7.486e-09 ***
## + total_acc                    1    35589 35641  25.93 3.543e-07 ***
## + open_acc                     1    35590 35642  25.27 4.973e-07 ***
## + collections_12_mths_ex_med   1    35596 35648  19.22 1.164e-05 ***
## + annual_inc                   1    35600 35652  15.12 0.0001008 ***
## + verification_status          2    35598 35652  16.59 0.0002491 ***
## + revol_util                   1    35601 35653  13.66 0.0002190 ***
## + installment                  1    35602 35654  12.82 0.0003428 ***
## + inq_last_6mths               1    35606 35658   8.50 0.0035595 ** 
## + pub_rec                      1    35607 35659   8.10 0.0044290 ** 
## + mths_since_last_major_derog  1    35608 35660   7.19 0.0073372 ** 
## <none>                              35615 35665                     
## + mths_since_last_record       1    35613 35665   1.51 0.2198582    
## + revol_bal                    1    35615 35667   0.36 0.5494714    
## - int_rate                     1    35655 35703  40.55 1.914e-10 ***
## - emp_length                  11    35720 35748 104.77 < 2.2e-16 ***
## - term                         1    35742 35790 126.80 < 2.2e-16 ***
## - loan_amnt                    1    35795 35843 179.99 < 2.2e-16 ***
## - home_ownership               3    35825 35869 209.68 < 2.2e-16 ***
## - grade                        6    35882 35920 267.29 < 2.2e-16 ***
## - dti                          1    36031 36079 415.74 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Step:  AIC=35630.65
## loan_status ~ grade + dti + loan_amnt + home_ownership + term + 
##     emp_length + int_rate + delinq_2yrs
## 
##                               Df Deviance   AIC    LRT  Pr(>Chi)    
## + total_acc                    1    35549 35603  29.31 6.172e-08 ***
## + open_acc                     1    35556 35610  22.16 2.508e-06 ***
## + annual_inc                   1    35561 35615  17.57 2.764e-05 ***
## + verification_status          2    35561 35617  17.66 0.0001463 ***
## + collections_12_mths_ex_med   1    35563 35617  15.21 9.634e-05 ***
## + revol_util                   1    35564 35618  14.24 0.0001608 ***
## + installment                  1    35565 35619  13.87 0.0001956 ***
## + mths_since_last_delinq       1    35568 35622  10.68 0.0010834 ** 
## + pub_rec                      1    35571 35625   8.13 0.0043559 ** 
## + inq_last_6mths               1    35571 35625   8.03 0.0046057 ** 
## <none>                              35579 35631                     
## + mths_since_last_record       1    35577 35631   1.70 0.1923600    
## + revol_bal                    1    35578 35632   0.34 0.5615886    
## + mths_since_last_major_derog  1    35579 35633   0.13 0.7210945    
## - delinq_2yrs                  1    35615 35665  36.24 1.747e-09 ***
## - int_rate                     1    35619 35669  40.49 1.975e-10 ***
## - emp_length                  11    35685 35715 105.92 < 2.2e-16 ***
## - term                         1    35711 35761 132.52 < 2.2e-16 ***
## - loan_amnt                    1    35759 35809 180.22 < 2.2e-16 ***
## - home_ownership               3    35794 35840 215.72 < 2.2e-16 ***
## - grade                        6    35841 35881 261.98 < 2.2e-16 ***
## - dti                          1    35998 36048 419.09 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Step:  AIC=35603.34
## loan_status ~ grade + dti + loan_amnt + home_ownership + term + 
##     emp_length + int_rate + delinq_2yrs + total_acc
## 
##                               Df Deviance   AIC    LRT  Pr(>Chi)    
## + open_acc                     1    35461 35517  88.44 < 2.2e-16 ***
## + inq_last_6mths               1    35533 35589  16.32 5.357e-05 ***
## + verification_status          2    35531 35589  18.04 0.0001210 ***
## + installment                  1    35534 35590  14.85 0.0001165 ***
## + collections_12_mths_ex_med   1    35535 35591  14.01 0.0001816 ***
## + annual_inc                   1    35537 35593  12.45 0.0004186 ***
## + mths_since_last_delinq       1    35540 35596   9.75 0.0017907 ** 
## + revol_util                   1    35540 35596   9.72 0.0018231 ** 
## + pub_rec                      1    35543 35599   6.47 0.0109500 *  
## <none>                              35549 35603                     
## + revol_bal                    1    35549 35605   0.63 0.4288168    
## + mths_since_last_record       1    35549 35605   0.25 0.6191527    
## + mths_since_last_major_derog  1    35549 35605   0.06 0.8066411    
## - total_acc                    1    35579 35631  29.31 6.172e-08 ***
## - delinq_2yrs                  1    35589 35641  39.62 3.090e-10 ***
## - int_rate                     1    35592 35644  43.04 5.353e-11 ***
## - emp_length                  11    35652 35684 102.83 < 2.2e-16 ***
## - term                         1    35688 35740 138.28 < 2.2e-16 ***
## - loan_amnt                    1    35742 35794 193.11 < 2.2e-16 ***
## - home_ownership               3    35753 35801 203.54 < 2.2e-16 ***
## - grade                        6    35812 35854 262.46 < 2.2e-16 ***
## - dti                          1    35996 36048 446.60 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Step:  AIC=35516.9
## loan_status ~ grade + dti + loan_amnt + home_ownership + term + 
##     emp_length + int_rate + delinq_2yrs + total_acc + open_acc
## 
##                               Df Deviance   AIC    LRT  Pr(>Chi)    
## + annual_inc                   1    35444 35502  16.95 3.831e-05 ***
## + verification_status          2    35443 35503  17.94 0.0001270 ***
## + revol_util                   1    35446 35504  15.11 0.0001013 ***
## + installment                  1    35448 35506  13.30 0.0002654 ***
## + collections_12_mths_ex_med   1    35449 35507  11.96 0.0005429 ***
## + inq_last_6mths               1    35449 35507  11.50 0.0006968 ***
## + mths_since_last_delinq       1    35453 35511   8.08 0.0044820 ** 
## + pub_rec                      1    35455 35513   5.42 0.0199163 *  
## <none>                              35461 35517                     
## + mths_since_last_record       1    35460 35518   0.53 0.4659882    
## + revol_bal                    1    35461 35519   0.08 0.7808971    
## + mths_since_last_major_derog  1    35461 35519   0.04 0.8473246    
## - delinq_2yrs                  1    35497 35551  36.06 1.917e-09 ***
## - int_rate                     1    35505 35559  44.13 3.069e-11 ***
## - open_acc                     1    35549 35603  88.44 < 2.2e-16 ***
## - emp_length                  11    35571 35605 109.85 < 2.2e-16 ***
## - total_acc                    1    35556 35610  95.59 < 2.2e-16 ***
## - term                         1    35603 35657 142.44 < 2.2e-16 ***
## - loan_amnt                    1    35634 35688 173.30 < 2.2e-16 ***
## - home_ownership               3    35667 35717 205.94 < 2.2e-16 ***
## - grade                        6    35721 35765 260.13 < 2.2e-16 ***
## - dti                          1    35828 35882 367.39 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Step:  AIC=35501.95
## loan_status ~ grade + dti + loan_amnt + home_ownership + term + 
##     emp_length + int_rate + delinq_2yrs + total_acc + open_acc + 
##     annual_inc
## 
##                               Df Deviance   AIC     LRT  Pr(>Chi)    
## + revol_util                   1    35426 35486  18.017 2.190e-05 ***
## + verification_status          2    35425 35487  18.649 8.923e-05 ***
## + inq_last_6mths               1    35431 35491  13.116 0.0002928 ***
## + installment                  1    35431 35491  12.831 0.0003409 ***
## + collections_12_mths_ex_med   1    35432 35492  12.169 0.0004858 ***
## + mths_since_last_delinq       1    35434 35494   9.461 0.0020994 ** 
## + pub_rec                      1    35437 35497   6.927 0.0084898 ** 
## <none>                              35444 35502                      
## + mths_since_last_record       1    35443 35503   0.932 0.3344409    
## + revol_bal                    1    35444 35504   0.412 0.5207704    
## + mths_since_last_major_derog  1    35444 35504   0.070 0.7917278    
## - annual_inc                   1    35461 35517  16.953 3.831e-05 ***
## - delinq_2yrs                  1    35482 35538  38.022 6.995e-10 ***
## - int_rate                     1    35490 35546  45.719 1.365e-11 ***
## - emp_length                  11    35544 35580 100.496 < 2.2e-16 ***
## - total_acc                    1    35533 35589  89.381 < 2.2e-16 ***
## - open_acc                     1    35537 35593  92.950 < 2.2e-16 ***
## - term                         1    35577 35633 133.428 < 2.2e-16 ***
## - loan_amnt                    1    35632 35688 187.584 < 2.2e-16 ***
## - home_ownership               3    35641 35693 197.082 < 2.2e-16 ***
## - grade                        6    35705 35751 261.289 < 2.2e-16 ***
## - dti                          1    35722 35778 278.227 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Step:  AIC=35485.93
## loan_status ~ grade + dti + loan_amnt + home_ownership + term + 
##     emp_length + int_rate + delinq_2yrs + total_acc + open_acc + 
##     annual_inc + revol_util
## 
##                               Df Deviance   AIC     LRT  Pr(>Chi)    
## + verification_status          2    35407 35471  18.762 8.433e-05 ***
## + inq_last_6mths               1    35410 35472  15.840 6.893e-05 ***
## + collections_12_mths_ex_med   1    35413 35475  13.213 0.0002781 ***
## + installment                  1    35414 35476  12.215 0.0004741 ***
## + mths_since_last_delinq       1    35417 35479   9.397 0.0021730 ** 
## + pub_rec                      1    35419 35481   7.163 0.0074443 ** 
## <none>                              35426 35486                      
## + mths_since_last_record       1    35425 35487   1.222 0.2690492    
## + mths_since_last_major_derog  1    35426 35488   0.155 0.6936460    
## + revol_bal                    1    35426 35488   0.096 0.7563124    
## - revol_util                   1    35444 35502  18.017 2.190e-05 ***
## - annual_inc                   1    35446 35504  19.859 8.339e-06 ***
## - delinq_2yrs                  1    35464 35522  38.430 5.675e-10 ***
## - int_rate                     1    35474 35532  47.786 4.755e-12 ***
## - total_acc                    1    35510 35568  83.743 < 2.2e-16 ***
## - emp_length                  11    35530 35568 103.767 < 2.2e-16 ***
## - open_acc                     1    35525 35583  99.443 < 2.2e-16 ***
## - term                         1    35562 35620 135.795 < 2.2e-16 ***
## - loan_amnt                    1    35605 35663 178.762 < 2.2e-16 ***
## - home_ownership               3    35631 35685 205.007 < 2.2e-16 ***
## - dti                          1    35668 35726 242.556 < 2.2e-16 ***
## - grade                        6    35687 35735 261.083 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Step:  AIC=35471.17
## loan_status ~ grade + dti + loan_amnt + home_ownership + term + 
##     emp_length + int_rate + delinq_2yrs + total_acc + open_acc + 
##     annual_inc + revol_util + verification_status
## 
##                               Df Deviance   AIC     LRT  Pr(>Chi)    
## + inq_last_6mths               1    35392 35458  15.538 8.086e-05 ***
## + collections_12_mths_ex_med   1    35394 35460  12.862 0.0003353 ***
## + installment                  1    35395 35461  12.024 0.0005253 ***
## + mths_since_last_delinq       1    35398 35464   9.411 0.0021564 ** 
## + pub_rec                      1    35401 35467   6.601 0.0101937 *  
## <none>                              35407 35471                      
## + mths_since_last_record       1    35406 35472   0.718 0.3968847    
## + mths_since_last_major_derog  1    35407 35473   0.106 0.7446273    
## + revol_bal                    1    35407 35473   0.080 0.7769471    
## - verification_status          2    35426 35486  18.762 8.433e-05 ***
## - revol_util                   1    35425 35487  18.130 2.064e-05 ***
## - annual_inc                   1    35428 35490  20.626 5.583e-06 ***
## - delinq_2yrs                  1    35447 35509  39.584 3.142e-10 ***
## - int_rate                     1    35457 35519  50.127 1.441e-12 ***
## - emp_length                  11    35504 35546  96.916 7.270e-16 ***
## - total_acc                    1    35491 35553  84.027 < 2.2e-16 ***
## - open_acc                     1    35507 35569  99.465 < 2.2e-16 ***
## - term                         1    35544 35606 136.824 < 2.2e-16 ***
## - loan_amnt                    1    35573 35635 166.071 < 2.2e-16 ***
## - home_ownership               3    35601 35659 193.650 < 2.2e-16 ***
## - dti                          1    35649 35711 242.280 < 2.2e-16 ***
## - grade                        6    35669 35721 261.764 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Step:  AIC=35457.63
## loan_status ~ grade + dti + loan_amnt + home_ownership + term + 
##     emp_length + int_rate + delinq_2yrs + total_acc + open_acc + 
##     annual_inc + revol_util + verification_status + inq_last_6mths
## 
##                               Df Deviance   AIC     LRT  Pr(>Chi)    
## + collections_12_mths_ex_med   1    35379 35447  12.900 0.0003287 ***
## + installment                  1    35380 35448  11.549 0.0006777 ***
## + mths_since_last_delinq       1    35381 35449  10.184 0.0014166 ** 
## + pub_rec                      1    35385 35453   6.584 0.0102926 *  
## <none>                              35392 35458                      
## + mths_since_last_record       1    35391 35459   0.691 0.4056797    
## + mths_since_last_major_derog  1    35391 35459   0.199 0.6553885    
## + revol_bal                    1    35392 35460   0.077 0.7813563    
## - inq_last_6mths               1    35407 35471  15.538 8.086e-05 ***
## - verification_status          2    35410 35472  18.460 9.808e-05 ***
## - revol_util                   1    35412 35476  20.808 5.078e-06 ***
## - annual_inc                   1    35414 35478  22.829 1.771e-06 ***
## - delinq_2yrs                  1    35431 35495  39.728 2.918e-10 ***
## - int_rate                     1    35444 35508  52.401 4.525e-13 ***
## - emp_length                  11    35490 35534  98.325 3.830e-16 ***
## - total_acc                    1    35483 35547  91.762 < 2.2e-16 ***
## - open_acc                     1    35486 35550  94.584 < 2.2e-16 ***
## - term                         1    35538 35602 146.066 < 2.2e-16 ***
## - loan_amnt                    1    35566 35630 174.493 < 2.2e-16 ***
## - home_ownership               3    35588 35648 196.679 < 2.2e-16 ***
## - grade                        6    35642 35696 250.275 < 2.2e-16 ***
## - dti                          1    35641 35705 249.231 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Step:  AIC=35446.73
## loan_status ~ grade + dti + loan_amnt + home_ownership + term + 
##     emp_length + int_rate + delinq_2yrs + total_acc + open_acc + 
##     annual_inc + revol_util + verification_status + inq_last_6mths + 
##     collections_12_mths_ex_med
## 
##                               Df Deviance   AIC     LRT  Pr(>Chi)    
## + installment                  1    35367 35437  11.426 0.0007244 ***
## + mths_since_last_delinq       1    35370 35440   8.730 0.0031297 ** 
## + pub_rec                      1    35372 35442   6.408 0.0113585 *  
## <none>                              35379 35447                      
## + mths_since_last_record       1    35378 35448   0.636 0.4250090    
## + revol_bal                    1    35379 35449   0.069 0.7925676    
## + mths_since_last_major_derog  1    35379 35449   0.003 0.9571168    
## - collections_12_mths_ex_med   1    35392 35458  12.900 0.0003287 ***
## - inq_last_6mths               1    35394 35460  15.575 7.929e-05 ***
## - verification_status          2    35397 35461  18.119 0.0001163 ***
## - revol_util                   1    35401 35467  21.901 2.871e-06 ***
## - annual_inc                   1    35402 35468  23.158 1.492e-06 ***
## - delinq_2yrs                  1    35415 35481  35.800 2.187e-09 ***
## - int_rate                     1    35430 35496  51.395 7.553e-13 ***
## - emp_length                  11    35477 35523  98.076 4.290e-16 ***
## - total_acc                    1    35467 35533  88.628 < 2.2e-16 ***
## - open_acc                     1    35471 35537  92.665 < 2.2e-16 ***
## - term                         1    35526 35592 146.832 < 2.2e-16 ***
## - loan_amnt                    1    35553 35619 174.724 < 2.2e-16 ***
## - home_ownership               3    35574 35636 195.616 < 2.2e-16 ***
## - grade                        6    35627 35683 248.595 < 2.2e-16 ***
## - dti                          1    35627 35693 247.841 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Step:  AIC=35437.31
## loan_status ~ grade + dti + loan_amnt + home_ownership + term + 
##     emp_length + int_rate + delinq_2yrs + total_acc + open_acc + 
##     annual_inc + revol_util + verification_status + inq_last_6mths + 
##     collections_12_mths_ex_med + installment
## 
##                               Df Deviance   AIC     LRT  Pr(>Chi)    
## + mths_since_last_delinq       1    35358 35430   8.843 0.0029421 ** 
## + pub_rec                      1    35361 35433   6.230 0.0125635 *  
## - loan_amnt                    1    35369 35437   1.743 0.1867110    
## <none>                              35367 35437                      
## + mths_since_last_record       1    35367 35439   0.635 0.4256166    
## + revol_bal                    1    35367 35439   0.043 0.8360434    
## + mths_since_last_major_derog  1    35367 35439   0.004 0.9483777    
## - installment                  1    35379 35447  11.426 0.0007244 ***
## - collections_12_mths_ex_med   1    35380 35448  12.776 0.0003512 ***
## - inq_last_6mths               1    35382 35450  15.100 0.0001019 ***
## - verification_status          2    35385 35451  17.928 0.0001279 ***
## - revol_util                   1    35388 35456  21.192 4.156e-06 ***
## - annual_inc                   1    35390 35458  22.527 2.072e-06 ***
## - delinq_2yrs                  1    35404 35472  36.783 1.320e-09 ***
## - int_rate                     1    35429 35497  61.729 3.942e-15 ***
## - term                         1    35438 35506  70.462 < 2.2e-16 ***
## - emp_length                  11    35466 35514  99.139 2.645e-16 ***
## - total_acc                    1    35456 35524  89.115 < 2.2e-16 ***
## - open_acc                     1    35458 35526  91.064 < 2.2e-16 ***
## - home_ownership               3    35563 35627 195.359 < 2.2e-16 ***
## - grade                        6    35622 35680 254.949 < 2.2e-16 ***
## - dti                          1    35615 35683 247.918 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Step:  AIC=35430.46
## loan_status ~ grade + dti + loan_amnt + home_ownership + term + 
##     emp_length + int_rate + delinq_2yrs + total_acc + open_acc + 
##     annual_inc + revol_util + verification_status + inq_last_6mths + 
##     collections_12_mths_ex_med + installment + mths_since_last_delinq
## 
##                               Df Deviance   AIC     LRT  Pr(>Chi)    
## + pub_rec                      1    35352 35426   6.333 0.0118483 *  
## + mths_since_last_major_derog  1    35354 35428   4.811 0.0282777 *  
## - loan_amnt                    1    35360 35430   1.785 0.1815103    
## <none>                              35358 35430                      
## + mths_since_last_record       1    35358 35432   0.735 0.3912658    
## + revol_bal                    1    35358 35432   0.034 0.8538711    
## - mths_since_last_delinq       1    35367 35437   8.843 0.0029421 ** 
## - collections_12_mths_ex_med   1    35370 35440  11.318 0.0007676 ***
## - installment                  1    35370 35440  11.538 0.0006818 ***
## - delinq_2yrs                  1    35374 35444  15.110 0.0001014 ***
## - inq_last_6mths               1    35374 35444  15.814 6.988e-05 ***
## - verification_status          2    35376 35444  17.923 0.0001282 ***
## - revol_util                   1    35380 35450  21.129 4.293e-06 ***
## - annual_inc                   1    35383 35453  24.078 9.253e-07 ***
## - int_rate                     1    35422 35492  63.075 1.989e-15 ***
## - term                         1    35430 35500  71.254 < 2.2e-16 ***
## - emp_length                  11    35458 35508  99.248 2.516e-16 ***
## - total_acc                    1    35445 35515  87.009 < 2.2e-16 ***
## - open_acc                     1    35448 35518  89.501 < 2.2e-16 ***
## - home_ownership               3    35556 35622 197.704 < 2.2e-16 ***
## - grade                        6    35615 35675 256.050 < 2.2e-16 ***
## - dti                          1    35607 35677 248.777 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Step:  AIC=35426.13
## loan_status ~ grade + dti + loan_amnt + home_ownership + term + 
##     emp_length + int_rate + delinq_2yrs + total_acc + open_acc + 
##     annual_inc + revol_util + verification_status + inq_last_6mths + 
##     collections_12_mths_ex_med + installment + mths_since_last_delinq + 
##     pub_rec
## 
##                               Df Deviance   AIC     LRT  Pr(>Chi)    
## + mths_since_last_major_derog  1    35348 35424   4.592 0.0321127 *  
## - loan_amnt                    1    35354 35426   1.721 0.1895678    
## <none>                              35352 35426                      
## + mths_since_last_record       1    35352 35428   0.025 0.8740707    
## + revol_bal                    1    35352 35428   0.025 0.8749050    
## - pub_rec                      1    35358 35430   6.333 0.0118483 *  
## - mths_since_last_delinq       1    35361 35433   8.947 0.0027794 ** 
## - collections_12_mths_ex_med   1    35363 35435  11.147 0.0008416 ***
## - installment                  1    35363 35435  11.358 0.0007511 ***
## - delinq_2yrs                  1    35367 35439  15.069 0.0001037 ***
## - verification_status          2    35370 35440  17.391 0.0001673 ***
## - inq_last_6mths               1    35368 35440  15.806 7.018e-05 ***
## - revol_util                   1    35373 35445  21.363 3.799e-06 ***
## - annual_inc                   1    35378 35450  25.802 3.784e-07 ***
## - int_rate                     1    35414 35486  62.319 2.921e-15 ***
## - term                         1    35424 35496  71.588 < 2.2e-16 ***
## - emp_length                  11    35451 35503  98.991 2.828e-16 ***
## - total_acc                    1    35436 35508  83.578 < 2.2e-16 ***
## - open_acc                     1    35441 35513  88.662 < 2.2e-16 ***
## - home_ownership               3    35549 35617 197.305 < 2.2e-16 ***
## - grade                        6    35606 35668 254.164 < 2.2e-16 ***
## - dti                          1    35602 35674 249.957 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Step:  AIC=35423.54
## loan_status ~ grade + dti + loan_amnt + home_ownership + term + 
##     emp_length + int_rate + delinq_2yrs + total_acc + open_acc + 
##     annual_inc + revol_util + verification_status + inq_last_6mths + 
##     collections_12_mths_ex_med + installment + mths_since_last_delinq + 
##     pub_rec + mths_since_last_major_derog
## 
##                               Df Deviance   AIC     LRT  Pr(>Chi)    
## - loan_amnt                    1    35349 35423   1.713 0.1906389    
## <none>                              35348 35424                      
## + revol_bal                    1    35348 35426   0.027 0.8701625    
## + mths_since_last_record       1    35348 35426   0.021 0.8842465    
## - mths_since_last_major_derog  1    35352 35426   4.592 0.0321127 *  
## - pub_rec                      1    35354 35428   6.115 0.0134052 *  
## - installment                  1    35359 35433  11.367 0.0007475 ***
## - collections_12_mths_ex_med   1    35360 35434  12.325 0.0004468 ***
## - mths_since_last_delinq       1    35361 35435  13.526 0.0002353 ***
## - inq_last_6mths               1    35363 35437  15.535 8.099e-05 ***
## - verification_status          2    35365 35437  17.769 0.0001385 ***
## - delinq_2yrs                  1    35364 35438  16.460 4.969e-05 ***
## - revol_util                   1    35368 35442  20.557 5.789e-06 ***
## - annual_inc                   1    35374 35448  26.166 3.132e-07 ***
## - int_rate                     1    35409 35483  61.428 4.592e-15 ***
## - term                         1    35419 35493  71.699 < 2.2e-16 ***
## - emp_length                  11    35447 35501  99.438 2.307e-16 ***
## - total_acc                    1    35431 35505  83.341 < 2.2e-16 ***
## - open_acc                     1    35435 35509  87.893 < 2.2e-16 ***
## - home_ownership               3    35546 35616 198.562 < 2.2e-16 ***
## - grade                        6    35599 35663 251.788 < 2.2e-16 ***
## - dti                          1    35596 35670 248.500 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Step:  AIC=35423.25
## loan_status ~ grade + dti + home_ownership + term + emp_length + 
##     int_rate + delinq_2yrs + total_acc + open_acc + annual_inc + 
##     revol_util + verification_status + inq_last_6mths + collections_12_mths_ex_med + 
##     installment + mths_since_last_delinq + pub_rec + mths_since_last_major_derog
## 
##                               Df Deviance   AIC     LRT  Pr(>Chi)    
## <none>                              35349 35423                      
## + loan_amnt                    1    35348 35424   1.713 0.1906389    
## + revol_bal                    1    35349 35425   0.041 0.8398013    
## + mths_since_last_record       1    35349 35425   0.020 0.8872851    
## - mths_since_last_major_derog  1    35354 35426   4.601 0.0319576 *  
## - pub_rec                      1    35355 35427   6.177 0.0129404 *  
## - collections_12_mths_ex_med   1    35362 35434  12.380 0.0004340 ***
## - mths_since_last_delinq       1    35363 35435  13.495 0.0002392 ***
## - verification_status          2    35367 35437  17.688 0.0001443 ***
## - inq_last_6mths               1    35365 35437  15.890 6.712e-05 ***
## - delinq_2yrs                  1    35366 35438  16.297 5.414e-05 ***
## - revol_util                   1    35370 35442  20.714 5.331e-06 ***
## - annual_inc                   1    35377 35449  27.468 1.597e-07 ***
## - int_rate                     1    35410 35482  60.858 6.134e-15 ***
## - emp_length                  11    35448 35500  99.167 2.611e-16 ***
## - total_acc                    1    35432 35504  83.224 < 2.2e-16 ***
## - open_acc                     1    35438 35510  88.310 < 2.2e-16 ***
## - installment                  1    35534 35606 184.789 < 2.2e-16 ***
## - home_ownership               3    35548 35616 199.092 < 2.2e-16 ***
## - grade                        6    35599 35661 250.098 < 2.2e-16 ***
## - dti                          1    35597 35669 248.126 < 2.2e-16 ***
## - term                         1    35647 35719 297.572 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Call:  glm(formula = loan_status ~ grade + dti + home_ownership + term + 
##     emp_length + int_rate + delinq_2yrs + total_acc + open_acc + 
##     annual_inc + revol_util + verification_status + inq_last_6mths + 
##     collections_12_mths_ex_med + installment + mths_since_last_delinq + 
##     pub_rec + mths_since_last_major_derog, family = binomial(link = "logit"), 
##     data = full_data)
## 
## Coefficients:
##                        (Intercept)                              gradeB  
##                          2.588e+00                          -5.584e-01  
##                             gradeC                              gradeD  
##                         -1.132e+00                          -1.636e+00  
##                             gradeE                              gradeF  
##                         -2.134e+00                          -2.398e+00  
##                             gradeG                                 dti  
##                         -2.676e+00                          -2.763e-02  
##                  home_ownershipOWN                  home_ownershipRENT  
##                         -2.802e-01                          -4.040e-01  
##                  home_ownershipANY                                term  
##                         -1.010e+00                          -2.503e-02  
##                   emp_length1 year                 emp_length10+ years  
##                          6.807e-02                           2.986e-01  
##                  emp_length2 years                   emp_length3 years  
##                          1.523e-01                           1.715e-01  
##                  emp_length4 years                   emp_length5 years  
##                          1.581e-01                           1.773e-01  
##                  emp_length6 years                   emp_length7 years  
##                          2.098e-01                           2.804e-01  
##                  emp_length8 years                   emp_length9 years  
##                          1.543e-01                           1.656e-01  
##                      emp_lengthn/a                            int_rate  
##                         -1.693e-01                           7.610e-02  
##                        delinq_2yrs                           total_acc  
##                         -5.004e-02                           3.714e-01  
##                           open_acc                          annual_inc  
##                         -3.107e-02                           1.888e-06  
##                         revol_util  verification_statusSource Verified  
##                         -2.896e-03                          -1.526e-01  
##        verification_statusVerified                      inq_last_6mths  
##                         -1.300e-01                          -4.835e-02  
##         collections_12_mths_ex_med                         installment  
##                         -2.130e-01                          -8.861e-04  
##             mths_since_last_delinq                             pub_rec  
##                          3.105e-03                          -3.710e-02  
##        mths_since_last_major_derog  
##                         -1.688e-03  
## 
## Degrees of Freedom: 36221 Total (i.e. Null);  36185 Residual
## Null Deviance:       38610 
## Residual Deviance: 35350     AIC: 35420

The step function ignored redundant variables (the variables that are highly correlated). The best models shown below:

#optional model
final.model <- glm(formula = loan_status ~ grade + dti + home_ownership + term + emp_length + int_rate + delinq_2yrs + total_acc + open_acc + annual_inc + revol_util + verification_status + inq_last_6mths + collections_12_mths_ex_med + installment + mths_since_last_delinq + pub_rec + mths_since_last_major_derog, family = binomial(link = "logit"), data = full_data)

summary(final.model)
## 
## Call:
## glm(formula = loan_status ~ grade + dti + home_ownership + term + 
##     emp_length + int_rate + delinq_2yrs + total_acc + open_acc + 
##     annual_inc + revol_util + verification_status + inq_last_6mths + 
##     collections_12_mths_ex_med + installment + mths_since_last_delinq + 
##     pub_rec + mths_since_last_major_derog, family = binomial(link = "logit"), 
##     data = full_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5328   0.3621   0.5536   0.7259   2.4219  
## 
## Coefficients:
##                                      Estimate Std. Error z value Pr(>|z|)
## (Intercept)                         2.588e+00  1.798e-01  14.389  < 2e-16
## gradeB                             -5.584e-01  8.513e-02  -6.560 5.40e-11
## gradeC                             -1.132e+00  9.916e-02 -11.418  < 2e-16
## gradeD                             -1.636e+00  1.240e-01 -13.196  < 2e-16
## gradeE                             -2.134e+00  1.515e-01 -14.084  < 2e-16
## gradeF                             -2.398e+00  1.911e-01 -12.551  < 2e-16
## gradeG                             -2.676e+00  2.348e-01 -11.398  < 2e-16
## dti                                -2.763e-02  1.759e-03 -15.701  < 2e-16
## home_ownershipOWN                  -2.802e-01  4.495e-02  -6.234 4.53e-10
## home_ownershipRENT                 -4.040e-01  2.900e-02 -13.930  < 2e-16
## home_ownershipANY                  -1.010e+00  9.549e-01  -1.057 0.290405
## term                               -2.503e-02  1.438e-03 -17.405  < 2e-16
## emp_length1 year                    6.807e-02  7.434e-02   0.916 0.359871
## emp_length10+ years                 2.986e-01  5.545e-02   5.385 7.25e-08
## emp_length2 years                   1.523e-01  6.834e-02   2.228 0.025867
## emp_length3 years                   1.715e-01  6.925e-02   2.476 0.013284
## emp_length4 years                   1.581e-01  7.311e-02   2.163 0.030550
## emp_length5 years                   1.773e-01  7.351e-02   2.411 0.015896
## emp_length6 years                   2.098e-01  7.919e-02   2.650 0.008060
## emp_length7 years                   2.804e-01  8.161e-02   3.436 0.000589
## emp_length8 years                   1.543e-01  7.971e-02   1.936 0.052917
## emp_length9 years                   1.656e-01  8.530e-02   1.941 0.052261
## emp_lengthn/a                      -1.693e-01  6.913e-02  -2.449 0.014320
## int_rate                            7.610e-02  9.796e-03   7.768 7.97e-15
## delinq_2yrs                        -5.004e-02  1.218e-02  -4.110 3.96e-05
## total_acc                           3.714e-01  4.076e-02   9.112  < 2e-16
## open_acc                           -3.107e-02  3.293e-03  -9.435  < 2e-16
## annual_inc                          1.888e-06  3.749e-07   5.037 4.73e-07
## revol_util                         -2.896e-03  6.365e-04  -4.550 5.38e-06
## verification_statusSource Verified -1.526e-01  3.703e-02  -4.120 3.79e-05
## verification_statusVerified        -1.300e-01  3.934e-02  -3.304 0.000954
## inq_last_6mths                     -4.835e-02  1.208e-02  -4.001 6.30e-05
## collections_12_mths_ex_med         -2.130e-01  5.955e-02  -3.577 0.000348
## installment                        -8.861e-04  6.497e-05 -13.639  < 2e-16
## mths_since_last_delinq              3.105e-03  8.437e-04   3.680 0.000234
## pub_rec                            -3.710e-02  1.476e-02  -2.513 0.011958
## mths_since_last_major_derog        -1.688e-03  7.846e-04  -2.151 0.031456
##                                       
## (Intercept)                        ***
## gradeB                             ***
## gradeC                             ***
## gradeD                             ***
## gradeE                             ***
## gradeF                             ***
## gradeG                             ***
## dti                                ***
## home_ownershipOWN                  ***
## home_ownershipRENT                 ***
## home_ownershipANY                     
## term                               ***
## emp_length1 year                      
## emp_length10+ years                ***
## emp_length2 years                  *  
## emp_length3 years                  *  
## emp_length4 years                  *  
## emp_length5 years                  *  
## emp_length6 years                  ** 
## emp_length7 years                  ***
## emp_length8 years                  .  
## emp_length9 years                  .  
## emp_lengthn/a                      *  
## int_rate                           ***
## delinq_2yrs                        ***
## total_acc                          ***
## open_acc                           ***
## annual_inc                         ***
## revol_util                         ***
## verification_statusSource Verified ***
## verification_statusVerified        ***
## inq_last_6mths                     ***
## collections_12_mths_ex_med         ***
## installment                        ***
## mths_since_last_delinq             ***
## pub_rec                            *  
## mths_since_last_major_derog        *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 38610  on 36221  degrees of freedom
## Residual deviance: 35349  on 36185  degrees of freedom
## AIC: 35423
## 
## Number of Fisher Scoring iterations: 4

By looking at the summary statistics, I concluded that the variables “home_ownershipANY”, “emp_length1 year”,“emp_length8 years” and “emp_length9 years” are not statistically significant (they don’t affect the outcome) because theirs p-value is greater that 0.05 level of significance.

In order to test goodness of fit I performed Likelihood Ratio Test, Pseudo R^2 and Hosmer-Lemeshow Test.

A logistic regression is said to provide a better fit to the data if it demonstrates an improvement over a model with fewer predictors. This is performed using the likelihood ratio test, which compares the likelihood of the data under the full model against the likelihood of the data under a model with fewer predictors. Removing predictor variables from a model will almost always make the model fit less well, but it is necessary to test whether the observed difference in model fit is statistically significant.

I stated the following hypothesis:

Null hypothesis: reduced model is true Alternative hypothesis: reduced model is false

I compared final regression model generated by step function with initial null model and model with fewer predictors.

#the model with fewer predictors
model_fewer_pred <- glm(formula = loan_status ~ grade + dti + loan_amnt + home_ownership + term + emp_length + int_rate, family = binomial(link = "logit"), data = full_data)

#optimal model vs null model
anova(final.model, model.null, test ="Chisq")
## Analysis of Deviance Table
## 
## Model 1: loan_status ~ grade + dti + home_ownership + term + emp_length + 
##     int_rate + delinq_2yrs + total_acc + open_acc + annual_inc + 
##     revol_util + verification_status + inq_last_6mths + collections_12_mths_ex_med + 
##     installment + mths_since_last_delinq + pub_rec + mths_since_last_major_derog
## Model 2: loan_status ~ 1
##   Resid. Df Resid. Dev  Df Deviance  Pr(>Chi)    
## 1     36185      35349                           
## 2     36221      38610 -36  -3260.5 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#optimal model vs the model with fewer predictors
anova(final.model, model_fewer_pred, test ="Chisq")
## Analysis of Deviance Table
## 
## Model 1: loan_status ~ grade + dti + home_ownership + term + emp_length + 
##     int_rate + delinq_2yrs + total_acc + open_acc + annual_inc + 
##     revol_util + verification_status + inq_last_6mths + collections_12_mths_ex_med + 
##     installment + mths_since_last_delinq + pub_rec + mths_since_last_major_derog
## Model 2: loan_status ~ grade + dti + loan_amnt + home_ownership + term + 
##     emp_length + int_rate
##   Resid. Df Resid. Dev  Df Deviance  Pr(>Chi)    
## 1     36185      35349                           
## 2     36197      35615 -12  -265.64 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In both cases the null hypothesis is rejected (as p-value is less than significance level of 5%). It would provide evidence against the reduced model in favor of the final model.

Unlike linear regression with ordinary least squares estimation, there is no R-squire statistic which explains the proportion of variance in the dependent variable that is explained by the predictors. However, there are a number of pseudo R2-squire metrics that could be of value. Most notable is McFadden’s R-squire. It measures ranges from 0 to just under 1, with values closer to zero indicating that the model has no predictive power.

pR2(final.model)
##           llh       llhNull            G2      McFadden          r2ML 
## -1.767462e+04 -1.930490e+04  3.260549e+03  8.444875e-02  8.608317e-02 
##          r2CU 
##  1.313065e-01

The test showed that the model has no predictive power since the value of r2CU is close to 0.

Another approach to determining the goodness of fit is through the Homer-Lemeshow statistics, which is computed on data after the observations have been segmented into groups based on having similar predicted probabilities. It examines whether the observed proportions of events are similar to the predicted probabilities of occurrence in subgroups of the data set using a pearson chi square test.

I stated the following hypothesis:

Null hypothesis: there is a good fit of the model Alternative hypothesis: there is a poor fit of the model

hoslem.test(full_data$loan_status, fitted(final.model), g=10)
## Warning in Ops.factor(1, y): '-' not meaningful for factors
## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  full_data$loan_status, fitted(final.model)
## X-squared = 36222, df = 8, p-value < 2.2e-16

Since p-vale is close to 0 I rejected the null hypothesis. There is a poor fit to the model.

Based on the results of the optimal logistic regression model I built the following logit function:

#logit(p) = ln(p/(p-1)) =
#logit_p=
#(3.434e+00)-(5.621e-01)*gradeB-(1.138e+00)*gradeC-(1.644e+00)*gradeD-(2.144e+00)*gradeE -(2.144e+00)*gradeF-(2.689e+00)*gradeG-(2.710e-02)*dti -(2.842e-(1.512e-01)*verificationstatussourseverified+(1.317e-01)*verificationstatussverified-(4.745e-02)*inqlast6mths-(2.149e-01)*collections12mthsexmed-(8.799e-04)*installment +(3.070e-03)*mthssincelastdelinq-(3.284e-02)*pubrec+(1.607e-03)*mthssincelastmajorderog-01)*homeownershipOWN-(4.066e-01)*homeownershipRENT-(2.481e-02)*term+(3.027e-01)*emlengh10years +(1.534e-01)*emplengthyears+(1.718e-01)*emplength3years+(1.559e-01)*emplength4years+(1.778e-01)*emplength5years+(2.102-01)*emplength6years+(2.829-01)*emplength7years+(1.571e-01)*emplength8years-(1.651e-01)*emplengthn/a+years+(7.601e-02)*intrate-(4.956e-02)*deling2yrs+(1.239e-02)*totalacc-(3.128-02)*opeacc+(1.922e-06)*annualinc-(2.941e-03)*revolutil 

The natural logarithm of the odds ratio is equivalent to a linear function of the independent variables. The antilog of the logit function allows us to find the estimated regression equation.

The estimated regression equation is shown below:

#p_hat = e^logit_p/(1+e^logit_p)

In order to illustrate how the equation works I will show how grade can affect the probability of whether the loan will be paid off or not.

Let’s change the loan grade and hold all remaining variables constant.

#create vectors to store grades and probabilities
grade <- c()
probability <- c()

#grade A
logit_p_A = (3.434e+00)-(5.621e-01)*0-(1.138e+00)*0-(1.644e+00)*0-(2.144e+00)*0 -(2.411e+00)*0-(2.689e+00)*0
probability[1] = exp(1)^logit_p_A/(1+exp(1)^logit_p_A)
grade[1] <- "grade A"

#grade B
logit_p_B = (3.434e+00)-(5.621e-01)*1-(1.138e+00)*0-(1.644e+00)*0-(2.144e+00)*0 -(2.411e+00)*0-(2.689e+00)*0
probability[2] = exp(1)^logit_p_B/(1+exp(1)^logit_p_B)
grade[2] <- "grade B"

#grade C
logit_p_C = (3.434e+00)-(5.621e-01)*0-(1.138e+00)*1-(1.644e+00)*0-(2.144e+00)*0 -(2.411e+00)*0-(2.689e+00)*0
probability[3] = exp(1)^logit_p_C/(1+exp(1)^logit_p_C)
grade[3] <- "grade C"

#grade D
logit_p_D = (3.434e+00)-(5.621e-01)*0-(1.138e+00)*0-(1.644e+00)*1-(2.144e+00)*0 -(2.411e+00)*0-(2.689e+00)*0
probability[4] = exp(1)^logit_p_D/(1+exp(1)^logit_p_D)
grade[4] <- "grade D"

#grade E
logit_p_E = (3.434e+00)-(5.621e-01)*0-(1.138e+00)*0-(1.644e+00)*0-(2.144e+00)*1 -(2.411e+00)*0-(2.689e+00)*0
probability[5] = exp(1)^logit_p_E/(1+exp(1)^logit_p_E)
grade[5] <- "grade E"

#grade F
logit_p_F = (3.434e+00)-(5.621e-01)*0-(1.138e+00)*0-(1.644e+00)*0-(2.144e+00)*0 -(2.411e+00)*1-(2.689e+00)*0
probability[6] = exp(1)^logit_p_F/(1+exp(1)^logit_p_F)
grade[6] <- "grade F"

#grade G
logit_p_G = (3.434e+00)-(5.621e-01)*0-(1.138e+00)*0-(1.644e+00)*0-(2.144e+00)*0 -(2.411e+00)*0-(2.689e+00)*1
probability[7] = exp(1)^logit_p_G/(1+exp(1)^logit_p_G)
grade[7] <- "grade G"

#table with results
prob_table <- data.frame(grade,probability)
head(prob_table)
##     grade probability
## 1 grade A   0.9687504
## 2 grade B   0.9464397
## 3 grade C   0.9085452
## 4 grade D   0.8569273
## 5 grade E   0.7841472
## 6 grade F   0.7355566
#graph
g <- ggplot(prob_table, aes(grade,probability))
g + ggtitle("Change in probability based on change in grade") + geom_bar(stat="identity",fill="skyblue")

The bar chart above shows how change in grades reflects probability while holding all other variables constant.

#create vectors to store change in dti and probabilitiies
dti_value <- c()
probability <- c()

for (i in 1:100){
  
dti_value[i] <- i
logit_p_dti = (3.434e+00)-(2.710e-02)*i
    
probability[i] = exp(1)^logit_p_dti/(1+exp(1)^logit_p_dti)
  
}

dti_table <- data.frame(dti_value,probability)
head(dti_table)
##   dti_value probability
## 1         1   0.9679195
## 2         2   0.9670672
## 3         3   0.9661931
## 4         4   0.9652967
## 5         5   0.9643773
## 6         6   0.9634345
#graph
g <- ggplot(data=dti_table, aes(x=dti_value,y=probability,colour="skyblue"))
g + geom_line()+ ggtitle("Change in probability based on change in Debt to Income ratio") + labs(x="Debt to Income ratio,%")

The bar chart above shows how change in debt to income ratio reflects probability of succsess while holding all other variables constant.

By using logit function investor can verify how change in variable or combinations of variables can affect probability while holding all other variables constant.

Next I used logistic model in order to predict loan status.

#66% of data set is for training
training <- full_data[sample(nrow(full_data)),][1:round(0.66*nrow(full_data)),]
#34% of data set is for testing
testing <- full_data[sample(nrow(full_data)),][(round(0.66*nrow(full_data))+1):nrow(full_data),]

#run optimal model with training data set
pred_model <- glm(formula = loan_status ~ grade + dti + home_ownership + term + emp_length + int_rate + delinq_2yrs + total_acc + open_acc + annual_inc + revol_util + verification_status + inq_last_6mths + collections_12_mths_ex_med + installment + mths_since_last_delinq + pub_rec + mths_since_last_major_derog, family = binomial(link = "logit"), data = training)

#apply regression model as predictor for testing data set
testing$predicted_loan_status = predict(pred_model, newdata=testing,type = "response")
#replace prpability that is greater or equal to 0.5 with "Fully Paid"" class and the probability that is less than 0.5 with "Chraged Off" class
testing <- testing %>% mutate(predicted_loan_status = ifelse(predicted_loan_status < 0.5, "Charged Off","Fully Paid"))
levels(factor(testing$predicted_loan_status))
## [1] "Charged Off" "Fully Paid"
testing <- testing %>% select(loan_status, predicted_loan_status)
head(testing)
##   loan_status predicted_loan_status
## 1  Fully Paid            Fully Paid
## 2 Charged Off            Fully Paid
## 3  Fully Paid            Fully Paid
## 4  Fully Paid            Fully Paid
## 5  Fully Paid            Fully Paid
## 6 Charged Off           Charged Off
#create confusion matrix
confusion_matrix <- table(testing$predicted_loan_status, testing$loan_status)
confusion_matrix
##              
##               Charged Off Fully Paid
##   Charged Off         278        243
##   Fully Paid         2512       9282
#calculate accuracy
accuracy <- (confusion_matrix[1,1] + confusion_matrix[2,2])/nrow(testing)
accuracy
## [1] 0.7762891

Logistic regression can predict whether a loan will be paid off or not with 74.93% of accuracy.

In conclusion I would like to say that I achieved all goals of the project. First, I built the optimal logistic regression function using stepwise technique (that selects the best model based on lowest AIC criterion).
Second,I showed how change in grades and dti can affect probability of success(the probability whether the loan will be paid off). Third, I calculated that logistic regression can predict loan status with 74.93% of accuracy.

I learnt a lot about multiple logistic regression that weren’t covered in labs assignments