Objective

As this is a graded task for our Academy students, completion of the task is not optional and count towards your final score.

Students should be awarded the full points if:
1. The document demonstrates the student’s ability in data preparation (eg. one or two exploratory data analysis steps) 2. The document demonstrates the student’s understanding of an unbiased estimate of the model’s accuracy (eg. train and test sets or cross-validation sets)
3. The document demonstrates the student’s ability to interpret the coefficients (e.g one or two paragraphs of whether the presence of one variable lead to the increase or decrease of the log-odds / probabilities of default)

Option 1: Logistic Regression on Credit Risk Option 2: Customer segment prediction Student should receive 1 point for each of the above requirements, for a total of (3) points.

Solution

library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(ggplot2)
library(class)
library(Amelia)
## Loading required package: Rcpp
## ## 
## ## Amelia II: Multiple Imputation
## ## (Version 1.7.5, built: 2018-05-07)
## ## Copyright (C) 2005-2018 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##
library(ROCR)
## Loading required package: gplots
## 
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
## 
##     lowess

Option 1 : Logistic Regression on Credit Risk

Applying what you’ve learned, present a simple R Markdown document in which you demonstrate the use of logistic regression on the lbb_loans.csv dataset. Explain your findings wherever necessary and show the necessary data preparation steps. To help you through the exercise, consider the following questions throughout the document:

  • How do we correctly interpret the negative coefficients obtained from your logistic regression?
  • How do we know which of the variables are more statistically significant as predictors?
  • What are some strategies to improve your model?

Data Preparation

To prepare the data we taking lbb_loans data that provided by Algorit.ma and here i the name so i can understand it better

In the Unsupervised Machine Learning workshop within the Machine Learning Specialization, I will dive into the specific details of anomaly detection algorithms with far greater depth so let’s stay on track and study the dataset we’ve just read into our environment:

#read the csv first
loans <- read.csv("lbb_loans.csv")

# i check is there any missing value
missmap(loans, main = "Missing values vs observed")

#checking the propotion of not_paid 
table(loans$not_paid)
## 
##   0   1 
## 749 751
#i seperate train and test in loan data set.
intrain <- sample(nrow(loans), nrow(loans)*0.8)
loans.train <- loans[intrain, ]
loans.test <- loans[-intrain, ]

Logistic Regression on Credit Risk

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
creditriskmod <- glm(not_paid~., 
                     data= loans.train, 
                     family= binomial) 
summary(creditriskmod)
## 
## Call:
## glm(formula = not_paid ~ ., family = binomial, data = loans.train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8917  -1.1111  -0.7274   1.1387   1.6874  
## 
## Coefficients: (2 not defined because of singularities)
##                                      Estimate Std. Error z value Pr(>|z|)
## (Intercept)                        -9.206e-01  2.431e+00  -0.379 0.704901
## initial_list_statusw               -1.010e-01  1.493e-01  -0.677 0.498467
## purposedebt_consolidation           1.291e-01  1.531e-01   0.843 0.399173
## purposehome_improvement             5.856e-02  2.350e-01   0.249 0.803188
## purposemajor_purchase               6.020e-01  3.556e-01   1.693 0.090478
## purposesmall_business               8.220e-01  4.981e-01   1.650 0.098859
## int_rate                           -1.549e-02  5.243e-02  -0.296 0.767601
## installment                         9.223e-04  2.401e-04   3.841 0.000122
## annual_inc                         -3.317e-06  2.148e-06  -1.544 0.122471
## dti                                 2.515e-03  5.295e-03   0.475 0.634787
## verification_statusSource Verified  1.173e-01  1.445e-01   0.811 0.417091
## verification_statusVerified         1.720e-01  1.590e-01   1.082 0.279405
## gradeB                              1.231e-01  2.710e-01   0.454 0.649599
## gradeC                              5.364e-01  4.252e-01   1.262 0.207128
## gradeD                              8.396e-01  6.600e-01   1.272 0.203306
## gradeE                              9.319e-01  9.562e-01   0.975 0.329774
## gradeF                              6.278e-01  1.247e+00   0.503 0.614766
## gradeG                              2.300e-02  1.394e+00   0.017 0.986833
## revol_bal                           2.948e-06  3.519e-06   0.838 0.402163
## inq_last_12m                       -2.964e-02  2.473e-02  -1.198 0.230766
## delinq_2yrs                         1.913e-01  7.984e-02   2.396 0.016597
## home_ownershipOWN                   2.502e-01  1.902e-01   1.315 0.188357
## home_ownershipRENT                  1.305e-01  1.395e-01   0.935 0.349831
## log_inc                             1.871e-02  2.256e-01   0.083 0.933927
## verified                                   NA         NA      NA       NA
## grdCtoA                                    NA         NA      NA       NA
##                                       
## (Intercept)                           
## initial_list_statusw                  
## purposedebt_consolidation             
## purposehome_improvement               
## purposemajor_purchase              .  
## purposesmall_business              .  
## int_rate                              
## installment                        ***
## annual_inc                            
## dti                                   
## verification_statusSource Verified    
## verification_statusVerified           
## gradeB                                
## gradeC                                
## gradeD                                
## gradeE                                
## gradeF                                
## gradeG                                
## revol_bal                             
## inq_last_12m                          
## delinq_2yrs                        *  
## home_ownershipOWN                     
## home_ownershipRENT                    
## log_inc                               
## verified                              
## grdCtoA                               
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1663.1  on 1199  degrees of freedom
## Residual deviance: 1589.6  on 1176  degrees of freedom
## AIC: 1637.6
## 
## Number of Fisher Scoring iterations: 4
#making first model
creditrisk1 <- glm(not_paid ~ int_rate + installment + annual_inc + delinq_2yrs + home_ownership,
                   family = binomial, 
                   data = loans.train)
summary(creditrisk1)
## 
## Call:
## glm(formula = not_paid ~ int_rate + installment + annual_inc + 
##     delinq_2yrs + home_ownership, family = binomial, data = loans.train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7764  -1.1115  -0.8098   1.1695   1.5297  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -8.996e-01  2.013e-01  -4.469 7.86e-06 ***
## int_rate            3.553e-02  1.109e-02   3.203  0.00136 ** 
## installment         9.500e-04  2.222e-04   4.276 1.90e-05 ***
## annual_inc         -3.089e-06  1.122e-06  -2.753  0.00591 ** 
## delinq_2yrs         1.926e-01  7.865e-02   2.449  0.01431 *  
## home_ownershipOWN   2.730e-01  1.863e-01   1.465  0.14289    
## home_ownershipRENT  1.647e-01  1.303e-01   1.265  0.20601    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1663.1  on 1199  degrees of freedom
## Residual deviance: 1609.2  on 1193  degrees of freedom
## AIC: 1623.2
## 
## Number of Fisher Scoring iterations: 4
#making prediction 
loans.test$pred.risk<- predict(creditrisk1, loans.test, type = "response")

# making the confusion matrix
table("predicted"=as.numeric(loans.test$pred.risk>=0.5), "actual"=loans.test$not_paid)
##          actual
## predicted  0  1
##         0 92 64
##         1 46 98
# make accuracy, recall, precision, and specificity from the model
accu <- round(91+85/(91+66+85+58),2)
reca <- round(85/(85+66),2)
prec <- round(85/(85+58),2)
spec <- round(91/(91+58),2)

paste("Accuracy:", accu)
## [1] "Accuracy: 91.28"
paste("Recall:", reca)
## [1] "Recall: 0.56"
paste("Precision:", prec)
## [1] "Precision: 0.59"
paste("Specificity:", spec)
## [1] "Specificity: 0.61"
# to see 
prop.table(table(loans.test$not_paid))
## 
##    0    1 
## 0.46 0.54

The Debate

  1. How do we correctly interpret the negative coefficients obtained from your logistic regression?
    Answear The coefficients in a logistic regression are log odds ratios. Negative values mean that the odds ratio is smaller than 1, that is, the odds of the test group are lower than the odds of the reference group.
  • Negative coefficients, indicating that customers who have spent less time at either their current employer or their current address are more likely to default

  • positive coefficients, indicating that higher target ratios or higher amounts of predictors are both associated with a greater likelihood of corelate

amount of credit card debt both have positive coefficients, indicating that higher dti ratios or higher amounts of credit card debts are both associated with a greater likelihood of loan defaults.

  1. How do we know which of the variables are more statistically significant as predictors? Answear Statistical measures can show the relative importance of the different predictor variables. However, these measures can’t determine whether the variables are important in a practical sense. To determine practical importance, you’ll need to use your subject area knowledge. If you randomly sample your observations, the variability of the predictor values in your sample likely reflects the variability in the population. In this case, the standardized coefficients and the change in R-squared values are likely to reflect their population values.

  2. What are some strategies to improve your model? Answear
  3. Step one: univariable analysis
  4. Step two: multivariable model comparisons
  5. Step three: linearity assumption
  6. Step four: interactions among covariates
  7. Step five: Assessing fit of the model

The process of variable selection, deleting, model fitting and refitting can be repeated for several cycles, depending on the complexity of variables. Interaction helps to disentangle complex relationship between covariates and their synergistic effect on response variable. Model should be checked for the GOF. In other words, how the fitted model reflects the real data. Hosmer-Lemeshow GOF test is the most widely used for logistic regression model. However, it is a summary statistic for checking model fit. Investigators may be interested in whether the model fits across entire range of covariate pattern, which is the task of regression diagnostics.