As this is a graded task for our Academy students, completion of the task is not optional and count towards your final score.
Students should be awarded the full points if:
1. The document demonstrates the student’s ability in data preparation (eg. one or two exploratory data analysis steps) 2. The document demonstrates the student’s understanding of an unbiased estimate of the model’s accuracy (eg. train and test sets or cross-validation sets)
3. The document demonstrates the student’s ability to interpret the coefficients (e.g one or two paragraphs of whether the presence of one variable lead to the increase or decrease of the log-odds / probabilities of default)
Option 1: Logistic Regression on Credit Risk Option 2: Customer segment prediction Student should receive 1 point for each of the above requirements, for a total of (3) points.
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(ggplot2)
library(class)
library(Amelia)
## Loading required package: Rcpp
## ##
## ## Amelia II: Multiple Imputation
## ## (Version 1.7.5, built: 2018-05-07)
## ## Copyright (C) 2005-2018 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##
library(ROCR)
## Loading required package: gplots
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
Applying what you’ve learned, present a simple R Markdown document in which you demonstrate the use of logistic regression on the lbb_loans.csv
dataset. Explain your findings wherever necessary and show the necessary data preparation steps. To help you through the exercise, consider the following questions throughout the document:
To prepare the data we taking lbb_loans data that provided by Algorit.ma and here i the name so i can understand it better
In the Unsupervised Machine Learning workshop within the Machine Learning Specialization, I will dive into the specific details of anomaly detection algorithms with far greater depth so let’s stay on track and study the dataset we’ve just read into our environment:
#read the csv first
loans <- read.csv("lbb_loans.csv")
# i check is there any missing value
missmap(loans, main = "Missing values vs observed")
#checking the propotion of not_paid
table(loans$not_paid)
##
## 0 1
## 749 751
#i seperate train and test in loan data set.
intrain <- sample(nrow(loans), nrow(loans)*0.8)
loans.train <- loans[intrain, ]
loans.test <- loans[-intrain, ]
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
creditriskmod <- glm(not_paid~.,
data= loans.train,
family= binomial)
summary(creditriskmod)
##
## Call:
## glm(formula = not_paid ~ ., family = binomial, data = loans.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8917 -1.1111 -0.7274 1.1387 1.6874
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.206e-01 2.431e+00 -0.379 0.704901
## initial_list_statusw -1.010e-01 1.493e-01 -0.677 0.498467
## purposedebt_consolidation 1.291e-01 1.531e-01 0.843 0.399173
## purposehome_improvement 5.856e-02 2.350e-01 0.249 0.803188
## purposemajor_purchase 6.020e-01 3.556e-01 1.693 0.090478
## purposesmall_business 8.220e-01 4.981e-01 1.650 0.098859
## int_rate -1.549e-02 5.243e-02 -0.296 0.767601
## installment 9.223e-04 2.401e-04 3.841 0.000122
## annual_inc -3.317e-06 2.148e-06 -1.544 0.122471
## dti 2.515e-03 5.295e-03 0.475 0.634787
## verification_statusSource Verified 1.173e-01 1.445e-01 0.811 0.417091
## verification_statusVerified 1.720e-01 1.590e-01 1.082 0.279405
## gradeB 1.231e-01 2.710e-01 0.454 0.649599
## gradeC 5.364e-01 4.252e-01 1.262 0.207128
## gradeD 8.396e-01 6.600e-01 1.272 0.203306
## gradeE 9.319e-01 9.562e-01 0.975 0.329774
## gradeF 6.278e-01 1.247e+00 0.503 0.614766
## gradeG 2.300e-02 1.394e+00 0.017 0.986833
## revol_bal 2.948e-06 3.519e-06 0.838 0.402163
## inq_last_12m -2.964e-02 2.473e-02 -1.198 0.230766
## delinq_2yrs 1.913e-01 7.984e-02 2.396 0.016597
## home_ownershipOWN 2.502e-01 1.902e-01 1.315 0.188357
## home_ownershipRENT 1.305e-01 1.395e-01 0.935 0.349831
## log_inc 1.871e-02 2.256e-01 0.083 0.933927
## verified NA NA NA NA
## grdCtoA NA NA NA NA
##
## (Intercept)
## initial_list_statusw
## purposedebt_consolidation
## purposehome_improvement
## purposemajor_purchase .
## purposesmall_business .
## int_rate
## installment ***
## annual_inc
## dti
## verification_statusSource Verified
## verification_statusVerified
## gradeB
## gradeC
## gradeD
## gradeE
## gradeF
## gradeG
## revol_bal
## inq_last_12m
## delinq_2yrs *
## home_ownershipOWN
## home_ownershipRENT
## log_inc
## verified
## grdCtoA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1663.1 on 1199 degrees of freedom
## Residual deviance: 1589.6 on 1176 degrees of freedom
## AIC: 1637.6
##
## Number of Fisher Scoring iterations: 4
#making first model
creditrisk1 <- glm(not_paid ~ int_rate + installment + annual_inc + delinq_2yrs + home_ownership,
family = binomial,
data = loans.train)
summary(creditrisk1)
##
## Call:
## glm(formula = not_paid ~ int_rate + installment + annual_inc +
## delinq_2yrs + home_ownership, family = binomial, data = loans.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7764 -1.1115 -0.8098 1.1695 1.5297
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.996e-01 2.013e-01 -4.469 7.86e-06 ***
## int_rate 3.553e-02 1.109e-02 3.203 0.00136 **
## installment 9.500e-04 2.222e-04 4.276 1.90e-05 ***
## annual_inc -3.089e-06 1.122e-06 -2.753 0.00591 **
## delinq_2yrs 1.926e-01 7.865e-02 2.449 0.01431 *
## home_ownershipOWN 2.730e-01 1.863e-01 1.465 0.14289
## home_ownershipRENT 1.647e-01 1.303e-01 1.265 0.20601
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1663.1 on 1199 degrees of freedom
## Residual deviance: 1609.2 on 1193 degrees of freedom
## AIC: 1623.2
##
## Number of Fisher Scoring iterations: 4
#making prediction
loans.test$pred.risk<- predict(creditrisk1, loans.test, type = "response")
# making the confusion matrix
table("predicted"=as.numeric(loans.test$pred.risk>=0.5), "actual"=loans.test$not_paid)
## actual
## predicted 0 1
## 0 92 64
## 1 46 98
# make accuracy, recall, precision, and specificity from the model
accu <- round(91+85/(91+66+85+58),2)
reca <- round(85/(85+66),2)
prec <- round(85/(85+58),2)
spec <- round(91/(91+58),2)
paste("Accuracy:", accu)
## [1] "Accuracy: 91.28"
paste("Recall:", reca)
## [1] "Recall: 0.56"
paste("Precision:", prec)
## [1] "Precision: 0.59"
paste("Specificity:", spec)
## [1] "Specificity: 0.61"
# to see
prop.table(table(loans.test$not_paid))
##
## 0 1
## 0.46 0.54
Answear
The coefficients in a logistic regression are log odds ratios. Negative values mean that the odds ratio is smaller than 1, that is, the odds of the test group are lower than the odds of the reference group.Negative coefficients, indicating that customers who have spent less time at either their current employer or their current address are more likely to default
positive coefficients, indicating that higher target ratios or higher amounts of predictors are both associated with a greater likelihood of corelate
amount of credit card debt both have positive coefficients, indicating that higher dti ratios or higher amounts of credit card debts are both associated with a greater likelihood of loan defaults.
How do we know which of the variables are more statistically significant as predictors? Answear
Statistical measures can show the relative importance of the different predictor variables. However, these measures can’t determine whether the variables are important in a practical sense. To determine practical importance, you’ll need to use your subject area knowledge. If you randomly sample your observations, the variability of the predictor values in your sample likely reflects the variability in the population. In this case, the standardized coefficients and the change in R-squared values are likely to reflect their population values.
Answear
Step five: Assessing fit of the model
The process of variable selection, deleting, model fitting and refitting can be repeated for several cycles, depending on the complexity of variables. Interaction helps to disentangle complex relationship between covariates and their synergistic effect on response variable. Model should be checked for the GOF. In other words, how the fitted model reflects the real data. Hosmer-Lemeshow GOF test is the most widely used for logistic regression model. However, it is a summary statistic for checking model fit. Investigators may be interested in whether the model fits across entire range of covariate pattern, which is the task of regression diagnostics.