Credit risk is the risk that must be borne by a bank or financing institution when providing loans to an individual or institution. This risk is in the form of non-payment of loan principal and interest, resulting in the certain losses.
To minimize this credit risk, a risk assessment process is usually carried out before a loan is granted, which is called credit scoring and credit rating for the borrower.
To calculate this credit risk, usually the financing institution uses a predetermined standard calculation. However, what is increasingly becoming a trend is calculations using machine learning methods based on historical loan data.
The purpose of this report is to try to find a certain model that may minimize the risk that can occur to the borrowing institution, which based on the results of this assessment will determine whether a loan application may be accepted or rejected by the financial borrower/ institution.
This report will try to see the correlation of Loan Status with certain variabels: - Borrower Marrital Status - Borrower Education - Borrower Credit History - Borrower Property Area
Some of the Business Questions that can be gathered are:
Is there a relationship between the loan status with marital status, education, credit history, and property area of the borrower?
Can we predict the loan status based on marital status, education, credit history, and property area of the borrower?
cr_train <- read.csv("CR_Train.csv", stringsAsFactors = T, na.strings=c("","","NA"))Checking for Data Types and NA value of Data:
summary(cr_train)#> Loan_ID Gender Married Dependents Education
#> LP001002: 1 Female:112 No :213 0 :345 Graduate :480
#> LP001003: 1 Male :489 Yes :398 1 :102 Not Graduate:134
#> LP001005: 1 NA's : 13 NA's: 3 2 :101
#> LP001006: 1 3+ : 51
#> LP001008: 1 NA's: 15
#> LP001011: 1
#> (Other) :608
#> Self_Employed ApplicantIncome CoapplicantIncome LoanAmount
#> No :500 Min. : 150 Min. : 0 Min. : 9.0
#> Yes : 82 1st Qu.: 2878 1st Qu.: 0 1st Qu.:100.0
#> NA's: 32 Median : 3812 Median : 1188 Median :128.0
#> Mean : 5403 Mean : 1621 Mean :146.4
#> 3rd Qu.: 5795 3rd Qu.: 2297 3rd Qu.:168.0
#> Max. :81000 Max. :41667 Max. :700.0
#> NA's :22
#> Loan_Amount_Term Credit_History Property_Area Loan_Status
#> Min. : 12 Min. :0.0000 Rural :179 N:192
#> 1st Qu.:360 1st Qu.:1.0000 Semiurban:233 Y:422
#> Median :360 Median :1.0000 Urban :202
#> Mean :342 Mean :0.8422
#> 3rd Qu.:360 3rd Qu.:1.0000
#> Max. :480 Max. :1.0000
#> NA's :14 NA's :50
anyNA(cr_train)#> [1] TRUE
colSums(is.na(cr_train))#> Loan_ID Gender Married Dependents
#> 0 13 3 15
#> Education Self_Employed ApplicantIncome CoapplicantIncome
#> 0 32 0 0
#> LoanAmount Loan_Amount_Term Credit_History Property_Area
#> 22 14 50 0
#> Loan_Status
#> 0
glimpse(cr_train)#> Rows: 614
#> Columns: 13
#> $ Loan_ID <fct> LP001002, LP001003, LP001005, LP001006, LP001008, LP~
#> $ Gender <fct> Male, Male, Male, Male, Male, Male, Male, Male, Male~
#> $ Married <fct> No, Yes, Yes, Yes, No, Yes, Yes, Yes, Yes, Yes, Yes,~
#> $ Dependents <fct> 0, 1, 0, 0, 0, 2, 0, 3+, 2, 1, 2, 2, 2, 0, 2, 0, 1, ~
#> $ Education <fct> Graduate, Graduate, Graduate, Not Graduate, Graduate~
#> $ Self_Employed <fct> No, No, Yes, No, No, Yes, No, No, No, No, No, NA, No~
#> $ ApplicantIncome <int> 5849, 4583, 3000, 2583, 6000, 5417, 2333, 3036, 4006~
#> $ CoapplicantIncome <dbl> 0, 1508, 0, 2358, 0, 4196, 1516, 2504, 1526, 10968, ~
#> $ LoanAmount <int> NA, 128, 66, 120, 141, 267, 95, 158, 168, 349, 70, 1~
#> $ Loan_Amount_Term <int> 360, 360, 360, 360, 360, 360, 360, 360, 360, 360, 36~
#> $ Credit_History <int> 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, NA, ~
#> $ Property_Area <fct> Urban, Rural, Urban, Urban, Urban, Urban, Urban, Sem~
#> $ Loan_Status <fct> Y, N, Y, Y, Y, Y, Y, N, Y, N, Y, Y, Y, N, Y, Y, Y, N~
There are two data types that need to be change: - Loan_Amount_Term : Change as factor data type - Credit_History : Change as factor data type
cr.train <- cr_train %>%
dplyr::select(-Loan_ID) %>%
mutate(Loan_Amount_Term = as.factor(Loan_Amount_Term),
Credit_History = as.factor(Credit_History))
head(cr.train)#> Gender Married Dependents Education Self_Employed ApplicantIncome
#> 1 Male No 0 Graduate No 5849
#> 2 Male Yes 1 Graduate No 4583
#> 3 Male Yes 0 Graduate Yes 3000
#> 4 Male Yes 0 Not Graduate No 2583
#> 5 Male No 0 Graduate No 6000
#> 6 Male Yes 2 Graduate Yes 5417
#> CoapplicantIncome LoanAmount Loan_Amount_Term Credit_History Property_Area
#> 1 0 NA 360 1 Urban
#> 2 1508 128 360 1 Rural
#> 3 0 66 360 1 Urban
#> 4 2358 120 360 1 Urban
#> 5 0 141 360 1 Urban
#> 6 4196 267 360 1 Urban
#> Loan_Status
#> 1 Y
#> 2 N
#> 3 Y
#> 4 Y
#> 5 Y
#> 6 Y
There are also NA values based on preliminary check on: - LoanAmount - Loan_Amount_Term - Credit_History
Creating function for data cleansing:
Mode = function(x){
a = table(x)
b = max(a)
if(all(a == b))
mod = NA
else if(is.numeric(x))
mod = as.numeric(names(a))[a==b]
else
mod = names(a)[a==b]
return(mod)
}In order to create a better overall result, we will try to replace any missing/ NA values based on its types: - Data with missing Numeric type values: will be replaced by its mean values (using mean() function). - Data value with the factor data type will be replaced with value that has highest number of occurrences in its set of data (using mode() function).
cr.train$Gender[is.na(cr.train$Gender)] <- Mode(cr.train$Gender)
cr.train$Married[is.na(cr.train$Married)] <- Mode(cr.train$Married)
cr.train$Dependents[is.na(cr.train$Dependents)] <- Mode(cr.train$Dependents)
cr.train$Credit_History[is.na(cr.train$Credit_History)] <- Mode(cr.train$Credit_History)cr.train$LoanAmount[is.na(cr.train$LoanAmount)] <- mean(cr.train$LoanAmount, na.rm = T)
cr.train$Loan_Amount_Term[is.na(cr.train$Loan_Amount_Term)] <- mean(cr.train$Loan_Amount_Term, na.rm = T)
summary(cr.train)#> Gender Married Dependents Education Self_Employed
#> Female:112 No :213 0 :360 Graduate :480 No :500
#> Male :502 Yes:401 1 :102 Not Graduate:134 Yes : 82
#> 2 :101 NA's: 32
#> 3+: 51
#>
#>
#>
#> ApplicantIncome CoapplicantIncome LoanAmount Loan_Amount_Term
#> Min. : 150 Min. : 0 Min. : 9.0 360 :512
#> 1st Qu.: 2878 1st Qu.: 0 1st Qu.:100.2 180 : 44
#> Median : 3812 Median : 1188 Median :129.0 480 : 15
#> Mean : 5403 Mean : 1621 Mean :146.4 300 : 13
#> 3rd Qu.: 5795 3rd Qu.: 2297 3rd Qu.:164.8 84 : 4
#> Max. :81000 Max. :41667 Max. :700.0 (Other): 12
#> NA's : 14
#> Credit_History Property_Area Loan_Status
#> 0: 89 Rural :179 N:192
#> 1:525 Semiurban:233 Y:422
#> Urban :202
#>
#>
#>
#>
plot(cr.train$Loan_Status, cr.train$LoanAmount)prop.table(table(cr.train$Loan_Status))#>
#> N Y
#> 0.3127036 0.6872964
table(cr.train$Loan_Status)#>
#> N Y
#> 192 422
cr.train = na.omit(cr.train)
levels(cr.train$Loan_Status)#> [1] "N" "Y"
Although the proportion between borrower who get the loan (Y) with borrower who does not get the loan (N) is not quite in balance condition Borrower who got the loan is higher that borrower who does not get the loan), but as the data and time constraint, it will considered as adequate to continue with the modeling.
Creating Generalized Model for all available predictor variabels:
model.cr0 <- glm(formula = Loan_Status ~ 1,
data = cr.train,
family = "binomial")
model.cr <- glm(formula = Loan_Status ~ ., data = cr.train, family = 'binomial')
summary(model.cr0)#>
#> Call:
#> glm(formula = Loan_Status ~ 1, family = "binomial", data = cr.train)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -1.5282 -1.5282 0.8633 0.8633 0.8633
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 0.79511 0.09056 8.78 <0.0000000000000002 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 705.51 on 568 degrees of freedom
#> Residual deviance: 705.51 on 568 degrees of freedom
#> AIC: 707.51
#>
#> Number of Fisher Scoring iterations: 4
summary(model.cr)#>
#> Call:
#> glm(formula = Loan_Status ~ ., family = "binomial", data = cr.train)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.2384 -0.3671 0.5139 0.7099 2.4588
#>
#> Coefficients:
#> Estimate Std. Error z value
#> (Intercept) 11.303251499 1455.397665124 0.008
#> GenderMale -0.025757240 0.319348213 -0.081
#> MarriedYes 0.530484291 0.269061534 1.972
#> Dependents1 -0.348072627 0.316592874 -1.099
#> Dependents2 0.312801884 0.354876848 0.881
#> Dependents3+ 0.046914157 0.447023205 0.105
#> EducationNot Graduate -0.510019079 0.275336465 -1.852
#> Self_EmployedYes -0.102003154 0.324543228 -0.314
#> ApplicantIncome 0.000004178 0.000027167 0.154
#> CoapplicantIncome -0.000055857 0.000042370 -1.318
#> LoanAmount -0.001457084 0.001695529 -0.859
#> Loan_Amount_Term36 -31.109472905 1782.396928254 -0.017
#> Loan_Amount_Term60 0.593737383 1779.816079792 0.000
#> Loan_Amount_Term84 -14.534583116 1455.398057514 -0.010
#> Loan_Amount_Term120 -0.461619001 1673.992772581 0.000
#> Loan_Amount_Term180 -13.762369863 1455.397626880 -0.009
#> Loan_Amount_Term240 -14.432232683 1455.398108763 -0.010
#> Loan_Amount_Term300 -14.201857197 1455.397735067 -0.010
#> Loan_Amount_Term360 -14.081636424 1455.397559196 -0.010
#> Loan_Amount_Term480 -15.443353184 1455.397700593 -0.011
#> Credit_History1 3.855041185 0.426766326 9.033
#> Property_AreaSemiurban 0.983493628 0.283096817 3.474
#> Property_AreaUrban 0.205043897 0.274857418 0.746
#> Pr(>|z|)
#> (Intercept) 0.993803
#> GenderMale 0.935716
#> MarriedYes 0.048654 *
#> Dependents1 0.271579
#> Dependents2 0.378081
#> Dependents3+ 0.916417
#> EducationNot Graduate 0.063976 .
#> Self_EmployedYes 0.753295
#> ApplicantIncome 0.877769
#> CoapplicantIncome 0.187399
#> LoanAmount 0.390137
#> Loan_Amount_Term36 0.986075
#> Loan_Amount_Term60 0.999734
#> Loan_Amount_Term84 0.992032
#> Loan_Amount_Term120 0.999780
#> Loan_Amount_Term180 0.992455
#> Loan_Amount_Term240 0.992088
#> Loan_Amount_Term300 0.992214
#> Loan_Amount_Term360 0.992280
#> Loan_Amount_Term480 0.991534
#> Credit_History1 < 0.0000000000000002 ***
#> Property_AreaSemiurban 0.000513 ***
#> Property_AreaUrban 0.455667
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 705.51 on 568 degrees of freedom
#> Residual deviance: 506.56 on 546 degrees of freedom
#> AIC: 552.56
#>
#> Number of Fisher Scoring iterations: 14
exp(0.79) #(Intercept model without predictor)#> [1] 2.203396
exp(-0.025) #genderMale#> [1] 0.9753099
exp(0.53) #MarriedYes#> [1] 1.698932
exp(-0.51) #EducationNot Graduate#> [1] 0.6004956
exp(-0.10) #Self_EmployedYes#> [1] 0.9048374
Based on preliminary model using all predictor variabels, we can get the following sample interpretations of the model: - The incidence of the borrower getting loan is about 2.20 times more likely than not getting the loan.
The incidence of male borrower getting a loan is 0.98 times more likely than female borrower getting the loan ** provided that ** other variables have the same value.
The incidence of married borrower getting a loan is 1.70 times more likely than un-married borrower getting the loan ** provided that ** other variables have the same value.
The incidence of non-graduated borrower getting a loan is 0.60 times more likely than graduated borrower getting the loan ** provided that ** other variables have the same value.
The incidence of self-employed borrower getting a loan is 0.90 times more likely than employee-typed borrower getting the loan ** provided that ** other variables have the same value.
model.step0 <- step(model.cr0, direction = "backward", trace = F)
model.step <- step(model.cr, direction = "backward", trace = F)
summary(model.step0)#>
#> Call:
#> glm(formula = Loan_Status ~ 1, family = "binomial", data = cr.train)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -1.5282 -1.5282 0.8633 0.8633 0.8633
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 0.79511 0.09056 8.78 <0.0000000000000002 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 705.51 on 568 degrees of freedom
#> Residual deviance: 705.51 on 568 degrees of freedom
#> AIC: 707.51
#>
#> Number of Fisher Scoring iterations: 4
summary(model.step)#>
#> Call:
#> glm(formula = Loan_Status ~ Married + Education + LoanAmount +
#> Credit_History + Property_Area, family = "binomial", data = cr.train)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.1936 -0.4041 0.5650 0.7038 2.4901
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -2.817582 0.502586 -5.606 0.0000000207 ***
#> MarriedYes 0.624818 0.227924 2.741 0.00612 **
#> EducationNot Graduate -0.475380 0.262955 -1.808 0.07063 .
#> LoanAmount -0.001820 0.001255 -1.450 0.14694
#> Credit_History1 3.767896 0.418259 9.009 < 0.0000000000000002 ***
#> Property_AreaSemiurban 0.863675 0.273398 3.159 0.00158 **
#> Property_AreaUrban 0.201740 0.264577 0.763 0.44576
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 705.51 on 568 degrees of freedom
#> Residual deviance: 524.94 on 562 degrees of freedom
#> AIC: 538.94
#>
#> Number of Fisher Scoring iterations: 5
exp(0.795) #(Intercept model without predictor)#> [1] 2.214441
exp(0.625) #MarriedYes#> [1] 1.868246
exp(-0.475) #EducationNot Graduate#> [1] 0.6218851
exp(3.76) #Credit_History1#> [1] 42.94843
exp(0.20) #Property_AreaSemiurban#> [1] 1.221403
From the backward stepwise model, it can be concluded that the optimal backward stepwise model are as following:
The incidence of the borrower getting loan is about 2.20 times more likely than not getting the loan.
The incidence of married borrower getting a loan is 1.87 times more likely than un-married borrower getting the loan ** provided that ** other variables have the same value.
The incidence of non-graduated borrower getting a loan is 0.62 times more likely than graduated borrower getting the loan ** provided that ** other variables have the same value.
The incidence of male borrower getting a loan is 0.98 times more likely than female borrower getting the loan ** provided that ** other variables have the same value.
The incidence of borrower who already has credit history getting a loan is 42.95 times more likely than borrower who does not have previous credit history in getting the loan ** provided that ** other variables have the same value.
Based on Business Assumption, we will try to continue model based on these variabels: - Target Variabel: Loan_Status - Predictor Variabels: Marrital status + Education status + Loan Amount + Credit_History + Property_Area.
Example Visualization between Loan Status vs. Property_Area:
cr.train %>%
ggplot(aes(x= LoanAmount, y = Loan_Status)) +
geom_point(alpha = 0.5)+
geom_smooth(method = "lm")+
theme_minimal()Before creating prediction, we will gather and prepare the data for validation from other source file (CR_Validate.csv), where before we can execute the data to the model, certain data modifications are needed in order to match with the Train Data and Model been constructed priorly.
cr_val <- read.csv("CR_Validate.csv", stringsAsFactors = T, na.strings=c("","","NA"))
head(cr_val)#> Loan_ID Gender Married Dependents Education Self_Employed ApplicantIncome
#> 1 LP001015 Male Yes 0 Graduate No 5720
#> 2 LP001022 Male Yes 1 Graduate No 3076
#> 3 LP001031 Male Yes 2 Graduate No 5000
#> 4 LP001035 Male Yes 2 Graduate No 2340
#> 5 LP001051 Male No 0 Not Graduate No 3276
#> 6 LP001054 Male Yes 0 Not Graduate Yes 2165
#> CoapplicantIncome LoanAmount Loan_Amount_Term Credit_History Property_Area
#> 1 0 110 360 1 Urban
#> 2 1500 126 360 1 Urban
#> 3 1800 208 360 1 Urban
#> 4 2546 100 360 NA Urban
#> 5 0 78 360 1 Urban
#> 6 3422 152 360 1 Urban
#> outcome
#> 1 Y
#> 2 Y
#> 3 Y
#> 4 Y
#> 5 N
#> 6 Y
str(cr_val)#> 'data.frame': 367 obs. of 13 variables:
#> $ Loan_ID : Factor w/ 367 levels "LP001015","LP001022",..: 1 2 3 4 5 6 7 8 9 10 ...
#> $ Gender : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 1 2 2 2 ...
#> $ Married : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 2 1 ...
#> $ Dependents : Factor w/ 4 levels "0","1","2","3+": 1 2 3 3 1 1 2 3 3 1 ...
#> $ Education : Factor w/ 2 levels "Graduate","Not Graduate": 1 1 1 1 2 2 2 2 1 2 ...
#> $ Self_Employed : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 1 1 NA 1 ...
#> $ ApplicantIncome : int 5720 3076 5000 2340 3276 2165 2226 3881 13633 2400 ...
#> $ CoapplicantIncome: int 0 1500 1800 2546 0 3422 0 0 0 2400 ...
#> $ LoanAmount : int 110 126 208 100 78 152 59 147 280 123 ...
#> $ Loan_Amount_Term : int 360 360 360 360 360 360 360 360 240 360 ...
#> $ Credit_History : int 1 1 1 NA 1 1 1 0 1 1 ...
#> $ Property_Area : Factor w/ 3 levels "Rural","Semiurban",..: 3 3 3 3 3 3 2 1 3 2 ...
#> $ outcome : Factor w/ 2 levels "N","Y": 2 2 2 2 1 2 2 1 2 2 ...
Data Cleansing & Pre-processing for Validation Data
cr.validate <- cr_val %>%
dplyr::select(-Loan_ID) %>%
mutate(Loan_Amount_Term = as.factor(Loan_Amount_Term),
Credit_History = as.factor(Credit_History))
cr.validate$Gender[is.na(cr.validate$Gender)] <- Mode(cr.validate$Gender)
cr.validate$Married[is.na(cr.validate$Married)] <- Mode(cr.validate$Married)
cr.validate$Dependents[is.na(cr.validate$Dependents)] <- Mode(cr.validate$Dependents)
cr.validate$Credit_History[is.na(cr.validate$Credit_History)] <- Mode(cr.validate$Credit_History)
cr.validate$LoanAmount[is.na(cr.validate$LoanAmount)] <- mean(cr.validate$LoanAmount, na.rm = T)
cr.validate$Loan_Amount_Term[is.na(cr.validate$Loan_Amount_Term)] <- mean(cr.validate$Loan_Amount_Term, na.rm = T)
summary(cr.validate)#> Gender Married Dependents Education Self_Employed
#> Female: 70 No :134 0 :210 Graduate :283 No :307
#> Male :297 Yes:233 1 : 58 Not Graduate: 84 Yes : 37
#> 2 : 59 NA's: 23
#> 3+: 40
#>
#>
#>
#> ApplicantIncome CoapplicantIncome LoanAmount Loan_Amount_Term
#> Min. : 0 Min. : 0 Min. : 28.0 360 :311
#> 1st Qu.: 2864 1st Qu.: 0 1st Qu.:101.0 180 : 22
#> Median : 3786 Median : 1025 Median :126.0 480 : 8
#> Mean : 4806 Mean : 1570 Mean :136.1 300 : 7
#> 3rd Qu.: 5060 3rd Qu.: 2430 3rd Qu.:157.5 240 : 4
#> Max. :72529 Max. :24000 Max. :550.0 (Other): 9
#> NA's : 6
#> Credit_History Property_Area outcome
#> 0: 59 Rural :111 N: 77
#> 1:308 Semiurban:116 Y:290
#> Urban :140
#>
#>
#>
#>
Prediction Loan Status and creating new column for Predicted Load Status (pred.loanstat) on Validation Data
cr.validate$pred.loanstat <- predict(object = model.step,
newdata = cr.validate,
type = "response")
head(cr.validate)#> Gender Married Dependents Education Self_Employed ApplicantIncome
#> 1 Male Yes 0 Graduate No 5720
#> 2 Male Yes 1 Graduate No 3076
#> 3 Male Yes 2 Graduate No 5000
#> 4 Male Yes 2 Graduate No 2340
#> 5 Male No 0 Not Graduate No 3276
#> 6 Male Yes 0 Not Graduate Yes 2165
#> CoapplicantIncome LoanAmount Loan_Amount_Term Credit_History Property_Area
#> 1 0 110 360 1 Urban
#> 2 1500 126 360 1 Urban
#> 3 1800 208 360 1 Urban
#> 4 2546 100 360 1 Urban
#> 5 0 78 360 1 Urban
#> 6 3422 152 360 1 Urban
#> outcome pred.loanstat
#> 1 Y 0.8287250
#> 2 Y 0.8245510
#> 3 Y 0.8018997
#> 4 Y 0.8312936
#> 5 N 0.6305729
#> 6 Y 0.7359021
Classify Validation Data based on Predicted Loan Status and put it in a new column Predicted Loan Status (predstat.label)
Assumption of Business Requirement: Only approve loans for borrower with Predicted Loan Status values >= 0.75
cr.validate$predstat.Label <- ifelse(cr.validate$pred.loanstat >= 0.75, "Y", "N")
cr.validate$predstat.Label <- as.factor(cr.validate$predstat.Label)
head(cr.validate)#> Gender Married Dependents Education Self_Employed ApplicantIncome
#> 1 Male Yes 0 Graduate No 5720
#> 2 Male Yes 1 Graduate No 3076
#> 3 Male Yes 2 Graduate No 5000
#> 4 Male Yes 2 Graduate No 2340
#> 5 Male No 0 Not Graduate No 3276
#> 6 Male Yes 0 Not Graduate Yes 2165
#> CoapplicantIncome LoanAmount Loan_Amount_Term Credit_History Property_Area
#> 1 0 110 360 1 Urban
#> 2 1500 126 360 1 Urban
#> 3 1800 208 360 1 Urban
#> 4 2546 100 360 1 Urban
#> 5 0 78 360 1 Urban
#> 6 3422 152 360 1 Urban
#> outcome pred.loanstat predstat.Label
#> 1 Y 0.8287250 Y
#> 2 Y 0.8245510 Y
#> 3 Y 0.8018997 Y
#> 4 Y 0.8312936 Y
#> 5 N 0.6305729 N
#> 6 Y 0.7359021 N
As the model is having a very strong predictor variables, We will try to create several models to compare on selected Predictor Variables.
Observing on how the model been created has performed can be achieved by bulding Confusion Matrix though predicted and actual values. Business Assumption for Confusion Matrix is that the True Positive is the Loans that will be approved (Y as positive)
library(caret)
confusionMatrix(data = cr.validate$predstat.Label,
reference = cr.validate$outcome,
positive = "Y")#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction N Y
#> N 71 89
#> Y 6 201
#>
#> Accuracy : 0.7411
#> 95% CI : (0.6931, 0.7852)
#> No Information Rate : 0.7902
#> P-Value [Acc > NIR] : 0.9899
#>
#> Kappa : 0.4407
#>
#> Mcnemar's Test P-Value : <0.0000000000000002
#>
#> Sensitivity : 0.6931
#> Specificity : 0.9221
#> Pos Pred Value : 0.9710
#> Neg Pred Value : 0.4437
#> Prevalence : 0.7902
#> Detection Rate : 0.5477
#> Detection Prevalence : 0.5640
#> Balanced Accuracy : 0.8076
#>
#> 'Positive' Class : Y
#>
Based on the Confusion Matrix Result, the matrix can be interpreted as below:
True Negative values (TN) : 71
True Positive values (TP) : 201
False Negative values (FN): 89
False Positive values (FP): 6
In overall, about 74.1% the classifiers of model correctly predicting the validation data (prediction accuracy).
The model is about 97.1% correct when predicting ‘Yes’ for the loan for borrower (prediction precision).
About 56.4% the yes condition actually occur in the data sample (Prevalence condition)
The ratio of correct positive predictions to the total predicted positives is about 69.3% (Recall/ Sensitivity)