LBB-C1: Credit Risk Modeling

Wayan K.

4/30/2021


About the Data

Credit risk is the risk that must be borne by a bank or financing institution when providing loans to an individual or institution. This risk is in the form of non-payment of loan principal and interest, resulting in the certain losses.

To minimize this credit risk, a risk assessment process is usually carried out before a loan is granted, which is called credit scoring and credit rating for the borrower.

To calculate this credit risk, usually the financing institution uses a predetermined standard calculation. However, what is increasingly becoming a trend is calculations using machine learning methods based on historical loan data.

The purpose of this report is to try to find a certain model that may minimize the risk that can occur to the borrowing institution, which based on the results of this assessment will determine whether a loan application may be accepted or rejected by the financial borrower/ institution.

This report will try to see the correlation of Loan Status with certain variabels: - Borrower Marrital Status - Borrower Education - Borrower Credit History - Borrower Property Area

Some of the Business Questions that can be gathered are:

  • Is there a relationship between the loan status with marital status, education, credit history, and property area of the borrower?

  • Can we predict the loan status based on marital status, education, credit history, and property area of the borrower?

cr_train <- read.csv("CR_Train.csv", stringsAsFactors = T, na.strings=c("","","NA"))

Data Dictionary

  • Loan_ID : Load ID
  • Gender : Borrower Gender
  • Married : Borrower Marital Status
  • Dependents : Borrower No of Dependents
  • Education : Borrower Education
  • Self_Employed : Self Employed Status
  • ApplicantIncome : Borrower Monthly Income
  • CoapplicantIncome: Co-Borrower Monthly Income
  • LoanAmount : Loan Amount
  • Loan_Amount_Term : Loan Amount Term
  • Credit_History : Borrower Credit History
  • Property_Area : Borrower Property Area
  • Loan_Status : Loan Status

Data Wrangling

Checking for Data Types and NA value of Data:

summary(cr_train)
#>      Loan_ID       Gender    Married    Dependents        Education  
#>  LP001002:  1   Female:112   No  :213   0   :345   Graduate    :480  
#>  LP001003:  1   Male  :489   Yes :398   1   :102   Not Graduate:134  
#>  LP001005:  1   NA's  : 13   NA's:  3   2   :101                     
#>  LP001006:  1                           3+  : 51                     
#>  LP001008:  1                           NA's: 15                     
#>  LP001011:  1                                                        
#>  (Other) :608                                                        
#>  Self_Employed ApplicantIncome CoapplicantIncome   LoanAmount   
#>  No  :500      Min.   :  150   Min.   :    0     Min.   :  9.0  
#>  Yes : 82      1st Qu.: 2878   1st Qu.:    0     1st Qu.:100.0  
#>  NA's: 32      Median : 3812   Median : 1188     Median :128.0  
#>                Mean   : 5403   Mean   : 1621     Mean   :146.4  
#>                3rd Qu.: 5795   3rd Qu.: 2297     3rd Qu.:168.0  
#>                Max.   :81000   Max.   :41667     Max.   :700.0  
#>                                                  NA's   :22     
#>  Loan_Amount_Term Credit_History     Property_Area Loan_Status
#>  Min.   : 12      Min.   :0.0000   Rural    :179   N:192      
#>  1st Qu.:360      1st Qu.:1.0000   Semiurban:233   Y:422      
#>  Median :360      Median :1.0000   Urban    :202              
#>  Mean   :342      Mean   :0.8422                              
#>  3rd Qu.:360      3rd Qu.:1.0000                              
#>  Max.   :480      Max.   :1.0000                              
#>  NA's   :14       NA's   :50
anyNA(cr_train)
#> [1] TRUE
colSums(is.na(cr_train))
#>           Loan_ID            Gender           Married        Dependents 
#>                 0                13                 3                15 
#>         Education     Self_Employed   ApplicantIncome CoapplicantIncome 
#>                 0                32                 0                 0 
#>        LoanAmount  Loan_Amount_Term    Credit_History     Property_Area 
#>                22                14                50                 0 
#>       Loan_Status 
#>                 0
glimpse(cr_train)
#> Rows: 614
#> Columns: 13
#> $ Loan_ID           <fct> LP001002, LP001003, LP001005, LP001006, LP001008, LP~
#> $ Gender            <fct> Male, Male, Male, Male, Male, Male, Male, Male, Male~
#> $ Married           <fct> No, Yes, Yes, Yes, No, Yes, Yes, Yes, Yes, Yes, Yes,~
#> $ Dependents        <fct> 0, 1, 0, 0, 0, 2, 0, 3+, 2, 1, 2, 2, 2, 0, 2, 0, 1, ~
#> $ Education         <fct> Graduate, Graduate, Graduate, Not Graduate, Graduate~
#> $ Self_Employed     <fct> No, No, Yes, No, No, Yes, No, No, No, No, No, NA, No~
#> $ ApplicantIncome   <int> 5849, 4583, 3000, 2583, 6000, 5417, 2333, 3036, 4006~
#> $ CoapplicantIncome <dbl> 0, 1508, 0, 2358, 0, 4196, 1516, 2504, 1526, 10968, ~
#> $ LoanAmount        <int> NA, 128, 66, 120, 141, 267, 95, 158, 168, 349, 70, 1~
#> $ Loan_Amount_Term  <int> 360, 360, 360, 360, 360, 360, 360, 360, 360, 360, 36~
#> $ Credit_History    <int> 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, NA, ~
#> $ Property_Area     <fct> Urban, Rural, Urban, Urban, Urban, Urban, Urban, Sem~
#> $ Loan_Status       <fct> Y, N, Y, Y, Y, Y, Y, N, Y, N, Y, Y, Y, N, Y, Y, Y, N~

Data Cleansing & Exploratory Data Analysis

There are two data types that need to be change: - Loan_Amount_Term : Change as factor data type - Credit_History : Change as factor data type

cr.train <- cr_train %>% 
  dplyr::select(-Loan_ID) %>% 
  mutate(Loan_Amount_Term = as.factor(Loan_Amount_Term),
         Credit_History = as.factor(Credit_History))

head(cr.train)
#>   Gender Married Dependents    Education Self_Employed ApplicantIncome
#> 1   Male      No          0     Graduate            No            5849
#> 2   Male     Yes          1     Graduate            No            4583
#> 3   Male     Yes          0     Graduate           Yes            3000
#> 4   Male     Yes          0 Not Graduate            No            2583
#> 5   Male      No          0     Graduate            No            6000
#> 6   Male     Yes          2     Graduate           Yes            5417
#>   CoapplicantIncome LoanAmount Loan_Amount_Term Credit_History Property_Area
#> 1                 0         NA              360              1         Urban
#> 2              1508        128              360              1         Rural
#> 3                 0         66              360              1         Urban
#> 4              2358        120              360              1         Urban
#> 5                 0        141              360              1         Urban
#> 6              4196        267              360              1         Urban
#>   Loan_Status
#> 1           Y
#> 2           N
#> 3           Y
#> 4           Y
#> 5           Y
#> 6           Y

There are also NA values based on preliminary check on: - LoanAmount - Loan_Amount_Term - Credit_History

Creating function for data cleansing:

Mode = function(x){
  a = table(x)
  b = max(a)
  if(all(a == b))
    mod = NA
  else if(is.numeric(x))
    mod = as.numeric(names(a))[a==b]
    else
      mod = names(a)[a==b]
  return(mod)
}

In order to create a better overall result, we will try to replace any missing/ NA values based on its types: - Data with missing Numeric type values: will be replaced by its mean values (using mean() function). - Data value with the factor data type will be replaced with value that has highest number of occurrences in its set of data (using mode() function).

cr.train$Gender[is.na(cr.train$Gender)] <-  Mode(cr.train$Gender)

cr.train$Married[is.na(cr.train$Married)] <- Mode(cr.train$Married)

cr.train$Dependents[is.na(cr.train$Dependents)] <-  Mode(cr.train$Dependents)

cr.train$Credit_History[is.na(cr.train$Credit_History)] <-  Mode(cr.train$Credit_History)
cr.train$LoanAmount[is.na(cr.train$LoanAmount)] <- mean(cr.train$LoanAmount, na.rm = T)

cr.train$Loan_Amount_Term[is.na(cr.train$Loan_Amount_Term)] <- mean(cr.train$Loan_Amount_Term, na.rm = T)

summary(cr.train)
#>     Gender    Married   Dependents        Education   Self_Employed
#>  Female:112   No :213   0 :360     Graduate    :480   No  :500     
#>  Male  :502   Yes:401   1 :102     Not Graduate:134   Yes : 82     
#>                         2 :101                        NA's: 32     
#>                         3+: 51                                     
#>                                                                    
#>                                                                    
#>                                                                    
#>  ApplicantIncome CoapplicantIncome   LoanAmount    Loan_Amount_Term
#>  Min.   :  150   Min.   :    0     Min.   :  9.0   360    :512     
#>  1st Qu.: 2878   1st Qu.:    0     1st Qu.:100.2   180    : 44     
#>  Median : 3812   Median : 1188     Median :129.0   480    : 15     
#>  Mean   : 5403   Mean   : 1621     Mean   :146.4   300    : 13     
#>  3rd Qu.: 5795   3rd Qu.: 2297     3rd Qu.:164.8   84     :  4     
#>  Max.   :81000   Max.   :41667     Max.   :700.0   (Other): 12     
#>                                                    NA's   : 14     
#>  Credit_History   Property_Area Loan_Status
#>  0: 89          Rural    :179   N:192      
#>  1:525          Semiurban:233   Y:422      
#>                 Urban    :202              
#>                                            
#>                                            
#>                                            
#> 

Visualizing & Understanding Variabels

plot(cr.train$Loan_Status, cr.train$LoanAmount)

prop.table(table(cr.train$Loan_Status))
#> 
#>         N         Y 
#> 0.3127036 0.6872964
table(cr.train$Loan_Status)
#> 
#>   N   Y 
#> 192 422
cr.train = na.omit(cr.train)

levels(cr.train$Loan_Status)
#> [1] "N" "Y"

Although the proportion between borrower who get the loan (Y) with borrower who does not get the loan (N) is not quite in balance condition Borrower who got the loan is higher that borrower who does not get the loan), but as the data and time constraint, it will considered as adequate to continue with the modeling.

Logistic Regression Model Selection

Creating Generalized Model for all available predictor variabels:

model.cr0 <- glm(formula = Loan_Status ~ 1,
                    data = cr.train, 
                    family = "binomial")

model.cr <- glm(formula = Loan_Status ~ ., data = cr.train, family = 'binomial')

summary(model.cr0)
#> 
#> Call:
#> glm(formula = Loan_Status ~ 1, family = "binomial", data = cr.train)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -1.5282  -1.5282   0.8633   0.8633   0.8633  
#> 
#> Coefficients:
#>             Estimate Std. Error z value            Pr(>|z|)    
#> (Intercept)  0.79511    0.09056    8.78 <0.0000000000000002 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 705.51  on 568  degrees of freedom
#> Residual deviance: 705.51  on 568  degrees of freedom
#> AIC: 707.51
#> 
#> Number of Fisher Scoring iterations: 4
summary(model.cr)
#> 
#> Call:
#> glm(formula = Loan_Status ~ ., family = "binomial", data = cr.train)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -2.2384  -0.3671   0.5139   0.7099   2.4588  
#> 
#> Coefficients:
#>                              Estimate     Std. Error z value
#> (Intercept)              11.303251499 1455.397665124   0.008
#> GenderMale               -0.025757240    0.319348213  -0.081
#> MarriedYes                0.530484291    0.269061534   1.972
#> Dependents1              -0.348072627    0.316592874  -1.099
#> Dependents2               0.312801884    0.354876848   0.881
#> Dependents3+              0.046914157    0.447023205   0.105
#> EducationNot Graduate    -0.510019079    0.275336465  -1.852
#> Self_EmployedYes         -0.102003154    0.324543228  -0.314
#> ApplicantIncome           0.000004178    0.000027167   0.154
#> CoapplicantIncome        -0.000055857    0.000042370  -1.318
#> LoanAmount               -0.001457084    0.001695529  -0.859
#> Loan_Amount_Term36      -31.109472905 1782.396928254  -0.017
#> Loan_Amount_Term60        0.593737383 1779.816079792   0.000
#> Loan_Amount_Term84      -14.534583116 1455.398057514  -0.010
#> Loan_Amount_Term120      -0.461619001 1673.992772581   0.000
#> Loan_Amount_Term180     -13.762369863 1455.397626880  -0.009
#> Loan_Amount_Term240     -14.432232683 1455.398108763  -0.010
#> Loan_Amount_Term300     -14.201857197 1455.397735067  -0.010
#> Loan_Amount_Term360     -14.081636424 1455.397559196  -0.010
#> Loan_Amount_Term480     -15.443353184 1455.397700593  -0.011
#> Credit_History1           3.855041185    0.426766326   9.033
#> Property_AreaSemiurban    0.983493628    0.283096817   3.474
#> Property_AreaUrban        0.205043897    0.274857418   0.746
#>                                    Pr(>|z|)    
#> (Intercept)                        0.993803    
#> GenderMale                         0.935716    
#> MarriedYes                         0.048654 *  
#> Dependents1                        0.271579    
#> Dependents2                        0.378081    
#> Dependents3+                       0.916417    
#> EducationNot Graduate              0.063976 .  
#> Self_EmployedYes                   0.753295    
#> ApplicantIncome                    0.877769    
#> CoapplicantIncome                  0.187399    
#> LoanAmount                         0.390137    
#> Loan_Amount_Term36                 0.986075    
#> Loan_Amount_Term60                 0.999734    
#> Loan_Amount_Term84                 0.992032    
#> Loan_Amount_Term120                0.999780    
#> Loan_Amount_Term180                0.992455    
#> Loan_Amount_Term240                0.992088    
#> Loan_Amount_Term300                0.992214    
#> Loan_Amount_Term360                0.992280    
#> Loan_Amount_Term480                0.991534    
#> Credit_History1        < 0.0000000000000002 ***
#> Property_AreaSemiurban             0.000513 ***
#> Property_AreaUrban                 0.455667    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 705.51  on 568  degrees of freedom
#> Residual deviance: 506.56  on 546  degrees of freedom
#> AIC: 552.56
#> 
#> Number of Fisher Scoring iterations: 14
exp(0.79) #(Intercept model without predictor)
#> [1] 2.203396
exp(-0.025) #genderMale
#> [1] 0.9753099
exp(0.53) #MarriedYes
#> [1] 1.698932
exp(-0.51)  #EducationNot Graduate
#> [1] 0.6004956
exp(-0.10)  #Self_EmployedYes
#> [1] 0.9048374

Based on preliminary model using all predictor variabels, we can get the following sample interpretations of the model: - The incidence of the borrower getting loan is about 2.20 times more likely than not getting the loan.

  • The incidence of male borrower getting a loan is 0.98 times more likely than female borrower getting the loan ** provided that ** other variables have the same value.

  • The incidence of married borrower getting a loan is 1.70 times more likely than un-married borrower getting the loan ** provided that ** other variables have the same value.

  • The incidence of non-graduated borrower getting a loan is 0.60 times more likely than graduated borrower getting the loan ** provided that ** other variables have the same value.

  • The incidence of self-employed borrower getting a loan is 0.90 times more likely than employee-typed borrower getting the loan ** provided that ** other variables have the same value.

Feature Selection with Stepwise Backward Model

model.step0 <- step(model.cr0, direction = "backward", trace = F)

model.step <- step(model.cr, direction = "backward", trace = F)

summary(model.step0)
#> 
#> Call:
#> glm(formula = Loan_Status ~ 1, family = "binomial", data = cr.train)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -1.5282  -1.5282   0.8633   0.8633   0.8633  
#> 
#> Coefficients:
#>             Estimate Std. Error z value            Pr(>|z|)    
#> (Intercept)  0.79511    0.09056    8.78 <0.0000000000000002 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 705.51  on 568  degrees of freedom
#> Residual deviance: 705.51  on 568  degrees of freedom
#> AIC: 707.51
#> 
#> Number of Fisher Scoring iterations: 4
summary(model.step)
#> 
#> Call:
#> glm(formula = Loan_Status ~ Married + Education + LoanAmount + 
#>     Credit_History + Property_Area, family = "binomial", data = cr.train)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -2.1936  -0.4041   0.5650   0.7038   2.4901  
#> 
#> Coefficients:
#>                         Estimate Std. Error z value             Pr(>|z|)    
#> (Intercept)            -2.817582   0.502586  -5.606         0.0000000207 ***
#> MarriedYes              0.624818   0.227924   2.741              0.00612 ** 
#> EducationNot Graduate  -0.475380   0.262955  -1.808              0.07063 .  
#> LoanAmount             -0.001820   0.001255  -1.450              0.14694    
#> Credit_History1         3.767896   0.418259   9.009 < 0.0000000000000002 ***
#> Property_AreaSemiurban  0.863675   0.273398   3.159              0.00158 ** 
#> Property_AreaUrban      0.201740   0.264577   0.763              0.44576    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 705.51  on 568  degrees of freedom
#> Residual deviance: 524.94  on 562  degrees of freedom
#> AIC: 538.94
#> 
#> Number of Fisher Scoring iterations: 5
exp(0.795) #(Intercept model without predictor)
#> [1] 2.214441
exp(0.625) #MarriedYes
#> [1] 1.868246
exp(-0.475) #EducationNot Graduate
#> [1] 0.6218851
exp(3.76)  #Credit_History1
#> [1] 42.94843
exp(0.20)  #Property_AreaSemiurban
#> [1] 1.221403

Interpretation of Stepwise Backward Model

From the backward stepwise model, it can be concluded that the optimal backward stepwise model are as following:

  • The incidence of the borrower getting loan is about 2.20 times more likely than not getting the loan.

  • The incidence of married borrower getting a loan is 1.87 times more likely than un-married borrower getting the loan ** provided that ** other variables have the same value.

  • The incidence of non-graduated borrower getting a loan is 0.62 times more likely than graduated borrower getting the loan ** provided that ** other variables have the same value.

  • The incidence of male borrower getting a loan is 0.98 times more likely than female borrower getting the loan ** provided that ** other variables have the same value.

  • The incidence of borrower who already has credit history getting a loan is 42.95 times more likely than borrower who does not have previous credit history in getting the loan ** provided that ** other variables have the same value.

Based on Business Assumption, we will try to continue model based on these variabels: - Target Variabel: Loan_Status - Predictor Variabels: Marrital status + Education status + Loan Amount + Credit_History + Property_Area.

Example Visualization between Loan Status vs. Property_Area:

cr.train %>% 
  ggplot(aes(x= LoanAmount, y = Loan_Status)) +
  geom_point(alpha = 0.5)+
  geom_smooth(method = "lm")+
  theme_minimal()

Prediction Based on Selected Model

Before creating prediction, we will gather and prepare the data for validation from other source file (CR_Validate.csv), where before we can execute the data to the model, certain data modifications are needed in order to match with the Train Data and Model been constructed priorly.

cr_val <- read.csv("CR_Validate.csv", stringsAsFactors = T, na.strings=c("","","NA"))
head(cr_val)
#>    Loan_ID Gender Married Dependents    Education Self_Employed ApplicantIncome
#> 1 LP001015   Male     Yes          0     Graduate            No            5720
#> 2 LP001022   Male     Yes          1     Graduate            No            3076
#> 3 LP001031   Male     Yes          2     Graduate            No            5000
#> 4 LP001035   Male     Yes          2     Graduate            No            2340
#> 5 LP001051   Male      No          0 Not Graduate            No            3276
#> 6 LP001054   Male     Yes          0 Not Graduate           Yes            2165
#>   CoapplicantIncome LoanAmount Loan_Amount_Term Credit_History Property_Area
#> 1                 0        110              360              1         Urban
#> 2              1500        126              360              1         Urban
#> 3              1800        208              360              1         Urban
#> 4              2546        100              360             NA         Urban
#> 5                 0         78              360              1         Urban
#> 6              3422        152              360              1         Urban
#>   outcome
#> 1       Y
#> 2       Y
#> 3       Y
#> 4       Y
#> 5       N
#> 6       Y
str(cr_val)
#> 'data.frame':    367 obs. of  13 variables:
#>  $ Loan_ID          : Factor w/ 367 levels "LP001015","LP001022",..: 1 2 3 4 5 6 7 8 9 10 ...
#>  $ Gender           : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 1 2 2 2 ...
#>  $ Married          : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 2 1 ...
#>  $ Dependents       : Factor w/ 4 levels "0","1","2","3+": 1 2 3 3 1 1 2 3 3 1 ...
#>  $ Education        : Factor w/ 2 levels "Graduate","Not Graduate": 1 1 1 1 2 2 2 2 1 2 ...
#>  $ Self_Employed    : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 1 1 NA 1 ...
#>  $ ApplicantIncome  : int  5720 3076 5000 2340 3276 2165 2226 3881 13633 2400 ...
#>  $ CoapplicantIncome: int  0 1500 1800 2546 0 3422 0 0 0 2400 ...
#>  $ LoanAmount       : int  110 126 208 100 78 152 59 147 280 123 ...
#>  $ Loan_Amount_Term : int  360 360 360 360 360 360 360 360 240 360 ...
#>  $ Credit_History   : int  1 1 1 NA 1 1 1 0 1 1 ...
#>  $ Property_Area    : Factor w/ 3 levels "Rural","Semiurban",..: 3 3 3 3 3 3 2 1 3 2 ...
#>  $ outcome          : Factor w/ 2 levels "N","Y": 2 2 2 2 1 2 2 1 2 2 ...

Data Cleansing & Pre-processing for Validation Data

  • Mutating Loan Amount Term and Credit History as Factor
  • Using mode for variables with categoristic type “NA” values, and mean value for variables with numerical type “NA” values.
cr.validate <- cr_val %>% 
  dplyr::select(-Loan_ID) %>% 
  mutate(Loan_Amount_Term = as.factor(Loan_Amount_Term),
         Credit_History = as.factor(Credit_History))

cr.validate$Gender[is.na(cr.validate$Gender)] <-  Mode(cr.validate$Gender)

cr.validate$Married[is.na(cr.validate$Married)] <- Mode(cr.validate$Married)

cr.validate$Dependents[is.na(cr.validate$Dependents)] <-  Mode(cr.validate$Dependents)

cr.validate$Credit_History[is.na(cr.validate$Credit_History)] <-  Mode(cr.validate$Credit_History)

cr.validate$LoanAmount[is.na(cr.validate$LoanAmount)] <- mean(cr.validate$LoanAmount, na.rm = T)

cr.validate$Loan_Amount_Term[is.na(cr.validate$Loan_Amount_Term)] <- mean(cr.validate$Loan_Amount_Term, na.rm = T)

summary(cr.validate)
#>     Gender    Married   Dependents        Education   Self_Employed
#>  Female: 70   No :134   0 :210     Graduate    :283   No  :307     
#>  Male  :297   Yes:233   1 : 58     Not Graduate: 84   Yes : 37     
#>                         2 : 59                        NA's: 23     
#>                         3+: 40                                     
#>                                                                    
#>                                                                    
#>                                                                    
#>  ApplicantIncome CoapplicantIncome   LoanAmount    Loan_Amount_Term
#>  Min.   :    0   Min.   :    0     Min.   : 28.0   360    :311     
#>  1st Qu.: 2864   1st Qu.:    0     1st Qu.:101.0   180    : 22     
#>  Median : 3786   Median : 1025     Median :126.0   480    :  8     
#>  Mean   : 4806   Mean   : 1570     Mean   :136.1   300    :  7     
#>  3rd Qu.: 5060   3rd Qu.: 2430     3rd Qu.:157.5   240    :  4     
#>  Max.   :72529   Max.   :24000     Max.   :550.0   (Other):  9     
#>                                                    NA's   :  6     
#>  Credit_History   Property_Area outcome
#>  0: 59          Rural    :111   N: 77  
#>  1:308          Semiurban:116   Y:290  
#>                 Urban    :140          
#>                                        
#>                                        
#>                                        
#> 

Prediction Loan Status and creating new column for Predicted Load Status (pred.loanstat) on Validation Data

cr.validate$pred.loanstat <- predict(object = model.step,
                    newdata = cr.validate,
                    type = "response")
head(cr.validate)
#>   Gender Married Dependents    Education Self_Employed ApplicantIncome
#> 1   Male     Yes          0     Graduate            No            5720
#> 2   Male     Yes          1     Graduate            No            3076
#> 3   Male     Yes          2     Graduate            No            5000
#> 4   Male     Yes          2     Graduate            No            2340
#> 5   Male      No          0 Not Graduate            No            3276
#> 6   Male     Yes          0 Not Graduate           Yes            2165
#>   CoapplicantIncome LoanAmount Loan_Amount_Term Credit_History Property_Area
#> 1                 0        110              360              1         Urban
#> 2              1500        126              360              1         Urban
#> 3              1800        208              360              1         Urban
#> 4              2546        100              360              1         Urban
#> 5                 0         78              360              1         Urban
#> 6              3422        152              360              1         Urban
#>   outcome pred.loanstat
#> 1       Y     0.8287250
#> 2       Y     0.8245510
#> 3       Y     0.8018997
#> 4       Y     0.8312936
#> 5       N     0.6305729
#> 6       Y     0.7359021

Classify Validation Data based on Predicted Loan Status and put it in a new column Predicted Loan Status (predstat.label)

Assumption of Business Requirement: Only approve loans for borrower with Predicted Loan Status values >= 0.75

cr.validate$predstat.Label <- ifelse(cr.validate$pred.loanstat >= 0.75, "Y", "N")

cr.validate$predstat.Label <- as.factor(cr.validate$predstat.Label)

head(cr.validate)
#>   Gender Married Dependents    Education Self_Employed ApplicantIncome
#> 1   Male     Yes          0     Graduate            No            5720
#> 2   Male     Yes          1     Graduate            No            3076
#> 3   Male     Yes          2     Graduate            No            5000
#> 4   Male     Yes          2     Graduate            No            2340
#> 5   Male      No          0 Not Graduate            No            3276
#> 6   Male     Yes          0 Not Graduate           Yes            2165
#>   CoapplicantIncome LoanAmount Loan_Amount_Term Credit_History Property_Area
#> 1                 0        110              360              1         Urban
#> 2              1500        126              360              1         Urban
#> 3              1800        208              360              1         Urban
#> 4              2546        100              360              1         Urban
#> 5                 0         78              360              1         Urban
#> 6              3422        152              360              1         Urban
#>   outcome pred.loanstat predstat.Label
#> 1       Y     0.8287250              Y
#> 2       Y     0.8245510              Y
#> 3       Y     0.8018997              Y
#> 4       Y     0.8312936              Y
#> 5       N     0.6305729              N
#> 6       Y     0.7359021              N

As the model is having a very strong predictor variables, We will try to create several models to compare on selected Predictor Variables.

Model Evaluation

Observing on how the model been created has performed can be achieved by bulding Confusion Matrix though predicted and actual values. Business Assumption for Confusion Matrix is that the True Positive is the Loans that will be approved (Y as positive)

library(caret)

confusionMatrix(data = cr.validate$predstat.Label, 
                reference = cr.validate$outcome,
                positive = "Y")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   N   Y
#>          N  71  89
#>          Y   6 201
#>                                              
#>                Accuracy : 0.7411             
#>                  95% CI : (0.6931, 0.7852)   
#>     No Information Rate : 0.7902             
#>     P-Value [Acc > NIR] : 0.9899             
#>                                              
#>                   Kappa : 0.4407             
#>                                              
#>  Mcnemar's Test P-Value : <0.0000000000000002
#>                                              
#>             Sensitivity : 0.6931             
#>             Specificity : 0.9221             
#>          Pos Pred Value : 0.9710             
#>          Neg Pred Value : 0.4437             
#>              Prevalence : 0.7902             
#>          Detection Rate : 0.5477             
#>    Detection Prevalence : 0.5640             
#>       Balanced Accuracy : 0.8076             
#>                                              
#>        'Positive' Class : Y                  
#> 

Based on the Confusion Matrix Result, the matrix can be interpreted as below:

  • True Negative values (TN) : 71

  • True Positive values (TP) : 201

  • False Negative values (FN): 89

  • False Positive values (FP): 6

  • In overall, about 74.1% the classifiers of model correctly predicting the validation data (prediction accuracy).

  • The model is about 97.1% correct when predicting ‘Yes’ for the loan for borrower (prediction precision).

  • About 56.4% the yes condition actually occur in the data sample (Prevalence condition)

  • The ratio of correct positive predictions to the total predicted positives is about 69.3% (Recall/ Sensitivity)