Library

library(tidyverse)
library(caTools)
library(ROCR)
library(rpart)
library(rmdformats)
library(randomForest) 
library(psych)

Introduction

Research question:

What are the variables affect loan approval rate? Given a list of applicant characteristics, can we build a model to predict the loan approval outcome?




Problem: Dream Housing Company wants to automate the loan eligibility process (real time) based on customer detail provided while filling online application form. These details are Gender, Marital Status, Education, Number of Dependents, Income, Loan Amount, Credit History and others. To automate this process, they have given a problem to identify the customers segments, those are eligible for loan amount so that they can specifically target these customers

Data

This data source was given as part of a data science challenge. I downloaded the data and loaded to my git-hub account. I will read the data into R.

Source: https://datahack.analyticsvidhya.com/contest/practice-problem-loan-prediction-iii/.

my_loan_data<- read.csv("https://raw.githubusercontent.com/yinaS1234/data-606/main/606%20final%20project/loan_data.csv")

head(my_loan_data)
##    Loan_ID Gender Married Dependents    Education Self_Employed ApplicantIncome
## 1 LP001002   Male      No          0     Graduate            No            5849
## 2 LP001003   Male     Yes          1     Graduate            No            4583
## 3 LP001005   Male     Yes          0     Graduate           Yes            3000
## 4 LP001006   Male     Yes          0 Not Graduate            No            2583
## 5 LP001008   Male      No          0     Graduate            No            6000
## 6 LP001011   Male     Yes          2     Graduate           Yes            5417
##   CoapplicantIncome LoanAmount Loan_Amount_Term Credit_History Property_Area
## 1                 0         NA              360              1         Urban
## 2              1508        128              360              1         Rural
## 3                 0         66              360              1         Urban
## 4              2358        120              360              1         Urban
## 5                 0        141              360              1         Urban
## 6              4196        267              360              1         Urban
##   Loan_Status
## 1           Y
## 2           N
## 3           Y
## 4           Y
## 5           Y
## 6           Y
dim(my_loan_data)
## [1] 614  13

Data Cleaning

#Store backup before removing missing values
my_loan_data_backup <- my_loan_data

#Return all rows with missing values
my_loan_data[!complete.cases(my_loan_data),]
##      Loan_ID Gender Married Dependents    Education Self_Employed
## 1   LP001002   Male      No          0     Graduate            No
## 17  LP001034   Male      No          1 Not Graduate            No
## 20  LP001041   Male     Yes          0     Graduate              
## 25  LP001052   Male     Yes          1     Graduate              
## 31  LP001091   Male     Yes          1     Graduate              
## 36  LP001106   Male     Yes          0     Graduate            No
## 37  LP001109   Male     Yes          0     Graduate            No
## 43  LP001123   Male     Yes          0     Graduate            No
## 45  LP001136   Male     Yes          0 Not Graduate           Yes
## 46  LP001137 Female      No          0     Graduate            No
## 64  LP001213   Male     Yes          1     Graduate            No
## 74  LP001250   Male     Yes         3+ Not Graduate            No
## 80  LP001264   Male     Yes         3+ Not Graduate           Yes
## 82  LP001266   Male     Yes          1     Graduate           Yes
## 84  LP001273   Male     Yes          0     Graduate            No
## 87  LP001280   Male     Yes          2 Not Graduate            No
## 96  LP001326   Male      No          0     Graduate              
## 103 LP001350   Male     Yes                Graduate            No
## 104 LP001356   Male     Yes          0     Graduate            No
## 113 LP001391   Male     Yes          0 Not Graduate            No
## 114 LP001392 Female      No          1     Graduate           Yes
## 118 LP001405   Male     Yes          1     Graduate            No
## 126 LP001443 Female      No          0     Graduate            No
## 128 LP001449   Male      No          0     Graduate            No
## 130 LP001465   Male     Yes          0     Graduate            No
## 131 LP001469   Male      No          0     Graduate           Yes
## 157 LP001541   Male     Yes          1     Graduate            No
## 166 LP001574   Male     Yes          0     Graduate            No
## 182 LP001634   Male      No          0     Graduate            No
## 188 LP001643   Male     Yes          0     Graduate            No
## 198 LP001669 Female      No          0 Not Graduate            No
## 199 LP001671 Female     Yes          0     Graduate            No
## 203 LP001682   Male     Yes         3+ Not Graduate            No
## 220 LP001734 Female     Yes          2     Graduate            No
## 224 LP001749   Male     Yes          0     Graduate            No
## 233 LP001770   Male      No          0 Not Graduate            No
## 237 LP001786   Male     Yes          0     Graduate              
## 238 LP001788 Female      No          0     Graduate           Yes
## 260 LP001864   Male     Yes         3+ Not Graduate            No
## 261 LP001865   Male     Yes          1     Graduate            No
## 280 LP001908 Female     Yes          0 Not Graduate            No
## 285 LP001922   Male     Yes          0     Graduate            No
## 306 LP001990   Male      No          0 Not Graduate            No
## 310 LP001998   Male     Yes          2 Not Graduate            No
## 314 LP002008   Male     Yes          2     Graduate           Yes
## 318 LP002036   Male     Yes          0     Graduate            No
## 319 LP002043 Female      No          1     Graduate            No
## 323 LP002054   Male     Yes          2 Not Graduate            No
## 324 LP002055 Female      No          0     Graduate            No
## 336 LP002106   Male     Yes                Graduate           Yes
## 339 LP002113 Female      No         3+ Not Graduate            No
## 349 LP002137   Male     Yes          0     Graduate            No
## 364 LP002178   Male     Yes          0     Graduate            No
## 368 LP002188   Male      No          0     Graduate            No
## 378 LP002223   Male     Yes          0     Graduate            No
## 388 LP002243   Male     Yes          0 Not Graduate            No
## 393 LP002263   Male     Yes          0     Graduate            No
## 396 LP002272   Male     Yes          2     Graduate            No
## 412 LP002319   Male     Yes          0     Graduate              
## 422 LP002357 Female      No          0 Not Graduate            No
## 424 LP002362   Male     Yes          1     Graduate            No
## 436 LP002393 Female                        Graduate            No
## 438 LP002401   Male     Yes          0     Graduate            No
## 445 LP002424   Male     Yes          0     Graduate            No
## 450 LP002444   Male      No          1 Not Graduate           Yes
## 452 LP002447   Male     Yes          2 Not Graduate            No
## 461 LP002478            Yes          0     Graduate           Yes
## 474 LP002522 Female      No          0     Graduate           Yes
## 480 LP002533   Male     Yes          2     Graduate            No
## 491 LP002560   Male      No          0 Not Graduate            No
## 492 LP002562   Male     Yes          1 Not Graduate            No
## 498 LP002588   Male     Yes          0     Graduate            No
## 504 LP002618   Male     Yes          1 Not Graduate            No
## 507 LP002624   Male     Yes          0     Graduate            No
## 525 LP002697   Male      No          0     Graduate            No
## 531 LP002717   Male     Yes          0     Graduate            No
## 534 LP002729   Male      No          1     Graduate            No
## 545 LP002757 Female     Yes          0 Not Graduate            No
## 551 LP002778   Male     Yes          2     Graduate           Yes
## 552 LP002784   Male     Yes          1 Not Graduate            No
## 557 LP002794 Female      No          0     Graduate            No
## 566 LP002833   Male     Yes          0 Not Graduate            No
## 584 LP002898   Male     Yes          1     Graduate            No
## 601 LP002949 Female      No         3+     Graduate              
## 606 LP002960   Male     Yes          0 Not Graduate            No
##     ApplicantIncome CoapplicantIncome LoanAmount Loan_Amount_Term
## 1              5849                 0         NA              360
## 17             3596                 0        100              240
## 20             2600              3500        115               NA
## 25             3717              2925        151              360
## 31             4166              3369        201              360
## 36             2275              2067         NA              360
## 37             1828              1330        100               NA
## 43             2400                 0         75              360
## 45             4695                 0         96               NA
## 46             3410                 0         88               NA
## 64             4945                 0         NA              360
## 74             4755                 0         95               NA
## 80             3333              2166        130              360
## 82             2395                 0         NA              360
## 84             6000              2250        265              360
## 87             3333              2000         99              360
## 96             6782                 0         NA              360
## 103           13650                 0         NA              360
## 104            4652              3583         NA              360
## 113            3572              4114        152               NA
## 114            7451                 0         NA              360
## 118            2214              1398         85              360
## 126            3692                 0         93              360
## 128            3865              1640         NA              360
## 130            6080              2569        182              360
## 131           20166                 0        650              480
## 157            6000                 0        160              360
## 166            3707              3166        182               NA
## 182            1916              5063         67              360
## 188            2383              2138         58              360
## 198            1907              2365        120               NA
## 199            3416              2816        113              360
## 203            3992                 0         NA              180
## 220            4283              2383        127              360
## 224            7578              1010        175               NA
## 233            3189              2598        120               NA
## 237            5746                 0        255              360
## 238            3463                 0        122              360
## 260            4931                 0        128              360
## 261            6083              4250        330              360
## 280            4100                 0        124              360
## 285           20667                 0         NA              360
## 306            2000                 0         NA              360
## 310            7667                 0        185              360
## 314            5746                 0        144               84
## 318            2058              2134         88              360
## 319            3541                 0        112              360
## 323            3601              1590         NA              360
## 324            3166              2985        132              360
## 336            5503              4490         70               NA
## 339            1830                 0         NA              360
## 349            6333              4583        259              360
## 364            3013              3033         95              300
## 368            5124                 0        124               NA
## 378            4310                 0        130              360
## 388            3010              3136         NA              360
## 393            2583              2115        120              360
## 396            3276               484        135              360
## 412            6256                 0        160              360
## 422            2720                 0         80               NA
## 424            7250              1667        110               NA
## 436           10047                 0         NA              240
## 438            2213              1125         NA              360
## 445            7333              8333        175              300
## 450            2769              1542        190              360
## 452            1958              1456         60              300
## 461            2083              4083        160              360
## 474            2500                 0         93              360
## 480            2947              1603         NA              360
## 491            2699              2785         96              360
## 492            5333              1131        186              360
## 498            4625              2857        111               12
## 504            4050              5302        138              360
## 507           20833              6667        480              360
## 525            4680              2087         NA              360
## 531            1025              5500        216              360
## 534           11250                 0        196              360
## 545            3017               663        102              360
## 551            6633                 0         NA              360
## 552            2492              2375         NA              360
## 557            2667              1625         84              360
## 566            4467                 0        120              360
## 584            1880                 0         61              360
## 601             416             41667        350              180
## 606            2400              3800         NA              180
##     Credit_History Property_Area Loan_Status
## 1                1         Urban           Y
## 17              NA         Urban           Y
## 20               1         Urban           Y
## 25              NA     Semiurban           N
## 31              NA         Urban           N
## 36               1         Urban           Y
## 37               0         Urban           N
## 43              NA         Urban           Y
## 45               1         Urban           Y
## 46               1         Urban           Y
## 64               0         Rural           N
## 74               0     Semiurban           N
## 80              NA     Semiurban           Y
## 82               1     Semiurban           Y
## 84              NA     Semiurban           N
## 87              NA     Semiurban           Y
## 96              NA         Urban           N
## 103              1         Urban           Y
## 104              1     Semiurban           Y
## 113              0         Rural           N
## 114              1     Semiurban           Y
## 118             NA         Urban           Y
## 126             NA         Rural           Y
## 128              1         Rural           Y
## 130             NA         Rural           N
## 131             NA         Urban           Y
## 157             NA         Rural           Y
## 166              1         Rural           Y
## 182             NA         Rural           N
## 188             NA         Rural           Y
## 198              1         Urban           Y
## 199             NA     Semiurban           Y
## 203              1         Urban           N
## 220             NA     Semiurban           Y
## 224              1     Semiurban           Y
## 233              1         Rural           Y
## 237             NA         Urban           N
## 238             NA         Urban           Y
## 260             NA     Semiurban           N
## 261             NA         Urban           Y
## 280             NA         Rural           Y
## 285              1         Rural           N
## 306              1         Urban           N
## 310             NA         Rural           Y
## 314             NA         Rural           Y
## 318             NA         Urban           Y
## 319             NA     Semiurban           Y
## 323              1         Rural           Y
## 324             NA         Rural           Y
## 336              1     Semiurban           Y
## 339              0         Urban           N
## 349             NA     Semiurban           Y
## 364             NA         Urban           Y
## 368              0         Rural           N
## 378             NA     Semiurban           Y
## 388              0         Urban           N
## 393             NA         Urban           Y
## 396             NA     Semiurban           Y
## 412             NA         Urban           Y
## 422              0         Urban           N
## 424              0         Urban           N
## 436              1     Semiurban           Y
## 438              1         Urban           Y
## 445             NA         Rural           Y
## 450             NA     Semiurban           N
## 452             NA         Urban           Y
## 461             NA     Semiurban           Y
## 474             NA         Urban           Y
## 480              1         Urban           N
## 491             NA     Semiurban           Y
## 492             NA         Urban           Y
## 498             NA         Urban           Y
## 504             NA         Rural           N
## 507             NA         Urban           Y
## 525              1     Semiurban           N
## 531             NA         Rural           Y
## 534             NA     Semiurban           N
## 545             NA     Semiurban           Y
## 551              0         Rural           N
## 552              1         Rural           Y
## 557             NA         Urban           Y
## 566             NA         Rural           Y
## 584             NA         Rural           N
## 601             NA         Urban           N
## 606              1         Urban           N
#store only data without missing values (removed 85 rows)
my_loan_data<- my_loan_data[complete.cases(my_loan_data),]

dim(my_loan_data)
## [1] 529  13
## create a new column trg by Loan_Status column Y=1, N=0
my_loan_data<-my_loan_data%>%
  mutate(trg=ifelse(my_loan_data$Loan_Status=='Y',1,0))
## remove Loan_status column
my_loan_data <- subset( my_loan_data, select = -Loan_Status )
## rename the last column to Loan_Status.
colnames(my_loan_data)[13] <- 'Loan_Status'
# Convert all columns to factor
my_loan_data <- as.data.frame(unclass(my_loan_data),                     
                       stringsAsFactors = TRUE)

Exploratory data analysis & Inference

Dependent Variable


Loan_status is the dependent variable. It is a categorical variable which gives us yes and no for loan approval status

Independent Variable(s)


There are a few independent variables. I will choose the most appropriate variables after doing exploratory analysis. Here are some preliminary variables listed below:

–Credit history

–Applicant income

–Applicants with higher education

–Gender of the applicant

–Number of Dependents

–Property area

Relevant summary statistics

str(my_loan_data)
## 'data.frame':    529 obs. of  13 variables:
##  $ Loan_ID          : Factor w/ 529 levels "LP001003","LP001005",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Gender           : Factor w/ 3 levels "","Female","Male": 3 3 3 3 3 3 3 3 3 3 ...
##  $ Married          : Factor w/ 3 levels "","No","Yes": 3 3 3 2 3 3 3 3 3 3 ...
##  $ Dependents       : Factor w/ 5 levels "","0","1","2",..: 3 2 2 2 4 2 5 4 3 4 ...
##  $ Education        : Factor w/ 2 levels "Graduate","Not Graduate": 1 1 2 1 1 2 1 1 1 1 ...
##  $ Self_Employed    : Factor w/ 3 levels "","No","Yes": 2 3 2 2 3 2 2 2 2 2 ...
##  $ ApplicantIncome  : int  4583 3000 2583 6000 5417 2333 3036 4006 12841 3200 ...
##  $ CoapplicantIncome: num  1508 0 2358 0 4196 ...
##  $ LoanAmount       : int  128 66 120 141 267 95 158 168 349 70 ...
##  $ Loan_Amount_Term : int  360 360 360 360 360 360 360 360 360 360 ...
##  $ Credit_History   : int  1 1 1 1 1 1 0 1 1 1 ...
##  $ Property_Area    : Factor w/ 3 levels "Rural","Semiurban",..: 1 3 3 3 3 3 2 3 2 3 ...
##  $ Loan_Status      : num  0 1 1 1 1 1 0 1 0 1 ...

Inference

summary(my_loan_data)
##      Loan_ID       Gender    Married   Dependents        Education  
##  LP001003:  1         : 12      :  2     : 12     Graduate    :421  
##  LP001005:  1   Female: 95   No :188   0 :295     Not Graduate:108  
##  LP001006:  1   Male  :422   Yes:339   1 : 85                       
##  LP001008:  1                          2 : 92                       
##  LP001011:  1                          3+: 45                       
##  LP001013:  1                                                       
##  (Other) :523                                                       
##  Self_Employed ApplicantIncome CoapplicantIncome   LoanAmount   
##     : 25       Min.   :  150   Min.   :    0     Min.   :  9.0  
##  No :434       1st Qu.: 2900   1st Qu.:    0     1st Qu.:100.0  
##  Yes: 70       Median : 3816   Median : 1086     Median :128.0  
##                Mean   : 5508   Mean   : 1542     Mean   :145.9  
##                3rd Qu.: 5815   3rd Qu.: 2232     3rd Qu.:167.0  
##                Max.   :81000   Max.   :33837     Max.   :700.0  
##                                                                 
##  Loan_Amount_Term Credit_History     Property_Area  Loan_Status    
##  Min.   : 36.0    Min.   :0.0000   Rural    :155   Min.   :0.0000  
##  1st Qu.:360.0    1st Qu.:1.0000   Semiurban:209   1st Qu.:0.0000  
##  Median :360.0    Median :1.0000   Urban    :165   Median :1.0000  
##  Mean   :342.4    Mean   :0.8507                   Mean   :0.6919  
##  3rd Qu.:360.0    3rd Qu.:1.0000                   3rd Qu.:1.0000  
##  Max.   :480.0    Max.   :1.0000                   Max.   :1.0000  
## 
describe(my_loan_data)
##                   vars   n    mean      sd median trimmed     mad min   max
## Loan_ID*             1 529  265.00  152.85    265  265.00  195.70   1   529
## Gender*              2 529    2.78    0.47      3    2.87    0.00   1     3
## Married*             3 529    2.64    0.49      3    2.68    0.00   1     3
## Dependents*          4 529    2.74    1.05      2    2.60    0.00   1     5
## Education*           5 529    1.20    0.40      1    1.13    0.00   1     2
## Self_Employed*       6 529    2.09    0.42      2    2.04    0.00   1     3
## ApplicantIncome      7 529 5507.82 6404.13   3816 4346.45 1802.84 150 81000
## CoapplicantIncome    8 529 1542.39 2524.30   1086 1118.17 1610.10   0 33837
## LoanAmount           9 529  145.85   84.11    128  133.26   45.96   9   700
## Loan_Amount_Term    10 529  342.35   64.86    360  358.31    0.00  36   480
## Credit_History      11 529    0.85    0.36      1    0.94    0.00   0     1
## Property_Area*      12 529    2.02    0.78      2    2.02    1.48   1     3
## Loan_Status         13 529    0.69    0.46      1    0.74    0.00   0     1
##                   range  skew kurtosis     se
## Loan_ID*            528  0.00    -1.21   6.65
## Gender*               2 -1.95     3.03   0.02
## Married*              2 -0.67    -1.31   0.02
## Dependents*           4  0.86    -0.49   0.05
## Education*            1  1.46     0.14   0.02
## Self_Employed*        2  0.56     2.31   0.02
## ApplicantIncome   80850  6.43    56.78 278.44
## CoapplicantIncome 33837  5.96    60.12 109.75
## LoanAmount          691  2.59     9.94   3.66
## Loan_Amount_Term    444 -2.26     6.06   2.82
## Credit_History        1 -1.96     1.85   0.02
## Property_Area*        2 -0.03    -1.35   0.03
## Loan_Status           1 -0.83    -1.32   0.02
hist(my_loan_data$ApplicantIncome, col="lightblue")

hist(my_loan_data$CoapplicantIncome,col="yellow")

hist(my_loan_data$LoanAmount,col="red")

Loan Applicant incomes range from 150 to 81000, majoriry < than 20,000

Co applicant income 0 ( no - coapplicants ) to 33837, majority<10,000

Loan Amount ranges from 9 to 700, with majority fall into range 100~200

Most people are Males and Working in a company.

All IDs are unique and randomly alloted. They have no impact on the Loan_Status and can be dropped




summary(my_loan_data$Property_Area)
##     Rural Semiurban     Urban 
##       155       209       165

Property Area:

ggplot(data=my_loan_data, aes(my_loan_data$Property_Area)) + 
  geom_histogram(col="blue",fill="lightblue",stat="count" ) +
  facet_grid(~my_loan_data$Loan_Status)+
  scale_x_discrete()
## Warning in geom_histogram(col = "blue", fill = "lightblue", stat = "count"):
## Ignoring unknown parameters: `binwidth`, `bins`, and `pad`
## Warning: Use of `my_loan_data$Property_Area` is discouraged.
## ℹ Use `Property_Area` instead.

Histogram of Property Area shows that Loan approval is more into Semiurban area than Rural and Urban.

Coapplicant Income:

summary(my_loan_data$CoapplicantIncome)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0       0    1086    1542    2232   33837
ggplot(data=my_loan_data, aes(x= my_loan_data$CoapplicantIncome)) + 
  geom_histogram(col="yellow",fill="pink", bins = 15) +
  facet_grid(~my_loan_data$Loan_Status)+
  theme_bw()
## Warning: Use of `my_loan_data$CoapplicantIncome` is discouraged.
## ℹ Use `CoapplicantIncome` instead.

Histogram shows that low income peoples are mainly applying for loans and number of loan rejection is more in the lowest income segment

Education:

summary(my_loan_data$Education)
##     Graduate Not Graduate 
##          421          108
ggplot(data=my_loan_data, aes(my_loan_data$Education)) + 
  geom_histogram(col="lightgreen",fill="blue",stat="count" ) +
  facet_grid(~my_loan_data$Loan_Status)+
  scale_x_discrete()+
  theme_bw()
## Warning in geom_histogram(col = "lightgreen", fill = "blue", stat = "count"):
## Ignoring unknown parameters: `binwidth`, `bins`, and `pad`
## Warning: Use of `my_loan_data$Education` is discouraged.
## ℹ Use `Education` instead.

loan approval rate for graduate is more than non graduate

Number of Dependents:

summary(my_loan_data$Dependents)
##       0   1   2  3+ 
##  12 295  85  92  45
ggplot(data=my_loan_data, aes(my_loan_data$Dependents)) + 
  geom_histogram(col="lightyellow",fill="lightgreen",stat="count" ) +
  facet_grid(~my_loan_data$Loan_Status)+
  scale_x_discrete()+
  theme_bw()
## Warning in geom_histogram(col = "lightyellow", fill = "lightgreen", stat =
## "count"): Ignoring unknown parameters: `binwidth`, `bins`, and `pad`
## Warning: Use of `my_loan_data$Dependents` is discouraged.
## ℹ Use `Dependents` instead.

People having no dependents have maximum loan approval and rejection count

Gender:

summary(my_loan_data$Gender)
##        Female   Male 
##     12     95    422
ggplot(data=my_loan_data, aes(my_loan_data$Gender)) + 
  geom_histogram(col="lightgrey",fill="lightblue",stat="count") +
  facet_grid(~my_loan_data$Loan_Status)+
  scale_x_discrete()+
  theme_bw()
## Warning in geom_histogram(col = "lightgrey", fill = "lightblue", stat =
## "count"): Ignoring unknown parameters: `binwidth`, `bins`, and `pad`
## Warning: Use of `my_loan_data$Gender` is discouraged.
## ℹ Use `Gender` instead.

Male applicant has higher loan approval and rejection count than female applicant.

Models

Logistic Regression

my_loan_data_1 <- my_loan_data[,2:13]
ind <- sample.split (Y=my_loan_data_1$Loan_Status, SplitRatio=0.8)
traindf<- my_loan_data_1 [ind,]
testdf<- my_loan_data_1 [!ind,]
LRmodel<-glm(Loan_Status~.,traindf,family = "binomial")
summary(LRmodel)
## 
## Call:
## glm(formula = Loan_Status ~ ., family = "binomial", data = traindf)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3815  -0.3506   0.4698   0.6857   2.4521  
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             1.085e+01  6.200e+02   0.018  0.98604    
## GenderFemale            5.232e-01  8.826e-01   0.593  0.55333    
## GenderMale              9.307e-01  8.272e-01   1.125  0.26054    
## MarriedNo              -1.344e+01  6.200e+02  -0.022  0.98270    
## MarriedYes             -1.312e+01  6.200e+02  -0.021  0.98312    
## Dependents0             3.713e-01  1.046e+00   0.355  0.72264    
## Dependents1             1.861e-01  1.073e+00   0.173  0.86230    
## Dependents2             8.196e-01  1.078e+00   0.760  0.44708    
## Dependents3+            5.459e-01  1.136e+00   0.480  0.63093    
## EducationNot Graduate  -5.991e-01  3.217e-01  -1.862  0.06258 .  
## Self_EmployedNo        -4.410e-01  6.087e-01  -0.724  0.46883    
## Self_EmployedYes       -8.157e-01  6.885e-01  -1.185  0.23611    
## ApplicantIncome         1.535e-05  2.760e-05   0.556  0.57802    
## CoapplicantIncome      -5.288e-05  4.487e-05  -1.178  0.23865    
## LoanAmount             -1.579e-03  2.008e-03  -0.787  0.43149    
## Loan_Amount_Term       -3.050e-03  2.628e-03  -1.161  0.24579    
## Credit_History          3.921e+00  4.817e-01   8.140 3.95e-16 ***
## Property_AreaSemiurban  1.239e+00  3.363e-01   3.684  0.00023 ***
## Property_AreaUrban      3.887e-01  3.296e-01   1.180  0.23816    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 521.94  on 422  degrees of freedom
## Residual deviance: 365.73  on 404  degrees of freedom
## AIC: 403.73
## 
## Number of Fisher Scoring iterations: 13

From the R value and P value above, the most significant variables are

Credit_History Property_AreaSemiurban

res<-predict(LRmodel,testdf,type="response")
res
##          2         10         17         24         40         46         47 
## 0.79794269 0.89624863 0.66451209 0.81633239 0.83752733 0.94938687 0.92082622 
##         51         54         58         59         65         69         74 
## 0.81069083 0.10777463 0.86219550 0.10496769 0.73341144 0.81832210 0.90442186 
##         78         79         83         88         93        100        105 
## 0.90066196 0.92292349 0.90800481 0.88236807 0.91622230 0.87221921 0.08165033 
##        108        110        117        118        127        130        131 
## 0.95380157 0.88337976 0.91200815 0.91666153 0.77537480 0.29442100 0.81339857 
##        134        136        137        141        146        153        154 
## 0.86718474 0.15967917 0.83255416 0.11732794 0.89580946 0.05371221 0.95933721 
##        155        156        158        159        162        164        165 
## 0.92959225 0.89417772 0.04969927 0.39583721 0.90542864 0.90661879 0.72504606 
##        166        168        169        171        175        184        188 
## 0.89228838 0.75889991 0.83724720 0.69110289 0.58322439 0.75241390 0.94747396 
##        189        190        193        197        198        200        201 
## 0.88126234 0.92760280 0.94156327 0.91669401 0.83368379 0.72745169 0.59475517 
##        211        214        215        217        223        256        261 
## 0.86506224 0.79558641 0.47048645 0.05388181 0.92675016 0.74497613 0.63541592 
##        284        293        294        297        303        323        324 
## 0.73954258 0.89696460 0.88349011 0.80046944 0.88637529 0.77852073 0.74126741 
##        328        346        354        355        360        373        386 
## 0.71539124 0.78135158 0.03380021 0.63919148 0.75209956 0.91247922 0.05303924 
##        392        400        409        411        419        428        432 
## 0.81064732 0.10496907 0.94954956 0.89069213 0.75267449 0.03581224 0.88337633 
##        435        442        445        449        451        456        462 
## 0.73370172 0.77333827 0.88208114 0.74487033 0.75549622 0.86861193 0.67495877 
##        464        473        476        477        478        485        487 
## 0.90986704 0.77746170 0.92127151 0.90038903 0.91046546 0.80599164 0.74993693 
##        489        503        509        510        512        515        521 
## 0.78369224 0.84678731 0.94606609 0.70795165 0.78048943 0.11392097 0.81225177 
##        525 
## 0.65013296
table(Actualvalue=testdf$Loan_Status,Predictedvalue=res>0.5)
##            Predictedvalue
## Actualvalue FALSE TRUE
##           0    13   20
##           1     3   70
(10+73)/(10+73+23)
## [1] 0.7830189

Accuracy:78%

Decison Tree

set.seed(42)
sample <- sample.int(n = nrow(my_loan_data_1), size = floor(.70*nrow(my_loan_data_1)), replace = F)
trainnew <- my_loan_data_1[sample, ]
testnew  <- my_loan_data_1[-sample, ]
dtree <- rpart(Loan_Status ~ Credit_History + Education + Self_Employed + Property_Area + LoanAmount + 
                 ApplicantIncome, method="class", data=traindf,parms=list(split="information"))
dtree$cptable
##           CP nsplit rel error    xerror       xstd
## 1 0.40769231      0 1.0000000 1.0000000 0.07299480
## 2 0.01153846      1 0.5923077 0.5923077 0.06104778
## 3 0.01000000      4 0.5538462 0.6538462 0.06339489
plotcp(dtree)

dtree.pruned <- prune(dtree, cp=.02290076)
library(rpart.plot)
prp(dtree.pruned, type = 2, extra = 104,
    fallen.leaves = TRUE, main="Decision Tree")

dtree.pred <- predict(dtree.pruned, trainnew, type="class")
dtree.perf <- table(trainnew$Loan_Status, dtree.pred,
                    dnn=c("Actual", "Predicted"))
dtree.perf
##       Predicted
## Actual   0   1
##      0  49  64
##      1   5 252

Now use the Testdata

dtree_test <- rpart(Loan_Status ~ Credit_History+Education+Self_Employed+Property_Area+LoanAmount+
                 ApplicantIncome,method="class", data=testnew,parms=list(split="information"))
dtree_test$cptable
##     CP nsplit rel error xerror       xstd
## 1 0.42      0      1.00   1.00 0.11709266
## 2 0.01      1      0.58   0.58 0.09738725
plotcp(dtree_test)

dtree_test.pruned <- prune(dtree_test, cp=.01639344)
prp(dtree_test.pruned, type = 2, extra = 104,
    fallen.leaves = TRUE, main="Decision Tree")

dtree_test.pred <- predict(dtree_test.pruned, testnew, type="class") 
dtree_test.perf <- table(testnew$Loan_Status, dtree_test.pred, dnn=c("Actual", "Predicted")) 
dtree_test.perf
##       Predicted
## Actual   0   1
##      0  23  27
##      1   2 107

Accuracy: 84%

Random Forest

trainnew <- mutate_if(trainnew, is.character, as.factor)
testnew <- mutate_if(testnew, is.character, as.factor)
str(trainnew)
## 'data.frame':    370 obs. of  12 variables:
##  $ Gender           : Factor w/ 3 levels "","Female","Male": 2 3 3 3 3 3 3 2 3 2 ...
##  $ Married          : Factor w/ 3 levels "","No","Yes": 2 3 3 3 3 3 3 3 3 3 ...
##  $ Dependents       : Factor w/ 5 levels "","0","1","2",..: 2 3 2 5 2 3 4 2 4 2 ...
##  $ Education        : Factor w/ 2 levels "Graduate","Not Graduate": 1 1 1 1 1 1 2 1 2 2 ...
##  $ Self_Employed    : Factor w/ 3 levels "","No","Yes": 1 2 2 2 2 2 2 2 2 2 ...
##  $ ApplicantIncome  : int  2764 6400 5695 4333 5708 8080 2281 2423 4226 2149 ...
##  $ CoapplicantIncome: num  1459 7250 4167 1811 5625 ...
##  $ LoanAmount       : int  110 180 175 160 187 180 113 130 110 178 ...
##  $ Loan_Amount_Term : int  360 360 360 360 360 360 360 360 360 360 ...
##  $ Credit_History   : int  1 0 1 0 1 1 1 1 1 0 ...
##  $ Property_Area    : Factor w/ 3 levels "Rural","Semiurban",..: 3 3 2 3 2 3 1 2 3 2 ...
##  $ Loan_Status      : num  1 0 1 1 1 1 0 1 1 0 ...
set.seed(42) 
fit.forest <- randomForest(Loan_Status ~ Credit_History+Education+Self_Employed+Property_Area+LoanAmount+
                             ApplicantIncome, data=trainnew,
                           na.action=na.roughfix,
                           importance=TRUE)
## Warning in randomForest.default(m, y, ...): The response has five or fewer
## unique values. Are you sure you want to do regression?
fit.forest
## 
## Call:
##  randomForest(formula = Loan_Status ~ Credit_History + Education +      Self_Employed + Property_Area + LoanAmount + ApplicantIncome,      data = trainnew, importance = TRUE, na.action = na.roughfix) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##           Mean of squared residuals: 0.1532389
##                     % Var explained: 27.76
importance(fit.forest, type=2)
##                 IncNodePurity
## Credit_History      21.070826
## Education            1.547224
## Self_Employed        2.261059
## Property_Area        3.915217
## LoanAmount          12.623137
## ApplicantIncome     12.353369
forest.pred <- predict(fit.forest, testnew)

forest.pred
##          7          8         11         13         15         18         19 
## 0.18606571 0.84736805 0.90641530 0.86812300 0.86385154 0.15143880 0.63527671 
##         21         22         23         26         29         39         44 
## 0.02503413 0.77682193 0.95482886 0.94075782 0.93379921 0.09528321 0.81479861 
##         50         52         56         58         59         64         65 
## 0.86186664 0.84827143 0.06644799 0.72425266 0.12150380 0.76722260 0.85302685 
##         67         69         71         77         80         81         83 
## 0.22927017 0.66189477 0.93808185 0.81191621 0.94322408 0.87068374 0.89618627 
##         87         99        101        104        117        119        125 
## 0.81518598 0.57208155 0.07689353 0.38116725 0.84412748 0.82910943 0.07243296 
##        129        132        134        137        140        143        144 
## 0.85471395 0.88044784 0.94244478 0.81404207 0.85935665 0.76344657 0.48027126 
##        145        151        154        155        164        168        175 
## 0.53469707 0.96251593 0.78286069 0.70324899 0.87845452 0.74736243 0.72767164 
##        176        177        179        184        190        195        199 
## 0.86712947 0.91999354 0.04911328 0.81178215 0.93714782 0.57345324 0.93441907 
##        200        204        205        209        211        216        218 
## 0.51228080 0.87853953 0.88237628 0.73530632 0.95339297 0.51774225 0.65730188 
##        222        223        231        232        234        235        236 
## 0.90590670 0.93809181 0.82622192 0.86775881 0.92269132 0.90809510 0.95689061 
##        237        244        249        253        256        260        263 
## 0.83164763 0.85649768 0.84315649 0.73647578 0.81610847 0.96443344 0.85261580 
##        264        266        270        273        278        286        289 
## 0.77118010 0.39335017 0.88251789 0.79942445 0.13002499 0.77731525 0.83247778 
##        291        293        305        306        307        309        310 
## 0.86316058 0.81567545 0.67244056 0.78155327 0.03315355 0.85109600 0.86028908 
##        319        320        323        326        327        331        335 
## 0.67394184 0.06517938 0.75014035 0.92752656 0.58128036 0.82912828 0.72063540 
##        336        338        342        352        354        362        363 
## 0.52411056 0.79369728 0.10451982 0.28144606 0.17437228 0.81688706 0.83412806 
##        364        370        376        378        379        380        382 
## 0.82180217 0.84517657 0.06454731 0.87788340 0.69482882 0.87674441 0.74987929 
##        383        384        386        387        392        395        397 
## 0.48864629 0.81231812 0.10942573 0.10288116 0.82999229 0.88683508 0.82723184 
##        404        407        411        413        414        417        418 
## 0.80543375 0.64422213 0.84018241 0.65319735 0.88837251 0.71647980 0.14181460 
##        423        424        425        426        427        428        436 
## 0.65562933 0.06052029 0.93714018 0.72484760 0.70625758 0.12495368 0.85971967 
##        445        446        449        454        456        462        463 
## 0.91176845 0.57748037 0.59574787 0.94035504 0.51922286 0.58880613 0.76161453 
##        466        469        475        477        483        488        492 
## 0.90664380 0.87212338 0.06686548 0.75970351 0.70903837 0.04781336 0.73129931 
##        493        496        497        500        501        502        505 
## 0.93911616 0.83333321 0.90949124 0.85027991 0.92559482 0.13719139 0.82114098 
##        512        516        517        520        522 
## 0.81000331 0.70759094 0.87966150 0.62261016 0.76738305
table(Actualvalue=testnew$Loan_Status,Predictedvalue=forest.pred>0.5)
##            Predictedvalue
## Actualvalue FALSE TRUE
##           0    24   26
##           1     5  104
(104+24)/(104+24+26+5)
## [1] 0.8050314

To calculate accuracy, use the following formula: (TP+TN)/(TP+TN+FP+FN).

Accuracy: 80.5%

Conclusion

To summary our finding: Credit History and. Property_AreaSemiurban are the 2 most significant variables to predict loan application outcome. Dream Company should target customers with Credit history and customer who lives in Semiurban area.

As far as model accuracy,

78% accuracy for logistic regresission

84% accuracy for Decesion tree

80.50% accuracy for random forest.

Limitation

The dataset is relatively small. A larger dataset will help to improve the model accuracy.

Also, the dataset use use to build model are mostly low income, male, working in a company, it might be interesting to look at female, high income, also those who are self employed, to build better model.

