Problem Statement

This is a classic data science data-set. It involves predicting the loan statuses of applicants. Dream Housing Finance company deals in all home loans. They have presence across all urban, semi urban and rural areas. Customer first apply for home loan after that company validates the customer eligibility for loan. Company wants to automate the loan eligibility process (real time) based on customer detail provided while filling online application form. These details are Gender, Marital Status, Education, Number of Dependents, Income, Loan Amount, Credit History and others.

The objective is to automate the loan approval process for Dream Housing Finance by identifying if the applicant’s loan could be approved.

Load packages

pacman::p_load(tidyverse, ggthemes , ggrepel, knitr,kableExtra, data.table,caret,pROC,rpart, rattle, rpart.plot, RColorBrewer)

Import data

train <- read_csv("D:/Documents/Family/TKC/20190325_Rportfolio/project/20190328_loan/source/train_u6lujuX_CVtuZ9i.csv")
test <- read_csv("D:/Documents/Family/TKC/20190325_Rportfolio/project/20190328_loan/source/test_Y3wMUE5_7gLdaTN.csv")

Explore data non-visually

Such non visual exploration allows you to appreciate the variables and size of the data-sets. It gives an initial idea of how cleaning can be executed.

#non visual exploration 
#check the data
glimpse(train)

## Observations: 614
## Variables: 13
## $ Loan_ID           <chr> "LP001002", "LP001003", "LP001005", "LP00100...
## $ Gender            <chr> "Male", "Male", "Male", "Male", "Male", "Mal...
## $ Married           <chr> "No", "Yes", "Yes", "Yes", "No", "Yes", "Yes...
## $ Dependents        <chr> "0", "1", "0", "0", "0", "2", "0", "3+", "2"...
## $ Education         <chr> "Graduate", "Graduate", "Graduate", "Not Gra...
## $ Self_Employed     <chr> "No", "No", "Yes", "No", "No", "Yes", "No", ...
## $ ApplicantIncome   <dbl> 5849, 4583, 3000, 2583, 6000, 5417, 2333, 30...
## $ CoapplicantIncome <dbl> 0, 1508, 0, 2358, 0, 4196, 1516, 2504, 1526,...
## $ LoanAmount        <dbl> NA, 128, 66, 120, 141, 267, 95, 158, 168, 34...
## $ Loan_Amount_Term  <dbl> 360, 360, 360, 360, 360, 360, 360, 360, 360,...
## $ Credit_History    <dbl> 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,...
## $ Property_Area     <chr> "Urban", "Rural", "Urban", "Urban", "Urban",...
## $ Loan_Status       <chr> "Y", "N", "Y", "Y", "Y", "Y", "Y", "N", "Y",...

glimpse(test)

## Observations: 367
## Variables: 12
## $ Loan_ID           <chr> "LP001015", "LP001022", "LP001031", "LP00103...
## $ Gender            <chr> "Male", "Male", "Male", "Male", "Male", "Mal...
## $ Married           <chr> "Yes", "Yes", "Yes", "Yes", "No", "Yes", "No...
## $ Dependents        <chr> "0", "1", "2", "2", "0", "0", "1", "2", "2",...
## $ Education         <chr> "Graduate", "Graduate", "Graduate", "Graduat...
## $ Self_Employed     <chr> "No", "No", "No", "No", "No", "Yes", "No", "...
## $ ApplicantIncome   <dbl> 5720, 3076, 5000, 2340, 3276, 2165, 2226, 38...
## $ CoapplicantIncome <dbl> 0, 1500, 1800, 2546, 0, 3422, 0, 0, 0, 2400,...
## $ LoanAmount        <dbl> 110, 126, 208, 100, 78, 152, 59, 147, 280, 1...
## $ Loan_Amount_Term  <dbl> 360, 360, 360, 360, 360, 360, 360, 360, 240,...
## $ Credit_History    <dbl> 1, 1, 1, NA, 1, 1, 1, 0, 1, 1, 1, 1, NA, 0, ...
## $ Property_Area     <chr> "Urban", "Urban", "Urban", "Urban", "Urban",...

#non visual exploration 
str(train) #check the data structure

## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 614 obs. of  13 variables:
##  $ Loan_ID          : chr  "LP001002" "LP001003" "LP001005" "LP001006" ...
##  $ Gender           : chr  "Male" "Male" "Male" "Male" ...
##  $ Married          : chr  "No" "Yes" "Yes" "Yes" ...
##  $ Dependents       : chr  "0" "1" "0" "0" ...
##  $ Education        : chr  "Graduate" "Graduate" "Graduate" "Not Graduate" ...
##  $ Self_Employed    : chr  "No" "No" "Yes" "No" ...
##  $ ApplicantIncome  : num  5849 4583 3000 2583 6000 ...
##  $ CoapplicantIncome: num  0 1508 0 2358 0 ...
##  $ LoanAmount       : num  NA 128 66 120 141 267 95 158 168 349 ...
##  $ Loan_Amount_Term : num  360 360 360 360 360 360 360 360 360 360 ...
##  $ Credit_History   : num  1 1 1 1 1 1 1 0 1 1 ...
##  $ Property_Area    : chr  "Urban" "Rural" "Urban" "Urban" ...
##  $ Loan_Status      : chr  "Y" "N" "Y" "Y" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Loan_ID = col_character(),
##   ..   Gender = col_character(),
##   ..   Married = col_character(),
##   ..   Dependents = col_character(),
##   ..   Education = col_character(),
##   ..   Self_Employed = col_character(),
##   ..   ApplicantIncome = col_double(),
##   ..   CoapplicantIncome = col_double(),
##   ..   LoanAmount = col_double(),
##   ..   Loan_Amount_Term = col_double(),
##   ..   Credit_History = col_double(),
##   ..   Property_Area = col_character(),
##   ..   Loan_Status = col_character()
##   .. )

#combine the train and test data for cleaning
all.data <- rbind(
  train,
  test %>% mutate(Loan_Status="test")
)

#check overall missing data 
#calculate missing proportion
missingprop <- function(data) {
  miss.stuff <- data %>%
    filter(!complete.cases(.))
  miss.stuff.prop <- nrow(miss.stuff)/nrow(data) #97.2% missing
  return(miss.stuff.prop)
}

missingprop(all.data)

## [1] 0.216106

colMeans(is.na(all.data)) #check which columns to impute missing data

##           Loan_ID            Gender           Married        Dependents 
##       0.000000000       0.024464832       0.003058104       0.025484200 
##         Education     Self_Employed   ApplicantIncome CoapplicantIncome 
##       0.000000000       0.056065240       0.000000000       0.000000000 
##        LoanAmount  Loan_Amount_Term    Credit_History     Property_Area 
##       0.027522936       0.020387360       0.080530071       0.000000000 
##       Loan_Status 
##       0.000000000

all.data.summary <- all.data %>% mutate_all(as.factor)
summary(all.data.summary)

##      Loan_ID       Gender    Married    Dependents        Education  
##  LP001002:  1   Female:182   No  :347   0   :545   Graduate    :763  
##  LP001003:  1   Male  :775   Yes :631   1   :160   Not Graduate:218  
##  LP001005:  1   NA's  : 24   NA's:  3   2   :160                     
##  LP001006:  1                           3+  : 91                     
##  LP001008:  1                           NA's: 25                     
##  LP001011:  1                                                        
##  (Other) :975                                                        
##  Self_Employed ApplicantIncome CoapplicantIncome   LoanAmount 
##  No  :807      2500   : 13     0      :429       120    : 29  
##  Yes :119      5000   : 11     2500   :  6       110    : 27  
##  NA's: 55      3333   : 10     1666   :  5       100    : 24  
##                3500   :  9     2000   :  5       187    : 21  
##                2600   :  8     2083   :  5       150    : 19  
##                4333   :  7     2333   :  5       (Other):834  
##                (Other):923     (Other):526       NA's   : 27  
##  Loan_Amount_Term Credit_History   Property_Area Loan_Status
##  360    :823      0   :148       Rural    :290   N   :192   
##  180    : 66      1   :754       Semiurban:349   test:367   
##  480    : 23      NA's: 79       Urban    :342   Y   :422   
##  300    : 20                                                
##  240    :  8                                                
##  (Other): 21                                                
##  NA's   : 20

Treat missing data

#impute all missing variables
#for catergorical data, use the category that appears most frequently
#for numerical data, use median 

all.data <- all.data %>%
  mutate(Gender=ifelse(is.na(Gender),"Male",Gender),
         Married=ifelse(is.na(Married),"Yes",Married),
         Dependents=ifelse(is.na(Dependents),"0",Dependents),
         Self_Employed=ifelse(is.na(Self_Employed),"No",Self_Employed),
         LoanAmount=ifelse(is.na(LoanAmount),median(LoanAmount, na.rm = T),LoanAmount),
         Loan_Amount_Term=ifelse(is.na(Loan_Amount_Term),median(Loan_Amount_Term, na.rm = T),Loan_Amount_Term),
         Credit_History=ifelse(is.na(Credit_History),1,Credit_History))

Loan status will be the target variable since the company wants to automate the process of giving loan eligibility

One Hot Encoding & Simple Transformation

Transform the categorical data

#find out the correlation between the variables 
#one hot encoding all the categorical variables for correlation 
#convert the binary variables into 1/0
all.data <- all.data %>%
  mutate(Gender=ifelse(Gender=="Male",1,0),
         Married=ifelse(Married=="Yes",1,0),
         Education=ifelse(Education=="Graduate",1,0),
         Self_Employed=ifelse(Self_Employed=="Yes",1,0))
  
data.categorical <- all.data %>%
  select_if(is.character) %>%
  select(-Loan_ID,-Loan_Status,-Dependents)

dmy <- dummyVars(" ~ .", data = data.categorical)
data.categorical.new <- data.frame(predict(dmy, newdata = data.categorical))

#replace original categorical data with binary categorical data
data.cat.names <- names(data.categorical)
all.data <- all.data[,!names(all.data)%in%data.cat.names]
data.new <- cbind(all.data,data.categorical.new)

head(data.new)

##    Loan_ID Gender Married Dependents Education Self_Employed
## 1 LP001002      1       0          0         1             0
## 2 LP001003      1       1          1         1             0
## 3 LP001005      1       1          0         1             1
## 4 LP001006      1       1          0         0             0
## 5 LP001008      1       0          0         1             0
## 6 LP001011      1       1          2         1             1
##   ApplicantIncome CoapplicantIncome LoanAmount Loan_Amount_Term
## 1            5849                 0        126              360
## 2            4583              1508        128              360
## 3            3000                 0         66              360
## 4            2583              2358        120              360
## 5            6000                 0        141              360
## 6            5417              4196        267              360
##   Credit_History Loan_Status Property_AreaRural Property_AreaSemiurban
## 1              1           Y                  0                      0
## 2              1           N                  1                      0
## 3              1           Y                  0                      0
## 4              1           Y                  0                      0
## 5              1           Y                  0                      0
## 6              1           Y                  0                      0
##   Property_AreaUrban
## 1                  1
## 2                  0
## 3                  1
## 4                  1
## 5                  1
## 6                  1

#calculate actual household size using dependents
table(data.new$Married, data.new$CoapplicantIncome > 0)

##    
##     FALSE TRUE
##   0   224  123
##   1   205  429

#Assume that applicant is added to dependents to form the household size
#But if applicant is married or applicant has a co-applicant or both, then it will be +2.

data.new <- data.new %>%
  mutate(Dependents=ifelse(Dependents=="3+",3,Dependents),
         Dependents=as.numeric(Dependents),
         householdsize=case_when(Married==1 ~ Dependents+2,
                            CoapplicantIncome>0 ~ Dependents+2,
                            Married==0 & CoapplicantIncome==0 ~ Dependents+1),
         CoapplicantIndicator=ifelse(CoapplicantIncome>0,1,0)) 


#check the data is correctly feature-engineered
table(data.new$householdsize , data.new$Dependents , useNA = "ifany")

##    
##       0   1   2   3
##   1 184   0   0   0
##   2 386  24   0   0
##   3   0 136   9   0
##   4   0   0 151   7
##   5   0   0   0  84

rm(list=setdiff(ls(), c("data.new")))

#remove test from train
train.new <- data.new %>% 
  filter(Loan_Status!="test") %>%
  mutate(Loan_Status=ifelse(Loan_Status=="Y",1,0))

Correlation

#check for correlation between the variables
cors.data <- data.frame(sapply(train.new %>%
                                 select(-Loan_ID), cor, y=train.new$Loan_Status)) 
names(cors.data) <- c("correlation")

cors.data<-setDT(cors.data, keep.rownames = TRUE)[]
cors.data <- cors.data %>% 
  arrange(desc(correlation)) %>% 
  mutate(flag=ifelse(correlation>0,T,F))

kable(cors.data, digits = 2, format = "html", row.names = TRUE) %>%
  kable_styling(bootstrap_options = c("striped","hover"), 
                full_width = T,
                font_size = 15)

	rn	correlation	flag
1	Loan_Status	1.00	TRUE
2	Credit_History	0.54	TRUE
3	Property_AreaSemiurban	0.14	TRUE
4	Married	0.09	TRUE
5	Education	0.09	TRUE
6	CoapplicantIndicator	0.08	TRUE
7	householdsize	0.04	TRUE
8	Gender	0.02	TRUE
9	Dependents	0.01	TRUE
10	Self_Employed	0.00	FALSE
11	ApplicantIncome	0.00	FALSE
12	Loan_Amount_Term	-0.02	FALSE
13	LoanAmount	-0.03	FALSE
14	Property_AreaUrban	-0.04	FALSE
15	CoapplicantIncome	-0.06	FALSE
16	Property_AreaRural	-0.10	FALSE

It shows that the existing variables like credit history, property area and married are important factors in deciding if the applicant’s loan is approved

Visual exploration - Target Variable & Categorical Variables

Understand the distribution of the data against the loan status (i.e. target variable) visualise the categorical variables and loan statuses

#understand the target variable
train.new %>% 
  group_by(Loan_Status) %>% 
  summarise(n.count=n()) %>%
  mutate(percent=round(n.count/nrow(train.new)*100,1),
         Loan_Status=as.factor(Loan_Status)) %>%
  ungroup() %>%
  ggplot(aes(x=Loan_Status, y=percent, fill=Loan_Status)) +
  geom_bar(stat="identity")+
  theme_economist_white()

#create a function for plotting many categorical variables and how they interact with the target variable
PlotSimple <- function(dataframe,x,y){
  aaa <- enquo(x)
  bbb <- enquo(y)
  dataframe %>%
    filter(!is.na(!! aaa), !is.na(!! bbb))  %>%
    group_by(!! aaa,!! bbb) %>%
    summarise(n=n())%>%
    mutate(percent=n/nrow(dataframe)) %>%
    ggplot(aes_(fill=aaa, y=~percent, x=bbb)) +
    geom_bar(position="dodge", stat="identity") +
    theme_economist_white()
  #   plot(p) # not strictly necessary
}



xvars <- list(as.name("Married"), as.name("Credit_History"),as.name("Gender"),
              as.name("Education"),as.name("Self_Employed"),as.name("Property_AreaSemiurban"))
cat.data <- train.new %>%
  select(Loan_Status,Gender,Married,Education,Self_Employed,Credit_History,Property_AreaSemiurban) %>%
  mutate_all(as.factor)
all_plots<-lapply (xvars, PlotSimple, dataframe=cat.data, y =Loan_Status)
cowplot::plot_grid(plotlist = all_plots)

60% of the clients have their loan approved. Similarly, 60% of clients who have credit history will likely to have their loan approved. This is an indication of credit history and loan approval have some correlation. 29.2% of Applicants living in semi-urban property area tend to have their loan approved. Married tend to have loan status approved

Visual exploration - Target Variable & Continuous Variables

Understand the distribution of the loan, income

#make the target variable into a factor
train.new <- train.new %>%
  mutate(Loan_Status=as.factor(Loan_Status))

varlist <- c("ApplicantIncome", "CoapplicantIncome", "LoanAmount")

PlotFast <- function(varName) {

train.new %>% 
group_by_("Loan_Status") %>% 
select_("Loan_Status",varName) %>% 
ggplot(aes_string("Loan_Status",varName,fill="Loan_Status")) + 
    geom_boxplot() +
    theme_economist_white()

}

all_plot_cont<-lapply(varlist,PlotFast)
cowplot::plot_grid(plotlist = all_plot_cont, ncol=3)

rm(train.new)

It is hard to see any distinctive pattern amongst the current continuous variables. It may mean that the approved and not-approved cases have similar loan amount, applicant/co-applicant incomes.

Feature Engineering & Updated Correlation

It is crucial to make variables that improve the model’s predictive accuracy.

#feature engineer variables to obtain a model that is good at prediction
#client's total income (including co-applicant's income)
#since the combined income would give a more holistic of their financial situation


#The feature-engineered variables:
#  1. client's combined income (including co-applicant's income) since it reflects the true financial ability to repay the loan
#  2. Income Loan Ratio refers to the proportion of loan amount to the combined income. It measures the ability of paying the loan. If the ratio is greater than 1, then it is highly likely that the applicant could pay back
#  3. Loan Amount Term Ratio refers to the proportion of the loan size and the time taken for the applicant to pay back. If the ratio is greater than 1, then it takes a shorter time for the applicant to pay back the loan
#  4. Per household income measures the income every individual in the household has (taking into account the dependants). The higher per household income, then the more likely applicant can pay back.
#  5. Per household income loan ratio measures proportion of loan amount to the per household income. It measures the ability of paying the loan. If the ratio is greater than 1, then it is highly likely that the applicant could pay back



data.new <- data.new %>%
  mutate(combined.income=ApplicantIncome+CoapplicantIncome,
         income.loan.ratio=combined.income/LoanAmount,
         loan.amt.term.ratio=LoanAmount/Loan_Amount_Term,
         per.household.income=combined.income/householdsize,
         per.household.income.loan.ratio=per.household.income/LoanAmount) 


#check for zero variance variables
nzv_cols <- nearZeroVar(data.new)
if(length(nzv_cols) > 0) data.new <- data.new[, -nzv_cols]

#prepare the data for modelling
train.new <- data.new %>% 
  filter(Loan_Status!="test") %>%
  mutate(Loan_Status=ifelse(Loan_Status=="Y",1,0))

#check for correlation between the variables
#remove correlated variables
cors.data <- data.frame(sapply(train.new %>%
                                 select(-Loan_ID), cor, y=train.new$Loan_Status)) 
names(cors.data) <- c("correlation")
cors.data<-setDT(cors.data, keep.rownames = TRUE)[]
cors.data <- cors.data %>% 
  arrange(desc(correlation)) %>% 
  mutate(flag=ifelse(correlation>0,T,F))


kable(cors.data, digits = 2, format = "html", row.names = TRUE) %>%
  kable_styling(bootstrap_options = c("striped","hover"),
                full_width = T,
                font_size = 15)

	rn	correlation	flag
1	Loan_Status	1.00	TRUE
2	Credit_History	0.54	TRUE
3	Property_AreaSemiurban	0.14	TRUE
4	Married	0.09	TRUE
5	Education	0.09	TRUE
6	CoapplicantIndicator	0.08	TRUE
7	householdsize	0.04	TRUE
8	income.loan.ratio	0.02	TRUE
9	Gender	0.02	TRUE
10	Dependents	0.01	TRUE
11	Self_Employed	0.00	FALSE
12	ApplicantIncome	0.00	FALSE
13	loan.amt.term.ratio	-0.01	FALSE
14	per.household.income.loan.ratio	-0.02	FALSE
15	Loan_Amount_Term	-0.02	FALSE
16	combined.income	-0.03	FALSE
17	LoanAmount	-0.03	FALSE
18	Property_AreaUrban	-0.04	FALSE
19	per.household.income	-0.05	FALSE
20	CoapplicantIncome	-0.06	FALSE
21	Property_AreaRural	-0.10	FALSE

#remove correlation less than +/-0.6
correlations <- cor(train.new %>%
                      select(-Loan_ID))
highCorr <- findCorrelation(correlations, cutoff = .6, names = FALSE)

train.new <- train.new[, -highCorr]
train.new <- train.new %>%
  mutate(Loan_Status=as.factor(Loan_Status))
varlist <- c("combined.income", "income.loan.ratio", "per.household.income.loan.ratio")
all_plot_cont<-lapply(varlist,PlotFast)
cowplot::plot_grid(plotlist = all_plot_cont,ncol=3)

test.new <- data.new %>%
  filter(Loan_Status=="test") %>%
  mutate(Loan_Status=NA)

test.new <- test.new[,-highCorr]

It is hard to see any distinctive pattern amongst the new continuous variables. It may mean that the approved and not-approved cases have similar loan amount, applicant/co-applicant incomes.

Modelling

#split the train data into training and validating data sets into 70% and 30%
intrain<-createDataPartition(y=train.new$Loan_Status ,p=0.7,list=FALSE)
training<-train.new[intrain,]
validating<-train.new[-intrain,]

training.id <- training %>% select(Loan_ID)
training <- training %>% select(-Loan_ID)

validating.id <- validating %>% select(Loan_ID)
validating <- validating %>% select(-Loan_ID)


#model =====
#randomforest =====
training$Loan_Status <- as.factor(training$Loan_Status)
set.seed(1)
fit_rf <- train(Loan_Status ~ ., data = training, method = "rf", prox = FALSE)
print(fit_rf)

## Random Forest 
## 
## 431 samples
##  16 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 431, 431, 431, 431, 431, 431, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.8061970  0.5017869
##    9    0.7945394  0.4872008
##   16    0.7904534  0.4780981
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

varImp(fit_rf)

## rf variable importance
## 
##                                  Overall
## Credit_History                  100.0000
## combined.income                  35.7775
## income.loan.ratio                34.4260
## ApplicantIncome                  34.1814
## per.household.income.loan.ratio  31.6184
## LoanAmount                       29.8043
## CoapplicantIncome                19.4077
## Loan_Amount_Term                  7.3335
## householdsize                     6.7821
## Dependents                        4.1325
## Property_AreaSemiurban            2.6524
## Property_AreaRural                1.5337
## Married                           0.9549
## Education                         0.9459
## Self_Employed                     0.2596
## Gender                            0.0000

rf_preds <- predict(fit_rf, newdata = validating)
cfMatrix <- confusionMatrix(table(data = rf_preds, validating$Loan_Status))
cfMatrix

## Confusion Matrix and Statistics
## 
##     
## data   0   1
##    0  21   5
##    1  36 121
##                                           
##                Accuracy : 0.776           
##                  95% CI : (0.7086, 0.8342)
##     No Information Rate : 0.6885          
##     P-Value [Acc > NIR] : 0.0056          
##                                           
##                   Kappa : 0.3863          
##  Mcnemar's Test P-Value : 2.797e-06       
##                                           
##             Sensitivity : 0.3684          
##             Specificity : 0.9603          
##          Pos Pred Value : 0.8077          
##          Neg Pred Value : 0.7707          
##              Prevalence : 0.3115          
##          Detection Rate : 0.1148          
##    Detection Prevalence : 0.1421          
##       Balanced Accuracy : 0.6644          
##                                           
##        'Positive' Class : 0               
##

#create decision tree ====
set.seed(2)
fit_dt <- rpart(Loan_Status ~ . ,
             data=training,
             method="class")
print(fit_dt)

## n= 431 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 431 135 1 (0.31322506 0.68677494)  
##   2) Credit_History< 0.5 71   5 0 (0.92957746 0.07042254) *
##   3) Credit_History>=0.5 360  69 1 (0.19166667 0.80833333)  
##     6) income.loan.ratio< 26.26687 10   3 0 (0.70000000 0.30000000) *
##     7) income.loan.ratio>=26.26687 350  62 1 (0.17714286 0.82285714) *

fit_dt$variable.importance

##                  Credit_History               income.loan.ratio 
##                       64.583460                        5.315714 
##                 combined.income per.household.income.loan.ratio 
##                        2.882395                        1.594714

fancyRpartPlot(fit_dt, cex=0.7)

The decision tree has a good advantage in explaining the profiles of applicants. For instance, if one does not have a credit history (i.e. he/she has borrowed before and met the guidelines) then he/she is unlikely to have his loan approved. But if the applicant has a credit history, then other criteria will need to be checked to determine if the loan could be approved.

#confusion matrix
validating$Loan_Status <- as.factor(validating$Loan_Status)

dt_preds <- predict(fit_dt, newdata = validating, type = "class")
cfMatrix_dt <- confusionMatrix(data = dt_preds, validating$Loan_Status)
cfMatrix_dt

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  22   4
##          1  35 122
##                                           
##                Accuracy : 0.7869          
##                  95% CI : (0.7204, 0.8438)
##     No Information Rate : 0.6885          
##     P-Value [Acc > NIR] : 0.002           
##                                           
##                   Kappa : 0.4162          
##  Mcnemar's Test P-Value : 1.556e-06       
##                                           
##             Sensitivity : 0.3860          
##             Specificity : 0.9683          
##          Pos Pred Value : 0.8462          
##          Neg Pred Value : 0.7771          
##              Prevalence : 0.3115          
##          Detection Rate : 0.1202          
##    Detection Prevalence : 0.1421          
##       Balanced Accuracy : 0.6771          
##                                           
##        'Positive' Class : 0               
##

Conclusion

Both Randomforest and Decision Tree have similar variable importance results in which credit_history, income loan ratio are important in determining if the loan will be approved. Randomforest and Decision Tree have similarly high accuracy (approximately 80%)

Loan Prediction

Kwan Chet

June 8, 2019