Problem Statement
This is a classic data science data-set. It involves predicting the loan statuses of applicants. Dream Housing Finance company deals in all home loans. They have presence across all urban, semi urban and rural areas. Customer first apply for home loan after that company validates the customer eligibility for loan. Company wants to automate the loan eligibility process (real time) based on customer detail provided while filling online application form. These details are Gender, Marital Status, Education, Number of Dependents, Income, Loan Amount, Credit History and others.
The objective is to automate the loan approval process for Dream Housing Finance by identifying if the applicant’s loan could be approved.
Load packages
pacman::p_load(tidyverse, ggthemes , ggrepel, knitr,kableExtra, data.table,caret,pROC,rpart, rattle, rpart.plot, RColorBrewer)Import data
train <- read_csv("D:/Documents/Family/TKC/20190325_Rportfolio/project/20190328_loan/source/train_u6lujuX_CVtuZ9i.csv")
test <- read_csv("D:/Documents/Family/TKC/20190325_Rportfolio/project/20190328_loan/source/test_Y3wMUE5_7gLdaTN.csv")Explore data non-visually
Such non visual exploration allows you to appreciate the variables and size of the data-sets. It gives an initial idea of how cleaning can be executed.
#non visual exploration
#check the data
glimpse(train)## Observations: 614
## Variables: 13
## $ Loan_ID <chr> "LP001002", "LP001003", "LP001005", "LP00100...
## $ Gender <chr> "Male", "Male", "Male", "Male", "Male", "Mal...
## $ Married <chr> "No", "Yes", "Yes", "Yes", "No", "Yes", "Yes...
## $ Dependents <chr> "0", "1", "0", "0", "0", "2", "0", "3+", "2"...
## $ Education <chr> "Graduate", "Graduate", "Graduate", "Not Gra...
## $ Self_Employed <chr> "No", "No", "Yes", "No", "No", "Yes", "No", ...
## $ ApplicantIncome <dbl> 5849, 4583, 3000, 2583, 6000, 5417, 2333, 30...
## $ CoapplicantIncome <dbl> 0, 1508, 0, 2358, 0, 4196, 1516, 2504, 1526,...
## $ LoanAmount <dbl> NA, 128, 66, 120, 141, 267, 95, 158, 168, 34...
## $ Loan_Amount_Term <dbl> 360, 360, 360, 360, 360, 360, 360, 360, 360,...
## $ Credit_History <dbl> 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,...
## $ Property_Area <chr> "Urban", "Rural", "Urban", "Urban", "Urban",...
## $ Loan_Status <chr> "Y", "N", "Y", "Y", "Y", "Y", "Y", "N", "Y",...
glimpse(test)## Observations: 367
## Variables: 12
## $ Loan_ID <chr> "LP001015", "LP001022", "LP001031", "LP00103...
## $ Gender <chr> "Male", "Male", "Male", "Male", "Male", "Mal...
## $ Married <chr> "Yes", "Yes", "Yes", "Yes", "No", "Yes", "No...
## $ Dependents <chr> "0", "1", "2", "2", "0", "0", "1", "2", "2",...
## $ Education <chr> "Graduate", "Graduate", "Graduate", "Graduat...
## $ Self_Employed <chr> "No", "No", "No", "No", "No", "Yes", "No", "...
## $ ApplicantIncome <dbl> 5720, 3076, 5000, 2340, 3276, 2165, 2226, 38...
## $ CoapplicantIncome <dbl> 0, 1500, 1800, 2546, 0, 3422, 0, 0, 0, 2400,...
## $ LoanAmount <dbl> 110, 126, 208, 100, 78, 152, 59, 147, 280, 1...
## $ Loan_Amount_Term <dbl> 360, 360, 360, 360, 360, 360, 360, 360, 240,...
## $ Credit_History <dbl> 1, 1, 1, NA, 1, 1, 1, 0, 1, 1, 1, 1, NA, 0, ...
## $ Property_Area <chr> "Urban", "Urban", "Urban", "Urban", "Urban",...
#non visual exploration
str(train) #check the data structure## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 614 obs. of 13 variables:
## $ Loan_ID : chr "LP001002" "LP001003" "LP001005" "LP001006" ...
## $ Gender : chr "Male" "Male" "Male" "Male" ...
## $ Married : chr "No" "Yes" "Yes" "Yes" ...
## $ Dependents : chr "0" "1" "0" "0" ...
## $ Education : chr "Graduate" "Graduate" "Graduate" "Not Graduate" ...
## $ Self_Employed : chr "No" "No" "Yes" "No" ...
## $ ApplicantIncome : num 5849 4583 3000 2583 6000 ...
## $ CoapplicantIncome: num 0 1508 0 2358 0 ...
## $ LoanAmount : num NA 128 66 120 141 267 95 158 168 349 ...
## $ Loan_Amount_Term : num 360 360 360 360 360 360 360 360 360 360 ...
## $ Credit_History : num 1 1 1 1 1 1 1 0 1 1 ...
## $ Property_Area : chr "Urban" "Rural" "Urban" "Urban" ...
## $ Loan_Status : chr "Y" "N" "Y" "Y" ...
## - attr(*, "spec")=
## .. cols(
## .. Loan_ID = col_character(),
## .. Gender = col_character(),
## .. Married = col_character(),
## .. Dependents = col_character(),
## .. Education = col_character(),
## .. Self_Employed = col_character(),
## .. ApplicantIncome = col_double(),
## .. CoapplicantIncome = col_double(),
## .. LoanAmount = col_double(),
## .. Loan_Amount_Term = col_double(),
## .. Credit_History = col_double(),
## .. Property_Area = col_character(),
## .. Loan_Status = col_character()
## .. )
#combine the train and test data for cleaning
all.data <- rbind(
train,
test %>% mutate(Loan_Status="test")
)
#check overall missing data
#calculate missing proportion
missingprop <- function(data) {
miss.stuff <- data %>%
filter(!complete.cases(.))
miss.stuff.prop <- nrow(miss.stuff)/nrow(data) #97.2% missing
return(miss.stuff.prop)
}
missingprop(all.data)## [1] 0.216106
colMeans(is.na(all.data)) #check which columns to impute missing data## Loan_ID Gender Married Dependents
## 0.000000000 0.024464832 0.003058104 0.025484200
## Education Self_Employed ApplicantIncome CoapplicantIncome
## 0.000000000 0.056065240 0.000000000 0.000000000
## LoanAmount Loan_Amount_Term Credit_History Property_Area
## 0.027522936 0.020387360 0.080530071 0.000000000
## Loan_Status
## 0.000000000
all.data.summary <- all.data %>% mutate_all(as.factor)
summary(all.data.summary)## Loan_ID Gender Married Dependents Education
## LP001002: 1 Female:182 No :347 0 :545 Graduate :763
## LP001003: 1 Male :775 Yes :631 1 :160 Not Graduate:218
## LP001005: 1 NA's : 24 NA's: 3 2 :160
## LP001006: 1 3+ : 91
## LP001008: 1 NA's: 25
## LP001011: 1
## (Other) :975
## Self_Employed ApplicantIncome CoapplicantIncome LoanAmount
## No :807 2500 : 13 0 :429 120 : 29
## Yes :119 5000 : 11 2500 : 6 110 : 27
## NA's: 55 3333 : 10 1666 : 5 100 : 24
## 3500 : 9 2000 : 5 187 : 21
## 2600 : 8 2083 : 5 150 : 19
## 4333 : 7 2333 : 5 (Other):834
## (Other):923 (Other):526 NA's : 27
## Loan_Amount_Term Credit_History Property_Area Loan_Status
## 360 :823 0 :148 Rural :290 N :192
## 180 : 66 1 :754 Semiurban:349 test:367
## 480 : 23 NA's: 79 Urban :342 Y :422
## 300 : 20
## 240 : 8
## (Other): 21
## NA's : 20
Treat missing data
#impute all missing variables
#for catergorical data, use the category that appears most frequently
#for numerical data, use median
all.data <- all.data %>%
mutate(Gender=ifelse(is.na(Gender),"Male",Gender),
Married=ifelse(is.na(Married),"Yes",Married),
Dependents=ifelse(is.na(Dependents),"0",Dependents),
Self_Employed=ifelse(is.na(Self_Employed),"No",Self_Employed),
LoanAmount=ifelse(is.na(LoanAmount),median(LoanAmount, na.rm = T),LoanAmount),
Loan_Amount_Term=ifelse(is.na(Loan_Amount_Term),median(Loan_Amount_Term, na.rm = T),Loan_Amount_Term),
Credit_History=ifelse(is.na(Credit_History),1,Credit_History))Loan status will be the target variable since the company wants to automate the process of giving loan eligibility
One Hot Encoding & Simple Transformation
Transform the categorical data
#find out the correlation between the variables
#one hot encoding all the categorical variables for correlation
#convert the binary variables into 1/0
all.data <- all.data %>%
mutate(Gender=ifelse(Gender=="Male",1,0),
Married=ifelse(Married=="Yes",1,0),
Education=ifelse(Education=="Graduate",1,0),
Self_Employed=ifelse(Self_Employed=="Yes",1,0))
data.categorical <- all.data %>%
select_if(is.character) %>%
select(-Loan_ID,-Loan_Status,-Dependents)
dmy <- dummyVars(" ~ .", data = data.categorical)
data.categorical.new <- data.frame(predict(dmy, newdata = data.categorical))
#replace original categorical data with binary categorical data
data.cat.names <- names(data.categorical)
all.data <- all.data[,!names(all.data)%in%data.cat.names]
data.new <- cbind(all.data,data.categorical.new)
head(data.new)## Loan_ID Gender Married Dependents Education Self_Employed
## 1 LP001002 1 0 0 1 0
## 2 LP001003 1 1 1 1 0
## 3 LP001005 1 1 0 1 1
## 4 LP001006 1 1 0 0 0
## 5 LP001008 1 0 0 1 0
## 6 LP001011 1 1 2 1 1
## ApplicantIncome CoapplicantIncome LoanAmount Loan_Amount_Term
## 1 5849 0 126 360
## 2 4583 1508 128 360
## 3 3000 0 66 360
## 4 2583 2358 120 360
## 5 6000 0 141 360
## 6 5417 4196 267 360
## Credit_History Loan_Status Property_AreaRural Property_AreaSemiurban
## 1 1 Y 0 0
## 2 1 N 1 0
## 3 1 Y 0 0
## 4 1 Y 0 0
## 5 1 Y 0 0
## 6 1 Y 0 0
## Property_AreaUrban
## 1 1
## 2 0
## 3 1
## 4 1
## 5 1
## 6 1
#calculate actual household size using dependents
table(data.new$Married, data.new$CoapplicantIncome > 0)##
## FALSE TRUE
## 0 224 123
## 1 205 429
#Assume that applicant is added to dependents to form the household size
#But if applicant is married or applicant has a co-applicant or both, then it will be +2.
data.new <- data.new %>%
mutate(Dependents=ifelse(Dependents=="3+",3,Dependents),
Dependents=as.numeric(Dependents),
householdsize=case_when(Married==1 ~ Dependents+2,
CoapplicantIncome>0 ~ Dependents+2,
Married==0 & CoapplicantIncome==0 ~ Dependents+1),
CoapplicantIndicator=ifelse(CoapplicantIncome>0,1,0))
#check the data is correctly feature-engineered
table(data.new$householdsize , data.new$Dependents , useNA = "ifany")##
## 0 1 2 3
## 1 184 0 0 0
## 2 386 24 0 0
## 3 0 136 9 0
## 4 0 0 151 7
## 5 0 0 0 84
rm(list=setdiff(ls(), c("data.new")))
#remove test from train
train.new <- data.new %>%
filter(Loan_Status!="test") %>%
mutate(Loan_Status=ifelse(Loan_Status=="Y",1,0))Correlation
#check for correlation between the variables
cors.data <- data.frame(sapply(train.new %>%
select(-Loan_ID), cor, y=train.new$Loan_Status))
names(cors.data) <- c("correlation")
cors.data<-setDT(cors.data, keep.rownames = TRUE)[]
cors.data <- cors.data %>%
arrange(desc(correlation)) %>%
mutate(flag=ifelse(correlation>0,T,F))
kable(cors.data, digits = 2, format = "html", row.names = TRUE) %>%
kable_styling(bootstrap_options = c("striped","hover"),
full_width = T,
font_size = 15) | rn | correlation | flag | |
|---|---|---|---|
| 1 | Loan_Status | 1.00 | TRUE |
| 2 | Credit_History | 0.54 | TRUE |
| 3 | Property_AreaSemiurban | 0.14 | TRUE |
| 4 | Married | 0.09 | TRUE |
| 5 | Education | 0.09 | TRUE |
| 6 | CoapplicantIndicator | 0.08 | TRUE |
| 7 | householdsize | 0.04 | TRUE |
| 8 | Gender | 0.02 | TRUE |
| 9 | Dependents | 0.01 | TRUE |
| 10 | Self_Employed | 0.00 | FALSE |
| 11 | ApplicantIncome | 0.00 | FALSE |
| 12 | Loan_Amount_Term | -0.02 | FALSE |
| 13 | LoanAmount | -0.03 | FALSE |
| 14 | Property_AreaUrban | -0.04 | FALSE |
| 15 | CoapplicantIncome | -0.06 | FALSE |
| 16 | Property_AreaRural | -0.10 | FALSE |
It shows that the existing variables like credit history, property area and married are important factors in deciding if the applicant’s loan is approved
Visual exploration - Target Variable & Categorical Variables
Understand the distribution of the data against the loan status (i.e. target variable) visualise the categorical variables and loan statuses
#understand the target variable
train.new %>%
group_by(Loan_Status) %>%
summarise(n.count=n()) %>%
mutate(percent=round(n.count/nrow(train.new)*100,1),
Loan_Status=as.factor(Loan_Status)) %>%
ungroup() %>%
ggplot(aes(x=Loan_Status, y=percent, fill=Loan_Status)) +
geom_bar(stat="identity")+
theme_economist_white()#create a function for plotting many categorical variables and how they interact with the target variable
PlotSimple <- function(dataframe,x,y){
aaa <- enquo(x)
bbb <- enquo(y)
dataframe %>%
filter(!is.na(!! aaa), !is.na(!! bbb)) %>%
group_by(!! aaa,!! bbb) %>%
summarise(n=n())%>%
mutate(percent=n/nrow(dataframe)) %>%
ggplot(aes_(fill=aaa, y=~percent, x=bbb)) +
geom_bar(position="dodge", stat="identity") +
theme_economist_white()
# plot(p) # not strictly necessary
}
xvars <- list(as.name("Married"), as.name("Credit_History"),as.name("Gender"),
as.name("Education"),as.name("Self_Employed"),as.name("Property_AreaSemiurban"))
cat.data <- train.new %>%
select(Loan_Status,Gender,Married,Education,Self_Employed,Credit_History,Property_AreaSemiurban) %>%
mutate_all(as.factor)
all_plots<-lapply (xvars, PlotSimple, dataframe=cat.data, y =Loan_Status)
cowplot::plot_grid(plotlist = all_plots)60% of the clients have their loan approved. Similarly, 60% of clients who have credit history will likely to have their loan approved. This is an indication of credit history and loan approval have some correlation. 29.2% of Applicants living in semi-urban property area tend to have their loan approved. Married tend to have loan status approved
Visual exploration - Target Variable & Continuous Variables
Understand the distribution of the loan, income
#make the target variable into a factor
train.new <- train.new %>%
mutate(Loan_Status=as.factor(Loan_Status))
varlist <- c("ApplicantIncome", "CoapplicantIncome", "LoanAmount")
PlotFast <- function(varName) {
train.new %>%
group_by_("Loan_Status") %>%
select_("Loan_Status",varName) %>%
ggplot(aes_string("Loan_Status",varName,fill="Loan_Status")) +
geom_boxplot() +
theme_economist_white()
}
all_plot_cont<-lapply(varlist,PlotFast)
cowplot::plot_grid(plotlist = all_plot_cont, ncol=3)rm(train.new)It is hard to see any distinctive pattern amongst the current continuous variables. It may mean that the approved and not-approved cases have similar loan amount, applicant/co-applicant incomes.
Feature Engineering & Updated Correlation
It is crucial to make variables that improve the model’s predictive accuracy.
#feature engineer variables to obtain a model that is good at prediction
#client's total income (including co-applicant's income)
#since the combined income would give a more holistic of their financial situation
#The feature-engineered variables:
# 1. client's combined income (including co-applicant's income) since it reflects the true financial ability to repay the loan
# 2. Income Loan Ratio refers to the proportion of loan amount to the combined income. It measures the ability of paying the loan. If the ratio is greater than 1, then it is highly likely that the applicant could pay back
# 3. Loan Amount Term Ratio refers to the proportion of the loan size and the time taken for the applicant to pay back. If the ratio is greater than 1, then it takes a shorter time for the applicant to pay back the loan
# 4. Per household income measures the income every individual in the household has (taking into account the dependants). The higher per household income, then the more likely applicant can pay back.
# 5. Per household income loan ratio measures proportion of loan amount to the per household income. It measures the ability of paying the loan. If the ratio is greater than 1, then it is highly likely that the applicant could pay back
data.new <- data.new %>%
mutate(combined.income=ApplicantIncome+CoapplicantIncome,
income.loan.ratio=combined.income/LoanAmount,
loan.amt.term.ratio=LoanAmount/Loan_Amount_Term,
per.household.income=combined.income/householdsize,
per.household.income.loan.ratio=per.household.income/LoanAmount)
#check for zero variance variables
nzv_cols <- nearZeroVar(data.new)
if(length(nzv_cols) > 0) data.new <- data.new[, -nzv_cols]
#prepare the data for modelling
train.new <- data.new %>%
filter(Loan_Status!="test") %>%
mutate(Loan_Status=ifelse(Loan_Status=="Y",1,0))
#check for correlation between the variables
#remove correlated variables
cors.data <- data.frame(sapply(train.new %>%
select(-Loan_ID), cor, y=train.new$Loan_Status))
names(cors.data) <- c("correlation")
cors.data<-setDT(cors.data, keep.rownames = TRUE)[]
cors.data <- cors.data %>%
arrange(desc(correlation)) %>%
mutate(flag=ifelse(correlation>0,T,F))
kable(cors.data, digits = 2, format = "html", row.names = TRUE) %>%
kable_styling(bootstrap_options = c("striped","hover"),
full_width = T,
font_size = 15) | rn | correlation | flag | |
|---|---|---|---|
| 1 | Loan_Status | 1.00 | TRUE |
| 2 | Credit_History | 0.54 | TRUE |
| 3 | Property_AreaSemiurban | 0.14 | TRUE |
| 4 | Married | 0.09 | TRUE |
| 5 | Education | 0.09 | TRUE |
| 6 | CoapplicantIndicator | 0.08 | TRUE |
| 7 | householdsize | 0.04 | TRUE |
| 8 | income.loan.ratio | 0.02 | TRUE |
| 9 | Gender | 0.02 | TRUE |
| 10 | Dependents | 0.01 | TRUE |
| 11 | Self_Employed | 0.00 | FALSE |
| 12 | ApplicantIncome | 0.00 | FALSE |
| 13 | loan.amt.term.ratio | -0.01 | FALSE |
| 14 | per.household.income.loan.ratio | -0.02 | FALSE |
| 15 | Loan_Amount_Term | -0.02 | FALSE |
| 16 | combined.income | -0.03 | FALSE |
| 17 | LoanAmount | -0.03 | FALSE |
| 18 | Property_AreaUrban | -0.04 | FALSE |
| 19 | per.household.income | -0.05 | FALSE |
| 20 | CoapplicantIncome | -0.06 | FALSE |
| 21 | Property_AreaRural | -0.10 | FALSE |
#remove correlation less than +/-0.6
correlations <- cor(train.new %>%
select(-Loan_ID))
highCorr <- findCorrelation(correlations, cutoff = .6, names = FALSE)
train.new <- train.new[, -highCorr]
train.new <- train.new %>%
mutate(Loan_Status=as.factor(Loan_Status))
varlist <- c("combined.income", "income.loan.ratio", "per.household.income.loan.ratio")
all_plot_cont<-lapply(varlist,PlotFast)
cowplot::plot_grid(plotlist = all_plot_cont,ncol=3)test.new <- data.new %>%
filter(Loan_Status=="test") %>%
mutate(Loan_Status=NA)
test.new <- test.new[,-highCorr]It is hard to see any distinctive pattern amongst the new continuous variables. It may mean that the approved and not-approved cases have similar loan amount, applicant/co-applicant incomes.
Modelling
#split the train data into training and validating data sets into 70% and 30%
intrain<-createDataPartition(y=train.new$Loan_Status ,p=0.7,list=FALSE)
training<-train.new[intrain,]
validating<-train.new[-intrain,]
training.id <- training %>% select(Loan_ID)
training <- training %>% select(-Loan_ID)
validating.id <- validating %>% select(Loan_ID)
validating <- validating %>% select(-Loan_ID)
#model =====
#randomforest =====
training$Loan_Status <- as.factor(training$Loan_Status)
set.seed(1)
fit_rf <- train(Loan_Status ~ ., data = training, method = "rf", prox = FALSE)
print(fit_rf)## Random Forest
##
## 431 samples
## 16 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 431, 431, 431, 431, 431, 431, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.8061970 0.5017869
## 9 0.7945394 0.4872008
## 16 0.7904534 0.4780981
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
varImp(fit_rf)## rf variable importance
##
## Overall
## Credit_History 100.0000
## combined.income 35.7775
## income.loan.ratio 34.4260
## ApplicantIncome 34.1814
## per.household.income.loan.ratio 31.6184
## LoanAmount 29.8043
## CoapplicantIncome 19.4077
## Loan_Amount_Term 7.3335
## householdsize 6.7821
## Dependents 4.1325
## Property_AreaSemiurban 2.6524
## Property_AreaRural 1.5337
## Married 0.9549
## Education 0.9459
## Self_Employed 0.2596
## Gender 0.0000
rf_preds <- predict(fit_rf, newdata = validating)
cfMatrix <- confusionMatrix(table(data = rf_preds, validating$Loan_Status))
cfMatrix## Confusion Matrix and Statistics
##
##
## data 0 1
## 0 21 5
## 1 36 121
##
## Accuracy : 0.776
## 95% CI : (0.7086, 0.8342)
## No Information Rate : 0.6885
## P-Value [Acc > NIR] : 0.0056
##
## Kappa : 0.3863
## Mcnemar's Test P-Value : 2.797e-06
##
## Sensitivity : 0.3684
## Specificity : 0.9603
## Pos Pred Value : 0.8077
## Neg Pred Value : 0.7707
## Prevalence : 0.3115
## Detection Rate : 0.1148
## Detection Prevalence : 0.1421
## Balanced Accuracy : 0.6644
##
## 'Positive' Class : 0
##
#create decision tree ====
set.seed(2)
fit_dt <- rpart(Loan_Status ~ . ,
data=training,
method="class")
print(fit_dt)## n= 431
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 431 135 1 (0.31322506 0.68677494)
## 2) Credit_History< 0.5 71 5 0 (0.92957746 0.07042254) *
## 3) Credit_History>=0.5 360 69 1 (0.19166667 0.80833333)
## 6) income.loan.ratio< 26.26687 10 3 0 (0.70000000 0.30000000) *
## 7) income.loan.ratio>=26.26687 350 62 1 (0.17714286 0.82285714) *
fit_dt$variable.importance## Credit_History income.loan.ratio
## 64.583460 5.315714
## combined.income per.household.income.loan.ratio
## 2.882395 1.594714
fancyRpartPlot(fit_dt, cex=0.7)The decision tree has a good advantage in explaining the profiles of applicants. For instance, if one does not have a credit history (i.e. he/she has borrowed before and met the guidelines) then he/she is unlikely to have his loan approved. But if the applicant has a credit history, then other criteria will need to be checked to determine if the loan could be approved.
#confusion matrix
validating$Loan_Status <- as.factor(validating$Loan_Status)
dt_preds <- predict(fit_dt, newdata = validating, type = "class")
cfMatrix_dt <- confusionMatrix(data = dt_preds, validating$Loan_Status)
cfMatrix_dt## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 22 4
## 1 35 122
##
## Accuracy : 0.7869
## 95% CI : (0.7204, 0.8438)
## No Information Rate : 0.6885
## P-Value [Acc > NIR] : 0.002
##
## Kappa : 0.4162
## Mcnemar's Test P-Value : 1.556e-06
##
## Sensitivity : 0.3860
## Specificity : 0.9683
## Pos Pred Value : 0.8462
## Neg Pred Value : 0.7771
## Prevalence : 0.3115
## Detection Rate : 0.1202
## Detection Prevalence : 0.1421
## Balanced Accuracy : 0.6771
##
## 'Positive' Class : 0
##
Conclusion
Both Randomforest and Decision Tree have similar variable importance results in which credit_history, income loan ratio are important in determining if the loan will be approved. Randomforest and Decision Tree have similarly high accuracy (approximately 80%)