library(tidyverse)
library(caTools)
library(ROCR)
library(rpart)
library(rmdformats)
library(randomForest)
library(psych)Research question:
Problem: Dream Housing Company wants to automate the loan eligibility process (real time) based on customer detail provided while filling online application form. These details are Gender, Marital Status, Education, Number of Dependents, Income, Loan Amount, Credit History and others. To automate this process, they have given a problem to identify the customers segments, those are eligible for loan amount so that they can specifically target these customers
This data source was given as part of a data science challenge. I downloaded the data and loaded to my git-hub account. I will read the data into R.
Source: https://datahack.analyticsvidhya.com/contest/practice-problem-loan-prediction-iii/.
my_loan_data<- read.csv("https://raw.githubusercontent.com/yinaS1234/data-606/main/606%20final%20project/loan_data.csv")
head(my_loan_data)## Loan_ID Gender Married Dependents Education Self_Employed ApplicantIncome
## 1 LP001002 Male No 0 Graduate No 5849
## 2 LP001003 Male Yes 1 Graduate No 4583
## 3 LP001005 Male Yes 0 Graduate Yes 3000
## 4 LP001006 Male Yes 0 Not Graduate No 2583
## 5 LP001008 Male No 0 Graduate No 6000
## 6 LP001011 Male Yes 2 Graduate Yes 5417
## CoapplicantIncome LoanAmount Loan_Amount_Term Credit_History Property_Area
## 1 0 NA 360 1 Urban
## 2 1508 128 360 1 Rural
## 3 0 66 360 1 Urban
## 4 2358 120 360 1 Urban
## 5 0 141 360 1 Urban
## 6 4196 267 360 1 Urban
## Loan_Status
## 1 Y
## 2 N
## 3 Y
## 4 Y
## 5 Y
## 6 Y
dim(my_loan_data)## [1] 614 13
#Store backup before removing missing values
my_loan_data_backup <- my_loan_data
#Return all rows with missing values
my_loan_data[!complete.cases(my_loan_data),]## Loan_ID Gender Married Dependents Education Self_Employed
## 1 LP001002 Male No 0 Graduate No
## 17 LP001034 Male No 1 Not Graduate No
## 20 LP001041 Male Yes 0 Graduate
## 25 LP001052 Male Yes 1 Graduate
## 31 LP001091 Male Yes 1 Graduate
## 36 LP001106 Male Yes 0 Graduate No
## 37 LP001109 Male Yes 0 Graduate No
## 43 LP001123 Male Yes 0 Graduate No
## 45 LP001136 Male Yes 0 Not Graduate Yes
## 46 LP001137 Female No 0 Graduate No
## 64 LP001213 Male Yes 1 Graduate No
## 74 LP001250 Male Yes 3+ Not Graduate No
## 80 LP001264 Male Yes 3+ Not Graduate Yes
## 82 LP001266 Male Yes 1 Graduate Yes
## 84 LP001273 Male Yes 0 Graduate No
## 87 LP001280 Male Yes 2 Not Graduate No
## 96 LP001326 Male No 0 Graduate
## 103 LP001350 Male Yes Graduate No
## 104 LP001356 Male Yes 0 Graduate No
## 113 LP001391 Male Yes 0 Not Graduate No
## 114 LP001392 Female No 1 Graduate Yes
## 118 LP001405 Male Yes 1 Graduate No
## 126 LP001443 Female No 0 Graduate No
## 128 LP001449 Male No 0 Graduate No
## 130 LP001465 Male Yes 0 Graduate No
## 131 LP001469 Male No 0 Graduate Yes
## 157 LP001541 Male Yes 1 Graduate No
## 166 LP001574 Male Yes 0 Graduate No
## 182 LP001634 Male No 0 Graduate No
## 188 LP001643 Male Yes 0 Graduate No
## 198 LP001669 Female No 0 Not Graduate No
## 199 LP001671 Female Yes 0 Graduate No
## 203 LP001682 Male Yes 3+ Not Graduate No
## 220 LP001734 Female Yes 2 Graduate No
## 224 LP001749 Male Yes 0 Graduate No
## 233 LP001770 Male No 0 Not Graduate No
## 237 LP001786 Male Yes 0 Graduate
## 238 LP001788 Female No 0 Graduate Yes
## 260 LP001864 Male Yes 3+ Not Graduate No
## 261 LP001865 Male Yes 1 Graduate No
## 280 LP001908 Female Yes 0 Not Graduate No
## 285 LP001922 Male Yes 0 Graduate No
## 306 LP001990 Male No 0 Not Graduate No
## 310 LP001998 Male Yes 2 Not Graduate No
## 314 LP002008 Male Yes 2 Graduate Yes
## 318 LP002036 Male Yes 0 Graduate No
## 319 LP002043 Female No 1 Graduate No
## 323 LP002054 Male Yes 2 Not Graduate No
## 324 LP002055 Female No 0 Graduate No
## 336 LP002106 Male Yes Graduate Yes
## 339 LP002113 Female No 3+ Not Graduate No
## 349 LP002137 Male Yes 0 Graduate No
## 364 LP002178 Male Yes 0 Graduate No
## 368 LP002188 Male No 0 Graduate No
## 378 LP002223 Male Yes 0 Graduate No
## 388 LP002243 Male Yes 0 Not Graduate No
## 393 LP002263 Male Yes 0 Graduate No
## 396 LP002272 Male Yes 2 Graduate No
## 412 LP002319 Male Yes 0 Graduate
## 422 LP002357 Female No 0 Not Graduate No
## 424 LP002362 Male Yes 1 Graduate No
## 436 LP002393 Female Graduate No
## 438 LP002401 Male Yes 0 Graduate No
## 445 LP002424 Male Yes 0 Graduate No
## 450 LP002444 Male No 1 Not Graduate Yes
## 452 LP002447 Male Yes 2 Not Graduate No
## 461 LP002478 Yes 0 Graduate Yes
## 474 LP002522 Female No 0 Graduate Yes
## 480 LP002533 Male Yes 2 Graduate No
## 491 LP002560 Male No 0 Not Graduate No
## 492 LP002562 Male Yes 1 Not Graduate No
## 498 LP002588 Male Yes 0 Graduate No
## 504 LP002618 Male Yes 1 Not Graduate No
## 507 LP002624 Male Yes 0 Graduate No
## 525 LP002697 Male No 0 Graduate No
## 531 LP002717 Male Yes 0 Graduate No
## 534 LP002729 Male No 1 Graduate No
## 545 LP002757 Female Yes 0 Not Graduate No
## 551 LP002778 Male Yes 2 Graduate Yes
## 552 LP002784 Male Yes 1 Not Graduate No
## 557 LP002794 Female No 0 Graduate No
## 566 LP002833 Male Yes 0 Not Graduate No
## 584 LP002898 Male Yes 1 Graduate No
## 601 LP002949 Female No 3+ Graduate
## 606 LP002960 Male Yes 0 Not Graduate No
## ApplicantIncome CoapplicantIncome LoanAmount Loan_Amount_Term
## 1 5849 0 NA 360
## 17 3596 0 100 240
## 20 2600 3500 115 NA
## 25 3717 2925 151 360
## 31 4166 3369 201 360
## 36 2275 2067 NA 360
## 37 1828 1330 100 NA
## 43 2400 0 75 360
## 45 4695 0 96 NA
## 46 3410 0 88 NA
## 64 4945 0 NA 360
## 74 4755 0 95 NA
## 80 3333 2166 130 360
## 82 2395 0 NA 360
## 84 6000 2250 265 360
## 87 3333 2000 99 360
## 96 6782 0 NA 360
## 103 13650 0 NA 360
## 104 4652 3583 NA 360
## 113 3572 4114 152 NA
## 114 7451 0 NA 360
## 118 2214 1398 85 360
## 126 3692 0 93 360
## 128 3865 1640 NA 360
## 130 6080 2569 182 360
## 131 20166 0 650 480
## 157 6000 0 160 360
## 166 3707 3166 182 NA
## 182 1916 5063 67 360
## 188 2383 2138 58 360
## 198 1907 2365 120 NA
## 199 3416 2816 113 360
## 203 3992 0 NA 180
## 220 4283 2383 127 360
## 224 7578 1010 175 NA
## 233 3189 2598 120 NA
## 237 5746 0 255 360
## 238 3463 0 122 360
## 260 4931 0 128 360
## 261 6083 4250 330 360
## 280 4100 0 124 360
## 285 20667 0 NA 360
## 306 2000 0 NA 360
## 310 7667 0 185 360
## 314 5746 0 144 84
## 318 2058 2134 88 360
## 319 3541 0 112 360
## 323 3601 1590 NA 360
## 324 3166 2985 132 360
## 336 5503 4490 70 NA
## 339 1830 0 NA 360
## 349 6333 4583 259 360
## 364 3013 3033 95 300
## 368 5124 0 124 NA
## 378 4310 0 130 360
## 388 3010 3136 NA 360
## 393 2583 2115 120 360
## 396 3276 484 135 360
## 412 6256 0 160 360
## 422 2720 0 80 NA
## 424 7250 1667 110 NA
## 436 10047 0 NA 240
## 438 2213 1125 NA 360
## 445 7333 8333 175 300
## 450 2769 1542 190 360
## 452 1958 1456 60 300
## 461 2083 4083 160 360
## 474 2500 0 93 360
## 480 2947 1603 NA 360
## 491 2699 2785 96 360
## 492 5333 1131 186 360
## 498 4625 2857 111 12
## 504 4050 5302 138 360
## 507 20833 6667 480 360
## 525 4680 2087 NA 360
## 531 1025 5500 216 360
## 534 11250 0 196 360
## 545 3017 663 102 360
## 551 6633 0 NA 360
## 552 2492 2375 NA 360
## 557 2667 1625 84 360
## 566 4467 0 120 360
## 584 1880 0 61 360
## 601 416 41667 350 180
## 606 2400 3800 NA 180
## Credit_History Property_Area Loan_Status
## 1 1 Urban Y
## 17 NA Urban Y
## 20 1 Urban Y
## 25 NA Semiurban N
## 31 NA Urban N
## 36 1 Urban Y
## 37 0 Urban N
## 43 NA Urban Y
## 45 1 Urban Y
## 46 1 Urban Y
## 64 0 Rural N
## 74 0 Semiurban N
## 80 NA Semiurban Y
## 82 1 Semiurban Y
## 84 NA Semiurban N
## 87 NA Semiurban Y
## 96 NA Urban N
## 103 1 Urban Y
## 104 1 Semiurban Y
## 113 0 Rural N
## 114 1 Semiurban Y
## 118 NA Urban Y
## 126 NA Rural Y
## 128 1 Rural Y
## 130 NA Rural N
## 131 NA Urban Y
## 157 NA Rural Y
## 166 1 Rural Y
## 182 NA Rural N
## 188 NA Rural Y
## 198 1 Urban Y
## 199 NA Semiurban Y
## 203 1 Urban N
## 220 NA Semiurban Y
## 224 1 Semiurban Y
## 233 1 Rural Y
## 237 NA Urban N
## 238 NA Urban Y
## 260 NA Semiurban N
## 261 NA Urban Y
## 280 NA Rural Y
## 285 1 Rural N
## 306 1 Urban N
## 310 NA Rural Y
## 314 NA Rural Y
## 318 NA Urban Y
## 319 NA Semiurban Y
## 323 1 Rural Y
## 324 NA Rural Y
## 336 1 Semiurban Y
## 339 0 Urban N
## 349 NA Semiurban Y
## 364 NA Urban Y
## 368 0 Rural N
## 378 NA Semiurban Y
## 388 0 Urban N
## 393 NA Urban Y
## 396 NA Semiurban Y
## 412 NA Urban Y
## 422 0 Urban N
## 424 0 Urban N
## 436 1 Semiurban Y
## 438 1 Urban Y
## 445 NA Rural Y
## 450 NA Semiurban N
## 452 NA Urban Y
## 461 NA Semiurban Y
## 474 NA Urban Y
## 480 1 Urban N
## 491 NA Semiurban Y
## 492 NA Urban Y
## 498 NA Urban Y
## 504 NA Rural N
## 507 NA Urban Y
## 525 1 Semiurban N
## 531 NA Rural Y
## 534 NA Semiurban N
## 545 NA Semiurban Y
## 551 0 Rural N
## 552 1 Rural Y
## 557 NA Urban Y
## 566 NA Rural Y
## 584 NA Rural N
## 601 NA Urban N
## 606 1 Urban N
#store only data without missing values (removed 85 rows)
my_loan_data<- my_loan_data[complete.cases(my_loan_data),]
dim(my_loan_data)## [1] 529 13
## create a new column trg by Loan_Status column Y=1, N=0
my_loan_data<-my_loan_data%>%
mutate(trg=ifelse(my_loan_data$Loan_Status=='Y',1,0))## remove Loan_status column
my_loan_data <- subset( my_loan_data, select = -Loan_Status )
## rename the last column to Loan_Status.
colnames(my_loan_data)[13] <- 'Loan_Status'# Convert all columns to factor
my_loan_data <- as.data.frame(unclass(my_loan_data),
stringsAsFactors = TRUE)
Loan_status is the dependent variable. It is a categorical variable which gives us yes and no for loan approval status
There are a few independent variables. I will choose the most
appropriate variables after doing exploratory analysis. Here are some
preliminary variables listed below:
–Credit history
–Applicant income
–Applicants with higher education
–Gender of the applicant
–Number of Dependents
–Property area
str(my_loan_data)## 'data.frame': 529 obs. of 13 variables:
## $ Loan_ID : Factor w/ 529 levels "LP001003","LP001005",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Gender : Factor w/ 3 levels "","Female","Male": 3 3 3 3 3 3 3 3 3 3 ...
## $ Married : Factor w/ 3 levels "","No","Yes": 3 3 3 2 3 3 3 3 3 3 ...
## $ Dependents : Factor w/ 5 levels "","0","1","2",..: 3 2 2 2 4 2 5 4 3 4 ...
## $ Education : Factor w/ 2 levels "Graduate","Not Graduate": 1 1 2 1 1 2 1 1 1 1 ...
## $ Self_Employed : Factor w/ 3 levels "","No","Yes": 2 3 2 2 3 2 2 2 2 2 ...
## $ ApplicantIncome : int 4583 3000 2583 6000 5417 2333 3036 4006 12841 3200 ...
## $ CoapplicantIncome: num 1508 0 2358 0 4196 ...
## $ LoanAmount : int 128 66 120 141 267 95 158 168 349 70 ...
## $ Loan_Amount_Term : int 360 360 360 360 360 360 360 360 360 360 ...
## $ Credit_History : int 1 1 1 1 1 1 0 1 1 1 ...
## $ Property_Area : Factor w/ 3 levels "Rural","Semiurban",..: 1 3 3 3 3 3 2 3 2 3 ...
## $ Loan_Status : num 0 1 1 1 1 1 0 1 0 1 ...
summary(my_loan_data)## Loan_ID Gender Married Dependents Education
## LP001003: 1 : 12 : 2 : 12 Graduate :421
## LP001005: 1 Female: 95 No :188 0 :295 Not Graduate:108
## LP001006: 1 Male :422 Yes:339 1 : 85
## LP001008: 1 2 : 92
## LP001011: 1 3+: 45
## LP001013: 1
## (Other) :523
## Self_Employed ApplicantIncome CoapplicantIncome LoanAmount
## : 25 Min. : 150 Min. : 0 Min. : 9.0
## No :434 1st Qu.: 2900 1st Qu.: 0 1st Qu.:100.0
## Yes: 70 Median : 3816 Median : 1086 Median :128.0
## Mean : 5508 Mean : 1542 Mean :145.9
## 3rd Qu.: 5815 3rd Qu.: 2232 3rd Qu.:167.0
## Max. :81000 Max. :33837 Max. :700.0
##
## Loan_Amount_Term Credit_History Property_Area Loan_Status
## Min. : 36.0 Min. :0.0000 Rural :155 Min. :0.0000
## 1st Qu.:360.0 1st Qu.:1.0000 Semiurban:209 1st Qu.:0.0000
## Median :360.0 Median :1.0000 Urban :165 Median :1.0000
## Mean :342.4 Mean :0.8507 Mean :0.6919
## 3rd Qu.:360.0 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :480.0 Max. :1.0000 Max. :1.0000
##
describe(my_loan_data)## vars n mean sd median trimmed mad min max
## Loan_ID* 1 529 265.00 152.85 265 265.00 195.70 1 529
## Gender* 2 529 2.78 0.47 3 2.87 0.00 1 3
## Married* 3 529 2.64 0.49 3 2.68 0.00 1 3
## Dependents* 4 529 2.74 1.05 2 2.60 0.00 1 5
## Education* 5 529 1.20 0.40 1 1.13 0.00 1 2
## Self_Employed* 6 529 2.09 0.42 2 2.04 0.00 1 3
## ApplicantIncome 7 529 5507.82 6404.13 3816 4346.45 1802.84 150 81000
## CoapplicantIncome 8 529 1542.39 2524.30 1086 1118.17 1610.10 0 33837
## LoanAmount 9 529 145.85 84.11 128 133.26 45.96 9 700
## Loan_Amount_Term 10 529 342.35 64.86 360 358.31 0.00 36 480
## Credit_History 11 529 0.85 0.36 1 0.94 0.00 0 1
## Property_Area* 12 529 2.02 0.78 2 2.02 1.48 1 3
## Loan_Status 13 529 0.69 0.46 1 0.74 0.00 0 1
## range skew kurtosis se
## Loan_ID* 528 0.00 -1.21 6.65
## Gender* 2 -1.95 3.03 0.02
## Married* 2 -0.67 -1.31 0.02
## Dependents* 4 0.86 -0.49 0.05
## Education* 1 1.46 0.14 0.02
## Self_Employed* 2 0.56 2.31 0.02
## ApplicantIncome 80850 6.43 56.78 278.44
## CoapplicantIncome 33837 5.96 60.12 109.75
## LoanAmount 691 2.59 9.94 3.66
## Loan_Amount_Term 444 -2.26 6.06 2.82
## Credit_History 1 -1.96 1.85 0.02
## Property_Area* 2 -0.03 -1.35 0.03
## Loan_Status 1 -0.83 -1.32 0.02
hist(my_loan_data$ApplicantIncome, col="lightblue")hist(my_loan_data$CoapplicantIncome,col="yellow")hist(my_loan_data$LoanAmount,col="red")Loan Applicant incomes range from 150 to 81000, majoriry < than 20,000
Co applicant income 0 ( no - coapplicants ) to 33837, majority<10,000
Loan Amount ranges from 9 to 700, with majority fall into range 100~200
Most people are Males and Working in a company.
All IDs are unique and randomly alloted. They have no impact on the Loan_Status and can be dropped
summary(my_loan_data$Property_Area)## Rural Semiurban Urban
## 155 209 165
ggplot(data=my_loan_data, aes(my_loan_data$Property_Area)) +
geom_histogram(col="blue",fill="lightblue",stat="count" ) +
facet_grid(~my_loan_data$Loan_Status)+
scale_x_discrete()## Warning in geom_histogram(col = "blue", fill = "lightblue", stat = "count"):
## Ignoring unknown parameters: `binwidth`, `bins`, and `pad`
## Warning: Use of `my_loan_data$Property_Area` is discouraged.
## ℹ Use `Property_Area` instead.
Histogram of Property Area shows that Loan approval is more into Semiurban area than Rural and Urban.
summary(my_loan_data$CoapplicantIncome)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 0 1086 1542 2232 33837
ggplot(data=my_loan_data, aes(x= my_loan_data$CoapplicantIncome)) +
geom_histogram(col="yellow",fill="pink", bins = 15) +
facet_grid(~my_loan_data$Loan_Status)+
theme_bw()## Warning: Use of `my_loan_data$CoapplicantIncome` is discouraged.
## ℹ Use `CoapplicantIncome` instead.
Histogram shows that low income peoples are mainly applying for loans
and number of loan rejection is more in the lowest income segment
summary(my_loan_data$Education)## Graduate Not Graduate
## 421 108
ggplot(data=my_loan_data, aes(my_loan_data$Education)) +
geom_histogram(col="lightgreen",fill="blue",stat="count" ) +
facet_grid(~my_loan_data$Loan_Status)+
scale_x_discrete()+
theme_bw()## Warning in geom_histogram(col = "lightgreen", fill = "blue", stat = "count"):
## Ignoring unknown parameters: `binwidth`, `bins`, and `pad`
## Warning: Use of `my_loan_data$Education` is discouraged.
## ℹ Use `Education` instead.
loan approval rate for graduate is more than non graduate
summary(my_loan_data$Dependents)## 0 1 2 3+
## 12 295 85 92 45
ggplot(data=my_loan_data, aes(my_loan_data$Dependents)) +
geom_histogram(col="lightyellow",fill="lightgreen",stat="count" ) +
facet_grid(~my_loan_data$Loan_Status)+
scale_x_discrete()+
theme_bw()## Warning in geom_histogram(col = "lightyellow", fill = "lightgreen", stat =
## "count"): Ignoring unknown parameters: `binwidth`, `bins`, and `pad`
## Warning: Use of `my_loan_data$Dependents` is discouraged.
## ℹ Use `Dependents` instead.
People having no dependents have maximum loan approval and rejection count
summary(my_loan_data$Gender)## Female Male
## 12 95 422
ggplot(data=my_loan_data, aes(my_loan_data$Gender)) +
geom_histogram(col="lightgrey",fill="lightblue",stat="count") +
facet_grid(~my_loan_data$Loan_Status)+
scale_x_discrete()+
theme_bw()## Warning in geom_histogram(col = "lightgrey", fill = "lightblue", stat =
## "count"): Ignoring unknown parameters: `binwidth`, `bins`, and `pad`
## Warning: Use of `my_loan_data$Gender` is discouraged.
## ℹ Use `Gender` instead.
Male applicant has higher loan approval and rejection count than female applicant.
my_loan_data_1 <- my_loan_data[,2:13]
ind <- sample.split (Y=my_loan_data_1$Loan_Status, SplitRatio=0.8)
traindf<- my_loan_data_1 [ind,]
testdf<- my_loan_data_1 [!ind,]LRmodel<-glm(Loan_Status~.,traindf,family = "binomial")
summary(LRmodel)##
## Call:
## glm(formula = Loan_Status ~ ., family = "binomial", data = traindf)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3815 -0.3506 0.4698 0.6857 2.4521
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.085e+01 6.200e+02 0.018 0.98604
## GenderFemale 5.232e-01 8.826e-01 0.593 0.55333
## GenderMale 9.307e-01 8.272e-01 1.125 0.26054
## MarriedNo -1.344e+01 6.200e+02 -0.022 0.98270
## MarriedYes -1.312e+01 6.200e+02 -0.021 0.98312
## Dependents0 3.713e-01 1.046e+00 0.355 0.72264
## Dependents1 1.861e-01 1.073e+00 0.173 0.86230
## Dependents2 8.196e-01 1.078e+00 0.760 0.44708
## Dependents3+ 5.459e-01 1.136e+00 0.480 0.63093
## EducationNot Graduate -5.991e-01 3.217e-01 -1.862 0.06258 .
## Self_EmployedNo -4.410e-01 6.087e-01 -0.724 0.46883
## Self_EmployedYes -8.157e-01 6.885e-01 -1.185 0.23611
## ApplicantIncome 1.535e-05 2.760e-05 0.556 0.57802
## CoapplicantIncome -5.288e-05 4.487e-05 -1.178 0.23865
## LoanAmount -1.579e-03 2.008e-03 -0.787 0.43149
## Loan_Amount_Term -3.050e-03 2.628e-03 -1.161 0.24579
## Credit_History 3.921e+00 4.817e-01 8.140 3.95e-16 ***
## Property_AreaSemiurban 1.239e+00 3.363e-01 3.684 0.00023 ***
## Property_AreaUrban 3.887e-01 3.296e-01 1.180 0.23816
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 521.94 on 422 degrees of freedom
## Residual deviance: 365.73 on 404 degrees of freedom
## AIC: 403.73
##
## Number of Fisher Scoring iterations: 13
From the R value and P value above, the most significant variables are
Credit_History Property_AreaSemiurban
res<-predict(LRmodel,testdf,type="response")
res## 2 10 17 24 40 46 47
## 0.79794269 0.89624863 0.66451209 0.81633239 0.83752733 0.94938687 0.92082622
## 51 54 58 59 65 69 74
## 0.81069083 0.10777463 0.86219550 0.10496769 0.73341144 0.81832210 0.90442186
## 78 79 83 88 93 100 105
## 0.90066196 0.92292349 0.90800481 0.88236807 0.91622230 0.87221921 0.08165033
## 108 110 117 118 127 130 131
## 0.95380157 0.88337976 0.91200815 0.91666153 0.77537480 0.29442100 0.81339857
## 134 136 137 141 146 153 154
## 0.86718474 0.15967917 0.83255416 0.11732794 0.89580946 0.05371221 0.95933721
## 155 156 158 159 162 164 165
## 0.92959225 0.89417772 0.04969927 0.39583721 0.90542864 0.90661879 0.72504606
## 166 168 169 171 175 184 188
## 0.89228838 0.75889991 0.83724720 0.69110289 0.58322439 0.75241390 0.94747396
## 189 190 193 197 198 200 201
## 0.88126234 0.92760280 0.94156327 0.91669401 0.83368379 0.72745169 0.59475517
## 211 214 215 217 223 256 261
## 0.86506224 0.79558641 0.47048645 0.05388181 0.92675016 0.74497613 0.63541592
## 284 293 294 297 303 323 324
## 0.73954258 0.89696460 0.88349011 0.80046944 0.88637529 0.77852073 0.74126741
## 328 346 354 355 360 373 386
## 0.71539124 0.78135158 0.03380021 0.63919148 0.75209956 0.91247922 0.05303924
## 392 400 409 411 419 428 432
## 0.81064732 0.10496907 0.94954956 0.89069213 0.75267449 0.03581224 0.88337633
## 435 442 445 449 451 456 462
## 0.73370172 0.77333827 0.88208114 0.74487033 0.75549622 0.86861193 0.67495877
## 464 473 476 477 478 485 487
## 0.90986704 0.77746170 0.92127151 0.90038903 0.91046546 0.80599164 0.74993693
## 489 503 509 510 512 515 521
## 0.78369224 0.84678731 0.94606609 0.70795165 0.78048943 0.11392097 0.81225177
## 525
## 0.65013296
table(Actualvalue=testdf$Loan_Status,Predictedvalue=res>0.5)## Predictedvalue
## Actualvalue FALSE TRUE
## 0 13 20
## 1 3 70
(10+73)/(10+73+23)## [1] 0.7830189
Accuracy:78%
set.seed(42)
sample <- sample.int(n = nrow(my_loan_data_1), size = floor(.70*nrow(my_loan_data_1)), replace = F)
trainnew <- my_loan_data_1[sample, ]
testnew <- my_loan_data_1[-sample, ]dtree <- rpart(Loan_Status ~ Credit_History + Education + Self_Employed + Property_Area + LoanAmount +
ApplicantIncome, method="class", data=traindf,parms=list(split="information"))
dtree$cptable## CP nsplit rel error xerror xstd
## 1 0.40769231 0 1.0000000 1.0000000 0.07299480
## 2 0.01153846 1 0.5923077 0.5923077 0.06104778
## 3 0.01000000 4 0.5538462 0.6538462 0.06339489
plotcp(dtree)dtree.pruned <- prune(dtree, cp=.02290076)
library(rpart.plot)
prp(dtree.pruned, type = 2, extra = 104,
fallen.leaves = TRUE, main="Decision Tree")dtree.pred <- predict(dtree.pruned, trainnew, type="class")
dtree.perf <- table(trainnew$Loan_Status, dtree.pred,
dnn=c("Actual", "Predicted"))
dtree.perf## Predicted
## Actual 0 1
## 0 49 64
## 1 5 252
Now use the Testdata
dtree_test <- rpart(Loan_Status ~ Credit_History+Education+Self_Employed+Property_Area+LoanAmount+
ApplicantIncome,method="class", data=testnew,parms=list(split="information"))
dtree_test$cptable## CP nsplit rel error xerror xstd
## 1 0.42 0 1.00 1.00 0.11709266
## 2 0.01 1 0.58 0.58 0.09738725
plotcp(dtree_test)dtree_test.pruned <- prune(dtree_test, cp=.01639344)
prp(dtree_test.pruned, type = 2, extra = 104,
fallen.leaves = TRUE, main="Decision Tree")dtree_test.pred <- predict(dtree_test.pruned, testnew, type="class")
dtree_test.perf <- table(testnew$Loan_Status, dtree_test.pred, dnn=c("Actual", "Predicted"))
dtree_test.perf## Predicted
## Actual 0 1
## 0 23 27
## 1 2 107
Accuracy: 84%
trainnew <- mutate_if(trainnew, is.character, as.factor)
testnew <- mutate_if(testnew, is.character, as.factor)str(trainnew)## 'data.frame': 370 obs. of 12 variables:
## $ Gender : Factor w/ 3 levels "","Female","Male": 2 3 3 3 3 3 3 2 3 2 ...
## $ Married : Factor w/ 3 levels "","No","Yes": 2 3 3 3 3 3 3 3 3 3 ...
## $ Dependents : Factor w/ 5 levels "","0","1","2",..: 2 3 2 5 2 3 4 2 4 2 ...
## $ Education : Factor w/ 2 levels "Graduate","Not Graduate": 1 1 1 1 1 1 2 1 2 2 ...
## $ Self_Employed : Factor w/ 3 levels "","No","Yes": 1 2 2 2 2 2 2 2 2 2 ...
## $ ApplicantIncome : int 2764 6400 5695 4333 5708 8080 2281 2423 4226 2149 ...
## $ CoapplicantIncome: num 1459 7250 4167 1811 5625 ...
## $ LoanAmount : int 110 180 175 160 187 180 113 130 110 178 ...
## $ Loan_Amount_Term : int 360 360 360 360 360 360 360 360 360 360 ...
## $ Credit_History : int 1 0 1 0 1 1 1 1 1 0 ...
## $ Property_Area : Factor w/ 3 levels "Rural","Semiurban",..: 3 3 2 3 2 3 1 2 3 2 ...
## $ Loan_Status : num 1 0 1 1 1 1 0 1 1 0 ...
set.seed(42)
fit.forest <- randomForest(Loan_Status ~ Credit_History+Education+Self_Employed+Property_Area+LoanAmount+
ApplicantIncome, data=trainnew,
na.action=na.roughfix,
importance=TRUE)## Warning in randomForest.default(m, y, ...): The response has five or fewer
## unique values. Are you sure you want to do regression?
fit.forest##
## Call:
## randomForest(formula = Loan_Status ~ Credit_History + Education + Self_Employed + Property_Area + LoanAmount + ApplicantIncome, data = trainnew, importance = TRUE, na.action = na.roughfix)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 2
##
## Mean of squared residuals: 0.1532389
## % Var explained: 27.76
importance(fit.forest, type=2)## IncNodePurity
## Credit_History 21.070826
## Education 1.547224
## Self_Employed 2.261059
## Property_Area 3.915217
## LoanAmount 12.623137
## ApplicantIncome 12.353369
forest.pred <- predict(fit.forest, testnew)
forest.pred## 7 8 11 13 15 18 19
## 0.18606571 0.84736805 0.90641530 0.86812300 0.86385154 0.15143880 0.63527671
## 21 22 23 26 29 39 44
## 0.02503413 0.77682193 0.95482886 0.94075782 0.93379921 0.09528321 0.81479861
## 50 52 56 58 59 64 65
## 0.86186664 0.84827143 0.06644799 0.72425266 0.12150380 0.76722260 0.85302685
## 67 69 71 77 80 81 83
## 0.22927017 0.66189477 0.93808185 0.81191621 0.94322408 0.87068374 0.89618627
## 87 99 101 104 117 119 125
## 0.81518598 0.57208155 0.07689353 0.38116725 0.84412748 0.82910943 0.07243296
## 129 132 134 137 140 143 144
## 0.85471395 0.88044784 0.94244478 0.81404207 0.85935665 0.76344657 0.48027126
## 145 151 154 155 164 168 175
## 0.53469707 0.96251593 0.78286069 0.70324899 0.87845452 0.74736243 0.72767164
## 176 177 179 184 190 195 199
## 0.86712947 0.91999354 0.04911328 0.81178215 0.93714782 0.57345324 0.93441907
## 200 204 205 209 211 216 218
## 0.51228080 0.87853953 0.88237628 0.73530632 0.95339297 0.51774225 0.65730188
## 222 223 231 232 234 235 236
## 0.90590670 0.93809181 0.82622192 0.86775881 0.92269132 0.90809510 0.95689061
## 237 244 249 253 256 260 263
## 0.83164763 0.85649768 0.84315649 0.73647578 0.81610847 0.96443344 0.85261580
## 264 266 270 273 278 286 289
## 0.77118010 0.39335017 0.88251789 0.79942445 0.13002499 0.77731525 0.83247778
## 291 293 305 306 307 309 310
## 0.86316058 0.81567545 0.67244056 0.78155327 0.03315355 0.85109600 0.86028908
## 319 320 323 326 327 331 335
## 0.67394184 0.06517938 0.75014035 0.92752656 0.58128036 0.82912828 0.72063540
## 336 338 342 352 354 362 363
## 0.52411056 0.79369728 0.10451982 0.28144606 0.17437228 0.81688706 0.83412806
## 364 370 376 378 379 380 382
## 0.82180217 0.84517657 0.06454731 0.87788340 0.69482882 0.87674441 0.74987929
## 383 384 386 387 392 395 397
## 0.48864629 0.81231812 0.10942573 0.10288116 0.82999229 0.88683508 0.82723184
## 404 407 411 413 414 417 418
## 0.80543375 0.64422213 0.84018241 0.65319735 0.88837251 0.71647980 0.14181460
## 423 424 425 426 427 428 436
## 0.65562933 0.06052029 0.93714018 0.72484760 0.70625758 0.12495368 0.85971967
## 445 446 449 454 456 462 463
## 0.91176845 0.57748037 0.59574787 0.94035504 0.51922286 0.58880613 0.76161453
## 466 469 475 477 483 488 492
## 0.90664380 0.87212338 0.06686548 0.75970351 0.70903837 0.04781336 0.73129931
## 493 496 497 500 501 502 505
## 0.93911616 0.83333321 0.90949124 0.85027991 0.92559482 0.13719139 0.82114098
## 512 516 517 520 522
## 0.81000331 0.70759094 0.87966150 0.62261016 0.76738305
table(Actualvalue=testnew$Loan_Status,Predictedvalue=forest.pred>0.5)## Predictedvalue
## Actualvalue FALSE TRUE
## 0 24 26
## 1 5 104
(104+24)/(104+24+26+5)## [1] 0.8050314
To calculate accuracy, use the following formula: (TP+TN)/(TP+TN+FP+FN).
Accuracy: 80.5%
To summary our finding: Credit History and. Property_AreaSemiurban are the 2 most significant variables to predict loan application outcome. Dream Company should target customers with Credit history and customer who lives in Semiurban area.
As far as model accuracy,
78% accuracy for logistic regresission
84% accuracy for Decesion tree
80.50% accuracy for random forest.
Limitation
The dataset is relatively small. A larger dataset will help to improve the model accuracy.
Also, the dataset use use to build model are mostly low income, male, working in a company, it might be interesting to look at female, high income, also those who are self employed, to build better model.