About Company
Dream Housing Finance company deals in all home loans. They have a presence across all urban, semi-urban and rural areas. The customer first applies for a home loan after that company validates the customer’s eligibility for a loan.
The company wants to automate the loan eligibility process (real-time) based on customer detail provided while filling out the online application form. These details are Gender, Marital Status, Education, Number of Dependents, Income, Loan Amount, Credit History and others. To automate this process, they have given a problem identifying the customer segments eligible for loan amounts to target these customers specifically.
This dataset is from Analytics Vidhya’s Competition.
# Import packages
pacman::p_load(tidyverse, gridExtra, circlize, AICcmodavg, RColorBrewer, dplyr, EnvStats, GGally)
loan.train <- read.csv('loan_sanction_train.csv', header = TRUE)
loan.test <- read.csv('loan_sanction_test.csv', header = TRUE)
summary(loan.train)
## Loan_ID Gender Married Dependents
## Length:614 Length:614 Length:614 Length:614
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Education Self_Employed ApplicantIncome CoapplicantIncome
## Length:614 Length:614 Min. : 150 Min. : 0
## Class :character Class :character 1st Qu.: 2878 1st Qu.: 0
## Mode :character Mode :character Median : 3812 Median : 1188
## Mean : 5403 Mean : 1621
## 3rd Qu.: 5795 3rd Qu.: 2297
## Max. :81000 Max. :41667
##
## LoanAmount Loan_Amount_Term Credit_History Property_Area
## Min. : 9.0 Min. : 12 Min. :0.0000 Length:614
## 1st Qu.:100.0 1st Qu.:360 1st Qu.:1.0000 Class :character
## Median :128.0 Median :360 Median :1.0000 Mode :character
## Mean :146.4 Mean :342 Mean :0.8422
## 3rd Qu.:168.0 3rd Qu.:360 3rd Qu.:1.0000
## Max. :700.0 Max. :480 Max. :1.0000
## NA's :22 NA's :14 NA's :50
## Loan_Status
## Length:614
## Class :character
## Mode :character
##
##
##
##
glimpse(loan.train)
## Rows: 614
## Columns: 13
## $ Loan_ID <chr> "LP001002", "LP001003", "LP001005", "LP001006", "LP0…
## $ Gender <chr> "Male", "Male", "Male", "Male", "Male", "Male", "Mal…
## $ Married <chr> "No", "Yes", "Yes", "Yes", "No", "Yes", "Yes", "Yes"…
## $ Dependents <chr> "0", "1", "0", "0", "0", "2", "0", "3+", "2", "1", "…
## $ Education <chr> "Graduate", "Graduate", "Graduate", "Not Graduate", …
## $ Self_Employed <chr> "No", "No", "Yes", "No", "No", "Yes", "No", "No", "N…
## $ ApplicantIncome <int> 5849, 4583, 3000, 2583, 6000, 5417, 2333, 3036, 4006…
## $ CoapplicantIncome <dbl> 0, 1508, 0, 2358, 0, 4196, 1516, 2504, 1526, 10968, …
## $ LoanAmount <int> NA, 128, 66, 120, 141, 267, 95, 158, 168, 349, 70, 1…
## $ Loan_Amount_Term <int> 360, 360, 360, 360, 360, 360, 360, 360, 360, 360, 36…
## $ Credit_History <int> 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, NA, …
## $ Property_Area <chr> "Urban", "Rural", "Urban", "Urban", "Urban", "Urban"…
## $ Loan_Status <chr> "Y", "N", "Y", "Y", "Y", "Y", "Y", "N", "Y", "N", "Y…
cols <- list('Gender','Married', 'Dependents','Education','Self_Employed','Credit_History','Property_Area')
for (col in cols){
uni <- list(unique(loan.train[[col]]))
print(paste0('The unique values in the column ',col, ' are: ', uni))
}
## [1] "The unique values in the column Gender are: c(\"Male\", \"Female\", \"\")"
## [1] "The unique values in the column Married are: c(\"No\", \"Yes\", \"\")"
## [1] "The unique values in the column Dependents are: c(\"0\", \"1\", \"2\", \"3+\", \"\")"
## [1] "The unique values in the column Education are: c(\"Graduate\", \"Not Graduate\")"
## [1] "The unique values in the column Self_Employed are: c(\"No\", \"Yes\", \"\")"
## [1] "The unique values in the column Credit_History are: c(1, 0, NA)"
## [1] "The unique values in the column Property_Area are: c(\"Urban\", \"Rural\", \"Semiurban\")"
coappinc_table <- table(loan.train$CoapplicantIncome)
print(paste0('The number of observations with zero value is: ',coappinc_table[names(coappinc_table) == 0]))
## [1] "The number of observations with zero value is: 273"
From the initial summary statistic we can see that there are some missing values in the dataset, the unique values contain empty characters and most of them are in the wrong format. The Coapplicant Income also contains many zero values, we replace all of these values with NA and then using a backward filling to impute some real data. We have to make sure that there are no blank values either,hence counting all the missing values by column including blank ones.
sapply(loan.train, function(x) sum(is.na(x) | x == '' | x== ' '))
## Loan_ID Gender Married Dependents
## 0 13 3 15
## Education Self_Employed ApplicantIncome CoapplicantIncome
## 0 32 0 0
## LoanAmount Loan_Amount_Term Credit_History Property_Area
## 22 14 50 0
## Loan_Status
## 0
sapply(loan.test, function(x) sum(is.na(x) | x == '' | x== ' '))
## Loan_ID Gender Married Dependents
## 0 11 0 10
## Education Self_Employed ApplicantIncome CoapplicantIncome
## 0 23 0 0
## LoanAmount Loan_Amount_Term Credit_History Property_Area
## 5 6 29 0
We replace the missing values in the Gender category with the *most frequent gender.
The Married variable has three missing values that we replace with ‘No’.
Missing values from the Dependents are replaced with ‘0’, assuming it was left empty as there are no dependents.
From the Self Employed category the values are replaced with ‘No’.
We replace the missing values in the Loan Amount with the average loan amount based on the assumption that it is a mandatory value for a loan application hence can’t be zero.
The Loan Amount Term is replaced with the most frequently appearing term.
The missing values from Credit History are replaced with ‘0’, meaning that there is no history available for that client.
Removing the first column from both data set (Loan_ID) as we don’t need it.
getMode <- function(x){
uni <- unique(x)
uni[which.max(tabulate(match(x, uni)))]
}
loan.train$Gender[loan.train$Gender == '' | loan.train$Gender == ' '] <- getMode(loan.train$Gender)
loan.train$Married[loan.train$Married == '' | loan.train$Married == ' '] <- 'No'
loan.train$Dependents[loan.train$Dependents == '' | loan.train$Dependents == ' '] <- 0
loan.train$Self_Employed[loan.train$Self_Employed == '' | loan.train$Self_Employed == ' '] <- 'No'
loan.train$LoanAmount[is.na(loan.train$LoanAmount)] <- mean(loan.train$LoanAmount, na.rm = TRUE)
loan.train$Loan_Amount_Term[is.na(loan.train$Loan_Amount_Term)] <- getMode(loan.train$Loan_Amount_Term)
loan.train$Credit_History[is.na(loan.train$Credit_History)] <- 0
loan.test$Gender[loan.test$Gender == '' | loan.test$Gender == ' '] <- getMode(loan.test$Gender)
loan.test$Dependents[loan.test$Dependents == '' | loan.test$Dependents == ' '] <- 0
loan.test$Self_Employed[loan.test$Self_Employed == '' | loan.test$Self_Employed == ' '] <- 'No'
loan.test$LoanAmount[is.na(loan.test$LoanAmount)] <- mean(loan.test$LoanAmount, na.rm = TRUE)
loan.test$Loan_Amount_Term[is.na(loan.test$Loan_Amount_Term)] <- getMode(loan.test$Loan_Amount_Term)
loan.test$Credit_History[is.na(loan.test$Credit_History)] <- 0
loan.train <- subset(loan.train, select = -c(Loan_ID))
loan.test <- subset(loan.test, select = -c(Loan_ID))
sapply(loan.train, function(x) sum(is.na(x) | x == '' | x== ' '))
## Gender Married Dependents Education
## 0 0 0 0
## Self_Employed ApplicantIncome CoapplicantIncome LoanAmount
## 0 0 0 0
## Loan_Amount_Term Credit_History Property_Area Loan_Status
## 0 0 0 0
sapply(loan.test, function(x) sum(is.na(x) | x == '' | x== ' '))
## Gender Married Dependents Education
## 0 0 0 0
## Self_Employed ApplicantIncome CoapplicantIncome LoanAmount
## 0 0 0 0
## Loan_Amount_Term Credit_History Property_Area
## 0 0 0
loan.train$CoapplicantIncome <- replace(loan.train$CoapplicantIncome, loan.train$CoapplicantIncome == 0, NA)
loan.test$CoapplicantIncome <- replace(loan.test$CoapplicantIncome, loan.test$CoapplicantIncome == 0, NA)
loan.train <- loan.train %>% fill(CoapplicantIncome, .direction = 'up')
loan.train <- loan.train %>% fill(CoapplicantIncome, .direction = 'down')
loan.test <- loan.test %>% fill(CoapplicantIncome, .direction = 'up')
loan.test <- loan.test %>% fill(CoapplicantIncome, .direction = 'down')
There are no missing values anymore and the data is now ready for further analysis.
Replacing the values ‘3+’, with ‘3’ keeping the assumption that it means 3 or more dependents.
loan.train$Dependents[loan.train$Dependents > 2] <- 3
loan.test$Dependents[loan.test$Dependents > 2] <- 3
We require that some of the variables to be in a categorical format.
loan.train <- loan.train %>% mutate_at(
c('Gender', 'Married', 'Dependents', 'Education', 'Self_Employed', 'Credit_History', 'Property_Area', 'Loan_Status'), as.factor)
loan.test <- loan.test %>% mutate_at(
c('Gender', 'Married', 'Dependents', 'Education', 'Self_Employed', 'Credit_History', 'Property_Area'), as.factor)
for(col in colnames(loan.train)){
print(paste0('The number of missing values in ', col, 'are: ', sum(is.na(loan.train[[col]]))))
}
## [1] "The number of missing values in Genderare: 0"
## [1] "The number of missing values in Marriedare: 0"
## [1] "The number of missing values in Dependentsare: 0"
## [1] "The number of missing values in Educationare: 0"
## [1] "The number of missing values in Self_Employedare: 0"
## [1] "The number of missing values in ApplicantIncomeare: 0"
## [1] "The number of missing values in CoapplicantIncomeare: 0"
## [1] "The number of missing values in LoanAmountare: 0"
## [1] "The number of missing values in Loan_Amount_Termare: 0"
## [1] "The number of missing values in Credit_Historyare: 0"
## [1] "The number of missing values in Property_Areaare: 0"
## [1] "The number of missing values in Loan_Statusare: 0"
Have a look our numeric variables and their distribution.
numeric_list <- c('ApplicantIncome','CoapplicantIncome','LoanAmount')
histFunc <- function(data, var){
ggplot(data, aes(data[,var]))+
geom_histogram(bins = 45, color="black", fill="white", show.legend = FALSE)+
ggtitle(paste0('Histogram of ', var))+
labs(x = var, y = 'Number of observations') +
geom_vline(aes(xintercept=mean(data[,var])),
color="blue", linetype="dashed", size=1)
}
densityFunc <- function(data, var){
ggplot(data, aes(data[,var]))+
geom_histogram(aes(y = ..density..), bins = 45, color="black", fill="white", show.legend = FALSE)+
geom_density(alpha=.2, fill="#FF6666")+
ggtitle(paste0('Histogram of ', var))+
labs(x = var, y = 'Density')
}
p <- list()
pp <- list()
for(var in numeric_list){
p[[var]] <- histFunc(loan.train, var)
pp[[var]] <- densityFunc(loan.train, var)
}
do.call(grid.arrange,p)
do.call(grid.arrange, pp)
There are many outliers in our the variables. Calculate the quantiles of
these three variables and the upper limit only
upperlimitList <- list()
limitCalc <- function(var){
q <- quantile(loan.train[,var], probs = seq(0, 1, 1/4))
iqr = q[4] - q[2]
upper_limit <- round(q[4] + (iqr * 1.5), digits = 0)
}
upperLL <- c('ApplicantIncome','CoapplicantIncome','LoanAmount')
for (var in upperLL){
upperlimitList[[var]] <- limitCalc(var)
}
Remove the outliers and plot again to see the distribution of the data and their qqplot.
outl <- subset(loan.train,
Loan_Amount_Term >= 400 |
Loan_Amount_Term <= 179 |
LoanAmount >= upperlimitList$LoanAmount |
ApplicantIncome >= upperlimitList$ApplicantIncome |
CoapplicantIncome >= upperlimitList$CoapplicantIncome)
new.loan.train <- anti_join(loan.train, outl, by=NULL, copy=FALSE)
qqlist <- c('ApplicantIncome','CoapplicantIncome','LoanAmount','Loan_Amount_Term')
normalityFunc <- function(data, var){
shpwlk <- shapiro.test(data[,var])
ggplot(data, aes(data[,var]))+
geom_histogram(bins = 45, color="black", fill="white", show.legend = FALSE)+
ggtitle('The p-value of the distribution is: ', shpwlk$p.value)+
labs(x = var, y = 'Number of observations')+
geom_vline(aes(xintercept=mean(data[,var])),
color="blue", linetype="dashed", size=1)
}
qqNormPlot <- function(data, var){
qqnorm(data[,var], col = 'red')
qqline(data[,var], distribution = qnorm, col = 'blue')
}
qqNormPlot2 <- function(data, var){
qqnorm(data[,var], col = 'red')
qqline(data[,var], distribution = qnorm, col = 'blue')
}
p <- list()
for(var in qqlist){
p[[var]] <- normalityFunc(new.loan.train, var)
}
do.call(grid.arrange,p)
p <- list()
par(mfcol=c(2,2))
for(var in qqlist){
p[[var]] <- qqNormPlot(loan.train, var)
}
p <- list()
par(mfcol=c(2,2))
for(var in qqlist){
p[[var]] <- qqNormPlot2(new.loan.train, var)
}
The data now seems more normally distributed and from the qqplots can be seen that there is a significant difference between the before and the after state. Make a final test using Rosnet test to check for outliers.
par(mfcol=c(2,2))
ggplot(new.loan.train, aes(x = '', y = ApplicantIncome))+
geom_boxplot(outlier.colour = "red", outlier.shape = 1)
rosTest <- rosnerTest(new.loan.train$ApplicantIncome, k = 5)
rosTest$all.stats
## i Mean.i SD.i Value Obs.Num R.i+1 lambda.i+1 Outlier
## 1 0 4070.586 1859.743 10139 455 3.263039 3.861533 FALSE
## 2 1 4058.351 1841.490 10047 356 3.252067 3.861000 FALSE
## 3 2 4046.253 1823.514 10000 119 3.264985 3.860465 FALSE
## 4 3 4034.200 1805.520 10000 170 3.304200 3.859929 FALSE
## 5 4 4022.099 1787.189 10000 263 3.344863 3.859392 FALSE
ggplot(new.loan.train, aes(x = '', y = CoapplicantIncome))+
geom_boxplot(outlier.colour = "red", outlier.shape = 1)
rosTest2 <- rosnerTest(new.loan.train$CoapplicantIncome, k = 5)
rosTest2$all.stats
## i Mean.i SD.i Value Obs.Num R.i+1 lambda.i+1 Outlier
## 1 0 2397.169 1129.974 5654 80 2.882216 3.861533 FALSE
## 2 1 2390.603 1121.584 5625 17 2.883776 3.861000 FALSE
## 3 2 2384.069 1113.228 5625 129 2.911291 3.860465 FALSE
## 4 3 2377.508 1104.736 5625 139 2.939608 3.859929 FALSE
## 5 4 2370.921 1096.104 5625 140 2.968768 3.859392 FALSE
ggplot(new.loan.train, aes(x = '', y = LoanAmount))+
geom_boxplot(outlier.colour = "red", outlier.shape = 1)
rosTest3 <- rosnerTest(new.loan.train$LoanAmount, k = 5)
rosTest3$all.stats
## i Mean.i SD.i Value Obs.Num R.i+1 lambda.i+1 Outlier
## 1 0 128.2782 44.06748 260 455 2.989094 3.861533 FALSE
## 2 1 128.0126 43.71205 259 289 2.996597 3.861000 FALSE
## 3 2 127.7480 43.35677 258 58 3.004192 3.860465 FALSE
## 4 3 127.4843 43.00162 258 461 3.035134 3.859929 FALSE
## 5 4 127.2196 42.64042 255 129 2.996698 3.859392 FALSE
We can see that there are no more outliers left. Removing the outliers will help to increase model performance. Values in the Loan Amount Term are not tested as everything that’s not 360 value is considered as an outlier. Removing these would significantly reduce the number of observations in the dataframe.
table(loan.train$Loan_Amount_Term, loan.train$Loan_Amount_Term)
##
## 12 36 60 84 120 180 240 300 360 480
## 12 1 0 0 0 0 0 0 0 0 0
## 36 0 2 0 0 0 0 0 0 0 0
## 60 0 0 2 0 0 0 0 0 0 0
## 84 0 0 0 4 0 0 0 0 0 0
## 120 0 0 0 0 3 0 0 0 0 0
## 180 0 0 0 0 0 44 0 0 0 0
## 240 0 0 0 0 0 0 4 0 0 0
## 300 0 0 0 0 0 0 0 13 0 0
## 360 0 0 0 0 0 0 0 0 526 0
## 480 0 0 0 0 0 0 0 0 0 15
univarGenderPlot <- function(data, var){
data %>% group_by(data[,var]) %>%
ggplot(aes(x = data[,var])) +
geom_bar(fill = 4) +
labs(x = var, y = 'Number of applicants') +
ggtitle(paste0('Number of applicants by ', var)) +
theme(plot.title = element_text(hjust = 0.5))
}
plt.list1 <- c('Gender', 'Dependents', 'Married', 'Education', 'Self_Employed', 'Loan_Amount_Term', 'Credit_History', 'Property_Area')
p <- list()
for(var in plt.list1){
p[[var]] <- univarGenderPlot(loan.train,var)
}
do.call(grid.arrange,p)
p <- list()
for(var in plt.list1){
p[[var]] <- univarGenderPlot(new.loan.train,var)
}
do.call(grid.arrange,p)
univarLoanAmountPlot <- function(data, var){
data %>% ggplot(aes(x = LoanAmount)) +
geom_histogram(bins = 45, fill = 2,position = "identity") +
facet_wrap(~ data[,var]) +
labs(x = paste0('Loan amount by ', var), y = 'Number of applicants') +
ggtitle(paste0('Distribution of loan amount by ', var)) +
theme(plot.title = element_text(hjust = 0.5))
}
plt.list2 <- c('Gender', 'Dependents', 'Married', 'Education', 'Self_Employed')
p <- list()
for(var in plt.list2){
p[[var]] <- univarLoanAmountPlot(loan.train, var)
}
do.call(grid.arrange,p)
p <- list()
for(var in plt.list2){
p[[var]] <- univarLoanAmountPlot(new.loan.train, var)
}
do.call(grid.arrange,p)
univarApplicantIncomePlot <- function(data, var){
data %>% ggplot(aes(x = ApplicantIncome)) +
geom_histogram(bins = 45, fill = 4, position = "identity") +
facet_wrap(~ data[,var]) +
labs(x = paste0('Applicant Income by ', var), y = 'Number of applicants') +
ggtitle(paste0('Distribution of applicants income by ', var)) +
theme(plot.title = element_text(hjust = 0.5))
}
p <- list()
for(var in plt.list2){
p[[var]] <- univarApplicantIncomePlot(loan.train, var)
}
do.call(grid.arrange,p)
p <- list()
for(var in plt.list2){
p[[var]] <- univarApplicantIncomePlot(new.loan.train, var)
}
do.call(grid.arrange,p)
### Multivariate analysis
table(new.loan.train$Loan_Status, new.loan.train$Loan_Amount_Term)
##
## 180 240 300 360
## N 11 1 4 131
## Y 24 3 5 318
area_loanstatus2 <- table(new.loan.train$Property_Area, new.loan.train$Loan_Status)
chordDiagram(as.matrix(area_loanstatus2), small.gap = 3, big.gap = 10)
circos.clear()
The number of applications by term and acceptance.
The circular plot shows the the loan status - accepted or rejected - with the breakdown of the preferred property location.
loanAmLoanT <- function(data){
data %>% ggplot(aes(x = Loan_Amount_Term , y = LoanAmount, group = Loan_Status)) +
geom_point(cex = 1.5, pch = 1, position = position_jitter(w = 10, h = 0), aes(color = Loan_Status)) +
scale_color_manual(values = c('red', 'blue')) +
labs(x = 'Loan AMount Term', y = 'Loan amount') +
ggtitle('Loan amount and loan term comparison') +
theme(plot.title = element_text(hjust = 0.5))
}
appliCoapp <- function(data){
#filter(ApplicantIncome < 20000 & CoapplicantIncome < 20000) %>%
data %>% ggplot(aes(x = ApplicantIncome , y = CoapplicantIncome, group = Loan_Status)) +
geom_point(cex = 1.5, pch = 1, position = position_jitter(w = 5, h = 0), aes(color = Loan_Status)) +
scale_color_manual(values = c('red', 'blue')) +
labs(x = 'Main applicant income', y = 'Co-applicant income') +
ggtitle('Trimmed main and co applicant income distribution') +
theme(plot.title = element_text(hjust = 0.5))
}
loanAmLoanT(loan.train)
appliCoapp(loan.train)
loanAmLoanT(new.loan.train)
appliCoapp(new.loan.train)
The scatterplot shows applicants income relationship and the term by
amount before and after the cleaning from outliers.
ggpairs(new.loan.train, columns = c('ApplicantIncome','CoapplicantIncome','LoanAmount','Loan_Amount_Term','Credit_History','Property_Area','Loan_Status'))
We fit a binomial logistic model on our training data set.
glm_model <- glm(formula = Loan_Status ~., family = binomial, data = new.loan.train)
summary(glm_model)
##
## Call:
## glm(formula = Loan_Status ~ ., family = binomial, data = new.loan.train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.326e+00 1.060e+00 -1.251 0.21089
## GenderMale 2.660e-01 3.199e-01 0.831 0.40572
## MarriedYes 3.442e-01 2.773e-01 1.241 0.21445
## Dependents1 -6.901e-02 3.322e-01 -0.208 0.83543
## Dependents2 1.071e-02 3.418e-01 0.031 0.97501
## Dependents3 -2.768e-01 4.319e-01 -0.641 0.52157
## EducationNot Graduate -3.766e-01 2.606e-01 -1.445 0.14846
## Self_EmployedYes -3.205e-01 3.523e-01 -0.910 0.36291
## ApplicantIncome -2.504e-05 7.088e-05 -0.353 0.72387
## CoapplicantIncome 8.969e-05 1.076e-04 0.833 0.40473
## LoanAmount -2.993e-03 3.112e-03 -0.962 0.33614
## Loan_Amount_Term 5.114e-04 2.471e-03 0.207 0.83606
## Credit_History1 2.325e+00 2.542e-01 9.143 < 2e-16 ***
## Property_AreaSemiurban 8.926e-01 2.843e-01 3.140 0.00169 **
## Property_AreaUrban 1.488e-01 2.782e-01 0.535 0.59284
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 603.60 on 496 degrees of freedom
## Residual deviance: 486.79 on 482 degrees of freedom
## AIC: 516.79
##
## Number of Fisher Scoring iterations: 4
Applying a step-wise algorithm to find the best model.
step(glm_model, direction = 'both', trace = 1, steps = 1000)
## Start: AIC=516.79
## Loan_Status ~ Gender + Married + Dependents + Education + Self_Employed +
## ApplicantIncome + CoapplicantIncome + LoanAmount + Loan_Amount_Term +
## Credit_History + Property_Area
##
## Df Deviance AIC
## - Dependents 3 487.24 511.24
## - Loan_Amount_Term 1 486.83 514.83
## - ApplicantIncome 1 486.91 514.91
## - Gender 1 487.47 515.47
## - CoapplicantIncome 1 487.49 515.49
## - Self_Employed 1 487.60 515.60
## - LoanAmount 1 487.72 515.72
## - Married 1 488.33 516.33
## <none> 486.79 516.79
## - Education 1 488.85 516.85
## - Property_Area 2 498.30 524.30
## - Credit_History 1 581.56 609.56
##
## Step: AIC=511.24
## Loan_Status ~ Gender + Married + Education + Self_Employed +
## ApplicantIncome + CoapplicantIncome + LoanAmount + Loan_Amount_Term +
## Credit_History + Property_Area
##
## Df Deviance AIC
## - Loan_Amount_Term 1 487.32 509.32
## - ApplicantIncome 1 487.41 509.41
## - Gender 1 487.89 509.89
## - CoapplicantIncome 1 488.00 510.00
## - Self_Employed 1 488.10 510.10
## - LoanAmount 1 488.15 510.15
## - Married 1 488.68 510.68
## <none> 487.24 511.24
## - Education 1 489.41 511.41
## + Dependents 3 486.79 516.79
## - Property_Area 2 498.61 518.61
## - Credit_History 1 582.49 604.49
##
## Step: AIC=509.32
## Loan_Status ~ Gender + Married + Education + Self_Employed +
## ApplicantIncome + CoapplicantIncome + LoanAmount + Credit_History +
## Property_Area
##
## Df Deviance AIC
## - ApplicantIncome 1 487.53 507.53
## - Gender 1 487.95 507.95
## - CoapplicantIncome 1 488.05 508.05
## - LoanAmount 1 488.17 508.17
## - Self_Employed 1 488.18 508.18
## - Married 1 488.72 508.72
## <none> 487.32 509.32
## - Education 1 489.59 509.59
## + Loan_Amount_Term 1 487.24 511.24
## + Dependents 3 486.83 514.83
## - Property_Area 2 498.81 516.81
## - Credit_History 1 582.55 602.55
##
## Step: AIC=507.53
## Loan_Status ~ Gender + Married + Education + Self_Employed +
## CoapplicantIncome + LoanAmount + Credit_History + Property_Area
##
## Df Deviance AIC
## - Gender 1 488.13 506.13
## - CoapplicantIncome 1 488.28 506.28
## - Self_Employed 1 488.53 506.53
## - Married 1 489.12 507.12
## - LoanAmount 1 489.15 507.15
## <none> 487.53 507.53
## - Education 1 489.68 507.68
## + ApplicantIncome 1 487.32 509.32
## + Loan_Amount_Term 1 487.41 509.41
## + Dependents 3 486.97 512.97
## - Property_Area 2 499.13 515.13
## - Credit_History 1 582.66 600.66
##
## Step: AIC=506.13
## Loan_Status ~ Married + Education + Self_Employed + CoapplicantIncome +
## LoanAmount + Credit_History + Property_Area
##
## Df Deviance AIC
## - CoapplicantIncome 1 488.92 504.92
## - Self_Employed 1 489.14 505.14
## - LoanAmount 1 489.56 505.56
## <none> 488.13 506.13
## - Education 1 490.15 506.15
## - Married 1 490.99 506.99
## + Gender 1 487.53 507.53
## + ApplicantIncome 1 487.95 507.95
## + Loan_Amount_Term 1 488.05 508.05
## + Dependents 3 487.62 511.62
## - Property_Area 2 499.20 513.20
## - Credit_History 1 584.80 600.80
##
## Step: AIC=504.92
## Loan_Status ~ Married + Education + Self_Employed + LoanAmount +
## Credit_History + Property_Area
##
## Df Deviance AIC
## - Self_Employed 1 489.81 503.81
## - LoanAmount 1 489.99 503.99
## <none> 488.92 504.92
## - Education 1 491.02 505.02
## - Married 1 491.61 505.61
## + CoapplicantIncome 1 488.13 506.13
## + Gender 1 488.28 506.28
## + ApplicantIncome 1 488.72 506.72
## + Loan_Amount_Term 1 488.87 506.87
## + Dependents 3 488.35 510.35
## - Property_Area 2 499.79 511.79
## - Credit_History 1 584.80 598.80
##
## Step: AIC=503.81
## Loan_Status ~ Married + Education + LoanAmount + Credit_History +
## Property_Area
##
## Df Deviance AIC
## - LoanAmount 1 491.05 503.05
## <none> 489.81 503.81
## - Education 1 491.92 503.92
## - Married 1 492.53 504.53
## + Self_Employed 1 488.92 504.92
## + CoapplicantIncome 1 489.14 505.14
## + Gender 1 489.16 505.16
## + ApplicantIncome 1 489.48 505.48
## + Loan_Amount_Term 1 489.75 505.75
## + Dependents 3 489.16 509.16
## - Property_Area 2 500.45 510.45
## - Credit_History 1 585.28 597.28
##
## Step: AIC=503.05
## Loan_Status ~ Married + Education + Credit_History + Property_Area
##
## Df Deviance AIC
## - Education 1 492.82 502.82
## <none> 491.05 503.05
## - Married 1 493.20 503.20
## + LoanAmount 1 489.81 503.81
## + Self_Employed 1 489.99 503.99
## + ApplicantIncome 1 490.01 504.01
## + Gender 1 490.61 504.61
## + CoapplicantIncome 1 490.73 504.73
## + Loan_Amount_Term 1 491.02 505.02
## + Dependents 3 490.35 508.35
## - Property_Area 2 501.49 509.49
## - Credit_History 1 586.75 596.75
##
## Step: AIC=502.82
## Loan_Status ~ Married + Credit_History + Property_Area
##
## Df Deviance AIC
## - Married 1 494.82 502.82
## <none> 492.82 502.82
## + Education 1 491.05 503.05
## + Self_Employed 1 491.78 503.78
## + LoanAmount 1 491.92 503.92
## + ApplicantIncome 1 492.18 504.18
## + CoapplicantIncome 1 492.42 504.42
## + Gender 1 492.47 504.47
## + Loan_Amount_Term 1 492.73 504.73
## + Dependents 3 491.99 507.99
## - Property_Area 2 503.80 509.80
## - Credit_History 1 590.92 598.92
##
## Step: AIC=502.82
## Loan_Status ~ Credit_History + Property_Area
##
## Df Deviance AIC
## <none> 494.82 502.82
## + Married 1 492.82 502.82
## + Education 1 493.20 503.20
## + Gender 1 493.58 503.58
## + Self_Employed 1 493.76 503.76
## + ApplicantIncome 1 494.10 504.10
## + LoanAmount 1 494.35 504.35
## + CoapplicantIncome 1 494.45 504.45
## + Loan_Amount_Term 1 494.79 504.79
## + Dependents 3 494.37 508.37
## - Property_Area 2 506.12 510.12
## - Credit_History 1 591.66 597.66
##
## Call: glm(formula = Loan_Status ~ Credit_History + Property_Area, family = binomial,
## data = new.loan.train)
##
## Coefficients:
## (Intercept) Credit_History1 Property_AreaSemiurban
## -1.1512 2.2903 0.8811
## Property_AreaUrban
## 0.2150
##
## Degrees of Freedom: 496 Total (i.e. Null); 493 Residual
## Null Deviance: 603.6
## Residual Deviance: 494.8 AIC: 502.8
There is not much of an improvement on the AIC from the step-wise algorithm, hence we use the variables with the least AIC and a few more that’s seems important.
We build the model only with these variables and compare them with the intercept only model.
glm2_model <- glm(formula = Loan_Status ~ ApplicantIncome + CoapplicantIncome + LoanAmount + Credit_History + Property_Area, family = binomial, data = new.loan.train)
glm_intercept <- glm(Loan_Status~1, data = new.loan.train, family = binomial)
anova(glm_intercept, glm2_model, test="Chisq")
## Analysis of Deviance Table
##
## Model 1: Loan_Status ~ 1
## Model 2: Loan_Status ~ ApplicantIncome + CoapplicantIncome + LoanAmount +
## Credit_History + Property_Area
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 496 603.60
## 2 490 493.39 6 110.2 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
LSP <- data.frame(new.loan.train$Loan_Status)
LSP$new.loan.train.Loan_Status <- as.factor(LSP$new.loan.train.Loan_Status)
LSP$glm_model_predprob <- round(fitted(glm2_model),2)
LSP$glm_pred <- ifelse(LSP$glm_model_predprob > 0.6, 'Y', 'N')
LSP$glm_pred <- as.factor(LSP$glm_pred)
library(caret)
confusionMatrix(LSP$new.loan.train.Loan_Status, LSP$glm_pred, positive = 'Y')
## Confusion Matrix and Statistics
##
## Reference
## Prediction N Y
## N 77 70
## Y 36 314
##
## Accuracy : 0.7867
## 95% CI : (0.748, 0.8219)
## No Information Rate : 0.7726
## P-Value [Acc > NIR] : 0.244920
##
## Kappa : 0.4512
##
## Mcnemar's Test P-Value : 0.001349
##
## Sensitivity : 0.8177
## Specificity : 0.6814
## Pos Pred Value : 0.8971
## Neg Pred Value : 0.5238
## Prevalence : 0.7726
## Detection Rate : 0.6318
## Detection Prevalence : 0.7042
## Balanced Accuracy : 0.7496
##
## 'Positive' Class : Y
##
library('pROC')
rocobj <- roc(LSP$new.loan.train.Loan_Status, LSP$glm_model_predprob, auc = TRUE, plot = TRUE, ci = TRUE, print.auc=TRUE, grid = TRUE)
ciobj <- ci.se(rocobj, specificities=seq(0, 1, 0.1))
plot(ciobj, type="shape", col="#1c61b6AA")
plot(ci.sp(rocobj, boot.stratified=TRUE), type="bars")
print(rocobj)
##
## Call:
## roc.default(response = LSP$new.loan.train.Loan_Status, predictor = LSP$glm_model_predprob, auc = TRUE, ci = TRUE, plot = TRUE, print.auc = TRUE, grid = TRUE)
##
## Data: LSP$glm_model_predprob in 147 controls (LSP$new.loan.train.Loan_Status N) < 350 cases (LSP$new.loan.train.Loan_Status Y).
## Area under the curve: 0.7702
## 95% CI: 0.7245-0.816 (DeLong)
print(paste0('The AUC score for the SVM classifier is: ', rocobj$auc))
## [1] "The AUC score for the SVM classifier is: 0.77022351797862"
The ROC curve is seen above for the model. The area under the curve with confidence interval shows that the model is an acceptable.
library(e1071)
nB_model <- naiveBayes(Loan_Status~., data = new.loan.train)
LSP$pred_nB <- predict(nB_model, new.loan.train, type = 'raw')
LSP$predY_nB <- ifelse(LSP$pred_nB[,2] > 0.6, 'Y', 'N')
LSP$predY_nB <- as.factor(LSP$predY_nB)
confusionMatrix(LSP$predY_nB, LSP$new.loan.train.Loan_Status, positive = 'Y')
## Confusion Matrix and Statistics
##
## Reference
## Prediction N Y
## N 79 36
## Y 68 314
##
## Accuracy : 0.7907
## 95% CI : (0.7523, 0.8257)
## No Information Rate : 0.7042
## P-Value [Acc > NIR] : 8.159e-06
##
## Kappa : 0.4638
##
## Mcnemar's Test P-Value : 0.002367
##
## Sensitivity : 0.8971
## Specificity : 0.5374
## Pos Pred Value : 0.8220
## Neg Pred Value : 0.6870
## Prevalence : 0.7042
## Detection Rate : 0.6318
## Detection Prevalence : 0.7686
## Balanced Accuracy : 0.7173
##
## 'Positive' Class : Y
##
rocobj_nB <- roc(LSP$new.loan.train.Loan_Status, LSP$pred_nB[,2], auc = TRUE, plot = TRUE, ci = TRUE, print.auc=TRUE, grid = TRUE)
ciobj_nB <- ci.se(rocobj_nB, specificities=seq(0, 1, 0.1))
plot(ciobj_nB, type="shape", col="#1c61b6AA")
plot(ci.sp(rocobj_nB, boot.stratified=TRUE), type="bars")
print(rocobj_nB)
##
## Call:
## roc.default(response = LSP$new.loan.train.Loan_Status, predictor = LSP$pred_nB[, 2], auc = TRUE, ci = TRUE, plot = TRUE, print.auc = TRUE, grid = TRUE)
##
## Data: LSP$pred_nB[, 2] in 147 controls (LSP$new.loan.train.Loan_Status N) < 350 cases (LSP$new.loan.train.Loan_Status Y).
## Area under the curve: 0.7755
## 95% CI: 0.7297-0.8212 (DeLong)
print(paste0('The AUC score for the Naive Bayes classifier is: ', rocobj_nB$auc))
## [1] "The AUC score for the Naive Bayes classifier is: 0.775471331389699"
new.loan.train$Dependents <- as.integer(new.loan.train$Dependents)
new.loan.train$CoapplicantIncome <- as.integer(new.loan.train$CoapplicantIncome)
new.loan.train$LoanAmount <- as.integer(new.loan.train$LoanAmount)
svm_model <- svm(formula = Loan_Status~., data = new.loan.train, type = 'C-classification', probability = TRUE, kernel = 'linear')
LSP$svm_pred <- predict(svm_model, new.loan.train, probability = TRUE)
confusionMatrix(LSP$svm_pred, LSP$new.loan.train.Loan_Status, positive = 'Y')
## Confusion Matrix and Statistics
##
## Reference
## Prediction N Y
## N 77 36
## Y 70 314
##
## Accuracy : 0.7867
## 95% CI : (0.748, 0.8219)
## No Information Rate : 0.7042
## P-Value [Acc > NIR] : 2.071e-05
##
## Kappa : 0.4512
##
## Mcnemar's Test P-Value : 0.001349
##
## Sensitivity : 0.8971
## Specificity : 0.5238
## Pos Pred Value : 0.8177
## Neg Pred Value : 0.6814
## Prevalence : 0.7042
## Detection Rate : 0.6318
## Detection Prevalence : 0.7726
## Balanced Accuracy : 0.7105
##
## 'Positive' Class : Y
##
LSP$svm_pred_bool <- ifelse(LSP$svm_pred == 'Y', 1, 0)
rocobj_svm <- roc(LSP$new.loan.train.Loan_Status, LSP$svm_pred_bool, auc = TRUE, plot = TRUE, ci = TRUE, print.auc=TRUE, grid = TRUE)
ciobj_svm <- ci.se(rocobj_svm, specificities=seq(0, 1, 0.1))
plot(ciobj_svm, type="shape", col="#1c61b6AA")
plot(ci.sp(rocobj_svm, boot.stratified=TRUE), type="bars")
print(rocobj_svm)
##
## Call:
## roc.default(response = LSP$new.loan.train.Loan_Status, predictor = LSP$svm_pred_bool, auc = TRUE, ci = TRUE, plot = TRUE, print.auc = TRUE, grid = TRUE)
##
## Data: LSP$svm_pred_bool in 147 controls (LSP$new.loan.train.Loan_Status N) < 350 cases (LSP$new.loan.train.Loan_Status Y).
## Area under the curve: 0.7105
## 95% CI: 0.6669-0.754 (DeLong)
print(paste0('The AUC score for the SVM classifier is: ', rocobj_svm$auc))
## [1] "The AUC score for the SVM classifier is: 0.71047619047619"
library(neuralnet)
library(fastDummies)
loan.train_ann <- dummy_cols(new.loan.train,
select_columns = c('Gender','Married','Education','Self_Employed','Credit_History','Property_Area','Loan_Status'),
remove_first_dummy = TRUE,
remove_selected_columns = TRUE)
normalize <- function(x){
return((x-min(x))/(max(x)-min(x)))
}
loan.train_ann$Dependents <- normalize(loan.train_ann$Dependents)
loan.train_ann$ApplicantIncome <- normalize(loan.train_ann$ApplicantIncome)
loan.train_ann$CoapplicantIncome <- normalize(loan.train_ann$CoapplicantIncome)
loan.train_ann$LoanAmount <- normalize(loan.train_ann$LoanAmount)
loan.train_ann$Loan_Amount_Term <- normalize(loan.train_ann$Loan_Amount_Term)
names(loan.train_ann)[names(loan.train_ann)=='Education_Not Graduate'] <- 'Education_Not_Graduate'
ann_model <- neuralnet(Loan_Status_Y~., data = loan.train_ann, hidden = 5, rep = 3, err.fct = 'ce', linear.output = FALSE)
LSP$ann_pred <- ann_model$net.result[[1]]
LSP$ann_pred_bool <- ifelse(LSP$ann_pred > 0.6, 1, 0)
LSP$ann_pred_YN <- ifelse(LSP$ann_pred_bool == 1, 'Y', 'N')
LSP$ann_pred_YN <- as.factor(LSP$ann_pred_YN)
plot(ann_model)
confusionMatrix(LSP$ann_pred_YN, LSP$new.loan.train.Loan_Status, positive = 'Y')
## Confusion Matrix and Statistics
##
## Reference
## Prediction N Y
## N 116 32
## Y 31 318
##
## Accuracy : 0.8732
## 95% CI : (0.8407, 0.9012)
## No Information Rate : 0.7042
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.6963
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9086
## Specificity : 0.7891
## Pos Pred Value : 0.9112
## Neg Pred Value : 0.7838
## Prevalence : 0.7042
## Detection Rate : 0.6398
## Detection Prevalence : 0.7022
## Balanced Accuracy : 0.8488
##
## 'Positive' Class : Y
##
rocobj_ann <- roc(LSP$new.loan.train.Loan_Status, LSP$ann_pred_bool, auc = TRUE, plot = TRUE, ci = TRUE, print.auc=TRUE, grid = TRUE)
ciobj_ann <- ci.se(rocobj_ann, specificities=seq(0, 1, 0.1))
plot(ciobj_ann, type="shape", col="#1c61b6AA")
plot(ci.sp(rocobj_ann, boot.stratified=TRUE), type="bars")
print(rocobj_ann)
##
## Call:
## roc.default(response = LSP$new.loan.train.Loan_Status, predictor = LSP$ann_pred_bool, auc = TRUE, ci = TRUE, plot = TRUE, print.auc = TRUE, grid = TRUE)
##
## Data: LSP$ann_pred_bool in 147 controls (LSP$new.loan.train.Loan_Status N) < 350 cases (LSP$new.loan.train.Loan_Status Y).
## Area under the curve: 0.8488
## 95% CI: 0.8125-0.8852 (DeLong)
print(paste0('The AUC score for the ANN classifier is: ', rocobj_ann$auc))
## [1] "The AUC score for the ANN classifier is: 0.848843537414966"
From the above tested classifiers the ANN classifier perform best.
CC0: Public Domain - https://creativecommons.org/publicdomain/zero/1.0/