Background

About Company

Dream Housing Finance company deals in all home loans. They have a presence across all urban, semi-urban and rural areas. The customer first applies for a home loan after that company validates the customer’s eligibility for a loan.

The company wants to automate the loan eligibility process (real-time) based on customer detail provided while filling out the online application form. These details are Gender, Marital Status, Education, Number of Dependents, Income, Loan Amount, Credit History and others. To automate this process, they have given a problem identifying the customer segments eligible for loan amounts to target these customers specifically.

This dataset is from Analytics Vidhya’s Competition.

Data description

  • Loan Id - Id
  • Gender - Male/Female
  • Married - Marital status
  • Dependents - The number of dependents of a customer (0,1,2,3 or more)
  • Education - Has graduated or not
  • Self_Employed - Whether the customer is self-employed
  • ApplicantIncome - Applicant’s monthly income
  • CoapplicantIncome - Second applicant’s income
  • LoanAmount - Amount of loan applied for
  • Loan_Amount_Term - Borrowing time interval
  • Credit_History - Credit history is available
  • Property_Area - Location of the property (Urban, Semiurban, Rural)
  • Loan_Status - Accepted for loan or not
# Import packages
pacman::p_load(tidyverse, gridExtra, circlize, AICcmodavg, RColorBrewer, dplyr, EnvStats, GGally)
loan.train <- read.csv('loan_sanction_train.csv', header = TRUE)
loan.test <- read.csv('loan_sanction_test.csv', header = TRUE)

Data characteristics

summary(loan.train)
##    Loan_ID             Gender            Married           Dependents       
##  Length:614         Length:614         Length:614         Length:614        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##   Education         Self_Employed      ApplicantIncome CoapplicantIncome
##  Length:614         Length:614         Min.   :  150   Min.   :    0    
##  Class :character   Class :character   1st Qu.: 2878   1st Qu.:    0    
##  Mode  :character   Mode  :character   Median : 3812   Median : 1188    
##                                        Mean   : 5403   Mean   : 1621    
##                                        3rd Qu.: 5795   3rd Qu.: 2297    
##                                        Max.   :81000   Max.   :41667    
##                                                                         
##    LoanAmount    Loan_Amount_Term Credit_History   Property_Area     
##  Min.   :  9.0   Min.   : 12      Min.   :0.0000   Length:614        
##  1st Qu.:100.0   1st Qu.:360      1st Qu.:1.0000   Class :character  
##  Median :128.0   Median :360      Median :1.0000   Mode  :character  
##  Mean   :146.4   Mean   :342      Mean   :0.8422                     
##  3rd Qu.:168.0   3rd Qu.:360      3rd Qu.:1.0000                     
##  Max.   :700.0   Max.   :480      Max.   :1.0000                     
##  NA's   :22      NA's   :14       NA's   :50                         
##  Loan_Status       
##  Length:614        
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 
glimpse(loan.train)
## Rows: 614
## Columns: 13
## $ Loan_ID           <chr> "LP001002", "LP001003", "LP001005", "LP001006", "LP0…
## $ Gender            <chr> "Male", "Male", "Male", "Male", "Male", "Male", "Mal…
## $ Married           <chr> "No", "Yes", "Yes", "Yes", "No", "Yes", "Yes", "Yes"…
## $ Dependents        <chr> "0", "1", "0", "0", "0", "2", "0", "3+", "2", "1", "…
## $ Education         <chr> "Graduate", "Graduate", "Graduate", "Not Graduate", …
## $ Self_Employed     <chr> "No", "No", "Yes", "No", "No", "Yes", "No", "No", "N…
## $ ApplicantIncome   <int> 5849, 4583, 3000, 2583, 6000, 5417, 2333, 3036, 4006…
## $ CoapplicantIncome <dbl> 0, 1508, 0, 2358, 0, 4196, 1516, 2504, 1526, 10968, …
## $ LoanAmount        <int> NA, 128, 66, 120, 141, 267, 95, 158, 168, 349, 70, 1…
## $ Loan_Amount_Term  <int> 360, 360, 360, 360, 360, 360, 360, 360, 360, 360, 36…
## $ Credit_History    <int> 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, NA, …
## $ Property_Area     <chr> "Urban", "Rural", "Urban", "Urban", "Urban", "Urban"…
## $ Loan_Status       <chr> "Y", "N", "Y", "Y", "Y", "Y", "Y", "N", "Y", "N", "Y…
cols <- list('Gender','Married', 'Dependents','Education','Self_Employed','Credit_History','Property_Area')
for (col in cols){
  uni <- list(unique(loan.train[[col]]))
  print(paste0('The unique values in the column ',col, ' are: ', uni))
}
## [1] "The unique values in the column Gender are: c(\"Male\", \"Female\", \"\")"
## [1] "The unique values in the column Married are: c(\"No\", \"Yes\", \"\")"
## [1] "The unique values in the column Dependents are: c(\"0\", \"1\", \"2\", \"3+\", \"\")"
## [1] "The unique values in the column Education are: c(\"Graduate\", \"Not Graduate\")"
## [1] "The unique values in the column Self_Employed are: c(\"No\", \"Yes\", \"\")"
## [1] "The unique values in the column Credit_History are: c(1, 0, NA)"
## [1] "The unique values in the column Property_Area are: c(\"Urban\", \"Rural\", \"Semiurban\")"
coappinc_table <- table(loan.train$CoapplicantIncome)
print(paste0('The number of observations with zero value is: ',coappinc_table[names(coappinc_table) == 0]))
## [1] "The number of observations with zero value is: 273"

From the initial summary statistic we can see that there are some missing values in the dataset, the unique values contain empty characters and most of them are in the wrong format. The Coapplicant Income also contains many zero values, we replace all of these values with NA and then using a backward filling to impute some real data. We have to make sure that there are no blank values either,hence counting all the missing values by column including blank ones.

sapply(loan.train, function(x) sum(is.na(x) | x == '' | x== ' '))
##           Loan_ID            Gender           Married        Dependents 
##                 0                13                 3                15 
##         Education     Self_Employed   ApplicantIncome CoapplicantIncome 
##                 0                32                 0                 0 
##        LoanAmount  Loan_Amount_Term    Credit_History     Property_Area 
##                22                14                50                 0 
##       Loan_Status 
##                 0
sapply(loan.test, function(x) sum(is.na(x) | x == '' | x== ' '))
##           Loan_ID            Gender           Married        Dependents 
##                 0                11                 0                10 
##         Education     Self_Employed   ApplicantIncome CoapplicantIncome 
##                 0                23                 0                 0 
##        LoanAmount  Loan_Amount_Term    Credit_History     Property_Area 
##                 5                 6                29                 0

Data cleaning and transformation

We replace the missing values in the Gender category with the *most frequent gender.

The Married variable has three missing values that we replace with ‘No’.

Missing values from the Dependents are replaced with ‘0’, assuming it was left empty as there are no dependents.

From the Self Employed category the values are replaced with ‘No’.

We replace the missing values in the Loan Amount with the average loan amount based on the assumption that it is a mandatory value for a loan application hence can’t be zero.

The Loan Amount Term is replaced with the most frequently appearing term.

The missing values from Credit History are replaced with ‘0’, meaning that there is no history available for that client.

Removing the first column from both data set (Loan_ID) as we don’t need it.

getMode <- function(x){
  uni <- unique(x)
  uni[which.max(tabulate(match(x, uni)))]
}

loan.train$Gender[loan.train$Gender == '' | loan.train$Gender == ' '] <- getMode(loan.train$Gender)
loan.train$Married[loan.train$Married == '' | loan.train$Married == ' '] <- 'No'
loan.train$Dependents[loan.train$Dependents == '' | loan.train$Dependents == ' '] <- 0
loan.train$Self_Employed[loan.train$Self_Employed == '' | loan.train$Self_Employed == ' '] <- 'No'
loan.train$LoanAmount[is.na(loan.train$LoanAmount)] <- mean(loan.train$LoanAmount, na.rm = TRUE)
loan.train$Loan_Amount_Term[is.na(loan.train$Loan_Amount_Term)] <- getMode(loan.train$Loan_Amount_Term)
loan.train$Credit_History[is.na(loan.train$Credit_History)] <- 0

loan.test$Gender[loan.test$Gender == '' | loan.test$Gender == ' '] <- getMode(loan.test$Gender)
loan.test$Dependents[loan.test$Dependents == '' | loan.test$Dependents == ' '] <- 0
loan.test$Self_Employed[loan.test$Self_Employed == '' | loan.test$Self_Employed == ' '] <- 'No'
loan.test$LoanAmount[is.na(loan.test$LoanAmount)] <- mean(loan.test$LoanAmount, na.rm = TRUE)
loan.test$Loan_Amount_Term[is.na(loan.test$Loan_Amount_Term)] <- getMode(loan.test$Loan_Amount_Term)
loan.test$Credit_History[is.na(loan.test$Credit_History)] <- 0

loan.train <- subset(loan.train, select = -c(Loan_ID))
loan.test <- subset(loan.test, select = -c(Loan_ID))

sapply(loan.train, function(x) sum(is.na(x) | x == '' | x== ' '))
##            Gender           Married        Dependents         Education 
##                 0                 0                 0                 0 
##     Self_Employed   ApplicantIncome CoapplicantIncome        LoanAmount 
##                 0                 0                 0                 0 
##  Loan_Amount_Term    Credit_History     Property_Area       Loan_Status 
##                 0                 0                 0                 0
sapply(loan.test, function(x) sum(is.na(x) | x == '' | x== ' '))
##            Gender           Married        Dependents         Education 
##                 0                 0                 0                 0 
##     Self_Employed   ApplicantIncome CoapplicantIncome        LoanAmount 
##                 0                 0                 0                 0 
##  Loan_Amount_Term    Credit_History     Property_Area 
##                 0                 0                 0
loan.train$CoapplicantIncome <- replace(loan.train$CoapplicantIncome, loan.train$CoapplicantIncome == 0, NA)
loan.test$CoapplicantIncome <- replace(loan.test$CoapplicantIncome, loan.test$CoapplicantIncome == 0, NA)
loan.train <- loan.train %>% fill(CoapplicantIncome, .direction = 'up')
loan.train <- loan.train %>% fill(CoapplicantIncome, .direction = 'down')
loan.test <- loan.test %>% fill(CoapplicantIncome, .direction = 'up')
loan.test <- loan.test %>% fill(CoapplicantIncome, .direction = 'down')

There are no missing values anymore and the data is now ready for further analysis.

Replacing the values ‘3+’, with ‘3’ keeping the assumption that it means 3 or more dependents.

loan.train$Dependents[loan.train$Dependents > 2] <- 3
loan.test$Dependents[loan.test$Dependents > 2] <- 3

We require that some of the variables to be in a categorical format.

loan.train <- loan.train %>% mutate_at(
  c('Gender', 'Married', 'Dependents', 'Education', 'Self_Employed', 'Credit_History', 'Property_Area', 'Loan_Status'), as.factor)

loan.test <- loan.test %>% mutate_at(
  c('Gender', 'Married', 'Dependents', 'Education', 'Self_Employed', 'Credit_History', 'Property_Area'), as.factor)
for(col in colnames(loan.train)){
  print(paste0('The number of missing values in ', col, 'are: ', sum(is.na(loan.train[[col]]))))
}
## [1] "The number of missing values in Genderare: 0"
## [1] "The number of missing values in Marriedare: 0"
## [1] "The number of missing values in Dependentsare: 0"
## [1] "The number of missing values in Educationare: 0"
## [1] "The number of missing values in Self_Employedare: 0"
## [1] "The number of missing values in ApplicantIncomeare: 0"
## [1] "The number of missing values in CoapplicantIncomeare: 0"
## [1] "The number of missing values in LoanAmountare: 0"
## [1] "The number of missing values in Loan_Amount_Termare: 0"
## [1] "The number of missing values in Credit_Historyare: 0"
## [1] "The number of missing values in Property_Areaare: 0"
## [1] "The number of missing values in Loan_Statusare: 0"

Outlier’s Detection

Have a look our numeric variables and their distribution.

numeric_list <- c('ApplicantIncome','CoapplicantIncome','LoanAmount')

histFunc <- function(data, var){
    ggplot(data, aes(data[,var]))+
          geom_histogram(bins = 45, color="black", fill="white", show.legend = FALSE)+
          ggtitle(paste0('Histogram of ', var))+
          labs(x = var, y = 'Number of observations') +
          geom_vline(aes(xintercept=mean(data[,var])),
            color="blue", linetype="dashed", size=1)
}

densityFunc <- function(data, var){
    ggplot(data, aes(data[,var]))+
          geom_histogram(aes(y = ..density..), bins = 45, color="black", fill="white", show.legend = FALSE)+
          geom_density(alpha=.2, fill="#FF6666")+
          ggtitle(paste0('Histogram of ', var))+
          labs(x = var, y = 'Density')
}

p <- list()
pp <- list()
for(var in numeric_list){
  p[[var]] <- histFunc(loan.train, var)
  pp[[var]] <- densityFunc(loan.train, var)
}
do.call(grid.arrange,p)

do.call(grid.arrange, pp)

There are many outliers in our the variables. Calculate the quantiles of these three variables and the upper limit only

upperlimitList <- list()
limitCalc <- function(var){
  q <- quantile(loan.train[,var], probs = seq(0, 1, 1/4))
  iqr = q[4] - q[2]
  upper_limit <- round(q[4] + (iqr * 1.5), digits = 0)
}

upperLL <- c('ApplicantIncome','CoapplicantIncome','LoanAmount')
for (var in upperLL){
  upperlimitList[[var]] <- limitCalc(var)
}

Remove the outliers and plot again to see the distribution of the data and their qqplot.

outl <- subset(loan.train, 
               Loan_Amount_Term >= 400 |
               Loan_Amount_Term <= 179 |
               LoanAmount >= upperlimitList$LoanAmount |
               ApplicantIncome >= upperlimitList$ApplicantIncome |
               CoapplicantIncome >= upperlimitList$CoapplicantIncome)
new.loan.train <- anti_join(loan.train, outl, by=NULL, copy=FALSE)

qqlist <- c('ApplicantIncome','CoapplicantIncome','LoanAmount','Loan_Amount_Term')

normalityFunc <- function(data, var){
    shpwlk <- shapiro.test(data[,var])
    ggplot(data, aes(data[,var]))+
          geom_histogram(bins = 45, color="black", fill="white", show.legend = FALSE)+
          ggtitle('The p-value of the distribution is: ', shpwlk$p.value)+
          labs(x = var, y = 'Number of observations')+
          geom_vline(aes(xintercept=mean(data[,var])),
            color="blue", linetype="dashed", size=1)
}

qqNormPlot <- function(data, var){
  qqnorm(data[,var], col = 'red')
  qqline(data[,var], distribution = qnorm, col = 'blue')
}

qqNormPlot2 <- function(data, var){
  qqnorm(data[,var], col = 'red')
  qqline(data[,var], distribution = qnorm, col = 'blue')
}

p <- list()
for(var in qqlist){
  p[[var]] <- normalityFunc(new.loan.train, var)
}
do.call(grid.arrange,p)

p <- list()
par(mfcol=c(2,2))
for(var in qqlist){
  p[[var]] <- qqNormPlot(loan.train, var)
}

p <- list()
par(mfcol=c(2,2))
for(var in qqlist){
  p[[var]] <- qqNormPlot2(new.loan.train, var)
}

The data now seems more normally distributed and from the qqplots can be seen that there is a significant difference between the before and the after state. Make a final test using Rosnet test to check for outliers.

par(mfcol=c(2,2))
ggplot(new.loan.train, aes(x = '', y = ApplicantIncome))+
                    geom_boxplot(outlier.colour = "red", outlier.shape = 1)

rosTest <- rosnerTest(new.loan.train$ApplicantIncome, k = 5) 
rosTest$all.stats
##   i   Mean.i     SD.i Value Obs.Num    R.i+1 lambda.i+1 Outlier
## 1 0 4070.586 1859.743 10139     455 3.263039   3.861533   FALSE
## 2 1 4058.351 1841.490 10047     356 3.252067   3.861000   FALSE
## 3 2 4046.253 1823.514 10000     119 3.264985   3.860465   FALSE
## 4 3 4034.200 1805.520 10000     170 3.304200   3.859929   FALSE
## 5 4 4022.099 1787.189 10000     263 3.344863   3.859392   FALSE
ggplot(new.loan.train, aes(x = '', y = CoapplicantIncome))+
                    geom_boxplot(outlier.colour = "red", outlier.shape = 1)

rosTest2 <- rosnerTest(new.loan.train$CoapplicantIncome, k = 5) 
rosTest2$all.stats
##   i   Mean.i     SD.i Value Obs.Num    R.i+1 lambda.i+1 Outlier
## 1 0 2397.169 1129.974  5654      80 2.882216   3.861533   FALSE
## 2 1 2390.603 1121.584  5625      17 2.883776   3.861000   FALSE
## 3 2 2384.069 1113.228  5625     129 2.911291   3.860465   FALSE
## 4 3 2377.508 1104.736  5625     139 2.939608   3.859929   FALSE
## 5 4 2370.921 1096.104  5625     140 2.968768   3.859392   FALSE
ggplot(new.loan.train, aes(x = '', y = LoanAmount))+
                    geom_boxplot(outlier.colour = "red", outlier.shape = 1)

rosTest3 <- rosnerTest(new.loan.train$LoanAmount, k = 5) 
rosTest3$all.stats
##   i   Mean.i     SD.i Value Obs.Num    R.i+1 lambda.i+1 Outlier
## 1 0 128.2782 44.06748   260     455 2.989094   3.861533   FALSE
## 2 1 128.0126 43.71205   259     289 2.996597   3.861000   FALSE
## 3 2 127.7480 43.35677   258      58 3.004192   3.860465   FALSE
## 4 3 127.4843 43.00162   258     461 3.035134   3.859929   FALSE
## 5 4 127.2196 42.64042   255     129 2.996698   3.859392   FALSE

We can see that there are no more outliers left. Removing the outliers will help to increase model performance. Values in the Loan Amount Term are not tested as everything that’s not 360 value is considered as an outlier. Removing these would significantly reduce the number of observations in the dataframe.

EDA

Univariate analysis

table(loan.train$Loan_Amount_Term, loan.train$Loan_Amount_Term)
##      
##        12  36  60  84 120 180 240 300 360 480
##   12    1   0   0   0   0   0   0   0   0   0
##   36    0   2   0   0   0   0   0   0   0   0
##   60    0   0   2   0   0   0   0   0   0   0
##   84    0   0   0   4   0   0   0   0   0   0
##   120   0   0   0   0   3   0   0   0   0   0
##   180   0   0   0   0   0  44   0   0   0   0
##   240   0   0   0   0   0   0   4   0   0   0
##   300   0   0   0   0   0   0   0  13   0   0
##   360   0   0   0   0   0   0   0   0 526   0
##   480   0   0   0   0   0   0   0   0   0  15
univarGenderPlot <- function(data, var){
         data %>% group_by(data[,var]) %>%
                        ggplot(aes(x = data[,var])) +
                        geom_bar(fill = 4) +
                        labs(x = var, y = 'Number of applicants') +
                        ggtitle(paste0('Number of applicants by ', var)) +
                        theme(plot.title = element_text(hjust = 0.5))
}

plt.list1 <- c('Gender', 'Dependents', 'Married', 'Education', 'Self_Employed', 'Loan_Amount_Term', 'Credit_History', 'Property_Area')

p <- list()
for(var in plt.list1){
  p[[var]] <- univarGenderPlot(loan.train,var)
}
do.call(grid.arrange,p)

p <- list()
for(var in plt.list1){
  p[[var]] <- univarGenderPlot(new.loan.train,var)
}
do.call(grid.arrange,p)

univarLoanAmountPlot <- function(data, var){
         data %>% ggplot(aes(x = LoanAmount)) +
                        geom_histogram(bins = 45, fill = 2,position = "identity") +
                        facet_wrap(~ data[,var]) +
                        labs(x = paste0('Loan amount by ', var), y = 'Number of applicants') +
                        ggtitle(paste0('Distribution of loan amount by ', var)) +
                        theme(plot.title = element_text(hjust = 0.5))
}

plt.list2 <- c('Gender', 'Dependents', 'Married', 'Education', 'Self_Employed')

p <- list()
for(var in plt.list2){
  p[[var]] <- univarLoanAmountPlot(loan.train, var)
}
do.call(grid.arrange,p)

p <- list()
for(var in plt.list2){
  p[[var]] <- univarLoanAmountPlot(new.loan.train, var)
}
do.call(grid.arrange,p)

univarApplicantIncomePlot <- function(data, var){
         data %>% ggplot(aes(x = ApplicantIncome)) +
                        geom_histogram(bins = 45, fill = 4, position = "identity") +
                        facet_wrap(~ data[,var]) +
                        labs(x = paste0('Applicant Income by ', var), y = 'Number of applicants') +
                        ggtitle(paste0('Distribution of applicants income by ', var)) +
                        theme(plot.title = element_text(hjust = 0.5))
}

p <- list()
for(var in plt.list2){
  p[[var]] <- univarApplicantIncomePlot(loan.train, var)
}
do.call(grid.arrange,p)

p <- list()
for(var in plt.list2){
  p[[var]] <- univarApplicantIncomePlot(new.loan.train, var)
}
do.call(grid.arrange,p)

### Multivariate analysis

table(new.loan.train$Loan_Status, new.loan.train$Loan_Amount_Term)
##    
##     180 240 300 360
##   N  11   1   4 131
##   Y  24   3   5 318
area_loanstatus2 <- table(new.loan.train$Property_Area, new.loan.train$Loan_Status)
chordDiagram(as.matrix(area_loanstatus2), small.gap = 3, big.gap = 10)

circos.clear()

The number of applications by term and acceptance.

The circular plot shows the the loan status - accepted or rejected - with the breakdown of the preferred property location.

loanAmLoanT <- function(data){
       data %>% ggplot(aes(x = Loan_Amount_Term , y = LoanAmount, group = Loan_Status)) +
                geom_point(cex = 1.5, pch = 1, position = position_jitter(w = 10, h = 0), aes(color = Loan_Status)) +
                scale_color_manual(values = c('red', 'blue')) +
                labs(x = 'Loan AMount Term', y = 'Loan amount') +
                ggtitle('Loan amount and loan term comparison') +
                theme(plot.title = element_text(hjust = 0.5))
}

appliCoapp <- function(data){
                #filter(ApplicantIncome < 20000 & CoapplicantIncome < 20000) %>% 
       data %>% ggplot(aes(x = ApplicantIncome , y = CoapplicantIncome, group = Loan_Status)) +
                geom_point(cex = 1.5, pch = 1, position = position_jitter(w = 5, h = 0), aes(color = Loan_Status)) +
                scale_color_manual(values = c('red', 'blue')) +
                labs(x = 'Main applicant income', y = 'Co-applicant income') +
                ggtitle('Trimmed main and co applicant income distribution') +
                theme(plot.title = element_text(hjust = 0.5))
}

loanAmLoanT(loan.train)

appliCoapp(loan.train)

loanAmLoanT(new.loan.train)

appliCoapp(new.loan.train)

The scatterplot shows applicants income relationship and the term by amount before and after the cleaning from outliers.

ggpairs(new.loan.train, columns = c('ApplicantIncome','CoapplicantIncome','LoanAmount','Loan_Amount_Term','Credit_History','Property_Area','Loan_Status'))

Model building

Binary Logistic regression

We fit a binomial logistic model on our training data set.

glm_model <- glm(formula = Loan_Status ~., family = binomial, data = new.loan.train)
summary(glm_model)
## 
## Call:
## glm(formula = Loan_Status ~ ., family = binomial, data = new.loan.train)
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -1.326e+00  1.060e+00  -1.251  0.21089    
## GenderMale              2.660e-01  3.199e-01   0.831  0.40572    
## MarriedYes              3.442e-01  2.773e-01   1.241  0.21445    
## Dependents1            -6.901e-02  3.322e-01  -0.208  0.83543    
## Dependents2             1.071e-02  3.418e-01   0.031  0.97501    
## Dependents3            -2.768e-01  4.319e-01  -0.641  0.52157    
## EducationNot Graduate  -3.766e-01  2.606e-01  -1.445  0.14846    
## Self_EmployedYes       -3.205e-01  3.523e-01  -0.910  0.36291    
## ApplicantIncome        -2.504e-05  7.088e-05  -0.353  0.72387    
## CoapplicantIncome       8.969e-05  1.076e-04   0.833  0.40473    
## LoanAmount             -2.993e-03  3.112e-03  -0.962  0.33614    
## Loan_Amount_Term        5.114e-04  2.471e-03   0.207  0.83606    
## Credit_History1         2.325e+00  2.542e-01   9.143  < 2e-16 ***
## Property_AreaSemiurban  8.926e-01  2.843e-01   3.140  0.00169 ** 
## Property_AreaUrban      1.488e-01  2.782e-01   0.535  0.59284    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 603.60  on 496  degrees of freedom
## Residual deviance: 486.79  on 482  degrees of freedom
## AIC: 516.79
## 
## Number of Fisher Scoring iterations: 4

Applying a step-wise algorithm to find the best model.

step(glm_model, direction = 'both', trace = 1, steps = 1000)
## Start:  AIC=516.79
## Loan_Status ~ Gender + Married + Dependents + Education + Self_Employed + 
##     ApplicantIncome + CoapplicantIncome + LoanAmount + Loan_Amount_Term + 
##     Credit_History + Property_Area
## 
##                     Df Deviance    AIC
## - Dependents         3   487.24 511.24
## - Loan_Amount_Term   1   486.83 514.83
## - ApplicantIncome    1   486.91 514.91
## - Gender             1   487.47 515.47
## - CoapplicantIncome  1   487.49 515.49
## - Self_Employed      1   487.60 515.60
## - LoanAmount         1   487.72 515.72
## - Married            1   488.33 516.33
## <none>                   486.79 516.79
## - Education          1   488.85 516.85
## - Property_Area      2   498.30 524.30
## - Credit_History     1   581.56 609.56
## 
## Step:  AIC=511.24
## Loan_Status ~ Gender + Married + Education + Self_Employed + 
##     ApplicantIncome + CoapplicantIncome + LoanAmount + Loan_Amount_Term + 
##     Credit_History + Property_Area
## 
##                     Df Deviance    AIC
## - Loan_Amount_Term   1   487.32 509.32
## - ApplicantIncome    1   487.41 509.41
## - Gender             1   487.89 509.89
## - CoapplicantIncome  1   488.00 510.00
## - Self_Employed      1   488.10 510.10
## - LoanAmount         1   488.15 510.15
## - Married            1   488.68 510.68
## <none>                   487.24 511.24
## - Education          1   489.41 511.41
## + Dependents         3   486.79 516.79
## - Property_Area      2   498.61 518.61
## - Credit_History     1   582.49 604.49
## 
## Step:  AIC=509.32
## Loan_Status ~ Gender + Married + Education + Self_Employed + 
##     ApplicantIncome + CoapplicantIncome + LoanAmount + Credit_History + 
##     Property_Area
## 
##                     Df Deviance    AIC
## - ApplicantIncome    1   487.53 507.53
## - Gender             1   487.95 507.95
## - CoapplicantIncome  1   488.05 508.05
## - LoanAmount         1   488.17 508.17
## - Self_Employed      1   488.18 508.18
## - Married            1   488.72 508.72
## <none>                   487.32 509.32
## - Education          1   489.59 509.59
## + Loan_Amount_Term   1   487.24 511.24
## + Dependents         3   486.83 514.83
## - Property_Area      2   498.81 516.81
## - Credit_History     1   582.55 602.55
## 
## Step:  AIC=507.53
## Loan_Status ~ Gender + Married + Education + Self_Employed + 
##     CoapplicantIncome + LoanAmount + Credit_History + Property_Area
## 
##                     Df Deviance    AIC
## - Gender             1   488.13 506.13
## - CoapplicantIncome  1   488.28 506.28
## - Self_Employed      1   488.53 506.53
## - Married            1   489.12 507.12
## - LoanAmount         1   489.15 507.15
## <none>                   487.53 507.53
## - Education          1   489.68 507.68
## + ApplicantIncome    1   487.32 509.32
## + Loan_Amount_Term   1   487.41 509.41
## + Dependents         3   486.97 512.97
## - Property_Area      2   499.13 515.13
## - Credit_History     1   582.66 600.66
## 
## Step:  AIC=506.13
## Loan_Status ~ Married + Education + Self_Employed + CoapplicantIncome + 
##     LoanAmount + Credit_History + Property_Area
## 
##                     Df Deviance    AIC
## - CoapplicantIncome  1   488.92 504.92
## - Self_Employed      1   489.14 505.14
## - LoanAmount         1   489.56 505.56
## <none>                   488.13 506.13
## - Education          1   490.15 506.15
## - Married            1   490.99 506.99
## + Gender             1   487.53 507.53
## + ApplicantIncome    1   487.95 507.95
## + Loan_Amount_Term   1   488.05 508.05
## + Dependents         3   487.62 511.62
## - Property_Area      2   499.20 513.20
## - Credit_History     1   584.80 600.80
## 
## Step:  AIC=504.92
## Loan_Status ~ Married + Education + Self_Employed + LoanAmount + 
##     Credit_History + Property_Area
## 
##                     Df Deviance    AIC
## - Self_Employed      1   489.81 503.81
## - LoanAmount         1   489.99 503.99
## <none>                   488.92 504.92
## - Education          1   491.02 505.02
## - Married            1   491.61 505.61
## + CoapplicantIncome  1   488.13 506.13
## + Gender             1   488.28 506.28
## + ApplicantIncome    1   488.72 506.72
## + Loan_Amount_Term   1   488.87 506.87
## + Dependents         3   488.35 510.35
## - Property_Area      2   499.79 511.79
## - Credit_History     1   584.80 598.80
## 
## Step:  AIC=503.81
## Loan_Status ~ Married + Education + LoanAmount + Credit_History + 
##     Property_Area
## 
##                     Df Deviance    AIC
## - LoanAmount         1   491.05 503.05
## <none>                   489.81 503.81
## - Education          1   491.92 503.92
## - Married            1   492.53 504.53
## + Self_Employed      1   488.92 504.92
## + CoapplicantIncome  1   489.14 505.14
## + Gender             1   489.16 505.16
## + ApplicantIncome    1   489.48 505.48
## + Loan_Amount_Term   1   489.75 505.75
## + Dependents         3   489.16 509.16
## - Property_Area      2   500.45 510.45
## - Credit_History     1   585.28 597.28
## 
## Step:  AIC=503.05
## Loan_Status ~ Married + Education + Credit_History + Property_Area
## 
##                     Df Deviance    AIC
## - Education          1   492.82 502.82
## <none>                   491.05 503.05
## - Married            1   493.20 503.20
## + LoanAmount         1   489.81 503.81
## + Self_Employed      1   489.99 503.99
## + ApplicantIncome    1   490.01 504.01
## + Gender             1   490.61 504.61
## + CoapplicantIncome  1   490.73 504.73
## + Loan_Amount_Term   1   491.02 505.02
## + Dependents         3   490.35 508.35
## - Property_Area      2   501.49 509.49
## - Credit_History     1   586.75 596.75
## 
## Step:  AIC=502.82
## Loan_Status ~ Married + Credit_History + Property_Area
## 
##                     Df Deviance    AIC
## - Married            1   494.82 502.82
## <none>                   492.82 502.82
## + Education          1   491.05 503.05
## + Self_Employed      1   491.78 503.78
## + LoanAmount         1   491.92 503.92
## + ApplicantIncome    1   492.18 504.18
## + CoapplicantIncome  1   492.42 504.42
## + Gender             1   492.47 504.47
## + Loan_Amount_Term   1   492.73 504.73
## + Dependents         3   491.99 507.99
## - Property_Area      2   503.80 509.80
## - Credit_History     1   590.92 598.92
## 
## Step:  AIC=502.82
## Loan_Status ~ Credit_History + Property_Area
## 
##                     Df Deviance    AIC
## <none>                   494.82 502.82
## + Married            1   492.82 502.82
## + Education          1   493.20 503.20
## + Gender             1   493.58 503.58
## + Self_Employed      1   493.76 503.76
## + ApplicantIncome    1   494.10 504.10
## + LoanAmount         1   494.35 504.35
## + CoapplicantIncome  1   494.45 504.45
## + Loan_Amount_Term   1   494.79 504.79
## + Dependents         3   494.37 508.37
## - Property_Area      2   506.12 510.12
## - Credit_History     1   591.66 597.66
## 
## Call:  glm(formula = Loan_Status ~ Credit_History + Property_Area, family = binomial, 
##     data = new.loan.train)
## 
## Coefficients:
##            (Intercept)         Credit_History1  Property_AreaSemiurban  
##                -1.1512                  2.2903                  0.8811  
##     Property_AreaUrban  
##                 0.2150  
## 
## Degrees of Freedom: 496 Total (i.e. Null);  493 Residual
## Null Deviance:       603.6 
## Residual Deviance: 494.8     AIC: 502.8

There is not much of an improvement on the AIC from the step-wise algorithm, hence we use the variables with the least AIC and a few more that’s seems important.

We build the model only with these variables and compare them with the intercept only model.

glm2_model <- glm(formula = Loan_Status ~  ApplicantIncome + CoapplicantIncome + LoanAmount + Credit_History + Property_Area, family = binomial, data = new.loan.train)

glm_intercept <- glm(Loan_Status~1, data = new.loan.train, family = binomial)
anova(glm_intercept, glm2_model, test="Chisq")
## Analysis of Deviance Table
## 
## Model 1: Loan_Status ~ 1
## Model 2: Loan_Status ~ ApplicantIncome + CoapplicantIncome + LoanAmount + 
##     Credit_History + Property_Area
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1       496     603.60                          
## 2       490     493.39  6    110.2 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
LSP <- data.frame(new.loan.train$Loan_Status)
LSP$new.loan.train.Loan_Status <- as.factor(LSP$new.loan.train.Loan_Status)

LSP$glm_model_predprob <- round(fitted(glm2_model),2)
LSP$glm_pred <- ifelse(LSP$glm_model_predprob > 0.6, 'Y', 'N')
LSP$glm_pred <- as.factor(LSP$glm_pred)
library(caret)
confusionMatrix(LSP$new.loan.train.Loan_Status, LSP$glm_pred, positive = 'Y')
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   N   Y
##          N  77  70
##          Y  36 314
##                                          
##                Accuracy : 0.7867         
##                  95% CI : (0.748, 0.8219)
##     No Information Rate : 0.7726         
##     P-Value [Acc > NIR] : 0.244920       
##                                          
##                   Kappa : 0.4512         
##                                          
##  Mcnemar's Test P-Value : 0.001349       
##                                          
##             Sensitivity : 0.8177         
##             Specificity : 0.6814         
##          Pos Pred Value : 0.8971         
##          Neg Pred Value : 0.5238         
##              Prevalence : 0.7726         
##          Detection Rate : 0.6318         
##    Detection Prevalence : 0.7042         
##       Balanced Accuracy : 0.7496         
##                                          
##        'Positive' Class : Y              
## 
library('pROC')

rocobj <- roc(LSP$new.loan.train.Loan_Status, LSP$glm_model_predprob, auc = TRUE, plot = TRUE, ci = TRUE, print.auc=TRUE, grid = TRUE)
ciobj <- ci.se(rocobj, specificities=seq(0, 1, 0.1))
plot(ciobj, type="shape", col="#1c61b6AA")
plot(ci.sp(rocobj, boot.stratified=TRUE), type="bars")

print(rocobj)
## 
## Call:
## roc.default(response = LSP$new.loan.train.Loan_Status, predictor = LSP$glm_model_predprob,     auc = TRUE, ci = TRUE, plot = TRUE, print.auc = TRUE, grid = TRUE)
## 
## Data: LSP$glm_model_predprob in 147 controls (LSP$new.loan.train.Loan_Status N) < 350 cases (LSP$new.loan.train.Loan_Status Y).
## Area under the curve: 0.7702
## 95% CI: 0.7245-0.816 (DeLong)
print(paste0('The AUC score for the SVM classifier is: ', rocobj$auc))
## [1] "The AUC score for the SVM classifier is: 0.77022351797862"

The ROC curve is seen above for the model. The area under the curve with confidence interval shows that the model is an acceptable.

Naive Bayes Classifier

library(e1071)

nB_model <- naiveBayes(Loan_Status~., data = new.loan.train)
LSP$pred_nB <- predict(nB_model, new.loan.train, type = 'raw')

LSP$predY_nB <- ifelse(LSP$pred_nB[,2] > 0.6, 'Y', 'N')
LSP$predY_nB <- as.factor(LSP$predY_nB)

confusionMatrix(LSP$predY_nB, LSP$new.loan.train.Loan_Status, positive = 'Y')
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   N   Y
##          N  79  36
##          Y  68 314
##                                           
##                Accuracy : 0.7907          
##                  95% CI : (0.7523, 0.8257)
##     No Information Rate : 0.7042          
##     P-Value [Acc > NIR] : 8.159e-06       
##                                           
##                   Kappa : 0.4638          
##                                           
##  Mcnemar's Test P-Value : 0.002367        
##                                           
##             Sensitivity : 0.8971          
##             Specificity : 0.5374          
##          Pos Pred Value : 0.8220          
##          Neg Pred Value : 0.6870          
##              Prevalence : 0.7042          
##          Detection Rate : 0.6318          
##    Detection Prevalence : 0.7686          
##       Balanced Accuracy : 0.7173          
##                                           
##        'Positive' Class : Y               
## 
rocobj_nB <- roc(LSP$new.loan.train.Loan_Status, LSP$pred_nB[,2], auc = TRUE, plot = TRUE, ci = TRUE, print.auc=TRUE, grid = TRUE)
ciobj_nB <- ci.se(rocobj_nB, specificities=seq(0, 1, 0.1))
plot(ciobj_nB, type="shape", col="#1c61b6AA")
plot(ci.sp(rocobj_nB, boot.stratified=TRUE), type="bars")

print(rocobj_nB)
## 
## Call:
## roc.default(response = LSP$new.loan.train.Loan_Status, predictor = LSP$pred_nB[,     2], auc = TRUE, ci = TRUE, plot = TRUE, print.auc = TRUE,     grid = TRUE)
## 
## Data: LSP$pred_nB[, 2] in 147 controls (LSP$new.loan.train.Loan_Status N) < 350 cases (LSP$new.loan.train.Loan_Status Y).
## Area under the curve: 0.7755
## 95% CI: 0.7297-0.8212 (DeLong)
print(paste0('The AUC score for the Naive Bayes classifier is: ', rocobj_nB$auc))
## [1] "The AUC score for the Naive Bayes classifier is: 0.775471331389699"

Support Vector Machines Classifier

new.loan.train$Dependents <- as.integer(new.loan.train$Dependents)
new.loan.train$CoapplicantIncome <- as.integer(new.loan.train$CoapplicantIncome)
new.loan.train$LoanAmount <- as.integer(new.loan.train$LoanAmount)
svm_model <- svm(formula = Loan_Status~., data = new.loan.train, type = 'C-classification', probability = TRUE, kernel = 'linear')
LSP$svm_pred <- predict(svm_model, new.loan.train, probability = TRUE)

confusionMatrix(LSP$svm_pred, LSP$new.loan.train.Loan_Status, positive = 'Y')
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   N   Y
##          N  77  36
##          Y  70 314
##                                          
##                Accuracy : 0.7867         
##                  95% CI : (0.748, 0.8219)
##     No Information Rate : 0.7042         
##     P-Value [Acc > NIR] : 2.071e-05      
##                                          
##                   Kappa : 0.4512         
##                                          
##  Mcnemar's Test P-Value : 0.001349       
##                                          
##             Sensitivity : 0.8971         
##             Specificity : 0.5238         
##          Pos Pred Value : 0.8177         
##          Neg Pred Value : 0.6814         
##              Prevalence : 0.7042         
##          Detection Rate : 0.6318         
##    Detection Prevalence : 0.7726         
##       Balanced Accuracy : 0.7105         
##                                          
##        'Positive' Class : Y              
## 
LSP$svm_pred_bool <- ifelse(LSP$svm_pred == 'Y', 1, 0)

rocobj_svm <- roc(LSP$new.loan.train.Loan_Status, LSP$svm_pred_bool, auc = TRUE, plot = TRUE, ci = TRUE, print.auc=TRUE, grid = TRUE)
ciobj_svm <- ci.se(rocobj_svm, specificities=seq(0, 1, 0.1))
plot(ciobj_svm, type="shape", col="#1c61b6AA")
plot(ci.sp(rocobj_svm, boot.stratified=TRUE), type="bars")

print(rocobj_svm)
## 
## Call:
## roc.default(response = LSP$new.loan.train.Loan_Status, predictor = LSP$svm_pred_bool,     auc = TRUE, ci = TRUE, plot = TRUE, print.auc = TRUE, grid = TRUE)
## 
## Data: LSP$svm_pred_bool in 147 controls (LSP$new.loan.train.Loan_Status N) < 350 cases (LSP$new.loan.train.Loan_Status Y).
## Area under the curve: 0.7105
## 95% CI: 0.6669-0.754 (DeLong)
print(paste0('The AUC score for the SVM classifier is: ', rocobj_svm$auc))
## [1] "The AUC score for the SVM classifier is: 0.71047619047619"

Artificial Neural Network

library(neuralnet)
library(fastDummies)

loan.train_ann <- dummy_cols(new.loan.train, 
                             select_columns = c('Gender','Married','Education','Self_Employed','Credit_History','Property_Area','Loan_Status'), 
                             remove_first_dummy = TRUE, 
                             remove_selected_columns = TRUE)
normalize <- function(x){
  return((x-min(x))/(max(x)-min(x)))
}
loan.train_ann$Dependents <- normalize(loan.train_ann$Dependents)
loan.train_ann$ApplicantIncome <- normalize(loan.train_ann$ApplicantIncome)
loan.train_ann$CoapplicantIncome <- normalize(loan.train_ann$CoapplicantIncome)
loan.train_ann$LoanAmount <- normalize(loan.train_ann$LoanAmount)
loan.train_ann$Loan_Amount_Term <- normalize(loan.train_ann$Loan_Amount_Term)
names(loan.train_ann)[names(loan.train_ann)=='Education_Not Graduate'] <- 'Education_Not_Graduate'
ann_model <- neuralnet(Loan_Status_Y~., data = loan.train_ann, hidden = 5, rep = 3, err.fct = 'ce', linear.output = FALSE)
LSP$ann_pred <- ann_model$net.result[[1]]
LSP$ann_pred_bool <- ifelse(LSP$ann_pred > 0.6, 1, 0)
LSP$ann_pred_YN <- ifelse(LSP$ann_pred_bool == 1, 'Y', 'N')
LSP$ann_pred_YN <- as.factor(LSP$ann_pred_YN)
plot(ann_model)
confusionMatrix(LSP$ann_pred_YN, LSP$new.loan.train.Loan_Status, positive = 'Y')
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   N   Y
##          N 116  32
##          Y  31 318
##                                           
##                Accuracy : 0.8732          
##                  95% CI : (0.8407, 0.9012)
##     No Information Rate : 0.7042          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.6963          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9086          
##             Specificity : 0.7891          
##          Pos Pred Value : 0.9112          
##          Neg Pred Value : 0.7838          
##              Prevalence : 0.7042          
##          Detection Rate : 0.6398          
##    Detection Prevalence : 0.7022          
##       Balanced Accuracy : 0.8488          
##                                           
##        'Positive' Class : Y               
## 
rocobj_ann <- roc(LSP$new.loan.train.Loan_Status, LSP$ann_pred_bool, auc = TRUE, plot = TRUE, ci = TRUE, print.auc=TRUE, grid = TRUE)
ciobj_ann <- ci.se(rocobj_ann, specificities=seq(0, 1, 0.1))
plot(ciobj_ann, type="shape", col="#1c61b6AA")
plot(ci.sp(rocobj_ann, boot.stratified=TRUE), type="bars")

print(rocobj_ann)
## 
## Call:
## roc.default(response = LSP$new.loan.train.Loan_Status, predictor = LSP$ann_pred_bool,     auc = TRUE, ci = TRUE, plot = TRUE, print.auc = TRUE, grid = TRUE)
## 
## Data: LSP$ann_pred_bool in 147 controls (LSP$new.loan.train.Loan_Status N) < 350 cases (LSP$new.loan.train.Loan_Status Y).
## Area under the curve: 0.8488
## 95% CI: 0.8125-0.8852 (DeLong)
print(paste0('The AUC score for the ANN classifier is: ', rocobj_ann$auc))
## [1] "The AUC score for the ANN classifier is: 0.848843537414966"

From the above tested classifiers the ANN classifier perform best.

Licence

CC0: Public Domain - https://creativecommons.org/publicdomain/zero/1.0/