Introduction

Problem Scoping and Diagnosis In this project, our clients want to know the chance that some customers will default their loan payment and use that as a parameter to decide whether to approve or disapprove the loan.

Goals and objectives of the project The goal of this project is to build a model that will classify if a certain customer will default its loan payment or not

Dataset Description

Importing the data set

setwd("C:/Users/seune/Desktop/Master's Degree/Stat/Assignment")
data = read.csv('Loan_data.csv')

Checking for the structure of the data

str(data)
## 'data.frame':    614 obs. of  13 variables:
##  $ Loan_ID          : Factor w/ 614 levels "LP001002","LP001003",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Gender           : Factor w/ 3 levels "","Female","Male": 3 3 3 3 3 3 3 3 3 3 ...
##  $ Married          : Factor w/ 3 levels "","No","Yes": 2 3 3 3 2 3 3 3 3 3 ...
##  $ Dependents       : Factor w/ 5 levels "","0","1","2",..: 2 3 2 2 2 4 2 5 4 3 ...
##  $ Education        : Factor w/ 2 levels "Graduate","Not Graduate": 1 1 1 2 1 1 2 1 1 1 ...
##  $ Self_Employed    : Factor w/ 3 levels "","No","Yes": 2 2 3 2 2 3 2 2 2 2 ...
##  $ ApplicantIncome  : int  5849 4583 3000 2583 6000 5417 2333 3036 4006 12841 ...
##  $ CoapplicantIncome: num  0 1508 0 2358 0 ...
##  $ LoanAmount       : int  NA 128 66 120 141 267 95 158 168 349 ...
##  $ Loan_Amount_Term : int  360 360 360 360 360 360 360 360 360 360 ...
##  $ Credit_History   : int  1 1 1 1 1 1 1 0 1 1 ...
##  $ Property_Area    : Factor w/ 3 levels "Rural","Semiurban",..: 3 1 3 3 3 3 3 2 3 2 ...
##  $ Loan_Status      : Factor w/ 2 levels "N","Y": 2 1 2 2 2 2 2 1 2 1 ...

The data set consist of 614 observations with 13 variables out which there are 8 categorical variables, 4 integer variables and 1 numeric variable.

Extracts of the first 5 observations

head(data, n=5)
##    Loan_ID Gender Married Dependents    Education Self_Employed
## 1 LP001002   Male      No          0     Graduate            No
## 2 LP001003   Male     Yes          1     Graduate            No
## 3 LP001005   Male     Yes          0     Graduate           Yes
## 4 LP001006   Male     Yes          0 Not Graduate            No
## 5 LP001008   Male      No          0     Graduate            No
##   ApplicantIncome CoapplicantIncome LoanAmount Loan_Amount_Term
## 1            5849                 0         NA              360
## 2            4583              1508        128              360
## 3            3000                 0         66              360
## 4            2583              2358        120              360
## 5            6000                 0        141              360
##   Credit_History Property_Area Loan_Status
## 1              1         Urban           Y
## 2              1         Rural           N
## 3              1         Urban           Y
## 4              1         Urban           Y
## 5              1         Urban           Y

This is the first 5 observation of the data set.

Data Pre-processing

Checking for the structure and other possible incompleteness

summary(data)
##      Loan_ID       Gender    Married   Dependents        Education  
##  LP001002:  1         : 13      :  3     : 15     Graduate    :480  
##  LP001003:  1   Female:112   No :213   0 :345     Not Graduate:134  
##  LP001005:  1   Male  :489   Yes:398   1 :102                       
##  LP001006:  1                          2 :101                       
##  LP001008:  1                          3+: 51                       
##  LP001011:  1                                                       
##  (Other) :608                                                       
##  Self_Employed ApplicantIncome CoapplicantIncome   LoanAmount   
##     : 32       Min.   :  150   Min.   :    0     Min.   :  9.0  
##  No :500       1st Qu.: 2878   1st Qu.:    0     1st Qu.:100.0  
##  Yes: 82       Median : 3812   Median : 1188     Median :128.0  
##                Mean   : 5403   Mean   : 1621     Mean   :146.4  
##                3rd Qu.: 5795   3rd Qu.: 2297     3rd Qu.:168.0  
##                Max.   :81000   Max.   :41667     Max.   :700.0  
##                                                  NA's   :22     
##  Loan_Amount_Term Credit_History     Property_Area Loan_Status
##  Min.   : 12      Min.   :0.0000   Rural    :179   N:192      
##  1st Qu.:360      1st Qu.:1.0000   Semiurban:233   Y:422      
##  Median :360      Median :1.0000   Urban    :202              
##  Mean   :342      Mean   :0.8422                              
##  3rd Qu.:360      3rd Qu.:1.0000                              
##  Max.   :480      Max.   :1.0000                              
##  NA's   :14       NA's   :50

The summary reveals that there are some blank spaces. For example; Dependents has 15 blank spaces,Married has 3,Gender has 13 and so on.

More so, the summary statistics gives us a view of the skewness of the numeric variables;i.e how close or far away the mean is from the median(middle).

Replacing blank space with NAs

data[data==""] <- NA

We have been able to replace the blank spaces with NA’s which will now be captured by R as a missing number.

Checking for Missing Data

sum(is.na(data))
## [1] 149

This shows that there are 149 missing values in the data set.

Summary of the data

summary(data)
##      Loan_ID       Gender    Married    Dependents        Education  
##  LP001002:  1         :  0       :  0       :  0   Graduate    :480  
##  LP001003:  1   Female:112   No  :213   0   :345   Not Graduate:134  
##  LP001005:  1   Male  :489   Yes :398   1   :102                     
##  LP001006:  1   NA's  : 13   NA's:  3   2   :101                     
##  LP001008:  1                           3+  : 51                     
##  LP001011:  1                           NA's: 15                     
##  (Other) :608                                                        
##  Self_Employed ApplicantIncome CoapplicantIncome   LoanAmount   
##      :  0      Min.   :  150   Min.   :    0     Min.   :  9.0  
##  No  :500      1st Qu.: 2878   1st Qu.:    0     1st Qu.:100.0  
##  Yes : 82      Median : 3812   Median : 1188     Median :128.0  
##  NA's: 32      Mean   : 5403   Mean   : 1621     Mean   :146.4  
##                3rd Qu.: 5795   3rd Qu.: 2297     3rd Qu.:168.0  
##                Max.   :81000   Max.   :41667     Max.   :700.0  
##                                                  NA's   :22     
##  Loan_Amount_Term Credit_History     Property_Area Loan_Status
##  Min.   : 12      Min.   :0.0000   Rural    :179   N:192      
##  1st Qu.:360      1st Qu.:1.0000   Semiurban:233   Y:422      
##  Median :360      Median :1.0000   Urban    :202              
##  Mean   :342      Mean   :0.8422                              
##  3rd Qu.:360      3rd Qu.:1.0000                              
##  Max.   :480      Max.   :1.0000                              
##  NA's   :14       NA's   :50

The summary statistic has now clearly shown the missing numbers and variables that has the missing values. For an instance, Gender now has 13 NA’s as compared to 13 blank spaces it has earlier.

Handling missing number using KNN Imputation method.

library(VIM)
## Loading required package: colorspace
## Loading required package: grid
## Loading required package: data.table
## VIM is ready to use. 
##  Since version 4.0.0 the GUI is in its own package VIMGUI.
## 
##           Please use the package to use the new (and old) GUI.
## Suggestions and bug-reports can be submitted at: https://github.com/alexkowa/VIM/issues
## 
## Attaching package: 'VIM'
## The following object is masked from 'package:datasets':
## 
##     sleep
#Picking the columns with missing number
data1 <- kNN(data,variable = c("Gender","Married","Dependents","Self_Employed","LoanAmount",
                              "Loan_Amount_Term","Credit_History"), k = 7)
summary(data1)
##      Loan_ID       Gender    Married   Dependents        Education  
##  LP001002:  1         :  0      :  0     :  0     Graduate    :480  
##  LP001003:  1   Female:114   No :213   0 :355     Not Graduate:134  
##  LP001005:  1   Male  :500   Yes:401   1 :102                       
##  LP001006:  1                          2 :103                       
##  LP001008:  1                          3+: 54                       
##  LP001011:  1                                                       
##  (Other) :608                                                       
##  Self_Employed ApplicantIncome CoapplicantIncome   LoanAmount   
##     :  0       Min.   :  150   Min.   :    0     Min.   :  9.0  
##  No :532       1st Qu.: 2878   1st Qu.:    0     1st Qu.:100.0  
##  Yes: 82       Median : 3812   Median : 1188     Median :126.5  
##                Mean   : 5403   Mean   : 1621     Mean   :145.4  
##                3rd Qu.: 5795   3rd Qu.: 2297     3rd Qu.:165.8  
##                Max.   :81000   Max.   :41667     Max.   :700.0  
##                                                                 
##  Loan_Amount_Term Credit_History    Property_Area Loan_Status
##  Min.   : 12.0    Min.   :0.000   Rural    :179   N:192      
##  1st Qu.:360.0    1st Qu.:1.000   Semiurban:233   Y:422      
##  Median :360.0    Median :1.000   Urban    :202              
##  Mean   :342.4    Mean   :0.855                              
##  3rd Qu.:360.0    3rd Qu.:1.000                              
##  Max.   :480.0    Max.   :1.000                              
##                                                              
##  Gender_imp      Married_imp     Dependents_imp  Self_Employed_imp
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical    
##  FALSE:601       FALSE:611       FALSE:599       FALSE:582        
##  TRUE :13        TRUE :3         TRUE :15        TRUE :32         
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##  LoanAmount_imp  Loan_Amount_Term_imp Credit_History_imp
##  Mode :logical   Mode :logical        Mode :logical     
##  FALSE:592       FALSE:600            FALSE:564         
##  TRUE :22        TRUE :14             TRUE :50          
##                                                         
##                                                         
##                                                         
## 

Subseting the data set

data1 <- subset(data1, select = Loan_ID:Loan_Status)

sum(is.na(data1))
## [1] 0

The data set now has 0 missing values.

Summary of the data

library(psych)
describe(data1)
##                   vars   n    mean      sd median trimmed     mad min
## Loan_ID*             1 614  307.50  177.39  307.5  307.50  227.58   1
## Gender*              2 614    2.81    0.39    3.0    2.89    0.00   2
## Married*             3 614    2.65    0.48    3.0    2.69    0.00   2
## Dependents*          4 614    2.77    1.02    2.0    2.60    0.00   2
## Education*           5 614    1.22    0.41    1.0    1.15    0.00   1
## Self_Employed*       6 614    2.13    0.34    2.0    2.04    0.00   2
## ApplicantIncome      7 614 5403.46 6109.04 3812.5 4292.06 1822.86 150
## CoapplicantIncome    8 614 1621.25 2926.25 1188.5 1154.85 1762.07   0
## LoanAmount           9 614  145.39   84.40  126.5  132.30   45.96   9
## Loan_Amount_Term    10 614  342.41   64.43  360.0  358.54    0.00  12
## Credit_History      11 614    0.86    0.35    1.0    0.94    0.00   0
## Property_Area*      12 614    2.04    0.79    2.0    2.05    1.48   1
## Loan_Status*        13 614    1.69    0.46    2.0    1.73    0.00   1
##                     max range  skew kurtosis     se
## Loan_ID*            614   613  0.00    -1.21   7.16
## Gender*               3     1 -1.61     0.60   0.02
## Married*              3     1 -0.64    -1.59   0.02
## Dependents*           5     3  0.97    -0.45   0.04
## Education*            2     1  1.36    -0.15   0.02
## Self_Employed*        3     1  2.15     2.62   0.01
## ApplicantIncome   81000 80850  6.51    59.83 246.54
## CoapplicantIncome 41667 41667  7.45    83.97 118.09
## LoanAmount          700   691  2.71    10.66   3.41
## Loan_Amount_Term    480   468 -2.39     6.83   2.60
## Credit_History        1     1 -2.01     2.05   0.01
## Property_Area*        3     2 -0.07    -1.39   0.03
## Loan_Status*          2     1 -0.81    -1.35   0.02

Describes gives us a broad range of summary statistics.

Exploratory Data Analysis

Correlation Matrix

Checking for correlation and multicollinearity between the variables

library(psych)
pairs.panels (data1,
             gap = 0,
             bg = c("red","green","blue"[data1$Loan_Status]),
             pch = 21)

Checking for outlier

using Box Plot

boxplot(data1$ApplicantIncome, horizontal = TRUE, main = "Boxplot for Applicant Income")

boxplot(data1$CoapplicantIncome, horizontal = TRUE, main = "Boxplot for Co-Applicant Income")

boxplot(data1$LoanAmount, horizontal = TRUE, main = "Boxplot for LoanAmount")

Outlier Treatment for ApplicantIncome,CoapplicantIncome and LoanAmount

ApplicantIncome

bench <- 5795 + 1.5*IQR(data1$ApplicantIncome) #Q3 + 1.5*IQR(data$Age)
bench
## [1] 10171.25
#WINsORIZING method of treating outlier
data1$ApplicantIncome[data1$ApplicantIncome > bench]
##  [1] 12841 12500 11500 10750 13650 11417 14583 10408 23803 10513 20166
## [12] 14999 11757 14866 39999 51763 33846 39147 12000 11000 16250 14683
## [23] 11146 14583 20667 20233 15000 63337 19730 15759 81000 14880 12876
## [34] 10416 37719 16692 16525 16667 10833 18333 17263 20833 13262 17500
## [45] 11250 18165 19484 16666 16120 12000
data1$ApplicantIncome[data1$ApplicantIncome > bench] <- bench
summary(data1$ApplicantIncome)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     150    2878    3812    4617    5795   10171
boxplot(data1$ApplicantIncome, main = "Boxplot for ApplicantIncome")

length(data1$ApplicantIncome)
## [1] 614

CoapplicantIncome

bench <- 2297 + 1.5*IQR(data1$CoapplicantIncome) #Q3 + 1.5*IQR(data$Age)
bench
## [1] 5742.875
#WINsORIZING method of treating outlier
data1$CoapplicantIncome[data1$CoapplicantIncome > bench]
##  [1] 10968  8106  7210  8980  7750 11300  7250  7101  6250  7873 20000
## [12] 20000  8333  6667  6666  7166 33837 41667
data1$CoapplicantIncome[data1$CoapplicantIncome > bench] <- bench
summary(data1$CoapplicantIncome)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0       0    1188    1420    2297    5743
boxplot(data1$CoapplicantIncome, main = "Boxplot for Co-ApplicantIncome")

length(data1$CoapplicantIncome)
## [1] 614

LoanAmount

bench <- 165.8 + 1.5*IQR(data1$LoanAmount) #Q3 + 1.5*IQR(data$Age)
bench
## [1] 264.425
#WINsORIZING method of treating outlier
data1$LoanAmount[data1$LoanAmount > bench]
##  [1] 267 349 315 320 286 312 265 370 650 290 600 275 700 495 280 279 304
## [18] 330 436 480 300 376 490 308 570 380 296 275 360 405 500 480 311 480
## [35] 400 324 600 275 292 350 496
data1$LoanAmount[data1$LoanAmount > bench] <- bench
summary(data1$LoanAmount)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0   100.0   126.5   137.2   165.8   264.4
boxplot(data1$LoanAmount, main = "Boxplot for LoanAmount")

length(data1$LoanAmount)
## [1] 614

The outliers have all been treated and the data is now clean to an appreciable level.

Checking for class imbalance

prop.table(table(data1$Loan_Status))
## 
##         N         Y 
## 0.3127036 0.6872964
table(data1$Loan_Status)
## 
##   N   Y 
## 192 422

Class imbalance is a situation, mostly in classification model building; where the total number of positive class of a data set is extremely lower than the total number of the negative class.

In the data set, we have 68.7% of the response variable as YES and 31.3% as NO.Hence, we can conclude that there is no class imbalance in this data set.

Train and Test set

set.seed(222)
split = sample(2,nrow(data1),prob = c(0.75,0.25),replace = TRUE)
train_set = data1[split == 1,]
test_set = data1[split == 2,]

It is the usual practice in Machine Learning field to divide the data set into train and test set. The model will be built on the train set and the performance of the model will be tested on the test.

Logistic Regression

Logistic regression uses sigmoid function to classify variables into classes and its basically applicable to classification problems. Other applicable models for classification problems are Decision Tree, Random Forest, Naive Bayes, Neural Network and so on.

For the purpose of this project we will be using Decision Tree and Random Forest along with Logistic Regression.

# Fitting Logistic Regression to the Training set
logistics_classifier = glm(formula = Loan_Status ~ .,
                           family = binomial,
                           data = train_set[,-c(1)])

summary(logistics_classifier)
## 
## Call:
## glm(formula = Loan_Status ~ ., family = binomial, data = train_set[, 
##     -c(1)])
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3963  -0.2804   0.5099   0.6753   2.9553  
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -3.114e+00  1.096e+00  -2.841  0.00449 ** 
## GenderMale             -5.206e-01  3.599e-01  -1.446  0.14807    
## MarriedYes              8.692e-01  2.976e-01   2.920  0.00350 ** 
## Dependents1            -6.381e-01  3.391e-01  -1.882  0.05988 .  
## Dependents2             3.567e-01  4.166e-01   0.856  0.39181    
## Dependents3+            1.583e-01  5.129e-01   0.309  0.75752    
## EducationNot Graduate  -3.566e-01  3.065e-01  -1.163  0.24470    
## Self_EmployedYes        3.214e-01  3.994e-01   0.805  0.42089    
## ApplicantIncome         1.888e-05  7.695e-05   0.245  0.80615    
## CoapplicantIncome       8.308e-05  9.532e-05   0.872  0.38344    
## LoanAmount             -4.518e-03  3.277e-03  -1.379  0.16795    
## Loan_Amount_Term       -6.262e-04  2.162e-03  -0.290  0.77210    
## Credit_History          4.672e+00  6.191e-01   7.547 4.45e-14 ***
## Property_AreaSemiurban  8.172e-01  3.135e-01   2.607  0.00914 ** 
## Property_AreaUrban      3.953e-01  3.096e-01   1.277  0.20165    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 591.70  on 471  degrees of freedom
## Residual deviance: 409.09  on 457  degrees of freedom
## AIC: 439.09
## 
## Number of Fisher Scoring iterations: 5

Based on the output of the Logistic regression,only 3 variables are significant while other are insignificant.

Credit_History is an important factor in deciding whether a client will default or not and this was clearly in tune with the outcome of the model. Whether the customer is married or not is also a significant factor, as far as this data set is concerned.

Prediction using Logistics Regressor

# Predicting the Test set results
prob_pred = predict(logistics_classifier, type = 'response', newdata = test_set)
y_pred = ifelse(prob_pred > 0.5, 1, 0)

Confusion Matrix

estimating the performance of the model

cm = table(ActualValue=test_set$Loan_Status, PredictedValue=prob_pred > 0.5)
cm
##            PredictedValue
## ActualValue FALSE TRUE
##           N    15   26
##           Y     4   97
#Estimating the percentage of performance
sum(diag(cm))/sum(cm)
## [1] 0.7887324

Logistics Regression was able to give us an accuracy of 78.87%, which means that we can expect our model to classify correct about 8 observations in every 10.

Decision Tree

library(party)
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: sandwich
Tree_Classifer = ctree(Loan_Status ~ .,
                       data = train_set[,-c(1)])
Tree_Classifer
## 
##   Conditional inference tree with 3 terminal nodes
## 
## Response:  Loan_Status 
## Inputs:  Gender, Married, Dependents, Education, Self_Employed, ApplicantIncome, CoapplicantIncome, LoanAmount, Loan_Amount_Term, Credit_History, Property_Area 
## Number of observations:  472 
## 
## 1) Credit_History <= 0; criterion = 1, statistic = 153.068
##   2)*  weights = 70 
## 1) Credit_History > 0
##   3) Married == {No}; criterion = 0.962, statistic = 10.38
##     4)*  weights = 133 
##   3) Married == {Yes}
##     5)*  weights = 269
plot(Tree_Classifer)

The decision tree model also corroborated the position of the logistic regression by making credit_history as the most important variable for consideration when deciding if a customer is going to default or not.

Prediction using the Decision Tree

pred = predict(Tree_Classifer,newdata = test_set)

cm = table(ActualValue=test_set$Loan_Status, PredictedValue=pred)
cm
##            PredictedValue
## ActualValue  N  Y
##           N 15 26
##           Y  4 97

Confusion Matrix

estimating the percentage of performance

sum(diag(cm))/sum(cm)
## [1] 0.7887324

The level of accuracy achieved by the Decision Tree model is similar to that of logistics regression at 78.87%

Random Forest

Random Forest is ensemble method in that it averages the performance of 500 Decision Trees to arrive at its output where Decision Tree employs only just one Tree.The 500 tree were chosen at random. Its the reason the model is regarded as Random Forest.

library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:psych':
## 
##     outlier
set.seed(153)
rf_classifier <- randomForest(Loan_Status ~ ., data = train_set[,-c(1)])

str(rf_classifier)
## List of 19
##  $ call           : language randomForest(formula = Loan_Status ~ ., data = train_set[, -c(1)])
##  $ type           : chr "classification"
##  $ predicted      : Factor w/ 2 levels "N","Y": 2 2 2 2 1 2 2 2 2 2 ...
##   ..- attr(*, "names")= chr [1:472] "2" "3" "4" "7" ...
##  $ err.rate       : num [1:500, 1:3] 0.312 0.33 0.344 0.291 0.297 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : NULL
##   .. ..$ : chr [1:3] "OOB" "N" "Y"
##  $ confusion      : num [1:2, 1:3] 70 16 81 305 0.536 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:2] "N" "Y"
##   .. ..$ : chr [1:3] "N" "Y" "class.error"
##  $ votes          : 'matrix' num [1:472, 1:2] 0.2067 0.1758 0.0435 0.1489 0.8128 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:472] "2" "3" "4" "7" ...
##   .. ..$ : chr [1:2] "N" "Y"
##  $ oob.times      : num [1:472] 208 182 184 188 187 165 199 175 183 169 ...
##  $ classes        : chr [1:2] "N" "Y"
##  $ importance     : num [1:11, 1] 3.66 5.09 9.96 4.26 2.89 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:11] "Gender" "Married" "Dependents" "Education" ...
##   .. ..$ : chr "MeanDecreaseGini"
##  $ importanceSD   : NULL
##  $ localImportance: NULL
##  $ proximity      : NULL
##  $ ntree          : num 500
##  $ mtry           : num 3
##  $ forest         :List of 14
##   ..$ ndbigtree : int [1:500] 173 161 149 195 151 175 163 139 169 183 ...
##   ..$ nodestatus: int [1:233, 1:500] 1 1 1 -1 1 1 1 -1 1 1 ...
##   ..$ bestvar   : int [1:233, 1:500] 6 6 10 0 2 7 6 0 6 6 ...
##   ..$ treemap   : int [1:233, 1:2, 1:500] 2 4 6 0 8 10 12 0 14 16 ...
##   ..$ nodepred  : int [1:233, 1:500] 0 0 0 2 0 0 0 1 0 0 ...
##   ..$ xbestsplit: num [1:233, 1:500] 1903 837 0.5 0 2 ...
##   ..$ pid       : num [1:2] 1 1
##   ..$ cutoff    : num [1:2] 0.5 0.5
##   ..$ ncat      : Named int [1:11] 3 3 5 2 3 1 1 1 1 1 ...
##   .. ..- attr(*, "names")= chr [1:11] "Gender" "Married" "Dependents" "Education" ...
##   ..$ maxcat    : int 5
##   ..$ nrnodes   : int 233
##   ..$ ntree     : num 500
##   ..$ nclass    : int 2
##   ..$ xlevels   :List of 11
##   .. ..$ Gender           : chr [1:3] "" "Female" "Male"
##   .. ..$ Married          : chr [1:3] "" "No" "Yes"
##   .. ..$ Dependents       : chr [1:5] "" "0" "1" "2" ...
##   .. ..$ Education        : chr [1:2] "Graduate" "Not Graduate"
##   .. ..$ Self_Employed    : chr [1:3] "" "No" "Yes"
##   .. ..$ ApplicantIncome  : num 0
##   .. ..$ CoapplicantIncome: num 0
##   .. ..$ LoanAmount       : num 0
##   .. ..$ Loan_Amount_Term : num 0
##   .. ..$ Credit_History   : num 0
##   .. ..$ Property_Area    : chr [1:3] "Rural" "Semiurban" "Urban"
##  $ y              : Factor w/ 2 levels "N","Y": 1 2 2 2 1 2 1 2 2 2 ...
##   ..- attr(*, "names")= chr [1:472] "2" "3" "4" "7" ...
##  $ test           : NULL
##  $ inbag          : NULL
##  $ terms          :Classes 'terms', 'formula'  language Loan_Status ~ Gender + Married + Dependents + Education + Self_Employed +      ApplicantIncome + CoapplicantIncom| __truncated__ ...
##   .. ..- attr(*, "variables")= language list(Loan_Status, Gender, Married, Dependents, Education, Self_Employed,      ApplicantIncome, CoapplicantIncome,| __truncated__ ...
##   .. ..- attr(*, "factors")= int [1:12, 1:11] 0 1 0 0 0 0 0 0 0 0 ...
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. ..$ : chr [1:12] "Loan_Status" "Gender" "Married" "Dependents" ...
##   .. .. .. ..$ : chr [1:11] "Gender" "Married" "Dependents" "Education" ...
##   .. ..- attr(*, "term.labels")= chr [1:11] "Gender" "Married" "Dependents" "Education" ...
##   .. ..- attr(*, "order")= int [1:11] 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..- attr(*, "intercept")= num 0
##   .. ..- attr(*, "response")= int 1
##   .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. ..- attr(*, "predvars")= language list(Loan_Status, Gender, Married, Dependents, Education, Self_Employed,      ApplicantIncome, CoapplicantIncome,| __truncated__ ...
##   .. ..- attr(*, "dataClasses")= Named chr [1:12] "factor" "factor" "factor" "factor" ...
##   .. .. ..- attr(*, "names")= chr [1:12] "Loan_Status" "Gender" "Married" "Dependents" ...
##  - attr(*, "class")= chr [1:2] "randomForest.formula" "randomForest"
attributes(rf_classifier)
## $names
##  [1] "call"            "type"            "predicted"      
##  [4] "err.rate"        "confusion"       "votes"          
##  [7] "oob.times"       "classes"         "importance"     
## [10] "importanceSD"    "localImportance" "proximity"      
## [13] "ntree"           "mtry"            "forest"         
## [16] "y"               "test"            "inbag"          
## [19] "terms"          
## 
## $class
## [1] "randomForest.formula" "randomForest"

Confusion Matrix

estimating the performance of the model

rf_pred = predict(rf_classifier,test_set)

library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
## 
##     margin
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
confusionMatrix(rf_pred,test_set$Loan_Status)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  N  Y
##          N 16  6
##          Y 25 95
##                                           
##                Accuracy : 0.7817          
##                  95% CI : (0.7047, 0.8466)
##     No Information Rate : 0.7113          
##     P-Value [Acc > NIR] : 0.036626        
##                                           
##                   Kappa : 0.3836          
##  Mcnemar's Test P-Value : 0.001225        
##                                           
##             Sensitivity : 0.3902          
##             Specificity : 0.9406          
##          Pos Pred Value : 0.7273          
##          Neg Pred Value : 0.7917          
##              Prevalence : 0.2887          
##          Detection Rate : 0.1127          
##    Detection Prevalence : 0.1549          
##       Balanced Accuracy : 0.6654          
##                                           
##        'Positive' Class : N               
## 
plot(rf_classifier)

Determining the most important variable in the forest

varImpPlot(rf_classifier)

importance(rf_classifier)
##                   MeanDecreaseGini
## Gender                    3.660779
## Married                   5.090343
## Dependents                9.955185
## Education                 4.264957
## Self_Employed             2.894202
## ApplicantIncome          33.028981
## CoapplicantIncome        19.163331
## LoanAmount               30.289876
## Loan_Amount_Term          9.729408
## Credit_History           58.327543
## Property_Area             9.865506

The Random Forest model also ranked Credit_History as the most important variable just the other 2 previous models. While Random Forest agree that ApplicantIncome is another important variable; Logistics regression chose Married.

Conclusion

Based on the performance of Logistics Regression, Decision Tree and Random Forest models; we can conclude that if adequate pre-processing methods were carefully observed; these models can perform extremely well on classification problem.

There are other advanced models as Ensemble methods and Neural Network that can also perform very well on classification algorithms but care must be taken to afford over-fitting in the course of achieving high accuracy.

Comments and suggestions are welcome.

Thanks. Owolabi Ebenezer