Problem Scoping and Diagnosis In this project, our clients want to know the chance that some customers will default their loan payment and use that as a parameter to decide whether to approve or disapprove the loan.
Goals and objectives of the project The goal of this project is to build a model that will classify if a certain customer will default its loan payment or not
Dataset Description
setwd("C:/Users/seune/Desktop/Master's Degree/Stat/Assignment")
data = read.csv('Loan_data.csv')
str(data)
## 'data.frame': 614 obs. of 13 variables:
## $ Loan_ID : Factor w/ 614 levels "LP001002","LP001003",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Gender : Factor w/ 3 levels "","Female","Male": 3 3 3 3 3 3 3 3 3 3 ...
## $ Married : Factor w/ 3 levels "","No","Yes": 2 3 3 3 2 3 3 3 3 3 ...
## $ Dependents : Factor w/ 5 levels "","0","1","2",..: 2 3 2 2 2 4 2 5 4 3 ...
## $ Education : Factor w/ 2 levels "Graduate","Not Graduate": 1 1 1 2 1 1 2 1 1 1 ...
## $ Self_Employed : Factor w/ 3 levels "","No","Yes": 2 2 3 2 2 3 2 2 2 2 ...
## $ ApplicantIncome : int 5849 4583 3000 2583 6000 5417 2333 3036 4006 12841 ...
## $ CoapplicantIncome: num 0 1508 0 2358 0 ...
## $ LoanAmount : int NA 128 66 120 141 267 95 158 168 349 ...
## $ Loan_Amount_Term : int 360 360 360 360 360 360 360 360 360 360 ...
## $ Credit_History : int 1 1 1 1 1 1 1 0 1 1 ...
## $ Property_Area : Factor w/ 3 levels "Rural","Semiurban",..: 3 1 3 3 3 3 3 2 3 2 ...
## $ Loan_Status : Factor w/ 2 levels "N","Y": 2 1 2 2 2 2 2 1 2 1 ...
The data set consist of 614 observations with 13 variables out which there are 8 categorical variables, 4 integer variables and 1 numeric variable.
head(data, n=5)
## Loan_ID Gender Married Dependents Education Self_Employed
## 1 LP001002 Male No 0 Graduate No
## 2 LP001003 Male Yes 1 Graduate No
## 3 LP001005 Male Yes 0 Graduate Yes
## 4 LP001006 Male Yes 0 Not Graduate No
## 5 LP001008 Male No 0 Graduate No
## ApplicantIncome CoapplicantIncome LoanAmount Loan_Amount_Term
## 1 5849 0 NA 360
## 2 4583 1508 128 360
## 3 3000 0 66 360
## 4 2583 2358 120 360
## 5 6000 0 141 360
## Credit_History Property_Area Loan_Status
## 1 1 Urban Y
## 2 1 Rural N
## 3 1 Urban Y
## 4 1 Urban Y
## 5 1 Urban Y
This is the first 5 observation of the data set.
Checking for the structure and other possible incompleteness
summary(data)
## Loan_ID Gender Married Dependents Education
## LP001002: 1 : 13 : 3 : 15 Graduate :480
## LP001003: 1 Female:112 No :213 0 :345 Not Graduate:134
## LP001005: 1 Male :489 Yes:398 1 :102
## LP001006: 1 2 :101
## LP001008: 1 3+: 51
## LP001011: 1
## (Other) :608
## Self_Employed ApplicantIncome CoapplicantIncome LoanAmount
## : 32 Min. : 150 Min. : 0 Min. : 9.0
## No :500 1st Qu.: 2878 1st Qu.: 0 1st Qu.:100.0
## Yes: 82 Median : 3812 Median : 1188 Median :128.0
## Mean : 5403 Mean : 1621 Mean :146.4
## 3rd Qu.: 5795 3rd Qu.: 2297 3rd Qu.:168.0
## Max. :81000 Max. :41667 Max. :700.0
## NA's :22
## Loan_Amount_Term Credit_History Property_Area Loan_Status
## Min. : 12 Min. :0.0000 Rural :179 N:192
## 1st Qu.:360 1st Qu.:1.0000 Semiurban:233 Y:422
## Median :360 Median :1.0000 Urban :202
## Mean :342 Mean :0.8422
## 3rd Qu.:360 3rd Qu.:1.0000
## Max. :480 Max. :1.0000
## NA's :14 NA's :50
The summary reveals that there are some blank spaces. For example; Dependents has 15 blank spaces,Married has 3,Gender has 13 and so on.
More so, the summary statistics gives us a view of the skewness of the numeric variables;i.e how close or far away the mean is from the median(middle).
Replacing blank space with NAs
data[data==""] <- NA
We have been able to replace the blank spaces with NA’s which will now be captured by R as a missing number.
Checking for Missing Data
sum(is.na(data))
## [1] 149
This shows that there are 149 missing values in the data set.
Summary of the data
summary(data)
## Loan_ID Gender Married Dependents Education
## LP001002: 1 : 0 : 0 : 0 Graduate :480
## LP001003: 1 Female:112 No :213 0 :345 Not Graduate:134
## LP001005: 1 Male :489 Yes :398 1 :102
## LP001006: 1 NA's : 13 NA's: 3 2 :101
## LP001008: 1 3+ : 51
## LP001011: 1 NA's: 15
## (Other) :608
## Self_Employed ApplicantIncome CoapplicantIncome LoanAmount
## : 0 Min. : 150 Min. : 0 Min. : 9.0
## No :500 1st Qu.: 2878 1st Qu.: 0 1st Qu.:100.0
## Yes : 82 Median : 3812 Median : 1188 Median :128.0
## NA's: 32 Mean : 5403 Mean : 1621 Mean :146.4
## 3rd Qu.: 5795 3rd Qu.: 2297 3rd Qu.:168.0
## Max. :81000 Max. :41667 Max. :700.0
## NA's :22
## Loan_Amount_Term Credit_History Property_Area Loan_Status
## Min. : 12 Min. :0.0000 Rural :179 N:192
## 1st Qu.:360 1st Qu.:1.0000 Semiurban:233 Y:422
## Median :360 Median :1.0000 Urban :202
## Mean :342 Mean :0.8422
## 3rd Qu.:360 3rd Qu.:1.0000
## Max. :480 Max. :1.0000
## NA's :14 NA's :50
The summary statistic has now clearly shown the missing numbers and variables that has the missing values. For an instance, Gender now has 13 NA’s as compared to 13 blank spaces it has earlier.
Handling missing number using KNN Imputation method.
library(VIM)
## Loading required package: colorspace
## Loading required package: grid
## Loading required package: data.table
## VIM is ready to use.
## Since version 4.0.0 the GUI is in its own package VIMGUI.
##
## Please use the package to use the new (and old) GUI.
## Suggestions and bug-reports can be submitted at: https://github.com/alexkowa/VIM/issues
##
## Attaching package: 'VIM'
## The following object is masked from 'package:datasets':
##
## sleep
#Picking the columns with missing number
data1 <- kNN(data,variable = c("Gender","Married","Dependents","Self_Employed","LoanAmount",
"Loan_Amount_Term","Credit_History"), k = 7)
summary(data1)
## Loan_ID Gender Married Dependents Education
## LP001002: 1 : 0 : 0 : 0 Graduate :480
## LP001003: 1 Female:114 No :213 0 :355 Not Graduate:134
## LP001005: 1 Male :500 Yes:401 1 :102
## LP001006: 1 2 :103
## LP001008: 1 3+: 54
## LP001011: 1
## (Other) :608
## Self_Employed ApplicantIncome CoapplicantIncome LoanAmount
## : 0 Min. : 150 Min. : 0 Min. : 9.0
## No :532 1st Qu.: 2878 1st Qu.: 0 1st Qu.:100.0
## Yes: 82 Median : 3812 Median : 1188 Median :126.5
## Mean : 5403 Mean : 1621 Mean :145.4
## 3rd Qu.: 5795 3rd Qu.: 2297 3rd Qu.:165.8
## Max. :81000 Max. :41667 Max. :700.0
##
## Loan_Amount_Term Credit_History Property_Area Loan_Status
## Min. : 12.0 Min. :0.000 Rural :179 N:192
## 1st Qu.:360.0 1st Qu.:1.000 Semiurban:233 Y:422
## Median :360.0 Median :1.000 Urban :202
## Mean :342.4 Mean :0.855
## 3rd Qu.:360.0 3rd Qu.:1.000
## Max. :480.0 Max. :1.000
##
## Gender_imp Married_imp Dependents_imp Self_Employed_imp
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:601 FALSE:611 FALSE:599 FALSE:582
## TRUE :13 TRUE :3 TRUE :15 TRUE :32
##
##
##
##
## LoanAmount_imp Loan_Amount_Term_imp Credit_History_imp
## Mode :logical Mode :logical Mode :logical
## FALSE:592 FALSE:600 FALSE:564
## TRUE :22 TRUE :14 TRUE :50
##
##
##
##
Subseting the data set
data1 <- subset(data1, select = Loan_ID:Loan_Status)
sum(is.na(data1))
## [1] 0
The data set now has 0 missing values.
Summary of the data
library(psych)
describe(data1)
## vars n mean sd median trimmed mad min
## Loan_ID* 1 614 307.50 177.39 307.5 307.50 227.58 1
## Gender* 2 614 2.81 0.39 3.0 2.89 0.00 2
## Married* 3 614 2.65 0.48 3.0 2.69 0.00 2
## Dependents* 4 614 2.77 1.02 2.0 2.60 0.00 2
## Education* 5 614 1.22 0.41 1.0 1.15 0.00 1
## Self_Employed* 6 614 2.13 0.34 2.0 2.04 0.00 2
## ApplicantIncome 7 614 5403.46 6109.04 3812.5 4292.06 1822.86 150
## CoapplicantIncome 8 614 1621.25 2926.25 1188.5 1154.85 1762.07 0
## LoanAmount 9 614 145.39 84.40 126.5 132.30 45.96 9
## Loan_Amount_Term 10 614 342.41 64.43 360.0 358.54 0.00 12
## Credit_History 11 614 0.86 0.35 1.0 0.94 0.00 0
## Property_Area* 12 614 2.04 0.79 2.0 2.05 1.48 1
## Loan_Status* 13 614 1.69 0.46 2.0 1.73 0.00 1
## max range skew kurtosis se
## Loan_ID* 614 613 0.00 -1.21 7.16
## Gender* 3 1 -1.61 0.60 0.02
## Married* 3 1 -0.64 -1.59 0.02
## Dependents* 5 3 0.97 -0.45 0.04
## Education* 2 1 1.36 -0.15 0.02
## Self_Employed* 3 1 2.15 2.62 0.01
## ApplicantIncome 81000 80850 6.51 59.83 246.54
## CoapplicantIncome 41667 41667 7.45 83.97 118.09
## LoanAmount 700 691 2.71 10.66 3.41
## Loan_Amount_Term 480 468 -2.39 6.83 2.60
## Credit_History 1 1 -2.01 2.05 0.01
## Property_Area* 3 2 -0.07 -1.39 0.03
## Loan_Status* 2 1 -0.81 -1.35 0.02
Describes gives us a broad range of summary statistics.
Checking for correlation and multicollinearity between the variables
library(psych)
pairs.panels (data1,
gap = 0,
bg = c("red","green","blue"[data1$Loan_Status]),
pch = 21)
using Box Plot
boxplot(data1$ApplicantIncome, horizontal = TRUE, main = "Boxplot for Applicant Income")
boxplot(data1$CoapplicantIncome, horizontal = TRUE, main = "Boxplot for Co-Applicant Income")
boxplot(data1$LoanAmount, horizontal = TRUE, main = "Boxplot for LoanAmount")
ApplicantIncome
bench <- 5795 + 1.5*IQR(data1$ApplicantIncome) #Q3 + 1.5*IQR(data$Age)
bench
## [1] 10171.25
#WINsORIZING method of treating outlier
data1$ApplicantIncome[data1$ApplicantIncome > bench]
## [1] 12841 12500 11500 10750 13650 11417 14583 10408 23803 10513 20166
## [12] 14999 11757 14866 39999 51763 33846 39147 12000 11000 16250 14683
## [23] 11146 14583 20667 20233 15000 63337 19730 15759 81000 14880 12876
## [34] 10416 37719 16692 16525 16667 10833 18333 17263 20833 13262 17500
## [45] 11250 18165 19484 16666 16120 12000
data1$ApplicantIncome[data1$ApplicantIncome > bench] <- bench
summary(data1$ApplicantIncome)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 150 2878 3812 4617 5795 10171
boxplot(data1$ApplicantIncome, main = "Boxplot for ApplicantIncome")
length(data1$ApplicantIncome)
## [1] 614
CoapplicantIncome
bench <- 2297 + 1.5*IQR(data1$CoapplicantIncome) #Q3 + 1.5*IQR(data$Age)
bench
## [1] 5742.875
#WINsORIZING method of treating outlier
data1$CoapplicantIncome[data1$CoapplicantIncome > bench]
## [1] 10968 8106 7210 8980 7750 11300 7250 7101 6250 7873 20000
## [12] 20000 8333 6667 6666 7166 33837 41667
data1$CoapplicantIncome[data1$CoapplicantIncome > bench] <- bench
summary(data1$CoapplicantIncome)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 0 1188 1420 2297 5743
boxplot(data1$CoapplicantIncome, main = "Boxplot for Co-ApplicantIncome")
length(data1$CoapplicantIncome)
## [1] 614
LoanAmount
bench <- 165.8 + 1.5*IQR(data1$LoanAmount) #Q3 + 1.5*IQR(data$Age)
bench
## [1] 264.425
#WINsORIZING method of treating outlier
data1$LoanAmount[data1$LoanAmount > bench]
## [1] 267 349 315 320 286 312 265 370 650 290 600 275 700 495 280 279 304
## [18] 330 436 480 300 376 490 308 570 380 296 275 360 405 500 480 311 480
## [35] 400 324 600 275 292 350 496
data1$LoanAmount[data1$LoanAmount > bench] <- bench
summary(data1$LoanAmount)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 100.0 126.5 137.2 165.8 264.4
boxplot(data1$LoanAmount, main = "Boxplot for LoanAmount")
length(data1$LoanAmount)
## [1] 614
The outliers have all been treated and the data is now clean to an appreciable level.
prop.table(table(data1$Loan_Status))
##
## N Y
## 0.3127036 0.6872964
table(data1$Loan_Status)
##
## N Y
## 192 422
Class imbalance is a situation, mostly in classification model building; where the total number of positive class of a data set is extremely lower than the total number of the negative class.
In the data set, we have 68.7% of the response variable as YES and 31.3% as NO.Hence, we can conclude that there is no class imbalance in this data set.
set.seed(222)
split = sample(2,nrow(data1),prob = c(0.75,0.25),replace = TRUE)
train_set = data1[split == 1,]
test_set = data1[split == 2,]
It is the usual practice in Machine Learning field to divide the data set into train and test set. The model will be built on the train set and the performance of the model will be tested on the test.
Logistic regression uses sigmoid function to classify variables into classes and its basically applicable to classification problems. Other applicable models for classification problems are Decision Tree, Random Forest, Naive Bayes, Neural Network and so on.
For the purpose of this project we will be using Decision Tree and Random Forest along with Logistic Regression.
# Fitting Logistic Regression to the Training set
logistics_classifier = glm(formula = Loan_Status ~ .,
family = binomial,
data = train_set[,-c(1)])
summary(logistics_classifier)
##
## Call:
## glm(formula = Loan_Status ~ ., family = binomial, data = train_set[,
## -c(1)])
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3963 -0.2804 0.5099 0.6753 2.9553
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.114e+00 1.096e+00 -2.841 0.00449 **
## GenderMale -5.206e-01 3.599e-01 -1.446 0.14807
## MarriedYes 8.692e-01 2.976e-01 2.920 0.00350 **
## Dependents1 -6.381e-01 3.391e-01 -1.882 0.05988 .
## Dependents2 3.567e-01 4.166e-01 0.856 0.39181
## Dependents3+ 1.583e-01 5.129e-01 0.309 0.75752
## EducationNot Graduate -3.566e-01 3.065e-01 -1.163 0.24470
## Self_EmployedYes 3.214e-01 3.994e-01 0.805 0.42089
## ApplicantIncome 1.888e-05 7.695e-05 0.245 0.80615
## CoapplicantIncome 8.308e-05 9.532e-05 0.872 0.38344
## LoanAmount -4.518e-03 3.277e-03 -1.379 0.16795
## Loan_Amount_Term -6.262e-04 2.162e-03 -0.290 0.77210
## Credit_History 4.672e+00 6.191e-01 7.547 4.45e-14 ***
## Property_AreaSemiurban 8.172e-01 3.135e-01 2.607 0.00914 **
## Property_AreaUrban 3.953e-01 3.096e-01 1.277 0.20165
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 591.70 on 471 degrees of freedom
## Residual deviance: 409.09 on 457 degrees of freedom
## AIC: 439.09
##
## Number of Fisher Scoring iterations: 5
Based on the output of the Logistic regression,only 3 variables are significant while other are insignificant.
Credit_History is an important factor in deciding whether a client will default or not and this was clearly in tune with the outcome of the model. Whether the customer is married or not is also a significant factor, as far as this data set is concerned.
Prediction using Logistics Regressor
# Predicting the Test set results
prob_pred = predict(logistics_classifier, type = 'response', newdata = test_set)
y_pred = ifelse(prob_pred > 0.5, 1, 0)
estimating the performance of the model
cm = table(ActualValue=test_set$Loan_Status, PredictedValue=prob_pred > 0.5)
cm
## PredictedValue
## ActualValue FALSE TRUE
## N 15 26
## Y 4 97
#Estimating the percentage of performance
sum(diag(cm))/sum(cm)
## [1] 0.7887324
Logistics Regression was able to give us an accuracy of 78.87%, which means that we can expect our model to classify correct about 8 observations in every 10.
library(party)
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: sandwich
Tree_Classifer = ctree(Loan_Status ~ .,
data = train_set[,-c(1)])
Tree_Classifer
##
## Conditional inference tree with 3 terminal nodes
##
## Response: Loan_Status
## Inputs: Gender, Married, Dependents, Education, Self_Employed, ApplicantIncome, CoapplicantIncome, LoanAmount, Loan_Amount_Term, Credit_History, Property_Area
## Number of observations: 472
##
## 1) Credit_History <= 0; criterion = 1, statistic = 153.068
## 2)* weights = 70
## 1) Credit_History > 0
## 3) Married == {No}; criterion = 0.962, statistic = 10.38
## 4)* weights = 133
## 3) Married == {Yes}
## 5)* weights = 269
plot(Tree_Classifer)
The decision tree model also corroborated the position of the logistic regression by making credit_history as the most important variable for consideration when deciding if a customer is going to default or not.
Prediction using the Decision Tree
pred = predict(Tree_Classifer,newdata = test_set)
cm = table(ActualValue=test_set$Loan_Status, PredictedValue=pred)
cm
## PredictedValue
## ActualValue N Y
## N 15 26
## Y 4 97
estimating the percentage of performance
sum(diag(cm))/sum(cm)
## [1] 0.7887324
The level of accuracy achieved by the Decision Tree model is similar to that of logistics regression at 78.87%
Random Forest is ensemble method in that it averages the performance of 500 Decision Trees to arrive at its output where Decision Tree employs only just one Tree.The 500 tree were chosen at random. Its the reason the model is regarded as Random Forest.
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:psych':
##
## outlier
set.seed(153)
rf_classifier <- randomForest(Loan_Status ~ ., data = train_set[,-c(1)])
str(rf_classifier)
## List of 19
## $ call : language randomForest(formula = Loan_Status ~ ., data = train_set[, -c(1)])
## $ type : chr "classification"
## $ predicted : Factor w/ 2 levels "N","Y": 2 2 2 2 1 2 2 2 2 2 ...
## ..- attr(*, "names")= chr [1:472] "2" "3" "4" "7" ...
## $ err.rate : num [1:500, 1:3] 0.312 0.33 0.344 0.291 0.297 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : NULL
## .. ..$ : chr [1:3] "OOB" "N" "Y"
## $ confusion : num [1:2, 1:3] 70 16 81 305 0.536 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:2] "N" "Y"
## .. ..$ : chr [1:3] "N" "Y" "class.error"
## $ votes : 'matrix' num [1:472, 1:2] 0.2067 0.1758 0.0435 0.1489 0.8128 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:472] "2" "3" "4" "7" ...
## .. ..$ : chr [1:2] "N" "Y"
## $ oob.times : num [1:472] 208 182 184 188 187 165 199 175 183 169 ...
## $ classes : chr [1:2] "N" "Y"
## $ importance : num [1:11, 1] 3.66 5.09 9.96 4.26 2.89 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:11] "Gender" "Married" "Dependents" "Education" ...
## .. ..$ : chr "MeanDecreaseGini"
## $ importanceSD : NULL
## $ localImportance: NULL
## $ proximity : NULL
## $ ntree : num 500
## $ mtry : num 3
## $ forest :List of 14
## ..$ ndbigtree : int [1:500] 173 161 149 195 151 175 163 139 169 183 ...
## ..$ nodestatus: int [1:233, 1:500] 1 1 1 -1 1 1 1 -1 1 1 ...
## ..$ bestvar : int [1:233, 1:500] 6 6 10 0 2 7 6 0 6 6 ...
## ..$ treemap : int [1:233, 1:2, 1:500] 2 4 6 0 8 10 12 0 14 16 ...
## ..$ nodepred : int [1:233, 1:500] 0 0 0 2 0 0 0 1 0 0 ...
## ..$ xbestsplit: num [1:233, 1:500] 1903 837 0.5 0 2 ...
## ..$ pid : num [1:2] 1 1
## ..$ cutoff : num [1:2] 0.5 0.5
## ..$ ncat : Named int [1:11] 3 3 5 2 3 1 1 1 1 1 ...
## .. ..- attr(*, "names")= chr [1:11] "Gender" "Married" "Dependents" "Education" ...
## ..$ maxcat : int 5
## ..$ nrnodes : int 233
## ..$ ntree : num 500
## ..$ nclass : int 2
## ..$ xlevels :List of 11
## .. ..$ Gender : chr [1:3] "" "Female" "Male"
## .. ..$ Married : chr [1:3] "" "No" "Yes"
## .. ..$ Dependents : chr [1:5] "" "0" "1" "2" ...
## .. ..$ Education : chr [1:2] "Graduate" "Not Graduate"
## .. ..$ Self_Employed : chr [1:3] "" "No" "Yes"
## .. ..$ ApplicantIncome : num 0
## .. ..$ CoapplicantIncome: num 0
## .. ..$ LoanAmount : num 0
## .. ..$ Loan_Amount_Term : num 0
## .. ..$ Credit_History : num 0
## .. ..$ Property_Area : chr [1:3] "Rural" "Semiurban" "Urban"
## $ y : Factor w/ 2 levels "N","Y": 1 2 2 2 1 2 1 2 2 2 ...
## ..- attr(*, "names")= chr [1:472] "2" "3" "4" "7" ...
## $ test : NULL
## $ inbag : NULL
## $ terms :Classes 'terms', 'formula' language Loan_Status ~ Gender + Married + Dependents + Education + Self_Employed + ApplicantIncome + CoapplicantIncom| __truncated__ ...
## .. ..- attr(*, "variables")= language list(Loan_Status, Gender, Married, Dependents, Education, Self_Employed, ApplicantIncome, CoapplicantIncome,| __truncated__ ...
## .. ..- attr(*, "factors")= int [1:12, 1:11] 0 1 0 0 0 0 0 0 0 0 ...
## .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. ..$ : chr [1:12] "Loan_Status" "Gender" "Married" "Dependents" ...
## .. .. .. ..$ : chr [1:11] "Gender" "Married" "Dependents" "Education" ...
## .. ..- attr(*, "term.labels")= chr [1:11] "Gender" "Married" "Dependents" "Education" ...
## .. ..- attr(*, "order")= int [1:11] 1 1 1 1 1 1 1 1 1 1 ...
## .. ..- attr(*, "intercept")= num 0
## .. ..- attr(*, "response")= int 1
## .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## .. ..- attr(*, "predvars")= language list(Loan_Status, Gender, Married, Dependents, Education, Self_Employed, ApplicantIncome, CoapplicantIncome,| __truncated__ ...
## .. ..- attr(*, "dataClasses")= Named chr [1:12] "factor" "factor" "factor" "factor" ...
## .. .. ..- attr(*, "names")= chr [1:12] "Loan_Status" "Gender" "Married" "Dependents" ...
## - attr(*, "class")= chr [1:2] "randomForest.formula" "randomForest"
attributes(rf_classifier)
## $names
## [1] "call" "type" "predicted"
## [4] "err.rate" "confusion" "votes"
## [7] "oob.times" "classes" "importance"
## [10] "importanceSD" "localImportance" "proximity"
## [13] "ntree" "mtry" "forest"
## [16] "y" "test" "inbag"
## [19] "terms"
##
## $class
## [1] "randomForest.formula" "randomForest"
estimating the performance of the model
rf_pred = predict(rf_classifier,test_set)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
##
## margin
## The following objects are masked from 'package:psych':
##
## %+%, alpha
confusionMatrix(rf_pred,test_set$Loan_Status)
## Confusion Matrix and Statistics
##
## Reference
## Prediction N Y
## N 16 6
## Y 25 95
##
## Accuracy : 0.7817
## 95% CI : (0.7047, 0.8466)
## No Information Rate : 0.7113
## P-Value [Acc > NIR] : 0.036626
##
## Kappa : 0.3836
## Mcnemar's Test P-Value : 0.001225
##
## Sensitivity : 0.3902
## Specificity : 0.9406
## Pos Pred Value : 0.7273
## Neg Pred Value : 0.7917
## Prevalence : 0.2887
## Detection Rate : 0.1127
## Detection Prevalence : 0.1549
## Balanced Accuracy : 0.6654
##
## 'Positive' Class : N
##
plot(rf_classifier)
varImpPlot(rf_classifier)
importance(rf_classifier)
## MeanDecreaseGini
## Gender 3.660779
## Married 5.090343
## Dependents 9.955185
## Education 4.264957
## Self_Employed 2.894202
## ApplicantIncome 33.028981
## CoapplicantIncome 19.163331
## LoanAmount 30.289876
## Loan_Amount_Term 9.729408
## Credit_History 58.327543
## Property_Area 9.865506
The Random Forest model also ranked Credit_History as the most important variable just the other 2 previous models. While Random Forest agree that ApplicantIncome is another important variable; Logistics regression chose Married.
Based on the performance of Logistics Regression, Decision Tree and Random Forest models; we can conclude that if adequate pre-processing methods were carefully observed; these models can perform extremely well on classification problem.
There are other advanced models as Ensemble methods and Neural Network that can also perform very well on classification algorithms but care must be taken to afford over-fitting in the course of achieving high accuracy.
Comments and suggestions are welcome.
Thanks. Owolabi Ebenezer