Below data dictionary describes the Loan_approval dataset
LoanID: unique loan ID Gender: applicant gender (Male/Female) Married: applicant marriage status (Yes/No) Dependents: number of dependents for applicant (0, 1, 2, 3+) Education: applicant college education status (Graduate / Not Graduate) Self_Employed: applicant self-employment status (Yes/No) ApplicantIncome: applicant income level CoapplicantIncome: co-applicant income level (if applicable) LoanAmount: loan amount requested (in thousands) Loan_Amount_Term: loan term (in months) Credit_History: credit history meets guidelines (1/0) PropertyArea: property location (Urban/Semi Urban/Rural) Loan_Status: loan approved (Yes/No). target variable
## Registered S3 methods overwritten by 'lme4':
## method from
## cooks.distance.influence.merMod car
## influence.merMod car
## dfbeta.influence.merMod car
## dfbetas.influence.merMod car
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
## Loading required package: sandwich
##
## Attaching package: 'AMORE'
## The following object is masked from 'package:caret':
##
## train
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:rattle':
##
## importance
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
loan_ori<- read.csv('DATAHW3gp2\\Loan_approval.csv', sep=',')
dim(loan_ori)
## [1] 614 13
# glimpse(loan_ori)
str(loan_ori)
## 'data.frame': 614 obs. of 13 variables:
## $ Loan_ID : Factor w/ 614 levels "LP001002","LP001003",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Gender : Factor w/ 3 levels "","Female","Male": 3 3 3 3 3 3 3 3 3 3 ...
## $ Married : Factor w/ 3 levels "","No","Yes": 2 3 3 3 2 3 3 3 3 3 ...
## $ Dependents : Factor w/ 5 levels "","0","1","2",..: 2 3 2 2 2 4 2 5 4 3 ...
## $ Education : Factor w/ 2 levels "Graduate","Not Graduate": 1 1 1 2 1 1 2 1 1 1 ...
## $ Self_Employed : Factor w/ 3 levels "","No","Yes": 2 2 3 2 2 3 2 2 2 2 ...
## $ ApplicantIncome : int 5849 4583 3000 2583 6000 5417 2333 3036 4006 12841 ...
## $ CoapplicantIncome: num 0 1508 0 2358 0 ...
## $ LoanAmount : int NA 128 66 120 141 267 95 158 168 349 ...
## $ Loan_Amount_Term : int 360 360 360 360 360 360 360 360 360 360 ...
## $ Credit_History : int 1 1 1 1 1 1 1 0 1 1 ...
## $ Property_Area : Factor w/ 3 levels "Rural","Semiurban",..: 3 1 3 3 3 3 3 2 3 2 ...
## $ Loan_Status : Factor w/ 2 levels "N","Y": 2 1 2 2 2 2 2 1 2 1 ...
## make the figure large by setting height=9 otherise too small
pairs(loan_ori, panel = panel.smooth, main = "Loan Approval Data")
Checking for correlation and multicollinearity between the variables
##
## Attaching package: 'psych'
## The following object is masked from 'package:randomForest':
##
## outlier
## The following object is masked from 'package:AMORE':
##
## sim
## The following object is masked from 'package:car':
##
## logit
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
# library(psych)
pairs.panels (loan_ori,
gap = 0,
bg = c("red","green","blue"[loan_ori$Loan_Status]),
pch = 21)
### OUTCOME IMBLANCE CHECKING
table(loan_ori$Loan_Status)
##
## N Y
## 192 422
prop.table(table(loan_ori$Loan_Status))
##
## N Y
## 0.3127036 0.6872964
#check overall missing data
#calculate missing proportion
missingprop <- function(loan_ori) {
miss.stuff <- loan_ori %>%
filter(!complete.cases(.))
miss.stuff.prop <- nrow(miss.stuff)/nrow(loan_ori)
return(miss.stuff.prop)
}
missingprop(loan_ori)
## [1] 0.1384365
## 13% missing
# Checking for Missing Data
sum(is.na(loan_ori))
## [1] 86
summary(loan_ori)
## Loan_ID Gender Married Dependents Education
## LP001002: 1 : 13 : 3 : 15 Graduate :480
## LP001003: 1 Female:112 No :213 0 :345 Not Graduate:134
## LP001005: 1 Male :489 Yes:398 1 :102
## LP001006: 1 2 :101
## LP001008: 1 3+: 51
## LP001011: 1
## (Other) :608
## Self_Employed ApplicantIncome CoapplicantIncome LoanAmount
## : 32 Min. : 150 Min. : 0 Min. : 9.0
## No :500 1st Qu.: 2878 1st Qu.: 0 1st Qu.:100.0
## Yes: 82 Median : 3812 Median : 1188 Median :128.0
## Mean : 5403 Mean : 1621 Mean :146.4
## 3rd Qu.: 5795 3rd Qu.: 2297 3rd Qu.:168.0
## Max. :81000 Max. :41667 Max. :700.0
## NA's :22
## Loan_Amount_Term Credit_History Property_Area Loan_Status
## Min. : 12 Min. :0.0000 Rural :179 N:192
## 1st Qu.:360 1st Qu.:1.0000 Semiurban:233 Y:422
## Median :360 Median :1.0000 Urban :202
## Mean :342 Mean :0.8422
## 3rd Qu.:360 3rd Qu.:1.0000
## Max. :480 Max. :1.0000
## NA's :14 NA's :50
describe(loan_ori)
## vars n mean sd median trimmed mad min max
## Loan_ID* 1 614 307.50 177.39 307.5 307.50 227.58 1 614
## Gender* 2 614 2.78 0.47 3.0 2.87 0.00 1 3
## Married* 3 614 2.64 0.49 3.0 2.68 0.00 1 3
## Dependents* 4 614 2.72 1.04 2.0 2.58 0.00 1 5
## Education* 5 614 1.22 0.41 1.0 1.15 0.00 1 2
## Self_Employed* 6 614 2.08 0.42 2.0 2.04 0.00 1 3
## ApplicantIncome 7 614 5403.46 6109.04 3812.5 4292.06 1822.86 150 81000
## CoapplicantIncome 8 614 1621.25 2926.25 1188.5 1154.85 1762.07 0 41667
## LoanAmount 9 592 146.41 85.59 128.0 133.14 47.44 9 700
## Loan_Amount_Term 10 600 342.00 65.12 360.0 358.38 0.00 12 480
## Credit_History 11 564 0.84 0.36 1.0 0.93 0.00 0 1
## Property_Area* 12 614 2.04 0.79 2.0 2.05 1.48 1 3
## Loan_Status* 13 614 1.69 0.46 2.0 1.73 0.00 1 2
## range skew kurtosis se
## Loan_ID* 613 0.00 -1.21 7.16
## Gender* 2 -1.92 2.91 0.02
## Married* 2 -0.72 -1.16 0.02
## Dependents* 4 0.89 -0.38 0.04
## Education* 1 1.36 -0.15 0.02
## Self_Employed* 2 0.49 2.17 0.02
## ApplicantIncome 80850 6.51 59.83 246.54
## CoapplicantIncome 41667 7.45 83.97 118.09
## LoanAmount 691 2.66 10.26 3.52
## Loan_Amount_Term 468 -2.35 6.58 2.66
## Credit_History 1 -1.87 1.51 0.02
## Property_Area* 2 -0.07 -1.39 0.03
## Loan_Status* 1 -0.81 -1.35 0.02
loan_ori[loan_ori==""] <- NA
# We have been able to replace the blank spaces with NA’s which will now be captured by R as a missing number.
par(mfrow=c(2, 1))
boxplot(loan_ori$ApplicantIncome, horizontal = TRUE, main = "Boxplot for Applicant Income", col='red')
boxplot(loan_ori$CoapplicantIncome, horizontal = TRUE, main = "Boxplot for Co-Applicant Income", col='blue')
par(mfrow=c(2, 1))
boxplot(loan_ori$ApplicantIncome, outline= TRUE, col = "red", title ='Applicant, with Outlier',horizontal=TRUE)
boxplot(loan_ori$ApplicantIncome, outline= FALSE, col = "red",title ='Applicant, without Outlier',horizontal=TRUE)
par(mfrow=c(2, 1))
boxplot(loan_ori$CoapplicantIncome, horizontal = TRUE, main = "Boxplot for Co-Applicant Income, With Outlier", col='blue', outline=TRUE)
boxplot(loan_ori$CoapplicantIncome, horizontal = TRUE, main = "Boxplot for Co-Applicant Income, without Outlier", col='blue', outline=FALSE)
par(mfrow=c(2, 1))
boxplot(loan_ori$LoanAmount, horizontal = TRUE, main = "Boxplot for LoanAmount, with Outlier", col='yellow', outline=TRUE)
boxplot(loan_ori$LoanAmount, horizontal = TRUE, main = "Boxplot for LoanAmount, without outlier", col='yellow', outline=FALSE)
ggplot(data=loan_ori, aes(x= loan_ori$ApplicantIncome)) +
geom_histogram(col="red",fill="yellow", bins = 15) +
facet_grid(~loan_ori$Loan_Status)+
theme_bw()
## Warning: Use of `loan_ori$ApplicantIncome` is discouraged. Use `ApplicantIncome`
## instead.
The numerical variables shown here, applicant income, Coapplicant income, loan amount, are clearly not normally distributed at all. There are huge outliers in the high income people. In order for us to generalize model without these outliers, we will exclude the high end owning people from the data. Also data shows us that the income is not normally distributed, it is heavily skewed to the left, meaning to the lower tier income. So we will use the log transformation 2 make them more normalized.
# loan2$LogToalIncome
dim(loan_ori)
## [1] 614 13
# loan1<- loan_ori
loan1 <-
loan_ori %>%
filter (ApplicantIncome<35000 ) %>%
filter ( CoapplicantIncome<20000) %>%
select (-Loan_ID)
dim(loan1)
## [1] 604 12
str(loan1)
## 'data.frame': 604 obs. of 12 variables:
## $ Gender : Factor w/ 3 levels "","Female","Male": 3 3 3 3 3 3 3 3 3 3 ...
## $ Married : Factor w/ 3 levels "","No","Yes": 2 3 3 3 2 3 3 3 3 3 ...
## $ Dependents : Factor w/ 5 levels "","0","1","2",..: 2 3 2 2 2 4 2 5 4 3 ...
## $ Education : Factor w/ 2 levels "Graduate","Not Graduate": 1 1 1 2 1 1 2 1 1 1 ...
## $ Self_Employed : Factor w/ 3 levels "","No","Yes": 2 2 3 2 2 3 2 2 2 2 ...
## $ ApplicantIncome : int 5849 4583 3000 2583 6000 5417 2333 3036 4006 12841 ...
## $ CoapplicantIncome: num 0 1508 0 2358 0 ...
## $ LoanAmount : int NA 128 66 120 141 267 95 158 168 349 ...
## $ Loan_Amount_Term : int 360 360 360 360 360 360 360 360 360 360 ...
## $ Credit_History : int 1 1 1 1 1 1 1 0 1 1 ...
## $ Property_Area : Factor w/ 3 levels "Rural","Semiurban",..: 3 1 3 3 3 3 3 2 3 2 ...
## $ Loan_Status : Factor w/ 2 levels "N","Y": 2 1 2 2 2 2 2 1 2 1 ...
loan2<-loan1
## remove Loan_ID
# loan2$Loan_ID <- NULL
Further, because the loan is one application by the income of applicant plus the Co applicant, there is no point in making them as two variables. We transform them into one single variable, total income. Also this is log transformed as well.
# Make new VR TotalIncome
loan2$TotalIncome <- loan2$ApplicantIncome + loan2$CoapplicantIncome
loan2$ApplicantIncome <- NULL
loan2$CoapplicantIncome <- NULL
loan2$LogToalIncome <- log(loan2$TotalIncome)
loan2$TotalIncome<- NULL
## LoanAMount to log transform, and remove original
loan2$LogLoanAmount <- log(loan2$LoanAmount)
loan2$LoanAmount<- NULL
str(loan2)
## 'data.frame': 604 obs. of 11 variables:
## $ Gender : Factor w/ 3 levels "","Female","Male": 3 3 3 3 3 3 3 3 3 3 ...
## $ Married : Factor w/ 3 levels "","No","Yes": 2 3 3 3 2 3 3 3 3 3 ...
## $ Dependents : Factor w/ 5 levels "","0","1","2",..: 2 3 2 2 2 4 2 5 4 3 ...
## $ Education : Factor w/ 2 levels "Graduate","Not Graduate": 1 1 1 2 1 1 2 1 1 1 ...
## $ Self_Employed : Factor w/ 3 levels "","No","Yes": 2 2 3 2 2 3 2 2 2 2 ...
## $ Loan_Amount_Term: int 360 360 360 360 360 360 360 360 360 360 ...
## $ Credit_History : int 1 1 1 1 1 1 1 0 1 1 ...
## $ Property_Area : Factor w/ 3 levels "Rural","Semiurban",..: 3 1 3 3 3 3 3 2 3 2 ...
## $ Loan_Status : Factor w/ 2 levels "N","Y": 2 1 2 2 2 2 2 1 2 1 ...
## $ LogToalIncome : num 8.67 8.71 8.01 8.51 8.7 ...
## $ LogLoanAmount : num NA 4.85 4.19 4.79 4.95 ...
hist(loan2$LogToalIncome,
main="Histogram for Applicant Income-Log Transformed and Outlier Removed",
xlab="Income",
border="blue",
col="maroon",
las=1,
breaks=50, prob = TRUE)
hist(loan2$LogLoanAmount,
main="Histogram for Loan Amount-Log Tranformed and Outlier Removed",
xlab="LoanAmount",
border="red",
col="blue",
las=1,
breaks=50, prob = TRUE)
After the necessary transformation and the getting after the necessary transformation after the necessary transformation and outlier removal. The most important numerical variables, total house income, and loan amount seem to be satisfyingly distributed, close to normal. We can use them for the future analysis.
sapply(loan2, function(x) sum(is.na(x)))
## Gender Married Dependents Education
## 12 3 15 0
## Self_Employed Loan_Amount_Term Credit_History Property_Area
## 30 14 49 0
## Loan_Status LogToalIncome LogLoanAmount
## 0 0 22
##
## Attaching package: 'mice'
## The following object is masked from 'package:stats':
##
## filter
## The following objects are masked from 'package:base':
##
## cbind, rbind
## Loading required package: colorspace
##
## Attaching package: 'colorspace'
## The following object is masked from 'package:pROC':
##
## coords
## Loading required package: grid
## VIM is ready to use.
## Since version 4.0.0 the GUI is in its own package VIMGUI.
##
## Please use the package to use the new (and old) GUI.
## Suggestions and bug-reports can be submitted at: https://github.com/alexkowa/VIM/issues
##
## Attaching package: 'VIM'
## The following object is masked from 'package:rattle':
##
## wine
## The following object is masked from 'package:datasets':
##
## sleep
sapply(loan2, function(x) sum(is.na(x)))
## Gender Married Dependents Education
## 12 3 15 0
## Self_Employed Loan_Amount_Term Credit_History Property_Area
## 30 14 49 0
## Loan_Status LogToalIncome LogLoanAmount
## 0 0 22
mice_plot <- aggr(loan2, col=c('navyblue','red'),
numbers=TRUE, sortVars=TRUE,
labels=names(loan2), cex.axis=.7,
gap=3, ylab=c("Missing data","Pattern"))
##
## Variables sorted by number of missings:
## Variable Count
## Credit_History 0.081125828
## Self_Employed 0.049668874
## LogLoanAmount 0.036423841
## Dependents 0.024834437
## Loan_Amount_Term 0.023178808
## Gender 0.019867550
## Married 0.004966887
## Education 0.000000000
## Property_Area 0.000000000
## Loan_Status 0.000000000
## LogToalIncome 0.000000000
Judging from the data, there are about 13% of the variables that is missing. Especially credit history, which is the most important variables in determining a long, has 8% missing. If we exclude all the missing’s, the data will be very skewed and then not fitting for analysis, because the the missing variables happen to be in important categories, rather than unimportant ones.
# The mice() function takes care of the imputing process:
imputed_list <- mice(data=loan2, m=1, maxit = 2, method = 'cart', seed = 500)
##
## iter imp variable
## 1 1 Gender Married Dependents Self_Employed Loan_Amount_Term Credit_History LogLoanAmount
## 2 1 Gender Married Dependents Self_Employed Loan_Amount_Term Credit_History LogLoanAmount
## Warning: Number of logged events: 14
## 1 rounds of imputation otherwise too time consuming
## a list of 22
tr <- complete(imputed_list,1) ## Here I am choosing the 1st round only (although only 1 round )
dim(imputed_list)
## NULL
# Number of logged events: 14NULL
dim(tr)
## [1] 604 11
loan3<-tr
# loan4_temp<-imputed_Data$data ## still have missing data, does not work, do not know why
# str(loan4)
mice_plot2 <- aggr(loan3, col=c('#F8766D','#00BFC4'), numbers=TRUE, sortVars=TRUE, labels=names(loan3), cex.axis=.7, gap=3, ylab=c("Missing data","Pattern"))
##
## Variables sorted by number of missings:
## Variable Count
## Gender 0
## Married 0
## Dependents 0
## Education 0
## Self_Employed 0
## Loan_Amount_Term 0
## Credit_History 0
## Property_Area 0
## Loan_Status 0
## LogToalIncome 0
## LogLoanAmount 0
# no missing now
sapply(loan3, function(x) sum(is.na(x)))
## Gender Married Dependents Education
## 0 0 0 0
## Self_Employed Loan_Amount_Term Credit_History Property_Area
## 0 0 0 0
## Loan_Status LogToalIncome LogLoanAmount
## 0 0 0
We used the mice package for the missing data imputation. And we choose the method that allows for both categorical and numerical imputation. As we can see, after the imputation, the data is complete with very minimum loss of information. We are satisfied with it
str(loan3)
## 'data.frame': 604 obs. of 11 variables:
## $ Gender : Factor w/ 3 levels "","Female","Male": 3 3 3 3 3 3 3 3 3 3 ...
## $ Married : Factor w/ 3 levels "","No","Yes": 2 3 3 3 2 3 3 3 3 3 ...
## $ Dependents : Factor w/ 5 levels "","0","1","2",..: 2 3 2 2 2 4 2 5 4 3 ...
## $ Education : Factor w/ 2 levels "Graduate","Not Graduate": 1 1 1 2 1 1 2 1 1 1 ...
## $ Self_Employed : Factor w/ 3 levels "","No","Yes": 2 2 3 2 2 3 2 2 2 2 ...
## $ Loan_Amount_Term: num 360 360 360 360 360 360 360 360 360 360 ...
## $ Credit_History : num 1 1 1 1 1 1 1 0 1 1 ...
## $ Property_Area : Factor w/ 3 levels "Rural","Semiurban",..: 3 1 3 3 3 3 3 2 3 2 ...
## $ Loan_Status : Factor w/ 2 levels "N","Y": 2 1 2 2 2 2 2 1 2 1 ...
## $ LogToalIncome : num 8.67 8.71 8.01 8.51 8.7 ...
## $ LogLoanAmount : num 4.7 4.85 4.19 4.79 4.95 ...
hist(loan3$LogLoanAmount,
main="Histogram for Log Loan Amount-After Imputation",
xlab="Loan Amount-log",
border="blue",
col="maroon",
las=1,
breaks=20, prob = TRUE)
set.seed(42)
sample <- sample.int(n = nrow(loan3), size = floor(.70*nrow(loan3)), replace = F)
trainnew <- loan3[sample, ]
testnew <- loan3[-sample, ]
dim(trainnew)
## [1] 422 11
dim(testnew)
## [1] 182 11
# summary(trainnew)
variable.summary(trainnew)
## Class %.NA Levels Min.Level.Size Mean SD
## Gender factor 0 3 0 NA NA
## Married factor 0 3 0 NA NA
## Dependents factor 0 5 0 NA NA
## Education factor 0 2 93 NA NA
## Self_Employed factor 0 3 0 NA NA
## Loan_Amount_Term numeric 0 NA NA 340.0094787 67.2949328
## Credit_History numeric 0 NA NA 0.8649289 0.3422052
## Property_Area factor 0 3 123 NA NA
## Loan_Status factor 0 2 128 NA NA
## LogToalIncome numeric 0 NA NA 8.6399397 0.4928378
## LogLoanAmount numeric 0 NA NA 4.8403827 0.5133288
colnames(trainnew)
## [1] "Gender" "Married" "Dependents" "Education"
## [5] "Self_Employed" "Loan_Amount_Term" "Credit_History" "Property_Area"
## [9] "Loan_Status" "LogToalIncome" "LogLoanAmount"
dtree <- rpart(Loan_Status ~ Credit_History+Education+Self_Employed+Property_Area+LogLoanAmount+LogToalIncome,method="class", data=trainnew,parms=list(split="information"))
dtree$cptable
## CP nsplit rel error xerror xstd
## 1 0.35156250 0 1.0000000 1.0000000 0.07377555
## 2 0.02343750 1 0.6484375 0.6484375 0.06379295
## 3 0.01171875 2 0.6250000 0.6562500 0.06408137
## 4 0.01000000 4 0.6015625 0.6406250 0.06350095
fancyRpartPlot(dtree)
First we fit all the variables into the decision tree, using class method. As we can see that in the first of four layers of the decision tree, it starts as credit history, as the first layer, then followed by total income, then followed by loan amount. This finding is intuitive, and is in line with what we saw in the u univariable analysis, before the modeling.
dtree.pruned <- prune(dtree, cp=.02290076)
dtree.pred <- predict(dtree.pruned, trainnew, type="class")
dtree.perf <- table(trainnew$Loan_Status, dtree.pred,
dnn=c("Actual", "Predicted"))
dtree.perf
## Predicted
## Actual N Y
## N 56 72
## Y 8 286
fancyRpartPlot(dtree.pruned)
Next we did the pruning of the tree, by CP of 0.022. We calculated the confusion matrix of this pruned tree.
Now, people without credit history will have 21% chance of getting a loan.
dtree_test <- rpart(Loan_Status ~ Credit_History+Education+Self_Employed+Property_Area+LogToalIncome
,method="class", data=testnew,parms=list(split="information"))
dtree_test$cptable
## CP nsplit rel error xerror xstd
## 1 0.5166667 0 1.0000000 1.0000000 0.10569844
## 2 0.0100000 1 0.4833333 0.4833333 0.08229203
dtree_test.pruned <- prune(dtree_test, cp=.022)
dtree_test.pred <- predict(dtree_test.pruned, testnew, type="class")
dtree_test.perf <- table(testnew$Loan_Status, dtree_test.pred,
dnn=c("Actual", "Predicted"))
dtree_test.perf
## Predicted
## Actual N Y
## N 33 27
## Y 2 120
fancyRpartPlot(dtree_test.pruned)
set.seed(817)
dim(trainnew)
## [1] 422 11
colnames(trainnew)
## [1] "Gender" "Married" "Dependents" "Education"
## [5] "Self_Employed" "Loan_Amount_Term" "Credit_History" "Property_Area"
## [9] "Loan_Status" "LogToalIncome" "LogLoanAmount"
original_rf<-randomForest(Loan_Status~ ., trainnew)
original_rf
##
## Call:
## randomForest(formula = Loan_Status ~ ., data = trainnew)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 20.62%
## Confusion matrix:
## N Y class.error
## N 56 72 0.56250000
## Y 15 279 0.05102041
Determining the most important variable in the forest
plot(original_rf)
varImpPlot(original_rf)
From the random forest model, it is very clear that the three variables stand out as the most important ones, credit history, total income, total loan amount. The rest of the variables come up by decreasing in gini score, make into aloe priority category. These three variables are the most important ones in getting loan approved. This is not surprising.
set.seed(42)
fit.forest2 <- randomForest(Loan_Status ~ Credit_History+Education+Self_Employed+Property_Area+LogToalIncome, data=trainnew, importance=TRUE)
fit.forest2
##
## Call:
## randomForest(formula = Loan_Status ~ Credit_History + Education + Self_Employed + Property_Area + LogToalIncome, data = trainnew, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 19.43%
## Confusion matrix:
## N Y class.error
## N 53 75 0.58593750
## Y 7 287 0.02380952
forest.pred2 <- predict(fit.forest2, testnew)
forest.perf_test <- table(testnew$Loan_Status, forest.pred2,
dnn=c("Actual", "Predicted"))
forest.perf_test
## Predicted
## Actual N Y
## N 35 25
## Y 4 118
set.seed(817)
tune_grid<-expand.grid(mtry=c(1:10), ntree=c(500,1000,1500,2000)) #expand a grid of parameters
mtry<-tune_grid[[1]]
ntree<-tune_grid[[2]]
OOB<-NULL #use to store calculated OOB error estimate
for(i in 1:nrow(tune_grid)){
rf<-randomForest(Loan_Status~. ,trainnew, mtry=mtry[i], ntree=ntree[i])
confusion<-rf$confusion
temp<-(confusion[2]+confusion[3])/614 #calculate the OOB error estimate
OOB<-append(OOB,temp)
}
tune_grid$OOB<-OOB
head(tune_grid[order(tune_grid["OOB"]), ], 4) #order the results
## mtry ntree OOB
## 12 2 1000 0.1384365
## 13 3 1000 0.1384365
## 22 2 1500 0.1384365
## 32 2 2000 0.1384365
I was not able to run Gradient boosting on my computer because it crashes R studio.
Judging from the confusion matrix, the random forest model and the classification tree perform similary on this dataset. Both models have a similar true positive and true negative amount of subjects that fall into the two cells of the confusion matrix, meaning, that the accuracy and the false negative rate are similar.
Tuning the models make the models slightly better than its original, but not significantly better.
However, I did not run the model before all the data transformation, so this conclusion might be subject to the fact that the variables are all transformed and normalized very well before fitting the models, which indicates somewhat overfitting of the both models.
I would not use gradient boosting model unless really necessary. THis is a relatively simple and stragightfoward data with meaningful results, gradient boosting, due to its overly slow process, might be an overkill to this busines problem.