Decision Tree and Random Forest to Predict Loan Approval

GOAL

Please use the attached dataset on loan approval status to predict loan approval using Decision Trees. Please be sure to conduct a thorough exploratory analysis to start the task and walk us through your reasoning behind all the steps you are taking.
Using the same dataset on Loan Approval Status, please use Random Forests to predict on loan approval status. Again, please be sure to walk us through the steps you took to get to your final model.
Using the Loan Approval Status data, please use Gradient Boosting to predict on the loan approval status. Please use whatever boosting approach you deem appropriate; but please be sure to walk us through your steps.
Model performance: please compare the models you settled on for problems Comment on their relative performance. Which one would you prefer the most? Why?

DATA DICTIONARY

Below data dictionary describes the Loan_approval dataset

LoanID: unique loan ID Gender: applicant gender (Male/Female) Married: applicant marriage status (Yes/No) Dependents: number of dependents for applicant (0, 1, 2, 3+) Education: applicant college education status (Graduate / Not Graduate) Self_Employed: applicant self-employment status (Yes/No) ApplicantIncome: applicant income level CoapplicantIncome: co-applicant income level (if applicable) LoanAmount: loan amount requested (in thousands) Loan_Amount_Term: loan term (in months) Credit_History: credit history meets guidelines (1/0) PropertyArea: property location (Urban/Semi Urban/Rural) Loan_Status: loan approved (Yes/No). target variable

## Registered S3 methods overwritten by 'lme4':
##   method                          from
##   cooks.distance.influence.merMod car 
##   influence.merMod                car 
##   dfbeta.influence.merMod         car 
##   dfbetas.influence.merMod        car

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

## The following object is masked from 'package:purrr':
## 
##     some

## Loading required package: sandwich

## 
## Attaching package: 'AMORE'

## The following object is masked from 'package:caret':
## 
##     train

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:rattle':
## 
##     importance

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

DATA EXPLORATION

loan_ori<- read.csv('DATAHW3gp2\\Loan_approval.csv', sep=',')

dim(loan_ori)

## [1] 614  13

# glimpse(loan_ori)

str(loan_ori)

## 'data.frame':    614 obs. of  13 variables:
##  $ Loan_ID          : Factor w/ 614 levels "LP001002","LP001003",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Gender           : Factor w/ 3 levels "","Female","Male": 3 3 3 3 3 3 3 3 3 3 ...
##  $ Married          : Factor w/ 3 levels "","No","Yes": 2 3 3 3 2 3 3 3 3 3 ...
##  $ Dependents       : Factor w/ 5 levels "","0","1","2",..: 2 3 2 2 2 4 2 5 4 3 ...
##  $ Education        : Factor w/ 2 levels "Graduate","Not Graduate": 1 1 1 2 1 1 2 1 1 1 ...
##  $ Self_Employed    : Factor w/ 3 levels "","No","Yes": 2 2 3 2 2 3 2 2 2 2 ...
##  $ ApplicantIncome  : int  5849 4583 3000 2583 6000 5417 2333 3036 4006 12841 ...
##  $ CoapplicantIncome: num  0 1508 0 2358 0 ...
##  $ LoanAmount       : int  NA 128 66 120 141 267 95 158 168 349 ...
##  $ Loan_Amount_Term : int  360 360 360 360 360 360 360 360 360 360 ...
##  $ Credit_History   : int  1 1 1 1 1 1 1 0 1 1 ...
##  $ Property_Area    : Factor w/ 3 levels "Rural","Semiurban",..: 3 1 3 3 3 3 3 2 3 2 ...
##  $ Loan_Status      : Factor w/ 2 levels "N","Y": 2 1 2 2 2 2 2 1 2 1 ...

## make the figure large by setting height=9 otherise too small
pairs(loan_ori, panel = panel.smooth, main = "Loan Approval Data")

Correlation Matrix

Checking for correlation and multicollinearity between the variables

## 
## Attaching package: 'psych'

## The following object is masked from 'package:randomForest':
## 
##     outlier

## The following object is masked from 'package:AMORE':
## 
##     sim

## The following object is masked from 'package:car':
## 
##     logit

## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

# library(psych)
pairs.panels (loan_ori,
             gap = 0,
             bg = c("red","green","blue"[loan_ori$Loan_Status]),
             pch = 21)

### OUTCOME IMBLANCE CHECKING

table(loan_ori$Loan_Status)

## 
##   N   Y 
## 192 422

prop.table(table(loan_ori$Loan_Status))

## 
##         N         Y 
## 0.3127036 0.6872964

MISSING DATA CHECKING

#check overall missing data 
#calculate missing proportion

missingprop <- function(loan_ori) {
  miss.stuff <- loan_ori %>%
    filter(!complete.cases(.))
  miss.stuff.prop <- nrow(miss.stuff)/nrow(loan_ori) 
  return(miss.stuff.prop)
}

missingprop(loan_ori)

## [1] 0.1384365

## 13% missing

# Checking for Missing Data
sum(is.na(loan_ori))

## [1] 86

summary(loan_ori)

##      Loan_ID       Gender    Married   Dependents        Education  
##  LP001002:  1         : 13      :  3     : 15     Graduate    :480  
##  LP001003:  1   Female:112   No :213   0 :345     Not Graduate:134  
##  LP001005:  1   Male  :489   Yes:398   1 :102                       
##  LP001006:  1                          2 :101                       
##  LP001008:  1                          3+: 51                       
##  LP001011:  1                                                       
##  (Other) :608                                                       
##  Self_Employed ApplicantIncome CoapplicantIncome   LoanAmount   
##     : 32       Min.   :  150   Min.   :    0     Min.   :  9.0  
##  No :500       1st Qu.: 2878   1st Qu.:    0     1st Qu.:100.0  
##  Yes: 82       Median : 3812   Median : 1188     Median :128.0  
##                Mean   : 5403   Mean   : 1621     Mean   :146.4  
##                3rd Qu.: 5795   3rd Qu.: 2297     3rd Qu.:168.0  
##                Max.   :81000   Max.   :41667     Max.   :700.0  
##                                                  NA's   :22     
##  Loan_Amount_Term Credit_History     Property_Area Loan_Status
##  Min.   : 12      Min.   :0.0000   Rural    :179   N:192      
##  1st Qu.:360      1st Qu.:1.0000   Semiurban:233   Y:422      
##  Median :360      Median :1.0000   Urban    :202              
##  Mean   :342      Mean   :0.8422                              
##  3rd Qu.:360      3rd Qu.:1.0000                              
##  Max.   :480      Max.   :1.0000                              
##  NA's   :14       NA's   :50

describe(loan_ori)

##                   vars   n    mean      sd median trimmed     mad min   max
## Loan_ID*             1 614  307.50  177.39  307.5  307.50  227.58   1   614
## Gender*              2 614    2.78    0.47    3.0    2.87    0.00   1     3
## Married*             3 614    2.64    0.49    3.0    2.68    0.00   1     3
## Dependents*          4 614    2.72    1.04    2.0    2.58    0.00   1     5
## Education*           5 614    1.22    0.41    1.0    1.15    0.00   1     2
## Self_Employed*       6 614    2.08    0.42    2.0    2.04    0.00   1     3
## ApplicantIncome      7 614 5403.46 6109.04 3812.5 4292.06 1822.86 150 81000
## CoapplicantIncome    8 614 1621.25 2926.25 1188.5 1154.85 1762.07   0 41667
## LoanAmount           9 592  146.41   85.59  128.0  133.14   47.44   9   700
## Loan_Amount_Term    10 600  342.00   65.12  360.0  358.38    0.00  12   480
## Credit_History      11 564    0.84    0.36    1.0    0.93    0.00   0     1
## Property_Area*      12 614    2.04    0.79    2.0    2.05    1.48   1     3
## Loan_Status*        13 614    1.69    0.46    2.0    1.73    0.00   1     2
##                   range  skew kurtosis     se
## Loan_ID*            613  0.00    -1.21   7.16
## Gender*               2 -1.92     2.91   0.02
## Married*              2 -0.72    -1.16   0.02
## Dependents*           4  0.89    -0.38   0.04
## Education*            1  1.36    -0.15   0.02
## Self_Employed*        2  0.49     2.17   0.02
## ApplicantIncome   80850  6.51    59.83 246.54
## CoapplicantIncome 41667  7.45    83.97 118.09
## LoanAmount          691  2.66    10.26   3.52
## Loan_Amount_Term    468 -2.35     6.58   2.66
## Credit_History        1 -1.87     1.51   0.02
## Property_Area*        2 -0.07    -1.39   0.03
## Loan_Status*          1 -0.81    -1.35   0.02

DATA TRANSFORMATION - PART A

Replacing blank space with NAs

loan_ori[loan_ori==""] <- NA
# We have been able to replace the blank spaces with NA’s which will now be captured by R as a missing number.

DATA TRANSFORMATION - PART B

NUMERICAL DATA EXAMINATION AND TRANSFORMATION

par(mfrow=c(2, 1))
boxplot(loan_ori$ApplicantIncome, horizontal = TRUE, main = "Boxplot for Applicant Income", col='red')
boxplot(loan_ori$CoapplicantIncome, horizontal = TRUE, main = "Boxplot for Co-Applicant Income", col='blue')

par(mfrow=c(2, 1))
boxplot(loan_ori$ApplicantIncome, outline= TRUE, col = "red",  title ='Applicant, with Outlier',horizontal=TRUE)
boxplot(loan_ori$ApplicantIncome, outline= FALSE, col = "red",title ='Applicant, without Outlier',horizontal=TRUE)

par(mfrow=c(2, 1))

boxplot(loan_ori$CoapplicantIncome, horizontal = TRUE, main = "Boxplot for Co-Applicant Income, With Outlier", col='blue', outline=TRUE)
boxplot(loan_ori$CoapplicantIncome, horizontal = TRUE, main = "Boxplot for Co-Applicant Income, without Outlier", col='blue', outline=FALSE)

par(mfrow=c(2, 1))
boxplot(loan_ori$LoanAmount, horizontal = TRUE, main = "Boxplot for LoanAmount, with Outlier", col='yellow', outline=TRUE)
boxplot(loan_ori$LoanAmount, horizontal = TRUE, main = "Boxplot for LoanAmount, without outlier", col='yellow', outline=FALSE)

ggplot(data=loan_ori, aes(x= loan_ori$ApplicantIncome)) + 
  geom_histogram(col="red",fill="yellow", bins = 15) +
  facet_grid(~loan_ori$Loan_Status)+
  theme_bw()

## Warning: Use of `loan_ori$ApplicantIncome` is discouraged. Use `ApplicantIncome`
## instead.

DATA TRANSFORMATION - PART C

OUTLIER REMOVAL

The numerical variables shown here, applicant income, Coapplicant income, loan amount, are clearly not normally distributed at all. There are huge outliers in the high income people. In order for us to generalize model without these outliers, we will exclude the high end owning people from the data. Also data shows us that the income is not normally distributed, it is heavily skewed to the left, meaning to the lower tier income. So we will use the log transformation 2 make them more normalized.

# loan2$LogToalIncome
dim(loan_ori)

## [1] 614  13

# loan1<- loan_ori

loan1 <-
loan_ori %>% 
  filter (ApplicantIncome<35000 ) %>% 
  filter ( CoapplicantIncome<20000) %>% 
  select (-Loan_ID)

dim(loan1)

## [1] 604  12

str(loan1)

## 'data.frame':    604 obs. of  12 variables:
##  $ Gender           : Factor w/ 3 levels "","Female","Male": 3 3 3 3 3 3 3 3 3 3 ...
##  $ Married          : Factor w/ 3 levels "","No","Yes": 2 3 3 3 2 3 3 3 3 3 ...
##  $ Dependents       : Factor w/ 5 levels "","0","1","2",..: 2 3 2 2 2 4 2 5 4 3 ...
##  $ Education        : Factor w/ 2 levels "Graduate","Not Graduate": 1 1 1 2 1 1 2 1 1 1 ...
##  $ Self_Employed    : Factor w/ 3 levels "","No","Yes": 2 2 3 2 2 3 2 2 2 2 ...
##  $ ApplicantIncome  : int  5849 4583 3000 2583 6000 5417 2333 3036 4006 12841 ...
##  $ CoapplicantIncome: num  0 1508 0 2358 0 ...
##  $ LoanAmount       : int  NA 128 66 120 141 267 95 158 168 349 ...
##  $ Loan_Amount_Term : int  360 360 360 360 360 360 360 360 360 360 ...
##  $ Credit_History   : int  1 1 1 1 1 1 1 0 1 1 ...
##  $ Property_Area    : Factor w/ 3 levels "Rural","Semiurban",..: 3 1 3 3 3 3 3 2 3 2 ...
##  $ Loan_Status      : Factor w/ 2 levels "N","Y": 2 1 2 2 2 2 2 1 2 1 ...

DATA TRANSFORMATION - PART D

MAKE NEW VARIABLE

loan2<-loan1
## remove Loan_ID
# loan2$Loan_ID <-  NULL

Further, because the loan is one application by the income of applicant plus the Co applicant, there is no point in making them as two variables. We transform them into one single variable, total income. Also this is log transformed as well.

# Make new VR TotalIncome
loan2$TotalIncome <- loan2$ApplicantIncome + loan2$CoapplicantIncome

loan2$ApplicantIncome <- NULL
loan2$CoapplicantIncome <- NULL

loan2$LogToalIncome <- log(loan2$TotalIncome)

loan2$TotalIncome<- NULL

## LoanAMount to log transform, and remove original
loan2$LogLoanAmount <- log(loan2$LoanAmount)

loan2$LoanAmount<- NULL

str(loan2)

## 'data.frame':    604 obs. of  11 variables:
##  $ Gender          : Factor w/ 3 levels "","Female","Male": 3 3 3 3 3 3 3 3 3 3 ...
##  $ Married         : Factor w/ 3 levels "","No","Yes": 2 3 3 3 2 3 3 3 3 3 ...
##  $ Dependents      : Factor w/ 5 levels "","0","1","2",..: 2 3 2 2 2 4 2 5 4 3 ...
##  $ Education       : Factor w/ 2 levels "Graduate","Not Graduate": 1 1 1 2 1 1 2 1 1 1 ...
##  $ Self_Employed   : Factor w/ 3 levels "","No","Yes": 2 2 3 2 2 3 2 2 2 2 ...
##  $ Loan_Amount_Term: int  360 360 360 360 360 360 360 360 360 360 ...
##  $ Credit_History  : int  1 1 1 1 1 1 1 0 1 1 ...
##  $ Property_Area   : Factor w/ 3 levels "Rural","Semiurban",..: 3 1 3 3 3 3 3 2 3 2 ...
##  $ Loan_Status     : Factor w/ 2 levels "N","Y": 2 1 2 2 2 2 2 1 2 1 ...
##  $ LogToalIncome   : num  8.67 8.71 8.01 8.51 8.7 ...
##  $ LogLoanAmount   : num  NA 4.85 4.19 4.79 4.95 ...

hist(loan2$LogToalIncome, 
     main="Histogram for Applicant Income-Log Transformed and Outlier Removed", 
     xlab="Income", 
     border="blue", 
     col="maroon",
     las=1, 
     breaks=50, prob = TRUE)

hist(loan2$LogLoanAmount,  
     main="Histogram for Loan Amount-Log Tranformed and Outlier Removed", 
     xlab="LoanAmount", 
     border="red", 
     col="blue",
     las=1, 
     breaks=50, prob = TRUE)

After the necessary transformation and the getting after the necessary transformation after the necessary transformation and outlier removal. The most important numerical variables, total house income, and loan amount seem to be satisfyingly distributed, close to normal. We can use them for the future analysis.

IMPUTATION FOR MISSING DATA

sapply(loan2, function(x) sum(is.na(x)))

##           Gender          Married       Dependents        Education 
##               12                3               15                0 
##    Self_Employed Loan_Amount_Term   Credit_History    Property_Area 
##               30               14               49                0 
##      Loan_Status    LogToalIncome    LogLoanAmount 
##                0                0               22

## 
## Attaching package: 'mice'

## The following object is masked from 'package:stats':
## 
##     filter

## The following objects are masked from 'package:base':
## 
##     cbind, rbind

## Loading required package: colorspace

## 
## Attaching package: 'colorspace'

## The following object is masked from 'package:pROC':
## 
##     coords

## Loading required package: grid

## VIM is ready to use. 
##  Since version 4.0.0 the GUI is in its own package VIMGUI.
## 
##           Please use the package to use the new (and old) GUI.

## Suggestions and bug-reports can be submitted at: https://github.com/alexkowa/VIM/issues

## 
## Attaching package: 'VIM'

## The following object is masked from 'package:rattle':
## 
##     wine

## The following object is masked from 'package:datasets':
## 
##     sleep

sapply(loan2, function(x) sum(is.na(x)))

##           Gender          Married       Dependents        Education 
##               12                3               15                0 
##    Self_Employed Loan_Amount_Term   Credit_History    Property_Area 
##               30               14               49                0 
##      Loan_Status    LogToalIncome    LogLoanAmount 
##                0                0               22

mice_plot <- aggr(loan2, col=c('navyblue','red'),
                  numbers=TRUE, sortVars=TRUE,
                  labels=names(loan2), cex.axis=.7,
                  gap=3, ylab=c("Missing data","Pattern"))

## 
##  Variables sorted by number of missings: 
##          Variable       Count
##    Credit_History 0.081125828
##     Self_Employed 0.049668874
##     LogLoanAmount 0.036423841
##        Dependents 0.024834437
##  Loan_Amount_Term 0.023178808
##            Gender 0.019867550
##           Married 0.004966887
##         Education 0.000000000
##     Property_Area 0.000000000
##       Loan_Status 0.000000000
##     LogToalIncome 0.000000000

Judging from the data, there are about 13% of the variables that is missing. Especially credit history, which is the most important variables in determining a long, has 8% missing. If we exclude all the missing’s, the data will be very skewed and then not fitting for analysis, because the the missing variables happen to be in important categories, rather than unimportant ones.

# The mice() function takes care of the imputing process:
imputed_list <- mice(data=loan2, m=1, maxit = 2, method = 'cart', seed = 500)

## 
##  iter imp variable
##   1   1  Gender  Married  Dependents  Self_Employed  Loan_Amount_Term  Credit_History  LogLoanAmount
##   2   1  Gender  Married  Dependents  Self_Employed  Loan_Amount_Term  Credit_History  LogLoanAmount

## Warning: Number of logged events: 14

## 1 rounds of imputation otherwise too time consuming
## a list of 22
tr <- complete(imputed_list,1)   ## Here I am choosing the 1st round only (although only 1 round )

dim(imputed_list)

## NULL

# Number of logged events: 14NULL
dim(tr)

## [1] 604  11

loan3<-tr
# loan4_temp<-imputed_Data$data  ## still have missing data, does not work, do not know why
# str(loan4)

mice_plot2 <- aggr(loan3, col=c('#F8766D','#00BFC4'), numbers=TRUE, sortVars=TRUE, labels=names(loan3), cex.axis=.7, gap=3, ylab=c("Missing data","Pattern"))

## 
##  Variables sorted by number of missings: 
##          Variable Count
##            Gender     0
##           Married     0
##        Dependents     0
##         Education     0
##     Self_Employed     0
##  Loan_Amount_Term     0
##    Credit_History     0
##     Property_Area     0
##       Loan_Status     0
##     LogToalIncome     0
##     LogLoanAmount     0

# no missing now

sapply(loan3, function(x) sum(is.na(x)))

##           Gender          Married       Dependents        Education 
##                0                0                0                0 
##    Self_Employed Loan_Amount_Term   Credit_History    Property_Area 
##                0                0                0                0 
##      Loan_Status    LogToalIncome    LogLoanAmount 
##                0                0                0

We used the mice package for the missing data imputation. And we choose the method that allows for both categorical and numerical imputation. As we can see, after the imputation, the data is complete with very minimum loss of information. We are satisfied with it

str(loan3)

## 'data.frame':    604 obs. of  11 variables:
##  $ Gender          : Factor w/ 3 levels "","Female","Male": 3 3 3 3 3 3 3 3 3 3 ...
##  $ Married         : Factor w/ 3 levels "","No","Yes": 2 3 3 3 2 3 3 3 3 3 ...
##  $ Dependents      : Factor w/ 5 levels "","0","1","2",..: 2 3 2 2 2 4 2 5 4 3 ...
##  $ Education       : Factor w/ 2 levels "Graduate","Not Graduate": 1 1 1 2 1 1 2 1 1 1 ...
##  $ Self_Employed   : Factor w/ 3 levels "","No","Yes": 2 2 3 2 2 3 2 2 2 2 ...
##  $ Loan_Amount_Term: num  360 360 360 360 360 360 360 360 360 360 ...
##  $ Credit_History  : num  1 1 1 1 1 1 1 0 1 1 ...
##  $ Property_Area   : Factor w/ 3 levels "Rural","Semiurban",..: 3 1 3 3 3 3 3 2 3 2 ...
##  $ Loan_Status     : Factor w/ 2 levels "N","Y": 2 1 2 2 2 2 2 1 2 1 ...
##  $ LogToalIncome   : num  8.67 8.71 8.01 8.51 8.7 ...
##  $ LogLoanAmount   : num  4.7 4.85 4.19 4.79 4.95 ...

hist(loan3$LogLoanAmount,
     main="Histogram for Log Loan Amount-After Imputation",
     xlab="Loan Amount-log",
     border="blue",
     col="maroon",
     las=1,
     breaks=20, prob = TRUE)

TRAINING TESTING SPLIT

set.seed(42)
sample <- sample.int(n = nrow(loan3), size = floor(.70*nrow(loan3)), replace = F)
trainnew <- loan3[sample, ]
testnew  <- loan3[-sample, ]
dim(trainnew)

## [1] 422  11

dim(testnew)

## [1] 182  11

# summary(trainnew)

variable.summary(trainnew)

##                    Class %.NA Levels Min.Level.Size        Mean         SD
## Gender            factor    0      3              0          NA         NA
## Married           factor    0      3              0          NA         NA
## Dependents        factor    0      5              0          NA         NA
## Education         factor    0      2             93          NA         NA
## Self_Employed     factor    0      3              0          NA         NA
## Loan_Amount_Term numeric    0     NA             NA 340.0094787 67.2949328
## Credit_History   numeric    0     NA             NA   0.8649289  0.3422052
## Property_Area     factor    0      3            123          NA         NA
## Loan_Status       factor    0      2            128          NA         NA
## LogToalIncome    numeric    0     NA             NA   8.6399397  0.4928378
## LogLoanAmount    numeric    0     NA             NA   4.8403827  0.5133288

DECISION TREE

colnames(trainnew)

##  [1] "Gender"           "Married"          "Dependents"       "Education"       
##  [5] "Self_Employed"    "Loan_Amount_Term" "Credit_History"   "Property_Area"   
##  [9] "Loan_Status"      "LogToalIncome"    "LogLoanAmount"

dtree <- rpart(Loan_Status ~ Credit_History+Education+Self_Employed+Property_Area+LogLoanAmount+LogToalIncome,method="class", data=trainnew,parms=list(split="information"))
dtree$cptable

##           CP nsplit rel error    xerror       xstd
## 1 0.35156250      0 1.0000000 1.0000000 0.07377555
## 2 0.02343750      1 0.6484375 0.6484375 0.06379295
## 3 0.01171875      2 0.6250000 0.6562500 0.06408137
## 4 0.01000000      4 0.6015625 0.6406250 0.06350095

 fancyRpartPlot(dtree)

First we fit all the variables into the decision tree, using class method. As we can see that in the first of four layers of the decision tree, it starts as credit history, as the first layer, then followed by total income, then followed by loan amount. This finding is intuitive, and is in line with what we saw in the u univariable analysis, before the modeling.

dtree.pruned <- prune(dtree, cp=.02290076)

dtree.pred <- predict(dtree.pruned, trainnew, type="class")
dtree.perf <- table(trainnew$Loan_Status, dtree.pred,
                    dnn=c("Actual", "Predicted"))
dtree.perf

##       Predicted
## Actual   N   Y
##      N  56  72
##      Y   8 286

fancyRpartPlot(dtree.pruned)

Next we did the pruning of the tree, by CP of 0.022. We calculated the confusion matrix of this pruned tree.

Now, people without credit history will have 21% chance of getting a loan.

dtree_test <- rpart(Loan_Status ~ Credit_History+Education+Self_Employed+Property_Area+LogToalIncome
                ,method="class", data=testnew,parms=list(split="information"))

dtree_test$cptable

##          CP nsplit rel error    xerror       xstd
## 1 0.5166667      0 1.0000000 1.0000000 0.10569844
## 2 0.0100000      1 0.4833333 0.4833333 0.08229203

dtree_test.pruned <- prune(dtree_test, cp=.022)


dtree_test.pred <- predict(dtree_test.pruned, testnew, type="class")
dtree_test.perf <- table(testnew$Loan_Status, dtree_test.pred,
                    dnn=c("Actual", "Predicted"))
dtree_test.perf

##       Predicted
## Actual   N   Y
##      N  33  27
##      Y   2 120

fancyRpartPlot(dtree_test.pruned)

RANDOM FOREST

Model without tuning- 1st Model

set.seed(817)
dim(trainnew)

## [1] 422  11

colnames(trainnew)

##  [1] "Gender"           "Married"          "Dependents"       "Education"       
##  [5] "Self_Employed"    "Loan_Amount_Term" "Credit_History"   "Property_Area"   
##  [9] "Loan_Status"      "LogToalIncome"    "LogLoanAmount"

original_rf<-randomForest(Loan_Status~ ., trainnew)
original_rf

## 
## Call:
##  randomForest(formula = Loan_Status ~ ., data = trainnew) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 20.62%
## Confusion matrix:
##    N   Y class.error
## N 56  72  0.56250000
## Y 15 279  0.05102041

Determining the most important variable in the forest

plot(original_rf)

varImpPlot(original_rf)

From the random forest model, it is very clear that the three variables stand out as the most important ones, credit history, total income, total loan amount. The rest of the variables come up by decreasing in gini score, make into aloe priority category. These three variables are the most important ones in getting loan approved. This is not surprising.

Selective Variables- 2nd RF Model

set.seed(42) 

fit.forest2 <- randomForest(Loan_Status ~ Credit_History+Education+Self_Employed+Property_Area+LogToalIncome, data=trainnew, importance=TRUE)
fit.forest2

## 
## Call:
##  randomForest(formula = Loan_Status ~ Credit_History + Education +      Self_Employed + Property_Area + LogToalIncome, data = trainnew,      importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 19.43%
## Confusion matrix:
##    N   Y class.error
## N 53  75  0.58593750
## Y  7 287  0.02380952

forest.pred2 <- predict(fit.forest2, testnew)
forest.perf_test <- table(testnew$Loan_Status, forest.pred2,
                     dnn=c("Actual", "Predicted"))
forest.perf_test

##       Predicted
## Actual   N   Y
##      N  35  25
##      Y   4 118

Tuning Model- Random Forest

set.seed(817)
tune_grid<-expand.grid(mtry=c(1:10), ntree=c(500,1000,1500,2000)) #expand a grid of parameters
mtry<-tune_grid[[1]]
ntree<-tune_grid[[2]] 

OOB<-NULL #use to store calculated OOB error estimate
for(i in 1:nrow(tune_grid)){
  rf<-randomForest(Loan_Status~. ,trainnew, mtry=mtry[i], ntree=ntree[i])
  confusion<-rf$confusion
  temp<-(confusion[2]+confusion[3])/614 #calculate the OOB error estimate
  OOB<-append(OOB,temp)
}
tune_grid$OOB<-OOB
head(tune_grid[order(tune_grid["OOB"]), ], 4) #order the results

##    mtry ntree       OOB
## 12    2  1000 0.1384365
## 13    3  1000 0.1384365
## 22    2  1500 0.1384365
## 32    2  2000 0.1384365

Gradient Boosting

I was not able to run Gradient boosting on my computer because it crashes R studio.

CONCLUSION:

Judging from the confusion matrix, the random forest model and the classification tree perform similary on this dataset. Both models have a similar true positive and true negative amount of subjects that fall into the two cells of the confusion matrix, meaning, that the accuracy and the false negative rate are similar.

Tuning the models make the models slightly better than its original, but not significantly better.

However, I did not run the model before all the data transformation, so this conclusion might be subject to the fact that the variables are all transformed and normalized very well before fitting the models, which indicates somewhat overfitting of the both models.

I would not use gradient boosting model unless really necessary. THis is a relatively simple and stragightfoward data with meaningful results, gradient boosting, due to its overly slow process, might be an overkill to this busines problem.