Data622_HWk2

Assignment:

Based on the latest topics presented, bring a dataset of your choice and create a Decision Tree where you can solve a classification or regression problem and predict the outcome of a particular feature or detail of the data used. Switch variables to generate 2 decision trees and compare the results. Create a random forest for regression and analyze the results. Based on real cases where desicion trees went wrong, and ‘the bad & ugly’ aspects of decision trees (https://decizone.com/blog/the-good-the-bad-the-ugly-of-using-decision-trees), how can you change this perception when using the decision tree you created to solve a real problem? Format: document with screen captures & analysis.

Import Data and Data Structure

We imported the data from local drive. Another option could be to load the date from Github.

## 'data.frame':    614 obs. of  13 variables:
##  $ Loan_ID          : chr  "LP001002" "LP001003" "LP001005" "LP001006" ...
##  $ Gender           : chr  "Male" "Male" "Male" "Male" ...
##  $ Married          : chr  "No" "Yes" "Yes" "Yes" ...
##  $ Dependents       : chr  "0" "1" "0" "0" ...
##  $ Education        : chr  "Graduate" "Graduate" "Graduate" "Not Graduate" ...
##  $ Self_Employed    : chr  "No" "No" "Yes" "No" ...
##  $ ApplicantIncome  : int  5849 4583 3000 2583 6000 5417 2333 3036 4006 12841 ...
##  $ CoapplicantIncome: num  0 1508 0 2358 0 ...
##  $ LoanAmount       : int  NA 128 66 120 141 267 95 158 168 349 ...
##  $ Loan_Amount_Term : int  360 360 360 360 360 360 360 360 360 360 ...
##  $ Credit_History   : int  1 1 1 1 1 1 1 0 1 1 ...
##  $ Property_Area    : chr  "Urban" "Rural" "Urban" "Urban" ...
##  $ Loan_Status      : chr  "Y" "N" "Y" "Y" ...

Loan_ID	Gender	Married	Dependents	Education	Self_Employed	ApplicantIncome	CoapplicantIncome	LoanAmount	Loan_Amount_Term	Credit_History	Property_Area	Loan_Status
LP001002	Male	No	0	Graduate	No	5849	0	NA	360	1	Urban	Y
LP001003	Male	Yes	1	Graduate	No	4583	1508	128	360	1	Rural	N
LP001005	Male	Yes	0	Graduate	Yes	3000	0	66	360	1	Urban	Y
LP001006	Male	Yes	0	Not Graduate	No	2583	2358	120	360	1	Urban	Y
LP001008	Male	No	0	Graduate	No	6000	0	141	360	1	Urban	Y
LP001011	Male	Yes	2	Graduate	Yes	5417	4196	267	360	1	Urban	Y
LP001013	Male	Yes	0	Not Graduate	No	2333	1516	95	360	1	Urban	Y
LP001014	Male	Yes	3+	Graduate	No	3036	2504	158	360	0	Semiurban	N

—————————————————————————————————————————————————-

Dataset Description

Variables ========== Descriptions

Loan_ID ========== Unique Loan ID

Gender =========== Male/Female

Married =========== Appliquant marital status (Y/N)

Dependents ========= Number of dependents

Education ========== Applicant Education (Graduate/Undergraduate)

Self_Employed ====== Self_employed (Y/N)

ApplicantIncome ==== Applicant income

CoapplicantIncome == Coapplicant income

LoanAmount ========= Loan amount in thousands dollars

Loan_Amount_Term === Term of loan in months

Credit_History ===== Credit history meets guidelines

Property_Area ====== Urban, semi-urban, rural

Loan_Status ======== Loan approved (Y/N)

This dataset is a typical format which banks use to screen/select applicant for a loan. There 614 records with 13 variables. The datatypes in this dataset are mostly character and numerical. There are some variables (Loan_Status,Self_Employed, Married,Dependents etc) with characters datatype that should be factor with two levels (yes/no or 0/1). The variable “Credit_History” should be in term of number of years. We assume the bank uses ‘1’ to say the customer meets the minimum number of years to qualify for a loan and ‘0’ for those who don’t meet the minimum years. Normally, a customer with a credit history = 0 should be denied a loan. Is it true on this bank record? Answer is no. Therefore the decision to approve a loan for a customer relies on the combination with other variables other than the dependent/target ‘Loan_Status’. Based on the information about the structure of the dataset, we can conclude that we have a labeled data. Therefore, we can be confident in using supervised learning on this dataset. As we know, supervised learning model account for a classification model and we will predict the state of client loan approval.

Cleaning Data

#install.packages('Amelia')
#install.packages('DataExplorer')

library(Amelia)

## Loading required package: Rcpp

## ## 
## ## Amelia II: Multiple Imputation
## ## (Version 1.8.0, built: 2021-05-26)
## ## Copyright (C) 2005-2022 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##

#sum(is.na(loanDF))
misValues <- sum(is.na(loanDF))# Returning the column names with missing values

#sum(is.na(basket1a$X.1))
#misValues1 <- sum(is.na()
# Filling the empty spece with "NA"
#us_d <- dplyr::na_if(us_d, "")
#is.null(us_d)
#if (is.na(us_d)|| us_d== '')
#is.empty(" ")
#apply(myData, 2, function(myCol){  sum(myCol == "1") > 0
  
emptyValue <- sum(emptyValue <- sapply(loanDF, function(x) all(is.na(x) | x == '' )) ) 

cat("The dataset contains missing values for a total record of : " , misValues)

## The dataset contains missing values for a total record of :  86

print("\n")

## [1] "\n"

cat("The dataset contains empty values for a total record of : " , emptyValue)

## The dataset contains empty values for a total record of :  0

missmap(loanDF,col=c('yellow','black'),y.at=1,y.labels=' ',legend=TRUE)

#count(loanDF$Credit_History)

The plot of missing values shows that there are definitely missing values(86 records) withing the dataset. Let’s take a look at this missing values.

library(VIM)

## Loading required package: colorspace

## 
## Attaching package: 'colorspace'

## The following object is masked from 'package:pROC':
## 
##     coords

## Loading required package: grid

## VIM is ready to use.

## Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues

## 
## Attaching package: 'VIM'

## The following object is masked from 'package:datasets':
## 
##     sleep

#aggr(loanDF)
#vis_miss(loanDF)



missing.values <- function(df){
    df %>%
    gather(key = "variables", value = "val") %>%
    mutate(is.missing = is.na(val)) %>%
    group_by(variables, is.missing) %>%
    dplyr::summarise(number.missing = n()) %>%
    filter(is.missing==T) %>%
    dplyr::select(-is.missing) %>%
    arrange(desc(number.missing)) 
}

missing.values(loanDF)%>%
  kable()

## `summarise()` has grouped output by 'variables'. You can override using the
## `.groups` argument.

variables	number.missing
Credit_History	50
LoanAmount	22
Loan_Amount_Term	14

library(DataExplorer)
plot_missing(loanDF)

#gg_miss_upset(loanDF)

# dev.off()
# print(plot(1))


#count((data1000R$Order.Priority))

#sum(is.na(data1000R$Order.Priority))
# Not sure why the code below does not work
# data1000R %>% 
#   group_by(data1000R$Order.Priority) %>%
#   summarize(Count=n()) %>%
#   mutate(Percent = (Count/sum(Count))*100) %>%
#   arrange(desc(Count))

The missing values are present in these variables (Loan_Amount_Term, LoanAmount and Credit_History). Since the dataset is a small in size, deleting these missing values will reduce the dataset. Instead of deleting, we can apply imputation on these missing values.

#if (is.na(loanDF$Self_Employed) || loanDF$Self_Employed == '')
count(loanDF$Gender)

##        x freq
## 1          13
## 2 Female  112
## 3   Male  489

count(loanDF$Married)

##     x freq
## 1        3
## 2  No  213
## 3 Yes  398

count(loanDF$Self_Employed)

##     x freq
## 1       32
## 2  No  500
## 3 Yes   82

count(loanDF$Credit_History)

##    x freq
## 1  0   89
## 2  1  475
## 3 NA   50

print("The above frequency distribution shows that there are 04 variable with some blank/empty values")

## [1] "The above frequency distribution shows that there are 04 variable with some blank/empty values"

#loanDF$Gender[loanDF$Gender==""]<-NA
#loanDF[loanDF==""]<- c('NA')

# Works but does not fix the issue with blanks value
#loanDF <- loanDF %>% 
#                 mutate_all(na_if,"")

# Works but does not fix the issue with blank value
## define a empty function
# empty_as_na <- function(x){
#     if("factor" %in% class(x)) x <- as.character(x) ## since ifelse won't work with factors
#     ifelse(as.character(x)!="", x, NA) <NA>
# }

# Works but the issue with blank value is still present
## transform all columns
#loanDF %>% 
#  mutate_each(funs(empty_as_na))

#loanDF[loanDF=="NA"]<- c('<NA>')
#loanDF <- loanDF %>%
#   mutate(across(everything(), ~ifelse(.=="", NA, as.character(.))))


print("\n")

## [1] "\n"

print("Let's see if sum of missing values will cath these blank values since we applied a function earlier to account for this issue")

## [1] "Let's see if sum of missing values will cath these blank values since we applied a function earlier to account for this issue"

print("\n")

## [1] "\n"

cat("Sum of missing values within variable = Credit_History is:  ", sum(is.na(loanDF$Credit_History)))

## Sum of missing values within variable = Credit_History is:   50

print("\n")

## [1] "\n"

cat("Sum of missing values within variable = Gender is:  ", sum(is.na(loanDF$Gender)))

## Sum of missing values within variable = Gender is:   0

print("\n")

## [1] "\n"

cat("Sum of missing values within variable = Self_Employed is:  ", sum(is.na(loanDF$Self_Employed)))

## Sum of missing values within variable = Self_Employed is:   0

print("\n")

## [1] "\n"

cat("Sum of missing values within variable = Married is:  ", sum(is.na(loanDF$Married)))

## Sum of missing values within variable = Married is:   0

print("\n")

## [1] "\n"

#View(loanDF)

Somehow there are some empty values. These aren’t easy to check because the mapping of missing values above missed them. We will fill in the empty/blank values with ‘NA’. Then, check again before performing imputation.

#loanDF <- read.csv("Loan.csv", header=T, na.strings=c("",'NA'))

#loanDF$Gender[loanDF$Gender == " "]<- NA
# loanDF$Gender[loanDF$Gender == ""  | loanDF$Gender== " "] <- NA 
# loanDF$Dependents[loanDF$Dependents == ""  | loanDF$Dependents== " "] <- NA 
# loanDF$Self_Employed[loanDF$Self_Employed == ""  | loanDF$Self_Employed== " "] <- NA 
# loanDF$Married[loanDF$Married == ""  | loanDF$Married== " "] <- NA 

#loanDF$Self_Employed[is.na(loanDF$Self_Employed)] <- mean(loanDF$Self_Employed, na.rm = TRUE)

#if (!require("tidyverse")) install.packages("tidyverse")

# loanDF %>% 
#   mutate(Gender = if_else(is.na(Gender), 
#                          calc_mode(Gender), 
#                          Gender))
# 
# calc_mode <- function(x){
#   
#   # List the distinct / unique values
#   distinct_values <- unique(x)
#   
#   # Count the occurrence of each distinct value
#   distinct_tabulate <- tabulate(match(x, distinct_values))
#   
#   # Return the value with the highest occurrence
#   distinct_values[which.max(distinct_tabulate)]
# }
# 
# 
# loanDF %>% 
#   mutate(across(everything(), ~replace_na(.x, calc_mode(.x))))

# 
# getmode <- function(v){
#   v=v[nchar(as.character(v))>0]
#   uniqv <- unique(v)
#   uniqv[which.max(tabulate(match(v, uniqv)))]
# }
# 
# for (cols in colnames(df)) {
#   if (cols %in% names(df[,sapply(df, is.numeric)])) {
#     df<-df%>%mutate(!!cols := replace(!!rlang::sym(cols), is.na(!!rlang::sym(cols)), mean(!!rlang::sym(cols), na.rm=TRUE)))
#      
#   }
#   else {
#      
#     df<-df%>%mutate(!!cols := replace(!!rlang::sym(cols), !!rlang::sym(cols)=="", getmode(!!rlang::sym(cols))))
#      
#   }
# }
#  
# df


# The above attempts work but somehow the issue is still persisting. This time , we are going to try prof-fix

loanDF$Married <- loanDF$Married %>% replace_na("NA")

loanDF$Gender <- loanDF$Gender %>% replace_na("NA")

loanDF$Dependents <- loanDF$Dependents %>% replace_na("NA")

loanDF$Self_Employed <- loanDF$Self_Employed %>% replace_na("NA")


count(loanDF$Gender)

##        x freq
## 1          13
## 2 Female  112
## 3   Male  489

count(loanDF$Self_Employed)

##     x freq
## 1       32
## 2  No  500
## 3 Yes   82

count(loanDF$Credit_History)

##    x freq
## 1  0   89
## 2  1  475
## 3 NA   50

count(loanDF$Married)

##     x freq
## 1        3
## 2  No  213
## 3 Yes  398

print("\n")

## [1] "\n"

cat("Sum of missing values within variable = Credit_History is:  ", sum(is.na(loanDF$Credit_History)))

## Sum of missing values within variable = Credit_History is:   50

print("\n")

## [1] "\n"

cat("Sum of missing values within variable = Gender is:  ", sum(is.na(loanDF$Gender)))

## Sum of missing values within variable = Gender is:   0

print("\n")

## [1] "\n"

cat("Sum of missing values within variable = Self_Employed is:  ", sum(is.na(loanDF$Self_Employed)))

## Sum of missing values within variable = Self_Employed is:   0

print("\n")

## [1] "\n"

cat("Sum of missing values within variable = Married is:  ", sum(is.na(loanDF$Married)))

## Sum of missing values within variable = Married is:   0

print("\n")

## [1] "\n"

#View(loanDF)

let’s perform imputation.

#df[!(is.na(df$start_pc) | df$start_pc==""), ]
#df <- with(df, df[!(start_pc == "" | is.na(start_pc)), ])
#test for non-zero string length using nzchar.
#df <- with(df, df[!(nzchar(start_pc) | is.na(start_pc)), ])

#loanDF1 <- loanDF1[-which(loanDF1$Gender == ""), ]


library(mice)

## 
## Attaching package: 'mice'

## The following object is masked from 'package:stats':
## 
##     filter

## The following objects are masked from 'package:base':
## 
##     cbind, rbind

imputed <- mice(loanDF, m=2, maxit = 2, method = 'cart', seed = 23321)

## 
##  iter imp variable
##   1   1  LoanAmount  Loan_Amount_Term  Credit_History
##   1   2  LoanAmount  Loan_Amount_Term  Credit_History
##   2   1  LoanAmount  Loan_Amount_Term  Credit_History
##   2   2  LoanAmount  Loan_Amount_Term  Credit_History

## Warning: Number of logged events: 8

#mice = multiple imputation by chained equations. The 'm' argument = number of rounds of imputation
#CART = classification and regression trees
loanDF1<- complete(imputed,2) #here I chose the second round of data imputation

missmap(loanDF1,col=c('yellow','black'),y.at=1,y.labels=' ',legend=TRUE)

str(loanDF1)

## 'data.frame':    614 obs. of  13 variables:
##  $ Loan_ID          : chr  "LP001002" "LP001003" "LP001005" "LP001006" ...
##  $ Gender           : chr  "Male" "Male" "Male" "Male" ...
##  $ Married          : chr  "No" "Yes" "Yes" "Yes" ...
##  $ Dependents       : chr  "0" "1" "0" "0" ...
##  $ Education        : chr  "Graduate" "Graduate" "Graduate" "Not Graduate" ...
##  $ Self_Employed    : chr  "No" "No" "Yes" "No" ...
##  $ ApplicantIncome  : int  5849 4583 3000 2583 6000 5417 2333 3036 4006 12841 ...
##  $ CoapplicantIncome: num  0 1508 0 2358 0 ...
##  $ LoanAmount       : num  128 128 66 120 141 267 95 158 168 349 ...
##  $ Loan_Amount_Term : num  360 360 360 360 360 360 360 360 360 360 ...
##  $ Credit_History   : num  1 1 1 1 1 1 1 0 1 1 ...
##  $ Property_Area    : chr  "Urban" "Rural" "Urban" "Urban" ...
##  $ Loan_Status      : chr  "Y" "N" "Y" "Y" ...

#library(stringi)
#stri_isempty(loanDF1$Self_Employed)


# loanDF1$Married <- loanDF1$Married %>% replace_na("NA")
# 
# loanDF$Gender <- loanDF$Gender %>% replace_na("NA")
# 
# loanDF$Dependents <- loanDF$Dependents %>% replace_na("NA")
# 
# loanDF$Self_Employed <- loanDF$Self_Employed %>% replace_na("NA")

#is.null(loanDF1$Gender)
# Checking for empty value again
count(loanDF1$Gender)

##        x freq
## 1          13
## 2 Female  112
## 3   Male  489

count(loanDF1$Married)

##     x freq
## 1        3
## 2  No  213
## 3 Yes  398

We clearly see that there is no more missing data. But there are persisting issue with blank values.

Processing Data

Let’s remove the variables that we don’t need for the decision trees model. Then, we will reformat the dataset into a new data frame in which some variables (Married,Dependents,Self_Employed,Credit_History and Loan_Status).

loanDF1$Loan_ID <- NULL
str(loanDF1)

## 'data.frame':    614 obs. of  12 variables:
##  $ Gender           : chr  "Male" "Male" "Male" "Male" ...
##  $ Married          : chr  "No" "Yes" "Yes" "Yes" ...
##  $ Dependents       : chr  "0" "1" "0" "0" ...
##  $ Education        : chr  "Graduate" "Graduate" "Graduate" "Not Graduate" ...
##  $ Self_Employed    : chr  "No" "No" "Yes" "No" ...
##  $ ApplicantIncome  : int  5849 4583 3000 2583 6000 5417 2333 3036 4006 12841 ...
##  $ CoapplicantIncome: num  0 1508 0 2358 0 ...
##  $ LoanAmount       : num  128 128 66 120 141 267 95 158 168 349 ...
##  $ Loan_Amount_Term : num  360 360 360 360 360 360 360 360 360 360 ...
##  $ Credit_History   : num  1 1 1 1 1 1 1 0 1 1 ...
##  $ Property_Area    : chr  "Urban" "Rural" "Urban" "Urban" ...
##  $ Loan_Status      : chr  "Y" "N" "Y" "Y" ...

Summary and Correlation

This is a summary and correlation of the popular item known as “Beverage”

summary(loanDF1)

##     Gender            Married           Dependents         Education        
##  Length:614         Length:614         Length:614         Length:614        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  Self_Employed      ApplicantIncome CoapplicantIncome   LoanAmount   
##  Length:614         Min.   :  150   Min.   :    0     Min.   :  9.0  
##  Class :character   1st Qu.: 2878   1st Qu.:    0     1st Qu.:100.0  
##  Mode  :character   Median : 3812   Median : 1188     Median :128.0  
##                     Mean   : 5403   Mean   : 1621     Mean   :146.8  
##                     3rd Qu.: 5795   3rd Qu.: 2297     3rd Qu.:168.0  
##                     Max.   :81000   Max.   :41667     Max.   :700.0  
##  Loan_Amount_Term Credit_History   Property_Area      Loan_Status       
##  Min.   : 12.0    Min.   :0.0000   Length:614         Length:614        
##  1st Qu.:360.0    1st Qu.:1.0000   Class :character   Class :character  
##  Median :360.0    Median :1.0000   Mode  :character   Mode  :character  
##  Mean   :342.3    Mean   :0.8388                                        
##  3rd Qu.:360.0    3rd Qu.:1.0000                                        
##  Max.   :480.0    Max.   :1.0000

#library(psych)

#describe(loanDF1$Self_Employed)
par(mfrow=c(2,3))
corr1 <- table(loanDF1$Loan_Status, loanDF1$Gender)
barplot(corr1, main="Loan Status by Gender",
        xlab="Gender", col=c("darkgrey","green"),
        legend = rownames(corr1))

corr2 <- table(loanDF1$Loan_Status, loanDF1$Education)
barplot(corr2, main="Loan Status by Education",
        xlab="Education", col=c("darkgrey","blue"),
        legend = rownames(corr2))
corr3 <- table(loanDF1$Loan_Status, loanDF1$Married)
barplot(corr3, main="Loan Status by Married",
        xlab="Married", col=c("darkgrey","red"),
        legend = rownames(corr3))
corr4 <- table(loanDF1$Loan_Status, loanDF1$Self_Employed)
barplot(corr4, main="Loan Status by Self Employed",
        xlab="Self_Employed", col=c("darkgrey","yellow"),
        legend = rownames(corr4))
corr5 <- table(loanDF1$Loan_Status, loanDF1$Property_Area)
barplot(corr5, main="Loan Status by Property_Area",
        xlab="Property_Area", col=c("black","maroon"),
        legend = rownames(corr5))
corr6 <- table(loanDF1$Loan_Status, loanDF1$Credit_History)
barplot(corr6, main="Loan Status by Credit_History",
        xlab="Credit_History", col=c("darkgrey","maroon"),
        legend = rownames(corr6))

#as.numeric(data1000R1$Units.Sold)
#library(Hmisc)
#data1 <- data.frame(data1000R1)
#cor(loanDF1)
#cor(data1000R1[,unlist(lapply(data1000R1, is.numeric))])
#rcorr(as.matrix(data1000R1), type = "Pearson")

The assumption that we made early came out to be false. We see that there is a few percentage of customers getting loan approved despite the fact that they did not meet the minimum years of credit history. Therefore, the loan_status decision is based on other variables than credit_history. By curiousity, we also checked loan approval by gender and found out men dominate in applying for a loan. We wonder how would bank interprets this result. Perhaps, the workforce in the area is predominantly men power. Let’s see how Married families do versus the non-married. The result is somewhat we would anticipate it right. Married families get more loan approved than non-married. More results shows that the bank trusts more graduate customers than those with no graduate degree. In addition, self-employed customers seem to not getting loan approval. One explanation could be that there are more employed customers than self-employed ones in the area.

These results still show the blanks values.

Let’s see Loan approval, applicant income and loan amount distributions

## Warning in data(loanDF1, package = "lattice"): data set 'loanDF1' not found

We observed right skewed distribution with some outliers. One way to deal with outliers is to delete if there aren’t many. This method might have bad effect on the rest of the data since this is a small dataset. Since the imputation by classification and regression trees (cart) does not fix the blank values, we want to try one more method, random forest (rf), then we will tranform character variables into factors.

imputed <- mice(loanDF1, maxit = 0)
predicts <- imputed$predictorMatrix

imputed <- mice(loanDF1, method = 'rf', predictorMatrix = predicts, m=2)

## 
##  iter imp variable
##   1   1
##   1   2
##   2   1
##   2   2
##   3   1
##   3   2
##   4   1
##   4   2
##   5   1
##   5   2

loanDF1 <- complete(imputed)
count(loanDF1$Gender)

##        x freq
## 1          13
## 2 Female  112
## 3   Male  489

data(loanDF1, package="lattice")
ggplot(data=loanDF1, aes(x=CoapplicantIncome, fill=Education)) +
  geom_density() +
  facet_grid(Education~.)

data(loanDF1, package="lattice")
ggplot(data=loanDF1, aes(x=CoapplicantIncome, fill=Property_Area)) +
  geom_density() +
  facet_grid(Property_Area~.)

loanDF1$Gender <- as.factor(loanDF1$Gender)
loanDF1$Married <- as.factor(loanDF1$Married)
loanDF1$Dependents <- as.factor(loanDF1$Dependents)
loanDF1$Education <- as.factor(loanDF1$Education)
loanDF1$Self_Employed <- as.factor(loanDF1$Self_Employed)
loanDF1$Property_Area <- as.factor(loanDF1$Property_Area)
loanDF1$Credit_History <- as.factor(loanDF1$Credit_History)
loanDF1$Loan_Status <- as.factor(loanDF1$Loan_Status)




str(loanDF1)

## 'data.frame':    614 obs. of  12 variables:
##  $ Gender           : Factor w/ 3 levels "","Female","Male": 3 3 3 3 3 3 3 3 3 3 ...
##  $ Married          : Factor w/ 3 levels "","No","Yes": 2 3 3 3 2 3 3 3 3 3 ...
##  $ Dependents       : Factor w/ 5 levels "","0","1","2",..: 2 3 2 2 2 4 2 5 4 3 ...
##  $ Education        : Factor w/ 2 levels "Graduate","Not Graduate": 1 1 1 2 1 1 2 1 1 1 ...
##  $ Self_Employed    : Factor w/ 3 levels "","No","Yes": 2 2 3 2 2 3 2 2 2 2 ...
##  $ ApplicantIncome  : int  5849 4583 3000 2583 6000 5417 2333 3036 4006 12841 ...
##  $ CoapplicantIncome: num  0 1508 0 2358 0 ...
##  $ LoanAmount       : num  128 128 66 120 141 267 95 158 168 349 ...
##  $ Loan_Amount_Term : num  360 360 360 360 360 360 360 360 360 360 ...
##  $ Credit_History   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 1 2 2 ...
##  $ Property_Area    : Factor w/ 3 levels "Rural","Semiurban",..: 3 1 3 3 3 3 3 2 3 2 ...
##  $ Loan_Status      : Factor w/ 2 levels "N","Y": 2 1 2 2 2 2 2 1 2 1 ...

Building Model1 Decision Trees

library(caTools)
library(party)

## Loading required package: mvtnorm

## Loading required package: modeltools

## Loading required package: stats4

## 
## Attaching package: 'modeltools'

## The following object is masked from 'package:arules':
## 
##     info

## The following object is masked from 'package:plyr':
## 
##     empty

## The following object is masked from 'package:BayesFactor':
## 
##     posterior

## Loading required package: strucchange

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following object is masked from 'package:tsibble':
## 
##     index

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## Loading required package: sandwich

loanDF2 <- loanDF1 %>%
                   dplyr::select(Gender, Married, Dependents, Education, Self_Employed, Property_Area, Credit_History, Loan_Status)

data1 = sample.split(loanDF2, SplitRatio = 0.80)
train1 <- subset(loanDF2, data1 == TRUE)
test1 <- subset(loanDF2, data1 == FALSE)
model1 <- ctree(Loan_Status ~ ., train1)
plot(model1)

Prediction of model1

pred1 <- predict(model1, test1)
classifier1 <- table(test1$Loan_Status, pred1)
classifier1

##    pred1
##       N   Y
##   N  21  25
##   Y   4 102

Accuracy of Model 1

accuracy1 <- sum(diag(classifier1))/sum(classifier1)
accuracy1

## [1] 0.8092105

#str(loanDF2)

# # load package
# #install.packages("ggstatsplot")
# library(ggstatsplot)
# 
# # correlogram
# ggstatsplot::ggcorrmat(
#   data = data1000R1,
#   type = "parametric", # parametric for Pearson, nonparametric for Spearman's correlation
#   colors = c("darkred", "white", "steelblue") # change default colors
# )

Let’s try rpart function

library(rpart)
library(rpart.plot)
library(caret)

model2 <- rpart(Loan_Status ~.,method="class", data=train1)

rpart.plot(model2, tweak =1.6)

model2.pred <- predict(model2, test1, type="class")
model2.accuracy <- table(test1$Loan_Status, model2.pred, dnn=c("Actual", "Predicted"))
model2.accuracy

##       Predicted
## Actual   N   Y
##      N  21  25
##      Y   4 102

confusionMatrix(predict(model2, type = "class"), train1$Loan_Status)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   N   Y
##          N  62  12
##          Y  84 304
##                                           
##                Accuracy : 0.7922          
##                  95% CI : (0.7523, 0.8283)
##     No Information Rate : 0.684           
##     P-Value [Acc > NIR] : 1.396e-07       
##                                           
##                   Kappa : 0.4458          
##                                           
##  Mcnemar's Test P-Value : 4.280e-13       
##                                           
##             Sensitivity : 0.4247          
##             Specificity : 0.9620          
##          Pos Pred Value : 0.8378          
##          Neg Pred Value : 0.7835          
##              Prevalence : 0.3160          
##          Detection Rate : 0.1342          
##    Detection Prevalence : 0.1602          
##       Balanced Accuracy : 0.6933          
##                                           
##        'Positive' Class : N               
##

# set.seed(232)
# 
# library(caTools)
# data1000R1s <- sample.split(data1000R1, SplitRatio = 0.70)
# train1 <- subset(data1000R1, data1000R1s == TRUE)
# test1 <- subset(data1000R1, data1000R1s == FALSE)
# 
# model1 <- lm(Total.Profit~., train1)
# summary(model1)
# plot (model1, which = 2)
# 
# plot (model1, which = 1)

Model3 Random Forest

library(randomForest)

## randomForest 4.7-1

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

## The following object is masked from 'package:dplyr':
## 
##     combine

model3 <- randomForest(Loan_Status ~., data = train1, importance = TRUE, ntree=500)
print(model3)

## 
## Call:
##  randomForest(formula = Loan_Status ~ ., data = train1, importance = TRUE,      ntree = 500) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 21.21%
## Confusion matrix:
##    N   Y class.error
## N 62  84   0.5753425
## Y 14 302   0.0443038

varImp(model3)

##                         N          Y
## Gender          2.9838443  2.9838443
## Married         0.6227917  0.6227917
## Dependents      0.5938169  0.5938169
## Education       1.8119032  1.8119032
## Self_Employed  -2.0796632 -2.0796632
## Property_Area   5.1007809  5.1007809
## Credit_History 58.0467064 58.0467064

varImpPlot(model3)

#importance(model3, type = 2)

pred3 <- predict(model3, test1)
model3.accuracy <- table(test1$Loan_Status, pred3, dnn = c("actual", "predicted"))
model3.accuracy

##       predicted
## actual   N   Y
##      N  21  25
##      Y   4 102

conf_matrix_RF <- confusionMatrix(pred3, test1$Loan_Status)
conf_matrix_RF

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   N   Y
##          N  21   4
##          Y  25 102
##                                           
##                Accuracy : 0.8092          
##                  95% CI : (0.7376, 0.8683)
##     No Information Rate : 0.6974          
##     P-Value [Acc > NIR] : 0.0012442       
##                                           
##                   Kappa : 0.4809          
##                                           
##  Mcnemar's Test P-Value : 0.0002041       
##                                           
##             Sensitivity : 0.4565          
##             Specificity : 0.9623          
##          Pos Pred Value : 0.8400          
##          Neg Pred Value : 0.8031          
##              Prevalence : 0.3026          
##          Detection Rate : 0.1382          
##    Detection Prevalence : 0.1645          
##       Balanced Accuracy : 0.7094          
##                                           
##        'Positive' Class : N               
##

Summary of model performance

library(kableExtra)

## 
## Attaching package: 'kableExtra'

## The following object is masked from 'package:dplyr':
## 
##     group_rows

decision_tree_model <- confusionMatrix(table(model2.pred, test1$Loan_Status))$byClass
decision_tree_accuracy <- confusionMatrix(table(model2.pred, test1$Loan_Status))$overall['Accuracy']
decision_tree_model <- data.frame(decision_tree_model)
decision_tree_model <- rbind("Accuracy" = decision_tree_accuracy, decision_tree_model)


randomForest_model <- confusionMatrix(table(pred3, test1$Loan_Status))$byClass
randomforest_accuracy <- confusionMatrix(table(pred3, test1$Loan_Status))$overall['Accuracy']
randomForest_model <- data.frame(randomForest_model)
randomForest_model <- rbind("Accuracy" = randomforest_accuracy, randomForest_model)

summary_dt_rf <- data.frame(decision_tree_model, randomForest_model)

summary_dt_rf %>% 
              kable() %>% 
                     kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

	decision_tree_model	randomForest_model
Accuracy	0.8092105	0.8092105
Sensitivity	0.4565217	0.4565217
Specificity	0.9622642	0.9622642
Pos Pred Value	0.8400000	0.8400000
Neg Pred Value	0.8031496	0.8031496
Precision	0.8400000	0.8400000
Recall	0.4565217	0.4565217
F1	0.5915493	0.5915493
Prevalence	0.3026316	0.3026316
Detection Rate	0.1381579	0.1381579
Detection Prevalence	0.1644737	0.1644737
Balanced Accuracy	0.7093929	0.7093929

The performance of the decision trees and random forest models appears to be about the same. We wonder if we didn’t assign the same variable twice. Nonetheless, the code looks good and we calling the random forest and decision trees function. Perhaps the explanation is on the rpart() function …meaning we get the same result with RandomForest(). Let’s switch the target variable and see if we still get the same result.

str(loanDF2)

## 'data.frame':    614 obs. of  8 variables:
##  $ Gender        : Factor w/ 3 levels "","Female","Male": 3 3 3 3 3 3 3 3 3 3 ...
##  $ Married       : Factor w/ 3 levels "","No","Yes": 2 3 3 3 2 3 3 3 3 3 ...
##  $ Dependents    : Factor w/ 5 levels "","0","1","2",..: 2 3 2 2 2 4 2 5 4 3 ...
##  $ Education     : Factor w/ 2 levels "Graduate","Not Graduate": 1 1 1 2 1 1 2 1 1 1 ...
##  $ Self_Employed : Factor w/ 3 levels "","No","Yes": 2 2 3 2 2 3 2 2 2 2 ...
##  $ Property_Area : Factor w/ 3 levels "Rural","Semiurban",..: 3 1 3 3 3 3 3 2 3 2 ...
##  $ Credit_History: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 1 2 2 ...
##  $ Loan_Status   : Factor w/ 2 levels "N","Y": 2 1 2 2 2 2 2 1 2 1 ...

Credit history sounds appropriate for a target variable, let’s say the bank want to predict if a customer requesting for a new loan based on the pre-existing conditions as described in the dataset met the minimum years loan qualification.

Model4 Decision Tree

library(caTools)
library(party)

loanDF2 <- loanDF1 %>%
                   dplyr::select(Gender, Married, Dependents, Education, Self_Employed, Property_Area, Credit_History, Loan_Status)

data2 = sample.split(loanDF2, SplitRatio = 0.80)
train2 <- subset(loanDF2, data2 == TRUE)
test2 <- subset(loanDF2, data2 == FALSE)
model4 <- ctree(Credit_History ~ ., train2)
plot(model4)

Prediction of model4

pred4 <- predict(model4, test2)
classifier2 <- table(test2$Credit_History, pred4)
classifier2

##    pred4
##       0   1
##   0   0  21
##   1   0 133

Accuracy of Model4

accuracy1 <- sum(diag(classifier2))/sum(classifier2)
accuracy1

## [1] 0.8636364

#str(loanDF2)

# # load package
# #install.packages("ggstatsplot")
# library(ggstatsplot)
# 
# # correlogram
# ggstatsplot::ggcorrmat(
#   data = data1000R1,
#   type = "parametric", # parametric for Pearson, nonparametric for Spearman's correlation
#   colors = c("darkred", "white", "steelblue") # change default colors
# )

Model4 Random Forest

library(randomForest)

model5 <- randomForest(Credit_History ~., data = train2, importance = TRUE, ntree=500)
print(model5)

## 
## Call:
##  randomForest(formula = Credit_History ~ ., data = train2, importance = TRUE,      ntree = 500) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 19.78%
## Confusion matrix:
##    0   1 class.error
## 0 13  65  0.83333333
## 1 26 356  0.06806283

varImp(model5)

##                        0          1
## Gender        -0.0597435 -0.0597435
## Married        2.0295887  2.0295887
## Dependents    -0.7149852 -0.7149852
## Education     -2.9101691 -2.9101691
## Self_Employed -1.5031505 -1.5031505
## Property_Area -2.8970870 -2.8970870
## Loan_Status   35.0951505 35.0951505

varImpPlot(model5)

#importance(model3, type = 2)

pred5 <- predict(model5, test2)
model5.accuracy <- table(test2$Credit_History, pred5, dnn = c("actual", "predicted"))
model5.accuracy

##       predicted
## actual   0   1
##      0   5  16
##      1  11 122

conf_matrix_RF <- confusionMatrix(pred5, test2$Credit_History)
conf_matrix_RF

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0   5  11
##          1  16 122
##                                           
##                Accuracy : 0.8247          
##                  95% CI : (0.7553, 0.8812)
##     No Information Rate : 0.8636          
##     P-Value [Acc > NIR] : 0.9325          
##                                           
##                   Kappa : 0.1727          
##                                           
##  Mcnemar's Test P-Value : 0.4414          
##                                           
##             Sensitivity : 0.23810         
##             Specificity : 0.91729         
##          Pos Pred Value : 0.31250         
##          Neg Pred Value : 0.88406         
##              Prevalence : 0.13636         
##          Detection Rate : 0.03247         
##    Detection Prevalence : 0.10390         
##       Balanced Accuracy : 0.57769         
##                                           
##        'Positive' Class : 0               
##

Summary of model (Credit History as a target) performance

library(kableExtra)
decision_tree_model <- confusionMatrix(table(pred4, test2$Credit_History))$byClass
decision_tree_accuracy <- confusionMatrix(table(pred4, test2$Credit_History))$overall['Accuracy']
decision_tree_model <- data.frame(decision_tree_model)
decision_tree_model <- rbind("Accuracy" = decision_tree_accuracy, decision_tree_model)


randomForest_model <- confusionMatrix(table(pred5, test2$Credit_History))$byClass
randomforest_accuracy <- confusionMatrix(table(pred5, test2$Credit_History))$overall['Accuracy']
randomForest_model <- data.frame(randomForest_model)
randomForest_model <- rbind("Accuracy" = randomforest_accuracy, randomForest_model)

summary_dt_rf <- data.frame(decision_tree_model, randomForest_model)

summary_dt_rf %>% 
              kable() %>% 
                     kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

	decision_tree_model	randomForest_model
Accuracy	0.8636364	0.8246753
Sensitivity	0.0000000	0.2380952
Specificity	1.0000000	0.9172932
Pos Pred Value	NaN	0.3125000
Neg Pred Value	0.8636364	0.8840580
Precision	NA	0.3125000
Recall	0.0000000	0.2380952
F1	NA	0.2702703
Prevalence	0.1363636	0.1363636
Detection Rate	0.0000000	0.0324675
Detection Prevalence	0.0000000	0.1038961
Balanced Accuracy	0.5000000	0.5776942

This time based on model accuracy , decision tree wins over random forest. We wonder if the different in the performance between the two models is not due to the fact we used ctree() function for the decision tree model. In addition, there is also the possibility of some biais because the dataset is not all clean(blank values present)