Source Code: https://github.com/djlofland/DATA622_S2021_Group2/tree/master/Homework3


Part 1: KNN on the Penguins dataset


Please use K-nearest neighbor (KNN) algorithm to predict the species variable. Please be sure to walk through the steps you took. (40 points)

Similar to past assignments when using the Palmer Penguins dataset, we’ll first do some quick exploratory analysis to examine the different features available.

Load Data

We can see below, that there are four continuous variables: bill_length_mm, bill_depth_mm, flipper_length_mm and body_mass_g. Additionally, there are a few categorical variables: island, sex, and year.

species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
Adelie Torgersen 39.1 18.7 181 3750 male 2007
Adelie Torgersen 39.5 17.4 186 3800 female 2007
Adelie Torgersen 40.3 18.0 195 3250 female 2007
Adelie Torgersen NA NA NA NA NA 2007
Adelie Torgersen 36.7 19.3 193 3450 female 2007
Adelie Torgersen 39.3 20.6 190 3650 male 2007

Feature Plots

For the continuous variables, we can examine the distributions, broken out by the target variable, species:

By separating our distributions by our target variable, species, we can see that many of the feature interactions show clustering between Adelie and Chinstrap penguins (red and green), while Gentoo penguins tend to contrast the other two species for most interactions. This is also confirmed by most of the single-variable distributions split out by species in the plots below. With the exception of the distribution of bill_length_mm, distributions for body_mass_g, bill_depth_mm, and flipper_length_mm all show there to be overlapping distributions between Adelie and Chinstrap penguin species.

Next, we’ll do some data tidying to get our data set ready for our KNN model. year reflects the date/time of recording and requires some additional thought. If we hypothesize or expect that variability in penguin features could have a time dependent effect, then year could be important. For example, if penguin food supplies vary year by year and this impacts growth or if yearly temperature affects penguin development, then year could matter. If we thought global warming might be impacting penguins physical development or other natural disasters might have impacted grown on specific years, then again, year might matter. That said, in this situation, because we don’t have any reason to expect a time dependent effect, year probably doesn’t add any explanatory power to our models and we can remove it from our data set:

penguins <- penguins %>% 
  dplyr::select(-year)

Missing Value

Additionally, when looking at the number of missing values, we can see the following:

penguins %>%
  summarise_all(funs(sum(is.na(.)))) %>%
  pivot_longer(cols = 1:7, names_to = 'columns', values_to = 'NA_count') %>%
  arrange(desc(NA_count)) %>%
  ggplot(aes(y = columns, x = NA_count)) + geom_col(fill = 'deepskyblue4') +
    geom_label(aes(label = NA_count)) +
    theme_minimal() +
    labs(title = 'Count of missing values in penguins dataset') +
    theme(plot.title = element_text(hjust = 0.5))

gg_miss_upset(penguins)

We can see that 11 individuals have at least one missing data point. The missing data points are clustered - two penguin records are missing bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, and sex. None other penguin records missing just the sex data point. We cannot do much with the 2 records missing most of their data, so those will be dropped. For the records missing just sex, while we could try to KNN impute, we don’t know enough about gender differences to know if that’s a safe operation or whether it would introduce additional variability. Therefore, we’ll drop these from our data set as well.

penguins <- na.omit(penguins)

Now, we can take a look at a summary of data before splitting to get a sense of what still needs to be tidy’d in order to get it ready for our KNN model:

skim(penguins)
Data summary
Name penguins
Number of rows 333
Number of columns 7
_______________________
Column type frequency:
factor 3
numeric 4
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
species 0 1 FALSE 3 Ade: 146, Gen: 119, Chi: 68
island 0 1 FALSE 3 Bis: 163, Dre: 123, Tor: 47
sex 0 1 FALSE 2 mal: 168, fem: 165

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
bill_length_mm 0 1 43.99 5.47 32.1 39.5 44.5 48.6 59.6 ▃▇▇▆▁
bill_depth_mm 0 1 17.16 1.97 13.1 15.6 17.3 18.7 21.5 ▅▆▇▇▂
flipper_length_mm 0 1 200.97 14.02 172.0 190.0 197.0 213.0 231.0 ▂▇▃▅▃
body_mass_g 0 1 4207.06 805.22 2700.0 3550.0 4050.0 4775.0 6300.0 ▃▇▅▃▂

From above, we can see that we’ll need to remove our species variable, and save it to a separate outcome variable in order to evaluate the performance of our KNN model later on. We can do this by running the following syntax:

species_actual <- penguins %>% 
  dplyr::select(species)

penguins <- penguins %>% 
  dplyr::select(-species)

Additionally, we can see below in the distributions of each of our continuous variables that the scales are inconsistent across features. For better model performance, we’ll want to standardize each of our features.

In order to do this, we’ll administer a z-score standardization to fix our scaling, which will ultimately help with our clustering.

penguins[, c("bill_length_mm", 
             "bill_depth_mm", 
             "flipper_length_mm", 
             "body_mass_g")] <- scale(penguins[, c("bill_length_mm", 
                                                   "bill_depth_mm", 
                                                   "flipper_length_mm", 
                                                   "body_mass_g")])

As you can see below, after the z-score standardization, the scaling on the x and y axis is a lot more consistent across our features.

penguins %>%
  dplyr::select(body_mass_g, bill_length_mm, bill_depth_mm, flipper_length_mm) %>%
  ggpairs()

Although this may help with our KNN model, we’ll have to be careful about interpretability later on! Finally, we’ll want to dummy code our island and sex variables, since they are categorical, we’ll need to convert them to 1s and 0s:

island_dcode <- as.data.frame(dummy.code(penguins$island))
penguins <- cbind(penguins, island_dcode)

We’ll also want to remove our original island variable from this dataset, since we created our dummy variables:

penguins <- penguins %>% 
  dplyr::select(-island)

The same process will then be applied to the sex variable:

penguins$sex <- dummy.code(penguins$sex)

Finally, after a fair amount of data tidying and investigation, our data set is ready for our KNN model. We can take a quick look at the updated version of our data set:

kable(head(penguins)) %>% 
  kable_styling(bootstrap_options = "basic")
bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex Biscoe Dream Torgersen
1 -0.8946955 0.7795590 -1.4246077 -0.5676206 1 0 0 0
2 -0.8215515 0.1194043 -1.0678666 -0.5055254 0 1 0 0
3 -0.6752636 0.4240910 -0.4257325 -1.1885721 0 1 0 0
5 -1.3335592 1.0842457 -0.5684290 -0.9401915 0 1 0 0
6 -0.8581235 1.7444004 -0.7824736 -0.6918109 1 0 0 0
7 -0.9312674 0.3225288 -1.4246077 -0.7228585 0 1 0 0

Split Training/Test

With our data set tidied, we then split it into a training and test set.

set.seed(123)

trainingRows <- createDataPartition(as.matrix(species_actual$species), p=0.8, list=FALSE)

train_penguins <- penguins[trainingRows,]
test_penguins <- penguins[-trainingRows,]

We’ll also do the same for our target variable, using the same split:

species_actual_train <- species_actual$species[trainingRows]
species_actual_test <- species_actual$species[-trainingRows]

Fit kNN Model

Now, with our data split accordingly, we’ll need to identify the appropriate number for k. To do this, we will first take a standard approach of calculating the square root of the number of rows in our training data set:

sqrt(nrow(train_penguins))
## [1] 16.37071

We can see above, that it is roughly between 16 and 17. Therefore, we’ll perform two KNN models with k=16 and k=17:

set.seed(123) 

k16 <- knn(train_penguins, 
           test_penguins, 
           cl=species_actual_train, 
           k=16)

k17 <- knn(train_penguins, 
           test_penguins, 
           cl=species_actual_train, 
           k=17)
misClassError <- mean(k16 != species_actual_test)
paste0('The number of misclassified penguins with 16 neighbors is: ', misClassError)
## [1] "The number of misclassified penguins with 16 neighbors is: 0.0307692307692308"
table(k16, species_actual_test)
##            species_actual_test
## k16         Adelie Chinstrap Gentoo
##   Adelie        28         1      0
##   Chinstrap      1        12      0
##   Gentoo         0         0     23
misClassError <- mean(k17 != species_actual_test)
paste0('The number of misclassified penguins with 17 neighbors is: ', misClassError)
## [1] "The number of misclassified penguins with 17 neighbors is: 0.0307692307692308"
table(k17, species_actual_test)
##            species_actual_test
## k17         Adelie Chinstrap Gentoo
##   Adelie        28         1      0
##   Chinstrap      1        12      0
##   Gentoo         0         0     23

From our resulting confusion matrices, that show the classified penguins from our KNN model compared to the actual species designations, we can see that both a k of 16 and a k of 17 preformed well on our test set. Essentially, the misclassification rate showed no difference with k=16 or k=17. Although this approach seemed to work quite effectively, we do need to be careful about how we’d go about using either one of these models to predict new data. With an artificially high k value, we could be underfitting the data.

kNN train() with caret

Therefore, as a final step, we can run one final KNN model using the caret package, where the built-in train() function will run KNN classification that automatically picks the optimal number of neighbors (k).

knn_caret <- train(train_penguins, 
                   species_actual_train, 
                   method = "knn", 
                   preProcess = c("center", "scale"))

knn_caret
## k-Nearest Neighbors 
## 
## 268 samples
##   8 predictor
##   3 classes: 'Adelie', 'Chinstrap', 'Gentoo' 
## 
## Pre-processing: centered (7), scaled (7), ignore (1) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 268, 268, 268, 268, 268, 268, ... 
## Resampling results across tuning parameters:
## 
##   k  Accuracy   Kappa    
##   5  0.9980298  0.9968970
##   7  0.9976347  0.9963040
##   9  0.9991997  0.9987279
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 9.

From our output, we can see that a k value of 9 was chose as the optimal value for our model based on the Accuracy and Kappa values calculated. Since this inter-rater reliability metric of Kappa is quite important when working with unbalanced data sets such as our penguins data set, this k value of 9 may ultimately perform better than our initial KNN models where k=16 or k=17. We can see this relationship determined by the train function for finding the optimal neighbors plotted below:

plot(knn_caret)

We can see that a neighbors value of 9 clearly showed the highest accuracy value.

knn_caret_predictions <- predict(knn_caret, newdata = test_penguins)
confusionMatrix(knn_caret_predictions, as.factor(species_actual_test))
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Adelie Chinstrap Gentoo
##   Adelie        28         1      0
##   Chinstrap      1        12      0
##   Gentoo         0         0     23
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9692          
##                  95% CI : (0.8932, 0.9963)
##     No Information Rate : 0.4462          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9516          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity                 0.9655           0.9231        1.0000
## Specificity                 0.9722           0.9808        1.0000
## Pos Pred Value              0.9655           0.9231        1.0000
## Neg Pred Value              0.9722           0.9808        1.0000
## Prevalence                  0.4462           0.2000        0.3538
## Detection Rate              0.4308           0.1846        0.3538
## Detection Prevalence        0.4462           0.2000        0.3538
## Balanced Accuracy           0.9689           0.9519        1.0000

In the end, with the caret package identifying the “optimal” k-value, we can see that all three of our KNN models performed very well on our test data set. Although it will likely depend on the utility of the KNN model, it does appear that any of our three models could be chose to provide accurate predictions of the three penguin species.


Part 2: Decision Trees on loan approval data set


Please use the attached dataset on loan approval status to predict loan approval using Decision Trees. Please be sure to conduct a thorough exploratory analysis to start the task and walk us through your reasoning behind all the steps you are taking.

Load Data

First, we decided to read in the loan approval data set and take a look at its features:

loan <- read.csv("https://raw.githubusercontent.com/DaisyCai2019/NewData/master/Loan_approval.csv")

kable(head(loan)) %>% 
  kable_styling(bootstrap_options = "basic")
Loan_ID Gender Married Dependents Education Self_Employed ApplicantIncome CoapplicantIncome LoanAmount Loan_Amount_Term Credit_History Property_Area Loan_Status
LP001002 Male No 0 Graduate No 5849 0 NA 360 1 Urban Y
LP001003 Male Yes 1 Graduate No 4583 1508 128 360 1 Rural N
LP001005 Male Yes 0 Graduate Yes 3000 0 66 360 1 Urban Y
LP001006 Male Yes 0 Not Graduate No 2583 2358 120 360 1 Urban Y
LP001008 Male No 0 Graduate No 6000 0 141 360 1 Urban Y
LP001011 Male Yes 2 Graduate Yes 5417 4196 267 360 1 Urban Y

As we can see from a glimpse of the data set above, the following features are available:

  • Loan_ID: a unique identifier for each loan
  • Gender: split into male/female
  • Married: indicates whether the applicant is either married (“Yes”) or not married (“No”)
  • Dependents: records the number of dependents to the applicant
  • Education: indicates whether the applicant is a graduate or undergraduate student
  • Self_Employed: indicates whether the applicant is either self employed (“Yes) or not (”No")
  • ApplicantIncome: indicates the applicant’s income
  • CoapplicantIncome: indicates the coapplicant’s income
  • LoanAmount: indicates the loan amount (in thousands)
  • Loan_Amount_Term: indicates the loan amount term in number of months
  • Credit_History: indicates whether or not the applicant’s credit history meets loan guidelines (1 or 0)
  • Property_Area: indicates whether the applicant’s property is “urban”, “semi urban” or “rural”
  • Loan_Status: the target variable, indicates whether or not the applicant received the loan

Exploratory data analysis

Now, we can run some exploratory data analysis to get a better sense of how to tidy and interpret these features. Here’s an initial summary of the dataset:

skim(loan)
Data summary
Name loan
Number of rows 614
Number of columns 13
_______________________
Column type frequency:
character 8
numeric 5
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Loan_ID 0 1 8 8 0 614 0
Gender 0 1 0 6 13 3 0
Married 0 1 0 3 3 3 0
Dependents 0 1 0 2 15 5 0
Education 0 1 8 12 0 2 0
Self_Employed 0 1 0 3 32 3 0
Property_Area 0 1 5 9 0 3 0
Loan_Status 0 1 1 1 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
ApplicantIncome 0 1.00 5403.46 6109.04 150 2877.5 3812.5 5795.00 81000 ▇▁▁▁▁
CoapplicantIncome 0 1.00 1621.25 2926.25 0 0.0 1188.5 2297.25 41667 ▇▁▁▁▁
LoanAmount 22 0.96 146.41 85.59 9 100.0 128.0 168.00 700 ▇▃▁▁▁
Loan_Amount_Term 14 0.98 342.00 65.12 12 360.0 360.0 360.00 480 ▁▁▁▇▁
Credit_History 50 0.92 0.84 0.36 0 1.0 1.0 1.00 1 ▂▁▁▁▇

Handle Missing Data

We can see that there are a fair amount of things we’ll need to do to clean the dataset before being able to run our decision tree algorithm. First, we can see that there quite a few missing values (NAs) in our LoanAmount, Loan_Amount_Term and Credit_History features. Also, we noticed that there were a lot of blank values, which needed to be recoded to NAs. Therefore, we used the naniar package below to handle this:

loan <- loan %>% 
  replace_with_na_all(condition = ~. == "")
loan %>%
  summarise_all(funs(sum(is.na(.)))) %>%
  pivot_longer(cols = 1:13, names_to = 'columns', values_to = 'NA_count') %>%
  arrange(desc(NA_count)) %>%
  ggplot(aes(y = columns, x = NA_count)) + geom_col(fill = 'deepskyblue4') +
  geom_label(aes(label = NA_count)) +
  theme_minimal() +
  labs(title = 'Count of missing values in penguins dataset') +
  theme(plot.title = element_text(hjust = 0.5))

gg_miss_upset(loan)

This can be further visualized by the “Missingness Map” below:

missmap(loan)

Fix Datatypes

Before progressing, we thought it would be helpful to conduct some transformations on this data, as well as account for NAs. To do this, we used the factor function on each feature:

loan$Loan_Status <- factor(loan$Loan_Status)

loan <- loan %>%
           mutate(Gender = factor(Gender),
                  Married = factor(Married),
                  Dependents = factor(Dependents),
                  Education = factor(Education),
                  Self_Employed = factor(Self_Employed),
                  Property_Area = factor(Property_Area),
                  Loan_Status = factor(Loan_Status))

summary(loan)
##    Loan_ID             Gender    Married    Dependents        Education  
##  Length:614         Female:112   No  :213   0   :345   Graduate    :480  
##  Class :character   Male  :489   Yes :398   1   :102   Not Graduate:134  
##  Mode  :character   NA's  : 13   NA's:  3   2   :101                     
##                                             3+  : 51                     
##                                             NA's: 15                     
##                                                                          
##                                                                          
##  Self_Employed ApplicantIncome CoapplicantIncome   LoanAmount   
##  No  :500      Min.   :  150   Min.   :    0     Min.   :  9.0  
##  Yes : 82      1st Qu.: 2878   1st Qu.:    0     1st Qu.:100.0  
##  NA's: 32      Median : 3812   Median : 1188     Median :128.0  
##                Mean   : 5403   Mean   : 1621     Mean   :146.4  
##                3rd Qu.: 5795   3rd Qu.: 2297     3rd Qu.:168.0  
##                Max.   :81000   Max.   :41667     Max.   :700.0  
##                                                  NA's   :22     
##  Loan_Amount_Term Credit_History     Property_Area Loan_Status
##  Min.   : 12      Min.   :0.0000   Rural    :179   N:192      
##  1st Qu.:360      1st Qu.:1.0000   Semiurban:233   Y:422      
##  Median :360      Median :1.0000   Urban    :202              
##  Mean   :342      Mean   :0.8422                              
##  3rd Qu.:360      3rd Qu.:1.0000                              
##  Max.   :480      Max.   :1.0000                              
##  NA's   :14       NA's   :50

Feature Plots

Next, we’ll want to take a look at the distributions of the continuous features, which appear to be ApplicantIncome, CoapplicantIncome, LoanAmount, and Loan_Amount_Term. We color coded Loan_status where blue is Y and red is N to help visualize patterns between the distributions and our feature distribution:

We can see that ApplicantIncome, CoapplicantIncome and LoanAmount are highly right skewed, with long right tails. Conversely, it looks like Loan_Amount_Term is highly left skewed, with a long left tail.

Target Plots

Additionally, we can also take a look at the categorical variables, broken out by our target variable of whether or not a loan was issued to the applicant. From the green bars below, we can see the total amount of loan approvals, relative to the red bars below documenting the loan denials.

From the categorical variable splits based on loan approval status, we can see a few interesting things:

  • The number of loans applied for by males was significantly higher than loans applied for by females in this dataset.

  • The same phenomenon is true for graduate students, where a much higher number of loan applications were coming from graduate students relative to undergraduate students.

  • Whether or not the applicant was self employed or married also showed a pretty large class imbalance, where those that identified as not self employed and those that identified as married had higher proportions of applicants in this dataset than those that did not identify by those two characteristics.

  • There was a pretty even split in the number of applicants based on property area, where someone identifying that they live in rural, semiurban, or urban settings were pretty evenly distributed.

  • We can see that applicants that did not pass the credit history criteria were almost always likely to be denied a loan.

These factors will be interesting to examine and discuss as we look to build our decision tree model and random forest models, since these class imbalances could affect model performance and interpretation.


Impute missing

Since there were a large amount of missing values that we identified above, we can use KNN() to help with our imputations. With the KNN method, a categorical missing value is imputed with the majority among its k nearest neighbors, and the average value (mean) of the k nearest neighbors is regarded as the prediction for a numerical missing value.

loan <- kNN(loan) %>%
  subset(select = Loan_ID:Loan_Status)

Now, we can double check that this worked correctly by running syntax to find any missing values:

sapply(loan, function(x) sum(is.na(x)))
##           Loan_ID            Gender           Married        Dependents 
##                 0                 0                 0                 0 
##         Education     Self_Employed   ApplicantIncome CoapplicantIncome 
##                 0                 0                 0                 0 
##        LoanAmount  Loan_Amount_Term    Credit_History     Property_Area 
##                 0                 0                 0                 0 
##       Loan_Status 
##                 0

Fortunately, we can see that there are now no missing data points!

Apply Transforms

Next, we’ll have to work through a few transformations for our highly skewed continuous data. For our LoanAmount feature, we can conduct a log transformation:

loan$LogLoanAmount <- log(loan$LoanAmount)
loan$LogLoan_Amount_Term <- log(loan$Loan_Amount_Term)

Banks generally don’t want loan payments to exceed 25% of an applicants income. Most banks will add the co-applicant income to the applicant income and treat the sum in their decision criteria. For this reason, we combine ApplicantIncome and CoapplicantIncome into a single Income feature and drop the separate ApplicantIncome and CoapplicantIncome. Next we perform a log transformation on Income to obtain the best results.

loan$Income <- loan$ApplicantIncome + loan$CoapplicantIncome
loan$ApplicantIncome <- NULL
loan$CoapplicantIncome <- NULL
loan$LogIncome <- log(loan$Income)

When checking the new distribution of our Income variable, we can see that it is much more normally distributed:

With the exploratory data analysis behind us, and our data tidy’d, we are now ready to run our decision tree!

Split Training/Test

Similar to our KNN model process above, we’ll determine an 80/20 split for our testing and training datasets:

set.seed(123)

trainingRows <- createDataPartition(as.matrix(species_actual$species), p=0.8, list=FALSE)

train_loan<-loan[trainingRows,]
test_loan<-loan[-trainingRows,]

With the data split, we can now run our first decision tree:

tree1 <- rpart(Loan_Status ~ Gender + Married + Dependents + Education + 
                 Self_Employed + LogIncome + LogLoanAmount + Loan_Amount_Term + 
                 Credit_History + Property_Area, 
               data=train_loan)

rpart.plot(tree1, nn=TRUE)

summary(tree1)
## Call:
## rpart(formula = Loan_Status ~ Gender + Married + Dependents + 
##     Education + Self_Employed + LogIncome + LogLoanAmount + Loan_Amount_Term + 
##     Credit_History + Property_Area, data = train_loan)
##   n= 268 
## 
##           CP nsplit rel error    xerror       xstd
## 1 0.31764706      0 1.0000000 1.0000000 0.08962903
## 2 0.01647059      1 0.6823529 0.6823529 0.07931171
## 3 0.01176471      8 0.5647059 0.9294118 0.08781286
## 4 0.01000000     10 0.5411765 0.9058824 0.08715167
## 
## Variable importance
## Credit_History      LogIncome  LogLoanAmount  Property_Area         Gender 
##             47             22             18              9              2 
##  Self_Employed      Education     Dependents 
##              1              1              1 
## 
## Node number 1: 268 observations,    complexity param=0.3176471
##   predicted class=Y  expected loss=0.3171642  P(node) =1
##     class counts:    85   183
##    probabilities: 0.317 0.683 
##   left son=2 (37 obs) right son=3 (231 obs)
##   Primary splits:
##       Credit_History < 0.5      to the left,  improve=25.753790, (0 missing)
##       LogIncome      < 7.779046 to the left,  improve= 4.191559, (0 missing)
##       LogLoanAmount  < 5.361193 to the right, improve= 3.578266, (0 missing)
##       Property_Area  splits as  LRL,          improve= 2.391196, (0 missing)
##       Dependents     splits as  RLRL,         improve= 1.315920, (0 missing)
## 
## Node number 2: 37 observations
##   predicted class=N  expected loss=0.1351351  P(node) =0.1380597
##     class counts:    32     5
##    probabilities: 0.865 0.135 
## 
## Node number 3: 231 observations,    complexity param=0.01647059
##   predicted class=Y  expected loss=0.2294372  P(node) =0.8619403
##     class counts:    53   178
##    probabilities: 0.229 0.771 
##   left son=6 (23 obs) right son=7 (208 obs)
##   Primary splits:
##       LogLoanAmount < 5.381604 to the right, improve=3.1629310, (0 missing)
##       Property_Area splits as  LRL,          improve=2.9041910, (0 missing)
##       LogIncome     < 8.869608 to the right, improve=2.0238030, (0 missing)
##       Married       splits as  LR,           improve=0.5197897, (0 missing)
##       Dependents    splits as  LLRL,         improve=0.4655822, (0 missing)
##   Surrogate splits:
##       LogIncome < 9.627714 to the right, agree=0.922, adj=0.217, (0 split)
## 
## Node number 6: 23 observations,    complexity param=0.01647059
##   predicted class=Y  expected loss=0.4782609  P(node) =0.0858209
##     class counts:    11    12
##    probabilities: 0.478 0.522 
##   left son=12 (14 obs) right son=13 (9 obs)
##   Primary splits:
##       Property_Area splits as  LLR,          improve=0.6211180, (0 missing)
##       Dependents    splits as  LRRR,         improve=0.5540184, (0 missing)
##       LogLoanAmount < 5.623991 to the left,  improve=0.5540184, (0 missing)
##       LogIncome     < 9.360734 to the right, improve=0.5244147, (0 missing)
##       Married       splits as  LR,           improve=0.0115942, (0 missing)
##   Surrogate splits:
##       Self_Employed splits as  LR,           agree=0.739, adj=0.333, (0 split)
##       LogIncome     < 8.998378 to the right, agree=0.696, adj=0.222, (0 split)
##       LogLoanAmount < 6.325379 to the left,  agree=0.696, adj=0.222, (0 split)
##       Dependents    splits as  LLRL,         agree=0.652, adj=0.111, (0 split)
## 
## Node number 7: 208 observations,    complexity param=0.01647059
##   predicted class=Y  expected loss=0.2019231  P(node) =0.7761194
##     class counts:    42   166
##    probabilities: 0.202 0.798 
##   left son=14 (121 obs) right son=15 (87 obs)
##   Primary splits:
##       Property_Area splits as  LRL,          improve=3.6171640, (0 missing)
##       LogIncome     < 7.799417 to the left,  improve=1.9780490, (0 missing)
##       LogLoanAmount < 4.297262 to the right, improve=1.0214400, (0 missing)
##       Married       splits as  LR,           improve=0.5536131, (0 missing)
##       Dependents    splits as  RLRL,         improve=0.3148534, (0 missing)
##   Surrogate splits:
##       Gender           splits as  RL,           agree=0.601, adj=0.046, (0 split)
##       LogLoanAmount    < 5.318012 to the left,  agree=0.587, adj=0.011, (0 split)
##       Loan_Amount_Term < 420      to the left,  agree=0.587, adj=0.011, (0 split)
## 
## Node number 12: 14 observations
##   predicted class=N  expected loss=0.4285714  P(node) =0.05223881
##     class counts:     8     6
##    probabilities: 0.571 0.429 
## 
## Node number 13: 9 observations
##   predicted class=Y  expected loss=0.3333333  P(node) =0.03358209
##     class counts:     3     6
##    probabilities: 0.333 0.667 
## 
## Node number 14: 121 observations,    complexity param=0.01647059
##   predicted class=Y  expected loss=0.2809917  P(node) =0.4514925
##     class counts:    34    87
##    probabilities: 0.281 0.719 
##   left son=28 (106 obs) right son=29 (15 obs)
##   Primary splits:
##       LogLoanAmount < 4.283372 to the right, improve=1.57306500, (0 missing)
##       LogIncome     < 8.505121 to the left,  improve=1.37741000, (0 missing)
##       Education     splits as  RL,           improve=0.19043430, (0 missing)
##       Dependents    splits as  LRRL,         improve=0.13282170, (0 missing)
##       Self_Employed splits as  RL,           improve=0.09381985, (0 missing)
##   Surrogate splits:
##       LogIncome < 8.01334  to the right, agree=0.901, adj=0.2, (0 split)
## 
## Node number 15: 87 observations
##   predicted class=Y  expected loss=0.09195402  P(node) =0.3246269
##     class counts:     8    79
##    probabilities: 0.092 0.908 
## 
## Node number 28: 106 observations,    complexity param=0.01647059
##   predicted class=Y  expected loss=0.3113208  P(node) =0.3955224
##     class counts:    33    73
##    probabilities: 0.311 0.689 
##   left son=56 (11 obs) right son=57 (95 obs)
##   Primary splits:
##       LogIncome     < 8.139586 to the left,  improve=4.24708900, (0 missing)
##       LogLoanAmount < 5.225732 to the left,  improve=0.78845450, (0 missing)
##       Dependents    splits as  LRRL,         improve=0.10571960, (0 missing)
##       Gender        splits as  LR,           improve=0.06773702, (0 missing)
##       Self_Employed splits as  RL,           improve=0.06773702, (0 missing)
##   Surrogate splits:
##       LogLoanAmount < 4.349757 to the left,  agree=0.915, adj=0.182, (0 split)
## 
## Node number 29: 15 observations
##   predicted class=Y  expected loss=0.06666667  P(node) =0.05597015
##     class counts:     1    14
##    probabilities: 0.067 0.933 
## 
## Node number 56: 11 observations
##   predicted class=N  expected loss=0.2727273  P(node) =0.04104478
##     class counts:     8     3
##    probabilities: 0.727 0.273 
## 
## Node number 57: 95 observations,    complexity param=0.01647059
##   predicted class=Y  expected loss=0.2631579  P(node) =0.3544776
##     class counts:    25    70
##    probabilities: 0.263 0.737 
##   left son=114 (26 obs) right son=115 (69 obs)
##   Primary splits:
##       LogIncome     < 8.869608 to the right, improve=1.05615200, (0 missing)
##       LogLoanAmount < 4.740565 to the left,  improve=0.62682750, (0 missing)
##       Dependents    splits as  RLRL,         improve=0.57809530, (0 missing)
##       Self_Employed splits as  RL,           improve=0.29007700, (0 missing)
##       Gender        splits as  RL,           improve=0.08916409, (0 missing)
##   Surrogate splits:
##       LogLoanAmount < 5.15614  to the right, agree=0.832, adj=0.385, (0 split)
## 
## Node number 114: 26 observations,    complexity param=0.01647059
##   predicted class=Y  expected loss=0.3846154  P(node) =0.09701493
##     class counts:    10    16
##    probabilities: 0.385 0.615 
##   left son=228 (9 obs) right son=229 (17 obs)
##   Primary splits:
##       LogIncome     < 9.051689 to the left,  improve=2.1900450, (0 missing)
##       Dependents    splits as  LLR-,         improve=1.1197220, (0 missing)
##       LogLoanAmount < 5.225732 to the left,  improve=1.1197220, (0 missing)
##       Property_Area splits as  L-R,          improve=0.3076923, (0 missing)
##       Married       splits as  LR,           improve=0.1864802, (0 missing)
##   Surrogate splits:
##       Education splits as  RL, agree=0.731, adj=0.222, (0 split)
##       Gender    splits as  LR, agree=0.692, adj=0.111, (0 split)
## 
## Node number 115: 69 observations,    complexity param=0.01176471
##   predicted class=Y  expected loss=0.2173913  P(node) =0.2574627
##     class counts:    15    54
##    probabilities: 0.217 0.783 
##   left son=230 (31 obs) right son=231 (38 obs)
##   Primary splits:
##       LogIncome        < 8.505121 to the left,  improve=2.1268180, (0 missing)
##       LogLoanAmount    < 4.740565 to the left,  improve=1.6248130, (0 missing)
##       Loan_Amount_Term < 210      to the left,  improve=0.6948507, (0 missing)
##       Dependents       splits as  RLRL,         improve=0.5077346, (0 missing)
##       Property_Area    splits as  R-L,          improve=0.4461663, (0 missing)
##   Surrogate splits:
##       LogLoanAmount < 4.740565 to the left,  agree=0.797, adj=0.548, (0 split)
##       Property_Area splits as  R-L,          agree=0.652, adj=0.226, (0 split)
##       Dependents    splits as  RLRR,         agree=0.638, adj=0.194, (0 split)
##       Gender        splits as  LR,           agree=0.594, adj=0.097, (0 split)
## 
## Node number 228: 9 observations
##   predicted class=N  expected loss=0.3333333  P(node) =0.03358209
##     class counts:     6     3
##    probabilities: 0.667 0.333 
## 
## Node number 229: 17 observations
##   predicted class=Y  expected loss=0.2352941  P(node) =0.06343284
##     class counts:     4    13
##    probabilities: 0.235 0.765 
## 
## Node number 230: 31 observations,    complexity param=0.01176471
##   predicted class=Y  expected loss=0.3548387  P(node) =0.1156716
##     class counts:    11    20
##    probabilities: 0.355 0.645 
##   left son=460 (14 obs) right son=461 (17 obs)
##   Primary splits:
##       LogLoanAmount < 4.695914 to the right, improve=2.39522900, (0 missing)
##       LogIncome     < 8.403892 to the right, improve=1.77450100, (0 missing)
##       Married       splits as  RL,           improve=0.43039050, (0 missing)
##       Dependents    splits as  LRRL,         improve=0.24396860, (0 missing)
##       Education     splits as  RL,           improve=0.06021505, (0 missing)
##   Surrogate splits:
##       LogIncome     < 8.403892 to the right, agree=0.806, adj=0.571, (0 split)
##       Gender        splits as  LR,           agree=0.613, adj=0.143, (0 split)
##       Self_Employed splits as  RL,           agree=0.613, adj=0.143, (0 split)
## 
## Node number 231: 38 observations
##   predicted class=Y  expected loss=0.1052632  P(node) =0.141791
##     class counts:     4    34
##    probabilities: 0.105 0.895 
## 
## Node number 460: 14 observations
##   predicted class=N  expected loss=0.4285714  P(node) =0.05223881
##     class counts:     8     6
##    probabilities: 0.571 0.429 
## 
## Node number 461: 17 observations
##   predicted class=Y  expected loss=0.1764706  P(node) =0.06343284
##     class counts:     3    14
##    probabilities: 0.176 0.824

Predictions

From our decision tree summary and plot, we can see that a few variables, such as Credit_History, Income and LoanAmount, were the most important features in this classification method. We can use this decision tree to make predictions on our holdout test set.

loanPre <- predict(tree1, 
                   test_loan, 
                   type="class",
                   control = rpart.control(minsplit = 20, xval = 81, cp = 0.01))

confusionMatrix(loanPre, test_loan$Loan_Status)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   N   Y
##          N  68  69
##          Y  39 170
##                                           
##                Accuracy : 0.6879          
##                  95% CI : (0.6361, 0.7363)
##     No Information Rate : 0.6908          
##     P-Value [Acc > NIR] : 0.572055        
##                                           
##                   Kappa : 0.3219          
##                                           
##  Mcnemar's Test P-Value : 0.005262        
##                                           
##             Sensitivity : 0.6355          
##             Specificity : 0.7113          
##          Pos Pred Value : 0.4964          
##          Neg Pred Value : 0.8134          
##              Prevalence : 0.3092          
##          Detection Rate : 0.1965          
##    Detection Prevalence : 0.3960          
##       Balanced Accuracy : 0.6734          
##                                           
##        'Positive' Class : N               
## 

As we can see from our predictions and the confusion matrix, we were able to obtain about 68% accuracy using this decision tree. To achieve higher accuracy, we’ll use the prune() function in the rpart package to examine a predicted optimal tree size.

plotcp(tree1)

By plotting the cross-validated error against the complexity parameter, we can see that the relationship between the yields the best optimal outcome for our tree at a tree size of 2.

tree1$cptable
##           CP nsplit rel error    xerror       xstd
## 1 0.31764706      0 1.0000000 1.0000000 0.08962903
## 2 0.01647059      1 0.6823529 0.6823529 0.07931171
## 3 0.01176471      8 0.5647059 0.9294118 0.08781286
## 4 0.01000000     10 0.5411765 0.9058824 0.08715167

Tuning

Although this is the case, when using this cp value in our prune() function, we still obtain accuracy ratings of around 68%. In order to test difference complexity parameters, we also decided to use our cp value of 0.01647059, which is a tree size of 2. When we use this tree size, and prune our initial tree accordingly, we can see the resulting decision tree below:

tree1.pruned <- prune(tree1, cp=0.01647059)

rpart.plot(tree1.pruned, 
           extra = 104, 
           main = "Decision Tree")

Interestingly, from this pruning process, the Credit_History feature alone seems to do a fairly good job of classifying applicants into the approval vs. disapproval status. The accuracy is about 86% on our test set, which is slightly higher than our initial tree.

However, it’s important to think critically about whether or not this pruned tree would perform better on another dataset, and if it is applicable for real-life events.

tree1.pruned.pred <- predict(tree1.pruned, 
                             test_loan, 
                             type = "class" )

confusionMatrix(tree1.pruned.pred, test_loan$Loan_Status)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   N   Y
##          N  54   2
##          Y  53 237
##                                           
##                Accuracy : 0.841           
##                  95% CI : (0.7982, 0.8779)
##     No Information Rate : 0.6908          
##     P-Value [Acc > NIR] : 1.040e-10       
##                                           
##                   Kappa : 0.5715          
##                                           
##  Mcnemar's Test P-Value : 1.562e-11       
##                                           
##             Sensitivity : 0.5047          
##             Specificity : 0.9916          
##          Pos Pred Value : 0.9643          
##          Neg Pred Value : 0.8172          
##              Prevalence : 0.3092          
##          Detection Rate : 0.1561          
##    Detection Prevalence : 0.1618          
##       Balanced Accuracy : 0.7482          
##                                           
##        'Positive' Class : N               
## 

While a single feature does offer fairly good accuracy, there may be mitigating information from other features that would affect a loan decision which is lost if we only include a single feature. While decision trees are simple to construct and understand, their weakness is that they follow a simple linear path for decision making. This linear flow doesn’t allow for downstream features to alter or influence potential predictions.


Part 3: Random Forests on loan approval dataset


Using the same dataset on Loan Approval Status, please use Random Forests to predict on loan approval status. Again, please be sure to walk us through the steps you took to get to your final model. (50 points)

Initial Model

set.seed(123)
fit.forest <- randomForest(Loan_Status ~ Gender + Married + Dependents + 
                             Education + Self_Employed + LogIncome + 
                             LogLoanAmount + Loan_Amount_Term + 
                             Credit_History + Property_Area,
                           data=train_loan)

fit.forest
## 
## Call:
##  randomForest(formula = Loan_Status ~ Gender + Married + Dependents +      Education + Self_Employed + LogIncome + LogLoanAmount + Loan_Amount_Term +      Credit_History + Property_Area, data = train_loan) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 23.88%
## Confusion matrix:
##    N   Y class.error
## N 38  47  0.55294118
## Y 17 166  0.09289617
importance(fit.forest)
##                  MeanDecreaseGini
## Gender                   2.437321
## Married                  3.169019
## Dependents               7.414653
## Education                2.277848
## Self_Employed            2.834501
## LogIncome               27.251346
## LogLoanAmount           24.507891
## Loan_Amount_Term         5.023619
## Credit_History          22.057853
## Property_Area            6.812607
forest.pred <- predict(fit.forest, 
                       newdata = test_loan)

(dtree.cm_train <- confusionMatrix(forest.pred, test_loan$Loan_Status))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   N   Y
##          N  59  13
##          Y  48 226
##                                           
##                Accuracy : 0.8237          
##                  95% CI : (0.7794, 0.8624)
##     No Information Rate : 0.6908          
##     P-Value [Acc > NIR] : 1.331e-08       
##                                           
##                   Kappa : 0.5464          
##                                           
##  Mcnemar's Test P-Value : 1.341e-05       
##                                           
##             Sensitivity : 0.5514          
##             Specificity : 0.9456          
##          Pos Pred Value : 0.8194          
##          Neg Pred Value : 0.8248          
##              Prevalence : 0.3092          
##          Detection Rate : 0.1705          
##    Detection Prevalence : 0.2081          
##       Balanced Accuracy : 0.7485          
##                                           
##        'Positive' Class : N               
## 

Here, we notice slight improvements on both samples where accuracy for the training sample is 82.37%. Notice the disparity between Specificity and Sensitivity. This model has a high false positive rate and tends to over predict Y when it should have chosen N.

Next we want to see if we have generated enough trees so that the Out Of Bag (OOB Error) error rates are minimum. From the below we see that the OOB error rate is decreasing with 1-20 trees, and rate stabilizes that at around 100 trees.

plot(fit.forest, 
     col = c("black", "red", "dark green"), 
     main = "Predicted Loan Error Rates")

legend("topright", colnames(fit.forest$err.rate), col = 1:6, cex = 0.8, fill = 1:6)

Tuning

In order to see whether the Random forest model can be improved, we will run the same model, but this time we will use the tuneRF function. We provide the features, x, target, y, and number of trees used at each tuning step, ntreeTry.

The values were assigned randomly initially, and they have tweaked until the optimal was found.

set.seed(123)

tree_var <- tuneRF(x = subset(train_loan, select = -Loan_Status), 
                   y = train_loan$Loan_Status, 
                   ntreeTry = 1000)
## mtry = 3  OOB error = 23.88% 
## Searching left ...
## mtry = 2     OOB error = 23.13% 
## 0.03125 0.05 
## Searching right ...
## mtry = 6     OOB error = 23.13% 
## 0.03125 0.05

We rerun the Random Forest model with the new parameter \(Ntree = 1000\) and \(Mtry = 3\)

set.seed(123)

val_opt <- tree_var [,"mtry"][which.min(tree_var [,"OOBError"])]

fit.forest2 <- randomForest(Loan_Status  ~ Gender + Married + Dependents + 
                              Education + Self_Employed + LogIncome + 
                              LogLoanAmount + Loan_Amount_Term + 
                              Credit_History + Property_Area,
                            data=train_loan, 
                            importance = TRUE, 
                            ntree=1000, 
                            mtry = val_opt)

fit.forest2
## 
## Call:
##  randomForest(formula = Loan_Status ~ Gender + Married + Dependents +      Education + Self_Employed + LogIncome + LogLoanAmount + Loan_Amount_Term +      Credit_History + Property_Area, data = train_loan, importance = TRUE,      ntree = 1000, mtry = val_opt) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 22.39%
## Confusion matrix:
##    N   Y class.error
## N 37  48  0.56470588
## Y 12 171  0.06557377

The results of the trained Random Forest model are an out of bag error of 22.39%, which is higher than the original model 23.88%, Although, it still a good result it has got worse with the tuning.

forest.pred2 <- predict(fit.forest2, 
                        newdata = test_loan)

(forest.cm_train <- confusionMatrix(forest.pred2, test_loan$Loan_Status))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   N   Y
##          N  57   5
##          Y  50 234
##                                           
##                Accuracy : 0.841           
##                  95% CI : (0.7982, 0.8779)
##     No Information Rate : 0.6908          
##     P-Value [Acc > NIR] : 1.040e-10       
##                                           
##                   Kappa : 0.579           
##                                           
##  Mcnemar's Test P-Value : 2.975e-09       
##                                           
##             Sensitivity : 0.5327          
##             Specificity : 0.9791          
##          Pos Pred Value : 0.9194          
##          Neg Pred Value : 0.8239          
##              Prevalence : 0.3092          
##          Detection Rate : 0.1647          
##    Detection Prevalence : 0.1792          
##       Balanced Accuracy : 0.7559          
##                                           
##        'Positive' Class : N               
## 

Here, we notice there has the same accuracy as previous one for the training sample is 82.11%. The tuned model has performed worse than the test data set than the original Random Forest. The accuracy is slightly decreased, and the 95 % CI has decreased a bit too.

Next we want to see if we have generated enough trees so that the Out Of Bag (OOB Error) error rates are minimum. From the below we see that the OOB error rate is decreasing with 1-20 trees, and rate stabilizes that at around 20 trees.

plot(fit.forest2, 
     col = c("black", "red", "dark green"), 
     main = "Predicted Loan Error Rates")

legend("topright", colnames(fit.forest2$err.rate), col = 1:6, cex = 0.8, fill = 1:6)


Part 4: Gradient Boosting


Using the Loan Approval Status data, please use Gradient Boosting to predict on the loan approval status. Please use whatever boosting approach you deem appropriate;but please be sure to walk us through your steps. (50 points)

Initial Model

set.seed(123)

boost <- gbm(Loan_Status~., 
             data=train_loan[,-1],
             distribution = "gaussian",
             n.trees = 1000,
             cv.folds = 3)

gbm uses a default number of trees of 100, which is rarely sufficient. Consequently, I crank it up to 1,000 trees. The default depth of each tree (interaction.depth) is 1, which means we are ensembling a bunch of stumps. Lastly, I also include cv.folds to perform a 3 fold cross validation.

boost
## gbm(formula = Loan_Status ~ ., distribution = "gaussian", data = train_loan[, 
##     -1], n.trees = 1000, cv.folds = 3)
## A gradient boosted model with gaussian loss function.
## 1000 iterations were performed.
## The best cross-validation iteration was 23.
## There were 13 predictors of which 4 had non-zero influence.
sqrt(min(boost$cv.error))
## [1] 0.4136226
summary(boost)

##                                     var    rel.inf
## LoanAmount                   LoanAmount 34.8085776
## Income                           Income 34.0120783
## Dependents                   Dependents  9.6081783
## Credit_History           Credit_History  7.1610501
## Property_Area             Property_Area  5.3913204
## Loan_Amount_Term       Loan_Amount_Term  3.7712276
## Married                         Married  1.6303913
## Education                     Education  1.4209212
## Gender                           Gender  1.1968845
## Self_Employed             Self_Employed  0.9993707
## LogLoanAmount             LogLoanAmount  0.0000000
## LogLoan_Amount_Term LogLoan_Amount_Term  0.0000000
## LogIncome                     LogIncome  0.0000000

The best cross-validation iteration was 34 and the RMSE is 0.3913894

The summary of output creates a new data set with var, a factor variable with the variables in our model, and rel.inf, the relative influence each variable had on our model predictions.From the table, we can see LoanAmount, LogIncome and Credit_History are the top three most important variable for our model.

Tuning

We could tune parameters to see how the results change. Here, I increase the number of trees and also perform a 5 fold cross validation.

set.seed(123)

boost2<-gbm(Loan_Status~., 
            data=train_loan[,-1],
            distribution = "gaussian",
            n.trees = 2000,
            cv.folds = 5)

sqrt(min(boost2$cv.error))
## [1] 0.4160404
summary(boost2)

##                                     var   rel.inf
## Income                           Income 34.132899
## LoanAmount                   LoanAmount 31.436681
## Dependents                   Dependents 12.092748
## Property_Area             Property_Area  6.425018
## Credit_History           Credit_History  5.068015
## Loan_Amount_Term       Loan_Amount_Term  4.341234
## Self_Employed             Self_Employed  1.879418
## Married                         Married  1.838764
## Gender                           Gender  1.540934
## Education                     Education  1.244288
## LogLoanAmount             LogLoanAmount  0.000000
## LogLoan_Amount_Term LogLoan_Amount_Term  0.000000
## LogIncome                     LogIncome  0.000000

We can see the number of relative influence of each variable change and also the RMSE change to 0.3900092, lower than our previous model.

The partial Dependence Plots tell us the relationship and dependence of the variables X with the Response variable Y.

#partial dependence plots
plot(boost2,i="LoanAmount")

plot(boost2,i="Income")

The first plot shows loanAmount is negatively correlated with the response Loan_Status before 300K. The second plot indicate income is positively correlated with Loan_Status when it less than $20,000.

# plot loss function as a result of n trees added to the ensemble
gbm.perf(boost2, method = "cv")

## [1] 26
gbm.perf(boost2, method = "OOB")

## [1] 33
## attr(,"smoother")
## Call:
## loess(formula = object$oobag.improve ~ x, enp.target = min(max(4, 
##     length(x)/10), 50))
## 
## Number of Observations: 2000 
## Equivalent Number of Parameters: 39.99 
## Residual Standard Error: 0.0003932

The plots indicating the optimum number of trees based on the technique we used. The green line indicates error on test and the blue dotted line points the optimum number of iterations. We can observe that the beyond a certain a point (55 iterations for the cv method and 40 for the OOB method), the error on the test data appears to increase because of overfitting.

We use model 2 to forecast our data. According to the cv method, we choose 55 as the number of trees.

boostPre <- predict.gbm(boost2, 
                        test_loan, 
                        n.trees = 55)

boostPre <- ifelse(boostPre < 1.5, 'N', 'Y')
boostPre <- as.factor(boostPre)

(boost.cm <- confusionMatrix(boostPre, test_loan$Loan_Status))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   N   Y
##          N  54   2
##          Y  53 237
##                                           
##                Accuracy : 0.841           
##                  95% CI : (0.7982, 0.8779)
##     No Information Rate : 0.6908          
##     P-Value [Acc > NIR] : 1.040e-10       
##                                           
##                   Kappa : 0.5715          
##                                           
##  Mcnemar's Test P-Value : 1.562e-11       
##                                           
##             Sensitivity : 0.5047          
##             Specificity : 0.9916          
##          Pos Pred Value : 0.9643          
##          Neg Pred Value : 0.8172          
##              Prevalence : 0.3092          
##          Detection Rate : 0.1561          
##    Detection Prevalence : 0.1618          
##       Balanced Accuracy : 0.7482          
##                                           
##        'Positive' Class : N               
## 

According to the Confusion matrix, Our model accuracy is 0.841. Looking at Sensitivity and Specificaity, this model is has a far lower False Negative rate and is less likely to pick ‘N’ when it shouldn’t. However, it’s False positive rate is higher and where it picked ‘Y’ when it should have predicted ‘N’.


Part 5: Model Performance


Model performance: please compare the models you settled on for problem # 2 – 4.Comment on their relative performance. Which one would you prefer the most? Why?(20 points)

temp <- data.frame(dtree.cm_train$overall, 
                   forest.cm_train$overall,
                   boost.cm$overall) %>%
  t() %>%
  data.frame() %>%
  dplyr::select(Accuracy) %>%
  mutate(`Classification Error Rate` = 1-Accuracy)
eval <- data.frame(dtree.cm_train$byClass, 
                   forest.cm_train$byClass,
                   boost.cm$byClass)
eval <- data.frame(t(eval)) %>%
  cbind(temp) %>%
  mutate(eval = c("Decision Tree", "Random Forest", "Gradient Boosting")) 
eval <- dplyr::select(eval, Accuracy, `Classification Error Rate`, Sensitivity, Specificity, Precision, Recall, F1)
rownames(eval) = c("Decision Tree",  "Random Forest", "Gradient Boosting")
t_eval <- t(eval)
colnames(t_eval) <- rownames(eval)
rownames(t_eval) <- colnames(eval)
knitr::kable(t_eval)
Decision Tree Random Forest Gradient Boosting
Accuracy 0.8236994 0.8410405 0.8410405
Classification Error Rate 0.1763006 0.1589595 0.1589595
Sensitivity 0.5514019 0.5327103 0.5046729
Specificity 0.9456067 0.9790795 0.9916318
Precision 0.8194444 0.9193548 0.9642857
Recall 0.5514019 0.5327103 0.5046729
F1 0.6592179 0.6745562 0.6625767