Setup

library(skimr)
library(tidyverse)
library(caret) # For featureplot, classification report
library(corrplot) # For correlation matrix
library(AppliedPredictiveModeling)
library(mice) # For data imputation
library(VIM) # For missing data visualization
library(gridExtra) # For grid plots
library(rpart) # For Decision Trees models
library(rpart.plot) # For Decision Tree Plots
library(randomForest) # For Random Forest models
library(randomForestExplainer) # For Random Forest Variable Importance Analysis
library(gbm) # For Gradient Boosted Models
library(pROC) # For AUC calculations
library(ROCR) # For ROC and AUC plots

Penguins Dataset

Data Exploration

The penguin dataset is composed of 344 observations with 8 variables, 5 of which are numeric and 3 which are qualitative. The dataset is mostly complete with just a few observations with missing values that will need to be handled.

Data summary
Name	penguins
Number of rows	344
Number of columns	8
_______________________
Column type frequency:
factor	3
numeric	5
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
species	0	1.00	FALSE	3	Ade: 152, Gen: 124, Chi: 68
island	0	1.00	FALSE	3	Bis: 168, Dre: 124, Tor: 52
sex	11	0.97	FALSE	2	mal: 168, fem: 165

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
bill_length_mm	2	0.99	43.92	5.46	32.1	39.23	44.45	48.5	59.6	▃▇▇▆▁
bill_depth_mm	2	0.99	17.15	1.97	13.1	15.60	17.30	18.7	21.5	▅▅▇▇▂
flipper_length_mm	2	0.99	200.92	14.06	172.0	190.00	197.00	213.0	231.0	▂▇▃▅▂
body_mass_g	2	0.99	4201.75	801.95	2700.0	3550.00	4050.00	4750.0	6300.0	▃▇▆▃▂
year	0	1.00	2008.03	0.82	2007.0	2007.00	2008.00	2009.0	2009.0	▇▁▇▁▇

The target variable of interest is the species of penguins, which are categorized into three groups: Adelie, Gentoo and Chinstrap penguins.

## [1] Adelie    Gentoo    Chinstrap
## Levels: Adelie Chinstrap Gentoo

Species Distribution on Islands

From this plot, we can make a few key observations:

Gentoo penguins are only found on Biscoe Island
Chinstrap pengiuns only found on Dream Island
Adelie penguins are found on all three islands
Torgersen Island only has Adelie penguins

These island observations are valuable information in differentiating penguin species.

Sex Distribution

However, the sex of the penguins does not offer much information as the proportion is about even across all species. We can also note a few missing observations labeled as NA.

Missing Values & Variable Selection

We noted from the data summary above that 11 observations were missing for the sex variable. There is also no reason to believe that the year the observation was taken would have any impact on the morphology of the penguins. We are not looking for any time series modeling. Therefore, we also drop year from our predictor variables. There are also two observations which are missing body measurements altogether, so these rows will be dropped altogether.

penguins[!complete.cases(penguins), ]

penguins <- penguins[complete.cases(penguins), ]
penguins <- dplyr::select(penguins, -c(year, island))

Body Measurements

When looking at body measurements we see that Adelie and Chinstrap penguins largely overlap except for bill_length. This suggests that we might be able to use bill_depth, body_mass and flipper_length to differentiate the Gentoo penguins from the other species. However, the Adelie penguin stands out from the other others in bill_length

The scatterplot matrix below is another way to visualize the separation and overlap between classes for different combination of variables. We see that in general, Gentoo penguins standalone as a separate group. However, Adelie and Chinstrap penguins overlap in the comparison of bill_depth, flipper_length and body_mass.

We see on the univariate feature plots below that the data is aproximatelly normally distributed.

Multicollinearity

Taking a look at the correlation matrix below, we can make a few observations, notably that flipper_length is highly positively correlated with body_mass which makes sense given that larger penguins should have larger flippers. The other correlations are less obvious to interpret. Given that the dataset only contains a few predictors, we choose not to exclude any variables based on multicollinearity at this time.

K-Nearest Neighbors

The KNN algorithms requires minor data processing. Firstly, predictor values that are factors should be conversted to numeric. Secondly, because KNN uses distance between points to determine their classification, it is important for the points to be on the scaled appropriately. Here we pass the scale argument to the preProcess parameter of the training function to standardize each variable. The data is then split into training and testing sets 80%/20%. The test set contains 65 observations and the train set 268 observations.

Processing

# Processing
penguins_knn <- penguins
penguins_knn$sex <- as.numeric(penguins_knn$sex)-1 # recode as 1 or 0
# Data Partitioning
set.seed(622)
trainIndex <- createDataPartition(penguins_knn$species, p = .8, list = FALSE, times = 1)
knn_training <- penguins_knn[trainIndex,]
knn_testing  <- penguins_knn[-trainIndex,]

Modeling

We performed 10-fold cross-validation in the training data to determine the optimal parameter k for our model. The resulting accuracy for each value of k is displayed and plotted below. The maximum accuracy is reached with values of k=3 and k=4 but the training procedure automatically chose k=4 as the best model. We gain a full percentage point in cross-validation accuracy on the training data using the tuned model over models with slightly more or fewer neighbors.

trControl <- trainControl(method  = "cv",
                          number  = 10)
knn.fit <- train(species ~ .,
             method     = "knn",
             tuneGrid   = expand.grid(k = 1:10),
             trControl  = trControl,
             preProcess = c("center","scale"),
             metric     = "Accuracy",
             data       = knn_training)

knn.fit

## k-Nearest Neighbors 
## 
## 268 samples
##   5 predictor
##   3 classes: 'Adelie', 'Chinstrap', 'Gentoo' 
## 
## Pre-processing: centered (5), scaled (5) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 241, 241, 240, 242, 241, 242, ... 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    1  0.9851852  0.9769336
##    2  0.9850427  0.9765099
##    3  0.9962963  0.9941558
##    4  0.9962963  0.9941558
##    5  0.9887464  0.9821064
##    6  0.9851750  0.9764839
##    7  0.9851750  0.9764839
##    8  0.9888787  0.9823281
##    9  0.9888787  0.9823281
##   10  0.9888787  0.9823281
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 4.

Results

The evaluation of the tuned K-NN model on the testing data reveals that the model was able to classify species with perfect accuracy. However, it is important to note that 100% prediction accuracy is typically rare and that this model benefitted from fairly clean class separations and limited overlap in the original dataset.

knnPredict <- predict(knn.fit, newdata = knn_testing) 
confusionMatrix(knnPredict, knn_testing$species)

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Adelie Chinstrap Gentoo
##   Adelie        29         0      0
##   Chinstrap      0        13      0
##   Gentoo         0         0     23
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9448, 1)
##     No Information Rate : 0.4462     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity                 1.0000              1.0        1.0000
## Specificity                 1.0000              1.0        1.0000
## Pos Pred Value              1.0000              1.0        1.0000
## Neg Pred Value              1.0000              1.0        1.0000
## Prevalence                  0.4462              0.2        0.3538
## Detection Rate              0.4462              0.2        0.3538
## Detection Prevalence        0.4462              0.2        0.3538
## Balanced Accuracy           1.0000              1.0        1.0000

Loan Approval Dataset

A loan-issuance company with presence across urban, semi urban and rural areas wants to validate the eligibility of customers to be granted a loan. Using an online application form, customers enter a series of attributes such as Gender, Marital Status, Education, Number of Dependents, Income, Loan Amount, Credit History and others. To automate the eligibility process, multiple supervised classification approaches can be used. In our case, we explore tree-based methods starting from simple Decision Trees (DT). Ensemble methods such as Random Forests (RF) and Gradient Boosting (GB) are also used to improve on the classification accuracy of the simple DT.

The loan dataset is composed of 13 variables and 614 observations. The target or dependent variable is Loan_Status and contains ‘Y’ or ‘N’ as entries. Predicting whether a loan will be approved is a supervised binary classification problem.

Eight of the variables are factors and 5 are numeric. The dataset contains missing values recorded either as ‘NA’ or simply empty. Some columns are missing nearly 10% of observations. Imputation of the missing values will be a step in the data pre-processing. We also note that the Loan_ID variable is simply an index and holds no valuable information, making it safe for removal. The Credit_History variable is coded as numeric but it is a binary variable with two levels.

loan_raw <- read.csv('https://raw.githubusercontent.com/maelillien/data622/main/hw3/Loan_approval.csv', header = TRUE)
loan_raw <- loan_raw %>% mutate_if(is.character, factor)
loan <- loan_raw

head(loan)

# replace blank values with NA to allow for proper calculation of the complete_rate column in the data summary
loan[loan==''] <- NA 
skim(loan)

Data summary
Name	loan
Number of rows	614
Number of columns	13
_______________________
Column type frequency:
factor	8
numeric	5
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
Loan_ID	0	1.00	FALSE	614	LP0: 1, LP0: 1, LP0: 1, LP0: 1
Gender	13	0.98	FALSE	2	Mal: 489, Fem: 112, emp: 0
Married	3	1.00	FALSE	2	Yes: 398, No: 213, emp: 0
Dependents	15	0.98	FALSE	4	0: 345, 1: 102, 2: 101, 3+: 51
Education	0	1.00	FALSE	2	Gra: 480, Not: 134
Self_Employed	32	0.95	FALSE	2	No: 500, Yes: 82, emp: 0
Property_Area	0	1.00	FALSE	3	Sem: 233, Urb: 202, Rur: 179
Loan_Status	0	1.00	FALSE	2	Y: 422, N: 192

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
ApplicantIncome	0	1.00	5403.46	6109.04	150	2877.5	3812.5	5795.00	81000	▇▁▁▁▁
CoapplicantIncome	0	1.00	1621.25	2926.25	0	0.0	1188.5	2297.25	41667	▇▁▁▁▁
LoanAmount	22	0.96	146.41	85.59	9	100.0	128.0	168.00	700	▇▃▁▁▁
Loan_Amount_Term	14	0.98	342.00	65.12	12	360.0	360.0	360.00	480	▁▁▁▇▁
Credit_History	50	0.92	0.84	0.36	0	1.0	1.0	1.00	1	▂▁▁▁▇

Data Pre-processing

The pre-processing steps are the following:

Creating a new variable called TotalIncome by summing applicant and coapplicant incomes. Typically loan issuers take into account the combined income of the applicant and guarantor.
Dropping the valueless variable Loan_ID and the individual income variables that were just combined.
Treating Credit_History as a factor with 2 levels instead of a numeric variables
Imputation of missing values

Note that the tree based methods employed in this exercise are not required to be coded as numeric or expanded as dummy variables. We can see from the data summary above that the remainder of the variables have the proper data type.

loan <- loan %>% mutate(TotalIncome = ApplicantIncome + CoapplicantIncome)
loan <- loan %>% select(-c('Loan_ID','ApplicantIncome','CoapplicantIncome'))
loan$Credit_History <- as.factor(loan$Credit_History)

Imputation of Missing Values

By examining the pattern plot of missing values below, we discern a slight pattern from variable to variable and observation to observation. We can consider a few scenarios to account for the missingness. For example, individuals who are married but separated might leave the ‘Married’ field blank. Individuals with non-binary gender identity would not choose ‘Male’ or ‘Female’ if the option to leave the field blank is a valid entry. However, it seems possible to account for the missing values based on the complete information of other variables. We might be able to predict that an individual is married or lives in a suburban area if they have a large number of dependents. Therefore we make the assumption that this dataset has missing values at random (MAR).

The simplest way of dealing with missing values is to conduct “complete case analysis” which involves dropping observations for which predictor values are missing. However, given the small number of observations (614) in this dataset, dropping observations in such fashion would result in significant data loss. Since the variable with the most missing values (nearly 8%), namely Credit_History, is suspected to the very predictive we will impute the missing values using various predictive models.

## 
##  Variables sorted by number of missings: 
##          Variable       Count
##    Credit_History 0.081433225
##     Self_Employed 0.052117264
##        LoanAmount 0.035830619
##        Dependents 0.024429967
##  Loan_Amount_Term 0.022801303
##            Gender 0.021172638
##           Married 0.004885993
##         Education 0.000000000
##     Property_Area 0.000000000
##       Loan_Status 0.000000000
##       TotalIncome 0.000000000

Given the assumption that this dataset has Missing At Random Values, we can use the MICE (Multivariate Imputation via Chained Equations) package to impute the missing values throughout the dataset. For each type of predictor variable we will use:

Numeric Variables: Predictive Mean Matching (ppm)
Binary Variables: Logistic Regression (logreg)
Factor Variables (3+ levels): Bayesian Polytomous Regression (polyreg)

Below, we confirm that the imputing procedure was successful and that the dataset no longer contains missing values.

table(complete.cases(loan))

## 
## TRUE 
##  614

Data Exploration

By examining the target variables, we see that nearly 70% of all loans are approved.

We can make a few observations from boxplots of the numeric variables below:

Loan_Amount_Term only take a few discrete values representing various durations. The most common value by far, as indicated by the flat box is around 360 months meaning that the most common loan term is 30 years. This was fairly consistent across both outcomes of the dependent variable.
LoanAmount does not greatly differ across the dependent variable. The interquartile range is slightly more compressed in the ‘Y’ category and there is a greater range of outliers values on the upper end of the range. The mean loan amounts were comparable for both outcomes.
TotalIncome is fairly similar across the outcomes up to about $30,000 in total income. Interestingly, the observation with the highest total income recorded is the greatest outlier and was not issued a loan. Large skew is observed across both categories.

From the bar plots of the categorical variables shwon below, we observe the following:

Credit_History: The majority of applicants had credit history; the large majority of which were approved for a loan. This is likely to be one of the most important factor in determining who gets a loan.
Dependents: Individuals with 0 dependents form the majority of the cohort. Individual with fewer dependents may be less risk adverse and more willing to take on debt.
Education: More individuals with graduate education applied for loans and a greater proportion of them received one in comparison to the “Not Graduate” counterparts.
Gender: More than 3 times more males applied for loans than females. Both genders seem to be granted loans in the same proportion.
Married: Married individuals applied for loans. This could be a consequence of needing to finance something like a home or a car which is more typical of married households.
Property_Area: Inviduals living in semi-urban propety areas applied for the most number of loans but also had the greatest proportion of approved loans. Urban areas follow with with approximately 50% of approved load while rural areas has fewer applicants and a greater proportion of rejections.
Self_Employed: Individuals who were not self-employed made up the large majority of the observations. This makes sense given that in general salaried employees greatly outnumber self-employed employees. Additionally, a self-employed individual may have less consistent streams of revenue and therefore might be less willing to take on debt.

Interestingly, there is little difference in the relationship between loan amount and total income across the credit history category as seen by the nearly collinear regression lines. The slope was slightly higher for individuals with credit history which makes sense given that individuals with credit history are in general more likely to be granted loans. We can also observe that individuals on the low end of the total income axis and below the regression line generally had credit history. These are individuals with larger incomes but requesting less sizable loans.

The least squares lines across the dependent variable are also nearly collinear but the line representing individuals who received loans has a slightly greater slope suggesting that higher incomes unlock larger loans which is sensible and expected.

Other than credit history, total income (applicant income + coapplicant income) seems like the most logical basis for approving or denying a loan. The histograms below compare total income across all levels of the categorical variables. In all plots, total income is skewed to the right tail, with a few observations at the higher end of the log-transformed income scale. From these plots, we can make a few additional observations. When looking at education, it makes sense to see more graduates at the tail end of income since their education should yield to higher paying jobs. The gender distribution is consistent with the discriminatory pay gap between men and women. The remaining plot do not provide much additional insight.

Data Processing

The data pre-processing section the structural form of the data. No additional processing is required for tree-based methods. We partition our data into training and testing sets in a 70%/30% proportion.

set.seed(622)
trainIndex <- createDataPartition(loan$Loan_Status, p = .7, list = FALSE, times = 1)
training <- loan[ trainIndex,]
testing  <- loan[-trainIndex,]

Decision Trees

A Decision Tree is a type of supervised learning model where the data is recursively split into two or more sub-populations given a criteria. Each split is headed by a node, where the upper most node is called the root node, and nodes which cannot be split further are called terminal (leaf) nodes. All other nodes are considered internal nodes. Based on a given population of observations, the population is split into sub-populations with the criteria that each split separates the sub-population better than the previous split. This recursive splitting cycles through the different predictor variables in order to find leaf nodes that make the purest class separation or most accurate predictions.

The Decision Tree model can be used for regression and classification. In this case we use the model for classification of loan approvals.

Simple Tree

The first step is to grow a baseline decision tree based on splits generating the most Information Gain, which is based on the decrease in entropy after a dataset is split on a given attribute. Entropy is a measure of node purity.

After growing a complete simple tree, we will need to decide if it requires pruning in order to reduce its complexity. Overly complex trees have high variance and do not predict well on new, unseen data. In our case, we use the Complexity Parameter set to a value of 0 as a measure of the required split improvement. The Cost Complexity parameter is slightly different from Information Gain and Gini index, but is conceptually similar. The parameter modulates the amount by which splitting a given node improved the relative error, in other words, it is the minimum improvement in the overall model needed at each node to justify the split.

loan_dt <- rpart(Loan_Status ~., data = training, method = "class", parms = list(split="information"), control = rpart.control(cp=0))
loan_dt

## n= 431 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##   1) root 431 135 Y (0.31322506 0.68677494)  
##     2) Credit_History=0 73   6 N (0.91780822 0.08219178) *
##     3) Credit_History=1 358  68 Y (0.18994413 0.81005587)  
##       6) TotalIncome< 2457 8   2 N (0.75000000 0.25000000) *
##       7) TotalIncome>=2457 350  62 Y (0.17714286 0.82285714)  
##        14) TotalIncome>=20199.5 11   5 N (0.54545455 0.45454545) *
##        15) TotalIncome< 20199.5 339  56 Y (0.16519174 0.83480826)  
##          30) Loan_Amount_Term>=420 10   5 N (0.50000000 0.50000000) *
##          31) Loan_Amount_Term< 420 329  51 Y (0.15501520 0.84498480)  
##            62) Property_Area=Rural 92  21 Y (0.22826087 0.77173913)  
##             124) TotalIncome< 3606.5 9   4 N (0.55555556 0.44444444) *
##             125) TotalIncome>=3606.5 83  16 Y (0.19277108 0.80722892)  
##               250) Dependents=1 7   3 N (0.57142857 0.42857143) *
##               251) Dependents=0,2,3+ 76  12 Y (0.15789474 0.84210526) *
##            63) Property_Area=Semiurban,Urban 237  30 Y (0.12658228 0.87341772)  
##             126) TotalIncome>=3416 215  30 Y (0.13953488 0.86046512)  
##               252) TotalIncome>=6577 77  15 Y (0.19480519 0.80519481)  
##                 504) TotalIncome< 6800 7   2 N (0.71428571 0.28571429) *
##                 505) TotalIncome>=6800 70  10 Y (0.14285714 0.85714286) *
##               253) TotalIncome< 6577 138  15 Y (0.10869565 0.89130435) *
##             127) TotalIncome< 3416 22   0 Y (0.00000000 1.00000000) *

The tree fitted on the training data is displayed below. The simple baseline tree is 8 levels deep. Each node shows the predicted class, the probability of that class and the percentage of observations that fall into the given node. By studying the splits, we see that after rejecting an applicant based on the lack of credit history, the applicant has a 81% percent of being approved regardless of all the other variables. Factoring in TotalIncome which is an important feature of any loan application, only increased the percentage of approval by 1% to 82%. Further splitting by Loan Amount and Property Area only increased accuracy by 1% for each step.

rpart.plot(loan_dt, box.col = c("pink", "palegreen3")[loan_dt$frame$yval])

To evaluate the simple tree model, we predict Loan_Status on the test set and calculate the performance metrics. The confusion matrix below summarizes the results. We also record the accuracy, sensitivity and specificity performance metrics in a data frame for comparison with alternative classification methods.

loan_dt_pred <- predict(loan_dt, testing, type = "class")
dt_cm <- confusionMatrix(loan_dt_pred, testing$Loan_Status)
dt_cm$table

##           Reference
## Prediction   N   Y
##          N  29  21
##          Y  28 105

performance_df <- data.frame(Model = NULL, Accuracy = NULL, Sensitivity = NULL, Specificity = NULL)
perf_dt <- data.frame(Model = "Base Decision Tree", Accuracy = dt_cm$overall[1], Sensitivity = dt_cm$byClass[1], Specificity = dt_cm$byClass[2])
performance_df <- rbind(performance_df, perf_dt)
perf_dt

Pruned Tree

In order to obtain a tree which generalizes better to unseen data (lower variance), we now prune the tree based on cross-validated error rates. We look for the Cost Complexity parameter which results in the lowest cross-validation error rate. The tuned parameter is then fed back into the tree growing procedure and is used as a threshold to exceed to justify a node split. This ensures that the tree does not grow past certain nodes for which the cross-validation error rate is not minimal.

The tuning procedure indicated that the Decision Tree which provided the most accurate predictions is the minimal tree containing 2 terminal node, with just a single split on Credit History. This is an unrealistic tree which would limit the business loan issuing volume and for practical reasons, we chose to use the cost complexity parameter where the pruned tree contained at least 2 splits. From the tuning output below we see on the upper axis that the error rate is minimized with a tree of size 2, but the relative error only marginally increased when a tree with 3 terminal nodes was used.

plotcp(loan_dt)

cost_complexity_dt <- data.frame(printcp(loan_dt))

## 
## Classification tree:
## rpart(formula = Loan_Status ~ ., data = training, method = "class", 
##     parms = list(split = "information"), control = rpart.control(cp = 0))
## 
## Variables actually used in tree construction:
## [1] Credit_History   Dependents       Loan_Amount_Term Property_Area   
## [5] TotalIncome     
## 
## Root node error: 135/431 = 0.31323
## 
## n= 431 
## 
##          CP nsplit rel error  xerror     xstd
## 1 0.4518519      0   1.00000 1.00000 0.071325
## 2 0.0296296      1   0.54815 0.54815 0.057993
## 3 0.0074074      2   0.51852 0.57037 0.058908
## 4 0.0052910      3   0.51111 0.60741 0.060360
## 5 0.0000000     10   0.47407 0.76296 0.065582

min_err <- (cost_complexity_dt %>% filter(nsplit > 1) %>% slice(which.min(xerror)))$CP
cat("Minimum Error: ", min_err)

## Minimum Error:  0.007407407

The resulting tree with 3 terminal nodes is displayed below and the most significant predictors are Credit History and Total Income.

loan_dt_prune <- prune(loan_dt, min_err)
rpart.plot(loan_dt_prune, box.col = c("pink", "palegreen3")[loan_dt$frame$yval])

As before, the classification result on the test set in summarized in the confusion matrix and in the data frame row below.

loan_dt_prune_pred <- predict(loan_dt_prune, testing, type = "class")
dt_pruned_cm <- confusionMatrix(loan_dt_prune_pred, testing$Loan_Status)
dt_pruned_cm$table

##           Reference
## Prediction   N   Y
##          N  25   7
##          Y  32 119

perf_dt_pruned <- data.frame(Model = "Pruned Decision Tree", Accuracy = dt_pruned_cm$overall[1], Sensitivity = dt_pruned_cm$byClass[1], Specificity = dt_pruned_cm$byClass[2])
performance_df <- rbind(performance_df, perf_dt_pruned)
perf_dt_pruned

Random Forest

The Random Forest model is an ensemble method built from many individual decision trees and the classification output is determined by majority voting. The method aims to reduce the variance of individual trees and is particularly suited when predictors exhibit collinearity. Each tree in the forest is constructed using bootstrapped observations and a randomized subset of features. The forest growing procedure is controled by two tuning parameters that we seek to optimize:

ntree: the number of trees in the forest
mtry: the number of random variables used to build each tree

For classification exercises, the default subset of variables is the square root of the total number of predictor variables. Since we have 10 predictor variables in our dataset, the default value is 3.

When using bootstrapped samples, about a third of the observations are not used in construction of the trees, and the model’s prediction on these unseen observations is referred to as the Out of Bag Error. From the tuning results below we see random predictor subsets of size 2 and 3 per tree generates the identical values for the lowest Out of Bag error. The optimal value is chosen to be 2 by the model which is subsequently used as the mtry parameter.

min_tree_var <- tuneRF(x = subset(training, select = -Loan_Status), y=training$Loan_Status, ntreeTry = 500)

## mtry = 3  OOB error = 17.4% 
## Searching left ...
## mtry = 2     OOB error = 17.4% 
## 0 0.05 
## Searching right ...
## mtry = 6     OOB error = 18.56% 
## -0.06666667 0.05

val_opt <- min_tree_var [,"mtry"][which.min(min_tree_var [,"OOBError"])]
loan_rf <- randomForest(Loan_Status ~., data = training, importance = TRUE, ntree=500, mtry = val_opt)
loan_rf

## 
## Call:
##  randomForest(formula = Loan_Status ~ ., data = training, importance = TRUE,      ntree = 500, mtry = val_opt) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 16.71%
## Confusion matrix:
##    N   Y class.error
## N 72  63  0.46666667
## Y  9 287  0.03040541

Next we generate a large number of forests of different sizes in order to identify the parameter for which the Out Of Bag (OOB Error) error rates is minimal. From the plot of error rates below, we see that the OOB error rate decreases significantly up to about 50 tress, and further stabilizes around 100 trees. For this dataset 500 trees is more than enough to minimize the OOB Error rate.

plot(loan_rf, col = c("black", "red", "green"), lty = c(1, 1, 1), main = "Predicted Loan Error Rates")
legend("right", c(OOB = "Out of Bag Error", "Denied", "Appoved"), col = c("black", "red", "green"), lty = c(1, 1, 1))

Similarly to Decision Trees, we are interested in predictive power of individual variables but but in Random Forests, the most predictive feature is obscured by the complexity and not simply found at the top of the tree. We turn to Variable Importance, which is a scale of a given variable’s predictive power which is derived by taking the average purity of child nodes that the split causes across all trees in the forest. Variables with large increases in purity are considered more “important”.

Below is a table of the predictor variables and their various importance factors. Some definitions are provided below to help interpret the following table:

Accuracy Decrease: The decrease in accuracy of the overall model if a given predictor is not used in the model
Gini Decrease: Each split results in a reduction in the Gini Index. The Gini Decrease is an average of this Gini Index reduction across the entire forest
Mean_min_depth: The average depth in a tree that a given node is found at

var_imp <- measure_importance(loan_rf)
var_imp[,1:7]

We can visualize this table in a few ways which are presented below. We make the following observations:

Predictors that yield the highest purity are selected first and are often found at the top of trees
Credit History and Total Income are the the most important predictors because they are most often at the root of individual trees and have the lowest average depth

plot_min_depth_distribution(min_depth_distribution(loan_rf), mean_sample = "relevant_trees")

plot_multi_way_importance(var_imp, size_measure = "times_a_root")

Studying the multi-way importance plot we see that Credit History stands alone with the largest accuracy decrease if the variable is omitted from model and has the highest Gini decrease of all variables. Interestingly the accuracy decrease when excluding Total Income’s is mostly in line with the other predictors, much less than Credit History. However, it retains a higher gini decrease than the rest of the other predictors.

plot_multi_way_importance(loan_rf, x_measure = "accuracy_decrease", y_measure = "gini_decrease")

The classification results and metrics of the Random Forest model is summarized below.

loan_rf_pred <- predict(loan_rf, testing, type = "class")
rf_cm <- confusionMatrix(loan_rf_pred, testing$Loan_Status)
rf_cm$table

##           Reference
## Prediction   N   Y
##          N  24   4
##          Y  33 122

perf_rf <- data.frame(Model = "Random Forest", Accuracy = rf_cm$overall[1], Sensitivity = rf_cm$byClass[1], Specificity = rf_cm$byClass[2])
performance_df <- rbind(performance_df, perf_rf)
perf_rf

Gradient Boosting

Gradient Boosting another kind of ensemble modeling technique in which several basic models referred to as weak learners are combined to improve performance. Instead of building models in parallel and determining the output using majority voting, models are run in an additive series fashion, with each model output serving as an input to the next. This iterative procedure improves results with each new run of the classifier. Given sufficient iterations, the training error can be reduced to zero at the expense of computational speed. Gradient boosting is a “slow” learning method modulated by the shrinkage parameter lambda.

With gradient boosted trees, the performance of each individual tree is rather low. The weak learners are usually shallow trees which can be as simple as stumps. The error of each classification step (pseudo-resdiuals) is weighted in order to place more emphasis on the misclassified observations. The procedure is further controled by two additional parameters: number of trees and interaction depth (number of splits).

set.seed(123)
# gmb requires recoding the response as numeric
training$Loan_Status <- ifelse(training$Loan_Status == "Y", 1, 0)
loan_boost <- gbm(Loan_Status ~., data = training, 
              n.trees=500,
              interaction.depth=4,
              shrinkage=0.01,
              bag.fraction=0.5,
              distribution="bernoulli",
              verbose=FALSE,
              cv.folds = 5,
              n.cores=2
              )

Using cross-validation to find the optimal paramter for the number of trees yielded a value of 207 which is the value used to compute our predictions.

Examining feature importance, we once again see that Credit_History and TotalIncome are the most important predictors. Interstingly, LoanAmount holds a higher importance that in other models. We also notice that predictors such as Gender are on the low end of importance which is something that we would hope to be true as gender should have no value in determining if an individual should be issued a loan.

# Check performance using 5-fold cross-validation
best.iter <- gbm.perf(loan_boost, method="cv")

print(best.iter)

## [1] 207

summary(loan_boost)

As before, classification results are summarized below.

# gmb requires recoding the response as numeric
testing$Loan_Status <- ifelse(testing$Loan_Status == "Y", 1, 0)
loan_gb_prob <- predict(loan_boost, newdata = testing, n.trees = best.iter, type = "response")
loan_gb_pred <- as.factor(ifelse(loan_gb_prob > 0.5, 1, 0))
gb_cm <- confusionMatrix(loan_gb_pred, as.factor(testing$Loan_Status))
gb_cm$table

##           Reference
## Prediction   0   1
##          0  23   4
##          1  34 122

perf_gb <- data.frame(Model = "Gradient Boosting", Accuracy = gb_cm$overall[1], Sensitivity = gb_cm$byClass[1], Specificity = gb_cm$byClass[2])
performance_df <- rbind(performance_df, perf_gb)
perf_gb

Model Performance

Looking at the performance table below, we see that the accuracy rate on the test dataset of the Random Forest and Gradient Boosting models are almost the same at approximately at 79%. Random Forest has a slight edge in accuracy. The Pruned Decision Tree model is very close behind with an accuracy rate of approximately 78%.

With that said, it is worthwhile to look more into where these models are more accurate and where are they not. While no model in the real-world scenario will have a 100% accuracy rate of new data, we will have to decide which type of incorrect prediction is more costly to the business. For our business scenario of loan approval, it is less costly to deny a loan to someone who might pay it back than to grant one to someone who will not. Therefore, we will want to minimize the applicants that are likely to default on a loan. We can do this by investigating the Sensitivity and Specificity for each of the models:

Sensitivity (True Positive Rate): measures the proportion of applicants that were predicted as approved who were actually approved in the test dataset
Specificity (True Negative Rate): measures the proportion of applicants that were predicted to be rejected for a loan and were also rejected for the loan in the test dataset

A model with high sensitivity but low specificity would be more likely to approve loans for applicants who should not receive one. For this reason, the Random Forest model which has the highest specificity is the lowest risk option and is the model of choice given the high specificity rate and the highest accuracy rate.

An interesting notion here is that having a low sensitivity would mean a lost opportunity for the company to issue loands to lower risk individuals. Ultimately, the balance of sensitivity and specificity is dependent on the risk tolerance of the company. The Random Forest and Gradient Boosted Tree models have the highest specificity but lower sensitivity than the simple decision tree. Further analysis would need to be conducted on much additional revenue would be generated by lowering the specificity in favor of sensitvity which is outside the scope of this study.

Ultimately, the ensemble models provided the most risk balanced performance over the single tree models, with Random Forest edging out higher overall accuracy.

rownames(performance_df) <- NULL
performance_df

We can diagnose the best overall classifier by comparing the area under the curve (AUC) of the receiver operating characteristics curve (ROC) as shown on the plot below. The ideal classifier hugs the top left corner. As stated earlier, the ensemble models provide performance enhacements over the base and pruned trees. The Gradient Boosted tree model has the highest AUC value of 0.757 while the Random Forest model slightly lower at 0.746. However, we stick with the conclusion that the Random Forest model is slightly superior due to the minor sensitivity improvement over the Gradient Boosted model

DATA 622 HW3 - Classification using KNN, Decision Trees, Random Forests and Gradient Boosting

Mael Illien, Dhairav Chhatbar, Santosh Manjrekar

3/19/2021

Setup

Penguins Dataset

Data Exploration

Species Distribution on Islands

Sex Distribution

Missing Values & Variable Selection

Body Measurements

Multicollinearity

K-Nearest Neighbors

Processing

Modeling

Results

Loan Approval Dataset

Data Pre-processing

Imputation of Missing Values

Data Exploration

Data Processing

Decision Trees

Simple Tree

Pruned Tree

Random Forest

Gradient Boosting

Model Performance