Data 622 - Homework #3 (Group 2)

Source Code: https://github.com/djlofland/DATA622_S2021_Group2/tree/master/Homework3

Part 1: KNN on the Penguins dataset

Please use the K-nearest neighbor (KNN) algorithm to predict the species variable. Please be sure to walk through the steps you took. (40 points)

Similar to past assignments when using the Palmer Penguins dataset, we’ll first do some quick exploratory analysis to examine the different features available.

Load Data

We can see below that there are four continuous variables: bill_length_mm, bill_depth_mm, flipper_length_mm, and body_mass_g. Additionally, there are a few categorical variables: island, sex, and year.

species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
Adelie	Torgersen	39.1	18.7	181	3750	male	2007
Adelie	Torgersen	39.5	17.4	186	3800	female	2007
Adelie	Torgersen	40.3	18.0	195	3250	female	2007
Adelie	Torgersen	NA	NA	NA	NA	NA	2007
Adelie	Torgersen	36.7	19.3	193	3450	female	2007
Adelie	Torgersen	39.3	20.6	190	3650	male	2007

Feature Plots

For the continuous variables, we can examine the distributions, broken out by the target variable, species.

By separating our distributions by our target variable, species, we can see that many feature interactions show clustering between Adelie and Chinstrap penguins (red and green), while Gentoo penguins tend to contrast the other two species for most interactions. This is also confirmed by most of the single-variable distributions split out by species in the plots below. Except for the distribution of bill_length_mm, distributions for body_mass_g, bill_depth_mm, and flipper_length_mm all show there to be overlapping distributions between Adelie and Chinstrap penguin species.

Next, we’ll do some data tidying to get our data set ready for our KNN model. year reflects the date/time of recording and requires some additional thought. If we hypothesize or expect that variability in penguin features could have a time-dependent effect, then year could be important. For example, if penguin food supplies vary year by year and this impacts growth or yearly temperature affects penguin development, year could matter. If we thought global warming might impact penguins’ physical development or other natural disasters might have impacted growth in specific years, year might matter. That said, in this situation, because we don’t have any reason to expect a time-dependent effect, year probably doesn’t add any explanatory power to our models, and we can remove it from our data set.

# remove year column
penguins <- penguins %>% 
  dplyr::select(-year)

Missing Values

Additionally, when exploring missing values, we can see the following:

# Plot missing values per feature
penguins %>%
  summarise_all(funs(sum(is.na(.)))) %>%
  pivot_longer(cols = 1:7, names_to = 'columns', values_to = 'NA_count') %>%
  arrange(desc(NA_count)) %>%
  ggplot(aes(y = columns, x = NA_count)) + geom_col(fill = 'deepskyblue4') +
    geom_label(aes(label = NA_count)) +
    theme_minimal() +
    labs(title = 'Count of missing values in penguins dataset') +
    theme(plot.title = element_text(hjust = 0.5))

# Use nanair package to plot missing value patterns
gg_miss_upset(penguins)

We can see that 11 individuals have at least one missing data point. The missing data points are clustered - two penguin records are missing bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, and sex. Nine penguin records are missing just the sex data point. Since we cannot do much with the two records missing most of their data, we will drop these records. For the records missing just sex, while we could try kNN imputation, we don’t know enough about gender differences to know if that’s a safe operation or whether it would introduce bias. Therefore, we’ll drop these from our data set as well.

# Remove rows with missing values
penguins <- na.omit(penguins)

Looking at our dataset, we see that missing values are no longer an issue; however, we still have some additional cleanup to perform on our dataset below before we proceed with kNN modeling.

# Display data from summary
skim(penguins)

Data summary

Name	penguins
Number of rows	333
Number of columns	7
_______________________
Column type frequency:
factor	3
numeric	4
________________________
Group variables	None

Variable type: factor

skim_variable	complete_rate	ordered	n_unique	top_counts
species	1	FALSE	3	Ade: 146, Gen: 119, Chi: 68
island	1	FALSE	3	Bis: 163, Dre: 123, Tor: 47
sex	1	FALSE	2	mal: 168, fem: 165

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
bill_length_mm	1	43.99	5.47	32.1	39.5	44.5	48.6	59.6	▃▇▇▆▁
bill_depth_mm	1	17.16	1.97	13.1	15.6	17.3	18.7	21.5	▅▆▇▇▂
flipper_length_mm	1	200.97	14.02	172.0	190.0	197.0	213.0	231.0	▂▇▃▅▃
body_mass_g	1	4207.06	805.22	2700.0	3550.0	4050.0	4775.0	6300.0	▃▇▅▃▂

When setting up for modeling, we typically separate our features from our target to have an x and y. With this in mind, we remove species from the main dataframe and create a new dataframe, species_actual, to hold species. Note that the dataframes must maintain their same order, and we cannot do any operations that might alter dataframe length.

# New dataframe just holding the target
species_actual <- penguins %>% 
  dplyr::select(species)

# Remove the target column so we just have features in the dataframe
penguins <- penguins %>% 
  dplyr::select(-species)

Additionally, we can see below in each of our continuous variables’ distributions that the scales are inconsistent across features. For better model performance, we’ll want to standardize each of our features.

We normalize our continuous features using a simple z-score standardization.

# z-score scale continuous features
penguins[, c("bill_length_mm", 
             "bill_depth_mm", 
             "flipper_length_mm", 
             "body_mass_g")] <- scale(penguins[, c("bill_length_mm", 
                                                   "bill_depth_mm", 
                                                   "flipper_length_mm", 
                                                   "body_mass_g")])

As you can see below, after the z-score standardization, the scaling on the x and y-axis is now more consistent across our features.

# pair plot to inpect features after scaling
penguins %>%
  dplyr::select(body_mass_g, bill_length_mm, bill_depth_mm, flipper_length_mm) %>%
  ggpairs()

Although this may help our KNN model, we’ll have to be careful about interpretability later on!

Finally, we’ll want to dummy code our categorical island and sex variables. This process will create new columns for each value and assign a 0 or 1. Note that dummy encoding typically drops one value which becomes the baseline. So if we have a categorical feature with five unique values, we will have four columns. If all columns are 0, that represents the reference value. This helps reduce the number of necessary columns. With dummy columns in place, we need to remove our original island and sex variable from this dataset.

# Dummy code the island feature - this creates a new column for each island (0 || 1)
island_dcode <- as.data.frame(dummy.code(penguins$island))
penguins <- cbind(penguins, island_dcode)

# remove the original island factor columns - we will rely on the new dummy columns
penguins <- penguins %>% 
  dplyr::select(-island)

# dummy code the sex feature - since there are only 2 states, we will only be adding a single column (0 || 1)
penguins$sex <- dummy.code(penguins$sex)

Our data set is ready for our KNN model.

# quick inspect of dataframe
kable(head(penguins)) %>% 
  kable_styling(bootstrap_options = "basic")

	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	Biscoe
1	-0.8946955	0.7795590	-1.4246077	-0.5676206	1	0
2	-0.8215515	0.1194043	-1.0678666	-0.5055254	0	1
3	-0.6752636	0.4240910	-0.4257325	-1.1885721	0	1
5	-1.3335592	1.0842457	-0.5684290	-0.9401915	0	1
6	-0.8581235	1.7444004	-0.7824736	-0.6918109	1	0
7	-0.9312674	0.3225288	-1.4246077	-0.7228585	0	1

Splitting into a Training/Test set

One problem during the modeling process is overfitting. In this situation, a model appears to have good predictive power, but it has “memorized” the training data and doesn’t generalize with new data. So, we train with one dataset and evaluate with a separate dataset not seen during training. It is common to randomly separate the original dataset into 80% / 20%, with 80% used for training. Since our target variable is in a separate dataframe, we apply the same separation to it as well.

# we use the fixed random seed for reproducibility
set.seed(123)

# use caret to split out training vs test dataframes
trainingRows <- createDataPartition(as.matrix(species_actual$species), p=0.8, list=FALSE)

train_penguins <- penguins[trainingRows,]
test_penguins <- penguins[-trainingRows,]

# also split out the target dataframes
species_actual_train <- species_actual$species[trainingRows]
species_actual_test <- species_actual$species[-trainingRows]

Fitting our kNN Model

With our data split accordingly, we’ll need to identify the appropriate number for k. k represents the number of nearby neighbors used to determine the sample data point class. A rule of thumb has us start with the square root of the number of observations (i.e., rows or records). This varies depending on the number of records, and typically this rule of thumb leads to slightly higher than optimal values, but it’s a good starting point. Later we will use the caret package to help find the optimal value for k.

# initial k
sqrt(nrow(train_penguins))

## [1] 16.37071

We can see above, that it is approximately 16. Therefore, we’ll perform our initial kNN model with \(k=16\).

set.seed(123) 

# kNN with 16 clusters
k16 <- knn(train_penguins, 
           test_penguins, 
           cl=species_actual_train, 
           k=16)

confusionMatrix(k16, species_actual_test)

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Adelie Chinstrap Gentoo
##   Adelie        28         1      0
##   Chinstrap      1        12      0
##   Gentoo         0         0     23
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9692          
##                  95% CI : (0.8932, 0.9963)
##     No Information Rate : 0.4462          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9516          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity                 0.9655           0.9231        1.0000
## Specificity                 0.9722           0.9808        1.0000
## Pos Pred Value              0.9655           0.9231        1.0000
## Neg Pred Value              0.9722           0.9808        1.0000
## Prevalence                  0.4462           0.2000        0.3538
## Detection Rate              0.4308           0.1846        0.3538
## Detection Prevalence        0.4462           0.2000        0.3538
## Balanced Accuracy           0.9689           0.9519        1.0000

#misClassError <- mean(k16 != species_actual_test)
#paste0('The number of misclassified penguins with 16 neighbors is: ', misClassError)
#table(k16, species_actual_test)

Our resulting confusion matrix shows the kNN predicted species compared to the actual species designations. We can see that \(k=16\) performed well on our test set with an overall \(Accuracy=0.9662\) and very good Sensitivity and Specificity for each class between 0.9231 and 1.00. This tells us that there were very few false positives and negatives. While this model performed exceptionally well, our choice of \(k=16\) might be sub-optimal. Next, we will leverage caret:::train() to help us find the optimal value for k thru testing.

Determine Optimal k for kNN

The built-in caret::train() function will run the KNN classification multiple times with different values for k, evaluate model performance, and experimentally determine the optimal value of k for our data.

# kNN with caret - autofit k
knn_caret <- train(train_penguins, 
                   species_actual_train, 
                   method = "knn", 
                   preProcess = c("center", "scale"))

knn_caret

## k-Nearest Neighbors 
## 
## 268 samples
##   8 predictor
##   3 classes: 'Adelie', 'Chinstrap', 'Gentoo' 
## 
## Pre-processing: centered (7), scaled (7), ignore (1) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 268, 268, 268, 268, 268, 268, ... 
## Resampling results across tuning parameters:
## 
##   k  Accuracy   Kappa    
##   5  0.9975302  0.9961114
##   7  0.9975104  0.9960614
##   9  0.9983326  0.9973580
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 9.

From our output, we can see that a \(k=9\) was found to be the optimal value for our model based on the calculated Accuracy and Kappa values. The inter-rater reliability metric, Kappa, is quite important when working with unbalanced data sets. Unbalanced datasets have different counts for each class. As the count discrepancy between classes increases, our model can become biased by the larger represented class, leading to poorer predictions of the smaller classes. Given this, a \(k=9\) may ultimately perform better than our initial KNN model with \(k=16\), especially on new data.

# plot accuracy vs # neighbors
plot(knn_caret)

We can see that a neighbors value of 9 clearly showed the highest accuracy value.

# apply model to holdout group
knn_caret_predictions <- predict(knn_caret, newdata = test_penguins)

# show confusion matrix and model performance KPIs
confusionMatrix(knn_caret_predictions, as.factor(species_actual_test))

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Adelie Chinstrap Gentoo
##   Adelie        28         1      0
##   Chinstrap      1        12      0
##   Gentoo         0         0     23
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9692          
##                  95% CI : (0.8932, 0.9963)
##     No Information Rate : 0.4462          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9516          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity                 0.9655           0.9231        1.0000
## Specificity                 0.9722           0.9808        1.0000
## Pos Pred Value              0.9655           0.9231        1.0000
## Neg Pred Value              0.9722           0.9808        1.0000
## Prevalence                  0.4462           0.2000        0.3538
## Detection Rate              0.4308           0.1846        0.3538
## Detection Prevalence        0.4462           0.2000        0.3538
## Balanced Accuracy           0.9689           0.9519        1.0000

In the end, with the caret package identifying the “optimal” k-value, we can see that both our KNN models (\(k=16\) and \(k=9\)) performed very well on our test data set. Again, k indicates how many nearby data points are used to assign a sample data point to a class. If our neighboring data points are spread out versus tightly clustered will influence the optimal k. Large k values may lead to more misclassifications near borders between classes. If classes overlap or have non-linear patterns, then smaller k values may better identify classes for pockets of data points.

In this case, it appears that either of our models could be chosen to provide accurate predictions of the three penguin species.

Part 2: Decision Trees on loan approval data set

Please use the attached dataset on loan approval status to predict loan approval using Decision Trees. Please be sure to conduct a thorough exploratory analysis to start the task and walk us through your reasoning behind all the steps you are taking.

Load Data

First, we decided to read in the loan approval data set and take a look at its features:

# load CSV data set
loan <- read.csv("https://raw.githubusercontent.com/DaisyCai2019/NewData/master/Loan_approval.csv")

# display first few rows for inspection
kable(head(loan)) %>% 
  kable_styling(bootstrap_options = "basic")

Loan_ID	Gender	Married	Dependents	Education	Self_Employed	ApplicantIncome	CoapplicantIncome	LoanAmount	Loan_Amount_Term	Credit_History	Property_Area	Loan_Status
LP001002	Male	No	0	Graduate	No	5849	0	NA	360	1	Urban	Y
LP001003	Male	Yes	1	Graduate	No	4583	1508	128	360	1	Rural	N
LP001005	Male	Yes	0	Graduate	Yes	3000	0	66	360	1	Urban	Y
LP001006	Male	Yes	0	Not Graduate	No	2583	2358	120	360	1	Urban	Y
LP001008	Male	No	0	Graduate	No	6000	0	141	360	1	Urban	Y
LP001011	Male	Yes	2	Graduate	Yes	5417	4196	267	360	1	Urban	Y

As we can see from a glimpse of the data set above, the following features are available:

Loan_ID: a unique identifier for each loan
Gender: split into male/female
Married: indicates whether the applicant is either married (“Yes”) or not married (“No”)
Dependents: records the number of dependents to the applicant
Education: indicates whether the applicant is a graduate or undergraduate student
Self_Employed: indicates whether the applicant is either self-employed (“Yes) or not (”No")
ApplicantIncome: indicates the applicant’s income
CoapplicantIncome: indicates the co-applicant income
LoanAmount: indicates the loan amount (in thousands)
Loan_Amount_Term: indicates the loan amount term in the number of months
Credit_History: indicates whether or not the applicant’s credit history meets loan guidelines (1 or 0)
Property_Area: indicates whether the applicant’s property is “urban”, “semi-urban” or “rural”
Loan_Status: the target variable indicates whether or not the applicant received the loan

Exploratory data analysis

Now, we can run some exploratory data analysis to get a better sense of how to tidy and interpret these features. Here’s an initial summary of the dataset:

# print dataframe summary
skim(loan)

Data summary

Name	loan
Number of rows	614
Number of columns	13
_______________________
Column type frequency:
character	8
numeric	5
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	empty	n_unique
Loan_ID	1	8	8	0	614
Gender	1	0	6	13	3
Married	1	0	3	3	3
Dependents	1	0	2	15	5
Education	1	8	12	0	2
Self_Employed	1	0	3	32	3
Property_Area	1	5	9	0	3
Loan_Status	1	1	1	0	2

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
ApplicantIncome	0	1.00	5403.46	6109.04	150	2877.5	3812.5	5795.00	81000	▇▁▁▁▁
CoapplicantIncome	0	1.00	1621.25	2926.25	0	0.0	1188.5	2297.25	41667	▇▁▁▁▁
LoanAmount	22	0.96	146.41	85.59	9	100.0	128.0	168.00	700	▇▃▁▁▁
Loan_Amount_Term	14	0.98	342.00	65.12	12	360.0	360.0	360.00	480	▁▁▁▇▁
Credit_History	50	0.92	0.84	0.36	0	1.0	1.0	1.00	1	▂▁▁▁▇

Handle Missing Data

There are several cleanup steps we will need to perform before we can start modeling with decision trees. We noticed a lot of blank values, which we will need to change to NAs. By coding these to NA, we can then impute them when cleaning up the other NAs. We used the naniar package to help with cleaning up blanks and NAs.

# replace blank values with NA
loan <- loan %>% 
  replace_with_na_all(condition = ~. == "")

Next, we explore missing values to understand any patterns. This will help us determine the best way to handle missing values. Options might include dropping records with missing data or imputing missing data.

loan %>%
  summarise_all(funs(sum(is.na(.)))) %>%
  pivot_longer(cols = 1:13, names_to = 'columns', values_to = 'NA_count') %>%
  arrange(desc(NA_count)) %>%
  ggplot(aes(y = columns, x = NA_count)) + geom_col(fill = 'deepskyblue4') +
  geom_label(aes(label = NA_count)) +
  theme_minimal() +
  labs(title = 'Count of missing values in penguins dataset') +
  theme(plot.title = element_text(hjust = 0.5))

gg_miss_upset(loan)

This can be further visualized by the “Missingness Map”.

missmap(loan)

We can see a number of missing values (NAs) in our LoanAmount, Loan_Amount_Term, and Credit_History features. We will replace these in a later step.

Fix Datatypes

Before progressing, we need to convert categorical features into factors. The factor data type is far more memory efficient and provides information about the unique values, or levels, within the feature. Most libraries will detect the factor data type and understand that the feature is a categorical type.

# convert categorical features to factor
loan <- loan %>%
           mutate(Gender = as.factor(Gender),
                  Married = as.factor(Married),
                  Dependents = as.factor(Dependents),
                  Education = as.factor(Education),
                  Self_Employed = as.factor(Self_Employed),
                  Property_Area = as.factor(Property_Area),
                  Loan_Status = as.factor(Loan_Status))

summary(loan)

##    Loan_ID             Gender    Married    Dependents        Education  
##  Length:614         Female:112   No  :213   0   :345   Graduate    :480  
##  Class :character   Male  :489   Yes :398   1   :102   Not Graduate:134  
##  Mode  :character   NA's  : 13   NA's:  3   2   :101                     
##                                             3+  : 51                     
##                                             NA's: 15                     
##                                                                          
##                                                                          
##  Self_Employed ApplicantIncome CoapplicantIncome   LoanAmount   
##  No  :500      Min.   :  150   Min.   :    0     Min.   :  9.0  
##  Yes : 82      1st Qu.: 2878   1st Qu.:    0     1st Qu.:100.0  
##  NA's: 32      Median : 3812   Median : 1188     Median :128.0  
##                Mean   : 5403   Mean   : 1621     Mean   :146.4  
##                3rd Qu.: 5795   3rd Qu.: 2297     3rd Qu.:168.0  
##                Max.   :81000   Max.   :41667     Max.   :700.0  
##                                                  NA's   :22     
##  Loan_Amount_Term Credit_History     Property_Area Loan_Status
##  Min.   : 12      Min.   :0.0000   Rural    :179   N:192      
##  1st Qu.:360      1st Qu.:1.0000   Semiurban:233   Y:422      
##  Median :360      Median :1.0000   Urban    :202              
##  Mean   :342      Mean   :0.8422                              
##  3rd Qu.:360      3rd Qu.:1.0000                              
##  Max.   :480      Max.   :1.0000                              
##  NA's   :14       NA's   :50

Feature Plots

Next, we explore the distributions of our continuous features: ApplicantIncome, CoapplicantIncome, LoanAmount, and Loan_Amount_Term. We color-coded Loan_status so Y (blue) and N (red) to help visualize patterns between the distributions and our feature distribution.

We can see that ApplicantIncome, CoapplicantIncome and LoanAmount are highly right-skewed, with long right tails. Conversely, it looks like Loan_Amount_Term is highly left-skewed, with a long left tail.

Target Plots

Additionally, we can also look at the categorical variables, broken out by our target variable of whether a loan was issued to the applicant. From the green bars below, we can see the total amount of loan approvals relative to the red bars below documenting the loan denials.

Based on the categorical variables split on loan approval status, we can see a few interesting things:

The number of loans applied for by males was significantly higher than loans applied for by females in this dataset.
The same phenomenon is true for graduate students, where more loan applications came from graduate students than undergraduate students.
We see a large class imbalance with the Self_Employed and Married features. In both cases, there are more applicants who are not self-employed and who are married.
There was a pretty even split in the number of applicants based on Property_Area, where someone identifying that they live in rural, semiurban, or urban settings were pretty evenly distributed.
We can see that applicants who did not pass the credit history criteria were almost always denied a loan.

These factors will be interesting to examine and discuss as we build our decision tree model and random forest models. These class imbalances could affect model performance and interpretation.

Impute missing

We can use kNN imputation to help replace many of the missing values we identified above. With the KNN method, a categorical missing value is imputed by looking at other records with similar features. Once k similar records are found, they are used to infer the missing value. For missing numeric values, the k most similar records values are averaged (mean) to “predict” the missing value. Note this is only one way to handle numeric missing values. We could also do median imputation or mean imputation. It comes down to the nature of the data as to which technique makes the most sense.

loan <- kNN(loan) %>%
  subset(select = Loan_ID:Loan_Status)

Nexrt, we double check that this worked correctly by checking for missing values.

sapply(loan, function(x) sum(is.na(x)))

##           Loan_ID            Gender           Married        Dependents 
##                 0                 0                 0                 0 
##         Education     Self_Employed   ApplicantIncome CoapplicantIncome 
##                 0                 0                 0                 0 
##        LoanAmount  Loan_Amount_Term    Credit_History     Property_Area 
##                 0                 0                 0                 0 
##       Loan_Status 
##                 0

Fortunately, we can see that there are now no missing data points!

Apply Transformations

Next, we’ll have to work through a few transformations for our highly skewed continuous data. We use BoxCox to identify the optimal \(\lambda\) which informs the ideal transformation to try.

# calculate optimal lambda for BoxCox
BoxCox.lambda(loan$LoanAmount)

## [1] -0.1839035

BoxCox.lambda(loan$Loan_Amount_Term)

## [1] 1.999924

For LoanAmount, \(\lambda=-0.184\) and for Loan_Amount_Term, \(\lambda=1.99\). We round \(\lambda\) to the neared integer. So, for LoanAmount, \(\lambda=0\) which tells us to do a log() transformation. However, for Loan_Amount_Term, \(\lambda=2\), which tells us to do a \(Y^2\) transformation.

# apply a log transform
loan$LoanAmount_Xform <- log(loan$LoanAmount)

# apply a square transformation
loan$Loan_Amount_Term_Xform <- loan$Loan_Amount_Term^2

Banks generally don’t want loan payments to exceed 25% of an applicant’s income. Most banks will add the co-applicant income to the applicant’s income and treat the sum in their decision criteria. For this reason, we combine ApplicantIncome and CoapplicantIncome into a single Income feature and drop the separate ApplicantIncome and CoapplicantIncome.

# combine applicant and co-applicant incomes and drop the original separate features
loan$Income <- loan$ApplicantIncome + loan$CoapplicantIncome
loan$ApplicantIncome <- NULL
loan$CoapplicantIncome <- NULL

As above, Income is skewed, so we will want to transform it. Let’s calcualte BoxCOx lambda to inform how to transform.

# find the optimal lBoxCox lambda for Income
BoxCox.lambda(loan$Income)

## [1] -0.5592006

For Income, \(\lambda=-0.5\) which tells us to try a \(\frac{1}{\sqrt(Y)}\) transform.

# transform Income
loan$Income_Xform <- sqrt(loan$Income) ^ (-1)

When checking the new distribution of our Income variable, we can see that it is much more normally distributed:

With the exploratory data analysis behind us, and our data cleaned up, we are now ready to run our decision tree!

Splitting our data into a Training/Test set

Similar to our KNN model process above, we’ll determine an 80/20 split for our testing and training datasets:

set.seed(123)

trainingRows <- createDataPartition(as.matrix(loan$Loan_Status), p=0.8, list=FALSE)

train_loan <- loan[trainingRows,]
test_loan <- loan[-trainingRows,]

Initial Model

With the data split, we can now run our first decision tree:

tree1 <- rpart(Loan_Status ~ Gender + Married + Dependents + Education + 
                 Self_Employed + Income_Xform + LoanAmount_Xform + Loan_Amount_Term_Xform + 
                 Credit_History + Property_Area, 
               data=train_loan)

rpart.plot(tree1, nn=TRUE)

summary(tree1)

## Call:
## rpart(formula = Loan_Status ~ Gender + Married + Dependents + 
##     Education + Self_Employed + Income_Xform + LoanAmount_Xform + 
##     Loan_Amount_Term_Xform + Credit_History + Property_Area, 
##     data = train_loan)
##   n= 492 
## 
##           CP nsplit rel error    xerror       xstd
## 1 0.37662338      0 1.0000000 1.0000000 0.06679061
## 2 0.03246753      1 0.6233766 0.6233766 0.05707947
## 3 0.01948052      2 0.5909091 0.6233766 0.05707947
## 4 0.01000000      3 0.5714286 0.6233766 0.05707947
## 
## Variable importance
##   Credit_History     Income_Xform LoanAmount_Xform 
##               84               16                1 
## 
## Node number 1: 492 observations,    complexity param=0.3766234
##   predicted class=Y  expected loss=0.3130081  P(node) =1
##     class counts:   154   338
##    probabilities: 0.313 0.687 
##   left son=2 (70 obs) right son=3 (422 obs)
##   Primary splits:
##       Credit_History         < 0.5         to the left,  improve=59.010690, (0 missing)
##       Income_Xform           < 0.02049156  to the right, improve= 7.590926, (0 missing)
##       Property_Area          splits as  LRL, improve= 7.068090, (0 missing)
##       Loan_Amount_Term_Xform < 180000      to the right, improve= 2.441737, (0 missing)
##       Education              splits as  RL, improve= 1.729095, (0 missing)
##   Surrogate splits:
##       Income_Xform < 0.005074514 to the left,  agree=0.86, adj=0.014, (0 split)
## 
## Node number 2: 70 observations
##   predicted class=N  expected loss=0.08571429  P(node) =0.1422764
##     class counts:    64     6
##    probabilities: 0.914 0.086 
## 
## Node number 3: 422 observations,    complexity param=0.03246753
##   predicted class=Y  expected loss=0.2132701  P(node) =0.8577236
##     class counts:    90   332
##    probabilities: 0.213 0.787 
##   left son=6 (9 obs) right son=7 (413 obs)
##   Primary splits:
##       Income_Xform           < 0.02049156  to the right, improve=5.861038, (0 missing)
##       Property_Area          splits as  LRL, improve=3.255450, (0 missing)
##       LoanAmount_Xform       < 5.093731    to the right, improve=2.353409, (0 missing)
##       Married                splits as  LR, improve=1.943941, (0 missing)
##       Loan_Amount_Term_Xform < 180000      to the right, improve=1.684190, (0 missing)
## 
## Node number 6: 9 observations
##   predicted class=N  expected loss=0.2222222  P(node) =0.01829268
##     class counts:     7     2
##    probabilities: 0.778 0.222 
## 
## Node number 7: 413 observations,    complexity param=0.01948052
##   predicted class=Y  expected loss=0.2009685  P(node) =0.8394309
##     class counts:    83   330
##    probabilities: 0.201 0.799 
##   left son=14 (11 obs) right son=15 (402 obs)
##   Primary splits:
##       Income_Xform           < 0.007097162 to the left,  improve=4.284634, (0 missing)
##       LoanAmount_Xform       < 5.093731    to the right, improve=3.068245, (0 missing)
##       Property_Area          splits as  LRL, improve=2.162517, (0 missing)
##       Loan_Amount_Term_Xform < 180000      to the right, improve=1.832774, (0 missing)
##       Married                splits as  LR, improve=1.616488, (0 missing)
##   Surrogate splits:
##       LoanAmount_Xform < 6.280122    to the right, agree=0.976, adj=0.091, (0 split)
## 
## Node number 14: 11 observations
##   predicted class=N  expected loss=0.3636364  P(node) =0.02235772
##     class counts:     7     4
##    probabilities: 0.636 0.364 
## 
## Node number 15: 402 observations
##   predicted class=Y  expected loss=0.1890547  P(node) =0.8170732
##     class counts:    76   326
##    probabilities: 0.189 0.811

Decision Tree Predictions

From our decision tree summary and plot, we can see that a few variables, such as Credit_History, Income and LoanAmount, were the most important features in this classification method. We can use this decision tree to make predictions on our holdout test set.

loanPre <- predict(tree1, 
                   test_loan, 
                   type="class",
                   control = rpart.control(minsplit = 20, xval = 81, cp = 0.01))

confusionMatrix(loanPre, test_loan$Loan_Status)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  N  Y
##          N 22  6
##          Y 16 78
##                                           
##                Accuracy : 0.8197          
##                  95% CI : (0.7398, 0.8834)
##     No Information Rate : 0.6885          
##     P-Value [Acc > NIR] : 0.000775        
##                                           
##                   Kappa : 0.5469          
##                                           
##  Mcnemar's Test P-Value : 0.055009        
##                                           
##             Sensitivity : 0.5789          
##             Specificity : 0.9286          
##          Pos Pred Value : 0.7857          
##          Neg Pred Value : 0.8298          
##              Prevalence : 0.3115          
##          Detection Rate : 0.1803          
##    Detection Prevalence : 0.2295          
##       Balanced Accuracy : 0.7538          
##                                           
##        'Positive' Class : N               
##

As we can see from our predictions and the confusion matrix, we were able to obtain about 82% accuracy using this decision tree. To achieve higher accuracy, we’ll use the prune() function in the rpart package to examine a predicted optimal tree size.

plotcp(tree1)

By plotting the cross-validated error against the complexity parameter, we can see that the relationship between the yields the best optimal outcome for our tree at a tree size of 2.

tree1$cptable

##           CP nsplit rel error    xerror       xstd
## 1 0.37662338      0 1.0000000 1.0000000 0.06679061
## 2 0.03246753      1 0.6233766 0.6233766 0.05707947
## 3 0.01948052      2 0.5909091 0.6233766 0.05707947
## 4 0.01000000      3 0.5714286 0.6233766 0.05707947

We want to minimize xerror, which in row 2 (a tree with size 2), we see \(xerror=0.6233\)

Tuning our Decision Tree

Although this is the case, when using this cp value in our prune() function, we still obtain accuracy ratings of around 82%. To test different complexity parameters, we used a cp value of 0.03246753, which is a tree size of 2. When we use this tree size, and prune our initial tree accordingly, we can see the resulting decision tree.

# prune tree
tree1.pruned <- prune(tree1, cp=0.03246753)

rpart.plot(tree1.pruned, 
           extra = 104, 
           main = "Decision Tree")

Interestingly, from this pruning process, the Credit_History feature alone seems to do a reasonably good job of classifying applicants into the approval vs. disapproval status. The accuracy is about 86% on our test set, which is similar to our initial tree.

However, it’s important to think critically about whether this pruned tree would perform better on another dataset and is applicable for real-life events.

# predict holdout data
tree1.pruned.pred <- predict(tree1.pruned, 
                             test_loan, 
                             type = "class" )

# show model performance KPIs
confusionMatrix(tree1.pruned.pred, test_loan$Loan_Status)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  N  Y
##          N 22  2
##          Y 16 82
##                                           
##                Accuracy : 0.8525          
##                  95% CI : (0.7769, 0.9102)
##     No Information Rate : 0.6885          
##     P-Value [Acc > NIR] : 2.502e-05       
##                                           
##                   Kappa : 0.6174          
##                                           
##  Mcnemar's Test P-Value : 0.002183        
##                                           
##             Sensitivity : 0.5789          
##             Specificity : 0.9762          
##          Pos Pred Value : 0.9167          
##          Neg Pred Value : 0.8367          
##              Prevalence : 0.3115          
##          Detection Rate : 0.1803          
##    Detection Prevalence : 0.1967          
##       Balanced Accuracy : 0.7776          
##                                           
##        'Positive' Class : N               
##

While a single feature offers reasonably good accuracy, there may be mitigating information from other features that would be lost and affect a loan decision if we only include the first feature. While decision trees are simple to construct and understand, their weakness is that they follow a simple linear path for decision making. This linear flow doesn’t allow for downstream features to alter or influence potential predictions.

Part 3: Random Forests on loan approval dataset

Using the same dataset on Loan Approval Status, please use Random Forests to predict loan approval status. Again, please be sure to walk us through the steps you took to get to your final model. (50 points)

To expand on our analysis of the loan approval dataset, we’ll run a few random forest models to see if they are a more effective way to classify whether or not someone will be accepted or rejected for a loan.

Our Initial Random Forest Model

In order to run a random forest model, we’ll use the randomForest() function in the randomForest package. We’ll include the same initial features that we did for our first decision tree:

set.seed(123)

# fit a Random Forest model with training data
fit.forest <- randomForest(Loan_Status ~ Gender + Married + Dependents + 
                             Education + Self_Employed + Income_Xform + 
                             LoanAmount_Xform + Loan_Amount_Term_Xform + 
                             Credit_History + Property_Area,
                           data=train_loan)

# display model details
fit.forest

## 
## Call:
##  randomForest(formula = Loan_Status ~ Gender + Married + Dependents +      Education + Self_Employed + Income_Xform + LoanAmount_Xform +      Loan_Amount_Term_Xform + Credit_History + Property_Area,      data = train_loan) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 20.33%
## Confusion matrix:
##    N   Y class.error
## N 73  81  0.52597403
## Y 19 319  0.05621302

# show which feature were most important
importance(fit.forest)

##                        MeanDecreaseGini
## Gender                         4.295290
## Married                        4.815481
## Dependents                    10.784519
## Education                      4.104093
## Self_Employed                  3.517598
## Income_Xform                  46.900023
## LoanAmount_Xform              38.372816
## Loan_Amount_Term_Xform         9.644244
## Credit_History                52.770213
## Property_Area                 11.332043

After fitting our model, we can now use it to create predictions on our holdout test set to evaluate its performance.

# run predictions with holdout test data
forest.pred <- predict(fit.forest, 
                       newdata = test_loan)

# show model performance KPIs
(dtree.cm_train <- confusionMatrix(forest.pred, test_loan$Loan_Status))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  N  Y
##          N 21  5
##          Y 17 79
##                                           
##                Accuracy : 0.8197          
##                  95% CI : (0.7398, 0.8834)
##     No Information Rate : 0.6885          
##     P-Value [Acc > NIR] : 0.000775        
##                                           
##                   Kappa : 0.5398          
##                                           
##  Mcnemar's Test P-Value : 0.019016        
##                                           
##             Sensitivity : 0.5526          
##             Specificity : 0.9405          
##          Pos Pred Value : 0.8077          
##          Neg Pred Value : 0.8229          
##              Prevalence : 0.3115          
##          Detection Rate : 0.1721          
##    Detection Prevalence : 0.2131          
##       Balanced Accuracy : 0.7466          
##                                           
##        'Positive' Class : N               
##

Here, we notice a slight decrease in Accuracy (\(Accuracy=0.8197\)) compared with our Decision Tree. for the training sample is 82.37%. Notice the disparity between Specificity and Sensitivity. This model has a high false-positive rate and tends to over predict Y when it should have chosen N.

Next, we want to see if we have generated enough trees so that the Out Of Bag (OOB Error) error rates are minimum. From the below we see that the OOB error rate is decreasing with 1-20 trees, and the rate stabilizes around 100 trees.

plot(fit.forest, 
     col = c("black", "red", "dark green"), 
     main = "Predicted Loan Error Rates")

legend("topright", colnames(fit.forest$err.rate), col = 1:6, cex = 0.8, fill = 1:6)

Tuning our Random Forest Model

Tto see whether the Random forest model can be improved, we will run the same model, but this time we will use the tuneRF function. We provide the features (x), target (y), and number of trees used at each tuning step (ntreeTry).

The values were assigned randomly initially, and they are then tweaked until the optimal value was found.

set.seed(123)

# run RF tuning
tree_var <- tuneRF(x = subset(train_loan, select = -Loan_Status), 
                   y = train_loan$Loan_Status, 
                   ntreeTry = 1000)

## mtry = 3  OOB error = 20.33% 
## Searching left ...
## mtry = 2     OOB error = 19.92% 
## 0.02 0.05 
## Searching right ...
## mtry = 6     OOB error = 19.72% 
## 0.03 0.05

We rerun the Random Forest model with the new parameter \(Ntree = 100\) and \(Mtry = 6\)

set.seed(123)

# pull out optimal value from previous tuning step
val_opt <- tree_var [,"mtry"][which.min(tree_var [,"OOBError"])]

# fit a new RF model with better parameters
fit.forest2 <- randomForest(Loan_Status  ~ Gender + Married + Dependents + 
                              Education + Self_Employed + Income_Xform + 
                              LoanAmount_Xform + Loan_Amount_Term_Xform + 
                              Credit_History + Property_Area,
                            data = train_loan, 
                            importance = TRUE, 
                            ntree = 100, 
                            mtry = 6)

# display model details
fit.forest2

## 
## Call:
##  randomForest(formula = Loan_Status ~ Gender + Married + Dependents +      Education + Self_Employed + Income_Xform + LoanAmount_Xform +      Loan_Amount_Term_Xform + Credit_History + Property_Area,      data = train_loan, importance = TRUE, ntree = 100, mtry = 6) 
##                Type of random forest: classification
##                      Number of trees: 100
## No. of variables tried at each split: 6
## 
##         OOB estimate of  error rate: 21.14%
## Confusion matrix:
##    N   Y class.error
## N 81  73  0.47402597
## Y 31 307  0.09171598

The results of the Random Forest model on our training set yielded an out of bag error rate of 21.14%, which is higher than the original model’s OOB error of 20.33%, but still very close. Tuning in this case didn’t seem to make much difference.

# predict using holdout data 
forest.pred2 <- predict(fit.forest2, 
                        newdata = test_loan)

# show model performance KPIs
(forest.cm_train <- confusionMatrix(forest.pred2, test_loan$Loan_Status))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  N  Y
##          N 22  9
##          Y 16 75
##                                           
##                Accuracy : 0.7951          
##                  95% CI : (0.7125, 0.8628)
##     No Information Rate : 0.6885          
##     P-Value [Acc > NIR] : 0.00585         
##                                           
##                   Kappa : 0.4969          
##                                           
##  Mcnemar's Test P-Value : 0.23014         
##                                           
##             Sensitivity : 0.5789          
##             Specificity : 0.8929          
##          Pos Pred Value : 0.7097          
##          Neg Pred Value : 0.8242          
##              Prevalence : 0.3115          
##          Detection Rate : 0.1803          
##    Detection Prevalence : 0.2541          
##       Balanced Accuracy : 0.7359          
##                                           
##        'Positive' Class : N               
##

When we use our second Random Forest Model to make predictions on our test set, we notice slightly lower accuracy, \(Accuracy=79.5%\), compared to our untuned RF model, \(Accuracy=81.97%\).

Next, we want to see if we have generated enough trees so that the Out Of Bag (OOB Error) error rates are minimum. From the below we see that the OOB error rate is decreasing with 1-20 trees, increasing a little bit with 20-50 trees, and the rate stabilizes that at around 60 trees.

plot(fit.forest2, 
     col = c("black", "red", "dark green"), 
     main = "Predicted Loan Error Rates")

legend("topright", colnames(fit.forest2$err.rate), col = 1:6, cex = 0.8, fill = 1:6)

Part 4: Gradient Boosting

Using the Loan Approval Status data, please use Gradient Boosting to predict the loan approval status. Please use whatever boosting approach you deem appropriate; but please be sure to walk us through your steps. (50 points)

Initial Model

gbm uses a default number of trees of 100, which is rarely sufficient. Consequently, we increase it to 1000 trees. The default depth of each tree interaction.depth is 1, which means we are ensembling many stumps. Lastly, we include cv.folds to perform a 3-fold cross-validation.

set.seed(123)

# build gradient boosted model
boost <- gbm(Loan_Status  ~ Gender + Married + Dependents + 
                              Education + Self_Employed + Income_Xform + 
                              LoanAmount_Xform + Loan_Amount_Term_Xform + 
                              Credit_History + Property_Area, 
             data=train_loan[,-1],
             distribution = "gaussian",
             n.trees = 1000,
             cv.folds = 3)

# display model details
boost

## gbm(formula = Loan_Status ~ Gender + Married + Dependents + Education + 
##     Self_Employed + Income_Xform + LoanAmount_Xform + Loan_Amount_Term_Xform + 
##     Credit_History + Property_Area, distribution = "gaussian", 
##     data = train_loan[, -1], n.trees = 1000, cv.folds = 3)
## A gradient boosted model with gaussian loss function.
## 1000 iterations were performed.
## The best cross-validation iteration was 62.
## There were 10 predictors of which 8 had non-zero influence.

sqrt(min(boost$cv.error))

## [1] 0.3913306

summary(boost)

##                                           var    rel.inf
## Income_Xform                     Income_Xform 35.6412932
## LoanAmount_Xform             LoanAmount_Xform 26.9510994
## Credit_History                 Credit_History 14.5858857
## Dependents                         Dependents  7.4811920
## Loan_Amount_Term_Xform Loan_Amount_Term_Xform  5.2503899
## Property_Area                   Property_Area  4.5494508
## Self_Employed                   Self_Employed  2.0068665
## Married                               Married  1.3407906
## Education                           Education  1.2198756
## Gender                                 Gender  0.9731563

The best cross-validation iteration was 62 and the RMSE is 0.3913306

The summary of output creates a new data set with var, a factor variable with the variables in our model, and rel.inf, the relative influence each variable had on our model predictions. From the table, we can see LoanAmount, Income_Xform, and Credit_History are the top three most important variables for our model. Notice that the feature importance is different than we saw in our decision tree and illustrates how downstream features can be important. RF and GB both create many trees, randomly selecting different features, allowing less important features a chance to compete for attention.

Tuning

We can tune parameters to see if we can improve the performance. We increase the number of trees (n.trees) and perform a 5-fold cross-validation (cv.folds).

set.seed(123)

# train gb model with different parameters
boost2 <- gbm(Loan_Status  ~ Gender + Married + Dependents + 
                              Education + Self_Employed + Income_Xform + 
                              LoanAmount_Xform + Loan_Amount_Term_Xform + 
                              Credit_History + Property_Area, 
              data = train_loan[,-1],
              distribution = "gaussian",
              n.trees = 2000,
              cv.folds = 5)

# display performance KPIs
sqrt(min(boost2$cv.error))

## [1] 0.3886695

summary(boost2)

##                                           var   rel.inf
## Income_Xform                     Income_Xform 36.755530
## LoanAmount_Xform             LoanAmount_Xform 30.036793
## Credit_History                 Credit_History  9.148464
## Dependents                         Dependents  7.747582
## Loan_Amount_Term_Xform Loan_Amount_Term_Xform  5.246914
## Property_Area                   Property_Area  5.235610
## Married                               Married  1.703130
## Education                           Education  1.631745
## Self_Employed                   Self_Employed  1.340918
## Gender                                 Gender  1.153315

We can see the number of relative influence of some of the lower features changed and also the RMSE change to 0.3886695, lower than our previous model.

# plot loss function as a result of n trees added to the ensemble
gbm.perf(boost2, method = "cv")

## [1] 50

gbm.perf(boost2, method = "OOB")

## [1] 40
## attr(,"smoother")
## Call:
## loess(formula = object$oobag.improve ~ x, enp.target = min(max(4, 
##     length(x)/10), 50))
## 
## Number of Observations: 2000 
## Equivalent Number of Parameters: 39.99 
## Residual Standard Error: 0.0003087

The plots indicating the optimum number of trees based on the technique we used. The green line indicates the error on the test data, and the blue dotted line points to the optimum number of iterations. We can observe that beyond a certain point (50 iterations for the cv method and 40 for the OOB method), the test data’s error appears to increase because of overfitting.

We use model 2 to forecast our data. According to the cv method, we choose 50 as the number of trees.

# predict based on holdout test data with tuned parameters
boostPre <- predict.gbm(boost2, 
                        test_loan, 
                        n.trees = 50)

boostPre <- ifelse(boostPre < 1.5, 'N', 'Y')
boostPre <- as.factor(boostPre)

(boost.cm <- confusionMatrix(boostPre, test_loan$Loan_Status))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  N  Y
##          N 22  2
##          Y 16 82
##                                           
##                Accuracy : 0.8525          
##                  95% CI : (0.7769, 0.9102)
##     No Information Rate : 0.6885          
##     P-Value [Acc > NIR] : 2.502e-05       
##                                           
##                   Kappa : 0.6174          
##                                           
##  Mcnemar's Test P-Value : 0.002183        
##                                           
##             Sensitivity : 0.5789          
##             Specificity : 0.9762          
##          Pos Pred Value : 0.9167          
##          Neg Pred Value : 0.8367          
##              Prevalence : 0.3115          
##          Detection Rate : 0.1803          
##    Detection Prevalence : 0.1967          
##       Balanced Accuracy : 0.7776          
##                                           
##        'Positive' Class : N               
##

According to the Confusion matrix, Our model Accuracy is 0.8525. Looking at Sensitivity and Specificity, this model is has a far lower false-negative rate and is less likely to pick ‘N’ when it shouldn’t. However, its false-positive rate is higher, where it selected ‘Y’ when it should have predicted ‘N’.

Part 5: Model Performance

Model performance: please compare the models you settled on for problems # 2–4. Comment on their relative performance. Which one would you prefer the most? Why? (20 points)

After running various decision trees, random forest models, and gradient boosting methods, we can take a look at the overall evaluation metrics for these three techniques on the loan approval dataset. By creating a dataframe to store all of our metrics, we can visualize the outcomes below:

eval <- dplyr::select(eval, Accuracy, `Classification Error Rate`, Sensitivity, Specificity, Precision, Recall, F1)
rownames(eval) = c("Decision Tree",  "Random Forest", "Gradient Boosting")
t_eval <- t(eval)
colnames(t_eval) <- rownames(eval)
rownames(t_eval) <- colnames(eval)
knitr::kable(t_eval)

	Decision Tree	Random Forest	Gradient Boosting
Accuracy	0.8196721	0.7950820	0.8524590
Classification Error Rate	0.1803279	0.2049180	0.1475410
Sensitivity	0.5526316	0.5789474	0.5789474
Specificity	0.9404762	0.8928571	0.9761905
Precision	0.8076923	0.7096774	0.9166667
Recall	0.5526316	0.5789474	0.5789474
F1	0.6562500	0.6376812	0.7096774

When comparing our three different techniques, we can see that our Gradient Boosting method seemed to perform the best and Random Forest performed worst. That said, Decision Tree and Random Forest had mixed results when we consider false-positives and false-negatives. RF performed worst overall but had better Sensitivity than Decision Tree. With lower overall accuracy, higher classification error rate, and worse performance on measures such as F1, sensitivity, and specificity, we can determine that we’d rule out Random Forest. Additionally, the decision tree we are using here to evaluate against our other techniques only utilizes one feature (Credit_History). Such isn’t as flexible if we were to use this model on a different dataset and doesn’t allow for special cases whre a loan should be approved based on other criteria.

When considering model performance, accuracy is a good starting point, but false positives versus false negatives can be pretty significant from a business perspective. If the bank incorrectly predicts they shouldn’t give a loan, but the applicant would have paid it (false-negative), the bank loses out on potential revenue. If the bank incorrectly gives a loan and the applicant defaults (false-positive), the bank loses money from loan defaults. In the end, the bank will need to assign a cost function to both of these situations and determine which has a more negative impact on business. We see high Specificity between 89~97% with all three models, which indicates low false-negative. These models do a pretty good job of giving loans to applicants that will repay, so there should be little opportunity cost on that side. However, we see low Sensitivity between 55~57%, i.e., high numbers of false positives. These models would suggest giving loans to applicants that might default. Of the two errors, this second is probably far more expensive. As a bank, we would probably care more about maximizing Sensitivity along with Accuracy. With that in mind, from a business perspective, the Gradient Boosting model ends up being a better compromise of both Accuracy and Sensitivity.

Data 622 - Homework #3 (Group 2)

Mengqin Cai, Zhi Ying Chen, Donny Lofland, Grace Han, Zach Alexander

4/5/2021

Part 1: KNN on the Penguins dataset

Load Data

Feature Plots

Missing Values

Splitting into a Training/Test set

Fitting our kNN Model

Determine Optimal k for kNN

Part 2: Decision Trees on loan approval data set

Load Data

Exploratory data analysis

Handle Missing Data

Fix Datatypes

Feature Plots

Target Plots

Impute missing

Apply Transformations

Splitting our data into a Training/Test set

Initial Model

Decision Tree Predictions

Tuning our Decision Tree

Part 3: Random Forests on loan approval dataset

Our Initial Random Forest Model

Tuning our Random Forest Model

Part 4: Gradient Boosting

Initial Model

Tuning

Part 5: Model Performance