Group 4 Assignment 3

(Q1) k-Nearest Neighbors

[Using] the Penguin dataset . . . please use [the] K-nearest neighbors algorithm to predict the species variable.

The KNN Algorithm is a simple classification algorithm. It stores the training data and classifies new cases by a majority vote of the case’s k-nearest neighbors.

The algorithm is sensitive to the units of the chosen predictor variables. It is important that we transform all the values to a common scale. A frequently used practice is to normalize the data to the interval [0, 1].

KNN works well when all predictor data has been converted to numerics with the exception of the target variable species.

A note on data preparation: this is the third time we’ve worked with the palmerpenguins data. We’ve repeated some data preparation and EDA in the appendix. The data prep we do here are items specific to the k-NN algorithm.

penguins_na_omit <- readr::read_rds('data/penguins_na_omit.Rds')

normalize <- function(x) {
  (x - min(x)) / (max(x) - min(x))
}

penguins_df_norm <- penguins_na_omit %>% 
  select(-year) %>% 
  mutate_if(is.factor, as.integer) %>% 
  mutate_if(is.numeric, normalize) %>% 
  mutate(species = penguins_na_omit$species)

Create Training and Test sets

The first is used to train the system, while the second is used to evaluate the learned or trained system. In practice, the division of your data set into a test and a training sets is disjoint: the most common splitting choice is to take 2/3 of your original data set as the training set, while the 1/3 that remains will compose the test set.

# Create sample data for testing and training.
set.seed(1234)
ind <- sample(2, nrow(penguins_df_norm), replace=TRUE, prob=c(0.67, 0.33))

# Training and test sets
penguin.train <- penguins_df_norm %>% filter(ind == 1) %>% select(-species)
penguin.test  <- penguins_df_norm %>% filter(ind == 2) %>% select(-species)

# Training and test labels
penguin.trainLabels <- penguins_df_norm$species[ind == 1]
penguin.testLabels <- penguins_df_norm$species[ind == 2]

Build KNN Classifier

This step performs the actual classification. Note: the k parameter is often an odd number to avoid ties in the voting scores. A good starting point for k is the square root of the total number of observations.

penguin_pred <- knn(
  k = 19,
  train = penguin.train, 
  test = penguin.test, 
  cl = penguin.trainLabels)

caret::confusionMatrix(
  penguin_pred, 
  penguin.testLabels)
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Adelie Chinstrap Gentoo
##   Adelie        38         0      0
##   Chinstrap      0        25      0
##   Gentoo         0         0     37
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9638, 1)
##     No Information Rate : 0.38       
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity                   1.00             1.00          1.00
## Specificity                   1.00             1.00          1.00
## Pos Pred Value                1.00             1.00          1.00
## Neg Pred Value                1.00             1.00          1.00
## Prevalence                    0.38             0.25          0.37
## Detection Rate                0.38             0.25          0.37
## Detection Prevalence          0.38             0.25          0.37
## Balanced Accuracy             1.00             1.00          1.00

With k = 19 we have zero misclassifications!

What are the accuracy measures for other values of k?

k_search <- function(k) {
  penguin_pred <- knn(
    k = k,
    train = penguin.train, 
    test = penguin.test, 
    cl = penguin.trainLabels)

  cm <- caret::confusionMatrix(
    penguin_pred, 
    penguin.testLabels)
  
  data.frame(
    k = k, 
    Adelie = cm$byClass[1, 11],
    Chinstrap = cm$byClass[2, 11],
    Gentoo = cm$byClass[3, 11] )
}

res <- NULL
for (i in seq(1, 21, 2)) {
  res_i <- k_search(i)
  if (is.null(res))
    res <- res_i
  else
    res <- rbind(res, res_i)
}

print(res)
##     k    Adelie Chinstrap Gentoo
## 1   1 0.9919355 0.9800000      1
## 2   3 0.9868421 0.9933333      1
## 3   5 0.9919355 0.9800000      1
## 4   7 0.9919355 0.9800000      1
## 5   9 0.9919355 0.9800000      1
## 6  11 0.9919355 0.9800000      1
## 7  13 0.9919355 0.9800000      1
## 8  15 0.9919355 0.9800000      1
## 9  17 1.0000000 1.0000000      1
## 10 19 1.0000000 1.0000000      1
## 11 21 1.0000000 1.0000000      1

We see that k = 17 is the first value of k for which we have perfect accuracy. But, the accuracy is extremely good for all values of k.

What if we do not know island?

From previous homework assignments, we found that island was an extremely helpful variable because not all species are found on each island. If we wanted to build a model without island (for individuals we might find at sea, for example), how well would it perform?

cols <- c(
  "bill_length_mm", "bill_depth_mm", 
  "flipper_length_mm", "body_mass_g", "sex")

k_search <- function(k) {
  penguin_pred <- knn(
    k = k,
    train = penguin.train[, cols], 
    test = penguin.test[, cols], 
    cl = penguin.trainLabels)

  cm <- caret::confusionMatrix(
    penguin_pred, 
    penguin.testLabels)
  
  data.frame(
    k = k, 
    Adelie = cm$byClass[1, 11],
    Chinstrap = cm$byClass[2, 11],
    Gentoo = cm$byClass[3, 11] )
}

res <- NULL
for (i in seq(1, 21, 2)) {
  res_i <- k_search(i)
  if (is.null(res))
    res <- res_i
  else
    res <- rbind(res, res_i)
}

print(res)
##     k    Adelie Chinstrap Gentoo
## 1   1 0.9707131 0.9533333      1
## 2   3 0.9919355 0.9800000      1
## 3   5 0.9838710 0.9600000      1
## 4   7 0.9838710 0.9600000      1
## 5   9 0.9838710 0.9600000      1
## 6  11 0.9838710 0.9600000      1
## 7  13 0.9838710 0.9600000      1
## 8  15 0.9838710 0.9600000      1
## 9  17 0.9838710 0.9600000      1
## 10 19 0.9838710 0.9600000      1
## 11 21 0.9838710 0.9600000      1

With k as low as 3 we get a pretty good model without using island. However, we do not see perfect classification as we do when the variable is included.

Loan Data Exploration and Prep

Data Exploration

# Load raw CSV file
loans_raw <- readr::read_csv("data/Loan_approval.csv")

Before conducting any modeling, we will explore our dataset. Below describes the Loan_approval dataset where we have 614 observations with 13 features used to describe these observations. eight of these variables are factors while the other five are numeric.

summary(loans_raw) %>% kable() %>%   
  kable_styling(
      full_width = FALSE, position = "center", bootstrap_options = c("hover")) %>% 
  scroll_box() %>% 
  kable_classic_2()
Loan_ID Gender Married Dependents Education Self_Employed ApplicantIncome CoapplicantIncome LoanAmount Loan_Amount_Term Credit_History Property_Area Loan_Status
Length:614 Length:614 Length:614 Length:614 Length:614 Length:614 Min. : 150 Min. : 0 Min. : 9.0 Min. : 12 Min. :0.0000 Length:614 Length:614
Class :character Class :character Class :character Class :character Class :character Class :character 1st Qu.: 2878 1st Qu.: 0 1st Qu.:100.0 1st Qu.:360 1st Qu.:1.0000 Class :character Class :character
Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Median : 3812 Median : 1188 Median :128.0 Median :360 Median :1.0000 Mode :character Mode :character
NA NA NA NA NA NA Mean : 5403 Mean : 1621 Mean :146.4 Mean :342 Mean :0.8422 NA NA
NA NA NA NA NA NA 3rd Qu.: 5795 3rd Qu.: 2297 3rd Qu.:168.0 3rd Qu.:360 3rd Qu.:1.0000 NA NA
NA NA NA NA NA NA Max. :81000 Max. :41667 Max. :700.0 Max. :480 Max. :1.0000 NA NA
NA NA NA NA NA NA NA NA NA’s :22 NA’s :14 NA’s :50 NA NA

We can visualize the amount of missing data that is observed in our dataset with the vis_miss function from the naniar package. Majority of the data is complete and approximate 1.9% of the dataset is missing. There are no observations with majority of features missing, and there are no features that are abnormally sparse. A threshold of 0.5 was set for observation completion, meaning if 50% of the features measure NA for any observation, that row is removed from our dataset. For this case, no rows in the dataset triggered this threshold. The few pieces of missing information will need to be handled with imputation, removal. Of the categorical variables, this includes, Gender, Married, Dependents, Self_Employed and Credit_History. of the numeric variables this includes Loan Amount, Loan_Amount_Terms, and Credit_History. Imputation methods explored include, mean, mode, bagimpute, knn, and median.

naniar::vis_miss(loans_raw)

# # Sample code for testing the "50% missing threshold"
# limitmissing <- .5*ncol(loans_raw)
# retain <- apply(loans_raw, MARGIN= 1, function(y) sum(length(which(is.na(y)))))<limitmissing
# Dataset <- loans_raw[retain,]

Simple barplot is showing the frequency of all categorical variables in the dataset, as well as a histogram to show distribution of numeric features in the dataset. Gender comprised primarily of Male and this is a feature that is decidely removed from our dataset to avoid making loan approval predictions based on any gender biases. on the Loan_Amount_Term, most individuals go with the 30 year mortgage, while few select terms below or above that. Many of the distributions skew right, however no transformations to normalize this dataset are necessary for decision tree classification-based models.

unique(loans_raw$Loan_Amount_Term)
##  [1] 360 120 240  NA 180  60 300 480  36  84  12
plot_bar(loans_raw, theme_config = defaulttheme, ncol = 3)

plot_histogram(loans_raw,theme_config = defaulttheme, ncol = 2)

The bar chart also depicts that there is class imbalance of the our response variable Loan_Status. Knowing this, we may need to perform some over/undersampling techniques in order to obtain better training sets. The table below shows this as 69% approved Y and 31% denied N. With this, we know that at the very least, guessing the majority class will provide a 69% accuracy and our models need to surpass this. Kappa is a metric used to measure accuracy that is normalized based on the class distribution and will be used to compare model performance. For Kappa, values of 0-0.20 as slight, 0.21-0.40 as fair, 0.41-0.60 as moderate, 0.61-0.80 as substantial, and 0.81-1 as almost perfect.

prop.table(table(loans_raw$Loan_Status)) %>% as.data.frame() %>% 
  mutate(Freq=percent(Freq)) %>% 
  rename("Class" = Var1, "Proportion" = "Freq") %>% 
  mutate(Count =table(loans_raw$Loan_Status)) %>% kable() %>%   
  kable_styling(
      full_width = FALSE, position = "center", bootstrap_options = c("hover")) %>% 
  kable_classic_2()
Class Proportion Count
N 31% 192
Y 69% 422

Next we look at the same categorical and numeric distributions, however we separate these by our predictor variable to assess any interesting or unusual patterns. We notice that the proportion of Y/N are distributed in our categorical variables such that there is no feature that determines our prediction (which would provide less incentive to build models). However, Credit_Score seems to be a major determining factor.

plot_bar(loans_raw, by = "Loan_Status")

plot_boxplot(loans_raw, by = "Loan_Status", theme_config = defaulttheme, ncol = 2)

Tree-based models are not susceptible to the negative impacts of colinearity, however we can still check for if there is any capacity for feature reduction by looking at our pairwise comparison of numeric features. if there are features that describe the data too similarly, they may not provide much additional benefit in the model. For this particular dataset, we do not observe heavy multicolinearity.

plot_correlation(
  loans_raw,
  type = c("continuous"),
  cor_args = list(use = "pairwise.complete.obs"))

Data preparation

While working through questions 2-5 we experimented with various data preparation and cleaning options. We’ve consolidated our selected steps here to make it convenient for the reader to follow along.

We imputed missing values using the mode or the mean, depending on the variable. We also added two new variables: Total_Income and IncomeLoanRatio.

# Helper functions
mode <- function(x, ...) {
  tx <- table(x)
  which(tx == max(tx)) %>% sort %>% `[`(., 1) %>% names
}
impute <- function(x, fun) {
  fval <- fun(x, na.rm = TRUE)
  y <- ifelse(is.na(x), fval, x)
  return(y)
}

# Prep statements
loans <- loans_raw %>% 
  select(-Loan_ID, -Gender) %>% 
  mutate(
    Married            = impute(Married, mode),
    Dependents         = impute(Dependents, mode),
    Education          = impute(Education, mode),
    Self_Employed      = impute(Self_Employed, mode),
    ApplicantIncome    = impute(ApplicantIncome, mean),
    CoapplicantIncome  = impute(CoapplicantIncome, mean),
    LoanAmount         = impute(LoanAmount, mean),
    Loan_Amount_Term   = impute(Loan_Amount_Term, mean),
    Credit_History     = impute(Credit_History, mode),
    Property_Area      = impute(Property_Area, mode),
    Loan_Status        = impute(Loan_Status, mode)  ) %>% 
  mutate(
    Total_Income = ApplicantIncome + CoapplicantIncome) %>% 
  select(-ApplicantIncome, -CoapplicantIncome) %>% 
  mutate(IncomeLoanRatio = Total_Income / LoanAmount) %>% 
  mutate_if(is.character, factor)

# Create dummyVars, *including* base levels!
dv <- dummyVars(~ ., data = loans, fullRank = FALSE)
loans_dv <- predict(dv, loans) %>% as.data.frame

# Split data: L is a binary vector
# so one can split loans or loans_dv the same way
set.seed(3)
S <- initial_split(loans, prop = .75, strata = Loan_Status)
L <- (1:nrow(loans)) %in% S$in_id

loans_train <- loans[L, ]
loans_test <- loans[!L, ]
loans_dv_train <- loans_dv[L, ]
loans_dv_test <- loans_dv[!L, ]

# Save Rds files for later use
write_rds(loans_train, 'data/loans_train.Rds')
write_rds(loans_test, 'data/loans_test.Rds')
write_rds(loans_dv_train, 'data/loans_dv_train.Rds')
write_rds(loans_dv_test, 'data/loans_dv_test.Rds')

(Q2) Decision Tree

Several models will be tested across a 10-fold bootstrapped validation training set that will allow for robust predictions on the training set that are less susceptible to overfitting. The decision tree model will be tested using the rpart inside of the tidymodels framework. After testing various model setups along the validation sets, the best model will be tuned via the appropriate decision tree specifications and final predictions and accuracy will be shown on the training and testing sets which are split 75/25.

**Note on tidymodels: the tidymodels framework is a wrapper around the various packages that one can use for the same modeling methods. The specific functions of this package may be new to the reader, however we hope that the intent of the following code is easy to follow even for the reader unfamiliar with tidymodels. We do not use the tidymodels framework for all questions in this homework.

Datatrain <- read_rds('data/loans_dv_train.Rds') %>% 
  mutate(Loan_Status = ifelse(Loan_Status.Y, "Y", "N")) %>% 
  select(-Loan_Status.N, -Loan_Status.Y)

Datatest <- read_rds('data/loans_dv_test.Rds') %>% 
  mutate(Loan_Status = ifelse(Loan_Status.Y, "Y", "N")) %>% 
  select(-Loan_Status.N, -Loan_Status.Y)

datacv <- vfold_cv(Datatrain, v = 10, strata = Loan_Status)

tree_engine <-
    decision_tree(mode = "classification") %>% 
    parsnip::set_engine(engine = "rpart")

dt_wf <-
    workflow() %>% 
    workflows::add_model(tree_engine)

Candidate models

We ran several iterations of models to determine the best data preparation steps above. We have not included each intermediate model. If the reader is interested in reading that code, the authors will make it available.

Baseline model

Kappa is a metric used to measure accuracy that is normalized based on the class distribution and will be used to compare model performance. For Kappa, values of 0-0.20 as slight, 0.21-0.40 as fair, 0.41-0.60 as moderate, 0.61-0.80 as substantial, and 0.81-1 as almost perfect.

The results this model tested among 10-folds of a training set is 80% accuracy and 48% KAP. Other performance metrics are presented below.

All_metrics <- data.frame()


# Baseline Model = #1
dt_recipe1 <-
  recipe(Loan_Status ~ ., data = Datatrain)

dt_recipe1 %>% prep()
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor         19
## 
## Training data contained 461 data points and no missing data.
dt_wf1<-
 dt_wf %>% 
  add_recipe(dt_recipe1)

wf1_results <- dt_wf1 %>% 
  fit_resamples(
    resamples= datacv, 
    metrics = metric_set(
      roc_auc, accuracy, sensitivity, specificity, kap),
    control = control_resamples(save_pred = TRUE))

wf1_results %>% collect_metrics(summarize = TRUE)
## # A tibble: 5 x 6
##   .metric  .estimator  mean     n std_err .config             
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy binary     0.818    10  0.0138 Preprocessor1_Model1
## 2 kap      binary     0.521    10  0.0405 Preprocessor1_Model1
## 3 roc_auc  binary     0.741    10  0.0204 Preprocessor1_Model1
## 4 sens     binary     0.511    10  0.0392 Preprocessor1_Model1
## 5 spec     binary     0.956    10  0.0116 Preprocessor1_Model1
wf_res_function <-
  function(wf_results, modelname) {
    # this function returns 
    # ROC_AUC, accuracy, and KAPPA of a cross validated model result
    
    All_metrics %>% bind_rows(
      collect_metrics(wf_results, summarize = TRUE) %>%
      mutate(model = modelname))
  }

All_metrics <- wf_res_function(wf1_results, "Baseline")

Down-sampling test model

This model starts with the baseline and adds down-sampling to balance our response variable. The down-sampling was tested at a few ratios from 1:1 to 1:2. the optimal ratio is presented below at 1.8 with 35% N and 65% Y – a 4% adjustment in the class balance.

This model has lower accuracy and KAP likely because our dataset is not large enough. We do not use down-sampling in our final model.

# Down-sampling = #3
set.seed(0)

dt_recipe3 <-
  dt_recipe1 %>% 
  themis::step_downsample(Loan_Status, skip = TRUE, under_ratio = 1.8)

table(juice(prep(dt_recipe3, training = Datatrain))$Loan_Status)
## 
##   N   Y 
## 144 259
dt_wf3<-
 dt_wf %>% 
  add_recipe(dt_recipe3)

dt_recipe3 %>% prep()
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor         19
## 
## Training data contained 461 data points and no missing data.
## 
## Operations:
## 
## Down-sampling based on Loan_Status [trained]
wf3_results <- dt_wf3 %>% 
  fit_resamples(
    resamples = datacv,  
    metrics = metric_set(
      roc_auc, accuracy, sensitivity, specificity, kap),
    control = control_resamples(save_pred = TRUE))

wf3_results %>% collect_metrics(summarize = TRUE)
## # A tibble: 5 x 6
##   .metric  .estimator  mean     n std_err .config             
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy binary     0.796    10  0.0164 Preprocessor1_Model1
## 2 kap      binary     0.482    10  0.0418 Preprocessor1_Model1
## 3 roc_auc  binary     0.731    10  0.0244 Preprocessor1_Model1
## 4 sens     binary     0.526    10  0.0384 Preprocessor1_Model1
## 5 spec     binary     0.919    10  0.0204 Preprocessor1_Model1
All_metrics <-
  wf_res_function(wf3_results, "Downsample")

Discretization Testing model

In this model we bin Total_Income. Binning features sometimes allows groupings to better predict our response variable. Binning was tested along various features and various bins, but the feature that improved model accuracy most when binning was the engineered feature Total_Income.

# Binned = #6
dt_recipe6 <-
  dt_recipe3 %>% 
  step_discretize(Total_Income, num_breaks = 8)
  
dt_recipe6 %>% prep
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor         19
## 
## Training data contained 461 data points and no missing data.
## 
## Operations:
## 
## Down-sampling based on Loan_Status [trained]
## Dummy variables from Total_Income [trained]
dt_wf6 <-
 dt_wf %>% 
  add_recipe(dt_recipe6)

wf6_results <- dt_wf6 %>% 
  fit_resamples(
    resamples = datacv,
    metrics = metric_set(
      roc_auc, accuracy, sensitivity, specificity, kap),
    control = control_resamples(save_pred = TRUE))

wf6_results %>% collect_metrics(summarize = TRUE)
## # A tibble: 5 x 6
##   .metric  .estimator  mean     n std_err .config             
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy binary     0.803    10  0.0130 Preprocessor1_Model1
## 2 kap      binary     0.492    10  0.0339 Preprocessor1_Model1
## 3 roc_auc  binary     0.728    10  0.0216 Preprocessor1_Model1
## 4 sens     binary     0.512    10  0.0369 Preprocessor1_Model1
## 5 spec     binary     0.934    10  0.0205 Preprocessor1_Model1
All_metrics <- wf_res_function(wf6_results, "Binning")

Binning our engineered feature Total_Income increased our model KAP from 52% to 53% and our accuracy from 82.1% to 82.4% so we will retain these steps.

All_metrics %>% 
  filter(.metric == "kap") %>% 
  kable() %>% 
  kable_styling(
    full_width = FALSE, position = "center", bootstrap_options = c("hover"))
.metric .estimator mean n std_err .config model
kap binary 0.5206812 10 0.0405325 Preprocessor1_Model1 Baseline
kap binary 0.4824551 10 0.0417955 Preprocessor1_Model1 Downsample
kap binary 0.4920266 10 0.0339074 Preprocessor1_Model1 Binning

Model Tuning

After optimizing our model based on various transformations across multiple validation datasets, the best model is selected and needs to be tuned. There are various specifications of a decision tree that can affect the performance. The ones that will be tuned in this section are:

  • cost_complexity: represents the amount of information gain required for a tree to continue splitting along a node

  • tree_depth: represents the maximum amount of branches a tree may extend before terminating

  • min_n: represents the minimum number of datapoints required at a node for a split to be made

In order to test various values, a grid of these features are set up with various combinations of options as shown below.

tree_grid <- grid_regular(
  cost_complexity(), tree_depth(), min_n(), levels = 4)
tree_grid %>% 
  kable %>% 
  kable_styling(
      full_width = FALSE, position="center", bootstrap_options = c("hover")) %>%
  scroll_box(height = "200px")
cost_complexity tree_depth min_n
0e+00 1 2
1e-07 1 2
1e-04 1 2
1e-01 1 2
0e+00 5 2
1e-07 5 2
1e-04 5 2
1e-01 5 2
0e+00 10 2
1e-07 10 2
1e-04 10 2
1e-01 10 2
0e+00 15 2
1e-07 15 2
1e-04 15 2
1e-01 15 2
0e+00 1 14
1e-07 1 14
1e-04 1 14
1e-01 1 14
0e+00 5 14
1e-07 5 14
1e-04 5 14
1e-01 5 14
0e+00 10 14
1e-07 10 14
1e-04 10 14
1e-01 10 14
0e+00 15 14
1e-07 15 14
1e-04 15 14
1e-01 15 14
0e+00 1 27
1e-07 1 27
1e-04 1 27
1e-01 1 27
0e+00 5 27
1e-07 5 27
1e-04 5 27
1e-01 5 27
0e+00 10 27
1e-07 10 27
1e-04 10 27
1e-01 10 27
0e+00 15 27
1e-07 15 27
1e-04 15 27
1e-01 15 27
0e+00 1 40
1e-07 1 40
1e-04 1 40
1e-01 1 40
0e+00 5 40
1e-07 5 40
1e-04 5 40
1e-01 5 40
0e+00 10 40
1e-07 10 40
1e-04 10 40
1e-01 10 40
0e+00 15 40
1e-07 15 40
1e-04 15 40
1e-01 15 40

All of these values are testing on the 10-fold cross-validation set where the result of every combination along each 10 splits of the data are averaged. The results of this tuning exercise is plotted below. KAP is the metric used to select the best tree, where the minimum cost function, a tree depth of 5 and a minimal node size of 40 is shown to be optimal.

doParallel::registerDoParallel()

tree_engine <-
  decision_tree(
    mode = "classification",
    cost_complexity = tune(), # min improvement needed at each node
    tree_depth = tune(), # max depth of the tree allowed
    min_n(tune()) # min number of datapoints required 
  ) %>% 
  set_engine(engine = "rpart")

# Loading cached object for document generation
#
# set.seed(3)
# tree_rs <- tune_grid(
#  object =  tree_engine,
#  preprocessor = dt_recipe6,
#  resamples = datacv,
#  grid = tree_grid,
#  metrics = metric_set(accuracy, kap)
# )
# readr::write_rds(tree_rs, 'output/q2_tree_rs.Rds')
#
tree_rs <- readr::read_rds('output/q2_tree_rs.Rds')

collect_metrics(tree_rs) %>% 
  kable() %>% 
  kable_styling(
    full_width = FALSE, position="center", bootstrap_options = c("hover")) %>% 
  scroll_box(height = "200px")
cost_complexity tree_depth min_n .metric .estimator mean n std_err .config
0e+00 1 2 accuracy binary 0.8177120 10 0.0155097 Preprocessor1_Model01
0e+00 1 2 kap binary 0.4992879 10 0.0484900 Preprocessor1_Model01
1e-07 1 2 accuracy binary 0.8177120 10 0.0155097 Preprocessor1_Model02
1e-07 1 2 kap binary 0.4992879 10 0.0484900 Preprocessor1_Model02
1e-04 1 2 accuracy binary 0.8177120 10 0.0155097 Preprocessor1_Model03
1e-04 1 2 kap binary 0.4992879 10 0.0484900 Preprocessor1_Model03
1e-01 1 2 accuracy binary 0.8177120 10 0.0155097 Preprocessor1_Model04
1e-01 1 2 kap binary 0.4992879 10 0.0484900 Preprocessor1_Model04
0e+00 5 2 accuracy binary 0.7789434 10 0.0173302 Preprocessor1_Model05
0e+00 5 2 kap binary 0.4281378 10 0.0437256 Preprocessor1_Model05
1e-07 5 2 accuracy binary 0.7789434 10 0.0173302 Preprocessor1_Model06
1e-07 5 2 kap binary 0.4281378 10 0.0437256 Preprocessor1_Model06
1e-04 5 2 accuracy binary 0.7789434 10 0.0173302 Preprocessor1_Model07
1e-04 5 2 kap binary 0.4281378 10 0.0437256 Preprocessor1_Model07
1e-01 5 2 accuracy binary 0.8177120 10 0.0155097 Preprocessor1_Model08
1e-01 5 2 kap binary 0.4992879 10 0.0484900 Preprocessor1_Model08
0e+00 10 2 accuracy binary 0.7396618 10 0.0129704 Preprocessor1_Model09
0e+00 10 2 kap binary 0.3670301 10 0.0399304 Preprocessor1_Model09
1e-07 10 2 accuracy binary 0.7396618 10 0.0129704 Preprocessor1_Model10
1e-07 10 2 kap binary 0.3670301 10 0.0399304 Preprocessor1_Model10
1e-04 10 2 accuracy binary 0.7396618 10 0.0129704 Preprocessor1_Model11
1e-04 10 2 kap binary 0.3670301 10 0.0399304 Preprocessor1_Model11
1e-01 10 2 accuracy binary 0.8177120 10 0.0155097 Preprocessor1_Model12
1e-01 10 2 kap binary 0.4992879 10 0.0484900 Preprocessor1_Model12
0e+00 15 2 accuracy binary 0.7025501 10 0.0207084 Preprocessor1_Model13
0e+00 15 2 kap binary 0.3176340 10 0.0538731 Preprocessor1_Model13
1e-07 15 2 accuracy binary 0.7025501 10 0.0207084 Preprocessor1_Model14
1e-07 15 2 kap binary 0.3176340 10 0.0538731 Preprocessor1_Model14
1e-04 15 2 accuracy binary 0.7025501 10 0.0207084 Preprocessor1_Model15
1e-04 15 2 kap binary 0.3176340 10 0.0538731 Preprocessor1_Model15
1e-01 15 2 accuracy binary 0.8177120 10 0.0155097 Preprocessor1_Model16
1e-01 15 2 kap binary 0.4992879 10 0.0484900 Preprocessor1_Model16
0e+00 1 14 accuracy binary 0.8177120 10 0.0155097 Preprocessor1_Model17
0e+00 1 14 kap binary 0.4992879 10 0.0484900 Preprocessor1_Model17
1e-07 1 14 accuracy binary 0.8177120 10 0.0155097 Preprocessor1_Model18
1e-07 1 14 kap binary 0.4992879 10 0.0484900 Preprocessor1_Model18
1e-04 1 14 accuracy binary 0.8177120 10 0.0155097 Preprocessor1_Model19
1e-04 1 14 kap binary 0.4992879 10 0.0484900 Preprocessor1_Model19
1e-01 1 14 accuracy binary 0.8177120 10 0.0155097 Preprocessor1_Model20
1e-01 1 14 kap binary 0.4992879 10 0.0484900 Preprocessor1_Model20
0e+00 5 14 accuracy binary 0.7764333 10 0.0173564 Preprocessor1_Model21
0e+00 5 14 kap binary 0.4449979 10 0.0447469 Preprocessor1_Model21
1e-07 5 14 accuracy binary 0.7764333 10 0.0173564 Preprocessor1_Model22
1e-07 5 14 kap binary 0.4449979 10 0.0447469 Preprocessor1_Model22
1e-04 5 14 accuracy binary 0.7764333 10 0.0173564 Preprocessor1_Model23
1e-04 5 14 kap binary 0.4449979 10 0.0447469 Preprocessor1_Model23
1e-01 5 14 accuracy binary 0.8177120 10 0.0155097 Preprocessor1_Model24
1e-01 5 14 kap binary 0.4992879 10 0.0484900 Preprocessor1_Model24
0e+00 10 14 accuracy binary 0.7545976 10 0.0213791 Preprocessor1_Model25
0e+00 10 14 kap binary 0.4210184 10 0.0557057 Preprocessor1_Model25
1e-07 10 14 accuracy binary 0.7545976 10 0.0213791 Preprocessor1_Model26
1e-07 10 14 kap binary 0.4210184 10 0.0557057 Preprocessor1_Model26
1e-04 10 14 accuracy binary 0.7545976 10 0.0213791 Preprocessor1_Model27
1e-04 10 14 kap binary 0.4210184 10 0.0557057 Preprocessor1_Model27
1e-01 10 14 accuracy binary 0.8177120 10 0.0155097 Preprocessor1_Model28
1e-01 10 14 kap binary 0.4992879 10 0.0484900 Preprocessor1_Model28
0e+00 15 14 accuracy binary 0.7524699 10 0.0203631 Preprocessor1_Model29
0e+00 15 14 kap binary 0.4193623 10 0.0536246 Preprocessor1_Model29
1e-07 15 14 accuracy binary 0.7524699 10 0.0203631 Preprocessor1_Model30
1e-07 15 14 kap binary 0.4193623 10 0.0536246 Preprocessor1_Model30
1e-04 15 14 accuracy binary 0.7524699 10 0.0203631 Preprocessor1_Model31
1e-04 15 14 kap binary 0.4193623 10 0.0536246 Preprocessor1_Model31
1e-01 15 14 accuracy binary 0.8177120 10 0.0155097 Preprocessor1_Model32
1e-01 15 14 kap binary 0.4992879 10 0.0484900 Preprocessor1_Model32
0e+00 1 27 accuracy binary 0.8177120 10 0.0155097 Preprocessor1_Model33
0e+00 1 27 kap binary 0.4992879 10 0.0484900 Preprocessor1_Model33
1e-07 1 27 accuracy binary 0.8177120 10 0.0155097 Preprocessor1_Model34
1e-07 1 27 kap binary 0.4992879 10 0.0484900 Preprocessor1_Model34
1e-04 1 27 accuracy binary 0.8177120 10 0.0155097 Preprocessor1_Model35
1e-04 1 27 kap binary 0.4992879 10 0.0484900 Preprocessor1_Model35
1e-01 1 27 accuracy binary 0.8177120 10 0.0155097 Preprocessor1_Model36
1e-01 1 27 kap binary 0.4992879 10 0.0484900 Preprocessor1_Model36
0e+00 5 27 accuracy binary 0.7937363 10 0.0170746 Preprocessor1_Model37
0e+00 5 27 kap binary 0.4800042 10 0.0397048 Preprocessor1_Model37
1e-07 5 27 accuracy binary 0.7937363 10 0.0170746 Preprocessor1_Model38
1e-07 5 27 kap binary 0.4800042 10 0.0397048 Preprocessor1_Model38
1e-04 5 27 accuracy binary 0.7937363 10 0.0170746 Preprocessor1_Model39
1e-04 5 27 kap binary 0.4800042 10 0.0397048 Preprocessor1_Model39
1e-01 5 27 accuracy binary 0.8177120 10 0.0155097 Preprocessor1_Model40
1e-01 5 27 kap binary 0.4992879 10 0.0484900 Preprocessor1_Model40
0e+00 10 27 accuracy binary 0.7697225 10 0.0181954 Preprocessor1_Model41
0e+00 10 27 kap binary 0.4408407 10 0.0444088 Preprocessor1_Model41
1e-07 10 27 accuracy binary 0.7697225 10 0.0181954 Preprocessor1_Model42
1e-07 10 27 kap binary 0.4408407 10 0.0444088 Preprocessor1_Model42
1e-04 10 27 accuracy binary 0.7697225 10 0.0181954 Preprocessor1_Model43
1e-04 10 27 kap binary 0.4408407 10 0.0444088 Preprocessor1_Model43
1e-01 10 27 accuracy binary 0.8177120 10 0.0155097 Preprocessor1_Model44
1e-01 10 27 kap binary 0.4992879 10 0.0484900 Preprocessor1_Model44
0e+00 15 27 accuracy binary 0.7697225 10 0.0181954 Preprocessor1_Model45
0e+00 15 27 kap binary 0.4408407 10 0.0444088 Preprocessor1_Model45
1e-07 15 27 accuracy binary 0.7697225 10 0.0181954 Preprocessor1_Model46
1e-07 15 27 kap binary 0.4408407 10 0.0444088 Preprocessor1_Model46
1e-04 15 27 accuracy binary 0.7697225 10 0.0181954 Preprocessor1_Model47
1e-04 15 27 kap binary 0.4408407 10 0.0444088 Preprocessor1_Model47
1e-01 15 27 accuracy binary 0.8177120 10 0.0155097 Preprocessor1_Model48
1e-01 15 27 kap binary 0.4992879 10 0.0484900 Preprocessor1_Model48
0e+00 1 40 accuracy binary 0.8177120 10 0.0155097 Preprocessor1_Model49
0e+00 1 40 kap binary 0.4992879 10 0.0484900 Preprocessor1_Model49
1e-07 1 40 accuracy binary 0.8177120 10 0.0155097 Preprocessor1_Model50
1e-07 1 40 kap binary 0.4992879 10 0.0484900 Preprocessor1_Model50
1e-04 1 40 accuracy binary 0.8177120 10 0.0155097 Preprocessor1_Model51
1e-04 1 40 kap binary 0.4992879 10 0.0484900 Preprocessor1_Model51
1e-01 1 40 accuracy binary 0.8177120 10 0.0155097 Preprocessor1_Model52
1e-01 1 40 kap binary 0.4992879 10 0.0484900 Preprocessor1_Model52
0e+00 5 40 accuracy binary 0.8042800 10 0.0153525 Preprocessor1_Model53
0e+00 5 40 kap binary 0.5006576 10 0.0387744 Preprocessor1_Model53
1e-07 5 40 accuracy binary 0.8042800 10 0.0153525 Preprocessor1_Model54
1e-07 5 40 kap binary 0.5006576 10 0.0387744 Preprocessor1_Model54
1e-04 5 40 accuracy binary 0.8042800 10 0.0153525 Preprocessor1_Model55
1e-04 5 40 kap binary 0.5006576 10 0.0387744 Preprocessor1_Model55
1e-01 5 40 accuracy binary 0.8177120 10 0.0155097 Preprocessor1_Model56
1e-01 5 40 kap binary 0.4992879 10 0.0484900 Preprocessor1_Model56
0e+00 10 40 accuracy binary 0.7870233 10 0.0194098 Preprocessor1_Model57
0e+00 10 40 kap binary 0.4705777 10 0.0462998 Preprocessor1_Model57
1e-07 10 40 accuracy binary 0.7870233 10 0.0194098 Preprocessor1_Model58
1e-07 10 40 kap binary 0.4705777 10 0.0462998 Preprocessor1_Model58
1e-04 10 40 accuracy binary 0.7870233 10 0.0194098 Preprocessor1_Model59
1e-04 10 40 kap binary 0.4705777 10 0.0462998 Preprocessor1_Model59
1e-01 10 40 accuracy binary 0.8177120 10 0.0155097 Preprocessor1_Model60
1e-01 10 40 kap binary 0.4992879 10 0.0484900 Preprocessor1_Model60
0e+00 15 40 accuracy binary 0.7870233 10 0.0194098 Preprocessor1_Model61
0e+00 15 40 kap binary 0.4705777 10 0.0462998 Preprocessor1_Model61
1e-07 15 40 accuracy binary 0.7870233 10 0.0194098 Preprocessor1_Model62
1e-07 15 40 kap binary 0.4705777 10 0.0462998 Preprocessor1_Model62
1e-04 15 40 accuracy binary 0.7870233 10 0.0194098 Preprocessor1_Model63
1e-04 15 40 kap binary 0.4705777 10 0.0462998 Preprocessor1_Model63
1e-01 15 40 accuracy binary 0.8177120 10 0.0155097 Preprocessor1_Model64
1e-01 15 40 kap binary 0.4992879 10 0.0484900 Preprocessor1_Model64
# tree_rs %>% autoplot() + theme_light(base_family = "IBMPlexSans")
tree_rs %>% autoplot()

Finalize decision

We can finalize our decision and create our decision tree based on the following specification:

show_best(tree_rs, "kap")
## # A tibble: 5 x 9
##   cost_complexity tree_depth min_n .metric .estimator  mean     n std_err
##             <dbl>      <int> <int> <chr>   <chr>      <dbl> <int>   <dbl>
## 1    0.0000000001          5    40 kap     binary     0.501    10  0.0388
## 2    0.0000001             5    40 kap     binary     0.501    10  0.0388
## 3    0.0001                5    40 kap     binary     0.501    10  0.0388
## 4    0.0000000001          1     2 kap     binary     0.499    10  0.0485
## 5    0.0000001             1     2 kap     binary     0.499    10  0.0485
## # ... with 1 more variable: .config <chr>
final_tree <- finalize_model(tree_engine, select_best(tree_rs, "kap"))

final_tree
## Decision Tree Model Specification (classification)
## 
## Main Arguments:
##   cost_complexity = 1e-10
##   tree_depth = 5
##   min_n = 40
## 
## Computational engine: rpart

A visual of our fitted model tree to the training dataset is shown below. The first value in the tree nodes indicate Y/N on whether there loan default would be predicted at that node, the second values shows the probability associated with N on the left and Y on the right. the third row in the node shows the percentage of the data contained in each node. The top most node depicts the most descriptive feature of our classification model, while following nodes depict secondary descriptive features. Due to the tree specifications (40 data points needed minimum at each node) only two main features are used to predict LoanStatus which are Credit_History and IncomeLoanRatio.

final_wf <-
  workflow() %>% 
  add_model(final_tree) %>% 
  add_recipe(dt_recipe6)

wf_final_results <- final_wf %>% 
  fit_resamples(
    resamples = datacv, 
    metrics = metric_set(
      roc_auc, accuracy, sensitivity, specificity, kap),
    control = control_resamples(save_pred = TRUE))


All_metrics <- wf_res_function(wf_final_results, "wf_final")

dtmodel <- final_wf %>% fit(data = Datatrain)

rattle::fancyRpartPlot(dtmodel$fit$fit$fit,palettes="RdPu")

Variable Importance Plot

Additional information about the variable importance was extracted from the model and shown in the variable importance plot below. where Credit_History has an overwhelming weight of importance, followed by our engineered feature IncomeLoanRatio.

mod <- dtmodel$fit$fit$fit

# For plotting tree
caret::varImp(mod) %>% 
  mutate(Feature = rownames(.)) %>% 
  ggplot(
    mapping = aes(x = fct_reorder(Feature, Overall), y = Overall)) + 
  geom_col(fill ="skyblue3") +
  coord_flip() +
  defaulttheme +
  labs(
    x = "Importance",
    y = "Feature",
    title = "Variable Importance")

training_predictions <- predict(
  dtmodel, 
  new_data = Datatrain, 
  type = "class") %>% 
  bind_cols(Datatrain$Loan_Status) %>% 
  rename("Loan_Status" = "...2") %>% 
  mutate(
    set = "Training",
    Loan_Status = factor(Loan_Status))

Confusion Matrix, Accuracy

The confusion matrix and accuracy of our training set is shown below with an accuracy of 83.1% and kappa of 55.6%. The confusion matrix shows that the model is not very good at predicting the minority class of when an individual will not be approved for the loan.

training_predictions %>% conf_mat(truth = Loan_Status, estimate=.pred_class)
##           Truth
## Prediction   N   Y
##          N  75  10
##          Y  69 307
metrics(
  training_predictions, 
  truth = Loan_Status,
  estimate = .pred_class)
## # A tibble: 2 x 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.829
## 2 kap      binary         0.551

Similarly, the confusion matrix and accuracy of our testing set is shown below with an accuracy of 79.7% and kappa of 46.6%. Similar to the training dataset, the confusion matrix shows that the model is not very good at predicting the minority class of when an individual will not be approved for the loan.

testing_predictions <- predict(
  dtmodel,
  new_data = Datatest, 
  type = "class") %>% 
  bind_cols(Datatest$Loan_Status) %>% 
  rename("Loan_Status"= "...2") %>% 
  mutate(
    set = "Testing",
    Loan_Status = factor(Loan_Status))

testing_predictions %>% conf_mat(truth = Loan_Status, estimate=.pred_class)
##           Truth
## Prediction   N   Y
##          N  23   5
##          Y  25 100
metrics(testing_predictions, truth = Loan_Status,estimate = .pred_class)
## # A tibble: 2 x 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.804
## 2 kap      binary         0.487
readr::write_rds(xgb_final, 'output/xgb-final-model.Rds')
pred_final <- predict(xgb_final, newdata = dtest)
readr::write_rds(pred_final, 'output/xgb-final-predictions.Rds')

(Q3) Random Forest

We used the training and test datasets we created above and the randomForest package for this section of the homework. The randomForest package requires we make a few additional changes:

  • column names in our “dummy value” objects must be valid R variable names
  • the function will attempt regression if our response variable is numeric, so we change it back to being a factor
loans_train <- readr::read_rds('data/loans_train.Rds')
loans_test <- readr::read_rds('data/loans_test.Rds')
loans_dv_train <- readr::read_rds('data/loans_dv_train.Rds') %>% 
  rename(
    Dependents.3 = `Dependents.3+`,
    Education.NotGraduate = `Education.Not Graduate`) %>% 
  mutate(Loan_Status = factor(ifelse(Loan_Status.Y, 'Y', 'N'))) %>% 
  select(-Loan_Status.N, -Loan_Status.Y)
  
loans_dv_test <- readr::read_rds('data/loans_dv_test.Rds') %>% 
  rename(
    Dependents.3 = `Dependents.3+`,
    Education.NotGraduate = `Education.Not Graduate`) %>% 
  mutate(Loan_Status = factor(ifelse(Loan_Status.Y, 'Y', 'N'))) %>% 
  select(-Loan_Status.N, -Loan_Status.Y)

The random forest algorithm works well with all sorts of data: numeric and categorical, un-scaled and scaled, full rank and highly correlative. So, we should get similar results if we use the loans_train object (which has factor variables in single columns) or the loans_dv_train object which splits factor variables into dummy columns including columns for base levels.

Default model 1 (which has factor variables in single columns)

set.seed(521)
m1 <- randomForest(Loan_Status ~ ., data = loans_train)
print(m1)
## 
## Call:
##  randomForest(formula = Loan_Status ~ ., data = loans_train) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 19.09%
## Confusion matrix:
##    N   Y class.error
## N 74  70  0.48611111
## Y 18 299  0.05678233

Default model 2 (which splits factor variables into dummy columns)

set.seed(521)
m1_dv <- randomForest(Loan_Status ~ ., data = loans_dv_train)
print(m1_dv)
## 
## Call:
##  randomForest(formula = Loan_Status ~ ., data = loans_dv_train) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 19.74%
## Confusion matrix:
##    N   Y class.error
## N 69  75  0.52083333
## Y 16 301  0.05047319

We ran the random forests for both default models and we were very disatisfied with the results. There is an error rate of 19.31%. We will find out later when we run the confusion matrix of the prediction values that Model 1 has an accuracy of 81.7%, a Kappa of 0.522, and a Sensitivity of 0.521. Moreover, a normal plot for a random forest should show a curve sloping down near the X and Y axes. There are 3 different curves almost parellel to each other. Considering the similarities between the default models, we decided to keep the m1 model as Model 1 for comparisons with the other Random Forest models in this section.

plot(m1)

(Q3) Random Forest - Model based on Model 1 by Tuning Number of Trees (Model 2)

Features Highlight

The error rate for the previous model may be unacceptable. The previous model used 500 trees. In the next execution of Random Forest, we try to see if we can reduce the error rate by manipulating the number of trees used. Other indices that we used to determine what could be the optimal model for this exercise are Accuracy, Error Rate, Kappa, and Sensitivity. We ran Model 2 multiple times using ntree parameter for 100, 200, 300, 400, 425, 450, and 475 random trees.

ntree Accuracy ErrorRate Kappa Sensitivity
100 81.70% 19.96% 0.533 0.542
200 81.70% 20.39% 0.533 0.542
300 81.05% 20.17% 0.513 0.521
400 82.35% 19.75% 0.547 0.542
425 83.00% 19.52% 0.566 0.563
450 83.00% 19.96% 0.566 0.563
475 83.00% 19.74% 0.566 0.563

The table summarizes the values derived from running Model 2. Based on this exercise, we determined that the optimal Model 2 would be one with ntree = 425 because the combination of indices Accuracy, Error Rate, Kappa, and Sensitivity. ntree = 425 has the highest Accuracy, lowest Error rate, highest Kappa, and highest Sensitivity if you take all indices together. Moreover, the Optimal Model 2 with ntree = 425 is slightly better than Model 1 in all indices with the exception of Error Rate.

set.seed(121)
ntreecand = 425
m2 <- randomForest(Loan_Status ~ ., data = loans_train, ntree = ntreecand)
print(m2)
## 
## Call:
##  randomForest(formula = Loan_Status ~ ., data = loans_train, ntree = ntreecand) 
##                Type of random forest: classification
##                      Number of trees: 425
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 19.74%
## Confusion matrix:
##    N   Y class.error
## N 73  71  0.49305556
## Y 20 297  0.06309148
plot(m2)

(Q3) Random Forest - Model based on Optimal Model 2 by Tuning Number of Variables (Model 3)

Execution of Random Forest

Model 2 with ntree = 425 may be the optimal solution up to this point. However, we wanted to investigate further if the number of variables (mtry) will help increase the Accuracy of the model. In the next execution of Random Forest, we try to see if when can reduce the error rate while increasing Accuracy by manipulating the number of variables used. Other indices that we used to determine what could be the optimal model for this exercise are Accuracy, Error Rate, Kappa, and Sensitivity. We ran Model 3 multiple times using mtry parameter for 2, 3, 4, 5, 6, and 7 variables.

mtry Accuracy ErrorRate Kappa Sensitivity
2 80.39% 19.09% 0.487 0.479
3 83.00% 19.52% 0.566 0.563
4 80.39% 19.96% 0.512 0.563
5 81.05% 19.96% 0.511 0.563
6 80.39% 19.74% 0.525 0.563
7 79.74% 19.96% 0.515 0.563

The table summarizes the values derived from running Model 3. Based on this exercise, we determined that the optimal Model 3 would be one with mtry = 3 because the combination of indices Accuracy, Error Rate, Kappa, and Sensitivity. mtry = 3 has the highest Accuracy, lowest Error rate, highest Kappa, and highest Sensitivity if you take all indices together. Moreover, the Optimal Model 3 with mtry = 3 is slightly better than Model 1 in all indices with the exception of Error Rate. Coincidentally, the default mtry for Model 2 is 3: the same as the optimal version of Model 3.

set.seed(121)
mtrycand = 3
m3 <- randomForest(Loan_Status ~ ., data = loans_train, ntree = ntreecand, mtry = mtrycand)
print(m3)
## 
## Call:
##  randomForest(formula = Loan_Status ~ ., data = loans_train, ntree = ntreecand,      mtry = mtrycand) 
##                Type of random forest: classification
##                      Number of trees: 425
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 19.74%
## Confusion matrix:
##    N   Y class.error
## N 73  71  0.49305556
## Y 20 297  0.06309148
plot(m3)

(Q3) Random Forest - Model Comparison

In the final chore for this exercise, we made predictions on all three generated random forest models and developed the confusion matrix for each. We extracted important values from these models and compared and discussed them.

Model 1

prediction1 <- predict(m1,newdata = loans_test)
prediction1.cm <- confusionMatrix(prediction1, loans_test$Loan_Status) 
prediction1.cm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   N   Y
##          N  27   5
##          Y  21 100
##                                          
##                Accuracy : 0.8301         
##                  95% CI : (0.761, 0.8859)
##     No Information Rate : 0.6863         
##     P-Value [Acc > NIR] : 4.049e-05      
##                                          
##                   Kappa : 0.5661         
##                                          
##  Mcnemar's Test P-Value : 0.003264       
##                                          
##             Sensitivity : 0.5625         
##             Specificity : 0.9524         
##          Pos Pred Value : 0.8437         
##          Neg Pred Value : 0.8264         
##              Prevalence : 0.3137         
##          Detection Rate : 0.1765         
##    Detection Prevalence : 0.2092         
##       Balanced Accuracy : 0.7574         
##                                          
##        'Positive' Class : N              
## 

Model 2

prediction2 <- predict(m2,newdata = loans_test)
prediction2.cm <- confusionMatrix(prediction2, loans_test$Loan_Status) 
prediction2.cm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   N   Y
##          N  26   5
##          Y  22 100
##                                           
##                Accuracy : 0.8235          
##                  95% CI : (0.7537, 0.8804)
##     No Information Rate : 0.6863          
##     P-Value [Acc > NIR] : 9.014e-05       
##                                           
##                   Kappa : 0.5466          
##                                           
##  Mcnemar's Test P-Value : 0.002076        
##                                           
##             Sensitivity : 0.5417          
##             Specificity : 0.9524          
##          Pos Pred Value : 0.8387          
##          Neg Pred Value : 0.8197          
##              Prevalence : 0.3137          
##          Detection Rate : 0.1699          
##    Detection Prevalence : 0.2026          
##       Balanced Accuracy : 0.7470          
##                                           
##        'Positive' Class : N               
## 

Model 3

prediction3 <- predict(m3,newdata = loans_test)
prediction3.cm <- confusionMatrix(prediction3, loans_test$Loan_Status) 
prediction3.cm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   N   Y
##          N  26   5
##          Y  22 100
##                                           
##                Accuracy : 0.8235          
##                  95% CI : (0.7537, 0.8804)
##     No Information Rate : 0.6863          
##     P-Value [Acc > NIR] : 9.014e-05       
##                                           
##                   Kappa : 0.5466          
##                                           
##  Mcnemar's Test P-Value : 0.002076        
##                                           
##             Sensitivity : 0.5417          
##             Specificity : 0.9524          
##          Pos Pred Value : 0.8387          
##          Neg Pred Value : 0.8197          
##              Prevalence : 0.3137          
##          Detection Rate : 0.1699          
##    Detection Prevalence : 0.2026          
##       Balanced Accuracy : 0.7470          
##                                           
##        'Positive' Class : N               
## 

Final Comparison

  • Random Forest Model 2 has the highest accuracy at 83.00% and F1-Score at 88.5%. Despite the high error rate indicated earlier, F1-Score and Accuracy seemed to indicate that this Random Forest is an optimal model.

  • By visual inspection of the confusion matrices, the p-value of Random Forest Model is signficantly less than 0.05 at 4.049e-05. Model 1’s p-value at 0.0002 indicating that there is sufficient evidence that it is a viable model, but Model 2’s p-value is significantly less than that of Model 2. Model 3’s p-value is the same as that of Model 2 because both models have been found out to be the same with mtry = 3 and ntree = 425. The very low p-value indicates that their is sufficient evidence that Random Forest Model 2 is a viable model for further consideration.

  • Model 2 (m2) is the optimal Random Forest model in this exercise based on its favorable Accuracy, Kappa, F1-Score, and Sensitivity.

# Model 1 Values
prediction1.accuracy <- prediction1.cm$overall['Accuracy']
prediction1.kappa <- prediction1.cm$overall['Kappa']
prediction1.sensitivity <- prediction1.cm$byClass['Sensitivity']
prediction1.TN <- prediction1.cm$table[1,1]
prediction1.FP <- prediction1.cm$table[1,2]
prediction1.FN <- prediction1.cm$table[2,1]
prediction1.TP <- prediction1.cm$table[2,2]
prediction1.TPR <- prediction1.TP /(prediction1.TP + prediction1.FN)
prediction1.TNR <- prediction1.TN /(prediction1.TN + prediction1.FP)
prediction1.FPR <- prediction1.FP /(prediction1.TN + prediction1.FP)
prediction1.FNR <- prediction1.FN /(prediction1.TP + prediction1.FN)
prediction1.precision <- prediction1.TP / (prediction1.TP + prediction1.FP)
prediction1.recall <- prediction1.TP / (prediction1.TP + prediction1.FN)
prediction1.specificity <- prediction1.TN / (prediction1.TN + prediction1.FP)
prediction1.f1score <- 2 * ((prediction1.precision * prediction1.recall) / (prediction1.precision + prediction1.recall))
# Model 2 Values
prediction2.accuracy <- prediction2.cm$overall['Accuracy']
prediction2.kappa <- prediction2.cm$overall['Kappa']
prediction2.sensitivity <- prediction2.cm$byClass['Sensitivity']
prediction2.TN <- prediction2.cm$table[1,1]
prediction2.FP <- prediction2.cm$table[1,2]
prediction2.FN <- prediction2.cm$table[2,1]
prediction2.TP <- prediction2.cm$table[2,2]
prediction2.TPR <- prediction2.TP /(prediction2.TP + prediction2.FN)
prediction2.TNR <- prediction2.TN /(prediction2.TN + prediction2.FP)
prediction2.FPR <- prediction2.FP /(prediction2.TN + prediction2.FP)
prediction2.FNR <- prediction2.FN /(prediction2.TP + prediction2.FN)
prediction2.precision <- prediction2.TP / (prediction2.TP + prediction2.FP)
prediction2.recall <- prediction2.TP / (prediction2.TP + prediction2.FN)
prediction2.specificity <- prediction2.TN / (prediction2.TN + prediction2.FP)
prediction2.f1score <- 2 * ((prediction2.precision * prediction2.recall) / (prediction2.precision + prediction2.recall))
# Model 3 Values
prediction3.accuracy <- prediction3.cm$overall['Accuracy']
prediction3.kappa <- prediction3.cm$overall['Kappa']
prediction3.sensitivity <- prediction3.cm$byClass['Sensitivity']
prediction3.TN <- prediction3.cm$table[1,1]
prediction3.FP <- prediction3.cm$table[1,2]
prediction3.FN <- prediction3.cm$table[2,1]
prediction3.TP <- prediction3.cm$table[2,2]
prediction3.TPR <- prediction3.TP /(prediction3.TP + prediction3.FN)
prediction3.TNR <- prediction3.TN /(prediction3.TN + prediction3.FP)
prediction3.FPR <- prediction3.FP /(prediction3.TN + prediction3.FP)
prediction3.FNR <- prediction3.FN /(prediction3.TP + prediction3.FN)
prediction3.precision <- prediction3.TP / (prediction3.TP + prediction3.FP)
prediction3.recall <- prediction3.TP / (prediction3.TP + prediction3.FN)
prediction3.specificity <- prediction3.TN / (prediction3.TN + prediction3.FP)
prediction3.f1score <- 2 * ((prediction3.precision * prediction3.recall) / (prediction3.precision + prediction3.recall))
Model Accuracy Kappa Sensitivity Recall Specificity Precision F1Score
Model 1 0.8300654 0.5660995 0.5625000 0.8264463 0.8437500 0.952381 0.8849558
Model 2 0.8235294 0.5465920 0.5416667 0.8196721 0.8387097 0.952381 0.8810573
Model 3 0.8235294 0.5465920 0.5416667 0.8196721 0.8387097 0.952381 0.8810573
readr::write_rds(m2, 'output/rf-mod2-model.Rds')
readr::write_rds(prediction2, 'output/rf-mod2-predictions.Rds')

(Q4) Gradient Boosting

We are using the xgboost package to answer this question. The xgboost package is incredibly powerful and flexible. You can learn more about it and the underlying algorithms here:

We followed the general steps in the tutorial to answer this question as it is our first time using the xgboost package.

  1. Prepare the data
  2. Run a sample model
  3. Tune parameters to select a model with high accuracy and low chance of over-fitting

Prepare the data

xgboost requires data to be in a particular format. The following code shows how to start with the output from the data prep section above and obtain the desired format for the algorithm.

# Load Rds files from data prep section
loans_train     <- readr::read_rds('data/loans_train.Rds')
loans_test      <- readr::read_rds('data/loans_test.Rds')
loans_dv_train  <- readr::read_rds('data/loans_dv_train.Rds')
loans_dv_test   <- readr::read_rds('data/loans_dv_test.Rds')

# xgboost wants predictors and response in separate objects
train_labels <- loans_dv_train$Loan_Status.Y
test_labels <- loans_dv_test$Loan_Status.Y

# We will use this to tell the function to balance the Y/N values
negative_cases <- sum(train_labels == 0)
postive_cases <- sum(train_labels == 1)

# xgboost can use a special xgb.DMatrix object
dtrain <- loans_dv_train %>% 
  select(-Loan_Status.N, -Loan_Status.Y) %>% 
  as.matrix %>% 
  xgb.DMatrix(
    data = .,
    label = loans_dv_train$Loan_Status.Y %>% as.matrix)

# create the same for test data for the predict function
dtest <- loans_dv_test %>% 
  select(-Loan_Status.N, -Loan_Status.Y) %>% 
  as.matrix %>% 
  xgb.DMatrix(
    data = .,
    label = loans_dv_test$Loan_Status.Y %>% as.matrix)

Create a single pre-tuning model

It is always good when using a new method to start with a simple model to make sure everything is working properly.

The following function call includes the smallest number of non-default parameters. (In reality, scale_pos_weight is not required, but our labels are slightly imbalanced, so we include it as mandatory.)

xgb1 <- xgboost(
  data = dtrain,
  objective = 'binary:logistic',
  eval_metric = 'error',
  scale_pos_weight = negative_cases/postive_cases,
  nrounds = 10)
## [1]  train-error:0.184382 
## [2]  train-error:0.154013 
## [3]  train-error:0.134490 
## [4]  train-error:0.121475 
## [5]  train-error:0.117137 
## [6]  train-error:0.121475 
## [7]  train-error:0.114967 
## [8]  train-error:0.106291 
## [9]  train-error:0.106291 
## [10] train-error:0.106291

The training error will always decrease if we increase the number of iterations. However, the change in error tends to zero. The important question is what does the test error look like with each successive tree?

# Some helper functions
test_error <- function(model, n) {
  pred <- predict(model, newdata = dtest, ntreelimit = n)
  mean(as.numeric(pred > 0.5) != test_labels)
}

test_errors <- function(model) {
  N <- model$niter
  errors <- numeric(N)
  for (i in 1:N) {
    errors[i] <- test_error(model, i)
  }
  return(errors)
}

compare_errors <- function(model) {
  df <- cbind(
    model$evaluation_log %>% as.data.frame(),
    test_errors(model))
  names(df) <- c('iter', 'train_error', 'test_error')
  return(df)
}

# xgb1 errors
compare_errors(xgb1)
##    iter train_error test_error
## 1     1    0.184382  0.2091503
## 2     2    0.154013  0.2026144
## 3     3    0.134490  0.2287582
## 4     4    0.121475  0.2156863
## 5     5    0.117137  0.2222222
## 6     6    0.121475  0.2222222
## 7     7    0.114967  0.2287582
## 8     8    0.106291  0.2222222
## 9     9    0.106291  0.2287582
## 10   10    0.106291  0.2222222

Even at iteration #1 we have evidence of over-fitting. And the test error increases on average as we add more trees. This is most likely because the defaults of xgboost allow for a max tree depth of six! Our selected decision model only had a depth of two, so we probably have over-fitting after just one iteration.

Parameter tuning

There are dozens of parameters one can pass to the xgboost function, and we will not attempt to address them all here. The most important for our first foray into xgboost are probably the following

  • eta the scaling factor applied to each iteration, default is 0.3
  • max_depth controls the maximum size of each tree, default is 6
  • nround is the number of iterations to run
  • early_stopping_rounds controls whether to stop before reaching nrounds
  • gamma is described best by the help page: “minimum loss reduction required to make a further partition on a leaf node of the tree. the larger, the more conservative the algorithm will be.”

Through experimentation not included in this write-up, we found that the small size of our data set means we do not need many boosting rounds to hit a plateau in the training accuracy. Therefore, gamma is probably not something we need to worry about. (gamma will help prevent too many iterations that have little overall impact.) The same argument applies to early_stopping_rounds.

The two arguments that make the most sense for us to tune are

  • max_depth and
  • nround

Selected parameters

The final model we selected is

xgb_final <- xgboost(
  data = dtrain,
  objective = 'binary:logistic',
  eval_metric = 'error',
  scale_pos_weight = negative_cases/postive_cases,
  nrounds = 17,
  max_depth = 3,
  verbose = 0)

Seventeen may seem lot a lot of iterations, but remember each successive tree’s contribution to the whole is only 30%. The first tree carries the most weight by far.

Interpretation

Boosted trees are not as easy to interpret as single trees, but there are still some things that are helpful. Like a variable importance plot:

importance_matrix <- xgb.importance(model = xgb_final)
xgb.plot.importance(importance_matrix)

The rank of importance agrees with our single-tree analysis, as we’d expect.

readr::write_rds(xgb_final, 'output/xgb-final-model.Rds')
pred_final <- predict(xgb_final, newdata = dtest)
readr::write_rds(pred_final, 'output/xgb-final-predictions.Rds')

(Q5) Model Performance and Selection

Load RDS

# load rds into session
rds_files = c(paste0("output/", list.files("output")), 
              paste0("data/", list.files("data"))) %>%
    grep(pattern = ".rds$", ., ignore.case = TRUE, value = TRUE)

var_names = gsub(pattern = "(output/|data/|.[Rr]ds)", replacement = "", x = rds_files) %>%
    gsub(pattern = "-", replacement = "_", .)

sapply(1:length(rds_files), function(x) assign(var_names[x], readRDS(rds_files[x]), envir = .GlobalEnv)) %>% invisible()

Tidy Models Output

### look at prediction result

# reorder level 
loans_test_target <- relevel(loans_test$Loan_Status, ref = "Y")

# decision tree
dt_testing_predictions = relevel(dt_testing_Predictions$.pred_class, ref = "Y")
dt_confusionMatrix = caret::confusionMatrix(dt_testing_predictions, loans_test_target)
dt_confusionMatrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   Y   N
##          Y 100  25
##          N   5  23
##                                           
##                Accuracy : 0.8039          
##                  95% CI : (0.7321, 0.8636)
##     No Information Rate : 0.6863          
##     P-Value [Acc > NIR] : 0.0007737       
##                                           
##                   Kappa : 0.4866          
##                                           
##  Mcnemar's Test P-Value : 0.0005226       
##                                           
##             Sensitivity : 0.9524          
##             Specificity : 0.4792          
##          Pos Pred Value : 0.8000          
##          Neg Pred Value : 0.8214          
##              Prevalence : 0.6863          
##          Detection Rate : 0.6536          
##    Detection Prevalence : 0.8170          
##       Balanced Accuracy : 0.7158          
##                                           
##        'Positive' Class : Y               
## 
# random forest
rf_mod2_predictions = relevel(rf_mod2_predictions, ref = "Y")
rf_confusionMatrix = caret::confusionMatrix(rf_mod2_predictions, loans_test_target)
rf_confusionMatrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  Y  N
##          Y 98 21
##          N  7 27
##                                           
##                Accuracy : 0.817           
##                  95% CI : (0.7465, 0.8748)
##     No Information Rate : 0.6863          
##     P-Value [Acc > NIR] : 0.0001923       
##                                           
##                   Kappa : 0.5385          
##                                           
##  Mcnemar's Test P-Value : 0.0140193       
##                                           
##             Sensitivity : 0.9333          
##             Specificity : 0.5625          
##          Pos Pred Value : 0.8235          
##          Neg Pred Value : 0.7941          
##              Prevalence : 0.6863          
##          Detection Rate : 0.6405          
##    Detection Prevalence : 0.7778          
##       Balanced Accuracy : 0.7479          
##                                           
##        'Positive' Class : Y               
## 
# xgb boost
xgb_pred = ifelse(xgb_final_predictions > .5, 'Y', 'N') %>% 
    factor(., levels = c("Y", "N"), labels = c("Y", "N"))
xgb_confusionMatrix = caret::confusionMatrix(xgb_pred, loans_test_target)
xgb_confusionMatrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  Y  N
##          Y 97 19
##          N  8 29
##                                           
##                Accuracy : 0.8235          
##                  95% CI : (0.7537, 0.8804)
##     No Information Rate : 0.6863          
##     P-Value [Acc > NIR] : 9.014e-05       
##                                           
##                   Kappa : 0.563           
##                                           
##  Mcnemar's Test P-Value : 0.05429         
##                                           
##             Sensitivity : 0.9238          
##             Specificity : 0.6042          
##          Pos Pred Value : 0.8362          
##          Neg Pred Value : 0.7838          
##              Prevalence : 0.6863          
##          Detection Rate : 0.6340          
##    Detection Prevalence : 0.7582          
##       Balanced Accuracy : 0.7640          
##                                           
##        'Positive' Class : Y               
## 
# tidy output
dt_stat = broom::tidy(dt_confusionMatrix)
rf_stat = broom::tidy(rf_confusionMatrix)
xgb_stat = broom::tidy(xgb_confusionMatrix)

model_stat_tidy = dplyr::inner_join(dt_stat %>% dplyr::select(term, `decision tree` = estimate),
                                    rf_stat %>% dplyr::select(term, `random forest` = estimate), 
                                    by = "term") %>%
    dplyr::inner_join(., xgb_stat %>% dplyr::select(term, `gradient boosting` = estimate),
                      by = "term")

model_stat_tidy %>% dplyr::filter(term != "mcnemar")
## # A tibble: 13 x 4
##    term                 `decision tree` `random forest` `gradient boosting`
##    <chr>                          <dbl>           <dbl>               <dbl>
##  1 accuracy                       0.804           0.817               0.824
##  2 kappa                          0.487           0.538               0.563
##  3 sensitivity                    0.952           0.933               0.924
##  4 specificity                    0.479           0.562               0.604
##  5 pos_pred_value                 0.8             0.824               0.836
##  6 neg_pred_value                 0.821           0.794               0.784
##  7 precision                      0.8             0.824               0.836
##  8 recall                         0.952           0.933               0.924
##  9 f1                             0.870           0.875               0.878
## 10 prevalence                     0.686           0.686               0.686
## 11 detection_rate                 0.654           0.641               0.634
## 12 detection_prevalence           0.817           0.778               0.758
## 13 balanced_accuracy              0.716           0.748               0.764

There were six decision tree models built. They were built in sequential order, so that each was built upon an improvement of its predecessor. Accuracy was improved gradually with additional information thrown into the model, e.g. we generated the “Total_Income” variable based on the “ApplicantIncome” + “CoapplicantIncome”. Subsequently, “IncomeLoanRatio” was created based on the Total_Income/LoanAmount, and finally we carried out discretization of the “Total_Income” into 6 bins. We also tried to remove features that had low variance. In addition, we tried undersampling in one of our models as an attempt to tackle the imbalanced data. We performed tuning for the best (the sixth) model, and we focused on cost complexity, tree depth and the min n (the minimum number of data points needed at a node for a split to occur). The result suggested that there were only two strong features (see above variable importance plot) that could predict the loan status, i.e. “Credit_History” and “IncomeLoanRatio”. After many iteration and tuning exercise, the decision tree algorithm was able to achieve 80.4% accuracy and Kappa of 48.7%.

For random forest, we built three models for comparison, varying the number of variables at each split (mtree = 2, 3, 4, 5, 6, and 7) and the number of trees generated (ntree = 100, 200, 300, 400, 425, 450, and 475). The result suggested that the most optimal parameters in this data set based upon accuracy, error rate, Kappa, and sensitivity would be mtree = 3 and ntree = 425. The best model achieved 81.7% accuracy and Kappa of 53.8%.

For gradient boosting, we focused on tuning two parameters “max_depth” and “nround” for achieving best accuracy. Using a for loop to run for each combination of “max_depth” (1:5) and “nround” (1:20) in a grid search, we used a subset of those 20-iteration to accomplish the search for the best combination of these parameters. At iteration 17, we finally saw a low point in our test error rate. The final model produced a similar conclusion as the decision tree, i.e. pointing to the importance of the “Credit_History” and “IncomeLoanRatio”. The gradient boosting model achieved the best performance among the three algorithms with 82.4% accuracy and Kappa of 56.3%.

The gradient boosting model was clearly the winner among the three based on various metrics. However, the result of decision tree is generally better in terms of interpretability and communication (visually) with various stakeholders. Decision tree algorithm can easily work with numerical and categorical features. It requires very little data preprocessing and it holds no assumptions about the shape of data (thus non-parametric and suitable for fitting various type of data). It also happens to be less affected by multicollinearity compared to other classifiers. However, it tends to be overfitting, and if the data is not well balanced, it can cause significant bias (such as the data set in this assignment, and we tried to resolve this issue by undersampling but with no significant improvement). When one’s data set gets more complex with large number of features, random forest begins to shine and presented to be a better alternative.

Random forest also works easily with numerical and categorical features without much need in data preprocessing. Random forest implicitly performs feature selection and generate uncorrelated decision trees based on choosing a random set of features to build each decision tree. Thus, it is less susceptible by outliers and it can handle linear and non-linear relationships well. Essentially, random forest is to average the results across the multiple decision trees (based on a random subset of input variables) that it builds, thus it can generally provide high accuracy and balance the bias-variance trade-off very well. However, the down side of random forest is that it tends to be less visible or interpretable compared to decision tree. Random forest is not as straightforward as decision tree in communicating the output (such as highlighting specific set of important features) to the stakeholders. In addition, it can be computationally intensive for large data sets.

Similar to random forest, gradient boosting also works like a black box algorithm. Like any other ensemble machine learning procedure, models are built in sequential orders, where later models (or top layers) would correct preceding predictors (from base layers) to arrive at better prediction. Gradient boosting aims at fitting a new predictor in the residual errors committed by the preceding predictor. The errors from prior step are highlighted, and by combining one weak learner to the next learner, the error is reduced significantly over time. The algorithm is less susceptible to overfitting and the result tends to be easy for communication with highlighting important features. The disadvantage is that it can be over sensitive to outliers since every classifier attempts to fix the residual errors generated from prior step. Another disadvantage is the lack of scalability. As one can imagine, the implementation is based on correcting previous predictors, thus making the procedure difficult to streamline. It can be very challenging for large and complex data set.

Each of the machine learning algorithms discussed above shares different pros and cons. For this assignment, gradient boosting is the most preferrable algorithm for this small data set. It is the best classifier in this exercise as reflected by accuracy, Kappa, F1 score, precision, etc. For better result (such as enhancing accuracy and reducing overfitting), future attempts can focus more on feature engineering. For example, while “Credit_History” is obviously important, we come up with the second most important feature in our model, i.e. “IncomeLoanRatio” that holds strong predictive power and it is also highly intuitive.

Penguins data prep

library(tidyverse)
library(corrplot)
library(palmerpenguins)
penguins <- palmerpenguins::penguins

This data has 344 rows of 8 variables:

variable definition
species a factor denoting penguin species (Adélie, Chinstrap and Gentoo)
island a factor denoting island in Palmer Archipelago, Antarctica (Biscoe, Dream or Torgersen)
bill_length_mm a number denoting bill length (millimeters)
bill_depth_mm a number denoting bill depth (millimeters)
flipper_length_mm an integer denoting flipper length (millimeters)
body_mass_g an integer denoting body mass (grams)
sex a factor denoting penguin sex (female, male)
year an integer denoting the study year (2007, 2008, or 2009)
summary(penguins)
##       species          island    bill_length_mm  bill_depth_mm  
##  Adelie   :152   Biscoe   :168   Min.   :32.10   Min.   :13.10  
##  Chinstrap: 68   Dream    :124   1st Qu.:39.23   1st Qu.:15.60  
##  Gentoo   :124   Torgersen: 52   Median :44.45   Median :17.30  
##                                  Mean   :43.92   Mean   :17.15  
##                                  3rd Qu.:48.50   3rd Qu.:18.70  
##                                  Max.   :59.60   Max.   :21.50  
##                                  NA's   :2       NA's   :2      
##  flipper_length_mm  body_mass_g       sex           year     
##  Min.   :172.0     Min.   :2700   female:165   Min.   :2007  
##  1st Qu.:190.0     1st Qu.:3550   male  :168   1st Qu.:2007  
##  Median :197.0     Median :4050   NA's  : 11   Median :2008  
##  Mean   :200.9     Mean   :4202                Mean   :2008  
##  3rd Qu.:213.0     3rd Qu.:4750                3rd Qu.:2009  
##  Max.   :231.0     Max.   :6300                Max.   :2009  
##  NA's   :2         NA's   :2

While there are several methods for dealing with missing data, we have opted to simply remove rows with NAs.

  • imputing numerical variables with mean would decrease variance
  • four variables bill_length_mm, bill_depth_mm, flipper_length_mm, and body_mass_g have the same number of NAs
  • removing rows with NAs treates this variables equally
  • variable sex has 11 missing data points
  • we removed these observations because our data is split evenly between male and female
penguins_na_omit = na.omit(penguins)

There is some correlation in the numeric variables, as we have seen in our other assignments.

penguins_numeric <- penguins_na_omit %>% 
  select(-species, -sex, -island, -year)

# Cor-relation between numerics
corrplot(cor(penguins_numeric), type = 'lower', diag = FALSE)

Year is not a potential predictor for this problem so we have excluded it. There is a strong relationship between body_mass_g and flipper_length_mm. The pairs of measurement variables that have absolute correlations above 0.5 we list here:

corr_matrtix <- cor(penguins_numeric)

corr_matrtix %>% 
  as.data.frame %>% 
  mutate(x = rownames(corr_matrtix)) %>% 
  pivot_longer(
    cols = -x,
    names_to = 'y', 
    values_to = 'corr') %>% 
  filter(x < y) %>% 
  filter(corr > 0.5 | corr < -0.5) %>% 
  arrange(desc(abs(corr)))
## # A tibble: 4 x 3
##   x              y                   corr
##   <chr>          <chr>              <dbl>
## 1 body_mass_g    flipper_length_mm  0.873
## 2 bill_length_mm flipper_length_mm  0.653
## 3 bill_length_mm body_mass_g        0.589
## 4 bill_depth_mm  flipper_length_mm -0.578

We save the prepared data for use in Question #1.

readr::write_rds(penguins_na_omit, 'data/penguins_na_omit.Rds')