# KNN and Tree Based Ensemble Models

Project Goal

In this report, we conduct an exploratory data analysis (EDA) on the Heart Failure Prediction Dataset. Then we fit KNN and ensemble models with repeated cross-validation on training data Test the models by finding the confusion matrix on the test data:

  • K-Nearest Neighbors (KNN)
  • Ensemble Models
    • Classification Tree
    • Bagged Tree
    • Random Forests
    • Generalized Boosted Regression Models

Set up: Packages and Helper Functions

In this task, we will use the following packages:

  • here: enables easy file referencing and builds file paths in a OS-independent way
  • stats: loads this before loading tidyverse to avoid masking some tidyverse functions
  • tidyverse: includes collections of useful packages like dplyr (data manipulation), tidyr (tidying data), ggplots (creating graphs), etc.
  • glue: embeds and evaluates R expressions into strings to be printed as messages
  • scales: formats and labels scales nicely for better visualization
  • caret: training and plotting classification and regression models
  • rpart: recursive partitioning for classification, regression and survival trees.
  • randomForest: classification and regression based on a forest of trees using random inputs.
  • gbm: generalized boosted regression models

First, we define a helper functions to reduce repeating codes. This function trains models with cross validation to tune parameters and apply the best model on test set to see its performance.

Arguments:

  • form: formula
  • df_train: training set
  • df_test: test set
  • method: classification or regression model to use
  • trControl: a list of values that define how train acts
  • tuneGrid: a data frame with possible tuning values
  • plot: whether to plot parameter and metric
  • ...: arguments passed to the classification or regression routine

Returned Value: a confusion matrix

fit_model <- function(form, df_train, df_test, method, trControl, tuneGrid = NULL, plot = T, ...) {
  # train model
  fit <- train(
    form = form,
    data = df_train,
    method = method,
    preProcess = c("center", "scale"),
    trControl = trControl,
    tuneGrid = tuneGrid, ...)
  
  # print the best tune if there is a tuning parameter
  if(is.null(tuneGrid)){
    print("No tuning parameter")
  } else {
    # print the best tune 
    print("The best tune is found with:")
    print(glue("\t{names(fit$bestTune)} = {fit$bestTune[1,]}"))
  
    if(plot){
      # get model info
      model <- fit$modelInfo$label
      parameter <- fit$modelInfo$parameters$parameter
      description <- fit$modelInfo$parameters$label
      
      # plot parameter vs metrics
      p <- fit$results %>% 
        rename_at(1, ~"x") %>% 
        pivot_longer(cols = -1, names_to = "Metric") %>% 
        ggplot(aes(x, value, color = Metric)) +
        geom_point() +
        geom_line() +
        facet_grid(rows = vars(Metric), scales = "free_y") +
        labs(
          title = glue("{model}: Hyperparameter Tuning"),
          x = glue("{parameter} ({description})")
        )
      print(p)
    }
  }
  
  # make prediction on test set
  pred <- predict(fit, newdata = df_test)

  # confusion matrix
  cfm <- confusionMatrix(df_test[,1] %>% as_vector(), pred)
  
  # print confusion matrix and accuracy
  print("Confusion table:")
  print(cfm$table)
  print(glue("Accuracy = {cfm$overall['Accuracy']}"))

  # return the confusion matrix
  return(cfm)
}

Data

The Heart Failure Prediction Dataset gives information about whether or not someone has heart disease along with different measurements about that person’s health. A local copy is saved in the data folder. Since the original column names contain space and special characters, and are long as well, we rename the columns when we read the data in.

df_raw <- read_csv(here("data", "heart.csv"))
## Rows: 918 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): Sex, ChestPainType, RestingECG, ExerciseAngina, ST_Slope
## dbl (7): Age, RestingBP, Cholesterol, FastingBS, MaxHR, Oldpeak, HeartDisease
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# show the raw data
df_raw
# check structure
str(df_raw)
## spc_tbl_ [918 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Age           : num [1:918] 40 49 37 48 54 39 45 54 37 48 ...
##  $ Sex           : chr [1:918] "M" "F" "M" "F" ...
##  $ ChestPainType : chr [1:918] "ATA" "NAP" "ATA" "ASY" ...
##  $ RestingBP     : num [1:918] 140 160 130 138 150 120 130 110 140 120 ...
##  $ Cholesterol   : num [1:918] 289 180 283 214 195 339 237 208 207 284 ...
##  $ FastingBS     : num [1:918] 0 0 0 0 0 0 0 0 0 0 ...
##  $ RestingECG    : chr [1:918] "Normal" "Normal" "ST" "Normal" ...
##  $ MaxHR         : num [1:918] 172 156 98 108 122 170 170 142 130 120 ...
##  $ ExerciseAngina: chr [1:918] "N" "N" "N" "Y" ...
##  $ Oldpeak       : num [1:918] 0 1 0 1.5 0 0 0 0 1.5 0 ...
##  $ ST_Slope      : chr [1:918] "Up" "Flat" "Up" "Flat" ...
##  $ HeartDisease  : num [1:918] 0 1 0 1 0 0 0 0 1 0 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Age = col_double(),
##   ..   Sex = col_character(),
##   ..   ChestPainType = col_character(),
##   ..   RestingBP = col_double(),
##   ..   Cholesterol = col_double(),
##   ..   FastingBS = col_double(),
##   ..   RestingECG = col_character(),
##   ..   MaxHR = col_double(),
##   ..   ExerciseAngina = col_character(),
##   ..   Oldpeak = col_double(),
##   ..   ST_Slope = col_character(),
##   ..   HeartDisease = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
# check if any missing values
anyNA(df_raw)
## [1] FALSE

Next, we will prepare the data:

  • remove ST_Slope
  • convert the three categorical predictors Sex, ChestPainType and RestingECG to factors
  • convert the response HeartDisease to a factor and relocate it to the first column
df <- df_raw %>% 
  select(-ST_Slope) %>% 
  mutate(
    Sex = factor(Sex),
    ChestPainType = factor(ChestPainType),
    RestingECG = factor(RestingECG ),
    ExerciseAngina = factor(ExerciseAngina),
    HeartDisease = if_else(HeartDisease == 1, "Heart Disease", "Normal") %>% factor()
  ) %>% 
  relocate(HeartDisease)

# show the data frame
df
# quick summaries of numeric and categorical variables
skim(df)
Data summary
Name df
Number of rows 918
Number of columns 11
_______________________
Column type frequency:
factor 5
numeric 6
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
HeartDisease 0 1 FALSE 2 Hea: 508, Nor: 410
Sex 0 1 FALSE 2 M: 725, F: 193
ChestPainType 0 1 FALSE 4 ASY: 496, NAP: 203, ATA: 173, TA: 46
RestingECG 0 1 FALSE 3 Nor: 552, LVH: 188, ST: 178
ExerciseAngina 0 1 FALSE 2 N: 547, Y: 371

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Age 0 1 53.51 9.43 28.0 47.00 54.0 60.0 77.0 ▁▅▇▆▁
RestingBP 0 1 132.40 18.51 0.0 120.00 130.0 140.0 200.0 ▁▁▃▇▁
Cholesterol 0 1 198.80 109.38 0.0 173.25 223.0 267.0 603.0 ▃▇▇▁▁
FastingBS 0 1 0.23 0.42 0.0 0.00 0.0 0.0 1.0 ▇▁▁▁▂
MaxHR 0 1 136.81 25.46 60.0 120.00 138.0 156.0 202.0 ▁▃▇▆▂
Oldpeak 0 1 0.89 1.07 -2.6 0.00 0.6 1.5 6.2 ▁▇▆▁▁

Split the Data

We will use caret::createDataPartition() to create a 80/20 split of training and test sets.

set.seed(2022)

# split data
trainIndex <- createDataPartition(df$HeartDisease, p = 0.8, list = FALSE)
df_train <- df[trainIndex, ]
df_test <- df[-trainIndex, ]

Part 1: kNN

To use kNN, we will need to encode the categorical predictors.

# one-hot encode
dummies_model <- dummyVars(HeartDisease ~ ., data = df)
df_encoded <- bind_cols(df[, 1], predict(dummies_model, newdata = df))

# show the encoded data frame
df_encoded
# do the same split pattern for training and test sets
df_encoded_train <- df_encoded[trainIndex, ]
df_encoded_test <- df_encoded[-trainIndex, ]

Train a kNN model on the standardize data (centered and scaled) using repeated cross validation (10 folds, 3 repeats) to determine the best parameter \(k = {1, 2,... , 40}\).

# train a kNN model with cv and apply the best model on test set
cfm_knn <- fit_model(
  HeartDisease ~ ., df_encoded_train, df_encoded_test, "knn", 
  trControl = trainControl(method = "repeatedcv", number = 10, repeats = 3),
  tuneGrid = expand.grid(k = 1:40))
## [1] "The best tune is found with:"
##  k = 40

## [1] "Confusion table:"
##                Reference
## Prediction      Heart Disease Normal
##   Heart Disease            85     16
##   Normal                   20     62
## Accuracy = 0.80327868852459

Part 2: Ensemble

Train a classification tree model on the standardize data (centered and scaled) using repeated cross validation (5 folds, 3 repeats) to determine the best parameter \(\text{cp} = {0, 0.001, 0.002,... ,0.100}\).

# train a classification tree model with cv and apply the best model on test set
cfm_tree <- fit_model(
  HeartDisease ~ ., df_train, df_test, "rpart", 
  trControl = trainControl(method = "repeatedcv", number = 5, repeats = 3),
  tuneGrid = expand.grid(cp = (1:100)/1000))
## [1] "The best tune is found with:"
##  cp = 0.011

## [1] "Confusion table:"
##                Reference
## Prediction      Heart Disease Normal
##   Heart Disease            78     23
##   Normal                   21     61
## Accuracy = 0.759562841530055

Train a bagged tree model on the standardize data (centered and scaled) using repeated cross validation (5 folds, 3 repeats).

# train a bagged tree model with cv and apply the best model on test set
cfm_bag <- fit_model(
  HeartDisease ~ ., df_train, df_test, "treebag", 
  trControl = trainControl(method = "repeatedcv", number = 5, repeats = 3), plot = FALSE)
## [1] "No tuning parameter"
## [1] "Confusion table:"
##                Reference
## Prediction      Heart Disease Normal
##   Heart Disease            81     20
##   Normal                   23     59
## Accuracy = 0.765027322404372

Train a random forest model on the standardize data (centered and scaled) using repeated cross validation (5 folds, 3 repeats) to determine the best parameter \(\text{mtry} = {1, 2, ..., 15}\).

# train a random forest model with cv and apply the best model on test set
cfm_rf <- fit_model(
  HeartDisease ~ ., df_train, df_test, "rf", 
  trControl = trainControl(method = "repeatedcv", number = 5, repeats = 3),
  tuneGrid = expand.grid(mtry = 1:15))
## Warning in randomForest.default(x, y, mtry = param$mtry, ...): invalid mtry:
## reset to within valid range

## Warning in randomForest.default(x, y, mtry = param$mtry, ...): invalid mtry:
## reset to within valid range

## Warning in randomForest.default(x, y, mtry = param$mtry, ...): invalid mtry:
## reset to within valid range

## Warning in randomForest.default(x, y, mtry = param$mtry, ...): invalid mtry:
## reset to within valid range

## Warning in randomForest.default(x, y, mtry = param$mtry, ...): invalid mtry:
## reset to within valid range

## Warning in randomForest.default(x, y, mtry = param$mtry, ...): invalid mtry:
## reset to within valid range

## Warning in randomForest.default(x, y, mtry = param$mtry, ...): invalid mtry:
## reset to within valid range

## Warning in randomForest.default(x, y, mtry = param$mtry, ...): invalid mtry:
## reset to within valid range

## Warning in randomForest.default(x, y, mtry = param$mtry, ...): invalid mtry:
## reset to within valid range

## Warning in randomForest.default(x, y, mtry = param$mtry, ...): invalid mtry:
## reset to within valid range

## Warning in randomForest.default(x, y, mtry = param$mtry, ...): invalid mtry:
## reset to within valid range

## Warning in randomForest.default(x, y, mtry = param$mtry, ...): invalid mtry:
## reset to within valid range

## Warning in randomForest.default(x, y, mtry = param$mtry, ...): invalid mtry:
## reset to within valid range

## Warning in randomForest.default(x, y, mtry = param$mtry, ...): invalid mtry:
## reset to within valid range

## Warning in randomForest.default(x, y, mtry = param$mtry, ...): invalid mtry:
## reset to within valid range

## Warning in randomForest.default(x, y, mtry = param$mtry, ...): invalid mtry:
## reset to within valid range

## Warning in randomForest.default(x, y, mtry = param$mtry, ...): invalid mtry:
## reset to within valid range

## Warning in randomForest.default(x, y, mtry = param$mtry, ...): invalid mtry:
## reset to within valid range

## Warning in randomForest.default(x, y, mtry = param$mtry, ...): invalid mtry:
## reset to within valid range

## Warning in randomForest.default(x, y, mtry = param$mtry, ...): invalid mtry:
## reset to within valid range

## Warning in randomForest.default(x, y, mtry = param$mtry, ...): invalid mtry:
## reset to within valid range

## Warning in randomForest.default(x, y, mtry = param$mtry, ...): invalid mtry:
## reset to within valid range

## Warning in randomForest.default(x, y, mtry = param$mtry, ...): invalid mtry:
## reset to within valid range

## Warning in randomForest.default(x, y, mtry = param$mtry, ...): invalid mtry:
## reset to within valid range

## Warning in randomForest.default(x, y, mtry = param$mtry, ...): invalid mtry:
## reset to within valid range

## Warning in randomForest.default(x, y, mtry = param$mtry, ...): invalid mtry:
## reset to within valid range

## Warning in randomForest.default(x, y, mtry = param$mtry, ...): invalid mtry:
## reset to within valid range

## Warning in randomForest.default(x, y, mtry = param$mtry, ...): invalid mtry:
## reset to within valid range

## Warning in randomForest.default(x, y, mtry = param$mtry, ...): invalid mtry:
## reset to within valid range

## Warning in randomForest.default(x, y, mtry = param$mtry, ...): invalid mtry:
## reset to within valid range
## [1] "The best tune is found with:"
##  mtry = 4

## [1] "Confusion table:"
##                Reference
## Prediction      Heart Disease Normal
##   Heart Disease            82     19
##   Normal                   21     61
## Accuracy = 0.781420765027322

Train a boosted tree model on the standardize data (centered and scaled) using repeated cross validation (5 folds, 3 repeats) to determine the best parameters \(\text{n.trees} = {25, 50, ..., 200}\), \(\text{interaction.depth} = {1, 2, 3, 4}\), \(\text{shrinkage} = 0.1\) and \(\text{nminobsinnode} = 10\).

# train a boosted tree model with cv and apply the best model on test set
cfm_boost <- fit_model(
  HeartDisease ~ ., df_train, df_test, "gbm", 
  trControl = trainControl(method = "repeatedcv", number = 5, repeats = 3),
  tuneGrid = expand.grid(
    n.trees = seq(25, 200, 25),
    interaction.depth = 1:4,
    shrinkage = 0.1,
    n.minobsinnode =10),
  plot = FALSE, verbose = FALSE)
## [1] "The best tune is found with:"
##  n.trees = 100
##  interaction.depth = 2
##  shrinkage = 0.1
##  n.minobsinnode = 10
## [1] "Confusion table:"
##                Reference
## Prediction      Heart Disease Normal
##   Heart Disease            85     16
##   Normal                   20     62
## Accuracy = 0.80327868852459

Comparison

tibble(
    kNN = cfm_knn$overall, 
    Tree = cfm_tree$overall,
    Bagged = cfm_bag$overall,
    RandomForest = cfm_rf$overall,
    Boosted = cfm_boost$overall
  ) %>% 
  mutate_all(round, 4) %>% 
  bind_cols(Metric = names(cfm_knn$overall)) %>% 
  relocate(Metric)