# KNN and Tree Based Ensemble Models
Project Goal
In this report, we conduct an exploratory data analysis (EDA) on the Heart Failure Prediction Dataset. Then we fit KNN and ensemble models with repeated cross-validation on training data Test the models by finding the confusion matrix on the test data:
- K-Nearest Neighbors (KNN)
- Ensemble Models
- Classification Tree
- Bagged Tree
- Random Forests
- Generalized Boosted Regression Models
Set up: Packages and Helper Functions
In this task, we will use the following packages:
here: enables easy file referencing and builds file paths in a OS-independent waystats: loads this before loadingtidyverseto avoid masking sometidyversefunctionstidyverse: includes collections of useful packages likedplyr(data manipulation),tidyr(tidying data),ggplots(creating graphs), etc.glue: embeds and evaluates R expressions into strings to be printed as messagesscales: formats and labels scales nicely for better visualizationcaret: training and plotting classification and regression modelsrpart: recursive partitioning for classification, regression and survival trees.randomForest: classification and regression based on a forest of trees using random inputs.gbm: generalized boosted regression models
First, we define a helper functions to reduce repeating codes. This function trains models with cross validation to tune parameters and apply the best model on test set to see its performance.
Arguments:
form: formuladf_train: training setdf_test: test setmethod: classification or regression model to usetrControl: a list of values that define how train actstuneGrid: a data frame with possible tuning valuesplot: whether to plot parameter and metric...: arguments passed to the classification or regression routineReturned Value: a confusion matrix
fit_model <- function(form, df_train, df_test, method, trControl, tuneGrid = NULL, plot = T, ...) {
# train model
fit <- train(
form = form,
data = df_train,
method = method,
preProcess = c("center", "scale"),
trControl = trControl,
tuneGrid = tuneGrid, ...)
# print the best tune if there is a tuning parameter
if(is.null(tuneGrid)){
print("No tuning parameter")
} else {
# print the best tune
print("The best tune is found with:")
print(glue("\t{names(fit$bestTune)} = {fit$bestTune[1,]}"))
if(plot){
# get model info
model <- fit$modelInfo$label
parameter <- fit$modelInfo$parameters$parameter
description <- fit$modelInfo$parameters$label
# plot parameter vs metrics
p <- fit$results %>%
rename_at(1, ~"x") %>%
pivot_longer(cols = -1, names_to = "Metric") %>%
ggplot(aes(x, value, color = Metric)) +
geom_point() +
geom_line() +
facet_grid(rows = vars(Metric), scales = "free_y") +
labs(
title = glue("{model}: Hyperparameter Tuning"),
x = glue("{parameter} ({description})")
)
print(p)
}
}
# make prediction on test set
pred <- predict(fit, newdata = df_test)
# confusion matrix
cfm <- confusionMatrix(df_test[,1] %>% as_vector(), pred)
# print confusion matrix and accuracy
print("Confusion table:")
print(cfm$table)
print(glue("Accuracy = {cfm$overall['Accuracy']}"))
# return the confusion matrix
return(cfm)
}Data
The Heart
Failure Prediction Dataset gives information about whether or not
someone has heart disease along with different measurements about that
person’s health. A local copy is saved in the data folder.
Since the original column names contain space and special characters,
and are long as well, we rename the columns when we read the data
in.
## Rows: 918 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): Sex, ChestPainType, RestingECG, ExerciseAngina, ST_Slope
## dbl (7): Age, RestingBP, Cholesterol, FastingBS, MaxHR, Oldpeak, HeartDisease
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## spc_tbl_ [918 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Age : num [1:918] 40 49 37 48 54 39 45 54 37 48 ...
## $ Sex : chr [1:918] "M" "F" "M" "F" ...
## $ ChestPainType : chr [1:918] "ATA" "NAP" "ATA" "ASY" ...
## $ RestingBP : num [1:918] 140 160 130 138 150 120 130 110 140 120 ...
## $ Cholesterol : num [1:918] 289 180 283 214 195 339 237 208 207 284 ...
## $ FastingBS : num [1:918] 0 0 0 0 0 0 0 0 0 0 ...
## $ RestingECG : chr [1:918] "Normal" "Normal" "ST" "Normal" ...
## $ MaxHR : num [1:918] 172 156 98 108 122 170 170 142 130 120 ...
## $ ExerciseAngina: chr [1:918] "N" "N" "N" "Y" ...
## $ Oldpeak : num [1:918] 0 1 0 1.5 0 0 0 0 1.5 0 ...
## $ ST_Slope : chr [1:918] "Up" "Flat" "Up" "Flat" ...
## $ HeartDisease : num [1:918] 0 1 0 1 0 0 0 0 1 0 ...
## - attr(*, "spec")=
## .. cols(
## .. Age = col_double(),
## .. Sex = col_character(),
## .. ChestPainType = col_character(),
## .. RestingBP = col_double(),
## .. Cholesterol = col_double(),
## .. FastingBS = col_double(),
## .. RestingECG = col_character(),
## .. MaxHR = col_double(),
## .. ExerciseAngina = col_character(),
## .. Oldpeak = col_double(),
## .. ST_Slope = col_character(),
## .. HeartDisease = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
## [1] FALSE
Next, we will prepare the data:
- remove
ST_Slope - convert the three categorical predictors
Sex,ChestPainTypeandRestingECGto factors - convert the response
HeartDiseaseto a factor and relocate it to the first column
df <- df_raw %>%
select(-ST_Slope) %>%
mutate(
Sex = factor(Sex),
ChestPainType = factor(ChestPainType),
RestingECG = factor(RestingECG ),
ExerciseAngina = factor(ExerciseAngina),
HeartDisease = if_else(HeartDisease == 1, "Heart Disease", "Normal") %>% factor()
) %>%
relocate(HeartDisease)
# show the data frame
df| Name | df |
| Number of rows | 918 |
| Number of columns | 11 |
| _______________________ | |
| Column type frequency: | |
| factor | 5 |
| numeric | 6 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| HeartDisease | 0 | 1 | FALSE | 2 | Hea: 508, Nor: 410 |
| Sex | 0 | 1 | FALSE | 2 | M: 725, F: 193 |
| ChestPainType | 0 | 1 | FALSE | 4 | ASY: 496, NAP: 203, ATA: 173, TA: 46 |
| RestingECG | 0 | 1 | FALSE | 3 | Nor: 552, LVH: 188, ST: 178 |
| ExerciseAngina | 0 | 1 | FALSE | 2 | N: 547, Y: 371 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Age | 0 | 1 | 53.51 | 9.43 | 28.0 | 47.00 | 54.0 | 60.0 | 77.0 | ▁▅▇▆▁ |
| RestingBP | 0 | 1 | 132.40 | 18.51 | 0.0 | 120.00 | 130.0 | 140.0 | 200.0 | ▁▁▃▇▁ |
| Cholesterol | 0 | 1 | 198.80 | 109.38 | 0.0 | 173.25 | 223.0 | 267.0 | 603.0 | ▃▇▇▁▁ |
| FastingBS | 0 | 1 | 0.23 | 0.42 | 0.0 | 0.00 | 0.0 | 0.0 | 1.0 | ▇▁▁▁▂ |
| MaxHR | 0 | 1 | 136.81 | 25.46 | 60.0 | 120.00 | 138.0 | 156.0 | 202.0 | ▁▃▇▆▂ |
| Oldpeak | 0 | 1 | 0.89 | 1.07 | -2.6 | 0.00 | 0.6 | 1.5 | 6.2 | ▁▇▆▁▁ |
Split the Data
We will use caret::createDataPartition() to create a
80/20 split of training and test sets.
Part 1: kNN
To use kNN, we will need to encode the categorical predictors.
# one-hot encode
dummies_model <- dummyVars(HeartDisease ~ ., data = df)
df_encoded <- bind_cols(df[, 1], predict(dummies_model, newdata = df))
# show the encoded data frame
df_encoded# do the same split pattern for training and test sets
df_encoded_train <- df_encoded[trainIndex, ]
df_encoded_test <- df_encoded[-trainIndex, ]Train a kNN model on the standardize data (centered and scaled) using repeated cross validation (10 folds, 3 repeats) to determine the best parameter \(k = {1, 2,... , 40}\).
# train a kNN model with cv and apply the best model on test set
cfm_knn <- fit_model(
HeartDisease ~ ., df_encoded_train, df_encoded_test, "knn",
trControl = trainControl(method = "repeatedcv", number = 10, repeats = 3),
tuneGrid = expand.grid(k = 1:40))## [1] "The best tune is found with:"
## k = 40
## [1] "Confusion table:"
## Reference
## Prediction Heart Disease Normal
## Heart Disease 85 16
## Normal 20 62
## Accuracy = 0.80327868852459
Part 2: Ensemble
Train a classification tree model on the standardize data (centered and scaled) using repeated cross validation (5 folds, 3 repeats) to determine the best parameter \(\text{cp} = {0, 0.001, 0.002,... ,0.100}\).
# train a classification tree model with cv and apply the best model on test set
cfm_tree <- fit_model(
HeartDisease ~ ., df_train, df_test, "rpart",
trControl = trainControl(method = "repeatedcv", number = 5, repeats = 3),
tuneGrid = expand.grid(cp = (1:100)/1000))## [1] "The best tune is found with:"
## cp = 0.011
## [1] "Confusion table:"
## Reference
## Prediction Heart Disease Normal
## Heart Disease 78 23
## Normal 21 61
## Accuracy = 0.759562841530055
Train a bagged tree model on the standardize data (centered and scaled) using repeated cross validation (5 folds, 3 repeats).
# train a bagged tree model with cv and apply the best model on test set
cfm_bag <- fit_model(
HeartDisease ~ ., df_train, df_test, "treebag",
trControl = trainControl(method = "repeatedcv", number = 5, repeats = 3), plot = FALSE)## [1] "No tuning parameter"
## [1] "Confusion table:"
## Reference
## Prediction Heart Disease Normal
## Heart Disease 81 20
## Normal 23 59
## Accuracy = 0.765027322404372
Train a random forest model on the standardize data (centered and scaled) using repeated cross validation (5 folds, 3 repeats) to determine the best parameter \(\text{mtry} = {1, 2, ..., 15}\).
# train a random forest model with cv and apply the best model on test set
cfm_rf <- fit_model(
HeartDisease ~ ., df_train, df_test, "rf",
trControl = trainControl(method = "repeatedcv", number = 5, repeats = 3),
tuneGrid = expand.grid(mtry = 1:15))## Warning in randomForest.default(x, y, mtry = param$mtry, ...): invalid mtry:
## reset to within valid range
## Warning in randomForest.default(x, y, mtry = param$mtry, ...): invalid mtry:
## reset to within valid range
## Warning in randomForest.default(x, y, mtry = param$mtry, ...): invalid mtry:
## reset to within valid range
## Warning in randomForest.default(x, y, mtry = param$mtry, ...): invalid mtry:
## reset to within valid range
## Warning in randomForest.default(x, y, mtry = param$mtry, ...): invalid mtry:
## reset to within valid range
## Warning in randomForest.default(x, y, mtry = param$mtry, ...): invalid mtry:
## reset to within valid range
## Warning in randomForest.default(x, y, mtry = param$mtry, ...): invalid mtry:
## reset to within valid range
## Warning in randomForest.default(x, y, mtry = param$mtry, ...): invalid mtry:
## reset to within valid range
## Warning in randomForest.default(x, y, mtry = param$mtry, ...): invalid mtry:
## reset to within valid range
## Warning in randomForest.default(x, y, mtry = param$mtry, ...): invalid mtry:
## reset to within valid range
## Warning in randomForest.default(x, y, mtry = param$mtry, ...): invalid mtry:
## reset to within valid range
## Warning in randomForest.default(x, y, mtry = param$mtry, ...): invalid mtry:
## reset to within valid range
## Warning in randomForest.default(x, y, mtry = param$mtry, ...): invalid mtry:
## reset to within valid range
## Warning in randomForest.default(x, y, mtry = param$mtry, ...): invalid mtry:
## reset to within valid range
## Warning in randomForest.default(x, y, mtry = param$mtry, ...): invalid mtry:
## reset to within valid range
## Warning in randomForest.default(x, y, mtry = param$mtry, ...): invalid mtry:
## reset to within valid range
## Warning in randomForest.default(x, y, mtry = param$mtry, ...): invalid mtry:
## reset to within valid range
## Warning in randomForest.default(x, y, mtry = param$mtry, ...): invalid mtry:
## reset to within valid range
## Warning in randomForest.default(x, y, mtry = param$mtry, ...): invalid mtry:
## reset to within valid range
## Warning in randomForest.default(x, y, mtry = param$mtry, ...): invalid mtry:
## reset to within valid range
## Warning in randomForest.default(x, y, mtry = param$mtry, ...): invalid mtry:
## reset to within valid range
## Warning in randomForest.default(x, y, mtry = param$mtry, ...): invalid mtry:
## reset to within valid range
## Warning in randomForest.default(x, y, mtry = param$mtry, ...): invalid mtry:
## reset to within valid range
## Warning in randomForest.default(x, y, mtry = param$mtry, ...): invalid mtry:
## reset to within valid range
## Warning in randomForest.default(x, y, mtry = param$mtry, ...): invalid mtry:
## reset to within valid range
## Warning in randomForest.default(x, y, mtry = param$mtry, ...): invalid mtry:
## reset to within valid range
## Warning in randomForest.default(x, y, mtry = param$mtry, ...): invalid mtry:
## reset to within valid range
## Warning in randomForest.default(x, y, mtry = param$mtry, ...): invalid mtry:
## reset to within valid range
## Warning in randomForest.default(x, y, mtry = param$mtry, ...): invalid mtry:
## reset to within valid range
## Warning in randomForest.default(x, y, mtry = param$mtry, ...): invalid mtry:
## reset to within valid range
## [1] "The best tune is found with:"
## mtry = 4
## [1] "Confusion table:"
## Reference
## Prediction Heart Disease Normal
## Heart Disease 82 19
## Normal 21 61
## Accuracy = 0.781420765027322
Train a boosted tree model on the standardize data (centered and scaled) using repeated cross validation (5 folds, 3 repeats) to determine the best parameters \(\text{n.trees} = {25, 50, ..., 200}\), \(\text{interaction.depth} = {1, 2, 3, 4}\), \(\text{shrinkage} = 0.1\) and \(\text{nminobsinnode} = 10\).
# train a boosted tree model with cv and apply the best model on test set
cfm_boost <- fit_model(
HeartDisease ~ ., df_train, df_test, "gbm",
trControl = trainControl(method = "repeatedcv", number = 5, repeats = 3),
tuneGrid = expand.grid(
n.trees = seq(25, 200, 25),
interaction.depth = 1:4,
shrinkage = 0.1,
n.minobsinnode =10),
plot = FALSE, verbose = FALSE)## [1] "The best tune is found with:"
## n.trees = 100
## interaction.depth = 2
## shrinkage = 0.1
## n.minobsinnode = 10
## [1] "Confusion table:"
## Reference
## Prediction Heart Disease Normal
## Heart Disease 85 16
## Normal 20 62
## Accuracy = 0.80327868852459