Reminding note: Descriptive, Inferential and Predictive Models. Predictive models are mostly black box models that are complex. These have achieve high predictive capacity by compromising interpretability. In this session we will compare the performance of different machine learning classification models. Each algorithm has strengths and weaknesses, making them suitable for different tasks. For details: https://hastie.su.domains/ISLR2/ISLRv2_corrected_June_2023.pdf.download.html

1. K-Nearest Neighbors (KNN)

What it is:
KNN is a simple, instance-based supervised learning algorithm used for classification and regression. It classifies new data points based on their similarity to existing data points.

How it works:

Step 1: Choose the number of neighbors (K).
Step 2: Calculate the distance (Euclidean, Manhattan, etc.) between the new data point and all training points.
Step 3: Select the K nearest neighbors based on distance.
Step 4: For classification, assign the majority class among neighbors; for regression, take the average of neighbors’ values.

KNN is non-parametric and lazy (does not learn a model but memorizes data). It works well with small datasets but can be slow with large data.

2. Linear Discriminant Analysis (LDA)

What it is:
LDA is a supervised classification method that projects data into a lower-dimensional space while maximizing class separability.

How it works:

Step 1: Compute mean vectors for each class and overall mean.
Step 2: Calculate within-class and between-class scatter matrices.
Step 3: Compute eigenvectors and eigenvalues for optimal projection.
Step 4: Project data onto the new space and classify using linear decision boundaries.

LDA assumes Gaussian-distributed data with equal covariance matrices. It is efficient for dimensionality reduction.

3. Quadratic Discriminant Analysis (QDA)

What it is:
QDA is similar to LDA but relaxes the assumption of equal covariance matrices, allowing quadratic decision boundaries.

How it works:

Step 1: Estimate class-specific mean and covariance matrices.
Step 2: Compute discriminant functions for each class using Bayes’ theorem.
Step 3: Classify new points based on the highest discriminant value.

QDA is more flexible than LDA but requires more data to estimate covariances accurately.

4. Support Vector Machine (SVM)

What it is:
SVM is a powerful supervised algorithm for classification and regression, finding the optimal hyperplane that maximizes the margin between classes.

How it works:

Step 1: Identify support vectors (critical data points near the boundary).
Step 2: Maximize the margin using optimization techniques.
Step 3: Use kernel tricks (linear, polynomial, RBF) for non-linear separation.

SVM is effective in high-dimensional spaces but can be slow for large datasets.

5. Gradient Boosting

What it is:
Gradient Boosting is an ensemble method that builds models sequentially, correcting errors from previous models.

How it works:

Step 1: Train a weak model (e.g., decision tree).
Step 2: Compute residuals (errors) and fit a new model on them.
Step 3: Combine models with weighted learning rates.
Step 4: Repeat until errors are minimized.

Popular implementations like XGBoost and LightGBM improve efficiency and accuracy.

6. Neural Networks

What it is:
Neural Networks are deep learning models inspired by biological neurons, capable of learning complex patterns.

How it works:

Step 1: Input data passes through layers (input, hidden, output).
Step 2: Weights are adjusted via backpropagation.
Step 3: Activation functions (ReLU, Sigmoid) introduce non-linearity.
Step 4: Optimizers (Adam, SGD) minimize loss functions.

Neural networks excel in tasks like image and speech recognition but require large data and computational power.

7. Random Forest

What it is:
Random Forest is an ensemble method using multiple decision trees for improved accuracy and robustness.

How it works:

Step 1: Build multiple trees on random subsets of data (bagging).
Step 2: Split nodes on random feature subsets.
Step 3: Aggregate predictions (majority voting for classification, averaging for regression).

It reduces overfitting and handles high-dimensional data well.

Practice different machine learning models

Open R script, save it in a folder and set that folder as the working directory.

# Load libraries
library(MASS)
library(jtools)
library(tidymodels)

## ── Attaching packages ────────────────────────────────────── tidymodels 1.3.0 ──

## ✔ broom        1.0.8     ✔ recipes      1.3.1
## ✔ dials        1.4.0     ✔ rsample      1.3.0
## ✔ dplyr        1.1.4     ✔ tibble       3.2.1
## ✔ ggplot2      3.5.2     ✔ tidyr        1.3.1
## ✔ infer        1.0.8     ✔ tune         1.3.0
## ✔ modeldata    1.4.0     ✔ workflows    1.2.0
## ✔ parsnip      1.3.2     ✔ workflowsets 1.1.1
## ✔ purrr        1.0.4     ✔ yardstick    1.3.2

## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ purrr::discard()         masks scales::discard()
## ✖ dplyr::filter()          masks stats::filter()
## ✖ yardstick::get_weights() masks jtools::get_weights()
## ✖ dplyr::lag()             masks stats::lag()
## ✖ dplyr::select()          masks MASS::select()
## ✖ recipes::step()          masks stats::step()

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ readr     2.1.5
## ✔ lubridate 1.9.4     ✔ stringr   1.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ readr::col_factor() masks scales::col_factor()
## ✖ purrr::discard()    masks scales::discard()
## ✖ dplyr::filter()     masks stats::filter()
## ✖ stringr::fixed()    masks recipes::fixed()
## ✖ dplyr::lag()        masks stats::lag()
## ✖ dplyr::select()     masks MASS::select()
## ✖ readr::spec()       masks yardstick::spec()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

tidymodels_prefer()
theme_set(theme_bw())

# Load data
library(mlbench)
data(BreastCancer)
glimpse(BreastCancer)

## Rows: 699
## Columns: 11
## $ Id              <chr> "1000025", "1002945", "1015425", "1016277", "1017023",…
## $ Cl.thickness    <ord> 5, 5, 3, 6, 4, 8, 1, 2, 2, 4, 1, 2, 5, 1, 8, 7, 4, 4, …
## $ Cell.size       <ord> 1, 4, 1, 8, 1, 10, 1, 1, 1, 2, 1, 1, 3, 1, 7, 4, 1, 1,…
## $ Cell.shape      <ord> 1, 4, 1, 8, 1, 10, 1, 2, 1, 1, 1, 1, 3, 1, 5, 6, 1, 1,…
## $ Marg.adhesion   <ord> 1, 5, 1, 1, 3, 8, 1, 1, 1, 1, 1, 1, 3, 1, 10, 4, 1, 1,…
## $ Epith.c.size    <ord> 2, 7, 2, 3, 2, 7, 2, 2, 2, 2, 1, 2, 2, 2, 7, 6, 2, 2, …
## $ Bare.nuclei     <fct> 1, 10, 2, 4, 1, 10, 10, 1, 1, 1, 1, 1, 3, 3, 9, 1, 1, …
## $ Bl.cromatin     <fct> 3, 3, 3, 3, 3, 9, 3, 3, 1, 2, 3, 2, 4, 3, 5, 4, 2, 3, …
## $ Normal.nucleoli <fct> 1, 2, 1, 7, 1, 7, 1, 1, 1, 1, 1, 1, 4, 1, 5, 3, 1, 1, …
## $ Mitoses         <fct> 1, 1, 1, 1, 1, 1, 1, 1, 5, 1, 1, 1, 1, 1, 4, 1, 1, 1, …
## $ Class           <fct> benign, benign, benign, benign, benign, malignant, ben…

We can see the description of the data set in help menu by searching with BreastCancer. Remove the id columns and missing values (NA’s).

bc = BreastCancer[-1] %>% na.omit() # remove Id and missing values (NA's)
x = bc %>% select(-Class) %>% mutate(across(everything(), as.numeric))
y = bc %>% select(Class)

BC = bind_cols(x, y)

summary(BC)

##   Cl.thickness      Cell.size        Cell.shape     Marg.adhesion  
##  Min.   : 1.000   Min.   : 1.000   Min.   : 1.000   Min.   : 1.00  
##  1st Qu.: 2.000   1st Qu.: 1.000   1st Qu.: 1.000   1st Qu.: 1.00  
##  Median : 4.000   Median : 1.000   Median : 1.000   Median : 1.00  
##  Mean   : 4.442   Mean   : 3.151   Mean   : 3.215   Mean   : 2.83  
##  3rd Qu.: 6.000   3rd Qu.: 5.000   3rd Qu.: 5.000   3rd Qu.: 4.00  
##  Max.   :10.000   Max.   :10.000   Max.   :10.000   Max.   :10.00  
##   Epith.c.size     Bare.nuclei      Bl.cromatin     Normal.nucleoli
##  Min.   : 1.000   Min.   : 1.000   Min.   : 1.000   Min.   : 1.00  
##  1st Qu.: 2.000   1st Qu.: 1.000   1st Qu.: 2.000   1st Qu.: 1.00  
##  Median : 2.000   Median : 1.000   Median : 3.000   Median : 1.00  
##  Mean   : 3.234   Mean   : 3.545   Mean   : 3.445   Mean   : 2.87  
##  3rd Qu.: 4.000   3rd Qu.: 6.000   3rd Qu.: 5.000   3rd Qu.: 4.00  
##  Max.   :10.000   Max.   :10.000   Max.   :10.000   Max.   :10.00  
##     Mitoses            Class    
##  Min.   :1.000   benign   :444  
##  1st Qu.:1.000   malignant:239  
##  Median :1.000                  
##  Mean   :1.583                  
##  3rd Qu.:1.000                  
##  Max.   :9.000

We will use Class (benign or malignant) as the dependent variable. We can rename the variables (new name = old name)

# Name the modified dataset as df
df = BC %>% 
  rename(Thickness = Cl.thickness,
         Size = Cell.size,
         Shape = Cell.shape,
         Adhesion = Marg.adhesion,
         Epithelia = Epith.c.size,
         Nuclei = Bare.nuclei,
         Chromatin = Bl.cromatin,
         Nucleoli = Normal.nucleoli,
         Mitoses = Mitoses,
         Cancer = Class)

summary(df)

##    Thickness           Size            Shape           Adhesion    
##  Min.   : 1.000   Min.   : 1.000   Min.   : 1.000   Min.   : 1.00  
##  1st Qu.: 2.000   1st Qu.: 1.000   1st Qu.: 1.000   1st Qu.: 1.00  
##  Median : 4.000   Median : 1.000   Median : 1.000   Median : 1.00  
##  Mean   : 4.442   Mean   : 3.151   Mean   : 3.215   Mean   : 2.83  
##  3rd Qu.: 6.000   3rd Qu.: 5.000   3rd Qu.: 5.000   3rd Qu.: 4.00  
##  Max.   :10.000   Max.   :10.000   Max.   :10.000   Max.   :10.00  
##    Epithelia          Nuclei         Chromatin         Nucleoli    
##  Min.   : 1.000   Min.   : 1.000   Min.   : 1.000   Min.   : 1.00  
##  1st Qu.: 2.000   1st Qu.: 1.000   1st Qu.: 2.000   1st Qu.: 1.00  
##  Median : 2.000   Median : 1.000   Median : 3.000   Median : 1.00  
##  Mean   : 3.234   Mean   : 3.545   Mean   : 3.445   Mean   : 2.87  
##  3rd Qu.: 4.000   3rd Qu.: 6.000   3rd Qu.: 5.000   3rd Qu.: 4.00  
##  Max.   :10.000   Max.   :10.000   Max.   :10.000   Max.   :10.00  
##     Mitoses            Cancer   
##  Min.   :1.000   benign   :444  
##  1st Qu.:1.000   malignant:239  
##  Median :1.000                  
##  Mean   :1.583                  
##  3rd Qu.:1.000                  
##  Max.   :9.000

# Check the variable, benign is first level. Tidymodels predicts the first level as the event or positive outcome. However, we are interested to predict malignant cancer.

df %>% select(Cancer) %>% summary()

##        Cancer   
##  benign   :444  
##  malignant:239

# Split the data for analysis, assessment and validation
# training set = for analysis and assessment
# testing set = for validation
# strata = Cancer ensures same proportion of malignant:benign in the train or test as was in the main data set.

set.seed(1)
data_split = initial_split(df, prop = 0.7, strata = Cancer)
train = training(data_split) # or analysis(data_split)
test = testing(data_split) # or assess(data_split)

# Recipe or formula and preprocessing, same for all models
# . means all other variables in the data
# We do not have any numeric variables
# We may not all the preprocessing steps that we have shown only for demonstration

my_recipe = recipe(Cancer ~ ., data = train) %>% 
  step_dummy(all_nominal_predictors()) %>%
  step_zv(all_predictors()) %>%
  step_normalize(all_numeric_predictors())

Logistic regression

# Model specification
log_reg = logistic_reg(mode = 'classification')

# Workflow
log_reg_wf = workflow() %>% add_model(log_reg) %>% add_recipe(my_recipe)

# Final model training with the train data
# Train the model == fit the model to the training set (known data)
# Extracting results

log_reg_trained = fit(log_reg_wf, train)

extract_fit_parsnip(log_reg_trained) %>% tidy()

## # A tibble: 10 × 5
##    term        estimate std.error statistic  p.value
##    <chr>          <dbl>     <dbl>     <dbl>    <dbl>
##  1 (Intercept)   -1.11      0.362    -3.07  0.00215 
##  2 Thickness      1.26      0.441     2.86  0.00417 
##  3 Size           0.629     0.748     0.842 0.400   
##  4 Shape          0.720     0.800     0.900 0.368   
##  5 Adhesion       0.938     0.424     2.21  0.0270  
##  6 Epithelia      0.117     0.404     0.289 0.773   
##  7 Nuclei         1.32      0.399     3.30  0.000982
##  8 Chromatin      0.936     0.462     2.03  0.0428  
##  9 Nucleoli       0.423     0.421     1.00  0.315   
## 10 Mitoses        0.872     0.571     1.53  0.127

# To apply summ() from jtools or summary()
log_reg_out = extract_fit_engine(log_reg_trained)

summ(log_reg_out)

Observations	477
Dependent variable	..y
Type	Generalized linear model
Family	binomial
Link	logit

χ²(9)	540.54
p	0.00
Pseudo-R² (Cragg-Uhler)	0.93
Pseudo-R² (McFadden)	0.88
AIC	97.18
BIC	138.86

	Est.	S.E.	z val.	p
(Intercept)	-1.11	0.36	-3.07	0.00
Thickness	1.26	0.44	2.86	0.00
Size	0.63	0.75	0.84	0.40
Shape	0.72	0.80	0.90	0.37
Adhesion	0.94	0.42	2.21	0.03
Epithelia	0.12	0.40	0.29	0.77
Nuclei	1.32	0.40	3.30	0.00
Chromatin	0.94	0.46	2.03	0.04
Nucleoli	0.42	0.42	1.00	0.32
Mitoses	0.87	0.57	1.53	0.13
Standard errors: MLE

# Predict the test data (unknown for the machine but known for us)
log_reg_pred = predict(log_reg_trained, new_data = test) %>% 
  bind_cols(predict(log_reg_trained, new_data = test, type = 'prob')) %>% 
  bind_cols(test)

# Explore the the predictions
log_reg_pred

## # A tibble: 206 × 13
##    .pred_class .pred_benign .pred_malignant Thickness  Size Shape Adhesion
##  * <fct>              <dbl>           <dbl>     <dbl> <dbl> <dbl>    <dbl>
##  1 benign             0.983         0.0175          4     1     1        3
##  2 benign             0.943         0.0572          1     1     1        1
##  3 benign             0.995         0.00482         2     1     2        1
##  4 benign             0.985         0.0145          2     1     1        1
##  5 benign             0.992         0.00775         4     2     1        1
##  6 benign             0.998         0.00230         1     1     1        1
##  7 benign             0.997         0.00258         2     1     1        1
##  8 benign             0.998         0.00242         1     1     1        1
##  9 benign             0.995         0.00471         3     2     1        1
## 10 benign             0.998         0.00181         1     1     3        1
## # ℹ 196 more rows
## # ℹ 6 more variables: Epithelia <dbl>, Nuclei <dbl>, Chromatin <dbl>,
## #   Nucleoli <dbl>, Mitoses <dbl>, Cancer <fct>

# Compare this prediction with the known column in the test data

# Collect accuracy metrics, we will this matrix
log_metrices = caret::confusionMatrix(data = log_reg_pred$.pred_class, reference = log_reg_pred$Cancer, positive = 'malignant')

log_metrices

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  benign malignant
##   benign       133         5
##   malignant      1        67
##                                           
##                Accuracy : 0.9709          
##                  95% CI : (0.9377, 0.9892)
##     No Information Rate : 0.6505          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9351          
##                                           
##  Mcnemar's Test P-Value : 0.2207          
##                                           
##             Sensitivity : 0.9306          
##             Specificity : 0.9925          
##          Pos Pred Value : 0.9853          
##          Neg Pred Value : 0.9638          
##              Prevalence : 0.3495          
##          Detection Rate : 0.3252          
##    Detection Prevalence : 0.3301          
##       Balanced Accuracy : 0.9615          
##                                           
##        'Positive' Class : malignant       
##

# This may not match with this command because of level confusion in the dependent variable
# We can relevel the variable to mach accuracy statistics

conf_mat(log_reg_pred, truth = Cancer, estimate = .pred_class) %>% summary()

## # A tibble: 13 × 3
##    .metric              .estimator .estimate
##    <chr>                <chr>          <dbl>
##  1 accuracy             binary         0.971
##  2 kap                  binary         0.935
##  3 sens                 binary         0.993
##  4 spec                 binary         0.931
##  5 ppv                  binary         0.964
##  6 npv                  binary         0.985
##  7 mcc                  binary         0.936
##  8 j_index              binary         0.923
##  9 bal_accuracy         binary         0.962
## 10 detection_prevalence binary         0.670
## 11 precision            binary         0.964
## 12 recall               binary         0.993
## 13 f_meas               binary         0.978

# Extract metrics

my_metrics = metric_set(yardstick::sensitivity, yardstick::specificity)

my_metrics(
  log_reg_pred,
  truth = Cancer,
  estimate = .pred_class,
  event_level = "second")

## # A tibble: 2 × 3
##   .metric     .estimator .estimate
##   <chr>       <chr>          <dbl>
## 1 sensitivity binary         0.931
## 2 specificity binary         0.993

conf_mat(log_reg_pred, truth = Cancer, estimate = .pred_class) %>% autoplot(type = 'heatmap')

roc_auc(log_reg_pred, truth = Cancer, .pred_malignant, event_level = 'second')

## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.997

# Event level first means benign, second means malignant

# Data for visualization
roc_viz = roc_curve(log_reg_pred, truth = Cancer, .pred_malignant, event_level = 'second')

ggplot(roc_viz, aes(x = 1 - specificity, y = sensitivity)) +
  geom_path(color = 'blue') +
  geom_abline(linetype = "dashed", color = "gray50") +
  labs(
    title = "ROC Curve",
    x = "1 - Specificity (False Positive Rate)",
    y = "Sensitivity (True Positive Rate)"
  ) +
  theme_bw()

# Same plot can be drawn using autoplot
roc_viz %>% autoplot()

Now in brief other machine learning models with along with logistic regression for comparison. Remember the steps: split > recipe > model > workflow > fit > evaluate

library(discrim)  # For LDA/QDA
library(kernlab)  # For SVM
library(kknn)     # For KNN
library(nnet)     # For Neural Networks

set.seed(123)
split = initial_split(df, prop = 0.7, strata = Cancer)
train_data = training(split)
test_data = testing(split)

recipe = recipe(Cancer ~ ., data = train_data) %>%
  step_normalize(all_numeric_predictors())

logistic_model = logistic_reg(mode = 'classification') 

workflow_logistic = workflow() %>%
  add_recipe(recipe) %>%
  add_model(logistic_model)

fit_logistic = workflow_logistic %>% fit(data = train_data)

lda_model = discrim_linear(mode = 'classification') 

workflow_lda = workflow() %>%
  add_recipe(recipe) %>%
  add_model(lda_model)

fit_lda = workflow_lda %>% fit(data = train_data)

qda_model = discrim_quad(mode = 'classification') 

workflow_qda = workflow() %>%
  add_recipe(recipe) %>%
  add_model(qda_model)

fit_qda = workflow_qda %>% fit(data = train_data)

knn_model = nearest_neighbor(mode = 'classification',
  neighbors = tune())

workflow_knn = workflow() %>%
  add_recipe(recipe) %>%
  add_model(knn_model)

# Tune KNN
knn_grid = grid_regular(neighbors(range = c(3, 15)), levels = 5)
tune_knn = tune_grid(
  workflow_knn,
  resamples = vfold_cv(train_data, v = 5),
  grid = knn_grid
)

# Fix: Use named argument for select_best()
best_knn = select_best(tune_knn, metric = "accuracy")
fit_knn = finalize_workflow(workflow_knn, best_knn) %>% fit(train_data)

svm_model = svm_rbf(mode = 'classification')


workflow_svm = workflow() %>%
  add_recipe(recipe) %>%
  add_model(svm_model)

fit_svm = workflow_svm %>% fit(data = train_data)

svm_poly_model = svm_poly(mode = 'classification',
  cost = tune(), degree = tune())

workflow_svm_poly = workflow() %>%
  add_recipe(recipe) %>%
  add_model(svm_poly_model)

# Tune SVM-Poly
svm_poly_grid = grid_regular(cost(), degree(), levels = 3)
tune_svm_poly = tune_grid(
  workflow_svm_poly,
  resamples = vfold_cv(train_data, v = 5),
  grid = svm_poly_grid
)

best_svm_poly = select_best(tune_svm_poly, metric = "accuracy")
fit_svm_poly = finalize_workflow(workflow_svm_poly, best_svm_poly) %>% fit(train_data)

svm_rbf_model = svm_rbf(mode = 'classification', 
                        cost = tune(), rbf_sigma = tune()) 


workflow_svm_rbf = workflow() %>%
  add_recipe(recipe) %>%
  add_model(svm_rbf_model)

# Tune SVM-RBF
svm_rbf_grid = grid_regular(cost(), rbf_sigma(), levels = 3)
tune_svm_rbf = tune_grid(
  workflow_svm_rbf,
  resamples = vfold_cv(train_data, v = 5),
  grid = svm_rbf_grid
)

best_svm_rbf = select_best(tune_svm_rbf, metric = "accuracy")
fit_svm_rbf = finalize_workflow(workflow_svm_rbf, best_svm_rbf) %>% fit(train_data)

xgb_model = boost_tree(mode = 'classification') 


workflow_xgb = workflow() %>%
  add_recipe(recipe) %>%
  add_model(xgb_model)

fit_xgb = workflow_xgb %>% fit(data = train_data)

rf_model = rand_forest(mode = 'classification') 

workflow_rf = workflow() %>%
  add_recipe(recipe) %>%
  add_model(rf_model)

fit_rf = workflow_rf %>% fit(data = train_data)

nn_model = mlp(mode = 'classification', hidden_units = tune(), epochs = tune()) 

workflow_nn = workflow() %>%
  add_recipe(recipe) %>%
  add_model(nn_model)

# Tune Neural Net
nn_grid = grid_regular(hidden_units(range = c(5, 10)), epochs(range = c(50, 100)), levels = 2)
tune_nn = tune_grid(
  workflow_nn,
  resamples = vfold_cv(train_data, v = 3),
  grid = nn_grid
)

best_nn = select_best(tune_nn, metric = "accuracy")
fit_nn = finalize_workflow(workflow_nn, best_nn) %>% fit(train_data)

Quick evaluation of models using test data set

# Begin mapping over models list
models = list(
  "Logit" = fit_logistic,
  "LDA" = fit_lda,
  "QDA" = fit_qda,
  "KNN" = fit_knn,
  "SVM" = fit_svm,
  "SVM (Poly)" = fit_svm_poly,
  "SVM (RBF)" = fit_svm_rbf,
  "Boosting" = fit_xgb,
  "Neural Net" = fit_nn,
  "Random Forest" = fit_rf
)


# Create a function to create an evaluation results
# Step 1: Get predicted probabilities
# Step 2: Add predicted class labels
# Step 3: Add true values from test data
# Step 4: Define which metrics to calculate
# Step 5: Calculate metrics
# Keep track of which model produced which results


test_metrics = map_dfr(models, ~ {
  predictions = predict(.x, test_data, type = "prob") %>%
    bind_cols(predict(.x, test_data)) %>%
    bind_cols(test_data)
  
  metrics = metric_set(accuracy, roc_auc, sens, spec)
  
  metrics(predictions, truth = Cancer, estimate = .pred_class, .pred_benign)
}, .id = "Model")  



evaluations = test_metrics %>%
  select(Model, .metric, .estimate) %>%
  pivot_wider(names_from = .metric, values_from = .estimate) %>%
  arrange(desc(accuracy))

evaluations

## # A tibble: 10 × 5
##    Model         accuracy  sens  spec roc_auc
##    <chr>            <dbl> <dbl> <dbl>   <dbl>
##  1 Boosting         0.976 0.985 0.958   0.991
##  2 Random Forest    0.976 0.978 0.972   0.992
##  3 KNN              0.971 0.985 0.944   0.982
##  4 SVM (Poly)       0.971 0.985 0.944   0.995
##  5 Logit            0.966 0.985 0.931   0.994
##  6 LDA              0.966 0.985 0.931   0.995
##  7 Neural Net       0.961 0.970 0.944   0.972
##  8 SVM              0.951 0.948 0.958   0.990
##  9 SVM (RBF)        0.951 0.948 0.958   0.990
## 10 QDA              0.947 0.940 0.958   0.991

# Visualization
ggplot(evaluations) +
  aes(x = Model, y = accuracy, fill = Model) +
  geom_col() +
  geom_text(aes(y = accuracy + 0.05, label = round(accuracy, 2))) +
  labs(x = '', y = 'Accuracy') +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1),
        legend.position = 'none')

Cross-validation and bootstrapping for ROC

Only for logistic regression for demonstration purpose. You can practice also for others

# 10-fold cross-validation (CV): 10 resamples for more accurate error metrics
set.seed(1)
cv_resamples = vfold_cv(train, v = 10, strata = Cancer)

# Fit the model to cv resamples 
cv_results = fit_resamples(log_reg_wf, 
                           resamples = cv_resamples,
                           metrics = metric_set(roc_auc, accuracy),
                           control = control_resamples(save_pred = TRUE))

# See cv results: error metrics
collect_metrics(cv_results)

## # A tibble: 2 × 6
##   .metric  .estimator  mean     n std_err .config             
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy binary     0.956    10 0.0120  Preprocessor1_Model1
## 2 roc_auc  binary     0.995    10 0.00234 Preprocessor1_Model1

# Bootstrapping
boot_resamples = bootstraps(train, times = 25, strata = Cancer)

# Fit the model to boot resamples
boot_results = fit_resamples(log_reg_wf,
                             resamples = boot_resamples,
                             metrics = metric_set(roc_auc, accuracy),
                             control = control_resamples(save_pred = TRUE))

# See boot results: error metrics
collect_metrics(boot_results)

## # A tibble: 2 × 6
##   .metric  .estimator  mean     n  std_err .config             
##   <chr>    <chr>      <dbl> <int>    <dbl> <chr>               
## 1 accuracy binary     0.954    25 0.00319  Preprocessor1_Model1
## 2 roc_auc  binary     0.991    25 0.000825 Preprocessor1_Model1

# ROC curve from CV and bootstraps
cv_preds = collect_predictions(cv_results) %>% mutate(source = 'CV')
boot_preds = collect_predictions(boot_results) %>% mutate(source = 'Bootstrap')
all_preds = bind_rows(cv_preds, boot_preds)

# ROC curve data
all_roc = all_preds %>%
  group_by(source, id) %>% 
  roc_curve(truth = Cancer, .pred_malignant, event_level = 'second')

roc_viz_model = roc_viz %>% mutate(source = 'Model')

all_roc %>% autoplot()

# Visualize
ggplot(all_roc, aes(x = 1 - specificity, y = sensitivity, color = source)) +
  geom_path(aes(group = interaction(source, id)), alpha = 0.3) +
  geom_abline(linetype = "dashed", color = "gray50") +
  geom_path(data = roc_viz_model, linewidth = 1) + # main model's roc curve
  geom_smooth() +
  labs(
    title = "ROC Curves with Mean Overlay: 10-Fold CV vs Bootstrapping",
    x = "1 - Specificity (False Positive Rate)",
    y = "Sensitivity (True Positive Rate)",
    color = "Resampling Method"
  ) +
  theme_minimal() +
  theme(legend.position = "bottom")

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Extra notes on model functions, engines (default, no need of any change) and tuning hyperparameters

Functions for different machine learning models in tidymodels

# Logistic Regression
logistic_reg(mode = 'classification', 
             penalty = tune(), 
             mixture = tune())

# Linear Discriminant Analysis
discrim_linear(mode = 'classification', 
               penalty = tune(), 
               regularization = tune()) 

# Quadratic Discriminant Analysis
discrim_quad(mode = 'classification', 
             regularization_method = tune()) 

# K-Nearest Neighbors
nearest_neighbor(mode = 'classification', 
                 neighbors = tune(), 
                 weight_func = tune(), 
                 dist_power = tune()) 

# Support Vector Machine (SVM) with RBF kernel
svm_rbf(mode = 'classification', 
        cost = tune(), 
        rbf_sigma = tune(), 
        margin = tune()) 

# Support Vector Machine (SVM) with Polynomial kernel
svm_poly(mode = 'classification', 
         cost = tune(), 
         degree = tune(), 
         scale_factor = tune(), 
         margin = tune()) 

# Gradient Boosting (XGBoost)
boost_tree(mode = 'classification',
  mtry = tune(),
  trees = tune(),
  min_n = tune(),
  tree_depth = tune(),
  learn_rate = tune(),
  loss_reduction = tune(),
  sample_size = tune(),
  stop_iter = tune()) 

# Random Forest
rand_forest(mode = 'classification',
  mtry = tune(),
  trees = tune(),
  min_n = tune()) 

# Neural Network
mlp(mode = 'classification',
  hidden_units = tune(),
  penalty = tune(),
  dropout = tune(),
  epochs = tune(),
  activation = tune(),
  learn_rate = tune())

Classroom practice codes

# Comparing different machine learning models using Cancer dataset

library(MASS)
library(jtools)
library(tidymodels)
library(tidyverse)
library(mlbench)
library(discrim)  
library(kernlab) 
library(kknn)     
library(nnet)   
library(xgboost)
library(ranger)

tidymodels_prefer()
theme_set(theme_bw())

# Load the dataset
data(BreastCancer)

summary(BreastCancer)

##       Id             Cl.thickness   Cell.size     Cell.shape  Marg.adhesion
##  Length:699         1      :145   1      :384   1      :353   1      :407  
##  Class :character   5      :130   10     : 67   2      : 59   2      : 58  
##  Mode  :character   3      :108   3      : 52   10     : 58   3      : 58  
##                     4      : 80   2      : 45   3      : 56   10     : 55  
##                     10     : 69   4      : 40   4      : 44   4      : 33  
##                     2      : 50   5      : 30   5      : 34   8      : 25  
##                     (Other):117   (Other): 81   (Other): 95   (Other): 63  
##   Epith.c.size  Bare.nuclei   Bl.cromatin  Normal.nucleoli    Mitoses   
##  2      :386   1      :402   2      :166   1      :443     1      :579  
##  3      : 72   10     :132   3      :165   10     : 61     2      : 35  
##  4      : 48   2      : 30   1      :152   3      : 44     3      : 33  
##  1      : 47   5      : 30   7      : 73   2      : 36     10     : 14  
##  6      : 41   3      : 28   4      : 40   8      : 24     4      : 12  
##  5      : 39   (Other): 61   5      : 34   6      : 22     7      :  9  
##  (Other): 66   NA's   : 16   (Other): 69   (Other): 69     (Other): 17  
##        Class    
##  benign   :458  
##  malignant:241  
##                 
##                 
##                 
##                 
##

# Remove ID column and NA (missing) values and rename the columns

bc = BreastCancer[-1] %>% na.omit() # remove Id and missing values (NA's)

x = bc %>% select(-Class) %>% mutate(across(everything(), as.numeric))
y = bc %>% select(Class)

BC = bind_cols(x, y)

summary(BC)

##   Cl.thickness      Cell.size        Cell.shape     Marg.adhesion  
##  Min.   : 1.000   Min.   : 1.000   Min.   : 1.000   Min.   : 1.00  
##  1st Qu.: 2.000   1st Qu.: 1.000   1st Qu.: 1.000   1st Qu.: 1.00  
##  Median : 4.000   Median : 1.000   Median : 1.000   Median : 1.00  
##  Mean   : 4.442   Mean   : 3.151   Mean   : 3.215   Mean   : 2.83  
##  3rd Qu.: 6.000   3rd Qu.: 5.000   3rd Qu.: 5.000   3rd Qu.: 4.00  
##  Max.   :10.000   Max.   :10.000   Max.   :10.000   Max.   :10.00  
##   Epith.c.size     Bare.nuclei      Bl.cromatin     Normal.nucleoli
##  Min.   : 1.000   Min.   : 1.000   Min.   : 1.000   Min.   : 1.00  
##  1st Qu.: 2.000   1st Qu.: 1.000   1st Qu.: 2.000   1st Qu.: 1.00  
##  Median : 2.000   Median : 1.000   Median : 3.000   Median : 1.00  
##  Mean   : 3.234   Mean   : 3.545   Mean   : 3.445   Mean   : 2.87  
##  3rd Qu.: 4.000   3rd Qu.: 6.000   3rd Qu.: 5.000   3rd Qu.: 4.00  
##  Max.   :10.000   Max.   :10.000   Max.   :10.000   Max.   :10.00  
##     Mitoses            Class    
##  Min.   :1.000   benign   :444  
##  1st Qu.:1.000   malignant:239  
##  Median :1.000                  
##  Mean   :1.583                  
##  3rd Qu.:1.000                  
##  Max.   :9.000

df = BC %>% 
  rename(Thickness = Cl.thickness,
         Size = Cell.size,
         Shape = Cell.shape,
         Adhesion = Marg.adhesion,
         Epithelia = Epith.c.size,
         Nuclei = Bare.nuclei,
         Chromatin = Bl.cromatin,
         Nucleoli = Normal.nucleoli,
         Mitoses = Mitoses,
         Cancer = Class)

summary(df)

##    Thickness           Size            Shape           Adhesion    
##  Min.   : 1.000   Min.   : 1.000   Min.   : 1.000   Min.   : 1.00  
##  1st Qu.: 2.000   1st Qu.: 1.000   1st Qu.: 1.000   1st Qu.: 1.00  
##  Median : 4.000   Median : 1.000   Median : 1.000   Median : 1.00  
##  Mean   : 4.442   Mean   : 3.151   Mean   : 3.215   Mean   : 2.83  
##  3rd Qu.: 6.000   3rd Qu.: 5.000   3rd Qu.: 5.000   3rd Qu.: 4.00  
##  Max.   :10.000   Max.   :10.000   Max.   :10.000   Max.   :10.00  
##    Epithelia          Nuclei         Chromatin         Nucleoli    
##  Min.   : 1.000   Min.   : 1.000   Min.   : 1.000   Min.   : 1.00  
##  1st Qu.: 2.000   1st Qu.: 1.000   1st Qu.: 2.000   1st Qu.: 1.00  
##  Median : 2.000   Median : 1.000   Median : 3.000   Median : 1.00  
##  Mean   : 3.234   Mean   : 3.545   Mean   : 3.445   Mean   : 2.87  
##  3rd Qu.: 4.000   3rd Qu.: 6.000   3rd Qu.: 5.000   3rd Qu.: 4.00  
##  Max.   :10.000   Max.   :10.000   Max.   :10.000   Max.   :10.00  
##     Mitoses            Cancer   
##  Min.   :1.000   benign   :444  
##  1st Qu.:1.000   malignant:239  
##  Median :1.000                  
##  Mean   :1.583                  
##  3rd Qu.:1.000                  
##  Max.   :9.000

# data split
set.seed(123)
split = initial_split(df, prop = 0.7, strata = Cancer)
train_data = training(split)
test_data = testing(split)

# Recipe
recipe = recipe(Cancer ~ ., data = train_data) %>%
  step_normalize(all_numeric_predictors())

# Model spec, workflow, train the model
# Logistic regression
logistic_model = logistic_reg(mode = 'classification') 

workflow_logistic = workflow() %>%
  add_recipe(recipe) %>%
  add_model(logistic_model)

fit_logistic = workflow_logistic %>% fit(data = train_data)

# Linear discriminant 
lda_model = discrim_linear(mode = 'classification') 

workflow_lda = workflow() %>%
  add_recipe(recipe) %>%
  add_model(lda_model)

fit_lda = workflow_lda %>% fit(data = train_data)

# Quadratic discriminant: discrim_quad()
qda_model = discrim_quad(mode = 'classification')

workflow_qda = workflow() %>% add_recipe(recipe) %>% add_model(qda_model)

fit_qda = workflow_qda %>% fit(data = train_data)

# KNN model: nearest_neighbor()
knn_model = nearest_neighbor(mode = 'classification')

workflow_knn = workflow() %>% add_recipe(recipe) %>% add_model(knn_model)

fit_knn = workflow_knn %>% fit(data = train_data)

# SVM model - radial kernel: svm_rbf()
svm_model = svm_rbf(mode = 'classification')

workflow_svm = workflow() %>% add_recipe(recipe) %>% add_model(svm_model)

fit_svm = workflow_svm %>% fit(data = train_data)

# SVM model - polynomial kernel: svm_poly()
svm_poly_model = svm_poly(mode = 'classification')

workflow_svm_poly = workflow() %>% add_recipe(recipe) %>% add_model(svm_poly_model)

fit_svm_poly = workflow_svm_poly %>% fit(data = train_data)

##  Setting default kernel parameters

# XGB Boosting model: boost_tree()
xgb_model = boost_tree(mode = 'classification')

workflow_xgb = workflow() %>% add_recipe(recipe) %>% add_model(xgb_model)

fit_xgb = workflow_xgb %>% fit(data = train_data)

# Random forest
rf_model = rand_forest(mode = 'classification') 

workflow_rf = workflow() %>% add_recipe(recipe) %>% add_model(rf_model)

fit_rf = workflow_rf %>% fit(data = train_data)

# Neural network
nn_model = mlp(mode = 'classification') 

workflow_nn = workflow() %>% add_recipe(recipe) %>% add_model(nn_model)

fit_nn = workflow_nn %>% fit(data = train_data)

# Predict the test data using the trained model
models = list(
  "Logit" = fit_logistic,
  "LDA" = fit_lda,
  "QDA" = fit_qda,
  "KNN" = fit_knn,
  "SVM" = fit_svm,
  "SVM (Poly)" = fit_svm_poly,
  "Boosting" = fit_xgb,
  "Neural Net" = fit_nn,
  "Random Forest" = fit_rf
)

models

## $Logit
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: logistic_reg()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 1 Recipe Step
## 
## • step_normalize()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## 
## Call:  stats::glm(formula = ..y ~ ., family = stats::binomial, data = data)
## 
## Coefficients:
## (Intercept)    Thickness         Size        Shape     Adhesion    Epithelia  
##     -1.2728       1.5308      -0.2336       1.2488       1.0901      -0.2170  
##      Nuclei    Chromatin     Nucleoli      Mitoses  
##      1.3384       1.4797       1.2404       1.1794  
## 
## Degrees of Freedom: 476 Total (i.e. Null);  467 Residual
## Null Deviance:       617.7 
## Residual Deviance: 63.98     AIC: 83.98
## 
## $LDA
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: discrim_linear()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 1 Recipe Step
## 
## • step_normalize()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Call:
## lda(..y ~ ., data = data)
## 
## Prior probabilities of groups:
##    benign malignant 
## 0.6498952 0.3501048 
## 
## Group means:
##            Thickness       Size      Shape   Adhesion  Epithelia     Nuclei
## benign    -0.5256868 -0.6032032 -0.6027367 -0.5227893 -0.5154657 -0.6009624
## malignant  0.9758259  1.1197186  1.1188525  0.9704473  0.9568525  1.1155589
##            Chromatin   Nucleoli    Mitoses
## benign    -0.5496349 -0.5245437 -0.3100511
## malignant  1.0202804  0.9737039  0.5755440
## 
## Coefficients of linear discriminants:
##                  LD1
## Thickness 0.50029972
## Size      0.39313923
## Shape     0.35296450
## Adhesion  0.11098335
## Epithelia 0.06610106
## Nuclei    0.94791769
## Chromatin 0.24389851
## Nucleoli  0.36902658
## Mitoses   0.04143210
## 
## $QDA
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: discrim_quad()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 1 Recipe Step
## 
## • step_normalize()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Call:
## qda(..y ~ ., data = data)
## 
## Prior probabilities of groups:
##    benign malignant 
## 0.6498952 0.3501048 
## 
## Group means:
##            Thickness       Size      Shape   Adhesion  Epithelia     Nuclei
## benign    -0.5256868 -0.6032032 -0.6027367 -0.5227893 -0.5154657 -0.6009624
## malignant  0.9758259  1.1197186  1.1188525  0.9704473  0.9568525  1.1155589
##            Chromatin   Nucleoli    Mitoses
## benign    -0.5496349 -0.5245437 -0.3100511
## malignant  1.0202804  0.9737039  0.5755440
## 
## $KNN
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: nearest_neighbor()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 1 Recipe Step
## 
## • step_normalize()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## 
## Call:
## kknn::train.kknn(formula = ..y ~ ., data = data, ks = min_rows(5,     data, 5))
## 
## Type of response variable: nominal
## Minimal misclassification: 0.03563941
## Best kernel: optimal
## Best k: 5
## 
## $SVM
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: svm_rbf()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 1 Recipe Step
## 
## • step_normalize()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Support Vector Machine object of class "ksvm" 
## 
## SV type: C-svc  (classification) 
##  parameter : cost C = 1 
## 
## Gaussian Radial Basis kernel function. 
##  Hyperparameter : sigma =  0.746345800597587 
## 
## Number of Support Vectors : 208 
## 
## Objective Function Value : -38.5428 
## Training error : 0.002096 
## Probability model included. 
## 
## $`SVM (Poly)`
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: svm_poly()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 1 Recipe Step
## 
## • step_normalize()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Support Vector Machine object of class "ksvm" 
## 
## SV type: C-svc  (classification) 
##  parameter : cost C = 1 
## 
## Polynomial kernel function. 
##  Hyperparameters : degree =  1  scale =  1  offset =  1 
## 
## Number of Support Vectors : 39 
## 
## Objective Function Value : -30.5977 
## Training error : 0.02935 
## Probability model included. 
## 
## $Boosting
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: boost_tree()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 1 Recipe Step
## 
## • step_normalize()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## ##### xgb.Booster
## raw: 22.6 Kb 
## call:
##   xgboost::xgb.train(params = list(eta = 0.3, max_depth = 6, gamma = 0, 
##     colsample_bytree = 1, colsample_bynode = 1, min_child_weight = 1, 
##     subsample = 1), data = x$data, nrounds = 15, watchlist = x$watchlist, 
##     verbose = 0, nthread = 1, objective = "binary:logistic")
## params (as set within xgb.train):
##   eta = "0.3", max_depth = "6", gamma = "0", colsample_bytree = "1", colsample_bynode = "1", min_child_weight = "1", subsample = "1", nthread = "1", objective = "binary:logistic", validate_parameters = "TRUE"
## xgb.attributes:
##   niter
## callbacks:
##   cb.evaluation.log()
## # of features: 9 
## niter: 15
## nfeatures : 9 
## evaluation_log:
##   iter training_logloss
##  <num>            <num>
##      1       0.46594890
##      2       0.34020498
##    ---              ---
##     14       0.03760773
##     15       0.03358607
## 
## $`Neural Net`
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: mlp()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 1 Recipe Step
## 
## • step_normalize()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## a 9-5-1 network with 56 weights
## inputs: Thickness Size Shape Adhesion Epithelia Nuclei Chromatin Nucleoli Mitoses 
## output(s): ..y 
## options were - entropy fitting 
## 
## $`Random Forest`
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: rand_forest()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 1 Recipe Step
## 
## • step_normalize()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Ranger result
## 
## Call:
##  ranger::ranger(x = maybe_data_frame(x), y = y, num.threads = 1,      verbose = FALSE, seed = sample.int(10^5, 1), probability = TRUE) 
## 
## Type:                             Probability estimation 
## Number of trees:                  500 
## Sample size:                      477 
## Number of independent variables:  9 
## Mtry:                             3 
## Target node size:                 10 
## Variable importance mode:         none 
## Splitrule:                        gini 
## OOB prediction error (Brier s.):  0.02670169

# Quick evaluation of the models

test_metrics = map_dfr(models, ~ {
  predictions = predict(.x, test_data, type = "prob") %>%
    bind_cols(predict(.x, test_data)) %>%
    bind_cols(test_data)
  metrics = metric_set(accuracy, roc_auc, sens, spec)
  metrics(predictions, truth = Cancer, estimate = .pred_class, .pred_benign)
}, .id = "Model")  



evaluations = test_metrics %>%
  select(Model, .metric, .estimate) %>%
  pivot_wider(names_from = .metric, values_from = .estimate) %>%
  arrange(desc(accuracy))

evaluations

## # A tibble: 9 × 5
##   Model         accuracy  sens  spec roc_auc
##   <chr>            <dbl> <dbl> <dbl>   <dbl>
## 1 SVM (Poly)       0.976 0.985 0.958   0.995
## 2 Boosting         0.976 0.985 0.958   0.991
## 3 Random Forest    0.976 0.978 0.972   0.993
## 4 KNN              0.971 0.985 0.944   0.981
## 5 Logit            0.966 0.985 0.931   0.994
## 6 LDA              0.966 0.985 0.931   0.995
## 7 Neural Net       0.956 0.978 0.917   0.960
## 8 SVM              0.951 0.948 0.958   0.991
## 9 QDA              0.947 0.940 0.958   0.991

summary(test_data$Cancer)

##    benign malignant 
##       134        72

# Extracting outputs from individual trained model
# Example: logistic regression => fit_logistic

fit_logistic

## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: logistic_reg()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 1 Recipe Step
## 
## • step_normalize()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## 
## Call:  stats::glm(formula = ..y ~ ., family = stats::binomial, data = data)
## 
## Coefficients:
## (Intercept)    Thickness         Size        Shape     Adhesion    Epithelia  
##     -1.2728       1.5308      -0.2336       1.2488       1.0901      -0.2170  
##      Nuclei    Chromatin     Nucleoli      Mitoses  
##      1.3384       1.4797       1.2404       1.1794  
## 
## Degrees of Freedom: 476 Total (i.e. Null);  467 Residual
## Null Deviance:       617.7 
## Residual Deviance: 63.98     AIC: 83.98

fit_logistic %>% tidy()

## # A tibble: 10 × 5
##    term        estimate std.error statistic p.value
##    <chr>          <dbl>     <dbl>     <dbl>   <dbl>
##  1 (Intercept)   -1.27      0.388    -3.28  0.00104
##  2 Thickness      1.53      0.477     3.21  0.00133
##  3 Size          -0.234     0.761    -0.307 0.759  
##  4 Shape          1.25      0.824     1.52  0.129  
##  5 Adhesion       1.09      0.428     2.55  0.0108 
##  6 Epithelia     -0.217     0.427    -0.508 0.612  
##  7 Nuclei         1.34      0.461     2.90  0.00370
##  8 Chromatin      1.48      0.549     2.70  0.00700
##  9 Nucleoli       1.24      0.467     2.65  0.00794
## 10 Mitoses        1.18      0.546     2.16  0.0307

fit_logistic %>% extract_fit_engine() %>% summ(confint = TRUE)

Observations	477
Dependent variable	..y
Type	Generalized linear model
Family	binomial
Link	logit

χ²(9)	553.75
p	0.00
Pseudo-R² (Cragg-Uhler)	0.95
Pseudo-R² (McFadden)	0.90
AIC	83.98
BIC	125.65

	Est.	2.5%	97.5%	z val.	p
(Intercept)	-1.27	-2.03	-0.51	-3.28	0.00
Thickness	1.53	0.60	2.47	3.21	0.00
Size	-0.23	-1.72	1.26	-0.31	0.76
Shape	1.25	-0.37	2.86	1.52	0.13
Adhesion	1.09	0.25	1.93	2.55	0.01
Epithelia	-0.22	-1.05	0.62	-0.51	0.61
Nuclei	1.34	0.43	2.24	2.90	0.00
Chromatin	1.48	0.40	2.56	2.70	0.01
Nucleoli	1.24	0.32	2.16	2.65	0.01
Mitoses	1.18	0.11	2.25	2.16	0.03
Standard errors: MLE

fit_logistic %>% extract_fit_parsnip() %>% tidy()

## # A tibble: 10 × 5
##    term        estimate std.error statistic p.value
##    <chr>          <dbl>     <dbl>     <dbl>   <dbl>
##  1 (Intercept)   -1.27      0.388    -3.28  0.00104
##  2 Thickness      1.53      0.477     3.21  0.00133
##  3 Size          -0.234     0.761    -0.307 0.759  
##  4 Shape          1.25      0.824     1.52  0.129  
##  5 Adhesion       1.09      0.428     2.55  0.0108 
##  6 Epithelia     -0.217     0.427    -0.508 0.612  
##  7 Nuclei         1.34      0.461     2.90  0.00370
##  8 Chromatin      1.48      0.549     2.70  0.00700
##  9 Nucleoli       1.24      0.467     2.65  0.00794
## 10 Mitoses        1.18      0.546     2.16  0.0307

# Predict test data using the trained model
log_predictions = predict(fit_logistic, test_data) %>% 
  bind_cols(predict(fit_logistic, test_data, type = 'prob')) %>%
  bind_cols(test_data)

head(log_predictions)

## # A tibble: 6 × 13
##   .pred_class .pred_benign .pred_malignant Thickness  Size Shape Adhesion
##   <fct>              <dbl>           <dbl>     <dbl> <dbl> <dbl>    <dbl>
## 1 benign        0.991              0.00937         5     1     1        1
## 2 benign        0.995              0.00465         3     1     1        1
## 3 malignant     0.00000989         1.00            8    10    10        8
## 4 benign        0.970              0.0298          1     1     1        1
## 5 benign        0.988              0.0115          2     1     1        1
## 6 benign        0.997              0.00278         4     2     1        1
## # ℹ 6 more variables: Epithelia <dbl>, Nuclei <dbl>, Chromatin <dbl>,
## #   Nucleoli <dbl>, Mitoses <dbl>, Cancer <fct>

# Collect accuracy metrics, we will this matrix
log_metrics = caret::confusionMatrix(data = log_predictions$.pred_class, 
                                      reference = log_predictions$Cancer, 
                                      positive = 'malignant')

log_metrics

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  benign malignant
##   benign       132         5
##   malignant      2        67
##                                           
##                Accuracy : 0.966           
##                  95% CI : (0.9312, 0.9862)
##     No Information Rate : 0.6505          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9245          
##                                           
##  Mcnemar's Test P-Value : 0.4497          
##                                           
##             Sensitivity : 0.9306          
##             Specificity : 0.9851          
##          Pos Pred Value : 0.9710          
##          Neg Pred Value : 0.9635          
##              Prevalence : 0.3495          
##          Detection Rate : 0.3252          
##    Detection Prevalence : 0.3350          
##       Balanced Accuracy : 0.9578          
##                                           
##        'Positive' Class : malignant       
##

# Using tidymodels
conf_mat(log_predictions, truth = Cancer, estimate = .pred_class) %>% summary()

## # A tibble: 13 × 3
##    .metric              .estimator .estimate
##    <chr>                <chr>          <dbl>
##  1 accuracy             binary         0.966
##  2 kap                  binary         0.925
##  3 sens                 binary         0.985
##  4 spec                 binary         0.931
##  5 ppv                  binary         0.964
##  6 npv                  binary         0.971
##  7 mcc                  binary         0.925
##  8 j_index              binary         0.916
##  9 bal_accuracy         binary         0.958
## 10 detection_prevalence binary         0.665
## 11 precision            binary         0.964
## 12 recall               binary         0.985
## 13 f_meas               binary         0.974

conf_mat(log_predictions, truth = Cancer, estimate = .pred_class) %>% autoplot(type = 'heatmap')

# Positive class 'benign', but we want 'malignant' as the positive class

metrics = metric_set(accuracy, sensitivity, specificity)

metrics(log_predictions, truth = Cancer, estimate = .pred_class, event_level = 'second')

## # A tibble: 3 × 3
##   .metric     .estimator .estimate
##   <chr>       <chr>          <dbl>
## 1 accuracy    binary         0.966
## 2 sensitivity binary         0.931
## 3 specificity binary         0.985

# ROC_AUC
# Data for visualization
roc_viz = roc_curve(log_predictions, truth = Cancer, .pred_malignant, event_level = 'second')

roc_viz %>% autoplot()

# Variable importance plot (vip::vip())
library(vip)
vip(fit_logistic, num_features = 10)

Machine Learning Classification Model comparison

Professor Dr. Md. Kamrul Hasan

1. K-Nearest Neighbors (KNN)

2. Linear Discriminant Analysis (LDA)

3. Quadratic Discriminant Analysis (QDA)

4. Support Vector Machine (SVM)

5. Gradient Boosting

6. Neural Networks

7. Random Forest

Practice different machine learning models

Logistic regression

Now in brief other machine learning models with along with logistic regression for comparison. Remember the steps: split > recipe > model > workflow > fit > evaluate

Quick evaluation of models using test data set

Cross-validation and bootstrapping for ROC

Only for logistic regression for demonstration purpose. You can practice also for others

Extra notes on model functions, engines (default, no need of any change) and tuning hyperparameters

Functions for different machine learning models in tidymodels

Classroom practice codes