Classification & Smoothing

Discriminant, Tree and GAM - Ch.11 of A.Agresti’s book, An Introduction to Categorical Data Analysis

Author

Javad Faradmal, AI

Published

December 16, 2024


This Chapter

  • Logistic Regression
  • Bayes Discriminant Rule (Naive Bayes).
  • Fisher’s Linear Discriminant Rule.
  • Fisher’s Quadratic Discriminant Rule.
  • KNN Discriminant Rule.
  • Classification Tree.
    • Growing and Pruning Tree.
  • Logistic Regression vs. Bayes Rule’s vs. LDA vs. QDA vs. KNN vs. Classification Tree.
  • Generalized Additive Models
  • Regularization for High-Dimensional Categorical Data

 

Data used for examples

PimaIndiansDiabetes: Pima Indians Diabetes Database

Description: A data frame with 768 observations on 9 variables.

pregnant: Number of times pregnant
glucose: Plasma glucose concentration (glucose tolerance test)
pressure: Diastolic blood pressure (mm Hg)
triceps: Triceps skin fold thickness (mm)
insulin: 2-Hour serum insulin (mu U/ml)
mass: Body mass index
pedigree: Diabetes pedigree function
age: Age (years)
diabetes: Class variable (test for diabetes)

Code
data(PimaIndiansDiabetes)
str(PimaIndiansDiabetes)
## 'data.frame':    768 obs. of  9 variables:
##  $ pregnant: num  6 1 8 1 0 5 3 10 2 8 ...
##  $ glucose : num  148 85 183 89 137 116 78 115 197 125 ...
##  $ pressure: num  72 66 64 66 40 74 50 0 70 96 ...
##  $ triceps : num  35 29 0 23 35 0 32 0 45 0 ...
##  $ insulin : num  0 0 0 94 168 0 88 0 543 0 ...
##  $ mass    : num  33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
##  $ pedigree: num  0.627 0.351 0.672 0.167 2.288 ...
##  $ age     : num  50 31 32 21 33 30 26 29 53 54 ...
##  $ diabetes: Factor w/ 2 levels "neg","pos": 2 1 2 1 2 1 2 1 2 2 ...

 

kyphosis: Data on Children who have had Corrective Spinal Surgery

Description: The kyphosis data frame has 81 rows and 4 columns.

Kyphosis: A factor with levels absent present indicating if a kyphosis (a type of deformation) was present after the operation.
Age: Age in month
Number: The number of vertebrae involved
Start: The number of the first (topmost) vertebra operated on.

Code
data(kyphosis, package="rpart")
str(kyphosis)
## 'data.frame':    81 obs. of  4 variables:
##  $ Kyphosis: Factor w/ 2 levels "absent","present": 1 1 2 1 1 1 1 1 1 2 ...
##  $ Age     : int  71 158 128 2 1 1 61 37 113 59 ...
##  $ Number  : int  3 3 4 5 4 2 2 3 2 6 ...
##  $ Start   : int  5 14 5 1 15 16 17 16 16 12 ...
Code
set.seed(123)

# Split the data into training and testing sets
diabetes_split <- initial_split(PimaIndiansDiabetes, prop = 0.8)
diabetes_train <- training(diabetes_split)
diabetes_test <- testing(diabetes_split)

# Create a recipe for preprocessing
diabetes_recipe <- recipe(diabetes ~ age + glucose, data = diabetes_train) %>%
  step_normalize(all_predictors())

# Create a grid of points to visualize the decision boundary
age_range <- seq(min(PimaIndiansDiabetes$age) - 0.5, 
                max(PimaIndiansDiabetes$age) + 0.5, length.out = 200)
glucose_range <- seq(min(PimaIndiansDiabetes$glucose) - 0.5, 
                max(PimaIndiansDiabetes$glucose) + 0.5, length.out = 200)
grid <- expand.grid(age = age_range, glucose  = glucose_range)
metrics <- metric_set(yardstick::accuracy, yardstick::sensitivity, yardstick::specificity)

 

Logistic Regression

Code
# Define the logistic regression model
logistic_model <- logistic_reg() %>%
  set_engine("glm")

# Create a workflow
diabetes_workflow <- workflow() %>%
  add_recipe(diabetes_recipe) %>%
  add_model(logistic_model)

# Fit the logistic regression model
logistic_fit <- fit(diabetes_workflow, data = diabetes_train)

# Make predictions on the test set
logistic_predictions <- predict(logistic_fit, diabetes_test, type = "prob") %>%
  bind_cols(diabetes_test)

# Add predictions as a factor to the dataset
logistic_predictions <- logistic_predictions %>%
  mutate(
    .pred_class = factor(ifelse(.pred_pos >= 0.5, "pos", "neg"), levels = levels(diabetes))
  )

# Confusion matrix
conf_mat <- logistic_predictions %>%
  conf_mat(truth = diabetes, estimate = .pred_class) %>%
  print
          Truth
Prediction neg pos
       neg  88  28
       pos  14  24
Code
# Evaluate the model's performance
accuracy <- logistic_predictions %>%
  metrics(truth = diabetes, estimate = .pred_class) %>%
  print
# A tibble: 3 × 3
  .metric     .estimator .estimate
  <chr>       <chr>          <dbl>
1 accuracy    binary         0.727
2 sensitivity binary         0.863
3 specificity binary         0.462
Code
# Calculate ROC AUC
roc_auc <- roc(diabetes_test$diabetes, logistic_predictions$.pred_pos)
print(roc_auc$auc)
Area under the curve: 0.7753
Code
ggroc(roc_auc) +
  annotate("segment", x = 1, y = 0, xend = 0, yend = 1,
           col = "red", linetype = 3) + 
  annotate("text",label=round(roc_auc$auc,3), x = 0.5,
           y = 0.65, col = "darkblue")

Code
# Predict the class for each point in the grid
grid_predictions <- predict(logistic_fit, new_data = grid, type = "class")
grid <- grid %>%
  mutate(Predicted = grid_predictions$.pred_class)
# Plot the decision boundary
ggplot() +
  geom_tile(data = grid, aes(x = age, y = glucose, fill = Predicted), alpha = 0.5) + # Background color
  geom_point(data = diabetes_train, aes(x = age, y = glucose, color = diabetes), size = 1) + # Training points
  scale_fill_manual(values = c("neg" = "red", "pos" = "blue")) + # Area colors
  scale_color_manual(values = c("neg" = "red", "pos" = "blue")) + # Point colors
  labs(title = "Logistic Decision Boundary", x = "age", y = "glucose", fill = "Prediction") +
  theme_minimal()


 

Bayes Discriminant Rule (Naive Bayes)

  • Let we have G groups with \(P (obs \in g^{th} \text{ group})=\pi_g\) (priori probability).
  • Let X has the conditional probability mass/density function \(f_{\mathbf X}(\mathbf x|G=g)\).
  • Posterior probability function of \(obs \in g^{th} \text{ group} | \mathbf {X=x}\) will be:

\[P(obs \in g^{th} \text{ group }| \mathbf {X=x})=\frac{f_{\mathbf X}(\mathbf x|G=g)\times \pi_g} {f_{\mathbf X}(\mathbf x)}\]

The Bayes classifier assigns an obs. to the population with highest posterior probability:

\[d(\mathbf x) = \arg\max_g ~P(obs \in g^{th} \text{ group }| \mathbf {X=x})\]

 

If \(f_{\mathbf X}(\mathbf x|G=g) \text{ be } MVN(\mathbf \mu_g , \mathbf \Sigma)\) then \(d(\mathbf x) = \arg\max_g ~\mathbf \mu_{g}^{T} \mathbf \Sigma^{-1} \mathbf X - 0.5 \times \mathbf \mu_{g}^{T} \mathbf \Sigma^{-1} \mathbf \mu_{g} + \log \pi_g\).

If G=2, then an obs. allocates to first group if and only if \(\mathbf a^T(\mathbf X - \mathbf h) > \log \frac{\pi_1}{\pi_2}\) where \(\mathbf a=\mathbf \Sigma^{-1}(\mathbf \mu_1-\mathbf \mu_2)\) and \(\mathbf h=\frac{(\mathbf \mu_1+\mathbf \mu_2)}{2}\)

If \(f_{\mathbf X}(\mathbf x|G=g) \text{ be } \sim MVN(\mathbf \mu_g , \mathbf \Sigma)\) and \(G \sim \text{Unif}\) then ‘Linear discriminant analysis’ could be used.

If \(f_{\mathbf X}(\mathbf x|G=g) \text{ be } \sim MVN(\mathbf \mu_g , \mathbf \Sigma_g)\) and \(G \sim \text{Unif}\) then ‘Quadratic discriminant analysis’ could be used.


 

Fisher’s linear discriminant analysis

Fisher approach:

  • Looked for a linear discriminant functions.
  • Assuming no particular distribution for each population.
  • Total covariance matrix (S) decompose to:
    • Within-class covariance matrix
    • Between-class covariance matrix
  • Project the data into a lower dimensional space Z=Xa optimal for classification.
    • Optimal: Choose a so that maximise the B/W of z.

 

Totla (co)variance and its decomposition:

\[ \begin{align*} {\mathbb{V}\operatorname{ar}}(z_i)&= {\mathbb{V}\operatorname{ar}}(\mathbf a^\top \mathbf x_i)\\ &=\mathbf a^\top {\mathbb{V}\operatorname{ar}}(\mathbf x_i) \mathbf a\\ &=\mathbf a^\top \mathbf S\mathbf a\\ &= \mathbf a^\top\mathbf W\mathbf a+ \mathbf a^\top\mathbf B\mathbf a\\ \end{align*} \]

 

Optimization problem and solution:

\[ \begin{equation} \max_{\mathbf a}\frac{\mathbf a^\top \mathbf B\mathbf a}{\mathbf a^\top \mathbf W\mathbf a} \end{equation} \]

Solution: Eigenvector of \(W^{−1}B\) corresponding to the largest eigenvalue of \(W^{−1}B\)

The function \(L(\mathbf x)=\mathbf a^\top \mathbf x\) is called Fisher’s linear discriminant function. Allocate an obs. to the population g if (discriminant rule):

\(d^{Fisher}(\mathbf x) = \arg \min_k |L(\mathbf x) - L({\boldsymbol{\mu}}_k)| = \arg \min_k | \mathbf a^\top \mathbf x- \mathbf a^\top \hat{{\boldsymbol{\mu}}}_k |.\)

 

Discriminant rule

Let \(L(\mathbf x)=\mathbf a^\top \mathbf x\) and \(L(\mathbf \mu_g)=\mathbf a^\top \mathbf \mu_g\)

Allocate an obs. to the population g if:

\(d(\mathbf x) = \arg \min_g |L(\mathbf x) - L({\boldsymbol{\mu}}_g)| = \arg \min_g | \mathbf a^\top \mathbf x- \mathbf a^\top \hat{{\boldsymbol{\mu}}}_g |\)


 

Example: Pima Indians Diabetes

Code
# Define the LDA model
lda_model <- discrim_linear() %>%
  set_engine("MASS") %>% # Use the MASS package for LDA
  set_mode("classification")

# Create a workflow for LDA
lda_workflow <- workflow() %>%
  add_recipe(diabetes_recipe) %>%
  add_model(lda_model)

# Fit the LDA model
lda_fit <- fit(lda_workflow, data = diabetes_train)

# Predict class labels using LDA
lda_predictions <- predict(lda_fit, diabetes_test, type = "class") %>%
  bind_cols(predict(lda_fit, diabetes_test, type = "prob")) %>%
  bind_cols(diabetes_test)

# Evaluate the LDA model
accuracy <- lda_predictions %>%
  metrics(truth = diabetes, estimate = .pred_class) %>%
  print
# A tibble: 3 × 3
  .metric     .estimator .estimate
  <chr>       <chr>          <dbl>
1 accuracy    binary         0.727
2 sensitivity binary         0.863
3 specificity binary         0.462
Code
# Confusion matrix for LDA
lda_conf_mat <- lda_predictions %>%
  conf_mat(truth = diabetes, estimate = .pred_class) %>%
  print
          Truth
Prediction neg pos
       neg  88  28
       pos  14  24
Code
# Calculate ROC AUC
roc_auc <- roc(diabetes_test$diabetes, lda_predictions$.pred_pos)
print(roc_auc$auc)
Area under the curve: 0.7768
Code
ggroc(roc_auc) +
  annotate("segment", x = 1, y = 0, xend = 0, yend = 1,
           col = "red", linetype = 3) + 
  annotate("text",label=round(roc_auc$auc,3), x = 0.5,
           y = 0.65, col = "darkblue")

Code
# Predict the class for each point in the grid
grid_predictions <- predict(lda_fit, new_data = grid, type = "class")
grid <- grid %>%
  mutate(Predicted = grid_predictions$.pred_class)
# Plot the decision boundary
ggplot() +
  geom_tile(data = grid, aes(x = age, y = glucose, fill = Predicted), alpha = 0.5) + # Background color
  geom_point(data = diabetes_train, aes(x = age, y = glucose, color = diabetes), size = 1) + # Training points
  scale_fill_manual(values = c("neg" = "red", "pos" = "blue")) + # Area colors
  scale_color_manual(values = c("neg" = "red", "pos" = "blue")) + # Point colors
  labs(title = "LDA decision Boundary", x = "age", y = "glucose", fill = "Prediction") +
  theme_minimal()


 

Quadratic discriminant analysis

Quadratic Discriminant Analysis (QDA) is a statistical classification technique used in machine learning and pattern recognition to categorize observations into predefined classes. It extends the ideas of Linear Discriminant Analysis (LDA) by allowing for more flexible decision boundaries between classes. QDA is particularly useful when the assumption of equal covariance matrices across classes (as in LDA) does not hold.


Comparison with Linear Discriminant Analysis (LDA)

  • Covariance Assumption:
    • LDA: Assumes that all classes share the same covariance matrix (\(\boldsymbol{\Sigma}_k = \boldsymbol{\Sigma}\) for all \(k\)). This leads to linear decision boundaries.
    • QDA: Allows each class to have its own covariance matrix (\(\boldsymbol{\Sigma}_k\) can differ across classes), resulting in quadratic decision boundaries.
  • Flexibility:
    • LDA: Less flexible due to the shared covariance assumption. May perform poorly if the covariance matrices are significantly different across classes.
    • QDA: More flexible and can model more complex class distributions.
  • Parameter Estimation:
    • LDA: Estimates fewer parameters since the covariance matrix is shared.
    • QDA: Estimates a separate covariance matrix for each class, requiring more data to avoid overfitting.

Limitations of QDA

  • Higher Complexity: Requires estimating a separate covariance matrix for each class, leading to a larger number of parameters.
  • Data Requirements: Needs more training data to accurately estimate the covariance matrices, especially in high-dimensional feature spaces.
  • Risk of Overfitting: With limited data, QDA can overfit the training set, reducing its generalization performance.

Example: Pima Indians Diabetes

 

Code
# Define the QDA model
qda_model <- discrim_quad() %>%
  set_engine("MASS") %>% # Use the MASS package for QDA
  set_mode("classification")

# Create a workflow for QDA
qda_workflow <- workflow() %>%
  add_recipe(diabetes_recipe) %>%
  add_model(qda_model)

# Fit the QDA model
qda_fit <- fit(qda_workflow, data = diabetes_train)

# Predict class labels using QDA
qda_predictions <- predict(qda_fit, diabetes_test, type = "class") %>%
  bind_cols(predict(qda_fit, diabetes_test, type = "prob")) %>%
  bind_cols(diabetes_test)

# Evaluate the QDA model
accuracy <- qda_predictions %>%
  metrics(truth = diabetes, estimate = .pred_class) %>%
  print
# A tibble: 3 × 3
  .metric     .estimator .estimate
  <chr>       <chr>          <dbl>
1 accuracy    binary         0.734
2 sensitivity binary         0.873
3 specificity binary         0.462
Code
# Confusion matrix for QDA
qda_conf_mat <- qda_predictions %>%
  conf_mat(truth = diabetes, estimate = .pred_class) %>%
  print
          Truth
Prediction neg pos
       neg  89  28
       pos  13  24
Code
# Calculate ROC AUC
roc_auc <- roc(diabetes_test$diabetes, qda_predictions$.pred_pos)
print(roc_auc$auc)
Area under the curve: 0.7819
Code
ggroc(roc_auc) +
  annotate("segment", x = 1, y = 0, xend = 0, yend = 1,
           col = "red", linetype = 3) + 
  annotate("text",label=round(roc_auc$auc,3), x = 0.5,
           y = 0.65, col = "darkblue")

Code
# Predict the class for each point in the grid
grid_predictions <- predict(qda_fit, new_data = grid, type = "class")
grid <- grid %>%
  mutate(Predicted = grid_predictions$.pred_class)

# Plot the decision boundary
ggplot() +
  geom_tile(data = grid, aes(x = age, y = glucose, fill = Predicted), alpha = 0.5) + # Background color
  geom_point(data = diabetes_train, aes(x = age, y = glucose, color = diabetes), size = 1) + # Training points
  scale_fill_manual(values = c("neg" = "red", "pos" = "blue")) + # Area colors
  scale_color_manual(values = c("neg" = "red", "pos" = "blue")) + # Point colors
  labs(title = "QDA Decision Boundary", x = "age", y = "glucose", fill = "Prediction") +
  theme_minimal()


 

K-Nearest Neighbors

  • KNN is one of the simplest and most intuitive machine learning algorithms
  • KNN used for both classification and regression tasks
  • KNN can be remarkably effective, especially in scenarios where the decision boundary is irregular
  • KNN operates on the principle that similar data points exist in close proximity
  • KNN predict class of new point based on the k closest sample points
  • KNN does not build an explicit model

KNN Algorithm

  1. Choose the number of neighbors (k):
  2. Compute the distance: (e.g., Euclidean, Manhattan).
  3. Identify the nearest neighbors:
  4. Make a prediction:
    • Classification: Determine the majority class among the k neighbors.
    • For Regression: Compute the average/weighted average of the target values of the k neighbors.
  • Euclidean Distance:

    \(d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_{i=1}^{n} (x_i - y_i)^2}\)

  • Manhattan Distance:

    \(d(\mathbf{x}, \mathbf{y}) = \sum_{i=1}^{n} |x_i - y_i|\)

  • Minkowski Distance:

    \(d(\mathbf{x}, \mathbf{y}) = \left( \sum_{i=1}^{n} |x_i - y_i|^p \right)^{1/p}\)

    (For \(p = 2\), it becomes Euclidean; for \(p = 1\), it becomes Manhattan.)


Role of Number of Neighbors (k)

  • Small k (e.g., k=1):
    • Pros: Captures fine details and can model complex decision boundaries.
    • Cons: Highly sensitive to noise; may lead to overfitting.
  • Large k:
    • Pros: Smoother decision boundaries; less sensitive to noise.
    • Cons: Can underfit; may blur distinct class boundaries.

Strategies for Choosing k: Cross-Validation (select k with the best performance)


 

Example: Pima Indians Diabetes

Let K=1

Code
# Define the k-NN model
knn_model <- nearest_neighbor(neighbors = 1, weight_func = "rectangular") %>%
  set_engine("kknn") %>%
  set_mode("classification")

# Create a workflow
knn_workflow <- workflow() %>%
  add_recipe(diabetes_recipe) %>%
  add_model(knn_model)

# Fit the k-NN model
knn1_fit <- fit(knn_workflow, data = diabetes_train)

# Predict class labels using k-NN
knn1_predictions <- predict(knn1_fit, diabetes_test, type = "class") %>%
  bind_cols(diabetes_test)

# Evaluate the k-NN model
accuracy <- knn1_predictions %>%
  metrics(truth = diabetes, estimate = .pred_class) %>%
  print
# A tibble: 3 × 3
  .metric     .estimator .estimate
  <chr>       <chr>          <dbl>
1 accuracy    binary         0.669
2 sensitivity binary         0.765
3 specificity binary         0.481
Code
# Confusion matrix for k-NN
knn_conf_mat <- knn1_predictions %>%
  conf_mat(truth = diabetes, estimate = .pred_class) %>%
  print
          Truth
Prediction neg pos
       neg  78  27
       pos  24  25
Code
# Predict the class for each point in the grid
grid_predictions <- predict(knn1_fit, new_data = grid, type = "class")
grid <- grid %>%
  mutate(Predicted = grid_predictions$.pred_class)
# Plot the decision boundary
ggplot() +
  geom_tile(data = grid, aes(x = age, y = glucose, fill = Predicted), alpha = 0.5) + # Background color
  geom_point(data = diabetes_train, aes(x = age, y = glucose, color = diabetes), size = 1) + # Training points
  scale_fill_manual(values = c("neg" = "red", "pos" = "blue")) + # Area colors
  scale_color_manual(values = c("neg" = "red", "pos" = "blue")) + # Point colors
  labs(title = "k-NN Decision Boundary (k=1)", x = "age", y = "glucose", fill = "Prediction") +
  theme_minimal()

 

Let K=5

Code
# Define the k-NN model
knn_model <- nearest_neighbor(neighbors = 5, weight_func = "rectangular") %>%
  set_engine("kknn") %>%
  set_mode("classification")

# Create a workflow
knn_workflow <- workflow() %>%
  add_recipe(diabetes_recipe) %>%
  add_model(knn_model)

# Fit the k-NN model
knn5_fit <- fit(knn_workflow, data = diabetes_train)

# Make predictions on the test set
knn5_predictions <- predict(knn5_fit, diabetes_test, type = "class") %>%
  bind_cols(diabetes_test)

# Confusion matrix
conf_mat <- knn5_predictions %>%
  conf_mat(truth = diabetes, estimate = .pred_class) %>%
  print
          Truth
Prediction neg pos
       neg  87  23
       pos  15  29
Code
# Evaluate the model's performance
accuracy <- knn5_predictions %>%
  metrics(truth = diabetes, estimate = .pred_class) %>%
  print
# A tibble: 3 × 3
  .metric     .estimator .estimate
  <chr>       <chr>          <dbl>
1 accuracy    binary         0.753
2 sensitivity binary         0.853
3 specificity binary         0.558
Code
# Predict the class for each point in the grid
grid_predictions <- predict(knn5_fit, new_data = grid, type = "class")
grid <- grid %>%
  mutate(Predicted = grid_predictions$.pred_class)
# Plot the decision boundary
ggplot() +
  geom_tile(data = grid, aes(x = age, y = glucose, fill = Predicted), alpha = 0.5) + # Background color
  geom_point(data = diabetes_train, aes(x = age, y = glucose, color = diabetes), size = 1) + # Training points
  scale_fill_manual(values = c("neg" = "red", "pos" = "blue")) + # Area colors
  scale_color_manual(values = c("neg" = "red", "pos" = "blue")) + # Point colors
  labs(title = "k-NN Decision Boundary (k=5)", x = "age", y = "glucose", fill = "Prediction") +
  theme_minimal()

 

Let K=50

Code
# Define the k-NN model
knn_model <- nearest_neighbor(neighbors = 50, weight_func = "rectangular") %>%
  set_engine("kknn") %>%
  set_mode("classification")

# Create a workflow
knn_workflow <- workflow() %>%
  add_recipe(diabetes_recipe) %>%
  add_model(knn_model)

# Fit the k-NN model
knn50_fit <- fit(knn_workflow, data = diabetes_train)

# Make predictions on the test set
knn50_predictions <- predict(knn50_fit, diabetes_test, type = "class") %>%
  bind_cols(diabetes_test)

# Confusion matrix
conf_mat <- knn50_predictions %>%
  conf_mat(truth = diabetes, estimate = .pred_class) %>%
  print
          Truth
Prediction neg pos
       neg  88  23
       pos  14  29
Code
# Evaluate the model's performance
accuracy <- knn50_predictions %>%
  metrics(truth = diabetes, estimate = .pred_class) %>%
  print
# A tibble: 3 × 3
  .metric     .estimator .estimate
  <chr>       <chr>          <dbl>
1 accuracy    binary         0.760
2 sensitivity binary         0.863
3 specificity binary         0.558
Code
# Predict the class for each point in the grid
grid_predictions <- predict(knn50_fit, new_data = grid, type = "class")
grid <- grid %>%
  mutate(Predicted = grid_predictions$.pred_class)
# Plot the decision boundary
ggplot() +
  geom_tile(data = grid, aes(x = age, y = glucose, fill = Predicted), alpha = 0.5) + # Background color
  geom_point(data = diabetes_train, aes(x = age, y = glucose, color = diabetes), size = 1) + # Training points
  scale_fill_manual(values = c("neg" = "red", "pos" = "blue")) + # Area colors
  scale_color_manual(values = c("neg" = "red", "pos" = "blue")) + # Point colors
  labs(title = "k-NN Decision Boundary (k=50)", x = "age", y = "glucose", fill = "Prediction") +
  theme_minimal()


Classification Tree.

Statistical Problem

Let:

  • Y be a response variable,
  • X1, … , Xp, be a set of p predictors,
  • X’s will be fixed, and Y be a random variable.

Then the problem is to establish a relationship between Y and the X’s.

Mathematically, to estimate the conditional probability or a function of it like conditional expectation.

\(P(Y=y | x_1, ... , x_p)~~~~~~E(Y=y | x_1, ... , X_p)\)

Statistician’s job is to to extract important patterns and trends, and understand “what the data says” (learning from data).

Methods for supervised learning, the two end of the spectrum:

  1. Models in one side, rely heavily on some strong assumption (Low variance and High bias).
    • Logistic regression.
  2. Models on the other side, do not rely on any rigorous assumptions. (High variance and Low bias)
    • KNN

Tree elements

 

 

The tree has 3 layers of nodes.

  • 1st layer (depth=1): the root node.
  • 2nd layer (depth=2), one internal and one external node.
  • 3rd layer (depth=3) two terminal nodes.

Parent: an internal node that partitioning in two nodes.

daughters/offspring/descendants: two nodes extracted from a parent.


Partitioning of subjects

Y: preterm/term delivery.

Covariates, X1 (age) & X13 (amount of drinking)

  • Every node is a subset of the sample.
  • Terminal nodes partitioning the sample.

 

  • Node (I) x13 ≤ c2.
  • Node (II) x13 > c2 and x1 ≤ c1.
  • Node (III) x13 > c2 and x1 > c1.

Aim of the partitioning

 

Complete homogeneity of terminal nodes is an ideal that is rarely realized.

constitute as homogeneous as possible partitioning of subjects in terms of response based on the X’s.

 

A quantitative measure of the node homogeneity is the node impurity.

A node homogeneity = \(\frac{\text{n. of women having a preterm delivery in that node}}{\text{Total n. of women in that node}}\)

 

The closer this ratio is to 0 or 1, the more homogeneous is the node.


Splitting a Node

  1. For each predictor variable:
    1. Constitute all allowable splits
    2. Select the best split based on the a goodness of a split criterion.
  2. Compare the best split of different variables and select among them (the best of the best).

Example: Pima Indians Diabetes

 

With only 2 variables

Code
# Specify the decision tree model
tree_model <- decision_tree() %>% # Use decision_tree() from the 'parsnip' package
  set_engine("rpart") %>% # Set the engine to 'rpart'
  set_mode("classification") # Set the mode to classification

# Create a workflow
tree_wf <- workflow() %>%
  add_model(tree_model) %>%
  add_recipe(diabetes_recipe)

# Fit the model to the training data
tree_fit <- tree_wf %>%
  fit(data = diabetes_train)

# Extract the fitted tree
fitted_tree <- tree_fit %>%
  extract_fit_engine()

# Plot the decision tree
rpart.plot(fitted_tree)

Code
# Make predictions on the test data
tree_predictions <- predict(tree_fit, new_data = diabetes_test, type = "prob") %>%
  bind_cols(diabetes_test)

tree_predictions <- tree_predictions %>%
  mutate(
    .pred_class = factor(ifelse(.pred_pos >= 0.5, "pos", "neg"), levels = levels(diabetes))
  )

# Evaluate the model
accuracy <- tree_predictions %>%
  metrics(truth = diabetes, estimate = .pred_class) %>%
  print
# A tibble: 3 × 3
  .metric     .estimator .estimate
  <chr>       <chr>          <dbl>
1 accuracy    binary         0.708
2 sensitivity binary         0.814
3 specificity binary         0.5  
Code
# Confusion matrix
conf_mat <- tree_predictions %>%
  conf_mat(truth = diabetes, estimate = .pred_class) %>%
  print
          Truth
Prediction neg pos
       neg  83  26
       pos  19  26
Code
# Calculate ROC AUC
roc_auc <- roc(diabetes_test$diabetes, tree_predictions$.pred_pos)
print(roc_auc$auc)
Area under the curve: 0.7198
Code
ggroc(roc_auc) +
  annotate("segment", x = 1, y = 0, xend = 0, yend = 1,
           col = "red", linetype = 3) + 
  annotate("text",label=round(roc_auc$auc,3), x = 0.5,
           y = 0.65, col = "darkblue")

Code
# Predict the class for each point in the grid
grid_predictions <- predict(tree_fit, new_data = grid, type = "class")
grid <- grid %>%
  mutate(Predicted = grid_predictions$.pred_class)
# Plot the decision boundary
ggplot() +
  geom_tile(data = grid, aes(x = age, y = glucose, fill = Predicted), alpha = 0.5) + # Background color
  geom_point(data = diabetes_train, aes(x = age, y = glucose, color = diabetes), size = 1) + # Training points
  scale_fill_manual(values = c("neg" = "red", "pos" = "blue")) + # Area colors
  scale_color_manual(values = c("neg" = "red", "pos" = "blue")) + # Point colors
  labs(title = "Classification Tree", x = "age", y = "glucose", fill = "Prediction") +
  theme_minimal()

 

With All variables

Code
diabetes_recipe <- recipe(diabetes ~ ., data = diabetes_train) %>%
  step_normalize(all_predictors())

# Specify the decision tree model
tree_model <- decision_tree() %>% # Use decision_tree() from the 'parsnip' package
  set_engine("rpart") %>% # Set the engine to 'rpart'
  set_mode("classification") # Set the mode to classification

# Create a workflow
tree_wf <- workflow() %>%
  add_model(tree_model) %>%
  add_recipe(diabetes_recipe)

# Fit the model to the training data
tree_fit <- tree_wf %>%
  fit(data = diabetes_train)

# Extract the fitted tree
fitted_tree <- tree_fit %>%
  extract_fit_engine()

# Plot the decision tree
rpart.plot(fitted_tree)

Code
# Make predictions on the test data
tree_predictions <- predict(tree_fit, new_data = diabetes_test, type = "prob") %>%
  bind_cols(diabetes_test)

tree_predictions <- tree_predictions %>%
  mutate(
    .pred_class = factor(ifelse(.pred_pos >= 0.5, "pos", "neg"), levels = levels(diabetes))
  )

# Evaluate the model
accuracy <- tree_predictions %>%
  metrics(truth = diabetes, estimate = .pred_class) %>%
  print
# A tibble: 3 × 3
  .metric     .estimator .estimate
  <chr>       <chr>          <dbl>
1 accuracy    binary         0.721
2 sensitivity binary         0.775
3 specificity binary         0.615
Code
# Confusion matrix
conf_mat <- tree_predictions %>%
  conf_mat(truth = diabetes, estimate = .pred_class) %>%
  print
          Truth
Prediction neg pos
       neg  79  20
       pos  23  32
Code
# Calculate ROC AUC
roc_auc <- roc(diabetes_test$diabetes, tree_predictions$.pred_pos)
print(roc_auc$auc)
Area under the curve: 0.7392
Code
ggroc(roc_auc) +
  annotate("segment", x = 1, y = 0, xend = 0, yend = 1,
           col = "red", linetype = 3) + 
  annotate("text",label=round(roc_auc$auc,3), x = 0.5,
           y = 0.65, col = "darkblue")


Logistic Regression vs. LDA vs. KNN vs. Classification Tree.

 

1. Logistic Regression

Models the probability of a class using a logistic (sigmoid) function.

Assumptions:

  • Linearity: The log-odds of the outcome is a linear combination of the input features.
  • Independence: Observations are independent.
  • No or little multicollinearity: Predictors are not highly correlated.

Strengths:

  • Interpretability: Coefficients can be interpreted in terms of odds ratios.
  • Efficiency: Computationally less intensive; scales well with large datasets.
  • Probabilistic Outputs: Provides probabilities for class memberships.
  • Regularization: Extensions like Lasso or Ridge can handle multicollinearity and feature selection.
  • Handles Mixed Data Types: Can manage numerical and categorical features.
  • Robust: makes no assumption about a distribution for X.

Weaknesses:

  • Linearity Assumption: May not capture complex relationships.
  • Sensitive to Outliers: Can be skewed by extreme values.
  • Feature extraction: May require transformation or interaction terms to capture non-linearities.

2. Linear Discriminant Analysis (LDA)

A classification method that projects data onto a lower-dimensional space while preserving class separability.

Assumptions:

  • Normality: Features are normally distributed within each class.
  • Equal Covariance Matrices: All classes share the same covariance matrix.
  • Linearity: Decision boundaries are linear.

Strengths:

  • Efficiency: Works well when assumptions are met; computationally inexpensive.
  • Dimensionality Reduction: Can reduce feature space, improving visualization and reducing noise.
  • Robustness: Performs well with small sample sizes relative to feature dimensions.

Weaknesses:

  • Assumption Sensitivity: Make performance less valuable if normality or equal covariance assumptions are violated.
  • Linearity: Cannot capture non-linear relationships between features and classes.
  • Scaling: Sensitive to the scale of the features; requires standardization.

3. k-Nearest Neighbors (kNN)

A non-parametric, instance-based learning algorithm that classifies based on the majority class among the k closest training examples.

Assumptions:

  • Locality: Assumes that similar instances are near each other in feature space.

Strengths:

  • Simplicity: Easy to understand and implement.
  • Flexibility: Can handle multi-class problems and adapt to complex decision boundaries.
  • No Training Phase: Immediate classification without a separate training phase.

Weaknesses:

  • Computationally Intensive
  • Curse of Dimensionality
  • Feature Scaling: Requires careful feature scaling (e.g., normalization) to ensure fair distance calculations.

4. Classification Trees

Splits the feature space into regions based on feature values to make predictions.

Assumptions:

Strengths:

  • Interpretability: Results in a model that’s easy to visualize and understand.
  • Handles Mixed Data Types: Can manage numerical and categorical features.
  • Non-linear Relationships and Intractions
  • Feature Importance: Can provide insights into feature importance.

Weaknesses:

  • Overfitting: Specially with deep trees.
  • Instability: Small changes in data can lead to different tree structures.
  • Scalability: Can become computationally expensive with very large datasets.

Generalized Additive Models (GAMs)

Some predictors have unknown/complicated relation with response. It is possible to replace the linear effect of those explanatory variables by some smooth functions (more general structure).

+ \(\rightarrow\) Less potential incorrect conclusions because of model misspecification
- \(\rightarrow\) Need to choose among a potentially infinite number of smooth forms
- \(\rightarrow\) the number of parameters is much larger
- \(\rightarrow\) overfitting

GLM: extend LM
- Distribution of response \(\rightarrow\) Exponential family
- Link function \(\rightarrow\) Every differentiable monotone function
GAM: extend GLM
- Linear predictor function \(\rightarrow\) Smooth but additive predictor function

\(g(\mu_i)=\alpha+s(x_{i1})+\dots+s(x_{ip})\)

What are Generalized Additive Models (GAMs)?

  • Definition:
    • GAMs are an extension of Generalized Linear Models (GLMs) that allow for non-linear relationships between predictors and the response variable.
  • Key Feature:
    • Use smooth functions,sj to model the relationship between jth predictor and the response.
  • Flexibility:
    • Capture complex patterns without specifying the exact form of the non-linearity.

Components of GAMs

  • Response Variable (\(Y\)):
    • Has distribution from exponential family
  • Link function
  • Predictors:
    • Each predictor can have a non-linear effect modeled by smooth function, sj.
  • Model Structure: \(g(\mathbb{E}[Y|X]) = \alpha + s_1(x_1) + s_2(x_2) + \dots + s_p(x_p)\)

Example: GAM for Diabetes Prediction

Goal: Build a GAM to predict the probability of diabetes based on age and glucose.

Code
# Fit the GAM model on training data
gam_fit1 <- gam(diabetes ~ s(glucose) + s(age), 
                family = binomial(link = "logit"), 
                data = diabetes_train)
# Summary of the model
summary(gam_fit1)

Family: binomial 
Link function: logit 

Formula:
diabetes ~ s(glucose) + s(age)

Parametric coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -0.8617     0.1148  -7.503 6.23e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
             edf Ref.df Chi.sq  p-value    
s(glucose) 5.182  6.192  104.9  < 2e-16 ***
s(age)     2.941  3.693   32.9 1.48e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.306   Deviance explained = 26.9%
UBRE = -0.02255  Scale est. = 1         n = 614
Code
# Predict probabilities on the test set
diabetes_test$.pred_pos <- predict(gam_fit1, newdata = diabetes_test,
                                        type = "response")

# Classification using 0.5 threshold
diabetes_test$.pred_class <- factor(
  ifelse(diabetes_test$.pred_pos > 0.5, "pos", "neg")
)

# Confusion Matrix
conf_mat <- table(Predicted = diabetes_test$.pred_class,
                 Truth = diabetes_test$diabetes) %>%
  print
         Truth
Predicted neg pos
      neg  88  23
      pos  14  29
Code
# Calculate Accuracy
accuracy <- mean(diabetes_test$.pred_class == diabetes_test$diabetes)
print(paste("Accuracy:", round(accuracy * 100, 2), "%"))
[1] "Accuracy: 75.97 %"
Code
# Calculate ROC AUC
roc_auc <- roc(diabetes_test$diabetes, diabetes_test$.pred_pos)
print(roc_auc$auc)
Area under the curve: 0.813
Code
ggroc(roc_auc) +
  annotate("segment", x = 1, y = 0, xend = 0, yend = 1,
           col = "red", linetype = 3) + 
  annotate("text",label=round(roc_auc$auc,3), x = 0.5,
           y = 0.65, col = "darkblue")

Code
# Predict the class for each point in the grid
grid_predictions <- predict(gam_fit1, newdata = grid, type = "response")
grid <- grid %>%
  mutate(Predicted = factor(
  ifelse(grid_predictions > 0.5, "pos", "neg")
))
# Plot the decision boundary
ggplot() +
  geom_tile(data = grid, aes(x = age, y = glucose, fill = Predicted), alpha = 0.5) + # Background color
  geom_point(data = diabetes_train, aes(x = age, y = glucose, color = diabetes), size = 1) + # Training points
  scale_fill_manual(values = c("neg" = "red", "pos" = "blue")) + # Area colors
  scale_color_manual(values = c("neg" = "red", "pos" = "blue")) + # Point colors
  labs(title = "GAM decision Boundary", x = "age", y = "glucose", fill = "Prediction") +
  theme_minimal()

Code
# Example: Visualize the effect of Glucose on Outcome
visreg(gam_fit1, "glucose", scale = "response", gg = TRUE) +
  ggtitle("Effect of Glucose on Diabetes Probability") +
  theme_minimal()

Code
visreg(gam_fit1, "age", scale = "response", gg = TRUE) +
  ggtitle("Effect of Glucose on Diabetes Probability") +
  theme_minimal()

The Bias/Variance Tradeoff

Smoothing methods could provide different degree of smoothing (using tuning parameter(s))
Greater smoothness \(\rightarrow\) decreasing the variance, increasing the bias.

Selecting an effective df for sj \(\rightarrow\) control the degree of smoothness

effective df = 3 is somewhat similar in overall complexity to a third-degree polynomial.

Code
visreg(
  update(gam_fit1,.~ s(glucose,fx=TRUE, k = 3)),
  "glucose", scale = "response", gg = TRUE) +
  ggtitle("Effect of Glucose on Diabetes Probability (edf=2)") +
  theme_minimal()

Code
visreg(
  update(gam_fit1,.~ s(glucose,fx=TRUE, k = 4)),
  "glucose", scale = "response", gg = TRUE) +
  ggtitle("Effect of Glucose on Diabetes Probability (edf=3)") +
  theme_minimal()

Code
visreg(
  update(gam_fit1,.~ s(glucose,fx=TRUE, k = 5)),
  "glucose", scale = "response", gg = TRUE) +
  ggtitle("Effect of Glucose on Diabetes Probability (edf=4)") +
  theme_minimal()

Code
visreg(
  update(gam_fit1,.~ s(glucose,fx=TRUE, k = 6)),
  "glucose", scale = "response", gg = TRUE) +
  ggtitle("Effect of Glucose on Diabetes Probability (edf=5)") +
  theme_minimal()

Code
visreg(
  update(gam_fit1,.~ s(glucose,fx=TRUE, k = 7)),
  "glucose", scale = "response", gg = TRUE) +
  ggtitle("Effect of Glucose on Diabetes Probability (edf=6)") +
  theme_minimal()


Regularization for High-Dimensional Categorical Data

Emphasizing Logistic Regression

1. Introduction

  • High-Dimensional Data: Situations where the number of predictors (p) is large relative to the number of observations (n).
  • Regularization: Techniques to prevent overfitting by adding a penalty to the model complexity.

 

High-dimensional data are increasingly common in applications in genomics, biomedical imaging and so on.


2. Penalized-Likelihood Methods and Lq-Norm Smoothing

Maximum Likelihood Estimation (MLE)

  • MLE aims to find parameter values that maximize the likelihood function.
  • In high dimensions, MLE can lead to overfitting.

overfitting: Model captures noise instead of the underlying pattern.

Penalized Likelihood

  • Objective: Penalize the log-likelihood with a penalty term, \(s(\beta)\), to reduce model complexity.
    • s decreases as elements of \(\beta\) are smoother, such as \(L_q-norm\) smoothing function (generally shrinks the ML estimates toward 0)
  • General Form: \(\hat{\beta} = \arg\max_{\beta} \left\{ L(\beta) - \lambda \|\beta\|_q \right\}\) where:
    • \(L(\beta)\): Log-likelihood function.
    • \(\lambda\): Regularization parameter.
    • \(\|\beta\|_q\): Lq-norm of the coefficient vector (\(\sum_{i=1}^p {|\beta_i|^{q}}\)).

Lq-Norm Smoothing

  • L1-Norm (Lasso) (\(q = 1\)): Promotes sparsity, leading to variable selection.
  • L2-Norm (Ridge Regression) (\(q = 2\)): Shrinks coefficients but doesn’t set them exactly to zero.
  • Elastic Net: Combines L1 and L2 penalties.

3. L1 Norm (Lasso)

  • Lasso (Least Absolute Shrinkage and Selection Operator):
    • Adds penalty proportional to the absolute value of the coefficients \(\hat{\beta} = \arg\max_{\beta} {L(\beta) - \lambda \sum_{i=1}^p {|\beta_i|}}\).
    • As \(\lambda\) increases, more of the \(\hat{\beta}\) completely shrink to zero.

choosing \(\lambda\)

using k-fold cross-validation, for some selected \(\lambda\) values:

  • split data into k (typically k= 10) part
  • each time, one part selected as test and remaining as training data
  • fit the model to training data and then check the goodness of the predictions for y in testing data
  • summary measure of prediction error is the sample mean prediction error for the k runs, for a measure of prediction error
  • select the \(\lambda\):
    • method 1: one with lowest sample mean prediction error
    • method 2: one which its mean prediction error is one standard error above the minimum one selected using 1st method (one-standard-error rule)

Example: Pima Indians Diabetes

Code
# Prepare data
X <- model.matrix(diabetes ~ ., data = PimaIndiansDiabetes)[,-1]
y <- PimaIndiansDiabetes$diabetes

# Fit Lasso logistic regression
lasso_fit <- glmnet(X, y, family = "binomial", alpha = 1)
lasso_model <- cv.glmnet(X, y, family = "binomial", alpha = 1)

plot(lasso_fit, label = TRUE, xvar = "lambda")

Code
plot(lasso_fit, xlim=c(-3.1,-1.5), ylim=c(0,0.08), xvar = "lambda")

Code
# Coefficients at best lambda
cat("\nmin lambda:\t",lasso_model$lambda.min,"\n")

min lambda:  0.003079873 
Code
coef(lasso_model, s = "lambda.min")
9 x 1 sparse Matrix of class "dgCMatrix"
                       s1
(Intercept) -8.1068741379
pregnant     0.1168706886
glucose      0.0338126297
pressure    -0.0112824790
triceps      .           
insulin     -0.0008909569
mass         0.0844148631
pedigree     0.8639975750
age          0.0137171480
Code
cat("\n1se lambda:\t",lasso_model$lambda.1se,"\n")

1se lambda:  0.03459695 
Code
coef(lasso_model, s = "lambda.1se")
9 x 1 sparse Matrix of class "dgCMatrix"
                      s1
(Intercept) -5.937235394
pregnant     0.067571906
glucose      0.026149025
pressure     .          
triceps      .          
insulin      .          
mass         0.046798167
pedigree     0.268841839
age          0.004099019
Code
cat("\nselected lambda:\t", 0.1,"\n")

selected lambda:     0.1 
Code
coef(lasso_model, s = .1)
9 x 1 sparse Matrix of class "dgCMatrix"
                     s1
(Intercept) -2.99982384
pregnant     .         
glucose      0.01738912
pressure     .         
triceps      .         
insulin      .         
mass         0.00721427
pedigree     .         
age          .         

4. Why Shrink Maximum Likelihood Estimates Toward 0?

Study with large number of Xs’ \(\rightarrow\) most of them have no/minor effects

For instance: in genetic association studies (thousands of genes as X and whether a person has a particular disease as Y), \(|MLE(\beta_j)|\) tend to be much larger than \(|\beta_j|\) (sampling variation, collinearity, quasi-complete or complete separation, sparseness and …).

Benefits of Shrinkage

  • Variance-Bias Tradeoff: Shrinking coefficients reduces variance at the cost of a small increase in bias.
  • Improved Prediction Accuracy: By reducing variance.
  • Model Interpretability: Simplifies the model by selecting relevant features.
  • Numerical Stability: Avoids issues with multicollinearity.

Disadvantage of Lasso

  • May overly penalize \(\beta_j\) that is truly large.
  • \(\hat{\beta_j}\) obtained from Lasso methods do not have approximate normal sampling distribution
  • \(\hat{\beta_j}\) obtained from Lasso methods is bias

5. Variable Selection Methods (Dimension Reduction)

Variable selection methods for large p:

  • variable-elimination methods: stepwise methods and the lasso.
  • replaces the p explanatory variables by a small set of artificial uncorrelated variables (principal components)
  • identify the relevant effects using standard significance tests but with an adjustment for the possibly huge number of tests conducted: One way to do this uses the false discovery rate (FDR).

6. Controlling the False Discovery Rate

Multiple Comparisons Problem

If p be large and few variables have an effects, testing \(H_0: \beta_j = 0\) at \(\alpha = 0.05\) for each explanatory variable:

  • without adjustment \(\rightarrow\) probability of type I error is more than nominal level \(\rightarrow\) most of detected effects are type I error.
  • With Bonferroni adjustment \(\rightarrow\) power may be very low for detect true effects

Recall: if g be the number of tests, Bonferroni’s method use \(\alpha/g\) as the size for each individual test.

  • False Discoveries: Incorrectly identifying a variable as significant.

Control False Discovery Rate (FDR)

  • multiple inference method that addresses FDR issue (“holm”, “hochberg”, “hommel”, “bonferroni”, “BH”, “BY”, “fdr”, …)
  • Benjamini-Hochberg Procedure method:
    • Sort p-value of g tests (\(P_{(1)},\dots,P_{(j)},\dots,P_{(g)}\)).
    • compute adjusted p-value, \(p^*\), and compare with \(\alpha\).
      • \(p_{j}^*= \min(1,~P_j*g/j)\)
      • If for some i<j, \(p_{i}^*>p_{j}^* \Rightarrow p_{i}^* = p_{j}^*\)
j p-value \(p^*\)
1 0.0001 0.0015
2 0.0004 0.0030
3 0.0019 0.0095
4 0.0095 0.0356
5 0.0200 0.0600
6 0.0280 0.0638
7 0.0300 0.0638
8 0.0340 0.0638
9 0.0460 0.0767
10 0.3200 0.4800
11 0.4300 0.5864
12 0.5700 0.7125
13 0.6500 0.7500
14 0.7600 0.8143
15 1.0000 1.000
Code
pvals <- c(0.0001, 0.0004, 0.0019, 0.0095, 0.020, 0.028, 0.030,0.034,
           0.046, 0.32, 0.43, 0.57, 0.65, 0.76, 1.00)
cat("\nAdjusted p* using Benjamini-Hochberg Procedure\n")

Adjusted p* using Benjamini-Hochberg Procedure
Code
p.adjust(pvals, method=c("BH"))
 [1] 0.00150000 0.00300000 0.00950000 0.03562500 0.06000000 0.06375000
 [7] 0.06375000 0.06375000 0.07666667 0.48000000 0.58636364 0.71250000
[13] 0.75000000 0.81428571 1.00000000

END