Chapter 9. Ensemble Methods

Ensemble methods in machine learning refer to techniques that combine multiple individual models (often called “base learners” or “weak learners”) to create a more robust and accurate predictive model. The core idea is that by aggregating the predictions from several models, the ensemble can reduce errors, improve generalization, and handle variance or bias better than a single model. This is inspired by the “wisdom of the crowd” concept, where diverse opinions lead to better decisions.

Key benefits of ensembles:

Reduced Overfitting: By averaging or voting across models, ensembles mitigate the risk of a single model fitting noise in the data.
Improved Accuracy: They often outperform individual models on complex tasks.
Handling Bias-Variance Tradeoff: Different ensemble types address high bias (e.g., boosting) or high variance (e.g., bagging).

Common types include:

Bagging (Bootstrap Aggregating): Builds multiple models on bootstrapped subsets of the data and aggregates predictions (e.g., via averaging or voting).
Boosting: Sequentially builds models, where each new model focuses on correcting the errors of the previous ones.
Random Forest: A specific type of bagging that uses decision trees with random feature subsets for added diversity.

Ensembles are widely used in classification, regression, and other tasks, and libraries like scikit-learn (Python) or caret/randomForest (R) make them easy to implement.

9.1 Bagging (with Decision Trees)

Bagging involves training multiple decision trees on bootstrapped samples and aggregating their predictions.

How Bagging Works?

Bootstrap Sampling: Create multiple datasets by sampling with replacement from original training data. Each bootstrap sample has same size as original dataset
Model Training: Train a separate model \(h\) on each bootstrap sample. All models use the same learning algorithm.
Aggregation: Classification by Majority voting or Regression by Averaging predictions.

9.1.1 Step-by-step Example by Hand

Here is a step-by-step example by hand for the data: x = 1,2,3,4,5, y = +1,+1,-1,-1,+1.

For the bootstrap sample x = {3,1,3,5,4}, y = {-1,+1,-1,+1,-1}, the steps are shown in the following table:

kable(data.frame(
  Threshold = c("1.5", "2.5", "3.5", "4.5", "1.5", "2.5", "3.5", "4.5"),
  Rule = c(
    "+1 if x <= 1.5, else -1",
    "+1 if x <= 2.5, else -1",
    "+1 if x <= 3.5, else -1",
    "+1 if x <= 4.5, else -1",
    "-1 if x <= 1.5, else +1",
    "-1 if x <= 2.5, else +1",
    "-1 if x <= 3.5, else +1",
    "-1 if x <= 4.5, else +1"
  ),
  `Correct Points` = c("1,3,3,4", "1,3,3,4", "1,4", "1", "5", "5", "3,3,5", "3,3,4,5"),
  `Wrong Points` = c("5", "5", "3,3,5", "3,3,4,5", "1,3,3,4", "1,3,3,4", "1,4", "1"),
  Error = c("1", "1", "3", "4", "4", "4", "2", "1")
),
col.names = c("Threshold", "Rule", "Correct Points", "Wrong Points", "No. of Errors"),
caption = "Bootstrap Sample: x = {3,1,3,5,4}, y = {-1,+1,-1,+1,-1}",
escape = FALSE, align = "c") %>%
  kable_styling(bootstrap_options = c("striped", "hover")) %>%
  row_spec(0, bold = TRUE, background = "#2c3e50", color = "white") %>%
  row_spec(c(1,2,8), background = "#d4edda", bold = TRUE)  # Best rules (1 error)

Bootstrap Sample: x = {3,1,3,5,4}, y = {-1,+1,-1,+1,-1}
Threshold	Rule	Correct Points	Wrong Points	No. of Errors
1.5	+1 if x <= 1.5, else -1	1,3,3,4	5	1
2.5	+1 if x <= 2.5, else -1	1,3,3,4	5	1
3.5	+1 if x <= 3.5, else -1	1,4	3,3,5	3
4.5	+1 if x <= 4.5, else -1	1	3,3,4,5	4
1.5	-1 if x <= 1.5, else +1	5	1,3,3,4	4
2.5	-1 if x <= 2.5, else +1	5	1,3,3,4	4
3.5	-1 if x <= 3.5, else +1	3,3,5	1,4	2
4.5	-1 if x <= 4.5, else +1	3,3,4,5	1	1

A different table can be constructed for another bootstrap sample: \(\{2, 5, 1, 3, 5\} \rightarrow y = \{+1, +1, +1, -1, +1\}\)

kable(data.frame(
  Threshold = c("1.5", "2.5", "3.5", "4.5", "1.5", "2.5", "3.5", "4.5"),
  Rule = c(
    "+1 if x <= 1.5, else -1",
    "+1 if x <= 2.5, else -1",
    "+1 if x <= 3.5, else -1",
    "+1 if x <= 4.5, else -1",
    "-1 if x <= 1.5, else +1",
    "-1 if x <= 2.5, else +1",
    "-1 if x <= 3.5, else +1",
    "-1 if x <= 4.5, else +1"
  ),
  `Correct Points` = c("1,3", "1,2,3", "1,2", "1,2", "2,5,5", "5,5", "3,5,5", "3,5,5"),
  `Wrong Points` = c("2,5,5", "5,5", "3,5,5", "3,5,5", "1,3", "1,2,3", "1,2", "1,2"),
  Error = c("3", "2", "3", "3", "2", "3", "2", "2")
),
col.names = c("Threshold", "Rule", "Correct Points", "Wrong Points", "No. of Errors"),
caption = "Bootstrap Sample 2: x = {1,2,3,5,5}, y = {+1,+1,-1,+1,+1}",
escape = FALSE, align = "c") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE, font_size = 13) %>%
  row_spec(0, bold = TRUE, background = "#2c3e50", color = "white") %>%
  row_spec(c(2,5,7,8), background = "#d4edda", bold = TRUE)  # Best rules (2 errors)

Bootstrap Sample 2: x = {1,2,3,5,5}, y = {+1,+1,-1,+1,+1}
Threshold	Rule	Correct Points	Wrong Points	No. of Errors
1.5	+1 if x <= 1.5, else -1	1,3	2,5,5	3
2.5	+1 if x <= 2.5, else -1	1,2,3	5,5	2
3.5	+1 if x <= 3.5, else -1	1,2	3,5,5	3
4.5	+1 if x <= 4.5, else -1	1,2	3,5,5	3
1.5	-1 if x <= 1.5, else +1	2,5,5	1,3	2
2.5	-1 if x <= 2.5, else +1	5,5	1,2,3	3
3.5	-1 if x <= 3.5, else +1	3,5,5	1,2	2
4.5	-1 if x <= 4.5, else +1	3,5,5	1,2	2

When multiple decision stumps achieve the same (lowest) in-sample error (the first bootstrap sample results in 3 best rules that tie in error rate and the second bootstrap sample results in 4 tied best rules), pick the stump with the lowest threshold among the best performers.

Thus, we have the final two rules (stumps or models) summarized in the following table:

bag_table <- data.frame(
  Model = c("$h_1$", "$h_2$"),
  `Bootstrap Sample` = c("{1,3,3,4,5}", "{1,2,3,5,5}"),
  Threshold = c("1.5", "2.5"),
  Rule = c(
    "+1 if x \\leq 1.5, else -1",
    "+1 if x \\leq 2.5, else -1"
  ),
  `In-sample Error` = c("1/5", "2/5")
)

kable(bag_table,
      col.names = c("Model", "Bootstrap Sample", "Threshold", "Rule", "In-sample Error"),
      caption = "Bagging: 2 Decision Stumps (Corrected)",
      escape = FALSE,
      align = "c") %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE, font_size = 13) %>%
  row_spec(0, bold = TRUE, background = "#2980b9", color = "white") %>%
  column_spec(1, bold = TRUE) %>%
  column_spec(2, width = "10em") %>%
  column_spec(4, width = "12em")

Bagging: 2 Decision Stumps (Corrected)
Model	Bootstrap Sample	Threshold	Rule	In-sample Error
\(h_1\)	{1,3,3,4,5}	1.5	+1 if x , else -1	1/5
\(h_2\)	{1,2,3,5,5}	2.5	+1 if x , else -1	2/5

Bagging: Majority Vote (two models \(h_1\) and \(h_2\), one based on each bootstrap sample)

kable(data.frame(
  x = 1:5,
  `True y` = c("+1", "+1", "-1", "-1", "+1"),
  `h1(x)`  = c("+1", "-1", "-1", "-1", "-1"),
  `h2(x)`  = c("+1", "+1", "-1", "-1", "-1"),
  `H(x)`   = c("+1", "+1 (tie → H)", "-1", "-1", "-1"),
  Correct  = c("Yes", "Yes", "Yes", "Yes", "No")
),
col.names = c("x", "True y", "h₁(x)", "h₂(x)", "H(x)", "Correct?"),
caption = "Final Bagging Prediction (Tie: H=+1, T=-1)",
align = "c") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = TRUE, font_size = 14) %>%
  row_spec(0, bold = TRUE, background = "#2c3e50", color = "white") %>%
  row_spec(5, background = "#f8d7da") %>%  # Highlight error
  column_spec(1, width = "6em") %>%
  column_spec(2:6, width = "10em")

Final Bagging Prediction (Tie: H=+1, T=-1)
x	True y	h₁(x)	h₂(x)	H(x)	Correct?
1	+1	+1	+1	+1	Yes
2	+1	-1	+1	+1 (tie → H)	Yes
3	-1	-1	-1	-1	Yes
4	-1	-1	-1	-1	Yes
5	+1	-1	-1	-1	No

Accuracy: 4/5 = 80% Tie-breaking: coin-tossing (\(H=1, T=-1\)) in case of tie

9.1.2 How can we do bagging using the R package “adabag”?

set.seed(42)  # For reproducibility
train_idx <- sample(nrow(iris), 0.7 * nrow(iris))
train_data <- iris[train_idx, ]
test_data <- iris[-train_idx, ]

# Train bagging model (uses decision trees by default)
bagging_model <- bagging(Species ~ ., data = train_data, nbagg = 25)  # nbagg: number of bootstrap samples

# Predict on test data
predictions <- predict(bagging_model, newdata = test_data)$class |> factor()

confusionMatrix(predictions, test_data$Species)

## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         12          0         0
##   versicolor      0         14         1
##   virginica       0          1        17
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9556          
##                  95% CI : (0.8485, 0.9946)
##     No Information Rate : 0.4             
##     P-Value [Acc > NIR] : 2.842e-15       
##                                           
##                   Kappa : 0.9324          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            0.9333           0.9444
## Specificity                 1.0000            0.9667           0.9630
## Pos Pred Value              1.0000            0.9333           0.9444
## Neg Pred Value              1.0000            0.9667           0.9630
## Prevalence                  0.2667            0.3333           0.4000
## Detection Rate              0.2667            0.3111           0.3778
## Detection Prevalence        0.2667            0.3333           0.4000
## Balanced Accuracy           1.0000            0.9500           0.9537

9.2 Boosting (AdaBoost)

AdaBoost (Adaptive Boosting) is a boosting algorithm that weights misclassified instances higher in subsequent models. I’ll use the adabag package for AdaBoost. (Alternatively, for XGBoost, see the next section.)

The following shows the steps how AdaBoost works, assuming traing data

\[{(x_1, y_1), ..., (x_N, y_N)}, \text{where}~ y_i~ \text{is}~ -1 ~\text{or} ~ 1.\]

Initialize weights: \(w_i^{(1)} = \frac{1}{N}\) for \(i = 1, 2, \dots, N\)
For t = 1 to T:
- Find weak classifier \(h_t\) that minimizes weighted error \(\epsilon_t\): \(\epsilon_t = \sum_{i=1}^{m} w_i^{(t)} \mathbf{1}\{h_t(x_i) \neq y_i\}\)
- Compute classifier weight: \(\alpha_t = \frac{1}{2} \ln\left(\frac{1 - \epsilon_t}{\epsilon_t}\right)\)
- Calculate normalization factor: \(Z_t = \sum_{i=1}^{N} w_i^{(t)} \cdot exp(-\alpha_t y_i h_t(x_i))\)
- Update and then normalize weights: \(w_i^{(t+1)} = \frac{w_i^{(t)} \exp(-\alpha_t y_i h_t(x_i))}{Z_t}\)
Output final strong classifier: \(H(x) = \text{sign}\left(\sum_{t=1}^T \alpha_t h_t(x)\right)\)

9.2.1 Step-by-step Example by Hand

Assume the following training data:

Fake data
x	y
1	1
2	1
3	-1
4	-1
5	1

Weak Classifiers: Decision stumps (threshold-based binary classifiers on a single feature) of the form “If \(x \le \theta\) then \(+1\) else \(-1\)” and “If \(x \le \theta\) then \(-1\) else \(1\)”

Step 1: Use multiple iterations to update weights. Initial weights for all instances are equal (uniform):

\[w_1^{(1)}=w_2^{(1)}=w_3^{(1)}=w_4^{(1)}=w_5^{(1)}=\frac{1}{5}=0.2\]

In each future iteration, train all possible weak classifiers (Decision Stumps) each with one feature (predictor) and a threshold. Since we only have one feature, we consider possible thresholds (midpoints):

Between 1 and 2 → threshold = 1.5 Between 2 and 3 → threshold = 2.5 Between 3 and 4 → threshold = 3.5 Between 4 and 5 → threshold = 4.5

For each threshold, we try two versions:

Predict +1 if \(x\le\) threshold, else -1 Predict -1 if \(x\le\) threshold, else +1

We compute weighted error for each.

Try threshold = 1.5

Rule A: Predict +1 if \(x\le\) 1.5 → point 1 → +1 (correct) Points 2,3,4,5 → -1

Point 2: true +1 → misclassified Point 3: true -1 → correct Point 4: true -1 → correct Point 5: true +1 → misclassified Misclassified: 2, 5 → weights: 0.2 + 0.2 = 0.4

Rule B: Predict -1 if \(x\le\) → point 1 → -1 (wrong) Others → +1

Point 1: true +1 → misclassified Point 2: true +1 → correct Point 3: true -1 → misclassified Point 4: true -1 → misclassified Point 5: true +1 → correct Misclassified: 1,3,4 → 0.2×3 = 0.6

→ Better is Rule A: error = 0.4

Try threshold = 2.5

Rule A: +1 if \(x\le\) 2.5 → points 1,2 → +1 (both correct) Points 3,4,5 → -1

Point 3: -1 → correct Point 4: -1 → correct Point 5: +1 → misclassified Misclassified: only 5 → weight = 0.2

Rule B: -1 if \(x\le\) 2.5 → points 1,2 → -1 (both wrong) Points 3,4,5 → +1

Point 3: -1 → wrong Point 4: -1 → wrong Point 5: +1 → correct Misclassified: 1,2,3,4 → 0.8

→ Rule A wins: error = 0.2

Try threshold = 3.5

Rule A: +1 if \(x\le\) 3.5 → 1,2,3

1,2: +1 → correct 3: -1 → wrong → 4,5 → -1 4: -1 → correct 5: +1 → wrong Misclassified: 3,5 → 0.4

Rule B: -1 if \(x\le\) 3.5 → 1,2,3 → -1

1,2: wrong 3: correct → 4,5 → +1 4: wrong 5: correct Misclassified: 1,2,4 → 0.6

→ Best: 0.4

Try threshold = 4.5

Rule A: +1 if \(x\le\) → 1,2,3,4

1,2: correct 3,4: wrong → 5 → -1 (wrong) Misclassified: 3,4,5 → 0.6

Rule B: -1 if \(x\le\) → 1,2,3,4 → -1

1,2: wrong 3,4: correct → 5 → +1 (correct) Misclassified: 1,2 → 0.4

→ Best: 0.4

Best stump in iteration 1:

Threshold = 2.5, predict +1 if \(x\le\) 2.5, else -1 → Misclassifies only point 5 Weighted error \(\varepsilon_1 = 0.2\)

Calculate Classifier Weight \(\alpha_1\) for \(h_1(x)\):

\[\alpha_1 = \frac{1}{2}ln\frac{1-\epsilon_1}{\epsilon_1}=0.6931\]

Calculate Normalization Factor \(Z_1\).

Misclassified instance (5): \(w_5 \times e^{\alpha_1} = 0.2 \times e^{0.6931} = 0.2 \times 2 = 0.4\)
Correctly classified instances: \(w_i \times e^{-\alpha_1} = 0.2 \times e^{-0.6931} = 0.2 \times 0.5 = 0.1\)

\(Z_1=0.1+0.1+0.1+0.1+0.4=0.8\)

Update Weights.

Correct instances (1,2,3,4): \(w_i^{(2)} = \frac{0.1}{0.8} = 0.125\)
Misclassified instance (5): \(w_5^{(2)} = \frac{0.4}{0.8} = 0.5\)

The updated weights are: \([0.125, 0.125, 0.125, 0.125, 0.5]\)

Repeat the above iteration. The following is the second Iteration (t = 2).

Weights: [0.125, 0.125, 0.125, 0.125, 0.5] Try thresholds again.

Threshold = 1.5

Rule A: +1 if x ≤ 1.5 → point 1 → correct Others → -1

2: +1 → wrong 3: -1 → correct 4: -1 → correct 5: +1 → wrong Misclassified: 2,5 → 0.125 + 0.5 = 0.625

Rule B: -1 if x ≤ 1.5 → 1 wrong; others +1 → 2 correct, 3 wrong, 4 wrong, 5 correct Misclassified: 1,3,4 → 0.125×3 = 0.375

→ Better: 0.375

Threshold = 2.5

Rule A: +1 if x ≤ 2.5 → 1,2 correct; 3,4,5 → -1 → 3,4 correct, 5 wrong Misclassified: 5 → 0.5 Rule B: -1 if x ≤ 2.5 → 1,2 wrong; 3,4,5 → +1 → 3,4 wrong, 5 correct Misclassified: 1,2,3,4 → 0.125×4 = 0.5

→ Best: 0.5

Threshold = 3.5

Rule A: +1 if x ≤ 3.5 → 1,2 correct, 3 wrong; 4,5 → -1 → 4 correct, 5 wrong Misclassified: 3,5 → 0.125 + 0.5 = 0.625 Rule B: -1 if x ≤ 3.5 → 1,2,3 → -1 → 1,2 wrong, 3 correct; 4,5 → +1 → 4 wrong, 5 correct Misclassified: 1,2,4 → 0.125×3 = 0.375

→ Best: 0.375

Threshold = 4.5

Rule A: +1 if x ≤ 4.5 → 1,2,3,4 → 1,2 correct, 3,4 wrong; 5 → -1 wrong Misclassified: 3,4,5 → 0.125 + 0.125 + 0.5 = 0.75 Rule B: -1 if x ≤ 4.5 → 1,2,3,4 → -1 → 1,2 wrong, 3,4 correct; 5 → +1 correct Misclassified: 1,2 → 0.125 + 0.125 = 0.25 ← Best so far!

Best stump in iteration 2: Threshold = 4.5, predict -1 if x ≤ 4.5, else +1 → Misclassifies only points 1 and 2 \(\varepsilon_2 = 0.25\)

\(\alpha_2=\frac{1}{2}ln\frac{1-0.25}{0.25}=0.5493\)

Update Weights.

Calculate Normalization Factor \(Z_2\).

Misclassified instances (1,2): \(0.125 \times e^{0.5493} = 0.125 \times 1.732 = 0.2165\)
Correctly classified instances (3,4): \(0.125 \times e^{-0.5493} = 0.125 \times 0.577 = 0.0721\)

Instance 5: \(0.5 \times e^{-0.5493} = 0.5 \times 0.577 = 0.2885\)

\(Z_2=0.2165+0.2165+0.0721+0.0721+0.2885=0.8657\)

Calculate New Weights

\(w_1^{(3)} = 0.2165/0.8657 \approx 0.25\)
\(w_2^{(3)} = 0.2165/0.8657 \approx 0.25\)
\(w_3^{(3)} = 0.0721/0.8657 \approx 0.083\)
\(w_4^{(3)} = 0.0721/0.8657 \approx 0.083\)
\(w_5^{(3)} = 0.2885/0.8657 \approx 0.333\)

For demonstration purposes, we just do two iterations. So, the final weights are: \([0.25, 0.25, 0.083, 0.083, 0.333]\)

Step 2: The final ensemble classifier is a weighted combination of our weak classifiers:

\[H(x)=sign(0.6931\cdot h_1(x)+0.5493\cdot h_2(x))\]

Where:

\(h_1(x)\): “If \(x \le 2.5\) then \(+1\) else \(-1\)”

\(h_2(x)\): “If \(x \le 4.5\) then \(-1\) else \(+1\)”

verification <- data.frame(
  Instance = 1:5,
  x = 1:5,
  h1 = c(1, 1, -1, -1, -1),
  h2 = c(-1, -1, -1, -1, 1),
  WeightedSum = c(0.144, 0.144, -1.24, -1.24, -0.144),
  FinalPred = c(1, 1, -1, -1, -1),
  TrueLabel = c(1, 1, -1, -1, 1),
  Correct = c("✓", "✓", "✓", "✓", "✗")
)

kable(verification, caption = "Final Classifier Verification") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Final Classifier Verification
Instance	x	h1	h2	WeightedSum	FinalPred	TrueLabel	Correct
1	1	1	-1	0.144	1	1	✓
2	2	1	-1	0.144	1	1	✓
3	3	-1	-1	-1.240	-1	-1	✓
4	4	-1	-1	-1.240	-1	-1	✓
5	5	-1	1	-0.144	-1	1	✗

Training Accuracy: 4/5 = 80%.

AdaBoost Iteration Summary (2 Iterations)
Round	Threshold	Rule	ε (Error)	α (Weight)	Misclassified
1	2.5	+1 if ≤2.5, else -1	0.2	0.693	{5}
2	4.5	-1 if ≤4.5, else +1	0.25	0.549	{1,2}

9.2.2 Adaboosting with the adabag Package

The following shows how the R package adabag does adaboosting with the iris data.

if (!require(adabag)) install.packages("adabag")
library(adabag)

# Train AdaBoost model
adaboost_model <- boosting(Species ~ ., data = train_data, boos = TRUE, mfinal = 50)  # mfinal: number of iterations

# Predict on test data
predictions2 <- predict(adaboost_model, newdata = test_data)$class |> factor()

confusionMatrix(predictions2, test_data$Species)

## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         12          0         0
##   versicolor      0         14         1
##   virginica       0          1        17
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9556          
##                  95% CI : (0.8485, 0.9946)
##     No Information Rate : 0.4             
##     P-Value [Acc > NIR] : 2.842e-15       
##                                           
##                   Kappa : 0.9324          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            0.9333           0.9444
## Specificity                 1.0000            0.9667           0.9630
## Pos Pred Value              1.0000            0.9333           0.9444
## Neg Pred Value              1.0000            0.9667           0.9630
## Prevalence                  0.2667            0.3333           0.4000
## Detection Rate              0.2667            0.3111           0.3778
## Detection Prevalence        0.2667            0.3333           0.4000
## Balanced Accuracy           1.0000            0.9500           0.9537

9.3 XGBoost

XGBoost (short for eXtreme Gradient Boosting) is a powerful, scalable machine learning library for gradient-boosted decision trees (GBDT). It’s widely used for classification, regression, and ranking tasks in data science and machine learning competitions (like Kaggle). XGBoost builds many weak decision trees sequentially — each one learning from the mistakes of the previous ones — to create a strong, accurate predictive model.

Some Tips:

The name “eXtreme” comes from its speed and performance — it was designed to push gradient boosting to the extreme.
No factor handling – convert factors to dummy variables yourself.
Missing values – XGBoost natively handles NA; just keep them.

Why People Love XGBoost?

Wins competitions (used in ~50% of Kaggle winning solutions)
Robust to overfitting
Handles messy real-world data
Interpretable (feature importance, SHAP)
Fast training (even on large datasets)

Example: Predict Iris flower species using petal/sepal measurements

Step 1: Load packages and data

library(xgboost)
library(data.table)
library(ggplot2)
library(caret)  # for confusion matrix

Step 2: Prepare Data for XGBoost

XGBoost needs:

Features: numeric matrix
Labels: numeric (0-based for classification)

# Features (all except Species)
X <- iris[, 1:4] %>% as.matrix()

# Labels: convert factor to 0,1,2
y <- as.numeric(iris$Species) - 1   # setosa=0, versicolor=1, virginica=2

# Train-test split (70-30)
set.seed(123)
train_idx <- sample(1:nrow(iris), size = 0.7 * nrow(iris))

X_train <- X[train_idx, ]
X_test  <- X[-train_idx, ]
y_train <- y[train_idx]
y_test  <- y[-train_idx]

# Create xgb.DMatrix (XGBoost's special format)
dtrain <- xgb.DMatrix(data = X_train, label = y_train)

dtest  <- xgb.DMatrix(data = X_test,  label = y_test)

Step 3: Train a XGBoost Model

# Parameters
params <- list(
  objective = "multi:softprob",   # multiclass probability
  num_class = 3,                  # 3 species
  eval_metric = "mlogloss",       # log loss
  eta = 0.3,                      # learning rate
  max_depth = 6,
  subsample = 0.8,
  colsample_bytree = 0.8
)

# Train model
set.seed(123)
model <- xgb.train(
  params = params,
  data = dtrain,
  nrounds = 20
)

Key XGBoost Parameters
Parameter	Description	Typical Values
eta	Learning rate (step size shrinkage)	0.01 – 0.3
max_depth	Maximum depth of a tree	3 – 10
subsample	Fraction of observations used per boosting round	0.5 – 1
colsample_bytree	Fraction of columns (features) used per tree	0.5 – 1
min_child_weight	Minimum sum of instance weight needed in a child	1 – 10

Step 4: Make Predictions

# Predict probabilities (returns vector, reshape to matrix)
pred_probs <- predict(model, dtest, reshape = TRUE)
head(pred_probs)

##           [,1]        [,2]        [,3]
## [1,] 0.9862779 0.009412179 0.004309894
## [2,] 0.9873204 0.007788341 0.004891236
## [3,] 0.9873204 0.007788341 0.004891236
## [4,] 0.9873204 0.007788341 0.004891236
## [5,] 0.9820973 0.013611067 0.004291625
## [6,] 0.9862779 0.009412179 0.004309894

# Predicted class (0-based)
pred_class  <- max.col(pred_probs) - 1
pred_labels <- levels(iris$Species)[pred_class + 1]

Step 5: Evaluate Model

# Accuracy
accuracy <- mean(pred_class == y_test)
cat("Accuracy:", round(accuracy, 4), "\n")  # ~0.9778

## Accuracy: 0.9778

# Confusion Matrix
cm <- confusionMatrix(
  factor(pred_labels, levels = levels(iris$Species)),
  factor(iris$Species[-train_idx], levels = levels(iris$Species))
)
print(cm)

## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         14          0         0
##   versicolor      0         17         0
##   virginica       0          1        13
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9778          
##                  95% CI : (0.8823, 0.9994)
##     No Information Rate : 0.4             
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9664          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            0.9444           1.0000
## Specificity                 1.0000            1.0000           0.9688
## Pos Pred Value              1.0000            1.0000           0.9286
## Neg Pred Value              1.0000            0.9643           1.0000
## Prevalence                  0.3111            0.4000           0.2889
## Detection Rate              0.3111            0.3778           0.2889
## Detection Prevalence        0.3111            0.3778           0.3111
## Balanced Accuracy           1.0000            0.9722           0.9844

Step 6: Feature Importance

importance_matrix <- xgb.importance(model = model)
importance_matrix

##         Feature       Gain     Cover Frequency
##          <char>      <num>     <num>     <num>
## 1: Petal.Length 0.64346260 0.4517480 0.3506494
## 2:  Petal.Width 0.30552986 0.3409521 0.2987013
## 3: Sepal.Length 0.03025878 0.1036788 0.1948052
## 4:  Sepal.Width 0.02074876 0.1036211 0.1558442

Explanation of each of the last 3 columns of the above table:

Type	Meaning
Gain	How much the feature improves accuracy
Cover	How many samples the feature affects
Frequency	How often the feature is used

Step 7: Hyperparameter Tuning with Cross-Validation

# Use xgb.cv for 5-fold CV
cv <- xgb.cv(
  params = params,
  data = dtrain,
  nrounds = 200,
  nfold = 5,
  early_stopping_rounds = 20,
  verbose = 0
)

# Best iteration
best_iter <- cv$best_iteration
cat("Best iteration:", best_iter, "\n")

## Best iteration: 18

SHAP measures how much each feature contributes to a specific prediction.

library(shapviz)
# Now create SHAP object — this will work!
shap <- shapviz(model, X_pred = X_train)

# Global importance
sv_importance(shap, kind = "bar") + 
  ggtitle("SHAP Feature Importance")

9.4. Random Forest

Random Forest is an extension of bagging that adds randomness by selecting random subsets of features at each split. Use the randomForest package.

if (!require(randomForest)) install.packages("randomForest")
library(randomForest)

# Load and prepare data
data(iris)
set.seed(42)
train_idx <- sample(nrow(iris), 0.7 * nrow(iris))
train_data <- iris[train_idx, ]
test_data <- iris[-train_idx, ]

# Train Random Forest model
rf_model <- randomForest(Species ~ ., data = train_data, ntree = 20, mtry = 2)  # ntree: number of trees, mtry: features per split

# Predict on test data
predictions3 <- predict(rf_model, newdata = test_data)

confusionMatrix(predictions3, test_data$Species)

## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         12          0         0
##   versicolor      0         14         1
##   virginica       0          1        17
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9556          
##                  95% CI : (0.8485, 0.9946)
##     No Information Rate : 0.4             
##     P-Value [Acc > NIR] : 2.842e-15       
##                                           
##                   Kappa : 0.9324          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            0.9333           0.9444
## Specificity                 1.0000            0.9667           0.9630
## Pos Pred Value              1.0000            0.9333           0.9444
## Neg Pred Value              1.0000            0.9667           0.9630
## Prevalence                  0.2667            0.3333           0.4000
## Detection Rate              0.2667            0.3111           0.3778
## Detection Prevalence        0.2667            0.3333           0.4000
## Balanced Accuracy           1.0000            0.9500           0.9537

Chapter 10. Clustering

Hierarchical Clustering is an unsupervised method that builds a hierarchy of clusters without requiring a predefined k. It comes in two forms:

Agglomerative (Bottom-Up): Starts with each data point as its own cluster and iteratively merges the closest pairs based on a linkage criterion (e.g., Ward’s method minimizes variance, single linkage uses minimum distance, complete linkage uses maximum distance).
Divisive (Top-Down): Starts with all points in one cluster and recursively splits them. The result is a dendrogram, a tree-like diagram showing the order and distance of merges/splits, which helps determine the optimal number of clusters by “cutting” the tree at a desired height.

K-Means Clustering is an unsupervised machine learning algorithm that partitions a dataset into a predefined number of clusters (k). It aims to minimize the variation (sum of squared distances) within each cluster. The process is iterative:

Randomly initialize k centroids.
Assign each data point to the nearest centroid based on a distance metric (typically Euclidean).
Update centroids as the mean of all points assigned to each cluster.
Repeat until centroids stabilize or a set number of iterations is reached.

10.1 Hierarchical Clustering

A comparison between the single linkage and the complete linkage method:

comparison <- data.frame(
  Feature = c(
    "**Linkage Type**",
    "**Distance Definition**",
    "**Formula**",
    "**Intuition**",
    "**Tendency**",
    "**Sensitive to Outliers?**",
    "**Best For**",
    "**Merge Order (Our Example)**",
    "**Final Merge Distance**"
  ),
  `Single Linkage` = c(
    "Nearest-neighbor",
    "Minimum distance between any two points in clusters",
    "$$   d(C_1, C_2) = \\min_{i\\in C_1, j\\in C_2} d(i,j)   $$",
    "Merge if **any** point is close",
    "Long, chain-like clusters (chaining effect)",
    "Yes — one close point can bridge clusters",
    "Elongated, irregular, or connected data",
    "1. A–B (1.00)<br>2. C–D (1.00)<br>3. AB–E (2.24)<br>4. ABE–CD (2.83)",
    "**2.83** (E to C)"
  ),
  `Complete Linkage` = c(
    "Furthest-neighbor",
    "Maximum distance between any two points in clusters",
    "$$   d(C_1, C_2) = \\max_{i\\in C_1, j\\in C_2} d(i,j)   $$",
    "Merge only if **all** points are close",
    "Compact, spherical clusters",
    "No — requires full cluster proximity",
    "Round, well-separated, noisy data",
    "1. A–B (1.00)<br>2. C–D (1.00)<br>3. AB–E (2.83)<br>4. ABE–CD (6.40)",
    "**6.40** (A to D)"
  ),
  check.names = FALSE
)

comparison %>%
  kable("html", escape = FALSE, align = c("l", "l", "l")) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive"),
    full_width = FALSE,
    font_size = 14
  ) %>%
  row_spec(0, bold = TRUE, background = "#f8f9fa") %>%
  column_spec(1, bold = TRUE, width = "8cm") %>%
  column_spec(2, width = "7cm", background = "#e6f3ff") %>%
  column_spec(3, width = "7cm", background = "#fff4e6")

Feature	Single Linkage	Complete Linkage
Linkage Type	Nearest-neighbor	Furthest-neighbor
Distance Definition	Minimum distance between any two points in clusters	Maximum distance between any two points in clusters
Formula	\[ d(C_1, C_2) = \min_{i\in C_1, j\in C_2} d(i,j) \]	\[ d(C_1, C_2) = \max_{i\in C_1, j\in C_2} d(i,j) \]
Intuition	Merge if any point is close	Merge only if all points are close
Tendency	Long, chain-like clusters (chaining effect)	Compact, spherical clusters
Sensitive to Outliers?	Yes — one close point can bridge clusters	No — requires full cluster proximity
Best For	Elongated, irregular, or connected data	Round, well-separated, noisy data
Merge Order (Our Example)	A–B (1.00) 2. C–D (1.00) 3. AB–E (2.24) 4. ABE–CD (2.83)	A–B (1.00) 2. C–D (1.00) 3. AB–E (2.83) 4. ABE–CD (6.40)
Final Merge Distance	2.83 (E to C)	6.40 (A to D)

10.1.1 An Example Worked by Hand

Given data:

points <- data.frame(
  id = c("A", "B", "C", "D", "E"),
  x  = c(1, 2, 5, 6, 3),
  y  = c(1, 1, 5, 5, 3)
)
points

##   id x y
## 1  A 1 1
## 2  B 2 1
## 3  C 5 5
## 4  D 6 5
## 5  E 3 3

Plot data:

library(ggplot2)
ggplot(points, aes(x, y, label = id)) +
  geom_point(size = 6) +
  geom_text(vjust = -1.5, size = 5, show.legend = FALSE) +
  theme_minimal(base_size = 14) +
  labs(title = "Five Points for Clustering") +
  xlim(0, 7) + ylim(0, 6)

The following are steps for creating a dendrogram based on the single linkage method.

Step 1: Create the distance matrix

dist_matrix <- round(as.matrix(dist(points[,2:3], method = "euclidean")), 2)
rownames(dist_matrix) <- colnames(dist_matrix) <- points$id
dist_matrix

##      A    B    C    D    E
## A 0.00 1.00 5.66 6.40 2.83
## B 1.00 0.00 5.00 5.66 2.24
## C 5.66 5.00 0.00 1.00 2.83
## D 6.40 5.66 1.00 0.00 3.61
## E 2.83 2.24 2.83 3.61 0.00

Step 2: Show the calculation detail.

manual <- data.frame(
  Step = 1:4,
  Closest_Pair = c("A-B (1.00)", "C-D (1.00)", "AB-E (2.24)", "ABE-CD (2.83)"),
  Merge_Into = c("AB", "CD", "ABE", "ABECD"),
  New_Clusters = c("{AB}, {C}, {D}, {E}",
                   "{AB}, {CD}, {E}",
                   "{ABE}, {CD}",
                   "{ABECD}"),
  Merge_Height = c(1.00, 1.00, 2.24, 2.83)
)
knitr::kable(manual, digits = 2, align = "c",
             caption = "Hand-Calculated Merge Steps (Single Linkage)") %>%
  kable_styling(
    full_width = FALSE,           # prevents stretching
    position = "center"           # <-- centers the table
  )

Hand-Calculated Merge Steps (Single Linkage)
Step	Closest_Pair	Merge_Into	New_Clusters	Merge_Height
1	A-B (1.00)	AB	{AB}, {C}, {D}, {E}	1.00
2	C-D (1.00)	CD	{AB}, {CD}, {E}	1.00
3	AB-E (2.24)	ABE	{ABE}, {CD}	2.24
4	ABE-CD (2.83)	ABECD	{ABECD}	2.83

Step 3: Create the dendrogram.

# --- 1. Load ONLY what's needed ---
library(dendextend)

# --- 2. Your exact data ---
points <- data.frame(
  id = c("A", "B", "C", "D", "E"),
  x  = c(1,  2,  5,  6,  3),
  y  = c(1,  1,  5,  5,  3)
)

# --- 3. Build hclust from your manual steps ---
merge_matrix <- rbind(
  c(-1, -2),  # A-B
  c(-3, -4),  # C-D
  c( 1, -5),  # AB-E
  c( 3,  2)   # ABE-CD
)

hc <- list(
  merge = merge_matrix,
  height = c(1.00, 1.00, 2.24, 2.83),
  order = c(1, 2, 5, 3, 4),
  labels = points$id,
  method = "single"
)
class(hc) <- "hclust"

# --- 4. Plot dendrogram ---
par(mar = c(6, 4, 4, 8))  # extra bottom + right space

plot(hc, 
     hang = -1, 
     main = "Single-Linkage Dendrogram",
     ylab = "Distance", 
     xlab = "",
     labels = FALSE)  # <-- DO NOT print default labels

# --- 5. ADD -45° ROTATED LETTERS MANUALLY ---
x_pos <- 1:5  # leaf positions in order: A=1, B=2, E=3, C=4, D=5
text(x = x_pos,
     y = -0.5,
     labels = c("A", "B", "E", "C", "D"),
     srt = -45,           # -45 degrees, srt = "string rotation"
     adj = c(1, 0.5),    # align right to point
     cex = 1.4,          # large
     font = 2,           # bold
     xpd = TRUE)

# --- 6. Add merge height lines ---
abline(h = c(1.20, 1.7, 2.3), col = "gray", lty = 2)

# --- 7. Add step labels on the right ---
text(x = 5.5, y = c(1.15, 2.39, 2.98),
     labels = c("Step 1: A-B (1.00)", 
                "Step 3: AB-E (2.24)", 
                "Step 4: ABE-CD (2.83)"),
     col = "darkred", adj = 0, cex = 0.9, font = 2)

Step 4: Cut the dendrogram.

cuts <- data.frame(
  k = 1:5,
  height = c(3.0, 2.5, 2.0, 1.0, 0.5),
  clusters = I(list(
    "{ABECD}",
    "{ABE}, {CD}",
    "{AB}, {CD}, {E}",
    "{AB}, {C}, {D}, {E}",
    "{A}, {B}, {C}, {D}, {E}"
  ))
)
knitr::kable(cuts, caption = "Clusters Obtained by Cutting the Dendrogram")

Clusters Obtained by Cutting the Dendrogram
k	height	clusters
1	3.0	{ABECD}
2	2.5	{ABE}, {CD}
3	2.0	{AB}, {C….
4	1.0	{AB}, {C….
5	0.5	{A}, {B}….

As a comparison, the dendrogram based on the complewte linkage method looks like the following:

hc_complete <- hclust(dist(points[,2:3]), method = "complete")
plot(hc_complete, main = "Complete Linkage Dendrogram", hang = -1)

10.1.2 R Code on Pima Data

We use the Pima data.

library(mlbench)
data("PimaIndiansDiabetes2")
Pima = PimaIndiansDiabetes2 |> na.omit()

X <- Pima[, 1:8]  # Exclude Outcome for unsupervised clustering

# Standardize features
X_std <- scale(X)

# Hierarchical Clustering (Agglomerative with Dendrogram)
dist_matrix <- dist(X_std, method = "euclidean")
hclust_model <- hclust(dist_matrix, method = "ward.D2")

# Plot Dendrogram
plot(hclust_model, main = "Hierarchical Clustering Dendrogram (Ward Linkage)",
     xlab = "Sample Index", ylab = "Distance", sub = "")

# Cut dendrogram
hier_labels <- cutree(hclust_model, k = 3)
hier_labels

##   4   5   7   9  14  15  17  19  20  21  25  26  28  29  32  33  36  40  41  44 
##   1   2   1   2   2   3   2   1   1   2   3   3   1   3   2   1   1   2   1   3 
##  51  52  53  54  55  57  58  60  64  69  70  71  72  74  83  86  88  89  92  93 
##   1   1   1   3   2   2   2   1   1   1   1   1   1   1   1   1   1   3   1   3 
##  95  96  98  99 100 104 106 108 109 110 111 112 113 115 120 121 123 126 127 128 
##   1   2   1   1   2   1   1   1   1   1   1   2   1   1   1   2   1   1   1   1 
## 129 131 133 135 136 137 138 140 143 145 148 151 153 154 157 158 159 160 162 163 
##   1   1   2   1   1   1   1   2   1   2   1   2   3   2   1   1   1   3   3   1 
## 166 170 172 174 175 176 178 182 187 188 189 190 192 196 198 199 200 204 205 207 
##   1   1   1   2   1   2   2   1   2   2   3   1   2   2   1   1   2   1   3   3 
## 209 214 215 216 217 218 221 224 225 226 229 230 232 233 235 237 242 244 245 248 
##   1   1   3   3   1   1   1   3   1   1   2   1   2   1   1   3   1   1   2   2 
## 249 253 255 259 260 261 266 272 274 276 278 280 282 283 286 287 288 289 290 291 
##   2   1   1   1   3   1   1   1   1   2   1   1   3   1   3   2   2   1   1   1 
## 292 293 294 296 297 298 299 302 303 306 307 308 309 310 312 313 314 316 317 319 
##   1   2   1   1   2   1   3   1   1   1   3   1   1   1   1   1   1   1   1   1 
## 321 324 326 327 329 330 332 335 336 339 341 342 346 347 349 354 357 359 360 361 
##   1   3   1   1   1   3   1   1   2   2   1   1   3   1   1   1   1   3   2   1 
## 365 366 369 370 371 373 374 375 376 377 378 380 381 383 384 385 386 389 390 391 
##   2   1   1   1   2   1   1   1   3   1   1   2   1   1   1   1   1   3   1   1 
## 393 394 396 397 403 406 410 412 413 414 415 416 420 421 422 423 425 426 428 429 
##   1   1   1   1   1   1   2   1   2   1   1   2   1   2   1   1   2   2   1   2 
## 430 432 433 442 443 446 447 448 449 450 451 453 455 458 459 460 461 463 466 467 
##   1   1   1   1   1   2   1   1   1   1   1   1   1   1   3   3   3   3   1   1 
## 468 470 477 478 479 481 483 484 486 487 488 491 494 498 499 500 501 504 507 508 
##   1   2   2   1   3   2   1   1   2   2   2   1   1   1   3   2   1   1   1   1 
## 509 512 515 516 517 520 521 522 527 528 529 531 533 535 539 540 541 542 544 545 
##   1   1   1   1   3   3   1   1   1   1   1   1   2   1   2   2   3   1   1   1 
## 546 547 548 549 552 554 555 556 562 563 564 566 567 568 569 570 573 574 575 576 
##   2   2   1   1   1   1   1   3   1   1   1   1   1   3   1   1   1   1   2   1 
## 577 585 589 592 594 595 596 598 600 604 607 608 609 610 611 612 613 615 618 621 
##   1   2   2   1   1   2   1   1   1   3   2   1   2   1   1   1   2   3   1   1 
## 624 626 632 634 638 639 640 641 645 646 647 648 649 651 652 653 655 656 657 658 
##   1   1   1   1   1   3   1   1   1   2   1   1   3   1   1   1   1   1   1   2 
## 660 663 664 666 669 670 671 673 674 680 681 683 686 689 690 693 694 696 697 699 
##   1   3   2   1   3   3   3   3   2   1   1   1   1   1   1   1   2   2   1   1 
## 701 705 708 710 711 712 714 716 717 719 722 723 724 727 731 733 734 737 739 741 
##   1   1   1   1   1   1   1   1   2   1   1   1   1   1   1   1   1   1   1   3 
## 742 743 745 746 748 749 752 754 756 761 764 766 
##   1   1   3   3   2   1   1   2   2   1   3   1

10.2 k-Means Clustering

Here is an App that shows how K-Means clustering works: https://szhang2025.github.io/KMeansCustering/

10.2.1 A Hand-calculated Example

We present a hand-calculated example of K-Means Clustering with K = 2 on a tiny 2D dataset.

The points are: A(1,1), (2,1), (2,2), (8,7), (8,8), (9,8)

data.frame(
  Point = c("A", "B", "C", "D", "E", "F"),
  X1 = c(1,2,2,8,8,9),
  X2 = c(1,1,2,7,8,8)
) %>% kable() %>% kable_styling()

Point	X1	X2
A	1	1
B	2	1
C	2	2
D	8	7
E	8	8
F	9	8

Step 0: Choose Initial Centroids (Randomly). Let’s pick:
- C₁ = (2, 1) → Point B
- C₂ = (8, 7) → Point D
Step 1: Assign Points to Nearest Centroid (Euclidean Distance)

data.frame(
  Point = c("A(1,1)", "B(2,1)", "C(2,2)", "D(8,7)", "E(8,8)", "F(9,8)"),
  `Dist to C1` = c("1", "0", "1", "≈8.5", "≈9.2", "≈9.9"),
  `Dist to C2` = c("≈9.2", "≈8.5", "≈7.8", "0", "1", "≈1.4"),
  `Cluster` = c("1","1","1","2","2","2"),
  `New C1` = rep("(1.67, 1.33)", 3),
  `New C2` = rep("(8.33, 7.67)", 3),
  `Final Cluster` = c("1","1","1","2","2","2")
) %>%
  kable(escape = FALSE) %>%
  kable_styling(full_width = FALSE) %>%
  add_header_above(c(" ", "Iteration 1" = 3, "Update" = 2, "Final" = 1))

	Iteration 1			Update		Final
Point	Dist.to.C1	Dist.to.C2	Cluster	New.C1	New.C2	Final.Cluster
A(1,1)	1	≈9.2	1	(1.67, 1.33)	(8.33, 7.67)	1
B(2,1)	0	≈8.5	1	(1.67, 1.33)	(8.33, 7.67)	1
C(2,2)	1	≈7.8	1	(1.67, 1.33)	(8.33, 7.67)	1
D(8,7)	≈8.5	0	2	(1.67, 1.33)	(8.33, 7.67)	2
E(8,8)	≈9.2	1	2	(1.67, 1.33)	(8.33, 7.67)	2
F(9,8)	≈9.9	≈1.4	2	(1.67, 1.33)	(8.33, 7.67)	2

Step 2: Re-assign Points to New Centroids with C₁ = (1.67, 1.33) and C₂ = (8.33, 7.67).

data.frame(
  Point = c("A(1, 1)", "B(2, 1)", "C(2, 2)", "D(8, 7)", "E(8, 8)", "F(9, 8)"),
  `Dist to C1` = c("≈ 0.82", "≈ 0.47", "≈ 0.75", "≈ 8.78", "≈ 9.05", "≈ 9.30"),
  `Dist to C2` = c("≈ 9.05", "≈ 8.78", "≈ 8.30", "≈ 0.75", "≈ 0.47", "≈ 0.82"),
  `Assigned Cluster` = c("1", "1", "1", "2", "2", "2")
) %>%
  kable(escape = FALSE) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE) %>%
  column_spec(4, bold = TRUE, color = "darkred") %>%
  add_header_above(c(" " = 1, "Distance" = 2, " " = 1), bold = TRUE)

	Distance
Point	Dist.to.C1	Dist.to.C2	Assigned.Cluster
A(1, 1)	≈ 0.82	≈ 9.05	1
B(2, 1)	≈ 0.47	≈ 8.78	1
C(2, 2)	≈ 0.75	≈ 8.30	1
D(8, 7)	≈ 8.78	≈ 0.75	2
E(8, 8)	≈ 9.05	≈ 0.47	2
F(9, 8)	≈ 9.30	≈ 0.82	2

The algorithm has converged and the final result is:

Cluster 1 contains points A, B, C with centroid at (1.67, 1.33)
Cluster 2 contains points D, E, F with centroid at (8.33, 7.67)

10.2.2 R code on Pima Data

# K-Means Clustering for k=2 to 10 and Elbow Method
inertias <- numeric(9)
sil_scores <- numeric(9)
K <- 2:10

for (k in K) {
  set.seed(42)  # Reproducibility
  kmeans_model <- kmeans(X_std, centers = k, nstart = 25)
  inertias[k-1] <- kmeans_model$tot.withinss  # Inertia (within-cluster SS)
  if (k > 1) {
    sil <- cluster::silhouette(kmeans_model$cluster, dist(X_std))
    sil_scores[k-1] <- mean(sil[, 3])  # Average silhouette score
  } else {
    sil_scores[k-1] <- NA  # Silhouette not defined for k=1
  }
}

# Plot Elbow Method (Inertia vs. k)
plot(K, inertias, type = "b", pch = 19, col = "blue", 
     xlab = "Number of Clusters (k)", ylab = "Inertia (Within-Cluster SS)",
     main = "Elbow Method for Optimal k")
grid()

# Plot Silhouette Scores
plot(K, sil_scores, type = "b", pch = 19, col = "red",
     xlab = "Number of Clusters (k)", ylab = "Average Silhouette Score",
     main = "Silhouette Scores for Different k")
grid()

# Choose optimal k (e.g., based on elbow, typically k=3 or 4)
optimal_k <- 3
set.seed(42)
kmeans_model <- kmeans(X_std, centers = optimal_k, nstart = 25)
labels <- kmeans_model$cluster

# PCA for 2D/3D Visualization
pca <- prcomp(X_std, scale. = TRUE)
X_pca <- pca$x[, 1:3]  # First 3 PCs for 2D/3D

# 2D Visualization
pca_df <- data.frame(PC1 = X_pca[, 1], PC2 = X_pca[, 2], Cluster = as.factor(labels))
ggplot(pca_df, aes(x = PC1, y = PC2, color = Cluster)) +
  geom_point(size = 3) +
  labs(title = "K-Means Clusters in 2D PCA Space", x = "PC1", y = "PC2") +
  theme_minimal()

Compare K-Means and Hierarchical Clustering:

cat("K-Means Silhouette Score for k =", optimal_k, ":", sil_scores[optimal_k-1], "\n")

## K-Means Silhouette Score for k = 3 : 0.2115462

# Cluster Interpretation (K-Means)
cat("\nCluster Mean Features (K-Means):\n")

## 
## Cluster Mean Features (K-Means):

for (i in 1:optimal_k) {
  cat("Cluster", i, ":\n")
  print(colMeans(X[labels == i, ]))
}

## Cluster 1 :
##    pregnant     glucose    pressure     triceps     insulin        mass 
##   8.0116279 139.5697674  76.8604651  31.9883721 197.7558140  33.5744186 
##    pedigree         age 
##   0.5346395  46.3604651 
## Cluster 2 :
##    pregnant     glucose    pressure     triceps     insulin        mass 
##   1.8032787 131.3606557  74.3032787  38.0819672 201.3934426  39.2163934 
##    pedigree         age 
##   0.6045984  27.6475410 
## Cluster 3 :
##    pregnant     glucose    pressure     triceps     insulin        mass 
##   2.0923913 108.9184783  65.3532609  21.8913043 106.5054348  28.7934783 
##    pedigree         age 
##   0.4635543  25.7554348

# Silhouette Score for Hierarchical Clustering
sil_hier <- cluster::silhouette(hier_labels, dist_matrix)
cat("\nHierarchical Clustering Silhouette Score:", mean(sil_hier[, 3]), "\n")

## 
## Hierarchical Clustering Silhouette Score: 0.2255977

10.3 A Comparison between Hierarchical Clustering and K-Means Clustering

data.frame(
  Aspect = c(
    "Type", 
    "Output", 
    "Number of Clusters", 
    "Algorithm Mechanism", 
    "Distance/Similarity", 
    "Scalability", 
    "Cluster Shape", 
    "Deterministic", 
    "Sensitivity to Outliers", 
    "Interpretability", 
    "Use Cases", 
    "Parameter Tuning", 
    "Memory Usage"
  ),
  `Hierarchical Clustering` = c(
    "Agglomerative or Divisive",
    "Dendrogram (tree structure)",
    "Not required upfront",
    "Merges/splits clusters by similarity",
    "Linkage: single, complete, average, Ward’s",
    "O(n²) or O(n³); not scalable",
    "Non-spherical, nested, irregular",
    "Yes",
    "High (especially single linkage)",
    "High (dendrogram shows hierarchy)",
    "Taxonomy, gene analysis, small data",
    "Linkage, distance, cut height",
    "High: O(n²) distance matrix"
  ),
  `K-Means Clustering` = c(
    "Partitional",
    "Fixed K clusters",
    "Must specify K in advance",
    "Assigns to nearest centroid, updates",
    "Euclidean to centroids",
    "O(n·K·I); highly scalable",
    "Spherical, equal size",
    "No (depends on init)",
    "Moderate",
    "Moderate (centroids + assignments)",
    "Segmentation, image compression",
    "K, initialization (K-means++)",
    "Low"
  ),
  check.names = FALSE
) %>%
  kable("html", escape = FALSE, align = c("l", "l", "l")) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE) %>%
  add_header_above(c(" " = 1, "Clustering Method" = 2), bold = TRUE, font_size = 14) %>%
  column_spec(1, bold = TRUE, width = "8cm") %>%
  column_spec(2:3, width = "10cm")

	Clustering Method
Aspect	Hierarchical Clustering	K-Means Clustering
Type	Agglomerative or Divisive	Partitional
Output	Dendrogram (tree structure)	Fixed K clusters
Number of Clusters	Not required upfront	Must specify K in advance
Algorithm Mechanism	Merges/splits clusters by similarity	Assigns to nearest centroid, updates
Distance/Similarity	Linkage: single, complete, average, Ward’s	Euclidean to centroids
Scalability	O(n²) or O(n³); not scalable	O(n·K·I); highly scalable
Cluster Shape	Non-spherical, nested, irregular	Spherical, equal size
Deterministic	Yes	No (depends on init)
Sensitivity to Outliers	High (especially single linkage)	Moderate
Interpretability	High (dendrogram shows hierarchy)	Moderate (centroids + assignments)
Use Cases	Taxonomy, gene analysis, small data	Segmentation, image compression
Parameter Tuning	Linkage, distance, cut height	K, initialization (K-means++)
Memory Usage	High: O(n²) distance matrix	Low

data.frame(
  `Hierarchical Clustering` = c(
    "No need to specify K",
    "Visual dendrogram for insight",
    "Handles non-convex clusters well"
  ),
  `K-Means Clustering` = c(
    "Fast and efficient on large datasets",
    "Simple and widely implemented",
    "Works well with spherical clusters"
  ),
  check.names = FALSE
) %>%
  kable("html") %>%
  kable_styling(bootstrap_options = c("striped", "hover")) %>%
  add_header_above(c("Advantages" = 2), bold = TRUE)

Advantages
Hierarchical Clustering	K-Means Clustering
No need to specify K	Fast and efficient on large datasets
Visual dendrogram for insight	Simple and widely implemented
Handles non-convex clusters well	Works well with spherical clusters

data.frame(
  `Hierarchical Clustering` = c(
    "Computationally expensive",
    "Not suitable for large datasets",
    "Irreversible merging"
  ),
  `K-Means Clustering` = c(
    "Requires predefined K",
    "Sensitive to initialization/outliers",
    "Assumes spherical clusters"
  ),
  check.names = FALSE
) %>%
  kable("html") %>%
  kable_styling(bootstrap_options = c("striped", "hover")) %>%
  add_header_above(c("Limitations" = 2), bold = TRUE)

Limitations
Hierarchical Clustering	K-Means Clustering
Computationally expensive	Requires predefined K
Not suitable for large datasets	Sensitive to initialization/outliers
Irreversible merging	Assumes spherical clusters

data.frame(
  Method = c("Hierarchical Clustering", "K-Means Clustering"),
  `When to Use` = c(
    paste("- Don’t know number of clusters<br>",
          "- Want hierarchy (e.g., taxonomy)<br>",
          "- Small to medium data (< 10k points)<br>",
          "- Complex or nested shapes"),
    paste("- Large datasets<br>",
          "- Reasonable guess for K<br>",
          "- Spherical, similar-sized clusters<br>",
          "- Speed and scalability priority")
  ),
  check.names = FALSE
) %>%
  kable("html", escape = FALSE) %>%
  kable_styling(full_width = FALSE) %>%
  column_spec(1, bold = TRUE, background = "#f0f0f0") %>%
  column_spec(2, width = "12cm")

Method	When to Use
Hierarchical Clustering	Don’t know number of clusters - Want hierarchy (e.g., taxonomy) - Small to medium data (< 10k points) - Complex or nested shapes
K-Means Clustering	Large datasets - Reasonable guess for K - Spherical, similar-sized clusters - Speed and scalability priority

Chapter 11. Support Vector Machines

Support Vector Machines (SVMs) are powerful supervised machine learning models used for classification and regression. They excel in high-dimensional spaces and are particularly effective when the number of features is large compared to the number of samples.

The classification, the core idea of the SVM is to find the optimal hyperplane that maximizes the margin between two classes. Refer to https://en.wikipedia.org/wiki/Support_vector_machine

Hyperplane: A decision boundary separating classes.
Margin: The distance between the hyperplane and the nearest data points (called support vectors).
Goal: Maximize this margin → better generalization.
Linearly separable data result in a hard margin SVM for perfect separation
Rea-world data result in a soft margin SVM that allows some misclassification (regularization is needed)
When data is not linearly separable in original space, map to higher-dimensional space using kernel functions.

11.1 A Hand-calculated Example for Classification with SVM

Click: https://szhang2025.github.io/SVM-Example/

11.2 Impliment SVM Using Software

11.2.1 Using train() from the caret Package

We use the caret package to train an SVM based on the iris data ignoring species that are virginicas. For binary classification, we create two labels: 1 and \(-1\). Then, we create training and test data and do scaling.

library(dplyr)

## ---- 1. Data with binary labels -------------------------------------------------
df <- iris %>% 
  filter(Species %in% c("versicolor", "virginica")) %>% 
  dplyr::select(Sepal.Length, Sepal.Width, Species) %>%
  dplyr::mutate(Species = factor(ifelse(Species=="versicolor", -1, 1), 
                          levels = c(-1, 1),
                          labels = c("versicolor", "virginica")))

n = nrow(df)

set.seed(123)
indices <- sample(n, 0.7*n)
train <- df[ indices, ]
test  <- df[-indices, ]

## ---- 2. Scale (must be the SAME on train & test) ------------
preproc <- preProcess(train[,1:2], method = c("center","scale"))
train_s <- predict(preproc, train)
test_s  <- predict(preproc, test)

## ---- 3. Train with 5-fold CV --------------------------------
ctrl <- trainControl(method = "cv", 
                     number = 5, 
                     classProbs = TRUE,
                     summaryFunction = twoClassSummary)

set.seed(123)
svm_caret <- train(Species ~ .,
                   data = train_s,
                   method = "svmRadial",
                   trControl = ctrl,
                   tuneLength = 12,          # tries 12 combos
                   metric = "ROC")           # best model is selected based on ROC

print(svm_caret)

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 70 samples
##  2 predictor
##  2 classes: 'versicolor', 'virginica' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 56, 56, 56, 56, 56 
## Resampling results across tuning parameters:
## 
##   C       ROC        Sens       Spec     
##     0.25  0.7122449  0.8000000  0.6000000
##     0.50  0.7163265  0.8000000  0.6000000
##     1.00  0.7285714  0.7714286  0.6000000
##     2.00  0.7204082  0.7428571  0.6000000
##     4.00  0.6959184  0.7142857  0.6000000
##     8.00  0.7081633  0.7142857  0.5714286
##    16.00  0.7122449  0.6571429  0.6000000
##    32.00  0.5693878  0.5142857  0.5142857
##    64.00  0.6836735  0.6571429  0.5428571
##   128.00  0.6714286  0.7428571  0.5142857
##   256.00  0.5408163  0.5142857  0.4857143
##   512.00  0.6632653  0.6571429  0.4857143
## 
## Tuning parameter 'sigma' was held constant at a value of 0.72262
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.72262 and C = 1.

plot(svm_caret)

## ---- 3. Predict on new data ---------------------------------
pred_caret <- predict(svm_caret, newdata = test_s)          # class labels
prob_caret <- predict(svm_caret, newdata = test_s, type = "prob")

head(prob_caret)

##   versicolor virginica
## 1  0.1684878 0.8315122
## 2  0.2445360 0.7554640
## 3  0.1980473 0.8019527
## 4  0.6305172 0.3694828
## 5  0.5259511 0.4740489
## 6  0.6495178 0.3504822

confusionMatrix(pred_caret, test_s$Species)

## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   versicolor virginica
##   versicolor         10         6
##   virginica           5         9
##                                           
##                Accuracy : 0.6333          
##                  95% CI : (0.4386, 0.8007)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : 0.1002          
##                                           
##                   Kappa : 0.2667          
##                                           
##  Mcnemar's Test P-Value : 1.0000          
##                                           
##             Sensitivity : 0.6667          
##             Specificity : 0.6000          
##          Pos Pred Value : 0.6250          
##          Neg Pred Value : 0.6429          
##              Prevalence : 0.5000          
##          Detection Rate : 0.3333          
##    Detection Prevalence : 0.5333          
##       Balanced Accuracy : 0.6333          
##                                           
##        'Positive' Class : versicolor      
##

11.2.2 Hyperparameter Tuning with tune.svm()

library(e1071)

## 
## Attaching package: 'e1071'

## The following object is masked from 'package:coefplot':
## 
##     extractPath

## The following object is masked from 'package:ggplot2':
## 
##     element

library(caret)
library(dplyr)

## ---- 1. CV-tuning -------------------------------------------
tune_out <- tune.svm(Species ~ .,
                     data = train_s,
                     gamma = c(0.01, 0.5, 1), 
                     cost  = c(0.1, 1, 5),      
                     kernel = "radial",
                     tunecontrol = tune.control(sampling = "cross", cross = 5))

tune_out                # best parameters + CV error

## 
## Parameter tuning of 'svm':
## 
## - sampling method: 5-fold cross validation 
## 
## - best parameters:
##  gamma cost
##   0.01    5
## 
## - best performance: 0.3142857

best_mod <- tune_out$best.model   # <-- the fitted SVM with best params

## ---- 2. Predict on hold-out test set ------------------------
pred_test <- predict(best_mod, test_s)
confusionMatrix(pred_test, test_s$Species)

## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   versicolor virginica
##   versicolor         10         5
##   virginica           5        10
##                                           
##                Accuracy : 0.6667          
##                  95% CI : (0.4719, 0.8271)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : 0.04937         
##                                           
##                   Kappa : 0.3333          
##                                           
##  Mcnemar's Test P-Value : 1.00000         
##                                           
##             Sensitivity : 0.6667          
##             Specificity : 0.6667          
##          Pos Pred Value : 0.6667          
##          Neg Pred Value : 0.6667          
##              Prevalence : 0.5000          
##          Detection Rate : 0.3333          
##    Detection Prevalence : 0.5000          
##       Balanced Accuracy : 0.6667          
##                                           
##        'Positive' Class : versicolor      
##

Plot Decision Boundary:

plot_svm <- function(model, data, title) {
  # Create grid
  x1_range <- seq(min(data$Sepal.Length), max(data$Sepal.Length), length.out = 100)
  x2_range <- seq(min(data$Sepal.Width), max(data$Sepal.Width), length.out = 100)
  grid <- expand.grid(Sepal.Length = x1_range, Sepal.Width = x2_range)
  
  # Predict on grid
  grid$pred <- predict(model, grid)
  
  # Plot
  ggplot() +
    geom_tile(data = grid, aes(x = Sepal.Length, y = Sepal.Width, fill = pred), alpha = 0.5) +
    geom_point(data = data, aes(x = Sepal.Length, y = Sepal.Width, color = Species), size = 3) +
    scale_fill_manual(values = c("lightblue", "lightcoral"), guide = "none") +
    scale_color_manual(values = c("blue", "red")) +
    labs(title = title, x = "Sepal Length", y = "Sepal Width") +
    theme_minimal()
}

# Plot best model
plot_svm(best_mod, train_s, "SVM Decision Boundary (RBF Kernel based on tune.svm() from e1071 Package)")

plot_svm(svm_caret, train_s, "SVM Decision Boundary (RBF Kernel using train() function)")

## 11.3 SVM for Regression

For regression, it’s a tube (or band) around the regression function, not a hard decision boundary like in classification.

We use a simulated dataset to demonstrate the use of SVM for a regression problem.

library(e1071)
set.seed(123)

# Noisy sine
x <- seq(0, 4*pi, length=80)
y <- sin(x) + rnorm(80, sd=0.3)
df <- data.frame(x=x, y=y)

# Fit SVR with ε = 0.2
svr <- svm(y ~ x, data=df, type="eps-regression", kernel="radial",
           epsilon=0.2, cost=10)

# Predict on fine grid
grid <- data.frame(x = seq(0, 4*pi, length=300))
pred <- predict(svr, grid)

Plot the boundary:

# Plot tube
plot(x, y, col="blue", pch=16, main="SVR: ε-Insensitive Tube")
lines(grid$x, sin(grid$x), col="green", lwd=2, lty=2)
lines(grid$x, pred, col="red", lwd=2)
lines(grid$x, pred + 0.2, col="red", lty=3)
lines(grid$x, pred - 0.2, col="red", lty=3)
legend("topright", 
       c("Data", "True", "SVR", "±ε boundary"),
       col=c("blue","green","red","red"), pch=c(16,NA,NA,NA), lty=c(NA,2,1,3))

## 11.4 SVM: Classification vs Regression

# Create the comparison table
comparison <- data.frame(
  Feature = c(
    "Output",
    "Boundary",
    "Loss Function",
    "Support Vectors",
    "Goal",
    "Key Parameter",
    "Prediction Type"
  ),
  SVC = c(
    "Class label",
    "Hyperplane (line)",
    "Hinge loss",
    "On margin boundaries",
    "Maximize margin",
    "`C` (misclassification penalty)",
    "`predict(..., type = 'response')`"
  ),
  SVR = c(
    "Continuous value",
    "**Tube** of width `2ε`",
    "ε-insensitive loss",
    "On or **outside** the tube",
    "Fit inside tube + minimize `||w||`",
    "`C`, `ε` (tube radius), `ν` (optional)",
    "`predict(..., type = 'response')`"
  ),
  stringsAsFactors = FALSE
)

# Render as beautiful table
kable(comparison, 
      col.names = c("Feature", "SVC (Classification)", "SVR (Regression)"),
      caption = "Key Differences Between SVC and SVR",
      align = c("l", "l", "l")) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE,
                font_size = 13) %>%
  row_spec(0, bold = TRUE, background = "#f8f9fa") %>%
  column_spec(1, bold = TRUE, width = "5cm") %>%
  column_spec(2:3, width = "6cm") %>%
  add_header_above(c(" " = 1, "Support Vector Machines" = 2), 
                   bold = TRUE, background = "#e9ecef")

Key Differences Between SVC and SVR
	Support Vector Machines
Feature	SVC (Classification)	SVR (Regression)
Output	Class label	Continuous value
Boundary	Hyperplane (line)	Tube of width `2ε`
Loss Function	Hinge loss	ε-insensitive loss
Support Vectors	On margin boundaries	On or outside the tube
Goal	Maximize margin	Fit inside tube + minimize `&#124;&#124;w&#124;&#124;`
Key Parameter	`C` (misclassification penalty)	`C`, `ε` (tube radius), `ν` (optional)
Prediction Type	`predict(..., type = 'response')`	`predict(..., type = 'response')`

Chapter 12. A Summary of Machine Learning

data.frame(
  Algorithm = c("K-Means", "Hierarchical", "KNN", "Decision Tree", "SVM", "Neural Net"),
  Eager = c("Yes", "Yes", "No", "Yes", "Yes", "Yes"),
  Lazy = c("No", "No", "Yes", "No", "No", "No"),
  Greedy = c("Yes", "Yes", "No", "Yes", "No", "No")
) %>% kable() %>% kable_styling()

Algorithm	Eager	Lazy	Greedy
K-Means	Yes	No	Yes
Hierarchical	Yes	No	Yes
KNN	No	Yes	No
Decision Tree	Yes	No	Yes
SVM	Yes	No	No
Neural Net	Yes	No	No

data.frame(
  Term = c("**Eager Learning**", "**Lazy Learning**", "**Greedy Strategy**"),
  Definition = c(
    "Builds a complete model during **training phase**; prediction is fast.",
    "Defers computation until **prediction time**; stores all training data.",
    "Makes **locally optimal choice** at each step, aiming for global optimum."
  ),
  `Key Characteristics` = c(
    "- Heavy computation in training<br>- Discards data after training<br>- Generalizes early",
    "- No model in training<br>- Computation on query<br>- Slow prediction",
    "- Not a learning type<br>- Step-by-step local decisions<br>- May reach local optima"
  ),
  check.names = FALSE
) %>%
  kable("html", escape = FALSE, align = c("l", "l", "l")) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive"),
    full_width = FALSE,
    font_size = 14
  ) %>%
  row_spec(0, bold = TRUE, background = "#f8f9fa") %>%
  column_spec(1, bold = TRUE, width = "4cm", background = "#e9ecef") %>%
  column_spec(2, width = "8cm") %>%
  column_spec(3, width = "8cm")

Term	Definition	Key Characteristics
Eager Learning	Builds a complete model during training phase; prediction is fast.	Heavy computation in training - Discards data after training - Generalizes early
Lazy Learning	Defers computation until prediction time; stores all training data.	No model in training - Computation on query - Slow prediction
Greedy Strategy	Makes locally optimal choice at each step, aiming for global optimum.	Not a learning type - Step-by-step local decisions - May reach local optima

Final Review Questions and Solutions: https://szhang2025.github.io/SampleExamForSTAT415.615/

Lecture Notes for Data Mining (Part 2)

Shiju Zhang