Chapter 1. Data Mining: An Overview

Data Mining is the computational process of discovering

patterns,
correlations,
anomalies, and
statistical structures

within large datasets. It employs methods at the intersection of machine learning, statistics, and database systems to extract useful information and transform it into an understandable structure for further use.

Data mining is often seen as a core step in the broader KDD (Knowledge Discovery in Databases) process:

Data Cleaning: Remove noise and inconsistent data.
Data Integration: Combine data from multiple sources.
Data Selection: Retrieve relevant data from the database.
Data Transformation: Consolidate data into forms appropriate for mining (e.g., aggregation, normalization).
Data Mining: Apply intelligent methods to extract patterns.
Pattern Evaluation: Identify truly interesting patterns representing knowledge.
Knowledge Presentation: Present the mined knowledge to the user using visualization and knowledge representation techniques.

Data mining tasks are broadly categorized into two types:

Predictive Tasks: Infer patterns to predict future outcomes.
- Classification: Predicts a categorical class label. An example is to classify a loan application as “low,” “medium,” or “high” risk. Common classification algorithms include Decision Trees, Naive Bayes, k-Nearest Neighbors (k-NN), Support Vector Machines (SVM), and Neural Networks.
- Regression: Predicts a continuous numeric value. An example is to predict the price of a house or future sales revenue. common algorithms include Linear Regression, Regression Trees, k-Nearest Neighbors (k-NN), Support Vector Machines (SVM), and Neural Networks.
Descriptive Tasks: Discover patterns that describe the data, providing new insights.
- Clustering: Groups similar objects together without predefined labels. The goal is to find inherent groupings in the data. An example is customer segmentation for targeted marketing. Commonly used algorithms for clustering include k-Means and Hierarchical Clustering.
- Association Rule Learning (often called market basket analysis): Discovers interesting relations (affinities) between variables in large databases. An example is the famous rule {Diapers} -> {Beer} which suggests that people who buy diapers are also likely to buy beer. A commonly used algorithm is Apriori.

Why is Data Mining Important?

We live in the age of data. The explosion of data from the web, sensors, business systems, and social media has created immense opportunities. Data mining helps us:

Turn Data into Knowledge: Move from simply storing data to understanding it.
Make Informed Decisions: Support decision-making with data-driven insights, not just intuition.
Gain a Competitive Advantage: Businesses can understand customer behavior, optimize processes, and create new strategies.
Enable Scientific Discovery: Find hidden patterns in scientific data, from genomics to astronomy.

Common Challenges in Data Mining:

Data Quality: Dealing with noisy, incomplete, and inconsistent data.
Scalability: Algorithms must be efficient and scalable to handle enormous datasets.
Dimensionality: The “curse of dimensionality” – high-dimensional data can be sparse and computationally expensive to process.
Privacy and Ethics: Using data responsibly, ensuring individual privacy, and avoiding biased or unethical outcomes.
Interpretability: Understanding and trusting the results of complex models (“black boxes”).

Data mining has ubiquitous applications across industries:

Retail: Market basket analysis, customer segmentation, sales forecasting.
Finance: Credit scoring, fraud detection, algorithmic trading.
Healthcare: Disease prediction, drug discovery, patient outcome analysis.
Telecommunications: Churn prediction, network fault isolation.
Science: Gene sequence analysis, climate pattern recognition, astronomical data analysis.
Web & Social Media: Recommendation systems (Netflix, Amazon), sentiment analysis, targeted advertising.

1.1 Core Ideas in Data Mining

Classification is the process of classifying a new record into one of the existing categories using a rule developed with similar data where individuals are already classified.

An application for a loan can repay on time (category 1), repay late (category 2), or declare bankrupcy (category 3).
A credit card transaction can be normal (category 1) or fraudulent (category 2).
A packet of data traveling on a network can be benigh (category 1) or threatening (category 2).

The victim of an illness can be recovered (category 1), still be ill (category 2), or be deceased (category 3).

Prediction is similar to classification, except that we are trying to predict the value of a numeric output variable. Loosely speaking, we can also predict the class of a categorical output variable, so prediction is a more general term.

In business settings, we may wish to find how likely a customer who purchases item A also purchases item B, or find what items tend to be purchased together. Such analysis of associations patterns is called market basket analysis (or affinity analysis or association rules). The rules are generated at the population level and can be used in a variety of ways, such as product placement, weekly promotional offers, and bundling products. Association rules derived from a hospital database on patients’ symptoms during consecutive hospitalizations can help find “which symptom is followed by what other symptoms” to help predict future symptoms for returning patients.

A dataset may contain hundreds of thousands of variables of which many are simply noise. In this situation, reducing the number of variables in the initial step before deploying data mining methods, called dimension reduction, can improve predictive power, manageability, and interpretability.

Before mining your data, you may need to clean, manipulate, as well as visualize them to generate hypotheses. This aiming at understanding the global landscape of the data is termed data exploration. Exploration by creating charts and dashboards is called data visualization or visual analytics.

learning can be supervised or unsupervised. Supervised learning refers to the process of providing an algorithm with records (called training data) in which an output variable of interest is known and the algorithm “learns” how to predict the value of this output variable with new records where the output is unknown. Once the model is developed based on the training data, it can then be applied to some reserved records to check it’s performance. The reserved data consist of the test data. If a few models are being considered, a third reserved dataset, called the validation data, can be used to select the best model based on some criterion. Simple linear regression is an example of supervised learning algorithm, since the values of the outcome variable (usually denoted by y) are all known. Unsupervised learning, on the other hand, refers to the process where there is no output variable to predict or classify and one simply uses a computer algorithm to learn patterns in the data.

1.2 The Steps in Data Mining

The steps are

Understand the purpose of the data mining project.
Obtain the data from one or more database or resources.
Explore, clean, and process the data.
Reduce the data dimension, if necessary.
Determine the data mining task (prediction, classification, clustering, etc.)
Partition the data for supervised learning into traing, validation, and test datasets.
Choose the data mining techniques to be used (regression, neural nets, decision trees, etc.)
Perform the task.
Interpret the results.
Deploy the model.

1.3 Preliminary Steps

1.3.1 Loading and looking at the data in R

If your data is stored in a CSV (Comma Delimited Values) file, you can use the r function read.csv() or fread() from the data.table package. The syntax is

read.csv(file = “myfile.csv”, header = TRUE, sep = “,”, skip = k)
fread(file = “myfile.csv”, sep = “,”, skip = k, header = TRUE)

1.3.2 Oversampling rare events in classification tasks

If the event we are using for classification is rare, for example, fraudulent credit card transactions, sampling a random subset of records may yield so few events that we have little information on them. In such cases, we would want our sampling procedure to overweight the rare class relative to the majority class so that our sample would end up with a healthy complement of rare events. Capturing events is more valuable than missing events, with price paid due to misclassifying nonevents.

The ROSE package creates possibly balanced samples by random over-sampling minority examples, under-sampling majority examples or combination of over- and under-sampling.

Here is a reference of using ROSE: https://www.analyticsvidhya.com/blog/2016/03/practical-guide-deal-imbalanced-classification-problems/

1.3.3 Outliers and missing values

Outliers can be deleted or appropriately handled with domain knowledge. Missing values can be imputed using one of a few methods, or the records with missing values can be deleted if missing is at random.

1.3.4 Normalyzing (Standardizing) and rescaling data

Some algorithms require that the numeric data be normalized (or standardized). To normalize or standardize a variable, we subtract the mean from the corresponding column and then divide by the standard deviation of the column. The R function scale() will do the job. The resulting values are called z-scores of the original values.

scale(mtcars)

##                             mpg        cyl        disp          hp        drat
## Mazda RX4            0.15088482 -0.1049878 -0.57061982 -0.53509284  0.56751369
## Mazda RX4 Wag        0.15088482 -0.1049878 -0.57061982 -0.53509284  0.56751369
## Datsun 710           0.44954345 -1.2248578 -0.99018209 -0.78304046  0.47399959
## Hornet 4 Drive       0.21725341 -0.1049878  0.22009369 -0.53509284 -0.96611753
## Hornet Sportabout   -0.23073453  1.0148821  1.04308123  0.41294217 -0.83519779
## Valiant             -0.33028740 -0.1049878 -0.04616698 -0.60801861 -1.56460776
## Duster 360          -0.96078893  1.0148821  1.04308123  1.43390296 -0.72298087
## Merc 240D            0.71501778 -1.2248578 -0.67793094 -1.23518023  0.17475447
## Merc 230             0.44954345 -1.2248578 -0.72553512 -0.75387015  0.60491932
## Merc 280            -0.14777380 -0.1049878 -0.50929918 -0.34548584  0.60491932
## Merc 280C           -0.38006384 -0.1049878 -0.50929918 -0.34548584  0.60491932
## Merc 450SE          -0.61235388  1.0148821  0.36371309  0.48586794 -0.98482035
## Merc 450SL          -0.46302456  1.0148821  0.36371309  0.48586794 -0.98482035
## Merc 450SLC         -0.81145962  1.0148821  0.36371309  0.48586794 -0.98482035
## Cadillac Fleetwood  -1.60788262  1.0148821  1.94675381  0.85049680 -1.24665983
## Lincoln Continental -1.60788262  1.0148821  1.84993175  0.99634834 -1.11574009
## Chrysler Imperial   -0.89442035  1.0148821  1.68856165  1.21512565 -0.68557523
## Fiat 128             2.04238943 -1.2248578 -1.22658929 -1.17683962  0.90416444
## Honda Civic          1.71054652 -1.2248578 -1.25079481 -1.38103178  2.49390411
## Toyota Corolla       2.29127162 -1.2248578 -1.28790993 -1.19142477  1.16600392
## Toyota Corona        0.23384555 -1.2248578 -0.89255318 -0.72469984  0.19345729
## Dodge Challenger    -0.76168319  1.0148821  0.70420401  0.04831332 -1.56460776
## AMC Javelin         -0.81145962  1.0148821  0.59124494  0.04831332 -0.83519779
## Camaro Z28          -1.12671039  1.0148821  0.96239618  1.43390296  0.24956575
## Pontiac Firebird    -0.14777380  1.0148821  1.36582144  0.41294217 -0.96611753
## Fiat X1-9            1.19619000 -1.2248578 -1.22416874 -1.17683962  0.90416444
## Porsche 914-2        0.98049211 -1.2248578 -0.89093948 -0.81221077  1.55876313
## Lotus Europa         1.71054652 -1.2248578 -1.09426581 -0.49133738  0.32437703
## Ford Pantera L      -0.71190675  1.0148821  0.97046468  1.71102089  1.16600392
## Ferrari Dino        -0.06481307 -0.1049878 -0.69164740  0.41294217  0.04383473
## Maserati Bora       -0.84464392  1.0148821  0.56703942  2.74656682 -0.10578782
## Volvo 142E           0.21725341 -1.2248578 -0.88529152 -0.54967799  0.96027290
##                               wt        qsec         vs         am       gear
## Mazda RX4           -0.610399567 -0.77716515 -0.8680278  1.1899014  0.4235542
## Mazda RX4 Wag       -0.349785269 -0.46378082 -0.8680278  1.1899014  0.4235542
## Datsun 710          -0.917004624  0.42600682  1.1160357  1.1899014  0.4235542
## Hornet 4 Drive      -0.002299538  0.89048716  1.1160357 -0.8141431 -0.9318192
## Hornet Sportabout    0.227654255 -0.46378082 -0.8680278 -0.8141431 -0.9318192
## Valiant              0.248094592  1.32698675  1.1160357 -0.8141431 -0.9318192
## Duster 360           0.360516446 -1.12412636 -0.8680278 -0.8141431 -0.9318192
## Merc 240D           -0.027849959  1.20387148  1.1160357 -0.8141431  0.4235542
## Merc 230            -0.068730634  2.82675459  1.1160357 -0.8141431  0.4235542
## Merc 280             0.227654255  0.25252621  1.1160357 -0.8141431  0.4235542
## Merc 280C            0.227654255  0.58829513  1.1160357 -0.8141431  0.4235542
## Merc 450SE           0.871524874 -0.25112717 -0.8680278 -0.8141431 -0.9318192
## Merc 450SL           0.524039143 -0.13920420 -0.8680278 -0.8141431 -0.9318192
## Merc 450SLC          0.575139986  0.08464175 -0.8680278 -0.8141431 -0.9318192
## Cadillac Fleetwood   2.077504765  0.07344945 -0.8680278 -0.8141431 -0.9318192
## Lincoln Continental  2.255335698 -0.01608893 -0.8680278 -0.8141431 -0.9318192
## Chrysler Imperial    2.174596366 -0.23993487 -0.8680278 -0.8141431 -0.9318192
## Fiat 128            -1.039646647  0.90727560  1.1160357  1.1899014  0.4235542
## Honda Civic         -1.637526508  0.37564148  1.1160357  1.1899014  0.4235542
## Toyota Corolla      -1.412682800  1.14790999  1.1160357  1.1899014  0.4235542
## Toyota Corona       -0.768812180  1.20946763  1.1160357 -0.8141431 -0.9318192
## Dodge Challenger     0.309415603 -0.54772305 -0.8680278 -0.8141431 -0.9318192
## AMC Javelin          0.222544170 -0.30708866 -0.8680278 -0.8141431 -0.9318192
## Camaro Z28           0.636460997 -1.36476075 -0.8680278 -0.8141431 -0.9318192
## Pontiac Firebird     0.641571082 -0.44699237 -0.8680278 -0.8141431 -0.9318192
## Fiat X1-9           -1.310481114  0.58829513  1.1160357  1.1899014  0.4235542
## Porsche 914-2       -1.100967659 -0.64285758 -0.8680278  1.1899014  1.7789276
## Lotus Europa        -1.741772228 -0.53093460  1.1160357  1.1899014  1.7789276
## Ford Pantera L      -0.048290296 -1.87401028 -0.8680278  1.1899014  1.7789276
## Ferrari Dino        -0.457097039 -1.31439542 -0.8680278  1.1899014  1.7789276
## Maserati Bora        0.360516446 -1.81804880 -0.8680278  1.1899014  1.7789276
## Volvo 142E          -0.446876870  0.42041067  1.1160357  1.1899014  0.4235542
##                           carb
## Mazda RX4            0.7352031
## Mazda RX4 Wag        0.7352031
## Datsun 710          -1.1221521
## Hornet 4 Drive      -1.1221521
## Hornet Sportabout   -0.5030337
## Valiant             -1.1221521
## Duster 360           0.7352031
## Merc 240D           -0.5030337
## Merc 230            -0.5030337
## Merc 280             0.7352031
## Merc 280C            0.7352031
## Merc 450SE           0.1160847
## Merc 450SL           0.1160847
## Merc 450SLC          0.1160847
## Cadillac Fleetwood   0.7352031
## Lincoln Continental  0.7352031
## Chrysler Imperial    0.7352031
## Fiat 128            -1.1221521
## Honda Civic         -0.5030337
## Toyota Corolla      -1.1221521
## Toyota Corona       -1.1221521
## Dodge Challenger    -0.5030337
## AMC Javelin         -0.5030337
## Camaro Z28           0.7352031
## Pontiac Firebird    -0.5030337
## Fiat X1-9           -1.1221521
## Porsche 914-2       -0.5030337
## Lotus Europa        -0.5030337
## Ford Pantera L       0.7352031
## Ferrari Dino         1.9734398
## Maserati Bora        3.2116766
## Volvo 142E          -0.5030337
## attr(,"scaled:center")
##        mpg        cyl       disp         hp       drat         wt       qsec 
##  20.090625   6.187500 230.721875 146.687500   3.596563   3.217250  17.848750 
##         vs         am       gear       carb 
##   0.437500   0.406250   3.687500   2.812500 
## attr(,"scaled:scale")
##         mpg         cyl        disp          hp        drat          wt 
##   6.0269481   1.7859216 123.9386938  68.5628685   0.5346787   0.9784574 
##        qsec          vs          am        gear        carb 
##   1.7869432   0.5040161   0.4989909   0.7378041   1.6152000

We can also normalize data by rescaling a numeric variable to a scale from 0 to 1, which means to subtract the minimum value of the column and then divide by the range (maximum - minimum). The function rescale() from the “scales” package can do the job.

scale(mtcars$mpg)

##              [,1]
##  [1,]  0.15088482
##  [2,]  0.15088482
##  [3,]  0.44954345
##  [4,]  0.21725341
##  [5,] -0.23073453
##  [6,] -0.33028740
##  [7,] -0.96078893
##  [8,]  0.71501778
##  [9,]  0.44954345
## [10,] -0.14777380
## [11,] -0.38006384
## [12,] -0.61235388
## [13,] -0.46302456
## [14,] -0.81145962
## [15,] -1.60788262
## [16,] -1.60788262
## [17,] -0.89442035
## [18,]  2.04238943
## [19,]  1.71054652
## [20,]  2.29127162
## [21,]  0.23384555
## [22,] -0.76168319
## [23,] -0.81145962
## [24,] -1.12671039
## [25,] -0.14777380
## [26,]  1.19619000
## [27,]  0.98049211
## [28,]  1.71054652
## [29,] -0.71190675
## [30,] -0.06481307
## [31,] -0.84464392
## [32,]  0.21725341
## attr(,"scaled:center")
## [1] 20.09062
## attr(,"scaled:scale")
## [1] 6.026948

1.4 Overfitting

When building models for prediction or classification, we need to make sure that the chosen model generalizes beyond the datasets that we have at hand for building and choosing the model. That is, the chosen model should not be too complicated (for example, with too many terms) so that it overfits the data with large standard errors.

The mean squared error of a model can be decomposed into two components: squared bias and variance. If a model underfits the data, the bias part is high but the variance part is low. If a model overfits the data, the bias part is low but the variance part is high. To check the predictive power of a model, we use a new set of predictor values

Here are some simulations of overfitting: https://medium.com/analytics-vidhya/full-demo-of-overfitting-and-underfitting-in-r-70fd9b66a7d1 and https://fengkehh.github.io/post/2017-06-30-overfitting/ and https://towardsdatascience.com/overfitting-vs-underfitting-a-complete-example-d05dd7e19765

When fitting a multiple regression model in R, we use the R function lm(). The syntax is lm(formula, data). To get the prediction with new data, we use the predict() function.

1.5 Building a Predictive Model

When building a predictive model, the typical steps are

Determine the purpose: prediction or classification
Obtain the data: you might need to select a sample from the database.
Explore, clean, and process the data: the description of the variables (AKA data dictionary), error checking, creation of dummy variables
Reduce the data dimension: The reduction of the number of variables
Determine the data mining task: prediction of a value or a label, supervised or unsupervised learning
Partition the data (for supervised tasks): the partition of data into training, validation, and test data.
Choose the best candidate technique.
Perform the task using algorithms: the lm() function for multiple regression, the naiveBayes() function for classification, the rpart(), knn() and svm() functions for both prediction and classification, and the functions neuralnet(), linDA(), and lda(), qda(), or mda()
Interpret the results.
Deploy the model: apply the model to new records to predict (or to score) the output values.

1.6 Common Challenges in Data Mining:

Data Quality: Dealing with noisy, incomplete, and inconsistent data.
Scalability: Algorithms must be efficient and scalable to handle enormous datasets.
Dimensionality: The “curse of dimensionality” – high-dimensional data can be sparse and computationally expensive to process.
Privacy and Ethics: Using data responsibly, ensuring individual privacy, and avoiding biased or unethical outcomes.
Interpretability: Understanding and trusting the results of complex models (“black boxes”).

Chapter 2. Data Explorarion

Data Exploration is a crucial initial phase in the data mining process that involves using visual and statistical techniques to understand the general properties of a dataset before applying complex algorithms or models.

Think of it as a detective first surveying a crime scene—looking for obvious clues, understanding the layout, and identifying potential points of interest before calling in the forensics team for deep analysis.

The primary goals are to:

Understand the data’s structure.
Uncover initial patterns, trends, and relationships.
Detect anomalies and outliers.
Identify important variables.
Formulate hypotheses for more rigorous testing later.
Inform and guide the subsequent steps of data preprocessing and modeling.

2.1 Key Activities in Data Exploration

Data exploration typically involves three main types of analysis:

Summary Statistics (Numerical Summaries)

This involves calculating key metrics that describe the central tendency, spread, and shape of the data’s distribution.

Central Tendency: Mean, Median, Mode.

Spread: Range, Variance, Standard Deviation, Interquartile Range (IQR).

Shape: Skewness (measure of asymmetry), Kurtosis (measure of “tailedness”).

For Categorical Data: Frequency counts and proportions for each category.

Example: Calculating the average (mean) customer age and the standard deviation tells you the typical age and how much variation there is in your customer base.

Visualization (Graphical Representations)

Visualizations are powerful tools for seeing patterns that are difficult to detect in tables of numbers.

Histograms & Box Plots: Understand the distribution of a single numerical variable. A box plot quickly shows the median, quartiles, and potential outliers.

Scatter Plots: Visualize the relationship (e.g., correlation) between two numerical variables (e.g., advertising spend vs. sales).

Bar Charts & Pie Charts: Show the frequency or proportion of categories in a categorical variable (e.g., number of customers per country).

Heatmaps & Correlation Matrices: Visualize the correlation coefficients between multiple numerical variables at once. This helps identify highly related variables.

Pair Plots: A grid of scatter plots for multiple variables, providing a comprehensive overview of their relationships.

Data Quality Assessment

This is an essential part of exploration where you check for issues that need to be addressed in the data preprocessing stage.

Missing Values: Identifying columns with blank or NULL entries and understanding the pattern of missingness.

Outliers: Detecting extreme values that could be errors or genuine rare events. A scatter plot or box plot makes outliers obvious.

Inconsistencies: Spotting typos, duplicate records, or incorrect data types (e.g., a date stored as text).

2.2 Visualization with R

We use the R built-in data “iris” to demonstrate visualization.

# 1. Create a side-by-side box plot (bxp)
p <- ggplot(iris, aes(color = Species, x = Sepal.Length))
bxp <- p + geom_boxplot() 


# 2. Create an overlay density plot (dp)
p <- ggplot(iris, aes(color = Species, x = Sepal.Length))
dp <- p + geom_density() 


# 3. Create a scatter plot
lp <- ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) + 
  geom_point() 

# 4. Combine the plots on one page
library(ggpubr)
ggarrange(bxp, dp, lp,
                    labels = c("A", "B", "C", "D"),
                    ncol = 2, nrow = 2)

Chapter 3 Dimension Reduction

3.1 Introduction

The dimension of a dataset is the number of variables (quantitative or categorical). The dimensionality of a model is the number of predictors or input variables used by the model. For data mining algorithms to operate efficiently, the dimension need to be reduced. This process is part of the pilot phase of data mining and is done before deploying a model. Several dimension reduction approaches are:

Incorporating domain knowledge to remove or combine categories.
Use data visualization (such as scatterplot matrix) or summaries (such as correlation matrix) to detect redundant variables or categories.
Convert ordinal categorical variables to numerical variables.
Employ automated reduction techniques, such as the principal component analysis (PCA), to create a new set of uncorrelated variables and use only a few of them.
Use regression models and classification & regression trees to remove redundant variables or combining similar categories of categorical variables.

3.2 Curse of Dimensionality

The curse of dimensionality is the affliction caused by adding variables to a model that already involves multiple variables. As variables are added, the feature space increasingly sparse, and classification and prediction models fail because available data are insufficient to provide a useful model across so many variables. Dimension reduction in artificial intelligence (AI) literature is often referred to as feature selection (selecting predictor variables or features) or feature extraction (creating more useful variables in place of original ones).

3.3 Reducing the Number of Categories in Categorical Variables

The handling of a categorical variable in regression is usually through the creation of dummy variables. Many times, software will handle it automatically. If you have $k$ categories for a variable, you can create $k$ dummy variables, each taking values of 0 or 1. Since these dummy variables are perfectly correlated, to reduce multicollinearity, you can only use any $k-1$ of them. If your data have many categorical variables, then many more dummy variables will be created. If you think some categories of a categorical variable can be combined to form a new categorical variable with a less number of categories, do it!

3.4 Converting Ordinal Categorical Variables to Numerical Variables

Other way to achieve data reduction is to use meaningful numerical values for the categories of a ordinal variable. A categorical variable is ordinal (such as grade, one on a Likert scale), if there is a natural ordering in the categories of it. Some domain knowledge might be needed when mapping categories to numerical values.

3.5 Principal Component Analysis

The principal component analysis is a technique that allows us to create a set of $p$ uncorrelated quantitative variables (called principal components) out of a set of $p$ quantitative variables, such that the sum of the variances of the PC’s equals the sum of the variances of the original variables. The PC that has the largest variance among all PC’s is called the first principal component, the PC that has the second largest variance among all PC’s is called the second principal component, and so on.

Here is a math-free introduction to PCA: https://builtin.com/data-science/step-step-explanation-principal-component-analysis

head(iris, n = 10)
D = iris[,-5] # Remove 5th column which is categorical
head(D, n = 10)

# No centering, no scaling
pcs = princomp(D, cor = TRUE, scores = TRUE) # Use the correlation matrix to start PCA
# scores = TRUE returns a new data frame with columns being the principal components

pcs

summary(pcs) # Shows importance of components

plot(pcs, type = "l", main = "A Scree Plot") # Variances of PC's
# The numbers in the above graph are equal to the eigen values of the correlation matrix of D, as shown below.
eigen(cor(D))
eigen(cor(D))$values # Must sum to 1
# Get the corresponding eigen vectors
eigen(cor(D))$vectors

# pcs is a list
names(pcs)

pcs$sdev # Gives the standard deviation of each of the principal components.
pcs$center # Gives the mean of each column of the original data D
pcs$scale # Gives the scaling factor (similar to standard deviation) of each column of the original data D
pcs$loadings # Extract coefficients (or vectors); a matrix whose columns contain the eigenvectors

pcs$scores # Extract the data frame with columns that are the principal components variables (called a score matrix).
# The difference between this new data frame and the original data frame is that the new 
# one has columns that are all uncorrelated to each other. The total variance remains the same.

biplot(pcs) # A scatterplot plot of the first two PCs with original variables and observation labels
# Add lines to the biplot
abline(h=0, v=0, lty = "dashed")

# Calculate the correlation between each of the original variable and each PC
cor(pcs$scores,D)

# Calculate angles between PC1 and original variables
(cor(pcs$scores,D)[1,] %>% acos())*180/pi # acos() is the inverse cosine function

Chapter 4 Regularization Methods in Regression: Ridge Regression, Lasso, and Elastic Net

4.1 Review of Classifical multiple regression

The multiple linear regression model has the structure: \[y=\beta_0+\beta_1x_1 + \beta_2x_2 + \beta_3x_3 +\cdots+ \beta_px_p + \epsilon\] where $y$ is the response variable, the $x_i$’s are predictors, $\beta$’s are coefficients, and $\epsilon$ is the error term which is assumed to have a normal distribution with mean 0 and a constant variance $\sigma^2$.

The R function that fits a multiple linear regression is lm(). The syntax is

\[lm(formula = y\sim{~}x_1 + x_2 + x_3 + \cdots + x_p, data = myDataFrame)\]

Since some of the x variables (predictors) might be correlated, the model fitting will have the so-called multi-collinearity issue which results in larger standard errors of the estimates of regression coefficients. To avoid such an issue, we can drop some predictors that are highly correlated to other predictors.

We would like to have parsimonious models which have a small number of predictors.

There are a few criteria that help select predictors:

$R^2$, the ratio of regression sum of squares to the total sum of squares
adjusted $R^2$, $R^2_{adj}=1-\frac{n-1}{n-p-1}(1-R^2)$
Mallow’s $C_p$,
$AIC$,
$BIC$,

and so on.

The Mallow’s $C_p$ criterion chooses the model with $C_p$ equal to the number of coefficients including the intercept. It is defined as

\[C_p=\frac{SSE}{\hat{\sigma}^2_{full}}+2(p+1)-n\] where $n$ is the number of observations, $p$ is the number of predictors in the smaller model, $SSE$ is the sum of squares of residuals. The quantity $\hat{\sigma}^2_{full}$ is the estimate of the variance of the error term under the full model, a model with all predictors in.

The $AIC$ (Akaike Information Criterion) is defined as

\[AIC=n\cdot log(SSE/n)+2p\] The $BIC$ (Bayesian Information Criterion) is defined as

\[BIC=n\cdot log(SSE/n)+p\cdot log(n)\] The smaller the AIC/BIC, the better the model.

A reference: https://online.stat.psu.edu/stat462/node/197/

Here we show how to calculate Mallow’s $C_p$ with R code. You can use the R package “leaps”.

full_model = lm(mpg ~ ., data = mtcars)
sigma.square.full = summary(full_model)$sigma^2 # The MSE of the full model

reduced_model <- lm(mpg ~ disp + hp + wt + qsec, data = mtcars)

p = 4 # 4 predictors used in the reduced model
n = nrow(mtcars) # Number of observations in data

sse.reduced = sum(reduced_model$residuals^2) # The SSE of the reduced model

Cp=sse.reduced/sigma.square.full + 2*(p+1) - n

Cp

## [1] 4.430434

Variable selection can be done using a “backward”, “forward”, or “stepwise” method.

# Backward selection starts with the full model
step = step(full_model, direction = "backward") 
summary(step) # Summarize the final model

pred = predict(step) # You can use validation data if you do have (which is recommended)
pred
library(forecast)
accuracy(pred, mtcars$mpg)

# Stepwise selection can start with the full model
step = step(full_model, direction = "both") 
summary(step)

intercept.only_model = lm(mpg ~ 1, data = mtcars)

# Forward selection starts with the intercept only model, or other small model
step = step(intercept.only_model, 
            scope=list(lower=intercept.only_model, upper = full_model), 
            direction = "forward") 
summary(step)

# Stepwise selection can also start with the intercept only model
step = step(intercept.only_model, 
            scope=list(lower=intercept.only_model, upper = full_model), 
            direction = "both")
summary(step)

All end up with the same model, though not necessarily the case in general.

4.2 Regularization Methods in Regression

A reference: http://www.sthda.com/english/articles/37-model-selection-essentials-in-r/153-penalized-regression-essentials-ridge-lasso-elastic-net/

The following is from the above reference, with a minor modification.

The ordinary multiple linear regression model based on the ordinary least squares method performs poorly in a situation, where you have a data set containing a number of variables superior to the number of samples.

A better alternative is the penalized regression allowing to create a linear regression model that is penalized, for having too many variables in the model, by adding a constraint when using the least squares approach. This is also known as the shrinkage or regularization method.

The consequence of imposing this penalty is to reduce (i.e. shrink) the coefficient values towards zero, and thus allows the less contributive variables to have a coefficient close or equal to zero.

Note that, the shrinkage requires the selection of a tuning parameter ($\lambda$) that determines the amount of shrinkage.

Three most commonly used penalized regression methods are ridge regression, lasso regression and elastic net regression.

4.2.1 Ridge regression

Ridge regression shrinks the regression coefficients, so that variables, with minor contribution to the outcome, have their coefficients close to zero.

The shrinkage of the coefficients is achieved by penalizing the regression model with a penalty term called $L_2$-norm, which is the sum of the squared coefficients.

The amount of the penalty can be fine-tuned using a constant called lambda, denoted by Greek letter $\lambda$. Selecting a good value for $\lambda$ is critical.

When $\lambda=0$, the penalty term has no effect, and ridge regression will produce the classical least square coefficients. However, as $\lambda$ increases to infinite, the impact of the shrinkage penalty grows, and the ridge regression coefficients will get close to zero.

Note that, in contrast to the ordinary least squares regression, ridge regression is highly affected by the scale of the predictors. Therefore, it is better to standardize (i.e., scale) the predictors before applying the ridge regression, so that all the predictors are on the same scale.

The standardization of a predictor x, can be achieved using the formula $x'=\frac{x-mean(x)}{sd(x)}$, where $mean(x)$ is the mean of $x$ and $sd(x)$ is the standard deviation. The consequence of this is that, all standardized predictors will have a mean of zero and a standard deviation of one allowing the final fit to not depend on the scale on which the predictors are measured.

One important advantage of the ridge regression, is that it still performs well, compared to the ordinary least squares method, in a situation where you have data with the number of predictors ($p$) larger than the number of observations ($n$).

One disadvantage of the ridge regression is that, it will include all the predictors in the final model, unlike the stepwise regression methods, which generally select models that involve a reduced set of variables.

Ridge regression shrinks the coefficients towards zero, but it does not set any of them exactly to zero. The lasso regression is an alternative that overcomes this drawback.

Required R packages for implementing the regularization methods are:

tidyverse for easy data manipulation and visualization
glmnet for computing penalized regression
caret for easy machine learning workflow

The key function to fit a penalized generalized linear model is glmnet(), which is used as follows:

\[\text{glmnet(x, y, alpha, lambda)},\] where

x: matrix of predictor variables
y: the response or outcome variable, which is a binary variable.
alpha: the elastic net mixing parameter. Allowed values include:
“1”: for lasso regression
“0”: for ridge regression
a value between 0 and 1 (say 0.3) for elastic net regression.
lamba: a numeric value defining the amount of shrinkage. Should be specified by the analyst.

In penalized regression, you need to specify a constant lambda to adjust the amount of the coefficient shrinkage. The best lambda for your data, can be defined as the lambda that minimize the cross-validation prediction error rate. This can be determined automatically using the function cv.glmnet().

A note: when making prediction, glmnet uses the parameter “newx” instead of “newdata” for the predict() function!

A detailed introduction to the glmnet() function is https://glmnet.stanford.edu/articles/glmnet.html.

Partition Data:

# Load packages
library(tidyverse)
library(caret)
library(glmnet)

# Load the data
data("Boston", package = "MASS")

# Split the data into training and test set.
# Write my own function to partition original dataset into training and validation sets
partition = function(Data, prop.train = 0.8){
  n = nrow(Data)
  shuffled.idx = sample(1:n)
  train.idx = shuffled.idx[1:round(n*0.8)]
  train.data = Data[train.idx, ]
  validation.data = Data[-train.idx, ]
  return(list(train.data = train.data, validation.data= validation.data))
}

partitioned.data = partition(Boston, 0.8)
train.data  <- partitioned.data$train.data
validation.data <- partitioned.data$validation.data

# Textbook partitions data by the function createDataPartition() from package "caret" 
training.samples <- Boston$medv %>% createDataPartition(p = 0.8, list = FALSE)
train.data  <- Boston[training.samples, ]
validation.data <- Boston[-training.samples, ]

Train a Ridge Regression Model:

# Create a MATRIX containing only predictors and converting any categorical variable to dummy(0/1) variables
x = model.matrix(medv~., train.data)[, -1] # Remove the first column which is the intercept term
# Create the outcome/response variable
y = train.data$medv

# Apply cross validatation to get the best lambda value that minimizes the cross-validation error.
cv <- cv.glmnet(x, y, alpha = 0) # alpha = 0 indicates ridge regression
names(cv) # Check What are in the output

##  [1] "lambda"     "cvm"        "cvsd"       "cvup"       "cvlo"      
##  [6] "nzero"      "call"       "name"       "glmnet.fit" "lambda.min"
## [11] "lambda.1se" "index"

# Display the best lambda value that minimize the cross-validation error.
cv$lambda.min

## [1] 0.6711405

# Apply the glmnet() function from the package "glmnet" to fit penalized generalized linear regression model with the best lambda, based on training data
model = glmnet(x, y, alpha = 0, lambda = cv$lambda.min) # alpha = 0 indicates ridge regression

# Display regression coefficients
coef(model)

## 14 x 1 sparse Matrix of class "dgCMatrix"
##                        s0
## (Intercept)  30.161201898
## crim         -0.081304930
## zn            0.042455174
## indus        -0.018052570
## chas          3.034250945
## nox         -12.019329562
## rm            3.799027209
## age          -0.001052654
## dis          -1.052520763
## rad           0.146860467
## tax          -0.005189129
## ptratio      -0.936572749
## black         0.008465278
## lstat        -0.472738519

# Plot the model
plot(model, label=TRUE)

Make a Prediction:

# Prepare data for predictions on the validation data
x.new <- model.matrix(medv ~., validation.data)[,-1]
predictions <- model %>% predict(newx = x.new) %>% as.vector()
# Model performance metrics
data.frame(
  RMSE = sqrt(sum((predictions-validation.data$medv)^2)),
  Rsquare = (cor(predictions, validation.data$medv))^2
)

##       RMSE   Rsquare
## 1 48.08355 0.7302393

4.2.2 Lasso regression

Lasso stands for Least Absolute Shrinkage and Selection Operator. It shrinks the regression coefficients toward zero by penalizing the regression model with a penalty term called $L_1$-norm, which is the sum of the absolute coefficients.

In the case of lasso regression, the penalty has the effect of forcing some of the coefficient estimates, with a minor contribution to the model, to be exactly equal to zero. This means that, lasso can be also seen as an alternative to the subset selection methods for performing variable selection in order to reduce the complexity of the model.

As in ridge regression, selecting a good value of $\lambda$ for the lasso is critical.

A helpful video: https://www.youtube.com/watch?v=ldOqDuXKfL4

One obvious advantage of lasso regression over ridge regression, is that it produces simpler and more interpretable models that incorporate only a reduced set of the predictors. However, neither ridge regression nor the lasso will universally dominate the other.

Generally, lasso might perform better in a situation where some of the predictors have large coefficients, and the remaining predictors have very small coefficients.

Ridge regression will perform better when the outcome is a function of many predictors, all with coefficients of roughly equal size.

To fit a penalized regression using the lasso approach, the only difference from the R code used for ridge regression is that, for lasso regression you need to specify the argument alpha = 1 instead of alpha = 0 (for ridge regression).

4.2.3 Elastic Net

Elastic Net produces a regression model that is penalized with both the L1-norm and L2-norm. The consequence of this is to effectively shrink coefficients (like in ridge regression) and to set some coefficients to zero (as in LASSO).

The elastic net regression can be easily computed using the caret workflow, which invokes the glmnet package.

We use caret to automatically select the best tuning parameters alpha and lambda. The caret package tests a range of possible alpha and lambda values, then selects the best values for lambda and alpha, resulting to a final model that is an elastic net model.

Here, we’ll test the combination of 10 different values for alpha and lambda. This is specified using the option tuneLength.

The best alpha and lambda values are those values that minimize the cross-validation error.

# Build the model using the training set
set.seed(123)
model <- train(
  medv ~., data = train.data, method = "glmnet",
  trControl = trainControl("cv", number = 10),
  tuneLength = 10
)
# Best tuning parameter
model$bestTune

# Display Coefficient of the final model. You need
# to specify the best lambda
coef(model$finalModel, model$bestTune$lambda)

# Make predictions on the validation data
x.new <- model.matrix(medv ~., validation.data)[,-1]
predictions <- model %>% predict(x.new)
# Model performance metrics
data.frame(
  RMSE = sqrt(sum((predictions-validation.data$medv)^2)),
  Rsquare = (cor(predictions, validation.data$medv))^2
)

4.3 Using the Workflow of the “caret” Package

We can easily compute and compare ridge, lasso and elastic net regression using the caret workflow.

The package “caret” will automatically choose the best tuning parameter values, compute the final model and evaluate the model performance using cross-validation techniques.

# Setup a grid range of lambda values:
lambda <- 10^seq(-3, 3, length = 100)

# 0. # Build the model based on lm()
set.seed(123)
lm <- train(
  medv ~., 
  data = train.data, 
  method = "lm"
)
# Model coefficients
lm$finalModel

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Coefficients:
## (Intercept)         crim           zn        indus         chas          nox  
##   38.857751    -0.101137     0.056995     0.042972     2.800200   -17.706189  
##          rm          age          dis          rad          tax      ptratio  
##    3.557881     0.004066    -1.386298     0.290356    -0.011244    -1.048322  
##       black        lstat  
##    0.008454    -0.528554

# Make predictions
predictions <- lm %>% predict(newdata = validation.data)
# Model prediction performance
data.frame(
  RMSE = sqrt(sum((predictions-validation.data$medv)^2)),
  Rsquare = (cor(predictions, validation.data$medv))^2
)

##       RMSE   Rsquare
## 1 47.23971 0.7371523

# 1. # Build the model based on ridge regression
set.seed(123)
ridge <- train(
  medv ~., 
  data = train.data, 
  method = "glmnet",
  tuneGrid = expand.grid(alpha = 0, lambda = lambda)
)
# Model coefficients
coef(ridge$finalModel, ridge$bestTune$lambda)

## 14 x 1 sparse Matrix of class "dgCMatrix"
##               s=0.5336699
## (Intercept)  3.011912e+01
## crim        -8.097438e-02
## zn           4.257096e-02
## indus       -1.789278e-02
## chas         3.032259e+00
## nox         -1.193438e+01
## rm           3.795155e+00
## age         -8.805055e-04
## dis         -1.051099e+00
## rad          1.466378e-01
## tax         -5.215477e-03
## ptratio     -9.355305e-01
## black        8.472816e-03
## lstat       -4.736292e-01

# Make predictions
predictions <- ridge %>% predict(newdata = validation.data)
# Model prediction performance
data.frame(
  RMSE = sqrt(sum((predictions-validation.data$medv)^2)),
  Rsquare = (cor(predictions, validation.data$medv))^2
)

##       RMSE   Rsquare
## 1 48.09003 0.7301834

# 2. Build the model based on lasso regression
set.seed(123)
lasso <- train(
  medv ~., data = train.data, method = "glmnet",
  tuneGrid = expand.grid(alpha = 1, lambda = lambda)
  )
# Model coefficients
coef(lasso$finalModel, lasso$bestTune$lambda)

## 14 x 1 sparse Matrix of class "dgCMatrix"
##              s=0.01873817
## (Intercept)  37.297701673
## crim         -0.093817075
## zn            0.052954116
## indus         0.017989474
## chas          2.835973849
## nox         -16.386225027
## rm            3.611415279
## age           0.001147027
## dis          -1.352191129
## rad           0.249263001
## tax          -0.009198695
## ptratio      -1.026893010
## black         0.008338414
## lstat        -0.523566223

# Make predictions
predictions <- lasso %>% predict(newdata = validation.data)
# Model prediction performance
data.frame(
  RMSE = sqrt(sum((predictions-validation.data$medv)^2)),
  Rsquare = (cor(predictions, validation.data$medv))^2
)

##       RMSE   Rsquare
## 1 47.26953 0.7369978

# 3. Build the model based on Elastic net:
set.seed(123)
elastic <- train(
  medv ~., data = train.data, method = "glmnet"
)
# Model coefficients
coef(elastic$finalModel, elastic$bestTune$lambda)

## 14 x 1 sparse Matrix of class "dgCMatrix"
##               s=0.1342281
## (Intercept)  3.560779e+01
## crim        -9.072383e-02
## zn           5.053326e-02
## indus        7.336181e-03
## chas         2.896603e+00
## nox         -1.538997e+01
## rm           3.664437e+00
## age          6.215139e-04
## dis         -1.290027e+00
## rad          2.219325e-01
## tax         -8.000154e-03
## ptratio     -1.007558e+00
## black        8.398374e-03
## lstat       -5.132889e-01

# Make predictions
predictions <- elastic %>% predict(newdata = validation.data)
# Model prediction performance
data.frame(
  RMSE = sqrt(sum((predictions-validation.data$medv)^2)),
  Rsquare = (cor(predictions, validation.data$medv))^2
)

##   RMSE   Rsquare
## 1 47.4 0.7358807

# 4. Comparing performances of the models:
# The performance of the different models - ridge, lasso and elastic net - can be easily compared using caret. 
# The best model is defined as the one that minimizes the prediction error.

models <- list(lm = lm, ridge = ridge, lasso = lasso, elastic = elastic)
resamples(models) %>% summary( metric = "RMSE")

## 
## Call:
## summary.resamples(object = ., metric = "RMSE")
## 
## Models: lm, ridge, lasso, elastic 
## Number of resamples: 25 
## 
## RMSE 
##             Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## lm      4.159431 4.717935 4.980008 4.972712 5.219128 6.144363    0
## ridge   4.139008 4.708648 5.031095 4.978794 5.183415 5.919056    0
## lasso   4.142013 4.727994 4.972148 4.970572 5.239575 6.115328    0
## elastic 4.127985 4.743703 4.976498 4.966967 5.211853 6.069288    0

It can be seen that the ridge model has the lowest mean RMSE and median RMSE.

Chapter 5. Generalized Linear Models and Logistic Regression

5.1 Introduction to Generalized Linear Models

Ordinary multiple regression assumes a continuous response variable with normally distributed errors, a mean dependent on a linear combination of predictors, and constant variance. In R, this is implemented using the lm() function. When the response variable deviates from normality, generalized linear models (GLMs) extend this framework to accommodate various distributions. In R, GLMs are fitted using the glm() function.

5.2 Binary Logistic Regression

Binary logistic regression, a special case of GLMs, models a binary categorical response variable assuming a Bernoulli distribution with a probability parameter $p$ linked to a linear combination of predictors. In R, logistic regression is implemented as:

glm(y ~ x1 + x2 + x3 + ..., data = myData, family = binomial)

5.2.1 Example: Iris Dataset

We use the Iris dataset to demonstrate logistic regression, transforming the Species variable into a binary outcome: “setosa” (1) versus “non-setosa” (0).

head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

D <- iris

# Replace the Species column by a dummy variable indicating Setosa.
D$Species <- ifelse(D$Species == "setosa", 1, 0)

5.3 Exploratory Data Analysis

To understand relationships between predictors and the binary response, we visualize the data using boxplots and compute correlations.

library(ggplot2)
library(gridExtra)

p1 <- ggplot(D, aes(x = factor(Species), y = Sepal.Length)) +
  geom_boxplot() +
  labs(title = "Boxplot of Sepal Length", y = "Sepal Length (cm)", x = NULL) +
  theme_minimal() +
  theme(legend.position = "none")
p2 <- ggplot(D, aes(x = factor(Species), y = Sepal.Width)) +
  geom_boxplot() +
  labs(title = "Boxplot of Sepal Width", y = "Sepal Width (cm)", x = NULL) +
  theme_minimal() +
  theme(legend.position = "none")
p3 <- ggplot(D, aes(x = factor(Species), y = Petal.Length)) +
  geom_boxplot() +
  labs(title = "Boxplot of Petal Length", y = "Petal Length (cm)", x = NULL) +
  theme_minimal() +
  theme(legend.position = "none")
p4 <- ggplot(D, aes(x = factor(Species), y = Petal.Width)) +
  geom_boxplot() +
  labs(title = "Boxplot of Petal Width", y = "Petal Width (cm)", x = NULL) +
  theme_minimal() +
  theme(legend.position = "none")

grid.arrange(p1, p2, p3, p4, ncol = 2)

# Display a correlation matrix
cor(D[, -5])

##              Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length    1.0000000  -0.1175698    0.8717538   0.8179411
## Sepal.Width    -0.1175698   1.0000000   -0.4284401  -0.3661259
## Petal.Length    0.8717538  -0.4284401    1.0000000   0.9628654
## Petal.Width     0.8179411  -0.3661259    0.9628654   1.0000000

The boxplots reveal how predictors distinguish species, while the correlation matrix highlights linear relationships among predictors.

5.4 Model Training

We partition the data into training (60%, subjective) and validation sets, then train a logistic regression model using Sepal.Length as the predictor.

n <- nrow(D)
set.seed(3.14)
idx <- sample(1:n)
n.rows.training <- n * 0.6
trainingData <- D[idx[1:n.rows.training], ]
validationData <- D[-idx[1:n.rows.training], ]

m <- glm(Species ~ Sepal.Length, data = trainingData, family = binomial)
summary(m)

## 
## Call:
## glm(formula = Species ~ Sepal.Length, family = binomial, data = trainingData)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   24.3285     5.1970   4.681 2.85e-06 ***
## Sepal.Length  -4.5237     0.9561  -4.731 2.23e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 114.573  on 89  degrees of freedom
## Residual deviance:  50.449  on 88  degrees of freedom
## AIC: 54.449
## 
## Number of Fisher Scoring iterations: 6

5.5 Model Evaluation

We predict probabilities on the validation set and assess performance using a confusion matrix.

# Extract the actual response values from the validation dataset
actual <- validationData$Species
# Predict probabilities for the validation data, excluding the response column, with type="response" for probabilities
pred <- predict(m, newdata = validationData[, -5], type = "response")
# Create a data frame to compare actual and predicted values
data.frame(actual, pred)

##     actual         pred
## 1        1 7.786169e-01
## 13       1 9.318040e-01
## 14       1 9.924347e-01
## 21       1 4.751484e-01
## 23       1 9.712369e-01
## 24       1 7.786169e-01
## 25       1 9.318040e-01
## 27       1 8.468356e-01
## 28       1 6.910971e-01
## 30       1 9.555154e-01
## 31       1 9.318040e-01
## 32       1 4.751484e-01
## 34       1 3.654331e-01
## 35       1 8.968187e-01
## 39       1 9.881584e-01
## 41       1 8.468356e-01
## 42       1 9.815098e-01
## 44       1 8.468356e-01
## 45       1 7.786169e-01
## 49       1 5.873164e-01
## 51       0 6.503024e-04
## 52       0 9.725824e-03
## 53       0 1.021918e-03
## 55       0 6.208747e-03
## 59       0 3.958440e-03
## 60       0 6.910971e-01
## 63       0 5.658760e-02
## 72       0 3.675319e-02
## 79       0 5.658760e-02
## 81       0 3.654331e-01
## 82       0 3.654331e-01
## 85       0 4.751484e-01
## 88       0 1.520474e-02
## 89       0 2.681105e-01
## 91       0 3.654331e-01
## 92       0 3.675319e-02
## 93       0 1.290964e-01
## 99       0 7.786169e-01
## 100      0 1.889877e-01
## 102      0 1.290964e-01
## 105      0 6.208747e-03
## 106      0 4.311279e-05
## 110      0 2.632444e-04
## 111      0 6.208747e-03
## 116      0 9.725824e-03
## 121      0 1.021918e-03
## 123      0 2.742526e-05
## 124      0 1.520474e-02
## 126      0 2.632444e-04
## 127      0 2.369628e-02
## 128      0 3.675319e-02
## 129      0 9.725824e-03
## 131      0 1.065376e-04
## 134      0 1.520474e-02
## 135      0 3.675319e-02
## 138      0 9.725824e-03
## 141      0 2.521668e-03
## 142      0 1.021918e-03
## 145      0 2.521668e-03
## 147      0 1.520474e-02

# Load the caret package for performance evaluation
library(caret)
# Convert predicted probabilities to binary labels (1 if probability > 0.5, else 0) and convert to a factor
pred.label <- factor((as.numeric(pred) > 0.5) * 1)
# Convert actual values to a factor for compatibility with confusionMatrix
act <- factor(actual)
# Generate a confusion matrix to evaluate model performance, setting positive class as "1"
confusionMatrix(pred.label, act, positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 38  3
##          1  2 17
##                                           
##                Accuracy : 0.9167          
##                  95% CI : (0.8161, 0.9724)
##     No Information Rate : 0.6667          
##     P-Value [Acc > NIR] : 5.6e-06         
##                                           
##                   Kappa : 0.8101          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.8500          
##             Specificity : 0.9500          
##          Pos Pred Value : 0.8947          
##          Neg Pred Value : 0.9268          
##              Prevalence : 0.3333          
##          Detection Rate : 0.2833          
##    Detection Prevalence : 0.3167          
##       Balanced Accuracy : 0.9000          
##                                           
##        'Positive' Class : 1               
##

5.6 Performance Metrics and Visualization

5.6.1 Decile-wise Lift Plot

A lift plot measures how effectively the model identifies positive outcomes compared to an intercept-only logistic model (i.e., a baseline model with no predictors). The dataset is sorted by predicted probabilities in descending order and divided into deciles, where each decile represents 10% of the data. Lift is calculated as the ratio of the model’s success rate (proportion of true positives in a decile) to the baseline success rate (proportion of positives in the entire dataset).

Example: Suppose we have 100 observations, with 20 positive cases (20% baseline positive rate). We sort the data by predicted probabilities and divide it into 10 deciles (10 observations each). If in the top decile, the model identifies 8 positive cases (assuming you have a cutoff in mind to define what a positive is). The success rate in this decile is $8/10 = 80\%$. The lift is $80\% / 20\% = 4$, meaning the model is 4 times as effective at identifying positives in this decile as the baseline.

library(gains)
# Calculate gains table, dividing data into 10 deciles based on predicted probabilities
gain <- gains(actual, pred, groups = 10)
# Compute lift as the ratio of mean response in each decile to the overall mean response
lift <- gain$mean.resp / mean(actual)
# Create a barplot for the decile-wise lift chart, with decile depths as x-axis labels
midpoints <- barplot(lift, names.arg = gain$depth, xlab = "Percentile", ylab = "Mean Response", main = "Decile-wise Lift Chart", ylim = c(0, 3.5))
# Add text labels above bars to display rounded lift values for clarity
text(midpoints, lift + 0.1, labels = round(lift, 1), cex = 0.8)

Interpretation: The model identifies up to three times more positives in the top deciles compared to baseline, indicating strong performance.

5.6.2 ROC Curve and AUC

The ROC curve plots the True Positive Rate (TPR) against the False Positive Rate (FPR) across thresholds. The Area Under the Curve (AUC) summarizes model performance.

library(pROC)
roc <- roc(actual ~ pred)

sensitivity = roc$sensitivities
specificity = roc$specificities

plot(sensitivity~I(1-specificity), type = "l", col = "blue", main = "ROC Curve", xlab = "1-Specificity")

# alculate AUC
auc(roc)

## Area under the curve: 0.9819

Interpretation: An AUC > 0.5 indicates better-than-random performance, with values closer to 1 reflecting stronger models.

5.6.3 Recall and Precision

Recall: (True Positive) / (All Positives). It is the same as sensitivity which measures the ability to identify all positives.
Precision: (True Positives) / (All Predicted Positives). Measures the accuracy of positive predictions.

Trade-off: Improving recall often reduces precision, and vice versa. The $F_1$-score balances these metrics. $F_1$ is the harmonic mean of the two:

\[F_1=\frac{2}{\frac{1}{Recall}+\frac{1}{Precision}}\]

Choosing Metrics:

When false negatives are costly, use recall.
When false positives are costly, use precision.
When both types of errors have similar costs, use $F_1$ score

For rare event data (imbalanced data), these metrics are more useful than accuracy for selecting models.

5.7 A Case Study (DIY)

Read the paper https://pmc.ncbi.nlm.nih.gov/articles/PMC8902098/ (right click the link and choose “open link in new tab”) to learn how a real paper can be drafted. Mimic the paper to do your project on a logistic regression.

Chapter 6. Classification and Regression Trees (CART)

Classification and Regression Trees (CART) are decision tree algorithms used for:

Classification: Predicting categorical outcomes (e.g., species of a flower)
Regression: Predicting continuous outcomes (e.g., house prices)

CART works by

recursively splitting the input space (all individuals in data) into regions (groups) based on feature values and
making a decision based on the majority class (for classification) or average value (for regression) in that region.

To understand how a classification tree model works, watch videos:

https://www.youtube.com/watch?v=HVijfZpFzUw introducing classification trees,
https://www.youtube.com/watch?v=38qlnqSvQaM cont’d,
https://www.youtube.com/watch?v=oqSBbDkIDqA introducing measure of impurity or diversity, and
https://www.youtube.com/watch?v=N8ZzPyDnroo introducing how to select features for partition.

To understand how a classification tree model works, watch videos:

https://www.youtube.com/watch?v=52w9jOhBABM

Key advantages of CART:

Easy to interpret
Handles both numerical and categorical data
Captures non-linear relationships

Key disadvantages of CART:

Prone to overfitting
Sensitive to small changes in data

We’ll

use the rpart package to impliment the algorithm,
visualize the results using the rpart.plot package, and
evaluate the performance the models by the caret package.

We will also covers

how to prune a tree to reduce complexity of a tree, and
how to tune (adjust) hyperparameters to optimize the performance of a tree model

6.1 CART for Classification

Example 1: Classification with the Iris Dataset

We’ll use the iris dataset to build a classification tree to predict flower species.

Step 1: Load Libraries and Split Data

library(rpart)
library(rpart.plot)
library(caret) # for splitting data and assessing model 

set.seed(123) # For reproducibility
trainIndex <- createDataPartition(iris$Species, p = 0.7, list = FALSE)
train_data <- iris[trainIndex, ]
validation_data <- iris[-trainIndex, ]

Step 2: Build the Classification Tree

We’ll create a decision tree to classify Species based on the other features.

# Train the classification tree
tree_model <- rpart(Species ~ ., data = train_data, method = "class")

# Plot the tree
rpart.plot(tree_model, main = "Classification Tree for Iris Dataset")

The resulting plot shows the tree structure, with nodes representing decisions and leaves representing predicted classes.

Step 3: Make Predictions

Predict species for the first few rows of the dataset:

# Predict
predictions <- predict(tree_model, newdata = validation_data, type = "class")

Step 4: Evaluate the Model

Use the caret package to compute accuracy:

library(caret)
confusionMatrix(predictions, validation_data$Species)

## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         15          0         0
##   versicolor      0         14         2
##   virginica       0          1        13
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9333         
##                  95% CI : (0.8173, 0.986)
##     No Information Rate : 0.3333         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.9            
##                                          
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            0.9333           0.8667
## Specificity                 1.0000            0.9333           0.9667
## Pos Pred Value              1.0000            0.8750           0.9286
## Neg Pred Value              1.0000            0.9655           0.9355
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.3111           0.2889
## Detection Prevalence        0.3333            0.3556           0.3111
## Balanced Accuracy           1.0000            0.9333           0.9167

This outputs accuracy and other metrics like sensitivity and specificity.

6.2 CART for Regression

Example 2: Regression with the Boston Housing Dataset

Now, let’s build a regression tree to predict house prices using the Boston dataset from the MASS package.

Step 1: Load and Split Data

library(MASS)
data(Boston)
library(caret)
library(rpart)
library(rpart.plot)

set.seed(123) # For reproducibility
trainIndex <- createDataPartition(Boston$medv, p = 0.7, list = FALSE)
train_data <- Boston[trainIndex, ]
validation_data <- Boston[-trainIndex, ]

Step 2: Build the Regression Tree

# Fit the regression tree
reg_tree <- rpart(medv ~ ., data = train_data, method = "anova")

# Plot the tree
rpart.plot(reg_tree, main = "Regression Tree for Boston Housing")

Here,

method = “anova” specifies regression, and medv is the target variable (median house value).
The value 20 in the root (top) node is the mean of the response variable “medv”.
The value 83% inside the second node represents the percent of all observations falling in the node.

Step 3: Make Predictions and calculate the performance metric “root mean square error”.

# Predict
reg_predictions <- predict(reg_tree, validation_data)

# Calculate RMSE
rmse <- sqrt(mean((reg_predictions - validation_data$medv)^2))
cat("Root Mean Squared Error:", rmse, "\n")

## Root Mean Squared Error: 5.568488

6.3 Pruning a Tree

To avoid overfitting, we can prune the tree based on the complexity parameter (cp):

# Print complexity parameter table
printcp(reg_tree)

## 
## Regression tree:
## rpart(formula = medv ~ ., data = train_data, method = "anova")
## 
## Variables actually used in tree construction:
## [1] lstat   nox     ptratio rm     
## 
## Root node error: 28877/356 = 81.115
## 
## n= 356 
## 
##         CP nsplit rel error  xerror     xstd
## 1 0.462980      0   1.00000 1.00335 0.097781
## 2 0.165305      1   0.53702 0.71078 0.074466
## 3 0.080161      2   0.37171 0.52042 0.061138
## 4 0.034497      3   0.29155 0.38648 0.054263
## 5 0.027794      4   0.25706 0.36181 0.052094
## 6 0.021800      5   0.22926 0.33327 0.048749
## 7 0.011952      6   0.20746 0.31856 0.048561
## 8 0.011358      7   0.19551 0.31018 0.048737
## 9 0.010000      8   0.18415 0.31168 0.049371

# Calculate the optimal cp value
optimal.cp = reg_tree$cptable[which.min(reg_tree$cptable[,"xerror"]),"CP"]
cat("The optimal complexity parameter is ", optimal.cp, ", based on the cross validation error.", sep = "")

## The optimal complexity parameter is 0.01135807, based on the cross validation error.

# Prune the tree
pruned_tree <- prune(reg_tree, cp = optimal.cp)

# Plot pruned tree
rpart.plot(pruned_tree, main = "Pruned Regression Tree")

The cp table shows the cross-validation error for different tree sizes. We select the cp value that minimizes the error (xerror).

6.4 Tuning a Tree

You can tune a tree by adjusting parameters in rpart.control:

# Example with custom control parameters
control <- rpart.control(minsplit = 20, maxdepth = 10, cp = 0.01)
tuned_tree <- rpart(medv ~ ., data = Boston, method = "anova", control = control)

Here,

minsplit: Minimum number of observations in a node to attempt a split
maxdepth: Maximum depth of the tree
cp: Complexity parameter to control tree growth

6.5 Some Details regarding Pruning and Tuning

6.5.1 Pruning

Pruning is the process of reducing the size of a decision tree by removing branches (nodes and leaves) that contribute little to predictive accuracy or are likely to cause overfitting. Overfitting occurs when a tree is too complex and captures noise in the training data, leading to poor performance on unseen data (e.g., validation or test sets).

Purpose

Prevent Overfitting: Pruning simplifies the tree to improve its generalization to new data.
Reduce Complexity: A smaller tree is easier to interpret and less computationally intensive.
Maintain Predictive Power: Pruning aims to retain the most important splits while removing those that add little value.

How It Works

Pruning in CART typically involves cost-complexity pruning, which balances the tree’s complexity (number of nodes) against its predictive error.

The best tree size minimizes the cost-complexity measure, defined as:

\[R_\alpha(T) = R(T) + \alpha \cdot |T|\]

Where:

$R_\alpha(T)$: The cost-complexity measure for tree $ T $.
$R(T)$: The error of the tree (e.g., misclassification rate for classification or sum of squared residuals for regression) on the training data.
$\alpha$: The complexity parameter ($c_p$), a penalty per terminal node.
$|T|$: The number of terminal nodes (leaves) in the tree.

The complexity parameter $\alpha$ (referred to as $c_p$ in rpart) determines how much penalty is applied for each additional terminal node. A higher $\alpha$ favors simpler trees by pruning branches that don’t sufficiently reduce $R(T)$.

The rpart package uses the complexity parameter ($c_p$) to control pruning:

A higher $c_p$ value leads to more aggressive pruning (simpler tree). A lower $c_p$ value retains more branches (more complex tree).

The pruning process involves:

Growing a full tree (with low $c_p$, allowing many splits). Evaluating the tree’s performance using cross-validation or a validation set. Selecting a $c_p$ value that minimizes the cross-validated error (xerror in rpart). Removing branches that don’t justify their complexity based on the chosen $c_p$.

6.5.2 Tuning

Tuning is the process of adjusting hyperparameters of the decision tree algorithm before or during model training to optimize its performance. In the context of rpart, tuning involves setting parameters in the rpart.control function to control how the tree is grown.

Purpose

Optimize Model Performance: Tuning aims to find the best combination of hyperparameters to improve accuracy, reduce overfitting, or balance bias and variance.
Control Tree Growth: Parameters like minsplit, maxdepth, or cp can limit the tree’s complexity during training.
Customize Behavior: Tuning allows you to tailor the tree to specific datasets or problems (e.g., handling imbalanced data or small datasets).

How It Works

Tuning involves specifying parameters in rpart.control to influence how the tree is constructed. Common parameters include:

minsplit: Minimum number of observations in a node to attempt a split.
maxdepth: Maximum depth of the tree (number of levels of splits).
cp: Complexity parameter threshold (any split that doesn’t decrease the error by at least cp is not attempted).
minbucket: Minimum number of observations in a terminal node (leaf).

Tuning typically involves:

Specifying a range of values for these parameters.
Training multiple models with different parameter combinations.
Evaluating models on a validation set or using cross-validation to select the best parameters.

6.6 Visualizing Variable Importance

Check which variables are most important. We continue with the regression tree “reg_tree” we trained before.

# Variable importance for regression tree
var_imp <- reg_tree$variable.importance
barplot(var_imp, main = "Variable Importance", las = 2)

This plots the relative importance of each feature in the tree.

For a regression tree, the importance of a feature (i.e. predictor) sums the reductions in the sum of squared errors (sse) across splits.
For a classification tree, the importance of a feature (i.e. predictor) sums the reductions in impurity (e.g., Gini index) across splits.

For hand calculation, refer to examples in https://scientistcafe.com/ids/splitting-criteria.

Chapter 7. The K-Nearest Neighbor (K-NN) Algorithm for Regression and Classification

k-Nearest Neighbors (k-NN) is a simple, intuitive supervised machine learning algorithm used for classification and regression.

Two short videos to understand k-NN:

7.1 K-NN for Classification

How It Works?

For each point in the training data,
- Compute the distance (usually Euclidean) from this point to all other points in the training data.
- Select the $k$ closest points (neighbors).
- Use the majority class among those $k$ neighbors as the prediction.
Repeat the procedure for other $k$’s.
The $k$ that achieves the highest performance () becomes the optimal $k$.

How to test the predictive ability of the learner (the k-nn with the optimal $k$)?

For each point in the test data,
- Compute the distance (usually Euclidean) from this point to all other points in the test data.
- Select the $k$ closest points (neighbors).
- Use the majority class among those $k$ neighbors as the prediction.
Report the overall performance metric (accuracy, sensitivity, specificity, $F_1$, …).

An Example:

library(caret)
library(MLmetrics)

# Prepare binary classification dataset, with setosa as positive class first
binary_iris <- iris
binary_iris$Species <- factor(ifelse(iris$Species == "setosa", "setosa", "other"),
                             levels = c("setosa", "other"))

# Train/test split
set.seed(123)
train_index <- createDataPartition(binary_iris$Species, p = 0.7, list = FALSE)
train_data <- binary_iris[train_index, ]
test_data  <- binary_iris[-train_index, ]

# Simple train control with cross-validation
ctrl <- trainControl(
  method = "cv",
  number = 5,
  classProbs = FALSE,
  savePredictions = TRUE
)

# Train k-NN model
set.seed(123)
knn_model <- train(
  Species ~ ., 
  data = train_data,
  method = "knn",
  trControl = ctrl,
  tuneLength = 10,
  metric = "Accuracy",
  preProcess = c("center", "scale")
)

# Print training results
print(knn_model$results)

##     k Accuracy Kappa AccuracySD KappaSD
## 1   5        1     1          0       0
## 2   7        1     1          0       0
## 3   9        1     1          0       0
## 4  11        1     1          0       0
## 5  13        1     1          0       0
## 6  15        1     1          0       0
## 7  17        1     1          0       0
## 8  19        1     1          0       0
## 9  21        1     1          0       0
## 10 23        1     1          0       0

plot(knn_model)

# Predict on test set
test_pred <- predict(knn_model, newdata = test_data)

# Evaluate on test set
conf_mat <- confusionMatrix(test_pred, test_data$Species, positive = "setosa")
f1_test <- F1_Score(y_pred = test_pred, y_true = test_data$Species, positive = "setosa")
sens_test <- Sensitivity(y_pred = test_pred, y_true = test_data$Species, positive = "setosa")

# Show test results
print(conf_mat)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction setosa other
##     setosa     15     0
##     other       0    30
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9213, 1)
##     No Information Rate : 0.6667     
##     P-Value [Acc > NIR] : 1.191e-08  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0000     
##             Specificity : 1.0000     
##          Pos Pred Value : 1.0000     
##          Neg Pred Value : 1.0000     
##              Prevalence : 0.3333     
##          Detection Rate : 0.3333     
##    Detection Prevalence : 0.3333     
##       Balanced Accuracy : 1.0000     
##                                      
##        'Positive' Class : setosa     
##

cat("Test F1 Score:", round(f1_test, 3), "\n")

## Test F1 Score: 1

cat("Test Sensitivity:", round(sens_test, 3), "\n")

## Test Sensitivity: 1

Explanation of the code:

trainControl() defines how the training should be controlled.
- method = “cv”: Use cross-validation.
- number = 10: Use 10-fold cross-validation (data is split into 10 parts, trained on 9, tested on 1, repeated 10 times).
train() is a function from the caret package that:
- Trains the model
- Tunes hyperparameters (in this case, the number of neighbors $k$)
- Evaluates performance using the strategy defined in trainControl
Arguments:
- Species ~ .: This is a formula. It means to predict Species using all other columns (the . means “all other variables”).
- data = iris: Use the iris dataset.
- method = “knn”: Use the k-NN algorithm.
- trControl = ctrl: Apply the 10-fold cross-validation we set up.
- tuneLength = 10: Try 10 different values of k (usually from k = 1 to k = 10).

7.1.1 How to Choose Different Performance metrics?

By default, R choose accuracy or kappa as the performance mewtric when determining $k$. Fortunately, the caret package allows users to define their own performance metrics, which is done through the use of the “summaryFunction” parameter in the trainControl() function.

The following example shows how a summary function is created and used for the choice of performance metrics.

library(caret)
library(MLmetrics)

# Prepare binary classification dataset, with setosa as positive class first
binary_iris <- iris
binary_iris$Species <- factor(ifelse(iris$Species == "setosa", "setosa", "other"),
                             levels = c("setosa", "other"))

# Train/test split
set.seed(123)
train_index <- createDataPartition(binary_iris$Species, p = 0.8, list = FALSE)
train_data <- binary_iris[train_index, ]
test_data  <- binary_iris[-train_index, ]

# FIXED: Custom summary function
custom_summary <- function(data, lev = NULL, model = NULL) {
  # Ensure predictions and observations have the same levels
  pred <- factor(data$pred, levels = lev)
  obs <- factor(data$obs, levels = lev)
  
  # Calculate metrics
  acc <- mean(pred == obs)
  kappa <- postResample(pred, obs)
  f1  <- F1_Score(y_pred = pred, y_true = obs, positive = lev[1])
  sens <- Sensitivity(y_pred = pred, y_true = obs, positive = lev[1])
  
  c(Accuracy = acc, Kappa = kappa, F1 = f1, Sensitivity = sens)
}

# Train control with cross-validation and custom summary function
ctrl <- trainControl(
  method = "cv",
  number = 5,
  summaryFunction = custom_summary,
  classProbs = FALSE,
  savePredictions = TRUE
)

# Train k-NN model
set.seed(123)
knn_model <- train(
  Species ~ ., 
  data = train_data,
  method = "knn",
  trControl = ctrl,
  tuneLength = 10,
  metric = "F1",  # Use F1 to select best model
  preProcess = c("center", "scale")
)

# Print training results
print(knn_model)

## k-Nearest Neighbors 
## 
## 120 samples
##   4 predictor
##   2 classes: 'setosa', 'other' 
## 
## Pre-processing: centered (4), scaled (4) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 96, 96, 96, 96, 96 
## Resampling results across tuning parameters:
## 
##   k   Accuracy  Kappa.Accuracy  Kappa.Kappa  F1  Sensitivity
##    5  1         1               1            1   1          
##    7  1         1               1            1   1          
##    9  1         1               1            1   1          
##   11  1         1               1            1   1          
##   13  1         1               1            1   1          
##   15  1         1               1            1   1          
##   17  1         1               1            1   1          
##   19  1         1               1            1   1          
##   21  1         1               1            1   1          
##   23  1         1               1            1   1          
## 
## F1 was used to select the optimal model using the largest value.
## The final value used for the model was k = 23.

plot(knn_model)

# Predict on test set
test_pred <- predict(knn_model, newdata = test_data)

# FIXED: Ensure both have the same factor levels
test_pred <- factor(test_pred, levels = levels(test_data$Species))

# Evaluate on test set
conf_mat <- confusionMatrix(test_pred, test_data$Species, positive = "setosa")
f1_test <- F1_Score(y_pred = test_pred, y_true = test_data$Species, positive = "setosa")
sens_test <- Sensitivity(y_pred = test_pred, y_true = test_data$Species, positive = "setosa")

# Show test results
print(conf_mat)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction setosa other
##     setosa     10     0
##     other       0    20
##                                      
##                Accuracy : 1          
##                  95% CI : (0.8843, 1)
##     No Information Rate : 0.6667     
##     P-Value [Acc > NIR] : 5.215e-06  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0000     
##             Specificity : 1.0000     
##          Pos Pred Value : 1.0000     
##          Neg Pred Value : 1.0000     
##              Prevalence : 0.3333     
##          Detection Rate : 0.3333     
##    Detection Prevalence : 0.3333     
##       Balanced Accuracy : 1.0000     
##                                      
##        'Positive' Class : setosa     
##

cat("Test F1 Score:", round(f1_test, 3), "\n")

## Test F1 Score: 1

cat("Test Sensitivity:", round(sens_test, 3), "\n")

## Test Sensitivity: 1

7.2 K-NN for Regression

The idea is similar to that for classification.

An example:

library(caret)
library(MLmetrics)

# Use mtcars dataset for regression example
data(mtcars)

# We'll predict mpg (miles per gallon) based on other features
# Train/test split
set.seed(123)
train_index <- createDataPartition(mtcars$mpg, p = 0.7, list = FALSE)
train_data <- mtcars[train_index, ]
test_data  <- mtcars[-train_index, ]

# Train control with cross-validation for regression
ctrl <- trainControl(
  method = "cv",
  number = 5,
  verboseIter = TRUE
)

# Train k-NN regression model
set.seed(123)
knn_regression <- train(
  mpg ~ ., 
  data = train_data,
  method = "knn",
  trControl = ctrl,
  tuneLength = 10,
  metric = "RMSE",                      # Use RMSE to select k 
  preProcess = c("center", "scale")     # Normalize data
)

## + Fold1: k= 5 
## - Fold1: k= 5 
## + Fold1: k= 7 
## - Fold1: k= 7 
## + Fold1: k= 9 
## - Fold1: k= 9 
## + Fold1: k=11 
## - Fold1: k=11 
## + Fold1: k=13 
## - Fold1: k=13 
## + Fold1: k=15 
## - Fold1: k=15 
## + Fold1: k=17 
## - Fold1: k=17 
## + Fold1: k=19 
## - Fold1: k=19 
## + Fold1: k=21

## - Fold1: k=21 
## + Fold1: k=23

## - Fold1: k=23 
## + Fold2: k= 5 
## - Fold2: k= 5 
## + Fold2: k= 7 
## - Fold2: k= 7 
## + Fold2: k= 9 
## - Fold2: k= 9 
## + Fold2: k=11 
## - Fold2: k=11 
## + Fold2: k=13 
## - Fold2: k=13 
## + Fold2: k=15 
## - Fold2: k=15 
## + Fold2: k=17 
## - Fold2: k=17 
## + Fold2: k=19 
## - Fold2: k=19 
## + Fold2: k=21

## - Fold2: k=21 
## + Fold2: k=23

## - Fold2: k=23 
## + Fold3: k= 5 
## - Fold3: k= 5 
## + Fold3: k= 7 
## - Fold3: k= 7 
## + Fold3: k= 9 
## - Fold3: k= 9 
## + Fold3: k=11 
## - Fold3: k=11 
## + Fold3: k=13 
## - Fold3: k=13 
## + Fold3: k=15 
## - Fold3: k=15 
## + Fold3: k=17 
## - Fold3: k=17 
## + Fold3: k=19 
## - Fold3: k=19 
## + Fold3: k=21

## - Fold3: k=21 
## + Fold3: k=23

## - Fold3: k=23 
## + Fold4: k= 5 
## - Fold4: k= 5 
## + Fold4: k= 7 
## - Fold4: k= 7 
## + Fold4: k= 9 
## - Fold4: k= 9 
## + Fold4: k=11 
## - Fold4: k=11 
## + Fold4: k=13 
## - Fold4: k=13 
## + Fold4: k=15 
## - Fold4: k=15 
## + Fold4: k=17 
## - Fold4: k=17 
## + Fold4: k=19 
## - Fold4: k=19 
## + Fold4: k=21

## - Fold4: k=21 
## + Fold4: k=23

## - Fold4: k=23 
## + Fold5: k= 5 
## - Fold5: k= 5 
## + Fold5: k= 7 
## - Fold5: k= 7 
## + Fold5: k= 9 
## - Fold5: k= 9 
## + Fold5: k=11 
## - Fold5: k=11 
## + Fold5: k=13 
## - Fold5: k=13 
## + Fold5: k=15 
## - Fold5: k=15 
## + Fold5: k=17 
## - Fold5: k=17 
## + Fold5: k=19 
## - Fold5: k=19 
## + Fold5: k=21

## - Fold5: k=21 
## + Fold5: k=23

## - Fold5: k=23

## Aggregating results
## Selecting tuning parameters
## Fitting k = 7 on full training set

# Print the model and k-value results
print(knn_regression)

## k-Nearest Neighbors 
## 
## 24 samples
## 10 predictors
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 20, 19, 19, 19, 19 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    5  3.883708  0.7602765  3.290200
##    7  3.686561  0.7520515  3.116714
##    9  3.976232  0.7694084  3.191778
##   11  3.964159  0.7890296  3.165727
##   13  4.190642  0.8206310  3.268121
##   15  4.609692  0.8076585  3.534733
##   17  5.211151  0.7847071  4.046588
##   19  5.914555  0.4368621  4.676421
##   21  5.919207  0.5900987  4.715763
##   23  5.919207  0.5900987  4.715763
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 7.

# Plot the tuning results
plot(knn_regression)

# Print table of k values and their performance
print(knn_regression$results)

##     k     RMSE  Rsquared      MAE    RMSESD RsquaredSD     MAESD
## 1   5 3.883708 0.7602765 3.290200 0.9946827  0.2103032 0.9945161
## 2   7 3.686561 0.7520515 3.116714 1.0175383  0.1715952 0.8600984
## 3   9 3.976232 0.7694084 3.191778 1.2834203  0.1382612 1.1485793
## 4  11 3.964159 0.7890296 3.165727 1.6363006  0.1198575 1.3007283
## 5  13 4.190642 0.8206310 3.268121 1.9679972  0.1469771 1.4052232
## 6  15 4.609692 0.8076585 3.534733 2.0436493  0.1809126 1.5147916
## 7  17 5.211151 0.7847071 4.046588 1.9356391  0.1559647 1.5617107
## 8  19 5.914555 0.4368621 4.676421 1.8620472  0.4872509 1.5349162
## 9  21 5.919207 0.5900987 4.715763 1.8539655  0.3948347 1.4700451
## 10 23 5.919207 0.5900987 4.715763 1.8539655  0.3948347 1.4700451

# Predict on test set
test_predictions <- predict(knn_regression, newdata = test_data)

# Evaluate on test set
test_actual <- test_data$mpg

# Calculate regression metrics with the MLmetric package
rmse_test <- RMSE(test_predictions, test_actual)
mae_test <- MAE(test_predictions, test_actual)
r2_test <- R2(test_predictions, test_actual)
mape_test <- MAPE(test_predictions, test_actual)

# Show test results
cat("\n=== Test Set Performance ===\n")

## 
## === Test Set Performance ===

cat("RMSE:", round(rmse_test, 3), "\n")

## RMSE: 1.799

cat("MAE:", round(mae_test, 3), "\n")

## MAE: 1.511

cat("R-squared:", round(r2_test, 3), "\n")

## R-squared: 0.91

cat("MAPE:", round(mape_test * 100, 2), "%\n")

## MAPE: 7.57 %

# Plot actual vs predicted
plot(test_actual, test_predictions, 
     main = "Actual vs Predicted MPG",
     xlab = "Actual MPG", 
     ylab = "Predicted MPG",
     pch = 19, col = "blue")
abline(0, 1, col = "red", lwd = 2)  # Perfect prediction line

# Add some statistics to the plot
legend("topleft", 
       legend = c(paste("R² =", round(r2_test, 3)),
                  paste("RMSE =", round(rmse_test, 3))),
       bty = "n")

Chapter 8. An Introduction to Deep Learning

Deep learning is a subset of machine learning that uses neural networks with multiple layers to model complex patterns in data. It excels in tasks like image classification, natural language processing, and regression. A neural network consists of:

Input Layer: Accepts input features (e.g., pixel values, numerical data).
Hidden Layers: Process inputs using weights, biases, and activation functions (e.g., ReLU, sigmoid) to capture patterns.
Output Layer: Produces predictions (e.g., class probabilities, numerical values).

Training involves:

Forward Pass: Compute predictions from inputs.
Loss Calculation: Measure prediction error using a loss function (e.g., binary cross-entropy for classification).
Backward Pass (Backpropagation): Compute gradients of the loss with respect to weights and biases.
Optimization: Update weights using an optimizer (e.g., stochastic gradient descent or SGD) over multiple epochs (full dataset passes) and batches (subsets of data).

We can use the function neuralnet() from R package neuralnet to train models for deep learning.

8.1 Using Neural Networks for Classification

For classification, the output layer typically has more than one neuron (one for each category of the target variable) with a sigmoid activation function. The error (cost or loss) function is the sum of squared error (sse) $err = \frac{1}{2n}\Sigma_{i=1}^{n} (y_i-\hat{y_i})^2$ or the cross-entropy (ce), measuring the difference between predicted and true probability distributions”

For binary classification, $ce = -\frac{1}{n} \sum_{i=1}^{n} [y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i)$, where $y_i$ is the true label (0 or 1), $\hat{y}_{i}$ is the predicted probability, $n$ is the number of observations.
For multi-class classification, $ce=-\frac{1}{n} \sum_{i=1}^{n} \sum_{c=1}^{c} y_{i,c} \log(\hat{y}_{i,c})$, where $y{i,c}$ is the true label (dummy taking 0 or 1), $\hat{y}_{i,c}$ is the predicted probability, $n$ is the number of samples, and $c$ is the number of classes.

Example: We show how to classify Species using iris data.

# Install and load required packages
if (!require(neuralnet)) install.packages("neuralnet")
if (!require(neuralnet)) install.packages("NeuralNetTools")
library(neuralnet)
library(NeuralNetTools) # To produce variable importance

# Set random seed for reproducibility
set.seed(123)

# Standardize features (mean = 0, sd = 1)
features <- iris[, 1:4]
features_scaled <- scale(features)
iris_scaled <- data.frame(features_scaled, Species = iris$Species)

# Split into training and test sets (80% train, 20% test)
n_samples <- nrow(iris_scaled)
train_idx <- sample(1:n_samples, 0.8 * n_samples)
train_data <- iris_scaled[train_idx, ]
test_data <- iris_scaled[-train_idx, ]

# Train neural network with neuralnet
m <- neuralnet(
  Species ~ Sepal.Length + Sepal.Width,  
  data = train_data,
  hidden = c(2, 3),  # Two hidden layers: 2 and 3 nodes
  rep = 5,           # 5 repetitions
  act.fct = "logistic",  # Logistic activation for hidden layers
  linear.output = FALSE,  # Logistic activation
  stepmax = 20000      # Maximum iterations (epochs)
)

# Plot the best neural network
plot(m, rep = "best")

# Predict on test set
predictions <- predict(m, test_data[, 1:4])

# Convert predictions to class labels
predicted_labels <- apply(predictions, 1, which.max)
predicted_species <- factor(levels(iris$Species)[predicted_labels], levels = levels(iris$Species))

# Create confusion matrix using caret
caret::confusionMatrix(predicted_species, test_data$Species)

## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         10          0         0
##   versicolor      0          7         0
##   virginica       0          8         5
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7333          
##                  95% CI : (0.5411, 0.8772)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : 0.008062        
##                                           
##                   Kappa : 0.619           
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            0.4667           1.0000
## Specificity                 1.0000            1.0000           0.6800
## Pos Pred Value              1.0000            1.0000           0.3846
## Neg Pred Value              1.0000            0.6522           1.0000
## Prevalence                  0.3333            0.5000           0.1667
## Detection Rate              0.3333            0.2333           0.1667
## Detection Prevalence        0.3333            0.2333           0.4333
## Balanced Accuracy           1.0000            0.7333           0.8400

8.2 Neural Networks for Regression

For regression, the output layer typically has one neuron (for a single continuous target) with a linear activation function. The error function is The error function is the sum of squared error (sse) $err = \frac{1}{2n}\Sigma_{i=1}^{n} (y_i-\hat{y_i})^2$.

We demonstrate the use of neural network for regression with synthetic data and with the Boston housing data from the mlbench package.

Example 1. We generate a synthetic dataset where $y = x^2 + \sin(3x) + \epsilon$, with $\epsilon$ as Gaussian (normal) noise.

Generate data:

# Generate synthetic data
set.seed(123)
n <- 200
x <- runif(n, -2, 2)
y <- x^2 + sin(3 * x) + rnorm(n, 0, 0.2)  # Nonlinear relation with noise
data <- data.frame(x = x, y = y)

# Split into training and test sets
train_idx <- sample(1:n, 0.8 * n)
train_data <- data[train_idx, ]
test_data <- data[-train_idx, ]

# Plot data
ggplot(data, aes(x = x, y = y)) +
  geom_point(color = "#1f77b4", alpha = 0.5) +
  labs(title = "Synthetic Regression Dataset") +
  theme_minimal()

We train a neural network with one hidden layer (10 nodes) using the neuralnet package. The linear.output = TRUE ensures no activation function is applied to the output layer, suitable for regression.

# Train neural network
nn_model <- neuralnet(y ~ x, 
                      data = train_data, 
                      hidden = c(10),          # One hidden layer with 10 nodes
                      err.fct = "sse",         # Sum of Squared Errors
                      linear.output = TRUE,    # Linear output for regression
                      act.fct = "tanh",        # Tanh activation for hidden layer
                      stepmax = 1e6)          # Max iterations

We evaluate the model on the test set by calculating the SSE and Mean Squared Error (MSE).

# Predict on test set
pred_test <- predict(nn_model, test_data)

# Calculate SSE and MSE
sse_test <- sum((test_data$y - pred_test)^2)
mse_test <- sse_test / nrow(test_data)
cat("Test SSE:", sse_test, "\n")

## Test SSE: 0.9893282

cat("Test MSE:", mse_test, "\n")

## Test MSE: 0.0247332

# Plot predictions vs. actual
test_results <- data.frame(x = test_data$x, y_actual = test_data$y, y_pred = pred_test)
ggplot(test_results, aes(x = x)) +
  geom_point(aes(y = y_actual), color = "#1f77b4", alpha = 0.5, shape = 16) +
  geom_point(aes(y = y_pred), color = "#ff7f0e", alpha = 0.5, shape = 17) +
  labs(title = "Test Set: Actual vs. Predicted") +
  theme_minimal() +
  scale_color_manual(values = c("#1f77b4", "#ff7f0e"))

We visualize the model’s predictions across the input range to assess how well it captures the nonlinear relationship.

# Generate predictions for a smooth curve
x_grid <- data.frame(x = seq(-2, 2, length.out = 100))
pred_grid <- predict(nn_model, x_grid)

# Plot
ggplot() +
  geom_point(data = train_data, aes(x = x, y = y), color = "#1f77b4", alpha = 0.5) +
  geom_line(data = data.frame(x = x_grid$x, y = pred_grid), aes(x = x, y = y), 
            color = "#ff7f0e", size = 1) +
  labs(title = "Neural Network Fit (Training Data)", x = "x", y = "y") +
  theme_minimal()

Now, we use the Boston housing data to train a neural net for predicting median house price.

Data partition and normalization.

library(neuralnet)
library(mlbench)
library(caret)

data(BostonHousing)
head(BostonHousing)

##      crim zn indus chas   nox    rm  age    dis rad tax ptratio      b lstat
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12  5.21
##   medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7

str(BostonHousing)

## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : num  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ b      : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

BostonHousing$chas = as.numeric(BostonHousing$chas) - 1 # Convert factor to numeric


# Split data
set.seed(123)
trainIndex <- createDataPartition(BostonHousing$medv, p = 0.8, list = FALSE)
train_data <- BostonHousing[trainIndex, ]
test_data <- BostonHousing[-trainIndex, ]

# Neural networks perform better with scaled data (normalized to [0,1]
# Define preprocessing (normalize to [0,1])
pre_proc <- preProcess(train_data, method = "range")

# Apply to training and test data
train_scaled <- predict(pre_proc, train_data)
test_scaled <- predict(pre_proc, test_data)

Train the model:

# Train neural network
set.seed(123)  # For reproducibility
nn <- neuralnet(medv ~ crim + zn + indus + chas + nox + rm + age + dis + rad + tax + ptratio + b + lstat,
                data = train_scaled,
                hidden = c(4, 2),      # Two hidden layers: 4 and 2 neurons
                act.fct = "tanh",       # Tanh activation for hidden layers
                linear.output = TRUE,   # Linear output for regression
                err.fct = "sse",        # Sum of squared errors
                threshold = 0.01,       # Convergence threshold
                stepmax = 1e6)          # Max iterations

Visualize the network:

plot(nn)

Make Predictions: Predict medv for the test set and rescale to original units ($1000s).

predictions <- predict(nn, test_scaled[, -which(names(test_scaled) == "medv")])

# Rescale predictions and actual values to original scale
predictions_rescaled <- predictions * (max(train_data$medv) - min(train_data$medv)) + min(train_data$medv)
actual_rescaled <- test_scaled$medv * (max(train_data$medv) - min(train_data$medv)) + min(train_data$medv)

Evaluate Performance: Calculate Mean Squared Error (MSE) and R-squared.

mse <- mean((actual_rescaled - predictions_rescaled)^2)
cat("Mean Squared Error:", mse, "\n")

## Mean Squared Error: 14.44873

# R-squared
r_squared <- cor(actual_rescaled, predictions_rescaled)^2
cat("R-squared:", r_squared, "\n")

## R-squared: 0.8397848

All code together:

library(neuralnet)
library(mlbench)
library(caret)
data(BostonHousing)
BostonHousing$chas = as.numeric(BostonHousing$chas) - 1 # Convert factor to numeric

# Split data
set.seed(123)
trainIndex <- createDataPartition(BostonHousing$medv, p = 0.8, list = FALSE)
train_data <- BostonHousing[trainIndex, ]
test_data <- BostonHousing[-trainIndex, ]

# Normalize with preProcess
pre_proc <- preProcess(train_data, method = "range")
train_scaled <- predict(pre_proc, train_data)
test_scaled <- predict(pre_proc, test_data)

# Model Training
set.seed(123)
nn <- neuralnet(medv ~ crim + zn + indus + chas + nox + rm + age + dis + rad + tax + ptratio + b + lstat, data = train_scaled, hidden = c(4, 2), act.fct = "tanh",
                linear.output = TRUE, err.fct = "sse", threshold = 0.01, stepmax = 1e6)
# Predictions and Evaluation
predictions <- predict(nn, test_scaled[, -which(names(test_scaled) == "medv")])
predictions_rescaled <- predictions * (max(train_data$medv) - min(train_data$medv)) + min(train_data$medv)
actual_rescaled <- test_scaled$medv * (max(train_data$medv) - min(train_data$medv)) + min(train_data$medv)
mse <- mean((actual_rescaled - predictions_rescaled)^2)
r_squared <- cor(actual_rescaled, predictions_rescaled)^2

plot(actual_rescaled, predictions_rescaled, main = "Predicted vs Actual",
     xlab = "Actual ($1000s)", ylab = "Predicted ($1000s)")
abline(0, 1, col = "red")

Chapter 9. Ensemble Methods

Ensemble methods in machine learning refer to techniques that combine multiple individual models (often called “base learners” or “weak learners”) to create a more robust and accurate predictive model. The core idea is that by aggregating the predictions from several models, the ensemble can reduce errors, improve generalization, and handle variance or bias better than a single model. This is inspired by the “wisdom of the crowd” concept, where diverse opinions lead to better decisions.

Key benefits of ensembles:

Reduced Overfitting: By averaging or voting across models, ensembles mitigate the risk of a single model fitting noise in the data.
Improved Accuracy: They often outperform individual models on complex tasks.
Handling Bias-Variance Tradeoff: Different ensemble types address high bias (e.g., boosting) or high variance (e.g., bagging).

Common types include:

Bagging (Bootstrap Aggregating): Builds multiple models on bootstrapped subsets of the data and aggregates predictions (e.g., via averaging or voting).
Boosting: Sequentially builds models, where each new model focuses on correcting the errors of the previous ones.
Random Forest: A specific type of bagging that uses decision trees with random feature subsets for added diversity.

Ensembles are widely used in classification, regression, and other tasks, and libraries like scikit-learn (Python) or caret/randomForest (R) make them easy to implement.

9.1 Bagging (with Decision Trees)

Bagging involves training multiple decision trees on bootstrapped samples and aggregating their predictions.

How Bagging Works?

Bootstrap Sampling: Create multiple datasets by sampling with replacement from original training data. Each bootstrap sample has same size as original dataset
Model Training: Train a separate model $h$ on each bootstrap sample. All models use the same learning algorithm.
Aggregation: Classification by Majority voting or Regression by Averaging predictions.

9.1.1 Step-by-step Example by Hand

Here is a step-by-step example by hand for the data: x = 1,2,3,4,5, y = +1,+1,-1,-1,+1.

For the bootstrap sample x = {3,1,3,5,4}, y = {-1,+1,-1,+1,-1}, the steps are shown in the following table:

kable(data.frame(
  Threshold = c("1.5", "2.5", "3.5", "4.5", "1.5", "2.5", "3.5", "4.5"),
  Rule = c(
    "+1 if x <= 1.5, else -1",
    "+1 if x <= 2.5, else -1",
    "+1 if x <= 3.5, else -1",
    "+1 if x <= 4.5, else -1",
    "-1 if x <= 1.5, else +1",
    "-1 if x <= 2.5, else +1",
    "-1 if x <= 3.5, else +1",
    "-1 if x <= 4.5, else +1"
  ),
  `Correct Points` = c("1,3,3,4", "1,3,3,4", "1,4", "1", "5", "5", "3,3,5", "3,3,4,5"),
  `Wrong Points` = c("5", "5", "3,3,5", "3,3,4,5", "1,3,3,4", "1,3,3,4", "1,4", "1"),
  Error = c("1", "1", "3", "4", "4", "4", "2", "1")
),
col.names = c("Threshold", "Rule", "Correct Points", "Wrong Points", "No. of Errors"),
caption = "Bootstrap Sample: x = {3,1,3,5,4}, y = {-1,+1,-1,+1,-1}",
escape = FALSE, align = "c") %>%
  kable_styling(bootstrap_options = c("striped", "hover")) %>%
  row_spec(0, bold = TRUE, background = "#2c3e50", color = "white") %>%
  row_spec(c(1,2,8), background = "#d4edda", bold = TRUE)  # Best rules (1 error)

Bootstrap Sample: x = {3,1,3,5,4}, y = {-1,+1,-1,+1,-1}
Threshold	Rule	Correct Points	Wrong Points	No. of Errors
1.5	+1 if x <= 1.5, else -1	1,3,3,4	5	1
2.5	+1 if x <= 2.5, else -1	1,3,3,4	5	1
3.5	+1 if x <= 3.5, else -1	1,4	3,3,5	3
4.5	+1 if x <= 4.5, else -1	1	3,3,4,5	4
1.5	-1 if x <= 1.5, else +1	5	1,3,3,4	4
2.5	-1 if x <= 2.5, else +1	5	1,3,3,4	4
3.5	-1 if x <= 3.5, else +1	3,3,5	1,4	2
4.5	-1 if x <= 4.5, else +1	3,3,4,5	1	1

A different table can be constructed for another bootstrap sample: $\{2, 5, 1, 3, 5\} \rightarrow y = \{+1, +1, +1, -1, +1\}$

kable(data.frame(
  Threshold = c("1.5", "2.5", "3.5", "4.5", "1.5", "2.5", "3.5", "4.5"),
  Rule = c(
    "+1 if x <= 1.5, else -1",
    "+1 if x <= 2.5, else -1",
    "+1 if x <= 3.5, else -1",
    "+1 if x <= 4.5, else -1",
    "-1 if x <= 1.5, else +1",
    "-1 if x <= 2.5, else +1",
    "-1 if x <= 3.5, else +1",
    "-1 if x <= 4.5, else +1"
  ),
  `Correct Points` = c("1,3", "1,2,3", "1,2", "1,2", "2,5,5", "5,5", "3,5,5", "3,5,5"),
  `Wrong Points` = c("2,5,5", "5,5", "3,5,5", "3,5,5", "1,3", "1,2,3", "1,2", "1,2"),
  Error = c("3", "2", "3", "3", "2", "3", "2", "2")
),
col.names = c("Threshold", "Rule", "Correct Points", "Wrong Points", "No. of Errors"),
caption = "Bootstrap Sample 2: x = {1,2,3,5,5}, y = {+1,+1,-1,+1,+1}",
escape = FALSE, align = "c") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE, font_size = 13) %>%
  row_spec(0, bold = TRUE, background = "#2c3e50", color = "white") %>%
  row_spec(c(2,5,7,8), background = "#d4edda", bold = TRUE)  # Best rules (2 errors)

Bootstrap Sample 2: x = {1,2,3,5,5}, y = {+1,+1,-1,+1,+1}
Threshold	Rule	Correct Points	Wrong Points	No. of Errors
1.5	+1 if x <= 1.5, else -1	1,3	2,5,5	3
2.5	+1 if x <= 2.5, else -1	1,2,3	5,5	2
3.5	+1 if x <= 3.5, else -1	1,2	3,5,5	3
4.5	+1 if x <= 4.5, else -1	1,2	3,5,5	3
1.5	-1 if x <= 1.5, else +1	2,5,5	1,3	2
2.5	-1 if x <= 2.5, else +1	5,5	1,2,3	3
3.5	-1 if x <= 3.5, else +1	3,5,5	1,2	2
4.5	-1 if x <= 4.5, else +1	3,5,5	1,2	2

When multiple decision stumps achieve the same (lowest) in-sample error (the first bootstrap sample results in 3 best rules that tie in error rate and the second bootstrap sample results in 4 tied best rules), pick the stump with the lowest threshold among the best performers.

Thus, we have the final two rules (stumps or models) summarized in the following table:

bag_table <- data.frame(
  Model = c("$h_1$", "$h_2$"),
  `Bootstrap Sample` = c("{1,3,3,4,5}", "{1,2,3,5,5}"),
  Threshold = c("1.5", "2.5"),
  Rule = c(
    "+1 if x \\leq 1.5, else -1",
    "+1 if x \\leq 2.5, else -1"
  ),
  `In-sample Error` = c("1/5", "2/5")
)

kable(bag_table,
      col.names = c("Model", "Bootstrap Sample", "Threshold", "Rule", "In-sample Error"),
      caption = "Bagging: 2 Decision Stumps (Corrected)",
      escape = FALSE,
      align = "c") %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE, font_size = 13) %>%
  row_spec(0, bold = TRUE, background = "#2980b9", color = "white") %>%
  column_spec(1, bold = TRUE) %>%
  column_spec(2, width = "10em") %>%
  column_spec(4, width = "12em")

Bagging: 2 Decision Stumps (Corrected)
Model	Bootstrap Sample	Threshold	Rule	In-sample Error
$h_1$	{1,3,3,4,5}	1.5	+1 if x , else -1	1/5
$h_2$	{1,2,3,5,5}	2.5	+1 if x , else -1	2/5

Bagging: Majority Vote (two models $h_1$ and $h_2$, one based on each bootstrap sample)

kable(data.frame(
  x = 1:5,
  `True y` = c("+1", "+1", "-1", "-1", "+1"),
  `h1(x)`  = c("+1", "-1", "-1", "-1", "-1"),
  `h2(x)`  = c("+1", "+1", "-1", "-1", "-1"),
  `H(x)`   = c("+1", "+1 (tie → H)", "-1", "-1", "-1"),
  Correct  = c("Yes", "Yes", "Yes", "Yes", "No")
),
col.names = c("x", "True y", "h₁(x)", "h₂(x)", "H(x)", "Correct?"),
caption = "Final Bagging Prediction (Tie: H=+1, T=-1)",
align = "c") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = TRUE, font_size = 14) %>%
  row_spec(0, bold = TRUE, background = "#2c3e50", color = "white") %>%
  row_spec(5, background = "#f8d7da") %>%  # Highlight error
  column_spec(1, width = "6em") %>%
  column_spec(2:6, width = "10em")

Final Bagging Prediction (Tie: H=+1, T=-1)
x	True y	h₁(x)	h₂(x)	H(x)	Correct?
1	+1	+1	+1	+1	Yes
2	+1	-1	+1	+1 (tie → H)	Yes
3	-1	-1	-1	-1	Yes
4	-1	-1	-1	-1	Yes
5	+1	-1	-1	-1	No

Accuracy: 4/5 = 80% Tie-breaking: coin-tossing ($H=1, T=-1$) in case of tie

9.1.2 How can we do bagging using the R package “adabag”?

set.seed(42)  # For reproducibility
train_idx <- sample(nrow(iris), 0.7 * nrow(iris))
train_data <- iris[train_idx, ]
test_data <- iris[-train_idx, ]

# Train bagging model (uses decision trees by default)
bagging_model <- bagging(Species ~ ., data = train_data, nbagg = 25)  # nbagg: number of bootstrap samples

# Predict on test data
predictions <- predict(bagging_model, newdata = test_data)$class |> factor()

confusionMatrix(predictions, test_data$Species)

## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         12          0         0
##   versicolor      0         14         1
##   virginica       0          1        17
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9556          
##                  95% CI : (0.8485, 0.9946)
##     No Information Rate : 0.4             
##     P-Value [Acc > NIR] : 2.842e-15       
##                                           
##                   Kappa : 0.9324          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            0.9333           0.9444
## Specificity                 1.0000            0.9667           0.9630
## Pos Pred Value              1.0000            0.9333           0.9444
## Neg Pred Value              1.0000            0.9667           0.9630
## Prevalence                  0.2667            0.3333           0.4000
## Detection Rate              0.2667            0.3111           0.3778
## Detection Prevalence        0.2667            0.3333           0.4000
## Balanced Accuracy           1.0000            0.9500           0.9537

9.2 Boosting (AdaBoost)

AdaBoost (Adaptive Boosting) is a boosting algorithm that weights misclassified instances higher in subsequent models. I’ll use the adabag package for AdaBoost. (Alternatively, for XGBoost, see the next section.)

The following shows the steps how AdaBoost works, assuming traing data

\[{(x_1, y_1), ..., (x_N, y_N)}, \text{where}~ y_i~ \text{is}~ -1 ~\text{or} ~ 1.\]

Initialize weights: $w_i^{(1)} = \frac{1}{N}$ for $i = 1, 2, \dots, N$
For t = 1 to T:
- Find weak classifier $h_t$ that minimizes weighted error $\epsilon_t$: $\epsilon_t = \sum_{i=1}^{m} w_i^{(t)} \mathbf{1}\{h_t(x_i) \neq y_i\}$
- Compute classifier weight: $\alpha_t = \frac{1}{2} \ln\left(\frac{1 - \epsilon_t}{\epsilon_t}\right)$
- Calculate normalization factor: $Z_t = \sum_{i=1}^{N} w_i^{(t)} \cdot exp(-\alpha_t y_i h_t(x_i))$
- Update and then normalize weights: $w_i^{(t+1)} = \frac{w_i^{(t)} \exp(-\alpha_t y_i h_t(x_i))}{Z_t}$
Output final strong classifier: $H(x) = \text{sign}\left(\sum_{t=1}^T \alpha_t h_t(x)\right)$

9.2.1 Step-by-step Example by Hand

Assume the following training data:

Fake data
x	y
1	1
2	2
3	-1
4	-1
5	1

Weak Classifiers: Decision stumps (threshold-based binary classifiers on a single feature) of the form “If $x \le \theta$ then $+1$ else $-1$” and “If $x \le \theta$ then $-1$ else $1$”

Step 1: Use multiple iterations to update weights. Initial weights for all instances are equal (uniform):

\[w_1^{(1)}=w_2^{(1)}=w_3^{(1)}=w_4^{(1)}=w_5^{(1)}=\frac{1}{5}=0.2\]

In each future iteration, train all possible weak classifiers (Decision Stumps) each with one feature (predictor) and a threshold. Since we only have one feature, we consider possible thresholds (midpoints):

Between 1 and 2 → threshold = 1.5 Between 2 and 3 → threshold = 2.5 Between 3 and 4 → threshold = 3.5 Between 4 and 5 → threshold = 4.5

For each threshold, we try two versions:

Predict +1 if $x\le$ threshold, else -1 Predict -1 if $x\le$ threshold, else +1

We compute weighted error for each.

Try threshold = 1.5

Rule A: Predict +1 if $x\le$ 1.5 → point 1 → +1 (correct) Points 2,3,4,5 → -1

Point 2: true +1 → misclassified Point 3: true -1 → correct Point 4: true -1 → correct Point 5: true +1 → misclassified Misclassified: 2, 5 → weights: 0.2 + 0.2 = 0.4

Rule B: Predict -1 if $x\le$ → point 1 → -1 (wrong) Others → +1

Point 1: true +1 → misclassified Point 2: true +1 → correct Point 3: true -1 → misclassified Point 4: true -1 → misclassified Point 5: true +1 → correct Misclassified: 1,3,4 → 0.2×3 = 0.6

→ Better is Rule A: error = 0.4

Try threshold = 2.5

Rule A: +1 if $x\le$ 2.5 → points 1,2 → +1 (both correct) Points 3,4,5 → -1

Point 3: -1 → correct Point 4: -1 → correct Point 5: +1 → misclassified Misclassified: only 5 → weight = 0.2

Rule B: -1 if $x\le$ 2.5 → points 1,2 → -1 (both wrong) Points 3,4,5 → +1

Point 3: -1 → wrong Point 4: -1 → wrong Point 5: +1 → correct Misclassified: 1,2,3,4 → 0.8

→ Rule A wins: error = 0.2

Try threshold = 3.5

Rule A: +1 if $x\le$ 3.5 → 1,2,3

1,2: +1 → correct 3: -1 → wrong → 4,5 → -1 4: -1 → correct 5: +1 → wrong Misclassified: 3,5 → 0.4

Rule B: -1 if $x\le$ 3.5 → 1,2,3 → -1

1,2: wrong 3: correct → 4,5 → +1 4: wrong 5: correct Misclassified: 1,2,4 → 0.6

→ Best: 0.4

Try threshold = 4.5

Rule A: +1 if $x\le$ → 1,2,3,4

1,2: correct 3,4: wrong → 5 → -1 (wrong) Misclassified: 3,4,5 → 0.6

Rule B: -1 if $x\le$ → 1,2,3,4 → -1

1,2: wrong 3,4: correct → 5 → +1 (correct) Misclassified: 1,2 → 0.4

→ Best: 0.4

Best stump in iteration 1:

Threshold = 2.5, predict +1 if $x\le$ 2.5, else -1 → Misclassifies only point 5 Weighted error $\varepsilon_1 = 0.2$

Calculate Classifier Weight $\alpha_1$ for $h_1(x)$:

\[\alpha_1 = \frac{1}{2}ln\frac{1-\epsilon_1}{\epsilon_1}=0.6931\]

Calculate Normalization Factor $Z_1$.

Misclassified instance (5): $w_5 \times e^{\alpha_1} = 0.2 \times e^{0.6931} = 0.2 \times 2 = 0.4$
Correctly classified instances: $w_i \times e^{-\alpha_1} = 0.2 \times e^{-0.6931} = 0.2 \times 0.5 = 0.1$

$Z_1=0.1+0.1+0.1+0.1+0.4=0.8$

Update Weights.

Correct instances (1,2,3,4): $w_i^{(2)} = \frac{0.1}{0.8} = 0.125$
Misclassified instance (5): $w_5^{(2)} = \frac{0.4}{0.8} = 0.5$

The updated weights are: $[0.125, 0.125, 0.125, 0.125, 0.5]$

Repeat the above iteration. The following is the second Iteration (t = 2).

Weights: [0.125, 0.125, 0.125, 0.125, 0.5] Try thresholds again.

Threshold = 1.5

Rule A: +1 if x ≤ 1.5 → point 1 → correct Others → -1

2: +1 → wrong 3: -1 → correct 4: -1 → correct 5: +1 → wrong Misclassified: 2,5 → 0.125 + 0.5 = 0.625

Rule B: -1 if x ≤ 1.5 → 1 wrong; others +1 → 2 correct, 3 wrong, 4 wrong, 5 correct Misclassified: 1,3,4 → 0.125×3 = 0.375

→ Better: 0.375

Threshold = 2.5

Rule A: +1 if x ≤ 2.5 → 1,2 correct; 3,4,5 → -1 → 3,4 correct, 5 wrong Misclassified: 5 → 0.5 Rule B: -1 if x ≤ 2.5 → 1,2 wrong; 3,4,5 → +1 → 3,4 wrong, 5 correct Misclassified: 1,2,3,4 → 0.125×4 = 0.5

→ Best: 0.5

Threshold = 3.5

Rule A: +1 if x ≤ 3.5 → 1,2 correct, 3 wrong; 4,5 → -1 → 4 correct, 5 wrong Misclassified: 3,5 → 0.125 + 0.5 = 0.625 Rule B: -1 if x ≤ 3.5 → 1,2,3 → -1 → 1,2 wrong, 3 correct; 4,5 → +1 → 4 wrong, 5 correct Misclassified: 1,2,4 → 0.125×3 = 0.375

→ Best: 0.375

Threshold = 4.5

Rule A: +1 if x ≤ 4.5 → 1,2,3,4 → 1,2 correct, 3,4 wrong; 5 → -1 wrong Misclassified: 3,4,5 → 0.125 + 0.125 + 0.5 = 0.75 Rule B: -1 if x ≤ 4.5 → 1,2,3,4 → -1 → 1,2 wrong, 3,4 correct; 5 → +1 correct Misclassified: 1,2 → 0.125 + 0.125 = 0.25 ← Best so far!

Best stump in iteration 2: Threshold = 4.5, predict -1 if x ≤ 4.5, else +1 → Misclassifies only points 1 and 2 $\varepsilon_2 = 0.25$

$\alpha_2=\frac{1}{2}ln\frac{1-0.25}{0.25}=0.5493$

Update Weights.

Calculate Normalization Factor $Z_2$.

Misclassified instances (1,2): $0.125 \times e^{0.5493} = 0.125 \times 1.732 = 0.2165$
Correctly classified instances (3,4): $0.125 \times e^{-0.5493} = 0.125 \times 0.577 = 0.0721$

Instance 5: $0.5 \times e^{-0.5493} = 0.5 \times 0.577 = 0.2885$

$Z_2=0.2165+0.2165+0.0721+0.0721+0.2885=0.8657$

Calculate New Weights

$w_1^{(3)} = 0.2165/0.8657 \approx 0.25$
$w_2^{(3)} = 0.2165/0.8657 \approx 0.25$
$w_3^{(3)} = 0.0721/0.8657 \approx 0.083$
$w_4^{(3)} = 0.0721/0.8657 \approx 0.083$
$w_5^{(3)} = 0.2885/0.8657 \approx 0.333$

For demonstration purposes, we just do two iterations. So, the final weights are: $[0.25, 0.25, 0.083, 0.083, 0.333]$

Step 2: The final ensemble classifier is a weighted combination of our weak classifiers:

\[H(x)=sign(0.6931\cdot h_1(x)+0.5493\cdot h_2(x))\]

Where:

$h_1(x)$: “If $x \le 2.5$ then $+1$ else $-1$”

$h_2(x)$: “If $x \le 4.5$ then $+1$ else $-1$”

verification <- data.frame(
  Instance = 1:5,
  x = 1:5,
  h1 = c(1, 1, -1, -1, -1),
  h2 = c(-1, -1, -1, -1, 1),
  WeightedSum = c(0.314, 0.314, -1.072, -1.412, -0.314),
  FinalPred = c(1, 1, -1, -1, -1),
  TrueLabel = c(1, 1, -1, -1, 1),
  Correct = c("✓", "✓", "✓", "✓", "✗")
)

kable(verification, caption = "Final Classifier Verification") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Final Classifier Verification
Instance	x	h1	h2	WeightedSum	FinalPred	TrueLabel	Correct
1	1	1	-1	0.314	1	1	✓
2	2	1	-1	0.314	1	1	✓
3	3	-1	-1	-1.072	-1	-1	✓
4	4	-1	-1	-1.412	-1	-1	✓
5	5	-1	1	-0.314	-1	1	✗

Training Accuracy: 4/5 = 80%.

AdaBoost Iteration Summary (2 Iterations)
Round	Threshold	Rule	ε (Error)	α (Weight)	Misclassified
1	2.5	+1 if ≤2.5, else -1	0.2	0.693	{5}
2	4.5	-1 if ≤4.5, else +1	0.25	0.549	{1,2}

9.2.2 Adaboosting with the adabag Package

The following shows how the R package adabag does adaboosting with the iris data.

if (!require(adabag)) install.packages("adabag")
library(adabag)

# Train AdaBoost model
adaboost_model <- boosting(Species ~ ., data = train_data, boos = TRUE, mfinal = 50)  # mfinal: number of iterations

# Predict on test data
predictions2 <- predict(adaboost_model, newdata = test_data)$class |> factor()

confusionMatrix(predictions2, test_data$Species)

## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         12          0         0
##   versicolor      0         14         1
##   virginica       0          1        17
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9556          
##                  95% CI : (0.8485, 0.9946)
##     No Information Rate : 0.4             
##     P-Value [Acc > NIR] : 2.842e-15       
##                                           
##                   Kappa : 0.9324          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            0.9333           0.9444
## Specificity                 1.0000            0.9667           0.9630
## Pos Pred Value              1.0000            0.9333           0.9444
## Neg Pred Value              1.0000            0.9667           0.9630
## Prevalence                  0.2667            0.3333           0.4000
## Detection Rate              0.2667            0.3111           0.3778
## Detection Prevalence        0.2667            0.3333           0.4000
## Balanced Accuracy           1.0000            0.9500           0.9537

9.3 XGBoost

XGBoost (short for eXtreme Gradient Boosting) is a powerful, scalable machine learning library for gradient-boosted decision trees (GBDT). It’s widely used for classification, regression, and ranking tasks in data science and machine learning competitions (like Kaggle). XGBoost builds many weak decision trees sequentially — each one learning from the mistakes of the previous ones — to create a strong, accurate predictive model.

Some Tips:

The name “eXtreme” comes from its speed and performance — it was designed to push gradient boosting to the extreme.
No factor handling – convert factors to dummy variables yourself.
Missing values – XGBoost natively handles NA; just keep them.

Why People Love XGBoost?

Wins competitions (used in ~50% of Kaggle winning solutions)
Robust to overfitting
Handles messy real-world data
Interpretable (feature importance, SHAP)
Fast training (even on large datasets)

Example: Predict Iris flower species using petal/sepal measurements

Step 1: Load packages and data

library(xgboost)
library(data.table)
library(ggplot2)
library(caret)  # for confusion matrix

Step 2: Prepare Data for XGBoost

XGBoost needs:

Features: numeric matrix
Labels: numeric (0-based for classification)

# Features (all except Species)
X <- iris[, 1:4] %>% as.matrix()

# Labels: convert factor to 0,1,2
y <- as.numeric(iris$Species) - 1   # setosa=0, versicolor=1, virginica=2

# Train-test split (70-30)
set.seed(123)
train_idx <- sample(1:nrow(iris), size = 0.7 * nrow(iris))

X_train <- X[train_idx, ]
X_test  <- X[-train_idx, ]
y_train <- y[train_idx]
y_test  <- y[-train_idx]

# Create xgb.DMatrix (XGBoost's special format)
dtrain <- xgb.DMatrix(data = X_train, label = y_train)

dtest  <- xgb.DMatrix(data = X_test,  label = y_test)

Step 3: Train a XGBoost Model

# Parameters
params <- list(
  objective = "multi:softprob",   # multiclass probability
  num_class = 3,                  # 3 species
  eval_metric = "mlogloss",       # log loss
  eta = 0.3,                      # learning rate
  max_depth = 6,
  subsample = 0.8,
  colsample_bytree = 0.8
)

# Train model
set.seed(123)
model <- xgb.train(
  params = params,
  data = dtrain,
  nrounds = 20
)

Key XGBoost Parameters
Parameter	Description	Typical Values
eta	Learning rate (step size shrinkage)	0.01 – 0.3
max_depth	Maximum depth of a tree	3 – 10
subsample	Fraction of observations used per boosting round	0.5 – 1
colsample_bytree	Fraction of columns (features) used per tree	0.5 – 1
min_child_weight	Minimum sum of instance weight needed in a child	1 – 10

Step 4: Make Predictions

# Predict probabilities (returns vector, reshape to matrix)
pred_probs <- predict(model, dtest, reshape = TRUE)
head(pred_probs)

##           [,1]        [,2]        [,3]
## [1,] 0.9862779 0.009412179 0.004309894
## [2,] 0.9873204 0.007788341 0.004891236
## [3,] 0.9873204 0.007788341 0.004891236
## [4,] 0.9873204 0.007788341 0.004891236
## [5,] 0.9820973 0.013611067 0.004291625
## [6,] 0.9862779 0.009412179 0.004309894

# Predicted class (0-based)
pred_class  <- max.col(pred_probs) - 1
pred_labels <- levels(iris$Species)[pred_class + 1]

Step 5: Evaluate Model

# Accuracy
accuracy <- mean(pred_class == y_test)
cat("Accuracy:", round(accuracy, 4), "\n")  # ~0.9778

## Accuracy: 0.9778

# Confusion Matrix
cm <- confusionMatrix(
  factor(pred_labels, levels = levels(iris$Species)),
  factor(iris$Species[-train_idx], levels = levels(iris$Species))
)
print(cm)

## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         14          0         0
##   versicolor      0         17         0
##   virginica       0          1        13
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9778          
##                  95% CI : (0.8823, 0.9994)
##     No Information Rate : 0.4             
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9664          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            0.9444           1.0000
## Specificity                 1.0000            1.0000           0.9688
## Pos Pred Value              1.0000            1.0000           0.9286
## Neg Pred Value              1.0000            0.9643           1.0000
## Prevalence                  0.3111            0.4000           0.2889
## Detection Rate              0.3111            0.3778           0.2889
## Detection Prevalence        0.3111            0.3778           0.3111
## Balanced Accuracy           1.0000            0.9722           0.9844

Step 6: Feature Importance

importance_matrix <- xgb.importance(model = model)
importance_matrix

##         Feature       Gain     Cover Frequency
##          <char>      <num>     <num>     <num>
## 1: Petal.Length 0.64346260 0.4517480 0.3506494
## 2:  Petal.Width 0.30552986 0.3409521 0.2987013
## 3: Sepal.Length 0.03025878 0.1036788 0.1948052
## 4:  Sepal.Width 0.02074876 0.1036211 0.1558442

Explanation of each of the last 3 columns of the above table:

Type	Meaning
Gain	How much the feature improves accuracy
Cover	How many samples the feature affects
Frequency	How often the feature is used

Step 7: Hyperparameter Tuning with Cross-Validation

# Use xgb.cv for 5-fold CV
cv <- xgb.cv(
  params = params,
  data = dtrain,
  nrounds = 200,
  nfold = 5,
  early_stopping_rounds = 20,
  verbose = 0
)

# Best iteration
best_iter <- cv$best_iteration
cat("Best iteration:", best_iter, "\n")

## Best iteration: 18

SHAP measures how much each feature contributes to a specific prediction.

library(shapviz)
# Now create SHAP object — this will work!
shap <- shapviz(model, X_pred = X_train)

# Global importance
sv_importance(shap, kind = "bar") + 
  ggtitle("SHAP Feature Importance")

9.4. Random Forest

Random Forest is an extension of bagging that adds randomness by selecting random subsets of features at each split. Use the randomForest package.

if (!require(randomForest)) install.packages("randomForest")
library(randomForest)

# Load and prepare data
data(iris)
set.seed(42)
train_idx <- sample(nrow(iris), 0.7 * nrow(iris))
train_data <- iris[train_idx, ]
test_data <- iris[-train_idx, ]

# Train Random Forest model
rf_model <- randomForest(Species ~ ., data = train_data, ntree = 20, mtry = 2)  # ntree: number of trees, mtry: features per split

# Predict on test data
predictions3 <- predict(rf_model, newdata = test_data)

confusionMatrix(predictions3, test_data$Species)

## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         12          0         0
##   versicolor      0         14         1
##   virginica       0          1        17
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9556          
##                  95% CI : (0.8485, 0.9946)
##     No Information Rate : 0.4             
##     P-Value [Acc > NIR] : 2.842e-15       
##                                           
##                   Kappa : 0.9324          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            0.9333           0.9444
## Specificity                 1.0000            0.9667           0.9630
## Pos Pred Value              1.0000            0.9333           0.9444
## Neg Pred Value              1.0000            0.9667           0.9630
## Prevalence                  0.2667            0.3333           0.4000
## Detection Rate              0.2667            0.3111           0.3778
## Detection Prevalence        0.2667            0.3333           0.4000
## Balanced Accuracy           1.0000            0.9500           0.9537

Chapter 10. Clustering

Hierarchical Clustering is an unsupervised method that builds a hierarchy of clusters without requiring a predefined k. It comes in two forms:

Agglomerative (Bottom-Up): Starts with each data point as its own cluster and iteratively merges the closest pairs based on a linkage criterion (e.g., Ward’s method minimizes variance, single linkage uses minimum distance, complete linkage uses maximum distance).
Divisive (Top-Down): Starts with all points in one cluster and recursively splits them. The result is a dendrogram, a tree-like diagram showing the order and distance of merges/splits, which helps determine the optimal number of clusters by “cutting” the tree at a desired height.

K-Means Clustering is an unsupervised machine learning algorithm that partitions a dataset into a predefined number of clusters (k). It aims to minimize the variance (sum of squared distances) within each cluster. The process is iterative:

Randomly initialize k centroids.
Assign each data point to the nearest centroid based on a distance metric (typically Euclidean).
Update centroids as the mean of all points assigned to each cluster.
Repeat until centroids stabilize or a set number of iterations is reached.

10.1 Hierarchical Clustering

We use the Pima data.

library(mlbench)
data("PimaIndiansDiabetes2")
Pima = PimaIndiansDiabetes2 |> na.omit()

X <- Pima[, 1:8]  # Exclude Outcome for unsupervised clustering

# Standardize features
X_std <- scale(X)

# Hierarchical Clustering (Agglomerative with Dendrogram)
dist_matrix <- dist(X_std, method = "euclidean")
hclust_model <- hclust(dist_matrix, method = "ward.D2")

# Plot Dendrogram
plot(hclust_model, main = "Hierarchical Clustering Dendrogram (Ward Linkage)",
     xlab = "Sample Index", ylab = "Distance", sub = "")

# Cut dendrogram
hier_labels <- cutree(hclust_model, k = 3)
hier_labels

##   4   5   7   9  14  15  17  19  20  21  25  26  28  29  32  33  36  40  41  44 
##   1   2   1   2   2   3   2   1   1   2   3   3   1   3   2   1   1   2   1   3 
##  51  52  53  54  55  57  58  60  64  69  70  71  72  74  83  86  88  89  92  93 
##   1   1   1   3   2   2   2   1   1   1   1   1   1   1   1   1   1   3   1   3 
##  95  96  98  99 100 104 106 108 109 110 111 112 113 115 120 121 123 126 127 128 
##   1   2   1   1   2   1   1   1   1   1   1   2   1   1   1   2   1   1   1   1 
## 129 131 133 135 136 137 138 140 143 145 148 151 153 154 157 158 159 160 162 163 
##   1   1   2   1   1   1   1   2   1   2   1   2   3   2   1   1   1   3   3   1 
## 166 170 172 174 175 176 178 182 187 188 189 190 192 196 198 199 200 204 205 207 
##   1   1   1   2   1   2   2   1   2   2   3   1   2   2   1   1   2   1   3   3 
## 209 214 215 216 217 218 221 224 225 226 229 230 232 233 235 237 242 244 245 248 
##   1   1   3   3   1   1   1   3   1   1   2   1   2   1   1   3   1   1   2   2 
## 249 253 255 259 260 261 266 272 274 276 278 280 282 283 286 287 288 289 290 291 
##   2   1   1   1   3   1   1   1   1   2   1   1   3   1   3   2   2   1   1   1 
## 292 293 294 296 297 298 299 302 303 306 307 308 309 310 312 313 314 316 317 319 
##   1   2   1   1   2   1   3   1   1   1   3   1   1   1   1   1   1   1   1   1 
## 321 324 326 327 329 330 332 335 336 339 341 342 346 347 349 354 357 359 360 361 
##   1   3   1   1   1   3   1   1   2   2   1   1   3   1   1   1   1   3   2   1 
## 365 366 369 370 371 373 374 375 376 377 378 380 381 383 384 385 386 389 390 391 
##   2   1   1   1   2   1   1   1   3   1   1   2   1   1   1   1   1   3   1   1 
## 393 394 396 397 403 406 410 412 413 414 415 416 420 421 422 423 425 426 428 429 
##   1   1   1   1   1   1   2   1   2   1   1   2   1   2   1   1   2   2   1   2 
## 430 432 433 442 443 446 447 448 449 450 451 453 455 458 459 460 461 463 466 467 
##   1   1   1   1   1   2   1   1   1   1   1   1   1   1   3   3   3   3   1   1 
## 468 470 477 478 479 481 483 484 486 487 488 491 494 498 499 500 501 504 507 508 
##   1   2   2   1   3   2   1   1   2   2   2   1   1   1   3   2   1   1   1   1 
## 509 512 515 516 517 520 521 522 527 528 529 531 533 535 539 540 541 542 544 545 
##   1   1   1   1   3   3   1   1   1   1   1   1   2   1   2   2   3   1   1   1 
## 546 547 548 549 552 554 555 556 562 563 564 566 567 568 569 570 573 574 575 576 
##   2   2   1   1   1   1   1   3   1   1   1   1   1   3   1   1   1   1   2   1 
## 577 585 589 592 594 595 596 598 600 604 607 608 609 610 611 612 613 615 618 621 
##   1   2   2   1   1   2   1   1   1   3   2   1   2   1   1   1   2   3   1   1 
## 624 626 632 634 638 639 640 641 645 646 647 648 649 651 652 653 655 656 657 658 
##   1   1   1   1   1   3   1   1   1   2   1   1   3   1   1   1   1   1   1   2 
## 660 663 664 666 669 670 671 673 674 680 681 683 686 689 690 693 694 696 697 699 
##   1   3   2   1   3   3   3   3   2   1   1   1   1   1   1   1   2   2   1   1 
## 701 705 708 710 711 712 714 716 717 719 722 723 724 727 731 733 734 737 739 741 
##   1   1   1   1   1   1   1   1   2   1   1   1   1   1   1   1   1   1   1   3 
## 742 743 745 746 748 749 752 754 756 761 764 766 
##   1   1   3   3   2   1   1   2   2   1   3   1

10.2 kMeans Clustering

# K-Means Clustering for k=2 to 10 and Elbow Method
inertias <- numeric(9)
sil_scores <- numeric(9)
K <- 2:10

for (k in K) {
  set.seed(42)  # Reproducibility
  kmeans_model <- kmeans(X_std, centers = k, nstart = 25)
  inertias[k-1] <- kmeans_model$tot.withinss  # Inertia (within-cluster SS)
  if (k > 1) {
    sil <- cluster::silhouette(kmeans_model$cluster, dist(X_std))
    sil_scores[k-1] <- mean(sil[, 3])  # Average silhouette score
  } else {
    sil_scores[k-1] <- NA  # Silhouette not defined for k=1
  }
}

# Plot Elbow Method (Inertia vs. k)
plot(K, inertias, type = "b", pch = 19, col = "blue", 
     xlab = "Number of Clusters (k)", ylab = "Inertia (Within-Cluster SS)",
     main = "Elbow Method for Optimal k")
grid()

# Plot Silhouette Scores
plot(K, sil_scores, type = "b", pch = 19, col = "red",
     xlab = "Number of Clusters (k)", ylab = "Average Silhouette Score",
     main = "Silhouette Scores for Different k")
grid()

# Choose optimal k (e.g., based on elbow, typically k=3 or 4)
optimal_k <- 3
set.seed(42)
kmeans_model <- kmeans(X_std, centers = optimal_k, nstart = 25)
labels <- kmeans_model$cluster

# PCA for 2D/3D Visualization
pca <- prcomp(X_std, scale. = TRUE)
X_pca <- pca$x[, 1:3]  # First 3 PCs for 2D/3D

# 2D Visualization
pca_df <- data.frame(PC1 = X_pca[, 1], PC2 = X_pca[, 2], Cluster = as.factor(labels))
ggplot(pca_df, aes(x = PC1, y = PC2, color = Cluster)) +
  geom_point(size = 3) +
  labs(title = "K-Means Clusters in 2D PCA Space", x = "PC1", y = "PC2") +
  theme_minimal()

Compare K-Means and Hierarchical Clustering:

cat("K-Means Silhouette Score for k =", optimal_k, ":", sil_scores[optimal_k-1], "\n")

## K-Means Silhouette Score for k = 3 : 0.2115462

# Cluster Interpretation (K-Means)
cat("\nCluster Mean Features (K-Means):\n")

## 
## Cluster Mean Features (K-Means):

for (i in 1:optimal_k) {
  cat("Cluster", i, ":\n")
  print(colMeans(X[labels == i, ]))
}

## Cluster 1 :
##    pregnant     glucose    pressure     triceps     insulin        mass 
##   8.0116279 139.5697674  76.8604651  31.9883721 197.7558140  33.5744186 
##    pedigree         age 
##   0.5346395  46.3604651 
## Cluster 2 :
##    pregnant     glucose    pressure     triceps     insulin        mass 
##   1.8032787 131.3606557  74.3032787  38.0819672 201.3934426  39.2163934 
##    pedigree         age 
##   0.6045984  27.6475410 
## Cluster 3 :
##    pregnant     glucose    pressure     triceps     insulin        mass 
##   2.0923913 108.9184783  65.3532609  21.8913043 106.5054348  28.7934783 
##    pedigree         age 
##   0.4635543  25.7554348

# Silhouette Score for Hierarchical Clustering
sil_hier <- cluster::silhouette(hier_labels, dist_matrix)
cat("\nHierarchical Clustering Silhouette Score:", mean(sil_hier[, 3]), "\n")

## 
## Hierarchical Clustering Silhouette Score: 0.2255977

Model	Bootstrap Sample	Threshold	Rule	In-sample Error
\(h_1\)	{1,3,3,4,5}	1.5	+1 if x , else -1	1/5
\(h_2\)	{1,2,3,5,5}	2.5	+1 if x , else -1	2/5

Lecture Notes for Data Mining

Shiju Zhang