Machine Learning Techniques in R: From Basics to Advanced

Machine Learning (ML) is a field of artificial intelligence that involves teaching computers to learn patterns from data without being explicitly programmed. This document provides a step-by-step guide to various machine learning techniques in R, from fundamental concepts to more advanced algorithms.

We will cover:

Data Preparation and Exploration
Linear Regression (Supervised Learning)
Logistic Regression (Supervised Learning)
Decision Trees
Random Forest
Gradient Boosting with XGBoost
Neural Networks
Model Evaluation and Tuning
Conclusion and References

Throughout this document, we will use real datasets from R libraries to illustrate practical examples.

Useful Resources, Books & Other Lecture Notes References:

2. Machine Learning, Artificial Intelligence, and Deep Learning

The terms Artificial Intelligence (AI), Machine Learning (ML), and Deep Learning (DL) are often used interchangeably, but they have distinct meanings and a hierarchical relationship.

2.1 Artificial Intelligence (AI)

Artificial Intelligence is defined as computer algorithms that perform tasks believed to require human intelligence. AI includes:

Rule-based algorithms: Explicit instructions for a system (e.g., robots assembling cars).
Data-based algorithms: Using historical data to develop predictive models.

Examples of AI applications include: - Advanced home automation - Speech recognition (Natural Language Processing, NLP) - Optical Character Recognition (OCR) - Self-driving cars - Art generation (e.g., DALL-E).

2.2 Machine Learning (ML)

Machine Learning is a subset of AI that is data-driven rather than rule-based.

In ML, algorithms learn patterns from data rather than following predefined rules.
ML models use training data to calibrate themselves and discover patterns internally.

Rule-based vs. Data-based Systems

Rule-Based Systems	Machine Learning Systems
Require experts to define rules.	Derive patterns from data.
Explicit “if-then” logic.	Adjust rules based on training.
Example: Handwritten regex for email spam detection.	Example: Spam classifier using past emails.

Common Machine Learning Algorithms

Regression: Ordinary Least Squares (OLS), Logistic Regression.
Instance-Based Learning: k-Nearest Neighbors (k-NN).
Ensemble Methods: Random Forest.
Clustering: k-Means, Hierarchical Clustering.
Neural Networks.

2.3 Deep Learning (DL)

Deep Learning is a subset of ML that relies on Neural Networks with many layers.

Key Features of Deep Learning:

Uses multiple layers of neurons.
Learns complex patterns through weighted non-linear functions.
Updates parameters using gradient-based optimization.

If a neural network contains many neurons (often millions) and multiple layers, it is referred to as a Deep Learning model.

Common Deep Learning Applications

✅ Natural Language Processing (NLP)
✅ Advanced Image Recognition
✅ Programming Code Completion

2.4 Machine Learning Tasks

Machine Learning algorithms can be categorized based on the tasks they perform.

1. Regression

Objective: Predict a continuous variable.
Example: Predict house prices based on square footage.
Algorithms:
- Linear Regression (OLS)
- Polynomial Regression
- Random Forest
- Neural Networks

2. Classification

Objective: Predict a category (e.g., spam vs. not spam).
Example: Email spam detection.
Algorithms:
- Logistic Regression
- k-Nearest Neighbors (k-NN)
- Random Forest
- Neural Networks

Classification tasks can be: - Binary Classification: Yes/No, True/False, 0/1. - Multiclass Classification: Red/Blue/Green, Disease A/B/C.

3. Cluster Analysis

Objective: Group observations into homogeneous clusters.
Example: Customer segmentation in marketing.
Algorithms:
- k-Means Clustering
- Hierarchical Clustering
- Gaussian Mixture Models (GMM).

2.5 Regression vs. Classification vs. Clustering

Task	Goal	Example	Algorithms
Regression	Predict a continuous variable.	House price prediction.	OLS, Neural Networks.
Classification	Predict a category.	Spam filtering, Cancer detection.	Logistic Regression, k-NN.
Clustering	Group similar observations.	Customer segmentation.	k-Means, Hierarchical Clustering.

2.6 Conclusion

AI is the broadest category, encompassing rule-based and data-driven models.
ML is a subset of AI, focusing on data-driven learning.
DL is a subset of ML, leveraging Neural Networks to model complex patterns.

2.7 References

Russell, S., & Norvig, P. (2020). Artificial Intelligence: A Modern Approach. Pearson.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning. Springer.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.

This document provides a structured, visually appealing explanation of AI, ML, and DL, with clear hierarchical relationships, tables, and mathematical formulations. 🚀

1. Libraries and Data Import

We will use several R packages to facilitate data manipulation, visualization, and modeling:

dplyr and tidyr for data wrangling
ggplot2 for data visualization
caret for streamlined machine learning modeling (optional, but highly recommended)
rpart for decision trees
randomForest for random forest models
xgboost for gradient boosting
nnet or keras (optional) for neural networks

Install any packages you do not have by using install.packages("package_name").

# Data Wrangling
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)

# Visualization
library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.4.1

# Machine Learning & Modeling
library(caret)         # For cross-validation, model training pipeline

## Warning: package 'caret' was built under R version 4.4.1

## Loading required package: lattice

library(rpart)         # For decision trees
library(randomForest)  # For random forest

## Warning: package 'randomForest' was built under R version 4.4.1

## randomForest 4.7-1.1

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

## The following object is masked from 'package:dplyr':
## 
##     combine

library(xgboost)       # For gradient boosting

## Warning: package 'xgboost' was built under R version 4.4.1

## 
## Attaching package: 'xgboost'

## The following object is masked from 'package:dplyr':
## 
##     slice

#  Neural Network
 library(nnet)       # Basic feed-forward neural network
 library(keras)      # Deep learning in R (requires TensorFlow backend)

## Warning: package 'keras' was built under R version 4.4.1

set.seed(123)  # For reproducibility

2. Data Preparation and Exploration

2.1 Data Overview

For illustration, let’s start with two well-known datasets:

mtcars (built-in dataset in R) for a regression example.
iris (built-in dataset in R) for a classification example.

2.2 Exploring the `mtcars` Dataset (Regression)

The mtcars dataset contains information about miles per gallon (mpg) and various characteristics of different car models.

data("mtcars")

# Basic structure
str(mtcars)

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

# First few rows
head(mtcars)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

# Convert some variables to factor (e.g., am, cyl, gear)
mtcars$am <- factor(mtcars$am, labels = c("Automatic", "Manual"))
mtcars$cyl <- factor(mtcars$cyl)
mtcars$gear <- factor(mtcars$gear)

2. Data Preparation and Exploration

2.1 Data Overview

For illustration, let’s start with two well-known datasets:

mtcars (built-in dataset in R) for a regression example.
iris (built-in dataset in R) for a classification example.

2.2 Exploring the `mtcars` Dataset (Regression)

The mtcars dataset contains information about miles per gallon (mpg) and various characteristics of different car models.

ggplot(mtcars, aes(x = mpg)) +
  geom_histogram(fill = "blue", bins = 10, alpha = 0.7) +
  theme_minimal() +
  labs(title = "Distribution of MPG", x = "MPG", y = "Count")

Visualizing the relationship between mpg and weight (wt):

ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point(color = "red") +
  geom_smooth(method = "lm", se = FALSE, color = "blue") +
  theme_minimal() +
  labs(title = "MPG vs. Weight", x = "Weight (1000 lbs)", y = "MPG")

## `geom_smooth()` using formula = 'y ~ x'

We observe a negative relationship between mpg and wt: as weight increases, mpg tends to decrease.

2.4 Exploring the `iris` Dataset (Classification)

The iris dataset has 150 observations of iris flowers with four features: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, and the species: Iris-setosa, Iris-versicolor, Iris-virginica.

data("iris")

# Basic structure
str(iris)

## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

# First few rows
head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Class Distribution

table(iris$Species)

## 
##     setosa versicolor  virginica 
##         50         50         50

pairs(iris, col = iris$Species,
      main = "Iris Feature Scatterplot Matrix")

3. Linear Regression (Supervised Learning)

Linear Regression predicts a continuous outcome variable (Y) from one or more predictor variables (X).

Model Form:

MPG = β₀ + β₁ × Weight + β₂ × Horsepower + … + ε

3.1 Simple Linear Regression Example

Let’s model mpg using wt only:

model_lm_simple <- lm(mpg ~ wt, data = mtcars)
summary(model_lm_simple)

## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5432 -2.3647 -0.1252  1.4096  6.8727 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
## wt           -5.3445     0.5591  -9.559 1.29e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared:  0.7528, Adjusted R-squared:  0.7446 
## F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

Interpretation: The coefficient for wt indicates how much MPG changes for each additional 1000 lbs of weight.

3.2 Multiple Linear Regression Example

We can include more predictors, such as wt, hp (horsepower), and am (transmission type).

model_lm_multi <- lm(mpg ~ wt + hp + am, data = mtcars)
summary(model_lm_multi)

## 
## Call:
## lm(formula = mpg ~ wt + hp + am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4221 -1.7924 -0.3788  1.2249  5.5317 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 34.002875   2.642659  12.867 2.82e-13 ***
## wt          -2.878575   0.904971  -3.181 0.003574 ** 
## hp          -0.037479   0.009605  -3.902 0.000546 ***
## amManual     2.083710   1.376420   1.514 0.141268    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.538 on 28 degrees of freedom
## Multiple R-squared:  0.8399, Adjusted R-squared:  0.8227 
## F-statistic: 48.96 on 3 and 28 DF,  p-value: 2.908e-11

Key points:

Check p-values to see which predictors are significant.
Look at R-squared to assess how much of the variance in mpg is explained.

3.3 Diagnostics

par(mfrow = c(2, 2))
plot(model_lm_multi)

Key points:

Check p-values to see which predictors are significant.
Look at R-squared to assess how much of the variance in mpg is explained.

4. Logistic Regression (Supervised Learning)

Logistic Regression predicts a binary (or multi-class) outcome. Although iris$Species has three classes, we can simplify to a binary problem by filtering for two species for demonstration.

4.1 Data Preparation

Let’s consider only Setosa vs. Versicolor for a binary classification:

iris_binary <- iris %>%
  filter(Species != "virginica") %>%
  mutate(Species = factor(Species))

table(iris_binary$Species)

## 
##     setosa versicolor 
##         50         50

We have roughly equal classes.

4.2 Logistic Model

We’ll predict Species using Petal.Length and Petal.Width.

model_logistic <- glm(Species ~ Petal.Length + Petal.Width, 
                      data = iris_binary, 
                      family = binomial)

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

summary(model_logistic)

## 
## Call:
## glm(formula = Species ~ Petal.Length + Petal.Width, family = binomial, 
##     data = iris_binary)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept)     -72.73   70289.28  -0.001    0.999
## Petal.Length     18.37   74002.45   0.000    1.000
## Petal.Width      35.76  199094.68   0.000    1.000
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1.3863e+02  on 99  degrees of freedom
## Residual deviance: 1.8210e-09  on 97  degrees of freedom
## AIC: 6
## 
## Number of Fisher Scoring iterations: 25

Key points:

Coefficients in logistic regression are on the log-odds scale.
Petal.Length and Petal.Width are expected to help distinguish Setosa from Versicolor.

4.3 Model Predictions and Accuracy

# Predict probabilities
iris_binary$prob <- predict(model_logistic, type = "response")

# Convert probabilities to classes using 0.5 threshold
iris_binary$pred <- ifelse(iris_binary$prob > 0.5, "versicolor", "setosa")
iris_binary$pred <- factor(iris_binary$pred)

# Confusion Matrix
confusionMatrix(data = iris_binary$pred, reference = iris_binary$Species)

## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor
##   setosa         50          0
##   versicolor      0         50
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9638, 1)
##     No Information Rate : 0.5        
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0        
##             Specificity : 1.0        
##          Pos Pred Value : 1.0        
##          Neg Pred Value : 1.0        
##              Prevalence : 0.5        
##          Detection Rate : 0.5        
##    Detection Prevalence : 0.5        
##       Balanced Accuracy : 1.0        
##                                      
##        'Positive' Class : setosa     
##

We can evaluate the model using metrics like:

Accuracy: The proportion of correctly classified cases.
Sensitivity (True Positive Rate): The proportion of actual positives that are correctly identified.
Specificity (True Negative Rate): The proportion of actual negatives that are correctly identified.

5. Decision Trees

Decision trees split the data based on feature thresholds to predict an outcome (either regression or classification).

5.1 Classification Tree on `iris`

We can use all three classes of the iris dataset.

# rpart for classification
model_tree <- rpart(Species ~ ., data = iris, method = "class")
model_tree

## n= 150 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 150 100 setosa (0.33333333 0.33333333 0.33333333)  
##   2) Petal.Length< 2.45 50   0 setosa (1.00000000 0.00000000 0.00000000) *
##   3) Petal.Length>=2.45 100  50 versicolor (0.00000000 0.50000000 0.50000000)  
##     6) Petal.Width< 1.75 54   5 versicolor (0.00000000 0.90740741 0.09259259) *
##     7) Petal.Width>=1.75 46   1 virginica (0.00000000 0.02173913 0.97826087) *

5.2 Visualize the Tree

library(rpart.plot)

## Warning: package 'rpart.plot' was built under R version 4.4.2

rpart.plot(model_tree, main = "Decision Tree for Iris Dataset")

* Each node in the tree represents a split based on a feature and threshold. * Eventually, the tree leads to leaf nodes, which represent the predicted classes or values.

5.3 Evaluate Model Performance

pred_tree <- predict(model_tree, iris, type = "class")
confusionMatrix(pred_tree, iris$Species)

## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         50          0         0
##   versicolor      0         49         5
##   virginica       0          1        45
## 
## Overall Statistics
##                                          
##                Accuracy : 0.96           
##                  95% CI : (0.915, 0.9852)
##     No Information Rate : 0.3333         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.94           
##                                          
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            0.9800           0.9000
## Specificity                 1.0000            0.9500           0.9900
## Pos Pred Value              1.0000            0.9074           0.9783
## Neg Pred Value              1.0000            0.9896           0.9519
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.3267           0.3000
## Detection Prevalence        0.3333            0.3600           0.3067
## Balanced Accuracy           1.0000            0.9650           0.9450

Decision trees often have low bias but high variance. They can overfit if not pruned.

Understanding Tree-Based Models and Neural Networks

1. Binary Decision Trees

A binary decision tree is a flowchart-like structure in which each internal node represents a decision rule, each branch represents an outcome of the rule, and each leaf node represents a final prediction.

1.1 Structure of a Decision Tree

A decision tree recursively splits data based on feature values. It follows a hierarchical structure:

Root Node: The first decision point (representing the entire dataset).
Internal Nodes: Represent feature-based splits.
Leaf Nodes: Represent the final class labels (for classification) or numeric values (for regression).

1.2 Mathematically Formulating Decision Trees

A decision tree splits data to minimize impurity using criteria such as:

Gini Impurity (for classification):

G(X) = 1 - Σ_i=1^C p_i²

where p_i is the proportion of observations belonging to class i.

Entropy (Information Gain-based split):

H(X) = -Σ_i=1^C p_i log₂ p_i

Mean Squared Error (MSE) (for regression):

MSE = (1/N) Σ_i=1^N (y_i - ŷ)²

where N is the number of observations, y_i is the actual value, and ŷ is the predicted value.

A node splits where the chosen criterion (e.g., Gini, Entropy, MSE) is minimized.

1.3 Visual Representation of a Decision Tree

\[ \begin{array}{c} \textbf{Start (Root Node)} \\ \downarrow \\ \text{Feature 1 < Threshold?} \\ \begin{array}{cc} \text{Yes} & \text{No} \\ \downarrow & \downarrow \\ \text{Feature 2 < Threshold?} & \text{Class B} \\ \begin{array}{cc} \text{Yes} & \text{No} \\ \downarrow & \downarrow \\ \text{Class A} & \text{Class C} \end{array} \end{array} \end{array} \]

2. Random Forests

A Random Forest is an ensemble learning method that combines multiple decision trees to improve generalization and reduce overfitting.

2.1 How Random Forest Works

A Random Forest follows these steps:

Bootstrapping: Select random subsets of training data (sampling with replacement).
Feature Subset Selection: Each tree considers a random subset of features.
Voting/Averaging:
- Classification: The majority vote across trees determines the final class.
- Regression: The average of predictions from all trees is taken.

2.2 Mathematical Formulation

Suppose we have $B$ trees, each trained on different bootstrap samples. The final prediction for an input $x$ depends on the type of problem:

For Classification (Majority Voting):

\[ \hat{y} = \text{mode} \{ T_1(x), T_2(x), ..., T_B(x) \} \] where $T_b(x)$ is the prediction from the $b$-th tree.

For Regression (Averaging Predictions):

\[ \hat{y} = \frac{1}{B} \sum_{b=1}^{B} T_b(x) \]

where: - $\hat{y}$ is the final predicted value. - $T_b(x)$ is the prediction from the $b$-th decision tree.

2.3 Advantages of Random Forests

✅ Handles Non-linearity: Works well with complex patterns.
✅ Reduces Overfitting: Combines multiple trees to improve generalization.
✅ Feature Importance: Identifies the most significant variables.
✅ Works with Large Datasets: Efficient for large-scale problems.

\[ \begin{array}{c} \textbf{Dataset} \\ \downarrow \\ \text{Bootstrap Samples} \\ \begin{array}{ccc} \text{Tree 1} & \text{Tree 2} & \text{Tree B} \\ \downarrow & \downarrow & \downarrow \\ \text{Predictions} \\ \downarrow \\ \text{Majority Vote / Averaging} \end{array} \end{array} \]

6. Random Forest

Random Forest is an ensemble of decision trees, each trained on a bootstrap sample of the data and a random subset of features. It typically improves generalization performance compared to a single decision tree.

6.1 Train a Random Forest on `iris`

model_rf <- randomForest(Species ~ ., data = iris, ntree = 100, importance = TRUE)
model_rf

## 
## Call:
##  randomForest(formula = Species ~ ., data = iris, ntree = 100,      importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 100
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 4.67%
## Confusion matrix:
##            setosa versicolor virginica class.error
## setosa         50          0         0        0.00
## versicolor      0         47         3        0.06
## virginica       0          4        46        0.08

ntree = 100: Number of trees in the forest.
importance = TRUE: Compute feature importance.

6.2 Feature Importance

importance(model_rf)

##                 setosa versicolor virginica MeanDecreaseAccuracy
## Sepal.Length  3.596153  3.6267879  4.758109             5.558137
## Sepal.Width   2.882770  0.6114827  2.330913             2.792841
## Petal.Length  8.877766 14.0149004 13.072048            14.387083
## Petal.Width  10.693787 15.1471250 13.399962            16.699318
##              MeanDecreaseGini
## Sepal.Length        11.546338
## Sepal.Width          3.270738
## Petal.Length        40.353781
## Petal.Width         44.103010

varImpPlot(model_rf, main = "Feature Importance in Random Forest")

* Feature Importance

Feature importance refers to calculating a score for each input feature of a machine learning model. This score reflects the feature’s contribution to the model’s predictive performance. A higher score indicates a greater influence on the model’s predictions.

6.3 Evaluation

pred_rf <- predict(model_rf, iris)
confusionMatrix(pred_rf, iris$Species)

## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         50          0         0
##   versicolor      0         50         0
##   virginica       0          0        50
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9757, 1)
##     No Information Rate : 0.3333     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            1.0000           1.0000
## Specificity                 1.0000            1.0000           1.0000
## Pos Pred Value              1.0000            1.0000           1.0000
## Neg Pred Value              1.0000            1.0000           1.0000
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.3333           0.3333
## Detection Prevalence        0.3333            0.3333           0.3333
## Balanced Accuracy           1.0000            1.0000           1.0000

Random Forest often yields high accuracy with minimal tuning.

XGBoost (Extreme Gradient Boosting)

XGBoost is an optimized gradient boosting algorithm that builds trees sequentially, where each new tree corrects the errors of the previous trees. It is widely used in machine learning competitions and real-world applications due to its efficiency and accuracy.

How XGBoost Works

XGBoost follows an iterative approach to improving model performance:

Initialize a Simple Model: Start with a weak learner (e.g., a single tree predicting a constant value).
Compute Residuals: Compute the errors (residuals) between actual and predicted values.
Fit a New Tree to Predict Residuals: Each new tree models the negative gradient of the loss function.
Update the Model Iteratively: Add the new tree to the model and repeat the process until convergence.

3.2 Mathematical Formulation

Given a dataset $(X, y)$ with $N$ observations, XGBoost minimizes the loss function:

\[ L(\theta) = \sum_{i=1}^{N} l(y_i, \hat{y}_i) + \sum_{k=1}^{K} \Omega(f_k) \]

where:

$l(y_i, \hat{y}_i)$ is the loss function, such as:
- Mean Squared Error (MSE) for regression: \[ l(y_i, \hat{y}_i) = (y_i - \hat{y}_i)^2. \]
- Log Loss for classification.
$\Omega(f_k)$ is the regularization term to prevent overfitting, defined as:

\[ \Omega(f_k) = \gamma T + \frac{1}{2} \lambda \sum_j w_j^2 \]

where:
- $T$ is the number of leaves in the tree.
- $w_j$ are the leaf weights.
- $\gamma$ and $\lambda$ are regularization parameters.

Gradient Boosting Step

Each new tree $f_k(x)$ predicts the negative gradient of the loss function:

\[ g_i = \frac{\partial l(y_i, \hat{y}_i)}{\partial \hat{y}_i}. \]

The updated prediction at iteration $t+1$ is:

\[ \hat{y}_i^{(t+1)} = \hat{y}_i^{(t)} + \eta f_t(x_i), \]

where $\eta$ is the learning rate.

3.3 Advantages of XGBoost

✅ Handles Missing Values: Automatically deals with missing data.
✅ Highly Efficient: Uses parallel computing and optimized algorithms.
✅ Feature Importance: Identifies the most significant predictors.
✅ Regularization: Reduces overfitting using L1 (LASSO) and L2 (Ridge) penalties.

*** XGBoost employs a specialized sparsity-aware algorithm that detects missing values throughout the model training process. When building decision trees, it assesses potential splits based not only on available data but also on the locations of missing values

7. Gradient Boosting with XGBoost

Gradient boosting builds trees sequentially, with each new tree correcting the errors of the previous ensemble. XGBoost is a highly optimized library for gradient boosting.

7.1 Data Preparation

For demonstration, we will use the iris dataset (all three classes), but XGBoost typically works with numeric matrices.

# Encode Species as numeric (0,1,2)
iris_xgb <- iris
iris_xgb$Species <- as.numeric(iris_xgb$Species) - 1  # setosa=0, versicolor=1, virginica=2

# Prepare matrix for xgboost
train_matrix <- as.matrix(iris_xgb[, 1:4])
train_label  <- iris_xgb$Species

7.2 Train XGBoost

For multi-class classification, we use objective = "multi:softprob" and specify num_class = 3.

xgb_data <- xgb.DMatrix(data = train_matrix, label = train_label)

params <- list(
  booster = "gbtree",
  objective = "multi:softprob",
  eval_metric = "mlogloss",
  num_class = 3
)

model_xgb <- xgb.train(
  params = params,
  data = xgb_data,
  nrounds = 50,            # number of boosting rounds
  verbose = 0
)

# Check model
model_xgb

## ##### xgb.Booster
## raw: 126.6 Kb 
## call:
##   xgb.train(params = params, data = xgb_data, nrounds = 50, verbose = 0)
## params (as set within xgb.train):
##   booster = "gbtree", objective = "multi:softprob", eval_metric = "mlogloss", num_class = "3", validate_parameters = "TRUE"
## xgb.attributes:
##   niter
## # of features: 4 
## niter: 50
## nfeatures : 4

7.3 Predictions and Evaluation

pred_xgb <- predict(model_xgb, xgb_data)
# pred_xgb is a probability matrix with 3 columns

# Convert to class predictions
pred_xgb_matrix <- matrix(pred_xgb, ncol = 3, byrow = TRUE)
pred_class <- max.col(pred_xgb_matrix) - 1  # convert to 0,1,2

Compute the accuracy:

accuracy_xgb <- sum(pred_class == train_label) / nrow(iris_xgb)
accuracy_xgb

## [1] 1

XGBoost is often highly performant and can be tuned extensively via parameters like max_depth, eta, colsample_bytree, etc.

Neural Networks

If you want to explore Neural Networks in R, there are two common approaches:

The nnet package for a basic feed-forward, single-hidden-layer neural network.
The keras package for deep learning (requires Python and TensorFlow).

8.1 Example with `nnet`

library(nnet)
# For iris classification
iris_nn <- iris
iris_nn$Species <- class.ind(iris_nn$Species)  # one-hot encoding

nn_model <- nnet(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width,
                 data = iris_nn,
                 size = 2,      # number of hidden units
                 rang = 0.1,
                 decay = 5e-4,
                 maxit = 200)

## # weights:  19
## initial  value 112.679059 
## iter  10 value 50.295223
## iter  20 value 50.136134
## iter  30 value 45.494950
## iter  40 value 4.756196
## iter  50 value 3.679492
## iter  60 value 3.411043
## iter  70 value 3.371026
## iter  80 value 3.359917
## iter  90 value 3.356568
## iter 100 value 3.352885
## iter 110 value 3.351186
## final  value 3.350599 
## converged

Note: The above code will not run directly because nnet expects a different data format for one-hot encoding or a single numeric outcome. Usually, we transform the target into 3 columns for a 3-class problem or train multiple networks.

Machine Learning, Artificial Intelligence, and Deep Learning

Dr Debashis Chatterjee

2025-02-25

Machine Learning Techniques in R: From Basics to Advanced

Useful Resources, Books & Other Lecture Notes References:

2. Machine Learning, Artificial Intelligence, and Deep Learning

2.1 Artificial Intelligence (AI)

2.2 Machine Learning (ML)

Rule-based vs. Data-based Systems

Common Machine Learning Algorithms

2.3 Deep Learning (DL)

Key Features of Deep Learning:

Common Deep Learning Applications

2.4 Machine Learning Tasks

1. Regression

2. Classification

3. Cluster Analysis

2.5 Regression vs. Classification vs. Clustering

2.6 Conclusion

2.7 References

1. Libraries and Data Import

2. Data Preparation and Exploration

2.1 Data Overview

2.2 Exploring the mtcars Dataset (Regression)

2. Data Preparation and Exploration

2.1 Data Overview

2.2 Exploring the mtcars Dataset (Regression)

Visualizing the relationship between mpg and weight (wt):

2.4 Exploring the iris Dataset (Classification)

Class Distribution

3. Linear Regression (Supervised Learning)

3.1 Simple Linear Regression Example

Interpretation: The coefficient for wt indicates how much MPG changes for each additional 1000 lbs of weight.

3.2 Multiple Linear Regression Example

Key points:

3.3 Diagnostics

Key points:

4. Logistic Regression (Supervised Learning)

4.1 Data Preparation

We have roughly equal classes.

4.2 Logistic Model

Key points:

4.3 Model Predictions and Accuracy

5. Decision Trees

5.1 Classification Tree on iris

5.2 Visualize the Tree

5.3 Evaluate Model Performance

Decision trees often have low bias but high variance. They can overfit if not pruned.

Understanding Tree-Based Models and Neural Networks

1. Binary Decision Trees

1.1 Structure of a Decision Tree

1.2 Mathematically Formulating Decision Trees

1.3 Visual Representation of a Decision Tree

2. Random Forests

2.1 How Random Forest Works

2.2 Mathematical Formulation

For Classification (Majority Voting):

For Regression (Averaging Predictions):

2.3 Advantages of Random Forests

6. Random Forest

6.1 Train a Random Forest on iris

6.2 Feature Importance

6.3 Evaluation

Random Forest often yields high accuracy with minimal tuning.

XGBoost (Extreme Gradient Boosting)

How XGBoost Works

3.2 Mathematical Formulation

Gradient Boosting Step

3.3 Advantages of XGBoost

7. Gradient Boosting with XGBoost

7.1 Data Preparation

7.2 Train XGBoost

7.3 Predictions and Evaluation

Compute the accuracy:

XGBoost is often highly performant and can be tuned extensively via parameters like max_depth, eta, colsample_bytree, etc.

Neural Networks

8.1 Example with nnet

Note: The above code will not run directly because nnet expects a different data format for one-hot encoding or a single numeric outcome. Usually, we transform the target into 3 columns for a 3-class problem or train multiple networks.

2.2 Exploring the `mtcars` Dataset (Regression)

2.2 Exploring the `mtcars` Dataset (Regression)

2.4 Exploring the `iris` Dataset (Classification)

5.1 Classification Tree on `iris`

6.1 Train a Random Forest on `iris`

8.1 Example with `nnet`