Supervised Learning

Supervised learning is used when you have variables you want to predict using other variables. This typically involves learning a mapping from input variables (predictors) x to an output variable (response) y, such that:

\[ y = f(x) \]

Common examples of supervised learning include linear regression, logistic regression(Classification), decision trees, and support vector machines.

In the simplest case of supervised learning, we have:

A single input variable: $x$
A single response variable: $y$

We aim to determine a function $f$ such that:

\[ y = f(x) \]

Let:

$x = (x_1, x_2, \dots, x_n)$
$y = (y_1, y_2, \dots, y_n)$

Our goal is to find a function $f$ such that:

\[ y_i = f(x_i) \]

However, in practice, the observations often contain noise. We account for this by introducing a target vector $t$ such that:

\[ t_i = y_i + \varepsilon_i = f(x_i) + \varepsilon_i \]

Where:

$t = (t_1, t_2, \dots, t_n)$ is the observed target values
$\varepsilon_i$ is the noise or error term in each observation

Parameterizing the Function

To define the mapping function $f$, we introduce parameters $\theta$, making $f$ a function of both $x$ and $\theta$:

\[ f(x; \theta) = y(\theta) \]

Our goal is to find the parameters $\theta$ that minimize the difference between the predicted value $y(\theta)$ and the observed target value $t$.

In other words, we want to minimize the error term $\varepsilon$, often done by minimizing a loss function such as the mean squared error (MSE):

\[ \text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (f(x_i; \theta) - t_i)^2 \]

A smaller MSE means our function $f$ is a better fit for the data.

Next Steps

We will now apply this theory to two models:

Linear Regression: where the response variable is continuous
Logistic Regression: where the response variable is categorical (typically binary)

We’ll demonstrate both using example datasets in R.

Linear Regression

Linear Regression is used when the output (or response) variable we are trying to predict is numerical.
The relationship between the input variable $x$ and the output variable $y$ is modeled using a linear function:

\[ f_\theta(x) = \theta_1 x + \theta_0 \]

Here:

$\theta_1$ is the slope of the line (also called the weight or coefficient)
$\theta_0$ is the intercept (the value of $y$ when $x = 0$)

Our objective is to find the values of $\theta_1$ and $\theta_0$ that minimize the Mean Squared Error (MSE), i.e., the average squared difference between the predicted values $y(\theta)$ and the actual target values $t$:

\[ \text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (f_\theta(x_i) - t_i)^2 \]

By minimizing the MSE, we ensure that the model’s predictions are as close as possible to the actual observed data.

Example: Predicting Car Stopping Distance using Speed

We will use the built-in cars dataset in R, which contains two variables:

speed: Speed of the car (mph)
dist: Stopping distance (ft)

Our goal is to predict the stopping distance dist given the car’s speed speed.

# Magrittr Library for piping operations
library(magrittr)

# Visualization
library(ggplot2)

# Manipulating data frames
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

# Load cars datasets
data(cars)

cars %>% head

##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16
## 6     9   10

Plotting Linear regression model on the dataset

cars %>% ggplot(aes(x = speed, y = dist)) +
geom_point() +
geom_smooth(method = "lm")

## `geom_smooth()` using formula = 'y ~ x'

Understanding the Regression Line

The red line drawn on the dataset is the regression line that best fits the data. It is based on the regression function:

\[ \hat{y} = \theta_1 x + \theta_0 \]

This function maps a given speed $x$ to a corresponding predicted stopping distance $\hat{y}$.

Behind the scenes, the lm() function in R performs a learning process that finds the optimal values for both $\theta_1$ (slope) and $\theta_0$ (intercept). These parameters are chosen such that the Mean Squared Error (MSE) between the predicted distances and the actual distances is minimized.

This means the line is positioned to make the overall prediction error as small as possible across all the data points.

Exploring the Effect of Slope ($\theta_1$) When Intercept is Zero

Let us assume for now that the intercept $\theta_0$ is equal to 0.
(We will explain the reasoning behind this assumption later.)

With this simplification, our regression function becomes:

\[ \hat{y} = \theta_1 x \]

This means the regression line will pass through the origin (0,0), and its shape will depend entirely on the value of the slope $\theta_1$.

Let’s visualize how different values of $\theta_1$ affect the fit of the regression line on the cars dataset.

# This define our regression function that takes in speed and theta_1 as argument
predict_dist <- function(speed, theta_1)
  data.frame(speed = speed, dist = theta_1 * speed, theta = as.factor(theta_1))

cars %>% ggplot(aes(x = speed, y = dist, colour = theta)) + geom_point(colour = "black") +
  
# Plotting the regression lines with different values of theta_1  
geom_line(data = predict_dist(cars$speed, 2)) +
geom_line(data = predict_dist(cars$speed, 3)) +
geom_line(data = predict_dist(cars$speed, 4)) +
scale_color_discrete(name=expression(theta[1]))

We can see that for different values of $\theta_1$ the lines all fit the dataset differently. The next thing to do is to find out what value of $\theta_1$ minimise the MSE $E_{x,t}(\theta_i) = \sum (θ_1 x_i - t_i)^2$. Let try it ourself with some random values of $\theta_1$

# generating random 50 thetas
thetas <- seq(0, 5, length.out = 50)

# function to calculate the MSE for each theta
fitting_error <- Vectorize(function(theta)
sum((theta * cars$speed - cars$dist)**2))

data.frame(thetas = thetas, errors = fitting_error(thetas)) %>%
ggplot(aes(x = thetas, y = errors)) +
geom_line() +
xlab(expression(theta[1])) + ylab("")

We can see that the $\theta_1$ with minimal MSE appears at the bottom of the curve.

Model Validation

Suppose we want to use two different regression models on the same dataset and determine which one provides the best fit or is more accurate.

To do this, we compare their performance using a common metric — the Mean Squared Error (MSE).
The model with the lower MSE is considered to have a better fit.

Let’s look at an example:

line <- cars %>% lm(dist ~ speed, data=.)
poly <- cars %>% lm(dist ~ speed + I(speed^2), data = .)

# MSE function
rmse <- function(x,t) sqrt(mean(sum((t - x)^2)))
rmse(predict(line, cars), cars$dist)

## [1] 106.5529

rmse(predict(poly, cars), cars$dist)

## [1] 104.0419

Now, clearly the polynomial model fits the data slightly better than the linear model — and theoretically, it should.
However, there’s a bit of a cheat happening here: we are evaluating how well the models perform on the same data that was used to fit them.

This creates a problem.

A more complex model (like a higher-degree polynomial) will almost always appear to perform better on the training data.
But that doesn’t necessarily mean it’s a better model — it could simply be overfitting: capturing the random noise in the data instead of the true underlying relationship.

What we really care about is how well the model generalizes — that is:

How well does the model perform on new, unseen data that it hasn’t already seen and used to fit its parameters?

To properly evaluate a model’s performance, we need to test it on separate data.
This is where concepts like train-test splits and cross-validation come into play.

Train-Test Split

To evaluate how well our models generalize to unseen data, we can split the dataset into two parts:

Training set: used to fit the model (learn the parameters)
Testing set: used to evaluate the model’s performance on unseen data

In this example, the cars dataset has 50 data points.
We will use the first 25 points to train our models, and the remaining 25 to test them.

training_data <- cars[1:25,]
test_data <- cars[26:50,]

line <- cars %>% lm(dist ~ speed, data=.)
poly <- cars %>% lm(dist ~ speed + I(speed^2), data=.)

# Now calculate the Root mean square error with test data after training the model with training data.
rmse(predict(line, test_data), test_data$dist)

## [1] 83.43421

rmse(predict(poly, test_data), test_data$dist)

## [1] 80.64634

Importance of Random Sampling in Train-Test Split

Even though the second-degree polynomial still performs better on the test set, we are still cheating — and here’s why:

The cars dataset is sorted by distance, which means:

The training set (first 25 rows) contains mostly shorter distances
The test set (last 25 rows) contains mostly longer distances

So the training and test data are not similar. This introduces bias, because the model is evaluated on a range of data it has never seen before.

In general, we can’t always know if there is hidden structure in our dataset based on row order. In this case it’s easy to spot, but often it is more subtle.

A Better Approach

To avoid this kind of bias, we should randomly sample the data when creating training and test sets. This removes structure based on row order and ensures both sets are representative of the overall data.

sampled_cars <- cars %>% mutate(training = sample(0:1, nrow(cars), replace = TRUE))
training_data <- sampled_cars %>% filter(training == 1)
test_data <- sampled_cars %>% filter(training == 0)

line <- training_data %>% lm(dist ~ speed, data = .)
poly <- training_data %>% lm(dist ~ speed + I(speed^2), data = .)

rmse(predict(line, test_data), test_data$dist)

## [1] 92.26793

rmse(predict(poly, test_data), test_data$dist)

## [1] 92.72817

We can conclude now that the Polynomial model is a better fit to our data

Interpreting the Model Summary

The model summary gives us insightful information about how well the model fits the data, including the values of the parameters:

θ₀ (intercept)
θ₁ (slope)

For example, from the summary output:

θ₀ (intercept) = -10.0031
This value tells us the predicted distance when speed is 0. However, this doesn’t give us much practical insight, since a speed of 0 leading to a negative distance is not realistic in this context.
θ₁ (slope) = 3.2891
This value is more meaningful. It tells us that for every unit increase in speed, the distance increases by approximately 3.29 units.

line %>% summary

## 
## Call:
## lm(formula = dist ~ speed, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.194  -6.992  -2.677   4.109  32.879 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.544      8.853  -1.982   0.0614 .  
## speed          3.815      0.530   7.197 5.74e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.28 on 20 degrees of freedom
## Multiple R-squared:  0.7214, Adjusted R-squared:  0.7075 
## F-statistic:  51.8 on 1 and 20 DF,  p-value: 5.739e-07

# Using the fitted model to make prediction on new speed
new_data <- data.frame(speed = 31)
predicted_line_distance <- predict(line, newdata = new_data)
predicted_poly_distance <- predict(line, newdata = new_data)

# View the result
predicted_line_distance

##       1 
## 100.712

predicted_poly_distance

##       1 
## 100.712

Classification

Classification is used when the target variable is categorical — that is, we want to classify observations into discrete categories.

For example, we can use the BreastCancer dataset to predict whether a patient’s tumor is:

Malignant (cancerous)
Benign (non-cancerous)

We’ll use features such as cell size and cell thickness to make this prediction.

library(mlbench)
data("BreastCancer")
BreastCancer %>% head

##        Id Cl.thickness Cell.size Cell.shape Marg.adhesion Epith.c.size
## 1 1000025            5         1          1             1            2
## 2 1002945            5         4          4             5            7
## 3 1015425            3         1          1             1            2
## 4 1016277            6         8          8             1            3
## 5 1017023            4         1          1             3            2
## 6 1017122            8        10         10             8            7
##   Bare.nuclei Bl.cromatin Normal.nucleoli Mitoses     Class
## 1           1           3               1       1    benign
## 2          10           3               2       1    benign
## 3           2           3               1       1    benign
## 4           4           3               7       1    benign
## 5           1           3               1       1    benign
## 6          10           9               7       1 malignant

# The Class column signifies either benign or malignant tumor
BreastCancer %>%
ggplot(aes(x = Cl.thickness, y = Class)) +
geom_jitter(height = 0.05, width = 0.3, alpha=0.4)

To plot the classification function glm on the dataset

BreastCancer %>%
mutate(Cl.thickness.numeric = as.numeric(as.character(Cl.thickness))) %>%
mutate(IsMalignant = ifelse(Class == "benign", 0, 1)) %>%
ggplot(aes(x = Cl.thickness.numeric, y = IsMalignant)) +  
geom_jitter(height = 0.05, width = 0.3, alpha=0.4) +
geom_smooth(method = "glm", method.args = list(family = "binomial"))

## `geom_smooth()` using formula = 'y ~ x'

Binary Classification Example

For binary classification, we assume that the target values tᵢ are binary, typically encoded as 0 and 1. However, the input variables xᵢ can still be real-valued.

A common way to define the mapping function f(·; θ) in this case is to ensure it outputs values in the unit interval [0, 1], which we interpret as the probability that the target value is 1.

Prediction Rule:

Predict 0 if f(x; θ) < 0.5
Predict 1 if f(x; θ) > 0.5

(You may define a rule for exactly f(x; θ) = 0.5)

Logistic (Sigmoid) Function:

In linear classification, a common mapping function is the logistic function (also known as the sigmoid function), defined as:

\[ f(x; \theta) = \sigma(\theta_1 x + \theta_0) \]

Where the sigmoid function is:

\[ \sigma(z) = \frac{1}{1 + e^{-z}} \]

This function maps any real number into the range (0, 1), making it ideal for representing probabilities.

formatted_data <- BreastCancer %>%
mutate(Cl.thickness.numeric = as.numeric(as.character(Cl.thickness)),
Cell.size.numeric = as.numeric(as.character(Cell.size))) %>%
mutate(IsMalignant = ifelse(Class == "benign", 0, 1))

# Fiiting data to the classification model
fitted_model <- formatted_data %>% glm(IsMalignant ~ Cl.thickness.numeric + Cell.size.numeric, data = .)

# Prediction rule
classify <- function(probability) ifelse(probability < 0.5, 0, 1)

# Making Prediction
classified_malignant <- predict(fitted_model, formatted_data) %>% classify

formatted_data %>% select(Cl.thickness.numeric, Cell.size.numeric, IsMalignant) %>% head

##   Cl.thickness.numeric Cell.size.numeric IsMalignant
## 1                    5                 1           0
## 2                    5                 4           0
## 3                    3                 1           0
## 4                    6                 8           0
## 5                    4                 1           0
## 6                    8                10           1

Evaluating Classification Models

When dealing with classification problems, the Root Mean Square Error (RMSE) is not appropriate for evaluating model performance.

Instead of measuring the distance between predicted and actual values (as we do in regression), in classification we care about how many predictions are:

Correctly classified
Incorrectly classified

Common Evaluation Metrics for Classification:

Accuracy: Proportion of total predictions that are correct.
Confusion Matrix: A table that shows the number of true positives, false positives, true negatives, and false negatives.
Precision and Recall: Useful when dealing with imbalanced datasets.
F1 Score: Harmonic mean of precision and recall.

Confusion Matrix

A confusion matrix shows how well a classification model performed on a dataset where we know the true outcomes.

It provides a breakdown of correct and incorrect predictions by comparing the predicted classes with the actual classes.

Structure of the Confusion Matrix:

Rows: Represent the actual values from formatted_data$IsMalignant
Columns: Represent the predicted values from classified_malignant

This allows us to see how many cases of class 0 and class 1 were predicted correctly or incorrectly.

print("Predict classification")

## [1] "Predict classification"

classified_malignant %>% table

## .
##   0   1 
## 492 207

print("Real data classification")

## [1] "Real data classification"

formatted_data$IsMalignant %>% table

## .
##   0   1 
## 458 241

print("Confusion matrix")

## [1] "Confusion matrix"

table(formatted_data$IsMalignant, classified_malignant)

##    classified_malignant
##       0   1
##   0 450   8
##   1  42 199

Confusion Matrix Breakdown

True Negatives (450) – Top-left cell

True status: Benign (0)
Prediction: Benign (0)
The model correctly identified 450 benign tumors.

False Positives (8) – Top-right cell

True status: Benign (0)
Prediction: Malignant (1)
The model incorrectly flagged 8 benign tumors as malignant.
Also called “Type I error”.

False Negatives (42) – Bottom-left cell

True status: Malignant (1)
Prediction: Benign (0)
The model missed 42 malignant tumors, incorrectly labeling them as benign.
Also called “Type II error”, and is typically more serious in medical contexts.

True Positives (199) – Bottom-right cell

True status: Malignant (1)
Prediction: Malignant (1)
The model correctly identified 199 malignant tumors.

Accuracy Calculation

True Negatives (TN) = 450
True Positives (TP) = 199
False Positives (FP) = 8
False Negatives (FN) = 42

Step-by-step:

Total correct predictions = TN + TP
= 450 + 199
= 649
Total predictions = TN + TP + FP + FN
= 450 + 199 + 8 + 42
= 699
Accuracy = Total correct predictions ÷ Total predictions
= 649 ÷ 699
= 0.9284…
Convert to percentage:
= 0.9284… × 100
= 92.84%

confusion_matrix <- table(formatted_data$Class, classified_malignant, dnn=c("Data", "Predictions"))
(accuracy <- sum(diag(confusion_matrix))/sum(confusion_matrix))

## [1] 0.9284692

Class Distribution in the BreastCancer Dataset

Before training a classification model, it’s important to understand the distribution of the classes in your dataset. In this case, we are looking at the distribution of benign and malignant tumors in the BreastCancer dataset.

# Table of class frequencies
tbl <- table(BreastCancer$Class)

# Proportion of benign tumors
tbl["benign"] / sum(tbl)

##    benign 
## 0.6552217

# Proportion of malignant tumors
tbl["malignant"] / sum(tbl)

## malignant 
## 0.3447783

Sensitivity and Specificity

While high accuracy is desirable, accuracy alone isn’t enough—especially in clinical settings where the consequences of misclassification can vary greatly.

For instance, misclassifying: - A benign tumor as malignant (false positive) might cause unnecessary stress or procedures. - A malignant tumor as benign (false negative) could result in delayed treatment and serious harm.

Because of this, we often use Sensitivity and Specificity to evaluate classifiers more thoroughly.

Specificity

Definition: Measures how well the model identifies negative cases.
In the context of breast cancer:
- Specificity tells us how often the model correctly predicts a benign tumor when it truly is benign.
Formula:
\[ \text{Specificity} = \frac{TN}{TN + FP} \]

Sensitivity (also called Recall)

Definition: Measures how well the model identifies positive cases.
In the context of breast cancer:
- Sensitivity tells us how often the model correctly predicts a malignant tumor when it truly is malignant.
Formula:
\[ \text{Sensitivity} = \frac{TP}{TP + FN} \]

These metrics help provide a more nuanced evaluation of model performance, especially in high-stakes domains like healthcare.

# Specificity
specificity <- confusion_matrix[1,1] / (confusion_matrix[1,1] + confusion_matrix[1,2])
print("specificity")

## [1] "specificity"

specificity

## [1] 0.9825328

# Sensitivity
sensitivity <- confusion_matrix[2,2]/(confusion_matrix[2,1]+confusion_matrix[2,2])
print("sensitivity")

## [1] "sensitivity"

sensitivity

## [1] 0.8257261

Why These Metrics Matter

In medical contexts like cancer detection, both sensitivity and specificity play crucial roles:

High Sensitivity is critical because:
- Missing a malignant tumor (a false negative) could be life-threatening.
- The goal is to catch as many true cancer cases as possible.
High Specificity is also important because:
- Falsely diagnosing a benign tumor as malignant (a false positive) can lead to unnecessary treatments, costs, and patient anxiety.

Model Performance

Specificity: 98.25% ✅
Sensitivity: 82.57% ⚠️

This model has excellent specificity, meaning it is very good at confirming benign tumors.
However, its sensitivity is somewhat lower, meaning it’s not as strong at detecting malignant tumors.

⚠️ In critical healthcare scenarios, improving sensitivity may be prioritized—even if it slightly reduces specificity.

Email Spam Detectors: Balancing Sensitivity and Specificity

When it comes to spam detection, there’s a crucial tradeoff between sensitivity and specificity:

🔍 For Spam Detection

Sensitivity (True Positive Rate)

The ability to correctly identify actual spam emails.
High sensitivity: Most spam is caught ✅
Low sensitivity: Spam slips through into the inbox ❌

Specificity (True Negative Rate)

The ability to correctly identify legitimate (non-spam) emails.
High specificity: Legitimate emails rarely get marked as spam ✅
Low specificity: Legitimate emails end up in the spam folder ❌

📌 The Practical Implications

High Sensitivity + Low Specificity

✅ Most spam is filtered
❌ Many legitimate emails are incorrectly marked as spam
❗ Users must frequently check their spam folder
⚠️ Can lead to “false alarm fatigue”, where users ignore the spam folder entirely

High Specificity + Low Sensitivity

✅ Legitimate emails usually reach the inbox
❌ More spam gets through to the inbox
⚠️ Inbox becomes cluttered and potentially risky

🧠 Real-World Considerations

Cost asymmetry:
- Missing a legitimate email (false positive) is usually worse than receiving a spam email (false negative)
User preferences:
- Some users tolerate more spam to avoid missing important emails
Adaptive systems:
- Modern filters often learn from user behavior and can be customized per user
Commercial context:
- Specificity is prioritized in business settings to avoid losing critical client communication

⚖️ Common Strategy

Most commercial email providers tune their filters to favor specificity slightly over sensitivity, since the cost of missing legitimate emails—especially business-critical ones—is typically greater than the annoyance of occasional spam.

Supervised Learning

Mustapha Yusuf

2025-04-29