Supervised learning is used when you have variables you want to
predict using other variables. This typically involves learning a
mapping from input variables (predictors) x
to an output
variable (response) y
, such that:
\[ y = f(x) \]
Common examples of supervised learning include linear regression, logistic regression(Classification), decision trees, and support vector machines.
In the simplest case of supervised learning, we have:
We aim to determine a function \(f\) such that:
\[ y = f(x) \]
Let:
Our goal is to find a function \(f\) such that:
\[ y_i = f(x_i) \]
However, in practice, the observations often contain noise. We account for this by introducing a target vector \(t\) such that:
\[ t_i = y_i + \varepsilon_i = f(x_i) + \varepsilon_i \]
Where:
To define the mapping function \(f\), we introduce parameters \(\theta\), making \(f\) a function of both \(x\) and \(\theta\):
\[ f(x; \theta) = y(\theta) \]
Our goal is to find the parameters \(\theta\) that minimize the difference between the predicted value \(y(\theta)\) and the observed target value \(t\).
In other words, we want to minimize the error term \(\varepsilon\), often done by minimizing a loss function such as the mean squared error (MSE):
\[ \text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (f(x_i; \theta) - t_i)^2 \]
A smaller MSE means our function \(f\) is a better fit for the data.
We will now apply this theory to two models:
We’ll demonstrate both using example datasets in R.
Linear Regression is used when the output (or response) variable we
are trying to predict is numerical.
The relationship between the input variable \(x\) and the output variable \(y\) is modeled using a linear
function:
\[ f_\theta(x) = \theta_1 x + \theta_0 \]
Here:
Our objective is to find the values of \(\theta_1\) and \(\theta_0\) that minimize the Mean Squared Error (MSE), i.e., the average squared difference between the predicted values \(y(\theta)\) and the actual target values \(t\):
\[ \text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (f_\theta(x_i) - t_i)^2 \]
By minimizing the MSE, we ensure that the model’s predictions are as close as possible to the actual observed data.
We will use the built-in cars
dataset in R, which
contains two variables:
speed
: Speed of the car (mph)dist
: Stopping distance (ft)Our goal is to predict the stopping distance dist
given
the car’s speed speed
.
# Magrittr Library for piping operations
library(magrittr)
# Visualization
library(ggplot2)
# Manipulating data frames
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Load cars datasets
data(cars)
cars %>% head
## speed dist
## 1 4 2
## 2 4 10
## 3 7 4
## 4 7 22
## 5 8 16
## 6 9 10
Plotting Linear regression model on the dataset
cars %>% ggplot(aes(x = speed, y = dist)) +
geom_point() +
geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'
The red line drawn on the dataset is the regression line that best fits the data. It is based on the regression function:
\[ \hat{y} = \theta_1 x + \theta_0 \]
This function maps a given speed \(x\) to a corresponding predicted stopping distance \(\hat{y}\).
Behind the scenes, the lm()
function in R performs a
learning process that finds the optimal values for both
\(\theta_1\) (slope) and \(\theta_0\) (intercept). These parameters
are chosen such that the Mean Squared Error (MSE)
between the predicted distances and the actual distances is
minimized.
This means the line is positioned to make the overall prediction error as small as possible across all the data points.
Let us assume for now that the intercept \(\theta_0\) is equal to 0.
(We will explain the reasoning behind this assumption later.)
With this simplification, our regression function becomes:
\[ \hat{y} = \theta_1 x \]
This means the regression line will pass through the origin (0,0), and its shape will depend entirely on the value of the slope \(\theta_1\).
Let’s visualize how different values of \(\theta_1\) affect the fit of the regression
line on the cars
dataset.
# This define our regression function that takes in speed and theta_1 as argument
predict_dist <- function(speed, theta_1)
data.frame(speed = speed, dist = theta_1 * speed, theta = as.factor(theta_1))
cars %>% ggplot(aes(x = speed, y = dist, colour = theta)) + geom_point(colour = "black") +
# Plotting the regression lines with different values of theta_1
geom_line(data = predict_dist(cars$speed, 2)) +
geom_line(data = predict_dist(cars$speed, 3)) +
geom_line(data = predict_dist(cars$speed, 4)) +
scale_color_discrete(name=expression(theta[1]))
We can see that for different values of \(\theta_1\) the lines all fit the dataset
differently. The next thing to do is to find out what value of \(\theta_1\) minimise the MSE \(E_{x,t}(\theta_i) = \sum (θ_1 x_i -
t_i)^2\). Let try it ourself with some random values of \(\theta_1\)
# generating random 50 thetas
thetas <- seq(0, 5, length.out = 50)
# function to calculate the MSE for each theta
fitting_error <- Vectorize(function(theta)
sum((theta * cars$speed - cars$dist)**2))
data.frame(thetas = thetas, errors = fitting_error(thetas)) %>%
ggplot(aes(x = thetas, y = errors)) +
geom_line() +
xlab(expression(theta[1])) + ylab("")
We can see that the \(\theta_1\) with
minimal MSE appears at the bottom of the curve.
Suppose we want to use two different regression models on the same dataset and determine which one provides the best fit or is more accurate.
To do this, we compare their performance using a common metric — the
Mean Squared Error (MSE).
The model with the lower MSE is considered to have a
better fit.
Let’s look at an example:
line <- cars %>% lm(dist ~ speed, data=.)
poly <- cars %>% lm(dist ~ speed + I(speed^2), data = .)
# MSE function
rmse <- function(x,t) sqrt(mean(sum((t - x)^2)))
rmse(predict(line, cars), cars$dist)
## [1] 106.5529
rmse(predict(poly, cars), cars$dist)
## [1] 104.0419
Now, clearly the polynomial model fits the data slightly
better than the linear model — and theoretically, it
should.
However, there’s a bit of a cheat happening here: we
are evaluating how well the models perform on the same
data that was used to fit them.
This creates a problem.
A more complex model (like a higher-degree polynomial) will almost
always appear to perform better on the training data.
But that doesn’t necessarily mean it’s a better model — it could simply
be overfitting: capturing the random noise in the data
instead of the true underlying relationship.
What we really care about is how well the model generalizes — that is:
How well does the model perform on new, unseen data that it hasn’t already seen and used to fit its parameters?
To properly evaluate a model’s performance, we need to test
it on separate data.
This is where concepts like train-test splits and
cross-validation come into play.
To evaluate how well our models generalize to unseen data, we can split the dataset into two parts:
In this example, the cars
dataset has 50 data
points.
We will use the first 25 points to train our models,
and the remaining 25 to test them.
training_data <- cars[1:25,]
test_data <- cars[26:50,]
line <- cars %>% lm(dist ~ speed, data=.)
poly <- cars %>% lm(dist ~ speed + I(speed^2), data=.)
# Now calculate the Root mean square error with test data after training the model with training data.
rmse(predict(line, test_data), test_data$dist)
## [1] 83.43421
rmse(predict(poly, test_data), test_data$dist)
## [1] 80.64634
Even though the second-degree polynomial still performs better on the test set, we are still cheating — and here’s why:
The cars
dataset is sorted by distance,
which means:
So the training and test data are not similar. This introduces bias, because the model is evaluated on a range of data it has never seen before.
In general, we can’t always know if there is hidden structure in our dataset based on row order. In this case it’s easy to spot, but often it is more subtle.
To avoid this kind of bias, we should randomly sample the data when creating training and test sets. This removes structure based on row order and ensures both sets are representative of the overall data.
sampled_cars <- cars %>% mutate(training = sample(0:1, nrow(cars), replace = TRUE))
training_data <- sampled_cars %>% filter(training == 1)
test_data <- sampled_cars %>% filter(training == 0)
line <- training_data %>% lm(dist ~ speed, data = .)
poly <- training_data %>% lm(dist ~ speed + I(speed^2), data = .)
rmse(predict(line, test_data), test_data$dist)
## [1] 92.26793
rmse(predict(poly, test_data), test_data$dist)
## [1] 92.72817
We can conclude now that the Polynomial model is a better fit to our data
The model summary gives us insightful information about how well the model fits the data, including the values of the parameters:
For example, from the summary output:
θ₀ (intercept) = -10.0031
This value tells us the predicted distance when speed is 0. However,
this doesn’t give us much practical insight, since a
speed of 0 leading to a negative distance is not
realistic in this context.
θ₁ (slope) = 3.2891
This value is more meaningful. It tells us that for every unit
increase in speed, the distance increases by
approximately 3.29 units.
line %>% summary
##
## Call:
## lm(formula = dist ~ speed, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.194 -6.992 -2.677 4.109 32.879
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.544 8.853 -1.982 0.0614 .
## speed 3.815 0.530 7.197 5.74e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.28 on 20 degrees of freedom
## Multiple R-squared: 0.7214, Adjusted R-squared: 0.7075
## F-statistic: 51.8 on 1 and 20 DF, p-value: 5.739e-07
# Using the fitted model to make prediction on new speed
new_data <- data.frame(speed = 31)
predicted_line_distance <- predict(line, newdata = new_data)
predicted_poly_distance <- predict(line, newdata = new_data)
# View the result
predicted_line_distance
## 1
## 100.712
predicted_poly_distance
## 1
## 100.712
Classification is used when the target variable is categorical — that is, we want to classify observations into discrete categories.
For example, we can use the BreastCancer dataset to predict whether a patient’s tumor is:
We’ll use features such as cell size and cell thickness to make this prediction.
library(mlbench)
data("BreastCancer")
BreastCancer %>% head
## Id Cl.thickness Cell.size Cell.shape Marg.adhesion Epith.c.size
## 1 1000025 5 1 1 1 2
## 2 1002945 5 4 4 5 7
## 3 1015425 3 1 1 1 2
## 4 1016277 6 8 8 1 3
## 5 1017023 4 1 1 3 2
## 6 1017122 8 10 10 8 7
## Bare.nuclei Bl.cromatin Normal.nucleoli Mitoses Class
## 1 1 3 1 1 benign
## 2 10 3 2 1 benign
## 3 2 3 1 1 benign
## 4 4 3 7 1 benign
## 5 1 3 1 1 benign
## 6 10 9 7 1 malignant
# The Class column signifies either benign or malignant tumor
BreastCancer %>%
ggplot(aes(x = Cl.thickness, y = Class)) +
geom_jitter(height = 0.05, width = 0.3, alpha=0.4)
To plot the classification function
glm
on the dataset
BreastCancer %>%
mutate(Cl.thickness.numeric = as.numeric(as.character(Cl.thickness))) %>%
mutate(IsMalignant = ifelse(Class == "benign", 0, 1)) %>%
ggplot(aes(x = Cl.thickness.numeric, y = IsMalignant)) +
geom_jitter(height = 0.05, width = 0.3, alpha=0.4) +
geom_smooth(method = "glm", method.args = list(family = "binomial"))
## `geom_smooth()` using formula = 'y ~ x'
For binary classification, we assume that the target
values tᵢ
are binary, typically
encoded as 0
and 1
. However, the input
variables xᵢ
can still be
real-valued.
A common way to define the mapping function
f(·; θ)
in this case is to ensure it outputs values in the
unit interval [0, 1], which we interpret as the
probability that the target value is 1.
f(x; θ) < 0.5
f(x; θ) > 0.5
(You may define a rule for exactly f(x; θ) = 0.5
)
In linear classification, a common mapping function is the logistic function (also known as the sigmoid function), defined as:
\[ f(x; \theta) = \sigma(\theta_1 x + \theta_0) \]
Where the sigmoid function is:
\[ \sigma(z) = \frac{1}{1 + e^{-z}} \]
This function maps any real number into the range (0, 1), making it ideal for representing probabilities.
formatted_data <- BreastCancer %>%
mutate(Cl.thickness.numeric = as.numeric(as.character(Cl.thickness)),
Cell.size.numeric = as.numeric(as.character(Cell.size))) %>%
mutate(IsMalignant = ifelse(Class == "benign", 0, 1))
# Fiiting data to the classification model
fitted_model <- formatted_data %>% glm(IsMalignant ~ Cl.thickness.numeric + Cell.size.numeric, data = .)
# Prediction rule
classify <- function(probability) ifelse(probability < 0.5, 0, 1)
# Making Prediction
classified_malignant <- predict(fitted_model, formatted_data) %>% classify
formatted_data %>% select(Cl.thickness.numeric, Cell.size.numeric, IsMalignant) %>% head
## Cl.thickness.numeric Cell.size.numeric IsMalignant
## 1 5 1 0
## 2 5 4 0
## 3 3 1 0
## 4 6 8 0
## 5 4 1 0
## 6 8 10 1
When dealing with classification problems, the Root Mean Square Error (RMSE) is not appropriate for evaluating model performance.
Instead of measuring the distance between predicted and actual values (as we do in regression), in classification we care about how many predictions are:
A confusion matrix shows how well a classification model performed on a dataset where we know the true outcomes.
It provides a breakdown of correct and incorrect predictions by comparing the predicted classes with the actual classes.
formatted_data$IsMalignant
classified_malignant
This allows us to see how many cases of class 0
and
class 1
were predicted correctly or incorrectly.
print("Predict classification")
## [1] "Predict classification"
classified_malignant %>% table
## .
## 0 1
## 492 207
print("Real data classification")
## [1] "Real data classification"
formatted_data$IsMalignant %>% table
## .
## 0 1
## 458 241
print("Confusion matrix")
## [1] "Confusion matrix"
table(formatted_data$IsMalignant, classified_malignant)
## classified_malignant
## 0 1
## 0 450 8
## 1 42 199
Total correct predictions = TN + TP
= 450 + 199
= 649
Total predictions = TN + TP + FP + FN
= 450 + 199 + 8 + 42
= 699
Accuracy = Total correct predictions ÷ Total
predictions
= 649 ÷ 699
= 0.9284…
Convert to percentage:
= 0.9284… × 100
= 92.84%
confusion_matrix <- table(formatted_data$Class, classified_malignant, dnn=c("Data", "Predictions"))
(accuracy <- sum(diag(confusion_matrix))/sum(confusion_matrix))
## [1] 0.9284692
Before training a classification model, it’s important to understand
the distribution of the classes in your dataset. In this case, we are
looking at the distribution of benign and
malignant tumors in the BreastCancer
dataset.
# Table of class frequencies
tbl <- table(BreastCancer$Class)
# Proportion of benign tumors
tbl["benign"] / sum(tbl)
## benign
## 0.6552217
# Proportion of malignant tumors
tbl["malignant"] / sum(tbl)
## malignant
## 0.3447783
While high accuracy is desirable, accuracy alone isn’t enough—especially in clinical settings where the consequences of misclassification can vary greatly.
For instance, misclassifying: - A benign tumor as malignant (false positive) might cause unnecessary stress or procedures. - A malignant tumor as benign (false negative) could result in delayed treatment and serious harm.
Because of this, we often use Sensitivity and Specificity to evaluate classifiers more thoroughly.
These metrics help provide a more nuanced evaluation of model performance, especially in high-stakes domains like healthcare.
# Specificity
specificity <- confusion_matrix[1,1] / (confusion_matrix[1,1] + confusion_matrix[1,2])
print("specificity")
## [1] "specificity"
specificity
## [1] 0.9825328
# Sensitivity
sensitivity <- confusion_matrix[2,2]/(confusion_matrix[2,1]+confusion_matrix[2,2])
print("sensitivity")
## [1] "sensitivity"
sensitivity
## [1] 0.8257261
In medical contexts like cancer detection, both sensitivity and specificity play crucial roles:
This model has excellent specificity, meaning it is
very good at confirming benign tumors.
However, its sensitivity is somewhat lower, meaning
it’s not as strong at detecting malignant tumors.
⚠️ In critical healthcare scenarios, improving sensitivity may be prioritized—even if it slightly reduces specificity.
When it comes to spam detection, there’s a crucial tradeoff between sensitivity and specificity:
Most commercial email providers tune their filters to favor specificity slightly over sensitivity, since the cost of missing legitimate emails—especially business-critical ones—is typically greater than the annoyance of occasional spam.