15th March, 2025

Introduction

For this homework, I decided to talk about Linear Regression as it is one of the most fundamental and useful concepts that one can learn in the field of machine learning. It serves as a stepping stone to much more complicated algorithms and models. A person with a strong grasp of Linear Regression will have a much easier time learning the more complicated stuff!

So what does a linear regression equation look like? It looks something like this -

\[ Y = \beta_0 + \beta_1 X + \epsilon \] where Y is the dependent variable, \(\beta_0\) is the intercept, \(\beta_1\) is the slope and \(\epsilon\) is the margin of error. Fun fact: The error term accounts for factors that are not included in the model but influence the dependent variable. It also helps to assess the goodness of fit of the model. If the above representation seems complicated for you, then have a look at this!

\[ Y = \alpha + \beta X \] In essence, both equations are the same except the latter one does not contain the error term.

Example Data and Model (Using IRIS dataset for this presentation)

library(ggplot2)

# Use iris dataset 
data <- iris

# Fit model 
model <- lm(Sepal.Length ~ Sepal.Width, data = data)

# create predicted column in dataset
data$predicted = predict(model)
data$residuals = data$Sepal.Length - data$predicted
# We will generate a scatter plot using our predicted values to see how it performs

Scatter Plot of our Model with a Regression Line

ggplot(data, aes(x = Sepal.Width, y = Sepal.Length)) +
  geom_point(color = 'blue') +
  geom_line(aes(y = predicted), color = 'red') +
  labs(title = "Linear Regression", x = "Sepal Width", y = "Sepal Length")

How did our model perform?

When we use any kind of predictive modelling, in order to make changes to our model, we need to know how our model is performing with respect to the actual values. In order to do that, we construct something called a residual plot. The residual plot of a regression function shows us the difference in expected and predicted value. The following slide will contain one such residual plot. Furthermore, to fine-tune our models, we use something called a loss function. While not used in a linear regression model, it is still important to know for future use :) One such loss function is the MSE or Mean Squared Error. It looks like this: \[ 1/n \sum_{i=1}^{n} (Y_i - \hat{Y}_i)^2 \] It looks complicated but in actuality it is simply the square of the distance between the expected Y value and predicted Y value. Squaring allows us to maintain the magnitude of the difference!

Residual Plot

ggplot(data, aes(x = predicted, y = residuals)) +
  geom_point(color = 'blue') +
  geom_hline(yintercept = 0, color = 'red', linetype = "dashed") +
  labs(title = "Residual Plot", x = "Predicted Sepal Length", y = "Residuals") +
  theme_minimal()

Making a 3-D plot using a Linear Regression model

Conclusion

Simple linear regression, while not the most powerful predictive model, serves as an excellent starting point for learning about regression techniques. As you progress, you’ll discover that more advanced models, such as multiple regression, polynomial regression or even neural network architectures, are often built upon the foundational concepts of linear regression. I hope you enjoyed this mini-presentation! Have a great day!