2026-02-10

Defining Linear Regression

Linear regression models the relationship between a dependent variable and one or more independent variables. It aims to find the best-fit line that minimizes the error between the predicted and actual values.

  • Widely used in statistics, engineering, and the sciences
  • The purpose is to explain and predict outcomes
  • It assumes a linear relationship between independent and dependent variable

Linear Regression Model

The simple linear regression model is: Y=β0​+β1​X+ε

Where:

  • \(Y\) = dependent variable
  • \(X\) = independent variable
  • \(\beta_0\) = intercept
  • \(\beta_1\) = slope
  • \(\varepsilon\) = random error

Least Squares Estimation

In order to obtain a line that fits the data the best, we need to understand what a residual is and how that relates to the Least Squares Estimation.

A residual is simply a measure of error for a given data point.

-If our line falls directly on a data point, that data point would have a residual of 0, meaning for that point, there is no error.

  • If our line falls above the data point, that point would have a negative residual. Meaning the true value is less than what our line predicts for that point.

  • If our line falls below the data point, that point would have a positive residual. Meaning the true value is greater than what our line predicts for that data point

Least Squares Estimation (cont.)

All this to say, we want our regression line to have the minimum amount of error across all data points.

To solve this, we aim to minimize the sum of squared residuals (formula shown below):

SSE(β0​,β1​)=∑​(yi​−(β0​+β1​xi​))^2 where the range for ∑ is from 1 to n, where n is the number of data points.

  • This makes all errors positive
  • This penalizes large errors more than smaller errors
  • Produces a smooth function that’s easy to optimize

Least Squares Estimation (cont.)

In order to solve the optimization problem we:

  • Take the partial derivatives with respect to β0 and β1
  • We set them equal to 0 and then solve.

The following slide will use linear regression on a synthetic data

Simple linear Regression

First we create the synthetic data as show below

set.seed(123)
n <- 100
x <- rnorm(n)
y <- 3 + 2*x + rnorm(n, sd = 1)

data <- data.frame(x, y)

Then we use ggplot to create the plot

ggplot(data, aes(x = x, y = y)) +
  geom_point(color = "steelblue") +
  geom_smooth(method = "lm", se = TRUE, color = "darkred") +
  labs(
    title = "Simple Linear Regression",
    x = "Predictor (X)",
    y = "Response (Y)"
  )

Simple linear Regression (cont)

Below is a linear regression plot that fits the data.

## `geom_smooth()` using formula = 'y ~ x'

Residual Diagnostics

Because a linear regression model assumes linearity, we can use residual diagnostics to check the validity of this assumption.

The following code will plot a residual diagnostic of the previous graph. If our assumption is correct, we should see a random scatter around 0 and no pattern.

model <- lm(y ~ x, data = data)
data$residuals <- resid(model)

ggplot(data, aes(x = x, y = residuals)) +
  geom_point(color = "purple") +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(
    title = "Residuals vs Predictor",
    x = "X",
    y = "Residuals"
  )

Residual Diagnostics plot

Multiple Regression (2 independent variables)

We can use linear regression with 2 variables with the formula shown below.

Y=β0​+β1​X1​+β2​X2​+ε

Its the same as our original formula, except we added an additional x term with a corresponding beta coefficient.

Multiple Regression (cont)

Like before we will use synthetic data to demonstrate.

set.seed(1)
x1 <- rnorm(n)
x2 <- rnorm(n)
y2 <- 1 + 2*x1 - 1.5*x2 + rnorm(n)

df3d <- data.frame(x1, x2, y2)

Multiple Regression (cont)

And below is the code that plots the graph.

plot_ly(
  df3d,
  x = ~x1,
  y = ~x2,
  z = ~y2,
  type = "scatter3d",
  mode = "markers"
) %>%
  layout(
    title = "Multiple Linear Regression",
    scene = list(
      xaxis = list(title = "X1"),
      yaxis = list(title = "X2"),
      zaxis = list(title = "Y")
    )
  )

Multiple Regression graph