2025-11-17

What is Simple Linear Regression?

Simple Linear Regression is a statistical method that models the relationship between two variables (X–>Y):

  • Dependent variable (Y): The outcome we want to predict
  • Independent variable (X): The predictor we use

Our goal is to find a straight line that fits bets through the data points.

Where is this useful?

Simple Linear Regressions are useful in almost all contexts. Most areas of knowledge are based on the study of factor that describe other factors, Such as:

  • Real State: (Predicting house prices based on square footage
  • Marketing: (Forecasting sales based on advertising spend)
  • Education: (Predicting exam scores based on attendance rate)

The Mathematical Model

The simple linear regression equation is:

\[Y = \beta_0 + \beta_1 X + \epsilon\]

Where:

  • \(Y\) = Dependent variable (response)
  • \(X\) = Independent variable (predictor)
  • \(\beta_0\) = Y-intercept (value of Y when X = 0)
  • \(\beta_1\) = Slope (change in Y for one unit change in X)
  • \(\epsilon\) = Error term (random deviation)

Estimating the Coefficients

We estimate \(\beta_0\) and \(\beta_1\) using the Least Squares Method, which minimizes the sum of squared residuals:

\[\hat{\beta_1} = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n}(x_i - \bar{x})^2}\]

\[\hat{\beta_0} = \bar{y} - \hat{\beta_1}\bar{x}\]

Where: - \(\bar{x}\) and \(\bar{y}\) are the sample means - The “hat” notation (\(\hat{\beta}\)) indicates estimated values

Our Example: Iris Flower Measurements

We’ll use the iris dataset to explore flower morphology:

Question: Can we predict petal length from petal width?

##   Petal.Width Petal.Length Species
## 1         0.2          1.4  setosa
## 2         0.2          1.4  setosa
## 3         0.2          1.3  setosa
## 4         0.2          1.5  setosa
## 5         0.2          1.4  setosa
## 6         0.4          1.7  setosa
## 7         0.3          1.4  setosa
## 8         0.2          1.5  setosa
  • Petal.Width: Width of petal (cm)
  • Petal.Length: Length of petal (cm)
  • Species: Three iris species (setosa, versicolor, virginica)

Scatter Plot with Regression Line

We can observe a strong positive relationship that shows wider petals tend to be longer.

Residual Analysis

The residuals show a grouping pattern, indicating our simple model may be oversimplifying the relationship (This is a risk of Simple Linear Reg). We may need to analyze regression by species to assess the correlation more rigourously.

The R Code Behind Our First Plot

library(ggplot2)

ggplot(iris, aes(x = Petal.Width, y = Petal.Length)) +
  geom_point(aes(color = Species),
             size = 3, 
             alpha = 0.6) +
  geom_smooth(method = "lm",
              se = TRUE,
              color = "black",
              fill = "gray70") +
  scale_color_manual(values = c("setosa" = "#FFC107",
                                 "versicolor" = "#8C1D40",
                                 "virginica" = "#00BCD4")) +
  labs(title = "Petal Width vs. Petal Length",
       x = "Petal Width (cm)",
       y = "Petal Length (cm)") +
  theme_minimal(base_size = 14)

Interactive Scatter Plot with Model Details

This plot is similar to our first one, however, we have separated linear regressions by species, and you can zoom in to different parts of the graph & hover over points to see actual vs. predicted values and residuals!

Key Takeaways

Simple Linear Regression allows us to:

  1. Quantify relationships between variables
  2. Make predictions for new observations
  3. Understand trends in any type of data

Our findings: - Each 1 cm increase in petal width increases length by ~2.23 cm - Model explains 93% of variance in petal length (\(R^2 = 0.93\))

It is important to note that Correlation ≠ Causation!

For this, we always validate assumptions (linearity, normality, homoscedasticity).