2025-10-19

What is Simple Linear Regression?

Simple Linear Regression (SLR) is a method used to model the linear relationship between two continuous variables.

  • Dependent Variable (Y): The outcome or response variable we want to predict.
  • Independent Variable (X): The predictor or explanatory variable used to relate Y.
  • Goal: To find the straight line that best fits the data, minimizing the sum of squared errors (residuals).

We will use the built-in R dataset trees to illustrate SLR, modeling Volume (Y) as a function of Girth (X).

The Simple Linear Model

The theoretical relationship is expressed as:

\[ \text{Volume}_i = \beta_0 + \beta_1 \cdot \text{Girth}_i + \varepsilon_i \]

Where: - \(i\) indexes the observation. - \(\beta_0\): The theoretical intercept (expected \(Y\) when \(X=0\)). - \(\beta_1\): The theoretical slope (expected change in \(Y\) for a one-unit increase in \(X\)). - \(\varepsilon_i\): The error term (residual), assumed to be independently and normally distributed: \(\varepsilon \sim N(0, \sigma^2)\).

The fitted model derived from the data is:

\[ \widehat{\text{Volume}}_i = \hat{\beta}_0 + \hat{\beta}_1 \cdot \text{Girth}_i \]

Data Exploration: The trees Dataset

We will use the built-in trees dataset, which contains measurements for 31 cherry trees.

data(trees)
head(trees)
  Girth Height Volume
1   8.3     70   10.3
2   8.6     65   10.3
3   8.8     63   10.2
4  10.5     72   16.4
5  10.7     81   18.8
6  10.8     83   19.7

Variables: - Girth: Tree diameter (in inches). - Height: Tree height (in feet). - Volume: Timber volume (in cubic feet).

\(\text{Volume} = f(\text{Girth})\).

Fitting the Regression Model

The lm() function in R is used to fit the linear model, providing estimates for the coefficients (\(\hat{\beta}_0\) and \(\hat{\beta}_1\)).

reg_trees <- lm(Volume ~ Girth, data = trees)
summary(reg_trees)
Call:
lm(formula = Volume ~ Girth, data = trees)

Residuals:
   Min     1Q Median     3Q    Max 
-8.065 -3.107  0.152  3.495  9.587 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -36.9435     3.3651  -10.98 7.62e-12 ***
Girth         5.0659     0.2474   20.48  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.252 on 29 degrees of freedom
Multiple R-squared:  0.9353,    Adjusted R-squared:  0.9331 
F-statistic: 419.4 on 1 and 29 DF,  p-value: < 2.2e-16

Visualizing the SLR Line

We use ggplot2 to visualize the relationship and overlay the fitted linear model line using geom_smooth(method = "lm").

Calculation of Coefficients

The slope (\(\hat{\beta}_1\)) and intercept (\(\hat{\beta}_0\)) estimates rely on the sum of squares and means of \(X\) and \(Y\).

The Sums of Squares of \(X\) is calculated as: \[ SS_x = \sum_{i=1}^{n} (x_i - \bar{x})^2 \]

The estimated slope \(\hat{\beta}_1\) (or \(b_1\)) is: \[ \hat{\beta}_1 = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{SS_x} \]

The estimated intercept \(\hat{\beta}_0\) (or \(b_0\)) is: \[ \hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x} \]

Exploring Multivariate Data

While SLR uses only one predictor, we can visualize the context of other variables using mtcars. Here, we plot MPG vs Weight, and map Cylinders to color and Horsepower to size.

Interactive 3D Visualization

We use plotly to visualize three variables from mtcars: Weight (x), Displacement (y), and MPG (z), allowing interactive rotation and exploration of potential multiple linear relationships.