March 25, 2025

What is Simple Linear Regression?

Simple Linear Regression is a statistical method used to model the relationship between:

  • One independent variable (predictor) X
  • One dependent variable (response) Y

The relationship is modeled as a straight line:

  • We fit a line through the data points
  • The line minimizes the sum of squared differences between observed and predicted values
  • Used for prediction and understanding relationships between variables

The Mathematical Model

In Simple Linear Regression, we model the relationship as:

\[Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i\]

Where:

  • \(Y_i\) is the dependent variable (response)
  • \(X_i\) is the independent variable (predictor)
  • \(\beta_0\) is the y-intercept (value of Y when X = 0)
  • \(\beta_1\) is the slope (change in Y for a one-unit change in X)
  • \(\varepsilon_i\) is the error term (assumed to be normally distributed)

Key Assumptions of Linear Regression

  1. Linearity: The relationship between X and Y is linear
  2. Independence: Observations are independent of each other
  3. Homoscedasticity: Variance of residuals is constant
  4. Normality: Residuals follow a normal distribution

Violating these assumptions can lead to invalid conclusions.

Parameter Estimation

The method of Ordinary Least Squares (OLS) is used to estimate the parameters.

The estimators for \(\beta_0\) and \(\beta_1\) are:

\[\hat{\beta}_1 = \frac{\sum_{i=1}^{n} (X_i - \bar{X})(Y_i - \bar{Y})}{\sum_{i=1}^{n} (X_i - \bar{X})^2}\]

\[\hat{\beta}_0 = \bar{Y} - \hat{\beta}_1 \bar{X}\]

Where \(\bar{X}\) and \(\bar{Y}\) are the sample means of X and Y.

Example Dataset: Using Trees

##   Girth Height Volume
## 1   8.3     70   10.3
## 2   8.6     65   10.3
## 3   8.8     63   10.2
## 4  10.5     72   16.4
## 5  10.7     81   18.8
## 6  10.8     83   19.7

Visualization with ggplot2

Visualization with Additional Variables

3D Visualization with Plotly

R Code for ggplot2 Visualization

Here’s the R code to create the regression plot, following the lecture style:

# Load the dataset
data(trees)

# Create a basic scatterplot with regression line
g <- ggplot(data = trees, aes(x = Girth, y = Volume)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = TRUE)

# Display the plot
g

Model Evaluation

We evaluate the model using several metrics:

  1. R-squared: 0.9353 (proportion of variance explained)
  2. Adjusted R-squared: 0.9331
  3. Residual Standard Error: 4.252
  4. F-statistic: 419.36 (p-value: < 2.22e-16)