Simple Linear Regression

March 25, 2025

What is Simple Linear Regression?

Simple Linear Regression is a statistical method used to model the relationship between:

One independent variable (predictor) X
One dependent variable (response) Y

The relationship is modeled as a straight line:

We fit a line through the data points
The line minimizes the sum of squared differences between observed and predicted values
Used for prediction and understanding relationships between variables

The Mathematical Model

In Simple Linear Regression, we model the relationship as:

\[Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i\]

Where:

\(Y_i\) is the dependent variable (response)
\(X_i\) is the independent variable (predictor)
\(\beta_0\) is the y-intercept (value of Y when X = 0)
\(\beta_1\) is the slope (change in Y for a one-unit change in X)
\(\varepsilon_i\) is the error term (assumed to be normally distributed)

Key Assumptions of Linear Regression

Linearity: The relationship between X and Y is linear
Independence: Observations are independent of each other
Homoscedasticity: Variance of residuals is constant
Normality: Residuals follow a normal distribution

Violating these assumptions can lead to invalid conclusions.

Parameter Estimation

The method of Ordinary Least Squares (OLS) is used to estimate the parameters.

The estimators for \(\beta_0\) and \(\beta_1\) are:

\[\hat{\beta}_1 = \frac{\sum_{i=1}^{n} (X_i - \bar{X})(Y_i - \bar{Y})}{\sum_{i=1}^{n} (X_i - \bar{X})^2}\]

\[\hat{\beta}_0 = \bar{Y} - \hat{\beta}_1 \bar{X}\]

Where \(\bar{X}\) and \(\bar{Y}\) are the sample means of X and Y.

Example Dataset: Using Trees

##   Girth Height Volume
## 1   8.3     70   10.3
## 2   8.6     65   10.3
## 3   8.8     63   10.2
## 4  10.5     72   16.4
## 5  10.7     81   18.8
## 6  10.8     83   19.7

Visualization with ggplot2

Visualization with Additional Variables

3D Visualization with Plotly

R Code for ggplot2 Visualization

Here’s the R code to create the regression plot, following the lecture style:

# Load the dataset
data(trees)

# Create a basic scatterplot with regression line
g <- ggplot(data = trees, aes(x = Girth, y = Volume)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = TRUE)

# Display the plot
g

Model Evaluation

We evaluate the model using several metrics:

R-squared: 0.9353 (proportion of variance explained)
Adjusted R-squared: 0.9331
Residual Standard Error: 4.252
F-statistic: 419.36 (p-value: < 2.22e-16)