2024-11-15
# Load libraries
library(ggplot2)
library(plotly)
## Warning: package 'plotly' was built under R version 4.4.2
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.4.2
## Warning: package 'tidyr' was built under R version 4.4.2
## Warning: package 'readr' was built under R version 4.4.2
## Warning: package 'purrr' was built under R version 4.4.2
## Warning: package 'dplyr' was built under R version 4.4.2
## Warning: package 'forcats' was built under R version 4.4.2
## Warning: package 'lubridate' was built under R version 4.4.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks plotly::filter(), stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Set default options
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)
set.seed(123)
Simple Linear Regression
Simple linear regression is a statistical method that:
- Models the relationship between two continuous variables, one independent and one dependent.
- Assumes a linear relationship between (X) and (Y)
- Is the foundation for more complex regression analyses.
- Residual models help shape and compliment the conclusions from linear regression data.
Mathematical Foundation
The simple linear regression model is defined as:
\[Y_i = \beta_0 + \beta_1X_i + \epsilon_i\]
Where: - \(\beta_0\) is the y-intercept - \(\beta_1\) is the slope - \(\epsilon_i\) is the error term, assumed \(\epsilon_i \sim N(0, \sigma^2)\)
- “Simple” means the outcome variable equals a single predictor.
Least Squares Estimation
The parameters are estimated by minimizing:
\[\sum_{i=1}^n (Y_i - (\beta_0 + \beta_1X_i))^2\]
Resulting in: \[\hat{\beta_1} = \frac{\sum(X_i - \bar{X})(Y_i - \bar{Y})}{\sum(X_i - \bar{X})^2}\] \[\hat{\beta_0} = \bar{Y} - \hat{\beta_1}\bar{X}\]
Example Dataset and Basic Visualization
# Generate sample data
n <- 100
X <- runif(n, 0, 10)
Y <- 2 + 3 * X + rnorm(n, 0, 2)
data <- data.frame(X = X, Y = Y)
# Fit linear model
model <- lm(Y ~ X, data = data)
# Create basic scatter plot with regression line
p1 <- ggplot(data, aes(x = X, y = Y)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", se = TRUE, color = "#8C1D40") +
theme_minimal() +
labs(title = "Simple Linear Regression Example",
x = "Predictor (X)",
y = "Response (Y)")
print(p1)

Residual Analysis
# Add residuals to data
data$residuals <- residuals(model)
data$fitted <- fitted(model)
# Create residual plot
p2 <- ggplot(data, aes(x = fitted, y = residuals)) +
geom_point(alpha = 0.5) +
geom_hline(yintercept = 0, linetype = "dashed", color = "#8C1D40") +
theme_minimal() +
labs(title = "Residual Plot",
x = "Fitted Values",
y = "Residuals")
print(p2)

Contour Plot Showing Linear Regression
# Create grid of beta0 and beta1 values
beta0_seq <- seq(1, 3, length.out = 50)
beta1_seq <- seq(2, 4, length.out = 50)
grid <- expand.grid(beta0 = beta0_seq, beta1 = beta1_seq)
# Calculate RSS for each combination
calculate_rss <- function(beta0, beta1) {
sum((Y - (beta0 + beta1 * X))^2)
}
grid$rss <- mapply(calculate_rss, grid$beta0, grid$beta1)
# Create contour plot using ggplot2
ggplot(grid, aes(x = beta0, y = beta1)) +
geom_contour(aes(z = rss, color = ..level..), bins = 20) +
theme_minimal() +
labs(title = "RSS Contour Plot",
x = "β₀",
y = "β₁",
color = "RSS") +
scale_color_viridis_c()

Model Summary and Interpretation
The fitted model equation is: \[\hat{Y} = \hat{\beta_0} + \hat{\beta_1}X\]
Key statistics:
summary(model)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.982080 0.39210999 5.054909 2.002292e-06
## X 2.982034 0.06836424 43.619788 5.348260e-66
Conclusions
- Simple linear regression provides a visual and easy to understand way to model relationships -“Line fitting” is a hallmark of Simple linear regression, where a straight line can fit into a series of data points.
- Assumptions should be carefully checked, there are limitations to the models that simple linear regression can create.
- Because of the limitations of linear regression, residual analysis is crucial for model validation.
- The method can be extended to multiple regression and other more complex models