2024-11-15

# Load libraries
library(ggplot2)
library(plotly)
## Warning: package 'plotly' was built under R version 4.4.2
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.4.2
## Warning: package 'tidyr' was built under R version 4.4.2
## Warning: package 'readr' was built under R version 4.4.2
## Warning: package 'purrr' was built under R version 4.4.2
## Warning: package 'dplyr' was built under R version 4.4.2
## Warning: package 'forcats' was built under R version 4.4.2
## Warning: package 'lubridate' was built under R version 4.4.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks plotly::filter(), stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Set default options
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)
set.seed(123)

Simple Linear Regression

Simple linear regression is a statistical method that:

  • Models the relationship between two continuous variables, one independent and one dependent.
  • Assumes a linear relationship between (X) and (Y)
  • Is the foundation for more complex regression analyses.
  • Residual models help shape and compliment the conclusions from linear regression data.

Mathematical Foundation

The simple linear regression model is defined as:

\[Y_i = \beta_0 + \beta_1X_i + \epsilon_i\]

Where: - \(\beta_0\) is the y-intercept - \(\beta_1\) is the slope - \(\epsilon_i\) is the error term, assumed \(\epsilon_i \sim N(0, \sigma^2)\)

  • “Simple” means the outcome variable equals a single predictor.

Least Squares Estimation

The parameters are estimated by minimizing:

\[\sum_{i=1}^n (Y_i - (\beta_0 + \beta_1X_i))^2\]

Resulting in: \[\hat{\beta_1} = \frac{\sum(X_i - \bar{X})(Y_i - \bar{Y})}{\sum(X_i - \bar{X})^2}\] \[\hat{\beta_0} = \bar{Y} - \hat{\beta_1}\bar{X}\]

Example Dataset and Basic Visualization

# Generate sample data
n <- 100
X <- runif(n, 0, 10)
Y <- 2 + 3 * X + rnorm(n, 0, 2)
data <- data.frame(X = X, Y = Y)

# Fit linear model
model <- lm(Y ~ X, data = data)

# Create basic scatter plot with regression line
p1 <- ggplot(data, aes(x = X, y = Y)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = TRUE, color = "#8C1D40") +
  theme_minimal() +
  labs(title = "Simple Linear Regression Example",
       x = "Predictor (X)",
       y = "Response (Y)")
print(p1)

Residual Analysis

# Add residuals to data
data$residuals <- residuals(model)
data$fitted <- fitted(model)

# Create residual plot
p2 <- ggplot(data, aes(x = fitted, y = residuals)) +
  geom_point(alpha = 0.5) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "#8C1D40") +
  theme_minimal() +
  labs(title = "Residual Plot",
       x = "Fitted Values",
       y = "Residuals")
print(p2)

Contour Plot Showing Linear Regression

# Create grid of beta0 and beta1 values
beta0_seq <- seq(1, 3, length.out = 50)
beta1_seq <- seq(2, 4, length.out = 50)
grid <- expand.grid(beta0 = beta0_seq, beta1 = beta1_seq)

# Calculate RSS for each combination
calculate_rss <- function(beta0, beta1) {
  sum((Y - (beta0 + beta1 * X))^2)
}

grid$rss <- mapply(calculate_rss, grid$beta0, grid$beta1)

# Create contour plot using ggplot2
ggplot(grid, aes(x = beta0, y = beta1)) +
  geom_contour(aes(z = rss, color = ..level..), bins = 20) +
  theme_minimal() +
  labs(title = "RSS Contour Plot",
       x = "β₀",
       y = "β₁",
       color = "RSS") +
  scale_color_viridis_c()

Model Summary and Interpretation

The fitted model equation is: \[\hat{Y} = \hat{\beta_0} + \hat{\beta_1}X\]

Key statistics:

summary(model)$coefficients
##             Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 1.982080 0.39210999  5.054909 2.002292e-06
## X           2.982034 0.06836424 43.619788 5.348260e-66

Conclusions

  • Simple linear regression provides a visual and easy to understand way to model relationships -“Line fitting” is a hallmark of Simple linear regression, where a straight line can fit into a series of data points.
  • Assumptions should be carefully checked, there are limitations to the models that simple linear regression can create.
  • Because of the limitations of linear regression, residual analysis is crucial for model validation.
  • The method can be extended to multiple regression and other more complex models