2024-10-22

Introduction to Simple Linear Regression

Simple linear regression is a statistical method that allows us to summarize and study relationships between two continuous (quantitative) variables.

Assumptions of Simple Linear Regression

  • Linearity: There is a linear relationship between X and Y.
  • Independence: Observations are independent of each other.
  • Homoscedasticity: The variance of residuals is constant.
  • Normality: The residuals of the model are normally distributed.

Mathematical Representation of Simple Linear Regression

\[ y_i = \beta_0 + \beta_1x_i + \epsilon_i \] Where: - \(y_i\): Dependent variable - \(x_i\): Independent variable - \(\beta_0\): Intercept - \(\beta_1\): Slope - \(\epsilon_i\): Error term

Dataset

set.seed(42)
x <- rnorm(50, mean = 5, sd = 2)
y <- 2 * x + rnorm(50)

2D Visualization of Linear Regression

model <- lm(y ~ x)

ggplot(data = data.frame(x, y), aes(x = x, y = y)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  theme_minimal()

3D Visualization

x_seq <- seq(0, 10, length.out = 50)
y_seq <- 2 * x_seq + rnorm(50)

plot_ly(x = ~x_seq, y = ~y_seq, z = ~rnorm(50), type = "scatter3d", mode = "markers") %>%
  layout(title = "3D Plot of Simple Linear Regression",
         scene = list(xaxis = list(title = 'X'),
                      yaxis = list(title = 'Y'),
                      zaxis = list(title = 'Residuals')))

Equation of the Best Fit Line

The equation of the best fit line is:

\[ \hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x \]

Where: - \(\hat{\beta}_0\) is the intercept estimate - \(\hat{\beta}_1\) is the slope estimate

Plot the residuals

residuals <- residuals(model)
ggplot(data = data.frame(x = x, residuals = residuals), aes(x = x, y = residuals)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  theme_minimal()