9/13/2023

What is Linear Regression?

Linear regression is a statistical technique used to model the strength and direction of the association between a dependent variable (DV) and one or more independent variables (IV). It assumes that there is a linear relationship between the IV and DV, meaning changes in the IV results in proportional changes in the DV.

  • Prediction: Linear regression can help predict the value of the DV based on the values of one or more IV.
  • Inference: Linear regression can be used to understand the relationship between a response (DV) and predictor(IV) by determining how changes in the IV associate with the changes in the DV, often done by examining the coefficient (slope and intercept) of the regression equation.

Simple Linear Regression

Relates a dependent variables (Y) to an independent variable (X) through a linear equation:

\(Y = \beta_0 + \beta_1 X + \epsilon\)

  • \(Y\) = Dependent Variable
  • \(X\) = Independent Variable
  • \(\beta_{0}\) is the intercept (value of \(Y\) when \(X\) = 0)
  • \(\beta_{1}\) is the slope (change in \(Y\) for change in \(X\))
  • \(\epsilon\) represents the error term (difference between the observed \(Y\) and the predicted \(Y\))

Best Fit Line

Linear Regression finds the best-fitting line by minimizing the sum of the squared differences between the observed and predicted values: \[ \min_{\beta_0, \beta_1} \sum_{i=1}^{n} (Y_i - (\beta_0 + \beta_1 X_i))^2 \]

  • \(Y_{i}\): Observed values of DV for \(i\)-th data point
  • \(X_{i}\): Values of IV for \(i\)-th data point
  • \(\beta_{0}\): Intercept (constant) term of regression line
  • \(\beta_{1}\): Slope of regression line
  • \(n\): Number of points in the dataset

Another Best Fit Line

This is from the dataset faithful from the infamous Yellowstone Geyser. This best fit line can be used to predict eruption duration based on waiting time.

Linear Regression Plot With Grouped Data

This is the same plot as the previous slide, but the wait times have been grouped by 10 minute intervals.

Code For Eruption Dataset Group by Time Intervals

data(faithful)
faithful$wait_intervals <- cut(faithful$waiting, 
  breaks = seq(40, 100, by = 10), right = FALSE)

ggplot(faithful, aes(x = waiting, 
  y = eruptions, color = wait_intervals)) +
    geom_point() +
    geom_smooth(method = "lm", se = FALSE) +
    labs(
      title = "Eruption Durations vs. Waiting Times with Grouped Points",
      x = "Waiting Time (minutes)",
      y = "Eruption Duration (minutes)",
      color = "Wait Time Intervals"
    ) +
    scale_color_discrete(name = "Wait Time Intervals")

Another Linear Regression Plot

Using puromycin data set that describes the relationship between the rate of protein synthesis in the presence of the antibiotic puromycin.