2025-11-09

Simple Linear Regression

Linear regression analysis per Ibm (2025) is used to predict the value of a variable based on the value of another variable. The dependent variable is the variable you will want to predict, while the variable that is used to predict is called an independent variable. (Ibm, 2025). The “simple” part comes from the use of only one predictor variable as opposed to multiple linear regression which will have multiple predictor variables (2.1 - What Is Simple Linear Regression? | STAT 462, n.d.).

The Linear Model

The model for a Simple Linear Regression is expressed by the formula

\[ y_i = \beta_0 + \beta_1 x_i + \epsilon_i \]

  • \(y_i\) is the value of the dependent variable
  • \(x_1\) is the value of the independent variable
  • \(\beta_0\) is the y-intercept, the expected value of y when x is 0.
  • \(\beta_1\) is the slope coefficient, representing the average change in y for a one-unit increase in x.
  • \(\epsilon_i\) is the random error term, which accounts for the variability in y.

Estimating the coefficients (OLS)

The unkown coefficients \(\beta_0\) and \(\beta_1\) are estimated using the Ordinary Least Squares (OLS) method. OLS minimizes the sum of the squared differences between the actual observed value (\(y_i\)) and the predicted value (\(\hat{y}_i\)).

The estimated regression line is:

\[ \hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i \]

The OLS formulas for the estimated coefficients are:

\[ \hat{\beta}_1 = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2} \]

\[ \hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x} \]

Assumptions of Simple Linear Regression

For the model to hava a valid inferences (like confidence intervals and hypothesis tests) several key assumptions about (\(\epsilon_i\)), the random error, must hold:

  • Linearity: The true relationship between X and Y is linear.
  • Independence: The observations are independent of each other (errors are uncorrelated).
  • Normality: The errors are normally distributed for any given value of X.
  • Equal Variance (Homoscedasticity): The variance of the errors is constant across all levels of the independent variable X.

Example: The cars Dataset

library(ggplot2)
data("cars")
ggplot(cars, aes(x = speed, y = dist)) +
  geom_point(color = "red", size = 2) + 
  labs(
    title = "Car Speed VS Stopping Distance",
    x = "Speed (MPH)",
    y = "Stopping Distance (ft)"
  )

The line of best fit

We can overlay the OLS line on the scatter plot.

ggplot(cars, aes(x = speed, y = dist)) + 
  geom_point(color = "red", size = 2) +
  geom_smooth(methom = "lm", se = TRUE, color = "green", fill = "yellow")

  labs(
    title = "Car Speed VS Stopping Distance",
    x = "Speed (MPH)",
    y = "Stopping Distance (ft)"
  )
## <ggplot2::labels> List of 3
##  $ x    : chr "Speed (MPH)"
##  $ y    : chr "Stopping Distance (ft)"
##  $ title: chr "Car Speed VS Stopping Distance"

Cars dataset using Plotly

library(plotly)

carsSLR <- lm(dist ~ speed, data = cars)

p <- plot_ly(data = cars, x = ~speed, y = ~dist, type = 'scatter', mode = 'markers',
             marker = list(size = 10, color = 'orange',
                           line = list(color = 'blue', width = 3)),
             name = "Observed Data") %>%
  layout(
    title = "Car Speed Vs Stopping Distance",
    xaxis = list(title = 'Speed (MPH)'),
    yaxis = list(title = 'Stopping Distance (ft)')
  )

p

Adding regression line

ggplotly(
  ggplot(cars, aes(x = speed, y = dist)) +
    geom_point(color = "blue", size = 1.5) +
    geom_smooth(method = "lm", se = FALSE, color = "red") +
    labs(title = 'Car Speed vs. Stopping Distance')
)