2026-04-12

What is Simple Linear Regression?

Simple linear regression is a statistical method used to model how one variable (X) predicts another (Y) using a linear equation:

  • Independent variable (X): predictor
  • Dependent variable (Y): response

Simple linear regression estimates the expected value of Y given X via a fitted line.

Example Question: Can study time serve as a useful predictor of exam scores?

Simple Linear Regression Equation

Simple linear regression is modeled by: \[ Y = \beta_0 + \beta_1 X + \epsilon \] Where:

  • \(Y\) = response variable
  • \(\beta_0\) = intercept
  • \(\beta_1\) = slope
  • \(X\) = predictor variable
  • \(\epsilon\) = random error

Least Squares Method

In simple linear regression, we choose the line that best fits the data.

We estimate this line by minimizing squared residuals: \[ \sum_{i=1}^{n}(y_i - \hat{y}_i)^2 \] Where:

  • \(y_i\) = actual value
  • \(\hat{y}_i\) = predicted value
  • \(n\) = number of observations

This method is called the least squares method.

Example Dataset

We will study whether study time serves as a useful predictor of exam scores.

  • Independent variable (X): study hours
  • Dependent variable (Y): exam score

Scatterplot Code

library(ggplot2)

study_data = data.frame(study_hours = c(1,2,3,4,5,6,7,8,9,10),
                        exam_score = c(52,55,61,64,68,72,78,85,88,94))

ggplot(study_data, aes(x = study_hours, y = exam_score)) + 
  geom_point(size = 3) + geom_smooth(method = "lm", color = "red") +
  labs(title = "Study Hours vs. Exam Scores", x = "Study Hours", 
       y = "Exam Score") +
  theme(plot.title = element_text(hjust = 0.5))

Scatterplot

## `geom_smooth()` using formula = 'y ~ x'

Residual Visualization Code

ggplot(study_data, aes(x = study_hours, y = exam_score)) +
  geom_point(size = 3) +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  geom_segment(aes(xend = study_hours,
                   yend = predict(lm(exam_score ~ study_hours, 
                                     data = study_data))),
               color = "blue", linewidth = 1) +
  labs(title = "Study Hours vs. Exam Scores",
       x = "Study Hours", y = "Exam Score") +
  theme(plot.title = element_text(hjust = 0.5))

Residual Visualization

## `geom_smooth()` using formula = 'y ~ x'

Plotly Code

library(plotly)

model = lm(exam_score ~ study_hours, data = study_data)

plot_ly(study_data, x = ~study_hours, y = ~exam_score, 
        type = 'scatter', mode = 'markers', name = "Data") %>%
  add_lines(x = ~study_hours, y = ~fitted(model), 
            name = "Regression Line") %>% 
  layout(title = list(text = "Study Hours vs. Exam Scores", x = 0.43), 
         xaxis = list(title = "Study Hours"), 
         yaxis = list(title = "Exam Score"))

Interactive Plotly Graph

Regression Results

## 
## Call:
## lm(formula = exam_score ~ study_hours, data = study_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.0485 -0.7227 -0.2000  1.1333  1.5576 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  45.8667     0.9045   50.71 2.53e-11 ***
## study_hours   4.6970     0.1458   32.22 9.38e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.324 on 8 degrees of freedom
## Multiple R-squared:  0.9924, Adjusted R-squared:  0.9914 
## F-statistic:  1038 on 1 and 8 DF,  p-value: 9.376e-10

Interpeting the Regression Results

  • Slope (\(\beta_1\)):
    • Expected change in exam score for a one-unit increase in study hours.
  • \(R^2\):
    • Proportion of variability explained by study time.
    • A high \(R^2\) indicates the model explains much of the variation.
    • A low \(R^2\) suggests other factors influence exam scores.
  • p-value (slope):
    • Tests whether the slope is significantly different from zero.
    • A small p-value (< 0.05) suggests study time is a useful predictor.

Conclusion

Simple linear regression allows us to:

  • quantify the relationship between study time and exam scores
  • make predictions based on study hours
  • summarize the overall linear pattern in the data

In this example, higher study time results in higher predicted exam scores, reflected in a strong slope (≈ 4.7) and high \(R^2\) (≈ 0.99).