Simple Linear Regression

2026-04-12

What is Simple Linear Regression?

Simple linear regression is a statistical method used to model how one variable (X) predicts another (Y) using a linear equation:

Independent variable (X): predictor
Dependent variable (Y): response

Simple linear regression estimates the expected value of Y given X via a fitted line.

Example Question: Can study time serve as a useful predictor of exam scores?

Simple Linear Regression Equation

Simple linear regression is modeled by: \[ Y = \beta_0 + \beta_1 X + \epsilon \] Where:

\(Y\) = response variable
\(\beta_0\) = intercept
\(\beta_1\) = slope
\(X\) = predictor variable
\(\epsilon\) = random error

Least Squares Method

In simple linear regression, we choose the line that best fits the data.

We estimate this line by minimizing squared residuals: \[ \sum_{i=1}^{n}(y_i - \hat{y}_i)^2 \] Where:

\(y_i\) = actual value
\(\hat{y}_i\) = predicted value
\(n\) = number of observations

This method is called the least squares method.

Example Dataset

We will study whether study time serves as a useful predictor of exam scores.

Independent variable (X): study hours
Dependent variable (Y): exam score

Scatterplot Code

library(ggplot2)

study_data = data.frame(study_hours = c(1,2,3,4,5,6,7,8,9,10),
                        exam_score = c(52,55,61,64,68,72,78,85,88,94))

ggplot(study_data, aes(x = study_hours, y = exam_score)) + 
  geom_point(size = 3) + geom_smooth(method = "lm", color = "red") +
  labs(title = "Study Hours vs. Exam Scores", x = "Study Hours", 
       y = "Exam Score") +
  theme(plot.title = element_text(hjust = 0.5))

Scatterplot

## `geom_smooth()` using formula = 'y ~ x'

Residual Visualization Code

ggplot(study_data, aes(x = study_hours, y = exam_score)) +
  geom_point(size = 3) +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  geom_segment(aes(xend = study_hours,
                   yend = predict(lm(exam_score ~ study_hours, 
                                     data = study_data))),
               color = "blue", linewidth = 1) +
  labs(title = "Study Hours vs. Exam Scores",
       x = "Study Hours", y = "Exam Score") +
  theme(plot.title = element_text(hjust = 0.5))

Residual Visualization

## `geom_smooth()` using formula = 'y ~ x'

Plotly Code

library(plotly)

model = lm(exam_score ~ study_hours, data = study_data)

plot_ly(study_data, x = ~study_hours, y = ~exam_score, 
        type = 'scatter', mode = 'markers', name = "Data") %>%
  add_lines(x = ~study_hours, y = ~fitted(model), 
            name = "Regression Line") %>% 
  layout(title = list(text = "Study Hours vs. Exam Scores", x = 0.43), 
         xaxis = list(title = "Study Hours"), 
         yaxis = list(title = "Exam Score"))

Interactive Plotly Graph

Regression Results

## 
## Call:
## lm(formula = exam_score ~ study_hours, data = study_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.0485 -0.7227 -0.2000  1.1333  1.5576 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  45.8667     0.9045   50.71 2.53e-11 ***
## study_hours   4.6970     0.1458   32.22 9.38e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.324 on 8 degrees of freedom
## Multiple R-squared:  0.9924, Adjusted R-squared:  0.9914 
## F-statistic:  1038 on 1 and 8 DF,  p-value: 9.376e-10

Interpeting the Regression Results

Slope (\(\beta_1\)):
- Expected change in exam score for a one-unit increase in study hours.
\(R^2\):
- Proportion of variability explained by study time.
- A high \(R^2\) indicates the model explains much of the variation.
- A low \(R^2\) suggests other factors influence exam scores.
p-value (slope):
- Tests whether the slope is significantly different from zero.
- A small p-value (< 0.05) suggests study time is a useful predictor.

Conclusion

Simple linear regression allows us to:

quantify the relationship between study time and exam scores
make predictions based on study hours
summarize the overall linear pattern in the data

In this example, higher study time results in higher predicted exam scores, reflected in a strong slope (≈ 4.7) and high \(R^2\) (≈ 0.99).