Simple Linear Regression

2026-04-12

What is Simple Linear Regression?

Simple linear regression is a statistical method used to model how one variable (X) predicts another (Y) using a linear equation:

Independent variable (X): predictor
Dependent variable (Y): response

Simple linear regression estimates the expected value of Y given X via a fitted line.

Example Question: Can study time serve as a useful predictor of exam scores?

Simple Linear Regression Equation

Simple linear regression is modeled by: \[ Y = \beta_0 + \beta_1 X + \epsilon \] Where:

\(Y\) = response variable
\(\beta_0\) = intercept
\(\beta_1\) = slope
\(X\) = predictor variable
\(\epsilon\) = random error

Least Squares Method

In simple linear regression, we choose the line that best fits the data. This technique is called the least squares method.

We estimate this line by minimizing squared residuals: \[ \sum_{i=1}^{n}(y_i - \hat{y}_i)^2 \] Where:

\(y_i\) = actual value
\(\hat{y}_i\) = predicted value
\(n\) = number of observations

Example Dataset

We will study whether study time serves as a useful predictor of exam scores.

Independent variable (X): study hours
Dependent variable (Y): exam score

Scatterplot Code

library(ggplot2)

study_data = data.frame(study_hours = c(1,2,3,4,5,6,7,8,9,10),
                        exam_score = c(52,55,61,64,68,72,78,85,88,94))

ggplot(study_data, aes(x = study_hours, y = exam_score)) + 
  geom_point(size = 3) + geom_smooth(method = "lm", color = "red") +
  labs(title = "Study Hours vs. Exam Scores", x = "Study Hours", 
       y = "Exam Score") +
  theme(plot.title = element_text(hjust = 0.5))

Scatterplot

## `geom_smooth()` using formula = 'y ~ x'

Residual Visualization Code

ggplot(study_data, aes(x = study_hours, y = exam_score)) +
  geom_point(size = 3) +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  geom_segment(aes(xend = study_hours,
                   yend = predict(lm(exam_score ~ study_hours, 
                                     data = study_data))),
               color = "blue", linewidth = 1) +
  labs(title = "Study Hours vs. Exam Scores",
       x = "Study Hours", y = "Exam Score") +
  theme(plot.title = element_text(hjust = 0.5))

Residual Visualization

## `geom_smooth()` using formula = 'y ~ x'

Plotly Code

library(plotly)

model = lm(exam_score ~ study_hours, data = study_data)

plot_ly(study_data, x = ~study_hours, y = ~exam_score, 
        type = 'scatter', mode = 'markers', name = "Data") %>%
  add_lines(x = ~study_hours, y = ~fitted(model), 
            name = "Regression Line") %>% 
  layout(title = list(text = "Study Hours vs. Exam Scores", x = 0.43), 
         xaxis = list(title = "Study Hours"), 
         yaxis = list(title = "Exam Score"))

Interactive Plotly Graph

Regression Results

## 
## Call:
## lm(formula = exam_score ~ study_hours, data = study_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.0485 -0.7227 -0.2000  1.1333  1.5576 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  45.8667     0.9045   50.71 2.53e-11 ***
## study_hours   4.6970     0.1458   32.22 9.38e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.324 on 8 degrees of freedom
## Multiple R-squared:  0.9924, Adjusted R-squared:  0.9914 
## F-statistic:  1038 on 1 and 8 DF,  p-value: 9.376e-10

Interpeting the Regression Results

Slope (\(\beta_1\)):
- Expected change in exam score for a one-unit increase in study hours.
\(R^2\):
- Proportion of variation explained by study time.
- A high \(R^2\) suggests study time explains much of the variation in exam scores.
- A low \(R^2\) suggests other factors may better explain variation in exam scores.
p-value (slope):
- Tests whether the slope differs significantly from zero.
- A small p-value (< 0.05) indicates that study time is a statistically significant predictor of exam scores.

Conclusion

Simple linear regression allows us to:

quantify the relationship between study time and exam scores
make predictions based on study hours
summarize the overall linear pattern in the data

In this example, higher study time results in higher predicted exam scores, reflected in a strong positive slope (≈ 4.7) and high \(R^2\) (≈ 0.99).

The small p-value (≈ \(9.4 \times 10^{-10}\)) indicates that study time is a statistically significant predictor of exam scores.