2026-05-27

What is simple linear regression?

A method for modeling the relationship between two numeric variables — one you observe (\(X\)) and one you want to predict (\(Y\)).

Everyday examples:

  • Does temperature predict ice cream sales?
  • Does hours studied predict exam score?
  • Does penguin flipper length predict body mass?

If there’s a roughly straight-line relationship between \(X\) and \(Y\), simple linear regression helps you describe it, measure it, and use it for prediction.

In our code along, we work with a simple dataset — columns named X and Y — so you can focus on the method without worrying about the subject matter.

The regression equation

The model we fit is:

\[\hat{y} = b_0 + b_1 x\]

Symbol Name What it means
\(\hat{y}\) Predicted value Our best guess of \(Y\) for a given \(X\)
\(b_0\) Intercept The predicted \(Y\) when \(X = 0\)
\(b_1\) Slope How much \(\hat{y}\) changes for each 1-unit increase in \(X\)

The slope is the key number. A positive slope means \(Y\) tends to rise with \(X\); a negative slope means it falls.

Reading a slope in plain English

Example from OpenStax (Introductory Statistics 2e):

Researchers used a student’s third-exam score (\(x\)) to predict their final exam score (\(y\)). The fitted line was:

\[\hat{y} = -173.51 + 4.83\,x\]

“For each additional point on the third exam, the predicted final-exam score increases by 4.83 points, on average.”

That’s all slope interpretation is: fill in the units and say what happens to \(Y\) when \(X\) goes up by 1.

What does the data look like?

Here is a scatter plot of the data from our code along.

Does the relationship look roughly linear? That’s the first thing to check before fitting any model.

Fitting the model in R

# Fit the linear regression model
model <- lm(Y ~ X, data = df)

# View the results
summary(model)

lm(Y ~ X, data = df) tells R: predict Y using X, from the data frame df.

summary(model) gives you the slope, intercept, and \(R^2\) — the three values you’ll report on Canvas this week.

How well does the model fit? — \(R^2\)

\(R^2\) (R-squared) measures how much of the variation in \(Y\) is explained by \(X\):

\[R^2 = 1 - \frac{\text{Residual Sum of Squares}}{\text{Total Sum of Squares}}\]

  • \(R^2 = 0\) → \(X\) tells us nothing about \(Y\)
  • \(R^2 = 1\) → \(X\) perfectly predicts \(Y\)
  • In practice, somewhere in between

OpenStax example: \(R^2 = 0.44\) means 44% of the variation in final-exam scores is explained by third-exam scores — the other 56% comes from things not in the model.

We report Adjusted \(R^2\), which makes a small correction for sample size.

Checking the residuals

A residual is the gap between what the model predicted and what actually happened: \[e_i = y_i - \hat{y}_i\]

Good residuals look random and bell-shaped — no pattern, centered at zero.

In our code along (Chunk 3), you’ll calculate these residuals and report the value for the first observation on Canvas.

Predicting beyond the data — use caution!

In our code along (Chunk 4), you predict \(Y\) for \(X = 21\), \(22\), and \(23\) — which may be outside the range of the training data.

The dashed blue line marks extrapolation. Predictions become less reliable the further you go beyond what was observed.

Before you submit

Double-check each item before knitting:

  • ☐ File renamed: CodeAlong5-YourLastNameFirstInitial.Rmd
  • Canvas quiz — slope, intercept, adjusted \(R^2\) (Chunk 2)
  • Canvas quiz — residual for row 1 (Chunk 3)
  • Canvas quiz — predicted \(Y\) when \(X = 22\) (Chunk 4)
  • ☐ Click Knit → Knit to HTML and confirm no errors top to bottom
  • ☐ Submit the HTML file: CodeAlong5-YourLastNameFirstInitial.html

Resources