This practice document picks up where Tuesday’s lecture left off. In lecture, we learned that correlation gives us direction and strength, but not how much. Regression gives us a line — a predicted value of Y for any value of X, in real units.
Today, you’ll fit your first regression model on real Canadian data, interpret the output, and practice using the regression equation to make predictions.
library(tidyverse)
library(broom)
cchs <- read_csv("data/cchs_teaching.csv")
We are using the 2022 Canadian Community Health Survey (CCHS), a major national health survey conducted by Statistics Canada. This file contains responses from over 67,000 Canadians on topics including life satisfaction, health, stress, community belonging, and socioeconomic circumstances.
Several variables in this survey code health outcomes so that lower numbers mean better outcomes (e.g., 1 = “Excellent,” 5 = “Poor”). This creates negative correlations and negative slopes that are technically correct but counterintuitive — “more health” showing up as a smaller number is confusing when you’re learning regression for the first time.
To make interpretation easier, we’ll reverse-code the relevant variables so that higher numbers always mean better outcomes. The code below does this once, and we’ll use the recoded versions for the rest of the document.
cchs <- cchs %>%
mutate(
# Self-rated mental health: original 1=Excellent ... 5=Poor
# Reversed: 1=Poor ... 5=Excellent
mental_health = 6 - GEN_05,
# Self-rated general health: original 1=Excellent ... 5=Poor
# Reversed: 1=Poor ... 5=Excellent
general_health = 6 - GEN_01,
# Perceived life stress: original 1=Not at all ... 5=Extremely
# Reversed so higher = less stressed (more relaxed)
low_stress = 6 - GEN_10,
# Community belonging: original 1=Very strong ... 4=Very weak
# Reversed: 1=Very weak ... 4=Very strong
belonging = 5 - GEN_20,
# Life satisfaction: already coded 0-10 where higher = more satisfied
# Just rename for clarity
life_satisfaction = LSM_01
)
What just happened? We created new variables with
reversed scales. For instance, someone who answered “Excellent” (coded 1
in the original) now has a value of 5 on mental_health.
This way, a positive slope means “more of X is associated with more of
Y” — which is how most people naturally think about relationships.
Survey data uses special codes for missing or refused responses. We need to filter these out before analyzing. The CCHS uses codes like 9, 96, 99, and 999 to indicate “not stated” or “not applicable,” depending on the variable.
We’ll clean the variables we need as we go, but here’s the general principle: always check the codebook, then filter before calculating.
Before we build a regression line, let’s start with what we already know.
Does a stronger sense of belonging to one’s community go with higher life satisfaction? Let’s look.
# Filter out not-stated codes
cor_data <- cchs %>%
filter(life_satisfaction <= 10, # Remove code 99 (not stated)
belonging >= 1 & belonging <= 4) # Keep valid responses only
Before computing the correlation, let’s use a tool from Arc 1 — boxplots — to compare life satisfaction across belonging groups.
cor_data %>%
mutate(belonging_label = factor(belonging,
levels = 1:4,
labels = c("Very Weak", "Somewhat Weak", "Somewhat Strong", "Very Strong"))) %>%
ggplot(aes(x = belonging_label, y = life_satisfaction, fill = belonging_label)) +
geom_boxplot(alpha = 0.7, outlier.alpha = 0.1) +
scale_fill_manual(values = c("#d4d4d4", "#b8cfe0", "#4A6FA5", "#2c4a7c")) +
labs(
title = "Life Satisfaction by Community Belonging",
subtitle = "CCHS 2022 | Medians and IQRs across belonging groups",
x = "Sense of Community Belonging",
y = "Life Satisfaction (0-10)"
) +
theme_minimal() +
theme(legend.position = "none")
Notice how the median (the thick line inside each box) shifts upward as belonging gets stronger. The IQR (the height of each box) also changes — people with weaker belonging show more variation in their life satisfaction.
This is the same logic as the conditional means from lecture: the typical life satisfaction score depends on the level of belonging.
ggplot(cor_data, aes(x = belonging, y = life_satisfaction)) +
geom_jitter(alpha = 0.05, width = 0.25, height = 0.25, color = "steelblue") +
labs(
title = "Life Satisfaction and Community Belonging",
subtitle = "CCHS 2022 | Each dot is one respondent (jittered)",
x = "Sense of Community Belonging (1 = Very Weak, 4 = Very Strong)",
y = "Life Satisfaction (0-10)"
) +
scale_x_continuous(breaks = 1:4,
labels = c("1\nVery Weak", "2\nSomewhat\nWeak",
"3\nSomewhat\nStrong", "4\nVery\nStrong")) +
theme_minimal()
What’s happening in this code?
geom_jitter() adds small random noise so overlapping
points become visible (without jitter, 67,000 points would stack on top
of each other at only 4 x 11 = 44 grid positions)alpha = 0.05 makes each point nearly transparent —
darker areas mean more respondentsr_belonging <- cor(cor_data$life_satisfaction, cor_data$belonging)
r_belonging
## [1] 0.3880256
ANSWER KEY
The correlation is approximately 0.39 (the exact value comes from the
cor()output above).The correlation is positive. The sign tells us that as community belonging increases (gets stronger), life satisfaction also tends to increase. People who feel a stronger sense of belonging to their community tend to report higher life satisfaction. We can confirm this visually: the boxplot medians rise from left to right, and the scatterplot shows denser concentration of points in the upper-right region.
This is a moderate correlation. The conventional benchmarks are: weak (< 0.3), moderate (0.3–0.5), strong (> 0.5). At r = 0.39, we’re solidly in the moderate range. In the scatterplot, you can see a general upward trend — the cloud of points tilts toward the upper right — but there is still a lot of spread. At any given level of belonging, life satisfaction ranges widely (from 0 to 10). That wide scatter is why r isn’t closer to 1: the relationship is real but far from deterministic.
The correlation told us that belonging and life satisfaction move together. But the correlation coefficient can’t answer: “How much higher is life satisfaction for people who rate their mental health one point better?” For that, we need a regression line.
Let’s model life satisfaction as a function of self-rated mental health.
# Clean the data for this regression
reg_data <- cchs %>%
filter(life_satisfaction <= 10, # Valid life satisfaction (0-10)
mental_health >= 1 & mental_health <= 5) # Valid mental health (1-5)
cat("Number of respondents:", nrow(reg_data), "\n")
## Number of respondents: 66216
In R, we fit a linear regression with lm() — short for
“linear model.”
model <- lm(life_satisfaction ~ mental_health, data = reg_data)
Read this as: “Model life satisfaction as a linear function of mental health.”
The tilde (~) means “predicted by.” The outcome (Y) goes on the left; the predictor (X) goes on the right.
tidy(model) %>%
mutate(across(where(is.numeric), ~ round(.x, 4)))
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 3.89 0.0228 171. 0
## 2 mental_health 1.01 0.0061 165. 0
This table has five columns. For now, focus on the first two:
We’ll come back to std.error, statistic,
and p.value shortly.
Look at the output table above.
ANSWER KEY
From the
tidy()output, the intercept is approximately 3.89 (the estimate in the(Intercept)row) and the slope is approximately 1.01 (the estimate in themental_healthrow). These numbers come directly from theestimatecolumn.“Comparing two people who differ by one point on self-rated mental health, the model predicts a difference of about 1.01 points on life satisfaction.” The slope tells us the predicted difference in Y for a one-unit difference in X. Here, that’s almost exactly one-for-one: one point better mental health goes with about one point higher life satisfaction.
The slope is positive, which makes sense: we reverse-coded mental health so that higher values mean better mental health. A positive slope means that people with better mental health tend to have higher life satisfaction. If this slope were negative, it would mean better mental health is associated with lower life satisfaction — that would be hard to explain.
Every simple linear regression produces an equation of the form:
\[\widehat{y} = b_0 + b_1 \cdot x\]
where \(b_0\) is the intercept and \(b_1\) is the slope. The “hat” on \(y\) means it’s a predicted value — our best guess, not the actual observation.
Let’s extract our specific numbers:
b0 <- round(coef(model)[1], 2)
b1 <- round(coef(model)[2], 2)
cat("Intercept (b0):", b0, "\n")
## Intercept (b0): 3.89
cat("Slope (b1):", b1, "\n")
## Slope (b1): 1.01
cat("\nThe equation: life_satisfaction_hat =", b0, "+", b1, "x mental_health\n")
##
## The equation: life_satisfaction_hat = 3.89 + 1.01 x mental_health
Let’s use this equation. For someone who rates their mental health as 2 (Fair):
\[\widehat{y} = b_0 + b_1 \times 2\]
# Prediction for mental_health = 2
pred_2 <- b0 + b1 * 2
cat("Predicted life satisfaction for mental_health = 2:", pred_2, "\n")
## Predicted life satisfaction for mental_health = 2: 5.91
# Prediction for mental_health = 4
pred_4 <- b0 + b1 * 4
cat("Predicted life satisfaction for mental_health = 4:", pred_4, "\n")
## Predicted life satisfaction for mental_health = 4: 7.93
# Difference
cat("Difference (4 vs 2):", pred_4 - pred_2, "\n")
## Difference (4 vs 2): 2.02
cat("That's 2 x", b1, "=", 2 * b1, "\n")
## That's 2 x 1.01 = 2.02
The slope works as a multiplier: multiply the difference in X by the slope to get the predicted difference in Y.
new_data <- tibble(mental_health = c(1, 2, 3, 4, 5))
new_data$predicted <- predict(model, newdata = new_data)
new_data
## # A tibble: 5 × 2
## mental_health predicted
## <dbl> <dbl>
## 1 1 4.90
## 2 2 5.91
## 3 3 6.92
## 4 4 7.93
## 5 5 8.94
Your hand calculations should match these numbers.
ANSWER KEY
For mental_health = 1: \(\widehat{y}\) = 3.89 + 1.01 x 1 = 4.90. You can verify this from the
predict()table (first row). Is it reasonable? A predicted life satisfaction of about 5 out of 10 for someone with poor mental health seems plausible, though individual variation is large.For mental_health = 5: \(\widehat{y}\) = 3.89 + 1.01 x 5 = 8.94. Check the last row of the
predict()table. A predicted life satisfaction near 9 out of 10 for someone with excellent mental health also seems plausible.The difference is 8.94 - 4.90 = 4.04. This equals 4 x 1.01 = 4.04. The difference in predictions equals the difference in X (which is 5 - 1 = 4) multiplied by the slope. This is the core logic of regression: the slope is a constant rate of change, so you just multiply.
The intercept (3.89) is the predicted life satisfaction when mental health = 0. But no one in the data has mental_health = 0 — the scale runs from 1 to 5. The intercept is a mathematical anchor point that positions the line correctly, but it doesn’t have a direct real-world interpretation here. This is common in regression: the intercept is necessary for the equation but doesn’t always describe a realistic scenario. We rely on the slope for substantive interpretation.
ggplot(reg_data, aes(x = mental_health, y = life_satisfaction)) +
geom_jitter(alpha = 0.04, width = 0.25, height = 0.25, color = "steelblue") +
geom_smooth(method = "lm", se = FALSE, color = "red", linewidth = 1.2) +
labs(
title = "Life Satisfaction and Self-Rated Mental Health",
subtitle = "Red line = regression line (line of best fit)",
x = "Self-Rated Mental Health (1 = Poor, 5 = Excellent)",
y = "Life Satisfaction (0-10)"
) +
scale_x_continuous(breaks = 1:5,
labels = c("1\nPoor", "2\nFair", "3\nGood",
"4\nVery Good", "5\nExcellent")) +
theme_minimal()
What geom_smooth(method = "lm") does:
It draws the regression line — the same one defined by the equation you
wrote in Part 3. Every point on this line is a predicted value.
The vertical distance between each point and the line is a residual — the difference between what was observed and what the model predicted.
# Pick a mix: some well-predicted, some poorly predicted
set.seed(2026)
# Get all residuals, then pick illustrative cases
reg_data_with_resid <- reg_data %>%
mutate(
predicted = round(predict(model, newdata = .), 2),
residual = round(life_satisfaction - predicted, 2)
)
# Select cases that show clear contrast
examples <- bind_rows(
# Two well-predicted respondents (small residuals)
reg_data_with_resid %>% filter(abs(residual) < 0.5) %>% sample_n(2),
# One moderately off
reg_data_with_resid %>% filter(abs(residual) > 1.5 & abs(residual) < 2.5) %>% sample_n(1),
# Two poorly predicted respondents (large residuals)
reg_data_with_resid %>% filter(abs(residual) > 3) %>% sample_n(2)
) %>%
select(mental_health, life_satisfaction, predicted, residual) %>%
arrange(abs(residual))
examples
## # A tibble: 5 × 4
## mental_health life_satisfaction predicted residual
## <dbl> <dbl> <dbl> <dbl>
## 1 4 8 7.93 0.07
## 2 2 6 5.91 0.09
## 3 3 5 6.92 -1.92
## 4 4 4 7.93 -3.93
## 5 3 0 6.92 -6.92
The residual column is what the model didn’t capture —
everything else besides mental health that affects someone’s life
satisfaction.
ANSWER KEY
The respondents with residuals close to zero (the first two rows, with |residual| < 0.5) are well-predicted — the model’s guess was close to their actual score. The respondents with large residuals (last two rows, |residual| > 3) are poorly predicted — the model was off by 3 or more points on a 10-point scale. For example, a respondent with high mental health but very low life satisfaction would have a large negative residual — the model expected them to be much more satisfied than they actually are. The specific cases vary because of the random seed, but the contrast between small and large residuals should be clear.
Many factors could explain why two people with the same mental health rating report different life satisfaction: income and financial security (someone with good mental health but severe financial stress might report lower satisfaction), relationship quality (social support, marriage, family conflict), physical health (chronic pain, disability), employment situation (job loss, meaningful work), or recent life events (bereavement, moving, having a child). Mental health matters, but it’s only one piece of the puzzle.
The residual is the gap between reality and the model’s prediction. It captures everything that affects life satisfaction other than mental health — all the variables we didn’t include (income, relationships, health, luck, personality, life circumstances). A simple regression with one predictor will always leave a lot unexplained. The residuals are a reminder that our model is a simplification, not a complete explanation.
We found a positive slope. But could this just be noise — a pattern that appeared by chance in our particular sample?
tidy(model) %>%
mutate(
across(c(estimate, std.error, statistic), ~ round(.x, 4)),
p.value = ifelse(p.value < 0.0001, "< 0.0001", as.character(round(p.value, 4)))
)
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <chr>
## 1 (Intercept) 3.89 0.0228 171. < 0.0001
## 2 mental_health 1.01 0.0061 165. < 0.0001
Focus on the p.value column for the
mental_health row.
From lecture: the p-value asks, “If there were truly no relationship between mental health and life satisfaction in the population, how likely would we be to see a slope this large (or larger) in our sample, just by chance?”
A small p-value means it would be very unlikely — so we have evidence the pattern is real.
Our p-value is far below all of these thresholds — very strong evidence that the association between mental health and life satisfaction is not just noise.
ANSWER KEY
Yes, the association is statistically significant. The p-value for the
mental_healthslope shows “< 0.0001” in the output, which is well below all conventional thresholds (0.05, 0.01, 0.001). This means there is very strong evidence against the null hypothesis of no relationship. With over 66,000 respondents and a slope of approximately 1.01, we can be very confident this pattern exists in the population and is not a fluke of our sample.Statistical significance does not tell us two important things. First, it doesn’t tell us about effect size — how much mental health matters for life satisfaction in practical terms. A slope can be statistically significant but trivially small. Second, it doesn’t tell us about causation — we don’t know whether better mental health causes higher life satisfaction, or whether it’s the other way around, or whether some third factor (like income, social support, or personality) drives both. Significance only tells us the pattern is unlikely to be random noise.
No, that would not be an important finding in any practical sense. A slope of 0.001 means a one-unit change in X predicts a change of only 0.001 in Y — that’s essentially zero effect. The reason it’s “significant” is the massive sample size (n = 10 million): with enough data, even the tiniest, most meaningless pattern becomes detectable. This is a classic example of statistical significance without practical significance. The p-value tells you the pattern is “real” (not noise), but the slope tells you it doesn’t matter. This is why we always look at the slope and the p-value together, not just the p-value alone.
So far our predictor (mental health) was ordinal — coded 1 through 5. What happens when the predictor is categorical, like sex?
cat_data <- cchs %>%
filter(mental_health >= 1 & mental_health <= 5,
DHH_SEX %in% c(1, 2)) %>%
mutate(sex = ifelse(DHH_SEX == 1, "Male", "Female"))
cat_model <- lm(mental_health ~ sex, data = cat_data)
tidy(cat_model) %>%
mutate(
across(c(estimate, std.error, statistic), ~ round(.x, 4)),
p.value = ifelse(p.value < 0.0001, "< 0.0001", as.character(round(p.value, 4)))
)
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <chr>
## 1 (Intercept) 3.51 0.0055 641. < 0.0001
## 2 sexMale 0.181 0.0081 22.5 < 0.0001
Reading this output:
When a predictor is categorical, R picks one group as the reference category (here, “Female” because it comes first alphabetically) and reports the difference for the other group.
sexMale) is how
much higher or lower Males score compared to Femalescat_data %>%
group_by(sex) %>%
summarise(
mean_mental_health = round(mean(mental_health), 3),
n = n()
)
## # A tibble: 2 × 3
## sex mean_mental_health n
## <chr> <dbl> <int>
## 1 Female 3.51 35665
## 2 Male 3.69 30683
Check: the intercept should equal the Female mean, and intercept + slope should equal the Male mean. Let’s verify:
b0_sex <- round(coef(cat_model)[1], 3)
b1_sex <- round(coef(cat_model)[2], 3)
cat("Intercept (Female mean):", b0_sex, "\n")
## Intercept (Female mean): 3.508
cat("Slope (Male difference):", b1_sex, "\n")
## Slope (Male difference): 0.181
cat("Intercept + Slope =", b0_sex, "+", b1_sex, "=", b0_sex + b1_sex, "(Male mean)\n")
## Intercept + Slope = 3.508 + 0.181 = 3.689 (Male mean)
glance(cat_model) %>%
select(r.squared, adj.r.squared) %>%
mutate(across(everything(), ~ round(.x, 4)))
## # A tibble: 1 × 2
## r.squared adj.r.squared
## <dbl> <dbl>
## 1 0.0076 0.0075
R-squared tells you the proportion of variation in mental health that is explained by sex alone. Look at the number — is it large or small?
ANSWER KEY
From the
tidy()output, the intercept is approximately 3.51, which is the predicted mental health score for women (the reference group). The slope forsexMaleis approximately 0.18. So the predicted score for men is 3.51 + 0.18 = 3.69. You can confirm this by checking the group means table and thecat-checkoutput, which both show Female mean = 3.51 and Male mean = 3.69.The difference is 0.18 points on a 5-point scale. That’s very small — less than one-fifth of a single point. To put it in perspective, the scale goes from 1 (Poor) to 5 (Excellent), and the gap between men and women is about 4% of the total range. Both groups average between “Good” (3) and “Very Good” (4). If you saw two people — one scoring 3.51 and the other 3.69 — you would not describe their mental health as meaningfully different. A useful benchmark: if the difference is small relative to the standard deviation of the outcome (which is about 1.04), then it’s not large. Here, 0.18 / 1.04 = 0.17 standard deviations — a very small effect.
No, a very small p-value does not mean the difference is important. It means the difference is unlikely to be zero in the population. With n = 66,000+, even tiny differences produce very small p-values because the standard error shrinks as n grows. The p-value here reflects our certainty that the difference exists, not the size of the difference. A difference of 0.18 on a 5-point scale is statistically detectable but practically negligible — it would not change any clinical decision, policy, or individual’s life.
R-squared is approximately 0.0076 (about 0.76%). That means sex explains less than 1% of the variation in mental health. The other 99%+ comes from factors other than sex — income, age, life circumstances, personality, social support, and countless other things. This extremely low R-squared reinforces the message from the p-value: yes, the difference between men and women is “real” in the statistical sense, but sex is essentially useless as a predictor of any individual’s mental health. This is the perfect example of the lecture point that statistical significance and practical importance are not the same thing.
You’ve fitted and interpreted a regression model. Now let’s see if you can work with the equation on paper — without running code.
Below are regression results from a model you haven’t run:
life satisfaction predicted by low stress (where higher
values of low_stress mean less stressed, on a 1-5
scale).
The model found:
\[\widehat{\text{life\_satisfaction}} = 4.74 + 0.85 \times \text{low\_stress}\]
Use this equation to answer the following. Show your arithmetic in your journal.
Let’s verify this equation by running the model:
stress_data <- cchs %>%
filter(life_satisfaction <= 10,
low_stress >= 1 & low_stress <= 5)
stress_model <- lm(life_satisfaction ~ low_stress, data = stress_data)
tidy(stress_model) %>%
mutate(across(where(is.numeric), ~ round(.x, 4)))
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 4.73 0.0237 199. 0
## 2 low_stress 0.852 0.0069 123. 0
Problem A — Find \(\widehat{y}\): A respondent reports a stress level of 4 on the reversed scale (meaning they have relatively low stress). What does the model predict for their life satisfaction?
Problem B — Find X: The model predicts a life
satisfaction score of 7.29. What value of
low_stress would produce this prediction? (Solve the
equation for X.)
Problem C — Find the residual: A respondent has
low_stress = 3 and an observed life
satisfaction of 8. The model predicts \(\widehat{y}\) for this respondent based on
the equation above. What is the residual? Is this respondent
above or below the regression
line?
Problem D — Interpret: In one sentence, interpret the slope of 0.85 using the template from lecture: “Comparing two people who differ by one unit on [predictor], the model predicts…”
ANSWER KEY
Problem A — Find \(\widehat{y}\):
Plug low_stress = 4 into the equation:
\(\widehat{y}\) = 4.74 + 0.85 x 4 = 4.74 + 3.40 = 8.14
The model predicts a life satisfaction of 8.14 for someone with low_stress = 4.
Problem B — Find X:
We know \(\widehat{y}\) = 7.29 and need to solve for X:
7.29 = 4.74 + 0.85 x X
7.29 - 4.74 = 0.85 x X
2.55 = 0.85 x X
X = 2.55 / 0.85 = 3
A
low_stressvalue of 3 would produce a predicted life satisfaction of 7.29.
Problem C — Find the residual:
First, compute the predicted value for low_stress = 3:
\(\widehat{y}\) = 4.74 + 0.85 x 3 = 4.74 + 2.55 = 7.29
Then compute the residual:
residual = observed - predicted = 8 - 7.29 = 0.71
The residual is positive (+0.71), which means this respondent is above the regression line. Their actual life satisfaction is higher than what the model predicted based on their stress level alone. Something other than stress — maybe strong social support, meaningful work, or good health — is pushing their satisfaction above the trend.
Problem D — Interpret:
“Comparing two people who differ by one unit on low stress, the model predicts the person with lower stress will have a life satisfaction score that is 0.85 points higher on the 0-10 scale.”
Note: The slope tells us about the predicted difference between two people, not about what happens when one person’s stress changes. We cannot make causal claims from this cross-sectional regression.
Take a few minutes to reflect on this practice session:
ANSWER KEY
These are reflection questions, so answers will vary. Here are model responses:
Correlation tells you direction and strength (positive/negative, weak/strong) but not how much. Regression tells you the predicted change in Y per unit change in X, in real units. The correlation says “these move together”; the regression says “by how much.”
The slope (to understand the size and direction of the relationship) and the p-value (to assess whether the relationship is likely real or just noise). Together, these answer the two most important questions: “Is there an effect?” and “How big is it?” A common mistake is looking only at the p-value and ignoring the slope. As we saw in Part 6, a tiny slope can have a tiny p-value if n is large enough.
Open-ended. Common sources of confusion at this stage include: why the intercept doesn’t always make sense in context, what “predicted” means when we’re looking at data we already have, and why we’d use a straight line for data that doesn’t look very linear. All good questions worth discussing.
Correlation gives a unitless number (r) between -1 and +1. It measures direction and strength but not “how much.”
Regression gives an equation: \(\widehat{y} = b_0 + b_1 \cdot x\). The slope (\(b_1\)) tells you the predicted change in Y for a one-unit change in X — in real units.
The intercept (b0) is the predicted Y when X = 0. It anchors the line but may not always have a meaningful real-world interpretation.
The slope (b1) is the rate of change. Multiply any difference in X by the slope to get the predicted difference in Y.
Residuals (\(e = y - \widehat{y}\)) are what the model didn’t explain. They capture all the other factors besides X that affect Y.
The p-value asks: “How likely is this slope if there were no real relationship?” A small p-value means the pattern is unlikely to be noise. It does not mean the effect is large or causal.
Categorical predictors work by comparing groups. The intercept is the predicted value for the reference group; the slope is the predicted difference for the other group.
| Function | What it does | Example |
|---|---|---|
lm() |
Fit a linear regression | lm(y ~ x, data = df) |
tidy() |
Clean regression output table | tidy(model) |
glance() |
Model-level summary (R-squared, etc.) | glance(model) |
coef() |
Extract intercept and slope | coef(model) |
predict() |
Get predicted values for new data | predict(model, newdata = df) |
cor() |
Compute correlation coefficient | cor(x, y) |
geom_smooth(method = "lm") |
Add regression line to plot | Added inside ggplot() |
geom_jitter() |
Scatter with random noise | geom_jitter(alpha = 0.05) |
ifelse() |
Recode a variable | ifelse(x == 1, "A", "B") |
Remember: The goal is not perfect code, but building intuition for what regression tells you — and what it doesn’t. If something confuses you, write it down — that’s valuable information for class discussion.