Getting Started

This practice document picks up where Tuesday’s lecture left off. In lecture, we learned that correlation gives us direction and strength, but not how much. Regression gives us a line — a predicted value of Y for any value of X, in real units.

Today, you’ll fit your first regression model on real Canadian data, interpret the output, and practice using the regression equation to make predictions.

Load packages and data

library(tidyverse)
library(broom)

cchs <- read_csv("data/cchs_teaching.csv")

We are using the 2022 Canadian Community Health Survey (CCHS), a major national health survey conducted by Statistics Canada. This file contains responses from over 67,000 Canadians on topics including life satisfaction, health, stress, community belonging, and socioeconomic circumstances.

A note on variable coding

Several variables in this survey code health outcomes so that lower numbers mean better outcomes (e.g., 1 = “Excellent,” 5 = “Poor”). This creates negative correlations and negative slopes that are technically correct but counterintuitive — “more health” showing up as a smaller number is confusing when you’re learning regression for the first time.

To make interpretation easier, we’ll reverse-code the relevant variables so that higher numbers always mean better outcomes. The code below does this once, and we’ll use the recoded versions for the rest of the document.

cchs <- cchs %>%
  mutate(
    # Self-rated mental health: original 1=Excellent ... 5=Poor
    # Reversed: 1=Poor ... 5=Excellent
    mental_health = 6 - GEN_05,

    # Self-rated general health: original 1=Excellent ... 5=Poor
    # Reversed: 1=Poor ... 5=Excellent
    general_health = 6 - GEN_01,

    # Perceived life stress: original 1=Not at all ... 5=Extremely
    # Reversed so higher = less stressed (more relaxed)
    low_stress = 6 - GEN_10,

    # Community belonging: original 1=Very strong ... 4=Very weak
    # Reversed: 1=Very weak ... 4=Very strong
    belonging = 5 - GEN_20,

    # Life satisfaction: already coded 0-10 where higher = more satisfied
    # Just rename for clarity
    life_satisfaction = LSM_01
  )

What just happened? We created new variables with reversed scales. For instance, someone who answered “Excellent” (coded 1 in the original) now has a value of 5 on mental_health. This way, a positive slope means “more of X is associated with more of Y” — which is how most people naturally think about relationships.

Data cleaning

Survey data uses special codes for missing or refused responses. We need to filter these out before analyzing. The CCHS uses codes like 9, 96, 99, and 999 to indicate “not stated” or “not applicable,” depending on the variable.

We’ll clean the variables we need as we go, but here’s the general principle: always check the codebook, then filter before calculating.


Part 1: Review — The Correlation Coefficient

Before we build a regression line, let’s start with what we already know.

Life satisfaction and community belonging

Does a stronger sense of belonging to one’s community go with higher life satisfaction? Let’s look.

# Filter out not-stated codes
cor_data <- cchs %>%
  filter(life_satisfaction <= 10,    # Remove code 99 (not stated)
         belonging >= 1 & belonging <= 4)  # Keep valid responses only

Comparing groups: Boxplots

Before computing the correlation, let’s use a tool from Arc 1 — boxplots — to compare life satisfaction across belonging groups.

cor_data %>%
  mutate(belonging_label = factor(belonging,
    levels = 1:4,
    labels = c("Very Weak", "Somewhat Weak", "Somewhat Strong", "Very Strong"))) %>%
  ggplot(aes(x = belonging_label, y = life_satisfaction, fill = belonging_label)) +
  geom_boxplot(alpha = 0.7, outlier.alpha = 0.1) +
  scale_fill_manual(values = c("#d4d4d4", "#b8cfe0", "#4A6FA5", "#2c4a7c")) +
  labs(
    title = "Life Satisfaction by Community Belonging",
    subtitle = "CCHS 2022 | Medians and IQRs across belonging groups",
    x = "Sense of Community Belonging",
    y = "Life Satisfaction (0-10)"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

Notice how the median (the thick line inside each box) shifts upward as belonging gets stronger. The IQR (the height of each box) also changes — people with weaker belonging show more variation in their life satisfaction.

This is the same logic as the conditional means from lecture: the typical life satisfaction score depends on the level of belonging.

The scatterplot

ggplot(cor_data, aes(x = belonging, y = life_satisfaction)) +
  geom_jitter(alpha = 0.05, width = 0.25, height = 0.25, color = "steelblue") +
  labs(
    title = "Life Satisfaction and Community Belonging",
    subtitle = "CCHS 2022 | Each dot is one respondent (jittered)",
    x = "Sense of Community Belonging (1 = Very Weak, 4 = Very Strong)",
    y = "Life Satisfaction (0-10)"
  ) +
  scale_x_continuous(breaks = 1:4,
                     labels = c("1\nVery Weak", "2\nSomewhat\nWeak",
                                "3\nSomewhat\nStrong", "4\nVery\nStrong")) +
  theme_minimal()

What’s happening in this code?

  • geom_jitter() adds small random noise so overlapping points become visible (without jitter, 67,000 points would stack on top of each other at only 4 x 11 = 44 grid positions)
  • alpha = 0.05 makes each point nearly transparent — darker areas mean more respondents

Computing the correlation

r_belonging <- cor(cor_data$life_satisfaction, cor_data$belonging)
r_belonging
## [1] 0.3880256

Journal Prompt 1

  1. What is the correlation between life satisfaction and community belonging?
  2. Is it positive or negative? What does the sign tell you about the direction of the relationship?
  3. Would you describe the strength of this correlation as weak, moderate, or strong? Look at the scatterplot — how does the amount of scatter around the trend relate to the value of r you computed?

ANSWER KEY

  1. The correlation is approximately 0.39 (the exact value comes from the cor() output above).

  2. The correlation is positive. The sign tells us that as community belonging increases (gets stronger), life satisfaction also tends to increase. People who feel a stronger sense of belonging to their community tend to report higher life satisfaction. We can confirm this visually: the boxplot medians rise from left to right, and the scatterplot shows denser concentration of points in the upper-right region.

  3. This is a moderate correlation. The conventional benchmarks are: weak (< 0.3), moderate (0.3–0.5), strong (> 0.5). At r = 0.39, we’re solidly in the moderate range. In the scatterplot, you can see a general upward trend — the cloud of points tilts toward the upper right — but there is still a lot of spread. At any given level of belonging, life satisfaction ranges widely (from 0 to 10). That wide scatter is why r isn’t closer to 1: the relationship is real but far from deterministic.


Part 2: Your First Regression

The correlation told us that belonging and life satisfaction move together. But the correlation coefficient can’t answer: “How much higher is life satisfaction for people who rate their mental health one point better?” For that, we need a regression line.

Life satisfaction and self-rated mental health

Let’s model life satisfaction as a function of self-rated mental health.

# Clean the data for this regression
reg_data <- cchs %>%
  filter(life_satisfaction <= 10,        # Valid life satisfaction (0-10)
         mental_health >= 1 & mental_health <= 5)  # Valid mental health (1-5)

cat("Number of respondents:", nrow(reg_data), "\n")
## Number of respondents: 66216

Fitting the model

In R, we fit a linear regression with lm() — short for “linear model.”

model <- lm(life_satisfaction ~ mental_health, data = reg_data)

Read this as: “Model life satisfaction as a linear function of mental health.”

The tilde (~) means “predicted by.” The outcome (Y) goes on the left; the predictor (X) goes on the right.

Reading the output

tidy(model) %>%
  mutate(across(where(is.numeric), ~ round(.x, 4)))
## # A tibble: 2 × 5
##   term          estimate std.error statistic p.value
##   <chr>            <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)       3.89    0.0228      171.       0
## 2 mental_health     1.01    0.0061      165.       0

This table has five columns. For now, focus on the first two:

  • term: Which part of the equation — the intercept or the slope
  • estimate: The number the model found

We’ll come back to std.error, statistic, and p.value shortly.


Journal Prompt 2

Look at the output table above.

  1. What is the intercept? What is the slope?
  2. Using the template from lecture, fill in the blank: “Comparing two people who differ by one point on self-rated mental health, the model predicts a difference of ____ on life satisfaction.”
  3. Is the slope positive or negative? Does the direction make sense to you? Why?

ANSWER KEY

  1. From the tidy() output, the intercept is approximately 3.89 (the estimate in the (Intercept) row) and the slope is approximately 1.01 (the estimate in the mental_health row). These numbers come directly from the estimate column.

  2. “Comparing two people who differ by one point on self-rated mental health, the model predicts a difference of about 1.01 points on life satisfaction.” The slope tells us the predicted difference in Y for a one-unit difference in X. Here, that’s almost exactly one-for-one: one point better mental health goes with about one point higher life satisfaction.

  3. The slope is positive, which makes sense: we reverse-coded mental health so that higher values mean better mental health. A positive slope means that people with better mental health tend to have higher life satisfaction. If this slope were negative, it would mean better mental health is associated with lower life satisfaction — that would be hard to explain.


Part 3: The Equation and Predictions

Writing the equation

Every simple linear regression produces an equation of the form:

\[\widehat{y} = b_0 + b_1 \cdot x\]

where \(b_0\) is the intercept and \(b_1\) is the slope. The “hat” on \(y\) means it’s a predicted value — our best guess, not the actual observation.

Let’s extract our specific numbers:

b0 <- round(coef(model)[1], 2)
b1 <- round(coef(model)[2], 2)

cat("Intercept (b0):", b0, "\n")
## Intercept (b0): 3.89
cat("Slope (b1):", b1, "\n")
## Slope (b1): 1.01
cat("\nThe equation: life_satisfaction_hat =", b0, "+", b1, "x mental_health\n")
## 
## The equation: life_satisfaction_hat = 3.89 + 1.01 x mental_health

Making predictions by hand

Let’s use this equation. For someone who rates their mental health as 2 (Fair):

\[\widehat{y} = b_0 + b_1 \times 2\]

# Prediction for mental_health = 2
pred_2 <- b0 + b1 * 2
cat("Predicted life satisfaction for mental_health = 2:", pred_2, "\n")
## Predicted life satisfaction for mental_health = 2: 5.91
# Prediction for mental_health = 4
pred_4 <- b0 + b1 * 4
cat("Predicted life satisfaction for mental_health = 4:", pred_4, "\n")
## Predicted life satisfaction for mental_health = 4: 7.93
# Difference
cat("Difference (4 vs 2):", pred_4 - pred_2, "\n")
## Difference (4 vs 2): 2.02
cat("That's 2 x", b1, "=", 2 * b1, "\n")
## That's 2 x 1.01 = 2.02

The slope works as a multiplier: multiply the difference in X by the slope to get the predicted difference in Y.

Checking with R’s predict()

new_data <- tibble(mental_health = c(1, 2, 3, 4, 5))
new_data$predicted <- predict(model, newdata = new_data)
new_data
## # A tibble: 5 × 2
##   mental_health predicted
##           <dbl>     <dbl>
## 1             1      4.90
## 2             2      5.91
## 3             3      6.92
## 4             4      7.93
## 5             5      8.94

Your hand calculations should match these numbers.


Journal Prompt 3

  1. What does the model predict for someone with a mental health rating of 1 (Poor)? Does this prediction seem reasonable?
  2. What does it predict for someone with a rating of 5 (Excellent)?
  3. Compute the difference between those two predictions. How does it relate to the slope?
  4. The intercept is the predicted value when mental health = 0. But 0 isn’t a valid value on this scale (it only goes 1-5). What does this tell you about interpreting the intercept?

ANSWER KEY

  1. For mental_health = 1: \(\widehat{y}\) = 3.89 + 1.01 x 1 = 4.90. You can verify this from the predict() table (first row). Is it reasonable? A predicted life satisfaction of about 5 out of 10 for someone with poor mental health seems plausible, though individual variation is large.

  2. For mental_health = 5: \(\widehat{y}\) = 3.89 + 1.01 x 5 = 8.94. Check the last row of the predict() table. A predicted life satisfaction near 9 out of 10 for someone with excellent mental health also seems plausible.

  3. The difference is 8.94 - 4.90 = 4.04. This equals 4 x 1.01 = 4.04. The difference in predictions equals the difference in X (which is 5 - 1 = 4) multiplied by the slope. This is the core logic of regression: the slope is a constant rate of change, so you just multiply.

  4. The intercept (3.89) is the predicted life satisfaction when mental health = 0. But no one in the data has mental_health = 0 — the scale runs from 1 to 5. The intercept is a mathematical anchor point that positions the line correctly, but it doesn’t have a direct real-world interpretation here. This is common in regression: the intercept is necessary for the equation but doesn’t always describe a realistic scenario. We rely on the slope for substantive interpretation.


Part 4: Seeing the Line

The scatterplot with the regression line

ggplot(reg_data, aes(x = mental_health, y = life_satisfaction)) +
  geom_jitter(alpha = 0.04, width = 0.25, height = 0.25, color = "steelblue") +
  geom_smooth(method = "lm", se = FALSE, color = "red", linewidth = 1.2) +
  labs(
    title = "Life Satisfaction and Self-Rated Mental Health",
    subtitle = "Red line = regression line (line of best fit)",
    x = "Self-Rated Mental Health (1 = Poor, 5 = Excellent)",
    y = "Life Satisfaction (0-10)"
  ) +
  scale_x_continuous(breaks = 1:5,
                     labels = c("1\nPoor", "2\nFair", "3\nGood",
                                "4\nVery Good", "5\nExcellent")) +
  theme_minimal()

What geom_smooth(method = "lm") does: It draws the regression line — the same one defined by the equation you wrote in Part 3. Every point on this line is a predicted value.

What the model misses: Residuals

The vertical distance between each point and the line is a residual — the difference between what was observed and what the model predicted.

# Pick a mix: some well-predicted, some poorly predicted
set.seed(2026)

# Get all residuals, then pick illustrative cases
reg_data_with_resid <- reg_data %>%
  mutate(
    predicted = round(predict(model, newdata = .), 2),
    residual = round(life_satisfaction - predicted, 2)
  )

# Select cases that show clear contrast
examples <- bind_rows(
  # Two well-predicted respondents (small residuals)
  reg_data_with_resid %>% filter(abs(residual) < 0.5) %>% sample_n(2),
  # One moderately off
  reg_data_with_resid %>% filter(abs(residual) > 1.5 & abs(residual) < 2.5) %>% sample_n(1),
  # Two poorly predicted respondents (large residuals)
  reg_data_with_resid %>% filter(abs(residual) > 3) %>% sample_n(2)
) %>%
  select(mental_health, life_satisfaction, predicted, residual) %>%
  arrange(abs(residual))

examples
## # A tibble: 5 × 4
##   mental_health life_satisfaction predicted residual
##           <dbl>             <dbl>     <dbl>    <dbl>
## 1             4                 8      7.93     0.07
## 2             2                 6      5.91     0.09
## 3             3                 5      6.92    -1.92
## 4             4                 4      7.93    -3.93
## 5             3                 0      6.92    -6.92

The residual column is what the model didn’t capture — everything else besides mental health that affects someone’s life satisfaction.


Journal Prompt 4

  1. Look at the example respondents above. Who does the model predict well for (small residual)? Who does it predict poorly for (large residual)?
  2. What might explain why two people with the same mental health rating have different life satisfaction scores? Name two or three possible factors.
  3. The lecture said residuals capture “everything the model didn’t explain.” In your own words, what does that mean?

ANSWER KEY

  1. The respondents with residuals close to zero (the first two rows, with |residual| < 0.5) are well-predicted — the model’s guess was close to their actual score. The respondents with large residuals (last two rows, |residual| > 3) are poorly predicted — the model was off by 3 or more points on a 10-point scale. For example, a respondent with high mental health but very low life satisfaction would have a large negative residual — the model expected them to be much more satisfied than they actually are. The specific cases vary because of the random seed, but the contrast between small and large residuals should be clear.

  2. Many factors could explain why two people with the same mental health rating report different life satisfaction: income and financial security (someone with good mental health but severe financial stress might report lower satisfaction), relationship quality (social support, marriage, family conflict), physical health (chronic pain, disability), employment situation (job loss, meaningful work), or recent life events (bereavement, moving, having a child). Mental health matters, but it’s only one piece of the puzzle.

  3. The residual is the gap between reality and the model’s prediction. It captures everything that affects life satisfaction other than mental health — all the variables we didn’t include (income, relationships, health, luck, personality, life circumstances). A simple regression with one predictor will always leave a lot unexplained. The residuals are a reminder that our model is a simplification, not a complete explanation.


Part 5: Is the Pattern Real?

We found a positive slope. But could this just be noise — a pattern that appeared by chance in our particular sample?

The p-value

tidy(model) %>%
  mutate(
    across(c(estimate, std.error, statistic), ~ round(.x, 4)),
    p.value = ifelse(p.value < 0.0001, "< 0.0001", as.character(round(p.value, 4)))
  )
## # A tibble: 2 × 5
##   term          estimate std.error statistic p.value 
##   <chr>            <dbl>     <dbl>     <dbl> <chr>   
## 1 (Intercept)       3.89    0.0228      171. < 0.0001
## 2 mental_health     1.01    0.0061      165. < 0.0001

Focus on the p.value column for the mental_health row.

From lecture: the p-value asks, “If there were truly no relationship between mental health and life satisfaction in the population, how likely would we be to see a slope this large (or larger) in our sample, just by chance?”

A small p-value means it would be very unlikely — so we have evidence the pattern is real.

Standard thresholds

  • p < 0.05: “Significant” — less than a 5% chance under no relationship
  • p < 0.01: “Highly significant” — less than 1%
  • p < 0.001: “Very highly significant” — less than 0.1%

Our p-value is far below all of these thresholds — very strong evidence that the association between mental health and life satisfaction is not just noise.


Journal Prompt 5

  1. Is the association between mental health and life satisfaction statistically significant? How do you know?
  2. From lecture: significance tells you the pattern is unlikely to be noise. What does it not tell you? (Hint: think about effect size and causation.)
  3. Imagine a study found a slope of 0.001 with p < 0.001 and n = 10,000,000. Would you call that an important finding? Why or why not?

ANSWER KEY

  1. Yes, the association is statistically significant. The p-value for the mental_health slope shows “< 0.0001” in the output, which is well below all conventional thresholds (0.05, 0.01, 0.001). This means there is very strong evidence against the null hypothesis of no relationship. With over 66,000 respondents and a slope of approximately 1.01, we can be very confident this pattern exists in the population and is not a fluke of our sample.

  2. Statistical significance does not tell us two important things. First, it doesn’t tell us about effect sizehow much mental health matters for life satisfaction in practical terms. A slope can be statistically significant but trivially small. Second, it doesn’t tell us about causation — we don’t know whether better mental health causes higher life satisfaction, or whether it’s the other way around, or whether some third factor (like income, social support, or personality) drives both. Significance only tells us the pattern is unlikely to be random noise.

  3. No, that would not be an important finding in any practical sense. A slope of 0.001 means a one-unit change in X predicts a change of only 0.001 in Y — that’s essentially zero effect. The reason it’s “significant” is the massive sample size (n = 10 million): with enough data, even the tiniest, most meaningless pattern becomes detectable. This is a classic example of statistical significance without practical significance. The p-value tells you the pattern is “real” (not noise), but the slope tells you it doesn’t matter. This is why we always look at the slope and the p-value together, not just the p-value alone.


Part 6: A Categorical Predictor — Sex Differences in Mental Health

So far our predictor (mental health) was ordinal — coded 1 through 5. What happens when the predictor is categorical, like sex?

Fitting the model

cat_data <- cchs %>%
  filter(mental_health >= 1 & mental_health <= 5,
         DHH_SEX %in% c(1, 2)) %>%
  mutate(sex = ifelse(DHH_SEX == 1, "Male", "Female"))

cat_model <- lm(mental_health ~ sex, data = cat_data)

tidy(cat_model) %>%
  mutate(
    across(c(estimate, std.error, statistic), ~ round(.x, 4)),
    p.value = ifelse(p.value < 0.0001, "< 0.0001", as.character(round(p.value, 4)))
  )
## # A tibble: 2 × 5
##   term        estimate std.error statistic p.value 
##   <chr>          <dbl>     <dbl>     <dbl> <chr>   
## 1 (Intercept)    3.51     0.0055     641.  < 0.0001
## 2 sexMale        0.181    0.0081      22.5 < 0.0001

Reading this output:

When a predictor is categorical, R picks one group as the reference category (here, “Female” because it comes first alphabetically) and reports the difference for the other group.

  • The intercept is the predicted mental health score for the reference group (Female)
  • The slope (labeled sexMale) is how much higher or lower Males score compared to Females

Checking against group means

cat_data %>%
  group_by(sex) %>%
  summarise(
    mean_mental_health = round(mean(mental_health), 3),
    n = n()
  )
## # A tibble: 2 × 3
##   sex    mean_mental_health     n
##   <chr>               <dbl> <int>
## 1 Female               3.51 35665
## 2 Male                 3.69 30683

Check: the intercept should equal the Female mean, and intercept + slope should equal the Male mean. Let’s verify:

b0_sex <- round(coef(cat_model)[1], 3)
b1_sex <- round(coef(cat_model)[2], 3)

cat("Intercept (Female mean):", b0_sex, "\n")
## Intercept (Female mean): 3.508
cat("Slope (Male difference):", b1_sex, "\n")
## Slope (Male difference): 0.181
cat("Intercept + Slope =", b0_sex, "+", b1_sex, "=", b0_sex + b1_sex, "(Male mean)\n")
## Intercept + Slope = 3.508 + 0.181 = 3.689 (Male mean)

How much does sex explain?

glance(cat_model) %>%
  select(r.squared, adj.r.squared) %>%
  mutate(across(everything(), ~ round(.x, 4)))
## # A tibble: 1 × 2
##   r.squared adj.r.squared
##       <dbl>         <dbl>
## 1    0.0076        0.0075

R-squared tells you the proportion of variation in mental health that is explained by sex alone. Look at the number — is it large or small?


Journal Prompt 6

  1. What is the predicted mental health score for women? For men?
  2. Is the difference large or small? How do you judge “large” on a 5-point scale?
  3. The p-value is very small. Does that mean the difference is important? Why might a statistically significant difference still be practically unimportant?
  4. What does R-squared tell you about how much of the variation in mental health is explained by sex alone? Does this change your interpretation of the p-value?

ANSWER KEY

  1. From the tidy() output, the intercept is approximately 3.51, which is the predicted mental health score for women (the reference group). The slope for sexMale is approximately 0.18. So the predicted score for men is 3.51 + 0.18 = 3.69. You can confirm this by checking the group means table and the cat-check output, which both show Female mean = 3.51 and Male mean = 3.69.

  2. The difference is 0.18 points on a 5-point scale. That’s very small — less than one-fifth of a single point. To put it in perspective, the scale goes from 1 (Poor) to 5 (Excellent), and the gap between men and women is about 4% of the total range. Both groups average between “Good” (3) and “Very Good” (4). If you saw two people — one scoring 3.51 and the other 3.69 — you would not describe their mental health as meaningfully different. A useful benchmark: if the difference is small relative to the standard deviation of the outcome (which is about 1.04), then it’s not large. Here, 0.18 / 1.04 = 0.17 standard deviations — a very small effect.

  3. No, a very small p-value does not mean the difference is important. It means the difference is unlikely to be zero in the population. With n = 66,000+, even tiny differences produce very small p-values because the standard error shrinks as n grows. The p-value here reflects our certainty that the difference exists, not the size of the difference. A difference of 0.18 on a 5-point scale is statistically detectable but practically negligible — it would not change any clinical decision, policy, or individual’s life.

  4. R-squared is approximately 0.0076 (about 0.76%). That means sex explains less than 1% of the variation in mental health. The other 99%+ comes from factors other than sex — income, age, life circumstances, personality, social support, and countless other things. This extremely low R-squared reinforces the message from the p-value: yes, the difference between men and women is “real” in the statistical sense, but sex is essentially useless as a predictor of any individual’s mental health. This is the perfect example of the lecture point that statistical significance and practical importance are not the same thing.


Part 7: The Equation Challenge

You’ve fitted and interpreted a regression model. Now let’s see if you can work with the equation on paper — without running code.

Below are regression results from a model you haven’t run: life satisfaction predicted by low stress (where higher values of low_stress mean less stressed, on a 1-5 scale).

The model found:

\[\widehat{\text{life\_satisfaction}} = 4.74 + 0.85 \times \text{low\_stress}\]

Use this equation to answer the following. Show your arithmetic in your journal.

Let’s verify this equation by running the model:

stress_data <- cchs %>%
  filter(life_satisfaction <= 10,
         low_stress >= 1 & low_stress <= 5)

stress_model <- lm(life_satisfaction ~ low_stress, data = stress_data)
tidy(stress_model) %>%
  mutate(across(where(is.numeric), ~ round(.x, 4)))
## # A tibble: 2 × 5
##   term        estimate std.error statistic p.value
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)    4.73     0.0237      199.       0
## 2 low_stress     0.852    0.0069      123.       0

Journal Prompt 7

Problem A — Find \(\widehat{y}\): A respondent reports a stress level of 4 on the reversed scale (meaning they have relatively low stress). What does the model predict for their life satisfaction?

Problem B — Find X: The model predicts a life satisfaction score of 7.29. What value of low_stress would produce this prediction? (Solve the equation for X.)

Problem C — Find the residual: A respondent has low_stress = 3 and an observed life satisfaction of 8. The model predicts \(\widehat{y}\) for this respondent based on the equation above. What is the residual? Is this respondent above or below the regression line?

Problem D — Interpret: In one sentence, interpret the slope of 0.85 using the template from lecture: “Comparing two people who differ by one unit on [predictor], the model predicts…”

ANSWER KEY

Problem A — Find \(\widehat{y}\):

Plug low_stress = 4 into the equation:

\(\widehat{y}\) = 4.74 + 0.85 x 4 = 4.74 + 3.40 = 8.14

The model predicts a life satisfaction of 8.14 for someone with low_stress = 4.


Problem B — Find X:

We know \(\widehat{y}\) = 7.29 and need to solve for X:

7.29 = 4.74 + 0.85 x X

7.29 - 4.74 = 0.85 x X

2.55 = 0.85 x X

X = 2.55 / 0.85 = 3

A low_stress value of 3 would produce a predicted life satisfaction of 7.29.


Problem C — Find the residual:

First, compute the predicted value for low_stress = 3:

\(\widehat{y}\) = 4.74 + 0.85 x 3 = 4.74 + 2.55 = 7.29

Then compute the residual:

residual = observed - predicted = 8 - 7.29 = 0.71

The residual is positive (+0.71), which means this respondent is above the regression line. Their actual life satisfaction is higher than what the model predicted based on their stress level alone. Something other than stress — maybe strong social support, meaningful work, or good health — is pushing their satisfaction above the trend.


Problem D — Interpret:

“Comparing two people who differ by one unit on low stress, the model predicts the person with lower stress will have a life satisfaction score that is 0.85 points higher on the 0-10 scale.”

Note: The slope tells us about the predicted difference between two people, not about what happens when one person’s stress changes. We cannot make causal claims from this cross-sectional regression.


Part 8: Reflection

Final Journal Prompt

Take a few minutes to reflect on this practice session:

  1. What is the one thing you want to remember about the difference between correlation and regression?
  2. When you see a regression output table, what are the two numbers you look at first? Why?
  3. What was confusing or surprising about today’s practice? Write down a question you’d like to discuss in class.

ANSWER KEY

These are reflection questions, so answers will vary. Here are model responses:

  1. Correlation tells you direction and strength (positive/negative, weak/strong) but not how much. Regression tells you the predicted change in Y per unit change in X, in real units. The correlation says “these move together”; the regression says “by how much.”

  2. The slope (to understand the size and direction of the relationship) and the p-value (to assess whether the relationship is likely real or just noise). Together, these answer the two most important questions: “Is there an effect?” and “How big is it?” A common mistake is looking only at the p-value and ignoring the slope. As we saw in Part 6, a tiny slope can have a tiny p-value if n is large enough.

  3. Open-ended. Common sources of confusion at this stage include: why the intercept doesn’t always make sense in context, what “predicted” means when we’re looking at data we already have, and why we’d use a straight line for data that doesn’t look very linear. All good questions worth discussing.


Summary of Key Concepts

Correlation gives a unitless number (r) between -1 and +1. It measures direction and strength but not “how much.”

Regression gives an equation: \(\widehat{y} = b_0 + b_1 \cdot x\). The slope (\(b_1\)) tells you the predicted change in Y for a one-unit change in X — in real units.

The intercept (b0) is the predicted Y when X = 0. It anchors the line but may not always have a meaningful real-world interpretation.

The slope (b1) is the rate of change. Multiply any difference in X by the slope to get the predicted difference in Y.

Residuals (\(e = y - \widehat{y}\)) are what the model didn’t explain. They capture all the other factors besides X that affect Y.

The p-value asks: “How likely is this slope if there were no real relationship?” A small p-value means the pattern is unlikely to be noise. It does not mean the effect is large or causal.

Categorical predictors work by comparing groups. The intercept is the predicted value for the reference group; the slope is the predicted difference for the other group.

Summary of Key Functions

Function What it does Example
lm() Fit a linear regression lm(y ~ x, data = df)
tidy() Clean regression output table tidy(model)
glance() Model-level summary (R-squared, etc.) glance(model)
coef() Extract intercept and slope coef(model)
predict() Get predicted values for new data predict(model, newdata = df)
cor() Compute correlation coefficient cor(x, y)
geom_smooth(method = "lm") Add regression line to plot Added inside ggplot()
geom_jitter() Scatter with random noise geom_jitter(alpha = 0.05)
ifelse() Recode a variable ifelse(x == 1, "A", "B")

Remember: The goal is not perfect code, but building intuition for what regression tells you — and what it doesn’t. If something confuses you, write it down — that’s valuable information for class discussion.