Lesson 1: Always Check, and Double Check
print(table(chs_data$PGEOGR))
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 2446 1714 1088 1977 2037 2001 1218 1165 1395 2195 946 1021 1009 1091 1033 2233
## 17 18 19 20 21 22 23 24 25 26 27 28
## 1037 1924 997 940 1742 1078 968 2748 1110 1123 1964 788
print(table(chs_data$PVISMIN))
##
## 1 2 9
## 2609 17765 20614
print(table(chs_data$PHGEDUC))
##
## 1 2 3 4 5 6 7 99
## 3691 8035 3539 7656 1688 6671 4230 5478
print(table(chs_data$PCOS_05))
##
## 1 2 3 4 5 6 7 8 9 99
## 1829 1122 1246 6034 3513 6297 8578 4396 7515 458
What do you note here? What about below?
atlantic_chs <- chs_data %>%
filter(PGEOGR %in% 1:6) %>%
select(
outcome = PCOS_05, # Community satisfaction (0-10)
household_income = PHHTTINC, # Household income
education = PHGEDUC, # Highest education completed
visible_minority = PVISMIN # Visible minority status
)
table(atlantic_chs$visible_minority)
##
## 9
## 11263
Let’s do a more systematic check:
vis_min_by_region <- chs_data %>%
group_by(PGEOGR) %>%
summarize(
total_observations = n(),
visible_minority_1 = sum(PVISMIN == 1),
visible_minority_2 = sum(PVISMIN == 2),
visible_minority_9 = sum(PVISMIN == 9),
percent_missing = round(sum(PVISMIN == 9) / n() * 100, 1)
)
print(vis_min_by_region)
## # A tibble: 28 × 6
## PGEOGR total_observations visible_minority_1 visible_minority_2
## <int> <int> <int> <int>
## 1 1 2446 0 0
## 2 2 1714 0 0
## 3 3 1088 0 0
## 4 4 1977 0 0
## 5 5 2037 0 0
## 6 6 2001 0 0
## 7 7 1218 23 1076
## 8 8 1165 288 870
## 9 9 1395 11 1236
## 10 10 2195 4 2109
## # ℹ 18 more rows
## # ℹ 2 more variables: visible_minority_9 <int>, percent_missing <dbl>
We could run the models on the full data like so:
chs_for_analysis <- chs_data %>%
select(
outcome = PCOS_05, # Community satisfaction (0-10)
household_income = PHHTTINC, # Household income
education = PHGEDUC, # Highest education completed
visible_minority = PVISMIN, # Visible minority status
region = PGEOGR # Region code
) %>%
mutate(
# Clean outcome variable (community satisfaction 0-10)
outcome = case_when(
outcome < 0 ~ NA_real_,
outcome > 10 ~ NA_real_,
TRUE ~ as.numeric(outcome)
),
# Clean household income by excluding extreme values
household_income = case_when(
household_income >= 99999999999 ~ NA_real_, # Exclude extremely large values
household_income < 0 ~ NA_real_, # Exclude negative values
TRUE ~ household_income
),
# Create log income variable with clean data
log_income = log10(household_income + 1),
# Recode region with all provided names
region_name = case_when(
region == 1 ~ "Newfoundland and Labrador",
region == 2 ~ "Prince Edward Island",
region == 3 ~ "Halifax",
region == 4 ~ "Outside Halifax - NS",
region == 5 ~ "Saint John and Moncton",
region == 6 ~ "Outside Saint John and Moncton - NB",
region == 7 ~ "Québec",
region == 8 ~ "Montréal",
region == 9 ~ "Other CMAs - Québec",
region == 10 ~ "Outside CMAs - QC",
region == 11 ~ "Ottawa",
region == 12 ~ "Toronto",
region == 13 ~ "Hamilton",
region == 14 ~ "Kitchener-Cambridge-Waterloo",
region == 15 ~ "Other CMAs - Ontario",
region == 16 ~ "Outside CMAs - ON",
region == 17 ~ "Winnipeg",
region == 18 ~ "Outside Winnipeg - MB",
region == 19 ~ "Regina",
region == 20 ~ "Saskatoon",
region == 21 ~ "Outside Regina and Saskatoon - SK",
region == 22 ~ "Calgary",
region == 23 ~ "Edmonton",
region == 24 ~ "Outside Calgary and Edmonton - AB",
region == 25 ~ "Vancouver",
region == 26 ~ "Other CMAs - British Columbia",
region == 27 ~ "Outside CMAs - BC",
region == 28 ~ "Territories",
TRUE ~ NA_character_
),
# Convert to factor with explicit ordering
region_name = factor(region_name),
# Recode education as a factor with meaningful labels
education = case_when(
education == 1 ~ "Less than high school",
education == 2 ~ "High school diploma",
education == 3 ~ "Trade certificate",
education == 4 ~ "College/CEGEP diploma",
education == 5 ~ "University below bachelor",
education == 6 ~ "Bachelor's degree",
education == 7 ~ "Above bachelor's degree",
TRUE ~ NA_character_
),
# Convert to factor with explicit ordering
education = factor(education,
levels = c("Less than high school",
"High school diploma",
"Trade certificate",
"College/CEGEP diploma",
"University below bachelor",
"Bachelor's degree",
"Above bachelor's degree")),
# Recode visible minority status as a factor
# Based on the table: 1=visible minority, 2=no visible minority, 9=not stated
visible_minority = case_when(
visible_minority == 1 ~ "Visible minority member",
visible_minority == 2 ~ "No visible minority",
TRUE ~ NA_character_
),
# Convert to factor with explicit ordering
visible_minority = factor(visible_minority,
levels = c("Visible minority member",
"No visible minority"))
)
Building Progressive Regression Models (using full dataset for
demonstration)
# Model 1: Household Income Only
model1 <- lm(outcome ~ log_income, data = chs_for_analysis)
summary(model1)
##
## Call:
## lm(formula = outcome ~ log_income, data = chs_for_analysis)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.3936 -1.3103 0.6809 1.7434 3.0253
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.55803 0.16168 34.377 < 2e-16 ***
## log_income 0.14417 0.03352 4.302 1.7e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.172 on 33140 degrees of freedom
## (7846 observations deleted due to missingness)
## Multiple R-squared: 0.000558, Adjusted R-squared: 0.0005279
## F-statistic: 18.5 on 1 and 33140 DF, p-value: 1.701e-05
# Model 2: Add Education Level
model2 <- lm(outcome ~ log_income + education, data = chs_for_analysis)
summary(model2)
##
## Call:
## lm(formula = outcome ~ log_income + education, data = chs_for_analysis)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.1680 -1.4335 0.4245 1.7528 3.9079
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.56769 0.18474 24.725 <2e-16 ***
## log_income 0.49260 0.04046 12.175 <2e-16 ***
## educationHigh school diploma -0.47556 0.04642 -10.246 <2e-16 ***
## educationTrade certificate -0.58201 0.05512 -10.560 <2e-16 ***
## educationCollege/CEGEP diploma -0.83930 0.04815 -17.432 <2e-16 ***
## educationUniversity below bachelor -0.61749 0.06969 -8.861 <2e-16 ***
## educationBachelor's degree -0.92026 0.05165 -17.816 <2e-16 ***
## educationAbove bachelor's degree -0.90825 0.05711 -15.904 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.151 on 30378 degrees of freedom
## (10602 observations deleted due to missingness)
## Multiple R-squared: 0.01401, Adjusted R-squared: 0.01378
## F-statistic: 61.67 on 7 and 30378 DF, p-value: < 2.2e-16
# Model 3: Add Visible Minority Status
model3 <- lm(outcome ~ log_income + education + visible_minority, data = chs_for_analysis)
summary(model3)
##
## Call:
## lm(formula = outcome ~ log_income + education + visible_minority,
## data = chs_for_analysis)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.8930 -1.6130 0.2976 1.7448 4.2639
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.97730 0.24539 16.208 < 2e-16 ***
## log_income 0.53884 0.05207 10.349 < 2e-16 ***
## educationHigh school diploma -0.45692 0.06318 -7.232 4.97e-13 ***
## educationTrade certificate -0.64664 0.07562 -8.551 < 2e-16 ***
## educationCollege/CEGEP diploma -0.76111 0.06491 -11.725 < 2e-16 ***
## educationUniversity below bachelor -0.57024 0.09065 -6.290 3.24e-10 ***
## educationBachelor's degree -0.84646 0.06960 -12.163 < 2e-16 ***
## educationAbove bachelor's degree -0.80487 0.07645 -10.528 < 2e-16 ***
## visible_minorityNo visible minority 0.21569 0.05070 4.254 2.11e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.17 on 17349 degrees of freedom
## (23630 observations deleted due to missingness)
## Multiple R-squared: 0.01395, Adjusted R-squared: 0.0135
## F-statistic: 30.69 on 8 and 17349 DF, p-value: < 2.2e-16
# Create regression tables using tab_model
tab_model(
model1, model2, model3,
dv.labels = c("Model 1:\nIncome Only", "Model 2:\n+ Education", "Model 3:\n+ Visible Minority"),
string.pred = "Predictors",
string.est = "β",
string.ci = "95% CI",
p.style = "stars",
show.aic = TRUE,
show.r2 = TRUE,
title = "Table 1. Factors Affecting Community Satisfaction in Canada"
)
Table 1. Factors Affecting Community Satisfaction in Canada
|
Model 1: Income Only
|
Model 2: + Education
|
Model 3: + Visible Minority
|
Predictors
|
β
|
95% CI
|
β
|
95% CI
|
β
|
95% CI
|
(Intercept)
|
5.56 ***
|
5.24 – 5.87
|
4.57 ***
|
4.21 – 4.93
|
3.98 ***
|
3.50 – 4.46
|
log income
|
0.14 ***
|
0.08 – 0.21
|
0.49 ***
|
0.41 – 0.57
|
0.54 ***
|
0.44 – 0.64
|
education [High school diploma]
|
|
|
-0.48 ***
|
-0.57 – -0.38
|
-0.46 ***
|
-0.58 – -0.33
|
education [Trade certificate]
|
|
|
-0.58 ***
|
-0.69 – -0.47
|
-0.65 ***
|
-0.79 – -0.50
|
education [College/CEGEP diploma]
|
|
|
-0.84 ***
|
-0.93 – -0.74
|
-0.76 ***
|
-0.89 – -0.63
|
education [University below bachelor]
|
|
|
-0.62 ***
|
-0.75 – -0.48
|
-0.57 ***
|
-0.75 – -0.39
|
education [Bachelor’s degree]
|
|
|
-0.92 ***
|
-1.02 – -0.82
|
-0.85 ***
|
-0.98 – -0.71
|
education [Above bachelor’s degree]
|
|
|
-0.91 ***
|
-1.02 – -0.80
|
-0.80 ***
|
-0.95 – -0.66
|
visible minority [No visible minority]
|
|
|
|
|
0.22 ***
|
0.12 – 0.32
|
Observations
|
33142
|
30386
|
17358
|
R2 / R2 adjusted
|
0.001 / 0.001
|
0.014 / 0.014
|
0.014 / 0.013
|
AIC
|
145475.142
|
132800.950
|
76163.893
|
- p<0.05 ** p<0.01 *** p<0.001
|
We are, however, not being systematic still in understanding how to
handle all our variables:
Lesson 2: Explore variable distributions before AND after
atlantic_clean <- chs_data %>%
filter(PGEOGR %in% 1:6) %>%
select(
satisfaction = PCOS_05,
income = PHHTTINC,
education = PHGEDUC,
immigration = PRSPIMST,
region = PGEOGR
)
# Check income distribution before transformation
print(table(atlantic_clean$income))
##
## 775 800 850 875 1300 1900
## 2 1 1 1 3 1
## 1950 2000 2400 2500 4200 4800
## 1 1 1 3 1 1
## 5250 5500 6000 6250 6500 7000
## 1 1 1 2 1 2
## 7250 7500 7750 8000 8250 8500
## 5 4 14 15 15 9
## 8750 9000 9250 9500 9750 10000
## 6 5 5 4 7 19
## 10500 11000 11500 12000 12500 13000
## 20 22 34 28 14 17
## 13500 14000 14500 15000 15500 16000
## 19 25 19 19 20 14
## 16500 17000 17500 18000 18500 19000
## 10 17 20 11 20 38
## 19500 20000 20500 21000 22000 23000
## 24 97 2 166 230 214
## 24000 25000 26000 27000 28000 29000
## 160 143 86 91 81 75
## 30000 31000 32000 33000 34000 35000
## 80 67 69 73 89 93
## 36000 37000 38000 39000 40000 41000
## 100 92 95 101 95 79
## 42000 43000 44000 45000 46000 47000
## 78 89 70 70 78 74
## 47500 48000 49000 50000 51000 52500
## 27 77 56 133 11 126
## 55000 57500 60000 62500 65000 67500
## 174 174 150 164 185 157
## 70000 72500 75000 77500 80000 82500
## 157 146 136 132 142 133
## 85000 87500 90000 92500 95000 97500
## 146 117 122 99 137 102
## 1e+05 102500 105000 110000 115000 120000
## 142 16 153 143 129 122
## 125000 130000 135000 140000 145000 150000
## 154 111 92 93 93 102
## 155000 160000 165000 170000 175000 180000
## 86 70 72 50 57 58
## 185000 190000 195000 2e+05 205000 210000
## 46 61 39 54 3 64
## 220000 230000 240000 250000 260000 270000
## 51 46 33 32 23 21
## 280000 290000 3e+05 310000 320000 330000
## 16 14 6 5 1 6
## 340000 350000 360000 370000 380000 390000
## 7 8 5 7 8 4
## 4e+05 410000 420000 450000 460000 470000
## 4 4 1 2 5 7
## 480000 5e+05 525000 550000 575000 99999999999
## 7 6 10 12 3 2535
# Check satisfaction values
print(table(atlantic_clean$satisfaction))
##
## 1 2 3 4 5 6 7 8 9 99
## 437 242 300 1509 830 1581 2305 1314 2621 124
# Clean the variables one by one
atlantic_clean <- atlantic_clean %>%
mutate(
satisfaction_clean = case_when(
satisfaction >= 0 & satisfaction <= 10 ~ satisfaction,
TRUE ~ NA_real_
)
)
# Compare before AND after
print(table(atlantic_clean$satisfaction))
##
## 1 2 3 4 5 6 7 8 9 99
## 437 242 300 1509 830 1581 2305 1314 2621 124
print(table(atlantic_clean$satisfaction_clean))
##
## 1 2 3 4 5 6 7 8 9
## 437 242 300 1509 830 1581 2305 1314 2621
atlantic_clean <- atlantic_clean %>%
mutate(
income_clean = case_when(
income >= 99999999999 ~ NA_real_,
income < 0 ~ NA_real_,
TRUE ~ income
)
)
## Compare before AND after
print(table(atlantic_clean$income))
##
## 775 800 850 875 1300 1900
## 2 1 1 1 3 1
## 1950 2000 2400 2500 4200 4800
## 1 1 1 3 1 1
## 5250 5500 6000 6250 6500 7000
## 1 1 1 2 1 2
## 7250 7500 7750 8000 8250 8500
## 5 4 14 15 15 9
## 8750 9000 9250 9500 9750 10000
## 6 5 5 4 7 19
## 10500 11000 11500 12000 12500 13000
## 20 22 34 28 14 17
## 13500 14000 14500 15000 15500 16000
## 19 25 19 19 20 14
## 16500 17000 17500 18000 18500 19000
## 10 17 20 11 20 38
## 19500 20000 20500 21000 22000 23000
## 24 97 2 166 230 214
## 24000 25000 26000 27000 28000 29000
## 160 143 86 91 81 75
## 30000 31000 32000 33000 34000 35000
## 80 67 69 73 89 93
## 36000 37000 38000 39000 40000 41000
## 100 92 95 101 95 79
## 42000 43000 44000 45000 46000 47000
## 78 89 70 70 78 74
## 47500 48000 49000 50000 51000 52500
## 27 77 56 133 11 126
## 55000 57500 60000 62500 65000 67500
## 174 174 150 164 185 157
## 70000 72500 75000 77500 80000 82500
## 157 146 136 132 142 133
## 85000 87500 90000 92500 95000 97500
## 146 117 122 99 137 102
## 1e+05 102500 105000 110000 115000 120000
## 142 16 153 143 129 122
## 125000 130000 135000 140000 145000 150000
## 154 111 92 93 93 102
## 155000 160000 165000 170000 175000 180000
## 86 70 72 50 57 58
## 185000 190000 195000 2e+05 205000 210000
## 46 61 39 54 3 64
## 220000 230000 240000 250000 260000 270000
## 51 46 33 32 23 21
## 280000 290000 3e+05 310000 320000 330000
## 16 14 6 5 1 6
## 340000 350000 360000 370000 380000 390000
## 7 8 5 7 8 4
## 4e+05 410000 420000 450000 460000 470000
## 4 4 1 2 5 7
## 480000 5e+05 525000 550000 575000 99999999999
## 7 6 10 12 3 2535
print(table(atlantic_clean$income_clean))
##
## 775 800 850 875 1300 1900 1950 2000 2400 2500 4200
## 2 1 1 1 3 1 1 1 1 3 1
## 4800 5250 5500 6000 6250 6500 7000 7250 7500 7750 8000
## 1 1 1 1 2 1 2 5 4 14 15
## 8250 8500 8750 9000 9250 9500 9750 10000 10500 11000 11500
## 15 9 6 5 5 4 7 19 20 22 34
## 12000 12500 13000 13500 14000 14500 15000 15500 16000 16500 17000
## 28 14 17 19 25 19 19 20 14 10 17
## 17500 18000 18500 19000 19500 20000 20500 21000 22000 23000 24000
## 20 11 20 38 24 97 2 166 230 214 160
## 25000 26000 27000 28000 29000 30000 31000 32000 33000 34000 35000
## 143 86 91 81 75 80 67 69 73 89 93
## 36000 37000 38000 39000 40000 41000 42000 43000 44000 45000 46000
## 100 92 95 101 95 79 78 89 70 70 78
## 47000 47500 48000 49000 50000 51000 52500 55000 57500 60000 62500
## 74 27 77 56 133 11 126 174 174 150 164
## 65000 67500 70000 72500 75000 77500 80000 82500 85000 87500 90000
## 185 157 157 146 136 132 142 133 146 117 122
## 92500 95000 97500 1e+05 102500 105000 110000 115000 120000 125000 130000
## 99 137 102 142 16 153 143 129 122 154 111
## 135000 140000 145000 150000 155000 160000 165000 170000 175000 180000 185000
## 92 93 93 102 86 70 72 50 57 58 46
## 190000 195000 2e+05 205000 210000 220000 230000 240000 250000 260000 270000
## 61 39 54 3 64 51 46 33 32 23 21
## 280000 290000 3e+05 310000 320000 330000 340000 350000 360000 370000 380000
## 16 14 6 5 1 6 7 8 5 7 8
## 390000 4e+05 410000 420000 450000 460000 470000 480000 5e+05 525000 550000
## 4 4 4 1 2 5 7 7 6 10 12
## 575000
## 3
atlantic_clean <- atlantic_clean %>%
mutate(
log_income = log10(income_clean)
)
print(table(atlantic_clean$log_income))
##
## 2.88930170250631 2.90308998699194 2.92941892571429 2.94200805302231
## 2 1 1 1
## 3.11394335230684 3.27875360095283 3.29003461136252 3.30102999566398
## 3 1 1 1
## 3.38021124171161 3.39794000867204 3.6232492903979 3.68124123737559
## 1 3 1 1
## 3.72015930340596 3.74036268949424 3.77815125038364 3.79588001734408
## 1 1 1 2
## 3.81291335664286 3.84509804001426 3.86033800657099 3.8750612633917
## 1 2 5 4
## 3.88930170250631 3.90308998699194 3.91645394854993 3.92941892571429
## 14 15 15 9
## 3.94200805302231 3.95424250943932 3.96614173273903 3.97772360528885
## 6 5 5 4
## 3.98900461569854 4 4.02118929906994 4.04139268515823
## 7 19 20 22
## 4.06069784035361 4.07918124604763 4.09691001300806 4.11394335230684
## 34 28 14 17
## 4.13033376849501 4.14612803567824 4.16136800223498 4.17609125905568
## 19 25 19 19
## 4.19033169817029 4.20411998265593 4.21748394421391 4.23044892137827
## 20 14 10 17
## 4.24303804868629 4.25527250510331 4.26717172840301 4.27875360095283
## 20 11 20 38
## 4.29003461136252 4.30102999566398 4.31175386105575 4.32221929473392
## 24 97 2 166
## 4.34242268082221 4.36172783601759 4.38021124171161 4.39794000867204
## 230 214 160 143
## 4.41497334797082 4.43136376415899 4.44715803134222 4.46239799789896
## 86 91 81 75
## 4.47712125471966 4.49136169383427 4.50514997831991 4.51851393987789
## 80 67 69 73
## 4.53147891704226 4.54406804435028 4.55630250076729 4.568201724067
## 89 93 100 92
## 4.57978359661681 4.5910646070265 4.60205999132796 4.61278385671974
## 95 101 95 79
## 4.6232492903979 4.63346845557959 4.64345267648619 4.65321251377534
## 78 89 70 70
## 4.66275783168157 4.67209785793572 4.67669360962487 4.68124123737559
## 78 74 27 77
## 4.69019608002851 4.69897000433602 4.70757017609794 4.72015930340596
## 56 133 11 126
## 4.74036268949424 4.75966784468963 4.77815125038364 4.79588001734407
## 174 174 150 164
## 4.81291335664286 4.82930377283103 4.84509804001426 4.86033800657099
## 185 157 157 146
## 4.8750612633917 4.88930170250631 4.90308998699194 4.91645394854993
## 136 132 142 133
## 4.92941892571429 4.94200805302231 4.95424250943933 4.96614173273903
## 146 117 122 99
## 4.97772360528885 4.98900461569854 5 5.01072386539177
## 137 102 142 16
## 5.02118929906994 5.04139268515823 5.06069784035361 5.07918124604763
## 153 143 129 122
## 5.09691001300806 5.11394335230684 5.13033376849501 5.14612803567824
## 154 111 92 93
## 5.16136800223498 5.17609125905568 5.19033169817029 5.20411998265593
## 93 102 86 70
## 5.21748394421391 5.23044892137827 5.24303804868629 5.25527250510331
## 72 50 57 58
## 5.26717172840301 5.27875360095283 5.29003461136252 5.30102999566398
## 46 61 39 54
## 5.31175386105575 5.32221929473392 5.34242268082221 5.36172783601759
## 3 64 51 46
## 5.38021124171161 5.39794000867204 5.41497334797082 5.43136376415899
## 33 32 23 21
## 5.44715803134222 5.46239799789896 5.47712125471966 5.49136169383427
## 16 14 6 5
## 5.50514997831991 5.51851393987789 5.53147891704226 5.54406804435028
## 1 6 7 8
## 5.55630250076729 5.568201724067 5.57978359661681 5.5910646070265
## 5 7 8 4
## 5.60205999132796 5.61278385671974 5.6232492903979 5.65321251377534
## 4 4 1 2
## 5.66275783168157 5.67209785793572 5.68124123737559 5.69897000433602
## 5 7 7 6
## 5.72015930340596 5.74036268949424 5.75966784468963
## 10 12 3
## Before
print(table(atlantic_clean$education))
##
## 1 2 3 4 5 6 7 99
## 1075 2208 1084 2131 326 1547 1051 1841
atlantic_clean <- atlantic_clean %>%
mutate(
education_clean = case_when(
education == 1 ~ "Less than high school",
education == 2 ~ "High school diploma",
education == 3 ~ "Trade certificate",
education == 4 ~ "College/CEGEP diploma",
education == 5 ~ "University below bachelor",
education == 6 ~ "Bachelor's degree",
education == 7 ~ "Above bachelor's degree",
TRUE ~ NA_character_
),
education_clean = factor(education_clean)
)
## After
print(table(atlantic_clean$education_clean))
##
## Above bachelor's degree Bachelor's degree College/CEGEP diploma
## 1051 1547 2131
## High school diploma Less than high school Trade certificate
## 2208 1075 1084
## University below bachelor
## 326
Now note that these are ordered alphabetically. Rank ordered
categorical variables should be ordered for interpretability, setting
the “lowest” rank/category as the reference category in modeling:
atlantic_clean <- atlantic_clean %>%
mutate(
education_grouped = case_when(
education_clean %in% c("Less than high school", "High school diploma") ~ "High school or less",
education_clean %in% c("Trade certificate", "College/CEGEP diploma") ~ "Post-secondary (non-university)",
education_clean %in% c("University below bachelor", "Bachelor's degree", "Above bachelor's degree") ~ "University education",
TRUE ~ NA_character_
),
education_grouped = factor(education_grouped, levels = c("High school or less",
"Post-secondary (non-university)",
"University education"))
)
## Check, always
print(table(atlantic_clean$education_grouped))
##
## High school or less Post-secondary (non-university)
## 3283 3215
## University education
## 2924
Let’s continue:
## Before
print(table(atlantic_clean$immigration))
##
## 9
## 11263
atlantic_clean <- atlantic_clean %>%
mutate(
immigration_clean = case_when(
immigration == 1 ~ "Canadian-born",
immigration == 2 ~ "Immigrant/Non-permanent",
TRUE ~ NA_character_
),
immigration_clean = factor(immigration_clean)
)
## After
print(table(atlantic_clean$immigration_clean))
## < table of extent 0 >
Again, not provided for Atlantic Canada!
atlantic_analysis <- atlantic_clean %>%
filter(
!is.na(satisfaction_clean),
!is.na(log_income),
!is.na(education_clean)
)
print(nrow(atlantic_clean))
## [1] 11263
print(nrow(atlantic_analysis))
## [1] 7791
print(nrow(atlantic_clean) - nrow(atlantic_analysis))
## [1] 3472
Let us build models now:
model1 <- lm(satisfaction_clean ~ log_income, data = atlantic_analysis)
model2 <- lm(satisfaction_clean ~ log_income + education_grouped, data = atlantic_analysis)
tab_model(
model1, model2,
dv.labels = c("Model 1: Income", "Model 2: + Education"),
title = "Proper Models With Clean Data"
)
Proper Models With Clean Data
|
Model 1: Income
|
Model 2: + Education
|
Predictors
|
Estimates
|
CI
|
p
|
Estimates
|
CI
|
p
|
(Intercept)
|
6.25
|
5.60 – 6.91
|
<0.001
|
4.68
|
3.94 – 5.42
|
<0.001
|
log income
|
0.08
|
-0.06 – 0.21
|
0.272
|
0.48
|
0.32 – 0.64
|
<0.001
|
education grouped [Post-secondary (non-university)]
|
|
|
|
-0.48
|
-0.60 – -0.36
|
<0.001
|
education grouped [University education]
|
|
|
|
-0.63
|
-0.77 – -0.50
|
<0.001
|
Observations
|
7791
|
7791
|
R2 / R2 adjusted
|
0.000 / 0.000
|
0.012 / 0.012
|
Here is a reminder how to clean up your table, which you can do
directly in the tab_model function:
tab_model(
model1, model2,
dv.labels = c("Model 1: Income Only", "Model 2: Income + Education"),
title = "Life Satisfaction in Atlantic Canada",
pred.labels = c(
"(Intercept)" = "Intercept",
"log_income" = "Log Income (log10)",
"education_groupedPost-secondary (non-university)" = "Education: Post-secondary",
"education_groupedUniversity education" = "Education: University"
),
show.reflvl = TRUE,
p.style = "numeric"
)
Life Satisfaction in Atlantic Canada
|
Model 1: Income Only
|
Model 2: Income + Education
|
Predictors
|
Estimates
|
CI
|
p
|
Estimates
|
CI
|
p
|
Intercept
|
6.25
|
5.60 – 6.91
|
<0.001
|
4.68
|
3.94 – 5.42
|
<0.001
|
Log Income (log10)
|
0.08
|
-0.06 – 0.21
|
0.272
|
0.48
|
0.32 – 0.64
|
<0.001
|
Education: Post-secondary
|
|
|
|
-0.48
|
-0.60 – -0.36
|
<0.001
|
Education: University
|
|
|
|
-0.63
|
-0.77 – -0.50
|
<0.001
|
Observations
|
7791
|
7791
|
R2 / R2 adjusted
|
0.000 / 0.000
|
0.012 / 0.012
|
Lesson 3: Understanding interactions
First, let me simulate an example to explain:
set.seed(123)
n <- 500
party_size <- sample(5:100, n, replace = TRUE)
personality <- sample(c("Introvert", "Extrovert"), n, replace = TRUE)
party_data <- data.frame(
party_size = party_size,
personality = personality
)
party_data$personality_num <- ifelse(party_data$personality == "Introvert", -1, 1)
base_enjoyment <- 50
introvert_slope <- -0.5
extrovert_slope <- 0.3
noise_sd <- 10
party_data$enjoyment <- base_enjoyment +
(party_data$personality_num == -1) * introvert_slope * party_data$party_size +
(party_data$personality_num == 1) * extrovert_slope * party_data$party_size +
rnorm(n, mean = 0, sd = noise_sd)
party_data$enjoyment <- pmax(0, pmin(100, party_data$enjoyment))
model <- lm(enjoyment ~ party_size * personality, data = party_data)
ggplot(party_data, aes(x = party_size, y = enjoyment, color = personality)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", se = TRUE) +
labs(
title = "Party Enjoyment by Party Size and Personality Type",
x = "Number of People at Party",
y = "Enjoyment (0-100 scale)",
color = "Personality Type"
) +
theme_minimal() +
scale_color_manual(values = c("Introvert" = "blue", "Extrovert" = "red"))

new_data <- expand.grid(
party_size = seq(5, 100, by = 5),
personality = c("Introvert", "Extrovert")
)
predictions <- predict(model, newdata = new_data, se.fit = TRUE)
new_data$predicted_enjoyment <- predictions$fit
ggplot(new_data, aes(x = party_size, y = predicted_enjoyment, color = personality, group = personality)) +
geom_line(size = 1.5) +
labs(
title = "Predicted Party Enjoyment by Party Size and Personality Type",
subtitle = "Illustration of Non-Parallel Lines (Interaction Effect)",
x = "Number of People at Party",
y = "Predicted Enjoyment (0-100 scale)",
color = "Personality Type"
) +
theme_minimal() +
scale_color_manual(values = c("Introvert" = "blue", "Extrovert" = "red")) +
annotate("text", x = 80, y = 30, label = "Introverts enjoy\nsmaller parties", color = "blue") +
annotate("text", x = 80, y = 70, label = "Extroverts enjoy\nlarger parties", color = "red")

Let’s do another simulated example, leveraging a simulated example we
did during Tutorial 6 (for MLR):
set.seed(123)
n <- 500
experience <- runif(n, 0, 30) # Years of experience from 0-30
education <- sample(c("HS or Less", "College", "Graduate"),
size = n, replace = TRUE,
prob = c(0.4, 0.4, 0.2)) # Realistic distribution
income_mlr <- case_when(
education == "HS or Less" ~ 25000 + 1200 * experience,
education == "College" ~ 40000 + 1200 * experience,
education == "Graduate" ~ 60000 + 1200 * experience
) + rnorm(n, 0, 10000) # Add some random noise
income_interaction <- case_when(
education == "HS or Less" ~ 25000 + 800 * experience,
education == "College" ~ 40000 + 1200 * experience,
education == "Graduate" ~ 60000 + 2000 * experience
) + rnorm(n, 0, 10000) # Add some random noise
sim_data_mlr <- data.frame(
Income = income_mlr,
Experience = experience,
Education = factor(education, levels = c("HS or Less", "College", "Graduate")),
Model = "Multiple Linear Regression\n(Parallel Slopes)"
)
sim_data_interaction <- data.frame(
Income = income_interaction,
Experience = experience,
Education = factor(education, levels = c("HS or Less", "College", "Graduate")),
Model = "Interaction Model\n(Different Slopes)"
)
p1 <- ggplot(sim_data_mlr, aes(x = Experience, y = Income, color = Education)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", se = FALSE, linewidth = 1.2) +
scale_y_continuous(labels = scales::dollar_format()) +
scale_color_brewer(palette = "Set1") +
theme_minimal(base_size = 12) +
labs(
title = "Multiple Linear Regression",
subtitle = "Effect of experience is HELD CONSTANT across education levels\n(Experience slope doesn't depend on education)",
x = "Years of Work Experience",
y = "Annual Income",
color = "Education Level"
) +
theme(
legend.position = "bottom",
plot.title = element_text(face = "bold"),
panel.grid.minor = element_blank()
) +
annotate("text", x = 25, y = 30000, label = "Reference Category",
color = "#E41A1C", fontface = "bold") +
annotate("text", x = 25, y = 55000,
label = "β₂, College effect = +$15,000", color = "#377EB8") +
annotate("text", x = 25, y = 80000,
label = "β₃, Graduate effect = +$35,000", color = "#4DAF4A") +
annotate("text", x = 5, y = 90000,
label = "β₁ = +$1,200 per year\n(same for all groups)",
color = "black", fontface = "italic")
p2 <- ggplot(sim_data_interaction, aes(x = Experience, y = Income, color = Education)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", se = FALSE, linewidth = 1.2) +
scale_y_continuous(labels = scales::dollar_format()) +
scale_color_brewer(palette = "Set1") +
theme_minimal(base_size = 12) +
labs(
title = "Interaction Model",
subtitle = "Effect of experience is ALLOWED TO VARY by education level\n(Experience slope depends on education)",
x = "Years of Work Experience",
y = "Annual Income",
color = "Education Level"
) +
theme(
legend.position = "bottom",
plot.title = element_text(face = "bold"),
panel.grid.minor = element_blank()
) +
annotate("text", x = 25, y = 30000,
label = "HS slope = +$800/year",
color = "#E41A1C", fontface = "bold") +
annotate("text", x = 25, y = 60000,
label = "College slope = +$1,200/year", color = "#377EB8") +
annotate("text", x = 25, y = 100000,
label = "Graduate slope = +$2,000/year", color = "#4DAF4A")
combined_plot <- p1 + p2 +
plot_layout(guides = "collect") +
plot_annotation(
title = "Comparing Multiple Linear Regression vs. Interaction Effects",
theme = theme(
plot.title = element_text(face = "bold", size = 16),
legend.position = "bottom"
)
)
combined_plot

Let’s use our real-world example from earlier to see a possible
interaction modeled and then visualized:
model3 <- lm(satisfaction_clean ~ log_income + education_grouped + log_income * education_grouped, data = atlantic_analysis)
tab_model(
model1, model2, model3,
dv.labels = c("Model 1: Income Only", "Model 2: Main Effects", "Model 3: Interaction"),
title = "Life Satisfaction in Atlantic Canada",
pred.labels = c(
"(Intercept)" = "Intercept",
"log_income" = "Log Income (log10)",
"education_groupedPost-secondary (non-university)" = "Education: Post-secondary",
"education_groupedUniversity education" = "Education: University",
"log_income:education_groupedPost-secondary (non-university)" = "Log Income × Post-secondary",
"log_income:education_groupedUniversity education" = "Log Income × University"
),
p.style = "numeric",
digits = 2
)
Life Satisfaction in Atlantic Canada
|
Model 1: Income Only
|
Model 2: Main Effects
|
Model 3: Interaction
|
Predictors
|
Estimates
|
CI
|
p
|
Estimates
|
CI
|
p
|
Estimates
|
CI
|
p
|
Intercept
|
6.25
|
5.60 – 6.91
|
<0.001
|
4.68
|
3.94 – 5.42
|
<0.001
|
4.96
|
3.65 – 6.26
|
<0.001
|
Log Income (log10)
|
0.08
|
-0.06 – 0.21
|
0.272
|
0.48
|
0.32 – 0.64
|
<0.001
|
0.42
|
0.13 – 0.71
|
0.004
|
Education: Post-secondary
|
|
|
|
-0.48
|
-0.60 – -0.36
|
<0.001
|
-0.97
|
-2.78 – 0.85
|
0.296
|
Education: University
|
|
|
|
-0.63
|
-0.77 – -0.50
|
<0.001
|
-0.96
|
-2.92 – 0.99
|
0.334
|
Log Income × Post-secondary
|
|
|
|
|
|
|
0.10
|
-0.28 – 0.49
|
0.598
|
Log Income × University
|
|
|
|
|
|
|
0.07
|
-0.34 – 0.48
|
0.731
|
Observations
|
7791
|
7791
|
7791
|
R2 / R2 adjusted
|
0.000 / 0.000
|
0.012 / 0.012
|
0.012 / 0.012
|
Now let’s visualize:
plot_model(model3, type = "int",
axis.title = c("Log Income", "Life Satisfaction"),
legend.title = "Education Level",
title = "Interaction Between Income and Education on Life Satisfaction") +
theme_minimal()

Let’s revisit our full Canada specification above:
model3 <- lm(outcome ~ log_income + education + visible_minority, data = chs_for_analysis)
summary(model3)
##
## Call:
## lm(formula = outcome ~ log_income + education + visible_minority,
## data = chs_for_analysis)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.8930 -1.6130 0.2976 1.7448 4.2639
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.97730 0.24539 16.208 < 2e-16 ***
## log_income 0.53884 0.05207 10.349 < 2e-16 ***
## educationHigh school diploma -0.45692 0.06318 -7.232 4.97e-13 ***
## educationTrade certificate -0.64664 0.07562 -8.551 < 2e-16 ***
## educationCollege/CEGEP diploma -0.76111 0.06491 -11.725 < 2e-16 ***
## educationUniversity below bachelor -0.57024 0.09065 -6.290 3.24e-10 ***
## educationBachelor's degree -0.84646 0.06960 -12.163 < 2e-16 ***
## educationAbove bachelor's degree -0.80487 0.07645 -10.528 < 2e-16 ***
## visible_minorityNo visible minority 0.21569 0.05070 4.254 2.11e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.17 on 17349 degrees of freedom
## (23630 observations deleted due to missingness)
## Multiple R-squared: 0.01395, Adjusted R-squared: 0.0135
## F-statistic: 30.69 on 8 and 17349 DF, p-value: < 2.2e-16
model4 <- lm(outcome ~ log_income + education + visible_minority + education*visible_minority, data = chs_for_analysis)
summary(model4)
##
## Call:
## lm(formula = outcome ~ log_income + education + visible_minority +
## education * visible_minority, data = chs_for_analysis)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.9161 -1.5948 0.2954 1.7457 4.2920
##
## Coefficients:
## Estimate
## (Intercept) 3.37521
## log_income 0.54296
## educationHigh school diploma 0.26447
## educationTrade certificate -0.59678
## educationCollege/CEGEP diploma -0.26389
## educationUniversity below bachelor 0.07138
## educationBachelor's degree -0.28055
## educationAbove bachelor's degree -0.10208
## visible_minorityNo visible minority 0.82030
## educationHigh school diploma:visible_minorityNo visible minority -0.75196
## educationTrade certificate:visible_minorityNo visible minority -0.04679
## educationCollege/CEGEP diploma:visible_minorityNo visible minority -0.51026
## educationUniversity below bachelor:visible_minorityNo visible minority -0.67477
## educationBachelor's degree:visible_minorityNo visible minority -0.58442
## educationAbove bachelor's degree:visible_minorityNo visible minority -0.77655
## Std. Error
## (Intercept) 0.36397
## log_income 0.05208
## educationHigh school diploma 0.31156
## educationTrade certificate 0.37376
## educationCollege/CEGEP diploma 0.29797
## educationUniversity below bachelor 0.33225
## educationBachelor's degree 0.28769
## educationAbove bachelor's degree 0.28943
## visible_minorityNo visible minority 0.28045
## educationHigh school diploma:visible_minorityNo visible minority 0.31794
## educationTrade certificate:visible_minorityNo visible minority 0.38110
## educationCollege/CEGEP diploma:visible_minorityNo visible minority 0.30444
## educationUniversity below bachelor:visible_minorityNo visible minority 0.34454
## educationBachelor's degree:visible_minorityNo visible minority 0.29459
## educationAbove bachelor's degree:visible_minorityNo visible minority 0.29805
## t value
## (Intercept) 9.273
## log_income 10.425
## educationHigh school diploma 0.849
## educationTrade certificate -1.597
## educationCollege/CEGEP diploma -0.886
## educationUniversity below bachelor 0.215
## educationBachelor's degree -0.975
## educationAbove bachelor's degree -0.353
## visible_minorityNo visible minority 2.925
## educationHigh school diploma:visible_minorityNo visible minority -2.365
## educationTrade certificate:visible_minorityNo visible minority -0.123
## educationCollege/CEGEP diploma:visible_minorityNo visible minority -1.676
## educationUniversity below bachelor:visible_minorityNo visible minority -1.958
## educationBachelor's degree:visible_minorityNo visible minority -1.984
## educationAbove bachelor's degree:visible_minorityNo visible minority -2.605
## Pr(>|t|)
## (Intercept) < 2e-16
## log_income < 2e-16
## educationHigh school diploma 0.39598
## educationTrade certificate 0.11036
## educationCollege/CEGEP diploma 0.37584
## educationUniversity below bachelor 0.82991
## educationBachelor's degree 0.32948
## educationAbove bachelor's degree 0.72433
## visible_minorityNo visible minority 0.00345
## educationHigh school diploma:visible_minorityNo visible minority 0.01804
## educationTrade certificate:visible_minorityNo visible minority 0.90228
## educationCollege/CEGEP diploma:visible_minorityNo visible minority 0.09374
## educationUniversity below bachelor:visible_minorityNo visible minority 0.05019
## educationBachelor's degree:visible_minorityNo visible minority 0.04729
## educationAbove bachelor's degree:visible_minorityNo visible minority 0.00918
##
## (Intercept) ***
## log_income ***
## educationHigh school diploma
## educationTrade certificate
## educationCollege/CEGEP diploma
## educationUniversity below bachelor
## educationBachelor's degree
## educationAbove bachelor's degree
## visible_minorityNo visible minority **
## educationHigh school diploma:visible_minorityNo visible minority *
## educationTrade certificate:visible_minorityNo visible minority
## educationCollege/CEGEP diploma:visible_minorityNo visible minority .
## educationUniversity below bachelor:visible_minorityNo visible minority .
## educationBachelor's degree:visible_minorityNo visible minority *
## educationAbove bachelor's degree:visible_minorityNo visible minority **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.169 on 17343 degrees of freedom
## (23630 observations deleted due to missingness)
## Multiple R-squared: 0.01475, Adjusted R-squared: 0.01395
## F-statistic: 18.54 on 14 and 17343 DF, p-value: < 2.2e-16
plot_model(model4, type = "int",
axis.title = c("Minority", "Life Satisfaction"),
legend.title = "Education Level",
title = "Interaction Between Visible Minoirty and Education on Life Satisfaction") +
theme_minimal()

Ok, let’s turn now to breaking down what it all means with our
example:
Main Effects Interpretation:
Log income (0.54296): For each unit increase in
log income, life satisfaction increases by about 0.54 points on the 0-10
scale, holding all other variables constant. This is a strong, highly
significant effect that persists regardless of education or visible
minority status.
Education main effects: These coefficients
represent the relationship between education and life satisfaction
specifically for visible minorities (the reference group).
Interestingly, none of these education effects are statistically
significant, suggesting that among visible minorities, different
education levels don’t significantly predict different life satisfaction
levels when controlling for income.
Visible minority main effect (0.82030): This
coefficient shows that non-visible minorities report life satisfaction
0.82 points higher than visible minorities, but only for those with less
than high school education (the reference education category). This is a
substantial difference on a 0-10 scale.
Understanding the Interaction:
The interaction terms represent how the effect of visible minority
status changes across different education levels (or equivalently, how
the effect of education differs between visible minority groups).
Each interaction coefficient should be interpreted as a
modification to the main effect of visible minority status:
- For example, the interaction “High school diploma:visible minorityNo
visible minority” (-0.75196) means that the visible minority gap of 0.82
is reduced by 0.75 for those with high school diplomas, resulting in a
much smaller gap of about 0.07.
For people with bachelor’s degrees, the visible minority effect
is 0.82030 - 0.58442 = 0.24, and for those with above bachelor’s
degrees, it’s 0.82030 - 0.77655 = 0.04.
Pattern of diminishing returns: The higher the
education level, the smaller the life satisfaction gap between visible
minorities and non-visible minorities, with the gap nearly disappearing
at the highest education level.
What This Means More “Sociologically”:
Conditional relationships: The effect of being a
visible minority on life satisfaction is conditional on education level,
and vice versa. We cannot accurately describe either effect without
considering the other.
Income vs. education effects: Income has a direct
positive relationship with life satisfaction regardless of other
factors, while education’s relationship is more contingent on visible
minority status. This suggests different pathways through which
socioeconomic factors influence well-being.
But let’s actually explore another (i.e., don’t do that – the last
lesson for today)