Lab Overview This combined lab covers two related topics: Part 1 introduces binary outcome models using marijuana legalization attitudes from the GSS. Part 2 practices our understanding of predictions and marginal effects across LPM and logistic models, using unemployment from the GSS. Part 3 uses ACS data to practice binary outcome models, with a particular attention to how to apply survey design variables to produce weighted descriptive and weighted logistic regression analysis.
Conceptual Review: Why Not Just Use OLS for Binary Outcomes? When the dependent variable is binary (0/1), ordinary least squares (OLS) can produce predicted probabilities below 0 or above 1. The Linear Probability Model (LPM) uses OLS anyway and is easy to interpret, but logistic regression solves the boundary problem by modeling the log odds of the outcome. You’ll see both approaches this week and compare them directly.
We’ll use the 2022 General Social Survey (GSS). Our outcome is
legal_grass: whether a respondent thinks marijuana should
be legalized (1 = yes, 0 = no). Our key predictor is
conservatism, a 1–7 ideology scale (1 = extremely liberal,
7 = extremely conservative).
library(gssr)
gss2022 <- gss_get_yr(2022)
gss <- gss2022 %>%
select(grassnv, polviews, sex, race, hispanic, age, educ, wtssnrps, ballot) %>%
haven::zap_labels() %>% #strips variable and value labels from data imported; standard practice if you are importing STATA file
mutate(
legal_grass = case_when(
grassnv == 1 ~ 1,
grassnv == 2 ~ 0
),
conservatism = polviews,
age = case_when(age > 88 ~ NaN, TRUE ~ age),
sex = factor(sex, levels = c(1, 2), labels = c("Male", "Female")),
race = case_when(
hispanic != 1 ~ "Hispanic",
race == 1 & hispanic == 1 ~ "White",
race == 2 & hispanic == 1 ~ "Black",
race == 3 & hispanic == 1 ~ "Other"
),
race4 = factor(race, levels = c("White", "Black", "Hispanic", "Other")),
weight = wtssnrps
) %>%
filter(ballot != 1) %>%
select(-wtssnrps, -hispanic, -race, -polviews, -grassnv, -ballot)
gg_miss_upset(gss)
We’ll use listwise deletion for simplicity here.
gss <- gss %>% drop_na()
datasummary_skim(gss, type = "numeric",
title = "Descriptive Statistics: Continuous Variables",
notes = "Data: 2022 General Social Survey")
| Unique | Missing Pct. | Mean | SD | Min | Median | Max | Histogram | |
|---|---|---|---|---|---|---|---|---|
| age | 69 | 0 | 48.7 | 17.6 | 18.0 | 47.0 | 88.0 | |
| highest year of school completed | 15 | 0 | 14.7 | 2.5 | 4.0 | 14.0 | 20.0 | |
| legal_grass | 2 | 0 | 0.8 | 0.4 | 0.0 | 1.0 | 1.0 | |
| think of self as liberal or conservative | 7 | 0 | 3.8 | 1.5 | 1.0 | 4.0 | 7.0 | |
| person post-stratification weight, nonrespondents adjusted | 526 | 0 | 0.8 | 0.9 | 0.1 | 0.5 | 6.3 |
datasummary_skim(gss, type = "categorical",
title = "Descriptive Statistics: Categorical Variables",
notes = "Data: 2022 General Social Survey")
| N | % | ||
|---|---|---|---|
| Data: 2022 General Social Survey | |||
| sex | Male | 236 | 43.9 |
| Female | 301 | 56.1 | |
| race4 | White | 379 | 70.6 |
| Black | 52 | 9.7 | |
| Hispanic | 77 | 14.3 | |
| Other | 29 | 5.4 | |
A balance table compares demographics across the two
sides of the outcome variable. The Diff. in Means column
shows how groups differ.
datasummary_balance(~legal_grass, data = gss,
title = "Balance Table: Demographics by Views on Marijuana Legalization",
notes = "Data: 2022 General Social Survey (Unweighted)")
| 0 | 1 | ||||
|---|---|---|---|---|---|
| Mean | Std. Dev. | Mean | Std. Dev. | ||
| Data: 2022 General Social Survey (Unweighted) | |||||
| age | 55.3 | 17.1 | 46.7 | 17.3 | |
| educ | 14.8 | 2.8 | 14.6 | 2.5 | |
| conservatism | 4.8 | 1.4 | 3.6 | 1.4 | |
| weight | 0.9 | 1.1 | 0.7 | 0.8 | |
| N | Pct. | N | Pct. | ||
| sex | Male | 60 | 50.0 | 176 | 42.2 |
| Female | 60 | 50.0 | 241 | 57.8 | |
| race4 | White | 79 | 65.8 | 300 | 71.9 |
| Black | 9 | 7.5 | 43 | 10.3 | |
| Hispanic | 20 | 16.7 | 57 | 13.7 | |
| Other | 12 | 10.0 | 17 | 4.1 | |
Exercise 1: What do the descriptive statistics and balance table tell you about who supports marijuana legalization? Summarize two or three key differences.
Your answer here: The first thing I noticed about the descriptive statistics is that the distribution for ideologgy is pretty evenly distrbuted but a majority of respondents identified as somewhere in the middle. For opinions on marijuana legalization, I was surprised to see that most respondents support the issue. Furthermore, we see that whites make up the vast majority of the surveyees in this dataset. When we look to the balance table, we see that the population supporting legalization are younger, less conservative, and slightly less educated on average versus those that oppose marijuana legalization. Moreover, more women support legalization than men, though the breakdown of those opposing is 50/50 by sex. Given that women are more greatly represented in this sample, we can only take this with a grain of salt. This is the same case for race, where whites appear to strongly favor legalization but we can’t draw conclusions off of this given that whites make up the vast majority of the study too.
The LPM estimates the effect of each predictor on the probability that the outcome equals 1. It’s just OLS with a binary dependent variable.
linprob1 <- lm(legal_grass ~ conservatism + age + educ + race4 + sex,
data = gss, weights = weight)
modelsummary(linprob1,
estimate = "{estimate}{stars} ({std.error})",
statistic = NULL,
stars = c(`***` = .001, `**` = .01, `*` = .05),
gof_omit = "F|RMSE|Log|IC",
output = "huxtable")
| (1) | |
|---|---|
| (Intercept) | 1.534*** (0.122) |
| conservatism | -0.097*** (0.012) |
| age | -0.005*** (0.001) |
| educ | -0.007 (0.006) |
| race4Black | -0.030 (0.069) |
| race4Hispanic | -0.205*** (0.051) |
| race4Other | -0.468*** (0.068) |
| sexFemale | -0.008 (0.034) |
| Num.Obs. | 537 |
| R2 | 0.238 |
| R2 Adj. | 0.228 |
Exercise 2: Interpret the coefficient for
conservatism in the LPM. What does it mean in probability
terms?
Your answer here: For this model, the coefficient would tell us the likelihood of an individual supporting legalization if they are not at all conservative, 0 years of age, no education, white, and male. This kind of highlights the issue of LPMs, as there’s no reality where the DV is greater than one as the intercept implies. However, would it make sense to interpret this as a young/white/uneducated/male being highly likely to support legalization?
The LPM can produce predicted probabilities below 0 or above 1. To
see this, we predict across a grid of conservatism and
age values for a Black male with average education.
linpreddata <- expand_grid(conservatism = seq(1, 7),
age = seq(18, 88, 10),
educ = mean(gss$educ),
race4 = "Black",
sex = "Male")
linpreddata$pred_grass <- predict(linprob1, newdata = linpreddata, type = "response")
range(linpreddata$pred_grass)
## [1] 0.2774806 1.2132735
ggplot(linpreddata, aes(x = conservatism, y = age, fill = pred_grass)) +
geom_tile() + scale_fill_viridis_c() + theme_minimal() +
geom_text(aes(label = round(pred_grass, 2)), color = "white", size = 4) +
scale_x_continuous(breaks = seq(0, 8, 1)) +
scale_y_continuous(breaks = seq(18, 88, 10)) +
labs(title = "Linear Probability Model",
subtitle = "Predicted Probability of Favoring Legalization — Black Males, Average Education",
y = "Age", x = "Conservatism", fill = "Probability")
Exercise 3: Are any predicted values outside [0, 1]? Where in the grid does this occur, and why is it a problem?
Your answer here: Yes, there are predicted values greater than 1, which is the problem with LPMs. We see here that these are around the lowest ages and the lowest degrees of conservatism. The issue here is that we are interested in the yes/no of supporting legalization, and anything outside of the 0-1 range is not observable in real life. Though we can’t use this to interpret the probability coefficients, we can see that it means that younger/liberal individuals are far more likely to support legalization than others.
Logistic regression solves the out-of-bounds problem by transforming
predictions through the logistic function, we make sure all predicted
probabilities stay between 0 and 1. Use glm() with
family = "binomial".
logit_full <- glm(legal_grass ~ conservatism + age + educ + race4 + sex,
family = "binomial", data = gss)
logit_base <- glm(legal_grass ~ conservatism,
family = "binomial", data = gss)
compmods <- list("LPM" = linprob1, "Base Logit" = logit_base, "Full Logit" = logit_full)
modelsummary(compmods,
estimate = "{estimate}{stars} ({std.error})",
statistic = NULL,
stars = c(`***` = .001, `**` = .01, `*` = .05),
gof_omit = "F|RMSE",
output = "huxtable")
| LPM | Base Logit | Full Logit | |
|---|---|---|---|
| (Intercept) | 1.534*** (0.122) | 3.740*** (0.388) | 6.507*** (0.945) |
| conservatism | -0.097*** (0.012) | -0.596*** (0.083) | -0.628*** (0.090) |
| age | -0.005*** (0.001) | -0.026*** (0.007) | |
| educ | -0.007 (0.006) | -0.079 (0.046) | |
| race4Black | -0.030 (0.069) | -0.090 (0.420) | |
| race4Hispanic | -0.205*** (0.051) | -0.900** (0.332) | |
| race4Other | -0.468*** (0.068) | -1.362** (0.446) | |
| sexFemale | -0.008 (0.034) | 0.193 (0.231) | |
| Num.Obs. | 537 | 537 | 537 |
| R2 | 0.238 | ||
| R2 Adj. | 0.228 | ||
| AIC | 679.2 | 513.3 | 493.1 |
| BIC | 717.8 | 521.9 | 527.4 |
| Log.Lik. | -330.603 | -254.646 | -238.560 |
modelplot(compmods, coef_omit = 'Interc') +
geom_vline(aes(xintercept = 0), linetype = "dotted", linewidth = 1, alpha = .5) +
theme(legend.position = "bottom")
Note: The LPM and logit coefficients are on different scales and should not be directly compared in magnitude. The plot shows direction and uncertainty, not equivalent effect sizes.
modelsummary(logit_full,
estimate = "{estimate} ({std.error}) {stars}",
stars = c(`*` = .05, `**` = .01, `***` = .001),
statistic = NULL,
gof_omit = "F|RMSE|IC|Log",
output = "huxtable")
| (1) | |
|---|---|
| (Intercept) | 6.507 (0.945) *** |
| conservatism | -0.628 (0.090) *** |
| age | -0.026 (0.007) *** |
| educ | -0.079 (0.046) |
| race4Black | -0.090 (0.420) |
| race4Hispanic | -0.900 (0.332) ** |
| race4Other | -1.362 (0.446) ** |
| sexFemale | 0.193 (0.231) |
| Num.Obs. | 537 |
logitpreddata <- expand_grid(conservatism = seq(1, 7),
age = seq(18, 88, 10),
educ = mean(gss$educ),
race4 = "Black",
sex = "Male")
logitpreddata$pred_grass <- predict(logit_full, newdata = logitpreddata, type = "response")
range(logitpreddata$pred_grass)
## [1] 0.1921750 0.9845489
ggplot(logitpreddata, aes(x = conservatism, y = age, fill = pred_grass)) +
geom_tile() + scale_fill_viridis_c() + theme_minimal() +
geom_text(aes(label = round(pred_grass, 2)), color = "white", size = 4) +
scale_x_continuous(breaks = seq(0, 8, 1)) +
scale_y_continuous(breaks = seq(18, 88, 10)) +
labs(title = "Logit Model",
subtitle = "Predicted Probability of Favoring Legalization — Black Males, Average Education",
y = "Age", x = "Conservatism", fill = "Probability")
Log odds are directionally intuitive but hard to communicate. Odds ratios (ORs) are easier: an OR > 1 means increased odds, OR < 1 means decreased odds.
cbind(LogOdds = round(coef(logit_full), 3),
OddsRatio = round(exp(coef(logit_full)), 3))
## LogOdds OddsRatio
## (Intercept) 6.507 669.665
## conservatism -0.628 0.534
## age -0.026 0.974
## educ -0.079 0.924
## race4Black -0.090 0.914
## race4Hispanic -0.900 0.407
## race4Other -1.362 0.256
## sexFemale 0.193 1.213
To get the percent change in odds: subtract 1 from the OR. For example, if OR = 0.391, then \((0.391 - 1) = -0.609\), meaning a ~61% decrease in odds.
modelsummary(list("Log Odds" = logit_full, "Odds Ratio" = logit_full),
exponentiate = c(F, T),
estimate = "{estimate} ({std.error}) {stars}",
stars = c(`*` = .05, `**` = .01, `***` = .001),
statistic = NULL,
gof_omit = "F|RMSE|Log",
output = "huxtable")
| Log Odds | Odds Ratio | |
|---|---|---|
| (Intercept) | 6.507 (0.945) *** | 669.665 (632.587) *** |
| conservatism | -0.628 (0.090) *** | 0.534 (0.048) *** |
| age | -0.026 (0.007) *** | 0.974 (0.007) *** |
| educ | -0.079 (0.046) | 0.924 (0.043) |
| race4Black | -0.090 (0.420) | 0.914 (0.384) |
| race4Hispanic | -0.900 (0.332) ** | 0.407 (0.135) ** |
| race4Other | -1.362 (0.446) ** | 0.256 (0.114) ** |
| sexFemale | 0.193 (0.231) | 1.213 (0.281) |
| Num.Obs. | 537 | 537 |
| AIC | 493.1 | 493.1 |
| BIC | 527.4 | 527.4 |
Exercise 5: Interpret the odds ratio for
educ. What does it say about the relationship between
education and support for legalization?
Your answer here: For education, we can see that a one unit increase (a one year increase in education) is associated with a ~8% decreased odds of supporting legalization. This means that highly educated people are less likely to support legalization than those with less education.
Higher (less negative) log-likelihood = better fit. We use it to compare nested models.
logLik(logit_full)
## 'log Lik.' -238.5598 (df=8)
logLik(logit_base)
## 'log Lik.' -254.6458 (df=2)
There is no true R² for logistic regression. McFadden’s pseudo-R² approximates it:
\[R^2_{McFadden} = 1 - \frac{LL_{model}}{LL_{null}}\]
Values between 0.2 and 0.4 are generally considered a good fit.
DescTools::PseudoR2(logit_full, which = c("McFadden", "McFaddenAdj", "logLik"))
## McFadden McFaddenAdj logLik
## 0.1637843 0.1357422 -238.5597959
DescTools::PseudoR2(logit_base, which = c("McFadden", "McFaddenAdj", "logLik"))
## McFadden McFaddenAdj logLik
## 0.1073987 0.1003881 -254.6457804
Exercise 7: Based on the log-likelihood and pseudo-R² values, which model fits better? the base logit or the full logit? How do you know?
Your answer here: Using just the log likelihood estimates, we see that the full logit model is better because it is closer to 0. The McFadden’s psuedo-R2 supports this, though it is outside of the ideal range so we shouldn’t consider that model a good fit for the data.
Conceptual Review: Why Do We Need Predictions and Marginal Effects? Raw regression coefficients tell you about direction and statistical significance, but they are often hard to communicate and can be misleading , especially in nonlinear models like logit. Predictions let you translate model output into concrete expected values (e.g., “a 30-year-old with 16 years of education is predicted to earn $X”). Marginal effects tell you how the predicted outcome changes when one predictor changes by one unit, which gives you the “slope” of the relationship at any given point. —
For Part 2, we use the same GSS dataset with a focus on unemployment as the primary outcome and human capital and demographic characteristics as predictors.
gss2022b <- gss2022
gss2 <- gss2022b %>%
transmute(
income = as.numeric(conrinc),
age = ifelse(age < 0, NA, age),
educ = ifelse(educ < 0, NA, educ),
sex = factor(sex, levels = 1:2, labels = c("Male", "Female")),
race = factor(race, levels = 1:3, labels = c("White", "Black", "Other")),
unemp = ifelse(unemp == 1, 1, 0)
) %>%
drop_na()
datasummary_skim(gss2, type = "numeric", fmt = 2, histogram = TRUE,
title = "Descriptive Statistics: Continuous Variables",
notes = "Data: 2022 General Social Survey")
| Unique | Missing Pct. | Mean | SD | Min | Median | Max | Histogram | |
|---|---|---|---|---|---|---|---|---|
| income | 26 | 0 | 40987.62 | 39527.08 | 316.50 | 28485.00 | 163572.71 | |
| age | 67 | 0 | 43.96 | 14.58 | 18.00 | 42.00 | 89.00 | |
| educ | 20 | 0 | 14.50 | 2.90 | 0.00 | 14.00 | 20.00 | |
| unemp | 2 | 0 | 0.36 | 0.48 | 0.00 | 0.00 | 1.00 |
datasummary_skim(gss2, type = "categorical",
title = "Descriptive Statistics: Categorical Variables",
notes = "Data: 2022 General Social Survey",
output = "huxtable")
| N | % | ||
|---|---|---|---|
| sex | Male | 709 | 48.5 |
| Female | 753 | 51.5 | |
| race | White | 975 | 66.7 |
| Black | 242 | 16.6 | |
| Other | 245 | 16.8 | |
| Data: 2022 General Social Survey | |||
We fit two models: a base model with main effects only, and a full
model that adds an educ × sex interaction to test whether
education’s protective effect differs by sex.
logit_base <- glm(unemp ~ educ + sex + age + I(age^2) + race,
data = gss2, family = binomial)
logit_model <- glm(unemp ~ educ * sex + age + I(age^2) + race,
data = gss2, family = binomial)
modelsummary(list("Base" = logit_base, "With Interaction" = logit_model),
exponentiate = FALSE,
estimate = "{estimate}{stars} ({std.error})",
statistic = NULL,
stars = c(`***` = .001, `**` = .01, `*` = .05),
gof_omit = "pss|alg|RMSE|AIC|BIC|F",
output = "huxtable")
| Base | With Interaction | |
|---|---|---|
| (Intercept) | 1.435* (0.575) | 1.249 (0.645) |
| educ | -0.087*** (0.020) | -0.075** (0.028) |
| sexFemale | 0.245* (0.113) | 0.605 (0.578) |
| age | -0.011 (0.025) | -0.011 (0.025) |
| I(age^2) | -0.000 (0.000) | -0.000 (0.000) |
| raceBlack | 0.075 (0.153) | 0.072 (0.153) |
| raceOther | 0.163 (0.151) | 0.161 (0.151) |
| educ × sexFemale | -0.025 (0.040) | |
| Num.Obs. | 1462 | 1462 |
| Log.Lik. | -915.008 | -914.806 |
Exercise: Interpret the main effect of
educ and the interaction term educ:sexFemale
in the full model using log odds. Does education protect men and women
equally against unemployment risk?
Your answer here: The main effect of education is that, in both models, a one unit increase in education drastically reduces the odds of being unemployed. Furthermore, the interaction term tells us that there is a slight difference in how education impacts sex/gender’s relationship with employment, though the relationship itself is not significant. If it were, we would say that education protects women from unemployment at a lower level than it does for men.
Before interpreting, we should check whether the full model fits better than the base model.
logLik(logit_base)
## 'log Lik.' -915.008 (df=7)
logLik(logit_model)
## 'log Lik.' -914.8059 (df=8)
DescTools::PseudoR2(logit_base, which = c("McFadden", "McFaddenAdj", "logLik"))
## McFadden McFaddenAdj logLik
## 0.04594708 0.03864838 -915.00803839
DescTools::PseudoR2(logit_model, which = c("McFadden", "McFaddenAdj", "logLik"))
## McFadden McFaddenAdj logLik
## 0.04615780 0.03781643 -914.80594082
Exercise 14: Compare model fit between the base and interaction models. Does adding the interaction term improve fit? How do you know?
Your answer here: We see that adding the interaction term does slightly improve the model, as the one with it is a little closer to 0 than the base model. This is also the case for the McFadden’s R2, so we know that the interaction term’s model is performing only slightly better than the one without it.
plot_predictions(logit_model, condition = c("educ", "sex")) +
labs(title = "Predicted P(Unemployment) by Education and Sex",
x = "Education (years)", y = "Predicted Probability of Unemployment")
# Predicted probabilities for specific profiles
predictions(logit_model,
newdata = datagrid(educ = c(12, 16),
sex = c("Male", "Female"),
age = mean(gss2$age),
race = "White"))
##
## educ sex age race Estimate Pr(>|z|) S 2.5 % 97.5 %
## 12 Male 44 White 0.372 <0.001 16.0 0.318 0.429
## 12 Female 44 White 0.445 0.09 3.5 0.383 0.509
## 16 Male 44 White 0.305 <0.001 38.6 0.259 0.356
## 16 Female 44 White 0.350 <0.001 26.0 0.303 0.400
##
## Type: invlink(link)
Exercise: Using the specific predictions above, describe the predicted probability of unemployment for a White male with 12 vs. 16 years of education (at mean age). Compare to the equivalent female profiles. What do these concrete numbers communicate that log-odds coefficients cannot?
Your answer here: Here, we see that white men with the lower level of education have a higher predicted probability of being unemployed (37%) versus highly educated white men (30%). For women, we see that under-educated respondents are way more likely to be unemployed (44%) versus highly educated ones (35%). It is worth nothing that at both levels of education, women are more likely than men to be undemployed These concrete numbers can tell us the actual predicted probability of being unemployed by age/education/etc, whereas log-odds can only tell us the directionality of the relationship with employment.
Rather than specific profiles, we can ask: what is the model-based average predicted probability for each racial group, averaging over all other covariates as they actually appear in the data?
avg_predictions(logit_model, by = "race")
##
## race Estimate Std. Error z Pr(>|z|) S 2.5 % 97.5 %
## White 0.348 0.0148 23.5 <0.001 402.3 0.319 0.377
## Black 0.393 0.0305 12.9 <0.001 123.6 0.333 0.452
## Other 0.404 0.0305 13.3 <0.001 130.7 0.344 0.464
##
## Type: response
avg_predictions(logit_model, by = "sex")
##
## sex Estimate Std. Error z Pr(>|z|) S 2.5 % 97.5 %
## Male 0.339 0.0173 19.6 <0.001 280.4 0.305 0.372
## Female 0.389 0.0172 22.6 <0.001 373.0 0.355 0.423
##
## Type: response
Exercise: Describe the racial and sex differences in average predicted unemployment probability. How do these compare to the raw observed proportions? Why might they differ?
Your answer here: From these observations, we can see that white individuals have a much lower predicted probabiltiy of being unemployed (~35%) versus people of color (~40%). For sex, we see that women are slightly more likely to be unemployed (39%) versus men (34%). This aligns with what we saw above, though it could’ve differed because the other prediction table applied specific education levels and ages.
The AME computes each observation’s marginal effect and averages across the sample. It answers: “On average across the population, what is the effect of a one-unit change in X on the probability of unemployment?”
avg_slopes(logit_model)
##
## Term Contrast Estimate Std. Error z Pr(>|z|) S 2.5 % 97.5 %
## age dY/dX -0.00623 0.000891 -6.996 <0.001 38.5 -0.00798 -0.00449
## educ dY/dX -0.01916 0.004242 -4.517 <0.001 17.3 -0.02748 -0.01085
## race Black - White 0.01582 0.033552 0.472 0.6373 0.7 -0.04994 0.08158
## race Other - White 0.03559 0.033708 1.056 0.2911 1.8 -0.03048 0.10165
## sex Female - Male 0.05348 0.024575 2.176 0.0295 5.1 0.00532 0.10165
##
## Type: response
Key Takeaway: AME estimates are in probability units. A negative AME for
educmeans each additional year of schooling reduces the average probability of unemployment, a direct, policy-relevant quantity.
The MEM computes the effect for a hypothetical “average” respondent.
slopes(logit_model, newdata = datagrid())
##
## Term Contrast Estimate Std. Error z Pr(>|z|) S 2.5 % 97.5 %
## age dY/dX -0.00688 0.000959 -7.169 <0.001 40.3 -0.008756 -0.0050
## educ dY/dX -0.02331 0.006575 -3.545 <0.001 11.3 -0.036201 -0.0104
## race Black - White 0.01709 0.036220 0.472 0.6370 0.7 -0.053897 0.0881
## race Other - White 0.03844 0.036361 1.057 0.2904 1.8 -0.032827 0.1097
## sex Female - Male 0.05153 0.026185 1.968 0.0491 4.3 0.000211 0.1029
##
## Type: response
Exercise: Compare the AME and MEM for
educ and for sexFemale. Are they similar or
different? When and why do they diverge in a logistic model?
Your answer here: For education, we see in both forms that a one year increase is associated with a decreased predicted probability of unemployment. Similarly, we see that females in both estimates are more likely to be unemployed than men. They can diverge when the “average” person is not actually a legit real-life observation, wherein the MEM provides an estimate that we wouldn’t actually observe. (I could use more help understanding this part, tbh)
Does education’s protective effect change across the life course and does this differ by sex?
Your answer here: I’m not confident I’m interpretting this correctly, but I think: Yes, education protects more against unemployment for younger individuals. And, as women get older, they are more likely to experience unemployment than men.
slopes(logit_model, newdata = datagrid(educ = c(12, 16),
sex = c("Male", "Female"),
age = c(25, 45, 65)))
##
## Term Contrast educ sex age Estimate Std. Error z Pr(>|z|) S
## age dY/dX 12 Male 25 -0.00531 0.002885 -1.84 0.0655 3.9
## age dY/dX 12 Male 45 -0.00691 0.000957 -7.23 <0.001 40.9
## age dY/dX 12 Male 65 -0.00670 0.001780 -3.76 <0.001 12.5
## age dY/dX 12 Female 25 -0.00523 0.002787 -1.88 0.0607 4.0
## age dY/dX 12 Female 45 -0.00734 0.001005 -7.31 <0.001 41.7
## --- 50 rows omitted. See ?print.marginaleffects ---
## sex Female - Male 16 Male 45 0.04410 0.028395 1.55 0.1204 3.1
## sex Female - Male 16 Male 65 0.03149 0.020391 1.54 0.1226 3.0
## sex Female - Male 16 Female 25 0.04990 0.032146 1.55 0.1206 3.1
## sex Female - Male 16 Female 45 0.04410 0.028395 1.55 0.1204 3.1
## sex Female - Male 16 Female 65 0.03149 0.020391 1.54 0.1226 3.0
## 2.5 % 97.5 %
## -0.01097 0.000340
## -0.00879 -0.005040
## -0.01019 -0.003207
## -0.01069 0.000234
## -0.00931 -0.005375
## -0.01155 0.099757
## -0.00848 0.071450
## -0.01310 0.112907
## -0.01155 0.099757
## -0.00848 0.071450
## Type: response
plot_slopes(logit_model, variables = "educ", condition = c("age", "sex")) +
labs(title = "Marginal Effect of Education on P(Unemployment) by Age and Sex",
x = "Age", y = "Marginal Effect of Education")
Exercise: Describe how education’s protective effect varies across age and by sex. At what ages is education most protective? Do men and women follow the same pattern?
Your answer here: I think education is most effective for younger people, as the slope is most level around ~30 years of age. The marginal effect of education is always lower for women, but the gap between men/women closes as they age. They follow a similar trajectory, so it’s not like women are more employed as they age while men aren’t.
Key Takeaway: No single summary number, not the AME, not the coefficient, not the OR , tells the full story of a logistic model. Predictions, classification, and conditional effects each illuminate a different facet of the relationship.Pick the one that is most relevant for your project and your audience.
Overview This section applies everything from Parts 1 and 2 to a new dataset and a new substantive question: Who lacks health insurance? We use individual-level microdata from the 2022 American Community Survey (ACS) for Texas, pulled directly from the Census Bureau API using the
tidycensuspackage. This is real-world survey data with survey weights and a good opportunity to practice working with large, nationally representative microdata.
The ACS Public Use Microdata Sample (PUMS) provides individual-level
records. We pull it using get_pums(). You need a free
Census API key, register at https://api.census.gov/data/key_signup.html and run
census_api_key("YOUR_KEY", install = TRUE) once to store
it.
Variables we request:
| Variable | Description |
|---|---|
HICOV |
Health insurance coverage (1 = insured, 2 = uninsured) |
AGEP |
Age |
SEX |
Sex (1 = Male, 2 = Female) |
RAC1P |
Race |
HISP |
Hispanic origin |
SCHL |
Educational attainment |
PINCP |
Personal income |
ESR |
Employment status |
PWGTP |
Person weight |
library(tidycensus)
census_api_key("fc561cd65e1b44273f52a1e76620bc59253f8ba3", overwrite = TRUE) # run once to store your key
acs_raw <- get_pums(
variables = c("HICOV", "AGEP", "SEX", "RAC1P", "HISP", "SCHL", "PINCP", "ESR", "PWGTP",
"PUMA", "ST"),
state = "TX",
survey = "acs1",
year = 2022
)
acs <- acs_raw %>%
mutate(
# Outcome: uninsured (1 = no health insurance, 0 = has coverage)
uninsured = as.integer(HICOV == 2),
age = as.numeric(AGEP),
sex = factor(SEX, levels = c("1", "2"), labels = c("Male", "Female")),
# Education: recode SCHL to approximate years of schooling
educ_yrs = case_when(
SCHL %in% as.character(1:11) ~ 8, # less than HS
SCHL %in% as.character(12:15) ~ 12, # HS diploma / GED
SCHL %in% as.character(16:17) ~ 14, # some college
SCHL == "18" ~ 14, # associate's
SCHL == "19" ~ 14,
SCHL == "20" ~ 16, # bachelor's
SCHL %in% as.character(21:24) ~ 18, # graduate degree
TRUE ~ NA_real_
),
# what if I want to code education as a categorical variable?
educ_cat = case_when(
SCHL %in% as.character(1:11) ~ "1 Less than HS",
SCHL %in% as.character(12:15) ~ "2 HS diploma/GED",
SCHL %in% as.character(16:20) ~ "3 Some college/Associate's",
SCHL %in% as.character(21:24) ~ "4 Bachelor's or higher"
),
educ_cat = factor(educ_cat, levels = c("1 Less than HS", "2 HS diploma/GED",
"3 Some college/Associate's", "4 Bachelor's or higher")),
# Race/ethnicity (Hispanic takes priority)
race_eth = case_when(
HISP != "01" ~ "Hispanic",
RAC1P == "1" ~ "White",
RAC1P == "2" ~ "Black",
RAC1P %in% as.character(3:9) ~ "Other"
),
race_eth = factor(race_eth, levels = c("White", "Black", "Hispanic", "Other")),
# Employed vs. not
# ESR: "1"/"2" civilian employed, "3" unemployed,
# "4"/"5" armed forces, "6" not in labor force, "b" N/A (under 16)
employed = case_when(
ESR %in% c("1", "2", "4", "5") ~ 1,
ESR %in% c("3", "6") ~ 0,
TRUE ~ NA
),
# Income: log-transform (keep positive incomes only)
income_log = ifelse(as.numeric(PINCP) > 0, log(as.numeric(PINCP)), NA_real_),
#survey design variables
weight = as.numeric(PWGTP),
PUMA = as.character(PUMA),
ST = as.character(ST)
) %>%
# Working-age adults only
filter(age >= 18, age <= 64) %>%
select(uninsured, age, sex, educ_yrs, race_eth, employed, income_log, weight, PUMA, ST) %>%
drop_na()
cat("ACS analytic sample:", nrow(acs), "individuals\n")
## ACS analytic sample: 142845 individuals
# ACS PUMS uses a stratified cluster design:
# ids = ~PUMA : Public Use Microdata Areas are the sampling clusters
# strata = ~ST : states are the strata
# nest = TRUE : PUMA IDs are nested within strata (same number ≠ same area)
acs_design <- svydesign(ids = ~PUMA,
strata = ~ST,
weights = ~weight,
data = acs,
nest = TRUE)
# Weighted means and proportions using the survey design
# svymean() returns population-representative estimates
# Continuous variables
svymean(~age + income_log + uninsured, design = acs_design, na.rm = TRUE) %>%
as.data.frame() %>%
round(3) %>%
knitr::kable(caption = "Weighted Means: ACS Texas 2022 (Ages 18–64)")
| mean | SE | |
|---|---|---|
| age | 39.940 | 0.146 |
| income_log | 10.404 | 0.019 |
| uninsured | 0.187 | 0.006 |
# Categorical variables
svymean(~sex + race_eth + educ_yrs, design = acs_design, na.rm = TRUE) %>%
as.data.frame() %>%
round(3) %>%
knitr::kable(caption = "Weighted Proportions: Categorical Variables")
| mean | SE | |
|---|---|---|
| sexMale | 0.529 | 0.002 |
| sexFemale | 0.471 | 0.002 |
| race_ethWhite | 0.410 | 0.014 |
| race_ethBlack | 0.128 | 0.007 |
| race_ethHispanic | 0.369 | 0.015 |
| race_ethOther | 0.094 | 0.005 |
| educ_yrs | 15.366 | 0.049 |
# Weighted uninsurance rate by group
cat("=== Weighted % Uninsured by Race/Ethnicity ===\n")
## === Weighted % Uninsured by Race/Ethnicity ===
svyby(~uninsured, ~race_eth, design = acs_design, FUN = svymean, na.rm = TRUE) %>%
mutate(across(where(is.numeric), ~round(. * 100, 1))) %>%
print()
## race_eth uninsured se
## White White 11.3 0.4
## Black Black 17.2 0.7
## Hispanic Hispanic 29.2 0.8
## Other Other 11.5 0.6
cat("=== Weighted % Uninsured by Race/Ethnicity ===\n")
## === Weighted % Uninsured by Race/Ethnicity ===
svyby(~uninsured, ~employed, design = acs_design, FUN = svymean, na.rm = TRUE) %>%
mutate(across(where(is.numeric), ~round(. * 100, 1))) %>%
print()
## employed uninsured se
## 0 0 22.3 0.6
## 1 100 18.1 0.6
cat("\n=== Weighted % Uninsured by Sex ===\n")
##
## === Weighted % Uninsured by Sex ===
svyby(~uninsured, ~sex, design = acs_design, FUN = svymean, na.rm = TRUE) %>%
mutate(across(where(is.numeric), ~round(. * 100, 1))) %>%
print()
## sex uninsured se
## Male Male 20.7 0.7
## Female Female 16.4 0.5
cat("\n=== Weighted % Uninsured by Education ===\n")
##
## === Weighted % Uninsured by Education ===
svyby(~uninsured, ~educ_yrs, design = acs_design, FUN = svymean, na.rm = TRUE) %>%
mutate(across(where(is.numeric), ~round(. * 100, 1))) %>%
print()
## educ_yrs uninsured se
## 8 800 53.3 2.0
## 12 1200 41.2 1.1
## 14 1400 23.5 0.5
## 16 1600 15.3 0.6
## 18 1800 7.3 0.3
Exercise 1: What patterns do you see in the balance table? Which groups have higher rates of being uninsured?
Your answer here: We see here that Hispanic men have the lighest rates of being uninsured, and that education has a negative correlation with insurance.
We fit both an LPM and a logistic regression predicting the probability of being uninsured, with education, income, age, sex, race/ethnicity, and employment status as predictors.
ACS person weights (PWGTP) are probability
weights , each respondent represents hundreds of people in the
population. The correct approach is svyglm() from the
survey package, which accounts for the complex survey
design properly.
lpm_acs <- lm(uninsured ~ educ_yrs + income_log + age + sex + race_eth + employed,
data = acs, weights = weight)
logit_acs <- svyglm(uninsured ~ educ_yrs + income_log + age + sex + race_eth + employed,
design = acs_design, family = binomial)
modelsummary(list("LPM" = lpm_acs, "Logit (Log Odds)" = logit_acs),
estimate = "{estimate}{stars} ({std.error})",
statistic = NULL,
stars = c(`***` = .001, `**` = .01, `*` = .05),
gof_omit = "F|RMSE|Log|IC",
output = "huxtable")
| LPM | Logit (Log Odds) | |
|---|---|---|
| (Intercept) | 1.057*** (0.010) | 4.672*** (0.137) |
| educ_yrs | -0.031*** (0.000) | -0.239*** (0.007) |
| income_log | -0.035*** (0.001) | -0.244*** (0.011) |
| age | -0.002*** (0.000) | -0.013*** (0.001) |
| sexFemale | -0.044*** (0.002) | -0.309*** (0.022) |
| race_ethBlack | 0.028*** (0.003) | 0.309*** (0.053) |
| race_ethHispanic | 0.120*** (0.002) | 0.842*** (0.038) |
| race_ethOther | 0.011** (0.004) | 0.082 (0.048) |
| employed | 0.019*** (0.003) | 0.144*** (0.042) |
| Num.Obs. | 142845 | 142845 |
| R2 | 0.105 | |
| R2 Adj. | 0.105 |
Exercise 2: Compare the direction of coefficients between the LPM and logit. Do they tell the same story? Which predictor has the strongest association with being uninsured?
Your answer here: By and large, directionally, both models tell the same story. Furthermore, we see in both that hispanics have the highest probability of being uninsured and is the strongest indicator of uninsurance.
# Check whether LPM produces out-of-bounds predictions
lpm_preds <- predict(lpm_acs)
cat("LPM predicted probability range:", round(range(lpm_preds), 3), "\n")
## LPM predicted probability range: -0.124 0.798
cat("Proportion below 0:", mean(lpm_preds < 0), "\n")
## Proportion below 0: 0.08279604
cat("Proportion above 1:", mean(lpm_preds > 1), "\n")
## Proportion above 1: 0
Exercise 3: Does the LPM produce out-of-bounds predictions here?
Your answer here: Yes, the LPM produces predictions that are less than 0.
modelsummary(list("Log Odds" = logit_acs, "Odds Ratios" = logit_acs),
exponentiate = c(FALSE, TRUE),
estimate = "{estimate}{stars} ({std.error})",
statistic = NULL,
stars = c(`***` = .001, `**` = .01, `*` = .05),
gof_omit = "F|RMSE",
output = "huxtable")
| Log Odds | Odds Ratios | |
|---|---|---|
| (Intercept) | 4.672*** (0.137) | 106.872*** (14.656) |
| educ_yrs | -0.239*** (0.007) | 0.787*** (0.005) |
| income_log | -0.244*** (0.011) | 0.783*** (0.008) |
| age | -0.013*** (0.001) | 0.987*** (0.001) |
| sexFemale | -0.309*** (0.022) | 0.734*** (0.016) |
| race_ethBlack | 0.309*** (0.053) | 1.363*** (0.073) |
| race_ethHispanic | 0.842*** (0.038) | 2.320*** (0.088) |
| race_ethOther | 0.082 (0.048) | 1.085 (0.052) |
| employed | 0.144*** (0.042) | 1.155*** (0.048) |
| Num.Obs. | 142845 | 142845 |
| AIC | 121884.6 | 121884.6 |
| BIC | 121937.5 | 121937.5 |
| Log.Lik. | -60831.719 | -60831.719 |
Exercise 4: Interpret the odds ratio for age and race_ethHispanic, respectively.
Your answer here: For age, we see that a one year increase in age correlates with a 1.3% declined odds of being uninsured. In other words, older individuals are more likely to be insured than younger individuals. For race, we see that being hispanic, relative to being white, increases the odds of being uninsured by 132%.
plot_predictions(logit_acs, condition = c("educ_yrs", "race_eth")) +
labs(title = "Predicted P(Uninsured) by Education and Race/Ethnicity",
subtitle = "ACS Texas 2022; other covariates held at mean/mode",
x = "Education (years)", y = "Predicted Probability of Being Uninsured")
predictions(logit_acs,
newdata = datagrid(educ_yrs = c(8, 12, 16),
race_eth = c("White", "Black", "Hispanic"),
income_log = mean(acs$income_log, na.rm = TRUE),
age = mean(acs$age),
sex = "Male",
employed = 1))
##
## educ_yrs race_eth income_log age sex employed Estimate Std. Error z
## 8 White 10.4 41.6 Male 1 0.458 0.01395 32.8
## 8 Black 10.4 41.6 Male 1 0.535 0.01556 34.4
## 8 Hispanic 10.4 41.6 Male 1 0.662 0.01199 55.3
## 12 White 10.4 41.6 Male 1 0.245 0.00702 34.9
## 12 Black 10.4 41.6 Male 1 0.307 0.01079 28.4
## 12 Hispanic 10.4 41.6 Male 1 0.430 0.00921 46.6
## 16 White 10.4 41.6 Male 1 0.111 0.00342 32.4
## 16 Black 10.4 41.6 Male 1 0.145 0.00648 22.4
## 16 Hispanic 10.4 41.6 Male 1 0.224 0.00660 34.0
## Pr(>|z|) S 2.5 % 97.5 %
## <0.001 783.4 0.431 0.486
## <0.001 859.8 0.505 0.566
## <0.001 Inf 0.639 0.686
## <0.001 884.6 0.231 0.259
## <0.001 588.2 0.286 0.328
## <0.001 Inf 0.412 0.448
## <0.001 763.4 0.104 0.118
## <0.001 367.8 0.133 0.158
## <0.001 839.8 0.212 0.237
##
## Type: response
Exercise 5: For an employed male at mean age and income, how does predicted uninsurance probability differ between a White respondent with 16 years of education vs. a Hispanic respondent with 8 years?
Your answer here: We see here that white males at mean age who are employed and have 16 years of education are only 11% likely to be uninsured. Conversely, hispanic men at the same age and employment but with only 8 years of education are 66% likely to be uninsuredd.
avg_predictions(logit_acs, by = "race_eth")
##
## race_eth Estimate Std. Error z Pr(>|z|) S 2.5 % 97.5 %
## White 0.111 0.00280 39.6 <0.001 Inf 0.105 0.116
## Black 0.174 0.00625 27.8 <0.001 561.4 0.161 0.186
## Hispanic 0.287 0.00585 49.0 <0.001 Inf 0.275 0.298
## Other 0.114 0.00474 24.1 <0.001 422.4 0.105 0.123
##
## Type: response
Exercise 6: Compare the average predicted probability of being uninsured across racial/ethnic groups. How large is the Hispanic-White gap?
Your answer here: We see here that hispanics have an 18.6% higher odds of being uninsured than white individuals. For whites, we see that they are only 11% likely to be uninsured. Blacks are 17.4% likely to be uninsured. Hispanics are 28.7% likely to be uninsured. Others are 11.4% likely to be uninsured.
avg_slopes(logit_acs)
##
## Term Contrast Estimate Std. Error z Pr(>|z|) S 2.5 %
## age dY/dX -0.00160 0.000114 -14.06 <0.001 146.7 -0.00182
## educ_yrs dY/dX -0.03019 0.000852 -35.44 <0.001 911.7 -0.03186
## employed 1 - 0 0.01776 0.005011 3.54 <0.001 11.3 0.00794
## income_log dY/dX -0.03084 0.001405 -21.94 <0.001 352.0 -0.03359
## race_eth Black - White 0.03544 0.006416 5.52 <0.001 24.8 0.02286
## race_eth Hispanic - White 0.11291 0.005415 20.85 <0.001 318.3 0.10229
## race_eth Other - White 0.00872 0.005218 1.67 0.0948 3.4 -0.00151
## sex Female - Male -0.03885 0.002858 -13.60 <0.001 137.4 -0.04445
## 97.5 %
## -0.00138
## -0.02853
## 0.02758
## -0.02808
## 0.04801
## 0.12352
## 0.01895
## -0.03325
##
## Type: response
plot_slopes(logit_acs, variables = "income_log", condition = c("age", "sex")) +
labs(title = "Marginal Effect of Log Income on P(Uninsured) by Age and Sex",
x = "Age", y = "Marginal Effect of Log Income")
Exercise 7: What is the AME of
income_log? Interpret it in plain language. Does income’s
protective effect against being uninsured vary by age or sex?
Your answer here: The AME of income_log is -.03. I think this means that a one unit increase in income reduces the likelihood of being uninsured by 3%, but I’m not entirely confident in this interpretation. We see here that income has a relatively constant effect on insurnace by age, though women are always much more likely to be uninsured than men. (I’m not confident at all in this answer tbh)
Exercise 8 (Challenge): Re-run the logit as an
unweighted glm() (no weights, no survey design) and compare
the race/ethnicity coefficients to svyglm(). Do the point
estimates change? Do the standard errors change?
Your answer here: It doesn’t look like the unweighted model changed much, though the standard errors are slightly larger in the weighted one. I could use some help interpretting this comparison.
logit_acs_unwtd <- glm(
uninsured ~ educ_yrs + income_log + age + sex + race_eth + employed,
data = acs_design$variables,
family = binomial
)
modelsummary(
list(
"Unweighted Logit" = logit_acs_unwtd,
"Survey-Weighted Logit" = logit_acs
),
statistic = "({std.error})",
stars = TRUE,
gof_omit = "IC|Log|Adj|Pseudo"
)
| Unweighted Logit | Survey-Weighted Logit | |
|---|---|---|
| + p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 | ||
| (Intercept) | 4.551*** | 4.672*** |
| (0.076) | (0.137) | |
| educ_yrs | -0.237*** | -0.239*** |
| (0.004) | (0.007) | |
| income_log | -0.241*** | -0.244*** |
| (0.007) | (0.011) | |
| age | -0.011*** | -0.013*** |
| (0.001) | (0.001) | |
| sexFemale | -0.298*** | -0.309*** |
| (0.016) | (0.022) | |
| race_ethBlack | 0.275*** | 0.309*** |
| (0.027) | (0.053) | |
| race_ethHispanic | 0.776*** | 0.842*** |
| (0.017) | (0.038) | |
| race_ethOther | 0.010 | 0.082+ |
| (0.030) | (0.048) | |
| employed | 0.075*** | 0.144*** |
| (0.022) | (0.042) | |
| Num.Obs. | 142845 | 142845 |
| F | 1478.118 | 411.809 |