Lab Overview This combined lab covers two related topics: Part 1 introduces binary outcome models using marijuana legalization attitudes from the GSS. Part 2 practices our understanding of predictions and marginal effects across LPM and logistic models, using unemployment from the GSS. Part 3 uses ACS data to practice binary outcome models, with a particular attention to how to apply survey design variables to produce weighted descriptive and weighted logistic regression analysis.

Part 1: Binary Outcomes

Conceptual Review: Why Not Just Use OLS for Binary Outcomes? When the dependent variable is binary (0/1), ordinary least squares (OLS) can produce predicted probabilities below 0 or above 1. The Linear Probability Model (LPM) uses OLS anyway and is easy to interpret, but logistic regression solves the boundary problem by modeling the log odds of the outcome. You’ll see both approaches this week and compare them directly.

Data Loading and Management

We’ll use the 2022 General Social Survey (GSS). Our outcome is legal_grass: whether a respondent thinks marijuana should be legalized (1 = yes, 0 = no). Our key predictor is conservatism, a 1–7 ideology scale (1 = extremely liberal, 7 = extremely conservative).

library(gssr)

gss2022 <- gss_get_yr(2022)

gss <- gss2022 %>%
  select(grassnv, polviews, sex, race, hispanic, age, educ, wtssnrps, ballot) %>%
  haven::zap_labels() %>% #strips variable and value labels from data imported; standard practice if you are importing STATA file
  mutate(
    legal_grass = case_when(
      grassnv == 1 ~ 1,
      grassnv == 2 ~ 0
    ),
    conservatism = polviews,
    age = case_when(age > 88 ~ NaN, TRUE ~ age),
    sex = factor(sex, levels = c(1, 2), labels = c("Male", "Female")),
    race = case_when(
      hispanic != 1 ~ "Hispanic",
      race == 1 & hispanic == 1 ~ "White",
      race == 2 & hispanic == 1 ~ "Black",
      race == 3 & hispanic == 1 ~ "Other"
    ),
    race4 = factor(race, levels = c("White", "Black", "Hispanic", "Other")),
    weight = wtssnrps
  ) %>%
  filter(ballot != 1) %>%
  select(-wtssnrps, -hispanic, -race, -polviews, -grassnv, -ballot)

Missingness Analysis

gg_miss_upset(gss)

We’ll use listwise deletion for simplicity here.

gss <- gss %>% drop_na()

Descriptive Statistics

datasummary_skim(gss, type = "numeric",
                 title = "Descriptive Statistics: Continuous Variables",
                 notes = "Data: 2022 General Social Survey")

	Unique	Mean	SD	Min	Median	Max
age	69	48.7	17.6	18.0	47.0	88.0
highest year of school completed	15	14.7	2.5	4.0	14.0	20.0
legal_grass	2	0.8	0.4	0.0	1.0	1.0
think of self as liberal or conservative	7	3.8	1.5	1.0	4.0	7.0
person post-stratification weight, nonrespondents adjusted	526	0.8	0.9	0.1	0.5	6.3

datasummary_skim(gss, type = "categorical",
                 title = "Descriptive Statistics: Categorical Variables",
                 notes = "Data: 2022 General Social Survey")

Descriptive Statistics: Categorical Variables
		N	%
Data: 2022 General Social Survey
sex	Male	236	43.9
	Female	301	56.1
race4	White	379	70.6
	Black	52	9.7
	Hispanic	77	14.3
	Other	29	5.4

A balance table compares demographics across the two sides of the outcome variable. The Diff. in Means column shows how groups differ.

datasummary_balance(~legal_grass, data = gss,
                    title = "Balance Table: Demographics by Views on Marijuana Legalization",
                    notes = "Data: 2022 General Social Survey (Unweighted)")

Balance Table: Demographics by Views on Marijuana Legalization
		0		1
		Mean	Std. Dev.	Mean	Std. Dev.
Data: 2022 General Social Survey (Unweighted)
age		55.3	17.1	46.7	17.3
educ		14.8	2.8	14.6	2.5
conservatism		4.8	1.4	3.6	1.4
weight		0.9	1.1	0.7	0.8
		N	Pct.	N	Pct.
sex	Male	60	50.0	176	42.2
	Female	60	50.0	241	57.8
race4	White	79	65.8	300	71.9
	Black	9	7.5	43	10.3
	Hispanic	20	16.7	57	13.7
	Other	12	10.0	17	4.1

Exercise 1: What do the descriptive statistics and balance table tell you about who supports marijuana legalization? Summarize two or three key differences.

Your answer here: The first thing I noticed about the descriptive statistics is that the distribution for ideologgy is pretty evenly distrbuted but a majority of respondents identified as somewhere in the middle. For opinions on marijuana legalization, I was surprised to see that most respondents support the issue. Furthermore, we see that whites make up the vast majority of the surveyees in this dataset. When we look to the balance table, we see that the population supporting legalization are younger, less conservative, and slightly less educated on average versus those that oppose marijuana legalization. Moreover, more women support legalization than men, though the breakdown of those opposing is 50/50 by sex. Given that women are more greatly represented in this sample, we can only take this with a grain of salt. This is the same case for race, where whites appear to strongly favor legalization but we can’t draw conclusions off of this given that whites make up the vast majority of the study too.

Section 1: Linear Probability Model (LPM)

The LPM estimates the effect of each predictor on the probability that the outcome equals 1. It’s just OLS with a binary dependent variable.

linprob1 <- lm(legal_grass ~ conservatism + age + educ + race4 + sex,
               data = gss, weights = weight)

modelsummary(linprob1,
             estimate  = "{estimate}{stars} ({std.error})",
               statistic = NULL,                                                                                        
               stars     = c(`***` = .001, `**` = .01, `*` = .05),
             gof_omit = "F|RMSE|Log|IC",
             output = "huxtable")

	(1)
(Intercept)	1.534*** (0.122)
conservatism	-0.097*** (0.012)
age	-0.005*** (0.001)
educ	-0.007 (0.006)
race4Black	-0.030 (0.069)
race4Hispanic	-0.205*** (0.051)
race4Other	-0.468*** (0.068)
sexFemale	-0.008 (0.034)
Num.Obs.	537
R2	0.238
R2 Adj.	0.228

Exercise 2: Interpret the coefficient for conservatism in the LPM. What does it mean in probability terms?

Your answer here: For this model, the coefficient would tell us the likelihood of an individual supporting legalization if they are not at all conservative, 0 years of age, no education, white, and male. This kind of highlights the issue of LPMs, as there’s no reality where the DV is greater than one as the intercept implies. However, would it make sense to interpret this as a young/white/uneducated/male being highly likely to support legalization?

The Out-of-Bounds Problem

The LPM can produce predicted probabilities below 0 or above 1. To see this, we predict across a grid of conservatism and age values for a Black male with average education.

linpreddata <- expand_grid(conservatism = seq(1, 7),
                           age = seq(18, 88, 10),
                           educ = mean(gss$educ),
                           race4 = "Black",
                           sex = "Male")

linpreddata$pred_grass <- predict(linprob1, newdata = linpreddata, type = "response")

range(linpreddata$pred_grass)

## [1] 0.2774806 1.2132735

ggplot(linpreddata, aes(x = conservatism, y = age, fill = pred_grass)) +
  geom_tile() + scale_fill_viridis_c() + theme_minimal() +
  geom_text(aes(label = round(pred_grass, 2)), color = "white", size = 4) +
  scale_x_continuous(breaks = seq(0, 8, 1)) +
  scale_y_continuous(breaks = seq(18, 88, 10)) +
  labs(title = "Linear Probability Model",
       subtitle = "Predicted Probability of Favoring Legalization — Black Males, Average Education",
       y = "Age", x = "Conservatism", fill = "Probability")

Exercise 3: Are any predicted values outside [0, 1]? Where in the grid does this occur, and why is it a problem?

Your answer here: Yes, there are predicted values greater than 1, which is the problem with LPMs. We see here that these are around the lowest ages and the lowest degrees of conservatism. The issue here is that we are interested in the yes/no of supporting legalization, and anything outside of the 0-1 range is not observable in real life. Though we can’t use this to interpret the probability coefficients, we can see that it means that younger/liberal individuals are far more likely to support legalization than others.

Section 2: Logistic Regression

Logistic regression solves the out-of-bounds problem by transforming predictions through the logistic function, we make sure all predicted probabilities stay between 0 and 1. Use glm() with family = "binomial".

logit_full <- glm(legal_grass ~ conservatism + age + educ + race4 + sex,
                  family = "binomial", data = gss)

logit_base <- glm(legal_grass ~ conservatism,
                  family = "binomial", data = gss)

compmods <- list("LPM" = linprob1, "Base Logit" = logit_base, "Full Logit" = logit_full)

modelsummary(compmods,
             estimate  = "{estimate}{stars} ({std.error})",
               statistic = NULL,                                                                                        
               stars     = c(`***` = .001, `**` = .01, `*` = .05),
             gof_omit = "F|RMSE",
             output = "huxtable")

	LPM	Base Logit	Full Logit
(Intercept)	1.534*** (0.122)	3.740*** (0.388)	6.507*** (0.945)
conservatism	-0.097*** (0.012)	-0.596*** (0.083)	-0.628*** (0.090)
age	-0.005*** (0.001)		-0.026*** (0.007)
educ	-0.007 (0.006)		-0.079 (0.046)
race4Black	-0.030 (0.069)		-0.090 (0.420)
race4Hispanic	-0.205*** (0.051)		-0.900** (0.332)
race4Other	-0.468*** (0.068)		-1.362** (0.446)
sexFemale	-0.008 (0.034)		0.193 (0.231)
Num.Obs.	537	537	537
R2	0.238
R2 Adj.	0.228
AIC	679.2	513.3	493.1
BIC	717.8	521.9	527.4
Log.Lik.	-330.603	-254.646	-238.560

modelplot(compmods, coef_omit = 'Interc') +
  geom_vline(aes(xintercept = 0), linetype = "dotted", linewidth = 1, alpha = .5) +
  theme(legend.position = "bottom")

Note: The LPM and logit coefficients are on different scales and should not be directly compared in magnitude. The plot shows direction and uncertainty, not equivalent effect sizes.

Interpreting Log Odds

modelsummary(logit_full,
             estimate = "{estimate} ({std.error}) {stars}",
             stars     = c(`*` = .05, `**` = .01, `***` = .001),   
             statistic = NULL,
             gof_omit = "F|RMSE|IC|Log",
             output = "huxtable")

	(1)
(Intercept)	6.507 (0.945) ***
conservatism	-0.628 (0.090) ***
age	-0.026 (0.007) ***
educ	-0.079 (0.046)
race4Black	-0.090 (0.420)
race4Hispanic	-0.900 (0.332) **
race4Other	-1.362 (0.446) **
sexFemale	0.193 (0.231)
Num.Obs.	537

Checking That Predictions Stay Bounded

logitpreddata <- expand_grid(conservatism = seq(1, 7),
                             age = seq(18, 88, 10),
                             educ = mean(gss$educ),
                             race4 = "Black",
                             sex = "Male")

logitpreddata$pred_grass <- predict(logit_full, newdata = logitpreddata, type = "response")

range(logitpreddata$pred_grass)

## [1] 0.1921750 0.9845489

ggplot(logitpreddata, aes(x = conservatism, y = age, fill = pred_grass)) +
  geom_tile() + scale_fill_viridis_c() + theme_minimal() +
  geom_text(aes(label = round(pred_grass, 2)), color = "white", size = 4) +
  scale_x_continuous(breaks = seq(0, 8, 1)) +
  scale_y_continuous(breaks = seq(18, 88, 10)) +
  labs(title = "Logit Model",
       subtitle = "Predicted Probability of Favoring Legalization — Black Males, Average Education",
       y = "Age", x = "Conservatism", fill = "Probability")

Section 3: Odds Ratios

Log odds are directionally intuitive but hard to communicate. Odds ratios (ORs) are easier: an OR > 1 means increased odds, OR < 1 means decreased odds.

cbind(LogOdds  = round(coef(logit_full), 3),
      OddsRatio = round(exp(coef(logit_full)), 3))

##               LogOdds OddsRatio
## (Intercept)     6.507   669.665
## conservatism   -0.628     0.534
## age            -0.026     0.974
## educ           -0.079     0.924
## race4Black     -0.090     0.914
## race4Hispanic  -0.900     0.407
## race4Other     -1.362     0.256
## sexFemale       0.193     1.213

To get the percent change in odds: subtract 1 from the OR. For example, if OR = 0.391, then $(0.391 - 1) = -0.609$, meaning a ~61% decrease in odds.

modelsummary(list("Log Odds" = logit_full, "Odds Ratio" = logit_full),
             exponentiate = c(F, T),
             estimate = "{estimate} ({std.error}) {stars}",
             stars     = c(`*` = .05, `**` = .01, `***` = .001),   
             statistic = NULL,
             gof_omit = "F|RMSE|Log",
             output = "huxtable")

	Log Odds	Odds Ratio
(Intercept)	6.507 (0.945) ***	669.665 (632.587) ***
conservatism	-0.628 (0.090) ***	0.534 (0.048) ***
age	-0.026 (0.007) ***	0.974 (0.007) ***
educ	-0.079 (0.046)	0.924 (0.043)
race4Black	-0.090 (0.420)	0.914 (0.384)
race4Hispanic	-0.900 (0.332) **	0.407 (0.135) **
race4Other	-1.362 (0.446) **	0.256 (0.114) **
sexFemale	0.193 (0.231)	1.213 (0.281)
Num.Obs.	537	537
AIC	493.1	493.1
BIC	527.4	527.4

Exercise 5: Interpret the odds ratio for educ. What does it say about the relationship between education and support for legalization?

Your answer here: For education, we can see that a one unit increase (a one year increase in education) is associated with a ~8% decreased odds of supporting legalization. This means that highly educated people are less likely to support legalization than those with less education.

Section 4: Model Evaluation

Log-Likelihood

Higher (less negative) log-likelihood = better fit. We use it to compare nested models.

logLik(logit_full)

## 'log Lik.' -238.5598 (df=8)

logLik(logit_base)

## 'log Lik.' -254.6458 (df=2)

McFadden’s Pseudo-R²

There is no true R² for logistic regression. McFadden’s pseudo-R² approximates it:

\[R^2_{McFadden} = 1 - \frac{LL_{model}}{LL_{null}}\]

Values between 0.2 and 0.4 are generally considered a good fit.

DescTools::PseudoR2(logit_full, which = c("McFadden", "McFaddenAdj", "logLik"))

##     McFadden  McFaddenAdj       logLik 
##    0.1637843    0.1357422 -238.5597959

DescTools::PseudoR2(logit_base, which = c("McFadden", "McFaddenAdj", "logLik"))

##     McFadden  McFaddenAdj       logLik 
##    0.1073987    0.1003881 -254.6457804

Exercise 7: Based on the log-likelihood and pseudo-R² values, which model fits better? the base logit or the full logit? How do you know?

Your answer here: Using just the log likelihood estimates, we see that the full logit model is better because it is closer to 0. The McFadden’s psuedo-R2 supports this, though it is outside of the ideal range so we shouldn’t consider that model a good fit for the data.

Part 2: Predictions and Marginal Effects

Conceptual Review: Why Do We Need Predictions and Marginal Effects? Raw regression coefficients tell you about direction and statistical significance, but they are often hard to communicate and can be misleading , especially in nonlinear models like logit. Predictions let you translate model output into concrete expected values (e.g., “a 30-year-old with 16 years of education is predicted to earn $X”). Marginal effects tell you how the predicted outcome changes when one predictor changes by one unit, which gives you the “slope” of the relationship at any given point. —

Data Preparation

For Part 2, we use the same GSS dataset with a focus on unemployment as the primary outcome and human capital and demographic characteristics as predictors.

gss2022b <- gss2022

gss2 <- gss2022b %>%
  transmute(
    income = as.numeric(conrinc),
    age    = ifelse(age < 0, NA, age),
    educ   = ifelse(educ < 0, NA, educ),
    sex    = factor(sex, levels = 1:2, labels = c("Male", "Female")),
    race   = factor(race, levels = 1:3, labels = c("White", "Black", "Other")),
    unemp  = ifelse(unemp == 1, 1, 0)
  ) %>%
  drop_na()

Descriptive Statistics

datasummary_skim(gss2, type = "numeric", fmt = 2, histogram = TRUE,
                 title = "Descriptive Statistics: Continuous Variables",
                 notes = "Data: 2022 General Social Survey")

	Unique	Mean	SD	Min	Median	Max
income	26	40987.62	39527.08	316.50	28485.00	163572.71
age	67	43.96	14.58	18.00	42.00	89.00
educ	20	14.50	2.90	0.00	14.00	20.00
unemp	2	0.36	0.48	0.00	0.00	1.00

datasummary_skim(gss2, type = "categorical",
                 title = "Descriptive Statistics: Categorical Variables",
                 notes = "Data: 2022 General Social Survey",
                 output = "huxtable")

Descriptive Statistics: Categorical Variables
		N	%
sex	Male	709	48.5
	Female	753	51.5
race	White	975	66.7
	Black	242	16.6
	Other	245	16.8
Data: 2022 General Social Survey

Section 5: Logistic Regression — Full Analysis

Part A: Model Fitting and Comparison

We fit two models: a base model with main effects only, and a full model that adds an educ × sex interaction to test whether education’s protective effect differs by sex.

logit_base  <- glm(unemp ~ educ + sex + age + I(age^2) + race,
                   data = gss2, family = binomial)

logit_model <- glm(unemp ~ educ * sex + age + I(age^2) + race,
                   data = gss2, family = binomial)

modelsummary(list("Base" = logit_base, "With Interaction" = logit_model),
             exponentiate = FALSE,
             estimate  = "{estimate}{stars} ({std.error})",
               statistic = NULL,                                                                                        
               stars     = c(`***` = .001, `**` = .01, `*` = .05),
             gof_omit = "pss|alg|RMSE|AIC|BIC|F",
             output = "huxtable")

	Base	With Interaction
(Intercept)	1.435* (0.575)	1.249 (0.645)
educ	-0.087*** (0.020)	-0.075** (0.028)
sexFemale	0.245* (0.113)	0.605 (0.578)
age	-0.011 (0.025)	-0.011 (0.025)
I(age^2)	-0.000 (0.000)	-0.000 (0.000)
raceBlack	0.075 (0.153)	0.072 (0.153)
raceOther	0.163 (0.151)	0.161 (0.151)
educ × sexFemale		-0.025 (0.040)
Num.Obs.	1462	1462
Log.Lik.	-915.008	-914.806

Exercise: Interpret the main effect of educ and the interaction term educ:sexFemale in the full model using log odds. Does education protect men and women equally against unemployment risk?

Your answer here: The main effect of education is that, in both models, a one unit increase in education drastically reduces the odds of being unemployed. Furthermore, the interaction term tells us that there is a slight difference in how education impacts sex/gender’s relationship with employment, though the relationship itself is not significant. If it were, we would say that education protects women from unemployment at a lower level than it does for men.

Part B: Model Fit Statistics

Before interpreting, we should check whether the full model fits better than the base model.

logLik(logit_base)

## 'log Lik.' -915.008 (df=7)

logLik(logit_model)

## 'log Lik.' -914.8059 (df=8)

DescTools::PseudoR2(logit_base,  which = c("McFadden", "McFaddenAdj", "logLik"))

##      McFadden   McFaddenAdj        logLik 
##    0.04594708    0.03864838 -915.00803839

DescTools::PseudoR2(logit_model, which = c("McFadden", "McFaddenAdj", "logLik"))

##      McFadden   McFaddenAdj        logLik 
##    0.04615780    0.03781643 -914.80594082

Exercise 14: Compare model fit between the base and interaction models. Does adding the interaction term improve fit? How do you know?

Your answer here: We see that adding the interaction term does slightly improve the model, as the one with it is a little closer to 0 than the base model. This is also the case for the McFadden’s R2, so we know that the interaction term’s model is performing only slightly better than the one without it.

Part C: Predicted Probabilities

plot_predictions(logit_model, condition = c("educ", "sex")) +
  labs(title = "Predicted P(Unemployment) by Education and Sex",
       x = "Education (years)", y = "Predicted Probability of Unemployment")

# Predicted probabilities for specific profiles
predictions(logit_model,
            newdata = datagrid(educ = c(12, 16),
                               sex  = c("Male", "Female"),
                               age  = mean(gss2$age),
                               race = "White"))

## 
##  educ    sex age  race Estimate Pr(>|z|)    S 2.5 % 97.5 %
##    12 Male    44 White    0.372   <0.001 16.0 0.318  0.429
##    12 Female  44 White    0.445     0.09  3.5 0.383  0.509
##    16 Male    44 White    0.305   <0.001 38.6 0.259  0.356
##    16 Female  44 White    0.350   <0.001 26.0 0.303  0.400
## 
## Type: invlink(link)

Exercise: Using the specific predictions above, describe the predicted probability of unemployment for a White male with 12 vs. 16 years of education (at mean age). Compare to the equivalent female profiles. What do these concrete numbers communicate that log-odds coefficients cannot?

Your answer here: Here, we see that white men with the lower level of education have a higher predicted probability of being unemployed (37%) versus highly educated white men (30%). For women, we see that under-educated respondents are way more likely to be unemployed (44%) versus highly educated ones (35%). It is worth nothing that at both levels of education, women are more likely than men to be undemployed These concrete numbers can tell us the actual predicted probability of being unemployed by age/education/etc, whereas log-odds can only tell us the directionality of the relationship with employment.

Average Predicted Probability by Group

Rather than specific profiles, we can ask: what is the model-based average predicted probability for each racial group, averaging over all other covariates as they actually appear in the data?

avg_predictions(logit_model, by = "race")

## 
##   race Estimate Std. Error    z Pr(>|z|)     S 2.5 % 97.5 %
##  White    0.348     0.0148 23.5   <0.001 402.3 0.319  0.377
##  Black    0.393     0.0305 12.9   <0.001 123.6 0.333  0.452
##  Other    0.404     0.0305 13.3   <0.001 130.7 0.344  0.464
## 
## Type: response

avg_predictions(logit_model, by = "sex")

## 
##     sex Estimate Std. Error    z Pr(>|z|)     S 2.5 % 97.5 %
##  Male      0.339     0.0173 19.6   <0.001 280.4 0.305  0.372
##  Female    0.389     0.0172 22.6   <0.001 373.0 0.355  0.423
## 
## Type: response

Exercise: Describe the racial and sex differences in average predicted unemployment probability. How do these compare to the raw observed proportions? Why might they differ?

Your answer here: From these observations, we can see that white individuals have a much lower predicted probabiltiy of being unemployed (~35%) versus people of color (~40%). For sex, we see that women are slightly more likely to be unemployed (39%) versus men (34%). This aligns with what we saw above, though it could’ve differed because the other prediction table applied specific education levels and ages.

Part D: Average Marginal Effects (AME)

The AME computes each observation’s marginal effect and averages across the sample. It answers: “On average across the population, what is the effect of a one-unit change in X on the probability of unemployment?”

avg_slopes(logit_model)

## 
##  Term      Contrast Estimate Std. Error      z Pr(>|z|)    S    2.5 %   97.5 %
##  age  dY/dX         -0.00623   0.000891 -6.996   <0.001 38.5 -0.00798 -0.00449
##  educ dY/dX         -0.01916   0.004242 -4.517   <0.001 17.3 -0.02748 -0.01085
##  race Black - White  0.01582   0.033552  0.472   0.6373  0.7 -0.04994  0.08158
##  race Other - White  0.03559   0.033708  1.056   0.2911  1.8 -0.03048  0.10165
##  sex  Female - Male  0.05348   0.024575  2.176   0.0295  5.1  0.00532  0.10165
## 
## Type: response

Key Takeaway: AME estimates are in probability units. A negative AME for educ means each additional year of schooling reduces the average probability of unemployment, a direct, policy-relevant quantity.

Marginal Effects at the Mean (MEM)

The MEM computes the effect for a hypothetical “average” respondent.

slopes(logit_model, newdata = datagrid())

## 
##  Term      Contrast Estimate Std. Error      z Pr(>|z|)    S     2.5 %  97.5 %
##  age  dY/dX         -0.00688   0.000959 -7.169   <0.001 40.3 -0.008756 -0.0050
##  educ dY/dX         -0.02331   0.006575 -3.545   <0.001 11.3 -0.036201 -0.0104
##  race Black - White  0.01709   0.036220  0.472   0.6370  0.7 -0.053897  0.0881
##  race Other - White  0.03844   0.036361  1.057   0.2904  1.8 -0.032827  0.1097
##  sex  Female - Male  0.05153   0.026185  1.968   0.0491  4.3  0.000211  0.1029
## 
## Type: response

Exercise: Compare the AME and MEM for educ and for sexFemale. Are they similar or different? When and why do they diverge in a logistic model?

Your answer here: For education, we see in both forms that a one year increase is associated with a decreased predicted probability of unemployment. Similarly, we see that females in both estimates are more likely to be unemployed than men. They can diverge when the “average” person is not actually a legit real-life observation, wherein the MEM provides an estimate that we wouldn’t actually observe. (I could use more help understanding this part, tbh)

Part E: Conditional Marginal Effects

Does education’s protective effect change across the life course and does this differ by sex?

Your answer here: I’m not confident I’m interpretting this correctly, but I think: Yes, education protects more against unemployment for younger individuals. And, as women get older, they are more likely to experience unemployment than men.

slopes(logit_model, newdata = datagrid(educ = c(12, 16),
                                       sex  = c("Male", "Female"),
                                       age  = c(25, 45, 65)))

## 
##  Term      Contrast educ    sex age Estimate Std. Error     z Pr(>|z|)    S
##   age dY/dX           12 Male    25 -0.00531   0.002885 -1.84   0.0655  3.9
##   age dY/dX           12 Male    45 -0.00691   0.000957 -7.23   <0.001 40.9
##   age dY/dX           12 Male    65 -0.00670   0.001780 -3.76   <0.001 12.5
##   age dY/dX           12 Female  25 -0.00523   0.002787 -1.88   0.0607  4.0
##   age dY/dX           12 Female  45 -0.00734   0.001005 -7.31   <0.001 41.7
## --- 50 rows omitted. See ?print.marginaleffects ---
##   sex Female - Male   16 Male    45  0.04410   0.028395  1.55   0.1204  3.1
##   sex Female - Male   16 Male    65  0.03149   0.020391  1.54   0.1226  3.0
##   sex Female - Male   16 Female  25  0.04990   0.032146  1.55   0.1206  3.1
##   sex Female - Male   16 Female  45  0.04410   0.028395  1.55   0.1204  3.1
##   sex Female - Male   16 Female  65  0.03149   0.020391  1.54   0.1226  3.0
##     2.5 %    97.5 %
##  -0.01097  0.000340
##  -0.00879 -0.005040
##  -0.01019 -0.003207
##  -0.01069  0.000234
##  -0.00931 -0.005375
##  -0.01155  0.099757
##  -0.00848  0.071450
##  -0.01310  0.112907
##  -0.01155  0.099757
##  -0.00848  0.071450
## Type: response

plot_slopes(logit_model, variables = "educ", condition = c("age", "sex")) +
  labs(title = "Marginal Effect of Education on P(Unemployment) by Age and Sex",
       x = "Age", y = "Marginal Effect of Education")

Exercise: Describe how education’s protective effect varies across age and by sex. At what ages is education most protective? Do men and women follow the same pattern?

Your answer here: I think education is most effective for younger people, as the slope is most level around ~30 years of age. The marginal effect of education is always lower for women, but the gap between men/women closes as they age. They follow a similar trajectory, so it’s not like women are more employed as they age while men aren’t.

Key Takeaway: No single summary number, not the AME, not the coefficient, not the OR , tells the full story of a logistic model. Predictions, classification, and conditional effects each illuminate a different facet of the relationship.Pick the one that is most relevant for your project and your audience.

Part 3: Binary Outcomes with American Community Survey Data

Overview This section applies everything from Parts 1 and 2 to a new dataset and a new substantive question: Who lacks health insurance? We use individual-level microdata from the 2022 American Community Survey (ACS) for Texas, pulled directly from the Census Bureau API using the tidycensus package. This is real-world survey data with survey weights and a good opportunity to practice working with large, nationally representative microdata.

Data Loading

The ACS Public Use Microdata Sample (PUMS) provides individual-level records. We pull it using get_pums(). You need a free Census API key, register at https://api.census.gov/data/key_signup.html and run census_api_key("YOUR_KEY", install = TRUE) once to store it.

Variables we request:

Variable	Description
`HICOV`	Health insurance coverage (1 = insured, 2 = uninsured)
`AGEP`	Age
`SEX`	Sex (1 = Male, 2 = Female)
`RAC1P`	Race
`HISP`	Hispanic origin
`SCHL`	Educational attainment
`PINCP`	Personal income
`ESR`	Employment status
`PWGTP`	Person weight

library(tidycensus)

census_api_key("fc561cd65e1b44273f52a1e76620bc59253f8ba3", overwrite = TRUE)  # run once to store your key
acs_raw <- get_pums(
  variables = c("HICOV", "AGEP", "SEX", "RAC1P", "HISP", "SCHL", "PINCP", "ESR", "PWGTP",
                "PUMA", "ST"),
  state     = "TX",
  survey    = "acs1",
  year      = 2022
)

Data Management

acs <- acs_raw %>%
  mutate(
    # Outcome: uninsured (1 = no health insurance, 0 = has coverage)
    uninsured = as.integer(HICOV == 2),

    age = as.numeric(AGEP),

    sex = factor(SEX, levels = c("1", "2"), labels = c("Male", "Female")),

    # Education: recode SCHL to approximate years of schooling
    educ_yrs = case_when(
      SCHL %in% as.character(1:11)  ~ 8,   # less than HS
      SCHL %in% as.character(12:15) ~ 12,  # HS diploma / GED
      SCHL %in% as.character(16:17) ~ 14,  # some college
      SCHL == "18"                  ~ 14,  # associate's
      SCHL == "19"                  ~ 14,
      SCHL == "20"                  ~ 16,  # bachelor's
      SCHL %in% as.character(21:24) ~ 18,  # graduate degree
      TRUE ~ NA_real_
    ),
    # what if I want to code education as a categorical variable? 
    educ_cat = case_when(                                                                                                     
    SCHL %in% as.character(1:11)  ~ "1 Less than HS",                                                                   
    SCHL %in% as.character(12:15) ~ "2 HS diploma/GED",                                                                 
    SCHL %in% as.character(16:20) ~ "3 Some college/Associate's",                                                       
    SCHL %in% as.character(21:24) ~ "4 Bachelor's or higher"                                                            
  ),                                                                                                                    
  educ_cat = factor(educ_cat, levels = c("1 Less than HS", "2 HS diploma/GED",                                                  
                                  "3 Some college/Associate's", "4 Bachelor's or higher")),                             
                          

    # Race/ethnicity (Hispanic takes priority)
    race_eth = case_when(
      HISP != "01"                   ~ "Hispanic",
      RAC1P == "1"                   ~ "White",
      RAC1P == "2"                   ~ "Black",
      RAC1P %in% as.character(3:9)   ~ "Other"
    ),
    race_eth = factor(race_eth, levels = c("White", "Black", "Hispanic", "Other")),

    # Employed vs. not
    # ESR: "1"/"2" civilian employed, "3" unemployed,
    #       "4"/"5" armed forces, "6" not in labor force, "b" N/A (under 16)
    employed = case_when(
      ESR %in% c("1", "2", "4", "5") ~ 1,
      ESR %in% c("3", "6")           ~ 0,
      TRUE                           ~ NA 
    ),
    # Income: log-transform (keep positive incomes only)
    income_log = ifelse(as.numeric(PINCP) > 0, log(as.numeric(PINCP)), NA_real_),

    #survey design variables
    weight = as.numeric(PWGTP),
    PUMA   = as.character(PUMA),
    ST     = as.character(ST)
  ) %>%
  
  # Working-age adults only
  filter(age >= 18, age <= 64) %>%
  select(uninsured, age, sex, educ_yrs, race_eth, employed, income_log, weight, PUMA, ST) %>%
  drop_na()

cat("ACS analytic sample:", nrow(acs), "individuals\n")

## ACS analytic sample: 142845 individuals

# ACS PUMS uses a stratified cluster design:
#   ids     = ~PUMA   : Public Use Microdata Areas are the sampling clusters
#   strata  = ~ST     : states are the strata
#   nest    = TRUE    : PUMA IDs are nested within strata (same number ≠ same area)
acs_design <- svydesign(ids     = ~PUMA,
                        strata  = ~ST,
                        weights = ~weight,
                        data    = acs,
                        nest    = TRUE)

Descriptive Statistics

# Weighted means and proportions using the survey design
# svymean() returns population-representative estimates

# Continuous variables
svymean(~age + income_log + uninsured, design = acs_design, na.rm = TRUE) %>%
  as.data.frame() %>%
  round(3) %>%
  knitr::kable(caption = "Weighted Means: ACS Texas 2022 (Ages 18–64)")

Weighted Means: ACS Texas 2022 (Ages 18–64)
	mean	SE
age	39.940	0.146
income_log	10.404	0.019
uninsured	0.187	0.006

# Categorical variables
svymean(~sex + race_eth + educ_yrs, design = acs_design, na.rm = TRUE) %>%
  as.data.frame() %>%
  round(3) %>%
  knitr::kable(caption = "Weighted Proportions: Categorical Variables")

Weighted Proportions: Categorical Variables
	mean	SE
sexMale	0.529	0.002
sexFemale	0.471	0.002
race_ethWhite	0.410	0.014
race_ethBlack	0.128	0.007
race_ethHispanic	0.369	0.015
race_ethOther	0.094	0.005
educ_yrs	15.366	0.049

# Weighted uninsurance rate by group
cat("=== Weighted % Uninsured by Race/Ethnicity ===\n")

## === Weighted % Uninsured by Race/Ethnicity ===

svyby(~uninsured, ~race_eth, design = acs_design, FUN = svymean, na.rm = TRUE) %>%
  mutate(across(where(is.numeric), ~round(. * 100, 1))) %>%
  print()

##          race_eth uninsured  se
## White       White      11.3 0.4
## Black       Black      17.2 0.7
## Hispanic Hispanic      29.2 0.8
## Other       Other      11.5 0.6

cat("=== Weighted % Uninsured by Race/Ethnicity ===\n")

## === Weighted % Uninsured by Race/Ethnicity ===

svyby(~uninsured, ~employed, design = acs_design, FUN = svymean, na.rm = TRUE) %>%
  mutate(across(where(is.numeric), ~round(. * 100, 1))) %>%
  print()

##   employed uninsured  se
## 0        0      22.3 0.6
## 1      100      18.1 0.6

cat("\n=== Weighted % Uninsured by Sex ===\n")

## 
## === Weighted % Uninsured by Sex ===

svyby(~uninsured, ~sex, design = acs_design, FUN = svymean, na.rm = TRUE) %>%
  mutate(across(where(is.numeric), ~round(. * 100, 1))) %>%
  print()

##           sex uninsured  se
## Male     Male      20.7 0.7
## Female Female      16.4 0.5

cat("\n=== Weighted % Uninsured by Education ===\n")

## 
## === Weighted % Uninsured by Education ===

svyby(~uninsured, ~educ_yrs, design = acs_design, FUN = svymean, na.rm = TRUE) %>%
  mutate(across(where(is.numeric), ~round(. * 100, 1))) %>%
  print()

##    educ_yrs uninsured  se
## 8       800      53.3 2.0
## 12     1200      41.2 1.1
## 14     1400      23.5 0.5
## 16     1600      15.3 0.6
## 18     1800       7.3 0.3

Exercise 1: What patterns do you see in the balance table? Which groups have higher rates of being uninsured?

Your answer here: We see here that Hispanic men have the lighest rates of being uninsured, and that education has a negative correlation with insurance.

Section A: LPM vs. Logit

We fit both an LPM and a logistic regression predicting the probability of being uninsured, with education, income, age, sex, race/ethnicity, and employment status as predictors.

ACS person weights (PWGTP) are probability weights , each respondent represents hundreds of people in the population. The correct approach is svyglm() from the survey package, which accounts for the complex survey design properly.

lpm_acs <- lm(uninsured ~ educ_yrs + income_log + age + sex + race_eth + employed,
              data = acs, weights = weight)

logit_acs <- svyglm(uninsured ~ educ_yrs + income_log + age + sex + race_eth + employed,
                    design = acs_design, family = binomial)

modelsummary(list("LPM" = lpm_acs, "Logit (Log Odds)" = logit_acs),
             estimate  = "{estimate}{stars} ({std.error})",
             statistic = NULL,
             stars     = c(`***` = .001, `**` = .01, `*` = .05),
             gof_omit  = "F|RMSE|Log|IC",
             output    = "huxtable")

	LPM	Logit (Log Odds)
(Intercept)	1.057*** (0.010)	4.672*** (0.137)
educ_yrs	-0.031*** (0.000)	-0.239*** (0.007)
income_log	-0.035*** (0.001)	-0.244*** (0.011)
age	-0.002*** (0.000)	-0.013*** (0.001)
sexFemale	-0.044*** (0.002)	-0.309*** (0.022)
race_ethBlack	0.028*** (0.003)	0.309*** (0.053)
race_ethHispanic	0.120*** (0.002)	0.842*** (0.038)
race_ethOther	0.011** (0.004)	0.082 (0.048)
employed	0.019*** (0.003)	0.144*** (0.042)
Num.Obs.	142845	142845
R2	0.105
R2 Adj.	0.105

Exercise 2: Compare the direction of coefficients between the LPM and logit. Do they tell the same story? Which predictor has the strongest association with being uninsured?

Your answer here: By and large, directionally, both models tell the same story. Furthermore, we see in both that hispanics have the highest probability of being uninsured and is the strongest indicator of uninsurance.

# Check whether LPM produces out-of-bounds predictions
lpm_preds <- predict(lpm_acs)
cat("LPM predicted probability range:", round(range(lpm_preds), 3), "\n")

## LPM predicted probability range: -0.124 0.798

cat("Proportion below 0:", mean(lpm_preds < 0), "\n")

## Proportion below 0: 0.08279604

cat("Proportion above 1:", mean(lpm_preds > 1), "\n")

## Proportion above 1: 0

Exercise 3: Does the LPM produce out-of-bounds predictions here?

Your answer here: Yes, the LPM produces predictions that are less than 0.

Section B: Odds Ratios and Model Fit

modelsummary(list("Log Odds" = logit_acs, "Odds Ratios" = logit_acs),
             exponentiate = c(FALSE, TRUE),
             estimate  = "{estimate}{stars} ({std.error})",
             statistic = NULL,
             stars     = c(`***` = .001, `**` = .01, `*` = .05),
             gof_omit  = "F|RMSE",
             output    = "huxtable")

	Log Odds	Odds Ratios
(Intercept)	4.672*** (0.137)	106.872*** (14.656)
educ_yrs	-0.239*** (0.007)	0.787*** (0.005)
income_log	-0.244*** (0.011)	0.783*** (0.008)
age	-0.013*** (0.001)	0.987*** (0.001)
sexFemale	-0.309*** (0.022)	0.734*** (0.016)
race_ethBlack	0.309*** (0.053)	1.363*** (0.073)
race_ethHispanic	0.842*** (0.038)	2.320*** (0.088)
race_ethOther	0.082 (0.048)	1.085 (0.052)
employed	0.144*** (0.042)	1.155*** (0.048)
Num.Obs.	142845	142845
AIC	121884.6	121884.6
BIC	121937.5	121937.5
Log.Lik.	-60831.719	-60831.719

Exercise 4: Interpret the odds ratio for age and race_ethHispanic, respectively.

Your answer here: For age, we see that a one year increase in age correlates with a 1.3% declined odds of being uninsured. In other words, older individuals are more likely to be insured than younger individuals. For race, we see that being hispanic, relative to being white, increases the odds of being uninsured by 132%.

Section C: Predicted Probabilities

plot_predictions(logit_acs, condition = c("educ_yrs", "race_eth")) +
  labs(title = "Predicted P(Uninsured) by Education and Race/Ethnicity",
       subtitle = "ACS Texas 2022; other covariates held at mean/mode",
       x = "Education (years)", y = "Predicted Probability of Being Uninsured")

predictions(logit_acs,
            newdata = datagrid(educ_yrs   = c(8, 12, 16),
                               race_eth   = c("White", "Black", "Hispanic"),
                               income_log = mean(acs$income_log, na.rm = TRUE),
                               age        = mean(acs$age),
                               sex        = "Male",
                               employed   = 1))

## 
##  educ_yrs race_eth income_log  age  sex employed Estimate Std. Error    z
##         8 White          10.4 41.6 Male        1    0.458    0.01395 32.8
##         8 Black          10.4 41.6 Male        1    0.535    0.01556 34.4
##         8 Hispanic       10.4 41.6 Male        1    0.662    0.01199 55.3
##        12 White          10.4 41.6 Male        1    0.245    0.00702 34.9
##        12 Black          10.4 41.6 Male        1    0.307    0.01079 28.4
##        12 Hispanic       10.4 41.6 Male        1    0.430    0.00921 46.6
##        16 White          10.4 41.6 Male        1    0.111    0.00342 32.4
##        16 Black          10.4 41.6 Male        1    0.145    0.00648 22.4
##        16 Hispanic       10.4 41.6 Male        1    0.224    0.00660 34.0
##  Pr(>|z|)     S 2.5 % 97.5 %
##    <0.001 783.4 0.431  0.486
##    <0.001 859.8 0.505  0.566
##    <0.001   Inf 0.639  0.686
##    <0.001 884.6 0.231  0.259
##    <0.001 588.2 0.286  0.328
##    <0.001   Inf 0.412  0.448
##    <0.001 763.4 0.104  0.118
##    <0.001 367.8 0.133  0.158
##    <0.001 839.8 0.212  0.237
## 
## Type: response

Exercise 5: For an employed male at mean age and income, how does predicted uninsurance probability differ between a White respondent with 16 years of education vs. a Hispanic respondent with 8 years?

Your answer here: We see here that white males at mean age who are employed and have 16 years of education are only 11% likely to be uninsured. Conversely, hispanic men at the same age and employment but with only 8 years of education are 66% likely to be uninsuredd.

avg_predictions(logit_acs, by = "race_eth")

## 
##  race_eth Estimate Std. Error    z Pr(>|z|)     S 2.5 % 97.5 %
##  White       0.111    0.00280 39.6   <0.001   Inf 0.105  0.116
##  Black       0.174    0.00625 27.8   <0.001 561.4 0.161  0.186
##  Hispanic    0.287    0.00585 49.0   <0.001   Inf 0.275  0.298
##  Other       0.114    0.00474 24.1   <0.001 422.4 0.105  0.123
## 
## Type: response

Exercise 6: Compare the average predicted probability of being uninsured across racial/ethnic groups. How large is the Hispanic-White gap?

Your answer here: We see here that hispanics have an 18.6% higher odds of being uninsured than white individuals. For whites, we see that they are only 11% likely to be uninsured. Blacks are 17.4% likely to be uninsured. Hispanics are 28.7% likely to be uninsured. Others are 11.4% likely to be uninsured.

Section D: Average Marginal Effects

avg_slopes(logit_acs)

## 
##        Term         Contrast Estimate Std. Error      z Pr(>|z|)     S    2.5 %
##  age        dY/dX            -0.00160   0.000114 -14.06   <0.001 146.7 -0.00182
##  educ_yrs   dY/dX            -0.03019   0.000852 -35.44   <0.001 911.7 -0.03186
##  employed   1 - 0             0.01776   0.005011   3.54   <0.001  11.3  0.00794
##  income_log dY/dX            -0.03084   0.001405 -21.94   <0.001 352.0 -0.03359
##  race_eth   Black - White     0.03544   0.006416   5.52   <0.001  24.8  0.02286
##  race_eth   Hispanic - White  0.11291   0.005415  20.85   <0.001 318.3  0.10229
##  race_eth   Other - White     0.00872   0.005218   1.67   0.0948   3.4 -0.00151
##  sex        Female - Male    -0.03885   0.002858 -13.60   <0.001 137.4 -0.04445
##    97.5 %
##  -0.00138
##  -0.02853
##   0.02758
##  -0.02808
##   0.04801
##   0.12352
##   0.01895
##  -0.03325
## 
## Type: response

plot_slopes(logit_acs, variables = "income_log", condition = c("age", "sex")) +
  labs(title = "Marginal Effect of Log Income on P(Uninsured) by Age and Sex",
       x = "Age", y = "Marginal Effect of Log Income")

Exercise 7: What is the AME of income_log? Interpret it in plain language. Does income’s protective effect against being uninsured vary by age or sex?

Your answer here: The AME of income_log is -.03. I think this means that a one unit increase in income reduces the likelihood of being uninsured by 3%, but I’m not entirely confident in this interpretation. We see here that income has a relatively constant effect on insurnace by age, though women are always much more likely to be uninsured than men. (I’m not confident at all in this answer tbh)

Exercise 8 (Challenge): Re-run the logit as an unweighted glm() (no weights, no survey design) and compare the race/ethnicity coefficients to svyglm(). Do the point estimates change? Do the standard errors change?

Your answer here: It doesn’t look like the unweighted model changed much, though the standard errors are slightly larger in the weighted one. I could use some help interpretting this comparison.

logit_acs_unwtd <- glm(
  uninsured ~ educ_yrs + income_log + age + sex + race_eth + employed,
  data = acs_design$variables,
  family = binomial
)

modelsummary(
  list(
    "Unweighted Logit" = logit_acs_unwtd,
    "Survey-Weighted Logit" = logit_acs
  ),
  statistic = "({std.error})",
  stars = TRUE,
  gof_omit = "IC|Log|Adj|Pseudo"
)

	Unweighted Logit	Survey-Weighted Logit
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001
(Intercept)	4.551***	4.672***
	(0.076)	(0.137)
educ_yrs	-0.237***	-0.239***
	(0.004)	(0.007)
income_log	-0.241***	-0.244***
	(0.007)	(0.011)
age	-0.011***	-0.013***
	(0.001)	(0.001)
sexFemale	-0.298***	-0.309***
	(0.016)	(0.022)
race_ethBlack	0.275***	0.309***
	(0.027)	(0.053)
race_ethHispanic	0.776***	0.842***
	(0.017)	(0.038)
race_ethOther	0.010	0.082+
	(0.030)	(0.048)
employed	0.075***	0.144***
	(0.022)	(0.042)
Num.Obs.	142845	142845
F	1478.118	411.809

Weeks 8 & 9: Binary Outcomes, Predictions, and Marginal Effects

Matthew Reyes

Spring 2025

Part 1: Binary Outcomes

Data Loading and Management

Missingness Analysis

Descriptive Statistics

Section 1: Linear Probability Model (LPM)

The Out-of-Bounds Problem

Section 2: Logistic Regression

Interpreting Log Odds

Checking That Predictions Stay Bounded

Section 3: Odds Ratios

Section 4: Model Evaluation

Log-Likelihood

McFadden’s Pseudo-R²

Part 2: Predictions and Marginal Effects

Data Preparation

Descriptive Statistics

Section 5: Logistic Regression — Full Analysis

Part A: Model Fitting and Comparison

Part B: Model Fit Statistics

Part C: Predicted Probabilities

Average Predicted Probability by Group

Part D: Average Marginal Effects (AME)

Marginal Effects at the Mean (MEM)

Part E: Conditional Marginal Effects

Part 3: Binary Outcomes with American Community Survey Data

Data Loading

Data Management

Descriptive Statistics

Section A: LPM vs. Logit

Section B: Odds Ratios and Model Fit

Section C: Predicted Probabilities

Section D: Average Marginal Effects