library(tidyverse)
# add any other packages you need here
set.seed(6789) # for reproducibility
lalonde <- bind_rows(causaldata::cps_mixtape, causaldata::nsw_mixtape)Homework: Propensity score and doubly robust methods
PHTH 6800
For this homework we will use the cps_mixtape and nsw_mixtape datasets from the causaldata package. The nsw_mixtape dataset contains treated and control individuals from the National Supported Work (NSW) Demonstration, a randomized job training program for young, disadvantaged workers that started in 1975, while cps_mixtape is a observational comparison group drawn from the Current Population Survey (CPS). These datasets have been used extensively to demonstrate causal methods, starting with Lalonde (1986). The treatment (treat) is participation in the NSW program and the outcome is real earnings in 1978 (re78, in dollars). Available covariates are: age, educ (years of education), black, hisp, marr (married), nodegree (no high school degree), re74 (real earnings 1974), and re75 (real earnings 1975).
We will assume conditional exchangeability given age, education, race/ethnicity, marital status, high school degree status, and prior earnings.
Question 1
We want to estimate the effect of the NSW job training program on 1978 earnings using propensity score matching.
- Before fitting any model, explore the raw data to assess positivity. Describe what you find. Do you have any concerns? Would you classify any violations you see as structural or random? (You don’t need to look into or list every combiantion of covariates — just describe any potentially problematic patterns.)
#checking how many postive and negitive values we have in the dataset
count(lalonde, treat)# A tibble: 2 × 2
treat n
<dbl> <int>
1 0 16252
2 1 185
The CPS controls outnumber the treated group. This makes sense as CPS is a general population sample, while NSW participants are disadvantaged workers so the distributions will look different. I would think these are structural violations for some subgroups and not just sampling noise.
- Fit a logistic regression model for
treatusing all available covariates as linear terms. Visualize the overlap of the estimated propensity scores by treatment group (it might help to plot the treated and control distributions separately to see them better, instead of layered as we have done in class). Does the distribution confirm what you found in part (a)?
library(MatchIt)
library(WeightIt)
library(cobalt)
library(marginaleffects)
ps_mod <- glm(treat ~ age + educ + black + hisp + marr + nodegree + re74 + re75,
family = binomial(), data = lalonde)
lalonde <- lalonde |>
mutate(ps = predict(ps_mod, type = "response"))
# Plot treated and control separately so they don't obscure each other
ggplot(lalonde, aes(x = ps)) +
geom_histogram(bins = 50) +
facet_wrap(~treat, labeller = labeller(treat = c("0" = "Control", "1" = "Treated")),
scales = "free_y") +
labs(x = "Estimated propensity score", y = "Count") +
theme_minimal()Yes the distribution confirms what we found in part a, as seen in the graphs it can be seen that the distributions will be very skewed. We see treated individuals have a very high scores.
- Is it valid to estimate the ATE in this dataset without any change to the study population? What about the ATT? Describe conceptually how these two estimands differ with respect to the observed data and this causal question and interpreting results.
ATE - effect if everyone were treated vs untreated - Not sure if it would be valid here unless there were changes to the study population. - CPS and NSW participants are different demographically and economically. - The positivity assumption would be violated for the control side.
ATT - effect among the treated - Only need untreated individuals who are comparable to NSW participants - I think we can find some CPS participants that fit in which makes ATT a more realistic estimate given the data we have at hand.
- Use
matchit()withdistance = "glm",discard = "both", andratio = 4to perform nearest-neighbor 4:1 matching. How many treated and control individuals were discarded? Based on who was discarded, what estimand does the resulting matched sample target?
matched <- matchit(treat ~ age + educ + black + hisp + marr + nodegree + re74 + re75,
data = lalonde,
distance = "glm",
discard = "both",
ratio = 4)
matchedA `matchit` object
- method: 4:1 nearest neighbor matching without replacement
- distance: Propensity score [common support]
- estimated with logistic regression
- common support: units from both groups dropped
- number of obs.: 16437 (original), 925 (matched)
- target estimand: ATT
- covariates: age, educ, black, hisp, marr, nodegree, re74, re75
We can confirm the positivity problem, where no CPS participants are in common with NSW participants. The result is just the ATT among all the treated individuals since no treated participants were dropped.
- Create a love plot of the matched sample. Comment briefly on balance.
love.plot(matched, stars = "std", thresholds = 0.1)I think the matched sample is balanced well on the covariates - as the adjusted data points are within the threshold.
- Extract the matched dataset and use
avg_comparisons()with cluster-robust standard errors to estimate your causal estimand. Report and interpret the estimate.
matched_data <- match.data(matched)
avg_comparisons(
lm(re78 ~ treat + age + educ + black + hisp + marr + nodegree + re74 + re75,
data = matched_data, weights = weights),
variables = "treat",
vcov = ~subclass
)
Estimate Std. Error z Pr(>|z|) S 2.5 % 97.5 %
1185 625 1.9 0.0578 4.1 -39.2 2409
Term: treat
Type: response
Comparison: 1 - 0
The effect of job training seems to yield a higher pay but the confidence interval is greater than z, which means that we cant confidently rule out that there was no effect.
Question 2
We now try an weighting approach on the same data.
- Fit an IPT weighting model for the ATE using all available covariates. Use
include.obj = TRUEso the underlying propensity score model is stored. Print thesummary()and a love plot. What do the maximum weights tell you?
w_out1 <- weightit(treat ~ age + educ + black + hisp + marr + nodegree + re74 + re75,
data = lalonde,
method = "glm",
estimand = "ATE",
include.obj = TRUE)
summary(w_out1) Summary of weights
- Weight ranges:
Min Max
treated 3.277 |---------------------------| 853.003
control 1. || 1.443
- Units with the 5 most extreme weights by group:
185 179 137 124 10
treated 284.928 379.295 542.757 628.738 853.003
9204 9371 16125 5555 3947
control 1.433 1.433 1.433 1.434 1.443
- Weight statistics:
Coef of Var MAD Entropy # Zeros
treated 2.712 1.415 1.492 0
control 0.051 0.018 0.001 0
- Effective Sample Sizes:
Control Treated
Unweighted 16252. 185.
Weighted 16210.01 22.24
love.plot(w_out1, stars = "std", thresholds = 0.1)The max weight for a treated participant is 853. Not a great model to use one participant to represent 853 untreated people. From the plot we can see the balance is not great after weighting, with many of the features outside of the threshold. Might have to restrict the sample?
- Examine the propensity score model coefficients using
summary(w_out1$obj). Do any covariates have extremely large or small coefficients that might point to a data problem?
summary(w_out1$obj)
Call:
NULL
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -7.85095 0.25529 -30.753 < 2e-16 ***
age -0.08438 0.06916 -1.220 0.22246
educ -0.03174 0.08592 -0.369 0.71184
black 3.73378 0.16769 22.265 < 2e-16 ***
hisp 1.70479 0.25535 6.676 2.53e-11 ***
marr -0.73686 0.14820 -4.972 6.69e-07 ***
nodegree 0.46088 0.16334 2.822 0.00478 **
re74 -0.25480 0.15889 -1.604 0.10882
re75 -1.61274 0.20323 -7.935 2.23e-15 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for quasibinomial family taken to be 0.4271614)
Null deviance: 2028.1 on 16436 degrees of freedom
Residual deviance: 1152.6 on 16428 degrees of freedom
AIC: NA
Number of Fisher Scoring iterations: 10
- The coefficent ‘Black’ stands out as it has the highest value, so it is the stongest predictor of NSW participation. Not a data problem, just seems like race is a big driver in the weights.
- ‘re75’ has the largest negitive coefficient, which confirms that higher earnings in 1975 predicts that the individual will not be in NSW.
- Main issues seem like they were driven by race and prior earnings.
- Rather than attempting to reweight a fundamentally incomparable population, apply new eligibility criteria to restrict the dataset to individuals more plausibly comparable to the NSW participants. Decide on criteria that make substantive sense (this is somewhat subjective so everyone will have different answers; explain your reasoning). Refit the propensity score model in the restricted dataset, print the new summary and love plot, and assess whether the weight distribution and balance improved.
lalonde_restricted <- lalonde |>
filter(
age <= 55,
re74 < 10000,
re75 < 10000
)
w_out2 <- weightit(treat ~ age + educ + black + hisp + marr + nodegree + re74 + re75,
data = lalonde_restricted,
method = "glm",
estimand = "ATE",
include.obj = TRUE)
summary(w_out2) Summary of weights
- Weight ranges:
Min Max
treated 3.235 |---------------------------| 1050.621
control 1. || 1.454
- Units with the 5 most extreme weights by group:
170 136 124 116 10
treated 224.793 230.111 311.566 733.294 1050.621
5086 5073 4776 4253 2896
control 1.432 1.435 1.439 1.439 1.454
- Weight statistics:
Coef of Var MAD Entropy # Zeros
treated 3.289 1.415 1.628 0
control 0.081 0.046 0.003 0
- Effective Sample Sizes:
Control Treated
Unweighted 5250. 171.
Weighted 5215.49 14.55
love.plot(w_out2, stars = "std", thresholds = 0.1)I choose to restrict age at 55 and real earning 74 and 75 at 10 000. - Since the program is aimed towards disadvantaged workers, I would doubt that people above 55 would be apart of the program as they are closer to retirement. - as the program was intended for low-income individuals, capping the earnings at 10k makes sense as as anyone earning above that would not be apart of the program.
The weight distribution did not improve but got worse :(. The positivity problem still exists as we can see that even fewer treated people are doing the work in the weighted estimator - which we can see by the increase in maximum treated weight. I think this is because race drives the high weights so even comparing these two groups, they are not comparable to race.
- Even after applying the eligibility criteria, there may still be some extreme weights. Trim the weights at the 99.5th percentile. Compare the weight summaries and love plots before and after trimming. What are you trading off by trimming?
w_out2_trimmed <- trim(w_out2, at = 0.995)
summary(w_out2) Summary of weights
- Weight ranges:
Min Max
treated 3.235 |---------------------------| 1050.621
control 1. || 1.454
- Units with the 5 most extreme weights by group:
170 136 124 116 10
treated 224.793 230.111 311.566 733.294 1050.621
5086 5073 4776 4253 2896
control 1.432 1.435 1.439 1.439 1.454
- Weight statistics:
Coef of Var MAD Entropy # Zeros
treated 3.289 1.415 1.628 0
control 0.081 0.046 0.003 0
- Effective Sample Sizes:
Control Treated
Unweighted 5250. 171.
Weighted 5215.49 14.55
summary(w_out2_trimmed) Summary of weights
- Weight ranges:
Min Max
treated 3.235 |------------------------| 19.386
control 1. || 1.454
- Units with the 5 most extreme weights by group:
10 22 23 28 42
treated 19.386 19.386 19.386 19.386 19.386
5086 5073 4776 4253 2896
control 1.432 1.435 1.439 1.439 1.454
- Weight statistics:
Coef of Var MAD Entropy # Zeros
treated 0.753 0.595 0.234 0
control 0.081 0.046 0.003 0
- Effective Sample Sizes:
Control Treated
Unweighted 5250. 171.
Weighted 5215.49 109.41
love.plot(w_out2, stars = "std", thresholds = 0.1)love.plot(w_out2_trimmed, stars = "std", thresholds = 0.1)Trimming made a difference and the weights are not dominated by only a few people.Trimming however did not fix the compariability problem rather just stabilized the estimator. The tradoff is giving a small amount of bias for a stable estimate, by capping the weights which reduced the variance but the trimmed people are not showing up properly.
Question 3
Using your dataset with your eligibility critera applied, we will now compare several estimation approaches for the ATE.
- Estimate the ATE using outcome regression standardization via
standardize_glm(), adjusting for the same covariates. Report the means under each treatment level and the ATE with 95% CIs.
library(stdReg2)
fit_std <- standardize_glm(
re78 ~ treat + age + educ + black + hisp + marr + nodegree + re74 + re75,
data = lalonde_restricted,
family = "gaussian",
values = list(treat = c(0, 1)),
contrasts = "difference",
reference = 0
)
fit_stdOutcome formula: re78 ~ treat + age + educ + black + hisp + marr + nodegree +
re74 + re75
Outcome family: gaussian
Outcome link function: identity
Exposure: treat
Tables:
treat Estimate Std.Error lower.0.95 upper.0.95
1 0 6564 97.3 6374 6755
2 1 8358 625.4 7132 9583
Reference level: treat = 0
Contrast: difference
treat Estimate Std.Error lower.0.95 upper.0.95
1 0 0 0 0 0
2 1 1793 629 561 3026
The confidence interval excludes zero so this is a statistically significant positive effect.
- Estimate the ATE using IPT weighting (untrimmed weights) via
glm_weightit()+avg_comparisons(). Report the means under each treatment level and the ATE with 95% CIs.
# 3b - IPTW untrimmed
fit_w2 <- glm_weightit(re78 ~ treat + age + educ + black + hisp + marr + nodegree + re74 + re75,
data = lalonde_restricted,
weightit = w_out2)
avg_comparisons(fit_w2, variables = "treat")
Estimate Std. Error z Pr(>|z|) S 2.5 % 97.5 %
752 788 0.953 0.34 1.6 -794 2297
Term: treat
Type: probs
Comparison: 1 - 0
avg_predictions(fit_w2, variables = "treat")
treat Estimate Std. Error z Pr(>|z|) S 2.5 % 97.5 %
0 6566 90.5 72.59 <0.001 Inf 6389 6743
1 7317 783.1 9.34 <0.001 66.6 5783 8852
Type: probs
- Estimate the ATE using IPT weighting (trimmed weights from Question 2d) via
glm_weightit()+avg_comparisons(). Report the means under each treatment level and the ATE with 95% CIs.
fit_w2_trimmed <- glm_weightit(re78 ~ treat + age + educ + black + hisp + marr + nodegree + re74 + re75,
data = lalonde_restricted,
weightit = w_out2_trimmed)
avg_comparisons(fit_w2_trimmed, variables = "treat")
Estimate Std. Error z Pr(>|z|) S 2.5 % 97.5 %
1501 5554 0.27 0.787 0.3 -9385 12386
Term: treat
Type: probs
Comparison: 1 - 0
avg_predictions(fit_w2_trimmed, variables = "treat")
treat Estimate Std. Error z Pr(>|z|) S 2.5 % 97.5 %
0 6566 88 74.59 <0.001 Inf 6393 6738
1 8067 5554 1.45 0.146 2.8 -2819 18952
Type: probs
- Estimate the ATE using the doubly robust IPTW-GLM approach via
standardize_glm_dr(), specifying both an outcome model and a propensity score model. Report the means under each treatment level and the ATE with 95% CIs.
fit_dr <- standardize_glm_dr(
formula_outcome = re78 ~ treat + age + educ + black + hisp + marr + nodegree + re74 + re75,
formula_exposure = treat ~ age + educ + black + hisp + marr + nodegree + re74 + re75,
data = lalonde_restricted,
values = list(treat = c(0, 1)),
contrasts = "difference",
reference = 0
)
fit_drDoubly robust estimator with:
Exposure formula: treat ~ age + educ + black + hisp + marr + nodegree + re74 +
re75
Exposure link function: logit
Outcome formula: re78 ~ treat + age + educ + black + hisp + marr + nodegree +
re74 + re75
Outcome family: gaussian
Outcome link function: identity
Exposure: treat
Tables:
treat Estimate Std.Error lower.0.95 upper.0.95
1 0 6566 97.3 6375 6757
2 1 7317 1629.7 4123 10512
Reference level: treat = 0
Contrast: difference
treat Estimate Std.Error lower.0.95 upper.0.95
1 0 0 0 0 0
2 1 752 1632 -2447 3950
- Estimate the ATE using TMLE with a Super Learner library of your choice. At minimum include
"SL.glm","SL.step", and"SL.glm.interaction", but add any other models of your choice (you can see what is included withSuperLearner::listWrappers()and even create your own withSuperLearner::create.Learner()). Report the means under each treatment level and the ATE with 95% CIs.
library(tmle)
library(SuperLearner)
tmle_out <- tmle(
Y = lalonde_restricted$re78,
A = lalonde_restricted$treat,
W = lalonde_restricted |> select(age, educ, black, hisp, marr, nodegree, re74, re75),
family = "gaussian",
Q.SL.library = c("SL.glm", "SL.step", "SL.glm.interaction"),
g.SL.library = c("SL.glm", "SL.step", "SL.glm.interaction")
)
summary(tmle_out) Initial estimation of Q
Procedure: cv-SuperLearner, ensemble
Model:
Y ~ SL.glm_All + SL.step_All + SL.glm.interaction_All
Coefficients:
SL.glm_All 0.2166229
SL.step_All 0
SL.glm.interaction_All 0.7833771
Cross-validated R squared : 0.1946
Estimation of g (treatment mechanism)
Procedure: SuperLearner, ensemble
Model:
A ~ SL.glm_All + SL.step_All + SL.glm.interaction_All
Coefficients:
SL.glm_All 0
SL.step_All 0.4653137
SL.glm.interaction_All 0.5346863
Estimation of g.Z (intermediate variable assignment mechanism)
Procedure: No intermediate variable
Estimation of g.Delta (missingness mechanism)
Procedure: No missingness, ensemble
Bounds on g: (0.0079, 1)
Bounds on g for ATT/ATC: (0.0079, 0.9921)
Marginal Mean under Treatment (EY1)
Parameter Estimate: 7890.2
Estimated Variance: 242440
p-value: <2e-16
95% Conf Interval: (6925.1, 8855.3)
Marginal Mean under Comparator (EY0)
Parameter Estimate: 6562.1
Estimated Variance: 9456.3
p-value: <2e-16
95% Conf Interval: (6371.5, 6752.7)
Additive Effect
Parameter Estimate: 1328.1
Estimated Variance: 250700
p-value: 0.0079922
95% Conf Interval: (346.7, 2309.4)
Additive Effect among the Treated
Parameter Estimate: 2058.1
Estimated Variance: 403880
p-value: 0.0012018
95% Conf Interval: (812.5, 3303.7)
Additive Effect among the Controls
Parameter Estimate: 1899.6
Estimated Variance: 269930
p-value: 0.00025587
95% Conf Interval: (881.33, 2917.9)
- Do the estimates broadly agree? What can and cannot be concluded from their agreement or disagreement?
I somewhat agree with the estimates, as they all point in a positive direction but vary in magnitude. Outcome regression and TMLE have the most stable results with confidence intervals that exclude 0s.I think this suggests a positive effect of NSW on earnings. The other IPTW estimates have a wide Confidence intervals and crosses 0 mainly because of the large weights from the violations from the positivity.
Question 4
- We have used two strategies in this homework to address the positivity problem, one in the matching portion and another in the weighting. What is the key difference between these two strategies, and how do they affect the estimand and its interpretation?
- In matching we are dropping individulas outside the region of common support, which changes who is in the analysis which pushes the estimated towards ATE. (population restricting is explicit)
- In weighting we applied the criteria to restrict the dataset and then trimmed the weights. Which was us trying to stabilize the estimator by removing the comparisons and adding a cap to some observations. It is harder to characterize the estimated after trimming. (restricts the population indirectly)
- Explain in your own words what “doubly robust” means in this context. What is a limitation of double robustness?
- I think doubly robust means that you get 2 tries to get the estimator right. If the outcome model or the propensity score model is specified properly the estimate will be consistent.
- A limitations that double robustness doesnt help when both models are incorrect. it also as we saw doesnt protect against positivity violations.