Homework: Propensity score and doubly robust methods

PHTH 6800

Author

Your name here

For this homework we will use the cps_mixtape and nsw_mixtape datasets from the causaldata package. The nsw_mixtape dataset contains treated and control individuals from the National Supported Work (NSW) Demonstration, a randomized job training program for young, disadvantaged workers that started in 1975, while cps_mixtape is a observational comparison group drawn from the Current Population Survey (CPS). These datasets have been used extensively to demonstrate causal methods, starting with Lalonde (1986). The treatment (treat) is participation in the NSW program and the outcome is real earnings in 1978 (re78, in dollars). Available covariates are: age, educ (years of education), black, hisp, marr (married), nodegree (no high school degree), re74 (real earnings 1974), and re75 (real earnings 1975).

We will assume conditional exchangeability given age, education, race/ethnicity, marital status, high school degree status, and prior earnings.

library(tidyverse)
# add any other packages you need here
set.seed(6789) # for reproducibility

lalonde <- bind_rows(causaldata::cps_mixtape, causaldata::nsw_mixtape)

Question 1

We want to estimate the effect of the NSW job training program on 1978 earnings using propensity score matching.

  1. Before fitting any model, explore the raw data to assess positivity. Describe what you find. Do you have any concerns? Would you classify any violations you see as structural or random? (You don’t need to look into or list every combiantion of covariates — just describe any potentially problematic patterns.)
#checking how many postive and negitive values we have in the dataset
count(lalonde, treat)
# A tibble: 2 × 2
  treat     n
  <dbl> <int>
1     0 16252
2     1   185

The CPS controls outnumber the treated group. This makes sense as CPS is a general population sample, while NSW participants are disadvantaged workers so the distributions will look different. I would think these are structural violations for some subgroups and not just sampling noise.

  1. Fit a logistic regression model for treat using all available covariates as linear terms. Visualize the overlap of the estimated propensity scores by treatment group (it might help to plot the treated and control distributions separately to see them better, instead of layered as we have done in class). Does the distribution confirm what you found in part (a)?
library(MatchIt)
library(WeightIt)
library(cobalt)
library(marginaleffects)

ps_mod <- glm(treat ~ age + educ + black + hisp + marr + nodegree + re74 + re75,
              family = binomial(), data = lalonde)

lalonde <- lalonde |>
  mutate(ps = predict(ps_mod, type = "response"))

# Plot treated and control separately so they don't obscure each other
ggplot(lalonde, aes(x = ps)) +
  geom_histogram(bins = 50) +
  facet_wrap(~treat, labeller = labeller(treat = c("0" = "Control", "1" = "Treated")),
             scales = "free_y") +
  labs(x = "Estimated propensity score", y = "Count") +
  theme_minimal()

Yes the distribution confirms what we found in part a, as seen in the graphs it can be seen that the distributions will be very skewed. We see treated individuals have a very high scores.

  1. Is it valid to estimate the ATE in this dataset without any change to the study population? What about the ATT? Describe conceptually how these two estimands differ with respect to the observed data and this causal question and interpreting results.

ATE - effect if everyone were treated vs untreated - Not sure if it would be valid here unless there were changes to the study population. - CPS and NSW participants are different demographically and economically. - The positivity assumption would be violated for the control side.

ATT - effect among the treated - Only need untreated individuals who are comparable to NSW participants - I think we can find some CPS participants that fit in which makes ATT a more realistic estimate given the data we have at hand.

  1. Use matchit() with distance = "glm", discard = "both", and ratio = 4 to perform nearest-neighbor 4:1 matching. How many treated and control individuals were discarded? Based on who was discarded, what estimand does the resulting matched sample target?
matched <- matchit(treat ~ age + educ + black + hisp + marr + nodegree + re74 + re75,
                   data = lalonde,
                   distance = "glm",
                   discard = "both",
                   ratio = 4)
matched
A `matchit` object
 - method: 4:1 nearest neighbor matching without replacement
 - distance: Propensity score [common support]

             - estimated with logistic regression
 - common support: units from both groups dropped
 - number of obs.: 16437 (original), 925 (matched)
 - target estimand: ATT
 - covariates: age, educ, black, hisp, marr, nodegree, re74, re75

We can confirm the positivity problem, where no CPS participants are in common with NSW participants. The result is just the ATT among all the treated individuals since no treated participants were dropped.

  1. Create a love plot of the matched sample. Comment briefly on balance.
love.plot(matched, stars = "std", thresholds = 0.1)

I think the matched sample is balanced well on the covariates - as the adjusted data points are within the threshold.

  1. Extract the matched dataset and use avg_comparisons() with cluster-robust standard errors to estimate your causal estimand. Report and interpret the estimate.
matched_data <- match.data(matched)

avg_comparisons(
  lm(re78 ~ treat + age + educ + black + hisp + marr + nodegree + re74 + re75,
     data = matched_data, weights = weights),
  variables = "treat",
  vcov = ~subclass
)

 Estimate Std. Error   z Pr(>|z|)   S 2.5 % 97.5 %
     1185        625 1.9   0.0578 4.1 -39.2   2409

Term: treat
Type: response
Comparison: 1 - 0

The effect of job training seems to yield a higher pay but the confidence interval is greater than z, which means that we cant confidently rule out that there was no effect.

Question 2

We now try an weighting approach on the same data.

  1. Fit an IPT weighting model for the ATE using all available covariates. Use include.obj = TRUE so the underlying propensity score model is stored. Print the summary() and a love plot. What do the maximum weights tell you?
w_out1 <- weightit(treat ~ age + educ + black + hisp + marr + nodegree + re74 + re75,
                   data = lalonde,
                   method = "glm",
                   estimand = "ATE",
                   include.obj = TRUE)

summary(w_out1)
                  Summary of weights

- Weight ranges:

          Min                                   Max
treated 3.277 |---------------------------| 853.003
control 1.    ||                              1.443

- Units with the 5 most extreme weights by group:
                                                
             185     179     137     124      10
 treated 284.928 379.295 542.757 628.738 853.003
            9204    9371   16125    5555    3947
 control   1.433   1.433   1.433   1.434   1.443

- Weight statistics:

        Coef of Var   MAD Entropy # Zeros
treated       2.712 1.415   1.492       0
control       0.051 0.018   0.001       0

- Effective Sample Sizes:

            Control Treated
Unweighted 16252.    185.  
Weighted   16210.01   22.24
love.plot(w_out1, stars = "std", thresholds = 0.1)

The max weight for a treated participant is 853. Not a great model to use one participant to represent 853 untreated people. From the plot we can see the balance is not great after weighting, with many of the features outside of the threshold. Might have to restrict the sample?

  1. Examine the propensity score model coefficients using summary(w_out1$obj). Do any covariates have extremely large or small coefficients that might point to a data problem?
summary(w_out1$obj)

Call:
NULL

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -7.85095    0.25529 -30.753  < 2e-16 ***
age         -0.08438    0.06916  -1.220  0.22246    
educ        -0.03174    0.08592  -0.369  0.71184    
black        3.73378    0.16769  22.265  < 2e-16 ***
hisp         1.70479    0.25535   6.676 2.53e-11 ***
marr        -0.73686    0.14820  -4.972 6.69e-07 ***
nodegree     0.46088    0.16334   2.822  0.00478 ** 
re74        -0.25480    0.15889  -1.604  0.10882    
re75        -1.61274    0.20323  -7.935 2.23e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for quasibinomial family taken to be 0.4271614)

    Null deviance: 2028.1  on 16436  degrees of freedom
Residual deviance: 1152.6  on 16428  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 10
  • The coefficent ‘Black’ stands out as it has the highest value, so it is the stongest predictor of NSW participation. Not a data problem, just seems like race is a big driver in the weights.
  • ‘re75’ has the largest negitive coefficient, which confirms that higher earnings in 1975 predicts that the individual will not be in NSW.
  • Main issues seem like they were driven by race and prior earnings.
  1. Rather than attempting to reweight a fundamentally incomparable population, apply new eligibility criteria to restrict the dataset to individuals more plausibly comparable to the NSW participants. Decide on criteria that make substantive sense (this is somewhat subjective so everyone will have different answers; explain your reasoning). Refit the propensity score model in the restricted dataset, print the new summary and love plot, and assess whether the weight distribution and balance improved.
lalonde_restricted <- lalonde |>
  filter(
    age <= 55,       
    re74 < 10000,        
    re75 < 10000
  )

w_out2 <- weightit(treat ~ age + educ + black + hisp + marr + nodegree + re74 + re75,
                   data = lalonde_restricted,
                   method = "glm",
                   estimand = "ATE",
                   include.obj = TRUE)

summary(w_out2)
                  Summary of weights

- Weight ranges:

          Min                                    Max
treated 3.235 |---------------------------| 1050.621
control 1.    ||                               1.454

- Units with the 5 most extreme weights by group:
                                                 
             170     136     124     116       10
 treated 224.793 230.111 311.566 733.294 1050.621
            5086    5073    4776    4253     2896
 control   1.432   1.435   1.439   1.439    1.454

- Weight statistics:

        Coef of Var   MAD Entropy # Zeros
treated       3.289 1.415   1.628       0
control       0.081 0.046   0.003       0

- Effective Sample Sizes:

           Control Treated
Unweighted 5250.    171.  
Weighted   5215.49   14.55
love.plot(w_out2, stars = "std", thresholds = 0.1)

I choose to restrict age at 55 and real earning 74 and 75 at 10 000. - Since the program is aimed towards disadvantaged workers, I would doubt that people above 55 would be apart of the program as they are closer to retirement. - as the program was intended for low-income individuals, capping the earnings at 10k makes sense as as anyone earning above that would not be apart of the program.

The weight distribution did not improve but got worse :(. The positivity problem still exists as we can see that even fewer treated people are doing the work in the weighted estimator - which we can see by the increase in maximum treated weight. I think this is because race drives the high weights so even comparing these two groups, they are not comparable to race.

  1. Even after applying the eligibility criteria, there may still be some extreme weights. Trim the weights at the 99.5th percentile. Compare the weight summaries and love plots before and after trimming. What are you trading off by trimming?
w_out2_trimmed <- trim(w_out2, at = 0.995)

summary(w_out2)
                  Summary of weights

- Weight ranges:

          Min                                    Max
treated 3.235 |---------------------------| 1050.621
control 1.    ||                               1.454

- Units with the 5 most extreme weights by group:
                                                 
             170     136     124     116       10
 treated 224.793 230.111 311.566 733.294 1050.621
            5086    5073    4776    4253     2896
 control   1.432   1.435   1.439   1.439    1.454

- Weight statistics:

        Coef of Var   MAD Entropy # Zeros
treated       3.289 1.415   1.628       0
control       0.081 0.046   0.003       0

- Effective Sample Sizes:

           Control Treated
Unweighted 5250.    171.  
Weighted   5215.49   14.55
summary(w_out2_trimmed)
                  Summary of weights

- Weight ranges:

          Min                                  Max
treated 3.235    |------------------------| 19.386
control 1.    ||                             1.454

- Units with the 5 most extreme weights by group:
                                           
             10     22     23     28     42
 treated 19.386 19.386 19.386 19.386 19.386
           5086   5073   4776   4253   2896
 control  1.432  1.435  1.439  1.439  1.454

- Weight statistics:

        Coef of Var   MAD Entropy # Zeros
treated       0.753 0.595   0.234       0
control       0.081 0.046   0.003       0

- Effective Sample Sizes:

           Control Treated
Unweighted 5250.    171.  
Weighted   5215.49  109.41
love.plot(w_out2, stars = "std", thresholds = 0.1)

love.plot(w_out2_trimmed, stars = "std", thresholds = 0.1)

Trimming made a difference and the weights are not dominated by only a few people.Trimming however did not fix the compariability problem rather just stabilized the estimator. The tradoff is giving a small amount of bias for a stable estimate, by capping the weights which reduced the variance but the trimmed people are not showing up properly.

Question 3

Using your dataset with your eligibility critera applied, we will now compare several estimation approaches for the ATE.

  1. Estimate the ATE using outcome regression standardization via standardize_glm(), adjusting for the same covariates. Report the means under each treatment level and the ATE with 95% CIs.
library(stdReg2)

fit_std <- standardize_glm(
  re78 ~ treat + age + educ + black + hisp + marr + nodegree + re74 + re75,
  data = lalonde_restricted,
  family = "gaussian",
  values = list(treat = c(0, 1)),
  contrasts = "difference",
  reference = 0
)

fit_std
Outcome formula: re78 ~ treat + age + educ + black + hisp + marr + nodegree + 
    re74 + re75
Outcome family: gaussian 
Outcome link function: identity 
Exposure:  treat 

Tables: 
  treat Estimate Std.Error lower.0.95 upper.0.95
1     0     6564      97.3       6374       6755
2     1     8358     625.4       7132       9583

Reference level:  treat = 0 
Contrast:  difference 
  treat Estimate Std.Error lower.0.95 upper.0.95
1     0        0         0          0          0
2     1     1793       629        561       3026

The confidence interval excludes zero so this is a statistically significant positive effect.

  1. Estimate the ATE using IPT weighting (untrimmed weights) via glm_weightit() + avg_comparisons(). Report the means under each treatment level and the ATE with 95% CIs.
# 3b - IPTW untrimmed
fit_w2 <- glm_weightit(re78 ~ treat + age + educ + black + hisp + marr + nodegree + re74 + re75,
                       data = lalonde_restricted,
                       weightit = w_out2)

avg_comparisons(fit_w2, variables = "treat")

 Estimate Std. Error     z Pr(>|z|)   S 2.5 % 97.5 %
      752        788 0.953     0.34 1.6  -794   2297

Term: treat
Type: probs
Comparison: 1 - 0
avg_predictions(fit_w2, variables = "treat")

 treat Estimate Std. Error     z Pr(>|z|)    S 2.5 % 97.5 %
     0     6566       90.5 72.59   <0.001  Inf  6389   6743
     1     7317      783.1  9.34   <0.001 66.6  5783   8852

Type: probs
  1. Estimate the ATE using IPT weighting (trimmed weights from Question 2d) via glm_weightit() + avg_comparisons(). Report the means under each treatment level and the ATE with 95% CIs.
fit_w2_trimmed <- glm_weightit(re78 ~ treat + age + educ + black + hisp + marr + nodegree + re74 + re75,
                                data = lalonde_restricted,
                                weightit = w_out2_trimmed)

avg_comparisons(fit_w2_trimmed, variables = "treat")

 Estimate Std. Error    z Pr(>|z|)   S 2.5 % 97.5 %
     1501       5554 0.27    0.787 0.3 -9385  12386

Term: treat
Type: probs
Comparison: 1 - 0
avg_predictions(fit_w2_trimmed, variables = "treat")

 treat Estimate Std. Error     z Pr(>|z|)   S 2.5 % 97.5 %
     0     6566         88 74.59   <0.001 Inf  6393   6738
     1     8067       5554  1.45    0.146 2.8 -2819  18952

Type: probs
  1. Estimate the ATE using the doubly robust IPTW-GLM approach via standardize_glm_dr(), specifying both an outcome model and a propensity score model. Report the means under each treatment level and the ATE with 95% CIs.
fit_dr <- standardize_glm_dr(
  formula_outcome = re78 ~ treat + age + educ + black + hisp + marr + nodegree + re74 + re75,
  formula_exposure = treat ~ age + educ + black + hisp + marr + nodegree + re74 + re75,
  data = lalonde_restricted,
  values = list(treat = c(0, 1)),
  contrasts = "difference",
  reference = 0
)
fit_dr
Doubly robust estimator with: 

Exposure formula: treat ~ age + educ + black + hisp + marr + nodegree + re74 + 
    re75
Exposure link function: logit 
Outcome formula: re78 ~ treat + age + educ + black + hisp + marr + nodegree + 
    re74 + re75
Outcome family: gaussian 
Outcome link function: identity 
Exposure:  treat 

Tables: 
  treat Estimate Std.Error lower.0.95 upper.0.95
1     0     6566      97.3       6375       6757
2     1     7317    1629.7       4123      10512

Reference level:  treat = 0 
Contrast:  difference 
  treat Estimate Std.Error lower.0.95 upper.0.95
1     0        0         0          0          0
2     1      752      1632      -2447       3950
  1. Estimate the ATE using TMLE with a Super Learner library of your choice. At minimum include "SL.glm", "SL.step", and "SL.glm.interaction", but add any other models of your choice (you can see what is included with SuperLearner::listWrappers() and even create your own with SuperLearner::create.Learner()). Report the means under each treatment level and the ATE with 95% CIs.
library(tmle)
library(SuperLearner)

tmle_out <- tmle(
  Y = lalonde_restricted$re78,
  A = lalonde_restricted$treat,
  W = lalonde_restricted |> select(age, educ, black, hisp, marr, nodegree, re74, re75),
  family = "gaussian",
  Q.SL.library = c("SL.glm", "SL.step", "SL.glm.interaction"),
  g.SL.library = c("SL.glm", "SL.step", "SL.glm.interaction")
)
summary(tmle_out)
 Initial estimation of Q
     Procedure: cv-SuperLearner, ensemble
     Model:
         Y ~  SL.glm_All + SL.step_All + SL.glm.interaction_All

     Coefficients: 
          SL.glm_All    0.2166229 
         SL.step_All    0 
     SL.glm.interaction_All    0.7833771 

     Cross-validated R squared :  0.1946 

 Estimation of g (treatment mechanism)
     Procedure: SuperLearner, ensemble
     Model:
         A ~  SL.glm_All + SL.step_All + SL.glm.interaction_All 

     Coefficients: 
          SL.glm_All    0 
         SL.step_All    0.4653137 
     SL.glm.interaction_All    0.5346863 

 Estimation of g.Z (intermediate variable assignment mechanism)
     Procedure: No intermediate variable 

 Estimation of g.Delta (missingness mechanism)
     Procedure: No missingness, ensemble

 Bounds on g: (0.0079, 1) 

 Bounds on g for ATT/ATC: (0.0079, 0.9921) 

 Marginal Mean under Treatment (EY1)
   Parameter Estimate:  7890.2
   Estimated Variance:  242440
              p-value:  <2e-16
    95% Conf Interval:  (6925.1, 8855.3)

 Marginal Mean under Comparator (EY0)
   Parameter Estimate:  6562.1
   Estimated Variance:  9456.3
              p-value:  <2e-16
    95% Conf Interval:  (6371.5, 6752.7)

 Additive Effect
   Parameter Estimate:  1328.1
   Estimated Variance:  250700
              p-value:  0.0079922
    95% Conf Interval:  (346.7, 2309.4)

 Additive Effect among the Treated
   Parameter Estimate:  2058.1
   Estimated Variance:  403880
              p-value:  0.0012018
    95% Conf Interval:  (812.5, 3303.7)

 Additive Effect among the Controls
   Parameter Estimate:  1899.6
   Estimated Variance:  269930
              p-value:  0.00025587
    95% Conf Interval:  (881.33, 2917.9)
  1. Do the estimates broadly agree? What can and cannot be concluded from their agreement or disagreement?

I somewhat agree with the estimates, as they all point in a positive direction but vary in magnitude. Outcome regression and TMLE have the most stable results with confidence intervals that exclude 0s.I think this suggests a positive effect of NSW on earnings. The other IPTW estimates have a wide Confidence intervals and crosses 0 mainly because of the large weights from the violations from the positivity.

Question 4

  1. We have used two strategies in this homework to address the positivity problem, one in the matching portion and another in the weighting. What is the key difference between these two strategies, and how do they affect the estimand and its interpretation?
  • In matching we are dropping individulas outside the region of common support, which changes who is in the analysis which pushes the estimated towards ATE. (population restricting is explicit)
  • In weighting we applied the criteria to restrict the dataset and then trimmed the weights. Which was us trying to stabilize the estimator by removing the comparisons and adding a cap to some observations. It is harder to characterize the estimated after trimming. (restricts the population indirectly)
  1. Explain in your own words what “doubly robust” means in this context. What is a limitation of double robustness?
  • I think doubly robust means that you get 2 tries to get the estimator right. If the outcome model or the propensity score model is specified properly the estimate will be consistent.
  • A limitations that double robustness doesnt help when both models are incorrect. it also as we saw doesnt protect against positivity violations.