Power with Propensity Weights

Author

Eric Hedberg

Introduction

The purpose of this short document is to use simulation to provide a method to determine plausible design effects from the use of inverse propensity weights. For example, suppose a power analysis is required for an evaluation of a program which seeks to improve several binary outcomes with varying baseline rates.

  • outcome 1 = .1
  • outcome 2 = .25
  • outcome 3 = .5

Propensity weights design effect

These design effects are for propensity weighted approach. Propensity weights effectively match treatment and comparison units based on a predicted propensity (see, e.g., Austin 2011) for the estimation of the average treatment effect (ATE). However, propensity weights also impact the significance tests through weighting design effects, which are related to the variation in the weight variable. A rough estimate of this design effect is given in Kish (1992)

\[ DEFF = 1 + CV_w^2 \]

where \(CV_w\) is the ratio of the standard deviation of the weights to the mean weight. The factor \(DEFF\) can be used to estimate the effective sample size for power analysis. For example, if the actual sample size is 100, but the expected \(DEFF\) is 2, the sample size for power analysis should be \(\frac{100}{2} = 50\). Alternatively, if a power analysis reports a required sample of 100 and the \(DEFF\) is 2, the actual number of required cases is 200.

In order to anticipate the \(DEFF\) from propensity weights, we must consider the the correlation of the observable factors used in the treatment propensity model and how this impacts the weights. Formally, let’s assume that the propensity for treatment is some probability which ranges from 0 to 1, but can be transformed into log-odds, \(T_i = \mbox{ln}\left(\frac{p_i}{1-p_p}\right)\).

These quantiles are associated with a set of factors which can be summarized by some set of standard normal values, \(X\).

This function then creates the variables for \(X\) and \(T\), where both are approximately standard normal and correlated with a value of \(\rho\). We also define a log-odds function.

log_odds <- function(p) {
  return(
    log(p/(1-p))
    )
}
create_T <- function(xrho, n, p = .5) {
  x <- rnorm(n)
  t <- xrho*x + rnorm(n, mean = log_odds(p), sd = sqrt(1-xrho^2))
  return(data.frame(cbind(x = x, t = t)))
}

For example, we create an object with 1,000 observations for \(T\) and \(X\), called xt

xt <- create_T(xrho = .5,  n = 1000)
summary(xt)
       x                   t            
 Min.   :-3.159044   Min.   :-3.058016  
 1st Qu.:-0.688354   1st Qu.:-0.634629  
 Median :-0.009003   Median : 0.055023  
 Mean   : 0.009850   Mean   : 0.007813  
 3rd Qu.: 0.685672   3rd Qu.: 0.633364  
 Max.   : 2.753219   Max.   : 2.408692  
cat(paste("sd of t is", sd(xt$t),"\n"))
sd of t is 0.955875153440084 
cat(paste("cor between x and t is", cor(xt$x, xt$t),"\n"))
cor between x and t is 0.497211775994681 

Next, we create another function to take \(T\) and create a treatment variable, based on a binomial coin flip with probability equal to the probability associated with log-odds \(T\). The program also takes a parameter for the general probability of treatment, \(p\), which is converted into a logit to re-center the value of \(T\).

Say we want a treatment variable with a mean of around .5 (after defining an inverse logit function),

inv_logit <- function(t) {
  return(
    1/(1+exp(-1*(t)))
  )
}
mean(
  rbinom(
    nrow(xt),
    1,
    inv_logit(xt$t)
  )
)
[1] 0.498

This program then runs a quick logit to predict treatment and computes the propensity weight, and then the expected design effect. It uses the create_T function defined above.

compute_deff <- function(xrho, p = .5, n) {
  xt <- create_T(xrho = xrho,n = n, p = p)
  treat <- rbinom(
    n,
    1,
    inv_logit(xt$t)
  )
  ehat <- glm(treat ~ xt$x, family = binomial("logit"))$fitted.values
  w <- treat/ehat + (1-treat)/(1-ehat)
  return(1 + sd(w)^2/mean(w)^2)
}

So, if we assume that our observed factors are correlated with the propensity for treatment by about .5, and half of the cases are assigned treatment, the design effect may be something like

compute_deff(xrho = .5, p = .5, n =1000)
[1] 1.048454

So, below we run a simulation with various values of this correlation to get a sense about what we expect this to design effect to be, by getting the mean of the simulations for correlations from 0, which means the weights are very similar and thus no design effect to .9, which means the design effect is larger because the weights differ more.

mean_sim <- function(xrho, n, p, reps = 10000) {
  return(
    mean(
      replicate(
        reps, 
        compute_deff(xrho = xrho, p = p, n = n)
      )
    )
  )
}
deff <- unlist(
  lapply(
    seq(from = 0, to = .9, by = .1),
    mean_sim,
    p = .5,
    n = 1000,
  )
)
cbind(rho = seq(from = 0, to = .9, by = .1), deff = deff)
      rho     deff
 [1,] 0.0 1.002010
 [2,] 0.1 1.003769
 [3,] 0.2 1.009111
 [4,] 0.3 1.018611
 [5,] 0.4 1.032439
 [6,] 0.5 1.051774
 [7,] 0.6 1.079025
 [8,] 0.7 1.114816
 [9,] 0.8 1.163595
[10,] 0.9 1.232852

Where we see that the design effect increases with the quality of the propensity model. A worst case would be a \(DEFF\) of about 1.3, with a marginal model producing a \(DEFF\) of about 1.10.

Below are values for \(DEFF\) for designs in which 10, 25, 75, or 90 percent of \(N=1000\) cases are typically assigned treatment. As the portion of cases assigned treatment become more extreme, the design effect increases.

for (p in c(.1, .25, .5, .75, .9)) {
  deff <- unlist(
    lapply(
      seq(from = 0, to = .9, by = .1),
      mean_sim,
      p = p,
      n =1000,
    )
  ) 
  cat(paste0("p = ", p,"\n"))
  print(cbind(rho = seq(from = 0, to = .9, by = .1), deff = deff))
  cat("\n")
}
p = 0.1
      rho     deff
 [1,] 0.0 2.184610
 [2,] 0.1 2.194091
 [3,] 0.2 2.229179
 [4,] 0.3 2.284834
 [5,] 0.4 2.375992
 [6,] 0.5 2.501019
 [7,] 0.6 2.674646
 [8,] 0.7 2.902447
 [9,] 0.8 3.214270
[10,] 0.9 3.642611

p = 0.25
      rho     deff
 [1,] 0.0 1.229047
 [2,] 0.1 1.232630
 [3,] 0.2 1.244051
 [4,] 0.3 1.262332
 [5,] 0.4 1.289472
 [6,] 0.5 1.327022
 [7,] 0.6 1.379900
 [8,] 0.7 1.449139
 [9,] 0.8 1.548725
[10,] 0.9 1.687918

p = 0.5
      rho     deff
 [1,] 0.0 1.002007
 [2,] 0.1 1.003772
 [3,] 0.2 1.009180
 [4,] 0.3 1.018469
 [5,] 0.4 1.032355
 [6,] 0.5 1.051821
 [7,] 0.6 1.078307
 [8,] 0.7 1.114653
 [9,] 0.8 1.164398
[10,] 0.9 1.233071

p = 0.75
      rho     deff
 [1,] 0.0 1.229666
 [2,] 0.1 1.232785
 [3,] 0.2 1.242895
 [4,] 0.3 1.261740
 [5,] 0.4 1.289401
 [6,] 0.5 1.326830
 [7,] 0.6 1.379779
 [8,] 0.7 1.450299
 [9,] 0.8 1.548309
[10,] 0.9 1.687641

p = 0.9
      rho     deff
 [1,] 0.0 2.183357
 [2,] 0.1 2.193195
 [3,] 0.2 2.229672
 [4,] 0.3 2.291048
 [5,] 0.4 2.381147
 [6,] 0.5 2.503858
 [7,] 0.6 2.668637
 [8,] 0.7 2.901404
 [9,] 0.8 3.210860
[10,] 0.9 3.663451

These results are generally invariant with regards to sample size. For example, here are the results for \(p = .25\) for \(N = 500\)

deff <- unlist(
  lapply(
    seq(from = 0, to = .9, by = .1),
    mean_sim,
    p = .25,
    n =500,
  )
)
cbind(rho = seq(from = 0, to = .9, by = .1), deff = deff)
      rho     deff
 [1,] 0.0 1.233912
 [2,] 0.1 1.237913
 [3,] 0.2 1.249339
 [4,] 0.3 1.266866
 [5,] 0.4 1.295482
 [6,] 0.5 1.334918
 [7,] 0.6 1.388497
 [8,] 0.7 1.458235
 [9,] 0.8 1.558223
[10,] 0.9 1.700188

The Power Analysis

Now we are ready for the power analysis. Below, I estimate the required sample to detect proportion differences of .05, .1, and .2 from baselines of .5, .75, and .1 (for the outcomes above). To apply the impact of weighting, we then multiply the samples by 1.1 and 1.3 to find the required samples (assuming a 50/50 split in cases).

for (baseline in c(.1,.25,.5)) {
  for (d in c(.05,.1,.2)) {
    n <- power.prop.test(
      p1 = baseline, 
      p2 = baseline+d, 
      power = .8,
      alternative = "two.sided")$n
    
    cat(
      paste0(
        "baseline rate = ", baseline, 
        "; delta = ", d, "\n",
        "DEFF = 1.1 = ", ceiling(n*2*1.1), "\n",
        "DEFF = 1.3 = ", ceiling(n*2*1.3), "\n"
      )
    )
  }
}
baseline rate = 0.1; delta = 0.05
DEFF = 1.1 = 1509
DEFF = 1.3 = 1783
baseline rate = 0.1; delta = 0.1
DEFF = 1.1 = 438
DEFF = 1.3 = 518
baseline rate = 0.1; delta = 0.2
DEFF = 1.1 = 136
DEFF = 1.3 = 161
baseline rate = 0.25; delta = 0.05
DEFF = 1.1 = 2752
DEFF = 1.3 = 3252
baseline rate = 0.25; delta = 0.1
DEFF = 1.1 = 723
DEFF = 1.3 = 855
baseline rate = 0.25; delta = 0.2
DEFF = 1.1 = 194
DEFF = 1.3 = 230
baseline rate = 0.5; delta = 0.05
DEFF = 1.1 = 3443
DEFF = 1.3 = 4069
baseline rate = 0.5; delta = 0.1
DEFF = 1.1 = 853
DEFF = 1.3 = 1008
baseline rate = 0.5; delta = 0.2
DEFF = 1.1 = 205
DEFF = 1.3 = 242

Acknowledgement

This work was supported by the Department of Education, (grant number R305D200045).