EPI 271 Propensity Score Lab 3: Use of propensity scores (Adjustment, Stratification, Matching, and Weighting)

References

Software for implementing matching methods and propensity scores: http://www.biostat.jhsph.edu/~estuart/propensityscoresoftware.html
Graphical representation of balance using PSAgraphics: http://www.jstatsoft.org/v29/i06/paper
MatchIt: Nonparametric Preprocessing for Parametric Causal Inference1: http://r.iq.harvard.edu/docs/matchit/2.4-20/matchit.pdf
Propensity Score Matching: Nearest Neighbor (Greedy) Matching with & without Subclassification: http://www.youtube.com/watch?v=LmtR8B9pRWc

6. Do the methods differ?

Most of the methods resulted in ORs between 1.5 to 2.0.

The matching method (unconditional) and the SMR weighting method are answering the same question, contrasting the exposed subjects to similar subjects who were not exposed (average treatment effect in the treated, ATT).

The inverse probability of treatment assignment weight (IPTW) method is answering a different question, i.e., contrasting the whole cohort on TPA to the whole cohort off TPA (average treatment effect, ATE). It means a counterfactual situation where even those with contraindications were treated with TPA, thus, the OR of death would increase.

-- Unadjusted result
Unadjusted logistic     OR 3.17 (2.33, 4.31)

-- Conventional logistic regression adjusting for covariates as they are
Conventional logistic   OR 1.91 (1.27, 2.88)

-- Propensity score as a covariate in logistic regression
Linear PS adjusted      OR 2.76 (1.97, 3.87)
Curvilinear PS adjusted OR 1.65 (1.09, 2.51)

-- PS quintile stratification and crude OR within each stratum
Stratum 1               OR 0.00 (0.00, 888.11)  # No exposed case
Stratum 2               OR 0.00 (0.00, 143.14)  # No exposed case
Stratum 3               OR 7.95 (0.75,  49.44)
Stratum 4               OR 1.94 (0.21,   8.72)
Stratum 5               OR 1.72 (1.14,   2.56)
M-H combined            OR 1.80 (1.25,   2.60)  # Not appropriate here because of effect modification

-- PS Matching and logistic regression
MatchIt   1:1           OR 1.78 (1.01, 3.19)    # Unconditional logistic regression
nonrandom 1:1           OR 1.69 (0.96, 2.97)    # Unconditional logistic regression
nonrandom 1:1           OR 1.72 (0.96, 3.08)    # Conditional logistic regression
nonrandom 1:1           OR 1.68 (0.96, 2.97)    # GEE with exchangeable correlation

MatchIt   1:4           OR 1.80 (1.17, 2.72)    # Unconditional logistic regression
nonrandom 1:4           OR 1.80 (1.18, 2.74)    # Unconditional logistic regression
nonrandom 1:4           OR 1.83 (1.19, 2.82)    # Conditional logistic regression
nonrandom 1:4           OR 1.80 (1.18, 2.74)    # GEE with exchangeable correlation

-- PS Weighting and logistic regression with robust variance estimator
IPTW                    OR 4.61 (1.22, 17.45)   # Everybody on TPA vs Nobody on TPA (ATE)
SMRW                    OR 1.54 (0.96,  2.48)   # Treated on TPA vs Treated not on TPA (ATT)

Create dataset

## Load data
library(sas7bdat)
dat <- read.sas7bdat("./epi271data.sas7bdat")
names(dat) <- tolower(names(dat))

## Create new variables
dat <- within(dat, {
    ## age ≥ 70
    age70 <- as.numeric(age >= 70)

    ## submart ≥ 90
    sumbart90 <- as.numeric(sumbartel >= 90)

    ## rankpre 4,5,6 to 5
    rankpre[rankpre %in% c(4,5,6)] <- 5

    ## Categorical time
    timeintcat <- cut(timeint, breaks = c(-Inf,1,3,5,Inf), labels = 1:4,
                      right = FALSE, include.lowest = T)

    ## Quantile of resident
    residentq <- cut(residents, breaks = quantile(residents), labels = 1:4, include.lowest = T)
})

## Convert categoricals to factors
categoricals <- c("age5","age70","afib","aphasia","living","gender","rankpre","time","residentq","referral","paresis","sumbart90","timeintcat","transport","vigilanz","ward")

dat[categoricals] <- lapply(dat[categoricals],
                            function(var) {
                                var <- factor(var)
                            })

Calculate the propensity score for t-PA treatment for the study population

logit.ps <- glm(tpa ~ age5 + afib + aphasia + cardiac + gender + htn + hyperchol + icu + living + rankpre + residentq + referral + paresis + prevstroke + sumbart90 + transport + timeintcat + vigilanz + ward + timeintcat:year + age70:year, data = dat, family = binomial)

## Extract propensity score (for those who ps is available)
pscores <- fitted(logit.ps)

## Put them back into the dataset using case numbers
dat$pscore <- NA

## Put the PS into the dataset for those who have PS
dat$pscore[as.numeric(names(pscores))] <- pscores

Crude & conventional analyses

## Fit a crude model
logit.crude <- glm(death ~ tpa,
                   data = dat,
                   family = binomial(link = "logit"))

library(epicalc)
logistic.display(logit.crude)$table[1,, drop = F]

             OR(95%CI)           P(Wald's test) P(LR-test)
tpa: 1 vs 0  "3.17 (2.33,4.31) " "< 0.001"      "< 0.001"


## Fit a traditional logistic regression model
logit.traditional <- glm(death ~ tpa + age5 + afib + aphasia + cardiac + gender + htn + hyperchol + icu + living + rankpre + residentq + referral + paresis + prevstroke + sumbart90 + transport + timeintcat + vigilanz + ward + timeintcat:year + age70:year,
                         data = dat,
                         family = binomial(link = "logit"))

logistic.display(logit.traditional)$table[1,, drop = F]

       OR lower95ci upper95ci Pr(>|Z|)
tpa 1.914      1.27     2.885 0.001924

1. Easiest with regard to programming is: regression adjustment with the propensity score.

a. Replace the confounder set in your outcome model with the propensity score (i.e., linear, quintiles, deciles etc). 2. Regression model: run an outcome model of the association between TPA and death controlling for confounding by including the propensity score.

Linear PS: OR 2.76 (1.97,3.87)
Curvilinear PS: 1.65 (1.09, 2.51)

## Linear relationship assumed
logit.linear.ps.adjusted <- glm(death ~ tpa + pscore,
                         data = dat,
                         family = binomial(link = "logit"))
logistic.display(logit.linear.ps.adjusted)


Logistic regression predicting death 

                    crude OR(95%CI)    adj. OR(95%CI)     P(Wald's test) P(LR-test)
tpa: 1 vs 0         2.76 (1.97,3.87)   1.71 (1.11,2.63)   0.016          0.019     

pscore (cont. var.) 9.61 (4.89,18.88)  5.24 (2.21,12.41)  < 0.001        < 0.001   

Log-likelihood = -2023.0324
No. of observations = 9146
AIC value = 4052.0647


## curvilinear relationship assumed
logit.curvilinear.ps.adjusted <- glm(death ~ tpa + pscore + I(pscore^2),
                                data = dat,
                                family = binomial(link = "logit"))
logistic.display(logit.curvilinear.ps.adjusted)


                    OR   lower95ci  upper95ci    Pr(>|Z|)
tpa           1.653413  1.09112590    2.50546 0.017729754
pscore      147.758293 18.05759275 1209.04893 0.000003193
I(pscore^2)   0.001006  0.00001467    0.06899 0.001376494

3. Stratified analysis: categorize the propensity score into quintiles and perform a stratified analysis.

## Create quintiles
quintiles <- quantile(dat$pscore, prob = seq(from = 0,to = 1, by = 0.2), na.rm = T)
dat$pscoreq <- cut(dat$pscore, breaks = quintiles, labels = 1:5, include.lowest = T)

## Perform logistic regression within each stratum
library(plyr)
logistic.stratified <- dlply(.data = dat, .variables = "pscoreq",
                             .fun = function(DF) {
                                 glm(death ~ tpa, data = DF, family = binomial)
                             })

a. What is the association between t-PA and death in each stratum?

## Get OR in each stratum
res.strata.logit <- lapply(logistic.stratified[1:5], function(X){
    logistic.display(X)$table[1,]
})
print(do.call(rbind, res.strata.logit), quote = F)

  OR(95%CI)          P(Wald's test) P(LR-test)
1 0 (0,Inf)          0.984          0.771     
2 0 (0,Inf)          0.986          0.701     
3 7.97 (1.53,41.67)  0.014          0.039     
4 1.94 (0.43,8.71)   0.385          0.422     
5 1.72 (1.17,2.53)   0.006          0.008


## In the first two strata, no exposed cases are present, thus OR of 0
xtabs.stratified <- dlply(.data = dat, .variables = "pscoreq",
                          .fun = function(DF) {
                              xtabs(~ tpa + death, data = DF)[2:1, 2:1]
                          })
## Show 2x2 table in each stratum
xtabs.stratified[1:5]

$`1`
   death
tpa  1    0
  1  0    1
  0 76 1754

$`2`
   death
tpa  1    0
  1  0    2
  0 66 1761

$`3`
   death
tpa  1    0
  1  2    5
  0 87 1734

$`4`
   death
tpa   1    0
  1   2   13
  0 133 1681

$`5`
   death
tpa   1    0
  1  38  236
  0 133 1422

b. What is the overall effect estimate?

The MH-OR is probably not useful in this situation as the effects are heterogenous across PS strata.

## Transform list to array
array.stratified <- laply(xtabs.stratified[1:5], invisible)
## Array permutation to change dimention order
array.stratified <- aperm(array.stratified, c(2,3,1))
## Rename dimensions
dimnames(array.stratified) <- list(tpa = c("1", "0"), death = c("1", "0"), stratum = c("1", "2", "3", "4", "5"))

## OR both stratum-specific and Mantel-Haenszel
epicalc::mhor(mhtable = array.stratified)


Stratified analysis by  stratum 
               OR lower lim. upper lim. P value
stratum 1    0.00      0.000     888.11 1.00000
stratum 2    0.00      0.000     143.14 1.00000
stratum 3    7.95      0.747      49.44 0.04196
stratum 4    1.94      0.211       8.72 0.30501
stratum 5    1.72      1.137       2.56 0.00919
M-H combined 1.80      1.252       2.60 0.00129

M-H Chi2(1) = 10.36 , P value = 0.001 

 One or more cells of the stratified table == 0. 
 Homogeneity test not computable. 

 Graph not drawn

Graphical representation of balance within strata using PSAgraphics

http://www.jstatsoft.org/v29/i06/paper

## Complete cases only for this package
dat.ps.complete <- dat[complete.cases(dat),]
nrow(dat.ps.complete)

[1] 7410


## Very few exposed with low PS
xtabs(~ pscoreq + tpa, data = dat.ps.complete)

       tpa
pscoreq    0    1
      1 1491    0
      2 1487    1
      3 1456    7
      4 1471   14
      5 1255  228


## Visualization using PSAgraphics
library(PSAgraphics)

## Age compared (Compare balance graphically of a continuous covariate as part of a PSA)
box.psa(continuous = dat.ps.complete$age,
        treatment  = dat.ps.complete$tpa,
        strata     = dat.ps.complete$pscoreq)

plot of chunk unnamed-chunk-9

## Atrial fibrillation compared (Compare balance graphically of a categorical covariate as part of a PSA)
cat.psa(categorical = dat.ps.complete$afib,
        treatment   = dat.ps.complete$tpa,
        strata      = dat.ps.complete$pscoreq)

plot of chunk unnamed-chunk-9

$`treatment:stratum.proportions`
    0:1 1:1   0:2 1:2   0:3 1:3   0:4   1:4  0:5   1:5
0 0.881 NaN 0.841   1 0.792   1 0.704 0.857 0.68 0.627
1 0.119 NaN 0.159   0 0.208   0 0.296 0.143 0.32 0.373

## Generates a Propensity Score Assessment Plot (does not work)
## circ.psa(response  = (dat.ps.complete$death == 1),
##          treatment = dat.ps.complete$tpa,
##          strata    = dat.ps.complete$pscoreq)

4. Match on the propensity score: various possibilities to match:

Matching using MatchIt package

Reference

full and optimal will be done through optmatch package.

  method: This argument specifies a matching method. Currently,
          ‘"exact"’ (exact matching), ‘"full"’ (full matching),
          ‘"genetic"’ (genetic matching), ‘"nearest"’ (nearest neighbor
          matching), ‘"optimal"’ (optimal matching), and ‘"subclass"’
          (subclassification) are available. The default is
          ‘"nearest"’. Note that within each of these matching methods,
          _MatchIt_ offers a variety of options.

## Full syntax
matchit(formula, data = NULL, discard = 0, exact = FALSE,
        replace = FALSE, ratio = 1, model = "logit", reestimate = FALSE,
        nearest = TRUE, m.order = 2, caliper = 0, calclosest = FALSE,
        mahvars = NULL, subclass = 0, sub.by = "treat", counter = TRUE,
        full = FALSE, full.options = list(), ...)

Propensity score matching using MatchIt

Most of these methods (such as logistic or probit regression) define the distance by first estimating the propensity score, defined as the probability of receiving treatment, conditional on the covariates. ( http://r.iq.harvard.edu/docs/matchit/2.4-20/matchit.pdf 4.1.0.2.2 Additional Arguments for Specification of Distance Measures)

## Load MatchIt by Dr. Gary King
library(MatchIt)

## 
##  MatchIt (Version 2.4-20, built: 2011-10-24)
##  Please refer to http://gking.harvard.edu/matchit for full documentation 
##  or help.matchit() for help with commands supported by MatchIt.
##


## matchit() takes formula for propensity score, not propensity score itself
out.matchit <- matchit(## Give formula for propensity score model
                       formula  = tpa ~ age5 + afib + aphasia + cardiac + gender + htn + hyperchol + icu + living + rankpre + residentq + referral + paresis + prevstroke + sumbart90 + transport + timeintcat + vigilanz + ward + timeintcat:year + age70:year,
                       data     = dat.ps.complete, # Cannot have missing values. Use complete dataset
                       method   = "nearest",       # nearest is the same as greedy match
                       distance = "logit",         # Distance defined by usual propensity score from logistic model
                       ratio    = 1                # 1:1 match is the default
                       )

## 1 tx:4 control matching
out.matchit4 <- matchit(## Give formula for propensity score model
                       formula  = tpa ~ age5 + afib + aphasia + cardiac + gender + htn + hyperchol + icu + living + rankpre + residentq + referral + paresis + prevstroke + sumbart90 + transport + timeintcat + vigilanz + ward + timeintcat:year + age70:year,
                       data     = dat.ps.complete, # Cannot have missing values. Use complete dataset
                       method   = "nearest",       # nearest is the same as greedy match
                       distance = "logit",         # Distance defined by usual propensity score from logistic model
                       ratio    = 4                # 1:4 match
                       )

## 1:1 match. There are only 250 treated, thus, 250 contols were chosen
out.matchit


Call: 
matchit(formula = tpa ~ age5 + afib + aphasia + cardiac + gender + 
    htn + hyperchol + icu + living + rankpre + residentq + referral + 
    paresis + prevstroke + sumbart90 + transport + timeintcat + 
    vigilanz + ward + timeintcat:year + age70:year, data = dat.ps.complete, 
    method = "nearest", distance = "logit", ratio = 1)

Sample sizes:
          Control Treated
All          7160     250
Matched       250     250
Unmatched    6910       0
Discarded       0       0

## out.matchit$match.matrix     # Stratum indicator is in $match.matrix
## Output Matched Data Sets for further analysis
dat.matchit <- match.data(out.matchit)

## 1:4 match. There are only 250 treated, thus, 1000 contols were chosen
out.matchit4


Call: 
matchit(formula = tpa ~ age5 + afib + aphasia + cardiac + gender + 
    htn + hyperchol + icu + living + rankpre + residentq + referral + 
    paresis + prevstroke + sumbart90 + transport + timeintcat + 
    vigilanz + ward + timeintcat:year + age70:year, data = dat.ps.complete, 
    method = "nearest", distance = "logit", ratio = 4)

Sample sizes:
          Control Treated
All          7160     250
Matched      1000     250
Unmatched    6160       0
Discarded       0       0

## out.matchit4$match.matrix     # Stratum indicator is in $match.matrix
## Output Matched Data Sets for further analysis
dat.matchit4 <- match.data(out.matchit4)

Examine balance

## See balance of all covariates (not run)
## plot(out.matchit)

## Propensity score jittered dot plots 1:1
plot(out.matchit, type = "jitter", interactive = F)