This example is based on the Right Heart Catheterization data set available at Vanderbilt University and described here. The key reference is Connors AF et al. 1996 The effectiveness of RHC in the initial care of critically ill patients. JAMA 276: 889-897. Connors et al. used a logistic regression model to develop a propensity score then: [a] matched RHC to non-RHC patients and [b] adjusted for propensity score in models for outcomes, followed by a sensitivity analysis. The key conclusions were that RHC patients had decreased survival time, and any unmeasured confounder would have to be somewhat strong to explain away the results.
The observations in the rhc data set describe 5,735 subjects who are critically ill, and most variables were obtained during day 1 of a hospitalization. We’ll begin by importing the rhc data into a tibble (a lazy and surly data frame - see R’s help file on tibbles for more details) in R that contains 5,735 observations (rows) on 63 variables (columns).
In the import process, we’re going to do a few things to simplify the process of getting a functional data set.
The automated column specification in R’s readr package uses the first 1000 rows of the data set - with the rhc data, this creates parsing failures for several measures which are assumed to be integers but actually should be parsed as doubles (continuous variables). So we specify the column types for those variables here then allow R to identify the type for the other columns.
By default, readr’s read_csv command treats all string variables as characters, and not as factors. For several multicategorical variables, we want a factor representation in a specific order, so we’ll specify that, too, also using the cols function.
This will throw a message indicating that the `` variable (which I assume is a set of row names) will have a new name, but we’ll ignore that since we won’t use the row names in our subsequent work.
After we import the data, we re-order the columns to present the SUPPORT patient identification code first, then the exposure (swang1), then information on the outcomes, then information on the covariates of interest for our work, dropping the other variables.
The rhc_cleaning data at this stage is a tibble (a lazy and surly data frame) containing 5735 observations (rows) on 57 variables (columns). The observations describe subjects who are critically ill, and most variables were obtained during day 1 of a hospitalization.
Code
rhc_cleaning
# A tibble: 5,735 × 57
ptid swang1 death sadmdte dschdte dthdte lstctdte age sex edu income
<chr> <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <dbl> <fct>
1 00005 No RHC No 11142 11151 NA 11382 70.3 Male 12 Under …
2 00007 RHC Yes 11799 11844 11844 11844 78.2 Female 12 Under …
3 00009 RHC No 12083 12143 NA 12400 46.1 Female 14.1 $25-$5…
4 00010 No RHC Yes 11146 11183 11183 11182 75.3 Female 9 $11-$2…
5 00011 RHC Yes 12035 12037 12037 12036 67.9 Male 9.95 Under …
6 00012 No RHC No 12389 12396 NA 12590 86.1 Female 8 Under …
7 00013 No RHC No 12381 12423 NA 12616 55.0 Male 14 $25-$5…
8 00014 No RHC Yes 11453 11487 11491 11490 43.6 Male 12 $25-$5…
9 00016 No RHC No 12426 12437 NA 12560 18.0 Female 12.8 Under …
10 00017 RHC No 11381 11400 NA 11590 48.4 Female 11.0 Under …
# … with 5,725 more rows, and 46 more variables: ninsclas <fct>, race <fct>,
# cat1 <fct>, dnr1 <fct>, wtkilo1 <dbl>, hrt1 <dbl>, meanbp1 <dbl>,
# resp1 <dbl>, temp1 <dbl>, card <fct>, gastr <fct>, hema <fct>, meta <fct>,
# neuro <fct>, ortho <fct>, renal <fct>, resp <fct>, seps <fct>,
# trauma <fct>, amihx <dbl>, ca <fct>, cardiohx <dbl>, chfhx <dbl>,
# chrpulhx <dbl>, dementhx <dbl>, gibledhx <dbl>, immunhx <dbl>,
# liverhx <dbl>, malighx <dbl>, psychhx <dbl>, renalhx <dbl>, …
1.5 Any missingness?
Code
rhc_cleaning %>%miss_var_summary()
# A tibble: 57 × 3
variable n_miss pct_miss
<chr> <int> <dbl>
1 dthdte 2013 35.1
2 dschdte 1 0.0174
3 ptid 0 0
4 swang1 0 0
5 death 0 0
6 sadmdte 0 0
7 lstctdte 0 0
8 age 0 0
9 sex 0 0
10 edu 0 0
# … with 47 more rows
We could also graph this with gg_miss_var(rhc_cleaning). As it turns out, there were three variables in the raw rhc data that have missing values, but we’ve omitted one (about day 1 urine output) in creating this rhc_cleaning file. The remaining two are dates, and we’ll address those in the material on creating our outcomes.
2 The Treatment / Exposure We’ll Study
The treatment/exposure we are studying is the presence of a Swan-Ganz (right heart) catheter on day 1, captured in the variable swang1. The Medline Plus description of right heart catheterization is here. Basically, the procedure passes a thin tube into the right side of the heart and the arteries leading to the lungs, to measure the heart’s function and blood flow. It’s done in very ill patients for several reasons, and is also referred to as right heart catheterization and pulmonary artery catheterization.
Code
rhc_cleaning %>%tabyl(swang1)
swang1 n percent
RHC 2184 0.3808195
No RHC 3551 0.6191805
Of course, the most important issue to us is that each subject was not randomly allocated to either the “RHC” or “No RHC” group. Instead, this is an observational study, where each subject received their treatment (RHC or no) on the basis of what their providers thought would be best for them.
3 The 3 Outcomes We’ll Study
We will define three outcomes here: one binary, one quantitative, and one time-to-event.
Our first outcome is in-study mortality, captured as a binary variable.
We will then define hospital length of stay, a quantitative variable, and then
Time to death, a time-to-event variable that will be right-censored for those who did not die during the study.
3.1 Mortality, a binary outcome: death
One outcome is death during the study, as captured in the death variable. This is a binary outcome.
Code
rhc_cleaning %>%tabyl(death)
death n percent
No 2013 0.3510026
Yes 3722 0.6489974
Those people without a value for dthdte turn out to be the people with death values of “No” and this makes sense, because dthdte is a code for date of death.
3.2 Working with Dates to Define the Time-Related Outcomes
3.2.1 The Raw Data on Dates
Variable
Definition
Values
sadmdte
Study Admission Date
Integer ranging from 10754 to 12441, indicating the # of days after 1960-01-01, so study admission date range is actually 1989-06-11 to 1994-01-23.
dschdte
Hospital Discharge Date
Integer ranging from 10757 to 12560, indicating the # of days after 1960-01-01, so hospital discharge date range is actually 1989-06-14 to 1994-05-22. (Note: 1 missing value)
dthdte
Death Date
Integer ranging from 10757 to 12783, indicating the # of days after 1960-01-01, so death date range is actually 1989-06-14 to 1994-12-31. The 2013 people with missing values are those with No for death.
lstctdte
Date of Last Contact
Integer ranging from 10756 to 12644, indicating the # of days after 1960-01-01, so death date range is actually 1989-06-13 to 1994-08-14.
Next, we will develop the two other primary outcomes: survival time (which we’ll store in survdays), and hospital length of stay (which will go in hospdays).
It turns out that for three subjects, their dates of last contact appear to occur after their death dates, which is irritating, but doesn’t affect us much since the only way in which we’ll use the last contact date is for calculating survival time for patients without a death date.
3.2.2 Impute hospital discharge date for the one missing value
We could just impute a random value, but that’s not a great idea, since hospital discharge date is tied to other information that is available to us for this patient. The subject with a missing hospital discharge date turns out to be ptid 08382.
# A tibble: 1 × 6
ptid death sadmdte dschdte dthdte lstctdte
<chr> <fct> <dbl> <dbl> <dbl> <dbl>
1 08382 Yes 12386 NA 12780 12582
So the patient was admitted to the study on day 12386, had their last contact with the study on day 12582 (i.e. 196 days after study entry), and died on day 12780 (i.e. 394 days after study entry.) It seems rational to impute a value of hospital discharge date that falls in between their date of study admission and their date of last contact.
Among the RHC patients, what is a reasonable guess as to the time from study admission to hospital discharge that we might apply to this patient? We could see what the distribution of these differences are in the available data, like so…
Code
mosaic::favstats(dschdte - sadmdte ~ swang1, data = rhc_cleaning)
swang1 min Q1 median Q3 max mean sd n missing
1 RHC 2 8 16 31 342 24.69125 27.80502 2183 1
2 No RHC 2 7 12 22 338 19.52915 23.58672 3551 0
I’ll arbitrarily choose to impute a discharge date for ptid 08382 so as to yield a 25 day hospitalization, so as to affect the mean stay length within the RHC group as little as possible.
This implies that we’ll impute a value of 12386 + 25 = 12411 for this patient.
3.3 Length of initial hospital stay, a quantitative outcome: hospdays
Our third outcome is length of initial hospital stay, which is captured by subtracting the study admission date (sadmdte) from the hospital discharge date (dschdte). This is a quantitative outcomes, and we will create it soon, and name it hospdays in our data set.
swang1 min Q1 median Q3 max mean sd n missing
1 RHC 2 8 16 31 342 24.69139 27.79865 2184 0
2 No RHC 2 7 12 22 338 19.52915 23.58672 3551 0
It looks like all of these values are reasonable (in that they are all positive) and we have no missing values.
3.4 Survival Time in Days, a time-to-event outcome: survdays
Another outcome is survival time in days (which is censored for all patients who did not die during the study), and which is captured by subtracting the study admission date (sadmdte) from the death date (dthdte) for the patients who died in study, and the study admission date (sadmdte) from the date of last contact (lstctdte) for the patients who did not die in study. This is a time-to-event outcome, and we will create it soon, and name it survdays in our data set.
For the subjects who died, this is straightforward, in that we take the death date and subtract the study admission date.
For those who survived through the study, we take the date of last contact and subtract the study admission date, in order to get their (right censored) survival times.
death min Q1 median Q3 max mean sd n missing
1 No 2 187 210 236 1351 231.18082 112.24989 2013 0
2 Yes 2 6 12 23 338 19.99893 24.59443 3722 0
Again, this looks reasonable. All of these values are strictly positive, for instance, and none seem wildly out of line, and we have no missing values. We also see that those who died had much shorter survival times (typically) than those who survived, which makes sense.
We might also look at the survdays variable broken down by RHC status.
Code
mosaic::favstats(survdays ~ swang1, data = rhc_cleaning)
swang1 min Q1 median Q3 max mean sd n missing
1 RHC 2 9 25.5 187 1351 91.75962 129.5763 2184 0
2 No RHC 2 9 25.0 191 1243 95.57871 117.7174 3551 0
These seem quite comparable, so far, and again, they make sense (all are positive, and there are no missing values.)
4 The 50 Covariates We’ll Study
Our propensity model will eventually include 50 covariates, described below.
4.1 Group 1: Socio-demographics (6 covariates)
Variable
Definition
age
Age in years
sex
Sex
edu
Years of Education
income
Income Category (4 levels)
ninsclas
Insurance Category (6 levels)
race
Self-Reported Race (3 levels)
Here’s a Table 1 for these covariates.
Code
vars01 <-c("age", "sex", "edu", "income", "ninsclas", "race")table1_01 <-CreateTableOne(vars = vars01, strata ="swang1", data = rhc_cleaning, test =FALSE)print(table1_01, smd =TRUE)
Seven of these covariates show standardized mean differences larger than 0.10.
4.4 Group 4: Comorbid Illness and Transfer Status (13 covariates)
Variable
Definition
amihx
Definite Myocardial Infarction
ca
Cancer (3 levels)
cardiohx
Acute MI, Peripheral Vascular Disease, Severe Cardiovascular Symptoms (NYHA-Class III), Very Severe Cardiovascular Symptoms (NYHA- IV)
chfhx
Congestive Heart Failure
chrpulhx
Chronic Pulmonary Disease, Severe or Very Severe Pulmonary Disease
dementhx
Dementia, Stroke or Cerebral Infarct, Parkinson’s Disease
gibledhx
Upper GI Bleeding
immunhx
Immunosupperssion, Organ Transplant, HIV Positivity, Diabetes Mellitus Without End Organ Damage, Diabetes Mellitus With End Organ Damage, Connective Tissue Disease
Six of these covariates show standardized mean differences larger than 0.10.
So, in all, 31 of the 50 covariates show standardized mean differences that exceed 0.10 before any propensity score adjustment.
5 A Little Refactoring
For our outcome, and our treatment, it will be useful to have 1/0 versions of the variables, and it will also be helpful to have the factor versions releveled to put the outcome of interest (death) first and the treatment of interest (RHC) first.
I’m going to set a seed for random numbers before we start, which I can change later if I like.
Code
set.seed(50012345)
7 Unadjusted Outcome Assessments
Before we estimate a propensity score, we’ll perform unadjusted assessments to describe the impact of RHC (or not) on our three outcomes, without adjustment for any covariates at all.
7.1.2 Fitting the Model with the treatment and outcome as factors requires care
Alternatively, we could just work with the factors, but then, it’s important to be sure what is actually being modeled. Do this by making the outcome and treatment logical results, as follows.
Code
modelA_unadj1 <-glm((death =="Yes") ~ (swang1 =="RHC"), data = rhc, family =binomial())tidy(modelA_unadj1, conf.int =TRUE, exponentiate =TRUE)
Now, you’ve inverted the odds ratio! Not good. So, either use 0 and 1 (with 1 being the result you want to study), or use the factors but specify the direction for both treatment and outcome as a logical statement.
7.2 Outcome B: Length of Stay (quantitative)
Code
hospdays_unadj <-lm(hospdays ~ treat_rhc, data = rhc)hospdays_unadj
Here’s the data on survdays and death for our first six patients:
Code
rhc %>%select(survdays, death) %>%head()
# A tibble: 6 × 2
survdays death
<dbl> <fct>
1 240 No
2 45 Yes
3 317 No
4 37 Yes
5 2 Yes
6 201 No
Here, we have a (right-censored) time to event outcome, called survdays, and an indicator of censoring (death which is “Yes” if the patient’s time is real and not censored). So the first subject should have a time to death of 240 or more days, because they were still alive when the study ended, while the second should have a time that is exactly 45 days, because that person died during the study. The survival object we need, then, is:
Code
Surv(rhc$survdays, rhc$death =="Yes") %>%head()
[1] 240+ 45 317+ 37 2 201+
Code
survdays_unadj <-coxph(Surv(survdays, death =="Yes") ~ treat_rhc, data = rhc)survdays_unadj
Call:
coxph(formula = Surv(survdays, death == "Yes") ~ treat_rhc, data = rhc)
coef exp(coef) se(coef) z p
treat_rhc 0.07841 1.08156 0.03347 2.342 0.0192
Likelihood ratio test=5.46 on 1 df, p=0.0195
n= 5735, number of events= 3722
OK. We’re trying to use ps to estimate the probability of having a RHC, and it looks the people who actually had RHC have higher PS, on average, so that’s promising.
9 Checking Rubin’s Rules Prior to Propensity Adjustment
Rubin Rule 1 result should ideally be near 0, and certainly between -50 and +50.
So, we’ve got some work to do. Let’s try matching. In several different ways.
10 Match 1: 1:1 Greedy Matching on the linear PS, without Replacement
Code
X <- rhc$linpsTr <-as.logical(rhc$swang1 =="RHC")match1 <-Match(Tr = Tr, X = X, M =1, estimand ="ATT", replace =FALSE, ties =FALSE)summary(match1)
Estimate... 0
SE......... 0
T-stat..... NaN
p.val...... NA
Original number of observations.............. 5735
Original number of treated obs............... 2184
Matched number of observations............... 2184
Matched number of observations (unweighted). 2184
11 Match 2: 1:2 Greedy Matching on the linear PS, without Replacement
Code
X <- rhc$linpsTr <-as.logical(rhc$swang1 =="RHC")match2 <-Match(Tr = Tr, X = X, M =2, estimand ="ATT", replace =FALSE, ties =FALSE)summary(match2)
Estimate... 0
SE......... 0
T-stat..... NaN
p.val...... NA
Original number of observations.............. 5735
Original number of treated obs............... 2184
Matched number of observations............... 1775
Matched number of observations (unweighted). 3550
12 Match 3: Genetic Search Matching to do 1:1 Matching without replacement
The GenMatch function finds optimal balance using multivariate matching where a genetic search algorithm determines the weight that each covariate is given.
Please note that I am using way too small values of pop.size (especially) and probably max.generations, too, so that things run quickly. I am also only matching on the propensity score and linear propensity score, rather than on the individual covariates. For a course project, I recommend you try matching on the individual covariates if you can, but this reduction in the paramater values is fine.
For actual publications, follow the recommendations of the GenMatch algorithm, as described in the Matching and genoud package help files. The default values (which may be too small themselves) are 100 for both pop.size and max.generations.
Code
X <-cbind(rhc$linps, rhc$ps)Tr <-as.logical(rhc$swang1 =="RHC")genout3 <-GenMatch(Tr = Tr, X = X,estimand ="ATT", M =1,pop.size =10, max.generations =10,wait.generations =4, verbose =FALSE)
Fri Mar 10 07:39:32 2023
Domains:
0.000000e+00 <= X1 <= 1.000000e+03
0.000000e+00 <= X2 <= 1.000000e+03
Data Type: Floating Point
Operators (code number, name, population)
(1) Cloning........................... 0
(2) Uniform Mutation.................. 1
(3) Boundary Mutation................. 1
(4) Non-Uniform Mutation.............. 1
(5) Polytope Crossover................ 1
(6) Simple Crossover.................. 2
(7) Whole Non-Uniform Mutation........ 1
(8) Heuristic Crossover............... 2
(9) Local-Minimum Crossover........... 0
SOFT Maximum Number of Generations: 10
Maximum Nonchanging Generations: 4
Population size : 10
Convergence Tolerance: 1.000000e-03
Not Using the BFGS Derivative Based Optimizer on the Best Individual Each Generation.
Not Checking Gradients before Stopping.
Using Out of Bounds Individuals.
Maximization Problem.
GENERATION: 0 (initializing the population)
Lexical Fit..... 1.311905e-01 2.248375e-01 1.000000e+00 1.000000e+00
#unique......... 10, #Total UniqueCount: 10
var 1:
best............ 2.514367e+01
mean............ 3.122722e+02
variance........ 5.498279e+04
var 2:
best............ 6.998688e+02
mean............ 4.798079e+02
variance........ 1.318615e+05
GENERATION: 1
Lexical Fit..... 1.313094e-01 2.252163e-01 1.000000e+00 1.000000e+00
#unique......... 6, #Total UniqueCount: 16
var 1:
best............ 3.190046e+01
mean............ 4.206356e+01
variance........ 1.654869e+03
var 2:
best............ 5.221507e+02
mean............ 7.154282e+02
variance........ 5.651104e+04
GENERATION: 2
Lexical Fit..... 1.313094e-01 2.252163e-01 1.000000e+00 1.000000e+00
#unique......... 7, #Total UniqueCount: 23
var 1:
best............ 3.190046e+01
mean............ 6.657013e+01
variance........ 4.258032e+03
var 2:
best............ 5.221507e+02
mean............ 6.198933e+02
variance........ 1.196544e+04
GENERATION: 3
Lexical Fit..... 1.313094e-01 2.252163e-01 1.000000e+00 1.000000e+00
#unique......... 6, #Total UniqueCount: 29
var 1:
best............ 3.190046e+01
mean............ 9.278428e+01
variance........ 2.429210e+04
var 2:
best............ 5.221507e+02
mean............ 5.796450e+02
variance........ 1.073255e+04
GENERATION: 4
Lexical Fit..... 1.313266e-01 2.252830e-01 1.000000e+00 1.000000e+00
#unique......... 4, #Total UniqueCount: 33
var 1:
best............ 4.099422e+01
mean............ 3.764262e+01
variance........ 1.006961e+02
var 2:
best............ 5.231225e+02
mean............ 4.814444e+02
variance........ 1.694430e+04
GENERATION: 5
Lexical Fit..... 1.313266e-01 2.252830e-01 1.000000e+00 1.000000e+00
#unique......... 6, #Total UniqueCount: 39
var 1:
best............ 4.099422e+01
mean............ 3.515675e+01
variance........ 2.348079e+01
var 2:
best............ 5.231225e+02
mean............ 5.324653e+02
variance........ 1.354162e+03
'wait.generations' limit reached.
No significant improvement in 4 generations.
Solution Lexical Fitness Value:
1.313266e-01 2.252830e-01 1.000000e+00 1.000000e+00
Parameters at the Solution:
X[ 1] : 4.099422e+01
X[ 2] : 5.231225e+02
Solution Found Generation 1
Number of Generations Run 5
Fri Mar 10 07:39:40 2023
Total run time : 0 hours 0 minutes and 8 seconds
Estimate... 0
SE......... 0
T-stat..... NaN
p.val...... NA
Original number of observations.............. 5735
Original number of treated obs............... 2184
Matched number of observations............... 2184
Matched number of observations (unweighted). 2297
13 Match 4: 1:1 Greedy Matching on the linear PS, WITH Replacement
Code
X <- rhc$linpsTr <-as.logical(rhc$swang1 =="RHC")match4 <-Match(Tr = Tr, X = X, M =1, estimand ="ATT", replace =TRUE, ties =FALSE)summary(match4)
Estimate... 0
SE......... 0
T-stat..... NaN
p.val...... NA
Original number of observations.............. 5735
Original number of treated obs............... 2184
Matched number of observations............... 2184
Matched number of observations (unweighted). 2184
14 Match 5: 1:1 Caliper Matching on the linear PS, WITHOUT Replacement
Code
X <- rhc$linpsTr <-as.logical(rhc$swang1 =="RHC")match5 <-Match(Tr = Tr, X = X, M =1, caliper =0.2,estimand ="ATT", replace =FALSE, ties =FALSE)summary(match5)
Estimate... 0
SE......... 0
T-stat..... NaN
p.val...... NA
Original number of observations.............. 5735
Original number of treated obs............... 2184
Matched number of observations............... 1563
Matched number of observations (unweighted). 1563
Caliper (SDs)........................................ 0.2
Number of obs dropped by 'exact' or 'caliper' 621
15 Match 6: 1:2 Greedy Matching on the linear PS, WITH Replacement
Code
X <- rhc$linpsTr <-as.logical(rhc$swang1 =="RHC")match6 <-Match(Tr = Tr, X = X, M =2, estimand ="ATT", replace =TRUE, ties =FALSE)summary(match6)
Estimate... 0
SE......... 0
T-stat..... NaN
p.val...... NA
Original number of observations.............. 5735
Original number of treated obs............... 2184
Matched number of observations............... 2184
Matched number of observations (unweighted). 4368
17 ATT Matching Estimates for the Binary Outcome, Death
17.1 With Match 1, in detail
Note that the variable which identifies the matches (called matches_1) is set up as a factor, as it needs to be, and that I’m using the 1/0 versions of both the outcome (so died rather than death) and treatment (so treat_rhc rather than swang1).
Code
death_adj_m1 <-clogit(died ~ treat_rhc +strata(matches_1), data = rhc.matches_1)summary(death_adj_m1)
Call:
coxph(formula = Surv(rep(1, 4368L), died) ~ treat_rhc + strata(matches_1),
data = rhc.matches_1, method = "exact")
n= 4368, number of events= 2861
coef exp(coef) se(coef) z Pr(>|z|)
treat_rhc 0.23156 1.26056 0.06488 3.569 0.000358 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(coef) exp(-coef) lower .95 upper .95
treat_rhc 1.261 0.7933 1.11 1.432
Concordance= 0.558 (se = 0.023 )
Likelihood ratio test= 12.82 on 1 df, p=3e-04
Wald test = 12.74 on 1 df, p=4e-04
Score (logrank) test = 12.79 on 1 df, p=3e-04
The odds ratio in the exp(coef) section above is the average causal effect estimate. It describes the odds of having an event (died) occur associated with being a RHC subject (as identified by treat_rhc), as compared to the odds of that event when a non-RHC subject. We can tidy this with:
ggplot(death_match_results, aes(x = approach, y = estimate, ymin = conf.low, ymax = conf.high)) +geom_text(aes(label =round_half_up(estimate,2)), vjust =-1.25) +geom_pointrange() +geom_hline(yintercept =1, col ="red", linetype ="dashed") +theme_bw() +coord_flip() +labs(title ="Comparing ATT Estimates for Death using RHC Propensity Matching",x ="Propensity Adjustment Approach", y ="Estimated Odds Ratio for Death associated with RHC")
Let’s look at just the four strategies that achieved stronger balance.
Code
death_match_results %>%filter(approach %in%c("Match 3", "Match 4", "Match 5", "Match 6")) %>%ggplot(., aes(x = description, y = estimate, ymin = conf.low, ymax = conf.high)) +geom_text(aes(label =round_half_up(estimate,2)), vjust =-1.25) +geom_pointrange() +geom_hline(yintercept =1, col ="red", linetype ="dashed") +theme_bw() +coord_flip() +labs(title ="ATT Matching Estimates in RHC (Death)",x ="Propensity Adjustment Approach", y ="Estimated Odds Ratio for Death associated with RHC")
18 ATT Matching Estimates for the Quantitative Outcome, Hospital Length of Stay
18.1 With Match 1, in detail
Again, the variable which identifies the matches (called matches_1) is set up as a factor, as it needs to be, and that I’m using the 1/0 version of treatment (so treat_rhc rather than swang1).
Our result for this quantitative outcome (hospdays) comes from a mixed model where the matches are treated as a random factor, but the treatment group is treated as a fixed factor, using the lme4 package.
Linear mixed model fit by REML ['lmerMod']
Formula: hospdays ~ treat_rhc + (1 | matches_1)
Data: rhc.matches_1
REML criterion at convergence: 40934.5
Scaled residuals:
Min 1Q Median 3Q Max
-0.9292 -0.5540 -0.3137 0.1747 11.9768
Random effects:
Groups Name Variance Std.Dev.
matches_1 (Intercept) 14.5 3.809
Residual 674.0 25.962
Number of obs: 4368, groups: matches_1, 2184
Fixed effects:
Estimate Std. Error t value
(Intercept) 20.8255 0.5615 37.090
treat_rhc 3.8658 0.7857 4.921
Correlation of Fixed Effects:
(Intr)
treat_rhc -0.700
The estimated effect of treat_rhc in the fixed effects section, combined with the 95% confidence intervals material, provides the average causal effect estimate. It describes the change in the outcome hospdays associated with being a RHC subject (as identified by treat_rhc), as compared to being a non-RHC subject. We can tidy this with:
ggplot(hospdays_match_results, aes(x = approach, y = estimate, ymin = conf.low, ymax = conf.high)) +geom_text(aes(label =round_half_up(estimate,2)), vjust =-1.25) +geom_pointrange() +geom_hline(yintercept =0, col ="red", linetype ="dashed") +theme_bw() +coord_flip() +labs(title ="Comparing ATT Estimates for LOS using RHC Propensity Matching",x ="Propensity Adjustment Approach", y ="Estimated Change in Hospital Length of Stay associated with RHC")
Let’s look at just the four strategies that achieved stronger balance.
Code
hospdays_match_results %>%filter(approach %in%c("Match 3", "Match 4", "Match 5", "Match 6")) %>%ggplot(., aes(x = description, y = estimate, ymin = conf.low, ymax = conf.high)) +geom_text(aes(label =round_half_up(estimate,2)), vjust =-1.25) +geom_pointrange() +geom_hline(yintercept =0, col ="red", linetype ="dashed") +theme_bw() +coord_flip() +labs(title ="ATT Matching Estimates in RHC (LOS)",x ="Propensity Adjustment Approach", y ="Est. Change in Hospital LOS associated with RHC")
19 ATT Matching Estimates for the Time-to-Event Outcome, In-Study Survival
We’ll use a stratified Cox proportional hazards model to compare the RHC/No RHC groups on our time-to-event outcome survdays, while accounting for the matched pairs. The main result will be a relative hazard rate estimate, with 95% CI. I will use the 0/1 numeric versions of the event indicator (died), and of the treatment indicator (treat_rhc).
Call:
coxph(formula = Surv(survdays, died) ~ treat_rhc + strata(matches_1),
data = rhc.matches_1)
n= 4368, number of events= 2861
coef exp(coef) se(coef) z Pr(>|z|)
treat_rhc 0.04768 1.04883 0.04554 1.047 0.295
exp(coef) exp(-coef) lower .95 upper .95
treat_rhc 1.049 0.9534 0.9593 1.147
Concordance= 0.512 (se = 0.016 )
Likelihood ratio test= 1.1 on 1 df, p=0.3
Wald test = 1.1 on 1 df, p=0.3
Score (logrank) test = 1.1 on 1 df, p=0.3
The hazard ratio in the exp(coef) section is the average causal effect estimate. It describes the hazard for having an event (died) occur associated with being a RHC subject (as identified by treat_rhc), as compared to the hazard of that event when a non-RHC subject. We can tidy this with:
ggplot(survdays_match_results, aes(x = approach, y = estimate, ymin = conf.low, ymax = conf.high)) +geom_text(aes(label =round_half_up(estimate,2)), vjust =-1.25) +geom_pointrange() +geom_hline(yintercept =1, col ="red", linetype ="dashed") +theme_bw() +coord_flip() +labs(title ="Comparing ATT Estimates for Survival using RHC Propensity Matching",x ="Propensity Adjustment Approach", y ="Estimated Hazard Ratio associated with RHC")
Let’s look at just the four strategies that achieved stronger balance.
Code
survdays_match_results %>%filter(approach %in%c("Match 3", "Match 4", "Match 5", "Match 6")) %>%ggplot(., aes(x = description, y = estimate, ymin = conf.low, ymax = conf.high)) +geom_text(aes(label =round_half_up(estimate,2)), vjust =-1.25) +geom_pointrange() +geom_hline(yintercept =1, col ="red", linetype ="dashed") +theme_bw() +coord_flip() +labs(title ="ATT Matching Estimates in RHC (Survival)",x ="Propensity Adjustment Approach", y ="Estimated Hazard Ratio associated with RHC")
20 Propensity Score Weighting (ATT approach)
20.1 Estimate the Propensity Score using Generalized Boosted Regression, and then perfom ATT Weighting
To begin, we’ll re-estimate the propensity score using the twang function ps. This uses a generalized boosted regression approach to estimate the propensity score and produce material for checking balance.
Code
# Recall that twang does not play well with tibbles,# so we have to use the data frame version of rhc object# and I'm using the 1/0 version of the treatmentrhc_df <- base::as.data.frame(rhc)ps_rhc <-ps(treat_rhc ~ age + sex + edu + income + ninsclas + race + cat1 + dnr1 + wtkilo1 + hrt1 + meanbp1 + resp1 + temp1 + card + gastr + hema + meta + neuro + ortho + renal + resp + seps + trauma + amihx + ca + cardiohx + chfhx + chrpulhx + dementhx + gibledhx + immunhx + liverhx + malighx + psychhx + renalhx + transhx + aps1 + das2d3pc + scoma1 + surv2md1 + alb1 + bili1 + crea1 + hema1 + paco21 + pafi1 + ph1 + pot1 + sod1 + wblc1,data = rhc_df,n.trees =4000,interaction.depth =2,stop.method =c("es.mean"),estimand ="ATT",verbose =FALSE)
20.1.1 Did we let the simulations run long enough to stabilize estimates?
Code
plot(ps_rhc)
20.1.2 What is the effective sample size of our weighted results?
Code
summary(ps_rhc)
n.treat n.ctrl ess.treat ess.ctrl max.es mean.es max.ks
unw 2184 3551 2184 3551.000 0.53367192 0.14968341 0.21274111
es.mean.ATT 2184 3551 2184 1250.501 0.09123789 0.03520429 0.04966016
max.ks.p mean.ks iter
unw NA 0.05831639 NA
es.mean.ATT NA 0.01540652 3999
20.1.3 How is the balance?
Code
plot(ps_rhc, plots =2)
Code
plot(ps_rhc, plots =3)
20.1.4 Assessing Balance with cobalt
Code
b2 <-bal.tab(ps_rhc, full.stop.method ="es.mean.att", stats =c("m", "v"), un =TRUE)b2
We could simply compare the propensity score balance as reported in the table above, looking at the standardized biases (the standardized difference on a proportion scale, rather than a percentage.) I’ll multiply each by 100 to place things on the usual scale.
Code
100*b2[1]$Balance$Diff.Un[1] # imbalance prior to weighting
[1] 155.5377
Code
100*b2[1]$Balance$Diff.Adj[1] # imbalance after weighting
[1] 51.07571
So this is still indicative of some difficulty reaching a completely reasonable level of balance. We’d like to see this much closer to 0.
20.3 Rubin’s Rule 2 After Weighting
Code
b2[1]$Balance$V.Ratio.Un[1] # imbalance prior to weighting
[1] 1.208331
Code
b2[1]$Balance$V.Ratio.Adj[1] # imbalance after weighting
[1] 0.9057622
So there’s no particular problem with Rule 2 here.
20.4 Semi-Automated Love plot of Standardized Differences
Code
p <-love.plot(b2, threshold = .1, size =3, stars ="raw",title ="Standardized Diffs and ATT weights")p +theme_bw()
20.5 Semi-Automated Love plot of Variance Ratios
Code
p <-love.plot(b2, stat ="v",threshold =1.25, size =3, title ="Variance Ratios: TWANG ATT weighting")p +theme_bw()
21 ATT Weighting Estimates
We’re not too excited about weighted estimates without (at least) an additional adjustment for differences in the propensity score in this case, but I’ll run them anyway just to show how that would be done.
21.1 Weighted ATT for Outcome A: Death
The relevant regression approach uses the svydesign and svyglm functions from the survey package. For a binary outcome, we build the outcome model using the quasibinomial, rather than the usual binomial family. Note the use of the 1/0 version of both the outcome and the treatment variables here.
The estimates we want come from the treat_rhc row.
21.2 Double Robust Weighted Estimate for Death
Now let’s add in a regression adjustment for the linear propensity score, which should alleviate some of our concerns about residual imbalance after weighting.
So there’s our table. It’s worth remembering that the only analyses that we’ve
Code
death_results %>%ggplot(., aes(x = description, y = estimate, ymin = conf.low, ymax = conf.high)) +geom_text(aes(label =round_half_up(estimate,2)), vjust =-1.25) +geom_pointrange() +geom_hline(yintercept =1, col ="red", linetype ="dashed") +theme_bw() +coord_flip() +labs(title ="ATT Estimates in RHC (Death)",x ="Propensity Adjustment Approach", y ="Estimated Odds Ratio for Death associated with RHC")
And here are the results for only the five methods (adding our double robust approach to the analyses we liked in the matching) which showed reasonable balance after propensity score adjustment…
Code
death_results %>%filter(approach %in%c("Match 3", "Match 4", "Match 5","Match 6", "DR")) %>%ggplot(., aes(x = description, y = estimate, ymin = conf.low, ymax = conf.high)) +geom_text(aes(label =round_half_up(estimate,2)), vjust =-1.25) +geom_pointrange() +geom_hline(yintercept =1, col ="red", linetype ="dashed") +theme_bw() +coord_flip() +labs(title ="ATT Estimates in RHC (Death)",x ="Propensity Adjustment Approach", y ="Estimated Odds Ratio for Death associated with RHC")
21.4 Weighted ATT for Outcome B: Hospital Length of Stay
The relevant regression approach uses the svydesign and svyglm functions from the survey package.
The estimates we want come from the treat_rhc row.
21.5 Double Robust Weighted Estimate for LOS
Now let’s add in a regression adjustment for the linear propensity score. Again, we should have some additional faith in these results over those with just weighting alone, because of the failure of the weighting approach to pass Rubin’s Rule 1.
# A tibble: 9 × 7
term estimate std.error conf.low conf.high approach description
<chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 treat_rhc 5.16 0.687 3.81 6.51 Unmatched No Matching
2 treat_rhc 3.87 0.786 2.33 5.41 Match 1 1:1 greedy match…
3 treat_rhc 4.98 0.545 3.91 6.05 Match 2 1:2 greedy match…
4 treat_rhc 1.40 0.824 -0.216 3.01 Match 3 1:1 genetic sear…
5 treat_rhc 1.27 0.859 -0.411 2.96 Match 4 1:1 greedy match…
6 treat_rhc 2.30 0.919 0.496 4.10 Match 5 1:1 caliper matc…
7 treat_rhc 1.22 0.556 0.125 2.31 Match 6 1:2 greedy match…
8 treat_rhc 2.10 1.03 0.0750 4.12 ATT weighted ATT weights
9 treat_rhc 1.56 1.10 -0.605 3.72 DR DR: ATT wts + adj
So there’s our table.
Code
hospdays_results %>%ggplot(., aes(x = description, y = estimate, ymin = conf.low, ymax = conf.high)) +geom_text(aes(label =round_half_up(estimate,2)), vjust =-1.25) +geom_pointrange() +geom_hline(yintercept =0, col ="red", linetype ="dashed") +theme_bw() +coord_flip() +labs(title ="Comparing ATT Estimates for LOS",x ="Propensity Adjustment Approach", y ="Estimated Change in Hospital LOS associated with RHC")
And here are the results for only the five methods (adding our double robust approach to the analyses we liked in the matching) which showed reasonable balance after propensity score adjustment…
Code
hospdays_results %>%filter(approach %in%c("Match 3", "Match 4", "Match 5","Match 6", "DR")) %>%ggplot(., aes(x = description, y = estimate, ymin = conf.low, ymax = conf.high)) +geom_text(aes(label =round_half_up(estimate,2)), vjust =-1.25) +geom_pointrange() +geom_hline(yintercept =0, col ="red", linetype ="dashed") +theme_bw() +coord_flip() +labs(title ="Comparing ATT Estimates for LOS",x ="Propensity Adjustment Approach", y ="Estimated Change in Hospital LOS associated with RHC")
21.7 Weighted ATT for Outcome C: In-Study Time to Death
Again, the estimates we want come from the treat_rhc row.
21.8 Double Robust Weighted Estimate for Days to Death
Now let’s add in a regression adjustment for the linear propensity score. Again, in this case, we prefer this double robust approach to weighting alone because of Rubin’s Rule 1 and the Love plot.
survdays_results %>%ggplot(., aes(x = description, y = estimate, ymin = conf.low, ymax = conf.high)) +geom_text(aes(label =round_half_up(estimate,2)), vjust =-1.25) +geom_pointrange() +geom_hline(yintercept =1, col ="red", linetype ="dashed") +theme_bw() +coord_flip() +labs(title ="Comparing ATT Estimates for Survival Time",x ="Propensity Adjustment Approach", y ="Estimated Hazard Ratio associated with RHC")
And here are the results for only the five methods (adding our double robust approach to the analyses we liked in the matching) which showed reasonable balance after propensity score adjustment…
Code
survdays_results %>%filter(approach %in%c("Match 3", "Match 4", "Match 5","Match 6", "DR")) %>%ggplot(., aes(x = description, y = estimate, ymin = conf.low, ymax = conf.high)) +geom_text(aes(label =round_half_up(estimate,2)), vjust =-1.25) +geom_pointrange() +geom_hline(yintercept =1, col ="red", linetype ="dashed") +theme_bw() +coord_flip() +labs(title ="Comparing ATT Estimates for Survival Time",x ="Propensity Adjustment Approach", y ="Estimated Hazard Ratio associated with RHC")
22 Sensitivity Analyses After Matching
22.1 For the Binary Outcome, Death
The rbounds package can evaluate binary outcomes using the binarysens and Fishersens functions. The binarysens function works directly on a matched sample as developed using the Matching package. For example, in our third match, based on the genetic search algorithm with 1:1 matching without replacement, this is match3. Note that in order to do this, we needed to run match3 including the appropriate (binary) outcome (Y).
Code
Y <- rhc$diedX <-cbind(rhc$linps, rhc$ps)Tr <-as.logical(rhc$swang1 =="RHC")match3 <-Match(Y = Y, Tr = Tr, X = X, estimand ="ATT", Weight.matrix = genout3)Rc <- match3$mdata$Y[match3$mdata$Tr==0]Rt <- match3$mdata$Y[match3$mdata$Tr==1] out.tab <-matrix(c(table(Rc, Rt)), nrow =2)out.tab
Rosenbaum Sensitivity Test
Unconfounded estimate .... 0
Gamma Lower bound Upper bound
1.00 0 0.00000
1.05 0 0.00006
1.10 0 0.00099
1.15 0 0.00865
1.20 0 0.04479
1.25 0 0.14833
1.30 0 0.33859
1.35 0 0.57390
1.40 0 0.77815
1.45 0 0.90728
1.50 0 0.96873
Note: Gamma is Odds of Differential Assignment To
Treatment Due to Unobserved Factors
At a 5% significance level, we need to compare the listed upper bounds to a value of \(\alpha = 0.05/2 = 0.025\). It appears that we retain significance up to a hidden bias of \(\Gamma\) = 1.15, but not at \(\Gamma\) = 1.2. See Chapter 9 of Rosenbaum’s Observation and Experiment for more details on interpretation of this result.
We can also use this approach with other Matching results, including Match 6, which is a 1:2 match with replacement. Again, we have to run the match with the appropriate outcome included.
Code
Y <- rhc$diedX <- rhc$linpsTr <-as.logical(rhc$swang1 =="RHC")match6 <-Match(Y = Y, Tr = Tr, X = X, M =2, estimand ="ATT", replace =TRUE, ties =FALSE)Rc <- match6$mdata$Y[match6$mdata$Tr==0]Rt <- match6$mdata$Y[match6$mdata$Tr==1] out.tab <-matrix(c(table(Rc, Rt)), nrow =2)out.tab
Rosenbaum Sensitivity Test
Unconfounded estimate .... 0
Gamma Lower bound Upper bound
1.00 0 0.00000
1.05 0 0.00004
1.10 0 0.00192
1.15 0 0.02811
1.20 0 0.16574
1.25 0 0.47114
1.30 0 0.78518
1.35 0 0.94718
1.40 0 0.99213
1.45 0 0.99927
1.50 0 0.99996
Note: Gamma is Odds of Differential Assignment To
Treatment Due to Unobserved Factors
Here, at a 5% significance level, we again compare our upper bounds to \(\alpha/2 = 0.025\) and we retain significance up to a hidden bias of \(\Gamma\) = 1.10, but not at \(\Gamma\) = 1.15.
22.2 For the Quantitative Outcome, In-Hospital Length of Stay
We have already used the Match function from the Matching package to develop our matched samples. For our 1:1 matches, we need only run the psens function from the rbounds package to obtain sensitivity results. For example, in Match 3,
Rosenbaum Sensitivity Test for Wilcoxon Signed Rank P-Value
Unconfounded estimate .... 1e-04
Gamma Lower bound Upper bound
1.00 1e-04 0.0001
1.05 0e+00 0.0041
1.10 0e+00 0.0453
1.15 0e+00 0.2168
1.20 0e+00 0.5344
1.25 0e+00 0.8213
1.30 0e+00 0.9575
1.35 0e+00 0.9937
1.40 0e+00 0.9994
1.45 0e+00 1.0000
1.50 0e+00 1.0000
Note: Gamma is Odds of Differential Assignment To
Treatment Due to Unobserved Factors
At a 5% significance level, we retain significance up to a hidden bias of \(\Gamma\) = 1.05, but not at \(\Gamma\) = 1.1. See Chapter 9 of Rosenbaum’s Observation and Experiment for more details on interpretation of this result.
22.3 For the Survival Outcome, In-Study Time to Death
In this setting, we would need to use the spreadsheet software provided by Dr. Love. That software is designed only for 1:1 greedy matching without replacement or weighting, which is the approach we took in Match 1 (we could also have looked at matching using a caliper, but without replacement.) We need to identify, for each matched pair of subjects, whether that pair’s set of results was:
that the RHC patient definitely survived longer than the non-RHC patient
that the non-RHC patient definitely survived longer than the RHC patient
that it is unclear which patient survived longer, due to censoring
Here is some very fussy code that accomplishes this counting. This could probably be accomplished just as well with some sort of case_when statement, but it’s a bit tricky.
result n percent
No RHC lived longer 988 0.4523810
No clear winner 278 0.1272894
RHC lived longer 918 0.4203297
Total 2184 1.0000000
So we have counts for the pairs where a clear winner is identified, and we can enter the spreadsheet. We need the number of pairs with a clear winner, and the number of pairs where the No RHC patient lived longer than the RHC patient. Doing so, we see that the result doesn’t meet the sign-score test standard necessary to have a p value below 0.10, so we stop there, and conclude that the effect would be highly sensitive to hidden bias.
---title: "Propensity Scores and the Right Heart Catheterization Data"author: "Thomas E. Love, Ph.D."date: last-modifiedformat: html: toc: true number-sections: true code-fold: show code-tools: true code-overflow: wrap embed-resources: true date-format: iso theme: pulse ---# Preliminaries## Study BackgroundThis example is based on the Right Heart Catheterization data set available at [Vanderbilt University](https://biostat.app.vumc.org/wiki/Main/DataSets) and described [here](https://biostat.app.vumc.org/wiki/pub/Main/DataSets/rhc.html). The key reference is Connors AF et al. 1996 The effectiveness of RHC in the initial care of critically ill patients. [*JAMA* 276: 889-897](http://jamanetwork.com/journals/jama/fullarticle/407990). Connors et al. used a logistic regression model to develop a propensity score then: [a] matched RHC to non-RHC patients and [b] adjusted for propensity score in models for outcomes, followed by a sensitivity analysis. The key conclusions were that RHC patients had decreased survival time, and any unmeasured confounder would have to be somewhat strong to explain away the results.## Loading Packages```{r load_packages, message=FALSE, warning = FALSE}knitr::opts_chunk$set(comment =NA)library(here); library(janitor); library(naniar)library(tableone); library(broom); library(survival)library(Matching); library(cobalt); library(rgenoud)library(lme4); library(rbounds); library(conflicted)library(twang); library(survey)library(tidyverse)conflict_prefer("select", "dplyr")conflict_prefer("filter", "dplyr")theme_set(theme_bw())```## Importing the DataThe observations in the `rhc` data set describe 5,735 subjects who are critically ill, and most variables were obtained during day 1 of a hospitalization. We'll begin by importing the `rhc` data into a **tibble** (a lazy and surly data frame - see R's help file on tibbles for more details) in R that contains 5,735 observations (rows) on 63 variables (columns). In the import process, we're going to do a few things to simplify the process of getting a functional data set.- The automated column specification in R's **readr** package uses the first 1000 rows of the data set - with the `rhc` data, this creates parsing failures for several measures which are assumed to be integers but actually should be parsed as doubles (continuous variables). So we specify the column types for those variables here then allow R to identify the type for the other columns.- By default, **readr**'s `read_csv` command treats all string variables as characters, and not as factors. For several multicategorical variables, we want a factor representation in a specific order, so we'll specify that, too, also using the `cols` function.```{r read_with_column_specs}column_types_rhc <-cols(urin1 ="d", meanbp1 ="d", resp1 ="d",swang1 =col_factor(c("RHC", "No RHC")),death =col_factor(c("No", "Yes")),sex =col_factor(c("Male", "Female")),cat1 =col_factor(c("ARF", "CHF", "Cirrhosis", "Colon Cancer", "Coma", "COPD","Lung Cancer", "MOSF w/Malignancy", "MOSF w/Sepsis")),dnr1 =col_factor(c("No", "Yes")),card =col_factor(c("No", "Yes")),gastr =col_factor(c("No", "Yes")),hema =col_factor(c("No", "Yes")),meta =col_factor(c("No", "Yes")),neuro =col_factor(c("No", "Yes")),ortho =col_factor(c("No", "Yes")),renal =col_factor(c("No", "Yes")),resp =col_factor(c("No", "Yes")),seps =col_factor(c("No", "Yes")),trauma =col_factor(c("No", "Yes")),income =col_factor(c("Under $11k", "$11-$25k", "$25-$50k", "> $50k")),ninsclas =col_factor(c("Private", "Private & Medicare", "Medicare", "Medicare & Medicaid", "Medicaid", "No insurance")),race =col_factor(c("white", "black", "other")),ca =col_factor(c("No", "Yes", "Metastatic")))```The data are available at the link below.```{r}rhc_raw <-read_csv("https://biostat.app.vumc.org/wiki/pub/Main/DataSets/rhc.csv", col_types = column_types_rhc)```This will throw a message indicating that the `` variable (which I assume is a set of row names) will have a new name, but we'll ignore that since we won't use the row names in our subsequent work.- After we import the data, we re-order the columns to present the SUPPORT patient identification code first, then the exposure (`swang1`), then information on the outcomes, then information on the covariates of interest for our work, dropping the other variables.```{r complete_import_process}rhc_cleaning <- rhc_raw %>%select(ptid, swang1, death, sadmdte, dschdte, dthdte, lstctdte, age, sex, edu, income, ninsclas, race, cat1, dnr1, wtkilo1, hrt1, meanbp1, resp1, temp1, card, gastr, hema, meta, neuro, ortho, renal, resp, seps, trauma, amihx, ca, cardiohx, chfhx, chrpulhx, dementhx, gibledhx, immunhx, liverhx, malighx, psychhx, renalhx, transhx, aps1, das2d3pc, scoma1, surv2md1, alb1, bili1, crea1, hema1, paco21, pafi1, ph1, pot1, sod1, wblc1)```## The Data Set after Initial ImportThe `rhc_cleaning` data at this stage is a tibble (a lazy and surly data frame) containing `r nrow(rhc_cleaning)` observations (rows) on `r ncol(rhc_cleaning)` variables (columns). The observations describe subjects who are critically ill, and most variables were obtained during day 1 of a hospitalization.```{r}rhc_cleaning```## Any missingness?```{r impute_discharge_date}rhc_cleaning %>%miss_var_summary()```We could also graph this with `gg_miss_var(rhc_cleaning)`. As it turns out, there were three variables in the raw `rhc` data that have missing values, but we've omitted one (about day 1 urine output) in creating this `rhc_cleaning` file. The remaining two are dates, and we'll address those in the material on creating our outcomes.# The Treatment / Exposure We'll StudyThe treatment/exposure we are studying is the presence of a Swan-Ganz (right heart) catheter on day 1, captured in the variable `swang1`. The Medline Plus description of right heart catheterization [is here](https://medlineplus.gov/ency/article/003870.htm). Basically, the procedure passes a thin tube into the right side of the heart and the arteries leading to the lungs, to measure the heart's function and blood flow. It's done in very ill patients for several reasons, and is also referred to as right heart catheterization and pulmonary artery catheterization. ```{r}rhc_cleaning %>%tabyl(swang1)```Of course, the most important issue to us is that each subject was not randomly allocated to either the "RHC" or "No RHC" group. Instead, this is an **observational** study, where each subject received their treatment (RHC or no) on the basis of what their providers thought would be best for them.# The 3 Outcomes We'll StudyWe will define three outcomes here: one binary, one quantitative, and one time-to-event. - Our first outcome is in-study mortality, captured as a binary variable.- We will then define hospital length of stay, a quantitative variable, and then- Time to death, a time-to-event variable that will be right-censored for those who did not die during the study.## Mortality, a binary outcome: `death`One outcome is death during the study, as captured in the `death` variable. This is a **binary** outcome.```{r}rhc_cleaning %>%tabyl(death)```Those people without a value for `dthdte` turn out to be the people with `death` values of "No" and this makes sense, because `dthdte` is a code for date of death.## Working with Dates to Define the Time-Related Outcomes### The Raw Data on DatesVariable | Definition | Values-------- | ------------------------ | ----------------------------------------`sadmdte` | Study Admission Date | Integer ranging from 10754 to 12441, indicating the \# of days after 1960-01-01, so study admission date range is actually 1989-06-11 to 1994-01-23.`dschdte` | Hospital Discharge Date | Integer ranging from 10757 to 12560, indicating the \# of days after 1960-01-01, so hospital discharge date range is actually 1989-06-14 to 1994-05-22. (**Note: 1 missing value**)`dthdte` | Death Date | Integer ranging from 10757 to 12783, indicating the \# of days after 1960-01-01, so death date range is actually 1989-06-14 to 1994-12-31. The **2013 people with missing values** are those with No for `death`.`lstctdte` | Date of Last Contact | Integer ranging from 10756 to 12644, indicating the \# of days after 1960-01-01, so death date range is actually 1989-06-13 to 1994-08-14.Next, we will develop the two other primary outcomes: survival time (which we'll store in `survdays`), and hospital length of stay (which will go in `hospdays`). - It turns out that for three subjects, their dates of last contact appear to occur after their death dates, which is irritating, but doesn't affect us much since the only way in which we'll use the last contact date is for calculating survival time for patients *without* a death date. ### Impute hospital discharge date for the one missing valueWe could just impute a random value, but that's not a great idea, since hospital discharge date is tied to other information that is available to us for this patient. The subject with a missing hospital discharge date turns out to be `ptid` 08382.```{r}rhc_cleaning %>%filter(is.na(dschdte)) %>%select(ptid, death, sadmdte, dschdte, dthdte, lstctdte)```So the patient was admitted to the study on day 12386, had their last contact with the study on day 12582 (i.e. 196 days after study entry), and died on day 12780 (i.e. 394 days after study entry.) It seems rational to impute a value of hospital discharge date that falls in between their date of study admission and their date of last contact. - Among the RHC patients, what is a reasonable guess as to the time from study admission to hospital discharge that we might apply to this patient? We could see what the distribution of these differences are in the available data, like so...```{r}#| message: falsemosaic::favstats(dschdte - sadmdte ~ swang1, data = rhc_cleaning)```I'll arbitrarily choose to impute a discharge date for `ptid` 08382 so as to yield a 25 day hospitalization, so as to affect the mean stay length within the RHC group as little as possible. This implies that we'll impute a value of 12386 + 25 = 12411 for this patient.```{r}rhc_cleaning <- rhc_cleaning %>%mutate(dschdte =replace_na(dschdte, 12411))rhc_cleaning %>%select(dschdte) %>%miss_var_summary()```## Length of initial hospital stay, a quantitative outcome: `hospdays`Our third outcome is length of initial hospital stay, which is captured by subtracting the study admission date (`sadmdte`) from the hospital discharge date (`dschdte`). This is a **quantitative** outcomes, and we will create it soon, and name it `hospdays` in our data set.```{r}rhc_cleaning <- rhc_cleaning %>%mutate(hospdays = dschdte - sadmdte)mosaic::favstats(hospdays ~ swang1, data = rhc_cleaning)```It looks like all of these values are reasonable (in that they are all positive) and we have no missing values.## Survival Time in Days, a time-to-event outcome: `survdays`Another outcome is survival time in days (which is censored for all patients who did not die during the study), and which is captured by subtracting the study admission date (`sadmdte`) from the death date (`dthdte`) for the patients who died in study, and the study admission date (`sadmdte`) from the date of last contact (`lstctdte`) for the patients who did not die in study. This is a **time-to-event** outcome, and we will create it soon, and name it `survdays` in our data set.- For the subjects who died, this is straightforward, in that we take the death date and subtract the study admission date. - For those who survived through the study, we take the date of last contact and subtract the study admission date, in order to get their (right censored) survival times.```{r}rhc_cleaning <- rhc_cleaning %>%mutate(survdays =ifelse(death =="Yes", dschdte - sadmdte, lstctdte - sadmdte))mosaic::favstats(survdays ~ death, data = rhc_cleaning)```Again, this looks reasonable. All of these values are strictly positive, for instance, and none seem wildly out of line, and we have no missing values. We also see that those who died had much shorter survival times (typically) than those who survived, which makes sense.We might also look at the `survdays` variable broken down by RHC status.```{r}mosaic::favstats(survdays ~ swang1, data = rhc_cleaning)```These seem quite comparable, so far, and again, they make sense (all are positive, and there are no missing values.)# The 50 Covariates We'll StudyOur propensity model will eventually include 50 covariates, described below.## Group 1: Socio-demographics (6 covariates)Variable | Definition -------- | ------------------------------------------------`age` | Age in years `sex` | Sex `edu` | Years of Education `income` | Income Category (4 levels)`ninsclas` | Insurance Category (6 levels) `race` | Self-Reported Race (3 levels)Here's a Table 1 for these covariates.```{r covariates_demographics}vars01 <-c("age", "sex", "edu", "income", "ninsclas", "race")table1_01 <-CreateTableOne(vars = vars01, strata ="swang1", data = rhc_cleaning, test =FALSE)print(table1_01, smd =TRUE)```- Two of these covariates show standardized mean differences larger than 0.10.## Group 2: Presentation at Admission (7 covariates)Variable | Definition -------- | ------------------------------------------------`cat1` | Primary disease category (9 levels)`dnr1` | Do Not Resuscitate status, day 1 `wtkilo1` | Weight in kg, day 1 `hrt1` | Heart rate, day 1 `meanbp1` | Mean Blood Pressure, day 1 `resp1` | Respiratory rate, day 1`temp1` | Temperature, C, day 1```{r covariates_admission_presentation}vars02 <-c("cat1", "dnr1", "wtkilo1", "hrt1", "meanbp1", "resp1", "temp1")table1_02 <-CreateTableOne(vars = vars02, strata ="swang1", data = rhc_cleaning, test =FALSE)print(table1_02, smd =TRUE)```- Six of these covariates show standardized mean differences larger than 0.10.## Group 3: Admission Diagnosis Categories (10 covariates)Variable | Definition -------- | ------------------------------------------------`card` | Cardiovascular diagnosis `gastr` | Gastrointestinal diagnosis`hema` | Hematologic diagnosis`meta` | Metabolic diagnosis`neuro` | Neurological diagnosis`ortho` | Orthopedic diagnosis`renal` | Renal diagnosis`resp` | Respiratory diagnosis `seps` | Sepsis diagnosis `trauma` | Trauma diagnosis ```{r covariates_admission_diagnoses}vars03 <-c("card", "gastr", "hema", "meta", "neuro", "ortho","renal", "resp", "seps", "trauma")table1_03 <-CreateTableOne(vars = vars03, strata ="swang1", data = rhc_cleaning, test =FALSE)print(table1_03, smd =TRUE)```- Seven of these covariates show standardized mean differences larger than 0.10.## Group 4: Comorbid Illness and Transfer Status (13 covariates)Variable | Definition -------- | ------------------------------------------------`amihx` | Definite Myocardial Infarction `ca` | Cancer (3 levels)`cardiohx` | Acute MI, Peripheral Vascular Disease, Severe Cardiovascular Symptoms (NYHA-Class III), Very Severe Cardiovascular Symptoms (NYHA- IV)`chfhx` | Congestive Heart Failure `chrpulhx` | Chronic Pulmonary Disease, Severe or Very Severe Pulmonary Disease `dementhx` | Dementia, Stroke or Cerebral Infarct, Parkinson's Disease `gibledhx` | Upper GI Bleeding `immunhx` | Immunosupperssion, Organ Transplant, HIV Positivity, Diabetes Mellitus Without End Organ Damage, Diabetes Mellitus With End Organ Damage, Connective Tissue Disease `liverhx` | Cirrhosis, Hepatic Failure `malighx` | Solid Tumor, Metastatic Disease, Chronic Leukemia/Myeloma, Acute Leukemia, Lymphoma `psychhx` | Psychiatric History, Active Psychosis or Severe Depression `renalhx` | Chronic Renal Disease, Chronic Hemodialysis or Peritoneal Dialysis `transhx` | Transfer (\> 24 Hours) from Another Hospital ```{r covariates_comorbidities_and_transfer}vars04 <-c("amihx", "ca", "cardiohx", "chfhx", "chrpulhx", "dementhx","gibledhx", "immunhx", "liverhx", "malighx", "psychhx", "renalhx", "transhx")table1_04 <-CreateTableOne(vars = vars04, strata ="swang1", data = rhc_cleaning, test =FALSE)print(table1_04, smd =TRUE)```- Seven of these covariates show standardized mean differences larger than 0.10.## Group 5: Day 1 Summary Measures of Presentation / Severity of Illness (4 covariates)Variable | Definition -------- | ------------------------------------------------`aps1` | APACHE III score ignoring Coma, day 1 `das2d3pc` | DASI (Duke Activity Status Index prior to admission) `scoma1` | Support Coma score based on Glasgow, day 1 `surv2md1` | SUPPORT model estimate of Prob(surviving 2 months)```{r covariates_presentation_severity}vars05 <-c("aps1", "das2d3pc", "scoma1", "surv2md1")table1_05 <-CreateTableOne(vars = vars05, strata ="swang1", data = rhc_cleaning, test =FALSE)print(table1_05, smd =TRUE)```- Three of these covariates show standardized mean differences larger than 0.10.## Group 6: Day 1 Lab Results (10 covariates)Variable | Definition -------- | ------------------------------------------------`alb1` | Albumin `bili1` | Bilirubin `crea1` | Serum Creatinine `hema1` | Hematocrit `paco21` | PaCo2 `pafi1` | PaO2\/(.01 FIO2) `ph1` | Serum PH (arterial) `pot1` | Serum Potassium `sod1` | Serum Sodium `wblc1` | White blood cell count (4 subjects with value 0)```{r covariates_day_1_lab_results}vars06 <-c("alb1", "bili1", "crea1", "hema1", "paco21","pafi1", "pot1", "sod1", "wblc1")table1_06 <-CreateTableOne(vars = vars06, strata ="swang1", data = rhc_cleaning, test =FALSE)print(table1_06, smd =TRUE)```- Six of these covariates show standardized mean differences larger than 0.10.So, in all, 31 of the 50 covariates show standardized mean differences that exceed 0.10 before any propensity score adjustment.# A Little RefactoringFor our outcome, and our treatment, it will be useful to have 1/0 versions of the variables, and it will also be helpful to have the factor versions releveled to put the outcome of interest (death) first and the treatment of interest (RHC) first.```{r}rhc <- rhc_cleaning %>%mutate(treat_rhc =as.numeric(swang1 =="RHC"),swang1 =fct_relevel(swang1, "RHC"),death =fct_relevel(death, "Yes"),died =as.numeric(death =="Yes"))```## Saving the Data File```{r}saveRDS(rhc, here("data", "rhc.Rds"))```# Setting a SeedI'm going to set a seed for random numbers before we start, which I can change later if I like.```{r}set.seed(50012345)```# Unadjusted Outcome AssessmentsBefore we estimate a propensity score, we'll perform unadjusted assessments to describe the impact of RHC (or not) on our three outcomes, without adjustment for any covariates at all.## Outcome A: In-Study Mortality (binary)```{r}rhc %>%tabyl(swang1, death) %>%adorn_totals() %>%adorn_percentages() %>%adorn_pct_formatting() %>%adorn_ns(position ="front") %>% adorn_title```The odds ratio of death associated with RHC as opposed to No RHC should be larger than 1, specifically, it should be $$\frac{1486 \times 1315}{2236 \times 698} = 1.25$$Let's see if that's what we get.### Fitting the Model with the 1/0 treatment and outcome```{r}death_unadj <-glm(died ~ treat_rhc, data = rhc, family =binomial())death_unadjtidy(death_unadj, conf.int =TRUE, exponentiate =TRUE)```### Fitting the Model with the treatment and outcome as factors requires careAlternatively, we could just work with the factors, but then, it's important to be sure what is actually being modeled. Do this by making the outcome and treatment logical results, as follows.```{r}modelA_unadj1 <-glm((death =="Yes") ~ (swang1 =="RHC"), data = rhc, family =binomial())tidy(modelA_unadj1, conf.int =TRUE, exponentiate =TRUE)```Failing to do this can cause trouble:```{r}modelA_unadj_bad <-glm(death ~ swang1, data = rhc, family =binomial())tidy(modelA_unadj_bad, conf.int =TRUE, exponentiate =TRUE)```Note that this last version is actually predicting "Death = No" on the basis of "swang1 = No RHC", which can be extremely confusing. And if you did this:```{r}modelA_unadj_bad2 <-glm(death ~ swang1 =="RHC", data = rhc, family =binomial())tidy(modelA_unadj_bad2, conf.int =TRUE, exponentiate =TRUE)```Now, you've inverted the odds ratio! Not good. So, either use 0 and 1 (with 1 being the result you want to study), or use the factors but specify the direction for both treatment and outcome as a logical statement.## Outcome B: Length of Stay (quantitative)```{r}hospdays_unadj <-lm(hospdays ~ treat_rhc, data = rhc)hospdays_unadjtidy(hospdays_unadj, conf.int =TRUE)```## Outcome C: Time to death (time-to-event)Here's the data on `survdays` and `death` for our first six patients:```{r}rhc %>%select(survdays, death) %>%head()```Here, we have a (right-censored) time to event outcome, called `survdays`, and an indicator of censoring (`death` which is "Yes" if the patient's time is real and not censored). So the first subject should have a time to death of 240 or more days, because they were still alive when the study ended, while the second should have a time that is exactly 45 days, because that person died during the study. The survival object we need, then, is:```{r}Surv(rhc$survdays, rhc$death =="Yes") %>%head()``````{r}survdays_unadj <-coxph(Surv(survdays, death =="Yes") ~ treat_rhc, data = rhc)survdays_unadjtidy(survdays_unadj, conf.int =TRUE, exponentiate =TRUE)```# Estimate the Propensity Score with Logistic Regression```{r}propensity_model <-glm(swang1 =="RHC"~ age + sex + edu + income + ninsclas + race + cat1 + dnr1 + wtkilo1 + hrt1 + meanbp1 + resp1 + temp1 + card + gastr + hema + meta + neuro + ortho + renal + resp + seps + trauma + amihx + ca + cardiohx + chfhx + chrpulhx + dementhx + gibledhx + immunhx + liverhx + malighx + psychhx + renalhx + transhx + aps1 + das2d3pc + scoma1 + surv2md1 + alb1 + bili1 + crea1 + hema1 + paco21 + pafi1 + ph1 + pot1 + sod1 + wblc1,family =binomial(link ="logit"), data = rhc)```## Store the propensity scores and linear propensity scores```{r}rhc <- rhc %>%mutate(ps =fitted(propensity_model),linps = propensity_model$linear.predictors)rhc %>%select(ptid, ps, linps) %>%head()```## Plot the Propensity Scores to Verify that they Match Your Expectations```{r}ggplot(rhc, aes(x = ps, fill = swang1)) +geom_density(alpha =0.5) +scale_fill_viridis_d(option ="plasma") +theme_bw()``````{r}ggplot(rhc, aes(x = swang1, y = ps)) +geom_violin(aes(fill = swang1)) +geom_boxplot(width =0.2) +scale_fill_viridis_d(option ="plasma") +guides(fill ="none") +coord_flip() +theme_bw()```OK. We're trying to use `ps` to estimate the probability of having a RHC, and it looks the people who actually had RHC have higher PS, on average, so that's promising.# Checking Rubin's Rules Prior to Propensity AdjustmentRubin Rule 1 result should ideally be near 0, and certainly between -50 and +50.```{r}rubin1.unadj <-with(rhc, abs(100*(mean(linps[swang1=="RHC"]) -mean(linps[swang1=="No RHC"]))/sd(linps)))rubin1.unadj```Rubin Rule 2 result should ideally be near 1, and certainly between 1/2 and 2.```{r}rubin2.unadj <-with(rhc, var(linps[swang1 =="RHC"]) /var(linps[swang1 =="No RHC"]))rubin2.unadj```So, we've got some work to do. Let's try matching. In several different ways.# Match 1: 1:1 Greedy Matching on the linear PS, without Replacement```{r}X <- rhc$linpsTr <-as.logical(rhc$swang1 =="RHC")match1 <-Match(Tr = Tr, X = X, M =1, estimand ="ATT", replace =FALSE, ties =FALSE)summary(match1)```## Love Plot for Match 1```{r, fig.height = 8}b1 <-bal.tab(match1, swang1 =="RHC"~ age + sex + edu + income + ninsclas + race + cat1 + dnr1 + wtkilo1 + hrt1 + meanbp1 + resp1 + temp1 + card + gastr + hema + meta + neuro + ortho + renal + resp + seps + trauma + amihx + ca + cardiohx + chfhx + chrpulhx + dementhx + gibledhx + immunhx + liverhx + malighx + psychhx + renalhx + transhx + aps1 + das2d3pc + scoma1 + surv2md1 + alb1 + bili1 + crea1 + hema1 + paco21 + pafi1 + ph1 + pot1 + sod1 + wblc1 + ps + linps,data=rhc, un =TRUE)love.plot(b1, threshold = .1, size =1.5, stars ="raw",var.order ="unadjusted",title ="Standardized Differences and Match 1") +theme_bw()```## Rubin's Rules 1 and 2 for Match 1Create a new data frame, containing only the matched sample.```{r}matches_1 <-factor(rep(match1$index.treated, 2))rhc.matches_1 <-cbind(matches_1, rhc[c(match1$index.control, match1$index.treated),])```Rubin Rule 1 result after matching should ideally be near 0, and certainly between -50 and +50.```{r}rubin1.m1 <-with(rhc.matches_1, abs(100*(mean(linps[swang1=="RHC"]) -mean(linps[swang1=="No RHC"]))/sd(linps)))rubin1.m1```Rubin Rule 2 result after matching should ideally be near 1, and certainly between 1/2 and 2.```{r}rubin2.m1 <-with(rhc.matches_1, var(linps[swang1 =="RHC"]) /var(linps[swang1 =="No RHC"]))rubin2.m1```# Match 2: 1:2 Greedy Matching on the linear PS, without Replacement```{r}X <- rhc$linpsTr <-as.logical(rhc$swang1 =="RHC")match2 <-Match(Tr = Tr, X = X, M =2, estimand ="ATT", replace =FALSE, ties =FALSE)summary(match2)```## Love Plot for Match 2```{r, fig.height = 8}b2 <-bal.tab(match2, swang1 =="RHC"~ age + sex + edu + income + ninsclas + race + cat1 + dnr1 + wtkilo1 + hrt1 + meanbp1 + resp1 + temp1 + card + gastr + hema + meta + neuro + ortho + renal + resp + seps + trauma + amihx + ca + cardiohx + chfhx + chrpulhx + dementhx + gibledhx + immunhx + liverhx + malighx + psychhx + renalhx + transhx + aps1 + das2d3pc + scoma1 + surv2md1 + alb1 + bili1 + crea1 + hema1 + paco21 + pafi1 + ph1 + pot1 + sod1 + wblc1 + ps + linps,data=rhc, un =TRUE)love.plot(b2, threshold = .1, size =1.5, stars ="raw",var.order ="unadjusted",title ="Standardized Differences and Match 2") +theme_bw()```## Rubin's Rules 1 and 2 for Match 2Create a new data frame, containing only the matched sample.```{r}matches_2 <-factor(rep(match2$index.treated, 2))rhc.matches_2 <-cbind(matches_2, rhc[c(match2$index.control, match2$index.treated),])```Rubin Rule 1 result after matching should ideally be near 0, and certainly between -50 and +50.```{r}rubin1.m2 <-with(rhc.matches_2, abs(100*(mean(linps[swang1=="RHC"]) -mean(linps[swang1=="No RHC"]))/sd(linps)))rubin1.m2```Rubin Rule 2 result after matching should ideally be near 1, and certainly between 1/2 and 2.```{r}rubin2.m2 <-with(rhc.matches_2, var(linps[swang1 =="RHC"]) /var(linps[swang1 =="No RHC"]))rubin2.m2```# Match 3: Genetic Search Matching to do 1:1 Matching without replacementThe `GenMatch` function finds optimal balance using multivariate matching where a genetic search algorithm determines the weight that each covariate is given.Please note that I am using way too small values of `pop.size` (especially) and probably `max.generations`, too, so that things run quickly. I am also only matching on the propensity score and linear propensity score, rather than on the individual covariates. For a course project, I recommend you try matching on the individual covariates if you can, but this reduction in the paramater values is fine. For actual publications, follow the recommendations of the `GenMatch` algorithm, as described in the `Matching` and `genoud` package help files. The default values (which may be too small themselves) are 100 for both `pop.size` and `max.generations`.```{r}X <-cbind(rhc$linps, rhc$ps)Tr <-as.logical(rhc$swang1 =="RHC")genout3 <-GenMatch(Tr = Tr, X = X,estimand ="ATT", M =1,pop.size =10, max.generations =10,wait.generations =4, verbose =FALSE)match3 <-Match(Tr = Tr, X = X, estimand ="ATT", Weight.matrix = genout3)summary(match3)```## Love Plot for Match 3```{r, fig.height = 8}b3 <-bal.tab(match3, swang1 =="RHC"~ age + sex + edu + income + ninsclas + race + cat1 + dnr1 + wtkilo1 + hrt1 + meanbp1 + resp1 + temp1 + card + gastr + hema + meta + neuro + ortho + renal + resp + seps + trauma + amihx + ca + cardiohx + chfhx + chrpulhx + dementhx + gibledhx + immunhx + liverhx + malighx + psychhx + renalhx + transhx + aps1 + das2d3pc + scoma1 + surv2md1 + alb1 + bili1 + crea1 + hema1 + paco21 + pafi1 + ph1 + pot1 + sod1 + wblc1 + ps + linps,data=rhc, un =TRUE)love.plot(b3, threshold = .1, size =1.5, stars ="raw",var.order ="unadjusted",title ="Standardized Differences and Match 3") +theme_bw()```## Rubin's Rules 1 and 2 for Match 3Create a new data frame, containing only the matched sample.```{r}matches_3 <-factor(rep(match3$index.treated, 2))rhc.matches_3 <-cbind(matches_3, rhc[c(match3$index.control, match3$index.treated),])```Rubin Rule 1 result after matching should ideally be near 0, and certainly between -50 and +50.```{r}rubin1.m3 <-with(rhc.matches_3, abs(100*(mean(linps[swang1=="RHC"]) -mean(linps[swang1=="No RHC"]))/sd(linps)))rubin1.m3```Rubin Rule 2 result after matching should ideally be near 1, and certainly between 1/2 and 2.```{r}rubin2.m3 <-with(rhc.matches_3, var(linps[swang1 =="RHC"]) /var(linps[swang1 =="No RHC"]))rubin2.m3```# Match 4: 1:1 Greedy Matching on the linear PS, WITH Replacement```{r}X <- rhc$linpsTr <-as.logical(rhc$swang1 =="RHC")match4 <-Match(Tr = Tr, X = X, M =1, estimand ="ATT", replace =TRUE, ties =FALSE)summary(match4)```## Love Plot for Match 4```{r, fig.height = 8}b4 <-bal.tab(match4, swang1 =="RHC"~ age + sex + edu + income + ninsclas + race + cat1 + dnr1 + wtkilo1 + hrt1 + meanbp1 + resp1 + temp1 + card + gastr + hema + meta + neuro + ortho + renal + resp + seps + trauma + amihx + ca + cardiohx + chfhx + chrpulhx + dementhx + gibledhx + immunhx + liverhx + malighx + psychhx + renalhx + transhx + aps1 + das2d3pc + scoma1 + surv2md1 + alb1 + bili1 + crea1 + hema1 + paco21 + pafi1 + ph1 + pot1 + sod1 + wblc1 + ps + linps,data=rhc, un =TRUE)love.plot(b4, threshold = .1, size =1.5, stars ="raw",var.order ="unadjusted",title ="Standardized Differences and Match 4") +theme_bw()```## Rubin's Rules 1 and 2 for Match 4Create a new data frame, containing only the matched sample.```{r}matches_4 <-factor(rep(match4$index.treated, 2))rhc.matches_4 <-cbind(matches_4, rhc[c(match4$index.control, match4$index.treated),])```Rubin Rule 1 result after matching should ideally be near 0, and certainly between -50 and +50.```{r}rubin1.m4 <-with(rhc.matches_4, abs(100*(mean(linps[swang1=="RHC"]) -mean(linps[swang1=="No RHC"]))/sd(linps)))rubin1.m4```Rubin Rule 2 result after matching should ideally be near 1, and certainly between 1/2 and 2.```{r}rubin2.m4 <-with(rhc.matches_4, var(linps[swang1 =="RHC"]) /var(linps[swang1 =="No RHC"]))rubin2.m4```# Match 5: 1:1 Caliper Matching on the linear PS, WITHOUT Replacement```{r}X <- rhc$linpsTr <-as.logical(rhc$swang1 =="RHC")match5 <-Match(Tr = Tr, X = X, M =1, caliper =0.2,estimand ="ATT", replace =FALSE, ties =FALSE)summary(match5)```## Love Plot for Match 5```{r, fig.height = 8}b5 <-bal.tab(match5, swang1 =="RHC"~ age + sex + edu + income + ninsclas + race + cat1 + dnr1 + wtkilo1 + hrt1 + meanbp1 + resp1 + temp1 + card + gastr + hema + meta + neuro + ortho + renal + resp + seps + trauma + amihx + ca + cardiohx + chfhx + chrpulhx + dementhx + gibledhx + immunhx + liverhx + malighx + psychhx + renalhx + transhx + aps1 + das2d3pc + scoma1 + surv2md1 + alb1 + bili1 + crea1 + hema1 + paco21 + pafi1 + ph1 + pot1 + sod1 + wblc1 + ps + linps,data=rhc, un =TRUE)love.plot(b5, threshold = .1, size =1.5, stars ="raw",var.order ="unadjusted",title ="Standardized Differences and Match 5") +theme_bw()```## Rubin's Rules 1 and 2 for Match 5Create a new data frame, containing only the matched sample.```{r}matches_5 <-factor(rep(match5$index.treated, 2))rhc.matches_5 <-cbind(matches_5, rhc[c(match5$index.control, match5$index.treated),])```Rubin Rule 1 result after matching should ideally be near 0, and certainly between -50 and +50.```{r}rubin1.m5 <-with(rhc.matches_5, abs(100*(mean(linps[swang1=="RHC"]) -mean(linps[swang1=="No RHC"]))/sd(linps)))rubin1.m5```Rubin Rule 2 result after matching should ideally be near 1, and certainly between 1/2 and 2.```{r}rubin2.m5 <-with(rhc.matches_5, var(linps[swang1 =="RHC"]) /var(linps[swang1 =="No RHC"]))rubin2.m5```# Match 6: 1:2 Greedy Matching on the linear PS, WITH Replacement```{r}X <- rhc$linpsTr <-as.logical(rhc$swang1 =="RHC")match6 <-Match(Tr = Tr, X = X, M =2, estimand ="ATT", replace =TRUE, ties =FALSE)summary(match6)```## Love Plot for Match 6```{r, fig.height = 8}b6 <-bal.tab(match6, swang1 =="RHC"~ age + sex + edu + income + ninsclas + race + cat1 + dnr1 + wtkilo1 + hrt1 + meanbp1 + resp1 + temp1 + card + gastr + hema + meta + neuro + ortho + renal + resp + seps + trauma + amihx + ca + cardiohx + chfhx + chrpulhx + dementhx + gibledhx + immunhx + liverhx + malighx + psychhx + renalhx + transhx + aps1 + das2d3pc + scoma1 + surv2md1 + alb1 + bili1 + crea1 + hema1 + paco21 + pafi1 + ph1 + pot1 + sod1 + wblc1 + ps + linps,data=rhc, un =TRUE)love.plot(b6, threshold = .1, size =1.5, stars ="raw",var.order ="unadjusted",title ="Standardized Differences and Match 6") +theme_bw()```## Rubin's Rules 1 and 2 for Match 6Create a new data frame, containing only the matched sample.```{r}matches_6 <-factor(rep(match6$index.treated, 2))rhc.matches_6 <-cbind(matches_6, rhc[c(match6$index.control, match6$index.treated),])```Rubin Rule 1 result after matching should ideally be near 0, and certainly between -50 and +50.```{r}rubin1.m6 <-with(rhc.matches_6, abs(100*(mean(linps[swang1=="RHC"]) -mean(linps[swang1=="No RHC"]))/sd(linps)))rubin1.m6```Rubin Rule 2 result after matching should ideally be near 1, and certainly between 1/2 and 2.```{r}rubin2.m6 <-with(rhc.matches_6, var(linps[swang1 =="RHC"]) /var(linps[swang1 =="No RHC"]))rubin2.m6```# Ranking the MatchesMatch | Rubin 1 | Rubin 2 | Love Plot | Matched Sets | Description-----: | ------: | -----: | --------- | -------: | -------------------------None | `r round_half_up(rubin1.unadj,1)` | `r round_half_up(rubin2.unadj,2)` | -- | -- | No Matching1 | `r round_half_up(rubin1.m1, 1)` | `r round_half_up(rubin2.m1, 2)` | Pretty Bad | 2,184 | 1:1 greedy w/o repl2 | `r round_half_up(rubin1.m2, 1)` | `r round_half_up(rubin2.m2, 2)` | Dreadful | 1,775 | 1:2 greedy w/o repl3 | `r round_half_up(rubin1.m3, 1)` | `r round_half_up(rubin2.m3, 2)` | Very Good | 2,184 | Genetic Search 1:1 w/o repl4 | `r round_half_up(rubin1.m4, 1)` | `r round_half_up(rubin2.m4, 2)` | Very Good | 2,184 | 1:1 greedy with repl5 | `r round_half_up(rubin1.m5, 1)` | `r round_half_up(rubin2.m5, 2)` | Very Good | 1,562 | 1:1 caliper without repl6 | `r round_half_up(rubin1.m6, 1)` | `r round_half_up(rubin2.m6, 2)` | Very Good | 2,184 | 1:2 greedy with repl# ATT Matching Estimates for the Binary Outcome, Death## With Match 1, in detailNote that the variable which identifies the matches (called `matches_1`) is set up as a factor, as it needs to be, and that I'm using the 1/0 versions of both the outcome (so `died` rather than `death`) and treatment (so `treat_rhc` rather than `swang1`).```{r}death_adj_m1 <-clogit(died ~ treat_rhc +strata(matches_1), data = rhc.matches_1)summary(death_adj_m1)```The odds ratio in the `exp(coef)` section above is the average causal effect estimate. It describes the odds of having an event (`died`) occur associated with being a RHC subject (as identified by `treat_rhc`), as compared to the odds of that event when a non-RHC subject. We can tidy this with:```{r}tidy(death_adj_m1, exponentiate =TRUE, conf.level =0.95)```## With Matches 2-6, in much less detail```{r}death_adj_m2 <-clogit(died ~ treat_rhc +strata(matches_2), data = rhc.matches_2)tidy(death_adj_m2, exponentiate =TRUE, conf.level =0.95)``````{r}death_adj_m3 <-clogit(died ~ treat_rhc +strata(matches_3), data = rhc.matches_3)tidy(death_adj_m3, exponentiate =TRUE, conf.level =0.95)``````{r}death_adj_m4 <-clogit(died ~ treat_rhc +strata(matches_4), data = rhc.matches_4)tidy(death_adj_m4, exponentiate =TRUE, conf.level =0.95)``````{r}death_adj_m5 <-clogit(died ~ treat_rhc +strata(matches_5), data = rhc.matches_5)tidy(death_adj_m5, exponentiate =TRUE, conf.level =0.95)``````{r}death_adj_m6 <-clogit(died ~ treat_rhc +strata(matches_6), data = rhc.matches_6)tidy(death_adj_m6, exponentiate =TRUE, conf.level =0.95)```## Graphing the Results Across Multiple Matching Strategies```{r}temp0 <-tidy(death_unadj, conf.int =TRUE, exponentiate =TRUE) %>%filter(term =="treat_rhc")temp1 <-tidy(death_adj_m1, exponentiate =TRUE, conf.int = T, conf.level =0.95)temp2 <-tidy(death_adj_m2, exponentiate =TRUE, conf.int = T, conf.level =0.95)temp3 <-tidy(death_adj_m3, exponentiate =TRUE, conf.int = T, conf.level =0.95)temp4 <-tidy(death_adj_m4, exponentiate =TRUE, conf.int = T, conf.level =0.95)temp5 <-tidy(death_adj_m5, exponentiate =TRUE, conf.int = T, conf.level =0.95)temp6 <-tidy(death_adj_m6, exponentiate =TRUE, conf.int = T, conf.level =0.95)death_match_results <-rbind(temp0, temp1, temp2, temp3, temp4, temp5, temp6)death_match_results <- death_match_results %>%mutate(approach =c("Unmatched", "Match 1", "Match 2", "Match 3", "Match 4", "Match 5", "Match 6")) %>%mutate(description =c("No Matching", "1:1 greedy matching without replacement", "1:2 greedy matching without replacement","1:1 genetic search matching without replacement","1:1 greedy matching with replacement","1:1 caliper matching on the linear PS without replacement","1:2 greedy matching with replacement"))death_match_results``````{r}ggplot(death_match_results, aes(x = approach, y = estimate, ymin = conf.low, ymax = conf.high)) +geom_text(aes(label =round_half_up(estimate,2)), vjust =-1.25) +geom_pointrange() +geom_hline(yintercept =1, col ="red", linetype ="dashed") +theme_bw() +coord_flip() +labs(title ="Comparing ATT Estimates for Death using RHC Propensity Matching",x ="Propensity Adjustment Approach", y ="Estimated Odds Ratio for Death associated with RHC")```Let's look at just the four strategies that achieved stronger balance.```{r}death_match_results %>%filter(approach %in%c("Match 3", "Match 4", "Match 5", "Match 6")) %>%ggplot(., aes(x = description, y = estimate, ymin = conf.low, ymax = conf.high)) +geom_text(aes(label =round_half_up(estimate,2)), vjust =-1.25) +geom_pointrange() +geom_hline(yintercept =1, col ="red", linetype ="dashed") +theme_bw() +coord_flip() +labs(title ="ATT Matching Estimates in RHC (Death)",x ="Propensity Adjustment Approach", y ="Estimated Odds Ratio for Death associated with RHC")```# ATT Matching Estimates for the Quantitative Outcome, Hospital Length of Stay## With Match 1, in detailAgain, the variable which identifies the matches (called `matches_1`) is set up as a factor, as it needs to be, and that I'm using the 1/0 version of treatment (so `treat_rhc` rather than `swang1`). Our result for this quantitative outcome (`hospdays`) comes from a mixed model where the matches are treated as a random factor, but the treatment group is treated as a fixed factor, using the `lme4` package.```{r}hospdays_adj_m1 <-lmer(hospdays ~ treat_rhc + (1| matches_1),data = rhc.matches_1)summary(hospdays_adj_m1); confint(hospdays_adj_m1)```The estimated effect of `treat_rhc` in the fixed effects section, combined with the 95% confidence intervals material, provides the average causal effect estimate. It describes the change in the outcome `hospdays` associated with being a RHC subject (as identified by `treat_rhc`), as compared to being a non-RHC subject. We can tidy this with:```{r, message = FALSE}broom.mixed::tidy(hospdays_adj_m1, conf.int =TRUE, conf.level =0.95)```or more specifically, we can look at just the treatment effect with:```{r, message = FALSE}broom.mixed::tidy(hospdays_adj_m1, conf.int =TRUE, conf.level =0.95) %>%filter(term =="treat_rhc")```## With Matches 2-6, in much less detail```{r}hospdays_adj_m2 <-lmer(hospdays ~ treat_rhc + (1| matches_2),data = rhc.matches_2)broom.mixed::tidy(hospdays_adj_m2, conf.int =TRUE, conf.level =0.95) %>%filter(term =="treat_rhc")``````{r}hospdays_adj_m3 <-lmer(hospdays ~ treat_rhc + (1| matches_3),data = rhc.matches_3)broom.mixed::tidy(hospdays_adj_m3, conf.int =TRUE, conf.level =0.95) %>%filter(term =="treat_rhc")``````{r}hospdays_adj_m4 <-lmer(hospdays ~ treat_rhc + (1| matches_4),data = rhc.matches_4)broom.mixed::tidy(hospdays_adj_m4, conf.int =TRUE, conf.level =0.95) %>%filter(term =="treat_rhc")``````{r}hospdays_adj_m5 <-lmer(hospdays ~ treat_rhc + (1| matches_5),data = rhc.matches_5)broom.mixed::tidy(hospdays_adj_m5, conf.int =TRUE, conf.level =0.95) %>%filter(term =="treat_rhc")``````{r}hospdays_adj_m6 <-lmer(hospdays ~ treat_rhc + (1| matches_6),data = rhc.matches_6)broom.mixed::tidy(hospdays_adj_m6, conf.int =TRUE, conf.level =0.95) %>%filter(term =="treat_rhc")```## Graphing the Results Across Multiple Matching Strategies```{r}temp0 <-tidy(hospdays_unadj, conf.int =TRUE) %>%filter(term =="treat_rhc") %>%select(term, estimate, std.error, conf.low, conf.high)temp1 <-tidy(hospdays_adj_m1, conf.int =TRUE, conf.level =0.95) %>%filter(term =="treat_rhc") %>%select(term, estimate, std.error, conf.low, conf.high)temp2 <-tidy(hospdays_adj_m2, conf.int =TRUE, conf.level =0.95) %>%filter(term =="treat_rhc") %>%select(term, estimate, std.error, conf.low, conf.high)temp3 <-tidy(hospdays_adj_m3, conf.int =TRUE, conf.level =0.95) %>%filter(term =="treat_rhc") %>%select(term, estimate, std.error, conf.low, conf.high)temp4 <-tidy(hospdays_adj_m4, conf.int =TRUE, conf.level =0.95) %>%filter(term =="treat_rhc") %>%select(term, estimate, std.error, conf.low, conf.high)temp5 <-tidy(hospdays_adj_m5, conf.int =TRUE, conf.level =0.95) %>%filter(term =="treat_rhc") %>%select(term, estimate, std.error, conf.low, conf.high)temp6 <-tidy(hospdays_adj_m6, conf.int =TRUE, conf.level =0.95) %>%filter(term =="treat_rhc") %>%select(term, estimate, std.error, conf.low, conf.high)hospdays_match_results <-rbind(temp0, temp1, temp2, temp3, temp4, temp5, temp6)hospdays_match_results <- hospdays_match_results %>%mutate(approach =c("Unmatched", "Match 1", "Match 2", "Match 3", "Match 4", "Match 5", "Match 6")) %>%mutate(description =c("No Matching", "1:1 greedy matching without replacement", "1:2 greedy matching without replacement","1:1 genetic search matching without replacement","1:1 greedy matching with replacement","1:1 caliper matching on the linear PS without replacement","1:2 greedy matching with replacement"))hospdays_match_results``````{r}ggplot(hospdays_match_results, aes(x = approach, y = estimate, ymin = conf.low, ymax = conf.high)) +geom_text(aes(label =round_half_up(estimate,2)), vjust =-1.25) +geom_pointrange() +geom_hline(yintercept =0, col ="red", linetype ="dashed") +theme_bw() +coord_flip() +labs(title ="Comparing ATT Estimates for LOS using RHC Propensity Matching",x ="Propensity Adjustment Approach", y ="Estimated Change in Hospital Length of Stay associated with RHC")```Let's look at just the four strategies that achieved stronger balance.```{r}hospdays_match_results %>%filter(approach %in%c("Match 3", "Match 4", "Match 5", "Match 6")) %>%ggplot(., aes(x = description, y = estimate, ymin = conf.low, ymax = conf.high)) +geom_text(aes(label =round_half_up(estimate,2)), vjust =-1.25) +geom_pointrange() +geom_hline(yintercept =0, col ="red", linetype ="dashed") +theme_bw() +coord_flip() +labs(title ="ATT Matching Estimates in RHC (LOS)",x ="Propensity Adjustment Approach", y ="Est. Change in Hospital LOS associated with RHC")```# ATT Matching Estimates for the Time-to-Event Outcome, In-Study SurvivalWe'll use a stratified Cox proportional hazards model to compare the RHC/No RHC groups on our time-to-event outcome `survdays`, while accounting for the matched pairs. The main result will be a relative hazard rate estimate, with 95% CI. I will use the 0/1 numeric versions of the event indicator (`died`), and of the treatment indicator (`treat_rhc`).## With Match 1, in detail```{r}survdays_adj_m1 <-coxph(Surv(survdays, died) ~ treat_rhc +strata(matches_1), data=rhc.matches_1)summary(survdays_adj_m1)```The hazard ratio in the `exp(coef)` section is the average causal effect estimate. It describes the hazard for having an event (`died`) occur associated with being a RHC subject (as identified by `treat_rhc`), as compared to the hazard of that event when a non-RHC subject. We can tidy this with:```{r}tidy(survdays_adj_m1, exponentiate =TRUE, conf.int =TRUE, conf.level =0.95)```## With Matches 2-6, in much less detail```{r}survdays_adj_m2 <-coxph(Surv(survdays, died) ~ treat_rhc +strata(matches_2), data=rhc.matches_2)tidy(survdays_adj_m2, exponentiate =TRUE, conf.int =TRUE, conf.level =0.95)``````{r}survdays_adj_m3 <-coxph(Surv(survdays, died) ~ treat_rhc +strata(matches_3), data=rhc.matches_3)tidy(survdays_adj_m3, exponentiate =TRUE, conf.int =TRUE, conf.level =0.95)``````{r}survdays_adj_m4 <-coxph(Surv(survdays, died) ~ treat_rhc +strata(matches_4), data=rhc.matches_4)tidy(survdays_adj_m4, exponentiate =TRUE, conf.int =TRUE, conf.level =0.95)``````{r}survdays_adj_m5 <-coxph(Surv(survdays, died) ~ treat_rhc +strata(matches_5), data=rhc.matches_5)tidy(survdays_adj_m5, exponentiate =TRUE, conf.int =TRUE, conf.level =0.95)``````{r}survdays_adj_m6 <-coxph(Surv(survdays, died) ~ treat_rhc +strata(matches_6), data=rhc.matches_6)tidy(survdays_adj_m6, exponentiate =TRUE, conf.int =TRUE, conf.level =0.95)```## Graphing the Results Across Multiple Matching Strategies```{r}temp0 <-tidy(survdays_unadj, conf.int =TRUE, exponentiate =TRUE)temp1 <-tidy(survdays_adj_m1, exponentiate =TRUE, conf.int =TRUE, conf.level =0.95)temp2 <-tidy(survdays_adj_m2, exponentiate =TRUE, conf.int =TRUE, conf.level =0.95)temp3 <-tidy(survdays_adj_m3, exponentiate =TRUE, conf.int =TRUE, conf.level =0.95)temp4 <-tidy(survdays_adj_m4, exponentiate =TRUE, conf.int =TRUE, conf.level =0.95)temp5 <-tidy(survdays_adj_m5, exponentiate =TRUE, conf.int =TRUE, conf.level =0.95)temp6 <-tidy(survdays_adj_m6, exponentiate =TRUE, conf.int =TRUE, conf.level =0.95)survdays_match_results <-rbind(temp0, temp1, temp2, temp3, temp4, temp5, temp6)survdays_match_results <- survdays_match_results %>%mutate(approach =c("Unmatched", "Match 1", "Match 2", "Match 3", "Match 4", "Match 5", "Match 6")) %>%mutate(description =c("No Matching", "1:1 greedy matching without replacement", "1:2 greedy matching without replacement","1:1 genetic search matching without replacement","1:1 greedy matching with replacement","1:1 caliper matching on the linear PS without replacement","1:2 greedy matching with replacement"))survdays_match_results``````{r}ggplot(survdays_match_results, aes(x = approach, y = estimate, ymin = conf.low, ymax = conf.high)) +geom_text(aes(label =round_half_up(estimate,2)), vjust =-1.25) +geom_pointrange() +geom_hline(yintercept =1, col ="red", linetype ="dashed") +theme_bw() +coord_flip() +labs(title ="Comparing ATT Estimates for Survival using RHC Propensity Matching",x ="Propensity Adjustment Approach", y ="Estimated Hazard Ratio associated with RHC")```Let's look at just the four strategies that achieved stronger balance.```{r}survdays_match_results %>%filter(approach %in%c("Match 3", "Match 4", "Match 5", "Match 6")) %>%ggplot(., aes(x = description, y = estimate, ymin = conf.low, ymax = conf.high)) +geom_text(aes(label =round_half_up(estimate,2)), vjust =-1.25) +geom_pointrange() +geom_hline(yintercept =1, col ="red", linetype ="dashed") +theme_bw() +coord_flip() +labs(title ="ATT Matching Estimates in RHC (Survival)",x ="Propensity Adjustment Approach", y ="Estimated Hazard Ratio associated with RHC")```# Propensity Score Weighting (ATT approach)## Estimate the Propensity Score using Generalized Boosted Regression, and then perfom ATT WeightingTo begin, we'll re-estimate the propensity score using the `twang` function `ps`. This uses a *generalized boosted regression* approach to estimate the propensity score and produce material for checking balance.```{r, warning = FALSE}# Recall that twang does not play well with tibbles,# so we have to use the data frame version of rhc object# and I'm using the 1/0 version of the treatmentrhc_df <- base::as.data.frame(rhc)ps_rhc <-ps(treat_rhc ~ age + sex + edu + income + ninsclas + race + cat1 + dnr1 + wtkilo1 + hrt1 + meanbp1 + resp1 + temp1 + card + gastr + hema + meta + neuro + ortho + renal + resp + seps + trauma + amihx + ca + cardiohx + chfhx + chrpulhx + dementhx + gibledhx + immunhx + liverhx + malighx + psychhx + renalhx + transhx + aps1 + das2d3pc + scoma1 + surv2md1 + alb1 + bili1 + crea1 + hema1 + paco21 + pafi1 + ph1 + pot1 + sod1 + wblc1,data = rhc_df,n.trees =4000,interaction.depth =2,stop.method =c("es.mean"),estimand ="ATT",verbose =FALSE)```### Did we let the simulations run long enough to stabilize estimates?```{r}plot(ps_rhc)```### What is the effective sample size of our weighted results?```{r}summary(ps_rhc)```### How is the balance?```{r}plot(ps_rhc, plots =2)``````{r}plot(ps_rhc, plots =3)```### Assessing Balance with `cobalt````{r}b2 <-bal.tab(ps_rhc, full.stop.method ="es.mean.att", stats =c("m", "v"), un =TRUE)b2```## Rubin's Rule 1 After WeightingWe could simply compare the propensity score balance as reported in the table above, looking at the standardized biases (the standardized difference on a proportion scale, rather than a percentage.) I'll multiply each by 100 to place things on the usual scale.```{r}100*b2[1]$Balance$Diff.Un[1] # imbalance prior to weighting100*b2[1]$Balance$Diff.Adj[1] # imbalance after weighting```So this is still indicative of some difficulty reaching a completely reasonable level of balance. We'd like to see this much closer to 0.## Rubin's Rule 2 After Weighting```{r}b2[1]$Balance$V.Ratio.Un[1] # imbalance prior to weightingb2[1]$Balance$V.Ratio.Adj[1] # imbalance after weighting```So there's no particular problem with Rule 2 here.## Semi-Automated Love plot of Standardized Differences```{r, fig.height = 8}p <-love.plot(b2, threshold = .1, size =3, stars ="raw",title ="Standardized Diffs and ATT weights")p +theme_bw()```## Semi-Automated Love plot of Variance Ratios```{r, fig.height = 8}p <-love.plot(b2, stat ="v",threshold =1.25, size =3, title ="Variance Ratios: TWANG ATT weighting")p +theme_bw()```# ATT Weighting Estimates We're not too excited about weighted estimates without (at least) an additional adjustment for differences in the propensity score in this case, but I'll run them anyway just to show how that would be done.## Weighted ATT for Outcome A: DeathThe relevant regression approach uses the `svydesign` and `svyglm` functions from the `survey` package. For a binary outcome, we build the outcome model using the quasibinomial, rather than the usual binomial family. Note the use of the 1/0 version of both the outcome and the treatment variables here.```{r}rhcwts_design <-svydesign(ids=~1, weights=~get.weights(ps_rhc, stop.method ="es.mean"),data=rhc_df) # using twang ATT weightsadj_death_wtd <-svyglm(died ~ treat_rhc, design=rhcwts_design,family=quasibinomial())wt_twangatt_res_death <-tidy(adj_death_wtd, exponentiate =TRUE,conf.int =TRUE,conf.level =0.95) %>%select(term, estimate, std.error,lo95 = conf.low, hi95 = conf.high)wt_twangatt_res_death```The estimates we want come from the `treat_rhc` row.## Double Robust Weighted Estimate for DeathNow let's add in a regression adjustment for the linear propensity score, which should alleviate some of our concerns about residual imbalance after weighting.```{r}dr_adj_death <-svyglm(died ~ treat_rhc + linps, design=rhcwts_design,family=quasibinomial())dr_death <-tidy(dr_adj_death, exponentiate =TRUE,conf.int =TRUE,conf.level =0.95) %>%select(term, estimate, std.error,lo95 = conf.low, hi95 = conf.high)dr_death```Again, the estimates we want come from the `treat_rhc` row.## Graphing the Weighted and Matched Results for Outcome A```{r}temp7 <-tidy(adj_death_wtd, exponentiate =TRUE, conf.int = T, conf.level =0.95) %>%filter(term =="treat_rhc")temp8 <-tidy(dr_adj_death, exponentiate =TRUE, conf.int = T, conf.level =0.95) %>%filter(term =="treat_rhc")death_new_results <-rbind(temp7, temp8) %>%mutate(approach =c("ATT weighted", "DR")) %>%mutate(description =c("ATT weights", "DR: ATT wts + adj"))death_results <-rbind(death_match_results, death_new_results)death_results```So there's our table. It's worth remembering that the only analyses that we've ```{r}death_results %>%ggplot(., aes(x = description, y = estimate, ymin = conf.low, ymax = conf.high)) +geom_text(aes(label =round_half_up(estimate,2)), vjust =-1.25) +geom_pointrange() +geom_hline(yintercept =1, col ="red", linetype ="dashed") +theme_bw() +coord_flip() +labs(title ="ATT Estimates in RHC (Death)",x ="Propensity Adjustment Approach", y ="Estimated Odds Ratio for Death associated with RHC")```And here are the results for only the five methods (adding our double robust approach to the analyses we liked in the matching) which showed reasonable balance after propensity score adjustment...```{r}death_results %>%filter(approach %in%c("Match 3", "Match 4", "Match 5","Match 6", "DR")) %>%ggplot(., aes(x = description, y = estimate, ymin = conf.low, ymax = conf.high)) +geom_text(aes(label =round_half_up(estimate,2)), vjust =-1.25) +geom_pointrange() +geom_hline(yintercept =1, col ="red", linetype ="dashed") +theme_bw() +coord_flip() +labs(title ="ATT Estimates in RHC (Death)",x ="Propensity Adjustment Approach", y ="Estimated Odds Ratio for Death associated with RHC")```## Weighted ATT for Outcome B: Hospital Length of StayThe relevant regression approach uses the `svydesign` and `svyglm` functions from the `survey` package.```{r}rhcwts_design <-svydesign(ids=~1, weights=~get.weights(ps_rhc, stop.method ="es.mean"),data=rhc_df) # using twang ATT weightsadj_hospdays_wtd <-svyglm(hospdays ~ treat_rhc, design= rhcwts_design)wt_twangatt_res_hospdays <-tidy(adj_hospdays_wtd, conf.int =TRUE,conf.level =0.95) %>%select(term, estimate, std.error,lo95 = conf.low, hi95 = conf.high)wt_twangatt_res_hospdays```The estimates we want come from the `treat_rhc` row.## Double Robust Weighted Estimate for LOSNow let's add in a regression adjustment for the linear propensity score. Again, we should have some additional faith in these results over those with just weighting alone, because of the failure of the weighting approach to pass Rubin's Rule 1.```{r}dr_adj_days <-svyglm(hospdays ~ treat_rhc + linps, design=rhcwts_design)dr_los <-tidy(dr_adj_days, conf.int =TRUE,conf.level =0.95) %>%select(term, estimate, std.error,lo95 = conf.low, hi95 = conf.high)dr_los```Once again, the estimates we want come from the `treat_rhc` row.## Graphing the Weighted and Matched Results for Outcome B```{r}temp9 <-tidy(adj_hospdays_wtd, conf.int = T, conf.level =0.95) %>%filter(term =="treat_rhc") %>%select(-statistic, -p.value)temp10 <-tidy(dr_adj_days, conf.int = T, conf.level =0.95) %>%filter(term =="treat_rhc") %>%select(-statistic, -p.value)hospdays_new_results <-rbind(temp9, temp10) %>%mutate(approach =c("ATT weighted", "DR")) %>%mutate(description =c("ATT weights", "DR: ATT wts + adj"))hospdays_results <-rbind(hospdays_match_results, hospdays_new_results)hospdays_results```So there's our table. ```{r}hospdays_results %>%ggplot(., aes(x = description, y = estimate, ymin = conf.low, ymax = conf.high)) +geom_text(aes(label =round_half_up(estimate,2)), vjust =-1.25) +geom_pointrange() +geom_hline(yintercept =0, col ="red", linetype ="dashed") +theme_bw() +coord_flip() +labs(title ="Comparing ATT Estimates for LOS",x ="Propensity Adjustment Approach", y ="Estimated Change in Hospital LOS associated with RHC")```And here are the results for only the five methods (adding our double robust approach to the analyses we liked in the matching) which showed reasonable balance after propensity score adjustment...```{r}hospdays_results %>%filter(approach %in%c("Match 3", "Match 4", "Match 5","Match 6", "DR")) %>%ggplot(., aes(x = description, y = estimate, ymin = conf.low, ymax = conf.high)) +geom_text(aes(label =round_half_up(estimate,2)), vjust =-1.25) +geom_pointrange() +geom_hline(yintercept =0, col ="red", linetype ="dashed") +theme_bw() +coord_flip() +labs(title ="Comparing ATT Estimates for LOS",x ="Propensity Adjustment Approach", y ="Estimated Change in Hospital LOS associated with RHC")```## Weighted ATT for Outcome C: In-Study Time to Death```{r}rhc_wts <-get.weights(ps_rhc, stop.method ="es.mean")adj_survdays_wtd <-coxph(Surv(survdays, died) ~ treat_rhc,data = rhc_df, weights = rhc_wts)wt_twangatt_res_surv <-tidy(adj_survdays_wtd, exponentiate =TRUE,conf.int =TRUE,conf.level =0.95) %>%select(term, estimate, std.error,lo95 = conf.low, hi95 = conf.high)wt_twangatt_res_surv```Again, the estimates we want come from the `treat_rhc` row.## Double Robust Weighted Estimate for Days to DeathNow let's add in a regression adjustment for the linear propensity score. Again, in this case, we prefer this double robust approach to weighting alone because of Rubin's Rule 1 and the Love plot.```{r}dr_survdays <-coxph(Surv(survdays, died) ~ treat_rhc + linps,data = rhc_df, weights = rhc_wts)dr_surv <-tidy(dr_survdays, exponentiate =TRUE, conf.int =TRUE,conf.level =0.95) %>%select(term, estimate, std.error,lo95 = conf.low, hi95 = conf.high)dr_surv```Once again, the estimates we want come from the `treat_rhc` row.## Graphing the Weighted and Matched Results for Outcome C```{r}temp11 <-tidy(adj_survdays_wtd, exponentiate = T, conf.int = T, conf.level =0.95) %>%filter(term =="treat_rhc") %>%select(-robust.se)temp12 <-tidy(dr_survdays, exponentiate = T, conf.int = T, conf.level =0.95) %>%filter(term =="treat_rhc") %>%select(-robust.se)survdays_new_results <-rbind(temp11, temp12) %>%mutate(approach =c("ATT weighted", "DR")) %>%mutate(description =c("ATT weights", "DR: ATT wts + adj"))survdays_results <-rbind(survdays_new_results, survdays_match_results)survdays_results```So there's our table. ```{r}survdays_results %>%ggplot(., aes(x = description, y = estimate, ymin = conf.low, ymax = conf.high)) +geom_text(aes(label =round_half_up(estimate,2)), vjust =-1.25) +geom_pointrange() +geom_hline(yintercept =1, col ="red", linetype ="dashed") +theme_bw() +coord_flip() +labs(title ="Comparing ATT Estimates for Survival Time",x ="Propensity Adjustment Approach", y ="Estimated Hazard Ratio associated with RHC")```And here are the results for only the five methods (adding our double robust approach to the analyses we liked in the matching) which showed reasonable balance after propensity score adjustment...```{r}survdays_results %>%filter(approach %in%c("Match 3", "Match 4", "Match 5","Match 6", "DR")) %>%ggplot(., aes(x = description, y = estimate, ymin = conf.low, ymax = conf.high)) +geom_text(aes(label =round_half_up(estimate,2)), vjust =-1.25) +geom_pointrange() +geom_hline(yintercept =1, col ="red", linetype ="dashed") +theme_bw() +coord_flip() +labs(title ="Comparing ATT Estimates for Survival Time",x ="Propensity Adjustment Approach", y ="Estimated Hazard Ratio associated with RHC")```# Sensitivity Analyses After Matching## For the Binary Outcome, DeathThe `rbounds` package can evaluate binary outcomes using the `binarysens` and `Fishersens` functions. The `binarysens` function works directly on a matched sample as developed using the Matching package. For example, in our third match, based on the genetic search algorithm with 1:1 matching without replacement, this is `match3`. Note that in order to do this, we needed to run `match3` including the appropriate (binary) outcome (Y).```{r}Y <- rhc$diedX <-cbind(rhc$linps, rhc$ps)Tr <-as.logical(rhc$swang1 =="RHC")match3 <-Match(Y = Y, Tr = Tr, X = X, estimand ="ATT", Weight.matrix = genout3)Rc <- match3$mdata$Y[match3$mdata$Tr==0]Rt <- match3$mdata$Y[match3$mdata$Tr==1] out.tab <-matrix(c(table(Rc, Rt)), nrow =2)out.tabbinarysens(out.tab[2,1], out.tab[1,2],Gamma =1.5, GammaInc =0.05)```At a 5% significance level, we need to compare the listed upper bounds to a value of $\alpha = 0.05/2 = 0.025$. It appears that we retain significance up to a hidden bias of $\Gamma$ = 1.15, but not at $\Gamma$ = 1.2. See Chapter 9 of Rosenbaum's *Observation and Experiment* for more details on interpretation of this result.We can also use this approach with other Matching results, including Match 6, which is a 1:2 match with replacement. Again, we have to run the match with the appropriate outcome included.```{r}Y <- rhc$diedX <- rhc$linpsTr <-as.logical(rhc$swang1 =="RHC")match6 <-Match(Y = Y, Tr = Tr, X = X, M =2, estimand ="ATT", replace =TRUE, ties =FALSE)Rc <- match6$mdata$Y[match6$mdata$Tr==0]Rt <- match6$mdata$Y[match6$mdata$Tr==1] out.tab <-matrix(c(table(Rc, Rt)), nrow =2)out.tabbinarysens(out.tab[2,1], out.tab[1,2],Gamma =1.5, GammaInc =0.05)```Here, at a 5% significance level, we again compare our upper bounds to $\alpha/2 = 0.025$ and we retain significance up to a hidden bias of $\Gamma$ = 1.10, but not at $\Gamma$ = 1.15.## For the Quantitative Outcome, In-Hospital Length of StayWe have already used the `Match` function from the Matching package to develop our matched samples. For our 1:1 matches, we need only run the `psens` function from the `rbounds` package to obtain sensitivity results. For example, in Match 3,```{r}Y <- rhc$hospdaysX <-cbind(rhc$linps, rhc$ps)Tr <-as.logical(rhc$swang1 =="RHC")match3 <-Match(Y = Y, Tr = Tr, X = X, estimand ="ATT", Weight.matrix = genout3)Rc <- match3$mdata$Y[match3$mdata$Tr==0]Rt <- match3$mdata$Y[match3$mdata$Tr==1] psens(Rt, Rc, Gamma =1.5, GammaInc =0.05)```At a 5% significance level, we retain significance up to a hidden bias of $\Gamma$ = 1.05, but not at $\Gamma$ = 1.1. See Chapter 9 of Rosenbaum's *Observation and Experiment* for more details on interpretation of this result.## For the Survival Outcome, In-Study Time to DeathIn this setting, we would need to use the spreadsheet software provided by Dr. Love. That software is designed only for 1:1 greedy matching without replacement or weighting, which is the approach we took in Match 1 (we could also have looked at matching using a caliper, but without replacement.) We need to identify, for each matched pair of subjects, whether that pair's set of results was:- that the RHC patient definitely survived longer than the non-RHC patient- that the non-RHC patient definitely survived longer than the RHC patient- that it is unclear which patient survived longer, due to censoringHere is some very fussy code that accomplishes this counting. This could probably be accomplished just as well with some sort of `case_when` statement, but it's a bit tricky.```{r}a1 <- rhc.matches_1 %>%select(matches_1, swang1, death, survdays) %>%arrange(matches_1) ``````{r}a2 <-unite(a1, situation, swang1, death, sep ="_", remove =TRUE)a3 <-spread(a2, situation, survdays)a4 <- a3 %>%filter(complete.cases(`No RHC_Yes`, `RHC_Yes`)) %>%mutate(result =ifelse(`RHC_Yes`>`No RHC_Yes`, "RHC lived longer", "No RHC lived longer"))a5 <- a3 %>%filter(complete.cases(`No RHC_No`, `RHC_No`)) %>%mutate(result ="No clear winner")a6 <- a3 %>%filter(complete.cases(`No RHC_No`, `RHC_Yes`)) %>%mutate(result =ifelse(`RHC_Yes`>`No RHC_No`, "No clear winner", "No RHC lived longer"))a7 <- a3 %>%filter(complete.cases(`No RHC_Yes`, `RHC_No`)) %>%mutate(result =ifelse(`RHC_No`>`No RHC_Yes`, "RHC lived longer", "No clear winner"))a8 <-bind_rows(a4, a5, a6, a7)a8 %>%tabyl(result) %>%adorn_totals()```So we have counts for the pairs where a clear winner is identified, and we can enter the spreadsheet. We need the number of pairs with a clear winner, and the number of pairs where the No RHC patient lived longer than the RHC patient. Doing so, we see that the result doesn't meet the sign-score test standard necessary to have a p value below 0.10, so we stop there, and conclude that the effect would be highly sensitive to hidden bias.# Session Information```{r}xfun::session_info()```