The first question we must ask is: what is our target population? In our case, we will draw upon outcomes from the FU1 and FU2 surveys, so we will begin by making the FU1 respondants the target. In subsequent analyses that use FU2 variables as the “baseline” value we can construct a separate set of weights using FU2 as the target. However, one key drawback to doing this is that we end up with effectively two “target populations” – so we may want to just settle on using FU1 respondants, or even baseline respondants, as the target.
In many ways the modeling of non-response can be thought of as an extension of propensity score modeling; indeed there is a robust statistical literature to this effect (see David 1983, for example).
Response propensity methods allow for more flexible modeling of the probability of survey response. In this way we can accommodate continuous predictors, allowing for flexible specifications (e.g., splines, local polynomial regression) and a large number of variables.
Under this approach, survey response adjustments are made using the inverse of the estimated response propensities. Clearly, an assumption that we have fit the correct specification of the model is needed; this could be mitigated to some degree through the use of flexible regressions as defined above.
A limitation, however, is that some individuals may have very low estimated propensities. These individuals would therefore receive very large nonresponse weighting adjustment factors, which could affect the variance of the estimated parameters in our study.
One way around this would be to trim or compress very large weights (Potter, 1990; Kish, 1992; Meng et al., 2010). However, another approach based on recent developments on propensity score methods (Imbens and Rubin 2015) and nonresponse stratification methods (Little 1983), may yield a tractable approach.
Our first step is to pre-specify a set of characteristics and administrative variables deemed a priori as potentially important for response. We can begin with a baseline response propensity specification with linear terms for each of these covariates, though we could also consider choosing a more flexible specification that considers second-order and interaction terms (possibly selected by an iterative algorithm and checked via cross-validation of the prediction errors to mitigate against the risk of overfitting).
# Baseline X variables for predicting response
z <- c("education","sex","maritalstatus","hhsize","employed","enrollment_source","raceanalysis","insurancecoverage")
# We can also include interaction terms very easily:
# z <- c("education","sex","maritalstatus","hhsize","employed","state","enrollment_source","raceanalysis",
# "enrollment_source*enrollment_age","enrollment_source*raceanalysis","enrollment_source*education",
# "enrollment_source*sex","enrollment_source*maritalstatus","enrollment_source*hhsize","enrollment_source*employed",
# "enrollment_age*sex","enrollment_age*education","enrollment_age*maritalstatus","enrollment_age*employed")
# Variables we want to enter as a spline
z.spline <- c("enrollment_age")
# FU variables for predicting response (e.g., lagged value of the outcomes).
######
## NOTE: The current dataset includes only a very minimal numbef of these variables.
## we will augment this list with additional variables in the next iteration of the data.
######
x_2 <- c("insurancecoveragef1")
x_3 <- c("insurancecoveragef1")
We next fit logistic regression models predicting survey response. Note I have commented out the model predicting FU1 response since we are using FU1 respondants as our “target” population.
# !!!!! TK NOTE: EVENTUALLY WE WILL NEED TO USE ADMINISTRATIVE VARIABLES FOR THIS!
# Response Indicators
xx$r_1 <- as.integer(!is.na(xx$insurancecoveragef1))
xx$r_2 <- as.integer(!is.na(xx$insurancecoveragef2))
xx$r_3 <- as.integer(!is.na(xx$insurancecoveragef3))
#!!!!!!
#!!!!!! Since our target population is responders to FU1 we restrict the master sample to them here.
#!!!!!!
xx <- xx %>% filter(r_1==1)
# # Fit Response Model for FU1
# xx_1.full <- xx[,intersect(colnames(xx),c("r_1","idnumber",z,z.spline))]
# xx_1.full.desc <- Hmisc::describe(xx_1.full)
#
# # Get Only non-missing values that will be fit in response regreesion
# xx_1 <- xx_1.full %>% na.omit()
#
# # Fit Model & obtain model frame, predicted response rate, and log-odds
# m_1 <- glm(tofmla(y="r_1",x=z,x.spline="enrollment_age"),family="binomial",data=xx_1)
# xx_1.f <- cbind.data.frame(idnumber=xx_1$idnumber,
# r_1=model.frame(m_1)$r_1,
# model.matrix(m_1)) %>% data.frame() %>% select(-X.Intercept.)
# xx_1.f$pr <- predict(m_1,type="response")
# xx_1.f$logodds = log(xx_1.f$pr/(1-xx_1.f$pr))
#
# # Perform K-fold cross validation in case we want to compare across models
# #cost <- function(r, pi = 0) mean(abs(r-pi) > 0.5)
# #cv.err_1 <- boot::cv.glm(xx_1, m_1, cost, K = 6)$delta[2]
#######
## FU2
#######
# Fit Response Model for FU2
xx_2.full <- xx[,intersect(colnames(xx),c("r_2","idnumber",z,z.spline,x_2))]
xx_2.full.desc <- Hmisc::describe(xx_2.full)
# NOTE: THE NON-RESPONDANTS TO FU1 WILL HAVE ALL MISSING VALUES FOR
# THE FU1 VARIABLES INCLUDED IN THE MODEL THAT PREDICTS FU2 RESPONSE.
# WE WILL REPLACE THESE WITH VALUES OF 9999 SO THEY AREN'T DROPPED.
na.to.9998 <- function(x) ifelse(is.na(x),"9998",x)
xx_2.full <- xx_2.full %>% mutate_each_(funs(na.to.9998),x_2)
# Get Only non-missing values that will be fit in response regreesion
xx_2 <- xx_2.full %>% na.omit()
# Fit Model & obtain model frame, predicted response rate, and log-odds
m_2 <- glm(tofmla(y="r_2",x=z,x.spline="enrollment_age"),family="binomial",data=xx_2)
m_2.1 <- glm(tofmla(y="r_2",x=c(z,x_2),x.spline="enrollment_age"),family="binomial",data=xx_2)
xx_2.f <- cbind.data.frame(idnumber=xx_2$idnumber,
r_2=model.frame(m_2)$r_2,
model.matrix(m_2)) %>% data.frame() %>% select(-X.Intercept.)
xx_2.f$pr_2 <- predict(m_2,type="response")
xx_2.f$logodds = log(xx_2.f$pr_2/(1-xx_2.f$pr_2))
# Perform K-fold cross validation in case we want to compare across models
# cost <- function(r, pi = 0) mean(abs(r-pi) > 0.5)
# cv.err_2 <- boot::cv.glm(xx_2, m_2, cost, K = 6)$delta[2]
# cv.err_2.1 <- boot::cv.glm(xx_2, m_2.1, cost, K = 6)$delta[2]
#######
## FU3
#######
# Fit Response Model for FU2
xx_3.full <- xx[,intersect(colnames(xx),c("r_3","idnumber",z,z.spline,x_3,x_2))]
xx_3.full.desc <- Hmisc::describe(xx_3.full)
# NOTE: THE NON-RESPONDANTS TO FU1 WILL HAVE ALL MISSING VALUES FOR
# THE FU1 VARIABLES INCLUDED IN THE MODEL THAT PREDICTS FU2 RESPONSE.
# WE WILL REPLACE THESE WITH VALUES OF 9999 SO THEY AREN'T DROPPED.
na.to.9998 <- function(x) ifelse(is.na(x),"9998",x)
xx_3.full <- xx_3.full %>% mutate_each_(funs(na.to.9998),c(x_3,x_2))
# Get Only non-missing values that will be fit in response regreesion
xx_3 <- xx_3.full %>% na.omit()
# Fit Model & obtain model frame, predicted response rate, and log-odds
m_3 <- glm(tofmla(y="r_3",x=z,x.spline="enrollment_age"),family="binomial",data=xx_3)
m_3.1 <- glm(tofmla(y="r_3",x=c(z,x_3,x_2),x.spline="enrollment_age"),
family="binomial",data=xx_3)
xx_3.f <- cbind.data.frame(idnumber=xx_3$idnumber,
r_3=model.frame(m_3)$r_3,
model.matrix(m_3)) %>% data.frame() %>% select(-X.Intercept.)
xx_3.f$pr_3 <- predict(m_3,type="response")
xx_3.f$logodds = log(xx_3.f$pr_3/(1-xx_3.f$pr_3))
# Perform K-fold cross validation in case we want to compare across models
#cv.err_2 <- boot::cv.glm(xx_2, m_2, cost, K = 6)$delta[2]
####
# Fit a model with only the baseline covariates (i.e., no interactions for balance checking later)
####
xx.base <- xx[,-grep("f1$|f2$|f3$",names(xx))] %>% select(-peremployed_tract)
xx.base<- colnames(xx.base[,grep(paste0(colnames(model.frame(m_3)[,-1]),collapse="|"),
colnames(xx.base))])
vars <- c(z,z.spline,x_2,x_3)
vars2 <- c(z,z.spline,x_2)
if (max(as.integer(grepl("\\*",c(vars))))==1) xx.base <-vars[-grep("\\*",c(vars))] else xx.base <- vars
m_base <- glm(tofmla(y="r_3",x=xx.base),family="binomial",data=xx_3)
stargazer::stargazer(m_2,m_3,title="Response Propensity Model for SCCS FU2 and FU3",
ci=FALSE,font.size="scriptsize",omit.stat = c("adj.rsq","rsq","f","ser"),type="html")
Dependent variable: | ||
r_2 | r_3 | |
(1) | (2) | |
education2 | 0.165*** | 0.240*** |
(0.041) | (0.042) | |
education3 | 0.411*** | 0.503*** |
(0.040) | (0.040) | |
education4 | 0.443*** | 0.556*** |
(0.055) | (0.053) | |
education5 | 0.488*** | 0.615*** |
(0.042) | (0.042) | |
education6 | 0.666*** | 0.811*** |
(0.051) | (0.049) | |
education7 | 0.862*** | 1.060*** |
(0.069) | (0.062) | |
education8 | 0.973*** | 1.208*** |
(0.099) | (0.086) | |
education8888 | 0.551 | -0.384 |
(1.324) | (1.306) | |
education9999 | -0.039 | 0.753 |
(1.418) | (1.417) | |
sexM | -0.389*** | -0.382*** |
(0.021) | (0.020) | |
maritalstatus2 | -0.238*** | -0.229*** |
(0.025) | (0.023) | |
maritalstatus3 | -0.230*** | -0.261*** |
(0.037) | (0.035) | |
maritalstatus4 | -0.103*** | -0.124*** |
(0.030) | (0.028) | |
maritalstatus8888 | -0.804 | -1.360 |
(1.066) | (1.238) | |
hhsize(2,3] | -0.082*** | -0.053** |
(0.027) | (0.025) | |
hhsize(3,4] | -0.060* | -0.077** |
(0.033) | (0.031) | |
hhsize(4,5] | -0.062 | -0.065 |
(0.045) | (0.042) | |
hhsize(5,6] | -0.206*** | 0.014 |
(0.063) | (0.061) | |
hhsize(6,8.89e+03] | -0.109 | 0.039 |
(0.068) | (0.066) | |
hhsize(8.89e+03,1e+04] | 0.027 | 1.308* |
(0.680) | (0.755) | |
employed1 | 0.053** | 0.132*** |
(0.022) | (0.020) | |
employed7777 | -0.070 | 0.005 |
(0.107) | (0.105) | |
enrollment_sourceM | 0.996*** | 0.903*** |
(0.035) | (0.030) | |
enrollment_sourceT | 0.762*** | 0.697*** |
(0.169) | (0.140) | |
raceanalysis2 | -0.163*** | -0.156*** |
(0.023) | (0.021) | |
raceanalysis3 | -0.245** | -0.216** |
(0.103) | (0.097) | |
raceanalysis4 | -0.354 | -0.209 |
(0.269) | (0.252) | |
raceanalysis5 | -0.269 | -0.058 |
(0.164) | (0.160) | |
raceanalysis6 | -0.415*** | -0.455*** |
(0.142) | (0.139) | |
raceanalysis7 | -0.126* | -0.115* |
(0.071) | (0.066) | |
raceanalysis8888 | 9.754 | 0.578 |
(48.650) | (0.870) | |
raceanalysis9999 | 0.108 | 0.792 |
(1.194) | (1.192) | |
insurancecoverage1 | 0.049** | 0.011 |
(0.022) | (0.021) | |
insurancecoverage8888 | -0.564 | -0.601 |
(0.717) | (0.754) | |
insurancecoverage9999 | 0.741 | 0.504 |
(0.578) | (0.500) | |
rcs(enrollment_age)enrollment_age | 0.032*** | 0.038*** |
(0.009) | (0.009) | |
rcs(enrollment_age)enrollment_age’ | -0.163** | -0.269*** |
(0.079) | (0.075) | |
rcs(enrollment_age)enrollment_age’’ | 0.547** | 0.865*** |
(0.257) | (0.244) | |
rcs(enrollment_age)enrollment_age’’’ | -0.687** | -1.013*** |
(0.278) | (0.264) | |
Constant | -0.924** | -1.822*** |
(0.389) | (0.375) | |
Observations | 50,165 | 50,165 |
Log Likelihood | -29,880.000 | -32,735.000 |
Akaike Inf. Crit. | 59,839.000 | 65,550.000 |
Note: | p<0.1; p<0.05; p<0.01 |
After estimating the response propensity model we next assess the adequacy of the predicted response score. To do this we exploit a feature of the true response propensity score: the independence of the response indicator and the baseline characteristics given the score. Ideally, strata would be constructed pairing individuals with the same value of the score; however, this isn’t feasible given the wide range of values the score could take on.
Therefore, we proceed by stratifying the data using the following iterative procedure.
In the notation below, let \(N_0(j)\) and \(N_1(j)\) be the subsample sizes of non-responders and responders in substratum \(j\). Similarly, let \(\bar l_0(j)\) and \(\bar l_1(j)\) be the average value of the linearized response propensity within the stratum. Finally, let \(S^2\) be the sample variance of the linearized response propensity within block j.
# Define the model/ data to assess balance on
xx.f <- xx_3.f %>% mutate(r=r_3)
####
## Parameters
####
# t-statistic cutoff value
t_max <- 1
# Minimum sample size in substratum
N_min <- 10
####
# t-statistic function
####
tstat <- function(X)
{
temp1 <- X %>% group_by(J,r) %>% summarise(Nj=n(), lj = mean(logodds))
temp2 <- X %>% inner_join(temp1,c("J","r")) %>% group_by(J,r) %>%
mutate(s1=sum((logodds-lj)^2)) %>% select(J,r,Nj,lj,s1) %>% unique() %>%
reshape2::melt(id.vars=c("J","r")) %>%
reshape2::dcast(J~variable+r) %>% mutate(S2j = (1 / ((Nj_0+Nj_1)-2))*(s1_0+s1_1)) %>%
mutate(tstat = (lj_1-lj_0)/sqrt(S2j*((1/Nj_1)+(1/Nj_0))))
tt <- temp2 %>% collect() %>% .[["tstat"]]
tt[is.na(tt)] <- 0
Nj0 <- temp2 %>% collect() %>% .[["Nj_0"]]
Nj1 <- temp2 %>% collect() %>% .[["Nj_1"]]
J <- temp2 %>% collect() %>% .[["J"]]
out <- list(J = J, Nj0=Nj0,Nj1=Nj1,tt=tt)
return(out)
}
####
# Split the response data into strata.
####
# Start with J = 1
X <- xx.f %>% mutate(J =1) %>% mutate(pr=pr_3)
strata.to.split <- 1
i = 1
test <- list()
while (strata.to.split==1)
{
stats <- X %>% tstat()
cut.medscore <- X %>% group_by(J) %>%
mutate(newcut = cut(pr,breaks=c(0,median(pr),1),labels=c("l","h")),
median = median(pr)) %>% group_by(J,r,newcut) %>%
summarise(newJ=n(),median=mean(median)) %>%
reshape2::melt(id.vars=c("J","r","newcut")) %>% reshape2::dcast(J~variable+r+newcut) %>%
tbl_df() %>% rename(median=median_0_l) %>% select(-median_1_l,-median_1_h,-median_0_h)
stats <- do.call("cbind",stats) %>% tbl_df()
stats2 <- stats %>% inner_join(cut.medscore,"J") %>% mutate(abs.tt=abs(as.numeric(tt))) %>%
mutate(split = as.integer(abs.tt>t_max & newJ_0_l>N_min & newJ_1_l>N_min &
newJ_0_h>N_min & newJ_1_h>N_min)) %>%
select(J,split,median)
strata.to.split <- stats2 %>% collect() %>% .[["split"]] %>% max()
X2 <- X %>% left_join(stats2,"J") %>% ungroup() %>% group_by(J) %>%
mutate(J2 = ifelse(pr<=median,"1","2")) %>% mutate(test=ifelse(pr>median,1,0)) %>%
ungroup() %>% mutate(J = ifelse(split==1,paste0(J,J2),J)); length(unique(X2$J))
if (is.na(strata.to.split)) strata.to.split <- 0
if (max(strata.to.split)==1) X <- X2 %>% select(-split,-median,-J2)
test[[i]] <- X2 %>% filter(idnumber=="00017217") %>% select(pr,J,median)
i = i + 1
}
# Get Number of strata
J <- length(unique(X$J))
# Replace the original data with the stratified version of the data
xx.f <- X
# Attach t-statistic values to each of the J strata in the main data.
xx.f.stats <- X %>% tstat()
xx.f.stats <- do.call("cbind",xx.f.stats) %>% tbl_df() %>%
mutate(tt=round(as.numeric(tt),3)) %>% select(J,tt)
The plot below shows the density distribution of response propensity scores among responders and non-responders within each of the J strata.
Non-Responders | Responders | |
---|---|---|
111111111 | 160 | 38 |
111111112 | 138 | 57 |
11111112 | 287 | 106 |
11111121 | 275 | 118 |
11111122 | 254 | 138 |
111112 | 1043 | 528 |
1111211 | 508 | 276 |
1111212111 | 71 | 31 |
1111212112 | 62 | 32 |
111121212 | 121 | 75 |
111121221 | 126 | 70 |
1111212221 | 65 | 34 |
1111212222 | 48 | 48 |
111122 | 957 | 609 |
111211 | 939 | 633 |
111212 | 894 | 672 |
11122 | 1708 | 1428 |
11211 | 1694 | 1441 |
112121111 | 115 | 88 |
112121112 | 94 | 100 |
112121121 | 104 | 95 |
112121122 | 118 | 75 |
112121211 | 112 | 84 |
112121212 | 97 | 98 |
1121212211 | 60 | 39 |
1121212212 | 41 | 55 |
112121222 | 89 | 105 |
112122 | 815 | 750 |
1122 | 3117 | 3148 |
12111 | 1481 | 1658 |
12112 | 1394 | 1738 |
12121 | 1391 | 1745 |
12122 | 1321 | 1813 |
122111 | 641 | 928 |
12211211 | 158 | 238 |
12211212 | 143 | 245 |
122112211 | 78 | 118 |
1221122121 | 42 | 56 |
1221122122 | 29 | 69 |
12211222 | 137 | 254 |
1221211 | 284 | 503 |
1221212 | 252 | 530 |
122122 | 430 | 1136 |
12221111 | 126 | 267 |
12221112 | 102 | 290 |
1222112 | 195 | 590 |
12221211 | 85 | 307 |
12221212 | 103 | 289 |
12221221 | 88 | 305 |
12221222 | 79 | 311 |
122221 | 279 | 1294 |
1222221111 | 22 | 77 |
1222221112 | 17 | 82 |
122222112 | 26 | 168 |
12222212 | 54 | 335 |
12222221 | 54 | 337 |
1222222211 | 14 | 84 |
1222222212 | 11 | 86 |
122222222 | 16 | 177 |
We next check overall balance of the covariates across responders and non-responders. We have \(K\) covariates in the response propensity model, and to assess balance across strata we will analyze each of these covariates \(X_{ik}, k = 1, ..., K\) as an “outcome.” Specifically, in stratum \(j\) the difference between responding and non-responding units is estimated by
\[ \hat \tau_k^X(j) = \bar X_{1,k}(j) - \bar X_{0,k}(j) \] with sampling variance estimated as \[ \hat V_k^X(j) = s_k^2(j) \cdot \bigg ( \frac{1}{N_1(j)}+\frac{1}{N_0(j)} \bigg) \] where \[ s_k^2(j) = \frac{1}{N_0(j)+N_1(j)-2}\bigg ( \sum_{i:B_i(j)=1}^N (1-R_i) \cdot (X_{ik}-\bar X_{0k}(j))^2 + \sum_{i:B_i(j)=1}^N R_i \cdot (X_{ik}-\bar X_{1k}(j))^2\bigg) \]
We then take a weighted average of these estimates across blocks
\[ \hat \tau_k^X = \sum_{j=1}^J \bigg ( \frac{N_0(j)+N_1(j)}{N} \bigg ) \hat \tau_k^X(j) \] with variance \[ \hat V_k^X = \sum_{j= 1}^J \bigg (\frac{N_0(j)+N_1(j)}{N}) \bigg )^2 \cdot \hat V_k \] and convert into a z-value for a two-sided test of whether \(\hat \tau_k^X\) is equal to zero: \[ z_k = \frac{\hat \tau_k^X}{\sqrt{\hat V_k^X}} \]
####
# z-scores for balance
####
# Get only the baseline covariates (i.e., not interaction terms)
X1 <- expand.model.frame(m_base,~idnumber)
X2 <- model.matrix(m_base) %>% data.frame() %>% select(-1)
colnames(X2) <- paste0("xx",1:ncol(X2))
X3 <- xx.f %>% select(J,idnumber) %>% mutate(idnumber=as.character(idnumber))
X <- cbind.data.frame(X1,X2) %>% select(idnumber,r=r_3,contains("xx")) %>% inner_join(X3,"idnumber")
# Get tau (i.e., difference in mean value of covariates between respondants and non-respondants
# in each stratum)
temp1 <- X %>% group_by(J,r) %>% summarise_each(funs(mean) ,contains("xx"))
tauJ <- temp1 %>% mutate_each(funs(diff),contains("xx")) %>% group_by(J,r) %>% filter(n()==1)
colnames(tauJ) = gsub("^xx","tau_xx",colnames(temp1))
# Get Sample Sizes by stratum (overall, responders, nonresponders)
NJr <- X %>% group_by(J,r) %>% summarise(Nj=n())
NJ <- NJr %>% reshape2::melt(id.vars=c("J","r")) %>%
reshape2::dcast(J~variable+r) %>% mutate(Nj = Nj_0 + Nj_1)
vv <- function(x) (x-mean(x))^2
s2J <- X %>% group_by(J,r) %>% mutate_each(funs(vv),contains("xx")) %>%
summarise_each(funs(sum), contains("xx")) %>% group_by(J) %>%
summarise_each(funs(sum),contains("xx")) %>% left_join(NJ,"J") %>%
mutate(temp=1/(Nj_1+Nj_0-2)) %>% mutate_each(funs(.*temp),contains("xx"))
VkJ <- s2J %>% mutate(temp=(1/Nj_0)+(1/Nj_1)) %>% mutate_each(funs(.*temp),contains("xx"))
tau <- tauJ %>% group_by(J) %>% filter(row_number()==1) %>% select(-r) %>%
left_join(NJ,"J") %>% ungroup() %>% mutate(N=sum(Nj)) %>%
mutate(weight = (Nj_0+Nj_1)/N) %>% mutate_each(funs(.*weight),contains("xx")) %>%
summarise_each(funs(sum), contains("xx"))
V <- VkJ %>% ungroup() %>% mutate(N=sum(Nj)) %>%
mutate(weight = ((Nj_0+Nj_1)/N)^2) %>% mutate_each(funs(.*weight),contains("xx")) %>%
summarise_each(funs(sum), contains("xx")) %>% as.vector()
z <- tau/sqrt(V) %>% as.matrix() %>% t()
pval <- 2*pnorm(-abs(t(z)))
uniform <- seq(0,1,0.01)
empirical <- c()
for (pp in 1:length(uniform)) {
empirical[pp] <- (1/length(pval))*sum(as.integer(pval[,1]<=uniform[pp]))
}
To assess overall balance we can obtain the associated p-values for these z-scores and compare them to the uniform distribution. We then ask, what fraction of the empirical p-values are less than a given percentile of the uniform(0,1) distribution? In the plot below, the plotted fraction of the empirical p-values below a given percentile value on the uniform(0,1) is above the 45-degree line and the mean value is >0.50, indicating good overall balance across responders and non-responders on all the included predictors.
meanval <- mean(pval[,1])
pdata <- cbind.data.frame(uniform=uniform,empirical=empirical)
p <- pdata %>% ggplot(aes(x=uniform,y=uniform))+geom_line(lty=2)+geom_step(aes(x=empirical))+
annotate("text",x=.5,y=0.25,label=paste("Mean Value for p-values:",round(meanval,3)),hjust=0)
p + theme_tufte() + theme(text=element_text(family="Gill Sans"))+xlab("Empirical\nShare of p-values less than Y-value")+
ylab("Uniform")
Balance Check for FU3 Weight
Having settled on a response propensity model with a reasonable degree of balance across responders and non-responders, we construct a non-response weight for each responding observation within the strata. These observations get a weight equal to the inverse of the average observed response rate within each stratum.
# Construct Weights
xx_3.f <- xx.f %>% mutate(w.FU3=1)
xx_3.f$cell.3 <- xx_3.f %>% group_by(J) %>% group_indices()
xx_3.f <- xx_3.f %>% arrange(cell.3) %>% group_by(cell.3) %>%
mutate(rr.cell = mean(r_3))
xx_3.f[xx_3.f$r_3==1,]$w.FU3 = xx_3.f[xx_3.f$r_3==1,]$w.FU3 * (1/xx_3.f[xx_3.f$r_3==1,]$rr.cell)
xx_3.f[xx_3.f$r_3==0,]$w.FU3 = 0
# Validate the weights sum to the target population size (i.e., FU1 sample size)
sum(xx_3.f$w.FU3)
## [1] 50165
round(sum(xx_3.f$w.FU3),0)== dim(xx_3)[1]
## [1] TRUE
We’ll now go through a similar process to construct the FU2 weights as well.
xx.f <- xx_2.f %>% mutate(r=r_2)
####
# Split the FU2 response data into strata.
####
# Start with J = 1
X <- xx.f %>% mutate(J =1) %>% mutate(pr=pr_2)
strata.to.split <- 1
i = 1
test <- list()
while (strata.to.split==1)
{
stats <- X %>% tstat()
cut.medscore <- X %>% group_by(J) %>%
mutate(newcut = cut(pr,breaks=c(0,median(pr),1),labels=c("l","h")),
median = median(pr)) %>% group_by(J,r,newcut) %>%
summarise(newJ=n(),median=mean(median)) %>%
reshape2::melt(id.vars=c("J","r","newcut")) %>%
reshape2::dcast(J~variable+r+newcut) %>% tbl_df() %>%
rename(median=median_0_l) %>% select(-median_1_l,-median_1_h,-median_0_h)
stats <- do.call("cbind",stats) %>% tbl_df()
stats2 <- stats %>% inner_join(cut.medscore,"J") %>%
mutate(abs.tt=abs(as.numeric(tt))) %>%
mutate(split = as.integer(abs.tt>t_max & newJ_0_l>N_min &
newJ_1_l>N_min & newJ_0_h>N_min & newJ_1_h>N_min)) %>%
select(J,split,median)
strata.to.split <- stats2 %>% collect() %>% .[["split"]] %>% max()
X2 <- X %>% left_join(stats2,"J") %>% ungroup() %>% group_by(J) %>%
mutate(J2 = ifelse(pr<=median,"1","2")) %>% mutate(test=ifelse(pr>median,1,0)) %>%
ungroup() %>% mutate(J = ifelse(split==1,paste0(J,J2),J)); length(unique(X2$J))
if (is.na(strata.to.split)) strata.to.split <- 0
if (max(strata.to.split)==1) X <- X2 %>% select(-split,-median,-J2)
test[[i]] <- X2 %>% filter(idnumber=="00017217") %>% select(pr,J,median)
i = i + 1
cat(length(unique(X$J)))
}
## 2481523313948556262
#do.call("rbind",test)
# Get Number of strata
J <- length(unique(X$J)); J
## [1] 62
# Replace the original data with the stratified version of the data
xx.f <- X
# Attach t-statistic values to each of the J strata in the main data.
xx.f.stats <- X %>% tstat()
xx.f.stats <- do.call("cbind",xx.f.stats) %>% tbl_df() %>%
mutate(tt=round(as.numeric(tt),3)) %>%
select(J,tt)
####
# z-scores for balance
####
# Obtain the baseline covariates (i.e., no interactions for balance checking later)
if (max(as.integer(grepl("\\*",c(vars2))))==1) xx.base <-vars2[-grep("\\*",c(vars2))] else xx.base <- vars2
m_base <- glm(tofmla(y="r_2",x=xx.base),family="binomial",data=xx_2)
# Get only the baseline covariates (i.e., not interaction terms)
X1 <- expand.model.frame(m_base,~idnumber)
X2 <- model.matrix(m_base) %>% data.frame() %>% select(-1)
colnames(X2) <- paste0("xx",1:ncol(X2))
X3 <- xx.f %>% select(J,idnumber) %>% mutate(idnumber=as.character(idnumber))
X <- cbind.data.frame(X1,X2) %>% select(idnumber,r=r_2,contains("xx")) %>% inner_join(X3,"idnumber")
# Get tau (i.e., difference in mean value of covariates between respondants and non-respondants
# in each stratum)
temp1 <- X %>% group_by(J,r) %>% summarise_each(funs(mean) ,contains("xx"))
tauJ <- temp1 %>% mutate_each(funs(diff),contains("xx")) %>% group_by(J,r) %>% filter(n()==1)
colnames(tauJ) = gsub("^xx","tau_xx",colnames(temp1))
# Get Sample Sizes by stratum (overall, responders, nonresponders)
NJr <- X %>% group_by(J,r) %>% summarise(Nj=n())
NJ <- NJr %>% reshape2::melt(id.vars=c("J","r")) %>%
reshape2::dcast(J~variable+r) %>% mutate(Nj = Nj_0 + Nj_1)
vv <- function(x) (x-mean(x))^2
s2J <- X %>% group_by(J,r) %>% mutate_each(funs(vv),contains("xx")) %>%
summarise_each(funs(sum), contains("xx")) %>% group_by(J) %>%
summarise_each(funs(sum),contains("xx")) %>% left_join(NJ,"J") %>%
mutate(temp=1/(Nj_1+Nj_0-2)) %>% mutate_each(funs(.*temp),contains("xx"))
VkJ <- s2J %>% mutate(temp=(1/Nj_0)+(1/Nj_1)) %>% mutate_each(funs(.*temp),contains("xx"))
tau <- tauJ %>% group_by(J) %>% filter(row_number()==1) %>% select(-r) %>%
left_join(NJ,"J") %>% ungroup() %>% mutate(N=sum(Nj)) %>%
mutate(weight = (Nj_0+Nj_1)/N) %>% mutate_each(funs(.*weight),contains("xx")) %>%
summarise_each(funs(sum), contains("xx"))
V <- VkJ %>% ungroup() %>% mutate(N=sum(Nj)) %>%
mutate(weight = ((Nj_0+Nj_1)/N)^2) %>% mutate_each(funs(.*weight),contains("xx")) %>%
summarise_each(funs(sum), contains("xx")) %>% as.vector()
z <- tau/sqrt(V) %>% as.matrix() %>% t()
pval <- 2*pnorm(-abs(t(z)))
uniform <- seq(0,1,0.01)
empirical <- c()
for (pp in 1:length(uniform)) {
empirical[pp] <- (1/length(pval))*sum(as.integer(pval[,1]<=uniform[pp]))
}
# Construct FU1 Weights
xx_2.f <- xx.f %>% mutate(w.FU2=1)
xx_2.f$cell.3 <- xx_2.f %>% group_by(J) %>% group_indices()
xx_2.f <- xx_2.f %>% arrange(cell.3) %>% group_by(cell.3) %>% mutate(rr.cell = mean(r_2))
xx_2.f[xx_2.f$r_2==1,]$w.FU2 = xx_2.f[xx_2.f$r_2==1,]$w.FU2 * (1/xx_2.f[xx_2.f$r_2==1,]$rr.cell)
xx_2.f[xx_2.f$r_2==0,]$w.FU2 = 0
# Validate the weights sum to the target population size (i.e., Baseline sample size)
sum(xx_2.f$w.FU2)
## [1] 50165
sum(xx_2.f$w.FU2) == dim(xx_2)[1]
## [1] FALSE
meanval <- mean(pval[,1])
pdata <- cbind.data.frame(uniform=uniform,empirical=empirical)
p <- pdata %>% ggplot(aes(x=uniform,y=uniform))+geom_line(lty=2)+geom_step(aes(x=empirical))+
annotate("text",x=.5,y=0.25,label=paste("Mean Value for p-values:",round(meanval,3)),hjust=0)
p + theme_tufte() + theme(text=element_text(family="Gill Sans"))+
xlab("Empirical\nShare of p-values less than Y-value")+ylab("Uniform")
Balance Check for FU2 Weight
The next set of challenges relates to the fact that we have a multi-wave survey, though some of our analyses may simply rely on cross-sectional data. Here, we need to draw heavily on the work of Little and David (1983) for how to combine nonresponse weights across follow-up waves of the SCCS. The discussion of similar issues in Chen et al. (2012) also heavily influences the notation and discussion below.
With multiple waves of the SCCS we must deal with survey attrition due to moving, loss-to-follow-up or death. However, given that the sampling frame for the SCCS follow-up surveys targets previous nonresponders, we also have reentry among people who may not respond to a particular follow-up. For FU3 we have about 3-4 thousand cases that meet these criteria.
To develop the weighting adjustment for attrition and reentry we need a bit of notation. Since we have defined our target population based on the responders to the baseline survey (see above) we essentially have a three-wave panel.
With this in mind, let \(r = c(r_1,r_2,r_3)\) be the response status (0/1) in each follow-up wave \(k\). Let \(z\) denote a set of design variables (e.g., general population sample, community health center [??], etc.). And also let \(x = (x_1,x_2,x_3)\) reflect variables collected in each follow-up round, where \(x_k\) are observed when \(r_k = 1\) and are missing when \(r_k=0\).
We then fit the following logit regressions:
With these weights in hand, \(w_1\) is the initial nonresponse weight while \(w_2\) can be used for any cross-sectional analyses of respondants in the second follow-up as well as any longitudinal analyses that include respondants to follow-up 1 and 2. In addition, \(w_3\) can be used for the cross-sectional analyses of follow-up 3 and in longitudinal analyses of respondants to all three waves.
We have some cases where we have respondants to FU1 and FU3, but not FU2. The approach above can be amended to include an additional weight based on regressing \(r_3\) on \(z\) and \(x_1\). Then, the weights \(w_k.1\) can be constructed such that they use the appropriate predicted value based on whether information on \(x_2\) is available (i.e., whether \(r_2\) is 0 or 1).
Response Pattern | Percent |
---|---|
000 | 32.1 |
111 | 29.6 |
100 | 15.5 |
110 | 13.1 |
101 | 4.2 |
010 | 2.9 |
011 | 2.6 |
001 | 0.1 |
rr | pct |
---|---|
111 | 43.6 |
100 | 22.8 |
110 | 19.3 |
101 | 6.2 |
010 | 4.2 |
011 | 3.8 |
001 | 0.2 |
For our within-state comparisons we’ll fit the following model: \[ h(E(Y_{itk})) = \beta_0 + \beta_1 Year_t + \beta_ 2Tx_i + \beta_3PostExpansion_t \times Tx_i + \beta_4 State_k + \beta_5 State_k \times Year_t + \beta_6 State_k \times Tx_i + \beta_7 Post-Expansion_t \times Tx_i \times Expansion-State + \beta_8 \mathrm{spline}(Age) + \alpha_i \] For now I have fit linear models with a respondent random effect to capture the longitudinal nature of the data. We have few patient characteristics but I did include a restricted cubic spline for age. The comparison group is also simply respondents over age 65.
Here we restrict only to the non-elderly and fit the following model: \[ h(E(Y_{itk})) = \beta_0 + \beta_1Year_t + \beta_2State_k + \beta_3PostExpansion_t \times ExpansionState_k + \beta_4 \mathrm{spline}(Age) + \alpha_i \] Again, I have fit a linear model with a respondent random effect and a restricted cubic spline for age.
Dependent variable: | ||
r | ||
Within-State Model | Between-State Model | |
(1) | (2) | |
I(post * tx) | 0.040*** | |
(0.005) | ||
I(post * tx * ss.expansion) | -0.011 | |
(0.011) | ||
I(post * ss.expansion) | 0.015** | |
(0.007) | ||
Observations | 104,966 | 62,720 |
Log Likelihood | -66,099.000 | -40,307.000 |
Akaike Inf. Crit. | 132,286.000 | 80,655.000 |
Bayesian Inf. Crit. | 132,707.000 | 80,836.000 |
Note: | p<0.1; p<0.05; p<0.01 |
Dependent variable: | ||
r | ||
Nonexpansion | Expansion | |
(1) | (2) | |
I(post * tx) | 0.040*** | 0.029*** |
(0.005) | (0.009) | |
Observations | 83,506 | 21,460 |
Log Likelihood | -53,176.000 | -12,904.000 |
Akaike Inf. Crit. | 106,421.000 | 25,840.000 |
Bayesian Inf. Crit. | 106,738.000 | 25,968.000 |
Note: | p<0.1; p<0.05; p<0.01 |
Dependent variable: | ||||
unins | ||||
Within:Unweighted | Within:Non-Response Weighted | Between:Unweighted | Between:Non-Response Weighted | |
(1) | (2) | (3) | (4) | |
I(post * tx) | -0.063*** | -0.069*** | ||
(0.005) | (0.005) | |||
I(post * tx * ss.expansion) | -0.068*** | -0.074*** | ||
(0.010) | (0.010) | |||
ss.expansion:post | -0.063*** | -0.072*** | ||
(0.007) | (0.008) | |||
Observations | 77,217 | 77,217 | 46,121 | 46,121 |
Log Likelihood | -21,325.000 | -23,386.000 | -19,760.000 | -20,797.000 |
Akaike Inf. Crit. | 42,763.000 | 46,884.000 | 39,562.000 | 41,636.000 |
Bayesian Inf. Crit. | 43,281.000 | 47,402.000 | 39,746.000 | 41,820.000 |
Note: | p<0.1; p<0.05; p<0.01 |