1 Defining a Target Population

The first question we must ask is: what is our target population? In our case, we will draw upon outcomes from the FU1 and FU2 surveys, so we will begin by making the FU1 respondants the target. In subsequent analyses that use FU2 variables as the “baseline” value we can construct a separate set of weights using FU2 as the target. However, one key drawback to doing this is that we end up with effectively two “target populations” – so we may want to just settle on using FU1 respondants, or even baseline respondants, as the target.

1.1 Response Propensity Approach

In many ways the modeling of non-response can be thought of as an extension of propensity score modeling; indeed there is a robust statistical literature to this effect (see David 1983, for example).

Response propensity methods allow for more flexible modeling of the probability of survey response. In this way we can accommodate continuous predictors, allowing for flexible specifications (e.g., splines, local polynomial regression) and a large number of variables.

Under this approach, survey response adjustments are made using the inverse of the estimated response propensities. Clearly, an assumption that we have fit the correct specification of the model is needed; this could be mitigated to some degree through the use of flexible regressions as defined above.

A limitation, however, is that some individuals may have very low estimated propensities. These individuals would therefore receive very large nonresponse weighting adjustment factors, which could affect the variance of the estimated parameters in our study.

One way around this would be to trim or compress very large weights (Potter, 1990; Kish, 1992; Meng et al., 2010). However, another approach based on recent developments on propensity score methods (Imbens and Rubin 2015) and nonresponse stratification methods (Little 1983), may yield a tractable approach.

2 Non-Response Adjustment for the FU3 Survey

2.1 Pre-specify a list of predictors

Our first step is to pre-specify a set of characteristics and administrative variables deemed a priori as potentially important for response. We can begin with a baseline response propensity specification with linear terms for each of these covariates, though we could also consider choosing a more flexible specification that considers second-order and interaction terms (possibly selected by an iterative algorithm and checked via cross-validation of the prediction errors to mitigate against the risk of overfitting).

# Baseline X variables for predicting response
  z <- c("education","sex","maritalstatus","hhsize","employed","enrollment_source","raceanalysis","insurancecoverage")

# We can also include interaction terms very easily:
  # z <- c("education","sex","maritalstatus","hhsize","employed","state","enrollment_source","raceanalysis",
  #      "enrollment_source*enrollment_age","enrollment_source*raceanalysis","enrollment_source*education",
  #      "enrollment_source*sex","enrollment_source*maritalstatus","enrollment_source*hhsize","enrollment_source*employed",
  #      "enrollment_age*sex","enrollment_age*education","enrollment_age*maritalstatus","enrollment_age*employed")

# Variables we want to enter as a spline
  z.spline <- c("enrollment_age")

# FU variables for predicting response (e.g., lagged value of the outcomes).
######
## NOTE: The current dataset includes only  a very minimal numbef of these variables.
## we will augment this list with additional variables in the next iteration of the data. 
######
  x_2 <- c("insurancecoveragef1")
  x_3 <- c("insurancecoveragef1")

2.2 Fit the response propensity model

We next fit logistic regression models predicting survey response. Note I have commented out the model predicting FU1 response since we are using FU1 respondants as our “target” population.

# !!!!! TK NOTE: EVENTUALLY WE WILL NEED TO USE ADMINISTRATIVE VARIABLES FOR THIS!
# Response Indicators
xx$r_1 <- as.integer(!is.na(xx$insurancecoveragef1))
xx$r_2 <- as.integer(!is.na(xx$insurancecoveragef2))
xx$r_3 <- as.integer(!is.na(xx$insurancecoveragef3))

#!!!!!!
#!!!!!! Since our target population is responders to FU1 we restrict the master sample to them here. 
#!!!!!!
xx <- xx %>% filter(r_1==1)

# # Fit Response Model for FU1
# xx_1.full <-  xx[,intersect(colnames(xx),c("r_1","idnumber",z,z.spline))] 
# xx_1.full.desc <- Hmisc::describe(xx_1.full)
# 
# # Get Only non-missing values that will be fit in response regreesion
# xx_1 <- xx_1.full %>% na.omit()
# 
# # Fit Model & obtain model frame, predicted response rate, and log-odds
# m_1 <- glm(tofmla(y="r_1",x=z,x.spline="enrollment_age"),family="binomial",data=xx_1)
# xx_1.f <-  cbind.data.frame(idnumber=xx_1$idnumber,
#                             r_1=model.frame(m_1)$r_1,
#                             model.matrix(m_1)) %>% data.frame() %>% select(-X.Intercept.)
# xx_1.f$pr <- predict(m_1,type="response")
# xx_1.f$logodds =  log(xx_1.f$pr/(1-xx_1.f$pr))
# 
# # Perform K-fold cross validation in case we want to compare across models
# #cost <- function(r, pi = 0) mean(abs(r-pi) > 0.5)
# #cv.err_1 <- boot::cv.glm(xx_1, m_1, cost, K = 6)$delta[2]


#######
## FU2
#######
# Fit Response Model for FU2
xx_2.full <-  xx[,intersect(colnames(xx),c("r_2","idnumber",z,z.spline,x_2))] 
xx_2.full.desc <- Hmisc::describe(xx_2.full)

# NOTE: THE NON-RESPONDANTS TO FU1 WILL HAVE ALL MISSING VALUES FOR 
# THE FU1 VARIABLES INCLUDED IN THE MODEL THAT PREDICTS FU2 RESPONSE.
# WE WILL REPLACE THESE WITH VALUES OF 9999 SO THEY AREN'T DROPPED.
na.to.9998 <- function(x) ifelse(is.na(x),"9998",x)
xx_2.full <- xx_2.full %>% mutate_each_(funs(na.to.9998),x_2)

# Get Only non-missing values that will be fit in response regreesion
xx_2 <- xx_2.full %>% na.omit()

# Fit Model & obtain model frame, predicted response rate, and log-odds
m_2 <- glm(tofmla(y="r_2",x=z,x.spline="enrollment_age"),family="binomial",data=xx_2)
m_2.1 <- glm(tofmla(y="r_2",x=c(z,x_2),x.spline="enrollment_age"),family="binomial",data=xx_2)

xx_2.f <-  cbind.data.frame(idnumber=xx_2$idnumber,
                            r_2=model.frame(m_2)$r_2,
                            model.matrix(m_2)) %>% data.frame() %>% select(-X.Intercept.)
xx_2.f$pr_2 <- predict(m_2,type="response")
xx_2.f$logodds =  log(xx_2.f$pr_2/(1-xx_2.f$pr_2))

# Perform K-fold cross validation in case we want to compare across models
# cost <- function(r, pi = 0) mean(abs(r-pi) > 0.5)
# cv.err_2 <- boot::cv.glm(xx_2, m_2, cost, K = 6)$delta[2]
# cv.err_2.1 <- boot::cv.glm(xx_2, m_2.1, cost, K = 6)$delta[2]

#######
## FU3
#######
# Fit Response Model for FU2
xx_3.full <-  xx[,intersect(colnames(xx),c("r_3","idnumber",z,z.spline,x_3,x_2))] 
xx_3.full.desc <- Hmisc::describe(xx_3.full)

# NOTE: THE NON-RESPONDANTS TO FU1 WILL HAVE ALL MISSING VALUES FOR 
# THE FU1 VARIABLES INCLUDED IN THE MODEL THAT PREDICTS FU2 RESPONSE.
# WE WILL REPLACE THESE WITH VALUES OF 9999 SO THEY AREN'T DROPPED.
na.to.9998 <- function(x) ifelse(is.na(x),"9998",x)
xx_3.full <- xx_3.full %>% mutate_each_(funs(na.to.9998),c(x_3,x_2))


# Get Only non-missing values that will be fit in response regreesion
xx_3 <- xx_3.full %>% na.omit()

# Fit Model & obtain model frame, predicted response rate, and log-odds
m_3 <- glm(tofmla(y="r_3",x=z,x.spline="enrollment_age"),family="binomial",data=xx_3)
m_3.1 <- glm(tofmla(y="r_3",x=c(z,x_3,x_2),x.spline="enrollment_age"),
             family="binomial",data=xx_3)

xx_3.f <-  cbind.data.frame(idnumber=xx_3$idnumber,
                            r_3=model.frame(m_3)$r_3,
                            model.matrix(m_3)) %>% data.frame() %>% select(-X.Intercept.)
xx_3.f$pr_3 <- predict(m_3,type="response")
xx_3.f$logodds =  log(xx_3.f$pr_3/(1-xx_3.f$pr_3))

# Perform K-fold cross validation in case we want to compare across models
#cv.err_2 <- boot::cv.glm(xx_2, m_2, cost, K = 6)$delta[2]


####
# Fit a model with only the baseline covariates (i.e., no interactions for balance checking later)
####
xx.base <- xx[,-grep("f1$|f2$|f3$",names(xx))] %>% select(-peremployed_tract)
xx.base<- colnames(xx.base[,grep(paste0(colnames(model.frame(m_3)[,-1]),collapse="|"),
                                 colnames(xx.base))])
vars <- c(z,z.spline,x_2,x_3)
vars2 <- c(z,z.spline,x_2)
if (max(as.integer(grepl("\\*",c(vars))))==1) xx.base <-vars[-grep("\\*",c(vars))] else xx.base <- vars
m_base <- glm(tofmla(y="r_3",x=xx.base),family="binomial",data=xx_3)
stargazer::stargazer(m_2,m_3,title="Response Propensity Model for SCCS FU2 and FU3",
           ci=FALSE,font.size="scriptsize",omit.stat = c("adj.rsq","rsq","f","ser"),type="html")
Response Propensity Model for SCCS FU2 and FU3
Dependent variable:
r_2 r_3
(1) (2)
education2 0.165*** 0.240***
(0.041) (0.042)
education3 0.411*** 0.503***
(0.040) (0.040)
education4 0.443*** 0.556***
(0.055) (0.053)
education5 0.488*** 0.615***
(0.042) (0.042)
education6 0.666*** 0.811***
(0.051) (0.049)
education7 0.862*** 1.060***
(0.069) (0.062)
education8 0.973*** 1.208***
(0.099) (0.086)
education8888 0.551 -0.384
(1.324) (1.306)
education9999 -0.039 0.753
(1.418) (1.417)
sexM -0.389*** -0.382***
(0.021) (0.020)
maritalstatus2 -0.238*** -0.229***
(0.025) (0.023)
maritalstatus3 -0.230*** -0.261***
(0.037) (0.035)
maritalstatus4 -0.103*** -0.124***
(0.030) (0.028)
maritalstatus8888 -0.804 -1.360
(1.066) (1.238)
hhsize(2,3] -0.082*** -0.053**
(0.027) (0.025)
hhsize(3,4] -0.060* -0.077**
(0.033) (0.031)
hhsize(4,5] -0.062 -0.065
(0.045) (0.042)
hhsize(5,6] -0.206*** 0.014
(0.063) (0.061)
hhsize(6,8.89e+03] -0.109 0.039
(0.068) (0.066)
hhsize(8.89e+03,1e+04] 0.027 1.308*
(0.680) (0.755)
employed1 0.053** 0.132***
(0.022) (0.020)
employed7777 -0.070 0.005
(0.107) (0.105)
enrollment_sourceM 0.996*** 0.903***
(0.035) (0.030)
enrollment_sourceT 0.762*** 0.697***
(0.169) (0.140)
raceanalysis2 -0.163*** -0.156***
(0.023) (0.021)
raceanalysis3 -0.245** -0.216**
(0.103) (0.097)
raceanalysis4 -0.354 -0.209
(0.269) (0.252)
raceanalysis5 -0.269 -0.058
(0.164) (0.160)
raceanalysis6 -0.415*** -0.455***
(0.142) (0.139)
raceanalysis7 -0.126* -0.115*
(0.071) (0.066)
raceanalysis8888 9.754 0.578
(48.650) (0.870)
raceanalysis9999 0.108 0.792
(1.194) (1.192)
insurancecoverage1 0.049** 0.011
(0.022) (0.021)
insurancecoverage8888 -0.564 -0.601
(0.717) (0.754)
insurancecoverage9999 0.741 0.504
(0.578) (0.500)
rcs(enrollment_age)enrollment_age 0.032*** 0.038***
(0.009) (0.009)
rcs(enrollment_age)enrollment_age’ -0.163** -0.269***
(0.079) (0.075)
rcs(enrollment_age)enrollment_age’’ 0.547** 0.865***
(0.257) (0.244)
rcs(enrollment_age)enrollment_age’’’ -0.687** -1.013***
(0.278) (0.264)
Constant -0.924** -1.822***
(0.389) (0.375)
Observations 50,165 50,165
Log Likelihood -29,880.000 -32,735.000
Akaike Inf. Crit. 59,839.000 65,550.000
Note: p<0.1; p<0.05; p<0.01

2.3 Construct response propensity strata

After estimating the response propensity model we next assess the adequacy of the predicted response score. To do this we exploit a feature of the true response propensity score: the independence of the response indicator and the baseline characteristics given the score. Ideally, strata would be constructed pairing individuals with the same value of the score; however, this isn’t feasible given the wide range of values the score could take on.

Therefore, we proceed by stratifying the data using the following iterative procedure.

  • We begin with one stratum (J=1) and test whether the current number of strata is sufficient using a t-statistic calculated by comparing the mean values for responders and non-responders of the log-odds ratio for the response propensity scores We assess the adequacy of this block using the estimated linearized response propensity score (log odds ratio), defined as \[ \hat l(x) = \mathrm{ln}\bigg ( \frac{\hat e(x)}{1-\hat e(x)} \bigg ) \] We focus on the log odds ratio because it is more likely to have an (approximately) normal distribution (see figure below).

  • If there is evidence that the mean values of the log odds ratio of the estimated score for responders and non-responders are unequal in the given stratum (for example, if the t-statistic is above a pre-determined cutoff value of 1) then the stratum is split in half according to the median value of the estimated score within the stratum. This split is subject not only to the pre-determined t-value cutoff, but also to pre-specified minimum sample characteristics.

In the notation below, let \(N_0(j)\) and \(N_1(j)\) be the subsample sizes of non-responders and responders in substratum \(j\). Similarly, let \(\bar l_0(j)\) and \(\bar l_1(j)\) be the average value of the linearized response propensity within the stratum. Finally, let \(S^2\) be the sample variance of the linearized response propensity within block j.

# Define the model/ data to assess balance on
xx.f <- xx_3.f %>% mutate(r=r_3)

####
## Parameters
####

# t-statistic cutoff value
t_max <- 1  

# Minimum sample size in substratum
N_min <- 10

####
# t-statistic function
####

tstat <- function(X) 
{
  temp1 <- X %>% group_by(J,r) %>% summarise(Nj=n(), lj = mean(logodds)) 
  temp2 <- X %>% inner_join(temp1,c("J","r")) %>% group_by(J,r) %>% 
    mutate(s1=sum((logodds-lj)^2)) %>% select(J,r,Nj,lj,s1) %>% unique() %>% 
    reshape2::melt(id.vars=c("J","r")) %>% 
    reshape2::dcast(J~variable+r)  %>% mutate(S2j = (1 / ((Nj_0+Nj_1)-2))*(s1_0+s1_1)) %>% 
    mutate(tstat = (lj_1-lj_0)/sqrt(S2j*((1/Nj_1)+(1/Nj_0))))  
  tt <- temp2 %>% collect() %>% .[["tstat"]]
  tt[is.na(tt)] <- 0
  Nj0 <- temp2 %>% collect() %>% .[["Nj_0"]] 
  Nj1 <- temp2 %>% collect() %>% .[["Nj_1"]] 
  J <- temp2 %>% collect() %>% .[["J"]]
  out <- list(J = J, Nj0=Nj0,Nj1=Nj1,tt=tt)
  return(out)
}

####
# Split the response data into strata.
####

# Start with J = 1
X <- xx.f %>% mutate(J =1) %>% mutate(pr=pr_3)
strata.to.split <- 1
i = 1
test <- list()
while (strata.to.split==1)
{
  stats <- X %>% tstat() 
  cut.medscore <- X %>% group_by(J) %>% 
    mutate(newcut = cut(pr,breaks=c(0,median(pr),1),labels=c("l","h")),
    median = median(pr)) %>% group_by(J,r,newcut) %>% 
    summarise(newJ=n(),median=mean(median)) %>% 
    reshape2::melt(id.vars=c("J","r","newcut")) %>% reshape2::dcast(J~variable+r+newcut) %>% 
    tbl_df() %>% rename(median=median_0_l) %>% select(-median_1_l,-median_1_h,-median_0_h)
  
  stats <- do.call("cbind",stats) %>% tbl_df()
  stats2 <- stats %>% inner_join(cut.medscore,"J") %>% mutate(abs.tt=abs(as.numeric(tt))) %>% 
    mutate(split = as.integer(abs.tt>t_max & newJ_0_l>N_min & newJ_1_l>N_min &
                                newJ_0_h>N_min & newJ_1_h>N_min)) %>% 
    select(J,split,median)
  
  
  strata.to.split <- stats2 %>% collect() %>% .[["split"]] %>% max()

  X2 <- X %>% left_join(stats2,"J") %>% ungroup() %>% group_by(J) %>% 
    mutate(J2 = ifelse(pr<=median,"1","2")) %>% mutate(test=ifelse(pr>median,1,0)) %>% 
    ungroup() %>% mutate(J = ifelse(split==1,paste0(J,J2),J)); length(unique(X2$J))
  
  if (is.na(strata.to.split))  strata.to.split <- 0
  
  if (max(strata.to.split)==1) X <- X2 %>% select(-split,-median,-J2)
  test[[i]] <- X2 %>% filter(idnumber=="00017217") %>% select(pr,J,median)  
  i = i + 1
}

# Get Number of strata
J <- length(unique(X$J))

# Replace the original data with the stratified version of the data
xx.f <- X

# Attach t-statistic values to each of the J strata in the main data. 
xx.f.stats <- X %>% tstat() 
xx.f.stats <-  do.call("cbind",xx.f.stats) %>% tbl_df() %>% 
  mutate(tt=round(as.numeric(tt),3)) %>% select(J,tt)

The plot below shows the density distribution of response propensity scores among responders and non-responders within each of the J strata.

Sample Sizes by Response Propensity Strata (FU2 Response)
Non-Responders Responders
111111111 160 38
111111112 138 57
11111112 287 106
11111121 275 118
11111122 254 138
111112 1043 528
1111211 508 276
1111212111 71 31
1111212112 62 32
111121212 121 75
111121221 126 70
1111212221 65 34
1111212222 48 48
111122 957 609
111211 939 633
111212 894 672
11122 1708 1428
11211 1694 1441
112121111 115 88
112121112 94 100
112121121 104 95
112121122 118 75
112121211 112 84
112121212 97 98
1121212211 60 39
1121212212 41 55
112121222 89 105
112122 815 750
1122 3117 3148
12111 1481 1658
12112 1394 1738
12121 1391 1745
12122 1321 1813
122111 641 928
12211211 158 238
12211212 143 245
122112211 78 118
1221122121 42 56
1221122122 29 69
12211222 137 254
1221211 284 503
1221212 252 530
122122 430 1136
12221111 126 267
12221112 102 290
1222112 195 590
12221211 85 307
12221212 103 289
12221221 88 305
12221222 79 311
122221 279 1294
1222221111 22 77
1222221112 17 82
122222112 26 168
12222212 54 335
12222221 54 337
1222222211 14 84
1222222212 11 86
122222222 16 177

2.4 Assessing global balance across strata

We next check overall balance of the covariates across responders and non-responders. We have \(K\) covariates in the response propensity model, and to assess balance across strata we will analyze each of these covariates \(X_{ik}, k = 1, ..., K\) as an “outcome.” Specifically, in stratum \(j\) the difference between responding and non-responding units is estimated by

\[ \hat \tau_k^X(j) = \bar X_{1,k}(j) - \bar X_{0,k}(j) \] with sampling variance estimated as \[ \hat V_k^X(j) = s_k^2(j) \cdot \bigg ( \frac{1}{N_1(j)}+\frac{1}{N_0(j)} \bigg) \] where \[ s_k^2(j) = \frac{1}{N_0(j)+N_1(j)-2}\bigg ( \sum_{i:B_i(j)=1}^N (1-R_i) \cdot (X_{ik}-\bar X_{0k}(j))^2 + \sum_{i:B_i(j)=1}^N R_i \cdot (X_{ik}-\bar X_{1k}(j))^2\bigg) \]

We then take a weighted average of these estimates across blocks

\[ \hat \tau_k^X = \sum_{j=1}^J \bigg ( \frac{N_0(j)+N_1(j)}{N} \bigg ) \hat \tau_k^X(j) \] with variance \[ \hat V_k^X = \sum_{j= 1}^J \bigg (\frac{N_0(j)+N_1(j)}{N}) \bigg )^2 \cdot \hat V_k \] and convert into a z-value for a two-sided test of whether \(\hat \tau_k^X\) is equal to zero: \[ z_k = \frac{\hat \tau_k^X}{\sqrt{\hat V_k^X}} \]

####
# z-scores for balance
####
# Get only the baseline covariates (i.e., not interaction terms) 
X1 <- expand.model.frame(m_base,~idnumber) 
X2 <- model.matrix(m_base) %>% data.frame() %>% select(-1)
colnames(X2) <- paste0("xx",1:ncol(X2))
X3 <- xx.f %>% select(J,idnumber) %>% mutate(idnumber=as.character(idnumber))
X <- cbind.data.frame(X1,X2)  %>% select(idnumber,r=r_3,contains("xx")) %>% inner_join(X3,"idnumber")

# Get tau (i.e., difference in mean value of covariates between respondants and non-respondants 
# in each stratum)
temp1 <- X %>% group_by(J,r) %>% summarise_each(funs(mean) ,contains("xx")) 
tauJ <- temp1 %>% mutate_each(funs(diff),contains("xx")) %>% group_by(J,r) %>% filter(n()==1) 
colnames(tauJ) = gsub("^xx","tau_xx",colnames(temp1))

# Get Sample Sizes by stratum (overall, responders, nonresponders)
NJr <-  X %>% group_by(J,r) %>% summarise(Nj=n()) 
NJ <-  NJr %>%  reshape2::melt(id.vars=c("J","r")) %>% 
reshape2::dcast(J~variable+r)  %>%     mutate(Nj = Nj_0 + Nj_1)

vv <- function(x) (x-mean(x))^2
s2J <- X %>% group_by(J,r) %>% mutate_each(funs(vv),contains("xx")) %>% 
  summarise_each(funs(sum), contains("xx")) %>% group_by(J) %>% 
  summarise_each(funs(sum),contains("xx")) %>% left_join(NJ,"J") %>% 
  mutate(temp=1/(Nj_1+Nj_0-2)) %>% mutate_each(funs(.*temp),contains("xx"))
VkJ <- s2J %>% mutate(temp=(1/Nj_0)+(1/Nj_1)) %>% mutate_each(funs(.*temp),contains("xx"))

tau <- tauJ %>% group_by(J) %>% filter(row_number()==1) %>% select(-r) %>% 
  left_join(NJ,"J") %>% ungroup() %>% mutate(N=sum(Nj)) %>% 
  mutate(weight = (Nj_0+Nj_1)/N) %>% mutate_each(funs(.*weight),contains("xx")) %>% 
  summarise_each(funs(sum), contains("xx"))

V <- VkJ %>% ungroup() %>% mutate(N=sum(Nj)) %>% 
  mutate(weight = ((Nj_0+Nj_1)/N)^2) %>% mutate_each(funs(.*weight),contains("xx")) %>% 
  summarise_each(funs(sum), contains("xx")) %>% as.vector()

z <- tau/sqrt(V)  %>% as.matrix() %>% t()
pval <- 2*pnorm(-abs(t(z)))

uniform <- seq(0,1,0.01)
empirical <- c()
for (pp in 1:length(uniform)) {
    empirical[pp] <- (1/length(pval))*sum(as.integer(pval[,1]<=uniform[pp]))
}

To assess overall balance we can obtain the associated p-values for these z-scores and compare them to the uniform distribution. We then ask, what fraction of the empirical p-values are less than a given percentile of the uniform(0,1) distribution? In the plot below, the plotted fraction of the empirical p-values below a given percentile value on the uniform(0,1) is above the 45-degree line and the mean value is >0.50, indicating good overall balance across responders and non-responders on all the included predictors.

meanval <- mean(pval[,1])
pdata <- cbind.data.frame(uniform=uniform,empirical=empirical)
p <- pdata %>% ggplot(aes(x=uniform,y=uniform))+geom_line(lty=2)+geom_step(aes(x=empirical))+
  annotate("text",x=.5,y=0.25,label=paste("Mean Value for p-values:",round(meanval,3)),hjust=0)
p + theme_tufte() + theme(text=element_text(family="Gill Sans"))+xlab("Empirical\nShare of p-values less than Y-value")+
  ylab("Uniform")
Balance Check for FU3 Weight

Balance Check for FU3 Weight

2.5 Construct the non-response weights

Having settled on a response propensity model with a reasonable degree of balance across responders and non-responders, we construct a non-response weight for each responding observation within the strata. These observations get a weight equal to the inverse of the average observed response rate within each stratum.

# Construct  Weights
xx_3.f <- xx.f %>% mutate(w.FU3=1)
xx_3.f$cell.3 <- xx_3.f %>% group_by(J) %>% group_indices() 

xx_3.f <- xx_3.f %>% arrange(cell.3) %>% group_by(cell.3) %>% 
  mutate(rr.cell = mean(r_3))
xx_3.f[xx_3.f$r_3==1,]$w.FU3 = xx_3.f[xx_3.f$r_3==1,]$w.FU3 * (1/xx_3.f[xx_3.f$r_3==1,]$rr.cell)
xx_3.f[xx_3.f$r_3==0,]$w.FU3 = 0

# Validate the weights sum to the target population size (i.e., FU1 sample size)
sum(xx_3.f$w.FU3) 
## [1] 50165
round(sum(xx_3.f$w.FU3),0)== dim(xx_3)[1]
## [1] TRUE

3 Construct the FU2 Weights

We’ll now go through a similar process to construct the FU2 weights as well.

xx.f <- xx_2.f %>% mutate(r=r_2)

####
# Split the FU2 response data into strata.
####

# Start with J = 1
X <- xx.f %>% mutate(J =1) %>% mutate(pr=pr_2)
strata.to.split <- 1
i = 1
test <- list()
while (strata.to.split==1)
{
  stats <- X %>% tstat() 
  cut.medscore <- X %>% group_by(J) %>% 
    mutate(newcut = cut(pr,breaks=c(0,median(pr),1),labels=c("l","h")),
    median = median(pr)) %>% group_by(J,r,newcut) %>% 
    summarise(newJ=n(),median=mean(median)) %>% 
    reshape2::melt(id.vars=c("J","r","newcut")) %>% 
    reshape2::dcast(J~variable+r+newcut) %>% tbl_df() %>% 
    rename(median=median_0_l) %>% select(-median_1_l,-median_1_h,-median_0_h)
  
  stats <- do.call("cbind",stats) %>% tbl_df()
  stats2 <- stats %>% inner_join(cut.medscore,"J") %>% 
    mutate(abs.tt=abs(as.numeric(tt))) %>% 
    mutate(split = as.integer(abs.tt>t_max & newJ_0_l>N_min &
                  newJ_1_l>N_min & newJ_0_h>N_min & newJ_1_h>N_min)) %>% 
    select(J,split,median)
  
  
  strata.to.split <- stats2 %>% collect() %>% .[["split"]] %>% max()

  X2 <- X %>% left_join(stats2,"J") %>% ungroup() %>% group_by(J) %>% 
    mutate(J2 = ifelse(pr<=median,"1","2")) %>% mutate(test=ifelse(pr>median,1,0)) %>% 
    ungroup() %>% mutate(J = ifelse(split==1,paste0(J,J2),J)); length(unique(X2$J))
  
  if (is.na(strata.to.split))  strata.to.split <- 0
  
  if (max(strata.to.split)==1) X <- X2 %>% select(-split,-median,-J2)
  test[[i]] <- X2 %>% filter(idnumber=="00017217") %>% select(pr,J,median)  
  i = i + 1
  cat(length(unique(X$J)))

}
## 2481523313948556262
#do.call("rbind",test)

# Get Number of strata
J <- length(unique(X$J)); J
## [1] 62
# Replace the original data with the stratified version of the data
xx.f <- X

# Attach t-statistic values to each of the J strata in the main data. 
xx.f.stats <- X %>% tstat() 
xx.f.stats <-  do.call("cbind",xx.f.stats) %>% tbl_df() %>% 
  mutate(tt=round(as.numeric(tt),3)) %>% 
  select(J,tt)


####
# z-scores for balance
####

# Obtain the baseline covariates (i.e., no interactions for balance checking later)
if (max(as.integer(grepl("\\*",c(vars2))))==1) xx.base <-vars2[-grep("\\*",c(vars2))] else xx.base <- vars2
m_base <- glm(tofmla(y="r_2",x=xx.base),family="binomial",data=xx_2)


# Get only the baseline covariates (i.e., not interaction terms) 
X1 <- expand.model.frame(m_base,~idnumber) 
X2 <- model.matrix(m_base) %>% data.frame() %>% select(-1)
colnames(X2) <- paste0("xx",1:ncol(X2))
X3 <- xx.f %>% select(J,idnumber) %>% mutate(idnumber=as.character(idnumber))
X <- cbind.data.frame(X1,X2)  %>% select(idnumber,r=r_2,contains("xx")) %>% inner_join(X3,"idnumber")

# Get tau (i.e., difference in mean value of covariates between respondants and non-respondants 
# in each stratum)
temp1 <- X %>% group_by(J,r) %>% summarise_each(funs(mean) ,contains("xx")) 
tauJ <- temp1 %>% mutate_each(funs(diff),contains("xx")) %>% group_by(J,r) %>% filter(n()==1) 
colnames(tauJ) = gsub("^xx","tau_xx",colnames(temp1))

# Get Sample Sizes by stratum (overall, responders, nonresponders)
NJr <-  X %>% group_by(J,r) %>% summarise(Nj=n()) 
NJ <-  NJr %>%  reshape2::melt(id.vars=c("J","r")) %>% 
reshape2::dcast(J~variable+r)  %>%     mutate(Nj = Nj_0 + Nj_1)

vv <- function(x) (x-mean(x))^2
s2J <- X %>% group_by(J,r) %>% mutate_each(funs(vv),contains("xx")) %>% 
  summarise_each(funs(sum), contains("xx")) %>% group_by(J) %>% 
  summarise_each(funs(sum),contains("xx")) %>% left_join(NJ,"J") %>% 
  mutate(temp=1/(Nj_1+Nj_0-2)) %>% mutate_each(funs(.*temp),contains("xx"))
VkJ <- s2J %>% mutate(temp=(1/Nj_0)+(1/Nj_1)) %>% mutate_each(funs(.*temp),contains("xx"))

tau <- tauJ %>% group_by(J) %>% filter(row_number()==1) %>% select(-r) %>% 
  left_join(NJ,"J") %>% ungroup() %>% mutate(N=sum(Nj)) %>% 
  mutate(weight = (Nj_0+Nj_1)/N) %>% mutate_each(funs(.*weight),contains("xx")) %>% 
  summarise_each(funs(sum), contains("xx"))

V <- VkJ %>% ungroup() %>% mutate(N=sum(Nj)) %>% 
  mutate(weight = ((Nj_0+Nj_1)/N)^2) %>% mutate_each(funs(.*weight),contains("xx")) %>% 
  summarise_each(funs(sum), contains("xx")) %>% as.vector()

z <- tau/sqrt(V)  %>% as.matrix() %>% t()
pval <- 2*pnorm(-abs(t(z)))

uniform <- seq(0,1,0.01)
empirical <- c()
for (pp in 1:length(uniform)) {
    empirical[pp] <- (1/length(pval))*sum(as.integer(pval[,1]<=uniform[pp]))
}

# Construct FU1 Weights
xx_2.f <- xx.f %>% mutate(w.FU2=1)
xx_2.f$cell.3 <- xx_2.f %>% group_by(J) %>% group_indices() 

xx_2.f <- xx_2.f %>% arrange(cell.3) %>% group_by(cell.3) %>% mutate(rr.cell = mean(r_2))
xx_2.f[xx_2.f$r_2==1,]$w.FU2 = xx_2.f[xx_2.f$r_2==1,]$w.FU2 * (1/xx_2.f[xx_2.f$r_2==1,]$rr.cell)
xx_2.f[xx_2.f$r_2==0,]$w.FU2 = 0

# Validate the weights sum to the target population size (i.e., Baseline sample size)
sum(xx_2.f$w.FU2) 
## [1] 50165
sum(xx_2.f$w.FU2) == dim(xx_2)[1]
## [1] FALSE
meanval <- mean(pval[,1])
pdata <- cbind.data.frame(uniform=uniform,empirical=empirical)
p <- pdata %>% ggplot(aes(x=uniform,y=uniform))+geom_line(lty=2)+geom_step(aes(x=empirical))+
  annotate("text",x=.5,y=0.25,label=paste("Mean Value for p-values:",round(meanval,3)),hjust=0)
p + theme_tufte() + theme(text=element_text(family="Gill Sans"))+
  xlab("Empirical\nShare of p-values less than Y-value")+ylab("Uniform")
Balance Check for FU2 Weight

Balance Check for FU2 Weight

4 Differential Changes in Uninsured Rates

5 Dealing with the Multiple Waves of the SCCS

The next set of challenges relates to the fact that we have a multi-wave survey, though some of our analyses may simply rely on cross-sectional data. Here, we need to draw heavily on the work of Little and David (1983) for how to combine nonresponse weights across follow-up waves of the SCCS. The discussion of similar issues in Chen et al. (2012) also heavily influences the notation and discussion below.

5.1 Motivation and Notation

With multiple waves of the SCCS we must deal with survey attrition due to moving, loss-to-follow-up or death. However, given that the sampling frame for the SCCS follow-up surveys targets previous nonresponders, we also have reentry among people who may not respond to a particular follow-up. For FU3 we have about 3-4 thousand cases that meet these criteria.

To develop the weighting adjustment for attrition and reentry we need a bit of notation. Since we have defined our target population based on the responders to the baseline survey (see above) we essentially have a three-wave panel.

With this in mind, let \(r = c(r_1,r_2,r_3)\) be the response status (0/1) in each follow-up wave \(k\). Let \(z\) denote a set of design variables (e.g., general population sample, community health center [??], etc.). And also let \(x = (x_1,x_2,x_3)\) reflect variables collected in each follow-up round, where \(x_k\) are observed when \(r_k = 1\) and are missing when \(r_k=0\).

5.1.1 Regression Specifications

We then fit the following logit regressions:

  1. Regress \(r_1\) on \(z\) based on a sample of all units observed at baseline. We then construct a weight \(w_1\), where \(w_1^{-1} = \hat p(r_1 = 1|z)\).
  2. Regress \(r_2\) on \(z\) and \(x1\), using sampled units where \(r_1=1\). We then construct a weight $w_2 = \(w_1 w_{2.1}\) and where \(w_{2.1}^{-1} = \hat p(r_2=1|z,x_1,r_1=1)\).
  3. Regress \(r_3\) on \(z\), \(x_1\) and \(x_2\), using sampled units where \(r_1 = r_2 = 1\). We then construct a new weight \(w_3 = w_2 w_{3.12}\) and where \(w_{3.12}^{-1} = \hat p(r_3=1|z,x_1,x_2,r_1=r_2=1)\).

With these weights in hand, \(w_1\) is the initial nonresponse weight while \(w_2\) can be used for any cross-sectional analyses of respondants in the second follow-up as well as any longitudinal analyses that include respondants to follow-up 1 and 2. In addition, \(w_3\) can be used for the cross-sectional analyses of follow-up 3 and in longitudinal analyses of respondants to all three waves.

We have some cases where we have respondants to FU1 and FU3, but not FU2. The approach above can be amended to include an additional weight based on regressing \(r_3\) on \(z\) and \(x_1\). Then, the weights \(w_k.1\) can be constructed such that they use the appropriate predicted value based on whether information on \(x_2\) is available (i.e., whether \(r_2\) is 0 or 1).

5.2 Wave Nonresponse Patterns in the SCCS

Wave Response Patterns for the SCCS
Response Pattern Percent
000 32.1
111 29.6
100 15.5
110 13.1
101 4.2
010 2.9
011 2.6
001 0.1
Wave Response Patterns for the SCCS Excluding 000’s
rr pct
111 43.6
100 22.8
110 19.3
101 6.2
010 4.2
011 3.8
001 0.2

6 Relationship Between Non-Response and Differential Changes in Coverage

6.1 Within-State Comparisons

For our within-state comparisons we’ll fit the following model: \[ h(E(Y_{itk})) = \beta_0 + \beta_1 Year_t + \beta_ 2Tx_i + \beta_3PostExpansion_t \times Tx_i + \beta_4 State_k + \beta_5 State_k \times Year_t + \beta_6 State_k \times Tx_i + \beta_7 Post-Expansion_t \times Tx_i \times Expansion-State + \beta_8 \mathrm{spline}(Age) + \alpha_i \] For now I have fit linear models with a respondent random effect to capture the longitudinal nature of the data. We have few patient characteristics but I did include a restricted cubic spline for age. The comparison group is also simply respondents over age 65.

6.2 Between-State Comparisons

Here we restrict only to the non-elderly and fit the following model: \[ h(E(Y_{itk})) = \beta_0 + \beta_1Year_t + \beta_2State_k + \beta_3PostExpansion_t \times ExpansionState_k + \beta_4 \mathrm{spline}(Age) + \alpha_i \] Again, I have fit a linear model with a respondent random effect and a restricted cubic spline for age.

Relationship Between Intervention Status (Non-Elderly), State Expansion Status, and Wave Response
Dependent variable:
r
Within-State Model Between-State Model
(1) (2)
I(post * tx) 0.040***
(0.005)
I(post * tx * ss.expansion) -0.011
(0.011)
I(post * ss.expansion) 0.015**
(0.007)
Observations 104,966 62,720
Log Likelihood -66,099.000 -40,307.000
Akaike Inf. Crit. 132,286.000 80,655.000
Bayesian Inf. Crit. 132,707.000 80,836.000
Note: p<0.1; p<0.05; p<0.01
Within-State Model: Relationship Between Intervention Status (Non-Elderly) Follow-Up Response
Dependent variable:
r
Nonexpansion Expansion
(1) (2)
I(post * tx) 0.040*** 0.029***
(0.005) (0.009)
Observations 83,506 21,460
Log Likelihood -53,176.000 -12,904.000
Akaike Inf. Crit. 106,421.000 25,840.000
Bayesian Inf. Crit. 106,738.000 25,968.000
Note: p<0.1; p<0.05; p<0.01

7 DD Models Predicting Uninsured Status (Weighted and Unweighted by Survey Response)

Change in Uninsured Status: Within (and Between) State Comparisons
Dependent variable:
unins
Within:Unweighted Within:Non-Response Weighted Between:Unweighted Between:Non-Response Weighted
(1) (2) (3) (4)
I(post * tx) -0.063*** -0.069***
(0.005) (0.005)
I(post * tx * ss.expansion) -0.068*** -0.074***
(0.010) (0.010)
ss.expansion:post -0.063*** -0.072***
(0.007) (0.008)
Observations 77,217 77,217 46,121 46,121
Log Likelihood -21,325.000 -23,386.000 -19,760.000 -20,797.000
Akaike Inf. Crit. 42,763.000 46,884.000 39,562.000 41,636.000
Bayesian Inf. Crit. 43,281.000 47,402.000 39,746.000 41,820.000
Note: p<0.1; p<0.05; p<0.01