1 Defining a Target Population
2 Non-Response Adjustment for the FU3 Survey
3 Construct the FU2 Weights
4 Differential Changes in Uninsured Rates
5 Dealing with the Multiple Waves of the SCCS
6 Relationship Between Non-Response and Differential Changes in Coverage
7 DD Models Predicting Uninsured Status (Weighted and Unweighted by Survey Response)

1 Defining a Target Population

The first question we must ask is: what is our target population? In our case, we will draw upon outcomes from the FU1 and FU2 surveys, so we will begin by making the FU1 respondants the target. In subsequent analyses that use FU2 variables as the “baseline” value we can construct a separate set of weights using FU2 as the target. However, one key drawback to doing this is that we end up with effectively two “target populations” – so we may want to just settle on using FU1 respondants, or even baseline respondants, as the target.

1.1 Response Propensity Approach

In many ways the modeling of non-response can be thought of as an extension of propensity score modeling; indeed there is a robust statistical literature to this effect (see David 1983, for example).

Response propensity methods allow for more flexible modeling of the probability of survey response. In this way we can accommodate continuous predictors, allowing for flexible specifications (e.g., splines, local polynomial regression) and a large number of variables.

Under this approach, survey response adjustments are made using the inverse of the estimated response propensities. Clearly, an assumption that we have fit the correct specification of the model is needed; this could be mitigated to some degree through the use of flexible regressions as defined above.

A limitation, however, is that some individuals may have very low estimated propensities. These individuals would therefore receive very large nonresponse weighting adjustment factors, which could affect the variance of the estimated parameters in our study.

One way around this would be to trim or compress very large weights (Potter, 1990; Kish, 1992; Meng et al., 2010). However, another approach based on recent developments on propensity score methods (Imbens and Rubin 2015) and nonresponse stratification methods (Little 1983), may yield a tractable approach.

2 Non-Response Adjustment for the FU3 Survey

2.1 Pre-specify a list of predictors

Our first step is to pre-specify a set of characteristics and administrative variables deemed a priori as potentially important for response. We can begin with a baseline response propensity specification with linear terms for each of these covariates, though we could also consider choosing a more flexible specification that considers second-order and interaction terms (possibly selected by an iterative algorithm and checked via cross-validation of the prediction errors to mitigate against the risk of overfitting).

# Baseline X variables for predicting response
  z <- c("education","sex","maritalstatus","hhsize","employed","enrollment_source","raceanalysis","insurancecoverage")

# We can also include interaction terms very easily:
  # z <- c("education","sex","maritalstatus","hhsize","employed","state","enrollment_source","raceanalysis",
  #      "enrollment_source*enrollment_age","enrollment_source*raceanalysis","enrollment_source*education",
  #      "enrollment_source*sex","enrollment_source*maritalstatus","enrollment_source*hhsize","enrollment_source*employed",
  #      "enrollment_age*sex","enrollment_age*education","enrollment_age*maritalstatus","enrollment_age*employed")

# Variables we want to enter as a spline
  z.spline <- c("enrollment_age")

# FU variables for predicting response (e.g., lagged value of the outcomes).
######
## NOTE: The current dataset includes only  a very minimal numbef of these variables.
## we will augment this list with additional variables in the next iteration of the data. 
######
  x_2 <- c("insurancecoveragef1")
  x_3 <- c("insurancecoveragef1")

2.2 Fit the response propensity model

We next fit logistic regression models predicting survey response. Note I have commented out the model predicting FU1 response since we are using FU1 respondants as our “target” population.

# !!!!! TK NOTE: EVENTUALLY WE WILL NEED TO USE ADMINISTRATIVE VARIABLES FOR THIS!
# Response Indicators
xx$r_1 <- as.integer(!is.na(xx$insurancecoveragef1))
xx$r_2 <- as.integer(!is.na(xx$insurancecoveragef2))
xx$r_3 <- as.integer(!is.na(xx$insurancecoveragef3))

#!!!!!!
#!!!!!! Since our target population is responders to FU1 we restrict the master sample to them here. 
#!!!!!!
xx <- xx %>% filter(r_1==1)

# # Fit Response Model for FU1
# xx_1.full <-  xx[,intersect(colnames(xx),c("r_1","idnumber",z,z.spline))] 
# xx_1.full.desc <- Hmisc::describe(xx_1.full)
# 
# # Get Only non-missing values that will be fit in response regreesion
# xx_1 <- xx_1.full %>% na.omit()
# 
# # Fit Model & obtain model frame, predicted response rate, and log-odds
# m_1 <- glm(tofmla(y="r_1",x=z,x.spline="enrollment_age"),family="binomial",data=xx_1)
# xx_1.f <-  cbind.data.frame(idnumber=xx_1$idnumber,
#                             r_1=model.frame(m_1)$r_1,
#                             model.matrix(m_1)) %>% data.frame() %>% select(-X.Intercept.)
# xx_1.f$pr <- predict(m_1,type="response")
# xx_1.f$logodds =  log(xx_1.f$pr/(1-xx_1.f$pr))
# 
# # Perform K-fold cross validation in case we want to compare across models
# #cost <- function(r, pi = 0) mean(abs(r-pi) > 0.5)
# #cv.err_1 <- boot::cv.glm(xx_1, m_1, cost, K = 6)$delta[2]


#######
## FU2
#######
# Fit Response Model for FU2
xx_2.full <-  xx[,intersect(colnames(xx),c("r_2","idnumber",z,z.spline,x_2))] 
xx_2.full.desc <- Hmisc::describe(xx_2.full)

# NOTE: THE NON-RESPONDANTS TO FU1 WILL HAVE ALL MISSING VALUES FOR 
# THE FU1 VARIABLES INCLUDED IN THE MODEL THAT PREDICTS FU2 RESPONSE.
# WE WILL REPLACE THESE WITH VALUES OF 9999 SO THEY AREN'T DROPPED.
na.to.9998 <- function(x) ifelse(is.na(x),"9998",x)
xx_2.full <- xx_2.full %>% mutate_each_(funs(na.to.9998),x_2)

# Get Only non-missing values that will be fit in response regreesion
xx_2 <- xx_2.full %>% na.omit()

# Fit Model & obtain model frame, predicted response rate, and log-odds
m_2 <- glm(tofmla(y="r_2",x=z,x.spline="enrollment_age"),family="binomial",data=xx_2)
m_2.1 <- glm(tofmla(y="r_2",x=c(z,x_2),x.spline="enrollment_age"),family="binomial",data=xx_2)

xx_2.f <-  cbind.data.frame(idnumber=xx_2$idnumber,
                            r_2=model.frame(m_2)$r_2,
                            model.matrix(m_2)) %>% data.frame() %>% select(-X.Intercept.)
xx_2.f$pr_2 <- predict(m_2,type="response")
xx_2.f$logodds =  log(xx_2.f$pr_2/(1-xx_2.f$pr_2))

# Perform K-fold cross validation in case we want to compare across models
# cost <- function(r, pi = 0) mean(abs(r-pi) > 0.5)
# cv.err_2 <- boot::cv.glm(xx_2, m_2, cost, K = 6)$delta[2]
# cv.err_2.1 <- boot::cv.glm(xx_2, m_2.1, cost, K = 6)$delta[2]

#######
## FU3
#######
# Fit Response Model for FU2
xx_3.full <-  xx[,intersect(colnames(xx),c("r_3","idnumber",z,z.spline,x_3,x_2))] 
xx_3.full.desc <- Hmisc::describe(xx_3.full)

# NOTE: THE NON-RESPONDANTS TO FU1 WILL HAVE ALL MISSING VALUES FOR 
# THE FU1 VARIABLES INCLUDED IN THE MODEL THAT PREDICTS FU2 RESPONSE.
# WE WILL REPLACE THESE WITH VALUES OF 9999 SO THEY AREN'T DROPPED.
na.to.9998 <- function(x) ifelse(is.na(x),"9998",x)
xx_3.full <- xx_3.full %>% mutate_each_(funs(na.to.9998),c(x_3,x_2))


# Get Only non-missing values that will be fit in response regreesion
xx_3 <- xx_3.full %>% na.omit()

# Fit Model & obtain model frame, predicted response rate, and log-odds
m_3 <- glm(tofmla(y="r_3",x=z,x.spline="enrollment_age"),family="binomial",data=xx_3)
m_3.1 <- glm(tofmla(y="r_3",x=c(z,x_3,x_2),x.spline="enrollment_age"),
             family="binomial",data=xx_3)

xx_3.f <-  cbind.data.frame(idnumber=xx_3$idnumber,
                            r_3=model.frame(m_3)$r_3,
                            model.matrix(m_3)) %>% data.frame() %>% select(-X.Intercept.)
xx_3.f$pr_3 <- predict(m_3,type="response")
xx_3.f$logodds =  log(xx_3.f$pr_3/(1-xx_3.f$pr_3))

# Perform K-fold cross validation in case we want to compare across models
#cv.err_2 <- boot::cv.glm(xx_2, m_2, cost, K = 6)$delta[2]


####
# Fit a model with only the baseline covariates (i.e., no interactions for balance checking later)
####
xx.base <- xx[,-grep("f1$|f2$|f3$",names(xx))] %>% select(-peremployed_tract)
xx.base<- colnames(xx.base[,grep(paste0(colnames(model.frame(m_3)[,-1]),collapse="|"),
                                 colnames(xx.base))])
vars <- c(z,z.spline,x_2,x_3)
vars2 <- c(z,z.spline,x_2)
if (max(as.integer(grepl("\\*",c(vars))))==1) xx.base <-vars[-grep("\\*",c(vars))] else xx.base <- vars
m_base <- glm(tofmla(y="r_3",x=xx.base),family="binomial",data=xx_3)

stargazer::stargazer(m_2,m_3,title="Response Propensity Model for SCCS FU2 and FU3",
           ci=FALSE,font.size="scriptsize",omit.stat = c("adj.rsq","rsq","f","ser"),type="html")

**Response Propensity Model for SCCS FU2 and FU3**

	Dependent variable:

	r_2	r_3
	(1)	(2)

education2	0.165^***	0.240^***
	(0.041)	(0.042)

education3	0.411^***	0.503^***
	(0.040)	(0.040)

education4	0.443^***	0.556^***
	(0.055)	(0.053)

education5	0.488^***	0.615^***
	(0.042)	(0.042)

education6	0.666^***	0.811^***
	(0.051)	(0.049)

education7	0.862^***	1.060^***
	(0.069)	(0.062)

education8	0.973^***	1.208^***
	(0.099)	(0.086)

education8888	0.551	-0.384
	(1.324)	(1.306)

education9999	-0.039	0.753
	(1.418)	(1.417)

sexM	-0.389^***	-0.382^***
	(0.021)	(0.020)

maritalstatus2	-0.238^***	-0.229^***
	(0.025)	(0.023)

maritalstatus3	-0.230^***	-0.261^***
	(0.037)	(0.035)

maritalstatus4	-0.103^***	-0.124^***
	(0.030)	(0.028)

maritalstatus8888	-0.804	-1.360
	(1.066)	(1.238)

hhsize(2,3]	-0.082^***	-0.053^**
	(0.027)	(0.025)

hhsize(3,4]	-0.060^*	-0.077^**
	(0.033)	(0.031)

hhsize(4,5]	-0.062	-0.065
	(0.045)	(0.042)

hhsize(5,6]	-0.206^***	0.014
	(0.063)	(0.061)

hhsize(6,8.89e+03]	-0.109	0.039
	(0.068)	(0.066)

hhsize(8.89e+03,1e+04]	0.027	1.308^*
	(0.680)	(0.755)

employed1	0.053^**	0.132^***
	(0.022)	(0.020)

employed7777	-0.070	0.005
	(0.107)	(0.105)

enrollment_sourceM	0.996^***	0.903^***
	(0.035)	(0.030)

enrollment_sourceT	0.762^***	0.697^***
	(0.169)	(0.140)

raceanalysis2	-0.163^***	-0.156^***
	(0.023)	(0.021)

raceanalysis3	-0.245^**	-0.216^**
	(0.103)	(0.097)

raceanalysis4	-0.354	-0.209
	(0.269)	(0.252)

raceanalysis5	-0.269	-0.058
	(0.164)	(0.160)

raceanalysis6	-0.415^***	-0.455^***
	(0.142)	(0.139)

raceanalysis7	-0.126^*	-0.115^*
	(0.071)	(0.066)

raceanalysis8888	9.754	0.578
	(48.650)	(0.870)

raceanalysis9999	0.108	0.792
	(1.194)	(1.192)

insurancecoverage1	0.049^**	0.011
	(0.022)	(0.021)

insurancecoverage8888	-0.564	-0.601
	(0.717)	(0.754)

insurancecoverage9999	0.741	0.504
	(0.578)	(0.500)

rcs(enrollment_age)enrollment_age	0.032^***	0.038^***
	(0.009)	(0.009)

rcs(enrollment_age)enrollment_age’	-0.163^**	-0.269^***
	(0.079)	(0.075)

rcs(enrollment_age)enrollment_age’’	0.547^**	0.865^***
	(0.257)	(0.244)

rcs(enrollment_age)enrollment_age’’’	-0.687^**	-1.013^***
	(0.278)	(0.264)

Constant	-0.924^**	-1.822^***
	(0.389)	(0.375)


Observations	50,165	50,165
Log Likelihood	-29,880.000	-32,735.000
Akaike Inf. Crit.	59,839.000	65,550.000

Note:	p<0.1; p<0.05; p<0.01

2.3 Construct response propensity strata

After estimating the response propensity model we next assess the adequacy of the predicted response score. To do this we exploit a feature of the true response propensity score: the independence of the response indicator and the baseline characteristics given the score. Ideally, strata would be constructed pairing individuals with the same value of the score; however, this isn’t feasible given the wide range of values the score could take on.

Therefore, we proceed by stratifying the data using the following iterative procedure.

We begin with one stratum (J=1) and test whether the current number of strata is sufficient using a t-statistic calculated by comparing the mean values for responders and non-responders of the log-odds ratio for the response propensity scores We assess the adequacy of this block using the estimated linearized response propensity score (log odds ratio), defined as \[ \hat l(x) = \mathrm{ln}\bigg ( \frac{\hat e(x)}{1-\hat e(x)} \bigg ) \] We focus on the log odds ratio because it is more likely to have an (approximately) normal distribution (see figure below).

If there is evidence that the mean values of the log odds ratio of the estimated score for responders and non-responders are unequal in the given stratum (for example, if the t-statistic is above a pre-determined cutoff value of 1) then the stratum is split in half according to the median value of the estimated score within the stratum. This split is subject not only to the pre-determined t-value cutoff, but also to pre-specified minimum sample characteristics.

In the notation below, let $N_0(j)$ and $N_1(j)$ be the subsample sizes of non-responders and responders in substratum $j$. Similarly, let $\bar l_0(j)$ and $\bar l_1(j)$ be the average value of the linearized response propensity within the stratum. Finally, let $S^2$ be the sample variance of the linearized response propensity within block j.

# Define the model/ data to assess balance on
xx.f <- xx_3.f %>% mutate(r=r_3)

####
## Parameters
####

# t-statistic cutoff value
t_max <- 1  

# Minimum sample size in substratum
N_min <- 10

####
# t-statistic function
####

tstat <- function(X) 
{
  temp1 <- X %>% group_by(J,r) %>% summarise(Nj=n(), lj = mean(logodds)) 
  temp2 <- X %>% inner_join(temp1,c("J","r")) %>% group_by(J,r) %>% 
    mutate(s1=sum((logodds-lj)^2)) %>% select(J,r,Nj,lj,s1) %>% unique() %>% 
    reshape2::melt(id.vars=c("J","r")) %>% 
    reshape2::dcast(J~variable+r)  %>% mutate(S2j = (1 / ((Nj_0+Nj_1)-2))*(s1_0+s1_1)) %>% 
    mutate(tstat = (lj_1-lj_0)/sqrt(S2j*((1/Nj_1)+(1/Nj_0))))  
  tt <- temp2 %>% collect() %>% .[["tstat"]]
  tt[is.na(tt)] <- 0
  Nj0 <- temp2 %>% collect() %>% .[["Nj_0"]] 
  Nj1 <- temp2 %>% collect() %>% .[["Nj_1"]] 
  J <- temp2 %>% collect() %>% .[["J"]]
  out <- list(J = J, Nj0=Nj0,Nj1=Nj1,tt=tt)
  return(out)
}

####
# Split the response data into strata.
####

# Start with J = 1
X <- xx.f %>% mutate(J =1) %>% mutate(pr=pr_3)
strata.to.split <- 1
i = 1
test <- list()
while (strata.to.split==1)
{
  stats <- X %>% tstat() 
  cut.medscore <- X %>% group_by(J) %>% 
    mutate(newcut = cut(pr,breaks=c(0,median(pr),1),labels=c("l","h")),
    median = median(pr)) %>% group_by(J,r,newcut) %>% 
    summarise(newJ=n(),median=mean(median)) %>% 
    reshape2::melt(id.vars=c("J","r","newcut")) %>% reshape2::dcast(J~variable+r+newcut) %>% 
    tbl_df() %>% rename(median=median_0_l) %>% select(-median_1_l,-median_1_h,-median_0_h)
  
  stats <- do.call("cbind",stats) %>% tbl_df()
  stats2 <- stats %>% inner_join(cut.medscore,"J") %>% mutate(abs.tt=abs(as.numeric(tt))) %>% 
    mutate(split = as.integer(abs.tt>t_max & newJ_0_l>N_min & newJ_1_l>N_min &
                                newJ_0_h>N_min & newJ_1_h>N_min)) %>% 
    select(J,split,median)
  
  
  strata.to.split <- stats2 %>% collect() %>% .[["split"]] %>% max()

  X2 <- X %>% left_join(stats2,"J") %>% ungroup() %>% group_by(J) %>% 
    mutate(J2 = ifelse(pr<=median,"1","2")) %>% mutate(test=ifelse(pr>median,1,0)) %>% 
    ungroup() %>% mutate(J = ifelse(split==1,paste0(J,J2),J)); length(unique(X2$J))
  
  if (is.na(strata.to.split))  strata.to.split <- 0
  
  if (max(strata.to.split)==1) X <- X2 %>% select(-split,-median,-J2)
  test[[i]] <- X2 %>% filter(idnumber=="00017217") %>% select(pr,J,median)  
  i = i + 1
}

# Get Number of strata
J <- length(unique(X$J))

# Replace the original data with the stratified version of the data
xx.f <- X

# Attach t-statistic values to each of the J strata in the main data. 
xx.f.stats <- X %>% tstat() 
xx.f.stats <-  do.call("cbind",xx.f.stats) %>% tbl_df() %>% 
  mutate(tt=round(as.numeric(tt),3)) %>% select(J,tt)

The plot below shows the density distribution of response propensity scores among responders and non-responders within each of the J strata.

Sample Sizes by Response Propensity Strata (FU2 Response)
	Non-Responders	Responders
111111111	160	38
111111112	138	57
11111112	287	106
11111121	275	118
11111122	254	138
111112	1043	528
1111211	508	276
1111212111	71	31
1111212112	62	32
111121212	121	75
111121221	126	70
1111212221	65	34
1111212222	48	48
111122	957	609
111211	939	633
111212	894	672
11122	1708	1428
11211	1694	1441
112121111	115	88
112121112	94	100
112121121	104	95
112121122	118	75
112121211	112	84
112121212	97	98
1121212211	60	39
1121212212	41	55
112121222	89	105
112122	815	750
1122	3117	3148
12111	1481	1658
12112	1394	1738
12121	1391	1745
12122	1321	1813
122111	641	928
12211211	158	238
12211212	143	245
122112211	78	118
1221122121	42	56
1221122122	29	69
12211222	137	254
1221211	284	503
1221212	252	530
122122	430	1136
12221111	126	267
12221112	102	290
1222112	195	590
12221211	85	307
12221212	103	289
12221221	88	305
12221222	79	311
122221	279	1294
1222221111	22	77
1222221112	17	82
122222112	26	168
12222212	54	335
12222221	54	337
1222222211	14	84
1222222212	11	86
122222222	16	177

2.4 Assessing global balance across strata

We next check overall balance of the covariates across responders and non-responders. We have $K$ covariates in the response propensity model, and to assess balance across strata we will analyze each of these covariates $X_{ik}, k = 1, ..., K$ as an “outcome.” Specifically, in stratum $j$ the difference between responding and non-responding units is estimated by

\[ \hat \tau_k^X(j) = \bar X_{1,k}(j) - \bar X_{0,k}(j) \] with sampling variance estimated as \[ \hat V_k^X(j) = s_k^2(j) \cdot \bigg ( \frac{1}{N_1(j)}+\frac{1}{N_0(j)} \bigg) \] where \[ s_k^2(j) = \frac{1}{N_0(j)+N_1(j)-2}\bigg ( \sum_{i:B_i(j)=1}^N (1-R_i) \cdot (X_{ik}-\bar X_{0k}(j))^2 + \sum_{i:B_i(j)=1}^N R_i \cdot (X_{ik}-\bar X_{1k}(j))^2\bigg) \]

We then take a weighted average of these estimates across blocks

\[ \hat \tau_k^X = \sum_{j=1}^J \bigg ( \frac{N_0(j)+N_1(j)}{N} \bigg ) \hat \tau_k^X(j) \] with variance \[ \hat V_k^X = \sum_{j= 1}^J \bigg (\frac{N_0(j)+N_1(j)}{N}) \bigg )^2 \cdot \hat V_k \] and convert into a z-value for a two-sided test of whether $\hat \tau_k^X$ is equal to zero: \[ z_k = \frac{\hat \tau_k^X}{\sqrt{\hat V_k^X}} \]

####
# z-scores for balance
####
# Get only the baseline covariates (i.e., not interaction terms) 
X1 <- expand.model.frame(m_base,~idnumber) 
X2 <- model.matrix(m_base) %>% data.frame() %>% select(-1)
colnames(X2) <- paste0("xx",1:ncol(X2))
X3 <- xx.f %>% select(J,idnumber) %>% mutate(idnumber=as.character(idnumber))
X <- cbind.data.frame(X1,X2)  %>% select(idnumber,r=r_3,contains("xx")) %>% inner_join(X3,"idnumber")

# Get tau (i.e., difference in mean value of covariates between respondants and non-respondants 
# in each stratum)
temp1 <- X %>% group_by(J,r) %>% summarise_each(funs(mean) ,contains("xx")) 
tauJ <- temp1 %>% mutate_each(funs(diff),contains("xx")) %>% group_by(J,r) %>% filter(n()==1) 
colnames(tauJ) = gsub("^xx","tau_xx",colnames(temp1))

# Get Sample Sizes by stratum (overall, responders, nonresponders)
NJr <-  X %>% group_by(J,r) %>% summarise(Nj=n()) 
NJ <-  NJr %>%  reshape2::melt(id.vars=c("J","r")) %>% 
reshape2::dcast(J~variable+r)  %>%     mutate(Nj = Nj_0 + Nj_1)

vv <- function(x) (x-mean(x))^2
s2J <- X %>% group_by(J,r) %>% mutate_each(funs(vv),contains("xx")) %>% 
  summarise_each(funs(sum), contains("xx")) %>% group_by(J) %>% 
  summarise_each(funs(sum),contains("xx")) %>% left_join(NJ,"J") %>% 
  mutate(temp=1/(Nj_1+Nj_0-2)) %>% mutate_each(funs(.*temp),contains("xx"))
VkJ <- s2J %>% mutate(temp=(1/Nj_0)+(1/Nj_1)) %>% mutate_each(funs(.*temp),contains("xx"))

tau <- tauJ %>% group_by(J) %>% filter(row_number()==1) %>% select(-r) %>% 
  left_join(NJ,"J") %>% ungroup() %>% mutate(N=sum(Nj)) %>% 
  mutate(weight = (Nj_0+Nj_1)/N) %>% mutate_each(funs(.*weight),contains("xx")) %>% 
  summarise_each(funs(sum), contains("xx"))

V <- VkJ %>% ungroup() %>% mutate(N=sum(Nj)) %>% 
  mutate(weight = ((Nj_0+Nj_1)/N)^2) %>% mutate_each(funs(.*weight),contains("xx")) %>% 
  summarise_each(funs(sum), contains("xx")) %>% as.vector()

z <- tau/sqrt(V)  %>% as.matrix() %>% t()
pval <- 2*pnorm(-abs(t(z)))

uniform <- seq(0,1,0.01)
empirical <- c()
for (pp in 1:length(uniform)) {
    empirical[pp] <- (1/length(pval))*sum(as.integer(pval[,1]<=uniform[pp]))
}

To assess overall balance we can obtain the associated p-values for these z-scores and compare them to the uniform distribution. We then ask, what fraction of the empirical p-values are less than a given percentile of the uniform(0,1) distribution? In the plot below, the plotted fraction of the empirical p-values below a given percentile value on the uniform(0,1) is above the 45-degree line and the mean value is >0.50, indicating good overall balance across responders and non-responders on all the included predictors.

meanval <- mean(pval[,1])
pdata <- cbind.data.frame(uniform=uniform,empirical=empirical)
p <- pdata %>% ggplot(aes(x=uniform,y=uniform))+geom_line(lty=2)+geom_step(aes(x=empirical))+
  annotate("text",x=.5,y=0.25,label=paste("Mean Value for p-values:",round(meanval,3)),hjust=0)
p + theme_tufte() + theme(text=element_text(family="Gill Sans"))+xlab("Empirical\nShare of p-values less than Y-value")+
  ylab("Uniform")

Balance Check for FU3 Weight

2.5 Construct the non-response weights

Having settled on a response propensity model with a reasonable degree of balance across responders and non-responders, we construct a non-response weight for each responding observation within the strata. These observations get a weight equal to the inverse of the average observed response rate within each stratum.

# Construct  Weights
xx_3.f <- xx.f %>% mutate(w.FU3=1)
xx_3.f$cell.3 <- xx_3.f %>% group_by(J) %>% group_indices() 

xx_3.f <- xx_3.f %>% arrange(cell.3) %>% group_by(cell.3) %>% 
  mutate(rr.cell = mean(r_3))
xx_3.f[xx_3.f$r_3==1,]$w.FU3 = xx_3.f[xx_3.f$r_3==1,]$w.FU3 * (1/xx_3.f[xx_3.f$r_3==1,]$rr.cell)
xx_3.f[xx_3.f$r_3==0,]$w.FU3 = 0

# Validate the weights sum to the target population size (i.e., FU1 sample size)
sum(xx_3.f$w.FU3)

## [1] 50165

round(sum(xx_3.f$w.FU3),0)== dim(xx_3)[1]

## [1] TRUE

3 Construct the FU2 Weights

We’ll now go through a similar process to construct the FU2 weights as well.

xx.f <- xx_2.f %>% mutate(r=r_2)

####
# Split the FU2 response data into strata.
####

# Start with J = 1
X <- xx.f %>% mutate(J =1) %>% mutate(pr=pr_2)
strata.to.split <- 1
i = 1
test <- list()
while (strata.to.split==1)
{
  stats <- X %>% tstat() 
  cut.medscore <- X %>% group_by(J) %>% 
    mutate(newcut = cut(pr,breaks=c(0,median(pr),1),labels=c("l","h")),
    median = median(pr)) %>% group_by(J,r,newcut) %>% 
    summarise(newJ=n(),median=mean(median)) %>% 
    reshape2::melt(id.vars=c("J","r","newcut")) %>% 
    reshape2::dcast(J~variable+r+newcut) %>% tbl_df() %>% 
    rename(median=median_0_l) %>% select(-median_1_l,-median_1_h,-median_0_h)
  
  stats <- do.call("cbind",stats) %>% tbl_df()
  stats2 <- stats %>% inner_join(cut.medscore,"J") %>% 
    mutate(abs.tt=abs(as.numeric(tt))) %>% 
    mutate(split = as.integer(abs.tt>t_max & newJ_0_l>N_min &
                  newJ_1_l>N_min & newJ_0_h>N_min & newJ_1_h>N_min)) %>% 
    select(J,split,median)
  
  
  strata.to.split <- stats2 %>% collect() %>% .[["split"]] %>% max()

  X2 <- X %>% left_join(stats2,"J") %>% ungroup() %>% group_by(J) %>% 
    mutate(J2 = ifelse(pr<=median,"1","2")) %>% mutate(test=ifelse(pr>median,1,0)) %>% 
    ungroup() %>% mutate(J = ifelse(split==1,paste0(J,J2),J)); length(unique(X2$J))
  
  if (is.na(strata.to.split))  strata.to.split <- 0
  
  if (max(strata.to.split)==1) X <- X2 %>% select(-split,-median,-J2)
  test[[i]] <- X2 %>% filter(idnumber=="00017217") %>% select(pr,J,median)  
  i = i + 1
  cat(length(unique(X$J)))

}

## 2481523313948556262

#do.call("rbind",test)

# Get Number of strata
J <- length(unique(X$J)); J

## [1] 62

# Replace the original data with the stratified version of the data
xx.f <- X

# Attach t-statistic values to each of the J strata in the main data. 
xx.f.stats <- X %>% tstat() 
xx.f.stats <-  do.call("cbind",xx.f.stats) %>% tbl_df() %>% 
  mutate(tt=round(as.numeric(tt),3)) %>% 
  select(J,tt)


####
# z-scores for balance
####

# Obtain the baseline covariates (i.e., no interactions for balance checking later)
if (max(as.integer(grepl("\\*",c(vars2))))==1) xx.base <-vars2[-grep("\\*",c(vars2))] else xx.base <- vars2
m_base <- glm(tofmla(y="r_2",x=xx.base),family="binomial",data=xx_2)


# Get only the baseline covariates (i.e., not interaction terms) 
X1 <- expand.model.frame(m_base,~idnumber) 
X2 <- model.matrix(m_base) %>% data.frame() %>% select(-1)
colnames(X2) <- paste0("xx",1:ncol(X2))
X3 <- xx.f %>% select(J,idnumber) %>% mutate(idnumber=as.character(idnumber))
X <- cbind.data.frame(X1,X2)  %>% select(idnumber,r=r_2,contains("xx")) %>% inner_join(X3,"idnumber")

# Get tau (i.e., difference in mean value of covariates between respondants and non-respondants 
# in each stratum)
temp1 <- X %>% group_by(J,r) %>% summarise_each(funs(mean) ,contains("xx")) 
tauJ <- temp1 %>% mutate_each(funs(diff),contains("xx")) %>% group_by(J,r) %>% filter(n()==1) 
colnames(tauJ) = gsub("^xx","tau_xx",colnames(temp1))

# Get Sample Sizes by stratum (overall, responders, nonresponders)
NJr <-  X %>% group_by(J,r) %>% summarise(Nj=n()) 
NJ <-  NJr %>%  reshape2::melt(id.vars=c("J","r")) %>% 
reshape2::dcast(J~variable+r)  %>%     mutate(Nj = Nj_0 + Nj_1)

vv <- function(x) (x-mean(x))^2
s2J <- X %>% group_by(J,r) %>% mutate_each(funs(vv),contains("xx")) %>% 
  summarise_each(funs(sum), contains("xx")) %>% group_by(J) %>% 
  summarise_each(funs(sum),contains("xx")) %>% left_join(NJ,"J") %>% 
  mutate(temp=1/(Nj_1+Nj_0-2)) %>% mutate_each(funs(.*temp),contains("xx"))
VkJ <- s2J %>% mutate(temp=(1/Nj_0)+(1/Nj_1)) %>% mutate_each(funs(.*temp),contains("xx"))

tau <- tauJ %>% group_by(J) %>% filter(row_number()==1) %>% select(-r) %>% 
  left_join(NJ,"J") %>% ungroup() %>% mutate(N=sum(Nj)) %>% 
  mutate(weight = (Nj_0+Nj_1)/N) %>% mutate_each(funs(.*weight),contains("xx")) %>% 
  summarise_each(funs(sum), contains("xx"))

V <- VkJ %>% ungroup() %>% mutate(N=sum(Nj)) %>% 
  mutate(weight = ((Nj_0+Nj_1)/N)^2) %>% mutate_each(funs(.*weight),contains("xx")) %>% 
  summarise_each(funs(sum), contains("xx")) %>% as.vector()

z <- tau/sqrt(V)  %>% as.matrix() %>% t()
pval <- 2*pnorm(-abs(t(z)))

uniform <- seq(0,1,0.01)
empirical <- c()
for (pp in 1:length(uniform)) {
    empirical[pp] <- (1/length(pval))*sum(as.integer(pval[,1]<=uniform[pp]))
}

# Construct FU1 Weights
xx_2.f <- xx.f %>% mutate(w.FU2=1)
xx_2.f$cell.3 <- xx_2.f %>% group_by(J) %>% group_indices() 

xx_2.f <- xx_2.f %>% arrange(cell.3) %>% group_by(cell.3) %>% mutate(rr.cell = mean(r_2))
xx_2.f[xx_2.f$r_2==1,]$w.FU2 = xx_2.f[xx_2.f$r_2==1,]$w.FU2 * (1/xx_2.f[xx_2.f$r_2==1,]$rr.cell)
xx_2.f[xx_2.f$r_2==0,]$w.FU2 = 0

# Validate the weights sum to the target population size (i.e., Baseline sample size)
sum(xx_2.f$w.FU2)

## [1] 50165

sum(xx_2.f$w.FU2) == dim(xx_2)[1]

## [1] FALSE

meanval <- mean(pval[,1])
pdata <- cbind.data.frame(uniform=uniform,empirical=empirical)
p <- pdata %>% ggplot(aes(x=uniform,y=uniform))+geom_line(lty=2)+geom_step(aes(x=empirical))+
  annotate("text",x=.5,y=0.25,label=paste("Mean Value for p-values:",round(meanval,3)),hjust=0)
p + theme_tufte() + theme(text=element_text(family="Gill Sans"))+
  xlab("Empirical\nShare of p-values less than Y-value")+ylab("Uniform")

Balance Check for FU2 Weight

4 Differential Changes in Uninsured Rates

5 Dealing with the Multiple Waves of the SCCS

The next set of challenges relates to the fact that we have a multi-wave survey, though some of our analyses may simply rely on cross-sectional data. Here, we need to draw heavily on the work of Little and David (1983) for how to combine nonresponse weights across follow-up waves of the SCCS. The discussion of similar issues in Chen et al. (2012) also heavily influences the notation and discussion below.

5.1 Motivation and Notation

With multiple waves of the SCCS we must deal with survey attrition due to moving, loss-to-follow-up or death. However, given that the sampling frame for the SCCS follow-up surveys targets previous nonresponders, we also have reentry among people who may not respond to a particular follow-up. For FU3 we have about 3-4 thousand cases that meet these criteria.

To develop the weighting adjustment for attrition and reentry we need a bit of notation. Since we have defined our target population based on the responders to the baseline survey (see above) we essentially have a three-wave panel.

With this in mind, let $r = c(r_1,r_2,r_3)$ be the response status (0/1) in each follow-up wave $k$. Let $z$ denote a set of design variables (e.g., general population sample, community health center [??], etc.). And also let $x = (x_1,x_2,x_3)$ reflect variables collected in each follow-up round, where $x_k$ are observed when $r_k = 1$ and are missing when $r_k=0$.

5.1.1 Regression Specifications

We then fit the following logit regressions:

Regress $r_1$ on $z$ based on a sample of all units observed at baseline. We then construct a weight $w_1$, where $w_1^{-1} = \hat p(r_1 = 1|z)$.
Regress $r_2$ on $z$ and $x1$, using sampled units where $r_1=1$. We then construct a weight $w_2 = $w_1 w_{2.1}$ and where $w_{2.1}^{-1} = \hat p(r_2=1|z,x_1,r_1=1)$.
Regress $r_3$ on $z$, $x_1$ and $x_2$, using sampled units where $r_1 = r_2 = 1$. We then construct a new weight $w_3 = w_2 w_{3.12}$ and where $w_{3.12}^{-1} = \hat p(r_3=1|z,x_1,x_2,r_1=r_2=1)$.

With these weights in hand, $w_1$ is the initial nonresponse weight while $w_2$ can be used for any cross-sectional analyses of respondants in the second follow-up as well as any longitudinal analyses that include respondants to follow-up 1 and 2. In addition, $w_3$ can be used for the cross-sectional analyses of follow-up 3 and in longitudinal analyses of respondants to all three waves.

We have some cases where we have respondants to FU1 and FU3, but not FU2. The approach above can be amended to include an additional weight based on regressing $r_3$ on $z$ and $x_1$. Then, the weights $w_k.1$ can be constructed such that they use the appropriate predicted value based on whether information on $x_2$ is available (i.e., whether $r_2$ is 0 or 1).

5.2 Wave Nonresponse Patterns in the SCCS

Wave Response Patterns for the SCCS
Response Pattern	Percent
000	32.1
111	29.6
100	15.5
110	13.1
101	4.2
010	2.9
011	2.6
001	0.1

Wave Response Patterns for the SCCS Excluding 000’s
rr	pct
111	43.6
100	22.8
110	19.3
101	6.2
010	4.2
011	3.8
001	0.2

6 Relationship Between Non-Response and Differential Changes in Coverage

6.1 Within-State Comparisons

For our within-state comparisons we’ll fit the following model: \[ h(E(Y_{itk})) = \beta_0 + \beta_1 Year_t + \beta_ 2Tx_i + \beta_3PostExpansion_t \times Tx_i + \beta_4 State_k + \beta_5 State_k \times Year_t + \beta_6 State_k \times Tx_i + \beta_7 Post-Expansion_t \times Tx_i \times Expansion-State + \beta_8 \mathrm{spline}(Age) + \alpha_i \] For now I have fit linear models with a respondent random effect to capture the longitudinal nature of the data. We have few patient characteristics but I did include a restricted cubic spline for age. The comparison group is also simply respondents over age 65.

6.2 Between-State Comparisons

Here we restrict only to the non-elderly and fit the following model: \[ h(E(Y_{itk})) = \beta_0 + \beta_1Year_t + \beta_2State_k + \beta_3PostExpansion_t \times ExpansionState_k + \beta_4 \mathrm{spline}(Age) + \alpha_i \] Again, I have fit a linear model with a respondent random effect and a restricted cubic spline for age.

**Relationship Between Intervention Status (Non-Elderly), State Expansion Status, and Wave Response**

	Dependent variable:

	r
	Within-State Model	Between-State Model
	(1)	(2)

I(post * tx)	0.040^***
	(0.005)

I(post * tx * ss.expansion)	-0.011
	(0.011)

I(post * ss.expansion)		0.015^**
		(0.007)


Observations	104,966	62,720
Log Likelihood	-66,099.000	-40,307.000
Akaike Inf. Crit.	132,286.000	80,655.000
Bayesian Inf. Crit.	132,707.000	80,836.000

Note:	p<0.1; p<0.05; p<0.01

**Within-State Model: Relationship Between Intervention Status (Non-Elderly) Follow-Up Response**

	Dependent variable:

	r
	Nonexpansion	Expansion
	(1)	(2)

I(post * tx)	0.040^***	0.029^***
	(0.005)	(0.009)


Observations	83,506	21,460
Log Likelihood	-53,176.000	-12,904.000
Akaike Inf. Crit.	106,421.000	25,840.000
Bayesian Inf. Crit.	106,738.000	25,968.000

Note:	p<0.1; p<0.05; p<0.01

7 DD Models Predicting Uninsured Status (Weighted and Unweighted by Survey Response)

**Change in Uninsured Status: Within (and Between) State Comparisons**

	Dependent variable:

	unins
	Within:Unweighted	Within:Non-Response Weighted	Between:Unweighted	Between:Non-Response Weighted
	(1)	(2)	(3)	(4)

I(post * tx)	-0.063^***	-0.069^***
	(0.005)	(0.005)

I(post * tx * ss.expansion)	-0.068^***	-0.074^***
	(0.010)	(0.010)

ss.expansion:post			-0.063^***	-0.072^***
			(0.007)	(0.008)


Observations	77,217	77,217	46,121	46,121
Log Likelihood	-21,325.000	-23,386.000	-19,760.000	-20,797.000
Akaike Inf. Crit.	42,763.000	46,884.000	39,562.000	41,636.000
Bayesian Inf. Crit.	43,281.000	47,402.000	39,746.000	41,820.000

Note:	p<0.1; p<0.05; p<0.01

Non-Response Modeling For FU2 and FU3

3/6/2017