Categorical outcomes

In the vast majority of situations in your work as demographers, your outcome will either be of a qualitative nature or non-normally distributed, especially if you work with individual level survey data.

When we speak of qualitative outcomes, we generally are concerned with the observation of:

Logit and Probit models

If our outcome is dichotomous (1/0), the natural distribution to consider for a GLM is the binomial \[y \sim \ \text{Binomial}\binom{n}{p}\] with \(p\) being the mean of the binomial, and n being the number of trials, generally when you have individual data, n is always 1, and p is the probability of observing the 1, conditional on the observed predictors. There are two common techniques to estimate the mean, logistic and probit regression. In a Logistic model, the link function is the inverse logit function, or

\(\text{Logit}^{-1}(p) =log \frac{p}{(1-p)}\)

Which gives us the following conditional mean model:

\[E(y|x) = \frac{1}{1+ exp({-\sum_k \beta_k x_{ik}})}\] Which situates the model within the logistic distribution function. Expressing p as a linear model is done via this log odds transformation of the probability:

\[log \frac{p}{(1-p)} = \sum_k \beta_k x_{ik}\]

For the Probit model, the link function is the inverse cumulative Normal distribution:

\[E(y|x) = \Phi^{-1}(p) = \Phi (\sum_k \beta_k x_{ik})\]

In practice, these two models give very similar estimates and will have very similar interpretations, although the logistic regression model has the more convenient odds ratio interpretation of its \(\beta's\), while the probit model’s coefficients are often transformed into marginal coefficients, which is more of a challenge and software generally doesn’t give you these by default.

Logit/Probit Regression example

There is no trick to fitting logistic regression models using survey data, just use the svyglm() function with the appropriate distribution specified via family=binomial for logistic and family=binomial(link="probit") for the probit model. You don’t have to specify the link function if you’re just doing the logistic model, as it is the default.

Example

This example will cover the use of R functions for fitting binary logit and probit models to complex survey data.

For this example I am using 2016 CDC Behavioral Risk Factor Surveillance System (BRFSS) SMART metro area survey data. Link

#load libraries
library(car)
## Loading required package: carData
library(stargazer)
## 
## Please cite as:
##  Hlavac, Marek (2018). stargazer: Well-Formatted Regression and Summary Statistics Tables.
##  R package version 5.2.2. https://CRAN.R-project.org/package=stargazer
library(survey)
## Loading required package: grid
## Loading required package: Matrix
## Loading required package: survival
## 
## Attaching package: 'survey'
## The following object is masked from 'package:graphics':
## 
##     dotchart
library(questionr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
## 
##     recode
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
#load brfss
load(url("https://github.com/coreysparks/data/blob/master/brfss_2017.Rdata?raw=true"))


#The names in the data are very ugly, so I make them less ugly
nams<-names(brfss_17)
head(nams, n=10)
##  [1] "dispcode" "statere1" "safetime" "hhadult"  "genhlth"  "physhlth"
##  [7] "menthlth" "poorhlth" "hlthpln1" "persdoc2"
#we see some names are lower case, some are upper and some have a little _ in the first position. This is a nightmare.
newnames<-tolower(gsub(pattern = "_",replacement =  "",x =  nams))
names(brfss_17)<-newnames

#Poor or fair self rated health
brfss_17$badhealth<-Recode(brfss_17$genhlth, recodes="4:5=1; 1:3=0; else=NA")

#sex
brfss_17$male<-as.factor(ifelse(brfss_17$sex==1, "Male", "Female"))

#Age cut into intervals
brfss_17$agec<-cut(brfss_17$age80, breaks=c(0,24,39,59,79,99))

#race/ethnicity
brfss_17$black<-Recode(brfss_17$racegr3, recodes="2=1; 9=NA; else=0")
brfss_17$white<-Recode(brfss_17$racegr3, recodes="1=1; 9=NA; else=0")
brfss_17$other<-Recode(brfss_17$racegr3, recodes="3:4=1; 9=NA; else=0")
brfss_17$hispanic<-Recode(brfss_17$racegr3, recodes="5=1; 9=NA; else=0")

brfss_17$race_eth<-Recode(brfss_17$racegr3, recodes="1='nhwhite'; 2='nh black'; 3='nh other';4='nh multirace'; 5='hispanic'; else=NA", as.factor = T)

#insurance
brfss_17$ins<-Recode(brfss_17$hlthpln1, recodes ="7:9=NA; 1=1;2=0")

#income grouping
brfss_17$inc<-ifelse(brfss_17$incomg==9, NA, brfss_17$incomg)

#education level
brfss_17$educ<-Recode(brfss_17$educa, recodes="1:2='0Prim'; 3='1somehs'; 4='2hsgrad'; 5='3somecol'; 6='4colgrad';9=NA", as.factor=T)
brfss_17$educ<-relevel(brfss_17$educ, ref='2hsgrad')

#employment
brfss_17$employ<-Recode(brfss_17$employ1, recodes="1:2='Employed'; 2:6='nilf'; 7='retired'; 8='unable'; else=NA", as.factor=T)
brfss_17$employ<-relevel(brfss_17$employ, ref='Employed')

#marital status
brfss_17$marst<-Recode(brfss_17$marital, recodes="1='married'; 2='divorced'; 3='widowed'; 4='separated'; 5='nm';6='cohab'; else=NA", as.factor=T)
brfss_17$marst<-relevel(brfss_17$marst, ref='married')

#Age cut into intervals
brfss_17$agec<-cut(brfss_17$age80, breaks=c(0,24,39,59,79,99))

#BMI, in the brfss_17a the bmi variable has 2 implied decimal places, so we must divide by 100 to get real bmi's

brfss_17$bmi<-brfss_17$bmi5/100
brfss_17$obese<-ifelse(brfss_17$bmi>=30, 1, 0)
#smoking currently
brfss_17$smoke<-Recode(brfss_17$smoker3, recodes="1:2='Current'; 3='Former';4='NeverSmoked'; else=NA", as.factor=T)
brfss_17$smoke<-relevel(brfss_17$smoke, ref = "NeverSmoked")

Analysis

First, we will do some descriptive analysis, such as means and cross tabulations.

library(dplyr)
sub<-brfss_17%>%
  select(badhealth,mmsaname, bmi, agec,race_eth, marst, educ,white, black, hispanic, other, smoke, ins, mmsawt, ststr) %>%
  filter( complete.cases(.))

#First we tell R our survey design
options(survey.lonely.psu = "adjust")
des<-svydesign(ids=~1, strata=~ststr, weights=~mmsawt, data =sub )

First, we examine the % of US adults with poor/fair health by education level, and do a survey-corrected chi-square test for independence.

library(ggplot2)
cat<-svyby(formula = ~badhealth, by = ~educ, design = des, FUN = svymean, na.rm=T)
svychisq(~badhealth+educ, design = des)
## 
##  Pearson's X^2: Rao & Scott adjustment
## 
## data:  svychisq(~badhealth + educ, design = des)
## F = 585.7, ndf = 3.6133e+00, ddf = 7.1982e+05, p-value < 2.2e-16
qplot(x=cat$educ,y=cat$badhealth, data=cat ,xlab="Education", ylab="%  Fair/Poor Health" )+
geom_errorbar(aes(x=educ, ymin=badhealth-1.96*se,ymax= badhealth+1.96*se), width=.25)+
ggtitle(label = "% of US Adults with Fair/Poor Health by Education")

Calculate race*health cross tabulation, and plot it

dog<-svyby(formula = ~badhealth, by = ~race_eth, design = des, FUN = svymean, na.rm=T)
svychisq(~badhealth+race_eth, design = des)
## 
##  Pearson's X^2: Rao & Scott adjustment
## 
## data:  svychisq(~badhealth + race_eth, design = des)
## F = 113.57, ndf = 3.4023e+00, ddf = 6.7779e+05, p-value < 2.2e-16
qplot(x=dog$race_eth,y=dog$badhealth, data=dog ,xlab="Race", ylab="%  Fair/Poor Health" )+
geom_errorbar(aes(x=race_eth, ymin=badhealth-se,ymax= badhealth+se), width=.25)+
ggtitle(label = "% of US Adults with Fair/Poor Health by Race/Ethnicity")

Calculate race by education by health cross tabulation, and plot it

catdog<-svyby(formula = ~badhealth, by = ~race_eth+educ, design = des, FUN = svymean, na.rm=T)
catdog
#this plot is a little more complicated

catdog$educ_rec<-factor(c(rep("HS Grad", 5),rep("Primary Sch", 5), rep("LT HS", 5),  
                          rep("Some College", 5), rep("College Grad", 5)), ordered = T)
#fix the order of the education factor levels
catdog$educ_rec<-factor(catdog$educ_rec, levels(catdog$educ_rec)[c(4,3,2,5,1)])

p<-ggplot(catdog, aes(educ_rec,badhealth),xlab="Race", ylab="% Bad Health")
p<-p+geom_point(aes(colour=race_eth))
p<-p+geom_line(aes(colour=race_eth,group=race_eth))
#p<-p+geom_errorbar(aes(x=educ, ymin=badhealth-1.96*se,ymax= badhealth+1.96*se,colour=race_eth), width=.25)
p<-p+ylab("% Fair/Poor Health")
p<-p+xlab("Education Level")
p+ggtitle("% of US Adults in 2017 in Bad Health by Race and Education")

Which shows a significant variation in health status by education level and race/ethnicity

#Logit model
fit.logit<-svyglm(badhealth~race_eth+educ+agec,
                  design= des,
                  family=binomial)
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
summary(fit.logit)
## 
## Call:
## svyglm(formula = badhealth ~ race_eth + educ + agec, design = des, 
##     family = binomial)
## 
## Survey design:
## svydesign(ids = ~1, strata = ~ststr, weights = ~mmsawt, data = sub)
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -2.00287    0.06770 -29.585  < 2e-16 ***
## race_ethnh black     -0.03109    0.05058  -0.615   0.5387    
## race_ethnh multirace  0.06804    0.08741   0.778   0.4364    
## race_ethnh other     -0.17043    0.08313  -2.050   0.0404 *  
## race_ethnhwhite      -0.37482    0.04189  -8.949  < 2e-16 ***
## educ0Prim             0.88869    0.07067  12.575  < 2e-16 ***
## educ1somehs           0.67014    0.05069  13.220  < 2e-16 ***
## educ3somecol         -0.32651    0.03302  -9.889  < 2e-16 ***
## educ4colgrad         -1.16599    0.03459 -33.713  < 2e-16 ***
## agec(24,39]           0.49634    0.06571   7.554 4.24e-14 ***
## agec(39,59]           1.03889    0.06181  16.809  < 2e-16 ***
## agec(59,79]           1.34546    0.06167  21.817  < 2e-16 ***
## agec(79,99]           1.50892    0.07744  19.485  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 0.9953381)
## 
## Number of Fisher Scoring iterations: 5

Get odds ratios and confidence intervals

exp(coef(fit.logit))#odds ratios
##          (Intercept)     race_ethnh black race_ethnh multirace 
##            0.1349470            0.9693860            1.0704060 
##     race_ethnh other      race_ethnhwhite            educ0Prim 
##            0.8433002            0.6874099            2.4319523 
##          educ1somehs         educ3somecol         educ4colgrad 
##            1.9545118            0.7214375            0.3116149 
##          agec(24,39]          agec(39,59]          agec(59,79] 
##            1.6426999            2.8260718            3.8399711 
##          agec(79,99] 
##            4.5218259
exp(confint(fit.logit)) #confidence intervals for odds ratios
##                          2.5 %    97.5 %
## (Intercept)          0.1181786 0.1540947
## race_ethnh black     0.8778976 1.0704087
## race_ethnh multirace 0.9018691 1.2704381
## race_ethnh other     0.7165064 0.9925317
## race_ethnhwhite      0.6332309 0.7462245
## educ0Prim            2.1173841 2.7932543
## educ1somehs          1.7696595 2.1586731
## educ3somecol         0.6762309 0.7696662
## educ4colgrad         0.2911917 0.3334704
## agec(24,39]          1.4442085 1.8684717
## agec(39,59]          2.5036547 3.1900094
## agec(59,79]          3.4027733 4.3333414
## agec(79,99]          3.8850580 5.2629611

Probit model

To get the probit model, you use link = "probit" in svyglm

#probit model
fit.probit<-svyglm(badhealth~race_eth+educ+agec,
                   design=des,
                   family=binomial(link= "probit"))
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

Present both model coefficients next to one another

myexp<-function(x){exp(x)}

stargazer(fit.logit, fit.probit,
          type = "html",
          style="demography",
          covariate.labels=c("Black","MultRace" ,"Other","NHWhite", "PrimarySchool", "SomeHS","SomeColl", "CollGrad", "Age 24-39","Age 39-59" ,"Age 59-79", "Age 80+"),
          ci=T, column.sep.width = "10pt")
badhealth
survey-weighted survey-weighted
logistic probit
Model 1 Model 2
Black -0.031 -0.016
(-0.130, 0.068) (-0.072, 0.040)
MultRace 0.068 0.036
(-0.103, 0.239) (-0.061, 0.132)
Other -0.170* -0.100*
(-0.333, -0.007) (-0.188, -0.012)
NHWhite -0.375*** -0.211***
(-0.457, -0.293) (-0.257, -0.165)
PrimarySchool 0.889*** 0.546***
(0.750, 1.027) (0.461, 0.631)
SomeHS 0.670*** 0.401***
(0.571, 0.769) (0.341, 0.461)
SomeColl -0.327*** -0.184***
(-0.391, -0.262) (-0.220, -0.147)
CollGrad -1.166*** -0.629***
(-1.234, -1.098) (-0.665, -0.593)
Age 24-39 0.496*** 0.260***
(0.368, 0.625) (0.195, 0.325)
Age 39-59 1.039*** 0.555***
(0.918, 1.160) (0.493, 0.617)
Age 59-79 1.345*** 0.732***
(1.225, 1.466) (0.670, 0.794)
Age 80+ 1.509*** 0.837***
(1.357, 1.661) (0.755, 0.919)
Constant -2.003*** -1.158***
(-2.136, -1.870) (-1.228, -1.089)
N 200,568 200,568
Log Likelihood -77,742.630 -77,737.000
AIC 155,511.300 155,500.000
p < .05; p < .01; p < .001

Both of these models show the exact same patterns of effects, with Hispanics, blacks and multi-race individuals showing increased chances of reporting poor/fair health, when compared to whites (Reference group). Similarly, the education variables shows a negative linear trend, with those with more education having lower chances of reporting poor/fair health compared to those with a primary school education (Reference group), and likewise, as people get older, they are more likely to report poor/fair health, compared to those under age 24 (Reference group).

Marginal effects

In a regression model in general, the \(\beta's\) are the solution to the differential equation: \[\frac{\partial y}{\partial x} = \beta\]

which is just the rate of change in y, given x, known as the marginal effect. This is the case for strictly linear model

In the logit and probit model, which are nonlinear models, owing to their presumed model structure, the marginal effect also has to take into account the change in the respective pdf with respect to the mean, or:

\[\frac{\partial y}{\partial x} = \beta *\frac{\partial \Phi(x' \beta)}{\partial x'\beta}\]

So we have to multiply the estimated \(\beta\) by the p.d.f. of the assumed marginal distribution evaluated at the mean function. In R that’s not big problem:

#Logit marginal effects
log.marg<-coef(fit.logit)*mean(dlogis(predict(fit.logit)), na.rm=T)

#for probit now
prob.marg<-coef(fit.probit)*mean(dnorm(predict(fit.probit)), na.rm=T)


effects<-data.frame(effect=c(log.marg, prob.marg),
                    term=rep(names(log.marg),2),
                    model=c(rep("logit", length(log.marg)),rep("probit", length(log.marg)))
                    )

effects%>%
  ggplot(aes(x=term, y=effect))+geom_point(aes(color=model, group=model, shape=model),
  position=position_jitterdodge(jitter.width = 1),
             size=2)+
  ylab("Marginal Effect")+
  xlab("Model Term")+
  geom_abline(intercept = 0, slope=0)+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))+
  ggtitle(label = "Comparison of marginal effects in Logit and Probit models")

Which shows us that the marginal effects are very similar between the two models.

Fitted Values

As I often say, I like to talk about “interesting cases”. In order to do this, you need the fitted mean for a particular case. This is done by getting the fitted values for that case from the model. To do this, I generate a bunch of “fake people” that have variability in the model covariates, and fit the model for each type of person. This is perhaps overkill in this example because I fit every type of person, ideally you would want a few interesting cases to discuss.

In order to derive these, we effectively “solve the equation” for the model, or another way of saying it, we estimate the conditional mean of y, by specifying the x values that are meaningful for a particular comparison. For example the probability of a white, young college educated person reporting poor health is just the estimate of the model, evaluated at those particular characteristics:

\[\text{Pr(poor/fair health)} = \frac{1}{1+exp({\beta_0 + \beta_1*white + \beta_2*young+\beta_3*college})}\]

#get a series of predicted probabilites for different "types" of people for each model
#expand.grid will generate all possible combinations of values you specify
dat<-expand.grid(race_eth=levels(brfss_17$race_eth), educ=levels(brfss_17$educ), agec=levels(brfss_17$agec))

#You MAY need to get rid of impossible cases here

#generate the fitted values
fit<-predict(fit.logit, newdat=dat,type="response")
fitp<-predict(fit.probit, newdat=dat,type="response")
#add the values to the fake data
dat$fitted.prob.lrm<-round(fit, 3)
dat$fitted.prob.pro<-round(fitp, 3)

#Print the fitted probabilities for the first 20 cases
knitr::kable(head(dat, n=20))
race_eth educ agec fitted.prob.lrm fitted.prob.pro
hispanic 2hsgrad (0,24] 0.119 0.123
nh black 2hsgrad (0,24] 0.116 0.120
nh multirace 2hsgrad (0,24] 0.126 0.131
nh other 2hsgrad (0,24] 0.102 0.104
nhwhite 2hsgrad (0,24] 0.085 0.085
hispanic 0Prim (0,24] 0.247 0.270
nh black 0Prim (0,24] 0.241 0.265
nh multirace 0Prim (0,24] 0.260 0.282
nh other 0Prim (0,24] 0.217 0.238
nhwhite 0Prim (0,24] 0.184 0.205
hispanic 1somehs (0,24] 0.209 0.224
nh black 1somehs (0,24] 0.204 0.220
nh multirace 1somehs (0,24] 0.220 0.235
nh other 1somehs (0,24] 0.182 0.196
nhwhite 1somehs (0,24] 0.153 0.166
hispanic 3somecol (0,24] 0.089 0.090
nh black 3somecol (0,24] 0.086 0.087
nh multirace 3somecol (0,24] 0.094 0.096
nh other 3somecol (0,24] 0.076 0.075
nhwhite 3somecol (0,24] 0.063 0.060

Which show us the estimated probability of reporting poor/fair health for each specified type of “fake person” that we generate. For example, let’s look at the probability for a Non-Hispanic white, age 39-59 with a college education, compared to a Hispanic person, age 39-59 with a primary school education:

dat[which(dat$race_eth=="nhwhite"&dat$agec=="(39,59]"&dat$educ=="4colgrad"),]
dat[which(dat$race_eth=="hispanic"&dat$agec=="(39,59]"&dat$educ=="0Prim"),]

The first case has an estimated probability of reporting poor/fair health of about 7%, while the second case has over about a 50% chance. These are often more effective ways to convey the result of a model, instead of talking about all the regression coefficients.

The probability that a NH white person who is <25 years old and has a high school education is 0.085

Nested model comparison

Often in a research setting we are interested in comparing several models, and almost never are we satisfied with a single solitary model. This is because the literature on our subjects often has multiple facets to it. So, for instance, we may be interested in how SES mediates the effects of race/ethnicity on health (see Shuey and Wilson 2008 and Sudano and Baker 2005 or Hummer 1993 for a few examples of these types of analysis).

Typically in these types of analysis, predictor variables are entered into the model in “blocks”. For example, let’s look at the self-rated health outcome (0=good or excellent health, 1= fair/poor health) from last week. But instead of entering all variables in the model simultaneously, we begin with the effect of race/ethnicity, then add the effect of SES then the effects of health behavior variables.

fit.logit1<-svyglm(badhealth~race_eth,design= des, family=binomial) #race only
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
fit.logit2<-svyglm(badhealth~race_eth+educ,design= des, family=binomial) #race+education
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
fit.logit3<-svyglm(badhealth~race_eth+educ+ins+smoke,design= des, family=binomial)#race+education+health behaviors
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
summary(fit.logit1)
## 
## Call:
## svyglm(formula = badhealth ~ race_eth, design = des, family = binomial)
## 
## Survey design:
## svydesign(ids = ~1, strata = ~ststr, weights = ~mmsawt, data = sub)
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -1.10678    0.03280 -33.742  < 2e-16 ***
## race_ethnh black     -0.22286    0.04652  -4.790 1.67e-06 ***
## race_ethnh multirace -0.34678    0.08110  -4.276 1.90e-05 ***
## race_ethnh other     -0.82381    0.07938 -10.378  < 2e-16 ***
## race_ethnhwhite      -0.64963    0.03597 -18.059  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1.000005)
## 
## Number of Fisher Scoring iterations: 4

In model 1 we see that Hispanics and blacks have a higher odds of reporting poor self rated health, compared to non-Hispanic whites, while the “other” group shows lower odds of reporting poor health.

Now, let’s see if, by controlling for education, some of these differences go away, or are reduced. The fancy word for when an effect is reduced is “attenuated”. We will also do a test to see if the model with the education term significantly improves the model fit. Traditionally this would be done using a likelihood ratio test, but in survey models, that’s not kosher

summary(fit.logit2)
## 
## Call:
## svyglm(formula = badhealth ~ race_eth + educ, design = des, family = binomial)
## 
## Survey design:
## svydesign(ids = ~1, strata = ~ststr, weights = ~mmsawt, data = sub)
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -1.25928    0.04081 -30.858  < 2e-16 ***
## race_ethnh black      0.11161    0.04935   2.261 0.023731 *  
## race_ethnh multirace  0.09546    0.08542   1.118 0.263778    
## race_ethnh other     -0.18534    0.08571  -2.162 0.030592 *  
## race_ethnhwhite      -0.14924    0.03967  -3.762 0.000169 ***
## educ0Prim             1.12712    0.07035  16.023  < 2e-16 ***
## educ1somehs           0.71026    0.04937  14.386  < 2e-16 ***
## educ3somecol         -0.33133    0.03270 -10.132  < 2e-16 ***
## educ4colgrad         -1.10746    0.03392 -32.649  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 0.9984191)
## 
## Number of Fisher Scoring iterations: 5
regTermTest(fit.logit2, test.terms = ~educ, method="Wald", df = NULL)
## Wald test for educ
##  in svyglm(formula = badhealth ~ race_eth + educ, design = des, family = binomial)
## F =  513.1387  on  4  and  199207  df: p= < 2.22e-16

so, we see the race effects in all groups attenuate (reduce in size) somewhat after considering education, so the differences in health are smaller once you control for education.

The F test also suggest that the second model fits the data better than the first. It is another of these omnibus tests that asks whether there is any variation in our outcome by education in the second model.

Next we consider the third model, which contains health behaviors of current smoking and insurance coverage:

summary(fit.logit3)
## 
## Call:
## svyglm(formula = badhealth ~ race_eth + educ + ins + smoke, design = des, 
##     family = binomial)
## 
## Survey design:
## svydesign(ids = ~1, strata = ~ststr, weights = ~mmsawt, data = sub)
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -1.57045    0.05817 -26.996  < 2e-16 ***
## race_ethnh black      0.04697    0.04956   0.948  0.34324    
## race_ethnh multirace -0.06338    0.08678  -0.730  0.46515    
## race_ethnh other     -0.21570    0.08552  -2.522  0.01166 *  
## race_ethnhwhite      -0.30536    0.04029  -7.579  3.5e-14 ***
## educ0Prim             1.13629    0.07059  16.096  < 2e-16 ***
## educ1somehs           0.63809    0.05022  12.706  < 2e-16 ***
## educ3somecol         -0.31346    0.03301  -9.495  < 2e-16 ***
## educ4colgrad         -1.00270    0.03491 -28.721  < 2e-16 ***
## ins                   0.15316    0.04674   3.277  0.00105 ** 
## smokeCurrent          0.67681    0.03581  18.898  < 2e-16 ***
## smokeFormer           0.50403    0.03106  16.229  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 0.994982)
## 
## Number of Fisher Scoring iterations: 5
regTermTest(fit.logit3, test.terms=~ins+smoke, method="Wald", df = NULL)
## Wald test for ins smoke
##  in svyglm(formula = badhealth ~ race_eth + educ + ins + smoke, design = des, 
##     family = binomial)
## F =  155.2489  on  3  and  199204  df: p= < 2.22e-16

In this model, we see the effects for Hispanics and blacks go back up. This is somewhat confusing, but is most likely related to the differential rates of smoking among those groups, compared to whites. Both current and former smokers are more likely to report poor/fair health, while insurance coverage does not affect the odds at all. Finally, we see that the model is significantly better fitting than the previous model.

stargazer(fit.logit1, fit.logit2, fit.logit3,type = "html", style="demography", covariate.labels =c("Black","MultRace" ,"Other","NHWhite", "PrimarySchool", "SomeHS","SomeColl", "CollGrad",  "Insurance", "Current Smoker", "Former Smoker"), ci = T, apply.coef = myexp )
badhealth
Model 1 Model 2 Model 3
Black 0.800*** 1.118*** 1.048***
(0.709, 0.891) (1.021, 1.215) (0.951, 1.145)
MultRace 0.707*** 1.100*** 0.939***
(0.548, 0.866) (0.933, 1.268) (0.769, 1.109)
Other 0.439*** 0.831*** 0.806***
(0.283, 0.594) (0.663, 0.999) (0.638, 0.974)
NHWhite 0.522*** 0.861*** 0.737***
(0.452, 0.593) (0.784, 0.939) (0.658, 0.816)
PrimarySchool 3.087*** 3.115***
(2.949, 3.225) (2.977, 3.254)
SomeHS 2.035*** 1.893***
(1.938, 2.131) (1.794, 1.991)
SomeColl 0.718*** 0.731***
(0.654, 0.782) (0.666, 0.796)
CollGrad 0.330*** 0.367***
(0.264, 0.397) (0.298, 0.435)
Insurance 1.166***
(1.074, 1.257)
Current Smoker 1.968***
(1.897, 2.038)
Former Smoker 1.655***
(1.595, 1.716)
Constant 0.331*** 0.284*** 0.208***
(0.266, 0.395) (0.204, 0.364) (0.094, 0.322)
N 200,568 200,568 200,568
Log Likelihood -84,591.230 -79,950.350 -78,938.390
AIC 169,192.500 159,918.700 157,900.800
p < .05; p < .01; p < .001

Stratified models

Often in the literature, we will see models stratified by some predictor. This is usually because a specific hypothesis is stated regarding how the effect of a particular predictor varies by some categorical variable. In this case, we may be interested in considering if education or smoking universally affects the poor health outcome. We get at this by stratifying our analysis by race (in this example).

The easiest way to do this is to subset the data by race and run the models separately.

The first thing we do is test for the interaction of education and race. If this interaction is not significant, we have no justification for proceeding, because the effect of education does not vary by race group. This is the test of parallel slopes, a’ la the ANCOVA model

fit.logitint<-svyglm(badhealth~race_eth*educ+ins+smoke,design= des, family=binomial)#race*education interaction+health behaviors
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
regTermTest(fit.logitint, test.terms = ~race_eth:educ, method = "Wald", df=NULL)
## Wald test for race_eth:educ
##  in svyglm(formula = badhealth ~ race_eth * educ + ins + smoke, design = des, 
##     family = binomial)
## F =  2.664228  on  16  and  199188  df: p= 0.00031822

Here, the F-test does indicate that the interaction term in the model is significant, so the effects of education are not constant by race.

Now we stratify our models:

fit.unrestricted<-svyglm(badhealth~educ+ins+smoke,design= des, family=binomial)
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
fit.logit.white<-svyglm(badhealth~(educ+ins+smoke),design= subset(des, white==1), family=binomial)
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
fit.logit.black<-svyglm(badhealth~(educ+ins+smoke),design= subset(des, black==1), family=binomial)
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
fit.logit.other1<-svyglm(badhealth~(educ+ins+smoke),design= subset(des, race_eth=="nh other"), family=binomial)
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
fit.logit.other2<-svyglm(badhealth~(educ+ins+smoke),design= subset(des, race_eth=="nh multirace"), family=binomial)
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
fit.logit.hisp<-svyglm(badhealth~(educ+ins+smoke),design= subset(des, hispanic==1), family=binomial)
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

Here we examine the model results next to one another

stargazer(fit.logit.white, fit.logit.black, fit.logit.hisp, fit.logit.other1, fit.logit.other2,
          type = "html", style="demography",
          covariate.labels =c(  "Some HS","HS Graduate", "Some College", "College Grad", "Insurance", "Current Smoker", "Former Smoker","Intercept"),
          column.labels = c("NH White", "NH Black", "Hispanic", "NH Other 1", "NH Other 2"),
          ci=T , apply.coef = myexp)
badhealth
NH White NH Black Hispanic NH Other 1 NH Other 2
Model 1 Model 2 Model 3 Model 4 Model 5
Some HS 3.163*** 1.860*** 3.376*** 1.172** 2.177***
(2.918, 3.408) (1.475, 2.245) (3.179, 3.573) (0.368, 1.975) (1.257, 3.098)
HS Graduate 2.132*** 1.701*** 1.662*** 2.700*** 1.661***
(2.005, 2.259) (1.484, 1.918) (1.456, 1.868) (1.978, 3.423) (1.020, 2.301)
Some College 0.723*** 0.764*** 0.688*** 0.896*** 0.795***
(0.649, 0.798) (0.601, 0.927) (0.501, 0.874) (0.500, 1.292) (0.426, 1.164)
College Grad 0.354*** 0.417*** 0.407*** 0.402* 0.426*
(0.277, 0.431) (0.233, 0.601) (0.209, 0.606) (0.041, 0.763) (0.003, 0.848)
Insurance 1.020*** 1.243*** 1.244*** 1.425*** 1.594***
(0.900, 1.141) (1.036, 1.451) (1.075, 1.413) (0.990, 1.859) (1.157, 2.032)
Current Smoker 2.175*** 2.155*** 1.351*** 1.723*** 1.956***
(2.092, 2.259) (1.994, 2.317) (1.154, 1.549) (1.346, 2.100) (1.590, 2.323)
Former Smoker 1.624*** 2.203*** 1.389*** 2.439*** 2.046***
(1.556, 1.692) (2.024, 2.383) (1.224, 1.554) (2.099, 2.780) (1.681, 2.411)
Intercept 0.170** 0.191 0.220* 0.124 0.134
(0.041, 0.298) (-0.025, 0.407) (0.024, 0.416) (-0.360, 0.608) (-0.382, 0.651)
N 150,655 20,285 17,979 7,964 3,685
Log Likelihood -54,302.950 -9,109.037 -8,590.519 -2,611.371 -1,582.710
AIC 108,621.900 18,234.070 17,197.040 5,238.742 3,181.420
p < .05; p < .01; p < .001
beta.test<-function(model1, model2, betaname){
s1<-summary(model1)$coef
s2<-summary(model2)$coef
db <- ((s2[rownames(s2)==betaname,1]-s1[rownames(s1)==betaname,1]))^2
sd <-s2[rownames(s2)==betaname,2]^2+s1[rownames(s1)==betaname,2]^2
td <- db/sd
beta1=s1[rownames(s1)==betaname,1]
beta2=s2[rownames(s2)==betaname,1]
pv<-1- pchisq(td, df = 1)
print(list(beta=betaname,beta1=beta1, beta2=beta2, x2=td, pvalue=pv))
}

Here is an example of testing if the “Current Smoking” effect is the same among whites and blacks. This follows the logic set forth in Allison 2010, p 219

Test for \(\beta_{1j} = \beta_{1k}\) in two models \(j \text{ and } k\) \[z= \frac{\beta_{1j} - \beta_{1k}}{\left[ s.e.(\beta_{1j}) \right]^2+\left[ s.e.(\beta_{1k}) \right]^2}\]

compare the \(|z|\) test to a normal distribution for the p value.

beta.test(fit.logit.white, fit.logit.black, "smokeCurrent")
## $beta
## [1] "smokeCurrent"
## 
## $beta1
## [1] 0.7771827
## 
## $beta2
## [1] 0.76783
## 
## $x2
## [1] 0.01016842
## 
## $pvalue
## [1] 0.9196786

Which suggests that the effect of current smoking is the same in the two groups.

Here we also examine the equality of the college education effect, this time between Hispanics and non Hispanic multiracial

beta.test(fit.logit.hisp, fit.logit.other2, "educ4colgrad")
## $beta
## [1] "educ4colgrad"
## 
## $beta1
## [1] -0.8978635
## 
## $beta2
## [1] -0.8541422
## 
## $x2
## [1] 0.03368197
## 
## $pvalue
## [1] 0.8543849

Which suggests the effect of college education is the same in these two groups.