Categorical outcomes
- Logit and Probit models
- Logit/Probit Regression example
Present both model coefficients next to one another

Categorical outcomes

In the vast majority of situations in your work as demographers, your outcome will either be of a qualitative nature or non-normally distributed, especially if you work with individual level survey data.

When we speak of qualitative outcomes, we generally are concerned with the observation of:

A particular behavior (marriage, migration, birth, death)
A transition (unemployed to employed, married to divorced)
A threshold characteristic (adoption of sterilization after ideal # of children is reached)
In general, each of these outcomes would be coded as a binary variable (1 or 0) depending on whether the outcome of interest was observed

Logit and Probit models

If our outcome is dichotomous (1/0), the natural distribution to consider for a GLM is the binomial $y \sim \ \text{Binomial}\binom{n}{p}$ with $p$ being the mean of the binomial, and n being the number of trials, generally when you have individual data, n is always 1, and p is the probability of observing the 1, conditional on the observed predictors. There are two common techniques to estimate the mean, logistic and probit regression. In a Logistic model, the link function is the inverse logit function, or

$\text{Logit}^{-1}(p) =log \frac{p}{(1-p)}$

Which gives us the following conditional mean model:

$E(y|x) = \frac{1}{1+ exp({-\sum_k \beta_k x_{ik}})}$ Which situates the model within the logistic distribution function. Expressing p as a linear model is done via this log odds transformation of the probability:

$log \frac{p}{(1-p)} = \sum_k \beta_k x_{ik}$

For the Probit model, the link function is the inverse cumulative Normal distribution:

$E(y|x) = \Phi^{-1}(p) = \Phi (\sum_k \beta_k x_{ik})$

In practice, these two models give very similar estimates and will have very similar interpretations, although the logistic regression model has the more convenient odds ratio interpretation of its $\beta's$ , while the probit model’s coefficients are often transformed into marginal coefficients, which is more of a challenge and software generally doesn’t give you these by default.

Logit/Probit Regression example

There is no trick to fitting logistic regression models using survey data, just use the svyglm() function with the appropriate distribution specified via family=binomial for logistic and family=binomial(link="probit") for the probit model. You don’t have to specify the link function if you’re just doing the logistic model, as it is the default.

Example

This example will cover the use of R functions for fitting binary logit and probit models to complex survey data.

For this example I am using 2016 CDC Behavioral Risk Factor Surveillance System (BRFSS) SMART metro area survey data. Link

Analysis

First, we will do some descriptive analysis, such as means and cross tabulations.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:car':
## 
##     recode

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

sub<-brfss16m%>%
  select(badhealth,mmsaname, bmi, agec,race_eth, marst, educ,white, black, hispanic, other, smoke, ins, mmsawt, ststr) %>%
  filter( complete.cases(.))

#First we tell R our survey design
options(survey.lonely.psu = "adjust")
des<-svydesign(ids=~1, strata=~ststr, weights=~mmsawt, data =sub )

First, we examine the % of US adults with poor/fair health by education level, and do a survey-corrected chi-square test for independence.

library(ggplot2)
cat<-svyby(formula = ~badhealth, by = ~educ, design = des, FUN = svymean, na.rm=T)
svychisq(~badhealth+educ, design = des)

## 
##  Pearson's X^2: Rao & Scott adjustment
## 
## data:  svychisq(~badhealth + educ, design = des)
## F = 872.14, ndf = 3.6451e+00, ddf = 7.8436e+05, p-value < 2.2e-16

qplot(x=cat$educ,y=cat$badhealth, data=cat ,xlab="Education", ylab="%  Fair/Poor Health" )+
geom_errorbar(aes(x=educ, ymin=badhealth-1.96*se,ymax= badhealth+1.96*se), width=.25)+
ggtitle(label = "% of US Adults with Fair/Poor Health by Education")

calculate race*health cross tabulation, and plot it

dog<-svyby(formula = ~badhealth, by = ~race_eth, design = des, FUN = svymean, na.rm=T)
svychisq(~badhealth+race_eth, design = des)

## 
##  Pearson's X^2: Rao & Scott adjustment
## 
## data:  svychisq(~badhealth + race_eth, design = des)
## F = 180.45, ndf = 3.4913e+00, ddf = 7.5127e+05, p-value < 2.2e-16

qplot(x=dog$race_eth,y=dog$badhealth, data=dog ,xlab="Race", ylab="%  Fair/Poor Health" )+
geom_errorbar(aes(x=race_eth, ymin=badhealth-se,ymax= badhealth+se), width=.25)+
ggtitle(label = "% of US Adults with Fair/Poor Health by Race/Ethnicity")

calculate race by education by health cross tabulation, and plot it

catdog<-svyby(formula = ~badhealth, by = ~race_eth+educ, design = des, FUN = svymean, na.rm=T)
catdog

ABCDEFGHIJ0123456789

	race_eth <fctr>	educ <fctr>	badhealth <dbl>	se <dbl>
hispanic.2hsgrad	hispanic	2hsgrad	0.20662309	0.009324475
nh black.2hsgrad	nh black	2hsgrad	0.22078292	0.008922924
nh multirace.2hsgrad	nh multirace	2hsgrad	0.22169288	0.023942488
nh other.2hsgrad	nh other	2hsgrad	0.17628289	0.022601657
nhwhite.2hsgrad	nhwhite	2hsgrad	0.18809241	0.003620089
hispanic.0Prim	hispanic	0Prim	0.52268219	0.017371573
nh black.0Prim	nh black	0Prim	0.40835384	0.042074676
nh multirace.0Prim	nh multirace	0Prim	0.56121115	0.107934890
nh other.0Prim	nh other	0Prim	0.48458262	0.075413576
nhwhite.0Prim	nhwhite	0Prim	0.45751070	0.023759156

#this plot is a little more complicated
catdog$race_rec<-rep(c("NHWhite","Hispanic", "NHBlack","NH Multi", "NH Other" ),5)
catdog$educ_rec<-factor(c(rep("Primary Sch", 5), rep("LT HS", 5), rep("HS Grad", 5), 
                          rep("Some College", 5), rep("College Grad", 5)), ordered = T)
#fix the order of the education factor levels
catdog$educ_rec<-factor(catdog$educ_rec, levels(catdog$educ_rec)[c(4,3,2,5,1)])

p<-ggplot(catdog, aes(educ,badhealth),xlab="Race", ylab="% Bad Health")
p<-p+geom_point(aes(colour=race_eth))
p<-p+geom_line(aes(colour=race_eth,group=race_eth))
p<-p+geom_errorbar(aes(x=educ, ymin=badhealth-1.96*se,ymax= badhealth+1.96*se,colour=race_eth), width=.25)
p<-p+ylab("% Fair/Poor Health")
p<-p+xlab("Education Level")
p+ggtitle("% of US Adults in 2011 in Bad Health by Race and Education")

Which shows a significant variation in health status by education level and race/ethnicity

#Logit model
fit.logit<-svyglm(badhealth~race_eth+educ+agec,
                  design= des,
                  family=binomial)

## Warning in eval(family$initialize): non-integer #successes in a binomial
## glm!

exp(coef(fit.logit))#odds ratios

##          (Intercept)     race_ethnh black race_ethnh multirace 
##            0.1303120            0.8399410            1.1195877 
##     race_ethnh other      race_ethnhwhite            educ0Prim 
##            0.6681388            0.6135648            2.9570045 
##          educ1somehs         educ3somecol         educ4colgrad 
##            2.0602831            0.7325664            0.3152745 
##          agec(24,39]          agec(39,59]          agec(59,79] 
##            1.6285442            2.8825523            4.3441486 
##          agec(79,99] 
##            5.1250512

exp(confint(fit.logit)) #confidence intervals for odds ratios

##                          2.5 %    97.5 %
## (Intercept)          0.1138632 0.1491369
## race_ethnh black     0.7649679 0.9222622
## race_ethnh multirace 0.9491247 1.3206659
## race_ethnh other     0.5735897 0.7782731
## race_ethnhwhite      0.5699745 0.6604887
## educ0Prim            2.6224477 3.3342422
## educ1somehs          1.8811879 2.2564286
## educ3somecol         0.6891588 0.7787081
## educ4colgrad         0.2960943 0.3356971
## agec(24,39]          1.4262519 1.8595286
## agec(39,59]          2.5476044 3.2615376
## agec(59,79]          3.8372618 4.9179931
## agec(79,99]          4.4530765 5.8984277

#probit model
fit.probit<-svyglm(badhealth~race_eth+educ+agec,
                   design=des,
                   family=binomial(link= "probit"))

## Warning in eval(family$initialize): non-integer #successes in a binomial
## glm!

Present both model coefficients next to one another

myexp<-function(x){exp(x)}

stargazer(fit.logit,
          type = "html",
          style="demography",
          covariate.labels=c("Black","MultRace" ,"Other","NHWhite", "PrimarySchool", "SomeHS","SomeColl", "CollGrad", "Age 24-39","Age 39-59" ,"Age 59-79", "Age 80+"),
          ci=T, column.sep.width = "10pt",apply.coef = myexp)


	badhealth

Black	0.840^***
	(0.746, 0.933)
MultRace	1.120^***
	(0.954, 1.285)
Other	0.668^***
	(0.516, 0.821)
NHWhite	0.614^***
	(0.540, 0.687)
PrimarySchool	2.957^***
	(2.837, 3.077)
SomeHS	2.060^***
	(1.969, 2.151)
SomeColl	0.733^***
	(0.671, 0.794)
CollGrad	0.315^***
	(0.253, 0.378)
Age 24-39	1.629^***
	(1.496, 1.761)
Age 39-59	2.883^***
	(2.759, 3.006)
Age 59-79	4.344^***
	(4.220, 4.468)
Age 80+	5.125^***
	(4.985, 5.266)
Constant	0.130
	(-0.005, 0.265)
N	216,561
Log Likelihood	-80,335.410
AIC	160,696.800

p < .05; p < .01; p < .001

Both of these models show the exact same patterns of effects, with Hispanics, blacks and multi-race individuals showing increased chances of reporting poor/fair health, when compared to whites (Reference group). Similarly, the education variables shows a negative linear trend, with those with more education having lower chances of reporting poor/fair health compared to those with a primary school education (Reference group), and likewise, as people get older, they are more likely to report poor/fair health, compared to those under age 24 (Reference group).

Marginal effects

In a regression model in general, the $\beta's$ are the solution to the differential equation: $\frac{\partial y}{\partial x} = \beta$

which is just the rate of change in y, given x, known as the marginal effect. This is the case for strictly linear model

In the logit and probit model, which are nonlinear models, owing to their presumed model structure, the marginal effect also has to take into account the change in the respective pdf with respect to the mean, or:

$\frac{\partial y}{\partial x} = \beta *\frac{\partial \Phi(x' \beta)}{\partial x'\beta}$

So we have to multiply the estimated $\beta$ by the p.d.f. of the assumed marginal distribution evaluated at the mean function. In R that’s not big problem:

#Logit marginal effects
log.marg<-coef(fit.logit)*mean(dlogis(predict(fit.logit)), na.rm=T)

#for probit now
prob.marg<-coef(fit.probit)*mean(dnorm(predict(fit.probit)), na.rm=T)


effects<-data.frame(effect=c(log.marg, prob.marg),
                    term=rep(names(log.marg),2),
                    model=c(rep("logit", length(log.marg)),rep("probit", length(log.marg)))
                    )

effects%>%
  ggplot(aes(x=term, y=effect))+geom_point(aes(color=model, group=model, shape=model),
  position=position_jitterdodge(jitter.width = 1),
             size=2)+
  ylab("Marginal Effect")+
  xlab("Model Term")+
  geom_abline(intercept = 0, slope=0)+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))+
  ggtitle(label = "Comparison of marginal effects in Logit and Probit models")

Which shows us that the marginal effects are very similar between the two models.

Fitted Values

As I often say, I like to talk about “interesting cases”. In order to do this, you need the fitted mean for a particular case. This is done by getting the fitted values for that case from the model. To do this, I generate a bunch of “fake people” that have variability in the model covariates, and fit the model for each type of person. This is perhaps overkill in this example because I fit every type of person, ideally you would want a few interesting cases to discuss.

In order to derive these, we effectively “solve the equation” for the model, or another way of saying it, we estimate the conditional mean of y, by specifying the x values that are meaningful for a particular comparison. For example the probability of a white, young college educated person reporting poor health is just the estimate of the model, evaluated at those particular characteristics:

$\text{Pr(poor/fair health)} = \frac{1}{1+exp({\beta_0 + \beta_1*white + \beta_2*young+\beta_3*college})}$

#get a series of predicted probabilites for different "types" of people for each model
#expand.grid will generate all possible combinations of values you specify
dat<-expand.grid(race_eth=levels(brfss16m$race_eth), educ=levels(brfss16m$educ), agec=levels(brfss16m$agec))

#You MAY need to get rid of impossible cases here

#generate the fitted values
fit<-predict(fit.logit, newdat=dat,type="response")
fitp<-predict(fit.probit, newdat=dat,type="response")
#add the values to the fake data
dat$fitted.prob.lrm<-round(fit, 3)
dat$fitted.prob.pro<-round(fitp, 3)

#Print the fitted probabilities for the first 20 cases
knitr::kable(head(dat, n=20))

race_eth	educ	agec	fitted.prob.lrm	fitted.prob.pro
hispanic	2hsgrad	(0,24]	0.115	0.119
nh black	2hsgrad	(0,24]	0.099	0.101
nh multirace	2hsgrad	(0,24]	0.127	0.132
nh other	2hsgrad	(0,24]	0.080	0.080
nhwhite	2hsgrad	(0,24]	0.074	0.073
hispanic	0Prim	(0,24]	0.278	0.305
nh black	0Prim	(0,24]	0.245	0.272
nh multirace	0Prim	(0,24]	0.301	0.326
nh other	0Prim	(0,24]	0.205	0.230
nhwhite	0Prim	(0,24]	0.191	0.217
hispanic	1somehs	(0,24]	0.212	0.227
nh black	1somehs	(0,24]	0.184	0.200
nh multirace	1somehs	(0,24]	0.231	0.246
nh other	1somehs	(0,24]	0.152	0.165
nhwhite	1somehs	(0,24]	0.141	0.154
hispanic	3somecol	(0,24]	0.087	0.088
nh black	3somecol	(0,24]	0.074	0.074
nh multirace	3somecol	(0,24]	0.097	0.098
nh other	3somecol	(0,24]	0.060	0.057
nhwhite	3somecol	(0,24]	0.055	0.052

Which show us the estimated probability of reporting poor/fair health for each specified type of “fake person” that we generate. For example, let’s look at the probability for a Non-Hispanic white, age 39-59 with a college education, compared to a Hispanic person, age 39-59 with a primary school education:

dat[which(dat$race_eth=="nhwhite"&dat$agec=="(39,59]"&dat$educ=="4colgrad"),]

ABCDEFGHIJ0123456789

	race_eth <fctr>	educ <fctr>	agec <fctr>	fitted.prob.lrm <dbl>	fitted.prob.pro <dbl>
75	nhwhite	4colgrad	(39,59]	0.068	0.067

dat[which(dat$race_eth=="hispanic"&dat$agec=="(39,59]"&dat$educ=="0Prim"),]

ABCDEFGHIJ0123456789

	race_eth <fctr>	educ <fctr>	agec <fctr>	fitted.prob.lrm <dbl>	fitted.prob.pro <dbl>
56	hispanic	0Prim	(39,59]	0.526	0.521

The first case has an estimated probability of reporting poor/fair health of about 7%, while the second case has over about a 50% chance. These are often more effective ways to convey the result of a model, instead of talking about all the regression coefficients.

The probability that a NH white person who is <25 years old and has a high school education is 0.074

Nested model comparison

Often in a research setting we are interested in comparing several models, and almost never are we satisfied with a single solitary model. This is because the literature on our subjects often has multiple facets to it. So, for instance, we may be interested in how SES mediates the effects of race/ethnicity on health (see Shuey and Wilson 2008 and Sudano and Baker 2005 or Hummer 1993 for a few examples of these types of analysis).

Typically in these types of analysis, predictor variables are entered into the model in “blocks”. For example, let’s look at the self-rated health outcome (0=good or excellent health, 1= fair/poor health) from last week. But instead of entering all variables in the model simultaneously, we begin with the effect of race/ethnicity, then add the effect of SES then the effects of health behavior variables.

fit.logit1<-svyglm(badhealth~race_eth,design= des, family=binomial) #race only

## Warning in eval(family$initialize): non-integer #successes in a binomial
## glm!

fit.logit2<-svyglm(badhealth~race_eth+educ,design= des, family=binomial) #race+education

## Warning in eval(family$initialize): non-integer #successes in a binomial
## glm!

fit.logit3<-svyglm(badhealth~race_eth+educ+ins+smoke,design= des, family=binomial)#race+education+health behaviors

## Warning in eval(family$initialize): non-integer #successes in a binomial
## glm!

summary(fit.logit1)

## 
## Call:
## svyglm(formula = badhealth ~ race_eth, design = des, family = binomial)
## 
## Survey design:
## svydesign(ids = ~1, strata = ~ststr, weights = ~mmsawt, data = sub)
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -1.04839    0.02940 -35.654  < 2e-16 ***
## race_ethnh black     -0.38231    0.04356  -8.777  < 2e-16 ***
## race_ethnh multirace -0.33941    0.07926  -4.282 1.85e-05 ***
## race_ethnh other     -1.03318    0.07496 -13.784  < 2e-16 ***
## race_ethnhwhite      -0.76655    0.03238 -23.674  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1.000005)
## 
## Number of Fisher Scoring iterations: 4

In model 1 we see that Hispanics and blacks have a higher odds of reporting poor self rated health, compared to non-Hispanic whites, while the “other” group shows lower odds of reporting poor health.

Now, let’s see if, by controlling for education, some of these differences go away, or are reduced. The fancy word for when an effect is reduced is “attenuated”. We will also do a test to see if the model with the education term significantly improves the model fit. Traditionally this would be done using a likelihood ratio test, but in survey models, that’s not kosher

summary(fit.logit2)

## 
## Call:
## svyglm(formula = badhealth ~ race_eth + educ, design = des, family = binomial)
## 
## Survey design:
## svydesign(ids = ~1, strata = ~ststr, weights = ~mmsawt, data = sub)
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -1.26308    0.03681 -34.310  < 2e-16 ***
## race_ethnh black     -0.02520    0.04656  -0.541   0.5883    
## race_ethnh multirace  0.14082    0.08388   1.679   0.0932 .  
## race_ethnh other     -0.35181    0.07835  -4.490 7.12e-06 ***
## race_ethnhwhite      -0.21911    0.03596  -6.092 1.11e-09 ***
## educ0Prim             1.32167    0.06051  21.842  < 2e-16 ***
## educ1somehs           0.76391    0.04533  16.852  < 2e-16 ***
## educ3somecol         -0.33370    0.03073 -10.858  < 2e-16 ***
## educ4colgrad         -1.12429    0.03174 -35.425  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 0.998466)
## 
## Number of Fisher Scoring iterations: 5

regTermTest(fit.logit2, test.terms = ~educ, method="Wald", df = NULL)

## Wald test for educ
##  in svyglm(formula = badhealth ~ race_eth + educ, design = des, family = binomial)
## F =  716.3323  on  4  and  215176  df: p= < 2.22e-16

so, we see the race effects in all groups attenuate (reduce in size) somewhat after considering education, so the differences in health are smaller once you control for education.

The F test also suggest that the second model fits the data better than the first. It is another of these omnibus tests that asks whether there is any variation in our outcome by education in the second model.

Next we consider the third model, which contains health behaviors of current smoking and insurance coverage:

summary(fit.logit3)

## 
## Call:
## svyglm(formula = badhealth ~ race_eth + educ + ins + smoke, design = des, 
##     family = binomial)
## 
## Survey design:
## svydesign(ids = ~1, strata = ~ststr, weights = ~mmsawt, data = sub)
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -1.547335   0.052749 -29.334  < 2e-16 ***
## race_ethnh black     -0.093037   0.046911  -1.983  0.04734 *  
## race_ethnh multirace -0.002242   0.085095  -0.026  0.97898    
## race_ethnh other     -0.379124   0.079288  -4.782 1.74e-06 ***
## race_ethnhwhite      -0.376244   0.036723 -10.245  < 2e-16 ***
## educ0Prim             1.343604   0.060368  22.257  < 2e-16 ***
## educ1somehs           0.709797   0.045535  15.588  < 2e-16 ***
## educ3somecol         -0.311911   0.031250  -9.981  < 2e-16 ***
## educ4colgrad         -1.021504   0.032858 -31.089  < 2e-16 ***
## ins                   0.123410   0.046091   2.678  0.00742 ** 
## smokeCurrent          0.630684   0.032645  19.320  < 2e-16 ***
## smokeFormer           0.517375   0.029206  17.715  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 0.9968191)
## 
## Number of Fisher Scoring iterations: 5

regTermTest(fit.logit3, test.terms=~ins+smoke, method="Wald", df = NULL)

## Wald test for ins smoke
##  in svyglm(formula = badhealth ~ race_eth + educ + ins + smoke, design = des, 
##     family = binomial)
## F =  167.7186  on  3  and  215173  df: p= < 2.22e-16

In this model, we see the effects for Hispanics and blacks go back up. This is somewhat confusing, but is most likely related to the differential rates of smoking among those groups, compared to whites. Both current and former smokers are more likely to report poor/fair health, while insurance coverage does not affect the odds at all. Finally, we see that the model is significantly better fitting than the previous model.

stargazer(fit.logit1, fit.logit2, fit.logit3,type = "html", style="demography", covariate.labels =c("Black","MultRace" ,"Other","NHWhite", "PrimarySchool", "SomeHS","SomeColl", "CollGrad",  "Insurance", "Current Smoker", "Former Smoker"), ci = T )


	badhealth
	Model 1	Model 2	Model 3

Black	-0.382^***	-0.025	-0.093^*
	(-0.468, -0.297)	(-0.116, 0.066)	(-0.185, -0.001)
MultRace	-0.339^***	0.141	-0.002
	(-0.495, -0.184)	(-0.024, 0.305)	(-0.169, 0.165)
Other	-1.033^***	-0.352^***	-0.379^***
	(-1.180, -0.886)	(-0.505, -0.198)	(-0.535, -0.224)
NHWhite	-0.767^***	-0.219^***	-0.376^***
	(-0.830, -0.703)	(-0.290, -0.149)	(-0.448, -0.304)
PrimarySchool		1.322^***	1.344^***
		(1.203, 1.440)	(1.225, 1.462)
SomeHS		0.764^***	0.710^***
		(0.675, 0.853)	(0.621, 0.799)
SomeColl		-0.334^***	-0.312^***
		(-0.394, -0.273)	(-0.373, -0.251)
CollGrad		-1.124^***	-1.022^***
		(-1.186, -1.062)	(-1.086, -0.957)
Insurance			0.123^**
			(0.033, 0.214)
Current Smoker			0.631^***
			(0.567, 0.695)
Former Smoker			0.517^***
			(0.460, 0.575)
Constant	-1.048^***	-1.263^***	-1.547^***
	(-1.106, -0.991)	(-1.335, -1.191)	(-1.651, -1.444)
N	216,561	216,561	216,561
Log Likelihood	-88,672.490	-83,124.470	-82,153.280
AIC	177,355.000	166,266.900	164,330.600

p < .05; p < .01; p < .001

Stratified models

Often in the literature, we will see models stratified by some predictor. This is usually because a specific hypothesis is stated regarding how the effect of a particular predictor varies by some categorical variable. In this case, we may be interested in considering if education or smoking universally affects the poor health outcome. We get at this by stratifying our analysis by race (in this example).

The easiest way to do this is to subset the data by race and run the models separately.

The first thing we do is test for the interaction of education and race. If this interaction is not significant, we have no justification for proceeding, because the effect of education does not vary by race group. This is the test of parallel slopes, a’ la the ANCOVA model

fit.logitint<-svyglm(badhealth~race_eth*educ+ins+smoke,design= des, family=binomial)#race*education interaction+health behaviors

## Warning in eval(family$initialize): non-integer #successes in a binomial
## glm!

regTermTest(fit.logitint, test.terms = ~race_eth:educ, method = "Wald", df=NULL)

## Wald test for race_eth:educ
##  in svyglm(formula = badhealth ~ race_eth * educ + ins + smoke, design = des, 
##     family = binomial)
## F =  1.834767  on  16  and  215157  df: p= 0.021648

Here, the F-test does indicate that the interaction term in the model is significant, so the effects of education are not constant by race.

Now we stratify our models:

fit.unrestricted<-svyglm(badhealth~educ+ins+smoke,design= des, family=binomial)

## Warning in eval(family$initialize): non-integer #successes in a binomial
## glm!

fit.logit.white<-svyglm(badhealth~(educ+ins+smoke),design= subset(des, white==1), family=binomial)

## Warning in eval(family$initialize): non-integer #successes in a binomial
## glm!

fit.logit.black<-svyglm(badhealth~(educ+ins+smoke),design= subset(des, black==1), family=binomial)

## Warning in eval(family$initialize): non-integer #successes in a binomial
## glm!

fit.logit.other1<-svyglm(badhealth~(educ+ins+smoke),design= subset(des, race_eth=="nh other"), family=binomial)

## Warning in eval(family$initialize): non-integer #successes in a binomial
## glm!

fit.logit.other2<-svyglm(badhealth~(educ+ins+smoke),design= subset(des, race_eth=="nh multirace"), family=binomial)

## Warning in eval(family$initialize): non-integer #successes in a binomial
## glm!

fit.logit.hisp<-svyglm(badhealth~(educ+ins+smoke),design= subset(des, hispanic==1), family=binomial)

## Warning in eval(family$initialize): non-integer #successes in a binomial
## glm!

Here we examine the model results next to one another

stargazer(fit.logit.white, fit.logit.black, fit.logit.hisp, fit.logit.other1, fit.logit.other2,
          type = "html", style="demography",
          covariate.labels =c(  "Some HS","HS Graduate", "Some College", "College Grad", "Insurance", "Current Smoker", "Former Smoker","Intercept"),
          column.labels = c("NH White", "NH Black", "Hispanic", "NH Other 1", "NH Other 2"),
          ci=T )


	badhealth
	NH White	NH Black	Hispanic	NH Other 1	NH Other 2
	Model 1	Model 2	Model 3	Model 4	Model 5

Some HS	1.261^***	0.843^***	1.482^***	1.480^***	1.576^***
	(1.072, 1.450)	(0.475, 1.211)	(1.307, 1.657)	(0.830, 2.130)	(0.678, 2.474)
HS Graduate	0.736^***	0.607^***	0.776^***	0.688^*	0.701^*
	(0.622, 0.851)	(0.392, 0.822)	(0.591, 0.961)	(0.126, 1.250)	(0.120, 1.282)
Some College	-0.313^***	-0.382^***	-0.168	-0.657^**	-0.080
	(-0.382, -0.244)	(-0.536, -0.229)	(-0.339, 0.003)	(-1.105, -0.210)	(-0.455, 0.296)
College Grad	-1.061^***	-0.899^***	-0.951^***	-0.983^***	-0.775^***
	(-1.131, -0.990)	(-1.081, -0.717)	(-1.127, -0.775)	(-1.376, -0.590)	(-1.157, -0.394)
Insurance	0.003	0.029	0.273^***	-0.034	0.111
	(-0.115, 0.121)	(-0.186, 0.244)	(0.119, 0.427)	(-0.616, 0.547)	(-0.405, 0.627)
Current Smoker	0.740^***	0.567^***	0.307^***	0.784^***	0.659^***
	(0.664, 0.816)	(0.405, 0.729)	(0.131, 0.483)	(0.456, 1.112)	(0.286, 1.032)
Former Smoker	0.525^***	0.783^***	0.401^***	0.504^**	0.556^**
	(0.462, 0.589)	(0.619, 0.948)	(0.242, 0.561)	(0.186, 0.821)	(0.193, 0.919)
Intercept	-1.834^***	-1.553^***	-1.666^***	-1.731^***	-1.693^***
	(-1.963, -1.705)	(-1.768, -1.338)	(-1.834, -1.498)	(-2.326, -1.135)	(-2.318, -1.069)
N	163,923	21,529	19,564	8,139	3,406
Log Likelihood	-56,843.560	-9,140.441	-9,499.501	-2,388.959	-1,495.399
AIC	113,703.100	18,296.880	19,015.000	4,793.917	3,006.799

p < .05; p < .01; p < .001

beta.test<-function(model1, model2, betaname){
s1<-summary(model1)$coef
s2<-summary(model2)$coef
db <- ((s2[rownames(s2)==betaname,1]-s1[rownames(s1)==betaname,1]))^2
sd <-s2[rownames(s2)==betaname,2]^2+s1[rownames(s1)==betaname,2]^2
td <- db/sd
beta1=s1[rownames(s1)==betaname,1]
beta2=s2[rownames(s2)==betaname,1]
pv<-1- pchisq(td, df = 1)
print(list(beta=betaname,beta1=beta1, beta2=beta2, x2=td, pvalue=pv))
}

Here is an example of testing if the “Current Smoking” effect is the same among whites and blacks. This follows the logic set forth in Allison 2010, p 219

Test for $\beta_{1j} = \beta_{1k}$ in two models $j \text{ and } k$ $z= \frac{\beta_{1j} - \beta_{1k}}{\left[ s.e.(\beta_{1j}) \right]^2+\left[ s.e.(\beta_{1k}) \right]^2}$

compare the $|z|$ test to a normal distribution for the p value.

beta.test(fit.logit.white, fit.logit.black, "smokeCurrent")

## $beta
## [1] "smokeCurrent"
## 
## $beta1
## [1] 0.7399846
## 
## $beta2
## [1] 0.5670447
## 
## $x2
## [1] 3.571477
## 
## $pvalue
## [1] 0.05878002

Which suggests that the effect of current smoking is the same in the two groups.

Here we also examine the equality of the college education effect, this time between Hispanics and non hispanic multiracial

beta.test(fit.logit.hisp, fit.logit.other2, "educ4colgrad")

## $beta
## [1] "educ4colgrad"
## 
## $beta1
## [1] -0.9510186
## 
## $beta2
## [1] -0.7752798
## 
## $x2
## [1] 0.6725732
## 
## $pvalue
## [1] 0.4121559

Which suggests the effect of college education is the same in these two groups.

DEM 7283 - Example 2- Logit and Probit Models

Corey Sparks, PhD

February 4, 2019

Categorical outcomes

Logit and Probit models

Logit/Probit Regression example

Example

Analysis

calculate race*health cross tabulation, and plot it

calculate race by education by health cross tabulation, and plot it

Present both model coefficients next to one another

Marginal effects

Fitted Values

Nested model comparison

Stratified models