Categorical outcomes

In the vast majority of situations in your work as demographers, your outcome will either be of a qualitative nature or non-normally distributed, especially if you work with individual level survey data.

When we speak of qualitative outcomes, we generally are concerned with the observation of:

A particular behavior (marriage, migration, birth, death)
A transition (unemployed to employed, married to divorced)
A threshold characteristic (adoption of sterilization after ideal # of children is reached)
In general, each of these outcomes would be coded as a binary variable (1 or 0) depending on whether the outcome of interest was observed

Basics of Genearlized Linear Models

Up until now, we have been relying on linear statistical models which assumed the Normal distribution for our outcomes. A broader class of regression models, are Generalized Linear Models, or GLMs, which allow for linear regression for outcomes that are not assumed to come from a Normal distribution.

GLMs are a class of statistical models that link the mean of the specified distribution to a linear combination of covariates by some form of link function. For example, the Normal distribution has the mean, \(\mu\), which is typically estimated using the linear mean function :

\[\mu = \beta_0 + \beta_1 x_1\] Which describes the line that estimates the mean of the outcome variable as a linear function of the predictor variable \(x_1\). This model uses an identity link meaning there is no transformation of the linear mean function as it is connected to the mean of the outcome.

This can be written as:

\[g(u) = g(E(Y)) = \beta_0 + \beta_1 x_1\]

Where \(g()\) is the link function, linking the mean of the Normal distribution to the linear mean function of the model.

The linear model is appropriate for the Normal distribution, because this distribution can take any value from \(- \infty\) to \(\infty\). Other distributions do not have this wide range, so transformations of the linear mean function must be used so that the linear model remains on the scale of the data.

You have probably seen the binomial distribution in either a basic statistics course, remember the coin flips? Or in the context of a logistic regression model.

There are two ways the binomial distribution is typically used, the first is the context of logistic regression, where a special case of the binomial is used, called the Bernoulli distribution. This is the case of the binomial when there is basically a single coin flip, and you’re trying to figure out the probability that it is heads (or tails). This is said to be a single trial, and the outcome is either 1 or 0 (heads or tails).

The second way the binomial is used is when you have multiple trials, and you’re trying to estimate the probability of the event occurring over multiple trials. In this case, your number of trials, \(n\) can be large, and your number of successes, \(y\) is the random variable under consideration. This is the basic makeup of a demographic rate, the count-based binomial.

The mean of the binomial distribution is a proportion or a probability, \(\pi\), which tells you how likely it is that the event your interested in occurs. Any model using the binomial distributor will be geared towards estimating the probability.

When a variable is coded as binary, meaning either 1 or 0, the Bernoulli distribution is used, as in the logistic regression model. When coded like this, the model tries to use the other measured information to predict the 1 value versus the 0 value. So in the basic sense, we want to construct a model like:

\[Pr(y=1) =\pi = \text{some function of predictors}\]

The good thing is that, when we have count data, not just 1’s and 0’s, the same thing happens. The ratio or successes (\(y\)) to trials (\(n\)) is used to estimate \(\pi\) and we build a model for that rate:

\[\text{Binomial} \binom{n}{y} = \frac{y}{n} = \pi = \text{some function of predictors}\]

Binary outcome variables

The ratio \(\frac{y}{n}\) is a rate or probability, and as such has very strict bounds. Probabilities cannot be less than 0 or greater than 1, so again, we should not use the Normal distribution here, since it is valid for all real numbers. Instead, we are using the binomial, but we still run into the problem of having a strictly bounded value, \(\pi\) that we are trying to estimate with a linear function.

Enter the link function again.

The binomial distribution typically uses either a logit or probit link function, but others such as the complementary log-log link function are also used in certain circumstances. For now we will use the logit function.

The logit transforms the probability, \(\pi\), which is bound on the interval \([0,1]\) into a new limitless interval similar to the normal distribution of \([-\infty, \infty]\). The transformation is knows a the log-odds transformation, or logit for short.

The odds of an event happening are the probability that something happens, divided by the probability it does not happen, in this case:

\[\text{odds }{\pi} = \frac{\pi}{(1-\pi)}\]

Which is bound on the interval \([0, \infty]\), when we take the natural log of the odds, the value is transformed into the linear space, of \([-\infty, \infty]\).

\[\text{log-odds }{\pi} = log \left ( \frac{\pi}{(1-\pi)} \right) \]

This can be modeled using a linear function of covariates now, without worrying about the original boundary problem:

\[log \left ( \frac{\pi}{(1-\pi)} \right) = \beta_0 +\beta_1 x_1\]

or more compactly:

\[logit (\pi) = \beta_0 +\beta_1 x_1\]

Interpretation of the binomial model

Similar to when the Poisson model was introduced, the binomial model also has a strange interpretation when compared to the OLS model.

Since we used the log-odds, or logit transformation of the mean of the outcome, \(\pi\), the interpretations of the model \(\beta\)’s are not on a linear scale, they are on a log-odds scale.

While we can certainly interpret a positive \(\beta\) as increasing the odds of \(y\) occurring, or increasing the rate, and a negative \(\beta\), as decreasing the odds, this is not commonly how the model effects are interpreted.

Instead, the common interpretation when using the logit model is the odds ratio interpretation of \(\beta\). To obtain the odds ratio, you must exponentiate the \(\beta\) from the model.

For example, a \(\beta\) of 0.025 would be \(\text{exp}(\beta) = 1.025\), and we would say that for a 1 unit increase in \(x_1\), the odds of \(y\) occurring are 2.5% higher.

This percentage is obtained by subtracting 1 from the \(\text{exp}(\beta)\), or \(\text{% change in odds} = \text{exp}(\beta) - 1\), which in this case is 0.025.

For a \(\beta_1\) that is negative, such as \(\beta_1 = -.15\), the exponent is 0.86. Following the rule we just used, we see that a one unit change in \(x_1\) leads to a \(\text{exp}(-0.15) - 1 = -.14\), or a 14% decrease in the odds of y occurring.

Logit and Probit models

If our outcome is dichotomous (1/0), the natural distribution to consider for a GLM is the binomial

\[y \sim \ \text{Binomial}\binom{n}{p}\] with \(p\) being the mean of the binomial, and n being the number of trials, generally when you have individual data, n is always 1, and \(p\) is the probability of observing the 1, conditional on the observed predictors.

There are two common techniques to estimate the mean, logistic regression and probit regression.

In a Logistic model, the link function is the inverse logit function, or

\(\text{Logit}^{-1}(p) =log \frac{p}{(1-p)}\)

Which gives us the following conditional mean model:

\[E(y|x) = \frac{1}{1+ exp({-\sum_k \beta_k x_{ik}})}\] Which situates the model within the logistic distribution function.

Expressing p as a linear model is done via this log odds transformation of the probability:

\[log \frac{p}{(1-p)} = \sum_k \beta_k x_{ik}\]

For the Probit model, the link function is the inverse cumulative Normal distribution:

\[E(y|x) = \Phi^{-1}(p) = \Phi (\sum_k \beta_k x_{ik})\]

In practice, these two models give very similar estimates and will have very similar interpretations, although the logistic regression model has the more convenient odds ratio interpretation of its \(\beta's\), while the probit model’s coefficients are often transformed into marginal coefficients, which is more of a challenge and software generally doesn’t give you these by default.

Logit/Probit Regression example

There is no trick to fitting logistic regression models using survey data, just use the svyglm() function with the appropriate distribution specified via family=binomial for logistic and family=binomial(link="probit") for the probit model. You don’t have to specify the link function if you’re just doing the logistic model, as it is the default.

Example using BRFSS

This example will cover the use of R functions for fitting binary logit and probit models to complex survey data.

For this example I am using 2016 CDC Behavioral Risk Factor Surveillance System (BRFSS) SMART metro area survey data. Link

#load libraries
library(car)

## Loading required package: carData

library(stargazer)

## 
## Please cite as:

##  Hlavac, Marek (2018). stargazer: Well-Formatted Regression and Summary Statistics Tables.

##  R package version 5.2.2. https://CRAN.R-project.org/package=stargazer

library(survey)

## Loading required package: grid

## Loading required package: Matrix

## Loading required package: survival

## 
## Attaching package: 'survey'

## The following object is masked from 'package:graphics':
## 
##     dotchart

library(questionr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:car':
## 
##     recode

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.0.6     v stringr 1.4.0
## v tidyr   1.1.2     v forcats 0.5.1
## v readr   1.4.0

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x tidyr::expand() masks Matrix::expand()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## x tidyr::pack()   masks Matrix::pack()
## x dplyr::recode() masks car::recode()
## x purrr::some()   masks car::some()
## x tidyr::unpack() masks Matrix::unpack()

#load brfss
load(url("https://github.com/coreysparks/data/blob/master/brfss_2017.Rdata?raw=true"))


#The names in the data are very ugly, so I make them less ugly
nams<-names(brfss_17)
head(nams, n=10)

##  [1] "dispcode" "statere1" "safetime" "hhadult"  "genhlth"  "physhlth"
##  [7] "menthlth" "poorhlth" "hlthpln1" "persdoc2"

#we see some names are lower case, some are upper and some have a little _ in the first position. This is a nightmare.
newnames<-tolower(gsub(pattern = "_",replacement =  "",x =  nams))
names(brfss_17)<-newnames

#Poor or fair self rated health
brfss_17$badhealth<-Recode(brfss_17$genhlth,
                           recodes="4:5=1; 1:3=0; else=NA")

#sex
brfss_17$male<-as.factor(ifelse(brfss_17$sex==1,
                                "Male",
                                "Female"))

#Age cut into intervals
brfss_17$agec<-cut(brfss_17$age80,
                   breaks=c(0,24,39,59,79,99))

#race/ethnicity
brfss_17$black<-Recode(brfss_17$racegr3,
                       recodes="2=1; 9=NA; else=0")
brfss_17$white<-Recode(brfss_17$racegr3,
                       recodes="1=1; 9=NA; else=0")
brfss_17$other<-Recode(brfss_17$racegr3,
                       recodes="3:4=1; 9=NA; else=0")
brfss_17$hispanic<-Recode(brfss_17$racegr3,
                          recodes="5=1; 9=NA; else=0")

brfss_17$race_eth<-Recode(brfss_17$racegr3,
                          recodes="1='nhwhite'; 2='nh black'; 3='nh other';4='nh multirace'; 5='hispanic'; else=NA",
                          as.factor = T)

#insurance
brfss_17$ins<-Recode(brfss_17$hlthpln1,
                     recodes ="7:9=NA; 1=1;2=0")

#income grouping
brfss_17$inc<-ifelse(brfss_17$incomg==9,
                     NA,
                     brfss_17$incomg)

#education level
brfss_17$educ<-Recode(brfss_17$educa,
                      recodes="1:2='0Prim'; 3='1somehs'; 4='2hsgrad'; 5='3somecol'; 6='4colgrad';9=NA",
                      as.factor=T)

#brfss_17$educ<-relevel(brfss_17$educ, ref='2hsgrad')

#employment
brfss_17$employ<-Recode(brfss_17$employ1,
                        recodes="1:2='Employed'; 2:6='nilf'; 7='retired'; 8='unable'; else=NA",
                        as.factor=T)

brfss_17$employ<-relevel(brfss_17$employ,
                         ref='Employed')

#marital status
brfss_17$marst<-Recode(brfss_17$marital,
                       recodes="1='married'; 2='divorced'; 3='widowed'; 4='separated'; 5='nm';6='cohab'; else=NA",
                       as.factor=T)

brfss_17$marst<-relevel(brfss_17$marst,
                        ref='married')

#Age cut into intervals
brfss_17$agec<-cut(brfss_17$age80,
                   breaks=c(0,29,39,59,79,99))

#BMI, in the brfss_17a the bmi variable has 2 implied decimal places, so we must divide by 100 to get real bmi's

brfss_17$bmi<-brfss_17$bmi5/100

brfss_17$obese<-ifelse(brfss_17$bmi>=30,
                       1,
                       0)
#smoking currently
brfss_17$smoke<-Recode(brfss_17$smoker3,
                       recodes="1:2 ='Current'; 3 ='Former';4 ='NeverSmoked'; else = NA",
                       as.factor=T)

brfss_17$smoke<-relevel(brfss_17$smoke,
                        ref = "NeverSmoked")

Analysis

First, we will do some descriptive analysis, such as means and cross tabulations.

sub<-brfss_17 %>%
  select(badhealth,mmsaname, bmi, agec,race_eth, marst, educ,white, black, hispanic, other, smoke, ins, mmsawt, ststr) %>%
  filter( complete.cases( . ))

#First we tell R our survey design
options(survey.lonely.psu = "adjust")
des<-svydesign(ids= ~1,
               strata= ~ststr,
               weights= ~mmsawt
               , data = sub )

First, we examine the % of US adults with poor/fair health by education level, and do a survey-corrected chi-square test for independence.

cat<-svyby(formula = ~badhealth,
           by = ~educ,
           design = des,
           FUN = svymean,
           na.rm=T)

svychisq(~badhealth+educ,
         design = des)

## 
##  Pearson's X^2: Rao & Scott adjustment
## 
## data:  svychisq(~badhealth + educ, design = des)
## F = 585.7, ndf = 3.6133e+00, ddf = 7.1982e+05, p-value < 2.2e-16

plot of estimates with standard errors

cat%>%
  ggplot()+
  geom_point(aes(x=educ,y=badhealth))+
  geom_errorbar(aes(x=educ, ymin = badhealth-1.96*se, 
                    ymax= badhealth+1.96*se),
                width=.25)+
   labs(title = "Percent % of US Adults with Fair/Poor Health by Education", 
        caption = "Source: CDC BRFSS - SMART Data, 2017 \n Calculations by Corey S. Sparks, Ph.D.",
       x = "Education",
       y = "%  Fair/Poor Health")+
  theme_minimal()

Calculate race*health cross tabulation, and plot it

dog<-svyby(formula = ~badhealth,
           by = ~race_eth, 
           design = des, 
           FUN = svymean,
           na.rm=T)

svychisq(~badhealth+race_eth,
         design = des)

## 
##  Pearson's X^2: Rao & Scott adjustment
## 
## data:  svychisq(~badhealth + race_eth, design = des)
## F = 113.57, ndf = 3.4023e+00, ddf = 6.7779e+05, p-value < 2.2e-16

dog%>%
  ggplot()+
  geom_point(aes(x=race_eth,y=badhealth))+
  geom_errorbar(aes(x=race_eth, ymin = badhealth-1.96*se, 
                    ymax= badhealth+1.96*se),
                width=.25)+
   labs(title = "Percent % of US  with Fair/Poor Health by Race/Ethnicity", 
        caption = "Source: CDC BRFSS - SMART Data, 2017 \n Calculations by Corey S. Sparks, Ph.D.",
       x = "Race/Ethnicity",
       y = "%  Fair/Poor Health")+
  theme_minimal()

Calculate race by education by health cross tabulation, and plot it

catdog<-svyby(formula = ~badhealth,
              by = ~race_eth+educ,
              design = des,
              FUN = svymean,
              na.rm=T)

#this plot is a little more complicated, but facet_wrap() plots separate plots for groups

catdog%>%
  ggplot()+
  geom_point(aes(x=educ, y = badhealth))+ 
  geom_errorbar(aes(x=educ, ymin = badhealth-1.96*se, 
                    ymax= badhealth+1.96*se),
                width=.25)+
  facet_wrap(~ race_eth, nrow = 3)+
  labs(title = "Percent % of US  with Fair/Poor Health by Race/Ethnicity and Education", 
        caption = "Source: CDC BRFSS - SMART Data, 2017 \n Calculations by Corey S. Sparks, Ph.D.",
       x = "Education",
       y = "%  Fair/Poor Health")+
  theme_minimal()

Fitting the logistic regression model

To fit the model, we use svyglm()

#Logit model
fit.logit<-svyglm(badhealth ~ race_eth + educ + agec,
                  design = des,
                  family = binomial)

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

library(broom)
fit.logit%>%
  tidy()%>%
  knitr::kable(digits = 3)

term	estimate	std.error	statistic	p.value
(Intercept)	-0.941	0.077	-12.278	0.000
race_ethnh black	-0.032	0.051	-0.623	0.533
race_ethnh multirace	0.061	0.087	0.704	0.482
race_ethnh other	-0.179	0.083	-2.158	0.031
race_ethnhwhite	-0.378	0.042	-9.041	0.000
educ1somehs	-0.222	0.079	-2.809	0.005
educ2hsgrad	-0.896	0.071	-12.663	0.000
educ3somecol	-1.217	0.071	-17.029	0.000
educ4colgrad	-2.048	0.072	-28.304	0.000
agec(29,39]	0.365	0.057	6.454	0.000
agec(39,59]	0.870	0.047	18.422	0.000
agec(59,79]	1.177	0.047	24.932	0.000
agec(79,99]	1.341	0.067	20.094	0.000

Get odds ratios and confidence intervals for the estimates

The most straighforward way to get odds ratios is to exponentiate the \(\beta\) parameters. These are stored in the coefficients(fit.logit)

fit.logit%>%
  tidy()%>%
  mutate(OR = exp(estimate))%>%
  knitr::kable(digits = 3)

term	estimate	std.error	statistic	p.value	OR
(Intercept)	-0.941	0.077	-12.278	0.000	0.390
race_ethnh black	-0.032	0.051	-0.623	0.533	0.969
race_ethnh multirace	0.061	0.087	0.704	0.482	1.063
race_ethnh other	-0.179	0.083	-2.158	0.031	0.836
race_ethnhwhite	-0.378	0.042	-9.041	0.000	0.685
educ1somehs	-0.222	0.079	-2.809	0.005	0.801
educ2hsgrad	-0.896	0.071	-12.663	0.000	0.408
educ3somecol	-1.217	0.071	-17.029	0.000	0.296
educ4colgrad	-2.048	0.072	-28.304	0.000	0.129
agec(29,39]	0.365	0.057	6.454	0.000	1.441
agec(39,59]	0.870	0.047	18.422	0.000	2.387
agec(59,79]	1.177	0.047	24.932	0.000	3.244
agec(79,99]	1.341	0.067	20.094	0.000	3.824

And we can get simple confidence intervals for these esimates using confint() which will work for most models.

fit.logit%>%
  tidy()%>%
  mutate(OR = exp(estimate),
         LowerOR_Ci = exp(estimate - 1.96*std.error),
         UpperOR_Ci = exp(estimate + 1.96*std.error))%>%
  knitr::kable(digits = 3)

term	estimate	std.error	statistic	p.value	OR	LowerOR_Ci	UpperOR_Ci
(Intercept)	-0.941	0.077	-12.278	0.000	0.390	0.336	0.454
race_ethnh black	-0.032	0.051	-0.623	0.533	0.969	0.877	1.070
race_ethnh multirace	0.061	0.087	0.704	0.482	1.063	0.896	1.261
race_ethnh other	-0.179	0.083	-2.158	0.031	0.836	0.710	0.984
race_ethnhwhite	-0.378	0.042	-9.041	0.000	0.685	0.631	0.744
educ1somehs	-0.222	0.079	-2.809	0.005	0.801	0.686	0.935
educ2hsgrad	-0.896	0.071	-12.663	0.000	0.408	0.355	0.469
educ3somecol	-1.217	0.071	-17.029	0.000	0.296	0.257	0.341
educ4colgrad	-2.048	0.072	-28.304	0.000	0.129	0.112	0.149
agec(29,39]	0.365	0.057	6.454	0.000	1.441	1.289	1.610
agec(39,59]	0.870	0.047	18.422	0.000	2.387	2.176	2.618
agec(59,79]	1.177	0.047	24.932	0.000	3.244	2.957	3.559
agec(79,99]	1.341	0.067	20.094	0.000	3.824	3.355	4.359

A sligtly more digestible form can be obtained from the sjPlot library. In this plot, if the error bars overlap 1, the effects are not statistically significant.

library(sjPlot)

## Registered S3 methods overwritten by 'lme4':
##   method                          from
##   cooks.distance.influence.merMod car 
##   influence.merMod                car 
##   dfbeta.influence.merMod         car 
##   dfbetas.influence.merMod        car

plot_model(fit.logit,
           title = "Odds ratios for Poor Self Rated Health")

Probit model

To get the probit model, you use link = "probit" in svyglm

#probit model
fit.probit<-svyglm(badhealth~race_eth+educ+agec,
                   design=des,
                   family=binomial(link= "probit"))

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

Present both model coefficients next to one another

myexp<-function(x){exp(x)}

stargazer(fit.logit, fit.probit,
          type = "html",
          style="demography",
          covariate.labels=c("Black","MultRace" ,"Other","NH White", "Primary School", "Some HS","Some College", "College Grad", "Age 24-39","Age 39-59" ,"Age 59-79", "Age 80+"),
          ci=T,
          column.sep.width = "10pt")


	badhealth
	survey-weighted	survey-weighted
	logistic	probit
	Model 1	Model 2

Black	-0.032	-0.016
	(-0.131, 0.068)	(-0.072, 0.040)
MultRace	0.061	0.032
	(-0.109, 0.232)	(-0.064, 0.128)
Other	-0.179^*	-0.105^*
	(-0.342, -0.016)	(-0.193, -0.017)
NH White	-0.378^***	-0.213^***
	(-0.460, -0.296)	(-0.259, -0.167)
Primary School	-0.222^**	-0.148^**
	(-0.377, -0.067)	(-0.243, -0.052)
Some HS	-0.896^***	-0.551^***
	(-1.035, -0.758)	(-0.636, -0.466)
Some College	-1.217^***	-0.731^***
	(-1.357, -1.077)	(-0.817, -0.646)
College Grad	-2.048^***	-1.171^***
	(-2.189, -1.906)	(-1.256, -1.085)
Age 24-39	0.365^***	0.190^***
	(0.254, 0.476)	(0.133, 0.248)
Age 39-59	0.870^***	0.466^***
	(0.777, 0.962)	(0.417, 0.514)
Age 59-79	1.177^***	0.643^***
	(1.084, 1.269)	(0.595, 0.692)
Age 80+	1.341^***	0.749^***
	(1.210, 1.472)	(0.676, 0.821)
Constant	-0.941^***	-0.521^***
	(-1.091, -0.791)	(-0.609, -0.432)
N	200,568	200,568
Log Likelihood	-77,802.860	-77,800.350
AIC	155,631.700	155,626.700

p < .05; p < .01; p < .001

Both of these models show the exact same patterns of effects, with Hispanics, blacks and multi-race individuals showing increased chances of reporting poor/fair health, when compared to whites (Reference group).

Similarly, the education variables shows a negative linear trend, with those with more education having lower chances of reporting poor/fair health compared to those with a primary school education (Reference group), and likewise, as people get older, they are more likely to report poor/fair health, compared to those under age 24 (Reference group).

Fitted Values

As I often say, I like to talk about “interesting cases”. In order to do this, you need the fitted mean for a particular case. This is done by getting the fitted values for that case from the model. To do this, I generate a bunch of “fake people” that have variability in the model covariates, and fit the model for each type of person. This is perhaps overkill in this example because I fit every type of person, ideally you would want a few interesting cases to discuss.

In order to derive these, we effectively “solve the equation” for the model, or another way of saying it, we estimate the conditional mean of y, by specifying the x values that are meaningful for a particular comparison. For example the probability of a white, young college educated person reporting poor health is just the estimate of the model, evaluated at those particular characteristics:

\[\text{Pr(poor/fair health)} = \frac{1}{1+exp({\beta_0 + \beta_1*white + \beta_2*young+\beta_3*college})}\]

#get a series of predicted probabilites for different "types" of people for each model
#ref_grid will generate all possible combinations of predictors from a model

library(emmeans)
rg<-ref_grid(fit.logit)

marg_logit<-emmeans(object = rg,
              specs = c( "race_eth",  "educ"),
              type="response" )

knitr::kable(marg_logit,  digits = 4)

race_eth	educ	prob	SE	df	asymp.LCL	asymp.UCL
hispanic	0Prim	0.4526	0.0164	Inf	0.4206	0.4850
nh black	0Prim	0.4448	0.0187	Inf	0.4086	0.4816
nh multirace	0Prim	0.4678	0.0258	Inf	0.4177	0.5186
nh other	0Prim	0.4087	0.0240	Inf	0.3627	0.4564
nhwhite	0Prim	0.3616	0.0163	Inf	0.3304	0.3941
hispanic	1somehs	0.3983	0.0129	Inf	0.3734	0.4238
nh black	1somehs	0.3908	0.0130	Inf	0.3656	0.4166
nh multirace	1somehs	0.4131	0.0219	Inf	0.3709	0.4566
nh other	1somehs	0.3563	0.0200	Inf	0.3181	0.3964
nhwhite	1somehs	0.3121	0.0103	Inf	0.2923	0.3326
hispanic	2hsgrad	0.2523	0.0081	Inf	0.2367	0.2685
nh black	2hsgrad	0.2464	0.0074	Inf	0.2321	0.2611
nh multirace	2hsgrad	0.2640	0.0159	Inf	0.2341	0.2963
nh other	2hsgrad	0.2200	0.0133	Inf	0.1951	0.2471
nhwhite	2hsgrad	0.1878	0.0040	Inf	0.1801	0.1956
hispanic	3somecol	0.1967	0.0070	Inf	0.1833	0.2108
nh black	3somecol	0.1917	0.0065	Inf	0.1794	0.2047
nh multirace	3somecol	0.2065	0.0133	Inf	0.1817	0.2338
nh other	3somecol	0.1699	0.0111	Inf	0.1491	0.1928
nhwhite	3somecol	0.1436	0.0033	Inf	0.1374	0.1502
hispanic	4colgrad	0.0964	0.0041	Inf	0.0886	0.1048
nh black	4colgrad	0.0937	0.0038	Inf	0.0865	0.1014
nh multirace	4colgrad	0.1019	0.0076	Inf	0.0878	0.1179
nh other	4colgrad	0.0819	0.0058	Inf	0.0713	0.0939
nhwhite	4colgrad	0.0681	0.0019	Inf	0.0646	0.0718

You can compare these to the probit model estimates, they are very similar

rg<-ref_grid(fit.probit)

marg_probit<-emmeans(object = rg,
              specs = c( "race_eth", "agec"),
              type="response" )
knitr::kable(marg_probit, digits = 4)

race_eth	agec	prob	SE	df	asymp.LCL	asymp.UCL
hispanic	(0,29]	0.1490	0.0063	Inf	0.1370	0.1617
nh black	(0,29]	0.1453	0.0067	Inf	0.1326	0.1588
nh multirace	(0,29]	0.1565	0.0117	Inf	0.1348	0.1805
nh other	(0,29]	0.1259	0.0091	Inf	0.1090	0.1445
nhwhite	(0,29]	0.1050	0.0044	Inf	0.0965	0.1140
hispanic	(29,39]	0.1975	0.0073	Inf	0.1835	0.2122
nh black	(29,39]	0.1931	0.0078	Inf	0.1783	0.2087
nh multirace	(29,39]	0.2065	0.0138	Inf	0.1805	0.2347
nh other	(29,39]	0.1696	0.0112	Inf	0.1486	0.1925
nhwhite	(29,39]	0.1438	0.0054	Inf	0.1335	0.1547
hispanic	(39,59]	0.2826	0.0076	Inf	0.2678	0.2977
nh black	(39,59]	0.2772	0.0077	Inf	0.2622	0.2925
nh multirace	(39,59]	0.2935	0.0160	Inf	0.2629	0.3255
nh other	(39,59]	0.2481	0.0133	Inf	0.2228	0.2749
nhwhite	(39,59]	0.2153	0.0049	Inf	0.2058	0.2250
hispanic	(59,79]	0.3455	0.0087	Inf	0.3286	0.3627
nh black	(59,79]	0.3396	0.0088	Inf	0.3225	0.3571
nh multirace	(59,79]	0.3574	0.0174	Inf	0.3239	0.3919
nh other	(59,79]	0.3076	0.0150	Inf	0.2788	0.3376
nhwhite	(59,79]	0.2708	0.0049	Inf	0.2612	0.2806
hispanic	(79,99]	0.3851	0.0143	Inf	0.3574	0.4134
nh black	(79,99]	0.3790	0.0140	Inf	0.3520	0.4067
nh multirace	(79,99]	0.3974	0.0209	Inf	0.3570	0.4389
nh other	(79,99]	0.3456	0.0188	Inf	0.3095	0.3831
nhwhite	(79,99]	0.3068	0.0109	Inf	0.2858	0.3285

Which show us the estimated probability of reporting poor/fair health for each specified type of “typical person” that we generate. For example, let’s look at the probability for a Non-Hispanic white with a college education, compared to a Hispanic person, age 39-59 with a college education:

comps<-as.data.frame(marg_logit)

comps[comps$race_eth=="hispanic" & comps$educ == "4colgrad" , ]

comps[comps$race_eth=="nhwhite" & comps$educ == "4colgrad" , ]

The first case has an estimated probability of reporting poor/fair health of about 9%, while the second case has about a 6.5% chance. These are often more effective ways to convey the result of a model, instead of talking about all the regression coefficients.

DEM 7283 - Example 2 - Logit and Probit Models

Corey Sparks, PhD

08 February, 2021