In the vast majority of situations in your work as demographers, your outcome will either be of a qualitative nature or non-normally distributed, especially if you work with individual level survey data.

When we speak of qualitative outcomes, we generally are concerned with the observation of:

Example

This example will cover the use of R functions for fitting binary logit and probit models to complex survey data.

For this example I am using IPUMS data from Github.

library(haven)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(stargazer)
## 
## Please cite as:
##  Hlavac, Marek (2015). stargazer: Well-Formatted Regression and Summary Statistics Tables.
##  R package version 5.2. http://CRAN.R-project.org/package=stargazer
library(ggplot2)
library(knitr)

ipums<-read_dta("https://github.com/coreysparks/data/blob/master/usa_00045.dta?raw=true")

newpums<-ipums%>%
  filter(labforce==2, age>=18, incwage>0, relate==1)%>%
  mutate(mywage= ifelse(incwage%in%c(999998,999999), NA,incwage),
         unemp=ifelse(empstat==2,1,0), #unemployment
         mig=case_when(.$migrate1%in%c(2:4)~ 1,
                        .$migrate1==1~0), #moved in past year
         sexrecode=ifelse(sex==1, "male", "female"))%>%
  mutate(race_eth = case_when(.$hispan %in% c(1:4) & .$race %in%c(1:9) ~ "hispanic", 
                          .$hispan ==0 & .$race==1 ~"0nh_white",
                         .$hispan ==0 & .$race==2 ~"nh_black",
                         .$hispan ==0 & .$race%in%c(3,7,8,9) ~"nh_other",
                         .$hispan ==0 & .$race%in%c(4:6) ~"nh_asian",
                          .$hispan==9 ~ "missing"))%>%
  mutate(edurec = case_when(.$educd %in% c(0:61)~"nohs", 
                            .$educd %in% c(62:64)~"hs",
                            .$educd %in% c(65:100)~"somecoll",
                            .$educd %in% c(101:116)~"collgrad", 
                            .$educd ==999 ~ "missing"))

Analysis

First, we will do some descriptive analysis, such as means and cross tabulations.

First, we examine the % of US adults who moved in the last yar by education level.

sefun<-function(x) {sd(x, na.rm=T)/table(is.na(x))[1]}


newpums%>%
  group_by(edurec)%>%
  summarise(migpct=mean(mig, na.rm=T), semig=sefun(mig))%>%
  ggplot()+geom_bar(aes(edurec, migpct),stat = "identity")+geom_errorbar(aes(x=edurec, ymin=migpct-1.96*semig,ymax= migpct+1.96*semig), width=.25)+ggtitle("Percent of Texas Adults who moved in the last year", subtitle = "By level of education")

Which shows a significant variation in by education level

Logit and Probit models

If our outcome is dichtomous (1/0), the natural distribution to consider for a GLM is the binomial \[y \sim \ \text{Binomial}\binom{n}{p}\] with \(p\) being the mean of the binomial, and n being the number of trials, generally when you have individual data, n is always 1, and p is the probability of observing the 1, conditional on the observed predictors. There are two common techniques to estimate the mean, logistic and probit regression. In a Logistic model, the link function is the inverse logit function, or

\(\text{Logit}^{-1}(p) =log \frac{p}{(1-p)}\)

Which gives us the following conditional mean model:

\[E(y|x) = \frac{1}{1+ exp({-\sum_k \beta_k x_{ik}})}\] Which situates the model within the logistic distribution function. Expressing p as a linear model is done via this log odds transformation of the probability:

\[log \frac{p}{(1-p)} = \sum_k \beta_k x_{ik}\]

For the Probit model, the link function is the inverse cumulative Normal distribution:

\[E(y|x) = \Phi^{-1}(p) = \Phi (\sum_k \beta_k x_{ik})\]

In practice, these two models give very similar estimates and will have very similar interpretations, although the logitistic regression model has the more convenient odds ratio interpretation of its \(\beta's\), while the probit model’s coefficients are often transformed into marginal coefficients, which is more of a challenge and software generally doesn’t give you these by default.

Logit/Probit Regression example

There is no trick to fitting logistic regression models, just use the glm() function with the apppriate distribution specified via family=binomial for logistic and family=binomial(link="probit") for the probit model. You don’t have to specify the link function if you’re just doing the logistic model, as it is the default.

#Logit model
fit.logit<-glm(mig~race_eth+edurec+sexrecode+scale(age)+scale(I(age^2)),data=newpums, family=binomial)

summary(fit.logit)
## 
## Call:
## glm(formula = mig ~ race_eth + edurec + sexrecode + scale(age) + 
##     scale(I(age^2)), family = binomial, data = newpums)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4253  -0.5272  -0.3733  -0.3215   2.5323  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -2.09436    0.02423 -86.446  < 2e-16 ***
## race_ethhispanic -0.00983    0.03806  -0.258 0.796172    
## race_ethnh_asian  0.23345    0.05010   4.660 3.17e-06 ***
## race_ethnh_black  0.20662    0.03940   5.244 1.57e-07 ***
## race_ethnh_other  0.21128    0.06987   3.024 0.002495 ** 
## edurechs         -0.20782    0.03364  -6.177 6.53e-10 ***
## edurecnohs       -0.16562    0.05476  -3.025 0.002489 ** 
## edurecsomecoll   -0.09350    0.02798  -3.342 0.000832 ***
## sexrecodemale     0.06008    0.02404   2.499 0.012439 *  
## scale(age)       -2.60474    0.07301 -35.679  < 2e-16 ***
## scale(I(age^2))   1.85684    0.07814  23.763  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 54102  on 68041  degrees of freedom
## Residual deviance: 47740  on 68031  degrees of freedom
## AIC: 47762
## 
## Number of Fisher Scoring iterations: 5

And the probit model:

#probit model
fit.probit<-glm(mig~race_eth+edurec+sexrecode+scale(age)+scale(I(age^2)),data=newpums, family=binomial(link= "probit"))

summary(fit.probit)
## 
## Call:
## glm(formula = mig ~ race_eth + edurec + sexrecode + scale(age) + 
##     scale(I(age^2)), family = binomial(link = "probit"), data = newpums)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.3762  -0.5388  -0.3725  -0.3170   2.5448  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -1.209664   0.012857 -94.087  < 2e-16 ***
## race_ethhispanic -0.003402   0.020725  -0.164  0.86961    
## race_ethnh_asian  0.127317   0.027742   4.589 4.45e-06 ***
## race_ethnh_black  0.117695   0.021433   5.491 3.99e-08 ***
## race_ethnh_other  0.119265   0.038967   3.061  0.00221 ** 
## edurechs         -0.098155   0.017869  -5.493 3.95e-08 ***
## edurecnohs       -0.077109   0.029087  -2.651  0.00803 ** 
## edurecsomecoll   -0.039478   0.015161  -2.604  0.00922 ** 
## sexrecodemale     0.029043   0.012973   2.239  0.02518 *  
## scale(age)       -1.486622   0.038589 -38.525  < 2e-16 ***
## scale(I(age^2))   1.084296   0.040198  26.974  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 54102  on 68041  degrees of freedom
## Residual deviance: 47793  on 68031  degrees of freedom
## AIC: 47815
## 
## Number of Fisher Scoring iterations: 5

Both of these models show the exact same patterns of effects, with Hispanics, blacks and multi-race individuals showing increased chances of reporting poor/fair health, when compared to whites (Reference group). Similarly, the education variables shows a negative linear trend, with those with more education having lower chances of reporting poor/fair health compared to those with a primary school education (Reference group), and likewise, as people get older, they are more likely to report poor/fair health, compared to those under age 24 (Reference group).

Present both model coefficients next to one another

stargazer(fit.logit, fit.probit, type = "html", style = "demography", ci = T)
mig
logistic probit
Model 1 Model 2
race_ethhispanic -0.010 -0.003
(-0.084, 0.065) (-0.044, 0.037)
race_ethnh_asian 0.233*** 0.127***
(0.135, 0.332) (0.073, 0.182)
race_ethnh_black 0.207*** 0.118***
(0.129, 0.284) (0.076, 0.160)
race_ethnh_other 0.211** 0.119**
(0.074, 0.348) (0.043, 0.196)
edurechs -0.208*** -0.098***
(-0.274, -0.142) (-0.133, -0.063)
edurecnohs -0.166** -0.077**
(-0.273, -0.058) (-0.134, -0.020)
edurecsomecoll -0.093*** -0.039**
(-0.148, -0.039) (-0.069, -0.010)
sexrecodemale 0.060* 0.029*
(0.013, 0.107) (0.004, 0.054)
scale(age) -2.605*** -1.487***
(-2.748, -2.462) (-1.562, -1.411)
scale(I(age2)) 1.857*** 1.084***
(1.704, 2.010) (1.006, 1.163)
Constant -2.094*** -1.210***
(-2.142, -2.047) (-1.235, -1.184)
N 68,042 68,042
Log Likelihood -23,869.820 -23,896.710
AIC 47,761.630 47,815.420
p < .05; p < .01; p < .001

Fitted Values

As I often say, I like to talk about “interesting cases”. In order to do this, you need the fitted mean for a particular case. This is done by getting the fitted values for that case from the model. To do this, I generate a bunch of “fake people” that have variability in the model covariates, and fit the model for each type of person. This is perhaps overkill in this example because I fit every type of person, ideally you would want a few interesting cases to discuss.

In order to derive these, we effectively “solve the equation” for the model, or another way of saying it, we estimate the conditional mean of y, by specifying the x values that are meaningful for a particular comparison. For example the probabilty of a white, young college educated person reporting poor health is just the estimate of the model, evaluated at those particular characteristics:

\[\text{Pr(poor/fair health)} = \frac{1}{1+exp({\beta_0 + \beta_1*white + \beta_2*young+\beta_3*college})}\]

#get a series of predicted probabilites for different "types" of people for each model
#expand.grid will generate all possible combinations of values you specify
dat<-expand.grid(race_eth=levels(factor(newpums$race_eth)), edurec=levels(factor(newpums$edurec)), age=c(25, 35, 45, 65), sexrecode=levels(factor(newpums$sexrecode)) )

#You MAY need to get rid of impossible cases here

#generate the fitted values
fit<-predict(fit.logit, newdat=dat,type="response")
fitp<-predict(fit.probit, newdat=dat,type="response")
#add the values to the fake data
dat$fitted.prob.lrm<-round(fit, 3)
dat$fitted.prob.pro<-round(fitp, 3)

#Print the fitted probabilities for the first 20 cases
head(dat, n=20)
##     race_eth   edurec age sexrecode fitted.prob.lrm fitted.prob.pro
## 1  0nh_white collgrad  25    female           0.397           0.387
## 2   hispanic collgrad  25    female           0.395           0.386
## 3   nh_asian collgrad  25    female           0.454           0.437
## 4   nh_black collgrad  25    female           0.447           0.433
## 5   nh_other collgrad  25    female           0.448           0.434
## 6  0nh_white       hs  25    female           0.348           0.350
## 7   hispanic       hs  25    female           0.346           0.349
## 8   nh_asian       hs  25    female           0.403           0.399
## 9   nh_black       hs  25    female           0.397           0.395
## 10  nh_other       hs  25    female           0.398           0.395
## 11 0nh_white     nohs  25    female           0.358           0.358
## 12  hispanic     nohs  25    female           0.356           0.357
## 13  nh_asian     nohs  25    female           0.413           0.407
## 14  nh_black     nohs  25    female           0.407           0.403
## 15  nh_other     nohs  25    female           0.408           0.404
## 16 0nh_white somecoll  25    female           0.375           0.372
## 17  hispanic somecoll  25    female           0.372           0.371
## 18  nh_asian somecoll  25    female           0.431           0.421
## 19  nh_black somecoll  25    female           0.424           0.418
## 20  nh_other somecoll  25    female           0.425           0.418

Which shows us the estimated probabilty of reporting moving in the last year for each specified type of “fake person” that we generate. For example, let’s look at the probability for a Non-Hispanic white, age 25 with a college education, compared to a Hispanic person, age 25 with a high school education:

dat[which(dat$race_eth=="0nh_white"&dat$age==25&dat$edurec=="collgrad"),]
##     race_eth   edurec age sexrecode fitted.prob.lrm fitted.prob.pro
## 1  0nh_white collgrad  25    female           0.397           0.387
## 81 0nh_white collgrad  25      male           0.411           0.398
dat[which(dat$race_eth=="hispanic"&dat$age==25&dat$edurec=="hs"),]
##    race_eth edurec age sexrecode fitted.prob.lrm fitted.prob.pro
## 7  hispanic     hs  25    female           0.346           0.349
## 87 hispanic     hs  25      male           0.360           0.360

The first case has an estimated probability of reporting poor/fair health of about 39% form females and 41% for males, while the second case has a 35% chance for females and a 36% chance for males. These are often more effective ways to convey the result of a model, instead of talking about all the regression coefficients.

Marginal effects

In a regression model in general, the \(\beta's\) are the solution to the differential equation: \[\frac{\partial y}{\partial x} = \beta\]

which is just the rate of change in y, given x, known as the marginal effect. This is the case for strictly linear model

In the logit and probit model, which are nonlinear models, owing to their presumed model structure, the marginal effect also has to take into account the change in the respective pdf with respect to the mean, or:

\[\frac{\partial y}{\partial x} = \beta *\frac{\partial \Phi(x' \beta)}{\partial x'\beta}\]

So we have to multiply the estimated \(\beta\) by the p.d.f. of the assumed marginal distribution evaluated at the mean function. In R that’s not big problem:

#Logit marginal effects
log.marg<-coef(fit.logit)*mean(dlogis(predict(fit.logit)), na.rm=T)

#for probit now
prob.marg<-coef(fit.probit)*mean(dnorm(predict(fit.probit)), na.rm=T)

plot(log.marg[-1], ylab="Marginal Effects", axes=T,xaxt="n", main="Marginal Effects from Logit and Probit models", ylim=c(-.3, .3))
axis(side=1, at=1:10, labels=F)
text(x=1:10, y=-.4,  srt = 45, pos = 1, xpd = TRUE,
     labels = c( "Hispanic", "NH asian","NH black" ,"NH other","HS",
                 "NO HS","some coll", "male", "age", "age^2" ))
points(prob.marg[-1], col=2)
abline(h=0, col=3)
legend("topleft", legend=c("Logit Model", "Probit Model"), col=c("black", "red"),pch=1)

Which shows us that the marginal effects are very similar between the two models. We can coax these into a table like:

data.frame(m.logit=log.marg, m.probit=prob.marg)
##                       m.logit      m.probit
## (Intercept)      -0.219007341 -0.2335248301
## race_ethhispanic -0.001027932 -0.0006567572
## race_ethnh_asian  0.024411983  0.0245784045
## race_ethnh_black  0.021606581  0.0227208592
## race_ethnh_other  0.022093233  0.0230240354
## edurechs         -0.021732197 -0.0189487456
## edurecnohs       -0.017318541 -0.0148858855
## edurecsomecoll   -0.009777071 -0.0076211089
## sexrecodemale     0.006282844  0.0056066753
## scale(age)       -0.272378745 -0.2869913879
## scale(I(age^2))   0.194170592  0.2093227547