In the vast majority of situations in your work as demographers, your outcome will either be of a qualitative nature or non-normally distributed, especially if you work with individual level survey data.
When we speak of qualitative outcomes, we generally are concerned with the observation of:
A particular behavior (marriage, migration, birth, death)
A transition (unemployed to employed, married to divorced)
A threshold characteristic (adoption of sterilization after ideal # of children is reached)
In general, each of these outcomes would be coded as a binary variable (1 or 0) depending on whether the outcome of interest was observed
This example will cover the use of R functions for fitting binary logit and probit models to complex survey data.
For this example I am using IPUMS data from Github.
library(haven)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(stargazer)
##
## Please cite as:
## Hlavac, Marek (2015). stargazer: Well-Formatted Regression and Summary Statistics Tables.
## R package version 5.2. http://CRAN.R-project.org/package=stargazer
library(ggplot2)
library(knitr)
ipums<-read_dta("https://github.com/coreysparks/data/blob/master/usa_00045.dta?raw=true")
newpums<-ipums%>%
filter(labforce==2, age>=18, incwage>0, relate==1)%>%
mutate(mywage= ifelse(incwage%in%c(999998,999999), NA,incwage),
unemp=ifelse(empstat==2,1,0), #unemployment
mig=case_when(.$migrate1%in%c(2:4)~ 1,
.$migrate1==1~0), #moved in past year
sexrecode=ifelse(sex==1, "male", "female"))%>%
mutate(race_eth = case_when(.$hispan %in% c(1:4) & .$race %in%c(1:9) ~ "hispanic",
.$hispan ==0 & .$race==1 ~"0nh_white",
.$hispan ==0 & .$race==2 ~"nh_black",
.$hispan ==0 & .$race%in%c(3,7,8,9) ~"nh_other",
.$hispan ==0 & .$race%in%c(4:6) ~"nh_asian",
.$hispan==9 ~ "missing"))%>%
mutate(edurec = case_when(.$educd %in% c(0:61)~"nohs",
.$educd %in% c(62:64)~"hs",
.$educd %in% c(65:100)~"somecoll",
.$educd %in% c(101:116)~"collgrad",
.$educd ==999 ~ "missing"))
First, we will do some descriptive analysis, such as means and cross tabulations.
First, we examine the % of US adults who moved in the last yar by education level.
sefun<-function(x) {sd(x, na.rm=T)/table(is.na(x))[1]}
newpums%>%
group_by(edurec)%>%
summarise(migpct=mean(mig, na.rm=T), semig=sefun(mig))%>%
ggplot()+geom_bar(aes(edurec, migpct),stat = "identity")+geom_errorbar(aes(x=edurec, ymin=migpct-1.96*semig,ymax= migpct+1.96*semig), width=.25)+ggtitle("Percent of Texas Adults who moved in the last year", subtitle = "By level of education")
Which shows a significant variation in by education level
If our outcome is dichtomous (1/0), the natural distribution to consider for a GLM is the binomial \[y \sim \ \text{Binomial}\binom{n}{p}\] with \(p\) being the mean of the binomial, and n being the number of trials, generally when you have individual data, n is always 1, and p is the probability of observing the 1, conditional on the observed predictors. There are two common techniques to estimate the mean, logistic and probit regression. In a Logistic model, the link function is the inverse logit function, or
\(\text{Logit}^{-1}(p) =log \frac{p}{(1-p)}\)
Which gives us the following conditional mean model:
\[E(y|x) = \frac{1}{1+ exp({-\sum_k \beta_k x_{ik}})}\] Which situates the model within the logistic distribution function. Expressing p as a linear model is done via this log odds transformation of the probability:
\[log \frac{p}{(1-p)} = \sum_k \beta_k x_{ik}\]
For the Probit model, the link function is the inverse cumulative Normal distribution:
\[E(y|x) = \Phi^{-1}(p) = \Phi (\sum_k \beta_k x_{ik})\]
In practice, these two models give very similar estimates and will have very similar interpretations, although the logitistic regression model has the more convenient odds ratio interpretation of its \(\beta's\), while the probit model’s coefficients are often transformed into marginal coefficients, which is more of a challenge and software generally doesn’t give you these by default.
There is no trick to fitting logistic regression models, just use the glm() function with the apppriate distribution specified via family=binomial for logistic and family=binomial(link="probit") for the probit model. You don’t have to specify the link function if you’re just doing the logistic model, as it is the default.
#Logit model
fit.logit<-glm(mig~race_eth+edurec+sexrecode+scale(age)+scale(I(age^2)),data=newpums, family=binomial)
summary(fit.logit)
##
## Call:
## glm(formula = mig ~ race_eth + edurec + sexrecode + scale(age) +
## scale(I(age^2)), family = binomial, data = newpums)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4253 -0.5272 -0.3733 -0.3215 2.5323
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.09436 0.02423 -86.446 < 2e-16 ***
## race_ethhispanic -0.00983 0.03806 -0.258 0.796172
## race_ethnh_asian 0.23345 0.05010 4.660 3.17e-06 ***
## race_ethnh_black 0.20662 0.03940 5.244 1.57e-07 ***
## race_ethnh_other 0.21128 0.06987 3.024 0.002495 **
## edurechs -0.20782 0.03364 -6.177 6.53e-10 ***
## edurecnohs -0.16562 0.05476 -3.025 0.002489 **
## edurecsomecoll -0.09350 0.02798 -3.342 0.000832 ***
## sexrecodemale 0.06008 0.02404 2.499 0.012439 *
## scale(age) -2.60474 0.07301 -35.679 < 2e-16 ***
## scale(I(age^2)) 1.85684 0.07814 23.763 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 54102 on 68041 degrees of freedom
## Residual deviance: 47740 on 68031 degrees of freedom
## AIC: 47762
##
## Number of Fisher Scoring iterations: 5
And the probit model:
#probit model
fit.probit<-glm(mig~race_eth+edurec+sexrecode+scale(age)+scale(I(age^2)),data=newpums, family=binomial(link= "probit"))
summary(fit.probit)
##
## Call:
## glm(formula = mig ~ race_eth + edurec + sexrecode + scale(age) +
## scale(I(age^2)), family = binomial(link = "probit"), data = newpums)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.3762 -0.5388 -0.3725 -0.3170 2.5448
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.209664 0.012857 -94.087 < 2e-16 ***
## race_ethhispanic -0.003402 0.020725 -0.164 0.86961
## race_ethnh_asian 0.127317 0.027742 4.589 4.45e-06 ***
## race_ethnh_black 0.117695 0.021433 5.491 3.99e-08 ***
## race_ethnh_other 0.119265 0.038967 3.061 0.00221 **
## edurechs -0.098155 0.017869 -5.493 3.95e-08 ***
## edurecnohs -0.077109 0.029087 -2.651 0.00803 **
## edurecsomecoll -0.039478 0.015161 -2.604 0.00922 **
## sexrecodemale 0.029043 0.012973 2.239 0.02518 *
## scale(age) -1.486622 0.038589 -38.525 < 2e-16 ***
## scale(I(age^2)) 1.084296 0.040198 26.974 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 54102 on 68041 degrees of freedom
## Residual deviance: 47793 on 68031 degrees of freedom
## AIC: 47815
##
## Number of Fisher Scoring iterations: 5
Both of these models show the exact same patterns of effects, with Hispanics, blacks and multi-race individuals showing increased chances of reporting poor/fair health, when compared to whites (Reference group). Similarly, the education variables shows a negative linear trend, with those with more education having lower chances of reporting poor/fair health compared to those with a primary school education (Reference group), and likewise, as people get older, they are more likely to report poor/fair health, compared to those under age 24 (Reference group).
stargazer(fit.logit, fit.probit, type = "html", style = "demography", ci = T)
| mig | ||
| logistic | probit | |
| Model 1 | Model 2 | |
| race_ethhispanic | -0.010 | -0.003 |
| (-0.084, 0.065) | (-0.044, 0.037) | |
| race_ethnh_asian | 0.233*** | 0.127*** |
| (0.135, 0.332) | (0.073, 0.182) | |
| race_ethnh_black | 0.207*** | 0.118*** |
| (0.129, 0.284) | (0.076, 0.160) | |
| race_ethnh_other | 0.211** | 0.119** |
| (0.074, 0.348) | (0.043, 0.196) | |
| edurechs | -0.208*** | -0.098*** |
| (-0.274, -0.142) | (-0.133, -0.063) | |
| edurecnohs | -0.166** | -0.077** |
| (-0.273, -0.058) | (-0.134, -0.020) | |
| edurecsomecoll | -0.093*** | -0.039** |
| (-0.148, -0.039) | (-0.069, -0.010) | |
| sexrecodemale | 0.060* | 0.029* |
| (0.013, 0.107) | (0.004, 0.054) | |
| scale(age) | -2.605*** | -1.487*** |
| (-2.748, -2.462) | (-1.562, -1.411) | |
| scale(I(age2)) | 1.857*** | 1.084*** |
| (1.704, 2.010) | (1.006, 1.163) | |
| Constant | -2.094*** | -1.210*** |
| (-2.142, -2.047) | (-1.235, -1.184) | |
| N | 68,042 | 68,042 |
| Log Likelihood | -23,869.820 | -23,896.710 |
| AIC | 47,761.630 | 47,815.420 |
| p < .05; p < .01; p < .001 | ||
As I often say, I like to talk about “interesting cases”. In order to do this, you need the fitted mean for a particular case. This is done by getting the fitted values for that case from the model. To do this, I generate a bunch of “fake people” that have variability in the model covariates, and fit the model for each type of person. This is perhaps overkill in this example because I fit every type of person, ideally you would want a few interesting cases to discuss.
In order to derive these, we effectively “solve the equation” for the model, or another way of saying it, we estimate the conditional mean of y, by specifying the x values that are meaningful for a particular comparison. For example the probabilty of a white, young college educated person reporting poor health is just the estimate of the model, evaluated at those particular characteristics:
\[\text{Pr(poor/fair health)} = \frac{1}{1+exp({\beta_0 + \beta_1*white + \beta_2*young+\beta_3*college})}\]
#get a series of predicted probabilites for different "types" of people for each model
#expand.grid will generate all possible combinations of values you specify
dat<-expand.grid(race_eth=levels(factor(newpums$race_eth)), edurec=levels(factor(newpums$edurec)), age=c(25, 35, 45, 65), sexrecode=levels(factor(newpums$sexrecode)) )
#You MAY need to get rid of impossible cases here
#generate the fitted values
fit<-predict(fit.logit, newdat=dat,type="response")
fitp<-predict(fit.probit, newdat=dat,type="response")
#add the values to the fake data
dat$fitted.prob.lrm<-round(fit, 3)
dat$fitted.prob.pro<-round(fitp, 3)
#Print the fitted probabilities for the first 20 cases
head(dat, n=20)
## race_eth edurec age sexrecode fitted.prob.lrm fitted.prob.pro
## 1 0nh_white collgrad 25 female 0.397 0.387
## 2 hispanic collgrad 25 female 0.395 0.386
## 3 nh_asian collgrad 25 female 0.454 0.437
## 4 nh_black collgrad 25 female 0.447 0.433
## 5 nh_other collgrad 25 female 0.448 0.434
## 6 0nh_white hs 25 female 0.348 0.350
## 7 hispanic hs 25 female 0.346 0.349
## 8 nh_asian hs 25 female 0.403 0.399
## 9 nh_black hs 25 female 0.397 0.395
## 10 nh_other hs 25 female 0.398 0.395
## 11 0nh_white nohs 25 female 0.358 0.358
## 12 hispanic nohs 25 female 0.356 0.357
## 13 nh_asian nohs 25 female 0.413 0.407
## 14 nh_black nohs 25 female 0.407 0.403
## 15 nh_other nohs 25 female 0.408 0.404
## 16 0nh_white somecoll 25 female 0.375 0.372
## 17 hispanic somecoll 25 female 0.372 0.371
## 18 nh_asian somecoll 25 female 0.431 0.421
## 19 nh_black somecoll 25 female 0.424 0.418
## 20 nh_other somecoll 25 female 0.425 0.418
Which shows us the estimated probabilty of reporting moving in the last year for each specified type of “fake person” that we generate. For example, let’s look at the probability for a Non-Hispanic white, age 25 with a college education, compared to a Hispanic person, age 25 with a high school education:
dat[which(dat$race_eth=="0nh_white"&dat$age==25&dat$edurec=="collgrad"),]
## race_eth edurec age sexrecode fitted.prob.lrm fitted.prob.pro
## 1 0nh_white collgrad 25 female 0.397 0.387
## 81 0nh_white collgrad 25 male 0.411 0.398
dat[which(dat$race_eth=="hispanic"&dat$age==25&dat$edurec=="hs"),]
## race_eth edurec age sexrecode fitted.prob.lrm fitted.prob.pro
## 7 hispanic hs 25 female 0.346 0.349
## 87 hispanic hs 25 male 0.360 0.360
The first case has an estimated probability of reporting poor/fair health of about 39% form females and 41% for males, while the second case has a 35% chance for females and a 36% chance for males. These are often more effective ways to convey the result of a model, instead of talking about all the regression coefficients.
In a regression model in general, the \(\beta's\) are the solution to the differential equation: \[\frac{\partial y}{\partial x} = \beta\]
which is just the rate of change in y, given x, known as the marginal effect. This is the case for strictly linear model
In the logit and probit model, which are nonlinear models, owing to their presumed model structure, the marginal effect also has to take into account the change in the respective pdf with respect to the mean, or:
\[\frac{\partial y}{\partial x} = \beta *\frac{\partial \Phi(x' \beta)}{\partial x'\beta}\]
So we have to multiply the estimated \(\beta\) by the p.d.f. of the assumed marginal distribution evaluated at the mean function. In R that’s not big problem:
#Logit marginal effects
log.marg<-coef(fit.logit)*mean(dlogis(predict(fit.logit)), na.rm=T)
#for probit now
prob.marg<-coef(fit.probit)*mean(dnorm(predict(fit.probit)), na.rm=T)
plot(log.marg[-1], ylab="Marginal Effects", axes=T,xaxt="n", main="Marginal Effects from Logit and Probit models", ylim=c(-.3, .3))
axis(side=1, at=1:10, labels=F)
text(x=1:10, y=-.4, srt = 45, pos = 1, xpd = TRUE,
labels = c( "Hispanic", "NH asian","NH black" ,"NH other","HS",
"NO HS","some coll", "male", "age", "age^2" ))
points(prob.marg[-1], col=2)
abline(h=0, col=3)
legend("topleft", legend=c("Logit Model", "Probit Model"), col=c("black", "red"),pch=1)
Which shows us that the marginal effects are very similar between the two models. We can coax these into a table like:
data.frame(m.logit=log.marg, m.probit=prob.marg)
## m.logit m.probit
## (Intercept) -0.219007341 -0.2335248301
## race_ethhispanic -0.001027932 -0.0006567572
## race_ethnh_asian 0.024411983 0.0245784045
## race_ethnh_black 0.021606581 0.0227208592
## race_ethnh_other 0.022093233 0.0230240354
## edurechs -0.021732197 -0.0189487456
## edurecnohs -0.017318541 -0.0148858855
## edurecsomecoll -0.009777071 -0.0076211089
## sexrecodemale 0.006282844 0.0056066753
## scale(age) -0.272378745 -0.2869913879
## scale(I(age^2)) 0.194170592 0.2093227547