This handout introduces you to different ways of working with logistic and probit regression, extending the discussion in Bailey on these topics. After reading this handout, you should be able to reliably conduct basic probit or logit analyses, report the results in tabular form, provide basic graphics, and provide interpretations of substantive importance.

We’ll be working with real data in this handout: polling data from a Gallup News Service survey of 1,252 U.S. adults in a nationally representative survey. If this were a professional or academic analysis, we’d go much more in-depth into the data and its analysis, but since the effects we’re interested in are so big we can get away with ignoring some fine details. The poll was conducted September 25-30, 2014;

Importing Data

library(foreign)
data <- read.dta("g201411.dta")

The question we’re most interested in is the first question in this poll:

Do you approve or disapprove of the way Barack Obama is handling his job as president?

This is Q1 in the raw dataset. We transform it to make it useful for our analysis by setting “yes” equal to 1 and “no” equal to 0:

data$approve <- NA
data$approve[data$Q1=="Approve"] <- 1
data$approve[data$Q1=="Disapprove"] <- 0

We also need to clean up several variables for use in modelling later on:

# Creating Partisan ID variable 
# (dummy for whether respondent is a Republican)
data$gop <- NA
data$gop[data$D9=="Republican"]<- 1
data$gop[data$D9=="Democrat"] <- 0
data$gop[data$D9A=="Republican"] <- 1
data$gop[data$D9A=="Democrat"] <- 0
data$gop[data$D9A=="Neith/Other (vol)"] <- 0

# Creating female variable
# (dummy for whether respondent is a woman)

data$female <- NA
data$female[data$D1=="Female"] <- 1
data$female[data$D1=="Male"] <- 0
data$female[data$D1A=="Female"] <- 1
data$female[data$D1A == "Male"] <- 0

# Creating a variable with a clearer name
# (Gallup calls age "D2", which is hard to
# remember)
data$age <- data$D2

# Creating a dummy variable for whether the
# respondent has completed a four-year college 
# degree

data$college <- NA
data$college[as.numeric(data$D3)<=5] <- 0
data$college[as.numeric(data$D3)>=6 & as.numeric(data$D3) <= 8] <- 1

You should look at the codebook to see how I arrived at these values for the subsetting/indexing commands.

We now have a dependent variable—approve—and several explanatory variables—age, college, female, and gop. Why these explanatory variables? There’s a long literature in understanding the determinants of American voters’ opinions and behaviors, and many of those findings have disseminated into the culture. It shouldn’t come as a surprise that college-educated citizens tend, on average, to have different opinions than citizens without college degrees, that older and younger citizens view the world differently, that gender affects how people see their interests, and, most clearly, that people’s partisan affiliation is highly correlated with their vote choices and opinions. We will set aside concerns about endogeneity/causation for this handout; suffice it to say seeing whether partisan identification (in particular) is a reflection of how people see the world (that is, if people join parties based on their beliefs) or whether partisan ID exerts a causal influence on voters (that is, if the parties people join change voters’ beliefs) is a perennial and important subject.

Analyzing Probit Models Using glm()

We’re going to start with a basic probit model:

glm.1 <- glm(approve~age+college+female+gop,data=data,family=binomial(link="probit"))

This creates an object glm.1 with a dependent variable approve which takes a value of 1 if the respondent approves of President Obama and 0 otherwise. The code family=binomial(link="probit") tells glm() to run a probit model.

Like an OLS object created via lm(), we can use summary() to see what this means:

summary(glm.1)
## 
## Call:
## glm(formula = approve ~ age + college + female + gop, family = binomial(link = "probit"), 
##     data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0818  -0.3925  -0.3099   0.5962   2.6406  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.114993   0.240751   4.631 3.63e-06 ***
## age         -0.006766   0.003680  -1.839   0.0659 .  
## college      0.230035   0.124988   1.840   0.0657 .  
## female       0.006675   0.128084   0.052   0.9584    
## gop         -2.404942   0.127328 -18.888  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 985.32  on 782  degrees of freedom
## Residual deviance: 509.53  on 778  degrees of freedom
##   (469 observations deleted due to missingness)
## AIC: 519.53
## 
## Number of Fisher Scoring iterations: 5

This isn’t the kind of output you’d ever turn in, of course, but it is what you would see during your work on your own.

The output suggests that a shift in age (as people grow older) is related to negative trends in approval, that college-educated people are more likely to approve of President Obama, that women are very slightly more likely to approve than not, and that Republicans are vastly more likely not to approve of President Obama than the reference group (Democrats and independents). Of these, only our estimate of the coefficient for gop reaches conventional levels of significance—the association between being a Republican and disapproving of Obama, conditional on other factors, is very strong. Although neither the coefficient estimates of age nor college reach conventional levels of significance, we should be relatively certain that both the old and the better-educated in fact behave as their coefficient signs predict.

To calculate the magnitude of the relationship between a respondent’s being a Republican or not and his/her likelihood of approving of President Obama, we can do as Bailey suggests and plug in the estimated coefficient values:

p1 <- pnorm(glm.1$coef[1] + glm.1$coef[2]*45 + glm.1$coef[3]*1 + glm.1$coef[4]*1 + glm.1$coef[5]*1)
p1
## (Intercept) 
##  0.08727451

This code generates a predicted probability of approving of President Obama for a 45-year-old Republican college-educated woman—about 9 percent. (Ignore the name (Intercept, which is left over from plugging in all of the glm.1$coef data.) Note that because this is a probit model we use pnorm() to get the density function for the normal distribution; Bailey explains this intuitively in the textbook. The code also includes references to glm.1$coef and then a variety of different numbers within that phrase; check out what that means:

glm.1$coef
##  (Intercept)          age      college       female          gop 
##  1.114992969 -0.006766483  0.230034874  0.006675253 -2.404942449
glm.1$coef[1]
## (Intercept) 
##    1.114993

In other words, glm.1$coef is just our matrix of estimated coefficient values, and the [n] indexing just selects the n-th element in that matrix.

Counterfactually, we can predict what a 45-year-old college-educated woman who was not a Republican’s probability of approving of the president would be:

p2 <- pnorm(glm.1$coef[1] + glm.1$coef[2]*45 + glm.1$coef[3]*1 + glm.1$coef[4]*1 + glm.1$coef[5]*0)
p2
## (Intercept) 
##   0.8524989

This code generates that predicted probability—about 85 percent.

To analyze the counterfactual difference, we just subtract

p2-p1
## (Intercept) 
##   0.7652244

and the results tell us that a woman who was identical (according to the model) in every way but her partisan identification would be about 76 percentage points different in her likelihood of approving in the President. Substantively, this basically means that Republican party identifiers are overwhelmingly likely to disapprove of President Obama, but those who identify as Democrats or independents are overwhelmingly likely to approve.

We can solve this more generally—instead of plugging in values for everything, we just let the original data values speak for themselves in every way except party ID.

p3 <- pnorm(glm.1$coef[1] + glm.1$coef[2]*data$age + glm.1$coef[3]*data$college + 
              glm.1$coef[4]*data$female + glm.1$coef[5]*1)
p4 <- pnorm(glm.1$coef[1] + glm.1$coef[2]*data$age + glm.1$coef[3]*data$college + 
              glm.1$coef[4]*data$female + glm.1$coef[5]*0)
mean(p3,na.rm=TRUE) 
## [1] 0.06361608
mean(p4,na.rm=TRUE)
## [1] 0.8007045
mean(p4,na.rm=TRUE) - mean(p3,na.rm=TRUE)
## [1] 0.7370885

(We have to use the na.rm=TRUE syntax because any missing value in any part of the data will generate a missing value, NA, as the result.) As we can see, our prediction is, again, that Republicans are less than one-tenth (!) as likely as non-Republicans to approve of President Obama, and that the gap between the two groups is equal to about 75 percentage points. If all you knew about someone was that he was a Republican, and you had a chance to bet at even odds that he approved of President Obama, you should definitely take that bet!

Analyzing Probit Models Using Zelig

There’s another way to do this kind of analysis that you might (or might not!) find easier using a package called Zelig.

install.packages("Zelig") ## May take a few moments
library(Zelig)
## 
## Attaching package: 'Zelig'
## The following object is masked from 'package:base':
## 
##     mode

Let’s replicate our earlier analysis using Zelig.

glm.z.1 <- zelig(approve~age+college+female+gop,data=data,model="probit", cite=FALSE)

The syntax is almost exactly the same. model="probit" is, I think, a little easier than specifying a link function (although statisticians like the other syntax because it requires you to remember that you are in fact using probability distributions). And, obviously, we’re typing in zelig() instead of glm(). (You don’t have to set , cite=FALSE but I really recommend it.)

summary(glm.z.1)
## Model: 
## 
## Call:
## z5$zelig(formula = approve ~ age + college + female + gop, data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0818  -0.3925  -0.3099   0.5962   2.6406  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept)  1.114993   0.240751   4.631 3.63e-06
## age         -0.006766   0.003680  -1.839   0.0659
## college      0.230035   0.124988   1.840   0.0657
## female       0.006675   0.128084   0.052   0.9584
## gop         -2.404942   0.127328 -18.888  < 2e-16
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 985.32  on 782  degrees of freedom
## Residual deviance: 509.53  on 778  degrees of freedom
##   (469 observations deleted due to missingness)
## AIC: 519.53
## 
## Number of Fisher Scoring iterations: 5
## 
## Next step: Use 'setx' method

The output is almost exactly the same: there’s a little difference in that Zelig, written by zealots, doesn’t give you stars (which is really best practices anyway). The beauty of Zelig comes with the next step:

pred.x <- setx(glm.z.1,age=45,college=1,female=1,gop=1)

No clumsy specifying of objects! No remembering which coefficient estimate goes with which! Just write the values of the coefficients that you want to simulate and the name of the model object (glm.z.1, in this case) that you want to use for predictions. And then …

s.out <- sim(glm.z.1,x=pred.x)
summary(s.out)
## 
##  sim x :
##  -----
## ev
##            mean         sd        50%       2.5%     97.5%
## [1,] 0.08895784 0.02077701 0.08671798 0.05487727 0.1332768
## pv
##          0     1
## [1,] 0.896 0.104
plot(s.out)

Zelig works according to a three-part recipe: use zelig() to model, setx to set the values of coefficients, and sim() to generate a model object with predicted values of Y based on the x’s you set. Using summary() and plot() you can thereby see some pretty easily what is going on. Just as before, we estimate that a 45-year-old Republican college-educated woman is about 9 percent likely to approve of President Obama, but this comes with a nice distribution illustrating our uncertainty (or, in this case, our relative certainty) about those predictions. The ev values in the summary() object are the expected values and distribution – as you can see, 95 percent of the simulations returned expected values between about 5.4 percent and about 13.5 percent. The pv values are the predicted values — they are slightly different from the ev, but not really (the first time I ran this, it was 9.1 percent versus 8.9 percent). You can mostly ignore discrepancies between the two. (Also, since these are simulations, you’ll get slightly different results each time depending on how the computer randomly generates data to fill in the distributions of data.)

To compare our 45-year-old Republican woman with her non-Republican “twin”:

pred.x.dem <- setx(glm.z.1,age=45,college=1,female=1,gop=0)
s.out.2 <- sim(glm.z.1,
               x=pred.x,
               x1=pred.x.dem) ## note x1 is the new pred.x.dem object
summary(s.out.2)
## 
##  sim x :
##  -----
## ev
##            mean         sd        50%      2.5%     97.5%
## [1,] 0.08818767 0.02203123 0.08557339 0.0519672 0.1374175
## pv
##          0     1
## [1,] 0.921 0.079
## 
##  sim x1 :
##  -----
## ev
##           mean         sd       50%      2.5%     97.5%
## [1,] 0.8498413 0.03118508 0.8528755 0.7796029 0.9045696
## pv
##          0     1
## [1,] 0.134 0.866
## fd
##           mean        sd       50%      2.5%     97.5%
## [1,] 0.7616536 0.0259277 0.7642562 0.7048544 0.8079291

The interpretation of this output is pretty much the same as in the previous example, just make sure you keep x1 and x straight. The only new term is the fd numbers—fd means First Difference, and that’s just the difference between predicted values for x and x1 (and, again, it’s about 75 percentage points).

We can also plot this using plot(s.out.2). It’s really cool, and you should try it! But for technical reasons it’s a little hard to include on this handout.

Differences Between GLM and Zelig

As you can see, there’s not many differences between the model parameters you’ll estimate with GLM and Zelig. There’s a difference in the philosophy and difficulty of interpreting output that comes from the two packages. glm() is more classic and a little more fiddly; Zelig is easier to use but based on simulations for its interpretation, which may make it slightly difficult to perfectly replicate your results for interpretation only (again, the models the two packages estimate are equivalent).

I generally work with glm(), but I find that for exploratory work being able to switch back and forth between models within Zelig is much easier (you can specify model=ls for least-squares, model=logit for logit, etc etc). Either should work fine with the table-making procedure we’ve already discussed for displaying estimated coefficients; however, you must run one or the the other to get results that are substantively easy to interpret (as Bailey discusses).