This document covers in-class exercises from Session 4 of PUBP-705, Adv. Stat. Methods for Policy Analysis.

Coverage:

Load our packages and datasets for today

## Loading required package: car
## Warning: package 'car' was built under R version 3.0.3
## Loading required package: foreign
## Warning: package 'foreign' was built under R version 3.0.3
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.0.3
## Loading required package: pscl
## Warning: package 'pscl' was built under R version 3.0.3
## Loading required package: MASS
## Loading required package: lattice
## Warning: package 'lattice' was built under R version 3.0.3
## Classes and Methods for R developed in the
## 
## Political Science Computational Laboratory
## 
## Department of Political Science
## 
## Stanford University
## 
## Simon Jackman
## 
## hurdle and zeroinfl functions by Achim Zeileis

Do some cleaning.

libraries <- as.data.frame(lapply(libraries, function(x){replace(x,"<NA>",NA)}))
libraries <- as.data.frame(lapply(libraries, function(x){replace(x,"NA",NA)}))

# replace 'other' and 'refused' with 'NA'
libraries$educ <- ifelse(libraries$educ == "other",NA,libraries$educ)
libraries$educ <- ifelse(libraries$educ == "refused",NA,libraries$educ)
  1. Multicolinearity Detection for logistic regression

VIF works the same way for logistic regression as it does for normal regression.

# run our logit model predicting library use 
logm <- glm(libuseyr ~ student + degree + kids + intacess, 
                   data=libraries, family=binomial(link = logit))
summary(logm)
## 
## Call:
## glm(formula = libuseyr ~ student + degree + kids + intacess, 
##     family = binomial(link = logit), data = libraries)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0374  -1.2008   0.6745   0.9128   1.6338  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -0.4526     0.2391  -1.893  0.05832 .  
## studentno           -0.5766     0.1820  -3.168  0.00153 ** 
## degreehigh school    0.5660     0.1935   2.925  0.00345 ** 
## degreesome college   1.1099     0.2029   5.471 4.48e-08 ***
## degreecollege        1.2706     0.2129   5.968 2.40e-09 ***
## kidsone or more      0.5181     0.1191   4.350 1.36e-05 ***
## intacessyes          0.6052     0.1235   4.901 9.55e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1972.8  on 1469  degrees of freedom
## Residual deviance: 1812.6  on 1463  degrees of freedom
##   (32 observations deleted due to missingness)
## AIC: 1826.6
## 
## Number of Fisher Scoring iterations: 4
# the same implementation of VIF we used for OLS also works for logit
vif(logm)
##              GVIF Df GVIF^(1/(2*Df))
## student  1.060888  1        1.029994
## degree   1.176088  3        1.027401
## kids     1.041629  1        1.020602
## intacess 1.182948  1        1.087634
# a logit interaction term
logm2 <- glm(libuseyr ~ student + degree + kids + intacess + degree * intacess, 
                   data=libraries, family=binomial(link = logit))
summary(logm2)
## 
## Call:
## glm(formula = libuseyr ~ student + degree + kids + intacess + 
##     degree * intacess, family = binomial(link = logit), data = libraries)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9892  -1.1768   0.7018   0.9225   1.7312  
## 
## Coefficients:
##                                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                     -0.6917     0.2738  -2.526 0.011541 *  
## studentno                       -0.5540     0.1824  -3.037 0.002386 ** 
## degreehigh school                0.7336     0.2452   2.991 0.002776 ** 
## degreesome college               1.4244     0.2724   5.230 1.70e-07 ***
## degreecollege                    1.8271     0.3330   5.487 4.09e-08 ***
## kidsone or more                  0.5106     0.1195   4.273 1.93e-05 ***
## intacessyes                      1.3095     0.3755   3.488 0.000487 ***
## degreehigh school:intacessyes   -0.5820     0.4192  -1.388 0.165057    
## degreesome college:intacessyes  -0.8542     0.4323  -1.976 0.048161 *  
## degreecollege :intacessyes      -1.1258     0.4699  -2.396 0.016579 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1972.8  on 1469  degrees of freedom
## Residual deviance: 1805.8  on 1460  degrees of freedom
##   (32 observations deleted due to missingness)
## AIC: 1825.8
## 
## Number of Fisher Scoring iterations: 4

Might want to compare these models. Pseudo R2 is one way to do this. We’ll also get the odds ratios for the coefficients, for easier interpretation.

# get the Odds Ratio
exp(logm2$coef)
##                    (Intercept)                      studentno 
##                      0.5007376                      0.5746657 
##              degreehigh school             degreesome college 
##                      2.0825361                      4.1552929 
##                 degreecollege                 kidsone or more 
##                      6.2157561                      1.6662409 
##                    intacessyes  degreehigh school:intacessyes 
##                      3.7043568                      0.5587725 
## degreesome college:intacessyes     degreecollege :intacessyes 
##                      0.4256259                      0.3243994
# estimate Pseudo R2
pR2(logm)
##           llh       llhNull            G2      McFadden          r2ML 
## -9.062895e+02 -1.004677e+03  1.967748e+02  9.792941e-02  1.252879e-01 
##          r2CU 
##  1.681477e-01
pR2(logm2)
##           llh       llhNull            G2      McFadden          r2ML 
##  -902.8877362 -1004.6768696   203.5782667     0.1013153     0.1293268 
##          r2CU 
##     0.1735684

The pseudo R2 can be interpreted as an estimate of the improvement in the model relative to an ‘empty’ model with no predictors. We don’t really get much out of the first model, but the second one is more promising.

This package reports 3 measures of pseudo R2: McFadden’s, Maximum-Liklihood (ML), and Cragg and Uhler’s (CU). All these calculations use a similar idea - look at the log liklihood for the model (llh) and the log liklihood for a ‘null’ (empty) model and see how much of an improvement we get. The different methods use various corrections, but are conceptually similar.

  1. Logistic Regression Marginal Effects

Now that we’ve run some more models, we might want to get deeper into the interpretation of logistic regression coefficients. Basic OLS regression coefficients are easy to interpret - we can interpret them as the marginal effect of X on Y, holding all other variables at their means.

However, this is not true of logistic regression. Thinking about the functional form of a logistic regression model should give you a sense of why this is - the function is non-linear, so the marginal effects won’t the the same across all values. As a result, we have to estimate the marginal effects from the model.

This is necessary for ease of interpretation and for investigating practical policy significance. Specific probabilities might be of interest to us, or predicted probabilies at particular values of the independent variables.

Calculating marginal effects:

# we need yet another package for this
require(mfx)
## Loading required package: mfx
## Warning: package 'mfx' was built under R version 3.0.3
## Loading required package: sandwich
## Warning: package 'sandwich' was built under R version 3.0.3
## Loading required package: lmtest
## Warning: package 'lmtest' was built under R version 3.0.3
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 3.0.3
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## Loading required package: betareg
## Warning: package 'betareg' was built under R version 3.0.3
# now, we can estimate marginal effects for the model
logitmfx(libuseyr ~ student + degree + kids + intacess + degree * intacess, 
                   data=libraries, atmean = T)
## Call:
## logitmfx(formula = libuseyr ~ student + degree + kids + intacess + 
##     degree * intacess, data = libraries, atmean = T)
## 
## Marginal Effects:
##                                    dF/dx Std. Err.       z     P>|z|    
## studentno                      -0.123897  0.037812 -3.2766 0.0010506 ** 
## degreehigh school               0.167008  0.053107  3.1448 0.0016623 ** 
## degreesome college              0.300769  0.049366  6.0926 1.111e-09 ***
## degreecollege                   0.366479  0.053081  6.9041 5.051e-12 ***
## kidsone or more                 0.119055  0.027221  4.3737 1.222e-05 ***
## intacessyes                     0.308176  0.084591  3.6431 0.0002693 ***
## degreehigh school:intacessyes  -0.142197  0.103725 -1.3709 0.1704058    
## degreesome college:intacessyes -0.208440  0.105110 -1.9831 0.0473587 *  
## degreecollege :intacessyes     -0.272749  0.110672 -2.4645 0.0137216 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## dF/dx is for discrete change for the following variables:
## 
## [1] "studentno"                      "degreehigh school"             
## [3] "degreesome college"             "degreecollege "                
## [5] "kidsone or more"                "intacessyes"                   
## [7] "degreehigh school:intacessyes"  "degreesome college:intacessyes"
## [9] "degreecollege :intacessyes"

This output will give us something very much like the original output for the interaction model. However, the coefficient values will differ - rather than the average impact on probability of each variable, we have generated the impact of a 1 unit change of X on the probability of Y, holding other variables at their means.

Baseline effects (the normal logit coefficients) are the log of the odds. So: there is no marginal effect in many models without transformation.

Note: this package will only calculate the average margin effect, as far as I can tell. Not as flexible as the STATA version.

Make some predictions at different values. This is another way of presenting the results of a logistic regression in terms that make sense to people. We can estimate the predicted probabilities at specific values of independent variables and compare.

# make predictions for the model
subdata <- data.frame(libraries$libuseyr,libraries$student,libraries$intacess,
                      libraries$kids, libraries$degree)
subdata <- subdata[complete.cases(subdata),]

# generate a made-up observation with specific values for prediction
newdata <- data.frame(student = "no", intacess = "no",
                      degree = "high school", kids = "none")
newdata <- rbind(newdata,data.frame(student = "yes", intacess = "no",
                      degree = "high school", kids = "none"))
newdata$pred <- predict.glm(logm,newdata = newdata, type=c("link"))

So, what we’ve generated are two basic predictions from the model we made. All the values of the variables are the same, except we vary student from no to yes. So we generate predicted probabilities for two different hypothetical observations that differ on only one point (within the context of the model): whether or not they are currently a student. The difference is the impact on predicted probability of being a student, holding those other variables at specified fixed values.

More types of logit/probit models:

There are many other types of logistic regression that can deal with strange dependent variables.

For categorical-nominal variables you might want to use multinomial logit. For categorical-ordinal variables you might want to use ordered logit or probit

For variables with other strange properties, e.g. that are distributed non-linearly with a lot of small values and a much smaller number of large values, we will be using Poisson regressions (and others in the same family).

Poisson models are commonly used to deal with count data - count observations tend to follow a poisson distribution. This method also addresses non-linearities; count data is typically non-linear, with many low counts and a few high counts. There are a variety of ways to deal with these problems.

# we run poisson models using the same GLM framework we use for most models

# pois <- glm(formula, data=data, family=poisson)

# STOPPED HERE