This document covers in-class exercises from Session 3 of PUBP-705, Adv. Stat. Methods for Policy Analysis.

Material covered: + Log-Log models (elasticity) + Interaction terms + Logistic Regression

## Loading required package: car
## Warning: package 'car' was built under R version 3.0.3
## Loading required package: foreign
## Warning: package 'foreign' was built under R version 3.0.3
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.0.3
  1. Log-Log Models and elasticity calculations

As discussed last session, we can interpret logarithmically transformed variables in percentage terms - if we log both sides of an equation, we can interpret the results as the percentage impact a 1% increase of an IV has on the DV.

Example:

# read in data
poultry <- read.dta("C:/Users/Administrator/Downloads/poultry.dta")

# plot your variables of interest
ggplot(poultry,aes(price,q_capita)) + geom_point() + stat_smooth(method = "lm")

# make some log variables
poultry$log_price <- log(poultry$price)
poultry$log_qcapita <- log(poultry$q_capita)
poultry$log_inc000 <- log(poultry$inc_000)

ggplot(poultry,aes(log_price,log_qcapita)) + geom_point() + stat_smooth(method = "lm")

# make a scatter-matrix
pairs(data.frame(poultry$price,poultry$q_capita,poultry$inc_000,
                 poultry$log_price,poultry$log_qcapita,poultry$inc_000
                 ))

# predict an elasticity!
elasticity <- lm(log_qcapita ~ log_inc000 + log_price, data = poultry)
summary(elasticity)
## 
## Call:
## lm(formula = log_qcapita ~ log_inc000 + log_price, data = poultry)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.04040 -0.01475 -0.00175  0.01550  0.06184 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.79672    0.24752  31.499  < 2e-16 ***
## log_inc000  -0.40575    0.10114  -4.012 0.000684 ***
## log_price   -0.14801    0.05867  -2.523 0.020227 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02611 on 20 degrees of freedom
## Multiple R-squared:  0.8335, Adjusted R-squared:  0.8169 
## F-statistic: 50.07 on 2 and 20 DF,  p-value: 1.634e-08

The price coefficient is price elasticity of demand The income coefficient is income elasticity of demand

  1. Interaction terms

Problem: how do we understand conditional impacts? what if one variable changes dependent on another?

A student example: US state electricity regulation - why are electric markets reg’d / not reg’d? Theory: perhaps the party of the governor is a factor. In an interaction model, we might analyze the relationship between some other predictor and electricity regulation conditional on the party of the governor.

# read in our data
saldata <- read.dta("C:/Users/Administrator/Downloads/salary.dta")

Hypothesis: perhaps gender conditions the relationship between age and salary

H_Null: there the relationship between age and salary is the same regardless of gender

# test an interaction model

# make sex a more clear variable
# Random Note - there's probably a better way to read STATA files into R....
saldata$female <- ifelse(saldata$sex == "females", 1,0)

salmodel <- lm(salnow ~ female + age + female * age, data=saldata)
summary(salmodel)
## 
## Call:
## lm(formula = salnow ~ female + age + female * age, data = saldata)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -8967  -3521  -1316   1434  38431 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 19426.24    1445.45  13.440  < 2e-16 ***
## female      -6474.85    1888.88  -3.428 0.000662 ***
## age           -77.80      38.10  -2.042 0.041690 *  
## female:age     10.74      48.63   0.221 0.825364    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6061 on 470 degrees of freedom
## Multiple R-squared:  0.2176, Adjusted R-squared:  0.2126 
## F-statistic: 43.56 on 3 and 470 DF,  p-value: < 2.2e-16

Interpretation:

……

  1. Logistic Regression

Used for binary dependent variables Why not linear regression? It can give you strange results…. e.g. predictions >1

If your only interest is classification, you can technically make a linear probability model and classify as 0/1 if the prediction is >0.5, <0.5, but this doesn’t make nearly as much logical sense as logistic regression.

Logistic regression will generate predicted probabilities, which is much more helpful.

So, what are you predicting? The log of the odds ratio: $ log(p/1-p) $

Example:

# load in data
libraries <- read.dta("C:/Users/Administrator/Downloads/lib1500.dta", 
                      convert.factors = T)

This dataset deals with public library use - what predicts library usage?

Hypothesis: perhaps education is a factor to consider here? We might expect more educated people to be more likely to go to the library.

Look at the education variable

table(libraries$educ)
## 
##           1st grade           2nd grade           3rd grade 
##                   5                   1                   3 
##           4th grade           5th grade           6th grade 
##                   2                   3                   5 
##           7th grade           8th grade           9th grade 
##                   4                  22                  26 
##          10th grade          11th grade          12th grade 
##                  41                  53                 487 
##        some college    associate degree    bachelors degree 
##                 312                 110                 264 
##     master's degree professional school     doctoral degree 
##                 102                  25                  13 
##               other             refused 
##                   5                   0

Looks reasonable. HOWEVER: it has categories labeled ‘other’ and ‘refused’. We will have to do something about those.

Given how few people we have in the low categories, we’ll probably also want to combine into a variable like: ‘No HS’. We won’t be able to make any useful inferences about the 5 people who only attended 1st grade.

Note, however, that the dataset has done this for us in the ‘degree’ variable

# do some cleaning - replace all junk strings in the data
# with true NA values (this is an artifact of translating from STATA, I think)
# not sure if this happens generally, or if it's just this dataset
libraries <- as.data.frame(lapply(libraries, function(x){replace(x,"<NA>",NA)}))
libraries <- as.data.frame(lapply(libraries, function(x){replace(x,"NA",NA)}))

# replace 'other' and 'refused' with 'NA'
libraries$educ <- ifelse(libraries$educ == "other",NA,libraries$educ)
libraries$educ <- ifelse(libraries$educ == "refused",NA,libraries$educ)

# run a logistic regression predicting whether a person is a library user
# with: student, education, internet access, whether they have kids

# look at some descriptives first
table(libraries$libuseyr,libraries$student)
##             
##              yes  no refused
##   no          50 540       0
##   yes        160 742       0
##   don't know   0   0       0
##   refused      0   0       0
table(libraries$libuseyr,libraries$educ)
##             
##                1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
##   no           2   0   3   2   3   3   4  19  16  28  22 236 107  27  72
##   yes          3   1   0   0   0   2   0   3  10  13  31 249 205  82 192
##   don't know   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
##   refused      0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
##             
##               16  17  18
##   no          25   9   4
##   yes         77  16   9
##   don't know   0   0   0
##   refused      0   0   0
table(libraries$libuseyr,libraries$degree)
##             
##              less than high school high school some college college 
##   no                           102         236          134      110
##   yes                           63         249          287      294
##   don't know                     0           0            0        0
##   refused                        0           0            0        0
# that degree variable is quite a bit clearer

table(libraries$libuseyr,libraries$kids)
##             
##              none one or more unknown
##   no          404         188       0
##   yes         500         399       0
##   don't know    0           0       0
##   refused       0           0       0
# run a logistic regression
logit.model <- glm(libuseyr ~ student + educ + kids, 
                   data=libraries, family=binomial(link = logit))
summary(logit.model)
## 
## Call:
## glm(formula = libuseyr ~ student + educ + kids, family = binomial(link = logit), 
##     data = libraries)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2662  -1.1194   0.7205   0.9849   2.3198  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -2.34138    0.41973  -5.578 2.43e-08 ***
## studentno       -0.77552    0.17609  -4.404 1.06e-05 ***
## educ             0.24824    0.02923   8.493  < 2e-16 ***
## kidsone or more  0.60940    0.11631   5.240 1.61e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1974.7  on 1470  degrees of freedom
## Residual deviance: 1843.2  on 1467  degrees of freedom
##   (31 observations deleted due to missingness)
## AIC: 1851.2
## 
## Number of Fisher Scoring iterations: 4

Lots of stuff just happened, but just a process of figuring out what’s going on with the data, cleaning where necessary, and testing a simple model.

The coefficients we just generated are raw logit coefficients. They’re not very useful on their own - the interpretation is difficult. Is it helpful to tell someone that an additional year of eduction increases the log-odds of using the library by 0.248?

Maybe it would help you. Probably not me. You’ll want to convert to an easier form, which we can do by exponentiating the coefficients (remember: the log-odds are just the log of the odds). Doing this gets of the odds-ratio, which is much easier to discuss.

# get the Odds Ratio
exp(logit.model$coef)
##     (Intercept)       studentno            educ kidsone or more 
##      0.09619464      0.46046564      1.28177100      1.83933487

This can be interpreted as the change in the ratio of ‘success’ to ‘failure’ with a one unit change in the IV. So, for education, we have an OR of 1.218. We can say that, for each additional year of education, the person is on average about 28% more likely to use the library than a person with 1 less year of education.

For coefficients less than 1, it’s slightly different. Take the variable ‘student’ - apparently it’s measured as 1 if you are not a student, which seems to have a negative effect on the log-odds of visiting the library. Makes sense.

With the OR, it becomes less than 1 - someone who is not a student is ~54% less likely than someone who is a student to visit the library, on average.

Model fit statistics - there is no R^2 for logistic regression. However, we can approximate it.

# to get pseudo R^2 estimates we need yet another add-on package
require(pscl)
## Loading required package: pscl
## Warning: package 'pscl' was built under R version 3.0.3
## Loading required package: MASS
## Loading required package: lattice
## Classes and Methods for R developed in the
## 
## Political Science Computational Laboratory
## 
## Department of Political Science
## 
## Stanford University
## 
## Simon Jackman
## 
## hurdle and zeroinfl functions by Achim Zeileis
pR2(logit.model)
##           llh       llhNull            G2      McFadden          r2ML 
## -9.216075e+02 -1.004677e+03  1.661388e+02  8.268268e-02  1.067982e-01 
##          r2CU 
##  1.433785e-01

We care about the McFadden number here - McFadden’s Pseudo R^2

Basically, this calculation uses the log-liklihood to compare a model with no predictors to the model we created. The better our model with predictors performs, the higher the pseudo R^2.

So, we’re looking at the improvement of our model over a Null model, with no IVs and it’s giving us a sense of how we’re improving our predictions

Another way to do that is to actually make some predictions and see how we do.

## create a new data-frame with only the variables we care about

subdata <- data.frame(libraries$libuseyr,libraries$student,libraries$educ,
                      libraries$kids)

# restrict the dataset to complete cases so we can match the prediction
# a regression will, by default, use 'pairwise deletion' when making an estimate
# it deletes all the rows with any missing values
# we have to do this manually to match the predicted values to the data

subdata <- subdata[complete.cases(subdata),]

# make a prediction
subdata$prediction <- predict.glm(logit.model)

Now we can see how our prediction did against the actual values

table(subdata$prediction, subdata$libraries.libuseyr)
##                      
##                        no yes don't know refused
##   -2.86865591561056     1   0          0       0
##   -2.62041320020312     0   1          0       0
##   -2.37217048479568     1   0          0       0
##   -2.25925189084052     1   3          0       0
##   -2.12392776938824     1   0          0       0
##   -1.76276646002563     1   0          0       0
##   -1.62744233857335     2   1          0       0
##   -1.51452374461819     1   0          0       0
##   -1.37919962316591     3   0          0       0
##   -1.26628102921075     3   0          0       0
##   -1.13095690775847    15   1          0       0
##   -0.987249423399209    1   0          0       0
##   -0.882714192351032    8   3          0       0
##   -0.769795598395872    1   0          0       0
##   -0.634471476943591   21   3          0       0
##   -0.521552882988431    4   2          0       0
##   -0.38622876153615    12   7          0       0
##   -0.27331016758099     7   4          0       0
##   -0.242521277176886    1   1          0       0
##   -0.137986046128709  154 121          0       0
##   -0.107197155724605    1   1          0       0
##   -0.0250674521735498   6   7          0       0
##   0.110256669278731    57  95          0       0
##   0.223175263233891     7  11          0       0
##   0.358499384686172    14  35          0       0
##   0.389288275090276     0   3          0       0
##   0.471417978641332    73 101          0       0
##   0.502206869045436     0   2          0       0
##   0.606742100093613    51  79          0       0
##   0.637530990497717     7   6          0       0
##   0.719660694048772    31  60          0       0
##   0.750449584452876     1   3          0       0
##   0.854984815501053    15  44          0       0
##   0.885773705905157    11  34          0       0
##   0.967903409456213    12  27          0       0
##   0.998692299860317     3  10          0       0
##   1.10322753090849      5   9          0       0
##   1.1340164213126       1  13          0       0
##   1.21614612486365     15  83          0       0
##   1.24693501526776      2  19          0       0
##   1.35147024631593      3   7          0       0
##   1.38225913672004      5  18          0       0
##   1.46438884027109      5  21          0       0
##   1.4951777306752       8  16          0       0
##   1.63050185212748      5   8          0       0
##   1.71263155567854      1   3          0       0
##   1.74342044608264      0   6          0       0
##   1.87874456753492      1   3          0       0
##   1.96087427108598      1   2          0       0
##   1.99166316149008      1  11          0       0
##   2.23990587689752      0   4          0       0
##   2.48814859230496      2   1          0       0
# not too useful; however, we can see that it generated a lot of predicted 
# probabilities. That first person is very unlikely to have visited the library.

# generate 0,1 predictions
subdata$pred <- ifelse(subdata$prediction >= 0.5, 1, 0)

table(subdata$pred, subdata$libraries.libuseyr)
##    
##      no yes don't know refused
##   0 396 400          0       0
##   1 186 489          0       0

That’s a bit better. This table tells us how our model performed.

Top-left box is true-negatives - we predicted a 0 (didn’t visit) correctly Top-right is false-negative - we predicted a 0 when we should have guessed 1 Etc…

Sensitivity

Specificity

Other fit tests for logistic regression:

In addition to pseudo-R^2 and checking our predictive performance, we can use the Hosmer-Lemeshow Goodness of Fit test to see how well our model is specified.

Often abbreviated as HL, GOF, or HL-GOF.

We’ll need another add-on package for this.

require(ResourceSelection)
## Loading required package: ResourceSelection
## Warning: package 'ResourceSelection' was built under R version 3.0.3
## ResourceSelection 0.2-4   2014-05-19
# here's where I got the description of all this
# http://thestatsgeek.com/2014/02/16/the-hosmer-lemeshow-goodness-of-fit-test-for-logistic-regression/
  
# now, we can run the GOF test
# in this implementation, we have to pass the DV and the fitted values (our predicted probabilies) to the function
# we also need to tell it how many groups to use (see description in slide-deck)

hoslem.test(logit.model$y,logit.model$fitted.values,g=10)
## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  logit.model$y, logit.model$fitted.values
## X-squared = 23.0532, df = 8, p-value = 0.003297
# for students: use names(logit.model) to get a sense of what objects R creates when you run a logistic regression. In this case, we're using the DV from the model and the fitted.values from the model (the predictions)

# we're also setting the group number to 10

In this test, the Null Hypothesis is that the fit of the model is good.

We have some cause for concern here - we reject the null, so perhaps our model doesn’t fit that well. We could have figured this out from our predictions, which were pretty lousy (lots of errors). If this were an actual project, we’d have some work to do.

Finally, we can also calculate a VIF statistic using pseudo-R^2. The calculation is the same, and the package we used earlier can manage it.

vif(logit.model)
##  student     educ     kids 
## 1.000465 1.012274 1.012647
# nothing to worry about here