This document covers in-class exercises from Session 3 of PUBP-705, Adv. Stat. Methods for Policy Analysis.
Material covered: + Log-Log models (elasticity) + Interaction terms + Logistic Regression
## Loading required package: car
## Warning: package 'car' was built under R version 3.0.3
## Loading required package: foreign
## Warning: package 'foreign' was built under R version 3.0.3
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.0.3
As discussed last session, we can interpret logarithmically transformed variables in percentage terms - if we log both sides of an equation, we can interpret the results as the percentage impact a 1% increase of an IV has on the DV.
Example:
# read in data
poultry <- read.dta("C:/Users/Administrator/Downloads/poultry.dta")
# plot your variables of interest
ggplot(poultry,aes(price,q_capita)) + geom_point() + stat_smooth(method = "lm")
# make some log variables
poultry$log_price <- log(poultry$price)
poultry$log_qcapita <- log(poultry$q_capita)
poultry$log_inc000 <- log(poultry$inc_000)
ggplot(poultry,aes(log_price,log_qcapita)) + geom_point() + stat_smooth(method = "lm")
# make a scatter-matrix
pairs(data.frame(poultry$price,poultry$q_capita,poultry$inc_000,
poultry$log_price,poultry$log_qcapita,poultry$inc_000
))
# predict an elasticity!
elasticity <- lm(log_qcapita ~ log_inc000 + log_price, data = poultry)
summary(elasticity)
##
## Call:
## lm(formula = log_qcapita ~ log_inc000 + log_price, data = poultry)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.04040 -0.01475 -0.00175 0.01550 0.06184
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.79672 0.24752 31.499 < 2e-16 ***
## log_inc000 -0.40575 0.10114 -4.012 0.000684 ***
## log_price -0.14801 0.05867 -2.523 0.020227 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.02611 on 20 degrees of freedom
## Multiple R-squared: 0.8335, Adjusted R-squared: 0.8169
## F-statistic: 50.07 on 2 and 20 DF, p-value: 1.634e-08
The price coefficient is price elasticity of demand The income coefficient is income elasticity of demand
Problem: how do we understand conditional impacts? what if one variable changes dependent on another?
A student example: US state electricity regulation - why are electric markets reg’d / not reg’d? Theory: perhaps the party of the governor is a factor. In an interaction model, we might analyze the relationship between some other predictor and electricity regulation conditional on the party of the governor.
# read in our data
saldata <- read.dta("C:/Users/Administrator/Downloads/salary.dta")
Hypothesis: perhaps gender conditions the relationship between age and salary
H_Null: there the relationship between age and salary is the same regardless of gender
# test an interaction model
# make sex a more clear variable
# Random Note - there's probably a better way to read STATA files into R....
saldata$female <- ifelse(saldata$sex == "females", 1,0)
salmodel <- lm(salnow ~ female + age + female * age, data=saldata)
summary(salmodel)
##
## Call:
## lm(formula = salnow ~ female + age + female * age, data = saldata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8967 -3521 -1316 1434 38431
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19426.24 1445.45 13.440 < 2e-16 ***
## female -6474.85 1888.88 -3.428 0.000662 ***
## age -77.80 38.10 -2.042 0.041690 *
## female:age 10.74 48.63 0.221 0.825364
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6061 on 470 degrees of freedom
## Multiple R-squared: 0.2176, Adjusted R-squared: 0.2126
## F-statistic: 43.56 on 3 and 470 DF, p-value: < 2.2e-16
Interpretation:
……
Used for binary dependent variables Why not linear regression? It can give you strange results…. e.g. predictions >1
If your only interest is classification, you can technically make a linear probability model and classify as 0/1 if the prediction is >0.5, <0.5, but this doesn’t make nearly as much logical sense as logistic regression.
Logistic regression will generate predicted probabilities, which is much more helpful.
So, what are you predicting? The log of the odds ratio: $ log(p/1-p) $
Example:
# load in data
libraries <- read.dta("C:/Users/Administrator/Downloads/lib1500.dta",
convert.factors = T)
This dataset deals with public library use - what predicts library usage?
Hypothesis: perhaps education is a factor to consider here? We might expect more educated people to be more likely to go to the library.
Look at the education variable
table(libraries$educ)
##
## 1st grade 2nd grade 3rd grade
## 5 1 3
## 4th grade 5th grade 6th grade
## 2 3 5
## 7th grade 8th grade 9th grade
## 4 22 26
## 10th grade 11th grade 12th grade
## 41 53 487
## some college associate degree bachelors degree
## 312 110 264
## master's degree professional school doctoral degree
## 102 25 13
## other refused
## 5 0
Looks reasonable. HOWEVER: it has categories labeled ‘other’ and ‘refused’. We will have to do something about those.
Given how few people we have in the low categories, we’ll probably also want to combine into a variable like: ‘No HS’. We won’t be able to make any useful inferences about the 5 people who only attended 1st grade.
Note, however, that the dataset has done this for us in the ‘degree’ variable
# do some cleaning - replace all junk strings in the data
# with true NA values (this is an artifact of translating from STATA, I think)
# not sure if this happens generally, or if it's just this dataset
libraries <- as.data.frame(lapply(libraries, function(x){replace(x,"<NA>",NA)}))
libraries <- as.data.frame(lapply(libraries, function(x){replace(x,"NA",NA)}))
# replace 'other' and 'refused' with 'NA'
libraries$educ <- ifelse(libraries$educ == "other",NA,libraries$educ)
libraries$educ <- ifelse(libraries$educ == "refused",NA,libraries$educ)
# run a logistic regression predicting whether a person is a library user
# with: student, education, internet access, whether they have kids
# look at some descriptives first
table(libraries$libuseyr,libraries$student)
##
## yes no refused
## no 50 540 0
## yes 160 742 0
## don't know 0 0 0
## refused 0 0 0
table(libraries$libuseyr,libraries$educ)
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
## no 2 0 3 2 3 3 4 19 16 28 22 236 107 27 72
## yes 3 1 0 0 0 2 0 3 10 13 31 249 205 82 192
## don't know 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## refused 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##
## 16 17 18
## no 25 9 4
## yes 77 16 9
## don't know 0 0 0
## refused 0 0 0
table(libraries$libuseyr,libraries$degree)
##
## less than high school high school some college college
## no 102 236 134 110
## yes 63 249 287 294
## don't know 0 0 0 0
## refused 0 0 0 0
# that degree variable is quite a bit clearer
table(libraries$libuseyr,libraries$kids)
##
## none one or more unknown
## no 404 188 0
## yes 500 399 0
## don't know 0 0 0
## refused 0 0 0
# run a logistic regression
logit.model <- glm(libuseyr ~ student + educ + kids,
data=libraries, family=binomial(link = logit))
summary(logit.model)
##
## Call:
## glm(formula = libuseyr ~ student + educ + kids, family = binomial(link = logit),
## data = libraries)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2662 -1.1194 0.7205 0.9849 2.3198
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.34138 0.41973 -5.578 2.43e-08 ***
## studentno -0.77552 0.17609 -4.404 1.06e-05 ***
## educ 0.24824 0.02923 8.493 < 2e-16 ***
## kidsone or more 0.60940 0.11631 5.240 1.61e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1974.7 on 1470 degrees of freedom
## Residual deviance: 1843.2 on 1467 degrees of freedom
## (31 observations deleted due to missingness)
## AIC: 1851.2
##
## Number of Fisher Scoring iterations: 4
Lots of stuff just happened, but just a process of figuring out what’s going on with the data, cleaning where necessary, and testing a simple model.
The coefficients we just generated are raw logit coefficients. They’re not very useful on their own - the interpretation is difficult. Is it helpful to tell someone that an additional year of eduction increases the log-odds of using the library by 0.248?
Maybe it would help you. Probably not me. You’ll want to convert to an easier form, which we can do by exponentiating the coefficients (remember: the log-odds are just the log of the odds). Doing this gets of the odds-ratio, which is much easier to discuss.
# get the Odds Ratio
exp(logit.model$coef)
## (Intercept) studentno educ kidsone or more
## 0.09619464 0.46046564 1.28177100 1.83933487
This can be interpreted as the change in the ratio of ‘success’ to ‘failure’ with a one unit change in the IV. So, for education, we have an OR of 1.218. We can say that, for each additional year of education, the person is on average about 28% more likely to use the library than a person with 1 less year of education.
For coefficients less than 1, it’s slightly different. Take the variable ‘student’ - apparently it’s measured as 1 if you are not a student, which seems to have a negative effect on the log-odds of visiting the library. Makes sense.
With the OR, it becomes less than 1 - someone who is not a student is ~54% less likely than someone who is a student to visit the library, on average.
Model fit statistics - there is no R^2 for logistic regression. However, we can approximate it.
# to get pseudo R^2 estimates we need yet another add-on package
require(pscl)
## Loading required package: pscl
## Warning: package 'pscl' was built under R version 3.0.3
## Loading required package: MASS
## Loading required package: lattice
## Classes and Methods for R developed in the
##
## Political Science Computational Laboratory
##
## Department of Political Science
##
## Stanford University
##
## Simon Jackman
##
## hurdle and zeroinfl functions by Achim Zeileis
pR2(logit.model)
## llh llhNull G2 McFadden r2ML
## -9.216075e+02 -1.004677e+03 1.661388e+02 8.268268e-02 1.067982e-01
## r2CU
## 1.433785e-01
We care about the McFadden number here - McFadden’s Pseudo R^2
Basically, this calculation uses the log-liklihood to compare a model with no predictors to the model we created. The better our model with predictors performs, the higher the pseudo R^2.
So, we’re looking at the improvement of our model over a Null model, with no IVs and it’s giving us a sense of how we’re improving our predictions
Another way to do that is to actually make some predictions and see how we do.
## create a new data-frame with only the variables we care about
subdata <- data.frame(libraries$libuseyr,libraries$student,libraries$educ,
libraries$kids)
# restrict the dataset to complete cases so we can match the prediction
# a regression will, by default, use 'pairwise deletion' when making an estimate
# it deletes all the rows with any missing values
# we have to do this manually to match the predicted values to the data
subdata <- subdata[complete.cases(subdata),]
# make a prediction
subdata$prediction <- predict.glm(logit.model)
Now we can see how our prediction did against the actual values
table(subdata$prediction, subdata$libraries.libuseyr)
##
## no yes don't know refused
## -2.86865591561056 1 0 0 0
## -2.62041320020312 0 1 0 0
## -2.37217048479568 1 0 0 0
## -2.25925189084052 1 3 0 0
## -2.12392776938824 1 0 0 0
## -1.76276646002563 1 0 0 0
## -1.62744233857335 2 1 0 0
## -1.51452374461819 1 0 0 0
## -1.37919962316591 3 0 0 0
## -1.26628102921075 3 0 0 0
## -1.13095690775847 15 1 0 0
## -0.987249423399209 1 0 0 0
## -0.882714192351032 8 3 0 0
## -0.769795598395872 1 0 0 0
## -0.634471476943591 21 3 0 0
## -0.521552882988431 4 2 0 0
## -0.38622876153615 12 7 0 0
## -0.27331016758099 7 4 0 0
## -0.242521277176886 1 1 0 0
## -0.137986046128709 154 121 0 0
## -0.107197155724605 1 1 0 0
## -0.0250674521735498 6 7 0 0
## 0.110256669278731 57 95 0 0
## 0.223175263233891 7 11 0 0
## 0.358499384686172 14 35 0 0
## 0.389288275090276 0 3 0 0
## 0.471417978641332 73 101 0 0
## 0.502206869045436 0 2 0 0
## 0.606742100093613 51 79 0 0
## 0.637530990497717 7 6 0 0
## 0.719660694048772 31 60 0 0
## 0.750449584452876 1 3 0 0
## 0.854984815501053 15 44 0 0
## 0.885773705905157 11 34 0 0
## 0.967903409456213 12 27 0 0
## 0.998692299860317 3 10 0 0
## 1.10322753090849 5 9 0 0
## 1.1340164213126 1 13 0 0
## 1.21614612486365 15 83 0 0
## 1.24693501526776 2 19 0 0
## 1.35147024631593 3 7 0 0
## 1.38225913672004 5 18 0 0
## 1.46438884027109 5 21 0 0
## 1.4951777306752 8 16 0 0
## 1.63050185212748 5 8 0 0
## 1.71263155567854 1 3 0 0
## 1.74342044608264 0 6 0 0
## 1.87874456753492 1 3 0 0
## 1.96087427108598 1 2 0 0
## 1.99166316149008 1 11 0 0
## 2.23990587689752 0 4 0 0
## 2.48814859230496 2 1 0 0
# not too useful; however, we can see that it generated a lot of predicted
# probabilities. That first person is very unlikely to have visited the library.
# generate 0,1 predictions
subdata$pred <- ifelse(subdata$prediction >= 0.5, 1, 0)
table(subdata$pred, subdata$libraries.libuseyr)
##
## no yes don't know refused
## 0 396 400 0 0
## 1 186 489 0 0
That’s a bit better. This table tells us how our model performed.
Top-left box is true-negatives - we predicted a 0 (didn’t visit) correctly Top-right is false-negative - we predicted a 0 when we should have guessed 1 Etc…
Sensitivity
Specificity
Other fit tests for logistic regression:
In addition to pseudo-R^2 and checking our predictive performance, we can use the Hosmer-Lemeshow Goodness of Fit test to see how well our model is specified.
Often abbreviated as HL, GOF, or HL-GOF.
We’ll need another add-on package for this.
require(ResourceSelection)
## Loading required package: ResourceSelection
## Warning: package 'ResourceSelection' was built under R version 3.0.3
## ResourceSelection 0.2-4 2014-05-19
# here's where I got the description of all this
# http://thestatsgeek.com/2014/02/16/the-hosmer-lemeshow-goodness-of-fit-test-for-logistic-regression/
# now, we can run the GOF test
# in this implementation, we have to pass the DV and the fitted values (our predicted probabilies) to the function
# we also need to tell it how many groups to use (see description in slide-deck)
hoslem.test(logit.model$y,logit.model$fitted.values,g=10)
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: logit.model$y, logit.model$fitted.values
## X-squared = 23.0532, df = 8, p-value = 0.003297
# for students: use names(logit.model) to get a sense of what objects R creates when you run a logistic regression. In this case, we're using the DV from the model and the fitted.values from the model (the predictions)
# we're also setting the group number to 10
In this test, the Null Hypothesis is that the fit of the model is good.
We have some cause for concern here - we reject the null, so perhaps our model doesn’t fit that well. We could have figured this out from our predictions, which were pretty lousy (lots of errors). If this were an actual project, we’d have some work to do.
Finally, we can also calculate a VIF statistic using pseudo-R^2. The calculation is the same, and the package we used earlier can manage it.
vif(logit.model)
## student educ kids
## 1.000465 1.012274 1.012647
# nothing to worry about here