If you haven’t already, please read the Introduction to this
set of lab activities, which includes instructions on where to find the
.csv files referenced in each lab.
For Lab 3 install the packages car and
effects for later use. When you run library()
to load these, you’ll get a conflict for functions from car
with dplyr and purrr (both part of the
tidyverse) but it shouldn’t cause any issues in this
lab.
Linear regression can be considered as a simple case of generalized linear modeling. This lab is not intended to teach simple linear regression, but some good background can be found in the Handbook of Biological Statistics (McDonald 2014)
Run help("glm") to see the default arguments, the first
of which are reproduced here:
glm(formula, family = gaussian, data...) # many more arguments in full function
To get a little practice with lm() in a multiple
regression context, the first activity will ‘test’ this using
ants.tb.
Import the ants_tb.csv file and preview it
ants.tb<-read_csv("ants_tb.csv")
## Rows: 44 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Site, Habitat
## dbl (5): Srich, Latitude, Elevation, Elev_100m, logS
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(ants.tb)
Create a linear model, i.e. an lm() object using the
ants data:
ants.mlr1 <- lm(logS ~ Latitude + Elev_100m + Habitat, data = ants.tb)
Now use the same formula with the argument
family = gaussian in glm():
ants.glm1 <- glm(logS ~ Latitude + Elev_100m + Habitat, family = gaussian, data = ants.tb)
Use summary() on both model objects,
ants.mlr1 and ants.glm1, and examine the
output. How can you tell which output is from lm() and
which is from glm()? Does it look like they provide the
same results? What kinds of output are included in one but not the other
(HINT, look above and/or below the Coefficients table in each
summary)?
Now we’ll use glm() for logistic
regression. For this lab we are going to use a dataset with a
binary response variable (1/0, representing yes/no) and
multiple predictor variables.
A good reference for logistic regression is Lesson 12 from Penn State University’s STAT 462 (Applied Regression Analysis).
I modified (took liberties with) a dataset from the package
carData (see NOTES in sources*) for use in
this lab. It is available as mroz_mod.csv.
Don’t forget to use library() for the
packages listed at the top, if you haven’t loaded them this session.
Import the data using read_csv and run spec()
on the data. If you want to name your new tibble with the same name I
use in the code below, I named it restvol.
## Rows: 753 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (7): volunteer, k5, k6_18, age, college, spouse_col, log_faminc
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Assume that the data for this Lab are from a survey of individuals about their participation in any ecological restoration project in the previous calendar year (but see NOTES* for original dataset). The survey was constrained to married individuals between 30 and 60 years of age.
| Variable | Description | Notes |
|---|---|---|
volunteer |
participation in volunteer ecological restoration | 1 = yes, 0 = no |
k5 |
number of children ages 5 and younger | values span 0-3 (but few 3s) |
k6_18 |
number of children ages 6 to 18 | values span 0-8 (but few > 5) |
age |
age in years | 30-60 |
college |
college attendance | 1 = yes, 0 = no |
spouse_col |
spouse’s college attendance | 1 = yes, 0 = no |
log_faminc |
log-transformed monthly family income in $1K | 10^value (of this variable) is monthly income in $1K |
What ‘hypotheses’ (not statistical, just educated speculation) might you have about the effect of each variable on whether or not an individual participated in restoration in the past year?
Even though we’re going to include multiple independent variables in
our model in 2.3, let’s plot the relationship of just
ONE of them (k5) to the response variable
(volunteer), as an example of visualizing logistic
relationships in ggplot() and introducing some plotting
options.
In the code below, line by line:
restvol is the data (and we’re obviously using pipes
for this).ggplot() you should recognize
aes().geom_jitter() instead of
geom_point() because both variables are discrete and there
are a lot of data points (especially at k5 = 0). To be clear, the actual
values of k5 are just 0, 1, 2, or 3, but the points are spread out
(jittered) along the x-axis using width =, and very
slightly along the y-axis using height =.method = "lm" towards the end of
Lab 2, we can use geom_smooth() but
specify a different method = and specify that
family=binomial (look for similarities to
glm() code below!). stat_smooth() is used
similarly in ggplot().se = TRUE plots the prediction interval and you know
what color = does.restvol %>%
ggplot(aes(x = k5, y = volunteer)) +
geom_jitter(height = 0.025, width = 0.3) +
labs(x = "# of children < 5 yo", y = "probability of volunteering") +
geom_smooth(formula = y ~ x, method = "glm",
method.args = list(family=binomial),
se = TRUE,
color = "blue4")
Given this plot, does it appear that people with more children < 5 yo, are MORE or LESS likely to volunteer on a restoration project? Other than the plotted line and prediction interval can you “see” that pattern in the points themselves (and why or why not)?
On to using multiple predictors in logistic regression. The code we used above to create ants.glm1, as a multiple linear regression model, was
glm(logS ~ Latitude + Elev_100m + Habitat, family = gaussian, data = ants.tb)
Let’s create a model object using the restvol data.
restvol.lg1<-glm(volunteer ~ k5 + k6_18 + age + college + spouse_col + log_faminc,
family = binomial(link = logit), data = restvol)
The argument family = binomial(link = logit) is the
addition that makes this a logistic regression model. If we only used
family = binomial the same model would run, as “logit” is
the default link for that family - however in this lab I’ll specify the
link = function just to be explicit.
What are the other “link” functions that the
binomial family accepts? HINT: use
help("family")
Before we look at the summary() of the logistic
model, don’t forget about looking at distribution(s) of the residuals.
Although we CAN use plot() just like we would with output
from lm(), for logistic regression those are harder to
interpret. Instead let’s use residualPlots() which is part
of the package car.
residualPlots(restvol.lg1)
(if you get a long list of Warnings don’t worry about them.) The relatively ‘flat’ lines in the plots are reassuring -hooray for homoscedasticity!
P.S. this is one of the best examples of why we can call these Generalized Linear Models - my non-mathy version is, “if the model is appropriate for the data, the math of the equations for a specific model (through variance and link functions) means that the error term has a linear relationship with the predictors.”
Let’s run summary() on the model object to start to
examine the results.
summary(restvol.lg1)
##
## Call:
## glm(formula = volunteer ~ k5 + k6_18 + age + college + spouse_col +
## log_faminc, family = binomial(link = logit), data = restvol)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.10094 0.82905 1.328 0.184
## k5 -1.47701 0.19568 -7.548 4.42e-14 ***
## k6_18 -0.08604 0.06736 -1.277 0.201
## age -0.06266 0.01267 -4.944 7.66e-07 ***
## college 0.99975 0.22154 4.513 6.40e-06 ***
## spouse_col 0.15084 0.20542 0.734 0.463
## log_faminc 1.61174 0.38202 4.219 2.45e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1029.75 on 752 degrees of freedom
## Residual deviance: 920.78 on 746 degrees of freedom
## AIC: 934.78
##
## Number of Fisher Scoring iterations: 3
First let’s assess the overall model “fit”. Hopefully in
Exercise 1 you noticed that for a glm()
there’s not an R2 value, F-statistic, or p-value. However, there are a
number of ways to ask, “does the model fit the data?” Today we’ll use a
goodness-of-fit test, (using the chi-squared distribution) with the
reported Null and Residual deviances with their associated degrees of
freedom:
The general form is
1-pchisq((Null deviance - Residual deviance), (Null df - Residual df))
In this case, ‘Null’ refers to a null model, which for logistic (and indeed most) regression models is aka an intercept-only model. If we actually ran it, it would look like
glm(volunteer ~ 1, family = binomial(link = logit), data = restvol)
However, R does that for us and provides the Null deviance.
Take those values (null and residual deviance, and df of each) and
run the 1-pchisq() code (provided below if you have any
problems). The resulting probability can be interpreted like a p-value:
if it is <0.05 (or perhaps <0.10 if you’re OK with a higher chance
of Type I errors), then there is evidence that your model has a better
fit to the data than a null model.
Does the result of this goodness-of-fit test indicate that we should
proceed (is the probability < 0.05)? (NOTE, the pchisq()
function rounds down to 0 at very low probabilities.)
1-pchisq((1029.75 - 920.78), (752 - 746))
## [1] 0
Yes, the p-value is MUCH less than 0.05, meaning there IS a significant difference between our logistic regression model and the ‘null’ model in terms of fitting the data.
Other than the intercept (not meaningful in this case), what
parameters are considered statistically significant (in your
summary() output above)? What is the direction of
the effect (positive or negative) on the probability of
volunteering?
Because of the logit transformation (see PSU STAT 462 for different forms of the logistic regression equation), the linear combination of the X variables in a multiple logistic regression, in general terms is:
\[ log\left(\frac{\pi}{1-\pi}\right) \sim \beta_0 + X_1\beta_1 + ... + X_k\beta_k + \epsilon \] where \(\pi\) is the probability of the event (in our example, the probability of an individual volunteering) and \(({\pi}/{1-\pi})\) are the odds of that event. These concepts are commonly used, e.g. if an event has a 75% probability, the odds (or more specifically the odds ratio) of that event are 3:1. Check it!
\(0.75/(1-0.75) = 0.75/0.25 = 3/1\)
However, the full term on the left side of the logistic regression equation is not just the odds, but rather the log odds (i.e. the natural logarithm of the odds), and this is not something we’re necessarily used to thinking about. Since it’s using the natural log, we can just use \(e\) to the power of the coefficient(s), e.g.
exp(coef(restvol.lg1))
to get the multiplier effect on the odds ratio.
Now use: exp(confint(restvol.lg1))
…which shows the confidence intervals around the specific odds ratio calculations above. A confidence interval that overlaps 1.0 means that there is no relationship between that variable and the odds of being a volunteer (because a multiplier effect of 1 results in the same odds ratio).
Which of the CIs overlap 1.0?
For future use, here’s a different summary() function from the
package car called S() (run to see what it
provides automatically).
S(restvol.lg1)
## Call: glm(formula = volunteer ~ k5 + k6_18 + age + college + spouse_col +
## log_faminc, family = binomial(link = logit), data = restvol)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.10094 0.82905 1.328 0.184
## k5 -1.47701 0.19568 -7.548 4.42e-14 ***
## k6_18 -0.08604 0.06736 -1.277 0.201
## age -0.06266 0.01267 -4.944 7.66e-07 ***
## college 0.99975 0.22154 4.513 6.40e-06 ***
## spouse_col 0.15084 0.20542 0.734 0.463
## log_faminc 1.61174 0.38202 4.219 2.45e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1029.75 on 752 degrees of freedom
## Residual deviance: 920.78 on 746 degrees of freedom
##
## logLik df AIC BIC
## -460.39 7 934.78 967.15
##
## Number of Fisher Scoring iterations: 3
##
## Exponentiated Coefficients and Confidence Bounds
## Estimate 2.5 % 97.5 %
## (Intercept) 3.0069819 0.5923670 15.3432850
## k5 0.2283195 0.1537099 0.3313822
## k6_18 0.9175550 0.8036181 1.0469902
## age 0.9392603 0.9158636 0.9625767
## college 2.7175990 1.7701086 4.2236300
## spouse_col 1.1628102 0.7773691 1.7410571
## log_faminc 5.0115355 2.3999695 10.7540351
Now let’s plot the predicted effects by each variable. Effects plots from a multiple logistic regression model are intended to show the predicted effect of each predictor variable on the response (in this case, probability of volunteering), holding everything else in the model equal.
The default predictor effect plot can be generated using:
plot(predictorEffects(restvol.lg1))
Just in a descriptive way, what do you notice? What is true of the
plots of non-significant predictors compared to those that have
significant effects? Are there particular aspects of these relationships
that the plots help show, that you don’t get just from the
summary() output?
For future use see the Predictor Effects Gallery
*NOTES To create the dataset you uploaded as
mroz_mod.csv, I used the dataset named Mroz
from the carData package (which is installed with
car). The original dataset is on women’s labor force
participation, but I changed the variable names and context to make the
example more relevant to environmental studies. I also deleted one of
the original variables, lwg, and transformed values in the
log_faminc variable (called inc in the
original dataset), but the other values are the same. This dataset is
worked up as an example in (Fox and Weisberg 2019), citing (Mroz 1987). There’s
also an interesting use of it for plotting residuals in Zhang
(2016).