If you haven’t already, please read the Introduction to this set of lab activities, which includes instructions on where to find the .csv files referenced in each lab.

Generalized Linear Models, aka GLMs

For Lab 3 install the packages car and effects for later use. When you run library() to load these, you’ll get a conflict for functions from car with dplyr and purrr (both part of the tidyverse) but it shouldn’t cause any issues in this lab.

Linear regression can be considered as a simple case of generalized linear modeling. This lab is not intended to teach simple linear regression, but some good background can be found in the Handbook of Biological Statistics (McDonald 2014)

Exercise 1

Run help("glm") to see the default arguments, the first of which are reproduced here:

glm(formula, family = gaussian, data...) # many more arguments in full function

To get a little practice with lm() in a multiple regression context, the first activity will ‘test’ this using ants.tb.

Import the ants_tb.csv file and preview it

ants.tb<-read_csv("ants_tb.csv")
## Rows: 44 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Site, Habitat
## dbl (5): Srich, Latitude, Elevation, Elev_100m, logS
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(ants.tb)

Create a linear model, i.e. an lm() object using the ants data:

ants.mlr1 <- lm(logS ~ Latitude + Elev_100m + Habitat, data = ants.tb)

Now use the same formula with the argument family = gaussian in glm():

ants.glm1 <- glm(logS ~ Latitude + Elev_100m + Habitat, family = gaussian, data = ants.tb)

Use summary() on both model objects, ants.mlr1 and ants.glm1, and examine the output. How can you tell which output is from lm() and which is from glm()? Does it look like they provide the same results? What kinds of output are included in one but not the other (HINT, look above and/or below the Coefficients table in each summary)?

Exercise 2

Now we’ll use glm() for logistic regression. For this lab we are going to use a dataset with a binary response variable (1/0, representing yes/no) and multiple predictor variables.

A good reference for logistic regression is Lesson 12 from Penn State University’s STAT 462 (Applied Regression Analysis).

I modified (took liberties with) a dataset from the package carData (see NOTES in sources*) for use in this lab. It is available as mroz_mod.csv.

Don’t forget to use library() for the packages listed at the top, if you haven’t loaded them this session. Import the data using read_csv and run spec() on the data. If you want to name your new tibble with the same name I use in the code below, I named it restvol.

## Rows: 753 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (7): volunteer, k5, k6_18, age, college, spouse_col, log_faminc
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

2.1

Assume that the data for this Lab are from a survey of individuals about their participation in any ecological restoration project in the previous calendar year (but see NOTES* for original dataset). The survey was constrained to married individuals between 30 and 60 years of age.

Table 1. Variables in the modified Mroz dataset, imported as ‘restvol’
Variable Description Notes
volunteer participation in volunteer ecological restoration 1 = yes, 0 = no
k5 number of children ages 5 and younger values span 0-3 (but few 3s)
k6_18 number of children ages 6 to 18 values span 0-8 (but few > 5)
age age in years 30-60
college college attendance 1 = yes, 0 = no
spouse_col spouse’s college attendance 1 = yes, 0 = no
log_faminc log-transformed monthly family income in $1K 10^value (of this variable) is monthly income in $1K

What ‘hypotheses’ (not statistical, just educated speculation) might you have about the effect of each variable on whether or not an individual participated in restoration in the past year?

2.2

Even though we’re going to include multiple independent variables in our model in 2.3, let’s plot the relationship of just ONE of them (k5) to the response variable (volunteer), as an example of visualizing logistic relationships in ggplot() and introducing some plotting options.

In the code below, line by line:

  • restvol is the data (and we’re obviously using pipes for this).
  • in ggplot() you should recognize aes().
  • Here we use geom_jitter() instead of geom_point() because both variables are discrete and there are a lot of data points (especially at k5 = 0). To be clear, the actual values of k5 are just 0, 1, 2, or 3, but the points are spread out (jittered) along the x-axis using width =, and very slightly along the y-axis using height =.
  • similar to using method = "lm" towards the end of Lab 2, we can use geom_smooth() but specify a different method = and specify that family=binomial (look for similarities to glm() code below!). stat_smooth() is used similarly in ggplot().
  • se = TRUE plots the prediction interval and you know what color = does.
restvol %>%
  ggplot(aes(x = k5, y = volunteer)) +
  geom_jitter(height = 0.025, width = 0.3) +
  labs(x = "# of children < 5 yo", y = "probability of volunteering") +
  geom_smooth(formula = y ~ x, method = "glm", 
              method.args = list(family=binomial),
              se = TRUE,
              color = "blue4")

Given this plot, does it appear that people with more children < 5 yo, are MORE or LESS likely to volunteer on a restoration project? Other than the plotted line and prediction interval can you “see” that pattern in the points themselves (and why or why not)?

2.3

On to using multiple predictors in logistic regression. The code we used above to create ants.glm1, as a multiple linear regression model, was

glm(logS ~ Latitude + Elev_100m + Habitat, family = gaussian, data = ants.tb)

Let’s create a model object using the restvol data.

restvol.lg1<-glm(volunteer ~ k5 + k6_18 + age + college + spouse_col + log_faminc, 
                 family = binomial(link = logit), data = restvol)

The argument family = binomial(link = logit) is the addition that makes this a logistic regression model. If we only used family = binomial the same model would run, as “logit” is the default link for that family - however in this lab I’ll specify the link = function just to be explicit.

What are the other “link” functions that the binomial family accepts? HINT: use help("family")

2.4

Before we look at the summary() of the logistic model, don’t forget about looking at distribution(s) of the residuals. Although we CAN use plot() just like we would with output from lm(), for logistic regression those are harder to interpret. Instead let’s use residualPlots() which is part of the package car.

residualPlots(restvol.lg1)

(if you get a long list of Warnings don’t worry about them.) The relatively ‘flat’ lines in the plots are reassuring -hooray for homoscedasticity!

P.S. this is one of the best examples of why we can call these Generalized Linear Models - my non-mathy version is, “if the model is appropriate for the data, the math of the equations for a specific model (through variance and link functions) means that the error term has a linear relationship with the predictors.”

2.5

Let’s run summary() on the model object to start to examine the results.

summary(restvol.lg1)
## 
## Call:
## glm(formula = volunteer ~ k5 + k6_18 + age + college + spouse_col + 
##     log_faminc, family = binomial(link = logit), data = restvol)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.10094    0.82905   1.328    0.184    
## k5          -1.47701    0.19568  -7.548 4.42e-14 ***
## k6_18       -0.08604    0.06736  -1.277    0.201    
## age         -0.06266    0.01267  -4.944 7.66e-07 ***
## college      0.99975    0.22154   4.513 6.40e-06 ***
## spouse_col   0.15084    0.20542   0.734    0.463    
## log_faminc   1.61174    0.38202   4.219 2.45e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1029.75  on 752  degrees of freedom
## Residual deviance:  920.78  on 746  degrees of freedom
## AIC: 934.78
## 
## Number of Fisher Scoring iterations: 3

2.6

First let’s assess the overall model “fit”. Hopefully in Exercise 1 you noticed that for a glm() there’s not an R2 value, F-statistic, or p-value. However, there are a number of ways to ask, “does the model fit the data?” Today we’ll use a goodness-of-fit test, (using the chi-squared distribution) with the reported Null and Residual deviances with their associated degrees of freedom:

The general form is

1-pchisq((Null deviance - Residual deviance), (Null df - Residual df))

In this case, ‘Null’ refers to a null model, which for logistic (and indeed most) regression models is aka an intercept-only model. If we actually ran it, it would look like

glm(volunteer ~ 1, family = binomial(link = logit), data = restvol)

However, R does that for us and provides the Null deviance.

Take those values (null and residual deviance, and df of each) and run the 1-pchisq() code (provided below if you have any problems). The resulting probability can be interpreted like a p-value: if it is <0.05 (or perhaps <0.10 if you’re OK with a higher chance of Type I errors), then there is evidence that your model has a better fit to the data than a null model.

Does the result of this goodness-of-fit test indicate that we should proceed (is the probability < 0.05)? (NOTE, the pchisq() function rounds down to 0 at very low probabilities.)

1-pchisq((1029.75 - 920.78), (752 - 746))
## [1] 0

Yes, the p-value is MUCH less than 0.05, meaning there IS a significant difference between our logistic regression model and the ‘null’ model in terms of fitting the data.

2.7

Other than the intercept (not meaningful in this case), what parameters are considered statistically significant (in your summary() output above)? What is the direction of the effect (positive or negative) on the probability of volunteering?

Interpreting Coefficients

Because of the logit transformation (see PSU STAT 462 for different forms of the logistic regression equation), the linear combination of the X variables in a multiple logistic regression, in general terms is:

\[ log\left(\frac{\pi}{1-\pi}\right) \sim \beta_0 + X_1\beta_1 + ... + X_k\beta_k + \epsilon \] where \(\pi\) is the probability of the event (in our example, the probability of an individual volunteering) and \(({\pi}/{1-\pi})\) are the odds of that event. These concepts are commonly used, e.g. if an event has a 75% probability, the odds (or more specifically the odds ratio) of that event are 3:1. Check it!

\(0.75/(1-0.75) = 0.75/0.25 = 3/1\)

However, the full term on the left side of the logistic regression equation is not just the odds, but rather the log odds (i.e. the natural logarithm of the odds), and this is not something we’re necessarily used to thinking about. Since it’s using the natural log, we can just use \(e\) to the power of the coefficient(s), e.g.

exp(coef(restvol.lg1))

to get the multiplier effect on the odds ratio.

Now use: exp(confint(restvol.lg1))

…which shows the confidence intervals around the specific odds ratio calculations above. A confidence interval that overlaps 1.0 means that there is no relationship between that variable and the odds of being a volunteer (because a multiplier effect of 1 results in the same odds ratio).

Which of the CIs overlap 1.0?

2.8

For future use, here’s a different summary() function from the package car called S() (run to see what it provides automatically).

S(restvol.lg1)
## Call: glm(formula = volunteer ~ k5 + k6_18 + age + college + spouse_col +
##           log_faminc, family = binomial(link = logit), data = restvol)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.10094    0.82905   1.328    0.184    
## k5          -1.47701    0.19568  -7.548 4.42e-14 ***
## k6_18       -0.08604    0.06736  -1.277    0.201    
## age         -0.06266    0.01267  -4.944 7.66e-07 ***
## college      0.99975    0.22154   4.513 6.40e-06 ***
## spouse_col   0.15084    0.20542   0.734    0.463    
## log_faminc   1.61174    0.38202   4.219 2.45e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1029.75  on 752  degrees of freedom
## Residual deviance:  920.78  on 746  degrees of freedom
## 
##  logLik      df     AIC     BIC 
## -460.39       7  934.78  967.15 
## 
## Number of Fisher Scoring iterations: 3
## 
## Exponentiated Coefficients and Confidence Bounds
##              Estimate     2.5 %     97.5 %
## (Intercept) 3.0069819 0.5923670 15.3432850
## k5          0.2283195 0.1537099  0.3313822
## k6_18       0.9175550 0.8036181  1.0469902
## age         0.9392603 0.9158636  0.9625767
## college     2.7175990 1.7701086  4.2236300
## spouse_col  1.1628102 0.7773691  1.7410571
## log_faminc  5.0115355 2.3999695 10.7540351

2.9

Now let’s plot the predicted effects by each variable. Effects plots from a multiple logistic regression model are intended to show the predicted effect of each predictor variable on the response (in this case, probability of volunteering), holding everything else in the model equal.

The default predictor effect plot can be generated using:

plot(predictorEffects(restvol.lg1))

Just in a descriptive way, what do you notice? What is true of the plots of non-significant predictors compared to those that have significant effects? Are there particular aspects of these relationships that the plots help show, that you don’t get just from the summary() output?

For future use see the Predictor Effects Gallery

Citations

*NOTES To create the dataset you uploaded as mroz_mod.csv, I used the dataset named Mroz from the carData package (which is installed with car). The original dataset is on women’s labor force participation, but I changed the variable names and context to make the example more relevant to environmental studies. I also deleted one of the original variables, lwg, and transformed values in the log_faminc variable (called inc in the original dataset), but the other values are the same. This dataset is worked up as an example in (Fox and Weisberg 2019), citing (Mroz 1987). There’s also an interesting use of it for plotting residuals in Zhang (2016).

Fox, J., and S. Weisberg. 2019. An r companion to applied regression. 3rd edition. Sage Publications.
McDonald, J. H. 2014. Handbook of biological statistics. 3rd edition. Sparky House Publishing.
Mroz, T. A. 1987. The sensitivity of an empirical model of married women’s hours of work to economic and statistical assumptions. Econometrica 55:765–799.