If you haven’t already, please read the Introduction to this
set of lab activities, which includes instructions on where to find the
.csv files referenced in each lab.
For this lab you’ll need all of these packages - some of them you
will probably need to install: tidyverse,
AICcmodavg, effects, car,
rcompanion, MASS, pscl,
ggeffects, and performance.
Run the library() commands to load the packages you need
for this lab.
library(tidyverse)
library(AICcmodavg)
library(effects)
library(car)
library(rcompanion)
library(MASS) #for glm.nb()
library(pscl) #for zeroinfl()
library(ggeffects)
library(performance)
A good reference for regression modeling with count-based data is this overview from UCLA.
You’re going to bring in data from a .csv file again, but this time
we’re going to use col_types = to tell
read_csv specifically how to read in some of the
columns.
These data are from H. Jane Brockmann’s long-term research on horseshoe crabs in Florida (Brockmann 1996). The response variable is the number of male satellites (crabs that are not attached to a female during mating, but crowd around to try to fertilize at least some of her eggs). Import them now.
crab.sat <- read_csv("crab_sat.csv", col_types = cols(color = col_factor(levels = c("ltmed", "medium", "dkmed", "dark"), ordered = TRUE), spine_cond = col_integer(), satellites = col_integer()))
| Variable | Description | Notes |
|---|---|---|
satellites |
number of male satellites counted near a female crab | integer from 0-15 |
color |
color of female (light medium to dark) | ordinal variable |
color.n |
numerical version of color | coded 1-4, not used in this lab |
spine_cond |
condition of the female’s spines, coded numerically | 1 = both broken, 2 = one broken, 3 = both good |
width |
female carapace width | in centimeters |
weight |
female weight | in kilograms |
…and since we used read_csv you can run
spec(crab.sat) too, to see how the variables are
classified.
This kind of coding is helpful when you have a .csv and know what’s
in it (probably from opening it in Excel) AND how you want R to read in
each variable. You can always assign these classes after you’ve imported
the data into R (see for example, as.factor()).
In this case we know that the text in the column “color” can be considered an ordered factor (variable), i.e. there is a logical order, from ltmed through dark. The column/variable “spine_cond” is an integer that corresponds to the three categories (Table 1). In the case of “satellites”, the distinction of making it an integer is not critical but it can’t hurt since it’s a count.
See my NOTE1 about the menu-based Import Dataset tool in RStudio (which you may have used in the past) which is an alternative to the coding version…
Let’s start by plotting the count data - you can add your own tweaks
to the histogram if you want to use more functionality of
ggplot (see Histogram R Graph
Gallery)
crab.sat %>%
ggplot(aes(x=satellites)) +
geom_histogram(binwidth = 1, fill="purple4", alpha=0.6, color="#e9ecef") +
labs(title = "Size of Crab Satellite Groups", x = "N satellites") +
theme_bw()
Given this plot, you might want to know exactly how many counts of 0 satellite males there are. There are a number of ways of doing this in R (of course). One way, using base R coding is:
count(crab.sat[crab.sat$satellites == 0,6]) which counts
the number of values of 0 in column 6 (the satellites variable). Try
it…
…however, for practice let’s also set it up using
tidyverse syntax. This version gives the count not just of
the 0 values, but all distinct values in satellites, and
which should correspond to the histogram above.
crab.sat %>% count(satellites)
## # A tibble: 15 × 2
## satellites n
## <int> <int>
## 1 0 62
## 2 1 16
## 3 2 9
## 4 3 19
## 5 4 19
## 6 5 15
## 7 6 13
## 8 7 4
## 9 8 6
## 10 9 3
## 11 10 3
## 12 11 1
## 13 12 1
## 14 14 1
## 15 15 1
NOTE, if you have tried using select() after running
library(MASS) you will get an error message that starts
Error in select(.,…
This How
to Fix page from ‘Listen Data’ does a good job of explaining the
error - there are select() functions in both
dplyr (part of the tidyverse) and MASS
packages, but there is a simple workaround. See the message under the
library(MASS) above as well.
How many females had 0 satellite males in this data set? What percentage of the total N is that? If you think that we might come back to this later, you’re right!
One issue we haven’t covered in labs, but which might sound familiar if you’ve done multiple linear regression before, is multicollinearity. This happens when independent variables in a regression model are highly correlated to each other (easiest to conceptualize when both are continuous variables).
There is a pretty reasonable summary on this site if you want to read more: How to Test for Multicollinearity in R.
But for this lab, since crab.sat has two continuous
variables, width and weight, AND because these body-size related
measurements might be assumed to be correlated for female crabs, use
cor() to quickly get a Pearson’s correlation coefficient,
r:
cor(crab.sat$width, crab.sat$weight)
What is the r value? This (0.89) is high enough to think there COULD be an issue if we include both variables so we’ll proceed with one in the model in 1.3. (There is more you CAN do wrt multicollinearity - look up Variance Inflation Factors or VIFs if you want to learn more.)
Let’s create our first Poisson model. We’ll use AIC scores later, but our approach here is not completely model selection-based as you’ll see. We’re NOT evaluating a full set of candidate models including simple, one-predictor models. If you’re not familiar with the use of AIC, this page is a brief introduction: How to Calculate AIC in R.
The glm() formula, data, and family arguments should
look familiar from Lab 3, we’re just specifying that we want to
use the poisson family with its default
link = "log".
csat.ps1 <- glm(satellites ~ width + color + spine_cond, data = crab.sat, family = poisson(link = "log"))
summary(csat.ps1)
##
## Call:
## glm(formula = satellites ~ width + color + spine_cond, family = poisson(link = "log"),
## data = crab.sat)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.99086 0.60217 -4.967 6.81e-07 ***
## width 0.15033 0.02112 7.118 1.09e-12 ***
## color.L -0.37365 0.15720 -2.377 0.0175 *
## color.Q 0.10401 0.12232 0.850 0.3951
## color.C 0.05574 0.09124 0.611 0.5412
## spine_cond 0.01873 0.05823 0.322 0.7478
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 632.79 on 172 degrees of freedom
## Residual deviance: 559.24 on 167 degrees of freedom
## AIC: 926.54
##
## Number of Fisher Scoring iterations: 6
This should look somewhat familiar, if you’ve used glm()
before. For now, just a note on why R has included color.L,
color.Q, and color.C for our ordinal variable
color. This is a default way that glm()
handles an ordered factor: it evaluates the variable itself (color.L,
with L for linear), plus the term color2 (color.Q, with Q for
quadratic), plus the term color3 (color.C, with C for cubic).
For this model and lab, and since the quadratic and cubic terms are
non-significant, we’ll just focus on the term color.L.
Remember using the null and residual deviance to evaluate the logistic model in Lab 3?
Here we’ll use the function nagelkerke which calculates
this, plus some pseudo-R2 values
for us:
nagelkerke(csat.ps1)
## $Models
##
## Model: "glm, satellites ~ width + color + spine_cond, poisson(link = \"log\"), crab.sat"
## Null: "glm, satellites ~ 1, poisson(link = \"log\"), crab.sat"
##
## $Pseudo.R.squared.for.model.vs.null
## Pseudo.R.squared
## McFadden 0.0744374
## Cox and Snell (ML) 0.3463280
## Nagelkerke (Cragg and Uhler) 0.3474770
##
## $Likelihood.ratio.test
## Df.diff LogLik.diff Chisq p.value
## -5 -36.775 73.551 1.8661e-14
##
## $Number.of.observations
##
## Model: 173
## Null: 173
##
## $Messages
## [1] "Note: For models fit with REML, these statistics are based on refitting with ML"
##
## $Warnings
## [1] "None"
The output labeled ‘$Likelihood.ratio.test’ is the same as you should
get using
1-pchisq((Null deviance - Residual deviance), (Null df - Residual df))
Is there evidence this is a better fit for the data than a “NULL” model? YES, the p-value of the likelihood ratio test is very small, so you reject the null hypothesis of “no difference” (in model likelihood, or “fit” between this model and a null model)
As with the logistic model, we can and should examine patterns in the residuals. Run
residualPlots(csat.ps1)
## Test stat Pr(>|Test stat|)
## width 8.8807 0.002882 **
## color
## spine_cond 0.6114 0.434254
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The table output suggests a significant, non-linear association of
the residuals with the predictor variable width. In the
plot, you can see the bend in the fitted line - ALSO reflected in the
predicted values plot (lower right).
With the predictor variable color it’s also notable that
those ‘outlier’ residuals (high values above the whisker of the box
plot) occur more in the dark medium and dark color group.
For this Lab, we’ll take the approach that we could and probably
should evaluate other count data models. Going back to the histogram and
calculations in Exercise 1.1, it seems we might want to
do a check for ‘zero inflation’ and ‘overdispersion’. These functions
come from the package performance:
Run these - the results should be self-explanatory.
check_zeroinflation(csat.ps1)
## # Check for zero-inflation
##
## Observed zeros: 62
## Predicted zeros: 15
## Ratio: 0.24
## Model is underfitting zeros (probable zero-inflation).
check_overdispersion(csat.ps1)
## # Overdispersion test
##
## dispersion ratio = 3.258
## Pearson's Chi-Squared = 544.115
## p-value = < 0.001
## Overdispersion detected.
To evaluate these models against each other using AIC (below), we’ll
create a model object using the SAME predictor variables as we did for
the Poisson regression above,
satellites ~ width + color + spine_cond, with these
different modeling approaches (again, for more background on these
models see this UCLA
site)
Negative Binomial
csat.nb1 <- glm.nb(satellites ~ width + color + spine_cond, data = crab.sat)
Zero-inflated Negative Binomial
csat.zinb1 <- zeroinfl(satellites ~ width + color + spine_cond, data = crab.sat, dist = "negbin")
Hurdle
csat.hd1 <- hurdle(satellites ~ width + color + spine_cond, data = crab.sat, dist="poisson", zero.dist = "binomial")
There are multiple ways of setting up each of these, but for this lab
we’re just choosing a fairly basic approach. For the hurdle,
note that there are two distributions specified: a Poisson distribution
for the non-zero counts with dist="poisson" and a binomial
distribution for the zero counts.
Now we’re just going to ask for the AIC values directly.
AIC(csat.ps1, csat.nb1, csat.zinb1, csat.hd1)
Which model type has the lowest AIC? We’re going to use that as our best supported model of the crab satellite data.
Above you used residualPlots() with our original Poisson
regression, but that won’t work for the zero-inflated negative binomial
model. We’re going to build ONE plot (of predicted values v. residuals,
as a basic diagnostic) in ggplot() ourselves now.
dgn.resid<-as_tibble(cbind(zinb1.pred<-predict(csat.zinb1, type = "response"), zinb1.res<-residuals(csat.zinb1, type = "pearson")))
names(dgn.resid)<-c("predicted", "resid")
ggplot(dgn.resid, aes(x = predicted, y = resid)) +
geom_point(color = "black", shape = 21, size = 2)+
geom_smooth(method=lm, color = "red", se=TRUE)+
labs(x = "predicted values", y = "residuals")+
theme_classic()
How does that pattern look to you compared to the plot in the bottom right of 1.4?
Now run
summary(csat.zinb1)
##
## Call:
## zeroinfl(formula = satellites ~ width + color + spine_cond, data = crab.sat,
## dist = "negbin")
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -1.3799 -0.7302 -0.2485 0.5859 4.0152
##
## Count model coefficients (negbin with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.60617 0.87465 0.693 0.4883
## width 0.04286 0.03091 1.386 0.1656
## color.L 0.20792 0.22960 0.906 0.3652
## color.Q 0.36572 0.17543 2.085 0.0371 *
## color.C 0.18763 0.12779 1.468 0.1420
## spine_cond -0.07939 0.08178 -0.971 0.3317
## Log(theta) 1.77269 0.37637 4.710 2.48e-06 ***
##
## Zero-inflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 12.5195 3.0684 4.080 4.50e-05 ***
## width -0.4791 0.1156 -4.144 3.41e-05 ***
## color.L 1.2995 0.6970 1.864 0.0623 .
## color.Q 0.5772 0.5263 1.097 0.2728
## color.C 0.1348 0.3733 0.361 0.7180
## spine_cond -0.2531 0.2733 -0.926 0.3544
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Theta = 5.8867
## Number of iterations in BFGS optimization: 25
## Log-likelihood: -346.1 on 13 Df
The two tables of coefficients are:
Count model coefficients (negbin with log link) = assessing
the relationship of our predictor variables with the non-zero count
data. The significant Log(theta) just reflects the fact
that yes, the count data are overdispersed (this theta parameter is in
the model for that reason).
For our predictor variables, the significant color.Q
means that there is a significant curvilinear relationship (as
represented by a quadratic equation) between the level of
color and our count of satellite males. To understand this
better would take further plotting, which you’ll do in
1.8
Zero-inflation model coefficients (binomial with logit link)
= assessing the relationship of our predictor variables with the 0
counts (as a binary logistic function).
The negative width coefficient means that as
width values go up the count of male satellites is LESS
likely to be ‘0’. The other way of saying this would be, you are more
likely to have a male satellite, as your width increases.
The linear term for color has a p-value of 0.0623 which
is suggestive of a relationship, but that also depends on how closely
you like to stick to alpha = 0.05 (I don’t, myself, as long as you are
transparent about your findings). I would state that there’s at least
some evidence that darker colors also predict a greater probability of
having a male satellite.
Use the code below to plot “predicted counts” of male satellites across values of each predictor variable.
csat.eff<-ggpredict(csat.zinb1)
plot(csat.eff)
## $width
##
## $color
##
## $spine_cond
How do these plots help explain the output under Count model
coefficients (negbin with log link) from
summary(csat.zinb1)? Remember that the only statistically
signifcant model coefficients are Log(theta) (reflecting
the overdispersion compared to a Poisson distribution) and the
color.Q reflecting a non-linear relationship of female
color to the number of male satellites.
In the first plot (with width on the x-axis, the positive slope of the line reflects the values of the coefficient (0.0429). But the wide confidence intervals around the line reflect the non-significant effect. Similarly, in the third plot, the negative slope reflects the coefficent (-0.0794) for spine_cond but the intervals are even wider.
For the categorical value color, you can see that the predicted value of male satellites for light medium through dark medium are very similar, but go up quite a bit for the ‘dark’ category. This definitely appears to be a non-linear relationship, which is captured by the \(color^2\) term in the model (with a positive coefficent of 0.366).
NOTE1: You can find a lot of these specific
column-assigning tools (as well as directing R where to look for the
file) using the RStudio menu File -> Import Dataset -> (choose
your data type), e.g. from text, from Excel… choose
“from text” if it’s a .csv and choose the from text (readr)
option as that will create the tidyverse version (a tibble)
for you!
ALSO, what this tool does is actually set up code that runs in your console - so it CAN be a good way to see how to code what you just did!