Lab 4: Modeling discrete count data

If you haven’t already, please read the Introduction to this set of lab activities, which includes instructions on where to find the .csv files referenced in each lab.

Exercise 1

A good reference for regression modeling with count-based data is this overview from UCLA.

You’re going to bring in data from a .csv file again, but this time we’re going to use col_types = to tell read_csv specifically how to read in some of the columns.

These data are from H. Jane Brockmann’s long-term research on horseshoe crabs in Florida (Brockmann 1996). The response variable is the number of male satellites (crabs that are not attached to a female during mating, but crowd around to try to fertilize at least some of her eggs). Import them now.

crab.sat <- read_csv("crab_sat.csv", col_types = cols(color = col_factor(levels = c("ltmed", "medium", "dkmed", "dark"), ordered = TRUE), spine_cond = col_integer(), satellites = col_integer()))

Table 1. Variables in the horseshoe crab dataset (Brockmann 1996), imported as `crab.sat`
Variable	Description	Notes
`satellites`	number of male satellites counted near a female crab	integer from 0-15
`color`	color of female (light medium to dark)	ordinal variable
`color.n`	numerical version of color	coded 1-4, not used in this lab
`spine_cond`	condition of the female’s spines, coded numerically	1 = both broken, 2 = one broken, 3 = both good
`width`	female carapace width	in centimeters
`weight`	female weight	in kilograms

…and since we used read_csv you can run spec(crab.sat) too, to see how the variables are classified.

This kind of coding is helpful when you have a .csv and know what’s in it (probably from opening it in Excel) AND how you want R to read in each variable. You can always assign these classes after you’ve imported the data into R (see for example, as.factor()).

In this case we know that the text in the column “color” can be considered an ordered factor (variable), i.e. there is a logical order, from ltmed through dark. The column/variable “spine_cond” is an integer that corresponds to the three categories (Table 1). In the case of “satellites”, the distinction of making it an integer is not critical but it can’t hurt since it’s a count.

See my NOTE¹ about the menu-based Import Dataset tool in RStudio (which you may have used in the past) which is an alternative to the coding version…

1.1

Let’s start by plotting the count data - you can add your own tweaks to the histogram if you want to use more functionality of ggplot (see Histogram R Graph Gallery)

 crab.sat %>%
  ggplot(aes(x=satellites)) +
  geom_histogram(binwidth = 1, fill="purple4", alpha=0.6, color="#e9ecef") +
  labs(title = "Size of Crab Satellite Groups", x = "N satellites") +
  theme_bw()

Given this plot, you might want to know exactly how many counts of 0 satellite males there are. There are a number of ways of doing this in R (of course). One way, using base R coding is:

count(crab.sat[crab.sat$satellites == 0,6]) which counts the number of values of 0 in column 6 (the satellites variable). Try it…

…however, for practice let’s also set it up using tidyverse syntax. This version gives the count not just of the 0 values, but all distinct values in satellites, and which should correspond to the histogram above.

crab.sat %>% count(satellites)

## # A tibble: 15 × 2
##    satellites     n
##         <int> <int>
##  1          0    62
##  2          1    16
##  3          2     9
##  4          3    19
##  5          4    19
##  6          5    15
##  7          6    13
##  8          7     4
##  9          8     6
## 10          9     3
## 11         10     3
## 12         11     1
## 13         12     1
## 14         14     1
## 15         15     1

NOTE, if you have tried using select() after running library(MASS) you will get an error message that starts Error in select(.,…

This How to Fix page from ‘Listen Data’ does a good job of explaining the error - there are select() functions in both dplyr (part of the tidyverse) and MASS packages, but there is a simple workaround. See the message under the library(MASS) above as well.

How many females had 0 satellite males in this data set? What percentage of the total N is that? If you think that we might come back to this later, you’re right!

1.2

One issue we haven’t covered in labs, but which might sound familiar if you’ve done multiple linear regression before, is multicollinearity. This happens when independent variables in a regression model are highly correlated to each other (easiest to conceptualize when both are continuous variables).

There is a pretty reasonable summary on this site if you want to read more: How to Test for Multicollinearity in R.

But for this lab, since crab.sat has two continuous variables, width and weight, AND because these body-size related measurements might be assumed to be correlated for female crabs, use cor() to quickly get a Pearson’s correlation coefficient, r:

cor(crab.sat$width, crab.sat$weight)

What is the r value? This (0.89) is high enough to think there COULD be an issue if we include both variables so we’ll proceed with one in the model in 1.3. (There is more you CAN do wrt multicollinearity - look up Variance Inflation Factors or VIFs if you want to learn more.)

1.3

Let’s create our first Poisson model. We’ll use AIC scores later, but our approach here is not completely model selection-based as you’ll see. We’re NOT evaluating a full set of candidate models including simple, one-predictor models. If you’re not familiar with the use of AIC, this page is a brief introduction: How to Calculate AIC in R.

The glm() formula, data, and family arguments should look familiar from Lab 3, we’re just specifying that we want to use the poisson family with its default link = "log".

csat.ps1 <- glm(satellites ~ width + color + spine_cond, data = crab.sat, family = poisson(link = "log"))
summary(csat.ps1)

## 
## Call:
## glm(formula = satellites ~ width + color + spine_cond, family = poisson(link = "log"), 
##     data = crab.sat)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.99086    0.60217  -4.967 6.81e-07 ***
## width        0.15033    0.02112   7.118 1.09e-12 ***
## color.L     -0.37365    0.15720  -2.377   0.0175 *  
## color.Q      0.10401    0.12232   0.850   0.3951    
## color.C      0.05574    0.09124   0.611   0.5412    
## spine_cond   0.01873    0.05823   0.322   0.7478    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 632.79  on 172  degrees of freedom
## Residual deviance: 559.24  on 167  degrees of freedom
## AIC: 926.54
## 
## Number of Fisher Scoring iterations: 6

This should look somewhat familiar, if you’ve used glm() before. For now, just a note on why R has included color.L, color.Q, and color.C for our ordinal variable color. This is a default way that glm() handles an ordered factor: it evaluates the variable itself (color.L, with L for linear), plus the term color² (color.Q, with Q for quadratic), plus the term color³ (color.C, with C for cubic). For this model and lab, and since the quadratic and cubic terms are non-significant, we’ll just focus on the term color.L.

Remember using the null and residual deviance to evaluate the logistic model in Lab 3?

Here we’ll use the function nagelkerke which calculates this, plus some pseudo-R² values for us:

nagelkerke(csat.ps1)

## $Models
##                                                                                         
## Model: "glm, satellites ~ width + color + spine_cond, poisson(link = \"log\"), crab.sat"
## Null:  "glm, satellites ~ 1, poisson(link = \"log\"), crab.sat"                         
## 
## $Pseudo.R.squared.for.model.vs.null
##                              Pseudo.R.squared
## McFadden                            0.0744374
## Cox and Snell (ML)                  0.3463280
## Nagelkerke (Cragg and Uhler)        0.3474770
## 
## $Likelihood.ratio.test
##  Df.diff LogLik.diff  Chisq    p.value
##       -5     -36.775 73.551 1.8661e-14
## 
## $Number.of.observations
##           
## Model: 173
## Null:  173
## 
## $Messages
## [1] "Note: For models fit with REML, these statistics are based on refitting with ML"
## 
## $Warnings
## [1] "None"

The output labeled ‘$Likelihood.ratio.test’ is the same as you should get using 1-pchisq((Null deviance - Residual deviance), (Null df - Residual df))

Is there evidence this is a better fit for the data than a “NULL” model? YES, the p-value of the likelihood ratio test is very small, so you reject the null hypothesis of “no difference” (in model likelihood, or “fit” between this model and a null model)

1.4

As with the logistic model, we can and should examine patterns in the residuals. Run

residualPlots(csat.ps1)

##            Test stat Pr(>|Test stat|)   
## width         8.8807         0.002882 **
## color                                   
## spine_cond    0.6114         0.434254   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The table output suggests a significant, non-linear association of the residuals with the predictor variable width. In the plot, you can see the bend in the fitted line - ALSO reflected in the predicted values plot (lower right).

With the predictor variable color it’s also notable that those ‘outlier’ residuals (high values above the whisker of the box plot) occur more in the dark medium and dark color group.

For this Lab, we’ll take the approach that we could and probably should evaluate other count data models. Going back to the histogram and calculations in Exercise 1.1, it seems we might want to do a check for ‘zero inflation’ and ‘overdispersion’. These functions come from the package performance:

Run these - the results should be self-explanatory.

check_zeroinflation(csat.ps1)

## # Check for zero-inflation
## 
##    Observed zeros: 62
##   Predicted zeros: 15
##             Ratio: 0.24

## Model is underfitting zeros (probable zero-inflation).

check_overdispersion(csat.ps1)

## # Overdispersion test
## 
##        dispersion ratio =   3.258
##   Pearson's Chi-Squared = 544.115
##                 p-value = < 0.001

## Overdispersion detected.

1.5

To evaluate these models against each other using AIC (below), we’ll create a model object using the SAME predictor variables as we did for the Poisson regression above, satellites ~ width + color + spine_cond, with these different modeling approaches (again, for more background on these models see this UCLA site)

Negative Binomial

csat.nb1    <- glm.nb(satellites ~ width + color + spine_cond, data = crab.sat)

Zero-inflated Negative Binomial

csat.zinb1  <- zeroinfl(satellites ~ width + color + spine_cond, data = crab.sat, dist = "negbin")

Hurdle

csat.hd1  <- hurdle(satellites ~ width + color + spine_cond, data = crab.sat, dist="poisson", zero.dist = "binomial")

There are multiple ways of setting up each of these, but for this lab we’re just choosing a fairly basic approach. For the hurdle, note that there are two distributions specified: a Poisson distribution for the non-zero counts with dist="poisson" and a binomial distribution for the zero counts.

Now we’re just going to ask for the AIC values directly.

AIC(csat.ps1, csat.nb1, csat.zinb1, csat.hd1)

Which model type has the lowest AIC? We’re going to use that as our best supported model of the crab satellite data.

1.6

Above you used residualPlots() with our original Poisson regression, but that won’t work for the zero-inflated negative binomial model. We’re going to build ONE plot (of predicted values v. residuals, as a basic diagnostic) in ggplot() ourselves now.

dgn.resid<-as_tibble(cbind(zinb1.pred<-predict(csat.zinb1, type = "response"), zinb1.res<-residuals(csat.zinb1, type = "pearson")))

names(dgn.resid)<-c("predicted", "resid")

ggplot(dgn.resid, aes(x = predicted, y = resid)) +
  geom_point(color = "black", shape = 21, size = 2)+
  geom_smooth(method=lm, color = "red", se=TRUE)+
  labs(x = "predicted values", y = "residuals")+
  theme_classic()

How does that pattern look to you compared to the plot in the bottom right of 1.4?

1.7

Now run

summary(csat.zinb1)

## 
## Call:
## zeroinfl(formula = satellites ~ width + color + spine_cond, data = crab.sat, 
##     dist = "negbin")
## 
## Pearson residuals:
##     Min      1Q  Median      3Q     Max 
## -1.3799 -0.7302 -0.2485  0.5859  4.0152 
## 
## Count model coefficients (negbin with log link):
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.60617    0.87465   0.693   0.4883    
## width        0.04286    0.03091   1.386   0.1656    
## color.L      0.20792    0.22960   0.906   0.3652    
## color.Q      0.36572    0.17543   2.085   0.0371 *  
## color.C      0.18763    0.12779   1.468   0.1420    
## spine_cond  -0.07939    0.08178  -0.971   0.3317    
## Log(theta)   1.77269    0.37637   4.710 2.48e-06 ***
## 
## Zero-inflation model coefficients (binomial with logit link):
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  12.5195     3.0684   4.080 4.50e-05 ***
## width        -0.4791     0.1156  -4.144 3.41e-05 ***
## color.L       1.2995     0.6970   1.864   0.0623 .  
## color.Q       0.5772     0.5263   1.097   0.2728    
## color.C       0.1348     0.3733   0.361   0.7180    
## spine_cond   -0.2531     0.2733  -0.926   0.3544    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Theta = 5.8867 
## Number of iterations in BFGS optimization: 25 
## Log-likelihood: -346.1 on 13 Df

The two tables of coefficients are: Count model coefficients (negbin with log link) = assessing the relationship of our predictor variables with the non-zero count data. The significant Log(theta) just reflects the fact that yes, the count data are overdispersed (this theta parameter is in the model for that reason).

For our predictor variables, the significant color.Q means that there is a significant curvilinear relationship (as represented by a quadratic equation) between the level of color and our count of satellite males. To understand this better would take further plotting, which you’ll do in 1.8

Zero-inflation model coefficients (binomial with logit link) = assessing the relationship of our predictor variables with the 0 counts (as a binary logistic function).

The negative width coefficient means that as width values go up the count of male satellites is LESS likely to be ‘0’. The other way of saying this would be, you are more likely to have a male satellite, as your width increases.

The linear term for color has a p-value of 0.0623 which is suggestive of a relationship, but that also depends on how closely you like to stick to alpha = 0.05 (I don’t, myself, as long as you are transparent about your findings). I would state that there’s at least some evidence that darker colors also predict a greater probability of having a male satellite.

1.8

Use the code below to plot “predicted counts” of male satellites across values of each predictor variable.

csat.eff<-ggpredict(csat.zinb1)
plot(csat.eff)

## $width

## 
## $color

## 
## $spine_cond

How do these plots help explain the output under Count model coefficients (negbin with log link) from summary(csat.zinb1)? Remember that the only statistically signifcant model coefficients are Log(theta) (reflecting the overdispersion compared to a Poisson distribution) and the color.Q reflecting a non-linear relationship of female color to the number of male satellites.

In the first plot (with width on the x-axis, the positive slope of the line reflects the values of the coefficient (0.0429). But the wide confidence intervals around the line reflect the non-significant effect. Similarly, in the third plot, the negative slope reflects the coefficent (-0.0794) for spine_cond but the intervals are even wider.

For the categorical value color, you can see that the predicted value of male satellites for light medium through dark medium are very similar, but go up quite a bit for the ‘dark’ category. This definitely appears to be a non-linear relationship, which is captured by the $color^2$ term in the model (with a positive coefficent of 0.366).

Citations

NOTE¹: You can find a lot of these specific column-assigning tools (as well as directing R where to look for the file) using the RStudio menu File -> Import Dataset -> (choose your data type), e.g. from text, from Excel… choose “from text” if it’s a .csv and choose the from text (readr) option as that will create the tidyverse version (a tibble) for you!

ALSO, what this tool does is actually set up code that runs in your console - so it CAN be a good way to see how to code what you just did!

Brockmann, H. J. 1996. Satellite male groups in horseshoe crabs, limulus polyphemus. Ethology 102:1–21.

Lab 4: Modeling discrete count data

John Withey, The Evergreen State College

February 2025

GLMs continued: Poisson Regression and other models for count data

Exercise 1

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

Citations