Citizen science data offer opportunistic observations on bird abundances, that must be structured into a coherent design to address responses to environmental variation.

Resources

source('clarkFunctions2021.r')

eBird data

Data for this unit come from eBird, a community-based platform launched in 2002 where anyone can create an account and log the birds they see. An observation is a ‘checklist’ that records numbers of each species Visit this link for more information about how scientists use eBird data in their research projects.

Hurricane redistribution of seabirds

The observation that seabirds are found far from coastlines immediately following storms, combined with the observation that hurricanes are becoming more intense (citation), suggests that populations may suffer as future storms intensify.

Lane examined the redistribution of seabird counts on ebird, comparing changes before and after storms against those of non-seabird species. The basic design for the analysis includes main effects for distance-to-shore, before versus after storms (a factor), and an interaction between them. The interaction term addresses the question of whether or not storm effects change with distance from shore.

From the large eBird archive Lane extracted checklists relevant for the hurricane season (August-October) from South Carolina, North Carolina, and Virginia from 2015 to 2018,

load( "birdData.rdata", verbose = T )
## Loading objects:
##   stateData
data <- rbind( stateData[[1]], stateData[[2]], stateData[[3]]) # 

First we need some EDA. We check variable modes to discover that missing counts are indicated with an 'X', which we remove:

# any surprises?
sapply(data, mode)
##      checklist_id          latitude         longitude  observation_date 
##       "character"         "numeric"         "numeric"         "numeric" 
##              year  duration_minutes   scientific_name observation_count 
##         "numeric"         "numeric"       "character"       "character" 
##                jd 
##         "numeric"
# yes, counts are a character variable, find the offending counts
table( data$observation_count )

Here we remove from the data the row with 'X' in observation_count,

data <- data[ !data$observation_count  == 'X', ]
data$observation_count <- as.numeric( data$observation_count )

Next, we see that data are not stored as observations-by-variables, presumably because that format would require a large matrix dominated by zeros. Here are the first few lines of the data tibble

head(data)

All of these rows refer to the same checklist, listing the counts by species, one per row. This is a compact format for storage, but we need to change it to one line per checklist (again, observations- by-species). Observations by variables-is-obtained with tapply, which moves data from checklist-species by variables to site by variables (including species),

ydata   <- tapply( data$observation_count, 
                   list( obs = data$checklist_id, 
                         species = data$scientific_name),
                   sum, na.rm=T)
ydata[ is.na(ydata) ] <- 0                       # unobserved species

We now need plot information to match ydata. First extract the columns we need,

m     <- match( rownames(ydata), data$checklist_id )       
xdata <- data[m, !colnames(data) == 'scientific_name']    # observations by variables

Here are some dates:

library(lubridate)
xdata$month <- month(xdata$observation_date)              # translate format to numeric
xdata$year  <- year(xdata$observation_date) 
data$jd     <- yday(data$observation_date)

We now have data formatted as observations-by-variables, organized with plot information in xdata and species counts in ydata.

Here is observation effort, in hours:

hist( xdata$duration_minutes/60, nclass=50, xlab = 'Hours per observation', main='')

The duration recorded on checklists.

We could decide that there is something fundamentally unusual about observations lasting longer than a day. To increase comparability, we restrict observations to those that span less that 4 hours:

w <- which( xdata$duration_minutes < (4*60) )
xdata <- xdata[w,]
ydata <- ydata[w,]

The seabirds belong to the following genera:

genera <- c("Anhinga", "Pelecanus", "Fregata", 
            "Anous", "Rynchops", "Rissa", 
            "Chroicocephalus", "Hydrocoloeus", "Leucophaeus",
            "Larus", "Gelochelidon", "Hydroprogne",
            "Thalasseus", "Sternula", "Onychoprion", 
            "Sterna", "Chlidonias")

The gsub() function finds a pattern in a character string and replaces it with a new pattern. Here, we find all character strings that start with a space, and replace them with a blank string, in the column called scientific_name. For example, in Cardinalis cardinalis, the code will find and remove everything after the space, which is " cardinalis“. So we’re left with just the text before the space, or”Cardinalis", which is the genus.

specNames <- colnames(ydata)
genus     <- gsub( " .*", "", specNames )
seabirds  <- specNames[ genus %in% genera ]
yshore    <- ydata[, specNames %in% seabirds ]

Here is a species rank-incidence:

tmp <- ydata
tmp[ tmp > 1 ] <- 1
ra <- colMeans(tmp, na.rm=T )                # mean incidence
ra <- ra[ order(ra, decreasing = T) ]        # rank
rm( tmp )

np   <- 1:20
ylim <- c(1e-5, 1e+6)
plot( ra, type = 's', log = 'xy', ylim = ylim, lwd=2, 
      xlab = 'Rank', ylab = 'Incidence (fraction of observations)' )
text( np, 1.3*ra[np], names(ra)[np], srt = 80, pos = 4, cex=np^(-.4))

m <- match(seabirds, names(ra))
segments( m, ylim[1], m, ra[m], col='blue')

The incidence by rank for each species, only the dominants are labeled. As expected, seabirds (blue) are not well-represented in the data.

Not only are seabirds poorly represented, but data are also dominated by zeros. For example, here is the wood thrush,

spec <- "Hylocichla mustelina"  # Wood thrush
species <- ydata[ , spec ]
sdata   <- cbind(xdata, species)
hist( species, nclass=100)

Histogram of species counts is dominated by zeros.

This dominance by zeros can be a problem for model fitting, an issue we return to later.

Counts by year shows the increase in activity just over the short time covered by this data set:

hist( xdata$year, xlab = 'Year', ylab = 'Checklists', main='')

Rising use of ebird in this region.

Here are counts by julian date or day of the year:

par( mfrow=c(2,1), bty='n', mar=c(4,5,1,5) )
countByDate <- tapply( species, xdata$jd, sum)
minByDate   <- tapply(xdata$duration_minutes, xdata$jd, sum)
cpeByDate   <- countByDate/minByDate
DOY         <- as.numeric( names(countByDate) )
plot( DOY, minByDate/60, type = 's', xlab = '', ylab = 'Hours')
plot( DOY, cpeByDate, type = 's', xlab = '', ylab = 'Counts/minute')

Observation time varies daily (above), while CPE tails off late in the season.

To generate maps we want a color ramp that allows us to set the scale:

getColors <- function(x, scale = .5, hue = 'reds'){
  if(hue == 'reds')
    cols  <- c('#fff7ec','#fdcc8a','#fc8d59','#e34a33','#b30000')  # color ramp low to high
  if(hue == 'blues')
    cols <- c('#ffffcc','#a1dab4','#41b6c4','#2c7fb8','#253494')
  if(hue == 'grey')
    cols <- c('#f7f7f7','#cccccc','#969696','#636363','#252525')
  ncols <- 20
  grFun <- colorRampPalette( cols )                                # function generates ramp
  cseq  <- c(0, seq(.01, max(x)^scale, length = ncols )^(1/scale)) # partition ba scale
  i     <- findInterval( x, cseq, all.inside = T)                  # assign ba to partition
  g     <- grFun(ncols)                                            # ramp of length ncols
  list( cols = g[ i ], scale = cseq )                              # assign colors
}

To appreciate the geographic range at this time of year we want maps of counts. Here is a map for this species:

library( ggplot2 )
library( raster )
library( maps )
states <- map("state", plot = FALSE, fill = TRUE)
cc     <- getColors( species, .5 )$cols

ggplot() + 
  geom_point(data = sdata, aes(x = longitude, y = latitude), size = .3, 
             color = cc) +
  geom_polygon(data = states, aes(x=long, y = lat, group = group), 
               fill = NA, color = 'tan') + 
  coord_fixed(xlim = c(-84.5, -75),  ylim = c(32, 40), ratio = 1.2) 

Map of checklists with reds showing high counts, largely obscurred by overlapping points.

The problem with this map is that overlapping points obscure patterns. We can see through this clutter if we aggregate counts to a grid that is sufficiently coarse that points do not over lap. We do this here for a \(0.1^\circ \times 0.1^\circ\) lattice of longitude/latitude values. By accumulating both counts and effort (minutes) we generate a map of counts per effort (CPE) that clarifies distribution and abundance,

xloc <- round( sdata$longitude, 1 )                   # aggregate to 0.1 lon/lat
yloc <- round( sdata$latitude, 1 )

xy <- paste(xloc, yloc, sep='_')                      # assign name to grid points
sdata$xy <- xy

tcount <- tapply( sdata$species, xy, sum )            # counts by grid point
tmin   <- tapply( sdata$duration_minutes, xy, sum )   # effort
cpe    <- tcount/tmin                                 # CPE
lonlat <- columnSplit(names(tmin), '_' )              # lon/lat for grid points
sgrid  <- data.frame( lon = as.numeric(lonlat[,1]), lat = as.numeric(lonlat[,2]),
                      count = tcount, minutes = tmin, cpe,
                      stringsAsFactors = F)
rownames(sgrid) <- NULL
                                       
cc <- getColors( sgrid$cpe , .2 )$cols                 # color ramp based on CPE

ggplot() + 
  geom_point(data = sgrid, aes(x = lon, y = lat), size = 1, 
             color = cc, shape = 15) +
  geom_polygon(data = states, aes(x=long, y = lat, group = group), 
               fill = NA, color = 'tan') + 
  coord_fixed(xlim = c(-84.5, -75),  ylim = c(32, 39.4), ratio = 1.2) 

CPE map based on the 0.1-degree lattice.

Of course, we could isolate effort to clarify the geographic blind spots:

cc <- getColors( sgrid$minutes , .2 )$cols                 # color ramp based on minutes
ggplot() + 
  geom_point(data = sgrid, aes(x = lon, y = lat), size = 1, 
             color = cc, shape = 15) +
  geom_polygon(data = states, aes(x=long, y = lat, group = group), 
               fill = NA, color = 'tan') + 
  coord_fixed(xlim = c(-84.5, -75),  ylim = c(32, 39.4), ratio = 1.2) 

Effort map based on the 0.1-degree lattice shows areas of concentration and missing information.

Hurricane data

The hurricanes affecting this region are listed here:

dates <- as.Date(c('2016-10-09', '2017-09-26', '2018-09-14', '2019-09-06', '2020-08-04'))

hurricanes <- data.frame(
  name = c('Matthew', 'Maria', 'Florence', 'Dorian', 'Isaias'),
  cat  = c(1,1,1,2,1),
  path = c('coastal', 'offshore', 'inland', 'coastal', 'inland'),
  date = dates
)
knitr::kable(hurricanes)
name cat path date
Matthew 1 coastal 2016-10-09
Maria 1 offshore 2017-09-26
Florence 1 inland 2018-09-14
Dorian 2 coastal 2019-09-06
Isaias 1 inland 2020-08-04

Here we add columns for Julian date (day of the year) and year,

hurricanes$jd   <- yday(hurricanes$date)
hurricanes$year <- year(hurricanes$date)

And finally, we want a column with an interval of dates starting 10 days before each hurricane and ending 10 days after the hurricane,

hurricanes$int <- interval(hurricanes$date - 10, hurricanes$date + 10)

From the full data we extract checklists that occur within this interval,

pre    <- interval(hurricanes$date - 5, hurricanes$date - 1)
post   <- interval(hurricanes$date, hurricanes$date + 4)
timing <- hurrName <- rep('', nrow(sdata))

for(j in 1:length(pre)){                                # loop over hurricanes
  w <- which( sdata$observation_date %within% pre[j] )  # before hurricane
  timing[w] <- 'pre'
  hurrName[w] <- hurricanes$name[j]
  
  w <- which( sdata$observation_date %within% post[j] ) # after hurricane
  timing[w] <- 'post'
  hurrName[w] <- hurricanes$name[j]
}

sdata$htime <- timing
sdata$name  <- hurrName
xhurr <- sdata[ sdata$htime %in% c('pre','post'), ]
yhurr <- yshore[ sdata$htime %in% c('pre','post'), ]

To determine distance-to-shore we extract the coastline for this region,

library(sp)
library(spdep)
library(rgeos)
library(geosphere)
coast   <- map("usa", plot = FALSE)                                   # coastline
ww      <- which(coast$x > -82 & coast$x < -75 & 
                   coast$y > 32 & coast$y < 40)                       # isolate this region
coast$x <- coast$x[ww]
coast$y <- coast$y[ww]
coast$names <- 'main'
coast$range <- c( range(coast$x), range(coast$y))

coast  <- as_Spatial( st_as_sf(coast) )                               # compatibility
coords <- data.frame( long = xhurr$longitude, lat = xhurr$latitude )
spts   <- SpatialPoints(coords)
xhurr$dist2shore <- apply(gDistance(spts, coast, byid=TRUE), 2, min)  # points to coastline

Here is a map of all seabird CPE before and after hurricanes,

shoreCount <- rowSums(yhurr)
shoreCPE   <- shoreCount/xhurr$duration_minutes

cc <- getColors( shoreCPE, scale = .2, hue = 'grey' )$cols            # seabirds
w  <- which( xhurr$htime == 'pre' )                                   # before hurricanes
cc[ w ] <- getColors( shoreCPE[w], scale = .2, hue = 'blues' )$cols
w  <- which( xhurr$htime == 'post' )                                  # after 
cc[ w ] <- getColors( shoreCPE[w], scale = .2 )$cols

ggplot() + 
  geom_point(data = xhurr, aes(x = longitude, y = latitude), size = 1, 
             color = cc, shape = 15) +
  geom_polygon(data = states, aes(x=long, y = lat, group = group), 
               fill = NA, color = 'tan') + 
  coord_fixed(xlim = c(-84.5, -75),  ylim = c(32, 39.4), ratio = 1.2) 

Seabird CPE before (blue) and after (red) hurricanes.

Traditional GLM

A GLM can fit counts with effort as an offset, but will still struggle with massive zeros.

A GLM for all seabirds can be obtained on the counts using duration_minutes as effort. This Poisson regression assumes the standard log link (the default),

htime <- factor(xhurr$htime)
levels(htime) <- c('pre','post')
dist2shore <- xhurr$dist2shore
fit <- glm( shoreCount ~ htime*dist2shore, offset = log( xhurr$duration_minutes),
            family = 'poisson')
summary(fit) 
## 
## Call:
## glm(formula = shoreCount ~ htime * dist2shore, family = "poisson", 
##     offset = log(xhurr$duration_minutes))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -16.538   -3.613   -1.081   -0.194  144.544  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          -0.613498   0.004317 -142.10   <2e-16 ***
## htimepost             0.127618   0.005951   21.44   <2e-16 ***
## dist2shore           -2.122796   0.011872 -178.81   <2e-16 ***
## htimepost:dist2shore -0.885201   0.021421  -41.32   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 776262  on 11370  degrees of freedom
## Residual deviance: 566390  on 11367  degrees of freedom
## AIC: 576197
## 
## Number of Fisher Scoring iterations: 10

The negative dist2shore coefficient means that seabirds decline with distance from the coast. The positive htimepost coefficient indicates that seabirds are more abundant after than before hurricanes. The negative interaction means that the distance effect increases (becomes more negative) from pre- to post-hurricane. In other words, hurricanes concentrate seabirds along the coast. This result is counter-intuitive in light of our EDA that shows seabirds inland following hurricanes.

How would this compare with results for individual species?

Independent models for each species

Some species might be affected differently than others. It also seems important to ask if seabird responses differ from other species. Here is a loop to fit the model on each species that occurs in at least 50 observations,

library(gjam)                            # within time interval, at least 50 observations
yfull <- ydata[sdata$htime %in% c('pre','post'), ]
y     <- gjamTrimY(yfull,50)$y 
S     <- colnames(y)
specNames <- colnames(y)
S <- length(specNames)
specColor <- rep('grey', S)              # color code seabirds        
specColor[ gsub(" .*", "", specNames) %in% genera ] <- 'darkblue'

betaMu <- matrix(NA, 4, S)               # hold coefficients for each species
colnames(betaMu) <- specNames
betaSe <- betaMu

for(j in 1:S){
  yj  <- y[,j]
  fit <- glm( y[,j] ~ htime*dist2shore, offset = log( xhurr$duration_minutes),
            family = 'poisson')
  sf  <- summary(fit)$coefficients
  betaMu[,j] <- sf[,1]
  betaSe[,j] <- sf[,2]
}
rownames(betaMu) <- rownames(betaSe) <- rownames(sf)

Here is a function to plot coefficients for all species,

plotCoeff <- function( betaMu, betaSe, specColor,
                       rname = 'dist2shore' ){ # plot coefficients
  S <- ncol(betaMu)
  s <- c(1:S)
  Q <- nrow(betaMu)
  
  par(mfrow=c(Q,1), bty = 'n', mar=c(1,4,.1,3))

  for(k in 1:Q){
    
    bm  <- betaMu[k,]
    bs  <- betaSe[k,]
    ord <- order( bm )  
    bm  <- bm[ord]
    bs  <- bs[ord]
  
    up <- max( which( (bm - 1.96*bs) < 0 ) )   # increasers
    do <- min( which( (bm + 1.96*bs) > 0 ) )   # decreasers
  
    bs[ bs > 5 ] <- 5
    ylim <- c( min( bm - 2*bs ), max(bm + bs) )
    
    plot( NA, xlim = c(0, S), ylim = ylim, xlab = '', ylab = '', xaxt='n')
    rect(0, ylim[1], do, ylim[2], col = '#edf8fb', border = '#ccece6')
    rect(up, ylim[1], S, ylim[2], col = '#fef0d9', border = '#fdd49e')
    abline(h = 0, lty=2)
    
    points( s, bm, pch = 3, col = specColor[ord], lwd=2 )
    segments( s, bm - 1.96*bs, s, bm + 1.96*bs, col = specColor[ord] )
    segments( s, bm - bs, s, bm + bs, col = specColor[ord], lwd=3 )
    title( rownames(betaMu)[k], line = -2)
  }
}
plotCoeff( betaMu[-1,], betaSe[-1,], specColor = specColor )

Seabirds (blue) are distributed across the range of responses.

We’ve highlighted decreasers (left) and increases (right) as those having lower or higher CPE after hurricanes. Seabirds (blue) show no clear trend.

The species with negative dist2shore decrease in abundance with distance-from-shore. The increasers having positive interactions have hurricane effects that amplify distance-to-shore differences, and vice versa.

Generalized joint attribute modeling (GJAM)

The model for the aggregated seabird counts tell a different story from the individual species models. GJAM could address issues that relate to these differences. There are several things to consider. First, the log link function means that coefficients are proportionate. In order to interpret the magnitude of the response, we would have to specify how large the population is. Second, the massive zeros in the data is a problem for GLMs, and this problem diminishes when they are aggregated to a sum. Third, species probably depend on one another, and we are fitting them independently. For example, if we scatter a million birds in front of a storm, their displacements may tell us no more than just a few.

xhurr$htime <- factor( xhurr$htime )
levels(xhurr$htime) <- c('pre','post')
ef   <- list( columns = 1:S, values = xhurr$duration_minutes )
rl   <- list(r = 8, N = 20)
ml   <- list(ng = 2000, burnin = 1000, typeNames = 'DA', 
             reductList = rl, effort = ef)
form <- as.formula( ~ htime*dist2shore )
out  <- gjam(form, xdata = xhurr, ydata = y, modelList = ml)

Here are some plots,

pl   <- list(GRIDPLOTS=T, specColor = specColor)
gjamPlot(output = out, plotPars = pl)

Species richness and diversity are not fitted to the model, but the model predicts them well:

*Predicted species richness and diversity from the fitted model.*

Predicted species richness and diversity from the fitted model.

Fitted coefficients in the model accommodate the covariances between species, here showing only those having zero outside the 95% credible interval:

*Coefficients from the gjam model.*

Coefficients from the gjam model.

By combining fitted coefficients with the covariance in predictors, gjam estimates communities of species based on their similarities in response:

*Communities defined by responses to predictors in the model.*

Communities defined by responses to predictors in the model.

For seabird species with positive htime (increased CPE after hurricanes) and positive htime:dis2shore, the increases are highest at distance from shore. Here is figure in the same format we used for the GLM:

bm <- out$parameters$betaMu[-1,]
bs <- out$parameters$betaSe[-1,]
S <- ncol(bm)
specColor <- rep('grey', S)              # color code seabirds        
specColor[ gsub(" .*", "", specNames) %in% genera ] <- 'darkblue'
plotCoeff( betaMu = bm, betaSe = bs, specColor, rname = 'dist2shore' )
*Coefficients from gjam ordered by distance to shore.*

Coefficients from gjam ordered by distance to shore.

A positive interaction amplifies the hurricane effect at distance from the coast.

Exercise. Do increasers and decreasers tend to have positive or negative interactions htimepost:dist2shore? If so, why or why not? Check this for seabirds and non-seabird species. Where and why might GLM and GJAM results disagree?

You might try this:

  1. The objects bm and bs include one column for each species. They hold the posterior mean and standard error, respectively, for each predictor-species combination. Find the increasers and decreasers, based on the variable htimepost, using the approximate rule that zero is above bm + 1.96*bs (a decreaser) or below bm - 1.96*bs (an increaser).

  2. Determine which of these are seabirds using the vector of genera.

  3. You might summarize coefficients with biplots that compare the main effects (htimepost, dist2shore) with the interaction term.

Recap

Citizen-science data provide information on responses that would otherwise not be observed. They suffer from biases in coverage and uneven methods, which can sometimes be accommodated by the design used in the model. For ebird, observation time provides an index of effort that improves comparability of counts.

A joint model that considers the covariance in responses between species affects the uncertainty in parameters, which becomes the basis for deciding that an effect is ‘different than zero’. GLMs can model counts with effort, but depend on a non-linear link function that estimates parameters on a proportionate scale. GJAM accommodates the zero-dominance while estimating parameters on the observation scale. They can be directly interpreted as CPE per unit of predictor variable.