Citizen science data offer opportunistic observations on bird abundances, that must be structured into a coherent design to address responses to environmental variation. By contrast with BBS, which is structured to obtain counts be experts only during the breeding season, ebird data accumulate year-round by observers with a range of expertise in bird identification.
Resources
Lane Scher video on eBird data collection
Cyclone avoidance behaviour by foraging shorebirds shorebirds adjust behavior before/during storms.
The Wandering Eta arrives Birders anticipate storm effects, including displaced shorebirds.
Generalized joint attribute modeling (GJAM) vignette and publication
Here are some functions and packages:
source('clarkFunctions2022.r')
install.packages( 'ggplot2' )
install.packages( 'raster' )
install.packages( 'maps' )
install.packages( 'lubridate' )
install.packages( 'gjam' )
install.packages( 'sp' )
install.packages( 'spdep' )
install.packages( 'rgeos' )
install.packages( 'geosphere' )
install.packages( 'repmis' )
library( ggplot2 )
library( raster )
library( maps )
library( lubridate )
library( gjam )
library( sp )
library( spdep )
library( rgeos )
library( geosphere )Logistics
Today
- this unit
For next time
- questions in this unit to Sakai, will discuss in class
eBird data
Data for this unit come from eBird, a community-based platform launched in 2002 where anyone can create an account and log the birds they see. An observation is a ‘checklist’ that records numbers of each species Visit this link for more information about how scientists use eBird data in their research projects.
Here is a paper showing some of the differences between eBird and the breeding bird survey that we looked at last time.
Hurricane redistribution of shorebirds
The observation that shorebirds are found far from coastlines immediately following storms, combined with the observation that hurricanes are becoming more intense (citation), suggests that populations may suffer as future storms intensify. Here we compare distributions of shorebirds days before and immediately after hurricanes to examine effects on their distributions.
Lane examined the redistribution of shorebird counts on ebird, comparing changes before and after storms against those of non-shorebird species. The basic design for the analysis includes main effects for distance-to-shore, before versus after storms (a factor), and an interaction between them. The interaction term addresses the question of whether or not storm effects change with distance from shore.
From the large eBird archive Lane extracted checklists relevant for the hurricane season (August-October) from South Carolina, North Carolina, and Virginia from 2015 to 2018,
load( 'ebirdHurricaneExample.rdata', verbose = T )## Loading objects:
## xdata
## ydata
The observations-by-variables are organized with plot information in
xdata and species counts in ydata.
Here is observation effort, in hours:
hist( xdata$duration_minutes/60, nclass=50, xlab = 'Hours per observation', main='')The duration recorded on checklists.
The shorebirds belong to the following genera:
genera <- sort( c("Anhinga", "Pelecanus", "Fregata",
"Anous", "Rynchops", "Rissa",
"Chroicocephalus", "Hydrocoloeus", "Leucophaeus",
"Larus", "Gelochelidon", "Hydroprogne",
"Thalasseus", "Sternula", "Onychoprion",
"Sterna", "Chlidonias") )In the next block, the gsub() function finds a pattern
in a character string and replaces it with a new pattern. Here, we find
all character strings that start with a space, and replace them with a
blank string, in the column called scientific_name. For
example, in Cardinalis cardinalis, the code will find and
remove everything after the space, which is ” cardinalis”. So we’re left
with just the text before the space, or “Cardinalis”, which is the
genus.
specNames <- colnames(ydata)
genus <- gsub( " .*", "", specNames )
shorebirds <- specNames[ genus %in% genera ] # shorebirds are in genus
yshore <- ydata[, specNames %in% shorebirds ] # includes only shorebirdsSome ecological communities can be dominated by one or a few species. Others might have a broad range of species. Species rank abundance or rank incidence is used as an index of diversity.
Here is a species rank-incidence:
tmp <- ydata
tmp[ tmp > 1 ] <- 1 # 0 = absence, 1 = presence
ra <- colMeans(tmp, na.rm=T ) # mean incidence
ra <- ra[ order(ra, decreasing = T) ] # rank
rm( tmp )
np <- 1:20 # show only 20 most abundant
ylim <- c(1e-5, 1e+6)
plot( ra, type = 's', log = 'xy', ylim = ylim, lwd=2,
xlab = 'Rank', ylab = 'Incidence (fraction of observations)' )
text( np, 1.3*ra[np], names(ra)[np], srt = 80, pos = 4, cex=np^(-.4))
m <- match(shorebirds, names(ra)) # identify the shorebirds with line
segments( m, ylim[1], m, ra[m], col='blue')The incidence by rank for each species, only the dominants are labeled. As expected, shorebirds (blue) are not well-represented in the data.
Not only are shorebirds poorly represented, but data are also dominated by zeros. For example, here is the wood thrush,
spec <- "Hylocichla mustelina" # Wood thrush
species <- ydata[ , spec ]
sdata <- cbind(xdata, species)
hist( species, nclass=100)Histogram of wood thrush species counts, a common species, is still dominated by zeros.
This dominance by zeros can be a problem for model fitting, an issue we return to later.
Counts by year shows the increase in activity just over the short time covered by this data set:
hist( xdata$year, xlab = 'Year', ylab = 'Checklists', main='')Rising use of ebird in this region during the time of these hurricanes.
Here are counts by day of the year (DOY), sometimes termed the
Julian date. Recall the function tapply:
par( mfrow=c(2,1), bty='n', mar=c(4,5,1,5) )
countByDate <- tapply( species, xdata$jd, sum)
minByDate <- tapply(xdata$duration_minutes, xdata$jd, sum)
cpeByDate <- countByDate/minByDate
DOY <- as.numeric( names(countByDate) )
plot( DOY, minByDate/60, type = 's', xlab = '', ylab = 'Hours')
plot( DOY, cpeByDate, type = 's', xlab = 'DOY', ylab = 'Counts/minute')Observation time varies daily (above), while CPE tails off late in the season.
To generate maps we want a color ramp that allows us to set the scale. Here is a function that creates a color ramp that can be used to map patterns in abundance:
getColors <- function(x, scale = .5, hue = 'reds'){
if(hue == 'reds')
cols <- c('#fff7ec','#fdcc8a','#fc8d59','#e34a33','#b30000') # color ramp low to high
if(hue == 'blues')
cols <- c('#ffffcc','#a1dab4','#41b6c4','#2c7fb8','#253494')
if(hue == 'grey')
cols <- c('#f7f7f7','#cccccc','#969696','#636363','#252525')
ncols <- 20
grFun <- colorRampPalette( cols ) # function generates ramp
cseq <- c(0, seq(.01, max(x)^scale, length = ncols )^(1/scale)) # partition ba scale
i <- findInterval( x, cseq, all.inside = T) # assign ba to partition
g <- grFun(ncols) # ramp of length ncols
list( cols = g[ i ], scale = cseq ) # assign colors
}To appreciate the geographic range at this time of year we want maps of counts. Here is a map for this species:
states <- map("state", plot = FALSE, fill = TRUE)
cc <- getColors( species, .5 )$cols
ggplot() +
geom_point(data = sdata, aes(x = longitude, y = latitude), size = .3,
color = cc) +
geom_polygon(data = states, aes(x=long, y = lat, group = group),
fill = NA, color = 'tan') +
coord_fixed(xlim = c(-84.5, -75), ylim = c(32, 40), ratio = 1.2) Map of wood thrush checklists with reds showing high counts, largely obscurred by overlapping points.
This map shows abundance on a color ramp of yellow (low) to red (high). The problem with this map is that overlapping points obscure patterns. We can see through this clutter if we aggregate counts to a grid that is sufficiently coarse that points do not overlap. We do this here for a \(0.1^\circ \times 0.1^\circ\) lattice of longitude/latitude values. By accumulating both counts and effort (minutes) we generate a map of counts per effort (CPE) that clarifies distribution and abundance,
xloc <- round( sdata$longitude, 1 ) # aggregate to 0.1 lon/lat
yloc <- round( sdata$latitude, 1 )Now I generate location names, using their location:
xy <- paste(xloc, yloc, sep='_') # assign name to grid points
sdata$xy <- xyNow I can aggregate by location and draw the map:
tcount <- tapply( sdata$species, xy, sum ) # counts by grid point
tmin <- tapply( sdata$duration_minutes, xy, sum ) # effort by grid point
cpe <- tcount/tmin # CPE by grid point
lonlat <- columnSplit(names(tmin), '_' ) # lon/lat for grid points
sgrid <- data.frame( lon = as.numeric(lonlat[,1]), lat = as.numeric(lonlat[,2]),
count = tcount, minutes = tmin, cpe, stringsAsFactors = F)
rownames(sgrid) <- NULL
cc <- getColors( sgrid$cpe , .2 )$cols # color ramp based on CPE
ggplot() +
geom_point(data = sgrid, aes(x = lon, y = lat), size = 1,
color = cc, shape = 15) +
geom_polygon(data = states, aes(x=long, y = lat, group = group),
fill = NA, color = 'tan') +
coord_fixed(xlim = c(-84.5, -75), ylim = c(32, 39.4), ratio = 1.2) CPE map for wood thrush based on the 0.1-degree lattice allows me to see variation in CPE.
Now I can see the spatial patterns in abundance.
Of course, we could map the effort itself to clarify the geographic blind spots:
cc <- getColors( sgrid$minutes , .2 )$cols # color ramp based on minutes
ggplot() +
geom_point(data = sgrid, aes(x = lon, y = lat), size = 1,
color = cc, shape = 15) +
geom_polygon(data = states, aes(x=long, y = lat, group = group),
fill = NA, color = 'tan') +
coord_fixed(xlim = c(-84.5, -75), ylim = c(32, 39.4), ratio = 1.2) CPE map based on the 0.1-degree lattice shows areas of concentration and missing information.
We have most information where there are many birders around metropolitan centers (e.g., D.C., Richmond, the Research Triangle, Charlotte) and then again in the western mountains and along the coast. Many rural areas lack observations.
Hurricane data
The hurricanes affecting this region are entered here:
dates <- as.Date(c('2016-10-09', '2017-09-26', '2018-09-14', '2019-09-06', '2020-08-04'))
hurricanes <- data.frame(
name = c('Matthew', 'Maria', 'Florence', 'Dorian', 'Isaias'),
cat = c(1,1,1,2,1),
path = c('coastal', 'offshore', 'inland', 'coastal', 'inland'),
date = dates
)
knitr::kable(hurricanes)| name | cat | path | date |
|---|---|---|---|
| Matthew | 1 | coastal | 2016-10-09 |
| Maria | 1 | offshore | 2017-09-26 |
| Florence | 1 | inland | 2018-09-14 |
| Dorian | 2 | coastal | 2019-09-06 |
| Isaias | 1 | inland | 2020-08-04 |
Here we add columns for Julian date (day of the year) and year using
functions from the package lubridate,
hurricanes$jd <- yday(hurricanes$date)
hurricanes$year <- year(hurricanes$date)And finally, we want a column with an interval of dates starting 10 days before each hurricane and ending 10 days after the hurricane,
hurricanes$int <- interval(hurricanes$date - 10, hurricanes$date + 4)From the full data we extract checklists that occur within this interval, I want to find observations from several days before and after the hurricane.
pre <- interval(hurricanes$date - 10, hurricanes$date - 1)
post <- interval(hurricanes$date, hurricanes$date + 4)
timing <- hurrName <- rep('', nrow(sdata))
for(j in 1:length(pre)){ # loop over hurricanes
w <- which( sdata$observation_date %within% pre[j] ) # before hurricane
timing[w] <- 'pre'
hurrName[w] <- hurricanes$name[j]
w <- which( sdata$observation_date %within% post[j] ) # after hurricane
timing[w] <- 'post'
hurrName[w] <- hurricanes$name[j]
}
sdata$htime <- timing
sdata$name <- hurrName
xhurr <- sdata[ sdata$htime %in% c('pre','post'), ]
yhurr <- yshore[ sdata$htime %in% c('pre','post'), ]Find the column htime in xdata to see that
this worked.
To determine distance-to-shore I extract the coastline for this
region and calculate dist2shore:
coast <- map("usa", plot = FALSE) # coastline for USA
ww <- which(coast$x > -82 & coast$x < -75 &
coast$y > 32 & coast$y < 40) # isolate this region
coast$x <- coast$x[ww]
coast$y <- coast$y[ww]
coast$names <- 'main'
coast$range <- c( range(coast$x), range(coast$y))
coast <- as_Spatial( st_as_sf(coast) ) # compatibility
coords <- data.frame( long = xhurr$longitude, lat = xhurr$latitude )
spts <- SpatialPoints(coords)
xhurr$dist2shore <- apply(gDistance(spts, coast, byid=TRUE), 2, min) # dist to coastlineCheck the dist2shore column in xdata.
Here is a map of all shorebird CPE before and after hurricanes,
shoreCount <- rowSums(yhurr)
shoreCPE <- shoreCount/xhurr$duration_minutes
cc <- getColors( shoreCPE, scale = .2, hue = 'grey' )$cols # shorebirds
w <- which( xhurr$htime == 'pre' ) # before hurricanes
cc[ w ] <- getColors( shoreCPE[w], scale = .2, hue = 'blues' )$cols
w <- which( xhurr$htime == 'post' ) # after
cc[ w ] <- getColors( shoreCPE[w], scale = .2 )$cols
ggplot() +
geom_point(data = xhurr, aes(x = longitude, y = latitude), size = 1,
color = cc, shape = 15) +
geom_polygon(data = states, aes(x=long, y = lat, group = group),
fill = NA, color = 'tan') +
coord_fixed(xlim = c(-84.5, -75), ylim = c(32, 39.4), ratio = 1.2) Shorebird CPE before (blue) and after (red) hurricanes.
There are some shorebirds inland before hurricanes, but more after. These could be different species, so we need a model that includes all species.
Traditional GLM
A GLM can fit counts using effort as an offset, meaning that the counts are treated as counts per effort. But the GLM is univariate: each species is modeled separately as though it were independent of others.
A GLM for all shorebirds can be obtained on the counts using
duration_minutes as effort. This Poisson regression assumes
the standard log link (the default),
htime <- factor(xhurr$htime)
levels(htime) <- c('pre','post')
dist2shore <- xhurr$dist2shore
fit <- glm( shoreCount ~ htime*dist2shore, offset = log( xhurr$duration_minutes),
family = 'poisson')
summary(fit) ##
## Call:
## glm(formula = shoreCount ~ htime * dist2shore, family = "poisson",
## offset = log(xhurr$duration_minutes))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -16.602 -3.517 -0.987 -0.143 146.114
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.613498 0.004317 -142.10 <2e-16 ***
## htimepost 0.123291 0.005180 23.80 <2e-16 ***
## dist2shore -2.122796 0.011872 -178.81 <2e-16 ***
## htimepost:dist2shore -1.140877 0.017942 -63.59 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 1202446 on 17251 degrees of freedom
## Residual deviance: 864150 on 17248 degrees of freedom
## AIC: 878907
##
## Number of Fisher Scoring iterations: 9
The negative dist2shore coefficient means that
shorebirds decline with distance from the coast. The positive
htimepost coefficient indicates that shorebird counts per
effort increase after hurricanes. The negative interaction means that
the distance effect increases (becomes more negative) from pre- to
post-hurricane. In other words, hurricanes concentrate shorebirds along
the coast. This result is counter-intuitive in light of our EDA that
shows shorebirds inland following hurricanes.
How would this compare with results for individual species?
Independent models for each species
Some species might be affected differently than others. It also seems important to ask if shorebird responses differ from other species. Here is a loop to fit the model on each species that occurs in at least 50 observations,
yfull <- ydata[sdata$htime %in% c('pre','post'), ] # within time interval
w <- gjamTrimY(yfull,50) # at least 50 observations
y <- w$y
s <- w$colIndex # which species included
S <- colnames(y)
specNames <- colnames(y)
S <- length(specNames)
specColor <- rep('grey', S) # color code shorebirds
specColor[ gsub(" .*", "", specNames) %in% genera ] <- 'darkblue'
betaMu <- matrix(NA, 4, S) # hold coefficients for each species
colnames(betaMu) <- specNames
betaSe <- betaMu
for(j in 1:S){
yj <- y[,j]
fit <- glm( y[,j] ~ htime*dist2shore, offset = log( xhurr$duration_minutes),
family = 'poisson')
sf <- summary(fit)$coefficients
betaMu[,j] <- sf[,1]
betaSe[,j] <- sf[,2]
}
rownames(betaMu) <- rownames(betaSe) <- rownames(sf)Check the object betaMu.
I could define species having effects that increase or decrease
following hurricances as increasers or decreasers. The
definition I use here are coefficients having a 95% confidence interval
that is above or below zero. The 95% confidence interval is
approximately 1.96 standard errors below and above the mean. Here is
this criterion applied to the predictor dist2shore:
bm <- betaMu # parameter estimate for predictor (rows) X species (columns)
bs <- betaSe # parameter standard error
up <- which( (bm['dist2shore',] + 1.96*bs['dist2shore',]) < 0 ) # increasers
do <- which( (bm['dist2shore',] - 1.96*bs['dist2shore',]) > 0 ) # decreasersHere are the decreasers in do: Aix sponsa, Anas
platyrhynchos, Archilochus colubris, Baeolophus bicolor, Bombycilla
cedrorum, Buteo platypterus, Calidris himantopus, Calidris melanotos,
Cardellina canadensis, Catharus guttatus, Catharus minimus, Catharus
ustulatus, Chaetura pelagica, Charadrius vociferus, Chordeiles minor,
Contopus virens, Corvus brachyrhynchos, Corvus corax, Cyanocitta
cristata, Cygnus olor, Dryobates pubescens, Dryobates villosus,
Dryocopus pileatus, Empidonax minimus, Eremophila alpestris, Fulica
americana, Haemorhous mexicanus, Hirundo rustica, Hylocichla mustelina,
Junco hyemalis, Megascops asio, Meleagris gallopavo, Melospiza
lincolnii, Melospiza melodia, Oreothlypis peregrina, Passerculus
sandwichensis, Passerina cyanea, Phalaropus lobatus, Pheucticus
ludovicianus, Pipilo erythrophthalmus, Piranga olivacea, Poecile
carolinensis, Regulus calendula, Riparia riparia, Sayornis phoebe,
Seiurus aurocapilla, Setophaga caerulescens, Setophaga castanea,
Setophaga citrina, Setophaga dominica, Setophaga fusca, Setophaga
magnolia, Setophaga pensylvanica, Setophaga tigrina, Setophaga virens,
Sialia sialis, Sitta canadensis, Sitta carolinensis, Spinus tristis,
Sterna hirundo, Sturnella magna, Tachycineta bicolor, Thryothorus
ludovicianus, Tringa solitaria, Troglodytes hiemalis, Turdus
migratorius, Vireo flavifrons, Vireo philadelphicus, Vireo solitarius,
Zenaida macroura.
Here is a function to plot coefficients for all species,
plotCoeff <- function( betaMu, betaSe, specColor){
# plot coefficients
S <- ncol(betaMu)
s <- c(1:S)
Q <- nrow(betaMu)
par(mfrow=c(Q,1), bty = 'n', mar=c(1,4,.1,3))
for(k in 1:Q){
bm <- betaMu[k,]
bs <- betaSe[k,]
ord <- order( bm )
bm <- bm[ord]
bs <- bs[ord]
up <- max( which( (bm - 1.96*bs) < 0 ) ) # largest increasers
do <- min( which( (bm + 1.96*bs) > 0 ) ) # decreasers
bs[ bs > 5 ] <- 5
ylim <- c( min( bm - 2*bs ), max(bm + bs) )
plot( NA, xlim = c(0, S), ylim = ylim, xlab = '', ylab = '', xaxt='n')
rect(0, ylim[1], do, ylim[2], col = '#edf8fb', border = '#ccece6')
rect(up, ylim[1], S, ylim[2], col = '#fef0d9', border = '#fdd49e')
abline(h = 0, lty=2)
points( s, bm, pch = 3, col = specColor[ord], lwd=2 )
segments( s, bm - 1.96*bs, s, bm + 1.96*bs, col = specColor[ord] )
segments( s, bm - bs, s, bm + bs, col = specColor[ord], lwd=3 )
title( rownames(betaMu)[k], line = -2)
}
}plotCoeff( betaMu[-1,], betaSe[-1,], specColor = specColor )Shorebirds (blue) are distributed across the range of responses.
I’ve highlighted decreasers (left) and increases (right) as those having lower or higher CPE after hurricanes. Shorebirds (blue) show no clear trend.
The species with negative dist2shore decrease in
abundance with distance-from-shore. These include the shorebirds.
The increasers having positive interaction
htimepost:dist2shore have hurricane effects that amplify
distance-to-shore differences, and vice versa.
Generalized joint attribute modeling (GJAM)
The model for the aggregated shorebird counts tell a different story from the individual species models. GJAM could address issues that relate to these differences. There are several things to consider. First, the log link function means that coefficients are proportionate. In order to interpret the magnitude of the response, we would have to specify how large the population is. Second, species probably depend on one another, and we are fitting them independently. For example, if we scatter a million birds in front of a storm, their displacements may tell us no more than if we had observed just a few birds. GJAM accounts for the dependence between species.
This code will take a while to run due to the large size of the data set:
xhurr$htime <- factor( xhurr$htime )
levels(xhurr$htime) <- c('pre','post')
ef <- list( columns = 1:S, values = xhurr$duration_minutes )
rl <- list(r = 8, N = 20)
ml <- list(ng = 2000, burnin = 500, typeNames = 'DA', reductList = rl, effort = ef)
form <- as.formula( ~ htime*dist2shore )
out <- gjam(form, xdata = xhurr, ydata = y, modelList = ml)Here are some plots,
pl <- list(GRIDPLOTS=T, specColor = specColor)
gjamPlot(output = out, plotPars = pl)Fitted coefficients in the model accommodate the covariances between species, here showing only those having zero outside the 95% credible interval:
Coefficients from the gjam model for distance to shore. Note negative values for shorebirds (blue)–increasing distance means fewer shorebirds
Coefficients for pre-hurricane. Shorebirds (blue) see the largest negative impacts, meaning that pre-hurricane values are positive.
Coefficients for the interaction between distance and pre-hurricane. Shorebirds (blue) show the strongest interactions. Because most main effects of pre-hurricane are positive for shorebirds, a negative interaction with distance dampens the pre-hurricane effect with increasing distance from shore–most displaced shorebirds are those near the coast.
By combining fitted coefficients with the covariance in predictors, gjam estimates communities of species based on their similarities in response:
Communities defined by responses to predictors in the model show variation in similarity from high (red) to low (blue).
The species that are similar to one another in their responses to the
hurricanes and distance to shore fall out in the same groups in the
panel at left. Groups containing many shorebirds tend to have negative
responses to dist2shore (blue) in the response matrix at
right.
The overall sensitivity of bird abundances to predictors in the model.
The sensitivity plot above shows that dist2shore is the
overall most important predictor in the model of responses near the
times of hurricanes.
For shorebird species with positive htimepost/negative
htimepre (increased CPE after hurricanes) and positive
htimepost:dis2shore/negative
htimepre:dis2shore, the increases are highest at distance
from shore.
A positive interaction amplifies the hurricane effect at distance from the coast.
Exercise. Do you expect decreasers (e.g., many
shorebirds) to have positive or negative interactions
htimepre:dist2shore? If so, why or why not? In other words,
do you expect decreases to be greatest close or far from the coast?
Equivalently, do you expect a decline with distance from the coast to be
greatest for pre- or post-hurricane? Recall that the second variable
amplified the main effect when the interaction as the same sign as the
main effect.
Recap
Citizen-science data provide information on responses that would otherwise not be observed. They suffer from biases in coverage and uneven methods, which can sometimes be accommodated by the design used in the model. For ebird, observation time provides an index of effort that improves comparability of counts.
A joint model that considers the covariance in responses between species affects the uncertainty in parameters, which becomes the basis for deciding that an effect is ‘different than zero’. GLMs can model counts with effort, but depend on a non-linear link function that estimates parameters on a proportionate scale. GJAM accommodates the zero-dominance while estimating parameters on the observation scale. They can be directly interpreted as CPE per unit of predictor variable.