Recent and projected wood thrush losses.

Observational data are the basis for understanding changes in distribution and abundance. Useful analysis starts with exploratory data analysis (EDA) to understand relationships between variables and how sample methods and effort have changed over time and vary in space. The distribution of data determines parameter estimates and predictions of trends.

In this vignette we consider evidence for bird declines from the perspective of a structured inventory, the Breeding-Bird Survey (BBS).

Resources

Files on Sakai

Resources/code/clarkFunctions2021.r

Resources/data/bbsExample.Rdata

Resources/data/NCaroli.csv

Resources/data/routes.csv

Resources/data/SpeciesList.txt

Resources/data/weather.csv

source('clarkFunctions2021.r')

In the media

Birds Are Vanishing From North America(New York Times)

Silent Skies: Billions of North American Birds Have Vanished(Scientific American)

Science resources

Decline of the North American avifauna(Science)

STATE OF THE WORLD’S BIRDS. (BIRDLIFE INTERNATIONAL)

The Audubon Climate Report A study of citizen science data, with species distributions models, to predict species at risk.

Connectivity of wood thrush breeding, wintering, and migration sites, Stanley et al on tracking wood thrush populations combined with BBS (Cons Biol).

Declines in insectivorous birds are associated with high neonicotinoid concentrations, Hallmann et al. find a relationship between bird declines and surface-water concentrations of insecticides.

Conservation Status of North American Birds in the Face of Future Climate Change Langham et al. report on the Audubon study.

Breeding bird survey, explanation and data, with supporting data

IUCN Red List

Activities today

Follow up on ebird example, data analysis and application.
Discussion from maps: Using the BBS data select a species for study, identify where it is increasing or decreasing. Based on your research, speculate on causes for increases or declines. Select four species each for mapping trends. Based on the research discussed today prepare A) a written summary for the individual species you selected, and B) a synthesis of all species selected by the class. For A, your individual summaries will include 1) he spatio-temporal trends (where is it increasing/decreasing?) and 2) possible explanations based on your research. For B, you can enter your individual summaries in a google doc and begin thinking about the synthesis.

For next class

The migratory wood thrush: We started this unit by exploring trends in abundance, mention the migratory wood thrush as an example of a population that is declining. In this vignette we explore trends through time, rather than spatial maps. I’ll start by saying a bit more on the BBS data, then let the data themselves lead the transition to data exploration and modeling.

Wood thrush populations may be declining

Breeding-bird survey data

The breeding-bird survey (BBS) is a structured inventory taken during the breeding season in North America and spanning the last few decades.

As mentioned previously, files we will use are located on Sakai. Familiarize yourself with the BBS sampling protocols here. The data can be downloaded from here. I download most of files, but limit the geographic data files to the state of North Carolina.

Downloaded files are shown in a directory.

From the README.txt file I determine that I can examine data from the entire continent. To get started I focus on North Carolina. I see this is a .csv file, so I might open it in a text editor or in excel. I see that it has column headings. Data within the file appear to all be integer values. I can use the R function read.csv to read the file. It contains columns count10 ... that are summed in the column SpeciesTotal. I decide to begin with the first 6 columns and the SpeciesTotal:

keep <- c(1:6,13)                                        #columns I want
data <- read.csv('NCaroli.csv',header=T)[,keep]
colnames(data)[7] <- 'count'                             #simplify
data[1:5,]                                               #check

mode(data)

## [1] "list"

is.data.frame(data)

## [1] TRUE

The column I have renamed 'count' refers to total counts from 50 stops. Here are some functions I use to explore the data. Execute each in turn and try to make sense of what you get:

str(data)                                           # structure?
table(data$Route)                                   # obs by Route
table(data$Year)                                    # obs by Year
ry <- with(data, which(Year == 1970 & Route == 1))  # a specific route/yr
with(data, table(ry))                               # obs from this route/yr?
data[ry,c('Aou','count')][1:10,]                    # species code/count

Exercise 1. What is the role of the function with in this block of code?

The str(data) function tells me that data is an object called a data.frame, which is a type of list. Subsequent lines tell me about the columns in data. I could use the function summary(data) to summarize these columns by quantile. The other lines in this code use the function table to extract some different summaries. The help(nameOfFunction) pages are the first stop for R functions that I don’t understand.

OxV format and the `data.frame`

Data analysis requires observations-by-variables format. Data often must be restructured for analysis.

A call to the function mode(data) would tell me that data is a list. A list lets me gather a diversity of R objects under one name. When I pass that object around, all elements of the list move together. For example, I might want to store the years and routes where the wood thrush has been observed, the species with Aou code 7550. Here I construct a list that includes the name, routes, and years:

w <- which(data$Aou == 7550)    # rows in data for sp 2890
r <- unique(data$Route[w])      # unique routes
y <- unique(data$Year[w])       # unique years
woodThrush <- list(genus = 'Hylocichla', species = 'mustelina', 
                   routes = r, years = y)  
sapply(woodThrush,mode)

##       genus     species      routes       years 
## "character" "character"   "numeric"   "numeric"

sapply(woodThrush,length)       # lengths of objects in list

##   genus species  routes   years 
##       1       1     111      50

A call to function str(woodThrush) would tell me that I have a list of length-4 that includes two character variables and two vectors of integers. I can access the route data using woodThrush$routes or woodThrush[[3]] (third object in the list woodThrush).

Data are often stored in formats that are not amendable to analysis; reformatting is required. For analysis I typically want a matrix or data.frame with observations as rows and variables as columns, OxV format. Here are terms I use:

terms	definition
observation	what I record at a site/location/plot, numeric or not
sample	observations-by-variables `matrix` or `data.frame`
design matrix	observations-by-variables `matrix`, any factors converted to indicators
covariate	a predictor that takes continuous values
factor	a discrete predictor that can have 2, 3, … levels
level	a factor has at least 2
main effect	an individual predictor
interaction	a combination of more than one predictor

The list that I called data looks like a matrix. I find the dimensions of data with a call to dim(data) = [120361, 7]. These two numbers are rows and columns, respectively. Although data has rows and columns, it is not stored as a R matrix, because it may include factors or even characters. A data.frame is rendered as rows and columns to the console, but it is not stored as a matrix and, thus, does not admit matrix operations. The columns of a data.frame must have the same lengths, but they do not need to be of the same storage mode. Although I cannot do numeric operations across columns of a list, a data.frame is useful for structuring data and often required by R packages. If all elements of a data.frame are numeric, then the data.frame is converted to a numeric matrix using the function as.matrix.

This is a good time to mention a new object in R, the tribble, see a description here ??tribble. Some data.frame behaviors can be troublesome if you are not familiar with them. These will come up as we proceed. Because there is so much code available that uses the data.frame, I will continue to use it here.

Be careful when changing the dimension of a matrix. For example, I might construct a matrix like this:

nd <- 3
z <- matrix(1:nd, 2*nd, 2)
colnames(z) <- paste('c',1:2,sep='')
rownames(z) <- paste('r',1:(2*nd),sep='')
z

##    c1 c2
## r1  1  1
## r2  2  2
## r3  3  3
## r4  1  1
## r5  2  2
## r6  3  3

I provided 1:nd $= 3$ elements to fill a matrix that holds 6 rows and 2 columns. This did not generate an error or warning, because the number of elements in the matrix is a multiple of 3. This behavior can simplify code, but it can also create bugs when this is not what I want and I am not given a warning.

Is the wood thrush declining?

Analyzing trends in observational data require understanding of how methods and observation effort have changed over time.

In this section I fill out concepts and operations that help with exploratory data analysis (EDA).

Model components and definitions

To explore this question I need to structure the BBS data. An observation includes information (Route, Year) and the count for each species that is seen or heard. It includes a description of the weather, which affects behavior. From the table above, a sample is an observations-by-variables matrix or data.frame. It is the fundamental unit for analysis; I draw inference from a sample. This format is OxV (‘obs by variables’) format. The design matrix is a matrix. In R it is stored as a matrix, which is numeric. It contains predictors, which can be main effects and interactions. It can include covariates and factors. It can be analyzed, because it is numeric.

If factors are included in the data.frame, they should be declared using the function factor. If the factors are identified as words in a data.frame, then you will be interpreted that way. Confusion can arise when a factor is represented by integer values, e.g., $1, 2, \dots$; it will be analyzed as a factor only if I declare it with factor(variable). When the data.frame is converted to a design matrix, the factor levels each occupy a column of zeros and ones (more on this later).

OxV format for the BBS example

Many data sets are not organized as OxV, often for good reasons, but they must be put in OxV format for analysis. The data may not be stored as OxV, because it can be inefficient. In the BBS data, an observation consists of the counts of all species at a stop–it includes everything the observer recorded at that stop. Species not seen are zero. Rather than enter many zeros in columns for all of the species not seen, the BBS data offers one line per species–a single observation occupies many rows in data.

Second, OxV format may be hard to see. For viewing/editing data I prefer to see entire rows. When I can see the full width of an object I can find columns, check formatting, correct errors, and so forth. If BBS data were stored as OxV, I would have to scroll across hundreds of columns. If the number of variables is so large that I cannot see it, I might want to format it differently. I do so at the risk of creating more work when I analyze it. The object data is easy to read, because I see entire rows, e.g., with head(data) and tail(data).

If some variables are repeated across multiple observations they might be stored separately. This practice not only reduces size and, thus, storage, but also reduces errors that can enter simply due to redundancy. In the object data, the column Route could be treated as a variable (a ‘factor’ or group to be treated as a ‘random effect’), but it could also be an indicator for other variables. Any attribute of a Route that does not change through time can be recorded once and then retrieved using the index represented by Route. It can be an index that allows us to link it with information in other files. To put it in OxV I need to assemble this outside information into the proper rows.

For analysis, I reformat data to have one line per observation. As you work through the flow that follows pay attention to how vectors i, j are constructed then used to transform a vector in data to a matrix that is OxV:

allCodes <- sort(unique(data$Aou))                # species codes
S        <- length(allCodes)                      # no. species
rt_yr    <- paste(data$Route, data$Year, sep='_') # each route/yr is an obs
allObs   <- sort(unique(rt_yr))                   # each obs
n        <- length(allObs)                        # no. obs
y        <- matrix(0, n, S)                       # OxV
i        <- match(rt_yr, allObs)                  # row index
j        <- match(data[,'Aou'],allCodes)          # col index
ij       <- cbind(i,j)
y[ij]    <- data[,'count']                        # fill obs matrix
colnames(y) <- allCodes                           # name everything
rownames(y) <- allObs
rtYr     <- matrix( unlist(strsplit(allObs,'_')), n, 2, byrow=T) # rownames to rt, yr
route    <- as.numeric(rtYr[,1])
year     <- as.numeric(rtYr[,2])
x <- cbind(route, year)
rownames(x) <- allObs

Each observation refers to a Route/Year combination. I generated a vector rt_yr to give each observation a name that combines this information that uniquely identifies an observation. I extracted the unique rt_yr into a new vector allObs, which becomes the rownames for the OxV matrix y. The columns for y start with the unique species codes allCodes.

Using allObs and allCodes I associate each row in data with row index i and column index j. The OxV matrix y is initialized with zeros, then filled with one row of cbind(i,j) for each row of data. I add row and column names. Finally, I extract the proper route and year from rtYr and cbind them to y, which is OxV.

The species list

The file SpeciesList.txt contains codes for species. Unfortunately, it contains many non-ascii characters–including names in French and Spanish. The non-ascii characters create an error when I try to read the file.

I start by writing a function bbsSpecs to read the file line-by-line using the function readlines. This function gives me a character vector with one element per line from the file. I then use the function iconv to convert any latin1 character to ASCII, replacing all occurrences with the flag "notASCII". I then use the function grep to locate these flags. Finally, I replace all occurrences of "notASCII" with "-".

The function replaceNonAscii in the file clarkFunctions2021.r returns a character vector, each element being a row from filename, but without non-ascii characters. The question now becomes, what do I do with this character vector?

I need to parse each vector into variables, which now occupy positions in each character. I placed a function bbsSpecs in clarkFunctions2021.r that parses each vector into a data.frame containing species codes and names. Here is the table:

specTab <- bbsSpecs(path='')
specTab[1:5,1:3]

specTab is a data.frame where the species names are treated as factor levels.

To extract data for the wood thrush, I find the character 'WoodThrush' in the column name. I translate the Aou code into a character variable so I can locate it in the column heading of specTab. I then extract that column into a new variable I call y:

wr    <- grep('WoodThrush',specTab[,'name']) # find code
wc    <- as.character(specTab[wr,'Aou'])     # find column in specTab
ydata <- y[,wc]                              # extract counts

I now have a vector ydata, each element of which holds the counts corresponding to the same row in x.

Exercise 2. From specTab find the species code for the yellow-billed cuckoo. I would use the function grep.

Routes and weather

I read in the routes to determine where observations were obtained. These are contained in the file routes.csv, which I use to obtain locations:

rt <- read.csv('routes.csv',header=T)           # read routes
rt <- rt[rt$statenum == 63,]                    # NC code = 63
w  <- match(x[,'route'],rt[,'Route'])           # align with rows in x
l  <- rt[w,c('Longitude','Latitude')]           # align lon/lat with x

The object l holds the location information for each route sorted to match the order of routes in x and ydata.

The weather at the time of observation affects activity and, thus, counts. Here is the weather file that BBS provides with data:

r  <- read.csv('weather.csv',header=T,
               stringsAsFactors=F) 
r  <- r[r$statenum == 63,]       # NC code = 63

# average and convert temperature
temp <- (as.numeric(r[,'StartTemp']) + as.numeric(r[,'EndTemp']))/2
temp[r$TempScale == 'F'] <- round( (temp[r$TempScale == 'F'] - 32)*5/9, 1)

rn   <- paste(r[,'Route'],r[,'Year'],sep='_') # unique Route_Year id
w    <- match(rownames(x),rn)                 # align with rows in x
wthr <- cbind(temp, r[,c('Route','StartWind','StartSky')])[w,] # combine cols, sort rows

In the above code note the correction for the fact that some values are fahrenheit. Note the variable $TempScale flags this variable. The data.frame wthr holds temperature, wind, and sky conditions for the time when observations were made.

Here I set up a design matrix that holds columns I want as predictors as well as any others I might decide to use later:

year  <- x[,'year']
xdata <- cbind(l,year,wthr)
str(xdata)

## 'data.frame':    2079 obs. of  7 variables:
##  $ Longitude: num  -78.4 -78.4 -78.4 -78.4 -78.4 ...
##  $ Latitude : num  34 34 34 34 34 ...
##  $ year     : num  1966 1967 1968 1969 1970 ...
##  $ temp     : num  24.4 26.9 23.1 26.7 21.1 26.7 25.3 26.4 20.8 22.5 ...
##  $ Route    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ StartWind: chr  "1" "1" "1" "1" ...
##  $ StartSky : chr  "2" "1" "1" "0" ...

Explore the data

Sample effort varies in time, in geographic space, and in covariate space. This coverage determines estimated parameters, trends, and predictions.

I start by exploring the data. Here is a map to see where samples are located in the state, and the relationships between variables:

par(mfrow=c(2,2), bty='n', mar=c(4,5,1,1))
hist(xdata$year, main='', nclass=60)   # routes by year
title('effort by year')
library(maps)

times <- table(xdata$Route)                             # times per route
map('county',region='north carolina')
points(rt[,'Longitude'],rt[,'Latitude'], cex=.075*times, pch=16, col='grey')
points(xdata[,'Longitude'],xdata[,'Latitude'], cex=.3, pch=16)
title('effort by route')
plot(xdata$year, xdata$temp, cex=.3, ylab='degrees C')
title('temperature trend')
plot(jitter(as.numeric(xdata$StartWind)), 
     jitter(as.numeric(xdata$StartSky)), xlab='wind',ylab='sky' )
title('weather')

Data exploration for the wood thrush.

Exercise 3. Does the distribution of data suggest that trends in time could be hard to estimate?

There are often too many (sparse) factors

Factors have discrete levels. Because factors are treated as fixed effects in models, useful inference depends on each level having sufficient observations in the data. In observational data there are often too many factor levels, some of which may be too rare to support inference. It can be useful to aggregate them.

To check if some variables are stored as factors I could do this:

sapply(xdata, is.factor)

## Longitude  Latitude      year      temp     Route StartWind  StartSky 
##     FALSE     FALSE     FALSE     FALSE     FALSE     FALSE     FALSE

Here I note that variables clearly recorded as factor levels are not stored that way. The variables StartWind and StartSky. From documentation I learn that wind codes are an ordered factor. Here are sky classification codes:

Sky	description	new name
0	Clear or few clouds	clear
1	Partly cloudy (scattered) or variable sky	clouds
2	Cloudy (broken) or overcast	clouds
4	Fog or smoke	fogSmoke
5	Drizzle	rain
7	Snow	snow
8	Showers	rain
9	N/A	unknown

I could declare them as factors with xdata$StartWind <- as.factor(xdata$StartWind). Instead I do some consolidation first.

Here is a table of levels in the data:

table(xdata$StartSky)

## 
##   0   1   2   4   5   8 
## 928 457 530 131  28   5

Note, there is only one observation in level 5. I want fewer and more balanced levels, so I combine some:

sky <- .combineFacLevels(xdata$StartSky,fname='0', aname = 'clear')
sky <- .combineFacLevels(sky,fname=c('1','2'), aname = 'clouds')
sky <- .combineFacLevels(sky,fname='4', aname = 'fogSmoke')
sky <- .combineFacLevels(sky,fname=c('5','8'), aname = 'rain')
sky <- .combineFacLevels(sky,fname='7', aname = 'snow')
sky <- .combineFacLevels(sky,fname='9', aname = 'unkn')
xdata$StartSky <- sky

I expect to see few birds at high wind speeds. The variable StartWind is also a factor in the model (see documentation in weathercodes.txt). To determine the distribution of factor levels I make a table:

table(xdata$StartWind)

## 
##    0    1    2    3    4    5 
## 1189  597  224   64    4    1

In light of the fact that levels 4 and 5 are rare, and high winds are probably important, I could bring more balance by combining these high values into a single class:

wind <- .combineFacLevels(xdata$StartWind,fname=c('3','4','5'), 
                          aname = '3')
xdata$StartWind <- wind

Abundance trends

To examine trends in counts over time I start with scatter plots and a spline function for counts by year. To highlight the range of abundance I use a heat color gradient from blue (low) to red (high). I overlay a spline fit to summarize mean trends:

par(mfcol=c(1,2),bty='n', mar=c(4,4,2,.5))

colors <- c('blue','green','orange','brown')   # color gradient

ym   <- max(ydata) + 1
ylim <- c(0,80)
cf   <- colorRampPalette(colors)(ym)
plot(jitter(year),jitter(ydata),col=cf[y+1], cex=.4, pch=15, 
     xlab='year', ylab='count', ylim=ylim)
spl <- spline(ydata ~ year)
lines(spl,lwd=5,col='white')
lines(spl, lwd=2)

nc  <- 30                    # frequency before 1985, after 2000
y1  <- 1:(nc-1)
sc  <- c(round(y1/nc*ym,0), ym)
old <- hist(ydata[year <= 1985],plot=F, breaks=c(0,sc))
new <- hist(ydata[year >= 2000],plot=F, breaks=c(0,sc))

plot(NULL,xlim=c(0,.15),ylim=ylim, ylab='',xlab='frequency')
.colorSegment(old$density, old$mids, colors, nc, lwd=2)
.colorSegment(new$density, new$mids, colors, nc, lwd=3)

Counts shaded by abandance and spline fit at left. Frequencies of counts before 1985 and after 2000 at right.

The spline function gives an idea of the trend in the data, highlighted by the heat colors I extracted using the function colorRampPalette. To the right are frequency plots for counts before 1985 (thin) and after 2000 (thick). Plots show the declining trend from different perspectives.

A generalized linear model (GLM)

Here is a simple generalized linear regression (GLM) model, more precisely termed a Poisson regression to fit parameters. I include some predictors from the BBS data,

fit <- glm(ydata ~ temp + year + StartWind, data=xdata, family=poisson("log"))
summary(fit)

## 
## Call:
## glm(formula = ydata ~ temp + year + StartWind, family = poisson("log"), 
##     data = xdata)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -6.8752  -2.2054  -0.7085   1.1847   9.7254  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 63.5837470  0.9715180  65.448  < 2e-16 ***
## temp        -0.0213270  0.0020261 -10.526  < 2e-16 ***
## year        -0.0304594  0.0004853 -62.758  < 2e-16 ***
## StartWind1  -0.0745468  0.0159526  -4.673 2.97e-06 ***
## StartWind2  -0.1264779  0.0235340  -5.374 7.69e-08 ***
## StartWind3  -0.3814783  0.0433770  -8.794  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 17420  on 2077  degrees of freedom
## Residual deviance: 13607  on 2072  degrees of freedom
##   (1 observation deleted due to missingness)
## AIC: 20945
## 
## Number of Fisher Scoring iterations: 5

This code says that I want a GLM model for response (counts) y with two covariates, temp + year, and one factor, StartWind.

Exercise 4. Which variables in this GLM appear to be important and why?

Design matrix

A design matrix holds predictors. For observation $i$ there are $p$ predictors, typically the first is an intercept. Elements of the length-$p$ vector of coefficients $\boldsymbol{\beta} = (\beta_1, \dots, \beta_p)$ are multiplied by corresponding elements in the design vector $\mathbf{x}_i = (1, x_{i2}, \dots x_{i,p})$. The design matrix organizes the vectors $\mathbf{x}_i: i = 1, \dots, n$ into a $n \times p$ matrix $\mathbf{X}$.

I set up the design matrix using the function model.matrix. There is a missing value in xdata, which affects how I implement model.matrix. The default treatment of missing values in R packages is to disgard the entire observation from the data. To see where this missing value occurs you could do this: which(is.na(xdata),arr.ind=T). The design matrix constructed by model.matrix follows this convention, so I need to subject y to the same treatment. I get X and y with the same observation omitted this way:

data   <- xdata
data$y <- ydata
xmiss  <- which(is.na(xdata),arr.ind=T)
X <- model.matrix(y ~ temp + year + StartWind, data=data)
Y <- model.frame(y ~ temp + year + StartWind, data=data)$y

I take a moment to look at the design matrix X,

head(X)

##        (Intercept) temp year StartWind1 StartWind2 StartWind3
## 3269             1 24.4 1966          1          0          0
## 3269.1           1 26.9 1967          1          0          0
## 3269.2           1 23.1 1968          1          0          0
## 3269.3           1 26.7 1969          1          0          0
## 3269.4           1 21.1 1970          0          0          0
## 3269.5           1 26.7 1971          0          1          0

The three indicator columns for the factor StartWind contains 1’s for the non-reference level of this factor. The reference level is the first level,

attr(data$StartWind,'levels')

## [1] "0" "1" "2" "3"

There is no column for level 0, because this is the intercept. Thus, the coefficients for other levels are relative to level 0.

Additional variables

I now explore some additional variables that might help explain trends in the wood thrush population.

f     <- covarGrid(xdata[,c('Longitude','Latitude')], path='')
data <- cbind(data,f)

Here are the variable names: Longitude, Latitude, year, temp, Route, StartWind, StartSky.

Here are tables of soil and land cover :

table(data$soil)

## 
##    Alfisols    Andisols   Aridisols    Entisols   Histosols Inceptisols 
##          94           0           0          62         155         202 
##   Mollisols      Others   Spodosols    Ultisols   Vertisols 
##           0           0          23        1503           0

table(data$nlcd)

## 
##    cultivated-crop cultivated-pasture              dev-0              dev-1 
##                200                149                164                 89 
##       forest-decid   forest-evergreen       forest-mixed     herb-grassland 
##                565                327                147                 20 
##        shrub-scrub              water wetland-herbaceous      wetland-woody 
##                130                 40                  4                244

Here is some aggregation of levels:

data$nlcd <- .combineFacLevels(data$nlcd, c('dev-0','dev-1'), aname = 'dev')
fname      <- c('wetland-herbaceous','wetland-woody')
data$nlcd <- .combineFacLevels(data$nlcd, fname, aname = 'wetland')
data$nlcd <- .combineFacLevels(data$nlcd, fname='NA', aname = 'reference')

On which land cover types has the decline been most severe? Here are an analysis and some plots:

fit <- glm(y ~ temp + year*nlcd + StartWind, data = data, family=poisson("log"))

xmiss <- which(is.na(data),arr.ind=T)     # find missing values
xmiss <- sort(unique(xmiss[,1]))           # missing rows
newdata <- data[-xmiss,]                  # prediction grid

p1  <- predict(fit, newdata=newdata, type='response')
par(mfrow=c(2,2),bty='n', mar=c(4,4,1,1))
plot(ydata[-xmiss],p1, xlab='observed', ylab='predicted', cex=.2)
abline(0,1,lty=2)
plot(newdata$temp,p1, xlab='winter temp', ylab=' ', cex=.2)
plot(newdata$year,p1, xlab='year', ylab='predicted', cex=.2)
plot(newdata$nlcd,p1, xlab=' ', ylab='predicted', cex=.2)

summary(fit)

## 
## Call:
## glm(formula = y ~ temp + year * nlcd + StartWind, family = poisson("log"), 
##     data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -6.7144  -2.0133  -0.6579   1.1512   9.6787  
## 
## Coefficients:
##                               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                 102.114688   3.077006  33.186  < 2e-16 ***
## temp                         -0.021194   0.002162  -9.801  < 2e-16 ***
## year                         -0.049786   0.001547 -32.179  < 2e-16 ***
## nlcdcultivated-pasture       13.370544   4.863630   2.749 0.005976 ** 
## nlcddev                     -37.351718   4.237326  -8.815  < 2e-16 ***
## nlcdforest-decid            -47.125773   3.611895 -13.047  < 2e-16 ***
## nlcdforest-evergreen        -54.332414   3.705405 -14.663  < 2e-16 ***
## nlcdforest-mixed            -69.736761   4.312896 -16.169  < 2e-16 ***
## nlcdherb-grassland           -3.268878  13.422845  -0.244 0.807594    
## nlcdshrub-scrub             -66.047156   6.366410 -10.374  < 2e-16 ***
## nlcdwater                   203.189138  24.308416   8.359  < 2e-16 ***
## nlcdwetland                 -43.731596   4.576219  -9.556  < 2e-16 ***
## StartWind1                   -0.062698   0.016271  -3.853 0.000117 ***
## StartWind2                   -0.109932   0.023714  -4.636 3.56e-06 ***
## StartWind3                   -0.367400   0.043867  -8.375  < 2e-16 ***
## year:nlcdcultivated-pasture  -0.006783   0.002446  -2.773 0.005550 ** 
## year:nlcddev                  0.018650   0.002130   8.757  < 2e-16 ***
## year:nlcdforest-decid         0.023629   0.001815  13.016  < 2e-16 ***
## year:nlcdforest-evergreen     0.027362   0.001863  14.684  < 2e-16 ***
## year:nlcdforest-mixed         0.035029   0.002168  16.154  < 2e-16 ***
## year:nlcdherb-grassland       0.001304   0.006756   0.193 0.847000    
## year:nlcdshrub-scrub          0.032903   0.003191  10.312  < 2e-16 ***
## year:nlcdwater               -0.101772   0.012180  -8.356  < 2e-16 ***
## year:nlcdwetland              0.021945   0.002296   9.559  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 17420  on 2077  degrees of freedom
## Residual deviance: 12442  on 2054  degrees of freedom
##   (1 observation deleted due to missingness)
## AIC: 19816
## 
## Number of Fisher Scoring iterations: 5

Exercise 5. Examine the main effects and interactions from this analysis to suggest which cover types have experienced the biggest declines.

Exercise 6. Compare the trends you see in the wood thrush or another species to the projection maps of [Audubon] and PBGJAM.

Recap

A minimal knowledge of data types and storage modes is needed to manipulate objects in R. Functions are available from the base R package. They are supplemented by others that can be installed from the CRAN website, github, sourceforge, or even developed yourself.

Data analysis begins with data formatting and exploration. Data analysis requires OxV (observations $\times$ variables) format. I review packages and functions that can help.

Observational data are unbalanced. Analysis requires attention to uneven distribution of sample effort in geographic space, covariate space, and time.

11. Declining bird populations
change through time as a big data problem

Duke University ENV89S: Environmental change in the big-data era

James S Clark

2021-03-02

Resources

Files on Sakai

In the media

Science resources

Activities today

For next class

Breeding-bird survey data

OxV format and the `data.frame`

Is the wood thrush declining?

Model components and definitions

OxV format for the BBS example

The species list

Routes and weather

Explore the data

There are often too many (sparse) factors

Abundance trends

A generalized linear model (GLM)

Design matrix

Additional variables

Recap

11. Declining bird populations change through time as a big data problem

Duke University ENV89S: Environmental change in the big-data era

James S Clark

2021-03-02

Resources

Files on Sakai

In the media

Science resources

Activities today

For next class

Breeding-bird survey data

OxV format and the data.frame

Is the wood thrush declining?

Model components and definitions

OxV format for the BBS example

The species list

Routes and weather

Explore the data

There are often too many (sparse) factors

Abundance trends

A generalized linear model (GLM)

Design matrix

Additional variables

Recap

11. Declining bird populations
change through time as a big data problem

OxV format and the `data.frame`