Before modeling begins there is a question and, often, data. If data are observational, then exploratory data analysis (EDA) is the initial step that determines relationships between variables and spatio-temporal trends. Following EDA we apply a Bayesian model in rjags.

Logistics

Resources

Data: you can download Breeding bird survey, explanation and data, with supporting data.

  • for North Carolina the file BBSexampleTime.rdata is on github

Software

Get the updated clarkFunctions2024.R file on Sakai:

source('clarkFunctions2024.r')

rjags: If you are on a mac, you may have trouble installing rjags. If so, use this link to download the newest version of JAGS. Then install and load packages:

library( rjags )
library( maps )
library( coda )
library( TruncatedNormal )
library( lattice )
library( maps )
library( repmis )

Readings

Here is some background on wood thrush migration

Today’s plan

  • review unit 3 answers

  • Still a few camera traps need completion: wildlifeinsights.org. (Recall, the account is approved by submitting this form).

  • this vignette objectives:

    • basics of exploratory data analysis (EDA)
    • observations-by-variables format
    • more about factors
    • formula and graph for a basic GLM
    • rjags for Bayesian analysis

For next time

  • exercises from this vignette.


Case study: the migratory wood thrush

The migratory wood thrush population is declining. There are multiple threats, both here in the temperate zone and in their tropical wintering grounds in Central America. Stanley et al. (2015) quantified declines in wood thrush abundance in the breeding range by region, and they related it to forest cover. They used tracking data to determine if regional differences in the US might be explained by the fact that birds summer and wintering grounds were linked. I used this example to examine changes in abundance using the breeding-bird survey (BBS) data. Specifically, we will ask whether or not wood thrush populations could be in decline in NC and, if so, which variables in the BBS data might help us understand it.

Wood thrush populations may be declining.

Wood thrush populations may be declining.

EDA with R?

Data manipulation remains the most time-consuming element of analysis. The popularity of R begins with the flexibility it provides for data structures and management.

A user can interact with R at a range of levels. High-level interactions with transparent functions like lm (linear regression) require little more than specification of predictors and a response variable. At the same time, R allows for low-level algorithm development. In fact, functions written in C++ and Fortran can be compiled and used directly with R. This combination of high- and low-level interaction allows one to enter at almost any degree of sophistication, then advance, without requiring a change in software.

The Wickam article is a good background on data structures. Here is his list of the most common problems with data structures:

The most important concept to take from this article is that of observations by variables, the basic structure needed for data modeling.

interacting with R

This section on R application follows the introduction in the vignette on Introduction to R.

Where am I?

When I open R studio I am in a directory. I can interact directly with files that are in my working directory. To interact with a file in a different directory, I need to either supply a path to the directory that it occupies, or I set my working directory using the Session tab in Rstudio.

I can determine which directory I am in with the function getwd(). I can move to a different director using setwd(pathToThere).

Here I am:

getwd()
## [1] "/Users/jimclark/Library/CloudStorage/Box-Box/Home Folder jimclark/Private/classes/bayesClass2024spring/4bbs"

Syntax for paths is standard unix/linux. I move down a directory by giving a path with the directory name followed by /. I can move up directory by starting the path with "..".

Where’s the file?

As discussed above, files are identified by their location, or path, and the file name. The path can be relative to the current directory. To determine which files are in my current directory, I use list.files(). This returns a character vector of file names. To find the names of files in a different directory I need to include the path. This code works for me, because the paths match the locations of directories relative to my current directory:

list.files()                        # files in current directory
list.files("../dataFiles/")         # in dataFiles
list.files("../dataFiles/dataBBS")  # in dataFiles/dataBBS

What’s in the file

I interact with files through functions that allow me to access their contents or create new files to hold content that is currently in my working environment. Common formats for data include .txt and .csv files, but there are dozens of others.

An internet search will often locate an R package that reads a specific file type. Some files may require modification, either to make them readable or to extract information they contain that is not extracted by the package I find to read them. The R function readLines allows the desperate measure of reading files line-by-line. This might be needed when, say, a .txt file contains non-ASCII characters. If I have an unfamiliar file type, I start with google.

R has its own compression format .rdata that is commonly used for data when there are multiple objects that will be processed and stored together. These files are accessed through the load function.

Here I load observations contained in two objects, xdata (predictors) and ydata (bird counts) from the internet. The function source_data from package repmis wraps the load within a function that downloads the file first:

d <- "https://github.com/jimclarkatduke/gjam/blob/master/BBSexampleTime.rdata?raw=True"
source_data(d)
## [1] "xdata"         "ydata"         "missingEffort" "timeList"     
## [5] "edata"

The consol displays names of objects loaded into the working environment. Here are a few lines of xdata:

lon lat Route soil nlcd year temp StartWind StartSky juneTemp
1-0 -78.37957 33.98146 1 Spodosols forest 1964 NA NA NA 25.6
1-1 -78.37957 33.98146 1 Spodosols forest 1965 NA NA NA 24.5
1-2 -78.37957 33.98146 1 Spodosols forest 1966 24.4 2 2 23.8
1-3 -78.37957 33.98146 1 Spodosols forest 1967 26.9 2 2 24.1
1-4 -78.37957 33.98146 1 Spodosols forest 1968 23.1 2 2 25.4

Here are a few lines of ydata:

MourningDove Red-belliedWoodpecker ChimneySwift EasternWood-Pewee BlueJay
1-0 0 0 0 0 0
1-1 4 1 1 0 4
1-2 12 5 4 2 16
1-3 51 5 4 8 19
1-4 15 5 6 4 27

The modes of these objects are mode(xdata) = list and mode(ydata) = matrix. I created these objects from files on the BBS site.

The data.frame xdata holds information about each observation, including location (lon, lat), the observation Route, the soil type (a factor), the nlcd land cover type.

One of the important things about this structure is the alignment of observations in xdata and ydata. In fact, they could be the same file. I’ve separated them here because there are a large number of species in ydata, and I don’t want the extra effort needed to extract them from a larger file.


OxV format and the data.frame

Data are often stored in formats that are not amendable to analysis; reformatting is required. For analysis I typically want a matrix or data.frame with observations as rows and variables as columns, OxV format.

Many data sets are not organized as OxV, often for good reasons, but they must be put in OxV format for analysis. Here are terms I use:

terms definition
observation what I record at a site/location/plot, numeric or not
sample observations-by-variables matrix or data.frame
design matrix observations-by-variables matrix, any factors converted to indicators
covariate a predictor that takes continuous values
factor a discrete predictor that can have 2, 3, … levels
level a factor has at least 2
main effect an individual predictor
interaction a combination of more than one predictor


The list saved as xdata looks like a matrix. I find the dimensions of xdata with a call to dim(xdata) = [2751, 18]. These two numbers are rows and columns, respectively. Recall that although xdata has rows and columns, it is not stored as a R matrix, because it may include factors or even characters. A data.frame is rendered as rows and columns for the benefit of the user, but it is not stored as a matrix and, thus, does not admit matrix operations. The columns of a data.frame must have the same lengths, but they do not need to be of the same mode. In fact, a data.frame is a type of list.

Although I cannot do numeric operations across columns of a list, a data.frame is useful for structuring data. If all elements of a data.frame are numeric, then the data.frame can be converted to a numeric matrix using the function as.matrix.

A related object in R is the tribble, see a description here ??tribble. Some data.frame behaviors can be troublesome if you are not familiar with them. These will come up as we proceed. Because there is so much code available that uses the data.frame, I will continue to use it here.

Previously, we used sapply to explore the objects in a data.frame:

sapply( xdata, is.numeric )         # is each column numeric?
##     insert     groups      times        lon        lat      Route       soil 
##       TRUE       TRUE       TRUE       TRUE       TRUE       TRUE      FALSE 
##       nlcd       year       temp  StartWind   StartSky   precSite   precAnom 
##      FALSE       TRUE       TRUE       TRUE       TRUE       TRUE       TRUE 
##    defSite    defAnom winterTmin   juneTemp 
##       TRUE       TRUE       TRUE       TRUE
which( sapply( xdata, is.factor ) ) # which columns are factors?
## soil nlcd 
##    7    8

I’ll expand on factors in the next section.


Exercise 1: Are there character variables in xdata? How did you find out?


Exploratory data analysis (EDA)

Exploratory data analysis (EDA) is the first step required for an analysis. EDA helps to determine if the scientific question can be answered with the data at hand. It guides model development and computation.

The relationships between variables must be understood before any modeling begins. Do data provide sufficient coverage in space, time, and predictor and response variables? Are there relationships between variables? Is there ‘signal’? In this section I learn about the data through EDA.

Observations by variables

An observation is a route-year, and that is how xdata is organized. It is observations by variables. If I want to examine counts of wood thrush as routes (rows) by years columns. As previously, I can use match. At the end of the vignette on Introduction to R there was a problem that required the same operation, so I put it in the functions vec2Mat and mat2Vec that are now in clarkFunctions2024.R.

The next thing I do is create a routeYear variable to identify observations. One routeYear is a single observations:

xdata$routeYear <- paste( xdata$Route, xdata$year, sep = '_' )

Now I use the vec2Mat function I created to create a routeByYr matrix. To check this function I then invert the operation using mat2Vec. Because this second function cannot know how rows were organized in the original ydata, I use the newly created xdata$routeYear to compare the reconstructed routeYr with ydata[,spec]:

spec      <- 'WoodThrush'
routeByYr <- vec2Mat( ydata[,spec], xdata$Route, xdata$year )  # counts
routeYr   <- mat2Vec( routeByYr, include.na = F )
all.equal( ydata[,spec], routeYr[xdata$routeYear,'x'] )        # yes (except names)
## [1] "names for target but not for current"

The all.equal function says that these two objects are the same. The routeByYr matrix holds the WoodThrush counts organized by route and year:

1964 1965 1966 1967 1968 1969 1970 1971 1972 1973
0 1 0 0 2 0 1 4 3 5
NA NA 0 0 0 0 0 1 1 1
0 0 0 0 0 0 0 0 0 1
0 1 4 11 1 1 5 19 1 35
NA NA NA NA NA NA NA NA NA NA

The matrix routeByYr helps to visualize each Route as a time series. Missing values, NA are routes that we note counted in that year.

The wood thrush summer range (orange) includes North Carolina. The winter range is in Central America (blue).

The wood thrush summer range (orange) includes North Carolina. The winter range is in Central America (blue).

We can also obtain marginal summaries and express them as counts per effort (CPE),

cByYr <- colSums( routeByYr, na.rm=T )     # counts by year
nByYr <- colSums( routeByYr*0+1, na.rm=T ) # no. routes by year (why does this work?)
nByYr <- table( xdata$year )               # another method
CPE   <- cByYr/nByYr                       # counts per route by year

Spatial distribution

To explore the data I first generate a map of the observations. Here is a map of the routes:

map( 'county', xlim = c(-85, -75), ylim = c(33.6, 36.8), col='grey')
map( 'state', add=T)
points( xdata$lon, xdata$lat, pch = 16 )
*Route locations.*

Route locations.

I see balanced coverage in space, from mountains to coast.

I now examine counts for wood thrush, using symbol size to represent the species count and color to represent the variables for moisture deficit.

yspec <- ydata[, spec ]
cy  <- tapply(yspec, xdata$Route, mean, na.rm=T)     # route means
wi  <- match( as.numeric(names(cy)), xdata$Route )   # index for each route
ll  <- xdata[ wi, c('lon','lat') ]                   # locations
def <- xdata$defSite[wi]                             # route moisture deficit

Now I assign colors to def that range from wet (blue) to dry (brown). The function colorRampPalette generates a new function that takes the number of colors needed, nlev, and interpolates a new sequence of nlev colors.

ramp  <- c("#2166ac", "#01665e", "#d8b365", "#8c510a") # color ramp
nlevs <- 20                                            # no. of colors in sequence
df    <- seq( min(def), max(def), length = nlevs )     # sequence spanning range of def
colT  <- colorRampPalette( ramp )                      # new function to assign colors
cols  <- colT(nlevs)                                   # interpolated to no. of levels
di    <- findInterval(def, df, all.inside = T)         # assign color level        

There is now one moisture deficit value (def) for each Route. Their locations, as lon /lat are held in ll. Here is a histogram of def:

hist( def, nclass = nlevs, col = cols, border = cols, main = '' )
Annual moisture deficit ranges from negative (wet) to positive (dry).

Annual moisture deficit ranges from negative (wet) to positive (dry).

Note dry colors are positive. Here they are on the map:

map( 'county', xlim = c(-85, -75), ylim = c(33.6, 36.8), col = 'grey' )
map( 'state', add = T )
cex <- 6*cy/max(cy, na.rm=T)                                  # symbol size scaled to counts
points( ll[,1], ll[,2], pch = 16, cex = cex, col = cols[di] ) # note how colors handled
title(spec)
Wet colors are confined to western mountains. Large symbols indicate high counts.

Wet colors are confined to western mountains. Large symbols indicate high counts.


Exercise 2. What does this map tell us about the spatial distribution of bird counts and the moisture deficits where they are common?


Is the wood thrush declining?

To explore this question I need to structure the BBS data. Again, an observation includes information (Route, Year) and the count for each species that is seen or heard. It includes a description of the weather, which affects behavior. From the table above, a sample is an observations-by-variables matrix or data.frame. It is the fundamental unit for analysis; I draw inference from a sample. This format is OxV (‘obs by variables’) format. The design matrix is a matrix, which is numeric. It contains predictors, which can be main effects and interactions. It can include covariates and factors. It can be analyzed, because it is numeric.

If a variable is stored as a character it may need to be declared a factors. Confusion can arise when a factor is represented by integer values, e.g., \(1, 2, \dots\); it will be analyzed as a factor only if I declare it with factor(variable). When the data.frame is converted to a design matrix, the factor levels each occupy a column of zeros and ones (more on this later).

Here are raw observations plotted by year:

plot(xdata$year, ydata[,'WoodThrush'], xlab = 'Year', ylab = 'Wood thrush counts', 
     cex = .4, pch = 16, bty = 'n', col = cols[di] )

Here is a sum by route-year:

thrush <- tapply(yspec, list( route = xdata$Route, year = xdata$year), 
                 mean, na.rm=T)

Take a look at the structure of the matrix thrush. Here I add the mean taken over all routes in a given year:

thrushYear <- colMeans( thrush, na.rm=T)            # average by year
year       <- as.numeric(names(thrushYear))         # all years in the study

plot(xdata$year, ydata[,'WoodThrush'], xlab = 'Year',
     ylab = 'Wood thrush counts', cex = .5, pch = 16, bty = 'n', col = cols[di] )
lines(year, thrushYear, col='white', lwd=5)
lines(year, thrushYear, lwd=2)

thrushQuant <- apply( thrush, 2, quantile, na.rm=T) # quantiles for counts
lines(year, thrushQuant[2,], col='white', lwd=5)
lines(year, thrushQuant[4,], col='white', lwd=5)
lines(year, thrushQuant[2,], lwd=1)
lines(year, thrushQuant[4,], lwd=1)
*Counts in time with quantiles*

Counts in time with quantiles


Exercise 3. Can we tell from this plot if birds are declining? Are there alternative explanations?


More detailed EDA

I start by exploring the data to get a better idea of sample effort in space and time. Here is a map to see where samples are located in the state, and the relationships between variables:

par( mfrow = c(1,2), bty = 'n', mar = c(4,3,1,3) )        # observations by year
hist( xdata$year, main='', xlab = 'Year', nclass=60 )     # routes by year
title('effort by year')

xdata$Route <- as.numeric( as.character(xdata$Route) )    # if it's a factor

effortByRoute <- tapply( xdata$year*0 + 1, xdata$Route, sum, na.rm = T )
mm     <- match( names(effortByRoute), xdata$Route )
lonLat <- xdata[mm, c('lon','lat')]
routeSummary <- data.frame( route = xdata$Route[mm], lonLat, effort = effortByRoute )

map('county', region = 'north carolina', col = 'grey')   # symbols scaled by obs per route
points( routeSummary$lon, routeSummary$lat, cex = .03*effortByRoute, pch=16 )
title('effort by route')
*Distribution in space and time*.

Distribution in space and time.

The big change in effort after 1987 needs more thought. Which sites were added? Here’s a map:

firstYr <- tapply(xdata$year, xdata$Route, min, na.rm=T)  # yr a route started
firstYr <- firstYr[ as.character(routeSummary$route) ]
routeSummary <- cbind( routeSummary, firstYr )

col <- rep('blue', nrow(routeSummary))                    # early routes
col[ routeSummary[, 'firstYr'] > 1987 ] <- 'darkgreen'        # added later
map('county',region = 'north carolina')
points( routeSummary$lon, routeSummary$lat, cex=.055*routeSummary$effort, pch=16, col = 'white' )
points( routeSummary$lon, routeSummary$lat, cex=.05*routeSummary$effort, pch=16, col=col)
*Early (blue) and late (green) routes*

Early (blue) and late (green) routes

Are these sites changing the distribution of data with respect to, say, temperature?

plot(xdata$year, xdata$juneTemp, cex=.2, xlab = 'Year', ylab='degrees C', bty = 'n' )
title('Temperature trend')

routes <- sort(unique(routeSummary$route))
nroute <- length(routes)

for(i in 1:nroute){
  wi <- which(xdata$Route == routes[i])
  lines(xdata$year[wi], xdata$juneTemp[wi], col = col[i])
}
tmu <- tapply(xdata$juneTemp, xdata$year, mean, na.rm=T)
years <- as.numeric(names(tmu))
lines( years, tmu, col=2, lwd = 3)   # annual average T
*Temperature trends by route.*

Temperature trends by route.


Exercise 4. What do the sizes of the symbols in the map indicate? How would I determine if adding new sites after 1987 has biased the sample toward warmer or cooler temperatures? What about the distribution of data suggests that trends in time for bird abundances could be hard to estimate?


What is observed in a given year could depend less on the mean temperature for the site, but instead on the anomaly–is it a warm or cold time relative to the site mean? Here are June temperatures represented as anomalies from the site mean:

# overall trend
trend <- lm(tmu ~ years)

# local anomaly
tmuByRoute <- tapply(xdata$juneTemp, xdata$Route, mean, na.rm=T)
tanomaly   <- xdata$juneTemp - tmuByRoute[ as.character(xdata$Route) ]

plot(xdata$year, tanomaly, cex=.2, bty = 'n' )
abline(h = 0, lty=2)
*Temperature anomalies*

Temperature anomalies

cor( tanomaly, ydata[,spec] ) # correlation with thrush
## [1] -0.08288059


Exercise 5. Does the temperature anomaly explain variation in wood thrush count?

Unbalanced factor levels

Recall that some variables are stored as factors:

which( sapply(xdata, is.factor) )
## soil nlcd 
##    7    8

Here I note that variables clearly recorded as factor levels are not stored that way. The variables StartWind and StartSky are potentially important, because they affect bird activity. From documentation I learned that wind codes are an ordered factor. Here are sky classification codes used in the original data:

Sky description new name
0 Clear or few clouds clear
1 Partly cloudy (scattered) or variable sky clouds
2 Cloudy (broken) or overcast clouds
4 Fog or smoke fogSmoke
5 Drizzle rain
7 Snow snow
8 Showers rain
9 N/A unknown

A common problem with factors is that the representation of levels can be highly uneven, which has a big impact on a model fit. Each factor level will be treated like an intercept in model fitting. The version of xdata loaded here has nine levels from the BBS data aggregated into four. They are currently not stored as factors, but I can change that by declaring them to be factors.

To appreciate how they are treated in model fitting, try a simple example with the variable xdata$startWind. Here is the effect of ignoring that this is a factor:

# create a factor version for each variable
xdata$routeFactor     <- as.factor(xdata$Route)
xdata$StartWindFactor <- as.factor(xdata$StartWind)
xdata$SkyFactor       <- as.factor(xdata$StartSky)

fit1 <- glm(yspec ~ xdata$StartWind, family = poisson)        # WRONG: fitted as non-factor
fit2 <- glm(yspec ~ xdata$StartWindFactor, family = poisson)  # CORRECT

summary(fit1)
## 
## Call:
## glm(formula = yspec ~ xdata$StartWind, family = poisson)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.4627  -2.5917  -0.9182   1.2063  12.6804  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      2.310526   0.015492 149.143   <2e-16 ***
## xdata$StartWind -0.012172   0.008676  -1.403    0.161    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 17421  on 2078  degrees of freedom
## Residual deviance: 17419  on 2077  degrees of freedom
##   (672 observations deleted due to missingness)
## AIC: 24753
## 
## Number of Fisher Scoring iterations: 5
summary(fit2)
## 
## Call:
## glm(formula = yspec ~ xdata$StartWindFactor, family = poisson)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.5468  -2.5462  -0.9225   1.2632  12.7635  
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             2.281335   0.009269 246.131  < 2e-16 ***
## xdata$StartWindFactor2  0.054364   0.015747   3.452 0.000556 ***
## xdata$StartWindFactor3 -0.005901   0.023337  -0.253 0.800385    
## xdata$StartWindFactor4 -0.185720   0.043225  -4.297 1.73e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 17421  on 2078  degrees of freedom
## Residual deviance: 17385  on 2075  degrees of freedom
##   (672 observations deleted due to missingness)
## AIC: 24722
## 
## Number of Fisher Scoring iterations: 5
plot( jitter(xdata$StartWind), yspec, cex=.2, xlab = 'StartWind factor level',
      ylab = 'Wood thrush counts', bty = 'n' )
*Wind is a discrete factor.*

Wind is a discrete factor.


Exercise 6. What is the difference in your interpretation of the wind variable in the two fits? Explain what the coefficients mean in the two fits.

Exercise 7. From the object trend, is temperature increasing or decreasing on average? Is this trend ‘significant’? Does it differ by site? Hint: consider including route as a factor.



A generalized linear model (GLM)

Unlike the linear regression for continuous responses, with a Gaussian likelihood, the observations in this example are discrete counts. The linear model is not appropriate because observations can only be non-negative integers $ = 0, 1, 2, $.

Here is a simple generalized linear regression (GLM) model, more precisely termed a Poisson regression to fit parameters:

\[ \begin{aligned} y_i &\sim Poi(\lambda_i) \\ \log(\lambda_i) &= \mathbf{x}'_i \boldsymbol{\beta} \end{aligned} \] The term GLM refers to fact that there is a linear function of parameters (second line) embedded within a non-Gaussian distribution. The first line is a Poisson distribution. The Poisson distribution has one parameter, an intensity \(\lambda > 0\). It describes count data \(y \in \{0, 1, \dots\}\), i.e., non-negative integers. The Poisson has the unusual property that \(\lambda\) is both the mean and the variance of the distribution, \(E[y] = Var[y] = \lambda\).

The second line is linear in parameters, but it is a non-linear function of \(\lambda\). In this context, the log function is called the link function, connecting the linear model to the Poisson sampling distribution.

For a sample of \(n\) observations, the likelihood is

\[[\mathbf{y}|\mathbf{X}, \boldsymbol{\beta}] = \prod_{i=1}^n Poi(y_i | \lambda_i)\] where \(\mathbf{y}\) is the length-\(n\) vector of counts, and \(\mathbf{X}\) is the \(n \times p\) matrix of predictor variables. The R function glm can be used to fit GLMs. I give it a formula, the data, and model, family=distribution("link"). I include some predictors from the BBS data,

fit <- glm(yspec ~ temp + year + StartWindFactor, data=xdata, family=poisson("log"))
summary(fit)
## 
## Call:
## glm(formula = yspec ~ temp + year + StartWindFactor, family = poisson("log"), 
##     data = xdata)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -6.8752  -2.2054  -0.7085   1.1847   9.7254  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      63.5837470  0.9715180  65.448  < 2e-16 ***
## temp             -0.0213270  0.0020261 -10.526  < 2e-16 ***
## year             -0.0304594  0.0004853 -62.758  < 2e-16 ***
## StartWindFactor2 -0.0745468  0.0159526  -4.673 2.97e-06 ***
## StartWindFactor3 -0.1264779  0.0235340  -5.374 7.69e-08 ***
## StartWindFactor4 -0.3814783  0.0433770  -8.794  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 17420  on 2077  degrees of freedom
## Residual deviance: 13607  on 2072  degrees of freedom
##   (673 observations deleted due to missingness)
## AIC: 20945
## 
## Number of Fisher Scoring iterations: 5

This code says that I want a GLM model for response (counts) y with two covariates, temp + year, and one factor, StartWindFactor. I want a log link, i.e., \(\log(\lambda_i) = \mathbf{x}'_i \boldsymbol{\beta}\).


Exercise 9. Which variables in this GLM appear to be important and why?

A Bayesian implementation

I now turn to a Bayesian implementation. The package used here rjags uses Markov Chain Monte Carlo simulation. We’ll discuss that in coming vignettes.

The model I fit now includes a prior distribution for coefficients,

\[ \begin{aligned} \left[ \boldsymbol{\beta} | \mathbf{X}, \mathbf{y} \right] &\propto \prod_{i=1}^n Poi(y_i | \lambda_i) \times \prod_p N \left(\beta_p | 0, 10^5 \right) \\ \log(\lambda_i) &= \mathbf{x}'_i \boldsymbol{\beta} \end{aligned} \]

The first line holds on the right side the likelihood and a normal prior disribution for each \(\beta_p\). This prior distribution is non-informative, because the variance is so large as to make the prior flat.

Graphical model for the Bayesian GLM.

Graphical model for the Bayesian GLM.

The graph does not include a box for \(\lambda\), because it is a deterministic transformation of \(\beta\) and \(\mathbf{x}\). Consider that I could have written the likelihood \(Poi(y_i | \lambda_i)\) as \(Poi(y_i | \lambda_i)\) as \(Poi \left( y_i | \exp \left[\mathbf{x}'_i \boldsymbol{\beta} \right] \right)\).

Design matrix in rjags

For this example I use a package rjags. There are \(p\) predictors, typically the first is an intercept. As before, elements of the length-\(p\) vector of coefficients \(\boldsymbol{\beta} = (\beta_1, \dots, \beta_p)\) are multiplied by corresponding elements in the design vector \(\mathbf{x}_i = (1, x_{i2}, \dots x_{i,p})\). The design matrix organizes the vectors \(\mathbf{x}_i: i = 1, \dots, n\) into a \(n \times p\) matrix \(\mathbf{X}\).

I set up the design matrix using the function model.matrix. There are missing values in xdata, which affects how I implement model.matrix. The default treatment of missing values in R packages is to disregard the entire observation. To see where missing values occur you could do this: which(is.na(xdata),arr.ind=T). The design matrix constructed by model.matrix follows this convention, so I need to subject y to the same treatment. I get X and y with the same observation omitted this way:

y     <- ydata[,spec]
xmiss <- which(is.na(xdata),arr.ind=T)
X <- model.matrix(y ~ temp + year + StartWindFactor, data=xdata)
Y <- model.frame(y ~ temp + year + StartWindFactor, data=xdata)$y

I take a moment to look at the design matrix X,

head(X)
(Intercept) temp year StartWindFactor2 StartWindFactor3 StartWindFactor4
1-2 1 24.4 1966 1 0 0
1-3 1 26.9 1967 1 0 0
1-4 1 23.1 1968 1 0 0
1-5 1 26.7 1969 1 0 0
1-6 1 21.1 1970 0 0 0

The three indicator columns for the factor StartWind contain 1’s for the non-reference levels of this factor. The reference level is the first level,

attr(xdata$StartWindFactor,'levels')
## [1] "1" "2" "3" "4"

Again, there is no column for level 1, because this is the intercept. Thus, the coefficients for other levels are relative to level 1.

rjags implementation

I discuss theory and algorithms for Gibbs sampling in later units. For now, I simply implement rjags, using the model graph to discuss the fit at a high level.

For rjags I have to remember that the variance of the normal distribution is parameterized as a precision equal to 1/variance. To keep this straight, here is parameterization for the variance in the normal distribution as used in three contexts:

context notation which parameter?
text \(N(\mu,\sigma^2)\) variance \(\sigma^2\)
most R dnorm(x, mu, sigma) standard deviation \(\sigma\)
jags dnorm(mu, tau) precision \(\tau = \sigma^{-1}\)

This block of code sets up the model that will be stored in a file jagsWoodThrush.txt. The structure is cat( model string, file ). This code will write the model string to file. I will read this file when I run jags. The model looks like executable code, but it actually is setting up the graph. Note two for loops. The first loops over observations, defining the likelihood, as given in the equation above. The second loops over elements in the coefficient vector \(\boldsymbol{\beta}\).

file <- "jagsWoodThrush.txt"                # model file

# set up the 'graph'
cat("model{                                 
  for(i in 1:n){                             
    Y[i] ~ dpois(lambda[i])                 
    lambda[i] <- exp( inprod(beta[],X[i,]) ) 
  }
   for (i in 1:p) {
     beta[i] ~ dnorm(0, 1.0E-5)              
   }
}", file = file)

I can find the file generated by this code in my working directory.

Here is an analysis with rjags. Below is a function jags.model that I use to run a MCMC. The first line defines data as an object that I call dataList. Note that each of the objects in this list is used in the model. Arguments to function jags.model specify initial values for estiamtes (inits), the file holding the model, the number of times the simulation will be repeated (n.chains) and n.adapt or “burnin”. Several lines determine the number of MCMC chains, the number of burnin iterations, and the total number of iterations n.inter. I save objects for later analysis as a compressed file:

dataList <- list(Y = Y, X = X, n = nrow(X), p = ncol(X))  # specify data objects
save( dataList, file='jagsData.Rdata')

# computation
outjags <- jags.model(data = dataList, inits = NULL, file = file, 
                      n.chains = 3, n.adapt = 500 )      # translate graphical model in file
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 2078
##    Unobserved stochastic nodes: 6
##    Total graph size: 19897
## 
## Initializing model
params <- 'beta'
output <- coda.samples( outjags, params, n.iter = 5000 ) # posterior simulation

summary( output )
## 
## Iterations = 501:5500
## Thinning interval = 1 
## Number of chains = 3 
## Sample size per chain = 5000 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##              Mean       SD  Naive SE Time-series SE
## beta[1]  5.189551 2.704865 2.209e-02      1.2215808
## beta[2] -0.011593 0.002189 1.787e-05      0.0001967
## beta[3] -0.001331 0.001349 1.101e-05      0.0006120
## beta[4]  0.053291 0.016804 1.372e-04      0.0003988
## beta[5] -0.001114 0.023930 1.954e-04      0.0003854
## beta[6] -0.187601 0.043809 3.577e-04      0.0006562
## 
## 2. Quantiles for each variable:
## 
##              2.5%       25%        50%        75%     97.5%
## beta[1]  0.397136  2.900134  5.2317030  7.5102618  9.704023
## beta[2] -0.016096 -0.013045 -0.0115820 -0.0100760 -0.007313
## beta[3] -0.003573 -0.002472 -0.0013665 -0.0002177  0.001042
## beta[4]  0.020374  0.041920  0.0533381  0.0647435  0.086029
## beta[5] -0.048636 -0.017196 -0.0008988  0.0149631  0.045203
## beta[6] -0.274283 -0.217651 -0.1875291 -0.1573784 -0.103582
summary( window(output, start = 501) ) # remove early iterations
## 
## Iterations = 501:5500
## Thinning interval = 1 
## Number of chains = 3 
## Sample size per chain = 5000 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##              Mean       SD  Naive SE Time-series SE
## beta[1]  5.189551 2.704865 2.209e-02      1.2215808
## beta[2] -0.011593 0.002189 1.787e-05      0.0001967
## beta[3] -0.001331 0.001349 1.101e-05      0.0006120
## beta[4]  0.053291 0.016804 1.372e-04      0.0003988
## beta[5] -0.001114 0.023930 1.954e-04      0.0003854
## beta[6] -0.187601 0.043809 3.577e-04      0.0006562
## 
## 2. Quantiles for each variable:
## 
##              2.5%       25%        50%        75%     97.5%
## beta[1]  0.397136  2.900134  5.2317030  7.5102618  9.704023
## beta[2] -0.016096 -0.013045 -0.0115820 -0.0100760 -0.007313
## beta[3] -0.003573 -0.002472 -0.0013665 -0.0002177  0.001042
## beta[4]  0.020374  0.041920  0.0533381  0.0647435  0.086029
## beta[5] -0.048636 -0.017196 -0.0008988  0.0149631  0.045203
## beta[6] -0.274283 -0.217651 -0.1875291 -0.1573784 -0.103582
plot( output, bty = 'n' )

Note the difference in coefficients when summary was used with the start = 501. Now I reinitialize using the coefficients from the previous fitted model:

What do the trace plots suggest about the model fit?

I now have an object outjags of class "rjags" and an object output of class mcmc.list, which contains MCMC chains, in addition to other information about the model and fit. It contains lists within lists. I first want to plot the chains. I then generate some plots. This involves aggregating chains. We’ll discuss that in the next few meetings.

par(mfrow=c(7,2), mar=c(3,3,1,1), bty='n')          

allchains <- numeric(0)
nchains   <- length(output)
for(i in 1:nchains){
  allchains <- rbind(allchains,output[[i]])
}
allchains <- as.mcmc(allchains)

xyplot(output)          # using lattice for individual chains

densityplot(output)     # density for individual chains

densityplot(allchains) # density for combined chains

Note that the last densityplot of allchains combines the three MCMC chains. `

Exercise 10: Why is the wood thrush declining? Has the model converged? Repeat this analysis with a different species, then identify differences between the glm and Bayesian results.

Recap

A minimal knowledge of data types and storage modes is needed to manipulate objects in R. Functions are available from the base R package. They are supplemented by others that can be installed from the CRAN website, github, sourceforge, or even developed yourself.

Data analysis begins with data formatting and exploration. Data analysis requires OxV (observations \(\times\) variables) format. I review packages and functions that can help.

The example using Poisson regression provides an introduction to generalized linear models (GLMs), by conventional maximum likelihood (glm).

We will learn about how a Bayesian analysis like this works. For now, a first experience with rjags sets us up for modeling that we’ll discuss in the next weeks.