Before modeling begins there is a question and, often, data that will be
used to address it. If data are observational, then exploratory data
analysis (EDA) is the initial step that determines relationships between
variables and spatio-temporal trends. Following EDA of Breeding Bird
Survey data we apply a Bayesian model in rjags.
Wood thrush populations may be declining.
Data: you can download Breeding bird survey, explanation
and data, with supporting data. For North Carolina the file
BBSexampleTime.rdata is on github
Software
clarkFunctions2026.R file on
Canvas:source('clarkFunctions2026.r')
rjags: If you are on a mac, you may have trouble
installing rjags. If so, use this
link to download the newest version of JAGS. Then install and load
packages:library( rjags )
library( maps )
library( coda )
library( TruncatedNormal )
library( lattice )
library( maps )
library( repmis )
Here is some background on wood thrush migration:
this vignette objectives:
factorsrjags for Bayesian
analysisThe migratory wood thrush population is declining. There are multiple threats, both here in the temperate zone and in their tropical wintering grounds in Central America. Stanley et al. (2015) quantified declines in wood thrush abundance in the breeding range by region, and they related it to forest cover. They used tracking data to determine if regional differences in the US might be explained by the conditions in their differing wintering grounds (Central America). I used this example to examine changes in abundance using the breeding-bird survey (BBS) data. Specifically, we will ask whether or not wood thrush populations could be in decline in NC and, if so, which variables in the BBS data might help us understand it.
Data manipulation remains the most time-consuming element of analysis. The popularity of R begins with the flexibility it provides for data structures and management.
A user can interact with R at a range of levels.
High-level interactions with transparent functions like lm
(linear regression) require little more than specification of predictors
and a response variable. At the same time, R allows for
low-level algorithm development. In fact, functions written in C++ and
Fortran can be compiled and used directly with R. This
combination of high- and low-level interaction allows one to enter at
almost any degree of sophistication, then advance, without requiring a
change in software.
The Wickam article is a good background on data structures. Here is his list of the most common problems with data structures:
The most important concept to take from this article is that of observations by variables, the basic structure needed for data modeling.
This section on R application follows the introduction in
the vignette on Introduction to
R.
Because I am now interacting with files it’s worth a few words on
working directories and paths to files. When I open R studio I am in a
working directory. I can interact directly with files
that are stored in my working directory. To interact with a file in a
different directory, I need to either supply a path to that
directory, or I change my working directory using the
Session tab in Rstudio.
I can determine which directory I am in with the function
getwd(). I can move to a different director using
setwd(pathToThere).
Here I am:
getwd()
## [1] "/Users/jimclark/Documents/Classes/bayesClass665_2026/6bbs"
Syntax for paths is standard unix/linux. I move down a directory by
giving a path with the directory name followed by /. I can
move up directory by starting the path with "..".
As discussed above, files are identified by their location, or
path, and the file name. To determine which files are in my
current directory, I use list.files(). This returns a
character vector of file names. To find the names of files
in a different directory I need to include the path. The
path can be relative to the current directory. This code
works for me, because the paths match the locations of directories
relative to my current directory:
list.files() # files in current directory
list.files("../dataFiles/") # directory in the parent directory
list.files("../dataFiles/dataBBS") # a subdirectory of dataFiles
I interact with files through functions that allow me to
access their contents or create new files to hold content that is
currently in my working environment. Common formats for data include
.txt and .csv files, but there are dozens of
others.
An internet search will often locate an R package that
reads a specific file type. Some files may require modification, either
to make them readable or to extract information they contain that is not
extracted by the package I find to read them. The R function
readLines allows the desperate measure of reading files
line-by-line. This might be needed when, say, a .txt file
contains non-ASCII characters. If I have an unfamiliar file type, I
browse the internet.
R has its own compression format .rdata that is commonly
used when there are multiple objects that will be processed and stored
together. These files are accessed through the
load function.
Here I load observations contained in two objects, xdata
(predictors) and ydata (bird counts) from the internet. The
function source_data from package repmis wraps
the load within a function that downloads the file
first:
d <- "https://github.com/jimclarkatduke/gjam/blob/master/BBSexampleTime.rdata?raw=True"
source_data(d)
## [1] "xdata" "ydata" "missingEffort" "timeList"
## [5] "edata"
The consol displays names of objects loaded into the working
environment. Here are a few lines of xdata:
| lon | lat | Route | soil | nlcd | year | temp | StartWind | StartSky | juneTemp | |
|---|---|---|---|---|---|---|---|---|---|---|
| 1-0 | -78.37957 | 33.98146 | 1 | Spodosols | forest | 1964 | NA | NA | NA | 25.6 |
| 1-1 | -78.37957 | 33.98146 | 1 | Spodosols | forest | 1965 | NA | NA | NA | 24.5 |
| 1-2 | -78.37957 | 33.98146 | 1 | Spodosols | forest | 1966 | 24.4 | 2 | 2 | 23.8 |
| 1-3 | -78.37957 | 33.98146 | 1 | Spodosols | forest | 1967 | 26.9 | 2 | 2 | 24.1 |
| 1-4 | -78.37957 | 33.98146 | 1 | Spodosols | forest | 1968 | 23.1 | 2 | 2 | 25.4 |
Here are a few lines of ydata:
| MourningDove | Red-belliedWoodpecker | ChimneySwift | EasternWood-Pewee | BlueJay | |
|---|---|---|---|---|---|
| 1-0 | 0 | 0 | 0 | 0 | 0 |
| 1-1 | 4 | 1 | 1 | 0 | 4 |
| 1-2 | 12 | 5 | 4 | 2 | 16 |
| 1-3 | 51 | 5 | 4 | 8 | 19 |
| 1-4 | 15 | 5 | 6 | 4 | 27 |
The storage modes of these objects are
mode(xdata) = list and mode(ydata) = numeric.
I created these objects from files on the BBS site.
One of the important things about this structure is the alignment of
observations in xdata and ydata. In fact, they
could be the same file. I’ve separated them here because there are a
large number of species in ydata, and I find it easier to
work with fewer columns.
data.frame
Data are often stored in formats that are not amendable to analysis;
reformatting is required. For analysis I typically want a
matrix or data.frame with observations as rows
and variables as columns, OxV format.
Many data sets are not organized as OxV, often for good reasons, but they must be put in OxV format for analysis. Here are terms I use:
| terms | definition |
|---|---|
| observation | what I record at a site/location/plot, numeric or not |
| sample | observations-by-variables matrix or
data.frame |
| design matrix | observations-by-variables matrix, any
factors converted to indicators |
| covariate | a predictor that takes continuous values |
| factor | a discrete predictor that can have 2, 3, … levels |
| level | a factor has at least 2 |
| main effect | an individual predictor |
| interaction | a combination of more than one predictor |
The list saved as xdata looks like a table,
but it is not a matrix It is a special type of
list called a data.frame. Although
xdata has rows and columns, it is not stored as a
matrix, because it may include factors or even
characters. A data.frame is rendered to the
consol as rows and columns for the benefit of the user, but it is not
stored as a matrix and, thus, does not admit
matrix operations. The columns of a data.frame
must have the same lengths, but they do not need to be of the same
storage mode. In this case xdata holds
information about each observation, including location
(lon, lat), the observation Route, the
soil type (a factor), the nlcd
land cover type. Although I cannot do numeric operations across columns
of a list, a data.frame is useful for
structuring data. I find the dimensions of xdata with a
call to dim(xdata) = [2751, 18]. These two numbers are rows
and columns, respectively.
It can be helpful to assign a name to each observation as
rownames. I add rownames to several
data.frames here:
rownames( xdata ) <- rownames( ydata ) <- rownames( edata ) <- paste( xdata$Route, xdata$year, sep = '_' )
A related object in R is a tribble, see a description
here ??tribble. Some data.frame behaviors can
be troublesome if you are not familiar with them. These will come up as
we proceed. Because there is so much code that uses the
data.frame, I continue to use it here.
Previously, we used sapply to explore the objects in a
data.frame:
sapply( xdata, is.numeric ) # is each column numeric?
## insert groups times lon lat Route soil
## TRUE TRUE TRUE TRUE TRUE TRUE FALSE
## nlcd year temp StartWind StartSky precSite precAnom
## FALSE TRUE TRUE TRUE TRUE TRUE TRUE
## defSite defAnom winterTmin juneTemp
## TRUE TRUE TRUE TRUE
which( sapply( xdata, is.factor ) ) # which columns are factors?
## soil nlcd
## 7 8
I’ll expand on factors in the next section.
Exercise 1: Are there character variables
in xdata? How did you find out?
Exploratory data analysis (EDA) is the first step required for an analysis. EDA helps to determine if the scientific question can be answered with the data at hand. It guides model development and computation.
The relationships between variables must be explored before any modeling begins. Do data provide sufficient coverage in space, time, and predictor and response variables? Are there relationships between variables? Is there ‘signal’? In this section I learn about the data through EDA.
An observation is a route-year, and that is how
xdata is organized. It is observations by
variables. If I want to examine counts of wood thrush as
routes (rows) by years (columns), I can use
match in the function vec2Mat to generate a
route by year matrix.
spec <- 'WoodThrush'
routeByYr <- vec2Mat( ydata[,spec], xdata$Route, xdata$year ) # counts
| 1964 | 1965 | 1966 | 1967 | 1968 | 1969 | 1970 | 1971 | 1972 | 1973 |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 0 | 2 | 0 | 1 | 4 | 3 | 5 |
| NA | NA | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 0 | 1 | 4 | 11 | 1 | 1 | 5 | 19 | 1 | 35 |
| NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
The matrix routeByYr helps to visualize each
Route as a time series. Missing values, NA are
routes that were not observed in that year.
The wood thrush summer range (orange) includes North Carolina. The winter range is in Central America (blue).
We can also obtain marginal summaries and express them as counts per effort (CPE),
cByYr <- colSums( routeByYr, na.rm=T ) # counts by year
nByYr <- colSums( routeByYr*0+1, na.rm=T ) # routes by year (why does this work?)
nByYr <- table( xdata$year ) # another method
CPE <- cByYr/nByYr # counts per route by year
Some of these objects tell me about data structure, such as
nByYr, the number of routes observed by year.
To explore the data I first generate a map of the observations. Here is a map of the routes:
map( 'county', xlim = c(-85, -75), ylim = c(33.6, 36.8), col='grey')
map( 'state', add=T)
points( xdata$lon, xdata$lat, pch = 16 )
BBS route locations.
I see balanced coverage in space, from mountains to coast.
I now examine counts for wood thrush, using symbol size to represent
the species count and color to represent the variables for moisture
deficit.
yspec <- ydata[, spec ]
cy <- tapply(yspec, xdata$Route, mean, na.rm=T) # route means
wi <- match( as.numeric(names(cy)), xdata$Route ) # index for each route
ll <- xdata[ wi, c('lon','lat') ] # locations
def <- xdata$defSite[wi] # route moisture deficit
Now I assign colors to def that range from wet (blue) to
dry (brown). The function colorRampPalette generates a new
function that takes the number of colors needed,
nlev, and interpolates a new sequence of nlev
colors.
ramp <- c("#2166ac", "#01665e", "#d8b365", "#8c510a") # color ramp
nlevs <- 20 # no. of color bins
df <- seq( min(def), max(def), length = nlevs ) # bins span range of def
colT <- colorRampPalette( ramp ) # new function to assign colors
cols <- colT( nlevs ) # interpolate colors
di <- findInterval(def, df, all.inside = T) # assign color level
There is now one moisture deficit value (def) for each
Route. Their locations, as lon /lat, are held in
ll. Here is a histogram of def:
hist( def, nclass = nlevs, col = cols, border = cols, main = '' )
Annual moisture deficit ranges from negative (wet) to positive (dry).
Note dry colors are positive. Here they are on the map:
map( 'county', xlim = c(-85, -75), ylim = c(33.6, 36.8), col = 'grey' )
map( 'state', add = T )
cex <- 6*cy/max(cy, na.rm=T) # symbol size scaled to counts
points( ll[,1], ll[,2], pch = 16, cex = cex, col = cols[di] ) # note colors
title(spec)
Low moisture deficits are confined to western mountains. Large symbols indicate high counts.
Recall that symbol size is proportional to the average count
cy.
Exercise 2. What does this map tell us about the spatial distribution of bird counts and the moisture deficits where they are common?
To explore this question I need to structure the BBS data. Again, an
observation includes information (Route, Year)
and the count for each species that is seen or heard. It
includes a description of the weather, which affects behavior. From the
table above, a sample is an observations-by-variables
matrix or data.frame. It is the fundamental
unit for analysis; I draw inference from a sample. This
format is OxV (‘obs by variables’) format. The
design matrix is a matrix, which is
numeric. It contains predictors, which can be main effects and
interactions. It can include covariates and factors. It can be analyzed,
because it is numeric.
If a variable is stored as a character it may need to be
declared a factor. Confusion can arise when a factor is
represented by integer values, e.g., \(1, 2,
\dots\); it will be analyzed as a factor only if I declare it
with factor(variable). When the data.frame is
converted to a design matrix, the factor
levels each occupy a column of zeros and ones (more on this later).
Here are raw counts plotted by year:
plot( xdata$year, ydata[,'WoodThrush'], xlab = 'Year', ylab = 'Wood thrush counts',
cex = .4, pch = 16, col = cols[di] )
Here is a sum by route-year:
thrush <- tapply(yspec, list( route = xdata$Route, year = xdata$year),
mean, na.rm=T)
Take a look at the structure of the matrix thrush. Here
I add the mean taken over all routes in a given year:
thrushYear <- colMeans( thrush, na.rm=T) # average by year
year <- as.numeric(names(thrushYear)) # all years in the study
plot(xdata$year, ydata[,'WoodThrush'], xlab = 'Year',
ylab = 'Wood thrush counts', cex = .5, pch = 16, col = cols[di] )
lines(year, thrushYear, col='white', lwd=5)
lines(year, thrushYear, lwd=2)
thrushQuant <- apply( thrush, 2, quantile, na.rm=T) # quantiles for counts
lines(year, thrushQuant[2,], col='white', lwd=5)
lines(year, thrushQuant[4,], col='white', lwd=5)
lines(year, thrushQuant[2,], lwd=1)
lines(year, thrushQuant[4,], lwd=1)
Counts in time with quantiles
Exercise 3. Can we tell from this plot if birds are declining? Are there alternative explanations?
I start by exploring the data to get a better idea of sample effort in space and time. Here is a map to see where samples are located in the state, and the relationships between variables:
hist( xdata$year, main='', xlab = 'Year', nclass=60 ) # routes by year
title('effort by year')
xdata$Route <- as.numeric( as.character(xdata$Route) ) # factor to numeric
effortByRoute <- tapply( xdata$year*0 + 1, xdata$Route, sum, na.rm = T )
mm <- match( names(effortByRoute), xdata$Route )
lonLat <- xdata[mm, c('lon','lat')]
routeSummary <- data.frame( route = xdata$Route[mm], lonLat, effort = effortByRoute )
map('county', region = 'north carolina', col = 'grey') # symbols scaled by obs per route
points( routeSummary$lon, routeSummary$lat, cex = .03*effortByRoute, pch=16 )
title('effort by route')
Distribution in space and time.
The big change in effort after 1987 needs more thought. Which sites were added? Here’s a map:
firstYr <- tapply(xdata$year, xdata$Route, min, na.rm=T) # yr a route started
firstYr <- firstYr[ as.character(routeSummary$route) ]
routeSummary <- cbind( routeSummary, firstYr )
col <- rep('blue', nrow(routeSummary)) # early routes
col[ routeSummary[, 'firstYr'] > 1987 ] <- 'darkgreen' # added later
map('county',region = 'north carolina')
points( routeSummary$lon, routeSummary$lat, cex=.055*routeSummary$effort, pch=16, col = 'white' )
points( routeSummary$lon, routeSummary$lat, cex=.05*routeSummary$effort, pch=16, col=col)
Early (blue) and late (green) routes
Are these sites changing the distribution of data with respect to, say, temperature?
plot(xdata$year, xdata$juneTemp, cex=.2, xlab = 'Year', ylab='degrees C' )
title('Temperature trend')
routes <- sort(unique(routeSummary$route))
nroute <- length(routes)
for(i in 1:nroute){
wi <- which(xdata$Route == routes[i])
lines(xdata$year[wi], xdata$juneTemp[wi], col = col[i])
}
tmu <- tapply(xdata$juneTemp, xdata$year, mean, na.rm=T)
years <- as.numeric(names(tmu))
lines( years, tmu, col=2, lwd = 3) # annual average T
Temperature trends by route.
Exercise 4. What do the sizes of the symbols in the map indicate? How would I determine if adding new sites after 1987 has biased the sample toward warmer or cooler temperatures? What about the distribution of data suggests that trends in time for bird abundances could be hard to estimate?
What is observed in a given year could depend less on the mean temperature for the site, but instead on the anomaly–is it a warm or cold time relative to the site mean? Here are June temperatures represented as anomalies from the site mean:
trend <- lm(tmu ~ years) # overall trend
# local anomaly
tmuByRoute <- tapply(xdata$juneTemp, xdata$Route, mean, na.rm=T)
tanomaly <- xdata$juneTemp - tmuByRoute[ as.character(xdata$Route) ]
plot(xdata$year, tanomaly, cex=.2 )
abline(h = 0, lty=2)
Temperature anomalies
cor( tanomaly, ydata[,spec] ) # correlation with thrush
## [1] -0.08288059
Exercise 5. Does the temperature anomaly explain variation in wood thrush count?
Recall that some variables are stored as factors:
which( sapply(xdata, is.factor) )
## soil nlcd
## 7 8
Here I note that variables clearly recorded as factor levels are not
stored that way. The variables StartWind and
StartSky are potentially important, because they affect
bird activity. From documentation I learned that wind codes are an
ordered factor. Here are sky classification codes used in the original
data:
| Sky | description | new name |
|---|---|---|
| 0 | Clear or few clouds | clear |
| 1 | Partly cloudy (scattered) or variable sky | clouds |
| 2 | Cloudy (broken) or overcast | clouds |
| 4 | Fog or smoke | fogSmoke |
| 5 | Drizzle | rain |
| 7 | Snow | snow |
| 8 | Showers | rain |
| 9 | N/A | unknown |
plot( jitter(xdata$StartWind), yspec, cex=.2, xlab = 'StartWind factor level',
ylab = 'Wood thrush counts' )
Wind is a discrete factor with unbalanced levels.
A common problem with factors is that the representation of levels
can be highly uneven, which has a big impact on a model fit. Each
factor level will be treated like an intercept in model
fitting. The version of xdata loaded here has nine levels
from the BBS data aggregated into four,
table( xdata$StartWind )
##
## 1 2 3 4
## 1189 597 224 69
This ordinal scale is ranked from low to high. When there is an imbalance in factor levels (as here) it may be worth additional aggregation. Due to sample size alone well-represented levels may prove important (e.g., “significant”) when the rare levels do not. Is there an important difference between, say, level 3 and 4? If not, an aggregated factor level will reduce the imbalance.
To appreciate how they are treated in model fitting, try a simple
example with the variable xdata$startWind. Here is the
effect of ignoring that this is a factor:
# create a factor version for each variable
xdata$routeFactor <- as.factor(xdata$Route)
xdata$StartWindFactor <- as.factor(xdata$StartWind)
xdata$SkyFactor <- as.factor(xdata$StartSky)
fit1 <- glm(yspec ~ xdata$StartWind, family = poisson) # WRONG: fitted as non-factor
fit2 <- glm(yspec ~ xdata$StartWindFactor, family = poisson) # CORRECT
summary(fit1)
##
## Call:
## glm(formula = yspec ~ xdata$StartWind, family = poisson)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.310526 0.015492 149.143 <2e-16 ***
## xdata$StartWind -0.012172 0.008676 -1.403 0.161
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 17421 on 2078 degrees of freedom
## Residual deviance: 17419 on 2077 degrees of freedom
## (672 observations deleted due to missingness)
## AIC: 24753
##
## Number of Fisher Scoring iterations: 5
summary(fit2)
##
## Call:
## glm(formula = yspec ~ xdata$StartWindFactor, family = poisson)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.281335 0.009269 246.131 < 2e-16 ***
## xdata$StartWindFactor2 0.054364 0.015747 3.452 0.000556 ***
## xdata$StartWindFactor3 -0.005901 0.023337 -0.253 0.800385
## xdata$StartWindFactor4 -0.185720 0.043225 -4.297 1.73e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 17421 on 2078 degrees of freedom
## Residual deviance: 17385 on 2075 degrees of freedom
## (672 observations deleted due to missingness)
## AIC: 24722
##
## Number of Fisher Scoring iterations: 5
Exercise 6. What is the difference in your interpretation of the wind variable in the two fits? Explain what the coefficients mean in the two fits.
Exercise 7. From the object trend, is
temperature increasing or decreasing on average? Is this trend
‘significant’? Does it differ by site? Hint: consider including route as
a factor.
To examine trends in counts over time I start with scatter plots and a spline function for counts by year. To highlight the range of abundance I use a color gradient. I overlay a spline fit to summarize mean trends:
ramp <- c('#d0d1e6','#014636') # color gradient
nlev <- max(yspec) + 1
ylim <- c(0,80)
cf <- colorRampPalette( ramp )(nlev)
plot( jitter(xdata$year), jitter(yspec), col=cf[yspec+1], cex=.4, pch=15,
xlab='year', ylab='count', ylim=ylim)
spl <- spline(yspec ~ xdata$year)
lines(spl,lwd=5,col='white')
lines(spl)
nc <- 30 # frequency before 1985, after 2000
y1 <- 1:(nc-1)
sc <- c(round(y1/nc*nlev,0), nlev)
old <- hist(yspec[xdata$year <= 1987],plot=F, breaks=c(0,sc))
new <- hist(yspec[xdata$year >= 2000],plot=F, breaks=c(0,sc))
plot(NULL,xlim=c(0,.15),ylim=ylim, ylab='',xlab='frequency', yaxt = 'n')
colorSegment(old$density, old$mids, ramp, nc, lwd=2)
colorSegment(new$density, new$mids, ramp, nc, lwd=3)
text( .04, 24, 'pre-1987')
text( .07, 10, 'post-2000')
Counts shaded by abandance and spline fit at left. Frequencies of counts before 1985 and after 2000 at right.
Exercise 8. How does the distribution of counts change after 1987?
Unlike the linear regression for continuous responses, with a Gaussian likelihood, the observations in this example are discrete counts. The linear model is not appropriate because observations can only be non-negative integers $ = 0, 1, 2, $.
Here is a simple generalized linear regression (GLM) model, more precisely termed a Poisson regression to fit parameters:
\[ \begin{aligned} y_i &\sim Poi(\lambda_i) \\ \log(\lambda_i) &= \mathbf{x}'_i \boldsymbol{\beta} \end{aligned} \] The term GLM refers to fact that there is a linear function of parameters (second line) embedded within a non-Gaussian distribution. The first line is a Poisson distribution. The Poisson distribution has one parameter, an intensity \(\lambda > 0\). It describes count data \(y \in \{0, 1, \dots\}\), i.e., non-negative integers. The Poisson has the unusual property that \(\lambda\) is both the mean and the variance of the distribution, \(E[y] = Var[y] = \lambda\).
The second line is linear in parameters, but it is a non-linear function of \(\lambda\). In this context, the log function is called the link function, connecting the linear model to the Poisson sampling distribution.
For a sample of \(n\) observations, the likelihood is
\[[\mathbf{y}|\mathbf{X},
\boldsymbol{\beta}] = \prod_{i=1}^n Poi(y_i | \lambda_i)\] where
\(\mathbf{y}\) is the length-\(n\) vector of counts, and \(\mathbf{X}\) is the \(n \times p\) matrix of predictor variables.
The R function glm can be used to fit GLMs. I give it a
formula, the data, and model,
family=distribution("link"). I include some predictors from
the BBS data,
form <- as.formula( 'yspec ~ temp + year + StartWindFactor' )
fit <- glm( form, data = xdata, family = poisson("log"))
summary( fit )
##
## Call:
## glm(formula = form, family = poisson("log"), data = xdata)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 63.5837470 0.9715180 65.448 < 2e-16 ***
## temp -0.0213270 0.0020261 -10.526 < 2e-16 ***
## year -0.0304594 0.0004853 -62.758 < 2e-16 ***
## StartWindFactor2 -0.0745468 0.0159526 -4.673 2.97e-06 ***
## StartWindFactor3 -0.1264779 0.0235340 -5.374 7.69e-08 ***
## StartWindFactor4 -0.3814783 0.0433770 -8.794 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 17420 on 2077 degrees of freedom
## Residual deviance: 13607 on 2072 degrees of freedom
## (673 observations deleted due to missingness)
## AIC: 20945
##
## Number of Fisher Scoring iterations: 5
This code specifies a GLM model for response (counts) y
with two covariates, temp + year, and one factor,
StartWindFactor. I want a log link, i.e., \(\log(\lambda_i) = \mathbf{x}'_i
\boldsymbol{\beta}\).
Exercise 9. Which variables in this GLM appear to be important and why?
I now turn to a Bayesian implementation. The package used here
rjags uses Markov Chain Monte Carlo simulation. Here’s I’ll
start with a simple Metropolis sampler. We’ll discuss technical details
in coming vignettes.
The model I fit now includes a prior distribution for coefficients,
\[ \begin{aligned} \left[ \boldsymbol{\beta} | \mathbf{X}, \mathbf{y} \right] &\propto \prod_{i=1}^n Poi(y_i | \lambda_i) \times \prod_p N \left(\beta_p | 0, 10^5 \right) \\ \log(\lambda_i) &= \mathbf{x}'_i \boldsymbol{\beta} \end{aligned} \]
In the first line holds the likelihood and a normal prior disribution for each \(\beta_p\). This prior distribution is non-informative, because the variance is so large as to make the prior flat.
Graphical model for the Bayesian GLM.
The graph does not include a box for \(\lambda\), because it is a deterministic transformation of \(\beta\) and \(\mathbf{x}\). Consider that I could have written the likelihood \(Poi(y_i | \lambda_i)\) as \(Poi(y_i | \lambda_i)\) as \(Poi \left( y_i | \exp \left[\mathbf{x}'_i \boldsymbol{\beta} \right] \right)\).
I set up the design matrix using the function
model.matrix. There are missing values in
xdata, which affects how I implement
model.matrix. The default treatment of missing values in R
packages is to disregard the entire observation. To see where missing
values occur you could do this:
which(is.na(xdata),arr.ind=T). The design matrix
constructed by model.matrix follows this convention, so I
need to subject y to the same treatment. I get
X and y with the same observation omitted this
way:
X <- model.matrix( form, data = xdata)
Y <- model.frame( form, data = xdata)$yspec
I take a moment to look at the design matrix X,
head(X)
| (Intercept) | temp | year | StartWindFactor2 | StartWindFactor3 | StartWindFactor4 | |
|---|---|---|---|---|---|---|
| 1_1966 | 1 | 24.4 | 1966 | 1 | 0 | 0 |
| 1_1967 | 1 | 26.9 | 1967 | 1 | 0 | 0 |
| 1_1968 | 1 | 23.1 | 1968 | 1 | 0 | 0 |
| 1_1969 | 1 | 26.7 | 1969 | 1 | 0 | 0 |
| 1_1970 | 1 | 21.1 | 1970 | 0 | 0 | 0 |
The three indicator columns for the factor
StartWindFactor contain 1’s for the non-reference levels of
this factor. The reference level is the first level,
attr(xdata$StartWindFactor,'levels')
## [1] "1" "2" "3" "4"
Again, there is no column for level 1, because this is
the intercept. Thus, the coefficients for other levels are relative to
level 1.
Here is a Bayesian analysis based on Metropolis sampling, a type of
MCMC that we will discuss in vignettes to follow. For now I’ll just say
that the covariance matrix cmat is a crude approximation to
the shape I expect for the posterior distribution of \(\boldsymbol{\beta}\). I use it to improve
efficiency of the algorithm. Again, more on that later.
cmat <- solve( crossprod(X) ) # proposal matrix for Metropolis
bg <- cmat%*%crossprod(X, log( Y + 1) ) # initial coefficient estimates
cmat <- cmat*.1
p <- ncol(X)
n <- nrow(X)
ng <- 2000
bgibbs <- matrix( NA, ng, p, dimnames = list( NULL, colnames(X) ) )
accept <- 0
for( g in 1:ng){
tmp <- updateBetaGLM(bg, cmat, X, Y, likelihood = 'dpois', link = 'log')
bg <- tmp$beta
accept <- accept + tmp$accept
bgibbs[g, ] <- tmp$beta
}
keep <- 500:ng # omit burnin
beta <- cbind( apply( bgibbs[ keep, ], 2, mean ), # table of estimates
apply( bgibbs[ keep, ], 2, sd ),
t( apply( bgibbs[ keep, ], 2, quantile, c(.025,.975) ) ) )
colnames( beta )[1:2] <- c( 'Estimate', 'Std. Error' )
glmBetaBayes <- signif( beta, 4 )
The estimates from this model look like this:
| Estimate | Std. Error | 2.5% | 97.5% | |
|---|---|---|---|---|
| (Intercept) | 63.59000 | 0.9278000 | 61.76000 | 65.40000 |
| temp | -0.02104 | 0.0020960 | -0.02491 | -0.01692 |
| year | -0.03046 | 0.0004642 | -0.03136 | -0.02954 |
| StartWindFactor2 | -0.07467 | 0.0151600 | -0.10390 | -0.04922 |
| StartWindFactor3 | -0.12850 | 0.0240700 | -0.17290 | -0.08259 |
| StartWindFactor4 | -0.38980 | 0.0448500 | -0.47760 | -0.30440 |
Recall fitted coefficients for the traditional model:
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | 63.5837470 | 0.9715180 | 65.447834 | 0e+00 |
| temp | -0.0213270 | 0.0020261 | -10.526210 | 0e+00 |
| year | -0.0304594 | 0.0004853 | -62.757806 | 0e+00 |
| StartWindFactor2 | -0.0745468 | 0.0159526 | -4.673023 | 3e-06 |
| StartWindFactor3 | -0.1264779 | 0.0235340 | -5.374255 | 1e-07 |
| StartWindFactor4 | -0.3814783 | 0.0433770 | -8.794485 | 0e+00 |
I return to these estimates after fitting the model in
rjags.
I discuss theory and algorithms for Gibbs sampling in later units.
For now, I simply implement rjags, using the model graph to
discuss the fit at a high level.
For rjags I have to remember that the variance of the
normal distribution is parameterized as a precision
equal to 1/variance. To keep this straight, here is parameterization for
the variance in the normal distribution as used in three contexts:
| context | notation | which parameter? |
|---|---|---|
| equations | \(N(\mu,\sigma^2)\) | variance \(\sigma^2\) |
| most R | dnorm(x, mu, sigma) |
standard deviation \(\sigma\) |
jags |
dnorm(mu, tau) |
precision \(\tau = \sigma^{-1}\) |
This block of code sets up the model that will be stored in a file
jagsWoodThrush.txt. The structure is
cat( model string, file ). This code will write the
model string to file. I will read this file
when I run jags. The model looks like
executable code, but it actually is setting up the graph. Note two
for loops. The first loops over observations, defining the
likelihood, as given in the equation above. The second loops over
elements in the coefficient vector \(\boldsymbol{\beta}\).
file <- "jagsWoodThrush.txt" # model file
cat("model{ # set up the graph
for(i in 1:n){
Y[i] ~ dpois(lambda[i])
lambda[i] <- exp( inprod(beta[],X[i,]) )
}
for (i in 1:p) {
beta[i] ~ dnorm(0, 1.0E-5)
}
}", file = file)
Note syntax. The symbol ~ indicates “is distributed as”.
As in R, the symbol <- indicates assignment.
I can find the file generated by this code in my working directory.
Here is an analysis with rjags. Below is a
function jags.model that I use to run a MCMC. The first
line defines data as an object that I call
dataList. Note that each of the objects in this list is
used in the model. Arguments to
function jags.model specify initial values for estiamtes
(inits), the file holding the model, the
number of times the simulation will be repeated (n.chains)
and n.adapt or “burnin”. Several lines determine the number
of MCMC chains, the number of burnin iterations, and the total number of
iterations n.inter. I save objects for later analysis as a
compressed file:
dataList <- list(Y = Y, X = X, n = nrow(X), p = ncol(X) ) # specify data objects
save( dataList, file='jagsData.Rdata')
# computation
outjags <- jags.model(data = dataList, inits = NULL, file = file,
n.chains = 3, n.adapt = 500 ) # translate graphical model in file
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 2078
## Unobserved stochastic nodes: 6
## Total graph size: 19897
##
## Initializing model
params <- 'beta'
output <- coda.samples( outjags, params, n.iter = 5000 ) # posterior simulation
summary( output )
##
## Iterations = 501:5500
## Thinning interval = 1
## Number of chains = 3
## Sample size per chain = 5000
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## beta[1] 5.805337 2.740397 2.238e-02 1.2481387
## beta[2] -0.012057 0.002051 1.674e-05 0.0001766
## beta[3] -0.001634 0.001365 1.115e-05 0.0006202
## beta[4] 0.052441 0.016546 1.351e-04 0.0004762
## beta[5] -0.001844 0.023720 1.937e-04 0.0003943
## beta[6] -0.190163 0.045184 3.689e-04 0.0007933
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## beta[1] 1.217857 3.562913 5.773077 7.9652366 10.9163623
## beta[2] -0.016089 -0.013399 -0.012064 -0.0106913 -0.0079320
## beta[3] -0.004158 -0.002707 -0.001625 -0.0005131 0.0006735
## beta[4] 0.020224 0.041319 0.052539 0.0636193 0.0849490
## beta[5] -0.049083 -0.017564 -0.001796 0.0144223 0.0447375
## beta[6] -0.279474 -0.220870 -0.189565 -0.1597338 -0.1025073
summary( window(output, start = 501) ) # remove early iterations
##
## Iterations = 501:5500
## Thinning interval = 1
## Number of chains = 3
## Sample size per chain = 5000
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## beta[1] 5.805337 2.740397 2.238e-02 1.2481387
## beta[2] -0.012057 0.002051 1.674e-05 0.0001766
## beta[3] -0.001634 0.001365 1.115e-05 0.0006202
## beta[4] 0.052441 0.016546 1.351e-04 0.0004762
## beta[5] -0.001844 0.023720 1.937e-04 0.0003943
## beta[6] -0.190163 0.045184 3.689e-04 0.0007933
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## beta[1] 1.217857 3.562913 5.773077 7.9652366 10.9163623
## beta[2] -0.016089 -0.013399 -0.012064 -0.0106913 -0.0079320
## beta[3] -0.004158 -0.002707 -0.001625 -0.0005131 0.0006735
## beta[4] 0.020224 0.041319 0.052539 0.0636193 0.0849490
## beta[5] -0.049083 -0.017564 -0.001796 0.0144223 0.0447375
## beta[6] -0.279474 -0.220870 -0.189565 -0.1597338 -0.1025073
plot( output )
Note the difference in coefficients when summary was
used with the start = 501. Now I reinitialize using the
coefficients from the previous fitted model:
What do the trace plots suggest about the model fit?
I now have an object outjags of
class "rjags" and an object output of class
mcmc.list, which contains MCMC chains, in addition to other
information about the model and fit. It contains lists within lists. I
first want to plot the chains. I then generate some plots. This involves
aggregating chains. We’ll discuss that in the next few
meetings.
par(mfrow=c(7,2), mar=c(3,3,1,1), bty='n')
allchains <- numeric(0)
nchains <- length(output)
for(i in 1:nchains){
allchains <- rbind( allchains, output[[i]] )
}
allchains <- as.mcmc(allchains)
xyplot(output) # using lattice for individual chains
densityplot(output) # density for individual chains
densityplot(allchains) # density for combined chains
Note that the last densityplot of allchains
combines the three MCMC chains. Clearly, the algorithm is still not
converged. `
Exercise 10: Why is the wood thrush declining? Has the model converged? Repeat this analysis with a different species, then identify differences between the glm and Bayesian results.
These examples highlight at least two points. First, posterior
simulation is an art, and algorithms that work in one setting may need
additional efforts when applied to another. The traditional GLM
implemented in the function glm and in the Metropolis
sampler in updateBetaGLM are both fast and give essentially
the same estimates. This is to be expected for the non-informative prior
used here. By contrast, rjags was in this case slow and
still far from convergence. The algorithm could have been helped with
more tuning of the algorithm, but this can be difficult when models get
large. In fact, all software for posterior simulation that attempts
broad application confronts this challenge.
Second, as we saw with regression, the simple GLM estimated traditionally or with a non-informative prior should and does lead to the same interpretation. The value of Bayes increases as models become more complex.
A minimal knowledge of data types and storage modes is needed to manipulate objects in R. Functions are available from the base R package. They are supplemented by others that can be installed from the CRAN website, github, sourceforge, or even developed yourself.
Data analysis begins with data formatting and exploration. Data analysis requires OxV (observations \(\times\) variables) format. I review packages and functions that can help.
The example using Poisson regression provides an introduction to
generalized linear models (GLMs), by conventional maximum likelihood
(glm).
We will learn about how a Bayesian analysis like this works. For now, a first experience with rjags sets us up for modeling that we’ll discuss in the next weeks.