Before modeling begins there is a question and, often, data. If data are observational, then exploratory data analysis (EDA) is the initial step that determines relationships between variables and spatio-temporal trends.

Logistics

Resources

Data

you can download Breeding bird survey, explanation and data, with supporting data.
for North Carolina the file BBSexampleTime.rdata is on github

Software

source('clarkFunctions2021.r')

Readings

Tidy Data, Wickam on concepts that link models and data to algorithms to R, J Stat Soft.
Connectivity of wood thrush breeding, wintering, and migration sites, Stanley et al on tracking wood thrush populations combined with BBS Cons Biol.

For next time

reading
field trip comments from Lane
Assignment 3 exercises from this vignette due 4 Feb.

Today’s plan

groups
- identify questions on Assignment 1
- select coordinators for Assignment 2, due 4 Feb
discussion of questions
this document

Objectives:

basics of exploratory data analysis (EDA)
gain familiarity with R
learn to model with factors
write a formula and graph for a basic GLM

Case study: the migratory wood thrush

The migratory wood thrush population is declining. There are multiple threats, both here in the temperate zone and in their tropical wintering grounds in Central America. Stanley et al. (2015) quantified declines in wood thrush abundance in the breeding range by region, and they related it to forest cover. They used tracking data to determine if regional differences in the US might be explained by the fact that birds summer and wintering grounds were linked. I used this example to examine changes in abundance using the breeding-bird survey (BBS) data. Specifically, we will ask whether or not wood thrush populations could be in decline in NC and, if so, which variables in the BBS data might help us understand it.

We’ll use R to get us started on the BBS data, then let the data themselves lead the transition to data exploration and modeling.

Wood thrush populations may be declining

EDA with `R`?

Data manipulation remains the most time-consuming element of analysis. The challenges posed by data can halt progress long before model analysis begins. The popularity of R begins with the flexibility it provides for data structures and management.

A user can interact with R at a range of levels. High-level interactions with transparent functions like lm (linear regression) require little more than specification of predictors and a response variable. At the same time, R allows for low-level algorithm development. In fact, functions written in C++ and Fortran can be compiled and used directly with R. This combination of high- and low-level interaction allows one to enter at almost any degree of sophistication, then advance, without requiring a change in software.

The Wickam article is a good background on data structures.

interacting with R

This section on R application follows the introduction in the vignette basicR.

where am I?

When I open R studio I am in a directory. I can interact directly with files that are in my working directory. To interact with a file in a different directory, I need to either supply a path to the directory that it occupies, or I set my working directory using the Session tab in Rstudio.

I can determine which directory I am in with the function getwd(). I can move to a different director using setwd(pathToThere). Below is a screenshot of my directories:

Current directory.

Here I am:

getwd()

## [1] "/Users/jimclark/Box/Home Folder jimclark/Private/classes/bayesClasses2021spring/3data"

Syntax for paths is standard unix/linux. I move down a directory by giving a path with the directory name followed by /. I can move up directory by starting the path with "..". For example, the path to a file in the directory dataFiles (see screenshot above) is up, then down:

setwd("../dataFiles/")           # go there
setwd("../2data/")               # back here

Where’s the file?

As discussed above, files are identified by their location, or path, and the file name. The path can be relative to the current directory. To determine which files are in my current directory, I use list.files(). This returns a character vector of file names. To find the names of files in a different directory I need to include the path. This code works for me, because the paths match the locations of directories relative to my current directory:

list.files()                        # files in current directory

##  [1] "~$tesDay3.docx"            "3DataEDA.html"            
##  [3] "3DataEDA.Rmd"              "3EDA.html"                
##  [5] "3EDA.key"                  "3EDA.Rmd"                 
##  [7] "4Data2Bayes.Rmd"           "notesDay3.docx"           
##  [9] "poisBayesGraph.jpg"        "ranIntSlpBBS.rds"         
## [11] "ranIntSlpBBS.stan"         "rsconnect"                
## [13] "ScreenShotBBS.png"         "ScreenShotDirectories.png"
## [15] "wood_thrush.jpg"

list.files("../dataFiles/")         # in dataFiles

## character(0)

list.files("../dataFiles/dataBBS")  # in dataFiles/dataBBS

## character(0)

What’s in the file

I interact with files through functions that allow me to access their contents or create new files to hold content that is currently in my working environment. With so many file formats, ingesting data files has become almost an art. Common formats for data are .txt and .csv files, both of which can be straightforward, but not always. There are now dozens of file formats that can be loaded into R, but each can present challenges.

An internet search will often locate an R package that reads a specific file type. Some files may require modification, either to make them readable or to extract information they contain that is not extracted by the package I find to read them. The R function readLines allows the desperate measure of reading files line-by-line. This is useful when, say, a .txt file contains non-ASCII characters. However, this is no help at all for many file types. One of the most flexible editors appears to be atom, freely available on the internet.

R has its own compression format .rdata that is commonly used for data when there are multiple objects that will be processed together. These files are accessed through the load function.
Here I load observations contained in two objects, xdata (predictors) and ydata (bird counts) from the internet. The function source_data wraps load within a function that downloads the file first:

library(repmis)
d <- "https://github.com/jimclarkatduke/gjam/blob/master/BBSexampleTime.rdata?raw=True"
source_data(d)

## Downloading data from: https://github.com/jimclarkatduke/gjam/blob/master/BBSexampleTime.rdata?raw=True

## SHA-1 hash of the downloaded data file is:
## a258512e4f745d5823f5e7b7e0eed83f5e3b7e2f

## [1] "xdata"         "ydata"         "missingEffort" "timeList"     
## [5] "edata"

# objects look like this:
xdata[1:5,]

ydata[1:5,]

##     MourningDove Red-belliedWoodpecker ChimneySwift EasternWood-Pewee BlueJay
## 1-0            0                     0            0                 0       0
## 1-1            4                     1            1                 0       4
## 1-2           12                     5            4                 2      16
## 1-3           51                     5            4                 8      19
## 1-4           15                     5            6                 4      27
##     AmericanCrow EuropeanStarling CommonGrackle ChippingSparrow EasternTowhee
## 1-0            0                0             1               0             0
## 1-1            2                3             8               0             4
## 1-2            8                0            70               0            27
## 1-3           16                6            41               0            34
## 1-4           16                8            46               1            45
##     NorthernCardinal IndigoBunting Red-eyedVireo BrownThrasher CarolinaWren
## 1-0                0             0             0             0            0
## 1-1                5             3             1             0            5
## 1-2               11            11            12             2           21
## 1-3               25             5            10             2           16
## 1-4               40            11             3             2           18
##     TuftedTitmouse CarolinaChickadee WoodThrush AmericanRobin
## 1-0              0                 0          0             0
## 1-1              1                 2          1             0
## 1-2             12                 1          0             1
## 1-3              4                 0          0             0
## 1-4              2                 3          2             0

The modes of these objects are mode(xdata) = list and mode(ydata) = matrix. I created these objects from files on the BBS site.

The data.frame xdata holds information about each observation, including location (lon, lat), the observation Route, the soil type (a factor), the nlcd land cover type.

Why are the storage modes of xdata and ydata different?

OxV format and the `data.frame`

Data are often stored in formats that are not amendable to analysis; reformatting is required. For analysis I typically want a matrix or data.frame with observations as rows and variables as columns, OxV format.

Many data sets are not organized as OxV, often for good reasons, but they must be put in OxV format for analysis. Here are terms I use:

terms	definition
observation	what I record at a site/location/plot, numeric or not
sample	observations-by-variables `matrix` or `data.frame`
design matrix	observations-by-variables `matrix`, any factors converted to indicators
covariate	a predictor that takes continuous values
factor	a discrete predictor that can have 2, 3, … levels
level	a factor has at least 2
main effect	an individual predictor
interaction	a combination of more than one predictor

The list saved as xdata looks like a matrix. I find the dimensions of xdata with a call to dim(xdata) = [2751, 18]. These two numbers are rows and columns, respectively. Although xdata has rows and columns, it is not stored as a R matrix, because it may include factors or even characters. A data.frame is rendered as rows and columns for the benefit of the user, but it is not stored as a matrix and, thus, does not admit matrix operations. The columns of a data.frame must have the same lengths, but they do not need to be of the same mode. In fact, a data.frame is a type of list.

Although I cannot do numeric operations across columns of a list, a data.frame is useful for structuring data. If all elements of a data.frame are numeric, then the data.frame is converted to a numeric matrix using the function as.matrix.

A related object in R is the tribble, see a description here ??tribble. Some data.frame behaviors can be troublesome if you are not familiar with them. These will come up as we proceed. Because there is so much code available that uses the data.frame, I will continue to use it here.

Here are some ways to explore the objects in xdata:

# is each column numeric?
sapply( xdata, is.numeric )

##     insert     groups      times        lon        lat      Route       soil 
##       TRUE       TRUE       TRUE       TRUE       TRUE       TRUE      FALSE 
##       nlcd       year       temp  StartWind   StartSky   precSite   precAnom 
##      FALSE       TRUE       TRUE       TRUE       TRUE       TRUE       TRUE 
##    defSite    defAnom winterTmin   juneTemp 
##       TRUE       TRUE       TRUE       TRUE

# is each column a factor?
sapply( xdata, is.factor )

##     insert     groups      times        lon        lat      Route       soil 
##      FALSE      FALSE      FALSE      FALSE      FALSE      FALSE       TRUE 
##       nlcd       year       temp  StartWind   StartSky   precSite   precAnom 
##       TRUE      FALSE      FALSE      FALSE      FALSE      FALSE      FALSE 
##    defSite    defAnom winterTmin   juneTemp 
##      FALSE      FALSE      FALSE      FALSE

# which columns are factors?
which( sapply( xdata, is.factor ) )

## soil nlcd 
##    7    8

I’ll say more about factors in the next section.

Exercise 1: Are there character variables in xdata?

Exploratory data analysis (EDA)

Exploratory data analysis (EDA) is the first step required for an analysis. EDA helps to determine if the scientific question can be answered with the data at hand. It guides model development and computation.

The relationships between variables must be understood before any modeling begins. Do data provide sufficient coverage in space, time, and predictor and response variables? Are there relationships between variables? Is there ‘signal’? In this section I learn about the data through EDA.

Observations by variables

An observation is a route-year, and that is how xdata is organized. It is observations by variables. If I want to examine counts of wood thrush as routes (rows) by years columns. I can use match:

spec <- 'WoodThrush'
y    <- ydata[,spec]                            # extract the spec column

routes <- sort( unique( xdata$Route ) )         # find all routes
years  <- min( xdata$year ):max( xdata$year )   # all years

nr <- length(routes)                            # no routes and years
ny <- length(years)

i <- match(xdata$Route, routes)                 # route index    
j <- match(xdata$year, years)                   # year index    
routeByYr <- matrix(NA, nr, ny)                 # route by yr matrix
rownames(routeByYr) <- routes
colnames(routeByYr) <- years
routeByYr[ cbind(i, j) ] <- y                   # fill matrix with counts

head( routeByYr )[,1:15]

##   1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978
## 1    0    1    0    0    2    0    1    4    3    5    3    1    1    6    4
## 2   NA   NA    0    0    0    0    0    1    1    1    6    3    1    5    3
## 3    0    0    0    0    0    0    0    0    0    1    0    0    0    0    0
## 4    0    1    4   11    1    1    5   19    1   35   50    1    1    1    1
## 5   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
## 6   NA   NA    0    1   10   10   13   18   19   24   18   14   16   16   11

The matrix routeByYr helps to visualize each Route as a time series. We can also obtain marginal summaries:

cByYr <- colSums( routeByYr, na.rm=T )          # counts by year
nByYr <- colSums( routeByYr*0+1, na.rm=T )      # no. routes by year (why does this work?)
cPerN <- cByYr/nByYr                            # counts per route by year
nByYr <- table( xdata$year )                    # another method

tab <- rbind( round(nByYr), round(cPerN, 1) )
rownames(tab) <- c('routes', 'CPE')

knitr::kable( tab[,1:7] )

	1964	1965	1966	1967	1968	1969	1970
routes	14	14.0	22.0	24.0	24.0	26	27.0
CPE	0	1.4	8.8	6.4	12.1	14	7.6

Spatial distribution

To explore the data I first generate a map of the observations. Here is a map of the routes:

library(maps)

map('county', xlim = c(-85, -75), ylim = c(33.6, 36.8), col='grey')
map('state', add=T)
points(xdata$lon, xdata$lat, pch = 16, cex = 1)

Route locations.

I see balanced coverage in space, from mountains to coast.

I now examine counts for wood thrush, using symbol size to represent the species count and color to represent the variables for moisture deficit.

# route mean
cy   <- tapply(y, xdata$Route, mean, na.rm=T)         # route means
wi   <- match( as.numeric(names(cy)), xdata$Route )   # index for each route
ll   <- xdata[ wi, c('lon','lat') ]                   # find each location

# the moisture deficit varies by route
def <- xdata$defSite[wi]                              # route moisture deficit

# discretize color scheme
df   <- seq(min(def), max(def), length=20)            # bins for deficit values

# a function to define colors for a gradient
colT <- colorRampPalette( c('#8c510a','#d8b365','#01665e','#2166ac') )
cols <- rev( colT(20) )

# assign a color to each anomaly
di <- findInterval(def, df, all.inside = T)          # assign deficits to bins

map('county', xlim = c(-85, -75), ylim = c(33.6, 36.8), col='grey')
map('state', add=T)
cex <- 6*cy/max(cy, na.rm=T)
points(ll[,1], ll[,2], pch = 16, cex = cex, col = cols[di] )
title(spec)

Route locations

Exercise 2. What does this map tell us about the spatial distribution of bird counts and the moisture deficits where they are common?

Is the wood thrush declining?

To explore this question I need to structure the BBS data. An observation includes information (Route, Year) and the count for each species that is seen or heard. It includes a description of the weather, which affects behavior. From the table above, a sample is an observations-by-variables matrix or data.frame. It is the fundamental unit for analysis; I draw inference from a sample. This format is OxV (‘obs by variables’) format. The design matrix is a matrix, which is numeric. It contains predictors, which can be main effects and interactions. It can include covariates and factors. It can be analyzed, because it is numeric.

If factors are included in the data.frame, they should be declared using the function factor. If the factors are identified as words in a data.frame, then you will be interpreted that way. Confusion can arise when a factor is represented by integer values, e.g., $1, 2, \dots$; it will be analyzed as a factor only if I declare it with factor(variable). When the data.frame is converted to a design matrix, the factor levels each occupy a column of zeros and ones (more on this later).

Here are raw observations plotted by year:

plot(xdata$year, ydata[,'WoodThrush'], cex=.2)

Here is a sum by route-year:

# route by year mean counts
thrush <- tapply(ydata[,'WoodThrush'], 
                 list( route = xdata$Route, year = xdata$year), 
                 mean, na.rm=T)

Take a look at the structure of the matrix thrush. Here I add the mean taken over all routes in a given year:

# average by year
thrushYear <- colMeans( thrush, na.rm=T)

# all years in the study
year <- as.numeric(names(thrushYear))

plot(xdata$year, ydata[,'WoodThrush'], cex=.2)
lines(year, thrushYear, col='white', lwd=5)
lines(year, thrushYear, lwd=2)

# quantiles for counts
thrushQuant <- apply( thrush, 2, quantile, na.rm=T)
lines(year, thrushQuant[2,], col='white', lwd=5)
lines(year, thrushQuant[4,], col='white', lwd=5)
lines(year, thrushQuant[2,], lwd=1)
lines(year, thrushQuant[4,], lwd=1)

Counts in time with quantiles

Exercise 3. Can we tell from this plot if birds are declining?

More detailed EDA

I start by exploring the data to get a better idea of sample effort in space and time. Here is a map to see where samples are located in the state, and the relationships between variables:

# observations by year
par(mfrow=c(1,2), bty='n', mar=c(4,3,1,1))
hist(xdata$year, main='', nclass=60)                              # routes by year
title('effort by year')

# observations by route
xdata$Route  <- as.numeric( as.character(xdata$Route) )           # if it's a factor
n_route      <- table( xdata$Route )                              # obs per route
mm           <- match( as.numeric(names(n_route)), xdata$Route)
routeSummary <- data.frame(n_route, xdata[mm, c('lon','lat')],
                           stringsAsFactors = F)
colnames(routeSummary)[1:2] <- c('route','nobs')
head( routeSummary )

# symbols scaled by obs per route
map('county',region='north carolina')
points(routeSummary$lon,routeSummary$lat, cex=.05*routeSummary$nobs, 
       pch=16, col='tan')
points(xdata$lon,xdata$lat, cex=.3, pch=16)
title('effort by route')

Distribution in space and time.

The big change in effort after 1987 needs more thought. Which sites were added?

firstYr      <- tapply(xdata$year, xdata$Route, min, na.rm=T)  # yr a route started
firstYr      <- firstYr[ as.character(routeSummary$route) ]
routeSummary <- cbind( routeSummary, firstYr )
head( routeSummary )

col <- rep('blue', nrow(routeSummary))                    # early routes
col[ routeSummary$firstYr > 1987 ] <- 'green'             # added later
cex <- .05*routeSummary$nobs                              # symbol size proportional to nobs

map('county',region='north carolina')
points(routeSummary$lon,routeSummary$lat, cex=cex, pch=16, col=col)

Early and late routes

Are these sites changing the distribution of data with respect to, say, temperature?

plot(xdata$year, xdata$juneTemp, cex=.1, ylab='degrees C')
title('temperature trend')

routes <- sort(unique(routeSummary$route))
nroute <- length(routes)

for(i in 1:nroute){
  wi <- which(xdata$Route == routes[i])
  lines(xdata$year[wi], xdata$juneTemp[wi], col = col[i])
}
tmu <- tapply(xdata$juneTemp, xdata$year, mean, na.rm=T)
years <- as.numeric(names(tmu))
lines( years, tmu, col=2, lwd = 3)

Temperature trends by route.

Exercise 4. How would I determine if adding new sites has biased the sample toward warmer or cooler temperatures?

Exercise 5. Does the distribution of data suggest that trends in time for bird abundances could be hard to estimate?

What is observed in a given year could depend less on the mean temperature for the site, but instead on the anomaly–is it a warm or cold time relative to the site mean? Here are June temperatures represented as anomalies from the site mean:

# overall trend
trend <- lm(tmu ~ years)

# local anomaly
tmuByRoute <- tapply(xdata$juneTemp, xdata$Route, mean, na.rm=T)
tanomaly   <- xdata$juneTemp - tmuByRoute[ as.character(xdata$Route) ]

plot(xdata$year, tanomaly, cex=.2)
abline(h = 0, lty=2)

Temperature anomalies

Factors, fixed and random effects

To check if some variables are stored as factors I used the function sapply:

sapply(xdata, is.factor)

##     insert     groups      times        lon        lat      Route       soil 
##      FALSE      FALSE      FALSE      FALSE      FALSE      FALSE       TRUE 
##       nlcd       year       temp  StartWind   StartSky   precSite   precAnom 
##       TRUE      FALSE      FALSE      FALSE      FALSE      FALSE      FALSE 
##    defSite    defAnom winterTmin   juneTemp 
##      FALSE      FALSE      FALSE      FALSE

Here I note that variables clearly recorded as factor levels are not stored that way. The variables StartWind and StartSky are potentially important, because they affect bird activity. From documentation I learned that wind codes are an ordered factor. Here are sky classification codes:

Sky	description	new name
0	Clear or few clouds	clear
1	Partly cloudy (scattered) or variable sky	clouds
2	Cloudy (broken) or overcast	clouds
4	Fog or smoke	fogSmoke
5	Drizzle	rain
7	Snow	snow
8	Showers	rain
9	N/A	unknown

A common problem with factors is that the representation of levels can be highly uneven, which has a big impact on a model fit. Each factor level will be treated like an intercept in model fitting. The version of xdata loaded here has these 9 classes aggregated into four. They are currently not stored as factors, but I can change that by declaring them to be factors.

To appreciate how they are treated in model fitting, try a simple example with the variable xdata$startWind:

par(mfrow=c(1,2), bty = 'n')

# create a factor version for each variable
xdata$routeFactor     <- as.factor(xdata$Route)
xdata$StartWindFactor <- as.factor(xdata$StartWind)
xdata$SkyFactor       <- as.factor(xdata$StartSky)

fit1 <- glm(y ~ xdata$StartWind, family = poisson) # WRONG: fitted as non-factor
fit2 <- glm(y ~ xdata$StartWindFactor, family = poisson)

summary(fit1)

## 
## Call:
## glm(formula = y ~ xdata$StartWind, family = poisson)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.4627  -2.5917  -0.9182   1.2063  12.6804  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      2.310526   0.015492 149.143   <2e-16 ***
## xdata$StartWind -0.012172   0.008676  -1.403    0.161    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 17421  on 2078  degrees of freedom
## Residual deviance: 17419  on 2077  degrees of freedom
##   (672 observations deleted due to missingness)
## AIC: 24753
## 
## Number of Fisher Scoring iterations: 5

summary(fit2)

## 
## Call:
## glm(formula = y ~ xdata$StartWindFactor, family = poisson)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.5468  -2.5462  -0.9225   1.2632  12.7635  
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             2.281335   0.009269 246.131  < 2e-16 ***
## xdata$StartWindFactor2  0.054364   0.015747   3.452 0.000556 ***
## xdata$StartWindFactor3 -0.005901   0.023337  -0.253 0.800385    
## xdata$StartWindFactor4 -0.185720   0.043225  -4.297 1.73e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 17421  on 2078  degrees of freedom
## Residual deviance: 17385  on 2075  degrees of freedom
##   (672 observations deleted due to missingness)
## AIC: 24722
## 
## Number of Fisher Scoring iterations: 5

par( mfrow=c(1,2) )
plot( jitter(xdata$StartWind), y, cex=.2)
plot( xdata$StartWindFactor, y, cex=.2)

Wind is a discrete factor.

Exercise 6. What is the difference in your interpretation of the wind variable in the two fits?

Exercise 7. From the object trend, is temperature increasing or decreasing on average? Is this trend ‘significant’? Does it differ by site? Hint: consider including route as a factor.

Details in abundance trends

To examine trends in counts over time I start with scatter plots and a spline function for counts by year. To highlight the range of abundance I use a heat color gradient from blue (low) to red (high). I overlay a spline fit to summarize mean trends:

par(mfcol=c(1,2),bty='n', mar=c(4,4,2,.5))

colors <- c('blue','green','orange','brown')   # color gradient

ym   <- max(y) + 1
ylim <- c(0,80)
cf   <- colorRampPalette(colors)(ym)
plot(jitter(xdata$year),jitter(y),col=cf[y+1], cex=.4, pch=15, 
     xlab='year', ylab='count', ylim=ylim)
spl <- spline(y ~ xdata$year)

## Warning in regularize.values(x, y, ties, missing(ties)): collapsing to unique
## 'x' values

lines(spl,lwd=5,col='white')
lines(spl)


nc  <- 30                    # frequency before 1985, after 2000
y1  <- 1:(nc-1)
sc  <- c(round(y1/nc*ym,0), ym)
old <- hist(y[xdata$year <= 1987],plot=F, breaks=c(0,sc))
new <- hist(y[xdata$year >= 2000],plot=F, breaks=c(0,sc))

plot(NULL,xlim=c(0,.15),ylim=ylim, ylab='',xlab='frequency')
.colorSegment(old$density, old$mids, colors, nc, lwd=2)
.colorSegment(new$density, new$mids, colors, nc, lwd=3)
text( .04, 24, 'pre-1987')
text( .07, 10, 'post-2000')

Counts shaded by abandance and spline fit at left. Frequencies of counts before 1985 and after 2000 at right.

Exercise 8. How does the distribution of counts change after 1987?

The spline function gives an idea of the trend in the data, highlighted by the heat colors I extracted using the function colorRampPalette. To the right are frequency plots for counts before 1985 (thin) and after 2000 (thick).

A generalized linear model (GLM)

Here is a simple generalized linear regression (GLM) model, more precisely termed a Poisson regression to fit parameters:

\[ \begin{aligned} y_i &\sim Poi(\lambda_i) \\ \log(\lambda_i) &= \mathbf{x}'_i \boldsymbol{\beta} \end{aligned} \] The term GLM refers to fact that there is a linear function of parameters (second line) embedded within a non-Gaussian distribution. The first line is a Poisson distribution. The Poisson distribution has one parameter, an intensity $\lambda > 0$. It describes count data $y \in \{0, 1, \dots\}$, i.e., non-negative integers. The Poisson has the unusual property that $\lambda$ is both the mean and the variance of the distribution, $E[y] = Var[y] = \lambda$.

The second line is linear in parameters, but it is a non-linear function of $\lambda$. In this context, the log function is called the link function, connecting the linear model to the Poisson sampling distribution.

For a sample of $n$ observations, the likelihood is

\[[\mathbf{y}|\mathbf{X}, \boldsymbol{\beta}] = \prod_{i=1}^n Poi(y_i | \lambda_i)\] where $\mathbf{y}$ is the length-$n$ vector of counts, and $\mathbf{X}$ is the $n \times p$ matrix of predictor variables. The R function glm can be used to fit GLMs. I give it a formula, the data, and model, family=distribution("link"). I include some predictors from the BBS data,

fit <- glm(y ~ temp + year + StartWindFactor, data=xdata, family=poisson("log"))
summary(fit)

## 
## Call:
## glm(formula = y ~ temp + year + StartWindFactor, family = poisson("log"), 
##     data = xdata)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -6.8752  -2.2054  -0.7085   1.1847   9.7254  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      63.5837470  0.9715180  65.448  < 2e-16 ***
## temp             -0.0213270  0.0020261 -10.526  < 2e-16 ***
## year             -0.0304594  0.0004853 -62.758  < 2e-16 ***
## StartWindFactor2 -0.0745468  0.0159526  -4.673 2.97e-06 ***
## StartWindFactor3 -0.1264779  0.0235340  -5.374 7.69e-08 ***
## StartWindFactor4 -0.3814783  0.0433770  -8.794  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 17420  on 2077  degrees of freedom
## Residual deviance: 13607  on 2072  degrees of freedom
##   (673 observations deleted due to missingness)
## AIC: 20945
## 
## Number of Fisher Scoring iterations: 5

This code says that I want a GLM model for response (counts) y with two covariates, temp + year, and one factor, StartWind.

I can use the predict function to evaluate predictions:

yhat <- predict.glm(fit, type = 'response')
yobs <- y[names(yhat)]
plot(yobs, yhat, cex=.1, xlab='Observed', ylab = 'Predicted' )
abline(0, 1, col = 'white', lwd=4)
abline(0, 1, lty=2)
abline(h = mean(yobs), col = 'white', lwd=4)
abline(h = mean(yobs), lty=2)

I have drawn a 1:1 line and a horizontal line at the mean of the observations. The latter is expected if the model explains nothing.

Exercise 9. Which variables in this GLM appear to be important and why?

Recap

A minimal knowledge of data types and storage modes is needed to manipulate objects in R. Functions are available from the base R package. They are supplemented by others that can be installed from the CRAN website, github, sourceforge, or even developed yourself.

Data analysis begins with data formatting and exploration. Data analysis requires OxV (observations $\times$ variables) format. I review packages and functions that can help.

The example using Poisson regression provides an introduction to generalized linear models (GLMs), by conventional maximum likelihood (glm).

3. Exploratory data analysis (EDA)
Web files to pattern
Duke University

env/bio 665 Bayesian inference for environmental models

Jim Clark

2021-01-28

Logistics

Resources

Data

Software

Readings

For next time

Today’s plan

Case study: the migratory wood thrush

EDA with `R`?

interacting with R

where am I?

Where’s the file?

What’s in the file

OxV format and the `data.frame`

Exploratory data analysis (EDA)

Observations by variables

Spatial distribution

Is the wood thrush declining?

More detailed EDA

Factors, fixed and random effects

Details in abundance trends

A generalized linear model (GLM)

Recap

3. Exploratory data analysis (EDA) Web files to pattern Duke University

env/bio 665 Bayesian inference for environmental models

Jim Clark

2021-01-28

Logistics

Resources

Data

Software

Readings

For next time

Today’s plan

Case study: the migratory wood thrush

EDA with R?

interacting with R

where am I?

Where’s the file?

What’s in the file

OxV format and the data.frame

Exploratory data analysis (EDA)

Observations by variables

Spatial distribution

Is the wood thrush declining?

More detailed EDA

Factors, fixed and random effects

Details in abundance trends

A generalized linear model (GLM)

Recap

3. Exploratory data analysis (EDA)
Web files to pattern
Duke University

EDA with `R`?

OxV format and the `data.frame`