This document describes the process of loading, evaluating, modifying, and finalizing the Carolina Sandhills NWR (CSNWR) occupancy data for analysis. This data will be used in a subsequent document as the basis for an occupancy analysis of three pine woods/grassland bird species (Bachman’s Sparrow, Brown-headed Nuthatch, and Northern Bobwhite).

The specific objectives of this document are:

  1. To document the process of importing CSNWR’s occupancy data into R.

  2. Conduct some basic quality control and assessment of the data via built-in R functions and visualization.

  3. Create relevant environmental covariates from raw vegetation measurements.

  4. Join multiple data sets into a single data set for analysis.

  5. Save the resulting data sets for posterity and occupancy analysis.

Preliminaries

If you are new to R, or simply a bit rusty, have a look through this page to get (re)acquainted. It will make things much easier if the material on that page makes sense.

The following code imports some custom functions and aliases (abbreviations for existing functions) that we will use regularly. It also installs (if necessary; administrative privileges NOT required!!) and loads the R packages needed to perform the analyses in this document.

# Load some useful aliases and functions
# NOTE: requires an active internet connection
# install.packages("devtools") # Install your first time through
require(devtools)
source_gist(9216051) # Rprofile.R
source_gist(9216061) # various.R

# Loading required packages
toLoad = c("unmarked", "lubridate", "plyr", "dplyr", "car", "ggplot2", "grid")
instant_pkgs(toLoad)

# Load a useful modifications of an `unmarked` functions
source("./R/newformatLong.R")

Loading data

Getting the data into R is not always trivial, particularly given our reliance on Excel spreadsheets for the storage of data. The easiest format for R to work with is comma-delimited files (.csv) which can be created easily using Excel. This video illustrates the process; you may have to adjust the quality to 720p to make out the details:

The .csv creation process is a bit more complicated with an Access database table/query, but we can go down that road as necessary.

unmarked requires data in “wide” format, with a single row for each site and thus multiple columns for each survey (detection/nondetection) and, if necessary, associated covariates that change seasonally (among years) or among surveys (observation variables). However, we typically enter our data in “long” format with a single row for each observation. Fortunately, unmarked has some convenience functions to make this conversion relatively painless.

Throughout this document, we’ll use the 2012 count/occupancy data.

Let’s load the 2012 count/occupancy data and take a look at it.

occ2012 <- import.data(read.csv("./Data/occ2012.csv", header=TRUE), # Identify the data location
                       types = c("date", "character", "factor",     # Specify variable types
                                 "factor", rep("integer", 2))) 

The next five steps (functions) are used to examine the data structure, checking for gross errors in the data, etc., and simply making sure that R is reading it the way we expect it to be read. Some of these steps (e.g., str, summary) we’ll use repeatedly, others are specific to our expectations for this particular data set. If you haven’t confirmed that the code on this page makes sense, it’s probably worth your time.

# Step 1. Check the data frame structure
str(occ2012) # Looks as expected; equal observations (65 * 3) for each species
## 'data.frame':    585 obs. of  6 variables:
##  $ date    : POSIXct, format: "2012-03-19" "2012-03-19" "2012-03-19" "2012-03-19" ...
##  $ site    : chr  "4I" "4H" "4F" "4E" ...
##  $ observer: Factor w/ 2 levels "NJ","WW": 2 2 2 2 2 2 2 2 2 2 ...
##  $ spp     : Factor w/ 3 levels "BACS","BHNU",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ wind    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ count   : int  0 0 0 0 3 1 1 0 0 0 ...
# Step 2. For this particular data set, we know there are supposed to be 65 distinct sites.  This code evaluates this condition.
with(occ2012, length(unique(site))) # 65 sites, check...
## [1] 65
# Step 3. Get a summary.  Checking that, e.g., missing values where we're not expecting them and that the number of records for each level of factor variables and summary statistics for continuous variables makes senses.
summary(occ2012) # Looks good: no missing values; all values make sense
##       date                         site           observer   spp           wind           count       
##  Min.   :2012-03-19 00:00:00   Length:585         NJ: 22   BACS:195   Min.   :0.000   Min.   :0.0000  
##  1st Qu.:2012-03-29 00:00:00   Class :character   WW:563   BHNU:195   1st Qu.:1.000   1st Qu.:0.0000  
##  Median :2012-04-27 00:00:00   Mode  :character            NOBO:195   Median :2.000   Median :0.0000  
##  Mean   :2012-04-23 15:28:00                                          Mean   :1.605   Mean   :0.5607  
##  3rd Qu.:2012-05-15 00:00:00                                          3rd Qu.:2.000   3rd Qu.:1.0000  
##  Max.   :2012-05-29 00:00:00                                          Max.   :3.000   Max.   :5.0000  
##                                                                       NA's   :2
# Step 4. Tabulations of relevant variables are frequently handy for checking data.  In this case, we evaluate the distribution of counts for each species 
# For example, of the 195 visits, BACS were not detected on 164 visits, one was detected on 16 visits, 2 were detected on 14 visits, and 3 were detected on only a single visit (164 + 16 + 14 + 1 = 195).
with(occ2012, table(spp, count)) # no indication of extreme counts, etc.
##       count
## spp      0   1   2   3   4   5
##   BACS 164  16  14   1   0   0
##   BHNU  79  38  39  29   7   3
##   NOBO 168  19   8   0   0   0
# Step 5. Look at the first few rows of the data frame
head(occ2012)
##         date site observer  spp wind count
## 1 2012-03-19   4I       WW BHNU    0     0
## 2 2012-03-19   4H       WW BHNU    0     0
## 3 2012-03-19   4F       WW BHNU    0     0
## 4 2012-03-19   4E       WW BHNU    0     0
## 5 2012-03-19   4G       WW BHNU    0     3
## 6 2012-03-19   4J       WW BHNU    0     1

You should follow a similar procedure for creating a data set(s) containing any associated covariates (e.g., vegetation data). We will link this covariate data to the count/occupancy data in R, so it is highly recommended that you have matching variable names in both data sets to perform the join (in fact, the code below assumes that is the case). In our example, we have two covariate data sets — one associated with vegetation data that varies only among sites and not (yet) over time, and one associated with burn frequency that changed for some sites from 2012 to 2014. For our vegetation data that varies only among sites, all we require is a matching variable site in the vegetation data to link it to the observation data. For our fire frequency data that varies among sites but also among seasons, we’ll need another variable in both the observation and fire data sets to link them (e.g., site and year). In many cases we’ll be able (and prefer) to create these linking variables in R (e.g., year; see below); the less work we do in Excel the more transparent our analyses. Note, we suggest you include relevant covariates that change from survey to survey with the observation (occupancy) data (e.g., temperature, wind, observer).

# Load vegetation data
veg <- import.data(read.csv("./Data/vegplots.csv", header=TRUE),
                   types = c(rep("character", 2), "date", rep("numeric", 7),
                             "character", rep("numeric", 11)))

# Load fire frequency data
fires <- import.data(read.csv("./Data/firefreq.csv", header=TRUE),
                     types = c("character", "numeric", "character", "numeric"))

# Check the veg and fire data using some of the steps reported above
# str(veg)
# str(fires)

# As expected, all sites had 16 records, with 4 records in all 4 sectors (i.e., subplots)
#with(veg, all(table(site) == 16))
#with(veg, all(table(site, sector) == 4))

# Check for uncharacteristic measurements using summary (not illustrated)
# summary(veg) 
# summary(fires)

Additionally, plotting various aspects of the data (based on your biological knowledge or expectations) can provide an excellent means of finding errors. We’ll use this more below, but for now we present the simple example of patterns of fire frequency by site — fire frequency at a given site should not decrease over time.

We use the ggplot2 library for visualization since we find it rather intuitive. There’s literally a book on the subject, but we’ll try to keep it relatively simple.

# Fire frequency over time by site; should only increase (or remain the same) 
# Looks good now.  A previous version found an error!

# line 1: call ggplot, specify the data frame, and set up aesthetics
# In this case, we want year on the x-axis and burnfreq on the y-axis
ggplot(fires, aes(x = year, y = burnfreq)) +
    # line 2: we want a line graph and we want separate graphs (facets) for every site    
    geom_line() + facet_wrap(~site) + 
    # line 3: these options "prettify" the x-axis and set a simple black & white theme
    scale_x_continuous(breaks = c(2012, 2014), limits=c(2011.5,2014.5)) + theme_bw()

Manipulating and summarizing data

Our covariate data (vegetation and fire) are successfully loaded, but they are not yet ready to join with the occupancy/count observations. The vegetation data needs summarized at each site; it currently contains raw vegetation measurements on a host of variables. Next we use some data manipulation functions to calculate our desired measures. Specifically, using data from the four subplots, we will calculate for each site (1) the total number of snags (of any kind) within 50 meters (this minimizes overlap among subplots), (2) the average DBH of measured pine trees, and (3) indices of canopy, shrub, and ground covers.

# Note, When no pines were present, no DBHs could be measured
# These are entered as DBH = 0

# We group the veg data at the level you want to summarize
# In this case, we want variables at the `site` level; we also keep `date`
# since all subplots were measured on the same day
veg <- group_by(veg, site) # Data frame first, then grouping variable(s)

# Summarize variables using this grouping 
veg <- summarize(veg, # grouped data frame
                 date = median(date, na.rm = TRUE), # used after future veg surveys
                 pine_ba = mean(pine_ba, na.rm = TRUE),
                 hard_ba = mean(hard_ba, na.rm = TRUE),
                 pine_dbh = mean(rbind(pdbh1, pdbh2, pdbh3, pdbh4, pdbh5), na.rm = TRUE),
                 snags = sum(rbind(snag_l25, snag_l50), na.rm = TRUE),
                 canopy = mean(canopy, na.rm = TRUE)/20,
                 grass = mean(grass, na.rm = TRUE)/20,
                 forb = mean(forb, na.rm = TRUE)/20,
                 woody = mean(woody, na.rm = TRUE)/20,
                 bare = mean(bare, na.rm = TRUE)/20,
                 litter = mean(litter, na.rm = TRUE)/20,
                 shrubs = mean(shrubs, na.rm = TRUE)/20)

Of course, it’s always worth checking what you just created. We can use some of the reliable stand-bys as well as plot the data. Boxplots give us a feel for the distribution of a variable and allow us to identify extreme outliers or “too many” outliers and check that the median (bold horizontal line) and middle 50% of the data (box) conform with your expectations based on biological knowledge:

#str(veg)
#summary(veg)

# Some boxplots to look for any glaring problems
# First, we "melt" the data into a long format so that we can create boxplots for each variable
mveg <- melt(as.data.frame(select(veg, -site, -date)))
# Take a look if you desire
# head(mveg)

# Sometimes it's nice to see not just the first six rows but a random sample of rows from
# throughout the data frame
# You can view specific rows and columsn of a data frame using the syntax df[rows, columns]
# In this case, we ask for 15 random rows and all columns (hence the blank after the comma)
mveg[sample(1:nrow(mveg), 15), ]
##     variable   value
## 365    grass  0.1125
## 704   shrubs  0.2250
## 139 pine_dbh 10.0300
## 663   shrubs  0.3125
## 688   shrubs  0.2875
## 181 pine_dbh  5.4400
## 44   pine_ba 50.0000
## 222    snags  3.0000
## 256    snags  7.0000
## 220    snags  8.0000
## 609   litter  1.0000
## 657   shrubs  0.4500
## 389    grass  0.5750
## 242    snags  4.0000
## 566     bare  0.3125
# Now plot the boxplots
# Easy way, but this puts all variables on the same y-axis scale which may not be desired
# depending on your data
#ggplot(mveg, aes(x = variable, y = value)) + geom_boxplot()

# Slightly more complicated but better
ggplot(mveg, aes(x = "", y = value)) + # No x-axis; we'll use facets so we can vary the y-axis
    geom_boxplot() + # we want boxplots
    facet_wrap(~variable, scales="free_y", nrow=2) +
    labs(x = "", y = "Measurements from\nall sites") # Rename axes to "prettify"

Joining observations with covariates

With the relevant data sets loaded and modified, we need to consolidate into a single primary data frame for subsequent re-formatting and analysis by unmarked. The observation/occupancy data will be our primary data frame to which we add covariates because it contains a single record/line for every observation. Again, our job will be a lot easier if the variables used to link data sets are identically named and structured; your time will not be wasted to ensure this is the case. We first have to create a variable year in our observation data that reflects the year of observation so we can join it with the fire frequency data; year already exists in the fire frequency data. We also add a Julian day, or day of year, variable (doy) since unmarked (and most other analytical programs) don’t play very nicely with dates. Julian day adequately captures our intended use of survey date.

# Mutate does what you think: mutates (changes) variables into new variables
occ2012 <- mutate(occ2012, # specify the data frame to change, or mutate
                  year = year(date), # Add year variable; extracts year from date
                  doy = yday(date)) # Add Julian day (day of year) variable

Supposing our data sets now meet the matching variable criteria, the join process is relatively painless. We can keep reassigning the modifications to the observation/occupancy data to preserve the data set name. Joins can only occur between two data sets at a time, so in our case we’ll have to two joins; the order is irrelevant in this case.

# Join the vegetation data
# We don't need the date of the vegetation surveys, so we can drop (-) it
# This is really useful for large data sets with lots of irrelevant variables/info
occ2012 <- left_join(occ2012, select(veg, -date), by = "site")

# Join the fire data
# Notice that 2014 data are dropped because there are no matching records in the observations
occ2012 <- left_join(occ2012, select(fires, -burn_unit), by = c("site", "year"))

# Sort the data to make it easier to interpret visually
occ2012 <- arrange(occ2012, spp, site, date)

# Add a categorical variable for site visit to explore heterogeneous detection over time
occ2012 <- mutate(occ2012, # This WILL NOT work if you have missing observations
                  visit = factor(rep(1:3, nrow(occ2012)/3)))

Exploring relationships among covariates

With a data set tidied and joined, we’re still not quite ready for analysis. We have some data visualization to do first. Visualization of your data provides a convenient way to identify potential problems (e.g., extreme outliers that may indicate data entry errors) and explore potential multicollinearity (correlation) among covariates. We illustrate only for Bachman’s Sparrow (BACS).

We find the following figure extremely helpful, but it warrants some explanation. On the diagonal is the histogram and density of measured values for a given variable. This gives similar information to the boxplots we created earlier, although more variables are shown now after joining the data. Off-diagonal elements contain information on the bivariate relationship between the two variables involved. The upper diagonal shows the bivariate scatter plot with a LOESS smooth (red line) fitted to give you a sense of the relationship between the variables. The correlation between the two variables is recorded in the corresponding location in the lower diagonal — reds indicate positive correlations and blues negative correlations. The size of the box and text indicate the magnitude of the correlation. For example, the relationship between pine basal area (pine_ba) and canopy cover (canopy) is fairly strong as indicated by large red box and correlation of 0.67 (lower diagonal); the corresponding plot on the upper diagonal illustrates this relationship.

# First, subset to only BACS
BACS <- droplevels(subset(occ2012, spp == "BACS"))
pairs.panels(select(BACS, -date, -site, -observer, -spp, -year, -visit), main = "BACS @ CSNWR") # drop irrelevant columns

We also check formally for multicollinearity amongst our covariates which can negatively influence our ability to identify variables important in species occupancy and abundance. We use variance inflation factors (VIF) as a formal evaluation. VIFs measure how much a given variable is correlated with the other variables. High VIFs (VIFs > 3 are worth consideration) make it more difficult to identify the important variables in an analysis. For example, in this data set, the canopy cover measure is correlated with several of the other covariates (VIF ~ 5), particularly (looking at the above plot) pine basal area, leaf litter, and ground cover. This makes sense, as a site with higher pine basal area (i.e., more pine trees in general) should have a higher canopy score simply because of the amount of trees. It may be worthwhile considering dropping canopy cover or possibly pine basal area, depending on biological or practical (how easy are they to measure?) considerations.

# Testing VIF of predictor variables
testVIF <- select(BACS, -date, -site, -observer, -spp, -year, -count, -visit)
corvif(testVIF) # Canopy is borderline
##               VIF
## wind     1.078491
## doy      1.014359
## pine_ba  2.587933
## hard_ba  1.399166
## pine_dbh 1.550958
## snags    1.402949
## canopy   4.969563
## grass    1.965154
## forb     2.075085
## woody    1.581583
## bare     2.040518
## litter   3.077420
## shrubs   2.291691
## burnfreq 2.303101
# Dropping canopy cover leaves the remaining VIFs in good shape (< 3)
corvif(select(testVIF, -canopy)) # Much better 
##               VIF
## wind     1.074181
## doy      1.013730
## pine_ba  1.842130
## hard_ba  1.202303
## pine_dbh 1.492236
## snags    1.134720
## grass    1.650383
## forb     1.666595
## woody    1.580893
## bare     1.891900
## litter   2.378413
## shrubs   2.164207
## burnfreq 2.109610

Rescaling covariates to improve model fitting

Often it expedites the optimization algorithms to rescale the value of covariates so that they don’t span several orders of magnitude. In the occupancy data, for example, the proportion of shrubs and day of year differ by two orders of magnitude. We have two options standard options. First, we can standardize the values of the covariates, converting them to a standard normal scale. So, for example, instead of pine basal area having a mean (± SD) of 45.5 (± 16.8), respectively, it will be rescaled to a mean (± SD) of 0 (± 1); ditto for all other standardized variable. However, this can sometimes cause trouble in occupancy models that feature an interaction between a continuous and a categorical variable. Second, we can rescale them manually by dividing/multiplying by a constant. Assuming you remember the constant(!), this method makes it easier to put them back on the original scale for, e.g., plotting. The problem is that it’s a bit less convenient if you have many variables.

In this occupancy analysis we will not consider any interactions with categorical variables, so we standardize all continuous covariates. We use a slightly different formulation for the standardization process, scaling them relative to the median rather than mean. This makes it a bit easier to plot down the road.

# Get covariate averages and standard deviations
covars <- select(occ2012, -date, -site, -observer, -spp, -count, -year, -visit, -wind)
covar_meds <- apply(covars, 2, median)
covar_mads <- apply(covars, 2, mad)

# Now scale them in the original data set
which_cols <- which(names(occ2012) %in% names(covar_meds)) # Identifies columns to standardize
occ2012[, which_cols] <- median_scale(occ2012[, which_cols])

Converting data to unmarked format

Nearly finished! Our final act is to covert the (standardized) observation and covariate data set from its current “long” format, with a single record/row for every observation. unmarked wants the data in “wide” format and provides a function for the conversion. However, it has certain requirements for variable order. Moreover, the final format depends on the type of analysis we intend to perform (e.g., single season vs. multiple season vs. removal). For simplicity, we prepare the data for a single season occupancy analysis, but we’ll save the standardized occupancy data set for future use if and when with pursue occupancy analyses based on multiple seasons/years of surveys.

# First, since we're modeling occupancy here, we dichotomize the count column into presence (count > 0) or absence. 
occ2012$occ <- ifelse(occ2012$count > 0, 1, 0)

# Next we reorder the data set (colomn order) according to `unmarked`'s demands
# 1st 3 columns must be, in this order: site, date, and the occ/count variable
# Note: we drop `observer` because almost all surveys were conducted by a single observer
# We will keep it in multiseason models when the observer changes...
occ2012 <- cbind(occ2012[, c("site", "visit", "occ",
                             "count")],   # preserve the count for now 
                 occ2012[, "wind", drop = FALSE], # if any, factor covariates not standardized; use drop=FALSE if only one
                 occ2012[, which_cols],  # then all other covariates of interest
                 occ2012[, "spp", drop = FALSE])  # if more than one species/group, put it here

# Split into a list of data sets by species
# unmarked requires species separately at the moment
occ2012 <- split(occ2012, occ2012$spp)

# Drop species (spp) column from each data frame and restore data frame names
occ2012 <- lapply(occ2012, function(x) droplevels(select(x, -spp)))

# Pull the data set for each species into the environment for further use
# In this case, we now have three data sets to work with: BACS, BACS, and NOBO

list2env(occ2012, envir = .GlobalEnv)
## <environment: R_GlobalEnv>
# Now use an `unmarked` function to convert each species' data set into wide format
# Technically this is my modification of the function to automagically sort site-level
# variables from observation-level variables among other things
BHNU <- newformatLong(BHNU, type = "unmarkedFrameOccu", timename = "visit")
BACS <- newformatLong(BACS, type = "unmarkedFrameOccu", timename = "visit")
NOBO <- newformatLong(NOBO, type = "unmarkedFrameOccu", timename = "visit")

Save the data files

We’ll save all the relevant data sets into a single file that can easily be loaded into R without having to redo these steps every time we want to run an occupancy analysis.

# We want to keep the three species' data sets, the standardized 'final' occupancy data set, 
# and the objects containing the covariate medians and MADs, which we'll need to plot things
# on the original scale later
# Might as well save the veg and fire data as well
save(BHNU, BACS, NOBO, occ2012, covar_meds, covar_mads, veg, fires,
     file = "./Data/occupancy_2012.Rda")