Occupancy Modeling:

Overview

Background

Understanding why species are found at some locations and not others is a central area of focus in ecology (MacKenzie et al. 2017). Humans have actually long been following and predicting (or at least trying to predict) the movement patterns and occurrence of species, since our survival has depended on it (e.g., food, tools, clothing, medicine, avoiding predators, parasites, etc.). Today, occupancy models give us a powerful and important tool for conservation, wildlife management, and policy and decision making. When we estimate species occupancy, we can predict and map the likelihood (or probability) of a species occurring at a given location and time based on things like land cover type or habitat. For example, if you estiamte the probability of occupancy of an endangered or inherently rare species throughout an area, you would data to support protecting areas that, if disturbed or developed, might have negative effects for species (e.g., an extinction event). Other terms that get used and are synonymous with occupancy are occurrence, or presence/absence.

We will work through an example analysis in R to estimate occupancy for the wood thrush (Hylocichla mustelina) in New Hampshire, USA. We will use eBird data (https://ebird.org/home) collected by citizen scientists (Sulivan et al. 2009), and will follow methods outlined as best practices for processing, sampling, and analyzing these data (Strimas-Mackey et al. 2020; https://cornelllabofornithology.github.io/ebird-best-practices/). Within groups, you will develop and test hypotheses relating habitat characteristics to the probability of detecting wood thrushes. To fit and compare multiple occupancy models, we will use the R package ‘unmarked’ (Fiske and Chandler 2011).By evaluating model performance we can assess the contributions of site-level and observation-level covariates on probability of occupancy. We will import raw eBird data from the past 10 years in NH, and combine these data with remotely-sensed data from NASA’s Moderate Resolution Imaging Spectroradiometer (MODIS; Friedl and Sulla-Menashe 2015) to extract land cover types including forest, urban, and cropland. Additionally, we’ll add surface elevation data (Amatulli et al. 2018). Finally, within your groups, you’ll work to develop and test hypotheses and modify the existing code to use models that can predict where wood thrushes are most likely to occur (and not occur) in NH.

After working through the analysis with your group, each group member will write a brief (i.e., <= 5-page) summary that should include the motivation and objectives of your study, statement of predictions or hypotheses being tested, results (ideally some tables and figures), map(s), and a short discussion about management implications or strategies that are informed by results.

Goals

Use open source eBird data and NASA’s MODIS satellite imagery data
Estimate the occupancy of the wood thrush in New Hampshire
Test different factors that might influence occupancy of a forest-breeding migratory songbird
Use a model to predict the probability of wood thrush occupancy
Create figures of the relationship of covariates to detection probability and occupancy
Create a map of model-estimated occupancy (Mean and SE) for NH
Discuss habitat management strategies and decisions in NH to promote the wood thrush, a species of conservation concern

Skills

Using R and open source code
Data manipulation
Using raster and vector files
Building simple functions
Fitting models
AIC model selection and model averaging
Interpreting model results
Making publishable maps and figures

1) Load packages and set working directory in R.

It’s important at the beginning of each R session to get things organized such as setting the working directory and loading (or installing, if needed) any packages or functions. Below, you’ll want to set your own path for your working directory.

#clear environment
rm(list=ls())

#set working directory
my.directory<-"YOUR PATH GOES HERE"
# setwd(my.directory)

# install.packages("devtools")
# devtools::install_github("mstrimas/ebppackages")

#load packages
library(raster)
library(ggplot2)
library(ggmap)
library(ggsn)
library(viridisLite)
library(auk)
library(lubridate)
library(sf)
library(gridExtra)
library(tidyverse)
library(grid)
library(velox)
library(MODIS)
library(dggridR)
library(unmarked)
library(ebirdst)
library(MuMIn)
library(AICcmodavg)
library(fields)
library(purrr)
library(dplyr)
library(knitr)
library(pander)

# resolve namespace conflicts
select <- dplyr::select
filter <- dplyr::filter

#set global options
knitr::opts_chunk$set(echo = TRUE, message=FALSE, results='hide')

2) Read in raw eBird data and process: filtering data and zero-filling

When we read in raw eBird data, there are several steps we have to take to keep only the parts of the data we’ll be using. We do this through a process of filtering or sub-setting the data, by constraining it to include only what we’ll need for this analysis. Since the eBird data base contains data collected in various ways, using different methods, and under different conditions, we have to be specific about which data we use, to ensure it meets the appropriate assumptions of the models we’ll be using. For example, in our case, we’ll be limiting our data that we’ll use to include only those from standing or travelling surveys up to 5-km (total distance), and surveys conducted between 01 June and 30 June each year from 2010 to 2019. And in order to estimate occupancy of a species, we need all the surveys where a species was detected and all the surveys where it was NOT detected.

# create a new folder: "Data" within your working directory where we'll store data and save files
data_dir <- "./Data"
if (!dir.exists(data_dir)) {
  dir.create(data_dir)
}

#read in eBird data downloaded from eBird as a text (.txt) file
ebd <- auk_ebd("./Data/ebd_US-NH_relDec-2019/ebd_US-NH_relDec-2019.txt", 
               file_sampling = "./Data/ebd_US-NH_relDec-2019/ebd_US-NH_relDec-2019.txt")

#define an object, ebd_filters, which are a subset (or filter) of all the raw data, ebd:
  # Species: Wood Thrush
  # State: NH
  # Dates: 01 June to 30 June
  # Protocol Type: Stationary or Traveling
ebd_filters <- ebd %>% 
  auk_species("Wood Thrush") %>% 
  # southeastern coastal plain bcr
  auk_state("US-NH") %>% 
  # june, use * to get data from any year
  auk_date(date = c("*-06-01", "*-06-30")) %>% 
  # restrict to the standard traveling and stationary count protocols
  auk_protocol(protocol = c("Stationary", "Traveling")) %>% 
  auk_complete()
ebd_filters

#make file paths for eBird files output from previous code to make ebd_filters
f_ebd <- file.path(data_dir, "ebd_woothr_june_NH.txt")
f_sampling <- file.path(data_dir, "ebd_checklists_june_NH.txt")

# only run if the files don't already exist
if (!file.exists(f_ebd)) {
  auk_filter(ebd_filters, file = f_ebd, file_sampling = f_sampling)
}

#zero-filling the data (i.e., adding zeros where NAs are for each survey event where the species was NOT detected)
ebd_zf <- auk_zerofill(f_ebd, f_sampling, collapse = TRUE)

# function to convert time observation to hours since midnight
time_to_decimal <- function(x) {
  x <- hms(x, quiet = TRUE)
  hour(x) + minute(x) / 60 + second(x) / 3600
}

#fix variable names
names(ebd_zf)[names(ebd_zf)=="observation_count.x"]<-"observation_count"
names(ebd_zf)[names(ebd_zf)=="scientific_name.x"]<-"scientific_name"

#remove some columns
ebd_zf$scientific_name.y<-NULL
ebd_zf$observation_count.y<-NULL

# clean up variables
ebd_zf <- ebd_zf %>% 
  mutate(
    # convert X to NA
    observation_count = if_else(observation_count == "X", 
                                NA_character_, observation_count),
    observation_count = as.integer(observation_count),
    # effort_distance_km to 0 for non-travelling counts
    effort_distance_km = if_else(protocol_type != "Traveling", 
                                 0, effort_distance_km),
    # convert time to decimal hours since midnight
    time_observations_started = time_to_decimal(time_observations_started),
    # split date into year and day of year
    year = year(observation_date),
    day_of_year = yday(observation_date)
  )

# additional filtering: this will constrain data further to only include surveys 1.5  hrs or less, 5-km traveled or less, years greater than or equal to 2010, and surveys with fewer than or equal to 10.
ebd_zf_filtered <- ebd_zf %>% 
  filter(
    # effort filters
    duration_minutes <= 5 * 60,
    effort_distance_km <= 5,
    # last 10 years of data
    year >= 2010,
    # 10 or fewer observers
    number_observers <= 10)

#final formmating and saving
ebird <- ebd_zf_filtered %>% 
  select(checklist_id, observer_id, sampling_event_identifier,
         scientific_name,
         observation_count, species_observed, 
         state_code, locality_id, latitude, longitude,
         protocol_type, all_species_reported,
         observation_date, year, day_of_year,
         time_observations_started, 
         duration_minutes, effort_distance_km,
         number_observers)

#Export the formatted eBird data as a new .csv file
write_csv(ebird, "./Data/ebd_woothr_june_NH_zf.csv", na = "")

3) Read in formatted data, perform spatial sampling, and plot on map

#read in ebd_zf data
ebird<-read.csv("./Data/ebd_woothr_june_NH_zf.csv",header=TRUE)

#We'll want to set the map projection so every time we plot data (vector or raster), everything will line up correctly
map_proj <- "+init=epsg:4326 +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0"

#get bounding box coords from eBird data
x.min=min(ebird$longitude, na.rm=TRUE)-0.4
x.max=max(ebird$longitude, na.rm=TRUE)+0.4
y.min=min(ebird$latitude, na.rm=TRUE)-0.3
y.max=max(ebird$latitude, na.rm=TRUE)+0.3

#compile bounding box corner coordinates
ebird.bbox = c(left = x.min, bottom = y.min, right = x.max, top = y.max)

#define map area and type
map.new <- get_stamenmap(bbox=ebird.bbox, maptype="toner-background", source="stamen", zoom=8)

stateMap <- raster::getData('GADM', country='USA', level=2)
nh.sub<-stateMap[stateMap@data$NAME_1 == "New Hampshire",]

nh.poly<-fortify(nh.sub, region="NAME_1")
nh.poly <- nh.poly[order(nh.poly$order),]
nh.poly$id<-as.factor(nh.poly$id)


#plot base map and vector data with ggplot
ebird$species_observed<-as.factor(as.character(ebird$species_observed))
ebird$species_observed<-factor(ebird$species_observed, levels=c("TRUE","FALSE"))

rawEbirdMap<-ggmap(map.new, darken=c(0.7,"white"))+
  geom_map(data=nh.poly, map=nh.poly, aes(y=lat, x=long, map_id=id),color="black",fill="gray90",alpha=0.8, inherit.aes = FALSE)+
  geom_point(data=ebird, aes(x=longitude, y=latitude, color=species_observed),alpha=0.6,shape=16,size=0.6)+
  scale_color_manual(values=c("red","gray40"), name="Detected")+
  #overlay layer of all locations with positive detections
  #geom_point(data=subset(ebird, species_observed==TRUE), aes(x=longitude, y=latitude),color="royalblue",alpha=0.7,shape=16,size=1)+
  theme(panel.border=element_rect(color="black",fill="transparent"))+
  labs(x="Longitude",y="Latitude")+
  guides(alpha="none")
#rawEbirdMap

mapOut<-rawEbirdMap+ggsn::scalebar(dist = 20, dist_unit="km", model = "WGS84", transform=TRUE, y.min = y.min, y.max = y.max, x.min =x.min, x.max = x.max, st.size = 2, st.dist=0.03, location = "bottomright", st.bottom = FALSE, border.size=0.3, anchor=c(x=-70.55, y=42.48))+
  theme(legend.key.size = ,legend.position = c(0.2,0.8),   legend.background=element_rect(color="black",fill=alpha("white",0.4)),
        legend.title = element_text(size = 7), legend.text = element_text(size = 7))

ggsn::north2(mapOut, x = 0.69, y = 0.26, scale = 0.1, symbol = 11)

4) Read in covariates as raster data and format

4a) Elevation (1-km resolution) data from http://www.earthenv.org/topography

#Elevation data
elev <- raster("./Data/elevation_1KMmd_GMTEDmd.tif")
raster::plot(elev, col=viridis(100))

#now subset NH and plot
#crop clips the elevation raster to extent of polygon
nh_elev<-crop(elev, nh.sub)
raster::plot(nh_elev, col=viridis(100))

#mask clips elevation raster to include everything with the polygon's perimeter
nh_elev_crop<-mask(nh_elev, nh.sub)
raster::plot(nh_elev_crop, col=viridis(100))

#save cropped NH elev raster
writeRaster(nh_elev_crop, './Data/nh_elev_crop.tif', overwrite=TRUE)

4b) View land cover class data from MODIS terra satellite and prepare for use in models

To generate the data we’re loading in the is section, you’ll use the ‘MODIS’ package, which will require some preliminary steps in getting all the correct dependencies (e.g., GDAL, hdf4) installed and talking to each other correctly (by providing path information). For detailed info on downloading MODIS data from R, there is much information available on the web, but refer here first: https://cornelllabofornithology.github.io/ebird-best-practices/

# load the landcover data as a raster stack
landcover <- list.files("./Data/modis", "^modis_mcd12q1_umd", 
                        full.names = TRUE) %>% 
  stack()

# label layers with year
landcover <- names(landcover) %>% 
  str_extract("(?<=modis_mcd12q1_umd_)[0-9]{4}") %>% 
  paste0("y", .) %>% 
  setNames(landcover, .)

#subset raster stack to get single raster layer, 9
landcover2019<-subset(landcover, 9, drop=TRUE)

#reproject raster 
landcover.proj<-raster::projectRaster(landcover2019, crs=map_proj)

#crop to NH polygon
landcover.nh<-crop(landcover.proj, nh.sub)
landcover.nh<-mask(landcover.nh, nh.sub)

#convert raster to data.frame
landcover2019.df<-as.data.frame(landcover2019, xy = TRUE, na.rm = TRUE)

#make land cover lookup data.frame
lc.lookup<-data.frame(value=0:15,Class = c("Water","Evergreen_needleleaf", "Evergreen_broadleaf", 
                               "Deciduous_needleleaf","Deciduous_broadleaf","Mixed_forest",
                               "Closed_shrubland", "Open_shrubland", "Woody_savanna","Savanna", 
                               "Grassland", "Wetland", "Cropland", "Urban","Mosiac", "Barren"))

#convert raster to a data.frame 
raster2.pts <- rasterToPoints(landcover.nh)
raster2.df <- data.frame(raster2.pts)

#reset colnames 
colnames(raster2.df)[3]<-"value"
#round numbers
raster2.df$value<-round(raster2.df$value,0)

#merge with raster2.df
raster.merge<-merge(raster2.df, lc.lookup, by="value",all.x=TRUE)


#set colors for land cover classes
myColors<-c("khaki3","yellow","darkgreen","springgreen3","seagreen3","darkolivegreen4","tan","chartreuse3","orange","chocolate","red","royalblue","turquoise","purple")

#plot with ggplot
landCoverMap<-ggmap(map.new, darken=c(0.7,"white"))+
  geom_tile(data=raster.merge, aes(x=x, y=y, fill=Class, color=Class))+
  # geom_map(data=nh.poly, map=nh.poly, aes(y=lat, x=long, map_id=id),color="black",fill="gray90",alpha=0.8, inherit.aes = FALSE)+
  scale_fill_manual(values=myColors, name="Land Cover Class")+
  scale_color_manual(values=myColors)+
#overlay layer of all locations with positive detections
  #geom_point(data=subset(ebird, species_observed==TRUE), aes(x=longitude, y=latitude),color="royalblue",alpha=0.7,shape=16,size=1)+
  theme(panel.border=element_rect(color="black",fill="transparent"))+
  labs(x="Longitude",y="Latitude")+
  guides(alpha="none", color="none")

landCoverOut<-landCoverMap+ggsn::scalebar(dist = 20, dist_unit="km", model = "WGS84", transform=TRUE, y.min = y.min, y.max = y.max, x.min =x.min, x.max = x.max, st.size = 2, st.dist=0.03, location = "bottomright", st.bottom = FALSE, border.size=0.3, anchor=c(x=-70.55, y=42.48))

ggsn::north2(landCoverOut, x = 0.55, y = 0.26, scale = 0.1, symbol = 11)

4c) Extract land cover proporitons within buffer around each eBird survey location.

# load the landcover data as a raster stack
landcover <- list.files("./Data/modis", "^modis_mcd12q1_umd", 
                        full.names = TRUE) %>% 
  raster::stack()

# label layers with year
landcover <- names(landcover) %>% 
  str_extract("(?<=modis_mcd12q1_umd_)[0-9]{4}") %>% 
  paste0("y", .) %>% 
  setNames(landcover, .)

######################################################################

#get most recent year in rasters
max_lc_year <- names(landcover) %>% 
  str_extract("[0-9]{4}") %>% 
  as.integer() %>% 
  max()

#create a buffer radius of cells
neighborhood_radius <- 5 * ceiling(max(res(landcover))) / 2

#use buffer around survey locations to then extract raster cell values from
ebird_buff <- ebird %>% 
  distinct(year = format(as.Date(observation_date), "%Y"),
           locality_id, latitude, longitude) %>% 
  # for 2018 use 2017 landcover data
  mutate(year_lc = if_else(as.integer(year) > max_lc_year, 
                           as.character(max_lc_year), year),
         year_lc = paste0("y", year_lc)) %>% 
  # convert to spatial features
  st_as_sf(coords = c("longitude", "latitude"), crs = 4326) %>% 
  # transform to modis projection
  st_transform(crs = raster::projection(landcover)) %>% 
  # buffer to create neighborhood around each point
  st_buffer(dist = neighborhood_radius) %>%
  # nest by year
  nest(data = c(year, locality_id, geometry))

####################################################################################
# function to extract landcover data for all checklists in a given year
calculate_pland <- function(yr, regions, lc) {
  # create a lookup table to get locality_id from row number
  locs <- st_set_geometry(regions, NULL) %>% 
    mutate(id = row_number())
  
  # extract using velox
  lc_vlx <- velox(lc[[yr]])
  lc_vlx$extract(regions, df = TRUE) %>% 
    # velox doesn't properly name columns, fix that
    set_names(c("id", "landcover")) %>% 
    # join to lookup table to get locality_id
    inner_join(locs, ., by = "id") %>% 
    select(-id)
}

# iterate over all years extracting landcover for all checklists in each
lc_extract <- ebird_buff %>% 
  mutate(pland = map2(year_lc, data, calculate_pland, lc = landcover)) %>% 
  select(pland) %>% 
  unnest(cols = pland)

pland <- lc_extract %>% 
  # count landcovers
  count(locality_id, year, landcover) %>% 
  # calculate proporiton
  group_by(locality_id, year) %>% 
  mutate(pland = n / sum(n)) %>% 
  ungroup() %>% 
  select(-n) %>% 
  # remove NAs after tallying so pland is relative to total number of cells
  filter(!is.na(landcover))

# convert names to be more descriptive
lc_names <- tibble(landcover = 0:15,
                   lc_name = c("pland_00_water", 
                               "pland_01_evergreen_needleleaf", 
                               "pland_02_evergreen_broadleaf", 
                               "pland_03_deciduous_needleleaf", 
                               "pland_04_deciduous_broadleaf", 
                               "pland_05_mixed_forest",
                               "pland_06_closed_shrubland", 
                               "pland_07_open_shrubland", 
                               "pland_08_woody_savanna", 
                               "pland_09_savanna", 
                               "pland_10_grassland", 
                               "pland_11_wetland", 
                               "pland_12_cropland", 
                               "pland_13_urban", 
                               "pland_14_mosiac", 
                               "pland_15_barren"))
pland <- pland %>% 
  inner_join(lc_names, by = "landcover") %>% 
  arrange(landcover) %>% 
  select(-landcover)

# tranform to wide format, filling in implicit missing values with 0s%>% 
pland <- pland %>% 
  pivot_wider(names_from = lc_name, 
              values_from = pland, 
              values_fill = list(pland = 0))

# save
write_csv(pland, "./Data/modis_pland_location-year.csv")

#Now for elevation data

# buffer each checklist location
ebird_buff_noyear <- ebird %>% 
  distinct(locality_id, latitude, longitude) %>% 
  st_as_sf(coords = c("longitude", "latitude"), crs = 4326) %>% 
  st_transform(crs = projection(elev)) %>% 
  st_buffer(dist = neighborhood_radius)

# extract using velox and calculate median and sd
locs <- st_set_geometry(ebird_buff_noyear, NULL) %>% 
    mutate(id = row_number())
elev_checklists <- velox(elev)$extract(ebird_buff_noyear, df = TRUE) %>% 
  # velox doesn't properly name columns, fix that
  set_names(c("id", "elevation")) %>% 
  # join to lookup table to get locality_id
  inner_join(locs, ., by = "id") %>% 
  # summarize
  group_by(locality_id) %>% 
  summarize(elevation_median = median(elevation, na.rm = TRUE),
            elevation_sd = sd(elevation, na.rm = TRUE))

# extract using velox and calculate median and sd
elev_pred <- velox(elev)$extract(r_cells, df = TRUE) %>% 
  # velox doesn't properly name columns, fix that
  set_names(c("id", "elevation")) %>% 
  # summarize
  group_by(id) %>% 
  summarize(elevation_median = median(elevation, na.rm = TRUE),
            elevation_sd = sd(elevation, na.rm = TRUE))

# checklist covariates
pland_elev_checklist <- inner_join(pland, elev_checklists, by = "locality_id")
write_csv(pland_elev_checklist, "data/pland-elev_location-year.csv")

# prediction surface covariates
pland_elev_pred <- inner_join(pland_coords, elev_pred, by = "id")
write_csv(pland_elev_pred, "./Data/pland-elev_prediction-surface.csv")

4d) Combine eBird data with habitat covariates

# set random number seed to insure fully repeatable results
set.seed(1)

# setup output directory for saved results
if (!dir.exists("output")) {
  dir.create("output")
}

# ebird data
ebird <- read_csv("./Data/ebd_woothr_june_NH_zf.csv") %>% 
  mutate(year = year(observation_date),
         # occupancy modeling requires an integer response
         species_observed = as.integer(species_observed))

# modis land cover covariates
habitat <- read_csv("./Data/pland-elev_location-year.csv") %>% 
  mutate(year = as.integer(year))

# combine ebird and modis data
ebird_habitat <- inner_join(ebird, habitat, by = c("locality_id", "year"))

# prediction surface
pred_surface <- read_csv("./Data/pland-elev_prediction-surface.csv")
# latest year of landcover data
max_lc_year <- pred_surface$year[1]
r <- raster("./Data/prediction-surface.tif")

# load gis data for making maps
map_proj <- "+init=epsg:4326 +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0"
ne_land <- read_sf("./Data/GIS_data/gis-data.gpkg", "ne_land") %>% 
  st_transform(crs = map_proj) %>% 
  st_geometry()
bcr <- read_sf("./Data/GIS_data/gis-data.gpkg", "bcr") %>% 
  st_transform(crs = map_proj) %>% 
  st_geometry()
ne_country_lines <- read_sf("./Data/GIS_data/gis-data.gpkg", "ne_country_lines") %>% 
  st_transform(crs = map_proj) %>% 
  st_geometry()
ne_state_lines <- read_sf("./Data/GIS_data/gis-data.gpkg", "ne_state_lines") %>% 
  st_transform(crs = map_proj) %>% 
  st_geometry()

#################################################

#prepare data (filtering)
# filter prior to creating occupancy model data
ebird_filtered <- filter(ebird_habitat, 
                         number_observers <= 5,
                         year == max(year))

#format data
occ <- filter_repeat_visits(ebird_filtered, 
                            min_obs = 2, max_obs = 10,
                            annual_closure = TRUE,
                            date_var = "observation_date",
                            site_vars = c("locality_id", "observer_id"))
# entire data set
nrow(ebird_habitat)
#> [1] 134122
# reduced data set
nrow(occ)
#> [1] 11179
# number of individual sites
n_distinct(occ$site)
#1343
# format for unmarked
occ_wide <- format_unmarked_occu(occ, 
                                 site_id = "site", 
                                 response = "species_observed",
                                 site_covs = c("n_observations", 
                                               "latitude", "longitude", 
                                               "pland_04_deciduous_broadleaf", 
                                               "pland_05_mixed_forest",
                                               "pland_12_cropland",
                                               "pland_13_urban",
                                               "elevation_median"),
                                 obs_covs = c("time_observations_started", 
                                              "duration_minutes", 
                                              "effort_distance_km", 
                                              "number_observers", 
                                              "protocol_type",
                                              "pland_04_deciduous_broadleaf", 
                                              "pland_05_mixed_forest",
                                              "pland_12_cropland",
                                              "pland_13_urban",
                                              "elevation_median"))

4e) Spatial sub-sampling of survey locations

#Spatial sub sampling
# generate hexagonal grid with ~ 5 km betweeen cells
dggs <- dgconstruct(spacing = 5)

# get hexagonal cell id for each site
occ_wide_cell <- occ_wide %>% 
  mutate(cell = dgGEO_to_SEQNUM(dggs, longitude, latitude)$seqnum)
# sample one site per grid cell
occ_ss <- occ_wide_cell %>% 
  group_by(cell) %>% 
  sample_n(size = 1) %>% 
  ungroup() %>% 
  select(-cell)
# calculate the percent decrease in the number of sites
#1 - nrow(occ_ss) / nrow(occ_wide)


###################################################################
# generate polygons for the grid cells
hexagons <- dgcellstogrid(dggs, unique(occ_wide_cell$cell), frame = FALSE) %>% 
  st_as_sf()

occ_wide$Points<-paste("All (n = ",length(unique(occ_wide$site)), ")",sep="")
occ_ss$Points<-paste("Sampled (n = ", length(unique(occ_ss$site)), ")", sep="")

occ_comb<-rbind(occ_wide, occ_ss)
occ_comb$Points<-as.factor(occ_comb$Points)
levels(occ_comb$Points)

plotColors<-c("gray20","red")

subSamplePlot<-ggmap(map.new, darken=c(0.7,"white"))+
  geom_sf(data = hexagons,inherit.aes = FALSE)+
  geom_point(data=occ_comb, aes(x=longitude,y=latitude,color=Points),size=0.8,alpha=0.6, shape=16, inherit.aes = FALSE)+
  theme(panel.border=element_rect(color="black",fill="transparent"), panel.background = element_rect(fill="white"))+
  scale_color_manual(values=plotColors)+
  labs(x="Longitude",y="Latitude")+
  guides(alpha="none")
  
subSampleFacet<-subSamplePlot+facet_wrap(.~Points)
subSampleFacet

5) Make unmarkedFrame object with eBird data

#create unmarkedFrame object
occ_um <- formatWide(occ_ss, type = "unmarkedFrameOccu")

6) Fitting occupancy models using ‘unmarked’ package

# fit model occu(~detection process ~occupancy process)
occ_model <- occu(~ time_observations_started + duration_minutes + effort_distance_km + number_observers + protocol_type + pland_04_deciduous_broadleaf + pland_05_mixed_forest + pland_12_cropland + pland_13_urban + elevation_median
                  ~ pland_04_deciduous_broadleaf + pland_05_mixed_forest + 
                    pland_12_cropland + pland_13_urban+elevation_median, 
                  data = occ_um)
# look at the regression coefficients from the model
pander(summary(occ_model))

Call: occu(formula = ~time_observations_started + duration_minutes + effort_distance_km + number_observers + protocol_type + pland_04_deciduous_broadleaf + pland_05_mixed_forest + pland_12_cropland + pland_13_urban + elevation_median ~ pland_04_deciduous_broadleaf + pland_05_mixed_forest + pland_12_cropland + pland_13_urban + elevation_median, data = occ_um)

Occupancy (logit-scale): Estimate SE z P(>|z|) (Intercept) -2.29259 0.61687 -3.7165 0.000202 pland_04_deciduous_broadleaf 1.59035 0.91097 1.7458 0.080849 pland_05_mixed_forest 1.85037 0.87616 2.1119 0.034695 pland_12_cropland 1.96672 45.55465 0.0432 0.965564 pland_13_urban -0.22543 2.27904 -0.0989 0.921205 elevation_median -0.00402 0.00129 -3.1089 0.001878

Detection (logit-scale): Estimate SE z P(>|z|) (Intercept) 2.29085 1.21631 1.883 5.96e-02 time_observations_started -0.04547 0.04869 -0.934 3.50e-01 duration_minutes 0.01849 0.00735 2.517 1.19e-02 effort_distance_km -0.30976 0.36728 -0.843 3.99e-01 number_observers -0.17555 0.64105 -0.274 7.84e-01 protocol_typeTraveling 1.95479 0.84342 2.318 2.05e-02 pland_04_deciduous_broadleaf -0.48258 1.30947 -0.369 7.12e-01 pland_05_mixed_forest -0.24232 1.21475 -0.199 8.42e-01 pland_12_cropland -91.25579 22.22884 -4.105 4.04e-05 pland_13_urban -9.26570 2.77085 -3.344 8.26e-04 elevation_median -0.00294 0.00249 -1.183 2.37e-01

AIC: 455.9301 Number of sites: 380 optim convergence code: 0 optim iterations: 100 Bootstrap iterations: 0

state:

	Estimate	SE	z	P(>\|z\|)
(Intercept)	-2.293	0.6169	-3.716	0.000202
pland_04_deciduous_broadleaf	1.59	0.911	1.746	0.08085
pland_05_mixed_forest	1.85	0.8762	2.112	0.03469
pland_12_cropland	1.967	45.55	0.04317	0.9656
pland_13_urban	-0.2254	2.279	-0.09892	0.9212
elevation_median	-0.004019	0.001293	-3.109	0.001878

det:

	Estimate	SE	z	P(>\|z\|)
(Intercept)	2.291	1.216	1.883	0.05964
time_observations_started	-0.04547	0.04869	-0.9339	0.3503
duration_minutes	0.01849	0.007348	2.517	0.01185
effort_distance_km	-0.3098	0.3673	-0.8434	0.399
number_observers	-0.1756	0.6411	-0.2738	0.7842
protocol_typeTraveling	1.955	0.8434	2.318	0.02047
pland_04_deciduous_broadleaf	-0.4826	1.309	-0.3685	0.7125
pland_05_mixed_forest	-0.2423	1.215	-0.1995	0.8419
pland_12_cropland	-91.26	22.23	-4.105	4.038e-05
pland_13_urban	-9.266	2.771	-3.344	0.0008258
elevation_median	-0.002941	0.002485	-1.183	0.2367

7) Check goodness of model fit

occ_gof <- mb.gof.test(occ_model, nsim = 10, plot.hist = FALSE)
# hide the chisq table to give simpler output
occ_gof$chisq.table <- NULL
print(occ_gof)

MacKenzie and Bailey goodness-of-fit for single-season occupancy model

Chi-square statistic = 1630.075 Number of bootstrap samples = 10 P-value = 0.7

Quantiles of bootstrapped statistics: 0% 25% 50% 75% 100% 1355 1700 2065 2280 5477

Estimate of c-hat = 0.67

# get list of all possible terms, then subset to those we want to keep
det_terms <- getAllTerms(occ_model) %>% 
  # retain the detection submodel covariates
  discard(str_detect, pattern = "psi")

8) Model comparison and selection with Akaike’s Information Criterion (AIC)

# dredge all possibe combinations of the occupancy covariates
occ_dredge <- dredge(occ_model, fixed = det_terms)

# model comparison
mc <- select(occ_dredge, contains("psi("), df, AICc, delta, weight)
mc<-mc[,-1]

names(mc)<-gsub("psi\\(","", gsub("\\)","",names(mc)))
mc %>% 
  mutate_all(~ round(., 3)) %>% 
  pander()

Table continues below
elevation_median	pland_04_deciduous_broadleaf	pland_05_mixed_forest
-0.004	1.647	1.9
-0.003	NA	1.099
-0.002	NA	NA
-0.004	1.621	1.885
-0.004	1.627	1.915
NA	NA	NA
-0.003	NA	NA
-0.003	NA	0.939
NA	NA	0.687
-0.003	0.563	NA
-0.003	NA	1.094
-0.004	1.59	1.85
NA	NA	NA
-0.003	0.424	NA
NA	-0.056	NA
-0.002	NA	NA
NA	NA	NA
-0.003	NA	NA
-0.003	NA	0.975
NA	0.275	0.812
NA	NA	0.7
NA	NA	0.687
-0.003	0.562	NA
NA	-0.134	NA
NA	NA	NA
-0.003	0.426	NA
NA	-0.07	NA
NA	0.411	0.947
NA	0.29	0.818
NA	NA	0.697
NA	-0.119	NA
NA	0.388	0.929

pland_12_cropland	pland_13_urban	df	AICc	delta	weight
NA	NA	15	453.4	0	0.189
NA	NA	14	454.6	1.204	0.104
NA	NA	13	455.1	1.719	0.08
0.319	NA	16	455.6	2.161	0.064
NA	-0.089	16	455.6	2.244	0.062
NA	NA	12	455.8	2.376	0.058
NA	-2.567	14	455.9	2.481	0.055
NA	-1.626	15	456.4	2.96	0.043
NA	NA	13	456.6	3.221	0.038
NA	NA	14	456.9	3.514	0.033
-4.226	NA	15	457.1	3.673	0.03
1.967	-0.225	17	457.6	4.229	0.023
NA	-0.674	13	457.8	4.394	0.021
NA	-2.267	15	457.8	4.434	0.021
NA	NA	13	457.9	4.506	0.02
-9.323	NA	14	458.1	4.737	0.018
-4.391	NA	13	458.3	4.884	0.016
-0.764	-2.519	15	458.3	4.917	0.016
1.004	-1.668	16	458.4	4.99	0.016
NA	NA	14	458.6	5.214	0.014
NA	0.085	14	458.8	5.371	0.013
-0.786	NA	14	458.8	5.435	0.013
-8.499	NA	15	459.6	6.216	0.008
NA	-0.873	14	459.9	6.529	0.007
-0.394	-0.665	14	460	6.586	0.007
-0.994	-2.235	16	460	6.609	0.007
-3.96	NA	14	460.4	6.974	0.006
NA	0.717	15	460.7	7.306	0.005
0.78	NA	15	460.7	7.321	0.005
1.394	0.058	15	460.8	7.434	0.005
0.358	-0.761	15	462	8.656	0.003
1.892	0.589	16	462.7	9.336	0.002

9) Examine how detection-process covariates influence p(Detection)

detformulaList<-c(
  ~duration_minutes,
  ~effort_distance_km,
  ~number_observers,
  ~time_observations_started,
  ~pland_04_deciduous_broadleaf,
  ~pland_05_mixed_forest,
  ~pland_12_cropland,
  ~pland_13_urban,
  ~elevation_median
)


#get obsCovs as data.frame
obsCovs.df<-as.data.frame(occ_um@obsCovs)
#now fit models

#create empty list to save predicted detection Probs
DetCovariatePred<-list()

#for loop to iterate over detection covariates and produce predicted estimates for detection prob
for(i in 1:length(detformulaList)){
  print(detformulaList[[i]])
  new.occ_model <- occu(detformulaList[[i]]~ 1,  data=occ_um)
  
  #get det covariate name
  varName<-gsub("~","",as.character(detformulaList[[i]]))[2]

  #get range of values from data to simulate new data
  varData<-obsCovs.df[,names(obsCovs.df)==varName]

  #simulate new data
  sim.data<-data.frame(seq(min(varData,na.rm=TRUE), max(varData,na.rm=TRUE),length.out=100))
  colnames(sim.data)<-c(varName)
  #get predicted values using sim.data
  pred.df<-predict(new.occ_model, type="det", newdata=sim.data, appendData=TRUE)
  colnames(pred.df)[5]<-"det_cov"
  
  #add column with covariate name
  pred.df$CovariateName<-varName
  
  DetCovariatePred<-rbind(DetCovariatePred, pred.df)
  
}

_{duration_minutes}effort_distance_km _{number_observers}time_observations_started _{pland_04_deciduous_broadleaf}pland_05_mixed_forest _{pland_12_cropland}pland_13_urban ~elevation_median

  #make CovariateName a factor
  DetCovariatePred$CovariateName<-as.factor(DetCovariatePred$CovariateName)
  
  # Plot it
  detPlot<-ggplot(data=DetCovariatePred)+
    geom_ribbon(aes(x=det_cov, ymin=lower, ymax=upper), fill="gray80")+
    geom_line(aes(x=det_cov, y=Predicted),color="royalblue3")+
    labs(x="Covariate Values", y="Detection Probability")+
    #ylim(0,1)+
    theme(panel.border=element_rect(color="black",fill="transparent"), panel.background = element_rect(fill="white"))

  detPlotFacet<-detPlot+facet_wrap(.~CovariateName, scales="free",ncol=3)
  detPlotFacet

10) Examine how occupancy process covariates influence p(Occupancy)

occformulaList<-c(
  ~1~pland_04_deciduous_broadleaf,
  ~1~pland_05_mixed_forest,
  ~1~pland_12_cropland,
  ~1~pland_13_urban,
  ~1~elevation_median
)

#get obsCovs as data.frame
siteCovs.df<-as.data.frame(occ_um@siteCovs)
#now fit models

#create empty list to save predicted detection Probs
occCovariatePredOcc<-list()

#for loop to iterate over detection covariates and produce predicted estimates for detection prob
for(i in 1:length(occformulaList)){
  print(occformulaList[[i]])
  new.occ_model <- occu(occformulaList[[i]],  data=occ_um)
  
  #get det covariate name
  varName<-gsub("~","",as.character(occformulaList[[i]]))[3]

  #get range of values from data to simulate new data
  varData<-siteCovs.df[,names(siteCovs.df)==varName]

  #simulate new data
  sim.data<-data.frame(seq(min(varData,na.rm=TRUE), max(varData,na.rm=TRUE),length.out=100))
  colnames(sim.data)<-c(varName)
  #get predicted values using sim.data
  pred.df<-predict(new.occ_model, type="state", newdata=sim.data, appendData=TRUE)
  colnames(pred.df)[5]<-"occ_cov"
  
  #add column with covariate name
  pred.df$CovariateName<-varName
  
  occCovariatePredOcc<-rbind(occCovariatePredOcc, pred.df)
  
}

~1 ~ pland_04_deciduous_broadleaf ~1 ~ pland_05_mixed_forest ~1 ~ pland_12_cropland ~1 ~ pland_13_urban ~1 ~ elevation_median

  #make CovariateName a factor
  occCovariatePredOcc$CovariateName<-as.factor(occCovariatePredOcc$CovariateName)

  # Plot it
  occPlot<-ggplot(data=occCovariatePredOcc)+
    geom_ribbon(aes(x=occ_cov, ymin=lower, ymax=upper), fill="gray80")+
    geom_line(aes(x=occ_cov, y=Predicted),color="seagreen4")+
    labs(x="Covariate Values", y="Probability of Site Occupancy")+
    #ylim(0,1)+
    theme(panel.border=element_rect(color="black",fill="transparent"), panel.background = element_rect(fill="white"))

  occPlotFacet<-occPlot+facet_wrap(.~CovariateName, scales="free",ncol=3)
  occPlotFacet

11) Use model averaging of top-performing models to estimate wood thrush occupancy

# select models with the most suport for model averaging
occ_dredge_95 <- get.models(occ_dredge, subset = cumsum(weight) <= 0.95)

# average models based on model weights
occ_avg <- model.avg(occ_dredge_95, fit = TRUE)

# model coefficients
t(occ_avg$coefficients) %>%
knitr::kable()

#now use model to make predictions
occ_pred <- predict(occ_avg, 
                    newdata = as.data.frame(pred_surface), 
                    type = "state")

# add to prediction surface
pred_occ <- bind_cols(pred_surface, occ_prob = occ_pred$fit,occ_se = occ_pred$se.fit) %>% 
  select(latitude, longitude, occ_prob, occ_se)


r_pred <- pred_occ %>% 
  # convert to spatial features
  st_as_sf(coords = c("longitude", "latitude"), crs = map_proj) %>% 
  st_transform(crs = raster::projection(r)) %>% 
  # rasterize
  rasterize(r)
r_pred <- r_pred[[c("occ_prob", "occ_se")]]

# save the raster
tif_dir <- "output"
if (!dir.exists(tif_dir)) {
  dir.create(tif_dir)
}
writeRaster(r_pred[["occ_prob"]], 
            filename = file.path(tif_dir, "occupancy-model_prob_woothr.tif"),
            overwrite = TRUE)
writeRaster(r_pred[["occ_se"]], 
            filename = file.path(tif_dir, "occupancy-model_se_woothr.tif"), 
            overwrite = TRUE)

12) Map the mean and SE probability of occupancy for wood thrushes in NH.

#now map results
# project predictions
r_pred_proj <- projectRaster(r_pred, crs = map_proj, method = "ngb")

stateMap <- raster::getData('GADM', country='USA', level=2)
nh.sub<-stateMap[stateMap@data$NAME_1 == "New Hampshire",]

###################################################
#plot predicted occupancy in NH for each model
r_pred_proj_crop<-crop(r_pred_proj, nh.sub)
r_pred_proj_NH<-mask(r_pred_proj_crop,nh.sub)

r_pred_occu_NH<-subset(r_pred_proj_NH, 1, drop=TRUE)

# Convert the landscape data RasterLayer objects to data frames for ggplot
r_pred_occu_NH.df <- as.data.frame(r_pred_occu_NH, xy = TRUE, na.rm = TRUE)

#now add SE
r_pred_occuSE_NH<-subset(r_pred_proj_NH, 2, drop=TRUE)

# Convert the landscape data RasterLayer objects to data frames for ggplot
r_pred_occuSE_NH.df <- as.data.frame(r_pred_occuSE_NH, xy = TRUE, na.rm = TRUE)

#add Statistic Colum and standardize column names
r_pred_occu_NH.df$Statistic<-"Mean p(Occupancy)"
colnames(r_pred_occu_NH.df)[3]<-"Probability"
r_pred_occuSE_NH.df$Statistic<-"SE"
colnames(r_pred_occuSE_NH.df)[3]<-"Probability"

#combine data.frames
r_pred_comb<-rbind(r_pred_occu_NH.df, r_pred_occuSE_NH.df)

#make Statistic a factor
r_pred_comb$Statistic<-as.factor(r_pred_comb$Statistic)

NHOccuMap<-ggmap(map.new, darken=c(0.7,"white"))+
  geom_tile(data=r_pred_comb, aes(x=x, y=y, fill=Probability, color=Probability))+
  geom_map(data=nh.poly, map=nh.poly, aes(y=lat, x=long, map_id=id),color="black",fill="transparent",alpha=0.8, inherit.aes = FALSE)+
  scale_fill_viridis_c(name="p(occupancy)")+
  scale_color_viridis_c(name="p(occupancy)")+
theme(panel.border=element_rect(color="black",fill="transparent"))+
  labs(x="Longitude",y="Latitude")+
  guides(alpha="none", color="none")
NHOccuMap.facet<-NHOccuMap+facet_grid(~Statistic)

levels(r_pred_comb$Statistic)

[1] “Mean p(Occupancy)” “SE”

NHOccuMapOut<-NHOccuMap.facet+ggsn::scalebar(dist = 40, dist_unit="km", model = "WGS84", transform=TRUE, y.min = y.min, y.max = y.max, x.min =x.min, x.max = x.max, st.size = 2, st.dist=0.03, location = "bottomright", st.bottom = FALSE, border.size=0.3, anchor=c(x=-70.55, y=42.48))
  # theme(legend.key.size = ,legend.position = c(0.22,0.8),   legend.background=element_rect(color="black",fill=alpha("white",0.6)),
  #       legend.title = element_text(size = 5), legend.text = element_text(size = 5))

ggsn::north2(NHOccuMapOut, x = 0.75, y = 0.26, scale = 0.1, symbol = 11)

References

Amatulli, G., Domisch, S., Tuanmu, M.-N., Parmentier, B., Ranipeta, A., Malczyk, J., and Jetz, W. (2018) A suite of global, cross-scale topographic variables for environmental and biodiversity modeling. Scientific Data volume 5, Article number: 180040. DOI: doi:10.1038/sdata.2018.40.

Fiske, I. and Chandler, R., 2011. Unmarked: an R package for fitting hierarchical models of wildlife occurrence and abundance. Journal of statistical software, 43(10), pp.1-23.

Friedl, M., D. Sulla-Menashe. MCD12C1 MODIS/Terra+Aqua Land Cover Type Yearly L3 Global 0.05Deg CMG V006. 2015, distributed by NASA EOSDIS Land Processes DAAC, https://doi.org/10.5067/MODIS/MCD12C1.006. Accessed 2020-02-14.

MacKenzie, D.I., Nichols, J.D., Royle, J.A., Pollock, K.H., Bailey, L. and Hines, J.E., 2017. Occupancy estimation and modeling: inferring patterns and dynamics of species occurrence. Elsevier.

Strimas-Mackey, M., W.M. Hochachka, V. Ruiz-Gutierrez, O.J. Robinson, E.T. Miller, T. Auer, S. Kelling, D. Fink, A. Johnston. 2020. Best Practices for Using eBird Data. Version 1.0. https://cornelllabofornithology.github.io/ebird-best-practices/. Cornell Lab of Ornithology, Ithaca, New York. https://doi.org/10.5281/zenodo.3620739