BIOL 827 Dr. Jeff Lane
Assignment 1: Data Exploration and preliminary data analysis using LM or GLM
Kathrine M. Stewart
Question: At a landscape level, do female boreal woodland caribou (Rangifer tarandus caribou) with calves select for different habitat attributes than female caribou without calves during the summer season?
Hypothesis: Female caribou with calves will be more risk adverse than those without calves, and will therefore spend more time in habitats that reduce encounters with predators.
Prediction: Female caribou with calves will spend more time in muskegs and mature jack pine forest and will show a stronger avoidance of linear features than female caribou that do not have calves.
STEP 1. Data Collection
For this analysis, I used GPS point locations (n=11965) collected from from 54 radio collared, adult female caribou during the 2014 summer season (July 1st-September 15th). Collars were programmed to transmit a spatial location every 5 hours; however, due to failed fixes and random shifts in recording intervals, time intervals between points are irregular and individual sample sizes are unbalanced.
Using ArcGIS 10.2.1(c), I attached the following ecologically relevant attributes to each GPS point (variable, [code name in data frame]):
-elevation(m), [Elev]
-slope(degrees), [Slope]
-aspect, [Asp]
-Euclidean distance to the nearest linear feature (m, e.g., road, seismic line etc.), [LinDist]
-Euclidean distance to the nearest water course or water body (m), [WatDist]
-habitat class (categorical variable with 8 levels):
+Jack Pine Dominated, Early Successional Stage (<=10 yrs after fire), [JPES] +Jack Pine Dominated, Mid-Succession, (between 10 yrs and 40yrs after fire)[JPMid] +Jack Pine Dominated, Mature, (40+ yrs after fire), [JPMat] +Black Spruce Dominated, [BSDom] +White Birch Dominated, [WBDom] +Mixed Deciduous-Conifer, [MixCD] +Treed Muskeg, [Tmusk] +Open Muskeg, [Omusk]
Additionally, each point was classified as belonging to a caribou with a calf (hereafter CWC) or without a calf (hereafter CNC) according to data collected from calf and cow surveys conducted in March 2015. Because calf mortality is extremely high in the post-calving period, (which immediately precedes the summer season), I assumed that all caribou without calves in the March 2015 survey lost their calves prior to the 2014 summer season.
The study area was defined as the collared caribou’s summer range, which was derived from a minimum convex polygon enclosing all point locations recorded from the 54 caribou during the summer. These locations represent a sample of “used” locations drawn from all possible locations within the summer range. In order to build a habitat selection model, these locations must be compared to a sample of available locations drawn from within the same spatial area. Thus, I randomly generated an equal number of point locations (n=11965) within the defined summer range to create a sample of points representing what is available to the 54 caribou. Available points were only generated in regions of the study area that were plausibly accessible to caribou, (i.e., no points were generated in the middle of large water bodies). As with the used locations, a suite of spatial attributes was attached to each available location. By setting “use” vs. “availabile” as a binomial response, I am able to evaluate the probability of caribou occurrence within the study area.
STEP 2. Data Checking and Manipulation
sum <- read.csv("C:/Users/pdm276/Desktop/BIO827/Assignments/JeffLane_Assignment1/Summer.csv")
sum$AID<-as.factor(as.character(sum$AID))
sum$UseAv<-factor(sum$UseAv,levels=c(1,0), labels=c("Used", "Available")) #change 1 and 0 to Used and Available
sum$UseAv<-as.factor(as.character(sum$UseAv))
sum$CSN<-as.factor(as.character(sum$CSN))
sum$JPES<-as.factor(as.character(sum$JPES))
sum$JPMid<-as.factor(as.character(sum$JPMid))
sum$JPMat<-as.factor(as.character(sum$JPMat))
sum$BSDom<-as.factor(as.character(sum$BSDom))
sum$WBDom<-as.factor(as.character(sum$WBDom))
sum$MixCD<-as.factor(as.character(sum$MixCD))
sum$Tmusk<-as.factor(as.character(sum$Tmusk))
sum$Omusk<-as.factor(as.character(sum$Omusk))
#create column with Julian Dates
sum$Date<-as.character(sum$Date)
sum$Date<-strptime(sum$Date, format="%m/%d/%Y")
str(sum$Date) #check date in POSIXlt format
## POSIXlt[1:23930], format: "2014-07-01" "2014-07-01" "2014-07-01" "2014-07-01" ...
sum$Time<-as.character(sum$Time)
str(sum$Time)
## chr [1:23930] "0:00:35" "0:00:35" "0:00:35" "0:00:35" ...
sum$Jul.Date<-as.POSIXct(paste(sum$Date, sum$Time), format="%Y-%m-%d %H:%M:%S")
str(sum$Jul.Date) #check date in POSIXct format
## POSIXct[1:23930], format: "2014-07-01 00:00:35" "2014-07-01 00:00:35" ...
str(sum)
## 'data.frame': 23930 obs. of 27 variables:
## $ PID : int 380 380 495 495 460 460 484 484 476 476 ...
## $ UseAv : Factor w/ 2 levels "Available","Used": 1 2 1 2 1 2 1 2 1 2 ...
## $ AID : Factor w/ 54 levels "140101","140102",..: 2 2 7 7 13 13 23 23 34 34 ...
## $ CSN : Factor w/ 54 levels "670074","670075",..: 37 37 24 24 48 48 2 2 21 21 ...
## $ Date : POSIXlt, format: "2014-07-01" "2014-07-01" ...
## $ Time : chr "0:00:35" "0:00:35" "0:00:35" "0:00:35" ...
## $ Season : Factor w/ 1 level "Summer": 1 1 1 1 1 1 1 1 1 1 ...
## $ Calf : Factor w/ 2 levels "No","Yes": 2 2 1 1 1 1 1 1 1 1 ...
## $ Veg : Factor w/ 8 levels "BSDom","JPES",..: 1 4 8 3 2 1 2 2 1 1 ...
## $ JPES : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 2 2 1 1 ...
## $ JPMid : Factor w/ 2 levels "0","1": 1 2 1 1 1 1 1 1 1 1 ...
## $ JPMat : Factor w/ 2 levels "0","1": 1 1 1 2 1 1 1 1 1 1 ...
## $ BSDom : Factor w/ 2 levels "0","1": 2 1 1 1 1 2 1 1 2 2 ...
## $ WBDom : Factor w/ 2 levels "0","1": 1 1 2 1 1 1 1 1 1 1 ...
## $ MixCD : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tmusk : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ Omusk : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ Slope : int 0 1 10 1 2 7 2 2 3 2 ...
## $ Asp : num 23.2 106 309.1 11.3 345.5 ...
## $ Elev : int 456 460 433 357 441 379 553 367 353 554 ...
## $ Age : int 16 16 33 40 4 40 7 4 40 40 ...
## $ Island : Factor w/ 2 levels "Island","Mainland": 2 2 2 1 2 2 2 1 2 2 ...
## $ LinDist : num 25864 4911 5348 6793 9902 ...
## $ Lin_Feat: Factor w/ 2 levels "LinearDist","Major_Roads": 1 1 1 1 2 1 1 1 1 1 ...
## $ WatDist : num 2986 2599 325 0 429 ...
## $ Wat_Feat: Factor w/ 2 levels "WaterBodies",..: 1 1 2 1 1 1 2 1 2 1 ...
## $ Jul.Date: POSIXct, format: "2014-07-01 00:00:35" "2014-07-01 00:00:35" ...
sapply(sum, function(x) sum(is.na(x)))#check for NAs
## PID UseAv AID CSN Date Time Season Calf
## 0 0 0 0 0 0 0 0
## Veg JPES JPMid JPMat BSDom WBDom MixCD Tmusk
## 0 0 0 0 0 0 0 0
## Omusk Slope Asp Elev Age Island LinDist Lin_Feat
## 0 0 0 0 0 0 0 0
## WatDist Wat_Feat Jul.Date
## 0 0 0
head(sum)
## PID UseAv AID CSN Date Time Season Calf Veg JPES
## 1 380 Available 140102 670160 2014-07-01 0:00:35 Summer Yes BSDom 0
## 2 380 Used 140102 670160 2014-07-01 0:00:35 Summer Yes JPMid 0
## 3 495 Available 140107 670109 2014-07-01 0:00:35 Summer No WBDom 0
## 4 495 Used 140107 670109 2014-07-01 0:00:35 Summer No JPMat 0
## 5 460 Available 140116 670184 2014-07-01 0:00:35 Summer No JPES 1
## 6 460 Used 140116 670184 2014-07-01 0:00:35 Summer No BSDom 0
## JPMid JPMat BSDom WBDom MixCD Tmusk Omusk Slope Asp Elev Age Island
## 1 0 0 1 0 0 0 0 0 23.20 456 16 Mainland
## 2 1 0 0 0 0 0 0 1 105.95 460 16 Mainland
## 3 0 0 0 1 0 0 0 10 309.14 433 33 Mainland
## 4 0 1 0 0 0 0 0 1 11.31 357 40 Island
## 5 0 0 0 0 0 0 0 2 345.47 441 4 Mainland
## 6 0 0 1 0 0 0 0 7 310.84 379 40 Mainland
## LinDist Lin_Feat WatDist Wat_Feat Jul.Date
## 1 25863.94 LinearDist 2985.92 WaterBodies 2014-07-01 00:00:35
## 2 4911.23 LinearDist 2599.29 WaterBodies 2014-07-01 00:00:35
## 3 5348.42 LinearDist 324.96 Watercourses 2014-07-01 00:00:35
## 4 6792.65 LinearDist 0.00 WaterBodies 2014-07-01 00:00:35
## 5 9902.15 Major_Roads 428.65 WaterBodies 2014-07-01 00:00:35
## 6 12721.82 LinearDist 1452.08 WaterBodies 2014-07-01 00:00:35
STEP 3. Data Exploration
Exploring the raw data is a crucial part of any scientific study. Zurr et al. (2009) suggest researchers follow an 8 step data exploration protocol before proceeding with data analysis in order to avoid common statistical errors (see Figure 1, pp.4, Zuur et al. 2009). Not all of these steps apply to my data. For example, because my response variable (Y) is binomial, I did not test for outliers in the response variable (e.g., Step 1: Outliers in X and Y, Zurr et al. 2009), nor did I check for normality, homogeneity or zero trouble in Y (e.g., steps 2-4, Fig. 1, Zurr et al. 2009). I also did not check for independence in the response variable (e.g., Step 8: Independence in Y, Zurr et al. 2009) because I have a priori knowledge that when GPS locations are repeatedly sampled from individuals over short time intervals (as is the case with this sample of GPS locations), they will exhibit some degree of spatial-temporal autocorrelation.
I was able to explore the data for (a) outliers in the explanatory variables; (b) collinearity between explanatory variables; (c) relationships between the response and explanatory variables; and (d) interactions between explanatory variables.
(A) Outliers in X
(i) Continuous Explanatory Variables:
[Elevation]
par(mfrow=c(1,2))
boxplot(sum$Elev, ylab="Elevation (m)", col="blue", cex.lab=0.8, pch=16)
dotchart(sum$Elev, xlab="Elevation (m)", ylab="Order of Data", pch=16, cex.lab=0.8)
Figure 1: Boxplot and clevland dot plot to test for outliers in the explanatory variable, elevation (measured in meters above sea level). For the clevland dot plot, the data is ordered by Julian date.
par(mfrow=c(1,1))
From the box plot and Clevland dotplot, there appear to be several values for elevation which fall outside the range of elevation for the majority of the points. The range of elevation within the study area is 271m to 697m so all elevation values within this range are possible.
sum[sum$Elev>697,] #check for elevation values > 697m
## [1] PID UseAv AID CSN Date Time Season
## [8] Calf Veg JPES JPMid JPMat BSDom WBDom
## [15] MixCD Tmusk Omusk Slope Asp Elev Age
## [22] Island LinDist Lin_Feat WatDist Wat_Feat Jul.Date
## <0 rows> (or 0-length row.names)
sum[sum$Elev<271,] #check for elevation values < 271m
## [1] PID UseAv AID CSN Date Time Season
## [8] Calf Veg JPES JPMid JPMat BSDom WBDom
## [15] MixCD Tmusk Omusk Slope Asp Elev Age
## [22] Island LinDist Lin_Feat WatDist Wat_Feat Jul.Date
## <0 rows> (or 0-length row.names)
None of the elevation values fall outside the plausible range so I retained all values for this variable.
[Slope]
par(mfrow=c(1,2))
boxplot(sum$Slope, ylab="Slope (deg.)", col="green", cex.lab=0.8, pch=16)
dotchart(sum$Slope, xlab="Slope (deg.)", ylab="Order of Data", pch=16, cex.lab=0.8)
Figure 2: Boxplot and clevland dot plot to test for outliers in the explanatory variable, slope (measured in degrees). For the clevland dot plot, the data is ordered by Julian date.
par(mfrow=c(1,1))
From the box plot and Clevland dotplot, there appear to be several outliers for the variable Slope. The range of possible slope values within the study area is 0 degrees to 42.2511 degrees so all values within this range are possible.
sum[sum$slope>42.2511,] #check for slope values > 42.2511 degrees
## [1] PID UseAv AID CSN Date Time Season
## [8] Calf Veg JPES JPMid JPMat BSDom WBDom
## [15] MixCD Tmusk Omusk Slope Asp Elev Age
## [22] Island LinDist Lin_Feat WatDist Wat_Feat Jul.Date
## <0 rows> (or 0-length row.names)
sum[sum$Slope<0,] #check for slope values < 0 degrees
## [1] PID UseAv AID CSN Date Time Season
## [8] Calf Veg JPES JPMid JPMat BSDom WBDom
## [15] MixCD Tmusk Omusk Slope Asp Elev Age
## [22] Island LinDist Lin_Feat WatDist Wat_Feat Jul.Date
## <0 rows> (or 0-length row.names)
There are no slope values outside the plausible range so I retained all values for this variable.
[Aspect]
par(mfrow=c(1,2))
boxplot(sum$Asp, ylab="Aspect", col="yellow", cex.lab=0.8, pch=16)
dotchart(sum$Asp, xlab="Aspect", ylab="Order of Data", pch=16, cex.lab=0.8)
Figure 3: Boxplot and clevland dot plot to test for outliers in the explanatory variable, aspect. For the clevland dot plot, the data is ordered by Julian date.
par(mfrow=c(1,1))
There are no outliers evident for this variable.
[Distance to Linear Feature]
par(mfrow=c(1,2))
boxplot(sum$LinDist, ylab="Distance to Linear Feature (m)", col="purple", cex.lab=0.8, pch=16)
dotchart(sum$LinDist, xlab="Distance to Linear Feature (m)", ylab="Order of Data", pch=16, cex.lab=0.8)
Figure 4: Boxplot and clevland dot plot to test for outliers in the explanatory variable, distance to the neareast linear feature (measured as the euclidean distance in meters from a point location). For the clevland dot plot, the data is ordered by Julian date.
par(mfrow=c(1,1))
There are a number of values above 55000m (55km) that are worth investigating.
nrow(sum[sum$LinDist>55000,]) #25 records occur over 55km from linear features
## [1] 25
summary(sum$LinDist) #max distance is 72400m (72.4km) from nearest linear feature
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.02 2307.00 7005.00 9884.00 12390.00 72400.00
All 25 points located >55km from the nearest linear featue are classified as “available” and are located on islands. Caribou are strong swimmers capable of traversing large distances across water. These islands are all within swimming distance for caribou, either from the mainland or from an adjacent island that is itself accessible from the mainland. Therefore, I did not remove any of the values for this variable.
[Distance to Water]
par(mfrow=c(1,2))
boxplot(sum$WatDist, ylab="Distance to Water (m)", col="pink", cex.lab=0.8, pch=16)
dotchart(sum$WatDist, xlab="Distance to Water (m)", ylab="Order of Data", pch=16, cex.lab=0.8)
Figure 5: Boxplot and clevland dot plot to test for outliers in the explanatory variable, distance to the neareast water source (i.e., a river or lake, measured as the euclidean distance in meters from a point location). For the clevland dot plot, the data is ordered by Julian date.
par(mfrow=c(1,1))
There are several points located over 8000m (8km) from a water source (either a waterbody or watercourse).
nrow(sum[sum$WatDist>8000,]) #24 records over 8km from a water source
## [1] 24
summary(sum$WatDist) #max distance is 11300m (11.3km) from water
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 479.6 1217.0 1501.0 2212.0 11300.0
All 24 records located >8km from a water source are classified as “available” locations. They are all located in areas where it is possible for a point to have a radius of 11.3km that excludes a water source; therefore, none of these values were removed from the data set.
(ii) Outliers in Categorical Explanatory Variables
Categorical variables cannot produce outliers in the conventional sense, but comparing counts for each level of a categorcial variable can help identify levels that may be superfluous. Here, I compared counts for each level of the categorical variable, Habitat Class or [Veg].
barplot(table(sum$Veg), main="Distribution of Points Between Habitat Classes", xlab ="Habitat Class", ylab="Count")
summary(sum$Veg)
## BSDom JPES JPMat JPMid MixCD Omusk Tmusk WBDom
## 4190 4002 4208 3792 1539 3353 2488 358
Figure 6: Distribution of used and available point locations between the 8 habitat classes defined within the summer range. All habitat classes represent combinations of habitat features (e.g., moisture regime, canopy closure, understory composition etc.) that are ecologically relevant to woodland caribou.
From the distribution of the points in Figure 6, it appears that pixels classified as “white birch dominated” or [WBDom] are scarce in study area (358/23930~0.015 or 1.5% of total points), meaning that the probability of a caribou encountering this habitat class is rare. This could lead to a ‘false positive’ (i.e., indication that caribou strongly select for this habitat type even if they actually avoid it). In addition, its rarity means that it would be dangerous to assume that this habitat type is equally available to all of the caribou within the study area (which is a key assumption for this analysis). Indeed, this habitat type is only found in the southern half of the summer range and so is not available to caribou in the northern half.
One option to address this issue is to combine WBDom with a biophysically similar habitat class; in this case, “mixed conifer-deciduous” or [MixCD]. Before doing so, I looked at the distribution of this rare habitat class between “used” and “available” locations:
library(plyr)
count(subset(sum, UseAv=="Available" & Veg=="WBDom", select=Veg))
## Veg freq
## 1 WBDom 257
count(subset(sum, UseAv=="Used" & Veg=="WBDom", select=Veg))
## Veg freq
## 1 WBDom 101
There are over twice as many available WBDom locations than there are used WBDom locations (Available: 257, Used: 101), which suggest that caribou may avoid white birch dominated stands. At a smaller, local spatial scale, retaining the WBDom class might be warranted; however, at the landscape scale of this study, it is prudent to combine WBDom with the class MixCD.
library(plyr)
sum$VegReclass<-revalue(sum$Veg,c("WBDom"="MixCD.WB", "MixCD"="MixCD.WB")) #create new column with reclassed veg types
summary(sum$VegReclass) #check that WBDom and MixCD now combined in new class MixCD.WB
## BSDom JPES JPMat JPMid MixCD.WB Omusk Tmusk
## 4190 4002 4208 3792 1897 3353 2488
sum$MixCD.WB<-ifelse(sum$WBDom=="1",1,ifelse(sum$MixCD=="1",1,0)) #create new dummy variable column for MixcD.WB
nrow(sum[sum$MixCD.WB==1,]) #verify all 1897 records are classified as 1
## [1] 1897
sum$MixCD.WB<-as.factor(as.character(sum$MixCD.WB)) #convert to factor
str(sum$MixCD.WB)
## Factor w/ 2 levels "0","1": 1 1 2 1 1 1 1 1 1 1 ...
sapply(sum, function(x) sum(is.na(x))) #double check for NA values
## PID UseAv AID CSN Date Time
## 0 0 0 0 0 0
## Season Calf Veg JPES JPMid JPMat
## 0 0 0 0 0 0
## BSDom WBDom MixCD Tmusk Omusk Slope
## 0 0 0 0 0 0
## Asp Elev Age Island LinDist Lin_Feat
## 0 0 0 0 0 0
## WatDist Wat_Feat Jul.Date VegReclass MixCD.WB
## 0 0 0 0 0
(B) Collinearity Between Explanatory Variables
(i) Collinearity Between Continuous Explanatory Variables
I used a multipanel scatterplot to assess collinearity between continuous explanatory variables. I did this for the entire data set, and then by CWC and CNC.
[All Data]
names_sum<-c("Slope", "Aspect", "Elevation", "Nearest Linear Feat.", "Nearest Water Feat.")
#Code for panel.cor and panel.smooth2 functions (Zurr et al. 2009):
panel.cor <- function(x, y, digits=2, prefix="", cex.cor)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r1=cor(x,y,use="pairwise.complete.obs")
r <- abs(cor(x, y,use="pairwise.complete.obs"))
txt <- format(c(r1, 0.123456789), digits=digits)[1]
txt <- paste(prefix, txt, sep="")
if(missing(cex.cor)) cex <- 0.9/strwidth(txt)
text(0.5, 0.5, txt, cex = cex * r)
}
panel.smooth2=function (x, y, col = par("col"), bg = NA, pch = par("pch"),
cex = 1, col.smooth = "red", span = 2/3, iter = 3, ...)
{
points(x, y, pch = pch, col = col, bg = bg, cex = cex)
ok <- is.finite(x) & is.finite(y)
if (any(ok))
lines(stats::lowess(x[ok], y[ok], f = span, iter = iter),
col = 1, ...)
}
plot(sum[, c(18:20,23,25)], lower.panel = panel.cor, upper.panel = panel.smooth2, labels=names_sum)
Figure 7: Multipanel pair-wise scatter plot to assess collinearity between explanatory variables for all point locations from the summer season (i.e. those belonging to CWC and CNC). The numbers in the lower panels are the Pearson correlation coefficients between each pair of explanatory variables. According to the reported coefficients, there are no strong correlations between any of the explanatory variables.
[CWC]
Calf<-sum[sum$Calf=="Yes",]
plot(Calf[, c(18:20,23,25)], lower.panel = panel.cor, upper.panel = panel.smooth2, labels=names_sum)
Figure 8: Multipanel pair-wise scatter plot to assess collinearity between explanatory variables for point locations recorded from caribou with calves (CWC). The numbers in the lower panels are the Pearson correlation coefficients calculated for each pair of explanatory variables. According to the reported coefficients, there are no strong correlations between any of the explanatory variables for CWC.
[CNC]
NoCalf<-sum[sum$Calf=="No",]
plot(NoCalf[, c(18:20,23,25)], lower.panel = panel.cor, upper.panel = panel.smooth2, labels=names_sum)
Figure 9: Multipanel pair-wise scatter plot to assess collinearity between explanatory variables for point locations recorded from caribou without calves (CNC). The numbers in the lower panels are the Pearson correlation coefficients calculated for each pair of explanatory variables. According to the reported coefficients, there are no strong correlations between any of the explanatory variables for CNC.
(ii) Collinearity Between Categorical and Continuous Explanatory Variables
To assess collinearity between the categorical variable “Habitat Class” (now VegReclass) and the continuous explanatory variables, I evaluated the range of each continuous variable over each level of habitat class in order to check for systematic changes in the range of values for each continuous variable across the categories.
[Slope vs. Habitat Class]
library(lattice)
bwplot(sum$Slope~sum$VegReclass, cex=0.5, xlab="Habitat Class", ylab="Slope (deg.)", main="Slope by Habitat Class", par.settings=list(box.rectangle=list(col="blue"),box.umbrella=list(col="blue"), plot.symbol=list(cex=1, col="blue")), scales=list(x=list(relation="same"), y=list(relation="same")), pch="|", cex.lab=3, cex.main=3)
Figure 10: Boxplots to assess the spread of slope values recorded within each habitat class.
There appears to be a difference in the mean and range of values for slope between wetland habitat classes (Omusk and Tmusk) and the 5 terrestrial habitat classes (see Figure 10). Biologically, this makes sense as wetlands tend to be in low-lying, flat areas. The lower and upper quartiles of slope for each class are all between 0 and 5 degrees. While a 5 degree difference in slope could indirectly influence caribou via an impact on local vegetation communites, this small change in slope is likely unimportant at the landscape scale.
[Elevation vs. Habitat Class]
bwplot(sum$Elev~sum$VegReclass, cex=0.5, xlab="Habitat Class", ylab="Elevation (m) ", main="Elevation by Habitat Class", par.settings=list(box.rectangle=list(col="blue"),box.umbrella=list(col="blue"), plot.symbol=list(cex=1, col="blue")), scales=list(x=list(relation="same"), y=list(relation="same")), pch="|", cex.lab=3, cex.main=3)
Figure 11: Boxplots to assess the spread of elevation values recorded within each habitat class.
The mean value for elevation varies between some of the habitat classes (e.g., BSDom and JPMid), but the overall range of values for this variable is pretty consistent across classes (see Figure 11).
[Aspect vs. Habitat Class]
bwplot(sum$Asp~sum$VegReclass, cex=0.5, xlab="Habitat Class", ylab="Aspect", main="Aspect by Habitat Class", par.settings=list(box.rectangle=list(col="blue"),box.umbrella=list(col="blue"), plot.symbol=list(cex=1, col="blue")), scales=list(x=list(relation="same"), y=list(relation="same")), pch="|", cex.lab=3, cex.main=3)
Figure 12: Boxplots to assess the spread of aspect values recorded within each habitat class.
The means and ranges for the variable aspect are farily consistent across the seven habitat classes (see Figure 12).
[Distance to Linear Feature vs. Habitat Class]
bwplot(sum$LinDist~sum$VegReclass, cex=0.5, xlab="Habitat Class", ylab="Nearest Linear Feature (m)", main="Proximity to Linear Features by Habitat Class", par.settings=list(box.rectangle=list(col="blue"),box.umbrella=list(col="blue"), plot.symbol=list(cex=1, col="blue")), scales=list(x=list(relation="same"), y=list(relation="same")), pch="|", cex.lab=3, cex.main=3)
Figure 13: Boxplots to assess the spread of distance values measured between point locations and linear features within each habitat class.
As indicated in Figure 13, the mean values and ranges for the variable “distance to nearest linear features” are consistent across the seven habitat classes.
[Distance to Water Feature vs. Habitat Class]
bwplot(sum$WatDist~sum$VegReclass, cex=0.5, xlab="Habitat Class", ylab="Nearest Water Source (m)", main="Proximity to Water by Habitat Class", par.settings=list(box.rectangle=list(col="blue"),box.umbrella=list(col="blue"), plot.symbol=list(cex=1, col="blue")), scales=list(x=list(relation="same"), y=list(relation="same")), pch="|", cex.lab=3, cex.main=3)
Figure 14: Boxplots to assess the spread of distance values measured between point locations and the nearest water source within each habitat class.
The mean values and ranges for the variable “distance to nearest water source” are consistent across the seven habitat classes (see Figure 14).
(C) Relationships Between the Response and the Explanatory Variables
(i) Response vs. Continuous Explanatory Variables
The response is binomial (i.e., “Used” or “Available”). Because the used locations represent a subset of the available locations, the range of values for a given continuous variable vs. the used reponse must fall within the range of values for that variable vs. the available response. Here, I am looking for a marked difference in the mean and spread of the values between the two response categories. The data was split between CWC (label = “Yes”) and CNC (label = “No”).Recall: white-birch dominated forest were combined with mixed coniferous-deciduous forests, thus reducing the original 8 habitat classes to 7.
[Response vs. Slope]
library(lattice)
bwplot(sum$Slope~sum$UseAv|sum$Calf,data=sum,layout=c(2,1),cex=0.5, xlab="Used vs. Available Habitat", ylab="Slope(deg.)", main="Slope by Used/Available Points", par.settings=list(box.rectangle=list(col="blue"),box.umbrella=list(col="blue"), plot.symbol=list(cex=0.5, col="blue")), scales=list(x=list(relation="same"), y=list(relation="same")), pch="|", cex.lab=3, cex.main=3)
Figure 15: Range of values for the variable slope for used and randomly generated “available” point locations. Points are divided between caribou without calves (n=17834, label =“No”) and caribou with calves (n=6096, label = “Yes”).
[Response vs. Elevation]
bwplot(sum$Elev~sum$UseAv|sum$Calf,data=sum,layout=c(2,1),cex=0.5, xlab="Used vs. Available Habitat", ylab="Elevation (m)", main="Elevation by Used/Available Points", par.settings=list(box.rectangle=list(col="blue"),box.umbrella=list(col="blue"), plot.symbol=list(cex=0.5, col="blue")), scales=list(x=list(relation="same"), y=list(relation="same")), pch="|", cex.lab=3, cex.main=3)
Figure 16: Range of values for the variable elevation for used and randomly generated “available” point locations. Points are divided between caribou without calves (n=17834, label =“No”) and caribou with calves (n=6096, label = “Yes”).
[Response vs. Aspect]
bwplot(sum$Asp~sum$UseAv|sum$Calf,data=sum,layout=c(2,1),cex=0.5, xlab="Used vs. Available Habitat", ylab="Aspect", main="Aspect by Used/Available Points", par.settings=list(box.rectangle=list(col="blue"),box.umbrella=list(col="blue"), plot.symbol=list(cex=0.5, col="blue")), scales=list(x=list(relation="same"), y=list(relation="same")), pch="|", cex.lab=3, cex.main=3)
Figure 17: Range of values for the variable aspect for used and randomly generated “available” point locations. Points are divided between caribou without calves (n=17834, label =“No”) and caribou with calves (n=6096, label = “Yes”).
[Response vs. Distance to Nearest Linear Feature]
bwplot(sum$LinDist~sum$UseAv|sum$Calf,data=sum,layout=c(2,1),cex=0.5, xlab="Used vs. Available Habitat", ylab="Distance to Linear Feature (m)", main="Proximity to Linear Features by Used/Available Points", par.settings=list(box.rectangle=list(col="blue"),box.umbrella=list(col="blue"), plot.symbol=list(cex=0.5, col="blue")), scales=list(x=list(relation="same"), y=list(relation="same")), pch="|", cex.lab=3, cex.main=3)
Figure 18: Range of values for the variable “distance to the nearest linear feature” for used and randomly generated “available” point locations. Points are divided between caribou without calves (n=17834, label =“No”) and caribou with calves (n=6096, label = “Yes”).
[Response vs. Distance to Nearest Water Source]
bwplot(sum$WatDist~sum$UseAv|sum$Calf,data=sum,layout=c(2,1),cex=0.5, xlab="Used vs. Available Habitat", ylab="Distance to Water (m)",main="Proximity to Water by Used/Available Points", par.settings=list(box.rectangle=list(col="blue"),box.umbrella=list(col="blue"), plot.symbol=list(cex=0.5, col="blue")), scales=list(x=list(relation="same"), y=list(relation="same")), pch="|", cex.lab=3, cex.main=3)
Figure 19: Range of values for the variable “distance to the nearest water source (water body or water course)” for used and randomly generated “available” point locations. Points are divided between caribou without calves (n=17834, label =“No”) and caribou with calves (n=6096, label = “Yes”).
Looking at the boxplots in figures 15 to 19, there do not seem to be any definitive two-way relationships between the response varibale and the continuous variables for CWC or CNC.
(ii) Response vs. Categorical Variables
Here, I examined counts of used and available points across the seven habitat classes. I did this for all of the data and then by CWC and CNC.
[All Data]
count<-table(sum$UseAv, sum$VegReclass)
barplot(table(sum$UseAv, sum$VegReclass), ylab="Count", main="Used vs. Available Points by Habitat Type", col=c("blue", "red"), legend=rownames(count), beside=TRUE)
Figure 20: Distribution of used and available points across the seven habitat classes for the entire data set (n=23930 point locations divided equally between “used” and “available”).
Overall, there are more used locations in wetlands (open and treed muskegs), mature jack pine forests and black spruce dominated forests, with the biggest difference observed between Omusk and the response (see Figure 20). This suggest that open muskegs may be important to caribou habitat selection in the summer.
[CWC]
barplot(table(Calf$UseAv, Calf$VegReclass), ylab="Count", main="Used vs. Available Points by Habitat Type", col=c("blue", "red"), legend=rownames(count), beside=TRUE)
Figure 21: Distribution of used and available points across the seven habitat classes for caribou with calves (n=6096 point locations divided equally between “used” and “available”).
For caribou with calves, there were noticeably more used locations in wetlands and mature jack pine forests, with a marked difference between the number of used and available locations in the latter category (see Figure 21). This suggests that mature jack pine forests may be a good predictor of CWC habitat selection during the summer season.
[CNC]
barplot(table(NoCalf$UseAv, NoCalf$VegReclass), ylab="Count", main="Used vs. Available Points by Habitat Type", col=c("blue", "red"), legend=rownames(count), beside=TRUE)
Figure 22: Distribution of used and available points across the seven habitat classes for caribou without calves (n=17834 point locations divided equally between “used” and “available”).
For caribou without calves, there were distinctly more used locations in wetlands and black spruce dominated forests. Comparing the plots for CWC and CNC (Figures 21 and 22), it would seem that wetlands are important to caribou in general while mature jack pine forests are more important to CWC than CNC and black spruce forests are more important to CNC than CWC.
(D) Interactions Between Explanatory Variables
To investigate whether the relationship between continuous explanatory variables changes between habitat classes, I plotted pairs of explanatory variables by each habitat class for CWC (label = “Yes”) and CNC (label = “No”). Note that I did not test for interactions between aspect and any of the other variables because these relationships do not make biophysical sense.
[Slope vs. Elevation]
xyplot(sum$Slope~sum$Elev|sum$VegReclass*sum$Calf, data=sum, pch=19, main="Slope vs. Elevation by Habitat Class", xlab="Elevation(m)", ylab="Slope (deg.)", layout=c(7,2))
Figure 23: Relationship between slope and elevation within each habitat class. The top row depicts this relationship for caribou with calves (n=6096, label =“Yes”) and the bottom row depicts this relationship for caribou without calves (n=17834, label = “No”).
[Slope vs. Aspect]
xyplot(sum$Slope~sum$Asp|sum$VegReclass*sum$Calf, data=sum, pch=19, main="Slope vs. Aspect by Habitat Class", xlab="Aspect", ylab="Slope (deg.)", layout=c(7,2))
Figure 24: Relationship between slope and aspect within each habitat class. The top row depicts this relationship for caribou with calves (n=6096, label =“Yes”) and the bottom row depicts this relationship for caribou without calves (n=17834, label = “No”).
[Slope vs. Distance to Linear Feature]
xyplot(sum$Slope~sum$LinDist|sum$VegReclass*sum$Calf, data=sum, pch=19, main="Slope vs. Proximity to Linear Features", xlab="Proximity to Linear Feature (m)", ylab="Slope (deg.)", layout=c(7,2))
Figure 25: Relationship between slope and the distance to the nearest linear feature within each habitat class. The top row depicts this relationship for caribou with calves (n=6096, label =“Yes”) and the bottom row depicts this relationship for caribou without calves (n=17834, label = “No”).
[Slope vs. Distance to Water]
xyplot(sum$Slope~sum$WatDist|sum$VegReclass*sum$Calf, data=sum, pch=19, main="Slope vs. Distance to Water(m) by Habitat Class", xlab="Proximity to Water(m)", ylab="Slope (deg.)", layout=c(7,2))
Figure 26: Relationship between slope and the distance to the nearest water source within each habitat class. The top row depicts this relationship for caribou with calves (n=6096, label =“Yes”) and the bottom row depicts this relationship for caribou without calves (n=17834, label = “No”).
[Elevation vs. Distance to Linear Feature]
xyplot(sum$Elev~sum$LinDist|sum$VegReclass*sum$Calf, data=sum, pch=19, main="Elevation vs. Proximity to Linear Features by Habitat Class", xlab="Proximity to Linear Feature (m)", ylab="Elevation(m)", layout=c(7,2))
Figure 27: Relationship between elevation and distance to the nearest linear feature within each habitat class. The top row depicts this relationship for caribou with calves (n=6096, label =“Yes”) and the bottom row depicts this relationship for caribou without calves (n=17834, label = “No”).
[Elevation vs. Distance to Water]
xyplot(sum$Elev~sum$WatDist|sum$VegReclass*sum$Calf, data=sum, pch=19, main="Elevation vs. Distance to Water(m) by Habitat Class", xlab="Proximity to Water(m)", ylab="Elevation(m)", layout=c(7,2))
Figure 28: Relationship between elevation and distance to the nearest water source within each habitat class. The top row depicts this relationship for caribou with calves (n=6096, label =“Yes”) and the bottom row depicts this relationship for caribou without calves (n=17834, label = “No”).
[Distance to Linear Feature vs. Distance to Water]
xyplot(sum$LinDist~sum$WatDist|sum$VegReclass*sum$Calf, data=sum, pch=19, main="Proximty to Linear Features and Water by Habitat Class", xlab="Proximity to Water(m)", ylab="Proximity to Linear Feature(m)", layout=c(7,2))
Figure 29: Relationship between the distance to the nearest linear feature and the distance to the nearest water source within each habitat class. The top row depicts this relationship for caribou with calves (n=6096, label =“Yes”) and the bottom row depicts this relationship for caribou without calves (n=17834, label = “No”).
For both CWC and CNC, the relationships between continuous variables do not appear to change across habitat classes (see Figures 23-29).
STEP 4. Data Exploration Summary
Habitat preference is commonly quantified by comparing habitat use (i.e., used locations) vs. habitat availability (i.e., a random sample of point locations at least equal to the number of used locations). As my study design lends itself to this approach, I knew prior to the data exploration that a generalised linear model (glm) would be the most appropriate of this assignment’s model options for analysing my data. This type of model provides the framework necessary to apply logistic regression to data with a binomial response.
It is important to note that a GLM does not provide the framework necessary to address (a) the spatial and temporal autocorrelation between GPS point locations and (b) the fact that “used” locations were repeatedly sampled from the 54 collared caribou over the summer period. So, while I am acknowledging that my used points are not independent, I will not be able to address this issue using a GLM.
The data exploration process was incredibly useful as it allowed me to test for confounding outliers, collinearity and interactions between the explanatory variables, as well as explore individual relationships between the response variable and the explanatory variables. While testing for outliers, I discovered that white birch dominated forests were incredibly rare in the study area so I combined this habitat class with the habitat class for mixed conifer-deciduous forests because the rarity of this habitat class makes it superfluous at a landscape scale. While examining the relationship between the binomial response and the habitat classes for CWC vs. CNC, I discovered that mature jack pine forests may be more important to CWC while black spruce dominated forests may be more important to CNC (see Figures 21 and 22). Wetlands (e.g., open and treed muskegs) may be important to both groups. For the most part, however, there were no concerning or biologically implausible patterns or relationships in the data.
STEP 5. Model Selection, Interpretation and Validation
For modelling purposes, I created dummy variables (i.e. columns of 0s and 1s)for each level of the category ‘habitat class’, with a ‘1’ indicating the point was located in that habitat class and a ‘0’ indicating the point was not. If a ‘1’ was recorded under one of the habitat classes (for example JPES), then a ‘0’ would be recorded in all of the other habitat class dummy columns. E.g.:
head(sum[c(10:13,16,17,29)])
## JPES JPMid JPMat BSDom Tmusk Omusk MixCD.WB
## 1 0 0 0 1 0 0 0
## 2 0 1 0 0 0 0 0
## 3 0 0 0 0 0 0 1
## 4 0 0 1 0 0 0 0
## 5 1 0 0 0 0 0 0
## 6 0 0 0 1 0 0 0
Accordingly, these dummy variables are perfectly collinear. The purpose of doing this was to allow me to drop habitat classes if one or more proved to be a poor predictor of the response.
Backwards model selection was employed to assess which of the variables of interest best predicted habitat use; that is, I first created a global model containing all of the vairables of interest and then sequentially dropped model terms that did not significantly alter the model deviance when they were removed.
Because I am interested in whether habitat selection differs between caribou with calves (CWC) and caribou without calves (CNC) at the level of the landscape (i.e., over the entire range defined by the 54 collared caribou). I tested two seperate selection models - one for CWC and one for the CNC - and then compared whether the predictors to answer my question.
(i) CWC
[Global Model]
globCalf<-glm(UseAv~Elev+Slope+Asp+Slope:Asp+LinDist+WatDist+JPES+JPMid+JPMat+BSDom+MixCD.WB+Tmusk+Omusk, data=Calf, family=binomial (link="logit"))
summary(globCalf)
##
## Call:
## glm(formula = UseAv ~ Elev + Slope + Asp + Slope:Asp + LinDist +
## WatDist + JPES + JPMid + JPMat + BSDom + MixCD.WB + Tmusk +
## Omusk, family = binomial(link = "logit"), data = Calf)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7436 -1.0543 0.1477 1.0138 1.9739
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.496e+00 2.527e-01 9.876 < 2e-16 ***
## Elev -3.685e-03 5.112e-04 -7.209 5.65e-13 ***
## Slope -1.701e-01 2.952e-02 -5.763 8.28e-09 ***
## Asp 1.836e-05 3.373e-04 0.054 0.95659
## LinDist -1.467e-05 2.391e-06 -6.135 8.50e-10 ***
## WatDist -3.317e-05 2.223e-05 -1.492 0.13572
## JPES1 -8.801e-01 1.022e-01 -8.615 < 2e-16 ***
## JPMid1 -9.959e-01 1.003e-01 -9.933 < 2e-16 ***
## JPMat1 2.423e-01 9.253e-02 2.619 0.00882 **
## BSDom1 -2.639e-01 1.017e-01 -2.594 0.00947 **
## MixCD.WB1 -1.088e+00 1.262e-01 -8.627 < 2e-16 ***
## Tmusk1 -1.423e-01 1.080e-01 -1.317 0.18775
## Omusk1 NA NA NA NA
## Slope:Asp 1.486e-04 1.307e-04 1.137 0.25556
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 8450.9 on 6095 degrees of freedom
## Residual deviance: 7807.9 on 6083 degrees of freedom
## AIC: 7833.9
##
## Number of Fisher Scoring iterations: 4
Figure 30: Model summary output for the global model, globCalf.
In the model summary for the global model (Figure 30), there is a message stating “Coefficients: (1 not defined because of singularities)”. This is referring to the fact that the dummy variables for habitat class are collinear (as described above) and, as a result, one of the parameters cannot be defined independent of the other dummy variables. Here, the variable Omusk (open muskeg) was not defined because it was the last dummy column to record a ‘1’(hence the NA values for Omusk in the coefficient table).Essentially, the dummy columns have been collapsed back into a single categorical variable with the column Omusk set as the reference column for all of the other dummy columns.
Because I used a binomial logistic regression, the value of the coefficient for a given habitat class is a measure of how the log odds of a successful response (i.e. “used”) will change if a point location is recorded in that class compared to if that point location was recorded in the reference habitat class, Omusk. For example, the coefficient value for JPES is -8.801e-01 which means that if a point is recorded in early successional jack pine forest, the log odds of it being a “used” point changes by e-8.801e-01 or 0.039 compared to if that point had been recorded in open muskeg. According to the associated z-value (JPES: z=-8.615, p<2e-16), this change is significant. There is also a significant decrease in the log-odds ratio of a point location being used if the point is in mid-succession jack pine forest (JPMid: z=-9.933, p<2e-16), mixed coniferous-deciduous forest (MixCd.WB: z=-8.627, p<2e-16) or black spruce dominated forest (BSDom:z=-2.594, p=0.00947) when compared to open muskeg ; in contrast, the log odds of point being used significantly increases if that point is in mature jack pine forest (JPMat: z=2.619, p=0.00882) compared to open muskeg. There is no significant change in the log odds if the point is in treed vs. open muskeg (Tmusk: z=-1.317, p=1.18775).
For continuous explanatory variables, the value of the estimated coefficient is a measure of how the log odds of a successful response changes with a one unit increase in the continuous variable. At p<0.001, a one unit increase in elevation, slope and the distance to the nearest linear feature all cause a significant decrease in the log odds of a point being used (Elev: z=-7.209, p=5.65e-13; Slope: z=-5.763, p=8.28e-9; LinDist: z=-6.135, p=8.50e-10). Aspect and the distance to the nearest water source do not cause a significant change in the log odds of a point being used (Asp: z=0.054, p=0.95659; WatDist: z=-1.492, p=0.13572). For the former variable, this was expected, as aspect is only meaningful here as it relates to slope. The interaction between slope and aspect does not produce a significant change in the log odds of use (Slope:Asp: z=1.137, p=0.2556).
An assumption of logistic regression based on a binomial distribution is that the dispersion of the model is taken to be 1. Model dispersion can be estimated by dividing the residual deviance by the resiual degrees of freedom. For the model globCalf, the estimated dispersion is 7807.9/6083 = 1.28, which indicates the model is overdispersed. A solution to the problem of overdispersion is to fit a quasibinomial GLM, which incorporates a dispersion parameter, q. In a quasibinomial model, the variance is allowed to increase with the mean, but the standard error of all of the model coefficients is multiplied by square root of the dispersion parameter as a penalty. This can change the significance of some parameters.
Dispersion can result from outliers, model misspecification (i.e using the wrong model), collinear model covariates, the clustering of observations and/or misspecification of the link function. Here, the dispersion is likely caused by the spatial-temporal autocorrelation between GPS point locations, and the fact that the dummy variables are collinear. In short, a GLM model does not provide the framework necessary to address these issues and so the model is misspecified. Because of this misspecification, fitting a quasibinomial glm is unlikely to correct for the overdispersion; however, because ignoring overdispersion can result in severe model missinterpretation, I fit a quasibinomial model to test its effectiveness:
[Quais-Global Model]
QglobCalf<-glm(UseAv~Elev+Slope+Asp+Slope:Asp+LinDist+WatDist+JPES+JPMid+JPMat+BSDom+MixCD.WB+Tmusk+Omusk, data=Calf, family=quasibinomial (link="logit"))
summary(QglobCalf)
##
## Call:
## glm(formula = UseAv ~ Elev + Slope + Asp + Slope:Asp + LinDist +
## WatDist + JPES + JPMid + JPMat + BSDom + MixCD.WB + Tmusk +
## Omusk, family = quasibinomial(link = "logit"), data = Calf)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7436 -1.0543 0.1477 1.0138 1.9739
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.496e+00 2.532e-01 9.856 < 2e-16 ***
## Elev -3.685e-03 5.122e-04 -7.195 7.02e-13 ***
## Slope -1.701e-01 2.958e-02 -5.752 9.27e-09 ***
## Asp 1.836e-05 3.380e-04 0.054 0.95667
## LinDist -1.467e-05 2.395e-06 -6.123 9.74e-10 ***
## WatDist -3.317e-05 2.228e-05 -1.489 0.13653
## JPES1 -8.801e-01 1.024e-01 -8.598 < 2e-16 ***
## JPMid1 -9.959e-01 1.005e-01 -9.914 < 2e-16 ***
## JPMat1 2.423e-01 9.271e-02 2.614 0.00898 **
## BSDom1 -2.639e-01 1.019e-01 -2.589 0.00964 **
## MixCD.WB1 -1.088e+00 1.264e-01 -8.610 < 2e-16 ***
## Tmusk1 -1.423e-01 1.082e-01 -1.315 0.18866
## Omusk1 NA NA NA NA
## Slope:Asp 1.486e-04 1.310e-04 1.135 0.25653
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasibinomial family taken to be 1.0039)
##
## Null deviance: 8450.9 on 6095 degrees of freedom
## Residual deviance: 7807.9 on 6083 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 4
Figure 31: Model summary output for the quasibinomial global model, QglobCalf.
For the quasi model, the dispersion parameter is taken to be 1.0039 instead of 1 (see Figure 31). The model dispersion is still equal to 1.28, however, so the model is still overdispersed. The significance of the estimated model coefficients have also remained the same. Because the quasibinomial model failed to significantly reduce the model dispersion, and because the dispersion of the original model was not overly large, I chose to ignore the dispersion and proceeded with using the original model, globCalf.
[Model Selection and Interpretation]
drop1(globCalf, test="Chisq")
## Single term deletions
##
## Model:
## UseAv ~ Elev + Slope + Asp + Slope:Asp + LinDist + WatDist +
## JPES + JPMid + JPMat + BSDom + MixCD.WB + Tmusk + Omusk
## Df Deviance AIC LRT Pr(>Chi)
## <none> 7807.9 7833.9
## Elev 1 7860.3 7884.3 52.402 4.523e-13 ***
## LinDist 1 7845.6 7869.6 37.731 8.121e-10 ***
## WatDist 1 7810.1 7834.1 2.229 0.1355
## JPES 0 7807.9 7833.9 0.000
## JPMid 0 7807.9 7833.9 0.000
## JPMat 0 7807.9 7833.9 0.000
## BSDom 0 7807.9 7833.9 0.000
## MixCD.WB 0 7807.9 7833.9 0.000
## Tmusk 0 7807.9 7833.9 0.000
## Omusk 0 7807.9 7833.9 0.000
## Slope:Asp 1 7809.2 7833.2 1.292 0.2557
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Table 1: Output from the drop1 function (test = Chi-squared), as performed on the global model, globCalf. Note the lack of a test statistic for the dummy habitat class variables
The drop1 function did not calculate a test statistic for these dummy variables (see Table 1), which may indicate that it cannot handle the perfectly collinear dummy variables. Therefore, I repeated the above steps, but substituted all of the dummy habitat classes for the single categorical variable VegReclass.
Calf<-within(Calf, VegReclass<-relevel(VegReclass, ref="Omusk")) #relevel the variable VegReclass so that Omusk is the reference category; otherwise, the reference category is alphabetically selected as BSDom.
globCalf2<-glm(UseAv~Elev+Slope+Asp+Slope:Asp+LinDist+WatDist+VegReclass, data=Calf, family=binomial(link="logit"))
summary(globCalf2)
##
## Call:
## glm(formula = UseAv ~ Elev + Slope + Asp + Slope:Asp + LinDist +
## WatDist + VegReclass, family = binomial(link = "logit"),
## data = Calf)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7436 -1.0543 0.1477 1.0138 1.9739
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.496e+00 2.527e-01 9.876 < 2e-16 ***
## Elev -3.685e-03 5.112e-04 -7.209 5.65e-13 ***
## Slope -1.701e-01 2.952e-02 -5.763 8.28e-09 ***
## Asp 1.836e-05 3.373e-04 0.054 0.95659
## LinDist -1.467e-05 2.391e-06 -6.135 8.50e-10 ***
## WatDist -3.317e-05 2.223e-05 -1.492 0.13572
## VegReclassBSDom -2.639e-01 1.017e-01 -2.594 0.00947 **
## VegReclassJPES -8.801e-01 1.022e-01 -8.615 < 2e-16 ***
## VegReclassJPMat 2.423e-01 9.253e-02 2.619 0.00882 **
## VegReclassJPMid -9.959e-01 1.003e-01 -9.933 < 2e-16 ***
## VegReclassMixCD.WB -1.088e+00 1.262e-01 -8.627 < 2e-16 ***
## VegReclassTmusk -1.423e-01 1.080e-01 -1.317 0.18775
## Slope:Asp 1.486e-04 1.307e-04 1.137 0.25556
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 8450.9 on 6095 degrees of freedom
## Residual deviance: 7807.9 on 6083 degrees of freedom
## AIC: 7833.9
##
## Number of Fisher Scoring iterations: 4
Figure 32: Model summary output for the new global model, globCalf2. The dummy habitat class variables used in the model globCalf have been replaced by a single categorical variable, VegReclass.
The coefficient values for the habitat classes have not changed from when the habitat classes were specified as dummmy variables (compare Figure 32 to Figure 31). This supports the observation that the glm modelling program collapses collinear dummy variables into a single category. As with the model globCalf, this model is also overdispersed (dispersion = 1.28). For the reasons described above, I ignored the overdispersion.
drop1(globCalf2, test="Chisq")
## Single term deletions
##
## Model:
## UseAv ~ Elev + Slope + Asp + Slope:Asp + LinDist + WatDist +
## VegReclass
## Df Deviance AIC LRT Pr(>Chi)
## <none> 7807.9 7833.9
## Elev 1 7860.3 7884.3 52.40 4.523e-13 ***
## LinDist 1 7845.6 7869.6 37.73 8.121e-10 ***
## WatDist 1 7810.1 7834.1 2.23 0.1355
## VegReclass 6 8143.8 8157.8 335.94 < 2.2e-16 ***
## Slope:Asp 1 7809.2 7833.2 1.29 0.2557
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Table 2: Output from the drop1 function (test = Chi-squared), as performed on the new global model, globCalf2.
The drop1 function was able to calculate a test statistic for the category VegReclass. According to the Analysis of Deviance (Table 2), removing the variable VegReclass from the model globCalf2 significantly increases the model deviance (X2(6,N=6096) = 335.94, p<2.2e-16), which - unlike for the previous drop1 analysis - suggests habitat class is an important predictor of caribou habitat selection for CWC. Removing either the variable LinDist or Elev also causes a significant increase in the model variance (LinDist: X2(1,N=6096)=37.73, p= 8.121e-10, Elev: X2(6,N=6096) = 52.40, p= 4.523e-13). In contrast, removing either the interaction between Slope and Aspect (Slope:Asp), or distance to the nearest water source (WatDist, ) does not significantly alter the model deviance (Slope:Asp:X2(1,N=6096) = 1.29, p=0.2557; WatDist:X2(1,N=6096) = 2.23, p=0.1355) . Because the delta AIC between these two alternate models is less than 2, I concurrently dropped both terms from the global model to create a new, simplified model, dropCalf1. Note that in dropping the interaction term, Slope:Asp, I also dropped the independent terms, Slope and Asp. Though not shown here, I did rerun the entire model selection process with just slope as an independent fixed effect and it was not a significant predictor of the response.
dropCalf1<-glm(UseAv~Elev+LinDist+VegReclass, data=Calf, family=binomial(link="logit"))
summary(dropCalf1)
##
## Call:
## glm(formula = UseAv ~ Elev + LinDist + VegReclass, family = binomial(link = "logit"),
## data = Calf)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.73084 -1.07275 0.08744 1.01335 1.73868
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.600e+00 2.393e-01 10.866 < 2e-16 ***
## Elev -4.272e-03 4.791e-04 -8.917 < 2e-16 ***
## LinDist -1.355e-05 2.370e-06 -5.716 1.09e-08 ***
## VegReclassBSDom -5.261e-01 9.695e-02 -5.426 5.76e-08 ***
## VegReclassJPES -1.012e+00 1.004e-01 -10.081 < 2e-16 ***
## VegReclassJPMat 7.548e-02 8.995e-02 0.839 0.401
## VegReclassJPMid -1.108e+00 9.900e-02 -11.192 < 2e-16 ***
## VegReclassMixCD.WB -1.387e+00 1.217e-01 -11.396 < 2e-16 ***
## VegReclassTmusk -1.569e-01 1.075e-01 -1.461 0.144
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 8450.9 on 6095 degrees of freedom
## Residual deviance: 7905.9 on 6087 degrees of freedom
## AIC: 7923.9
##
## Number of Fisher Scoring iterations: 4
Figure 33: Model summary output for the reduced model, dropCalf1.
For this new model, a one unit increase in either elevation or the distance to the nearest linear feature causes a significant decrease in the log-odds of a point being used (Elev: z=-8.917, p< 2e-16; LinDist: z=-5.716, p<2e-16; see Figure 33). This suggests that caribou are more likey to be found at lower elevations and closer to linear features.
Compared to if a point is in open muskeg, the log odds of a point being used significantly decreases if that point is located in early successional jack pine forest (JPES: z=1.012e+00, p<2e-16), mid-successional jack pine forest (JPMid: z=-11.192, p<2e-16), black spruce dominated forest (BSDom: z=-5.426, p= 5.76e-8), and mixed coniferous-deciduous forest (MixCD.WB: z=-11.396, p<2e-16). There is no significant change in the log odds of use if a point is in mature jack pine forest (JPMat: z=0.839, p=0.401) or treed muskeg (z=-1.461, p=0.144*). This suggests that mature forests and treed muskegs are equally as important to CWC, while early and mid-successional jack pine forest, black spruce dominated forests and mixed coniferous-deciduous forests are more likely to be avoided relative to open muskegs.
The model dispersion is 7905.9/6087 = 1.30. Again, this is likely due to model misspecification, but I still attempted to rectify this issue by fitting a quasibinomial model:
QdropCalf1<-glm(UseAv~Elev+LinDist+VegReclass, data=Calf, family=quasibinomial(link="logit"))
summary(QdropCalf1)
##
## Call:
## glm(formula = UseAv ~ Elev + LinDist + VegReclass, family = quasibinomial(link = "logit"),
## data = Calf)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.73084 -1.07275 0.08744 1.01335 1.73868
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.600e+00 2.393e-01 10.865 < 2e-16 ***
## Elev -4.272e-03 4.791e-04 -8.915 < 2e-16 ***
## LinDist -1.355e-05 2.370e-06 -5.715 1.15e-08 ***
## VegReclassBSDom -5.261e-01 9.697e-02 -5.425 6.02e-08 ***
## VegReclassJPES -1.012e+00 1.004e-01 -10.080 < 2e-16 ***
## VegReclassJPMat 7.548e-02 8.996e-02 0.839 0.401
## VegReclassJPMid -1.108e+00 9.902e-02 -11.190 < 2e-16 ***
## VegReclassMixCD.WB -1.387e+00 1.217e-01 -11.394 < 2e-16 ***
## VegReclassTmusk -1.569e-01 1.075e-01 -1.460 0.144
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasibinomial family taken to be 1.00033)
##
## Null deviance: 8450.9 on 6095 degrees of freedom
## Residual deviance: 7905.9 on 6087 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 4
Figure 34: Model summary output for the quasibinomial model,QdropCalf1.
Fitting the quasibinomial model only very marginally reduced the model dispersion and did not change the significance of the estimates for the coefficients (see Figure 34); therefore, I continued backwards model selection with the binomial model, dropCalf1.
drop1(dropCalf1, test="Chisq")
## Single term deletions
##
## Model:
## UseAv ~ Elev + LinDist + VegReclass
## Df Deviance AIC LRT Pr(>Chi)
## <none> 7905.9 7923.9
## Elev 1 7986.7 8002.7 80.78 < 2.2e-16 ***
## LinDist 1 7938.7 7954.7 32.76 1.045e-08 ***
## VegReclass 6 8296.3 8302.3 390.33 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Table 3: Output from the drop1 function (test = Chi-squared), as performed on the reduced model, dropCalf1.
From the Analysis of Deviance table (Table 3), it is evident that dropping any of the remaining explanatory variables results in a significant increase in model deviance (Elev: X2(1,N=6096) = 80.78, p<2.2e-16; LinDist:X2(1,N=6096) = 32.76, p=1.045e-8; VegReclass: X2(6,N=6096) = 390.33, p<2.2e-16). This suggests that elevation, the distance to linear features and habitat class are all important predictors of caribou habitat selection for CWC.
To test the significance of this new model, I compared it to the global (saturated) model:
anova(globCalf2, dropCalf1, test="Chisq")
## Analysis of Deviance Table
##
## Model 1: UseAv ~ Elev + Slope + Asp + Slope:Asp + LinDist + WatDist +
## VegReclass
## Model 2: UseAv ~ Elev + LinDist + VegReclass
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 6083 7807.9
## 2 6087 7905.9 -4 -98.086 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Table 4: Comparison of the global model globcalf2 to the reduced model dropCalf1 using the function anova (test= Chi-squared).
The simpler model, dropCalf1 significantly reduced model deviance (see Table 4) and was thus selected as the top model for CWC.
[Model Validation]
Data analysis does not end with the selection of a top model. It is important to check how well the model fits the data and whether data complies with the assumptions of the model. For a binomial response modelled using logistic regression, traditional diagnostic plots (e.g., QQ-plot, Scale-Location plot) are generally uninformative because the residuals are not normally distributed (they are discrete), and the binary nature of the response results in strings of residuals that are hard to interpret. E.g.,:
par(mfrow=c(2,2))
plot(dropCalf1)
Figure 35: Example of traditional diagnostic plots for model validation. Because the response for this analysis is binomial, these plots are farily uninformative.
par(mfrow=c(1,1))
Plotting the residuals against the explanatory variables is also a somewhat fruitless exercise because the assumption of homoscedasticity does not apply to logistic regression. That said, a binned residual plot can be used as an alternative diagnostic tool to assess how well a glm predicts a binary response. To draw this plot, data are first organized into bins defined by the fitted values and then, the average residual value for each bin is plotted against the average fitted value for each bin. Bin size is somewhat arbitrary, but each bin should contain enough points to avoid unnecessary noise in the residuals while still allowing for a large enough number of bins to resolve local patterns in the residuals (Iorio and Innario 2012). For this study, bin size was set as √n (as per Gelman et al. 2006), where n is the sample size used to generate the model (e.g., n=6096 for CWC). The upper and lower lines on a binned residual plot represent theoretical error bounds within which 95% of the binned residuals should fall if the model fits the data well (Iorio and Innario 2012). If a logistic regression model does a good job of predicting the binomial response, the averaged residuals in a binned residual plot should be uniformly distributed around zero and the majority should fall within the 95% error bounds of the plot.
Before constructing the binned residual plot, I first plotted the model residuals vs. fitted values to ensure that none of the samples scored as “Used” or “1” were in the string of points scored as “Available” or “0”, (as recommended by Zurr et al. 2007):
plot(fitted(dropCalf1), resid(dropCalf1), xlab="fitted values", ylab = "model residuals", main="Residuals vs. Fitted Values [ Model: dropCalf1 ]", col=c("red", "blue")[Calf$UseAv], pch=16)
abline(h=0, lty=2, col="black")
Figure 35: Plot of model residuals vs. fitted values for the model dropCalf1. The upper blue string represents the “used” response and the lower red string the “available”" response.
In Figure 35, there are two strings of points because the model is predicting whether a point location is used (1) or available (0). If the true value of the response is 0, then the model overestimates the true value and the residuals are negative (red line); if the true value is 1, then the model underestimates the true value and the residuals are positive (blue line). In the above plot (Figure 35), the residuals follow the expected shape and all used and available reponses are in the appropriate string of points.
Next, I constructed the binned residual plot:
library(arm)
## Loading required package: MASS
## Loading required package: Matrix
## Loading required package: lme4
##
## arm (Version 1.8-6, built: 2015-7-7)
## Working directory is C:/Users/pdm276/Desktop/BIO827/Assignments/JeffLane_Assignment1
binnedplot(fitted(dropCalf1), resid(dropCalf1), nclass=78, main="Binned Residual Plot [Model dropCalf1]", col.int="red")
Figure 36: Binned Residual Plot for the model dropCalf1. Each point represents the average residual vs. the average fitted value for each bin of data points. Bin sizes were computed as the square root of the sample size, n (i.e., the sq. rt. of 6096). The upper and lower red lines represent the theoretical error bounds within which 95% of the estimated residuals should fall IF the model is true.
Looking at the binned residual plot (Figure 36), there are numerous points that fall outside of the theoretical error bounds. This is indicative of a poorly fit model. A poorly fit model may be the result of (a) multicollinearity between the explanatory variables that was not detected during the data exploration (e.g., pairwise correlations were small, but there may be a linear dependence among three or more variables); (b) one or more points with an undue amount of leverage (value of X far from the mean value of X); (c) inlfuential points (have a large influence on model fit); (d) a lack of independence in the response variable…etc. The latter is very likely given the spatio-temporal autocorrelation inherent to GPS point locations recorded at the study scale, however I still explored the data for multicollinearity and influential points.
One way to test for multicollinearity is to calculate a variance inflation factor (VIF). The VIF quantifies how much the variance of the estimated model coefficients are inflated when multicollinearity is present (i.e., it is the factor by which the variance is inflated).
library(car)
##
## Attaching package: 'car'
## The following object is masked from 'package:arm':
##
## logit
vif(dropCalf1)
## GVIF Df GVIF^(1/(2*Df))
## Elev 1.318738 1 1.148363
## LinDist 1.203566 1 1.097071
## VegReclass 1.114194 6 1.009052
Table 5: Generalised variance inflation factors calculated for the explanatory variables of the model dropCalf1. The output shows three columns: the GVIF, the Df, and the GVIF^(1/(2xDf)). The Df is the number of coefficients in the subset of the variable (e.g., for elevation, Df=1 because it is a continuous variable and therefore has only one level and one estimated coefficient). The GVIF is the generalised variation inflation factor, which is used in place of the standard VIF because the standard VIF cannot be applied to variables with a Df > 1 (e.g., a categorical variable with multiple levels). Briefly, it is the VIF corrected by the number of degrees of freedom of the predictor variable, meaning that for continuous variables, the VIF and GVIF are equivalent. The GVIF^(1/(2xDf)) reduces the GVIF to a linear measure and makes the GVIFs comparable across dimensions (Monette and Fox 1992).
There is considerable debate over whether to use the GVIF or GVIF^(1/(2xDf)) value to diagnose multicollinearity. Here, I have applied a GVIF^(1/(2xDf)) <2 (which is equal to a VIF of 4 for variables with Df=1) as the threshold for the inflation factor. Since none of the GVIF^(1/(2xDf)) values exceed this threshold (Table 5), it would appear that multicollinearity is not responsible for the poor model fit.
To check for influential points, I plotted the Cook’s distance for each observation:
cutoff <- 4/((nrow(Calf)-length(dropCalf1$coefficients)-2)) #identify D values > 4/(n-k-1). Thus is a commonly used threshold above which a Cook's distance value is indicative of an influential point.
plot(dropCalf1, which=4, cook.levels=cutoff)
Figure 36: Cook’s Distance for the used and available data points recorded for caribou with calves.
There are three data points in the graph with a Cook’s distance exceeding the threshold of 4(n-k-1): point 16122, point 22795, and point 23512 (labelled points, Figure 36). These were investigated further:
#row names refer back to the original data frame 'sum' so use this data frame to investigate points
sum[16122, c(2,3,8,18:20,21,23,25,28)] #used point
## UseAv AID Calf Slope Asp Elev Age LinDist WatDist VegReclass
## 16122 Used 140125 Yes 0 239.04 345 40 49429.27 968.42 MixCD.WB
sum[22795, c(2,3,8,18:20,21,23,25,28)] #available point
## UseAv AID Calf Slope Asp Elev Age LinDist WatDist
## 22795 Available 140125 Yes 1 21.04 559 6 19581.79 71.1
## VegReclass
## 22795 JPES
sum[23512, c(2,3,8,18:20,21,23,25,28)] #used point
## UseAv AID Calf Slope Asp Elev Age LinDist WatDist VegReclass
## 23512 Used 140125 Yes 2 53.13 345 40 51160.58 0 MixCD.WB
Figure 37: Summary of the three data points from data set for CWC with Cook’s Distance exceeding the predefined threshold of 4/(n-k-1).
The two used points (point 16122 and 23512, Figure 37) are located almost or over 50km from the nearest linear feature and were some of the points previously identified during the exploration of outliers for the variable LinDist. They were retained because they are measured points that were checked for GPS error and found to be realistic records of each animal’s location. The available point (point 22795, Figure 37) has one of the extreme values for elevation identified during the exploration for outliers in the variable Elev. However, because it falls within the possible range of elevation values for the study area, it was not excluded from the data set.
Removing these points may improve the model fit. Therefore, I reran the top model dropCalf1 but excluding these three influential points and their associated used or available counter-point.
sum2<-sum[-c(16121,16122,22795,22796,23511,23512),] #remove observations from original data frame
Calf2<-sum2[sum2$Calf=="Yes",] #resubset
Calf2<-within(Calf2, VegReclass<-relevel(VegReclass, ref="Omusk")) #relevel the variable VegReclass so Omusk is ref.
Calfnew<-glm(UseAv~Elev+LinDist+VegReclass, data=Calf2, family=binomial(link="logit"))
summary(Calfnew)
##
## Call:
## glm(formula = UseAv ~ Elev + LinDist + VegReclass, family = binomial(link = "logit"),
## data = Calf2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.73067 -1.07306 0.09125 1.01259 1.74790
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.593e+00 2.394e-01 10.828 < 2e-16 ***
## Elev -4.248e-03 4.794e-04 -8.860 < 2e-16 ***
## LinDist -1.386e-05 2.376e-06 -5.835 5.38e-09 ***
## VegReclassBSDom -5.257e-01 9.695e-02 -5.422 5.90e-08 ***
## VegReclassJPES -1.010e+00 1.005e-01 -10.052 < 2e-16 ***
## VegReclassJPMat 7.494e-02 8.995e-02 0.833 0.405
## VegReclassJPMid -1.110e+00 9.901e-02 -11.208 < 2e-16 ***
## VegReclassMixCD.WB -1.409e+00 1.223e-01 -11.515 < 2e-16 ***
## VegReclassTmusk -1.565e-01 1.075e-01 -1.457 0.145
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 8442.5 on 6089 degrees of freedom
## Residual deviance: 7896.2 on 6081 degrees of freedom
## AIC: 7914.2
##
## Number of Fisher Scoring iterations: 4
Figure 38: Model summary output for the model calfnew. This model was generated from the set of CWC points, excluding the three influential points identified in the Cook’s D Plot (Figure 36).
Though the z-values have changed slightly, the significance for each predictor variable hasn’t changed between this model and the model dropCalf1 (Table 3).
Because the models were not fitted on the same set of observations (i.e. the new model excludes six points from the original data set, Calf), I cannot use the function anova function in R(c) to compare the two models. However, I reran model diagnostics on this new model to look for improvements in model fit.
plot(fitted(Calfnew), resid(Calfnew), xlab="fitted values", ylab = "model residuals", main="Residuals vs. Fitted Values [ Model: Calfnew ]", col=c("red", "blue")[Calf2$UseAv], pch=16)
abline(h=0, lty=2, col="black")
Figure 39: Plot of model residuals vs. fitted values for the model Calfnew. The upper blue string represents the “used” response and the lower red string the “available”" response.
library(arm) #binned residuals
binnedplot(fitted(Calfnew), resid(Calfnew), nclass=78, main="Binned Residual Plot [Model Calfnew]", col.int="red")
Figure 40: Binned Residual Plot for the model Calfnew. Each point represents the average residual vs. the average fitted value for each bin of data points. Bin sizes were computed as the square root of the sample size, n (i.e., the sq. rt. of 6096). The upper and lower red lines represent the theoretical error bounds within which 95% of the estimated residuals should fall IF the model is true.
There are still a considerable number of points outside the theoretical 95% error bounds of the binned residual plot (Figure 40), indicating that removing the influential points did not improve model fit. At this point, it would be wise to begin exploring fitting other models that can account for some of the issues not addressed by the glm (mainly lack of independence in Y). This is further discussed at the end of this analysis. I now move on to the model selection, interpretation and validation for caribou without calves (CNC).
(ii) CNC
My research question is best answered by creating seperate models for caribou with calves (CWC) and caribou without calves (CNC). Below, I have repeated the above model selection, interpretation and validation for CNC, albeit with more abbreviated reporting where possible.
[Global Model]
NoCalf<-within(NoCalf, VegReclass<-relevel(VegReclass, ref="Omusk")) # to keep things consistent and make model comparisoneasier, relevel data so that "Omusk" is the reference category for VegReclass
globNoCalf<-glm(UseAv~Elev+Slope+Asp+Slope:Asp+LinDist+WatDist+VegReclass, data=NoCalf, family=binomial(link="logit"))
summary(globNoCalf)
##
## Call:
## glm(formula = UseAv ~ Elev + Slope + Asp + Slope:Asp + LinDist +
## WatDist + VegReclass, family = binomial(link = "logit"),
## data = NoCalf)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9205 -1.0681 0.1611 1.0359 2.1531
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.172e+00 1.473e-01 28.321 < 2e-16 ***
## Elev -6.923e-03 2.963e-04 -23.366 < 2e-16 ***
## Slope -1.596e-01 1.750e-02 -9.117 < 2e-16 ***
## Asp 1.266e-04 2.060e-04 0.615 0.53880
## LinDist -4.047e-05 1.685e-06 -24.017 < 2e-16 ***
## WatDist -1.999e-05 1.275e-05 -1.568 0.11695
## VegReclassBSDom -1.170e-01 6.013e-02 -1.945 0.05172 .
## VegReclassJPES -6.635e-01 5.877e-02 -11.289 < 2e-16 ***
## VegReclassJPMat -1.941e-01 5.948e-02 -3.263 0.00110 **
## VegReclassJPMid -7.546e-01 6.060e-02 -12.453 < 2e-16 ***
## VegReclassMixCD.WB -1.357e+00 7.691e-02 -17.641 < 2e-16 ***
## VegReclassTmusk -2.756e-01 6.529e-02 -4.221 2.43e-05 ***
## Slope:Asp 2.066e-04 7.452e-05 2.772 0.00557 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 24723 on 17833 degrees of freedom
## Residual deviance: 22496 on 17821 degrees of freedom
## AIC: 22522
##
## Number of Fisher Scoring iterations: 4
Figure 41: Model summary output for the global model globNoCalf. This model was generated from the set of CNC points (n=17834).
A one unit increase in elevation, slope or the distance to the nearest linear feature all significantly decrease the log odds of a point being used (Elev:z=-23.366, p<2e-16; Slope: z=-9.117, p<2e-16; LinDist: z=-24.017, p<2e-16, Figure 41). As was the case with the global model for CWC, aspect and the distance to the nearest water source do not significantly change the the log odds of a point being used (Asp: z=0.615, p=0.53880; WatDist: z=-1.568, p= 0.11695); however, unlike for the global CWC model, the interaction between slope and aspect does significantly increase the log odds of a point being used (Slope:Asp: z=2.772, p=0.00557).
At a significance level of p<0.05, all of the habitat classes except for black spruce dominated forest significantly decrease the log odds of a point being used compared to if it were in open muskeg (JPES: z=-11.289, p<2e-16; JPMat: z=-3.263; p=0.00110; JPMid: z=-12.453, p<2e-16; MixCD.WB: z=-17.641, p<2e-16; Tmusk: z=-4.221, p=2.43e-05). At a significance level of p<0.1, black spruce dominated forest significantly decreases the log odds of a point being used relative to if that point were in open muskeg (BSDom: z=-1.945, p=0.05172).
The model is overdispersed by 1.26 (Residual Deviance/ df = 22496/17821 = 1.26, Figure 41). I ran a quasibinomial model to explore whether the addition of a dispersion parameter would reduce the disperion.
[Quasi Global Model]
QglobNoCalf<-glm(UseAv~Elev+Slope+Asp+Slope:Asp+LinDist+WatDist+VegReclass, data=NoCalf, family=quasibinomial(link="logit"))
summary(QglobNoCalf)
##
## Call:
## glm(formula = UseAv ~ Elev + Slope + Asp + Slope:Asp + LinDist +
## WatDist + VegReclass, family = quasibinomial(link = "logit"),
## data = NoCalf)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9205 -1.0681 0.1611 1.0359 2.1531
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.172e+00 1.470e-01 28.382 < 2e-16 ***
## Elev -6.923e-03 2.957e-04 -23.416 < 2e-16 ***
## Slope -1.596e-01 1.747e-02 -9.137 < 2e-16 ***
## Asp 1.266e-04 2.056e-04 0.616 0.53794
## LinDist -4.047e-05 1.682e-06 -24.068 < 2e-16 ***
## WatDist -1.999e-05 1.272e-05 -1.571 0.11619
## VegReclassBSDom -1.170e-01 6.000e-02 -1.950 0.05124 .
## VegReclassJPES -6.635e-01 5.865e-02 -11.313 < 2e-16 ***
## VegReclassJPMat -1.941e-01 5.936e-02 -3.270 0.00108 **
## VegReclassJPMid -7.546e-01 6.047e-02 -12.480 < 2e-16 ***
## VegReclassMixCD.WB -1.357e+00 7.675e-02 -17.678 < 2e-16 ***
## VegReclassTmusk -2.756e-01 6.516e-02 -4.230 2.35e-05 ***
## Slope:Asp 2.066e-04 7.436e-05 2.778 0.00548 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasibinomial family taken to be 0.9957621)
##
## Null deviance: 24723 on 17833 degrees of freedom
## Residual deviance: 22496 on 17821 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 4
Figure 42: Model summary output for the quaisbinomial global model QglobNoCalf.
The model dispersion for the quasibinomial model is taken to be 0.9957621 (see Figure 42), which is slightly less than the assumed binomial model dispersion of 1 for the previous model (see Figure 41). As adding the dispersion parameter does not change the significance of any of the model parameters and actually seems to worsen the overdispersion, I continued with model selection using the original global model, globNoCalf.
[Model Selection and Interpretation]
drop1(globNoCalf, test="Chisq")
## Single term deletions
##
## Model:
## UseAv ~ Elev + Slope + Asp + Slope:Asp + LinDist + WatDist +
## VegReclass
## Df Deviance AIC LRT Pr(>Chi)
## <none> 22496 22522
## Elev 1 23063 23087 566.91 < 2.2e-16 ***
## LinDist 1 23123 23147 626.94 < 2.2e-16 ***
## WatDist 1 22498 22522 2.46 0.116780
## VegReclass 6 23050 23064 553.52 < 2.2e-16 ***
## Slope:Asp 1 22504 22528 7.70 0.005536 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Table 6: Output from the drop1 function (test = Chi-squared), as performed on the global model, globNoCalf.
According to the Analysis of Deviance table (Table 6), dropping any of the variables save for the distance to the nearest water source(WatDist: X2(1,N=17834) = 2.46, p=0.116780) significantly increases the model deviance (Elev: X2(1,N=17834) = 566.91, p<-2.2e-16; LinDist: X2(1,N=17834) = 626.94, p<-2.2e-16; VegReclass: X2(6,N=17834) = 553.52, p<-2.2e-16; Slope:Asp:X2(1,N=17834) = 7.70, p=0.005536). I therefore constructed a new model that excluded WatDist.
dropNoCalf1<-glm(UseAv~Elev+Slope+Asp+Slope:Asp+LinDist+VegReclass, data=NoCalf, family=binomial(link="logit"))
summary(dropNoCalf1)
##
## Call:
## glm(formula = UseAv ~ Elev + Slope + Asp + Slope:Asp + LinDist +
## VegReclass, family = binomial(link = "logit"), data = NoCalf)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9185 -1.0685 0.1634 1.0366 2.1636
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.183e+00 1.472e-01 28.423 < 2e-16 ***
## Elev -7.020e-03 2.899e-04 -24.216 < 2e-16 ***
## Slope -1.596e-01 1.750e-02 -9.116 < 2e-16 ***
## Asp 1.258e-04 2.060e-04 0.611 0.54134
## LinDist -4.042e-05 1.686e-06 -23.980 < 2e-16 ***
## VegReclassBSDom -1.149e-01 6.012e-02 -1.912 0.05589 .
## VegReclassJPES -6.626e-01 5.877e-02 -11.275 < 2e-16 ***
## VegReclassJPMat -1.893e-01 5.941e-02 -3.187 0.00144 **
## VegReclassJPMid -7.538e-01 6.059e-02 -12.442 < 2e-16 ***
## VegReclassMixCD.WB -1.355e+00 7.691e-02 -17.618 < 2e-16 ***
## VegReclassTmusk -2.734e-01 6.528e-02 -4.188 2.81e-05 ***
## Slope:Asp 2.095e-04 7.449e-05 2.813 0.00491 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 24723 on 17833 degrees of freedom
## Residual deviance: 22498 on 17822 degrees of freedom
## AIC: 22522
##
## Number of Fisher Scoring iterations: 4
Figure 43: Model summary output for the reduced model dropNoCalf1.
At a significance value of p<0.001, a one unit increase in the variables elevation, slope or the distance to the nearest linear feature significantly decreases the log odds of a point being used (Elev:z=-24.216, p<2e-16; Slope: z=-9.116, p<2e-16; LinDist: z=-23.980, p<2e-16, see Figure 43). Compared to the coefficients estimated for the global model, globNoCalf, the z-values reported here have changed little. As expected, the variable Aspect does not cause a significant change in the log odds of a point being used. Here, I ignore it as I am only interested in the interaction between Slope and Aspect.
With respect to if a point was located in an open muskeg habitat, a point located in all of the habitat classes, except for black spruce dominated forest, significantly decrease the log odds of a point being used at a significance level of p<0.05 (JPES: z=-11.275, p<2e-16; JPMat: z=-3.187; p=0.00144; JPMid: z=-12.442, p<2e-16; MixCD.WB: z=-17.618, p<2e-16, Tmusk: z=-4.188, p=2.81e-5). Black spruce dominated forest significantly decreases the log odds of a point being used relative to if that point were in open muskeg at p<0.1 (BSDom:z=-1.92, p=0.05589).
The model is overdispersed by 1.26 ((resid. dev.=22493/ df=17820) =1.26, see model output Figure 43). I ran a quasibinomial model to explore whether the addition of a dispersion parameter would reduce the overdispersion.
QdropNoCalf1<-glm(UseAv~Elev+Slope+Asp+Slope:Asp+LinDist+VegReclass, data=NoCalf, family=quasibinomial(link="logit"))
summary(QdropNoCalf1)
##
## Call:
## glm(formula = UseAv ~ Elev + Slope + Asp + Slope:Asp + LinDist +
## VegReclass, family = quasibinomial(link = "logit"), data = NoCalf)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9185 -1.0685 0.1634 1.0366 2.1636
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.183e+00 1.469e-01 28.481 < 2e-16 ***
## Elev -7.020e-03 2.893e-04 -24.266 < 2e-16 ***
## Slope -1.596e-01 1.747e-02 -9.135 < 2e-16 ***
## Asp 1.258e-04 2.056e-04 0.612 0.54051
## LinDist -4.042e-05 1.682e-06 -24.030 < 2e-16 ***
## VegReclassBSDom -1.149e-01 5.999e-02 -1.916 0.05541 .
## VegReclassJPES -6.626e-01 5.865e-02 -11.298 < 2e-16 ***
## VegReclassJPMat -1.893e-01 5.928e-02 -3.194 0.00141 **
## VegReclassJPMid -7.538e-01 6.046e-02 -12.467 < 2e-16 ***
## VegReclassMixCD.WB -1.355e+00 7.675e-02 -17.655 < 2e-16 ***
## VegReclassTmusk -2.734e-01 6.515e-02 -4.197 2.72e-05 ***
## Slope:Asp 2.095e-04 7.434e-05 2.819 0.00483 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasibinomial family taken to be 0.9958857)
##
## Null deviance: 24723 on 17833 degrees of freedom
## Residual deviance: 22498 on 17822 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 4
Figure 44: Model summary output for the quasibinomial model QdropNoCalf1.
The dispersion for the quasibinomial model is now taken to be 0.9958857 (see output Figure 44), which is slightly less than the assumed binomial model dispersion of 1. As adding the dispersion parameter does not change the significance of any of the model parameters and does not seem to improve the overdispersion, I continued with model selection using the model dropNoCalf1.
drop1(dropNoCalf1, test="Chisq")
## Single term deletions
##
## Model:
## UseAv ~ Elev + Slope + Asp + Slope:Asp + LinDist + VegReclass
## Df Deviance AIC LRT Pr(>Chi)
## <none> 22498 22522
## Elev 1 23109 23131 610.87 < 2.2e-16 ***
## LinDist 1 23124 23146 625.51 < 2.2e-16 ***
## VegReclass 6 23053 23065 554.59 < 2.2e-16 ***
## Slope:Asp 1 22506 22528 7.92 0.004878 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Table 7: Output from the drop1 function (test = Chi-squared), as performed on the reduced model, dropNoCalf1.
Dropping any of the remaining variables causes a significant increase in model deviance (Elev:X2(1,N=17834) = 610.87, p<-2.2e-16; LinDist: X2(1,N=17834) = 625.51, p<-2.2e-16; VegReclass: X2(6,N=17834) = 554.59, p<-2.2e-16; Slope:Asp: X2(1,N=17834) = 7.92, p=0.004878, see Table 7).This suggests that elevation, the distance to linear features, habitat class and the relationship between slope and aspect are all important predictors of habitat selection by CNC.
However, an anova comparison between the CNC global model (globNoCalf) and the model dropNoCalf1 indicates that the two models are NOT significantly different than one another (Table 8). Comparing AIC values for the two models, the delta AIC is less than 2 (Table 9).
anova(globNoCalf, dropNoCalf1, test="Chisq")
## Analysis of Deviance Table
##
## Model 1: UseAv ~ Elev + Slope + Asp + Slope:Asp + LinDist + WatDist +
## VegReclass
## Model 2: UseAv ~ Elev + Slope + Asp + Slope:Asp + LinDist + VegReclass
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 17821 22496
## 2 17822 22498 -1 -2.46 0.1168
Table 8: Comparing the reduced model dropNoCalf1 to the global model globNoCalf using the function anova (test= Chi-squared). There is no significant difference between the two models.
AIC(globNoCalf, dropNoCalf1)
## df AIC
## globNoCalf 13 22521.95
## dropNoCalf1 12 22522.41
Table 9: AIC values for the global model globNoCalf and the reduced model dropNoCalf1.
So which model should be selected as the ‘top’ model? One option is to average the models, which is common practice for models with a delta AIC <2. Another option is to choose the most parsimonious model (i.e the model with fewer terms). Because I am currently naive as to how to navigate the seemly numerous pitfalls associated with correctly applying and interpreting the results of model averaging, I subscribed to the latter option and selected the simpler model, dropNoCalf1 as the top model for CNC.
[Model Validation]
plot(fitted(dropNoCalf1), resid(dropNoCalf1), xlab="fitted values", ylab = "model residuals", main="Residuals vs. Fitted Values [ Model: dropNoCalf1 ]", col=c("red", "blue")[NoCalf$UseAv], pch=16)
abline(h=0, lty=2, col="black")
Figure 45: Plot of model residuals vs. fitted values for the model dropNoCalf1. The upper blue string represents the “used” response and the lower red string the “available”" response.
There are no used points in the available string and vice-versa.
library(arm)
binnedplot(fitted(dropNoCalf1), resid(dropNoCalf1), nclass=135, main="Binned Residual Plot [Model dropNoCalf1]", col.int="red")
Figure 46: Binned Residual Plot for the model dropNoCalf1. Each point represents the average residual vs. the average fitted value for each bin of data points. Bin sizes were computed as the square root of the sample size, n (i.e., the sq. rt. of 17834). The upper and lower red lines represent the theoretical error bounds within which 95% of the estimated residuals should fall IF the model is true.
For the model dropNoCalf1, there are a number of residuals outside the theoretical 95% error bounds, which suggests that the model dropNoCalf1 is a poor fit for the data. Interestingly, the residuals seem to cycle above and below the error bounds; this may be indicative that the variance is systematically cycling with one or more predictor variables.
I next calculated the generalised variance inflation factors.
library(car)
vif(dropNoCalf1)
## GVIF Df GVIF^(1/(2*Df))
## Elev 1.250175 1 1.118112
## Slope 4.711805 1 2.170669
## Asp 1.790590 1 1.338129
## LinDist 1.094295 1 1.046086
## VegReclass 1.312640 6 1.022929
## Slope:Asp 5.641445 1 2.375173
Table 10: Generalised variance inflation factors calculated for the explanatory variables of the model dropCalf1. The output shows three columns: the GVIF, the Df, and the GVIF^(1/(2xDf)). The Df is the number of coefficients in the subset of the variable. The GVIF is the generalised variation inflation factor, which is used in place of the standard VIF because the standard VIF cannot be applied to variables with a Df > 1. The GVIF^(1/(2xDf)) reduces the GVIF to a linear measure and makes the GVIFs comparable across dimensions (Monette and Fox 1992).
The GVIF^(1/(2Df)) values for the model terms Slope and Slope:Asp are greater than the threshold value of 2 (see Table 10). This is expected given that an interaction term is the product of its two interacting variables and thus not independent from them. Allison (2012) asserts that these high values are not a cause for concern and can be ignored. I am at a loss to explain, however, why the GVIF^(1/(2Df)) for Aspect is so low relative to Slope. For all of the other predictor variables, none have a **GVIF^(1/(2*Df))** >2 so multicollinearity is likley not the root cause of the poor model fit.
Finally, I tested for influential points.
cutoff <- 4/((nrow(NoCalf)-length(dropNoCalf1$coefficients)-2)) #identify D values > 4/(n-k-1).
plot(dropNoCalf1, which=4, cook.levels=cutoff)
Figure 47: Cook’s Distance for the used and available data points recorded for caribou without calves. There are three data points that have undue influence on the model (labelled points).
In Figure 47 it is difficult to read the two overlapping points between 0 and 5000. I used the interactive package zoom in R(c) to resolve these points into point 2996 and point 3074. An examination of the three points identified as having undue influence on the model (points 2996, 3074 and 11454), revealed that points 2996 and 3074 are actually duplicate points (see Figure 48). I therefore deleted 3074 and its associated available point from the data frame.
Point 2996 and point 11454 are both located on steeper slopes relative to most of the other points in the data set (14 and 18 degrees respectively, see Figure 48). In fact, only 14 records in the data set NoCalf have a Slope value greater than or equal to 14 degrees. These points were identified during the exploration for outliers, but because the maximum slope in the study area is approx. 42.2511 deg., they were not removed. These points are also both located at an extremely high elevation relative to the majority of the points in the data set (527m and 516m respectively); again, because these points are within the plausible range of values for the area, and because they have been double checked for GPS error through visual examination of each animal’s trajectory, they were not removed from the data. Nonetheless, I removed these points (and the associated available points) to check if their exclusion improved model fit.
#row names refer back to the original data frame 'sum' so use this data frame to investigate points
sum[2996, c(2,3,8,18:20,21,23,25,28)] #used point
## UseAv AID Calf Slope Asp Elev Age LinDist WatDist VegReclass
## 2996 Used 140168 No 14 101.46 527 33 69.98 1773.71 JPMid
sum[3074, c(2,3,8,18:20,21,23,25,28)] #used point - duplicate of 2996!
## UseAv AID Calf Slope Asp Elev Age LinDist WatDist VegReclass
## 3074 Used 140168 No 14 101.46 527 33 69.98 1773.71 JPMid
sum[11454, c(2,3,8,18:20,21,23,25,28)] #used point
## UseAv AID Calf Slope Asp Elev Age LinDist WatDist VegReclass
## 11454 Used 140161 No 18 354.32 516 40 1389.39 74.04 JPMat
Figure 48: Summary of the three data points from data set for CNC with Cook’s Distance exceeding the predefined threshold of 4/(n-k-1).
sum2<-sum[-c(3073, 3074),]
NoCalf<-sum2[sum2$Calf=="No",]
sum2<-sum2[-c(2995,2996,11453,11454),]
NoCalf2<-sum2[sum2$Calf=="No",]
NoCalf2<-within(NoCalf2, VegReclass<-relevel(VegReclass, ref="Omusk")) #relevel VegReclass so Omusk is reference
NoCalfnew<-glm(UseAv~Elev+Slope+Asp+Slope:Asp+LinDist+VegReclass, data=NoCalf2, family=binomial(link="logit"))
summary(NoCalfnew)
##
## Call:
## glm(formula = UseAv ~ Elev + Slope + Asp + Slope:Asp + LinDist +
## VegReclass, family = binomial(link = "logit"), data = NoCalf2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9204 -1.0685 0.1653 1.0358 2.1671
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.192e+00 1.472e-01 28.469 < 2e-16 ***
## Elev -7.030e-03 2.900e-04 -24.242 < 2e-16 ***
## Slope -1.636e-01 1.757e-02 -9.311 < 2e-16 ***
## Asp 1.101e-04 2.061e-04 0.534 0.59319
## LinDist -4.044e-05 1.686e-06 -23.984 < 2e-16 ***
## VegReclassBSDom -1.122e-01 6.013e-02 -1.865 0.06218 .
## VegReclassJPES -6.597e-01 5.879e-02 -11.222 < 2e-16 ***
## VegReclassJPMat -1.873e-01 5.942e-02 -3.152 0.00162 **
## VegReclassJPMid -7.546e-01 6.061e-02 -12.452 < 2e-16 ***
## VegReclassMixCD.WB -1.351e+00 7.693e-02 -17.564 < 2e-16 ***
## VegReclassTmusk -2.733e-01 6.529e-02 -4.187 2.83e-05 ***
## Slope:Asp 2.217e-04 7.466e-05 2.969 0.00299 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 24718 on 17829 degrees of freedom
## Residual deviance: 22486 on 17818 degrees of freedom
## AIC: 22510
##
## Number of Fisher Scoring iterations: 4
Figure 49: Model summary output for the the model Nocalfnew. This model was generated from a data set excluding the three influential points and their associated available points from the data set for CNC.
Though the z-values changed slightly for the new model NoCalfnew, the significance of the predictor variables essentially stayed the same (compare Figures 43 and 49). A binned residual plot revealed that there are still an abundance of residuals outside the 95% theoretical error bounds for this new model (Figure 51), which suggests that removing the influential points did not improve model fit. As was the case for the CWC model, it seems prudent to explore alternate modelling techniques.
plot(fitted(NoCalfnew), resid(NoCalfnew), xlab="fitted values", ylab = "model residuals", main="Residuals vs. Fitted Values [ Model: NoCalfnew ]", col=c("red", "blue")[NoCalf2$UseAv], pch=16)
abline(h=0, lty=2, col="black")
Figure 50: Plot of model residuals vs. fitted values for the model NoCalfnew. The upper blue string represents the “used” response and the lower red string the “available”" response.
library(arm) #binned residuals
binnedplot(fitted(NoCalfnew), resid(NoCalfnew), nclass=134, main="Binned Residual Plot [Model NoCalfnew]", col.int="red")
Figure 51: Binned Residual Plot for the model NoCalfnew. Each point represents the average residual vs. the average fitted value for each bin of data points. Bin sizes were computed as the square root of the sample size, n (i.e., the sq. rt. of 17834). The upper and lower red lines represent the theoretical error bounds within which 95% of the estimated residuals should fall IF the model is true.
STEP 6. Discussion
I am interested in evaluating whether habitat selection during the summer season differs between female caribou with calves (CWC) and female caribou without calves (CNC). For each subset group of caribou, I constructed and evaluated a glm (family =binomial, link = logit) to determine which combination of the given predictor variables (elevation, slope, aspect, distance to nearest linear feature, distance to nearest water source, and habitat class) best predicted the occurrence of caribou on the landscape. Comparing the ‘top’ models for CWC vs. CNC, the only difference was that the interaction between slope and aspect seemed to be important to CNC. From the bar charts generated whilst exploring the relationship between the response (“used” or “available” point location) and each habitat class, the heights of the bars indicated that mature jack pine forests may be more important to CWC while black spruce forests may be more important to CNC. Unfortunately, I was unable to tease out the importance of the different habitat classes because the drop1 function could not compute an analysis of deviance for the dummy habitat class variables originally included in the model.
As inidcated by binned residual plots, both top models fit their respective data sets poorly (Figures 36 and 46). Boyce et al. (2002) suggest k-fold cross validation as a more robust test for model goodness of fit in a presence-available study such as this one. However, evidence suggests that model misspecification is responsible for the poor fit for CWC and CNC, in which case running an alternative validation test is superfluous. First, the generalised variance inflation factors were below the threshold of concern for multicollinearity (Tables 5 and 10). Second, removal of influential points did not improve model fit (see binned residual plots for models generated with data that excluded influential data points, Figures 40 and 51). Finally, I have a priori knowledge that the response variable is not independent due to spatial-temporal autocorrelation, which is a violation of the assumptions of a binomial glm.
A more practical model choice for my data analysis would be a generalised linear mixed model (glmm) with random effects because the error structure that can account for my study’s unbalanced sample design and spatial-temporal autocorrelation. Generalised Estimating Equations (GEEs) may also be useful as they offer a framework for fitting models to clustered, correlated data. In summary, a glm model is not appropriate for my analysis.
REFERENCES
1.Allison, P. 2012. ‘When Can You Safely Ignore Multicollinearity?’, Statistical Horizons website: http://statisticalhorizons.com/multicollinearity;
2.Boyce, M.S., Vernier, P.R., Nielson, S.E. & Schmiegelow, F.K.A. 2002. Evaluating resource selection functions. Ecological Modelling, 157: 281-300;
3.Fox, J. and Monetter, G. 1992. Theory and Methods, Generalized Collinearity Diagnostics. Journal of the American Statistical Association, 87: 178-173;
4.Gelman, A., Hill, J. & Michael, A.R.2006. Data Analysis Using Regression and Multilevel/Hierarchical Models, [pg.97-99] Cambridge University Press;
5.Iorio, F.D. & Iannario, M. 2012. Residual Diagnostics for Interpreting CUB Models. STATISTICA annoLXXII n.2;
Zuur, A.F., Ieno, E.N., Smith, G.M. 2007. Analyzing Ecological Data. Springer Science + Business Media LLC, 223 Spring Street, New York, NY, 10013, USA;
Zurr A.F., Ieno, E.N., Elphick, C.S. 2009. A protocol for data exploration to avoid common statistical problems. Methods in Ecology and Evolution, 1: 3-14.