The Northern Great Plains Piping Plovers are a threatened shorbird species that reside in aquatic habitats across North and South Dakota. It is imperative that we figure out the renesting patterns for these birds so that we can provide ideal breeding habitats for the ones who are most likely to renest. Using the renesting propensity data from a USGS study, I conducted a two-sample t-test analyzing the age in which a plover’s nest failed by whether or not it renested. I was able to conclude that there is a significant difference between the ages of plovers who renest and those who do not.
Renesting is a strategy that many birds use in order to increase their chances of reproductive success after an initial nesting failure. Numerous factors can affect when a nest fails and if the bird decides to renest. There could be a limited amount of land for a bird to renest, the time of year may not be optimal to start breeding, or the bird may simply not want to spend the energy looking for a new nesting location. From 2014 to 2016, a USGS study was conducted on the renesting propensities of the Northern Great Plains Piping Plovers (Swift et al. 2020). These birds are found around various freshwater habitats in the North Dakota, South Dakota, and Montana regions. The researchers searched along sandbars and shorelines on lakes and rivers as these areas contained the vegetated sand and gravel that is ideal for nest locations. Once nests were located, they were monitored every two to four days. The dataset I used includes the IDs of each plover, the IDs of the pairs, whether or not they renested, the date when the first nest failed, the plover’s age when its nest failed, the habitat type, the amount of available habitat, the landform type, the amount of previous nesting attempts, the year of failure, and the cause of failure. In this analysis, I wanted to know if younger Northern Great Plains Piping Plovers have a higher rate of renesting than older plovers. I focused on the ages in which a plover’s nest failed and whether or not they renested. Since these birds are a threatened species, we must find a pattern in which groups of birds are more likely to renest so that we can accommodate them with ideal breeding habitats.
Using histograms, Shapiro-Wilk tests, and QQ-Plots, on the numeric variables, I was able to conclude that none of the numeric data meets the assumptions for normality. This led me to believe that the data should be transformed before running the two-sample t-test. If no transformation could allow those assumptions to be met, then a non-parametric test would have to be used. The best non-parametric to replace a two-sample t-test would have to be a Wilcox Rank Sum test.
The scatter plots of matrices and Pearson correlations for the variables also revealed that none of the data is significantly correlated (Fig. 1). This is beneficial to me as it indicates that the “age at failure” variable is a good predictor of whether or not the birds renest.
Boxplots appear to show a distinct difference between the median ages of plovers who renest and those who do not (Fig. 2). They suggest that plovers who renested tended to have had their nests fail earlier in the study and tended to be of younger age.
Histograms of each numeric variable are plotted along the diagonal. A correlation coefficient and scatterplot comparison is provided for each pair of variables.
Visualizes a mean, median, and range for each numeric variable over whether or not the plover renested.
In order to compare the means between the ages of plovers who renested and those who did not, I attempted to conduct a two-sample t-test. This test compares the means of a numerical variable between two populations. After checking the assumptions of normality and equal variances, I realized the data must be transformed. However, I was unable to find a transformation that would allow the data to meet the assumptions and resorted to using a non-parametric test instead. A linear regression model was then conducted as a “predictive” test using all of the numeric variables in order to see which variables are good indicators of failure age.
The null hypothesis for the two-sample t-test is as follows: There is no significant difference in the mean ages of Northern Great Plains Piping Plovers who renest and those who do not. The data for the “age at failure” by “renested” had to meet the assumptions of random samples, normality, and equal variances before the two-sample t-test could be performed. Given the methods of the researchers, I am going to assume that the data collection process was unbiased so that the Random Samples assumption could be met. To test for normality, I extracted the residuals and conducted a Shapiro-Wilk test, a QQ-Plot, and a histogram on this data. None of these tests suggested that the data is normal. To check for homoskedasticity, or equal variance, I looked at a residuals vs. fitted plot of the data and conducted a Levene’s test as well. Neither suggested that the variances are equal.
Since the assumptions for normality and homoskedasticity were not met and a transformation would have been too complex, I used a non-parametric Wilcox Rank Sum test. The null hypothesis for this test is as follows: the difference between the median ages of plovers who renest and plovers who do not is 0. The assumptions of random sampling and a symmetric population around the mean would have to be met in order to perform the test. The assumption for random samples has already been meet so there was no need to retest for it. To determine if the population is symmetric around the mean, I calculated the mean of the data then looked at a histogram to see if the data appeared mirrored around the mean. Based on my judgement, both assumptions were met to allow me to conduct the Wilcox Rank Sum test.
Since the sample size is very large, I tried ingoring the normality assumption and conducting a two-sample t-test anyway. I performed a Welch’s t-test to at least account for the unequal variances and compared the results to the Wilcox Rank Sum test. Since the Welch’s t-test is a variation of the two-sample t-test, the null hypothesis remains the same.The results of both tests led to the same conclusion: there is a significant difference in the mean ages of Northern Great Plains Piping Plovers who renest and those who do not.
To determine which variables are good indicators of failure age, I conducted a linear regression model using all of the numeric variables within the dataset. Before running the model, the assumptions of random sampling, normal residuals, and homoskedasticity must be met. The data was already proven to be randomly sampled, a Shapiro-Wilk test, QQ-Plot, and histogram were used to determine normality in the residuals, and a residuals vs. fitted plot was used to compare variances. These tests suggest that the assumptions for normality and homoskedasticity are not met. However, since the sample size is very large, I tried ignoring the violations and running the model anyway using a \(summary\) command.
To check the assumptions of the two-sample t-test, a Shapiro-Wilk test, QQ-Plot, and histogram on the residuals were used to check normality while a residuals vs. fitted plot and Levene’s test were used to check homoskedasticity. The Shapiro-Wilk test revealed a p-value of less than 0.05, the QQ-Plot displayed many points outside the 95 percent confidence interval, and the histogram showed a non-normal pattern. The consensus is that the normality assumption is not met. The residuals vs. fitted plot displayed an arguably cone-shaped pattern and the Levene’s test revealed a p-value of less than 0.05. Both of these tests suggest that the homoskedasticity assumptions are not met. Using a Wilcox Rank Sum test as a non-parametric alternative to the two-sample t-test, it was revealed that there is a significant difference in the mean ages of Northern Great Plains Piping Plovers who renest and those who do not. A Welch’s t-test that ignored the normality violation revealed the same conclusion.
For the Wilcox Rank Sum test, I calculated a mean of 29.19 then looked at a histogram of the “age at failure” variable to see if the assumption for a symmetric population around the mean was met. It appeared to be so and the Wilcox test was run. With a p-value of less than 0.05, we can reject the null hypothesis that the difference between the median plover ages at failure is 0 (W = 187007, p-value < 2.2e-16). This allows me to conclude that there is a significant difference between the ages of plovers who renest and those who do not. The Welch’s t-test also concluded a significant difference between the mean ages of plovers who renest and those who do not (p-value < 2.2e-16, 95% Confidence Interval: 10.79 < x < 13.67).
A boxplot of ages of plovers who renested is compared to a boxplot of those who did not renest.
When testing the assumptions for the linear regression model, the QQ-Plot only revealed a few confidence interval outliers at the top and the histogram showed a potentially normal graph of the residuals, possibly indicating a normal distribution. However, the Shapiro-Wilk test revealed a p-value of less than 0.05, clearly suggesting that the data is not normal. The residuals vs. fitted plot revealed no distinct pattern and thus suggests that the variances are equal.
Even though the assumptions for normality were not met, I still ran the regression model given that the sample size is exceptionally large. The summary revealed that each numerical variable output a p-value of less than 0.05. This means that each variable is a significant predictor of the failure age.
Age at failure is plotted against the failure date and the plot of plovers who renested overlaps the plot of those who did not.
Knowing that there is a significant difference between the mean ages of plovers who renest and those who do not, I can look at a boxplot to assume that plovers who renest tend to be younger in age (Fig. 3). Logically, this would make sense as older plovers do not have much of a lifespan left and therefore would find it a waste of energy to set up a new nest. With this information, we now know to prioritize accomodating the younger plovers if the necessary resources for their reproduction are limited. Since I concluded that other numeric variables (failure date, available habitat, previous nesting attempts) are good predictors of failure age, we can monitor these variables to study how exactly they affect the failure rates of plover nesting.
Though both the Wilcox Rank Sum test and Welch’s t-test led to the same conclusion, the power of this analysis is not the strongest. The non-parametric Wilcox test is almost guaranteed to be less powerful than a parametric test. This is why the Welch’s t-test was also conducted. However, the violation of the normality assumption was ignored since the sample size was very large. If a transformation could allow the data to be normalized, this analysis could have had more power.
H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016.
John Fox and Sanford Weisberg (2019). An {R} Companion to Applied Regression, Third Edition. Thousand Oaks CA: Sage. URL: https://socialsciences.mcmaster.ca/jfox/Books/Companion/
Revelle, W. (2019) psych: Procedures for Personality and Psychological Research, Northwestern University, Evanston, Illinois, USA, https://CRAN.R-project.org/package=psych Version = 1.9.12.
Swift, R.J., Anteau, M.J., Ring, M.M., Toy, D.L., and Sherfy, M.H., Low renesting propensity and reproductive success make renesting unproductive for the threatened Piping Plover (Charadrius melodus), The Condor, Volume 122, Issue 2, 5 May 2020, duz066, https://doi.org/10.1093/condor/duz066
Swift, R.J., Anteau, M.J., Ring, M.M., Toy, D.L., and Sherfy, M.H., 2019, Renesting propensity, intervals, and reproductive success data for the Northern Great Plains Piping Plover, a threatened shorebird species 2014-2016: U.S. Geological Survey data release, https://doi.org/10.5066/P9VAS8P7.
#Check data
head(prop)
levels(prop$renested)
## [1] "NO" "YES"
summary(prop)
## ID PairID renested failuredate
## Min. : 1.0 Min. : 1.0 NO :1126 Min. :132.0
## 1st Qu.: 244.0 1st Qu.: 222.0 YES: 229 1st Qu.:161.5
## Median : 504.0 Median : 493.0 Median :172.5
## Mean : 536.7 Mean : 489.1 Mean :174.0
## 3rd Qu.: 834.5 3rd Qu.: 741.0 3rd Qu.:186.0
## Max. :1120.0 Max. :1010.0 Max. :226.5
## ageatfailure habitattype availhabitat landform
## Min. : 1.00 ALKALI :286 Min. :-1.7790000 ISLAND :545
## 1st Qu.:17.00 RESERVOIR:582 1st Qu.:-1.0380000 SHORELINE:810
## Median :25.50 RIVER :487 Median : 0.0800000
## Mean :29.19 Mean :-0.0006472
## 3rd Qu.:43.90 3rd Qu.: 0.6140000
## Max. :58.00 Max. : 2.6570000
## prevnestingattempts mate1ageexp mate2ageexp year
## Min. :0.0000 ASY_no : 75 ASY_no : 60 Min. :2014
## 1st Qu.:0.0000 ASY_yes:722 ASY_yes: 213 1st Qu.:2014
## Median :0.0000 SY_no : 80 SY_no : 57 Median :2015
## Mean :0.1151 UNK :478 UNK :1025 Mean :2015
## 3rd Qu.:0.0000 3rd Qu.:2016
## Max. :2.0000 Max. :2016
## failurecause
## FLOODING:255
## HATCH :498
## PREDATOR:168
## UNK :392
## WEATHER : 42
##
#Check numeric variables for normality assumptions
par(mfrow=c(2,2))
hist(prop$failuredate, main = "Histogram of Failure Date", xlab = "Failure Date") #appears normal
hist(prop$ageatfailure, main = "Histogram of Age at Failure", xlab = "Age at Failure") #does not appear normal
hist(prop$availhabitat, main = "Histogram of Available Habitat", xlab = "Available Habitat") #likely not normal
hist(prop$prevnestingattempts, main = "Histogram of Previous Nesting Attempts", xlab = "Previous Nesting Attempts") #not normal, right-skewed
par(mfrow=c(1,1))
shapiro.test(prop$failuredate) #p-value < 0.05, not normal
##
## Shapiro-Wilk normality test
##
## data: prop$failuredate
## W = 0.98901, p-value = 1.449e-08
shapiro.test(prop$ageatfailure) #p-value < 0.05, not normal
##
## Shapiro-Wilk normality test
##
## data: prop$ageatfailure
## W = 0.93695, p-value < 2.2e-16
shapiro.test(prop$availhabitat) #p-value < 0.05, not normal
##
## Shapiro-Wilk normality test
##
## data: prop$availhabitat
## W = 0.88649, p-value < 2.2e-16
shapiro.test(prop$prevnestingattempts) #p-value < 0.05, not normal
##
## Shapiro-Wilk normality test
##
## data: prop$prevnestingattempts
## W = 0.36684, p-value < 2.2e-16
par(mfrow=c(2,2))
qqPlot(prop$failuredate) #some outliers at the bottom, not normal
## [1] 67 631
qqPlot(prop$ageatfailure) #LOTS of outliers, not normal
## [1] 559 1119
qqPlot(prop$availhabitat) #LOTS of outliers, not normal
## [1] 598 600
qqPlot(prop$prevnestingattempts) #LOTS of outliers, not normal
## [1] 28 143
#Numeric variable analysis
par(mfrow=c(1,1))
pairs.panels(prop[-1][-1][-1][-3][-4][-5][-5][-5][-5]) #no data is highly correlated; include as fig
par(mfrow=c(2,2)) #include as fig
boxplot(prop$failuredate~prop$renested, ylab = "Failure Date", xlab = "Renested")
boxplot(prop$ageatfailure~prop$renested, ylab = "Age at Failure", xlab = "Renested")
boxplot(prop$availhabitat~prop$renested, ylab = "Available Habitat", xlab = "Renested")
boxplot(prop$prevnestingattempts~prop$renested, ylab = "Previous Nesting Attempts", xlab = "Renested")
#I can conclude that...
#the assumptions of normality are not met for the "age at failure" variable, therefore the data should be transformed
#"renesting" appears to be a good predictor of each numeric variable
#the failure date is slightly correlated with the age at failure
#With this information, I will attempt a two-sample t-test on the age at failure by renested to compare means
#I will also conduct a linear regression model to see what variables are most predictive of the failure age
#Try to run a two-sample t-test on the "age at failure" by "renested"
#Check assumptions:
#Random samples? Yes
#Normality? ummm...
#Equal variances? ...
model <- lm(prop$ageatfailure~prop$renested)
par(mfrow=c(2,2))
plot(model) #only values in 2 locations, arguably cone-shaped; unequal variances
par(mfrow=c(1,1))
res <- model$residuals
shapiro.test(res) #p-value less than 0.05, not normal
##
## Shapiro-Wilk normality test
##
## data: res
## W = 0.95587, p-value < 2.2e-16
qqPlot(res) #many outliers, not normal
## [1] 417 1220
hist(res, main = "Histogram of Model Residuals", xlab = "Residuals") #no skew but not normal
#H_0 for Levene's Test: variances are equal
leveneTest(prop$ageatfailure~prop$renested) #p-value less than 0.05, reject null hypothesis; variances are not equal
#Assumptions of normality and equal variance are not met
#No easy transformation to meet assumptions, must use non-parametric test (Wilcox Rank Sum Test)
#Assumptions for Wilcox:
#Samples are random? Yes
#Population is symmetric around mean? Looks to be so
mean(prop$ageatfailure)
## [1] 29.18753
hist(prop$ageatfailure, main = "Histogram of Age at Failure", xlab = "Age at Failure")
#H_0 for wilcox test: the difference between the medians is 0
wilcox.test(prop$ageatfailure~prop$renested) #p-value less than 0.05, reject null hypothesis
##
## Wilcoxon rank sum test with continuity correction
##
## data: prop$ageatfailure by prop$renested
## W = 187007, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
#Conclusion: There is a difference between the ages of plovers who renest and those who do not.
#Sample size is very large so maybe I can ignore the normality violation and try a two-sample t-test anyway
#I will try a Welch's t-test to account for the unequal variances
t.test(prop$ageatfailure~prop$renested) #p-value less than 0.05, reject null hypothesis
##
## Welch Two Sample t-test
##
## data: prop$ageatfailure by prop$renested
## t = 16.685, df = 541.7, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 10.78966 13.66926
## sample estimates:
## mean in group NO mean in group YES
## 31.25435 19.02489
#Conclusion is the same: There is a difference between the ages plovers who renest and those who do not
#Create a ggplot to summarize the results, include as a fig
ggplot(prop, aes(x=renested, y=ageatfailure, fill=renested, color=renested))+ #base plot
geom_boxplot(show.legend=F)+ #create boxplot, hide legend cause its just colors
scale_fill_manual(values = c("red", "green"))+
scale_color_manual(values=c("black", "black"))+
ylab("Age at Failure")+
xlab("Renested")+
theme_bw()+
theme(text = element_text(size=15))
####Linear Regression
#Check assumptions of linear regression
#Random sample? Yes
#Normal residuals? oop...
#Equal variances? Yes
model2 <- lm(ageatfailure~renested + failuredate + availhabitat + prevnestingattempts, data = prop)
par(mfrow=c(2,2))
plot(model2) #res vs. fitted has no distinct pattern, variances equal
par(mfrow=c(1,1))
res2 <- model2$residuals
qqPlot(res2) #a few outliers at the top, possibly normal
## 545 1011
## 495 928
shapiro.test(res2) #p-value less than 0.05, not normal
##
## Shapiro-Wilk normality test
##
## data: res2
## W = 0.99053, p-value = 1.146e-07
hist(res2, main = "Histogram of Model 2 Residuals", xlab = "Residuals") #possibly normal
#Normality assumptions are not definitively met but since the sample size is so large, let's see the comparison anyway
summary(model2) #Everything is a significant predictor
##
## Call:
## lm(formula = ageatfailure ~ renested + failuredate + availhabitat +
## prevnestingattempts, data = prop)
##
## Residuals:
## Min 1Q Median 3Q Max
## -34.152 -7.387 0.016 8.776 28.595
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -58.77867 3.43903 -17.092 < 2e-16 ***
## renestedYES -3.19285 0.89394 -3.572 0.000367 ***
## failuredate 0.51857 0.01942 26.698 < 2e-16 ***
## availhabitat 1.05316 0.36693 2.870 0.004166 **
## prevnestingattempts -14.87630 0.92810 -16.029 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.14 on 1350 degrees of freedom
## Multiple R-squared: 0.4419, Adjusted R-squared: 0.4402
## F-statistic: 267.2 on 4 and 1350 DF, p-value: < 2.2e-16
#Compared "failure date" with "age at failure" since they are the most correlated
ggplot(prop, aes(x=failuredate, y=ageatfailure, color=renested, fill=renested))+
geom_jitter(size=3, alpha=0.6, shape=21)+
geom_smooth(method="lm", alpha=0.6, size=0.5)+
scale_color_manual(values=c("black", "black"))+
scale_fill_manual(values=c("red", "green"))+
ylab("Age at Failure")+
xlab("Failure Date")+
theme_bw()+ #cute theme
theme(text = element_text(size=15))