R script (3 Points) - You will receive 3 points for having a functioning .rmd file. This script must be used to answer the assigned questions. Include comments in the script as well.
Load the “HMF Complete Uncleaned.csv” file. This is a more complete version of the data you have been working with in this and previous sections. Note that there is missing data and you may have to adjust for this for some of your calculations.
# Load the dataset
HMF_data <- read_csv("HMF Complete Uncleaned.csv")
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
## dat <- vroom(...)
## problems(dat)
## Rows: 4332 Columns: 24
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (17): band, initialcapture, bandsize, date, net, sex, age, species, spe...
## dbl (6): uniqueid, year, weight, wc, wcr, wcf
## time (1): time
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Display the first few rows to check the data
head(HMF_data)
## # A tibble: 6 × 24
## uniqueid band initialcapture bandsize date year time net sex age
## <dbl> <chr> <chr> <chr> <chr> <dbl> <tim> <chr> <chr> <chr>
## 1 5 58-159034 n <NA> 6/26… 1961 18:40 61B M AHY
## 2 5 58-159034 n <NA> 6/12… 1961 13:40 62B M AHY
## 3 1448 101-1105… y <NA> 5/27… 1964 17:50 Nest <NA> N
## 4 1449 101-1105… y <NA> 5/27… 1964 17:50 Nest <NA> N
## 5 1454 101-1105… y <NA> 5/27… 1964 17:50 Nest <NA> N
## 6 1455 101-1105… y <NA> 5/27… 1964 17:50 Nest <NA> N
## # ℹ 14 more variables: weight <dbl>, species <chr>, speciesc <chr>,
## # habitat <chr>, food <chr>, nesting <chr>, behav <chr>, wc <dbl>, wcr <dbl>,
## # wcf <dbl>, broodp <chr>, fatc <chr>, cp <chr>, notes <chr>
Calculate the mean bird weight for the data set. In this data, the weight roughly follows a Poisson distribution (note this is because we are looking at multiple species with different average weights). Using your mean weight,plot the Poisson density function for weight.
# Calculate mean bird weight, removing missing values
mean_weight <- mean(HMF_data$weight, na.rm = TRUE)
# Create a sequence of weights for plotting
weight_range <- seq(0, max(HMF_data$weight, na.rm = TRUE), by = 1)
# Compute Poisson probabilities using the mean weight as lambda
poisson_probs <- dpois(weight_range, lambda = mean_weight)
# Plot the Poisson density function
ggplot(data = data.frame(weight_range, poisson_probs), aes(x = weight_range, y = poisson_probs)) +
geom_line(color = "blue", size = 1) +
labs(title = "Poisson Density Function for Bird Weight",
x = "Bird Weight (grams)",
y = "Probability Density") +
theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
In this data, bird weight is measured in grams. What is the probability that a randomly selected bird weighs less than 44 grams? What is the probability that a randomly selected bird weighs more than 35 grams?
# Probability of a bird weighing less than 44 grams
prob_less_44 <- ppois(43, lambda = mean_weight)
# Probability of a bird weighing more than 35 grams
prob_more_35 <- 1 - ppois(35, lambda = mean_weight)
# Print the results
prob_less_44
## [1] 0.7863364
prob_more_35
## [1] 0.6857404
Using R, subset the data into two time periods. One time period will be from 1960-1968. The other time period will be from 2011-2014. Name the the new dataframes “HMF_Historic_Data” and “HMF_Recent_Data” respectively. You can use either tidyverse or the subsetting function here
# Subset data for historic period (1960-1968)
HMF_Historic_Data <- HMF_data %>%
filter(year >= 1960 & year <= 1968)
## filter: removed 631 rows (15%), 3,701 rows remaining
# Subset data for recent period (2011-2014)
HMF_Recent_Data <- HMF_data %>%
filter(year >= 2011 & year <= 2014)
## filter: removed 3,978 rows (92%), 354 rows remaining
# Display the first few rows to verify
head(HMF_Historic_Data)
## # A tibble: 6 × 24
## uniqueid band initialcapture bandsize date year time net sex age
## <dbl> <chr> <chr> <chr> <chr> <dbl> <tim> <chr> <chr> <chr>
## 1 5 58-159034 n <NA> 6/26… 1961 18:40 61B M AHY
## 2 5 58-159034 n <NA> 6/12… 1961 13:40 62B M AHY
## 3 1448 101-1105… y <NA> 5/27… 1964 17:50 Nest <NA> N
## 4 1449 101-1105… y <NA> 5/27… 1964 17:50 Nest <NA> N
## 5 1454 101-1105… y <NA> 5/27… 1964 17:50 Nest <NA> N
## 6 1455 101-1105… y <NA> 5/27… 1964 17:50 Nest <NA> N
## # ℹ 14 more variables: weight <dbl>, species <chr>, speciesc <chr>,
## # habitat <chr>, food <chr>, nesting <chr>, behav <chr>, wc <dbl>, wcr <dbl>,
## # wcf <dbl>, broodp <chr>, fatc <chr>, cp <chr>, notes <chr>
head(HMF_Recent_Data)
## # A tibble: 6 × 24
## uniqueid band initialcapture bandsize date year time net sex age
## <dbl> <chr> <chr> <chr> <chr> <dbl> <tim> <chr> <chr> <chr>
## 1 1936 251091622 n <NA> 6/15… 2011 10:50 6 <NA> AD
## 2 1952 230191051 n <NA> 6/27… 2011 07:10 9 <NA> AD
## 3 1952 230191051 n <NA> 7/24… 2011 07:30 9 <NA> AD
## 4 1960 230191059 n <NA> 8/5/… 2011 10:15 2 <NA> AD
## 5 1974 230191073 n <NA> 7/17… 2013 06:30 6 <NA> AHY
## 6 2000 230191028 n <NA> 6/4/… 2012 09:30 9 <NA> AHY
## # ℹ 14 more variables: weight <dbl>, species <chr>, speciesc <chr>,
## # habitat <chr>, food <chr>, nesting <chr>, behav <chr>, wc <dbl>, wcr <dbl>,
## # wcf <dbl>, broodp <chr>, fatc <chr>, cp <chr>, notes <chr>
Using the two data frames, create two additional data frames that calculate the number of captures per year. Name the first “Historic_Captures_Per_Year” and the second “Recent_Captures_Per_Year”. Note: each row represents a unique capture, so the total number of rows or observations per year represents the total number of captures.Hint - using the group_by() and summarise() functions will be very useful here. Remember n() can be used to count the number of observations when used in the summarise command. Example summarise(count = n())
# Calculate number of captures per year for historic data
Historic_Captures_Per_Year <- HMF_Historic_Data %>%
group_by(year) %>%
summarise(count = n())
## group_by: one grouping variable (year)
## summarise: now 9 rows and 2 columns, ungrouped
# Calculate number of captures per year for recent data
Recent_Captures_Per_Year <- HMF_Recent_Data %>%
group_by(year) %>%
summarise(count = n())
## group_by: one grouping variable (year)
## summarise: now 4 rows and 2 columns, ungrouped
# Display the results
Historic_Captures_Per_Year
## # A tibble: 9 × 2
## year count
## <dbl> <int>
## 1 1960 378
## 2 1961 190
## 3 1962 234
## 4 1963 642
## 5 1964 900
## 6 1965 196
## 7 1966 380
## 8 1967 466
## 9 1968 315
Recent_Captures_Per_Year
## # A tibble: 4 × 2
## year count
## <dbl> <int>
## 1 2011 98
## 2 2012 97
## 3 2013 101
## 4 2014 58
Test to see if the number of captures/observations per year is normally distributed for each dataframe. For each data frame, produce one visualization test and one statistical test.
# Histogram for Historic Data
hist(Historic_Captures_Per_Year$count,
main = "Histogram of Captures Per Year (Historic Data)",
xlab = "Captures", col = "blue", border = "black")
# Q-Q Plot for Historic Data
qqnorm(Historic_Captures_Per_Year$count, main = "Q-Q Plot for Historic Data")
qqline(Historic_Captures_Per_Year$count, col = "red")
# Shapiro-Wilk Normality Test for Historic Data
shapiro.test(Historic_Captures_Per_Year$count)
##
## Shapiro-Wilk normality test
##
## data: Historic_Captures_Per_Year$count
## W = 0.87242, p-value = 0.1305
# Histogram for Recent Data
hist(Recent_Captures_Per_Year$count,
main = "Histogram of Captures Per Year (Recent Data)",
xlab = "Captures", col = "red", border = "black")
# Q-Q Plot for Recent Data
qqnorm(Recent_Captures_Per_Year$count, main = "Q-Q Plot for Recent Data")
qqline(Recent_Captures_Per_Year$count, col = "blue")
# Shapiro-Wilk Normality Test for Recent Data
shapiro.test(Recent_Captures_Per_Year$count)
##
## Shapiro-Wilk normality test
##
## data: Recent_Captures_Per_Year$count
## W = 0.70713, p-value = 0.01401
In 2014 and 2015, sampling effort was only half of that in 2011, 2012, and 2013. To adjust for this, multiply the capture records for 2014 and 2015 by 2 (note that this is a very rough approximation of how to actually adjust for differences in trap effort). Now check to see if the recent data is normally distributed for captures by year. Hint: you can easily double the value in a specific cell in a data frame as follows “example_data[x,y]<-example_data[x,y]*2”
# Adjust capture counts for 2014 and 2015
Recent_Captures_Per_Year$count[Recent_Captures_Per_Year$year %in% c(2014, 2015)] <-
Recent_Captures_Per_Year$count[Recent_Captures_Per_Year$year %in% c(2014, 2015)] * 2
# Histogram for Adjusted Recent Data
hist(Recent_Captures_Per_Year$count,
main = "Histogram of Adjusted Captures Per Year (Recent Data)",
xlab = "Captures", col = "red", border = "black")
# Q-Q Plot for Adjusted Recent Data
qqnorm(Recent_Captures_Per_Year$count, main = "Q-Q Plot for Adjusted Recent Data")
qqline(Recent_Captures_Per_Year$count, col = "blue")
# Shapiro-Wilk Normality Test for Adjusted Recent Data
shapiro.test(Recent_Captures_Per_Year$count)
##
## Shapiro-Wilk normality test
##
## data: Recent_Captures_Per_Year$count
## W = 0.78544, p-value = 0.07857
Conduct a linear regression to investigate whether or not their is a statistical relationship between number of captures and year for the historic and recent time periods. For each time period, is there a relationship? How do you know?
# Linear regression for Historic Data
historic_lm <- lm(count ~ year, data = Historic_Captures_Per_Year)
summary(historic_lm) # Display regression results
##
## Call:
## lm(formula = count ~ year, data = Historic_Captures_Per_Year)
##
## Residuals:
## Min 1Q Median 3Q Max
## -222.26 -163.16 -45.29 33.68 488.78
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -13402.244 62856.784 -0.213 0.837
## year 7.033 32.004 0.220 0.832
##
## Residual standard error: 247.9 on 7 degrees of freedom
## Multiple R-squared: 0.006852, Adjusted R-squared: -0.135
## F-statistic: 0.04829 on 1 and 7 DF, p-value: 0.8323
# Linear regression for Recent Data
recent_lm <- lm(count ~ year, data = Recent_Captures_Per_Year)
summary(recent_lm) # Display regression results
##
## Call:
## lm(formula = count ~ year, data = Recent_Captures_Per_Year)
##
## Residuals:
## 1 2 3 4
## 3.7 -3.1 -4.9 4.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11569.500 5162.367 -2.241 0.154
## year 5.800 2.565 2.261 0.152
##
## Residual standard error: 5.736 on 2 degrees of freedom
## Multiple R-squared: 0.7188, Adjusted R-squared: 0.5782
## F-statistic: 5.112 on 1 and 2 DF, p-value: 0.1522
# Plot regression for Historic Data
plot(Historic_Captures_Per_Year$year, Historic_Captures_Per_Year$count,
main = "Linear Regression: Captures Over Time (Historic Data)",
xlab = "Year", ylab = "Captures", pch = 16, col = "blue")
abline(historic_lm, col = "red", lwd = 2)
# Plot regression for Recent Data
plot(Recent_Captures_Per_Year$year, Recent_Captures_Per_Year$count,
main = "Linear Regression: Captures Over Time (Recent Data)",
xlab = "Year", ylab = "Captures", pch = 16, col = "blue")
abline(recent_lm, col = "red", lwd = 2)
#In the historic data from the 1960s, there is no meaningful relationship between the year and the number of captures. The regression line is nearly flat, and the coefficient for year is very small (7.033), indicating little to no trend. Additionally, the p-value (0.832) is very high, meaning the relationship is not statistically significant, and the R-squared value (0.006852) suggests that the model explains almost none of the variation in captures. In contrast, the recent data from 2011-2014 shows a more noticeable upward trend in captures, with a higher R-squared value (0.7188), indicating that the model explains about 72% of the variation. However, the p-value (0.152) is still above the conventional significance threshold, meaning the statistical evidence for a strong relationship remains weak, likely due to the small sample size.
We are interested in knowing if the overall abundance of species is different between the historic and recent time periods. Conduct a test which compares the abundances between the Historic_Captures_Per_Year and the Recent_Captures_Per_Year dataframes. Are the abundances between these two time periods different? Use a boxplot to visualize this data
# Combine both datasets for comparison
combined_data <- data.frame(
Year_Group = rep(c("Historic", "Recent"),
c(nrow(Historic_Captures_Per_Year), nrow(Recent_Captures_Per_Year))),
Captures = c(Historic_Captures_Per_Year$count, Recent_Captures_Per_Year$count)
)
# Boxplot comparing the two periods
boxplot(Captures ~ Year_Group, data = combined_data,
main = "Comparison of Captures Per Year (Historic vs. Recent)",
xlab = "Time Period", ylab = "Number of Captures",
col = c("blue", "red"))
# Perform a t-test to compare mean captures between time periods
t.test(Historic_Captures_Per_Year$count, Recent_Captures_Per_Year$count,
var.equal = FALSE) # Welch's t-test (default)
##
## Welch Two Sample t-test
##
## data: Historic_Captures_Per_Year$count and Recent_Captures_Per_Year$count
## t = 3.9673, df = 8.0517, p-value = 0.004081
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 129.2692 487.1753
## sample estimates:
## mean of x mean of y
## 411.2222 103.0000
We want to know if, broadly speaking, functional traits vary by habitat. Create a data frame that includes the mean, median, and standard deviation for species wing cord length (“wc” in the dataframe) for species that live in forests (forest), grasslands (grassland), open woodlands (open woodlands) and scrub (scrub). Once again, think about group_by() and summarise(). You will also need to filter so you only include the 4 groups of interest
Also note, drop levels will get rid of empty categories in your data frame. After you subset your data, you may see that your data still includes habitats that are not of interest even though there are 0 records for these habitats. droplevels() will fix this. Use droplevels as so: exampledata<-droplevels(exampledata)
# Load the dataset
HMF_Data <- read.csv("HMF Complete Uncleaned.csv", stringsAsFactors = TRUE)
# Subset the data to include only the four habitat types of interest
selected_habitats <- c("forest", "grassland", "open woodlands", "scrub")
HMF_Selected_Habitats <- subset(HMF_Data, habitat %in% selected_habitats)
# Drop unused factor levels
HMF_Selected_Habitats <- droplevels(HMF_Selected_Habitats)
# Calculate mean, median, and standard deviation of wing cord length (wc) for each habitat
habitat_summary <- aggregate(wc ~ habitat, data = HMF_Selected_Habitats,
FUN = function(x) c(mean = mean(x, na.rm = TRUE),
median = median(x, na.rm = TRUE),
sd = sd(x, na.rm = TRUE)))
# Convert the summary into a cleaner format
habitat_summary <- do.call(data.frame, habitat_summary)
# Rename columns for clarity
colnames(habitat_summary) <- c("Habitat", "Mean_WC", "Median_WC", "SD_WC")
# Display the summary table
print(habitat_summary)
## Habitat Mean_WC Median_WC SD_WC
## 1 forest 97.10915 104.00 17.24035
## 2 grassland 97.81250 98.75 12.29919
## 3 open woodlands 104.40155 93.50 25.79321
## 4 scrub 84.22275 84.00 11.00481
Run a test(s) to determine if wing cord length differs between habitat types. How does wing cord length differ between these four habitats? Create a boxplot that shows wing cord (measured in cm) for each habitat type.
# Boxplot of wing cord length by habitat
boxplot(wc ~ habitat, data = HMF_Selected_Habitats,
main = "Wing Cord Length by Habitat Type",
xlab = "Habitat", ylab = "Wing Cord Length (cm)",
col = c("blue", "green", "orange", "purple"),
las = 2) # Rotate x-axis labels for readability
# Perform ANOVA to test for differences in wing cord length across habitats
anova_result <- aov(wc ~ habitat, data = HMF_Selected_Habitats)
summary(anova_result) # Display ANOVA results
## Df Sum Sq Mean Sq F value Pr(>F)
## habitat 3 94376 31459 79.21 <2e-16 ***
## Residuals 3544 1407516 397
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 739 observations deleted due to missingness
# Perform Tukey's HSD test for pairwise comparisons
tukey_result <- TukeyHSD(anova_result)
print(tukey_result) # Display pairwise comparisons
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = wc ~ habitat, data = HMF_Selected_Habitats)
##
## $habitat
## diff lwr upr p adj
## grassland-forest 0.7033503 -8.418183 9.824884 0.9972442
## open woodlands-forest 7.2924000 5.393952 9.190848 0.0000000
## scrub-forest -12.8864046 -16.277770 -9.495040 0.0000000
## open woodlands-grassland 6.5890497 -2.596898 15.774997 0.2530558
## scrub-grassland -13.5897549 -23.195945 -3.983565 0.0015985
## scrub-open woodlands -20.1788046 -23.739790 -16.617819 0.0000000