Project_2

Author

Dekker Spielman

The dataset I am using is from the North American Bird Banding Program. This is a huge dataset, and there is data in it from as long ago as 1960. In general, this dataset tracks birds over time, as well as lots of details about said birds. The reason I chose this dataset is because it is collected via citizen reporting, which means many people have contributed to the dataset, and almost anyone can help add to it. The full NABBP dataset is way too large for this project, so I am using a smaller subset of it tracking specifically Red-Winged Blackbirds. Some of the variables that this csv keeps track of are: record ID, Band ID, date of the encounter, species, country, lattitude/longitude, age, sex, status, and a bunch of other information which is less important & that I’ll cut out.

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(plotly)


Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout

library(RColorBrewer)
setwd("~/aaaworkingdirectory")
blackbirds <- read_csv("NABBP_2023_grp_48.csv")

Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)

Rows: 845271 Columns: 29
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (16): EVENT_TYPE, BAND, ORIGINAL_BAND, EVENT_DATE, EVENT_DAY, EVENT_MONT...
dbl (12): RECORD_ID, EVENT_YEAR, SPECIES_ID, LAT_DD, LON_DD, COORD_PREC, AGE...
lgl  (1): OTHER_BANDS

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Here I take the full csv and cut out all the data I don’t plan on using. First, I get rid of every entry that isn’t in the US or Canada. Then, I get rid of every variable except for the band id, time information, and coordinate information. After that, I make a simple bar plot by time to see how many entries there are every year. My assumption when making the plot was that the number of entries would go up every year - not because the bird population is going up, but because the inception of the internet would make reporting bird encounters easier.

sleekBirds <- blackbirds |>
  filter(ISO_COUNTRY %in% c("US", "CA")) |>
  filter(as.numeric(EVENT_MONTH) < 13) |>
  filter(as.numeric(EVENT_DAY) < 32) |>
  mutate(month = recode(EVENT_MONTH,
                        "01" = "Jan",
                        "02" = "Feb",
                        "03" = "Mar",
                        "04" = "Apr",
                        "05" = "May",
                        "06" = "Jun",
                        "07" = "Jul",
                        "08" = "Aug",
                        "09" = "Sep",
                        "10" = "Oct",
                        "11" = "Nov",
                        "12" = "Dec")) |>
  select(EVENT_DATE, EVENT_DAY, EVENT_MONTH, month, EVENT_YEAR, LAT_DD, LON_DD) |>
  rename(lat = LAT_DD, long = LON_DD)
head(sleekBirds)

# A tibble: 6 × 7
  EVENT_DATE EVENT_DAY EVENT_MONTH month EVENT_YEAR   lat  long
  <chr>      <chr>     <chr>       <chr>      <dbl> <dbl> <dbl>
1 06/21/1960 21        06          Jun         1960  53.4 -113.
2 06/21/1960 21        06          Jun         1960  53.4 -113.
3 06/14/1962 14        06          Jun         1962  53.4 -111.
4 06/14/1962 14        06          Jun         1962  53.4 -111.
5 06/14/1962 14        06          Jun         1962  53.4 -111.
6 06/14/1962 14        06          Jun         1962  53.4 -111.

encounterPlot <- sleekBirds |>
  ggplot(aes(EVENT_YEAR)) +
  geom_bar() +
   labs(x = "Year",
       y = "Number of Reports",
       title = "Number of Bird Reports by Year")
encounterPlot

That result was unexpected. Instead of going up, there is a large spike near the end of the sixties, and then it stabilizes for the next forty years. For my research for this project, I learned from the US Fish and Wildlife service how climate change has been affecting the habitats of migratory birds. When the birds travel, they need to go further north in order to get to a habitat that suits them (Source 1). I plan on seeing if the bird sighting data helps to support this. For the next bit, I’m going to do the linear regression, to see if latitude goes up over time on average.

linearBird = lm(lat~EVENT_YEAR, data = sleekBirds)
summary(linearBird)


Call:
lm(formula = lat ~ EVENT_YEAR, data = sleekBirds)

Residuals:
     Min       1Q   Median       3Q      Max 
-17.0881  -1.1640   0.6006   2.1314  24.8629 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -2.939e+01  5.931e-01  -49.55   <2e-16 ***
EVENT_YEAR   3.554e-02  2.998e-04  118.57   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.698 on 843664 degrees of freedom
  (4 observations deleted due to missingness)
Multiple R-squared:  0.01639,   Adjusted R-squared:  0.01639 
F-statistic: 1.406e+04 on 1 and 843664 DF,  p-value: < 2.2e-16

As you can see, the slope is positive, and the p-value shows that the significance level is very significant. The linear equation is y = 0.03554x - 29.39 Now, I’m going to do the final visulization

birdSummary <- sleekBirds |>
  filter(EVENT_YEAR %in% c("1960", "2022")) |>
  arrange(EVENT_YEAR, EVENT_MONTH) |>
  summarize(.by = c("EVENT_YEAR", "month"), Latitude = mean(lat))
birdSummary

# A tibble: 24 × 3
   EVENT_YEAR month Latitude
        <dbl> <chr>    <dbl>
 1       1960 Jan       33.2
 2       1960 Feb       32.2
 3       1960 Mar       37.0
 4       1960 Apr       41.1
 5       1960 May       41.4
 6       1960 Jun       41.1
 7       1960 Jul       41.2
 8       1960 Aug       41.5
 9       1960 Sep       40.5
10       1960 Oct       39.4
# ℹ 14 more rows

#This is a fix I found that makes sure that the months don't get reordered to alphabetical when plotted (Source 2)
birdSummary$month <- as.character(birdSummary$month)
birdSummary$month <- factor(birdSummary$month, levels=unique(birdSummary$month))

aveLat <- birdSummary |>
  ggplot(aes(month, Latitude)) +
  geom_point(aes(color = as.character(EVENT_YEAR))) +
  geom_line(color = "pink") +
  scale_color_brewer(palette = "Set1") +
  theme_dark() +
   labs(x = "Month",
       y = "Average Latitude",
       color = "Year",
       caption = "NABBP",
       title = "Average Latitude of Red-Winged Blackbird Sightings 1960 vs 2022")
ggplotly(tooltip = c("Latitude"))

This visulization helps to show how Red-Winged Blackbirds have, over time, been spotted further and further north. I liked the way this turned out, even though it’s not really what I had been going for when I started. Because of all the GIS data included in the database, I wanted to have it be incorperated into a map. Unfortunately, due to the huge number of sightings, the widget had trouble loading, and would crash rstudio whenever I tried to put a circle for every sighting. I wasn’t really happy with putting the GIS information on a map if I averaged it the way I did, so I ended up putting it onto a plot the way you see here.

Bibliography

Source 1: https://www.fws.gov/story/2023-06/what-migrating-birds-can-teach-us-about-managing-climate-change

Source 2: https://stackoverflow.com/questions/12774210/how-do-you-specifically-order-ggplot2-x-axis-instead-of-alphabetical-order