Data 110 Project 2

Introduction to the Dataset

The dataset I selected focuses on bee observations from the Global Biodiversity Information Facility (GBIF), which compiles biodiversity records from around the world. I chose this dataset because I am interested in the role bees play in ecosystems and agriculture, and analyzing a large scale dataset to continue the interest peaked from exploring the bee dataset I wanted to use for project 1. The dataset contains about 600,000 observations, with variables describing both the collection process and the bees themselves. Key variables include event date and last edited date, names of collectors or identifiers and taxonomic classification (categorical variables), and location data such as latitude and longitude (numerical variables). Most of the data is categorical, with numerical values mainly limited to dates and geographic coordinates. To clean the dataset, I removed variables with little data or significance and filtered out entries with NAs to improve overall data quality.

Loading the Bee Dataset

library(leaflet)

Warning: package 'leaflet' was built under R version 4.5.3

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(plotly)


Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout

setwd("C:/Users/Administrator/OneDrive - montgomerycollege.edu/DATA 110")
bee_data <- read_csv("gbif_bee_data.csv")

Rows: 602665 Columns: 50
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (26): datasetKey, occurrenceID, kingdom, phylum, class, order, family, ...
dbl  (10): gbifID, individualCount, decimalLatitude, decimalLongitude, coord...
lgl  (12): infraspecificEpithet, verbatimScientificNameAuthorship, coordinat...
dttm  (2): dateIdentified, lastInterpreted

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Filtering NAs and Narrowing Variables

bee1 <- bee_data |>
  filter(!is.na(locality) & !is.na(stateProvince) & !is.na(month))|>
  select(-elevation, -verbatimScientificName, -verbatimScientificNameAuthorship, - infraspecificEpithet, -occurrenceStatus, -individualCount, -publishingOrgKey, -coordinateUncertaintyInMeters, -depth, -depthAccuracy, -typeStatus, -establishmentMeans, -mediaType)

Filtering down to 800 Observations by Locality of Maryland

#unique(bee1$locality) #there are about 1000 cities data is collected from

bee2 <- bee1 |>
  filter(stateProvince %in% c("Maryland"))

bee3 <- bee2 |>
  filter(locality %in% c("Fredrick", "Rockville", "Burtonsville", "District of Columbia", "Towson", "Ocean City", "Bowie", "Laurel", "College Park"))

Plot 1 (Interactive)

p1 <- ggplot(bee3, aes(x = year, y =scientificName, color=locality)) +
labs(x="Year",
y="Scientific Name",
title="Scientific Name of Bees Collected (by MD City)") +
theme_minimal(base_size = 14, base_family = "serif") +
geom_jitter(size = 4, alpha = 0.6) +
scale_color_brewer(name="Maryland City", palette = "Set2") +
  labs(caption = "Source: GBIF")
ggplotly(p1)

Plot 2 (Interactive)

p2 <- ggplot(bee3, aes(x = year, fill = locality)) +
  geom_histogram() +
  labs(
    x = "Year",
    y = "Number of Bee Observations",
    title = "Bee Observations by Year and Maryland City",
    fill = "City",
    caption = "Source: GBIF"
  ) +
  theme_minimal(base_size = 14, base_family = "serif") +
  scale_fill_brewer(palette = "Set3")
ggplotly(p2)

`stat_bin()` using `bins = 30`. Pick better value `binwidth`.

Plot 3 (Interactive)

p3 <- ggplot(bee3, aes(x = year, fill = scientificName )) +
  geom_density(alpha = 0.5) +
  scale_fill_viridis_d(option = "viridis") +
  labs(
    title = "Prevelance of Species in Maryland",
    x = "Year",
    y = "Density",
    caption = "Source: GBIF"
  ) +
  theme_minimal(base_size = 14, base_family = "serif")

ggplotly(p3)

Warning: Groups with fewer than two data points have been dropped.
Groups with fewer than two data points have been dropped.
Groups with fewer than two data points have been dropped.

Prevalence for Scale

bee3 <- bee3 |>
  add_count(scientificName, name = "prevalence")
class(bee3$prevalence)

[1] "integer"

Map Plot

leaflet(data = bee3) |>
  setView(lng = -76.6413, lat = 39.0458, zoom = 8) |>
  addProviderTiles(providers$Esri.NatGeoWorldMap) |>
  addCircleMarkers(
    lng = ~decimalLongitude,
    lat = ~decimalLatitude,
    radius =  ~log(prevalence)*2,
    fillOpacity = 0.5,
    fillColor = "yellow",
    color = "black"
  )

Transforming Month into Names

bee3 |>
  mutate(month = as.numeric(as.character(month))) |>
  mutate(month = c("January", "February", "March", "April",
                   "May", "June", "July", "August",
                   "September", "October", "November", "December")[month])

# A tibble: 668 × 38
       gbifID datasetKey    occurrenceID kingdom phylum class order family genus
        <dbl> <chr>         <chr>        <chr>   <chr>  <chr> <chr> <chr>  <chr>
 1 4023708322 f519367d-6b9… https://www… Animal… Arthr… Inse… <NA>  <NA>   <NA> 
 2 1456502036 f519367d-6b9… https://www… Animal… Arthr… Inse… <NA>  <NA>   <NA> 
 3 4023710352 f519367d-6b9… https://www… Animal… Arthr… Inse… <NA>  <NA>   <NA> 
 4 4023710462 f519367d-6b9… https://www… Animal… Arthr… Inse… <NA>  <NA>   <NA> 
 5 4023710466 f519367d-6b9… https://www… Animal… Arthr… Inse… <NA>  <NA>   <NA> 
 6 4023710467 f519367d-6b9… https://www… Animal… Arthr… Inse… <NA>  <NA>   <NA> 
 7 4023708473 f519367d-6b9… https://www… Animal… Arthr… Inse… <NA>  <NA>   <NA> 
 8 4023710497 f519367d-6b9… https://www… Animal… Arthr… Inse… <NA>  <NA>   <NA> 
 9 4023708518 f519367d-6b9… https://www… Animal… Arthr… Inse… <NA>  <NA>   <NA> 
10 4023708516 f519367d-6b9… https://www… Animal… Arthr… Inse… <NA>  <NA>   <NA> 
# ℹ 658 more rows
# ℹ 29 more variables: species <chr>, taxonRank <chr>, scientificName <chr>,
#   countryCode <chr>, locality <chr>, stateProvince <chr>,
#   decimalLatitude <dbl>, decimalLongitude <dbl>, coordinatePrecision <dbl>,
#   elevationAccuracy <lgl>, eventDate <chr>, day <dbl>, month <chr>,
#   year <dbl>, taxonKey <dbl>, speciesKey <dbl>, basisOfRecord <chr>,
#   institutionCode <chr>, collectionCode <chr>, catalogNumber <chr>, …

Map Including Mouse Click Pop-Up

popupbee <- paste0(
  "<b>Bee Observations In Maryland Cities: </b>", "<br>",
  "<b>Year: </b>", bee3$year, "<br>",
  "<b>Month: </b>", bee3$month, "<br>",
  "<b>Scientific Name: </b>", bee3$scientificName, "<br>",
  "<b>City: </b>", bee3$locality, "<br>",
  "<b>Lat: </b>", bee3$decimalLatitude, "<br>",
  "<b>Long: </b>", bee3$decimalLongitude, "<br>"
)

leaflet(data = bee3) |>
  setView(lng = -76.6413, lat = 39.0458, zoom = 8) |>
  addProviderTiles(providers$Esri.NatGeoWorldMap) |>
  addCircleMarkers(
    lng = ~decimalLongitude,
    lat = ~decimalLatitude,
    radius =  ~log(prevalence)*2,
    fillOpacity = 0.7,
    fillColor = "yellow",
    color = "black",
    popup = popupbee,
    clusterOptions = markerClusterOptions())

Closing Essay

My map shows the basic geographical prevalence of bees by species and time across cities in Maryland. The plots before the map didn’t show much exciting variation overall. From the first two plots, I did notice that most of the data was collected in Bowie, Laurel, and College Park. This makes me wonder what about the environment in these areas is supporting a higher concentration of bees, whether it’s things like green space, habitat conditions, or even differences in where people are sampling(maybe college specific).

One limitation of my analysis was that the dataset didn’t give me much to work with in terms of numerical variables, and there were a lot of NAs, which made deeper exploration difficult. Because of that, my analysis stayed mostly descriptive rather than more in-depth or statistical. If I had a cleaner dataset with more complete values, I would have liked to explore the trends in more detail and look more closely at how environmental factors might be affecting bee distribution over time.