The dataset I selected focuses on bee observations from the Global Biodiversity Information Facility (GBIF), which compiles biodiversity records from around the world. I chose this dataset because I am interested in the role bees play in ecosystems and agriculture, and analyzing a large scale dataset to continue the interest peaked from exploring the bee dataset I wanted to use for project 1. The dataset contains about 600,000 observations, with variables describing both the collection process and the bees themselves. Key variables include event date and last edited date, names of collectors or identifiers and taxonomic classification (categorical variables), and location data such as latitude and longitude (numerical variables). Most of the data is categorical, with numerical values mainly limited to dates and geographic coordinates. To clean the dataset, I removed variables with little data or significance and filtered out entries with NAs to improve overall data quality.
Loading the Bee Dataset
library(leaflet)
Warning: package 'leaflet' was built under R version 4.5.3
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.2.0 ✔ readr 2.1.6
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.2 ✔ tibble 3.3.1
✔ lubridate 1.9.5 ✔ tidyr 1.3.2
✔ purrr 1.2.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(plotly)
Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
p2 <-ggplot(bee3, aes(x = year, fill = locality)) +geom_histogram() +labs(x ="Year",y ="Number of Bee Observations",title ="Bee Observations by Year and Maryland City",fill ="City",caption ="Source: GBIF" ) +theme_minimal(base_size =14, base_family ="serif") +scale_fill_brewer(palette ="Set3")ggplotly(p2)
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
Plot 3 (Interactive)
p3 <-ggplot(bee3, aes(x = year, fill = scientificName )) +geom_density(alpha =0.5) +scale_fill_viridis_d(option ="viridis") +labs(title ="Prevelance of Species in Maryland",x ="Year",y ="Density",caption ="Source: GBIF" ) +theme_minimal(base_size =14, base_family ="serif")ggplotly(p3)
Warning: Groups with fewer than two data points have been dropped.
Groups with fewer than two data points have been dropped.
Groups with fewer than two data points have been dropped.
Prevalence for Scale
bee3 <- bee3 |>add_count(scientificName, name ="prevalence")class(bee3$prevalence)
My map shows the basic geographical prevalence of bees by species and time across cities in Maryland. The plots before the map didn’t show much exciting variation overall. From the first two plots, I did notice that most of the data was collected in Bowie, Laurel, and College Park. This makes me wonder what about the environment in these areas is supporting a higher concentration of bees, whether it’s things like green space, habitat conditions, or even differences in where people are sampling(maybe college specific).
One limitation of my analysis was that the dataset didn’t give me much to work with in terms of numerical variables, and there were a lot of NAs, which made deeper exploration difficult. Because of that, my analysis stayed mostly descriptive rather than more in-depth or statistical. If I had a cleaner dataset with more complete values, I would have liked to explore the trends in more detail and look more closely at how environmental factors might be affecting bee distribution over time.