A few years ago, I read a book (The Beekeeper’s Lament: How One Man and Half a Billion Honey Bees Help Feed America by Hannah Nordhaus) about industrial beekeeping in the United States.
Beekeeping is a challenging industry and I was taken aback by how beekeepers are able to handle issues affecting their colonies (e.g., climate change, colony collapse disorder, varroa mites, pesticide use, etc.). These issues are even more pronounced in winter and are seen across honeybee colonies in the Northern Hemisphere (Dainat et al., 2012). I also recently visited the 21 Degrees Estate on Oahu, a cacao farm where honey is produced on a small scale alongside several types of fruit trees and tropical flowers. I saw the beehives up close, met a few bees, learned about the beekeeping process in the context of a small family farm, and tasted different types of honey. I don’t have the resources to maintain my own bee colonies, but I do have the ability to analyze data about them.
The source of my dataset is Bee Colony Statistics from data.world. The dataset is copyrighted by Bee Informed Partnership (BIP), a non-profit organization which was initially funded by a USDA grant and currently works with Bee Labs at various universities. Each year, BIP collects data from beekeepers via an online Loss and Management Survey to assess honey bee health in the US. Survey data from April 1, 2010 to March 31, 2021 is captured in this dataset.
There are 8 variables in this dataset:
Year (categorical): Each year in this dataset begins on April 1st. For instance, data for 2010/11 was collected between 4/1/2010 and 3/31/2011.
Season (categorical): The source allows the viewer to select winter, summer, or annual data. All the data in this dataset is annual since it encompasses all seasons.
State (categorical): Includes all 50 states as well as the District of Columbia, Puerto Rico, Guam, “Other,” and “MultiStateOperation”
Total.Annual.Loss (numerical, continuous): A percentage calculated based on the number of colonies lost to the number of colonies managed over the specified period of time.
Beekeepers (numerical, discrete): The number of beekeepers with colonies in the state.
Beekeepers.Exclusive.To.State (numerical, continuous): The percentage of beekeepers who do not keep colonies outside of the selected state.
Colonies (numerical, discrete): The number of colonies in the state.
Colonies.Exclusive.to.State (numerical, continuous): The percentage of colonies whose beekeepers do not keep other bees outside of the selected state.
There is also a separate column for observation number.
I will tidy my data by ensuring that the variable names and data are formatted correctly, that the variable type is correct, and removing irrelevant columns that will not be analyzed.
If >10 beekeepers from a particular state responded to the survey, their losses were reported as N/A in this dataset to protect their privacy. However, their losses were included in the national statistics.
I used the 9 climate regions to group states within the contiguous US. These regions were identified as being climatically consistent by scientists at the National Centers for Environmental Information. I chose this grouping method because climate change has been identified as a variable affecting colony populations. Since non-contiguous states and territories are not included in these 9 regions, they have been filtered out.
library(tidyverse)
library(RColorBrewer)
library(plotly)
library(tidyr)
library(leaflet)
library(leaflegend)
setwd("C:/Users/jedi_/Documents/Academic/MC/Datasets")
bees <- read_csv("bee_colony_BIP21.csv", col_names = TRUE)
names(bees) <- tolower(names(bees))
names(bees) <- gsub("\\.","_",names(bees))
head(bees)
## # A tibble: 6 × 9
## `___1` year season state total_annual_loss beekeepers beekeepers_exclusive…¹
## <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 1 2020/… Annual Alas… NA NA NA
## 2 2 2020/… Annual Dist… NA NA NA
## 3 3 2020/… Annual Guam NA NA NA
## 4 4 2020/… Annual Hawa… NA NA NA
## 5 5 2020/… Annual Neva… NA NA NA
## 6 6 2020/… Annual Puer… NA NA NA
## # ℹ abbreviated name: ¹beekeepers_exclusive_to_state
## # ℹ 2 more variables: colonies <dbl>, colonies_exclusive_to_state <dbl>
bees2 <- bees |>
select(-c(`___1`, season))
sapply(bees2, class)
## year state
## "character" "character"
## total_annual_loss beekeepers
## "numeric" "numeric"
## beekeepers_exclusive_to_state colonies
## "numeric" "numeric"
## colonies_exclusive_to_state
## "numeric"
All variable types are correct.
bees3 <- bees2 |>
filter(!state %in% c("Alaska", "Guam", "Hawaii", "MultiStateOperation", "Other", "Puerto Rico"))
#Filter out non-contiguous states/territories and other values that don't correspond to a defined climate region
bees_region <- bees3 |>
mutate(region = case_when(state %in% c("California", "Nevada") ~ "West",
state %in% c("Oregon", "Washington", "Idaho") ~ "Northwest",
state %in% c("Montana", "Nebraska", "North Dakota", "South Dakota", "Wyoming") ~ "Northern Rockies and Plains",
state %in% c("Arizona", "Utah", "Colorado", "New Mexico") ~ "Southwest",
state %in% c("Texas", "Oklahoma", "Kansas", "Louisiana", "Arkansas", "Mississippi") ~ "South",
state %in% c("Alabama", "Florida", "Georgia", "North Carolina", "South Carolina", "Virginia") ~ "Southeast",
state %in% c("Illinois", "Indiana", "Kentucky", "Missouri", "Ohio", "Tennessee", "West Virginia") ~ "Ohio Valley",
state %in% c("Iowa", "Michigan", "Minnesota", "Wisconsin") ~ "Upper Midwest",
state %in% c("Connecticut", "Delaware", "Maine", "Maryland", "Massachusetts", "New Hampshire", "New Jersey", "New York", "Pennsylvania", "Rhode Island", "Vermont", "District of Columbia") ~ "Northeast"))
#Add a column for climate region
head(bees_region)
## # A tibble: 6 × 8
## year state total_annual_loss beekeepers beekeepers_exclusive…¹ colonies
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 2020/21 District… NA NA NA NA
## 2 2020/21 Nevada NA NA NA NA
## 3 2020/21 South Da… NA NA NA NA
## 4 2020/21 Delaware 56.5 11 90.9 1229
## 5 2020/21 Kansas 64.6 11 90.9 513
## 6 2020/21 Georgia 52.5 110 93.6 3863
## # ℹ abbreviated name: ¹beekeepers_exclusive_to_state
## # ℹ 2 more variables: colonies_exclusive_to_state <dbl>, region <chr>
bees_region |>
ggplot(aes(x = year, y = total_annual_loss, color = region)) +
geom_point()
#Create a scatterplot of total annual loss over the years by region
bees_region |>
ggplot(aes(x = year, y = total_annual_loss, fill = region)) +
geom_col() #Create a bar graph of total annual loss over the years by region
bees2010 <- bees_region |>
filter(year == "2010/11")
bees2010 |>
ggplot() +
geom_col(aes(x = region, y = total_annual_loss, fill = region))
#Create a bar graph of total annual loss by region for 2010/2011
bees2010 |>
ggplot(aes(x = beekeepers_exclusive_to_state, y = total_annual_loss, color = region)) +
geom_point()
#Create a scatterplot of total annual loss and beekeepers by region for 2010/2011
library(DataExplorer)
plot_correlation(bees3)
#Explore correlations among all variables
There do not appear to be strong correlations between any of the variables included in this correlation plot.
bees2010 |>
ggplot(aes(x = beekeepers, y = total_annual_loss, color = region)) +
geom_point() +
geom_smooth(method = "lm", formula = y~x, se = FALSE) +
labs(title = "Beekeepers vs. Annual Colony Loss in 2010",
caption = "Source: Bee Informed Partnership/USDA",
color = "Climate Region") +
xlab("Number of Beekeepers in State/Region") +
ylab("Annual Colony Loss (%)") +
theme_light()
#Create a scatterplot of annual loss and beekeepers by region for 2010/2011 with linear regression
This linear regression analysis shows that the relationship between number of beekeepers and annual colony loss are quite different across climate regions. There is a strong positive relationship in the Upper Midwest, Northeast, and Southwest regions, while there is a strong negative relationship in the Ohio Valley, South, and Northern Rockies and Plains states. There is a slight negative relationship in the Southeast and a slight positive relationship in the Southwest. The linear regression model does not appear to fit data collected for the West. The differences seen across the regions here might be due to differences in local climate/ecological events that affect bee populations in varied ways.
bees2010grp <- bees2010 |>
group_by(region, beekeepers) |>
summarise(loss = sum(total_annual_loss, na.rm = TRUE))
bp <- bees2010grp |>
ggplot(text = paste("Region:", region)) +
geom_line(aes(x = beekeepers, y = loss)) +
geom_area(aes(x = beekeepers, y = loss, fill = region), colour = "white") +
scale_fill_brewer(palette = "Set1") +
labs(title = "Bee Colony Loss in 2010",
caption = "Source: Bee Informed Partnership/USDA",
x = "# of Beekeepers in State",
y = "Annual Colony Loss (%)",
color = "Climate Region") +
theme_minimal(base_size = 11) +
theme(legend.position = c(0.16, 0.7)) +
theme(plot.title = element_text(hjust = 0.5))
bp <- ggplotly(bp) |> layout(margin = list(l = 50, r = 50, b = 100, t = 50),
annotations = list(x = 1, y = -0.3, text = "Source: Bee Informed Partnership",
xref='paper', yref='paper', showarrow = F,
xanchor='right', yanchor='auto', xshift=0, yshift=0,
font = list(size = 12)))
bp
#Use plotly to create an interactive plot of beekeepers vs. colony loss in 2010
This visualization shows the relationship between the number of beekeepers in a given state and the annual colony loss by region for the 2010-2011 period. There does not appear to be an overarching single, strong relationship across all regions, but the ways in which regions vary is interesting. Southeastern states appear to have the highest number of beekeepers but their colony loss is lower than the Northeast, where there are fewer beekeepers per state.
I would have liked to include years on the x-axis to compare colony losses over the years by region, but had difficulty with using lubridate in data cleaning. Most of the relationships I explored in this project seem inconclusive and I think the results would have been stronger and more meaningful if I had been able to include time series data, since most of the existing research on colony loss is driven by observations of changes over time. I spent a long time figuring out how to group states by their climate regions which took time away from other aspects of my work.
Dainat B, Evans JD, Chen YP, Gauthier L, Neumann P (2012) Predictive Markers of Honey Bee Colony Collapse. PLoS ONE 7(2): e32151. doi:10.1371/journal.pone.0032151
Nordhaus, H. (2011). The beekeeper’s lament: How one man and half a billion honey bees help feed america. Harpercollins.