Introduction
For my final assignment, I decided to leverage the Bee Colonies data set from TidyTuesday. This data set is from the USDA which “provides information on honey bee colonies in terms of number of colonies, maximum, lost, percent lost, added, renovated, and percent renovated, as well as colonies lost with Colony Collapse Disorder symptoms with both over and less than five colonies”. I first loaded both CSV files and then added seasons from the dates, cleaned the date column to as.Date format, and filtered NAs. I was specifically curious about the colony loss by state. I wanted to investigate to see if there are any correlations between variables like renovations or season. See my graphs below.
Graph 1: Histogram
My first graph was a simple histogram looking at the colony max for all states over all time. As you can see, it has skewed distribution which leads me to believe it would be normally distributed with log.
colony_max <- colony[,.(avg_colony_max = mean(colony_max)),
by = .(season, year, state)][order(-avg_colony_max)]
histogram1 <- ggplot(data=colony_max, aes(x=avg_colony_max)) +
geom_histogram(aes(y = (..count..)/sum(..count..)), binwidth = 50000,
boundary=0,
show.legend=F, na.rm=TRUE) +
labs(x = "Max Colony Size", y = NULL,
title = "Max Colony Size for All States") +
scale_y_continuous(expand = c(0.01,0.01),
labels = scales::percent_format(accuracy = 1)) +
scale_x_continuous(expand = c(0.01,0.01),
breaks = seq(1900, 1810000, 250000), label=comma)
histogram1
Graph 2: Violin Plot
I took the log of the colony max and plotted in a violin plot by year to determine if the max colony size has changed since the data was first being collected. It looks like most of the years have the same distribution except for 2018 which had smaller max colony sizes than other years.
colony_max$ln_avg_colony_max <- log(colony_max$avg_colony_max)
violin_plot1 <- ggplot(colony_max, aes(x=as.factor(year),
y=ln_avg_colony_max)) +
geom_violin(alpha=.7) +
labs( x = "Season", y = "Log of Max Colony Size",
title = "Max Colony Size per Year for All States",
subtitle="Note the log was taken due to skewed distribution.")
violin_plot1
Graph 3: Bar Chart
Next I created a bar chart looking at the top 10 states with the largest aggregate colony over time. As you can see, California is the largest state in terms of aggregate colonies by far.
agg_colony_by_state <- colony[,.(sum_colony_n = sum(colony_n)),
by = .(state)][order(-sum_colony_n)][1:10]
bar_chart_1 <- ggplot(data=agg_colony_by_state,
aes(x= reorder(state, -sum_colony_n), y=sum_colony_n)) +
geom_bar(stat="identity") +
labs(x="State", y= "Sum of Total Colony",
title= "Top 10 States with Largest Aggregate Colonies Over Time") +
theme(axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
axis.text.y=element_blank())
bar_chart_1
Graph 4: Bar Chart
Then I wanted to investigate the seasonal colony loss to better understand what season experiences the highest amount of colony loss. As you can see, unfortunately the data set only had full seasonal data for 2020.
avg_colony_loss <- colony[,.(avg_colony_lost_pct = mean(colony_lost_pct)),
by = .(season, year)][order(-year)][1:10]
bar_chart_2 <- ggplot(data=avg_colony_loss,
aes(x= year, y=avg_colony_lost_pct, fill = season)) +
geom_bar(stat="identity") +
labs(x="Year", y= "Avg Colony Lost Percentage", title= "Seasonal Colony Loss")
bar_chart_2
Graph 5: Histogram
Next I wanted to investigate some more positive (meaning happy!) data and see the distribution of colony addition for all states. This distribution is also skewed which lead me to believe a log distribution would be better suited here.
avg_colony_added <- colony[,.(avg_colony_added = mean(colony_added)),
by = .(season, year, state)][order(-avg_colony_added)]
histogram2 <- ggplot(data=avg_colony_added, aes(x=avg_colony_added)) +
geom_histogram(aes(y = (..count..)/sum(..count..)), binwidth = 5000,
boundary=0, show.legend=F, na.rm=TRUE) +
labs(x = "Colony Size", y=NULL, title = "Seasonal Colony Added for All States") +
scale_y_continuous(expand = c(0.01,0.01),
labels = scales::percent_format(accuracy = 1)) +
scale_x_continuous(expand = c(0.01,0.01),breaks = seq(0, 250000, 25000))
histogram2
Graph 6: Violin Plot
I took the log of the colony added and ran another violin plot per season for all states. It looks like the most colonies added occur during the spring time.
avg_colony_added$ln_avg_colony_added <- log(avg_colony_added$avg_colony_added)
violin_plot2 <- ggplot(avg_colony_added, aes(x=as.factor(season),
y=ln_avg_colony_added)) +
geom_violin(alpha=.7) +
labs( x = "Season", y = "Log of Average Colony Added",
title = "Average Colony Added per Season from 2015-2021",
subtitle="Note the log was taken due to skewed distribution.")
violin_plot2
Graph 7: Scatterplot
Then I moved onto running a couple scatterplots. I wanted to determine if there was a correlation between average max colony size and average colony lost. There was a linear relationship but it is hard to tell exactly since there are some extreme values (most likely from California).
colony_lost_max <- colony[,.(avg_colony_max = mean(colony_max),
avg_colony_lost = mean(colony_lost)),by = .(season, year, state)]
scatterplot1 <- ggplot(colony_lost_max, aes(avg_colony_max, avg_colony_lost)) +
geom_point(size=2, shape=20) +
labs(x="Avg Max Colony Size", y= "Avg Colony Lost",
title= "Scatterplot Between Max Colony and Colony Lost") +
theme(axis.text.y=element_blank(),
axis.text.x=element_blank()) +
geom_smooth(method=lm)
scatterplot1
Graph 8: Scatterplot
The second scatterplot I ran was investigating the relationship between average colony lost percentage and the average colony renovated percentage. I wanted to determine if there was a negative relationship between loss and renovation meaning colony loss decreases as colonies are renovated. Based on the observations, there is a slight negative association.
colony_lost_reno <- colony[,.(avg_colony_lost_pct = mean(colony_lost_pct),
avg_colony_reno_pct = mean(colony_reno_pct)),
by = .(season, year, state)]
scatterplot2 <- ggplot(colony_lost_reno, aes(x=avg_colony_lost_pct,
y=avg_colony_reno_pct)) +
geom_point(size=2, shape=20) +
labs(x="Avg Colony Lost Percentage", y= "Avg Colony Renovated Percentage",
title= "Average Colony Lost Percentage and Average Colony Renovated Percentage")+
theme(axis.text.y=element_blank(),
axis.text.x=element_blank()) +
geom_smooth(method=lm)
scatterplot2
Graph 9: Map
Next I moved onto maps. I imported lat/long data from an external datasource via GitHub and merged it with my colony data table. Since I knew 2020 was the year of complete seasonal loss observations, I used this year to determine which state had the highest loss percentages per season. As you can see from the 3 charts, it looks like the Southwest suffers from the highest colony loss during the summer while the South suffers from the highest colony loss during the Spring.
colony_lost_2020 <- colony[year==2020,by = .(season, colony_lost_pct, state)]
usa <- map_data("state")
us_state_lat_long <- as.data.frame(read_csv("https://raw.githubusercontent.com/jasperdebie/VisInfo/master/us-state-capitals.csv"))
us_state_lat_long$state <- us_state_lat_long$name
colony_lost_2020 <- merge(colony_lost_2020, us_state_lat_long, by="state")
map <- ggplot() +
geom_map(
data = usa, map = usa,
aes(long, lat, map_id = region)) +
geom_point(
data = colony_lost_2020,
aes(longitude, latitude, size = colony_lost_pct),
alpha= 0.5) +
theme_bw()+
theme(legend.position= "top",
legend.direction = "horizontal",
legend.text = element_text(size = 6)) +
labs(title="Colony Lost Percentage per Season in 2020") +
facet_wrap(. ~ season, ncol = 2) +
theme(axis.title=element_blank(),
axis.text = element_blank())
map
Graph 10: Animation
Finally, I created my animated bar chart looking into the stressors for 2020 per state. I used the season as fill to easily tell which stressor contributes to the highest percentage of total stress. As you can see, most states have Varroa mites at the highest stressor for their colony loss.
stressors_state_2020 <- stressor[year==2020,.(avg_stress_pct = mean(stress_pct)),
by = .(season, state, stressor)]
animation <- ggplot(stressors_state_2020, aes(stressor, avg_stress_pct,
fill=season)) +
geom_bar(position="dodge", stat="identity") +
theme_bw() +
labs( x = "Stressor", y = "Percentage",
title='2020 Colony Stressors by Season: {closest_state}') +
transition_states(state) +
theme(axis.text.x = element_text(size = 10, angle = 45, hjust = 1))
animation
Conclusion
In summary based on the observations, bee colony loss occurs most often in summer. The most important stressor is the Varroa mite (see more information here about this external parasitic mite). California is home to the largest colonies historically and the most new colonies are added in the spring. There is a slightly negative association between average colony lost percentage and average colony renovation percentage meaning the higher a colony’s renovation rate, the lower the lost percentage.