# Insert image
knitr::include_graphics("sf-beesemen.jpg")
- taken by Marisa Lubeck,
USGS
Bees, in general, have been slowly going extinct due to a variety of issues, such as pesticides, diseases, and invasive mites. This is a major problem that should be more frequently addressed since bees are one of Earth’s largest pollinators, allowing our fruits and flowers to grow. Without them, agriculture would falter and collapse, leaving us with very little in regards to economical and food stock. In order to do a study on the factors affecting bee populations, I have chosen to use a dataset named “Save the Bees”. All data from this dataset is directly collected from surveys collected by the USDA (United States Department of Agriculture), now publicly available on the USDA website, but has been cleaned and neatened by user Shane Simon on Kaggle (https://www.kaggle.com/datasets/m000sey/save-the-honey-bees). It is a full analysis of the population of honeybees between 2015 and 2022, with 17 different variables.
The utilized variables are:
percent_lost
, the percentage of honey bee colonies lost
within a certain quarter of the yearlost_colonies
, the actual number of lost honey bee
colonies within a quarter of a yearquarter
, the quarter of the year ranging from Q1 to
Q4percent_renovated
, the percent of honey bee colonies
renovated by human contact within the quarterpesticides
, the percent of honey bee colonies affected
by pesticidesvarroa_mites
, the percent of honey bee colonies
affected by varroa mitesdiseases
, the percent of honey bee colonies affected by
certain diseases# Calling the packages
library(ggplot2) # Graphing
library(tidyverse) # Commands
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(corrplot) # Correlation plot
## Warning: package 'corrplot' was built under R version 4.4.3
## corrplot 0.95 loaded
library(dplyr) # Commands
library(plotly) # Interactive Correlation plot
## Warning: package 'plotly' was built under R version 4.4.3
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
library(reshape2) # Correlation plot - Longform
## Warning: package 'reshape2' was built under R version 4.4.3
##
## Attaching package: 'reshape2'
##
## The following object is masked from 'package:tidyr':
##
## smiths
library(ggthemes) # Extra themes
## Warning: package 'ggthemes' was built under R version 4.4.3
# Calling the data set
bees <- read.csv("save_the_bees.csv")
head(bees)
## state state_code num_colonies max_colonies lost_colonies percent_lost
## 1 Alabama AL 7000 7000 1800 26
## 2 Arizona AZ 35000 35000 4600 13
## 3 Arkansas AR 13000 14000 1500 11
## 4 California CA 1440000 1690000 255000 15
## 5 Colorado CO 3500 12500 1500 12
## 6 Connecticut CT 3900 3900 870 22
## added_colonies renovated_colonies percent_renovated quarter year varroa_mites
## 1 2800 250 4 1 2015 10.0
## 2 3400 2100 6 1 2015 26.9
## 3 1200 90 1 1 2015 17.6
## 4 250000 124000 7 1 2015 24.7
## 5 200 140 1 1 2015 14.6
## 6 290 0 0 1 2015 2.5
## other_pests_and_parasites diseases pesticides other unknown
## 1 5.4 0.0 2.2 9.1 9.4
## 2 20.5 0.1 0.0 1.8 3.1
## 3 11.4 1.5 3.4 1.0 1.0
## 4 7.2 3.0 7.5 6.5 2.8
## 5 0.9 1.8 0.6 2.6 5.9
## 6 1.4 0.0 0.0 21.2 2.4
#Checking for NAs! (THERE'S NONE THIS IS A MIRACLE)
colSums(is.na(bees))
## state state_code num_colonies
## 0 0 0
## max_colonies lost_colonies percent_lost
## 0 0 0
## added_colonies renovated_colonies percent_renovated
## 0 0 0
## quarter year varroa_mites
## 0 0 0
## other_pests_and_parasites diseases pesticides
## 0 0 0
## other unknown
## 0 0
# (Out of curiosity) Highest loss states
high_loss_states <- bees |>
filter(percent_lost > 25)
# Arrange by highest loss percentage -> Missouri with a whopping 65%
bees_sorted <- bees |>
arrange(desc(percent_lost))
# Colony lost per year -> 2015 with a peak of 18322.98
bees_lost <- bees |>
group_by(year) |>
summarize(avg_colony_loss = mean(lost_colonies))
# Scatterplot visualization (1)
ggplot(bees, aes(x = percent_lost, y = pesticides, color = as.factor(quarter), size = percent_renovated)) +
geom_point(alpha = 0.6) +
labs(title = "Effect of Pesticides on Colony Loss",
x = "Colony Lost (%)",
y = "Pesticide Exposure (%)",
color = "Quarter of the Year",
size = "Renovation (%)",
caption = "Data sourced from the USDA") +
scale_color_manual(values = c("1" = "#EFB700", "2" = "#4C9A2A",
"3" = "#63A4FF", "4" = "#5A3D2B")) +
theme_gdocs() +
theme(plot.title = element_text(size = 20, face = "bold"),
plot.caption = element_text(size = 9))
The points seem to be mostly clumped in the bottom left corner, so maintaining lower half of both variables. Renovation and quarter seem to not have a pattern or effect. There is no visible line, trajectory or pattern- but a large quantity of various outliers. Overall there is no connected pattern, rhyme, or reason other than the grouping in the lower quartile.
# Scatterplot visualization (2)
ggplot(bees, aes(x = percent_lost, y = varroa_mites, color = as.factor(quarter), size = percent_renovated)) +
geom_point(alpha = 0.6) +
labs(title = "Effect of Varroa Mites on Colony Loss",
x = "Colony Lost (%)",
y = "Varroa Mite Exposure (%)",
color = "Quarter of the Year",
size = "Renovation (%)",
caption = "Data sourced from the USDA") +
scale_color_manual(values = c("1" = "#EFB700", "2" = "#4C9A2A",
"3" = "#63A4FF", "4" = "#5A3D2B")) +
theme_gdocs() +
theme(plot.title = element_text(size = 20, face = "bold"),
plot.caption = element_text(size = 9))
Points are pooled in the lower percentages of colony lost, but exposure to varroa mites has stretched the data upwards, touching higher percentages. The data is expressing a bit more of a strongly positive slope. Varroa mites seem to actually have minimal positive correlation with varroa mites, since the slope is almost TOO positive. There are some really high points that still lay in the lower areas of colony lost, instilling this perspective. Renovation seems to have no impact, but the quarters seem to have a little bit of a visible color grouping. Q1 (yellow) is the lowest, followed by Q2 (green) and Q4 (brown) being in the same general area, while Q3 (blue) is just a touch higher.
# Scatterplot visualization (3)
ggplot(bees, aes(x = percent_lost, y = diseases, color = as.factor(quarter), size = percent_renovated)) +
geom_point(alpha = 0.6) +
labs(title = "Effect of Diseases on Colony Loss",
x = "Colony Lost (%)",
y = "Disease Exposure (%)",
color = "Quarter of the Year",
size = "Renovation (%)",
caption = "Data sourced from the USDA") +
scale_color_manual(values = c("1" = "#EFB700", "2" = "#4C9A2A",
"3" = "#63A4FF", "4" = "#5A3D2B")) +
theme_gdocs() +
theme(plot.title = element_text(size = 20, face = "bold"),
plot.caption = element_text(size = 9))
Almost all points are collected between 25% disease exposure and 35% of colony lost. There is an expressed potential for a positive slope, but there are, once again, many outliers throwing their hands into the pool of contribution. Quarter and renovation seem to once again not hold any weight in potential patterns seen in this graph. There is a potential for visual correlation between the two variables.
# Develop the correlation matrix
cor_matrix <- bees |>
select(pesticides, diseases, varroa_mites, percent_lost) |> # Select variables
cor()
# Force the matrix to be longform
cor_data <- melt(cor_matrix)
# Color palette
custom_colors <- colorRampPalette(c("#715f61", "white", "#f1b24e"))(200)
# Default correlation plot for presentation (Without Plotly)
corrplot(cor_matrix, method = "circle", type = "upper", tl.cex = 0.8,
col = custom_colors, cl.length = 5)
# GGPlot version of the matrix so it can be interactive
p <- ggplot(cor_data, aes(x = Var1, y = Var2, fill = value)) +
geom_tile(color = "white") +
scale_fill_gradient2(low = "#715f61", mid = "white", high = "#f1b24e", midpoint = 0) +
theme_minimal() +
labs(title = "Interactive Correlation Matrix",
fill = "Correlation",
x = "Variable 1",
y = "variable 2",
caption = "Data sourced from the USDA") +
theme_economist()
ggplotly(p) |>
layout( # https://stackoverflow.com/questions/45103559/plotly-adding-a-source-or-caption-to-a-chart for how I did this because turns out you can't use normal ggplot captions
margin = list(b = 100),
annotations = list(
x = 1, y = -0.3, text = "Data sourced from the USDA",
xref = "paper", yref = "paper", showarrow = FALSE,
xanchor = "right", font = list(size = 12)
)
)
There is no expressed negative correlation between any of the variables, as well as no strong positive correlation. All correlation seems to be under 50%, and all correlation between my independent and dependent variables specifically is incredibly low. The lowest correlation falls to percent_lost and diseases at 0.1695, and the highest is pesticides and varroa_mites, with 0.4389.
# Linear Regression Model
beeslm <- lm(percent_lost ~ varroa_mites + pesticides + diseases, data = bees)
summary(beeslm)
##
## Call:
## lm(formula = percent_lost ~ varroa_mites + pesticides + diseases,
## data = bees)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.206 -4.694 -1.352 3.005 55.770
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.45868 0.35211 24.023 < 2e-16 ***
## varroa_mites 0.07235 0.01124 6.439 1.63e-10 ***
## pesticides 0.04589 0.02405 1.908 0.05659 .
## diseases 0.09974 0.03168 3.148 0.00168 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.11 on 1449 degrees of freedom
## Multiple R-squared: 0.06862, Adjusted R-squared: 0.0667
## F-statistic: 35.59 on 3 and 1449 DF, p-value: < 2.2e-16
LM EQUATION: percent lost = 8.45868 + 0.07235(varroa mites) + 0.04589(pesticides) + 0.09974(diseases)
ANALYSIS: Having an R-squared of 0.0667, AKA 6.67%, means only about that much of the percentage lost can be predicted by variables varroa_mites, pesticides, and diseases. This generally implies that these variables are not very predictive of the percentage of lost of bee colonies.
Having a p-value of 2.2e-16, AKA near zero, heavily implies that these variables are correlated and hold some form of affect on one another. The p-value is incredibly small, expressing how the observed results are very unlikely to be a product of random chance.
# Graphing the linear regression model
bees$predicted <- predict(beeslm)
ggplot(bees, aes(x = predicted, y = percent_lost)) +
geom_point() +
geom_smooth(color = "#c27d83", method = "lm") +
labs(title = "Prediction LM Graph",
caption = "Data sourced from the USDA",
x = "Predicted LM",
y = "Percent of Lost Colonies",) +
theme_gdocs() +
theme(plot.title = element_text(size = 20, face = "bold"),
plot.caption = element_text(size = 9))
## `geom_smooth()` using formula = 'y ~ x'
There is a clear positive slope, but it is incredibly weak. The clustered points all still remain about within that bottom left quarter, with the following sprinkled outliers. scattered in various predictions. Due to the weak slope we can assume the correlation is also weak as well.
# Graphing a histogram of the residuals
hist(beeslm$residuals,
main = "Histogram of Bees Linear Model Residuals", # Sets the title
xlab = "Residuals", # Label x-axis
ylab = "Frequency", # Label y-axis
col = "#EFB700", # Bar fill
border = "#5A3D2B") # Border color
Quick histogram of the residual, and it is expressed that, due to the graph being generally symmetrical about 0, there is suggested normality. Yes, it appears skewed right, but it is centered and symmetric about 0.
Overall, the general issues I chose to highlight regarding honey bee colony health and well being seem to have very little actual effect on the decline in honey bee colony population. These are just a small sample of the many issues that affect honey bee populations, so if I were to continue, I would like to explore effects the environment and global warming have had on honey bee colonies. Personally, I assumed that these variables would at least express stronger evidence of correlation. The high p-value is representative of significance between the variables, but because of the state of my data, it is unclear if my results can actually be considered representative of the general population of honey bee colonies.
INTRODUCTION:
Bees, in general, have been slowly going extinct due to a variety of issues, such as pesticides, diseases, and invasive mites. This is a major problem that should be more frequently addressed since bees are one of Earth’s largest pollinators, allowing our fruits and flowers to grow. Without them, agriculture would falter and collapse, leaving us with very little in regards to economical and food stock. In order to do a study on the factors affecting bee populations, I have chosen to use a dataset named “Save the Bees”. All data from this dataset is directly collected from surveys collected by the USDA (United States Department of Agriculture), now publicly available on the USDA website, but has been cleaned and neatened by user Shane Simon on Kaggle (https://www.kaggle.com/datasets/m000sey/save-the-honey-bees). It is a full analysis of the population of honeybees between 2015 and 2022, with 17 different variables.
The utilized variables are: percent_lost
, the percentage
of honey bee colonies lost within a certain quarter of the year
lost_colonies
, the actual number of lost honey bee colonies
within a quarter of a year quarter
, the quarter of the year
ranging from Q1 to Q4 percent_renovated
, the percent of
honey bee colonies renovated by human contact within the quarter
pesticides
, the percent of honey bee colonies affected by
pesticides varroa_mites
, the percent of honey bee colonies
affected by varroa mites diseases
, the percent of honey bee
colonies affected by certain diseases
GRAPH ANALYSIS:
(Scatterplot 1) The points seem to be mostly clumped in the bottom left corner, so maintaining lower half of both variables. Renovation and quarter seem to not have a pattern or effect. There is no visible line, trajectory or pattern- but a large quantity of various outliers. Overall there is no connected pattern, rhyme, or reason other than the grouping in the lower quartile.
(Scatterplot 2) Points are pooled in the lower percentages of colony lost, but exposure to varroa mites has stretched the data upwards, touching higher percentages. The data is expressing a bit more of a strongly positive slope. Varroa mites seem to actually have minimal positive correlation with varroa mites, since the slope is almost TOO positive. There are some really high points that still lay in the lower areas of colony lost, instilling this perspective. Renovation seems to have no impact, but the quarters seem to have a little bit of a visible color grouping. Q1 (yellow) is the lowest, followed by Q2 (green) and Q4 (brown) being in the same general area, while Q3 (blue) is just a touch higher.
(Scatterplot 3) Almost all points are collected between 25% disease exposure and 35% of colony lost. There is an expressed potential for a positive slope, but there are, once again, many outliers throwing their hands into the pool of contribution. Quarter and renovation seem to once again not hold any weight in potential patterns seen in this graph. There is a potential for visual correlation between the two variables.
(Correlation Matrix) There is no expressed negative correlation between any of the variables, as well as no strong positive correlation. All correlation seems to be under 50%, and all correlation between my independent and dependent variables specifically is incredibly low. The lowest correlation falls to percent_lost and diseases at 0.1695, and the highest is pesticides and varroa_mites, with 0.4389.
LINEAR REGRESSION MODEL:
(Linear Model Equation) percent lost = 8.45868 + 0.07235(varroa mites) + 0.04589(pesticides) + 0.09974(diseases)
(Analysis) Having an R-squared of 0.0667, AKA 6.67%, means only about that much of the percentage lost can be predicted by variables varroa_mites, pesticides, and diseases. This generally implies that these variables are not very predictive of the percentage of lost of bee colonies.
Having a p-value of 2.2e-16, AKA near zero, heavily implies that these variables are correlated and hold some form of affect on one another. The p-value is incredibly small, expressing how the observed results are very unlikely to be a product of random chance.
(Prediction model analysis) There is a clear positive slope, but it is incredibly weak. The clustered points all still remain about within that bottom left quarter, with the following sprinkled outliers. scattered in various predictions. Due to the weak slope we can assume the correlation is also weak as well.
(Histogram analysis) Quick histogram of the residual, and it is expressed that, due to the graph being generally symmetrical about 0, there is suggested normality. Yes, it appears skewed right, but it is centered and symmetric about 0.
CONCLUSION:
Overall, the general issues I chose to highlight regarding honey bee colony health and well being seem to have very little actual effect on the decline in honey bee colony population. These are just a small sample of the many issues that affect honey bee populations, so if I were to continue, I would like to explore effects the environment and global warming have had on honey bee colonies. Personally, I assumed that these variables would at least express stronger evidence of correlation. The high p-value is representative of significance between the variables, but because of the state of my data, it is unclear if my results can actually be considered representative of the general population of honey bee colonies.