DATA 110 Final Project

# Insert image
knitr::include_graphics("sf-beesemen.jpg")

- taken by Marisa Lubeck, USGS

Introduction

Bees, in general, have been slowly going extinct due to a variety of issues, such as pesticides, diseases, and invasive mites. This is a major problem that should be more frequently addressed since bees are one of Earth’s largest pollinators, allowing our fruits and flowers to grow. Without them, agriculture would falter and collapse, leaving us with very little in regards to economical and food stock. In order to do a study on the factors affecting bee populations, I have chosen to use a dataset named “Save the Bees”. All data from this dataset is directly collected from surveys collected by the USDA (United States Department of Agriculture), now publicly available on the USDA website, but has been cleaned and neatened by user Shane Simon on Kaggle (https://www.kaggle.com/datasets/m000sey/save-the-honey-bees). It is a full analysis of the population of honeybees between 2015 and 2022, with 17 different variables.

The utilized variables are:

percent_lost, the percentage of honey bee colonies lost within a certain quarter of the year
lost_colonies, the actual number of lost honey bee colonies within a quarter of a year
quarter, the quarter of the year ranging from Q1 to Q4
percent_renovated, the percent of honey bee colonies renovated by human contact within the quarter
pesticides, the percent of honey bee colonies affected by pesticides
varroa_mites, the percent of honey bee colonies affected by varroa mites
diseases, the percent of honey bee colonies affected by certain diseases

# Calling the packages
library(ggplot2) # Graphing
library(tidyverse) # Commands

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(corrplot) # Correlation plot

## Warning: package 'corrplot' was built under R version 4.4.3

## corrplot 0.95 loaded

library(dplyr) # Commands
library(plotly) # Interactive Correlation plot

## Warning: package 'plotly' was built under R version 4.4.3

## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout

library(reshape2) # Correlation plot - Longform

## Warning: package 'reshape2' was built under R version 4.4.3

## 
## Attaching package: 'reshape2'
## 
## The following object is masked from 'package:tidyr':
## 
##     smiths

library(ggthemes) # Extra themes

## Warning: package 'ggthemes' was built under R version 4.4.3

# Calling the data set
bees <- read.csv("save_the_bees.csv")
head(bees)

##         state state_code num_colonies max_colonies lost_colonies percent_lost
## 1     Alabama         AL         7000         7000          1800           26
## 2     Arizona         AZ        35000        35000          4600           13
## 3    Arkansas         AR        13000        14000          1500           11
## 4  California         CA      1440000      1690000        255000           15
## 5    Colorado         CO         3500        12500          1500           12
## 6 Connecticut         CT         3900         3900           870           22
##   added_colonies renovated_colonies percent_renovated quarter year varroa_mites
## 1           2800                250                 4       1 2015         10.0
## 2           3400               2100                 6       1 2015         26.9
## 3           1200                 90                 1       1 2015         17.6
## 4         250000             124000                 7       1 2015         24.7
## 5            200                140                 1       1 2015         14.6
## 6            290                  0                 0       1 2015          2.5
##   other_pests_and_parasites diseases pesticides other unknown
## 1                       5.4      0.0        2.2   9.1     9.4
## 2                      20.5      0.1        0.0   1.8     3.1
## 3                      11.4      1.5        3.4   1.0     1.0
## 4                       7.2      3.0        7.5   6.5     2.8
## 5                       0.9      1.8        0.6   2.6     5.9
## 6                       1.4      0.0        0.0  21.2     2.4

#Checking for NAs! (THERE'S NONE THIS IS A MIRACLE)
colSums(is.na(bees))

##                     state                state_code              num_colonies 
##                         0                         0                         0 
##              max_colonies             lost_colonies              percent_lost 
##                         0                         0                         0 
##            added_colonies        renovated_colonies         percent_renovated 
##                         0                         0                         0 
##                   quarter                      year              varroa_mites 
##                         0                         0                         0 
## other_pests_and_parasites                  diseases                pesticides 
##                         0                         0                         0 
##                     other                   unknown 
##                         0                         0

# (Out of curiosity) Highest loss states
high_loss_states <- bees |>
  filter(percent_lost > 25)

# Arrange by highest loss percentage -> Missouri with a whopping 65%
bees_sorted <- bees |>
  arrange(desc(percent_lost))

# Colony lost per year -> 2015 with a peak of 18322.98
bees_lost <- bees |>
  group_by(year) |>
  summarize(avg_colony_loss = mean(lost_colonies))

# Scatterplot visualization (1)
ggplot(bees, aes(x = percent_lost, y = pesticides, color = as.factor(quarter), size = percent_renovated)) +
  geom_point(alpha = 0.6) +
  labs(title = "Effect of Pesticides on Colony Loss",
       x = "Colony Lost (%)",
       y = "Pesticide Exposure (%)",
       color = "Quarter of the Year",
       size = "Renovation (%)",
       caption = "Data sourced from the USDA") +
  scale_color_manual(values = c("1" = "#EFB700", "2" = "#4C9A2A", 
                                "3" = "#63A4FF", "4" = "#5A3D2B")) +
  theme_gdocs() +
  theme(plot.title = element_text(size = 20, face = "bold"),
        plot.caption = element_text(size = 9))

The points seem to be mostly clumped in the bottom left corner, so maintaining lower half of both variables. Renovation and quarter seem to not have a pattern or effect. There is no visible line, trajectory or pattern- but a large quantity of various outliers. Overall there is no connected pattern, rhyme, or reason other than the grouping in the lower quartile.

# Scatterplot visualization (2)
ggplot(bees, aes(x = percent_lost, y = varroa_mites, color = as.factor(quarter), size = percent_renovated)) +
  geom_point(alpha = 0.6) +
  labs(title = "Effect of Varroa Mites on Colony Loss",
       x = "Colony Lost (%)",
       y = "Varroa Mite Exposure (%)",
       color = "Quarter of the Year",
       size = "Renovation (%)",
       caption = "Data sourced from the USDA") +
  scale_color_manual(values = c("1" = "#EFB700", "2" = "#4C9A2A", 
                                "3" = "#63A4FF", "4" = "#5A3D2B")) +
  theme_gdocs() +
  theme(plot.title = element_text(size = 20, face = "bold"),
        plot.caption = element_text(size = 9))

Points are pooled in the lower percentages of colony lost, but exposure to varroa mites has stretched the data upwards, touching higher percentages. The data is expressing a bit more of a strongly positive slope. Varroa mites seem to actually have minimal positive correlation with varroa mites, since the slope is almost TOO positive. There are some really high points that still lay in the lower areas of colony lost, instilling this perspective. Renovation seems to have no impact, but the quarters seem to have a little bit of a visible color grouping. Q1 (yellow) is the lowest, followed by Q2 (green) and Q4 (brown) being in the same general area, while Q3 (blue) is just a touch higher.

# Scatterplot visualization (3)
ggplot(bees, aes(x = percent_lost, y = diseases, color = as.factor(quarter), size = percent_renovated)) +
  geom_point(alpha = 0.6) +
  labs(title = "Effect of Diseases on Colony Loss",
       x = "Colony Lost (%)",
       y = "Disease Exposure (%)",
       color = "Quarter of the Year",
       size = "Renovation (%)",
       caption = "Data sourced from the USDA") +
  scale_color_manual(values = c("1" = "#EFB700", "2" = "#4C9A2A", 
                                "3" = "#63A4FF", "4" = "#5A3D2B")) +
  theme_gdocs() +
  theme(plot.title = element_text(size = 20, face = "bold"),
        plot.caption = element_text(size = 9))

Almost all points are collected between 25% disease exposure and 35% of colony lost. There is an expressed potential for a positive slope, but there are, once again, many outliers throwing their hands into the pool of contribution. Quarter and renovation seem to once again not hold any weight in potential patterns seen in this graph. There is a potential for visual correlation between the two variables.

# Develop the correlation matrix
cor_matrix <- bees |> 
  select(pesticides, diseases, varroa_mites, percent_lost) |> # Select variables
  cor() 

 # Force the matrix to be longform
cor_data <- melt(cor_matrix)

# Color palette
custom_colors <- colorRampPalette(c("#715f61", "white", "#f1b24e"))(200)

# Default correlation plot for presentation (Without Plotly)
corrplot(cor_matrix, method = "circle", type = "upper", tl.cex = 0.8,  
         col = custom_colors, cl.length = 5)

# GGPlot version of the matrix so it can be interactive
p <- ggplot(cor_data, aes(x = Var1, y = Var2, fill = value)) +
  geom_tile(color = "white") +
  scale_fill_gradient2(low = "#715f61", mid = "white", high = "#f1b24e", midpoint = 0) +
  theme_minimal() +
  labs(title = "Interactive Correlation Matrix", 
       fill = "Correlation",
       x = "Variable 1",
       y = "variable 2",
       caption = "Data sourced from the USDA") +
  theme_economist()

ggplotly(p) |>
  layout(      # https://stackoverflow.com/questions/45103559/plotly-adding-a-source-or-caption-to-a-chart for how I did this because turns out you can't use normal ggplot captions
    margin = list(b = 100),  
    annotations = list(
      x = 1, y = -0.3, text = "Data sourced from the USDA",
      xref = "paper", yref = "paper", showarrow = FALSE,
      xanchor = "right", font = list(size = 12)
    )
  )

There is no expressed negative correlation between any of the variables, as well as no strong positive correlation. All correlation seems to be under 50%, and all correlation between my independent and dependent variables specifically is incredibly low. The lowest correlation falls to percent_lost and diseases at 0.1695, and the highest is pesticides and varroa_mites, with 0.4389.

# Linear Regression Model
beeslm <- lm(percent_lost ~ varroa_mites + pesticides + diseases, data = bees)
summary(beeslm)

## 
## Call:
## lm(formula = percent_lost ~ varroa_mites + pesticides + diseases, 
##     data = bees)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.206  -4.694  -1.352   3.005  55.770 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.45868    0.35211  24.023  < 2e-16 ***
## varroa_mites  0.07235    0.01124   6.439 1.63e-10 ***
## pesticides    0.04589    0.02405   1.908  0.05659 .  
## diseases      0.09974    0.03168   3.148  0.00168 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.11 on 1449 degrees of freedom
## Multiple R-squared:  0.06862,    Adjusted R-squared:  0.0667 
## F-statistic: 35.59 on 3 and 1449 DF,  p-value: < 2.2e-16

LM EQUATION: percent lost = 8.45868 + 0.07235(varroa mites) + 0.04589(pesticides) + 0.09974(diseases)

ANALYSIS: Having an R-squared of 0.0667, AKA 6.67%, means only about that much of the percentage lost can be predicted by variables varroa_mites, pesticides, and diseases. This generally implies that these variables are not very predictive of the percentage of lost of bee colonies.

Having a p-value of 2.2e-16, AKA near zero, heavily implies that these variables are correlated and hold some form of affect on one another. The p-value is incredibly small, expressing how the observed results are very unlikely to be a product of random chance.

# Graphing the linear regression model
bees$predicted <- predict(beeslm)

ggplot(bees, aes(x = predicted, y = percent_lost)) +
  geom_point() +
  geom_smooth(color = "#c27d83", method = "lm") +
  labs(title = "Prediction LM Graph",
       caption = "Data sourced from the USDA",
       x = "Predicted LM",
       y = "Percent of Lost Colonies",) +
  theme_gdocs() +
  theme(plot.title = element_text(size = 20, face = "bold"),
        plot.caption = element_text(size = 9))

## `geom_smooth()` using formula = 'y ~ x'

There is a clear positive slope, but it is incredibly weak. The clustered points all still remain about within that bottom left quarter, with the following sprinkled outliers. scattered in various predictions. Due to the weak slope we can assume the correlation is also weak as well.

# Graphing a histogram of the residuals
hist(beeslm$residuals,
     main = "Histogram of Bees Linear Model Residuals",  # Sets the title
     xlab = "Residuals",  # Label x-axis
     ylab = "Frequency",  # Label y-axis
     col = "#EFB700",  # Bar fill
     border = "#5A3D2B")  # Border color

Quick histogram of the residual, and it is expressed that, due to the graph being generally symmetrical about 0, there is suggested normality. Yes, it appears skewed right, but it is centered and symmetric about 0.

Overall, the general issues I chose to highlight regarding honey bee colony health and well being seem to have very little actual effect on the decline in honey bee colony population. These are just a small sample of the many issues that affect honey bee populations, so if I were to continue, I would like to explore effects the environment and global warming have had on honey bee colonies. Personally, I assumed that these variables would at least express stronger evidence of correlation. The high p-value is representative of significance between the variables, but because of the state of my data, it is unclear if my results can actually be considered representative of the general population of honey bee colonies.

Full organization of all writing pieces:

INTRODUCTION:

The utilized variables are: percent_lost, the percentage of honey bee colonies lost within a certain quarter of the year lost_colonies, the actual number of lost honey bee colonies within a quarter of a year quarter, the quarter of the year ranging from Q1 to Q4 percent_renovated, the percent of honey bee colonies renovated by human contact within the quarter pesticides, the percent of honey bee colonies affected by pesticides varroa_mites, the percent of honey bee colonies affected by varroa mites diseases, the percent of honey bee colonies affected by certain diseases

GRAPH ANALYSIS:

(Scatterplot 1) The points seem to be mostly clumped in the bottom left corner, so maintaining lower half of both variables. Renovation and quarter seem to not have a pattern or effect. There is no visible line, trajectory or pattern- but a large quantity of various outliers. Overall there is no connected pattern, rhyme, or reason other than the grouping in the lower quartile.

(Scatterplot 2) Points are pooled in the lower percentages of colony lost, but exposure to varroa mites has stretched the data upwards, touching higher percentages. The data is expressing a bit more of a strongly positive slope. Varroa mites seem to actually have minimal positive correlation with varroa mites, since the slope is almost TOO positive. There are some really high points that still lay in the lower areas of colony lost, instilling this perspective. Renovation seems to have no impact, but the quarters seem to have a little bit of a visible color grouping. Q1 (yellow) is the lowest, followed by Q2 (green) and Q4 (brown) being in the same general area, while Q3 (blue) is just a touch higher.

(Scatterplot 3) Almost all points are collected between 25% disease exposure and 35% of colony lost. There is an expressed potential for a positive slope, but there are, once again, many outliers throwing their hands into the pool of contribution. Quarter and renovation seem to once again not hold any weight in potential patterns seen in this graph. There is a potential for visual correlation between the two variables.

(Correlation Matrix) There is no expressed negative correlation between any of the variables, as well as no strong positive correlation. All correlation seems to be under 50%, and all correlation between my independent and dependent variables specifically is incredibly low. The lowest correlation falls to percent_lost and diseases at 0.1695, and the highest is pesticides and varroa_mites, with 0.4389.

LINEAR REGRESSION MODEL:

(Linear Model Equation) percent lost = 8.45868 + 0.07235(varroa mites) + 0.04589(pesticides) + 0.09974(diseases)

(Analysis) Having an R-squared of 0.0667, AKA 6.67%, means only about that much of the percentage lost can be predicted by variables varroa_mites, pesticides, and diseases. This generally implies that these variables are not very predictive of the percentage of lost of bee colonies.

(Prediction model analysis) There is a clear positive slope, but it is incredibly weak. The clustered points all still remain about within that bottom left quarter, with the following sprinkled outliers. scattered in various predictions. Due to the weak slope we can assume the correlation is also weak as well.

(Histogram analysis) Quick histogram of the residual, and it is expressed that, due to the graph being generally symmetrical about 0, there is suggested normality. Yes, it appears skewed right, but it is centered and symmetric about 0.

CONCLUSION:

DATA 110 Final Project

R.R

Introduction

Full organization of all writing pieces: