Investigating the Impacts of Year, Season, Location, and Known Stressors on Bee Colony Health

Part 1 - Introduction

The data used herein was collected and provided by the USDA in their annual honey bee report released in August 2022 (jonthegeek, 2022). The USDA collected data on a stratified sample of honeybee colonies across the United States through the National Agricultural Statistics Service, who conduct a “Bee and Honey Inquiry” up to 4 times per year to gain data on colony numbers from field offices around the country (NASS, 2022). In this inquiry, colony owners can either electronically input their data, respond to mail requesting information, or speak with the NASS over the phone to report their honeybee conditions (NASS, 2022). For any missing data and to check all data points, the USDA used historic data (NASS, 2022). It is notable though, that because it was not possible to sample every single bee colony in the United States and these data were collected through a survey, the sample is susceptible to sampling error, but they used estimation and projections to minimize this error (NASS, 2022). This data was published to the TidyTuesday page on GitHub, from which we are accessing it (jonthegeek,2022).

The USDA reported data on the number of honeybee colonies, the maximum number of colonies in an area, the number of colonies lost, the percent of colonies lost, the percent of colonies added, the number and percent of colonies renovated, information on colony health stressors and information on Colony Collapse Disorder symptoms, as well as month and state location information (NASS, 2022).

The data used focuses on bee colony losses in the United States from 2008 to 2021. Specifically, the data is finding at what time of year (season) bee colony losses are the highest and has it changed over the 14 year time period. Two seasons, the summer and winter seasons, are primarily being focused in this data set.

Image of a Honeybee (Sarnoff, 2025).

Bee colonies play a major role in pollinating both plants and our food. Without the bees doing this, we would lose out on a third of our global food production (Roberts, 2025). Putting this into numbers, in the United States specifically, honey bees pollinate roughly $15 billion of food crops each year (Woods, 2021). Not only do they play a crucial role in food production directly, if bees were to go extinct, there would be indirect impacts. For example, there would be a loss of biodiversity (specifically with plant diversity and strength) in all ecosystems that rely on pollination, which would then create a chain reaction that would deplete and destroy every other species within the ecosystem (Roberts, 2025). All of this considered, it is absolutely vital to study and understand bee colonies to protect our food sources and ensure biodiversity remains in our ecosystems.

Our Research Question and Variables:

Using this data, we hope to inquire on what is happening to bee colonies around the United States. More specifically, we seek to analyze what factors are most greatly impacting bee colony loss, so that we can better understand the dangers posed to bee colonies. Here, we plan to explore the impacts of year, season (month), region (state), and known stressors on the percentage of bee colonies lost at a given time.

Response variables, or dependent variables, in this data are variables that were impacted as a result of stressors, months, and states, on bee colonies. In the case of this analysis, we are concerned primarily with percent of colonies lost. This variable is a numerical, continuous variable, communicating the percentage of bee colonies in an area lost. This variable is particularly useful because it allows us to standardize colonies lost despite differences in original colony number.

Explanatory variables, or independent variables, in this data set are the items that stayed consistent over the time period that the bee colonies didn’t impact. In this set, the year, month, states, and stressors are not impacted by the bees. This data is categorical and is extremely useful in determining, what, when, and where bee colonies are being lost the most and what can be possibly done to replenish and repopulate the bees.

First, we must load in and clean our data. For this, we will use a script provided by the TidyTuesday page on GitHub (jonthegeek, 2022).

#Access and clean data

#Call in Data:
colony <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2022/2022-01-11/colony.csv')

## Rows: 1222 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): months, state
## dbl (8): year, colony_n, colony_max, colony_lost, colony_lost_pct, colony_ad...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(colony)

## # A tibble: 6 × 10
##    year months        state      colony_n colony_max colony_lost colony_lost_pct
##   <dbl> <chr>         <chr>         <dbl>      <dbl>       <dbl>           <dbl>
## 1  2015 January-March Alabama        7000       7000        1800              26
## 2  2015 January-March Arizona       35000      35000        4600              13
## 3  2015 January-March Arkansas      13000      14000        1500              11
## 4  2015 January-March California  1440000    1690000      255000              15
## 5  2015 January-March Colorado       3500      12500        1500              12
## 6  2015 January-March Connectic…     3900       3900         870              22
## # ℹ 3 more variables: colony_added <dbl>, colony_reno <dbl>,
## #   colony_reno_pct <dbl>

stressor <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2022/2022-01-11/stressor.csv')

## Rows: 7332 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): months, state, stressor
## dbl (2): year, stress_pct
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(stressor)

## # A tibble: 6 × 5
##    year months        state   stressor              stress_pct
##   <dbl> <chr>         <chr>   <chr>                      <dbl>
## 1  2015 January-March Alabama Varroa mites                10  
## 2  2015 January-March Alabama Other pests/parasites        5.4
## 3  2015 January-March Alabama Disesases                   NA  
## 4  2015 January-March Alabama Pesticides                   2.2
## 5  2015 January-March Alabama Other                        9.1
## 6  2015 January-March Alabama Unknown                      9.4

#Merge Instead to Remove Duplicated columns and make sure data is merged based off common observations (month, year, stressor)
#see https://stackoverflow.com/questions/1299871/how-to-join-merge-data-frames-inner-outer-left-right for how to merge data
merged_bees<-merge(colony, stressor, by=c("year","months","state"))

Part 2 - Exploring the Data (Descriptive Statistics)

Our goal is to use the data to explore what the leading stressors of bees are and how they impact the percent of colonies lost, in relation to the year, state, and season (months of year). To do this, we must first explore and visualize the data.

Exploring Percent of Colonies Lost:

First, we must explore percent of colonies lost, our response variable. To explore this variable, we will first inquire about it’s mean, median, standard deviation, and standard error:

#mean = 11.38185% of colonies
mean_bees<-mean(merged_bees$colony_lost_pct,na.rm=TRUE)
mean_bees

## [1] 11.38185

#median= 10% of colonies
median(merged_bees$colony_lost_pct,na.rm=TRUE)

## [1] 10

#standard deviation = 7.223226
sd_bees<-sd(merged_bees$colony_lost_pct,na.rm=TRUE)
sd_bees

## [1] 7.223226

#calculate SE  (stdev/sqrt n) = 0.08435678  
n<-count(merged_bees) #n= 7332
SE<-(sd_bees/(sqrt(n)))
SE

##            n
## 1 0.08435678

These statistics tell us that the average (mean) percentage of colonies lost across the board was 11.38185%, and that the median percentage of colonies lost was 10%. The percent of colonies lost also had a standard deviation of 7.223226 and a standard error of 0.084. These statistics suggest that given the large sample size, the data is fairly close to population data (standard error is 0.084%), but that there is a bit of spread around the mean (standard deviation is 7.22%).

To further explore our response variable, it is also useful to visualize the shape of the data using a histogram:

percent_lost_hist<-ggplot(merged_bees, aes(x=colony_lost_pct))+
         geom_histogram(fill="yellow", color="black",binwidth=2)+ theme_classic()+ 
          xlab("Percent of Bee Colonies Lost")+ylab("Frequency")
percent_lost_hist

## Warning: Removed 324 rows containing non-finite outside the scale range
## (`stat_bin()`).

From the shape of the histogram, it appears the data may be positively skewed, with more data points close to zero, and a few data points further out on the positive end/tail. More tests, and potentially transformations, will thus have to be done on the data before statistical analysis.

We can also explore the frequency of different stressors (explanatory variable), using a histogram to visualize if/any stressors are more common:

Freq_Stress_Hist <-ggplot(merged_bees, aes(x=stressor))+
         geom_histogram(stat="count", fill="yellow", color="black",bins=20)+ theme_classic()+ 
          xlab("Stressors")+ylab("Frequency")

## Warning in geom_histogram(stat = "count", fill = "yellow", color = "black", :
## Ignoring unknown parameters: `binwidth`, `bins`, and `pad`

Freq_Stress_Hist

This histogram shows us that there is an equal number of data points for each stressor.

Next, we must explore how our explanatory variables, year, season (month), location (state), and stressor type, relate to our response variable (percent of colony lost).Being that these are all categorical variables, individual descriptive statistics cannot be run for each variable. Instead, we must visualize how these explanatory variables impact the response variable, percent of colonies lost.

First, we want to explore how the year impacted the percent of colonies lost. To do this, we can first compare the distribution of percent of colonies lost by year using a histogram:

#Year must be a factor for graph to work, make year a factor:
is.factor(merged_bees$year) #not a factor

## [1] FALSE

merged_bees$year <- as.factor(as.character(merged_bees$year)) ##make it a factor
is.factor(merged_bees$year)

## [1] TRUE

levels(merged_bees$year) #in numerical order, 2015-2021, good to go

## [1] "2015" "2016" "2017" "2018" "2019" "2020" "2021"

#Make Histogram:
colonybyyear <- gghistogram(merged_bees, x = "colony_lost_pct",
                     add = "mean", rug = TRUE,
                     color = "year", fill = "year", bins = 12, 
                     palette = c("#88CCEE","#CC6677","#DDCC77","#117733","#332288","#661100","#AA4499" ))
facet(colonybyyear, facet.by = "year")

## Warning: Removed 324 rows containing non-finite outside the scale range
## (`stat_bin()`).

When visually inspecting these histograms, it appears that there was not a huge difference between years when it came to the distribution, but that 2021 potentially had lower colony loss percentages across the board. To better visualize differences in percentage loss dependent on year, a box plot can be used:

lossbyyear <- ggplot(merged_bees, aes(x=year, y=colony_lost_pct, fill = year)) + 
  geom_boxplot()
lossbyyear

## Warning: Removed 324 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

#add jitter dots for each data point to better visualize spread of data:
lossbyyear + geom_jitter(shape=20, position=position_jitter(0.1)) + 
  theme_classic() +
  labs(x = "Year", y = "Percent of Colonies Lost") + 
  scale_x_discrete(labels = c('2015','2016','2017','2018','2019','2020','2021')) + 
  scale_fill_manual(values=c("#88CCEE","#CC6677","#DDCC77","#117733","#332288","#661100","#AA4499"), 
                    name="Year",
                    breaks=c("2015", "2016", "2017","2018","2019","2020","2021"))

## Warning: Removed 324 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

## Warning: Removed 324 rows containing missing values or values outside the scale range
## (`geom_point()`).

lossbyyearjitter <- ggboxplot(merged_bees, x = "year", y = "colony_loss_pct",
               color = "year", palette = "jco",
               add = "jitter",na.rm=TRUE)

These box plots show that, visually, there is a lot of spread to the percentage of colony lost in each year, but that over the years, there does not appear to be a huge difference in colony lost percentage, which can be further analyzed and confirmed.

Next, we want to look at the distribution of percent loss by month using a histogram:

is.factor(merged_bees$months)

## [1] FALSE

merged_bees$months <- as.factor(as.character(merged_bees$months)) ##make it a factor
is.factor(merged_bees$months)

## [1] TRUE

levels(merged_bees$months)

## [1] "April-June"       "January-March"    "July-September"   "October-December"

#
merged_bees <- merged_bees%>%
  reorder_levels(months,order=c("January-March","April-June","July-September","October-December"))
percent_lostbymonth <- gghistogram(merged_bees, x = "colony_lost_pct",
                     add = "mean", rug = TRUE,
                     color = "months", fill = "months", bins = 12, 
                     palette = c("#88CCEE","#CC6677","#DDCC77","#117733" ))
facet(percent_lostbymonth, facet.by = "months")

## Warning: Removed 324 rows containing non-finite outside the scale range
## (`stat_bin()`).

The histograms above show the percent loss of colonies during 4 sets of months (3 months per set), which appear to have slightly different, but relatively normal distributions. Further testing is needed to confirm any differences. We can also use a box plot to help us better visualize the data:

lossbymonths <- ggplot(merged_bees, aes(x=months, y=colony_lost_pct, fill = months)) + 
  geom_boxplot()
lossbymonths

## Warning: Removed 324 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

#add jitter dots for each data point to better visualize spread of data:
lossbymonths + geom_jitter(shape=20, position=position_jitter(0.1)) + 
  theme_classic() +
  labs(x = "Month", y = "Percent of Colonies Lost") + 
  scale_x_discrete(labels = c('January-March','April-June','July-September','October-December')) + 
  scale_fill_manual(values=c("#88CCEE","#CC6677","#DDCC77","#117733"), 
                    name="Months",
                    breaks=c("January-March","April-June","July-September","October-December"))

## Warning: Removed 324 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

## Warning: Removed 324 rows containing missing values or values outside the scale range
## (`geom_point()`).

lossbymonthsjitter <- ggboxplot(merged_bees, x = "months", y = "colony_lost_pct",
               color = "months", palette = "jco",
               add = "jitter",na.rm=TRUE)

The box plots show that there could be differences in percent loss by month, but that a lot of variation is present in the data, and further testing is needed to confirm if these differences are significant.

Very similar to loss by month, looking at the loss by state could allow us to see the areas where bee colony lost is the highest. We will use histograms to show distributions for each state, as well as the United States as a whole, to see the loss of bee colonies.

is.factor(merged_bees$state)

## [1] FALSE

merged_bees$state <- as.factor(as.character(merged_bees$state)) ##make it a factor
is.factor(merged_bees$state)

## [1] TRUE

levels(merged_bees$state)

##  [1] "Alabama"        "Arizona"        "Arkansas"       "California"    
##  [5] "Colorado"       "Connecticut"    "Florida"        "Georgia"       
##  [9] "Hawaii"         "Idaho"          "Illinois"       "Indiana"       
## [13] "Iowa"           "Kansas"         "Kentucky"       "Louisiana"     
## [17] "Maine"          "Maryland"       "Massachusetts"  "Michigan"      
## [21] "Minnesota"      "Mississippi"    "Missouri"       "Montana"       
## [25] "Nebraska"       "New Jersey"     "New Mexico"     "New York"      
## [29] "North Carolina" "North Dakota"   "Ohio"           "Oklahoma"      
## [33] "Oregon"         "Other States"   "Pennsylvania"   "South Carolina"
## [37] "South Dakota"   "Tennessee"      "Texas"          "United States" 
## [41] "Utah"           "Vermont"        "Virginia"       "Washington"    
## [45] "West Virginia"  "Wisconsin"      "Wyoming"

#Histogram
percent_lostbystate <- gghistogram(merged_bees, x = "colony_lost_pct",
                     add = "mean", rug = TRUE,
                     color = "state", fill = "state", binwidth=1,
                     palette=c("#3E92CC", "#FFD166", "#EF476F", "#06D6A0", "#118AB2","#073B4C", "#FFD23F", "#F8961E", "#F94144", "#F9C74F","#90BE6D", "#43AA8B", "#577590", "#7400B8", "#6930C3", "#5E60CE", "#5390D9", "#4EA8DE", "#48BFE3", "#56CFE1","#64DFDF", "#72EFDD", "#80FFDB", "#FFADAD", "#FFD6A5","#FDFFB6", "#CAFFBF", "#9BF6FF", "#A0C4FF", "#BDB2FF","#FFC6FF", "#FFFF1F", "#D00000", "#FFBA08", "#3F88C5","#032B43", "#136F63", "#F1FA0E", "#A8DADC", "#457B9D","#1D3557", "#FF6B6B", "#4ECDC4", "#556270", "#C7F464","#FF6F91", "#FF9671"))+theme(legend.position="none")
facet(percent_lostbystate, facet.by = "state",na.rm=TRUE)

## Warning: Removed 324 rows containing non-finite outside the scale range
## (`stat_bin()`).

Because there are so many states included, a histogram can be difficult to interpret. We can use a box plot to better visualize these differences:

lossbystate <- ggplot(merged_bees, aes(x=state, y=colony_lost_pct, fill = state)) + 
  geom_boxplot()+theme(axis.text.x=element_text(angle=45,vjust=1, hjust=1))
lossbystate

## Warning: Removed 324 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

The box plot shows us the variation present in the data, and suggests some group differences by month, but further testing and clarification are needed. Because there are so many states, we are going to subset our data to the top 20 states, and move forward with analysis from there. First we will subset our data and create a histogram to see the distribution of the data:

top_states <- merged_bees %>%
  group_by(state) %>%
  summarise(mean_pct = mean(colony_lost_pct, na.rm = TRUE)) %>%
  arrange(desc(mean_pct)) %>%
  slice_head(n = 20)

# Print the top 20 species
top_states

## # A tibble: 20 × 2
##    state          mean_pct
##    <fct>             <dbl>
##  1 Kansas             19.3
##  2 Arizona            17.7
##  3 New Mexico         16.9
##  4 Ohio               15.4
##  5 Alabama            15.3
##  6 Kentucky           14.7
##  7 Arkansas           14.6
##  8 Tennessee          14.6
##  9 Illinois           14.5
## 10 Colorado           14.0
## 11 Florida            13.4
## 12 United States      13.4
## 13 Georgia            13.2
## 14 Virginia           13.2
## 15 Wisconsin          13.0
## 16 Indiana            12.9
## 17 North Carolina     12.6
## 18 Pennsylvania       12.3
## 19 Massachusetts      12.2
## 20 West Virginia      12.2

levels(top_states$state)

##  [1] "Alabama"        "Arizona"        "Arkansas"       "California"    
##  [5] "Colorado"       "Connecticut"    "Florida"        "Georgia"       
##  [9] "Hawaii"         "Idaho"          "Illinois"       "Indiana"       
## [13] "Iowa"           "Kansas"         "Kentucky"       "Louisiana"     
## [17] "Maine"          "Maryland"       "Massachusetts"  "Michigan"      
## [21] "Minnesota"      "Mississippi"    "Missouri"       "Montana"       
## [25] "Nebraska"       "New Jersey"     "New Mexico"     "New York"      
## [29] "North Carolina" "North Dakota"   "Ohio"           "Oklahoma"      
## [33] "Oregon"         "Other States"   "Pennsylvania"   "South Carolina"
## [37] "South Dakota"   "Tennessee"      "Texas"          "United States" 
## [41] "Utah"           "Vermont"        "Virginia"       "Washington"    
## [45] "West Virginia"  "Wisconsin"      "Wyoming"

# Bar plot of top 20 endangered bee states
ggplot(top_states, aes(x = reorder(state, -mean_pct), y = mean_pct)) +
  geom_bar(stat = "identity", fill = "steelblue") + theme_bw() +
  labs(x = "States", y = "Percent Colony Lost", title = "Most Endangered Bee States") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

is.factor(merged_bees$state)

## [1] TRUE

subset.df.bees <- merged_bees[merged_bees$state %in% c('Kansas','Arizona','New Mexico','Ohio','Alabama','Kentucky','Arkansas','Tennessee', 'Illinois','Colorado','Florida','Georgia','Virginia','Wisconsin','Indiana','North Carolina','Pennsylvania','Massachusetts','West Virginia'), ]
subset.df.bees <- droplevels(subset.df.bees)
levels(subset.df.bees$state) #check to see if you only have your top 20

##  [1] "Alabama"        "Arizona"        "Arkansas"       "Colorado"      
##  [5] "Florida"        "Georgia"        "Illinois"       "Indiana"       
##  [9] "Kansas"         "Kentucky"       "Massachusetts"  "New Mexico"    
## [13] "North Carolina" "Ohio"           "Pennsylvania"   "Tennessee"     
## [17] "Virginia"       "West Virginia"  "Wisconsin"

#Make a Histogram:
percent_lostbytop20state <- gghistogram(subset.df.bees, x = "colony_lost_pct",
                     add = "mean", rug = TRUE,
                     color = "state", fill = "state", binwidth=5,
                     palette=c("#3E92CC", "#FFD166", "#EF476F","#073B4C", "#F8961E", "#F94144", "#F9C74F","#90BE6D", "#43AA8B", "#7400B8", "#6930C3", "#FFADAD", "#FF6B6B","#556270", "#A0C4FF", "#C7F464","#5E60CE", "#FFFF1F", "#D00000", "#FFBA08"))+theme(legend.position="none")
facet(percent_lostbytop20state, facet.by = "state", na.rm=TRUE)

## Warning: Removed 114 rows containing non-finite outside the scale range
## (`stat_bin()`).

We can also look for differences between the states on a boxplot:

lossbytop20state <- ggplot(subset.df.bees, aes(x=state, y=colony_lost_pct, fill = state)) + 
  geom_boxplot()+theme(axis.text.x=element_text(angle=45,vjust=1, hjust=0.5))
lossbytop20state

## Warning: Removed 114 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

Before attempting to determine whether certain stressors were associated with higher colony percentage losses, the distribution of colony loss percentage by stressor type must also be visually represented using a histogram:

#First check what stressors were tested, and make sure R knows they are factors:
is.factor(merged_bees$stressor) #not a factor

## [1] FALSE

merged_bees$stressor <- as.factor(as.character(merged_bees$stressor)) ##make it a factor
is.factor(merged_bees$stressor)

## [1] TRUE

levels(merged_bees$stressor)

## [1] "Disesases"             "Other"                 "Other pests/parasites"
## [4] "Pesticides"            "Unknown"               "Varroa mites"

#Make sure percent is numeric:
is.numeric(merged_bees$colony_lost_pct)

## [1] TRUE

## Reorder Data:
merged_bees <- merged_bees %>% 
  reorder_levels(stressor, order = c("Disesases", "Varroa mites","Other pests/parasites", "Pesticides", "Other","Unknown"))
#Make Histogram:
pctlossbystressor <- gghistogram(merged_bees, x = "colony_lost_pct",
                     add = "mean", rug = TRUE,
                     color = "stressor", fill = "stressor", bins = 12, 
                     palette = c("#88CCEE","#CC6677","#DDCC77","#117733","#332288","#661100" ))
facet(pctlossbystressor, facet.by = "stressor")

## Warning: Removed 324 rows containing non-finite outside the scale range
## (`stat_bin()`).

These histograms reveal a relatively similar distribution of percentage losses across all stressor types, and differences in colony loss dependent on stressor type may be more easily visually represented using a box plot:

stressorpercent <- ggplot(merged_bees, aes(x=stressor, y=colony_lost_pct, fill = stressor)) + 
  geom_boxplot()
stressorpercent

## Warning: Removed 324 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

#add jitter dots for each data point
stressorpercent + geom_jitter(shape=20, position=position_jitter(0.1)) + 
  theme_classic() +
  labs(x = "Stressor", y = "Percent of Colonies Lost") + 
  scale_x_discrete(labels = c('Disesases', 'Varroa mites','Other pests/parasites', 'Pesticides', 'Other','Unknown')) + 
  scale_fill_manual(values=c("#88CCEE","#CC6677","#DDCC77","#117733","#332288","#661100"), 
                    name="Stressor",
                    breaks=c("Disesases", "Varroa mites","Other pests/parasites", "Pesticides", "Other","Unknown"), 
                    labels=c("Disesases", "Varroa mites","Other pests/parasites", "Pesticides", "Other","Unknown"))

## Warning: Removed 324 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

## Warning: Removed 324 rows containing missing values or values outside the scale range
## (`geom_point()`).

stressorpctjitter <- ggboxplot(merged_bees, x = "stressor", y = "Percent of Colonies Lost",
               color = "stressor", palette = "jco",
               add = "jitter")

When visualizing colony loss per stressor, it appears that the bulk of colony loss may be equal across groups, but is also apparent that there are large amounts of variance, so more tests will need to be run, particularly for variance and normality.

Exploring Stressors:

In addition to these, we can also explore how location, season, and year influence what stressors are present. By distributing stressor presence by year, month (season), and state, we can see when and where certain stressors were most prevalent.

First, it is useful to determine whether the stressors present varied depending on the year. To do this, we can visualize the distribution of stressors each year using a bar plot:

ggplot(merged_bees, aes(x = year, y= stressor, fill=stressor, na.rm=TRUE)) +   
  geom_bar(position="stack",stat="identity") +theme_classic()+
  xlab("Year")+
  ylab("Frequency of Stressor")

Looking at a bar graph, it appears that stressors may not have fluctuated by year, but further analysis is needed.It appears that, if anything, there may have been less stressors present overall in 2021, but the proportions of stressors present remained the same, making it unclear whether there were simply less data points in 2021.

We can also look at the frequency of different stressors in each month using a bar plot:

ggplot(merged_bees, aes(x = months, y= stressor, fill=stressor, na.rm=TRUE)) +   
  geom_bar(position="stack",stat="identity") +theme_classic()+
  xlab("Month")+
  ylab("Frequency of Stressor")

The bar graph shows us that January through June has a higher frequency of all stressors compared to July through December.

Last, we can look at a bar graph of frequencies of stressors by state:

ggplot(merged_bees, aes(x = state, y= stressor, fill=stressor, na.rm=TRUE)) +   
  geom_bar(position="stack",stat="identity") +theme_classic()+
  xlab("state")+
  ylab("Frequency of Stressor")+
  theme(axis.text.x=element_text(angle=45,vjust=1, hjust=1))

Using this bar graph, it would appear that stressors are approximately equal in proportion in all states.

Part 3 - Statistical Test (Inferential Statistics)

Now that data has been visualized, we can begin to analyze the data to determine whether bee colony loss differs by year, season, region, and stressor type. Because percent loss is a numeric, continuous variable, and all 4 of these explanatory variables (year, month, state, stressor) are categorical and have 3 or more groups, one-way ANOVA’s can be used to answer each of these questions, as long as the assumptions for an ANOVA, normality and equal variance between groups, are met.

Analysis of Percent of Colonies Lost by Year

To elucidate whether the percent of colonies lost varies dependent on the year, we can use an analysis of variance if the data meets normality and equal variance assumptions because there are 7 year groups present. When we conduct this analysis, we are operating under the null hypothesis that percent of colonies lost does not vary by year. If the data meets our significance level, p<.05, then we can reject this null hypothesis, supporting the alternative hypothesis that the percent of bee colonies lost is dependent on the year. This alternative hypothesis means that at least one year is different from the others.

Tests for Normality

Recall that on the histogram of percent of colonies lost, there appeared to be positive skew. Because of this, we need to verify whether this data meets the normality conditions for an ANOVA.

To verify normality, we first must make a linear model of the data before we can examine its residuals for normality:

pct_year_model  <- lm(colony_lost_pct ~ year, data = merged_bees)

From here, we can use a qqplot, a histogram, and a residuals plot to see if the data (its residuals) is normal:

## Create a QQ plot of residuals
ggqqplot(residuals(pct_year_model))

## Look at the histogram of residuals
hist.model.yearpct <- hist(pct_year_model$residuals)

hist.model.yearpct

## $breaks
##  [1] -15 -10  -5   0   5  10  15  20  25  30  35  40
## 
## $counts
##  [1]  180 1518 2358 1620  696  366  156   48   36    6   24
## 
## $density
##  [1] 0.0051369863 0.0433219178 0.0672945205 0.0462328767 0.0198630137
##  [6] 0.0104452055 0.0044520548 0.0013698630 0.0010273973 0.0001712329
## [11] 0.0006849315
## 
## $mids
##  [1] -12.5  -7.5  -2.5   2.5   7.5  12.5  17.5  22.5  27.5  32.5  37.5
## 
## $xname
## [1] "pct_year_model$residuals"
## 
## $equidist
## [1] TRUE
## 
## attr(,"class")
## [1] "histogram"

## Residuals plot
plot(pct_year_model, 1)

#The qqplot is curved, the histogram is skewed, and the residuals plot does not have randomly scattered points, suggesting non-normality, lets use a Shapiro Wilk Test to verify:
#shapiro.test(pct_year_model$residuals) - error: Shapiro Test cannot be used if n>5000

The qqplot, histogram, and residuals plot of residuals appear non-normal, and a Shapiro Wilk cannot be used to verify this because the dataset is too large, so we are operating under the assumption that this data is nonnormal. Before trying to transform the data, ANOVA are very sensitive to extreme outliers, so lets check for extreme outliers:

merged_bees %>% 
  group_by(year) %>%
  identify_outliers(colony_lost_pct)

## # A tibble: 246 × 14
##    year  months        state    colony_n colony_max colony_lost colony_lost_pct
##    <fct> <fct>         <fct>       <dbl>      <dbl>       <dbl>           <dbl>
##  1 2015  January-March Illinois     6000      10500        4200              40
##  2 2015  January-March Illinois     6000      10500        4200              40
##  3 2015  January-March Illinois     6000      10500        4200              40
##  4 2015  January-March Illinois     6000      10500        4200              40
##  5 2015  January-March Illinois     6000      10500        4200              40
##  6 2015  January-March Illinois     6000      10500        4200              40
##  7 2015  January-March Kentucky     7500      10500        4100              39
##  8 2015  January-March Kentucky     7500      10500        4100              39
##  9 2015  January-March Kentucky     7500      10500        4100              39
## 10 2015  January-March Kentucky     7500      10500        4100              39
## # ℹ 236 more rows
## # ℹ 7 more variables: colony_added <dbl>, colony_reno <dbl>,
## #   colony_reno_pct <dbl>, stressor <fct>, stress_pct <dbl>, is.outlier <lgl>,
## #   is.extreme <lgl>

#There are extreme outliers. We can try to transform the data to see if it makes it more normal, then retest:
merged_bees$transformedcolony<-log(merged_bees$colony_lost_pct)

merged_bees %>% 
  group_by(year) %>%
  identify_outliers(transformedcolony)

## # A tibble: 276 × 15
##    year  months     state        colony_n colony_max colony_lost colony_lost_pct
##    <fct> <fct>      <fct>           <dbl>      <dbl>       <dbl>           <dbl>
##  1 2015  April-June Hawaii          13500      13500         120               1
##  2 2015  April-June Hawaii          13500      13500         120               1
##  3 2015  April-June Hawaii          13500      13500         120               1
##  4 2015  April-June Hawaii          13500      13500         120               1
##  5 2015  April-June Hawaii          13500      13500         120               1
##  6 2015  April-June Hawaii          13500      13500         120               1
##  7 2015  April-June Massachuset…     4100      14000         320               2
##  8 2015  April-June Massachuset…     4100      14000         320               2
##  9 2015  April-June Massachuset…     4100      14000         320               2
## 10 2015  April-June Massachuset…     4100      14000         320               2
## # ℹ 266 more rows
## # ℹ 8 more variables: colony_added <dbl>, colony_reno <dbl>,
## #   colony_reno_pct <dbl>, stressor <fct>, stress_pct <dbl>,
## #   transformedcolony <dbl>, is.outlier <lgl>, is.extreme <lgl>

There are several extreme outliers, even after transforming the data, so an ANOVA cannot be used, and a non-parametric test, most likely a Kruskal-Wallis test, must be used instead.

There are extreme outliers, and the data cannot be transformed to remove them.

Test for Equal Variance

Before proceeding, we must also test for equal variance between groups. A Levene’s test will help us do this:

merged_bees %>% levene_test(colony_lost_pct ~ year)

## # A tibble: 1 × 4
##     df1   df2 statistic        p
##   <int> <int>     <dbl>    <dbl>
## 1     6  7001      9.47 2.21e-10

Because the variance yields a p value below .05, the variances between groups are not equal.

Because the data does not fit the requirements of an ANOVA, a non-parametric test, a Kruskal-wallis test, must be used:

kruskal.test(colony_lost_pct ~ year, data = merged_bees)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  colony_lost_pct by year
## Kruskal-Wallis chi-squared = 97.202, df = 6, p-value < 2.2e-16

This test shows that there is a statistically significant difference in percent of bee colonies lost depending on what year it was (X^2(6)=97.202,p<0.0001).

Analysis of Percent of Colonies Lost by Month

We can also analyze whether percent of colonies lost varies by month. We can use an analysis of variance if the data meets normality and equal variance assumptions because there are 4 month groups present. When we conduct this analysis, we are operating under the null hypothesis that percent of colonies lost does not vary by month. If the data meets our significance level, p<.05, then we can reject this null hypothesis, supporting the alternative hypothesis that the percent of bee colonies lost is different dependent on the month. This alternative hypothesis means that at least one month is different from the others.

#make a linear model
pct_month_model  <- lm(colony_lost_pct ~ months, data = merged_bees)

From here, we can use a qqplot, a histogram, and a residuals plot to see if the data is normal, and check for outliers:

## Create a QQ plot of residuals
ggqqplot(residuals(pct_month_model))

## Look at the histogram of residuals
hist.model.monthpct <- hist(pct_month_model$residuals)

hist.model.monthpct

## $breaks
##  [1] -15 -10  -5   0   5  10  15  20  25  30  35  40  45
## 
## $counts
##  [1]  240 1110 2376 1968  750  330  126   42   36   18    6    6
## 
## $density
##  [1] 0.0068493151 0.0316780822 0.0678082192 0.0561643836 0.0214041096
##  [6] 0.0094178082 0.0035958904 0.0011986301 0.0010273973 0.0005136986
## [11] 0.0001712329 0.0001712329
## 
## $mids
##  [1] -12.5  -7.5  -2.5   2.5   7.5  12.5  17.5  22.5  27.5  32.5  37.5  42.5
## 
## $xname
## [1] "pct_month_model$residuals"
## 
## $equidist
## [1] TRUE
## 
## attr(,"class")
## [1] "histogram"

##Create a residuals plot
plot(pct_month_model, 1)

#The qqplot is curved, the histogram is skewed, and the residuals plot is not scattered, suggesting non-normality, and Shapiro Wilk does not work for a data set this size. 

#can try to use transformed data to make the data more normal:
transformedmodel_month  <- lm(transformedcolony ~ months, data = merged_bees)
plot(transformedmodel_month, 1)

plot(transformedmodel_month, 2)

#still not normal, can try different transformation
merged_bees$sqrtpercent<-sqrt(merged_bees$colony_lost_pct)
sqtransformedmodel_month  <- lm(sqrtpercent ~ months, data = merged_bees)
plot(sqtransformedmodel_month, 1)

sqtransformedmodel_month

## 
## Call:
## lm(formula = sqrtpercent ~ months, data = merged_bees)
## 
## Coefficients:
##            (Intercept)        monthsApril-June    monthsJuly-September  
##                 3.5356                 -0.8640                 -0.2621  
## monthsOctober-December  
##                -0.2348

#Data still non-normal

#check for outliers
merged_bees %>% 
  group_by(months) %>%
  identify_outliers(colony_lost_pct)

## # A tibble: 240 × 16
##    months        year  state    colony_n colony_max colony_lost colony_lost_pct
##    <fct>         <fct> <fct>       <dbl>      <dbl>       <dbl>           <dbl>
##  1 January-March 2015  Illinois     6000      10500        4200              40
##  2 January-March 2015  Illinois     6000      10500        4200              40
##  3 January-March 2015  Illinois     6000      10500        4200              40
##  4 January-March 2015  Illinois     6000      10500        4200              40
##  5 January-March 2015  Illinois     6000      10500        4200              40
##  6 January-March 2015  Illinois     6000      10500        4200              40
##  7 January-March 2015  Kentucky     7500      10500        4100              39
##  8 January-March 2015  Kentucky     7500      10500        4100              39
##  9 January-March 2015  Kentucky     7500      10500        4100              39
## 10 January-March 2015  Kentucky     7500      10500        4100              39
## # ℹ 230 more rows
## # ℹ 9 more variables: colony_added <dbl>, colony_reno <dbl>,
## #   colony_reno_pct <dbl>, stressor <fct>, stress_pct <dbl>,
## #   transformedcolony <dbl>, sqrtpercent <dbl>, is.outlier <lgl>,
## #   is.extreme <lgl>

#There are extreme outliers in the data, cannot do an ANOVA

The data does not pass any normality tests.

We can also check for equal variance between groups:

merged_bees %>% levene_test(colony_lost_pct ~ months)

## # A tibble: 1 × 4
##     df1   df2 statistic         p
##   <int> <int>     <dbl>     <dbl>
## 1     3  7004      170. 4.29e-106

The data is not normal, has extreme outliers, and does not have equal variance, so we must use a nonparametric test (Kruskal-Wallis):

kruskal.test(colony_lost_pct ~ months, data = merged_bees)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  colony_lost_pct by months
## Kruskal-Wallis chi-squared = 701.85, df = 3, p-value < 2.2e-16

There is a significant difference in percentage of colonies lost based on month (X^2(3)=701.85,p<0.0001)

Analysis of Percent of Colonies Lost by State

We can then perform the same analysis steps for percent of colonies lost by state.We can use an analysis of variance if the data meets normality and equal variance assumptions because there are 47 state groups present. When we conduct this analysis, we are operating under the null hypothesis that percent of colonies lost does not vary by state. If the data meets our significance level, p<.05, then we can reject this null hypothesis, supporting the alternative hypothesis that the percent of bee colonies lost is different dependent on the state.This alternative hypothesis means that at least one state is different from the others.

#make a linear model
pct_state_model  <- lm(colony_lost_pct ~ state, data = merged_bees)

## Create a QQ plot of residuals
ggqqplot(residuals(pct_state_model))

## Look at the histogram of residuals
hist.model.statepct <- hist(pct_state_model$residuals)

hist.model.statepct

## $breaks
##  [1] -20 -15 -10  -5   0   5  10  15  20  25  30  35  40
## 
## $counts
##  [1]   30  102 1224 2736 1686  732  330   84   42   12   18   12
## 
## $density
##  [1] 0.0008561644 0.0029109589 0.0349315068 0.0780821918 0.0481164384
##  [6] 0.0208904110 0.0094178082 0.0023972603 0.0011986301 0.0003424658
## [11] 0.0005136986 0.0003424658
## 
## $mids
##  [1] -17.5 -12.5  -7.5  -2.5   2.5   7.5  12.5  17.5  22.5  27.5  32.5  37.5
## 
## $xname
## [1] "pct_state_model$residuals"
## 
## $equidist
## [1] TRUE
## 
## attr(,"class")
## [1] "histogram"

##Look at residual plot
plot(pct_state_model, 1)

#The qqplot, histogram, and residuals plot all suggest non-normality

#check for outliers
merged_bees %>% 
  group_by(state) %>%
  identify_outliers(colony_lost_pct)

## # A tibble: 186 × 16
##    state   year  months          colony_n colony_max colony_lost colony_lost_pct
##    <fct>   <fct> <fct>              <dbl>      <dbl>       <dbl>           <dbl>
##  1 Alabama 2020  April-June          8500       8500        4100              48
##  2 Alabama 2020  April-June          8500       8500        4100              48
##  3 Alabama 2020  April-June          8500       8500        4100              48
##  4 Alabama 2020  April-June          8500       8500        4100              48
##  5 Alabama 2020  April-June          8500       8500        4100              48
##  6 Alabama 2020  April-June          8500       8500        4100              48
##  7 Arizona 2015  October-Decemb…    36000      39000       12000              31
##  8 Arizona 2015  October-Decemb…    36000      39000       12000              31
##  9 Arizona 2015  October-Decemb…    36000      39000       12000              31
## 10 Arizona 2015  October-Decemb…    36000      39000       12000              31
## # ℹ 176 more rows
## # ℹ 9 more variables: colony_added <dbl>, colony_reno <dbl>,
## #   colony_reno_pct <dbl>, stressor <fct>, stress_pct <dbl>,
## #   transformedcolony <dbl>, sqrtpercent <dbl>, is.outlier <lgl>,
## #   is.extreme <lgl>

#There are extreme outliers in the data, cannot do an ANOVA 

#check for equal variance 
merged_bees %>% levene_test(colony_lost_pct ~ state) #unequal variance

## # A tibble: 1 × 4
##     df1   df2 statistic         p
##   <int> <int>     <dbl>     <dbl>
## 1    46  6961      21.7 1.68e-166

The data has extreme outliers and does not have equal variance, so we must use a nonparametric test (Kruskal-Wallis):

kruskal.test(colony_lost_pct ~ state, data = merged_bees)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  colony_lost_pct by state
## Kruskal-Wallis chi-squared = 1553.3, df = 46, p-value < 2.2e-16

There is a significant difference in percent of bee colonies lost depending on state (X^2(46)=1553.3,p<0.0001).

Because the original dataset is too large to do pairwise comparisons reliably, we will redo this analysis using only the top 20 states for colony loss. Because 20 states are included, we will use an ANOVA if the data meets requirements. First, we must create a linear model and check for normality

#Create a linear model:
pct_top20state_model  <- lm(colony_lost_pct ~ state, data = subset.df.bees)

## Create a QQ plot of residuals
ggqqplot(residuals(pct_top20state_model))

## Look at the histogram of residuals
hist.model.monthpct <- hist(pct_top20state_model$residuals)

hist.model.monthpct

## $breaks
##  [1] -20 -15 -10  -5   0   5  10  15  20  25  30  35  40
## 
## $counts
##  [1]  30  78 666 924 504 330 198  60  30   6  18   6
## 
## $density
##  [1] 0.0021052632 0.0054736842 0.0467368421 0.0648421053 0.0353684211
##  [6] 0.0231578947 0.0138947368 0.0042105263 0.0021052632 0.0004210526
## [11] 0.0012631579 0.0004210526
## 
## $mids
##  [1] -17.5 -12.5  -7.5  -2.5   2.5   7.5  12.5  17.5  22.5  27.5  32.5  37.5
## 
## $xname
## [1] "pct_top20state_model$residuals"
## 
## $equidist
## [1] TRUE
## 
## attr(,"class")
## [1] "histogram"

##Create a residuals plot
plot(pct_top20state_model, 1)

#Shapiro Wilk Test for Normality
shapiro.test(subset.df.bees$colony_lost_pct)

## 
##  Shapiro-Wilk normality test
## 
## data:  subset.df.bees$colony_lost_pct
## W = 0.91864, p-value < 2.2e-16

The qqplot is curved, histogram is skewed, residual plot has lined up points, and the Shapiro test suggests non-normality (p<.001). Before attempting to transform the data, we must check for extreme outliers:

subset.df.bees %>% 
  group_by(state) %>%
  identify_outliers(colony_lost_pct)

## # A tibble: 72 × 14
##    state   year  months          colony_n colony_max colony_lost colony_lost_pct
##    <fct>   <fct> <fct>              <dbl>      <dbl>       <dbl>           <dbl>
##  1 Alabama 2020  April-June          8500       8500        4100              48
##  2 Alabama 2020  April-June          8500       8500        4100              48
##  3 Alabama 2020  April-June          8500       8500        4100              48
##  4 Alabama 2020  April-June          8500       8500        4100              48
##  5 Alabama 2020  April-June          8500       8500        4100              48
##  6 Alabama 2020  April-June          8500       8500        4100              48
##  7 Arizona 2015  October-Decemb…    36000      39000       12000              31
##  8 Arizona 2015  October-Decemb…    36000      39000       12000              31
##  9 Arizona 2015  October-Decemb…    36000      39000       12000              31
## 10 Arizona 2015  October-Decemb…    36000      39000       12000              31
## # ℹ 62 more rows
## # ℹ 7 more variables: colony_added <dbl>, colony_reno <dbl>,
## #   colony_reno_pct <dbl>, stressor <chr>, stress_pct <dbl>, is.outlier <lgl>,
## #   is.extreme <lgl>

There are still extreme outliers, so this data cannot be transformed to make it normal. Let’s also check for equal variance before proceeding:

subset.df.bees %>% levene_test(colony_lost_pct ~ state) #unequal variance

## # A tibble: 1 × 4
##     df1   df2 statistic        p
##   <int> <int>     <dbl>    <dbl>
## 1    18  2831      17.9 2.58e-54

Variance of the data is also unequal, so we must use a Kruskal-Wallis Test:

kruskal.test(colony_lost_pct ~ state, data = subset.df.bees)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  colony_lost_pct by state
## Kruskal-Wallis chi-squared = 174.11, df = 18, p-value < 2.2e-16

There is a significant difference in colony losyt percentage by state, even in only the top 20 states for colony lost (X^2(18)=174.11,p <.0001).

Analysis of Percent of Colonies Lost by Stressor

The same analysis can be done for percent of colonies lost by stressor.When we conduct this analysis, we are operating under the null hypothesis that percent of colonies lost does not vary by stressor. If the data meets our significance level, p<.05, then we can reject this null hypothesis, supporting the alternative hypothesis that the percent of bee colonies lost is different dependent on the stressor. This alternative hypothesis means that at least one stressor is different from the others.

#make a linear model
pct_stressor_model  <- lm(colony_lost_pct ~ stressor, data = merged_bees)

## Create a QQ plot of residuals
ggqqplot(residuals(pct_stressor_model))

## Look at the histogram of residuals
hist.model.stressorpct <- hist(pct_stressor_model$residuals)

hist.model.stressorpct

## $breaks
##  [1] -15 -10  -5   0   5  10  15  20  25  30  35  40  45
## 
## $counts
##  [1]  198 1578 2280 1656  630  408  144   42   42    6   18    6
## 
## $density
##  [1] 0.0056506849 0.0450342466 0.0650684932 0.0472602740 0.0179794521
##  [6] 0.0116438356 0.0041095890 0.0011986301 0.0011986301 0.0001712329
## [11] 0.0005136986 0.0001712329
## 
## $mids
##  [1] -12.5  -7.5  -2.5   2.5   7.5  12.5  17.5  22.5  27.5  32.5  37.5  42.5
## 
## $xname
## [1] "pct_stressor_model$residuals"
## 
## $equidist
## [1] TRUE
## 
## attr(,"class")
## [1] "histogram"

## Look at residuals plot:
plot(pct_stressor_model,1)

#The qqplot is curved, the histogram is skewed, and the residuals plot is a straight line, suggesting non-normality

#check for outliers
merged_bees %>% 
  group_by(stressor) %>%
  identify_outliers(colony_lost_pct)

## # A tibble: 204 × 16
##    stressor  year  months  state colony_n colony_max colony_lost colony_lost_pct
##    <fct>     <fct> <fct>   <fct>    <dbl>      <dbl>       <dbl>           <dbl>
##  1 Disesases 2015  Januar… Illi…     6000      10500        4200              40
##  2 Disesases 2015  Januar… Kent…     7500      10500        4100              39
##  3 Disesases 2015  Januar… Mary…     7500      10000        4100              41
##  4 Disesases 2015  Januar… Ohio     18000      22000       10500              48
##  5 Disesases 2015  Januar… Penn…    14000      21000        6500              31
##  6 Disesases 2015  Januar… West…     4700       6000        1800              30
##  7 Disesases 2015  July-S… Arka…    23000      30000        9000              30
##  8 Disesases 2015  Octobe… Ariz…    36000      39000       12000              31
##  9 Disesases 2015  Octobe… Kans…     8500       8500        3400              40
## 10 Disesases 2016  Januar… Okla…     6000       6000        2900              48
## # ℹ 194 more rows
## # ℹ 8 more variables: colony_added <dbl>, colony_reno <dbl>,
## #   colony_reno_pct <dbl>, stress_pct <dbl>, transformedcolony <dbl>,
## #   sqrtpercent <dbl>, is.outlier <lgl>, is.extreme <lgl>

#There are extreme outliers in the data, cannot do an ANOVA 

#check for equal variance 
merged_bees %>% levene_test(colony_lost_pct ~ stressor) #unequal variance

## # A tibble: 1 × 4
##     df1   df2 statistic     p
##   <int> <int>     <dbl> <dbl>
## 1     5  7002  6.01e-28     1

ANOVA requirements are not met, do a Kruskal-Wallis:

kruskal.test(colony_lost_pct ~ stressor, data = merged_bees)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  colony_lost_pct by stressor
## Kruskal-Wallis chi-squared = 0, df = 5, p-value = 1

There is not a significant difference in percent of colonies lost dependent on the stressor present (X^2(5)=0,p=1).

Post-Hoc Analysis

Knowing that there is a significant difference in percent of colonies lost dependent on year, month, and state, we can perform a post-hoc test to determine where exactly these differences occur. Because we have used a Kruskal-Wallis, we will use a nonparametric post hoc, Dunn’s Test, for this.

First, we will see where the difference is by year:

#Dunn's Post Hoc Test for Year
merged_bees %>%
  dunn_test(colony_lost_pct ~ year, p.adjust.method = "bonferroni")

## # A tibble: 21 × 9
##    .y.        group1 group2    n1    n2 statistic        p    p.adj p.adj.signif
##  * <chr>      <chr>  <chr>  <int> <int>     <dbl>    <dbl>    <dbl> <chr>       
##  1 colony_lo… 2015   2016    1128  1122    -0.949 3.43e- 1 1   e+ 0 ns          
##  2 colony_lo… 2015   2017    1128  1122    -1.96  4.98e- 2 1   e+ 0 ns          
##  3 colony_lo… 2015   2018    1128  1122    -0.212 8.32e- 1 1   e+ 0 ns          
##  4 colony_lo… 2015   2019    1128   840     0.596 5.51e- 1 1   e+ 0 ns          
##  5 colony_lo… 2015   2020    1128  1116    -7.18  7.05e-13 1.48e-11 ****        
##  6 colony_lo… 2015   2021    1128   558    -4.98  6.32e- 7 1.33e- 5 ****        
##  7 colony_lo… 2016   2017    1122  1122    -1.01  3.12e- 1 1   e+ 0 ns          
##  8 colony_lo… 2016   2018    1122  1122     0.735 4.62e- 1 1   e+ 0 ns          
##  9 colony_lo… 2016   2019    1122   840     1.47  1.41e- 1 1   e+ 0 ns          
## 10 colony_lo… 2016   2020    1122  1116    -6.22  4.88e-10 1.03e- 8 ****        
## # ℹ 11 more rows

There is a significant difference between: 2015 and 2020 (p<.0001), 2015 and 2021 (p<.0001), 2016 and 2020 (p<.0001), 2016 and 2021 (p<.001), 2017 and 2020 (p<.0001), 2017 and 2021 (p<.05), 2018 and 2020 (p<.0001), 2018 and 2021 (p<.0001), 2019 and 2020 (p<.0001), and 2019 and 2021 (p<.0001),

Next, we can do the same for month:

#Dunn's Post Hoc Test on Month:
merged_bees %>%
  dunn_test(colony_lost_pct ~ months, p.adjust.method = "bonferroni")

## # A tibble: 6 × 9
##   .y.       group1 group2    n1    n2 statistic         p     p.adj p.adj.signif
## * <chr>     <chr>  <chr>  <int> <int>     <dbl>     <dbl>     <dbl> <chr>       
## 1 colony_l… Janua… April…  1950  1680   -25.1   1.77e-139 1.06e-138 ****        
## 2 colony_l… Janua… July-…  1950  1692    -6.12  9.55e- 10 5.73e-  9 ****        
## 3 colony_l… Janua… Octob…  1950  1686    -5.23  1.66e-  7 9.98e-  7 ****        
## 4 colony_l… April… July-…  1680  1692    18.4   1.36e- 75 8.18e- 75 ****        
## 5 colony_l… April… Octob…  1680  1686    19.2   2.16e- 82 1.29e- 81 ****        
## 6 colony_l… July-… Octob…  1692  1686     0.848 3.97e-  1 1   e+  0 ns

There is a significant difference between: Jan-March compared to April-June, July-September, and October-December (p<.0001), April-June and July-September (p<.0001), and April-June and October-December (p<.0001). That said, all months differed from January-March and April-June.

Last, we can do the same for state. We will use the top 20 states because if we use all 47, we cannot reliably use pairwise comparisons:

#Dunn's Post Hoc Test on State
subset.df.bees %>%
  dunn_test(colony_lost_pct ~ state, p.adjust.method = "bonferroni")

## # A tibble: 171 × 9
##    .y.          group1 group2    n1    n2 statistic       p   p.adj p.adj.signif
##  * <chr>        <chr>  <chr>  <int> <int>     <dbl>   <dbl>   <dbl> <chr>       
##  1 colony_lost… Alaba… Arizo…   150   150     4.41  1.02e-5 0.00175 **          
##  2 colony_lost… Alaba… Arkan…   150   150    -0.699 4.84e-1 1       ns          
##  3 colony_lost… Alaba… Color…   150   150    -1.28  2.00e-1 1       ns          
##  4 colony_lost… Alaba… Flori…   150   150    -0.105 9.16e-1 1       ns          
##  5 colony_lost… Alaba… Georg…   150   150    -0.264 7.92e-1 1       ns          
##  6 colony_lost… Alaba… Illin…   150   150    -0.725 4.69e-1 1       ns          
##  7 colony_lost… Alaba… India…   150   150    -2.74  6.09e-3 1       ns          
##  8 colony_lost… Alaba… Kansas   150   150     3.99  6.58e-5 0.0112  *           
##  9 colony_lost… Alaba… Kentu…   150   150    -0.247 8.05e-1 1       ns          
## 10 colony_lost… Alaba… Massa…   150   150    -4.36  1.32e-5 0.00226 **          
## # ℹ 161 more rows

There is a significant difference in percent of colonies lost between: Alabama and Arizona, Indiana, Massachusetts, Pennsylvania, Arkansas

Arizona and Colorado, Florida, Georgia, Illinois, Indiana, Kentucky, Massachusetts, New Mexico, North Carolina, Ohio, Pennsylvania, Tennessee, Virginia, West Virginia, Wisconsin

Arkansas and Kansas, Massachusetts

Colorado and Indiana

Florida and Kansas, Massachusetts, Pennsylvania

Georgia and Kansas, Massachusetts

Illinois and Kansas, Massachusetts

Indiana and Kansas

Kansas and Kentucky, Massachusetts, New Mexico, North Carolina, Ohio, Pennsylvania, Tennessee, Virginia, West Virginia, Wisconsin, Massachusetts

Massachusetts and New Mexico, Tennessee

Pennsylvania and Tennessee

Part 4 - Conclusion

Results

After analyzing our data, it is apparent that year (X^2(6)=97.202,p<0.0001), month (X^2(3)=701.85,p<0.0001), and state (X^2(46)=1553.3,p<0.0001) have a significant impact on percent of bee colonies lost, but that stressor ((X^2(5)=0,p=1) does not. Year-wise, there was a significant difference between: 2015 and 2020 (p<.0001), 2015 and 2021 (p<.0001), 2016 and 2020 (p<.0001), 2016 and 2021 (p<.001), 2017 and 2020 (p<.0001), 2017 and 2021 (p<.05), 2018 and 2020 (p<.0001), 2018 and 2021 (p<.0001), 2019 and 2020 (p<.0001), and 2019 and 2021 (p<.0001). For month, there was a significant difference between: Jan-March compared to April-June, July-September, and October-December (p<.0001), April-June and July-September (p<.0001), and April-June and October-December (p<.0001). That said, all months differed from January-March and April-June. There is also a significant difference in colony loss percentage by state, even in only the top 20 states for colony lost (X^2(18)=174.11,p <.0001). State-wise, there were various pairwise differences, particularly when comparing Arizona, Kansas, and Massachusetts to other states. Seemingly, of the top 20 states, Massachusetts had the least colony loss and Kansas and Arizona had the most. This means that the year it was had a signficant impact on bee colony loss, as did the month/season, and the state the data were from. From this analysis, it is apparent that different stressor types did not have a different impact on bee colony loss from each other, meaning that either all of the stressors had the same effect on the bee colony loss.

Using a post hoc analysis, it was determined that there was a statistically significant difference between all years analyzed (2015, 2016, 2017, 2018, 2019, 2020, 2021) and the bee colony loss in 2020 and 2021, respectively. This means that in 2020 and 2021, the levels of bee colony loss were significantly lower (as shown in the box plot) than 2015, 2016, 2017, 2018, 2019, and each other. This means that for some reason, likely an environmental one, 2020 and 2021 had decreased bee colony loss.

From the graphs in part 1 it is clear that January to March had the most colonies lost, while April to June had the least. This is further supported by our post hoc analysis, which showed that there was a significant difference in colony loss in January to March compared to all other times of year, which, from the boxplot, we can tell is significantly more loss than other months. There is also a significant difference between the colony loss in April to June compared to all other months, which, from the boxplot, we can tell is because there was significantly more colony lost. This means, to a statitsically significant level, January to March had the most colony loss, while April to June had the least. Being that, in North America where data are collected, January to March bring winter weather, while April to June bring warmer weather and increased blooming plants to pollinate, we may speculate that weather and pollen resources have an impact on bee colony loss. A limitation to this analysis, though, is that such data on temperature and pollen levels are not available to support this point.

When comparing the top 20 states to eachother using a post hoc test, Arizona, Kansas, and Massachusetts were significantly different from every other state. Looking at the boxplot, it would appear that Kansas and Arizona had the most colony loss, while Massachusetts appeared to have the least out of the top 20. This aligns with previous speculation that climate/weather could be a stake, since Arizona and Kansas are largely warmer than Massachusetts, but further data collection and analysis would be needed to prove this. Further, additional post hoc testing would be necessary to determine differences with any states not in the top 20 for colony loss.

Interestingly, this data does not support that any specific stressor has a greater impact on colony loss than the others, meaning all stressors were equally dangerous to bees.

Limitations

Some limitations of our analysis can be seen with the Shapiro-Wilk test. Since the bee colony loss data has over 5000 data entries, we cannot use this test. Without the Shapiro-Wilks, it is difficult to confidently determine if our data is normal before running statistical analyses. Similarly, because there was data for so many states, it was difficult to analyze the full data set while still having readable histograms and boxplots. To mitigate this, we used the top 20 states, but this was at the expense of not analyzing the other 27 state’s data fully. We also do not have data for honey production, for instance, to determine the actual impacts of this colony lost. Likewise, without an analysis more advanced than the one ran herein (i.e., a multiple regression analysis or a two-way ANOVA), we were unable to capture whether combinations of stressors, rather than individual stressors alone, caused colony loss.

Implications and Next Steps

The analysis done is important because it shows us how and why colonies of bees are being lost. If too many bees are lost, the consequences would be dramatic. Using the statistical tests in Rmarkdown helped us better understand correlations between colony loss and explanatory variables (year, month, state, and stressor), and show if there is statistical significance between colony loss and variables. If there is, steps can be taken to combat bee colony loss and re-introduce/protect the existing colonies. For instance, the introduction of better quality care during months of higher loss could possibly limit future losses. Also, we can possibly limit the amount of stressors that are man-made (i.e. pesticides) by creating alternatives that keep bee colonies safer and healthier, which would be particularly useful if certain stressors are later found to be significant.

As mentioned in when discussing limitations, a multiple regression analysis would be a useful future step to help further elucidate whether combinations of stressors are responsible for colony loss. With this information, more informed decisions on ways to help protect bee colonies could be formed.

References

Jonthegeek. (2022). Bee Colonies (rfordatascience/ tidytuesday/ data/ 2022/2022-01-11/ readme.md). GitHub. https://github.com/rfordatascience/tidytuesday/blob/main/data/2022/2022-01-11/readme.md

National Agricultural Statistics Service (NASS). (2022). Honey Bee Colonies. National Agricultural Statistics Service (NASS), Agricultural Statistics Board, United States Department of Agriculture (USDA). ISSN: 2470-993X. https://downloads.USDA.library.Cornell.edu/USDA-esmis/files/rn301137d/kh04fx05c/qb98nn582/hcny0822.pdf

Roberts, C. (2025). What happens to ecosystems when bees disappear? Earth.Org. https://earth.org/what-happens-to-ecosystems-when-bees-disappear/

Sarnoff, L. (2025). Honey bee colonies could face 70% losses in 2025, impacting agriculture. ABC News. https://abcnews.go.com/US/honey-bee-colonies-face-70-losses-2025-impacting/story?id=120191720

Woods, J. (2021). US beekeepers continue to report high colony loss rates, no clear progression toward improvement. Office of Communications and Marketing. https://ocm.auburn.edu/newsroom/news_articles/2021/06/241121-honey-bee-annual-loss-survey-results.php