The data used herein was collected and provided by the USDA in their annual honey bee report released in August 2022 (jonthegeek, 2022). The USDA collected data on a stratified sample of honeybee colonies across the United States through the National Agricultural Statistics Service, who conduct a “Bee and Honey Inquiry” up to 4 times per year to gain data on colony numbers from field offices around the country (NASS, 2022). In this inquiry, colony owners can either electronically input their data, respond to mail requesting information, or speak with the NASS over the phone to report their honeybee conditions (NASS, 2022). For any missing data and to check all data points, the USDA used historic data (NASS, 2022). It is notable though, that because it was not possible to sample every single bee colony in the United States and these data were collected through a survey, the sample is susceptible to sampling error, but they used estimation and projections to minimize this error (NASS, 2022). This data was published to the TidyTuesday page on GitHub, from which we are accessing it (jonthegeek,2022).
The USDA reported data on the number of honeybee colonies, the maximum number of colonies in an area, the number of colonies lost, the percent of colonies lost, the percent of colonies added, the number and percent of colonies renovated, information on colony health stressors and information on Colony Collapse Disorder symptoms, as well as month and state location information (NASS, 2022).
The data used focuses on bee colony losses in the United States from 2008 to 2021. Specifically, the data is finding at what time of year (season) bee colony losses are the highest and has it changed over the 14 year time period. Two seasons, the summer and winter seasons, are primarily being focused in this data set.
Bee colonies play a major role in pollinating both plants and our food. Without the bees doing this, we would lose out on a third of our global food production (Roberts, 2025). Putting this into numbers, in the United States specifically, honey bees pollinate roughly $15 billion of food crops each year (Woods, 2021). Not only do they play a crucial role in food production directly, if bees were to go extinct, there would be indirect impacts. For example, there would be a loss of biodiversity (specifically with plant diversity and strength) in all ecosystems that rely on pollination, which would then create a chain reaction that would deplete and destroy every other species within the ecosystem (Roberts, 2025). All of this considered, it is absolutely vital to study and understand bee colonies to protect our food sources and ensure biodiversity remains in our ecosystems.
Using this data, we hope to inquire on what is happening to bee colonies around the United States. More specifically, we seek to analyze what factors are most greatly impacting bee colony loss, so that we can better understand the dangers posed to bee colonies. Here, we plan to explore the impacts of year, season (month), region (state), and known stressors on the percentage of bee colonies lost at a given time.
Response variables, or dependent variables, in this data are variables that were impacted as a result of stressors, months, and states, on bee colonies. In the case of this analysis, we are concerned primarily with percent of colonies lost. This variable is a numerical, continuous variable, communicating the percentage of bee colonies in an area lost. This variable is particularly useful because it allows us to standardize colonies lost despite differences in original colony number.
Explanatory variables, or independent variables, in this data set are the items that stayed consistent over the time period that the bee colonies didn’t impact. In this set, the year, month, states, and stressors are not impacted by the bees. This data is categorical and is extremely useful in determining, what, when, and where bee colonies are being lost the most and what can be possibly done to replenish and repopulate the bees.
First, we must load in and clean our data. For this, we will use a script provided by the TidyTuesday page on GitHub (jonthegeek, 2022).
#Access and clean data
#Call in Data:
colony <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2022/2022-01-11/colony.csv')
## Rows: 1222 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): months, state
## dbl (8): year, colony_n, colony_max, colony_lost, colony_lost_pct, colony_ad...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(colony)
## # A tibble: 6 × 10
## year months state colony_n colony_max colony_lost colony_lost_pct
## <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 2015 January-March Alabama 7000 7000 1800 26
## 2 2015 January-March Arizona 35000 35000 4600 13
## 3 2015 January-March Arkansas 13000 14000 1500 11
## 4 2015 January-March California 1440000 1690000 255000 15
## 5 2015 January-March Colorado 3500 12500 1500 12
## 6 2015 January-March Connectic… 3900 3900 870 22
## # ℹ 3 more variables: colony_added <dbl>, colony_reno <dbl>,
## # colony_reno_pct <dbl>
stressor <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2022/2022-01-11/stressor.csv')
## Rows: 7332 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): months, state, stressor
## dbl (2): year, stress_pct
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(stressor)
## # A tibble: 6 × 5
## year months state stressor stress_pct
## <dbl> <chr> <chr> <chr> <dbl>
## 1 2015 January-March Alabama Varroa mites 10
## 2 2015 January-March Alabama Other pests/parasites 5.4
## 3 2015 January-March Alabama Disesases NA
## 4 2015 January-March Alabama Pesticides 2.2
## 5 2015 January-March Alabama Other 9.1
## 6 2015 January-March Alabama Unknown 9.4
#Merge Instead to Remove Duplicated columns and make sure data is merged based off common observations (month, year, stressor)
#see https://stackoverflow.com/questions/1299871/how-to-join-merge-data-frames-inner-outer-left-right for how to merge data
merged_bees<-merge(colony, stressor, by=c("year","months","state"))
Our goal is to use the data to explore what the leading stressors of bees are and how they impact the percent of colonies lost, in relation to the year, state, and season (months of year). To do this, we must first explore and visualize the data.
First, we must explore percent of colonies lost, our response variable. To explore this variable, we will first inquire about it’s mean, median, standard deviation, and standard error:
#mean = 11.38185% of colonies
mean_bees<-mean(merged_bees$colony_lost_pct,na.rm=TRUE)
mean_bees
## [1] 11.38185
#median= 10% of colonies
median(merged_bees$colony_lost_pct,na.rm=TRUE)
## [1] 10
#standard deviation = 7.223226
sd_bees<-sd(merged_bees$colony_lost_pct,na.rm=TRUE)
sd_bees
## [1] 7.223226
#calculate SE (stdev/sqrt n) = 0.08435678
n<-count(merged_bees) #n= 7332
SE<-(sd_bees/(sqrt(n)))
SE
## n
## 1 0.08435678
These statistics tell us that the average (mean) percentage of colonies lost across the board was 11.38185%, and that the median percentage of colonies lost was 10%. The percent of colonies lost also had a standard deviation of 7.223226 and a standard error of 0.084. These statistics suggest that given the large sample size, the data is fairly close to population data (standard error is 0.084%), but that there is a bit of spread around the mean (standard deviation is 7.22%).
To further explore our response variable, it is also useful to visualize the shape of the data using a histogram:
percent_lost_hist<-ggplot(merged_bees, aes(x=colony_lost_pct))+
geom_histogram(fill="yellow", color="black",binwidth=2)+ theme_classic()+
xlab("Percent of Bee Colonies Lost")+ylab("Frequency")
percent_lost_hist
## Warning: Removed 324 rows containing non-finite outside the scale range
## (`stat_bin()`).
From the shape of the histogram, it appears the data may be positively skewed, with more data points close to zero, and a few data points further out on the positive end/tail. More tests, and potentially transformations, will thus have to be done on the data before statistical analysis.
We can also explore the frequency of different stressors (explanatory variable), using a histogram to visualize if/any stressors are more common:
Freq_Stress_Hist <-ggplot(merged_bees, aes(x=stressor))+
geom_histogram(stat="count", fill="yellow", color="black",bins=20)+ theme_classic()+
xlab("Stressors")+ylab("Frequency")
## Warning in geom_histogram(stat = "count", fill = "yellow", color = "black", :
## Ignoring unknown parameters: `binwidth`, `bins`, and `pad`
Freq_Stress_Hist
This histogram shows us that there is an equal number of data points for each stressor.
Next, we must explore how our explanatory variables, year, season (month), location (state), and stressor type, relate to our response variable (percent of colony lost).Being that these are all categorical variables, individual descriptive statistics cannot be run for each variable. Instead, we must visualize how these explanatory variables impact the response variable, percent of colonies lost.
First, we want to explore how the year impacted the percent of colonies lost. To do this, we can first compare the distribution of percent of colonies lost by year using a histogram:
#Year must be a factor for graph to work, make year a factor:
is.factor(merged_bees$year) #not a factor
## [1] FALSE
merged_bees$year <- as.factor(as.character(merged_bees$year)) ##make it a factor
is.factor(merged_bees$year)
## [1] TRUE
levels(merged_bees$year) #in numerical order, 2015-2021, good to go
## [1] "2015" "2016" "2017" "2018" "2019" "2020" "2021"
#Make Histogram:
colonybyyear <- gghistogram(merged_bees, x = "colony_lost_pct",
add = "mean", rug = TRUE,
color = "year", fill = "year", bins = 12,
palette = c("#88CCEE","#CC6677","#DDCC77","#117733","#332288","#661100","#AA4499" ))
facet(colonybyyear, facet.by = "year")
## Warning: Removed 324 rows containing non-finite outside the scale range
## (`stat_bin()`).
When visually inspecting these histograms, it appears that there was not a huge difference between years when it came to the distribution, but that 2021 potentially had lower colony loss percentages across the board. To better visualize differences in percentage loss dependent on year, a box plot can be used:
lossbyyear <- ggplot(merged_bees, aes(x=year, y=colony_lost_pct, fill = year)) +
geom_boxplot()
lossbyyear
## Warning: Removed 324 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
#add jitter dots for each data point to better visualize spread of data:
lossbyyear + geom_jitter(shape=20, position=position_jitter(0.1)) +
theme_classic() +
labs(x = "Year", y = "Percent of Colonies Lost") +
scale_x_discrete(labels = c('2015','2016','2017','2018','2019','2020','2021')) +
scale_fill_manual(values=c("#88CCEE","#CC6677","#DDCC77","#117733","#332288","#661100","#AA4499"),
name="Year",
breaks=c("2015", "2016", "2017","2018","2019","2020","2021"))
## Warning: Removed 324 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
## Warning: Removed 324 rows containing missing values or values outside the scale range
## (`geom_point()`).
lossbyyearjitter <- ggboxplot(merged_bees, x = "year", y = "colony_loss_pct",
color = "year", palette = "jco",
add = "jitter",na.rm=TRUE)
These box plots show that, visually, there is a lot of spread to the percentage of colony lost in each year, but that over the years, there does not appear to be a huge difference in colony lost percentage, which can be further analyzed and confirmed.
Next, we want to look at the distribution of percent loss by month using a histogram:
is.factor(merged_bees$months)
## [1] FALSE
merged_bees$months <- as.factor(as.character(merged_bees$months)) ##make it a factor
is.factor(merged_bees$months)
## [1] TRUE
levels(merged_bees$months)
## [1] "April-June" "January-March" "July-September" "October-December"
#
merged_bees <- merged_bees%>%
reorder_levels(months,order=c("January-March","April-June","July-September","October-December"))
percent_lostbymonth <- gghistogram(merged_bees, x = "colony_lost_pct",
add = "mean", rug = TRUE,
color = "months", fill = "months", bins = 12,
palette = c("#88CCEE","#CC6677","#DDCC77","#117733" ))
facet(percent_lostbymonth, facet.by = "months")
## Warning: Removed 324 rows containing non-finite outside the scale range
## (`stat_bin()`).
The histograms above show the percent loss of colonies during 4 sets of months (3 months per set), which appear to have slightly different, but relatively normal distributions. Further testing is needed to confirm any differences. We can also use a box plot to help us better visualize the data:
lossbymonths <- ggplot(merged_bees, aes(x=months, y=colony_lost_pct, fill = months)) +
geom_boxplot()
lossbymonths
## Warning: Removed 324 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
#add jitter dots for each data point to better visualize spread of data:
lossbymonths + geom_jitter(shape=20, position=position_jitter(0.1)) +
theme_classic() +
labs(x = "Month", y = "Percent of Colonies Lost") +
scale_x_discrete(labels = c('January-March','April-June','July-September','October-December')) +
scale_fill_manual(values=c("#88CCEE","#CC6677","#DDCC77","#117733"),
name="Months",
breaks=c("January-March","April-June","July-September","October-December"))
## Warning: Removed 324 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
## Warning: Removed 324 rows containing missing values or values outside the scale range
## (`geom_point()`).
lossbymonthsjitter <- ggboxplot(merged_bees, x = "months", y = "colony_lost_pct",
color = "months", palette = "jco",
add = "jitter",na.rm=TRUE)
The box plots show that there could be differences in percent loss by month, but that a lot of variation is present in the data, and further testing is needed to confirm if these differences are significant.
Very similar to loss by month, looking at the loss by state could allow us to see the areas where bee colony lost is the highest. We will use histograms to show distributions for each state, as well as the United States as a whole, to see the loss of bee colonies.
is.factor(merged_bees$state)
## [1] FALSE
merged_bees$state <- as.factor(as.character(merged_bees$state)) ##make it a factor
is.factor(merged_bees$state)
## [1] TRUE
levels(merged_bees$state)
## [1] "Alabama" "Arizona" "Arkansas" "California"
## [5] "Colorado" "Connecticut" "Florida" "Georgia"
## [9] "Hawaii" "Idaho" "Illinois" "Indiana"
## [13] "Iowa" "Kansas" "Kentucky" "Louisiana"
## [17] "Maine" "Maryland" "Massachusetts" "Michigan"
## [21] "Minnesota" "Mississippi" "Missouri" "Montana"
## [25] "Nebraska" "New Jersey" "New Mexico" "New York"
## [29] "North Carolina" "North Dakota" "Ohio" "Oklahoma"
## [33] "Oregon" "Other States" "Pennsylvania" "South Carolina"
## [37] "South Dakota" "Tennessee" "Texas" "United States"
## [41] "Utah" "Vermont" "Virginia" "Washington"
## [45] "West Virginia" "Wisconsin" "Wyoming"
#Histogram
percent_lostbystate <- gghistogram(merged_bees, x = "colony_lost_pct",
add = "mean", rug = TRUE,
color = "state", fill = "state", binwidth=1,
palette=c("#3E92CC", "#FFD166", "#EF476F", "#06D6A0", "#118AB2","#073B4C", "#FFD23F", "#F8961E", "#F94144", "#F9C74F","#90BE6D", "#43AA8B", "#577590", "#7400B8", "#6930C3", "#5E60CE", "#5390D9", "#4EA8DE", "#48BFE3", "#56CFE1","#64DFDF", "#72EFDD", "#80FFDB", "#FFADAD", "#FFD6A5","#FDFFB6", "#CAFFBF", "#9BF6FF", "#A0C4FF", "#BDB2FF","#FFC6FF", "#FFFF1F", "#D00000", "#FFBA08", "#3F88C5","#032B43", "#136F63", "#F1FA0E", "#A8DADC", "#457B9D","#1D3557", "#FF6B6B", "#4ECDC4", "#556270", "#C7F464","#FF6F91", "#FF9671"))+theme(legend.position="none")
facet(percent_lostbystate, facet.by = "state",na.rm=TRUE)
## Warning: Removed 324 rows containing non-finite outside the scale range
## (`stat_bin()`).
Because there are so many states included, a histogram can be difficult to interpret. We can use a box plot to better visualize these differences:
lossbystate <- ggplot(merged_bees, aes(x=state, y=colony_lost_pct, fill = state)) +
geom_boxplot()+theme(axis.text.x=element_text(angle=45,vjust=1, hjust=1))
lossbystate
## Warning: Removed 324 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
The box plot shows us the variation present in the data, and suggests some group differences by month, but further testing and clarification are needed. Because there are so many states, we are going to subset our data to the top 20 states, and move forward with analysis from there. First we will subset our data and create a histogram to see the distribution of the data:
top_states <- merged_bees %>%
group_by(state) %>%
summarise(mean_pct = mean(colony_lost_pct, na.rm = TRUE)) %>%
arrange(desc(mean_pct)) %>%
slice_head(n = 20)
# Print the top 20 species
top_states
## # A tibble: 20 × 2
## state mean_pct
## <fct> <dbl>
## 1 Kansas 19.3
## 2 Arizona 17.7
## 3 New Mexico 16.9
## 4 Ohio 15.4
## 5 Alabama 15.3
## 6 Kentucky 14.7
## 7 Arkansas 14.6
## 8 Tennessee 14.6
## 9 Illinois 14.5
## 10 Colorado 14.0
## 11 Florida 13.4
## 12 United States 13.4
## 13 Georgia 13.2
## 14 Virginia 13.2
## 15 Wisconsin 13.0
## 16 Indiana 12.9
## 17 North Carolina 12.6
## 18 Pennsylvania 12.3
## 19 Massachusetts 12.2
## 20 West Virginia 12.2
levels(top_states$state)
## [1] "Alabama" "Arizona" "Arkansas" "California"
## [5] "Colorado" "Connecticut" "Florida" "Georgia"
## [9] "Hawaii" "Idaho" "Illinois" "Indiana"
## [13] "Iowa" "Kansas" "Kentucky" "Louisiana"
## [17] "Maine" "Maryland" "Massachusetts" "Michigan"
## [21] "Minnesota" "Mississippi" "Missouri" "Montana"
## [25] "Nebraska" "New Jersey" "New Mexico" "New York"
## [29] "North Carolina" "North Dakota" "Ohio" "Oklahoma"
## [33] "Oregon" "Other States" "Pennsylvania" "South Carolina"
## [37] "South Dakota" "Tennessee" "Texas" "United States"
## [41] "Utah" "Vermont" "Virginia" "Washington"
## [45] "West Virginia" "Wisconsin" "Wyoming"
# Bar plot of top 20 endangered bee states
ggplot(top_states, aes(x = reorder(state, -mean_pct), y = mean_pct)) +
geom_bar(stat = "identity", fill = "steelblue") + theme_bw() +
labs(x = "States", y = "Percent Colony Lost", title = "Most Endangered Bee States") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
is.factor(merged_bees$state)
## [1] TRUE
subset.df.bees <- merged_bees[merged_bees$state %in% c('Kansas','Arizona','New Mexico','Ohio','Alabama','Kentucky','Arkansas','Tennessee', 'Illinois','Colorado','Florida','Georgia','Virginia','Wisconsin','Indiana','North Carolina','Pennsylvania','Massachusetts','West Virginia'), ]
subset.df.bees <- droplevels(subset.df.bees)
levels(subset.df.bees$state) #check to see if you only have your top 20
## [1] "Alabama" "Arizona" "Arkansas" "Colorado"
## [5] "Florida" "Georgia" "Illinois" "Indiana"
## [9] "Kansas" "Kentucky" "Massachusetts" "New Mexico"
## [13] "North Carolina" "Ohio" "Pennsylvania" "Tennessee"
## [17] "Virginia" "West Virginia" "Wisconsin"
#Make a Histogram:
percent_lostbytop20state <- gghistogram(subset.df.bees, x = "colony_lost_pct",
add = "mean", rug = TRUE,
color = "state", fill = "state", binwidth=5,
palette=c("#3E92CC", "#FFD166", "#EF476F","#073B4C", "#F8961E", "#F94144", "#F9C74F","#90BE6D", "#43AA8B", "#7400B8", "#6930C3", "#FFADAD", "#FF6B6B","#556270", "#A0C4FF", "#C7F464","#5E60CE", "#FFFF1F", "#D00000", "#FFBA08"))+theme(legend.position="none")
facet(percent_lostbytop20state, facet.by = "state", na.rm=TRUE)
## Warning: Removed 114 rows containing non-finite outside the scale range
## (`stat_bin()`).
We can also look for differences between the states on a boxplot:
lossbytop20state <- ggplot(subset.df.bees, aes(x=state, y=colony_lost_pct, fill = state)) +
geom_boxplot()+theme(axis.text.x=element_text(angle=45,vjust=1, hjust=0.5))
lossbytop20state
## Warning: Removed 114 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
Before attempting to determine whether certain stressors were associated with higher colony percentage losses, the distribution of colony loss percentage by stressor type must also be visually represented using a histogram:
#First check what stressors were tested, and make sure R knows they are factors:
is.factor(merged_bees$stressor) #not a factor
## [1] FALSE
merged_bees$stressor <- as.factor(as.character(merged_bees$stressor)) ##make it a factor
is.factor(merged_bees$stressor)
## [1] TRUE
levels(merged_bees$stressor)
## [1] "Disesases" "Other" "Other pests/parasites"
## [4] "Pesticides" "Unknown" "Varroa mites"
#Make sure percent is numeric:
is.numeric(merged_bees$colony_lost_pct)
## [1] TRUE
## Reorder Data:
merged_bees <- merged_bees %>%
reorder_levels(stressor, order = c("Disesases", "Varroa mites","Other pests/parasites", "Pesticides", "Other","Unknown"))
#Make Histogram:
pctlossbystressor <- gghistogram(merged_bees, x = "colony_lost_pct",
add = "mean", rug = TRUE,
color = "stressor", fill = "stressor", bins = 12,
palette = c("#88CCEE","#CC6677","#DDCC77","#117733","#332288","#661100" ))
facet(pctlossbystressor, facet.by = "stressor")
## Warning: Removed 324 rows containing non-finite outside the scale range
## (`stat_bin()`).
These histograms reveal a relatively similar distribution of percentage losses across all stressor types, and differences in colony loss dependent on stressor type may be more easily visually represented using a box plot:
stressorpercent <- ggplot(merged_bees, aes(x=stressor, y=colony_lost_pct, fill = stressor)) +
geom_boxplot()
stressorpercent
## Warning: Removed 324 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
#add jitter dots for each data point
stressorpercent + geom_jitter(shape=20, position=position_jitter(0.1)) +
theme_classic() +
labs(x = "Stressor", y = "Percent of Colonies Lost") +
scale_x_discrete(labels = c('Disesases', 'Varroa mites','Other pests/parasites', 'Pesticides', 'Other','Unknown')) +
scale_fill_manual(values=c("#88CCEE","#CC6677","#DDCC77","#117733","#332288","#661100"),
name="Stressor",
breaks=c("Disesases", "Varroa mites","Other pests/parasites", "Pesticides", "Other","Unknown"),
labels=c("Disesases", "Varroa mites","Other pests/parasites", "Pesticides", "Other","Unknown"))
## Warning: Removed 324 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
## Warning: Removed 324 rows containing missing values or values outside the scale range
## (`geom_point()`).
stressorpctjitter <- ggboxplot(merged_bees, x = "stressor", y = "Percent of Colonies Lost",
color = "stressor", palette = "jco",
add = "jitter")
When visualizing colony loss per stressor, it appears that the bulk of colony loss may be equal across groups, but is also apparent that there are large amounts of variance, so more tests will need to be run, particularly for variance and normality.
In addition to these, we can also explore how location, season, and year influence what stressors are present. By distributing stressor presence by year, month (season), and state, we can see when and where certain stressors were most prevalent.
First, it is useful to determine whether the stressors present varied depending on the year. To do this, we can visualize the distribution of stressors each year using a bar plot:
ggplot(merged_bees, aes(x = year, y= stressor, fill=stressor, na.rm=TRUE)) +
geom_bar(position="stack",stat="identity") +theme_classic()+
xlab("Year")+
ylab("Frequency of Stressor")
Looking at a bar graph, it appears that stressors may not have fluctuated by year, but further analysis is needed.It appears that, if anything, there may have been less stressors present overall in 2021, but the proportions of stressors present remained the same, making it unclear whether there were simply less data points in 2021.
We can also look at the frequency of different stressors in each month using a bar plot:
ggplot(merged_bees, aes(x = months, y= stressor, fill=stressor, na.rm=TRUE)) +
geom_bar(position="stack",stat="identity") +theme_classic()+
xlab("Month")+
ylab("Frequency of Stressor")
The bar graph shows us that January through June has a higher frequency of all stressors compared to July through December.
Last, we can look at a bar graph of frequencies of stressors by state:
ggplot(merged_bees, aes(x = state, y= stressor, fill=stressor, na.rm=TRUE)) +
geom_bar(position="stack",stat="identity") +theme_classic()+
xlab("state")+
ylab("Frequency of Stressor")+
theme(axis.text.x=element_text(angle=45,vjust=1, hjust=1))
Using this bar graph, it would appear that stressors are approximately equal in proportion in all states.
Now that data has been visualized, we can begin to analyze the data to determine whether bee colony loss differs by year, season, region, and stressor type. Because percent loss is a numeric, continuous variable, and all 4 of these explanatory variables (year, month, state, stressor) are categorical and have 3 or more groups, one-way ANOVA’s can be used to answer each of these questions, as long as the assumptions for an ANOVA, normality and equal variance between groups, are met.
To elucidate whether the percent of colonies lost varies dependent on the year, we can use an analysis of variance if the data meets normality and equal variance assumptions because there are 7 year groups present. When we conduct this analysis, we are operating under the null hypothesis that percent of colonies lost does not vary by year. If the data meets our significance level, p<.05, then we can reject this null hypothesis, supporting the alternative hypothesis that the percent of bee colonies lost is dependent on the year. This alternative hypothesis means that at least one year is different from the others.
Recall that on the histogram of percent of colonies lost, there appeared to be positive skew. Because of this, we need to verify whether this data meets the normality conditions for an ANOVA.
To verify normality, we first must make a linear model of the data before we can examine its residuals for normality:
pct_year_model <- lm(colony_lost_pct ~ year, data = merged_bees)
From here, we can use a qqplot, a histogram, and a residuals plot to see if the data (its residuals) is normal:
## Create a QQ plot of residuals
ggqqplot(residuals(pct_year_model))
## Look at the histogram of residuals
hist.model.yearpct <- hist(pct_year_model$residuals)
hist.model.yearpct
## $breaks
## [1] -15 -10 -5 0 5 10 15 20 25 30 35 40
##
## $counts
## [1] 180 1518 2358 1620 696 366 156 48 36 6 24
##
## $density
## [1] 0.0051369863 0.0433219178 0.0672945205 0.0462328767 0.0198630137
## [6] 0.0104452055 0.0044520548 0.0013698630 0.0010273973 0.0001712329
## [11] 0.0006849315
##
## $mids
## [1] -12.5 -7.5 -2.5 2.5 7.5 12.5 17.5 22.5 27.5 32.5 37.5
##
## $xname
## [1] "pct_year_model$residuals"
##
## $equidist
## [1] TRUE
##
## attr(,"class")
## [1] "histogram"
## Residuals plot
plot(pct_year_model, 1)
#The qqplot is curved, the histogram is skewed, and the residuals plot does not have randomly scattered points, suggesting non-normality, lets use a Shapiro Wilk Test to verify:
#shapiro.test(pct_year_model$residuals) - error: Shapiro Test cannot be used if n>5000
The qqplot, histogram, and residuals plot of residuals appear non-normal, and a Shapiro Wilk cannot be used to verify this because the dataset is too large, so we are operating under the assumption that this data is nonnormal. Before trying to transform the data, ANOVA are very sensitive to extreme outliers, so lets check for extreme outliers:
merged_bees %>%
group_by(year) %>%
identify_outliers(colony_lost_pct)
## # A tibble: 246 × 14
## year months state colony_n colony_max colony_lost colony_lost_pct
## <fct> <fct> <fct> <dbl> <dbl> <dbl> <dbl>
## 1 2015 January-March Illinois 6000 10500 4200 40
## 2 2015 January-March Illinois 6000 10500 4200 40
## 3 2015 January-March Illinois 6000 10500 4200 40
## 4 2015 January-March Illinois 6000 10500 4200 40
## 5 2015 January-March Illinois 6000 10500 4200 40
## 6 2015 January-March Illinois 6000 10500 4200 40
## 7 2015 January-March Kentucky 7500 10500 4100 39
## 8 2015 January-March Kentucky 7500 10500 4100 39
## 9 2015 January-March Kentucky 7500 10500 4100 39
## 10 2015 January-March Kentucky 7500 10500 4100 39
## # ℹ 236 more rows
## # ℹ 7 more variables: colony_added <dbl>, colony_reno <dbl>,
## # colony_reno_pct <dbl>, stressor <fct>, stress_pct <dbl>, is.outlier <lgl>,
## # is.extreme <lgl>
#There are extreme outliers. We can try to transform the data to see if it makes it more normal, then retest:
merged_bees$transformedcolony<-log(merged_bees$colony_lost_pct)
merged_bees %>%
group_by(year) %>%
identify_outliers(transformedcolony)
## # A tibble: 276 × 15
## year months state colony_n colony_max colony_lost colony_lost_pct
## <fct> <fct> <fct> <dbl> <dbl> <dbl> <dbl>
## 1 2015 April-June Hawaii 13500 13500 120 1
## 2 2015 April-June Hawaii 13500 13500 120 1
## 3 2015 April-June Hawaii 13500 13500 120 1
## 4 2015 April-June Hawaii 13500 13500 120 1
## 5 2015 April-June Hawaii 13500 13500 120 1
## 6 2015 April-June Hawaii 13500 13500 120 1
## 7 2015 April-June Massachuset… 4100 14000 320 2
## 8 2015 April-June Massachuset… 4100 14000 320 2
## 9 2015 April-June Massachuset… 4100 14000 320 2
## 10 2015 April-June Massachuset… 4100 14000 320 2
## # ℹ 266 more rows
## # ℹ 8 more variables: colony_added <dbl>, colony_reno <dbl>,
## # colony_reno_pct <dbl>, stressor <fct>, stress_pct <dbl>,
## # transformedcolony <dbl>, is.outlier <lgl>, is.extreme <lgl>
There are several extreme outliers, even after transforming the data, so an ANOVA cannot be used, and a non-parametric test, most likely a Kruskal-Wallis test, must be used instead.
There are extreme outliers, and the data cannot be transformed to remove them.
Before proceeding, we must also test for equal variance between groups. A Levene’s test will help us do this:
merged_bees %>% levene_test(colony_lost_pct ~ year)
## # A tibble: 1 × 4
## df1 df2 statistic p
## <int> <int> <dbl> <dbl>
## 1 6 7001 9.47 2.21e-10
Because the variance yields a p value below .05, the variances between groups are not equal.
Because the data does not fit the requirements of an ANOVA, a non-parametric test, a Kruskal-wallis test, must be used:
kruskal.test(colony_lost_pct ~ year, data = merged_bees)
##
## Kruskal-Wallis rank sum test
##
## data: colony_lost_pct by year
## Kruskal-Wallis chi-squared = 97.202, df = 6, p-value < 2.2e-16
This test shows that there is a statistically significant difference in percent of bee colonies lost depending on what year it was (X^2(6)=97.202,p<0.0001).
We can also analyze whether percent of colonies lost varies by month. We can use an analysis of variance if the data meets normality and equal variance assumptions because there are 4 month groups present. When we conduct this analysis, we are operating under the null hypothesis that percent of colonies lost does not vary by month. If the data meets our significance level, p<.05, then we can reject this null hypothesis, supporting the alternative hypothesis that the percent of bee colonies lost is different dependent on the month. This alternative hypothesis means that at least one month is different from the others.
#make a linear model
pct_month_model <- lm(colony_lost_pct ~ months, data = merged_bees)
From here, we can use a qqplot, a histogram, and a residuals plot to see if the data is normal, and check for outliers:
## Create a QQ plot of residuals
ggqqplot(residuals(pct_month_model))
## Look at the histogram of residuals
hist.model.monthpct <- hist(pct_month_model$residuals)
hist.model.monthpct
## $breaks
## [1] -15 -10 -5 0 5 10 15 20 25 30 35 40 45
##
## $counts
## [1] 240 1110 2376 1968 750 330 126 42 36 18 6 6
##
## $density
## [1] 0.0068493151 0.0316780822 0.0678082192 0.0561643836 0.0214041096
## [6] 0.0094178082 0.0035958904 0.0011986301 0.0010273973 0.0005136986
## [11] 0.0001712329 0.0001712329
##
## $mids
## [1] -12.5 -7.5 -2.5 2.5 7.5 12.5 17.5 22.5 27.5 32.5 37.5 42.5
##
## $xname
## [1] "pct_month_model$residuals"
##
## $equidist
## [1] TRUE
##
## attr(,"class")
## [1] "histogram"
##Create a residuals plot
plot(pct_month_model, 1)
#The qqplot is curved, the histogram is skewed, and the residuals plot is not scattered, suggesting non-normality, and Shapiro Wilk does not work for a data set this size.
#can try to use transformed data to make the data more normal:
transformedmodel_month <- lm(transformedcolony ~ months, data = merged_bees)
plot(transformedmodel_month, 1)
plot(transformedmodel_month, 2)
#still not normal, can try different transformation
merged_bees$sqrtpercent<-sqrt(merged_bees$colony_lost_pct)
sqtransformedmodel_month <- lm(sqrtpercent ~ months, data = merged_bees)
plot(sqtransformedmodel_month, 1)
sqtransformedmodel_month
##
## Call:
## lm(formula = sqrtpercent ~ months, data = merged_bees)
##
## Coefficients:
## (Intercept) monthsApril-June monthsJuly-September
## 3.5356 -0.8640 -0.2621
## monthsOctober-December
## -0.2348
#Data still non-normal
#check for outliers
merged_bees %>%
group_by(months) %>%
identify_outliers(colony_lost_pct)
## # A tibble: 240 × 16
## months year state colony_n colony_max colony_lost colony_lost_pct
## <fct> <fct> <fct> <dbl> <dbl> <dbl> <dbl>
## 1 January-March 2015 Illinois 6000 10500 4200 40
## 2 January-March 2015 Illinois 6000 10500 4200 40
## 3 January-March 2015 Illinois 6000 10500 4200 40
## 4 January-March 2015 Illinois 6000 10500 4200 40
## 5 January-March 2015 Illinois 6000 10500 4200 40
## 6 January-March 2015 Illinois 6000 10500 4200 40
## 7 January-March 2015 Kentucky 7500 10500 4100 39
## 8 January-March 2015 Kentucky 7500 10500 4100 39
## 9 January-March 2015 Kentucky 7500 10500 4100 39
## 10 January-March 2015 Kentucky 7500 10500 4100 39
## # ℹ 230 more rows
## # ℹ 9 more variables: colony_added <dbl>, colony_reno <dbl>,
## # colony_reno_pct <dbl>, stressor <fct>, stress_pct <dbl>,
## # transformedcolony <dbl>, sqrtpercent <dbl>, is.outlier <lgl>,
## # is.extreme <lgl>
#There are extreme outliers in the data, cannot do an ANOVA
The data does not pass any normality tests.
We can also check for equal variance between groups:
merged_bees %>% levene_test(colony_lost_pct ~ months)
## # A tibble: 1 × 4
## df1 df2 statistic p
## <int> <int> <dbl> <dbl>
## 1 3 7004 170. 4.29e-106
The data is not normal, has extreme outliers, and does not have equal variance, so we must use a nonparametric test (Kruskal-Wallis):
kruskal.test(colony_lost_pct ~ months, data = merged_bees)
##
## Kruskal-Wallis rank sum test
##
## data: colony_lost_pct by months
## Kruskal-Wallis chi-squared = 701.85, df = 3, p-value < 2.2e-16
There is a significant difference in percentage of colonies lost based on month (X^2(3)=701.85,p<0.0001)
We can then perform the same analysis steps for percent of colonies lost by state.We can use an analysis of variance if the data meets normality and equal variance assumptions because there are 47 state groups present. When we conduct this analysis, we are operating under the null hypothesis that percent of colonies lost does not vary by state. If the data meets our significance level, p<.05, then we can reject this null hypothesis, supporting the alternative hypothesis that the percent of bee colonies lost is different dependent on the state.This alternative hypothesis means that at least one state is different from the others.
#make a linear model
pct_state_model <- lm(colony_lost_pct ~ state, data = merged_bees)
## Create a QQ plot of residuals
ggqqplot(residuals(pct_state_model))
## Look at the histogram of residuals
hist.model.statepct <- hist(pct_state_model$residuals)
hist.model.statepct
## $breaks
## [1] -20 -15 -10 -5 0 5 10 15 20 25 30 35 40
##
## $counts
## [1] 30 102 1224 2736 1686 732 330 84 42 12 18 12
##
## $density
## [1] 0.0008561644 0.0029109589 0.0349315068 0.0780821918 0.0481164384
## [6] 0.0208904110 0.0094178082 0.0023972603 0.0011986301 0.0003424658
## [11] 0.0005136986 0.0003424658
##
## $mids
## [1] -17.5 -12.5 -7.5 -2.5 2.5 7.5 12.5 17.5 22.5 27.5 32.5 37.5
##
## $xname
## [1] "pct_state_model$residuals"
##
## $equidist
## [1] TRUE
##
## attr(,"class")
## [1] "histogram"
##Look at residual plot
plot(pct_state_model, 1)
#The qqplot, histogram, and residuals plot all suggest non-normality
#check for outliers
merged_bees %>%
group_by(state) %>%
identify_outliers(colony_lost_pct)
## # A tibble: 186 × 16
## state year months colony_n colony_max colony_lost colony_lost_pct
## <fct> <fct> <fct> <dbl> <dbl> <dbl> <dbl>
## 1 Alabama 2020 April-June 8500 8500 4100 48
## 2 Alabama 2020 April-June 8500 8500 4100 48
## 3 Alabama 2020 April-June 8500 8500 4100 48
## 4 Alabama 2020 April-June 8500 8500 4100 48
## 5 Alabama 2020 April-June 8500 8500 4100 48
## 6 Alabama 2020 April-June 8500 8500 4100 48
## 7 Arizona 2015 October-Decemb… 36000 39000 12000 31
## 8 Arizona 2015 October-Decemb… 36000 39000 12000 31
## 9 Arizona 2015 October-Decemb… 36000 39000 12000 31
## 10 Arizona 2015 October-Decemb… 36000 39000 12000 31
## # ℹ 176 more rows
## # ℹ 9 more variables: colony_added <dbl>, colony_reno <dbl>,
## # colony_reno_pct <dbl>, stressor <fct>, stress_pct <dbl>,
## # transformedcolony <dbl>, sqrtpercent <dbl>, is.outlier <lgl>,
## # is.extreme <lgl>
#There are extreme outliers in the data, cannot do an ANOVA
#check for equal variance
merged_bees %>% levene_test(colony_lost_pct ~ state) #unequal variance
## # A tibble: 1 × 4
## df1 df2 statistic p
## <int> <int> <dbl> <dbl>
## 1 46 6961 21.7 1.68e-166
The data has extreme outliers and does not have equal variance, so we must use a nonparametric test (Kruskal-Wallis):
kruskal.test(colony_lost_pct ~ state, data = merged_bees)
##
## Kruskal-Wallis rank sum test
##
## data: colony_lost_pct by state
## Kruskal-Wallis chi-squared = 1553.3, df = 46, p-value < 2.2e-16
There is a significant difference in percent of bee colonies lost depending on state (X^2(46)=1553.3,p<0.0001).
Because the original dataset is too large to do pairwise comparisons reliably, we will redo this analysis using only the top 20 states for colony loss. Because 20 states are included, we will use an ANOVA if the data meets requirements. First, we must create a linear model and check for normality
#Create a linear model:
pct_top20state_model <- lm(colony_lost_pct ~ state, data = subset.df.bees)
## Create a QQ plot of residuals
ggqqplot(residuals(pct_top20state_model))
## Look at the histogram of residuals
hist.model.monthpct <- hist(pct_top20state_model$residuals)
hist.model.monthpct
## $breaks
## [1] -20 -15 -10 -5 0 5 10 15 20 25 30 35 40
##
## $counts
## [1] 30 78 666 924 504 330 198 60 30 6 18 6
##
## $density
## [1] 0.0021052632 0.0054736842 0.0467368421 0.0648421053 0.0353684211
## [6] 0.0231578947 0.0138947368 0.0042105263 0.0021052632 0.0004210526
## [11] 0.0012631579 0.0004210526
##
## $mids
## [1] -17.5 -12.5 -7.5 -2.5 2.5 7.5 12.5 17.5 22.5 27.5 32.5 37.5
##
## $xname
## [1] "pct_top20state_model$residuals"
##
## $equidist
## [1] TRUE
##
## attr(,"class")
## [1] "histogram"
##Create a residuals plot
plot(pct_top20state_model, 1)
#Shapiro Wilk Test for Normality
shapiro.test(subset.df.bees$colony_lost_pct)
##
## Shapiro-Wilk normality test
##
## data: subset.df.bees$colony_lost_pct
## W = 0.91864, p-value < 2.2e-16
The qqplot is curved, histogram is skewed, residual plot has lined up points, and the Shapiro test suggests non-normality (p<.001). Before attempting to transform the data, we must check for extreme outliers:
subset.df.bees %>%
group_by(state) %>%
identify_outliers(colony_lost_pct)
## # A tibble: 72 × 14
## state year months colony_n colony_max colony_lost colony_lost_pct
## <fct> <fct> <fct> <dbl> <dbl> <dbl> <dbl>
## 1 Alabama 2020 April-June 8500 8500 4100 48
## 2 Alabama 2020 April-June 8500 8500 4100 48
## 3 Alabama 2020 April-June 8500 8500 4100 48
## 4 Alabama 2020 April-June 8500 8500 4100 48
## 5 Alabama 2020 April-June 8500 8500 4100 48
## 6 Alabama 2020 April-June 8500 8500 4100 48
## 7 Arizona 2015 October-Decemb… 36000 39000 12000 31
## 8 Arizona 2015 October-Decemb… 36000 39000 12000 31
## 9 Arizona 2015 October-Decemb… 36000 39000 12000 31
## 10 Arizona 2015 October-Decemb… 36000 39000 12000 31
## # ℹ 62 more rows
## # ℹ 7 more variables: colony_added <dbl>, colony_reno <dbl>,
## # colony_reno_pct <dbl>, stressor <chr>, stress_pct <dbl>, is.outlier <lgl>,
## # is.extreme <lgl>
There are still extreme outliers, so this data cannot be transformed to make it normal. Let’s also check for equal variance before proceeding:
subset.df.bees %>% levene_test(colony_lost_pct ~ state) #unequal variance
## # A tibble: 1 × 4
## df1 df2 statistic p
## <int> <int> <dbl> <dbl>
## 1 18 2831 17.9 2.58e-54
Variance of the data is also unequal, so we must use a Kruskal-Wallis Test:
kruskal.test(colony_lost_pct ~ state, data = subset.df.bees)
##
## Kruskal-Wallis rank sum test
##
## data: colony_lost_pct by state
## Kruskal-Wallis chi-squared = 174.11, df = 18, p-value < 2.2e-16
There is a significant difference in colony losyt percentage by state, even in only the top 20 states for colony lost (X^2(18)=174.11,p <.0001).
The same analysis can be done for percent of colonies lost by stressor.When we conduct this analysis, we are operating under the null hypothesis that percent of colonies lost does not vary by stressor. If the data meets our significance level, p<.05, then we can reject this null hypothesis, supporting the alternative hypothesis that the percent of bee colonies lost is different dependent on the stressor. This alternative hypothesis means that at least one stressor is different from the others.
#make a linear model
pct_stressor_model <- lm(colony_lost_pct ~ stressor, data = merged_bees)
## Create a QQ plot of residuals
ggqqplot(residuals(pct_stressor_model))
## Look at the histogram of residuals
hist.model.stressorpct <- hist(pct_stressor_model$residuals)
hist.model.stressorpct
## $breaks
## [1] -15 -10 -5 0 5 10 15 20 25 30 35 40 45
##
## $counts
## [1] 198 1578 2280 1656 630 408 144 42 42 6 18 6
##
## $density
## [1] 0.0056506849 0.0450342466 0.0650684932 0.0472602740 0.0179794521
## [6] 0.0116438356 0.0041095890 0.0011986301 0.0011986301 0.0001712329
## [11] 0.0005136986 0.0001712329
##
## $mids
## [1] -12.5 -7.5 -2.5 2.5 7.5 12.5 17.5 22.5 27.5 32.5 37.5 42.5
##
## $xname
## [1] "pct_stressor_model$residuals"
##
## $equidist
## [1] TRUE
##
## attr(,"class")
## [1] "histogram"
## Look at residuals plot:
plot(pct_stressor_model,1)
#The qqplot is curved, the histogram is skewed, and the residuals plot is a straight line, suggesting non-normality
#check for outliers
merged_bees %>%
group_by(stressor) %>%
identify_outliers(colony_lost_pct)
## # A tibble: 204 × 16
## stressor year months state colony_n colony_max colony_lost colony_lost_pct
## <fct> <fct> <fct> <fct> <dbl> <dbl> <dbl> <dbl>
## 1 Disesases 2015 Januar… Illi… 6000 10500 4200 40
## 2 Disesases 2015 Januar… Kent… 7500 10500 4100 39
## 3 Disesases 2015 Januar… Mary… 7500 10000 4100 41
## 4 Disesases 2015 Januar… Ohio 18000 22000 10500 48
## 5 Disesases 2015 Januar… Penn… 14000 21000 6500 31
## 6 Disesases 2015 Januar… West… 4700 6000 1800 30
## 7 Disesases 2015 July-S… Arka… 23000 30000 9000 30
## 8 Disesases 2015 Octobe… Ariz… 36000 39000 12000 31
## 9 Disesases 2015 Octobe… Kans… 8500 8500 3400 40
## 10 Disesases 2016 Januar… Okla… 6000 6000 2900 48
## # ℹ 194 more rows
## # ℹ 8 more variables: colony_added <dbl>, colony_reno <dbl>,
## # colony_reno_pct <dbl>, stress_pct <dbl>, transformedcolony <dbl>,
## # sqrtpercent <dbl>, is.outlier <lgl>, is.extreme <lgl>
#There are extreme outliers in the data, cannot do an ANOVA
#check for equal variance
merged_bees %>% levene_test(colony_lost_pct ~ stressor) #unequal variance
## # A tibble: 1 × 4
## df1 df2 statistic p
## <int> <int> <dbl> <dbl>
## 1 5 7002 6.01e-28 1
ANOVA requirements are not met, do a Kruskal-Wallis:
kruskal.test(colony_lost_pct ~ stressor, data = merged_bees)
##
## Kruskal-Wallis rank sum test
##
## data: colony_lost_pct by stressor
## Kruskal-Wallis chi-squared = 0, df = 5, p-value = 1
There is not a significant difference in percent of colonies lost dependent on the stressor present (X^2(5)=0,p=1).
Knowing that there is a significant difference in percent of colonies lost dependent on year, month, and state, we can perform a post-hoc test to determine where exactly these differences occur. Because we have used a Kruskal-Wallis, we will use a nonparametric post hoc, Dunn’s Test, for this.
First, we will see where the difference is by year:
#Dunn's Post Hoc Test for Year
merged_bees %>%
dunn_test(colony_lost_pct ~ year, p.adjust.method = "bonferroni")
## # A tibble: 21 × 9
## .y. group1 group2 n1 n2 statistic p p.adj p.adj.signif
## * <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <chr>
## 1 colony_lo… 2015 2016 1128 1122 -0.949 3.43e- 1 1 e+ 0 ns
## 2 colony_lo… 2015 2017 1128 1122 -1.96 4.98e- 2 1 e+ 0 ns
## 3 colony_lo… 2015 2018 1128 1122 -0.212 8.32e- 1 1 e+ 0 ns
## 4 colony_lo… 2015 2019 1128 840 0.596 5.51e- 1 1 e+ 0 ns
## 5 colony_lo… 2015 2020 1128 1116 -7.18 7.05e-13 1.48e-11 ****
## 6 colony_lo… 2015 2021 1128 558 -4.98 6.32e- 7 1.33e- 5 ****
## 7 colony_lo… 2016 2017 1122 1122 -1.01 3.12e- 1 1 e+ 0 ns
## 8 colony_lo… 2016 2018 1122 1122 0.735 4.62e- 1 1 e+ 0 ns
## 9 colony_lo… 2016 2019 1122 840 1.47 1.41e- 1 1 e+ 0 ns
## 10 colony_lo… 2016 2020 1122 1116 -6.22 4.88e-10 1.03e- 8 ****
## # ℹ 11 more rows
There is a significant difference between: 2015 and 2020 (p<.0001), 2015 and 2021 (p<.0001), 2016 and 2020 (p<.0001), 2016 and 2021 (p<.001), 2017 and 2020 (p<.0001), 2017 and 2021 (p<.05), 2018 and 2020 (p<.0001), 2018 and 2021 (p<.0001), 2019 and 2020 (p<.0001), and 2019 and 2021 (p<.0001),
Next, we can do the same for month:
#Dunn's Post Hoc Test on Month:
merged_bees %>%
dunn_test(colony_lost_pct ~ months, p.adjust.method = "bonferroni")
## # A tibble: 6 × 9
## .y. group1 group2 n1 n2 statistic p p.adj p.adj.signif
## * <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <chr>
## 1 colony_l… Janua… April… 1950 1680 -25.1 1.77e-139 1.06e-138 ****
## 2 colony_l… Janua… July-… 1950 1692 -6.12 9.55e- 10 5.73e- 9 ****
## 3 colony_l… Janua… Octob… 1950 1686 -5.23 1.66e- 7 9.98e- 7 ****
## 4 colony_l… April… July-… 1680 1692 18.4 1.36e- 75 8.18e- 75 ****
## 5 colony_l… April… Octob… 1680 1686 19.2 2.16e- 82 1.29e- 81 ****
## 6 colony_l… July-… Octob… 1692 1686 0.848 3.97e- 1 1 e+ 0 ns
There is a significant difference between: Jan-March compared to April-June, July-September, and October-December (p<.0001), April-June and July-September (p<.0001), and April-June and October-December (p<.0001). That said, all months differed from January-March and April-June.
Last, we can do the same for state. We will use the top 20 states because if we use all 47, we cannot reliably use pairwise comparisons:
#Dunn's Post Hoc Test on State
subset.df.bees %>%
dunn_test(colony_lost_pct ~ state, p.adjust.method = "bonferroni")
## # A tibble: 171 × 9
## .y. group1 group2 n1 n2 statistic p p.adj p.adj.signif
## * <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <chr>
## 1 colony_lost… Alaba… Arizo… 150 150 4.41 1.02e-5 0.00175 **
## 2 colony_lost… Alaba… Arkan… 150 150 -0.699 4.84e-1 1 ns
## 3 colony_lost… Alaba… Color… 150 150 -1.28 2.00e-1 1 ns
## 4 colony_lost… Alaba… Flori… 150 150 -0.105 9.16e-1 1 ns
## 5 colony_lost… Alaba… Georg… 150 150 -0.264 7.92e-1 1 ns
## 6 colony_lost… Alaba… Illin… 150 150 -0.725 4.69e-1 1 ns
## 7 colony_lost… Alaba… India… 150 150 -2.74 6.09e-3 1 ns
## 8 colony_lost… Alaba… Kansas 150 150 3.99 6.58e-5 0.0112 *
## 9 colony_lost… Alaba… Kentu… 150 150 -0.247 8.05e-1 1 ns
## 10 colony_lost… Alaba… Massa… 150 150 -4.36 1.32e-5 0.00226 **
## # ℹ 161 more rows
There is a significant difference in percent of colonies lost between: Alabama and Arizona, Indiana, Massachusetts, Pennsylvania, Arkansas
Arizona and Colorado, Florida, Georgia, Illinois, Indiana, Kentucky, Massachusetts, New Mexico, North Carolina, Ohio, Pennsylvania, Tennessee, Virginia, West Virginia, Wisconsin
Arkansas and Kansas, Massachusetts
Colorado and Indiana
Florida and Kansas, Massachusetts, Pennsylvania
Georgia and Kansas, Massachusetts
Illinois and Kansas, Massachusetts
Indiana and Kansas
Kansas and Kentucky, Massachusetts, New Mexico, North Carolina, Ohio, Pennsylvania, Tennessee, Virginia, West Virginia, Wisconsin, Massachusetts
Massachusetts and New Mexico, Tennessee
Pennsylvania and Tennessee
After analyzing our data, it is apparent that year (X^2(6)=97.202,p<0.0001), month (X^2(3)=701.85,p<0.0001), and state (X^2(46)=1553.3,p<0.0001) have a significant impact on percent of bee colonies lost, but that stressor ((X^2(5)=0,p=1) does not. Year-wise, there was a significant difference between: 2015 and 2020 (p<.0001), 2015 and 2021 (p<.0001), 2016 and 2020 (p<.0001), 2016 and 2021 (p<.001), 2017 and 2020 (p<.0001), 2017 and 2021 (p<.05), 2018 and 2020 (p<.0001), 2018 and 2021 (p<.0001), 2019 and 2020 (p<.0001), and 2019 and 2021 (p<.0001). For month, there was a significant difference between: Jan-March compared to April-June, July-September, and October-December (p<.0001), April-June and July-September (p<.0001), and April-June and October-December (p<.0001). That said, all months differed from January-March and April-June. There is also a significant difference in colony loss percentage by state, even in only the top 20 states for colony lost (X^2(18)=174.11,p <.0001). State-wise, there were various pairwise differences, particularly when comparing Arizona, Kansas, and Massachusetts to other states. Seemingly, of the top 20 states, Massachusetts had the least colony loss and Kansas and Arizona had the most. This means that the year it was had a signficant impact on bee colony loss, as did the month/season, and the state the data were from. From this analysis, it is apparent that different stressor types did not have a different impact on bee colony loss from each other, meaning that either all of the stressors had the same effect on the bee colony loss.
Using a post hoc analysis, it was determined that there was a statistically significant difference between all years analyzed (2015, 2016, 2017, 2018, 2019, 2020, 2021) and the bee colony loss in 2020 and 2021, respectively. This means that in 2020 and 2021, the levels of bee colony loss were significantly lower (as shown in the box plot) than 2015, 2016, 2017, 2018, 2019, and each other. This means that for some reason, likely an environmental one, 2020 and 2021 had decreased bee colony loss.
From the graphs in part 1 it is clear that January to March had the most colonies lost, while April to June had the least. This is further supported by our post hoc analysis, which showed that there was a significant difference in colony loss in January to March compared to all other times of year, which, from the boxplot, we can tell is significantly more loss than other months. There is also a significant difference between the colony loss in April to June compared to all other months, which, from the boxplot, we can tell is because there was significantly more colony lost. This means, to a statitsically significant level, January to March had the most colony loss, while April to June had the least. Being that, in North America where data are collected, January to March bring winter weather, while April to June bring warmer weather and increased blooming plants to pollinate, we may speculate that weather and pollen resources have an impact on bee colony loss. A limitation to this analysis, though, is that such data on temperature and pollen levels are not available to support this point.
When comparing the top 20 states to eachother using a post hoc test, Arizona, Kansas, and Massachusetts were significantly different from every other state. Looking at the boxplot, it would appear that Kansas and Arizona had the most colony loss, while Massachusetts appeared to have the least out of the top 20. This aligns with previous speculation that climate/weather could be a stake, since Arizona and Kansas are largely warmer than Massachusetts, but further data collection and analysis would be needed to prove this. Further, additional post hoc testing would be necessary to determine differences with any states not in the top 20 for colony loss.
Interestingly, this data does not support that any specific stressor has a greater impact on colony loss than the others, meaning all stressors were equally dangerous to bees.
Some limitations of our analysis can be seen with the Shapiro-Wilk test. Since the bee colony loss data has over 5000 data entries, we cannot use this test. Without the Shapiro-Wilks, it is difficult to confidently determine if our data is normal before running statistical analyses. Similarly, because there was data for so many states, it was difficult to analyze the full data set while still having readable histograms and boxplots. To mitigate this, we used the top 20 states, but this was at the expense of not analyzing the other 27 state’s data fully. We also do not have data for honey production, for instance, to determine the actual impacts of this colony lost. Likewise, without an analysis more advanced than the one ran herein (i.e., a multiple regression analysis or a two-way ANOVA), we were unable to capture whether combinations of stressors, rather than individual stressors alone, caused colony loss.
The analysis done is important because it shows us how and why colonies of bees are being lost. If too many bees are lost, the consequences would be dramatic. Using the statistical tests in Rmarkdown helped us better understand correlations between colony loss and explanatory variables (year, month, state, and stressor), and show if there is statistical significance between colony loss and variables. If there is, steps can be taken to combat bee colony loss and re-introduce/protect the existing colonies. For instance, the introduction of better quality care during months of higher loss could possibly limit future losses. Also, we can possibly limit the amount of stressors that are man-made (i.e. pesticides) by creating alternatives that keep bee colonies safer and healthier, which would be particularly useful if certain stressors are later found to be significant.
As mentioned in when discussing limitations, a multiple regression analysis would be a useful future step to help further elucidate whether combinations of stressors are responsible for colony loss. With this information, more informed decisions on ways to help protect bee colonies could be formed.
Jonthegeek. (2022). Bee Colonies (rfordatascience/ tidytuesday/ data/ 2022/2022-01-11/ readme.md). GitHub. https://github.com/rfordatascience/tidytuesday/blob/main/data/2022/2022-01-11/readme.md
National Agricultural Statistics Service (NASS). (2022). Honey Bee Colonies. National Agricultural Statistics Service (NASS), Agricultural Statistics Board, United States Department of Agriculture (USDA). ISSN: 2470-993X. https://downloads.USDA.library.Cornell.edu/USDA-esmis/files/rn301137d/kh04fx05c/qb98nn582/hcny0822.pdf
Roberts, C. (2025). What happens to ecosystems when bees disappear? Earth.Org. https://earth.org/what-happens-to-ecosystems-when-bees-disappear/
Sarnoff, L. (2025). Honey bee colonies could face 70% losses in 2025, impacting agriculture. ABC News. https://abcnews.go.com/US/honey-bee-colonies-face-70-losses-2025-impacting/story?id=120191720
Woods, J. (2021). US beekeepers continue to report high colony loss rates, no clear progression toward improvement. Office of Communications and Marketing. https://ocm.auburn.edu/newsroom/news_articles/2021/06/241121-honey-bee-annual-loss-survey-results.php