#check working directory
#getwd()
#load libraries
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2 v purrr 0.3.4
## v tibble 3.0.3 v dplyr 1.0.2
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(RColorBrewer)
library(ggthemes)
library(highcharter)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
INTRODUCTION In recent years the disappearance of large numbers of honeybees, also known as colony collapse disorder, has been studied extensively, but the causes remain elusive. (Mickaël Henry, 2012) There are a number of known stressors to honeybees including pests, parasites, disease, and pesticides. Additionally, transporting bees can be stressful if not done properly, (Caron, 2000) and the foraging environment including crop and temperature can have negative effects. (William G. Meikle, 2020) Because most, if not all, of bees moved are commercial, the motive to preserve the colonies’ health is strong. There is little danger to the bees unless accidental because they understand the proper methods. (Caron, 2000) Further, the colony moves are due to temperature change in that as weather gets cold, the bees are moved to warmer areas to help with different crops that are still in need of fertilization. (Mickaël Henry, 2012) The assumption for this analysis is that moves and temperature are not a serious factor in colony loss. Therefore, there will be analysis of the stressors measured in bees by examining the following datasets. I have been interested in news stories about this topic over the past decade, but never really delved into it in a scientific way. The Covid-19 pandemic made me curious to learn what was happening with the bees when I saw the dataset. I hope scientists can come up with some answers to help solve this problem. The Covid-19 vaccine shows there is always hope in science.
THE DATA Both of my datasets are from Kaggle. The first set that I called bees has general information as follows:
DATA: Bee colony loss
VARIABLES:
Year: Year in format fiscal format YYYY/YY
Season: All were annual - not used
State: Full name of state
Total Annual Loss: annual colony loss per 100 colonies
Beekeepers: Total number of beekeepers
Beekeepers Exclusive to State: annual number of beekeepers per 100 that only work in one state
Colonies: Total number of honey bee colonies
Colonies Exclusive to State: annual number of colonies per 100 that remain in the same state for the entire year.
DATA: Stressors
VARIABLES:
State: Abbreviation name of state
Varroa mites: Number of infesations of Varroa mites per 100 colonies
Other pests and parasites: Number of other infestations per 100 colonies
Diseases: Number of diseased per 100 colonies
Pesticides: Number of pesticide detected colonies per 100
Other: other detected stressor per 100 colonies
Unknown: unknown remaining unidentified stressor per 100 colonies.
Year: Year in quarters format YYYY-#Q
The reason for the examining the second dataset is that after I finished analyzing the first, there was not much there except the loss rate. Because of my background research, the other variables such as number of beekeepers and whether bees were moved or not did not seem to shed light on the question of what was causing the losses. The second dataset had variables related to the information about pests, disease and pesticides.
# LOAD DATASETS
bees <- read_csv("C:/Users/libcl/OneDrive/Documents/bee_colony_loss.csv") #1st set taken from course datasets, available at Kaggle
##
## -- Column specification --------------------------------------------------------
## cols(
## Year = col_character(),
## Season = col_character(),
## State = col_character(),
## `Total Annual Loss` = col_character(),
## Beekeepers = col_double(),
## `Beekeepers Exclusive to State` = col_character(),
## Colonies = col_double(),
## `Colonies Exclusive to State` = col_character()
## )
stressors <- read_csv("C:/Users/libcl/OneDrive/Documents/stressors.csv") #from https://storage.googleapis.com/kagglesdsdata/datasets/793168/1497256/stressors.csv
##
## -- Column specification --------------------------------------------------------
## cols(
## State = col_character(),
## `Varroa mites` = col_character(),
## `Other pests and parasites` = col_character(),
## Diseases = col_character(),
## Pesticides = col_character(),
## Other = col_character(),
## Unknown = col_character(),
## Year = col_character()
## )
#BEES DATA
#change column heading to lower case
names(bees) <- tolower(names(bees))
#replace " " with _ in column headings
names(bees) <- gsub(" ","_",names(bees), fixed = TRUE)
#add column region for states before replacing with abbreviation next
#replace state full name with abbreviation
bees$state <- state.abb[match(bees$state, state.name)]
#mutate "year" column to "date" in date form with starting year only
names(bees$year) <- gsub("//","",names(bees$year), fixed = TRUE)
# make new column date, with class numeric from year with / removed
my_year = str_replace(bees$year, pattern = "/", replacement = "")
my_years = (substr(my_year,1,4))
my_date = as.numeric(my_years)
bees <- bees %>%
mutate(dates = my_date)
# change percentages by removing %-sign and changing to numeric
my_total_annual_loss = str_replace(bees$total_annual_loss, pattern = "%", replacement = "")
my_beekeepers_exclusive_to_state = str_replace(bees$beekeepers_exclusive_to_state, pattern = "%", replacement = "")
my_colonies_exclusive_to_state = str_replace(bees$colonies_exclusive_to_state, pattern = "%", replacement = "")
bees$total_annual_loss = as.numeric(my_total_annual_loss)
bees$beekeepers_exclusive_to_state = as.numeric(my_beekeepers_exclusive_to_state)
bees$colonies_exclusive_to_state = as.numeric(my_colonies_exclusive_to_state)
#STRESSORS DATA
#change column heading to lower case
names(stressors) <- tolower(names(stressors))
#replace " " with _ in column headings
names(stressors) <- gsub(" ","_",names(stressors), fixed = TRUE)
#change stressor rows containing (Z), and replace with NA for all stressor column names
myvarroa_mites <- str_replace(stressors$varroa_mites, pattern = "(Z)", replacement = "NA")
myother_pests_and_parasites <- str_replace(stressors$other_pests_and_parasites, pattern = "(Z)", replacement = "NA")
mydiseases <- str_replace(stressors$diseases, pattern = "(Z)", replacement = "NA")
mypesticides <- str_replace(stressors$pesticides, pattern = "(Z)", replacement = "NA")
myother <- str_replace(stressors$other, pattern = "(Z)", replacement = "NA")
myunknown <- str_replace(stressors$unknown, pattern = "(Z)", replacement = "NA")
# change character to double class for percentages by changing to double to convert with mutate
vm = as.double(myvarroa_mites)
## Warning: NAs introduced by coercion
op = as.double(myother_pests_and_parasites)
## Warning: NAs introduced by coercion
d = as.double(mydiseases)
## Warning: NAs introduced by coercion
p = as.double(mypesticides)
## Warning: NAs introduced by coercion
o = as.double(myother)
## Warning: NAs introduced by coercion
u = as.double(myunknown)
## Warning: NAs introduced by coercion
#fix dates to eliminate quarter; will have to average for each year later for one annual value to join tables
my_year = str_replace(stressors$year, pattern = "-", replacement = "")
my_year = (substr(my_year,1,4))
my_date = as.numeric(my_year)
stressors <- stressors %>%
mutate(dates = my_date, var_mites = vm/100, other_pestpara = op/100, disease = d/100, pesticide = p/100, others = o/100, unk = u/100)
#write.csv(bees, "bees.csv")
#write.csv(stressors, "bees_stressors.csv")
bees <- bees %>%
filter(!is.na(state)) %>%
mutate(loss_rate = total_annual_loss/100, stexclusive_keeperrate = beekeepers_exclusive_to_state/100, stexclusive_colonyrate = colonies_exclusive_to_state/100)
df2 <- stressors %>%
filter(!is.na(state)) %>%
group_by(state, dates) %>%
mutate(n = n())
df3 <- df2 %>%
summarize(state, dates, n, var_mitesavg = sum(var_mites)/n, other_pestparaavg = sum(other_pestpara)/n, diseaseavg = sum(disease)/n, pesticideavg = sum(pesticide)/n, othersavg = sum(others)/n, unkavg = sum(unk)/n)
## `summarise()` regrouping output by 'state', 'dates' (override with `.groups` argument)
#drop duplicate rows
df4 <- distinct(df3)
bees_tab <- bees %>% inner_join(df4, by = "state", "dates", keep = FALSE)
bees_tab %>%
filter(total_annual_loss != "NA") %>%
ggplot(aes(loss_rate)) +
geom_boxplot(fill = 7) +
theme_dark()
bees_tab %>%
filter(total_annual_loss != "NA") %>%
ggplot(aes(loss_rate, group = year)) +
geom_boxplot(fill = 7) +
xlab("annual loss rate") +
ggtitle("United States Bee Colonies") +
theme_dark()
Looking at possible outliers using interquartile range and Q1 and Q3 values from summary: result is 7 values representing 7 states, 4 from 2010. 2010 had more losses so I am keeping the values because it won’t affect that much. Similarly a single value in one state should not affect the analysis.
summary(bees)
## year season state total_annual_loss
## Length:348 Length:348 Length:348 Min. : 7.50
## Class :character Class :character Class :character 1st Qu.:30.50
## Mode :character Mode :character Mode :character Median :39.80
## Mean :40.91
## 3rd Qu.:48.50
## Max. :86.90
## NA's :11
## beekeepers beekeepers_exclusive_to_state colonies
## Min. : 2.00 Min. : 3.90 Min. : 6
## 1st Qu.: 21.75 1st Qu.: 85.70 1st Qu.: 1019
## Median : 54.00 Median : 93.05 Median : 5156
## Mean : 86.40 Mean : 87.10 Mean : 31388
## 3rd Qu.:109.75 3rd Qu.: 96.90 3rd Qu.: 30617
## Max. :828.00 Max. :100.00 Max. :625897
##
## colonies_exclusive_to_state dates loss_rate
## Min. : 0.000 Min. :2010 Min. :0.0750
## 1st Qu.: 3.475 1st Qu.:2011 1st Qu.:0.3050
## Median : 21.350 Median :2013 Median :0.3980
## Mean : 40.163 Mean :2013 Mean :0.4091
## 3rd Qu.: 82.575 3rd Qu.:2015 3rd Qu.:0.4850
## Max. :100.000 Max. :2016 Max. :0.8690
## NA's :11
## stexclusive_keeperrate stexclusive_colonyrate
## Min. :0.0390 Min. :0.00000
## 1st Qu.:0.8570 1st Qu.:0.03475
## Median :0.9305 Median :0.21350
## Mean :0.8710 Mean :0.40163
## 3rd Qu.:0.9690 3rd Qu.:0.82575
## Max. :1.0000 Max. :1.00000
##
range <- IQR(bees$total_annual_loss, na.rm = TRUE, type = 7)
out_low <- 30.50 - (1.5 * range)
out_high <- 48.50 + 1.5 * range
bees %>%
filter(total_annual_loss > out_high | total_annual_loss < out_low)
## # A tibble: 7 x 12
## year season state total_annual_lo~ beekeepers beekeepers_excl~ colonies
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 2016~ Annual OK 83.9 29 96.5 9513
## 2 2013~ Annual CT 81.6 65 90.8 1531
## 3 2010~ Annual WV 78.6 25 92 509
## 4 2010~ Annual IA 80.7 14 85.7 1037
## 5 2010~ Annual MI 83.5 99 97 10320
## 6 2010~ Annual GA 84.7 61 96.7 5923
## 7 2010~ Annual ME 86.9 43 93 5668
## # ... with 5 more variables: colonies_exclusive_to_state <dbl>, dates <dbl>,
## # loss_rate <dbl>, stexclusive_keeperrate <dbl>, stexclusive_colonyrate <dbl>
Look at correlation of exclusivity of beekeepers versus exclusivity of colonies by state. No discernable pattern is seen in the scatterplot, and 44% correlation was calculated. There is not a lot of information in this dataset to determine what factors may be involved in the loss, so after visualizing extent of losses on a map, stressors in the second dataset will be considered.
bees %>%
ggplot() +
geom_point(aes(x = stexclusive_colonyrate, y = loss_rate, color = stexclusive_keeperrate)) +
ggtitle("Correlation Exclusive") +
xlab("Colonies Excl. to State") +
ylab("Loss Rate")
## Warning: Removed 11 rows containing missing values (geom_point).
cor(bees$stexclusive_colonyrate, bees$beekeepers_exclusive_to_state)
## [1] 0.4473572
#2010 losses map
bees2010 <- bees %>%
filter(dates == 2010)
bees2010$hover <- with(bees2010, paste(bees2010$state, "Single State BK Rate:", bees2010$stexclusive_keeperrate, "Single State Colony Rate:", bees2010$stexclusive_colonyrate, "Total Colonies:", bees2010$colonies, sep = "<br>"))
# give state boundaries a white border
l <- list(color = toRGB("white"), width = 2)
# specify some map projection/options
g <- list(
scope = 'usa',
projection = list(type = 'albers usa'),
showlakes = TRUE,
lakecolor = toRGB('steelblue')
)
p2010 <- plot_geo(bees2010, locationmode = 'USA-states') %>%
add_trace(
z = ~loss_rate, text = ~hover, locations = ~state,
color = ~loss_rate, colors = 'Purples'
) %>%
colorbar(title = "Loss Rate") %>%
layout(
title = '2010 - US Colony Loss Rate ',
geo = g
)
## Warning: Ignoring 3 observations
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
p2010
#2011 losses map
bees2011 <- bees %>%
filter(dates == 2011)
bees2011$hover <- with(bees2011, paste(bees2011$state, "Single State BK Rate:", bees2011$stexclusive_keeperrate, "Single State Colony Rate:", bees2011$stexclusive_colonyrate, "Total Colonies:", bees2011$colonies, sep = "<br>"))
# give state boundaries a white border
l <- list(color = toRGB("white"), width = 2)
# specify some map projection/options
g <- list(
scope = 'usa',
projection = list(type = 'albers usa'),
showlakes = TRUE,
lakecolor = toRGB('steelblue')
)
p2011 <- plot_geo(bees2011, locationmode = 'USA-states') %>%
add_trace(
z = ~loss_rate, text = ~hover, locations = ~state,
color = ~loss_rate, colors = 'Purples'
) %>%
colorbar(title = "Loss Rate") %>%
layout(
title = '2011 - US Colony Loss Rate ',
geo = g
)
## Warning: Ignoring 3 observations
p2011
#2012 losses map
bees2012 <- bees %>%
filter(dates == 2012)
bees2012$hover <- with(bees2012, paste(bees2012$state,"Single State BK Rate:", bees2012$stexclusive_keeperrate, "Single State Colony Rate:", bees2012$stexclusive_colonyrate, "Total Colonies:", bees2012$colonies, sep = "<br>"))
# give state boundaries a white border
l <- list(color = toRGB("white"), width = 2)
# specify some map projection/options
g <- list(
scope = 'usa',
projection = list(type = 'albers usa'),
showlakes = TRUE,
lakecolor = toRGB('steelblue')
)
p2012 <- plot_geo(bees2012, locationmode = 'USA-states') %>%
add_trace(
z = ~loss_rate, text = ~hover, locations = ~state,
color = ~loss_rate, colors = 'Purples'
) %>%
colorbar(title = "Loss Rate") %>%
layout(
title = '2012 - US Colony Loss Rate ',
geo = g
)
## Warning: Ignoring 2 observations
p2012
#2013 losses map
bees2013 <- bees %>%
filter(dates == 2013)
bees2013$hover <- with(bees2013, paste(bees2013$state,"Single State BK Rate:", bees2013$stexclusive_keeperrate, "Single State Colony Rate:", bees2013$stexclusive_colonyrate, "Total Colonies:", bees2013$colonies, sep = "<br>"))
# give state boundaries a white border
l <- list(color = toRGB("white"), width = 2)
# specify some map projection/options
g <- list(
scope = 'usa',
projection = list(type = 'albers usa'),
showlakes = TRUE,
lakecolor = toRGB('steelblue')
)
p2013 <- plot_geo(bees2013, locationmode = 'USA-states') %>%
add_trace(
z = ~loss_rate, text = ~hover, locations = ~state,
color = ~loss_rate, colors = 'Purples'
) %>%
colorbar(title = "Loss Rate") %>%
layout(
title = '2013 - US Colony Loss Rate ',
geo = g
)
## Warning: Ignoring 1 observations
p2013
subplot(p2010, p2011, p2012, p2013)
#2014 losses map
bees2014 <- bees %>%
filter(dates == 2014)
bees2014$hover <- with(bees2014, paste(bees2014$state,"Single State BK Rate:", bees2014$stexclusive_keeperrate, "Single State Colony Rate:", bees2014$stexclusive_colonyrate, "Total Colonies:", bees2014$colonies, sep = "<br>"))
# give state boundaries a white border
l <- list(color = toRGB("white"), width = 2)
# specify some map projection/options
g <- list(
scope = 'usa',
projection = list(type = 'albers usa'),
showlakes = TRUE,
lakecolor = toRGB('steelblue')
)
p2014 <- plot_geo(bees2014, locationmode = 'USA-states') %>%
add_trace(
z = ~loss_rate, text = ~hover, locations = ~state,
color = ~loss_rate, colors = 'Purples'
) %>%
colorbar(title = "Loss Rate") %>%
layout(
title = '2014 - US Colony Loss Rate ',
geo = g
)
## Warning: Ignoring 1 observations
#2015 losses map
bees2015 <- bees %>%
filter(dates == 2015)
bees2015$hover <- with(bees2015, paste(bees2015$state,"Single State BK Rate:", bees2015$stexclusive_keeperrate, "Single State Colony Rate:", bees2015$stexclusive_colonyrate, "Total Colonies:", bees2015$colonies, sep = "<br>"))
# give state boundaries a white border
l <- list(color = toRGB("white"), width = 2)
# specify some map projection/options
g <- list(
scope = 'usa',
projection = list(type = 'albers usa'),
showlakes = TRUE,
lakecolor = toRGB('steelblue')
)
p2015 <- plot_geo(bees2015, locationmode = 'USA-states') %>%
add_trace(
z = ~loss_rate, text = ~hover, locations = ~state,
color = ~loss_rate, colors = 'Purples'
) %>%
colorbar(title = "Loss Rate") %>%
layout(
title = '2015 - US Colony Loss Rate ',
geo = g
)
## Warning: Ignoring 1 observations
#2016 losses map
bees2016 <- bees %>%
filter(dates == 2016)
bees2016$hover <- with(bees2016, paste(bees2016$state,"Single State BK Rate:", bees2016$stexclusive_keeperrate, "Single State Colony Rate:", bees2016$stexclusive_colonyrate, "Total Colonies:", bees2016$colonies, sep = "<br>"))
# give state boundaries a white border
l <- list(color = toRGB("white"), width = 2)
# specify some map projection/options
g <- list(
scope = 'usa',
projection = list(type = 'albers usa'),
showlakes = TRUE,
lakecolor = toRGB('steelblue')
)
p2016 <- plot_geo(bees2016, locationmode = 'USA-states') %>%
add_trace(
z = ~loss_rate, text = ~hover, locations = ~state,
color = ~loss_rate, colors = 'Purples'
) %>%
colorbar(title = "Loss Rate") %>%
layout(
title = '2014 - 2016 - US Colony Loss Rate ',
geo = g
)
subplot(p2014, p2015, p2016)
bees_means <- bees_tab %>%
group_by(state) %>%
summarize(lrate = mean(loss_rate), mvar = mean(var_mitesavg), mop = mean(other_pestparaavg), mdis = mean(diseaseavg), mpesticide = mean(pesticideavg), mo = mean(othersavg), munk = mean(unkavg))
## `summarise()` ungrouping output (override with `.groups` argument)
# Counterintuitively, Hawai shows low average loss while having highest infestation rate; Illinois, Connecticut and Maryland show the reverse of low infestation with high loss
bees_means %>%
ggplot(aes(x = state, y = lrate, color = mvar)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
geom_point() +
ggtitle("Varrus Mite")
## Warning: Removed 3 rows containing missing values (geom_point).
#Illinois and Pennsylvania show high loss, low pesticides; those states showing highest pesticides all show loss_rates between 35-45%, by no means the highest
bees_means %>%
ggplot(aes(x = state, y = lrate, color = mpesticide)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
geom_point() +
ggtitle("Pesticides")
## Warning: Removed 3 rows containing missing values (geom_point).
#re-arrange to shorter names and arrange in long form
bees_means <- bees_means %>%
rename(other = mo)
bees_means <- bees_means %>%
rename(disease = mdis)
bees_means <- bees_means %>%
rename(other_pests = mop)
bees_means <- bees_means %>%
rename(pesticides = mpesticide)
bees_means <- bees_means %>%
rename(varrus_mites = mvar)
bees_means <- bees_means %>%
rename(unknown = munk)
bees_meanslong <- bees_means %>%
pivot_longer(
cols = 3:8,
names_to = "category",
values_to = "percentage",
values_drop_na = TRUE)
s_bar <- bees_meanslong %>%
filter(!is.na(percentage)) %>%
ggplot() +
geom_col(aes(x = category, y = percentage, fill = state), position = "dodge") +
ggtitle("Colony Stressor Groups") +
xlab("Category") +
ylab("Percentage affected")
s_bar
# stressors map
bees_means$hover <- with(bees_means, paste(bees_means$state,"Varrus Mites:", bees_means$varrus_mites, "Pesticides:", bees_means$pesticides, "Diseases:", bees_means$disease, sep = "<br>"))
# give state boundaries a white border
l <- list(color = toRGB("white"), width = 2)
# specify some map projection/options
g <- list(
scope = 'usa',
projection = list(type = 'albers usa'),
showlakes = TRUE,
lakecolor = toRGB('steelblue')
)
stressor_p1 <- plot_geo(bees_means, locationmode = 'USA-states') %>%
add_trace(
z = ~lrate, text = ~hover, locations = ~state,
color = ~lrate, colors = "YlOrBr"
) %>%
colorbar(title = "Average Loss Rate") %>%
layout(
title = 'US Colony Average Loss Rates and Stressors ',
geo = g
)
## Warning: Ignoring 3 observations
stressor_p1
Here is another stressors visualization without dates included. I could not include dates because only the year was provided in both datasets, and a year only cannot be converted to date class (I tried, but time was a constraint in figuring this out)
cols <- brewer.pal(6, "Dark2")
highchart() %>%
hc_add_series(data = bees_meanslong,
type = "line", hcaes(x = state,
y = percentage,
group = category)) %>%
hc_title(
text = "Rate of Stressors on Total Colonies Across States",
margin = 20,
align = "left",
style = list(color = "#22A884", useHTML = TRUE)
) %>%
hc_xAxis(
title = list(text = "state"),
alternateGridColor = "#FDFFD5",
opposite = TRUE
) %>%
hc_colors(cols)
I would like to look at linear model even though this looks difficult because there are many missing values in the dataset of stressors. Not surprisingly, there appears to be a downward slope from what I can tell.
bees_means %>%
filter(!is.na(lrate) & !is.na(varrus_mites)) %>%
ggplot(aes(x = varrus_mites, y = lrate)) +
geom_point(aes(x = varrus_mites, y = lrate)) +
geom_smooth(method = "lm", color = "red")
## `geom_smooth()` using formula 'y ~ x'
bees_means %>%
filter(!is.na(lrate) & !is.na(pesticides)) %>%
ggplot(aes(x = pesticides, y = lrate)) +
geom_point(aes(x = varrus_mites, y = lrate)) +
geom_smooth(method = "lm", color = "red")
## `geom_smooth()` using formula 'y ~ x'
bees_means %>%
filter(!is.na(lrate) & !is.na(disease)) %>%
ggplot(aes(x = pesticides, y = lrate)) +
geom_point(aes(x = disease, y = lrate)) +
geom_smooth(method = "lm", color = "red")
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 6 rows containing non-finite values (stat_smooth).
I decided to use correlations of stressors using Ggally. None look linear which indicates poor correlation. Since there is not much correlation a multi-regression model may include most of these variables. Disease and pesticides, and varrus mites and pesticides look like correlations above 70%. I will proceed with the ml fit.
# Try loading GGally to see correlations of stressors
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
df10 <- bees_means
df10$state <- df10$hover <- NULL
df10 <- df10 %>%
na.omit(df10)
# Check correlations (as scatterplots), distribution and print correlation coefficient
ggpairs(df10, title="correlogram with ggpairs()")
# first iteration of model with all included variables including date
fit_multi <- lm(loss_rate ~ dates.y + var_mitesavg + pesticideavg + diseaseavg + other_pestparaavg + othersavg + unkavg, data = bees_tab)
summary(fit_multi)
##
## Call:
## lm(formula = loss_rate ~ dates.y + var_mitesavg + pesticideavg +
## diseaseavg + other_pestparaavg + othersavg + unkavg, data = bees_tab)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.28518 -0.08880 -0.01354 0.06873 0.45938
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -10.237753 9.180897 -1.115 0.26521
## dates.y 0.005303 0.004555 1.164 0.24475
## var_mitesavg -0.205607 0.063929 -3.216 0.00136 **
## pesticideavg -0.028819 0.119301 -0.242 0.80920
## diseaseavg -0.370461 0.171475 -2.160 0.03110 *
## other_pestparaavg -0.160933 0.080344 -2.003 0.04559 *
## othersavg 0.778365 0.170752 4.558 6.16e-06 ***
## unkavg 0.186260 0.148228 1.257 0.20936
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1237 on 649 degrees of freedom
## (603 observations deleted due to missingness)
## Multiple R-squared: 0.07051, Adjusted R-squared: 0.06049
## F-statistic: 7.034 on 7 and 649 DF, p-value: 4.19e-08
plot(fit_multi)
##### remove unknown first, then pesticides, then diseases, then other pests, then parasites, and then other. All resulted in R^2 < .05 so the model does not work
fit_final <- lm(loss_rate ~ dates.y + var_mitesavg, data = bees_tab)
summary(fit_final)
##
## Call:
## lm(formula = loss_rate ~ dates.y + var_mitesavg, data = bees_tab)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.32415 -0.09976 -0.00520 0.07897 0.47429
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -10.145051 7.401080 -1.371 0.171
## dates.y 0.005263 0.003672 1.434 0.152
## var_mitesavg -0.191893 0.032715 -5.866 5.75e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1395 on 1229 degrees of freedom
## (28 observations deleted due to missingness)
## Multiple R-squared: 0.02723, Adjusted R-squared: 0.02565
## F-statistic: 17.2 on 2 and 1229 DF, p-value: 4.283e-08
plot(fit_final)
# first iteration of model with all included variables excluding date
head(bees_means)
## # A tibble: 6 x 9
## state lrate varrus_mites other_pests disease pesticides other unknown hover
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 AL 0.359 0.270 0.256 NA 0.0756 0.0556 0.0697 AL<br>~
## 2 AR 0.376 0.468 0.154 NA 0.136 0.0714 0.0386 AR<br>~
## 3 AZ 0.406 0.437 0.119 NA NA 0.105 0.0593 AZ<br>~
## 4 CA 0.357 0.422 0.136 0.075 0.140 0.104 0.0472 CA<br>~
## 5 CO 0.384 0.391 NA NA NA 0.0688 0.0209 CO<br>~
## 6 CT 0.530 0.182 NA NA NA 0.0588 NA CT<br>~
fit3 <- lm(lrate ~ varrus_mites + disease + pesticides + other_pests + other + unknown, data = bees_means)
summary(fit3)
##
## Call:
## lm(formula = lrate ~ varrus_mites + disease + pesticides + other_pests +
## other + unknown, data = bees_means)
##
## Residuals:
## 4 7 8 10 11 12 15 30
## -0.056673 0.069655 0.015274 0.081726 -0.001082 -0.026423 0.001653 -0.020245
## 31 34 38 43
## -0.060064 0.053353 -0.036368 -0.020805
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.6522 0.1142 5.709 0.0023 **
## varrus_mites -0.6030 0.4557 -1.323 0.2430
## disease -2.3666 1.5196 -1.557 0.1801
## pesticides 0.9453 1.1531 0.820 0.4496
## other_pests -1.0485 0.7408 -1.415 0.2161
## other 2.0426 1.1165 1.829 0.1269
## unknown -0.2040 1.0584 -0.193 0.8548
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06971 on 5 degrees of freedom
## (33 observations deleted due to missingness)
## Multiple R-squared: 0.7085, Adjusted R-squared: 0.3588
## F-statistic: 2.026 on 6 and 5 DF, p-value: 0.2278
plot(fit3)
#MR^2 is above 70%, but p-value is .2 so we need to proceed to see if we can get some level of significance from the variable.
# remove unknown, pesticides, varrus mites, unknown
head(bees_means)
## # A tibble: 6 x 9
## state lrate varrus_mites other_pests disease pesticides other unknown hover
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 AL 0.359 0.270 0.256 NA 0.0756 0.0556 0.0697 AL<br>~
## 2 AR 0.376 0.468 0.154 NA 0.136 0.0714 0.0386 AR<br>~
## 3 AZ 0.406 0.437 0.119 NA NA 0.105 0.0593 AZ<br>~
## 4 CA 0.357 0.422 0.136 0.075 0.140 0.104 0.0472 CA<br>~
## 5 CO 0.384 0.391 NA NA NA 0.0688 0.0209 CO<br>~
## 6 CT 0.530 0.182 NA NA NA 0.0588 NA CT<br>~
fit3 <- lm(lrate ~ disease + other_pests + other, data = bees_means)
summary(fit3)
##
## Call:
## lm(formula = lrate ~ disease + other_pests + other, data = bees_means)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.11154 -0.04988 0.01001 0.04216 0.12017
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.50031 0.07066 7.081 5.5e-06 ***
## disease -1.83572 0.64381 -2.851 0.01282 *
## other_pests -1.24978 0.41807 -2.989 0.00975 **
## other 2.08495 0.74845 2.786 0.01459 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06715 on 14 degrees of freedom
## (27 observations deleted due to missingness)
## Multiple R-squared: 0.545, Adjusted R-squared: 0.4476
## F-statistic: 5.591 on 3 and 14 DF, p-value: 0.009827
plot(fit3)
#MR^2 = .545 so explains about 55% of Loss Rate, with p = 0.009827 so the factors are important; surprisingly varrus mites and pesticides reduce the MR^2 so there must be some overlap with disease and other factors.
Our multiple regression model is not a perfect fit, but I believe the Multiple R^2 of 55% and p = .01 does indicate that disease, other pests, and the “other” undefined variable are significant in determining loss rates. Other helpful information such as temperature ranges of exposure, might complete the model.
CONCLUSION:
My initial research indicated that the cause of colony loss for honeybees is not settled and may be due to multiple factors. I believe that is shown by this data, especially when looking at the attempted multiple regression model which showed mild significance of a few of the categories including disease, pests and an undefined “other” factor.
Temperature data was not included in these datasets, but another analysis with this data might show sensitivity causing survival issues. (William G. Meikle, 2020) Also, surprisingly, pesticides did not appear as a significant contribution to colony death, but this may be due to how the pesticides are being measured. Low level chemical exposure may be just as damaging because it can lead to neurological issues that may interfere with homing issues, which is the ability for the worker bees to find there way back to the colony with necessary sustenance for the hives' survival, leading to colony death without high levels of chemicals being detected. (Mickaël Henry, 2012)
One surprising finding was the varrus mite pest category appeared to be the most common pest and was the by far the highest measure present of all categories in the data, but during the multiple regression was not as significant as disease which was measured at much lower levels, so caution is necessary when comparing levels as I did in this analysis. Just looking at the data visualizations of this data, the varrus mite infestations appeared to be a much more serious problem than pesticides; however, the scientific studies show that pesticides do not need to measure in high levels to cause harm a higher level of harm. (Mickaël Henry, 2012) Any future analysis I do of this dataset would need to include information about what measurements are considered harmful for each category, if that is possible to determine.
Works Cited Caron, D. M. (2000). Moving Bees. Mid-Atlantic Apiculture Research and Extension Consortium. University Park, PA: MAAREC Publication 5.3. Jerrod Penn, W. H. (2019, August 29). Support for Solitary Bee Conservation Among the Public Versus Beekeepers. Retrieved from Amer. J. Agr. Econ: https://academic.oup.com/ajae/article-abstract/101/5/1386/5556335 Mickaël Henry, M. B. (2012). A Common Pesticide Decreases Foraging Success and Survival in Honey Bees. Science, 348-350. William G. Meikle, M. &. (2020, 3 19). Landscape factors infuencing honey bee colony behavior in Southern California commercial apiaries. Retrieved from nature.com/scientificreports: https://www.nature.com/articles/s41598-020-61716-6