#check working directory
#getwd()

#load libraries
library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.3     v dplyr   1.0.2
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.0

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(plotly)

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

library(RColorBrewer)
library(ggthemes)
library(highcharter)

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

INTRODUCTION In recent years the disappearance of large numbers of honeybees, also known as colony collapse disorder, has been studied extensively, but the causes remain elusive. (Mickaël Henry, 2012) There are a number of known stressors to honeybees including pests, parasites, disease, and pesticides. Additionally, transporting bees can be stressful if not done properly, (Caron, 2000) and the foraging environment including crop and temperature can have negative effects. (William G. Meikle, 2020) Because most, if not all, of bees moved are commercial, the motive to preserve the colonies’ health is strong. There is little danger to the bees unless accidental because they understand the proper methods. (Caron, 2000) Further, the colony moves are due to temperature change in that as weather gets cold, the bees are moved to warmer areas to help with different crops that are still in need of fertilization. (Mickaël Henry, 2012) The assumption for this analysis is that moves and temperature are not a serious factor in colony loss. Therefore, there will be analysis of the stressors measured in bees by examining the following datasets. I have been interested in news stories about this topic over the past decade, but never really delved into it in a scientific way. The Covid-19 pandemic made me curious to learn what was happening with the bees when I saw the dataset. I hope scientists can come up with some answers to help solve this problem. The Covid-19 vaccine shows there is always hope in science.

THE DATA Both of my datasets are from Kaggle. The first set that I called bees has general information as follows:

DATA: Bee colony loss

VARIABLES:

Year: Year in format fiscal format YYYY/YY
Season: All were annual - not used
State: Full name of state
Total Annual Loss: annual colony loss per 100 colonies
Beekeepers: Total number of beekeepers
Beekeepers Exclusive to State: annual number of beekeepers per 100 that only work in one state
Colonies: Total number of honey bee colonies
Colonies Exclusive to State: annual number of colonies per 100 that remain in the same state for the entire year.

DATA: Stressors

VARIABLES:

State: Abbreviation name of state
Varroa mites: Number of infesations of Varroa mites per 100 colonies
Other pests and parasites: Number of other infestations per 100 colonies
Diseases: Number of diseased per 100 colonies
Pesticides: Number of pesticide detected colonies per 100
Other: other detected stressor per 100 colonies
Unknown: unknown remaining unidentified stressor per 100 colonies.
Year: Year in quarters format YYYY-#Q

The reason for the examining the second dataset is that after I finished analyzing the first, there was not much there except the loss rate. Because of my background research, the other variables such as number of beekeepers and whether bees were moved or not did not seem to shed light on the question of what was causing the losses. The second dataset had variables related to the information about pests, disease and pesticides.

# LOAD DATASETS
bees <- read_csv("C:/Users/libcl/OneDrive/Documents/bee_colony_loss.csv") #1st set taken from course datasets, available at Kaggle

## 
## -- Column specification --------------------------------------------------------
## cols(
##   Year = col_character(),
##   Season = col_character(),
##   State = col_character(),
##   `Total Annual Loss` = col_character(),
##   Beekeepers = col_double(),
##   `Beekeepers Exclusive to State` = col_character(),
##   Colonies = col_double(),
##   `Colonies Exclusive to State` = col_character()
## )

stressors <- read_csv("C:/Users/libcl/OneDrive/Documents/stressors.csv") #from https://storage.googleapis.com/kagglesdsdata/datasets/793168/1497256/stressors.csv

## 
## -- Column specification --------------------------------------------------------
## cols(
##   State = col_character(),
##   `Varroa mites` = col_character(),
##   `Other pests and parasites` = col_character(),
##   Diseases = col_character(),
##   Pesticides = col_character(),
##   Other = col_character(),
##   Unknown = col_character(),
##   Year = col_character()
## )

CLEAN THE DATA

#BEES DATA
#change column heading to lower case
names(bees) <- tolower(names(bees))

#replace " " with _ in column headings
names(bees) <- gsub(" ","_",names(bees), fixed = TRUE)

#add column region for states before replacing with abbreviation next

#replace state full name with abbreviation
bees$state <- state.abb[match(bees$state, state.name)]

#mutate "year" column to "date" in date form with starting year only
names(bees$year) <- gsub("//","",names(bees$year), fixed = TRUE)
# make new column date, with class numeric from year with / removed
my_year = str_replace(bees$year, pattern = "/", replacement = "")
my_years = (substr(my_year,1,4))
my_date = as.numeric(my_years)
bees <- bees %>%
  mutate(dates = my_date)

# change percentages by removing %-sign and changing to numeric
my_total_annual_loss = str_replace(bees$total_annual_loss, pattern = "%", replacement = "")
my_beekeepers_exclusive_to_state = str_replace(bees$beekeepers_exclusive_to_state, pattern = "%", replacement = "")
my_colonies_exclusive_to_state = str_replace(bees$colonies_exclusive_to_state, pattern = "%", replacement = "")
bees$total_annual_loss = as.numeric(my_total_annual_loss)
bees$beekeepers_exclusive_to_state = as.numeric(my_beekeepers_exclusive_to_state)
bees$colonies_exclusive_to_state = as.numeric(my_colonies_exclusive_to_state)
 

#STRESSORS DATA

#change column heading to lower case
names(stressors) <- tolower(names(stressors))

#replace " " with _ in column headings
names(stressors) <- gsub(" ","_",names(stressors), fixed = TRUE)

#change stressor rows containing (Z), and replace with NA for all stressor column names
myvarroa_mites <- str_replace(stressors$varroa_mites, pattern = "(Z)", replacement = "NA")
myother_pests_and_parasites <- str_replace(stressors$other_pests_and_parasites, pattern = "(Z)", replacement = "NA") 
mydiseases <- str_replace(stressors$diseases, pattern = "(Z)", replacement = "NA")
mypesticides <- str_replace(stressors$pesticides, pattern = "(Z)", replacement = "NA")
myother <- str_replace(stressors$other, pattern = "(Z)", replacement = "NA")
myunknown <- str_replace(stressors$unknown, pattern = "(Z)", replacement = "NA")

# change character to double class for percentages by changing to double to convert with mutate
vm = as.double(myvarroa_mites)

## Warning: NAs introduced by coercion

op = as.double(myother_pests_and_parasites)

## Warning: NAs introduced by coercion

d = as.double(mydiseases)

## Warning: NAs introduced by coercion

p = as.double(mypesticides)

## Warning: NAs introduced by coercion

o = as.double(myother)

## Warning: NAs introduced by coercion

u = as.double(myunknown)

## Warning: NAs introduced by coercion

#fix dates to eliminate quarter; will have to average for each year later for one annual value to join tables
my_year = str_replace(stressors$year, pattern = "-", replacement = "")
my_year = (substr(my_year,1,4))
my_date = as.numeric(my_year)

stressors <- stressors %>% 
  mutate(dates = my_date, var_mites = vm/100, other_pestpara = op/100, disease = d/100, pesticide = p/100, others = o/100, unk = u/100)

Write cleaned datasets to csv files

#write.csv(bees, "bees.csv")
#write.csv(stressors, "bees_stressors.csv")

Convert % columns to decimals with shorter column names

bees <- bees %>%
  filter(!is.na(state)) %>% 
  mutate(loss_rate = total_annual_loss/100, stexclusive_keeperrate = beekeepers_exclusive_to_state/100,  stexclusive_colonyrate = colonies_exclusive_to_state/100)

Stressors data needs averages for state/years before joining.

df2 <- stressors %>%
  filter(!is.na(state)) %>% 
  group_by(state, dates) %>% 
  mutate(n = n())

df3 <- df2 %>% 
  summarize(state, dates, n, var_mitesavg = sum(var_mites)/n, other_pestparaavg = sum(other_pestpara)/n, diseaseavg = sum(disease)/n, pesticideavg = sum(pesticide)/n, othersavg = sum(others)/n, unkavg = sum(unk)/n)

## `summarise()` regrouping output by 'state', 'dates' (override with `.groups` argument)

#drop duplicate rows
 df4 <- distinct(df3)

Join datasets for maps

bees_tab <- bees %>% inner_join(df4, by = "state", "dates", keep = FALSE)

Boxplots and histograms to see data distribution

bees_tab %>% 
  filter(total_annual_loss != "NA") %>% 
  ggplot(aes(loss_rate)) + 
  geom_boxplot(fill = 7) +
  theme_dark()

bees_tab %>% 
  filter(total_annual_loss != "NA") %>% 
  ggplot(aes(loss_rate, group = year)) +
  geom_boxplot(fill = 7) +
  xlab("annual loss rate") +
  ggtitle("United States Bee Colonies") +
  theme_dark()

Looking at possible outliers using interquartile range and Q1 and Q3 values from summary: result is 7 values representing 7 states, 4 from 2010. 2010 had more losses so I am keeping the values because it won’t affect that much. Similarly a single value in one state should not affect the analysis.

summary(bees)

##      year              season             state           total_annual_loss
##  Length:348         Length:348         Length:348         Min.   : 7.50    
##  Class :character   Class :character   Class :character   1st Qu.:30.50    
##  Mode  :character   Mode  :character   Mode  :character   Median :39.80    
##                                                           Mean   :40.91    
##                                                           3rd Qu.:48.50    
##                                                           Max.   :86.90    
##                                                           NA's   :11       
##    beekeepers     beekeepers_exclusive_to_state    colonies     
##  Min.   :  2.00   Min.   :  3.90                Min.   :     6  
##  1st Qu.: 21.75   1st Qu.: 85.70                1st Qu.:  1019  
##  Median : 54.00   Median : 93.05                Median :  5156  
##  Mean   : 86.40   Mean   : 87.10                Mean   : 31388  
##  3rd Qu.:109.75   3rd Qu.: 96.90                3rd Qu.: 30617  
##  Max.   :828.00   Max.   :100.00                Max.   :625897  
##                                                                 
##  colonies_exclusive_to_state     dates        loss_rate     
##  Min.   :  0.000             Min.   :2010   Min.   :0.0750  
##  1st Qu.:  3.475             1st Qu.:2011   1st Qu.:0.3050  
##  Median : 21.350             Median :2013   Median :0.3980  
##  Mean   : 40.163             Mean   :2013   Mean   :0.4091  
##  3rd Qu.: 82.575             3rd Qu.:2015   3rd Qu.:0.4850  
##  Max.   :100.000             Max.   :2016   Max.   :0.8690  
##                                             NA's   :11      
##  stexclusive_keeperrate stexclusive_colonyrate
##  Min.   :0.0390         Min.   :0.00000       
##  1st Qu.:0.8570         1st Qu.:0.03475       
##  Median :0.9305         Median :0.21350       
##  Mean   :0.8710         Mean   :0.40163       
##  3rd Qu.:0.9690         3rd Qu.:0.82575       
##  Max.   :1.0000         Max.   :1.00000       
##

range <- IQR(bees$total_annual_loss, na.rm = TRUE, type = 7)
out_low <- 30.50 - (1.5 * range)
out_high <- 48.50 + 1.5 * range

bees %>% 
  filter(total_annual_loss > out_high | total_annual_loss < out_low)

## # A tibble: 7 x 12
##   year  season state total_annual_lo~ beekeepers beekeepers_excl~ colonies
##   <chr> <chr>  <chr>            <dbl>      <dbl>            <dbl>    <dbl>
## 1 2016~ Annual OK                83.9         29             96.5     9513
## 2 2013~ Annual CT                81.6         65             90.8     1531
## 3 2010~ Annual WV                78.6         25             92        509
## 4 2010~ Annual IA                80.7         14             85.7     1037
## 5 2010~ Annual MI                83.5         99             97      10320
## 6 2010~ Annual GA                84.7         61             96.7     5923
## 7 2010~ Annual ME                86.9         43             93       5668
## # ... with 5 more variables: colonies_exclusive_to_state <dbl>, dates <dbl>,
## #   loss_rate <dbl>, stexclusive_keeperrate <dbl>, stexclusive_colonyrate <dbl>

Look at correlation of exclusivity of beekeepers versus exclusivity of colonies by state. No discernable pattern is seen in the scatterplot, and 44% correlation was calculated. There is not a lot of information in this dataset to determine what factors may be involved in the loss, so after visualizing extent of losses on a map, stressors in the second dataset will be considered.

bees %>% 
  ggplot() +
  geom_point(aes(x = stexclusive_colonyrate, y = loss_rate, color = stexclusive_keeperrate)) +
  ggtitle("Correlation Exclusive") +
  xlab("Colonies Excl. to State") +
  ylab("Loss Rate")

## Warning: Removed 11 rows containing missing values (geom_point).

cor(bees$stexclusive_colonyrate, bees$beekeepers_exclusive_to_state)

## [1] 0.4473572

Map losses for each year with beekeeper and colony information in tooltip

#2010 losses map
bees2010 <- bees %>% 
  filter(dates == 2010)

bees2010$hover <- with(bees2010,  paste(bees2010$state, "Single State BK Rate:", bees2010$stexclusive_keeperrate, "Single State Colony Rate:", bees2010$stexclusive_colonyrate, "Total Colonies:", bees2010$colonies, sep = "<br>"))

# give state boundaries a white border
l <- list(color = toRGB("white"), width = 2)
# specify some map projection/options
g <- list(
  scope = 'usa',
  projection = list(type = 'albers usa'),
  showlakes = TRUE,
  lakecolor = toRGB('steelblue')
)

p2010 <- plot_geo(bees2010, locationmode = 'USA-states') %>%
  add_trace(
    z = ~loss_rate, text = ~hover, locations = ~state,
    color = ~loss_rate, colors = 'Purples'
  ) %>%
  colorbar(title = "Loss Rate") %>%
  layout(
    title = '2010 - US Colony Loss Rate ',
    geo = g
  )

## Warning: Ignoring 3 observations

## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

p2010

#2011 losses map
bees2011 <- bees %>% 
  filter(dates == 2011)


bees2011$hover <- with(bees2011, paste(bees2011$state, "Single State BK Rate:", bees2011$stexclusive_keeperrate, "Single State Colony Rate:", bees2011$stexclusive_colonyrate, "Total Colonies:", bees2011$colonies, sep = "<br>"))

# give state boundaries a white border
l <- list(color = toRGB("white"), width = 2)
# specify some map projection/options
g <- list(
  scope = 'usa',
  projection = list(type = 'albers usa'),
  showlakes = TRUE,
  lakecolor = toRGB('steelblue')
)

p2011 <- plot_geo(bees2011, locationmode = 'USA-states') %>%
  add_trace(
    z = ~loss_rate, text = ~hover, locations = ~state,
    color = ~loss_rate, colors = 'Purples'
  ) %>%
  colorbar(title = "Loss Rate") %>%
  layout(
    title = '2011 - US Colony Loss Rate ',
    geo = g
  )

## Warning: Ignoring 3 observations

p2011

#2012 losses map
bees2012 <- bees %>% 
  filter(dates == 2012)


bees2012$hover <- with(bees2012,  paste(bees2012$state,"Single State BK Rate:", bees2012$stexclusive_keeperrate, "Single State Colony Rate:", bees2012$stexclusive_colonyrate, "Total Colonies:", bees2012$colonies, sep = "<br>"))

# give state boundaries a white border
l <- list(color = toRGB("white"), width = 2)
# specify some map projection/options
g <- list(
  scope = 'usa',
  projection = list(type = 'albers usa'),
  showlakes = TRUE,
  lakecolor = toRGB('steelblue')
)

p2012 <- plot_geo(bees2012, locationmode = 'USA-states') %>%
  add_trace(
    z = ~loss_rate, text = ~hover, locations = ~state,
    color = ~loss_rate, colors = 'Purples'
  ) %>%
  colorbar(title = "Loss Rate") %>%
  layout(
    title = '2012 - US Colony Loss Rate ',
    geo = g
  )

## Warning: Ignoring 2 observations

p2012

#2013 losses map
bees2013 <- bees %>% 
  filter(dates == 2013)


bees2013$hover <- with(bees2013,  paste(bees2013$state,"Single State BK Rate:", bees2013$stexclusive_keeperrate, "Single State Colony Rate:", bees2013$stexclusive_colonyrate, "Total Colonies:", bees2013$colonies, sep = "<br>"))

# give state boundaries a white border
l <- list(color = toRGB("white"), width = 2)
# specify some map projection/options
g <- list(
  scope = 'usa',
  projection = list(type = 'albers usa'),
  showlakes = TRUE,
  lakecolor = toRGB('steelblue')
)

p2013 <- plot_geo(bees2013, locationmode = 'USA-states') %>%
  add_trace(
    z = ~loss_rate, text = ~hover, locations = ~state,
    color = ~loss_rate, colors = 'Purples'
  ) %>%
  colorbar(title = "Loss Rate") %>%
  layout(
    title = '2013 - US Colony Loss Rate ',
    geo = g
  )

## Warning: Ignoring 1 observations

p2013

subplot(p2010, p2011, p2012, p2013)

#2014 losses map
bees2014 <- bees %>% 
  filter(dates == 2014)


bees2014$hover <- with(bees2014,  paste(bees2014$state,"Single State BK Rate:", bees2014$stexclusive_keeperrate, "Single State Colony Rate:", bees2014$stexclusive_colonyrate, "Total Colonies:", bees2014$colonies, sep = "<br>"))

# give state boundaries a white border
l <- list(color = toRGB("white"), width = 2)
# specify some map projection/options
g <- list(
  scope = 'usa',
  projection = list(type = 'albers usa'),
  showlakes = TRUE,
  lakecolor = toRGB('steelblue')
)

p2014 <- plot_geo(bees2014, locationmode = 'USA-states') %>%
  add_trace(
    z = ~loss_rate, text = ~hover, locations = ~state,
    color = ~loss_rate, colors = 'Purples'
  ) %>%
  colorbar(title = "Loss Rate") %>%
  layout(
    title = '2014 - US Colony Loss Rate ',
    geo = g
  )

## Warning: Ignoring 1 observations

#2015 losses map
bees2015 <- bees %>% 
  filter(dates == 2015)


bees2015$hover <- with(bees2015,  paste(bees2015$state,"Single State BK Rate:", bees2015$stexclusive_keeperrate, "Single State Colony Rate:", bees2015$stexclusive_colonyrate, "Total Colonies:", bees2015$colonies, sep = "<br>"))

# give state boundaries a white border
l <- list(color = toRGB("white"), width = 2)
# specify some map projection/options
g <- list(
  scope = 'usa',
  projection = list(type = 'albers usa'),
  showlakes = TRUE,
  lakecolor = toRGB('steelblue')
)

p2015 <- plot_geo(bees2015, locationmode = 'USA-states') %>%
  add_trace(
    z = ~loss_rate, text = ~hover, locations = ~state,
    color = ~loss_rate, colors = 'Purples'
  ) %>%
  colorbar(title = "Loss Rate") %>%
  layout(
    title = '2015 - US Colony Loss Rate ',
    geo = g
  )

## Warning: Ignoring 1 observations

#2016 losses map
bees2016 <- bees %>% 
  filter(dates == 2016)


bees2016$hover <- with(bees2016,  paste(bees2016$state,"Single State BK Rate:", bees2016$stexclusive_keeperrate, "Single State Colony Rate:", bees2016$stexclusive_colonyrate, "Total Colonies:", bees2016$colonies, sep = "<br>"))

# give state boundaries a white border
l <- list(color = toRGB("white"), width = 2)
# specify some map projection/options
g <- list(
  scope = 'usa',
  projection = list(type = 'albers usa'),
  showlakes = TRUE,
  lakecolor = toRGB('steelblue')
)

p2016 <- plot_geo(bees2016, locationmode = 'USA-states') %>%
  add_trace(
    z = ~loss_rate, text = ~hover, locations = ~state,
    color = ~loss_rate, colors = 'Purples'
  ) %>%
  colorbar(title = "Loss Rate") %>%
  layout(
    title = '2014 - 2016 - US Colony Loss Rate ',
    geo = g
  )

subplot(p2014, p2015, p2016)

Start analyzing various stressors, i.e. pests and pesticides.

Process means for observed causes of loss by state

bees_means <- bees_tab %>% 
  group_by(state) %>% 
  summarize(lrate = mean(loss_rate), mvar = mean(var_mitesavg), mop =  mean(other_pestparaavg), mdis = mean(diseaseavg), mpesticide = mean(pesticideavg), mo = mean(othersavg), munk = mean(unkavg))

## `summarise()` ungrouping output (override with `.groups` argument)

Plot means of varrus mite infestation rates and then pesticide rates by state and compare with mean state loss rates

# Counterintuitively, Hawai shows low average loss while having highest infestation rate; Illinois, Connecticut and Maryland show the reverse of low infestation with high loss
bees_means  %>%
  ggplot(aes(x = state, y = lrate, color = mvar)) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  geom_point() +
  ggtitle("Varrus Mite")

## Warning: Removed 3 rows containing missing values (geom_point).

#Illinois and Pennsylvania show high loss, low pesticides; those states showing highest pesticides all show loss_rates between 35-45%, by no means the highest
bees_means  %>%
  ggplot(aes(x = state, y = lrate, color = mpesticide)) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  geom_point() +
  ggtitle("Pesticides")

## Warning: Removed 3 rows containing missing values (geom_point).

Prepare to us plotly again to compare stressors with mean loss rate for each state

#re-arrange to shorter names and arrange in long form
bees_means <- bees_means %>% 
  rename(other = mo) 
bees_means <- bees_means %>% 
  rename(disease = mdis)
bees_means <- bees_means %>% 
  rename(other_pests = mop)
bees_means <- bees_means %>% 
  rename(pesticides = mpesticide)
bees_means <- bees_means %>% 
  rename(varrus_mites = mvar) 
bees_means <- bees_means %>% 
  rename(unknown = munk)
bees_meanslong <- bees_means %>%
  pivot_longer(
    cols = 3:8,
    names_to = "category",
    values_to = "percentage",
    values_drop_na = TRUE)

Create bar chart of stressor rates by state

s_bar <- bees_meanslong %>%
  filter(!is.na(percentage)) %>% 
  ggplot() + 
  geom_col(aes(x = category, y = percentage, fill = state), position = "dodge") +
  ggtitle("Colony Stressor Groups") +
  xlab("Category") +
  ylab("Percentage affected")
s_bar

Map mean of each stressor and loss rate on US Map

# stressors map

bees_means$hover <- with(bees_means,  paste(bees_means$state,"Varrus Mites:", bees_means$varrus_mites, "Pesticides:", bees_means$pesticides, "Diseases:", bees_means$disease,  sep = "<br>"))

# give state boundaries a white border
l <- list(color = toRGB("white"), width = 2)
# specify some map projection/options
g <- list(
  scope = 'usa',
  projection = list(type = 'albers usa'),
  showlakes = TRUE,
  lakecolor = toRGB('steelblue')
)

stressor_p1 <- plot_geo(bees_means, locationmode = 'USA-states') %>%
  add_trace(
    z = ~lrate, text = ~hover, locations = ~state,
    color = ~lrate, colors = "YlOrBr"
  ) %>%
  colorbar(title = "Average Loss Rate") %>%
  layout(
    title = 'US Colony Average Loss Rates and Stressors ',
    geo = g
  )

## Warning: Ignoring 3 observations

stressor_p1

Here is another stressors visualization without dates included. I could not include dates because only the year was provided in both datasets, and a year only cannot be converted to date class (I tried, but time was a constraint in figuring this out)

cols <- brewer.pal(6, "Dark2")

highchart() %>%
  hc_add_series(data = bees_meanslong,
                   type = "line", hcaes(x = state,
                   y = percentage, 
                   group = category)) %>% 
  hc_title(
  text = "Rate of Stressors on Total Colonies Across States",
    margin = 20,
    align = "left",
    style = list(color = "#22A884", useHTML = TRUE)
    ) %>% 
  
  
  hc_xAxis(
    title = list(text = "state"),
    alternateGridColor = "#FDFFD5",
    opposite = TRUE    
    

)  %>% 
  hc_colors(cols)

I would like to look at linear model even though this looks difficult because there are many missing values in the dataset of stressors. Not surprisingly, there appears to be a downward slope from what I can tell.

bees_means %>%
  filter(!is.na(lrate) & !is.na(varrus_mites)) %>% 
  ggplot(aes(x = varrus_mites, y = lrate)) +
  geom_point(aes(x = varrus_mites, y = lrate)) +
  geom_smooth(method = "lm", color = "red")

## `geom_smooth()` using formula 'y ~ x'

bees_means %>% 
  filter(!is.na(lrate) & !is.na(pesticides)) %>%
  ggplot(aes(x = pesticides, y = lrate)) +
  geom_point(aes(x = varrus_mites, y = lrate)) +
  geom_smooth(method = "lm", color = "red")

## `geom_smooth()` using formula 'y ~ x'

bees_means %>% 
  filter(!is.na(lrate) & !is.na(disease)) %>%
  ggplot(aes(x = pesticides, y = lrate)) +
  geom_point(aes(x = disease, y = lrate)) +
  geom_smooth(method = "lm", color = "red")

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 6 rows containing non-finite values (stat_smooth).

I decided to use correlations of stressors using Ggally. None look linear which indicates poor correlation. Since there is not much correlation a multi-regression model may include most of these variables. Disease and pesticides, and varrus mites and pesticides look like correlations above 70%. I will proceed with the ml fit.

# Try loading GGally to see correlations of stressors
library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

df10 <- bees_means
df10$state <- df10$hover <- NULL
df10 <- df10 %>% 
  na.omit(df10)

 
# Check correlations (as scatterplots), distribution and print correlation coefficient 
ggpairs(df10, title="correlogram with ggpairs()")

# first iteration of model with all included variables including date

fit_multi <- lm(loss_rate ~ dates.y + var_mitesavg + pesticideavg + diseaseavg  + other_pestparaavg + othersavg + unkavg, data = bees_tab)
summary(fit_multi)

## 
## Call:
## lm(formula = loss_rate ~ dates.y + var_mitesavg + pesticideavg + 
##     diseaseavg + other_pestparaavg + othersavg + unkavg, data = bees_tab)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.28518 -0.08880 -0.01354  0.06873  0.45938 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -10.237753   9.180897  -1.115  0.26521    
## dates.y             0.005303   0.004555   1.164  0.24475    
## var_mitesavg       -0.205607   0.063929  -3.216  0.00136 ** 
## pesticideavg       -0.028819   0.119301  -0.242  0.80920    
## diseaseavg         -0.370461   0.171475  -2.160  0.03110 *  
## other_pestparaavg  -0.160933   0.080344  -2.003  0.04559 *  
## othersavg           0.778365   0.170752   4.558 6.16e-06 ***
## unkavg              0.186260   0.148228   1.257  0.20936    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1237 on 649 degrees of freedom
##   (603 observations deleted due to missingness)
## Multiple R-squared:  0.07051,    Adjusted R-squared:  0.06049 
## F-statistic: 7.034 on 7 and 649 DF,  p-value: 4.19e-08

plot(fit_multi)

##### remove unknown first, then pesticides, then diseases, then other pests, then parasites, and then other. All resulted in R^2 < .05 so the model does not work

fit_final <- lm(loss_rate ~ dates.y + var_mitesavg, data = bees_tab)
summary(fit_final)

## 
## Call:
## lm(formula = loss_rate ~ dates.y + var_mitesavg, data = bees_tab)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.32415 -0.09976 -0.00520  0.07897  0.47429 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -10.145051   7.401080  -1.371    0.171    
## dates.y        0.005263   0.003672   1.434    0.152    
## var_mitesavg  -0.191893   0.032715  -5.866 5.75e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1395 on 1229 degrees of freedom
##   (28 observations deleted due to missingness)
## Multiple R-squared:  0.02723,    Adjusted R-squared:  0.02565 
## F-statistic:  17.2 on 2 and 1229 DF,  p-value: 4.283e-08

plot(fit_final)

# first iteration of model with all included variables excluding date
head(bees_means)

## # A tibble: 6 x 9
##   state lrate varrus_mites other_pests disease pesticides  other unknown hover  
##   <chr> <dbl>        <dbl>       <dbl>   <dbl>      <dbl>  <dbl>   <dbl> <chr>  
## 1 AL    0.359        0.270       0.256  NA         0.0756 0.0556  0.0697 AL<br>~
## 2 AR    0.376        0.468       0.154  NA         0.136  0.0714  0.0386 AR<br>~
## 3 AZ    0.406        0.437       0.119  NA        NA      0.105   0.0593 AZ<br>~
## 4 CA    0.357        0.422       0.136   0.075     0.140  0.104   0.0472 CA<br>~
## 5 CO    0.384        0.391      NA      NA        NA      0.0688  0.0209 CO<br>~
## 6 CT    0.530        0.182      NA      NA        NA      0.0588 NA      CT<br>~

fit3 <- lm(lrate ~ varrus_mites + disease + pesticides + other_pests + other + unknown, data = bees_means)
summary(fit3)

## 
## Call:
## lm(formula = lrate ~ varrus_mites + disease + pesticides + other_pests + 
##     other + unknown, data = bees_means)
## 
## Residuals:
##         4         7         8        10        11        12        15        30 
## -0.056673  0.069655  0.015274  0.081726 -0.001082 -0.026423  0.001653 -0.020245 
##        31        34        38        43 
## -0.060064  0.053353 -0.036368 -0.020805 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    0.6522     0.1142   5.709   0.0023 **
## varrus_mites  -0.6030     0.4557  -1.323   0.2430   
## disease       -2.3666     1.5196  -1.557   0.1801   
## pesticides     0.9453     1.1531   0.820   0.4496   
## other_pests   -1.0485     0.7408  -1.415   0.2161   
## other          2.0426     1.1165   1.829   0.1269   
## unknown       -0.2040     1.0584  -0.193   0.8548   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06971 on 5 degrees of freedom
##   (33 observations deleted due to missingness)
## Multiple R-squared:  0.7085, Adjusted R-squared:  0.3588 
## F-statistic: 2.026 on 6 and 5 DF,  p-value: 0.2278

plot(fit3)

#MR^2 is above 70%, but p-value is .2 so we need to proceed to see if we can get some level of significance from the variable.

# remove unknown, pesticides, varrus mites, unknown
head(bees_means)

## # A tibble: 6 x 9
##   state lrate varrus_mites other_pests disease pesticides  other unknown hover  
##   <chr> <dbl>        <dbl>       <dbl>   <dbl>      <dbl>  <dbl>   <dbl> <chr>  
## 1 AL    0.359        0.270       0.256  NA         0.0756 0.0556  0.0697 AL<br>~
## 2 AR    0.376        0.468       0.154  NA         0.136  0.0714  0.0386 AR<br>~
## 3 AZ    0.406        0.437       0.119  NA        NA      0.105   0.0593 AZ<br>~
## 4 CA    0.357        0.422       0.136   0.075     0.140  0.104   0.0472 CA<br>~
## 5 CO    0.384        0.391      NA      NA        NA      0.0688  0.0209 CO<br>~
## 6 CT    0.530        0.182      NA      NA        NA      0.0588 NA      CT<br>~

fit3 <- lm(lrate ~ disease + other_pests + other, data = bees_means)
summary(fit3)

## 
## Call:
## lm(formula = lrate ~ disease + other_pests + other, data = bees_means)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.11154 -0.04988  0.01001  0.04216  0.12017 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.50031    0.07066   7.081  5.5e-06 ***
## disease     -1.83572    0.64381  -2.851  0.01282 *  
## other_pests -1.24978    0.41807  -2.989  0.00975 ** 
## other        2.08495    0.74845   2.786  0.01459 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06715 on 14 degrees of freedom
##   (27 observations deleted due to missingness)
## Multiple R-squared:  0.545,  Adjusted R-squared:  0.4476 
## F-statistic: 5.591 on 3 and 14 DF,  p-value: 0.009827

plot(fit3)

#MR^2 = .545 so explains about 55% of Loss Rate, with p = 0.009827 so the factors are important; surprisingly varrus mites and pesticides reduce the MR^2 so there must be some overlap with disease and other factors.

Our multiple regression model is not a perfect fit, but I believe the Multiple R^2 of 55% and p = .01 does indicate that disease, other pests, and the “other” undefined variable are significant in determining loss rates. Other helpful information such as temperature ranges of exposure, might complete the model.

CONCLUSION:

My initial research indicated that the cause of colony loss for honeybees is not settled and may be due to multiple factors.  I believe that is shown by this data, especially when looking at the attempted multiple regression model which showed mild significance of a few of the categories including disease, pests and an undefined “other” factor. 
Temperature data was not included in these datasets, but another analysis with this data might show sensitivity causing survival issues. (William G. Meikle, 2020) Also, surprisingly, pesticides did not appear as a significant contribution to colony death, but this may be due to how the pesticides are being measured.  Low level chemical exposure may be just as damaging because it can lead to neurological issues that may interfere with homing issues, which is the ability for the worker bees to find there way back to the colony with necessary sustenance for the hives' survival, leading to colony death without high levels of chemicals being detected. (Mickaël Henry, 2012)
One surprising finding was the varrus mite pest category appeared to be the most common pest and was the by far the highest measure present of all categories in the data, but during the multiple regression was not as significant as disease which was measured at much lower levels, so caution is necessary when comparing levels as I did in this analysis. Just looking at the data visualizations of this data, the varrus mite infestations appeared to be a much more serious problem than pesticides; however, the scientific studies show that pesticides do not need to measure in high levels to cause harm a higher level of harm. (Mickaël Henry, 2012) Any future analysis I do of this dataset would need to include information about what measurements are considered harmful for each category, if that is possible to determine.

Works Cited Caron, D. M. (2000). Moving Bees. Mid-Atlantic Apiculture Research and Extension Consortium. University Park, PA: MAAREC Publication 5.3. Jerrod Penn, W. H. (2019, August 29). Support for Solitary Bee Conservation Among the Public Versus Beekeepers. Retrieved from Amer. J. Agr. Econ: https://academic.oup.com/ajae/article-abstract/101/5/1386/5556335 Mickaël Henry, M. B. (2012). A Common Pesticide Decreases Foraging Success and Survival in Honey Bees. Science, 348-350. William G. Meikle, M. &. (2020, 3 19). Landscape factors infuencing honey bee colony behavior in Southern California commercial apiaries. Retrieved from nature.com/scientificreports: https://www.nature.com/articles/s41598-020-61716-6

Bee Colony Loss

Elizabeth Click

12/13/2020