Project 2: Exploratory Data Analysis of the Disappearance of Honeybees throughtout 2010 - 2016.

Introduction

Love them of hate them, Honeybees are the most effective pollinators on the plant. They play a significant role in the nature by pollinating 85% of food crops intended for human consumption (FDA). In addition, they produce honey and wax. Two commonly used raw material.

In recent years, there has been reports that the number of bees in the United States are quickly dwindling. With the continuous disappearance of honeybees, devastating ripple effect are occurring across all ecosystems. If honeybees were to go extinct, we would lose all the plants that bees pollinate. With the significant decrease of plants in our ecosystem, all animals that eat those plants would die due to starvation, and so on up the food chain. In addition, our supermarkets would have half the amount of fruits and vegetables.

Because of the contribution and struggles of honeybees, many beekeepers and activist have arise to preserve the honeybee population such as the save the bee campaign.

In order to preserve the honeybee population, honeybees activist need to find the root cause on why honeybees are disappearing.

Data Descriptions

For this project, I a taking a look into the disappearance of honeybees in the United States through 2010-2016.

The data was obtained from the DATA 110 dataset course folder, but original originated from the Data World website at: https://data.world/finley/bee-colony-statistical-data-from-1987-2017/workspace/file?filename=Bee+Colony+Loss.xlsx

Data Details

This dataset is a statistical collection of data from 2010 - 2016. The data provides information on (by state) the loss of bee colonies (by percent) annually, the number of known colonies, the number of known beekeepers, and the percentage of beekeepers exclusive to the state.

Three categorical and five quantitative variables.

Total rows: 365

Total Features: 8

Variables

Year: Year in format fiscal format YYYY/YY

Season: All were annual

State: Full name of state in the United States

Annual Loss: Annual colony loss per 100 colonies

Beekeepers: Total number of beekeepers

Beekeepers Exclusive to State: Annual number of beekeepers per 100 that only work in one state

Colonies: Total number of honey bee colonies

Colonies Exclusive to State: Annual number of colonies per 100 that remain in the same state for the entire year.

Why I chose this Topic

The reason I chose this topic is because my initial topic was stolen, sad face. While scrolling around the DATA 110 dataset folder, I came upon the bee_colony_loss dataset. I was instantly reminded that for my birthday by auntie provides me with her freshly homemade baklava made from her own beehive. I was devastated from the thought if bees were to extinct, I would never be able to taste my aunties delicious baklava again. From there, I was intrigued and wanted to explore this dataset.

Required Packages

Packages required for this project:

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.5     v dplyr   1.0.7
## v tidyr   1.1.4     v stringr 1.4.0
## v readr   2.0.2     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library("dplyr")
library("ggplot2")
library("plotly")
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library("magrittr")
## 
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
## 
##     set_names
## The following object is masked from 'package:tidyr':
## 
##     extract
library("ggthemes")
library("usmap")
library("knitr")

Storing Data

# Setting working directory
setwd("C:/Users/Jerem/OneDrive/Documents/Montgomery College/Fall 2021/DATA 110/Project 2")
# Provide a fast and friendly way to read rectangular data
library(readxl)
# rereading dataset into bees
bees <- read_csv("bee_colony_loss.csv")
## Rows: 365 Columns: 8
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (6): Year, Season, ÿState, ÿTotal Annual Loss, ÿBeekeepers Exclusive to ...
## dbl (2): ÿBeekeepers, ÿColonies
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
#View(bees)
str(bees)
## spec_tbl_df [365 x 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Year                          : chr [1:365] "2016/17" "2016/17" "2016/17" "2016/17" ...
##  $ Season                        : chr [1:365] "Annual" "Annual" "Annual" "Annual" ...
##  $ ÿState                        : chr [1:365] "Massachusetts" "Montana" "Nevada" "Maine" ...
##  $ ÿTotal Annual Loss            : chr [1:365] "15.90%" "17.10%" "23.00%" "23.30%" ...
##  $ ÿBeekeepers                   : num [1:365] 87 21 13 65 18 10 9 52 30 62 ...
##  $ ÿBeekeepers Exclusive to State: chr [1:365] "94.30%" "52.40%" "92.30%" "93.80%" ...
##  $ ÿColonies                     : num [1:365] 27186 35905 2512 41102 6521 ...
##  $ ÿColonies Exclusive to State  : chr [1:365] "2.30%" "0.30%" "5.20%" "1.40%" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Year = col_character(),
##   ..   Season = col_character(),
##   ..   ÿState = col_character(),
##   ..   `ÿTotal Annual Loss` = col_character(),
##   ..   ÿBeekeepers = col_double(),
##   ..   `ÿBeekeepers Exclusive to State` = col_character(),
##   ..   ÿColonies = col_double(),
##   ..   `ÿColonies Exclusive to State` = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

Data Preparation

Each title of each columns besides Year and State has a letter y with a diaeresis next to it. For convenience stake, I am giving each column a new name.

# Give new column names
colnames(bees) <- c("Year", "Season", "State","Annual.Percent.Loss", "Beekeeper.Count", "Beekeepers.Percent.Exclusive.to.State","Colony.Count","Colonies.Percent.Exclusive.to.State")
head(bees)
## # A tibble: 6 x 8
##   Year    Season State         Annual.Percent.~ Beekeeper.Count Beekeepers.Perc~
##   <chr>   <chr>  <chr>         <chr>                      <dbl> <chr>           
## 1 2016/17 Annual Massachusetts 15.90%                        87 94.30%          
## 2 2016/17 Annual Montana       17.10%                        21 52.40%          
## 3 2016/17 Annual Nevada        23.00%                        13 92.30%          
## 4 2016/17 Annual Maine         23.30%                        65 93.80%          
## 5 2016/17 Annual Wyoming       23.40%                        18 77.80%          
## 6 2016/17 Annual Hawaii        26.20%                        10 100.00%         
## # ... with 2 more variables: Colony.Count <dbl>,
## #   Colonies.Percent.Exclusive.to.State <chr>

Several columns are factors when they should be numeric.

# removing % symbols for pure numeric values
bees[] <- lapply(bees, gsub, pattern = '%', replacement = '')

bees$Annual.Percent.Loss <- as.numeric(bees$Annual.Percent.Loss)
bees$Beekeepers.Percent.Exclusive.to.State <- as.numeric(bees$Beekeepers.Percent.Exclusive.to.State)
bees$Colonies.Percent.Exclusive.to.State <- as.numeric(bees$Colonies.Percent.Exclusive.to.State)
head(bees)
## # A tibble: 6 x 8
##   Year    Season State         Annual.Percent.~ Beekeeper.Count Beekeepers.Perc~
##   <chr>   <chr>  <chr>                    <dbl> <chr>                      <dbl>
## 1 2016/17 Annual Massachusetts             15.9 87                          94.3
## 2 2016/17 Annual Montana                   17.1 21                          52.4
## 3 2016/17 Annual Nevada                    23   13                          92.3
## 4 2016/17 Annual Maine                     23.3 65                          93.8
## 5 2016/17 Annual Wyoming                   23.4 18                          77.8
## 6 2016/17 Annual Hawaii                    26.2 10                         100  
## # ... with 2 more variables: Colony.Count <chr>,
## #   Colonies.Percent.Exclusive.to.State <dbl>

Beekeeper Count and Colony Count randomly turned into characters. Here we changed them back to numeric values

bees$Beekeeper.Count<- as.integer(bees$Beekeeper.Count)
bees$Colony.Count<- as.integer(bees$Colony.Count)
bees$Colony.Count<- as.numeric(bees$Colony.Count)
head(bees)
## # A tibble: 6 x 8
##   Year    Season State         Annual.Percent.~ Beekeeper.Count Beekeepers.Perc~
##   <chr>   <chr>  <chr>                    <dbl>           <int>            <dbl>
## 1 2016/17 Annual Massachusetts             15.9              87             94.3
## 2 2016/17 Annual Montana                   17.1              21             52.4
## 3 2016/17 Annual Nevada                    23                13             92.3
## 4 2016/17 Annual Maine                     23.3              65             93.8
## 5 2016/17 Annual Wyoming                   23.4              18             77.8
## 6 2016/17 Annual Hawaii                    26.2              10            100  
## # ... with 2 more variables: Colony.Count <dbl>,
## #   Colonies.Percent.Exclusive.to.State <dbl>

In this project, I am using plotly to make interactive maps. States need to be abbreviated so choropleth maps and plotly can work properly.

State Abbreviation dataset was obtain from World Population Review website: https://worldpopulationreview.com/states/state-abbreviations

# Storing state code abbreviation
states <- read_csv("states.csv")
## Rows: 51 Columns: 3
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (3): State, Abbrev, Code
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
#view(states)

# Joining both datasets together
new_bees <-bees %>%
  inner_join(states, by.x = State, by.y = state) %>%
  select(Year, Season, Code, Annual.Percent.Loss, Beekeeper.Count, Beekeepers.Percent.Exclusive.to.State, Colony.Count, Colonies.Percent.Exclusive.to.State)
## Joining, by = "State"
head(new_bees)
## # A tibble: 6 x 8
##   Year    Season Code  Annual.Percent.Loss Beekeeper.Count Beekeepers.Percent.E~
##   <chr>   <chr>  <chr>               <dbl>           <int>                 <dbl>
## 1 2016/17 Annual MA                   15.9              87                  94.3
## 2 2016/17 Annual MT                   17.1              21                  52.4
## 3 2016/17 Annual NV                   23                13                  92.3
## 4 2016/17 Annual ME                   23.3              65                  93.8
## 5 2016/17 Annual WY                   23.4              18                  77.8
## 6 2016/17 Annual HI                   26.2              10                 100  
## # ... with 2 more variables: Colony.Count <dbl>,
## #   Colonies.Percent.Exclusive.to.State <dbl>

Later on in my analysis, I found it troublesome to work with the leading years in the Year column. I removed the leading years from the year column.

# the Year column only shows the first 4 characters in each row
new_bees$Year <- substr(new_bees$Year,1,4)
head(new_bees)
## # A tibble: 6 x 8
##   Year  Season Code  Annual.Percent.Loss Beekeeper.Count Beekeepers.Percent.Exc~
##   <chr> <chr>  <chr>               <dbl>           <int>                   <dbl>
## 1 2016  Annual MA                   15.9              87                    94.3
## 2 2016  Annual MT                   17.1              21                    52.4
## 3 2016  Annual NV                   23                13                    92.3
## 4 2016  Annual ME                   23.3              65                    93.8
## 5 2016  Annual WY                   23.4              18                    77.8
## 6 2016  Annual HI                   26.2              10                   100  
## # ... with 2 more variables: Colony.Count <dbl>,
## #   Colonies.Percent.Exclusive.to.State <dbl>

Adding a new variables called “hover” for my plotly text descriptions.

new_bees$hover <- with(new_bees, paste("State:", new_bees$Code, 
                                     "<br>",
                                     "Beekeeper Count:", new_bees$Beekeeper.Count,
                                     "<br>",
                                     "Beekeepers %:", new_bees$Beekeepers.Percent.Exclusive.to.State, 
                                     "<br>",
                                     "Colony Count:", new_bees$Colony.Count, 
                                     "<br>",
                                     "Colonies %:", new_bees$Colonies.Percent.Exclusive.to.State))

Data Cleaning

Checking how many rows that have NA values.

row.has.na<- apply(new_bees,1,function(x){any(is.na(x))})
sum(row.has.na)
## [1] 13

According to the raw dataset there are 13 rows with NA values. To make sure there are no value errors within the analysis, I am removing all NA value rows.

# Removing "NA"s
df_bees <- new_bees[!row.has.na,]
str(df_bees)
## tibble [342 x 9] (S3: tbl_df/tbl/data.frame)
##  $ Year                                 : chr [1:342] "2016" "2016" "2016" "2016" ...
##  $ Season                               : chr [1:342] "Annual" "Annual" "Annual" "Annual" ...
##  $ Code                                 : chr [1:342] "MA" "MT" "NV" "ME" ...
##  $ Annual.Percent.Loss                  : num [1:342] 15.9 17.1 23 23.3 23.4 26.2 26.3 26.6 27.3 29.2 ...
##  $ Beekeeper.Count                      : int [1:342] 87 21 13 65 18 10 9 52 30 62 ...
##  $ Beekeepers.Percent.Exclusive.to.State: num [1:342] 94.3 52.4 92.3 93.8 77.8 100 22.2 94.2 83.3 82.3 ...
##  $ Colony.Count                         : num [1:342] 27186 35905 2512 41102 6521 ...
##  $ Colonies.Percent.Exclusive.to.State  : num [1:342] 2.3 0.3 5.2 1.4 1.4 100 0.2 79.8 1.9 4.2 ...
##  $ hover                                : chr [1:342] "State: MA <br> Beekeeper Count: 87 <br> Beekeepers %: 94.3 <br> Colony Count: 27186 <br> Colonies %: 2.3" "State: MT <br> Beekeeper Count: 21 <br> Beekeepers %: 52.4 <br> Colony Count: 35905 <br> Colonies %: 0.3" "State: NV <br> Beekeeper Count: 13 <br> Beekeepers %: 92.3 <br> Colony Count: 2512 <br> Colonies %: 5.2" "State: ME <br> Beekeeper Count: 65 <br> Beekeepers %: 93.8 <br> Colony Count: 41102 <br> Colonies %: 1.4" ...

Data Frame Summary

df_bees %>% head() %>% knitr::kable()
Year Season Code Annual.Percent.Loss Beekeeper.Count Beekeepers.Percent.Exclusive.to.State Colony.Count Colonies.Percent.Exclusive.to.State hover
2016 Annual MA 15.9 87 94.3 27186 2.3 State: MA
Beekeeper Count: 87
Beekeepers %: 94.3
Colony Count: 27186
Colonies %: 2.3
2016 Annual MT 17.1 21 52.4 35905 0.3 State: MT
Beekeeper Count: 21
Beekeepers %: 52.4
Colony Count: 35905
Colonies %: 0.3
2016 Annual NV 23.0 13 92.3 2512 5.2 State: NV
Beekeeper Count: 13
Beekeepers %: 92.3
Colony Count: 2512
Colonies %: 5.2
2016 Annual ME 23.3 65 93.8 41102 1.4 State: ME
Beekeeper Count: 65
Beekeepers %: 93.8
Colony Count: 41102
Colonies %: 1.4
2016 Annual WY 23.4 18 77.8 6521 1.4 State: WY
Beekeeper Count: 18
Beekeepers %: 77.8
Colony Count: 6521
Colonies %: 1.4
2016 Annual HI 26.2 10 100.0 84 100.0 State: HI
Beekeeper Count: 10
Beekeepers %: 100
Colony Count: 84
Colonies %: 100

Exploratory Data Analysis

Is There Actually an Annual Loss in Honeybees Throughout the Years

To begin with, I want to address if there is actually an annual loss of honeybees throughout the years.

Simply Graph

bee_bargraph <- df_bees %>%
  ggplot(aes(x = Year, y = Annual.Percent.Loss)) +
  geom_col(aes(fill = Year)) +
  labs(title = "Annual Loss Throughout 2010 - 2016",
       x = "Year",
       y = "Annual Loss")

bee_bargraph

Certainly, the annual loss of honeybees has spiked since 2011 by 00%. On average, 00% of honeybees disappear each year.

Statistical Anaylsis

### Make Box plots of % loss by year

bee_boxplots <- ggplot(df_bees, aes(Year, Annual.Percent.Loss)) +
              geom_boxplot(fill = 7) +
        scale_y_continuous(name = "Annual Percent Loss",
                           breaks = seq(0, 100, 10),
                           limits=c(0, 100)) +
        ggtitle("Box Plot of Annual Percent Loss by Year") +
        theme_dark() +
        scale_colour_brewer(palette="Accent")

bee_boxplots

The variability from year to year is quite high, but majority of colony loss hasn’t change overall. The year of 2010 is quite eye catching. The variability on annual loss was high before the initial incline of annual loss.

Additional Statistics

summary(df_bees)
##      Year              Season              Code           Annual.Percent.Loss
##  Length:342         Length:342         Length:342         Min.   : 7.50      
##  Class :character   Class :character   Class :character   1st Qu.:30.50      
##  Mode  :character   Mode  :character   Mode  :character   Median :39.85      
##                                                           Mean   :41.22      
##                                                           3rd Qu.:48.65      
##                                                           Max.   :86.90      
##  Beekeeper.Count  Beekeepers.Percent.Exclusive.to.State  Colony.Count   
##  Min.   :  5.00   Min.   :  3.90                        Min.   :    29  
##  1st Qu.: 23.00   1st Qu.: 85.62                        1st Qu.:  1046  
##  Median : 55.00   Median : 92.90                        Median :  5740  
##  Mean   : 87.97   Mean   : 87.05                        Mean   : 31906  
##  3rd Qu.:112.75   3rd Qu.: 96.78                        3rd Qu.: 31220  
##  Max.   :828.00   Max.   :100.00                        Max.   :625897  
##  Colonies.Percent.Exclusive.to.State    hover          
##  Min.   :  0.00                      Length:342        
##  1st Qu.:  3.50                      Class :character  
##  Median : 20.35                      Mode  :character  
##  Mean   : 38.87                                        
##  3rd Qu.: 79.50                                        
##  Max.   :100.00

Questions Associates with Dataset

Perhaps some appropriate questions associates with the dataset include: How does the number of beekeepers in a state relate to the number of bee colonies? Do particular states attract more bee colonies? Do particular states attract more bee keepers? Is there a correlation between the number of beekeepers vs number of bees and the annual loss across states? Is there a correlation between the four cardinal directions and the annual loss?

Annual Percent loss on the map of the United States

#color palette didn't work
Honey <- c("#E0900E", "#FBc349", "F8B51E", "#FFFDDE", "#DCC4A4" )
bee_map1 <- plot_geo(df_bees,
                    locationmode = 'USA-states',
                    frame = ~Year) %>%
  add_trace(locations = ~Code,
            z = ~Annual.Percent.Loss,
            zmin = 0,
            zmax = max(df_bees$Annual.Percent.Loss),
            color = ~Annual.Percent.Loss,
            colorscale = "heat",
            text = ~hover,
            hoverinfo = 'text') %>%
  layout(geo = list(scope = 'usa'),
         title = "Annual Percent Loss of Honeybees in the US\n2010 - 2016") %>%
  config(displayModeBar = FALSE) %>%
  colorbar(ticksuffix = "%")

bee_map1

Missing color is due to the removal of NA values.

Scrolling through the years it seems like the east has a higher annual loss than the west.

Checking if states are equally represented

df_bees %>%
        group_by(Code) %>%
        tally()
## # A tibble: 50 x 2
##    Code      n
##    <chr> <int>
##  1 AL        7
##  2 AR        7
##  3 AZ        7
##  4 CA        7
##  5 CO        7
##  6 CT        7
##  7 DC        5
##  8 DE        7
##  9 FL        7
## 10 GA        7
## # ... with 40 more rows

On average, each state has 7 rows of data - one row per year excluding District of Columbia (5 rows of data), Mississippi (6 rows of data), Nevada (5 rows of data), South Dakota (5 rows of data), and Wyoming (5 rows of data).

Conclusion

From this project analysis, we can conclude that there has been a large disappearance of honeybees throughout the years of 2010 - 2016.

Time wasn’t on my side for this project. I would of like to explore the factors for the disappearance of honeybees and the questions I was curious about the dataset.

In addition, if I had more time, I would of like to find and join others dataset to explore the massive decreases of the honeybee’s population each year.

Dataset I would of joined:

Citations

[1] EPA, Environmental Protection Agency, https://www.epa.gov/pollinator-protection/colony-collapse-disorder.

[2] Medicine, Center for Veterinary. “Helping Agriculture’s Helpful Honey Bees.” U.S. Food and Drug Administration, FDA, https://www.fda.gov/animal-veterinary/animal-health-literacy/helping-agricultures-helpful-honey-bees.

[3] Crane, Eva (1983) The Archaeology of Beekeeping, Cornell University Press, ISBN 0-8014-1609-4