1 - Context

In 2006, global concern was raised over the rapid decline in the honeybee population, an integral component to American honey agriculture. Large numbers of hives were lost to Colony Collapse Disorder, a phenomenon of disappearing worker bees causing the remaining hive colony to collapse. Speculation to the cause of this disorder points to hive diseases and pesticides harming the pollinators, though no overall consensus has been reached. Twelve years later, some industries are observing recovery but the American honey industry is still largely struggling. The U.S. used to locally produce over half the honey it consumes per year. Now, honey mostly comes from overseas, with 350 of the 400 million pounds of honey consumed every year originating from imports. This dataset provides insight into honey production supply and demand in America by state from 1998 to 2012.

2 - Data Source

NASS

NASS(National Agricultural Statistics Service) is a part of United States Department of Agriculture(USDA). NASS conducts many surveys every year and prepares reports covering virtually every aspect of U.S. agriculture such as production and supplies of food and fiber, prices paid and recieived by farmers, farm labor and wages, farm finances, chemical use, and changes in the demographics of U.S. and so on. Those are a few of what they are doing. NSAA largely is responsible for collecting data in agricultural industry. Since this organization is backed by U.S. government, those datasets are reliable and accurate.

3 - Analysis

3.1 Research Questions

Keeping the following as the main points to be analyzed:

  • How has honey production yield changed from 1998 to 2012?
  • Over time, which states produce the most honey? Which produce the least?
  • Does the data show any trends in terms of the number of honey producing colonies and yield per colony before 2006, which was when concern over Colony Collapse Disorder spread nationwide?
  • Are there any patterns that can be observed between total honey production and value of production every year? How has value of production, which in some sense could be tied to demand, changed every year?

3.2 Data Preparation

  • Let’s load up some packages, our data, and take a peek at the variables involved.
suppressMessages(library(lubridate))
suppressMessages(library(tidyverse))
suppressMessages(library(cowplot))
suppressMessages(library(scales))
suppressMessages(library(DataCombine))
suppressMessages(library(stargazer))
suppressMessages(library(sandwich))
suppressMessages(library(dyn) )
suppressMessages(library(lmtest))
suppressMessages(library(data.table))
suppressMessages(library(tidyr))
suppressMessages(library(readr))
suppressMessages(library(stringr))
suppressMessages(library(forecast))
suppressMessages(library(gridExtra))

  • Set working directory.
dir <- "../Data/"
func <- "../Codes/" 
output <- "../Output/"

  • Call function.
source(paste0(func, "theme_bg.R"))
source(paste0(func, "da_helper_functions.R"))

  • Here are 3 datasets with the same type of information on honeybee production and prices. They differ by years (1998-2002, 2003-2007, 2008-2012). The file is quite messy, so use read_lines() and then convert each vector into a dataframe.
#Load Data
honeyraw9802 <- read_lines(paste(dir, "honeyraw_1998to2002.csv", sep=""))
honeyraw9802 <- data.frame(honeyraw9802[-1:-4])

honeyraw0307 <- read_lines(paste(dir,"honeyraw_2003to2007.csv", sep=""))
honeyraw0307 <- data.frame(honeyraw0307[-1:-76])

honeyraw0812 <- read_lines(paste(dir, "honeyraw_2008to2012.csv", sep=""))
honeyraw0812 <- data.frame(honeyraw0812[-1:-67])

  • Combine the 3 datasets.
# Clean Data
honeyrawlist <- list(honeyraw9802,honeyraw0307,honeyraw0812)

honeyfinallist <- list()
suppressWarnings(
for(honey_idx in 1:length(honeyrawlist)) {

  honey <- honeyrawlist[[honey_idx]]

  honey <- honey %>%
    separate(colnames(honey[1]), paste0("X", 1:9), sep = ",") %>%
    setNames(c("X1", "X2", "state", "numcol", "yieldpercol","totalprod",
               "stocks","priceperlb","prodvalue")) %>%
    select(-X1, -X2)
  
  honey$state <- gsub("\"", "", honey$state)
  honey$numcol <- as.integer(honey$numcol)*1000
  honey$yieldpercol <- as.integer(honey$yieldpercol)
  honey$totalprod <- as.integer(honey$totalprod)*1000
  honey$stocks <- as.integer(honey$stocks)*1000
  honey$priceperlb <- as.numeric(honey$priceperlb)/100
  honey$prodvalue <- as.numeric(honey$prodvalue)*1000

  honey <- honey %>%
    filter(!is.na(prodvalue)&!grepl("/", honey$state))

  honeyfinallist[[honey_idx]] <- honey
}
)

# Combine all into one dataframe.
suppressWarnings(
honeyfinaldf <- bind_rows(honeyfinallist)
)

# Use match() to find full state names and replace with abbreviations
matches_and_nas <- match(honeyfinaldf$state, state.name)
matches <- !is.na(matches_and_nas)
honeyfinaldf$state[matches] <- state.abb[matches_and_nas[matches]]

# Lastly, we want to add a year column  
ystart <- c(grep("AL", honeyfinaldf$state))
yend <- c(grep("WY", honeyfinaldf$state))
years <- 1998:2012
ynums <- yend - ystart + 1
honeyfinaldf$year <- rep(years, ynums)

#adding a new variable consumption
honeyfinaldf$consumption <- honeyfinaldf$totalprod-honeyfinaldf$stocks

suppressMessages(glimpse(honeyfinaldf))
## Observations: 626
## Variables: 9
## $ state       <chr> "AL", "AZ", "AR", "CA", "CO", "FL", "GA", "HI", "I...
## $ numcol      <dbl> 16000, 55000, 53000, 450000, 27000, 230000, 75000,...
## $ yieldpercol <int> 71, 60, 65, 83, 72, 98, 56, 118, 50, 71, 92, 78, 4...
## $ totalprod   <dbl> 1136000, 3300000, 3445000, 37350000, 1944000, 2254...
## $ stocks      <dbl> 159000, 1485000, 1688000, 12326000, 1594000, 45080...
## $ priceperlb  <dbl> 0.72, 0.64, 0.59, 0.62, 0.70, 0.64, 0.69, 0.77, 0....
## $ prodvalue   <dbl> 818000, 2112000, 2033000, 23157000, 1361000, 14426...
## $ year        <int> 1998, 1998, 1998, 1998, 1998, 1998, 1998, 1998, 19...
## $ consumption <dbl> 977000, 1815000, 1757000, 25024000, 350000, 180320...
  • Variables of the honeyproduction dataset is provided below:
    1. state: 2-digit code standing for each state across America.
    2. numcol: Number of honey producing colonies. Honey producing colonies are the maximum number of colonies from which honey was taken during the year. It is possible to take honey from colonies which did not survive the entire year.
    3. yieldpercol: Honey yield per colony. Unit is pounds.
    4. totalprod: Total production (numcol x yieldpercol). Unit is pounds.
    5. stocks: Refers to stocks held by producers. Unit is pounds.
    6. priceperlb: Refers to average price per pound based on expanded sales. Unit is dollars.
    7. prodvalue: Value of production (totalprod x priceperlb). Unit is dollars.

  • Finally, the new dataframe satisfies the criteria of tidy data and is saved to a new csv ready to be uploaded.
# Save working file 
write.csv(honeyfinaldf, paste(output, "honeyproduction.csv", sep=""), row.names = FALSE)

3.3 Honey Production Yield Change From 1998 to 2012

  • It’s always good to use some data quality checks. There is no missing values.
colSums(is.na(honeyfinaldf))
##       state      numcol yieldpercol   totalprod      stocks  priceperlb 
##           0           0           0           0           0           0 
##   prodvalue        year consumption 
##           0           0           0

  • As seen from the graph, honey production has been decreasing over time.
change <- honeyfinaldf %>%
  group_by(year) %>%
  summarise(YearTotal=sum(totalprod))

ggplot(data=change, aes(x=year, y=YearTotal/1000000)) + 
  geom_smooth(method = "lm") + 
  geom_line(color = "red", size = 2) +
  geom_point() +
  scale_x_continuous(breaks=seq(1998, 2012, 2)) +
  ylab("Production (in Million lb)") +
  xlab("Years") +
  theme_bg()+
  background_grid(major = "xy", minor="y", size.major = 0.2) 

3.4 Which state produce the most honey? Which produces the least?

productionperstate <- honeyfinaldf %>%
    group_by(state) %>%
    summarise(sumprod=sum(totalprod)) %>%
    arrange(desc(sumprod)) %>%
    head(10) %>%
    mutate(percentage=round(sumprod/sum(sumprod)*100,2))
    
ggplot(data=productionperstate, aes(x=reorder(state,sumprod),y=sumprod/1000000))+
    geom_col(aes(fill=state),show.legend=F)+
    geom_text(aes(label=percentage))+
    coord_flip()+
    labs(x="States in the USA",y="Total Honey Produced (in Million lb)",title="Total Honey Produced by Each State")