Project2_Honey Production

Summary:

I will be working with the dataset “Honey production in USA from 1998 - 2012”. When I read about Morgan Freeman converting his 124 acre ranch into giant honeybee sanctuary to save the bees that intrigued me to learn deeper and thus I chose this data set to see the results.

While doing some research I found this website https://www.ars.usda.gov/oc/br/ccd/index/#public which explains in detail why honey bee is important to the human beings. The one statement which intrigued me was : “About one mouthful in three in our diet directly or indirectly benefits from honey bee pollination. Commercial production of many high-value and specialty crops like almonds and other tree nuts, berries, fruits and vegetables depend on pollination by honey bees. These are the foods that give our diet diversity, color, and flavor.” This one statement is enough give the sense of understanding.

Dataset Context:

In 2006, global concern was raised over the rapid decline in the honeybee population, an integral component to American honey agriculture. Large numbers of hives were lost to Colony Collapse Disorder, a phenomenon of disappearing worker bees causing the remaining hive colony to collapse. Speculation to the cause of this disorder points to hive diseases and pesticides harming the pollinators, though no overall consensus has been reached. Twelve years later, some industries are observing recovery but the American honey industry is still largely struggling. The U.S. used to locally produce over half the honey it consumes per year. Now, honey mostly comes from overseas, with 350 of the 400 million pounds of honey consumed every year originating from imports. This dataset provides insight into honey production supply and demand in America by state from 1998 to 2012. The context is from kaggle.

###Installing Library

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.0.6     ✓ dplyr   1.0.4
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(plotly)

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

library(htmlwidgets)

Set working directory and read the csv

setwd("~/Desktop/Pankti _ Data Science/honey production")
honey_production <- read_csv("~/Desktop/Pankti _ Data Science/honey production/honeyproduction.csv")

## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   state = col_character(),
##   numcol = col_double(),
##   yieldpercol = col_double(),
##   totalprod = col_double(),
##   stocks = col_double(),
##   priceperlb = col_double(),
##   prodvalue = col_double(),
##   year = col_double()
## )

glimpse(honey_production)

## Rows: 626
## Columns: 8
## $ state       <chr> "AL", "AZ", "AR", "CA", "CO", "FL", "GA", "HI", "ID", "IL…
## $ numcol      <dbl> 16000, 55000, 53000, 450000, 27000, 230000, 75000, 8000, …
## $ yieldpercol <dbl> 71, 60, 65, 83, 72, 98, 56, 118, 50, 71, 92, 78, 46, 50, …
## $ totalprod   <dbl> 1136000, 3300000, 3445000, 37350000, 1944000, 22540000, 4…
## $ stocks      <dbl> 159000, 1485000, 1688000, 12326000, 1594000, 4508000, 307…
## $ priceperlb  <dbl> 0.72, 0.64, 0.59, 0.62, 0.70, 0.64, 0.69, 0.77, 0.65, 1.1…
## $ prodvalue   <dbl> 818000, 2112000, 2033000, 23157000, 1361000, 14426000, 28…
## $ year        <dbl> 1998, 1998, 1998, 1998, 1998, 1998, 1998, 1998, 1998, 199…

The column description provided in the overview are as follows:

numcol: Number of honey producing colonies. Honey producing colonies are the maximum number of colonies from which honey was taken during the year It is possible to take honey from colonies which did not survive the entire year yieldpercol: Honey yield per colony. Unit is pounds (quantitaive) totalprod: Total production (numcol x yieldpercol). Unit is pounds (quantitaive) stocks: Refers to stocks held by producers. Unit is pounds (quantitaive) priceperlb: Refers to average price per pound based on expanded sales. Unit is dollars.(quantitaive) prodvalue: Value of production (totalprod x priceperlb). Unit is dollars.(quantitaive)

Other useful information: Certain states are excluded every year (ex. CT) to avoid disclosing data for individual operations. Due to rounding, total colonies multiplied by total yield may not equal production. Also, summation of states will not equal U.S. level value of production.

The dataset provided is pretty clean but to do quality check I will check whether there are any NA’s in the dataset

colSums(is.na(honey_production)) #Checking NA

##       state      numcol yieldpercol   totalprod      stocks  priceperlb 
##           0           0           0           0           0           0 
##   prodvalue        year 
##           0           0

head(honey_production)

## # A tibble: 6 x 8
##   state numcol yieldpercol totalprod   stocks priceperlb prodvalue  year
##   <chr>  <dbl>       <dbl>     <dbl>    <dbl>      <dbl>     <dbl> <dbl>
## 1 AL     16000          71   1136000   159000       0.72    818000  1998
## 2 AZ     55000          60   3300000  1485000       0.64   2112000  1998
## 3 AR     53000          65   3445000  1688000       0.59   2033000  1998
## 4 CA    450000          83  37350000 12326000       0.62  23157000  1998
## 5 CO     27000          72   1944000  1594000       0.7    1361000  1998
## 6 FL    230000          98  22540000  4508000       0.64  14426000  1998

Beginning wth exploratoty analysis I will be checking yield per colony, production per lb and price per lb for each year provided in the datatset. This analysis will help me to find the correlation between the measures and find out whether the oreelation is positive or negative in nature.

Trend of the honey Production in USA from 1998-2016

Prod_per_year <- honey_production%>% #New dataset
  group_by(year)%>% # grouping the datatset by year
  summarise(TotalProd = sum(totalprod))#Summation of the total prod of the honey over the years
options(scipen = 999)

Visualisation - Total Honey Production for years(1998-2012)

prod <- ggplot(Prod_per_year,aes(x=year , y= TotalProd))+
  geom_line(color = '#FFCC33')+
  geom_point()+
  geom_hline(yintercept = median(Prod_per_year$TotalProd), linetype = "dashed", color= "#99CCCC",size = 1)+
  annotate("text", x = 1999, y = 175000000, label = "Median Line", vjust = -0.5,color="red")+
  labs(x="Year",y="Total Production(lbs)", title="Total Production for years(1998-2012)",caption = "Dataset from Kaggle")+
  theme_minimal()
fig<-ggplotly(prod)
fig

Analysis of the visualisation (production/LB)

Looking at the graph the lowest production was in 2012. The median line states the decline in production started from the year 2005 with exception to 2010 As the production decreased lets look at the price variation

Trend of the price/LB in USA from 1998-2016

Avg_price <- honey_production%>% #New dataset
  group_by(year)%>% # grouping the datatset by year
  summarise(priceperLB = mean(priceperlb))#Smean pf the price per lb from 1998- 2012

Visualisation - Mean price(perlb) by years(1998-2012)

price <- ggplot(Avg_price,aes(x=year , y=priceperLB ))+
 geom_line(color = '#FFCC33')+
  geom_point()+
    geom_hline(yintercept = mean(Avg_price$priceperLB), linetype = "dashed", color= "#99CCCC",size = 1)+
   annotate("text", x = 1999, y = 1.5, label = "Mean Line", vjust = -0.5,color="red")+
  labs(x="Year",y="Price(perlb)", title="Mean price(perlb) by years(1998-2012)",caption = "Dataset from Kaggle")+
  theme_minimal()
fig<-ggplotly(price)
fig

Analysis of the visualisation (price/LB)

The lowest price was in the year 2000 where the honey production was 2200000000 lbs The price started increasing from the year 2008 with the highest in year 2012 where the production was least.

Trend of the yield/LB in USA from 1998-2016

Avg_yield <- honey_production%>% #New dataset
  group_by(year)%>% # grouping the datatset by year
  summarise(Yieldpercol = mean(yieldpercol))#mean of the yield per lb from 1998- 2012

Visualisation - Yield(perlb) by years(1998-2012)

yield <- ggplot(Avg_yield,aes(x=year , y=Yieldpercol ))+
 geom_line(color = '#FFCC33')+
  geom_point()+
    geom_hline(yintercept = mean(Avg_yield$Yieldpercol), linetype = "dashed", color= "#99CCCC",size=1)+
   annotate("text", x = 1999, y = 62, label = "Mean Line", vjust = -0.4,color="red")+
  labs(x="Year",y="Yield(perlb)", title="Yield(perlb) by years(1998-2012)")+
  theme_minimal()
fig<-ggplotly(yield)
fig

Analysis of the visualisation (yield/LB)

The yield has decreasing from 2006 with lowest in 2012. Tried to find why there was decrease in 2003 but could not find the concrete data. The decline which started after 2008 was due to economic crisis which occured

Looking at the graphs looks like there is a correlation between the yield and the price

corr_yield_price <- honey_production%>% 
  group_by(year)%>% 
  summarise(Yieldpercol = mean(yieldpercol),priceperLB = mean(priceperlb))

Visualisation - Correlation between average yield and average price per LB

corrplot_yield_price <- corr_yield_price %>%
  ggplot(aes(`Yieldpercol`, `priceperLB`)) + 
  geom_point(shape = 9, size = 2, color="#FFCC66") +
  geom_smooth(method='lm',color="#99CCCC",fill="#cccccc")+
  labs(x="Price(perLB)",y="Yield(perLB)", title="Correlation between average yield and average price per LB")+
  theme_light()
fig<-ggplotly(corrplot_yield_price)

## `geom_smooth()` using formula 'y ~ x'

fig

Calculating correlation

cor(corr_yield_price$Yieldpercol,corr_yield_price$priceperLB)

## [1] -0.8909709

Analysis

I calculated the correlation for the fields yield per colony and the price/LB to see whether the graph supports my assumption of negative correlation between the both fields.Here yield per colony is increasing but the price per lb is decreasing showing the inverse relation between both.The value also supports the graph of correlation.

I liked working with this dataset as it gave me good knowledge about honey industry and how because of various reasons it is still struggling compared to other industries. The one reason is demand has sky rocketed and demand is not able to fulfilled domestically because the problem for US beekeepers is that while they say they need to be paid $2 per pound to break even, foreign honey can be imported for as little as 81 cents per pound.(from https://www.bbc.com/news/business-53149367) The other problem of the industry is The problem for the US honey industry in dealing with this all is that the sector remains largely self-regulated, with very little government monitoring.Take the USDA’s grading system - it isn’t actually enforced. Honeys are not routinely tested by the department, or any other federal agency.(from https://www.bbc.com/news/business-53149367)

I wanted to work with the most recent data to see the changes , unfortunately I could not find the data but my search is on, I wish to continue the same datatset for final project if I can get the recent dataset.

The one thing which did not work for me is caption does not work with the ggplotly. I searched but could not find the ways to implement the caption along with the ggplot.

Project2_Honey Production_PD

Pankti Dalal

4/17/2021