I will be working with the dataset “Honey production in USA from 1998 - 2012”. When I read about Morgan Freeman converting his 124 acre ranch into giant honeybee sanctuary to save the bees that intrigued me to learn deeper and thus I chose this data set to see the results.
While doing some research I found this website https://www.ars.usda.gov/oc/br/ccd/index/#public which explains in detail why honey bee is important to the human beings. The one statement which intrigued me was : “About one mouthful in three in our diet directly or indirectly benefits from honey bee pollination. Commercial production of many high-value and specialty crops like almonds and other tree nuts, berries, fruits and vegetables depend on pollination by honey bees. These are the foods that give our diet diversity, color, and flavor.” This one statement is enough give the sense of understanding.
In 2006, global concern was raised over the rapid decline in the honeybee population, an integral component to American honey agriculture. Large numbers of hives were lost to Colony Collapse Disorder, a phenomenon of disappearing worker bees causing the remaining hive colony to collapse. Speculation to the cause of this disorder points to hive diseases and pesticides harming the pollinators, though no overall consensus has been reached. Twelve years later, some industries are observing recovery but the American honey industry is still largely struggling. The U.S. used to locally produce over half the honey it consumes per year. Now, honey mostly comes from overseas, with 350 of the 400 million pounds of honey consumed every year originating from imports. This dataset provides insight into honey production supply and demand in America by state from 1998 to 2012. The context is from kaggle.
###Installing Library
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3 ✓ purrr 0.3.4
## ✓ tibble 3.0.6 ✓ dplyr 1.0.4
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(htmlwidgets)
setwd("~/Desktop/Pankti _ Data Science/honey production")
honey_production <- read_csv("~/Desktop/Pankti _ Data Science/honey production/honeyproduction.csv")
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## state = col_character(),
## numcol = col_double(),
## yieldpercol = col_double(),
## totalprod = col_double(),
## stocks = col_double(),
## priceperlb = col_double(),
## prodvalue = col_double(),
## year = col_double()
## )
glimpse(honey_production)
## Rows: 626
## Columns: 8
## $ state <chr> "AL", "AZ", "AR", "CA", "CO", "FL", "GA", "HI", "ID", "IL…
## $ numcol <dbl> 16000, 55000, 53000, 450000, 27000, 230000, 75000, 8000, …
## $ yieldpercol <dbl> 71, 60, 65, 83, 72, 98, 56, 118, 50, 71, 92, 78, 46, 50, …
## $ totalprod <dbl> 1136000, 3300000, 3445000, 37350000, 1944000, 22540000, 4…
## $ stocks <dbl> 159000, 1485000, 1688000, 12326000, 1594000, 4508000, 307…
## $ priceperlb <dbl> 0.72, 0.64, 0.59, 0.62, 0.70, 0.64, 0.69, 0.77, 0.65, 1.1…
## $ prodvalue <dbl> 818000, 2112000, 2033000, 23157000, 1361000, 14426000, 28…
## $ year <dbl> 1998, 1998, 1998, 1998, 1998, 1998, 1998, 1998, 1998, 199…
numcol: Number of honey producing colonies. Honey producing colonies are the maximum number of colonies from which honey was taken during the year It is possible to take honey from colonies which did not survive the entire year yieldpercol: Honey yield per colony. Unit is pounds (quantitaive) totalprod: Total production (numcol x yieldpercol). Unit is pounds (quantitaive) stocks: Refers to stocks held by producers. Unit is pounds (quantitaive) priceperlb: Refers to average price per pound based on expanded sales. Unit is dollars.(quantitaive) prodvalue: Value of production (totalprod x priceperlb). Unit is dollars.(quantitaive)
Other useful information: Certain states are excluded every year (ex. CT) to avoid disclosing data for individual operations. Due to rounding, total colonies multiplied by total yield may not equal production. Also, summation of states will not equal U.S. level value of production.
The dataset provided is pretty clean but to do quality check I will check whether there are any NA’s in the dataset
colSums(is.na(honey_production)) #Checking NA
## state numcol yieldpercol totalprod stocks priceperlb
## 0 0 0 0 0 0
## prodvalue year
## 0 0
head(honey_production)
## # A tibble: 6 x 8
## state numcol yieldpercol totalprod stocks priceperlb prodvalue year
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 AL 16000 71 1136000 159000 0.72 818000 1998
## 2 AZ 55000 60 3300000 1485000 0.64 2112000 1998
## 3 AR 53000 65 3445000 1688000 0.59 2033000 1998
## 4 CA 450000 83 37350000 12326000 0.62 23157000 1998
## 5 CO 27000 72 1944000 1594000 0.7 1361000 1998
## 6 FL 230000 98 22540000 4508000 0.64 14426000 1998
Beginning wth exploratoty analysis I will be checking yield per colony, production per lb and price per lb for each year provided in the datatset. This analysis will help me to find the correlation between the measures and find out whether the oreelation is positive or negative in nature.
Prod_per_year <- honey_production%>% #New dataset
group_by(year)%>% # grouping the datatset by year
summarise(TotalProd = sum(totalprod))#Summation of the total prod of the honey over the years
options(scipen = 999)
prod <- ggplot(Prod_per_year,aes(x=year , y= TotalProd))+
geom_line(color = '#FFCC33')+
geom_point()+
geom_hline(yintercept = median(Prod_per_year$TotalProd), linetype = "dashed", color= "#99CCCC",size = 1)+
annotate("text", x = 1999, y = 175000000, label = "Median Line", vjust = -0.5,color="red")+
labs(x="Year",y="Total Production(lbs)", title="Total Production for years(1998-2012)",caption = "Dataset from Kaggle")+
theme_minimal()
fig<-ggplotly(prod)
fig
Looking at the graph the lowest production was in 2012. The median line states the decline in production started from the year 2005 with exception to 2010 As the production decreased lets look at the price variation
Avg_price <- honey_production%>% #New dataset
group_by(year)%>% # grouping the datatset by year
summarise(priceperLB = mean(priceperlb))#Smean pf the price per lb from 1998- 2012
price <- ggplot(Avg_price,aes(x=year , y=priceperLB ))+
geom_line(color = '#FFCC33')+
geom_point()+
geom_hline(yintercept = mean(Avg_price$priceperLB), linetype = "dashed", color= "#99CCCC",size = 1)+
annotate("text", x = 1999, y = 1.5, label = "Mean Line", vjust = -0.5,color="red")+
labs(x="Year",y="Price(perlb)", title="Mean price(perlb) by years(1998-2012)",caption = "Dataset from Kaggle")+
theme_minimal()
fig<-ggplotly(price)
fig
The lowest price was in the year 2000 where the honey production was 2200000000 lbs The price started increasing from the year 2008 with the highest in year 2012 where the production was least.
Avg_yield <- honey_production%>% #New dataset
group_by(year)%>% # grouping the datatset by year
summarise(Yieldpercol = mean(yieldpercol))#mean of the yield per lb from 1998- 2012
yield <- ggplot(Avg_yield,aes(x=year , y=Yieldpercol ))+
geom_line(color = '#FFCC33')+
geom_point()+
geom_hline(yintercept = mean(Avg_yield$Yieldpercol), linetype = "dashed", color= "#99CCCC",size=1)+
annotate("text", x = 1999, y = 62, label = "Mean Line", vjust = -0.4,color="red")+
labs(x="Year",y="Yield(perlb)", title="Yield(perlb) by years(1998-2012)")+
theme_minimal()
fig<-ggplotly(yield)
fig
The yield has decreasing from 2006 with lowest in 2012. Tried to find why there was decrease in 2003 but could not find the concrete data. The decline which started after 2008 was due to economic crisis which occured
corr_yield_price <- honey_production%>%
group_by(year)%>%
summarise(Yieldpercol = mean(yieldpercol),priceperLB = mean(priceperlb))
corrplot_yield_price <- corr_yield_price %>%
ggplot(aes(`Yieldpercol`, `priceperLB`)) +
geom_point(shape = 9, size = 2, color="#FFCC66") +
geom_smooth(method='lm',color="#99CCCC",fill="#cccccc")+
labs(x="Price(perLB)",y="Yield(perLB)", title="Correlation between average yield and average price per LB")+
theme_light()
fig<-ggplotly(corrplot_yield_price)
## `geom_smooth()` using formula 'y ~ x'
fig
cor(corr_yield_price$Yieldpercol,corr_yield_price$priceperLB)
## [1] -0.8909709
I calculated the correlation for the fields yield per colony and the price/LB to see whether the graph supports my assumption of negative correlation between the both fields.Here yield per colony is increasing but the price per lb is decreasing showing the inverse relation between both.The value also supports the graph of correlation.
I liked working with this dataset as it gave me good knowledge about honey industry and how because of various reasons it is still struggling compared to other industries. The one reason is demand has sky rocketed and demand is not able to fulfilled domestically because the problem for US beekeepers is that while they say they need to be paid $2 per pound to break even, foreign honey can be imported for as little as 81 cents per pound.(from https://www.bbc.com/news/business-53149367) The other problem of the industry is The problem for the US honey industry in dealing with this all is that the sector remains largely self-regulated, with very little government monitoring.Take the USDA’s grading system - it isn’t actually enforced. Honeys are not routinely tested by the department, or any other federal agency.(from https://www.bbc.com/news/business-53149367)
I wanted to work with the most recent data to see the changes , unfortunately I could not find the data but my search is on, I wish to continue the same datatset for final project if I can get the recent dataset.
The one thing which did not work for me is caption does not work with the ggplotly. I searched but could not find the ways to implement the caption along with the ggplot.