Content This data was downloaded from the Hass Avocado Board website in May of 2018 & compiled into a single CSV. Here’s how the Hass Avocado Board describes the data on their website:
The table below represents weekly 2018 retail scan data for National retail volume (units) and price. Retail scan data comes directly from retailers’ cash registers based on actual retail sales of Hass avocados. Starting in 2013, the table below reflects an expanded, multi-outlet retail data set. Multi-outlet reporting includes an aggregation of the following channels: grocery, mass, club, drug, dollar and military. The Average Price (of avocados) in the table reflects a per unit (per avocado) cost, even when multiple units (avocados) are sold in bags. The Product Lookup codes (PLU’s) in the table are only for Hass avocados. Other varieties of avocados (e.g. greenskins) are not included in this table.
Some relevant columns in the dataset:
Date - The date of the observation AveragePrice - the average price of a single avocado type - conventional or organic year - the year Region - the city or region of the observation Total Volume - Total number of avocados sold 4046 - Total number of avocados with PLU 4046 sold 4225 - Total number of avocados with PLU 4225 sold 4770 - Total number of avocados with PLU 4770 sold
# Import Packages
options(warn=-1)
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(ggthemes))
suppressPackageStartupMessages(library(gridExtra))
suppressPackageStartupMessages(library(lubridate))
suppressPackageStartupMessages(library(tidyr))
options(scipen=5)
There are 14 variables and 18249 rows in data.
#Check data structure
str(df)
## 'data.frame': 18249 obs. of 14 variables:
## $ X : int 0 1 2 3 4 5 6 7 8 9 ...
## $ Date : chr "2015-12-27" "2015-12-20" "2015-12-13" "2015-12-06" ...
## $ AveragePrice: num 1.33 1.35 0.93 1.08 1.28 1.26 0.99 0.98 1.02 1.07 ...
## $ Total.Volume: num 64237 54877 118220 78992 51040 ...
## $ X4046 : num 1037 674 795 1132 941 ...
## $ X4225 : num 54455 44639 109150 71976 43838 ...
## $ X4770 : num 48.2 58.3 130.5 72.6 75.8 ...
## $ Total.Bags : num 8697 9506 8145 5811 6184 ...
## $ Small.Bags : num 8604 9408 8042 5677 5986 ...
## $ Large.Bags : num 93.2 97.5 103.1 133.8 197.7 ...
## $ XLarge.Bags : num 0 0 0 0 0 0 0 0 0 0 ...
## $ type : chr "conventional" "conventional" "conventional" "conventional" ...
## $ year : int 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
## $ region : chr "Albany" "Albany" "Albany" "Albany" ...
After checked the class of each columns, I dropped year column as it is duplicate. We can get it from Date column. Then I decided to convert the Date column from factor to date. Also I changed year to factor from int for better analysis. I also created a new column called month from Date columns in case for monthly analysis. Finally, I sorted the data by Date by ascending order.
# Change the Date from factor to date and order data by Date
df <- df %>%
select(-X,-year) %>%
mutate(Date = as.Date(Date, "%Y-%m-%d"),
year = factor(year(Date)),
month = month(Date, label = TRUE),
weekday = wday(Date, label = TRUE),
year_month = as.factor(format(as.Date(Date), "%Y-%m")),
revenue = AveragePrice* Total.Volume) %>%
arrange(Date)
str(df)
## 'data.frame': 18249 obs. of 17 variables:
## $ Date : Date, format: "2015-01-04" "2015-01-04" ...
## $ AveragePrice: num 1.22 1 1.08 1.01 1.02 1.4 0.93 1.19 1.11 0.88 ...
## $ Total.Volume: num 40873 435021 788025 80034 491738 ...
## $ X4046 : num 2820 364302 53987 44562 7194 ...
## $ X4225 : num 28287 23821 552906 24964 396752 ...
## $ X4770 : num 49.9 82.2 39995 2752.3 128.8 ...
## $ Total.Bags : num 9716 46816 141137 7756 87663 ...
## $ Small.Bags : num 9187 16707 137146 6064 87407 ...
## $ Large.Bags : num 530 30109 3991 1691 256 ...
## $ XLarge.Bags : num 0 0 0 0 0 ...
## $ type : chr "conventional" "conventional" "conventional" "conventional" ...
## $ region : chr "Albany" "Atlanta" "BaltimoreWashington" "Boise" ...
## $ year : Factor w/ 4 levels "2015","2016",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ month : Ord.factor w/ 12 levels "Jan"<"Feb"<"Mar"<..: 1 1 1 1 1 1 1 1 1 1 ...
## $ weekday : Ord.factor w/ 7 levels "Sun"<"Mon"<"Tue"<..: 1 1 1 1 1 1 1 1 1 1 ...
## $ year_month : Factor w/ 39 levels "2015-01","2015-02",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ revenue : num 49865 435021 851067 80835 501573 ...
I continued checking if any null values, any duplicated rows and also checked if any extrem values to be aware of. All good!
# Any null values?
sum(is.null(df))
## [1] 0
# Check if any duplicated rows
anyDuplicated(df)
## [1] 0
#Check Date, type, region any duplicated
df%>% count(Date, type, region) %>% filter(n >1)
## [1] Date type region n
## <0 rows> (or 0-length row.names)
# Check if any extrem values
summary(df)
## Date AveragePrice Total.Volume X4046
## Min. :2015-01-04 Min. :0.440 Min. : 85 Min. : 0
## 1st Qu.:2015-10-25 1st Qu.:1.100 1st Qu.: 10839 1st Qu.: 854
## Median :2016-08-14 Median :1.370 Median : 107377 Median : 8645
## Mean :2016-08-13 Mean :1.406 Mean : 850644 Mean : 293008
## 3rd Qu.:2017-06-04 3rd Qu.:1.660 3rd Qu.: 432962 3rd Qu.: 111020
## Max. :2018-03-25 Max. :3.250 Max. :62505647 Max. :22743616
##
## X4225 X4770 Total.Bags Small.Bags
## Min. : 0 Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 3009 1st Qu.: 0 1st Qu.: 5089 1st Qu.: 2849
## Median : 29061 Median : 185 Median : 39744 Median : 26363
## Mean : 295155 Mean : 22840 Mean : 239639 Mean : 182195
## 3rd Qu.: 150207 3rd Qu.: 6243 3rd Qu.: 110783 3rd Qu.: 83338
## Max. :20470573 Max. :2546439 Max. :19373134 Max. :13384587
##
## Large.Bags XLarge.Bags type region
## Min. : 0 Min. : 0.0 Length:18249 Length:18249
## 1st Qu.: 127 1st Qu.: 0.0 Class :character Class :character
## Median : 2648 Median : 0.0 Mode :character Mode :character
## Mean : 54338 Mean : 3106.4
## 3rd Qu.: 22029 3rd Qu.: 132.5
## Max. :5719097 Max. :551693.7
##
## year month weekday year_month revenue
## 2015:5615 Jan :1944 Sun:18249 2015-03: 540 Min. : 134
## 2016:5616 Mar :1836 Mon: 0 2015-05: 540 1st Qu.: 17164
## 2017:5722 Feb :1728 Tue: 0 2015-08: 540 Median : 139530
## 2018:1296 May :1512 Wed: 0 2015-11: 540 Mean : 927948
## Jul :1512 Thu: 0 2016-01: 540 3rd Qu.: 514191
## Oct :1512 Fri: 0 2016-05: 540 Max. :54379912
## (Other):8205 Sat: 0 (Other):15009
Region column is quite messy. It contains total USA, regions and cities.
unique(df$region)
## [1] "Albany" "Atlanta" "BaltimoreWashington"
## [4] "Boise" "Boston" "BuffaloRochester"
## [7] "California" "Charlotte" "Chicago"
## [10] "CincinnatiDayton" "Columbus" "DallasFtWorth"
## [13] "Denver" "Detroit" "GrandRapids"
## [16] "GreatLakes" "HarrisburgScranton" "HartfordSpringfield"
## [19] "Houston" "Indianapolis" "Jacksonville"
## [22] "LasVegas" "LosAngeles" "Louisville"
## [25] "MiamiFtLauderdale" "Midsouth" "Nashville"
## [28] "NewOrleansMobile" "NewYork" "Northeast"
## [31] "NorthernNewEngland" "Orlando" "Philadelphia"
## [34] "PhoenixTucson" "Pittsburgh" "Plains"
## [37] "Portland" "RaleighGreensboro" "RichmondNorfolk"
## [40] "Roanoke" "Sacramento" "SanDiego"
## [43] "SanFrancisco" "Seattle" "SouthCarolina"
## [46] "SouthCentral" "Southeast" "Spokane"
## [49] "StLouis" "Syracuse" "Tampa"
## [52] "TotalUS" "West" "WestTexNewMexico"
We only choose Total US as analysis scope.
df_total <- df %>% filter(region == "TotalUS" )
Organic avocados are generally more expensive than conventional type.
date.ranges<-paste(min(date(df_total$Date)),"~",max(date(df_total$Date)))
ggplot(df_total, aes(x = type,y = AveragePrice, color=type)) +
geom_boxplot()+
scale_y_continuous(labels = scales::dollar,limits = c(0,2.5))+
labs(title="US Total Market: Average Price of one Avocado",
subtitle=paste("Date period: " ,date.ranges),
x="Date",y="Average Price",
color="Type")+
theme_minimal()
The average price distribution also shows organic avocado is much more expensive than conventional type.
df_total %>%
ggplot(aes(x = AveragePrice, fill=type)) +
geom_histogram(binwidth = 0.20, alpha=0.5) +
scale_y_continuous(breaks = seq(0,1000,20)) +
scale_x_continuous(breaks = seq(0,3.40,.20)) +
labs(title="Total US Market: Average Price of one Avocado",
subtitle=paste("Date period: " ,date.ranges),
x="Average Price")+
theme_minimal() +
theme(legend.position="top")
From 2015 to 2017, avocado average price appears to be gradually increasing.
date.ranges<-paste(min(date(df_total$Date)),"~",max(date(df_total$Date)))
ggplot(df_total, aes(type, AveragePrice))+
geom_boxplot(aes(color = year))+
scale_y_continuous(limits=c(0,2.2),
breaks = seq(0,2.2,0.2),
expand = expansion(mult = c(0, .05)),
labels = scales::dollar) +
labs(colour = "Year",
x = "Type",
y ="Average Price",
title = "Total US Market: Average price of one avocado by year by type",
subtitle=paste("Date period: " ,date.ranges))+
theme_minimal() +
theme(legend.position="top")
There are some exceptions to this. In March of 2017, there was a period of time where the conventional and organic avocado prices were similar. Moreover, in July of 2015, organic avocado prices were cheaper than conventional ones.
ggplot(df_total,aes(x=Date,y=AveragePrice,color=type,group=type))+
geom_line(size=1.2)+
geom_smooth(linetype=2,alpha=.3)+
scale_x_date(date_labels = "%b %y",
date_breaks = "4 month",
limits =c(min(date(df_total$Date)),max(date(df_total$Date))))+
scale_y_continuous(breaks = seq(0,4,.20))+
theme_minimal()+
labs(title="Total US Market: Average Price of one Avocado",
subtitle=paste("Date period: " ,date.ranges),
x="Date",
y="Average Price",
color="Type",
caption = "(based on US Total Data)")+
theme(legend.position="top")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
df_total %>%
group_by(year, month, type) %>%
summarise(AveragePrice = mean(AveragePrice)) %>%
ggplot(aes(x=month, y=AveragePrice, color=year,group = year)) +
labs(colour = "Year",
x = "Month",
y ="Average Price",
title="Total US Market: Average monthly prices of one avocado by type for each year",
subtitle=paste("Date period: " ,date.ranges))+
geom_line() +
scale_y_continuous(labels = scales::dollar)+
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
facet_grid(.~type) +
theme_bw() +
theme(legend.position="top")
## `summarise()` regrouping output by 'year', 'month' (override with `.groups` argument)
df_total %>%
mutate(Quarter=factor(quarter(Date,with_year = F))) %>%
ggplot(aes(x = Quarter,y=AveragePrice,color=type,group=Quarter))+
geom_jitter(width = .2)+
geom_boxplot()+
theme_bw()+
guides(color=F)+
facet_grid(type ~ year)+
scale_y_continuous(breaks = seq(0,3,.2),labels = scales::dollar)+
scale_x_discrete(drop=F)+
labs(title="Average Price of a Single Avocado: Total US Market",
subtitle= paste("Dates from",date.ranges),x="Quarter of Year",y="Average Price")
We can see there are clear seasonal trends in avocado average price. Sep and Oct, avocado average price for both types tend to be highest.