Import Data

Content This data was downloaded from the Hass Avocado Board website in May of 2018 & compiled into a single CSV. Here’s how the Hass Avocado Board describes the data on their website:

The table below represents weekly 2018 retail scan data for National retail volume (units) and price. Retail scan data comes directly from retailers’ cash registers based on actual retail sales of Hass avocados. Starting in 2013, the table below reflects an expanded, multi-outlet retail data set. Multi-outlet reporting includes an aggregation of the following channels: grocery, mass, club, drug, dollar and military. The Average Price (of avocados) in the table reflects a per unit (per avocado) cost, even when multiple units (avocados) are sold in bags. The Product Lookup codes (PLU’s) in the table are only for Hass avocados. Other varieties of avocados (e.g. greenskins) are not included in this table.

Some relevant columns in the dataset:

Date - The date of the observation AveragePrice - the average price of a single avocado type - conventional or organic year - the year Region - the city or region of the observation Total Volume - Total number of avocados sold 4046 - Total number of avocados with PLU 4046 sold 4225 - Total number of avocados with PLU 4225 sold 4770 - Total number of avocados with PLU 4770 sold

#  Import Packages
options(warn=-1)
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(ggthemes))
suppressPackageStartupMessages(library(gridExtra))
suppressPackageStartupMessages(library(lubridate))
suppressPackageStartupMessages(library(tidyr))

options(scipen=5)

Data Checking and Cleaning

There are 14 variables and 18249 rows in data.

#Check data structure
str(df)
## 'data.frame':    18249 obs. of  14 variables:
##  $ X           : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ Date        : chr  "2015-12-27" "2015-12-20" "2015-12-13" "2015-12-06" ...
##  $ AveragePrice: num  1.33 1.35 0.93 1.08 1.28 1.26 0.99 0.98 1.02 1.07 ...
##  $ Total.Volume: num  64237 54877 118220 78992 51040 ...
##  $ X4046       : num  1037 674 795 1132 941 ...
##  $ X4225       : num  54455 44639 109150 71976 43838 ...
##  $ X4770       : num  48.2 58.3 130.5 72.6 75.8 ...
##  $ Total.Bags  : num  8697 9506 8145 5811 6184 ...
##  $ Small.Bags  : num  8604 9408 8042 5677 5986 ...
##  $ Large.Bags  : num  93.2 97.5 103.1 133.8 197.7 ...
##  $ XLarge.Bags : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ type        : chr  "conventional" "conventional" "conventional" "conventional" ...
##  $ year        : int  2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
##  $ region      : chr  "Albany" "Albany" "Albany" "Albany" ...

After checked the class of each columns, I dropped year column as it is duplicate. We can get it from Date column. Then I decided to convert the Date column from factor to date. Also I changed year to factor from int for better analysis. I also created a new column called month from Date columns in case for monthly analysis. Finally, I sorted the data by Date by ascending order.

# Change the Date from factor to date and order data by Date
 df <- df %>%
  select(-X,-year) %>%
  mutate(Date = as.Date(Date, "%Y-%m-%d"), 
         year = factor(year(Date)), 
         month = month(Date, label = TRUE),
         weekday = wday(Date, label = TRUE),
         year_month = as.factor(format(as.Date(Date), "%Y-%m")),
         revenue = AveragePrice* Total.Volume) %>%
  arrange(Date)
str(df)
## 'data.frame':    18249 obs. of  17 variables:
##  $ Date        : Date, format: "2015-01-04" "2015-01-04" ...
##  $ AveragePrice: num  1.22 1 1.08 1.01 1.02 1.4 0.93 1.19 1.11 0.88 ...
##  $ Total.Volume: num  40873 435021 788025 80034 491738 ...
##  $ X4046       : num  2820 364302 53987 44562 7194 ...
##  $ X4225       : num  28287 23821 552906 24964 396752 ...
##  $ X4770       : num  49.9 82.2 39995 2752.3 128.8 ...
##  $ Total.Bags  : num  9716 46816 141137 7756 87663 ...
##  $ Small.Bags  : num  9187 16707 137146 6064 87407 ...
##  $ Large.Bags  : num  530 30109 3991 1691 256 ...
##  $ XLarge.Bags : num  0 0 0 0 0 ...
##  $ type        : chr  "conventional" "conventional" "conventional" "conventional" ...
##  $ region      : chr  "Albany" "Atlanta" "BaltimoreWashington" "Boise" ...
##  $ year        : Factor w/ 4 levels "2015","2016",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ month       : Ord.factor w/ 12 levels "Jan"<"Feb"<"Mar"<..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ weekday     : Ord.factor w/ 7 levels "Sun"<"Mon"<"Tue"<..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ year_month  : Factor w/ 39 levels "2015-01","2015-02",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ revenue     : num  49865 435021 851067 80835 501573 ...

I continued checking if any null values, any duplicated rows and also checked if any extrem values to be aware of. All good!

# Any null values?
sum(is.null(df))
## [1] 0
# Check if any duplicated rows
anyDuplicated(df)
## [1] 0
#Check Date, type, region any duplicated
df%>% count(Date, type, region) %>% filter(n >1)
## [1] Date   type   region n     
## <0 rows> (or 0-length row.names)
# Check if any extrem values
summary(df)
##       Date             AveragePrice    Total.Volume          X4046         
##  Min.   :2015-01-04   Min.   :0.440   Min.   :      85   Min.   :       0  
##  1st Qu.:2015-10-25   1st Qu.:1.100   1st Qu.:   10839   1st Qu.:     854  
##  Median :2016-08-14   Median :1.370   Median :  107377   Median :    8645  
##  Mean   :2016-08-13   Mean   :1.406   Mean   :  850644   Mean   :  293008  
##  3rd Qu.:2017-06-04   3rd Qu.:1.660   3rd Qu.:  432962   3rd Qu.:  111020  
##  Max.   :2018-03-25   Max.   :3.250   Max.   :62505647   Max.   :22743616  
##                                                                            
##      X4225              X4770           Total.Bags         Small.Bags      
##  Min.   :       0   Min.   :      0   Min.   :       0   Min.   :       0  
##  1st Qu.:    3009   1st Qu.:      0   1st Qu.:    5089   1st Qu.:    2849  
##  Median :   29061   Median :    185   Median :   39744   Median :   26363  
##  Mean   :  295155   Mean   :  22840   Mean   :  239639   Mean   :  182195  
##  3rd Qu.:  150207   3rd Qu.:   6243   3rd Qu.:  110783   3rd Qu.:   83338  
##  Max.   :20470573   Max.   :2546439   Max.   :19373134   Max.   :13384587  
##                                                                            
##    Large.Bags       XLarge.Bags           type              region         
##  Min.   :      0   Min.   :     0.0   Length:18249       Length:18249      
##  1st Qu.:    127   1st Qu.:     0.0   Class :character   Class :character  
##  Median :   2648   Median :     0.0   Mode  :character   Mode  :character  
##  Mean   :  54338   Mean   :  3106.4                                        
##  3rd Qu.:  22029   3rd Qu.:   132.5                                        
##  Max.   :5719097   Max.   :551693.7                                        
##                                                                            
##    year          month      weekday       year_month       revenue        
##  2015:5615   Jan    :1944   Sun:18249   2015-03:  540   Min.   :     134  
##  2016:5616   Mar    :1836   Mon:    0   2015-05:  540   1st Qu.:   17164  
##  2017:5722   Feb    :1728   Tue:    0   2015-08:  540   Median :  139530  
##  2018:1296   May    :1512   Wed:    0   2015-11:  540   Mean   :  927948  
##              Jul    :1512   Thu:    0   2016-01:  540   3rd Qu.:  514191  
##              Oct    :1512   Fri:    0   2016-05:  540   Max.   :54379912  
##              (Other):8205   Sat:    0   (Other):15009

Region column is quite messy. It contains total USA, regions and cities.

unique(df$region)
##  [1] "Albany"              "Atlanta"             "BaltimoreWashington"
##  [4] "Boise"               "Boston"              "BuffaloRochester"   
##  [7] "California"          "Charlotte"           "Chicago"            
## [10] "CincinnatiDayton"    "Columbus"            "DallasFtWorth"      
## [13] "Denver"              "Detroit"             "GrandRapids"        
## [16] "GreatLakes"          "HarrisburgScranton"  "HartfordSpringfield"
## [19] "Houston"             "Indianapolis"        "Jacksonville"       
## [22] "LasVegas"            "LosAngeles"          "Louisville"         
## [25] "MiamiFtLauderdale"   "Midsouth"            "Nashville"          
## [28] "NewOrleansMobile"    "NewYork"             "Northeast"          
## [31] "NorthernNewEngland"  "Orlando"             "Philadelphia"       
## [34] "PhoenixTucson"       "Pittsburgh"          "Plains"             
## [37] "Portland"            "RaleighGreensboro"   "RichmondNorfolk"    
## [40] "Roanoke"             "Sacramento"          "SanDiego"           
## [43] "SanFrancisco"        "Seattle"             "SouthCarolina"      
## [46] "SouthCentral"        "Southeast"           "Spokane"            
## [49] "StLouis"             "Syracuse"            "Tampa"              
## [52] "TotalUS"             "West"                "WestTexNewMexico"

We only choose Total US as analysis scope.

df_total <- df %>% filter(region == "TotalUS" )

Analysis

Organic avocados are generally more expensive than conventional type.

date.ranges<-paste(min(date(df_total$Date)),"~",max(date(df_total$Date)))

ggplot(df_total, aes(x = type,y = AveragePrice, color=type)) + 
  geom_boxplot()+
  scale_y_continuous(labels = scales::dollar,limits = c(0,2.5))+
    labs(title="US Total Market: Average Price of one Avocado",
         subtitle=paste("Date period: " ,date.ranges),
       x="Date",y="Average Price",
       color="Type")+ 
  theme_minimal()

The average price distribution also shows organic avocado is much more expensive than conventional type.

df_total %>% 
    ggplot(aes(x = AveragePrice, fill=type)) +
  geom_histogram(binwidth = 0.20, alpha=0.5) +
  scale_y_continuous(breaks = seq(0,1000,20)) + 
  scale_x_continuous(breaks = seq(0,3.40,.20)) +
      labs(title="Total US Market: Average Price of one Avocado",
         subtitle=paste("Date period: " ,date.ranges),
         x="Average Price")+
  theme_minimal() + 
  theme(legend.position="top")

From 2015 to 2017, avocado average price appears to be gradually increasing.

date.ranges<-paste(min(date(df_total$Date)),"~",max(date(df_total$Date)))

ggplot(df_total, aes(type, AveragePrice))+
    geom_boxplot(aes(color = year))+
    scale_y_continuous(limits=c(0,2.2),
                       breaks = seq(0,2.2,0.2),
                       expand = expansion(mult = c(0, .05)),
                       labels = scales::dollar) + 
    labs(colour = "Year", 
         x = "Type", 
         y ="Average Price", 
         title = "Total US Market: Average price of one avocado by year by type",
         subtitle=paste("Date period: " ,date.ranges))+
    
    theme_minimal() + 
    theme(legend.position="top")

There are some exceptions to this. In March of 2017, there was a period of time where the conventional and organic avocado prices were similar. Moreover, in July of 2015, organic avocado prices were cheaper than conventional ones.

ggplot(df_total,aes(x=Date,y=AveragePrice,color=type,group=type))+
  geom_line(size=1.2)+
  geom_smooth(linetype=2,alpha=.3)+
  scale_x_date(date_labels = "%b %y",
               date_breaks =  "4 month",
               limits =c(min(date(df_total$Date)),max(date(df_total$Date))))+
  scale_y_continuous(breaks = seq(0,4,.20))+
  theme_minimal()+
  labs(title="Total US Market: Average Price of one Avocado",
       subtitle=paste("Date period: " ,date.ranges),
       x="Date",
       y="Average Price",
       color="Type",
       caption = "(based on US Total Data)")+
  theme(legend.position="top")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

df_total %>% 
    group_by(year, month, type) %>% 
    summarise(AveragePrice = mean(AveragePrice)) %>%
    ggplot(aes(x=month, y=AveragePrice, color=year,group = year)) +
    labs(colour = "Year", 
         x = "Month", 
         y ="Average Price", 
         title="Total US Market: Average monthly prices of one avocado by type for each year",
       subtitle=paste("Date period: " ,date.ranges))+
    geom_line() + 
    scale_y_continuous(labels = scales::dollar)+
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
    facet_grid(.~type) + 
  theme_bw() + 
  theme(legend.position="top")
## `summarise()` regrouping output by 'year', 'month' (override with `.groups` argument)

  df_total %>% 
    mutate(Quarter=factor(quarter(Date,with_year = F)))  %>% 
    ggplot(aes(x = Quarter,y=AveragePrice,color=type,group=Quarter))+
    geom_jitter(width = .2)+
    geom_boxplot()+
    theme_bw()+
    guides(color=F)+
    facet_grid(type ~ year)+
    scale_y_continuous(breaks = seq(0,3,.2),labels = scales::dollar)+
    scale_x_discrete(drop=F)+
    labs(title="Average Price of a Single Avocado: Total US Market",
       subtitle= paste("Dates from",date.ranges),x="Quarter of Year",y="Average Price")

We can see there are clear seasonal trends in avocado average price. Sep and Oct, avocado average price for both types tend to be highest.