Introduction

This dataset was created to research which cities millennials should move to for affordable avocados so they can enjoy their avocado toast in all of its glory. The author of this dataset claimed that millennials love avocado toast.

However, I decided to take a different approach with this data by analyzing avocado sales through trends over time to understand changes in total volumes sold, average prices, types of avocados, and etc. Instead of trying to determine which cities sell the most cost-effective avocados, I want to uncover the reasons why these movements in avocado sales occur.

Dataset

I retrieved this Avocado Prices Dataset from Kaggle, where it was originally downloaded from the Hass Avocado Board Website in May 2018. The Hass Avocado Board contains data and research on Hass Avocados, the most popular type of avocados.

Description of the data from Hass Avocado Board’s website:

The table below represents weekly 2018 retail scan data for National retail volume (units) and price. Retail scan data comes directly from retailers’ cash registers based on actual retail sales of Hass avocados. Starting in 2013, the table below reflects an expanded, multi-outlet retail data set. Multi-outlet reporting includes an aggregation of the following channels: grocery, mass, club, drug, dollar and military. The Average Price (of avocados) in the table reflects a per unit (per avocado) cost, even when multiple units (avocados) are sold in bags.

Some relevant columns in the dataset that have been used in the analysis:

  • Date - The date of the observation
  • Average Price - the Average price of a single avocado
  • Type - Whether the avocado is conventional or organic
  • Year - the Year of the observation
  • Region - the city or region of the observation
  • Total Volume - Total number of avocados sold

Findings

It is definite that Hass avocado sales skyrocketed in 2016 due to increased popularity and have been growing at a rapid rate to match heightened demand. Because of this, consumers are buying larger quantities (bags) of avocados. Consumers are also starting to buy more organic options of avocados throughout the years. This might explain why average prices for avocados are also slowly rising annually as organic variations are normally more expensive than the conventional option.

The majority of Hass avocados are grown in California and are sold during their peak season of harvest. The peak season months of May to August have the largest number of avocados sold in any given year. In addition, regions around California have the highest volume of avocados sold because they have more distribution as they are physically closer to the supplier.

how to pick an avocado

Total Volume

What is the Total Volume of Avocados Sold?

The multiple bar charts below show the total volume of avocados sold by month by year in hundred of millions. I can conclude that the U.S avocado consumption rate is growing drastically as the total number of avocados sold increase year over year from 2015 to 2018. Please note that there is only data for the first three months in 2018, so this conclusion is based on the finalized data compiled for 2015 through 2017.

There was a spike in avocados sold in May of 2015 and 2016. This is because the peak season for avocados in the U.S. is May to August, so they are more likely to be advertised and promoted in retail stores around that time. Consumers must have been urgently waiting for May, the start of the avocado season, to splurge on their favorite fruit. As avocados become more prominent because of their nutritional value among other features, they are also being pursued by those in the food industry where avocado is an essential ingredient in popular meals and drinks.

months_df = avo_df %>%
  select(Date, `Total Volume`) %>%
  mutate(months=months(mdy(Date), abbreviate=TRUE),
         year=year(mdy(Date))) %>%
  group_by(year, months) %>%
  summarise(n=sum(`Total Volume`), .groups='keep') %>%
  data.frame()
 

months_df$year=factor(months_df$year) 

mymonths=c('Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec')
month_order=factor(months_df$months, level=mymonths)


x= min(as.numeric(levels(months_df$year)))

y= max(as.numeric(levels(months_df$year))) 


months_df$year = factor(months_df$year, levels=seq(y, x, by=-1))

ggplot(months_df, aes(x= month_order, y=n, fill=year)) +
  geom_bar(stat="identity", position="dodge") + 
  theme_light() + 
  theme(plot.title=element_text(hjust = 0.5)) + 
  scale_y_continuous(labels=comma) +
  labs(title="Total Volume of Avocados Sold by Month by Year",
       x="Months of the Year", 
       y="Number of Avocados",
       fill="Year") + 
  scale_fill_brewer(palette="Pastel1") +
  facet_wrap(~year, ncol=2, nrow=2)

Total Bags Sold

What is the Total Number of Avocados Bags Sold?

According to the Hass Avocado Board website, the total volume of avocados sold is accumulated by the number of total bags and the total number of avocados with specific product codes. I decided to focus on the total number of bags sold by month by year to ensure there is a similar trend that matches the total volume sold. There are also more bags of avocados sold in peak season months (May to August), comparable to total volume. Despite only having data for the first three months of 2018, I can see that February and March of 2018 sold more bags than most months between 2015 to 2017.

I also noticed that there is a dramatic increase of bags sold monthly in 2016 than in 2015, which allows me to believe that the market for avocados bloomed in 2016. The constant development and popularity of avocados have enabled them to become huge competitors of other fruits.

months_df = avo_df %>%
  select(Date, `Total Bags`) %>%
  mutate(months=months(mdy(Date), abbreviate=TRUE),
         year=year(mdy(Date))) %>%
  group_by(year, months) %>%
  summarise(n=sum(`Total Bags`), .groups='keep') %>%
  data.frame()

mymonths=c('Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec')
month_order=factor(months_df$months, level=mymonths)


days_df = avo_df %>%
  select (Date) %>%
  mutate(year=year(mdy(Date)),
         dayoftheweek=weekdays(mdy(Date), abbreviate=TRUE)) %>%
group_by(year, dayoftheweek) %>%
summarise(n=length(Date), .groups='keep') %>%
data.frame()

breaks=c(seq(0,max(days_df$n), by=100000))

ggplot(months_df, aes(x=year,y=months, fill=n)) +
  geom_tile(color="black") +
  geom_text(aes(label=comma(n))) + 
  labs(title="Total Bags Sold by Month By Year",
       x="Year",
       y="Months of the Year",
       fill="Total Bag Count") +
  theme_minimal() +
  theme(plot.title=element_text(hjust = 0.5)) +
  scale_y_discrete(limits=rev(levels(month_order))) +
  scale_fill_continuous(low="white", high="darkgreen", breaks=breaks, labels=comma) + 
  guides(fill=guide_legend(reverse = TRUE, override.aes = list(colour="black")))

Bag Sizes

What are the Different Bag Sizes for Avocados Sold?

I also decided to break down the total number of bags into small, large, and XL sizes to analyze if there was a trend in bag sizes purchased through the years. As observed, consumers have gradually started to buy more large bags than small bags. The percentage of small bags purchased in 2015 is 82.2%, and it dropped almost 10% to 73.4% in 2018. On the other hand, the percentage of large bags increased by 8%, from 17.1% in 2015 to 25.1% in 2018. The purchase rate for extra large bags is also increasing, but at a slower rate than large bags.

I am assuming that large bags are more cost-efficient than small bags, or consumers may prefer to buy large bags over small bags because they feel that avocados are a necessity now because of how popular avocados have become. I foresee large bags overtaking small bags in a couple of years at this rate.

bags_df = avo_df %>%
  select(Date, 
         `Small Bags`, 
         `Large Bags`, 
         `XLarge Bags`,
         `Total Bags`) %>%
  mutate(year = year(mdy(Date))) %>%
  group_by(year) %>%
  summarise(smallBagSum=sum(`Small Bags`),
            largeBagSum=sum(`Large Bags`),
            xlargeBagSum=sum(`XLarge Bags`),
            totalBagSum=sum(`Total Bags`),
            .groups='keep') %>%
  group_by(year) %>%
  mutate(smallBagPercent = round(100*smallBagSum/sum(totalBagSum),1),
         largeBagPercent = round(100*largeBagSum/sum(totalBagSum),1),
         xlargeBagPercent = round(100*xlargeBagSum/sum(totalBagSum),1)
         ) %>%
  ungroup() %>%
  data.frame()

row1 = c(2015, "Small Bags", 634682705)
row2 = c(2015, "Large Bags", 132066400) 
row3 = c(2015, "XLarge Bags", 5443128)

row4 = c(2016, "Small Bags", 1106494240)
row5 = c(2016, "Large Bags", 336626342)
row6 = c(2016, "XLarge Bags", 20038285)

row7 = c(2017, "Small Bags", 1222952525)
row8 = c(2017, "Large Bags", 399339040)
row9 = c(2017, "XLarge Bags", 23997172)

row10 = c(2018, "Small Bags", 360741368)
row11 = c(2018, "Large Bags", 123583988)
row12 = c(2018, "XLarge Bags", 7210592)

bags_sum = data.frame(rbind(row1, row2, row3, 
                            row4, row5, row6, 
                            row7, row8, row9,
                            row10, row11, row12))
colnames(bags_sum) = c("year", "bagType", "sum")

plot_ly(textposition="inside", labels= ~bagType, values = ~sum) %>%
  add_pie(data=bags_sum[bags_sum$year==2018, ], 
          name="2018", title="2018", domain=list(row=0, column=0)) %>%
  add_pie(data=bags_sum[bags_sum$year==2017, ],
          name="2017", title="2017", domain=list(row=0, column=1)) %>%
  add_pie(data=bags_sum[bags_sum$year==2016, ],
          name="2016", title="2016", domain=list(row=1, column=0)) %>%
  add_pie(data=bags_sum[bags_sum$year==2015, ],
          name="2015", title="2015", domain=list(row=1, column=1)) %>%
  layout(title="Bag Sizes Sold by Year", showlegend=TRUE, 
         grid=list(rows=2, columns=2))

Average Prices

What are the Average Prices for Avocados?

I was also curious as to how these avocados are priced in the market when demand continues to rise. From the line plot below, avocados have been getting more expensive every year in terms of the average price. Avocados are also cheaper at the start and end of every year, but the average price of avocados in 2018 is almost doubled the average price in 2015.

September 2017 had the highest average price of $1.60 for Hass avocados. There was a weak harvest of avocados in 2017, leading to a shortage as demand was still high. According to the American Restaurant Association, the wholesale price of avocados was up 125% since the beginning of the year in 2017. In addition, Guacamole Day (yes, there is an actual day to celebrate eating guacamole) also lands on September 16 of every year so restaurants needed to stock up to prepare. This helps explain the price surge.

Even though it may be tempting to buy avocados at their lowest average price at the beginning or the end of the year, retailers are most likely trying to discard the remaining, possibly stale, avocados by discounting them. It is more ideal to wait for the next batch of fresh avocados during peak season, even if they are more costly.

months_df2 = avo_df %>%
  select(Date, AveragePrice, `Total Volume`) %>%
  dplyr::rename(TotalVolume = `Total Volume`) %>%
  mutate(months=months(mdy(Date), abbreviate=TRUE),
         year=year(mdy(Date))) %>%
  group_by(year, months) %>%
  summarise(sumQuantity = sum(TotalVolume),
            totalPrice = sum(AveragePrice * TotalVolume),
            newAvgPrice = round((totalPrice/sumQuantity), 2),
            .groups = 'keep') %>%
  data.frame()

months_df2$year=factor(months_df2$year) 

mymonths2=c('Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec')
month_order2=factor(months_df2$months, level=mymonths2)

ggplot(months_df2, aes(x=month_order2, y=newAvgPrice, group=year)) +
  geom_line(aes(color=year), size=3) +
  labs(title="Average Prices by Month and by Year",
       x="Months of the Year",
       y="Average Price") +
  theme_light() +
  theme(plot.title=element_text(hjust=0.5)) +
  geom_point(shape=21, size=5, color="black", fill="white") +
  scale_y_continuous(labels=scales::dollar_format()) +
  scale_color_brewer(palette="Set2", name="Year", guide=guide_legend(reverse = TRUE))

Top 10 Regions

What are the Top 10 Regions for Most Avocados Sold?

The West region has the highest total number of avocados sold, however California and South Central are also competing for the top spot. They all have over one billion avocados sold between 2015 and 2018. This isn’t surprising as 90% of avocados are grown and produced in California, so it’s only natural that the surrounding regions have more access to the distribution because they are within closer proximity.

However, there is a variety of other regions outside of the top three that made it into the top ten such as northeast, southeast, great lakes and etc. This shows that avocados are expanding into other territories outside of their origin, thus continuing to increase popularity.

top10_region_df = avo_df %>%
  select(region, "Total Volume") %>%
  dplyr::rename(TotalVolume = "Total Volume") %>%
  group_by(region) %>%
  dplyr::summarise(sum_of_volume = sum(TotalVolume), .groups = 'keep') %>%
  dplyr::arrange(desc(sum_of_volume)) %>%
  data.frame() %>%
  dplyr::slice(2:11)

ylab = c(250, 500, 750, 1000)

ggplot(top10_region_df, aes(x = reorder(region, -sum_of_volume), y = sum_of_volume)) +
  geom_bar(colour="black", fill="violetred4", stat="identity") +
  labs(title=" Top 10 Regions with Highest Avocado Sales (2015-2018)",
       x="Regions",
       y="Number of Avocados") +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_y_continuous(labels = paste0(ylab, "M"),
                     breaks = 10^6 * ylab) +
  geom_text(aes(x = region, y = sum_of_volume, label = scales::comma(sum_of_volume), vjust=-0.5))

Avocado Types

What are the Different Types of Avocados Sold?

There are two types of Hass avocados sold in retail, conventional and organic. Generally, the average consumer would purchase conventional fruits over organic fruits because it is cheaper. This would attract the targeted audience, millennials, even more assuming that they are younger and frugal. The pie chart below shows that conventional avocados make up 97.2% of the total number of avocados sold between 2015 and 2018.

In the nested pie chart of Types of Avocados Sold (2015-2017), organic avocados are gradually being purchased more year over year. However, there is still a giant gap between consumers purchasing conventional avocados over organic varieties but it is still some sort of progress. This newly found discovery of avocados and their nutritional value has made consumers more health-conscious leading them to purchase more organic products as time goes on.

Disclaimer: I did not include 2018 in my nested pie chart, because the dataset only has three months worth of data in 2018.

type_df = avo_df %>%
  select(type, Date, `Total Volume`) %>%
  mutate(year = year(mdy(Date))) %>%
  group_by(year, type) %>%
  summarise(n=sum(`Total Volume`), .groups='keep') %>%
  group_by(year) %>%
  mutate(percent_of_total = round(100*n/sum(n), 1)) %>% 
  ungroup() %>%
  data.frame()

plot_ly(type_df, 
        labels = ~type, 
        values = ~n, 
        hoverinfo = "text", 
        hovertext = paste("<b>", type_df$type, "</b> 
          Count:", scales::comma(type_df$n))) %>%
  add_pie(hole=0.6) %>%
  layout(title="Types of Avocados Sold (2015-2018)") %>%
  layout(annotations=list(text=paste0("Total Avocado Count: \n", 
                                      scales::comma(sum(type_df$n))),
                          "showarrow"=F))
plot_ly(hole=0.7) %>%
  layout(title="Types of Avocados Sold (2015-2017)") %>%
  add_trace(data=type_df[type_df$year==2017, ],
            labels = ~type,
            values = ~type_df[type_df$year==2017, "n"],
            type="pie",
            textposition ="inside",
            hovertemplate = "<b>Year:</b> 2017<br> 
                            <b>Type:</b> %{label}<br> 
                            <b>Percent:</b> %{percent}<br> 
                            <b>Number of Avocados:</b> %{value}
                            <extra></extra>") %>%
  add_trace(data=type_df[type_df$year==2016, ],
            labels = ~type,
            values = ~type_df[type_df$year==2016, "n"],
            type="pie",
            textposition ="inside",
            hovertemplate = "<b>Year:</b> 2016<br> 
                            <b>Type:</b> %{label}<br> 
                            <b>Percent:</b> %{percent}<br> 
                            <b>Number of Avocados:</b> %{value}
                            <extra></extra>",
            domain = list(
              x = c(0.16, 0.84), 
              y = c(0.16, 0.84))) %>%
  add_trace(data=type_df[type_df$year==2015, ],
            labels = ~type,
            values = ~type_df[type_df$year==2015, "n"],
            type="pie",
            textposition ="inside",
            hovertemplate = "<b>Year:</b> 2015<br> 
                            <b>Type:</b> %{label}<br> 
                            <b>Percent:</b> %{percent}<br> 
                            <b>Number of Avocados:</b> %{value}
                            <extra></extra>",
            domain = list(
              x = c(0.27, 0.73), 
              y = c(0.27, 0.73)))