The challenge

Context

It is a well known fact that Millenials LOVE Avocado Toast. It’s also a well known fact that all Millenials live in their parents basements.

Clearly, they aren’t buying home because they are buying too much Avocado Toast!

But maybe there’s hope… if a Millenial could find a city with cheap avocados, they could live out the Millenial American Dream.

Content

This data was downloaded from the Hass Avocado Board website in May of 2018 & compiled into a single CSV. Here’s how the Hass Avocado Board describes the data on their website:

The table below represents weekly 2018 retail scan data for National retail volume (units) and price. Retail scan data comes directly from retailers’ cash registers based on actual retail sales of Hass avocados. Starting in 2013, the table below reflects an expanded, multi-outlet retail data set. Multi-outlet reporting includes an aggregation of the following channels: grocery, mass, club, drug, dollar and military. The Average Price (of avocados) in the table reflects a per unit (per avocado) cost, even when multiple units (avocados) are sold in bags. The Product Lookup codes (PLU’s) in the table are only for Hass avocados. Other varieties of avocados (e.g. greenskins) are not included in this table.

Some relevant columns in the dataset:

  • Date - The date of the observation
  • AveragePrice - the average price of a single avocado
  • type - conventional or organic
  • year - the year
  • Region - the city or region of the observation
  • Total Volume - Total number of avocados sold
  • 4046 - Total number of avocados with PLU 4046 sold
  • 4225 - Total number of avocados with PLU 4225 sold
  • 4770 - Total number of avocados with PLU 4770 sold

Acknowledgements

Many thanks to the Hass Avocado Board for sharing this data!!

http://www.hassavocadoboard.com/retail/volume-and-price-data

Inspiration

In which cities can millenials have their avocado toast AND buy a home?

Was the Avocadopocalypse of 2017 real?

Data quality

Missing values

column missing values
X1 0
Date 0
AveragePrice 0
Total Volume 0
4046 0
4225 0
4770 0
Total Bags 0
Small Bags 0
Large Bags 0
XLarge Bags 0
type 0
year 0
region 0

Duplicated values

Number of rows:

## [1] 18249

Number of unique rows:

## [1] 18249

There are no duplicated rows.

Data exploration

Price analyisis

Price distribution

Date variable

##       Date           
##  Min.   :2015-01-04  
##  1st Qu.:2015-10-25  
##  Median :2016-08-14  
##  Mean   :2016-08-13  
##  3rd Qu.:2017-06-04  
##  Max.   :2018-03-25

The time difference between the 2015-01-04 and 2018-03-25 is the 1176. Let’s check if we have this amount of observations:

## [1] 169

We only have 169 unique days. Strange.

Date n week_day
2015-01-04 108 Sunday
2015-01-11 108 Sunday
2015-01-18 108 Sunday
2015-01-25 108 Sunday
2015-02-01 108 Sunday
2015-02-08 108 Sunday
2015-02-15 108 Sunday
2015-02-22 108 Sunday
2015-03-01 108 Sunday
2015-03-08 108 Sunday
2015-03-15 108 Sunday
2015-03-22 108 Sunday
2015-03-29 108 Sunday
2015-04-05 108 Sunday
2015-04-12 108 Sunday
2015-04-19 108 Sunday
2015-04-26 108 Sunday
2015-05-03 108 Sunday
2015-05-10 108 Sunday
2015-05-17 108 Sunday
2015-05-24 108 Sunday
2015-05-31 108 Sunday
2015-06-07 108 Sunday
2015-06-14 108 Sunday
2015-06-21 108 Sunday
2015-06-28 108 Sunday
2015-07-05 108 Sunday
2015-07-12 108 Sunday
2015-07-19 108 Sunday
2015-07-26 108 Sunday
2015-08-02 108 Sunday
2015-08-09 108 Sunday
2015-08-16 108 Sunday
2015-08-23 108 Sunday
2015-08-30 108 Sunday
2015-09-06 108 Sunday
2015-09-13 108 Sunday
2015-09-20 108 Sunday
2015-09-27 108 Sunday
2015-10-04 108 Sunday
2015-10-11 108 Sunday
2015-10-18 108 Sunday
2015-10-25 108 Sunday
2015-11-01 108 Sunday
2015-11-08 108 Sunday
2015-11-15 108 Sunday
2015-11-22 108 Sunday
2015-11-29 108 Sunday
2015-12-06 107 Sunday
2015-12-13 108 Sunday
2015-12-20 108 Sunday
2015-12-27 108 Sunday
2016-01-03 108 Sunday
2016-01-10 108 Sunday
2016-01-17 108 Sunday
2016-01-24 108 Sunday
2016-01-31 108 Sunday
2016-02-07 108 Sunday
2016-02-14 108 Sunday
2016-02-21 108 Sunday
2016-02-28 108 Sunday
2016-03-06 108 Sunday
2016-03-13 108 Sunday
2016-03-20 108 Sunday
2016-03-27 108 Sunday
2016-04-03 108 Sunday
2016-04-10 108 Sunday
2016-04-17 108 Sunday
2016-04-24 108 Sunday
2016-05-01 108 Sunday
2016-05-08 108 Sunday
2016-05-15 108 Sunday
2016-05-22 108 Sunday
2016-05-29 108 Sunday
2016-06-05 108 Sunday
2016-06-12 108 Sunday
2016-06-19 108 Sunday
2016-06-26 108 Sunday
2016-07-03 108 Sunday
2016-07-10 108 Sunday
2016-07-17 108 Sunday
2016-07-24 108 Sunday
2016-07-31 108 Sunday
2016-08-07 108 Sunday
2016-08-14 108 Sunday
2016-08-21 108 Sunday
2016-08-28 108 Sunday
2016-09-04 108 Sunday
2016-09-11 108 Sunday
2016-09-18 108 Sunday
2016-09-25 108 Sunday
2016-10-02 108 Sunday
2016-10-09 108 Sunday
2016-10-16 108 Sunday
2016-10-23 108 Sunday
2016-10-30 108 Sunday
2016-11-06 108 Sunday
2016-11-13 108 Sunday
2016-11-20 108 Sunday
2016-11-27 108 Sunday
2016-12-04 108 Sunday
2016-12-11 108 Sunday
2016-12-18 108 Sunday
2016-12-25 108 Sunday
2017-01-01 108 Sunday
2017-01-08 108 Sunday
2017-01-15 108 Sunday
2017-01-22 108 Sunday
2017-01-29 108 Sunday
2017-02-05 108 Sunday
2017-02-12 108 Sunday
2017-02-19 108 Sunday
2017-02-26 108 Sunday
2017-03-05 108 Sunday
2017-03-12 108 Sunday
2017-03-19 108 Sunday
2017-03-26 108 Sunday
2017-04-02 108 Sunday
2017-04-09 108 Sunday
2017-04-16 108 Sunday
2017-04-23 108 Sunday
2017-04-30 108 Sunday
2017-05-07 108 Sunday
2017-05-14 108 Sunday
2017-05-21 108 Sunday
2017-05-28 108 Sunday
2017-06-04 108 Sunday
2017-06-11 108 Sunday
2017-06-18 107 Sunday
2017-06-25 107 Sunday
2017-07-02 108 Sunday
2017-07-09 108 Sunday
2017-07-16 108 Sunday
2017-07-23 108 Sunday
2017-07-30 108 Sunday
2017-08-06 108 Sunday
2017-08-13 108 Sunday
2017-08-20 108 Sunday
2017-08-27 108 Sunday
2017-09-03 108 Sunday
2017-09-10 108 Sunday
2017-09-17 108 Sunday
2017-09-24 108 Sunday
2017-10-01 108 Sunday
2017-10-08 108 Sunday
2017-10-15 108 Sunday
2017-10-22 108 Sunday
2017-10-29 108 Sunday
2017-11-05 108 Sunday
2017-11-12 108 Sunday
2017-11-19 108 Sunday
2017-11-26 108 Sunday
2017-12-03 108 Sunday
2017-12-10 108 Sunday
2017-12-17 108 Sunday
2017-12-24 108 Sunday
2017-12-31 108 Sunday
2018-01-07 108 Sunday
2018-01-14 108 Sunday
2018-01-21 108 Sunday
2018-01-28 108 Sunday
2018-02-04 108 Sunday
2018-02-11 108 Sunday
2018-02-18 108 Sunday
2018-02-25 108 Sunday
2018-03-04 108 Sunday
2018-03-11 108 Sunday
2018-03-18 108 Sunday
2018-03-25 108 Sunday

The date has been colected always on Sundays. Let’s find out what can we find if we focus only in one day:

## # A tibble: 108 x 3
## # Groups:   region [54]
##    region              type             n
##    <chr>               <chr>        <int>
##  1 Albany              conventional     1
##  2 Albany              organic          1
##  3 Atlanta             conventional     1
##  4 Atlanta             organic          1
##  5 BaltimoreWashington conventional     1
##  6 BaltimoreWashington organic          1
##  7 Boise               conventional     1
##  8 Boise               organic          1
##  9 Boston              conventional     1
## 10 Boston              organic          1
## # … with 98 more rows

Okey, so it means we have an observation, by each reagion and each type, every Sunday. Next step, are we missing any Sunday?

## # A tibble: 1 x 2
## # Groups:   week_day [1]
##   week_day     n
##   <ord>    <int>
## 1 Sunday     169

So yes, we can conclude we have the 169 Sundays from the starting date to the end one.

Price evolution

Overall price change:

Creating a quick model with region and type to predict the price:

AveragePrice
8 0.82 when type is conventional & region is DallasFtWorth or Houston or PhoenixTucson or SouthCentral or WestTexNewMexico
9 1.07 when type is conventional & region is Atlanta or Boise or California or CincinnatiDayton or Columbus or Denver or Detroit or Indianapolis or LasVegas or LosAngeles or Louisville or Nashville or NewOrleansMobile or Portland or RichmondNorfolk or Roanoke or SanDiego or SouthCarolina or Spokane or TotalUS or West
10 1.20 when type is conventional & region is Albany or BaltimoreWashington or Boston or BuffaloRochester or Charlotte or Chicago or GrandRapids or GreatLakes or HarrisburgScranton or HartfordSpringfield or Jacksonville or MiamiFtLauderdale or Midsouth or NewYork or Northeast or NorthernNewEngland or Orlando or Philadelphia or Pittsburgh or Plains or RaleighGreensboro or Sacramento or SanFrancisco or Seattle or Southeast or StLouis or Syracuse or Tampa & year < 2017
11 1.40 when type is conventional & region is Albany or BaltimoreWashington or Boston or BuffaloRochester or Charlotte or Chicago or GrandRapids or GreatLakes or HarrisburgScranton or HartfordSpringfield or Jacksonville or MiamiFtLauderdale or Midsouth or NewYork or Northeast or NorthernNewEngland or Orlando or Philadelphia or Pittsburgh or Plains or RaleighGreensboro or Sacramento or SanFrancisco or Seattle or Southeast or StLouis or Syracuse or Tampa & year >= 2017
12 1.41 when type is organic & region is CincinnatiDayton or Columbus or DallasFtWorth or Denver or Detroit or GreatLakes or Houston or Indianapolis or LosAngeles or Louisville or Nashville or Pittsburgh or RichmondNorfolk or Roanoke or SouthCentral
13 1.59 when type is organic & region is Atlanta or Boise or MiamiFtLauderdale or Midsouth or NewOrleansMobile or Portland or Southeast or Tampa or TotalUS or West
14 1.74 when type is organic & region is Albany or BaltimoreWashington or Boston or BuffaloRochester or California or Chicago or GrandRapids or HarrisburgScranton or Jacksonville or LasVegas or Northeast or NorthernNewEngland or Orlando or Philadelphia or PhoenixTucson or Plains or RaleighGreensboro or SanDiego or Seattle or SouthCarolina or Spokane or StLouis or Syracuse or WestTexNewMexico
15 2.08 when type is organic & region is Charlotte or HartfordSpringfield or NewYork or Sacramento or SanFrancisco

We can already find relevant patterns to predict where is going to be more expensive or less. Let’s check the relevant metrics of our model:

##      RMSE  Rsquared       MAE 
## 0.2629175 0.5736658 0.1955221

Results analyisis

In average, to buy advocados in DallasFtWorth or Houston or PhoenixTucson or SouthCentral or WestTexNewMexico is less expensive:

It is interesting the how the price has fall down in this regions. Why? Let’s see the distribution of our residuals:

Visualize the predictions:

Analysis vol. & price

Analyisis ORGANIC Aguacates

Analyisis CONVENTIONAL Aguacates

Seasons effect

The two previous charts shows us a clear seasonality bases on the seasons of the year. We are going to create new variables to identify them:

In that case we will use a linear model:

Let’s check the performance:

##      RMSE  Rsquared       MAE 
## 0.2524595 0.6069077 0.1917957

Let’s visualize the predictions:

This model has detected a tendency and a seasonality. Time to start using forecasting to make our predictions.

Time series

Decomposition

First we will analyse the organic avocados:

Understanding autocorrelation to find patterns we are not explaining:

Notes from a kaggle kernel:

  • What does the autocorrelation graph tells us?

    First of all, we have to understand the concept of lags. So what are lags? Think of lags as time intervals. In this case we think of lags as monthly time intervals. Now what is the main goal of autocorrelation? We would like to see if there is a linear relationship from the first lag (month). So we would like to see if there are certain patterns as opposed to the first month. The correlation at lag zero is always one because it correlated to itself. At lag one, we see that the correlation is close to one , this tells us that the correlation to the first month (lag 0) is similar, so the trend is highly correlated with month 1 (January).

  • What can we conclude about the autocorrelation?

    There is a high autocorrelation until the third lag (March), which tells us that there is a high linear relationship between these weeks with the first month. However, we see no linear relationship among the other months compared to the first month. So we can conclude, that there are no consistent patterns that will show any linear relationships with the prices shown in the first week for both conventional and organic avocados.

Applying ARIMA

Using arima to detect the arima model:

Visualize the arima model performance: