1 Introduction

It is a well known fact that Millenials LOVE Avocado Toast :D . It’s also a well known fact that all Millenials live in their parents basements :( .

Clearly, they aren’t buying home because they are buying too much Avocado Toast!

Now let’s help our Millenials in New York to predict avocado prices so they can save more to enjoy more avocados!

This data is downloaded from from link below:

https://www.kaggle.com/neuromusic/avocado-prices

2 Exploratory Data

2.1 Import Library

Let us import the library required.

library(dplyr) # for data wrangling
library(lubridate) # to dea with date
library(padr) # for padding
library(forecast) # time series library
library(tseries) # for adf.test
library(MLmetrics) # calculate error
library(ggplot2)
library(plotly)

2.2 Import Data

Let us import the data for the analysis.

avocado <- read.csv("avocado.csv")
avocado <- avocado %>% filter(region=="NewYork") %>% select(Date,AveragePrice)

2.3 Data Pre-Processing

Our Date column is still not in Date data type. We have to change to Date data type

avocado$Date <- as.Date(avocado$Date)
avocado$Date <- ymd(avocado$Date)

The characteristics of a Time Series object must consists of:

  • an ordered structure for the period
  • no missing period, and
  • no missing value

Therefore, we need to do time series padding.

avocado_clean <- avocado %>%
  group_by(Date) %>% 
  summarise(AveragePrice=mean(AveragePrice)) %>% 
  arrange(Date) 

avocado_clean   
colSums(is.na(avocado_clean))
#>         Date AveragePrice 
#>            0            0

Our Dataset is ready. It is on periodic order, no missing period and no missing value.

3 Time Series Analysis

3.1 TS Object

Before we do analysis and make prediction, we have to make our data set as a Time-Series (TS) Object

avocado_ts <- ts(data = avocado_clean$AveragePrice,
   start = 2015,
   frequency = 48)

In this section, I tried to explore whether our timeseries object has trend and seasonal properties (one-seasonal/multiseasonal).

avocado_deco <- avocado_ts %>%  decompose() %>% plot()

If we look at plot above, the data has trend and seasonality

3.2 Cross Validation

The cross-validation scheme for time series should not be sampled randomly, but splitted sequentially.

# 24 latest week for  test
avocado_test <- tail(avocado_ts, 24)

avocado_train <- head(avocado_ts, -24)

3.3 Model Building

This time, I will compare between two of the widely used time series modeling in business and industry: Holt-Winters and Seasonal Arima. I use Holt-Winters because my data contain both seasonality and trend. I also want to compare it between seasonal Arima to check whether seasonal ARIMA can give better forecasting performance.

# Holt-Winters
avocado_hw <- HoltWinters(avocado_train) 
# ARIMA
avocado_arima <- auto.arima(avocado_train, seasonal = T)

3.4 Forecast & Evaluation

Now we are going to forecast and evaluate our models

#forecast

avocado_hw_f <- forecast(avocado_hw,h=24)
avocado_arima_f <- forecast(avocado_arima,h=24)

Let us visualize our models’ forecast

avocado_ts %>% 
    autoplot(series = "Actual") +
  autolayer(avocado_hw_f$mean, series = "Holts-Winter Model")  +
  autolayer(avocado_arima_f$mean, series = "Arima Model")  +
  labs(title="Avocado Price Prediction in New York",
       y="Price/Piece in USD") + theme_bw()

Now let us compare our Models performance

#Holts-Winter Model
accuracy(avocado_hw_f,avocado_test)
#>                       ME      RMSE        MAE          MPE      MAPE      MASE
#> Training set  0.00492317 0.1422274 0.09740765  -0.05874552  5.330022 0.4293816
#> Test set     -0.23076662 0.3121820 0.25023176 -15.19327495 16.284733 1.1030439
#>                    ACF1 Theil's U
#> Training set 0.04455499        NA
#> Test set     0.70795934  2.526663
#Arima Model
accuracy(avocado_arima_f,avocado_test)
#>                        ME      RMSE        MAE          MPE      MAPE      MASE
#> Training set  0.004396247 0.1189256 0.09218445  -0.09417289  5.344705 0.4063573
#> Test set     -0.230409053 0.2698729 0.23197219 -14.95703282 15.039958 1.0225541
#>                     ACF1 Theil's U
#> Training set 0.002780219        NA
#> Test set     0.536553230   2.15109

If we look at above result, we found that our Arima model is the best model with lower MAPE (Test set) value.

3.5 Assumption

Since our Arima model is the best model, it needs to fullfil 2 assumption requirements: Normality & Non-Correlation with our model forecast residuals.

3.5.1 Normality

We check our model forecast residuals normality using Shapiro-wik test

shapiro.test(avocado_arima_f$residuals)
#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  avocado_arima_f$residuals
#> W = 0.97943, p-value = 0.02812
# p-value < 0.05; normally distributed

the p-value < 0.05, so we can say that our model residuals are normally distributed

3.5.2 Autocorrelation

We check the correlation between our model forecast residuals using Box test with Ljung-Box type

Box.test(avocado_arima_f$residuals, type = "Ljung-Box") 
#> 
#>  Box-Ljung test
#> 
#> data:  avocado_arima_f$residuals
#> X-squared = 0.0011441, df = 1, p-value = 0.973
#p-value > alpha (0.05);no auto correlation

the p-value > 0.05, so we can say that there is no auto-correlation between our model forecast residuals.

Now we can conclude that our Arima model fullfills both 2 assumptions: Normality and No-Correlation.

4 Model Price Prediction

Now we know that our best model is Arima Model. Our data is up to March-2018. Now we would like to find out the next 48 weeks (up to early 2019) price prediction using our model

#next 48 weeks price prediction (Price/piece in USD)  :
  
model_prod <- auto.arima(avocado_ts)

avocado_pro_f <- forecast(model_prod,h=48)

avocado_pro_f$mean
#> Time Series:
#> Start = c(2018, 26) 
#> End = c(2019, 25) 
#> Frequency = 48 
#>  [1] 1.573997 1.587375 1.600728 1.599668 1.598159 1.578807 1.594368 1.566271
#>  [9] 1.573241 1.594408 1.592582 1.543925 1.547584 1.545395 1.552727 1.567272
#> [17] 1.558721 1.580704 1.579162 1.562176 1.561239 1.564305 1.562690 1.563814
#> [25] 1.563869 1.562218 1.563362 1.551601 1.559180 1.553348 1.551341 1.536317
#> [33] 1.517518 1.538996 1.534328 1.531683 1.566683 1.563643 1.541474 1.520511
#> [41] 1.532246 1.530197 1.494546 1.519774 1.515978 1.535294 1.529221 1.531607
avocado_ts %>% 
    autoplot(series = "Actual Price") +
  autolayer(avocado_pro_f$mean, series = "Price Prediction")  +
  labs(title="Avocado Price Prediction in New York",
       y="Price/Piece in USD") + theme_bw()