It is a well known fact that Millenials LOVE Avocado Toast :D . It’s also a well known fact that all Millenials live in their parents basements :( .
Clearly, they aren’t buying home because they are buying too much Avocado Toast!
Now let’s help our Millenials in New York to predict avocado prices so they can save more to enjoy more avocados!
This data is downloaded from from link below:
Let us import the library required.
library(dplyr) # for data wrangling
library(lubridate) # to dea with date
library(padr) # for padding
library(forecast) # time series library
library(tseries) # for adf.test
library(MLmetrics) # calculate error
library(ggplot2)
library(plotly)Let us import the data for the analysis.
avocado <- read.csv("avocado.csv")
avocado <- avocado %>% filter(region=="NewYork") %>% select(Date,AveragePrice)Our Date column is still not in Date data type. We have to change to Date data type
avocado$Date <- as.Date(avocado$Date)
avocado$Date <- ymd(avocado$Date)The characteristics of a Time Series object must consists of:
Therefore, we need to do time series padding.
avocado_clean <- avocado %>%
group_by(Date) %>%
summarise(AveragePrice=mean(AveragePrice)) %>%
arrange(Date)
avocado_clean colSums(is.na(avocado_clean))#> Date AveragePrice
#> 0 0
Our Dataset is ready. It is on periodic order, no missing period and no missing value.
Before we do analysis and make prediction, we have to make our data set as a Time-Series (TS) Object
avocado_ts <- ts(data = avocado_clean$AveragePrice,
start = 2015,
frequency = 48)In this section, I tried to explore whether our timeseries object has trend and seasonal properties (one-seasonal/multiseasonal).
avocado_deco <- avocado_ts %>% decompose() %>% plot() If we look at plot above, the data has trend and seasonality
The cross-validation scheme for time series should not be sampled randomly, but splitted sequentially.
# 24 latest week for test
avocado_test <- tail(avocado_ts, 24)
avocado_train <- head(avocado_ts, -24)This time, I will compare between two of the widely used time series modeling in business and industry: Holt-Winters and Seasonal Arima. I use Holt-Winters because my data contain both seasonality and trend. I also want to compare it between seasonal Arima to check whether seasonal ARIMA can give better forecasting performance.
# Holt-Winters
avocado_hw <- HoltWinters(avocado_train)
# ARIMA
avocado_arima <- auto.arima(avocado_train, seasonal = T)Now we are going to forecast and evaluate our models
#forecast
avocado_hw_f <- forecast(avocado_hw,h=24)
avocado_arima_f <- forecast(avocado_arima,h=24)Let us visualize our models’ forecast
avocado_ts %>%
autoplot(series = "Actual") +
autolayer(avocado_hw_f$mean, series = "Holts-Winter Model") +
autolayer(avocado_arima_f$mean, series = "Arima Model") +
labs(title="Avocado Price Prediction in New York",
y="Price/Piece in USD") + theme_bw()Now let us compare our Models performance
#Holts-Winter Model
accuracy(avocado_hw_f,avocado_test)#> ME RMSE MAE MPE MAPE MASE
#> Training set 0.00492317 0.1422274 0.09740765 -0.05874552 5.330022 0.4293816
#> Test set -0.23076662 0.3121820 0.25023176 -15.19327495 16.284733 1.1030439
#> ACF1 Theil's U
#> Training set 0.04455499 NA
#> Test set 0.70795934 2.526663
#Arima Model
accuracy(avocado_arima_f,avocado_test)#> ME RMSE MAE MPE MAPE MASE
#> Training set 0.004396247 0.1189256 0.09218445 -0.09417289 5.344705 0.4063573
#> Test set -0.230409053 0.2698729 0.23197219 -14.95703282 15.039958 1.0225541
#> ACF1 Theil's U
#> Training set 0.002780219 NA
#> Test set 0.536553230 2.15109
If we look at above result, we found that our Arima model is the best model with lower MAPE (Test set) value.
Since our Arima model is the best model, it needs to fullfil 2 assumption requirements: Normality & Non-Correlation with our model forecast residuals.
We check our model forecast residuals normality using Shapiro-wik test
shapiro.test(avocado_arima_f$residuals)#>
#> Shapiro-Wilk normality test
#>
#> data: avocado_arima_f$residuals
#> W = 0.97943, p-value = 0.02812
# p-value < 0.05; normally distributedthe p-value < 0.05, so we can say that our model residuals are normally distributed
We check the correlation between our model forecast residuals using Box test with Ljung-Box type
Box.test(avocado_arima_f$residuals, type = "Ljung-Box") #>
#> Box-Ljung test
#>
#> data: avocado_arima_f$residuals
#> X-squared = 0.0011441, df = 1, p-value = 0.973
#p-value > alpha (0.05);no auto correlationthe p-value > 0.05, so we can say that there is no auto-correlation between our model forecast residuals.
Now we can conclude that our Arima model fullfills both 2 assumptions: Normality and No-Correlation.
Now we know that our best model is Arima Model. Our data is up to March-2018. Now we would like to find out the next 48 weeks (up to early 2019) price prediction using our model
#next 48 weeks price prediction (Price/piece in USD) :
model_prod <- auto.arima(avocado_ts)
avocado_pro_f <- forecast(model_prod,h=48)
avocado_pro_f$mean#> Time Series:
#> Start = c(2018, 26)
#> End = c(2019, 25)
#> Frequency = 48
#> [1] 1.573997 1.587375 1.600728 1.599668 1.598159 1.578807 1.594368 1.566271
#> [9] 1.573241 1.594408 1.592582 1.543925 1.547584 1.545395 1.552727 1.567272
#> [17] 1.558721 1.580704 1.579162 1.562176 1.561239 1.564305 1.562690 1.563814
#> [25] 1.563869 1.562218 1.563362 1.551601 1.559180 1.553348 1.551341 1.536317
#> [33] 1.517518 1.538996 1.534328 1.531683 1.566683 1.563643 1.541474 1.520511
#> [41] 1.532246 1.530197 1.494546 1.519774 1.515978 1.535294 1.529221 1.531607
avocado_ts %>%
autoplot(series = "Actual Price") +
autolayer(avocado_pro_f$mean, series = "Price Prediction") +
labs(title="Avocado Price Prediction in New York",
y="Price/Piece in USD") + theme_bw()