The goal of this topic is to build a predictive model on the beer data from the package fpp. The data will be split in 2 samples. One to train the model on data from 1991 to 1994. The second set will contain data from 1995 onwards. The forecast will be made using the tslm function and the four principal methods : mean forecast, naive forecast, drift method and seasonal method. At the end we will measure accuracy of the model comparing the training and test set.

Load data and libraries

Plot the data

plot(beer, type="l", col="blue")

Check data structure

summary(beer)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   119.0   135.5   145.5   149.3   156.2   192.0
hist(beer, col = "blue")

Data transformation using Boxcox function

best_beer<-BoxCox.lambda(beer)
best<-BoxCox(beer,best_beer)
plot(best)

check seasonality by month and year on transformed data

seasonplot(best, col = rainbow(10), year.labels = TRUE)

COMMENTS: If we look at the main building blocks of the data we can see that there is no trend but we suspect a seasonal structure related to the rising sales by the end of the year (christmas and new year effects)

Decompose data using stl function and additive model structure

decomp<-stl(best, s.window = 15)
plot(decomp)

adjust<-seasadj(decomp)
plot(naive(adjust))

adjust<-seasadj(decomp)
plot(snaive(adjust))

Autocorrelation with acf and pacf

tsdisplay(best)

Divide data between training and test time series objects

b1<-window(best, end=1994.99)
b2<-window(best, start=1995)
h=length(b2)

Forecast from 1995 onwards using mean, naive, drift and seasonal methods

plot(best, type = 'n')
lines(b1)
lines(b2, col = "red")
abline(v = end(b1) + 1, lty = 2, lwd = 2)

# Mean (based on the overall mean value)
f1 <- meanf(b1, h = h)
lines(f1$mean, lwd = 2, col = "yellow")

# Naive (based on the last value)
f2 <- rwf(b1, h = h)
lines(f2$mean, lwd = 2, col = "green")

# Drift (based on 1st and last value)
f3 <- rwf(b1, drift = TRUE, h = h)
lines(f3$mean, lwd = 2, col = "orange")

# Seasonal naive forecast
f4 <- snaive(b1, h = h)
lines(f4$mean, lwd = 2, col = "blue")

kable(accuracy(f1, b2)) # display accuracy for mean method
ME RMSE MAE MPE MAPE MASE ACF1 Theil’s U
Training set 0.0000000 0.0008243 0.0006761 -0.0000688 0.0680501 1.628164 0.4381281 NA
Test set -0.0005652 0.0008534 0.0006734 -0.0569714 0.0678601 1.621680 -0.4624799 0.7852037
kable(accuracy(f2, b2)) # display accuracy for naive method
ME RMSE MAE MPE MAPE MASE ACF1 Theil’s U
Training set 0.0000128 0.0008598 0.0006630 0.0012535 0.0667417 1.596633 -0.2099102 NA
Test set -0.0017945 0.0019050 0.0017945 -0.1807940 0.1807940 4.321656 -0.4624799 1.697863
kable(accuracy(f3, b2)) # display accuracy for drift method
ME RMSE MAE MPE MAPE MASE ACF1 Theil’s U
Training set 0.0000000 0.0008597 0.0006616 -0.0000387 0.0666047 1.593345 -0.2099102 NA
Test set -0.0018522 0.0019607 0.0018522 -0.1866123 0.1866123 4.460764 -0.4343529 1.751872
kable(accuracy(f4, b2)) # display accuracy for seasonal method
ME RMSE MAE MPE MAPE MASE ACF1 Theil’s U
Training set -0.0001523 0.0005277 0.0004152 -0.0153448 0.0418132 1.000000 -0.2479798 NA
Test set 0.0000392 0.0005297 0.0004480 0.0039347 0.0451176 1.078918 -0.0906758 0.454126

COMMENT :seasonal naive method is the one that minimizes RMSE on test set with the value of 0.0005296950. Our intuiton on the seasonal structure of the data has been confirmed by accuracy measures. seasonal naive method is then the best method to use for our prediction task

Analyse residuals for seasonal serie

res <- residuals(f4)
plot(res)

hist(res, breaks = "FD", col = "lightgreen")

acf(res, na.action = na.omit)

The bell shape of the histogram suggests that residuals are normal and not correlated.

acf(res, na.action = na.omit)

fit the model based on the trend

fit<-tslm(best~trend)
f<-forecast(fit, h=h)
plot(f)

acf(residuals(f))

fit the model based on the seasonality

fit2<-tslm(best~trend+season)
f2<-forecast(fit2, h=h)
plot(f2)

acf(residuals(f2))

pacf(residuals(f2))

pacf and acf plots suggest also that the overall trend and seasonal structure of the data show no autocorrelation.