From general series plot, we know that there are conflicts of sales - when Paperback sells well, Hardcover sales struggle and vice versa - at some points. The daily trend is generally increasing for both books and by the end, we notice that Hardcover sale is almost always bigger than Paperbook’s which is surprising given that the sale of Paperbook at day 1 was much smaller than Hardcover’s - it seems like there is a substitution effect from Paperback to Hardcover, people buy Hardcover instead of Paperback.
For a white noise series, we expect all of the spikes in the ACF to lie within dashed blue lines. From ggACF, we see that 2 spikes are outside the boundaries and other spikes lie within the bounds; in this case, T = 30 and so the bounds are at +- 2/sqrt(30) = +- 0.37.
Since the data is not white noise, non-stationary time series, we can derive trend or seasonality. I would like to perform STL decomposition with weekly data (transformed with frequency = 7) to examine the features of the data. We see the trend (7 days, weekly) is increasing for both Paperback and Harcover and for seasonal pattern (7 days, weekly), we see that on day 1 and 6, there is a peak for both and then on 3 and 4, sales for both drop dramatically.
From summary statistics, we know that Hardcover books are generally selling more overall - both mean and median are higher.
# Daily pattern - general series
autoplot(books) +
ylab("Number of book sales") + xlab("Day")
# Daily pattern - sub series
autoplot(books, facets=TRUE) +
ylab("Number of book sales") + xlab("Day")
# Weekly pattern
ts(books[, 'Paperback'], frequency = 7) %>%
stl(s.window = 'periodic', robust = TRUE) %>%
autoplot()
ts(books[, 'Hardcover'], frequency = 7) %>%
stl(s.window = 'periodic', robust = TRUE) %>%
autoplot()
#gglagplot(ts(books[, 'Paperback'], frequency = 7))
#gglagplot(ts(books[, 'Hardcover'], frequency = 7))
# Autocorrelation for each book
ggAcf(books[, 'Paperback'])
ggAcf(books[, 'Hardcover'])
# summary statistics
summary(books)
## Paperback Hardcover
## Min. :111.0 Min. :128.0
## 1st Qu.:167.2 1st Qu.:170.5
## Median :189.0 Median :200.5
## Mean :186.4 Mean :198.8
## 3rd Qu.:207.2 3rd Qu.:222.0
## Max. :247.0 Max. :283.0
As we can see, since SES is suitable for data with no clear trend and seasonal pattern, we have flat forecast.
# Estimate parameters
fc_paper <- ses(books[, 'Paperback'], h=5)
fc_hard <- ses(books[, 'Hardcover'], h=5)
autoplot(fc_paper) +
autolayer(fitted(fc_paper), series="Fitted") +
ylab("Number of paperbook sales") + xlab("Day")
autoplot(fc_hard) +
autolayer(fitted(fc_hard), series="Fitted") +
ylab("Number of hardbook sales") + xlab("Day")
# c. Compute the RMSE values for the training data in each case.
RMSE for Paperbook (training set) is higher than the one for Hardbook (training set).
#accuracy(fitted(fc_paper), books[, 'Paperback'])
accuracy(fc_paper)
## ME RMSE MAE MPE MAPE MASE
## Training set 7.175981 33.63769 27.8431 0.4736071 15.57784 0.7021303
## ACF1
## Training set -0.2117522
accuracy(fc_hard)
## ME RMSE MAE MPE MAPE MASE
## Training set 9.166735 31.93101 26.77319 2.636189 13.39487 0.7987887
## ACF1
## Training set -0.1417763
As we expected, since Holt’s linear method is linear, we see the linear trend forecasting line.
fc_paper_hl <- holt(books[, 'Paperback'], h=4)
fc_hard_hl <- holt(books[, 'Hardcover'], h=4)
autoplot(fc_paper_hl) +
autolayer(fitted(fc_paper_hl), series="Fitted") +
ylab("Number of paperbook sales") + xlab("Day")
autoplot(fc_hard_hl) +
autolayer(fitted(fc_hard_hl), series="Fitted") +
ylab("Number of hardbook sales") + xlab("Day")
accuracy(fc_paper_hl)
## ME RMSE MAE MPE MAPE MASE
## Training set -3.717178 31.13692 26.18083 -5.508526 15.58354 0.6602122
## ACF1
## Training set -0.1750792
accuracy(fc_hard_hl)
## ME RMSE MAE MPE MAPE MASE
## Training set -0.1357882 27.19358 23.15557 -2.114792 12.1626 0.6908555
## ACF1
## Training set -0.03245186
RMSEs for Holt’s are lower than SES’s. Since we know data is not white noise, Holt’s is suitable for these datasets. Holt’s is the extended version of SES to allow the forecasting of data with a trend. SES is suitable for data without a trend or seasonality.
# Let's compare RMSE on training set between Hort Linear and SES
# Hort's
print(accuracy(fc_paper_hl))
## ME RMSE MAE MPE MAPE MASE
## Training set -3.717178 31.13692 26.18083 -5.508526 15.58354 0.6602122
## ACF1
## Training set -0.1750792
print(accuracy(fc_hard_hl))
## ME RMSE MAE MPE MAPE MASE
## Training set -0.1357882 27.19358 23.15557 -2.114792 12.1626 0.6908555
## ACF1
## Training set -0.03245186
# SES
print(accuracy(fc_paper))
## ME RMSE MAE MPE MAPE MASE
## Training set 7.175981 33.63769 27.8431 0.4736071 15.57784 0.7021303
## ACF1
## Training set -0.2117522
print(accuracy(fc_hard))
## ME RMSE MAE MPE MAPE MASE
## Training set 9.166735 31.93101 26.77319 2.636189 13.39487 0.7987887
## ACF1
## Training set -0.1417763
One might think that Holt’s method is better than SES since it does provide trend. I would choose Holt’s for these datasets as they are not stationary.
autoplot(fc_paper_hl) +
autolayer(fitted(fc_paper_hl), series="Fitted") +
ylab("Number of paperbook sales") + xlab("Day")
autoplot(fc_hard_hl) +
autolayer(fitted(fc_hard_hl), series="Fitted") +
ylab("Number of hardbook sales") + xlab("Day")
# d. Calculate a 95% prediction interval for the first forecast for each series, using the RMSE values and assuming normal errors. Compare your intervals with those produced using ses and holt.
As expected, PIs are all different. Since SES is flat forecast and has higher RMSE, it does have wider PI than Holt’s.
# Hort's
print(forecast(fc_paper_hl))[1,]
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 31 209.4668 166.6035 252.3301 143.9130 275.0205
## 32 210.7177 167.8544 253.5811 145.1640 276.2715
## 33 211.9687 169.1054 254.8320 146.4149 277.5225
## 34 213.2197 170.3564 256.0830 147.6659 278.7735
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 31 209.4668 166.6035 252.3301 143.913 275.0205
holt(books[, 'Paperback'], h=1)$mean[1] - (1.96 * accuracy(fc_paper_hl)[2])
## [1] 148.4384
holt(books[, 'Paperback'], h=1)$mean[1] + (1.96 * accuracy(fc_paper_hl)[2])
## [1] 270.4951
print(forecast(fc_hard_hl))[1,]
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 31 250.1739 212.7390 287.6087 192.9222 307.4256
## 32 253.4765 216.0416 290.9113 196.2248 310.7282
## 33 256.7791 219.3442 294.2140 199.5274 314.0308
## 34 260.0817 222.6468 297.5166 202.8300 317.3334
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 31 250.1739 212.739 287.6087 192.9222 307.4256
holt(books[, 'Hardcover'], h=1)$mean[1] - (1.96 * accuracy(fc_paper_hl)[2])
## [1] 189.1455
holt(books[, 'Hardcover'], h=1)$mean[1] + (1.96 * accuracy(fc_paper_hl)[2])
## [1] 311.2022
# SES
print(forecast(fc_paper))[1,]
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 31 207.1097 162.4882 251.7311 138.8670 275.3523
## 32 207.1097 161.8589 252.3604 137.9046 276.3147
## 33 207.1097 161.2382 252.9811 136.9554 277.2639
## 34 207.1097 160.6259 253.5935 136.0188 278.2005
## 35 207.1097 160.0215 254.1979 135.0945 279.1249
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 31 207.1097 162.4882 251.7311 138.867 275.3523
ses(books[, 'Paperback'], h=1)$mean[1] - (1.96 * accuracy(fc_paper_hl)[2])
## [1] 146.0813
ses(books[, 'Paperback'], h=1)$mean[1] + (1.96 * accuracy(fc_paper_hl)[2])
## [1] 268.138
print(forecast(fc_hard))[1,]
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 31 239.5601 197.2026 281.9176 174.7799 304.3403
## 32 239.5601 194.9788 284.1414 171.3788 307.7414
## 33 239.5601 192.8607 286.2595 168.1396 310.9806
## 34 239.5601 190.8347 288.2855 165.0410 314.0792
## 35 239.5601 188.8895 290.2306 162.0662 317.0540
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 31 239.5601 197.2026 281.9176 174.7799 304.3403
ses(books[, 'Hardcover'], h=1)$mean[1] - (1.96 * accuracy(fc_paper_hl)[2])
## [1] 178.5317
ses(books[, 'Hardcover'], h=1)$mean[1] + (1.96 * accuracy(fc_paper_hl)[2])
## [1] 300.5885
ukcars, the quarterly UK passenger vehicle production data from 1977Q1-2005Q1.Strong seasonality every year (every 4 lags) from ACF, STL decomposition and Lag plot. We see an increasing trend in the data after 1980 (before, it was decreasing). From seasonal subseries plot, we know that Q.1s generally have the highest car production where as Q.3s are the lowest. The data is not white noise and stationary - every spike in every lag in ACF. Summary statistics shows that mean is almost the same as median, indicating that data is fairly normalized.
# general plot
autoplot(ukcars) +
ylab("Car production") + xlab("Quarters")
# STL decomposition
ukcars %>%
stl(s.window = 'periodic', robust = TRUE) %>%
autoplot()
# Lag plot
gglagplot(ukcars)
# Autocorrelation
ggAcf(ukcars)
# seasonal plot
ggseasonplot(ukcars, year.labels=TRUE, year.labels.left=TRUE) +
ylab("Car production") +
ggtitle("Seasonal plot: UK car product")
ggseasonplot(ukcars, polar=TRUE) +
ylab("Car production") +
ggtitle("Seasonal plot: UK car product")
# seasonal subseries plot
ggsubseriesplot(ukcars) +
ylab("Car production") +
ggtitle("Seasonal subseries plot: UK car product")
# summary statistics
summary(ukcars)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 171.2 267.0 335.1 333.5 391.8 494.3
Note that seasonally adj. data is obtained as original data - seasonality.
# fit normal data
fit <- stl(ukcars, s.window="periodic",
robust=TRUE)
#q.2 1977 -- 371.051 - 21.8574881
# fit only seasonaly adj. data
fit_sea <- seasadj(fit)
head(fit_sea)
## Qtr1 Qtr2 Qtr3 Qtr4
## 1977 306.9244 349.1935 315.8259 344.0282
## 1978 335.0444 340.9645
As data is seasonally adjusted, we use error = A, trend = A and seasonal = N model. Damped is usually better in forecasting values in long run.
fcast <- stlf(fit_sea, etsmodel = "AAN", damped=TRUE, h = 2)
fcast
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 2005 Q2 393.1714 362.6002 423.7426 346.4167 439.9261
## 2005 Q3 409.4102 372.0259 446.7945 352.2358 466.5846
# general plot
autoplot(fcast) +
ylab("Car production") + xlab("Quarters")
# d. Forecast the next two years of the series using Holt’s linear method applied to the seasonally adjusted data (as before but with damped=FALSE).
As data is seasonally adjusted, we use error = A, trend = A and seasonality = N model. Without damped, note that point forecasts are a little bit higher than with damped.
fcast_nodamp <- stlf(fit_sea, etsmodel = "AAN", damped=FALSE, h = 2)
fcast_nodamp
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 2005 Q2 394.3341 363.9375 424.7306 347.8465 440.8216
## 2005 Q3 411.4903 373.9466 449.0339 354.0722 468.9083
# general plot
autoplot(fcast_nodamp) +
ylab("Car production") + xlab("Quarters")
# e. Now use ets() to choose a seasonal model for the data.
Note that ets() automatically chooses the best model by minimizing AIC. The result shows that ANA is the most preferred model. (error = A, trend = N, seasonal = A)
# ETS models - ANA
ets_model <- ets(ukcars)
summary(ets_model)
## ETS(A,N,A)
##
## Call:
## ets(y = ukcars)
##
## Smoothing parameters:
## alpha = 0.6199
## gamma = 1e-04
##
## Initial states:
## l = 314.2568
## s = -1.7579 -44.9601 21.1956 25.5223
##
## sigma: 25.9302
##
## AIC AICc BIC
## 1277.752 1278.819 1296.844
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 1.313884 25.23244 20.17907 -0.1570979 6.629003 0.6576259
## ACF1
## Training set 0.02573334
For RMSE on entire set, STL model - ANN with no demped gives us the best result (the lowest RMSE).
# STL model - AAN with demped (RMSE on entire set) - 23.32
summary(fcast)
##
## Forecast method: STL + ETS(A,Ad,N)
##
## Model Information:
## ETS(A,Ad,N)
##
## Call:
## ets(y = x, model = etsmodel, damped = TRUE, allow.multiplicative.trend = allow.multiplicative.trend)
##
## Smoothing parameters:
## alpha = 0.7037
## beta = 1e-04
## phi = 0.9243
##
## Initial states:
## l = 341.6239
## b = -5.0799
##
## sigma: 23.8549
##
## AIC AICc BIC
## 1257.950 1258.743 1274.314
##
## Error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 1.551267 23.32113 18.48987 0.07585736 5.955086 0.602576
## ACF1
## Training set 0.02262668
##
## Forecasts:
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 2005 Q2 393.1714 362.6002 423.7426 346.4167 439.9261
## 2005 Q3 409.4102 372.0259 446.7945 352.2358 466.5846
# STL model - AAN with no demped (RMSE on entire set) - 23.3
summary(fcast_nodamp)
##
## Forecast method: STL + ETS(A,A,N)
##
## Model Information:
## ETS(A,A,N)
##
## Call:
## ets(y = x, model = etsmodel, damped = FALSE, allow.multiplicative.trend = allow.multiplicative.trend)
##
## Smoothing parameters:
## alpha = 0.7248
## beta = 1e-04
##
## Initial states:
## l = 328.1086
## b = 0.9202
##
## sigma: 23.7186
##
## AIC AICc BIC
## 1255.697 1256.258 1269.334
##
## Error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -0.3412727 23.295 18.1605 -0.5520141 5.884617 0.5918418
## ACF1
## Training set 0.02103582
##
## Forecasts:
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 2005 Q2 394.3341 363.9375 424.7306 347.8465 440.8216
## 2005 Q3 411.4903 373.9466 449.0339 354.0722 468.9083
# ETS model - ANA (RMSE on entire set) - 25.23
summary(ets_model)
## ETS(A,N,A)
##
## Call:
## ets(y = ukcars)
##
## Smoothing parameters:
## alpha = 0.6199
## gamma = 1e-04
##
## Initial states:
## l = 314.2568
## s = -1.7579 -44.9601 21.1956 25.5223
##
## sigma: 25.9302
##
## AIC AICc BIC
## 1277.752 1278.819 1296.844
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 1.313884 25.23244 20.17907 -0.1570979 6.629003 0.6576259
## ACF1
## Training set 0.02573334
For RMSE on testset, STL with demped is the most reasonable - the lowest RMSE.
# MSE for STL with demped
e <- tsCV(fit_sea, stlf, etsmodel = "AAN", damped=TRUE, h=2)
# RMSE on testset - STL with demped
sqrt(mean(e^2, na.rm=TRUE))
## [1] 29.76946
#> [1] 29.77
# MSE for STL with no demped
e2 <- tsCV(fit_sea, stlf, etsmodel = "AAN", damped=FALSE, h=2)
# RMSE on testset - STL with no demped
sqrt(mean(e2^2, na.rm=TRUE))
## [1] 30.5195
#> [1] 30.52
# MSE for STL with no demped
f <- function(y, h) {
forecast(ets(y), h = h)
}
e3 <- tsCV(ukcars, f, h=2)
# RMSE on testset - est with ANA
sqrt(mean(e3^2, na.rm=TRUE))
## [1] 34.4618
#> [1] 34.4618
# RMSE on all dataset - For STL with demped
#sqrt(mean(residuals(stlf(fit_sea, etsmodel = "AAN", damped=TRUE, h = 2))^2, na.rm=TRUE))
since p value < 0.05 and spikes appear in ACF plots, we reject null hypothesis that residuals are uncorrelated (null hypothesis of independence is rejected - the residuals are correlated). From the Normality plot, the plot sugests that residuals are fairly normally distributed (residuals have mean fairly close to zero).
Therefore, we conclude that my preferred model does not forecast correctly.
checkresiduals(fcast)
## Warning in checkresiduals(fcast): The fitted degrees of freedom is based on
## the model used for the seasonally adjusted data.
##
## Ljung-Box test
##
## data: Residuals from STL + ETS(A,Ad,N)
## Q* = 24.138, df = 3, p-value = 2.337e-05
##
## Model df: 5. Total lags used: 8
#checkresiduals(stlf(fit_sea, etsmodel = "AAN", damped=TRUE))