The data set souvenirs concerns the monthly sales figures of a shop which opened in January 1987 and sells gifts, souvenirs, and novelties. The shop is situated on the wharf at a beach resort town in Queensland, Australia. The sales volume varies with the seasonal population of tourists. There is a large influx of visitors to the town at Christmas and for the local surfing festival, held every March since 1988. Over time, the shop has expanded its premises, range of products, and staff.
Produce a time plot of the data and describe the patterns in the graph. Identify any unusual or unexpected fluctuations in the time series.
souvenirs %>% autoplot(color="#69b3a2") + ggtitle("Sales for a souvenir shop Queensland, Australia") + labs(x = "")
## Plot variable not specified, automatically selected `.vars = Sales`
The data shows a seasonal pattern with huge spike at christmas and small spike around March for the surfing festival each year. The spikes become bigger every year except in 1991. Overall, there is a growth trend.
Explain why it is necessary to take logarithms of these data before fitting a model.
p1 <- ggplot(souvenirs, aes(x = Sales)) + geom_histogram(fill="#69b3a2", color="#e9ecef", alpha=0.9) +
labs(title = "histogram of sales", x = "Sales")
p2 <- ggplot(souvenirs, aes(x = log(Sales))) + geom_histogram(fill="#69b3a2", color="#e9ecef", alpha=0.9) +
labs(title = "Taking logarithms", x = "Sales")
p1 + p2 + plot_layout(nrow = 1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The histogram appears to be heavy right-skewed, taking logarithm will help data become normal distribution, which will help improve the model and forecasting later.
Fit a regression model to the logarithms of these sales data with a linear trend, seasonal dummies and a “surfing festival” dummy variable.
souvenirs2<- ts(souvenirs$Sales,start=c(1987,1), end=c(1993,12), frequency=12)
log.souvenirs = log(souvenirs2)
# surfing festival Dummy
dummy.fest = rep(0, length(souvenirs2))
dummy.fest[seq_along(dummy.fest)%%12 == 3] = 1
dummy.fest[3] = 0
dummy.fest = ts(dummy.fest, freq = 12, start=c(1987,1))
new.data = data.frame(log.souvenirs, dummy.fest)
fit = tslm(log.souvenirs ~ trend + season + dummy.fest, data=new.data)
summary(fit)
##
## Call:
## tslm(formula = log.souvenirs ~ trend + season + dummy.fest, data = new.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.33673 -0.12757 0.00257 0.10911 0.37671
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.6196670 0.0742471 102.626 < 2e-16 ***
## trend 0.0220198 0.0008268 26.634 < 2e-16 ***
## season2 0.2514168 0.0956790 2.628 0.010555 *
## season3 0.2660828 0.1934044 1.376 0.173275
## season4 0.3840535 0.0957075 4.013 0.000148 ***
## season5 0.4094870 0.0957325 4.277 5.88e-05 ***
## season6 0.4488283 0.0957647 4.687 1.33e-05 ***
## season7 0.6104545 0.0958039 6.372 1.71e-08 ***
## season8 0.5879644 0.0958503 6.134 4.53e-08 ***
## season9 0.6693299 0.0959037 6.979 1.36e-09 ***
## season10 0.7473919 0.0959643 7.788 4.48e-11 ***
## season11 1.2067479 0.0960319 12.566 < 2e-16 ***
## season12 1.9622412 0.0961066 20.417 < 2e-16 ***
## dummy.fest 0.5015151 0.1964273 2.553 0.012856 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.179 on 70 degrees of freedom
## Multiple R-squared: 0.9567, Adjusted R-squared: 0.9487
## F-statistic: 119 on 13 and 70 DF, p-value: < 2.2e-16
Plot the residuals against time and against the fitted values. Do these plots reveal any problems with the model?
p1 <- autoplot(fit$residuals) + ggtitle("residuals")
ab <- data.frame(fit$residuals, fit$x)
p2 <- ggplot(ab, aes(x= fit.x, y = fit.residuals)) + geom_point() + labs(x = "fitted")
p1 + p2
## Don't know how to automatically pick scale for object of type <ts>. Defaulting
## to continuous.
## Don't know how to automatically pick scale for object of type <ts>. Defaulting
## to continuous.
The residuals plot appear to be in good shape, it’s random and no
specific pattern.
Do boxplots of the residuals for each month. Does this reveal any problems with the model?
boxplot(residuals(fit)~cycle(residuals(fit)))
The boxplot show some fluctuation of variance of residuals on periods August, September, October. This reveals that our model is not good enough to catch all seasonal details in data.
What do the values of the coefficients tell you about each variable?
fit$coefficients
## (Intercept) trend season2 season3 season4 season5
## 7.61966701 0.02201983 0.25141682 0.26608280 0.38405351 0.40948697
## season6 season7 season8 season9 season10 season11
## 0.44882828 0.61045453 0.58796443 0.66932985 0.74739195 1.20674790
## season12 dummy.fest
## 1.96224123 0.50151509
From coefficient values reveal the direction and magnitude of the variable toward the outcome. For example, the coefficient of season 12 is the highest among other season, this means with a same proportion change among seasons, season12 will result the most change in sales volumes.
What does the Ljung-Box test tell you about your model?
Box.test(fit$residuals, lag = 10, type = "Ljung-Box")
##
## Box-Ljung test
##
## data: fit$residuals
## X-squared = 69.779, df = 10, p-value = 4.891e-11
Ljung-box test result a p-value lower than 0.05, which implies that there is auto correlation in the model’s residuals. This means that the model can still be improved.
Regardless of your answers to the above questions, use your regression model to predict the monthly sales for 1994, 1995, and 1996. Produce prediction intervals for each of your forecasts.
forecast_data = data.frame(dummy.fest = rep(0, 36))
preds = forecast(fit, newdata=forecast_data)
ts_pred <- ts(preds$mean, freq = 12, start=c(1994,1))
ts_pred %>% autoplot() + ggtitle("Forecasting 1994, 1995, and 1996")
#intervals
preds
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Jan 1994 9.491352 9.238522 9.744183 9.101594 9.88111
## Feb 1994 9.764789 9.511959 10.017620 9.375031 10.15455
## Mar 1994 9.801475 9.461879 10.141071 9.277961 10.32499
## Apr 1994 9.941465 9.688635 10.194296 9.551707 10.33122
## May 1994 9.988919 9.736088 10.241749 9.599161 10.37868
## Jun 1994 10.050280 9.797449 10.303110 9.660522 10.44004
## Jul 1994 10.233926 9.981095 10.486756 9.844168 10.62368
## Aug 1994 10.233456 9.980625 10.486286 9.843698 10.62321
## Sep 1994 10.336841 10.084010 10.589671 9.947083 10.72660
## Oct 1994 10.436923 10.184092 10.689753 10.047165 10.82668
## Nov 1994 10.918299 10.665468 11.171129 10.528541 11.30806
## Dec 1994 11.695812 11.442981 11.948642 11.306054 12.08557
## Jan 1995 9.755590 9.499844 10.011336 9.361338 10.14984
## Feb 1995 10.029027 9.773281 10.284773 9.634775 10.42328
## Mar 1995 10.065713 9.722498 10.408928 9.536620 10.59481
## Apr 1995 10.205703 9.949957 10.461449 9.811451 10.59996
## May 1995 10.253157 9.997411 10.508903 9.858904 10.64741
## Jun 1995 10.314518 10.058772 10.570264 9.920265 10.70877
## Jul 1995 10.498164 10.242418 10.753910 10.103911 10.89242
## Aug 1995 10.497694 10.241948 10.753440 10.103441 10.89195
## Sep 1995 10.601079 10.345333 10.856825 10.206826 10.99533
## Oct 1995 10.701161 10.445415 10.956907 10.306908 11.09541
## Nov 1995 11.182537 10.926791 11.438282 10.788284 11.57679
## Dec 1995 11.960050 11.704304 12.215796 11.565797 12.35430
## Jan 1996 10.019828 9.760564 10.279093 9.620151 10.41951
## Feb 1996 10.293265 10.034000 10.552530 9.893588 10.69294
## Mar 1996 10.329951 9.982679 10.677222 9.794605 10.86530
## Apr 1996 10.469941 10.210677 10.729206 10.070264 10.86962
## May 1996 10.517395 10.258130 10.776659 10.117718 10.91707
## Jun 1996 10.578756 10.319491 10.838021 10.179079 10.97843
## Jul 1996 10.762402 10.503137 11.021667 10.362725 11.16208
## Aug 1996 10.761932 10.502667 11.021196 10.362254 11.16161
## Sep 1996 10.865317 10.606052 11.124582 10.465640 11.26499
## Oct 1996 10.965399 10.706134 11.224664 10.565722 11.36508
## Nov 1996 11.446774 11.187510 11.706039 11.047097 11.84645
## Dec 1996 12.224288 11.965023 12.483552 11.824611 12.62396
How could you improve these predictions by modifying the model?
#Box-cox test
souvenirs %>% features(Sales, features = guerrero) %>%
pull(lambda_guerrero)
## [1] 0.002118221
As we mentioned earlier on Ljung-box test result the model can be developed. And the box-cox test result suggest a transformation with lambda 0.00211 would maximize performance for the model.
Consider aus_airpassengers, the total number of passengers (in millions) from Australian air carriers for the period 1970-2011.
Use ARIMA() to find an appropriate ARIMA model. What model was selected. Check that the residuals look like white noise. Plot forecasts for the next 10 periods.
aus_airpassengers %>% autoplot() + ggtitle("Air Transport Passengers Australia") + labs(x = "")
## Plot variable not specified, automatically selected `.vars = Passengers`
fit <- aus_airpassengers %>%
model(ARIMA(Passengers))
report(fit)
## Series: Passengers
## Model: ARIMA(0,2,1)
##
## Coefficients:
## ma1
## -0.8963
## s.e. 0.0594
##
## sigma^2 estimated as 4.308: log likelihood=-97.02
## AIC=198.04 AICc=198.32 BIC=201.65
The time series has a increasing trend with no seasonal pattern. The function ARIMA() automatically select the optimal model with ARIMA(0,2,1). Now let check the residuals from the model.
fit %>% gg_tsresiduals()
The residuals appear to be white noise although there are two outliers on 1889 and 2009, residuals appear to be random. ACF plot shows there is no autocorrelation on residuals. Histogram is in good shape to be normal distribution.
detach(package:forecast)
fit %>% forecast(h=10) %>%
autoplot(aus_airpassengers) + ggtitle("Forecasting Air Transport Passengers Australia") + labs(x = "")
The trend is expected to continue growing rapidly and will pass 90 millions passengers annually in next 10 years.
Write the model in terms of the backshift operator.
Model ARIMA(0,2,1)
\[ (1 - B)^2 y_t = c + (1 + \theta_1B)\epsilon_t \]
\(y_t\) is the time series at time t.
B is the backshift operator.
\(\theta_1\) is the coefficients of the 1st lagged error terms.
\(\epsilon_t\) is the error term at time t.