can_ts <- ts(can_df$HoursPerWeek, start=c(1966), frequency=1)
can_lm <- tslm(can_ts ~ trend)
summary(can_lm)##
## Call:
## tslm(formula = can_ts ~ trend)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.20457 -0.28761 0.04779 0.30210 1.23190
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.416134 0.175818 212.811 < 2e-16 ***
## trend -0.061373 0.008518 -7.205 2.93e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.509 on 33 degrees of freedom
## Multiple R-squared: 0.6113, Adjusted R-squared: 0.5996
## F-statistic: 51.91 on 1 and 33 DF, p-value: 2.928e-08
After review, it is not possible to model seasonality using annual data.
can_qm <- tslm(can_ts ~ trend + I(trend^2))
summary(can_qm)##
## Call:
## tslm(formula = can_ts ~ trend + I(trend^2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.94503 -0.20964 -0.01652 0.31862 0.60160
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.1644156 0.2176599 175.340 < 2e-16 ***
## trend -0.1827154 0.0278786 -6.554 2.21e-07 ***
## I(trend^2) 0.0033706 0.0007512 4.487 8.76e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4049 on 32 degrees of freedom
## Multiple R-squared: 0.7614, Adjusted R-squared: 0.7465
## F-statistic: 51.07 on 2 and 32 DF, p-value: 1.1e-10
Again it’s found after review, it is not possible to model seasonality using annual data.
An exponential trend model should begin with an exploration of the logarithmic output (or y) variable
##
## Call:
## tslm(formula = dept_train ~ trend + season, lambda = 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.053524 -0.013199 -0.004527 0.014387 0.062681
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.748945 0.018725 574.057 < 2e-16 ***
## trend 0.011088 0.001295 8.561 3.70e-07 ***
## season2 0.024956 0.020764 1.202 0.248
## season3 0.165343 0.020884 7.917 9.79e-07 ***
## season4 0.433746 0.021084 20.572 2.10e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.03277 on 15 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 7.63e+11 on 4 and 15 DF, p-value: < 2.2e-16
Although visually it would appear that quarter two generally results in higher sales figures, the second quarter (season2) output is insignificant (.248) when valued statistically (p < .05).
dept_f <- forecast(dept_exp, h=2)
kable(dept_f)| 6 Q1 | 58793.71 | 55790.19 | 61958.92 | 54090.84 | 63905.46 |
| 6 Q2 | 60951.51 | 57837.76 | 64232.89 | 56076.04 | 66250.87 |
plot(dept_train, type = "o")
lines(dept_exp$fitted, col = "blue")plot(dept_train - dept_exp$fitted.values, type ="o")The U-Shape of the residuals shows that the extremes of the series (beginning/end) may be under-valued. As such, the estimates for quarters 21 and 22 may be slightly undervalued when compared to their actual values. We would prefer for these residuals to appear as white noise without a discernable pattern visible within the data.
The trend in the data is not captured well by the model
As described above, the U-Shape evident within the residual plot reveals that that the trend may not be captured adequately within the model. Seasonality would have been revealed through more oscillating trends in the residuals.
Based on the U-Shape mentioned previously, a quadratic trend model to the residuals (with Quarter and Quarter^2) would be an adequate and parsimoniou solution for improving the model’s fit.
The time-series appears to contain both trend and seasonality. The upward trend is initially masked visibly by the strong seasonality. From what we can see in the plots, there seems to be monthly seasonality (11 dummy variables) and a linear trend (1 variable) for a total of 12 variables within the series.
sales_df <- read.csv("SouvenirSales.csv",stringsAsFactors = FALSE)
sales_ts <- ts(sales_df$Sales,start = c(1995, 1), frequency = 12)
test_len <- 12
train_len <- length(sales_ts) - test_len
sales_train <- window(sales_ts, end = c(1995, train_len))
sales_test <- window(sales_ts, start = c(1995, train_len + 1))
a <- tslm(sales_train ~ trend + season)
summary(a)##
## Call:
## tslm(formula = sales_train ~ trend + season)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12592 -2359 -411 1940 33651
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3065.55 2640.26 -1.161 0.25029
## trend 245.36 34.08 7.199 1.24e-09 ***
## season2 1119.38 3422.06 0.327 0.74474
## season3 4408.84 3422.56 1.288 0.20272
## season4 1462.57 3423.41 0.427 0.67077
## season5 1446.19 3424.60 0.422 0.67434
## season6 1867.98 3426.13 0.545 0.58766
## season7 2988.56 3427.99 0.872 0.38684
## season8 3227.58 3430.19 0.941 0.35058
## season9 3955.56 3432.73 1.152 0.25384
## season10 4821.66 3435.61 1.403 0.16573
## season11 11524.64 3438.82 3.351 0.00141 **
## season12 32469.55 3442.36 9.432 2.19e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5927 on 59 degrees of freedom
## Multiple R-squared: 0.7903, Adjusted R-squared: 0.7476
## F-statistic: 18.53 on 12 and 59 DF, p-value: 9.435e-16
plot(a, main = "Model A")The highest coefficients (highest positive variation from month (season) 1) are in months 11 and 12. At first this would seem related to Christmas sales, but on further reflection it is most likely due to Australia’s location in the Southern hemisphere. November and December would mark the beginning of the summer season, which would most likely result in a large uptick in sales.
The trend coefficient of 245.36 is the slope (m) within a linear model (y = mx + b). This means that although their may be seasonal variation, the model overall maintains a regressive slope value of 245.36 over ever x (month) variable.
b <- tslm(log(sales_train) ~ trend + season)
summary(b)##
## Call:
## tslm(formula = log(sales_train) ~ trend + season)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.4529 -0.1163 0.0001 0.1005 0.3438
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.646363 0.084120 90.898 < 2e-16 ***
## trend 0.021120 0.001086 19.449 < 2e-16 ***
## season2 0.282015 0.109028 2.587 0.012178 *
## season3 0.694998 0.109044 6.374 3.08e-08 ***
## season4 0.373873 0.109071 3.428 0.001115 **
## season5 0.421710 0.109109 3.865 0.000279 ***
## season6 0.447046 0.109158 4.095 0.000130 ***
## season7 0.583380 0.109217 5.341 1.55e-06 ***
## season8 0.546897 0.109287 5.004 5.37e-06 ***
## season9 0.635565 0.109368 5.811 2.65e-07 ***
## season10 0.729490 0.109460 6.664 9.98e-09 ***
## season11 1.200954 0.109562 10.961 7.38e-16 ***
## season12 1.952202 0.109675 17.800 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1888 on 59 degrees of freedom
## Multiple R-squared: 0.9424, Adjusted R-squared: 0.9306
## F-statistic: 80.4 on 12 and 59 DF, p-value: < 2.2e-16
plot(b, main = "Model B")Fitting the logarithmic value of sales with a linear trend is the same as fitting an exponential trend to a model of Sales.
Much like with the trend coefficient in the previous problem, this coefficient represents the slope of the model. Rather than being in terms of dollars however, this coefficient (as a log value) represents a % growth value over every (x) period, in this case the growth is slightly larger than 2% per period. This of course will be varied by seasonal variations.
b_f <- forecast(b, h = 14)
b_f## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Jan 2001 9.188097 8.917220 9.458974 8.769890 9.606304
## Feb 2001 9.491232 9.220354 9.762109 9.073024 9.909439
## Mar 2001 9.925335 9.654457 10.196212 9.507127 10.343542
## Apr 2001 9.625329 9.354452 9.896207 9.207122 10.043537
## May 2001 9.694286 9.423408 9.965163 9.276078 10.112493
## Jun 2001 9.740741 9.469864 10.011619 9.322534 10.158949
## Jul 2001 9.898195 9.627318 10.169072 9.479988 10.316402
## Aug 2001 9.882831 9.611954 10.153708 9.464624 10.301038
## Sep 2001 9.992619 9.721742 10.263496 9.574412 10.410826
## Oct 2001 10.107664 9.836787 10.378542 9.689457 10.525872
## Nov 2001 10.600248 10.329370 10.871125 10.182040 11.018455
## Dec 2001 11.372615 11.101738 11.643493 10.954408 11.790823
## Jan 2002 9.441533 9.166476 9.716590 9.016873 9.866193
## Feb 2002 9.744667 9.469610 10.019724 9.320007 10.169327
feb <- 9.744667
exp(9.744667)## [1] 17062.99
af <- forecast(a, h = test_len)
bf <- forecast(b, h = test_len)
accuracy(af$mean, sales_test)## ME RMSE MAE MPE MAPE ACF1 Theil's U
## Test set 8251.513 17451.55 10055.28 10.53397 26.66568 0.3206228 0.9075924
accuracy(exp(bf$mean), sales_test)## ME RMSE MAE MPE MAPE ACF1 Theil's U
## Test set 4824.494 7101.444 5191.669 12.35943 15.5191 0.4245018 0.4610253
It would appear that both the RMSE and MAPE are lower for the second model (B), these are both valid measures of a model’s accuracy and with an MAPE almost 50% of the value of a, b would seem to be a better fit overall. We can also plot the models against the validation set to visually decide on the best fit.
plot(sales_test)
lines(af$mean, col="red")
lines(exp(bf$mean), col="blue")The blue line seems to be the best fit to the actual validation set (black), and lo and behold, its our model B, the RMSE and MAPE proved to be good predictors all along.
If we’re looking into descriptive rather than prescriptive analysis, we can concentrate our efforts on the heavy hitting months each year. Instead of generating foreccast patterns we can consolidate all data to really define the trends and seasonality differences that have become clear over time. Also, we may want to consider the presence of a multiplicative trend/seasonality emerging in the last few periods of the model based on the previous estimates.