#Loading necessary libraries:
library(forecast)
library(tidytext)
library(stringr)
library(ggplot2)
library(readr)
library(dplyr)
library(tidyr)
library(knitr)
library(pander)
For this assignment I will be completing the assigned problems from Chapter 3 in the Practical Time Series Forecasting with R textbook by, Galit Schmueli and Kenneth C. Lichtendahl Jr.
First I will import the csv file that contains the Souvenir Sales data needed for the following questions. The Souvenir Sales data file contains monthly sales for a souvenir shop at the beach resort town in Queensland, Australia, between 1995 and 2001.
#Importing the Souvenir Sales Data
Souv_Sales <- read.csv("SouvenirSales.csv", stringsAsFactors = FALSE)
str(Souv_Sales)
## 'data.frame': 84 obs. of 2 variables:
## $ Date : chr "Jan-95" "Feb-95" "Mar-95" "Apr-95" ...
## $ Sales: num 1665 2398 2841 3547 3753 ...
head(Souv_Sales)
## Date Sales
## 1 Jan-95 1664.81
## 2 Feb-95 2397.53
## 3 Mar-95 2840.71
## 4 Apr-95 3547.29
## 5 May-95 3752.96
## 6 Jun-95 3714.74
tail(Souv_Sales)
## Date Sales
## 79 1-Jul 26155.15
## 80 1-Aug 28586.52
## 81 1-Sep 30505.41
## 82 1-Oct 30821.33
## 83 1-Nov 46634.38
## 84 1-Dec 104660.67
By plotting the time series we are able to get a general idea what the data looks like before moving forward in the analysis.
SSales <- ts(Souv_Sales$Sales/1000, start=c(1995,1), frequency=12)
plot(SSales, ylim = c(0,120), ylab = "Sales (in thousands)", xlab = "Time", main = "Time Series of Souvenir Sales", bty = "l")
pander(SSales)
| Jan | Feb | Mar | Apr | May | Jun | Jul | Aug | |
|---|---|---|---|---|---|---|---|---|
| 1995 | 1.665 | 2.398 | 2.841 | 3.547 | 3.753 | 3.715 | 4.35 | 3.566 |
| 1996 | 2.5 | 5.198 | 7.225 | 4.806 | 5.901 | 4.951 | 6.179 | 4.752 |
| 1997 | 4.717 | 5.703 | 9.958 | 5.305 | 6.492 | 6.631 | 7.35 | 8.177 |
| 1998 | 5.921 | 5.815 | 12.42 | 6.37 | 7.609 | 7.225 | 8.121 | 7.979 |
| 1999 | 4.827 | 6.47 | 9.639 | 8.821 | 8.722 | 10.21 | 11.28 | 12.55 |
| 2000 | 7.615 | 9.85 | 14.56 | 11.59 | 9.333 | 13.08 | 16.73 | 19.89 |
| 2001 | 10.24 | 11.27 | 21.83 | 17.36 | 16 | 18.6 | 26.16 | 28.59 |
| Sep | Oct | Nov | Dec | |
|---|---|---|---|---|
| 1995 | 5.022 | 6.423 | 7.601 | 19.76 |
| 1996 | 5.496 | 5.835 | 12.6 | 28.54 |
| 1997 | 8.573 | 9.691 | 15.15 | 34.06 |
| 1998 | 8.093 | 8.477 | 17.91 | 30.11 |
| 1999 | 11.64 | 13.61 | 21.82 | 45.06 |
| 2000 | 23.93 | 25.39 | 36.02 | 80.72 |
| 2001 | 30.51 | 30.82 | 46.63 | 104.7 |
Generally speaking, Figure 1 shows some sort of upward trend over time and possible seasonality. In order to check if seasonality does exist in the data set we will break out the sales data by month.
par(oma = c(0,0,0,2))
xrange <- c(1995, 2001)
yrange <- range(SSales)
plot(xrange, yrange, type="n", xlab = "Year", ylab="Souvenir Sales (in thousands)", axes=F)
colors <- rainbow(12)
linetype <- c(1:12)
plotchar <-c(1:12)
axis(1, at=seq(1995,2001,1), labels=format(seq(1995,2001,1)))
for ( i in 1:12) {
currentMonth <- subset(SSales, cycle(SSales)==i)
lines(seq(1995, 1995+length(currentMonth)-1,1), currentMonth, type="b", lwd=1,
lty=linetype[i], col=colors[i], pch=plotchar[i])
}
title("Souvenir Sales Broken Out by Month")
legend(2001.35, 120, 1:12, cex=0.8, col=colors, pch=plotchar, lty=linetype, title="Month", xpd=NA)
ggseasonplot(SSales)
There is definitely seasonality looking at the plots in both Figure 2 and Figure 3, it is clear that December has the highest sales for the souvenir shop every year.
Using the \(decompose()\) function, the estimates for the systematic noise and components of a time series can be found.
componentsOfSSales <- decompose(SSales)
componentsOfSSales
## $x
## Jan Feb Mar Apr May Jun Jul
## 1995 1.66481 2.39753 2.84071 3.54729 3.75296 3.71474 4.34961
## 1996 2.49981 5.19824 7.22514 4.80603 5.90088 4.95134 6.17912
## 1997 4.71702 5.70263 9.95758 5.30478 6.49243 6.63080 7.34962
## 1998 5.92110 5.81458 12.42125 6.36977 7.60912 7.22475 8.12122
## 1999 4.82664 6.47023 9.63877 8.82117 8.72237 10.20948 11.27655
## 2000 7.61503 9.84969 14.55840 11.58733 9.33256 13.08209 16.73278
## 2001 10.24324 11.26688 21.82684 17.35733 15.99779 18.60153 26.15515
## Aug Sep Oct Nov Dec
## 1995 3.56634 5.02182 6.42348 7.60060 19.75621
## 1996 4.75215 5.49643 5.83510 12.60008 28.54172
## 1997 8.17662 8.57317 9.69050 15.15184 34.06101
## 1998 7.97925 8.09306 8.47670 17.91466 30.11441
## 1999 12.55222 11.63739 13.60689 21.82211 45.06069
## 2000 19.88861 23.93338 25.39135 36.02480 80.72171
## 2001 28.58652 30.50541 30.82133 46.63438 104.66067
##
## $seasonal
## Jan Feb Mar Apr May Jun
## 1995 -6.6501615 -5.5621051 -0.6918707 -4.6018646 -5.0742387 -4.8274476
## 1996 -6.6501615 -5.5621051 -0.6918707 -4.6018646 -5.0742387 -4.8274476
## 1997 -6.6501615 -5.5621051 -0.6918707 -4.6018646 -5.0742387 -4.8274476
## 1998 -6.6501615 -5.5621051 -0.6918707 -4.6018646 -5.0742387 -4.8274476
## 1999 -6.6501615 -5.5621051 -0.6918707 -4.6018646 -5.0742387 -4.8274476
## 2000 -6.6501615 -5.5621051 -0.6918707 -4.6018646 -5.0742387 -4.8274476
## 2001 -6.6501615 -5.5621051 -0.6918707 -4.6018646 -5.0742387 -4.8274476
## Jul Aug Sep Oct Nov Dec
## 1995 -2.4526359 -2.0894193 -1.3095168 -0.4258064 6.3416020 27.3434647
## 1996 -2.4526359 -2.0894193 -1.3095168 -0.4258064 6.3416020 27.3434647
## 1997 -2.4526359 -2.0894193 -1.3095168 -0.4258064 6.3416020 27.3434647
## 1998 -2.4526359 -2.0894193 -1.3095168 -0.4258064 6.3416020 27.3434647
## 1999 -2.4526359 -2.0894193 -1.3095168 -0.4258064 6.3416020 27.3434647
## 2000 -2.4526359 -2.0894193 -1.3095168 -0.4258064 6.3416020 27.3434647
## 2001 -2.4526359 -2.0894193 -1.3095168 -0.4258064 6.3416020 27.3434647
##
## $trend
## Jan Feb Mar Apr May Jun Jul
## 1995 NA NA NA NA NA NA 5.421133
## 1996 6.517855 6.643493 6.712677 6.707937 6.891732 7.466107 7.924554
## 1997 8.566258 8.757715 9.028598 9.317437 9.584402 9.920696 10.200837
## 1998 10.729094 10.753020 10.724792 10.654212 10.718755 10.669431 10.459387
## 1999 10.913802 11.235815 11.574035 11.935474 12.312042 13.097614 13.836559
## 2000 15.392422 15.925448 16.743464 17.746816 18.829614 20.907268 22.502653
## 2001 25.224785 25.979797 26.616045 27.116128 27.784443 29.223966 NA
## Aug Sep Oct Nov Dec
## 1995 5.572621 5.872002 6.107134 6.249078 6.390100
## 1996 8.037954 8.172822 8.307455 8.352884 8.447509
## 1997 10.255671 10.362989 10.510016 10.600920 10.672196
## 1998 10.441103 10.352485 10.338690 10.487217 10.657966
## 1999 14.093553 14.439348 14.759589 14.900270 15.045387
## 2000 22.671211 23.033112 23.576381 24.094515 24.602210
## 2001 NA NA NA NA NA
##
## $random
## Jan Feb Mar Apr May
## 1995 NA NA NA NA NA
## 1996 2.6321169 4.1168522 1.2043337 2.6999580 4.0833862
## 1997 2.8009240 2.5070205 1.6208524 0.5892071 1.9822662
## 1998 1.8421673 0.6236647 2.3883287 0.3174221 1.9646037
## 1999 0.5629994 0.7965205 -1.2433947 1.4875609 1.4845666
## 2000 -1.1272306 -0.5136528 -1.4931930 -1.5576212 -4.4228150
## 2001 -8.3313839 -9.1508120 -4.0973338 -5.1569337 -6.7124146
## Jun Jul Aug Sep Oct
## 1995 NA 1.3811125 0.0831381 0.4593348 0.7421522
## 1996 2.3126805 0.7072021 -1.1963844 -1.3668748 -2.0465482
## 1997 1.5375514 -0.3985808 0.0103681 -0.4803019 -0.3937099
## 1998 1.3827668 0.1144692 -0.3724336 -0.9499082 -1.4361836
## 1999 1.9393134 -0.1073729 0.5480868 -1.4924411 -0.7268928
## 2000 -2.9977307 -3.3172370 -0.6931819 2.2097843 2.2407755
## 2001 -5.7949882 NA NA NA NA
## Nov Dec
## 1995 -4.9900804 -13.9773547
## 1996 -2.0944058 -7.2492538
## 1997 -1.7906816 -3.9546509
## 1998 1.0858409 -7.8870209
## 1999 0.5802375 2.6718382
## 2000 5.5886825 28.7760353
## 2001 NA NA
##
## $figure
## [1] -6.6501615 -5.5621051 -0.6918707 -4.6018646 -5.0742387 -4.8274476
## [7] -2.4526359 -2.0894193 -1.3095168 -0.4258064 6.3416020 27.3434647
##
## $type
## [1] "additive"
##
## attr(,"class")
## [1] "decomposed.ts"
plot(componentsOfSSales)
Figure 4 shows us that the souvenir sales data has some type of upward trend, and that the seasonality seems to be additive because the spikes in the data occur at fairly equal distances apart.
validLen<- 12
TrainLeng <- length(SSales) - validLen
SSales_Train <- window(SSales, start =c(1995,1), end=c(1995, TrainLeng))
pander(SSales_Train)
| Jan | Feb | Mar | Apr | May | Jun | Jul | Aug | |
|---|---|---|---|---|---|---|---|---|
| 1995 | 1.665 | 2.398 | 2.841 | 3.547 | 3.753 | 3.715 | 4.35 | 3.566 |
| 1996 | 2.5 | 5.198 | 7.225 | 4.806 | 5.901 | 4.951 | 6.179 | 4.752 |
| 1997 | 4.717 | 5.703 | 9.958 | 5.305 | 6.492 | 6.631 | 7.35 | 8.177 |
| 1998 | 5.921 | 5.815 | 12.42 | 6.37 | 7.609 | 7.225 | 8.121 | 7.979 |
| 1999 | 4.827 | 6.47 | 9.639 | 8.821 | 8.722 | 10.21 | 11.28 | 12.55 |
| 2000 | 7.615 | 9.85 | 14.56 | 11.59 | 9.333 | 13.08 | 16.73 | 19.89 |
| Sep | Oct | Nov | Dec | |
|---|---|---|---|---|
| 1995 | 5.022 | 6.423 | 7.601 | 19.76 |
| 1996 | 5.496 | 5.835 | 12.6 | 28.54 |
| 1997 | 8.573 | 9.691 | 15.15 | 34.06 |
| 1998 | 8.093 | 8.477 | 17.91 | 30.11 |
| 1999 | 11.64 | 13.61 | 21.82 | 45.06 |
| 2000 | 23.93 | 25.39 | 36.02 | 80.72 |
SSales_Valid <- window(SSales, start=c(1995,TrainLeng+1), end=c(1995, TrainLeng+validLen))
pander(SSales_Valid)
| Jan | Feb | Mar | Apr | May | Jun | Jul | Aug | Sep | |
|---|---|---|---|---|---|---|---|---|---|
| 2001 | 10.24 | 11.27 | 21.83 | 17.36 | 16 | 18.6 | 26.16 | 28.59 | 30.51 |
| Oct | Nov | Dec | |
|---|---|---|---|
| 2001 | 30.82 | 46.63 | 104.7 |
Data partitioning is done as a preliminary step to make sure to address the problem of overfitting and to use one period of the data to develop our forecasting model or method, then we use the other period to test how the model/method performs. The data in this example was partitioned into two periods where the earlier time period will be used as the training period and the later period will be used as the validation period. Because we are looking at a time series forecasting example there is no third or additional test set, as there typically is when building predictive models using cross-sectional data. The third set in a time series would be the actual future forecasts after we valid how good the model is on the validation period.
The analyst chose a 12-month validation period to mimic the forecast horizon, as stated in the assignment she was given, which was to forecast for 12 months (i.e. for year 2002). The validation period selected needed to be at least a year, if it were longer than 12-months than the training period would contain less recent information and the model would miss out on the most recent information available at the time of prediction.
Since the data for full-year 2001 represents the validation period, we will use that to create the naive forecast for the validation period. The naive forecast will simply be the most recent value of the series, or in other words it will be the last data point in the training period. Therefore, the analysis below should show that the sales value for December 2000 (which was $80,721.71), repeated over the full naive forecast. When plotted the naive forecast will be a straight line. However, since our previous analysis showed seasonality we will show and plot the seasonal naive forecast for the validation period as well.
naive_Valid <- naive(SSales_Train, h=validLen)
naive_Valid
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Jan 2001 80.72171 67.31575 94.12767 60.219059 101.2244
## Feb 2001 80.72171 61.76282 99.68060 51.726583 109.7168
## Mar 2001 80.72171 57.50190 103.94152 45.210077 116.2333
## Apr 2001 80.72171 53.90978 107.53364 39.716409 121.7270
## May 2001 80.72171 50.74507 110.69835 34.876389 126.5670
## Jun 2001 80.72171 47.88394 113.55948 30.500677 130.9427
## Jul 2001 80.72171 45.25287 116.19055 26.476795 134.9666
## Aug 2001 80.72171 42.80392 118.63950 22.731457 138.7120
## Sep 2001 80.72171 40.50382 120.93960 19.213758 142.2297
## Oct 2001 80.72171 38.32833 123.11509 15.886636 145.5568
## Nov 2001 80.72171 36.25916 125.18426 12.722110 148.7213
## Dec 2001 80.72171 34.28209 127.16133 9.698445 151.7450
pander(naive_Valid$mean)
| Jan | Feb | Mar | Apr | May | Jun | Jul | Aug | |
|---|---|---|---|---|---|---|---|---|
| 2001 | 80.72 | 80.72 | 80.72 | 80.72 | 80.72 | 80.72 | 80.72 | 80.72 |
| Sep | Oct | Nov | Dec | |
|---|---|---|---|---|
| 2001 | 80.72 | 80.72 | 80.72 | 80.72 |
snaive_valid <- snaive(SSales_Train, h=validLen)
snaive_valid
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Jan 2001 7.61503 -0.6738117 15.90387 -5.0616594 20.29172
## Feb 2001 9.84969 1.5608483 18.13853 -2.8269994 22.52638
## Mar 2001 14.55840 6.2695583 22.84724 1.8817106 27.23509
## Apr 2001 11.58733 3.2984883 19.87617 -1.0893594 24.26402
## May 2001 9.33256 1.0437183 17.62140 -3.3441294 22.00925
## Jun 2001 13.08209 4.7932483 21.37093 0.4054006 25.75878
## Jul 2001 16.73278 8.4439383 25.02162 4.0560906 29.40947
## Aug 2001 19.88861 11.5997683 28.17745 7.2119206 32.56530
## Sep 2001 23.93338 15.6445383 32.22222 11.2566906 36.61007
## Oct 2001 25.39135 17.1025083 33.68019 12.7146606 38.06804
## Nov 2001 36.02480 27.7359583 44.31364 23.3481106 48.70149
## Dec 2001 80.72171 72.4328683 89.01055 68.0450206 93.39840
pander(snaive_valid$mean)
| Jan | Feb | Mar | Apr | May | Jun | Jul | Aug | |
|---|---|---|---|---|---|---|---|---|
| 2001 | 7.615 | 9.85 | 14.56 | 11.59 | 9.333 | 13.08 | 16.73 | 19.89 |
| Sep | Oct | Nov | Dec | |
|---|---|---|---|---|
| 2001 | 23.93 | 25.39 | 36.02 | 80.72 |
plot(snaive_valid, main="Seasonal Naive Forecast for the Validation Period", xlab="Year", ylab="Souvenir Sales (in thousands)")
Using the \(accuracy()\) function from the forecast package we can look at the RSME and MAPE of the naive forecast for the validation period quite easily.
accuracy(naive_Valid, SSales_Valid)
## ME RMSE MAE MPE MAPE MASE
## Training set 1.113477 10.46073 5.506879 -25.27554 61.16191 1.47054
## Test set -50.500287 56.09907 54.490114 -287.13834 290.95050 14.55087
## ACF1 Theil's U
## Training set -0.1968879 NA
## Test set 0.3182456 6.649124
The seasonal naive forecast for the validation period RMSE is $56,099.07 in sales and the MAPE is 290.95%. These values tell us how accurate the training period was in predicting the validation period using a naive forecast. The measures seem high, but they serve as benchmark for us to now look at the accuracy of the seasonal naive forecast and at a later date different forecast models.
accuracy(snaive_valid, SSales_Valid)
## ME RMSE MAE MPE MAPE MASE
## Training set 3.401361 6.467818 3.744801 22.39270 25.64127 1.000000
## Test set 7.828278 9.542346 7.828278 27.27926 27.27926 2.090439
## ACF1 Theil's U
## Training set 0.4140974 NA
## Test set 0.2264895 0.7373759
For the seasonal naive forecast for the validation period the RMSE is $9,542.35 in sales and the MAPE is 27.28%. This means that on average our forecast is \(\approx\) 27% off from the actual data points in the validation period. These measures are lower than the naive forecast and therefore, the deviation from the actual values is lower. The seasonal naive forecast has more predicting accuracy than the naive forecast, this makes sense because from the very first figure in this assignment we saw that seasonality existed.
NF_Hist_valid <-hist(naive_Valid$residuals/1000, breaks=12, ylim = c(0,70), ylab="Frequency", xlab="Forecast Error (in thousands)", main="", bty="l")
multiplier <- NF_Hist_valid$counts / NF_Hist_valid$density
NF_Density <- density(naive_Valid$residuals/1000, na.rm=TRUE)
NF_Density$y <- NF_Density$y * multiplier[1]
lines(NF_Density, col="red", lwd=2)
NF_Hist_valid$counts
## [1] 1 0 2 1 1 0 1 14 39 5 3 2 1 0 0 0 1
SNF_Hist_valid <-hist(snaive_valid$residuals/1000, breaks = 12, ylim=c(1,50), ylab="Frequency", xlab="Forecast Error (in thousands)", main="", bty="l")
multiplier <- SNF_Hist_valid$counts / SNF_Hist_valid$density
SNF_Density <- density(snaive_valid$residuals/1000, na.rm=TRUE)
SNF_Density$y <- SNF_Density$y * multiplier[1]
lines(SNF_Density, col="red", lwd=2)
SNF_Hist_valid$counts
## [1] 7 43 5 4 0 0 0 0 1
From Figure 7, we can see that the forecast error terms of the seasonal naive forecast for the validation period are right skewed. Of the 60 error terms we have, only 7 are negative, which means 53 are positive. This indicates that we are under predicting much more often than we are over predicting (\(\approx\) 7.5 times more likely to be under predicting than over predicting).
plot(SSales_Valid, bty="l", xaxt="n", xlab="Year: 2001", main="Naive Forecasts vs. Actual Sales", yaxt="n", ylab="Souvenir Sales (in thousands)", col=1, lty=1, lwd=2.5, ylim=c(0,125))
axis(1, at=seq(2001,2001.917,0.08333), labels=c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"))
axis(2, las=2)
# Naive forecast red line
lines(naive_Valid$mean, col=2, lty=1, lwd=2.5)
#Seasonal Naive Forecast blue
lines(snaive_valid$mean, col=3, lty=1, lwd=2.5)
# Add legend
legend(2001.3,125, c("Actual", "Naive Forecast", "Seasonal Naive Forecast"), col=(1:3), lty=1, lwd=2.5)
Figures 7 and 8 confirm that the seasonal naive forecast is under predicting sales.
plot(snaive_valid$residuals, bty="l", ylab="Residuals", xlab="Year", main="Residuals by Year")
Many of the residuals are above 0 which provides further evidence that our seasonal naive forecast is under predicting.
qqnorm(snaive_valid$residuals[13:72])
qqline(snaive_valid$residuals[13:72], col="blue", lwd=1.5)
Figure 10 shows that the error terms are not normally distributed. For the error terms to be normally distributed, the line would need to be at a 45 degree angle from the bottom lefthand corner to the upper righthand corner or all of the dots hitting the line, which is not the case.
qqnorm(SSales_Valid - snaive_valid$mean)
qqline(SSales_Valid - snaive_valid$mean, col="red", lwd=1.5)
Figure 11 shows us that the residuals for the validation period are not normally distributed either.
The analyst must recombine the data from the training period and validation period in order to use all available data, as well as the most recent data, to forecast for 2002. Developing a forecast based only on the training period would make it more difficult to generate an accurate forecast for 2002 because the analyst would not have the most recent data to help create the forecast.
Yes, the analyst should partition the data into the training and validation periods. Although three years of data is not much data to go off, the analyst would want to use the first 24 months as the training period and the last 12 months as the validation period. The data partitioning allows the analyst to develop models which will be used to test for accuracy and fit.
No, the analyst needs to examine time plots of the series and of the model forecasts for both the training and validation periods. Then both periods should be examined against the actual data in order to see how accurate the forecast was, similar to Figure 8.
Yes, the analyst will often do this to help determine if the chosen model is “good enough” and is viable for use. The analyst may also use the training period metrics to compare different models (e.g. seasonal naive vs. Holt’s). The analyst should be interested in these values for the validation period, as well, to examine how accurately the training period was able to predict the validation period using the MAPE and RSME as measures.
Yes, the analyst should look at the MAPE and RMSE values for the validation period to see how accurately the predicted validation period matched the actual data based on the training set. The MAPE allows us to have a better understanding of how the forecast deviated (on average) from the actual values. While the RSME allows us to measure the deviation in the same units as the data series. By looking at both of these measures we can gauge how accurate our forecast model is.
Yes, the analyst should definitely compute naive and seasonal naive forecasts. The naive forecast uses the most recent data points to help forecast a model. Although sometimes the naive forecast is not the best forecasting model it at least provides a baseline or starting point for other forecasting models to be benchmarked against in the future.
autoplot(snaive_valid)
autoplot(SSales, series="Data") + autolayer(fitted(snaive_valid), series="Fitted") +autolayer(snaive_valid$mean, series="Seasonal Naive")
## Warning: Removed 12 rows containing missing values (geom_path).
autoplot(residuals(snaive_valid))