Let’s take a look at a time series plot. I just want to see what the data looks like and try determine the patterns.
library(forecast)
## Warning: package 'forecast' was built under R version 3.3.3
setwd("C:/Users/Yusuf/Documents")
souvenir <- read.csv("SouvenirSales585.csv")
str(souvenir)
## 'data.frame': 84 obs. of 2 variables:
## $ Date : Factor w/ 84 levels "1-Apr","1-Aug",..: 38 32 56 14 62 50 44 20 80 74 ...
## $ Sales: num 1665 2398 2841 3547 3753 ...
head(souvenir)
## Date Sales
## 1 Jan-95 1664.81
## 2 Feb-95 2397.53
## 3 Mar-95 2840.71
## 4 Apr-95 3547.29
## 5 May-95 3752.96
## 6 Jun-95 3714.74
tail(souvenir)
## Date Sales
## 79 1-Jul 26155.15
## 80 1-Aug 28586.52
## 81 1-Sep 30505.41
## 82 1-Oct 30821.33
## 83 1-Nov 46634.38
## 84 1-Dec 104660.67
# Create the time series object and plot it
sSales <- ts(souvenir$Sales, start=c(1995,1), frequency=12)
# Plot it
plot(sSales, bty="l", yaxt="n", xlab="Year", ylab="Sales",lwd=3)
# Add the y-axis
axis(2, at=seq(1500,105000,10000), labels=format(seq(1.5,105,10)), las=2)
There appears to be seasonality on the monthly level in the series. Because of this pattern, the naive forecast for each of the months in the validation period is equal to the actual sales in the most recent similar month. For example, the forecast for January 2001 is equal to the actual sales from January 2000.
now use the snaive model from the forecast package to generate the point forecasts for the validation period (the year 2001).
library(forecast)
validLength <- 12
trainLength <- length(sSales) - validLength
sSalesTrain <- window(sSales, start=c(1995,1), end=c(1995, trainLength))
sSalesValid <- window(sSales, start=c(1995,trainLength+1), end=c(1995,trainLength+validLength))
naiveForValid <- naive(sSalesTrain, h=validLength)
snaiveForValid <- snaive(sSalesTrain, h=validLength)
# To see the point forecasts from the seasonal naive model
snaiveForValid$mean
## Jan Feb Mar Apr May Jun Jul
## 2001 7615.03 9849.69 14558.40 11587.33 9332.56 13082.09 16732.78
## Aug Sep Oct Nov Dec
## 2001 19888.61 23933.38 25391.35 36024.80 80721.71
We already have created the seasonal naive forecast from the previous question, but we still have to do more to get the RMSE and MAPE. It is simply one function call as shown below.
plot(naiveForValid)
plot(snaiveForValid)
accuracy(snaiveForValid, sSalesValid)
## ME RMSE MAE MPE MAPE MASE
## Training set 3401.361 6467.818 3744.801 22.39270 25.64127 1.000000
## Test set 7828.278 9542.346 7828.278 27.27926 27.27926 2.090439
## ACF1 Theil's U
## Training set 0.4140974 NA
## Test set 0.2264895 0.7373759
We are interested in the RMSE and MAPE for the Test set. We see that the RMSE = 9542 and the MAPE = 27.28%. The MAPE is often easier to intepret and convey to a decision maker.however, that any value of MAPE cannot necessarily indicate if there is “something wrong”. These metrics are based on historical forecast errors and are trying to tell us how “good” the model is. For this particular dataset we could probably find a better model. A first task would be to re-examine the time series plot, and maybe manipulate it some more, to see if there are any other patterns that the seasonal naive model does not capture.
We can use the following code to create a histogram of the error terms (i.e., the residuals). We are trying to determine if the “bell-curve” appears indicating a normal distribution for error terms. If the error terms are centered around zero, it means we are, on average, over-forecasting and under-forecasting about the same amount of the time. First, let’s look at the training period residuals. we also overlay a density curve to help identify the “shape”
# Plot the histogram and store it to use later
myhist <- hist(snaiveForValid$residuals, ylab="Frequency", xlab="Forecast Error", main="", bty="l",lwd=3)
multiplier <- myhist$counts / myhist$density
# Need to ignore NA from 1995
mydensity <- density(snaiveForValid$residuals, na.rm=TRUE,lwd=3)
## Warning: In density.default(snaiveForValid$residuals, na.rm = TRUE, lwd = 3) :
## extra argument 'lwd' will be disregarded
mydensity$y <- mydensity$y * multiplier[1]
# Add the density curve
lines(mydensity,lwd=3)
The resulting histogram indicates that the error terms for the training period are not normally distributed. There are multiple “humps” and it is right-skewed. We can also see the large majority of the error terms are positive indicating that we are under predicting most of the time.
Now,we create a histogram for the validation period. We accomplish this task with the following code. We need to do the calculation to find the error terms by subtracting the point forecasts from the actuals for the validation period.validHist <- hist(sSalesValid - snaiveForValid$mean, ylab="Frequency", xlab="Forecast Error", main="", bty="l",lwd=3)
validMultiplier <- validHist$counts / validHist$density
validDensity <- density(sSalesValid - snaiveForValid$mean)
validDensity$y <- validDensity$y * validMultiplier[1]
# Add the density curve
lines(validDensity,lwd=3)
To create a time plot of the actuals and the seasonal naive forecasts we use the following code.
# Plot the actual values from the validation period (2001)
plot(sSalesValid, bty="l", xaxt="n", xlab="The Year 2001", yaxt="n", ylab="Sales (000s)",lwd=3)
axis(1, at=seq(2001,2002,0.09090909), labels=c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"))
axis(2, at=seq(10200,104700,10000), labels=format(seq(10.2,104.7,10)), las=2)
# Now add the forecasts and make the line red and dashed
lines(snaiveForValid$mean, col=2, lty=2,lwd=3)
# Add a legend
legend(2001,104700, c("Actual","Forecast"), col=1:2, lty=1:2)
As is indicated by the line plots, the seasonal naive model is consistently under predicting the actual sales. This is consistent with what we saw with the histogram of the errors on the both the training period (showing that the large majority of the forecast errors are positive) and the validation period (all of the forecast errors are positive). All graphs are leading to the same conclusion.
There are two more graphs we plot just the residuals over time. Second, we create a âQ-Q plotâ" which helps determine the (non)-normality of the residuals.# plot all the residuals that we have
plot(snaiveForValid$residuals,lwd=3)
Again, we can see that we are under predicting a large majority of the time with the seasonal naive model. Additionally, it appears to be getting worse. Note, that because we used a seasonal naive model, we do not have forecasts for that entire first year, 1995. And, of course, with a forecast for that first year, we cannot calculate a residual. Now we look to a normality plot for the error terms for everywhere we have a residual. Recall, that the first year, 1995, we have no residuals, so we need to exclude those from the plot. Otherwise things look really weird
qqnorm(snaiveForValid$residuals[13:72],lwd=2)
qqline(snaiveForValid$residuals[13:72],lwd=2)
For the error terms to be normally distributed, we would need to see a 45-degree line from the bottom, left-hand corner to the upper, right-hand corner of the plot. This plot, again, shows that we do not have normally distributed error terms.
If we only want to look at the residuals for the validation period, we can use the following line of code.qqnorm(sSalesValid - snaiveForValid$mean)
qqline(sSalesValid - snaiveForValid$mean)
par(oma = c(0, 0, 0, 4))
xrange <- c(1995,2001)
yrange <- range(sSales)
plot(xrange, yrange, type="n", xlab="Year", ylab="Souvenir Sales by Month", bty="l", las=1, yaxt = "n",lwd=3)
colors <- rainbow(12)
linetype <- c(14:25)
plotchar <- c(14:25)
axis(1, at=seq(1995,2001), labels=format(seq(1995,2001)))
axis(2, at=seq(0,110000,18000), labels=format(seq(0,110000,18000)), cex.axis=0.5, las=2)
for (i in 1:12) {
currentmonth <- subset(sSales, cycle(sSales)==i)
lines(seq(1995, 1995+length(currentmonth)-1,1), currentmonth, type="b", lwd=2,
lty=linetype[i], col=colors[i], pch=plotchar[i])
}
title("Souvenir Sales Broken Out by Month")
legend(2002, 105000, 1:12, cex=0.8, col=colors, pch=plotchar, lty=linetype, title="Month", xpd=NA)
summary with the time series of the data + Decomposition show the components of the time series
sSales_plots = decompose(sSales)
sSales_plots$seasonal
## Jan Feb Mar Apr May Jun
## 1995 -6650.1615 -5562.1051 -691.8707 -4601.8646 -5074.2387 -4827.4476
## 1996 -6650.1615 -5562.1051 -691.8707 -4601.8646 -5074.2387 -4827.4476
## 1997 -6650.1615 -5562.1051 -691.8707 -4601.8646 -5074.2387 -4827.4476
## 1998 -6650.1615 -5562.1051 -691.8707 -4601.8646 -5074.2387 -4827.4476
## 1999 -6650.1615 -5562.1051 -691.8707 -4601.8646 -5074.2387 -4827.4476
## 2000 -6650.1615 -5562.1051 -691.8707 -4601.8646 -5074.2387 -4827.4476
## 2001 -6650.1615 -5562.1051 -691.8707 -4601.8646 -5074.2387 -4827.4476
## Jul Aug Sep Oct Nov Dec
## 1995 -2452.6359 -2089.4193 -1309.5168 -425.8064 6341.6020 27343.4647
## 1996 -2452.6359 -2089.4193 -1309.5168 -425.8064 6341.6020 27343.4647
## 1997 -2452.6359 -2089.4193 -1309.5168 -425.8064 6341.6020 27343.4647
## 1998 -2452.6359 -2089.4193 -1309.5168 -425.8064 6341.6020 27343.4647
## 1999 -2452.6359 -2089.4193 -1309.5168 -425.8064 6341.6020 27343.4647
## 2000 -2452.6359 -2089.4193 -1309.5168 -425.8064 6341.6020 27343.4647
## 2001 -2452.6359 -2089.4193 -1309.5168 -425.8064 6341.6020 27343.4647
plot(sSales_plots,lwd=2)