Pro-STA-585

Let’s take a look at a time series plot. I just want to see what the data looks like and try determine the patterns.

library(forecast)

## Warning: package 'forecast' was built under R version 3.3.3

setwd("C:/Users/Yusuf/Documents")
souvenir <- read.csv("SouvenirSales585.csv")
str(souvenir)

## 'data.frame':    84 obs. of  2 variables:
##  $ Date : Factor w/ 84 levels "1-Apr","1-Aug",..: 38 32 56 14 62 50 44 20 80 74 ...
##  $ Sales: num  1665 2398 2841 3547 3753 ...

head(souvenir)

##     Date   Sales
## 1 Jan-95 1664.81
## 2 Feb-95 2397.53
## 3 Mar-95 2840.71
## 4 Apr-95 3547.29
## 5 May-95 3752.96
## 6 Jun-95 3714.74

tail(souvenir)

##     Date     Sales
## 79 1-Jul  26155.15
## 80 1-Aug  28586.52
## 81 1-Sep  30505.41
## 82 1-Oct  30821.33
## 83 1-Nov  46634.38
## 84 1-Dec 104660.67

# Create the time series object and plot it
sSales <- ts(souvenir$Sales, start=c(1995,1), frequency=12)

# Plot it
plot(sSales, bty="l", yaxt="n", xlab="Year", ylab="Sales",lwd=3)
# Add the y-axis

axis(2, at=seq(1500,105000,10000), labels=format(seq(1.5,105,10)), las=2)

There appears to be seasonality on the monthly level in the series. Because of this pattern, the naive forecast for each of the months in the validation period is equal to the actual sales in the most recent similar month. For example, the forecast for January 2001 is equal to the actual sales from January 2000.

now use the snaive model from the forecast package to generate the point forecasts for the validation period (the year 2001).

library(forecast)
validLength <- 12
trainLength <- length(sSales) - validLength

sSalesTrain <- window(sSales, start=c(1995,1), end=c(1995, trainLength))
sSalesValid <- window(sSales, start=c(1995,trainLength+1), end=c(1995,trainLength+validLength))
naiveForValid <- naive(sSalesTrain, h=validLength)
snaiveForValid <- snaive(sSalesTrain, h=validLength)

# To see the point forecasts from the seasonal naive model
snaiveForValid$mean

##           Jan      Feb      Mar      Apr      May      Jun      Jul
## 2001  7615.03  9849.69 14558.40 11587.33  9332.56 13082.09 16732.78
##           Aug      Sep      Oct      Nov      Dec
## 2001 19888.61 23933.38 25391.35 36024.80 80721.71

We already have created the seasonal naive forecast from the previous question, but we still have to do more to get the RMSE and MAPE. It is simply one function call as shown below.

plot(naiveForValid)

plot(snaiveForValid)

accuracy(snaiveForValid, sSalesValid)

##                    ME     RMSE      MAE      MPE     MAPE     MASE
## Training set 3401.361 6467.818 3744.801 22.39270 25.64127 1.000000
## Test set     7828.278 9542.346 7828.278 27.27926 27.27926 2.090439
##                   ACF1 Theil's U
## Training set 0.4140974        NA
## Test set     0.2264895 0.7373759

We are interested in the RMSE and MAPE for the Test set. We see that the RMSE = 9542 and the MAPE = 27.28%. The MAPE is often easier to intepret and convey to a decision maker.however, that any value of MAPE cannot necessarily indicate if there is “something wrong”. These metrics are based on historical forecast errors and are trying to tell us how “good” the model is. For this particular dataset we could probably find a better model. A first task would be to re-examine the time series plot, and maybe manipulate it some more, to see if there are any other patterns that the seasonal naive model does not capture.

We can use the following code to create a histogram of the error terms (i.e., the residuals). We are trying to determine if the “bell-curve” appears indicating a normal distribution for error terms. If the error terms are centered around zero, it means we are, on average, over-forecasting and under-forecasting about the same amount of the time. First, let’s look at the training period residuals. we also overlay a density curve to help identify the “shape”

# Plot the histogram and store it to use later
myhist <- hist(snaiveForValid$residuals, ylab="Frequency", xlab="Forecast Error", main="", bty="l",lwd=3)

multiplier <- myhist$counts / myhist$density

# Need to ignore NA from 1995
mydensity <- density(snaiveForValid$residuals, na.rm=TRUE,lwd=3)

## Warning: In density.default(snaiveForValid$residuals, na.rm = TRUE, lwd = 3) :
##  extra argument 'lwd' will be disregarded

mydensity$y <- mydensity$y * multiplier[1]

# Add the density curve
lines(mydensity,lwd=3)

The resulting histogram indicates that the error terms for the training period are not normally distributed. There are multiple “humps” and it is right-skewed. We can also see the large majority of the error terms are positive indicating that we are under predicting most of the time.

Now,we create a histogram for the validation period. We accomplish this task with the following code. We need to do the calculation to find the error terms by subtracting the point forecasts from the actuals for the validation period.

validHist <- hist(sSalesValid - snaiveForValid$mean, ylab="Frequency", xlab="Forecast Error", main="", bty="l",lwd=3)

validMultiplier <- validHist$counts / validHist$density

validDensity <- density(sSalesValid - snaiveForValid$mean)
validDensity$y <- validDensity$y * validMultiplier[1]

# Add the density curve
lines(validDensity,lwd=3)

To create a time plot of the actuals and the seasonal naive forecasts we use the following code.

# Plot the actual values from the validation period (2001)
plot(sSalesValid, bty="l", xaxt="n", xlab="The Year 2001", yaxt="n", ylab="Sales (000s)",lwd=3)
axis(1, at=seq(2001,2002,0.09090909), labels=c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"))
axis(2, at=seq(10200,104700,10000), labels=format(seq(10.2,104.7,10)), las=2)

# Now add the forecasts and make the line red and dashed
lines(snaiveForValid$mean, col=2, lty=2,lwd=3)

# Add a legend
legend(2001,104700, c("Actual","Forecast"), col=1:2, lty=1:2)

As is indicated by the line plots, the seasonal naive model is consistently under predicting the actual sales. This is consistent with what we saw with the histogram of the errors on the both the training period (showing that the large majority of the forecast errors are positive) and the validation period (all of the forecast errors are positive). All graphs are leading to the same conclusion.

There are two more graphs we plot just the residuals over time. Second, we create a âQ-Q plotâ" which helps determine the (non)-normality of the residuals.

# plot all the residuals that we have
plot(snaiveForValid$residuals,lwd=3)

Again, we can see that we are under predicting a large majority of the time with the seasonal naive model. Additionally, it appears to be getting worse. Note, that because we used a seasonal naive model, we do not have forecasts for that entire first year, 1995. And, of course, with a forecast for that first year, we cannot calculate a residual. Now we look to a normality plot for the error terms for everywhere we have a residual. Recall, that the first year, 1995, we have no residuals, so we need to exclude those from the plot. Otherwise things look really weird

qqnorm(snaiveForValid$residuals[13:72],lwd=2)
qqline(snaiveForValid$residuals[13:72],lwd=2)

For the error terms to be normally distributed, we would need to see a 45-degree line from the bottom, left-hand corner to the upper, right-hand corner of the plot. This plot, again, shows that we do not have normally distributed error terms.

If we only want to look at the residuals for the validation period, we can use the following line of code.

qqnorm(sSalesValid - snaiveForValid$mean)
qqline(sSalesValid - snaiveForValid$mean)

par(oma = c(0, 0, 0, 4))
xrange <- c(1995,2001)
yrange <- range(sSales)
plot(xrange, yrange, type="n", xlab="Year", ylab="Souvenir Sales by Month", bty="l", las=1, yaxt = "n",lwd=3)
colors <- rainbow(12) 
linetype <- c(14:25) 
plotchar <- c(14:25)
axis(1, at=seq(1995,2001), labels=format(seq(1995,2001)))
axis(2, at=seq(0,110000,18000), labels=format(seq(0,110000,18000)), cex.axis=0.5, las=2)
for (i in 1:12) { 
  currentmonth <- subset(sSales, cycle(sSales)==i)
  lines(seq(1995, 1995+length(currentmonth)-1,1), currentmonth, type="b", lwd=2,
      lty=linetype[i], col=colors[i], pch=plotchar[i]) 
} 
title("Souvenir Sales Broken Out by Month")
legend(2002, 105000, 1:12, cex=0.8, col=colors, pch=plotchar, lty=linetype, title="Month", xpd=NA)

summary with the time series of the data + Decomposition show the components of the time series

sSales_plots = decompose(sSales)
sSales_plots$seasonal

##             Jan        Feb        Mar        Apr        May        Jun
## 1995 -6650.1615 -5562.1051  -691.8707 -4601.8646 -5074.2387 -4827.4476
## 1996 -6650.1615 -5562.1051  -691.8707 -4601.8646 -5074.2387 -4827.4476
## 1997 -6650.1615 -5562.1051  -691.8707 -4601.8646 -5074.2387 -4827.4476
## 1998 -6650.1615 -5562.1051  -691.8707 -4601.8646 -5074.2387 -4827.4476
## 1999 -6650.1615 -5562.1051  -691.8707 -4601.8646 -5074.2387 -4827.4476
## 2000 -6650.1615 -5562.1051  -691.8707 -4601.8646 -5074.2387 -4827.4476
## 2001 -6650.1615 -5562.1051  -691.8707 -4601.8646 -5074.2387 -4827.4476
##             Jul        Aug        Sep        Oct        Nov        Dec
## 1995 -2452.6359 -2089.4193 -1309.5168  -425.8064  6341.6020 27343.4647
## 1996 -2452.6359 -2089.4193 -1309.5168  -425.8064  6341.6020 27343.4647
## 1997 -2452.6359 -2089.4193 -1309.5168  -425.8064  6341.6020 27343.4647
## 1998 -2452.6359 -2089.4193 -1309.5168  -425.8064  6341.6020 27343.4647
## 1999 -2452.6359 -2089.4193 -1309.5168  -425.8064  6341.6020 27343.4647
## 2000 -2452.6359 -2089.4193 -1309.5168  -425.8064  6341.6020 27343.4647
## 2001 -2452.6359 -2089.4193 -1309.5168  -425.8064  6341.6020 27343.4647

plot(sSales_plots,lwd=2)

Pro-STA-585

Yusuf Sultan

4/28/2017