set working directory, load csv file

setwd("~/wd678/Unit 2/Assign 2")
souvenir <- read.csv("SouvenirSales.csv", stringsAsFactors=FALSE, header=TRUE)

create time series and plot it

souvSales.ts <- ts(souvenir$Sales, start=c(1995,1), frequency=12)
plot(souvSales.ts, bty="l")

break out each month and plot it over the years

reset outer margin; format domain & range; set up plot; format & add lines, title, legend

par(oma=c(0,0,0,2))

xrange <- c(1995,2001)
yrange <- range(souvSales.ts)
plot(xrange, yrange, type="n", xlab="Date", ylab="Sales",  bty="l", las=1)

colors <- rainbow(12)
linetype <- c(1:12)
plotchar <- c(1:12)

axis(1, at=seq(1995,2001,1), labels=format(seq(1995,2001,1)))

for (i in 1:12) {
  currentMonth <- subset(souvSales.ts, cycle(souvSales.ts)==i)
  lines(seq(1995, 1995+length(currentMonth)-1,1), currentMonth, type="b", lwd=1,
      lty=linetype[i], col=colors[i], pch=plotchar[i])
}

title("Souvenir Sales Broken Out by Month")

legend(2001.35,100000, 1:12, cex=0.8, col=colors, pch=plotchar, lty=linetype, title="Month", xpd=NA)

According to the above plots of months broken out, seasonality is monthly.

install and load forecast package

#install.packages("forecast")
library(forecast, quietly=TRUE, warn.conflicts=FALSE)

set validation and training period lengths, partition the periods

1(a) The data was partitioned to combat overfitting a model.

This avoids fitting the noise with the signal, in which case the model would perform poorly. Using separate data to build the model and to test it, improves the models predictive accuracy.

1(b) The analyst chose a 12-month validation period to match the forecast horizon

which is the minimal desired length for forecasting accuracy using a relatively short time series such as the analyst had. To lengthen the validation period, the most recent data in the time series, which is likely to be the most accurately predictive, is removed from the training period. With a long time series, a longer validation period can be used.

validLength <- 12

trainLength <- length(souvSales.ts) - validLength

nSalesTrain <- window(souvSales.ts, start=c(1995,1), end=c(1995,trainLength))
nSalesValid <- window(souvSales.ts, start=c(1995,trainLength+1), end=c(1995,trainLength+validLength))

use naive forecast and see its points

1(c) The naive forcast for all months in the validation period, computed below, by definition is taken from the single most recent point of the training period, was 80,722.

naiveForValid <- naive(nSalesTrain, h=validLength)

naiveForValid$mean
##           Jan      Feb      Mar      Apr      May      Jun      Jul
## 2001 80721.71 80721.71 80721.71 80721.71 80721.71 80721.71 80721.71
##           Aug      Sep      Oct      Nov      Dec
## 2001 80721.71 80721.71 80721.71 80721.71 80721.71

I determined the accuracy measures.

1(d) The RMSE was 56,099 and the MAPE was 291%.

These large errors occured because sales were seasonal, peaked every December, &, since December 2000 was the last sales data entry in the training period, which had an exponention trend, the single sample taken for the naive forecast was a peak December 2000 sale. With every December peak sales ascending exponentially, followed by relatively lower sales in other months, the naive forecast was substantially greater than all but the last sales month in 2001 which occured at the highest December peak of the trend.

accuracy(naiveForValid, nSalesValid)
##                      ME     RMSE       MAE        MPE      MAPE     MASE
## Training set   1113.477 10460.73  5506.879  -25.27554  61.16191  1.47054
## Test set     -50500.288 56099.07 54490.114 -287.13834 290.95050 14.55087
##                    ACF1 Theil's U
## Training set -0.1968879        NA
## Test set      0.3182456  6.649124

1(e) I plotted a histogram of naive forecast errors for the validation period.

naiveValidRes <- nSalesValid - naiveForValid$mean
myHist <- hist(naiveValidRes, ylab="Frequency", xlab="Forecast Error", main="", bty="l")

1(e-cont.) I plotted sales-forecast points and actual values for the validation period.

There is not a normal distribution of forecast errors. Forecasted sales overpredicted actual sales for 11 of 12 months in the validation period

because the naive forecast was taken near the end of an exponentially trending time series at a peaking December month in the training period, where December months had disporportionately higher sales than all other months.

plot(nSalesValid, bty="l", xaxt="n", xlab="Year 2001", yaxt="n", ylab="Souvenirs Sold")

axis(1, at=seq(2001, 2001.75, 0.25), labels=c("Jan", "Apr", "Jul","Oct"))
axis(2,las=2)

lines(naiveForValid$mean, col=2, lty=2)

legend(2001,110000, c("Actual","Forecast"), col=1:2, lty=1:2)

1(f) The analyst must re-combine the training and validation period and run the chosen model on this series.

The analyst can go through iterations of the whole process again if needed to develop a better performing model.

2.

The following steps - should - be taken:

a

This is done to combat overfitting a model. It avoids fitting the noise.

which causes the model to perform poorly. It’s done to run a model developed on training data, on new data designated for validation.

b

Since the model is built from the training period, it is desirable to test it on fresh data in the validation period.

Running the model on the training period on which it was built would introduce bias.

c

Naive forecasts are used as a baseline model for their simplicity to implement and the logic that the most recent data is likely the best predictive data,

to compare and judge the performance of all other models.

——————————————————————————————-

The following steps - should not - be taken:

d

The validation period should be examined as a time plot

to ensure that its components (level, trend, seasonality, noise) are congruent with the training period. Only then will the model built from the training period perform well on the validation period.

As a visual aid, time plots of model forecasts should be compared against actual data in the validation period.

e

Since the training period is used to build models, testing them on the same period would produce bias in the MAPE and RMSE.