set working directory, load csv file
setwd("~/wd678/Unit 2/Assign 2")
souvenir <- read.csv("SouvenirSales.csv", stringsAsFactors=FALSE, header=TRUE)
create time series and plot it
souvSales.ts <- ts(souvenir$Sales, start=c(1995,1), frequency=12)
plot(souvSales.ts, bty="l")
break out each month and plot it over the years
reset outer margin; format domain & range; set up plot; format & add lines, title, legend
par(oma=c(0,0,0,2))
xrange <- c(1995,2001)
yrange <- range(souvSales.ts)
plot(xrange, yrange, type="n", xlab="Date", ylab="Sales", bty="l", las=1)
colors <- rainbow(12)
linetype <- c(1:12)
plotchar <- c(1:12)
axis(1, at=seq(1995,2001,1), labels=format(seq(1995,2001,1)))
for (i in 1:12) {
currentMonth <- subset(souvSales.ts, cycle(souvSales.ts)==i)
lines(seq(1995, 1995+length(currentMonth)-1,1), currentMonth, type="b", lwd=1,
lty=linetype[i], col=colors[i], pch=plotchar[i])
}
title("Souvenir Sales Broken Out by Month")
legend(2001.35,100000, 1:12, cex=0.8, col=colors, pch=plotchar, lty=linetype, title="Month", xpd=NA)
According to the above plots of months broken out, seasonality is monthly.
install and load forecast package
#install.packages("forecast")
library(forecast, quietly=TRUE, warn.conflicts=FALSE)
set validation and training period lengths, partition the periods
This avoids fitting the noise with the signal, in which case the model would perform poorly. Using separate data to build the model and to test it, improves the models predictive accuracy.
which is the minimal desired length for forecasting accuracy using a relatively short time series such as the analyst had. To lengthen the validation period, the most recent data in the time series, which is likely to be the most accurately predictive, is removed from the training period. With a long time series, a longer validation period can be used.
validLength <- 12
trainLength <- length(souvSales.ts) - validLength
nSalesTrain <- window(souvSales.ts, start=c(1995,1), end=c(1995,trainLength))
nSalesValid <- window(souvSales.ts, start=c(1995,trainLength+1), end=c(1995,trainLength+validLength))
use naive forecast and see its points
naiveForValid <- naive(nSalesTrain, h=validLength)
naiveForValid$mean
## Jan Feb Mar Apr May Jun Jul
## 2001 80721.71 80721.71 80721.71 80721.71 80721.71 80721.71 80721.71
## Aug Sep Oct Nov Dec
## 2001 80721.71 80721.71 80721.71 80721.71 80721.71
I determined the accuracy measures.
These large errors occured because sales were seasonal, peaked every December, &, since December 2000 was the last sales data entry in the training period, which had an exponention trend, the single sample taken for the naive forecast was a peak December 2000 sale. With every December peak sales ascending exponentially, followed by relatively lower sales in other months, the naive forecast was substantially greater than all but the last sales month in 2001 which occured at the highest December peak of the trend.
accuracy(naiveForValid, nSalesValid)
## ME RMSE MAE MPE MAPE MASE
## Training set 1113.477 10460.73 5506.879 -25.27554 61.16191 1.47054
## Test set -50500.288 56099.07 54490.114 -287.13834 290.95050 14.55087
## ACF1 Theil's U
## Training set -0.1968879 NA
## Test set 0.3182456 6.649124
naiveValidRes <- nSalesValid - naiveForValid$mean
myHist <- hist(naiveValidRes, ylab="Frequency", xlab="Forecast Error", main="", bty="l")
because the naive forecast was taken near the end of an exponentially trending time series at a peaking December month in the training period, where December months had disporportionately higher sales than all other months.
plot(nSalesValid, bty="l", xaxt="n", xlab="Year 2001", yaxt="n", ylab="Souvenirs Sold")
axis(1, at=seq(2001, 2001.75, 0.25), labels=c("Jan", "Apr", "Jul","Oct"))
axis(2,las=2)
lines(naiveForValid$mean, col=2, lty=2)
legend(2001,110000, c("Actual","Forecast"), col=1:2, lty=1:2)
The analyst can go through iterations of the whole process again if needed to develop a better performing model.
The following steps - should - be taken:
which causes the model to perform poorly. It’s done to run a model developed on training data, on new data designated for validation.
Running the model on the training period on which it was built would introduce bias.
to compare and judge the performance of all other models.
——————————————————————————————-
The following steps - should not - be taken:
to ensure that its components (level, trend, seasonality, noise) are congruent with the training period. Only then will the model built from the training period perform well on the validation period.
As a visual aid, time plots of model forecasts should be compared against actual data in the validation period.