setwd("~/USM_MBA_678")
SouvSales <- read.csv("SouvenirSales.csv", header=TRUE, stringsAsFactors=FALSE)
str(SouvSales)
## 'data.frame': 84 obs. of 2 variables:
## $ Date : chr "Jan-95" "Feb-95" "Mar-95" "Apr-95" ...
## $ Sales: num 1665 2398 2841 3547 3753 ...
First we import souvenir sales figures into RStudio.
library(ggplot2)
library(forecast)
SouvSales.ts <- ts(SouvSales$Sales, start=c(1995,1), end=c(2001,12), freq=12)
plot(SouvSales.ts, xlab="Time", ylab="Sales)", bty="l")
After converting the souvenir sales data to a time series, we then plot the Sales figures. The graph shows obvious seasonality and trending of the data upwards over time.
#store the results to plot them later
ComponentsOfSouvSales<-decompose(SouvSales.ts)
#output them
ComponentsOfSouvSales
## $x
## Jan Feb Mar Apr May Jun Jul
## 1995 1664.81 2397.53 2840.71 3547.29 3752.96 3714.74 4349.61
## 1996 2499.81 5198.24 7225.14 4806.03 5900.88 4951.34 6179.12
## 1997 4717.02 5702.63 9957.58 5304.78 6492.43 6630.80 7349.62
## 1998 5921.10 5814.58 12421.25 6369.77 7609.12 7224.75 8121.22
## 1999 4826.64 6470.23 9638.77 8821.17 8722.37 10209.48 11276.55
## 2000 7615.03 9849.69 14558.40 11587.33 9332.56 13082.09 16732.78
## 2001 10243.24 11266.88 21826.84 17357.33 15997.79 18601.53 26155.15
## Aug Sep Oct Nov Dec
## 1995 3566.34 5021.82 6423.48 7600.60 19756.21
## 1996 4752.15 5496.43 5835.10 12600.08 28541.72
## 1997 8176.62 8573.17 9690.50 15151.84 34061.01
## 1998 7979.25 8093.06 8476.70 17914.66 30114.41
## 1999 12552.22 11637.39 13606.89 21822.11 45060.69
## 2000 19888.61 23933.38 25391.35 36024.80 80721.71
## 2001 28586.52 30505.41 30821.33 46634.38 104660.67
##
## $seasonal
## Jan Feb Mar Apr May Jun
## 1995 -6650.1615 -5562.1051 -691.8707 -4601.8646 -5074.2387 -4827.4476
## 1996 -6650.1615 -5562.1051 -691.8707 -4601.8646 -5074.2387 -4827.4476
## 1997 -6650.1615 -5562.1051 -691.8707 -4601.8646 -5074.2387 -4827.4476
## 1998 -6650.1615 -5562.1051 -691.8707 -4601.8646 -5074.2387 -4827.4476
## 1999 -6650.1615 -5562.1051 -691.8707 -4601.8646 -5074.2387 -4827.4476
## 2000 -6650.1615 -5562.1051 -691.8707 -4601.8646 -5074.2387 -4827.4476
## 2001 -6650.1615 -5562.1051 -691.8707 -4601.8646 -5074.2387 -4827.4476
## Jul Aug Sep Oct Nov Dec
## 1995 -2452.6359 -2089.4193 -1309.5168 -425.8064 6341.6020 27343.4647
## 1996 -2452.6359 -2089.4193 -1309.5168 -425.8064 6341.6020 27343.4647
## 1997 -2452.6359 -2089.4193 -1309.5168 -425.8064 6341.6020 27343.4647
## 1998 -2452.6359 -2089.4193 -1309.5168 -425.8064 6341.6020 27343.4647
## 1999 -2452.6359 -2089.4193 -1309.5168 -425.8064 6341.6020 27343.4647
## 2000 -2452.6359 -2089.4193 -1309.5168 -425.8064 6341.6020 27343.4647
## 2001 -2452.6359 -2089.4193 -1309.5168 -425.8064 6341.6020 27343.4647
##
## $trend
## Jan Feb Mar Apr May Jun Jul
## 1995 NA NA NA NA NA NA 5421.133
## 1996 6517.855 6643.493 6712.677 6707.937 6891.732 7466.107 7924.554
## 1997 8566.257 8757.715 9028.598 9317.438 9584.403 9920.696 10200.837
## 1998 10729.094 10753.020 10724.792 10654.212 10718.755 10669.431 10459.387
## 1999 10913.802 11235.815 11574.035 11935.474 12312.042 13097.614 13836.559
## 2000 15392.422 15925.448 16743.464 17746.816 18829.614 20907.268 22502.653
## 2001 25224.785 25979.797 26616.045 27116.128 27784.443 29223.966 NA
## Aug Sep Oct Nov Dec
## 1995 5572.621 5872.002 6107.134 6249.078 6390.100
## 1996 8037.954 8172.822 8307.455 8352.884 8447.509
## 1997 10255.671 10362.989 10510.016 10600.920 10672.196
## 1998 10441.103 10352.485 10338.690 10487.217 10657.966
## 1999 14093.552 14439.348 14759.589 14900.270 15045.387
## 2000 22671.211 23033.112 23576.381 24094.515 24602.210
## 2001 NA NA NA NA NA
##
## $random
## Jan Feb Mar Apr May
## 1995 NA NA NA NA NA
## 1996 2632.1169 4116.8522 1204.3337 2699.9580 4083.3862
## 1997 2800.9240 2507.0205 1620.8524 589.2071 1982.2662
## 1998 1842.1673 623.6647 2388.3287 317.4221 1964.6037
## 1999 562.9994 796.5205 -1243.3947 1487.5609 1484.5666
## 2000 -1127.2306 -513.6528 -1493.1930 -1557.6212 -4422.8150
## 2001 -8331.3839 -9150.8120 -4097.3338 -5156.9337 -6712.4146
## Jun Jul Aug Sep Oct
## 1995 NA 1381.1125 83.1381 459.3348 742.1522
## 1996 2312.6805 707.2021 -1196.3844 -1366.8748 -2046.5482
## 1997 1537.5514 -398.5808 10.3681 -480.3019 -393.7099
## 1998 1382.7668 114.4692 -372.4336 -949.9082 -1436.1836
## 1999 1939.3134 -107.3729 548.0868 -1492.4411 -726.8928
## 2000 -2997.7307 -3317.2370 -693.1819 2209.7843 2240.7755
## 2001 -5794.9882 NA NA NA NA
## Nov Dec
## 1995 -4990.0804 -13977.3547
## 1996 -2094.4058 -7249.2538
## 1997 -1790.6816 -3954.6509
## 1998 1085.8409 -7887.0209
## 1999 580.2375 2671.8382
## 2000 5588.6825 28776.0353
## 2001 NA NA
##
## $figure
## [1] -6650.1615 -5562.1051 -691.8707 -4601.8646 -5074.2387 -4827.4476
## [7] -2452.6359 -2089.4193 -1309.5168 -425.8064 6341.6020 27343.4647
##
## $type
## [1] "additive"
##
## attr(,"class")
## [1] "decomposed.ts"
#now we plot
plot(ComponentsOfSouvSales)
The observed frame is the actual data plot as a line plot. There is an overall upward movement to the data as shown on the trend frame. The seasonality seems to be additive. The random frame is the noise or unexplained part to the data. Possibly the repeating drops in the random frame are a function of the timing of marketing or advertising or possibly even weather related influences on the timing of seasonal shopping. In any case it should be noted that the random component of the time series does not seem that “random” but more like a predictably constant value or floor to the time series data.
Because of this seasonal pattern based on months, the naive forecast for each of the months in the validation period is equal to the actual sales in the most recent similar quarter, so we will use a “seasonal naive” forecast as the baseline. Thus a ‘snaive’ model from the ‘forecast’ package to generate the point forecasts for the validation period (the year 2001) will be used.
Problem 1 Questions: a) Why was the data partitioned? Partitioning occurs to avoid overfitting and to be able to assess the predictive performance of the model on new data.
c)What is the naive forecast for the validation period?
#set the length of the validation period to 12 months(one year)
nvalid<-12
#set the length of the training period to everything else
ntrain <- length(SouvSales.ts) - nvalid
#Partition the data into training and validation periods
SouvTrain<-window(SouvSales.ts, start=c(1995,1), end = c(1995, ntrain))
SouvValid<-window(SouvSales.ts, start=c(1995,ntrain+1), end=c(1995,ntrain+nvalid))
#use the seasonal naive forecast
snaiveForValid <-snaive(SouvTrain, h=nvalid)
#to see the point forecasts from the seasonal naive model
snaiveForValid$mean
## Jan Feb Mar Apr May Jun Jul
## 2001 7615.03 9849.69 14558.40 11587.33 9332.56 13082.09 16732.78
## Aug Sep Oct Nov Dec
## 2001 19888.61 23933.38 25391.35 36024.80 80721.71
This is the naive forecast for the validation period using a seasonal naive model.
We use the accuracy() function from the forecast library to determine the RMSE and MAPE values for the naive forecasts.
accuracy(snaiveForValid, SouvValid)
## ME RMSE MAE MPE MAPE MASE
## Training set 3401.361 6467.818 3744.801 22.39270 25.64127 1.000000
## Test set 7828.278 9542.346 7828.278 27.27926 27.27926 2.090439
## ACF1 Theil's U
## Training set 0.4140974 NA
## Test set 0.2264895 0.7373759
The RMSE and MAPE for the Test set are \(\approx\) 9542 and 27.28% respectively.
#plot the histogram and store it to use later
hist(snaiveForValid$residuals,breaks=20,probability=TRUE)
#add density curve
lines(density(snaiveForValid$residuals, na.rm=TRUE))
The distribution seems to follow that of a normal curve. We can see that most of the residuals are positive and concentrated in the two bins of 0-2000 and 2001-4000.
#plot the actual values from the validation period(2001)
plot(SouvValid, bty="l", xaxt="n", xlab="The Year 2001", yaxt="n",ylab="Sales")
axis(1, at=seq(2001,2001.91667,0.08333), labels=c("Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"))
axis(2, las=2)
#now add the forecasts and make the line red and dashed
lines(snaiveForValid$mean, col=2, lty=2)
#add a legend
legend(2001, 81000,c("Actual","Forecast"), col=1:2,lty=1:2)
As is indicated by the line plots, the seasonal naive model underpredicts consistently the actual sales throughout the validation period. We also saw that with the histogram of training period residuals showed mostly positive forecast errors and this is consistent with the line plot above.
Before attemptng to forecast future values of the series, the training and validation periods must be recombined into one long series and the chosen method is rerun on the complete data. This final model is then used to forecast future values. Below is the result of using the snaive model from the forecast library to generate 2002 forecasted values:
#set the length of the forecasting period to 12 months(one year)
nforecast<-12
#set the length of the combined training period to all the data
njoin <- length(SouvSales.ts)
#Partition the data into training and validation periods
SouvjoinTrain<-window(SouvSales.ts, start=c(1995,1), end = c(1995, njoin))
#use the seasonal naive forecast on the joined periods
snaiveForecast <-snaive(SouvjoinTrain, h=nforecast)
#to see the point forecasts from the seasonal naive model
snaiveForecast$mean
## Jan Feb Mar Apr May Jun Jul
## 2002 10243.24 11266.88 21826.84 17357.33 15997.79 18601.53 26155.15
## Aug Sep Oct Nov Dec
## 2002 28586.52 30505.41 30821.33 46634.38 104660.67
These are the forecast point values for 2002 using the snaive model.
autoplot(SouvSales.ts, series="Data") + autolayer(snaiveForecast$mean, series="Seasonal Naive")
And graphically, hereis the actual data plotted in addition to the forecast for the year 2002.
Problem #2 Forecasting Shampoo Sales
setwd("~/USM_MBA_678")
ShampSales <- read.csv("ShampooSales.csv", header=TRUE, stringsAsFactors=FALSE)
str(ShampSales)
## 'data.frame': 36 obs. of 2 variables:
## $ Month : chr "Jan-95" "Feb-95" "Mar-95" "Apr-95" ...
## $ Shampoo.Sales: num 266 146 183 119 180 ...
library(forecast)
ShampSales.ts <- ts(ShampSales$Shampoo.Sales, start=c(1995,1), end=c(1997,12), freq=12)
plot(ShampSales.ts, xlab="Time", ylab="Shampoo Sales", bty="l")
The Shampoo sales data shows a definite upward trend.
#store the results to plot them later
ComponentsOfShampSales<-decompose(ShampSales.ts)
#output them
ComponentsOfShampSales
## $x
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov
## 1995 266.0 145.9 183.1 119.3 180.3 168.5 231.8 224.5 192.8 122.9 336.5
## 1996 194.3 149.5 210.1 273.3 191.4 287.0 226.0 303.6 289.9 421.6 264.5
## 1997 339.7 440.4 315.9 439.3 401.3 437.4 575.5 407.6 682.0 475.3 581.3
## Dec
## 1995 185.9
## 1996 342.3
## 1997 646.9
##
## $seasonal
## Jan Feb Mar Apr May Jun
## 1995 -19.193924 -2.218924 -48.175174 27.591493 -44.800174 6.345660
## 1996 -19.193924 -2.218924 -48.175174 27.591493 -44.800174 6.345660
## 1997 -19.193924 -2.218924 -48.175174 27.591493 -44.800174 6.345660
## Jul Aug Sep Oct Nov Dec
## 1995 2.951910 30.431076 -1.171007 20.295660 37.274826 -9.331424
## 1996 2.951910 30.431076 -1.171007 20.295660 37.274826 -9.331424
## 1997 2.951910 30.431076 -1.171007 20.295660 37.274826 -9.331424
##
## $trend
## Jan Feb Mar Apr May Jun Jul
## 1995 NA NA NA NA NA NA 193.4708
## 1996 216.4250 219.4792 226.8208 243.3125 252.7583 256.2750 268.8500
## 1997 366.3875 385.2833 405.9542 424.5292 439.9667 465.8583 NA
## Aug Sep Oct Nov Dec
## 1995 190.6333 191.9083 199.4500 206.3292 211.7292
## 1996 287.0292 303.5583 314.8833 330.5458 345.5583
## 1997 NA NA NA NA NA
##
## $random
## Jan Feb Mar Apr May
## 1995 NA NA NA NA NA
## 1996 -2.931076 -67.760243 31.454340 2.396007 -16.558160
## 1997 -7.493576 57.335590 -41.878993 -12.820660 6.133507
## Jun Jul Aug Sep Oct
## 1995 NA 35.377257 3.435590 2.062674 -96.845660
## 1996 24.379340 -45.801910 -13.860243 -12.487326 86.421007
## 1997 -34.803993 NA NA NA NA
## Nov Dec
## 1995 92.896007 -16.497743
## 1996 -103.320660 6.073090
## 1997 NA NA
##
## $figure
## [1] -19.193924 -2.218924 -48.175174 27.591493 -44.800174 6.345660
## [7] 2.951910 30.431076 -1.171007 20.295660 37.274826 -9.331424
##
## $type
## [1] "additive"
##
## attr(,"class")
## [1] "decomposed.ts"
#now we plot
plot(ComponentsOfShampSales)
By decomposing the Shampoo Sales data, we verify the upward trend of the data as well as the presence of of a seasonal component to the time series.
As the question asks what steps to take regarding forecasting sales in future months, we would use the following steps: As with any time series, the data should be partitioned into training and validation periods. Time plots should be examined for both the training and validation periods. MAPE and RMSE values for the test or validation set should be looked at. Finally, given the seasonal component of the data set, naive forecasts should be computed.
#set the length of the validation period to 12 months(one year)
nvalid<-12
#set the length of the training period to everything else
ntrain <- length(ShampSales.ts) - nvalid
#Partition the data into training and validation periods
ShampTrain<-window(ShampSales.ts, start=c(1995,1), end = c(1995, ntrain))
ShampValid<-window(ShampSales.ts, start=c(1995,ntrain+1), end=c(1995,ntrain+nvalid))
#use the seasonal naive forecast
snaiveForShampValid <-snaive(ShampTrain, h=nvalid)
#to see the point forecasts from the seasonal naive model
snaiveForShampValid$mean
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov
## 1997 194.3 149.5 210.1 273.3 191.4 287.0 226.0 303.6 289.9 421.6 264.5
## Dec
## 1997 342.3
We use the accuracy() function from the forecast library to determine the RMSE and MAPE values for the naive forecasts.
accuracy(snaiveForShampValid, ShampValid)
## ME RMSE MAE MPE MAPE MASE
## Training set 66.33333 121.9118 91.2500 19.00793 30.12280 1.000000
## Test set 215.75833 240.4731 215.7583 43.62046 43.62046 2.364475
## ACF1 Theil's U
## Training set -0.3328601 NA
## Test set -0.6413139 1.824886
With a MAPE of 43.62% for the Test set. The predictive value of the naive model is questionable.
Problem #3 Two different models were fit to the same time series. The first 100 time periods were used for the training period and the last 12 periods were treated as a validation period. Assume that both models make sense practically and fit the data reasonably well. Below are the RMSE values for each of the models:
Training Period Validation Period
Model A 543 690 Model B 669 675
Which model appears more useful for retrospectively describing the different components of this time series? Why? Model A exhibits a considerably smaller RMSE value for the training error compared to Model B. Thus if we wish to retrospectively describe the time series by looking at the model itself, we would be presumably better off observing the model which was more predictive of the training period data - Model A.
Which model appears to be more useful for forecasting purposes? Why? In terms of which model to use for forecasting purposes, even though Model B has a much higher RMSE for the training period, it has a smaller RMSE for the Validation Period which means that it would presumably have a smaller RMSE when used for forecasting future values.