Problem #1

setwd("~/USM_MBA_678")
SouvSales <- read.csv("SouvenirSales.csv", header=TRUE, stringsAsFactors=FALSE)
str(SouvSales)
## 'data.frame':    84 obs. of  2 variables:
##  $ Date : chr  "Jan-95" "Feb-95" "Mar-95" "Apr-95" ...
##  $ Sales: num  1665 2398 2841 3547 3753 ...

First we import souvenir sales figures into RStudio.

library(ggplot2)
library(forecast)

SouvSales.ts <- ts(SouvSales$Sales, start=c(1995,1), end=c(2001,12), freq=12)
plot(SouvSales.ts, xlab="Time", ylab="Sales)",  bty="l")

After converting the souvenir sales data to a time series, we then plot the Sales figures. The graph shows obvious seasonality and trending of the data upwards over time.

#store the results to plot them later
ComponentsOfSouvSales<-decompose(SouvSales.ts)

#output them
ComponentsOfSouvSales
## $x
##            Jan       Feb       Mar       Apr       May       Jun       Jul
## 1995   1664.81   2397.53   2840.71   3547.29   3752.96   3714.74   4349.61
## 1996   2499.81   5198.24   7225.14   4806.03   5900.88   4951.34   6179.12
## 1997   4717.02   5702.63   9957.58   5304.78   6492.43   6630.80   7349.62
## 1998   5921.10   5814.58  12421.25   6369.77   7609.12   7224.75   8121.22
## 1999   4826.64   6470.23   9638.77   8821.17   8722.37  10209.48  11276.55
## 2000   7615.03   9849.69  14558.40  11587.33   9332.56  13082.09  16732.78
## 2001  10243.24  11266.88  21826.84  17357.33  15997.79  18601.53  26155.15
##            Aug       Sep       Oct       Nov       Dec
## 1995   3566.34   5021.82   6423.48   7600.60  19756.21
## 1996   4752.15   5496.43   5835.10  12600.08  28541.72
## 1997   8176.62   8573.17   9690.50  15151.84  34061.01
## 1998   7979.25   8093.06   8476.70  17914.66  30114.41
## 1999  12552.22  11637.39  13606.89  21822.11  45060.69
## 2000  19888.61  23933.38  25391.35  36024.80  80721.71
## 2001  28586.52  30505.41  30821.33  46634.38 104660.67
## 
## $seasonal
##             Jan        Feb        Mar        Apr        May        Jun
## 1995 -6650.1615 -5562.1051  -691.8707 -4601.8646 -5074.2387 -4827.4476
## 1996 -6650.1615 -5562.1051  -691.8707 -4601.8646 -5074.2387 -4827.4476
## 1997 -6650.1615 -5562.1051  -691.8707 -4601.8646 -5074.2387 -4827.4476
## 1998 -6650.1615 -5562.1051  -691.8707 -4601.8646 -5074.2387 -4827.4476
## 1999 -6650.1615 -5562.1051  -691.8707 -4601.8646 -5074.2387 -4827.4476
## 2000 -6650.1615 -5562.1051  -691.8707 -4601.8646 -5074.2387 -4827.4476
## 2001 -6650.1615 -5562.1051  -691.8707 -4601.8646 -5074.2387 -4827.4476
##             Jul        Aug        Sep        Oct        Nov        Dec
## 1995 -2452.6359 -2089.4193 -1309.5168  -425.8064  6341.6020 27343.4647
## 1996 -2452.6359 -2089.4193 -1309.5168  -425.8064  6341.6020 27343.4647
## 1997 -2452.6359 -2089.4193 -1309.5168  -425.8064  6341.6020 27343.4647
## 1998 -2452.6359 -2089.4193 -1309.5168  -425.8064  6341.6020 27343.4647
## 1999 -2452.6359 -2089.4193 -1309.5168  -425.8064  6341.6020 27343.4647
## 2000 -2452.6359 -2089.4193 -1309.5168  -425.8064  6341.6020 27343.4647
## 2001 -2452.6359 -2089.4193 -1309.5168  -425.8064  6341.6020 27343.4647
## 
## $trend
##            Jan       Feb       Mar       Apr       May       Jun       Jul
## 1995        NA        NA        NA        NA        NA        NA  5421.133
## 1996  6517.855  6643.493  6712.677  6707.937  6891.732  7466.107  7924.554
## 1997  8566.257  8757.715  9028.598  9317.438  9584.403  9920.696 10200.837
## 1998 10729.094 10753.020 10724.792 10654.212 10718.755 10669.431 10459.387
## 1999 10913.802 11235.815 11574.035 11935.474 12312.042 13097.614 13836.559
## 2000 15392.422 15925.448 16743.464 17746.816 18829.614 20907.268 22502.653
## 2001 25224.785 25979.797 26616.045 27116.128 27784.443 29223.966        NA
##            Aug       Sep       Oct       Nov       Dec
## 1995  5572.621  5872.002  6107.134  6249.078  6390.100
## 1996  8037.954  8172.822  8307.455  8352.884  8447.509
## 1997 10255.671 10362.989 10510.016 10600.920 10672.196
## 1998 10441.103 10352.485 10338.690 10487.217 10657.966
## 1999 14093.552 14439.348 14759.589 14900.270 15045.387
## 2000 22671.211 23033.112 23576.381 24094.515 24602.210
## 2001        NA        NA        NA        NA        NA
## 
## $random
##              Jan         Feb         Mar         Apr         May
## 1995          NA          NA          NA          NA          NA
## 1996   2632.1169   4116.8522   1204.3337   2699.9580   4083.3862
## 1997   2800.9240   2507.0205   1620.8524    589.2071   1982.2662
## 1998   1842.1673    623.6647   2388.3287    317.4221   1964.6037
## 1999    562.9994    796.5205  -1243.3947   1487.5609   1484.5666
## 2000  -1127.2306   -513.6528  -1493.1930  -1557.6212  -4422.8150
## 2001  -8331.3839  -9150.8120  -4097.3338  -5156.9337  -6712.4146
##              Jun         Jul         Aug         Sep         Oct
## 1995          NA   1381.1125     83.1381    459.3348    742.1522
## 1996   2312.6805    707.2021  -1196.3844  -1366.8748  -2046.5482
## 1997   1537.5514   -398.5808     10.3681   -480.3019   -393.7099
## 1998   1382.7668    114.4692   -372.4336   -949.9082  -1436.1836
## 1999   1939.3134   -107.3729    548.0868  -1492.4411   -726.8928
## 2000  -2997.7307  -3317.2370   -693.1819   2209.7843   2240.7755
## 2001  -5794.9882          NA          NA          NA          NA
##              Nov         Dec
## 1995  -4990.0804 -13977.3547
## 1996  -2094.4058  -7249.2538
## 1997  -1790.6816  -3954.6509
## 1998   1085.8409  -7887.0209
## 1999    580.2375   2671.8382
## 2000   5588.6825  28776.0353
## 2001          NA          NA
## 
## $figure
##  [1] -6650.1615 -5562.1051  -691.8707 -4601.8646 -5074.2387 -4827.4476
##  [7] -2452.6359 -2089.4193 -1309.5168  -425.8064  6341.6020 27343.4647
## 
## $type
## [1] "additive"
## 
## attr(,"class")
## [1] "decomposed.ts"
#now we plot
plot(ComponentsOfSouvSales)

The observed frame is the actual data plot as a line plot. There is an overall upward movement to the data as shown on the trend frame. The seasonality seems to be additive. The random frame is the noise or unexplained part to the data. Possibly the repeating drops in the random frame are a function of the timing of marketing or advertising or possibly even weather related influences on the timing of seasonal shopping. In any case it should be noted that the random component of the time series does not seem that “random” but more like a predictably constant value or floor to the time series data.

Because of this seasonal pattern based on months, the naive forecast for each of the months in the validation period is equal to the actual sales in the most recent similar quarter, so we will use a “seasonal naive” forecast as the baseline. Thus a ‘snaive’ model from the ‘forecast’ package to generate the point forecasts for the validation period (the year 2001) will be used.

Problem 1 Questions: a) Why was the data partitioned? Partitioning occurs to avoid overfitting and to be able to assess the predictive performance of the model on new data.

  1. Why did the analyst choose a 12-month validation period? Since the analyst was asked to predict the sales for the 12 months of 2002, she should choose a validation period that mimics the forecast horizon, thus allowing the evaluation of actual predictive performance.

c)What is the naive forecast for the validation period?

#set the length of the validation period to 12 months(one year)
nvalid<-12

#set the length of the training period to everything else
ntrain <- length(SouvSales.ts) - nvalid


#Partition the data into training and validation periods
SouvTrain<-window(SouvSales.ts, start=c(1995,1), end = c(1995, ntrain))
SouvValid<-window(SouvSales.ts, start=c(1995,ntrain+1), end=c(1995,ntrain+nvalid))

#use the seasonal naive forecast
snaiveForValid <-snaive(SouvTrain, h=nvalid)

#to see the point forecasts from the seasonal naive model
snaiveForValid$mean
##           Jan      Feb      Mar      Apr      May      Jun      Jul
## 2001  7615.03  9849.69 14558.40 11587.33  9332.56 13082.09 16732.78
##           Aug      Sep      Oct      Nov      Dec
## 2001 19888.61 23933.38 25391.35 36024.80 80721.71

This is the naive forecast for the validation period using a seasonal naive model.

  1. Compute the RMSE and MAPE for the naive forcasts.

We use the accuracy() function from the forecast library to determine the RMSE and MAPE values for the naive forecasts.

accuracy(snaiveForValid, SouvValid)
##                    ME     RMSE      MAE      MPE     MAPE     MASE
## Training set 3401.361 6467.818 3744.801 22.39270 25.64127 1.000000
## Test set     7828.278 9542.346 7828.278 27.27926 27.27926 2.090439
##                   ACF1 Theil's U
## Training set 0.4140974        NA
## Test set     0.2264895 0.7373759

The RMSE and MAPE for the Test set are \(\approx\) 9542 and 27.28% respectively.

  1. Plot a histogram of the forecast errors that result from the naive forecasts (for the validation period). Plot also a time plot for the naive forecasts and the actual sales numbers in the validation period. What can you say about the behavior of the naive forecasts?
#plot the histogram and store it to use later
hist(snaiveForValid$residuals,breaks=20,probability=TRUE)

#add density curve

lines(density(snaiveForValid$residuals, na.rm=TRUE))

The distribution seems to follow that of a normal curve. We can see that most of the residuals are positive and concentrated in the two bins of 0-2000 and 2001-4000.

#plot the actual values from the validation period(2001)
plot(SouvValid, bty="l", xaxt="n", xlab="The Year 2001", yaxt="n",ylab="Sales")

axis(1, at=seq(2001,2001.91667,0.08333), labels=c("Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"))
axis(2, las=2)

#now add the forecasts and make the line red and dashed
lines(snaiveForValid$mean, col=2, lty=2)

#add a legend
legend(2001, 81000,c("Actual","Forecast"), col=1:2,lty=1:2)

As is indicated by the line plots, the seasonal naive model underpredicts consistently the actual sales throughout the validation period. We also saw that with the histogram of training period residuals showed mostly positive forecast errors and this is consistent with the line plot above.

  1. The analyst found a forecasting model that gives satisfactory performance on the validation set. What must she do to use the forecasting model for generating forecasts for the year 2002?

Before attemptng to forecast future values of the series, the training and validation periods must be recombined into one long series and the chosen method is rerun on the complete data. This final model is then used to forecast future values. Below is the result of using the snaive model from the forecast library to generate 2002 forecasted values:

#set the length of the forecasting period to 12 months(one year)
nforecast<-12

#set the length of the combined training period to all the data
njoin <- length(SouvSales.ts)


#Partition the data into training and validation periods
SouvjoinTrain<-window(SouvSales.ts, start=c(1995,1), end = c(1995, njoin))


#use the seasonal naive forecast on the joined periods
snaiveForecast <-snaive(SouvjoinTrain, h=nforecast)

#to see the point forecasts from the seasonal naive model
snaiveForecast$mean
##            Jan       Feb       Mar       Apr       May       Jun       Jul
## 2002  10243.24  11266.88  21826.84  17357.33  15997.79  18601.53  26155.15
##            Aug       Sep       Oct       Nov       Dec
## 2002  28586.52  30505.41  30821.33  46634.38 104660.67

These are the forecast point values for 2002 using the snaive model.

autoplot(SouvSales.ts, series="Data") + autolayer(snaiveForecast$mean, series="Seasonal Naive")

And graphically, hereis the actual data plotted in addition to the forecast for the year 2002.

Problem #2 Forecasting Shampoo Sales

setwd("~/USM_MBA_678")
ShampSales <- read.csv("ShampooSales.csv", header=TRUE, stringsAsFactors=FALSE)
str(ShampSales)
## 'data.frame':    36 obs. of  2 variables:
##  $ Month        : chr  "Jan-95" "Feb-95" "Mar-95" "Apr-95" ...
##  $ Shampoo.Sales: num  266 146 183 119 180 ...
library(forecast)

ShampSales.ts <- ts(ShampSales$Shampoo.Sales, start=c(1995,1), end=c(1997,12), freq=12)
plot(ShampSales.ts, xlab="Time", ylab="Shampoo Sales",  bty="l")

The Shampoo sales data shows a definite upward trend.

#store the results to plot them later
ComponentsOfShampSales<-decompose(ShampSales.ts)

#output them
ComponentsOfShampSales
## $x
##        Jan   Feb   Mar   Apr   May   Jun   Jul   Aug   Sep   Oct   Nov
## 1995 266.0 145.9 183.1 119.3 180.3 168.5 231.8 224.5 192.8 122.9 336.5
## 1996 194.3 149.5 210.1 273.3 191.4 287.0 226.0 303.6 289.9 421.6 264.5
## 1997 339.7 440.4 315.9 439.3 401.3 437.4 575.5 407.6 682.0 475.3 581.3
##        Dec
## 1995 185.9
## 1996 342.3
## 1997 646.9
## 
## $seasonal
##             Jan        Feb        Mar        Apr        May        Jun
## 1995 -19.193924  -2.218924 -48.175174  27.591493 -44.800174   6.345660
## 1996 -19.193924  -2.218924 -48.175174  27.591493 -44.800174   6.345660
## 1997 -19.193924  -2.218924 -48.175174  27.591493 -44.800174   6.345660
##             Jul        Aug        Sep        Oct        Nov        Dec
## 1995   2.951910  30.431076  -1.171007  20.295660  37.274826  -9.331424
## 1996   2.951910  30.431076  -1.171007  20.295660  37.274826  -9.331424
## 1997   2.951910  30.431076  -1.171007  20.295660  37.274826  -9.331424
## 
## $trend
##           Jan      Feb      Mar      Apr      May      Jun      Jul
## 1995       NA       NA       NA       NA       NA       NA 193.4708
## 1996 216.4250 219.4792 226.8208 243.3125 252.7583 256.2750 268.8500
## 1997 366.3875 385.2833 405.9542 424.5292 439.9667 465.8583       NA
##           Aug      Sep      Oct      Nov      Dec
## 1995 190.6333 191.9083 199.4500 206.3292 211.7292
## 1996 287.0292 303.5583 314.8833 330.5458 345.5583
## 1997       NA       NA       NA       NA       NA
## 
## $random
##              Jan         Feb         Mar         Apr         May
## 1995          NA          NA          NA          NA          NA
## 1996   -2.931076  -67.760243   31.454340    2.396007  -16.558160
## 1997   -7.493576   57.335590  -41.878993  -12.820660    6.133507
##              Jun         Jul         Aug         Sep         Oct
## 1995          NA   35.377257    3.435590    2.062674  -96.845660
## 1996   24.379340  -45.801910  -13.860243  -12.487326   86.421007
## 1997  -34.803993          NA          NA          NA          NA
##              Nov         Dec
## 1995   92.896007  -16.497743
## 1996 -103.320660    6.073090
## 1997          NA          NA
## 
## $figure
##  [1] -19.193924  -2.218924 -48.175174  27.591493 -44.800174   6.345660
##  [7]   2.951910  30.431076  -1.171007  20.295660  37.274826  -9.331424
## 
## $type
## [1] "additive"
## 
## attr(,"class")
## [1] "decomposed.ts"
#now we plot
plot(ComponentsOfShampSales)

By decomposing the Shampoo Sales data, we verify the upward trend of the data as well as the presence of of a seasonal component to the time series.

As the question asks what steps to take regarding forecasting sales in future months, we would use the following steps: As with any time series, the data should be partitioned into training and validation periods. Time plots should be examined for both the training and validation periods. MAPE and RMSE values for the test or validation set should be looked at. Finally, given the seasonal component of the data set, naive forecasts should be computed.

#set the length of the validation period to 12 months(one year)
nvalid<-12

#set the length of the training period to everything else
ntrain <- length(ShampSales.ts) - nvalid


#Partition the data into training and validation periods
ShampTrain<-window(ShampSales.ts, start=c(1995,1), end = c(1995, ntrain))
ShampValid<-window(ShampSales.ts, start=c(1995,ntrain+1), end=c(1995,ntrain+nvalid))

#use the seasonal naive forecast
snaiveForShampValid <-snaive(ShampTrain, h=nvalid)

#to see the point forecasts from the seasonal naive model
snaiveForShampValid$mean
##        Jan   Feb   Mar   Apr   May   Jun   Jul   Aug   Sep   Oct   Nov
## 1997 194.3 149.5 210.1 273.3 191.4 287.0 226.0 303.6 289.9 421.6 264.5
##        Dec
## 1997 342.3

We use the accuracy() function from the forecast library to determine the RMSE and MAPE values for the naive forecasts.

accuracy(snaiveForShampValid, ShampValid)
##                     ME     RMSE      MAE      MPE     MAPE     MASE
## Training set  66.33333 121.9118  91.2500 19.00793 30.12280 1.000000
## Test set     215.75833 240.4731 215.7583 43.62046 43.62046 2.364475
##                    ACF1 Theil's U
## Training set -0.3328601        NA
## Test set     -0.6413139  1.824886

With a MAPE of 43.62% for the Test set. The predictive value of the naive model is questionable.

Problem #3 Two different models were fit to the same time series. The first 100 time periods were used for the training period and the last 12 periods were treated as a validation period. Assume that both models make sense practically and fit the data reasonably well. Below are the RMSE values for each of the models:

        Training Period       Validation Period

Model A 543 690 Model B 669 675

  1. Which model appears more useful for retrospectively describing the different components of this time series? Why? Model A exhibits a considerably smaller RMSE value for the training error compared to Model B. Thus if we wish to retrospectively describe the time series by looking at the model itself, we would be presumably better off observing the model which was more predictive of the training period data - Model A.

  2. Which model appears to be more useful for forecasting purposes? Why? In terms of which model to use for forecasting purposes, even though Model B has a much higher RMSE for the training period, it has a smaller RMSE for the Validation Period which means that it would presumably have a smaller RMSE when used for forecasting future values.