Term Paper Stage One

The topic I am studying is the revenue or sales for Indigo Books and Music, which is Canada’s largest book, gift, and specialty toy retailer. It is important to study this topic because there are larger uncertainty and a more rapid change in today’s business environment, and therefore there is a heavier role to play within predicting future sales (sales forecasting). Prediction becomes more important in order to not lose market shares and has become quite a common practice in the retail industry. Moreover, as competition grows stronger, companies continuously need to come up with new advantages to compete and survive. By forecasting sales, Indigo will be able to better strategize how it wants to operate in the future, in order to meet demands, perhaps realize that demand is decreasing and change needs to be employed. In order to conduct this study, the Bloomberg terminal was used to find quarterly revenue data for the years 2002 quarter one to 2017 quarter three.

To begin with, it is key to conduct statistical analysis and data visualization to get a good understanding of the properties of the data. To get a good insight it is important to determine whether the data has these three components of interest; a trend component, a seasonal component and an irregular component.

Load Library and Indigo sales data, indicate that it is a time series, and create a name variable

indigo_data <- read.csv("~/Dropbox (Personal)/ECON4210/Assignment Two/indigo_data.csv")
indigo_data <- ts(indigo_data, start = 2002, frequency = 4)
data = indigo_data[,"sales"]
plot(data, main="Indigo Quarterly Sales", xlab="Year",ylab="Millions")

Considering trend and seasonal influences are very important for the results of time series analysis. When observing the series against time, it is evident that Indigo sales fluctuate quite dramatically. There have been several periods where the sales peaked, reaching nearly 400 million, however, in a quarter or two they plummet down to just over 150 million. This could be because one or two quarters are more profitable than the other due to various factors, but overall indigo sales stay over the years stays nearly constant. This points towards seasonality. Moreover, there was an upward trend from 2002 to around 2010, but then it shifted down before slightly picking up again in around 2014. This proves that the data is quite random, and there is no evident trend.

Plot Histogram and determine summary statistics

summary(data)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   135.0   181.5   194.0   222.8   251.0   400.0

mean(data)

## [1] 222.7778

sd(data)

## [1] 66.04775

hist(indigo_data[,"sales"], breaks=50, main="Histogram of Indigo Quarterly Sales", xlab="Millions",ylab="Frequency")

The data has a mean of 222.7778 and a standard deviation of 66.04775. Looking at the histogram, the majority of the data is skewed to the left, with a few bars scattered from approximately 275 million to 400 million. That means the mean was correct in saying that majority of the sales are around 222 million, and the standard deviation was also correct in saying that a lot of points are far off from the averages, meaning there could be a few outliers.

Plot Autocorrelation

Acf(data, main="Indigo Sales Autocorrelation")

The autocorrelation of Indigo Sales graph shows strong evidence of autocorrelation in sales. The autocorrelation graph shows it peaking upward at when we analyze the residuals and the ACFs of the residuals, it is apparent that there are large spikes on lag 4, 8, 12 and 16 going upwards, and large spikes going downward at 2, 6, 10, and 14. This shows that there is a high degree of both pattern (which appears to be seasonal) and autocorrelation of the residuals. More important, indigo sales shows repeating patterns of positive and negative autocorrelations, typical for seasonality. Therefore, it is safe to assume that indigo sales for this time period show a strong seasonal pattern but no trend. The season pattern is not confused with a cyclical behavior because the period is unchanging and associated with an aspect of the calendar (quarters).Time series decomposition will determine if there are irregular components in the data set, basically determine is there is anything else in the time series.

Conduct a STL decomposition

TSDfit <- stl(data,s.window="periodic", robust=TRUE)
plot(TSDfit)

Due to the fact that there is high seasonality, as evident in both the data and seasonal graph, it is safe to assume that the seasonal pattern is constant through time, and to reflect that the parameter has been set to be infinite, forcing the seasonal component to be periodic (s.window=”periodic”). The entire data set will be used to perform an analysis. However, we are more interested in the remainder component, which is what is left over when the seasonal and trend-cycle components have been subtracted from the data. It is apparent that near around 2012, sales increased in certain quarters, reaching a high at the end of 2017 (corresponding to some large positive values in the remainder component). This should be taken into consideration as it may indicate that a drift component may be necessary to accurately forecast the data.

The next step involves forecasting sales using different models, and then determining which one shows the best fit. To begin with, forecasts will be made using benchmarks, and to determine how accurate the naïve, seasonal naïve and mean model is.

Set the training and testing windows

train <- window(data,start=c(2002, 1),end=c(2014, 2))
test <- window(data, start=c(2014,2),end=c(2017, 3))
both <- window(data,start=c(2002, 1))
h=length(test)

Forecast for Average, Naïve, Seasonal Naïve Methods with benchmarks

Indigofit1 <- meanf(train, h=h)
plot(Indigofit1,main="Indigo Quarterly Sales Average Forecast", xlab="Year",ylab="Millions")

Indigofit2 <- naive(train, h=h)
plot(Indigofit2,main="Indigo Quarterly Sales Naive Forecast ", xlab="Year",ylab="Millions")

Indigofit3 <- snaive(train, h=h)
plot(Indigofit3,main="Indigo Quarterly Sales Seasonal Naive Forecast", xlab="Year",ylab="Millions")

Determine the accuracy of the benchmark models

a1Indigofit1 = accuracy(Indigofit1, test)
a2Indigofit2 = accuracy(Indigofit2, test)
a3Indigofit3 = accuracy(Indigofit3, test)

Combine the forecast summary statistics into a table with row names

a.table<-rbind(a1Indigofit1,a2Indigofit2, a3Indigofit3)

row.names(a.table)<-c('Mean training','Mean test', 'Naive training', 'Naive test', 
                      'S. Naive training', 'S. Naive test' )

#Order the table according to MASE
a.table<-as.data.frame(a.table)
a.table<-a.table[order(a.table$MASE),]

Make the table look nice

kable(a.table, caption="Forecast accuracy",digits = 4 )

Forecast accuracy
	ME	RMSE	MAE	MPE	MAPE	MASE	ACF1	Theil’s U
S. Naive training	2.8913	12.1145	9.9348	1.2172	4.8370	1.0000	0.4454	NA
S. Naive test	20.2308	28.4483	21.0000	7.8957	8.1647	2.1138	0.5471	0.2896
Mean training	0.0000	59.5876	47.8720	-6.4317	21.4118	4.8186	-0.1012	NA
Mean test	30.9077	85.7059	60.2615	4.6263	20.3423	6.0657	-0.1527	0.8400
Naive test	68.3077	105.1482	68.3077	21.1096	21.1096	6.8756	-0.1527	1.0195
Naive training	0.8980	88.4072	68.3265	-5.8694	29.3440	6.8775	-0.3815	NA

The following will be looked at: RSME, MAE, MAPE, and MASE. It is evident that in all of those, the seasonal naïve model produces the least amount of error. This is evident because:

RSME: 105.14825 > 85.70588 > 28.44833

MAE: 68.30769 > 60.26154 > 21

MAPE: 21.10956 > 20.34231> 8.164698

MASE: 6.87561 > 6.065713 > 2.113786

In every single scenario, the season naive forecast has the lowest number which indicates the smallest error. Therefore, using these accuracies, the season naive forecast is the most accurate benchmark and I will be comparing that with the time series decomposition and smoothing methods.

Forecast using time series decomposition

Indigofit4.stl <- stl(train,data, s.window="periodic", robust=TRUE)
Indigofit4 <- forecast(Indigofit4.stl, method="rwdrift",h=h)
plot(Indigofit4)

To forecast a decomposed time series, we have to determine is we want to include the seasonal component (which is unchanging) or adjust for the seasonal component. Earlier, we found that there were a fair bit of irregularities apparent in the data from 2012 to 2017, so the forecast has to adjust for seasonality. For time series decomposition, a random walk with drift model is the most appropriate (Seasonal and Trend decomposition using Loess). Next, the smoothing method will be used.

The best smoothing method to use as a prediction model is the Holt (1957) and Winters (1960) extended Holt’s method, which captures seasonality. There are two variations to this method, the multiplicative and additive method, for this time series the additive model is preferred because we observed earlier that we see roughly the same size peaks and troughs throughout the time series. However, for this research paper, I will forecast using both methods to further verify that the additive method is more accurate.

Forecast using Holt-Winters smoothing method

Indigofit5 <- hw(train, seasonal="multiplicative", h=h) 
plot(Indigofit5)

Indigofit6 <- hw(train, seasonal="additive", h=h) 
plot(Indigofit6)

plot(Indigofit6,ylab="Indigo sales",
     plot.conf=FALSE, type="o", fcol="white", xlab="Year", main= "Holt-Winters Seasonal Smoothing Forecasts")
lines(fitted(Indigofit5), col="red", lty=2)
lines(fitted(Indigofit6), col="green", lty=2)
lines(Indigofit5$mean, type="o", col="red")
lines(Indigofit6$mean, type="o", col="green")
legend("topleft",lty=1, pch=1, col=1:3, 
       c("data","Holt Winters' Additive","Holt Winters' Multiplicative"))

Compare the accuracy

a5Indigofit5 = accuracy (Indigofit5, test)
a6Indigofit6 = accuracy (Indigofit6, test)

Combine the forecast summary statistics into a table with row names

a.table<-rbind(a5Indigofit5,a6Indigofit6)

row.names(a.table)<-c('Multiplicative training','Multiplicative test', 'Additive  training', 'Additive test')

#Order the table according to MASE
a.table<-as.data.frame(a.table)
a.table<-a.table[order(a.table$MASE),]
a.table

Make the table look nice

kable(a.table, caption="Forecast accuracy",digits = 4 )

Forecast accuracy
	ME	RMSE	MAE	MPE	MAPE	MASE	ACF1	Theil’s U
Additive training	-1.6080	9.4452	7.3524	-1.0898	3.5492	0.7401	0.0468	NA
Multiplicative training	-2.0446	9.7421	8.0996	-1.1484	3.8516	0.8153	0.0627	NA
Additive test	40.5063	49.4899	40.5063	16.2902	16.2902	4.0772	0.5929	0.5025
Multiplicative test	44.6692	57.7153	44.6692	16.6514	16.6514	4.4962	0.3261	0.5898

Comparing the accuracy measures (RMSE, MASE, MAPE, MPE, and MAE) we can see that the forecasting methods that have the lowest errors for the additive method, which proves our earlier statement.

Thus far, we have a seasonal naive model, Holt and Winter additive model and finally a seasonal and trend decomposition using loess random walk with drift model.

Compare the accuracy

a5Indigofit6 = accuracy (Indigofit6, test)
a4Indigofit4 = accuracy (Indigofit4, test)
a3Indigofit3 = accuracy (Indigofit3, test)

Combine the forecast summary statistics into a table with row names

a.table2<-rbind(a6Indigofit6, a4Indigofit4, a3Indigofit3)
row.names(a.table2)<-c('Additive  training', 'Additive test','SLT training','SLT test','s.naive training','s.naive test' )

#Order the table according to MASE
a.table2<-as.data.frame(a.table2)
a.table2<-a.table2[order(a.table2$MASE),]

Make the table look nice

kable(a.table2, caption="Forecast accuracy",digits = 4 )

Forecast accuracy
	ME	RMSE	MAE	MPE	MAPE	MASE	ACF1	Theil’s U
Additive training	-1.6080	9.4452	7.3524	-1.0898	3.5492	0.7401	0.0468	NA
SLT training	0.0000	14.4627	9.6959	-0.4611	4.4853	0.9760	-0.5401	NA
s.naive training	2.8913	12.1145	9.9348	1.2172	4.8370	1.0000	0.4454	NA
s.naive test	20.2308	28.4483	21.0000	7.8957	8.1647	2.1138	0.5471	0.2896
SLT test	29.3170	39.9708	29.3802	10.2083	10.2423	2.9573	0.0630	0.4030
Additive test	40.5063	49.4899	40.5063	16.2902	16.2902	4.0772	0.5929	0.5025

Comparing the accuracy for those model, it is evident that the seasonal naive model has the lowest amount of error and therefore is the best predictor for this data set.

Although it can be surprising, some forecasting methods are very simple and surprisingly effective. This method is the best one because it is versatile and can be used for data with nonlinear relationships. The data set have such a strong seasonal component which is the strongest factor, enough that our results have found that simply taking the previous years sales is the most accurate forecast.

From this research paper, I have learned that the sales data for Indigo shows seasonality, no apparent trend, a remainder component, and is best forecasted using the seasonal naive method. This forecast shows that Indigo sales stay relatively constant throughout time.

Term Paper Stage One

Niha Shamim

February 27, 2018

Load Library and Indigo sales data, indicate that it is a time series, and create a name variable

Plot Histogram and determine summary statistics

Plot Autocorrelation

Conduct a STL decomposition

Set the training and testing windows

Forecast for Average, Naïve, Seasonal Naïve Methods with benchmarks

Determine the accuracy of the benchmark models

Combine the forecast summary statistics into a table with row names

Make the table look nice

RSME: 105.14825 > 85.70588 > 28.44833

MAE: 68.30769 > 60.26154 > 21

MAPE: 21.10956 > 20.34231> 8.164698

MASE: 6.87561 > 6.065713 > 2.113786

Forecast using time series decomposition

Forecast using Holt-Winters smoothing method

Compare the accuracy

Combine the forecast summary statistics into a table with row names

Make the table look nice

Compare the accuracy

Combine the forecast summary statistics into a table with row names

Make the table look nice