The topic I am studying is the revenue or sales for Indigo Books and Music, which is Canada’s largest book, gift, and specialty toy retailer. It is important to study this topic because there are larger uncertainty and a more rapid change in today’s business environment, and therefore there is a heavier role to play within predicting future sales (sales forecasting). Prediction becomes more important in order to not lose market shares and has become quite a common practice in the retail industry. Moreover, as competition grows stronger, companies continuously need to come up with new advantages to compete and survive. By forecasting sales, Indigo will be able to better strategize how it wants to operate in the future, in order to meet demands, perhaps realize that demand is decreasing and change needs to be employed. In order to conduct this study, the Bloomberg terminal was used to find quarterly revenue data for the years 2002 quarter one to 2017 quarter three.
To begin with, it is key to conduct statistical analysis and data visualization to get a good understanding of the properties of the data. To get a good insight it is important to determine whether the data has these three components of interest; a trend component, a seasonal component and an irregular component.
indigo_data <- read.csv("~/Dropbox (Personal)/ECON4210/Assignment Two/indigo_data.csv")
indigo_data <- ts(indigo_data, start = 2002, frequency = 4)
data = indigo_data[,"sales"]
plot(data, main="Indigo Quarterly Sales", xlab="Year",ylab="Millions")
Considering trend and seasonal influences are very important for the results of time series analysis. When observing the series against time, it is evident that Indigo sales fluctuate quite dramatically. There have been several periods where the sales peaked, reaching nearly 400 million, however, in a quarter or two they plummet down to just over 150 million. This could be because one or two quarters are more profitable than the other due to various factors, but overall indigo sales stay over the years stays nearly constant. This points towards seasonality. Moreover, there was an upward trend from 2002 to around 2010, but then it shifted down before slightly picking up again in around 2014. This proves that the data is quite random, and there is no evident trend.
summary(data)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 135.0 181.5 194.0 222.8 251.0 400.0
mean(data)
## [1] 222.7778
sd(data)
## [1] 66.04775
hist(indigo_data[,"sales"], breaks=50, main="Histogram of Indigo Quarterly Sales", xlab="Millions",ylab="Frequency")
The data has a mean of 222.7778 and a standard deviation of 66.04775. Looking at the histogram, the majority of the data is skewed to the left, with a few bars scattered from approximately 275 million to 400 million. That means the mean was correct in saying that majority of the sales are around 222 million, and the standard deviation was also correct in saying that a lot of points are far off from the averages, meaning there could be a few outliers.
Acf(data, main="Indigo Sales Autocorrelation")
The autocorrelation of Indigo Sales graph shows strong evidence of autocorrelation in sales. The autocorrelation graph shows it peaking upward at when we analyze the residuals and the ACFs of the residuals, it is apparent that there are large spikes on lag 4, 8, 12 and 16 going upwards, and large spikes going downward at 2, 6, 10, and 14. This shows that there is a high degree of both pattern (which appears to be seasonal) and autocorrelation of the residuals. More important, indigo sales shows repeating patterns of positive and negative autocorrelations, typical for seasonality. Therefore, it is safe to assume that indigo sales for this time period show a strong seasonal pattern but no trend. The season pattern is not confused with a cyclical behavior because the period is unchanging and associated with an aspect of the calendar (quarters).Time series decomposition will determine if there are irregular components in the data set, basically determine is there is anything else in the time series.
TSDfit <- stl(data,s.window="periodic", robust=TRUE)
plot(TSDfit)
Due to the fact that there is high seasonality, as evident in both the data and seasonal graph, it is safe to assume that the seasonal pattern is constant through time, and to reflect that the parameter has been set to be infinite, forcing the seasonal component to be periodic (s.window=”periodic”). The entire data set will be used to perform an analysis. However, we are more interested in the remainder component, which is what is left over when the seasonal and trend-cycle components have been subtracted from the data. It is apparent that near around 2012, sales increased in certain quarters, reaching a high at the end of 2017 (corresponding to some large positive values in the remainder component). This should be taken into consideration as it may indicate that a drift component may be necessary to accurately forecast the data.
The next step involves forecasting sales using different models, and then determining which one shows the best fit. To begin with, forecasts will be made using benchmarks, and to determine how accurate the naïve, seasonal naïve and mean model is.
train <- window(data,start=c(2002, 1),end=c(2014, 2))
test <- window(data, start=c(2014,2),end=c(2017, 3))
both <- window(data,start=c(2002, 1))
h=length(test)
Indigofit1 <- meanf(train, h=h)
plot(Indigofit1,main="Indigo Quarterly Sales Average Forecast", xlab="Year",ylab="Millions")
Indigofit2 <- naive(train, h=h)
plot(Indigofit2,main="Indigo Quarterly Sales Naive Forecast ", xlab="Year",ylab="Millions")
Indigofit3 <- snaive(train, h=h)
plot(Indigofit3,main="Indigo Quarterly Sales Seasonal Naive Forecast", xlab="Year",ylab="Millions")
a1Indigofit1 = accuracy(Indigofit1, test)
a2Indigofit2 = accuracy(Indigofit2, test)
a3Indigofit3 = accuracy(Indigofit3, test)
a.table<-rbind(a1Indigofit1,a2Indigofit2, a3Indigofit3)
row.names(a.table)<-c('Mean training','Mean test', 'Naive training', 'Naive test',
'S. Naive training', 'S. Naive test' )
#Order the table according to MASE
a.table<-as.data.frame(a.table)
a.table<-a.table[order(a.table$MASE),]
kable(a.table, caption="Forecast accuracy",digits = 4 )
ME | RMSE | MAE | MPE | MAPE | MASE | ACF1 | Theil’s U | |
---|---|---|---|---|---|---|---|---|
S. Naive training | 2.8913 | 12.1145 | 9.9348 | 1.2172 | 4.8370 | 1.0000 | 0.4454 | NA |
S. Naive test | 20.2308 | 28.4483 | 21.0000 | 7.8957 | 8.1647 | 2.1138 | 0.5471 | 0.2896 |
Mean training | 0.0000 | 59.5876 | 47.8720 | -6.4317 | 21.4118 | 4.8186 | -0.1012 | NA |
Mean test | 30.9077 | 85.7059 | 60.2615 | 4.6263 | 20.3423 | 6.0657 | -0.1527 | 0.8400 |
Naive test | 68.3077 | 105.1482 | 68.3077 | 21.1096 | 21.1096 | 6.8756 | -0.1527 | 1.0195 |
Naive training | 0.8980 | 88.4072 | 68.3265 | -5.8694 | 29.3440 | 6.8775 | -0.3815 | NA |
The following will be looked at: RSME, MAE, MAPE, and MASE. It is evident that in all of those, the seasonal naïve model produces the least amount of error. This is evident because:
In every single scenario, the season naive forecast has the lowest number which indicates the smallest error. Therefore, using these accuracies, the season naive forecast is the most accurate benchmark and I will be comparing that with the time series decomposition and smoothing methods.
Indigofit4.stl <- stl(train,data, s.window="periodic", robust=TRUE)
Indigofit4 <- forecast(Indigofit4.stl, method="rwdrift",h=h)
plot(Indigofit4)
To forecast a decomposed time series, we have to determine is we want to include the seasonal component (which is unchanging) or adjust for the seasonal component. Earlier, we found that there were a fair bit of irregularities apparent in the data from 2012 to 2017, so the forecast has to adjust for seasonality. For time series decomposition, a random walk with drift model is the most appropriate (Seasonal and Trend decomposition using Loess). Next, the smoothing method will be used.
The best smoothing method to use as a prediction model is the Holt (1957) and Winters (1960) extended Holt’s method, which captures seasonality. There are two variations to this method, the multiplicative and additive method, for this time series the additive model is preferred because we observed earlier that we see roughly the same size peaks and troughs throughout the time series. However, for this research paper, I will forecast using both methods to further verify that the additive method is more accurate.
Indigofit5 <- hw(train, seasonal="multiplicative", h=h)
plot(Indigofit5)
Indigofit6 <- hw(train, seasonal="additive", h=h)
plot(Indigofit6)
plot(Indigofit6,ylab="Indigo sales",
plot.conf=FALSE, type="o", fcol="white", xlab="Year", main= "Holt-Winters Seasonal Smoothing Forecasts")
lines(fitted(Indigofit5), col="red", lty=2)
lines(fitted(Indigofit6), col="green", lty=2)
lines(Indigofit5$mean, type="o", col="red")
lines(Indigofit6$mean, type="o", col="green")
legend("topleft",lty=1, pch=1, col=1:3,
c("data","Holt Winters' Additive","Holt Winters' Multiplicative"))
a5Indigofit5 = accuracy (Indigofit5, test)
a6Indigofit6 = accuracy (Indigofit6, test)
a.table<-rbind(a5Indigofit5,a6Indigofit6)
row.names(a.table)<-c('Multiplicative training','Multiplicative test', 'Additive training', 'Additive test')
#Order the table according to MASE
a.table<-as.data.frame(a.table)
a.table<-a.table[order(a.table$MASE),]
a.table
kable(a.table, caption="Forecast accuracy",digits = 4 )
ME | RMSE | MAE | MPE | MAPE | MASE | ACF1 | Theil’s U | |
---|---|---|---|---|---|---|---|---|
Additive training | -1.6080 | 9.4452 | 7.3524 | -1.0898 | 3.5492 | 0.7401 | 0.0468 | NA |
Multiplicative training | -2.0446 | 9.7421 | 8.0996 | -1.1484 | 3.8516 | 0.8153 | 0.0627 | NA |
Additive test | 40.5063 | 49.4899 | 40.5063 | 16.2902 | 16.2902 | 4.0772 | 0.5929 | 0.5025 |
Multiplicative test | 44.6692 | 57.7153 | 44.6692 | 16.6514 | 16.6514 | 4.4962 | 0.3261 | 0.5898 |
Comparing the accuracy measures (RMSE, MASE, MAPE, MPE, and MAE) we can see that the forecasting methods that have the lowest errors for the additive method, which proves our earlier statement.
Thus far, we have a seasonal naive model, Holt and Winter additive model and finally a seasonal and trend decomposition using loess random walk with drift model.
a5Indigofit6 = accuracy (Indigofit6, test)
a4Indigofit4 = accuracy (Indigofit4, test)
a3Indigofit3 = accuracy (Indigofit3, test)
a.table2<-rbind(a6Indigofit6, a4Indigofit4, a3Indigofit3)
row.names(a.table2)<-c('Additive training', 'Additive test','SLT training','SLT test','s.naive training','s.naive test' )
#Order the table according to MASE
a.table2<-as.data.frame(a.table2)
a.table2<-a.table2[order(a.table2$MASE),]
kable(a.table2, caption="Forecast accuracy",digits = 4 )
ME | RMSE | MAE | MPE | MAPE | MASE | ACF1 | Theil’s U | |
---|---|---|---|---|---|---|---|---|
Additive training | -1.6080 | 9.4452 | 7.3524 | -1.0898 | 3.5492 | 0.7401 | 0.0468 | NA |
SLT training | 0.0000 | 14.4627 | 9.6959 | -0.4611 | 4.4853 | 0.9760 | -0.5401 | NA |
s.naive training | 2.8913 | 12.1145 | 9.9348 | 1.2172 | 4.8370 | 1.0000 | 0.4454 | NA |
s.naive test | 20.2308 | 28.4483 | 21.0000 | 7.8957 | 8.1647 | 2.1138 | 0.5471 | 0.2896 |
SLT test | 29.3170 | 39.9708 | 29.3802 | 10.2083 | 10.2423 | 2.9573 | 0.0630 | 0.4030 |
Additive test | 40.5063 | 49.4899 | 40.5063 | 16.2902 | 16.2902 | 4.0772 | 0.5929 | 0.5025 |
Comparing the accuracy for those model, it is evident that the seasonal naive model has the lowest amount of error and therefore is the best predictor for this data set.
Although it can be surprising, some forecasting methods are very simple and surprisingly effective. This method is the best one because it is versatile and can be used for data with nonlinear relationships. The data set have such a strong seasonal component which is the strongest factor, enough that our results have found that simply taking the previous years sales is the most accurate forecast.
From this research paper, I have learned that the sales data for Indigo shows seasonality, no apparent trend, a remainder component, and is best forecasted using the seasonal naive method. This forecast shows that Indigo sales stay relatively constant throughout time.