happytails <- read.csv("Happy Tails Overall Attendance 2007-2017.csv", header = TRUE)
happytails.ts <- ts(happytails$Attendance, start=c(2007,10), end=c(2017,12), freq=7)
happytails.ts2 = ts(happytails$Attendance, start=c(2007,10), end=c(2017,12), freq=12)
Happy Tails daycare and boarding has been making dogs tails wag for over a decade now. We were lucky enough to get our hands on their overall attendance numbers from 2007 until 2018, in order to help Frank, the owner, predict future attendance numbers and explore the impacts of external factors. Prior to forecasting, however, we wanted to initially explore the data; check for any patterns, trends, or seasonality that could aid in forecasting and future business decisions, as Frank hasn’t had the chance to do so previously.
First things first, we want to visualize the data in its entirety by plotting it.
Initial analysis of the timeseries shows a possible additive or multiplicative trend with seasonality. The pattern of the data appears to show an overall increase of Attendance between 2012 and 2018 and also displays seasonality with a large dip and increase in business levels each year. From around 2010 on, you can see a clear “M” shaped pattern in the data.
plot(happytails.ts, xlab="Year", ylab="Attendance", ylim=c(1, 100), bty="l")
As we began the project, it was noted that Happy Tails had been growing every year. The plot below shows in color that this would indeed be true for every month of the year. The shade of blue transitions from dark to light as time progresses which would indicate that business had indeed continued to grow since the company began.
ggplot(happytails) +
geom_point(aes(x= Attendance, y = Month,
col = Year))
The single lines from the geom_point visual represent the data pretty well, but to look at it from a slightly different perspective, we’ll jitter the date. Jittering the date using the geom_jitter() function creates more density in the plot, and shows here that attendance has indeed increased since the business began.
ggplot(happytails, aes(x= Attendance, y = Month, col = Year)) +
geom_jitter()
A box and whisker plot is a good way to visualize the range of attendance numbers for not only the entire data set, but the range that the majority of attendance falls into as well. As we can see, the majority of attendance counts fall between 39 and 75 dogs, with the mean number being 58.
summary(happytails$Attendance)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 39.00 53.00 57.59 75.00 151.00
boxplot(happytails$Attendance, horizontal= TRUE)
Another approach is to use the ggseasonplot() function, which plots a line for each year, instead of each month as above. This can be useful for seeing which years did better than others. It’s evident from the below plot that 2014-2017 had some of their highest attendance numbers, climbing above all other years multiple times, while 2009 and 2010 had some of the lowest attendance counts. This is to be expected as we saw from the initial graph of the data that the totals slowly increase as time goes on.
ggseasonplot(happytails.ts)
We can see that using ggseasonplot() to break the data down by week shows that Friday and Tuesday attendance for all years stayed in a pretty small range, between about 15 and 30.
ggseasonplot(happytails.ts)
The line graph below shows that Fridays seem to have the potential to be the busiest day of the week. Another noticable trend is that Thursdays also show potential to be very high attendance days. This could be due to holidays such as Thanksgiving which typically wrap Thursday into a prolonged weekend.
ggplot(happytails, aes(x= Day, y = Attendance)) +
geom_line(aes(color = Day, group = Day), size = 3)
The histogram below captures density of high attendance days at Happy Tails. We used the “Position Fill” format on this histogram to help provide an image which captures what days of the week have been historically more busy. This histogram captures that Fridays do seem to have higher attendance historically than most other days of the week. It also shows that Sundays are relatively quiet regarding attendance levels. This is likely due to the weekend playing a factor in attendance.
ggplot(happytails, aes(x= Attendance, fill = Day)) +
geom_histogram(aes(y = ..density..),binwidth = 10.0, position = "fill" )
One way to determine if there is seasonality in the data is to use the ets() function on the time series. The output below, ETS(M,N,N), suggests it’s a time series no seaonality.
validLength <- 12
trainLength <- length(happytails.ts) - validLength
# Partition the data into training and validation periods
happytailsTrain <- window(happytails.ts, end=c(2007, trainLength))
happytailsValid <- window(happytails.ts, start=c(2007,trainLength+1))
ets(happytails.ts)
## ETS(M,N,N)
##
## Call:
## ets(y = happytails.ts)
##
## Smoothing parameters:
## alpha = 0.0586
##
## Initial states:
## l = 16.9518
##
## sigma: 0.5115
##
## AIC AICc BIC
## 657.6154 657.9632 664.4867
We will now compute an estimate of the autocorrelation function on the time series with the Acf() function. A strong positive lag-1 autocorrelation means consecutive values generally move in the same direction, and is seen in time series with strong linear trends.
happytailsACF <- Acf(happytails.ts, lag.max = 52)
print(happytailsACF)
##
## Autocorrelations of series 'happytails.ts', by lag
##
## 0 1 2 3 4 5 6 7 8 9
## 1.000 0.203 -0.096 -0.059 -0.220 -0.023 0.219 0.191 -0.015 -0.149
## 10 11 12 13 14 15 16 17 18 19
## -0.145 0.023 0.065 -0.040 -0.057 -0.034 -0.114 -0.081 0.138 0.150
## 20 21 22 23 24 25 26 27 28 29
## -0.034 -0.067 -0.177 -0.169 -0.120 0.170 0.172 0.032 -0.126 -0.102
## 30 31 32 33 34 35 36 37 38 39
## -0.015 0.094 0.182 -0.001 -0.037 -0.077 -0.103 0.064 0.086 -0.013
## 40 41 42 43 44 45 46 47 48 49
## -0.033 0.019 -0.058 -0.009 0.014 -0.014 0.061 0.013 0.060 0.012
## 50 51 52
## -0.031 -0.024 -0.031
Lag-0 autocorrelation is 1, which will always be the case, and which shows that the y-values are perfectly correlated with themselves. We see that lags 1,4,6, and 7 have the largest autocorrelations. These large values indicate weekly seasonality. We can also see as the weeks go by, the numbers become less and less correlated.
Differencing the data with a lag-1 to help adjust for a linear trend:
Acf(diff(happytails.ts, lag=1), lag.max=52, main="ACF Plot for Differenced Series")
The plot indicates that most of the seasonality has been captured, though there are a couple lags close to being significant.
Plotting the differenced series:
plot(diff(happytails.ts, ylim=c(10, 100), bty="l"))
This plot confirms that seasonality still exists within the differenced series.
We will also conduct a Ljung-Box Test to check the white noise. A time series is considered white noise if the variables are uncorrelated, or are close to zero. If the box test returns a small p-value (less than .05), we will know it is not white noise.
Box.test(happytails.ts, lag=7, fitdf=0, type= "Ljung-Box")
##
## Box-Ljung test
##
## data: happytails.ts
## X-squared = 14.966, df = 7, p-value = 0.03644
We can see that there is a very small p-value, proving that the data is not white noise, but is in fact correlated as we assumed.
As we have previously mentioned there is some form of seasonality in the time series. Our review of the timeseries thus far has revealed some interesting insight to trends and patterns already, albeit with seasonality in place. A natural next step would be then to supress the seasonality and continue evaluation. We chose to supress the seasonality by both quarter and year. The below quarterly plot confirmed that there is an increase in attendance flucuation as time goes by, and also indicates an increase in overall attendance. By supressing yearly seasonality, it is much more apparent that attendance levels are increasing over time although attendance is seemingly flattening our or declining towards the end of the timeseries.
quarterly <- aggregate(happytails.ts2, nfrequency=4, FUN=sum)
plot(quarterly, ylim= c(0, 200), bty="l", main= "Quarterly")
yearly <- aggregate(happytails.ts, nfrequency=1, FUN=sum)
plot(yearly, ylim= c(0, 200), bty="l", main= "Yearly")
We also decomposed the time series to eliminate any systemic and noise components of the time series. The observed frame reveals upward movement in attendance levels, which is further indicated by the trend frame. The seasonality frame reveals that the pattern is additive and the random frame reveals the noise or unexplained part to the data.
componentsOfhappytails <- decompose(happytails.ts)
plot(componentsOfhappytails)
The output from the ets() function showed no additive or multiplicative trend in the data, but we’ll fit multiple different types of trend lines to the data to see if any of them are a good fit. We’ll fit a quadratic line, an exponential line, and both a linear line with seasonality and one without.
Checking the adjusted R-squared values will determine which model fits the data better. An adjusted R-squared value of .52 would mean 52% of the data can be explained using the chosen method, .75 would mean 75% of the data can be explained, etc.
Linear:
plot(happytails.ts, xlab="Year", ylab="Attendance", ylim=c(10, 100), bty="l")
happytailsLinear <- tslm(happytails.ts ~ trend)
summary(happytailsLinear)
##
## Call:
## tslm(formula = happytails.ts ~ trend)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.896 -4.460 2.201 6.905 22.971
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.73288 2.47720 7.562 1.08e-10 ***
## trend 0.06609 0.05818 1.136 0.26
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.47 on 71 degrees of freedom
## Multiple R-squared: 0.01785, Adjusted R-squared: 0.004016
## F-statistic: 1.29 on 1 and 71 DF, p-value: 0.2598
lines(happytailsLinear$fitted, lwd=2)
Linear with seasonality:
plot(happytails.ts, xlab="Year", ylab="Attendance", ylim=c(10, 100), bty="l")
happytailsLinearSeasonal = tslm(happytails.ts ~trend + season)
summary(happytailsLinearSeasonal)
##
## Call:
## tslm(formula = happytails.ts ~ trend + season)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.468 -4.392 1.718 7.132 24.776
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.28636 4.02723 4.789 1.01e-05 ***
## trend 0.06970 0.05912 1.179 0.243
## season2 -3.36970 4.75504 -0.709 0.481
## season3 2.02273 4.64620 0.435 0.665
## season4 -2.59242 4.64545 -0.558 0.579
## season5 -3.38939 4.64545 -0.730 0.468
## season6 -0.66061 4.75614 -0.139 0.890
## season7 3.36970 4.75504 0.709 0.481
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.63 on 65 degrees of freedom
## Multiple R-squared: 0.07357, Adjusted R-squared: -0.0262
## F-statistic: 0.7374 on 7 and 65 DF, p-value: 0.6411
lines(happytailsLinearSeasonal$fitted, lwd= 2)
Quadratic:
plot(happytails.ts, xlab="Year", ylab="Attendance", ylim=c(10, 100), bty="l")
happytailsQuad <- tslm(happytails.ts ~ trend + I(trend^2))
summary(happytailsQuad)
##
## Call:
## tslm(formula = happytails.ts ~ trend + I(trend^2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.851 -4.414 2.369 6.909 23.039
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.5502283 3.8076004 4.872 6.63e-06 ***
## trend 0.0806985 0.2374583 0.340 0.735
## I(trend^2) -0.0001975 0.0031097 -0.063 0.950
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.55 on 70 degrees of freedom
## Multiple R-squared: 0.01791, Adjusted R-squared: -0.01015
## F-statistic: 0.6381 on 2 and 70 DF, p-value: 0.5313
lines(happytailsQuad$fitted, lty=2, lwd=3)
Exponential:
plot(happytails.ts, xlab="Year", ylab="Attendance", ylim=c(10, 100), bty="l")
happytailsExponential <- tslm(happytails.ts ~ trend + season, lambda=0)
summary(happytailsExponential)
##
## Call:
## tslm(formula = happytails.ts ~ trend + season, lambda = 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.9908 -0.2309 0.1988 0.6425 1.4640
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.981098 0.369293 8.072 2.17e-11 ***
## trend -0.003289 0.005421 -0.607 0.546
## season2 -0.316385 0.436032 -0.726 0.471
## season3 0.297642 0.426052 0.699 0.487
## season4 -0.402614 0.425983 -0.945 0.348
## season5 -0.367555 0.425983 -0.863 0.391
## season6 0.140926 0.436134 0.323 0.748
## season7 0.095217 0.436032 0.218 0.828
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9749 on 65 degrees of freedom
## Multiple R-squared: 0.9605, Adjusted R-squared: 0.9562
## F-statistic: 225.7 on 7 and 65 DF, p-value: < 2.2e-16
lines(happytailsExponential$fitted, lwd= 2)
As we can see from the accuracy metrics, the exponential trend line fits the data the best, explaining 96% of it.
During our meeting with Frank, he advised us that his business levels were getting better every year, which they seemingly are. We wanted to take a deeper look at the impact of external events on his business, though. Two particular events that we wanted to review were the impact of the 2007 recession, and the opening of a local competitor in 2015.
To start we will take a closer look at business levels in from 2007-2013 to capture the three years during the recession and the three years following the recession. The attendance levels remain relatively stable from 2007-2010 and begin to slowly rise which could indicate that the business itself is not entirely recession proof.
A few observations can be made from the data. One thing that sticks out is that there’s not as much of a seasonal drop during the recession period as there is from 2010-2013: as we can see the attendance has limited fluctuation. Another interesting thing to note is nearly all the lowest attendance counts actually happened post-recession.
ht2007.ts <- ts(happytails$Attendance, start=c(2007), end=c(2012), freq=12)
plot(ht2007.ts, xlab="Year", ylab="Attendance", ylim=c(0, 100), bty="l")
Roscoe’s Bed and Bark opened on January 1, 2015. They offer similar services to Happy Tails, so we wanted to see if there were any indications of their opening affecting the Happy Tails data.
Roscoes <- window(happytails.ts, 2013, c(2016,12))
plot(Roscoes, xlab="Year", ylab="Attendance", ylim=c(0, 100), bty="l")
Attendance at Happy Tails doesn’t appear to be affected by the opening. Both minimal attendance and maximum attendance continue to appear to trend upward. This would indicate that there is more than enough room in the area for competition to open and still not truly impact business levels.
Now that we have evaluated the data, it is time to start working with forecasting models and applying them. But first, we have to partition the data into training and validation periods. Our validation period will be one year, so we’ll set the validation length to 12.
validLength <- 12
trainLength <- length(happytails.ts) - validLength
# Partition the data into training and validation periods
happytailsTrain <- window(happytails.ts, end=c(2007, trainLength))
happytailsValid <- window(happytails.ts, start=c(2007,trainLength+1))
The seasonal naive forecast simply uses the previous attendance count for that day from the previous season as the forecast. For instance, a time series with weekly seasonality would forecast next monday as having the same amount of dogs as the previous monday.
As we can see from the accompanying accuracy, the accuracy metrics test set are actually pretty low save for the MAPE and MPE, indicating that this wouldn’t be a terrible forecasting method to use.
# Use the seasonal naive forecast
snaiveForValid <- snaive(happytailsTrain, h=validLength)
autoplot(snaiveForValid)
accuracy(snaiveForValid, happytailsValid)
## ME RMSE MAE MPE MAPE MASE
## Training set 0.7555556 9.189366 6.888889 -140.0864 177.0335 1.000000
## Test set 2.3333333 14.623041 10.833333 -184.2225 224.9887 1.572581
## ACF1 Theil's U
## Training set 0.2044071 NA
## Test set 0.4067804 0.3411266
summary(snaiveForValid, happytailsValid)
##
## Forecast method: Seasonal naive method
##
## Model Information:
## Call: snaive(y = happytailsTrain, h = validLength)
##
## Residual sd: 9.2617
##
## Error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 0.7555556 9.189366 6.888889 -140.0864 177.0335 1 0.2044071
##
## Forecasts:
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 2015.714 28 16.223354 39.77665 9.989174 46.01083
## 2015.857 35 23.223354 46.77665 16.989174 53.01083
## 2016.000 32 20.223354 43.77665 13.989174 50.01083
## 2016.143 24 12.223354 35.77665 5.989174 42.01083
## 2016.286 29 17.223354 40.77665 10.989174 47.01083
## 2016.429 3 -8.776646 14.77665 -15.010826 21.01083
## 2016.571 3 -8.776646 14.77665 -15.010826 21.01083
## 2016.714 28 11.345307 44.65469 2.528845 53.47115
## 2016.857 35 18.345307 51.65469 9.528845 60.47115
## 2017.000 32 15.345307 48.65469 6.528845 57.47115
## 2017.143 24 7.345307 40.65469 -1.471155 49.47115
## 2017.286 29 12.345307 45.65469 3.528845 54.47115
Simple exponential smoothing is a form of a weighted moving average that places more weight on more recent data. So, last Monday’s attendance and the Monday before that will hold more weight when forecasting than will the attendance count from Monday ten years ago. We don’t explicity select a value for alpha to allow the ses() function to select one on its own.
It’s evident from the plot that this isn’t an accurate forecasting method, as it didn’t account for any dips in the attendance numbers.
ses_forecast = ses(happytailsTrain, h=12)
autoplot(ses_forecast)
accuracy(ses_forecast, happytailsValid)
## ME RMSE MAE MPE MAPE
## Training set -0.0009410753 8.683206 6.772228 -140.5818 165.6509
## Test set 7.8842759928 12.854771 11.922851 -128.2052 188.7422
## MASE ACF1 Theil's U
## Training set 0.9830653 0.2388391 NA
## Test set 1.7307364 0.1081245 0.1350953
summary(ses_forecast, happytailsValid)
##
## Forecast method: Simple exponential smoothing
##
## Model Information:
## Simple exponential smoothing
##
## Call:
## ses(y = happytailsTrain, h = 12)
##
## Smoothing parameters:
## alpha = 1e-04
##
## Initial states:
## l = 19.6157
##
## sigma: 8.6832
##
## AIC AICc BIC
## 436.2493 436.7493 442.1031
##
## Error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -0.0009410753 8.683206 6.772228 -140.5818 165.6509 0.9830653
## ACF1
## Training set 0.2388391
##
## Forecasts:
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 2015.714 19.61572 8.487747 30.7437 2.596952 36.6345
## 2015.857 19.61572 8.487747 30.7437 2.596952 36.6345
## 2016.000 19.61572 8.487747 30.7437 2.596952 36.6345
## 2016.143 19.61572 8.487747 30.7437 2.596952 36.6345
## 2016.286 19.61572 8.487747 30.7437 2.596952 36.6345
## 2016.429 19.61572 8.487747 30.7437 2.596952 36.6345
## 2016.571 19.61572 8.487747 30.7437 2.596952 36.6345
## 2016.714 19.61572 8.487747 30.7437 2.596952 36.6345
## 2016.857 19.61572 8.487747 30.7437 2.596952 36.6345
## 2017.000 19.61572 8.487747 30.7437 2.596952 36.6345
## 2017.143 19.61572 8.487747 30.7437 2.596952 36.6345
## 2017.286 19.61572 8.487747 30.7437 2.596952 36.6345
The Holt_Winter method is an exponential smoothing method that can handle data sets with both seasonality and trend. It can be used with both additive and multiplicative seasonality.
Additive Seasonality: The model appears to fit well with a lower MAPE and RMSE values, though the AICc value is slightly larger.
hw_additive = hw(happytailsTrain, seasonal = "additive", h=12)
autoplot(hw_additive)
summary(hw_additive)
##
## Forecast method: Holt-Winters' additive method
##
## Model Information:
## Holt-Winters' additive method
##
## Call:
## hw(y = happytailsTrain, h = 12, seasonal = "additive")
##
## Smoothing parameters:
## alpha = 1e-04
## beta = 1e-04
## gamma = 0.0033
##
## Initial states:
## l = 19.7922
## b = -0.0065
## s=2.2822 2.9427 0.6308 -0.0665 -4.9743 -2.8079
## 1.9931
##
## sigma: 8.1633
##
## AIC AICc BIC
## 447.8278 455.8278 471.2428
##
## Error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.09534931 8.163279 6.505222 -119.9818 144.613 0.9443064
## ACF1
## Training set 0.2335874
##
## Forecasts:
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 2015.714 19.44097 8.979309 29.90263 3.4412393 35.44070
## 2015.857 20.12026 9.658597 30.58192 4.1205274 36.11999
## 2016.000 22.39668 11.935018 32.85834 6.3969479 38.39641
## 2016.143 21.74094 11.279278 32.20261 5.7412072 37.74068
## 2016.286 21.48957 11.027904 31.95124 5.4898334 37.48931
## 2016.429 16.63959 6.177920 27.10125 0.6398475 32.63933
## 2016.571 14.45288 3.991209 24.91455 -1.5468642 30.45262
## 2016.714 19.39903 8.937270 29.86078 3.3991506 35.39890
## 2016.857 20.07832 9.616554 30.54008 4.0784325 36.07820
## 2017.000 22.35474 11.892971 32.81650 6.3548458 38.35463
## 2017.143 21.69900 11.237225 32.16077 5.6990967 37.69890
## 2017.286 21.44763 10.985845 31.90941 5.4477134 37.44754
Multiplicative:
We can see the errors are slightly higher when using the multiplicative seasonality, meaning we should have stuck to the additive.
hw_mult = hw(happytailsTrain, seasonal = "multiplicative", h=12)
autoplot(hw_mult)
summary(hw_mult)
##
## Forecast method: Holt-Winters' multiplicative method
##
## Model Information:
## Holt-Winters' multiplicative method
##
## Call:
## hw(y = happytailsTrain, h = 12, seasonal = "multiplicative")
##
## Smoothing parameters:
## alpha = 2e-04
## beta = 1e-04
## gamma = 5e-04
##
## Initial states:
## l = 19.9365
## b = -9e-04
## s=1.0627 1.0751 1.0693 1.0561 0.7201 0.8462
## 1.1705
##
## sigma: 0.4653
##
## AIC AICc BIC
## 459.2773 467.2773 482.6922
##
## Error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -0.2180287 8.125259 6.45653 -122.4453 145.6172 0.9372382
## ACF1
## Training set 0.2352681
##
## Forecasts:
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 2015.714 21.06072 8.501906 33.61953 1.853672 40.26777
## 2015.857 21.31988 8.606525 34.03323 1.876481 40.76328
## 2016.000 21.43775 8.654105 34.22139 1.886854 40.98864
## 2016.143 21.18809 8.553320 33.82286 1.864877 40.51130
## 2016.286 23.33149 9.418578 37.24441 2.053525 44.60946
## 2016.429 16.86626 6.808658 26.92387 1.484483 32.24804
## 2016.571 14.34923 5.792566 22.90590 1.262942 27.43552
## 2016.714 21.04440 8.495290 33.59351 1.852192 40.23661
## 2016.857 21.30336 8.599820 34.00690 1.874972 40.73175
## 2017.000 21.42113 8.647354 34.19491 1.885323 40.95694
## 2017.143 21.17167 8.546637 33.79670 1.863350 40.47998
## 2017.286 23.31341 9.411206 37.21561 2.051825 44.57499
The ets models are the most flexible of the smoothing methods and automatically select the best model choosing the one with the lowest AICc.
We can see from the output that it chose ETS(M,N,N) as the optimal model, which means it has multiplicative error, no trend, and no seasonality. It then uses this information to forecast, and as we can see has a slightly larger AICc value than Holt-Winter’s method, but is same as the simple exponential smoothing.
ets_fit <- ets(happytailsTrain)
ets_forecast <- forecast(ets_fit)
plot(ets_forecast)
summary(ets_forecast, happytailsValid)
##
## Forecast method: ETS(M,N,N)
##
## Model Information:
## ETS(M,N,N)
##
## Call:
## ets(y = happytailsTrain)
##
## Smoothing parameters:
## alpha = 1e-04
##
## Initial states:
## l = 19.6099
##
## sigma: 0.4428
##
## AIC AICc BIC
## 436.2462 436.7462 442.1000
##
## Error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.004824043 8.683208 6.774002 -140.5111 165.6113 0.9833229
## ACF1
## Training set 0.2388391
##
## Forecasts:
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 2015.714 19.60997 8.482661 30.73729 2.592218 36.62773
## 2015.857 19.60997 8.482661 30.73729 2.592218 36.62773
## 2016.000 19.60997 8.482661 30.73729 2.592218 36.62773
## 2016.143 19.60997 8.482661 30.73729 2.592218 36.62773
## 2016.286 19.60997 8.482661 30.73729 2.592218 36.62773
## 2016.429 19.60997 8.482661 30.73729 2.592217 36.62773
## 2016.571 19.60997 8.482661 30.73729 2.592217 36.62773
## 2016.714 19.60997 8.482661 30.73729 2.592217 36.62773
## 2016.857 19.60997 8.482661 30.73729 2.592217 36.62773
## 2017.000 19.60997 8.482661 30.73729 2.592217 36.62773
## 2017.143 19.60997 8.482661 30.73729 2.592217 36.62773
## 2017.286 19.60997 8.482661 30.73729 2.592217 36.62773
## 2017.429 19.60997 8.482661 30.73729 2.592217 36.62773
## 2017.571 19.60997 8.482661 30.73729 2.592217 36.62773
An AR model is an autogressive linear model that predicts the value of a time series using the preceding value in time. We’re going to use the auto.arima() function and let it choose the best model for us. As we can see, it chose an ARIMA(0,0,1)(0,1,1)[7] which has the smallest AICc of all previous forecasts.
arima_fit = auto.arima(happytailsTrain, stepwise = FALSE)
arima_forecast = forecast(arima_fit)
plot(arima_forecast)
summary(arima_forecast, happytailsValid)
##
## Forecast method: ARIMA(0,0,1)(0,1,1)[7]
##
## Model Information:
## Series: happytailsTrain
## ARIMA(0,0,1)(0,1,1)[7]
##
## Coefficients:
## ma1 sma1
## 0.3039 -0.3145
## s.e. 0.1702 0.1679
##
## sigma^2 estimated as 75.74: log likelihood=-160.61
## AIC=327.21 AICc=327.8 BIC=332.63
##
## Error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.482288 7.914054 5.575024 -110.6714 135.3323 0.8092776
## ACF1
## Training set -0.04183417
##
## Forecasts:
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 2015.714 25.745994 14.592716 36.89927 8.6885266 42.80346
## 2015.857 31.599251 19.942457 43.25604 13.7717235 49.42678
## 2016.000 30.377460 18.720666 42.03425 12.5499328 48.20499
## 2016.143 21.506229 9.849435 33.16302 3.6787015 39.33376
## 2016.286 27.004357 15.347564 38.66115 9.1768307 44.83188
## 2016.429 4.002981 -7.653812 15.65977 -13.8245453 21.83051
## 2016.571 3.662316 -7.994477 15.31911 -14.1652103 21.48984
## 2016.714 25.817852 11.877722 39.75798 4.4982628 47.13744
## 2016.857 31.599251 17.466895 45.73161 9.9856777 53.21282
## 2017.000 30.377460 16.245104 44.50982 8.7638870 51.99103
## 2017.143 21.506229 7.373873 35.63858 -0.1073443 43.11980
## 2017.286 27.004357 12.872002 41.13671 5.3907848 48.61793
## 2017.429 4.002981 -10.129374 18.13534 -17.6105912 25.61655
## 2017.571 3.662316 -10.470039 17.79467 -17.9512563 25.27589
A good way to take advantage of autocorrelation is to create a second-level forecasting model on the residuals. This approach is often useful for short-term forecasting, and includes a 3 step process where an AR model is fitted to the series of residuals instead of the raw data series, and is then used to forecast future residuals.
Step 1: fit a multiple linear regression model that accounts for linear trend and additive seasonality.
Step 2: fit an AR model to the forecast errors from Step One.
Step 3: compute the adjusted forecast from steps 1 and 2
Step 1: As the data has already been previously partioned, we will fit the model with linear trend and additive seasonality to it and check the residuals.
happytailsLinearSeason <- tslm(happytailsTrain ~ trend + season)
accuracy(happytailsLinearSeason)
## ME RMSE MAE MPE MAPE MASE
## Training set -3.41507e-16 8.104593 6.473376 -116.773 140.8127 0.9558764
resACF <- Acf(happytailsLinearSeason$residuals)
print(resACF)
##
## Autocorrelations of series 'happytailsLinearSeason$residuals', by lag
##
## 0 1 2 3 4 5 6 7 8 9
## 1.000 0.236 -0.192 -0.309 -0.357 -0.026 0.110 0.269 0.013 -0.250
## 10 11 12 13 14 15 16 17
## -0.079 -0.038 0.038 0.004 -0.019 0.072 0.076 0.048
We can see there are still multiple significant lags, meaning there is still autocorrelation present.
Step 2:
Before going forward, we have to calculate the t statistic, which is the coefficient divided by the standard error, and then use that to estimate the p-value on the normal distribution. (Anything greater that 2 or -2 is statistically significant)
ARModel = Arima(happytailsLinearSeason$residuals, order = c(1,0,0))
ARModel$coef["ar1"] / sqrt(diag(vcov(ARModel)))["ar1"]
## ar1
## 1.766488
2*pnorm(-abs(ARModel$coef["ar1"] / sqrt(diag(vcov(ARModel)))["ar1"]))
## ar1
## 0.07731401
We can now see if we have accounted for autocorrelation in the series:
Acf(ARModel$residuals)
We can tell from the plot that we haven’t captured all the autocorrelations, as there is still significant correlation at lag 4 and lag 7.
Step 3
Then we wanted to forecast using both the linear regression model and the autocorrelation of the error terms. We first created them separetely, and then added them together and ploted it to create the adjusted forecast.
We can see that we want to adjust the straight regression forecast down from June through July, October through December and February (negative values for the error forecast). We will also want to adjust it up for the positive months on the error forecast.
lrForecast <- forecast(happytailsLinearSeason, h=validLength)
errorForecast <- forecast(ARModel, h=validLength)
adjustedForecast <- lrForecast$mean + errorForecast$mean
#plot it
plot(happytailsValid, bty="l", xaxt="n", xlab="Year 2017", yaxt="n", ylab="Attendance", ylim=c(1,120))
# Add the x-axis
axis(1, at=seq(2017, 2017 + 11/12, 1/12), labels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"))
# Add the y-axis
axis(2, at=seq(10,130,10), labels=format(seq(10,130,10)), las=2)
# Add the adjusted forecast
lines(adjustedForecast, col="blue")
# Add a legend
legend(2017,120, c("Actuals", "Adjusted Forecast"), lty=c(1,1), col=c("black", "blue"), bty="n")
accuracy(adjustedForecast, happytailsValid)
## ME RMSE MAE MPE MAPE ACF1 Theil's U
## Test set 8.069135 13.46635 11.94609 -136.343 196.782 0.2124555 0.07828943
summary(adjustedForecast, happytailsValid)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 13.12 18.94 20.31 19.43 20.88 22.23
We can see that we want to adjust the straight regression forecast as it underforecasted for most of the plot.
A TBATS model can be used for a times series that exhibits multiple complex seasonalities, and takes into account autocorrelation in the residuals.
tbats_forecast = msts(happytailsTrain, seasonal.periods = c(7))
fit1= tbats(tbats_forecast)
fc_tbats = forecast(fit1)
autoplot(fc_tbats)
accuracy(fc_tbats, happytailsValid)
## ME RMSE MAE MPE MAPE MASE
## Training set 0.2780325 8.641762 6.82315 -135.8291 162.2391 0.9904572
## Test set 9.6929901 14.316217 13.06642 -100.1209 166.8231 1.8967386
## ACF1 Theil's U
## Training set 0.23857611 NA
## Test set 0.01801802 0.1592408
summary(fc_tbats, happytailsValid)
##
## Forecast method: BATS(1, {0,0}, -, -)
##
## Model Information:
## BATS(1, {0,0}, -, -)
##
## Call: tbats(y = tbats_forecast)
##
## Parameters
## Alpha: -0.00992644
##
## Seed States:
## [,1]
## [1,] 19.45052
##
## Sigma: 8.641762
## AIC: 433.7517
##
## Error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.2780325 8.641762 6.82315 -135.8291 162.2391 0.9904572
## ACF1
## Training set 0.2385761
##
## Forecasts:
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 2015.714 19.30701 8.232146 30.38187 2.369468 36.24455
## 2015.857 19.30701 8.231601 30.38242 2.368633 36.24539
## 2016.000 19.30701 8.231055 30.38296 2.367799 36.24622
## 2016.143 19.30701 8.230510 30.38351 2.366965 36.24706
## 2016.286 19.30701 8.229964 30.38406 2.366130 36.24789
## 2016.429 19.30701 8.229419 30.38460 2.365296 36.24872
## 2016.571 19.30701 8.228873 30.38515 2.364462 36.24956
## 2016.714 19.30701 8.228328 30.38569 2.363627 36.25039
## 2016.857 19.30701 8.227782 30.38624 2.362793 36.25123
## 2017.000 19.30701 8.227237 30.38678 2.361959 36.25206
## 2017.143 19.30701 8.226691 30.38733 2.361125 36.25289
## 2017.286 19.30701 8.226146 30.38787 2.360291 36.25373
## 2017.429 19.30701 8.225601 30.38842 2.359457 36.25456
## 2017.571 19.30701 8.225055 30.38896 2.358623 36.25540
We can see from the accuracy() function that the TBATS model doesn’t perform as well as the ARIMA model.
Based on our analysis, the ARIMA model fits our data the best and produces the most accurate forecasts. Below is a reproduction of the forecasting model that we applied to the existing data set. As noted previously, the AICc is the lowest of all forecasting models we applied at 327. The MAPE is still high relative to the other accuracy metrics, but is still lower than that of other methods.
arima_fit = auto.arima(happytailsTrain, stepwise = FALSE)
arima_forecast = forecast(arima_fit)
plot(arima_forecast)
summary(arima_forecast, happytailsValid)
##
## Forecast method: ARIMA(0,0,1)(0,1,1)[7]
##
## Model Information:
## Series: happytailsTrain
## ARIMA(0,0,1)(0,1,1)[7]
##
## Coefficients:
## ma1 sma1
## 0.3039 -0.3145
## s.e. 0.1702 0.1679
##
## sigma^2 estimated as 75.74: log likelihood=-160.61
## AIC=327.21 AICc=327.8 BIC=332.63
##
## Error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.482288 7.914054 5.575024 -110.6714 135.3323 0.8092776
## ACF1
## Training set -0.04183417
##
## Forecasts:
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 2015.714 25.745994 14.592716 36.89927 8.6885266 42.80346
## 2015.857 31.599251 19.942457 43.25604 13.7717235 49.42678
## 2016.000 30.377460 18.720666 42.03425 12.5499328 48.20499
## 2016.143 21.506229 9.849435 33.16302 3.6787015 39.33376
## 2016.286 27.004357 15.347564 38.66115 9.1768307 44.83188
## 2016.429 4.002981 -7.653812 15.65977 -13.8245453 21.83051
## 2016.571 3.662316 -7.994477 15.31911 -14.1652103 21.48984
## 2016.714 25.817852 11.877722 39.75798 4.4982628 47.13744
## 2016.857 31.599251 17.466895 45.73161 9.9856777 53.21282
## 2017.000 30.377460 16.245104 44.50982 8.7638870 51.99103
## 2017.143 21.506229 7.373873 35.63858 -0.1073443 43.11980
## 2017.286 27.004357 12.872002 41.13671 5.3907848 48.61793
## 2017.429 4.002981 -10.129374 18.13534 -17.6105912 25.61655
## 2017.571 3.662316 -10.470039 17.79467 -17.9512563 25.27589
Up to this point, we’ve only forecasted our validation period to measure the accuracy of the forecast. Now, we’ll apply the forecaset to the entire dataset to predict future attendance numbers. We’ll forecast for the next 5 months.
arima_fit = arima(happytails.ts, order= c(0,0,1), seasonal= c(0,1,1))
arima_forecast = forecast(arima_fit, h=5)
plot(arima_forecast)
summary(arima_forecast)
##
## Forecast method: ARIMA(0,0,1)(0,1,1)[7]
##
## Model Information:
##
## Call:
## arima(x = happytails.ts, order = c(0, 0, 1), seasonal = c(0, 1, 1))
##
## Coefficients:
## ma1 sma1
## 0.3045 -0.6637
## s.e. 0.1495 0.1624
##
## sigma^2 estimated as 123.6: log likelihood = -254.7, aic = 515.4
##
## Error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.9146159 10.57238 7.804503 -150.131 180.6143 0.8760157
## ACF1
## Training set -0.03582091
##
## Forecasts:
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 2018.714 27.48839 13.237840 41.73893 5.694058 49.28271
## 2018.857 33.02876 18.132252 47.92528 10.246515 55.81101
## 2019.000 20.58449 5.687983 35.48101 -2.197754 43.36674
## 2019.143 11.83076 -3.065696 26.72722 -10.951404 34.61293
## 2019.286 25.91576 11.019891 40.81163 3.134494 48.69703
As we can see, our point forecasted attendance for the next 5 months is 27, 33, 21, 12, and 26, respectively. It will be interesting to see how accurate these predictions are.