HappyTails

happytails <- read.csv("Happy Tails Overall Attendance 2007-2017.csv", header = TRUE)

happytails.ts <- ts(happytails$Attendance, start=c(2007,10), end=c(2017,12), freq=7)

happytails.ts2 =  ts(happytails$Attendance, start=c(2007,10), end=c(2017,12), freq=12)

INITIAL DATA VISUALIZATION

Happy Tails daycare and boarding has been making dogs tails wag for over a decade now. We were lucky enough to get our hands on their overall attendance numbers from 2007 until 2018, in order to help Frank, the owner, predict future attendance numbers and explore the impacts of external factors. Prior to forecasting, however, we wanted to initially explore the data; check for any patterns, trends, or seasonality that could aid in forecasting and future business decisions, as Frank hasn’t had the chance to do so previously.

Plotting The Data

First things first, we want to visualize the data in its entirety by plotting it.

Overall Attendance 2007 - 2017

Initial analysis of the timeseries shows a possible additive or multiplicative trend with seasonality. The pattern of the data appears to show an overall increase of Attendance between 2012 and 2018 and also displays seasonality with a large dip and increase in business levels each year. From around 2010 on, you can see a clear “M” shaped pattern in the data.

plot(happytails.ts, xlab="Year", ylab="Attendance", ylim=c(1, 100), bty="l")

Plotting Attendance Over Time Using geom_point

As we began the project, it was noted that Happy Tails had been growing every year. The plot below shows in color that this would indeed be true for every month of the year. The shade of blue transitions from dark to light as time progresses which would indicate that business had indeed continued to grow since the company began.

ggplot(happytails) +
  geom_point(aes(x= Attendance, y = Month, 
                  col = Year))

Using geom_jitter to Show Yearly Increases:

The single lines from the geom_point visual represent the data pretty well, but to look at it from a slightly different perspective, we’ll jitter the date. Jittering the date using the geom_jitter() function creates more density in the plot, and shows here that attendance has indeed increased since the business began.

ggplot(happytails, aes(x= Attendance, y = Month, col = Year)) +
  geom_jitter()

Range of Attendance Counts

A box and whisker plot is a good way to visualize the range of attendance numbers for not only the entire data set, but the range that the majority of attendance falls into as well. As we can see, the majority of attendance counts fall between 39 and 75 dogs, with the mean number being 58.

summary(happytails$Attendance)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   39.00   53.00   57.59   75.00  151.00

boxplot(happytails$Attendance, horizontal= TRUE)

Attendance Broken Out by Month with ggseasonplot()

Another approach is to use the ggseasonplot() function, which plots a line for each year, instead of each month as above. This can be useful for seeing which years did better than others. It’s evident from the below plot that 2014-2017 had some of their highest attendance numbers, climbing above all other years multiple times, while 2009 and 2010 had some of the lowest attendance counts. This is to be expected as we saw from the initial graph of the data that the totals slowly increase as time goes on.

ggseasonplot(happytails.ts)

Broken Out by Week Using ggseasonplot()

We can see that using ggseasonplot() to break the data down by week shows that Friday and Tuesday attendance for all years stayed in a pretty small range, between about 15 and 30.

ggseasonplot(happytails.ts)

High Attendance Days

The line graph below shows that Fridays seem to have the potential to be the busiest day of the week. Another noticable trend is that Thursdays also show potential to be very high attendance days. This could be due to holidays such as Thanksgiving which typically wrap Thursday into a prolonged weekend.

ggplot(happytails, aes(x= Day, y = Attendance)) +
 geom_line(aes(color = Day, group = Day), size = 3)

Density of High Attendance Days

The histogram below captures density of high attendance days at Happy Tails. We used the “Position Fill” format on this histogram to help provide an image which captures what days of the week have been historically more busy. This histogram captures that Fridays do seem to have higher attendance historically than most other days of the week. It also shows that Sundays are relatively quiet regarding attendance levels. This is likely due to the weekend playing a factor in attendance.

ggplot(happytails, aes(x= Attendance, fill = Day)) +
  geom_histogram(aes(y = ..density..),binwidth = 10.0, position = "fill" )

Determining Seasonality

One way to determine if there is seasonality in the data is to use the ets() function on the time series. The output below, ETS(M,N,N), suggests it’s a time series no seaonality.

validLength <- 12
trainLength <- length(happytails.ts) - validLength
# Partition the data into training and validation periods
happytailsTrain <- window(happytails.ts, end=c(2007, trainLength))
happytailsValid <- window(happytails.ts, start=c(2007,trainLength+1))

ets(happytails.ts)

## ETS(M,N,N) 
## 
## Call:
##  ets(y = happytails.ts) 
## 
##   Smoothing parameters:
##     alpha = 0.0586 
## 
##   Initial states:
##     l = 16.9518 
## 
##   sigma:  0.5115
## 
##      AIC     AICc      BIC 
## 657.6154 657.9632 664.4867

Autocorrelation

We will now compute an estimate of the autocorrelation function on the time series with the Acf() function. A strong positive lag-1 autocorrelation means consecutive values generally move in the same direction, and is seen in time series with strong linear trends.

happytailsACF <- Acf(happytails.ts, lag.max = 52)

print(happytailsACF)

## 
## Autocorrelations of series 'happytails.ts', by lag
## 
##      0      1      2      3      4      5      6      7      8      9 
##  1.000  0.203 -0.096 -0.059 -0.220 -0.023  0.219  0.191 -0.015 -0.149 
##     10     11     12     13     14     15     16     17     18     19 
## -0.145  0.023  0.065 -0.040 -0.057 -0.034 -0.114 -0.081  0.138  0.150 
##     20     21     22     23     24     25     26     27     28     29 
## -0.034 -0.067 -0.177 -0.169 -0.120  0.170  0.172  0.032 -0.126 -0.102 
##     30     31     32     33     34     35     36     37     38     39 
## -0.015  0.094  0.182 -0.001 -0.037 -0.077 -0.103  0.064  0.086 -0.013 
##     40     41     42     43     44     45     46     47     48     49 
## -0.033  0.019 -0.058 -0.009  0.014 -0.014  0.061  0.013  0.060  0.012 
##     50     51     52 
## -0.031 -0.024 -0.031

Lag-0 autocorrelation is 1, which will always be the case, and which shows that the y-values are perfectly correlated with themselves. We see that lags 1,4,6, and 7 have the largest autocorrelations. These large values indicate weekly seasonality. We can also see as the weeks go by, the numbers become less and less correlated.

Differencing the data with a lag-1 to help adjust for a linear trend:

Acf(diff(happytails.ts, lag=1), lag.max=52, main="ACF Plot for Differenced Series")

The plot indicates that most of the seasonality has been captured, though there are a couple lags close to being significant.

Plotting the differenced series:

plot(diff(happytails.ts, ylim=c(10, 100), bty="l"))

This plot confirms that seasonality still exists within the differenced series.

White Noise

We will also conduct a Ljung-Box Test to check the white noise. A time series is considered white noise if the variables are uncorrelated, or are close to zero. If the box test returns a small p-value (less than .05), we will know it is not white noise.

Box.test(happytails.ts, lag=7, fitdf=0, type= "Ljung-Box")

## 
##  Box-Ljung test
## 
## data:  happytails.ts
## X-squared = 14.966, df = 7, p-value = 0.03644

We can see that there is a very small p-value, proving that the data is not white noise, but is in fact correlated as we assumed.

Suppressing Seasonality

As we have previously mentioned there is some form of seasonality in the time series. Our review of the timeseries thus far has revealed some interesting insight to trends and patterns already, albeit with seasonality in place. A natural next step would be then to supress the seasonality and continue evaluation. We chose to supress the seasonality by both quarter and year. The below quarterly plot confirmed that there is an increase in attendance flucuation as time goes by, and also indicates an increase in overall attendance. By supressing yearly seasonality, it is much more apparent that attendance levels are increasing over time although attendance is seemingly flattening our or declining towards the end of the timeseries.

quarterly <- aggregate(happytails.ts2, nfrequency=4, FUN=sum)
plot(quarterly, ylim= c(0, 200), bty="l", main= "Quarterly")

yearly <- aggregate(happytails.ts, nfrequency=1, FUN=sum)
plot(yearly, ylim= c(0, 200), bty="l", main= "Yearly")

Decomposing the Data

We also decomposed the time series to eliminate any systemic and noise components of the time series. The observed frame reveals upward movement in attendance levels, which is further indicated by the trend frame. The seasonality frame reveals that the pattern is additive and the random frame reveals the noise or unexplained part to the data.

componentsOfhappytails <- decompose(happytails.ts)
plot(componentsOfhappytails)

Fitting Trend Lines to the Data

The output from the ets() function showed no additive or multiplicative trend in the data, but we’ll fit multiple different types of trend lines to the data to see if any of them are a good fit. We’ll fit a quadratic line, an exponential line, and both a linear line with seasonality and one without.

Checking the adjusted R-squared values will determine which model fits the data better. An adjusted R-squared value of .52 would mean 52% of the data can be explained using the chosen method, .75 would mean 75% of the data can be explained, etc.

Linear:

plot(happytails.ts, xlab="Year", ylab="Attendance", ylim=c(10, 100), bty="l")

happytailsLinear <- tslm(happytails.ts ~ trend)
summary(happytailsLinear)

## 
## Call:
## tslm(formula = happytails.ts ~ trend)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -21.896  -4.460   2.201   6.905  22.971 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 18.73288    2.47720   7.562 1.08e-10 ***
## trend        0.06609    0.05818   1.136     0.26    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.47 on 71 degrees of freedom
## Multiple R-squared:  0.01785,    Adjusted R-squared:  0.004016 
## F-statistic:  1.29 on 1 and 71 DF,  p-value: 0.2598

lines(happytailsLinear$fitted, lwd=2)

Linear with seasonality:

plot(happytails.ts, xlab="Year", ylab="Attendance", ylim=c(10, 100), bty="l")

happytailsLinearSeasonal = tslm(happytails.ts ~trend + season)
summary(happytailsLinearSeasonal)

## 
## Call:
## tslm(formula = happytails.ts ~ trend + season)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -23.468  -4.392   1.718   7.132  24.776 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 19.28636    4.02723   4.789 1.01e-05 ***
## trend        0.06970    0.05912   1.179    0.243    
## season2     -3.36970    4.75504  -0.709    0.481    
## season3      2.02273    4.64620   0.435    0.665    
## season4     -2.59242    4.64545  -0.558    0.579    
## season5     -3.38939    4.64545  -0.730    0.468    
## season6     -0.66061    4.75614  -0.139    0.890    
## season7      3.36970    4.75504   0.709    0.481    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.63 on 65 degrees of freedom
## Multiple R-squared:  0.07357,    Adjusted R-squared:  -0.0262 
## F-statistic: 0.7374 on 7 and 65 DF,  p-value: 0.6411

lines(happytailsLinearSeasonal$fitted, lwd= 2)

Quadratic:

plot(happytails.ts, xlab="Year", ylab="Attendance", ylim=c(10, 100), bty="l")

happytailsQuad <- tslm(happytails.ts ~ trend + I(trend^2))
summary(happytailsQuad)

## 
## Call:
## tslm(formula = happytails.ts ~ trend + I(trend^2))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -21.851  -4.414   2.369   6.909  23.039 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 18.5502283  3.8076004   4.872 6.63e-06 ***
## trend        0.0806985  0.2374583   0.340    0.735    
## I(trend^2)  -0.0001975  0.0031097  -0.063    0.950    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.55 on 70 degrees of freedom
## Multiple R-squared:  0.01791,    Adjusted R-squared:  -0.01015 
## F-statistic: 0.6381 on 2 and 70 DF,  p-value: 0.5313

lines(happytailsQuad$fitted, lty=2, lwd=3)

Exponential:

plot(happytails.ts, xlab="Year", ylab="Attendance", ylim=c(10, 100), bty="l")

happytailsExponential <- tslm(happytails.ts ~ trend + season, lambda=0)
summary(happytailsExponential)

## 
## Call:
## tslm(formula = happytails.ts ~ trend + season, lambda = 0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9908 -0.2309  0.1988  0.6425  1.4640 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.981098   0.369293   8.072 2.17e-11 ***
## trend       -0.003289   0.005421  -0.607    0.546    
## season2     -0.316385   0.436032  -0.726    0.471    
## season3      0.297642   0.426052   0.699    0.487    
## season4     -0.402614   0.425983  -0.945    0.348    
## season5     -0.367555   0.425983  -0.863    0.391    
## season6      0.140926   0.436134   0.323    0.748    
## season7      0.095217   0.436032   0.218    0.828    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9749 on 65 degrees of freedom
## Multiple R-squared:  0.9605, Adjusted R-squared:  0.9562 
## F-statistic: 225.7 on 7 and 65 DF,  p-value: < 2.2e-16

lines(happytailsExponential$fitted, lwd= 2)

As we can see from the accuracy metrics, the exponential trend line fits the data the best, explaining 96% of it.

IMPACT OF EXTERNAL FACTORS ON OVERALL ATTENDANCE

During our meeting with Frank, he advised us that his business levels were getting better every year, which they seemingly are. We wanted to take a deeper look at the impact of external events on his business, though. Two particular events that we wanted to review were the impact of the 2007 recession, and the opening of a local competitor in 2015.

Impact of the Recession

To start we will take a closer look at business levels in from 2007-2013 to capture the three years during the recession and the three years following the recession. The attendance levels remain relatively stable from 2007-2010 and begin to slowly rise which could indicate that the business itself is not entirely recession proof.

A few observations can be made from the data. One thing that sticks out is that there’s not as much of a seasonal drop during the recession period as there is from 2010-2013: as we can see the attendance has limited fluctuation. Another interesting thing to note is nearly all the lowest attendance counts actually happened post-recession.

ht2007.ts <- ts(happytails$Attendance, start=c(2007), end=c(2012), freq=12)
plot(ht2007.ts, xlab="Year", ylab="Attendance", ylim=c(0, 100), bty="l")

Impact of Competition

Roscoe’s Bed and Bark opened on January 1, 2015. They offer similar services to Happy Tails, so we wanted to see if there were any indications of their opening affecting the Happy Tails data.

Roscoes <- window(happytails.ts, 2013, c(2016,12))
plot(Roscoes, xlab="Year", ylab="Attendance", ylim=c(0, 100), bty="l")

Attendance at Happy Tails doesn’t appear to be affected by the opening. Both minimal attendance and maximum attendance continue to appear to trend upward. This would indicate that there is more than enough room in the area for competition to open and still not truly impact business levels.

APPLYING FORECASTING MODELS TO THE DATA

Now that we have evaluated the data, it is time to start working with forecasting models and applying them. But first, we have to partition the data into training and validation periods. Our validation period will be one year, so we’ll set the validation length to 12.

validLength <- 12
trainLength <- length(happytails.ts) - validLength
# Partition the data into training and validation periods
happytailsTrain <- window(happytails.ts, end=c(2007, trainLength))
happytailsValid <- window(happytails.ts, start=c(2007,trainLength+1))

Seasonal Naive Forecast

The seasonal naive forecast simply uses the previous attendance count for that day from the previous season as the forecast. For instance, a time series with weekly seasonality would forecast next monday as having the same amount of dogs as the previous monday.

As we can see from the accompanying accuracy, the accuracy metrics test set are actually pretty low save for the MAPE and MPE, indicating that this wouldn’t be a terrible forecasting method to use.

# Use the seasonal naive forecast
snaiveForValid <- snaive(happytailsTrain, h=validLength)

autoplot(snaiveForValid)

accuracy(snaiveForValid, happytailsValid)

##                     ME      RMSE       MAE       MPE     MAPE     MASE
## Training set 0.7555556  9.189366  6.888889 -140.0864 177.0335 1.000000
## Test set     2.3333333 14.623041 10.833333 -184.2225 224.9887 1.572581
##                   ACF1 Theil's U
## Training set 0.2044071        NA
## Test set     0.4067804 0.3411266

summary(snaiveForValid, happytailsValid)

## 
## Forecast method: Seasonal naive method
## 
## Model Information:
## Call: snaive(y = happytailsTrain, h = validLength) 
## 
## Residual sd: 9.2617 
## 
## Error measures:
##                     ME     RMSE      MAE       MPE     MAPE MASE      ACF1
## Training set 0.7555556 9.189366 6.888889 -140.0864 177.0335    1 0.2044071
## 
## Forecasts:
##          Point Forecast     Lo 80    Hi 80      Lo 95    Hi 95
## 2015.714             28 16.223354 39.77665   9.989174 46.01083
## 2015.857             35 23.223354 46.77665  16.989174 53.01083
## 2016.000             32 20.223354 43.77665  13.989174 50.01083
## 2016.143             24 12.223354 35.77665   5.989174 42.01083
## 2016.286             29 17.223354 40.77665  10.989174 47.01083
## 2016.429              3 -8.776646 14.77665 -15.010826 21.01083
## 2016.571              3 -8.776646 14.77665 -15.010826 21.01083
## 2016.714             28 11.345307 44.65469   2.528845 53.47115
## 2016.857             35 18.345307 51.65469   9.528845 60.47115
## 2017.000             32 15.345307 48.65469   6.528845 57.47115
## 2017.143             24  7.345307 40.65469  -1.471155 49.47115
## 2017.286             29 12.345307 45.65469   3.528845 54.47115

Simple Exponential Smoothing

Simple exponential smoothing is a form of a weighted moving average that places more weight on more recent data. So, last Monday’s attendance and the Monday before that will hold more weight when forecasting than will the attendance count from Monday ten years ago. We don’t explicity select a value for alpha to allow the ses() function to select one on its own.

It’s evident from the plot that this isn’t an accurate forecasting method, as it didn’t account for any dips in the attendance numbers.

ses_forecast = ses(happytailsTrain, h=12)
autoplot(ses_forecast)

accuracy(ses_forecast, happytailsValid)

##                         ME      RMSE       MAE       MPE     MAPE
## Training set -0.0009410753  8.683206  6.772228 -140.5818 165.6509
## Test set      7.8842759928 12.854771 11.922851 -128.2052 188.7422
##                   MASE      ACF1 Theil's U
## Training set 0.9830653 0.2388391        NA
## Test set     1.7307364 0.1081245 0.1350953

summary(ses_forecast, happytailsValid)

## 
## Forecast method: Simple exponential smoothing
## 
## Model Information:
## Simple exponential smoothing 
## 
## Call:
##  ses(y = happytailsTrain, h = 12) 
## 
##   Smoothing parameters:
##     alpha = 1e-04 
## 
##   Initial states:
##     l = 19.6157 
## 
##   sigma:  8.6832
## 
##      AIC     AICc      BIC 
## 436.2493 436.7493 442.1031 
## 
## Error measures:
##                         ME     RMSE      MAE       MPE     MAPE      MASE
## Training set -0.0009410753 8.683206 6.772228 -140.5818 165.6509 0.9830653
##                   ACF1
## Training set 0.2388391
## 
## Forecasts:
##          Point Forecast    Lo 80   Hi 80    Lo 95   Hi 95
## 2015.714       19.61572 8.487747 30.7437 2.596952 36.6345
## 2015.857       19.61572 8.487747 30.7437 2.596952 36.6345
## 2016.000       19.61572 8.487747 30.7437 2.596952 36.6345
## 2016.143       19.61572 8.487747 30.7437 2.596952 36.6345
## 2016.286       19.61572 8.487747 30.7437 2.596952 36.6345
## 2016.429       19.61572 8.487747 30.7437 2.596952 36.6345
## 2016.571       19.61572 8.487747 30.7437 2.596952 36.6345
## 2016.714       19.61572 8.487747 30.7437 2.596952 36.6345
## 2016.857       19.61572 8.487747 30.7437 2.596952 36.6345
## 2017.000       19.61572 8.487747 30.7437 2.596952 36.6345
## 2017.143       19.61572 8.487747 30.7437 2.596952 36.6345
## 2017.286       19.61572 8.487747 30.7437 2.596952 36.6345

Holt-Winter’s Method

The Holt_Winter method is an exponential smoothing method that can handle data sets with both seasonality and trend. It can be used with both additive and multiplicative seasonality.

Additive Seasonality: The model appears to fit well with a lower MAPE and RMSE values, though the AICc value is slightly larger.

hw_additive = hw(happytailsTrain, seasonal = "additive", h=12)
autoplot(hw_additive)

summary(hw_additive)

## 
## Forecast method: Holt-Winters' additive method
## 
## Model Information:
## Holt-Winters' additive method 
## 
## Call:
##  hw(y = happytailsTrain, h = 12, seasonal = "additive") 
## 
##   Smoothing parameters:
##     alpha = 1e-04 
##     beta  = 1e-04 
##     gamma = 0.0033 
## 
##   Initial states:
##     l = 19.7922 
##     b = -0.0065 
##     s=2.2822 2.9427 0.6308 -0.0665 -4.9743 -2.8079
##            1.9931
## 
##   sigma:  8.1633
## 
##      AIC     AICc      BIC 
## 447.8278 455.8278 471.2428 
## 
## Error measures:
##                      ME     RMSE      MAE       MPE    MAPE      MASE
## Training set 0.09534931 8.163279 6.505222 -119.9818 144.613 0.9443064
##                   ACF1
## Training set 0.2335874
## 
## Forecasts:
##          Point Forecast     Lo 80    Hi 80      Lo 95    Hi 95
## 2015.714       19.44097  8.979309 29.90263  3.4412393 35.44070
## 2015.857       20.12026  9.658597 30.58192  4.1205274 36.11999
## 2016.000       22.39668 11.935018 32.85834  6.3969479 38.39641
## 2016.143       21.74094 11.279278 32.20261  5.7412072 37.74068
## 2016.286       21.48957 11.027904 31.95124  5.4898334 37.48931
## 2016.429       16.63959  6.177920 27.10125  0.6398475 32.63933
## 2016.571       14.45288  3.991209 24.91455 -1.5468642 30.45262
## 2016.714       19.39903  8.937270 29.86078  3.3991506 35.39890
## 2016.857       20.07832  9.616554 30.54008  4.0784325 36.07820
## 2017.000       22.35474 11.892971 32.81650  6.3548458 38.35463
## 2017.143       21.69900 11.237225 32.16077  5.6990967 37.69890
## 2017.286       21.44763 10.985845 31.90941  5.4477134 37.44754

Multiplicative:
We can see the errors are slightly higher when using the multiplicative seasonality, meaning we should have stuck to the additive.

hw_mult = hw(happytailsTrain, seasonal = "multiplicative", h=12)
autoplot(hw_mult)

summary(hw_mult)

## 
## Forecast method: Holt-Winters' multiplicative method
## 
## Model Information:
## Holt-Winters' multiplicative method 
## 
## Call:
##  hw(y = happytailsTrain, h = 12, seasonal = "multiplicative") 
## 
##   Smoothing parameters:
##     alpha = 2e-04 
##     beta  = 1e-04 
##     gamma = 5e-04 
## 
##   Initial states:
##     l = 19.9365 
##     b = -9e-04 
##     s=1.0627 1.0751 1.0693 1.0561 0.7201 0.8462
##            1.1705
## 
##   sigma:  0.4653
## 
##      AIC     AICc      BIC 
## 459.2773 467.2773 482.6922 
## 
## Error measures:
##                      ME     RMSE     MAE       MPE     MAPE      MASE
## Training set -0.2180287 8.125259 6.45653 -122.4453 145.6172 0.9372382
##                   ACF1
## Training set 0.2352681
## 
## Forecasts:
##          Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## 2015.714       21.06072 8.501906 33.61953 1.853672 40.26777
## 2015.857       21.31988 8.606525 34.03323 1.876481 40.76328
## 2016.000       21.43775 8.654105 34.22139 1.886854 40.98864
## 2016.143       21.18809 8.553320 33.82286 1.864877 40.51130
## 2016.286       23.33149 9.418578 37.24441 2.053525 44.60946
## 2016.429       16.86626 6.808658 26.92387 1.484483 32.24804
## 2016.571       14.34923 5.792566 22.90590 1.262942 27.43552
## 2016.714       21.04440 8.495290 33.59351 1.852192 40.23661
## 2016.857       21.30336 8.599820 34.00690 1.874972 40.73175
## 2017.000       21.42113 8.647354 34.19491 1.885323 40.95694
## 2017.143       21.17167 8.546637 33.79670 1.863350 40.47998
## 2017.286       23.31341 9.411206 37.21561 2.051825 44.57499

Fitting an ets model

The ets models are the most flexible of the smoothing methods and automatically select the best model choosing the one with the lowest AICc.

We can see from the output that it chose ETS(M,N,N) as the optimal model, which means it has multiplicative error, no trend, and no seasonality. It then uses this information to forecast, and as we can see has a slightly larger AICc value than Holt-Winter’s method, but is same as the simple exponential smoothing.

ets_fit <- ets(happytailsTrain)
ets_forecast <- forecast(ets_fit)
plot(ets_forecast)

summary(ets_forecast, happytailsValid)

## 
## Forecast method: ETS(M,N,N)
## 
## Model Information:
## ETS(M,N,N) 
## 
## Call:
##  ets(y = happytailsTrain) 
## 
##   Smoothing parameters:
##     alpha = 1e-04 
## 
##   Initial states:
##     l = 19.6099 
## 
##   sigma:  0.4428
## 
##      AIC     AICc      BIC 
## 436.2462 436.7462 442.1000 
## 
## Error measures:
##                       ME     RMSE      MAE       MPE     MAPE      MASE
## Training set 0.004824043 8.683208 6.774002 -140.5111 165.6113 0.9833229
##                   ACF1
## Training set 0.2388391
## 
## Forecasts:
##          Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## 2015.714       19.60997 8.482661 30.73729 2.592218 36.62773
## 2015.857       19.60997 8.482661 30.73729 2.592218 36.62773
## 2016.000       19.60997 8.482661 30.73729 2.592218 36.62773
## 2016.143       19.60997 8.482661 30.73729 2.592218 36.62773
## 2016.286       19.60997 8.482661 30.73729 2.592218 36.62773
## 2016.429       19.60997 8.482661 30.73729 2.592217 36.62773
## 2016.571       19.60997 8.482661 30.73729 2.592217 36.62773
## 2016.714       19.60997 8.482661 30.73729 2.592217 36.62773
## 2016.857       19.60997 8.482661 30.73729 2.592217 36.62773
## 2017.000       19.60997 8.482661 30.73729 2.592217 36.62773
## 2017.143       19.60997 8.482661 30.73729 2.592217 36.62773
## 2017.286       19.60997 8.482661 30.73729 2.592217 36.62773
## 2017.429       19.60997 8.482661 30.73729 2.592217 36.62773
## 2017.571       19.60997 8.482661 30.73729 2.592217 36.62773

Arima Model

An AR model is an autogressive linear model that predicts the value of a time series using the preceding value in time. We’re going to use the auto.arima() function and let it choose the best model for us. As we can see, it chose an ARIMA(0,0,1)(0,1,1)[7] which has the smallest AICc of all previous forecasts.

arima_fit = auto.arima(happytailsTrain, stepwise = FALSE)
arima_forecast = forecast(arima_fit)
plot(arima_forecast)

summary(arima_forecast, happytailsValid)

## 
## Forecast method: ARIMA(0,0,1)(0,1,1)[7]
## 
## Model Information:
## Series: happytailsTrain 
## ARIMA(0,0,1)(0,1,1)[7] 
## 
## Coefficients:
##          ma1     sma1
##       0.3039  -0.3145
## s.e.  0.1702   0.1679
## 
## sigma^2 estimated as 75.74:  log likelihood=-160.61
## AIC=327.21   AICc=327.8   BIC=332.63
## 
## Error measures:
##                    ME     RMSE      MAE       MPE     MAPE      MASE
## Training set 0.482288 7.914054 5.575024 -110.6714 135.3323 0.8092776
##                     ACF1
## Training set -0.04183417
## 
## Forecasts:
##          Point Forecast      Lo 80    Hi 80       Lo 95    Hi 95
## 2015.714      25.745994  14.592716 36.89927   8.6885266 42.80346
## 2015.857      31.599251  19.942457 43.25604  13.7717235 49.42678
## 2016.000      30.377460  18.720666 42.03425  12.5499328 48.20499
## 2016.143      21.506229   9.849435 33.16302   3.6787015 39.33376
## 2016.286      27.004357  15.347564 38.66115   9.1768307 44.83188
## 2016.429       4.002981  -7.653812 15.65977 -13.8245453 21.83051
## 2016.571       3.662316  -7.994477 15.31911 -14.1652103 21.48984
## 2016.714      25.817852  11.877722 39.75798   4.4982628 47.13744
## 2016.857      31.599251  17.466895 45.73161   9.9856777 53.21282
## 2017.000      30.377460  16.245104 44.50982   8.7638870 51.99103
## 2017.143      21.506229   7.373873 35.63858  -0.1073443 43.11980
## 2017.286      27.004357  12.872002 41.13671   5.3907848 48.61793
## 2017.429       4.002981 -10.129374 18.13534 -17.6105912 25.61655
## 2017.571       3.662316 -10.470039 17.79467 -17.9512563 25.27589

Using AR as a Second-Layer Model

A good way to take advantage of autocorrelation is to create a second-level forecasting model on the residuals. This approach is often useful for short-term forecasting, and includes a 3 step process where an AR model is fitted to the series of residuals instead of the raw data series, and is then used to forecast future residuals.

Step 1: fit a multiple linear regression model that accounts for linear trend and additive seasonality.

Step 2: fit an AR model to the forecast errors from Step One.

Step 3: compute the adjusted forecast from steps 1 and 2

Step 1: As the data has already been previously partioned, we will fit the model with linear trend and additive seasonality to it and check the residuals.

happytailsLinearSeason <- tslm(happytailsTrain ~ trend + season)
accuracy(happytailsLinearSeason)

##                        ME     RMSE      MAE      MPE     MAPE      MASE
## Training set -3.41507e-16 8.104593 6.473376 -116.773 140.8127 0.9558764

resACF <- Acf(happytailsLinearSeason$residuals)

print(resACF)

## 
## Autocorrelations of series 'happytailsLinearSeason$residuals', by lag
## 
##      0      1      2      3      4      5      6      7      8      9 
##  1.000  0.236 -0.192 -0.309 -0.357 -0.026  0.110  0.269  0.013 -0.250 
##     10     11     12     13     14     15     16     17 
## -0.079 -0.038  0.038  0.004 -0.019  0.072  0.076  0.048

We can see there are still multiple significant lags, meaning there is still autocorrelation present.

Step 2:

Before going forward, we have to calculate the t statistic, which is the coefficient divided by the standard error, and then use that to estimate the p-value on the normal distribution. (Anything greater that 2 or -2 is statistically significant)

ARModel = Arima(happytailsLinearSeason$residuals, order = c(1,0,0))
ARModel$coef["ar1"] / sqrt(diag(vcov(ARModel)))["ar1"]

##      ar1 
## 1.766488

2*pnorm(-abs(ARModel$coef["ar1"] / sqrt(diag(vcov(ARModel)))["ar1"]))

##        ar1 
## 0.07731401

We can now see if we have accounted for autocorrelation in the series:

Acf(ARModel$residuals)

We can tell from the plot that we haven’t captured all the autocorrelations, as there is still significant correlation at lag 4 and lag 7.

Step 3

Then we wanted to forecast using both the linear regression model and the autocorrelation of the error terms. We first created them separetely, and then added them together and ploted it to create the adjusted forecast.

We can see that we want to adjust the straight regression forecast down from June through July, October through December and February (negative values for the error forecast). We will also want to adjust it up for the positive months on the error forecast.

lrForecast <- forecast(happytailsLinearSeason, h=validLength)

errorForecast <- forecast(ARModel, h=validLength)

adjustedForecast <- lrForecast$mean + errorForecast$mean

#plot it
plot(happytailsValid, bty="l", xaxt="n", xlab="Year 2017", yaxt="n", ylab="Attendance", ylim=c(1,120))

# Add the x-axis
axis(1, at=seq(2017, 2017 + 11/12, 1/12), labels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"))

# Add the y-axis
axis(2, at=seq(10,130,10), labels=format(seq(10,130,10)), las=2)

# Add the adjusted forecast
lines(adjustedForecast, col="blue")

# Add a legend
legend(2017,120, c("Actuals", "Adjusted Forecast"), lty=c(1,1), col=c("black", "blue"), bty="n")

accuracy(adjustedForecast, happytailsValid)

##                ME     RMSE      MAE      MPE    MAPE      ACF1  Theil's U
## Test set 8.069135 13.46635 11.94609 -136.343 196.782 0.2124555 0.07828943

summary(adjustedForecast, happytailsValid)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   13.12   18.94   20.31   19.43   20.88   22.23

We can see that we want to adjust the straight regression forecast as it underforecasted for most of the plot.

Fitting a TBATs model:

A TBATS model can be used for a times series that exhibits multiple complex seasonalities, and takes into account autocorrelation in the residuals.

tbats_forecast = msts(happytailsTrain, seasonal.periods = c(7))
fit1= tbats(tbats_forecast)
fc_tbats = forecast(fit1)
autoplot(fc_tbats)

accuracy(fc_tbats, happytailsValid)

##                     ME      RMSE      MAE       MPE     MAPE      MASE
## Training set 0.2780325  8.641762  6.82315 -135.8291 162.2391 0.9904572
## Test set     9.6929901 14.316217 13.06642 -100.1209 166.8231 1.8967386
##                    ACF1 Theil's U
## Training set 0.23857611        NA
## Test set     0.01801802 0.1592408

summary(fc_tbats, happytailsValid)

## 
## Forecast method: BATS(1, {0,0}, -, -)
## 
## Model Information:
## BATS(1, {0,0}, -, -)
## 
## Call: tbats(y = tbats_forecast)
## 
## Parameters
##   Alpha: -0.00992644
## 
## Seed States:
##          [,1]
## [1,] 19.45052
## 
## Sigma: 8.641762
## AIC: 433.7517
## 
## Error measures:
##                     ME     RMSE     MAE       MPE     MAPE      MASE
## Training set 0.2780325 8.641762 6.82315 -135.8291 162.2391 0.9904572
##                   ACF1
## Training set 0.2385761
## 
## Forecasts:
##          Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## 2015.714       19.30701 8.232146 30.38187 2.369468 36.24455
## 2015.857       19.30701 8.231601 30.38242 2.368633 36.24539
## 2016.000       19.30701 8.231055 30.38296 2.367799 36.24622
## 2016.143       19.30701 8.230510 30.38351 2.366965 36.24706
## 2016.286       19.30701 8.229964 30.38406 2.366130 36.24789
## 2016.429       19.30701 8.229419 30.38460 2.365296 36.24872
## 2016.571       19.30701 8.228873 30.38515 2.364462 36.24956
## 2016.714       19.30701 8.228328 30.38569 2.363627 36.25039
## 2016.857       19.30701 8.227782 30.38624 2.362793 36.25123
## 2017.000       19.30701 8.227237 30.38678 2.361959 36.25206
## 2017.143       19.30701 8.226691 30.38733 2.361125 36.25289
## 2017.286       19.30701 8.226146 30.38787 2.360291 36.25373
## 2017.429       19.30701 8.225601 30.38842 2.359457 36.25456
## 2017.571       19.30701 8.225055 30.38896 2.358623 36.25540

We can see from the accuracy() function that the TBATS model doesn’t perform as well as the ARIMA model.

Choosing the Best Model

Based on our analysis, the ARIMA model fits our data the best and produces the most accurate forecasts. Below is a reproduction of the forecasting model that we applied to the existing data set. As noted previously, the AICc is the lowest of all forecasting models we applied at 327. The MAPE is still high relative to the other accuracy metrics, but is still lower than that of other methods.

arima_fit = auto.arima(happytailsTrain, stepwise = FALSE)
arima_forecast = forecast(arima_fit)
plot(arima_forecast)

summary(arima_forecast, happytailsValid)

## 
## Forecast method: ARIMA(0,0,1)(0,1,1)[7]
## 
## Model Information:
## Series: happytailsTrain 
## ARIMA(0,0,1)(0,1,1)[7] 
## 
## Coefficients:
##          ma1     sma1
##       0.3039  -0.3145
## s.e.  0.1702   0.1679
## 
## sigma^2 estimated as 75.74:  log likelihood=-160.61
## AIC=327.21   AICc=327.8   BIC=332.63
## 
## Error measures:
##                    ME     RMSE      MAE       MPE     MAPE      MASE
## Training set 0.482288 7.914054 5.575024 -110.6714 135.3323 0.8092776
##                     ACF1
## Training set -0.04183417
## 
## Forecasts:
##          Point Forecast      Lo 80    Hi 80       Lo 95    Hi 95
## 2015.714      25.745994  14.592716 36.89927   8.6885266 42.80346
## 2015.857      31.599251  19.942457 43.25604  13.7717235 49.42678
## 2016.000      30.377460  18.720666 42.03425  12.5499328 48.20499
## 2016.143      21.506229   9.849435 33.16302   3.6787015 39.33376
## 2016.286      27.004357  15.347564 38.66115   9.1768307 44.83188
## 2016.429       4.002981  -7.653812 15.65977 -13.8245453 21.83051
## 2016.571       3.662316  -7.994477 15.31911 -14.1652103 21.48984
## 2016.714      25.817852  11.877722 39.75798   4.4982628 47.13744
## 2016.857      31.599251  17.466895 45.73161   9.9856777 53.21282
## 2017.000      30.377460  16.245104 44.50982   8.7638870 51.99103
## 2017.143      21.506229   7.373873 35.63858  -0.1073443 43.11980
## 2017.286      27.004357  12.872002 41.13671   5.3907848 48.61793
## 2017.429       4.002981 -10.129374 18.13534 -17.6105912 25.61655
## 2017.571       3.662316 -10.470039 17.79467 -17.9512563 25.27589

Future Forecasts Using ARIMA (0,0,1)(0,1,1)[7]

Up to this point, we’ve only forecasted our validation period to measure the accuracy of the forecast. Now, we’ll apply the forecaset to the entire dataset to predict future attendance numbers. We’ll forecast for the next 5 months.

arima_fit = arima(happytails.ts, order= c(0,0,1), seasonal= c(0,1,1))
arima_forecast = forecast(arima_fit, h=5)
plot(arima_forecast)

summary(arima_forecast)

## 
## Forecast method: ARIMA(0,0,1)(0,1,1)[7]
## 
## Model Information:
## 
## Call:
## arima(x = happytails.ts, order = c(0, 0, 1), seasonal = c(0, 1, 1))
## 
## Coefficients:
##          ma1     sma1
##       0.3045  -0.6637
## s.e.  0.1495   0.1624
## 
## sigma^2 estimated as 123.6:  log likelihood = -254.7,  aic = 515.4
## 
## Error measures:
##                     ME     RMSE      MAE      MPE     MAPE      MASE
## Training set 0.9146159 10.57238 7.804503 -150.131 180.6143 0.8760157
##                     ACF1
## Training set -0.03582091
## 
## Forecasts:
##          Point Forecast     Lo 80    Hi 80      Lo 95    Hi 95
## 2018.714       27.48839 13.237840 41.73893   5.694058 49.28271
## 2018.857       33.02876 18.132252 47.92528  10.246515 55.81101
## 2019.000       20.58449  5.687983 35.48101  -2.197754 43.36674
## 2019.143       11.83076 -3.065696 26.72722 -10.951404 34.61293
## 2019.286       25.91576 11.019891 40.81163   3.134494 48.69703

As we can see, our point forecasted attendance for the next 5 months is 27, 33, 21, 12, and 26, respectively. It will be interesting to see how accurate these predictions are.