happytails <- read.csv("Happy Tail Attendance_updated.csv")

Initial Data Visualization

First things first, we’ll plot the overall attendance data, which inlcudes half-day daycare, full-day daycare, and boarding, just to get a general idea of what it looks like.

library(forecast)
library(ggplot2)
library(TSA)
## Warning: package 'TSA' was built under R version 3.4.4
## Loading required package: leaps
## Warning: package 'leaps' was built under R version 3.4.4
## Loading required package: locfit
## Warning: package 'locfit' was built under R version 3.4.4
## locfit 1.5-9.1    2013-03-22
## Loading required package: mgcv
## Loading required package: nlme
## 
## Attaching package: 'nlme'
## The following object is masked from 'package:forecast':
## 
##     getResponse
## This is mgcv 1.8-22. For overview type 'help("mgcv-package")'.
## Loading required package: tseries
## 
## Attaching package: 'TSA'
## The following objects are masked from 'package:stats':
## 
##     acf, arima
## The following object is masked from 'package:utils':
## 
##     tar
library(ggfortify)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:nlme':
## 
##     collapse
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)

happytails.ts <- ts(happytails$Attendance, start=c(2012,1), end=c(2017,12), freq=7)
plot(happytails.ts, xlab="Year", ylab="Attendance", ylim=c(10, 100), bty="l")

There definitely appears to be a linear pattern here consisting of highs and lows.

Let’s look at just the first 3 years of data to see the pattern more clearly:

#Plot first 3 years

ht_first3years= window(happytails.ts, 2012,c(2014,12))
plot(ht_first3years, xlab="Year", ylab= "Attendance", ylim= c(0,200), bty= "l")

We can more clearly see almost an annually repeating “M” shape in the data. Attendance starts out low at the beginning of the year, peaks in the first half of the year, and begins decreasing for the last quarter of the year.

Aggregating by year

yearly <- aggregate(happytails.ts, nfrequency=1, FUN=sum)
plot(yearly, bty="l")

->It seems as the largest increase in business was from 2013 through 2015 with business leveling off towards 2016. It should be interesting when fitting prediction models to this data to see what the future could hold and if business will continue to level off or grow.<-

Attendance Broken Out by Month

Though the monthly plots can be a bit overwhelming at first sight, there are some good things we can pull off of it. For instance, it appears August 2012, March 2014, and July 2015 had the 3 lowest attendance counts from 2012-2017. January 2015, October 2016, and February 2017 had the highest attendance counts. Each year seems to follow a similar pattern, however the drop of attendance seems difficult to predict regarding as to when this change will occur.

# set the outer margin area to the right a bit bigger
par(oma = c(0, 0, 0, 2))

# We have 6 years of data
xrange <- c(1,7)
# Get the range of the ridership to set up a nicely formatted plot
yrange <- range(happytails.ts)

# set up the plot 
plot(xrange, yrange, type="n", xlab="Year", ylab="Attendance", bty="l")

# Give each of the months its own color, line type, and character
colors <- rainbow(12) 
linetype <- c(1:12) 
plotchar <- c(1:12)

# add lines 
for (i in 1:12) { 
  currentMonth <- subset(happytails.ts, cycle(happytails.ts)==i)
  lines(currentMonth, type="b", lwd=1.5,
      lty=linetype[i], col=colors[i], pch=plotchar[i]) 
} 

# add a title
title("Attendance Broken Out by Month")

# add a legend 
legend(6,80, 1:12, cex=0.8, col=colors, pch=plotchar, lty=linetype, title="Month")

Attendance Broken Out by Year with ggseasonplot

Another approach is to use the ggseasonplot() function, which plots a line for each year, instead of each month as above. This can be useful for seeing which years did better than others. It’s evident from the below plot that 2017 had some of their highest attendance numbers, climbing above all other years multiple times, while 2012 and 2013 had some of the lowest attendance counts. This is to be expected as we saw from the initial graph of the data that the totals slowly increase as time goes on.

ggseasonplot(happytails.ts)

Determining Seasonality

One way to determine if there is seasonality in the data is to use the ets() function on the time series. The output below, ETS(M,N,A), suggests it’s a time series with multiplicative error, no trend, and additive seaonality.

ets(happytails.ts)
## ETS(M,N,A) 
## 
## Call:
##  ets(y = happytails.ts) 
## 
##   Smoothing parameters:
##     alpha = 0.2962 
##     gamma = 0.0027 
## 
##   Initial states:
##     l = 58.6696 
##     s=-16.1443 5.599 10.8892 11.9537 12.8195 7.3565
##            -32.4736
## 
##   sigma:  0.1759
## 
##      AIC     AICc      BIC 
## 411.9533 418.0644 430.4548

Create a time series for weekly and monthly data

We will also look at the data from additional perspectives in weekly and monthly increments to determine if that shows any additional trend or pattern that can help with our forecasting. By creating a new time series we will work to decompose and see what comes up. We will start by creating time series for both sets.

#Weekly Time Series
happytails_W.ts<-ts(happytails$Attendance, start=c(2012,1,1),end=c(2018,2,28), frequency = 7)
happytails_W_lm<-tslm(happytails_W.ts ~ trend+I(trend^2))
#Monthly Time Series
happytails_M.ts<-ts(happytails$Attendance, start=c(2012,1,1),end=c(2018,2,28), frequency = 30)
happytails_M_lm<-tslm(happytails_M.ts ~ trend+I(trend^2))

Decompose and plot the weekly time series for initial analysis

#  Decompose and plot the weekly time series for initial analysis
dec_happytails_W <- decompose(happytails_W.ts)
plot(dec_happytails_W)

Decompose and plot the monthly time series for initial analysis

#  Decompose and plot the monthly time series for initial analysis
dec_happytails_M <- decompose(happytails_M.ts)
plot(dec_happytails_M)

#Create monthly and yearly averages
monthly <- group_by(happytails, MY) %>% summarise(mon_tot=sum(Attendance), monthly_avg_per_day=mean(Attendance))
annual <- group_by(happytails, Year) %>% summarise(yearly_tot=sum(Attendance), yearly_avg_per_day=mean(Attendance))
#  Create a Time Series and Line Model of the data
monthly_sum.ts<-ts(monthly$mon_tot, start=c(2012,1),end=c(2018,2), frequency = 12)
monthly_sum_lm<-tslm(monthly_sum.ts ~ trend+I(trend^2))
#Create time series for the monthly totals and averages
monthly_avg.ts<-ts(monthly$monthly_avg_per_day, start=c(2012,1),end=c(2018,2), frequency = 12)
monthly_avg_lm<-tslm(monthly_avg.ts ~ trend+I(trend^2))

Decompose and plot the monthly sum time series

#  Decompose and plot the monthly sum time series
dec_monthly_sum <- decompose(monthly_sum.ts)
plot(dec_monthly_sum)

Decompose and plot the monthly average time series

#  Decompose and plot the monthly average time series
dec_monthly_avg <- decompose(monthly_avg.ts)
plot(dec_monthly_avg)

Generate plots for daily time series

#  Generate plots for daily time series
par(mfrow = c(2,1))
plot(happytails_W.ts, col=4, main="Happy Tails Day Care Daily Total Attendance \n 2012 to 2018 with Weekly Seasonality", xlab="Day", ylab="Dogs", ylim=c(0, 120), type="l")
lines(happytails_W_lm$fitted, col=2, lwd=2)

plot(happytails_M.ts, col=4, main="Happy Tails Day Care Daily Total Attendance  \n  2012 to 2018 with Monthly Seasonality", xlab="Day", ylab="Dogs", ylim=c(0, 120), type="l")
lines(happytails_M_lm$fitted, col=2, lwd=2)

Generate plots for monthly time series

#  Generate plots for monthly time series
par(mfrow = c(2,1))
plot(monthly_sum.ts, col=4, main="Happy Tails Day Care Daily Total \n Attendance by month 2012 to 2018 \n", xlab="Day", ylab="Dogs", ylim=c(1500, 3500), type="l")
lines(monthly_sum_lm$fitted, col=2, lwd=2)

#  Generate plots for monthly time series
plot(monthly_avg.ts, col=4, main="Happy Tails Day Care Daily Average \n Attendance by month 2012 to 2018 \n", xlab="Day", ylab="Dogs", ylim=c(50, 120), type="l")
lines(monthly_avg_lm$fitted, col=2, lwd=2)

Fitting Trend Lines to the Data

The output from the ets() function showed no additive or multiplicative trend in the data, but we’ll fit multiple different types of trend lines to the data to see if any of them are a good fit. We’ll fit a quadratic line, an exponential line, and both a linear line with seasonality and one without.

Checking the adjusted R-squared values will determine which model fits the data better. An adjusted R-squared value of .52 would mean 52% of the data can be explained using the chosen method, .75 would mean 75% of the data can be explained, etc.

Linear:

plot(happytails.ts, xlab="Year", ylab="Attendance", ylim=c(10, 100), bty="l")

happytailsLinear <- tslm(happytails.ts ~ trend)
summary(happytailsLinear)
## 
## Call:
## tslm(formula = happytails.ts ~ trend)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -46.848 -14.806   5.029  13.540  24.194 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  45.0305     5.5262   8.149  2.1e-10 ***
## trend         0.5306     0.2005   2.647   0.0111 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.64 on 45 degrees of freedom
## Multiple R-squared:  0.1347, Adjusted R-squared:  0.1155 
## F-statistic: 7.008 on 1 and 45 DF,  p-value: 0.01115
lines(happytailsLinear$fitted, lwd=2)

Linear with seasonality:

plot(happytails.ts, xlab="Year", ylab="Attendance", ylim=c(10, 100), bty="l")

happytailsLinearSeasonal = tslm(happytails.ts ~trend + season)
summary(happytailsLinearSeasonal)
## 
## Call:
## tslm(formula = happytails.ts ~ trend + season)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23.9000  -4.3571  -0.5714   6.4464  16.2429 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  15.2898     4.4755   3.416 0.001497 ** 
## trend         0.4673     0.1083   4.317 0.000105 ***
## season2      38.6755     5.3598   7.216 1.09e-08 ***
## season3      42.4939     5.3630   7.923 1.20e-09 ***
## season4      44.1694     5.3685   8.227 4.72e-10 ***
## season5      40.9878     5.3761   7.624 3.03e-09 ***
## season6      36.2276     5.5799   6.493 1.07e-07 ***
## season7      14.5935     5.5841   2.613 0.012670 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.03 on 39 degrees of freedom
## Multiple R-squared:  0.7831, Adjusted R-squared:  0.7442 
## F-statistic: 20.12 on 7 and 39 DF,  p-value: 4e-11
lines(happytailsLinearSeasonal$fitted, lwd= 2)

Quadratic:

plot(happytails.ts, xlab="Year", ylab="Attendance", ylim=c(10, 100), bty="l")

happytailsQuad <- tslm(happytails.ts ~ trend + I(trend^2))
summary(happytailsQuad)
## 
## Call:
## tslm(formula = happytails.ts ~ trend + I(trend^2))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -45.478 -14.110   5.869  13.435  24.511 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 41.995991   8.592410   4.888  1.4e-05 ***
## trend        0.902219   0.825751   1.093    0.281    
## I(trend^2)  -0.007741   0.016679  -0.464    0.645    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.81 on 44 degrees of freedom
## Multiple R-squared:  0.139,  Adjusted R-squared:  0.09982 
## F-statistic:  3.55 on 2 and 44 DF,  p-value: 0.0372
lines(happytailsQuad$fitted, lty=2, lwd=3)

The Quadratic trend line seems to show a tapering off attendance after 2015.

Exponential:

plot(happytails.ts, xlab="Year", ylab="Attendance", ylim=c(10, 100), bty="l")

happytailsExponential <- tslm(happytails.ts ~ trend + season, lambda=0)
summary(happytailsExponential)
## 
## Call:
## tslm(formula = happytails.ts ~ trend + season, lambda = 0)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.48743 -0.10720  0.02158  0.12361  0.47292 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.983772   0.098648  30.247  < 2e-16 ***
## trend       0.009045   0.002386   3.790 0.000510 ***
## season2     0.919556   0.118138   7.784 1.84e-09 ***
## season3     1.027919   0.118210   8.696 1.15e-10 ***
## season4     1.049015   0.118331   8.865 6.91e-11 ***
## season5     0.990025   0.118499   8.355 3.20e-10 ***
## season6     0.925654   0.122989   7.526 4.10e-09 ***
## season7     0.496448   0.123081   4.033 0.000248 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.221 on 39 degrees of freedom
## Multiple R-squared:  0.9999, Adjusted R-squared:  0.9999 
## F-statistic: 4.431e+04 on 7 and 39 DF,  p-value: < 2.2e-16
lines(happytailsExponential$fitted, lwd= 2)

As we can see from the accuracy metrics, the exponential trend line fits the data the best, explaining 99% of it.

Autocorrelation and White Noise

We will now compute an estimate of the autocorrelation function on the time series with the Acf() function. A strong positive lag-1 autocorrelation means consecutive values generally move in the same direction, and is seen in time series with strong linear trends.

A time series is considered white noise if the variables are uncorrelated, or are close to zero.

happytailsACF <- Acf(happytails.ts)

print(happytailsACF)
## 
## Autocorrelations of series 'happytails.ts', by lag
## 
##      0      1      2      3      4      5      6      7      8      9 
##  1.000  0.288 -0.052 -0.262 -0.201 -0.075  0.370  0.593  0.272 -0.103 
##     10     11     12     13     14     15     16 
## -0.195 -0.261 -0.084  0.240  0.545  0.125 -0.121

Lag-0 autocorrelation is 1, which will always be the case, and which shows that the y-values are perfectly correlated with themselves. We see that lag-7, lag-14 and lag-21 have the largest autocorrelations. These large values indicate a linear trend and monthly seasonality. This obervation includes negative autocorrelations for this data set, which could mean that business is impacted by external forces, such as seasonal changes.

Differencing the data with a lag-1 to help adjust for a linear trend:

Acf(diff(happytails.ts, lag=1), lag.max=12, main="ACF Plot for Differenced Series")

The plot indicates that most of the seasonality has been captured, though there is still one significant lag at lag7.

Lets plot the differenced series.

plot(diff(happytails.ts, ylim=c(10, 100), bty="l"))

This plot confirms that seasonality still exists within the differenced series.

We will also conduct a Llung-Box Test to check the white noise. If it returns a small p-value we will know it is not white noise.

Box.test(happytails.ts, lag=24, fitdf=0, type= "Ljung-Box")
## 
##  Box-Ljung test
## 
## data:  happytails.ts
## X-squared = 110.66, df = 24, p-value = 4.322e-13

We can see that there is a very small p-value, meaning that the data is not white noise, but is in fact correlated.

Applying Forecasting Models to the data**

Now that we have evaluated the data, it is time to start working with forecasting models and applying them to the data. But first, we have to partition the data into training and validation periods. Our validation period will be one year, so we’ll set the validation length to 12.

validLength <- 12
trainLength <- length(happytails.ts) - validLength

happytailsTrain <- window(happytails.ts, end=c(2012, trainLength))
happytailsValid <- window(happytails.ts, start=c(2012,trainLength+1))

Compute both the trailing moving average with a window size of 12 and the centered moving average and plot.

# First, let's compute the trailing moving average based on the training set
# In the package zoo, we have the function rollmean()
library(zoo)
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
trailingMA <- rollmean(happytailsTrain, k=12, align="right")

# We can also find the centered moving average with the function ma()
centeredMA <- ma(happytailsTrain, order=12)

# Helps set up the plot
yrange = range(happytails.ts)

# Set up the plot
plot(c(2012, 2018), yrange, type="n", xlab="Year",  ylab="Attendance", bty="l", xaxt="n", yaxt="n")

# Add the time series air
lines(happytails.ts, bty="l")

# Add the x-axis
axis(1, at=seq(2012,2017,1), labels=format(seq(2012,2017,1)))

# Add the y-axis
axis(2, at=seq(10,100,10), labels=format(seq(10,100,10)), las=2)

lines(trailingMA, col="red", lty=2)
lines(centeredMA)

# Add the legend
legend(2012,99, c("ARPM", "Centered Moving Average", "Trailing Moving Average"), lty=c(1,1,2), col=c("black", "black", "red"), bty="n")

Both the trailing moving average and centered moving average seem to more accurately capture the seasonal dip in attendance that happens each year. Prior to 2015, the central moving aveage was larger than the trailing average almost all of the time which would be indicative of a significant upward trend between 2012 and 2015.

AR(1) Models and Random Walks

An AR model is an autogressive linear model that predicts the value of a time series using the preceding value in time. An AR(1) model suggests an autogressive model on a dataset with autocorrelation at lag 1.

A random walk is a special form of an AR(1) model in which the slope efficient is equal to 1 and changes from one period to the next are random. Because of this, random walk series are difficult to predict.

We will fit an AR(1) model and look at the estimated co-efficients. In order to accurately say the series is not a random walk, there must be a small p-value. Anything larger than ??=0.05 for a p-value would signify the series is a random walk.

fit <-Arima(happytails.ts, order=c(1,0,0))
print(fit)
## Series: happytails.ts 
## ARIMA(1,0,0) with non-zero mean 
## 
## Coefficients:
##          ar1     mean
##       0.2975  57.6904
## s.e.  0.1413   3.8552
## 
## sigma^2 estimated as 366.4:  log likelihood=-204.45
## AIC=414.91   AICc=415.47   BIC=420.46

We will calculate the two-tailed p-value from a t-distrubtion to be “conservative”.

2*pt(-abs((1 - fit$coef["ar1"]) / sqrt(diag(vcov(fit)))["ar1"]), df=length(happytails.ts)-1)
##          ar1 
## 9.645865e-06

Now we will calculate it using the normal distribution, just to check

2*pnorm(-abs((1 - fit$coef["ar1"]) / sqrt(diag(vcov(fit)))["ar1"]))
##          ar1 
## 6.616083e-07

Clearly, this does not appear to be a random walk, which is not suprising, considering we believe that there is seasonal patterns to this data set, and therefore wouldn’t be random.

Using AR as a Second-Layer Model

Another way to take advantage of autocorrelation is to create a second-level forecasting model on the residuals. This approach is often useful for short-term forecasting, and includes a 3 step process where an AR model is fitted to the series of residuals instead of the raw data series, and is then used to forecast future residuals.

Step 1: fit a multiple linear regression model that accounts for linear trend and additive seasonality.

Step 2: fitting an AR model to the forecast errors from Step One.

Step 3: compute the adjusted forecast from steps 1 and 2

Step 1: As the data has already been previously partioned, we will fit the model with linear trend and additive seasonality to it and see what it looks like

happytailsLinearSeason <- tslm(happytailsTrain ~ trend + season)
summary(happytailsLinearSeason)
## 
## Call:
## tslm(formula = happytailsTrain ~ trend + season)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.2000  -4.0000  -0.4857   5.9000  20.5429 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  13.8367     5.2868   2.617 0.014348 *  
## trend         0.6776     0.1752   3.868 0.000627 ***
## season2      33.5224     6.4904   5.165 1.96e-05 ***
## season3      42.8449     6.4975   6.594 4.49e-07 ***
## season4      43.7673     6.5093   6.724 3.22e-07 ***
## season5      39.8898     6.5257   6.113 1.57e-06 ***
## season6      32.2122     6.5469   4.920 3.77e-05 ***
## season7      13.3347     6.5726   2.029 0.052442 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.26 on 27 degrees of freedom
## Multiple R-squared:  0.7804, Adjusted R-squared:  0.7234 
## F-statistic:  13.7 on 7 and 27 DF,  p-value: 1.998e-07

There’s a fairlyl high p-value, and a very low adjusted R-sqaured value.

Now we want to look at the ACF plot for the residuals from our fitted model

resACF <- Acf(happytailsLinearSeason$residuals)

print(resACF)
## 
## Autocorrelations of series 'happytailsLinearSeason$residuals', by lag
## 
##      0      1      2      3      4      5      6      7      8      9 
##  1.000 -0.439  0.312 -0.256  0.277 -0.117  0.222 -0.533  0.372 -0.311 
##     10     11     12     13     14     15 
##  0.223 -0.257  0.045 -0.076  0.110 -0.209

We can see there are still multiple significant lags, meaning there is still autocorrelation present.

Step 2: Let’s run an AR(2) model on the residuals from the fitted model to see if it fits better than the AR(1):

ARModel <- Arima(happytailsLinearSeason$residuals, order=c(2,0,0))
print(ARModel)
## Series: happytailsLinearSeason$residuals 
## ARIMA(2,0,0) with non-zero mean 
## 
## Coefficients:
##           ar1     ar2     mean
##       -0.4003  0.2060  -0.1677
## s.e.   0.1685  0.1785   1.1068
## 
## sigma^2 estimated as 66.49:  log likelihood=-121.73
## AIC=251.47   AICc=252.8   BIC=257.69

Before going forward, we have to calculate the t statistics for each model, which is the coefficient divided by the standard error.
(Anything greater that 2 or -2 is statistically significant)

AR(1) t statistic:

ARModel$coef["ar1"] / sqrt(diag(vcov(ARModel)))["ar1"]
##       ar1 
## -2.374998

AR(2) t statistic:

ARModel$coef["ar2"] / sqrt(diag(vcov(ARModel)))["ar2"]
##      ar2 
## 1.154289

Now to estimate the p-value on the normal distribution:

AR(1) p-value:

2*pnorm(-abs(ARModel$coef["ar1"] / sqrt(diag(vcov(ARModel)))["ar1"]))
##        ar1 
## 0.01754905

AR(2) p-value:

2*pnorm(-abs(ARModel$coef["ar2"] / sqrt(diag(vcov(ARModel)))["ar1"]))
##       ar2 
## 0.2215266

This confirms that we should have just used an AR(1) Model since the residuals are statistically significant. We also may wish to see if we have accounted for autocorrelation in the series.

Create an ACF plot fo the residuals from the AR(2) Model

Acf(ARModel$residuals)

We can tell from the plot that we haven’t captured all the autocorrelations, as there are still numerous lags past the blue dotted line.

Step 3

Now we want to use the originally fitted regression model, along with the autocorrelation of the error terms, to forecast for the validation period.

lrForecast <- forecast(happytailsLinearSeason, h=validLength)
lrForecast
##          Point Forecast    Lo 80     Hi 80    Lo 95     Hi 95
## 2017.000       38.22857 22.69498  53.76216 13.96716  62.48998
## 2017.143       72.42857 56.89498  87.96216 48.16716  96.68998
## 2017.286       82.42857 66.89498  97.96216 58.16716 106.68998
## 2017.429       84.02857 68.49498  99.56216 59.76716 108.28998
## 2017.571       80.82857 65.29498  96.36216 56.56716 105.08998
## 2017.714       73.82857 58.29498  89.36216 49.56716  98.08998
## 2017.857       55.62857 40.09498  71.16216 31.36716  79.88998
## 2018.000       42.97143 26.86385  59.07901 17.81352  68.12934
## 2018.143       77.17143 61.06385  93.27901 52.01352 102.32934
## 2018.286       87.17143 71.06385 103.27901 62.01352 112.32934
## 2018.429       88.77143 72.66385 104.87901 63.61352 113.92934
## 2018.571       85.57143 69.46385 101.67901 60.41352 110.72934

Generate a forecast for the error terms using residuals AR(2) model

errorForecast <- forecast(ARModel, h=validLength)
errorForecast
##          Point Forecast      Lo 80     Hi 80     Lo 95    Hi 95
## 2017.000   -1.429140796 -11.879015  9.020733 -17.41084 14.55256
## 2017.143    1.631471793  -9.624434 12.887378 -15.58295 18.84589
## 2017.286   -1.147773536 -13.036554 10.741007 -19.33009 17.03455
## 2017.429    0.595253558 -11.532108 12.722615 -17.95195 19.14245
## 2017.571   -0.675036398 -12.927534 11.577461 -19.41361 18.06354
## 2017.714    0.192538783 -12.117834 12.502912 -18.63455 19.01963
## 2017.857   -0.416443634 -12.755246 11.922358 -19.28701 18.45413
## 2018.000    0.006060449 -12.346400 12.358520 -18.88540 18.89752
## 2018.143   -0.288524032 -12.647619 12.070571 -19.19013 18.61308
## 2018.286   -0.083562100 -12.445867 12.278743 -18.99008 18.82295
## 2018.429   -0.226295513 -12.590157 12.137566 -19.13519 18.68260
## 2018.571   -0.126935261 -12.491551 12.237681 -19.03698 18.78311

Create the adjusted forecast by adding the two forecasts together

adjustedForecast <- lrForecast$mean + errorForecast$mean
adjustedForecast
## Time Series:
## Start = c(2017, 1) 
## End = c(2018, 5) 
## Frequency = 7 
##  [1] 36.79943 74.06004 81.28080 84.62382 80.15354 74.02111 55.21213
##  [8] 42.97749 76.88290 87.08787 88.54513 85.44449

We can see that we want to adjust the straight regression forecast down from June through July, October through December and February (negative values for the error forecast). We will also want to adjust it up for the positive months on the error forecast.

Now lets plot the adjusted forecast next to the actuals from the validation period:

plot(happytailsValid, bty="l", xaxt="n", xlab="Year 2017", yaxt="n", ylab="Attendance", ylim=c(10,100))

# Add the x-axis
axis(1, at=seq(2017, 2017 + 11/12, 1/12), labels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"))

# Add the y-axis
axis(2, at=seq(10,100,10), labels=format(seq(10,100,10)), las=2)

# Add the adjusted forecast
lines(adjustedForecast, col="blue")

# Add a legend
legend(1,99, c("Actuals", "Adjusted Forecast"), lty=c(1,1), col=c("black", "blue"), bty="n")

We could also look at accuracy metrics.

accuracy(adjustedForecast, happytailsValid)
##                ME    RMSE      MAE       MPE     MAPE      ACF1 Theil's U
## Test set -7.59073 12.0672 10.44397 -18.06111 21.71138 0.1025393 0.2629886

Looking at the forecasted model compared to the results, there is still some more room to grow regarding accuracy. We’ll fit some additional models to the data to help with comparisons.

Fitting an ets model

ets_fit <- ets(happytails.ts)
ets_forecast <- forecast(ets_fit)
plot(ets_forecast)

accuracy(ets_forecast)
##                     ME     RMSE      MAE       MPE     MAPE      MASE
## Training set 0.2710522 10.14703 7.539165 -2.811093 15.24353 0.6822774
##                    ACF1
## Training set -0.3287415
summary(ets_forecast)
## 
## Forecast method: ETS(M,N,A)
## 
## Model Information:
## ETS(M,N,A) 
## 
## Call:
##  ets(y = happytails.ts) 
## 
##   Smoothing parameters:
##     alpha = 0.2962 
##     gamma = 0.0027 
## 
##   Initial states:
##     l = 58.6696 
##     s=-16.1443 5.599 10.8892 11.9537 12.8195 7.3565
##            -32.4736
## 
##   sigma:  0.1759
## 
##      AIC     AICc      BIC 
## 411.9533 418.0644 430.4548 
## 
## Error measures:
##                     ME     RMSE      MAE       MPE     MAPE      MASE
## Training set 0.2710522 10.14703 7.539165 -2.811093 15.24353 0.6822774
##                    ACF1
## Training set -0.3287415
## 
## Forecasts:
##          Point Forecast    Lo 80    Hi 80     Lo 95     Hi 95
## 2018.714       68.04340 52.70666 83.38014 44.587880  91.49892
## 2018.857       46.30886 34.89702 57.72070 28.855954  63.76176
## 2019.000       29.98401 21.21641 38.75161 16.575124  43.39290
## 2019.143       69.79834 52.97816 86.61853 44.074087  95.52260
## 2019.286       75.23844 56.65100 93.82588 46.811409 103.66547
## 2019.429       74.43644 55.31652 93.55637 45.195039 103.67784
## 2019.571       73.32975 53.75044 92.90905 43.385789 103.27370
## 2019.714       68.04340 48.80311 87.28370 38.617911  97.46889
## 2019.857       46.30886 30.01574 62.60197 21.390687  71.22703
## 2020.000       29.98401 15.39609 44.57194  7.673709  52.29432
## 2020.143       69.79834 49.32173 90.27496 38.482061 101.11463
## 2020.286       75.23844 53.27876 97.19812 41.654014 108.82287
## 2020.429       74.43644 52.01625 96.85663 40.147722 108.72516
## 2020.571       73.32975 50.50973 96.14976 38.429549 108.22994

Fitting a TBATs model:

A TBATS model can be used for a times series that exhibits multiple complex seasonalities, and takes into account autocorrelation in the residuals.

tbats_forecast = msts(happytails.ts, seasonal.periods = c(7))
fit1= tbats(tbats_forecast)
fc_tbats = forecast(fit1)
autoplot(fc_tbats)

accuracy(fc_tbats)
##                    ME     RMSE      MAE        MPE     MAPE      MASE
## Training set 1.333779 9.436176 7.277694 -0.6556304 15.07298 0.6586148
##                    ACF1
## Training set -0.2864632

We can see from the accuracy() function that the TBATS model performs even better than the ets, with a slightly lower MAPE, MAE, and RMSE, and MASE.


Create Training files, then run naive, seasonal naive, and mean forecasts

#  Create Training files, then run naive, seasonal naive, and mean forecasts
valid_length<-365
train_length<-length(happytails_W.ts)-valid_length
happytails_W_train<-window(happytails_W.ts, start=c(2012,1), end=c(2017,2))
happytails_W_valid <- window(happytails_W.ts, start=c(2017,3), end=c(2018,2))
happytails_W_nfc<-naive(happytails_W_train, h=30)
print(happytails_W_nfc)
##          Point Forecast      Lo 80    Hi 80       Lo 95    Hi 95
## 2017.286             87  56.915367 117.0846   40.989524 133.0105
## 2017.429             87  44.453903 129.5461   21.931360 152.0686
## 2017.571             87  34.891886 139.1081    7.307517 166.6925
## 2017.714             87  26.830733 147.1693   -5.020953 179.0210
## 2017.857             87  19.728715 154.2713  -15.882553 189.8826
## 2018.000             87  13.307999 160.6920  -25.702190 199.7022
## 2018.143             87   7.403542 166.5965  -34.732278 208.7323
## 2018.286             87   1.907807 172.0922  -43.137279 217.1373
## 2018.429             87  -3.253900 177.2539  -51.031429 225.0314
## 2018.571             87  -8.135964 182.1360  -58.497901 232.4979
## 2018.714             87 -12.779441 186.7794  -65.599486 239.5995
## 2018.857             87 -17.216227 191.2162  -72.384965 246.3850
## 2019.000             87 -21.471688 195.4717  -78.893132 252.8931
## 2019.143             87 -25.566391 199.5664  -85.155439 259.1554
## 2019.286             87 -29.517284 203.5173  -91.197809 265.1978
## 2019.429             87 -33.338534 207.3385  -97.041905 271.0419
## 2019.571             87 -37.042121 211.0421 -102.706054 276.7061
## 2019.714             87 -40.638290 214.6383 -108.205919 282.2059
## 2019.857             87 -44.135877 218.1359 -113.555017 287.5550
## 2020.000             87 -47.542571 221.5426 -118.765105 292.7651
## 2020.143             87 -50.865110 224.8651 -123.846491 297.8465
## 2020.286             87 -54.109439 228.1094 -128.808263 302.8083
## 2020.429             87 -57.280833 231.2808 -133.658493 307.6585
## 2020.571             87 -60.384002 234.3840 -138.404380 312.4044
## 2020.714             87 -63.423167 237.4232 -143.052382 317.0524
## 2020.857             87 -66.402133 240.4021 -147.608317 321.6083
## 2021.000             87 -69.324341 243.3243 -152.077448 326.0774
## 2021.143             87 -72.192917 246.1929 -156.464556 330.4646
## 2021.286             87 -75.010709 249.0107 -160.773998 334.7740
## 2021.429             87 -77.780324 251.7803 -165.009758 339.0098
happytails_W_snfc<-snaive(happytails_W_train, h=30)
print(happytails_W_snfc)
##          Point Forecast      Lo 80     Hi 80       Lo 95     Hi 95
## 2017.286             71 49.3131080  92.68689  37.8327608 104.16724
## 2017.429             77 55.3131080  98.68689  43.8327608 110.16724
## 2017.571             74 52.3131080  95.68689  40.8327608 107.16724
## 2017.714             75 53.3131080  96.68689  41.8327608 108.16724
## 2017.857             57 35.3131080  78.68689  23.8327608  90.16724
## 2018.000             38 16.3131080  59.68689   4.8327608  71.16724
## 2018.143             87 65.3131080 108.68689  53.8327608 120.16724
## 2018.286             71 40.3301031 101.66990  24.0944405 117.90556
## 2018.429             77 46.3301031 107.66990  30.0944405 123.90556
## 2018.571             74 43.3301031 104.66990  27.0944405 120.90556
## 2018.714             75 44.3301031 105.66990  28.0944405 121.90556
## 2018.857             57 26.3301031  87.66990  10.0944405 103.90556
## 2019.000             38  7.3301031  68.66990  -8.9055595  84.90556
## 2019.143             87 56.3301031 117.66990  40.0944405 133.90556
## 2019.286             71 33.4372011 108.56280  13.5526565 128.44734
## 2019.429             77 39.4372011 114.56280  19.5526565 134.44734
## 2019.571             74 36.4372011 111.56280  16.5526565 131.44734
## 2019.714             75 37.4372011 112.56280  17.5526565 132.44734
## 2019.857             57 19.4372011  94.56280  -0.4473435 114.44734
## 2020.000             38  0.4372011  75.56280 -19.4473435  95.44734
## 2020.143             87 49.4372011 124.56280  29.5526565 144.44734
## 2020.286             71 27.6262159 114.37378   4.6655216 137.33448
## 2020.429             77 33.6262159 120.37378  10.6655216 143.33448
## 2020.571             74 30.6262159 117.37378   7.6655216 140.33448
## 2020.714             75 31.6262159 118.37378   8.6655216 141.33448
## 2020.857             57 13.6262159 100.37378  -9.3344784 123.33448
## 2021.000             38 -5.3737841  81.37378 -28.3344784 104.33448
## 2021.143             87 43.6262159 130.37378  20.6655216 153.33448
## 2021.286             71 22.5066352 119.49336  -3.1642015 145.16420
## 2021.429             77 28.5066352 125.49336   2.8357985 151.16420
happytails_W_meanfc<-meanf(happytails_W_train, h=30)
print(happytails_W_meanfc)
##          Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## 2017.286       55.78378 29.47757 82.08999 14.91753 96.65004
## 2017.429       55.78378 29.47757 82.08999 14.91753 96.65004
## 2017.571       55.78378 29.47757 82.08999 14.91753 96.65004
## 2017.714       55.78378 29.47757 82.08999 14.91753 96.65004
## 2017.857       55.78378 29.47757 82.08999 14.91753 96.65004
## 2018.000       55.78378 29.47757 82.08999 14.91753 96.65004
## 2018.143       55.78378 29.47757 82.08999 14.91753 96.65004
## 2018.286       55.78378 29.47757 82.08999 14.91753 96.65004
## 2018.429       55.78378 29.47757 82.08999 14.91753 96.65004
## 2018.571       55.78378 29.47757 82.08999 14.91753 96.65004
## 2018.714       55.78378 29.47757 82.08999 14.91753 96.65004
## 2018.857       55.78378 29.47757 82.08999 14.91753 96.65004
## 2019.000       55.78378 29.47757 82.08999 14.91753 96.65004
## 2019.143       55.78378 29.47757 82.08999 14.91753 96.65004
## 2019.286       55.78378 29.47757 82.08999 14.91753 96.65004
## 2019.429       55.78378 29.47757 82.08999 14.91753 96.65004
## 2019.571       55.78378 29.47757 82.08999 14.91753 96.65004
## 2019.714       55.78378 29.47757 82.08999 14.91753 96.65004
## 2019.857       55.78378 29.47757 82.08999 14.91753 96.65004
## 2020.000       55.78378 29.47757 82.08999 14.91753 96.65004
## 2020.143       55.78378 29.47757 82.08999 14.91753 96.65004
## 2020.286       55.78378 29.47757 82.08999 14.91753 96.65004
## 2020.429       55.78378 29.47757 82.08999 14.91753 96.65004
## 2020.571       55.78378 29.47757 82.08999 14.91753 96.65004
## 2020.714       55.78378 29.47757 82.08999 14.91753 96.65004
## 2020.857       55.78378 29.47757 82.08999 14.91753 96.65004
## 2021.000       55.78378 29.47757 82.08999 14.91753 96.65004
## 2021.143       55.78378 29.47757 82.08999 14.91753 96.65004
## 2021.286       55.78378 29.47757 82.08999 14.91753 96.65004
## 2021.429       55.78378 29.47757 82.08999 14.91753 96.65004

Compare the accurracy of forecasts

We will review the accurracy of Naive, Seasonal, and Mean forecasts.

Naive Forecast

accuracy(happytails_W_nfc, happytails_W.ts)
##                      ME     RMSE      MAE        MPE     MAPE     MASE
## Training set   1.527778 23.47516 18.58333  -9.847215 38.56174 1.486667
## Test set     -25.428571 32.43455 25.42857 -73.765604 73.76560 2.034286
##                    ACF1 Theil's U
## Training set -0.2597335        NA
## Test set      0.1255394 0.6910422

Seasonal Forecast

accuracy(happytails_W_snfc, happytails_W.ts)
##                     ME     RMSE  MAE        MPE     MAPE MASE       ACF1
## Training set  4.766667 16.92237 12.5   3.060644 24.55643 1.00 -0.4777442
## Test set     -6.857143 10.15593  8.0 -20.049207 21.57410 0.64  0.3421299
##              Theil's U
## Training set        NA
## Test set      0.278051

Mean Forecast

accuracy(happytails_W_meanfc, happytails_W.ts)
##                        ME     RMSE      MAE       MPE     MAPE     MASE
## Training set 1.342957e-15 19.61265 16.94814 -22.02119 44.78424 1.355851
## Test set     5.787645e+00 20.94957 19.94981 -11.41727 46.20989 1.595985
##                   ACF1 Theil's U
## Training set 0.2489209        NA
## Test set     0.1255394 0.5018537

Plot the forecasts - Method 1 with Confidence Intervals

Now that we have reviewed accurracy, lets plot the forecasts.

Forecasts from the Naive Method

autoplot(happytails_W_nfc,  xlab="Week", ylab="# Dogs") + autolayer(fitted(happytails_W_nfc))
## Warning: Removed 1 rows containing missing values (geom_path).

Forecasts from the Seasonal Naive Method.

autoplot(happytails_W_snfc,  xlab="Week", ylab="# Dogs") + autolayer(fitted(happytails_W_snfc))
## Warning: Removed 7 rows containing missing values (geom_path).

Forecasts from the Mean Method

autoplot(happytails_W_meanfc,  xlab="Week", ylab="# Dogs") + autolayer(fitted(happytails_W_meanfc))

Using Method 2

#  Plot the forecasts - Method 2
plot(happytails_W_valid, bty="l",xant="n", xlab="Week", yaxt="n", ylab="# Dogs")
## Warning in plot.window(xlim, ylim, log, ...): "xant" is not a graphical
## parameter
## Warning in title(main = main, xlab = xlab, ylab = ylab, ...): "xant" is not
## a graphical parameter
## Warning in axis(1, ...): "xant" is not a graphical parameter
## Warning in axis(2, ...): "xant" is not a graphical parameter
## Warning in box(...): "xant" is not a graphical parameter
axis(2, las=2)
lines(happytails_W_snfc$mean, col=2, lty=2)
legend(2012, 100000, c("Actual", "Forecast"), col=1:2, lty=1:2)

Create a histogram of the forecast errors

# Create a histogram of the forecast errors
my_hist<-hist(happytails_W_snfc$residuals, ylab="Frequency", xlab = "Forecast Error", bty="l", main="Daily Dog Attendance Forecast Residuals \n Method 1")
multiplier <- my_hist$counts / my_hist$density

my_density<- density(happytails_W_snfc$residuals, na.rm=TRUE) 
my_density$y<- my_density$y * multiplier[1]

lines(my_density)

hist(happytails_W_snfc$residuals, breaks=10, probability = TRUE, ylab="Frequency", xlab = "Forecast Error", bty="l", main="Daily Dog Attendance Forecast Residuals \n Method 2")
lines(density(happytails_W_snfc$residuals, na.rm=TRUE))

hist(happytails_W_snfc$residuals, breaks=20, probability = TRUE, ylab="Frequency", xlab = "Forecast Error", bty="l", main="Daily Dog Attendance Forecast Residuals \n Method 3")
lines(density(happytails_W_snfc$residuals, na.rm=TRUE))

Obtaining additional Data

While working on our analysis of the data, we obtained a copy of attendance to from 2007-2011. One of our earlier questions was to determine if the impact of the recession may have had an impact on business. Apparently from plotting the data, it appears to have had a significant amount of impact on business levels as business was very low for the years following the recession.

ht2007 <- read.csv("dccount 2007-2012.csv")
head(ht2007)
##        Date Attendance  X X.1 X.2 X.3
## 1 10/1/2007         21 NA  NA  NA  NA
## 2 10/2/2007         28 NA  NA  NA  NA
## 3 10/3/2007          7 NA  NA  NA  NA
## 4 10/4/2007         21 NA  NA  NA  NA
## 5 10/5/2007         14 NA  NA  NA  NA
## 6 10/8/2007          7 NA  NA  NA  NA
str(ht2007)
## 'data.frame':    1560 obs. of  6 variables:
##  $ Date      : Factor w/ 1540 levels "","1/1/2008",..: 126 179 230 245 250 263 268 131 136 141 ...
##  $ Attendance: int  21 28 7 21 14 7 22 23 28 20 ...
##  $ X         : logi  NA NA NA NA NA NA ...
##  $ X.1       : logi  NA NA NA NA NA NA ...
##  $ X.2       : logi  NA NA NA NA NA NA ...
##  $ X.3       : int  NA NA NA NA NA NA NA NA NA NA ...

Create a Time Series for 2007-2011

ht2007.ts <- ts(ht2007$Attendance, start=c(2007), end=c(2011), freq=12)
autoplot(ht2007.ts, xlab="Year", ylab="Attendance", bty="l")

The Time Series shows that attendance was down following the recession. Interesting as the business was described earlier as “Recession Proof”