happytails <- read.csv("Happy Tail Attendance_updated.csv")
First things first, we’ll plot the overall attendance data, which inlcudes half-day daycare, full-day daycare, and boarding, just to get a general idea of what it looks like.
library(forecast)
library(ggplot2)
library(TSA)
## Warning: package 'TSA' was built under R version 3.4.4
## Loading required package: leaps
## Warning: package 'leaps' was built under R version 3.4.4
## Loading required package: locfit
## Warning: package 'locfit' was built under R version 3.4.4
## locfit 1.5-9.1 2013-03-22
## Loading required package: mgcv
## Loading required package: nlme
##
## Attaching package: 'nlme'
## The following object is masked from 'package:forecast':
##
## getResponse
## This is mgcv 1.8-22. For overview type 'help("mgcv-package")'.
## Loading required package: tseries
##
## Attaching package: 'TSA'
## The following objects are masked from 'package:stats':
##
## acf, arima
## The following object is masked from 'package:utils':
##
## tar
library(ggfortify)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:nlme':
##
## collapse
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
happytails.ts <- ts(happytails$Attendance, start=c(2012,1), end=c(2017,12), freq=7)
plot(happytails.ts, xlab="Year", ylab="Attendance", ylim=c(10, 100), bty="l")
There definitely appears to be a linear pattern here consisting of highs and lows.
Let’s look at just the first 3 years of data to see the pattern more clearly:
#Plot first 3 years
ht_first3years= window(happytails.ts, 2012,c(2014,12))
plot(ht_first3years, xlab="Year", ylab= "Attendance", ylim= c(0,200), bty= "l")
We can more clearly see almost an annually repeating “M” shape in the data. Attendance starts out low at the beginning of the year, peaks in the first half of the year, and begins decreasing for the last quarter of the year.
yearly <- aggregate(happytails.ts, nfrequency=1, FUN=sum)
plot(yearly, bty="l")
->It seems as the largest increase in business was from 2013 through 2015 with business leveling off towards 2016. It should be interesting when fitting prediction models to this data to see what the future could hold and if business will continue to level off or grow.<-
Though the monthly plots can be a bit overwhelming at first sight, there are some good things we can pull off of it. For instance, it appears August 2012, March 2014, and July 2015 had the 3 lowest attendance counts from 2012-2017. January 2015, October 2016, and February 2017 had the highest attendance counts. Each year seems to follow a similar pattern, however the drop of attendance seems difficult to predict regarding as to when this change will occur.
# set the outer margin area to the right a bit bigger
par(oma = c(0, 0, 0, 2))
# We have 6 years of data
xrange <- c(1,7)
# Get the range of the ridership to set up a nicely formatted plot
yrange <- range(happytails.ts)
# set up the plot
plot(xrange, yrange, type="n", xlab="Year", ylab="Attendance", bty="l")
# Give each of the months its own color, line type, and character
colors <- rainbow(12)
linetype <- c(1:12)
plotchar <- c(1:12)
# add lines
for (i in 1:12) {
currentMonth <- subset(happytails.ts, cycle(happytails.ts)==i)
lines(currentMonth, type="b", lwd=1.5,
lty=linetype[i], col=colors[i], pch=plotchar[i])
}
# add a title
title("Attendance Broken Out by Month")
# add a legend
legend(6,80, 1:12, cex=0.8, col=colors, pch=plotchar, lty=linetype, title="Month")
Another approach is to use the ggseasonplot() function, which plots a line for each year, instead of each month as above. This can be useful for seeing which years did better than others. It’s evident from the below plot that 2017 had some of their highest attendance numbers, climbing above all other years multiple times, while 2012 and 2013 had some of the lowest attendance counts. This is to be expected as we saw from the initial graph of the data that the totals slowly increase as time goes on.
ggseasonplot(happytails.ts)
One way to determine if there is seasonality in the data is to use the ets() function on the time series. The output below, ETS(M,N,A), suggests it’s a time series with multiplicative error, no trend, and additive seaonality.
ets(happytails.ts)
## ETS(M,N,A)
##
## Call:
## ets(y = happytails.ts)
##
## Smoothing parameters:
## alpha = 0.2962
## gamma = 0.0027
##
## Initial states:
## l = 58.6696
## s=-16.1443 5.599 10.8892 11.9537 12.8195 7.3565
## -32.4736
##
## sigma: 0.1759
##
## AIC AICc BIC
## 411.9533 418.0644 430.4548
We will also look at the data from additional perspectives in weekly and monthly increments to determine if that shows any additional trend or pattern that can help with our forecasting. By creating a new time series we will work to decompose and see what comes up. We will start by creating time series for both sets.
#Weekly Time Series
happytails_W.ts<-ts(happytails$Attendance, start=c(2012,1,1),end=c(2018,2,28), frequency = 7)
happytails_W_lm<-tslm(happytails_W.ts ~ trend+I(trend^2))
#Monthly Time Series
happytails_M.ts<-ts(happytails$Attendance, start=c(2012,1,1),end=c(2018,2,28), frequency = 30)
happytails_M_lm<-tslm(happytails_M.ts ~ trend+I(trend^2))
# Decompose and plot the weekly time series for initial analysis
dec_happytails_W <- decompose(happytails_W.ts)
plot(dec_happytails_W)
# Decompose and plot the monthly time series for initial analysis
dec_happytails_M <- decompose(happytails_M.ts)
plot(dec_happytails_M)
#Create monthly and yearly averages
monthly <- group_by(happytails, MY) %>% summarise(mon_tot=sum(Attendance), monthly_avg_per_day=mean(Attendance))
annual <- group_by(happytails, Year) %>% summarise(yearly_tot=sum(Attendance), yearly_avg_per_day=mean(Attendance))
# Create a Time Series and Line Model of the data
monthly_sum.ts<-ts(monthly$mon_tot, start=c(2012,1),end=c(2018,2), frequency = 12)
monthly_sum_lm<-tslm(monthly_sum.ts ~ trend+I(trend^2))
#Create time series for the monthly totals and averages
monthly_avg.ts<-ts(monthly$monthly_avg_per_day, start=c(2012,1),end=c(2018,2), frequency = 12)
monthly_avg_lm<-tslm(monthly_avg.ts ~ trend+I(trend^2))
# Decompose and plot the monthly sum time series
dec_monthly_sum <- decompose(monthly_sum.ts)
plot(dec_monthly_sum)
# Decompose and plot the monthly average time series
dec_monthly_avg <- decompose(monthly_avg.ts)
plot(dec_monthly_avg)
# Generate plots for daily time series
par(mfrow = c(2,1))
plot(happytails_W.ts, col=4, main="Happy Tails Day Care Daily Total Attendance \n 2012 to 2018 with Weekly Seasonality", xlab="Day", ylab="Dogs", ylim=c(0, 120), type="l")
lines(happytails_W_lm$fitted, col=2, lwd=2)
plot(happytails_M.ts, col=4, main="Happy Tails Day Care Daily Total Attendance \n 2012 to 2018 with Monthly Seasonality", xlab="Day", ylab="Dogs", ylim=c(0, 120), type="l")
lines(happytails_M_lm$fitted, col=2, lwd=2)
# Generate plots for monthly time series
par(mfrow = c(2,1))
plot(monthly_sum.ts, col=4, main="Happy Tails Day Care Daily Total \n Attendance by month 2012 to 2018 \n", xlab="Day", ylab="Dogs", ylim=c(1500, 3500), type="l")
lines(monthly_sum_lm$fitted, col=2, lwd=2)
# Generate plots for monthly time series
plot(monthly_avg.ts, col=4, main="Happy Tails Day Care Daily Average \n Attendance by month 2012 to 2018 \n", xlab="Day", ylab="Dogs", ylim=c(50, 120), type="l")
lines(monthly_avg_lm$fitted, col=2, lwd=2)
The output from the ets() function showed no additive or multiplicative trend in the data, but we’ll fit multiple different types of trend lines to the data to see if any of them are a good fit. We’ll fit a quadratic line, an exponential line, and both a linear line with seasonality and one without.
Checking the adjusted R-squared values will determine which model fits the data better. An adjusted R-squared value of .52 would mean 52% of the data can be explained using the chosen method, .75 would mean 75% of the data can be explained, etc.
Linear:
plot(happytails.ts, xlab="Year", ylab="Attendance", ylim=c(10, 100), bty="l")
happytailsLinear <- tslm(happytails.ts ~ trend)
summary(happytailsLinear)
##
## Call:
## tslm(formula = happytails.ts ~ trend)
##
## Residuals:
## Min 1Q Median 3Q Max
## -46.848 -14.806 5.029 13.540 24.194
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 45.0305 5.5262 8.149 2.1e-10 ***
## trend 0.5306 0.2005 2.647 0.0111 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.64 on 45 degrees of freedom
## Multiple R-squared: 0.1347, Adjusted R-squared: 0.1155
## F-statistic: 7.008 on 1 and 45 DF, p-value: 0.01115
lines(happytailsLinear$fitted, lwd=2)
Linear with seasonality:
plot(happytails.ts, xlab="Year", ylab="Attendance", ylim=c(10, 100), bty="l")
happytailsLinearSeasonal = tslm(happytails.ts ~trend + season)
summary(happytailsLinearSeasonal)
##
## Call:
## tslm(formula = happytails.ts ~ trend + season)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.9000 -4.3571 -0.5714 6.4464 16.2429
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.2898 4.4755 3.416 0.001497 **
## trend 0.4673 0.1083 4.317 0.000105 ***
## season2 38.6755 5.3598 7.216 1.09e-08 ***
## season3 42.4939 5.3630 7.923 1.20e-09 ***
## season4 44.1694 5.3685 8.227 4.72e-10 ***
## season5 40.9878 5.3761 7.624 3.03e-09 ***
## season6 36.2276 5.5799 6.493 1.07e-07 ***
## season7 14.5935 5.5841 2.613 0.012670 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.03 on 39 degrees of freedom
## Multiple R-squared: 0.7831, Adjusted R-squared: 0.7442
## F-statistic: 20.12 on 7 and 39 DF, p-value: 4e-11
lines(happytailsLinearSeasonal$fitted, lwd= 2)
Quadratic:
plot(happytails.ts, xlab="Year", ylab="Attendance", ylim=c(10, 100), bty="l")
happytailsQuad <- tslm(happytails.ts ~ trend + I(trend^2))
summary(happytailsQuad)
##
## Call:
## tslm(formula = happytails.ts ~ trend + I(trend^2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -45.478 -14.110 5.869 13.435 24.511
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 41.995991 8.592410 4.888 1.4e-05 ***
## trend 0.902219 0.825751 1.093 0.281
## I(trend^2) -0.007741 0.016679 -0.464 0.645
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.81 on 44 degrees of freedom
## Multiple R-squared: 0.139, Adjusted R-squared: 0.09982
## F-statistic: 3.55 on 2 and 44 DF, p-value: 0.0372
lines(happytailsQuad$fitted, lty=2, lwd=3)
The Quadratic trend line seems to show a tapering off attendance after 2015.
Exponential:
plot(happytails.ts, xlab="Year", ylab="Attendance", ylim=c(10, 100), bty="l")
happytailsExponential <- tslm(happytails.ts ~ trend + season, lambda=0)
summary(happytailsExponential)
##
## Call:
## tslm(formula = happytails.ts ~ trend + season, lambda = 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.48743 -0.10720 0.02158 0.12361 0.47292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.983772 0.098648 30.247 < 2e-16 ***
## trend 0.009045 0.002386 3.790 0.000510 ***
## season2 0.919556 0.118138 7.784 1.84e-09 ***
## season3 1.027919 0.118210 8.696 1.15e-10 ***
## season4 1.049015 0.118331 8.865 6.91e-11 ***
## season5 0.990025 0.118499 8.355 3.20e-10 ***
## season6 0.925654 0.122989 7.526 4.10e-09 ***
## season7 0.496448 0.123081 4.033 0.000248 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.221 on 39 degrees of freedom
## Multiple R-squared: 0.9999, Adjusted R-squared: 0.9999
## F-statistic: 4.431e+04 on 7 and 39 DF, p-value: < 2.2e-16
lines(happytailsExponential$fitted, lwd= 2)
As we can see from the accuracy metrics, the exponential trend line fits the data the best, explaining 99% of it.
We will now compute an estimate of the autocorrelation function on the time series with the Acf() function. A strong positive lag-1 autocorrelation means consecutive values generally move in the same direction, and is seen in time series with strong linear trends.
A time series is considered white noise if the variables are uncorrelated, or are close to zero.
happytailsACF <- Acf(happytails.ts)
print(happytailsACF)
##
## Autocorrelations of series 'happytails.ts', by lag
##
## 0 1 2 3 4 5 6 7 8 9
## 1.000 0.288 -0.052 -0.262 -0.201 -0.075 0.370 0.593 0.272 -0.103
## 10 11 12 13 14 15 16
## -0.195 -0.261 -0.084 0.240 0.545 0.125 -0.121
Lag-0 autocorrelation is 1, which will always be the case, and which shows that the y-values are perfectly correlated with themselves. We see that lag-7, lag-14 and lag-21 have the largest autocorrelations. These large values indicate a linear trend and monthly seasonality. This obervation includes negative autocorrelations for this data set, which could mean that business is impacted by external forces, such as seasonal changes.
Differencing the data with a lag-1 to help adjust for a linear trend:
Acf(diff(happytails.ts, lag=1), lag.max=12, main="ACF Plot for Differenced Series")
The plot indicates that most of the seasonality has been captured, though there is still one significant lag at lag7.
Lets plot the differenced series.
plot(diff(happytails.ts, ylim=c(10, 100), bty="l"))
This plot confirms that seasonality still exists within the differenced series.
We will also conduct a Llung-Box Test to check the white noise. If it returns a small p-value we will know it is not white noise.
Box.test(happytails.ts, lag=24, fitdf=0, type= "Ljung-Box")
##
## Box-Ljung test
##
## data: happytails.ts
## X-squared = 110.66, df = 24, p-value = 4.322e-13
We can see that there is a very small p-value, meaning that the data is not white noise, but is in fact correlated.
Now that we have evaluated the data, it is time to start working with forecasting models and applying them to the data. But first, we have to partition the data into training and validation periods. Our validation period will be one year, so we’ll set the validation length to 12.
validLength <- 12
trainLength <- length(happytails.ts) - validLength
happytailsTrain <- window(happytails.ts, end=c(2012, trainLength))
happytailsValid <- window(happytails.ts, start=c(2012,trainLength+1))
Compute both the trailing moving average with a window size of 12 and the centered moving average and plot.
# First, let's compute the trailing moving average based on the training set
# In the package zoo, we have the function rollmean()
library(zoo)
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
trailingMA <- rollmean(happytailsTrain, k=12, align="right")
# We can also find the centered moving average with the function ma()
centeredMA <- ma(happytailsTrain, order=12)
# Helps set up the plot
yrange = range(happytails.ts)
# Set up the plot
plot(c(2012, 2018), yrange, type="n", xlab="Year", ylab="Attendance", bty="l", xaxt="n", yaxt="n")
# Add the time series air
lines(happytails.ts, bty="l")
# Add the x-axis
axis(1, at=seq(2012,2017,1), labels=format(seq(2012,2017,1)))
# Add the y-axis
axis(2, at=seq(10,100,10), labels=format(seq(10,100,10)), las=2)
lines(trailingMA, col="red", lty=2)
lines(centeredMA)
# Add the legend
legend(2012,99, c("ARPM", "Centered Moving Average", "Trailing Moving Average"), lty=c(1,1,2), col=c("black", "black", "red"), bty="n")
Both the trailing moving average and centered moving average seem to more accurately capture the seasonal dip in attendance that happens each year. Prior to 2015, the central moving aveage was larger than the trailing average almost all of the time which would be indicative of a significant upward trend between 2012 and 2015.
An AR model is an autogressive linear model that predicts the value of a time series using the preceding value in time. An AR(1) model suggests an autogressive model on a dataset with autocorrelation at lag 1.
A random walk is a special form of an AR(1) model in which the slope efficient is equal to 1 and changes from one period to the next are random. Because of this, random walk series are difficult to predict.
We will fit an AR(1) model and look at the estimated co-efficients. In order to accurately say the series is not a random walk, there must be a small p-value. Anything larger than ??=0.05 for a p-value would signify the series is a random walk.
fit <-Arima(happytails.ts, order=c(1,0,0))
print(fit)
## Series: happytails.ts
## ARIMA(1,0,0) with non-zero mean
##
## Coefficients:
## ar1 mean
## 0.2975 57.6904
## s.e. 0.1413 3.8552
##
## sigma^2 estimated as 366.4: log likelihood=-204.45
## AIC=414.91 AICc=415.47 BIC=420.46
We will calculate the two-tailed p-value from a t-distrubtion to be “conservative”.
2*pt(-abs((1 - fit$coef["ar1"]) / sqrt(diag(vcov(fit)))["ar1"]), df=length(happytails.ts)-1)
## ar1
## 9.645865e-06
Now we will calculate it using the normal distribution, just to check
2*pnorm(-abs((1 - fit$coef["ar1"]) / sqrt(diag(vcov(fit)))["ar1"]))
## ar1
## 6.616083e-07
Clearly, this does not appear to be a random walk, which is not suprising, considering we believe that there is seasonal patterns to this data set, and therefore wouldn’t be random.
Using AR as a Second-Layer Model
Another way to take advantage of autocorrelation is to create a second-level forecasting model on the residuals. This approach is often useful for short-term forecasting, and includes a 3 step process where an AR model is fitted to the series of residuals instead of the raw data series, and is then used to forecast future residuals.
Step 1: fit a multiple linear regression model that accounts for linear trend and additive seasonality.
Step 2: fitting an AR model to the forecast errors from Step One.
Step 3: compute the adjusted forecast from steps 1 and 2
Step 1: As the data has already been previously partioned, we will fit the model with linear trend and additive seasonality to it and see what it looks like
happytailsLinearSeason <- tslm(happytailsTrain ~ trend + season)
summary(happytailsLinearSeason)
##
## Call:
## tslm(formula = happytailsTrain ~ trend + season)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.2000 -4.0000 -0.4857 5.9000 20.5429
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.8367 5.2868 2.617 0.014348 *
## trend 0.6776 0.1752 3.868 0.000627 ***
## season2 33.5224 6.4904 5.165 1.96e-05 ***
## season3 42.8449 6.4975 6.594 4.49e-07 ***
## season4 43.7673 6.5093 6.724 3.22e-07 ***
## season5 39.8898 6.5257 6.113 1.57e-06 ***
## season6 32.2122 6.5469 4.920 3.77e-05 ***
## season7 13.3347 6.5726 2.029 0.052442 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.26 on 27 degrees of freedom
## Multiple R-squared: 0.7804, Adjusted R-squared: 0.7234
## F-statistic: 13.7 on 7 and 27 DF, p-value: 1.998e-07
There’s a fairlyl high p-value, and a very low adjusted R-sqaured value.
Now we want to look at the ACF plot for the residuals from our fitted model
resACF <- Acf(happytailsLinearSeason$residuals)
print(resACF)
##
## Autocorrelations of series 'happytailsLinearSeason$residuals', by lag
##
## 0 1 2 3 4 5 6 7 8 9
## 1.000 -0.439 0.312 -0.256 0.277 -0.117 0.222 -0.533 0.372 -0.311
## 10 11 12 13 14 15
## 0.223 -0.257 0.045 -0.076 0.110 -0.209
We can see there are still multiple significant lags, meaning there is still autocorrelation present.
Step 2: Let’s run an AR(2) model on the residuals from the fitted model to see if it fits better than the AR(1):
ARModel <- Arima(happytailsLinearSeason$residuals, order=c(2,0,0))
print(ARModel)
## Series: happytailsLinearSeason$residuals
## ARIMA(2,0,0) with non-zero mean
##
## Coefficients:
## ar1 ar2 mean
## -0.4003 0.2060 -0.1677
## s.e. 0.1685 0.1785 1.1068
##
## sigma^2 estimated as 66.49: log likelihood=-121.73
## AIC=251.47 AICc=252.8 BIC=257.69
Before going forward, we have to calculate the t statistics for each model, which is the coefficient divided by the standard error.
(Anything greater that 2 or -2 is statistically significant)
AR(1) t statistic:
ARModel$coef["ar1"] / sqrt(diag(vcov(ARModel)))["ar1"]
## ar1
## -2.374998
AR(2) t statistic:
ARModel$coef["ar2"] / sqrt(diag(vcov(ARModel)))["ar2"]
## ar2
## 1.154289
Now to estimate the p-value on the normal distribution:
AR(1) p-value:
2*pnorm(-abs(ARModel$coef["ar1"] / sqrt(diag(vcov(ARModel)))["ar1"]))
## ar1
## 0.01754905
AR(2) p-value:
2*pnorm(-abs(ARModel$coef["ar2"] / sqrt(diag(vcov(ARModel)))["ar1"]))
## ar2
## 0.2215266
This confirms that we should have just used an AR(1) Model since the residuals are statistically significant. We also may wish to see if we have accounted for autocorrelation in the series.
Create an ACF plot fo the residuals from the AR(2) Model
Acf(ARModel$residuals)
We can tell from the plot that we haven’t captured all the autocorrelations, as there are still numerous lags past the blue dotted line.
Step 3
Now we want to use the originally fitted regression model, along with the autocorrelation of the error terms, to forecast for the validation period.
lrForecast <- forecast(happytailsLinearSeason, h=validLength)
lrForecast
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 2017.000 38.22857 22.69498 53.76216 13.96716 62.48998
## 2017.143 72.42857 56.89498 87.96216 48.16716 96.68998
## 2017.286 82.42857 66.89498 97.96216 58.16716 106.68998
## 2017.429 84.02857 68.49498 99.56216 59.76716 108.28998
## 2017.571 80.82857 65.29498 96.36216 56.56716 105.08998
## 2017.714 73.82857 58.29498 89.36216 49.56716 98.08998
## 2017.857 55.62857 40.09498 71.16216 31.36716 79.88998
## 2018.000 42.97143 26.86385 59.07901 17.81352 68.12934
## 2018.143 77.17143 61.06385 93.27901 52.01352 102.32934
## 2018.286 87.17143 71.06385 103.27901 62.01352 112.32934
## 2018.429 88.77143 72.66385 104.87901 63.61352 113.92934
## 2018.571 85.57143 69.46385 101.67901 60.41352 110.72934
Generate a forecast for the error terms using residuals AR(2) model
errorForecast <- forecast(ARModel, h=validLength)
errorForecast
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 2017.000 -1.429140796 -11.879015 9.020733 -17.41084 14.55256
## 2017.143 1.631471793 -9.624434 12.887378 -15.58295 18.84589
## 2017.286 -1.147773536 -13.036554 10.741007 -19.33009 17.03455
## 2017.429 0.595253558 -11.532108 12.722615 -17.95195 19.14245
## 2017.571 -0.675036398 -12.927534 11.577461 -19.41361 18.06354
## 2017.714 0.192538783 -12.117834 12.502912 -18.63455 19.01963
## 2017.857 -0.416443634 -12.755246 11.922358 -19.28701 18.45413
## 2018.000 0.006060449 -12.346400 12.358520 -18.88540 18.89752
## 2018.143 -0.288524032 -12.647619 12.070571 -19.19013 18.61308
## 2018.286 -0.083562100 -12.445867 12.278743 -18.99008 18.82295
## 2018.429 -0.226295513 -12.590157 12.137566 -19.13519 18.68260
## 2018.571 -0.126935261 -12.491551 12.237681 -19.03698 18.78311
Create the adjusted forecast by adding the two forecasts together
adjustedForecast <- lrForecast$mean + errorForecast$mean
adjustedForecast
## Time Series:
## Start = c(2017, 1)
## End = c(2018, 5)
## Frequency = 7
## [1] 36.79943 74.06004 81.28080 84.62382 80.15354 74.02111 55.21213
## [8] 42.97749 76.88290 87.08787 88.54513 85.44449
We can see that we want to adjust the straight regression forecast down from June through July, October through December and February (negative values for the error forecast). We will also want to adjust it up for the positive months on the error forecast.
Now lets plot the adjusted forecast next to the actuals from the validation period:
plot(happytailsValid, bty="l", xaxt="n", xlab="Year 2017", yaxt="n", ylab="Attendance", ylim=c(10,100))
# Add the x-axis
axis(1, at=seq(2017, 2017 + 11/12, 1/12), labels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"))
# Add the y-axis
axis(2, at=seq(10,100,10), labels=format(seq(10,100,10)), las=2)
# Add the adjusted forecast
lines(adjustedForecast, col="blue")
# Add a legend
legend(1,99, c("Actuals", "Adjusted Forecast"), lty=c(1,1), col=c("black", "blue"), bty="n")
We could also look at accuracy metrics.
accuracy(adjustedForecast, happytailsValid)
## ME RMSE MAE MPE MAPE ACF1 Theil's U
## Test set -7.59073 12.0672 10.44397 -18.06111 21.71138 0.1025393 0.2629886
Looking at the forecasted model compared to the results, there is still some more room to grow regarding accuracy. We’ll fit some additional models to the data to help with comparisons.
ets_fit <- ets(happytails.ts)
ets_forecast <- forecast(ets_fit)
plot(ets_forecast)
accuracy(ets_forecast)
## ME RMSE MAE MPE MAPE MASE
## Training set 0.2710522 10.14703 7.539165 -2.811093 15.24353 0.6822774
## ACF1
## Training set -0.3287415
summary(ets_forecast)
##
## Forecast method: ETS(M,N,A)
##
## Model Information:
## ETS(M,N,A)
##
## Call:
## ets(y = happytails.ts)
##
## Smoothing parameters:
## alpha = 0.2962
## gamma = 0.0027
##
## Initial states:
## l = 58.6696
## s=-16.1443 5.599 10.8892 11.9537 12.8195 7.3565
## -32.4736
##
## sigma: 0.1759
##
## AIC AICc BIC
## 411.9533 418.0644 430.4548
##
## Error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.2710522 10.14703 7.539165 -2.811093 15.24353 0.6822774
## ACF1
## Training set -0.3287415
##
## Forecasts:
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 2018.714 68.04340 52.70666 83.38014 44.587880 91.49892
## 2018.857 46.30886 34.89702 57.72070 28.855954 63.76176
## 2019.000 29.98401 21.21641 38.75161 16.575124 43.39290
## 2019.143 69.79834 52.97816 86.61853 44.074087 95.52260
## 2019.286 75.23844 56.65100 93.82588 46.811409 103.66547
## 2019.429 74.43644 55.31652 93.55637 45.195039 103.67784
## 2019.571 73.32975 53.75044 92.90905 43.385789 103.27370
## 2019.714 68.04340 48.80311 87.28370 38.617911 97.46889
## 2019.857 46.30886 30.01574 62.60197 21.390687 71.22703
## 2020.000 29.98401 15.39609 44.57194 7.673709 52.29432
## 2020.143 69.79834 49.32173 90.27496 38.482061 101.11463
## 2020.286 75.23844 53.27876 97.19812 41.654014 108.82287
## 2020.429 74.43644 52.01625 96.85663 40.147722 108.72516
## 2020.571 73.32975 50.50973 96.14976 38.429549 108.22994
A TBATS model can be used for a times series that exhibits multiple complex seasonalities, and takes into account autocorrelation in the residuals.
tbats_forecast = msts(happytails.ts, seasonal.periods = c(7))
fit1= tbats(tbats_forecast)
fc_tbats = forecast(fit1)
autoplot(fc_tbats)
accuracy(fc_tbats)
## ME RMSE MAE MPE MAPE MASE
## Training set 1.333779 9.436176 7.277694 -0.6556304 15.07298 0.6586148
## ACF1
## Training set -0.2864632
We can see from the accuracy() function that the TBATS model performs even better than the ets, with a slightly lower MAPE, MAE, and RMSE, and MASE.
# Create Training files, then run naive, seasonal naive, and mean forecasts
valid_length<-365
train_length<-length(happytails_W.ts)-valid_length
happytails_W_train<-window(happytails_W.ts, start=c(2012,1), end=c(2017,2))
happytails_W_valid <- window(happytails_W.ts, start=c(2017,3), end=c(2018,2))
happytails_W_nfc<-naive(happytails_W_train, h=30)
print(happytails_W_nfc)
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 2017.286 87 56.915367 117.0846 40.989524 133.0105
## 2017.429 87 44.453903 129.5461 21.931360 152.0686
## 2017.571 87 34.891886 139.1081 7.307517 166.6925
## 2017.714 87 26.830733 147.1693 -5.020953 179.0210
## 2017.857 87 19.728715 154.2713 -15.882553 189.8826
## 2018.000 87 13.307999 160.6920 -25.702190 199.7022
## 2018.143 87 7.403542 166.5965 -34.732278 208.7323
## 2018.286 87 1.907807 172.0922 -43.137279 217.1373
## 2018.429 87 -3.253900 177.2539 -51.031429 225.0314
## 2018.571 87 -8.135964 182.1360 -58.497901 232.4979
## 2018.714 87 -12.779441 186.7794 -65.599486 239.5995
## 2018.857 87 -17.216227 191.2162 -72.384965 246.3850
## 2019.000 87 -21.471688 195.4717 -78.893132 252.8931
## 2019.143 87 -25.566391 199.5664 -85.155439 259.1554
## 2019.286 87 -29.517284 203.5173 -91.197809 265.1978
## 2019.429 87 -33.338534 207.3385 -97.041905 271.0419
## 2019.571 87 -37.042121 211.0421 -102.706054 276.7061
## 2019.714 87 -40.638290 214.6383 -108.205919 282.2059
## 2019.857 87 -44.135877 218.1359 -113.555017 287.5550
## 2020.000 87 -47.542571 221.5426 -118.765105 292.7651
## 2020.143 87 -50.865110 224.8651 -123.846491 297.8465
## 2020.286 87 -54.109439 228.1094 -128.808263 302.8083
## 2020.429 87 -57.280833 231.2808 -133.658493 307.6585
## 2020.571 87 -60.384002 234.3840 -138.404380 312.4044
## 2020.714 87 -63.423167 237.4232 -143.052382 317.0524
## 2020.857 87 -66.402133 240.4021 -147.608317 321.6083
## 2021.000 87 -69.324341 243.3243 -152.077448 326.0774
## 2021.143 87 -72.192917 246.1929 -156.464556 330.4646
## 2021.286 87 -75.010709 249.0107 -160.773998 334.7740
## 2021.429 87 -77.780324 251.7803 -165.009758 339.0098
happytails_W_snfc<-snaive(happytails_W_train, h=30)
print(happytails_W_snfc)
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 2017.286 71 49.3131080 92.68689 37.8327608 104.16724
## 2017.429 77 55.3131080 98.68689 43.8327608 110.16724
## 2017.571 74 52.3131080 95.68689 40.8327608 107.16724
## 2017.714 75 53.3131080 96.68689 41.8327608 108.16724
## 2017.857 57 35.3131080 78.68689 23.8327608 90.16724
## 2018.000 38 16.3131080 59.68689 4.8327608 71.16724
## 2018.143 87 65.3131080 108.68689 53.8327608 120.16724
## 2018.286 71 40.3301031 101.66990 24.0944405 117.90556
## 2018.429 77 46.3301031 107.66990 30.0944405 123.90556
## 2018.571 74 43.3301031 104.66990 27.0944405 120.90556
## 2018.714 75 44.3301031 105.66990 28.0944405 121.90556
## 2018.857 57 26.3301031 87.66990 10.0944405 103.90556
## 2019.000 38 7.3301031 68.66990 -8.9055595 84.90556
## 2019.143 87 56.3301031 117.66990 40.0944405 133.90556
## 2019.286 71 33.4372011 108.56280 13.5526565 128.44734
## 2019.429 77 39.4372011 114.56280 19.5526565 134.44734
## 2019.571 74 36.4372011 111.56280 16.5526565 131.44734
## 2019.714 75 37.4372011 112.56280 17.5526565 132.44734
## 2019.857 57 19.4372011 94.56280 -0.4473435 114.44734
## 2020.000 38 0.4372011 75.56280 -19.4473435 95.44734
## 2020.143 87 49.4372011 124.56280 29.5526565 144.44734
## 2020.286 71 27.6262159 114.37378 4.6655216 137.33448
## 2020.429 77 33.6262159 120.37378 10.6655216 143.33448
## 2020.571 74 30.6262159 117.37378 7.6655216 140.33448
## 2020.714 75 31.6262159 118.37378 8.6655216 141.33448
## 2020.857 57 13.6262159 100.37378 -9.3344784 123.33448
## 2021.000 38 -5.3737841 81.37378 -28.3344784 104.33448
## 2021.143 87 43.6262159 130.37378 20.6655216 153.33448
## 2021.286 71 22.5066352 119.49336 -3.1642015 145.16420
## 2021.429 77 28.5066352 125.49336 2.8357985 151.16420
happytails_W_meanfc<-meanf(happytails_W_train, h=30)
print(happytails_W_meanfc)
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 2017.286 55.78378 29.47757 82.08999 14.91753 96.65004
## 2017.429 55.78378 29.47757 82.08999 14.91753 96.65004
## 2017.571 55.78378 29.47757 82.08999 14.91753 96.65004
## 2017.714 55.78378 29.47757 82.08999 14.91753 96.65004
## 2017.857 55.78378 29.47757 82.08999 14.91753 96.65004
## 2018.000 55.78378 29.47757 82.08999 14.91753 96.65004
## 2018.143 55.78378 29.47757 82.08999 14.91753 96.65004
## 2018.286 55.78378 29.47757 82.08999 14.91753 96.65004
## 2018.429 55.78378 29.47757 82.08999 14.91753 96.65004
## 2018.571 55.78378 29.47757 82.08999 14.91753 96.65004
## 2018.714 55.78378 29.47757 82.08999 14.91753 96.65004
## 2018.857 55.78378 29.47757 82.08999 14.91753 96.65004
## 2019.000 55.78378 29.47757 82.08999 14.91753 96.65004
## 2019.143 55.78378 29.47757 82.08999 14.91753 96.65004
## 2019.286 55.78378 29.47757 82.08999 14.91753 96.65004
## 2019.429 55.78378 29.47757 82.08999 14.91753 96.65004
## 2019.571 55.78378 29.47757 82.08999 14.91753 96.65004
## 2019.714 55.78378 29.47757 82.08999 14.91753 96.65004
## 2019.857 55.78378 29.47757 82.08999 14.91753 96.65004
## 2020.000 55.78378 29.47757 82.08999 14.91753 96.65004
## 2020.143 55.78378 29.47757 82.08999 14.91753 96.65004
## 2020.286 55.78378 29.47757 82.08999 14.91753 96.65004
## 2020.429 55.78378 29.47757 82.08999 14.91753 96.65004
## 2020.571 55.78378 29.47757 82.08999 14.91753 96.65004
## 2020.714 55.78378 29.47757 82.08999 14.91753 96.65004
## 2020.857 55.78378 29.47757 82.08999 14.91753 96.65004
## 2021.000 55.78378 29.47757 82.08999 14.91753 96.65004
## 2021.143 55.78378 29.47757 82.08999 14.91753 96.65004
## 2021.286 55.78378 29.47757 82.08999 14.91753 96.65004
## 2021.429 55.78378 29.47757 82.08999 14.91753 96.65004
We will review the accurracy of Naive, Seasonal, and Mean forecasts.
Naive Forecast
accuracy(happytails_W_nfc, happytails_W.ts)
## ME RMSE MAE MPE MAPE MASE
## Training set 1.527778 23.47516 18.58333 -9.847215 38.56174 1.486667
## Test set -25.428571 32.43455 25.42857 -73.765604 73.76560 2.034286
## ACF1 Theil's U
## Training set -0.2597335 NA
## Test set 0.1255394 0.6910422
Seasonal Forecast
accuracy(happytails_W_snfc, happytails_W.ts)
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 4.766667 16.92237 12.5 3.060644 24.55643 1.00 -0.4777442
## Test set -6.857143 10.15593 8.0 -20.049207 21.57410 0.64 0.3421299
## Theil's U
## Training set NA
## Test set 0.278051
Mean Forecast
accuracy(happytails_W_meanfc, happytails_W.ts)
## ME RMSE MAE MPE MAPE MASE
## Training set 1.342957e-15 19.61265 16.94814 -22.02119 44.78424 1.355851
## Test set 5.787645e+00 20.94957 19.94981 -11.41727 46.20989 1.595985
## ACF1 Theil's U
## Training set 0.2489209 NA
## Test set 0.1255394 0.5018537
Now that we have reviewed accurracy, lets plot the forecasts.
Forecasts from the Naive Method
autoplot(happytails_W_nfc, xlab="Week", ylab="# Dogs") + autolayer(fitted(happytails_W_nfc))
## Warning: Removed 1 rows containing missing values (geom_path).
Forecasts from the Seasonal Naive Method.
autoplot(happytails_W_snfc, xlab="Week", ylab="# Dogs") + autolayer(fitted(happytails_W_snfc))
## Warning: Removed 7 rows containing missing values (geom_path).
Forecasts from the Mean Method
autoplot(happytails_W_meanfc, xlab="Week", ylab="# Dogs") + autolayer(fitted(happytails_W_meanfc))
# Plot the forecasts - Method 2
plot(happytails_W_valid, bty="l",xant="n", xlab="Week", yaxt="n", ylab="# Dogs")
## Warning in plot.window(xlim, ylim, log, ...): "xant" is not a graphical
## parameter
## Warning in title(main = main, xlab = xlab, ylab = ylab, ...): "xant" is not
## a graphical parameter
## Warning in axis(1, ...): "xant" is not a graphical parameter
## Warning in axis(2, ...): "xant" is not a graphical parameter
## Warning in box(...): "xant" is not a graphical parameter
axis(2, las=2)
lines(happytails_W_snfc$mean, col=2, lty=2)
legend(2012, 100000, c("Actual", "Forecast"), col=1:2, lty=1:2)
# Create a histogram of the forecast errors
my_hist<-hist(happytails_W_snfc$residuals, ylab="Frequency", xlab = "Forecast Error", bty="l", main="Daily Dog Attendance Forecast Residuals \n Method 1")
multiplier <- my_hist$counts / my_hist$density
my_density<- density(happytails_W_snfc$residuals, na.rm=TRUE)
my_density$y<- my_density$y * multiplier[1]
lines(my_density)
hist(happytails_W_snfc$residuals, breaks=10, probability = TRUE, ylab="Frequency", xlab = "Forecast Error", bty="l", main="Daily Dog Attendance Forecast Residuals \n Method 2")
lines(density(happytails_W_snfc$residuals, na.rm=TRUE))
hist(happytails_W_snfc$residuals, breaks=20, probability = TRUE, ylab="Frequency", xlab = "Forecast Error", bty="l", main="Daily Dog Attendance Forecast Residuals \n Method 3")
lines(density(happytails_W_snfc$residuals, na.rm=TRUE))
While working on our analysis of the data, we obtained a copy of attendance to from 2007-2011. One of our earlier questions was to determine if the impact of the recession may have had an impact on business. Apparently from plotting the data, it appears to have had a significant amount of impact on business levels as business was very low for the years following the recession.
ht2007 <- read.csv("dccount 2007-2012.csv")
head(ht2007)
## Date Attendance X X.1 X.2 X.3
## 1 10/1/2007 21 NA NA NA NA
## 2 10/2/2007 28 NA NA NA NA
## 3 10/3/2007 7 NA NA NA NA
## 4 10/4/2007 21 NA NA NA NA
## 5 10/5/2007 14 NA NA NA NA
## 6 10/8/2007 7 NA NA NA NA
str(ht2007)
## 'data.frame': 1560 obs. of 6 variables:
## $ Date : Factor w/ 1540 levels "","1/1/2008",..: 126 179 230 245 250 263 268 131 136 141 ...
## $ Attendance: int 21 28 7 21 14 7 22 23 28 20 ...
## $ X : logi NA NA NA NA NA NA ...
## $ X.1 : logi NA NA NA NA NA NA ...
## $ X.2 : logi NA NA NA NA NA NA ...
## $ X.3 : int NA NA NA NA NA NA NA NA NA NA ...
ht2007.ts <- ts(ht2007$Attendance, start=c(2007), end=c(2011), freq=12)
autoplot(ht2007.ts, xlab="Year", ylab="Attendance", bty="l")
The Time Series shows that attendance was down following the recession. Interesting as the business was described earlier as “Recession Proof”