Introduction

This R guide will focus on time series data sets and how to analyze and interpret them. A time series data set is defined by having data that vary over time.

The very first thing we’re going to do in this R guide is explore the data set we’ll be using. Then, we’ll talk about how and why to use various smoothing techniques. Next, we can look at how time series data could be modeled using the linear modeling techniques we have already learned. Then, we’ll investigate how to manually and automatically create an autoregressive integrated moving average (ARIMA) model. Finally, we’ll focus on comparing the models we’ve created.

Exploring the Data

Understanding our Data Set

The data set we’ll be using for this R guide is \(\texttt{lap}\) from the R package astsa. This data gives various characteristics of the population and air quality of Los Angeles, recorded weekly from the first quarter of 1970 through the third quarter of 1979. This data was collected as part of the Los Angeles Pollution-Mortality Study. Let’s use \(\texttt{head(dataset)}\) and \(\texttt{names(dataset)}\) to take a look at the data. In order to use these commands with our data set, we’ll need to first transform the data set from a time series object to a data frame. We can do this using \(\texttt{as.data.frame(dataset)}\)

library(astsa)
data(lap)
names(as.data.frame(lap))
##  [1] "tmort"  "rmort"  "cmort"  "tempr"  "rh"     "co"     "so2"   
##  [8] "no2"    "hycarb" "o3"     "part"
head(as.data.frame(lap))
##    tmort rmort  cmort tempr    rh    co  so2   no2 hycarb   o3  part
## 1 183.63 11.90  97.85 72.38 29.20 11.51 3.37  9.64  45.79 6.69 72.72
## 2 191.05 10.75 104.64 67.19 67.51  8.92 2.59 10.05  43.90 6.83 49.60
## 3 180.09  9.33  94.36 62.94 61.42  9.48 3.29  7.80  32.18 4.98 55.68
## 4 184.67  9.54  98.05 72.49 58.99 10.28 3.04 13.39  40.43 9.25 55.16
## 5 173.60  8.27  95.85 74.25 34.80 10.57 3.39 11.90  48.53 9.15 66.02
## 6 183.73  7.55  95.98 67.88 66.78  7.99 2.57 10.11  48.61 8.80 44.01

We can see that there are a lot of different variables to choose from, all being quantitative. For this R guide, we have chosen to look at the cardiovascular mortality in Los Angeles over time. So, we will be using the variable \(\texttt{cmort}\) as our response and, since our data set is a time series object, time will be used as our predictor variable without specifying a time variable.

Plotting

We can plot our time series data using the simple \(\texttt{plot(response)}\) command, where \(\texttt{response}\) is simply our response variable. We can specify a different name for our response variable by adding the argument \(\texttt{ylab = "name"}\)

plot(lap[,"cmort"], ylab = "Cardiovascular Mortality")

Now that we know what our data looks like, let’s move on to smoothing out the trend so we can better visualize the data without all the noise.

Smoothing a Time Series

Smoothing is an important part of data visualization because it allows us to filter out the noise in our data set to determine the overall trend that the data shows. There are many different smoothing methods available to us, so we’ll illustrate each of them.

Basic Moving Averages

The very first smoothing method we’ll look at is the basic moving average. This technique simply averages each point and a couple points around that point to remove some of the variation in the data set. We can accomplish this by using the command \(\texttt{filter(response, sides, filter)}\). The \(\texttt{sides}\) argument will take the value of 1 or 2 depending on whether you want the average to be of only past values in the time series (1) or both past and future values (2). Then, the \(\texttt{filter}\) argument gives the weights for each of the data values to average as well as the number of data values you want to average for each point, as a vector. For this example, we’ll average both past and future values and we’ll use one value on either side of each point, giving each equal weight, to create our moving average.

bma <- filter(lap[,"cmort"], sides=2, filter = (c(1, 1, 1)/3))
par(mfrow=c(2, 1))
plot(lap[,"cmort"], ylab = "Cardiovascular Mortality", main = "Original")
plot(bma, ylab = "Cardiovascular Mortality", main = "Smoothed")

We also plotted our smoothed and original time series’ side by side to illustrate the difference between the two. For this, we used \(\texttt{par(mfrow=c(row, column))}\) where \(\texttt{row}\) is the number of rows you want to put your plots in and \(\texttt{column}\) is the number of columns you want to put your plots in. Then, we simply plotted the original response variable and our filtered response variable, adding a title to each plot.

You can definitely see a difference between the two plots, with the smoothed plot being far less noisy, or having less variation. Next, we’ll move on to weighted moving averages.

Weighted Moving Averages

The weighted moving average technique is very similar to the basic moving average technique, only changing the values for our \(\texttt{filter}\) argument. In this example, we want the data point being averaged to have twice the weight of the other data points being used in the average, so a 2 is used instead of a 1 in the corresponding vector component for the \(\texttt{filter}\) argument.

wma <- filter(lap[,"cmort"], sides=2, filter = (c(1, 2, 1)/3))
par(mfrow=c(2, 1))
plot(lap[,"cmort"], ylab = "Cardiovascular Mortality", main = "Original")
plot(wma, ylab = "Cardiovascular Mortality", main = "Smoothed")

As you can see, this technique had a very similar effect on our time series to the basic moving average technique.

Normal Kernel Smoother

Now, we’ll look at the normal kernel smoothing technique. This technique uses a normal distribution to weight the data points used in the averaging of the data. The point being averaged will have the highest weight and the further a point is from that point, the less weight it will have in the average. For this method we use the command \(\texttt{ksmooth(x, y, "normal", bandwidth)}\). \(\texttt{x}\) and \(\texttt{y}\) are our time and response variables, respectively. Since our data is a time series object, we’ll need to use the \(\texttt{time(dataset)}\) command to access our time variable. \(\texttt{"normal"}\) indicates that we will be doing a normal kernel smoother and \(\texttt{bandwidth}\) is an argument that helps scale the weight of each point. For this example, we’ll use a bandwidth of .1.

ks <- ksmooth(time(lap), lap[,"cmort"], "normal", bandwidth=.1)
plot(lap[,"cmort"], ylab = "Cardiovascular Mortality")
lines(ks, lwd=4, lty=2)

We can see the smoothed plot laid over top of the original time series using the command \(\texttt{lines()}\) with the arguments \(\texttt{lwd}\) and \(\texttt{lty}\) changing the line width and type, respectively. The dashed line is the smoothed time series.

Using this technique, we’re definitely able to see the variation in the data, but with significantly less noise.

Lowess Smoothing

Next is our Lowess smoothing technique. This technique is similar to doing localized regression to smooth the time series, or making a linear regression model for each data point using some proportion of the points in the time series. We can do Lowess smoothing using the command \(\texttt{lowess(response, f)}\), where \(\texttt{f}\) is the smoother span, the proportion of points in the time series used for each localized regression. We’ll use a smoother span of .03 for this example.

ls <- lowess(lap[,"cmort"], f=.03)
plot(lap[,"cmort"], ylab = "Cardiovascular Mortality")
lines(ls, lwd=4, lty=4)

Again, this smoother definitely removes a lot of noise from our time series and allows us to see the true trend of the data.

Spline Smoothing

Lastly, we discuss spline smoothing. The spline smoothing technique divides the x-axis into small intervals and uses the data points within each interval to create an averaged value. We can smooth using this technique with the command \(\texttt{smooth.spline(x, y, spar)}\). The \(\texttt{x}\) and \(\texttt{y}\) arguments are the time and response variables. \(\texttt{spar}\) is the smoothing parameter. We’ll use a smoothing parameter of .3 for this example.

ss <- smooth.spline(time(lap), lap[,"cmort"], spar = .3)
plot(lap[,"cmort"], ylab = "Cardiovascular Mortality")
lines(ss, lwd=4, lty=5)

This last smoothing technique also allows us to see the overall trend of the data without the noise. Additionally, we notice that each of these techniques produce relatively similar results so choosing a technique to use is mostly at the discretion of the user and the parameters used in the functions will need to be altered based on the data set used.

Decomposing the Data

Now we want to look at where the variation is coming from in our data set. To do this, we’ll need to look at the decomposition of the data. Decomposing the data divides up the different trends that we see in the plots, i.e. seasonal, overall, etc. We can decompose the data into its different trends using the function \(\texttt{decompose(dataset)}\).

Before we do the decomposition, we will need to see a constant, non-increasing and non-decreasing, seasonal trend. With the exception of the large spike around the beginning of 1973, we see a roughly constant seasonal trend in our data set since the distance between the peaks and troughs of our time series remain relatively consistent. Once we get our decomposed values, we will plot them using the \(\texttt{plot()}\) command.

decomp <- decompose(lap[,"cmort"])
plot(decomp)

The decomposition shows an overall decreasing trend, with a strong seasonal trend so we’ll need to account for that in our modeling.

Using Linear Models for Time Series Data

Creating the Model

Since we’ve talked a lot about linear models in the past, let’s create a multiple linear regression model to predict cardiovascular mortality with time as a predictor. We can create the model as we have in past R guides (http://rpubs.com/bensonsyd/385183) using the \(\texttt{lm()}\) command. Then, we can look at the a plot of the data and regression equation using the \(\texttt{plot()}\) and \(\texttt{lines()}\) commands, discussed previously. Finally, we can look at the summary of the model using \(\texttt{summary()}\).

However, the very first thing we’ll need to do is create levels in our time variable so that we can model the data monthly and seasonally. We first separate the decimals that represent the weeks from the integers which represent the years, then we turn these decimals into factors with 1 indicating the first week of the year and 52 indicating the 52nd week of the year. We then need to divide these weeks into months like this:

justyear <- floor(time(lap))
weekdec <- time(lap) - justyear
weekfactor <-factor(round(weekdec*52 + 1))
month <- weekfactor
levels(month) <- list(jan=c(as.character(1:4)), feb=c(as.character(5:8)), mar=c(as.character(9:13)), apr=c(as.character(14:17)), may=c(as.character(18:21)), jun=c(as.character(22:26)), jul=c(as.character(27:30)), aug=c(as.character(31:34)), sep=c(as.character(35:39)), oct=c(as.character(40:43)), nov=c(as.character(44:47)), dec=c(as.character(48:52)))

The \(\texttt{floor(variable)}\) command rounds down the specified variable, \(\texttt{factor(variable)}\) changes the specified variable into factors, and \(\texttt{round(variable)}\) rounds the specified variable to an integer. Finally, the \(\texttt{levels(variable)}\) command allows you to see or set the levels of your variable which is in a factor format.

We also saw that our data has a seasonal trend so we want to include that in our model. We need to create a new seasonal variable which divides our weeks into spring, summer, fall and winter, like this:

season <- weekfactor
levels(season) <- list(spr = c(as.character(12:24)), sum = c(as.character(25:37)), fall = c(as.character(38:50)), win = c(as.character(1:11), "51", "52"))

We divided the weeks into their seasons following the astronomical guidelines. Finally, we can create our model.

lmmod <- lm(lap[,"cmort"] ~ justyear + month + season)

Analyzing the Model

Plotting

We plot the data and the regression equation using a similar technique to the one used in our multiple linear regression R guides, using the \(\texttt{plot()}\) and \(\texttt{lines()}\) commands. The point estimates of the various coefficients can be found using the \(\texttt{coef(model)}\) command.

y <- coef(lmmod)[1]+coef(lmmod)[2]*justyear+coef(lmmod)[3]*(month == "feb")+coef(lmmod)[4]*(month == "mar")+coef(lmmod)[5]*(month == "apr")+coef(lmmod)[6]*(month == "may")+coef(lmmod)[7]*(month == "jun")+coef(lmmod)[8]*(month == "jul")+coef(lmmod)[9]*(month == "aug")+coef(lmmod)[10]*(month == "sep")+coef(lmmod)[11]*(month == "oct")+coef(lmmod)[12]*(month == "nov")+coef(lmmod)[13]*(month == "dec")+coef(lmmod)[14]*(season == "sum")+coef(lmmod)[15]*(season == "fall")+coef(lmmod)[16]*(season == "win")

plot(lap[,"cmort"] ~ time(lap), ylab = "Cardiovascular Mortality", type = "l")
lines(as.numeric(time(lap)), as.numeric(y), lwd=4, lty=6)

From the plot, we can see that this model allows us to visualize the trend of the data in a similar way to the plots we made with the smoothing techniques. We notice that this model seems to be missing some important peaks within the data, even though it follows the general trend.

summary(lmmod)
## 
## Call:
## lm(formula = lap[, "cmort"] ~ justyear + month + season)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.7960  -3.9063  -0.4418   3.5440  27.7597 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3200.45830  185.20307  17.281  < 2e-16 ***
## justyear      -1.57405    0.09379 -16.782  < 2e-16 ***
## monthfeb      -4.05850    1.33005  -3.051  0.00240 ** 
## monthmar      -6.20838    1.39611  -4.447 1.08e-05 ***
## monthapr      -8.27533    2.00006  -4.138 4.13e-05 ***
## monthmay      -9.11083    2.00006  -4.555 6.61e-06 ***
## monthjun     -10.59371    1.89910  -5.578 4.02e-08 ***
## monthjul     -12.90341    2.19023  -5.891 7.11e-09 ***
## monthaug     -10.13666    2.19023  -4.628 4.73e-06 ***
## monthsep      -9.63371    1.95697  -4.923 1.17e-06 ***
## monthoct      -6.75661    2.05731  -3.284  0.00110 ** 
## monthnov       3.16479    2.06400   1.533  0.12584    
## monthdec       3.62983    1.59172   2.280  0.02301 *  
## seasonsum      0.93033    1.49372   0.623  0.53369    
## seasonfall     4.68282    1.74014   2.691  0.00736 ** 
## seasonwin      2.93967    1.49372   1.968  0.04963 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.948 on 492 degrees of freedom
## Multiple R-squared:  0.6566, Adjusted R-squared:  0.6461 
## F-statistic: 62.71 on 15 and 492 DF,  p-value: < 2.2e-16

Looking at our summary, we can see that year is a significant predictor for our response variable, cardiovascular mortality, since the p-value from our t-test was less than \(2 \cdot 10^{-16}\). We definitely want to use it as a predictor to model our data. However, others of our predictors do not appear to be significant, including the month of November and Summer as a season, since the p-values from these t-tests are greater than .05.

This model also uses a lot of different predictors. One way to potentially reduce the number of predictors used for this model would be to try making an ARIMA model instead of using the linear model, so, let’s try using an ARIMA model instead.

Constructing ARIMA Manually

For this section, to make our ARIMA model, we’ll need to install several packages, including quantmod, tseries, timeSeries, forecast and xts. We’ll start by loading these.

library(quantmod)
library(tseries)
library(timeSeries)
library(forecast)
library(xts)

The three aspects in an ARIMA model are autoregression, differencing, and moving averages. Autoregression uses past points as predictors in a linear regression model. Differencing takes differences between points to create a plot with constant variance and a mean of 0. Moving averages utilizes error terms of past data points.

The decomposition showed an overall decreasing trend, with a strong seasonal trend so we’ll need to account for that in our ARIMA model. The order we will use for our ARIMA model is (p, d, q)x(P, D, Q)S where our (p, d, q) is the order for our non-seasonal component and (P, D, Q) is the order for our seasonal component. S corresponds to the number of periods there are in a year.

Differencing

The very first step to creating our ARIMA model is to determine our order of differencing. To do this, we’ll use the \(\texttt{diff(dataset, lag)}\) function. The \(\texttt{lag}\) argument indicates how far we need to go back in our data points to take the difference so, a lag of 52, which we will look at first for our model since we have weekly data, would take the difference between \(y_t\) and \(y_{t-52}\). We can plot this using \(\texttt{plot(differences)}\) and determine if our data is stationary after this differencing using \(\texttt{adf.test(differences)}\).

diffseas <- diff(lap[,"cmort"], lag=52)
plot(diffseas)

adf.test(diffseas)
## Warning in adf.test(diffseas): p-value smaller than printed p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  diffseas
## Dickey-Fuller = -7.0734, Lag order = 7, p-value = 0.01
## alternative hypothesis: stationary

For the ADF test, the null hypothesis is that the data is not stationary while the alternative hypothesis is that the data is stationary. The p-value of our ADF test is lower than .05 and the data looks relatively stationary, not increasing or decreasing, so we can say that the data is likely stationary enough. This means we do not have to compute any non-seasonal differencing. Thus, in our ARIMA model, \(d=0\) and \(D=1\).

Preliminary p, q, P and Q

To determine the order of the moving average (MA) and autoregression (AR) portions of our seasonal and non-seasonal components for our preliminary model, we need to look at autocorrelation function (ACF) and partial autocorrelation function (PACF) plots for our differenced values. We do this using the \(\texttt{Acf(differences)}\) and \(\texttt{Pacf(differences)}\) commands. The ACF plot will allow us to determine \(q\) and \(Q\) while the PACF plot will allow us to determine \(p\) and \(P\).

Acf(diffseas)

Our spikes are fairly high for the first 2 lags, but remain above the blue line until about the 4th lag. So we’ll try a \(q\) equal to 2. We also see a major spike with a few smaller ones around 52 so we will require a \(Q\) of 1. Next, we’ll look at the PACF plot.

Pacf(diffseas)

Our spike is high for the first two lags, so we’ll try a \(p\) of 2. Then, we see a large spike around 52 with a smaller spike around 104, so we will say \(P\) is probably equal to 1.

Creating the Model

Our first model will be an ARIMA (2, 0, 2)x(1, 1, 1)52. We can make this using the \(\texttt{arima(dataset, order, season = list(order, period))}\) function where the first \(\texttt{order}\) is our (p, d, q), the second \(\texttt{order}\) is our (P, D, Q) and \(\texttt{period}\) is our S. We can view the summary of our model using the \(\texttt{summary()}\) command.

mod1 <- arima(lap[,"cmort"], order = c(2, 0, 2), season = list(order = c(1, 1, 1), period= 52))
summary(mod1)
## 
## Call:
## arima(x = lap[, "cmort"], order = c(2, 0, 2), seasonal = list(order = c(1, 1, 
##     1), period = 52))
## 
## Coefficients:
##          ar1     ar2     ma1      ma2     sar1     sma1
##       0.2282  0.6449  0.0556  -0.2868  -0.2143  -0.7367
## s.e.  0.1259  0.1218  0.1370   0.1089   0.0655   0.0772
## 
## sigma^2 estimated as 29.75:  log likelihood = -1450.15,  aic = 2914.3
## 
## Training set error measures:
##                      ME     RMSE      MAE        MPE     MAPE      MASE
## Training set -0.6043444 5.167603 3.948944 -0.9549451 4.536669 0.7441035
##                      ACF1
## Training set -0.001270938

Now, we need to look at the significance of each of our parameters. We’ll need to run a t-test for each parameter. Our test statistic will be \(\dfrac{parameter}{SE(parameter)}\) then we can use the command \(\texttt{2*pt(abs(test.statistic), df, lower.tail=FALSE)}\) to get our p-value for each test. Our degrees of freedom is our sample size minus the number of parameters estimated. So, in this case it would be 502.

teststats <- coef(mod1)/sqrt(diag(vcov(mod1)))
2*pt(abs(teststats), df=502, lower.tail=FALSE)
##          ar1          ar2          ma1          ma2         sar1 
## 7.057046e-02 1.783863e-07 6.849434e-01 8.719135e-03 1.151300e-03 
##         sma1 
## 5.910738e-20

Since each of our ma’s have fairly high p-values, let’s try removing ma2 from the model. We can do this simply be decreasing \(q\) by one. After making that change to our model, we can run the t-test on each parameter again.

mod2 <- arima(lap[,"cmort"], order = c(2, 0, 1), season = list(order = c(1, 1, 1), period= 52))
teststats <- coef(mod2)/sqrt(diag(vcov(mod2)))
2*pt(abs(teststats), df=503, lower.tail=FALSE)
##          ar1          ar2          ma1         sar1         sma1 
## 1.318283e-06 7.573250e-03 1.027355e-02 2.538604e-03 1.695162e-19

Now all of our parameters are significant, so we’ll move forward with this model. The last thing we should do is check our Ljung-box statistic. We’ll do this using the \(\texttt{Box.test(residuals, type = "Ljung")}\) command.

Box.test(mod2$residuals, type = "Ljung")
## 
##  Box-Ljung test
## 
## data:  mod2$residuals
## X-squared = 0.034943, df = 1, p-value = 0.8517

The null hypothesis for this test is that the model is adequate, while the alternative hypothesis is that the model is inadequate. We get a p-value of .8517 from our test so we can say that this model is adequate.

Forecasting

We can use the model we just created to forecast, or predict, future cardiovascular mortality in Los Angeles. The command we use to do this is \(\texttt{forecast(model, h)}\) where h is the number of time units we want to forecast. So, let’s forecast the cardiovascular mortality in Los Angeles for the last quarter of 1979.

lastqt1 <- forecast(mod2, h=13)
lastqt1
##          Point Forecast    Lo 80     Hi 80    Lo 95     Hi 95
## 1979.769       84.75789 77.71755  91.79824 73.99061  95.52518
## 1979.788       88.87832 81.55608  96.20056 77.67992 100.07672
## 1979.808       89.89339 81.96907  97.81771 77.77419 102.01259
## 1979.827       89.96051 81.67854  98.24248 77.29433 102.62669
## 1979.846       92.24197 83.65403 100.82991 79.10785 105.37609
## 1979.865      100.08764 91.26290 108.91238 86.59137 113.58391
## 1979.885      101.41812 92.40260 110.43364 87.63007 115.20617
## 1979.904       97.26099 88.09253 106.42944 83.23905 111.28292
## 1979.923       98.90222 89.61034 108.19409 84.69152 113.11291
## 1979.942       97.35414 87.96246 106.74582 82.99081 111.71748
## 1979.962       97.29598 87.82353 106.76842 82.80912 111.78283
## 1979.981       97.80446 88.26619 107.34273 83.21693 112.39198
## 1980.000       93.99693 84.40749 103.58638 79.33114 108.66273

Our output gives us the point estimate for each prediction as well as an 80% and 95% prediction interval. We can also see this information on a plot with the rest of our time series using the \(\texttt{plot(forecast.output)}\) command.

plot(lastqt1)

On this plot, the blue line represents our point predictions, the dark blue shaded area is the 80% prediction interval and the light blue area is the 95% prediction interval.

Constructing ARIMA Automatically

Creating the Model

To create our ARIMA model automatically, we can use the command \(\texttt{auto.arima(response)}\). This will estimate the order of each aspect of ARIMA that we will need as well as the coefficients for the model.

mod <- auto.arima(lap[,"cmort"], seasonal = TRUE)
mod
## Series: lap[, "cmort"] 
## ARIMA(2,1,1) 
## 
## Coefficients:
##          ar1     ar2      ma1
##       0.0957  0.2515  -0.6435
## s.e.  0.4302  0.2444   0.4155
## 
## sigma^2 estimated as 33.72:  log likelihood=-1609.89
## AIC=3227.77   AICc=3227.85   BIC=3244.68

The output indicates that we need to use 2 degrees of autoregression, one degree of differencing and one degree of moving averages. There is no seasonal component. Thankfully, our coefficients are already estimated for us using these specifications. Thus, our model will be \(y_t = .0957 y_{t-1} + .2515 y_{t-2} + \epsilon_t - .6435 \epsilon_{t-1}\).

Forecasting

We can use the model we just created to forecast future cardiovascular mortality in Los Angeles. Again, the command we use to do this is \(\texttt{forecast(model, h)}\). So, let’s forecast the cardiovascular mortality in Los Angeles for the last quarter of 1979.

lastqt <- forecast(mod, h=13)
lastqt
##          Point Forecast    Lo 80     Hi 80    Lo 95     Hi 95
## 1979.769       86.49062 79.04890  93.93234 75.10949  97.87174
## 1979.788       85.59554 77.42829  93.76278 73.10481  98.08626
## 1979.808       85.76151 76.26452  95.25850 71.23712 100.28590
## 1979.827       85.55229 75.26156  95.84303 69.81397 101.29061
## 1979.846       85.57401 74.43977  96.70824 68.54566 102.60235
## 1979.865       85.52347 73.67330  97.37364 67.40020 103.64674
## 1979.885       85.52409 72.98132  98.06687 66.34158 104.70661
## 1979.904       85.51144 72.32593  98.69696 65.34594 105.67695
## 1979.923       85.51039 71.70929  99.31149 64.40343 106.61735
## 1979.942       85.50711 71.11959  99.89463 63.50330 107.51092
## 1979.962       85.50653 70.55521 100.45785 62.64045 108.37261
## 1979.981       85.50565 70.01165 100.99965 61.80963 109.20167
## 1980.000       85.50542 69.48710 101.52374 61.00751 110.00333

Again, our output gives us the point estimate for each prediction as well as an 80% and 95% prediction interval and we can see this information on a plot with the rest of our time series using the \(\texttt{plot()}\) command.

plot(lastqt)

You might notice that these forecasted values vary much less each week than the values we forecasted with our manually created model. This could be due to the lack of a seasonal component in this automatically generated model.

Comparing the Models

AIC

We can compare these two ARIMA models and our linear model using AIC, discussed in our multiple linear regression R guides (http://rpubs.com/bensonsyd/385183). We use the command \(\texttt{AIC(model)}\) to obtain these values for each model.

AIC(lmmod)
## [1] 3270.996
AIC(mod2)
## [1] 2916.594
AIC(mod)
## [1] 3227.77

From the AIC values, we can see that we were actually able to create a better model manually than automatically since our manual model had a lower AIC, but that both ARIMA models had better AIC values than the linear model meaning that they modelled the data better than the linear model.

Likelihood Ratio Test

We can also compare the two ARIMA models using a likelihood ratio test. This tests the null hypothesis that we should use our less complicated model against the alternative hypothesis that we should use our more complicated model. This test requires that we use nested models. In our case, our automatic model is nested within our manual model so we can use this test. Let’s see if the added predictors in our manual model are actually improving our model enough to warrant their use.

To compute our test statistic, we need to subtract the log-likelihood of our simpler model from the log-likelihood of our complex model and multiply the answer by 2. We can then compare this test statistic to a chi-squared distribution with degrees of freedom equal to the number of parameters used in our complicated model minus the number of parameters used in our simple model. In this case, our simple model uses 4 parameters and our complicated model uses 6 parameters. The \(\texttt{logLik(model)}\) function will give us the number of degrees of freedom and the log-likelihood for the chosen model. We can use \(\texttt{pchisq(teststat, df, lower.tail = FALSE)}\) to get our p-value, where \(\texttt{df}\) is our degrees of freedom.

teststat <- 2*(as.numeric(logLik(mod2))-as.numeric(logLik(mod)))
pchisq(teststat, df=2, lower.tail = FALSE)
## [1] 3.634035e-69

Our p-value is extremely small, far smaller than .05, so we can conclude that the parameters in our more complicated model are important and necessary for modeling this data set.