The purpose of this document is to guide readers through the basics of using R to perform basic time series analysis. First we’ll have to borrow time series data from R and then we can compare some models that might fit our data before we, finally, attempt to forecast a few values beyond the observed data using those models.
For our motivating example, we’ll use the data set is called ‘Seatbelts’ which comes from the ‘datasets’ library. This data set has multiple columns of data that refer to distinct studies on the number of drivers killed or seriously injured from January of 1969 through December of 1984. The column we’ll use is the third in the set so we’ll create a variable ‘frontdeath’ to store our data of interest. Unfortunately, the form of this data is not a matrix that includes the dates of observations. Thus, we’ll use the ‘seq()’ function to create a sequence of values. The syntax ‘as.Date’ will format the inputs as YYYY-MM-DD form dates and we define their succession as months using ‘by = month.’ Then we can combine our front seat deaths data with our date values using the ‘cbind’ function with the two sets as inputs and attach it to an arbitrary name ‘ourdata.’
library(datasets)
data(Seatbelts)
frontdeath <- Seatbelts[ , c("front")]
frontdeath1 <- ts(frontdeath , start = c(1969, 1) , end = c(1984, 12) , frequency = 12)
datevalues <- seq(from = as.Date("1969-01-01") , to = as.Date("1984-12-01") , by = 'month')
ourdata <- cbind.data.frame(datevalues , frontdeath1)
While we’re at it, we’ll need to install the ‘quantmod’ , ‘tseries’, ‘timeSeries’ , ‘forecast’ , and ‘xts’ R packages to access all of the functions we need for our analysis.
library(quantmod)
## Warning: package 'quantmod' was built under R version 3.3.3
## Loading required package: xts
## Warning: package 'xts' was built under R version 3.3.3
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 3.3.3
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: TTR
## Warning: package 'TTR' was built under R version 3.3.3
## Version 0.4-0 included new data defaults. See ?getSymbols.
library(tseries)
## Warning: package 'tseries' was built under R version 3.3.3
library(timeSeries)
## Warning: package 'timeSeries' was built under R version 3.3.3
## Loading required package: timeDate
## Warning: package 'timeDate' was built under R version 3.3.3
##
## Attaching package: 'timeSeries'
## The following object is masked from 'package:zoo':
##
## time<-
library(forecast)
## Warning: package 'forecast' was built under R version 3.3.3
library(xts)
Let’s plot our data to get an idea of what trends might be present. We’ll use define the ‘type’ as “l” for line so that our plot is a continuous line graph.
plot(ourdata , main = "Front Seat Casualties" , type = 'l')
These plot shows a huge drop in the number of front-seat passengers that were killed or injured during and after the year 1983. Research into this sudden decrease in front-seat fatalities reveals that in January 1983, the UK passed a law that required seatbelts to be worn. This accounts for the sudden decrease in front seat deaths and the subsequent low values throughout the 1983 and 1984 months. Since we can see there is a nontrivial effect on the data recorded after that date, we might consider chopping it out. However, this will affect how we interpret our forecasts once the models are built, so we’ll leave the data as we found it.
A downward trend seems to be present even after taking into account the 1983 difference, although it’s a bit difficult to distinguish what is seasonal variation and random variation, so we can use the ‘decomp()’ function to identify which sources of variation are present in our observed data. We plot the decomposition because a visual representation of the output will be easier to interpret at a glance.
FrontDecomp <- decompose(frontdeath1)
plot(FrontDecomp)
From this plot, we see that our instinct was mostly correct! There is a general downward trend and obvious seasonal variation. This information will become useful when we begin to build models because we will have to take into account seasonal components to create a model that fits the data best.
Sometimes it’s a good idea to start with interpreting a basic simple linear regression model to determine whether or not the respose has a linear relationship with your predictor. In our case, we can use the p-value from the T-test that the ‘summary’ function will run on ‘datevalues’ to judge the significance of the linear relationship between front seat driver deaths and time.
linmod <- lm(frontdeath1 ~ datevalues)
summary(linmod)
##
## Call:
## lm(formula = frontdeath1 ~ datevalues)
##
## Residuals:
## Min 1Q Median 3Q Max
## -277.297 -90.534 -2.868 95.399 314.167
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.012e+03 1.710e+01 59.18 <2e-16 ***
## datevalues -6.886e-02 5.608e-03 -12.28 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 131.1 on 190 degrees of freedom
## Multiple R-squared: 0.4424, Adjusted R-squared: 0.4395
## F-statistic: 150.8 on 1 and 190 DF, p-value: < 2.2e-16
From the summary output, it’s clear that the number of front seat deaths is linearly related to the passing of time based on the p-value of the ‘datevalues’ variable which is nearly 0. The slope of the line that relates the two is the coefficient on the ‘datevalues’ variable which is negative in this case. Thus, we expect that the number of front seat casualties will decrease by 0.688 times \(10^{-2}\) every successive month on average.
What this summary does not do a great job of showing is how useful the linear model is for forecasting future values and representing the trends present in the observed data. Let’s use the ‘plot’ function to do some residual analysis and assumption checks before we get too excited over this model.
plot(linmod)
From the Residuals vs. Fitted Values graph, we see from the red line that our residuals do not appear to be homoscedastic. Its wave-like fluctuations indicate that some sort of polynomial model might follow the observed trends more closely. When we take a look at the scale of the residuals plotted on the y-axis here, we see that our model frequently over-predicts or under-predicts deaths by the hundreds. This sort of variance explains why the coefficient of determination \(R^2\) from the summary above was 0.4424 indicating that less than 50% of the variation in observed front seat deaths is explained by this simple linear model.
From the normal q-q plot, there is no convincing evidence that our error terms don’t follow a normal distribution so we can reasonably accept that the normality assumption holds.
ARIMA is the shorthand way of indicating an autoregressive integrated moving average model. An autoregressive model is a time series model that generates a response \(y_t\) that is dependent on previous observations of the response. An autoregressive model of order p is denoted AR(p) where \(p\) is the number of previous observations of y that are factored into determining the subsequent \(y_t\) . In this type of model, \(y_t\) depends on \(y_{t-1}, . . . , y_{t-p}\) .
A moving average model is dependent on the error of the previous observations of \(y_t\) . Thus, we say that \(y_t\) depends on \(\epsilon_{t-1}, ... \epsilon_{t-q}\). It is denoted MA(q) where \(q\) is the number of error terms that are factored into determining the next iterative \(y_t\) .
An integrated model is applicable only to stationary time series models. These would be models with no obvious trend or seasonal variation. The series would appear to be a sample of independent and identically distributed points with mean 0 and constant variance. We achieve a stationary model by taking differences between consecutive observations to be our new data for the ARIMA model. For example, if d=1, we’ll use first order differences. That means that \(y_2 - y_1\), \(y_3 - y_2\), etc. will represent \(z_1\), \(z_2\), etc. respectively, and we’ll use those for the ARIMA model. If these differences do not produce a stationary time series, we’ll take the differences between the z values.
Including all of these concepts in one model will produce an ARIMA model with parameters \(p\), \(d\), and \(q\) where \(d\) is the order (number) of differences from the integrated portion that we took to get a stationary time series.
When building a model, we start with finding \(d\) . The ‘diff’ function will take one order of differences between consecutive observations. Observations that are adjacent have a lag of 1 so we’ll define that in the function. We can use the ‘adf.test’ function to perform an Augmented Dickey-Fuller test. The null hypothesis for this test is that our input data is not stationary. Thus if our p-value is low, we reject the null hypothesis in favor of the alternative one that our data is stationary.
firstdiffs <- diff(frontdeath1, lag=1)
adf.test(firstdiffs)
## Warning in adf.test(firstdiffs): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: firstdiffs
## Dickey-Fuller = -8.5119, Lag order = 5, p-value = 0.01
## alternative hypothesis: stationary
Our p-value is small and the output tells us that we can assume that we have a stationary time series now. Since we didn’t need to take any differences, we’ll use \(d\) = 1 when we build our ARIMA model.
To determine \(p\) and \(q\) , we need to explore autocorrelation in our observed data. An autocorrelation function tells you the correlation of 2 points separated by a specific “lag.” If we look at consecutive observations, the lag is 1. We use ‘Acf’ for autocorrelation function in R which will output a graph of correlation spikes. Here is what that looks like:
Acf(firstdiffs)
The spikes indicate the direction of correlation at various lags. For example, in this plot we see that adjacent observations are strongly negatively correlated since there is a downward spike at lag 1. To determine an appropriate \(q\) value we should look for how many spikes appear at low lags so we’ll build our ARIMA model with \(q\) = 1.
The ‘Pacf’ function in R will perform a partial autocorrelation function and we’ll use the output in a similar way to choose a \(p\) parameter.
Pacf(firstdiffs)
There are three significant spikes at the lowest lags so we’ll use 3 for our \(p\).
Now to put the pieces together, we’ll use the ‘arima’ function in R and input our values of \(p\), \(d\), and \(q\) in succession and define it as the order of the ARIMA model.
mod1 <- arima(frontdeath1 , order = c(3 , 1 , 1))
mod1
##
## Call:
## arima(x = frontdeath1, order = c(3, 1, 1))
##
## Coefficients:
## ar1 ar2 ar3 ma1
## 0.5135 -0.0164 -0.0307 -0.9038
## s.e. 0.0821 0.0823 0.0770 0.0404
##
## sigma^2 estimated as 11894: log likelihood = -1167.61, aic = 2345.22
The standard form of an ARIMA model is \(y_t = \phi_1 y_{t-1} + . . . + \phi_{p} y_{t-p} + \theta_1 \epsilon_{t-1} + \theta_2 \epsilon_{t-2}. . .+ \epsilon_t\) where \(\phi_i\) represents the \(i^{th}\) coefficient of the autoregressive portion and \(\theta_i\) represents the \(i^{th}\) coefficient of the moving average portion of the model. Thus, our output gave us three autoregressive coefficients and one moving average one since that’s what we asked it for.
So how useful are the forecasts from this model? We can only know by judging its utility based on a Box-Ljung test. This will test the null hypothesis that our ARIMA model adequately represents the trends and variation in the observed data.
Box.test(mod1$residuals, type = "Ljung")
##
## Box-Ljung test
##
## data: mod1$residuals
## X-squared = 0.012125, df = 1, p-value = 0.9123
Since the p-value from of the Box-Ljung test is quite high (way over 0.05), we fail to reject the null hypothesis and conclude that there isn’t evidence that our model is inadequate.
From the decomposition we perfomed earlier, it’s really obvious that our obeserved data shows seasonal variation so we explore the effect of a seasonal component in another ARIMA model. The result of this will be an ARIMA(p,d,q)x(P,D,Q)S model where \(S\) is the time span between seasonal repitition. Now \(P\) and \(Q\) will reference how many previous observations and error terms from similar months will be factored into determining the subsequent \(y_t\).
We figured out that the seasonal variation was repeated annually in the decomposition so now we’ll use a lag of 12 when finding the differences. We want to look at the seasonal variation first because removing it with first order differences might eliminate the need to include consecutive differences. Let’s check the result of an ADF test after we take seasonal differences.
seasondiff <- diff(frontdeath1, lag = 12)
adf.test(seasondiff)
##
## Augmented Dickey-Fuller Test
##
## data: seasondiff
## Dickey-Fuller = -3.3059, Lag order = 5, p-value = 0.07226
## alternative hypothesis: stationary
Our p-value is slightly higher than 0.05, indicating that we might want to include differences of an order higher than one. However, it’s not much higher than 1 and we can reasonably conclude that the time series is stationary so we’ll just keep this fact in mind when we models later.
We’ll repeat the same autocorrelation functions as before to analyze our new integrated data. However, we need to estimate values for \(P\) and \(Q\) as well this time. To do so, we’ll observe whether there are spikes at multiples of 12.
Acf(seasondiff)
Here, we can see there are several spikes at low lags indicating that a good choice of \(q\) is about 6 or between 1 and 6. Since there is a large spike at 12, we’ll use 1 as our value of \(Q\). Now let’s determine which values of \(P\) and \(p\) to plug in.
Pacf(seasondiff)
From this chart, we see two low spikes at low lags, but two more around 12. Any estimate between 1 and 4 for \(p\) might be warranted, but we’ll use two since the other lags are around our seasonal indicator. There are two big spikes at multiples of 12 so we’ll choose 2 as the value for \(P\).
Now that we have all of our parameters defined, let’s build another ARIMA model and conduct another Box-Ljung test to determine its utility. Notice that we define a ‘seasonal’ input before entering the order of \(P\), \(D\), and \(Q\) and we indicate that they’re the annual seasonal components by stating that their period is 12 months.
mod2 <- arima(frontdeath1, order = c(2 , 0 , 6) , seasonal = list(order = c(2 , 1 , 1) , period = 12))
Box.test(mod2$residuals, type="Ljung")
##
## Box-Ljung test
##
## data: mod2$residuals
## X-squared = 0.033192, df = 1, p-value = 0.8554
Notice that the p-value is much higher than 0.05 again which indicates that we have no reason to believe that this model is an inadequate representation of our observed data.
The two ARIMA models that we’ve constructed over the previous two sections passed the preliminary utility tests, but R has a function that will create an ARIMA model that it believes fits the observed data most accurately. This function is called ‘auto.arima’ and simply takes a dataset as its input. Let’s see how well it works.
modR <- auto.arima(frontdeath1)
Box.test(modR$residuals, type="Ljung")
##
## Box-Ljung test
##
## data: modR$residuals
## X-squared = 0.13636, df = 1, p-value = 0.7119
Again, this model’s p-value is larger than 0.05, so we conclude that it is also an adequate depiction of the observed data.
While all four of the models we’ve created so far have passed preliminary utility tests, we can still judge them against each other to decide which of them fit the data most closely. The easiest way to do so is to compare their AIC values. Refer to the multiple linear regression R guide for a review of AIC comparison, but the general rule is that we favor models with fewer terms and lower AIC values.
AIC(mod1)
## [1] 2345.217
AIC(mod2)
## [1] 2104.367
AIC(modR)
## [1] 2255.699
AIC(linmod)
## [1] 2421.211
This output shows that our seasonal ARIMA model has the lowest AIC value by over 100. The range of the decrease is enough to warrant the choice of a more complex model, so we are justified in choosing the seasonal model we built by hand to forecast front seat deaths beyond year 1984. However, if we trust R more than we trust ourselves, we could also use the model that it built for us.
Forecasting using an ARIMA model is very simple. It only requires one function, aptly named ‘forecast.’ All that it requires is for us to input which model to use and an \(h\) value that represents how many periods beyond our observed data that we would like to forecast. Since we only have about 15 years of data to predict with and the seatbelt laws changed just towards the end of our observation period, we might choose to only predict a few months out. Let’s try forecasting 3 months out with the ARIMA model that R built. Thus, our output would represent a predicted number of front seat driver deaths between January and March of 1985.
plot(forecast(modR , h = 3 , main = "Auto-ARIMA"))
## Warning in forecast.Arima(modR, h = 3, main = "Auto-ARIMA"): The non-
## existent main arguments will be ignored.
The three blue dots represent the point predictions of the number of front seat driver casualties in the UK for the months of January, February, and March 1985, respectively. The shaded areas represent confidence intervals for each of those point predictions. The darker blue represents the 80% confidence interval and the lighter shading represents the limits of the 95% confidence interval. Recall that as the interval expands, so does our confidence that the true value lies within those bounds.