Introduction

This R Guide will talk about time series. A time series is a series of data that is obtained at different times, often at equal intervals.

In this R guide, we will talk about the general equation for time series. We will then discuss seasonal variation and what it looks like in our data. We will talk about two methods for smoothing our data. Then, we can make a linear model and talk about autocorrelation. Lastly, we will talk about seasonal ARIMA modeling and forecasting.

For this R guide we will be using the data set birth, from the astsa library. This data set is for the monthly live births in thousands for the United stats during the years 1948-1979.

General Equation for Time Series

The general equation is: \[{y_t} = {TR_t}+{\epsilon_t}\]

where y_t is the value of the time series at time t
where TR_t is the trend at time t
where epsilon_t is the error at time t

y_t and epsilon_t are both parts we have seen before. TR_t is something new. The trend of the data determines what the TR_t piece is. Below are the four common types.

1. When there is no trend: $${TR_t} = {\beta_0}$$
2. When there is a linear trend: $${TR_t} = {\beta_0} + {\beta_1}t$$
3. When there is a quadratic trend: $${TR_t} = {\beta_0} + {\beta_1}t + {\beta_2}{t^2}$$
4. When there is a pth order polynomial trend: $${TR_t} = {\beta_0} + {\beta_1}t + {\beta_2}{t^2} +...+{\beta_p}{t^p}$$

Seasonal Variation

In time series data, seasonality occurs when there are variations present at specific regular intervals. These regular intervals must be less then a year. Many factors can cause seasonal variation, such as: seasons, weather, school, sports seasons, and more.

The general equation for time series when it has seasonality changes to: \[{y_t} = {TR_t} + {SN_t} + {\epsilon_t}\]

There are two types of seasonal variation: increasing seasonal variation and constant seasonal variation.

With increasing seasonal variation, the magnitude of swing is increasing over time. This is not the type of seasonal variation that we want. We want to find a transformation that will make the increasing seasonal variation constant.

With constant seasonal variation, the magnitude of swing is constant over time. We want our time series data to look like this. It will most likely not be perfect but we want each swing to be about the same height. We can achieve this by transforming data that has increasing seasonality, or if our data already has constant seasonality we are good to go.

Seasonal Variation in our Data

Now, since we know what seasonal variation is, we can look at our data. We need to call the library and the data before we can use the plot() command.

library(astsa)
data(birth)
plot(birth)

There appears to be constant seasonal variation because the magnitude of the swing is constant over time.

Another way to look at the data is the command decompose(), separates the data into a seasonal piece, a trend piece and an irregular components piece. This helps when we can’t see a clear seasonal or trend piece.

We can plot the decomposed data by calling plot(decompose(data)).

plot(decompose(birth))

The top box shows the data as it is.

The trend component shows that we start with an increasing trend until around 1960. Then the trend switches to decreasing until around 1976 where it increases again After that it continues to go back to a decreasing trend.

When looking at the seasonal component, we can see that the magnitude of our swings is constant over times. This means there is constant seasonal variation in our data. Fortunately, we don’t have to transform the data at all to get constant seasonal variation because we already have it.

The last section is random. This is the random noise in our data that doesn’t apply to either the trend or the seasonal components and is not useful for this analysis of the data.

Smoothing Our Data

There are various types of time series smoothing techniques. Smoothing our data is important to our time series analysis because it allows us to analyse the overall trend of our data without most of the noise interference.

The basic method for smoothing techniques is to average data values that are close to each other. This creates a smoothed data point that will eliminate the noise that previously surrounded that point. We will continue our analysis by performing some of these smoothing techniques.

Weighted moving Average

First, we are going to smooth the data set using the weighted moving average approach. To begin we have to decide how much weight we want to give to each point. For our analysis we have decided to use 3 points, the middle point and the one on either side. Thus, we need 3 weights, one for each point. To accomplish this we simply decided to give the middle point a weight twice as much of the weight of the point before or after it. To establish the weights, we created a weights variable that contains a vector c(1,2,1), which is how you give the middle point twice as much weight, and then divided by 3 because there are 3 points.

weights = c(1, 2, 1)/3

Thus, this means that the middle point will more heavily influence our smoothed point.

Next, all we have to do is use the filter(data set, sides, filter) function to smooth our data. We will enter birth as our data set and 2 as the sides argument as we want a data point from each side, past and future, to smooth three data points together; past, present, and future. Then, our filter will be set to our weights command we created before.

filterWeights = filter(birth, sides = 2, filter = weights)

Now, that we have smoothed our data, we can plot the original time series next to the smoothed time series to see the results. We can use the par(mfrow=c(2,1)) function to fit both plots next to each other.

par(mfrow = c(2,1))
plot.ts(birth, main = "unsmoothed")
plot.ts(filterWeights, main = "smoothed")

As you can see, the smoothed plot eliminates much of the noise in our data and allows us to clearly see the trend.

Splines

Our next technique for smoothing that we used is called spline smoothing. For spline smoothing, the x-axis gets split up into multiple small intervals. The data from each of these interval is then used to smooth the time series. We can do spline smoothing using the smooth.spline(time factor, data, spar) command. Spar is our smoothing parameter which we decided to use 0.2 for, the larger the spar the more smoothed your plot will become. We can plot our smoothed data on top of our original by using the lines() command so that we can see the difference.

plot(birth)
lines(smooth.spline(time(birth), birth, spar = 0.2), lwd = 2, col = 2)

Again, this smoothing technique also allows us to eliminate noise and yet still see the trend of our data and not smooth the plot too much.

Linear Model

Now, we can look at our linear model. First, we must create a data frame of the data so that we have variables we can call; date (the time factor of the data) and y (the birth response of the data).

library(reshape2)
data = data.frame(date = time(birth), Y = as.matrix(birth))
model = lm(Y ~ date, data)
summary(model)
## 
## Call:
## lm(formula = Y ~ date, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -78.810 -22.411  -1.172  21.602  82.170 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4297.3111   342.2257   12.56   <2e-16 ***
## date          -2.0303     0.1743  -11.65   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30.2 on 371 degrees of freedom
## Multiple R-squared:  0.2678, Adjusted R-squared:  0.2658 
## F-statistic: 135.7 on 1 and 371 DF,  p-value: < 2.2e-16

As you can see from the summary, date is significant, which we would expect as it is a time series analysis.

Monthly Model

Now, we can use dummy variables such as month to model our seasonal variation. To build a monthly model we first need to look at the time factors for our data set.

head(time(birth))
## [1] 1948.000 1948.083 1948.167 1948.250 1948.333 1948.417

We can see that time is in years and decimals, so we need to create monthly factors. This can be achieved by first getting the year by itself without a decimal using the floor() command so it is rounded down. Then, you take the current time factor and subtract off the year. Lastly, we can create monthly factors by taking just the decimal portion from the year, multiplying it by 12, and then rounding the result.

justyear <- floor(time(birth))
modecimal <- time(birth) - justyear
mofactor <-factor(round(modecimal*12))
head(cbind(time(birth), mofactor))
##      time(birth) mofactor
## [1,]    1948.000        1
## [2,]    1948.083        2
## [3,]    1948.167        3
## [4,]    1948.250        4
## [5,]    1948.333        5
## [6,]    1948.417        6

Now, we can label our month factors with the actual month using the levels() command. This makes it easier for us to read the summary of the month model.

levels(mofactor) <- c("Jan", "Feb", "Mar", "Apr", "May", 
                      "Jun", "Jul", "Aug", "Sep", "Oct",
                      "Nov", "Dec")

Next, we can create our month model just as we have created multiple linear regression models in previous R guides, using justyear and mofactor.

Using the summary function, we can see which months have an impact on births.

monthmod <- lm(as.matrix(birth) ~ 0+justyear + mofactor, data = birth)
summary(monthmod)
## 
## Call:
## lm(formula = as.matrix(birth) ~ 0 + justyear + mofactor, data = birth)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -64.20 -18.68  -1.27  24.58  57.98 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## justyear      -2.0586     0.1523  -13.52   <2e-16 ***
## mofactorJan 4348.3473   299.1271   14.54   <2e-16 ***
## mofactorFeb 4327.8089   299.0521   14.47   <2e-16 ***
## mofactorMar 4351.3896   299.0521   14.55   <2e-16 ***
## mofactorApr 4330.8412   299.0521   14.48   <2e-16 ***
## mofactorMay 4341.8412   299.0521   14.52   <2e-16 ***
## mofactorJun 4342.4218   299.0521   14.52   <2e-16 ***
## mofactorJul 4368.5831   299.0521   14.61   <2e-16 ***
## mofactorAug 4375.9380   299.0521   14.63   <2e-16 ***
## mofactorSep 4371.8735   299.0521   14.62   <2e-16 ***
## mofactorOct 4365.4218   299.0521   14.60   <2e-16 ***
## mofactorNov 4345.7122   299.0521   14.53   <2e-16 ***
## mofactorDec 4355.0993   299.0521   14.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 26.39 on 360 degrees of freedom
## Multiple R-squared:  0.9931, Adjusted R-squared:  0.9929 
## F-statistic:  4007 on 13 and 360 DF,  p-value: < 2.2e-16

It looks like all of the months have a significant impact and thus, we can say we have monthly seasonality. Now, we can plot our model to see how it compares to the actual data.

Now, we can plot the model on top of our data using the lines() command.

plot(birth, type = "l")
lines(as.numeric(time(birth)), monthmod$fitted.values, col = "red")

We can see that the red model does not plot the data very well. This is partially because of the trend. A single linear model cannot account for increasing and decreasing since it can only model a straight line. If we look closely, however, we can see that the seasonality portion of the data is modeled well with the red lines. The swings look to be about equal.

We can try to make it better by using seasons instead of months. This may not do much because this looks pretty good besides the trend portion and our data is already a monthly set. But, we can still try it out for fun.

Seasonal Model

To get a seasonal model, we can do the same thing, however instead of using month factors we will use season factors. We can categorize the seasons as: winter = Dec, Jan, Feb; spring = Mar, Apr, May; summer = Jun, Jul, Aug; and fall = Sep, Oct, Nov. Using these categories and the following R code we can create levels for the seasons.

season <- mofactor
levels(season) <- list(winter = c("Dec", "Jan", "Feb"), spring = c("Mar", "Apr", "May"), summer = c("Jun", "Jul", "Aug"), fall = c("Sep", "Oct", "Nov"))

Now that we have our seasons separated, we can create a new model based off of seasons instead of months.

seasonmod <- lm(as.matrix(birth) ~ 0+justyear + season, data = birth)
summary(seasonmod)
## 
## Call:
## lm(formula = as.matrix(birth) ~ 0 + justyear + season, data = birth)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -84.058 -20.858  -0.764  20.130  71.616 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## justyear       -2.056      0.165  -12.47   <2e-16 ***
## seasonwinter 4339.042    323.839   13.40   <2e-16 ***
## seasonspring 4336.599    323.812   13.39   <2e-16 ***
## seasonsummer 4357.556    323.812   13.46   <2e-16 ***
## seasonfall   4356.245    323.812   13.45   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 28.58 on 368 degrees of freedom
## Multiple R-squared:  0.9918, Adjusted R-squared:  0.9917 
## F-statistic:  8870 on 5 and 368 DF,  p-value: < 2.2e-16

We can see that all of the seasons are significant.

Again we can plot our new model to compare.

plot(birth, type = "l")
lines(as.numeric(time(birth)), seasonmod$fitted.values, col="blue")

Just from the plot, we can see we probably won’t be using this model. It did not seem to make the model any better. The swings go smaller so this model cannot account for all of the peaks.

Model Comparison

Now we can compare our month model and our season model to see which one fits our data better. For this comparison we used the anova function like in the last R guide.

anova(monthmod, seasonmod)
## Analysis of Variance Table
## 
## Model 1: as.matrix(birth) ~ 0 + justyear + mofactor
## Model 2: as.matrix(birth) ~ 0 + justyear + season
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1    360 250659                                  
## 2    368 300517 -8    -49858 8.9508 3.288e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Because the p-value is very small, we can see that we should keep the monthly model because that one is the more complicated model. This lines up with what we saw with the plots as well.

Autocorrelation

Auto-correlation is when current point is related to the previous point. For example, positive auto-correlation would be if the previous residual was positive, then the next residual is likely to be positive; same for negative following a negative residual. For negative auto-correlation, if the previous residual was positive the next residual is likely negative. We can check for auto-correlation by doing the Durbin-Watson test found in the lmtest library. If we us the dwtest on our model and set alternative = “greater”, we can check for positive auto-correlation. To check for negative correlation, simply change alternative to equal “less”. For this test, we will set our alpha level to .05.

library(lmtest)
## Warning: package 'lmtest' was built under R version 3.4.4
## Warning: package 'zoo' was built under R version 3.4.4
dwtest(monthmod, alternative = "greater")
## 
##  Durbin-Watson test
## 
## data:  monthmod
## DW = 0.074567, p-value < 2.2e-16
## alternative hypothesis: true autocorrelation is greater than 0
dwtest(monthmod, alternative = "less")
## 
##  Durbin-Watson test
## 
## data:  monthmod
## DW = 0.074567, p-value = 1
## alternative hypothesis: true autocorrelation is less than 0

This shows that we have positive autocorrelation because the first test gave us a very small p-value and the second gave 1 as a p-value. This is something we need to keep in mind when assessing our model, but it isn’t anything we need to try to fix.

Seasonal ARIMA

Because we have the trend piece, a linear model does not work well with our data. There is another type of model we can use which is called ARIMA.

ARIMA stands for Autoregressive Integrated Moving Average.

- Autoregressive: y_t is a linear combination of y_t-1 and maybe y_t-2,..., y_t-p.
- Moving Average: y_t is a linear combination of error at t-1,..., error at t-q.
- Integrated: the difference needed until there is a constant mean and a constant variances over time.

The equation is: ARIMA (p, d, q) (P, D, Q) S

1. where p is the order for the autoregressive portion
2. where d is the order for the integrated portion
3. where q is the order for the moving average portion
4. where P is the order for the seasonal autoregressive portion
5. where D is the order for the seasonal integrated portion
6. where Q is the order for the seasonal moving average portion
7. where S is the number of seasons

The hypotheses are below for the arima models along with the teststats.

\[{H_0}: {\phi} = 0 -and- {\theta} = 0\] \[{H_A}: {\phi} \neq 0 -and- {\theta} \neq 0\]

\[{teststat} = \hat{\phi}/{SE(\hat{\phi})}-and-\hat{\theta}/{SE(\hat{\theta})}\]

The list of libraries below are ones we need when we want to use time series. The forecast library lets us forecast with our model at the end. The quantmod library allows us to use lag when doing differences, which we will talk about later. The tseries library is for the ARIMA model, which we will discuss later on. The timeSeries library allowed us to plot the data. The last library, xts, allows us to take the differences.

library(forecast)
## Warning: package 'forecast' was built under R version 3.4.4
library(quantmod)
## Warning: package 'quantmod' was built under R version 3.4.4
## Warning: package 'xts' was built under R version 3.4.4
library(tseries)
## Warning: package 'tseries' was built under R version 3.4.4
library(timeSeries)
## Warning: package 'timeSeries' was built under R version 3.4.4
## Warning: package 'timeDate' was built under R version 3.4.4
library(xts)

There are 4 steps to creating an ARIMA model manually.

1. Differences
2. ACF
3. PACF
4. Model and Compare

Differences

Differencing is a transformation to make time-series data stationary. Being stationary means that the data’s properties don’t depend on time. We don’t want to see any seasonal or upward trend.

To find the seasonal difference, we can use the diff() command. We will supply our data set and a lag. Lag is the number of seasons. We chose 12 for the number of months. You can try any number of seasons and see which looks the best. We decided 12 for the number of months was best because that is what we used with our linear model. We can then plot the differences and see if that helped with the seasonal trend.

diff1 = diff(birth, lag = 12)
plot.ts(diff1)

This looks pretty good in terms of no seasonal trend.

A way to check for stationary data is the Augmented Dickey-Fuller Test. The null hypothesis is that the data is not stationary and the alternative is stationary. We will use the adf.test() command that takes the data.

adf.test(diff1)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  diff1
## Dickey-Fuller = -3.3649, Lag order = 7, p-value = 0.06026
## alternative hypothesis: stationary

The p-value is larger than our alpha level of 0.05 so we can conclude the data is not yet stationary. Since we got rid of most of the seasonal trend, we can take care of the downward trend.

We can use the same diff() command but exchange lag for differences. Differences will be the order or number of differences you want to take. We just took one to start off.

diff2 = diff(diff1, difference = 1)
plot.ts(diff2)

This looks even better. We should still check to see if it is stationary to move on.

adf.test(diff2)
## Warning in adf.test(diff2): p-value smaller than printed p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  diff2
## Dickey-Fuller = -8.7804, Lag order = 7, p-value = 0.01
## alternative hypothesis: stationary

Now, the p-value is very small. We can conclude the data is now stationary so we can continue on.

From these two differences, we can conclude that:

D = 1
d = 1

ACF

MA(q): checking if y_t is a linear combination of epsilon_t-1, epsilon_t-2,…, epsilon_t-q.

\[{y_t} = {\mu} + {\epsilon_t} + {\theta_1}{\epsilon_(t-1)} + {\theta_2}{\epsilon_(t-2)} + ... + {\theta_q}{\epsilon_(t-q)}\] Assume theta_q does not equal 0 and epsilon_t is iid N(0, sigma-squared)

To find q and Q, we can use the Acf() command and give it our difference data. When looking for q, we are looking at 1 of lag. If the line at 1 is past the blue line, q = 1. When looking for Q, we are looking at what lag we picked above (12) and the multiples (like 24). Again, if the line is past the blue line, then Q = 1.

Acf(diff2)

Looking at 1, we can see the line is past the blue line. The line at 2 is not past the blue line but it is very close. This leads us to believe q is 1, maybe 2.

We can see the line at 12 is past the blue line. The line at 24 is also past the blue line so Q might be above 1.

From the graph, we can conclude that:

q >= 1
Q >= 1

PACF

AR(p): checking if y_t is a linear combination of y_t-1, y_t-2,…, y_t-p.

\[{y_t} = {\phi_1}{y_(t-1)} + {\phi_2}{y_(t-2)} + ... + {\phi_p}{y_(t-p)} + {\epsilon_t}\] Assume phi_p does not equal 0 and epsilon_t is iid N(0, sigma-squared)

To find p and P, we can use the Pacf() command and give it our difference data. Finding p and P is the same as finding q and Q, just with a different graph.

Pacf(diff2)

Looking at 1, we can see the line is past the blue line. The line at 2, 3 and 4 are also past the blue line. This leads us to believe p is at least 1.

The line at 12 is past the blue line as well so we can assume 12 is 1. 24 is past the blue line so P should be at least 1.

From the graph, we can conclude that:

p >= 1
P >= 1

Modeling and Comparing

When deciding if a component is needed in the equation, we look at the teststat. To find the teststat, we simply take the top value in the coefficients table over the bottom value (or the standard error). As a rule of thumb, if the teststat is less than 2, drop that component. If it is between 2 and 3, use the teststat to find the p-value.

mod1 = arima(birth, order = c(1,1,1), season = list(order = c(1,1,1), period = 12))
summary(mod1)
## 
## Call:
## arima(x = birth, order = c(1, 1, 1), seasonal = list(order = c(1, 1, 1), period = 12))
## 
## Coefficients:
##          ar1      ma1    sar1     sma1
##       0.3127  -0.7088  0.1058  -0.8481
## s.e.  0.0868   0.0607  0.0710   0.0469
## 
## sigma^2 estimated as 45.53:  log likelihood = -1204.83,  aic = 2419.66
## 
## Training set error measures:
##                       ME     RMSE      MAE         MPE     MAPE      MASE
## Training set -0.09665592 6.628867 5.063631 -0.03292443 1.661033 0.3605801
##                    ACF1
## Training set 0.01128661

Looking at the teststats, we can see that the teststat for sar1 looks small. We can check what it actually is and find the p-value as well. We will use 2 times the pt() command because we want the alternative hypothesis to be not equal. We will supply the teststat, the degrees of freedom (which is the total number of data points minus the number of predictors) and lower.tail equal to false

teststat1 = 0.1058/0.0710
2*pt(teststat1, 368-4, lower.tail = FALSE)
## [1] 0.1370532

Since the p-value for this teststat is greater than our alpha level of 0.05, we will not include this predictor in our model. We can try a new model without this predictor

mod2 = arima(birth, order = c(1,1,1), season = list(order = c(0,1,1), period = 12))
summary(mod2)
## 
## Call:
## arima(x = birth, order = c(1, 1, 1), seasonal = list(order = c(0, 1, 1), period = 12))
## 
## Coefficients:
##          ar1      ma1     sma1
##       0.3038  -0.7006  -0.8000
## s.e.  0.0865   0.0604   0.0441
## 
## sigma^2 estimated as 45.91:  log likelihood = -1205.93,  aic = 2419.85
## 
## Training set error measures:
##                       ME     RMSE      MAE         MPE     MAPE     MASE
## Training set -0.04563711 6.657134 5.079695 -0.01582804 1.664997 0.361724
##                    ACF1
## Training set 0.01149696

All of the teststats look good now.

Before we try to make it better, we should check to see which model should be our base model. Let’s use the Likelihood Ratio Test to compare the two models.

This test is used to compare two nested models. Remember, nested models are when one model is just a special case of the other model. Our models are an example of nested models because both contain the same parameters but mod1 has an extra one.

We first need to find the teststats. The teststat is the loglikelihood of the complicated model minus the loglikelihood of the simple model and all of that multiplied by 2 or -2. The teststat needs to be positive so you can use 2 or -2 so that the teststat remains positive.

Before we can find the teststat, we need to find the loglikelihood of the models. We will use the logLik() command and give it our models.

logLik(mod1)
## 'log Lik.' -1204.83 (df=5)
logLik(mod2)
## 'log Lik.' -1205.927 (df=4)

With this command, we can see the likelihood and the degrees of freedom.

To get the teststat, we will need just the numeric value of the logLik() command. To do this, we will use the as.numeric() command and give the difference between our models. If you want, you can use the command on each model and then subtract or subtract and then use the command. They will produce the same results.

For the teststat, we will have to multiply by 2 or -2. We can find out which one we will need by looking at the logLik() command results. We can see that the loglikelihood of mod2 is larger, so we can use 2.

teststat = 2*as.numeric(logLik(mod2) - logLik(mod1))

Now, we can find our p-value. We will use the pchisq() command because the teststat has a chi-squared distribution. The pchisq() command needs a positive teststat, the degrees of freedom and lower.tail = FALSE. We can find the degrees of freedom by subtracting the degrees of freedom found in the logLik() command.

If the p-value is small, we should keep the more complicated model.

pchisq(abs(teststat), df=1, lower.tail = FALSE)
## [1] 0.1384866

Because the p-value is larger than our alpha level of 0.05, we should keep our simpler model, mod2. This will be our base model.

We can now try to increase p and q since we decided beforehand that those could be greater than 1.

mod3 = arima(birth, order = c(2,1,2), season = list(order = c(0,1,1), period = 12))
summary(mod3)
## 
## Call:
## arima(x = birth, order = c(2, 1, 2), seasonal = list(order = c(0, 1, 1), period = 12))
## 
## Coefficients:
##          ar1      ar2      ma1     ma2     sma1
##       1.2863  -0.3842  -1.6788  0.7343  -0.8155
## s.e.  0.1428   0.0898   0.1275  0.0920   0.0435
## 
## sigma^2 estimated as 45.37:  log likelihood = -1204.01,  aic = 2420.01
## 
## Training set error measures:
##                       ME     RMSE      MAE         MPE     MAPE      MASE
## Training set -0.02839672 6.617808 5.041914 -0.01203438 1.654352 0.3590337
##                     ACF1
## Training set 0.003065459

All of the teststats look good.

Now that we have the base model and the complicated model, we can use the Ljung Box Test to compare those two models.

The Ljung Box Test analyzes the residuals. The ARIMA model should account for the dependency for y_t’s. If the residuals are small, then the Ljung Box Test will have a small teststat. This is what we want. The null hypothesis for this test is that the model is adequate, which means it does account for dependency. The alternative is that the model is inadequate and does not account for dependency.

We will use the Box.test() command and supply the model$residuals (to get the residuals for the model) and the type which will equal Ljung.

Box.test(mod2$residuals, type = "Ljung")
## 
##  Box-Ljung test
## 
## data:  mod2$residuals
## X-squared = 0.049701, df = 1, p-value = 0.8236

The p-value is large. This means that we accept the null and can say that the model is adequate and can account for the dependency.

Box.test(mod3$residuals, type= "Ljung")
## 
##  Box-Ljung test
## 
## data:  mod3$residuals
## X-squared = 0.0035334, df = 1, p-value = 0.9526

The p-value for mod3 is larger than mod2. This shows us that mod3 accounts for more dependency and is, therefore, a better model for our data.

Auto ARIMA

R has a command to do almost all of the above automatically. It is called auto.arima. You can give auto.arima your data and it will give back a model for you to use. It is what R thinks is the best model to fit your data.

automod = auto.arima(birth)
automod
## Series: birth 
## ARIMA(1,1,1)(1,1,1)[12] 
## 
## Coefficients:
##          ar1      ma1    sar1     sma1
##       0.3127  -0.7088  0.1058  -0.8481
## s.e.  0.0868   0.0607  0.0710   0.0469
## 
## sigma^2 estimated as 46.04:  log likelihood=-1204.83
## AIC=2419.66   AICc=2419.83   BIC=2439.09

Right off the bat, you can see this is different than our model. The first component different is the number of seasons. Because this is a weekly time series, R automatically chose 52. When we worked with ours, we decided 12 was actually the best. From this, we can tell the models will be very different.

Our model is ARIMA (2,1,2)(0,1,1)[12].

The difference could stem from the fact that auto.arima only has a certain number of predictors. We did not have a constraint on ours. Another difference was that we dropped one of the predictors. The p-value was fairly small and could have been included if we really wanted to.

That being said, we are going to stick with our model.

For fun, we can compare the models using AIC. We assume that the automatic model will have the smaller AIC but it can still be interesting to check.

AIC(automod)
## [1] 2419.66
AIC(mod3)
## [1] 2420.01

We were right to assume the automatic model would have a lower AIC. There isn’t much of a difference though so our model isn’t much worse. This shows that our model will forecast fairly well, compared to the automatic model, but could be better.

Forecasting

Finally, we can use our model above to forecast into the future. We will use the forecast() command which takes a model and how many periods you want to forecast into the future. We chose to forecast 12 months into the future since our data is a monthly time series.

myARIMAcast = forecast(mod3, h=12)
plot(myARIMAcast)

The blue line is the forecasted line from our model. You can see two shaded areas. The darker blue shaded area is the 80% confidence interval while the lighter blue is the 95% confidence interval.

Now, we can forecast the auto model and see if it is better at forecasting than our model.

autocast = forecast(automod, h = 12)
plot(autocast)

Comparing the two ARIMA forecasts, we can see that they look very similar. It looks like both the intervals for the auto model might be slightly smaller but that seems to be the only noticeable difference.

Conclusion

This R guide went over time series methods. We discussed the general equation and seasonal variation. We smoothed our data and created a linear model. Lastly, we went over seasonal arima and forecasting for it.

There are other methods to model time series data but these are a few to get you started.