A time series measures the same variable over time. Examples include annual sales figures of a certain product or daily temperatures at a particular weather station. When modeling a time series, we use time as a predictor.
As always, it’s a good idea to make a plot before creating a model. This is still done using the plot function, but we add the argument type = “l” (lowecase L). With this argument, the output plot has the points connected using lines instead of individual data points, making it easier to spot any trends.
A time series plot can show no trend over time, which means the model will only have the intercept. If we see a linear trend, we want to use a linear model. If the trend has curvature, a quadratic or higher power model might be useful. Generally, if we see reversals in trends, we want to use polynomial models.
The aatemp dataset is helpful in learning how to create a model for a time series. Most of the concepts involved are ones we’ve already covered in previous classes and should be familiar. The aatemp dataset measures annual mean temperatures in Ann Arbor, Michigan, between 1854 and 2000. As always, it’s a good idea to plot the data before creating our model.
library(faraway)
## Warning: package 'faraway' was built under R version 3.4.4
data(aatemp)
attach(aatemp)
plot(temp ~ year, type = "l", ylab = "temperature")
The plot shows a slight upward trend. There isn’t any apparent curvature, so let’s first try a linear model. We still use the lm function that we used for linear and polynomial regression models, but we use time as the predictor.
linetemp <- lm(temp ~ year)
summary(linetemp)
##
## Call:
## lm(formula = temp ~ year)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9843 -0.9113 -0.0820 0.9946 3.5343
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 24.005510 7.310781 3.284 0.00136 **
## year 0.012237 0.003768 3.247 0.00153 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.466 on 113 degrees of freedom
## Multiple R-squared: 0.08536, Adjusted R-squared: 0.07727
## F-statistic: 10.55 on 1 and 113 DF, p-value: 0.001533
We can see that the linear term is definitely significant, which means there is a trend in the annual mean temperatures. To see if the linear model is any good, let’s look at the diagnostic plots.
par(mfrow = c(2,2))
plot(linetemp)
The residual plot could be better. We can see some evidence of heteroscedasticity, so let’s try a quadratic model.
quadtemp <- lm(temp ~ year + I(year^2))
summary(quadtemp)
##
## Call:
## lm(formula = temp ~ year + I(year^2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.0412 -0.9538 -0.0624 0.9959 3.5820
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.127e+02 3.837e+02 -0.554 0.580
## year 2.567e-01 3.962e-01 0.648 0.518
## I(year^2) -6.307e-05 1.022e-04 -0.617 0.539
##
## Residual standard error: 1.47 on 112 degrees of freedom
## Multiple R-squared: 0.08846, Adjusted R-squared: 0.07218
## F-statistic: 5.434 on 2 and 112 DF, p-value: 0.005591
The quadratic term doesn’t seem to be significant. Perhaps a cubic term would help things out a bit. We can make things easier when creating higher power polynomial models by using the poly argument for lm. The poly argument gets the predictor variable name and the desired power of our model, in that order. If we wanted a 5th power model, we would add poly(temp, 5). For a cubic model, we change the number to 3, as shown below.
cubetemp <- lm(temp ~ poly(year, 3))
summary(cubetemp)
##
## Call:
## lm(formula = temp ~ poly(year, 3))
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.8557 -0.9646 -0.1552 1.0485 4.1538
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 47.7426 0.1346 354.796 <2e-16 ***
## poly(year, 3)1 4.7616 1.4430 3.300 0.0013 **
## poly(year, 3)2 -0.9071 1.4430 -0.629 0.5309
## poly(year, 3)3 -3.3132 1.4430 -2.296 0.0236 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.443 on 111 degrees of freedom
## Multiple R-squared: 0.1298, Adjusted R-squared: 0.1063
## F-statistic: 5.518 on 3 and 111 DF, p-value: 0.001436
In the model summary, we can see that the cubic term is significant, but not the quadratic term. Now let’s see if this improves the diagnostic plots.
par(mfrow = c(2,2))
plot(cubetemp)
Everything looks good here. The residual plot shows a more even vertical spread, and we can see the there is no significant trend in the residuals. Normality and leverage both look good, and the scale-location plot shows no significant trends.
We can create a forecast for the future using a time series by plugging in a future time value. This is slightly different from the regression models in which we never want to predict beyond the scope of the data. With a time series, making predictions into the near future is acceptable because it’s reasonable to assume the trend will continue into the near future. In the case of Ann Arbor’s annual mean temperatures, it’s reasonable to assume next year’s temperature will be somewhat similar to this year’s temperature.
One concern we must address is the issue of autocorrelation. This is when the residuals are correlated with each other over time. We can have positive or negative autocorrelation in a time series model.
Postive autocorrelation can have two cases. The first case is that a positive residual is likely to be followed by another positive residual. The other case is a negative residual is likely to be followed by another negative residual.
Negative autocorrelation is the tendency for the residuals to change signs, that is, a positive residual is likely to be followed by a negative residual, and vice versa.
Autocorrelation is a problem because it throws off the standard errors of our model, leading to problems when we conduct hypothesis testing since we would be violating the independence assumption. We do have a formal test to check for autocorrelation, called the Durbin-Watson test. Note that this test only looks for first order autocorrelation (i.e. cases where residuals are correlated with the one directly before and directly after it). The Durbin-Watson test is a hypothesis test, so let’s define our null and alternative hypotheses.
\(H_0\): there is no autocorrelation in the model.
\(H_a\): there is autocorrelation OR there is positive autocorrelation OR there is negative autocorrelation
We can define our alternative hypothesis to test for the existence of any autocorrelation, or specifically for positive or negative autocorrelation. If we fail to reject the null hypothesis, then our model has no autocorrelation. If we reject the null hypothesis, then we have autocorrelation in the model.
We can perform the Durbin-Watson test using the dwtest function found in the lmtest library in R. We can change our alternative hypothesis using the alternative argument (this can be set to “two.sided”, “greater”, or “less”).
library(lmtest)
## Warning: package 'lmtest' was built under R version 3.4.4
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 3.4.4
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
dwtest(cubetemp, alternative = c("two.sided"))
##
## Durbin-Watson test
##
## data: cubetemp
## DW = 1.7171, p-value = 0.06928
## alternative hypothesis: true autocorrelation is not 0
Our p-value is greater than 0.05, so we fail to reject the null hypothesis. Our cubic model for the mean annual temperature in Ann Arbor, Michigan, does not have autocorrelation.
We checked the independence assumption for cross sectional data by looking at how the data was collected. In a time series, this assumption is checked by looking for autocorrelation.