Today in class we started chapter 6, which is about time series. We use time series models to make forecasts and predict what will happen in the future (beyond our data).
To begin, we covered section 6.1, which regards polynomial trends, where we use time as a predictor. We learned that we can write a time series in the following general form: \[y_t = TR_t + {\epsilon_t}\], where \(y_t\) is the value at time t, \(TR_t\) is the trend, and \({\epsilon_t}\) is the error term. Also, we discussed possible trends that a time series could have, including : no trend, linear treand, quadratic trends, and more. In general, a \(p^{th}\) order polynomial trend model can take the form: \[TR_t={\beta_0}+{\beta_1}t+{\beta_2}t^2+...+{\beta_p}t^p\]. Finally, it was mentioned that we would use a polynomial with three or more degrees when there is a reversal in curvature.
In R we can create multiple models to determine which one best fits our model. We can start with an SLR model and add terms to increase the order of our polynomial. Generally, we will keep adding terms to our polynomial if all of the terms in the new model are significant and that model fits our data better than the previous one. Keep in mind though, sometimes a model of k degrees might not have significant predictors, but a model with (k+1) predictors might have significant predictors. Thus, it’s important to analyze the data to see what makes sense, and it’s great to create a wide range of models for comparison.
In the last section of this document we use 3 datasets to illustrate the process of creating the best time series model for a given dataset.
In section 6.2 we learned about autocorrelation, which occurs when residuals are correlated through time. There are two types of correlation: positive and negative. Positive correlation occurs when residuals are likely to be followed by residuals of the same sign (ie. positive residuals followed by positive residuals or negative residuals followed by negative residuals). Negative autocorrelation occurs when residuals are likely to be followed by residuals of the opposite sign (ie. positive residuals followed by negative residuals or negative residuals followed by positive residuals).
Autocorrelations is bad because when we build a model we assume that the residuals are independent, but if the residuals are correlated they are not independent. Thus, autotcorrelation causes our standard error to be too small or too large. If our standard error is too small, we will reject the null hypothesis too frequently, and our cofidence intervals will be too narrow. If our standard error is too large, we will not reject the null hypothesis as frequently as we should, and our cofidence intervals will be too wide.
There are two methods that can be used to detect autocorrelation. The first of which is visual, where we can plot the residuals through time. The second method is numerical, where we use the Durbin Watson test. The nully hypothesis for this test states that there is no autocorrelation. The alternative hypothesis can take one of three forms. Either there is autocorrelation (two-sided), there is positive autocorrelation, or there is negative autocorrelation. Note that for our class we will only be focusing on first-order autocorrelation.
After we create our model we use the Durbin Watson test with either of the alternative hypotheses, and R gives us a p-value that tells us whether or not we should reject the null hypothesis.
Below there are three examples that demonstrate how to use the Durbin Watson test in R.
All hypothesis tests in the three examples assume a significance level of 0.05. All datasets are in the faraway library.
Attach the airpass data.
library(faraway)
data(airpass)
attach(airpass)
Plot the data
plot(pass ~ year, type = "l")
Notice that a linear model or a quadratic model could fit our data.
Create an SLR model
mod <- lm(pass ~ year)
summary(mod)
##
## Call:
## lm(formula = pass ~ year)
##
## Residuals:
## Min 1Q Median 3Q Max
## -93.858 -30.727 -5.757 24.489 164.999
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1474.771 61.106 -24.14 <2e-16 ***
## year 31.886 1.108 28.78 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 46.06 on 142 degrees of freedom
## Multiple R-squared: 0.8536, Adjusted R-squared: 0.8526
## F-statistic: 828.2 on 1 and 142 DF, p-value: < 2.2e-16
plot(mod)
The p-values from t-test and F-test confirm that there is a linear trend between the year and the number of passengers. We see that there is a trend and heteroscedasticity in the residuals
Create a quadratic model.
mod2 <- lm(pass ~ year + I(year^2))
summary(mod2)
##
## Call:
## lm(formula = pass ~ year + I(year^2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -100.353 -27.339 -7.442 21.603 146.116
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1570.5174 1053.9017 1.490 0.13841
## year -79.2078 38.4007 -2.063 0.04098 *
## I(year^2) 1.0092 0.3487 2.894 0.00441 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 44.91 on 141 degrees of freedom
## Multiple R-squared: 0.8618, Adjusted R-squared: 0.8599
## F-statistic: 439.8 on 2 and 141 DF, p-value: < 2.2e-16
The p-value is smaller for \(year\) because now \(year^2\) is accounting for some of the variability, but both terms are still significant. Definitely keep \(year^2\), and keep \(year\) because we’re keeping \(year^2\). Use the quadratic (rather than linear) model becasue \(year^2\) is significant
General Note: if year^p is significant, we must have year, year^2, …. year^(p-1) in model if xz is significant, we must keep x and z in the model
Extrapolate to see what the prediciction is for the year 1962
coef(mod2) %*% c(1,62, 62^2)
## [,1]
## [1,] 538.9268
The predicted number of passengers for 1963 is 538.9268.
Conduct the Durbin Watson test to check for auto correlation. The dwtest command assumes positive correlation
library(lmtest)
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 3.4.3
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
dwtest(mod2)
##
## Durbin-Watson test
##
## data: mod2
## DW = 0.56939, p-value < 2.2e-16
## alternative hypothesis: true autocorrelation is greater than 0
The p-val for this test \(2.2*10^{-16}\), so there is positive autocorrelation.
Now attach a new dataset.
library(faraway)
data(aatemp)
#annual temperature for Anarbor, MI
#forecast for 2002
attach(aatemp)
## The following object is masked from airpass:
##
## year
Plot the data.
plot(aatemp, type = "l")
From the groaph it looks like we might want a model with more than 2 degrees.
Create and SLR model.
modd <- lm(temp ~ year)
summary(modd)
##
## Call:
## lm(formula = temp ~ year)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9843 -0.9113 -0.0820 0.9946 3.5343
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 24.005510 7.310781 3.284 0.00136 **
## year 0.012237 0.003768 3.247 0.00153 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.466 on 113 degrees of freedom
## Multiple R-squared: 0.08536, Adjusted R-squared: 0.07727
## F-statistic: 10.55 on 1 and 113 DF, p-value: 0.001533
plot(modd)
Year is significant, so temp is related to year. The residuals aren’t great though.
Create a quadratic model.
modd2 <- lm(temp ~ year + I(year^2))
summary(modd2)
##
## Call:
## lm(formula = temp ~ year + I(year^2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.0412 -0.9538 -0.0624 0.9959 3.5820
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.127e+02 3.837e+02 -0.554 0.580
## year 2.567e-01 3.962e-01 0.648 0.518
## I(year^2) -6.307e-05 1.022e-04 -0.617 0.539
##
## Residual standard error: 1.47 on 112 degrees of freedom
## Multiple R-squared: 0.08846, Adjusted R-squared: 0.07218
## F-statistic: 5.434 on 2 and 112 DF, p-value: 0.005591
plot(modd2)
Now neither year or year^2 are significant, so there is no quadratic relationship.
Create a cubic polynomial.
modd3 <- lm(temp ~ year + I(year^2) + I(year^3))
summary(modd3)
##
## Call:
## lm(formula = temp ~ year + I(year^2) + I(year^3))
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.8557 -0.9646 -0.1552 1.0485 4.1538
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.959e+04 1.734e+04 2.283 0.0243 *
## year -6.159e+01 2.694e+01 -2.286 0.0241 *
## I(year^2) 3.197e-02 1.395e-02 2.291 0.0238 *
## I(year^3) -5.527e-06 2.407e-06 -2.296 0.0236 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.443 on 111 degrees of freedom
## Multiple R-squared: 0.1298, Adjusted R-squared: 0.1063
## F-statistic: 5.518 on 3 and 111 DF, p-value: 0.001436
plot(modd3)
The residuals look way better, but the p-values aren’t as significant as they were for the SLR model. Use anova to compare the SLR model and the cubic polynomial.
Anova looks at the difference between the two models. The p-val tells you if the different term(s) are significant
anova(modd, modd3)
## Analysis of Variance Table
##
## Model 1: temp ~ year
## Model 2: temp ~ year + I(year^2) + I(year^3)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 113 242.94
## 2 111 231.14 2 11.8 2.8335 0.06307 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-value is 0.06307, which is geater than our significance level of 0.05. Thus, we don’t want the extra terms, and we will keel our SLR model.
Predict the average temperature for the year 2002.
coef(modd) %*% c(1, 2002)
## [,1]
## [1,] 48.50451
Estimated avaerage temperature is 48.50451 degrees F.
Now check for positive correlation using the Durbin Watson test.
library(lmtest)
dwtest(modd3)
##
## Durbin-Watson test
##
## data: modd3
## DW = 1.7171, p-value = 0.03464
## alternative hypothesis: true autocorrelation is greater than 0
Our p-val is 0.03464, which is less than our significance level of 0.05. Thus, there is positive correlation for our model.
Attach the divusa dataset, and use the year to predict the number of births.
library(faraway)
data(divusa)
#looking at the birth weights in the US
#forecast for 2000
attach(divusa)
## The following object is masked from aatemp:
##
## year
## The following object is masked from airpass:
##
## year
Plot the data.
plot(birth ~ year, type = "l")
modusa <- lm(birth ~ year)
summary(modusa)
##
## Call:
## lm(formula = birth ~ year)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.159 -11.129 -4.086 11.232 33.509
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1073.45274 161.22254 6.658 4.03e-09 ***
## year -0.50284 0.08234 -6.107 4.14e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.06 on 75 degrees of freedom
## Multiple R-squared: 0.3321, Adjusted R-squared: 0.3232
## F-statistic: 37.3 on 1 and 75 DF, p-value: 4.143e-08
plot(modusa)
The data looks like it could be 3+ degree ploynomial. The year for the SLR model is really significant. The residuals are really bad.
Creat an quadratic model.
modusa2 <- lm(birth ~ year + I(year^2))
summary(modusa2)
##
## Call:
## lm(formula = birth ~ year + I(year^2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.2952 -11.8939 -0.7406 12.5154 26.3901
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.428e+04 1.464e+04 -3.709 0.000400 ***
## year 5.604e+01 1.495e+01 3.748 0.000351 ***
## I(year^2) -1.444e-02 3.818e-03 -3.782 0.000313 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.8 on 74 degrees of freedom
## Multiple R-squared: 0.4403, Adjusted R-squared: 0.4252
## F-statistic: 29.11 on 2 and 74 DF, p-value: 4.718e-10
plot(modusa2)
\(year^2\) and \(year\) are both really significant. The residuals are still messed up.
modusa3 <- lm(birth ~ year + I(year^2) + I(year^3))
summary(modusa3)
##
## Call:
## lm(formula = birth ~ year + I(year^2) + I(year^3))
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.382 -12.208 -1.928 12.042 26.581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.557e+06 1.467e+06 1.062 0.292
## year -2.414e+03 2.248e+03 -1.074 0.286
## I(year^2) 1.247e+00 1.148e+00 1.086 0.281
## I(year^3) -2.148e-04 1.954e-04 -1.099 0.275
##
## Residual standard error: 14.78 on 73 degrees of freedom
## Multiple R-squared: 0.4494, Adjusted R-squared: 0.4268
## F-statistic: 19.86 on 3 and 73 DF, p-value: 1.625e-09
plot(modusa3)
All predictors are significant The residuals are still messed up
Creat 4th degree polynomial.
modusa4 <- lm(birth ~ poly(year, 4))
summary(modusa4)
##
## Call:
## lm(formula = birth ~ poly(year, 4))
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.505 -7.543 1.208 7.846 13.306
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 88.8883 0.9585 92.738 < 2e-16 ***
## poly(year, 4)1 -98.0709 8.4107 -11.660 < 2e-16 ***
## poly(year, 4)2 -55.9709 8.4107 -6.655 4.76e-09 ***
## poly(year, 4)3 -16.2403 8.4107 -1.931 0.0574 .
## poly(year, 4)4 104.1628 8.4107 12.385 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.411 on 72 degrees of freedom
## Multiple R-squared: 0.8241, Adjusted R-squared: 0.8143
## F-statistic: 84.34 on 4 and 72 DF, p-value: < 2.2e-16
plot(modusa4)
All predictors are significant. The residuals are still messed up.
8th degree polynomial.
modusa8 <- lm(birth ~ poly(year, 8))
summary(modusa8)
##
## Call:
## lm(formula = birth ~ poly(year, 8))
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.8539 -1.8149 0.1146 1.5883 10.7377
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 88.8883 0.3592 247.435 < 2e-16 ***
## poly(year, 8)1 -98.0709 3.1523 -31.111 < 2e-16 ***
## poly(year, 8)2 -55.9709 3.1523 -17.756 < 2e-16 ***
## poly(year, 8)3 -16.2403 3.1523 -5.152 2.40e-06 ***
## poly(year, 8)4 104.1628 3.1523 33.043 < 2e-16 ***
## poly(year, 8)5 -23.5942 3.1523 -7.485 1.89e-10 ***
## poly(year, 8)6 -58.3056 3.1523 -18.496 < 2e-16 ***
## poly(year, 8)7 16.6946 3.1523 5.296 1.37e-06 ***
## poly(year, 8)8 13.5140 3.1523 4.287 5.85e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.152 on 68 degrees of freedom
## Multiple R-squared: 0.9767, Adjusted R-squared: 0.9739
## F-statistic: 355.8 on 8 and 68 DF, p-value: < 2.2e-16
plot(modusa8)
All predictors are significant. The residuals look great!!
Create a 9th degree polynomial.
modusa9 <- lm(birth ~ poly(year, 9))
summary(modusa9)
##
## Call:
## lm(formula = birth ~ poly(year, 9))
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.8169 -1.7292 0.1805 1.5714 10.6432
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 88.8883 0.3609 246.298 < 2e-16 ***
## poly(year, 9)1 -98.0709 3.1669 -30.968 < 2e-16 ***
## poly(year, 9)2 -55.9709 3.1669 -17.674 < 2e-16 ***
## poly(year, 9)3 -16.2403 3.1669 -5.128 2.69e-06 ***
## poly(year, 9)4 104.1628 3.1669 32.891 < 2e-16 ***
## poly(year, 9)5 -23.5942 3.1669 -7.450 2.35e-10 ***
## poly(year, 9)6 -58.3056 3.1669 -18.411 < 2e-16 ***
## poly(year, 9)7 16.6946 3.1669 5.272 1.55e-06 ***
## poly(year, 9)8 13.5140 3.1669 4.267 6.36e-05 ***
## poly(year, 9)9 -1.9428 3.1669 -0.613 0.542
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.167 on 67 degrees of freedom
## Multiple R-squared: 0.9768, Adjusted R-squared: 0.9737
## F-statistic: 313.4 on 9 and 67 DF, p-value: < 2.2e-16
plot(modusa9)
9th term is not significant, so we should use the polynomial with 8 degrees.
Predict the number of births for the year 2002.
newdat <- data.frame(year = 2002)
predict (modusa8, newdata = newdat)
## 1
## 126.8733
Number of births per 1000 women in year 2002 is 126.8733.
Check for positive autocorrelation.
library(lmtest)
dwtest(modusa8)
##
## Durbin-Watson test
##
## data: modusa8
## DW = 1.1621, p-value = 4.106e-07
## alternative hypothesis: true autocorrelation is greater than 0
The p-val is 4.106 * 10^{-07}, so we reject the null and conclude that there is positive correlation for our model.