Chapter 6 day 1

Section 1: polynomial trends through time!
The format is
y_t = TR_t + E_t where y is the value of the time series at time t, E is the error term and TR is the trend.
This can vary based on what level of polynomial is needed:
1) no trend = beta_0
2) linear trend = beta_0 + beta_1 * t
3) quadratic trend = beta_0 + beta_1 t + beta_1 t^2
which can either be increasing growth, decreasing growth, increasing decline and decreasing decline
4) multiple n polynomial trend = beta_0 + beta_1 t + beta_1 t^2 + … + beta_1 * t^n
which is used when the data shows a reversal in curvature.

Section 2: Autocorrelation
Investigating whether the residuals are correlated over time. The two types are Postive autocorrelation and Negative autocorrelation.
1) Positive Autocorrelation is when positive(negative) errors are likely to be followed by another positive(negative) residual
2) Negative Autocorrelation is when positive(negative) errors are likely to be followed by a negative(positive) residual.
We are only going to be looking at First order correlation, where we are only testing for errors directly on either side of the residual (aka E_t is only related to E_t-1 and E_t+1)
WE do this with the Durbin Watson statistic, which uses the code dwtest(). The hypothesis is that there is not autocorrelation in the residuals and we will test against the hypothesis that the error terms are autocorrelated (either positive or negative)

We’ll start with the airpass data, looking at the monthly airline passengers data set. We will check how the monthly totals changes over time by creating a linear model.

library(faraway)
data(airpass)
attach(airpass)
plot(pass~year, type="l")

airmod <- lm(pass~ year, airpass)
summary(airmod)

## 
## Call:
## lm(formula = pass ~ year, data = airpass)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -93.858 -30.727  -5.757  24.489 164.999 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1474.771     61.106  -24.14   <2e-16 ***
## year           31.886      1.108   28.78   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 46.06 on 142 degrees of freedom
## Multiple R-squared:  0.8536, Adjusted R-squared:  0.8526 
## F-statistic: 828.2 on 1 and 142 DF,  p-value: < 2.2e-16

plot(airmod)

We can try to improve the model by making it a polynomial rather than linear function.

airmod2<-lm(pass~year + I(year^2))
summary(airmod2)

## 
## Call:
## lm(formula = pass ~ year + I(year^2))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -100.353  -27.339   -7.442   21.603  146.116 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 1570.5174  1053.9017   1.490  0.13841   
## year         -79.2078    38.4007  -2.063  0.04098 * 
## I(year^2)      1.0092     0.3487   2.894  0.00441 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 44.91 on 141 degrees of freedom
## Multiple R-squared:  0.8618, Adjusted R-squared:  0.8599 
## F-statistic: 439.8 on 2 and 141 DF,  p-value: < 2.2e-16

plot(airmod2)

It doesn’t look that much different to me.

We can look at another data set to practice this more.
Here is the annual mean temperatures in Ann Arbor, Michigan. What linear model fits best (first we should graph it)?

data(aatemp)
attach(aatemp)

## The following object is masked from airpass:
## 
##     year

plot(temp~year,type="l")

tempmod<-lm(temp~year,data=aatemp)
summary(tempmod)

## 
## Call:
## lm(formula = temp ~ year, data = aatemp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9843 -0.9113 -0.0820  0.9946  3.5343 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 24.005510   7.310781   3.284  0.00136 **
## year         0.012237   0.003768   3.247  0.00153 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.466 on 113 degrees of freedom
## Multiple R-squared:  0.08536,    Adjusted R-squared:  0.07727 
## F-statistic: 10.55 on 1 and 113 DF,  p-value: 0.001533

plot(tempmod)

Now we can try a quadratic:

tempmod2<-lm(temp~year+I(year^2),data=aatemp)
summary(tempmod2)

## 
## Call:
## lm(formula = temp ~ year + I(year^2), data = aatemp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.0412 -0.9538 -0.0624  0.9959  3.5820 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.127e+02  3.837e+02  -0.554    0.580
## year         2.567e-01  3.962e-01   0.648    0.518
## I(year^2)   -6.307e-05  1.022e-04  -0.617    0.539
## 
## Residual standard error: 1.47 on 112 degrees of freedom
## Multiple R-squared:  0.08846,    Adjusted R-squared:  0.07218 
## F-statistic: 5.434 on 2 and 112 DF,  p-value: 0.005591

plot(tempmod2)

Well that’s ok, the p value is better, but we should try cubic to see:

tempmod3<-lm(temp~year+I(year^2)+I(year^3),data=aatemp)
summary(tempmod3)

## 
## Call:
## lm(formula = temp ~ year + I(year^2) + I(year^3), data = aatemp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8557 -0.9646 -0.1552  1.0485  4.1538 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  3.959e+04  1.734e+04   2.283   0.0243 *
## year        -6.159e+01  2.694e+01  -2.286   0.0241 *
## I(year^2)    3.197e-02  1.395e-02   2.291   0.0238 *
## I(year^3)   -5.527e-06  2.407e-06  -2.296   0.0236 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.443 on 111 degrees of freedom
## Multiple R-squared:  0.1298, Adjusted R-squared:  0.1063 
## F-statistic: 5.518 on 3 and 111 DF,  p-value: 0.001436

plot(tempmod3)

This looks much better and the p value had a significant improvement. We should probably be worried about the fact that the residuals are all packed in all left rather than spread out in the residuals vs leverage graph. We want to forcast the year 2002 to see what the temperature will be. You can do this by using the coefficients from the linear model summary.

3.959e+04+-6.159e+01*2002+3.197e-02*2002^2+-5.527e-06*2002^3

## [1] 73.92719

Now we can use the Durbin Watson to test our model to examine whether the time series errors have autocorrelation.

#install.packages("lmtest") to install the package
library(lmtest)

## Warning: package 'lmtest' was built under R version 3.4.4

## Loading required package: zoo

## Warning: package 'zoo' was built under R version 3.4.4

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

dwtest(tempmod3)

## 
##  Durbin-Watson test
## 
## data:  tempmod3
## DW = 1.7171, p-value = 0.03464
## alternative hypothesis: true autocorrelation is greater than 0

The default setting for the dwtest is positive. If you want to test for negative autocorrelation, you should specify the alternative as “less”.

dwtest(tempmod3,alternative = "less")

## 
##  Durbin-Watson test
## 
## data:  tempmod3
## DW = 1.7171, p-value = 0.9654
## alternative hypothesis: true autocorrelation is less than 0

I believe that there is no negative autocorrelation and there is positive ac

The last dataset we worked with is the divorce rates in the USA over the years 1920 - 1996.

data("divusa")
attach(divusa)

## The following object is masked from aatemp:
## 
##     year

## The following object is masked from airpass:
## 
##     year

plot(divorce~year,type = "l")

There’s a huge jump in the data right before 1950. I’m sure that’s going to affect our data.
Making the linear model:

divmod <- lm(divorce~year,data=divusa)
summary(divmod)

## 
## Call:
## lm(formula = divorce ~ year, data = divusa)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.7828 -1.8092  0.1592  1.6292  7.3048 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -422.97530   27.29465  -15.50   <2e-16 ***
## year           0.22280    0.01394   15.98   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.719 on 75 degrees of freedom
## Multiple R-squared:  0.7731, Adjusted R-squared:   0.77 
## F-statistic: 255.5 on 1 and 75 DF,  p-value: < 2.2e-16

plot(divmod)

divmod2 <- lm(divorce~year+I(year^2)+I(year^3),data=divusa)
summary(divmod2)

## 
## Call:
## lm(formula = divorce ~ year + I(year^2) + I(year^3), data = divusa)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.0160 -1.4605 -0.1395  1.5400  8.6719 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  4.986e+05  2.477e+05   2.013   0.0478 *
## year        -7.601e+02  3.796e+02  -2.003   0.0489 *
## I(year^2)    3.861e-01  1.939e-01   1.992   0.0502 .
## I(year^3)   -6.535e-05  3.300e-05  -1.980   0.0515 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.496 on 73 degrees of freedom
## Multiple R-squared:  0.8138, Adjusted R-squared:  0.8062 
## F-statistic: 106.4 on 3 and 73 DF,  p-value: < 2.2e-16

plot(divmod2)

We want to forcast the year 2002 to see what the divorce will be. You can do this by using the coefficients from the linear model summary.

4.986e+05+-7.601e+02*2000+3.861e-01*2000^2+-6.535e-05*2000^3

## [1] 0

I’m guessing because this dataset is not particularly easy to follow using linear/quadratic/linear that this model will not be good at predicting for the actual divorce rate because it doesn’t make sense to have divorce rate as 0.

Really the only different thing we’re doing with the models is defining year to be the only predictor, and the DW test. We’ve done the polynomial regression and plotting the “linear” model, and the dwtest is pretty easy once you have the package installed and you understand what autocorrelation is.

Chapter 6 day 1

Caitlyn Rome

April 3, 2018