In class, we began covering time series. We began by talking about what the different types of trend lines look like. The obvious linearly increasing and decreasing trends, and also the curved trends that can be fitted with a polynomial function.

#install.packages("faraway")
library(faraway)
## Warning: package 'faraway' was built under R version 3.4.4
data(aatemp)
attach(aatemp)
summary(aatemp)
##       year           temp      
##  Min.   :1854   Min.   :43.41  
##  1st Qu.:1910   1st Qu.:46.78  
##  Median :1939   Median :47.73  
##  Mean   :1940   Mean   :47.74  
##  3rd Qu.:1972   3rd Qu.:48.63  
##  Max.   :2000   Max.   :51.89
plot(temp~year, type="l")

mod <- lm(temp~year)
summary(mod)
## 
## Call:
## lm(formula = temp ~ year)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9843 -0.9113 -0.0820  0.9946  3.5343 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 24.005510   7.310781   3.284  0.00136 **
## year         0.012237   0.003768   3.247  0.00153 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.466 on 113 degrees of freedom
## Multiple R-squared:  0.08536,    Adjusted R-squared:  0.07727 
## F-statistic: 10.55 on 1 and 113 DF,  p-value: 0.001533

The t-test shows a low p-value after doing single linear regression, but based on the scatter plot we saw earlier, the trend does not seem to be linear.

plot(mod)

The residuals follow a curvature as well, lets add the quadratic term.

mod2 <- lm(temp~year+I(year^2))
summary(mod2)
## 
## Call:
## lm(formula = temp ~ year + I(year^2))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.0412 -0.9538 -0.0624  0.9959  3.5820 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.127e+02  3.837e+02  -0.554    0.580
## year         2.567e-01  3.962e-01   0.648    0.518
## I(year^2)   -6.307e-05  1.022e-04  -0.617    0.539
## 
## Residual standard error: 1.47 on 112 degrees of freedom
## Multiple R-squared:  0.08846,    Adjusted R-squared:  0.07218 
## F-statistic: 5.434 on 2 and 112 DF,  p-value: 0.005591

All of a sudden, nothing is significant. I know I am not imagining the curvature in the residuals. Lets add another polynomial term.

mod3 <- lm(temp~year+I(year^2)+I(year^3))
summary(mod3)
## 
## Call:
## lm(formula = temp ~ year + I(year^2) + I(year^3))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8557 -0.9646 -0.1552  1.0485  4.1538 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  3.959e+04  1.734e+04   2.283   0.0243 *
## year        -6.159e+01  2.694e+01  -2.286   0.0241 *
## I(year^2)    3.197e-02  1.395e-02   2.291   0.0238 *
## I(year^3)   -5.527e-06  2.407e-06  -2.296   0.0236 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.443 on 111 degrees of freedom
## Multiple R-squared:  0.1298, Adjusted R-squared:  0.1063 
## F-statistic: 5.518 on 3 and 111 DF,  p-value: 0.001436

The t-tests show significance across the board. Lets check on the residuals!

plot(mod3)

Much better, notice how the curvature in the residual plot has been eliminated.

summary(aatemp)
##       year           temp      
##  Min.   :1854   Min.   :43.41  
##  1st Qu.:1910   1st Qu.:46.78  
##  Median :1939   Median :47.73  
##  Mean   :1940   Mean   :47.74  
##  3rd Qu.:1972   3rd Qu.:48.63  
##  Max.   :2000   Max.   :51.89
coef(mod)%*%c(1,2002)
##          [,1]
## [1,] 48.50451
coef(mod2)%*%c(1,2002,2002^2)
##          [,1]
## [1,] 48.32538
coef(mod3)%*%c(1,2002,2002^2,2002^3)
##          [,1]
## [1,] 47.50484

Using each of the fitted models we can predict the temperature for the year 2002. If our best model was the 3 degree polynomial, the best prediction is 47.5 degrees.

Next we looked at data on births over time.

data(divusa)
attach(divusa)
## The following object is masked from aatemp:
## 
##     year
summary(divusa)
##       year         divorce        unemployed         femlab     
##  Min.   :1920   Min.   : 6.10   Min.   : 1.200   Min.   :22.70  
##  1st Qu.:1939   1st Qu.: 8.70   1st Qu.: 4.200   1st Qu.:27.47  
##  Median :1958   Median :10.60   Median : 5.600   Median :37.10  
##  Mean   :1958   Mean   :13.27   Mean   : 7.173   Mean   :38.58  
##  3rd Qu.:1977   3rd Qu.:20.30   3rd Qu.: 7.500   3rd Qu.:47.80  
##  Max.   :1996   Max.   :22.80   Max.   :24.900   Max.   :59.30  
##     marriage          birth           military     
##  Min.   : 49.70   Min.   : 65.30   Min.   : 1.940  
##  1st Qu.: 61.90   1st Qu.: 68.90   1st Qu.: 3.469  
##  Median : 74.10   Median : 85.90   Median : 9.102  
##  Mean   : 72.97   Mean   : 88.89   Mean   :12.365  
##  3rd Qu.: 80.00   3rd Qu.:107.30   3rd Qu.:14.266  
##  Max.   :118.10   Max.   :122.90   Max.   :86.641
plot(birth~year,type="l")

This is definitely not a linear trend.

bmod <- lm(birth~year)
summary(bmod)
## 
## Call:
## lm(formula = birth ~ year)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -25.159 -11.129  -4.086  11.232  33.509 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1073.45274  161.22254   6.658 4.03e-09 ***
## year          -0.50284    0.08234  -6.107 4.14e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.06 on 75 degrees of freedom
## Multiple R-squared:  0.3321, Adjusted R-squared:  0.3232 
## F-statistic:  37.3 on 1 and 75 DF,  p-value: 4.143e-08
plot(bmod)

The process is repeated until the 8th degree polynomial, which actually yields a significant model.

mod8 <- lm(birth~poly(year,8))
summary(mod8)
## 
## Call:
## lm(formula = birth ~ poly(year, 8))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.8539 -1.8149  0.1146  1.5883 10.7377 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     88.8883     0.3592 247.435  < 2e-16 ***
## poly(year, 8)1 -98.0709     3.1523 -31.111  < 2e-16 ***
## poly(year, 8)2 -55.9709     3.1523 -17.756  < 2e-16 ***
## poly(year, 8)3 -16.2403     3.1523  -5.152 2.40e-06 ***
## poly(year, 8)4 104.1628     3.1523  33.043  < 2e-16 ***
## poly(year, 8)5 -23.5942     3.1523  -7.485 1.89e-10 ***
## poly(year, 8)6 -58.3056     3.1523 -18.496  < 2e-16 ***
## poly(year, 8)7  16.6946     3.1523   5.296 1.37e-06 ***
## poly(year, 8)8  13.5140     3.1523   4.287 5.85e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.152 on 68 degrees of freedom
## Multiple R-squared:  0.9767, Adjusted R-squared:  0.9739 
## F-statistic: 355.8 on 8 and 68 DF,  p-value: < 2.2e-16
plot(mod8)

#install.packages("lmtest")
library(lmtest)
## Warning: package 'lmtest' was built under R version 3.4.4
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 3.4.4
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
dwtest(mod8, alternative="two.sided")
## 
##  Durbin-Watson test
## 
## data:  mod8
## DW = 1.1621, p-value = 8.211e-07
## alternative hypothesis: true autocorrelation is not 0
dwtest(mod8, alternative="less")
## 
##  Durbin-Watson test
## 
## data:  mod8
## DW = 1.1621, p-value = 1
## alternative hypothesis: true autocorrelation is less than 0
dwtest(mod8)
## 
##  Durbin-Watson test
## 
## data:  mod8
## DW = 1.1621, p-value = 4.106e-07
## alternative hypothesis: true autocorrelation is greater than 0

A Durbin Watson test reveals that there is auto correlation in the births over time data.

Auto correlation occurs when the next residual depends on the sign of the previous residual. Positive autocor is when a positive residual makes it more likely for the following residuals to also be positive. Positive autocor can also be when a negative residual makes it more likely for the following residuals to be negative. Negative autocor is when it alternates positive and negative signs. Both are bad because residuals should be independent.