In this class we learned about trends over time. Some trends through time could be linear, some could be quadratic, some could be even higher polynomials or perhaps there is no trend at all. We did a lot of testing and graphing to determine which graphs fit best. This first data set has to do with airline prices over time.

library(faraway)
## Warning: package 'faraway' was built under R version 3.4.4
data(airpass)
attach(airpass)
head(airpass)
##   pass     year
## 1  112 49.08333
## 2  118 49.16667
## 3  132 49.25000
## 4  129 49.33333
## 5  121 49.41667
## 6  135 49.50000
mod <- lm(pass ~ year)
summary(mod)
## 
## Call:
## lm(formula = pass ~ year)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -93.858 -30.727  -5.757  24.489 164.999 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1474.771     61.106  -24.14   <2e-16 ***
## year           31.886      1.108   28.78   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 46.06 on 142 degrees of freedom
## Multiple R-squared:  0.8536, Adjusted R-squared:  0.8526 
## F-statistic: 828.2 on 1 and 142 DF,  p-value: < 2.2e-16

With this test it is clear that the linear term is significant, but lets look at residual data to see if a polynomial term would help.

plot(mod)

The residuals look pretty linear, but let’s try a quadratic term just for fun.

mod2 <- lm(pass ~ year + I(year^2))
plot(mod2)

This didn’t seem to change anything much so I think for the sake of simplicity it is best to stick with our linear model. Next we’ll look at temperature data over time.

data(aatemp)
attach(aatemp)
## The following object is masked from airpass:
## 
##     year
summary(aatemp)
##       year           temp      
##  Min.   :1854   Min.   :43.41  
##  1st Qu.:1910   1st Qu.:46.78  
##  Median :1939   Median :47.73  
##  Mean   :1940   Mean   :47.74  
##  3rd Qu.:1972   3rd Qu.:48.63  
##  Max.   :2000   Max.   :51.89
tempmod <- lm(temp ~ year)
summary(tempmod)
## 
## Call:
## lm(formula = temp ~ year)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9843 -0.9113 -0.0820  0.9946  3.5343 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 24.005510   7.310781   3.284  0.00136 **
## year         0.012237   0.003768   3.247  0.00153 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.466 on 113 degrees of freedom
## Multiple R-squared:  0.08536,    Adjusted R-squared:  0.07727 
## F-statistic: 10.55 on 1 and 113 DF,  p-value: 0.001533

After a summary of the linear model, our linear term is significant so let’s look at the residuals.

plot(tempmod)

The residuals look like they have a cubic relationship, so let’s make a cubic and squared term and see if it is significant.

tempmod3 <- lm(temp ~ year + I(year^2) + I(year^3))
summary(tempmod3)
## 
## Call:
## lm(formula = temp ~ year + I(year^2) + I(year^3))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8557 -0.9646 -0.1552  1.0485  4.1538 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  3.959e+04  1.734e+04   2.283   0.0243 *
## year        -6.159e+01  2.694e+01  -2.286   0.0241 *
## I(year^2)    3.197e-02  1.395e-02   2.291   0.0238 *
## I(year^3)   -5.527e-06  2.407e-06  -2.296   0.0236 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.443 on 111 degrees of freedom
## Multiple R-squared:  0.1298, Adjusted R-squared:  0.1063 
## F-statistic: 5.518 on 3 and 111 DF,  p-value: 0.001436

All the terms are significant so let’s examine the residuals to see if they’ve flattened out.

plot(tempmod3)

This residual plot looks great, it has homoscedasticity and no trend. Now we can use this model to make some predictions.

coef(tempmod3) %*% c(1, 2002, 2002^2, 2002^3)
##          [,1]
## [1,] 47.50484
coef(tempmod) %*% c(1, 2002)
##          [,1]
## [1,] 48.50451

The predictions of our cubic model were different than our linear model by about one degree. Now let’s try looking at birth rates over time.

data(divusa)
attach(divusa)
## The following object is masked from aatemp:
## 
##     year
## The following object is masked from airpass:
## 
##     year
head(divusa)
##   year divorce unemployed femlab marriage birth military
## 1 1920     8.0        5.2  22.70     92.0 117.9   3.2247
## 2 1921     7.2       11.7  22.79     83.0 119.8   3.5614
## 3 1922     6.6        6.7  22.88     79.7 111.2   2.4553
## 4 1923     7.1        2.4  22.97     85.2 110.5   2.2065
## 5 1924     7.2        5.0  23.06     80.3 110.9   2.2889
## 6 1925     7.2        3.2  23.15     79.2 106.6   2.1735
divmod <- lm(birth ~ year)
plot(divmod)

After looking at the residuals of the linear model it is obvious this model should be fitted with a polynomial, but it is uncertain how large that polynomial should be. So I’ll test higher and higher degree polynomials until I find a term that is insignificant.

divmod2 <- lm(birth ~ poly(year, 8))
summary(divmod2)
## 
## Call:
## lm(formula = birth ~ poly(year, 8))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.8539 -1.8149  0.1146  1.5883 10.7377 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     88.8883     0.3592 247.435  < 2e-16 ***
## poly(year, 8)1 -98.0709     3.1523 -31.111  < 2e-16 ***
## poly(year, 8)2 -55.9709     3.1523 -17.756  < 2e-16 ***
## poly(year, 8)3 -16.2403     3.1523  -5.152 2.40e-06 ***
## poly(year, 8)4 104.1628     3.1523  33.043  < 2e-16 ***
## poly(year, 8)5 -23.5942     3.1523  -7.485 1.89e-10 ***
## poly(year, 8)6 -58.3056     3.1523 -18.496  < 2e-16 ***
## poly(year, 8)7  16.6946     3.1523   5.296 1.37e-06 ***
## poly(year, 8)8  13.5140     3.1523   4.287 5.85e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.152 on 68 degrees of freedom
## Multiple R-squared:  0.9767, Adjusted R-squared:  0.9739 
## F-statistic: 355.8 on 8 and 68 DF,  p-value: < 2.2e-16

We ended up with a degree 8 polynomial so let’s check the residuals and see how they look.

plot(divmod2)

Now the residuals are homoscedascous and show no signs of trending. Let’s plot this model against the actual data to get a visual of how well this polynomial model fits.

plot(birth ~ year)
lines(divmod2$fitted.values ~ year)

The model is obviously a great fit, showing that sometimes it pays to use high degree polynomials in our models. But there is one more thing we should test for, and that’s autocorrelation. If our residuals are correlated, we are violating one of the assumptions of regression. They could be positively correlated, or negatively correlated. Positive meaning that residuals often follow the sign of the previous residual, in a sequence that could resemble something like this: ++++++++———++++++++++. Negative autocorrelation means the residuals will flip flop very often in a sequence such as this: +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-. To detect autocorrelation, we can perform a Durbin-Watson test, with the null hypothesis being that there is no autocorrelation.

library(lmtest)
## Warning: package 'lmtest' was built under R version 3.4.4
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 3.4.4
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
dwtest(divmod2, alternative = "two.sided")
## 
##  Durbin-Watson test
## 
## data:  divmod2
## DW = 1.1621, p-value = 8.211e-07
## alternative hypothesis: true autocorrelation is not 0

This test was two sided, meaning it was looking for both positive and negative autocorrelation. And we can see from our very small p-value that there is autocorrelation, meaning we violated one of our regression assumptions.