In class, we began covering time series. We began by talking about what the different types of trend lines look like. The obvious linearly increasing and decreasing trends, and also the curved trends that can be fitted with a polynomial function.
#install.packages("faraway")
library(faraway)
## Warning: package 'faraway' was built under R version 3.4.4
data(aatemp)
attach(aatemp)
summary(aatemp)
## year temp
## Min. :1854 Min. :43.41
## 1st Qu.:1910 1st Qu.:46.78
## Median :1939 Median :47.73
## Mean :1940 Mean :47.74
## 3rd Qu.:1972 3rd Qu.:48.63
## Max. :2000 Max. :51.89
plot(temp~year, type="l")
mod <- lm(temp~year)
summary(mod)
##
## Call:
## lm(formula = temp ~ year)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9843 -0.9113 -0.0820 0.9946 3.5343
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 24.005510 7.310781 3.284 0.00136 **
## year 0.012237 0.003768 3.247 0.00153 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.466 on 113 degrees of freedom
## Multiple R-squared: 0.08536, Adjusted R-squared: 0.07727
## F-statistic: 10.55 on 1 and 113 DF, p-value: 0.001533
The t-test shows a low p-value after doing single linear regression, but based on the scatter plot we saw earlier, the trend does not seem to be linear.
plot(mod)
The residuals follow a curvature as well, lets add the quadratic term.
mod2 <- lm(temp~year+I(year^2))
summary(mod2)
##
## Call:
## lm(formula = temp ~ year + I(year^2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.0412 -0.9538 -0.0624 0.9959 3.5820
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.127e+02 3.837e+02 -0.554 0.580
## year 2.567e-01 3.962e-01 0.648 0.518
## I(year^2) -6.307e-05 1.022e-04 -0.617 0.539
##
## Residual standard error: 1.47 on 112 degrees of freedom
## Multiple R-squared: 0.08846, Adjusted R-squared: 0.07218
## F-statistic: 5.434 on 2 and 112 DF, p-value: 0.005591
All of a sudden, nothing is significant. I know I am not imagining the curvature in the residuals. Lets add another polynomial term.
mod3 <- lm(temp~year+I(year^2)+I(year^3))
summary(mod3)
##
## Call:
## lm(formula = temp ~ year + I(year^2) + I(year^3))
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.8557 -0.9646 -0.1552 1.0485 4.1538
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.959e+04 1.734e+04 2.283 0.0243 *
## year -6.159e+01 2.694e+01 -2.286 0.0241 *
## I(year^2) 3.197e-02 1.395e-02 2.291 0.0238 *
## I(year^3) -5.527e-06 2.407e-06 -2.296 0.0236 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.443 on 111 degrees of freedom
## Multiple R-squared: 0.1298, Adjusted R-squared: 0.1063
## F-statistic: 5.518 on 3 and 111 DF, p-value: 0.001436
The t-tests show significance across the board. Lets check on the residuals!
plot(mod3)
Much better, notice how the curvature in the residual plot has been eliminated.
summary(aatemp)
## year temp
## Min. :1854 Min. :43.41
## 1st Qu.:1910 1st Qu.:46.78
## Median :1939 Median :47.73
## Mean :1940 Mean :47.74
## 3rd Qu.:1972 3rd Qu.:48.63
## Max. :2000 Max. :51.89
coef(mod)%*%c(1,2002)
## [,1]
## [1,] 48.50451
coef(mod2)%*%c(1,2002,2002^2)
## [,1]
## [1,] 48.32538
coef(mod3)%*%c(1,2002,2002^2,2002^3)
## [,1]
## [1,] 47.50484
Using each of the fitted models we can predict the temperature for the year 2002. If our best model was the 3 degree polynomial, the best prediction is 47.5 degrees.
Next we looked at data on births over time.
data(divusa)
attach(divusa)
## The following object is masked from aatemp:
##
## year
summary(divusa)
## year divorce unemployed femlab
## Min. :1920 Min. : 6.10 Min. : 1.200 Min. :22.70
## 1st Qu.:1939 1st Qu.: 8.70 1st Qu.: 4.200 1st Qu.:27.47
## Median :1958 Median :10.60 Median : 5.600 Median :37.10
## Mean :1958 Mean :13.27 Mean : 7.173 Mean :38.58
## 3rd Qu.:1977 3rd Qu.:20.30 3rd Qu.: 7.500 3rd Qu.:47.80
## Max. :1996 Max. :22.80 Max. :24.900 Max. :59.30
## marriage birth military
## Min. : 49.70 Min. : 65.30 Min. : 1.940
## 1st Qu.: 61.90 1st Qu.: 68.90 1st Qu.: 3.469
## Median : 74.10 Median : 85.90 Median : 9.102
## Mean : 72.97 Mean : 88.89 Mean :12.365
## 3rd Qu.: 80.00 3rd Qu.:107.30 3rd Qu.:14.266
## Max. :118.10 Max. :122.90 Max. :86.641
plot(birth~year,type="l")
This is definitely not a linear trend.
bmod <- lm(birth~year)
summary(bmod)
##
## Call:
## lm(formula = birth ~ year)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.159 -11.129 -4.086 11.232 33.509
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1073.45274 161.22254 6.658 4.03e-09 ***
## year -0.50284 0.08234 -6.107 4.14e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.06 on 75 degrees of freedom
## Multiple R-squared: 0.3321, Adjusted R-squared: 0.3232
## F-statistic: 37.3 on 1 and 75 DF, p-value: 4.143e-08
plot(bmod)
The process is repeated until the 8th degree polynomial, which actually yields a significant model.
mod8 <- lm(birth~poly(year,8))
summary(mod8)
##
## Call:
## lm(formula = birth ~ poly(year, 8))
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.8539 -1.8149 0.1146 1.5883 10.7377
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 88.8883 0.3592 247.435 < 2e-16 ***
## poly(year, 8)1 -98.0709 3.1523 -31.111 < 2e-16 ***
## poly(year, 8)2 -55.9709 3.1523 -17.756 < 2e-16 ***
## poly(year, 8)3 -16.2403 3.1523 -5.152 2.40e-06 ***
## poly(year, 8)4 104.1628 3.1523 33.043 < 2e-16 ***
## poly(year, 8)5 -23.5942 3.1523 -7.485 1.89e-10 ***
## poly(year, 8)6 -58.3056 3.1523 -18.496 < 2e-16 ***
## poly(year, 8)7 16.6946 3.1523 5.296 1.37e-06 ***
## poly(year, 8)8 13.5140 3.1523 4.287 5.85e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.152 on 68 degrees of freedom
## Multiple R-squared: 0.9767, Adjusted R-squared: 0.9739
## F-statistic: 355.8 on 8 and 68 DF, p-value: < 2.2e-16
plot(mod8)
#install.packages("lmtest")
library(lmtest)
## Warning: package 'lmtest' was built under R version 3.4.4
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 3.4.4
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
dwtest(mod8, alternative="two.sided")
##
## Durbin-Watson test
##
## data: mod8
## DW = 1.1621, p-value = 8.211e-07
## alternative hypothesis: true autocorrelation is not 0
dwtest(mod8, alternative="less")
##
## Durbin-Watson test
##
## data: mod8
## DW = 1.1621, p-value = 1
## alternative hypothesis: true autocorrelation is less than 0
dwtest(mod8)
##
## Durbin-Watson test
##
## data: mod8
## DW = 1.1621, p-value = 4.106e-07
## alternative hypothesis: true autocorrelation is greater than 0
A Durbin Watson test reveals that there is auto correlation in the births over time data.
Auto correlation occurs when the next residual depends on the sign of the previous residual. Positive autocor is when a positive residual makes it more likely for the following residuals to also be positive. Positive autocor can also be when a negative residual makes it more likely for the following residuals to be negative. Negative autocor is when it alternates positive and negative signs. Both are bad because residuals should be independent.