In this class we learned about trends over time. Some trends through time could be linear, some could be quadratic, some could be even higher polynomials or perhaps there is no trend at all. We did a lot of testing and graphing to determine which graphs fit best. This first data set has to do with airline prices over time.
library(faraway)
## Warning: package 'faraway' was built under R version 3.4.4
data(airpass)
attach(airpass)
head(airpass)
## pass year
## 1 112 49.08333
## 2 118 49.16667
## 3 132 49.25000
## 4 129 49.33333
## 5 121 49.41667
## 6 135 49.50000
mod <- lm(pass ~ year)
summary(mod)
##
## Call:
## lm(formula = pass ~ year)
##
## Residuals:
## Min 1Q Median 3Q Max
## -93.858 -30.727 -5.757 24.489 164.999
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1474.771 61.106 -24.14 <2e-16 ***
## year 31.886 1.108 28.78 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 46.06 on 142 degrees of freedom
## Multiple R-squared: 0.8536, Adjusted R-squared: 0.8526
## F-statistic: 828.2 on 1 and 142 DF, p-value: < 2.2e-16
With this test it is clear that the linear term is significant, but lets look at residual data to see if a polynomial term would help.
plot(mod)
The residuals look pretty linear, but let’s try a quadratic term just for fun.
mod2 <- lm(pass ~ year + I(year^2))
plot(mod2)
This didn’t seem to change anything much so I think for the sake of simplicity it is best to stick with our linear model. Next we’ll look at temperature data over time.
data(aatemp)
attach(aatemp)
## The following object is masked from airpass:
##
## year
summary(aatemp)
## year temp
## Min. :1854 Min. :43.41
## 1st Qu.:1910 1st Qu.:46.78
## Median :1939 Median :47.73
## Mean :1940 Mean :47.74
## 3rd Qu.:1972 3rd Qu.:48.63
## Max. :2000 Max. :51.89
tempmod <- lm(temp ~ year)
summary(tempmod)
##
## Call:
## lm(formula = temp ~ year)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9843 -0.9113 -0.0820 0.9946 3.5343
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 24.005510 7.310781 3.284 0.00136 **
## year 0.012237 0.003768 3.247 0.00153 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.466 on 113 degrees of freedom
## Multiple R-squared: 0.08536, Adjusted R-squared: 0.07727
## F-statistic: 10.55 on 1 and 113 DF, p-value: 0.001533
After a summary of the linear model, our linear term is significant so let’s look at the residuals.
plot(tempmod)
The residuals look like they have a cubic relationship, so let’s make a cubic and squared term and see if it is significant.
tempmod3 <- lm(temp ~ year + I(year^2) + I(year^3))
summary(tempmod3)
##
## Call:
## lm(formula = temp ~ year + I(year^2) + I(year^3))
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.8557 -0.9646 -0.1552 1.0485 4.1538
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.959e+04 1.734e+04 2.283 0.0243 *
## year -6.159e+01 2.694e+01 -2.286 0.0241 *
## I(year^2) 3.197e-02 1.395e-02 2.291 0.0238 *
## I(year^3) -5.527e-06 2.407e-06 -2.296 0.0236 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.443 on 111 degrees of freedom
## Multiple R-squared: 0.1298, Adjusted R-squared: 0.1063
## F-statistic: 5.518 on 3 and 111 DF, p-value: 0.001436
All the terms are significant so let’s examine the residuals to see if they’ve flattened out.
plot(tempmod3)
This residual plot looks great, it has homoscedasticity and no trend. Now we can use this model to make some predictions.
coef(tempmod3) %*% c(1, 2002, 2002^2, 2002^3)
## [,1]
## [1,] 47.50484
coef(tempmod) %*% c(1, 2002)
## [,1]
## [1,] 48.50451
The predictions of our cubic model were different than our linear model by about one degree. Now let’s try looking at birth rates over time.
data(divusa)
attach(divusa)
## The following object is masked from aatemp:
##
## year
## The following object is masked from airpass:
##
## year
head(divusa)
## year divorce unemployed femlab marriage birth military
## 1 1920 8.0 5.2 22.70 92.0 117.9 3.2247
## 2 1921 7.2 11.7 22.79 83.0 119.8 3.5614
## 3 1922 6.6 6.7 22.88 79.7 111.2 2.4553
## 4 1923 7.1 2.4 22.97 85.2 110.5 2.2065
## 5 1924 7.2 5.0 23.06 80.3 110.9 2.2889
## 6 1925 7.2 3.2 23.15 79.2 106.6 2.1735
divmod <- lm(birth ~ year)
plot(divmod)
After looking at the residuals of the linear model it is obvious this model should be fitted with a polynomial, but it is uncertain how large that polynomial should be. So I’ll test higher and higher degree polynomials until I find a term that is insignificant.
divmod2 <- lm(birth ~ poly(year, 8))
summary(divmod2)
##
## Call:
## lm(formula = birth ~ poly(year, 8))
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.8539 -1.8149 0.1146 1.5883 10.7377
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 88.8883 0.3592 247.435 < 2e-16 ***
## poly(year, 8)1 -98.0709 3.1523 -31.111 < 2e-16 ***
## poly(year, 8)2 -55.9709 3.1523 -17.756 < 2e-16 ***
## poly(year, 8)3 -16.2403 3.1523 -5.152 2.40e-06 ***
## poly(year, 8)4 104.1628 3.1523 33.043 < 2e-16 ***
## poly(year, 8)5 -23.5942 3.1523 -7.485 1.89e-10 ***
## poly(year, 8)6 -58.3056 3.1523 -18.496 < 2e-16 ***
## poly(year, 8)7 16.6946 3.1523 5.296 1.37e-06 ***
## poly(year, 8)8 13.5140 3.1523 4.287 5.85e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.152 on 68 degrees of freedom
## Multiple R-squared: 0.9767, Adjusted R-squared: 0.9739
## F-statistic: 355.8 on 8 and 68 DF, p-value: < 2.2e-16
We ended up with a degree 8 polynomial so let’s check the residuals and see how they look.
plot(divmod2)
Now the residuals are homoscedascous and show no signs of trending. Let’s plot this model against the actual data to get a visual of how well this polynomial model fits.
plot(birth ~ year)
lines(divmod2$fitted.values ~ year)
The model is obviously a great fit, showing that sometimes it pays to use high degree polynomials in our models. But there is one more thing we should test for, and that’s autocorrelation. If our residuals are correlated, we are violating one of the assumptions of regression. They could be positively correlated, or negatively correlated. Positive meaning that residuals often follow the sign of the previous residual, in a sequence that could resemble something like this: ++++++++———++++++++++. Negative autocorrelation means the residuals will flip flop very often in a sequence such as this: +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-. To detect autocorrelation, we can perform a Durbin-Watson test, with the null hypothesis being that there is no autocorrelation.
library(lmtest)
## Warning: package 'lmtest' was built under R version 3.4.4
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 3.4.4
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
dwtest(divmod2, alternative = "two.sided")
##
## Durbin-Watson test
##
## data: divmod2
## DW = 1.1621, p-value = 8.211e-07
## alternative hypothesis: true autocorrelation is not 0
This test was two sided, meaning it was looking for both positive and negative autocorrelation. And we can see from our very small p-value that there is autocorrelation, meaning we violated one of our regression assumptions.