The aatemp data come from the U.S. Historical Climatology Network. They are the annual mean temperatures (in degrees F) in Ann Arbor, Michigan going back about 150 years.
attach(aatemp)
The following objects are masked from aatemp (pos = 3):
temp, year
lmod<-lm(temp~year)
summary(lmod)
Call:
lm(formula = temp ~ year)
Residuals:
Min 1Q Median 3Q Max
-3.9843 -0.9113 -0.0820 0.9946 3.5343
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 24.005510 7.310781 3.284 0.00136 **
year 0.012237 0.003768 3.247 0.00153 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.466 on 113 degrees of freedom
Multiple R-squared: 0.08536, Adjusted R-squared: 0.07727
F-statistic: 10.55 on 1 and 113 DF, p-value: 0.001533
# par(mfrow=c(2,2))
plot(lmod)
abline(h=0)
From the first diagnostic graph (Residuals vs. fitted) above, I do see that the trend is linear, and it isn’t homoscedastic or is it non-linear.
In the Normal Q-Q plot, normal residuals should follow the line approximately. Here, the residuals look normal.
par(mfrow=c(1,1))
plot(temp~year)
abline(coefficients(lmod))
NA
NA
Ans: So the trend is a linear trend.
As there is no step or stepAIC function that is bespoke for polynomial terms, I don’t think I can easily use backward elimination method to reduce the degree of the model. Please refer to my codes below as I progressively create polynomial terms in decreasing fashion and leverage anova function to compare all of them collectively.
library(MASS)
lmod_poly_10<-lm(temp ~ poly(year,10), data=aatemp)
lmod_poly_9 <- lm(temp ~ poly(year,9), data=aatemp)
lmod_poly_8 <- lm(temp ~ poly(year,8), data=aatemp)
lmod_poly_7 <- lm(temp ~ poly(year,7), data=aatemp)
lmod_poly_6 <- lm(temp ~ poly(year,6), data=aatemp)
lmod_poly_5 <- lm(temp ~ poly(year,5), data=aatemp)
lmod_poly_4 <- lm(temp ~ poly(year,4), data=aatemp)
lmod_poly_3 <- lm(temp ~ poly(year,3), data=aatemp)
lmod_poly_2 <- lm(temp ~ poly(year,2), data=aatemp)
lmod_poly_1 <- lm(temp ~ poly(year,1), data=aatemp)
print(anova(lmod_poly_10,lmod_poly_9,lmod_poly_8,lmod_poly_7,lmod_poly_6, lmod_poly_5, lmod_poly_4, lmod_poly_3, lmod_poly_2, lmod_poly_1))
Analysis of Variance Table
Model 1: temp ~ poly(year, 10)
Model 2: temp ~ poly(year, 9)
Model 3: temp ~ poly(year, 8)
Model 4: temp ~ poly(year, 7)
Model 5: temp ~ poly(year, 6)
Model 6: temp ~ poly(year, 5)
Model 7: temp ~ poly(year, 4)
Model 8: temp ~ poly(year, 3)
Model 9: temp ~ poly(year, 2)
Model 10: temp ~ poly(year, 1)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 104 208.11
2 105 208.24 -1 -0.1207 0.0603 0.80652
3 106 210.19 -1 -1.9583 0.9786 0.32483
4 107 211.41 -1 -1.2124 0.6059 0.43812
5 108 212.28 -1 -0.8784 0.4390 0.50908
6 109 213.75 -1 -1.4700 0.7346 0.39337
7 110 225.19 -1 -11.4404 5.7171 0.01860 *
8 111 231.14 -1 -5.9455 2.9711 0.08774 .
9 112 242.12 -1 -10.9776 5.4858 0.02108 *
10 113 242.94 -1 -0.8228 0.4112 0.52277
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# lmod_poly_10a <- lm(temp ~ year + I(year^2) + I(year^3) + I(year^4) + I(year^5) + I(year^6) + I(year^7) + I(year^8) + I(year^9) + I(year^10), data=aatemp)
# summary(lmod_poly_10)
# summary(lmod_poly_10a)
Since p-values for model 7 and mode 9 are significant. I picked out poly(year, 4) and poly(year, 2) to examine their other successful metrics to see which one to use.
summary(lmod_poly_4)
Call:
lm(formula = temp ~ poly(year, 4), data = aatemp)
Residuals:
Min 1Q Median 3Q Max
-4.0085 -0.9618 -0.0913 0.9926 3.7370
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 47.7426 0.1334 357.827 < 2e-16 ***
poly(year, 4)1 4.7616 1.4308 3.328 0.00119 **
poly(year, 4)2 -0.9071 1.4308 -0.634 0.52741
poly(year, 4)3 -3.3132 1.4308 -2.316 0.02243 *
poly(year, 4)4 2.4383 1.4308 1.704 0.09117 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.431 on 110 degrees of freedom
Multiple R-squared: 0.1522, Adjusted R-squared: 0.1213
F-statistic: 4.936 on 4 and 110 DF, p-value: 0.001068
# print(coef(summary(lmod_poly_4)))
step(object = lmod_poly_4, direction = "backward")
Start: AIC=87.28
temp ~ poly(year, 4)
Df Sum of Sq RSS AIC
<none> 225.19 87.284
- poly(year, 4) 4 40.418 265.61 98.267
Call:
lm(formula = temp ~ poly(year, 4), data = aatemp)
Coefficients:
(Intercept) poly(year, 4)1 poly(year, 4)2 poly(year, 4)3 poly(year, 4)4
47.7426 4.7616 -0.9071 -3.3132 2.4383
# print(coef(summary(lmod_poly_2)))
summary(lmod_poly_2)
Call:
lm(formula = temp ~ poly(year, 2), data = aatemp)
Residuals:
Min 1Q Median 3Q Max
-4.0412 -0.9538 -0.0624 0.9959 3.5820
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 47.7426 0.1371 348.218 < 2e-16 ***
poly(year, 2)1 4.7616 1.4703 3.239 0.00158 **
poly(year, 2)2 -0.9071 1.4703 -0.617 0.53851
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.47 on 112 degrees of freedom
Multiple R-squared: 0.08846, Adjusted R-squared: 0.07218
F-statistic: 5.434 on 2 and 112 DF, p-value: 0.005591
step(object = lmod_poly_2, direction = "backward")
Start: AIC=91.62
temp ~ poly(year, 2)
Df Sum of Sq RSS AIC
<none> 242.12 91.616
- poly(year, 2) 2 23.495 265.61 98.267
Call:
lm(formula = temp ~ poly(year, 2), data = aatemp)
Coefficients:
(Intercept) poly(year, 2)1 poly(year, 2)2
47.7426 4.7616 -0.9071
As polynomial with degree 4 has a lower AIC and a higher adjusted R-square at 12.13%, I conclude that I will go with degree 4 polynomial.
Here we plot the data along with the fitted model
# step(object = lmod_poly_10, direction = "backward")
#
# final_lmod_poly_10 <- stepAIC(lmod_poly_10, direction = 'backward')
#
# summary(final_lmod_poly_10)
#
# final_lmod_poly_10 <- stepAIC(final_lmod_poly_10, direction = 'backward')
plot(temp~year)
points(year, fitted(lmod_poly_4),col="orange", pch=20)
Let’s predict the temperature in 2020.
predict(lmod_poly_4, new_data)
1
49.559
Ans: So the predicted temperature in year 2020 is 49.559.
Inserting a vertical line when year = 1930.
plot(temp~year,aatemp,xlab="Year",ylab="Temperature", main="Annual Temperature distribution in Ann Arbor, MI")
abline(v=1930, lty=6)
# aatemp[which(aatemp$year == 1930), ]
constant<-function(x)ifelse(x<=1930, mean(aatemp[1:49,]$temp),0)
lienar_mod<-function(x)ifelse(x>1930, x-1930,0)
model<-lm(temp~constant(year)+lienar_mod(year),aatemp)
x<-seq(1854,2000,by=1)
additive_model<-model$coef[1]+model$coef[2]*constant(x)+model$coef[3]*lienar_mod(x)
lines(x,additive_model,lty=2)
summary(model)
Call:
lm(formula = temp ~ constant(year) + lienar_mod(year), data = aatemp)
Residuals:
Min 1Q Median 3Q Max
-3.5867 -0.9456 -0.0979 1.0233 3.9925
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 48.748738 0.343152 142.062 < 2e-16 ***
constant(year) -0.037279 0.008423 -4.426 2.24e-05 ***
lienar_mod(year) -0.012519 0.008248 -1.518 0.132
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.381 on 112 degrees of freedom
Multiple R-squared: 0.1954, Adjusted R-squared: 0.181
F-statistic: 13.6 on 2 and 112 DF, p-value: 5.167e-06
Ans: From the summary of the additive model (model), it does look like this particular claim is warranted as there is indications that the constant(year) portion of the linear model is statistical significant.
Choosing knots given the goal to fit with a cubic spline with 6 basis functions
require(splines)
# Visualizing the basis functions
attr(bs(year ,df=6) ,"knots")
25% 50% 75%
1910.5 1939.0 1971.5
dim(bs(year ,knots=c(1910.5, 1939.0, 1971.5) ))
[1] 115 6
bs_temp <- bs(year ,knots=c(1910.5, 1939.0, 1971.5) )
matplot(year, bs_temp, type="l", col=1, main="Visualizing basis functions")
Confirmed that the B-Spline Basis for polynomial Splines, called with bs(), has a dimension of 6. Recall that a cubic spline with three knots has seven degrees of freedom; these degrees of freedom are used up by an intercept, plus six basis functions. We could also use the df option to produce a spline with knots at uniform quantiles of the data. Since the requirement is for a cubic spline, we used the default degree of 3.
dim(bs(year ,df=6))
[1] 115 6
cs_fit <- lm(temp~bs(year ,df=6), data=aatemp)
yearlims =range(year)
year.grid=seq (from=yearlims [1], to=yearlims [2])
pred=predict (cs_fit ,newdata =list(year =year.grid),se=T)
plot(year ,temp ,col ="gray")
lines(year.grid ,pred$fit ,lwd =4)
lines(year.grid ,pred$fit +2* pred$se ,lty ="dashed")
lines(year.grid ,pred$fit -2* pred$se ,lty ="dashed")
# matplot(year, cbind(temp,cs_fit$fit), type="pl", ylab="year", pch=20,lty=1,col=1)
summary(cs_fit)
Call:
lm(formula = temp ~ bs(year, df = 6), data = aatemp)
Residuals:
Min 1Q Median 3Q Max
-3.6561 -0.9270 -0.1545 0.9551 3.2158
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 47.8876 1.0083 47.493 <2e-16 ***
bs(year, df = 6)1 -0.2200 1.8374 -0.120 0.9049
bs(year, df = 6)2 -2.3188 1.2635 -1.835 0.0692 .
bs(year, df = 6)3 0.9948 1.2497 0.796 0.4278
bs(year, df = 6)4 0.8136 1.2814 0.635 0.5268
bs(year, df = 6)5 -1.1798 1.3381 -0.882 0.3799
bs(year, df = 6)6 1.3563 1.2623 1.074 0.2850
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.401 on 108 degrees of freedom
Multiple R-squared: 0.2017, Adjusted R-squared: 0.1573
F-statistic: 4.547 on 6 and 108 DF, p-value: 0.0003774
Ans: No. This cubic spline model doesn’t fit better than the straight line model as its adjusted R-squared is lower at 15.73%.