library(faraway)
## Warning: package 'faraway' was built under R version 3.4.3
data(airpass)
attach(airpass)
head(airpass)
## pass year
## 1 112 49.08333
## 2 118 49.16667
## 3 132 49.25000
## 4 129 49.33333
## 5 121 49.41667
## 6 135 49.50000
We begin by plotting the data to get a general idea of what kind of model would be the best fit:
plot(pass ~ year, type = "l")
The plot shows that there is positive correlation between year and the number of passengers. However, it is dificult to determing whether the relationship is linear or not.
Next, we create a linear model:
mod <- lm(pass ~ year)
summary(mod)
##
## Call:
## lm(formula = pass ~ year)
##
## Residuals:
## Min 1Q Median 3Q Max
## -93.858 -30.727 -5.757 24.489 164.999
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1474.771 61.106 -24.14 <2e-16 ***
## year 31.886 1.108 28.78 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 46.06 on 142 degrees of freedom
## Multiple R-squared: 0.8536, Adjusted R-squared: 0.8526
## F-statistic: 828.2 on 1 and 142 DF, p-value: < 2.2e-16
The p-value for year are very small, so we can conlude that there is a linear relationship between year and passengers.
We can look at the residual plot to see if a linear model is the best possible model:
plot(mod)
Since the residuals are not evenly dispersed around zero, it seems that a linear model is the best to fit this data.
We can look at a quadratic model instead of a linear model to see if that will fit the relationship better.
mod_quad <- lm(pass ~ year + I(year^2))
summary(mod_quad)
##
## Call:
## lm(formula = pass ~ year + I(year^2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -100.353 -27.339 -7.442 21.603 146.116
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1570.5174 1053.9017 1.490 0.13841
## year -79.2078 38.4007 -2.063 0.04098 *
## I(year^2) 1.0092 0.3487 2.894 0.00441 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 44.91 on 141 degrees of freedom
## Multiple R-squared: 0.8618, Adjusted R-squared: 0.8599
## F-statistic: 439.8 on 2 and 141 DF, p-value: < 2.2e-16
The summary tells us that the quadratic term is significant, so this model is a better than the linear model. This model has a second term that helps fit the data that the linear model could not fit.
We look at the residual plot to see if it looks better than the previous one.
plot(mod_quad)
By looking at the residual plot and the summary, we can confirm that the quadratic model is more accurate that the linear model.
We can use the quadratic model to forcast what the number of passengers will likely be in 1962:
coef(mod) %*% c(1, 62)
## [,1]
## [1,] 502.1735
coef(mod_quad) %*% c(1, 62, 62^2)
## [,1]
## [1,] 538.9268