Seasonal variation is the magnitude of swing constant through time. All of the models we talk about in this class are going to have constant seasonal variation. To ensure this, if we encounter nonconstant seasonal variation, we’re going to transform the data to make it constant.
require(faraway)
## Loading required package: faraway
## Warning: package 'faraway' was built under R version 3.3.3
data(airpass)
plot(pass ~ year, airpass, type = "l")
In the plot above, the seasonal variation is increasing. We don’t want this, so let’s look at ways to transform the variable “pass” and make the seasonal variance constant
par(mfrow = c(2,2))
plot((pass^0.75) ~ year, airpass, type = "l")
plot((pass^0.5) ~ year, airpass, type = "l")
plot((pass^0.25) ~ year, airpass, type = "l")
plot(log(pass) ~ year, airpass, type = "l")
par(mfrow = c(1, 2))
plot(pass ~ year, airpass, type = "l")
plot(log(pass) ~ year, airpass, type = "l")
It looks like taking the natural log of the pass variable allows us to work with constant seasonal variance. Yay! Note that really small lambdas also look more constant, but we may as well go with using the log transformation, since they look pretty similar and natural log is easier to interpret.
Let’s say we’re working with a model that is just the trend, the seasonal component, and the error component. We will have a dummy variable for each unit of time. This means that if we’re looking at 12 months, we’ll have 12 dummy’s, if we’re looking at 4 seasons (or quarters) we’ll have 4 dummy’s, etc.
Let’s use airpass data to creat an example again. First we must clean up the data a bit (borrowed from in-class activity):
# Need to get just the month
justyear <- floor(airpass$year)
modecimal <- airpass$year - justyear #month as a decimal
mofactor <-factor(round(modecimal*12)) #month as a factor ie categorical
# Changes levels of factor to be mo names
levels(mofactor) <- c("Jan", "Feb", "Mar", "Apr", "May",
"Jun", "Jul", "Aug", "Sep", "Oct",
"Nov", "Dec")
Model time
model2 <- lm(log(pass) ~ justyear + mofactor, data=airpass)
coef(model2)
## (Intercept) justyear mofactorFeb mofactorMar mofactorApr
## -1.214997885 0.120825657 0.031389870 0.019403852 0.159699778
## mofactorMay mofactorJun mofactorJul mofactorAug mofactorSep
## 0.138499729 0.146195892 0.278410898 0.392422029 0.393195995
## mofactorOct mofactorNov mofactorDec
## 0.258630198 0.130540762 -0.003108143
summary(model2)
##
## Call:
## lm(formula = log(pass) ~ justyear + mofactor, data = airpass)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.156370 -0.041016 0.003677 0.044069 0.132324
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.214998 0.081277 -14.949 < 2e-16 ***
## justyear 0.120826 0.001432 84.399 < 2e-16 ***
## mofactorFeb 0.031390 0.024253 1.294 0.198
## mofactorMar 0.019404 0.024253 0.800 0.425
## mofactorApr 0.159700 0.024253 6.585 1.00e-09 ***
## mofactorMay 0.138500 0.024253 5.711 7.19e-08 ***
## mofactorJun 0.146196 0.024253 6.028 1.58e-08 ***
## mofactorJul 0.278411 0.024253 11.480 < 2e-16 ***
## mofactorAug 0.392422 0.024253 16.180 < 2e-16 ***
## mofactorSep 0.393196 0.024253 16.212 < 2e-16 ***
## mofactorOct 0.258630 0.024253 10.664 < 2e-16 ***
## mofactorNov 0.130541 0.024253 5.382 3.28e-07 ***
## mofactorDec -0.003108 0.024253 -0.128 0.898
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0593 on 131 degrees of freedom
## Multiple R-squared: 0.9835, Adjusted R-squared: 0.982
## F-statistic: 649.4 on 12 and 131 DF, p-value: < 2.2e-16
Note that all the dummy variables are specifically in comparison to January. This means that the difference between February and January, March and January, and December and January are not significantly different (statistically), but the other months are.
Let’s look at seasons instead of months:
levels(mofactor) <- list(Winter = c("Dec", "Jan", "Feb"), Spring = c("Mar", "Apr", "May"), Summer = c("Jun", "Jul", "Aug"), Fall = c("Sep", "Oct", "Nov"))
model3 <- lm(log(pass) ~ justyear + mofactor, data=airpass)
coef(model3)
## (Intercept) justyear mofactorSpring mofactorSummer mofactorFall
## -1.20197254 0.12076004 0.09641867 0.26289382 0.25133987
summary(model3)
##
## Call:
## lm(formula = log(pass) ~ justyear + mofactor, data = airpass)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.27061 -0.07119 0.00114 0.06893 0.21364
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.201973 0.132877 -9.046 1.17e-15 ***
## justyear 0.120760 0.002404 50.229 < 2e-16 ***
## mofactorSpring 0.096419 0.023543 4.095 7.12e-05 ***
## mofactorSummer 0.262894 0.023543 11.167 < 2e-16 ***
## mofactorFall 0.251340 0.023543 10.676 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.09983 on 139 degrees of freedom
## Multiple R-squared: 0.9503, Adjusted R-squared: 0.9489
## F-statistic: 664.4 on 4 and 139 DF, p-value: < 2.2e-16
plot(model3)
Not totally perfect, but now we’ve got a pretty good model with mostly constant seasonal variance and 4 significant dummy variables related to season! But let’s actually check and see if this seasonal model is better than the monthly model:
anova(model2, model3)
## Analysis of Variance Table
##
## Model 1: log(pass) ~ justyear + mofactor
## Model 2: log(pass) ~ justyear + mofactor
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 131 0.46072
## 2 139 1.38515 -8 -0.92443 32.857 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
There is a statistically significant difference between the two models, meaning we should actually just go with the 12-month model (model2) unless we have a really good reason to examine seasons (we don’t).