Constant (or Additive) Seasonal Variation: magnitude of swing is contant over time - good
Increasing Seasonal Variation: magnitude of swing is changing over time - bad so find a transformation so that you can have constant seasonal variation
library(faraway)
## Warning: package 'faraway' was built under R version 3.4.4
data(airpass)
attach(airpass)
We will look at the number of passengers over time. We want the swings in the line to be relatively equal.
Type = “l” makes the line rather than just the points on the graph.
plot(pass ~ year, type = "l")
This is pretty obviously increasing seasonal variation. We need to use transformations to make it constant.
I tried log first.
plot(log(pass) ~ year, type = "l")
This actually makes it much better. The magnitude looks to be about the same throughout the entire graph.
A dummy variable is a yes or no variable or something binary. E.g. is it spring or not?
Assume the model has this general form:
\[{y_t} = {TR_t} + {SN_t} + {\epsilon_t}\]
y is the value of the time series at time t TR is the trend at time t SN is the seasonal factor at time t the epsilon is the error at time t
Using dummy variable to model seasonality: create a dummy varaible for each month (11) or season (3), transform if necessary to get constant varaition first
Let’s use dummy variables for the different seasons for the data above.
First we need to get only the year out of each data point.
justyear <- floor(airpass$year)
summary(justyear)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 49.00 52.00 55.00 54.58 58.00 61.00
As you can see, these show only the year and not the month.
Now we can get the months in decimals.
modecimal <- airpass$year - justyear
summary(modecimal)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2292 0.4583 0.4583 0.6875 0.9167
We can make each month a different category by simply factoring modecimal. We can also add labels to each of the levels or months.
mofactor <-factor(round(modecimal*12))
levels(mofactor) <- c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
summary(mofactor)
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 12 12 12 12 12 12 12 12 12 12 12 12
Make sure you check using head(mofactor) and tail(mofactor) if the months are correct.
We can now look at the model.
mod <- lm(log(pass) ~ 0+justyear + mofactor, data=airpass)
coef(mod)
## justyear mofactorJan mofactorFeb mofactorMar mofactorApr mofactorMay
## 0.1208257 -1.2149979 -1.1836080 -1.1955940 -1.0552981 -1.0764982
## mofactorJun mofactorJul mofactorAug mofactorSep mofactorOct mofactorNov
## -1.0688020 -0.9365870 -0.8225759 -0.8218019 -0.9563677 -1.0844571
## mofactorDec
## -1.2181060
summary(mod)
##
## Call:
## lm(formula = log(pass) ~ 0 + justyear + mofactor, data = airpass)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.156370 -0.041016 0.003677 0.044069 0.132324
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## justyear 0.120826 0.001432 84.40 <2e-16 ***
## mofactorJan -1.214998 0.081277 -14.95 <2e-16 ***
## mofactorFeb -1.183608 0.079878 -14.82 <2e-16 ***
## mofactorMar -1.195594 0.079878 -14.97 <2e-16 ***
## mofactorApr -1.055298 0.079878 -13.21 <2e-16 ***
## mofactorMay -1.076498 0.079878 -13.48 <2e-16 ***
## mofactorJun -1.068802 0.079878 -13.38 <2e-16 ***
## mofactorJul -0.936587 0.079878 -11.72 <2e-16 ***
## mofactorAug -0.822576 0.079878 -10.30 <2e-16 ***
## mofactorSep -0.821802 0.079878 -10.29 <2e-16 ***
## mofactorOct -0.956368 0.079878 -11.97 <2e-16 ***
## mofactorNov -1.084457 0.079878 -13.58 <2e-16 ***
## mofactorDec -1.218106 0.079878 -15.25 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0593 on 131 degrees of freedom
## Multiple R-squared: 0.9999, Adjusted R-squared: 0.9999
## F-statistic: 9.734e+04 on 13 and 131 DF, p-value: < 2.2e-16
Now, we can uses seasons instead of months.
I chose to use Dec, Jan and Feb as winter, Mar, Apr and May as spring, etc. You can make the seasons whatever you want.
season <- mofactor
levels(season) <- list(winter=c("Dec", "Jan", "Feb"), spring=c("Mar", "Apr", "May"), summer=c("Jun", "Jul", "Aug"), fall=c("Sep", "Oct", "Nov"))
summary(season)
## winter spring summer fall
## 36 36 36 36
We can look at the new model now.
mod2 <- lm(log(pass) ~ 0+justyear + season, data=airpass)
coef(mod2)
## justyear seasonwinter seasonspring seasonsummer seasonfall
## 0.1207600 -1.2019725 -1.1055539 -0.9390787 -0.9506327
summary(mod2)
##
## Call:
## lm(formula = log(pass) ~ 0 + justyear + season, data = airpass)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.27061 -0.07119 0.00114 0.06893 0.21364
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## justyear 0.120760 0.002404 50.229 < 2e-16 ***
## seasonwinter -1.201973 0.132877 -9.046 1.17e-15 ***
## seasonspring -1.105554 0.132082 -8.370 5.52e-14 ***
## seasonsummer -0.939079 0.132082 -7.110 5.58e-11 ***
## seasonfall -0.950633 0.132082 -7.197 3.50e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.09983 on 139 degrees of freedom
## Multiple R-squared: 0.9997, Adjusted R-squared: 0.9997
## F-statistic: 8.93e+04 on 5 and 139 DF, p-value: < 2.2e-16
We can also look at graph with both models on it. We can compare the two models that way.
with(airpass, plot(log(pass)~ year, type="l" ))
lines(airpass$year, mod$fitted.values,col="blue")
lines(airpass$year, mod2$fitted.values,col="red")
We can also compare the models with numbers as well. That might be easier to see which one is better.
anova(mod, mod2)
## Analysis of Variance Table
##
## Model 1: log(pass) ~ 0 + justyear + mofactor
## Model 2: log(pass) ~ 0 + justyear + season
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 131 0.46072
## 2 139 1.38515 -8 -0.92443 32.857 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This suggests the monthly model is better. From the graph, you can also see that the blue line maps the data a little better than the red line.
I received a few great pieces of advice in the peer review session. Overall, our idea was good and our models and graphs we are going to use are good. Our explaination of pretty much everything needs to be elaborated on. Before this, I had been confused on what to add. I now can finish that up. We also completely forgot to add how we collected the data. That will be easy to add on. We also could take out a few models that we did since we showed each of the steps we got to the final model. I think we have a pretty good idea of what we need to do now.