A General Look at the Data

library(faraway)
## Warning: package 'faraway' was built under R version 3.3.3
data(airpass)
attach(airpass)

We’re looking at that same air passenger time series data. The topic of the day is seasonal variation and we’ll use the year part of the time as the trend and then the months as a seasonal component of the model.

with(airpass, plot(pass~year, type = "l"))

We see that the seasonal variation is NOT constant. The later years have much more variation than the earlier ones. We should transform the response to be logY.

with(airpass, plot(log(pass)~year, type = "l"))

Much better! It’s important to transform the response until you get something that resembles constant variation.

Creating a Seasonal Variation Component

Instead of the typical \({y_t} = TR_t + \epsilon_t\) we’ll add a \(SN_t\) in place of some dummy variable categorical things.

Maybe we should include the month to try to get a more accurate representation of the patterns.

yeartime <- floor(airpass$year)
monthtimed <- airpass$year - yeartime

^These variables will give us a value for the year and then convert the month into a decimal. Now we need to make sure that the month is a categorical factor so we’ll use the factor function.

monthfact <- factor(round(monthtimed * 12))

We can check our work with cbind(airpass$year, monthfact)

Now our new factor needs names so we know what they are. We have to type out the names of the months and assign them to levels

levels(monthfact) <- c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")  

Here we want to check again that our work happened like we wanted it to.

head(monthfact)
## [1] Feb Mar Apr May Jun Jul
## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
head(airpass$year)
## [1] 49.08333 49.16667 49.25000 49.33333 49.41667 49.50000
tail(monthfact)
## [1] Aug Sep Oct Nov Dec Jan
## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
tail(airpass$year)
## [1] 60.58333 60.66667 60.75000 60.83333 60.91667 61.00000
randomsamp <- sample(1:nrow(airpass) , 5 , replace = FALSE)
monthfact[randomsamp]
## [1] Dec Aug Dec Jan Mar
## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
airpass[randomsamp,"year"]
## [1] 52.91667 53.58333 55.91667 57.00000 59.16667

Now we can add these new variables into the airpass data so we can see them next to each other.

airpass$yeartime <- yeartime
airpass$monthfact <- monthfact

Here we create a model that uses both the year part of the date as a trend factor and the month part of the date to represent thes seasonal variatino factors.

amodel<-lm(log(pass)~ 0+yeartime+monthfact)
coef(amodel)
##     yeartime monthfactJan monthfactFeb monthfactMar monthfactApr 
##    0.1208257   -1.2149979   -1.1836080   -1.1955940   -1.0552981 
## monthfactMay monthfactJun monthfactJul monthfactAug monthfactSep 
##   -1.0764982   -1.0688020   -0.9365870   -0.8225759   -0.8218019 
## monthfactOct monthfactNov monthfactDec 
##   -0.9563677   -1.0844571   -1.2181060
summary(amodel)
## 
## Call:
## lm(formula = log(pass) ~ 0 + yeartime + monthfact)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.156370 -0.041016  0.003677  0.044069  0.132324 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## yeartime      0.120826   0.001432   84.40   <2e-16 ***
## monthfactJan -1.214998   0.081277  -14.95   <2e-16 ***
## monthfactFeb -1.183608   0.079878  -14.82   <2e-16 ***
## monthfactMar -1.195594   0.079878  -14.97   <2e-16 ***
## monthfactApr -1.055298   0.079878  -13.21   <2e-16 ***
## monthfactMay -1.076498   0.079878  -13.48   <2e-16 ***
## monthfactJun -1.068802   0.079878  -13.38   <2e-16 ***
## monthfactJul -0.936587   0.079878  -11.72   <2e-16 ***
## monthfactAug -0.822576   0.079878  -10.30   <2e-16 ***
## monthfactSep -0.821802   0.079878  -10.29   <2e-16 ***
## monthfactOct -0.956368   0.079878  -11.97   <2e-16 ***
## monthfactNov -1.084457   0.079878  -13.58   <2e-16 ***
## monthfactDec -1.218106   0.079878  -15.25   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0593 on 131 degrees of freedom
## Multiple R-squared:  0.9999, Adjusted R-squared:  0.9999 
## F-statistic: 9.734e+04 on 13 and 131 DF,  p-value: < 2.2e-16
with(airpass, plot(log(pass)~ year, type="l" ))
lines(airpass$year, amodel$fitted.values,col="blue")

Note that these two lines are similar, this is an important concept to be able to code out because now we know how to add and create our own variables in a data set and use them to create a multiple linear regression model for time series data.

Additional Components

Another useful way to organize this data would be to split it into literal seasons based on our new monthly categories. All we have to do is adjust the levels so that 3 months go into each new season variable.

season<-airpass$monthfact
levels(season)=list(Winter=c("Dec","Jan","Feb"),
                    Spring=c("Mar","Apr","May"),
                    Summer=c("Jun", "Jul", "Aug"),
                    Fall=c("Sep", "Oct", "Nov"))
summary(season)
## Winter Spring Summer   Fall 
##     36     36     36     36

This makes perfect sense because we split all of the seasons into equal numbers of months. Now, let’s be sure that we get all of the data in one place.

airpass$season<-season
head(airpass)
##   pass     year yeartime monthfact season
## 1  112 49.08333       49       Feb Winter
## 2  118 49.16667       49       Mar Spring
## 3  132 49.25000       49       Apr Spring
## 4  129 49.33333       49       May Spring
## 5  121 49.41667       49       Jun Summer
## 6  135 49.50000       49       Jul Summer

Finally, we create a model that includes the year and season, and we can compare our two prediction models using lines of different colors to see which one fits the observed trends the closest.

modseason <- lm(log(pass) ~ yeartime + season, data=airpass)
with(airpass, plot(log(pass)~ year, type="l" ))
lines(airpass$year, amodel$fitted.values,col="blue")
lines(airpass$year, modseason$fitted.values,col="red")

I’d say the monthly one fit the data way better. It captures more of the seasonal variation. Here we can compare the two using a partial f-test.

anova(amodel,modseason)
## Analysis of Variance Table
## 
## Model 1: log(pass) ~ 0 + yeartime + monthfact
## Model 2: log(pass) ~ yeartime + season
##   Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
## 1    131 0.46072                                  
## 2    139 1.38515 -8  -0.92443 32.857 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Peer Review

I got a lot of feedback on our group paper about how we organized the report. My partner pointed out that our introduction and methods sections had a lot of results in them. She suggested that we save the results for one complete section so that we aren’t giving away too much too soon. We also need to add more context by literature review so that our research question seems more like it’s solving a problem than just an assignment for a grade.