Overview

On Thursday in class we discussed sections 6.3 and 6.4, which tackled seasonal variation and the ways to model seasonal variation. We used the airpass data set as an example. At the end of class we did peer reviews on our group projects.

6.3: Seasonal Variation

Seasonal variation occurs when the variation is different for different reoccuring time periods, like each year, season, month, day, etc. There is constant seasonal variation, in which the magnitude of the swing is constant through time. This is the kind of seasonal variation we want to see in our data.

There is also increasing seasonal variation, in which the magnitude of the swing increases over time. We do not like to see this in our data - statisticians do not like variation that isn’t constant. To attempt to remedy this issue, we can perform transformations on our variables, such as log or square root.

Let’s look at the airpass data to see if there is seasonal variation. If there is, we’ll decide if it’s constant or increasing.

library(faraway)
## Warning: package 'faraway' was built under R version 3.4.4
data(airpass)
attach(airpass)
plot(pass ~ year, type = "l")

We can see that the variance increases as the years get later, so there seems to be increasing seasonal variation in our data.

To try and correct this, I’ll perform a log transformation on the response, which is the number of passengers in thousands.

plot(log(pass) ~ year, type = "l")

This seems to have corrected the problem. The variance looks much more like constant seasonal variance, which is what we want to see!

6.4: Modeling Seasonal Variation

Times series with seasonal variation can be modeling with the following equation: \[y_t = TR_t + SN_t +\epsilon_t\] where \(TR_t\) is the trend and \(SN_t\) is the seasonal factor.

We can model seasonal variation using dummy variables and trig functions. We didn’t go into much detail about trig functions, so I will only show how to model using dummy variables.

By modeling seasonal variation, we can produce more accurate results than if we just ignored the seasonal component.

First, we need to separate the year variable so we have the month and the year as separate variables.

justyear <- floor(airpass$year)
modecimal <- airpass$year - justyear

We can now turn each month from a decimal into an integer. 1 for Jan, 2 for Feb, etc. The factor command sets the month as a variable.

mofactor <-factor(round(modecimal*12))

Instead of numbers, let’s have the month names as the entries.

levels(mofactor) <- c("Jan", "Feb", "Mar", "Apr", "May", 
                      "Jun", "Jul", "Aug", "Sep", "Oct",
                      "Nov", "Dec")  

summary(mofactor) 
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 
##  12  12  12  12  12  12  12  12  12  12  12  12

We can see that there are twelve entries for each month.

Let’s check our work again.

#beginning few
head(mofactor)      
## [1] Feb Mar Apr May Jun Jul
## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
head(airpass$year)
## [1] 49.08333 49.16667 49.25000 49.33333 49.41667 49.50000
#end few
tail(mofactor)
## [1] Aug Sep Oct Nov Dec Jan
## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
tail(airpass$year)
## [1] 60.58333 60.66667 60.75000 60.83333 60.91667 61.00000
#random few
randoms <- sample(1:nrow(airpass), 10, replace=FALSE)
mofactor[randoms]
##  [1] Jun Dec Jan Apr Dec Dec Dec Mar Jan Apr
## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
airpass[randoms,"year"]
##  [1] 56.41667 57.91667 57.00000 57.25000 52.91667 53.91667 55.91667
##  [8] 52.16667 55.00000 59.25000

The decimals still seem to matchup with the month designation.

We can now add our new variables into the airpass data set.

airpass$justyear <- justyear
airpass$mofactor <- mofactor

We can now create a multiple linear regression model using a dummy variable for each month with the variables we just created. We will continue to log the pass variable to take care of the increasing seasonal variation.

mod1 <- lm(log(pass) ~ justyear + mofactor, data=airpass)
coef(mod1)
##  (Intercept)     justyear  mofactorFeb  mofactorMar  mofactorApr 
## -1.214997885  0.120825657  0.031389870  0.019403852  0.159699778 
##  mofactorMay  mofactorJun  mofactorJul  mofactorAug  mofactorSep 
##  0.138499729  0.146195892  0.278410898  0.392422029  0.393195995 
##  mofactorOct  mofactorNov  mofactorDec 
##  0.258630198  0.130540762 -0.003108143
summary(mod1)
## 
## Call:
## lm(formula = log(pass) ~ justyear + mofactor, data = airpass)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.156370 -0.041016  0.003677  0.044069  0.132324 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.214998   0.081277 -14.949  < 2e-16 ***
## justyear     0.120826   0.001432  84.399  < 2e-16 ***
## mofactorFeb  0.031390   0.024253   1.294    0.198    
## mofactorMar  0.019404   0.024253   0.800    0.425    
## mofactorApr  0.159700   0.024253   6.585 1.00e-09 ***
## mofactorMay  0.138500   0.024253   5.711 7.19e-08 ***
## mofactorJun  0.146196   0.024253   6.028 1.58e-08 ***
## mofactorJul  0.278411   0.024253  11.480  < 2e-16 ***
## mofactorAug  0.392422   0.024253  16.180  < 2e-16 ***
## mofactorSep  0.393196   0.024253  16.212  < 2e-16 ***
## mofactorOct  0.258630   0.024253  10.664  < 2e-16 ***
## mofactorNov  0.130541   0.024253   5.382 3.28e-07 ***
## mofactorDec -0.003108   0.024253  -0.128    0.898    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0593 on 131 degrees of freedom
## Multiple R-squared:  0.9835, Adjusted R-squared:  0.982 
## F-statistic: 649.4 on 12 and 131 DF,  p-value: < 2.2e-16

There is no dummy variable for January because that is automatically built into the intercept. So in this case, the estimates for the coefficients for the rest of the months are in comparison to January. We can see that most of the months are significant, but not all.

From this summary, we can tell that the summer months have more passengers than January because their coefficient estimates are bigger. We can tell that the winter months are similar to January because their coefficient estimates are closer to 0.

We could also create a multiple linear regression model including January as a dummy variable.

mod2 <- lm(log(pass) ~ 0 + justyear + mofactor, data=airpass)
coef(mod2)
##    justyear mofactorJan mofactorFeb mofactorMar mofactorApr mofactorMay 
##   0.1208257  -1.2149979  -1.1836080  -1.1955940  -1.0552981  -1.0764982 
## mofactorJun mofactorJul mofactorAug mofactorSep mofactorOct mofactorNov 
##  -1.0688020  -0.9365870  -0.8225759  -0.8218019  -0.9563677  -1.0844571 
## mofactorDec 
##  -1.2181060
summary(mod2)
## 
## Call:
## lm(formula = log(pass) ~ 0 + justyear + mofactor, data = airpass)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.156370 -0.041016  0.003677  0.044069  0.132324 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## justyear     0.120826   0.001432   84.40   <2e-16 ***
## mofactorJan -1.214998   0.081277  -14.95   <2e-16 ***
## mofactorFeb -1.183608   0.079878  -14.82   <2e-16 ***
## mofactorMar -1.195594   0.079878  -14.97   <2e-16 ***
## mofactorApr -1.055298   0.079878  -13.21   <2e-16 ***
## mofactorMay -1.076498   0.079878  -13.48   <2e-16 ***
## mofactorJun -1.068802   0.079878  -13.38   <2e-16 ***
## mofactorJul -0.936587   0.079878  -11.72   <2e-16 ***
## mofactorAug -0.822576   0.079878  -10.30   <2e-16 ***
## mofactorSep -0.821802   0.079878  -10.29   <2e-16 ***
## mofactorOct -0.956368   0.079878  -11.97   <2e-16 ***
## mofactorNov -1.084457   0.079878  -13.58   <2e-16 ***
## mofactorDec -1.218106   0.079878  -15.25   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0593 on 131 degrees of freedom
## Multiple R-squared:  0.9999, Adjusted R-squared:  0.9999 
## F-statistic: 9.734e+04 on 13 and 131 DF,  p-value: < 2.2e-16

Let’s take a look at the plot of our original linear model now, fitted over the original data.

with(airpass, plot(log(pass) ~ year, type="l" ))
lines(airpass$year, mod1$fitted.values,col="blue")

Looks pretty good!

Since not all of our months were significant in our original linear model, instead of breaking out every month, let’s collapse them down into seasons.

We start by creating a new season variable, and collapsing each month into the season we want.

season <- mofactor
levels(season) <- list(winter=c("Dec","Jan","Feb"), spring=c("Mar","Apr","May"),  
                       summer=c("Jun","Jul","Aug"), fall=c("Sep","Oct","Nov"))

We can check the top of the data.

head(season)
## [1] winter spring spring spring summer summer
## Levels: winter spring summer fall
head(mofactor)
## [1] Feb Mar Apr May Jun Jul
## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

We now create the model with the season dummy variable.

modseason <- lm(log(pass) ~ justyear + season, data=airpass)
coef(modseason)
##  (Intercept)     justyear seasonspring seasonsummer   seasonfall 
##  -1.20197254   0.12076004   0.09641867   0.26289382   0.25133987
summary(modseason)
## 
## Call:
## lm(formula = log(pass) ~ justyear + season, data = airpass)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.27061 -0.07119  0.00114  0.06893  0.21364 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.201973   0.132877  -9.046 1.17e-15 ***
## justyear      0.120760   0.002404  50.229  < 2e-16 ***
## seasonspring  0.096419   0.023543   4.095 7.12e-05 ***
## seasonsummer  0.262894   0.023543  11.167  < 2e-16 ***
## seasonfall    0.251340   0.023543  10.676  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09983 on 139 degrees of freedom
## Multiple R-squared:  0.9503, Adjusted R-squared:  0.9489 
## F-statistic: 664.4 on 4 and 139 DF,  p-value: < 2.2e-16

We’ll throw the season variable into the airpass data as well.

airpass$season <- season

We can now compare our two models on the same plot.

with(airpass, plot(log(pass) ~ year, type = "l"))
  lines(airpass$year, mod1$fitted.values, col="blue")
  lines(airpass$year, modseason$fitted.values, col="red")

We can see that our new model doesn’t look quite as accurate. To compare the two models more in depth, we can use the anova function.

anova(mod1, modseason)
## Analysis of Variance Table
## 
## Model 1: log(pass) ~ justyear + mofactor
## Model 2: log(pass) ~ justyear + season
##   Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
## 1    131 0.46072                                  
## 2    139 1.38515 -8  -0.92443 32.857 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Since we have a small p-value, we want the model with more variables - so we would use the model with all of the months.

Project Peer Review

The project I peer reviewed really just didn’t feel like a rough draft. The group had an interesting topic picked out as well as some interesting variables, but all they needed to do was flesh out the body of the paper, as well as organize some of their calculations.

I got some decent feedback on my group’s project. We had a few typos, and there were some organization issues. The reader wasn’t clear what variables we were going to proceed using once we had done the initial data exploration. We also need to write a little bit more on our methods and what we used in the Methods section, as well as rework the Lit Review a bit more. All in all, it was good to have somebody read and give feedback on the little as well as the big issues in our paper.