6.3 Seasonal Variation

On Thursday, we first took a look at seasonal variation. We looked at two types: constant (or additive) seasonal variation, in which the magnitude of the swing of values is constant through time, and increasing seasonal variation, where the magnitude of swim is increasing through time. We want to have a constant seasonal variation and not an increasing seasonal variation since that implies heteroscedasticity, which goes against one of the main regression assumptions. We can manipulate the plots by transforming the response variable in order to create constant seasonal variation; we can plot time against \(\sqrt y_t\),\(\log{y_t}\), or \(y_t^\lambda\), where as \(\lambda\) approaches 0, it resembles \(\log{y_t}\).

6.4 Modeling Seasonal Variation

The general model for seasonal variation is as follows: \[y_t=TR_t + SN_t + \epsilon_t\]

where:

\(y_t\) is the measure at time \(t\)

\(TR_t\) is the trend (e.g. \(\beta_0 + \beta_1 t\) )

\(SN_t\) is the seasonal factor

\(\epsilon_t\) is the error or residual

The prediction at time t is going to eliminate the error term and is as follows:

\[\widehat{y_t}=TR_t + SN_t\]

Example

In class, we used the airpass dataset from the faraway package to show what increasing seasonal variation looks like.

library(faraway)
data(airpass)
attach(airpass)
with(airpass,plot(pass~year,type="l"))

Since this implies heteroscedasticity, we want to transform the response variable in order to get the model to have constant seasonal variation (i.e. it leaves us with a homoscedastic model).

with(airpass,plot(log(pass)~year,type="l"))

As you can see, by transforming the response variable with the log function, the magnitude of the swing is relatively constant.

Let’s take a look at the other methods of transformation in order to compare it to what we have already found

with(airpass,plot(sqrt(pass)~year,type="l"))

Using the sqrt transformation, the increasing seasonal variation is still quite apparent, so we don’t want to use this model. Now, let’s take a look at raising pass to various powers.

par(mfrow=c(2,2))
with(airpass,plot((pass^.8)~year,type="l"))
with(airpass,plot((pass^.6)~year,type="l"))
with(airpass,plot((pass^.4)~year,type="l"))
with(airpass,plot((pass^.2)~year,type="l"))

As you can see, with decreasing the power that pass is raised to, the seasonal variation is becoming more and more constant. This intuitively makes sense since the power is approaching zero, and resembling the log function more and more, which we found to have pretty much constant seasonal variation.

Dummy Variables

Also, we can use dummy variables in modeling seasonal variation. In our example, we looked at months (e.g. I(mo=June)= 1 if June, 0 otherwise). We can actually combine dummy variables to see if we can create a reasonable model with less dummy variables required (e.g. 4 seasons)

justyear <- floor(airpass$year)
modecimal <- airpass$year - justyear  

mofactor <-factor(round(modecimal*12))  

cbind(airpass$year, mofactor)
##                 mofactor
##   [1,] 49.08333        2
##   [2,] 49.16667        3
##   [3,] 49.25000        4
##   [4,] 49.33333        5
##   [5,] 49.41667        6
##   [6,] 49.50000        7
##   [7,] 49.58333        8
##   [8,] 49.66667        9
##   [9,] 49.75000       10
##  [10,] 49.83333       11
##  [11,] 49.91667       12
##  [12,] 50.00000        1
##  [13,] 50.08333        2
##  [14,] 50.16667        3
##  [15,] 50.25000        4
##  [16,] 50.33333        5
##  [17,] 50.41667        6
##  [18,] 50.50000        7
##  [19,] 50.58333        8
##  [20,] 50.66667        9
##  [21,] 50.75000       10
##  [22,] 50.83333       11
##  [23,] 50.91667       12
##  [24,] 51.00000        1
##  [25,] 51.08333        2
##  [26,] 51.16667        3
##  [27,] 51.25000        4
##  [28,] 51.33333        5
##  [29,] 51.41667        6
##  [30,] 51.50000        7
##  [31,] 51.58333        8
##  [32,] 51.66667        9
##  [33,] 51.75000       10
##  [34,] 51.83333       11
##  [35,] 51.91667       12
##  [36,] 52.00000        1
##  [37,] 52.08333        2
##  [38,] 52.16667        3
##  [39,] 52.25000        4
##  [40,] 52.33333        5
##  [41,] 52.41667        6
##  [42,] 52.50000        7
##  [43,] 52.58333        8
##  [44,] 52.66667        9
##  [45,] 52.75000       10
##  [46,] 52.83333       11
##  [47,] 52.91667       12
##  [48,] 53.00000        1
##  [49,] 53.08333        2
##  [50,] 53.16667        3
##  [51,] 53.25000        4
##  [52,] 53.33333        5
##  [53,] 53.41667        6
##  [54,] 53.50000        7
##  [55,] 53.58333        8
##  [56,] 53.66667        9
##  [57,] 53.75000       10
##  [58,] 53.83333       11
##  [59,] 53.91667       12
##  [60,] 54.00000        1
##  [61,] 54.08333        2
##  [62,] 54.16667        3
##  [63,] 54.25000        4
##  [64,] 54.33333        5
##  [65,] 54.41667        6
##  [66,] 54.50000        7
##  [67,] 54.58333        8
##  [68,] 54.66667        9
##  [69,] 54.75000       10
##  [70,] 54.83333       11
##  [71,] 54.91667       12
##  [72,] 55.00000        1
##  [73,] 55.08333        2
##  [74,] 55.16667        3
##  [75,] 55.25000        4
##  [76,] 55.33333        5
##  [77,] 55.41667        6
##  [78,] 55.50000        7
##  [79,] 55.58333        8
##  [80,] 55.66667        9
##  [81,] 55.75000       10
##  [82,] 55.83333       11
##  [83,] 55.91667       12
##  [84,] 56.00000        1
##  [85,] 56.08333        2
##  [86,] 56.16667        3
##  [87,] 56.25000        4
##  [88,] 56.33333        5
##  [89,] 56.41667        6
##  [90,] 56.50000        7
##  [91,] 56.58333        8
##  [92,] 56.66667        9
##  [93,] 56.75000       10
##  [94,] 56.83333       11
##  [95,] 56.91667       12
##  [96,] 57.00000        1
##  [97,] 57.08333        2
##  [98,] 57.16667        3
##  [99,] 57.25000        4
## [100,] 57.33333        5
## [101,] 57.41667        6
## [102,] 57.50000        7
## [103,] 57.58333        8
## [104,] 57.66667        9
## [105,] 57.75000       10
## [106,] 57.83333       11
## [107,] 57.91667       12
## [108,] 58.00000        1
## [109,] 58.08333        2
## [110,] 58.16667        3
## [111,] 58.25000        4
## [112,] 58.33333        5
## [113,] 58.41667        6
## [114,] 58.50000        7
## [115,] 58.58333        8
## [116,] 58.66667        9
## [117,] 58.75000       10
## [118,] 58.83333       11
## [119,] 58.91667       12
## [120,] 59.00000        1
## [121,] 59.08333        2
## [122,] 59.16667        3
## [123,] 59.25000        4
## [124,] 59.33333        5
## [125,] 59.41667        6
## [126,] 59.50000        7
## [127,] 59.58333        8
## [128,] 59.66667        9
## [129,] 59.75000       10
## [130,] 59.83333       11
## [131,] 59.91667       12
## [132,] 60.00000        1
## [133,] 60.08333        2
## [134,] 60.16667        3
## [135,] 60.25000        4
## [136,] 60.33333        5
## [137,] 60.41667        6
## [138,] 60.50000        7
## [139,] 60.58333        8
## [140,] 60.66667        9
## [141,] 60.75000       10
## [142,] 60.83333       11
## [143,] 60.91667       12
## [144,] 61.00000        1

This makes it so that each result is assigned a month based on number. Now, we want to assign each result the name of the month instead of the number

levels(mofactor) <- c("Jan", "Feb", "Mar", "Apr", "May", 
                      "Jun", "Jul", "Aug", "Sep", "Oct",
                      "Nov", "Dec")  

summary(mofactor)  
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 
##  12  12  12  12  12  12  12  12  12  12  12  12
# To check work, take a look at the beginning few using the `head` function and the end few using the `tail` function
head(mofactor)      
## [1] Feb Mar Apr May Jun Jul
## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
head(airpass$year)
## [1] 49.08333 49.16667 49.25000 49.33333 49.41667 49.50000
tail(mofactor)
## [1] Aug Sep Oct Nov Dec Jan
## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
tail(airpass$year)
## [1] 60.58333 60.66667 60.75000 60.83333 60.91667 61.00000
# Randomly select some to check (because middle could be wrong)
randoms <- sample(1:nrow(airpass), 10, replace=FALSE)
mofactor[randoms]
##  [1] Sep Mar Oct Jul Apr Sep Mar Dec Apr May
## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
airpass[randoms,"year"]
##  [1] 52.66667 54.16667 58.75000 51.50000 55.25000 57.66667 49.16667
##  [8] 58.91667 58.25000 53.33333

Everything looks good. Now we can add new variable names to airpass to simplify.

airpass$justyear <- justyear
airpass$mofactor <- mofactor

With this we can create a model using a dummy for each month

mod <- lm(log(pass) ~ justyear + mofactor, data=airpass)
coef(mod)
##  (Intercept)     justyear  mofactorFeb  mofactorMar  mofactorApr 
## -1.214997885  0.120825657  0.031389870  0.019403852  0.159699778 
##  mofactorMay  mofactorJun  mofactorJul  mofactorAug  mofactorSep 
##  0.138499729  0.146195892  0.278410898  0.392422029  0.393195995 
##  mofactorOct  mofactorNov  mofactorDec 
##  0.258630198  0.130540762 -0.003108143
summary(mod)
## 
## Call:
## lm(formula = log(pass) ~ justyear + mofactor, data = airpass)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.156370 -0.041016  0.003677  0.044069  0.132324 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.214998   0.081277 -14.949  < 2e-16 ***
## justyear     0.120826   0.001432  84.399  < 2e-16 ***
## mofactorFeb  0.031390   0.024253   1.294    0.198    
## mofactorMar  0.019404   0.024253   0.800    0.425    
## mofactorApr  0.159700   0.024253   6.585 1.00e-09 ***
## mofactorMay  0.138500   0.024253   5.711 7.19e-08 ***
## mofactorJun  0.146196   0.024253   6.028 1.58e-08 ***
## mofactorJul  0.278411   0.024253  11.480  < 2e-16 ***
## mofactorAug  0.392422   0.024253  16.180  < 2e-16 ***
## mofactorSep  0.393196   0.024253  16.212  < 2e-16 ***
## mofactorOct  0.258630   0.024253  10.664  < 2e-16 ***
## mofactorNov  0.130541   0.024253   5.382 3.28e-07 ***
## mofactorDec -0.003108   0.024253  -0.128    0.898    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0593 on 131 degrees of freedom
## Multiple R-squared:  0.9835, Adjusted R-squared:  0.982 
## F-statistic: 649.4 on 12 and 131 DF,  p-value: < 2.2e-16

This tells us that the summer months are significantly different than January (built into the intercept), whereas the winter months are not since they have higher p-values (December, February, March)

with(airpass, plot(log(pass)~ year, type="l" ))
lines(airpass$year, mod$fitted.values,col="blue")

Since we found that winter and summer months are significantly different, we can take a look at seasons and use those for dummy variables. Once we get this new model, we want to conduct a partial f-test using the anova function in order to see if the model is significantly different than the model with months as the dummy variables.

season<-airpass$mofactor
levels(season)<-list(Summer=c("Jun","Jul","Aug"),Winter=c("Dec","Jan","Feb"),Spring=c("Mar","Apr","May"),Fall=c("Sep","Oct","Nov"))
mofactor
##   [1] Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun
##  [18] Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov
##  [35] Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr
##  [52] May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep
##  [69] Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb
##  [86] Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul
## [103] Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## [120] Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May
## [137] Jun Jul Aug Sep Oct Nov Dec Jan
## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
summary(season)
## Summer Winter Spring   Fall 
##     36     36     36     36
airpass$season<-season
head(airpass)
##   pass     year justyear mofactor season
## 1  112 49.08333       49      Feb Winter
## 2  118 49.16667       49      Mar Spring
## 3  132 49.25000       49      Apr Spring
## 4  129 49.33333       49      May Spring
## 5  121 49.41667       49      Jun Summer
## 6  135 49.50000       49      Jul Summer
modseason<-lm(log(pass) ~ justyear + season, data=airpass)
with(airpass,plot(log(pass)~year,type="l"))
lines(airpass$year,mod$fitted.values,col="blue")
lines(airpass$year,modseason$fitted.values,col="red")

anova(mod,modseason)
## Analysis of Variance Table
## 
## Model 1: log(pass) ~ justyear + mofactor
## Model 2: log(pass) ~ justyear + season
##   Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
## 1    131 0.46072                                  
## 2    139 1.38515 -8  -0.92443 32.857 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Since the p-value is <2.2e-16, the two models are significantly different, and therefore we should stick with the model where the dummy variables are the months. Also, note that all of the models appear to have constant seasonal variation, which is definitely a good sign.

Peer Review

At the end of class, we exchanged the drafts of our group projects, and we received a few notes that would help improve our paper. Firstly, we included some smooth scatter plot in order to represent our data. When one of my classmates viewed it, they did not understand what it was, so it would be good for us to add an explanation as to how a smooth scatter plot is useful/ how it might be better than a regular scatter plot for our data. Additionally, in our Methods section, my peer reviewer noted that we should be more clear as to how we came up with the model that we initially created for our project. As a whole, I received a lot of positive feedback as a result, with only a couple problems that need to be fixed in order to better understand our process.