Topic

Today in class we talked about seasonal variation, and how to model it.

Seasonal Variation

There are two types of seasonal varation, one good one bad.

  1. Constant: this had the same height between the peaks and pits. It could be increasing (both x and y get bigger values), but the peaks and the pits are still the same distance. We want this kind!
  2. Increasing: the peaks and the pits have different heights throughout the plot. We do not want this kind.

What to do with bad variation?

If we find ourselves with increasing seasonal variation, we are able to fix it! We can use transformations to make it have constant seasonal growth. We should start with a log transformation, then try a square root, and then can try a exponent fraction (smaller than the square root of 0.5). However, the closer that fraction gets to zero, the graph will start doing the same transformation as a log. So, instead of doing a small, exponent fraction, it would be simpler to do a log transformation.

Example!

Plot Yt over time. Do you have constant seasonal variation? If no, find transformation.

library(faraway)
## Warning: package 'faraway' was built under R version 3.4.4
data("airpass")
head(airpass)
##   pass     year
## 1  112 49.08333
## 2  118 49.16667
## 3  132 49.25000
## 4  129 49.33333
## 5  121 49.41667
## 6  135 49.50000
plot(pass ~ year, data = airpass, type = "l")

There is increasing seasonal variation, the peaks get higher as the year goes on. We should try and use a transformation to fix this.

We can start with log(Yt)

with(airpass, plot(log(pass) ~ year, type = "l"))

After trying with log of one side, log of both sides and log of the other side, the log around the response looks the best. Our peaks and pits of the seasons are now the same height.

Now, we want to split the year into months. We can look at the head of the data to see how the year is listed out.

head(airpass)
##   pass     year
## 1  112 49.08333
## 2  118 49.16667
## 3  132 49.25000
## 4  129 49.33333
## 5  121 49.41667
## 6  135 49.50000

It shows the months as decimals. 49.5 would be the year 1949 in July.

We will create dummy variables in the months so that we can see what the points are in each month.

yearonly <- floor(airpass$year)
monthdec <- airpass$year - yearonly
mofactor <-factor(round(monthdec*12))
cbind(airpass$year, mofactor)
##                 mofactor
##   [1,] 49.08333        2
##   [2,] 49.16667        3
##   [3,] 49.25000        4
##   [4,] 49.33333        5
##   [5,] 49.41667        6
##   [6,] 49.50000        7
##   [7,] 49.58333        8
##   [8,] 49.66667        9
##   [9,] 49.75000       10
##  [10,] 49.83333       11
##  [11,] 49.91667       12
##  [12,] 50.00000        1
##  [13,] 50.08333        2
##  [14,] 50.16667        3
##  [15,] 50.25000        4
##  [16,] 50.33333        5
##  [17,] 50.41667        6
##  [18,] 50.50000        7
##  [19,] 50.58333        8
##  [20,] 50.66667        9
##  [21,] 50.75000       10
##  [22,] 50.83333       11
##  [23,] 50.91667       12
##  [24,] 51.00000        1
##  [25,] 51.08333        2
##  [26,] 51.16667        3
##  [27,] 51.25000        4
##  [28,] 51.33333        5
##  [29,] 51.41667        6
##  [30,] 51.50000        7
##  [31,] 51.58333        8
##  [32,] 51.66667        9
##  [33,] 51.75000       10
##  [34,] 51.83333       11
##  [35,] 51.91667       12
##  [36,] 52.00000        1
##  [37,] 52.08333        2
##  [38,] 52.16667        3
##  [39,] 52.25000        4
##  [40,] 52.33333        5
##  [41,] 52.41667        6
##  [42,] 52.50000        7
##  [43,] 52.58333        8
##  [44,] 52.66667        9
##  [45,] 52.75000       10
##  [46,] 52.83333       11
##  [47,] 52.91667       12
##  [48,] 53.00000        1
##  [49,] 53.08333        2
##  [50,] 53.16667        3
##  [51,] 53.25000        4
##  [52,] 53.33333        5
##  [53,] 53.41667        6
##  [54,] 53.50000        7
##  [55,] 53.58333        8
##  [56,] 53.66667        9
##  [57,] 53.75000       10
##  [58,] 53.83333       11
##  [59,] 53.91667       12
##  [60,] 54.00000        1
##  [61,] 54.08333        2
##  [62,] 54.16667        3
##  [63,] 54.25000        4
##  [64,] 54.33333        5
##  [65,] 54.41667        6
##  [66,] 54.50000        7
##  [67,] 54.58333        8
##  [68,] 54.66667        9
##  [69,] 54.75000       10
##  [70,] 54.83333       11
##  [71,] 54.91667       12
##  [72,] 55.00000        1
##  [73,] 55.08333        2
##  [74,] 55.16667        3
##  [75,] 55.25000        4
##  [76,] 55.33333        5
##  [77,] 55.41667        6
##  [78,] 55.50000        7
##  [79,] 55.58333        8
##  [80,] 55.66667        9
##  [81,] 55.75000       10
##  [82,] 55.83333       11
##  [83,] 55.91667       12
##  [84,] 56.00000        1
##  [85,] 56.08333        2
##  [86,] 56.16667        3
##  [87,] 56.25000        4
##  [88,] 56.33333        5
##  [89,] 56.41667        6
##  [90,] 56.50000        7
##  [91,] 56.58333        8
##  [92,] 56.66667        9
##  [93,] 56.75000       10
##  [94,] 56.83333       11
##  [95,] 56.91667       12
##  [96,] 57.00000        1
##  [97,] 57.08333        2
##  [98,] 57.16667        3
##  [99,] 57.25000        4
## [100,] 57.33333        5
## [101,] 57.41667        6
## [102,] 57.50000        7
## [103,] 57.58333        8
## [104,] 57.66667        9
## [105,] 57.75000       10
## [106,] 57.83333       11
## [107,] 57.91667       12
## [108,] 58.00000        1
## [109,] 58.08333        2
## [110,] 58.16667        3
## [111,] 58.25000        4
## [112,] 58.33333        5
## [113,] 58.41667        6
## [114,] 58.50000        7
## [115,] 58.58333        8
## [116,] 58.66667        9
## [117,] 58.75000       10
## [118,] 58.83333       11
## [119,] 58.91667       12
## [120,] 59.00000        1
## [121,] 59.08333        2
## [122,] 59.16667        3
## [123,] 59.25000        4
## [124,] 59.33333        5
## [125,] 59.41667        6
## [126,] 59.50000        7
## [127,] 59.58333        8
## [128,] 59.66667        9
## [129,] 59.75000       10
## [130,] 59.83333       11
## [131,] 59.91667       12
## [132,] 60.00000        1
## [133,] 60.08333        2
## [134,] 60.16667        3
## [135,] 60.25000        4
## [136,] 60.33333        5
## [137,] 60.41667        6
## [138,] 60.50000        7
## [139,] 60.58333        8
## [140,] 60.66667        9
## [141,] 60.75000       10
## [142,] 60.83333       11
## [143,] 60.91667       12
## [144,] 61.00000        1

In this section of code, we are switching the years and decimals to months. We then subtract the year out, leaving the decimals to find the month. Next, we multiply the decimal by the number of months, so that we can get the numerical value of the month (ie 2 for February). The last line is double checking that the code transferred.

levels(mofactor) <- c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct","Nov", "Dec")  
summary(mofactor) 
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 
##  12  12  12  12  12  12  12  12  12  12  12  12

With this chunk we assigned each month to their related numerical values. 2 would be labeled Feb, 10 would be Oct, etc.

Next, we can add these into our dataset so that it will be clearly labeled.

airpass$yearonly <- yearonly
airpass$mofactor <- mofactor

We have uploaded these separate months, so now we can create the model with the dummy variable as months in it.

mod <- lm(log(pass) ~ yearonly + mofactor, data=airpass)
coef(mod)
##  (Intercept)     yearonly  mofactorFeb  mofactorMar  mofactorApr 
## -1.214997885  0.120825657  0.031389870  0.019403852  0.159699778 
##  mofactorMay  mofactorJun  mofactorJul  mofactorAug  mofactorSep 
##  0.138499729  0.146195892  0.278410898  0.392422029  0.393195995 
##  mofactorOct  mofactorNov  mofactorDec 
##  0.258630198  0.130540762 -0.003108143
summary(mod)
## 
## Call:
## lm(formula = log(pass) ~ yearonly + mofactor, data = airpass)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.156370 -0.041016  0.003677  0.044069  0.132324 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.214998   0.081277 -14.949  < 2e-16 ***
## yearonly     0.120826   0.001432  84.399  < 2e-16 ***
## mofactorFeb  0.031390   0.024253   1.294    0.198    
## mofactorMar  0.019404   0.024253   0.800    0.425    
## mofactorApr  0.159700   0.024253   6.585 1.00e-09 ***
## mofactorMay  0.138500   0.024253   5.711 7.19e-08 ***
## mofactorJun  0.146196   0.024253   6.028 1.58e-08 ***
## mofactorJul  0.278411   0.024253  11.480  < 2e-16 ***
## mofactorAug  0.392422   0.024253  16.180  < 2e-16 ***
## mofactorSep  0.393196   0.024253  16.212  < 2e-16 ***
## mofactorOct  0.258630   0.024253  10.664  < 2e-16 ***
## mofactorNov  0.130541   0.024253   5.382 3.28e-07 ***
## mofactorDec -0.003108   0.024253  -0.128    0.898    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0593 on 131 degrees of freedom
## Multiple R-squared:  0.9835, Adjusted R-squared:  0.982 
## F-statistic: 649.4 on 12 and 131 DF,  p-value: < 2.2e-16

If we look at the summary results closely, we can see that there seem to be trends in the pvalues of certain months. These trends look like they are showing seasonality. For example, February, March and December can all be clumped together, showing the season of winter.

We can change our dummy variable from months to seasons, to simpify our process a little bit.

season <- airpass$mofactor
levels(season) <- list(Winter=c("Jan", "Feb", "Mar", "Dec"), Spring=c("Apr", "May", "Jun"), Summer=c("Jul", "Aug", "Sep"), Fall=c("Oct", "Nov"))
summary(season)
## Winter Spring Summer   Fall 
##     48     36     36     24

To do this I separated out the trends in the p-values and used intuition to assign the different seasons. I was able to make a list of the seasons, and use our months input to add a broader level.

airpass$season <- season
head(airpass)
##   pass     year yearonly mofactor season
## 1  112 49.08333       49      Feb Winter
## 2  118 49.16667       49      Mar Winter
## 3  132 49.25000       49      Apr Spring
## 4  129 49.33333       49      May Spring
## 5  121 49.41667       49      Jun Spring
## 6  135 49.50000       49      Jul Summer

We are able to define season into our data set, and can use head{} to double check to make sure everything translates.

Lastly, we can create a model that splits it into seasons.

modseason <- lm(log(pass) ~ yearonly + season, data=airpass)

with(airpass, plot(log(pass)~year, type = "l"))
lines(airpass$year, mod$fitted.values, col="blue")
lines(airpass$year, modseason$fitted.values, col="red")

The blue is showing when we split the seasonality into months. The red shows when we split the seasonality into seasons.

Lastly, we can run an anova function to see which of our models has a better fit.

anova(mod,modseason)
## Analysis of Variance Table
## 
## Model 1: log(pass) ~ yearonly + mofactor
## Model 2: log(pass) ~ yearonly + season
##   Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
## 1    131 0.46072                                  
## 2    139 0.67624 -8  -0.21553 7.6604 2.215e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Model 2 is better, so we will keep using the model that uses seasons instead of months!