On Thursday in class we discussed sections 6.3 and 6.4, which tackled seasonal variation and the ways to model seasonal variation. We used the airpass
data set as an example. At the end of class we did peer reviews on our group projects.
Seasonal variation occurs when the variation is different for different reoccuring time periods, like each year, season, month, day, etc. There is constant seasonal variation, in which the magnitude of the swing is constant through time. This is the kind of seasonal variation we want to see in our data.
There is also increasing seasonal variation, in which the magnitude of the swing increases over time. We do not like to see this in our data - statisticians do not like variation that isn’t constant. To attempt to remedy this issue, we can perform transformations on our variables, such as log or square root.
Let’s look at the airpass
data to see if there is seasonal variation. If there is, we’ll decide if it’s constant or increasing.
library(faraway)
## Warning: package 'faraway' was built under R version 3.4.4
data(airpass)
attach(airpass)
plot(pass ~ year, type = "l")
We can see that the variance increases as the years get later, so there seems to be increasing seasonal variation in our data.
To try and correct this, I’ll perform a log transformation on the response, which is the number of passengers in thousands.
plot(log(pass) ~ year, type = "l")
This seems to have corrected the problem. The variance looks much more like constant seasonal variance, which is what we want to see!
Times series with seasonal variation can be modeling with the following equation: \[y_t = TR_t + SN_t +\epsilon_t\] where \(TR_t\) is the trend and \(SN_t\) is the seasonal factor.
We can model seasonal variation using dummy variables and trig functions. We didn’t go into much detail about trig functions, so I will only show how to model using dummy variables.
By modeling seasonal variation, we can produce more accurate results than if we just ignored the seasonal component.
First, we need to separate the year variable so we have the month and the year as separate variables.
justyear <- floor(airpass$year)
modecimal <- airpass$year - justyear
We can now turn each month from a decimal into an integer. 1 for Jan, 2 for Feb, etc. The factor command sets the month as a variable.
mofactor <-factor(round(modecimal*12))
Instead of numbers, let’s have the month names as the entries.
levels(mofactor) <- c("Jan", "Feb", "Mar", "Apr", "May",
"Jun", "Jul", "Aug", "Sep", "Oct",
"Nov", "Dec")
summary(mofactor)
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 12 12 12 12 12 12 12 12 12 12 12 12
We can see that there are twelve entries for each month.
Let’s check our work again.
#beginning few
head(mofactor)
## [1] Feb Mar Apr May Jun Jul
## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
head(airpass$year)
## [1] 49.08333 49.16667 49.25000 49.33333 49.41667 49.50000
#end few
tail(mofactor)
## [1] Aug Sep Oct Nov Dec Jan
## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
tail(airpass$year)
## [1] 60.58333 60.66667 60.75000 60.83333 60.91667 61.00000
#random few
randoms <- sample(1:nrow(airpass), 10, replace=FALSE)
mofactor[randoms]
## [1] Jun Dec Jan Apr Dec Dec Dec Mar Jan Apr
## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
airpass[randoms,"year"]
## [1] 56.41667 57.91667 57.00000 57.25000 52.91667 53.91667 55.91667
## [8] 52.16667 55.00000 59.25000
The decimals still seem to matchup with the month designation.
We can now add our new variables into the airpass data set.
airpass$justyear <- justyear
airpass$mofactor <- mofactor
We can now create a multiple linear regression model using a dummy variable for each month with the variables we just created. We will continue to log the pass
variable to take care of the increasing seasonal variation.
mod1 <- lm(log(pass) ~ justyear + mofactor, data=airpass)
coef(mod1)
## (Intercept) justyear mofactorFeb mofactorMar mofactorApr
## -1.214997885 0.120825657 0.031389870 0.019403852 0.159699778
## mofactorMay mofactorJun mofactorJul mofactorAug mofactorSep
## 0.138499729 0.146195892 0.278410898 0.392422029 0.393195995
## mofactorOct mofactorNov mofactorDec
## 0.258630198 0.130540762 -0.003108143
summary(mod1)
##
## Call:
## lm(formula = log(pass) ~ justyear + mofactor, data = airpass)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.156370 -0.041016 0.003677 0.044069 0.132324
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.214998 0.081277 -14.949 < 2e-16 ***
## justyear 0.120826 0.001432 84.399 < 2e-16 ***
## mofactorFeb 0.031390 0.024253 1.294 0.198
## mofactorMar 0.019404 0.024253 0.800 0.425
## mofactorApr 0.159700 0.024253 6.585 1.00e-09 ***
## mofactorMay 0.138500 0.024253 5.711 7.19e-08 ***
## mofactorJun 0.146196 0.024253 6.028 1.58e-08 ***
## mofactorJul 0.278411 0.024253 11.480 < 2e-16 ***
## mofactorAug 0.392422 0.024253 16.180 < 2e-16 ***
## mofactorSep 0.393196 0.024253 16.212 < 2e-16 ***
## mofactorOct 0.258630 0.024253 10.664 < 2e-16 ***
## mofactorNov 0.130541 0.024253 5.382 3.28e-07 ***
## mofactorDec -0.003108 0.024253 -0.128 0.898
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0593 on 131 degrees of freedom
## Multiple R-squared: 0.9835, Adjusted R-squared: 0.982
## F-statistic: 649.4 on 12 and 131 DF, p-value: < 2.2e-16
There is no dummy variable for January because that is automatically built into the intercept. So in this case, the estimates for the coefficients for the rest of the months are in comparison to January. We can see that most of the months are significant, but not all.
From this summary, we can tell that the summer months have more passengers than January because their coefficient estimates are bigger. We can tell that the winter months are similar to January because their coefficient estimates are closer to 0.
We could also create a multiple linear regression model including January as a dummy variable.
mod2 <- lm(log(pass) ~ 0 + justyear + mofactor, data=airpass)
coef(mod2)
## justyear mofactorJan mofactorFeb mofactorMar mofactorApr mofactorMay
## 0.1208257 -1.2149979 -1.1836080 -1.1955940 -1.0552981 -1.0764982
## mofactorJun mofactorJul mofactorAug mofactorSep mofactorOct mofactorNov
## -1.0688020 -0.9365870 -0.8225759 -0.8218019 -0.9563677 -1.0844571
## mofactorDec
## -1.2181060
summary(mod2)
##
## Call:
## lm(formula = log(pass) ~ 0 + justyear + mofactor, data = airpass)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.156370 -0.041016 0.003677 0.044069 0.132324
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## justyear 0.120826 0.001432 84.40 <2e-16 ***
## mofactorJan -1.214998 0.081277 -14.95 <2e-16 ***
## mofactorFeb -1.183608 0.079878 -14.82 <2e-16 ***
## mofactorMar -1.195594 0.079878 -14.97 <2e-16 ***
## mofactorApr -1.055298 0.079878 -13.21 <2e-16 ***
## mofactorMay -1.076498 0.079878 -13.48 <2e-16 ***
## mofactorJun -1.068802 0.079878 -13.38 <2e-16 ***
## mofactorJul -0.936587 0.079878 -11.72 <2e-16 ***
## mofactorAug -0.822576 0.079878 -10.30 <2e-16 ***
## mofactorSep -0.821802 0.079878 -10.29 <2e-16 ***
## mofactorOct -0.956368 0.079878 -11.97 <2e-16 ***
## mofactorNov -1.084457 0.079878 -13.58 <2e-16 ***
## mofactorDec -1.218106 0.079878 -15.25 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0593 on 131 degrees of freedom
## Multiple R-squared: 0.9999, Adjusted R-squared: 0.9999
## F-statistic: 9.734e+04 on 13 and 131 DF, p-value: < 2.2e-16
Let’s take a look at the plot of our original linear model now, fitted over the original data.
with(airpass, plot(log(pass) ~ year, type="l" ))
lines(airpass$year, mod1$fitted.values,col="blue")
Looks pretty good!
Since not all of our months were significant in our original linear model, instead of breaking out every month, let’s collapse them down into seasons.
We start by creating a new season
variable, and collapsing each month into the season we want.
season <- mofactor
levels(season) <- list(winter=c("Dec","Jan","Feb"), spring=c("Mar","Apr","May"),
summer=c("Jun","Jul","Aug"), fall=c("Sep","Oct","Nov"))
We can check the top of the data.
head(season)
## [1] winter spring spring spring summer summer
## Levels: winter spring summer fall
head(mofactor)
## [1] Feb Mar Apr May Jun Jul
## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
We now create the model with the season dummy variable.
modseason <- lm(log(pass) ~ justyear + season, data=airpass)
coef(modseason)
## (Intercept) justyear seasonspring seasonsummer seasonfall
## -1.20197254 0.12076004 0.09641867 0.26289382 0.25133987
summary(modseason)
##
## Call:
## lm(formula = log(pass) ~ justyear + season, data = airpass)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.27061 -0.07119 0.00114 0.06893 0.21364
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.201973 0.132877 -9.046 1.17e-15 ***
## justyear 0.120760 0.002404 50.229 < 2e-16 ***
## seasonspring 0.096419 0.023543 4.095 7.12e-05 ***
## seasonsummer 0.262894 0.023543 11.167 < 2e-16 ***
## seasonfall 0.251340 0.023543 10.676 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.09983 on 139 degrees of freedom
## Multiple R-squared: 0.9503, Adjusted R-squared: 0.9489
## F-statistic: 664.4 on 4 and 139 DF, p-value: < 2.2e-16
We’ll throw the season variable into the airpass
data as well.
airpass$season <- season
We can now compare our two models on the same plot.
with(airpass, plot(log(pass) ~ year, type = "l"))
lines(airpass$year, mod1$fitted.values, col="blue")
lines(airpass$year, modseason$fitted.values, col="red")
We can see that our new model doesn’t look quite as accurate. To compare the two models more in depth, we can use the anova function.
anova(mod1, modseason)
## Analysis of Variance Table
##
## Model 1: log(pass) ~ justyear + mofactor
## Model 2: log(pass) ~ justyear + season
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 131 0.46072
## 2 139 1.38515 -8 -0.92443 32.857 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Since we have a small p-value, we want the model with more variables - so we would use the model with all of the months.
The project I peer reviewed really just didn’t feel like a rough draft. The group had an interesting topic picked out as well as some interesting variables, but all they needed to do was flesh out the body of the paper, as well as organize some of their calculations.
I got some decent feedback on my group’s project. We had a few typos, and there were some organization issues. The reader wasn’t clear what variables we were going to proceed using once we had done the initial data exploration. We also need to write a little bit more on our methods and what we used in the Methods section, as well as rework the Lit Review a bit more. All in all, it was good to have somebody read and give feedback on the little as well as the big issues in our paper.