Today, we began class by discussing the issue of increasing seasonal variation and how we can make that increasing seasonal variation constant using transformations. We then discussed how to model this seasonal variation using dummy variables and briefly touched on using trigonometric functions to model the seasonal variation, too, but did not get to talk extensively about this method.
Generally, increasing seasonal variation is when the seasonal variation in a data set increases as time increases, so you see a fanning out of data points. Constant seasonal variation is when that seasonal variation remains constant so there is no fanning out. Using dummy variables in a time series analysis is very similar to using dummy variables in a multiple linear regression.
Of course, we use these modeling techniques any time we see seasonal variation in our time series data and use the transformations any time the seasonal variation in a data set is not constant.
To illustrate these concepts in R, I’ll complete an example using the data set airpass from the R package faraway. This data contains information on the number of passengers (in thousands) traveling by plane per month from 1949 to 1951.
First, we’ll need to plot the data to see whether it has constant or increasing seasonal variation.
library(faraway)
data(airpass)
plot(pass~year, data = airpass, type = "l")
This data set has very obvious increasing seasonal variation. To fix this, we can try either a square root or logarithmic transformation on our passenger variable. Let’s compare them.
plot(sqrt(pass)~year, data = airpass, type = "l")
plot(log(pass)~year, data = airpass, type = "l")
We can see that the log transform removes most of the increasing seasonal variation from our data so we will use that in any further models created.
Now, we’ll discuss how to add seasonal variation to the trend model. Since we didn’t get to talk much about modeling seasonal variation using trigonometric functions, I will only illustrate how to model seasonal variation using dummy variables.
head(airpass)
## pass year
## 1 112 49.08333
## 2 118 49.16667
## 3 132 49.25000
## 4 129 49.33333
## 5 121 49.41667
## 6 135 49.50000
Looking at our data set, we see that the data is monthly, but the time is not in factors, so we’ll need to make our months into factors instead of leaving them as decimals. We can do that in the following way:
justyear <- floor(airpass$year)
modecimal <- airpass$year - justyear
mofactor <-factor(round(modecimal*12))
head(cbind(airpass$year, mofactor))
## mofactor
## [1,] 49.08333 2
## [2,] 49.16667 3
## [3,] 49.25000 4
## [4,] 49.33333 5
## [5,] 49.41667 6
## [6,] 49.50000 7
levels(mofactor) <- c("Jan", "Feb", "Mar", "Apr", "May",
"Jun", "Jul", "Aug", "Sep", "Oct",
"Nov", "Dec")
airpass$justyear <- justyear
airpass$mofactor <- mofactor
So, now, if we view our the first few entries of our data set, we can see that the months are now factors instead of decimals.
head(airpass)
## pass year justyear mofactor
## 1 112 49.08333 49 Feb
## 2 118 49.16667 49 Mar
## 3 132 49.25000 49 Apr
## 4 129 49.33333 49 May
## 5 121 49.41667 49 Jun
## 6 135 49.50000 49 Jul
Then, let’s make our model. This is similar to making a multiple linear regression model.
monthmod <- lm(log(pass)~justyear + mofactor, data = airpass)
We can look at the summary of the model and notice that all of the coefficients corresponding to the different months are in relation to the month of January. Thus, we can compare how the number of passengers traveling by plane changes with respect to January for each month.
summary(monthmod)
##
## Call:
## lm(formula = log(pass) ~ justyear + mofactor, data = airpass)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.156370 -0.041016 0.003677 0.044069 0.132324
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.214998 0.081277 -14.949 < 2e-16 ***
## justyear 0.120826 0.001432 84.399 < 2e-16 ***
## mofactorFeb 0.031390 0.024253 1.294 0.198
## mofactorMar 0.019404 0.024253 0.800 0.425
## mofactorApr 0.159700 0.024253 6.585 1.00e-09 ***
## mofactorMay 0.138500 0.024253 5.711 7.19e-08 ***
## mofactorJun 0.146196 0.024253 6.028 1.58e-08 ***
## mofactorJul 0.278411 0.024253 11.480 < 2e-16 ***
## mofactorAug 0.392422 0.024253 16.180 < 2e-16 ***
## mofactorSep 0.393196 0.024253 16.212 < 2e-16 ***
## mofactorOct 0.258630 0.024253 10.664 < 2e-16 ***
## mofactorNov 0.130541 0.024253 5.382 3.28e-07 ***
## mofactorDec -0.003108 0.024253 -0.128 0.898
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0593 on 131 degrees of freedom
## Multiple R-squared: 0.9835, Adjusted R-squared: 0.982
## F-statistic: 649.4 on 12 and 131 DF, p-value: < 2.2e-16
Instead of using the monthly data as our dummy variable, we could aslo have collapsed the factors down to seasonal data. However, we will leave our model at this stage since models usually benefit from having as detailed data as possible.
Finally, we can plot our model to see how it compares to the actual trend of the data.
plot(log(pass)~year, data = airpass, type = "l")
lines(airpass$year, monthmod$fitted.values, type = "l", col = "red")
The red line is our model and the black line idicates the actual data. The model seems to fit our data pretty well.
Since these concepts are extremely interrelated to time series analysis, this topic fits very well into our course. These concepts also fit really well with what we have learned with regards to time series modelling because they allow us to add complexity to our model so that it more accurately presents the message of our data.
I received some feedback today on the formatting of our paper and a few organizational comments. However, the more important feedback I received today involved the content of the paper and the analysis that we produced. One comment asked us to be more specific about the transformations we tried and why we chose not to use them. I was also asked to give possible reasons for the insignificant predictors that we found in our model. Finally, a major critique I received was to emphasize the most important findings of our study more.
I will definitely take the formatting and organizational comments to heart and make those alterations in our paper. To address the comment pertaining to transformations, I will go into more detail about the process we went through with transformations and how they did not improve our model. I will address the comment relating to insignificant predictors by discussing these predictors to a greater extent in the paper and looking at the data set to see if there could be issues with the data that are causing us to conclude that the predictors are insignificant. I will also likely look into further studies relating to our topic to figure out whether our findings are standard or an anomily. Lastly, I will address the emphasis critique by expanding on the more important findings of our study since they seem to blend with the other information that we provide in the paper currently.