In class on Friday, we learned how to model seasonal variation and conducted peer review for our group projects.
library(faraway)
## Warning: package 'faraway' was built under R version 3.4.4
data(airpass)
Graphing the airpass time series data, we can see that the seasonal variation increases in magnitude as time goes on. That means that we are dealing with Increasing Seasonal variation. To fix this, we must transform the data.
with(airpass, plot(pass~year, type = "l"))
with(airpass, plot(log(pass)~year, type = "l"))
Doing a natural log of the passenger count makes our seasonal variation constant.
justyear <- floor(airpass$year)
modecimal <- airpass$year - justyear
mofactor <-factor(round(modecimal*12))
Follow the above commands to turn decimal years into years and months.
cbind(airpass$year, mofactor)
## mofactor
## [1,] 49.08333 2
## [2,] 49.16667 3
## [3,] 49.25000 4
## [4,] 49.33333 5
## [5,] 49.41667 6
## [6,] 49.50000 7
## [7,] 49.58333 8
## [8,] 49.66667 9
## [9,] 49.75000 10
## [10,] 49.83333 11
## [11,] 49.91667 12
## [12,] 50.00000 1
## [13,] 50.08333 2
## [14,] 50.16667 3
## [15,] 50.25000 4
## [16,] 50.33333 5
## [17,] 50.41667 6
## [18,] 50.50000 7
## [19,] 50.58333 8
## [20,] 50.66667 9
## [21,] 50.75000 10
## [22,] 50.83333 11
## [23,] 50.91667 12
## [24,] 51.00000 1
## [25,] 51.08333 2
## [26,] 51.16667 3
## [27,] 51.25000 4
## [28,] 51.33333 5
## [29,] 51.41667 6
## [30,] 51.50000 7
## [31,] 51.58333 8
## [32,] 51.66667 9
## [33,] 51.75000 10
## [34,] 51.83333 11
## [35,] 51.91667 12
## [36,] 52.00000 1
## [37,] 52.08333 2
## [38,] 52.16667 3
## [39,] 52.25000 4
## [40,] 52.33333 5
## [41,] 52.41667 6
## [42,] 52.50000 7
## [43,] 52.58333 8
## [44,] 52.66667 9
## [45,] 52.75000 10
## [46,] 52.83333 11
## [47,] 52.91667 12
## [48,] 53.00000 1
## [49,] 53.08333 2
## [50,] 53.16667 3
## [51,] 53.25000 4
## [52,] 53.33333 5
## [53,] 53.41667 6
## [54,] 53.50000 7
## [55,] 53.58333 8
## [56,] 53.66667 9
## [57,] 53.75000 10
## [58,] 53.83333 11
## [59,] 53.91667 12
## [60,] 54.00000 1
## [61,] 54.08333 2
## [62,] 54.16667 3
## [63,] 54.25000 4
## [64,] 54.33333 5
## [65,] 54.41667 6
## [66,] 54.50000 7
## [67,] 54.58333 8
## [68,] 54.66667 9
## [69,] 54.75000 10
## [70,] 54.83333 11
## [71,] 54.91667 12
## [72,] 55.00000 1
## [73,] 55.08333 2
## [74,] 55.16667 3
## [75,] 55.25000 4
## [76,] 55.33333 5
## [77,] 55.41667 6
## [78,] 55.50000 7
## [79,] 55.58333 8
## [80,] 55.66667 9
## [81,] 55.75000 10
## [82,] 55.83333 11
## [83,] 55.91667 12
## [84,] 56.00000 1
## [85,] 56.08333 2
## [86,] 56.16667 3
## [87,] 56.25000 4
## [88,] 56.33333 5
## [89,] 56.41667 6
## [90,] 56.50000 7
## [91,] 56.58333 8
## [92,] 56.66667 9
## [93,] 56.75000 10
## [94,] 56.83333 11
## [95,] 56.91667 12
## [96,] 57.00000 1
## [97,] 57.08333 2
## [98,] 57.16667 3
## [99,] 57.25000 4
## [100,] 57.33333 5
## [101,] 57.41667 6
## [102,] 57.50000 7
## [103,] 57.58333 8
## [104,] 57.66667 9
## [105,] 57.75000 10
## [106,] 57.83333 11
## [107,] 57.91667 12
## [108,] 58.00000 1
## [109,] 58.08333 2
## [110,] 58.16667 3
## [111,] 58.25000 4
## [112,] 58.33333 5
## [113,] 58.41667 6
## [114,] 58.50000 7
## [115,] 58.58333 8
## [116,] 58.66667 9
## [117,] 58.75000 10
## [118,] 58.83333 11
## [119,] 58.91667 12
## [120,] 59.00000 1
## [121,] 59.08333 2
## [122,] 59.16667 3
## [123,] 59.25000 4
## [124,] 59.33333 5
## [125,] 59.41667 6
## [126,] 59.50000 7
## [127,] 59.58333 8
## [128,] 59.66667 9
## [129,] 59.75000 10
## [130,] 59.83333 11
## [131,] 59.91667 12
## [132,] 60.00000 1
## [133,] 60.08333 2
## [134,] 60.16667 3
## [135,] 60.25000 4
## [136,] 60.33333 5
## [137,] 60.41667 6
## [138,] 60.50000 7
## [139,] 60.58333 8
## [140,] 60.66667 9
## [141,] 60.75000 10
## [142,] 60.83333 11
## [143,] 60.91667 12
## [144,] 61.00000 1
Now, we want to assign categorical labels to each month number.
levels(mofactor) <- c("Jan", "Feb", "Mar", "Apr", "May",
"Jun", "Jul", "Aug", "Sep", "Oct",
"Nov", "Dec")
summary(mofactor)
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 12 12 12 12 12 12 12 12 12 12 12 12
Great, there are exactly 12 data points for each month.
To check your work, select 10 months at random and see if they are correct.
randoms <- sample(1:nrow(airpass), 10, replace=FALSE)
mofactor[randoms]
## [1] Jan Nov Apr Mar Apr Sep Dec May May Apr
## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
airpass[randoms,"year"]
## [1] 55.00000 51.83333 51.25000 60.16667 57.25000 57.66667 59.91667
## [8] 54.33333 59.33333 54.25000
To add the variables into the data:
airpass$justyear <- justyear
airpass$mofactor <- mofactor
Now, just flow the new variables thru as predictors.
mod <- lm(log(pass) ~ justyear + mofactor, data=airpass)
coef(mod)
## (Intercept) justyear mofactorFeb mofactorMar mofactorApr
## -1.214997885 0.120825657 0.031389870 0.019403852 0.159699778
## mofactorMay mofactorJun mofactorJul mofactorAug mofactorSep
## 0.138499729 0.146195892 0.278410898 0.392422029 0.393195995
## mofactorOct mofactorNov mofactorDec
## 0.258630198 0.130540762 -0.003108143
summary(mod)
##
## Call:
## lm(formula = log(pass) ~ justyear + mofactor, data = airpass)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.156370 -0.041016 0.003677 0.044069 0.132324
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.214998 0.081277 -14.949 < 2e-16 ***
## justyear 0.120826 0.001432 84.399 < 2e-16 ***
## mofactorFeb 0.031390 0.024253 1.294 0.198
## mofactorMar 0.019404 0.024253 0.800 0.425
## mofactorApr 0.159700 0.024253 6.585 1.00e-09 ***
## mofactorMay 0.138500 0.024253 5.711 7.19e-08 ***
## mofactorJun 0.146196 0.024253 6.028 1.58e-08 ***
## mofactorJul 0.278411 0.024253 11.480 < 2e-16 ***
## mofactorAug 0.392422 0.024253 16.180 < 2e-16 ***
## mofactorSep 0.393196 0.024253 16.212 < 2e-16 ***
## mofactorOct 0.258630 0.024253 10.664 < 2e-16 ***
## mofactorNov 0.130541 0.024253 5.382 3.28e-07 ***
## mofactorDec -0.003108 0.024253 -0.128 0.898
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0593 on 131 degrees of freedom
## Multiple R-squared: 0.9835, Adjusted R-squared: 0.982
## F-statistic: 649.4 on 12 and 131 DF, p-value: < 2.2e-16
How can we tell the summer months have more passengers than January? The point estimate for the slope for summer month predictors is higher than those of winter months.
How can we tell the winter months are similar to January wrt number of passengers? They are less significant to the model because the intercept already includes January, so by adding February, March, and December, we are not contributing any useful information to the model.
with(airpass, plot(log(pass)~ year, type="l" ))
lines(airpass$year, mod$fitted.values,col="blue")
We are predicting very well.
We received a lot of feedback in our peer review session. I do not think that we had as complete of a draft as we should have prior to a peer review session, so much of the information was obvious to me, but we did come away with a few things to try in our project. Our lit review was not as related to our project mission as it could have been. We had a long paragraph on how movie studios do “cold openings” of movies to avoid being ripped apart by film critics, but our project was more about predicting critic behavior rather than movie performance. Our methods section included all of our predictor variables, but left out the response, which made our reviewers confused as to what we were predicting. We had too many plots cluttering up the page and not enough interpretation of the graphs. Something we definitely need to add is the checking of the regression assumptions. We received positive feedback on the discussion section of our paper, but I fear that it may be too long and wordy. When we have all of the changes made in our paper, that will be the first section that we will trim down. Our next step is to create a good presentation on this topic and I think that through preparing for a presentation, we will also get a sense of what else needs to be included in our paper. Some other next steps include giving our paper a table of contents and more readability.