First we will look at the airpass data within the faraway package to look at the trend of passengers over time. First we will call up the data and look at the plot that will show the relationship between the two variables.
library(faraway)
## Warning: package 'faraway' was built under R version 3.4.4
data("airpass")
names(airpass)
## [1] "pass" "year"
Now that we know the names of the variables we can look at the plot.
plot(pass ~ year, data = airpass, type = "l")
By looking at the graph there is definitely a seasonal variation, but it appears to be increasing seasonal variation, so we will attempt to find a transform that there will be constant variation.
plot(log(pass) ~ year, data = airpass , type = "l")
The log transformation appears to take care of the increasing seasonal variation, but let’s look at the square root to see what plot that provides.
plot(sqrt(pass) ~ year, data = airpass , type = "l")
Log appears to be the best transformation, you can see that square root still shows that increasing variation.
Now we will look at the Mitchell data set within the alr3 package for some more practice with seasonal variation.
library(alr3)
## Warning: package 'alr3' was built under R version 3.4.3
## Loading required package: car
## Warning: package 'car' was built under R version 3.4.3
##
## Attaching package: 'car'
## The following objects are masked from 'package:faraway':
##
## logit, vif
##
## Attaching package: 'alr3'
## The following objects are masked from 'package:faraway':
##
## cathedral, pipeline, twins
data("Mitchell")
names(Mitchell)
## [1] "Month" "Temp"
we want to look at the plot of temperature over time.
plot(Temp ~ Month , data = Mitchell , type = "l")
We are happy with this plot. Seasonal variation is certainly present, but there is no obvious increase in the variance. With this example, we would not need to use any of the transformations because it appears to already be constant seasonal variation.
Now we will create a model using dummy variables to model seasonality. We can also use transformations if necessary to achieve constant seasonal variation. We will use the airpass data set.
library(faraway)
data(airpass)
We know from our work earlier that the log(pass) transformation gave this data set the best seasonal variance. Now we want to look at the data on a monthly basis, so we will need to create a new variable for each month. To do this we will have to isolate each month as a decimal and rename it.
justyear <- floor(airpass$year)
modecimal <- airpass$year - justyear
Now we have each month as a decimal. We will know multiple all of the data points by 12 to get an integer for each month.
mofactor <-factor(round(modecimal*12))
Now we can check our work.
cbind(airpass$year, mofactor)
## mofactor
## [1,] 49.08333 2
## [2,] 49.16667 3
## [3,] 49.25000 4
## [4,] 49.33333 5
## [5,] 49.41667 6
## [6,] 49.50000 7
## [7,] 49.58333 8
## [8,] 49.66667 9
## [9,] 49.75000 10
## [10,] 49.83333 11
## [11,] 49.91667 12
## [12,] 50.00000 1
## [13,] 50.08333 2
## [14,] 50.16667 3
## [15,] 50.25000 4
## [16,] 50.33333 5
## [17,] 50.41667 6
## [18,] 50.50000 7
## [19,] 50.58333 8
## [20,] 50.66667 9
## [21,] 50.75000 10
## [22,] 50.83333 11
## [23,] 50.91667 12
## [24,] 51.00000 1
## [25,] 51.08333 2
## [26,] 51.16667 3
## [27,] 51.25000 4
## [28,] 51.33333 5
## [29,] 51.41667 6
## [30,] 51.50000 7
## [31,] 51.58333 8
## [32,] 51.66667 9
## [33,] 51.75000 10
## [34,] 51.83333 11
## [35,] 51.91667 12
## [36,] 52.00000 1
## [37,] 52.08333 2
## [38,] 52.16667 3
## [39,] 52.25000 4
## [40,] 52.33333 5
## [41,] 52.41667 6
## [42,] 52.50000 7
## [43,] 52.58333 8
## [44,] 52.66667 9
## [45,] 52.75000 10
## [46,] 52.83333 11
## [47,] 52.91667 12
## [48,] 53.00000 1
## [49,] 53.08333 2
## [50,] 53.16667 3
## [51,] 53.25000 4
## [52,] 53.33333 5
## [53,] 53.41667 6
## [54,] 53.50000 7
## [55,] 53.58333 8
## [56,] 53.66667 9
## [57,] 53.75000 10
## [58,] 53.83333 11
## [59,] 53.91667 12
## [60,] 54.00000 1
## [61,] 54.08333 2
## [62,] 54.16667 3
## [63,] 54.25000 4
## [64,] 54.33333 5
## [65,] 54.41667 6
## [66,] 54.50000 7
## [67,] 54.58333 8
## [68,] 54.66667 9
## [69,] 54.75000 10
## [70,] 54.83333 11
## [71,] 54.91667 12
## [72,] 55.00000 1
## [73,] 55.08333 2
## [74,] 55.16667 3
## [75,] 55.25000 4
## [76,] 55.33333 5
## [77,] 55.41667 6
## [78,] 55.50000 7
## [79,] 55.58333 8
## [80,] 55.66667 9
## [81,] 55.75000 10
## [82,] 55.83333 11
## [83,] 55.91667 12
## [84,] 56.00000 1
## [85,] 56.08333 2
## [86,] 56.16667 3
## [87,] 56.25000 4
## [88,] 56.33333 5
## [89,] 56.41667 6
## [90,] 56.50000 7
## [91,] 56.58333 8
## [92,] 56.66667 9
## [93,] 56.75000 10
## [94,] 56.83333 11
## [95,] 56.91667 12
## [96,] 57.00000 1
## [97,] 57.08333 2
## [98,] 57.16667 3
## [99,] 57.25000 4
## [100,] 57.33333 5
## [101,] 57.41667 6
## [102,] 57.50000 7
## [103,] 57.58333 8
## [104,] 57.66667 9
## [105,] 57.75000 10
## [106,] 57.83333 11
## [107,] 57.91667 12
## [108,] 58.00000 1
## [109,] 58.08333 2
## [110,] 58.16667 3
## [111,] 58.25000 4
## [112,] 58.33333 5
## [113,] 58.41667 6
## [114,] 58.50000 7
## [115,] 58.58333 8
## [116,] 58.66667 9
## [117,] 58.75000 10
## [118,] 58.83333 11
## [119,] 58.91667 12
## [120,] 59.00000 1
## [121,] 59.08333 2
## [122,] 59.16667 3
## [123,] 59.25000 4
## [124,] 59.33333 5
## [125,] 59.41667 6
## [126,] 59.50000 7
## [127,] 59.58333 8
## [128,] 59.66667 9
## [129,] 59.75000 10
## [130,] 59.83333 11
## [131,] 59.91667 12
## [132,] 60.00000 1
## [133,] 60.08333 2
## [134,] 60.16667 3
## [135,] 60.25000 4
## [136,] 60.33333 5
## [137,] 60.41667 6
## [138,] 60.50000 7
## [139,] 60.58333 8
## [140,] 60.66667 9
## [141,] 60.75000 10
## [142,] 60.83333 11
## [143,] 60.91667 12
## [144,] 61.00000 1
Looks good so far, we see a month integer matching up for every data point. Now let’s change those levels to the month names.
levels(mofactor) <- c("Jan", "Feb", "Mar", "Apr", "May",
"Jun", "Jul", "Aug", "Sep", "Oct",
"Nov", "Dec")
summary(mofactor)
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 12 12 12 12 12 12 12 12 12 12 12 12
Here we see that there was 12 data points recorded for each month. Let’s check our work again by looking at the last few data points.
tail(mofactor)
## [1] Aug Sep Oct Nov Dec Jan
## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
tail(airpass$year)
## [1] 60.58333 60.66667 60.75000 60.83333 60.91667 61.00000
These values look to be accurate. Now to make sure nothing odd is occurring in the middle, let’s look at a random sample.
randoms <- sample(1:nrow(airpass), 10, replace=FALSE)
mofactor[randoms]
## [1] Apr Apr Sep Apr Jan Feb Feb Nov Sep Dec
## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
airpass[randoms,"year"]
## [1] 50.25000 53.25000 54.66667 58.25000 52.00000 57.08333 56.08333
## [8] 50.83333 53.66667 55.91667
Looks good, we can now continue with our new monthly levels. Now we will create a model with a dummy variable for each month.
mod <- lm(log(pass) ~ justyear + mofactor, data=airpass)
coef(mod)
## (Intercept) justyear mofactorFeb mofactorMar mofactorApr
## -1.214997885 0.120825657 0.031389870 0.019403852 0.159699778
## mofactorMay mofactorJun mofactorJul mofactorAug mofactorSep
## 0.138499729 0.146195892 0.278410898 0.392422029 0.393195995
## mofactorOct mofactorNov mofactorDec
## 0.258630198 0.130540762 -0.003108143
summary(mod)
##
## Call:
## lm(formula = log(pass) ~ justyear + mofactor, data = airpass)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.156370 -0.041016 0.003677 0.044069 0.132324
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.214998 0.081277 -14.949 < 2e-16 ***
## justyear 0.120826 0.001432 84.399 < 2e-16 ***
## mofactorFeb 0.031390 0.024253 1.294 0.198
## mofactorMar 0.019404 0.024253 0.800 0.425
## mofactorApr 0.159700 0.024253 6.585 1.00e-09 ***
## mofactorMay 0.138500 0.024253 5.711 7.19e-08 ***
## mofactorJun 0.146196 0.024253 6.028 1.58e-08 ***
## mofactorJul 0.278411 0.024253 11.480 < 2e-16 ***
## mofactorAug 0.392422 0.024253 16.180 < 2e-16 ***
## mofactorSep 0.393196 0.024253 16.212 < 2e-16 ***
## mofactorOct 0.258630 0.024253 10.664 < 2e-16 ***
## mofactorNov 0.130541 0.024253 5.382 3.28e-07 ***
## mofactorDec -0.003108 0.024253 -0.128 0.898
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0593 on 131 degrees of freedom
## Multiple R-squared: 0.9835, Adjusted R-squared: 0.982
## F-statistic: 649.4 on 12 and 131 DF, p-value: < 2.2e-16
From the summary we can see that the month of January is tied in with the intercept coefficient. WE can see that there are more passengers that fly in the summer months because the intercept values are larger than in the winter months. Also because the coefficient values of the summer months are small, we know that fewer people are flying in the winter months.
with(airpass, plot(log(pass)~ year, type="l" ))
lines(airpass$year, mod$fitted.values,col="blue")
Looking at the comparison of the plots of the transformation data and our newly created model, they look nearly identical to each other.
I received some very useful feedback and advice from the peer review, and I hope that I was able to provide helpful tips as well. Some of my main take aways as far as improvements that needed to be made regarding our paper was explaining the importance of the predictors that we used in the models. Whether it be from previous studies, or just something that we hypothesized would be important, it should be mentioned. There were also a few areas where sentences were unclear, and plots were not positioned close enough in proximity to the paragraph that describes it. The also main piece of advice was to go back through and adjust so that the same verb tense is maintained throughout. This mistake is easily made with four authors of a paper, but it is important for it to be corrected so that there is consistency through the paper.
The main piece of advice I gave was to make sure to reference studies similar in nature in the introduction, as well as compare the findings of the similar study to your own. If your results don’t match theirs, that doesn’t mean you are wrong, but it does bring up an interesting talking point as to why you found different predictors to be significant. I also stressed the important of plots to help the reader visualize what is talked about in the text. You don’t want to include too many plots, but they can be helpful in some areas to better relay the desired info.