Homework 1 - Predictive Analytics & Forecasting

Ioannis (Yannos) Michailidis

###########################################################################################################

rm(list = ls())

Load useful libraries

library(plyr)
library(dplyr)
library(tidyr)
library(ggplot2)
library(lubridate)
library(zoo)
library(data.table)
library(fpp2)
library(seasonal)

CHAPTER 5

###########################################################################################################

Question 1

Get data for first 20 days:

daily20 = head(elecdaily, 20)

(a)

Plot data

autoplot(daily20, facets=TRUE)

Plot demand against temperature"

daily20 %>%
  as.data.frame() %>%
  ggplot(aes(x=Temperature, y=Demand)) +
  geom_point() +
  geom_smooth(method="lm", se=FALSE)

## `geom_smooth()` using formula 'y ~ x'

Regression model for demand with temperature xvar:

fit.lm1 = tslm(Demand ~ Temperature, data = daily20)
summary(fit.lm1)

## 
## Call:
## tslm(formula = Demand ~ Temperature, data = daily20)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -46.060  -7.117  -1.437  17.484  27.102 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  39.2117    17.9915   2.179   0.0428 *  
## Temperature   6.7572     0.6114  11.052 1.88e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22 on 18 degrees of freedom
## Multiple R-squared:  0.8716, Adjusted R-squared:  0.8644 
## F-statistic: 122.1 on 1 and 18 DF,  p-value: 1.876e-09

There is a positive relationship (Temperature coefficient = 6.7572) because the higher the temperature the more electricity is used for air conditioning. Intuitively, in very hot days more households turn on their AC for cooling increasing electricity demand.

(b)

Look at residuals:

checkresiduals(fit.lm1)

## 
##  Breusch-Godfrey test for serial correlation of order up to 5
## 
## data:  Residuals from Linear regression model
## LM test = 3.8079, df = 5, p-value = 0.5774

Also plot residuals against predictor:

df_daily20 <- as.data.frame(daily20)
df_daily20[,"Residuals"]  <- as.numeric(residuals(fit.lm1))
ggplot(df_daily20, aes(x=Temperature, y=Residuals)) +
  geom_point()

As we can see from the Breusch-Godfrey test, the p-value is high and therefore the test is insignificant. As a result, we can assume that there is no significant autocorrelation remaining in the residuals. In addition, despite the residual distribution being slightly skewed to the left and seeing an ACF spike at Lag 4 (and a couple outliers) autocorrelation is not particularly large. Also we can observe in the residuals vs predictor (Temperature) plot that points seem to be randomly scattered. Therefore, we can assume that the model is adequate.

(c)

Forecast for T=15 & T=35:

fc_new <- forecast(fit.lm1, h=1, newdata=data.frame(Temperature = c(15,35)))
fc_new

##          Point Forecast    Lo 80    Hi 80     Lo 95    Hi 95
## 3.857143       140.5701 108.6810 172.4591  90.21166 190.9285
## 4.000000       275.7146 245.2278 306.2014 227.57056 323.8586

Plot forecasts:

autoplot(fc_new)

Yes, I do believe the forecasts for Temperatures of 15C and 35C. For the fist we have a predicted average demand of 140.5701 which is reasonable as it is relatively lower to other historical points where temperature was slighlty higher by a sensible amount. For the latter we have a predicted average demand of 275.7146 which is also reasonable as it is slighlty higher than demand for the historical point in time where Temperature = 34.

(d)

We can use the forecast object to get the 80% and 90% prediction intervals.

80% intervals:

# for  T=15
fc_new$upper[1, 1]

## [1] 172.4591

fc_new$lower[1, 1]

## [1] 108.681

# for T=35
fc_new$upper[2, 1]

## [1] 306.2014

fc_new$lower[2, 1]

## [1] 245.2278

95% intervals:

# for T=15
fc_new$upper[1, 2]

## [1] 190.9285

fc_new$lower[1, 2]

## [1] 90.21166

# for T=35
fc_new$upper[2, 2]

## [1] 323.8586

fc_new$lower[2, 2]

## [1] 227.5706

(d)

Plot Temperature vs Demand:

elecdaily %>%
  as.data.frame() %>%
  ggplot(aes(x=Temperature, y=Demand)) +
  geom_point() +
  geom_smooth(method="lm", se=FALSE)

## `geom_smooth()` using formula 'y ~ x'

As we can see, a linear model like the one trained for the first 20 days would NOT be appropriate for the total dataset. It is indeed able to fit the data fine for the initial 20 data points but it would produce wrong predictions - especially for low and high temperature values in the complete dataset. The sample that we used to trained the linear model is not representative of the total data and would therefore be inadequate.

Question 2

(a)

head(mens400)

## Time Series:
## Start = 1896 
## End = 1916 
## Frequency = 0.25 
## [1] 54.2 49.4 49.2 50.0 48.2   NA

autoplot(mens400)

There are 3 main observations in the Year vs Winning time plot above:

That there is an overall decline as the Year increases (winning times are generally faster as years pass)
That the decline (negative slope) is not continuous (in some cases there is variation / increase in winning time in a year compared to the previous one)
That there is missing data (some years do not have recorded winning times)

(b)

Regression:

fit.lm2 = tslm(mens400 ~ time(mens400), data= mens400)

Get coefficient:

fit.lm2$coefficients[2]

## time(mens400) 
##   -0.06457385

Plot:

autoplot(mens400) + 
  geom_abline(slope=fit.lm2$coefficients[2],
              intercept=fit.lm2$coefficients[1], color="red")

As we can see above the fitted coefficient is approxiamtely -0.0646. As a result the winning times have been decreasing by an average rate of 0.0646 per year

(c)

df_mens400 <- as.data.frame(time(mens400))
df_mens400 <- df_mens400 %>%  rename(Time=x)
df_mens400[,"Residuals"]  <- as.numeric(residuals(fit.lm2))

Plot Time vs Residuals:

ggplot(df_mens400, aes(x=Time, y=Residuals)) +
  geom_point()

## Don't know how to automatically pick scale for object of type ts. Defaulting to continuous.

## Warning: Removed 3 rows containing missing values (geom_point).

Check Residuals:

checkresiduals(fit.lm2)

## 
##  Breusch-Godfrey test for serial correlation of order up to 6
## 
## data:  Residuals from Linear regression model
## LM test = 3.6082, df = 6, p-value = 0.7295

As we can observe from the two plots above and the Breusch-Godfrey test the fitted line is quite suitable for the data. Overall the test suggests that there is no significant autocorrelation left. The plots show that the line fits the data well with some exceptions (outlier in the very first) data point of year 1896.

(d)

Years with missing data are not taken into account when fitting the model.

Time = time(mens400)

Fit:

fit.lm2 <- lm(mens400 ~ Time,data = mens400,
              na.action = na.exclude)

Forecast:

fc_new <- forecast(fit.lm2, newdata = data.frame(Time = 2020))

Plot:

autoplot(mens400) + autolayer(fc_new, PI = TRUE)

80% prediction intervals:

fc_new$upper[,1]

## [1] 43.63487

fc_new$lower[,1]

## [1] 40.44975

94% prediction intervals:

fc_new$upper[,2]

## [1] 44.53176

fc_new$lower[,2]

## [1] 39.55286

The above prediction intervals are calculated with the assumption that residuals follow a normal distribution (which is not exactly the case as we have seen from our check residuals analysis previously)

Question 3

easter(ausbeer)

##      Qtr1 Qtr2 Qtr3 Qtr4
## 1956 0.67 0.33 0.00 0.00
## 1957 0.00 1.00 0.00 0.00
## 1958 0.00 1.00 0.00 0.00
## 1959 1.00 0.00 0.00 0.00
## 1960 0.00 1.00 0.00 0.00
## 1961 0.33 0.67 0.00 0.00
## 1962 0.00 1.00 0.00 0.00
## 1963 0.00 1.00 0.00 0.00
## 1964 1.00 0.00 0.00 0.00
## 1965 0.00 1.00 0.00 0.00
## 1966 0.00 1.00 0.00 0.00
## 1967 1.00 0.00 0.00 0.00
## 1968 0.00 1.00 0.00 0.00
## 1969 0.00 1.00 0.00 0.00
## 1970 1.00 0.00 0.00 0.00
## 1971 0.00 1.00 0.00 0.00
## 1972 0.33 0.67 0.00 0.00
## 1973 0.00 1.00 0.00 0.00
## 1974 0.00 1.00 0.00 0.00
## 1975 1.00 0.00 0.00 0.00
## 1976 0.00 1.00 0.00 0.00
## 1977 0.00 1.00 0.00 0.00
## 1978 1.00 0.00 0.00 0.00
## 1979 0.00 1.00 0.00 0.00
## 1980 0.00 1.00 0.00 0.00
## 1981 0.00 1.00 0.00 0.00
## 1982 0.00 1.00 0.00 0.00
## 1983 0.00 1.00 0.00 0.00
## 1984 0.00 1.00 0.00 0.00
## 1985 0.00 1.00 0.00 0.00
## 1986 1.00 0.00 0.00 0.00
## 1987 0.00 1.00 0.00 0.00
## 1988 0.00 1.00 0.00 0.00
## 1989 1.00 0.00 0.00 0.00
## 1990 0.00 1.00 0.00 0.00
## 1991 1.00 0.00 0.00 0.00
## 1992 0.00 1.00 0.00 0.00
## 1993 0.00 1.00 0.00 0.00
## 1994 0.00 1.00 0.00 0.00
## 1995 0.00 1.00 0.00 0.00
## 1996 0.00 1.00 0.00 0.00
## 1997 1.00 0.00 0.00 0.00
## 1998 0.00 1.00 0.00 0.00
## 1999 0.00 1.00 0.00 0.00
## 2000 0.00 1.00 0.00 0.00
## 2001 0.00 1.00 0.00 0.00
## 2002 1.00 0.00 0.00 0.00
## 2003 0.00 1.00 0.00 0.00
## 2004 0.00 1.00 0.00 0.00
## 2005 1.00 0.00 0.00 0.00
## 2006 0.00 1.00 0.00 0.00
## 2007 0.00 1.00 0.00 0.00
## 2008 1.00 0.00 0.00 0.00
## 2009 0.00 1.00 0.00 0.00
## 2010 0.00 1.00

We observe a vector of quarterly data for the fraction of time easter holiday days fall within each quarter. In most cases we have 0s and 1s. When the value is 0 it means that easter did not fall within that quarter and if it is 1 that easter completely fell within that quarter. If the value is a fraction it means that a fraction of easter (eg 0.67 of easter days) fall within that quarter and the rest (eg 0.33) fall within the previous or next quarter.

Question 4

As we know in a log-log model the estimated coefficients are basically elasticity values. For example in the following model:

# log y = β0 + β1 log x + e

β1 is n elasticity coefficient: the ratio of the percentage changein y to the percentage change in x. We can prove this by expressing y as a function of x:

# log y = β0 + β1 log x + e =>  
# y = e^( β0 + β1 log x + e)

and now we can differentiate with respect to x:

# dy/dx = d/dx(e^( β0 + β1 log x + e)) =>
# dy/dx = (β1/x)e^( β0 + β1 log x + e) =>
# dy/dx = β1*y/x =>
# β1 = (x/dx) * (dy/y) =>
# β1 = (x/dx) / (y/dy)

# Which is the ratio of the percentage change in x over the percentage change in y aka an elasticity

Question 5

(a)

autoplot(fancy)

The following patterns can be observed:

Very obvious seasonality. There are very high sales spikes every year during Christmas (and higher demand in the couple of months preceding Christmas) and smaller spikes every March when the local surg festival is held
Monthly sales and demand spikes are getting larger and larger especially in years after 1991 (when the spike did not increase). The following years the peak demand was increasing at a greater rate than regular non-seasonal sales. As a result, we can see that during these years - probably when the store is expanding its premises, range of products etc - sales are not increasing linearly but exponentially (logarithmic model probably more appropriate)

(b)

As observed previously, seasonal variations (peaks in December etc) increase exponentially (no linear trend). As a result, it is necessary that we use a log transformation before fitting the data.

(c)

yearmonth <- time(fancy)

Create surfing festival dummy (1 if month is 3 aka March after 1988):

surfing_festival <- c()
for(i in 1:length(yearmonth)){
  M <- round(12*(yearmonth[i] - floor(yearmonth[i]))) + 1
  Y <- floor(yearmonth[i])
  if(Y >= 1988 & M == 3){
    surfing_festival[i] <- 1
  } else {
    surfing_festival[i] <- 0
  }
}

Fit log-linear model using BoxCox - log(sales) and linear trend, seasonal dummies and surfing festival dummy:

fit.log_lm5 <- tslm(BoxCox(fancy, 0) ~ trend + season + surfing_festival)
summary(fit.log_lm5)

## 
## Call:
## tslm(formula = BoxCox(fancy, 0) ~ trend + season + surfing_festival)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.33673 -0.12757  0.00257  0.10911  0.37671 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      7.6196670  0.0742471 102.626  < 2e-16 ***
## trend            0.0220198  0.0008268  26.634  < 2e-16 ***
## season2          0.2514168  0.0956790   2.628 0.010555 *  
## season3          0.2660828  0.1934044   1.376 0.173275    
## season4          0.3840535  0.0957075   4.013 0.000148 ***
## season5          0.4094870  0.0957325   4.277 5.88e-05 ***
## season6          0.4488283  0.0957647   4.687 1.33e-05 ***
## season7          0.6104545  0.0958039   6.372 1.71e-08 ***
## season8          0.5879644  0.0958503   6.134 4.53e-08 ***
## season9          0.6693299  0.0959037   6.979 1.36e-09 ***
## season10         0.7473919  0.0959643   7.788 4.48e-11 ***
## season11         1.2067479  0.0960319  12.566  < 2e-16 ***
## season12         1.9622412  0.0961066  20.417  < 2e-16 ***
## surfing_festival 0.5015151  0.1964273   2.553 0.012856 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.179 on 70 degrees of freedom
## Multiple R-squared:  0.9567, Adjusted R-squared:  0.9487 
## F-statistic:   119 on 13 and 70 DF,  p-value: < 2.2e-16

(d)

Plot residuals against time:

autoplot(fit.log_lm5$residuals)

Get df with residuals and fitted values:

df_5_res_fitted <- cbind(Residuals = fit.log_lm5$residuals,
                         Fitted = fit.log_lm5$fitted.values) %>%
  as.data.frame()

Plot residuals against fitted values:

df_5_res_fitted %>%
  ggplot(aes(x = Fitted, y = Residuals)) + geom_point()

As we can see there is a pattern in residuals as time changes. This means that the model is inadequate as residuals are correlated with the the predictor aka time.

(e)

Get df with residuals with month column:

df_5_res_month <- cbind.data.frame(
  Month = factor(month.abb[round(12*(yearmonth - floor(yearmonth)) + 1)],
                 labels = month.abb,
                 ordered = TRUE),
  Residuals = fit.log_lm5$residuals
  )

Boxplot with residuals by month:

df_5_res_month %>% ggplot(aes(x = Month, y = Residuals)) + geom_boxplot()

## Don't know how to automatically pick scale for object of type ts. Defaulting to continuous.

The monthly residuals’ boxplots could reveal some issues with the model. As we can see distributions for each month vary a good amount as well as their symmetry - not all of them seem symmetrical. Also median values vary and are not all close to 0 (possibly affecting normality assumption for residuals)

(f)

Get model coefficients:

fit.log_lm5$coefficients

##      (Intercept)            trend          season2          season3 
##       7.61966701       0.02201983       0.25141682       0.26608280 
##          season4          season5          season6          season7 
##       0.38405351       0.40948697       0.44882828       0.61045453 
##          season8          season9         season10         season11 
##       0.58796443       0.66932985       0.74739195       1.20674790 
##         season12 surfing_festival 
##       1.96224123       0.50151509

We can make the following conclusions about our variables based on our model’s estimated coefficients:

There is a positive trend coefficient of 0.0220. That means that as time increases (years pass) sales also increase on average. For every additional year sales go up by approximately 2.2%.
All seasonal dummy coefficients are positive meaning that there is an additional effect for sales in all months relatively to the ommited dummy variable (season1 aka January). That means that on average all months are estimated to have higher sales than January (which is the lowest sales month). In addition, the highest additional effect is shown in December (as expected because of Christmas) as well as in the months prior to Christmas. The additional effect of season 3 which is when the festival takes place (March) is not particularly high because we have included the surfing_festival dummy
The surfing_festival dummy is positive (~0.50) meaning that there is an additional positive incremental effect of the the time of year when the surfing festival takes place on sales.

(g)

Check residuals:

checkresiduals(fit.log_lm5)

## 
##  Breusch-Godfrey test for serial correlation of order up to 17
## 
## data:  Residuals from Linear regression model
## LM test = 37.954, df = 17, p-value = 0.002494

As we can see the p-value of the Breusch-Godfrey test is very small (0.00249 which is smaller than 0.05). This means that the test is is significant and that residuals are still correlated. This indicated autocorrelation means that we cannot conclude that residuals resemble white noise making the model inadequate.

(h)

Create future time series for 1994,1995,1996:

df_future <- ts(data = rep(0, 36), start = 1994, 
                end = c(1996, 12), frequency = 12)

Create surfing festival dummy for future df:

yearmonth <- time(df_future)

surfing_festival <- c()
for(i in 1:length(yearmonth)){
  M <- round(12*(yearmonth[i] - floor(yearmonth[i]))) + 1
  Y <- floor(yearmonth[i])
  if(Y >= 1988 & M == 3){
    surfing_festival[i] <- 1
  } else {
    surfing_festival[i] <- 0
  }
}

Forecast:

df_future_sales <- forecast(
  fit.log_lm5, newdata = data.frame(Time = yearmonth,
                                    surfing_festival = surfing_festival)
  )

Get predictions with 80% and 95% prediction intervals (note that predictions and intervals below are not raw sales but logged):

df_future_sales

##          Point Forecast     Lo 80     Hi 80     Lo 95    Hi 95
## Jan 1994       9.491352  9.238522  9.744183  9.101594  9.88111
## Feb 1994       9.764789  9.511959 10.017620  9.375031 10.15455
## Mar 1994      10.302990 10.048860 10.557120  9.911228 10.69475
## Apr 1994       9.941465  9.688635 10.194296  9.551707 10.33122
## May 1994       9.988919  9.736088 10.241749  9.599161 10.37868
## Jun 1994      10.050280  9.797449 10.303110  9.660522 10.44004
## Jul 1994      10.233926  9.981095 10.486756  9.844168 10.62368
## Aug 1994      10.233456  9.980625 10.486286  9.843698 10.62321
## Sep 1994      10.336841 10.084010 10.589671  9.947083 10.72660
## Oct 1994      10.436923 10.184092 10.689753 10.047165 10.82668
## Nov 1994      10.918299 10.665468 11.171129 10.528541 11.30806
## Dec 1994      11.695812 11.442981 11.948642 11.306054 12.08557
## Jan 1995       9.755590  9.499844 10.011336  9.361338 10.14984
## Feb 1995      10.029027  9.773281 10.284773  9.634775 10.42328
## Mar 1995      10.567228 10.310518 10.823938 10.171489 10.96297
## Apr 1995      10.205703  9.949957 10.461449  9.811451 10.59996
## May 1995      10.253157  9.997411 10.508903  9.858904 10.64741
## Jun 1995      10.314518 10.058772 10.570264  9.920265 10.70877
## Jul 1995      10.498164 10.242418 10.753910 10.103911 10.89242
## Aug 1995      10.497694 10.241948 10.753440 10.103441 10.89195
## Sep 1995      10.601079 10.345333 10.856825 10.206826 10.99533
## Oct 1995      10.701161 10.445415 10.956907 10.306908 11.09541
## Nov 1995      11.182537 10.926791 11.438282 10.788284 11.57679
## Dec 1995      11.960050 11.704304 12.215796 11.565797 12.35430
## Jan 1996      10.019828  9.760564 10.279093  9.620151 10.41951
## Feb 1996      10.293265 10.034000 10.552530  9.893588 10.69294
## Mar 1996      10.831466 10.571566 11.091365 10.430810 11.23212
## Apr 1996      10.469941 10.210677 10.729206 10.070264 10.86962
## May 1996      10.517395 10.258130 10.776659 10.117718 10.91707
## Jun 1996      10.578756 10.319491 10.838021 10.179079 10.97843
## Jul 1996      10.762402 10.503137 11.021667 10.362725 11.16208
## Aug 1996      10.761932 10.502667 11.021196 10.362254 11.16161
## Sep 1996      10.865317 10.606052 11.124582 10.465640 11.26499
## Oct 1996      10.965399 10.706134 11.224664 10.565722 11.36508
## Nov 1996      11.446774 11.187510 11.706039 11.047097 11.84645
## Dec 1996      12.224288 11.965023 12.483552 11.824611 12.62396

Plot forecast:

autoplot(df_future_sales)

(i)

Transform predictions and intervals to raw data format:

df_fc_mean = exp(df_future_sales$mean)
df_fc_upper = exp(df_future_sales$upper)
df_fc_lower = exp(df_future_sales$lower)

Print transformed mean forecasts and prediction intervals:

df_fc_mean

##            Jan       Feb       Mar       Apr       May       Jun       Jul
## 1994  13244.70  17409.81  29821.65  20774.16  21783.73  23162.27  27831.56
## 1995  17250.40  22675.20  38840.85  27057.06  28371.96  30167.42  36248.88
## 1996  22467.57  29533.04  50587.81  35240.15  36952.72  39291.20  47211.93
##            Aug       Sep       Oct       Nov       Dec
## 1994  27818.48  30848.42  34095.57  55176.84 120067.79
## 1995  36231.84  40178.16  44407.37  71864.42 156380.86
## 1996  47189.73  52329.57  57837.85  93598.96 203676.38

df_fc_upper

##               [,1]      [,2]
## Jan 1994  17054.73  19557.43
## Feb 1994  22418.00  25707.73
## Mar 1994  38450.24  44123.68
## Apr 1994  26750.16  30675.62
## May 1994  28050.15  32166.37
## Jun 1994  29825.24  34201.95
## Jul 1994  35837.72  41096.73
## Aug 1994  35820.87  41077.41
## Sep 1994  39722.43  45551.50
## Oct 1994  43903.67  50346.32
## Nov 1994  71049.28  81475.41
## Dec 1994 154607.08 177294.90
## Jan 1995  22277.59  25587.08
## Feb 1995  29283.31  33633.55
## Mar 1995  50208.44  57697.39
## Apr 1995  34942.16  40133.06
## May 1995  36640.25  42083.42
## Jun 1995  38958.95  44746.58
## Jul 1995  46812.70  53767.06
## Aug 1995  46790.69  53741.78
## Sep 1995  51887.06  59595.26
## Oct 1995  57348.77  65868.34
## Nov 1995  92807.48 106594.69
## Dec 1995 201954.08 231955.81
## Jan 1996  29117.46  33506.86
## Feb 1996  38274.14  44043.89
## Mar 1996  65602.25  75517.62
## Apr 1996  45670.42  52555.15
## May 1996  47889.88  55109.18
## Jun 1996  50920.48  58596.65
## Jul 1996  61185.57  70409.18
## Aug 1996  61156.80  70376.07
## Sep 1996  67817.91  78041.33
## Oct 1996  74956.52  86256.08
## Nov 1996 121302.09 139588.15
## Dec 1996 263959.89 303751.35

# first column is 80% interval and second 95% interval

df_fc_lower

##               [,1]       [,2]
## Jan 1994  10285.82   8969.583
## Feb 1994  13520.45  11790.284
## Mar 1994  23129.40  20155.412
## Apr 1994  16133.21  14068.696
## May 1994  16917.24  14752.395
## Jun 1994  17987.81  15685.969
## Jul 1994  21613.98  18848.111
## Aug 1994  21603.82  18839.249
## Sep 1994  23956.87  20891.193
## Oct 1994  26478.61  23090.230
## Nov 1994  42850.31  37366.903
## Dec 1994  93244.59  81312.400
## Jan 1995  13357.65  11629.938
## Feb 1995  17558.28  15287.252
## Mar 1995  30046.98  26146.972
## Apr 1995  20951.33  18241.435
## May 1995  21969.51  19127.918
## Jun 1995  23359.80  20338.387
## Jul 1995  28068.91  24438.412
## Aug 1995  28055.72  24426.922
## Sep 1995  31111.50  27087.467
## Oct 1995  34386.35  29938.733
## Nov 1995  55647.40  48449.831
## Dec 1995 121091.75 105429.448
## Jan 1996  17336.40  15065.329
## Feb 1996  22788.25  19802.984
## Mar 1996  39009.73  33887.802
## Apr 1996  27191.96  23629.808
## May 1996  28513.41  24778.151
## Jun 1996  30317.82  26346.183
## Jul 1996  36429.60  31657.322
## Aug 1996  36412.48  31642.439
## Sep 1996  40378.47  35088.887
## Oct 1996  44628.77  38782.394
## Nov 1996  72222.70  62761.521
## Dec 1996 157160.50 136572.460

# first column is 80% interval and second 95%

(j)

As we have observed so far the log-linear model is inadequate. There is still correlation within the residuals and as we can see from the last predictions plot the model does not capture accurately the exponential increase in peak sales.

As a result, we could improve predictions by fitting a log-log model to capture the exponential trend. In this case the interpretation of our estimated coefficients would be elasticity values.

Question 6

head(gasoline)

## Time Series:
## Start = 1991.1 
## End = 1991.19582477755 
## Frequency = 52.1785714285714 
## [1] 6.621 6.433 6.582 7.224 6.875 6.947

Consider only data up to the end of 2004:

df_gasoline <- window(gasoline, end = 2005)

Look at plot:

autoplot(df_gasoline, xlab = "Year")

(a)

Fit harmonic regression with trend models with different fourier terms:

fit.hr1 = tslm(df_gasoline ~ trend + fourier(df_gasoline, K=1))
fit.hr2 = tslm(df_gasoline ~ trend + fourier(df_gasoline, K=3))
fit.hr3 = tslm(df_gasoline ~ trend + fourier(df_gasoline, K=5))
fit.hr4 = tslm(df_gasoline ~ trend + fourier(df_gasoline, K=7))
fit.hr5 = tslm(df_gasoline ~ trend + fourier(df_gasoline, K=10))
fit.hr6 = tslm(df_gasoline ~ trend + fourier(df_gasoline, K=12))

Plot observed gasoline and fitted values for different models:

autoplot(df_gasoline) +
  autolayer(fit.hr1$fitted.values,
            series = as.character(1)) +
  autolayer(fit.hr2$fitted.values,
            series = as.character(3)) +
  autolayer(fit.hr3$fitted.values,
            series = as.character(5)) +
  autolayer(fit.hr4$fitted.values,
            series = as.character(7)) +
  autolayer(fit.hr5$fitted.values,
            series = as.character(10)) +
  autolayer(fit.hr6$fitted.values,
            series = as.character(12)) +
  ylab("gasoline") +
  ggtitle("Weekly Gasoline Supply (mbpd)") +
  guides(colour = guide_legend(title = "Number of Fourier Transform pairs"))

We cannot tell much from plotting everything together so lets plot separately:

autoplot(df_gasoline) +
  autolayer(fit.hr1$fitted.values,
            series = as.character(1)) +
  ylab("gasoline") +
  ggtitle("Weekly Gasoline Supply (mbpd)") +
  guides(colour = guide_legend(title = "Number of Fourier Transform pairs"))

autoplot(df_gasoline) +
  autolayer(fit.hr2$fitted.values,
            series = as.character(3)) +
  ylab("gasoline") +
  ggtitle("Weekly Gasoline Supply (mbpd)") +
  guides(colour = guide_legend(title = "Number of Fourier Transform pairs"))

autoplot(df_gasoline) +
  autolayer(fit.hr3$fitted.values,
            series = as.character(5)) +
  ylab("gasoline") +
  ggtitle("Weekly Gasoline Supply (mbpd)") +
  guides(colour = guide_legend(title = "Number of Fourier Transform pairs"))

autoplot(df_gasoline) +
  autolayer(fit.hr4$fitted.values,
            series = as.character(7)) +
  ylab("gasoline") +
  ggtitle("Weekly Gasoline Supply (mbpd)") +
  guides(colour = guide_legend(title = "Number of Fourier Transform pairs"))

autoplot(df_gasoline) +
  autolayer(fit.hr5$fitted.values,
            series = as.character(10)) +
  ylab("gasoline") +
  ggtitle("Weekly Gasoline Supply (mbpd)") +
  guides(colour = guide_legend(title = "Number of Fourier Transform pairs"))

autoplot(df_gasoline) +
  autolayer(fit.hr6$fitted.values,
            series = as.character(12)) +
  ylab("gasoline") +
  ggtitle("Weekly Gasoline Supply (mbpd)") +
  guides(colour = guide_legend(title = "Number of Fourier Transform pairs"))

(b)

Lets select the appropriate number of fourier terms by minimizing AICc and CV:

Search and print AICc value for all possible number of fourier terms (lets say max k = 26 because freq/2 = 52.18/2 ~= 26):

for (k in 1:26){
  AICc <- CV(
    tslm(df_gasoline ~ trend + fourier(df_gasoline, K = k)))[['AICc']]
  print(paste0("AICc for ", k, " terms = ", AICc))
}

## [1] "AICc for 1 terms = -1813.20839072956"
## [1] "AICc for 2 terms = -1818.95741052303"
## [1] "AICc for 3 terms = -1862.83361533579"
## [1] "AICc for 4 terms = -1863.74227544095"
## [1] "AICc for 5 terms = -1872.72257839179"
## [1] "AICc for 6 terms = -1902.07737210877"
## [1] "AICc for 7 terms = -1912.67183911035"
## [1] "AICc for 8 terms = -1912.54204025102"
## [1] "AICc for 9 terms = -1910.97993898114"
## [1] "AICc for 10 terms = -1913.44100863917"
## [1] "AICc for 11 terms = -1913.22768793883"
## [1] "AICc for 12 terms = -1917.10993954982"
## [1] "AICc for 13 terms = -1913.17152209052"
## [1] "AICc for 14 terms = -1911.3588180895"
## [1] "AICc for 15 terms = -1906.98774556301"
## [1] "AICc for 16 terms = -1905.16529332884"
## [1] "AICc for 17 terms = -1903.78225572243"
## [1] "AICc for 18 terms = -1910.1407759961"
## [1] "AICc for 19 terms = -1906.43389513879"
## [1] "AICc for 20 terms = -1902.46881421726"
## [1] "AICc for 21 terms = -1898.39428251371"
## [1] "AICc for 22 terms = -1894.39830795924"
## [1] "AICc for 23 terms = -1890.69143456694"
## [1] "AICc for 24 terms = -1888.00086427751"
## [1] "AICc for 25 terms = -1885.12915004351"
## [1] "AICc for 26 terms = -1880.5002026393"

Search and print CV value for all possible number of fourier terms (lets say max k = 26 because freq/2 = 52.18/2 ~= 26):

for (k in 1:26){
  CV <- CV(
    tslm(df_gasoline ~ trend + fourier(df_gasoline, K = k)))[['CV']]
  print(paste0("CV for ", k, " terms = ", CV))
}

## [1] "CV for 1 terms = 0.0820192054454044"
## [1] "CV for 2 terms = 0.0813685429593528"
## [1] "CV for 3 terms = 0.0765884641420364"
## [1] "CV for 4 terms = 0.0764860811768495"
## [1] "CV for 5 terms = 0.0755364599938532"
## [1] "CV for 6 terms = 0.0725333015193832"
## [1] "CV for 7 terms = 0.0714733954288623"
## [1] "CV for 8 terms = 0.0714743127258836"
## [1] "CV for 9 terms = 0.0716145091624263"
## [1] "CV for 10 terms = 0.0713575412169122"
## [1] "CV for 11 terms = 0.0713648805476336"
## [1] "CV for 12 terms = 0.0709694493090534"
## [1] "CV for 13 terms = 0.0713380742903736"
## [1] "CV for 14 terms = 0.0714971412874022"
## [1] "CV for 15 terms = 0.0719086208045337"
## [1] "CV for 16 terms = 0.0720689398824101"
## [1] "CV for 17 terms = 0.0721860882730974"
## [1] "CV for 18 terms = 0.0715312552744315"
## [1] "CV for 19 terms = 0.0718716529738733"
## [1] "CV for 20 terms = 0.0722383395523968"
## [1] "CV for 21 terms = 0.0726160278362386"
## [1] "CV for 22 terms = 0.0729862490372426"
## [1] "CV for 23 terms = 0.0733286296560949"
## [1] "CV for 24 terms = 0.0735684811426977"
## [1] "CV for 25 terms = 0.0738256692802845"
## [1] "CV for 26 terms = 0.0742675686386856"

As we can observe we have an obvious first low AICc and CV value at K=7. However, both AICc and CV are minimized at K=12!

(c)

Check residuals:

checkresiduals(fit.hr6)

## 
##  Breusch-Godfrey test for serial correlation of order up to 104
## 
## data:  Residuals from Linear regression model
## LM test = 154.2, df = 104, p-value = 0.001021

(d)

Forecast using fitted model for next year (52 weeks horizon):

df_forecast_2005 <- forecast(
  fit.hr6, newdata=data.frame(fourier(df_gasoline, K = 12, h = 52))
  )

df_forecast_2005

##          Point Forecast    Lo 80    Hi 80    Lo 95     Hi 95
## 2005.014       8.713386 8.371087 9.055684 8.189474  9.237298
## 2005.033       8.642131 8.299720 8.984542 8.118047  9.166215
## 2005.052       8.656942 8.314540 8.999345 8.132871  9.181013
## 2005.071       8.661292 8.318899 9.003684 8.137236  9.185347
## 2005.090       8.686370 8.344109 9.028631 8.162515  9.210225
## 2005.110       8.784129 8.441982 9.126276 8.260448  9.307809
## 2005.129       8.895967 8.553823 9.238111 8.372291  9.419643
## 2005.148       8.937884 8.595756 9.280012 8.414232  9.461535
## 2005.167       8.940875 8.598761 9.282988 8.417246  9.464503
## 2005.186       8.988596 8.646477 9.330716 8.464958  9.512235
## 2005.205       9.062316 8.720191 9.404442 8.538669  9.585963
## 2005.225       9.073794 8.731673 9.415915 8.550153  9.597435
## 2005.244       9.031913 8.689800 9.374027 8.508284  9.555542
## 2005.263       9.041458 8.699342 9.383575 8.517824  9.565092
## 2005.282       9.122877 8.780756 9.464999 8.599236  9.646519
## 2005.301       9.169383 8.827263 9.511503 8.645744  9.693022
## 2005.320       9.132601 8.790487 9.474715 8.608970  9.656231
## 2005.340       9.120782 8.778667 9.462897 8.597150  9.644414
## 2005.359       9.221551 8.879431 9.563671 8.697912  9.745190
## 2005.378       9.335425 8.993306 9.677545 8.811787  9.859064
## 2005.397       9.316637 8.974522 9.658752 8.793006  9.840269
## 2005.416       9.214109 8.871994 9.556224 8.690478  9.737740
## 2005.435       9.218957 8.876838 9.561076 8.695320  9.742595
## 2005.455       9.383861 9.041741 9.725980 8.860222  9.907499
## 2005.474       9.545162 9.203046 9.887278 9.021530 10.068795
## 2005.493       9.559667 9.217553 9.901782 9.036037 10.083298
## 2005.512       9.487053 9.144935 9.829171 8.963416 10.010689
## 2005.531       9.464953 9.122834 9.807073 8.941315  9.988592
## 2005.550       9.508378 9.166261 9.850494 8.984744 10.032011
## 2005.570       9.534922 9.192808 9.877037 9.011292 10.058553
## 2005.589       9.521986 9.179869 9.864104 8.998351 10.045621
## 2005.608       9.522899 9.180779 9.865019 8.999260 10.046538
## 2005.627       9.548443 9.206326 9.890560 9.024808 10.072078
## 2005.646       9.536705 9.194591 9.878819 9.013075 10.060335
## 2005.665       9.451014 9.108897 9.793131 8.927380  9.974649
## 2005.685       9.330630 8.988509 9.672750 8.806990  9.854269
## 2005.704       9.229844 8.887726 9.571962 8.706208  9.753480
## 2005.723       9.170862 8.828748 9.512976 8.647232  9.694492
## 2005.742       9.168084 8.825968 9.510200 8.644451  9.691717
## 2005.761       9.230990 8.888869 9.573110 8.707349  9.754630
## 2005.780       9.320372 8.978252 9.662492 8.796734  9.844010
## 2005.800       9.363916 9.021801 9.706030 8.840285  9.887546
## 2005.819       9.342664 9.000548 9.684779 8.819032  9.866296
## 2005.838       9.304159 8.962036 9.646281 8.780516  9.827801
## 2005.857       9.267511 8.925389 9.609633 8.743869  9.791153
## 2005.876       9.194408 8.852292 9.536523 8.670776  9.718040
## 2005.895       9.100701 8.758587 9.442816 8.577070  9.624332
## 2005.915       9.104667 8.762540 9.446794 8.581017  9.628317
## 2005.934       9.265935 8.923806 9.608064 8.742282  9.789587
## 2005.953       9.436660 9.094538 9.778782 8.913018  9.960302
## 2005.972       9.400410 9.058293 9.742527 8.876776  9.924044
## 2005.991       9.147441 8.805233 9.489649 8.623668  9.671215

(e)

Plot all historical data with predictions up to end of 2005:

autoplot(df_forecast_2005) +
  autolayer(window(gasoline, start = 2005, end = 2006))

Zoom in between beginning of 2005 and end of 2005:

autoplot(df_forecast_2005) +
  autolayer(window(gasoline, start = 2005, end = 2006)) +
  scale_x_continuous(limits = c(2005, 2006))

## Scale for 'x' is already present. Adding another scale for 'x', which will
## replace the existing scale.

## Warning: Removed 726 row(s) containing missing values (geom_path).

As we can observe the model does a pretty good job forecasting most weeks in 2005. The actual values fall within the 80% prediction interval (dark blue area) for most weeks and the 95% prediction interval (light blue area) for very few weeks. There is only one point in time (around end of Q3 of the year) that the actual value falls out of the prediction interval in which case we have a sudden drop in supply that is not captured by our model.

Question 7

head(huron)

## Time Series:
## Start = 1875 
## End = 1880 
## Frequency = 1 
## [1] 10.38 11.86 10.97 10.80  9.79 10.39

(a)

Plot:

autoplot(huron)

There are two main observations from the yearly water level data:

That there is not a super clear pattern (data is quite cyclic - there are ups and downs but without any obvious pattern)
That there is initially a downward trend until about 1920, whereas after that there is visible trend

(b)

Linear regression:

fit.lm7 <- tslm(huron ~ trend)
fit.lm7

## 
## Call:
## tslm(formula = huron ~ trend)
## 
## Coefficients:
## (Intercept)        trend  
##     10.2020      -0.0242

Piecewise linear trend model with knot at 1915:

t1 <- time(huron)
# define t2
t2 <- ts(pmax(0,t1-1915), start=1875)
# fit model
fit.plm7 <- tslm(huron ~ t1 + t2)
fit.plm7

## 
## Call:
## tslm(formula = huron ~ t1 + t2)
## 
## Coefficients:
## (Intercept)           t1           t2  
##   132.90870     -0.06498      0.06486

Look at fitted values plots for the 2 models:

autoplot(huron) +
  autolayer(fitted(fit.lm7), series = "Linear") +
  autolayer(fitted(fit.plm7), series = "Piecewise") +
  xlab("Year") +  ylab("Water level") +
  ggtitle("Lake Huron Yearly Water Level") +
  guides(colour=guide_legend(title=" "))

As we can see from a first glance the piecewise linear model fits the data slightly better as it takes into account the change in trend occuring after 1915 (not downward after that)

(c)

Linear model forecasts:

df_forecast_lm_7 = forecast(fit.lm7, h=8)
df_forecast_lm_7

##      Point Forecast    Lo 80    Hi 80    Lo 95     Hi 95
## 1973       7.806127 6.317648 9.294605 5.516501 10.095752
## 1974       7.781926 6.292536 9.271315 5.490899 10.072952
## 1975       7.757724 6.267406 9.248043 5.465269 10.050179
## 1976       7.733523 6.242259 9.224788 5.439613 10.027434
## 1977       7.709322 6.217094 9.201550 5.413929 10.004715
## 1978       7.685121 6.191912 9.178331 5.388219  9.982024
## 1979       7.660920 6.166712 9.155128 5.362481  9.959359
## 1980       7.636719 6.141494 9.131943 5.336717  9.936721

Piecewise linear model forecasts:

h <- 8
t1.new = t1[length(t1)] + seq(h)
t2.new = t2[length(t2)] + seq(h)
df_future = cbind(t1=t1.new, t2=t2.new)%>%
  as.data.frame()

df_forecast_plm_7 <- forecast(fit.plm7, newdata = df_future)
df_forecast_plm_7

##      Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## 1973       8.455119 7.063583 9.846655 6.314483 10.59575
## 1974       8.454992 7.061518 9.848467 6.311374 10.59861
## 1975       8.454866 7.059398 9.850333 6.308182 10.60155
## 1976       8.454739 7.057225 9.852253 6.304906 10.60457
## 1977       8.454612 7.054998 9.854227 6.301549 10.60768
## 1978       8.454486 7.052717 9.856254 6.298109 10.61086
## 1979       8.454359 7.050384 9.858334 6.294587 10.61413
## 1980       8.454232 7.047997 9.860467 6.290984 10.61748

Plot fitted values and forecasts for both models:

autoplot(huron) +
  autolayer(fitted(fit.lm7), series = "Linear") +
  autolayer(fitted(fit.plm7), series = "Piecewise") +
  autolayer(df_forecast_lm_7, series = "Linear") +
  autolayer(df_forecast_plm_7, series = "Piecewise") +
  xlab("Year") +  ylab("Water level") +
  ggtitle("Lake Huron Yearly Water Level") +
  guides(colour=guide_legend(title=" "))

We can make the following observations:

That the linear model is probably inaccurate. Predictions and prediction intervals up to 1980 are continuously decreasing despite the change in trend after about 1915 after which there is no decreasing trend.
On the other hand the piecewise linear model seems to be more appropriate as predictions and prediction intervals seem stable (no decreasing trend) as expected.

Question 8

Done on paper.

###########################################################################################################

CHAPTER 6

###########################################################################################################

Question 1

The 7 term weighted average with weights of (0.067, 0.133, 0.200, 0.200, 0.200, 0.133, 0.067) is the following:

T_hat_t = 0.067y_t-3 + 0.133y_t-2 + 0.200y_t-1 + 0.200y_t + 0.200y_t+1 + 0.133y_t+2 + 0.067*y_t+3

The 3x5MA can be written as following:

T_hat_t = [((y_t-3 + y_t-2 + y_t-1 + y_t + y_t+1)/5) + ((y_t-2 + y_t-1 + y_t + y_t+1 + y_t+2)/5) + ((y_t-1 + y_t + y_t+1 + y_t+2 + y_t+3)/5)] / 3

= (y_t-3)/15 + 2(y_t-2)/15 + 3(y_t-1)/15 + 3(y_t)/15 + 3(y_t+1)/15 + 2(y_t+2)/15 + (y_t+3)/15

= 0.067y_t-3 + 0.133y_t-2 + 0.200y_t-1 + 0.200y_t + 0.200y_t+1 + 0.133y_t+2 + 0.067*y_t+3

As a result from the above 7MA = 3x5MA

Question 2

Look at plastics:

plastics

##    Jan  Feb  Mar  Apr  May  Jun  Jul  Aug  Sep  Oct  Nov  Dec
## 1  742  697  776  898 1030 1107 1165 1216 1208 1131  971  783
## 2  741  700  774  932 1099 1223 1290 1349 1341 1296 1066  901
## 3  896  793  885 1055 1204 1326 1303 1436 1473 1453 1170 1023
## 4  951  861  938 1109 1274 1422 1486 1555 1604 1600 1403 1209
## 5 1030 1032 1126 1285 1468 1637 1611 1608 1528 1420 1119 1013

(a)

Plot:

autoplot(plastics, main="Monthly sales of Product A", xlab="Year", ylab="Sales (thousands")

We can make the following observations:

There are seasonal fluctuations. These are yearly, meaning that at the beginning and end of every year we see troughs (low/minimum values) and at approximately in the middle of every year we see maxima (peaks). The seasonal fluctuations do not vary (increase or decrease) as Year increases from a first glance.
There is also a positive trend (yearly average increases).

(b)

decomp_1b <- plastics %>% decompose(type="multiplicative")

Get indices:

decomp_1b$seasonal

##         Jan       Feb       Mar       Apr       May       Jun       Jul
## 1 0.7670466 0.7103357 0.7765294 0.9103112 1.0447386 1.1570026 1.1636317
## 2 0.7670466 0.7103357 0.7765294 0.9103112 1.0447386 1.1570026 1.1636317
## 3 0.7670466 0.7103357 0.7765294 0.9103112 1.0447386 1.1570026 1.1636317
## 4 0.7670466 0.7103357 0.7765294 0.9103112 1.0447386 1.1570026 1.1636317
## 5 0.7670466 0.7103357 0.7765294 0.9103112 1.0447386 1.1570026 1.1636317
##         Aug       Sep       Oct       Nov       Dec
## 1 1.2252952 1.2313635 1.1887444 0.9919176 0.8330834
## 2 1.2252952 1.2313635 1.1887444 0.9919176 0.8330834
## 3 1.2252952 1.2313635 1.1887444 0.9919176 0.8330834
## 4 1.2252952 1.2313635 1.1887444 0.9919176 0.8330834
## 5 1.2252952 1.2313635 1.1887444 0.9919176 0.8330834

decomp_1b$trend

##         Jan       Feb       Mar       Apr       May       Jun       Jul
## 1        NA        NA        NA        NA        NA        NA  976.9583
## 2 1000.4583 1011.2083 1022.2917 1034.7083 1045.5417 1054.4167 1065.7917
## 3 1117.3750 1121.5417 1130.6667 1142.7083 1153.5833 1163.0000 1170.3750
## 4 1208.7083 1221.2917 1231.7083 1243.2917 1259.1250 1276.5833 1287.6250
## 5 1374.7917 1382.2083 1381.2500 1370.5833 1351.2500 1331.2500        NA
##         Aug       Sep       Oct       Nov       Dec
## 1  977.0417  977.0833  978.4167  982.7083  990.4167
## 2 1076.1250 1084.6250 1094.3750 1103.8750 1112.5417
## 3 1175.5000 1180.5417 1185.0000 1190.1667 1197.0833
## 4 1298.0417 1313.0000 1328.1667 1343.5833 1360.6250
## 5        NA        NA        NA        NA        NA

Plot:

decomp_1b %>%
  autoplot() + xlab("Year") +
  ggtitle("Monthly sales of Product A")

(c)

The above results do support the graphical interpretation from part a. There are indeed seasonal fluctuations in the data (indices range from about 0.71 to 1.23) as well as a positive trend.

(d)

Plot with seasonally adjusted data:

autoplot(plastics, series="Data") +
  autolayer(trendcycle(decomp_1b), series="Trend") +
  autolayer(seasadj(decomp_1b), series="Seasonally Adjusted") +
  xlab("Year") + ylab("Sales (thousands)") +
  ggtitle("Monthly sales of Product A") +
  scale_colour_manual(values=c("gray","blue","red"),
                      breaks=c("Data","Seasonally Adjusted","Trend"))

## Warning: Removed 12 row(s) containing missing values (geom_path).

Seasonally adjusted data:

seasadj(decomp_1b)

##         Jan       Feb       Mar       Apr       May       Jun       Jul
## 1  967.3468  981.2262  999.3182  986.4758  985.8925  956.7826 1001.1759
## 2  966.0431  985.4495  996.7427 1023.8257 1051.9377 1057.0417 1108.5982
## 3 1168.1168 1116.3736 1139.6864 1158.9443 1152.4413 1146.0648 1119.7701
## 4 1239.8204 1212.1029 1207.9388 1218.2647 1219.4437 1229.0378 1277.0364
## 5 1342.8129 1452.8342 1450.0416 1411.6051 1405.1361 1414.8628 1384.4587
##         Aug       Sep       Oct       Nov       Dec
## 1  992.4139  981.0263  951.4241  978.9119  939.8819
## 2 1100.9592 1089.0366 1090.2260 1074.6860 1081.5244
## 3 1171.9625 1196.2349 1222.2981 1179.5334 1227.9683
## 4 1269.0820 1302.6210 1345.9580 1414.4319 1451.2352
## 5 1312.3369 1240.9008 1194.5377 1128.1178 1215.9647

(e)

Add new outlier value in plastics data:

plastics_outlier <- plastics
plastics_outlier[25] <- plastics_outlier[15] + 500
plastics_outlier

##    Jan  Feb  Mar  Apr  May  Jun  Jul  Aug  Sep  Oct  Nov  Dec
## 1  742  697  776  898 1030 1107 1165 1216 1208 1131  971  783
## 2  741  700  774  932 1099 1223 1290 1349 1341 1296 1066  901
## 3 1274  793  885 1055 1204 1326 1303 1436 1473 1453 1170 1023
## 4  951  861  938 1109 1274 1422 1486 1555 1604 1600 1403 1209
## 5 1030 1032 1126 1285 1468 1637 1611 1608 1528 1420 1119 1013

Decompose again:

decomp_1e <- plastics_outlier %>% decompose(type="multiplicative")

Plot decomposition again:

decomp_1e %>%
  autoplot() + xlab("Year") +
  ggtitle("Monthly sales of Product A (with outlier)")

Plot with seasonally adjusted data:

autoplot(plastics_outlier, series="Data") +
  autolayer(trendcycle(decomp_1e), series="Trend") +
  autolayer(seasadj(decomp_1e), series="Seasonally Adjusted") +
  xlab("Year") + ylab("Sales (thousands)") +
  ggtitle("Monthly sales of Product A (with outlier)") +
  scale_colour_manual(values=c("gray","blue","red"),
                      breaks=c("Data","Seasonally Adjusted","Trend"))

## Warning: Removed 12 row(s) containing missing values (geom_path).

Seasonally adjusted data:

seasadj(decomp_1e)

##         Jan       Feb       Mar       Apr       May       Jun       Jul
## 1  879.5968  987.8941 1006.1412  993.1830  992.4330  962.9930 1008.1459
## 2  878.4113  992.1461 1003.5481 1030.7868 1058.9164 1063.9028 1116.3161
## 3 1510.2510 1123.9598 1147.4678 1166.8241 1160.0867 1153.5038 1127.5658
## 4 1127.3538 1220.3397 1216.1862 1226.5478 1227.5336 1237.0153 1285.9269
## 5 1221.0036 1462.7068 1459.9421 1421.2028 1414.4579 1424.0465 1394.0971
##         Aug       Sep       Oct       Nov       Dec
## 1  999.6349  987.9764  958.0526  985.5189  946.1675
## 2 1108.9700 1096.7519 1097.8215 1081.9394 1088.7572
## 3 1180.4899 1204.7096 1230.8138 1187.4945 1236.1805
## 4 1278.3160 1311.8495 1355.3352 1423.9784 1460.9406
## 5 1321.8856 1249.6920 1202.8600 1135.7319 1224.0966

As we can observe from the plots above the decomposition was robust enough and the seasonal component did not really change the outlier was viewed as a remainder (reasonable) and it can be observed in the seasonally adjusted plot (deseasonalized). Overall, the outlier did affect the seasonally adjusted data by introducing a spike at the point that we replaced with an extreme value (outlier).

(f)

No matter where the outlier would lie (middle, beginning or end) it would probably not have an effect on the seasonal component. It would, however, adjust the seasonally adjusted data as we saw above. In addition, having the outlier at one of the ends would probably cause a change in the overall level (trend) of the decomposed series.

Question 3

Load retail data:

retaildata <- readxl::read_excel("/Users/yannos/Documents/PredictiveAnalytics/retail.xlsx", skip=1)

Make into ts with chosen column:

retail <- ts(retaildata[,"A3349873A"], frequency=12, start=c(1982,4))

Plot the series:

autoplot(retail, main="Monthly Austrailian Retail Sales",
         ylab="Sales", xlab="Year")

X11 decomposition

Since we observe above that there is the seasonal fluctuation is varying, I would try a multiplicative decomposition:

x11_multiplicative <- retail %>% seas(x11="", transform.function = "log")
x11_multiplicative

## 
## Call:
## seas(x = ., transform.function = "log", x11 = "")
## 
## Coefficients:
##               Mon                Tue                Wed                Thu  
##         0.0026036         -0.0069845          0.0069365          0.0002691  
##               Fri                Sat         AO2000.Jun         LS2011.Jan  
##         0.0088366         -0.0021652          0.1558938          0.1763023  
## MA-Nonseasonal-01     MA-Seasonal-12  
##         0.3151592          0.5487838

Plot:

autoplot(x11_multiplicative)

As we can observe at the remainder plot there are some outliers that were not particularly visible before. Specifically, these would be around 1989, 1994, 2001, 2011 and 2012. In addition, we can observe the trend more clearly which is constantly increasing then to some extent levels off and then has a sudden increase to level off again during the last few months. Finally in the original time series plot it seems that the seasonal fluctuation constantly increases while after the decomposition we see that this is not necessarily the case.

Question 4

(a)

We can make the following observations from the plots showing the decomposition of the number of persons in the civilian labour force in Australia each month from February 1978 to August 1995:

There is an overall positive trend. The slope slighly decreases in 1991 but then goes back to approximately the original increasing slope after around 1993.
There are strong seasonal fluctuations (within cyclical trend). According to the scale, however, these seasonal fluctuations are not very drastic in comparison to the overall data/trend. The second seasonal fit plot shows us that the highest seasonal number of people in the labour force are observed in December and March while the lowest in January and August.
There are some very obvious strong outliers (as shown by the remainder plot) around and right after 1991 where we see sudden drops in the number of persons. These are greater than the average seasonality fluctuations (as shown by the scale) and coincide with the slight decrease in the trend slope.

(b)

Yes, the recession of 1991/1992 is visible in the estimated components. Specifically, it is observed as the outlier points (the unexpected drops) observed in the data and mainly the remainder plot in 1991 and 1992.

Question 5

Look at cangas:

head(cangas,10)

##         Jan    Feb    Mar    Apr    May    Jun    Jul    Aug    Sep    Oct
## 1960 1.4306 1.3059 1.4022 1.1699 1.1161 1.0113 0.9660 0.9773 1.0311 1.2521

(a)

Plots:

autoplot(cangas, main="Monthly Canadian Gas Production",
         ylab="Gas Production (Billions)", xlab="Year")

ggsubseriesplot(cangas, main="Monthly Canadian Gas Production",
                ylab="Gas Production (Billions)", xlab="Month")

ggseasonplot(cangas, main="Monthly Canadian Gas Production (seasonal for every year)",
             ylab="Gas Production (Billions)", xlab="Month")

As we can observe from the plots above, there is indeed changing seasonality over time. The size of seasonal fluctuations (amplitude) as well as the frequency and characteristic shape are all changing.

There could be various reasons for changing seasonality over time:

One could be changing average monthly temperatures and/or other environmental factors over time.
Monthly demand could change over time affecting needs for production such as higher use of gas for various purposes in later years that would not necessarily mean the same rise / drop in production as earlier years.
Extrenal factors that could vary over time such as gas prices, economic and plotical circumstances.
New discovered means or sources for production / storage / transportation over time for both Canada and countried that are supplied by Canada (eg US with increasing shale gas production etc).

(b)

From firstly observing the decomposition components I have decided the following:

To use an s.window = 15 because it signifies the number of consecutive years to use when estimating the seasonal component. In this case we are seeing seasonality fluctuations changing about every 15 years (there are approximately three main stages of seasonality fluctuations in the data)

Fit stl:

fit_stl <- cangas %>%
  stl(s.window=15, robust=TRUE)

Plot:

fit_stl %>% autoplot()

(c)

Try out SEATS decomposition:

fit_seats <- cangas %>% seas()

Plot:

fit_seats %>% autoplot()

Try out X11 decomposition:

fit_x11 <- cangas %>% seas(x11="")

Plot:

fit_x11 %>% autoplot()

We can now make the following observations in comparing STL with SEATS and X11 decompositions:

In all cases the trend is captured similarly.
In SEATS and X11 the seasonal component is very similar but different compared to the original STL decomposition. This is probably the case since SEATS and X11 assume multiplicative decomposition while the original STL can only handle additive decomposition.
STL allows for better handling of drastically changing seasonal fluctuations. This can be observed at the remainder plots where we can see that remainders are quite similar in the SEATS and X11 decompositions (although the remainder component in SEATS is slightly larger overall) but in general the remainder component of STL is smaller.

Question 6

bricksq quarterly data:

bricksq

##      Qtr1 Qtr2 Qtr3 Qtr4
## 1956  189  204  208  197
## 1957  187  214  227  223
## 1958  199  229  249  234
## 1959  208  253  267  255
## 1960  242  268  290  277
## 1961  241  253  265  236
## 1962  229  265  275  258
## 1963  231  263  308  313
## 1964  293  328  349  340
## 1965  309  349  366  340
## 1966  302  350  362  337
## 1967  326  358  359  357
## 1968  341  380  404  409
## 1969  383  417  454  428
## 1970  386  428  434  417
## 1971  385  433  453  436
## 1972  399  461  476  477
## 1973  452  461  534  516
## 1974  478  526  518  417
## 1975  340  437  459  449
## 1976  424  501  540  533
## 1977  457  513  522  478
## 1978  421  487  470  482
## 1979  458  526  573  563
## 1980  513  551  589  564
## 1981  519  581  581  578
## 1982  500  560  512  412
## 1983  303  409  420  413
## 1984  400  469  482  484
## 1985  447  507  533  503
## 1986  443  503  505  443
## 1987  415  485  495  458
## 1988  427  519  555  539
## 1989  511  572  570  526
## 1990  472  524  497  460
## 1991  373  436  424  430
## 1992  387  413  451  420
## 1993  394  462  476  443
## 1994  421  472  494

Plot:

bricksq %>% autoplot()

(a)

Fixed seasonality (s.window=periodic):

fit_stl_fixed <- bricksq %>%
  stl(t.window=27, s.window="periodic", robust=TRUE)

fit_stl_fixed %>% autoplot()

Changing seasonality:

fit_stl_changing <- bricksq %>%
  stl(t.window=27, s.window=7, robust=TRUE)

fit_stl_changing %>% autoplot()

(b)

Get seasonally adjusted:

stl_changing_seasadj <- seasadj(fit_stl_changing)

Plot with seasonally adjusted data:

autoplot(bricksq, series="Data") +
  autolayer(trendcycle(fit_stl_changing), series="Trend") +
  autolayer(seasadj(fit_stl_changing), series="Seasonally Adjusted") +
  xlab("Year") + ylab("Brick Production") +
  ggtitle("Australian quarterly clay brick production. 1956–1994") +
  scale_colour_manual(values=c("gray","blue","red"),
                      breaks=c("Data","Seasonally Adjusted","Trend"))

(c)

Naive forecasts of seasonally adjusted data:

naive_seas_adj <- fit_stl_changing %>% seasadj() %>% naive()

Plot:

naive_seas_adj %>%  autoplot() + 
  ylab("Brick Production") + xlab("Year") +
  ggtitle("Naive forecasts of seasonally adjusted quarterly clay brick production")

(d)

Reseasonalized naive forecasts with stlf:

naive_reseasonalized <- stlf(bricksq, method='naive')

Plot:

naive_reseasonalized %>%  autoplot() + 
  ylab("Brick Production") + xlab("Year") +
  ggtitle("Naive forecasts of quarterly clay brick production")

(e)

Check residuals:

checkresiduals(naive_reseasonalized)

## Warning in checkresiduals(naive_reseasonalized): The fitted degrees of freedom
## is based on the model used for the seasonally adjusted data.

## 
##  Ljung-Box test
## 
## data:  Residuals from STL +  Random walk
## Q* = 40.829, df = 8, p-value = 2.244e-06
## 
## Model df: 0.   Total lags used: 8

It seems that residuals are still correlated. The ACF plot shows some high autocorrelation values and the Ljung-Box test has a very small p-value. Residuals are also close to being normally distributed.

(f)

Repeat with robust STL decomposition:

naive_reseasonalized_robust <- stlf(bricksq, method='naive', robust=TRUE)

Plot:

naive_reseasonalized_robust %>%  autoplot() + 
  ylab("Brick Production") + xlab("Year") +
  ggtitle("Naive forecasts of quarterly clay brick production")

Check residuals:

checkresiduals(naive_reseasonalized_robust)

## Warning in checkresiduals(naive_reseasonalized_robust): The fitted degrees of
## freedom is based on the model used for the seasonally adjusted data.

## 
##  Ljung-Box test
## 
## data:  Residuals from STL +  Random walk
## Q* = 27.961, df = 8, p-value = 0.0004817
## 
## Model df: 0.   Total lags used: 8

We observe that residuals are still somewhat corelated but there has been some slight improvement compared to the non-robust method since the p-value has increased.

(g)

Split the data into train and test sets:

bricksq_train <- window(bricksq, end=c(1992,3))
bricksq_test <- window(bricksq, start=c(1992,4), end=c(1994,3))

Train and forecast using snaive():

fit_snaive <- snaive(bricksq_train)
fc_snaive <- forecast(fit_snaive, h=8)

Train and forecast using stlf():

fit_stlf <- stlf(bricksq_train)
fc_stlf <- forecast(fit_stlf, h=8)

Plot forecasts:

autoplot(bricksq, series = "Original data") +
  geom_line() +
 autolayer(fc_snaive, PI = FALSE, series = "snaive") +
  autolayer(fc_stlf, PI = FALSE, series = "stfl") +
  scale_color_manual(values = c("gray", "blue", "red"),
                     breaks = c("Original Data", "snaive", "stfl")) +
  scale_x_continuous(limits = c(1990, 1994.5)) +
  guides(colour = guide_legend(title = "series")) +
  ggtitle("snaive and stfl forecasts") +
  ylab("Brick production") +
  xlab("Year")

## Scale for 'x' is already present. Adding another scale for 'x', which will
## replace the existing scale.

## Warning: Removed 136 row(s) containing missing values (geom_path).

From a first glance from the above plot we could say that the stfl() fit does a better job in forecasting the last 2 years of data. However, we can evaluate the two fits by calculating RMSE on the test set:

df_test <- bricksq_test %>% as.data.frame() %>% rename(data=x)
df_snaive <- fc_snaive$mean %>% as.data.frame() %>% rename(snaive=x)
df_stfl <- fc_stlf$mean %>% as.data.frame() %>% rename(stlf=x)

Calculate out of sample rmse for snaive:

sqrt(mean(df_test$data - df_snaive$snaive))

## [1] 5.244044

Calculate out of sample rmse for stlf:

sqrt(mean(df_test$data - df_stfl$stlf))

## [1] 4.878376

As we can observe the rmse for stlf is lower and as a result stlf produces better forecasts.

Question 7

Look at data:

writing

##           Jan      Feb      Mar      Apr      May      Jun      Jul      Aug
## 1968  562.674  599.000  668.516  597.798  579.889  668.233  499.232  215.187
## 1969  634.712  639.283  712.182  621.557  621.000  675.989  501.322  220.286
## 1970  646.783  658.442  712.906  687.714  723.916  707.183  629.000  237.530
## 1971  676.155  748.183  810.681  729.363  701.108  790.079  594.621  230.716
## 1972  747.636  773.392  813.788  766.713  728.875  749.197  680.954  241.424
## 1973  795.337  788.421  889.968  797.393  751.000  821.255  691.605  290.655
## 1974  843.038  847.000  941.952  804.309  840.307  871.528  656.330  370.508
## 1975  778.139  856.075  938.833  813.023  783.417  828.110  657.311  310.032
## 1976  895.217  856.075  893.268  875.000  835.088  934.595  832.500  300.000
## 1977  875.024  992.968  976.804  968.697  871.675 1006.852  832.037  345.587
##           Sep      Oct      Nov      Dec
## 1968  555.813  586.935  546.136  571.111
## 1969  560.727  602.530  626.379  605.508
## 1970  613.296  730.444  734.925  651.812
## 1971  617.189  691.389  701.067  705.777
## 1972  680.234  708.326  694.238  772.071
## 1973  727.147  868.355  812.390  799.556
## 1974  742.000  847.152  731.675  898.527
## 1975  780.000  860.000  780.000  807.993
## 1976  791.443  900.000  781.729  880.000
## 1977  849.528  913.871  868.746  993.733

Plot:

autoplot(writing)

As we can observe there is probably no need for a Box-Cox tranformation. The series has a fairly linear trend and seasonality fluctuations do not vary.

Try stlf naive:

stlf_writing_naive <- stlf(writing, robust = TRUE, method = "naive")
autoplot(stlf_writing_naive)

Try stlf rwdrif:

stlf_writing_rwdrift <- stlf(writing, robust = TRUE, method = "rwdrift")
autoplot(stlf_writing_rwdrift)

onsidering the trend of the series I would suggest that the stlf_writing_rwdrift fit is most appropriate (using the rwdrift method).

Question 8

Look at data:

fancy

##            Jan       Feb       Mar       Apr       May       Jun       Jul
## 1987   1664.81   2397.53   2840.71   3547.29   3752.96   3714.74   4349.61
## 1988   2499.81   5198.24   7225.14   4806.03   5900.88   4951.34   6179.12
## 1989   4717.02   5702.63   9957.58   5304.78   6492.43   6630.80   7349.62
## 1990   5921.10   5814.58  12421.25   6369.77   7609.12   7224.75   8121.22
## 1991   4826.64   6470.23   9638.77   8821.17   8722.37  10209.48  11276.55
## 1992   7615.03   9849.69  14558.40  11587.33   9332.56  13082.09  16732.78
## 1993  10243.24  11266.88  21826.84  17357.33  15997.79  18601.53  26155.15
##            Aug       Sep       Oct       Nov       Dec
## 1987   3566.34   5021.82   6423.48   7600.60  19756.21
## 1988   4752.15   5496.43   5835.10  12600.08  28541.72
## 1989   8176.62   8573.17   9690.50  15151.84  34061.01
## 1990   7979.25   8093.06   8476.70  17914.66  30114.41
## 1991  12552.22  11637.39  13606.89  21822.11  45060.69
## 1992  19888.61  23933.38  25391.35  36024.80  80721.71
## 1993  28586.52  30505.41  30821.33  46634.38 104660.67

Plot:

autoplot(fancy)

As we can observe there is probably a need for a Box-Cox tranformation. The series has an exponential trend and there varying seasonality fluctuations.

Try stlf naive:

stlf_fancy_naive <- stlf(fancy, robust = TRUE, method = "naive", lambda = BoxCox.lambda(fancy))
autoplot(stlf_fancy_naive)

Try stlf rwdrif:

stlf_fancy_rwdrift <- stlf(fancy, robust = TRUE, method = "rwdrift", lambda = BoxCox.lambda(fancy))
autoplot(stlf_fancy_rwdrift)

Again, considering the trend of the series I would suggest that the stlf_fancy_rwdrift fit is most appropriate (using the rwdrift method).