###########################################################################################################
rm(list = ls())
library(plyr)
library(dplyr)
library(tidyr)
library(ggplot2)
library(lubridate)
library(zoo)
library(data.table)
library(fpp2)
library(seasonal)
###########################################################################################################
Get data for first 20 days:
daily20 = head(elecdaily, 20)
Plot data
autoplot(daily20, facets=TRUE)
Plot demand against temperature"
daily20 %>%
as.data.frame() %>%
ggplot(aes(x=Temperature, y=Demand)) +
geom_point() +
geom_smooth(method="lm", se=FALSE)
## `geom_smooth()` using formula 'y ~ x'
Regression model for demand with temperature xvar:
fit.lm1 = tslm(Demand ~ Temperature, data = daily20)
summary(fit.lm1)
##
## Call:
## tslm(formula = Demand ~ Temperature, data = daily20)
##
## Residuals:
## Min 1Q Median 3Q Max
## -46.060 -7.117 -1.437 17.484 27.102
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.2117 17.9915 2.179 0.0428 *
## Temperature 6.7572 0.6114 11.052 1.88e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22 on 18 degrees of freedom
## Multiple R-squared: 0.8716, Adjusted R-squared: 0.8644
## F-statistic: 122.1 on 1 and 18 DF, p-value: 1.876e-09
There is a positive relationship (Temperature coefficient = 6.7572) because the higher the temperature the more electricity is used for air conditioning. Intuitively, in very hot days more households turn on their AC for cooling increasing electricity demand.
Look at residuals:
checkresiduals(fit.lm1)
##
## Breusch-Godfrey test for serial correlation of order up to 5
##
## data: Residuals from Linear regression model
## LM test = 3.8079, df = 5, p-value = 0.5774
Also plot residuals against predictor:
df_daily20 <- as.data.frame(daily20)
df_daily20[,"Residuals"] <- as.numeric(residuals(fit.lm1))
ggplot(df_daily20, aes(x=Temperature, y=Residuals)) +
geom_point()
As we can see from the Breusch-Godfrey test, the p-value is high and therefore the test is insignificant. As a result, we can assume that there is no significant autocorrelation remaining in the residuals. In addition, despite the residual distribution being slightly skewed to the left and seeing an ACF spike at Lag 4 (and a couple outliers) autocorrelation is not particularly large. Also we can observe in the residuals vs predictor (Temperature) plot that points seem to be randomly scattered. Therefore, we can assume that the model is adequate.
Forecast for T=15 & T=35:
fc_new <- forecast(fit.lm1, h=1, newdata=data.frame(Temperature = c(15,35)))
fc_new
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 3.857143 140.5701 108.6810 172.4591 90.21166 190.9285
## 4.000000 275.7146 245.2278 306.2014 227.57056 323.8586
Plot forecasts:
autoplot(fc_new)
Yes, I do believe the forecasts for Temperatures of 15C and 35C. For the fist we have a predicted average demand of 140.5701 which is reasonable as it is relatively lower to other historical points where temperature was slighlty higher by a sensible amount. For the latter we have a predicted average demand of 275.7146 which is also reasonable as it is slighlty higher than demand for the historical point in time where Temperature = 34.
We can use the forecast object to get the 80% and 90% prediction intervals.
80% intervals:
# for T=15
fc_new$upper[1, 1]
## [1] 172.4591
fc_new$lower[1, 1]
## [1] 108.681
# for T=35
fc_new$upper[2, 1]
## [1] 306.2014
fc_new$lower[2, 1]
## [1] 245.2278
95% intervals:
# for T=15
fc_new$upper[1, 2]
## [1] 190.9285
fc_new$lower[1, 2]
## [1] 90.21166
# for T=35
fc_new$upper[2, 2]
## [1] 323.8586
fc_new$lower[2, 2]
## [1] 227.5706
Plot Temperature vs Demand:
elecdaily %>%
as.data.frame() %>%
ggplot(aes(x=Temperature, y=Demand)) +
geom_point() +
geom_smooth(method="lm", se=FALSE)
## `geom_smooth()` using formula 'y ~ x'
As we can see, a linear model like the one trained for the first 20 days would NOT be appropriate for the total dataset. It is indeed able to fit the data fine for the initial 20 data points but it would produce wrong predictions - especially for low and high temperature values in the complete dataset. The sample that we used to trained the linear model is not representative of the total data and would therefore be inadequate.
head(mens400)
## Time Series:
## Start = 1896
## End = 1916
## Frequency = 0.25
## [1] 54.2 49.4 49.2 50.0 48.2 NA
autoplot(mens400)
There are 3 main observations in the Year vs Winning time plot above:
That there is an overall decline as the Year increases (winning times are generally faster as years pass)
That the decline (negative slope) is not continuous (in some cases there is variation / increase in winning time in a year compared to the previous one)
That there is missing data (some years do not have recorded winning times)
Regression:
fit.lm2 = tslm(mens400 ~ time(mens400), data= mens400)
Get coefficient:
fit.lm2$coefficients[2]
## time(mens400)
## -0.06457385
Plot:
autoplot(mens400) +
geom_abline(slope=fit.lm2$coefficients[2],
intercept=fit.lm2$coefficients[1], color="red")
As we can see above the fitted coefficient is approxiamtely -0.0646. As a result the winning times have been decreasing by an average rate of 0.0646 per year
df_mens400 <- as.data.frame(time(mens400))
df_mens400 <- df_mens400 %>% rename(Time=x)
df_mens400[,"Residuals"] <- as.numeric(residuals(fit.lm2))
Plot Time vs Residuals:
ggplot(df_mens400, aes(x=Time, y=Residuals)) +
geom_point()
## Don't know how to automatically pick scale for object of type ts. Defaulting to continuous.
## Warning: Removed 3 rows containing missing values (geom_point).
Check Residuals:
checkresiduals(fit.lm2)
##
## Breusch-Godfrey test for serial correlation of order up to 6
##
## data: Residuals from Linear regression model
## LM test = 3.6082, df = 6, p-value = 0.7295
As we can observe from the two plots above and the Breusch-Godfrey test the fitted line is quite suitable for the data. Overall the test suggests that there is no significant autocorrelation left. The plots show that the line fits the data well with some exceptions (outlier in the very first) data point of year 1896.
Years with missing data are not taken into account when fitting the model.
Time = time(mens400)
Fit:
fit.lm2 <- lm(mens400 ~ Time,data = mens400,
na.action = na.exclude)
Forecast:
fc_new <- forecast(fit.lm2, newdata = data.frame(Time = 2020))
Plot:
autoplot(mens400) + autolayer(fc_new, PI = TRUE)
80% prediction intervals:
fc_new$upper[,1]
## [1] 43.63487
fc_new$lower[,1]
## [1] 40.44975
94% prediction intervals:
fc_new$upper[,2]
## [1] 44.53176
fc_new$lower[,2]
## [1] 39.55286
The above prediction intervals are calculated with the assumption that residuals follow a normal distribution (which is not exactly the case as we have seen from our check residuals analysis previously)
easter(ausbeer)
## Qtr1 Qtr2 Qtr3 Qtr4
## 1956 0.67 0.33 0.00 0.00
## 1957 0.00 1.00 0.00 0.00
## 1958 0.00 1.00 0.00 0.00
## 1959 1.00 0.00 0.00 0.00
## 1960 0.00 1.00 0.00 0.00
## 1961 0.33 0.67 0.00 0.00
## 1962 0.00 1.00 0.00 0.00
## 1963 0.00 1.00 0.00 0.00
## 1964 1.00 0.00 0.00 0.00
## 1965 0.00 1.00 0.00 0.00
## 1966 0.00 1.00 0.00 0.00
## 1967 1.00 0.00 0.00 0.00
## 1968 0.00 1.00 0.00 0.00
## 1969 0.00 1.00 0.00 0.00
## 1970 1.00 0.00 0.00 0.00
## 1971 0.00 1.00 0.00 0.00
## 1972 0.33 0.67 0.00 0.00
## 1973 0.00 1.00 0.00 0.00
## 1974 0.00 1.00 0.00 0.00
## 1975 1.00 0.00 0.00 0.00
## 1976 0.00 1.00 0.00 0.00
## 1977 0.00 1.00 0.00 0.00
## 1978 1.00 0.00 0.00 0.00
## 1979 0.00 1.00 0.00 0.00
## 1980 0.00 1.00 0.00 0.00
## 1981 0.00 1.00 0.00 0.00
## 1982 0.00 1.00 0.00 0.00
## 1983 0.00 1.00 0.00 0.00
## 1984 0.00 1.00 0.00 0.00
## 1985 0.00 1.00 0.00 0.00
## 1986 1.00 0.00 0.00 0.00
## 1987 0.00 1.00 0.00 0.00
## 1988 0.00 1.00 0.00 0.00
## 1989 1.00 0.00 0.00 0.00
## 1990 0.00 1.00 0.00 0.00
## 1991 1.00 0.00 0.00 0.00
## 1992 0.00 1.00 0.00 0.00
## 1993 0.00 1.00 0.00 0.00
## 1994 0.00 1.00 0.00 0.00
## 1995 0.00 1.00 0.00 0.00
## 1996 0.00 1.00 0.00 0.00
## 1997 1.00 0.00 0.00 0.00
## 1998 0.00 1.00 0.00 0.00
## 1999 0.00 1.00 0.00 0.00
## 2000 0.00 1.00 0.00 0.00
## 2001 0.00 1.00 0.00 0.00
## 2002 1.00 0.00 0.00 0.00
## 2003 0.00 1.00 0.00 0.00
## 2004 0.00 1.00 0.00 0.00
## 2005 1.00 0.00 0.00 0.00
## 2006 0.00 1.00 0.00 0.00
## 2007 0.00 1.00 0.00 0.00
## 2008 1.00 0.00 0.00 0.00
## 2009 0.00 1.00 0.00 0.00
## 2010 0.00 1.00
We observe a vector of quarterly data for the fraction of time easter holiday days fall within each quarter. In most cases we have 0s and 1s. When the value is 0 it means that easter did not fall within that quarter and if it is 1 that easter completely fell within that quarter. If the value is a fraction it means that a fraction of easter (eg 0.67 of easter days) fall within that quarter and the rest (eg 0.33) fall within the previous or next quarter.
As we know in a log-log model the estimated coefficients are basically elasticity values. For example in the following model:
# log y = β0 + β1 log x + e
β1 is n elasticity coefficient: the ratio of the percentage changein y to the percentage change in x. We can prove this by expressing y as a function of x:
# log y = β0 + β1 log x + e =>
# y = e^( β0 + β1 log x + e)
and now we can differentiate with respect to x:
# dy/dx = d/dx(e^( β0 + β1 log x + e)) =>
# dy/dx = (β1/x)e^( β0 + β1 log x + e) =>
# dy/dx = β1*y/x =>
# β1 = (x/dx) * (dy/y) =>
# β1 = (x/dx) / (y/dy)
# Which is the ratio of the percentage change in x over the percentage change in y aka an elasticity
autoplot(fancy)
The following patterns can be observed:
Very obvious seasonality. There are very high sales spikes every year during Christmas (and higher demand in the couple of months preceding Christmas) and smaller spikes every March when the local surg festival is held
Monthly sales and demand spikes are getting larger and larger especially in years after 1991 (when the spike did not increase). The following years the peak demand was increasing at a greater rate than regular non-seasonal sales. As a result, we can see that during these years - probably when the store is expanding its premises, range of products etc - sales are not increasing linearly but exponentially (logarithmic model probably more appropriate)
As observed previously, seasonal variations (peaks in December etc) increase exponentially (no linear trend). As a result, it is necessary that we use a log transformation before fitting the data.
yearmonth <- time(fancy)
Create surfing festival dummy (1 if month is 3 aka March after 1988):
surfing_festival <- c()
for(i in 1:length(yearmonth)){
M <- round(12*(yearmonth[i] - floor(yearmonth[i]))) + 1
Y <- floor(yearmonth[i])
if(Y >= 1988 & M == 3){
surfing_festival[i] <- 1
} else {
surfing_festival[i] <- 0
}
}
Fit log-linear model using BoxCox - log(sales) and linear trend, seasonal dummies and surfing festival dummy:
fit.log_lm5 <- tslm(BoxCox(fancy, 0) ~ trend + season + surfing_festival)
summary(fit.log_lm5)
##
## Call:
## tslm(formula = BoxCox(fancy, 0) ~ trend + season + surfing_festival)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.33673 -0.12757 0.00257 0.10911 0.37671
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.6196670 0.0742471 102.626 < 2e-16 ***
## trend 0.0220198 0.0008268 26.634 < 2e-16 ***
## season2 0.2514168 0.0956790 2.628 0.010555 *
## season3 0.2660828 0.1934044 1.376 0.173275
## season4 0.3840535 0.0957075 4.013 0.000148 ***
## season5 0.4094870 0.0957325 4.277 5.88e-05 ***
## season6 0.4488283 0.0957647 4.687 1.33e-05 ***
## season7 0.6104545 0.0958039 6.372 1.71e-08 ***
## season8 0.5879644 0.0958503 6.134 4.53e-08 ***
## season9 0.6693299 0.0959037 6.979 1.36e-09 ***
## season10 0.7473919 0.0959643 7.788 4.48e-11 ***
## season11 1.2067479 0.0960319 12.566 < 2e-16 ***
## season12 1.9622412 0.0961066 20.417 < 2e-16 ***
## surfing_festival 0.5015151 0.1964273 2.553 0.012856 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.179 on 70 degrees of freedom
## Multiple R-squared: 0.9567, Adjusted R-squared: 0.9487
## F-statistic: 119 on 13 and 70 DF, p-value: < 2.2e-16
Plot residuals against time:
autoplot(fit.log_lm5$residuals)
Get df with residuals and fitted values:
df_5_res_fitted <- cbind(Residuals = fit.log_lm5$residuals,
Fitted = fit.log_lm5$fitted.values) %>%
as.data.frame()
Plot residuals against fitted values:
df_5_res_fitted %>%
ggplot(aes(x = Fitted, y = Residuals)) + geom_point()
As we can see there is a pattern in residuals as time changes. This means that the model is inadequate as residuals are correlated with the the predictor aka time.
Get df with residuals with month column:
df_5_res_month <- cbind.data.frame(
Month = factor(month.abb[round(12*(yearmonth - floor(yearmonth)) + 1)],
labels = month.abb,
ordered = TRUE),
Residuals = fit.log_lm5$residuals
)
Boxplot with residuals by month:
df_5_res_month %>% ggplot(aes(x = Month, y = Residuals)) + geom_boxplot()
## Don't know how to automatically pick scale for object of type ts. Defaulting to continuous.
The monthly residuals’ boxplots could reveal some issues with the model. As we can see distributions for each month vary a good amount as well as their symmetry - not all of them seem symmetrical. Also median values vary and are not all close to 0 (possibly affecting normality assumption for residuals)
Get model coefficients:
fit.log_lm5$coefficients
## (Intercept) trend season2 season3
## 7.61966701 0.02201983 0.25141682 0.26608280
## season4 season5 season6 season7
## 0.38405351 0.40948697 0.44882828 0.61045453
## season8 season9 season10 season11
## 0.58796443 0.66932985 0.74739195 1.20674790
## season12 surfing_festival
## 1.96224123 0.50151509
We can make the following conclusions about our variables based on our model’s estimated coefficients:
There is a positive trend coefficient of 0.0220. That means that as time increases (years pass) sales also increase on average. For every additional year sales go up by approximately 2.2%.
All seasonal dummy coefficients are positive meaning that there is an additional effect for sales in all months relatively to the ommited dummy variable (season1 aka January). That means that on average all months are estimated to have higher sales than January (which is the lowest sales month). In addition, the highest additional effect is shown in December (as expected because of Christmas) as well as in the months prior to Christmas. The additional effect of season 3 which is when the festival takes place (March) is not particularly high because we have included the surfing_festival dummy
The surfing_festival dummy is positive (~0.50) meaning that there is an additional positive incremental effect of the the time of year when the surfing festival takes place on sales.
Check residuals:
checkresiduals(fit.log_lm5)
##
## Breusch-Godfrey test for serial correlation of order up to 17
##
## data: Residuals from Linear regression model
## LM test = 37.954, df = 17, p-value = 0.002494
As we can see the p-value of the Breusch-Godfrey test is very small (0.00249 which is smaller than 0.05). This means that the test is is significant and that residuals are still correlated. This indicated autocorrelation means that we cannot conclude that residuals resemble white noise making the model inadequate.
Create future time series for 1994,1995,1996:
df_future <- ts(data = rep(0, 36), start = 1994,
end = c(1996, 12), frequency = 12)
Create surfing festival dummy for future df:
yearmonth <- time(df_future)
surfing_festival <- c()
for(i in 1:length(yearmonth)){
M <- round(12*(yearmonth[i] - floor(yearmonth[i]))) + 1
Y <- floor(yearmonth[i])
if(Y >= 1988 & M == 3){
surfing_festival[i] <- 1
} else {
surfing_festival[i] <- 0
}
}
Forecast:
df_future_sales <- forecast(
fit.log_lm5, newdata = data.frame(Time = yearmonth,
surfing_festival = surfing_festival)
)
Get predictions with 80% and 95% prediction intervals (note that predictions and intervals below are not raw sales but logged):
df_future_sales
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Jan 1994 9.491352 9.238522 9.744183 9.101594 9.88111
## Feb 1994 9.764789 9.511959 10.017620 9.375031 10.15455
## Mar 1994 10.302990 10.048860 10.557120 9.911228 10.69475
## Apr 1994 9.941465 9.688635 10.194296 9.551707 10.33122
## May 1994 9.988919 9.736088 10.241749 9.599161 10.37868
## Jun 1994 10.050280 9.797449 10.303110 9.660522 10.44004
## Jul 1994 10.233926 9.981095 10.486756 9.844168 10.62368
## Aug 1994 10.233456 9.980625 10.486286 9.843698 10.62321
## Sep 1994 10.336841 10.084010 10.589671 9.947083 10.72660
## Oct 1994 10.436923 10.184092 10.689753 10.047165 10.82668
## Nov 1994 10.918299 10.665468 11.171129 10.528541 11.30806
## Dec 1994 11.695812 11.442981 11.948642 11.306054 12.08557
## Jan 1995 9.755590 9.499844 10.011336 9.361338 10.14984
## Feb 1995 10.029027 9.773281 10.284773 9.634775 10.42328
## Mar 1995 10.567228 10.310518 10.823938 10.171489 10.96297
## Apr 1995 10.205703 9.949957 10.461449 9.811451 10.59996
## May 1995 10.253157 9.997411 10.508903 9.858904 10.64741
## Jun 1995 10.314518 10.058772 10.570264 9.920265 10.70877
## Jul 1995 10.498164 10.242418 10.753910 10.103911 10.89242
## Aug 1995 10.497694 10.241948 10.753440 10.103441 10.89195
## Sep 1995 10.601079 10.345333 10.856825 10.206826 10.99533
## Oct 1995 10.701161 10.445415 10.956907 10.306908 11.09541
## Nov 1995 11.182537 10.926791 11.438282 10.788284 11.57679
## Dec 1995 11.960050 11.704304 12.215796 11.565797 12.35430
## Jan 1996 10.019828 9.760564 10.279093 9.620151 10.41951
## Feb 1996 10.293265 10.034000 10.552530 9.893588 10.69294
## Mar 1996 10.831466 10.571566 11.091365 10.430810 11.23212
## Apr 1996 10.469941 10.210677 10.729206 10.070264 10.86962
## May 1996 10.517395 10.258130 10.776659 10.117718 10.91707
## Jun 1996 10.578756 10.319491 10.838021 10.179079 10.97843
## Jul 1996 10.762402 10.503137 11.021667 10.362725 11.16208
## Aug 1996 10.761932 10.502667 11.021196 10.362254 11.16161
## Sep 1996 10.865317 10.606052 11.124582 10.465640 11.26499
## Oct 1996 10.965399 10.706134 11.224664 10.565722 11.36508
## Nov 1996 11.446774 11.187510 11.706039 11.047097 11.84645
## Dec 1996 12.224288 11.965023 12.483552 11.824611 12.62396
Plot forecast:
autoplot(df_future_sales)
Transform predictions and intervals to raw data format:
df_fc_mean = exp(df_future_sales$mean)
df_fc_upper = exp(df_future_sales$upper)
df_fc_lower = exp(df_future_sales$lower)
Print transformed mean forecasts and prediction intervals:
df_fc_mean
## Jan Feb Mar Apr May Jun Jul
## 1994 13244.70 17409.81 29821.65 20774.16 21783.73 23162.27 27831.56
## 1995 17250.40 22675.20 38840.85 27057.06 28371.96 30167.42 36248.88
## 1996 22467.57 29533.04 50587.81 35240.15 36952.72 39291.20 47211.93
## Aug Sep Oct Nov Dec
## 1994 27818.48 30848.42 34095.57 55176.84 120067.79
## 1995 36231.84 40178.16 44407.37 71864.42 156380.86
## 1996 47189.73 52329.57 57837.85 93598.96 203676.38
df_fc_upper
## [,1] [,2]
## Jan 1994 17054.73 19557.43
## Feb 1994 22418.00 25707.73
## Mar 1994 38450.24 44123.68
## Apr 1994 26750.16 30675.62
## May 1994 28050.15 32166.37
## Jun 1994 29825.24 34201.95
## Jul 1994 35837.72 41096.73
## Aug 1994 35820.87 41077.41
## Sep 1994 39722.43 45551.50
## Oct 1994 43903.67 50346.32
## Nov 1994 71049.28 81475.41
## Dec 1994 154607.08 177294.90
## Jan 1995 22277.59 25587.08
## Feb 1995 29283.31 33633.55
## Mar 1995 50208.44 57697.39
## Apr 1995 34942.16 40133.06
## May 1995 36640.25 42083.42
## Jun 1995 38958.95 44746.58
## Jul 1995 46812.70 53767.06
## Aug 1995 46790.69 53741.78
## Sep 1995 51887.06 59595.26
## Oct 1995 57348.77 65868.34
## Nov 1995 92807.48 106594.69
## Dec 1995 201954.08 231955.81
## Jan 1996 29117.46 33506.86
## Feb 1996 38274.14 44043.89
## Mar 1996 65602.25 75517.62
## Apr 1996 45670.42 52555.15
## May 1996 47889.88 55109.18
## Jun 1996 50920.48 58596.65
## Jul 1996 61185.57 70409.18
## Aug 1996 61156.80 70376.07
## Sep 1996 67817.91 78041.33
## Oct 1996 74956.52 86256.08
## Nov 1996 121302.09 139588.15
## Dec 1996 263959.89 303751.35
# first column is 80% interval and second 95% interval
df_fc_lower
## [,1] [,2]
## Jan 1994 10285.82 8969.583
## Feb 1994 13520.45 11790.284
## Mar 1994 23129.40 20155.412
## Apr 1994 16133.21 14068.696
## May 1994 16917.24 14752.395
## Jun 1994 17987.81 15685.969
## Jul 1994 21613.98 18848.111
## Aug 1994 21603.82 18839.249
## Sep 1994 23956.87 20891.193
## Oct 1994 26478.61 23090.230
## Nov 1994 42850.31 37366.903
## Dec 1994 93244.59 81312.400
## Jan 1995 13357.65 11629.938
## Feb 1995 17558.28 15287.252
## Mar 1995 30046.98 26146.972
## Apr 1995 20951.33 18241.435
## May 1995 21969.51 19127.918
## Jun 1995 23359.80 20338.387
## Jul 1995 28068.91 24438.412
## Aug 1995 28055.72 24426.922
## Sep 1995 31111.50 27087.467
## Oct 1995 34386.35 29938.733
## Nov 1995 55647.40 48449.831
## Dec 1995 121091.75 105429.448
## Jan 1996 17336.40 15065.329
## Feb 1996 22788.25 19802.984
## Mar 1996 39009.73 33887.802
## Apr 1996 27191.96 23629.808
## May 1996 28513.41 24778.151
## Jun 1996 30317.82 26346.183
## Jul 1996 36429.60 31657.322
## Aug 1996 36412.48 31642.439
## Sep 1996 40378.47 35088.887
## Oct 1996 44628.77 38782.394
## Nov 1996 72222.70 62761.521
## Dec 1996 157160.50 136572.460
# first column is 80% interval and second 95%
As we have observed so far the log-linear model is inadequate. There is still correlation within the residuals and as we can see from the last predictions plot the model does not capture accurately the exponential increase in peak sales.
As a result, we could improve predictions by fitting a log-log model to capture the exponential trend. In this case the interpretation of our estimated coefficients would be elasticity values.
head(gasoline)
## Time Series:
## Start = 1991.1
## End = 1991.19582477755
## Frequency = 52.1785714285714
## [1] 6.621 6.433 6.582 7.224 6.875 6.947
Consider only data up to the end of 2004:
df_gasoline <- window(gasoline, end = 2005)
Look at plot:
autoplot(df_gasoline, xlab = "Year")
Fit harmonic regression with trend models with different fourier terms:
fit.hr1 = tslm(df_gasoline ~ trend + fourier(df_gasoline, K=1))
fit.hr2 = tslm(df_gasoline ~ trend + fourier(df_gasoline, K=3))
fit.hr3 = tslm(df_gasoline ~ trend + fourier(df_gasoline, K=5))
fit.hr4 = tslm(df_gasoline ~ trend + fourier(df_gasoline, K=7))
fit.hr5 = tslm(df_gasoline ~ trend + fourier(df_gasoline, K=10))
fit.hr6 = tslm(df_gasoline ~ trend + fourier(df_gasoline, K=12))
Plot observed gasoline and fitted values for different models:
autoplot(df_gasoline) +
autolayer(fit.hr1$fitted.values,
series = as.character(1)) +
autolayer(fit.hr2$fitted.values,
series = as.character(3)) +
autolayer(fit.hr3$fitted.values,
series = as.character(5)) +
autolayer(fit.hr4$fitted.values,
series = as.character(7)) +
autolayer(fit.hr5$fitted.values,
series = as.character(10)) +
autolayer(fit.hr6$fitted.values,
series = as.character(12)) +
ylab("gasoline") +
ggtitle("Weekly Gasoline Supply (mbpd)") +
guides(colour = guide_legend(title = "Number of Fourier Transform pairs"))
We cannot tell much from plotting everything together so lets plot separately:
autoplot(df_gasoline) +
autolayer(fit.hr1$fitted.values,
series = as.character(1)) +
ylab("gasoline") +
ggtitle("Weekly Gasoline Supply (mbpd)") +
guides(colour = guide_legend(title = "Number of Fourier Transform pairs"))
autoplot(df_gasoline) +
autolayer(fit.hr2$fitted.values,
series = as.character(3)) +
ylab("gasoline") +
ggtitle("Weekly Gasoline Supply (mbpd)") +
guides(colour = guide_legend(title = "Number of Fourier Transform pairs"))
autoplot(df_gasoline) +
autolayer(fit.hr3$fitted.values,
series = as.character(5)) +
ylab("gasoline") +
ggtitle("Weekly Gasoline Supply (mbpd)") +
guides(colour = guide_legend(title = "Number of Fourier Transform pairs"))
autoplot(df_gasoline) +
autolayer(fit.hr4$fitted.values,
series = as.character(7)) +
ylab("gasoline") +
ggtitle("Weekly Gasoline Supply (mbpd)") +
guides(colour = guide_legend(title = "Number of Fourier Transform pairs"))
autoplot(df_gasoline) +
autolayer(fit.hr5$fitted.values,
series = as.character(10)) +
ylab("gasoline") +
ggtitle("Weekly Gasoline Supply (mbpd)") +
guides(colour = guide_legend(title = "Number of Fourier Transform pairs"))
autoplot(df_gasoline) +
autolayer(fit.hr6$fitted.values,
series = as.character(12)) +
ylab("gasoline") +
ggtitle("Weekly Gasoline Supply (mbpd)") +
guides(colour = guide_legend(title = "Number of Fourier Transform pairs"))
Lets select the appropriate number of fourier terms by minimizing AICc and CV:
Search and print AICc value for all possible number of fourier terms (lets say max k = 26 because freq/2 = 52.18/2 ~= 26):
for (k in 1:26){
AICc <- CV(
tslm(df_gasoline ~ trend + fourier(df_gasoline, K = k)))[['AICc']]
print(paste0("AICc for ", k, " terms = ", AICc))
}
## [1] "AICc for 1 terms = -1813.20839072956"
## [1] "AICc for 2 terms = -1818.95741052303"
## [1] "AICc for 3 terms = -1862.83361533579"
## [1] "AICc for 4 terms = -1863.74227544095"
## [1] "AICc for 5 terms = -1872.72257839179"
## [1] "AICc for 6 terms = -1902.07737210877"
## [1] "AICc for 7 terms = -1912.67183911035"
## [1] "AICc for 8 terms = -1912.54204025102"
## [1] "AICc for 9 terms = -1910.97993898114"
## [1] "AICc for 10 terms = -1913.44100863917"
## [1] "AICc for 11 terms = -1913.22768793883"
## [1] "AICc for 12 terms = -1917.10993954982"
## [1] "AICc for 13 terms = -1913.17152209052"
## [1] "AICc for 14 terms = -1911.3588180895"
## [1] "AICc for 15 terms = -1906.98774556301"
## [1] "AICc for 16 terms = -1905.16529332884"
## [1] "AICc for 17 terms = -1903.78225572243"
## [1] "AICc for 18 terms = -1910.1407759961"
## [1] "AICc for 19 terms = -1906.43389513879"
## [1] "AICc for 20 terms = -1902.46881421726"
## [1] "AICc for 21 terms = -1898.39428251371"
## [1] "AICc for 22 terms = -1894.39830795924"
## [1] "AICc for 23 terms = -1890.69143456694"
## [1] "AICc for 24 terms = -1888.00086427751"
## [1] "AICc for 25 terms = -1885.12915004351"
## [1] "AICc for 26 terms = -1880.5002026393"
Search and print CV value for all possible number of fourier terms (lets say max k = 26 because freq/2 = 52.18/2 ~= 26):
for (k in 1:26){
CV <- CV(
tslm(df_gasoline ~ trend + fourier(df_gasoline, K = k)))[['CV']]
print(paste0("CV for ", k, " terms = ", CV))
}
## [1] "CV for 1 terms = 0.0820192054454044"
## [1] "CV for 2 terms = 0.0813685429593528"
## [1] "CV for 3 terms = 0.0765884641420364"
## [1] "CV for 4 terms = 0.0764860811768495"
## [1] "CV for 5 terms = 0.0755364599938532"
## [1] "CV for 6 terms = 0.0725333015193832"
## [1] "CV for 7 terms = 0.0714733954288623"
## [1] "CV for 8 terms = 0.0714743127258836"
## [1] "CV for 9 terms = 0.0716145091624263"
## [1] "CV for 10 terms = 0.0713575412169122"
## [1] "CV for 11 terms = 0.0713648805476336"
## [1] "CV for 12 terms = 0.0709694493090534"
## [1] "CV for 13 terms = 0.0713380742903736"
## [1] "CV for 14 terms = 0.0714971412874022"
## [1] "CV for 15 terms = 0.0719086208045337"
## [1] "CV for 16 terms = 0.0720689398824101"
## [1] "CV for 17 terms = 0.0721860882730974"
## [1] "CV for 18 terms = 0.0715312552744315"
## [1] "CV for 19 terms = 0.0718716529738733"
## [1] "CV for 20 terms = 0.0722383395523968"
## [1] "CV for 21 terms = 0.0726160278362386"
## [1] "CV for 22 terms = 0.0729862490372426"
## [1] "CV for 23 terms = 0.0733286296560949"
## [1] "CV for 24 terms = 0.0735684811426977"
## [1] "CV for 25 terms = 0.0738256692802845"
## [1] "CV for 26 terms = 0.0742675686386856"
As we can observe we have an obvious first low AICc and CV value at K=7. However, both AICc and CV are minimized at K=12!
Check residuals:
checkresiduals(fit.hr6)
##
## Breusch-Godfrey test for serial correlation of order up to 104
##
## data: Residuals from Linear regression model
## LM test = 154.2, df = 104, p-value = 0.001021
Forecast using fitted model for next year (52 weeks horizon):
df_forecast_2005 <- forecast(
fit.hr6, newdata=data.frame(fourier(df_gasoline, K = 12, h = 52))
)
df_forecast_2005
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 2005.014 8.713386 8.371087 9.055684 8.189474 9.237298
## 2005.033 8.642131 8.299720 8.984542 8.118047 9.166215
## 2005.052 8.656942 8.314540 8.999345 8.132871 9.181013
## 2005.071 8.661292 8.318899 9.003684 8.137236 9.185347
## 2005.090 8.686370 8.344109 9.028631 8.162515 9.210225
## 2005.110 8.784129 8.441982 9.126276 8.260448 9.307809
## 2005.129 8.895967 8.553823 9.238111 8.372291 9.419643
## 2005.148 8.937884 8.595756 9.280012 8.414232 9.461535
## 2005.167 8.940875 8.598761 9.282988 8.417246 9.464503
## 2005.186 8.988596 8.646477 9.330716 8.464958 9.512235
## 2005.205 9.062316 8.720191 9.404442 8.538669 9.585963
## 2005.225 9.073794 8.731673 9.415915 8.550153 9.597435
## 2005.244 9.031913 8.689800 9.374027 8.508284 9.555542
## 2005.263 9.041458 8.699342 9.383575 8.517824 9.565092
## 2005.282 9.122877 8.780756 9.464999 8.599236 9.646519
## 2005.301 9.169383 8.827263 9.511503 8.645744 9.693022
## 2005.320 9.132601 8.790487 9.474715 8.608970 9.656231
## 2005.340 9.120782 8.778667 9.462897 8.597150 9.644414
## 2005.359 9.221551 8.879431 9.563671 8.697912 9.745190
## 2005.378 9.335425 8.993306 9.677545 8.811787 9.859064
## 2005.397 9.316637 8.974522 9.658752 8.793006 9.840269
## 2005.416 9.214109 8.871994 9.556224 8.690478 9.737740
## 2005.435 9.218957 8.876838 9.561076 8.695320 9.742595
## 2005.455 9.383861 9.041741 9.725980 8.860222 9.907499
## 2005.474 9.545162 9.203046 9.887278 9.021530 10.068795
## 2005.493 9.559667 9.217553 9.901782 9.036037 10.083298
## 2005.512 9.487053 9.144935 9.829171 8.963416 10.010689
## 2005.531 9.464953 9.122834 9.807073 8.941315 9.988592
## 2005.550 9.508378 9.166261 9.850494 8.984744 10.032011
## 2005.570 9.534922 9.192808 9.877037 9.011292 10.058553
## 2005.589 9.521986 9.179869 9.864104 8.998351 10.045621
## 2005.608 9.522899 9.180779 9.865019 8.999260 10.046538
## 2005.627 9.548443 9.206326 9.890560 9.024808 10.072078
## 2005.646 9.536705 9.194591 9.878819 9.013075 10.060335
## 2005.665 9.451014 9.108897 9.793131 8.927380 9.974649
## 2005.685 9.330630 8.988509 9.672750 8.806990 9.854269
## 2005.704 9.229844 8.887726 9.571962 8.706208 9.753480
## 2005.723 9.170862 8.828748 9.512976 8.647232 9.694492
## 2005.742 9.168084 8.825968 9.510200 8.644451 9.691717
## 2005.761 9.230990 8.888869 9.573110 8.707349 9.754630
## 2005.780 9.320372 8.978252 9.662492 8.796734 9.844010
## 2005.800 9.363916 9.021801 9.706030 8.840285 9.887546
## 2005.819 9.342664 9.000548 9.684779 8.819032 9.866296
## 2005.838 9.304159 8.962036 9.646281 8.780516 9.827801
## 2005.857 9.267511 8.925389 9.609633 8.743869 9.791153
## 2005.876 9.194408 8.852292 9.536523 8.670776 9.718040
## 2005.895 9.100701 8.758587 9.442816 8.577070 9.624332
## 2005.915 9.104667 8.762540 9.446794 8.581017 9.628317
## 2005.934 9.265935 8.923806 9.608064 8.742282 9.789587
## 2005.953 9.436660 9.094538 9.778782 8.913018 9.960302
## 2005.972 9.400410 9.058293 9.742527 8.876776 9.924044
## 2005.991 9.147441 8.805233 9.489649 8.623668 9.671215
Plot all historical data with predictions up to end of 2005:
autoplot(df_forecast_2005) +
autolayer(window(gasoline, start = 2005, end = 2006))
Zoom in between beginning of 2005 and end of 2005:
autoplot(df_forecast_2005) +
autolayer(window(gasoline, start = 2005, end = 2006)) +
scale_x_continuous(limits = c(2005, 2006))
## Scale for 'x' is already present. Adding another scale for 'x', which will
## replace the existing scale.
## Warning: Removed 726 row(s) containing missing values (geom_path).
As we can observe the model does a pretty good job forecasting most weeks in 2005. The actual values fall within the 80% prediction interval (dark blue area) for most weeks and the 95% prediction interval (light blue area) for very few weeks. There is only one point in time (around end of Q3 of the year) that the actual value falls out of the prediction interval in which case we have a sudden drop in supply that is not captured by our model.
head(huron)
## Time Series:
## Start = 1875
## End = 1880
## Frequency = 1
## [1] 10.38 11.86 10.97 10.80 9.79 10.39
Plot:
autoplot(huron)
There are two main observations from the yearly water level data:
That there is not a super clear pattern (data is quite cyclic - there are ups and downs but without any obvious pattern)
That there is initially a downward trend until about 1920, whereas after that there is visible trend
Linear regression:
fit.lm7 <- tslm(huron ~ trend)
fit.lm7
##
## Call:
## tslm(formula = huron ~ trend)
##
## Coefficients:
## (Intercept) trend
## 10.2020 -0.0242
Piecewise linear trend model with knot at 1915:
t1 <- time(huron)
# define t2
t2 <- ts(pmax(0,t1-1915), start=1875)
# fit model
fit.plm7 <- tslm(huron ~ t1 + t2)
fit.plm7
##
## Call:
## tslm(formula = huron ~ t1 + t2)
##
## Coefficients:
## (Intercept) t1 t2
## 132.90870 -0.06498 0.06486
Look at fitted values plots for the 2 models:
autoplot(huron) +
autolayer(fitted(fit.lm7), series = "Linear") +
autolayer(fitted(fit.plm7), series = "Piecewise") +
xlab("Year") + ylab("Water level") +
ggtitle("Lake Huron Yearly Water Level") +
guides(colour=guide_legend(title=" "))
As we can see from a first glance the piecewise linear model fits the data slightly better as it takes into account the change in trend occuring after 1915 (not downward after that)
Linear model forecasts:
df_forecast_lm_7 = forecast(fit.lm7, h=8)
df_forecast_lm_7
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 1973 7.806127 6.317648 9.294605 5.516501 10.095752
## 1974 7.781926 6.292536 9.271315 5.490899 10.072952
## 1975 7.757724 6.267406 9.248043 5.465269 10.050179
## 1976 7.733523 6.242259 9.224788 5.439613 10.027434
## 1977 7.709322 6.217094 9.201550 5.413929 10.004715
## 1978 7.685121 6.191912 9.178331 5.388219 9.982024
## 1979 7.660920 6.166712 9.155128 5.362481 9.959359
## 1980 7.636719 6.141494 9.131943 5.336717 9.936721
Piecewise linear model forecasts:
h <- 8
t1.new = t1[length(t1)] + seq(h)
t2.new = t2[length(t2)] + seq(h)
df_future = cbind(t1=t1.new, t2=t2.new)%>%
as.data.frame()
df_forecast_plm_7 <- forecast(fit.plm7, newdata = df_future)
df_forecast_plm_7
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 1973 8.455119 7.063583 9.846655 6.314483 10.59575
## 1974 8.454992 7.061518 9.848467 6.311374 10.59861
## 1975 8.454866 7.059398 9.850333 6.308182 10.60155
## 1976 8.454739 7.057225 9.852253 6.304906 10.60457
## 1977 8.454612 7.054998 9.854227 6.301549 10.60768
## 1978 8.454486 7.052717 9.856254 6.298109 10.61086
## 1979 8.454359 7.050384 9.858334 6.294587 10.61413
## 1980 8.454232 7.047997 9.860467 6.290984 10.61748
Plot fitted values and forecasts for both models:
autoplot(huron) +
autolayer(fitted(fit.lm7), series = "Linear") +
autolayer(fitted(fit.plm7), series = "Piecewise") +
autolayer(df_forecast_lm_7, series = "Linear") +
autolayer(df_forecast_plm_7, series = "Piecewise") +
xlab("Year") + ylab("Water level") +
ggtitle("Lake Huron Yearly Water Level") +
guides(colour=guide_legend(title=" "))
We can make the following observations:
That the linear model is probably inaccurate. Predictions and prediction intervals up to 1980 are continuously decreasing despite the change in trend after about 1915 after which there is no decreasing trend.
On the other hand the piecewise linear model seems to be more appropriate as predictions and prediction intervals seem stable (no decreasing trend) as expected.
Done on paper.
###########################################################################################################
###########################################################################################################
The 7 term weighted average with weights of (0.067, 0.133, 0.200, 0.200, 0.200, 0.133, 0.067) is the following:
T_hat_t = 0.067y_t-3 + 0.133y_t-2 + 0.200y_t-1 + 0.200y_t + 0.200y_t+1 + 0.133y_t+2 + 0.067*y_t+3
The 3x5MA can be written as following:
T_hat_t = [((y_t-3 + y_t-2 + y_t-1 + y_t + y_t+1)/5) + ((y_t-2 + y_t-1 + y_t + y_t+1 + y_t+2)/5) + ((y_t-1 + y_t + y_t+1 + y_t+2 + y_t+3)/5)] / 3
= (y_t-3)/15 + 2(y_t-2)/15 + 3(y_t-1)/15 + 3(y_t)/15 + 3(y_t+1)/15 + 2(y_t+2)/15 + (y_t+3)/15
= 0.067y_t-3 + 0.133y_t-2 + 0.200y_t-1 + 0.200y_t + 0.200y_t+1 + 0.133y_t+2 + 0.067*y_t+3
As a result from the above 7MA = 3x5MA
Look at plastics:
plastics
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1 742 697 776 898 1030 1107 1165 1216 1208 1131 971 783
## 2 741 700 774 932 1099 1223 1290 1349 1341 1296 1066 901
## 3 896 793 885 1055 1204 1326 1303 1436 1473 1453 1170 1023
## 4 951 861 938 1109 1274 1422 1486 1555 1604 1600 1403 1209
## 5 1030 1032 1126 1285 1468 1637 1611 1608 1528 1420 1119 1013
Plot:
autoplot(plastics, main="Monthly sales of Product A", xlab="Year", ylab="Sales (thousands")
We can make the following observations:
There are seasonal fluctuations. These are yearly, meaning that at the beginning and end of every year we see troughs (low/minimum values) and at approximately in the middle of every year we see maxima (peaks). The seasonal fluctuations do not vary (increase or decrease) as Year increases from a first glance.
There is also a positive trend (yearly average increases).
decomp_1b <- plastics %>% decompose(type="multiplicative")
Get indices:
decomp_1b$seasonal
## Jan Feb Mar Apr May Jun Jul
## 1 0.7670466 0.7103357 0.7765294 0.9103112 1.0447386 1.1570026 1.1636317
## 2 0.7670466 0.7103357 0.7765294 0.9103112 1.0447386 1.1570026 1.1636317
## 3 0.7670466 0.7103357 0.7765294 0.9103112 1.0447386 1.1570026 1.1636317
## 4 0.7670466 0.7103357 0.7765294 0.9103112 1.0447386 1.1570026 1.1636317
## 5 0.7670466 0.7103357 0.7765294 0.9103112 1.0447386 1.1570026 1.1636317
## Aug Sep Oct Nov Dec
## 1 1.2252952 1.2313635 1.1887444 0.9919176 0.8330834
## 2 1.2252952 1.2313635 1.1887444 0.9919176 0.8330834
## 3 1.2252952 1.2313635 1.1887444 0.9919176 0.8330834
## 4 1.2252952 1.2313635 1.1887444 0.9919176 0.8330834
## 5 1.2252952 1.2313635 1.1887444 0.9919176 0.8330834
decomp_1b$trend
## Jan Feb Mar Apr May Jun Jul
## 1 NA NA NA NA NA NA 976.9583
## 2 1000.4583 1011.2083 1022.2917 1034.7083 1045.5417 1054.4167 1065.7917
## 3 1117.3750 1121.5417 1130.6667 1142.7083 1153.5833 1163.0000 1170.3750
## 4 1208.7083 1221.2917 1231.7083 1243.2917 1259.1250 1276.5833 1287.6250
## 5 1374.7917 1382.2083 1381.2500 1370.5833 1351.2500 1331.2500 NA
## Aug Sep Oct Nov Dec
## 1 977.0417 977.0833 978.4167 982.7083 990.4167
## 2 1076.1250 1084.6250 1094.3750 1103.8750 1112.5417
## 3 1175.5000 1180.5417 1185.0000 1190.1667 1197.0833
## 4 1298.0417 1313.0000 1328.1667 1343.5833 1360.6250
## 5 NA NA NA NA NA
Plot:
decomp_1b %>%
autoplot() + xlab("Year") +
ggtitle("Monthly sales of Product A")
The above results do support the graphical interpretation from part a. There are indeed seasonal fluctuations in the data (indices range from about 0.71 to 1.23) as well as a positive trend.
Plot with seasonally adjusted data:
autoplot(plastics, series="Data") +
autolayer(trendcycle(decomp_1b), series="Trend") +
autolayer(seasadj(decomp_1b), series="Seasonally Adjusted") +
xlab("Year") + ylab("Sales (thousands)") +
ggtitle("Monthly sales of Product A") +
scale_colour_manual(values=c("gray","blue","red"),
breaks=c("Data","Seasonally Adjusted","Trend"))
## Warning: Removed 12 row(s) containing missing values (geom_path).
Seasonally adjusted data:
seasadj(decomp_1b)
## Jan Feb Mar Apr May Jun Jul
## 1 967.3468 981.2262 999.3182 986.4758 985.8925 956.7826 1001.1759
## 2 966.0431 985.4495 996.7427 1023.8257 1051.9377 1057.0417 1108.5982
## 3 1168.1168 1116.3736 1139.6864 1158.9443 1152.4413 1146.0648 1119.7701
## 4 1239.8204 1212.1029 1207.9388 1218.2647 1219.4437 1229.0378 1277.0364
## 5 1342.8129 1452.8342 1450.0416 1411.6051 1405.1361 1414.8628 1384.4587
## Aug Sep Oct Nov Dec
## 1 992.4139 981.0263 951.4241 978.9119 939.8819
## 2 1100.9592 1089.0366 1090.2260 1074.6860 1081.5244
## 3 1171.9625 1196.2349 1222.2981 1179.5334 1227.9683
## 4 1269.0820 1302.6210 1345.9580 1414.4319 1451.2352
## 5 1312.3369 1240.9008 1194.5377 1128.1178 1215.9647
Add new outlier value in plastics data:
plastics_outlier <- plastics
plastics_outlier[25] <- plastics_outlier[15] + 500
plastics_outlier
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1 742 697 776 898 1030 1107 1165 1216 1208 1131 971 783
## 2 741 700 774 932 1099 1223 1290 1349 1341 1296 1066 901
## 3 1274 793 885 1055 1204 1326 1303 1436 1473 1453 1170 1023
## 4 951 861 938 1109 1274 1422 1486 1555 1604 1600 1403 1209
## 5 1030 1032 1126 1285 1468 1637 1611 1608 1528 1420 1119 1013
Decompose again:
decomp_1e <- plastics_outlier %>% decompose(type="multiplicative")
Plot decomposition again:
decomp_1e %>%
autoplot() + xlab("Year") +
ggtitle("Monthly sales of Product A (with outlier)")
Plot with seasonally adjusted data:
autoplot(plastics_outlier, series="Data") +
autolayer(trendcycle(decomp_1e), series="Trend") +
autolayer(seasadj(decomp_1e), series="Seasonally Adjusted") +
xlab("Year") + ylab("Sales (thousands)") +
ggtitle("Monthly sales of Product A (with outlier)") +
scale_colour_manual(values=c("gray","blue","red"),
breaks=c("Data","Seasonally Adjusted","Trend"))
## Warning: Removed 12 row(s) containing missing values (geom_path).
Seasonally adjusted data:
seasadj(decomp_1e)
## Jan Feb Mar Apr May Jun Jul
## 1 879.5968 987.8941 1006.1412 993.1830 992.4330 962.9930 1008.1459
## 2 878.4113 992.1461 1003.5481 1030.7868 1058.9164 1063.9028 1116.3161
## 3 1510.2510 1123.9598 1147.4678 1166.8241 1160.0867 1153.5038 1127.5658
## 4 1127.3538 1220.3397 1216.1862 1226.5478 1227.5336 1237.0153 1285.9269
## 5 1221.0036 1462.7068 1459.9421 1421.2028 1414.4579 1424.0465 1394.0971
## Aug Sep Oct Nov Dec
## 1 999.6349 987.9764 958.0526 985.5189 946.1675
## 2 1108.9700 1096.7519 1097.8215 1081.9394 1088.7572
## 3 1180.4899 1204.7096 1230.8138 1187.4945 1236.1805
## 4 1278.3160 1311.8495 1355.3352 1423.9784 1460.9406
## 5 1321.8856 1249.6920 1202.8600 1135.7319 1224.0966
As we can observe from the plots above the decomposition was robust enough and the seasonal component did not really change the outlier was viewed as a remainder (reasonable) and it can be observed in the seasonally adjusted plot (deseasonalized). Overall, the outlier did affect the seasonally adjusted data by introducing a spike at the point that we replaced with an extreme value (outlier).
No matter where the outlier would lie (middle, beginning or end) it would probably not have an effect on the seasonal component. It would, however, adjust the seasonally adjusted data as we saw above. In addition, having the outlier at one of the ends would probably cause a change in the overall level (trend) of the decomposed series.
Load retail data:
retaildata <- readxl::read_excel("/Users/yannos/Documents/PredictiveAnalytics/retail.xlsx", skip=1)
Make into ts with chosen column:
retail <- ts(retaildata[,"A3349873A"], frequency=12, start=c(1982,4))
Plot the series:
autoplot(retail, main="Monthly Austrailian Retail Sales",
ylab="Sales", xlab="Year")
X11 decomposition
Since we observe above that there is the seasonal fluctuation is varying, I would try a multiplicative decomposition:
x11_multiplicative <- retail %>% seas(x11="", transform.function = "log")
x11_multiplicative
##
## Call:
## seas(x = ., transform.function = "log", x11 = "")
##
## Coefficients:
## Mon Tue Wed Thu
## 0.0026036 -0.0069845 0.0069365 0.0002691
## Fri Sat AO2000.Jun LS2011.Jan
## 0.0088366 -0.0021652 0.1558938 0.1763023
## MA-Nonseasonal-01 MA-Seasonal-12
## 0.3151592 0.5487838
Plot:
autoplot(x11_multiplicative)
As we can observe at the remainder plot there are some outliers that were not particularly visible before. Specifically, these would be around 1989, 1994, 2001, 2011 and 2012. In addition, we can observe the trend more clearly which is constantly increasing then to some extent levels off and then has a sudden increase to level off again during the last few months. Finally in the original time series plot it seems that the seasonal fluctuation constantly increases while after the decomposition we see that this is not necessarily the case.
We can make the following observations from the plots showing the decomposition of the number of persons in the civilian labour force in Australia each month from February 1978 to August 1995:
There is an overall positive trend. The slope slighly decreases in 1991 but then goes back to approximately the original increasing slope after around 1993.
There are strong seasonal fluctuations (within cyclical trend). According to the scale, however, these seasonal fluctuations are not very drastic in comparison to the overall data/trend. The second seasonal fit plot shows us that the highest seasonal number of people in the labour force are observed in December and March while the lowest in January and August.
There are some very obvious strong outliers (as shown by the remainder plot) around and right after 1991 where we see sudden drops in the number of persons. These are greater than the average seasonality fluctuations (as shown by the scale) and coincide with the slight decrease in the trend slope.
Yes, the recession of 1991/1992 is visible in the estimated components. Specifically, it is observed as the outlier points (the unexpected drops) observed in the data and mainly the remainder plot in 1991 and 1992.
Look at cangas:
head(cangas,10)
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct
## 1960 1.4306 1.3059 1.4022 1.1699 1.1161 1.0113 0.9660 0.9773 1.0311 1.2521
Plots:
autoplot(cangas, main="Monthly Canadian Gas Production",
ylab="Gas Production (Billions)", xlab="Year")
ggsubseriesplot(cangas, main="Monthly Canadian Gas Production",
ylab="Gas Production (Billions)", xlab="Month")
ggseasonplot(cangas, main="Monthly Canadian Gas Production (seasonal for every year)",
ylab="Gas Production (Billions)", xlab="Month")
As we can observe from the plots above, there is indeed changing seasonality over time. The size of seasonal fluctuations (amplitude) as well as the frequency and characteristic shape are all changing.
There could be various reasons for changing seasonality over time:
One could be changing average monthly temperatures and/or other environmental factors over time.
Monthly demand could change over time affecting needs for production such as higher use of gas for various purposes in later years that would not necessarily mean the same rise / drop in production as earlier years.
Extrenal factors that could vary over time such as gas prices, economic and plotical circumstances.
New discovered means or sources for production / storage / transportation over time for both Canada and countried that are supplied by Canada (eg US with increasing shale gas production etc).
From firstly observing the decomposition components I have decided the following:
To use an s.window = 15 because it signifies the number of consecutive years to use when estimating the seasonal component. In this case we are seeing seasonality fluctuations changing about every 15 years (there are approximately three main stages of seasonality fluctuations in the data)
Fit stl:
fit_stl <- cangas %>%
stl(s.window=15, robust=TRUE)
Plot:
fit_stl %>% autoplot()
Try out SEATS decomposition:
fit_seats <- cangas %>% seas()
Plot:
fit_seats %>% autoplot()
Try out X11 decomposition:
fit_x11 <- cangas %>% seas(x11="")
Plot:
fit_x11 %>% autoplot()
We can now make the following observations in comparing STL with SEATS and X11 decompositions:
In all cases the trend is captured similarly.
In SEATS and X11 the seasonal component is very similar but different compared to the original STL decomposition. This is probably the case since SEATS and X11 assume multiplicative decomposition while the original STL can only handle additive decomposition.
STL allows for better handling of drastically changing seasonal fluctuations. This can be observed at the remainder plots where we can see that remainders are quite similar in the SEATS and X11 decompositions (although the remainder component in SEATS is slightly larger overall) but in general the remainder component of STL is smaller.
bricksq quarterly data:
bricksq
## Qtr1 Qtr2 Qtr3 Qtr4
## 1956 189 204 208 197
## 1957 187 214 227 223
## 1958 199 229 249 234
## 1959 208 253 267 255
## 1960 242 268 290 277
## 1961 241 253 265 236
## 1962 229 265 275 258
## 1963 231 263 308 313
## 1964 293 328 349 340
## 1965 309 349 366 340
## 1966 302 350 362 337
## 1967 326 358 359 357
## 1968 341 380 404 409
## 1969 383 417 454 428
## 1970 386 428 434 417
## 1971 385 433 453 436
## 1972 399 461 476 477
## 1973 452 461 534 516
## 1974 478 526 518 417
## 1975 340 437 459 449
## 1976 424 501 540 533
## 1977 457 513 522 478
## 1978 421 487 470 482
## 1979 458 526 573 563
## 1980 513 551 589 564
## 1981 519 581 581 578
## 1982 500 560 512 412
## 1983 303 409 420 413
## 1984 400 469 482 484
## 1985 447 507 533 503
## 1986 443 503 505 443
## 1987 415 485 495 458
## 1988 427 519 555 539
## 1989 511 572 570 526
## 1990 472 524 497 460
## 1991 373 436 424 430
## 1992 387 413 451 420
## 1993 394 462 476 443
## 1994 421 472 494
Plot:
bricksq %>% autoplot()
Fixed seasonality (s.window=periodic):
fit_stl_fixed <- bricksq %>%
stl(t.window=27, s.window="periodic", robust=TRUE)
fit_stl_fixed %>% autoplot()
Changing seasonality:
fit_stl_changing <- bricksq %>%
stl(t.window=27, s.window=7, robust=TRUE)
fit_stl_changing %>% autoplot()
Get seasonally adjusted:
stl_changing_seasadj <- seasadj(fit_stl_changing)
Plot with seasonally adjusted data:
autoplot(bricksq, series="Data") +
autolayer(trendcycle(fit_stl_changing), series="Trend") +
autolayer(seasadj(fit_stl_changing), series="Seasonally Adjusted") +
xlab("Year") + ylab("Brick Production") +
ggtitle("Australian quarterly clay brick production. 1956–1994") +
scale_colour_manual(values=c("gray","blue","red"),
breaks=c("Data","Seasonally Adjusted","Trend"))
Naive forecasts of seasonally adjusted data:
naive_seas_adj <- fit_stl_changing %>% seasadj() %>% naive()
Plot:
naive_seas_adj %>% autoplot() +
ylab("Brick Production") + xlab("Year") +
ggtitle("Naive forecasts of seasonally adjusted quarterly clay brick production")
Reseasonalized naive forecasts with stlf:
naive_reseasonalized <- stlf(bricksq, method='naive')
Plot:
naive_reseasonalized %>% autoplot() +
ylab("Brick Production") + xlab("Year") +
ggtitle("Naive forecasts of quarterly clay brick production")
Check residuals:
checkresiduals(naive_reseasonalized)
## Warning in checkresiduals(naive_reseasonalized): The fitted degrees of freedom
## is based on the model used for the seasonally adjusted data.
##
## Ljung-Box test
##
## data: Residuals from STL + Random walk
## Q* = 40.829, df = 8, p-value = 2.244e-06
##
## Model df: 0. Total lags used: 8
It seems that residuals are still correlated. The ACF plot shows some high autocorrelation values and the Ljung-Box test has a very small p-value. Residuals are also close to being normally distributed.
Repeat with robust STL decomposition:
naive_reseasonalized_robust <- stlf(bricksq, method='naive', robust=TRUE)
Plot:
naive_reseasonalized_robust %>% autoplot() +
ylab("Brick Production") + xlab("Year") +
ggtitle("Naive forecasts of quarterly clay brick production")
Check residuals:
checkresiduals(naive_reseasonalized_robust)
## Warning in checkresiduals(naive_reseasonalized_robust): The fitted degrees of
## freedom is based on the model used for the seasonally adjusted data.
##
## Ljung-Box test
##
## data: Residuals from STL + Random walk
## Q* = 27.961, df = 8, p-value = 0.0004817
##
## Model df: 0. Total lags used: 8
We observe that residuals are still somewhat corelated but there has been some slight improvement compared to the non-robust method since the p-value has increased.
Split the data into train and test sets:
bricksq_train <- window(bricksq, end=c(1992,3))
bricksq_test <- window(bricksq, start=c(1992,4), end=c(1994,3))
Train and forecast using snaive():
fit_snaive <- snaive(bricksq_train)
fc_snaive <- forecast(fit_snaive, h=8)
Train and forecast using stlf():
fit_stlf <- stlf(bricksq_train)
fc_stlf <- forecast(fit_stlf, h=8)
Plot forecasts:
autoplot(bricksq, series = "Original data") +
geom_line() +
autolayer(fc_snaive, PI = FALSE, series = "snaive") +
autolayer(fc_stlf, PI = FALSE, series = "stfl") +
scale_color_manual(values = c("gray", "blue", "red"),
breaks = c("Original Data", "snaive", "stfl")) +
scale_x_continuous(limits = c(1990, 1994.5)) +
guides(colour = guide_legend(title = "series")) +
ggtitle("snaive and stfl forecasts") +
ylab("Brick production") +
xlab("Year")
## Scale for 'x' is already present. Adding another scale for 'x', which will
## replace the existing scale.
## Warning: Removed 136 row(s) containing missing values (geom_path).
From a first glance from the above plot we could say that the stfl() fit does a better job in forecasting the last 2 years of data. However, we can evaluate the two fits by calculating RMSE on the test set:
df_test <- bricksq_test %>% as.data.frame() %>% rename(data=x)
df_snaive <- fc_snaive$mean %>% as.data.frame() %>% rename(snaive=x)
df_stfl <- fc_stlf$mean %>% as.data.frame() %>% rename(stlf=x)
Calculate out of sample rmse for snaive:
sqrt(mean(df_test$data - df_snaive$snaive))
## [1] 5.244044
Calculate out of sample rmse for stlf:
sqrt(mean(df_test$data - df_stfl$stlf))
## [1] 4.878376
As we can observe the rmse for stlf is lower and as a result stlf produces better forecasts.
Look at data:
writing
## Jan Feb Mar Apr May Jun Jul Aug
## 1968 562.674 599.000 668.516 597.798 579.889 668.233 499.232 215.187
## 1969 634.712 639.283 712.182 621.557 621.000 675.989 501.322 220.286
## 1970 646.783 658.442 712.906 687.714 723.916 707.183 629.000 237.530
## 1971 676.155 748.183 810.681 729.363 701.108 790.079 594.621 230.716
## 1972 747.636 773.392 813.788 766.713 728.875 749.197 680.954 241.424
## 1973 795.337 788.421 889.968 797.393 751.000 821.255 691.605 290.655
## 1974 843.038 847.000 941.952 804.309 840.307 871.528 656.330 370.508
## 1975 778.139 856.075 938.833 813.023 783.417 828.110 657.311 310.032
## 1976 895.217 856.075 893.268 875.000 835.088 934.595 832.500 300.000
## 1977 875.024 992.968 976.804 968.697 871.675 1006.852 832.037 345.587
## Sep Oct Nov Dec
## 1968 555.813 586.935 546.136 571.111
## 1969 560.727 602.530 626.379 605.508
## 1970 613.296 730.444 734.925 651.812
## 1971 617.189 691.389 701.067 705.777
## 1972 680.234 708.326 694.238 772.071
## 1973 727.147 868.355 812.390 799.556
## 1974 742.000 847.152 731.675 898.527
## 1975 780.000 860.000 780.000 807.993
## 1976 791.443 900.000 781.729 880.000
## 1977 849.528 913.871 868.746 993.733
Plot:
autoplot(writing)
As we can observe there is probably no need for a Box-Cox tranformation. The series has a fairly linear trend and seasonality fluctuations do not vary.
Try stlf naive:
stlf_writing_naive <- stlf(writing, robust = TRUE, method = "naive")
autoplot(stlf_writing_naive)
Try stlf rwdrif:
stlf_writing_rwdrift <- stlf(writing, robust = TRUE, method = "rwdrift")
autoplot(stlf_writing_rwdrift)
onsidering the trend of the series I would suggest that the stlf_writing_rwdrift fit is most appropriate (using the rwdrift method).
Look at data:
fancy
## Jan Feb Mar Apr May Jun Jul
## 1987 1664.81 2397.53 2840.71 3547.29 3752.96 3714.74 4349.61
## 1988 2499.81 5198.24 7225.14 4806.03 5900.88 4951.34 6179.12
## 1989 4717.02 5702.63 9957.58 5304.78 6492.43 6630.80 7349.62
## 1990 5921.10 5814.58 12421.25 6369.77 7609.12 7224.75 8121.22
## 1991 4826.64 6470.23 9638.77 8821.17 8722.37 10209.48 11276.55
## 1992 7615.03 9849.69 14558.40 11587.33 9332.56 13082.09 16732.78
## 1993 10243.24 11266.88 21826.84 17357.33 15997.79 18601.53 26155.15
## Aug Sep Oct Nov Dec
## 1987 3566.34 5021.82 6423.48 7600.60 19756.21
## 1988 4752.15 5496.43 5835.10 12600.08 28541.72
## 1989 8176.62 8573.17 9690.50 15151.84 34061.01
## 1990 7979.25 8093.06 8476.70 17914.66 30114.41
## 1991 12552.22 11637.39 13606.89 21822.11 45060.69
## 1992 19888.61 23933.38 25391.35 36024.80 80721.71
## 1993 28586.52 30505.41 30821.33 46634.38 104660.67
Plot:
autoplot(fancy)
As we can observe there is probably a need for a Box-Cox tranformation. The series has an exponential trend and there varying seasonality fluctuations.
Try stlf naive:
stlf_fancy_naive <- stlf(fancy, robust = TRUE, method = "naive", lambda = BoxCox.lambda(fancy))
autoplot(stlf_fancy_naive)
Try stlf rwdrif:
stlf_fancy_rwdrift <- stlf(fancy, robust = TRUE, method = "rwdrift", lambda = BoxCox.lambda(fancy))
autoplot(stlf_fancy_rwdrift)
Again, considering the trend of the series I would suggest that the stlf_fancy_rwdrift fit is most appropriate (using the rwdrift method).