##Chapter 5 - Time Series Regression Models

Daily electricity demand for Victoria, Australia, during 2014 is contained in elecdaily. The data for the first 20 days can be obtained as follows:

#Question 1.A

library(fpp2)
## Loading required package: ggplot2
## Loading required package: forecast
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
## Loading required package: fma
## Loading required package: expsmooth
library(ggplot2)
library(psych)
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
daily20 <- head(elecdaily,20)
autoplot(daily20, main="Daily Electricity Demand for Victoria, Australia during 2014")

lm(formula = Demand~Temperature, data=daily20)
## 
## Call:
## lm(formula = Demand ~ Temperature, data = daily20)
## 
## Coefficients:
## (Intercept)  Temperature  
##      39.212        6.757
summary(lm(formula = Demand~Temperature, data=daily20))
## 
## Call:
## lm(formula = Demand ~ Temperature, data = daily20)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -46.060  -7.117  -1.437  17.484  27.102 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  39.2117    17.9915   2.179   0.0428 *  
## Temperature   6.7572     0.6114  11.052 1.88e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22 on 18 degrees of freedom
## Multiple R-squared:  0.8716, Adjusted R-squared:  0.8644 
## F-statistic: 122.1 on 1 and 18 DF,  p-value: 1.876e-09
# 1.A.
# As time gets close to 3, the demand in electricity reaches its breaking point before going back down once it moves past 3.
# There is a positive relationship between the two variables because as temperature rises,people turn on their air conditioning, which consumes more electricity and increase the demand of it.

#Question 1.B

plot(lm(Demand~Temperature, data=elecdaily))

# 1.B
# Most of the values are within the range, but there are a few outliers that affect negatively the residuals.
reg.LR <- tslm(Demand ~ Temperature, data=daily20)
checkresiduals(reg.LR)

## 
##  Breusch-Godfrey test for serial correlation of order up to 5
## 
## data:  Residuals from Linear regression model
## LM test = 3.8079, df = 5, p-value = 0.5774
#1.B
#In parallel, to the previous graphs, a few outliers affect negatively the residual, but most values fall in range since the residual is symmetrical(not skewed).

#Question 1.C

newtemp <- data.frame(Temperature = c(15,35))
elec.fo <- forecast(reg.LR, newdata = newtemp)
elec.fo
##          Point Forecast    Lo 80    Hi 80     Lo 95    Hi 95
## 3.857143       140.5701 108.6810 172.4591  90.21166 190.9285
## 4.000000       275.7146 245.2278 306.2014 227.57056 323.8586
#1.c
# The forecast range is feasable; in the graph at the beginning, we see an upward trend meaning that we can expect future demand to be in the ~200 range.

#Question 1.D

autoplot(elec.fo, main = "Forecasting the demand in electricity")

# This forecast seems feasable because we can clearly see that the trend goes upward meaning that the demand in electricity increases as time goes by.

#Question 1.E

plot(Demand ~ Temperature, data=elecdaily, main="Demand vs. Temperature of Elecdaily dataset")

# 1.E
# It might be a coincidence, but this graph looks a lot like the graph from question 1.B (residuals vs fitted); same distribution and same outliers. This infers that daily20 is a great representation of the original dataset.

#Question 2.A

Data set mens400 contains the winning times (in seconds) for the men’s 400 meters final in each Olympic Games from 1896 to 2016.

autoplot(mens400, main="Winning times for the Men's 400 meter Olympic Games Final", ylab="Time (Seconds)", xlab="Year")

#2.A
# For each new Olypmic edition, the average time to win an olympic gold medal at the Men's 400 meter Olympic Games Final has decreased. We can see it by the downward trend in the graph. Also, we have missing values in our data, which are indicated by the broken parts in the line.

#Question 2.B

#time series linear model
time400 <- time(mens400)
tslm(mens400 ~ time400, data=mens400)
## 
## Call:
## tslm(formula = mens400 ~ time400, data = mens400)
## 
## Coefficients:
## (Intercept)      time400  
##   172.48148     -0.06457
# The winning time decreases by 0.06457 second per year
## Regression line added
autoplot(mens400, main="Winning times for the Men's 400 meter Olympic Games Final", ylab="Time (Seconds)", xlab="Year") + geom_smooth(method="lm", se=FALSE)
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 3 rows containing non-finite values (stat_smooth).

#Question 2.C

mens400RY <- lm(mens400 ~ time400, data=mens400)
plot(mens400RY)

#2.C
# The Residuals vs. Fitted graph shows that the slope slowly decreases meaning that the winning time decreases at well year after year.
# This shows that the fitted line is a fairly accurate short term predictor, but in the long run, it lacks accuracy regarding the upcoming winning times.

#Question 2.D

#Removing missing values

TimeMens400 = time(mens400)
mens400LM = lm(mens400 ~ TimeMens400, data=mens400, na.action=na.exclude) # We remove the 3 missing values from the regression

Mens400F2022 <- forecast(mens400LM, newdata=data.frame(TimeMens400=2020))
Mens400F2022
##   Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## 1       42.04231 40.44975 43.63487 39.55286 44.53176
# 2.D
# The 80% C.I. ranges from 40.44975 to 43.63487
# The 95% C.I. ranges from 39.55286 to 44.53176

#Question 3

Type easter(ausbeer) and interpret what you see

easter(ausbeer)
##      Qtr1 Qtr2 Qtr3 Qtr4
## 1956 0.67 0.33 0.00 0.00
## 1957 0.00 1.00 0.00 0.00
## 1958 0.00 1.00 0.00 0.00
## 1959 1.00 0.00 0.00 0.00
## 1960 0.00 1.00 0.00 0.00
## 1961 0.33 0.67 0.00 0.00
## 1962 0.00 1.00 0.00 0.00
## 1963 0.00 1.00 0.00 0.00
## 1964 1.00 0.00 0.00 0.00
## 1965 0.00 1.00 0.00 0.00
## 1966 0.00 1.00 0.00 0.00
## 1967 1.00 0.00 0.00 0.00
## 1968 0.00 1.00 0.00 0.00
## 1969 0.00 1.00 0.00 0.00
## 1970 1.00 0.00 0.00 0.00
## 1971 0.00 1.00 0.00 0.00
## 1972 0.33 0.67 0.00 0.00
## 1973 0.00 1.00 0.00 0.00
## 1974 0.00 1.00 0.00 0.00
## 1975 1.00 0.00 0.00 0.00
## 1976 0.00 1.00 0.00 0.00
## 1977 0.00 1.00 0.00 0.00
## 1978 1.00 0.00 0.00 0.00
## 1979 0.00 1.00 0.00 0.00
## 1980 0.00 1.00 0.00 0.00
## 1981 0.00 1.00 0.00 0.00
## 1982 0.00 1.00 0.00 0.00
## 1983 0.00 1.00 0.00 0.00
## 1984 0.00 1.00 0.00 0.00
## 1985 0.00 1.00 0.00 0.00
## 1986 1.00 0.00 0.00 0.00
## 1987 0.00 1.00 0.00 0.00
## 1988 0.00 1.00 0.00 0.00
## 1989 1.00 0.00 0.00 0.00
## 1990 0.00 1.00 0.00 0.00
## 1991 1.00 0.00 0.00 0.00
## 1992 0.00 1.00 0.00 0.00
## 1993 0.00 1.00 0.00 0.00
## 1994 0.00 1.00 0.00 0.00
## 1995 0.00 1.00 0.00 0.00
## 1996 0.00 1.00 0.00 0.00
## 1997 1.00 0.00 0.00 0.00
## 1998 0.00 1.00 0.00 0.00
## 1999 0.00 1.00 0.00 0.00
## 2000 0.00 1.00 0.00 0.00
## 2001 0.00 1.00 0.00 0.00
## 2002 1.00 0.00 0.00 0.00
## 2003 0.00 1.00 0.00 0.00
## 2004 0.00 1.00 0.00 0.00
## 2005 1.00 0.00 0.00 0.00
## 2006 0.00 1.00 0.00 0.00
## 2007 0.00 1.00 0.00 0.00
## 2008 1.00 0.00 0.00 0.00
## 2009 0.00 1.00 0.00 0.00
## 2010 0.00 1.00
#3

#Just by the name that I had to type, I can infer that we talk about Australia because "AUs" is the 3-letter code of Australia. Plus the data set must refer to beer as well because of the second part of the name, "beer".

#I see that the data set ranges from 1956 to 2010 and I see that there are 4 columns, which refers to each quarter. From persoal knowledge, rugby is very famous in Australia so I assume each quarter refers to a break in between a rubgy game. 

#At first I thought of a binary dataset because of the "0.00" and "1.00", such as,"likelihood that rugby fans will consume a beer" but then I saw the values 0.33 and 0,67 so it made me question it... (maybe a probability ? a percentage ?)

#Question 4

Express y as a function of x and show that the coefficient β1 is the elasticity coefficient

#4
# log(y) = β0 + β1log(x) + e
# β1log(x) = log(y) - β0 - e
# β1 = (log(y) - β0 - e) / log(x)

# Coefficient β1 is the elasticity coefficient because it shows the ratio of the percentage change in the forecast variable (y) to the change in the predictor variable (X).
# For each change in the predictor (x), there is a β1 ratio percentage change in the forecast variable (y).

#Question 5.A

The data set fancy concerns the monthly sales figures of a shop which opened in January 1987 and sells gifts, souvenirs, and novelties. The shop is situated on the wharf at a beach resort town in Queensland, Australia. The sales volume varies with the seasonal population of tourists. There is a large influx of visitors to the town at Christmas and for the local surfing festival, held every March since 1988. Over time, the shop has expanded its premises, range of products, and staff.

autoplot(fancy, main="Sales Volume over Time", ylab="Sales Volume")

seasonplot(fancy, main="Seasonal Sales Volume over Time", ylab="Sales Volume")

#5A
# The graph shows signs of seasonal trends within the sales. Indeed, we can see a spike in Sales right before each new year(which is confirm in the second graph), indicating that the holiday season has a positive effect on sales. 

# In addition, there is a large increase in sales from 1992 to 1993 and from 1993 to 1994.It may be due to an increase in the popularity of the products.

#Question 5.B

#5B
#Explain why it is necessary to take logarithms of these data before fitting a model.

# If a pattern exists in a data set when plotted, there may be heteroscedasticity in the residuals. This means that the variance of the residuals may not be constant and lead to issues in the regression. If this occurs, a transformation (logarithm as an example) may be required.

#Question 5.C

# Creating a "surfing festival" dummy variable
time <- time(fancy)

surfingFestival <- c()
for(i in 1:length(time)){
  month <- round(12*(time[i] - floor(time[i]))) + 1
  year <- floor(time[i])
  if(year >= 1988 & month == 3) # started March 1998
    {
    surfingFestival[i] <- 1 # Binary - 1 for if surfing festival is in town
  } else {
    surfingFestival[i] <- 0 # 0 if the surfing festival is not in town
  }
}

# Regression formula and output
# Using BoxCox power transformation prior to regression formula
fancyTrans <- (BoxCox(fancy, 0))
fancyReg <- tslm(fancyTrans ~ trend + season + surfingFestival)
summary(fancyReg)
## 
## Call:
## tslm(formula = fancyTrans ~ trend + season + surfingFestival)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.33673 -0.12757  0.00257  0.10911  0.37671 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     7.6196670  0.0742471 102.626  < 2e-16 ***
## trend           0.0220198  0.0008268  26.634  < 2e-16 ***
## season2         0.2514168  0.0956790   2.628 0.010555 *  
## season3         0.2660828  0.1934044   1.376 0.173275    
## season4         0.3840535  0.0957075   4.013 0.000148 ***
## season5         0.4094870  0.0957325   4.277 5.88e-05 ***
## season6         0.4488283  0.0957647   4.687 1.33e-05 ***
## season7         0.6104545  0.0958039   6.372 1.71e-08 ***
## season8         0.5879644  0.0958503   6.134 4.53e-08 ***
## season9         0.6693299  0.0959037   6.979 1.36e-09 ***
## season10        0.7473919  0.0959643   7.788 4.48e-11 ***
## season11        1.2067479  0.0960319  12.566  < 2e-16 ***
## season12        1.9622412  0.0961066  20.417  < 2e-16 ***
## surfingFestival 0.5015151  0.1964273   2.553 0.012856 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.179 on 70 degrees of freedom
## Multiple R-squared:  0.9567, Adjusted R-squared:  0.9487 
## F-statistic:   119 on 13 and 70 DF,  p-value: < 2.2e-16

#Question 5.D

autoplot(fancyReg$residuals, main="Residuals plot of fancyReg Resgression", ylab="Residuals")

# 5D
# Based on the graph, there are still seasonal patterns that exist after the BoxCox transformation. This creates an issue with autocorrelation in the data.

#Question 5.E

boxplot(fancyReg$residuals, main="fancyReg Regression Residuals")

# 5E
# Most of the values appear to be in the middle; there don't seem to be a noticeable problem.

#Question 5.F

fancyReg$coefficients
##     (Intercept)           trend         season2         season3         season4 
##      7.61966701      0.02201983      0.25141682      0.26608280      0.38405351 
##         season5         season6         season7         season8         season9 
##      0.40948697      0.44882828      0.61045453      0.58796443      0.66932985 
##        season10        season11        season12 surfingFestival 
##      0.74739195      1.20674790      1.96224123      0.50151509
# 5F
# The surfingFestival coefficient has a value of 0.5015 showing a relative increase in sales during the month of March(each season represents a month so season 3 is the month of March). 

# The trend coefficient value of 0.0220 is low but increases gradually after each season. Until season 10 when we see a noticeable increase in sales, which corroborates the fact that sales increase toward the end of the year.

#Question 5.G

library(lmtest)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
bgtest(fancyReg)
## 
##  Breusch-Godfrey test for serial correlation of order up to 1
## 
## data:  fancyReg
## LM test = 25.031, df = 1, p-value = 5.642e-07
# 5G
# The output of the Breusch-Godfrey test gives us insight into serial correlation in the regression model. The p-value of the test is extremely low tell us that null-hypothesis can be rejected. This means that heteroscedasticity exists in our model.

#Question 5.I

fancyTrans <- (BoxCox(fancy, 0))
forecast(fancyTrans)
##          Point Forecast     Lo 80     Hi 80     Lo 95     Hi 95
## Jan 1994       9.674641  9.464046  9.885235  9.352565  9.996717
## Feb 1994       9.958925  9.733371 10.184479  9.613970 10.303880
## Mar 1994      10.439733 10.200143 10.679323 10.073312 10.806155
## Apr 1994      10.121042  9.868185 10.373898  9.734331 10.507752
## May 1994      10.157291  9.891822 10.422759  9.751292 10.563289
## Jun 1994      10.227196  9.949682 10.504711  9.802774 10.651618
## Jul 1994      10.403289 10.114223 10.692356  9.961200 10.845378
## Aug 1994      10.382661 10.082480 10.682842  9.923574 10.841748
## Sep 1994      10.518849 10.207944 10.829754 10.043361 10.994337
## Oct 1994      10.612257 10.290980 10.933534 10.120906 11.103608
## Nov 1994      11.108401 10.777070 11.439732 10.601674 11.615129
## Dec 1994      11.898748 11.557653 12.239843 11.377088 12.420408
## Jan 1995       9.985303  9.634703 10.335903  9.449107 10.521499
## Feb 1995      10.269587  9.909735 10.629440  9.719240 10.819934
## Mar 1995      10.750396 10.381517 11.119274 10.186244 11.314547
## Apr 1995      10.431704 10.054009 10.809399  9.854070 11.009338
## May 1995      10.467953 10.081638 10.854268  9.877135 11.058770
## Jun 1995      10.537858 10.143107 10.932610  9.934137 11.141579
## Jul 1995      10.713952 10.310934 11.116969 10.097590 11.330314
## Aug 1995      10.693323 10.282201 11.104445 10.064567 11.322080
## Sep 1995      10.829511 10.410437 11.248586 10.188592 11.470430
## Oct 1995      10.922919 10.496035 11.349803 10.270057 11.575781
## Nov 1995      11.419063 10.984506 11.853621 10.754465 12.083661
## Dec 1995      12.209410 11.767308 12.651512 11.533273 12.885547
# 5J
## How could you improve these predictions by modifying the model?

# Maybe these predictions could be improved by using the lambda function.
# The BoxCox.lambda(fancy) may be good for transforming a regression model.

#Question 6.A

The gasoline series consists of weekly data for supplies of US finished motor gasoline product, from 2 February 1991 to 20 January 2017. The units are in “million barrels per day”. Consider only the data to the end of 2004.

gasoline2014 <- window(gasoline, end=2005)
autoplot(gasoline2014, main="Quarterly Retail Trade Index in the Euro area", xlab="Year", ylab="Thousand barrels per day")

gasoline %>% tbats() %>% forecast() %>%
autoplot() + xlab("Year") + ylab("Thousand barrels per day")

#Question 6.B

gasolineCheck <- tslm(gasoline2014 ~ trend)
gasolineCheck
## 
## Call:
## tslm(formula = gasoline2014 ~ trend)
## 
## Coefficients:
## (Intercept)        trend  
##    7.089632     0.002816
#6.B
#The appropriate number is 7

#Question 6.C

checkresiduals(gasolineCheck)

## 
##  Breusch-Godfrey test for serial correlation of order up to 104
## 
## data:  Residuals from Linear regression model
## LM test = 314.46, df = 104, p-value < 2.2e-16

#Question 6.D

F.gasoline <- forecast(gasoline2014)
F.gasoline
##          Point Forecast    Lo 80     Hi 80    Lo 95     Hi 95
## 2005.014       8.644088 8.326814  8.961361 8.158860  9.129316
## 2005.033       8.698230 8.380805  9.015655 8.212771  9.183690
## 2005.052       8.688970 8.371393  9.006548 8.203277  9.174663
## 2005.071       8.687330 8.369599  9.005060 8.201402  9.173257
## 2005.090       8.879361 8.561476  9.197246 8.393198  9.365524
## 2005.110       8.912739 8.594699  9.230779 8.426339  9.399139
## 2005.129       8.930034 8.611838  9.248231 8.443395  9.416673
## 2005.148       8.928394 8.610041  9.246747 8.441515  9.415273
## 2005.167       9.165499 8.846988  9.484010 8.678378  9.652620
## 2005.186       9.066353 8.747683  9.385023 8.578989  9.553716
## 2005.205       9.050012 8.731182  9.368841 8.562404  9.537619
## 2005.225       8.962326 8.643336  9.281317 8.474473  9.450180
## 2005.244       9.096673 8.777521  9.415825 8.608572  9.584773
## 2005.263       9.179756 8.860442  9.499071 8.691407  9.668105
## 2005.282       9.268608 8.949130  9.588085 8.780008  9.757207
## 2005.301       9.052568 8.732926  9.372211 8.563718  9.541419
## 2005.320       9.111219 8.791412  9.431027 8.622116  9.600323
## 2005.340       9.252941 8.932968  9.572915 8.763584  9.742299
## 2005.359       9.356572 9.036431  9.676713 8.866959  9.846185
## 2005.378       9.338549 9.018240  9.658859 8.848679  9.828420
## 2005.397       9.251005 8.930527  9.571483 8.760876  9.741134
## 2005.416       9.224809 8.904161  9.545458 8.734420  9.715199
## 2005.435       9.346050 9.025231  9.666869 8.855400  9.836701
## 2005.455       9.450684 9.129693  9.771675 8.959770  9.941597
## 2005.474       9.620217 9.299054  9.941381 9.129040 10.111395
## 2005.493       9.420976 9.099639  9.742314 8.929533  9.912420
## 2005.512       9.525579 9.204067  9.847092 9.033868 10.017290
## 2005.531       9.415839 9.094150  9.737527 8.923859  9.907818
## 2005.550       9.607080 9.285215  9.928945 9.114830 10.099330
## 2005.570       9.510631 9.188589  9.832674 9.018110 10.003153
## 2005.589       9.603701 9.281480  9.925922 9.110906 10.096495
## 2005.608       9.504625 9.182225  9.827026 9.011556  9.997694
## 2005.627       9.695340 9.372759 10.017921 9.201995 10.188685
## 2005.646       9.477930 9.155167  9.800692 8.984307  9.971552
## 2005.665       9.373123 9.050178  9.696068 8.879222  9.867025
## 2005.685       9.342204 9.019076  9.665332 8.848022  9.836386
## 2005.704       9.153041 8.829728  9.476354 8.658577  9.647505
## 2005.723       9.110422 8.786924  9.433920 8.615675  9.605170
## 2005.742       9.302377 8.978693  9.626061 8.807345  9.797409
## 2005.761       9.349835 9.025963  9.673706 8.854516  9.845153
## 2005.780       9.236613 8.912553  9.560672 8.741006  9.732219
## 2005.800       9.413325 9.089076  9.737574 8.917430  9.909221
## 2005.819       9.418511 9.094072  9.742950 8.922324  9.914697
## 2005.838       9.218243 8.893613  9.542873 8.721764  9.714722
## 2005.857       9.283620 8.958798  9.608442 8.786847  9.780393
## 2005.876       9.120497 8.795481  9.445512 8.623429  9.617565
## 2005.895       9.213636 8.888427  9.538846 8.716271  9.711001
## 2005.915       9.238208 8.912804  9.563613 8.740545  9.735871
## 2005.934       9.373749 9.048148  9.699349 8.875786  9.871712
## 2005.953       9.342711 9.016913  9.668508 8.844447  9.840975
## 2005.972       9.364560 9.038564  9.690555 8.865993  9.863127
## 2005.991       8.892771 8.566576  9.218965 8.393899  9.391642
## 2006.010       8.794675 8.468280  9.121069 8.295497  9.293852
## 2006.030       8.848817 8.522222  9.175413 8.349332  9.348302
## 2006.049       8.839557 8.512759  9.166354 8.339763  9.339351
## 2006.068       8.837916 8.510916  9.164917 8.337812  9.338020
## 2006.087       9.029948 8.702743  9.357152 8.529532  9.530363
## 2006.106       9.063326 8.735916  9.390735 8.562596  9.564055
## 2006.125       9.080621 8.753006  9.408236 8.579577  9.581665
## 2006.144       9.078981 8.751159  9.406803 8.577620  9.580342
## 2006.164       9.316086 8.988056  9.644116 8.814407  9.817765
## 2006.183       9.216940 8.888701  9.545179 8.714941  9.718938
## 2006.202       9.200598 8.872149  9.529047 8.698279  9.702918
## 2006.221       9.112913 8.784253  9.441573 8.610271  9.615555
## 2006.240       9.247260 8.918388  9.576131 8.744294  9.750226
## 2006.259       9.330343 9.001258  9.659428 8.827051  9.833635
## 2006.279       9.419194 9.089896  9.748493 8.915575  9.922813
## 2006.298       9.203155 8.873642  9.532669 8.699208  9.707103
## 2006.317       9.261806 8.932077  9.591536 8.757528  9.766084
## 2006.336       9.403528 9.073582  9.733475 8.898918  9.908138
## 2006.355       9.507159 9.176994  9.837323 9.002216 10.012102
## 2006.374       9.489136 9.158753  9.819520 8.983858  9.994414
## 2006.394       9.401592 9.070988  9.732195 8.895977  9.907206
## 2006.413       9.375396 9.044572  9.706221 8.869444  9.881349
## 2006.432       9.496637 9.165591  9.827684 8.990345 10.002929
## 2006.451       9.601270 9.270001  9.932540 9.094637 10.107903
## 2006.470       9.770804 9.439311 10.102298 9.263829 10.277780
## 2006.489       9.571563 9.239844  9.903282 9.064243 10.078883
## 2006.509       9.676166 9.344221 10.008111 9.168500 10.183832
## 2006.528       9.566425 9.234254  9.898597 9.058412 10.074438
## 2006.547       9.757667 9.425267 10.090067 9.249305 10.266029
## 2006.566       9.661218 9.328589  9.993847 9.152506 10.169930
## 2006.585       9.754287 9.421428 10.087147 9.245223 10.263352
## 2006.604       9.655212 9.322122  9.988303 9.145794 10.164630
## 2006.624       9.845927 9.512604 10.179250 9.336154 10.355700
## 2006.643       9.628517 9.294961  9.962072 9.118387 10.138646
## 2006.662       9.523710 9.189920  9.857500 9.013222 10.034198
## 2006.681       9.492791 9.158766  9.826816 8.981943 10.003638
## 2006.700       9.303628 8.969366  9.637889 8.792419  9.814837
## 2006.719       9.261009 8.926510  9.595508 8.749437  9.772581
## 2006.739       9.452964 9.118227  9.787701 8.941028  9.964900
## 2006.758       9.500421 9.165445  9.835398 8.988119 10.012724
## 2006.777       9.387200 9.051982  9.722417 8.874529  9.899870
## 2006.796       9.563912 9.228454  9.899371 9.050873 10.076952
## 2006.815       9.569097 9.233396  9.904798 9.055687 10.082508
## 2006.834       9.368830 9.032886  9.704775 8.855047  9.882613
## 2006.854       9.434207 9.098018  9.770396 8.920050  9.948363
## 2006.873       9.271083 8.934649  9.607518 8.756551  9.785616
## 2006.892       9.364223 9.027542  9.700904 8.849314  9.879132
## 2006.911       9.388795 9.051866  9.725724 8.873507  9.904083
## 2006.930       9.524336 9.187158  9.861513 9.008667 10.040004
## 2006.949       9.493298 9.155871  9.830725 8.977247 10.009348
## 2006.969       9.515146 9.177469  9.852824 8.998713 10.031580
## 2006.988       9.043357 8.705428  9.381287 8.526539  9.560176

#Question 6.E

autoplot(F.gasoline) + xlab("Year")

# 6E
# The forecast results make sense because not only does the range is small, but it also follows the trend of previous years.

#Question 7.A

Data set huron gives the water level of Lake Huron in feet from 1875 to 1972.

autoplot(huron, main="Mean July Average Water Surface Elevation", xlab="Year", ylab ="Feet" )

# 7.A
# he graph shows a somehow negative trend in water surface elevation.

# The data shows some signs of seasonality with the lowest elevation around 1962 and the highest elevation around 1876. Why is there such a trend thaught ? Maybe some years it rains more than others ?

#Question 7.B + 7.C

#linear
fit.lin<- tslm(huron~trend)
#piecewise
t <- time(huron)
t.break <- 1915
tb1 <- ts(pmax(0,t-t.break), start=1875)
fit.pw <- tslm(huron ~ t+tb1)

h<-1980-t[length(t)]
t.new<- t[length(t)]+seq(h)
tb1.new<- tb1[length(tb1)]+seq(h)
newdata <- cbind(t= t.new, tb1= tb1.new) %>%
  as.data.frame()

fc_linear <- forecast(fit.lin, h=h)
fc_piecewise <- forecast(fit.pw, newdata=newdata)

autoplot(huron)+
  autolayer(fc_linear, series='Linear', PI=FALSE)+
  autolayer(fc_piecewise, series='Piecewise',PI=FALSE)

# 7C
# Both forecasts seemto be inefficient in predicting the actual results. Piecewise, however, seems to be a little closer from the truth.

##Chapter 6 - Time Series Decompositon

#Question 1

#Show that a 3×5 MA is equivalent to a 7-term weighted moving average with weights of 0.067, 0.133, 0.200, 0.200, 0.200, 0.133, and 0.067.

# Centered moving averages can be smoothed by another moving average. This creates a double moving average. In the case of a 3x5 moving average, this signifies a 3 moving average of a 5 moving average. 

# Weights = c(0.067, 0.133, 0.200, 0.200, 0.200, 0.133, 0.067)

# 3x5 MA = [((Y1 + Y2 + Y3 + Y4 + Y5)/5) + ((Y2 + Y3 + Y4 + Y5 + Y6)/5) + ((Y3 + Y4 + Y5 + Y6 + Y7)/5)] / 3

#Plugging in these values proves that the 3x5 moving average is equal to a 7-term weighted moving average

#Question 2.A

The plastics data set consists of the monthly sales (in thousands) of product A for a plastics manufacturer for five years.

autoplot(plastics, main="Monthly Sales of Product A", xlab="Year", ylab="Sales (*1000)")

#2.A
# The plot shows seasonal fluctuations of sales at the beginning and at the end of each year. 

# Sales are at their lowest at the beginning of each year and the highest at the end of each year.

Question 2.B

plastics %>% decompose(type="multiplicative") %>%
  autoplot() + xlab("Year") +
  ggtitle("Monthly sales of Product A")

#Question 2.C

# Do the results support the graphical interpretation from part a?

# Yes they do support the graphical interpretation from part a because we can see seasonal trends that exist in the data. Plus, it shows there is an overall positive trend in sales.

#Question 2.D

autoplot(plastics, main="Monthly sales of Product A", ylab="Sales (Thousands)", xlab="Year") + autolayer(snaive(plastics, h=30), series="Seasonal Naïve", PI=FALSE) + autolayer(naive(plastics, h=30), series="Naïve", PI=FALSE) + autolayer(rwf(plastics, h=30), series="Drift", PI=FALSE)

#Question 2.E

plasticsAdd <- plastics
plasticsAdd[15] <- plasticsAdd[15] + 500

autoplot(plasticsAdd, main="Monthly sales of Product A (Modified Value 15)", ylab="Sales (Thousands)", xlab="Year") + autolayer(snaive(plasticsAdd, h=30), series="Seasonal Naïve", PI=FALSE) + autolayer(naive(plasticsAdd, h=30), series="Naïve", PI=FALSE) + autolayer(rwf(plasticsAdd, h=30), series="Drift", PI=FALSE)

#2E
# The forecast values stay relatively consistent with the addition of 500 to value 15. This shows the strength of the seasonally adjusted data prediction.

#Question 2.F

## Add 500 to end Value
plasticsAddEnd <- plastics
plasticsAddEnd[40] <- plasticsAddEnd[40] + 500

autoplot(plasticsAddEnd, main="Monthly sales of Product A (Modified Value 40)", ylab="Sales (Thousands)", xlab="Year") + autolayer(snaive(plasticsAddEnd, h=30), series="Seasonal Naïve", PI=FALSE) + autolayer(naive(plasticsAddEnd, h=30), series="Naïve", PI=FALSE) + autolayer(rwf(plasticsAddEnd, h=30), series="Drift", PI=FALSE)

# 2F
# Despite shifting the whole graph to the left, it does not appear to make a difference when we add an outlier to the end value.

#Question 3

library(seasonal)
## 
## Attaching package: 'seasonal'
## The following object is masked from 'package:psych':
## 
##     outlier
retail = read.csv("C:\\Users\\student\\Desktop\\Predictive Analytics - Forecasting\\HW 2 - Hyndman Ch 5 & 6\\tute1.csv")

#Creating a time series of the imported data
ts.retail  <- ts(retail[,"Sales"], frequency=12, start=c(1982,4))

#Decomposition of our time series
retailx11 <- seas(ts.retail, x11="")

autoplot(retailx11, main="X11 Decomposition on Monthly Retail Sales", xlab="Year")

# 3
#Outliers are revealed in the remainder plot. We can see some major outliers around 1987 and at the end of 1989.

#Question 4.A

#Write about 3–5 sentences describing the results of the decomposition. Pay particular attention to the scales of the graphs in making your interpretation.

# The decomposition results shows the number of persons in the civilian labor force in Australia each month from February 1978 to August 1995. Immediately, we see a strong positive trend of the number of workers during the time period. The Seasonality chart shows that the growth is cyclical and has strong seasonal trends. There are some major outliers that exist in the data around 1991. The second chart provides a very interesting look ingot the seasonal components of each month. We see a large increase in July and a large decrease in March.

#Question 4.B

# Is the recession of 1991/1992 visible in the estimated components?

# We can see the presence of outliers during that time period on the graph. A recession make absolute sense in this sudden change of values and therefore is visible on the graphs.

#Question 5.A

This exercise uses the cangas data (monthly Canadian gas production in billions of cubic metres, January 1960 – February 2005).

#5A
autoplot(cangas, main="Monthly Canadian Gas Production", ylab="Gas Production (Billions)", xlab="Year")

ggsubseriesplot(cangas, main="Monthly Canadian Gas Production", ylab="Gas Production (Billions)")

ggseasonplot(cangas, main="Seasonal Plot: Monthly Canadian Gas Production", ylab="Gas Production (Billions)")

# 5A

# In talking about oil production, the seasonality may be due to fuel prices and demand. For instance, production may need to increase in months leading up to the holiday season.The first graph almost shows an increasing trend in the seasonality. This could be the result of innovations and better technology in producing gas. 

#Question 5.B

cangas %>% stl(t.window=13, s.window="periodic", robust=TRUE) %>% autoplot() +xlab("Year")

#Question 5.C

# c. Compare the results with those obtained using SEATS and X11. How are they different?

# The results appear fairly similar but some differences can be seen. 
# The seasonality graph shows a consistent seasonality throughout the time series. The remainder graph shows some outliers that can be seen between 1980 and 1990 and between 1995 to 2000.

#Question 6.A We will use the bricksq data (Australian quarterly clay brick production. 1956–1994) for this exercise.

bricksq %>% stl(t.window=26, s.window="periodic", robust=TRUE) %>% autoplot()

#Queston 6.B

autoplot(bricksq, main="Monthly sales of Product A", ylab="Sales (Thousands)", xlab="Year") + autolayer(snaive(bricksq, h=30), series="Seasonal Naïve", PI=FALSE)

#Question 6.C

autoplot(bricksq, main="Australian quarterly clay brick production. 1956–1994", ylab="Brick Production", xlab="Year")+ autolayer(naive(bricksq, h=30), series="Naïve", PI=FALSE)

#Question 6.D

brick.T <- stlf(bricksq)
brick.T
##         Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## 1994 Q4       465.7326 437.4903 493.9748 422.5398 508.9254
## 1995 Q1       424.6727 384.7119 464.6335 363.5580 485.7874
## 1995 Q2       483.9914 435.0232 532.9595 409.1010 558.8817
## 1995 Q3       493.9988 437.4242 550.5734 407.4754 580.5222
## 1995 Q4       465.7326 402.4454 529.0198 368.9431 562.5220
## 1996 Q1       424.6727 355.3067 494.0387 318.5865 530.7589
## 1996 Q2       483.9914 409.0259 558.9568 369.3416 598.6411
## 1996 Q3       493.9988 413.8128 574.1848 371.3650 616.6326
autoplot(brick.T)

#Question 6.E

checkresiduals((brick.T))
## Warning in checkresiduals((brick.T)): The fitted degrees of freedom is based on
## the model used for the seasonally adjusted data.

## 
##  Ljung-Box test
## 
## data:  Residuals from STL +  ETS(M,N,N)
## Q* = 41.128, df = 6, p-value = 2.733e-07
## 
## Model df: 2.   Total lags used: 8
# 6E
# the residuals appear to be corroleted.

#Question 6.F

brick.T2 <- stlf(bricksq, robust=TRUE)
brick.T2
##         Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## 1994 Q4       462.2355 432.9988 491.4723 417.5217 506.9493
## 1995 Q1       423.5662 382.1965 464.9358 360.2967 486.8356
## 1995 Q2       486.4525 435.7558 537.1492 408.9186 563.9864
## 1995 Q3       493.9985 435.4245 552.5726 404.4173 583.5798
## 1995 Q4       462.2355 396.7089 527.7621 362.0212 562.4498
## 1996 Q1       423.5662 351.7427 495.3896 313.7216 533.4107
## 1996 Q2       486.4525 408.8280 564.0770 367.7360 605.1689
## 1996 Q3       493.9985 410.9649 577.0322 367.0096 620.9875
autoplot(brick.T2)

# 6F
checkresiduals(brick.T2)
## Warning in checkresiduals(brick.T2): The fitted degrees of freedom is based on
## the model used for the seasonally adjusted data.

## 
##  Ljung-Box test
## 
## data:  Residuals from STL +  ETS(M,N,N)
## Q* = 28.163, df = 6, p-value = 8.755e-05
## 
## Model df: 2.   Total lags used: 8
# 6F
# Everything appears to be the same, which means that there must be some autocorrelation issues due to the similarities between our time series and a its lagged version over the intervals. 

#Question 6.G

# Compare forecasts from stlf() with those from snaive(), using a test set comprising the last 2 years of data. Which is better?

# Splitting bricksq into training and test sets
trainBrick <- subset(bricksq, end=length(bricksq) - 9)
testBrick <- subset(bricksq, start=length(bricksq) - 8)

sBrick <- snaive(trainBrick)
stlfBrick <- stlf(trainBrick, robust=TRUE)

# Plotting the results
autoplot(sBrick)

autoplot (stlfBrick)

# Based on the graphs, it looks like the stlf() function creates less variable forecast and thus, serves as a better predictor of brick production.

#Question 7

Use stlf() to produce forecasts of the writing series with either method=“naive” or method=“rwdrift”, whichever is most appropriate. Use the lambda argument if you think a Box-Cox transformation is required.

# Plotting the dataset to determine which method to use
autoplot(writing)

# I chose the Drift method based of of the seasonally adjusted nature of the data. A trend appears in the data.
stlf(writing, method='rwdrift')
##          Point Forecast    Lo 80     Hi 80     Lo 95     Hi 95
## Jan 1978       994.1148 934.1959 1054.0337 902.47669 1085.7529
## Feb 1978      1028.3524 943.2589 1113.4459 898.21318 1158.4917
## Mar 1978      1085.7244 981.0733 1190.3756 925.67431 1245.7745
## Apr 1978      1016.3740 895.0350 1137.7130 830.80195 1201.9460
## May 1978       985.4141 849.1980 1121.6301 777.08964 1193.7385
## Jun 1978      1053.7226 903.9001 1203.5452 824.58886 1282.8564
## Jul 1978       906.0076 743.5296 1068.4857 657.51890 1154.4963
## Aug 1978       498.0555 323.6657  672.4452 231.34939  764.7615
## Sep 1978       941.7754 756.0746 1127.4762 657.77056 1225.7803
## Oct 1978      1030.8382 834.3232 1227.3531 730.29446 1331.3819
## Nov 1978       968.4962 761.5860 1175.4064 652.05442 1284.9380
## Dec 1978      1036.1790 819.2324 1253.1257 704.38778 1367.9703
## Jan 1979      1036.5608 809.8888 1263.2329 689.89586 1383.2258
## Feb 1979      1070.7985 834.6736 1306.9233 709.67672 1431.9202
## Mar 1979      1128.1704 882.8340 1373.5069 752.96072 1503.3802
## Apr 1979      1058.8200 804.4868 1313.1532 669.85099 1447.7891
## May 1979      1027.8601 764.7231 1290.9971 625.42673 1430.2935
## Jun 1979      1096.1687 824.4019 1367.9354 680.53727 1511.8001
## Jul 1979       948.4537 668.2152 1228.6921 519.86593 1377.0414
## Aug 1979       540.5015 251.9355  829.0675  99.17787  981.8251
## Sep 1979       984.2214 687.4599 1280.9830 530.36377 1438.0791
## Oct 1979      1073.2842 768.4484 1378.1200 607.07804 1539.4903
## Nov 1979      1010.9422 698.1441 1323.7404 532.55884 1489.3257
## Dec 1979      1078.6251 757.9683 1399.2818 588.22284 1569.0273
# Box Cox lambda
writingBC <- stlf(writing, method='rwdrift', robust=TRUE, lambda = BoxCox.lambda(writing))
autoplot(writingBC)

#Question 8

Use stlf() to produce forecasts of the fancy series with either method=“naive” or method=“rwdrift”, whichever is most appropriate. Use the lambda argument if you think a Box-Cox transformation is required.

# Plotting the dataset to evaluate which method to choose
autoplot(fancy)

# The fancy dataset shows a similar trend and seasonally adjusted data as writing, but in a positive trend.
# Thus, I also choose to use the drift method for this dataset
stlf(fancy, method='rwdrift')
##          Point Forecast    Lo 80     Hi 80    Lo 95     Hi 95
## Jan 1994       62016.88 54142.09  69891.67 49973.43  74060.33
## Feb 1994       63653.72 52450.20  74857.24 46519.41  80788.02
## Mar 1994       69171.00 55368.11  82973.89 48061.31  90280.70
## Apr 1994       66186.51 50154.83  82218.19 41668.17  90704.85
## May 1994       66161.22 48133.34  84189.09 38589.96  93732.47
## Jun 1994       67578.60 47716.88  87440.32 37202.72  97954.48
## Jul 1994       70352.01 48777.36  91926.67 37356.42 103347.60
## Aug 1994       71702.28 48508.79  94895.77 36230.89 107173.66
## Sep 1994       73083.34 48346.63  97820.04 35251.81 110914.86
## Oct 1994       74310.72 48093.07 100528.37 34214.28 114407.16
## Nov 1994       83745.93 56099.59 111392.27 41464.50 126027.36
## Dec 1994      113495.30 84464.81 142525.80 69096.99 157893.61
## Jan 1995       70851.51 40475.31 101227.70 24395.13 117307.89
## Feb 1995       72488.35 40800.01 104176.69 24025.21 120951.49
## Mar 1995       78005.63 45034.69 110976.58 27580.93 128430.34
## Apr 1995       75021.14 40793.82 109248.46 22674.97 127367.31
## May 1995       74995.85 39535.58 110456.11 20764.06 129227.64
## Jun 1995       76413.23 39741.10 113085.36 20328.05 132498.42
## Jul 1995       79186.64 41321.70 117051.59 21277.20 137096.09
## Aug 1995       80536.91 41496.45 119577.37 20829.67 140244.15
## Sep 1995       81917.97 41717.77 122118.16 20437.08 143398.86
## Oct 1995       83145.35 41799.89 124490.82 19912.92 146377.79
## Nov 1995       92580.56 50103.11 135058.01 27616.91 157544.22
## Dec 1995      122329.93 78732.75 165927.12 55653.79 189006.07
fancyBC <- stlf(fancy, method='rwdrift', robust=TRUE, lambda = BoxCox.lambda(fancy))
autoplot(fancyBC)