CH5-6 exercise

library(fpp2)

## Loading required package: ggplot2

## Registered S3 methods overwritten by 'ggplot2':
##   method         from 
##   [.quosures     rlang
##   c.quosures     rlang
##   print.quosures rlang

## Loading required package: forecast

## Registered S3 method overwritten by 'xts':
##   method     from
##   as.zoo.xts zoo

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

## Registered S3 methods overwritten by 'forecast':
##   method             from    
##   fitted.fracdiff    fracdiff
##   residuals.fracdiff fracdiff

## Loading required package: fma

## Loading required package: expsmooth

library(forecast)
daily20 <- head(elecdaily,20)

#1)

# a) Plot the data and find the regression model for Demand with temperature as an explanatory variable. Why is there a positive relationship?
autoplot(daily20, facets= T)

fit1 <- tslm(Demand~Temperature, data = daily20)
fit1

## 
## Call:
## tslm(formula = Demand ~ Temperature, data = daily20)
## 
## Coefficients:
## (Intercept)  Temperature  
##      39.212        6.757

summary(fit1)

## 
## Call:
## tslm(formula = Demand ~ Temperature, data = daily20)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -46.060  -7.117  -1.437  17.484  27.102 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  39.2117    17.9915   2.179   0.0428 *  
## Temperature   6.7572     0.6114  11.052 1.88e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22 on 18 degrees of freedom
## Multiple R-squared:  0.8716, Adjusted R-squared:  0.8644 
## F-statistic: 122.1 on 1 and 18 DF,  p-value: 1.876e-09

# There is a positive relationship due to the increase in heat increasing the need for airconditioning which will raise electricity costs.

# b) Produce a residual plot. Is the model adequate? Are there any outliers or influential observations?
checkresiduals(fit1$residuals)

## Warning in modeldf.default(object): Could not find appropriate degrees of
## freedom for this model.

# it seems we have one or two outliers that may be influencing our predictions. Overall the model seems adequate because there is no autocorrelation between the variables.

# c) Use the model to forecast the electricity demand that you would expect for the next day if the maximum temperature was 15??? and compare it with the forecast if the with maximum temperature was  
# 35???. Do you believe these forecasts?
fc_Dem_Temp <- forecast(fit1, 
  newdata=data.frame(Temperature=c(15,35)))
fc_Dem_Temp

##          Point Forecast    Lo 80    Hi 80     Lo 95    Hi 95
## 3.857143       140.5701 108.6810 172.4591  90.21166 190.9285
## 4.000000       275.7146 245.2278 306.2014 227.57056 323.8586

#The forecast appears to be accurate as it mimics similar data points that are in our original dataset.

# d) Give prediction intervals for your forecasts. 
#85% intervals for 15, 35 degrees
fc_Dem_Temp$upper[,1]

## Time Series:
## Start = c(3, 7) 
## End = c(4, 1) 
## Frequency = 7 
## [1] 172.4591 306.2014

fc_Dem_Temp$lower[,1]

## Time Series:
## Start = c(3, 7) 
## End = c(4, 1) 
## Frequency = 7 
## [1] 108.6810 245.2278

# 95% intervals for 15, 35 degrees
fc_Dem_Temp$upper[, 2]

## Time Series:
## Start = c(3, 7) 
## End = c(4, 1) 
## Frequency = 7 
## [1] 190.9285 323.8586

fc_Dem_Temp$lower[, 2]

## Time Series:
## Start = c(3, 7) 
## End = c(4, 1) 
## Frequency = 7 
## [1]  90.21166 227.57056

#e) Plot Demand vs Temperature for all of the available data in elecdaily. What does this say about your model?
elecdaily %>%
  as.data.frame() %>%
  ggplot(aes(x=Temperature, y=Demand)) +
  geom_point() +
  geom_smooth(method="lm", se=FALSE)

# The resulting plot shows that there is a significant increase in demand once the temperatuer reaches approximately 35 degrees. This means that our model would not be very accurate for as it would not capture the true non-linear relationship of the data.

#2) Data set mens400 contains the winning times (in seconds) for the men's 400 meters final in each Olympic Games from 1896 to 2016.

# a. Plot the winning time against the year. Describe the main features of the plot.
autoplot(mens400)

# Most notable the winning times in Olympic men's 400m track final had the trend of decreasing with time.
# Further, there are some clearly missing values.

# b. Fit a regression line to the data. Obviously the winning times have been decreasing, but at what average rate per year?

time_mens400 <- time(mens400)
tslm_mens400 <- tslm(mens400 ~ time_mens400, 
                     data = mens400)
# Show data with regression line
autoplot(mens400) +
  geom_abline(slope = tslm_mens400$coefficients[2],
              intercept = tslm_mens400$coefficients[1],
              colour = "red")

# Winning time rate:
tslm_mens400$coefficients[2]

## time_mens400 
##  -0.06457385

# The winning times have been decreasing at average rate of 0.06457 second per year.
# c. Plot the residuals against the year. What does this indicate about the suitability of the fitted line?
cbind(Time = time_mens400, 
      Residuals = tslm_mens400$residuals) %>%
  as.data.frame() %>%
  ggplot(aes(x = Time, y = Residuals)) +
    geom_point() +
    ylab("Residuals of Regression Line (Seconds)")

## Warning: Removed 3 rows containing missing values (geom_point).

# The residual plot shows that the regression model generally fitted the data well. Checking with the checkresiduals function:
checkresiduals(tslm_mens400)

## 
##  Breusch-Godfrey test for serial correlation of order up to 6
## 
## data:  Residuals from Linear regression model
## LM test = 3.6082, df = 6, p-value = 0.7295

#This shows there is some level of autocorrelation and that perhaps we can get a better fit.
# d. Predict the winning time for the men's 400 meters final in the 2020 Olympics. Give a prediction interval for your forecasts. What assumptions have you made in these calculations?

#In the linear model I have removed the missing data values so that we can have a more accurate prediction.

lm_mens400 <- lm(
  mens400 ~ time_mens400, 
  data = mens400,
  na.action = na.exclude
  )
fc_mens400 <- forecast(
  lm_mens400, 
  newdata = data.frame(time_mens400 = 2020)
  )
autoplot(mens400) +
  autolayer(fc_mens400, PI = TRUE)

# Get 80% and 95% prediction intervals
fc_mens400$upper

##          [,1]     [,2]
## [1,] 43.63487 44.53176

fc_mens400$lower

##          [,1]     [,2]
## [1,] 40.44975 39.55286

# 80% interval is from 40.45 to 43.63
# 95% interval is from 39.55 to 44.53
# We need to consider that they were calculated from the assumption that the model's residuals were normally distributed. But as we saw from the result of checkresiduals function that it isn't true. Further the prediction point is a bit extreme and I would be willing to bet we have a non-linear relationship and the winning times will begin to slow down the rate at which they decrease. An assumption of a linear function here will likely lead to inaccurate predictions.

#3) Type easter(ausbeer) and interpret what you see.
easter(ausbeer)

##      Qtr1 Qtr2 Qtr3 Qtr4
## 1956 0.67 0.33 0.00 0.00
## 1957 0.00 1.00 0.00 0.00
## 1958 0.00 1.00 0.00 0.00
## 1959 1.00 0.00 0.00 0.00
## 1960 0.00 1.00 0.00 0.00
## 1961 0.33 0.67 0.00 0.00
## 1962 0.00 1.00 0.00 0.00
## 1963 0.00 1.00 0.00 0.00
## 1964 1.00 0.00 0.00 0.00
## 1965 0.00 1.00 0.00 0.00
## 1966 0.00 1.00 0.00 0.00
## 1967 1.00 0.00 0.00 0.00
## 1968 0.00 1.00 0.00 0.00
## 1969 0.00 1.00 0.00 0.00
## 1970 1.00 0.00 0.00 0.00
## 1971 0.00 1.00 0.00 0.00
## 1972 0.33 0.67 0.00 0.00
## 1973 0.00 1.00 0.00 0.00
## 1974 0.00 1.00 0.00 0.00
## 1975 1.00 0.00 0.00 0.00
## 1976 0.00 1.00 0.00 0.00
## 1977 0.00 1.00 0.00 0.00
## 1978 1.00 0.00 0.00 0.00
## 1979 0.00 1.00 0.00 0.00
## 1980 0.00 1.00 0.00 0.00
## 1981 0.00 1.00 0.00 0.00
## 1982 0.00 1.00 0.00 0.00
## 1983 0.00 1.00 0.00 0.00
## 1984 0.00 1.00 0.00 0.00
## 1985 0.00 1.00 0.00 0.00
## 1986 1.00 0.00 0.00 0.00
## 1987 0.00 1.00 0.00 0.00
## 1988 0.00 1.00 0.00 0.00
## 1989 1.00 0.00 0.00 0.00
## 1990 0.00 1.00 0.00 0.00
## 1991 1.00 0.00 0.00 0.00
## 1992 0.00 1.00 0.00 0.00
## 1993 0.00 1.00 0.00 0.00
## 1994 0.00 1.00 0.00 0.00
## 1995 0.00 1.00 0.00 0.00
## 1996 0.00 1.00 0.00 0.00
## 1997 1.00 0.00 0.00 0.00
## 1998 0.00 1.00 0.00 0.00
## 1999 0.00 1.00 0.00 0.00
## 2000 0.00 1.00 0.00 0.00
## 2001 0.00 1.00 0.00 0.00
## 2002 1.00 0.00 0.00 0.00
## 2003 0.00 1.00 0.00 0.00
## 2004 0.00 1.00 0.00 0.00
## 2005 1.00 0.00 0.00 0.00
## 2006 0.00 1.00 0.00 0.00
## 2007 0.00 1.00 0.00 0.00
## 2008 1.00 0.00 0.00 0.00
## 2009 0.00 1.00 0.00 0.00
## 2010 0.00 1.00

#ausbeer is the quarterly Australian Beer production
#the easter function returns a vector of 0's and 1's if easter is in that partuclar quarter. It will return a fractional component if it spans the two time periods provided. It is clear that easter will exist either entirely in Qtr1 or Qtr2 depending on the year or have 2/3 holiday days in one qtr or the other. 


#4) An elasticity coefficient is the ratio of the percentage change in the forecast variable (y) to the percentage change in the predictor variable (x). Mathematically, the elasticity is defined as (dy/dx)×(x/y). 
# Consider the log-log model,
# lfunction a -> log(y)=??0+??1log(x)+??. Express y as a function of x and show that the coefficient ??1 is the elasticity coefficient.
#Based on the text and the taking the derivation of the equation with respect to y it is clear that the coefficient is the elasticity. B1 in a log-log function can be interpreted as follows:
#For every 1% change in x there is B1% change in  y.


#5) The data set fancy concerns the monthly sales figures of a shop which opened in January 1987 and sells gifts, souvenirs, and novelties. The shop is situated on the wharf at a beach resort town in Queensland, Australia. The sales volume varies with the seasonal population of tourists. There is a large influx of visitors to the town at Christmas and for the local surfing festival, held every March since 1988. Over time, the shop has expanded its premises, range of products, and staff.

# a. Produce a time plot of the data and describe the patterns in the graph. Identify any unusual or unexpected fluctuations in the time series.
autoplot(fancy)

head(fancy, 50)

##           Jan      Feb      Mar      Apr      May      Jun      Jul
## 1987  1664.81  2397.53  2840.71  3547.29  3752.96  3714.74  4349.61
## 1988  2499.81  5198.24  7225.14  4806.03  5900.88  4951.34  6179.12
## 1989  4717.02  5702.63  9957.58  5304.78  6492.43  6630.80  7349.62
## 1990  5921.10  5814.58 12421.25  6369.77  7609.12  7224.75  8121.22
## 1991  4826.64  6470.23                                             
##           Aug      Sep      Oct      Nov      Dec
## 1987  3566.34  5021.82  6423.48  7600.60 19756.21
## 1988  4752.15  5496.43  5835.10 12600.08 28541.72
## 1989  8176.62  8573.17  9690.50 15151.84 34061.01
## 1990  7979.25  8093.06  8476.70 17914.66 30114.41
## 1991

# Sales generally increase from January to December. Sales have a strong increase in Decembers. Further, Sales in December increase as time went on. In 1991, sales decreased, indicating a potential outlier due to an unforseen event. In most years, there was also an increase every March, but not as dramatic as the december bump.

# b. Explain why it is necessary to take logarithms of these data before fitting a model.

# The size of the seasonal variations should be similiar across the series to be fitted well. The Fancy data shows that seasonal variations increased exponentially. Therefore it is necessary to take a log transformation of the data to capture the non-linear relationship.

# c. Use R to fit a regression model to the logarithms of these sales data with a linear trend, seasonal dummies and a "surfing festival" dummy variable.
# make "surfing_festival" dummy variable using time index of fancy. The value is 1 if the year is equal to or above 1988 and the month is March.
Time <- time(fancy)
surfing_festival <- c()
for(i in 1:length(Time)){
  month <- round(12*(Time[i] - floor(Time[i]))) + 1
  year <- floor(Time[i])
  if(year >= 1988 & month == 3){
    surfing_festival[i] <- 1
  } else {
    surfing_festival[i] <- 0
  }
}

fancy.tslm_log <- tslm(
  BoxCox(fancy, 0) ~ trend + season + surfing_festival
  )
# d. Plot the residuals against time and against the fitted values. Do these plots reveal any problems with the model?
autoplot(fancy.tslm_log$residuals)

# The residuals have pattern against time. It means that there is correlation between residuals and time.
cbind(Residuals = fancy.tslm_log$residuals,
      Fitted_values = fancy.tslm_log$fitted.values) %>%
  as.data.frame() %>%
  ggplot(aes(x = Fitted_values,
             y = Residuals)) +
    geom_point()

# The size of the residuals change as we move along the x-axis(fitted values). This means that despite the log transformation, there is still heteroscedacity in the errors or that the variance of the residuals is not constant. 
# e. Do boxplots of the residuals for each month. Does this reveal any problems with the model?
cbind.data.frame(
    Month = factor(
      month.abb[round(12*(Time - floor(Time)) + 1)],
      labels = month.abb,
      ordered = TRUE
    ),
    Residuals = fancy.tslm_log$residuals
    ) %>%
  ggplot(aes(x = Month,
             y = Residuals)) +
    geom_boxplot()

## Don't know how to automatically pick scale for object of type ts. Defaulting to continuous.

ggsubseriesplot(fancy.tslm_log$residuals)

# The distribution of the residuals was unsymetrical for some months. And for some months, the median of the residuals wasn't 0(residuals' mean should be 0 for all months to get the minimum SSE). Residuals with such properties can't have a normal distribution, which will make it difficult to get a consistent/reasonable prediction interval.
# f. What do the values of the coefficients tell you about each variable?
fancy.tslm_log$coefficients

##      (Intercept)            trend          season2          season3 
##       7.61966701       0.02201983       0.25141682       0.26608280 
##          season4          season5          season6          season7 
##       0.38405351       0.40948697       0.44882828       0.61045453 
##          season8          season9         season10         season11 
##       0.58796443       0.66932985       0.74739195       1.20674790 
##         season12 surfing_festival 
##       1.96224123       0.50151509

# The model has positive trend. It means that as time goes on, the sales amount generally increases. 
# And all seasonal variables are positive. It means that the sales amount is at a minimum in January and the sales of the other months were greater than January for most years. 
# Finally, surfing_festival variable's coefficient is 0.501 and it isn't small compared to the others. It means that there were increased sales in Marchs when surfing festival happened.

# g. What does the Breusch-Godfrey test tell you about your model?
checkresiduals(fancy.tslm_log)

## 
##  Breusch-Godfrey test for serial correlation of order up to 17
## 
## data:  Residuals from Linear regression model
## LM test = 37.954, df = 17, p-value = 0.002494

# The p value of the test is less than 0.05. It means that the residuals can be distinguished from white noise. The residuals are autocorrelated with each other with signifcant lag spikes in the first 3 lags and the 24+ lags.

# h. Regardless of your answers to the above questions, use your regression model to predict the monthly sales for 1994, 1995, and 1996. Produce prediction intervals for each of your forecasts.
# make surfing festival data for the months of 1994 through 1996.
future_fancy <- rep(0, 36)
for(i in 1:36){
  if(i %% 12 == 3){
    future_fancy[i] <- 1
  }
}
# make future data as time series.
future_fancy <- ts(data = future_fancy,
                   start = 1994,
                   end = c(1996, 12),
                   frequency = 12)
# forecast
fc_tslm_log_fancy <- forecast(
  fancy.tslm_log,
  newdata = data.frame(Time = time(future_fancy),
                       surfing_festival = future_fancy)
)
# plot the forecast
autoplot(fc_tslm_log_fancy)

# show prediction interval
fc_tslm_log_fancy$upper

##               [,1]     [,2]
## Jan 1994  9.744183  9.88111
## Feb 1994 10.017620 10.15455
## Mar 1994 10.557120 10.69475
## Apr 1994 10.194296 10.33122
## May 1994 10.241749 10.37868
## Jun 1994 10.303110 10.44004
## Jul 1994 10.486756 10.62368
## Aug 1994 10.486286 10.62321
## Sep 1994 10.589671 10.72660
## Oct 1994 10.689753 10.82668
## Nov 1994 11.171129 11.30806
## Dec 1994 11.948642 12.08557
## Jan 1995 10.011336 10.14984
## Feb 1995 10.284773 10.42328
## Mar 1995 10.823938 10.96297
## Apr 1995 10.461449 10.59996
## May 1995 10.508903 10.64741
## Jun 1995 10.570264 10.70877
## Jul 1995 10.753910 10.89242
## Aug 1995 10.753440 10.89195
## Sep 1995 10.856825 10.99533
## Oct 1995 10.956907 11.09541
## Nov 1995 11.438282 11.57679
## Dec 1995 12.215796 12.35430
## Jan 1996 10.279093 10.41951
## Feb 1996 10.552530 10.69294
## Mar 1996 11.091365 11.23212
## Apr 1996 10.729206 10.86962
## May 1996 10.776659 10.91707
## Jun 1996 10.838021 10.97843
## Jul 1996 11.021667 11.16208
## Aug 1996 11.021196 11.16161
## Sep 1996 11.124582 11.26499
## Oct 1996 11.224664 11.36508
## Nov 1996 11.706039 11.84645
## Dec 1996 12.483552 12.62396

fc_tslm_log_fancy$lower

##               [,1]      [,2]
## Jan 1994  9.238522  9.101594
## Feb 1994  9.511959  9.375031
## Mar 1994 10.048860  9.911228
## Apr 1994  9.688635  9.551707
## May 1994  9.736088  9.599161
## Jun 1994  9.797449  9.660522
## Jul 1994  9.981095  9.844168
## Aug 1994  9.980625  9.843698
## Sep 1994 10.084010  9.947083
## Oct 1994 10.184092 10.047165
## Nov 1994 10.665468 10.528541
## Dec 1994 11.442981 11.306054
## Jan 1995  9.499844  9.361338
## Feb 1995  9.773281  9.634775
## Mar 1995 10.310518 10.171489
## Apr 1995  9.949957  9.811451
## May 1995  9.997411  9.858904
## Jun 1995 10.058772  9.920265
## Jul 1995 10.242418 10.103911
## Aug 1995 10.241948 10.103441
## Sep 1995 10.345333 10.206826
## Oct 1995 10.445415 10.306908
## Nov 1995 10.926791 10.788284
## Dec 1995 11.704304 11.565797
## Jan 1996  9.760564  9.620151
## Feb 1996 10.034000  9.893588
## Mar 1996 10.571566 10.430810
## Apr 1996 10.210677 10.070264
## May 1996 10.258130 10.117718
## Jun 1996 10.319491 10.179079
## Jul 1996 10.503137 10.362725
## Aug 1996 10.502667 10.362254
## Sep 1996 10.606052 10.465640
## Oct 1996 10.706134 10.565722
## Nov 1996 11.187510 11.047097
## Dec 1996 11.965023 11.824611

# The intervals on Decembers were especially large.
# i. Transform your predictions and intervals to obtain predictions and intervals for the raw data.
# make fc_tslm_fancy object, which are inverse log transformed version of fc_tslm_log_fancy.
fc_tslm_fancy <- fc_tslm_log_fancy
# members which should be inverse log transformed.
members_inv.log <- c('x', 'mean', 'lower', 'upper', 'residuals', 'fitted')
# apply inverse log transform to the members.
fc_tslm_fancy[members_inv.log] <- lapply(
  fc_tslm_log_fancy[members_inv.log],
  InvBoxCox,
  lambda = 0
)
# apply inverse log transform to 'BoxCox(fancy, 0)' member in model's model.
fc_tslm_fancy[['model']][['model']][1] <- lapply(
  fc_tslm_log_fancy[['model']][['model']][1],
  InvBoxCox,
  lambda = 0
)
autoplot(fc_tslm_fancy)

# Even if I transformed the data, fitted values and forecasts, the name of predicted values is still 'BoxCox(fancy, 0)'. 
# I can't change it because it came from the formula in tslm function. Changing it means making new model, not just changing the variable's name.
# I think that it is better to set lambda = 0 in forecast function from the very first to forecast using log transformation.
fc_tslm_fancy$upper

##               [,1]      [,2]
## Jan 1994  17054.73  19557.43
## Feb 1994  22418.00  25707.73
## Mar 1994  38450.24  44123.68
## Apr 1994  26750.16  30675.62
## May 1994  28050.15  32166.37
## Jun 1994  29825.24  34201.95
## Jul 1994  35837.72  41096.73
## Aug 1994  35820.87  41077.41
## Sep 1994  39722.43  45551.50
## Oct 1994  43903.67  50346.32
## Nov 1994  71049.28  81475.41
## Dec 1994 154607.08 177294.90
## Jan 1995  22277.59  25587.08
## Feb 1995  29283.31  33633.55
## Mar 1995  50208.44  57697.39
## Apr 1995  34942.16  40133.06
## May 1995  36640.25  42083.42
## Jun 1995  38958.95  44746.58
## Jul 1995  46812.70  53767.06
## Aug 1995  46790.69  53741.78
## Sep 1995  51887.06  59595.26
## Oct 1995  57348.77  65868.34
## Nov 1995  92807.48 106594.69
## Dec 1995 201954.08 231955.81
## Jan 1996  29117.46  33506.86
## Feb 1996  38274.14  44043.89
## Mar 1996  65602.25  75517.62
## Apr 1996  45670.42  52555.15
## May 1996  47889.88  55109.18
## Jun 1996  50920.48  58596.65
## Jul 1996  61185.57  70409.18
## Aug 1996  61156.80  70376.07
## Sep 1996  67817.91  78041.33
## Oct 1996  74956.52  86256.08
## Nov 1996 121302.09 139588.15
## Dec 1996 263959.89 303751.35

fc_tslm_fancy$lower

##               [,1]       [,2]
## Jan 1994  10285.82   8969.583
## Feb 1994  13520.45  11790.284
## Mar 1994  23129.40  20155.412
## Apr 1994  16133.21  14068.696
## May 1994  16917.24  14752.395
## Jun 1994  17987.81  15685.969
## Jul 1994  21613.98  18848.111
## Aug 1994  21603.82  18839.249
## Sep 1994  23956.87  20891.193
## Oct 1994  26478.61  23090.230
## Nov 1994  42850.31  37366.903
## Dec 1994  93244.59  81312.400
## Jan 1995  13357.65  11629.938
## Feb 1995  17558.28  15287.252
## Mar 1995  30046.98  26146.972
## Apr 1995  20951.33  18241.435
## May 1995  21969.51  19127.918
## Jun 1995  23359.80  20338.387
## Jul 1995  28068.91  24438.412
## Aug 1995  28055.72  24426.922
## Sep 1995  31111.50  27087.467
## Oct 1995  34386.35  29938.733
## Nov 1995  55647.40  48449.831
## Dec 1995 121091.75 105429.448
## Jan 1996  17336.40  15065.329
## Feb 1996  22788.25  19802.984
## Mar 1996  39009.73  33887.802
## Apr 1996  27191.96  23629.808
## May 1996  28513.41  24778.151
## Jun 1996  30317.82  26346.183
## Jul 1996  36429.60  31657.322
## Aug 1996  36412.48  31642.439
## Sep 1996  40378.47  35088.887
## Oct 1996  44628.77  38782.394
## Nov 1996  72222.70  62761.521
## Dec 1996 157160.50 136572.460

# The range of prediction intervals became a lot bigger after inverse log transformation.
# j. How could you improve these predictions by modifying the model?
# The predictions when I don't use log transformation.
tslm_fancy <- tslm(
  fancy ~ trend + season + surfing_festival
  )
fc_tslm_fancy2 <- forecast(
  tslm_fancy, 
  newdata = data.frame(Time = time(future_fancy),
                       surfing_festival = future_fancy)
)
autoplot(fc_tslm_fancy2)

# The result shows that the forecasts couldn't reflect the exponential growth trend.
# The model could have improved the predictions by using log transformation. By using the transformation, the predictions could reflect the sales' non-linear growth trend.

library(fpp2)
library(forecast)
library(fma)

#1) Show that a 3x5 MA is equivalent to a 7-terim weighted moving average with weights of .067,.133,.200,.200,.133,.067

#A 3X5 MA is when a 3-MA follows a moving of a 5-MA series. It is also called a "centered moving average of order 5". This is because the results are now symmetric.

# In detailed steps, 1. If we have 7 terms in time series. 2. 5-MA : _ , _ , (Y1+Y2+Y3+Y4+Y5)/5, (Y2+Y3+Y4+Y5+Y6)/5, (Y3+Y4+Y5+Y6+Y7)/5, _ , _ 3. 3-MA : _ , _ , _ ,((Y1+Y2+Y3+Y4+Y5)/5 + (Y2+Y3+Y4+Y5+Y6)/5 + (Y3+Y4+Y5+Y6+Y7)/5)/3 , _ , _ , _
# 
# = 1/15(Y1) 2/15(Y2) 3/15(Y3) 3/15(Y4) 3/15(Y5) 2/15(Y6) 1/15(Y7)
# 
# = 0.067 0.133 0.2 0.2 0.2 0.133 0.067
# 

#2) The plastics data set consists of the monthly sales (in thousands) of product A for a plastics manufacturer for five years.
#a) Plot the time series of sales of product A. Can you identify seasonal fluctuations and/or a trend-cycle?
autoplot(plastics, xlab = 'Year', ylab="Sales")

#there is a a pattern that appears to be seasonal peaking in the summer and fall months and trend as the data moves up with time.

#b) Use a classical multiplicative decomposition to calculate the trend-cycle and seasonal indices.
fit1 <- decompose(plastics, type='multiplicative')
tr <- fit1$trend
seas <- fit1$seasonal
#c) Do the results support the graphical interpretation from part a?
#The model fit supports the graphical interpretation from part a following the seasonl peaks in the summer and fall months.
#d) Compute and plot the seasonally adjusted data.
plot(seasadj(fit1))

#e) Change one observation to be an outlier (e.g., add 500 to one observation), and recompute the seasonally adjusted data. What is the effect of the outlier?
plastics[60]=plastics[60]+500
fit2<-decompose(plastics, type="multiplicative")

tr<-fit2$trend
seas<- fit2$seasonal

plot(seasadj(fit2))

#the outlier significantly changes the model by having a large value towards the end of the observations.

#f) Does it make any difference if the outlier is near the end rather than in the middle of the time series?
plastics[60]=plastics[60]-500
plastics[30]=plastics[30]+500

fit3<-decompose(plastics, type="multiplicative")

tr<-fit3$trend
seas<- fit3$seasonal

plot(seasadj(fit3))

#Yes this causes the middle of the time series to have a large spike in the data as opposed to the end. Yet the changes is less drastic as it is smoothed out by the surrounding values. This will still have a large impact on the model.

#3) Recall your retail time series data (from Exercise 3 in Section 2.10). Decompose the series using X11. Does it reveal any outliers, or unusual features that you had not noticed previously?

library(readxl)
library(seasonal)

## Warning: package 'seasonal' was built under R version 3.6.1

retail <- read_excel("C:/Users/Joseph/Downloads/retail.xlsx", skip = 1)

retail <- ts(retail[,'A3349873A'], frequency = 12,start = c(1982,4))

fit4 <- seas(retail, x11="")

autoplot(fit4) +
  ggtitle("X11 decomposition of retail data")

#There appears to be an outlier in the early 2000s that has a large remainder value.

#4) Figures 6.16 and 6.17 show the result of decomposing the number of persons in the civilian labour force in Australia each month from February 1978 to August 1995.
#a) Write about 3-5 sentences describing the results of the decomposition. Pay particular attention to the scales of the graphs in making your interpretation.

#There is an upward trend in the data as time progresses. From 1991-93 there are some signifcant outliers as depicted by the remainder component. In comparison to the other components the seasonal variation is rather small as depcited by the scales of the graphs.

#b) Is the recession of 1991/1992 visible in the estimated components?
#Yes as mentioned in my previous answer there is a clear change in the remainder component that is captured in those years of recession.

#5) This exercise uses the cangas data (monthly Candaian gas production in billions of cubic meteres, January 1960 - February 2005)

#a)Plot the data using autoplot(), ggsubseriesplot() and ggseasonplot() to look at the effect of the changing seasonality over time. What do you think is causing it to change so much?

autoplot(cangas)

ggsubseriesplot(cangas)

ggseasonplot(cangas)

#The increase in production over time is likely due to increased demand and technology providing increased access to reserves previously unattainable. To the same point this likely impacts the oil production at different times as the globe begins to have a larger demand for oil rather than just those countries in the northern hemisphere. This is assuming that most oil is consumed in certain months based on temperature changes, ergo if the temperature is lower in the southern hemisphere it is likely to be higher in the north.

#b)Do an STL decomposition of the data. You will need to choose s.window to allow for the changing shape of the seasonal component.

cangas %>%
  stl(t.window=13, s.window=13, robust=TRUE) %>%
  autoplot()

#C)Compare the results with those obtained using SEATS and X11. How are they different?
cangas %>% seas() %>%
autoplot() +
  ggtitle("SEATS decomposition of Cangas")

cangas %>% seas(x11="") %>%
autoplot() +
  ggtitle("x11 decomposition of Cangas")

#there is no immediately apparent difference between the two models.

#6) We will use the bricksq data (Australian quarterly clay brick production. 1956-1994) for this exercise.
#a) Use an STL decomposition to calculate the trend-cycle and seasonal indices. (Experiment with having fixed or changing seasonality.)
#fixed seasonality
bricksq %>%
  stl(t.window=13, s.window="periodic", robust=TRUE) %>%
  autoplot()

bricksq %>%
 stl(t.window= 13, s.window=13, robust=T) %>%
 autoplot()

#b) Compute and plot the seasonally adjusted data.
fit5 <- stl(t.window = 13,s.window=13,robust=T,x=bricksq)
autoplot(seasadj(fit5))

#c) Use a naïve method to produce forecasts of the seasonally adjusted data.
fit5 %>% seasadj() %>% naive() %>% autoplot() + 
  ggtitle(label = "Naive forecast of seasonally adjusted brick data",
          subtitle = "after STL decomposition with changing seasonality")

fit6 <- stl(t.window = 13,s.window='periodic',robust=T,x=bricksq)
fit6 %>% seasadj() %>% naive() %>% autoplot() + 
  ggtitle(label = "Naive forecast of seasonally adjusted brick data",
          subtitle = "after STL decomposition with fixed seasonality")

#d) Use stlf() to reseasonalise the results, giving forecasts for the original data.
stlf_brick <- stlf(bricksq)
autoplot(stlf_brick)

#e) Do the residuals look uncorrelated?
checkresiduals(stlf_brick)

## Warning in checkresiduals(stlf_brick): The fitted degrees of freedom is
## based on the model used for the seasonally adjusted data.

## 
##  Ljung-Box test
## 
## data:  Residuals from STL +  ETS(M,N,N)
## Q* = 41.128, df = 6, p-value = 2.733e-07
## 
## Model df: 2.   Total lags used: 8

# The residuals are correlated with each other.
#f) Repeat with a robust STL decomposition. Does it make much difference?
stlf_brick_robust <- stlf(bricksq, robust = TRUE)
autoplot(stlf_brick_robust)

checkresiduals(stlf_brick_robust)

## Warning in checkresiduals(stlf_brick_robust): The fitted degrees of freedom
## is based on the model used for the seasonally adjusted data.

## 
##  Ljung-Box test
## 
## data:  Residuals from STL +  ETS(M,N,N)
## Q* = 28.163, df = 6, p-value = 8.755e-05
## 
## Model df: 2.   Total lags used: 8

# It looked like the autocorrelations became lower generally, but there are still some high values left at lag 5 and 8.
#g) Compare forecasts from stlf() with those from snaive(), using a test set comprising the last 2 years of data. Which is better?
trainset_brick <- subset(bricksq, 
                        end = length(bricksq) - 8)
testset_brick <- subset(bricksq,
                        start = length(bricksq) - 7)
snaive_brick <- snaive(trainset_brick)
stlf_brick_part <- stlf(trainset_brick, robust = TRUE)
# plot data and forecast results
autoplot(bricksq, series = "Original data") +
  geom_line(size = 1) +
  autolayer(stlf_brick_part, PI = FALSE, size = 1,
            series = "stlf") +
  autolayer(snaive_brick, PI = FALSE, size = 1,
            series = "snaive") +
  scale_color_manual(values = c("gray50", "blue", "red"),
                     breaks = c("Original data", "stlf", "snaive")) +
  scale_x_continuous(limits = c(1990, 1994.5)) +
  scale_y_continuous(limits = c(300, 600)) +
  guides(colour = guide_legend(title = "Data")) +
  ggtitle("Forecast from stlf and snaive functions") +
  annotate(
    "rect",
    xmin=1992.75,xmax=1994.5,ymin=-Inf,ymax=Inf,
    fill="lightgreen",alpha = 0.3
    )

## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.

## Warning: Removed 136 rows containing missing values (geom_path).

# can see from the plot that the forecasts from stlf function are more similar to the original data than the forecasts from snaive function. Unlike snaive function, stlf function can also use trend, and its seasonality can change over time. The test set have trend with seasonality. Therefore stlf function was better than snaive function to predict brick production amount of near future.


# 7. Use stlf to produce forecasts of the writing series with either method="naive" or method="rwdrift", whichever is most appropriate. Use the lambda argument if you think a Box-Cox transformation is required.

str(writing)

##  Time-Series [1:120] from 1968 to 1978: 563 599 669 598 580 ...

head(writing)

##          Jan     Feb     Mar     Apr     May     Jun
## 1968 562.674 599.000 668.516 597.798 579.889 668.233

autoplot(writing)

# can see that there are increasing trend in writing data. Therefore it would be better to use rwdrift method to forecast non-seasonal(seasonally adjusted) component.
# I think that it would be better to do Box-Cox transformation to make the size of the seasonal variation in the data about the same across the whole series.
stlf_writing <- stlf(writing, 
                     s.window = 13, 
                     robust = TRUE,
                     lambda = BoxCox.lambda(writing),
                     method = "rwdrift")
autoplot(stlf_writing)

# 8. Use stlf to produce forecasts of the fancy series with either method="naive" or method="rwdrift", whichever is most appropriate. Use the lambda argument if you think a Box-Cox transformation is required.

str(fancy)

##  Time-Series [1:84] from 1987 to 1994: 1665 2398 2841 3547 3753 ...

head(fancy)

##          Jan     Feb     Mar     Apr     May     Jun
## 1987 1664.81 2397.53 2840.71 3547.29 3752.96 3714.74

autoplot(fancy)

# There is increasing trend that it would be better to use rwdrift method for forecasting non-seasonal component.
# And it would be better to do Box-Cox transformation to make the variation in the data about the same across the whole series.
stlf_fancy <- stlf(fancy,
                   s.window = 13,
                   robust = TRUE,
                   lambda = BoxCox.lambda(fancy),
                   method = "rwdrift")
autoplot(stlf_fancy)

# The prediction intervals increase dramatically because of Box-Cox transformation. But without the transformation, the forecasts are unreasonable.

CH5-6 exercise

Joseph Sansevero

July 21, 2019