Week 2 - Chatper 5 and 6 Exercises

## R Packages Used
library(readr)
library(readxl)
library(ggplot2)
library(fpp2)
library(psych)
library(lmtest)
library(seasonal)

Chapter 5 - Time Series Regression Models

Daily electricity demand for Victoria, Australia, during 2014 is contained in elecdaily. The data for the first 20 days can be obtained as follows.

library(fpp2)
library(psych)
daily20 <- head(elecdaily,20)

## a. Plot the data and find the regression model for Demand with temperature as an explanatory variable. Why is there a positive relationship?
autoplot(daily20, main="Electricity Demand for Victoria, Australia during 2014")

lm(Demand~Temperature, data=elecdaily)

## 
## Call:
## lm(formula = Demand ~ Temperature, data = elecdaily)
## 
## Coefficients:
## (Intercept)  Temperature  
##    212.3856       0.4182

summary(lm(Demand~Temperature, data=elecdaily))

## 
## Call:
## lm(formula = Demand ~ Temperature, data = elecdaily)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -55.474 -15.719  -0.336  15.767 117.184 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 212.3856     5.0080  42.409   <2e-16 ***
## Temperature   0.4182     0.2263   1.848   0.0654 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 26.55 on 363 degrees of freedom
## Multiple R-squared:  0.009322,   Adjusted R-squared:  0.006592 
## F-statistic: 3.416 on 1 and 363 DF,  p-value: 0.0654

## There is a low correlation between Demand and Temperature. The p-value of Temperature as an explantory variable of Demand is 0.0654. Temperature shows a small increase at different points in the plot which results in a high Demand. We see the biggest increase in Demand when temperature increases around "Time" 3.

## b. Produce a residual plot. Is the model adequate? Are there any outliers or influential observations?
plot(lm(Demand~Temperature, data=elecdaily))

## It it noticeable that a majority of the values lie outside of Cook's distance line indicating that values have roughly no influence. Some extreme values lie well outside of Cook's Distance line indicating grealy influential cases. Observations #15, 16, and 17 are influential to the regression results and should be investigated further. Some outliers can be important to data analysis results but in this case it may be best to remove them.

## c. Use the model to forecast the electricity demand that you would expect for the next day if the maximum temperature was 15∘ and compare it with the forecast if the with maximum temperature was 35∘. Do you believe these forecasts?

DemTSLM <- tslm(Demand ~ Temperature, data=daily20)
tempRange <- data.frame(Temperature=c(15,35))
forecastDemTemp <- forecast(DemTSLM, tempRange)
forecastDemTemp

##          Point Forecast    Lo 80    Hi 80     Lo 95    Hi 95
## 3.857143       140.5701 108.6810 172.4591  90.21166 190.9285
## 4.000000       275.7146 245.2278 306.2014 227.57056 323.8586

## The forecast range seems reasonable when compared to the Demand values comapred to the training set. Similar results can be seen in the training set when a temperature of 20.3 produced a Demand of 173.8142 and a temperature of 34.0 produced a Demand of 258.5984.

## d. Give prediction intervals for your forecasts.
autoplot(daily20, facets=TRUE)

daily20 %>%
  as.data.frame() %>%
  ggplot(aes(x=Temperature, y=Demand)) +
  geom_point() +
  geom_smooth(method="lm", se=FALSE)

fit <- tslm(Demand ~ Temperature, data=daily20)
checkresiduals(fit)

## 
##  Breusch-Godfrey test for serial correlation of order up to 5
## 
## data:  Residuals from Linear regression model
## LM test = 3.8079, df = 5, p-value = 0.5774

forecast(fit, newdata=data.frame(Temperature=c(15,35)))

##          Point Forecast    Lo 80    Hi 80     Lo 95    Hi 95
## 3.857143       140.5701 108.6810 172.4591  90.21166 190.9285
## 4.000000       275.7146 245.2278 306.2014 227.57056 323.8586

## 15 degrees
## Forecast    Lo 80     Hi 80     Lo 95     Hi 95
## 3.857143    140.5701  108.6810  172.4591  90.21166 190.9285

## 35 Degrees
## Forecast    Lo 80     Hi 80     Lo 95      Hi 95
## 275.7146    245.2278  306.2014  227.57056  323.8586

## These prediction intervals give reasonable estimates based on the results of the training dataset.

## e. Plot Demand vs Temperature for all of the available data in elecdaily. What does this say about your model?
plot(Demand ~ Temperature, data=elecdaily, main="Demand vs. Temperature of Elecdaily dataset")

## The plot of the full dataset resembles the shape of the daily20 dataframe. The same outliers pointed out in the residuals section appear in the full plot. This shows that the daily20 subset of the full dataset is a good representation of the overall observations.

Data set mens400 contains the winning times (in seconds) for the men’s 400 meters final in each Olympic Games from 1896 to 2016.

## a. Plot the winning time against the year. Describe the main features of the plot.
autoplot(mens400, main="Winning times for the Men's 400 meter Olympic Games Final", ylab="Time (Seconds)", xlab="Year")

## There is an inverse relationship between race time in seconds and Year. The average time to win the 400 meter has decreased over time. There are also missing (NA) data points in the data set. This can be seen in the broken portion of the lines. 


## b. Fit a regression line to the data. Obviously the winning times have been decreasing, but at what average rate per year?

##Setting the dataset to a time series format
time400 <- time(mens400)
tslm(mens400 ~ time400, data=mens400)

## 
## Call:
## tslm(formula = mens400 ~ time400, data = mens400)
## 
## Coefficients:
## (Intercept)      time400  
##   172.48148     -0.06457

## Winning time decreases by roughly 0.06457 seconds per year
## Regression line added
autoplot(mens400, main="Winning times for the Men's 400 meter Olympic Games Final", ylab="Time (Seconds)", xlab="Year") + geom_smooth(method="lm", se=FALSE)

## c. Plot the residuals against the year. What does this indicate about the suitability of the fitted line?
mens400RG <- lm(mens400 ~ time400, data=mens400)
plot(mens400RG)

## The Residuals vs. Fitted graph shows that the slope slowly decreases and the winning times decrease at a decreasing rate each year. This shows that the fitted line is a fairly accurate short term predictor, but does not offer much insight into future winning times.

## d. Predict the winning time for the men’s 400 meters final in the 2020 Olympics. Give a prediction interval for your forecasts. What assumptions have you made in these calculations?

TimeMens400 <- time(mens400)
mens400LM <- lm(mens400 ~ TimeMens400, data=mens400, na.action=na.exclude) ## Removing the 3 NA values from the Regression

Mens400F2022 <- forecast(mens400LM, newdata=data.frame(TimeMens400=2020))
Mens400F2022

##   Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## 1       42.04231 40.44975 43.63487 39.55286 44.53176

## 80% Confidence interval shows a range from 40.44975 to 43.63487
## 95% Confidence internval shows a range from 39.55286 to 4.53176

Type easter(ausbeer) and interpret what you see.

easter(ausbeer)

##      Qtr1 Qtr2 Qtr3 Qtr4
## 1956 0.67 0.33 0.00 0.00
## 1957 0.00 1.00 0.00 0.00
## 1958 0.00 1.00 0.00 0.00
## 1959 1.00 0.00 0.00 0.00
## 1960 0.00 1.00 0.00 0.00
## 1961 0.33 0.67 0.00 0.00
## 1962 0.00 1.00 0.00 0.00
## 1963 0.00 1.00 0.00 0.00
## 1964 1.00 0.00 0.00 0.00
## 1965 0.00 1.00 0.00 0.00
## 1966 0.00 1.00 0.00 0.00
## 1967 1.00 0.00 0.00 0.00
## 1968 0.00 1.00 0.00 0.00
## 1969 0.00 1.00 0.00 0.00
## 1970 1.00 0.00 0.00 0.00
## 1971 0.00 1.00 0.00 0.00
## 1972 0.33 0.67 0.00 0.00
## 1973 0.00 1.00 0.00 0.00
## 1974 0.00 1.00 0.00 0.00
## 1975 1.00 0.00 0.00 0.00
## 1976 0.00 1.00 0.00 0.00
## 1977 0.00 1.00 0.00 0.00
## 1978 1.00 0.00 0.00 0.00
## 1979 0.00 1.00 0.00 0.00
## 1980 0.00 1.00 0.00 0.00
## 1981 0.00 1.00 0.00 0.00
## 1982 0.00 1.00 0.00 0.00
## 1983 0.00 1.00 0.00 0.00
## 1984 0.00 1.00 0.00 0.00
## 1985 0.00 1.00 0.00 0.00
## 1986 1.00 0.00 0.00 0.00
## 1987 0.00 1.00 0.00 0.00
## 1988 0.00 1.00 0.00 0.00
## 1989 1.00 0.00 0.00 0.00
## 1990 0.00 1.00 0.00 0.00
## 1991 1.00 0.00 0.00 0.00
## 1992 0.00 1.00 0.00 0.00
## 1993 0.00 1.00 0.00 0.00
## 1994 0.00 1.00 0.00 0.00
## 1995 0.00 1.00 0.00 0.00
## 1996 0.00 1.00 0.00 0.00
## 1997 1.00 0.00 0.00 0.00
## 1998 0.00 1.00 0.00 0.00
## 1999 0.00 1.00 0.00 0.00
## 2000 0.00 1.00 0.00 0.00
## 2001 0.00 1.00 0.00 0.00
## 2002 1.00 0.00 0.00 0.00
## 2003 0.00 1.00 0.00 0.00
## 2004 0.00 1.00 0.00 0.00
## 2005 1.00 0.00 0.00 0.00
## 2006 0.00 1.00 0.00 0.00
## 2007 0.00 1.00 0.00 0.00
## 2008 1.00 0.00 0.00 0.00
## 2009 0.00 1.00 0.00 0.00
## 2010 0.00 1.00

## I see a dataset containing first through fourth quarterly values from 1956 to 2010, ending in Quarter 2 of 2010. Each quarterly value appears to be between 0 and 1, possibly signifying a percentage or probability of an event occuring.

## After using the ?ausbeer command I can see that the full dataset shows the total quarterly beer production in Austrailia from Q1 1956 to Q2 2010. The Easter

An elasticity coefficient is the ratio of the percentage change in the forecast variable (y) to the percentage change in the predictor variable (x). Express y as a function of x and show that the coefficient β1 is the elasticity coefficient.

## log(y) = β0 + β1log(x) + e
## β1log(x) = log(y) - β0 - e
## β1 = (log(y) - β0 - e) / log(x)

## It is proven that coefficient β1 is the elasticity coefficient becuase it shows the ratio of the percentage change in the forecast variable (y) to the change in the predictor variable (X).
## For each change in the predictor (x), there is a β1 ratio percentage change in the forecast variable (y).

The data set fancy concerns the monthly sales figures of a shop which opened in January 1987 and sells gifts, souvenirs, and novelties. The shop is situated on the wharf at a beach resort town in Queensland, Australia. The sales volume varies with the seasonal population of tourists. There is a large influx of visitors to the town at Christmas and for the local surfing festival, held every March since 1988. Over time, the shop has expanded its premises, range of products, and staff.

## a. Produce a time plot of the data and describe the patterns in the graph. Identify any unusual or unexpected fluctuations in the time series.
autoplot(fancy, main="Sales Volume over Time", ylab="Sales Volume")

seasonplot(fancy, main="Seasonal Sales Volume over Time", ylab="Sales Volume")

## The graph shows signs of seasonal trends within the sales. As mentioned in the question, we can see a spike in Sales right before each new year, indicating that the holiday season may have an effect on sales. We see a somewhat uncharacteristically large increase in sales from year 1992 to 1993 and the same from 1993 to 1994. This may be the result of the mentioned expansion of products and staff.

## b. Explain why it is necessary to take logarithms of these data before fitting a model.
## If a pattern exists in a data set when plotted, there may be heteroscedasticity in the errors. This means that the variance of the residuals may not be constant and lead to issues in the regression. If this occurs, a transofmation (logarithm as an example) may be required.


## c. Use R to fit a regression model to the logarithms of these sales data with a linear trend, seasonal dummies and a “surfing festival” dummy variable.

## Creating a "surfing festival" dummy variable
time <- time(fancy)

surfingFestival <- c()
for(i in 1:length(time)){
  month <- round(12*(time[i] - floor(time[i]))) + 1
  year <- floor(time[i])
  if(year >= 1988 & month == 3) ## started March, 1998
    {
    surfingFestival[i] <- 1 ## Binary - 1 for if surfing festival is in town
  } else {
    surfingFestival[i] <- 0 ## 0 if the surfing festival is not in town
  }
}

## Regression formula and output
## Using BoxCox power transformation prior to regression formula
fancyTrans <- (BoxCox(fancy, 0))
fancyReg <- tslm(fancyTrans ~ trend + season + surfingFestival)
summary(fancyReg)

## 
## Call:
## tslm(formula = fancyTrans ~ trend + season + surfingFestival)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.33673 -0.12757  0.00257  0.10911  0.37671 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     7.6196670  0.0742471 102.626  < 2e-16 ***
## trend           0.0220198  0.0008268  26.634  < 2e-16 ***
## season2         0.2514168  0.0956790   2.628 0.010555 *  
## season3         0.2660828  0.1934044   1.376 0.173275    
## season4         0.3840535  0.0957075   4.013 0.000148 ***
## season5         0.4094870  0.0957325   4.277 5.88e-05 ***
## season6         0.4488283  0.0957647   4.687 1.33e-05 ***
## season7         0.6104545  0.0958039   6.372 1.71e-08 ***
## season8         0.5879644  0.0958503   6.134 4.53e-08 ***
## season9         0.6693299  0.0959037   6.979 1.36e-09 ***
## season10        0.7473919  0.0959643   7.788 4.48e-11 ***
## season11        1.2067479  0.0960319  12.566  < 2e-16 ***
## season12        1.9622412  0.0961066  20.417  < 2e-16 ***
## surfingFestival 0.5015151  0.1964273   2.553 0.012856 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.179 on 70 degrees of freedom
## Multiple R-squared:  0.9567, Adjusted R-squared:  0.9487 
## F-statistic:   119 on 13 and 70 DF,  p-value: < 2.2e-16

## d. Plot the residuals against time and against the fitted values. Do these plots reveal any problems with the model?
autoplot(fancyReg$residuals, main="Residuals plot of fancyReg Resgression", ylab="Residuals")

## Based on the produced plot, there appears to be seasonal pattern that still exists after the Box Cox transformation. This creates an issue with autocorrelation in the data.

## e. Do boxplots of the residuals for each month. Does this reveal any problems with the model?
boxplot(fancyReg$residuals, main="fancyReg Regression Residuals")

## f. What do the values of the coefficients tell you about each variable?
fancyReg$coefficients

##     (Intercept)           trend         season2         season3 
##      7.61966701      0.02201983      0.25141682      0.26608280 
##         season4         season5         season6         season7 
##      0.38405351      0.40948697      0.44882828      0.61045453 
##         season8         season9        season10        season11 
##      0.58796443      0.66932985      0.74739195      1.20674790 
##        season12 surfingFestival 
##      1.96224123      0.50151509

## The surfingFestival coefficient has a value of 0.5015 showing a relative increase in sales during the month of March. The trend coefficient value of 0.0220 is fairly low but shows a gradual increase in sales. We see a large increase in the coefficient slope around season 10. The coefficient increase from year to year increases much more rapidly at this point. This could be the result of the shop expansion during this time.

## g. What does the Breusch-Godfrey test tell you about your model?
library(lmtest)
bgtest(fancyReg)

## 
##  Breusch-Godfrey test for serial correlation of order up to 1
## 
## data:  fancyReg
## LM test = 25.031, df = 1, p-value = 5.642e-07

## The output of the Breusch-Godfrey test gives us insight into serial correlation in the regression model. The p-value of the test is extremely low tell us that null-hypothesis can be rejected. This means that heteroscedasticity exists in our model.


## h. Regardless of your answers to the above questions, use your regression model to predict the monthly sales for 1994, 1995, and 1996. Produce prediction intervals for each of your forecasts.
timeRangeFancy <- ts(data=time, start=1994, end=c(1996,12), frequency=12)
forecast(fancyReg, timeRangeFancy)

##          Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## Jan 1994       1006.002 501.0539 1510.950 227.5849 1784.419
## Feb 1994       1006.317 501.3479 1511.286 227.8675 1784.767
## Mar 1994       1006.396 501.6231 1511.168 228.2491 1784.542
## Apr 1994       1006.577 501.5658 1511.589 228.0624 1785.092
## May 1994       1006.667 501.6339 1511.699 228.1190 1785.214
## Jun 1994       1006.770 501.7159 1511.824 228.1896 1785.350
## Jul 1994       1006.995 501.9201 1512.070 228.3823 1785.608
## Aug 1994       1007.036 501.9403 1512.133 228.3910 1785.682
## Sep 1994       1007.182 502.0643 1512.299 228.5036 1785.860
## Oct 1994       1007.324 502.1850 1512.462 228.6128 1786.034
## Nov 1994       1007.847 502.6870 1513.006 229.1033 1786.590
## Dec 1994       1008.666 503.4851 1513.847 229.8900 1787.442
## Jan 1995       1006.768 501.5678 1511.967 227.9624 1785.573
## Feb 1995       1007.083 501.8618 1512.304 228.2450 1785.921
## Mar 1995       1007.161 502.1369 1512.186 228.6266 1785.696
## Apr 1995       1007.343 502.0797 1512.606 228.4399 1786.246
## May 1995       1007.432 502.1478 1512.717 228.4965 1786.368
## Jun 1995       1007.535 502.2298 1512.841 228.5670 1786.504
## Jul 1995       1007.761 502.4340 1513.088 228.7598 1786.762
## Aug 1995       1007.802 502.4542 1513.150 228.7685 1786.836
## Sep 1995       1007.947 502.5782 1513.317 228.8810 1787.014
## Oct 1995       1008.089 502.6989 1513.480 228.9903 1787.188
## Nov 1995       1008.612 503.2009 1514.024 229.4808 1787.744
## Dec 1995       1009.432 503.9990 1514.865 230.2674 1788.596
## Jan 1996       1007.533 502.0817 1512.985 228.3399 1786.727
## Feb 1996       1007.849 502.3757 1513.321 228.6225 1787.075
## Mar 1996       1007.927 502.6508 1513.203 229.0041 1786.850
## Apr 1996       1008.109 502.5936 1513.624 228.8174 1787.400
## May 1996       1008.198 502.6617 1513.734 228.8740 1787.522
## Jun 1996       1008.301 502.7437 1513.859 228.9445 1787.658
## Jul 1996       1008.527 502.9479 1514.105 229.1373 1787.916
## Aug 1996       1008.568 502.9681 1514.168 229.1460 1787.990
## Sep 1996       1008.713 503.0921 1514.334 229.2585 1788.168
## Oct 1996       1008.855 503.2128 1514.497 229.3678 1788.342
## Nov 1996       1009.378 503.7148 1515.042 229.8583 1788.898
## Dec 1996       1010.198 504.5129 1515.882 230.6449 1789.750

## i. Transform your predictions and intervals to obtain predictions and intervals for the raw data.
fancyTrans <- (BoxCox(fancy, 0))
forecast(fancyTrans)

##          Point Forecast     Lo 80     Hi 80     Lo 95     Hi 95
## Jan 1994       9.674641  9.464046  9.885235  9.352565  9.996717
## Feb 1994       9.958925  9.733371 10.184479  9.613970 10.303880
## Mar 1994      10.439733 10.200143 10.679323 10.073312 10.806155
## Apr 1994      10.121042  9.868185 10.373898  9.734331 10.507752
## May 1994      10.157291  9.891822 10.422759  9.751292 10.563289
## Jun 1994      10.227196  9.949682 10.504711  9.802774 10.651618
## Jul 1994      10.403289 10.114223 10.692356  9.961200 10.845378
## Aug 1994      10.382661 10.082480 10.682842  9.923574 10.841748
## Sep 1994      10.518849 10.207944 10.829754 10.043361 10.994337
## Oct 1994      10.612257 10.290980 10.933534 10.120906 11.103608
## Nov 1994      11.108401 10.777070 11.439732 10.601674 11.615129
## Dec 1994      11.898748 11.557653 12.239843 11.377088 12.420408
## Jan 1995       9.985303  9.634703 10.335903  9.449107 10.521499
## Feb 1995      10.269587  9.909735 10.629440  9.719240 10.819934
## Mar 1995      10.750396 10.381517 11.119274 10.186244 11.314547
## Apr 1995      10.431704 10.054009 10.809399  9.854070 11.009338
## May 1995      10.467953 10.081638 10.854268  9.877135 11.058770
## Jun 1995      10.537858 10.143107 10.932610  9.934137 11.141579
## Jul 1995      10.713952 10.310934 11.116969 10.097590 11.330314
## Aug 1995      10.693323 10.282201 11.104445 10.064567 11.322080
## Sep 1995      10.829511 10.410437 11.248586 10.188592 11.470430
## Oct 1995      10.922919 10.496035 11.349803 10.270057 11.575781
## Nov 1995      11.419063 10.984506 11.853621 10.754465 12.083661
## Dec 1995      12.209410 11.767308 12.651512 11.533273 12.885547

## j. How could you improve these predictions by modifying the model?

## These predictions could be improved by using the lambda function. BoxCox.lambda(fancy) is an effective method for transforming a regression model. Using a more robust transformation could capture the exponential growth better.

The gasoline series consists of weekly data for supplies of US finished motor gasoline product, from 2 February 1991 to 20 January 2017. The units are in “million barrels per day”. Consider only the data to the end of 2004.

## a. Fit a harmonic regression with trend to the data. Experiment with changing the number Fourier terms. Plot the observed gasoline and fitted values and comment on what you see.

## Some confusion in finding the dataset. Has it been updated? According to the following documentation from Hyndman is appears to be Euretail.
## Souce: https://cran.r-project.org/web/packages/fpp2/fpp2.pdf

gasoline2014 <- window(euretail, end=2005)
autoplot(gasoline2014, main="Quarterly Retail Trade Index in the Euro area", xlab="Year", ylab="Thousand barrels per day")

gasoline %>% tbats() %>% forecast() %>%
autoplot() + xlab("Year") + ylab("Thousand barrels per day")

## b. Select the appropriate number of Fourier terms to include by minimising the AICc or CV value.
gasolineCheck <- tslm(gasoline2014 ~ trend)
gasolineCheck

## 
## Call:
## tslm(formula = gasoline2014 ~ trend)
## 
## Coefficients:
## (Intercept)        trend  
##      88.610        0.302

## Fourier range of 1-25

## c. Check the residuals of the final model using the checkresiduals() function. Even though the residuals fail the correlation tests, the results are probably not severe enough to make much difference to the forecasts and prediction intervals. (Note that the correlations are relatively small, even though they are significant.)
checkresiduals(gasolineCheck)

## 
##  Breusch-Godfrey test for serial correlation of order up to 7
## 
## data:  Residuals from Linear regression model
## LM test = 21.937, df = 7, p-value = 0.002604

## d. To forecast using harmonic regression, you will need to generate the future values of the Fourier terms. Forecast the next year of data.
gasolineF <- forecast(gasoline2014)
gasolineF

##         Point Forecast     Lo 80    Hi 80    Lo 95    Hi 95
## 2005 Q2       99.97156  99.39234 100.5508 99.08571 100.8574
## 2005 Q3      100.54817  99.74497 101.3514 99.31978 101.7766
## 2005 Q4      100.72226  99.74468 101.6998 99.22718 102.2173
## 2006 Q1      100.38522  99.26099 101.5095 98.66586 102.1046
## 2006 Q2      100.94123  99.68548 102.1970 99.02073 102.8617
## 2006 Q3      101.51783 100.14204 102.8936 99.41374 103.6219
## 2006 Q4      101.69193 100.20544 103.1784 99.41854 103.9653
## 2007 Q1      101.35488  99.76614 102.9436 98.92510 103.7847

## e. Plot the forecasts along with the actual data for 2005. What do you find?
autoplot(gasolineF)

## The range of possible forecast values is a relatively small and could provide a fairly accurate forecast. The 95% internal produces a much tighter prediction range.

Data set huron gives the water level of Lake Huron in feet from 1875 to 1972.

## a. Plot the data and comment on its features.
autoplot(huron, main="Mean July Average Water Surface Elevation", ylab="Feet", xlab="Year")

## The plot shows a negatuve trend in water surface elevation starting from 1880 to 1986. The data shows some signs of seasonality with the lowest elevation around 1965 and the highest elevation around 1870.

## b. Fit a linear regression and compare this to a piecewise linear trend model with a knot at 1915.
tslm(huron ~ trend)

## 
## Call:
## tslm(formula = huron ~ trend)
## 
## Coefficients:
## (Intercept)        trend  
##     10.2020      -0.0242

huronRG1 <- tslm(huron ~ trend)

tHuron <- time(huron)
tHuron15 <- 1915

tHuronSection <- ts(pmax(0,tHuron-tHuron15), start=1875)

## c. Generate forecasts from these two models for the period up to 1980 and comment on these.
## Forecast 1
huronRG1 <- tslm(huron ~ trend)
forecast(huronRG1, h=8)

##      Point Forecast    Lo 80    Hi 80    Lo 95     Hi 95
## 1973       7.806127 6.317648 9.294605 5.516501 10.095752
## 1974       7.781926 6.292536 9.271315 5.490899 10.072952
## 1975       7.757724 6.267406 9.248043 5.465269 10.050179
## 1976       7.733523 6.242259 9.224788 5.439613 10.027434
## 1977       7.709322 6.217094 9.201550 5.413929 10.004715
## 1978       7.685121 6.191912 9.178331 5.388219  9.982024
## 1979       7.660920 6.166712 9.155128 5.362481  9.959359
## 1980       7.636719 6.141494 9.131943 5.336717  9.936721

## Forecast 2
huronRG2 <- tslm(huron ~ tHuronSection)
forecast(tHuronSection, h=8)

##      Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## 1973             58 57.86560 58.13440 57.79446 58.20554
## 1974             59 58.70969 59.29031 58.55602 59.44398
## 1975             60 59.51703 60.48297 59.26136 60.73864
## 1976             61 60.29422 61.70578 59.92061 62.07939
## 1977             62 61.04503 62.95497 60.53950 63.46050
## 1978             63 61.77204 64.22796 61.12199 64.87801
## 1979             64 62.47716 65.52284 61.67102 66.32898
## 1980             65 63.16194 66.83806 62.18893 67.81107

## The forecasts show some significant differences. The format of the two are slightly different but we can notice a similar increasing trend in both.

Chapter 6 - Time Series Decompositon

Show that a 3×5 MA is equivalent to a 7-term weighted moving average with weights of 0.067, 0.133, 0.200, 0.200, 0.200, 0.133, and 0.067.

## Centered moving averages can be smoothed by another moving average. This creates a double moving average. In the case of a 3x5 moving average, this signifies a 3 moving average of a 5 moving average. 

## Weights = c(0.067, 0.133, 0.200, 0.200, 0.200, 0.133, 0.067)

## 3x5 MA = [((Y1 + Y2 + Y3 + Y4 + Y5)/5) + ((Y2 + Y3 + Y4 + Y5 + Y6)/5) + ((Y3 + Y4 + Y5 + Y6 + Y7)/5)] / 3
## Plugging in these values proves that the 3x5 moving average is equal to a 7-term weighted moving average

The plastics data set consists of the monthly sales (in thousands) of product A for a plastics manufacturer for five years.

## a. Plot the time series of sales of product A. Can you identify seasonal fluctuations and/or a trend-cycle?
autoplot(plastics, main="Monthly sales of Product A", xlab="Year", ylab="Sales (thousands")

## The plot shows seasonal fluctuations of sales at the beginning and ending months of each year. Sales are at their lowest at the beginning of each year and the highest at the end of each year.

## b. Use a classical multiplicative decomposition to calculate the trend-cycle and seasonal indices.
plastics %>% decompose(type="multiplicative") %>%
  autoplot() + xlab("Year") +
  ggtitle("Monthly sales of Product A")

## c. Do the results support the graphical interpretation from part a?
## These results do support the above statements regarding seasonal trends that exist in the data. It also shows there is an overall positive trend in sales.

## d. Compute and plot the seasonally adjusted data.
autoplot(plastics, main="Monthly sales of Product A", ylab="Sales (Thousands)", xlab="Year") + autolayer(snaive(plastics, h=30), series="Seasonal Naïve", PI=FALSE) + autolayer(naive(plastics, h=30), series="Naïve", PI=FALSE) + autolayer(rwf(plastics, h=30), series="Drift", PI=FALSE)

## e. Change one observation to be an outlier (e.g., add 500 to one observation), and recompute the seasonally adjusted data. What is the effect of the outlier?
plasticsAdd <- plastics
plasticsAdd[15] <- plasticsAdd[15] + 500

autoplot(plasticsAdd, main="Monthly sales of Product A (Modified Value 15)", ylab="Sales (Thousands)", xlab="Year") + autolayer(snaive(plasticsAdd, h=30), series="Seasonal Naïve", PI=FALSE) + autolayer(naive(plasticsAdd, h=30), series="Naïve", PI=FALSE) + autolayer(rwf(plasticsAdd, h=30), series="Drift", PI=FALSE)

## The forecasted values stay relatively stable with the addition of 500 to value 15. This shows the strength of the seasonally adjusted data prediction.

## f. Does it make any difference if the outlier is near the end rather than in the middle of the time series?

## Add 500 to end Value
plasticsAddEnd <- plastics
plasticsAddEnd[40] <- plasticsAddEnd[40] + 500

autoplot(plasticsAddEnd, main="Monthly sales of Product A (Modified Value 40)", ylab="Sales (Thousands)", xlab="Year") + autolayer(snaive(plasticsAddEnd, h=30), series="Seasonal Naïve", PI=FALSE) + autolayer(naive(plasticsAddEnd, h=30), series="Naïve", PI=FALSE) + autolayer(rwf(plasticsAddEnd, h=30), series="Drift", PI=FALSE)

## Adding the outlier to an end value does shift the forecast up and likely created larger values than intended. For this reason, outliers should be evaluated when using Multiplicative decomposition.

Recall your retail time series data (from Exercise 3 in Section 2.10). Decompose the series using X11. Does it reveal any outliers, or unusual features that you had not noticed previously?

## Importing retail data into R Markdown (converted to .csv file)
library(readxl)
retail <- read_excel("C:/Users/bryce_anderson/Desktop/Boston College/Predictive Analytics and Forecasting/Week 2 (PA&F)/retail.xlsx", skip=1)

## Creating a time series of the imported data
retailTS <- ts(retail[,"A3349873A"], frequency=12, start=c(1982, 3))

## Plotting the time series
autoplot(retailTS, main="Monthly Austrailian Retail Sales", ylab="Sales", xlab="Year")

## X11 Decomposition
library(seasonal)
retailx11 <- seas(retailTS, x11="")

autoplot(retailx11, main="X11 Decomposition on Monthly Austrailian Retail Sales", xlab="Year")

## Outliers are revealed in the remainder plot. We can see some major outliers around 1989, 1994, 2001, and 2012.

Figures 6.16 and 6.17 show the result of decomposing the number of persons in the civilian labour force in Australia each month from February 1978 to August 1995.

## a. Write about 3–5 sentences describing the results of the decomposition. Pay particular attention to the scales of the graphs in making your interpretation.

## The Decomposition shows the results the number of persons in the civilian labour force in Austrialia each month from February 1978 to August 1995. Immediately, we see a strong positive trend of the number of workers during the time period. The Seasonality chart shows that the growth is cyclical and has strong seasonal trends. There are some major outliers that exist in the data around 1992. More research could explain the uptick in labourers seen during this period. The second chart provides a very interesting look ingot the seasonal components of each month. We see a large increase in July and a large decrease in March. 


## b. Is the recession of 1991/1992 visible in the estimated components?
## I mentioned the large outliers in the above examination and a recession during this period would perfectly explain the results seen. The recession during this period is highly visible in the remainder plot.

This exercise uses the cangas data (monthly Canadian gas production in billions of cubic metres, January 1960 – February 2005).

## a. Plot the data using autoplot(), ggsubseriesplot() and ggseasonplot() to look at the effect of the changing seasonality over time. What do you think is causing it to change so much?
autoplot(cangas, main="Monthly Canadian Gas Production", ylab="Gas Production (Billions)", xlab="Year")

ggsubseriesplot(cangas, main="Monthly Canadian Gas Production", ylab="Gas Production (Billions)")

ggseasonplot(cangas, main="Seasonal Plot: Monthly Canadian Gas Production", ylab="Gas Production (Billions)")

## In talking about oil production, the seasonaility may be due to fuel prices and demand based on season. For instance, production may need to increase in months leading up to the summer to meet demand. The same could be said for the holiday season. Production may also be cheaper in certain months for oil companies. The first graph almost shows three sections of different types of seasonality. This could be the result of innovations in production or the discovery of new sources.

## b. Do an STL decomposition of the data. You will need to choose s.window to allow for the changing shape of the seasonal component.
cangas %>%
stl(t.window=13, s.window="periodic", robust=TRUE) %>%
autoplot()

## c. Compare the results with those obtained using SEATS and X11. How are they different?

## The results appear fairly similar but some differences can be seen. The seasonality graph shows a consistent seasonality throughout the time series. The remainder plot provides some insight into the outliers that can be seen between 1980 and 1990. Smaller outlier values are also seen consistently from 1995 to 2005.

We will use the bricksq data (Australian quarterly clay brick production. 1956–1994) for this exercise.

## a. Use an STL decomposition to calculate the trend-cycle and seasonal indices. (Experiment with having fixed or changing seasonality.)

bricksq %>%
stl(t.window=26, s.window="periodic", robust=TRUE) %>%
autoplot()

## b. Compute and plot the seasonally adjusted data.

autoplot(bricksq, main="Monthly sales of Product A", ylab="Sales (Thousands)", xlab="Year") + autolayer(snaive(bricksq, h=30), series="Seasonal Naïve", PI=FALSE)

## c. Use a naïve method to produce forecasts of the seasonally adjusted data.
autoplot(bricksq, main="Australian quarterly clay brick production. 1956–1994", ylab="Brick Production", xlab="Year")+ autolayer(naive(bricksq, h=30), series="Naïve", PI=FALSE)

## d. Use stlf() to reseasonalise the results, giving forecasts for the original data.
brickF1 <- stlf(bricksq)
brickF1

##         Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## 1994 Q4       465.7326 437.4903 493.9748 422.5398 508.9254
## 1995 Q1       424.6727 384.7119 464.6335 363.5580 485.7874
## 1995 Q2       483.9914 435.0232 532.9595 409.1010 558.8817
## 1995 Q3       493.9988 437.4242 550.5734 407.4754 580.5222
## 1995 Q4       465.7326 402.4454 529.0198 368.9431 562.5220
## 1996 Q1       424.6727 355.3067 494.0387 318.5865 530.7589
## 1996 Q2       483.9914 409.0259 558.9568 369.3416 598.6411
## 1996 Q3       493.9988 413.8128 574.1848 371.3650 616.6326

autoplot(brickF1)

## e. Do the residuals look uncorrelated?
checkresiduals((brickF1))

## 
##  Ljung-Box test
## 
## data:  Residuals from STL +  ETS(M,N,N)
## Q* = 41.128, df = 6, p-value = 2.733e-07
## 
## Model df: 2.   Total lags used: 8

## The residuals appear to be approximately normal and show correlation.

## f. Repeat with a robust STL decomposition. Does it make much difference?
brickF1R <- stlf(bricksq, robust=TRUE)
brickF1R

##         Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## 1994 Q4       462.2355 432.9988 491.4723 417.5217 506.9493
## 1995 Q1       423.5662 382.1965 464.9358 360.2967 486.8356
## 1995 Q2       486.4525 435.7558 537.1492 408.9186 563.9864
## 1995 Q3       493.9985 435.4245 552.5726 404.4173 583.5798
## 1995 Q4       462.2355 396.7089 527.7621 362.0212 562.4498
## 1996 Q1       423.5662 351.7427 495.3896 313.7216 533.4107
## 1996 Q2       486.4525 408.8280 564.0770 367.7360 605.1689
## 1996 Q3       493.9985 410.9649 577.0322 367.0096 620.9875

autoplot(brickF1R)

checkresiduals(brickF1R)

## 
##  Ljung-Box test
## 
## data:  Residuals from STL +  ETS(M,N,N)
## Q* = 28.163, df = 6, p-value = 8.755e-05
## 
## Model df: 2.   Total lags used: 8

## It appears to be have reduced the normality of the residuals some but still fairly normal and has autocorrelation issues.

## g. Compare forecasts from stlf() with those from snaive(), using a test set comprising the last 2 years of data. Which is better?

## Splitting bricksq into training and test sets
trainBrick <- subset(bricksq, end=length(bricksq) - 9)
testBrick <- subset(bricksq, start=length(bricksq) - 8)

sBrick <- snaive(trainBrick)
stlfBrick <- stlf(trainBrick, robust=TRUE)

## Plotting the results
autoplot(sBrick)

autoplot(stlfBrick)

## The stlf() function appears to create less variable forecast and serves as a better predictor of brick production.

Use stlf() to produce forecasts of the writing series with either method=“naive” or method=“rwdrift”, whichever is most appropriate. Use the lambda argument if you think a Box-Cox transformation is required.

## Plotting the dataset to evaluate which method to choose
autoplot(writing)

## Chose the Drift method based of of the seasonally adjusted nature of the data. A trend appears in the data.
stlf(writing, method='rwdrift')

##          Point Forecast    Lo 80     Hi 80     Lo 95     Hi 95
## Jan 1978       994.1148 933.9446 1054.2850 902.09247 1086.1371
## Feb 1978      1028.3524 942.5528 1114.1521 897.13317 1159.5717
## Mar 1978      1085.7244 979.7839 1191.6649 923.70237 1247.7465
## Apr 1978      1016.3740 893.0618 1139.6862 827.78432 1204.9637
## May 1978       985.4141 846.4570 1124.3711 772.89758 1197.9306
## Jun 1978      1053.7226 900.3182 1207.1271 819.11078 1288.3345
## Jul 1978       906.0076 739.0422 1072.9731 650.65599 1161.3592
## Aug 1978       498.0555 318.2147  677.8962 223.01282  773.0981
## Sep 1978       941.7754 749.6073 1133.9435 647.87963 1235.6712
## Oct 1978      1030.8382 826.7912 1234.8851 718.77523 1342.9011
## Nov 1978       968.4962 752.9447 1184.0477 638.83869 1298.1537
## Dec 1978      1036.1790 809.4404 1262.9176 689.41229 1382.9458
## Jan 1979      1036.5608 798.9077 1274.2140 673.10172 1400.0200
## Feb 1979      1070.7985 822.4674 1319.1295 691.00884 1450.5881
## Mar 1979      1128.1704 869.3688 1386.9721 732.36741 1523.9735
## Apr 1979      1058.8200 789.7308 1327.9092 647.28362 1470.3564
## May 1979      1027.8601 748.6463 1307.0739 600.83943 1454.8808
## Jun 1979      1096.1687 806.9760 1385.3613 653.88668 1538.4507
## Jul 1979       948.4537 649.4133 1247.4940 491.11096 1405.7963
## Aug 1979       540.5015 231.7322  849.2708  68.27953 1012.7235
## Sep 1979       984.2214 665.8308 1302.6121 497.28500 1471.1579
## Oct 1979      1073.2842 745.3706 1401.1978 571.78355 1574.7848
## Nov 1979      1010.9422 673.5955 1348.2890 495.01498 1526.8695
## Dec 1979      1078.6251 731.9279 1425.3222 548.39750 1608.8526

## Box Cox lambda
writingBC <- stlf(writing, method='rwdrift', robust=TRUE, lambda = BoxCox.lambda(writing))
autoplot(writingBC)

Use stlf() to produce forecasts of the fancy series with either method=“naive” or method=“rwdrift”, whichever is most appropriate. Use the lambda argument if you think a Box-Cox transformation is required.

## Plotting the dataset to evaluate which method to choose
autoplot(fancy)

## The fancy dataset shows a similar trend and seasonally adjusted data as writing, but in a positive trend.
## I chose to use the drift method for this set also
stlf(fancy, method='rwdrift')

##          Point Forecast    Lo 80     Hi 80    Lo 95     Hi 95
## Jan 1994       62016.88 54094.79  69938.96 49901.09  74132.66
## Feb 1994       63653.72 52317.61  74989.83 46316.63  80990.80
## Mar 1994       69171.00 55126.64  83215.36 47692.01  90649.99
## Apr 1994       66186.51 49786.24  82586.79 41104.45  91268.57
## May 1994       66161.22 47622.54  84699.90 37808.75  94513.68
## Jun 1994       67578.60 47050.94  88106.26 36184.25  98972.95
## Jul 1994       70352.01 47944.97  92759.05 36083.40 104620.63
## Aug 1994       71702.28 47499.91  95904.65 34687.95 108716.61
## Sep 1994       73083.34 47152.22  99014.45 33425.12 112741.55
## Oct 1994       74310.72 46704.94 101916.50 32091.33 116530.11
## Nov 1994       83745.93 54510.27 112981.59 39033.85 128458.01
## Dec 1994      113495.30 82667.44 144323.17 66348.15 160642.45
## Jan 1995       70851.51 38463.56 103239.46 21318.42 120384.60
## Feb 1995       72488.35 38568.00 106408.69 20611.66 124365.04
## Mar 1995       78005.63 42576.99 113434.28 23822.20 132189.07
## Apr 1995       75021.14 38105.34 111936.94 18563.30 131478.99
## May 1995       74995.85 36611.57 113380.12 16292.16 133699.53
## Jun 1995       76413.23 36577.10 116249.36 15489.13 137337.34
## Jul 1995       79186.64 37913.52 120459.77 16064.85 142308.44
## Aug 1995       80536.91 37840.16 123233.66 15237.87 145835.95
## Sep 1995       81917.97 37809.66 126026.27 14460.14 149375.80
## Oct 1995       83145.35 37636.44 128654.27 13545.47 152745.24
## Nov 1995       92580.56 45681.00 139480.12 20853.87 164307.25
## Dec 1995      122329.93 74048.83 170611.04 48490.36 196169.51

fancyBC <- stlf(fancy, method='rwdrift', robust=TRUE, lambda = BoxCox.lambda(fancy))
autoplot(fancyBC)

Week 2 - Chatper 5 and 6 Exercises

Bryce Anderson

11/3/2019

Chapter 5 - Time Series Regression Models

Chapter 6 - Time Series Decompositon