Hwk 1 - Chapters 5 & 6 - Predictive Analytics

Thomas Laffey 11/18/2021

#5.10 Exercises

#assign daily 20 set
daily20 <- head(elecdaily,20)
frequency(daily20)

## [1] 7

##.  electricity demand for Victoria, Australia, during 2014 is contained in elecdaily. The data for the first 20 days can be obtained as follows.
#1.a.Plot the data and find the regression model for Demand with temperature as an explanatory variable. Why is there a positive relationship? 
autoplot(daily20, facets=TRUE) + ggtitle("Time Series of Demand / Work Day / Temp")

daily20 %>%
  as.data.frame() %>%
  ggplot(aes(x=Temperature, y=Demand)) +
  geom_point() +
  geom_smooth(method="lm", se=FALSE)

## `geom_smooth()` using formula 'y ~ x'

#model
daily_elec_mod <- tslm(Demand ~ Temperature, data=daily20)
summary(daily_elec_mod)

## 
## Call:
## tslm(formula = Demand ~ Temperature, data = daily20)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -46.060  -7.117  -1.437  17.484  27.102 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  39.2117    17.9915   2.179   0.0428 *  
## Temperature   6.7572     0.6114  11.052 1.88e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22 on 18 degrees of freedom
## Multiple R-squared:  0.8716, Adjusted R-squared:  0.8644 
## F-statistic: 122.1 on 1 and 18 DF,  p-value: 1.876e-09

#There is a positive relationship between temperature and demand because as hotter temperature lead to increased use of air cooling systems. It's also possible that at extreme levels, more people stay indoors.
#In fact as the half-hourly temp increases by 1 degrees, the demand for electricity likewise increases by 6.7gw

#1.b Produce a residual plot. Is the model adequate? Are there any outliers or influential observations?
checkresiduals(daily_elec_mod)

## 
##  Breusch-Godfrey test for serial correlation of order up to 5
## 
## data:  Residuals from Linear regression model
## LM test = 3.8079, df = 5, p-value = 0.5774

res_elec <- residuals(daily_elec_mod)
summary(res_elec)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -46.060  -7.117  -1.437   0.000  17.484  27.102

#The good news is that the residuals appear to be white noise and not correlate as none cross the ACF threshold 
#It's also true that the mean of the residuals is essentially 0, so the two big criteria related to the mean of residuals and correlation are good
#The residuals however due show inconsistent variation around the mean with a left skewed, bimodal distribution
#The P-value from the Breush-Godfrey test also indicates that there's no reason to believe the residual are correlated any differently from white noise

#1.c Use the model to forecast the electricity demand that you would expect for the next day if the maximum temperature was
#15 degrees and compare with the forecast if the max was 35 degrees. Do you believe these forecasts?
forecast(daily_elec_mod, newdata = data.frame(Temperature = c(15, 35)))

##          Point Forecast    Lo 80    Hi 80     Lo 95    Hi 95
## 4.285714       140.5701 108.6810 172.4591  90.21166 190.9285
## 4.428571       275.7146 245.2278 306.2014 227.57056 323.8586

#Both point forecasts [T(15) = 140.57 & T(35) = 275] trend as would be expected, meaning lower gw usage at lower temps
#Both figures would likely plot above the fitted line, which makes me wonder if some outlier influence is pulling our regression model lower

#1.d Give prediction intervals for your forecasts.
#For the 15 degree forecast, would expect the mean result of 95% of samples to fall within the interval of 108.6 and 190.9 degrees
#For the 35 degree forecast, would expect the mean result of 95% of samples to fall within the interval of 275.7 and 323.9 degrees

#1.e Plot Demand vs Temperature for all of the available data in elecdaily. What does this say about your model?
elecdaily %>%
  as.data.frame() %>%
  ggplot(aes(x=Temperature, y=Demand)) +
  geom_point() +
  geom_smooth(method="lm", se=FALSE) + ggtitle("All Data Temp vs Demand")

## `geom_smooth()` using formula 'y ~ x'

#The plot of the entire data set as compared to the first 20 point subset indicates to me that our forecast has too high of a positive slope
#I think my model underestimates energy consumption as temperature drops when demand increases for heating 
#The forecast for 35 degrees still seems reasonable, while the 15 degree forecast is way understated.

#2. Data set mens400 contains the winning times (in seconds) for the men’s 400 meters final in each Olympic Games from 1896 to 2016.
#2a. Plot the winning time against the year. Describe the main features of the plot.
autoplot(mens400) + xlab("Year") + ylab("Time (s)") + ggtitle("Olympic Winning Times - Men's 400m")

#The main features of the plot are 1) a clear negative (or downward) trend from 1900 to 2016 in terms of the winning time to complete the race,
#2) Noticeable gaps in the results during WW1 & WW2, and 3)No seasonality

#2b. Fit a regression line to the data. Obviously the winning times have been decreasing, but at what average rate per year?
m400 = ts(mens400)
time400 = time(mens400)
m400_mod = tslm(formula = m400~time400, data = mens400)
autoplot(mens400, series = "Data") + autolayer(fitted(m400_mod), series = "Fitted") + 
  ggtitle("M400 Olympic Winning Times vs Regression Fit") + xlab("Year") + ylab("Time (s)")

summary(m400_mod)

## 
## Call:
## tslm(formula = m400 ~ time400, data = mens400)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.6002 -0.5747 -0.2858  0.5751  4.1505 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 172.481477  11.487522   15.02 2.52e-14 ***
## time400      -0.064574   0.005865  -11.01 2.75e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.136 on 26 degrees of freedom
##   (3 observations deleted due to missingness)
## Multiple R-squared:  0.8234, Adjusted R-squared:  0.8166 
## F-statistic: 121.2 on 1 and 26 DF,  p-value: 2.752e-11

#The intercept of 172.5 is not very meaningful in this instance as it assumes the trend is linear all the way back to year 0
#Essentially the intercept says that at the start of first millenium, the predicted 400m time was 172, again not meaningful
#However, the intercept tells us that each year sees a decrease in the winning 400m time by 0.06seconds, or
#Each 4 year olympic cycle, we would predict a ~0.25 second decline in the winning 400m time

#2c. Plot the residuals against the year. What does this indicate about the suitability of the fitted line?
checkresiduals(m400_mod)

## 
##  Breusch-Godfrey test for serial correlation of order up to 6
## 
## data:  Residuals from Linear regression model
## LM test = 3.6082, df = 6, p-value = 0.7295

summary(residuals(m400_mod))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## -1.6001 -0.5747 -0.2858  0.0000  0.5751  4.1505       3

#The plotted residuals indicates that though our R squared value is high, there remains significant predictor factors unexplained
#I have concerns that our fitted line has too much variance, despite passing the main criteria of no autocorrelation and mean 0 for residuals
#Autocorrelation within the residuals does not seem to be an issue as the Breusch-Godfrey test and ACF plot show as white noise
#The residuals show significant variance, notable at the start of the 20th century with a non-normal distribution and noteceable outlier

#2d. Predict the winning time for the men’s 400 meters final in the 2020 Olympics. Give a prediction interval for your forecasts. What assumptions have you made in these calculations?
forecast(m400_mod, newdata = data.frame(time400 = c(2020)))

##      Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## 2020       42.04231 40.44975 43.63487 39.55286 44.53176

#The predicted 400m winning time at the 2020 olympics is 42.04 seconds with 95% confidence interval between 39.5 and 44.5 seconds
#The actual winning time in Tokyo was 43/85 seconds, which would have been outside the 80% confidence interval

#3. Type easter(ausbeer) and interpret what you see.
easter(ausbeer)

##      Qtr1 Qtr2 Qtr3 Qtr4
## 1956 0.67 0.33 0.00 0.00
## 1957 0.00 1.00 0.00 0.00
## 1958 0.00 1.00 0.00 0.00
## 1959 1.00 0.00 0.00 0.00
## 1960 0.00 1.00 0.00 0.00
## 1961 0.33 0.67 0.00 0.00
## 1962 0.00 1.00 0.00 0.00
## 1963 0.00 1.00 0.00 0.00
## 1964 1.00 0.00 0.00 0.00
## 1965 0.00 1.00 0.00 0.00
## 1966 0.00 1.00 0.00 0.00
## 1967 1.00 0.00 0.00 0.00
## 1968 0.00 1.00 0.00 0.00
## 1969 0.00 1.00 0.00 0.00
## 1970 1.00 0.00 0.00 0.00
## 1971 0.00 1.00 0.00 0.00
## 1972 0.33 0.67 0.00 0.00
## 1973 0.00 1.00 0.00 0.00
## 1974 0.00 1.00 0.00 0.00
## 1975 1.00 0.00 0.00 0.00
## 1976 0.00 1.00 0.00 0.00
## 1977 0.00 1.00 0.00 0.00
## 1978 1.00 0.00 0.00 0.00
## 1979 0.00 1.00 0.00 0.00
## 1980 0.00 1.00 0.00 0.00
## 1981 0.00 1.00 0.00 0.00
## 1982 0.00 1.00 0.00 0.00
## 1983 0.00 1.00 0.00 0.00
## 1984 0.00 1.00 0.00 0.00
## 1985 0.00 1.00 0.00 0.00
## 1986 1.00 0.00 0.00 0.00
## 1987 0.00 1.00 0.00 0.00
## 1988 0.00 1.00 0.00 0.00
## 1989 1.00 0.00 0.00 0.00
## 1990 0.00 1.00 0.00 0.00
## 1991 1.00 0.00 0.00 0.00
## 1992 0.00 1.00 0.00 0.00
## 1993 0.00 1.00 0.00 0.00
## 1994 0.00 1.00 0.00 0.00
## 1995 0.00 1.00 0.00 0.00
## 1996 0.00 1.00 0.00 0.00
## 1997 1.00 0.00 0.00 0.00
## 1998 0.00 1.00 0.00 0.00
## 1999 0.00 1.00 0.00 0.00
## 2000 0.00 1.00 0.00 0.00
## 2001 0.00 1.00 0.00 0.00
## 2002 1.00 0.00 0.00 0.00
## 2003 0.00 1.00 0.00 0.00
## 2004 0.00 1.00 0.00 0.00
## 2005 1.00 0.00 0.00 0.00
## 2006 0.00 1.00 0.00 0.00
## 2007 0.00 1.00 0.00 0.00
## 2008 1.00 0.00 0.00 0.00
## 2009 0.00 1.00 0.00 0.00
## 2010 0.00 1.00

#The data set contains quarterly columns (Q1:Q4) for row years from (1956:2010)
#The values in the matrix fall between 0 and 1
#These appear to represent dummy variables that indicate what quarter in which Easter is occurring as an explanatory factor for Austrailian beer sales
#I suspect that the fractional values represent when Holy Week is split between quarters

#4. Elasticity coefficient
#I'm not sure I fully understand the ask on this question
#First making y a function of x and showing what B1 equals
#Log(y) = B0 + B1Log(x) +e <--Starting equation
#Log(y) - B0 - e = B1Log(x)  <--Start to rearrange around B1Log(x)
#(Log(y)-B0 - e)/Log(x) = B1  <- Isolate elasticity coefficient
#B1 =  (log(y) - B0 - e)/log(x)
#Any percent change in preditor variable x would be equaled by the perecent change in forecast variable y

#5. The data set fancy concerns the monthly sales figures of a shop which opened in January 1987 and sells gifts, souvenirs, and novelties. The shop is situated on the wharf at a beach resort town in Queensland, Australia. The sales volume varies with the seasonal population of tourists. There is a large influx of visitors to the town at Christmas and for the local surfing festival, held every March since 1988. Over time, the shop has expanded its premises, range of products, and staff.
#5a.
head(fancy)

##          Jan     Feb     Mar     Apr     May     Jun
## 1987 1664.81 2397.53 2840.71 3547.29 3752.96 3714.74

autoplot(fancy) + ylab("Sales volume") + xlab("Month, Years") + ggtitle("Monthly Sales Figure Fancy Shop")

ggsubseriesplot(fancy, main = NULL)

#There is clear seasonality at play in the data with annual spikes - towards the latter part of year (Q4 months)
#The seasonal spikes are also trending up (or increasing) as time advances
#March has a surprising peak relative to the rest of Q1 & Q2, which may be explained some extraordinary local event

#5b. Log transformations are useful when the data shows increases or decreases with the level of the series because these sets are difficult to fit a model
#The goal of the log transformation is to level out the seasonal variation over time
transformed_fancy <- BoxCox(fancy,0)
Original_fancy <- autoplot(fancy) + ylab("Sales volume") + xlab("Month, Years") + ggtitle("Original Fancy Shop")
Trans_fancy <- autoplot(transformed_fancy) +  xlab("Month, Years") + ylab("Log Sales Volume") + ggtitle("Transformed Fancy Shop")
grid.arrange(Original_fancy, Trans_fancy, ncol = 2)

#Learned about grid arrange scoping out how to facet multiple plots

#5c. Fit a regression model to the logarithms of these sales data with a linear trend and seasonal variables.
#I will add back in the trend and seasonality composite effects in the regression model
fancy_trans_mod <- tslm(transformed_fancy~trend+season)
summary(fancy_trans_mod)

## 
## Call:
## tslm(formula = transformed_fancy ~ trend + season)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.41644 -0.12619  0.00608  0.11389  0.38567 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.6058604  0.0768740  98.939  < 2e-16 ***
## trend       0.0223930  0.0008448  26.508  < 2e-16 ***
## season2     0.2510437  0.0993278   2.527 0.013718 *  
## season3     0.6952066  0.0993386   6.998 1.18e-09 ***
## season4     0.3829341  0.0993565   3.854 0.000252 ***
## season5     0.4079944  0.0993817   4.105 0.000106 ***
## season6     0.4469625  0.0994140   4.496 2.63e-05 ***
## season7     0.6082156  0.0994534   6.116 4.69e-08 ***
## season8     0.5853524  0.0995001   5.883 1.21e-07 ***
## season9     0.6663446  0.0995538   6.693 4.27e-09 ***
## season10    0.7440336  0.0996148   7.469 1.61e-10 ***
## season11    1.2030164  0.0996828  12.068  < 2e-16 ***
## season12    1.9581366  0.0997579  19.629  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1858 on 71 degrees of freedom
## Multiple R-squared:  0.9527, Adjusted R-squared:  0.9447 
## F-statistic: 119.1 on 12 and 71 DF,  p-value: < 2.2e-16

#Did not add a specific dummy variable for surfing competition, creating forecasting issues in 5h. that I could not resolve

autoplot(transformed_fancy, series = "Transformed Data") + autolayer(fitted(fancy_trans_mod), series = "model") + ylab("Log Sales volume") + xlab("Month, Years") + ggtitle("Original Fancy Shop")

#5d. Plot the residuals against time and against the fitted values. Do these plots reveal any problems with the model?
autoplot(residuals(fancy_trans_mod)) + ylab("residuals")

fitted_res = cbind.data.frame(Fitted_values = fancy_trans_mod$fitted.values, Residuals = fancy_trans_mod$residuals)
ggplot(data = fitted_res, aes(x = Fitted_values, y = Residuals)) + geom_point()

## Don't know how to automatically pick scale for object of type ts. Defaulting to continuous.

## Don't know how to automatically pick scale for object of type ts. Defaulting to continuous.

#The analysis of the residuals seems to show clear indications of autocorrelation - it seems that seasonality has not been entirely neutralized,
#Heteroscedasticity is also in play as the variation between the residuals and fitted is not consistent

#5e.Do boxplots of the residuals for each month. Does this reveal any problems with the model?
#Month repeat for 7 years
residuals_month <- data.frame(fancy_trans_mod$residuals, month=rep(1:12,7))
boxplot(fancy_trans_mod$residuals ~ month, data=residuals_month)

#A better fitted model would show consistent variability in the residuals box plot, whereas March and the Fall show clear seasonal and cyclical effects

#5f. What do the values of the coefficients tell you about each variable?
summary(fancy_trans_mod)

## 
## Call:
## tslm(formula = transformed_fancy ~ trend + season)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.41644 -0.12619  0.00608  0.11389  0.38567 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.6058604  0.0768740  98.939  < 2e-16 ***
## trend       0.0223930  0.0008448  26.508  < 2e-16 ***
## season2     0.2510437  0.0993278   2.527 0.013718 *  
## season3     0.6952066  0.0993386   6.998 1.18e-09 ***
## season4     0.3829341  0.0993565   3.854 0.000252 ***
## season5     0.4079944  0.0993817   4.105 0.000106 ***
## season6     0.4469625  0.0994140   4.496 2.63e-05 ***
## season7     0.6082156  0.0994534   6.116 4.69e-08 ***
## season8     0.5853524  0.0995001   5.883 1.21e-07 ***
## season9     0.6663446  0.0995538   6.693 4.27e-09 ***
## season10    0.7440336  0.0996148   7.469 1.61e-10 ***
## season11    1.2030164  0.0996828  12.068  < 2e-16 ***
## season12    1.9581366  0.0997579  19.629  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1858 on 71 degrees of freedom
## Multiple R-squared:  0.9527, Adjusted R-squared:  0.9447 
## F-statistic: 119.1 on 12 and 71 DF,  p-value: < 2.2e-16

#The intercept indicates that the starting point in january is log sales volume of 7.6
#All subsequent coefficients are positive, which indicates that January is the lightest sales volume season
#March and Sept-Dec show strong sales.
#The model is trending in an increasing direction with each period

#5g.What does the Breusch-Godfrey test tell you about your model?
checkresiduals(fancy_trans_mod)

## 
##  Breusch-Godfrey test for serial correlation of order up to 17
## 
## data:  Residuals from Linear regression model
## LM test = 37.152, df = 17, p-value = 0.003209

#The Breush-Godfrey test indicates that we can reject the hypothesis that our residuals are not correlated, or are just white noise
#This means that the model is not fully capturing the explainable effects

#5h. Regardless of your answers to the above questions, use your regression model to predict the monthly sales for 1994, 1995, and 1996. Produce prediction intervals for each of your forecasts.
fc_fancy <- forecast(fancy_trans_mod, h = 12*3)
autoplot((fc_fancy))+  xlab("Month, Years") + ylab("Log Sales Volume") + ggtitle("Fancy Shop Transformed Forecast 1994-1996")

#Shows prediction intervals
fc_fancy

##          Point Forecast     Lo 80     Hi 80     Lo 95     Hi 95
## Jan 1994       9.509264  9.246995  9.771532  9.105002  9.913525
## Feb 1994       9.782700  9.520432 10.044969  9.378439 10.186962
## Mar 1994      10.249256  9.986988 10.511525  9.844995 10.653518
## Apr 1994       9.959377  9.697108 10.221645  9.555115 10.363638
## May 1994      10.006830  9.744562 10.269098  9.602569 10.411091
## Jun 1994      10.068191  9.805923 10.330459  9.663930 10.472452
## Jul 1994      10.251837  9.989569 10.514105  9.847576 10.656099
## Aug 1994      10.251367  9.989099 10.513635  9.847106 10.655628
## Sep 1994      10.354752 10.092484 10.617020  9.950491 10.759014
## Oct 1994      10.454834 10.192566 10.717102 10.050573 10.859095
## Nov 1994      10.936210 10.673942 11.198478 10.531949 11.340471
## Dec 1994      11.713723 11.451455 11.975991 11.309462 12.117984
## Jan 1995       9.777979  9.512777 10.043182  9.369195 10.186763
## Feb 1995      10.051416  9.786214 10.316618  9.642632 10.460200
## Mar 1995      10.517972 10.252770 10.783174 10.109188 10.926756
## Apr 1995      10.228092  9.962890 10.493295  9.819308 10.636876
## May 1995      10.275546 10.010343 10.540748  9.866762 10.684330
## Jun 1995      10.336907 10.071704 10.602109  9.928123 10.745691
## Jul 1995      10.520553 10.255351 10.785755 10.111769 10.929337
## Aug 1995      10.520083 10.254880 10.785285 10.111299 10.928867
## Sep 1995      10.623468 10.358266 10.888670 10.214684 11.032252
## Oct 1995      10.723550 10.458347 10.988752 10.314766 11.132334
## Nov 1995      11.204926 10.939723 11.470128 10.796142 11.613710
## Dec 1995      11.982439 11.717236 12.247641 11.573655 12.391223
## Jan 1996      10.046695  9.777950 10.315440  9.632451 10.460940
## Feb 1996      10.320132 10.051387 10.588877  9.905887 10.734376
## Mar 1996      10.786688 10.517943 11.055433 10.372443 11.200932
## Apr 1996      10.496808 10.228063 10.765553 10.082564 10.911053
## May 1996      10.544261 10.275516 10.813007 10.130017 10.958506
## Jun 1996      10.605623 10.336878 10.874368 10.191378 11.019867
## Jul 1996      10.789269 10.520524 11.058014 10.375024 11.203513
## Aug 1996      10.788798 10.520053 11.057543 10.374554 11.203043
## Sep 1996      10.892184 10.623439 11.160929 10.477939 11.306428
## Oct 1996      10.992266 10.723521 11.261011 10.578021 11.406510
## Nov 1996      11.473641 11.204896 11.742386 11.059397 11.887886
## Dec 1996      12.251155 11.982409 12.519900 11.836910 12.665399

#5i. Transform your predictions and intervals to obtain predictions and intervals for the raw data.
fc_raw_fancy <- exp(data.frame(fc_fancy))
ts_fc_raw_fancy <- ts(fc_raw_fancy, frequency = 12, start = 1994)
autoplot(ts_fc_raw_fancy) + xlab("Month, Years") + ylab("Sales Volume") + ggtitle("Fancy Shop Transformed Forecast 1994-1996")

#5j. How could you improve these predictions by modifying the model
#1) The first change I would make to my model is addressing the surf competition in March by adding dummy variables
#2) It seems like the model is faltering in terms of transformations appropriately reducing seasonal effects, so I would use a different log transformation

#6. The gasoline series consists of weekly data for supplies of US finished motor gasoline product, from 2 February 1991 to 20 January 2017. The units are in “million barrels per day.” Consider only the data to the end of 2004.
#End at 2005 to limit data to through 2004

#6a.
gas_2004 <- window(gasoline, end = 2005)
head(gas_2004)

## Time Series:
## Start = 1991.1 
## End = 1991.19582477755 
## Frequency = 52.1785714285714 
## [1] 6.621 6.433 6.582 7.224 6.875 6.947

#Weekly data, starts in 1991
str(gas_2004)

##  Time-Series [1:726] from 1991 to 2005: 6.62 6.43 6.58 7.22 6.88 ...

#Time series
autoplot(gas_2004) + xlab("Time (weekly") + ylab("Gas Volume (Million Barrels per Day)") + ggtitle("US Gas Supply on Weekly basis")

#Because the data shows weekly seasonality, Fourier modeling can be used to approximate periodic function
#Max fourier periods is 26 -> m/2
#For Fourier transformations, will try times of K= 4, 12, & 13

f.4_gas_2004 <- tslm(gas_2004 ~ trend + fourier(gas_2004,K = 4))
f.12_gas_2004 <- tslm(gas_2004 ~ trend + fourier(gas_2004,K = 12))
f.13_gas_2004 <- tslm(gas_2004 ~ trend + fourier(gas_2004,K = 13))

autoplot(gas_2004, series = "Data") + autolayer(fitted(f.4_gas_2004)) + autolayer(fitted(f.12_gas_2004)) + autolayer(fitted(f.13_gas_2004)) + xlab("Time (weekly)") + ylab("Gas Volume (Million Barrels per Day)") + ggtitle("US Gas Supply on Weekly basis")

#The Fourier harmonic models are much tighter in variance from the trend than the actual data

#6b. Select the appropriate number of Fourier terms to include by minimising the AICc or CV value.
#CV function displays 5 measures for prediction accuracy, with the goal of minimizing each, except adj R^2 (want higher)
CV(f.4_gas_2004)

##            CV           AIC          AICc           BIC         AdjR2 
##  7.648608e-02 -1.864112e+03 -1.863742e+03 -1.813649e+03  8.382404e-01

CV(f.12_gas_2004)

##            CV           AIC          AICc           BIC         AdjR2 
##  7.096945e-02 -1.919276e+03 -1.917110e+03 -1.795412e+03  8.532618e-01

CV(f.13_gas_2004)

##            CV           AIC          AICc           BIC         AdjR2 
##  7.133807e-02 -1.915672e+03 -1.913172e+03 -1.782633e+03  8.529214e-01

#The harmonic series with a K value of 12, minimizes the cross-validation values and maximized adjR2
#I think f.12_gas_2004 is the best overall model, but if we wanted to minimize AIC, we'd select a lower K value model

#6c. Check the residuals of the final model using the checkresiduals() function. Even though the residuals fail the correlation tests, the results are probably not severe enough to make much difference to the forecasts and prediction intervals. (Note that the correlations are relatively small, even though they are significant.
checkresiduals(f.12_gas_2004)

## 
##  Breusch-Godfrey test for serial correlation of order up to 104
## 
## data:  Residuals from Linear regression model
## LM test = 154.2, df = 104, p-value = 0.001021

#The residuals fit nicely around the mean 0 with relative normal distribution
#The residuals however do show some autocorrelation and fail the Breush-Gidfrey test indicating they are not simply whitenoise.

#6d. To forecast using harmonic regression, you will need to generate the future values of the Fourier terms. This can be done as follows.
#Provided -> fc <- forecast(fit, newdata=data.frame(fourier(x,K,h)))
fc_gas_2005 <- forecast(f.12_gas_2004, newdata = data.frame(fourier(gas_2004, K=12, h = 52)))
fc_gas_2005

##          Point Forecast    Lo 80    Hi 80    Lo 95     Hi 95
## 2005.014       8.713386 8.371087 9.055684 8.189474  9.237298
## 2005.033       8.642131 8.299720 8.984542 8.118047  9.166215
## 2005.052       8.656942 8.314540 8.999345 8.132871  9.181013
## 2005.071       8.661292 8.318899 9.003684 8.137236  9.185347
## 2005.090       8.686370 8.344109 9.028631 8.162515  9.210225
## 2005.110       8.784129 8.441982 9.126276 8.260448  9.307809
## 2005.129       8.895967 8.553823 9.238111 8.372291  9.419643
## 2005.148       8.937884 8.595756 9.280012 8.414232  9.461535
## 2005.167       8.940875 8.598761 9.282988 8.417246  9.464503
## 2005.186       8.988596 8.646477 9.330716 8.464958  9.512235
## 2005.205       9.062316 8.720191 9.404442 8.538669  9.585963
## 2005.225       9.073794 8.731673 9.415915 8.550153  9.597435
## 2005.244       9.031913 8.689800 9.374027 8.508284  9.555542
## 2005.263       9.041458 8.699342 9.383575 8.517824  9.565092
## 2005.282       9.122877 8.780756 9.464999 8.599236  9.646519
## 2005.301       9.169383 8.827263 9.511503 8.645744  9.693022
## 2005.320       9.132601 8.790487 9.474715 8.608970  9.656231
## 2005.340       9.120782 8.778667 9.462897 8.597150  9.644414
## 2005.359       9.221551 8.879431 9.563671 8.697912  9.745190
## 2005.378       9.335425 8.993306 9.677545 8.811787  9.859064
## 2005.397       9.316637 8.974522 9.658752 8.793006  9.840269
## 2005.416       9.214109 8.871994 9.556224 8.690478  9.737740
## 2005.435       9.218957 8.876838 9.561076 8.695320  9.742595
## 2005.455       9.383861 9.041741 9.725980 8.860222  9.907499
## 2005.474       9.545162 9.203046 9.887278 9.021530 10.068795
## 2005.493       9.559667 9.217553 9.901782 9.036037 10.083298
## 2005.512       9.487053 9.144935 9.829171 8.963416 10.010689
## 2005.531       9.464953 9.122834 9.807073 8.941315  9.988592
## 2005.550       9.508378 9.166261 9.850494 8.984744 10.032011
## 2005.570       9.534922 9.192808 9.877037 9.011292 10.058553
## 2005.589       9.521986 9.179869 9.864104 8.998351 10.045621
## 2005.608       9.522899 9.180779 9.865019 8.999260 10.046538
## 2005.627       9.548443 9.206326 9.890560 9.024808 10.072078
## 2005.646       9.536705 9.194591 9.878819 9.013075 10.060335
## 2005.665       9.451014 9.108897 9.793131 8.927380  9.974649
## 2005.685       9.330630 8.988509 9.672750 8.806990  9.854269
## 2005.704       9.229844 8.887726 9.571962 8.706208  9.753480
## 2005.723       9.170862 8.828748 9.512976 8.647232  9.694492
## 2005.742       9.168084 8.825968 9.510200 8.644451  9.691717
## 2005.761       9.230990 8.888869 9.573110 8.707349  9.754630
## 2005.780       9.320372 8.978252 9.662492 8.796734  9.844010
## 2005.800       9.363916 9.021801 9.706030 8.840285  9.887546
## 2005.819       9.342664 9.000548 9.684779 8.819032  9.866296
## 2005.838       9.304159 8.962036 9.646281 8.780516  9.827801
## 2005.857       9.267511 8.925389 9.609633 8.743869  9.791153
## 2005.876       9.194408 8.852292 9.536523 8.670776  9.718040
## 2005.895       9.100701 8.758587 9.442816 8.577070  9.624332
## 2005.915       9.104667 8.762540 9.446794 8.581017  9.628317
## 2005.934       9.265935 8.923806 9.608064 8.742282  9.789587
## 2005.953       9.436660 9.094538 9.778782 8.913018  9.960302
## 2005.972       9.400410 9.058293 9.742527 8.876776  9.924044
## 2005.991       9.147441 8.805233 9.489649 8.623668  9.671215

#6e. Plot the forecasts along with the actual data for 2005. What do you find?
gas_2005_actuals <- window(gasoline, start = 2005, end = 2006)
autoplot(gas_2005_actuals, series = "Actual data") + autolayer(fc_gas_2005$mean, series = "Prediction Mean") + xlab("Time (weekly)") + ylab("Gas Volume (Million Barrels per Day)") + ggtitle("US Gas Supply 2005: Actual vs Prediction")

autoplot(gas_2005_actuals, series = "Actual data") + autolayer(fc_gas_2005, series = "Prediction Range")

autoplot(fc_gas_2005, series = "Prediction Range") + autolayer(gas_2005_actuals, series = "Actual data") + autolayer(fc_gas_2005$mean, series = "Prediction Mean")

#The prediction follows the 2005 actual trend but in a less extreme manner (highs and lows are smaller)
#The drop in actual gas volume in Q3 of 2005 is not captured within the 95% confidence interval for prediction

#7. Data set huron gives the water level of Lake Huron in feet from 1875 to 1972.

#7a. Plot the data and comment on its features.
#Huron lake level in feet by year
frequency(huron)

## [1] 1

autoplot(huron) + xlab("Year") + ylab("Huron Depth (ft)") + ggtitle("Depth of Lake Huron by Year")

#The depth of the lake peaked just before 1880 and hit a nadir in ~1965
#The trend does seem to be decreasing with periodic bouncebacks
#I would expect there to be cyclical movement related to periods of draught or heavy precipitation

huron_45 <- window(huron, start = 1945, end = 1950)
autoplot(huron_45)

#Minimal seasonality when looking at a random 5 year period

#7b. Fit a linear regression and compare this to a piecewise linear trend model with a knot at 1915.
#leaving out seasonality as it does not seem to play a role

#linaer model
lin_mod_huron <- tslm(huron~trend)

#Piecewise model
Huron_T <- time(huron,na.action = na.omit) 
TS_PW <- ts(pmax(0,Huron_T-1915), start=1875)
Huron_pw_mod <- tslm(huron~Huron_T+TS_PW)

#Model Summaries
summary(lin_mod_huron)

## 
## Call:
## tslm(formula = huron ~ trend)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.50997 -0.72726  0.00083  0.74402  2.53565 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.202037   0.230111  44.335  < 2e-16 ***
## trend       -0.024201   0.004036  -5.996 3.55e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.13 on 96 degrees of freedom
## Multiple R-squared:  0.2725, Adjusted R-squared:  0.2649 
## F-statistic: 35.95 on 1 and 96 DF,  p-value: 3.545e-08

summary(Huron_pw_mod)

## 
## Call:
## tslm(formula = huron ~ Huron_T + TS_PW)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.49626 -0.66240 -0.07139  0.85163  2.39222 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 132.90870   19.97687   6.653 1.82e-09 ***
## Huron_T      -0.06498    0.01051  -6.181 1.58e-08 ***
## TS_PW         0.06486    0.01563   4.150 7.26e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.045 on 95 degrees of freedom
## Multiple R-squared:  0.3841, Adjusted R-squared:  0.3711 
## F-statistic: 29.62 on 2 and 95 DF,  p-value: 1.004e-10

checkresiduals(lin_mod_huron)

## 
##  Breusch-Godfrey test for serial correlation of order up to 10
## 
## data:  Residuals from Linear regression model
## LM test = 63.979, df = 10, p-value = 6.353e-10

checkresiduals(Huron_pw_mod)

## 
##  Breusch-Godfrey test for serial correlation of order up to 10
## 
## data:  Residuals from Linear regression model
## LM test = 59.437, df = 10, p-value = 4.63e-09

autoplot(huron, series = "Data") + autolayer(fitted(lin_mod_huron), series = "Linear Reg") + autolayer(fitted(Huron_pw_mod), series = "Piece-wise Mod") + xlab("Year") + ylab("Huron Depth (ft)") + ggtitle("Depth of Lake Huron by Year")

#The adj R^2 value is improved by the piecewise function as it follows the downward trend to 1915 where it flattens the trend

#7c. Generate forecasts from these two models for the period up to 1980 and comment on these.
fc_lm <- forecast(lin_mod_huron, h = 8)
fc_lm

##      Point Forecast    Lo 80    Hi 80    Lo 95     Hi 95
## 1973       7.806127 6.317648 9.294605 5.516501 10.095752
## 1974       7.781926 6.292536 9.271315 5.490899 10.072952
## 1975       7.757724 6.267406 9.248043 5.465269 10.050179
## 1976       7.733523 6.242259 9.224788 5.439613 10.027434
## 1977       7.709322 6.217094 9.201550 5.413929 10.004715
## 1978       7.685121 6.191912 9.178331 5.388219  9.982024
## 1979       7.660920 6.166712 9.155128 5.362481  9.959359
## 1980       7.636719 6.141494 9.131943 5.336717  9.936721

autoplot(huron, series = "Data") + autolayer(fc_lm, series = "Lin Model") + autolayer(fitted(fc_lm),series = "Lin Model") + xlab("Year") + ylab("Huron Depth (ft)") + ggtitle("Linear Forecast 1972-1980")

#The piecewise forecast requires some manipulation of the new data period

fc_time <- Huron_T[length(Huron_T)] + seq(8)
fc_ts <- TS_PW[length(TS_PW)] + seq(8)
fc_pw <- forecast(Huron_pw_mod, newdata = as.data.frame(cbind(Huron_T = fc_time, TS_PW = fc_ts)))
autoplot(huron, series = "Data") + autolayer(fc_pw, series = "PW Model") + autolayer(fitted(fc_pw),series = "PW Model") + xlab("Year") + ylab("Huron Depth (ft)") + ggtitle("PW Forecast 1972-1980")

#The linear model more effectively captures the most recent actual data in its confidence interval
#But the PW model more accurately reflect the leveled off trending of the data

#8 Sorry- I did not have capacity to delve into the matrices at this time

#CHAPTER 6 Exercises

#1 Show that a 3 × 5 MA is equivalent to a 7-term weighted moving average with weights of 0.067, 0.133, 0.200, 0.200, 0.200, 0.133, and 0.067.
#The 3 x 5 MA is equivalent to 1/3[1/5(y1 + Y2 + y3 + y4 + y5) + 1/5(Y2 + y3 + y4 + y5 + y6) + 1/5(y3 + y4 + y5 + y6+y7)]
#When factored out -> (1/15)y1 + (2/15)y2 + (3/15 = 1/3)y3 + (3/15 = 1/3)y4 + (3/15 = 1/3)y5 + (2/15)y6 + 1/15y7
y1 = 1/15
y2 = 2/15
y3 = 3/15
y4 = 3/15
y5 = 3/15
y6 = 2/15
y7 = 1/15
MA_6.1 <- c(y1, y2, y3, y4, y5, y6, y7)
MA_6.1

## [1] 0.06666667 0.13333333 0.20000000 0.20000000 0.20000000 0.13333333 0.06666667

#The resulting 3 x 5 MA match the 7-term weighted moving average

#2 The plastics data set consists of the monthly sales (in thousands) of product A for a plastics manufacturer for five years.

#2a. Plot the time series of sales of product A. Can you identify seasonal fluctuations and/or a trend-cycle?
#?plastics
#Monthly sales of product A for a plastics manufacturer
frequency(plastics)

## [1] 12

#Frequency is months
autoplot(plastics)

#An overall upward trend that shows increasing sales volume from year 1 - 5
ggseasonplot(plastics)

plastics_month <- data.frame(plastics, month=rep(1:12,5))
boxplot(plastics ~ month, data=plastics_month)

#A seasonal trend is also clearly present across all 5 years, an initial decline from January to Feb followed by increasing sales that peak during the Summer and begins to decline in Aug - Dec

#2b. Use a classical multiplicative decomposition to calculate the trend-cycle and seasonal indices.
plastics_multdecomp <- decompose(plastics, type = "multiplicative")
autoplot(plastics_multdecomp, Facet = TRUE)

#2c.Do the results support the graphical interpretation from part a?
#The plot confirms the analysis from 2.a. 
#The trend is a clear increase from year 1->5 with some later leveling
#Seasonal trends are consistent across years with clear peaks and troughs

#2d.Compute and plot the seasonally adjusted data. 
autoplot(seasadj(plastics_multdecomp)) + ylab("Plastic Sales Volume") + xlab("Months, Years")

#The removal of the seasonal effect leaves just the upwards trend in sales

#2e. Change one observation to be an outlier (e.g., add 500 to one observation), and recompute the seasonally adjusted data. What is the effect of the outlier?
plastics_outlier <- plastics
plastics_outlier[36] = plastics_outlier[36] + 750
autoplot(plastics_outlier, series = "Outlier") + autolayer(plastics, series = "Original")

plastics_multdecomp_outlier <- decompose(plastics_outlier, type = "multiplicative")
autoplot(plastics_multdecomp_outlier, Facet = TRUE)

autoplot(seasadj(plastics_multdecomp_outlier)) + ylab("Plastic Sales Volume") + xlab("Months, Years")

#The outlier continues to impact the data and is not resolved by the seasonal adjustment

#2f. Does it make any difference if the outlier is near the end rather than in the middle of the time series?
#Adjustments for seasonal changes will not resolve the outlier regardless if it is the middle or near the end of the data set.

#3. Recall your retail time series data (from Exercise 3 in Section 2.10). Decompose the series using X11. Does it reveal any outliers, or unusual features that you had not noticed previously?
getwd()

## [1] "C:/Users/tlaff/OneDrive/Documents"

retaildata <- read_xlsx("retail.xlsx", skip=1)
head(retaildata)

## # A tibble: 6 x 190
##   `Series ID`         A3349335T A3349627V A3349338X A3349398A A3349468W
##   <dttm>                  <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
## 1 1982-04-01 00:00:00      303.      41.7      63.9      409.      65.8
## 2 1982-05-01 00:00:00      298.      43.1      64        405.      65.8
## 3 1982-06-01 00:00:00      298       40.3      62.7      401       62.3
## 4 1982-07-01 00:00:00      308.      40.9      65.6      414.      68.2
## 5 1982-08-01 00:00:00      299.      42.1      62.6      404.      66  
## 6 1982-09-01 00:00:00      305.      42        64.4      412.      62.3
## # ... with 184 more variables: A3349336V <dbl>, A3349337W <dbl>,
## #   A3349397X <dbl>, A3349399C <dbl>, A3349874C <dbl>, A3349871W <dbl>,
## #   A3349790V <dbl>, A3349556W <dbl>, A3349791W <dbl>, A3349401C <dbl>,
## #   A3349873A <dbl>, A3349872X <dbl>, A3349709X <dbl>, A3349792X <dbl>,
## #   A3349789K <dbl>, A3349555V <dbl>, A3349565X <dbl>, A3349414R <dbl>,
## #   A3349799R <dbl>, A3349642T <dbl>, A3349413L <dbl>, A3349564W <dbl>,
## #   A3349416V <dbl>, A3349643V <dbl>, A3349483V <dbl>, A3349722T <dbl>, ...

#My name starts with a T, so I am choosing the data column with a T in it
retaildata_ts <- ts(retaildata[,"A3349335T"], frequency=12, start=c(1982,4))
#Running some of the initial analysis I would have done from section 2
autoplot(retaildata_ts)

ggseasonplot(retaildata_ts)

#Clear increasing trend in retail sales by year
#Seasonal spike in more recent years in March - could this be related to a special event
#Seasonal increases during holiday season - November & December
retaildata_x11 <- seas(retaildata_ts, x11(""))
autoplot(retaildata_x11, facet = TRUE)

#The two key noticings from the x11 analysis
#1)The seasonal effect was larger before 1990
#2)Some large irregularities (maybe outliers) occurred around 1986 and 2009 that could impact modeling

#4. Figures 6.16 and 6.17 show the result of decomposing the number of persons in the civilian labour force in Australia each month from February 1978 to August 1995.

#4a.Write about 3–5 sentences describing the results of the decomposition. Pay particular attention to the scales of the graphs in making your interpretation.
#The Austrailian labor force data shows a clear positive trend so that the level of civilian employment is increasing over time. There was a noticeable outlier dip in the early 1990s which could be the result of a more general business disruption, like a market correction. The trend line confirms this analysis with more steady growth from 1980-1985 and more rapid expansion 1985-1990.  Seasonal employment increases seemed to happen during the June and July than dip in August, which may be the result of school break employment.

#4b. Is the recession of 1991/1992 visible in the estimated components?
#The remainder components during the years 1991-1992 indicate that actual employment levels were far below composite modeling, which indicates a recession did occur.

#5. This exercise uses the cangas data (monthly Canadian gas production in billions of cubic metres, January 1960 – February 2005).
#5a. Plot the data using autoplot(), ggsubseriesplot() and ggseasonplot() to look at the effect of the changing seasonality over time. What do you think is causing it to change so much?
autoplot(cangas)

ggseasonplot(cangas)

ggsubseriesplot(cangas, TRUE, TRUE)

#There's a general upwards trend that levels between 1975-1990 with larger increases from 1960-1975 & 1990-2005
#Regarding seasonality, colder months Sept-Feb show higher gas usage, which is likely the result of increased use for heating homes
#The general upward trend reflection increasing population and increasing national wealth to afford heating

#5b. Do an STL decomposition of the data. You will need to choose s.window to allow for the changing shape of the seasonal component.
cangas_window <- stl(cangas, s.window = 12)
autoplot(cangas_window)

#The trend in the stl decomp confirms the intial analysis that the trend flattens in the early 80s, while also revealing that the seasonal impact increased in the '80s

#5c. Compare the results with those obtained using SEATS and X11. How are they different?
cangas_x11<-seas(cangas, x11 = "")
autoplot(cangas_x11)

#The trend factor seems to be less smooth then with STL decomp, while the scale of seasonal impact narrower but more significant in different periods.  For example the 1960s are larger effects, while the 1980s are lessened.

#6 We will use the bricksq data (Australian quarterly clay brick production. 1956–1994) for this exercise.
#?bricksq -> Australian quarterly clay brick production: 1956–1994.

#6a. Use an STL decomposition to calculate the trend-cycle and seasonal indices. (Experiment with having fixed or changing seasonality.)
bricksq_stl_period <- stl(bricksq, s.window = "periodic", TRUE)
bricks_period_plot <- autoplot(bricksq_stl_period) + ggtitle("Periodic STL")
bricksq_stl_fixed <- stl(bricksq, s.window = 12)
bricks_fixed_plot <- autoplot(bricksq_stl_fixed) +ggtitle("Fixed STL (12)")
grid.arrange(bricks_period_plot,bricks_fixed_plot, ncol = 2)

#The fixed plot creates a less consistent seasonal effect

#6b. Compute and plot the seasonally adjusted data.
brick_mod <- seasadj(stl(bricksq, t.window = 12, s.window = "periodic", robust = TRUE))
brick_mod_plot <- autoplot(brick_mod) + ggtitle("Brick Model - Seasonal Adj")
brick_plot <- autoplot(bricksq) + ggtitle("Brick Data")
grid.arrange(brick_plot, brick_mod_plot, ncol = 2)

# Seasonal adj is smoother

#6c. Use a naïve method to produce forecasts of the seasonally adjusted data.
brick_mod_naive <- naive(brick_mod)
autoplot(brick_mod_naive)

#6d. Use stlf() to reseasonalise the results, giving forecasts for the original data.
#Adding random walk to stl decomp
bricksq_stlf <- stlf(bricksq, method = "naive")
stlf_mod <- forecast(bricksq_stlf)
autoplot(stlf_mod)

#6e/f Compare residuals from STL and random walk
checkresiduals(brick_mod_naive)

## 
##  Ljung-Box test
## 
## data:  Residuals from Naive method
## Q* = 20.048, df = 8, p-value = 0.01016
## 
## Model df: 0.   Total lags used: 8

checkresiduals(stlf_mod)

## Warning in checkresiduals(stlf_mod): The fitted degrees of freedom is based on
## the model used for the seasonally adjusted data.

## 
##  Ljung-Box test
## 
## data:  Residuals from STL +  Random walk
## Q* = 44.369, df = 8, p-value = 4.845e-07
## 
## Model df: 0.   Total lags used: 8

#Both sets of residuals have a left tail distribution with inconsistent variance
#Naive method show lower variance, but marginally higher autocorrelation than random walk

#6g Compare forecasts from stlf() with those from snaive(), using a test set comprising the last 2 years of data. Which is better?
train_brick <- window(bricksq, end = c(1992,3))
brick_snaive_mod <- snaive(train_brick)
brick_stlf_mod <- stlf(train_brick)
brick_snaive_fc <- forecast(brick_snaive_mod, h=8)
brick_stlf_fc <- forecast(brick_stlf_mod, h=8)
autoplot(bricksq, series = "data") + autolayer(fitted(brick_snaive_fc), series = "SNaive") + autolayer(brick_snaive_fc, series = "SNaive") + ggtitle("SNaive vs Actuals")

## Warning: Removed 4 row(s) containing missing values (geom_path).

autoplot(bricksq, series = "data") + autolayer(fitted(brick_stlf_fc), series = "STLF") + autolayer(brick_stlf_fc, series = "STLF")+ ggtitle("STLF vs Actuals")

accuracy(brick_snaive_fc, bricksq)

##                     ME     RMSE      MAE      MPE     MAPE      MASE      ACF1
## Training set  6.174825 49.71281 36.41259 1.369661 8.903098 1.0000000 0.8105927
## Test set     27.500000 35.05353 30.00000 5.933607 6.528845 0.8238909 0.2405423
##              Theil's U
## Training set        NA
## Test set     0.9527794

accuracy(brick_stlf_fc, bricksq)

##                     ME     RMSE      MAE       MPE     MAPE      MASE      ACF1
## Training set  1.454487 20.05377 14.50408 0.3583404 3.565690 0.3983259 0.2150723
## Test set     23.289982 27.37352 24.47347 5.1087856 5.390569 0.6721157 0.2710260
##              Theil's U
## Training set        NA
## Test set     0.7148358

checkresiduals(brick_snaive_mod)

## 
##  Ljung-Box test
## 
## data:  Residuals from Seasonal naive method
## Q* = 221.45, df = 8, p-value < 2.2e-16
## 
## Model df: 0.   Total lags used: 8

checkresiduals(brick_stlf_mod)

## Warning in checkresiduals(brick_stlf_mod): The fitted degrees of freedom is
## based on the model used for the seasonally adjusted data.

## 
##  Ljung-Box test
## 
## data:  Residuals from STL +  ETS(M,N,N)
## Q* = 42.977, df = 6, p-value = 1.179e-07
## 
## Model df: 2.   Total lags used: 8

#The random walk (STLF) decomp has significant advantages in terms of lower autocorrelation and smaller error factors

#7. Use stlf() to produce forecasts of the writing series with either method="naive" or method="rwdrift", whichever is most appropriate. Use the lambda argument if you think a Box-Cox transformation is required.
head(writing)

##          Jan     Feb     Mar     Apr     May     Jun
## 1968 562.674 599.000 668.516 597.798 579.889 668.233

#Industry sales for printing and writing paper in thousands of Francs - 1963-1972
autoplot(writing)

ggseasonplot(writing)

#Strong seasonality, converges to almost nil every August
#Because of the strong seasonal effects, I will utilize the box-cox transformations
writing_stlf_naive <- stlf(writing, s.window = 12, method = "naive", lambda = BoxCox.lambda(writing))
writing_stlf_drift <- stlf(writing, s.window = 12, method = "rwdrift", lambda = BoxCox.lambda(writing))
autoplot(writing_stlf_naive) + ylab("Log Industry Sales (100s Francs") + xlab("years, months")

 autoplot(writing_stlf_drift) + ylab("Log Industry Sales (100s Francs") + xlab("years, months")

checkresiduals(writing_stlf_naive)

## Warning in checkresiduals(writing_stlf_naive): The fitted degrees of freedom is
## based on the model used for the seasonally adjusted data.

## 
##  Ljung-Box test
## 
## data:  Residuals from STL +  Random walk
## Q* = 65.055, df = 24, p-value = 1.195e-05
## 
## Model df: 0.   Total lags used: 24

checkresiduals(writing_stlf_drift)

## Warning in checkresiduals(writing_stlf_drift): The fitted degrees of freedom is
## based on the model used for the seasonally adjusted data.

## 
##  Ljung-Box test
## 
## data:  Residuals from STL +  Random walk with drift
## Q* = 65.055, df = 23, p-value = 6.879e-06
## 
## Model df: 1.   Total lags used: 24

#Both the random walk with and without drift show very similar forecasts
#The residuals for the drift form look slightly more normal and the upward trend is slightly acentuated so I prefer it in this case

#8 Use stlf() to produce forecasts of the fancy series with either method="naive" or method="rwdrift", whichever is most appropriate. Use the lambda argument if you think a Box-Cox transformation is required.
#For question 7 I compared naive and drift, here I am going stick with the drift method and compare with and without a Box.cox transformation
#See chapter 5 question 5 for previous Fancy data analysis  
fancy_stlf_drift <- stlf(fancy, s.window = 12, method = "rwdrift")
fancy_stlf_drift_boxcox <- stlf(fancy, s.window = 12, method = "rwdrift", lambda = BoxCox.lambda(writing))
fsd_plot <- autoplot(fancy_stlf_drift) + ylab("Sales volume")
fsd_bc_plot <- autoplot(fancy_stlf_drift_boxcox) + ylab("Log Sales Volume")
grid.arrange(fsd_plot, fsd_bc_plot, ncol=2)

checkresiduals(fancy_stlf_drift)

## Warning in checkresiduals(fancy_stlf_drift): The fitted degrees of freedom is
## based on the model used for the seasonally adjusted data.

## 
##  Ljung-Box test
## 
## data:  Residuals from STL +  Random walk with drift
## Q* = 26.324, df = 16, p-value = 0.04963
## 
## Model df: 1.   Total lags used: 17

checkresiduals(fancy_stlf_drift_boxcox)

## Warning in checkresiduals(fancy_stlf_drift_boxcox): The fitted degrees of
## freedom is based on the model used for the seasonally adjusted data.

## 
##  Ljung-Box test
## 
## data:  Residuals from STL +  Random walk with drift
## Q* = 38.325, df = 16, p-value = 0.00136
## 
## Model df: 1.   Total lags used: 17

#The boxcox transformation helps normalize for various outliers