Assignment: Homework Hyndman Chapter 5 and 6 Exercises

Chapter 5 Exercises

1.Daily electricity demand for Victoria, Australia, during 2014 is contained in elecdaily. The data for the first 20 days can be obtained as follows.

#R Packages Used
library(readr)
library(readxl)
library(ggplot2)
library(fpp2)

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

## ── Attaching packages ────────────────────────────────────────────── fpp2 2.4 ──

## ✓ forecast  8.15     ✓ expsmooth 2.3 
## ✓ fma       2.4

##

library(psych)

## 
## Attaching package: 'psych'

## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

library(lmtest)

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

library(seasonal)

## 
## Attaching package: 'seasonal'

## The following object is masked from 'package:psych':
## 
##     outlier

library('fpp2')
library(psych)

Plot the data and find the regression model for Demand with temperature as an explanatory variable. Why is there a positive relationship?

daily20 <- head(elecdaily,20)
autoplot(daily20, main="Electricity Demand for Victoria, Australia during 2014")

lm(Demand~Temperature, data=elecdaily)

## 
## Call:
## lm(formula = Demand ~ Temperature, data = elecdaily)
## 
## Coefficients:
## (Intercept)  Temperature  
##    212.3856       0.4182

summary(lm(Demand~Temperature, data=elecdaily))

## 
## Call:
## lm(formula = Demand ~ Temperature, data = elecdaily)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -55.474 -15.719  -0.336  15.767 117.184 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 212.3856     5.0080  42.409   <2e-16 ***
## Temperature   0.4182     0.2263   1.848   0.0654 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 26.55 on 363 degrees of freedom
## Multiple R-squared:  0.009322,   Adjusted R-squared:  0.006592 
## F-statistic: 3.416 on 1 and 363 DF,  p-value: 0.0654

-There is a positive relationship between demand and temperature, since temperature will cause electrical demand to increase. Thus, a change in temperature will result in higher demand for electrical units.

Produce a residual plot. Is the model adequate? Are there any outliers or influential observations?

plot(lm(Demand~Temperature, data=elecdaily))

-Since the majority of the values lie outside of the Cook’s distance, therefore the values have little to no influence. However, there are some values that lie far outside the Cook’s distance line indicating greater influence. For instance, observations 15-17 are influential to the regression which should be analyzed further. For this case, the outlier’s should be removed.

Use the model to forecast the electricity demand that you would expect for the next day if the maximum temperature was 15◦ and compare it with the forecast if the with maximum temperature was 35◦. Do you believe these forecasts?

DemTSLM <- tslm(Demand ~ Temperature, data=daily20)
tempRange <- data.frame(Temperature=c(15,35))
forecastDemTemp <- forecast(DemTSLM, tempRange)
forecastDemTemp

##          Point Forecast    Lo 80    Hi 80     Lo 95    Hi 95
## 4.285714       140.5701 108.6810 172.4591  90.21166 190.9285
## 4.428571       275.7146 245.2278 306.2014 227.57056 323.8586

Yes, the forecast range is reasonable when compared to demand values.

Give prediction intervals for your forecasts. The following R code will get you started:

autoplot(daily20, facets=TRUE)

daily20 %>%  
as.data.frame() %>%  
ggplot(aes(x=Temperature, y=Demand)) +  
geom_point() +  
geom_smooth(method="lm", se=FALSE)

## `geom_smooth()` using formula 'y ~ x'

fit <- tslm(Demand ~ Temperature, data=daily20)  
checkresiduals(fit)

## 
##  Breusch-Godfrey test for serial correlation of order up to 5
## 
## data:  Residuals from Linear regression model
## LM test = 3.8079, df = 5, p-value = 0.5774

forecast(fit, newdata=data.frame(Temperature=c(15,35)))

##          Point Forecast    Lo 80    Hi 80     Lo 95    Hi 95
## 4.285714       140.5701 108.6810 172.4591  90.21166 190.9285
## 4.428571       275.7146 245.2278 306.2014 227.57056 323.8586

The prediction intervals seem to give reasonable estimates when compared to the training dataset.

Plot Demand vs Temperature for all of the available data in elecdaily. What does this say about your model?

plot(Demand ~ Temperature, data=elecdaily, main="Demand vs. Temperature of Elecdaily dataset")

The plot resembles the shape and outliers in the framework, therefore the dataset represents the overall observations.

Data set mens400 contains the winning times (in seconds) for the men’s 400 meters final in each Olympic Games from 1896 to 2016.

1. Plot the winning time against the year. Describe the main features of the plot.

autoplot(mens400, main="Winning times for the Men's 400 meter Olympic Games Final", ylab="Time (Seconds)", xlab="Year")

The relationship between time in seconds and years shows an inverse relationship. In the dataset, there are missing N/A data points and broken sections of the graph. Lastly, the average time for a win in the 400 meter has decreased significantly over time.

Fit a regression line to the data. Obviously the winning times have been decreasing, but at what average rate per year?

time400 <- time(mens400)
tslm(mens400 ~ time400, data=mens400)

## 
## Call:
## tslm(formula = mens400 ~ time400, data = mens400)
## 
## Coefficients:
## (Intercept)      time400  
##   172.48148     -0.06457

The winning time has decreased by -0.06457 seconds per year.

autoplot(mens400, main="Winning times for the Men's 400 meter Olympic Games Final", ylab="Time (Seconds)", xlab="Year") + geom_smooth(method="lm", se=FALSE)

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 3 rows containing non-finite values (stat_smooth).

Plot the residuals against the year. What does this indicate about the suitability of the fitted line?

mens400RG <- lm(mens400 ~ time400, data=mens400)
plot(mens400RG)

Based on the Residuals and Fitted graphs, the slope and the winning times tend to decrease at a decreasing rate each year. Thus, the fitted line is an accurate short term predictor, but minimal insight for future winning times.

Predict the winning time for the men’s 400 meters final in the 2020 Olympics. Give a prediction interval for your forecasts. What assumptions have you made in these calculations?

TimeMens400 <- time(mens400)
mens400LM <- lm(mens400 ~ TimeMens400, data=mens400, na.action=na.exclude) ## Removing the 3 NA values from the Regression

Mens400F2022 <- forecast(mens400LM, newdata=data.frame(TimeMens400=2020))
Mens400F2022

##   Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## 1       42.04231 40.44975 43.63487 39.55286 44.53176

There is 80% confidence interval that shows a range from 40.44975 to 43.63487. Also, 95% confidence interval that shows a range from 39.55286 to 44.53176.

Type easter(ausbeer) and interpret what you see.

easter(ausbeer)

##      Qtr1 Qtr2 Qtr3 Qtr4
## 1956 0.67 0.33 0.00 0.00
## 1957 0.00 1.00 0.00 0.00
## 1958 0.00 1.00 0.00 0.00
## 1959 1.00 0.00 0.00 0.00
## 1960 0.00 1.00 0.00 0.00
## 1961 0.33 0.67 0.00 0.00
## 1962 0.00 1.00 0.00 0.00
## 1963 0.00 1.00 0.00 0.00
## 1964 1.00 0.00 0.00 0.00
## 1965 0.00 1.00 0.00 0.00
## 1966 0.00 1.00 0.00 0.00
## 1967 1.00 0.00 0.00 0.00
## 1968 0.00 1.00 0.00 0.00
## 1969 0.00 1.00 0.00 0.00
## 1970 1.00 0.00 0.00 0.00
## 1971 0.00 1.00 0.00 0.00
## 1972 0.33 0.67 0.00 0.00
## 1973 0.00 1.00 0.00 0.00
## 1974 0.00 1.00 0.00 0.00
## 1975 1.00 0.00 0.00 0.00
## 1976 0.00 1.00 0.00 0.00
## 1977 0.00 1.00 0.00 0.00
## 1978 1.00 0.00 0.00 0.00
## 1979 0.00 1.00 0.00 0.00
## 1980 0.00 1.00 0.00 0.00
## 1981 0.00 1.00 0.00 0.00
## 1982 0.00 1.00 0.00 0.00
## 1983 0.00 1.00 0.00 0.00
## 1984 0.00 1.00 0.00 0.00
## 1985 0.00 1.00 0.00 0.00
## 1986 1.00 0.00 0.00 0.00
## 1987 0.00 1.00 0.00 0.00
## 1988 0.00 1.00 0.00 0.00
## 1989 1.00 0.00 0.00 0.00
## 1990 0.00 1.00 0.00 0.00
## 1991 1.00 0.00 0.00 0.00
## 1992 0.00 1.00 0.00 0.00
## 1993 0.00 1.00 0.00 0.00
## 1994 0.00 1.00 0.00 0.00
## 1995 0.00 1.00 0.00 0.00
## 1996 0.00 1.00 0.00 0.00
## 1997 1.00 0.00 0.00 0.00
## 1998 0.00 1.00 0.00 0.00
## 1999 0.00 1.00 0.00 0.00
## 2000 0.00 1.00 0.00 0.00
## 2001 0.00 1.00 0.00 0.00
## 2002 1.00 0.00 0.00 0.00
## 2003 0.00 1.00 0.00 0.00
## 2004 0.00 1.00 0.00 0.00
## 2005 1.00 0.00 0.00 0.00
## 2006 0.00 1.00 0.00 0.00
## 2007 0.00 1.00 0.00 0.00
## 2008 1.00 0.00 0.00 0.00
## 2009 0.00 1.00 0.00 0.00
## 2010 0.00 1.00

The dataset represents the total quarterly beer producted in Austrailia. The dataset contains four quarterly values for 1956 to 2019; however, quarter one and two ends in 2010. Also, the quarterly values are between 0 and 1, so the values indicate either a percentage or probability of an event occurring.

An elasticity coefficient is the ratio of the percentage change in the forecast variable (y) to the percentage change in the predictor variable (x). Mathematically, the elasticity is de- fined as (dy/dx) × (x/y). Consider the log-log model, logy = β0 + β1 logx + ε. Express y as a function of x and show that the coefficient β1 is the elasticity coefficient.

The coefficient B1 is the elasticity coefficient indicating the ratio of the percentage in the forecast variable (y) to the change in the predictor variable (x). Thus, for every change in predictor (x), B1 ratio percentage change in the forecast variable (y).

The data set fancy concerns the monthly sales figures of a shop which opened in January 1987 and sells gifts, souvenirs, and novelties. The shop is situated on the wharf at a beach resort town in Queensland, Australia. The sales volume varies with the seasonal population of tourists. There is a large influx of visitors to the town at Christmas and for the local surfing festival, held every March since 1988. Over time, the shop has expanded its premises, range of products, and staff.

Produce a time plot of the data and describe the patterns in the graph. Identify any unusual or unexpected fluctuations in the time series.

autoplot(fancy, main="Sales Volume over Time", ylab="Sales Volume")

seasonplot(fancy, main="Seasonal Sales Volume over Time", ylab="Sales Volume")

The graphs shows the seasonal trends of sales. For instance, we can see a spike in sales right before the new year indicating that sales increase during the holiday season. Also, there is an increase in sales between 1992 and 1994, indicating that products and employees have increased for the high demand.

Explain why it is necessary to take logarithms of these data before fitting a model.

When there is a pattern formed during graphing of the dataset, then this indicates that there is heteroscedasticity in the errors. Also, the variance of the residuals may not be constant and cause issues with the regression.

Use R to fit a regression model to the logarithms of these sales data with a linear trend, seasonal dummies and a “surfing festival” dummy variable.

#Creating a "surfing festival" dummy variable
time <- time(fancy)

surfingFestival <- c()
for(i in 1:length(time)){
  month <- round(12*(time[i] - floor(time[i]))) + 1
  year <- floor(time[i])
  if(year >= 1988 & month == 3) ## started March, 1998
    {
    surfingFestival[i] <- 1 ## Binary - 1 for if surfing festival is in town
  } else {
    surfingFestival[i] <- 0 ## 0 if the surfing festival is not in town
  }
}

## Regression formula and output
## Using BoxCox power transformation prior to regression formula
fancyTrans <- (BoxCox(fancy, 0))
fancyReg <- tslm(fancyTrans ~ trend + season + surfingFestival)
summary(fancyReg)

## 
## Call:
## tslm(formula = fancyTrans ~ trend + season + surfingFestival)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.33673 -0.12757  0.00257  0.10911  0.37671 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     7.6196670  0.0742471 102.626  < 2e-16 ***
## trend           0.0220198  0.0008268  26.634  < 2e-16 ***
## season2         0.2514168  0.0956790   2.628 0.010555 *  
## season3         0.2660828  0.1934044   1.376 0.173275    
## season4         0.3840535  0.0957075   4.013 0.000148 ***
## season5         0.4094870  0.0957325   4.277 5.88e-05 ***
## season6         0.4488283  0.0957647   4.687 1.33e-05 ***
## season7         0.6104545  0.0958039   6.372 1.71e-08 ***
## season8         0.5879644  0.0958503   6.134 4.53e-08 ***
## season9         0.6693299  0.0959037   6.979 1.36e-09 ***
## season10        0.7473919  0.0959643   7.788 4.48e-11 ***
## season11        1.2067479  0.0960319  12.566  < 2e-16 ***
## season12        1.9622412  0.0961066  20.417  < 2e-16 ***
## surfingFestival 0.5015151  0.1964273   2.553 0.012856 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.179 on 70 degrees of freedom
## Multiple R-squared:  0.9567, Adjusted R-squared:  0.9487 
## F-statistic:   119 on 13 and 70 DF,  p-value: < 2.2e-16

Plot the residuals against time and against the fitted values. Do these plots reveal any problems with the model?

autoplot(fancyReg$residuals, main="Residuals plot of fancyReg Resgression", ylab="Residuals")

Based on the plot, there is still a seasonal pattern which is problematic for the autocorrelation in the dataset.

Do boxplots of the residuals for each month. Does this reveal any problems with the model?

boxplot(fancyReg$residuals, main="fancyReg Regression Residuals")

What do the values of the coefficients tell you about each variable?

fancyReg$coefficients

##     (Intercept)           trend         season2         season3         season4 
##      7.61966701      0.02201983      0.25141682      0.26608280      0.38405351 
##         season5         season6         season7         season8         season9 
##      0.40948697      0.44882828      0.61045453      0.58796443      0.66932985 
##        season10        season11        season12 surfingFestival 
##      0.74739195      1.20674790      1.96224123      0.50151509

The coefficient for surfing festival has a value of 0.5015 indicating a relative increase in sales during the month of March. The trend coefficient value is 0.0220 is relatively less, but indicates a slight increase in sales. Also, there is a significant increase in the coefficient slope around season 10. Thus, as product and staff increase for higher demand, the coefficient tends to increases year to year.

What does the Breusch-Godfrey test tell you about your model?

library(lmtest)
bgtest(fancyReg)

## 
##  Breusch-Godfrey test for serial correlation of order up to 1
## 
## data:  fancyReg
## LM test = 25.031, df = 1, p-value = 5.642e-07

Based on the Breusch-Godfrey test, the serial correlation in the regression model show low p-value indicating that the null-hypothesis can be rejected and heteroscedasticity exists.

Regardless of your answers to the above questions, use your regression model to predict the monthly sales for 1994, 1995, and 1996. Produce prediction intervals for each of your forecasts.

timeRangeFancy <- ts(data=time, start=1994, end=c(1996,12), frequency=12)
forecast(fancyReg, timeRangeFancy)

## Warning in forecast.lm(fancyReg, timeRangeFancy): newdata column names not
## specified, defaulting to first variable required.

##          Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## Jan 1994       1006.002 501.0539 1510.950 227.5849 1784.419
## Feb 1994       1006.317 501.3479 1511.286 227.8675 1784.767
## Mar 1994       1006.396 501.6231 1511.168 228.2491 1784.542
## Apr 1994       1006.577 501.5658 1511.589 228.0624 1785.092
## May 1994       1006.667 501.6339 1511.699 228.1190 1785.214
## Jun 1994       1006.770 501.7159 1511.824 228.1896 1785.350
## Jul 1994       1006.995 501.9201 1512.070 228.3823 1785.608
## Aug 1994       1007.036 501.9403 1512.133 228.3910 1785.682
## Sep 1994       1007.182 502.0643 1512.299 228.5036 1785.860
## Oct 1994       1007.324 502.1850 1512.462 228.6128 1786.034
## Nov 1994       1007.847 502.6870 1513.006 229.1033 1786.590
## Dec 1994       1008.666 503.4851 1513.847 229.8900 1787.442
## Jan 1995       1006.768 501.5678 1511.967 227.9624 1785.573
## Feb 1995       1007.083 501.8618 1512.304 228.2450 1785.921
## Mar 1995       1007.161 502.1369 1512.186 228.6266 1785.696
## Apr 1995       1007.343 502.0797 1512.606 228.4399 1786.246
## May 1995       1007.432 502.1478 1512.717 228.4965 1786.368
## Jun 1995       1007.535 502.2298 1512.841 228.5670 1786.504
## Jul 1995       1007.761 502.4340 1513.088 228.7598 1786.762
## Aug 1995       1007.802 502.4542 1513.150 228.7685 1786.836
## Sep 1995       1007.947 502.5782 1513.317 228.8810 1787.014
## Oct 1995       1008.089 502.6989 1513.480 228.9903 1787.188
## Nov 1995       1008.612 503.2009 1514.024 229.4808 1787.744
## Dec 1995       1009.432 503.9990 1514.865 230.2674 1788.596
## Jan 1996       1007.533 502.0817 1512.985 228.3399 1786.727
## Feb 1996       1007.849 502.3757 1513.321 228.6225 1787.075
## Mar 1996       1007.927 502.6508 1513.203 229.0041 1786.850
## Apr 1996       1008.109 502.5936 1513.624 228.8174 1787.400
## May 1996       1008.198 502.6617 1513.734 228.8740 1787.522
## Jun 1996       1008.301 502.7437 1513.859 228.9445 1787.658
## Jul 1996       1008.527 502.9479 1514.105 229.1373 1787.916
## Aug 1996       1008.568 502.9681 1514.168 229.1460 1787.990
## Sep 1996       1008.713 503.0921 1514.334 229.2585 1788.168
## Oct 1996       1008.855 503.2128 1514.497 229.3678 1788.342
## Nov 1996       1009.378 503.7148 1515.042 229.8583 1788.898
## Dec 1996       1010.198 504.5129 1515.882 230.6449 1789.750

Transform your predictions and intervals to obtain predictions and intervals for the raw data.

fancyTrans <- (BoxCox(fancy, 0))
forecast(fancyTrans)

##          Point Forecast     Lo 80     Hi 80     Lo 95     Hi 95
## Jan 1994       9.674641  9.464046  9.885235  9.352565  9.996717
## Feb 1994       9.958925  9.733371 10.184479  9.613970 10.303880
## Mar 1994      10.439733 10.200143 10.679323 10.073312 10.806155
## Apr 1994      10.121042  9.868185 10.373898  9.734331 10.507752
## May 1994      10.157291  9.891822 10.422759  9.751292 10.563289
## Jun 1994      10.227196  9.949682 10.504711  9.802774 10.651618
## Jul 1994      10.403289 10.114223 10.692356  9.961200 10.845378
## Aug 1994      10.382661 10.082480 10.682842  9.923574 10.841748
## Sep 1994      10.518849 10.207944 10.829754 10.043361 10.994337
## Oct 1994      10.612257 10.290980 10.933534 10.120906 11.103608
## Nov 1994      11.108401 10.777070 11.439732 10.601674 11.615129
## Dec 1994      11.898748 11.557653 12.239843 11.377088 12.420408
## Jan 1995       9.985303  9.634703 10.335903  9.449107 10.521499
## Feb 1995      10.269587  9.909735 10.629440  9.719240 10.819934
## Mar 1995      10.750396 10.381517 11.119274 10.186244 11.314547
## Apr 1995      10.431704 10.054009 10.809399  9.854070 11.009338
## May 1995      10.467953 10.081638 10.854268  9.877135 11.058770
## Jun 1995      10.537858 10.143107 10.932610  9.934137 11.141579
## Jul 1995      10.713952 10.310934 11.116969 10.097590 11.330314
## Aug 1995      10.693323 10.282201 11.104445 10.064567 11.322080
## Sep 1995      10.829511 10.410437 11.248586 10.188592 11.470430
## Oct 1995      10.922919 10.496035 11.349803 10.270057 11.575781
## Nov 1995      11.419063 10.984506 11.853621 10.754465 12.083661
## Dec 1995      12.209410 11.767308 12.651512 11.533273 12.885547

How could you improve these predictions by modifying the model?

To improve the predictions, we can use the lambda funcation to transfor the regression model. A robust transformation would be better able to show the exponential growth.

The gasoline series consists of weekly data for supplies of US finished motor gasoline product, from 2 February 1991 to 20 January 2017. The units are in “million barrels per day”. Consider only the data to the end of 2004.

Fit a harmonic regression with trend to the data. Experiment with changing the number Fourier terms. Plot the observed gasoline and fitted values and comment on what you see.

gasoline2014 <- window(euretail, end=2005)
autoplot(gasoline2014, main="Quarterly Retail Trade Index in the Euro area", xlab="Year", ylab="Thousand barrels per day")

gasoline %>% tbats() %>% forecast() %>%
autoplot() + xlab("Year") + ylab("Thousand barrels per day")

Select the appropriate number of Fourier terms to include by minimising the AICc or CV value.

gasolineCheck <- tslm(gasoline2014 ~ trend)
gasolineCheck

## 
## Call:
## tslm(formula = gasoline2014 ~ trend)
## 
## Coefficients:
## (Intercept)        trend  
##      88.610        0.302

Check the residuals of the final model using the checkresiduals() function. Even though the residuals fail the correlation tests, the results are probably not severe enough to make much difference to the forecasts and prediction intervals. (Note that the correlations are relatively small, even though they are significant.)

checkresiduals(gasolineCheck)

## 
##  Breusch-Godfrey test for serial correlation of order up to 7
## 
## data:  Residuals from Linear regression model
## LM test = 21.937, df = 7, p-value = 0.002604

To forecast using harmonic regression, you will need to generate the future values of the Fourier terms. This can be done as follows. fc <- forecast(fit, newdata=data.frame(fourier(x,k,h))) where fit is the fitted model using tslm(), k is the number of Fourier teerms used in creating fit, adn h is the forecast horizon required. Forecast the next year of data.

gasolineF <- forecast(gasoline2014)
gasolineF

##         Point Forecast     Lo 80    Hi 80    Lo 95    Hi 95
## 2005 Q2       99.97156  99.39234 100.5508 99.08571 100.8574
## 2005 Q3      100.54817  99.74497 101.3514 99.31978 101.7766
## 2005 Q4      100.72226  99.74468 101.6998 99.22718 102.2173
## 2006 Q1      100.38522  99.26099 101.5095 98.66586 102.1046
## 2006 Q2      100.94123  99.68548 102.1970 99.02073 102.8617
## 2006 Q3      101.51783 100.14204 102.8936 99.41374 103.6219
## 2006 Q4      101.69193 100.20544 103.1784 99.41854 103.9653
## 2007 Q1      101.35488  99.76614 102.9436 98.92510 103.7847

Plot the forecasts along with the actual data for 2005. What do you find?

autoplot(gasolineF)

The possible forecast values range is relatively short, so the forecast could be fairly accurate. However, the 95% internal provides a much closer prediction range.

Data set huron gives the water level of Lake Huron in feet from 1875 to 1972.

Plot the data and comment on its features.

autoplot(huron, main="Mean July Average Water Surface Elevation", ylab="Feet", xlab="Year")

The plot shows from 1880 to 1986, the water surface elevation started to decline. In the graph, the lowest elevation was around 1965 and the highest was around 1870 due to seasonality.

Fit a linear regression and compare this to a piecewise linear trend model with a knot at 1915.

tslm(huron ~ trend)

## 
## Call:
## tslm(formula = huron ~ trend)
## 
## Coefficients:
## (Intercept)        trend  
##     10.2020      -0.0242

huronRG1 <- tslm(huron ~ trend)

tHuron <- time(huron)
tHuron15 <- 1915

tHuronSection <- ts(pmax(0,tHuron-tHuron15), start=1875)

Generate forecasts from these two models for the period up to 1980 and comment on these.

#Forecast 1
huronRG1 <- tslm(huron ~ trend)
forecast(huronRG1, h=8)

##      Point Forecast    Lo 80    Hi 80    Lo 95     Hi 95
## 1973       7.806127 6.317648 9.294605 5.516501 10.095752
## 1974       7.781926 6.292536 9.271315 5.490899 10.072952
## 1975       7.757724 6.267406 9.248043 5.465269 10.050179
## 1976       7.733523 6.242259 9.224788 5.439613 10.027434
## 1977       7.709322 6.217094 9.201550 5.413929 10.004715
## 1978       7.685121 6.191912 9.178331 5.388219  9.982024
## 1979       7.660920 6.166712 9.155128 5.362481  9.959359
## 1980       7.636719 6.141494 9.131943 5.336717  9.936721

## Forecast 2
huronRG2 <- tslm(huron ~ tHuronSection)
forecast(tHuronSection, h=8)

##      Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## 1973             58 57.86560 58.13440 57.79446 58.20554
## 1974             59 58.70969 59.29031 58.55602 59.44398
## 1975             60 59.51703 60.48297 59.26136 60.73864
## 1976             61 60.29422 61.70578 59.92061 62.07939
## 1977             62 61.04503 62.95497 60.53950 63.46050
## 1978             63 61.77204 64.22796 61.12199 64.87801
## 1979             64 62.47716 65.52284 61.67102 66.32898
## 1980             65 63.16194 66.83806 62.18893 67.81107

The forecast indicated that there are different formats between the two, but both have an increasing trend.

Chapter 6 Exercises

Show that a 3×5 MA is equivalent to a 7-term weighted moving average with weights of 0.067, 0.133, 0.200, 0.200, 0.200, 0.133, and 0.067.

3x5 MA = [((Y1 + Y2 + Y3 + Y4 + Y5)/5) + ((Y2 + Y3 + Y4 + Y5 + Y6)/5) + ((Y3 + Y4 + Y5 + Y6 + Y7)/5)] / 3
In order to create a double moving average, the centered moving average will need to be smoothed by another moving average. For instance, 3x5 moving average indicates that 3 moving averages of 5 moving averages. So, inputting in the values shows that the 3x5 moving average is equal to a 7-term weighted moving average.

The plastics data set consists of the monthly sales (in thousands) of product A for a plastics manufacturer for five years.

Plot the time series of sales of product A. Can you identify seasonal fluctuations and/or a trend-cycle?

autoplot(plastics, main="Monthly sales of Product A", xlab="Year", ylab="Sales (thousands")

Based on the plot, during the beginning and ending months of each year we can see the fluctuation of sales. For instance, at the beginning of the year the sales are at a low point, but towards the end of the year sales are at their highest.

Use a classical multiplicative decomposition to calculate the trend-cycle and seasonal indices.

plastics %>% decompose(type="multiplicative") %>%
  autoplot() + xlab("Year") +
  ggtitle("Monthly sales of Product A")

Do the results support the graphical interpretation from part a?

The results do support the graphical interpretation from part a since the seasonal trends are evident and sales are showsing a positive trend.

Compute and plot the seasonally adjusted data.

autoplot(plastics, main="Monthly sales of Product A", ylab="Sales (Thousands)", xlab="Year") + autolayer(snaive(plastics, h=30), series="Seasonal Naïve", PI=FALSE) + autolayer(naive(plastics, h=30), series="Naïve", PI=FALSE) + autolayer(rwf(plastics, h=30), series="Drift", PI=FALSE)

Change one observation to be an outlier (e.g., add 500 to one observation), and recompute the seasonally adjusted data. What is the effect of the outlier?

plasticsAdd <- plastics
plasticsAdd[15] <- plasticsAdd[15] + 500

autoplot(plasticsAdd, main="Monthly sales of Product A (Modified Value 15)", ylab="Sales (Thousands)", xlab="Year") + autolayer(snaive(plasticsAdd, h=30), series="Seasonal Naïve", PI=FALSE) + autolayer(naive(plasticsAdd, h=30), series="Naïve", PI=FALSE) + autolayer(rwf(plasticsAdd, h=30), series="Drift", PI=FALSE)

Based on the graph, the forecast values show stability with the added 500 to one observation to value 15. Thus, the seasonally adjusted data prediction is more prevalent.

Does it make any difference if the outlier is near the end rather than in the middle of the time series?

## Adding 500 to End Value
plasticsAddEnd <- plastics
plasticsAddEnd[40] <- plasticsAddEnd[40] + 500

autoplot(plasticsAddEnd, main="Monthly sales of Product A (Modified Value 40)", ylab="Sales (Thousands)", xlab="Year") + autolayer(snaive(plasticsAddEnd, h=30), series="Seasonal Naïve", PI=FALSE) + autolayer(naive(plasticsAddEnd, h=30), series="Naïve", PI=FALSE) + autolayer(rwf(plasticsAddEnd, h=30), series="Drift", PI=FALSE)

Adding the outlier to an end value shifted the forecast up and increased the values. Thus, outliers should be evaluated when the multiplicative decomposition is used.

Recall your retail time series data (from Exercise 3 in Section 2.10). Decompose the series using X11. Does it reveal any outliers, or unusual features that you had not noticed previously?

library(seasonal)
retaildata <- read_csv("Downloads/retaildata.csv", skip = 1)

## New names:
## * `` -> ...191
## * `` -> ...192
## * `` -> ...193
## * `` -> ...194
## * `` -> ...195
## * ...

## Rows: 381 Columns: 200

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (1): Series ID
## dbl (181): A3349335T, A3349627V, A3349338X, A3349398A, A3349468W, A3349336V,...
## lgl  (18): A3349372C, A3349669T, A3349521W, A3349450X, A3349679W, A3349451A,...

## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(retaildata)

## # A tibble: 6 x 200
##   `Series ID` A3349335T A3349627V A3349338X A3349398A A3349468W A3349336V
##   <chr>           <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
## 1 Apr-1982         303.      41.7      63.9      409.      65.8      91.8
## 2 May-1982         298.      43.1      64        405.      65.8     103. 
## 3 Jun-1982         298       40.3      62.7      401       62.3     105  
## 4 Jul-1982         308.      40.9      65.6      414.      68.2     106  
## 5 Aug-1982         299.      42.1      62.6      404.      66        96.9
## 6 Sep-1982         305.      42        64.4      412.      62.3      97.5
## # … with 193 more variables: A3349337W <dbl>, A3349397X <dbl>, A3349399C <dbl>,
## #   A3349874C <dbl>, A3349871W <dbl>, A3349790V <dbl>, A3349556W <dbl>,
## #   A3349791W <dbl>, A3349401C <dbl>, A3349873A <dbl>, A3349872X <dbl>,
## #   A3349709X <dbl>, A3349792X <dbl>, A3349789K <dbl>, A3349555V <dbl>,
## #   A3349565X <dbl>, A3349414R <dbl>, A3349799R <dbl>, A3349642T <dbl>,
## #   A3349413L <dbl>, A3349564W <dbl>, A3349416V <dbl>, A3349643V <dbl>,
## #   A3349483V <dbl>, A3349722T <dbl>, A3349727C <dbl>, A3349641R <dbl>,
## #   A3349639C <dbl>, A3349415T <dbl>, A3349349F <dbl>, A3349563V <dbl>,
## #   A3349350R <dbl>, A3349640L <dbl>, A3349566A <dbl>, A3349417W <dbl>,
## #   A3349352V <dbl>, A3349882C <dbl>, A3349561R <dbl>, A3349883F <dbl>,
## #   A3349721R <dbl>, A3349478A <dbl>, A3349637X <dbl>, A3349479C <dbl>,
## #   A3349797K <dbl>, A3349477X <dbl>, A3349719C <dbl>, A3349884J <dbl>,
## #   A3349562T <dbl>, A3349348C <dbl>, A3349480L <dbl>, A3349476W <dbl>,
## #   A3349881A <dbl>, A3349410F <dbl>, A3349481R <dbl>, A3349718A <dbl>,
## #   A3349411J <dbl>, A3349638A <dbl>, A3349654A <dbl>, A3349499L <dbl>,
## #   A3349902A <dbl>, A3349432V <dbl>, A3349656F <dbl>, A3349361W <dbl>,
## #   A3349501L <dbl>, A3349503T <dbl>, A3349360V <dbl>, A3349903C <dbl>,
## #   A3349905J <dbl>, A3349658K <dbl>, A3349575C <dbl>, A3349428C <dbl>,
## #   A3349500K <dbl>, A3349577J <dbl>, A3349433W <dbl>, A3349576F <dbl>,
## #   A3349574A <dbl>, A3349816F <dbl>, A3349815C <dbl>, A3349744F <dbl>,
## #   A3349823C <dbl>, A3349508C <dbl>, A3349742A <dbl>, A3349661X <dbl>,
## #   A3349660W <dbl>, A3349909T <dbl>, A3349824F <dbl>, A3349507A <dbl>,
## #   A3349580W <dbl>, A3349825J <dbl>, A3349434X <dbl>, A3349822A <dbl>,
## #   A3349821X <dbl>, A3349581X <dbl>, A3349908R <dbl>, A3349743C <dbl>,
## #   A3349910A <dbl>, A3349435A <dbl>, A3349365F <dbl>, A3349746K <dbl>,
## #   A3349370X <dbl>, …

#Creating Time Series
retailTS <- ts(retaildata[,"A3349873A"], frequency=12, start=c(1982, 3))

#Plotting Time Series
autoplot(retailTS, main="Monthly Austrailian Retail Sales", ylab="Sales", xlab="Year")

## X11 Decomposition
library(seasonal)
retailx11 <- seas(retailTS, x11="")

autoplot(retailx11, main="X11 Decomposition on Monthly Austrailian Retail Sales", xlab="Year")

In the plot, there are a outliers around the years 1989, 1994, 2001, and 2012.

Figures 6.16 and 6.17 show the result of decomposing the number of persons in the civilian labour force in Australia each month from February 1978 to August 1995.

Write about 3–5 sentences describing the results of the decomposition. Pay particular attention to the scales of the graphs in making your interpretation.

The results of the decomposition shows the number of people in the civilian labour force located in Australia for each month starting from February 1978 to August 1995. Based on the graph, there is a positive trend for the number of workers during each time period. According to the seasonality chart, the cyclical growth increases and illustrates a significant seasonal tread. A historical event occurred during 1991 and 1992 which caused the increase of labour. In addition, some outliers are prevalent around 1992. Based on the second chart, the seasonal trends can be seen for each month which shows a significant increase in July and signifcant decrease in March.

Is the recession of 1991/1992 visible in the estimated components?

Yes, the 1991 and 1992 recession is visible in the estimated components and the number of outliers.

This exercise uses the cangas data (monthly Canadian gas production in billions of cubic metres, January 1960 – February 2005).

Plot the data using autoplot(), ggsubseriesplot() and ggseasonplot() to look at the effect of the changing seasonality over time. What do you think is causing it to change so much?

autoplot(cangas, main="Monthly Canadian Gas Production", ylab="Gas Production (Billions)", xlab="Year")

ggsubseriesplot(cangas, main="Monthly Canadian Gas Production", ylab="Gas Production (Billions)")

ggseasonplot(cangas, main="Seasonal Plot: Monthly Canadian Gas Production", ylab="Gas Production (Billions)")

Based on the graph of oil production, the changing seasonality over time is caused by the change in prices and demand. For example, the production will need to increase during the summer months due to consumers traveling more causing a increase in oil demand. In addition, in different months oil will be lower. Graph one shows three different eras of oil production, which could explain the increase in innovation. It will be interesting to see how electric cars will impact the years to come.

Do an STL decomposition of the data. You will need to choose s.window to allow for the changing shape of the seasonal component.

cangas %>%
stl(t.window=13, s.window="periodic", robust=TRUE) %>%
autoplot()

Compare the results with those obtained using SEATS and X11. How are they different?

The results between those obtained using SEATS and x11 shows some similarities, but some differences too. For instance, the seasonality graph shows a steady seasonality throughout each of the time series. Also, the remainder plot shows the outliers better between 1980 and 1990 along with 1995 and 2005.

We will use the bricksq data (Australian quarterly clay brick production. 1956–1994) for this exercise.

Use an STL decomposition to calculate the trend-cycle and seasonal indices. (Experiment with having fixed or changing seasonality.)

bricksq %>%
stl(t.window=26, s.window="periodic", robust=TRUE) %>%
autoplot()

Compute and plot the seasonally adjusted data.

autoplot(bricksq, main="Monthly sales of Product A", ylab="Sales (Thousands)", xlab="Year") + autolayer(snaive(bricksq, h=30), series="Seasonal Naïve", PI=FALSE)

Use a naïve method to produce forecasts of the seasonally adjusted data.

autoplot(bricksq, main="Australian quarterly clay brick production. 1956–1994", ylab="Brick Production", xlab="Year")+ autolayer(naive(bricksq, h=30), series="Naïve", PI=FALSE)

Use stlf() to reseasonalise the results, giving forecasts for the original data.

brickF1 <- stlf(bricksq)
brickF1

##         Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## 1994 Q4       467.0636 439.0835 495.0437 424.2718 509.8555
## 1995 Q1       426.0736 386.4842 465.6629 365.5268 486.6203
## 1995 Q2       484.0345 435.5220 532.5470 409.8411 558.2280
## 1995 Q3       493.9988 437.9514 550.0462 408.2817 579.7159
## 1995 Q4       467.0636 404.3669 529.7604 371.1772 562.9500
## 1996 Q1       426.0736 357.3555 494.7916 320.9784 531.1688
## 1996 Q2       484.0345 409.7703 558.2988 370.4571 597.6119
## 1996 Q3       493.9988 414.5638 573.4338 372.5134 615.4841

autoplot(brickF1)

Do the residuals look uncorrelated?

checkresiduals((brickF1))

## Warning in checkresiduals((brickF1)): The fitted degrees of freedom is based on
## the model used for the seasonally adjusted data.

## 
##  Ljung-Box test
## 
## data:  Residuals from STL +  ETS(M,N,N)
## Q* = 44.704, df = 6, p-value = 5.358e-08
## 
## Model df: 2.   Total lags used: 8

Based on the results, the residuals are appoximately normal and have correlation.

Repeat with a robust STL decomposition. Does it make much difference?

brickF1R <- stlf(bricksq, robust=TRUE)
brickF1R

##         Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## 1994 Q4       461.6172 432.2633 490.9710 416.7243 506.5100
## 1995 Q1       424.3612 382.8257 465.8967 360.8381 487.8842
## 1995 Q2       487.1456 436.2455 538.0458 409.3006 564.9907
## 1995 Q3       493.9985 435.1892 552.8078 404.0574 583.9396
## 1995 Q4       461.6172 395.8271 527.4072 360.9999 562.2344
## 1996 Q1       424.3612 352.2486 496.4738 314.0745 534.6479
## 1996 Q2       487.1456 409.2083 565.0829 367.9508 606.3404
## 1996 Q3       493.9985 410.6299 577.3671 366.4972 621.4997

autoplot(brickF1R)

checkresiduals(brickF1R)

## Warning in checkresiduals(brickF1R): The fitted degrees of freedom is based on
## the model used for the seasonally adjusted data.

## 
##  Ljung-Box test
## 
## data:  Residuals from STL +  ETS(M,N,N)
## Q* = 27.278, df = 6, p-value = 0.0001284
## 
## Model df: 2.   Total lags used: 8

Yes there is a difference. The residuals have decreased some in normality and have autocorrelation issues.

Compare forecasts from stlf() with those from snaive(), using a test set comprising the last 2 years of data. Which is better?

## Comparing Training and Test sets
trainBrick <- subset(bricksq, end=length(bricksq) - 9)
testBrick <- subset(bricksq, start=length(bricksq) - 8)

sBrick <- snaive(trainBrick)
stlfBrick <- stlf(trainBrick, robust=TRUE)

## Plotting
autoplot(sBrick)

autoplot(stlfBrick)

When comparing the two funcations, it seems that the stlf() provides less forecasting, but is the overall better predictor of brick production.

Use stlf() to produce forecasts of the writing series with either method=“naive” or method=“rwdrift”, whichever is most appropriate. Use the lambda argument if you think a Box-Cox transformation is required.

autoplot(writing)

stlf(writing, method='rwdrift')

##          Point Forecast    Lo 80     Hi 80    Lo 95     Hi 95
## Jan 1978       991.5849 932.9189 1050.2509 901.8630 1081.3068
## Feb 1978      1026.5147 943.2006 1109.8288 899.0967 1153.9327
## Mar 1978      1083.0518 980.5889 1185.5146 926.3484 1239.7551
## Apr 1978      1013.4371 894.6353 1132.2389 831.7454 1195.1288
## May 1978       980.1886 846.8209 1113.5563 776.2203 1184.1569
## Jun 1978      1051.1446 904.4549 1197.8342 826.8021 1275.4871
## Jul 1978       902.0899 743.0094 1061.1705 658.7972 1145.3826
## Aug 1978       490.0146 319.2714  660.7578 228.8855  751.1438
## Sep 1978       939.0529 757.2352 1120.8706 660.9867 1217.1191
## Oct 1978      1029.1126 836.7068 1221.5184 734.8534 1323.3718
## Nov 1978       961.3387 758.7551 1163.9223 651.5138 1271.1636
## Dec 1978      1035.7402 823.3300 1248.1504 710.8868 1360.5935
## Jan 1979      1033.5921 811.6598 1255.5243 694.1760 1373.0081
## Feb 1979      1068.5219 837.3345 1299.7092 714.9513 1422.0924
## Mar 1979      1125.0589 884.8526 1365.2653 757.6950 1492.4229
## Apr 1979      1055.4443 806.4293 1304.4593 674.6087 1436.2798
## May 1979      1022.1958 764.5610 1279.8305 628.1774 1416.2141
## Jun 1979      1093.1517 827.0677 1359.2358 686.2114 1500.0921
## Jul 1979       944.0971 669.7185 1218.4756 524.4713 1363.7229
## Aug 1979       532.0218 249.4898  814.5538  99.9264  964.1172
## Sep 1979       981.0601 690.5039 1271.6163 536.6928 1425.4274
## Oct 1979      1071.1198 772.6582 1369.5814 614.6622 1527.5774
## Nov 1979      1003.3459 697.0885 1309.6032 534.9656 1471.7261
## Dec 1979      1077.7473 763.7956 1391.6991 597.5996 1557.8951

writingBC <- stlf(writing, method='rwdrift', robust=TRUE, lambda = BoxCox.lambda(writing))
autoplot(writingBC)

Use stlf() to produce forecasts of the fancy series with either method=“naive” or method=“rwdrift”, whichever is most appropriate. Use the lambda argument if you think a Box-Cox transformation is required.

autoplot(fancy)

The fancy dataset has a positive trend and with seasonal adjustments.

stlf(fancy, method='rwdrift')

##          Point Forecast    Lo 80     Hi 80    Lo 95     Hi 95
## Jan 1994       60219.09 52848.76  67589.41 48947.14  71491.03
## Feb 1994       61810.73 51324.91  72296.55 45774.06  77847.41
## Mar 1994       67525.88 54607.21  80444.55 47768.48  87283.28
## Apr 1994       64392.56 49387.87  79397.25 41444.87  87340.25
## May 1994       64241.86 47368.86  81114.86 38436.83  90046.89
## Jun 1994       65746.22 47156.85  84335.59 37316.23  94176.21
## Jul 1994       68624.78 48432.20  88817.36 37742.90  99506.66
## Aug 1994       70067.94 48360.23  91775.65 36868.86 103267.01
## Sep 1994       71428.85 48276.79  94580.91 36020.82 106836.87
## Oct 1994       72613.64 48075.50  97151.78 35085.79 110141.49
## Nov 1994       82383.43 56508.12 108258.74 42810.56 121956.30
## Dec 1994      113329.03 86158.23 140499.82 71774.89 154883.16
## Jan 1995       68887.44 40457.16  97317.73 25407.07 112367.81
## Feb 1995       70479.09 40820.71 100137.46 25120.52 115837.66
## Mar 1995       76194.23 45335.42 107053.05 28999.75 123388.71
## Apr 1995       73060.91 41026.21 105095.62 24068.06 122053.77
## May 1995       72910.22 39721.55 106098.88 22152.53 123667.90
## Jun 1995       74414.57 40091.67 108737.47 21922.23 126906.92
## Jul 1995       77293.14 41853.83 112732.44 23093.39 131492.88
## Aug 1995       78736.29 42196.78 115275.81 22853.92 134618.66
## Sep 1995       80097.20 42472.25 117722.16 22554.80 137639.61
## Oct 1995       81282.00 42585.14 119978.86 22100.25 140463.74
## Nov 1995       91051.79 51295.45 130808.12 30249.72 151853.85
## Dec 1995      121997.38 81193.05 162801.71 59592.54 184402.22

fancyBC <- stlf(fancy, method='rwdrift', robust=TRUE, lambda = BoxCox.lambda(fancy))
autoplot(fancyBC)

Assignment: Homework Hyndman Chapter 5 and 6 Exercises

Jacob Eye

11/1/2021