Hyndman Chap 5 & 6

Chapter 5

Exercise 1

Daily electricity demand for Victoria, Australia, during 2014 is contained in elecdaily. The data for the first 20 days can be obtained as follows.

library(fpp2)

## Warning: package 'fpp2' was built under R version 4.0.5

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

## -- Attaching packages ---------------------------------------------- fpp2 2.4 --

## v ggplot2   3.3.3     v fma       2.4  
## v forecast  8.15      v expsmooth 2.3

## Warning: package 'forecast' was built under R version 4.0.5

## Warning: package 'fma' was built under R version 4.0.5

## Warning: package 'expsmooth' was built under R version 4.0.5

##

daily20 <- head(elecdaily,20)

Plot the data and find the regression model for Demand with temperature as an explanatory variable. Why is there a positive relationship?

autoplot(daily20)

model <- tslm(Demand ~ Temperature, data = daily20)
model #Positive value for temp of 6.757

## 
## Call:
## tslm(formula = Demand ~ Temperature, data = daily20)
## 
## Coefficients:
## (Intercept)  Temperature  
##      39.212        6.757

as.data.frame(daily20) %>%
  ggplot(aes(x = Temperature, y = Demand)) + ggtitle("Impact of Temperature on Daily Electricity Demand in Victoria") +
    geom_point() +
    geom_smooth(method = "lm")

## `geom_smooth()` using formula 'y ~ x'

The positive relationship between temperature and demand can be explained by the rise in air conditioning needed during the hot temperatures. Assuming these temperatures are in Celsius, the 40 degree temperature is approximately 104 degrees Fahrenheit. These obscenely high temperatures warrant constant usage of air conditioning units in households and businesses, which increases the demand for electricity in Victoria considerably.

B. Produce a residual plot. Is the model adequate? Are there any outliers or influential observations?

checkresiduals(model)

## 
##  Breusch-Godfrey test for serial correlation of order up to 5
## 
## data:  Residuals from Linear regression model
## LM test = 3.8079, df = 5, p-value = 0.5774

While the Breusch-Godfrey test does not have a significant value (p-value is 0.58), the residuals do not follow a normal distribution and they appear to have a left-skew. For this reason, the model is somewhat inadequate.

c/d.Use the model to forecast the electricity demand that you would expect for the next day if the maximum temperature was
15∘ and compare it with the forecast if the with maximum temperature was
35∘. Do you believe these forecasts? Give prediction intervals for your forecasts.

forecast(model, newdata=data.frame(Temperature = c(15,35)))

##          Point Forecast    Lo 80    Hi 80     Lo 95    Hi 95
## 4.285714       140.5701 108.6810 172.4591  90.21166 190.9285
## 4.428571       275.7146 245.2278 306.2014 227.57056 323.8586

I am not very familiar with changes to the demand for electricity, but is seems improbable that electricity demand would continue to decrease at a constant rate. 15 C is approximately 59 F, and would likely not have a notable difference in air conditioning than 20 C which is approximately 68 F. The electricity demand derived from air conditioning would likely flatten out near these values. It may even increase with lower temperatures as homes begin to implement heating systems. The prediction intervals are included in the above output.

Plot Demand vs Temperature for all of the available data in elecdaily. What does this say about your model?

#Code modeled after example code given in question d
elecdaily %>%
  as.data.frame() %>%
  ggplot(aes(x=Temperature, y=Demand)) +
  geom_point() +
  geom_smooth(method="lm", se=FALSE)

## `geom_smooth()` using formula 'y ~ x'

The above graph clearly shows a U-shaped curve which indicates that the linear model is inadequate. This is likely a result of the prior speculation about increases in heating at lower temperatures. Some sort of quadratic model would be more appropriate for this relationship.

Exercise 2 Data set mens400 contains the winning times (in seconds) for the men’s 400 meters final in each Olympic Games from 1896 to 2016.

Plot the winning time against the year. Describe the main features of the plot.

as.data.frame(mens400) %>%
  ggplot(aes(x = time(mens400), y = mens400)) + ylab("Time in seconds") + xlab("Year") + ggtitle("Winning times for men's 400 by Year") +
  geom_point()

## Don't know how to automatically pick scale for object of type ts. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type ts. Defaulting to continuous.

## Warning: Removed 3 rows containing missing values (geom_point).

The initial time may be an outlier, but there is a seemingly linear relationship after it. This curve could be considered to decrease at a greater rate in the early years (prior to the 1980s), and somewhat level out in recent decades. A spline or a form of asymptotic modeling may be best to capture this type of relationship.

Fit a regression line to the data. Obviously the winning times have been decreasing, but at what average rate per year?

model2 = tslm(mens400 ~ time(mens400))
model2

## 
## Call:
## tslm(formula = mens400 ~ time(mens400))
## 
## Coefficients:
##   (Intercept)  time(mens400)  
##     172.48148       -0.06457

as.data.frame(mens400) %>%
  ggplot(aes(x = time(mens400), y = mens400)) + ylab("Time in seconds") + xlab("Year") + ggtitle("Winning times for men's 400 by Year") +
  geom_point() +  geom_smooth(method="lm", se=FALSE)

## Don't know how to automatically pick scale for object of type ts. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type ts. Defaulting to continuous.

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 3 rows containing non-finite values (stat_smooth).

## Warning: Removed 3 rows containing missing values (geom_point).

The average time has decreased by 0.06457 seconds a year for the men’s 400 at the olympics.

Plot the residuals against the year. What does this indicate about the suitability of the fitted line?

checkresiduals(model2)

## 
##  Breusch-Godfrey test for serial correlation of order up to 6
## 
## data:  Residuals from Linear regression model
## LM test = 3.6082, df = 6, p-value = 0.7295

as.data.frame(mens400) %>%
  ggplot(aes(x = time(mens400), y = residuals(model2))) + ylab("Residuals") + xlab("Year") + ggtitle("Residuals from linear model") +
  geom_point() +  geom_smooth(method="lm", se=FALSE)

## Don't know how to automatically pick scale for object of type ts. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type ts. Defaulting to continuous.

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 3 rows containing non-finite values (stat_smooth).

## Warning: Removed 3 rows containing missing values (geom_point).

The fitted line is inadequate as the first point appears to be an outlier and there is a clear trend in the outliers almost all having negative outliers until 2000 when they all become positive. This upward trend towards positive residuals can be seen as far back as 1980. A spline type model or something more asymptopic would be more appropriate.

Predict the winning time for the men’s 400 meters final in the 2020 Olympics. Give a prediction interval for your forecasts. What assumptions have you made in these calculations?

time = time(mens400)
model2 = tslm(mens400 ~ time)
forecast(model2, newdata=data.frame(time=2020))

##      Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## 2020       42.04231 40.44975 43.63487 39.55286 44.53176

The most important assumption from this is that the men’s 400 meter final actually happened in 2020. As this is not true, I decided to also forecast for 2021 to see how different the data is for the actual year of the “2020” olympics. Both of these forecasts operate under the assumption that the average has been decreasing at a constant rate. The data indicates that the average time has been decreasing at a much slower rate since 1980 though. As the model does not account for this difference, the predicted time should be considerably lower than the actual time that we will see in a couple of weeks.

forecast(model2, newdata=data.frame(time=2021))

##      Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## 2020       41.97774 40.38286 43.57262 39.48466 44.47081

Exercise 3

easter(ausbeer)

##      Qtr1 Qtr2 Qtr3 Qtr4
## 1956 0.67 0.33 0.00 0.00
## 1957 0.00 1.00 0.00 0.00
## 1958 0.00 1.00 0.00 0.00
## 1959 1.00 0.00 0.00 0.00
## 1960 0.00 1.00 0.00 0.00
## 1961 0.33 0.67 0.00 0.00
## 1962 0.00 1.00 0.00 0.00
## 1963 0.00 1.00 0.00 0.00
## 1964 1.00 0.00 0.00 0.00
## 1965 0.00 1.00 0.00 0.00
## 1966 0.00 1.00 0.00 0.00
## 1967 1.00 0.00 0.00 0.00
## 1968 0.00 1.00 0.00 0.00
## 1969 0.00 1.00 0.00 0.00
## 1970 1.00 0.00 0.00 0.00
## 1971 0.00 1.00 0.00 0.00
## 1972 0.33 0.67 0.00 0.00
## 1973 0.00 1.00 0.00 0.00
## 1974 0.00 1.00 0.00 0.00
## 1975 1.00 0.00 0.00 0.00
## 1976 0.00 1.00 0.00 0.00
## 1977 0.00 1.00 0.00 0.00
## 1978 1.00 0.00 0.00 0.00
## 1979 0.00 1.00 0.00 0.00
## 1980 0.00 1.00 0.00 0.00
## 1981 0.00 1.00 0.00 0.00
## 1982 0.00 1.00 0.00 0.00
## 1983 0.00 1.00 0.00 0.00
## 1984 0.00 1.00 0.00 0.00
## 1985 0.00 1.00 0.00 0.00
## 1986 1.00 0.00 0.00 0.00
## 1987 0.00 1.00 0.00 0.00
## 1988 0.00 1.00 0.00 0.00
## 1989 1.00 0.00 0.00 0.00
## 1990 0.00 1.00 0.00 0.00
## 1991 1.00 0.00 0.00 0.00
## 1992 0.00 1.00 0.00 0.00
## 1993 0.00 1.00 0.00 0.00
## 1994 0.00 1.00 0.00 0.00
## 1995 0.00 1.00 0.00 0.00
## 1996 0.00 1.00 0.00 0.00
## 1997 1.00 0.00 0.00 0.00
## 1998 0.00 1.00 0.00 0.00
## 1999 0.00 1.00 0.00 0.00
## 2000 0.00 1.00 0.00 0.00
## 2001 0.00 1.00 0.00 0.00
## 2002 1.00 0.00 0.00 0.00
## 2003 0.00 1.00 0.00 0.00
## 2004 0.00 1.00 0.00 0.00
## 2005 1.00 0.00 0.00 0.00
## 2006 0.00 1.00 0.00 0.00
## 2007 0.00 1.00 0.00 0.00
## 2008 1.00 0.00 0.00 0.00
## 2009 0.00 1.00 0.00 0.00
## 2010 0.00 1.00

?easter()

## starting httpd help server ... done

This function is defined as the portion of the Easter holidays in each season, where “Easter is defined as the days from Good Friday to Easter Sunday inclusively”. This means that spillover between March and April can cause values between 0 and 1 to occur (such as in 1956).

Exercise 4

An elasticity coefficient is the ratio of the percentage change in the forecast variable (y) to the percentage change in the predictor variable (x). Mathematically, the elasticity is defined as (dy/dx)*(x/y). Consider the log-log model,logy=β0+β1logx+ε. Express y as a function of x and show that the coefficient β1 is the elasticity coefficient.

Under the assumption that x is a nonzero constant, the expected value of the model becomes logy = β0+β1logx as ε becomes 0.

The derivative for both sides of this this is (dy/dx)/y = (dx/dx)β1/x as the term β0 goes to 0. dx/dx simplifies to 1 as well. so it becomes (dy/dx)/y = β1/x.

This can be rearranged to become β1 = (dy/dx)(x/y)

Exercise 5 The data set fancy concerns the monthly sales figures of a shop which opened in January 1987 and sells gifts, souvenirs, and novelties. The shop is situated on the wharf at a beach resort town in Queensland, Australia. The sales volume varies with the seasonal population of tourists. There is a large influx of visitors to the town at Christmas and for the local surfing festival, held every March since 1988. Over time, the shop has expanded its premises, range of products, and staff.

Produce a time plot of the data and describe the patterns in the graph. Identify any unusual or unexpected fluctuations in the time series.

autoplot(fancy) + ggtitle("Sales in souvenir shop in Queensland")+ xlab("Year") +ylab("Sales")

As the prompt to the question explains, the data is incredibly dependent on the season and the large influx of tourists around Christmas and March. The uptick in March does not occur in 1987 as the surfing festival began in 1988. The uptick in sales around Christmas is much larger than the one around the surfing festival. It appears that the trend is growing over time considerably with the notably smaller increase in the 1990 Christmas season.

Explain why it is necessary to take logarithms of these data before fitting a model.

autoplot(log(fancy)) + ggtitle("Sales in souvenir shop in Queensland") + xlab("Year") + ylab("Log of Sales")

Taking the log of the sales helps to smooth out the line as the sharp uptick in sales in 1994 does not appear as extreme when in log form. This graph shows a more linear trend and helps shows that the seasonal trend is roughly constant when in log form with the overall upward trend accounting for a large portion of the increase in sales. A linear model could capture this upward trend in log sales well.

Use R to fit a regression model to the logarithms of these sales data with a linear trend, seasonal dummies and a “surfing festival” dummy variable.

?cycle()
time(fancy)

##           Jan      Feb      Mar      Apr      May      Jun      Jul      Aug
## 1987 1987.000 1987.083 1987.167 1987.250 1987.333 1987.417 1987.500 1987.583
## 1988 1988.000 1988.083 1988.167 1988.250 1988.333 1988.417 1988.500 1988.583
## 1989 1989.000 1989.083 1989.167 1989.250 1989.333 1989.417 1989.500 1989.583
## 1990 1990.000 1990.083 1990.167 1990.250 1990.333 1990.417 1990.500 1990.583
## 1991 1991.000 1991.083 1991.167 1991.250 1991.333 1991.417 1991.500 1991.583
## 1992 1992.000 1992.083 1992.167 1992.250 1992.333 1992.417 1992.500 1992.583
## 1993 1993.000 1993.083 1993.167 1993.250 1993.333 1993.417 1993.500 1993.583
##           Sep      Oct      Nov      Dec
## 1987 1987.667 1987.750 1987.833 1987.917
## 1988 1988.667 1988.750 1988.833 1988.917
## 1989 1989.667 1989.750 1989.833 1989.917
## 1990 1990.667 1990.750 1990.833 1990.917
## 1991 1991.667 1991.750 1991.833 1991.917
## 1992 1992.667 1992.750 1992.833 1992.917
## 1993 1993.667 1993.750 1993.833 1993.917

surf <- cycle(fancy) == 3 & time(fancy) >=1988 #3 for March, competition started in 1988
surf #looks good

##        Jan   Feb   Mar   Apr   May   Jun   Jul   Aug   Sep   Oct   Nov   Dec
## 1987 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 1988 FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 1989 FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 1990 FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 1991 FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 1992 FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 1993 FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

model3 = tslm(fancy ~ trend + season + surf, lambda = 0) #lambda has to be 0 for linear, model looks weird without this

model3

## 
## Call:
## tslm(formula = fancy ~ trend + season + surf, lambda = 0)
## 
## Coefficients:
## (Intercept)        trend      season2      season3      season4      season5  
##     7.61967      0.02202      0.25142      0.26608      0.38405      0.40949  
##     season6      season7      season8      season9     season10     season11  
##     0.44883      0.61045      0.58796      0.66933      0.74739      1.20675  
##    season12     surfTRUE  
##     1.96224      0.50152

?autolayer()
?fitted() #Can use these to add fit

autoplot(fancy) + xlab("Year") + ylab("Log of Sales") + ggtitle("Regression model compared to Data")+ autolayer(fitted(model3), series = "Fitted")

d. Plot the residuals against time and against the fitted values. Do these plots reveal any problems with the model?

autoplot(residuals(model3))

ggplot(aes(x=fitted(model3), y=residuals(model3)), data=fancy) + geom_point() + ggtitle("Residuals Vs. Fitted")

## Don't know how to automatically pick scale for object of type ts. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type ts. Defaulting to continuous.

The residuals in the first graph appear to have some serial correlation and the residuals in the second graph appear to mostly be random.

e.Do boxplots of the residuals for each month. Does this reveal any problems with the model?

mon <- factor(cycle(residuals(model3)))
ggplot() + geom_boxplot(aes(x=mon, y=residuals(model3), group=mon)) + xlab("Month") + ggtitle("Residuals against Month")

## Don't know how to automatically pick scale for object of type ts. Defaulting to continuous.

These boxplots indicate that there may be an issue with heteroscedasticity as the residual plots vary considerably between months. The means for each month do appear to hover around 0 and there are no boxes that are considerable off in value from the others (for instance there is no box where quartiles are all above or below 0).

What do the values of the coefficients tell you about each variable?

coefficients(model3)

## (Intercept)       trend     season2     season3     season4     season5 
##  7.61966701  0.02201983  0.25141682  0.26608280  0.38405351  0.40948697 
##     season6     season7     season8     season9    season10    season11 
##  0.44882828  0.61045453  0.58796443  0.66932985  0.74739195  1.20674790 
##    season12    surfTRUE 
##  1.96224123  0.50151509

The trend indicates that the log of sales increases by 0.02201983 per year, season2 indicates that the log of sales in February is 0.25141682 higher compared to the log of sales in January, season3 indicates that the log of sales in March is 0.26608280 higher compared to the log of sales in January, season4 indicates that the log of sales in April is 0.38405351 higher compared to the log of sales in January, season5 indicates that the log of sales in May is 0.40948697 higher compared to the log of sales in January, season6 indicates that the log of sales in June is 0.44882828 higher compared to the log of sales in January, season7 indicates that the log of sales in July is 0.61045453 higher compared to the log of sales in January, season8 indicates that the log of sales in August is 0.58796443 higher compared to the log of sales in January, season9 indicates that the log of sales in September is 0.66932985 higher compared to the log of sales in January, season10 indicates that the log of sales in October is 0.74739195 higher compared to the log of sales in January, season11 indicates that the log of sales in November is 1.20674790 higher compared to the log of sales in January, season12 indicates that the log of sales in December is 1.96224123 higher compared to the log of sales in January, surfTRUE indicates that the log of sales during the festival is 0.50151509 higher than when the festival does not occur

What does the Breusch-Godfrey test tell you about your model?

checkresiduals(model3)

## 
##  Breusch-Godfrey test for serial correlation of order up to 17
## 
## data:  Residuals from Linear regression model
## LM test = 37.954, df = 17, p-value = 0.002494

The above model indicates a relatively normal distribution in residuals. It also indicates a statistically significant Breusch-Godfrey test with a value of 0.002494, meaning that serial correlation in the residuals is significant at the 0.05 level.

h.Regardless of your answers to the above questions, use your regression model to predict the monthly sales for 1994, 1995, and 1996. Produce prediction intervals for each of your forecasts.

surffore <- rep(FALSE, 36)
surffore[c(3, 15, 27)] <- TRUE
fancyfore = ts(data=surffore, start = 1994, end = c(1996, 12), frequency = 12)
fancyforecast = forecast(model3, newdata=fancyfore)

## Warning in forecast.lm(model3, newdata = fancyfore): newdata column names not
## specified, defaulting to first variable required.

autoplot(fancyforecast)

i. Transform your predictions and intervals to obtain predictions and intervals for the raw data.

?tslm()
model3.5 = tslm(fancy ~ trend + season + surf, lambda = "auto")
model3.5

## 
## Call:
## tslm(formula = fancy ~ trend + season + surf, lambda = "auto")
## 
## Coefficients:
## (Intercept)        trend      season2      season3      season4      season5  
##     7.67964      0.02246      0.25586      0.27214      0.39099      0.41679  
##     season6      season7      season8      season9     season10     season11  
##     0.45698      0.62180      0.59902      0.68193      0.76149      1.23029  
##    season12     surfTRUE  
##     2.00238      0.50970

autoplot(fancy) + xlab("Year") + ylab("Log of Sales") + ggtitle("Regression model compared to Data")+ autolayer(fitted(model3.5), series = "Fitted")

I used a lambda value of “auto” to automatically perform a box-cox transformation using boxcox.lambda.

How could you improve these predictions by modifying the model? This model could be improved by using a model type that is better at capturing the nonlinearity of the trend data.

6.The gasoline series consists of weekly data for supplies of US finished motor gasoline product, from 2 February 1991 to 20 January 2017. The units are in “million barrels per day.” Consider only the data to the end of 2004.

a/b.Fit a harmonic regression with trend to the data. Experiment with changing the number Fourier terms. Plot the observed gasoline and fitted values and comment on what you see.Select the appropriate number of Fourier terms to include by minimising the AICc or CV value.

gas <- window(gasoline, end = 2005)
autoplot(gas)

As the data is weekly, that means the maximum value of K can be is 26 (52/2)

crossval = seq(26)
for(k in seq(26))
{
  model4 <- tslm(gas ~ trend + fourier(gas, K=k))
  crossval[k] <- CV(model4)['CV']
}
k <- which.min(crossval)
k #12

## [1] 12

model4 <- tslm(gas ~ trend + fourier(gas, K=k))
autoplot(gas) +autolayer(fitted(model4)) +ggtitle("Optimized k value compared to data")

The value of 12 minimized the cv value for the model. This creates a harmonic regression with trend model that captures the sine-wave like movements of the data while also capturing the general upward trend of the data over time.

c.Check the residuals of the final model using the checkresiduals() function. Even though the residuals fail the correlation tests, the results are probably not severe enough to make much difference to the forecasts and prediction intervals. (Note that the correlations are relatively small, even though they are significant.)

checkresiduals(model4)

## 
##  Breusch-Godfrey test for serial correlation of order up to 104
## 
## data:  Residuals from Linear regression model
## LM test = 154.2, df = 104, p-value = 0.001021

The model does fail the B-G test with a p-value = 0.001021, but the residual plots look relatively good. The residual count plot shows a nearly perfect normal distribution.

To forecast using harmonic regression, you will need to generate the future values of the Fourier terms. This can be done as follows.

//{r} //fc <- forecast(fit, newdata=data.frame(fourier(x,K,h))) //

where fit is the fitted model using tslm(), K is the number of Fourier terms used in creating fit, and h is the forecast horizon required.

Forecast the next year of data.

fc <- forecast(model4, newdata=data.frame(fourier(gas, 12, 52)))
fc

##          Point Forecast    Lo 80    Hi 80    Lo 95     Hi 95
## 2005.014       8.713386 8.371087 9.055684 8.189474  9.237298
## 2005.033       8.642131 8.299720 8.984542 8.118047  9.166215
## 2005.052       8.656942 8.314540 8.999345 8.132871  9.181013
## 2005.071       8.661292 8.318899 9.003684 8.137236  9.185347
## 2005.090       8.686370 8.344109 9.028631 8.162515  9.210225
## 2005.110       8.784129 8.441982 9.126276 8.260448  9.307809
## 2005.129       8.895967 8.553823 9.238111 8.372291  9.419643
## 2005.148       8.937884 8.595756 9.280012 8.414232  9.461535
## 2005.167       8.940875 8.598761 9.282988 8.417246  9.464503
## 2005.186       8.988596 8.646477 9.330716 8.464958  9.512235
## 2005.205       9.062316 8.720191 9.404442 8.538669  9.585963
## 2005.225       9.073794 8.731673 9.415915 8.550153  9.597435
## 2005.244       9.031913 8.689800 9.374027 8.508284  9.555542
## 2005.263       9.041458 8.699342 9.383575 8.517824  9.565092
## 2005.282       9.122877 8.780756 9.464999 8.599236  9.646519
## 2005.301       9.169383 8.827263 9.511503 8.645744  9.693022
## 2005.320       9.132601 8.790487 9.474715 8.608970  9.656231
## 2005.340       9.120782 8.778667 9.462897 8.597150  9.644414
## 2005.359       9.221551 8.879431 9.563671 8.697912  9.745190
## 2005.378       9.335425 8.993306 9.677545 8.811787  9.859064
## 2005.397       9.316637 8.974522 9.658752 8.793006  9.840269
## 2005.416       9.214109 8.871994 9.556224 8.690478  9.737740
## 2005.435       9.218957 8.876838 9.561076 8.695320  9.742595
## 2005.455       9.383861 9.041741 9.725980 8.860222  9.907499
## 2005.474       9.545162 9.203046 9.887278 9.021530 10.068795
## 2005.493       9.559667 9.217553 9.901782 9.036037 10.083298
## 2005.512       9.487053 9.144935 9.829171 8.963416 10.010689
## 2005.531       9.464953 9.122834 9.807073 8.941315  9.988592
## 2005.550       9.508378 9.166261 9.850494 8.984744 10.032011
## 2005.570       9.534922 9.192808 9.877037 9.011292 10.058553
## 2005.589       9.521986 9.179869 9.864104 8.998351 10.045621
## 2005.608       9.522899 9.180779 9.865019 8.999260 10.046538
## 2005.627       9.548443 9.206326 9.890560 9.024808 10.072078
## 2005.646       9.536705 9.194591 9.878819 9.013075 10.060335
## 2005.665       9.451014 9.108897 9.793131 8.927380  9.974649
## 2005.685       9.330630 8.988509 9.672750 8.806990  9.854269
## 2005.704       9.229844 8.887726 9.571962 8.706208  9.753480
## 2005.723       9.170862 8.828748 9.512976 8.647232  9.694492
## 2005.742       9.168084 8.825968 9.510200 8.644451  9.691717
## 2005.761       9.230990 8.888869 9.573110 8.707349  9.754630
## 2005.780       9.320372 8.978252 9.662492 8.796734  9.844010
## 2005.800       9.363916 9.021801 9.706030 8.840285  9.887546
## 2005.819       9.342664 9.000548 9.684779 8.819032  9.866296
## 2005.838       9.304159 8.962036 9.646281 8.780516  9.827801
## 2005.857       9.267511 8.925389 9.609633 8.743869  9.791153
## 2005.876       9.194408 8.852292 9.536523 8.670776  9.718040
## 2005.895       9.100701 8.758587 9.442816 8.577070  9.624332
## 2005.915       9.104667 8.762540 9.446794 8.581017  9.628317
## 2005.934       9.265935 8.923806 9.608064 8.742282  9.789587
## 2005.953       9.436660 9.094538 9.778782 8.913018  9.960302
## 2005.972       9.400410 9.058293 9.742527 8.876776  9.924044
## 2005.991       9.147441 8.805233 9.489649 8.623668  9.671215

e.Plot the forecasts along with the actual data for 2005. What do you find?

autoplot(fc, start = 2005) + autolayer(window(gasoline, start=2005, end=2006)) + xlim(2001, 2006)

## Scale for 'x' is already present. Adding another scale for 'x', which will
## replace the existing scale.

## Warning: Removed 517 row(s) containing missing values (geom_path).

The forecasts are pretty close, and contains the actual data pretty well besides a minor point in July or August. Even this point is relatively close to the forecasting interval though.

Exercise 7 Data set huron gives the water level of Lake Huron in feet from 1875 to 1972. a. Plot the data and comment on its features.

summary(huron)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.960   8.135   9.120   9.004   9.875  11.860

autoplot(huron)

There appears to be an initial downward trend that stabilizes around 1905-1915. An overall downward trend can be seen though, and there are considerable spikes approximately every decade.

b.Fit a linear regression and compare this to a piecewise linear trend model with a knot at 1915.

model.lin <- tslm(huron~trend)
summary(model.lin)

## 
## Call:
## tslm(formula = huron ~ trend)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.50997 -0.72726  0.00083  0.74402  2.53565 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.202037   0.230111  44.335  < 2e-16 ***
## trend       -0.024201   0.004036  -5.996 3.55e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.13 on 96 degrees of freedom
## Multiple R-squared:  0.2725, Adjusted R-squared:  0.2649 
## F-statistic: 35.95 on 1 and 96 DF,  p-value: 3.545e-08

t <- time(huron)
t.break1 <- 1915
tb1 <- ts(pmax(0, t - t.break1), start = 1875)
model.pw <- tslm(huron ~ t + tb1)
summary(model.pw)

## 
## Call:
## tslm(formula = huron ~ t + tb1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.49626 -0.66240 -0.07139  0.85163  2.39222 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 132.90870   19.97687   6.653 1.82e-09 ***
## t            -0.06498    0.01051  -6.181 1.58e-08 ***
## tb1           0.06486    0.01563   4.150 7.26e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.045 on 95 degrees of freedom
## Multiple R-squared:  0.3841, Adjusted R-squared:  0.3711 
## F-statistic: 29.62 on 2 and 95 DF,  p-value: 1.004e-10

The linear regression model has a trend of -0.024201, while the piecewise has a trend of -0.06498 up to the knot and the added slope of 0.06486 after. These add up to -0.00012, which means that the slope is virtually 0 after the knot.

Generate forecasts from these two models for the period up to 1980 and comment on these.

h=8 #up to 1980
forecast.lin = forecast(model.lin, h = 8)

t.new = t[length(t)] + seq(8)
tb1.new = tb1[length(tb1)] + seq(8)
pwdata <- as.data.frame(cbind(t=t.new, tb1=tb1.new))
forecast.pw = forecast(model.pw, newdata = pwdata)

autoplot(huron) + autolayer(fitted(forecast.lin)) +autolayer(fitted(forecast.pw)) + autolayer(forecast.lin, color= "red") + autolayer(forecast.pw, color = "blue")

The forecasts for the linear model predicts a continued downward trend, while the piecewise model predictions are relatively flat in slope.

Exercise 8 This is my best attempt to tackle this question, but I haven’t done this type of calculus before and I last took Calculus I and II in high school. I apologize in advance for the mental anguish reading my attempts will put you through.

a.X’X means that [1 1 1 1 1… 1] [1 1] [1 2 3 4 5… T] [1 2] [1 T]

This creates a matrix [T ∑T ] [∑T ∑(T^2)]

This matrix becomes [T 1/2T(T+1) ] [1/2T(T+1) 1/6(T)(T+1)(2T+1)]

dividing each point in the matrix by 6 yields an identical answer to a.

I never got far enough in calculus to do this type of of derivation.

My best guess would be that the model would become [1/6(T)(T+1)(2T+1) 1/2T(T+1)] [1/2T(T+1) T ]

This does not line up with the actual derivation point though.

c/d. similar to the above point, I am unfamiliar with this type of matrix notation and woefully lacking in the necessary skills to accomplish these questions. As these questions appear to be based off of the answer to part b, I am unable to proceed.

CHAPTER 6

Exercise 1 Show that a 3×5 MA is equivalent to a 7-term weighted moving average with weights of 0.067, 0.133, 0.200, 0.200, 0.200, 0.133, and 0.067.

We are effectively taking three separate 5-term moving averages and averaging them

[(X1+X2+X3+X4+X5)/5 + (X2+X3+X4+X5+X6)/5 + (X3+X4+X5+X6+X7)/5]/3

(X1 + 2X2 + 3X3 + 3X4 + 3X5 +2X6 +1X7)/15

The above weights are equivalent to 0.067X1, 0.133X2, 0.200X3, 0.200X4, 0.200X5, 0.133X6, and 0.067X7.

Exercise 2 The plastics data set consists of the monthly sales (in thousands) of product A for a plastics manufacturer for five years.

Plot the time series of sales of product A. Can you identify seasonal fluctuations and/or a trend-cycle?

autoplot(plastics)

There is a general upward trend over the five years with a sine-wave like fluctuation each year.

Use a classical multiplicative decomposition to calculate the trend-cycle and seasonal indices.

model5 = decompose(plastics, type="multiplicative")
autoplot(model5)

c. Do the results support the graphical interpretation from part a?

The results do support the conclusion from part a. The seasonal trend is clearly shown to be sine-wave like (though the wave is not perfect) and the upward trend is clearly shown as well. It is important to note that the trend here takes a small dip down in year five, meaning that the trend is not just a linear line upwards, but has some complexity to it.

Compute and plot the seasonally adjusted data.

model5sa = seasadj(model5)
autoplot(model5sa)

Change one observation to be an outlier (e.g., add 500 to one observation), and recompute the seasonally adjusted data. What is the effect of the outlier?

plasticsOut = plastics
set.seed(2814) #Earth's sector in Green Lantern Corp
round(runif(1, min=1, max = 60)) #22

## [1] 22

plasticsOut[22] = plasticsOut[22] + 500
model5Out = decompose(plasticsOut, type="multiplicative")
autoplot(model5Out)

model5Outsa = seasadj(model5Out)
autoplot(model5Outsa)

The outlier creates a significant spike in the seasonally adjusted data at the point it is and slightly changes the other data in the same month. For instance the same time of year in year 1 had a slight bump in the original data set, but now has a significant downward dip in this dataset that includes the outlier.

Does it make any difference if the outlier is near the end rather than in the middle of the time series?

plasticsOut2 = plastics
plasticsOut2[60] = plasticsOut2[60] + 500
model5End = decompose(plasticsOut2, type="multiplicative")
autoplot(model5End)

model5Endsa = seasadj(model5End)
autoplot(model5Endsa)

The data appears to be more in line with the original seasonally adjusted model than with the changes brought about by having an outlier be somewhere in the middle of the model. The jump of 500 is not lessened at all in this model as the 60th point in plastics had seasonally adjusted score of approximately 1200, while the outlier at the end brought this seasonally adjusted score up to be approximately 1800. This means that the change largely absorbed by the end point alone. The points for the 12th month look largely the same (besides the added outlier) compared to the original seasonally adjusted data which also implies that this added outlier did not change much of the data besides the point it is at.

Exercise 3 Recall your retail time series data (from Exercise 3 in Section 2.10). Decompose the series using X11. Does it reveal any outliers, or unusual features that you had not noticed previously?

setwd("C:\\Users\\samne\\OneDrive\\Desktop\\Master\\School\\Grad\\1st year\\Summer\\Forecasting\\Week 2\\HW 1")
retail <-read.csv("retail.csv", stringsAsFactors= T, skip=1)
retailts <- ts(retail[,"A3349873A"],frequency=12, start=c(1982,4))
autoplot(retailts)

library(seasonal)

## Warning: package 'seasonal' was built under R version 4.0.5

retailx11 = seas(retailts, x11="")
autoplot(retailx11)

The decomposition shows that the upward trend is not constant and may be flattening out near the end. There are several points where the trend actually dips downwards slightly. The seasonally adjusted data appears to have a slightly lesser effect in recent years, which is a trend that may continue to decrease. Looking at the ‘irregular’ graph shows that there are several notable outliers. The ones in 1989, 1994, and 2000 are of note as they are vastly different than the others around them.

Exercise 4 Figures 6.16 and 6.17 show the result of decomposing the number of persons in the civilian labour force in Australia each month from February 1978 to August 1995.

Write about 3–5 sentences describing the results of the decomposition. Pay particular attention to the scales of the graphs in making your interpretation.

The trend line has a general upward movement, and the movement is relatively constant. A linear model would be fairly appropriate for interpreting the data. The seasonal data shows that the pattern is rather complex each year and does not have a smooth pattern. The jumpy pattern is relatively constant, but grows slightly with time. December appears to have the highest employment while August has the lowest. The impact of the dip in approximately 1991 is largely only captured in the remainder graph.

Is the recession of 1991/1992 visible in the estimated components?

Extending the last point of part a, the recession is almost only visible in the remainder graph, though the trend line does flatten out somewhat during this time period.

Exercise 5 This exercise uses the cangas data (monthly Canadian gas production in billions of cubic metres, January 1960 – February 2005).

Plot the data using autoplot(), ggsubseriesplot() and ggseasonplot() to look at the effect of the changing seasonality over time. What do you think is causing it to change so much?

autoplot(cangas)

ggsubseriesplot(cangas)

ggseasonplot(cangas)

There appears to be large jumps in the 1980s and 1990s for Canadian gas production. This could be a result of the aggressive increase in natural gas production Canada undertook from approximately 1985 to 2000 (per a quick google search). Per my Canadian friend, natural gas became a major industry in Canada around that time and is now one of Canada’s leading industries. This would explain the aggressive growth during this period as they utilized the abundant natural resource until they hit a point of diminishing returns.

Do an STL decomposition of the data. You will need to choose s.window to allow for the changing shape of the seasonal component.

canstl = stl(cangas, s.window=5, robust = TRUE)
autoplot(canstl) + ggtitle("STL decomposition of Canadian Gas Production")

c. Compare the results with those obtained using SEATS and X11. How are they different?

library(seasonal)
canseas = seas(cangas)
autoplot(canseas) + ggtitle("SEATS decomposition of Canadian Gas Production")

library(seasonal)
canx11 = seas(cangas, x11="")
autoplot(canx11) + ggtitle("X11 decomposition of Canadian Gas Production")

The STL model balloons in the 1980s-1990s, but does has a smaller magnitude in the early years before this (1970s back). The SEATS and X11 decomposition have a larger magnitude in this range that gradually decreases before ballooning again in the 1980s-1990s. The trend line for SEATS and X11 is notably more jagged and bumpy than the smoother STL trend line. These differences are likely the result of the fact that STL is additive and is meant o have a “smoother” trend line that can be controlled by the user.

Exercise 6 We will use the bricksq data (Australian quarterly clay brick production. 1956–1994) for this exercise.

Use an STL decomposition to calculate the trend-cycle and seasonal indices. (Experiment with having fixed or changing seasonality.)

brickstl = stl(bricksq, s.window="periodic", robust = TRUE)
autoplot(brickstl) + ggtitle("STL decomposition of clay brick production by quarter")

brickstl = stl(bricksq, s.window=7, robust = TRUE)
autoplot(brickstl) + ggtitle("STL decomposition of clay brick production by quarter")

The trend lines for both a fixed and changing seasonality are nearly identical, and the seasonally adjusted trend line does not have great variation to it. The points with large remainders are similar in both models.

Compute and plot the seasonally adjusted data.

bricksa = seasadj(brickstl)
autoplot(bricksa)

c. Use a naïve method to produce forecasts of the seasonally adjusted data.

autoplot(naive(bricksa))

d. Use stlf() to reseasonalise the results, giving forecasts for the original data.

?stlf()
brickfc = stlf(bricksq, s.window = 7, method="naive")
autoplot(brickfc)

e. Do the residuals look uncorrelated?

checkresiduals(brickfc)

## Warning in checkresiduals(brickfc): The fitted degrees of freedom is based on
## the model used for the seasonally adjusted data.

## 
##  Ljung-Box test
## 
## data:  Residuals from STL +  Random walk
## Q* = 60.299, df = 8, p-value = 4.072e-10
## 
## Model df: 0.   Total lags used: 8

The residuals have a relatively normal distribution with a small left tail. The p-value of 4.072e-10 means that the Ljung-Box test indicates the presence of autocorrelation.

Repeat with a robust STL decomposition. Does it make much difference?

brickfc2 = stlf(bricksq, s.window = 7, method="naive", robust=TRUE)
autoplot(brickfc2)

checkresiduals(brickfc2)

## Warning in checkresiduals(brickfc2): The fitted degrees of freedom is based on
## the model used for the seasonally adjusted data.

## 
##  Ljung-Box test
## 
## data:  Residuals from STL +  Random walk
## Q* = 24.985, df = 8, p-value = 0.001564
## 
## Model df: 0.   Total lags used: 8

The p-value under the robust model is not nearly as significant as the prior model, but is still incredibly significant at a value of 0.001564. This means that there is still significant proof of autocorrelation.

Compare forecasts from stlf() with those from snaive(), using a test set comprising the last 2 years of data. Which is better?

brickend = window(bricksq, end = c(1992, 4))
fcstlf = stlf(brickend, s.window = 7, method="naive", robust=TRUE)
autoplot(fcstlf)

fcnaive = snaive(brickend)
autoplot(fcnaive)

accuracy(fcstlf, bricksq)

##                     ME     RMSE      MAE       MPE     MAPE      MASE
## Training set  1.454811 22.55502 15.67641 0.3272808 3.817365 0.4327014
## Test set     29.053932 31.42719 29.05393 6.4300642 6.430064 0.8019487
##                    ACF1 Theil's U
## Training set  0.0449449        NA
## Test set     -0.1419827 0.7613135

accuracy(fcnaive, bricksq)

##                    ME     RMSE      MAE      MPE     MAPE      MASE       ACF1
## Training set  6.06250 49.54691 36.22917 1.343615 8.857805 1.0000000  0.8090232
## Test set     34.28571 37.96615 34.28571 7.443878 7.443878 0.9463567 -0.1191536
##              Theil's U
## Training set        NA
## Test set     0.9975749

The STL with random walk model does a better job than the seasonal naive model in this example as metrics for error are lower in the model.

Exercise 7 Use stlf() to produce forecasts of the writing series with either method=“naive” or method=“rwdrift”, whichever is most appropriate. Use the lambda argument if you think a Box-Cox transformation is required.

autoplot(writing)

The autoplot shows a slight upward trend.

BoxCox.lambda(writing)

## [1] 0.1557392

The small value of lambda and the upward trend in the plot indicate that a lambda of zero with a drift method can be used.

driftfc = stlf(writing, s.window = "periodic", method="rwdrift", robust=TRUE)
autoplot(driftfc)

Exercise 8 Use stlf() to produce forecasts of the fancy series with either method=“naive” or method=“rwdrift”, whichever is most appropriate. Use the lambda argument if you think a Box-Cox transformation is required.

autoplot(fancy)

The spikes are growing at a considerable rate, a Box-Cox transformation will likely be warranted. There is also an upward trend that means rwdrift is warranted again.

fancydrift = stlf(fancy, s.window = "periodic", method="rwdrift", robust=TRUE, lambda = 0.1)
autoplot(fancydrift)

After experimenting with several lambda values, lambda of 0.1 seems to be appropriate for the model.

Hyndman Chap 5 & 6

Sam Neil

7/16/2021