Chapter 5
Exercise 1
Daily electricity demand for Victoria, Australia, during 2014 is contained in elecdaily. The data for the first 20 days can be obtained as follows.
library(fpp2)
## Warning: package 'fpp2' was built under R version 4.0.5
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
## -- Attaching packages ---------------------------------------------- fpp2 2.4 --
## v ggplot2 3.3.3 v fma 2.4
## v forecast 8.15 v expsmooth 2.3
## Warning: package 'forecast' was built under R version 4.0.5
## Warning: package 'fma' was built under R version 4.0.5
## Warning: package 'expsmooth' was built under R version 4.0.5
##
daily20 <- head(elecdaily,20)
autoplot(daily20)
model <- tslm(Demand ~ Temperature, data = daily20)
model #Positive value for temp of 6.757
##
## Call:
## tslm(formula = Demand ~ Temperature, data = daily20)
##
## Coefficients:
## (Intercept) Temperature
## 39.212 6.757
as.data.frame(daily20) %>%
ggplot(aes(x = Temperature, y = Demand)) + ggtitle("Impact of Temperature on Daily Electricity Demand in Victoria") +
geom_point() +
geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'
The positive relationship between temperature and demand can be explained by the rise in air conditioning needed during the hot temperatures. Assuming these temperatures are in Celsius, the 40 degree temperature is approximately 104 degrees Fahrenheit. These obscenely high temperatures warrant constant usage of air conditioning units in households and businesses, which increases the demand for electricity in Victoria considerably.
B. Produce a residual plot. Is the model adequate? Are there any outliers or influential observations?
checkresiduals(model)
##
## Breusch-Godfrey test for serial correlation of order up to 5
##
## data: Residuals from Linear regression model
## LM test = 3.8079, df = 5, p-value = 0.5774
While the Breusch-Godfrey test does not have a significant value (p-value is 0.58), the residuals do not follow a normal distribution and they appear to have a left-skew. For this reason, the model is somewhat inadequate.
c/d.Use the model to forecast the electricity demand that you would expect for the next day if the maximum temperature was
15∘ and compare it with the forecast if the with maximum temperature was
35∘. Do you believe these forecasts? Give prediction intervals for your forecasts.
forecast(model, newdata=data.frame(Temperature = c(15,35)))
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 4.285714 140.5701 108.6810 172.4591 90.21166 190.9285
## 4.428571 275.7146 245.2278 306.2014 227.57056 323.8586
I am not very familiar with changes to the demand for electricity, but is seems improbable that electricity demand would continue to decrease at a constant rate. 15 C is approximately 59 F, and would likely not have a notable difference in air conditioning than 20 C which is approximately 68 F. The electricity demand derived from air conditioning would likely flatten out near these values. It may even increase with lower temperatures as homes begin to implement heating systems. The prediction intervals are included in the above output.
#Code modeled after example code given in question d
elecdaily %>%
as.data.frame() %>%
ggplot(aes(x=Temperature, y=Demand)) +
geom_point() +
geom_smooth(method="lm", se=FALSE)
## `geom_smooth()` using formula 'y ~ x'
The above graph clearly shows a U-shaped curve which indicates that the linear model is inadequate. This is likely a result of the prior speculation about increases in heating at lower temperatures. Some sort of quadratic model would be more appropriate for this relationship.
Exercise 2 Data set mens400 contains the winning times (in seconds) for the men’s 400 meters final in each Olympic Games from 1896 to 2016.
as.data.frame(mens400) %>%
ggplot(aes(x = time(mens400), y = mens400)) + ylab("Time in seconds") + xlab("Year") + ggtitle("Winning times for men's 400 by Year") +
geom_point()
## Don't know how to automatically pick scale for object of type ts. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type ts. Defaulting to continuous.
## Warning: Removed 3 rows containing missing values (geom_point).
The initial time may be an outlier, but there is a seemingly linear relationship after it. This curve could be considered to decrease at a greater rate in the early years (prior to the 1980s), and somewhat level out in recent decades. A spline or a form of asymptotic modeling may be best to capture this type of relationship.
model2 = tslm(mens400 ~ time(mens400))
model2
##
## Call:
## tslm(formula = mens400 ~ time(mens400))
##
## Coefficients:
## (Intercept) time(mens400)
## 172.48148 -0.06457
as.data.frame(mens400) %>%
ggplot(aes(x = time(mens400), y = mens400)) + ylab("Time in seconds") + xlab("Year") + ggtitle("Winning times for men's 400 by Year") +
geom_point() + geom_smooth(method="lm", se=FALSE)
## Don't know how to automatically pick scale for object of type ts. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type ts. Defaulting to continuous.
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 3 rows containing non-finite values (stat_smooth).
## Warning: Removed 3 rows containing missing values (geom_point).
The average time has decreased by 0.06457 seconds a year for the men’s 400 at the olympics.
checkresiduals(model2)
##
## Breusch-Godfrey test for serial correlation of order up to 6
##
## data: Residuals from Linear regression model
## LM test = 3.6082, df = 6, p-value = 0.7295
as.data.frame(mens400) %>%
ggplot(aes(x = time(mens400), y = residuals(model2))) + ylab("Residuals") + xlab("Year") + ggtitle("Residuals from linear model") +
geom_point() + geom_smooth(method="lm", se=FALSE)
## Don't know how to automatically pick scale for object of type ts. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type ts. Defaulting to continuous.
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 3 rows containing non-finite values (stat_smooth).
## Warning: Removed 3 rows containing missing values (geom_point).
The fitted line is inadequate as the first point appears to be an outlier and there is a clear trend in the outliers almost all having negative outliers until 2000 when they all become positive. This upward trend towards positive residuals can be seen as far back as 1980. A spline type model or something more asymptopic would be more appropriate.
time = time(mens400)
model2 = tslm(mens400 ~ time)
forecast(model2, newdata=data.frame(time=2020))
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 2020 42.04231 40.44975 43.63487 39.55286 44.53176
The most important assumption from this is that the men’s 400 meter final actually happened in 2020. As this is not true, I decided to also forecast for 2021 to see how different the data is for the actual year of the “2020” olympics. Both of these forecasts operate under the assumption that the average has been decreasing at a constant rate. The data indicates that the average time has been decreasing at a much slower rate since 1980 though. As the model does not account for this difference, the predicted time should be considerably lower than the actual time that we will see in a couple of weeks.
forecast(model2, newdata=data.frame(time=2021))
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 2020 41.97774 40.38286 43.57262 39.48466 44.47081
Exercise 3
easter(ausbeer)
## Qtr1 Qtr2 Qtr3 Qtr4
## 1956 0.67 0.33 0.00 0.00
## 1957 0.00 1.00 0.00 0.00
## 1958 0.00 1.00 0.00 0.00
## 1959 1.00 0.00 0.00 0.00
## 1960 0.00 1.00 0.00 0.00
## 1961 0.33 0.67 0.00 0.00
## 1962 0.00 1.00 0.00 0.00
## 1963 0.00 1.00 0.00 0.00
## 1964 1.00 0.00 0.00 0.00
## 1965 0.00 1.00 0.00 0.00
## 1966 0.00 1.00 0.00 0.00
## 1967 1.00 0.00 0.00 0.00
## 1968 0.00 1.00 0.00 0.00
## 1969 0.00 1.00 0.00 0.00
## 1970 1.00 0.00 0.00 0.00
## 1971 0.00 1.00 0.00 0.00
## 1972 0.33 0.67 0.00 0.00
## 1973 0.00 1.00 0.00 0.00
## 1974 0.00 1.00 0.00 0.00
## 1975 1.00 0.00 0.00 0.00
## 1976 0.00 1.00 0.00 0.00
## 1977 0.00 1.00 0.00 0.00
## 1978 1.00 0.00 0.00 0.00
## 1979 0.00 1.00 0.00 0.00
## 1980 0.00 1.00 0.00 0.00
## 1981 0.00 1.00 0.00 0.00
## 1982 0.00 1.00 0.00 0.00
## 1983 0.00 1.00 0.00 0.00
## 1984 0.00 1.00 0.00 0.00
## 1985 0.00 1.00 0.00 0.00
## 1986 1.00 0.00 0.00 0.00
## 1987 0.00 1.00 0.00 0.00
## 1988 0.00 1.00 0.00 0.00
## 1989 1.00 0.00 0.00 0.00
## 1990 0.00 1.00 0.00 0.00
## 1991 1.00 0.00 0.00 0.00
## 1992 0.00 1.00 0.00 0.00
## 1993 0.00 1.00 0.00 0.00
## 1994 0.00 1.00 0.00 0.00
## 1995 0.00 1.00 0.00 0.00
## 1996 0.00 1.00 0.00 0.00
## 1997 1.00 0.00 0.00 0.00
## 1998 0.00 1.00 0.00 0.00
## 1999 0.00 1.00 0.00 0.00
## 2000 0.00 1.00 0.00 0.00
## 2001 0.00 1.00 0.00 0.00
## 2002 1.00 0.00 0.00 0.00
## 2003 0.00 1.00 0.00 0.00
## 2004 0.00 1.00 0.00 0.00
## 2005 1.00 0.00 0.00 0.00
## 2006 0.00 1.00 0.00 0.00
## 2007 0.00 1.00 0.00 0.00
## 2008 1.00 0.00 0.00 0.00
## 2009 0.00 1.00 0.00 0.00
## 2010 0.00 1.00
?easter()
## starting httpd help server ... done
This function is defined as the portion of the Easter holidays in each season, where “Easter is defined as the days from Good Friday to Easter Sunday inclusively”. This means that spillover between March and April can cause values between 0 and 1 to occur (such as in 1956).
Exercise 4
An elasticity coefficient is the ratio of the percentage change in the forecast variable (y) to the percentage change in the predictor variable (x). Mathematically, the elasticity is defined as (dy/dx)*(x/y). Consider the log-log model,logy=β0+β1logx+ε. Express y as a function of x and show that the coefficient β1 is the elasticity coefficient.
Under the assumption that x is a nonzero constant, the expected value of the model becomes logy = β0+β1logx as ε becomes 0.
The derivative for both sides of this this is (dy/dx)/y = (dx/dx)β1/x as the term β0 goes to 0. dx/dx simplifies to 1 as well. so it becomes (dy/dx)/y = β1/x.
This can be rearranged to become β1 = (dy/dx)(x/y)
Exercise 5 The data set fancy concerns the monthly sales figures of a shop which opened in January 1987 and sells gifts, souvenirs, and novelties. The shop is situated on the wharf at a beach resort town in Queensland, Australia. The sales volume varies with the seasonal population of tourists. There is a large influx of visitors to the town at Christmas and for the local surfing festival, held every March since 1988. Over time, the shop has expanded its premises, range of products, and staff.
autoplot(fancy) + ggtitle("Sales in souvenir shop in Queensland")+ xlab("Year") +ylab("Sales")
As the prompt to the question explains, the data is incredibly dependent on the season and the large influx of tourists around Christmas and March. The uptick in March does not occur in 1987 as the surfing festival began in 1988. The uptick in sales around Christmas is much larger than the one around the surfing festival. It appears that the trend is growing over time considerably with the notably smaller increase in the 1990 Christmas season.
autoplot(log(fancy)) + ggtitle("Sales in souvenir shop in Queensland") + xlab("Year") + ylab("Log of Sales")
Taking the log of the sales helps to smooth out the line as the sharp uptick in sales in 1994 does not appear as extreme when in log form. This graph shows a more linear trend and helps shows that the seasonal trend is roughly constant when in log form with the overall upward trend accounting for a large portion of the increase in sales. A linear model could capture this upward trend in log sales well.
?cycle()
time(fancy)
## Jan Feb Mar Apr May Jun Jul Aug
## 1987 1987.000 1987.083 1987.167 1987.250 1987.333 1987.417 1987.500 1987.583
## 1988 1988.000 1988.083 1988.167 1988.250 1988.333 1988.417 1988.500 1988.583
## 1989 1989.000 1989.083 1989.167 1989.250 1989.333 1989.417 1989.500 1989.583
## 1990 1990.000 1990.083 1990.167 1990.250 1990.333 1990.417 1990.500 1990.583
## 1991 1991.000 1991.083 1991.167 1991.250 1991.333 1991.417 1991.500 1991.583
## 1992 1992.000 1992.083 1992.167 1992.250 1992.333 1992.417 1992.500 1992.583
## 1993 1993.000 1993.083 1993.167 1993.250 1993.333 1993.417 1993.500 1993.583
## Sep Oct Nov Dec
## 1987 1987.667 1987.750 1987.833 1987.917
## 1988 1988.667 1988.750 1988.833 1988.917
## 1989 1989.667 1989.750 1989.833 1989.917
## 1990 1990.667 1990.750 1990.833 1990.917
## 1991 1991.667 1991.750 1991.833 1991.917
## 1992 1992.667 1992.750 1992.833 1992.917
## 1993 1993.667 1993.750 1993.833 1993.917
surf <- cycle(fancy) == 3 & time(fancy) >=1988 #3 for March, competition started in 1988
surf #looks good
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1987 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 1988 FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 1989 FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 1990 FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 1991 FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 1992 FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 1993 FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
model3 = tslm(fancy ~ trend + season + surf, lambda = 0) #lambda has to be 0 for linear, model looks weird without this
model3
##
## Call:
## tslm(formula = fancy ~ trend + season + surf, lambda = 0)
##
## Coefficients:
## (Intercept) trend season2 season3 season4 season5
## 7.61967 0.02202 0.25142 0.26608 0.38405 0.40949
## season6 season7 season8 season9 season10 season11
## 0.44883 0.61045 0.58796 0.66933 0.74739 1.20675
## season12 surfTRUE
## 1.96224 0.50152
?autolayer()
?fitted() #Can use these to add fit
autoplot(fancy) + xlab("Year") + ylab("Log of Sales") + ggtitle("Regression model compared to Data")+ autolayer(fitted(model3), series = "Fitted")
d. Plot the residuals against time and against the fitted values. Do these plots reveal any problems with the model?
autoplot(residuals(model3))
ggplot(aes(x=fitted(model3), y=residuals(model3)), data=fancy) + geom_point() + ggtitle("Residuals Vs. Fitted")
## Don't know how to automatically pick scale for object of type ts. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type ts. Defaulting to continuous.
The residuals in the first graph appear to have some serial correlation and the residuals in the second graph appear to mostly be random.
e.Do boxplots of the residuals for each month. Does this reveal any problems with the model?
mon <- factor(cycle(residuals(model3)))
ggplot() + geom_boxplot(aes(x=mon, y=residuals(model3), group=mon)) + xlab("Month") + ggtitle("Residuals against Month")
## Don't know how to automatically pick scale for object of type ts. Defaulting to continuous.
These boxplots indicate that there may be an issue with heteroscedasticity as the residual plots vary considerably between months. The means for each month do appear to hover around 0 and there are no boxes that are considerable off in value from the others (for instance there is no box where quartiles are all above or below 0).
coefficients(model3)
## (Intercept) trend season2 season3 season4 season5
## 7.61966701 0.02201983 0.25141682 0.26608280 0.38405351 0.40948697
## season6 season7 season8 season9 season10 season11
## 0.44882828 0.61045453 0.58796443 0.66932985 0.74739195 1.20674790
## season12 surfTRUE
## 1.96224123 0.50151509
The trend indicates that the log of sales increases by 0.02201983 per year, season2 indicates that the log of sales in February is 0.25141682 higher compared to the log of sales in January, season3 indicates that the log of sales in March is 0.26608280 higher compared to the log of sales in January, season4 indicates that the log of sales in April is 0.38405351 higher compared to the log of sales in January, season5 indicates that the log of sales in May is 0.40948697 higher compared to the log of sales in January, season6 indicates that the log of sales in June is 0.44882828 higher compared to the log of sales in January, season7 indicates that the log of sales in July is 0.61045453 higher compared to the log of sales in January, season8 indicates that the log of sales in August is 0.58796443 higher compared to the log of sales in January, season9 indicates that the log of sales in September is 0.66932985 higher compared to the log of sales in January, season10 indicates that the log of sales in October is 0.74739195 higher compared to the log of sales in January, season11 indicates that the log of sales in November is 1.20674790 higher compared to the log of sales in January, season12 indicates that the log of sales in December is 1.96224123 higher compared to the log of sales in January, surfTRUE indicates that the log of sales during the festival is 0.50151509 higher than when the festival does not occur
checkresiduals(model3)
##
## Breusch-Godfrey test for serial correlation of order up to 17
##
## data: Residuals from Linear regression model
## LM test = 37.954, df = 17, p-value = 0.002494
The above model indicates a relatively normal distribution in residuals. It also indicates a statistically significant Breusch-Godfrey test with a value of 0.002494, meaning that serial correlation in the residuals is significant at the 0.05 level.
h.Regardless of your answers to the above questions, use your regression model to predict the monthly sales for 1994, 1995, and 1996. Produce prediction intervals for each of your forecasts.
surffore <- rep(FALSE, 36)
surffore[c(3, 15, 27)] <- TRUE
fancyfore = ts(data=surffore, start = 1994, end = c(1996, 12), frequency = 12)
fancyforecast = forecast(model3, newdata=fancyfore)
## Warning in forecast.lm(model3, newdata = fancyfore): newdata column names not
## specified, defaulting to first variable required.
autoplot(fancyforecast)
i. Transform your predictions and intervals to obtain predictions and intervals for the raw data.
?tslm()
model3.5 = tslm(fancy ~ trend + season + surf, lambda = "auto")
model3.5
##
## Call:
## tslm(formula = fancy ~ trend + season + surf, lambda = "auto")
##
## Coefficients:
## (Intercept) trend season2 season3 season4 season5
## 7.67964 0.02246 0.25586 0.27214 0.39099 0.41679
## season6 season7 season8 season9 season10 season11
## 0.45698 0.62180 0.59902 0.68193 0.76149 1.23029
## season12 surfTRUE
## 2.00238 0.50970
autoplot(fancy) + xlab("Year") + ylab("Log of Sales") + ggtitle("Regression model compared to Data")+ autolayer(fitted(model3.5), series = "Fitted")
I used a lambda value of “auto” to automatically perform a box-cox transformation using boxcox.lambda.
6.The gasoline series consists of weekly data for supplies of US finished motor gasoline product, from 2 February 1991 to 20 January 2017. The units are in “million barrels per day.” Consider only the data to the end of 2004.
a/b.Fit a harmonic regression with trend to the data. Experiment with changing the number Fourier terms. Plot the observed gasoline and fitted values and comment on what you see.Select the appropriate number of Fourier terms to include by minimising the AICc or CV value.
gas <- window(gasoline, end = 2005)
autoplot(gas)
As the data is weekly, that means the maximum value of K can be is 26 (52/2)
crossval = seq(26)
for(k in seq(26))
{
model4 <- tslm(gas ~ trend + fourier(gas, K=k))
crossval[k] <- CV(model4)['CV']
}
k <- which.min(crossval)
k #12
## [1] 12
model4 <- tslm(gas ~ trend + fourier(gas, K=k))
autoplot(gas) +autolayer(fitted(model4)) +ggtitle("Optimized k value compared to data")
The value of 12 minimized the cv value for the model. This creates a harmonic regression with trend model that captures the sine-wave like movements of the data while also capturing the general upward trend of the data over time.
c.Check the residuals of the final model using the checkresiduals() function. Even though the residuals fail the correlation tests, the results are probably not severe enough to make much difference to the forecasts and prediction intervals. (Note that the correlations are relatively small, even though they are significant.)
checkresiduals(model4)
##
## Breusch-Godfrey test for serial correlation of order up to 104
##
## data: Residuals from Linear regression model
## LM test = 154.2, df = 104, p-value = 0.001021
The model does fail the B-G test with a p-value = 0.001021, but the residual plots look relatively good. The residual count plot shows a nearly perfect normal distribution.
//{r} //fc <- forecast(fit, newdata=data.frame(fourier(x,K,h))) //
where fit is the fitted model using tslm(), K is the number of Fourier terms used in creating fit, and h is the forecast horizon required.
Forecast the next year of data.
fc <- forecast(model4, newdata=data.frame(fourier(gas, 12, 52)))
fc
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 2005.014 8.713386 8.371087 9.055684 8.189474 9.237298
## 2005.033 8.642131 8.299720 8.984542 8.118047 9.166215
## 2005.052 8.656942 8.314540 8.999345 8.132871 9.181013
## 2005.071 8.661292 8.318899 9.003684 8.137236 9.185347
## 2005.090 8.686370 8.344109 9.028631 8.162515 9.210225
## 2005.110 8.784129 8.441982 9.126276 8.260448 9.307809
## 2005.129 8.895967 8.553823 9.238111 8.372291 9.419643
## 2005.148 8.937884 8.595756 9.280012 8.414232 9.461535
## 2005.167 8.940875 8.598761 9.282988 8.417246 9.464503
## 2005.186 8.988596 8.646477 9.330716 8.464958 9.512235
## 2005.205 9.062316 8.720191 9.404442 8.538669 9.585963
## 2005.225 9.073794 8.731673 9.415915 8.550153 9.597435
## 2005.244 9.031913 8.689800 9.374027 8.508284 9.555542
## 2005.263 9.041458 8.699342 9.383575 8.517824 9.565092
## 2005.282 9.122877 8.780756 9.464999 8.599236 9.646519
## 2005.301 9.169383 8.827263 9.511503 8.645744 9.693022
## 2005.320 9.132601 8.790487 9.474715 8.608970 9.656231
## 2005.340 9.120782 8.778667 9.462897 8.597150 9.644414
## 2005.359 9.221551 8.879431 9.563671 8.697912 9.745190
## 2005.378 9.335425 8.993306 9.677545 8.811787 9.859064
## 2005.397 9.316637 8.974522 9.658752 8.793006 9.840269
## 2005.416 9.214109 8.871994 9.556224 8.690478 9.737740
## 2005.435 9.218957 8.876838 9.561076 8.695320 9.742595
## 2005.455 9.383861 9.041741 9.725980 8.860222 9.907499
## 2005.474 9.545162 9.203046 9.887278 9.021530 10.068795
## 2005.493 9.559667 9.217553 9.901782 9.036037 10.083298
## 2005.512 9.487053 9.144935 9.829171 8.963416 10.010689
## 2005.531 9.464953 9.122834 9.807073 8.941315 9.988592
## 2005.550 9.508378 9.166261 9.850494 8.984744 10.032011
## 2005.570 9.534922 9.192808 9.877037 9.011292 10.058553
## 2005.589 9.521986 9.179869 9.864104 8.998351 10.045621
## 2005.608 9.522899 9.180779 9.865019 8.999260 10.046538
## 2005.627 9.548443 9.206326 9.890560 9.024808 10.072078
## 2005.646 9.536705 9.194591 9.878819 9.013075 10.060335
## 2005.665 9.451014 9.108897 9.793131 8.927380 9.974649
## 2005.685 9.330630 8.988509 9.672750 8.806990 9.854269
## 2005.704 9.229844 8.887726 9.571962 8.706208 9.753480
## 2005.723 9.170862 8.828748 9.512976 8.647232 9.694492
## 2005.742 9.168084 8.825968 9.510200 8.644451 9.691717
## 2005.761 9.230990 8.888869 9.573110 8.707349 9.754630
## 2005.780 9.320372 8.978252 9.662492 8.796734 9.844010
## 2005.800 9.363916 9.021801 9.706030 8.840285 9.887546
## 2005.819 9.342664 9.000548 9.684779 8.819032 9.866296
## 2005.838 9.304159 8.962036 9.646281 8.780516 9.827801
## 2005.857 9.267511 8.925389 9.609633 8.743869 9.791153
## 2005.876 9.194408 8.852292 9.536523 8.670776 9.718040
## 2005.895 9.100701 8.758587 9.442816 8.577070 9.624332
## 2005.915 9.104667 8.762540 9.446794 8.581017 9.628317
## 2005.934 9.265935 8.923806 9.608064 8.742282 9.789587
## 2005.953 9.436660 9.094538 9.778782 8.913018 9.960302
## 2005.972 9.400410 9.058293 9.742527 8.876776 9.924044
## 2005.991 9.147441 8.805233 9.489649 8.623668 9.671215
e.Plot the forecasts along with the actual data for 2005. What do you find?
autoplot(fc, start = 2005) + autolayer(window(gasoline, start=2005, end=2006)) + xlim(2001, 2006)
## Scale for 'x' is already present. Adding another scale for 'x', which will
## replace the existing scale.
## Warning: Removed 517 row(s) containing missing values (geom_path).
The forecasts are pretty close, and contains the actual data pretty well besides a minor point in July or August. Even this point is relatively close to the forecasting interval though.
Exercise 7 Data set huron gives the water level of Lake Huron in feet from 1875 to 1972. a. Plot the data and comment on its features.
summary(huron)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.960 8.135 9.120 9.004 9.875 11.860
autoplot(huron)
There appears to be an initial downward trend that stabilizes around 1905-1915. An overall downward trend can be seen though, and there are considerable spikes approximately every decade.
b.Fit a linear regression and compare this to a piecewise linear trend model with a knot at 1915.
model.lin <- tslm(huron~trend)
summary(model.lin)
##
## Call:
## tslm(formula = huron ~ trend)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.50997 -0.72726 0.00083 0.74402 2.53565
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.202037 0.230111 44.335 < 2e-16 ***
## trend -0.024201 0.004036 -5.996 3.55e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.13 on 96 degrees of freedom
## Multiple R-squared: 0.2725, Adjusted R-squared: 0.2649
## F-statistic: 35.95 on 1 and 96 DF, p-value: 3.545e-08
t <- time(huron)
t.break1 <- 1915
tb1 <- ts(pmax(0, t - t.break1), start = 1875)
model.pw <- tslm(huron ~ t + tb1)
summary(model.pw)
##
## Call:
## tslm(formula = huron ~ t + tb1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.49626 -0.66240 -0.07139 0.85163 2.39222
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 132.90870 19.97687 6.653 1.82e-09 ***
## t -0.06498 0.01051 -6.181 1.58e-08 ***
## tb1 0.06486 0.01563 4.150 7.26e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.045 on 95 degrees of freedom
## Multiple R-squared: 0.3841, Adjusted R-squared: 0.3711
## F-statistic: 29.62 on 2 and 95 DF, p-value: 1.004e-10
The linear regression model has a trend of -0.024201, while the piecewise has a trend of -0.06498 up to the knot and the added slope of 0.06486 after. These add up to -0.00012, which means that the slope is virtually 0 after the knot.
h=8 #up to 1980
forecast.lin = forecast(model.lin, h = 8)
t.new = t[length(t)] + seq(8)
tb1.new = tb1[length(tb1)] + seq(8)
pwdata <- as.data.frame(cbind(t=t.new, tb1=tb1.new))
forecast.pw = forecast(model.pw, newdata = pwdata)
autoplot(huron) + autolayer(fitted(forecast.lin)) +autolayer(fitted(forecast.pw)) + autolayer(forecast.lin, color= "red") + autolayer(forecast.pw, color = "blue")
The forecasts for the linear model predicts a continued downward trend, while the piecewise model predictions are relatively flat in slope.
Exercise 8 This is my best attempt to tackle this question, but I haven’t done this type of calculus before and I last took Calculus I and II in high school. I apologize in advance for the mental anguish reading my attempts will put you through.
a.X’X means that [1 1 1 1 1… 1] [1 1] [1 2 3 4 5… T] [1 2] [1 T]
This creates a matrix [T ∑T ] [∑T ∑(T^2)]
This matrix becomes [T 1/2T(T+1) ] [1/2T(T+1) 1/6(T)(T+1)(2T+1)]
dividing each point in the matrix by 6 yields an identical answer to a.
My best guess would be that the model would become [1/6(T)(T+1)(2T+1) 1/2T(T+1)] [1/2T(T+1) T ]
This does not line up with the actual derivation point though.
c/d. similar to the above point, I am unfamiliar with this type of matrix notation and woefully lacking in the necessary skills to accomplish these questions. As these questions appear to be based off of the answer to part b, I am unable to proceed.
CHAPTER 6
Exercise 1 Show that a 3×5 MA is equivalent to a 7-term weighted moving average with weights of 0.067, 0.133, 0.200, 0.200, 0.200, 0.133, and 0.067.
We are effectively taking three separate 5-term moving averages and averaging them
[(X1+X2+X3+X4+X5)/5 + (X2+X3+X4+X5+X6)/5 + (X3+X4+X5+X6+X7)/5]/3
(X1 + 2X2 + 3X3 + 3X4 + 3X5 +2X6 +1X7)/15
The above weights are equivalent to 0.067X1, 0.133X2, 0.200X3, 0.200X4, 0.200X5, 0.133X6, and 0.067X7.
Exercise 2 The plastics data set consists of the monthly sales (in thousands) of product A for a plastics manufacturer for five years.
autoplot(plastics)
There is a general upward trend over the five years with a sine-wave like fluctuation each year.
model5 = decompose(plastics, type="multiplicative")
autoplot(model5)
c. Do the results support the graphical interpretation from part a?
The results do support the conclusion from part a. The seasonal trend is clearly shown to be sine-wave like (though the wave is not perfect) and the upward trend is clearly shown as well. It is important to note that the trend here takes a small dip down in year five, meaning that the trend is not just a linear line upwards, but has some complexity to it.
model5sa = seasadj(model5)
autoplot(model5sa)
plasticsOut = plastics
set.seed(2814) #Earth's sector in Green Lantern Corp
round(runif(1, min=1, max = 60)) #22
## [1] 22
plasticsOut[22] = plasticsOut[22] + 500
model5Out = decompose(plasticsOut, type="multiplicative")
autoplot(model5Out)
model5Outsa = seasadj(model5Out)
autoplot(model5Outsa)
The outlier creates a significant spike in the seasonally adjusted data at the point it is and slightly changes the other data in the same month. For instance the same time of year in year 1 had a slight bump in the original data set, but now has a significant downward dip in this dataset that includes the outlier.
plasticsOut2 = plastics
plasticsOut2[60] = plasticsOut2[60] + 500
model5End = decompose(plasticsOut2, type="multiplicative")
autoplot(model5End)
model5Endsa = seasadj(model5End)
autoplot(model5Endsa)
The data appears to be more in line with the original seasonally adjusted model than with the changes brought about by having an outlier be somewhere in the middle of the model. The jump of 500 is not lessened at all in this model as the 60th point in plastics had seasonally adjusted score of approximately 1200, while the outlier at the end brought this seasonally adjusted score up to be approximately 1800. This means that the change largely absorbed by the end point alone. The points for the 12th month look largely the same (besides the added outlier) compared to the original seasonally adjusted data which also implies that this added outlier did not change much of the data besides the point it is at.
Exercise 3 Recall your retail time series data (from Exercise 3 in Section 2.10). Decompose the series using X11. Does it reveal any outliers, or unusual features that you had not noticed previously?
setwd("C:\\Users\\samne\\OneDrive\\Desktop\\Master\\School\\Grad\\1st year\\Summer\\Forecasting\\Week 2\\HW 1")
retail <-read.csv("retail.csv", stringsAsFactors= T, skip=1)
retailts <- ts(retail[,"A3349873A"],frequency=12, start=c(1982,4))
autoplot(retailts)
library(seasonal)
## Warning: package 'seasonal' was built under R version 4.0.5
retailx11 = seas(retailts, x11="")
autoplot(retailx11)
The decomposition shows that the upward trend is not constant and may be flattening out near the end. There are several points where the trend actually dips downwards slightly. The seasonally adjusted data appears to have a slightly lesser effect in recent years, which is a trend that may continue to decrease. Looking at the ‘irregular’ graph shows that there are several notable outliers. The ones in 1989, 1994, and 2000 are of note as they are vastly different than the others around them.
Exercise 4 Figures 6.16 and 6.17 show the result of decomposing the number of persons in the civilian labour force in Australia each month from February 1978 to August 1995.
The trend line has a general upward movement, and the movement is relatively constant. A linear model would be fairly appropriate for interpreting the data. The seasonal data shows that the pattern is rather complex each year and does not have a smooth pattern. The jumpy pattern is relatively constant, but grows slightly with time. December appears to have the highest employment while August has the lowest. The impact of the dip in approximately 1991 is largely only captured in the remainder graph.
Extending the last point of part a, the recession is almost only visible in the remainder graph, though the trend line does flatten out somewhat during this time period.
Exercise 5 This exercise uses the cangas data (monthly Canadian gas production in billions of cubic metres, January 1960 – February 2005).
autoplot(cangas)
ggsubseriesplot(cangas)
ggseasonplot(cangas)
There appears to be large jumps in the 1980s and 1990s for Canadian gas production. This could be a result of the aggressive increase in natural gas production Canada undertook from approximately 1985 to 2000 (per a quick google search). Per my Canadian friend, natural gas became a major industry in Canada around that time and is now one of Canada’s leading industries. This would explain the aggressive growth during this period as they utilized the abundant natural resource until they hit a point of diminishing returns.
canstl = stl(cangas, s.window=5, robust = TRUE)
autoplot(canstl) + ggtitle("STL decomposition of Canadian Gas Production")
c. Compare the results with those obtained using SEATS and X11. How are they different?
library(seasonal)
canseas = seas(cangas)
autoplot(canseas) + ggtitle("SEATS decomposition of Canadian Gas Production")
library(seasonal)
canx11 = seas(cangas, x11="")
autoplot(canx11) + ggtitle("X11 decomposition of Canadian Gas Production")
The STL model balloons in the 1980s-1990s, but does has a smaller magnitude in the early years before this (1970s back). The SEATS and X11 decomposition have a larger magnitude in this range that gradually decreases before ballooning again in the 1980s-1990s. The trend line for SEATS and X11 is notably more jagged and bumpy than the smoother STL trend line. These differences are likely the result of the fact that STL is additive and is meant o have a “smoother” trend line that can be controlled by the user.
Exercise 6 We will use the bricksq data (Australian quarterly clay brick production. 1956–1994) for this exercise.
brickstl = stl(bricksq, s.window="periodic", robust = TRUE)
autoplot(brickstl) + ggtitle("STL decomposition of clay brick production by quarter")
brickstl = stl(bricksq, s.window=7, robust = TRUE)
autoplot(brickstl) + ggtitle("STL decomposition of clay brick production by quarter")
The trend lines for both a fixed and changing seasonality are nearly identical, and the seasonally adjusted trend line does not have great variation to it. The points with large remainders are similar in both models.
bricksa = seasadj(brickstl)
autoplot(bricksa)
c. Use a naïve method to produce forecasts of the seasonally adjusted data.
autoplot(naive(bricksa))
d. Use stlf() to reseasonalise the results, giving forecasts for the original data.
?stlf()
brickfc = stlf(bricksq, s.window = 7, method="naive")
autoplot(brickfc)
e. Do the residuals look uncorrelated?
checkresiduals(brickfc)
## Warning in checkresiduals(brickfc): The fitted degrees of freedom is based on
## the model used for the seasonally adjusted data.
##
## Ljung-Box test
##
## data: Residuals from STL + Random walk
## Q* = 60.299, df = 8, p-value = 4.072e-10
##
## Model df: 0. Total lags used: 8
The residuals have a relatively normal distribution with a small left tail. The p-value of 4.072e-10 means that the Ljung-Box test indicates the presence of autocorrelation.
brickfc2 = stlf(bricksq, s.window = 7, method="naive", robust=TRUE)
autoplot(brickfc2)
checkresiduals(brickfc2)
## Warning in checkresiduals(brickfc2): The fitted degrees of freedom is based on
## the model used for the seasonally adjusted data.
##
## Ljung-Box test
##
## data: Residuals from STL + Random walk
## Q* = 24.985, df = 8, p-value = 0.001564
##
## Model df: 0. Total lags used: 8
The p-value under the robust model is not nearly as significant as the prior model, but is still incredibly significant at a value of 0.001564. This means that there is still significant proof of autocorrelation.
brickend = window(bricksq, end = c(1992, 4))
fcstlf = stlf(brickend, s.window = 7, method="naive", robust=TRUE)
autoplot(fcstlf)
fcnaive = snaive(brickend)
autoplot(fcnaive)
accuracy(fcstlf, bricksq)
## ME RMSE MAE MPE MAPE MASE
## Training set 1.454811 22.55502 15.67641 0.3272808 3.817365 0.4327014
## Test set 29.053932 31.42719 29.05393 6.4300642 6.430064 0.8019487
## ACF1 Theil's U
## Training set 0.0449449 NA
## Test set -0.1419827 0.7613135
accuracy(fcnaive, bricksq)
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 6.06250 49.54691 36.22917 1.343615 8.857805 1.0000000 0.8090232
## Test set 34.28571 37.96615 34.28571 7.443878 7.443878 0.9463567 -0.1191536
## Theil's U
## Training set NA
## Test set 0.9975749
The STL with random walk model does a better job than the seasonal naive model in this example as metrics for error are lower in the model.
Exercise 7 Use stlf() to produce forecasts of the writing series with either method=“naive” or method=“rwdrift”, whichever is most appropriate. Use the lambda argument if you think a Box-Cox transformation is required.
autoplot(writing)
The autoplot shows a slight upward trend.
BoxCox.lambda(writing)
## [1] 0.1557392
The small value of lambda and the upward trend in the plot indicate that a lambda of zero with a drift method can be used.
driftfc = stlf(writing, s.window = "periodic", method="rwdrift", robust=TRUE)
autoplot(driftfc)
Exercise 8 Use stlf() to produce forecasts of the fancy series with either method=“naive” or method=“rwdrift”, whichever is most appropriate. Use the lambda argument if you think a Box-Cox transformation is required.
autoplot(fancy)
The spikes are growing at a considerable rate, a Box-Cox transformation will likely be warranted. There is also an upward trend that means rwdrift is warranted again.
fancydrift = stlf(fancy, s.window = "periodic", method="rwdrift", robust=TRUE, lambda = 0.1)
autoplot(fancydrift)
After experimenting with several lambda values, lambda of 0.1 seems to be appropriate for the model.