This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
#Chapter 5
#Daily electricity demand for Victoria, Australia, during 2014 is contained in elecdaily. The data for the first 20 days can be obtained as follows.
library(fpp2)
## Warning: package 'fpp2' was built under R version 4.0.5
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
## -- Attaching packages ---------------------------------------------- fpp2 2.4 --
## v ggplot2 3.3.5 v fma 2.4
## v forecast 8.15 v expsmooth 2.3
## Warning: package 'ggplot2' was built under R version 4.0.5
## Warning: package 'forecast' was built under R version 4.0.5
##
daily20 = head(elecdaily,20)
daily20
## Time Series:
## Start = c(1, 4)
## End = c(4, 2)
## Frequency = 7
## Demand WorkDay Temperature
## 1.428571 174.8963 0 26.0
## 1.571429 188.5909 1 23.0
## 1.714286 188.9169 1 22.2
## 1.857143 173.8142 0 20.3
## 2.000000 169.5152 0 26.1
## 2.142857 195.7288 1 19.6
## 2.285714 199.9029 1 20.0
## 2.428571 205.3375 1 27.4
## 2.571429 228.0782 1 32.4
## 2.714286 258.5984 1 34.0
## 2.857143 201.7970 0 22.4
## 3.000000 187.6298 0 22.5
## 3.142857 254.6636 1 30.0
## 3.285714 322.2323 1 42.4
## 3.428571 343.9934 1 41.5
## 3.571429 347.6376 1 43.2
## 3.714286 332.9455 1 43.1
## 3.857143 219.7517 0 23.7
## 4.000000 186.9816 0 22.3
## 4.142857 228.4876 1 24.0
##
## Call:
## tslm(formula = Demand ~ Temperature, data = tsdaily20)
##
## Coefficients:
## (Intercept) Temperature
## 39.212 6.757
Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.
#b. Produce a residual plot. Is the model adequate? Are there any outliers or influential observations?
checkresiduals(tslmdaily20$residuals)
## Warning in modeldf.default(object): Could not find appropriate degrees of
## freedom for this model.
#based on the residual check, it does not appear that there are correlative residuals within the model, so this would indicate a good level of accuracy. It appears that there is an outlier, but it does not seem to affect the model as much as it could.
#c. Use the model to forecast the electricity demand that you would expect for the next day if the maximum temperature was 15∘ and compare it with the forecast if the with maximum temperature was 35∘. Do you believe these forecasts?
Demand15 = forecast(tslmdaily20, newdata = data.frame(Temperature = 15))
Demand15
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 21 140.5701 108.681 172.4591 90.21166 190.9285
Demand35 = forecast(tslmdaily20, newdata = data.frame(Temperature = 35))
Demand35
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 21 275.7146 245.2278 306.2014 227.5706 323.8586
#These forecasts seem to be right in line with the other observations from our scatterplot, so I would say they're believable
#d. Give prediction intervals for your forecasts. The following R code will get you started:
autoplot(daily20, facets=TRUE)
daily20 %>%
as.data.frame() %>%
ggplot(aes(x=Temperature, y=Demand)) +
geom_point() +
geom_smooth(method="lm", se=FALSE)
## `geom_smooth()` using formula 'y ~ x'
fit <- tslm(Demand ~ Temperature, data=daily20)
checkresiduals(fit)
##
## Breusch-Godfrey test for serial correlation of order up to 5
##
## data: Residuals from Linear regression model
## LM test = 3.8079, df = 5, p-value = 0.5774
forecast(fit, newdata=data.frame(Temperature=c(15,35)))
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 4.285714 140.5701 108.6810 172.4591 90.21166 190.9285
## 4.428571 275.7146 245.2278 306.2014 227.57056 323.8586
#With the given code, the forecast can be broken down at the following intervals: At 80%, the forecast for 15 degrees would be somewhere between 108.681 and 172.4591, and the forecast for 35 degrees would be somewhere between 245.2278 and 306.2014. At 95%, the forecast for 15 degrees would be somewhere between 90.21166 and 190.9285, and the forecast for 35 degrees would be somewhere between 227.27056 and 323.8586.
#e. Plot Demand vs Temperature for all of the available data in elecdaily. What does this say about your model?
plot(Demand ~ Temperature, data = elecdaily, main = "Demand for Electricity vs. Temperature")
#Based on the full dataset, the model appears less accurate, and this could be for a number of reasons. The main reason being, it does not account for the other uses of electricity in lower temperatures. This could include the fact that people will spend more time indoors, which suggests more usage of televisions, computers, lights, and so on. As people spend more time indoors, they will use more electricity.
#2. Data set mens400 contains the winning times (in seconds) for the men’s 400 meters final in each Olympic Games from 1896 to 2016.
#a. Plot the winning time against the year. Describe the main features of the plot.
autoplot(mens400, xlab="Year", ylab="Winner's Time", main="Winner's Time for 400 Meter By Year")
#It's interesting to see how the winner's time has decreased so drastically in the last 130 years
#b. Fit a regression line to the data. Obviously the winning times have been decreasing, but at what average rate per year?
wintimes = time(mens400)
wintslm = tslm(mens400 ~ wintimes, data = mens400)
wintslm
##
## Call:
## tslm(formula = mens400 ~ wintimes, data = mens400)
##
## Coefficients:
## (Intercept) wintimes
## 172.48148 -0.06457
autoplot(mens400, xlab="Year", ylab="Winner's Time", main="Winner's Time for 400 Meter By Year") +
geom_abline(slope = wintslm$coefficients[2],
intercept = wintslm$coefficients[1])
#Based on the regression line fit to the data, we can see that win times are decreasing by an average of -0.06457 seconds per year
#c. Plot the residuals against the year. What does this indicate about the suitability of the fitted line?
checkresiduals(wintslm$residuals)
## Warning in modeldf.default(object): Could not find appropriate degrees of
## freedom for this model.
#The residual plot suggests that the fitted line suits the model well
#d. Predict the winning time for the men’s 400 meters final in the 2020 Olympics. Give a prediction interval for your forecasts. What assumptions have you made in these calculations?
win2020 = forecast(wintslm, newdata = data.frame(wintimes = 2020))
win2020
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 2020 42.04231 40.44975 43.63487 39.55286 44.53176
#Based on the created forecast, we could expect a winning time right around 42.04231 seconds in 2020
#3.Type easter(ausbeer) and interpret what you see.
easter(ausbeer)
## Qtr1 Qtr2 Qtr3 Qtr4
## 1956 0.67 0.33 0.00 0.00
## 1957 0.00 1.00 0.00 0.00
## 1958 0.00 1.00 0.00 0.00
## 1959 1.00 0.00 0.00 0.00
## 1960 0.00 1.00 0.00 0.00
## 1961 0.33 0.67 0.00 0.00
## 1962 0.00 1.00 0.00 0.00
## 1963 0.00 1.00 0.00 0.00
## 1964 1.00 0.00 0.00 0.00
## 1965 0.00 1.00 0.00 0.00
## 1966 0.00 1.00 0.00 0.00
## 1967 1.00 0.00 0.00 0.00
## 1968 0.00 1.00 0.00 0.00
## 1969 0.00 1.00 0.00 0.00
## 1970 1.00 0.00 0.00 0.00
## 1971 0.00 1.00 0.00 0.00
## 1972 0.33 0.67 0.00 0.00
## 1973 0.00 1.00 0.00 0.00
## 1974 0.00 1.00 0.00 0.00
## 1975 1.00 0.00 0.00 0.00
## 1976 0.00 1.00 0.00 0.00
## 1977 0.00 1.00 0.00 0.00
## 1978 1.00 0.00 0.00 0.00
## 1979 0.00 1.00 0.00 0.00
## 1980 0.00 1.00 0.00 0.00
## 1981 0.00 1.00 0.00 0.00
## 1982 0.00 1.00 0.00 0.00
## 1983 0.00 1.00 0.00 0.00
## 1984 0.00 1.00 0.00 0.00
## 1985 0.00 1.00 0.00 0.00
## 1986 1.00 0.00 0.00 0.00
## 1987 0.00 1.00 0.00 0.00
## 1988 0.00 1.00 0.00 0.00
## 1989 1.00 0.00 0.00 0.00
## 1990 0.00 1.00 0.00 0.00
## 1991 1.00 0.00 0.00 0.00
## 1992 0.00 1.00 0.00 0.00
## 1993 0.00 1.00 0.00 0.00
## 1994 0.00 1.00 0.00 0.00
## 1995 0.00 1.00 0.00 0.00
## 1996 0.00 1.00 0.00 0.00
## 1997 1.00 0.00 0.00 0.00
## 1998 0.00 1.00 0.00 0.00
## 1999 0.00 1.00 0.00 0.00
## 2000 0.00 1.00 0.00 0.00
## 2001 0.00 1.00 0.00 0.00
## 2002 1.00 0.00 0.00 0.00
## 2003 0.00 1.00 0.00 0.00
## 2004 0.00 1.00 0.00 0.00
## 2005 1.00 0.00 0.00 0.00
## 2006 0.00 1.00 0.00 0.00
## 2007 0.00 1.00 0.00 0.00
## 2008 1.00 0.00 0.00 0.00
## 2009 0.00 1.00 0.00 0.00
## 2010 0.00 1.00
#This function returns a vector of 0's and 1's or fractional results if Easter spans March and April in the observed time period. Easter is defined as the days from Good Friday to Easter Sunday inclusively. This particular measurement is for the ausbeer dataset.
#4.An elasticity coefficient is the ratio of the percentage change in the forecast variable ( y ) to the percentage change in the predictor variable ( x ). Mathematically, the elasticity is defined as (dy/dx)×(x/y) . Consider the log-log model,log y = β0+β1 log x + ε. Express y as a function of x and show that the coefficient β1 is the elasticity coefficient.
#y = e^( β0 + β1 log x + e)
#for each change in x, there will be an e^(β1 log) change in y
#5. The data set fancy concerns the monthly sales figures of a shop which opened in January 1987 and sells gifts, souvenirs, and novelties. The shop is situated on the wharf at a beach resort town in Queensland, Australia. The sales volume varies with the seasonal population of tourists. There is a large influx of visitors to the town at Christmas and for the local surfing festival, held every March since 1988. Over time, the shop has expanded its premises, range of products, and staff.
#a.Produce a time plot of the data and describe the patterns in the graph. Identify any unusual or unexpected fluctuations in the time series.
autoplot(fancy, ylab= "Sales", xlab="Year")
head(fancy,84)
## Jan Feb Mar Apr May Jun Jul
## 1987 1664.81 2397.53 2840.71 3547.29 3752.96 3714.74 4349.61
## 1988 2499.81 5198.24 7225.14 4806.03 5900.88 4951.34 6179.12
## 1989 4717.02 5702.63 9957.58 5304.78 6492.43 6630.80 7349.62
## 1990 5921.10 5814.58 12421.25 6369.77 7609.12 7224.75 8121.22
## 1991 4826.64 6470.23 9638.77 8821.17 8722.37 10209.48 11276.55
## 1992 7615.03 9849.69 14558.40 11587.33 9332.56 13082.09 16732.78
## 1993 10243.24 11266.88 21826.84 17357.33 15997.79 18601.53 26155.15
## Aug Sep Oct Nov Dec
## 1987 3566.34 5021.82 6423.48 7600.60 19756.21
## 1988 4752.15 5496.43 5835.10 12600.08 28541.72
## 1989 8176.62 8573.17 9690.50 15151.84 34061.01
## 1990 7979.25 8093.06 8476.70 17914.66 30114.41
## 1991 12552.22 11637.39 13606.89 21822.11 45060.69
## 1992 19888.61 23933.38 25391.35 36024.80 80721.71
## 1993 28586.52 30505.41 30821.33 46634.38 104660.67
#The data seems fairly cyclical up until the holiday season of 1992, where the growth of holiday sales seems to exponentially increase
#b. Explain why it is necessary to take logarithms of these data before fitting a model.
#With the variance we see, especially starting during the 1992 holiday season, this will mitigate how much the sudden increases affect our observations
#c. Use R to fit a regression model to the logarithms of these sales data with a linear trend, seasonal dummies and a “surfing festival” dummy variable.
fancytslm = tslm(BoxCox(fancy,0)~season+trend)
fancytslm
##
## Call:
## tslm(formula = BoxCox(fancy, 0) ~ season + trend)
##
## Coefficients:
## (Intercept) season2 season3 season4 season5 season6
## 7.60586 0.25104 0.69521 0.38293 0.40799 0.44696
## season7 season8 season9 season10 season11 season12
## 0.60822 0.58535 0.66634 0.74403 1.20302 1.95814
## trend
## 0.02239
#d. Plot the residuals against time and against the fitted values. Do these plots reveal any problems with the model?
autoplot(fancytslm$residuals)
#Based on the plot below, there does appear to be a correlation with the residuals and the
#e. Do boxplots of the residuals for each month. Does this reveal any problems with the model?
Salestime = time(fancy)
cbind.data.frame(
Month = factor(
month.abb[round(12*(Salestime - floor(Salestime)) + 1)],
labels = month.abb,
ordered = TRUE
),
Residuals = fancytslm$residuals
) %>%
ggplot(aes(x = Month, y = Residuals)) +
geom_boxplot()
## Don't know how to automatically pick scale for object of type ts. Defaulting to continuous.
#The wide range of residual boxplots show that there may be issues with the plot for a given month, but the scale of residuals shows that this should not prevent the model from being useful
#f. What do the values of the coefficients tell you about each variable?
fancytslm$coefficients
## (Intercept) season2 season3 season4 season5 season6
## 7.60586042 0.25104367 0.69520658 0.38293405 0.40799437 0.44696253
## season7 season8 season9 season10 season11 season12
## 0.60821562 0.58535238 0.66634464 0.74403359 1.20301639 1.95813657
## trend
## 0.02239298
#These coefficients reveal that there is a general increase in sales over time, as each season has a positive coefficient, and the trend shows as positive.
#g. What does the Breusch-Godfrey test tell you about your model?
checkresiduals(fancytslm)
##
## Breusch-Godfrey test for serial correlation of order up to 17
##
## data: Residuals from Linear regression model
## LM test = 37.152, df = 17, p-value = 0.003209
#The pvalue shows us that the resiudals are indeed correlated
#h. Regardless of your answers to the above questions, use your regression model to predict the monthly sales for 1994, 1995, and 1996. Produce prediction intervals for each of your forecasts.
future_sales = rep(0, 36)
for(i in 1:36){
if(i %% 12 == 3){
future_sales[i] = 1
}
}
future_sales = ts(data = future_sales,
start = 1994,
end = c(1996, 12),
frequency = 12)
future_sales_tslm = forecast(fancytslm, newdata = data.frame(Salestime = time(future_sales)))
autoplot(future_sales_tslm)
#i. Transform your predictions and intervals to obtain predictions and intervals for the raw data.
future_sales_tslm$lower
## [,1] [,2]
## Jan 1994 9.246995 9.105002
## Feb 1994 9.520432 9.378439
## Mar 1994 9.986988 9.844995
## Apr 1994 9.697108 9.555115
## May 1994 9.744562 9.602569
## Jun 1994 9.805923 9.663930
## Jul 1994 9.989569 9.847576
## Aug 1994 9.989099 9.847106
## Sep 1994 10.092484 9.950491
## Oct 1994 10.192566 10.050573
## Nov 1994 10.673942 10.531949
## Dec 1994 11.451455 11.309462
## Jan 1995 9.512777 9.369195
## Feb 1995 9.786214 9.642632
## Mar 1995 10.252770 10.109188
## Apr 1995 9.962890 9.819308
## May 1995 10.010343 9.866762
## Jun 1995 10.071704 9.928123
## Jul 1995 10.255351 10.111769
## Aug 1995 10.254880 10.111299
## Sep 1995 10.358266 10.214684
## Oct 1995 10.458347 10.314766
## Nov 1995 10.939723 10.796142
## Dec 1995 11.717236 11.573655
## Jan 1996 9.777950 9.632451
## Feb 1996 10.051387 9.905887
## Mar 1996 10.517943 10.372443
## Apr 1996 10.228063 10.082564
## May 1996 10.275516 10.130017
## Jun 1996 10.336878 10.191378
## Jul 1996 10.520524 10.375024
## Aug 1996 10.520053 10.374554
## Sep 1996 10.623439 10.477939
## Oct 1996 10.723521 10.578021
## Nov 1996 11.204896 11.059397
## Dec 1996 11.982409 11.836910
future_sales_tslm$upper
## [,1] [,2]
## Jan 1994 9.771532 9.913525
## Feb 1994 10.044969 10.186962
## Mar 1994 10.511525 10.653518
## Apr 1994 10.221645 10.363638
## May 1994 10.269098 10.411091
## Jun 1994 10.330459 10.472452
## Jul 1994 10.514105 10.656099
## Aug 1994 10.513635 10.655628
## Sep 1994 10.617020 10.759014
## Oct 1994 10.717102 10.859095
## Nov 1994 11.198478 11.340471
## Dec 1994 11.975991 12.117984
## Jan 1995 10.043182 10.186763
## Feb 1995 10.316618 10.460200
## Mar 1995 10.783174 10.926756
## Apr 1995 10.493295 10.636876
## May 1995 10.540748 10.684330
## Jun 1995 10.602109 10.745691
## Jul 1995 10.785755 10.929337
## Aug 1995 10.785285 10.928867
## Sep 1995 10.888670 11.032252
## Oct 1995 10.988752 11.132334
## Nov 1995 11.470128 11.613710
## Dec 1995 12.247641 12.391223
## Jan 1996 10.315440 10.460940
## Feb 1996 10.588877 10.734376
## Mar 1996 11.055433 11.200932
## Apr 1996 10.765553 10.911053
## May 1996 10.813007 10.958506
## Jun 1996 10.874368 11.019867
## Jul 1996 11.058014 11.203513
## Aug 1996 11.057543 11.203043
## Sep 1996 11.160929 11.306428
## Oct 1996 11.261011 11.406510
## Nov 1996 11.742386 11.887886
## Dec 1996 12.519900 12.665399
#j. How could you improve these predictions by modifying the model?
#The first idea that comes to mind would be do to a month over month percent increase in sales instead of the raw numbers, as with a percent change, it will have less of an effect when there is a spike in sales in a given month. That exponential growth over time is what causes the model some issues
#6. The gasoline series consists of weekly data for supplies of US finished motor gasoline product, from 2 February 1991 to 20 January 2017. The units are in “million barrels per day.” Consider only the data to the end of 2004.
#a. Fit a harmonic regression with trend to the data. Experiment with changing the number Fourier terms. Plot the observed gasoline and fitted values and comment on what you see.
gassupply2004 = window(gasoline, end = 2005)
autoplot(gassupply2004, ylab = "Weekly Gas Supply", xlab = "Year")
gas1 <- tslm(gassupply2004 ~ trend + fourier(gassupply2004, K=6))
gas2 <- tslm(gassupply2004 ~ trend + fourier(gassupply2004, K=13))
gas3 <- tslm(gassupply2004 ~ trend + fourier(gassupply2004, K=20))
gas4 <- tslm(gassupply2004 ~ trend + fourier(gassupply2004, K=25))
autoplot(gassupply2004, ylab = "Weekly Gas Supply",main= "Fourier") + autolayer(fitted(gas1))+
autolayer(fitted(gas2))+autolayer(fitted(gas3)) +autolayer(fitted(gas4))
#b. Select the appropriate number of Fourier terms to include by minimising the AICc or CV value.
CV(gas1)
## CV AIC AICc BIC AdjR2
## 0.0725333 -1902.7534284 -1902.0773721 -1833.9401782 0.8474536
CV(gas2)
## CV AIC AICc BIC AdjR2
## 7.133807e-02 -1.915672e+03 -1.913172e+03 -1.782633e+03 8.529214e-01
CV(gas3)
## CV AIC AICc BIC AdjR2
## 7.223834e-02 -1.908017e+03 -1.902469e+03 -1.710753e+03 8.540588e-01
CV(gas4)
## CV AIC AICc BIC AdjR2
## 7.382567e-02 -1.893647e+03 -1.885129e+03 -1.650507e+03 8.530376e-01
#Of the models created, the best model seems to be when K = 13 among the examples given below, as it minimizes the CV and AICc values
#c. Check the residuals of the final model using the checkresiduals() function. Even though the residuals fail the correlation tests, the results are probably not severe enough to make much difference to the forecasts and prediction intervals. (Note that the correlations are relatively small, even though they are significant.)
checkresiduals(gas2$residuals)
## Warning in modeldf.default(object): Could not find appropriate degrees of
## freedom for this model.
#d. To forecast using harmonic regression, you will need to generate the future values of the Fourier terms. This can be done as follows. fc <- forecast(fit, newdata=data.frame(fourier(x,K,h))) where fit is the fitted model using tslm(), K is the number of Fourier terms used in creating fit, and h is the forecast horizon required. Forecast the next year of data.
fc = forecast(gas2, newdata = data.frame(fourier(gassupply2004, 13,52)))
fc
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 2005.014 8.710727 8.367545 9.053909 8.185461 9.235993
## 2005.033 8.650436 8.307136 8.993737 8.124989 9.175884
## 2005.052 8.660015 8.316734 9.003295 8.134598 9.185431
## 2005.071 8.653126 8.309850 8.996401 8.127716 9.178535
## 2005.090 8.683014 8.339886 9.026143 8.157831 9.208198
## 2005.110 8.791940 8.448944 9.134936 8.266959 9.316921
## 2005.129 8.899217 8.556224 9.242210 8.374241 9.424194
## 2005.148 8.930126 8.587156 9.273096 8.405185 9.455067
## 2005.167 8.937656 8.594691 9.280620 8.412723 9.462589
## 2005.186 8.996410 8.653437 9.339382 8.471465 9.521354
## 2005.205 9.065673 8.722700 9.408646 8.540727 9.590618
## 2005.225 9.066081 8.723117 9.409045 8.541149 9.591013
## 2005.244 9.028558 8.685593 9.371522 8.503625 9.553490
## 2005.263 9.049214 8.706245 9.392184 8.524274 9.574155
## 2005.282 9.126381 8.783413 9.469350 8.601442 9.651321
## 2005.301 9.161733 8.818770 9.504696 8.636803 9.686664
## 2005.320 9.129093 8.786129 9.472058 8.604160 9.654026
## 2005.340 9.128470 8.785501 9.471439 8.603531 9.653409
## 2005.359 9.225209 8.882242 9.568176 8.700273 9.750146
## 2005.378 9.327847 8.984884 9.670810 8.802917 9.852777
## 2005.397 9.312974 8.970009 9.655939 8.788040 9.837908
## 2005.416 9.221724 8.878756 9.564692 8.696785 9.746662
## 2005.435 9.222772 8.879806 9.565738 8.697837 9.747707
## 2005.455 9.376358 9.033395 9.719321 8.851428 9.901288
## 2005.474 9.541343 9.198377 9.884308 9.016408 10.066277
## 2005.493 9.567204 9.224236 9.910172 9.042266 10.092142
## 2005.512 9.491023 9.148058 9.833989 8.966089 10.015957
## 2005.531 9.457530 9.114567 9.800493 8.932600 9.982461
## 2005.550 9.504403 9.161437 9.847369 8.979467 10.029338
## 2005.570 9.542378 9.199410 9.885345 9.017440 10.067315
## 2005.589 9.526111 9.183147 9.869076 9.001178 10.051045
## 2005.608 9.515559 9.172596 9.858522 8.990628 10.040490
## 2005.627 9.544314 9.201347 9.887281 9.019378 10.069251
## 2005.646 9.544076 9.201108 9.887043 9.019138 10.069013
## 2005.665 9.455292 9.112328 9.798257 8.930360 9.980225
## 2005.685 9.323376 8.980412 9.666339 8.798445 9.848307
## 2005.704 9.225564 8.882596 9.568531 8.700626 9.750501
## 2005.723 9.178145 8.835178 9.521113 8.653208 9.703083
## 2005.742 9.172512 8.829549 9.515476 8.647581 9.697443
## 2005.761 9.223824 8.880859 9.566788 8.698891 9.748756
## 2005.780 9.315944 8.972975 9.658913 8.791004 9.840883
## 2005.800 9.371111 9.028143 9.714079 8.846172 9.896049
## 2005.819 9.347237 9.004274 9.690200 8.822307 9.872168
## 2005.838 9.297081 8.954115 9.640046 8.772147 9.822015
## 2005.857 9.262941 8.919969 9.605912 8.737998 9.787884
## 2005.876 9.201516 8.858547 9.544485 8.676576 9.726456
## 2005.895 9.105412 8.762450 9.448374 8.580483 9.630341
## 2005.915 9.097672 8.754701 9.440642 8.572730 9.622613
## 2005.934 9.261236 8.918258 9.604214 8.736282 9.786190
## 2005.953 9.443697 9.100722 9.786672 8.918748 9.968646
## 2005.972 9.405229 9.062264 9.748195 8.880296 9.930163
## 2005.991 9.140470 8.797421 9.483518 8.615408 9.665531
#e. Plot the forecasts along with the actual data for 2005. What do you find?
gassupply2005 = window(gasoline, start = 2005, end = 2006)
autoplot(gassupply2005, series = "True", ylab = "Weekly Gas Supply", xlab = "Year", main = "Forecast of 2005 vs True 2005 Gas Supply") +
autolayer(fc$mean, series = "Forecast")
#Based on the plot, it seems like the model fits the true result very well, aside from the anomolous drop that happens around August or September
#7.Data set huron gives the water level of Lake Huron in feet from 1875 to 1972.
#a.Plot the data and comment on its features.
autoplot(huron, xlab = "Lake Water Level", main="Lake Water Level of Lake Huron 1875-1972")
#There appears to be a slight overall downward trend in water level in the lake as well as some large fluctuations, likely based on precipitation in a given year.
#b. Fit a linear regression and compare this to a piecewise linear trend model with a knot at 1915.
huron_tslm_1 = tslm(huron ~ trend)
summary(huron_tslm_1)
##
## Call:
## tslm(formula = huron ~ trend)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.50997 -0.72726 0.00083 0.74402 2.53565
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.202037 0.230111 44.335 < 2e-16 ***
## trend -0.024201 0.004036 -5.996 3.55e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.13 on 96 degrees of freedom
## Multiple R-squared: 0.2725, Adjusted R-squared: 0.2649
## F-statistic: 35.95 on 1 and 96 DF, p-value: 3.545e-08
Timehuron = time(huron)
Timespan = ts(pmax(0, Timehuron - 1915), start = 1875)
huron_tslm_2 = tslm(huron ~ Timehuron + Timespan)
summary(huron_tslm_2)
##
## Call:
## tslm(formula = huron ~ Timehuron + Timespan)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.49626 -0.66240 -0.07139 0.85163 2.39222
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 132.90870 19.97687 6.653 1.82e-09 ***
## Timehuron -0.06498 0.01051 -6.181 1.58e-08 ***
## Timespan 0.06486 0.01563 4.150 7.26e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.045 on 95 degrees of freedom
## Multiple R-squared: 0.3841, Adjusted R-squared: 0.3711
## F-statistic: 29.62 on 2 and 95 DF, p-value: 1.004e-10
#c. Generate forecasts from these two models for the period up to 1980 and comment on these.
fc_huron_tslm_1 = forecast(huron_tslm_1, h=8)
autoplot(huron) +
autolayer(fitted(huron_tslm_1), series = "Linear") +
autolayer(fc_huron_tslm_1, series = "Linear")
Time.New = Timehuron[length(Timehuron)] + seq(8)
Timespan.New = Timespan[length(Timespan)] + seq(8)
newdata = cbind(Timehuron = Time.New,
Timespan = Timespan.New) %>%
as.data.frame()
fc_huron_tslm_2 = forecast(huron_tslm_2, newdata = newdata)
autoplot(huron) +
autolayer(fitted(huron_tslm_2), series = "Piecewise") +
autolayer(fc_huron_tslm_2, series = "Piecewise")
#The first model appears to follow the overall downward trend we've seen in the waterlevels, whereas the piecewise model adjusts the forecast to a point where the water level seems to plateau
#Chapter 6
#1. Show that a 3×5 MA is equivalent to a 7-term weighted moving average with weights of 0.067, 0.133, 0.200, 0.200, 0.200, 0.133, and 0.067.
#First we need to take the moving average of the first 5 terms, then the moving average of 3 of the calculated terms,
#((Y1 + Y2 + Y3 + Y4 + Y5)/5 + (Y2 + Y3 + Y4 + Y5+ Y6)/5 + (Y3 + Y4 + Y5 + Y6 + Y7)/5)/3... to show the calculation for the results below
#1/15(Y1) 2/15(Y2) 3/15(Y3) 3/15(Y4) 3/15(Y5) 2/15(Y6) 1/15(Y7)
#This results in a distribution of 0.067, 0.133, 0.200, 0.200, 0.200, 0.133, and 0.067
#2. The plastics data set consists of the monthly sales (in thousands) of product A for a plastics manufacturer for five years.
#a. Plot the time series of sales of product A. Can you identify seasonal fluctuations and/or a trend-cycle?
plot(plastics, xlab = "Year", ylab = "Total Sales", main = "Sales of Product A")
#It looks like there is an overall upward trend in Product A sales, but it seems as if there is a large increase around the middle of the year, and an almost equally large decrese right around the beginning of each year
#b. Use a classical multiplicative decomposition to calculate the trend-cycle and seasonal indices.
dec_mult = decompose(plastics, type = "multiplicative")
autoplot(dec_mult)
#c.Do the results support the graphical interpretation from part a?
#The decompositions appears to support the graph from part a, as you can see similar cycles with an upward trend
#d.Compute and plot the seasonally adjusted data.
seas_adj_plas = dec_mult$x/dec_mult$seasonal
autoplot(seas_adj_plas)
#e.Change one observation to be an outlier (e.g., add 500 to one observation), and recompute the seasonally adjusted data. What is the effect of the outlier?
outlier_plas = plastics
outlier_plas[10] = outlier_plas[10] + 500
decomp_outlier = decompose(outlier_plas, "multiplicative")
adj_plas_outlier = decomp_outlier$x/decomp_outlier$seasonal
autoplot(adj_plas_outlier)
#The outlier is very obvious in the early part of the graph, as it alters the graph greatly from its original form
#f.Does it make any difference if the outlier is near the end rather than in the middle of the time series?
#With the adjustment for seasonal changes, it should not matter whether or not the outlier is closer to an end
#3.Recall your retail time series data (from Exercise 3 in Section 2.10). Decompose the series using X11. Does it reveal any outliers, or unusual features that you had not noticed previously?
library(readxl)
retail <- read_excel("C:/Users/Dan McSwain/Desktop/Class Sources + Notes/Predictive Analytics & Forecasting/Week 2/retail.xlsx", skip = 1)
head(retail)
## # A tibble: 6 x 190
## `Series ID` A3349335T A3349627V A3349338X A3349398A A3349468W
## <dttm> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1982-04-01 00:00:00 303. 41.7 63.9 409. 65.8
## 2 1982-05-01 00:00:00 298. 43.1 64 405. 65.8
## 3 1982-06-01 00:00:00 298 40.3 62.7 401 62.3
## 4 1982-07-01 00:00:00 308. 40.9 65.6 414. 68.2
## 5 1982-08-01 00:00:00 299. 42.1 62.6 404. 66
## 6 1982-09-01 00:00:00 305. 42 64.4 412. 62.3
## # ... with 184 more variables: A3349336V <dbl>, A3349337W <dbl>,
## # A3349397X <dbl>, A3349399C <dbl>, A3349874C <dbl>, A3349871W <dbl>,
## # A3349790V <dbl>, A3349556W <dbl>, A3349791W <dbl>, A3349401C <dbl>,
## # A3349873A <dbl>, A3349872X <dbl>, A3349709X <dbl>, A3349792X <dbl>,
## # A3349789K <dbl>, A3349555V <dbl>, A3349565X <dbl>, A3349414R <dbl>,
## # A3349799R <dbl>, A3349642T <dbl>, A3349413L <dbl>, A3349564W <dbl>,
## # A3349416V <dbl>, A3349643V <dbl>, A3349483V <dbl>, A3349722T <dbl>,
## # A3349727C <dbl>, A3349641R <dbl>, A3349639C <dbl>, A3349415T <dbl>,
## # A3349349F <dbl>, A3349563V <dbl>, A3349350R <dbl>, A3349640L <dbl>,
## # A3349566A <dbl>, A3349417W <dbl>, A3349352V <dbl>, A3349882C <dbl>,
## # A3349561R <dbl>, A3349883F <dbl>, A3349721R <dbl>, A3349478A <dbl>,
## # A3349637X <dbl>, A3349479C <dbl>, A3349797K <dbl>, A3349477X <dbl>,
## # A3349719C <dbl>, A3349884J <dbl>, A3349562T <dbl>, A3349348C <dbl>,
## # A3349480L <dbl>, A3349476W <dbl>, A3349881A <dbl>, A3349410F <dbl>,
## # A3349481R <dbl>, A3349718A <dbl>, A3349411J <dbl>, A3349638A <dbl>,
## # A3349654A <dbl>, A3349499L <dbl>, A3349902A <dbl>, A3349432V <dbl>,
## # A3349656F <dbl>, A3349361W <dbl>, A3349501L <dbl>, A3349503T <dbl>,
## # A3349360V <dbl>, A3349903C <dbl>, A3349905J <dbl>, A3349658K <dbl>,
## # A3349575C <dbl>, A3349428C <dbl>, A3349500K <dbl>, A3349577J <dbl>,
## # A3349433W <dbl>, A3349576F <dbl>, A3349574A <dbl>, A3349816F <dbl>,
## # A3349815C <dbl>, A3349744F <dbl>, A3349823C <dbl>, A3349508C <dbl>,
## # A3349742A <dbl>, A3349661X <dbl>, A3349660W <dbl>, A3349909T <dbl>,
## # A3349824F <dbl>, A3349507A <dbl>, A3349580W <dbl>, A3349825J <dbl>,
## # A3349434X <dbl>, A3349822A <dbl>, A3349821X <dbl>, A3349581X <dbl>,
## # A3349908R <dbl>, A3349743C <dbl>, A3349910A <dbl>, A3349435A <dbl>,
## # A3349365F <dbl>, A3349746K <dbl>, ...
retailts <- ts(retail[,"A3349627V"],
frequency=12, start=c(1982,4))
decompx11<-decompose(retailts, type = "multiplicative")
autoplot(decompx11)
#The remainders graph looks odd compared to other decompositions I've seen, this is likely due to the presence of at least some outliers
#4.Figures 6.16 and 6.17 show the result of decomposing the number of persons in the civilian labour force in Australia each month from February 1978 to August 1995.
#a.Write about 3–5 sentences describing the results of the decomposition. Pay particular attention to the scales of the graphs in making your interpretation.
#With the presented data, especially from the decomposition in Figure 6.16, as you can see an overall increase in labor force in Australia. The trend line, however, does appear to flatten out in 1991 and 1992, and you can see a large negative on the remainder plot, suggesting a large, negative outlier. This could very well be signs of the recession mentioned below.
#b. Is the recession of 1991/1992 visible in the estimated components?
#As mentioned with the previous part of the question, we can see that there is very clearly a negative outlier, suggesting the mentioned recession, so yes, I would say it is visible in the estimates.
#5. This exercise uses the cangas data (monthly Canadian gas production in billions of cubic metres, January 1960 – February 2005).
#a.Plot the data using autoplot(), ggsubseriesplot() and ggseasonplot() to look at the effect of the changing seasonality over time. What do you think is causing it to change so much?
autoplot(cangas)
ggsubseriesplot(cangas)
ggseasonplot(cangas)
#For the changes we see from season to season, it's likely that outside temperature on a given day would affect oil production and consumption. The colder it gets, the more oil is produced and used.
#b.Do an STL decomposition of the data. You will need to choose s.window to allow for the changing shape of the seasonal component.
cangas %>%
stl(s.window = 12) %>%
autoplot()
#c.Compare the results with those obtained using SEATS and X11. How are they different?
library(seasonal)
## Warning: package 'seasonal' was built under R version 4.0.5
cangas %>%
seas() %>%
autoplot()
#There is a tremendous difference in the cycles of seasons between the 2 seasonal components of these graphs. The "seas" plot shows a more robust seasonality that decreased from 1960 to 1975 before rising back to peak levels in around 1984, before stabilizing in the early to mid 1990s. The "stl" plots, on the other hand, show an almost inverse seasonality from 1960 to 1975 and in the mid 90's, whereas the 1980's are similar.
#6. We will use the bricksq data (Australian quarterly clay brick production. 1956–1994) for this exercise.
#a.Use an STL decomposition to calculate the trend-cycle and seasonal indices. (Experiment with having fixed or changing seasonality.)
stlbrick = stl(bricksq, s.window = "periodic", t.window = 12)
stlbrick12 = stl(bricksq, s.window = 12)
stlbrick24 = stl(bricksq, s.window = 24)
autoplot(stlbrick)
autoplot(stlbrick12)
autoplot(stlbrick24)
#b.Compute and plot the seasonally adjusted data.
fitbrick = stl(bricksq, t.window = 12, s.window = "periodic", robust = TRUE)
fitbrick %>%
seasadj() %>%
autoplot() + ylab("Seasonal adjusted Brick Production")
#c.Use a naïve method to produce forecasts of the seasonally adjusted data.
fitbrick %>%
seasadj() %>%
naive() %>%
autoplot() + ylab("Seasonal adjusted Brick Production")
#d.Use stlf() to reseasonalise the results, giving forecasts for the original data.
brick_stlf = stlf(bricksq, method = "naive")
fc_brick_stlf = forecast(brick_stlf)
autoplot(fc_brick_stlf)
#e. Do the residuals look uncorrelated?
checkresiduals(fc_brick_stlf$residuals)
## Warning in modeldf.default(object): Could not find appropriate degrees of
## freedom for this model.
#The residuals may have a slight correlation, but it should not be an issue for the model. The distribution appears to be almost normal, the ACF model shows some concerning values, particulartly at lag value of 8, but do not seem to be enough to compromise the integrity of the model
#f.Repeat with a robust STL decomposition. Does it make much difference?
robust_stlf_brick = stlf(bricksq, method = "naive", robust = TRUE)
autoplot(robust_stlf_brick)
checkresiduals(robust_stlf_brick)
## Warning in checkresiduals(robust_stlf_brick): The fitted degrees of freedom is
## based on the model used for the seasonally adjusted data.
##
## Ljung-Box test
##
## data: Residuals from STL + Random walk
## Q* = 27.073, df = 8, p-value = 0.0006869
##
## Model df: 0. Total lags used: 8
#The difference does not appear to be a large difference, however, it does appear to reduce the autocorrelation, but not to a significant degree.
#g.Compare forecasts from stlf() with those from snaive(), using a test set comprising the last 2 years of data. Which is better?
brick_training = subset(bricksq, end = length(bricksq)-8)
brick_test = subset(bricksq, end = length(bricksq) - 7)
brick_snaive = snaive(brick_training)
brick_stlf_robust = stlf(brick_training, method = "naive", robust = TRUE)
autoplot(bricksq, series = "bricksq Data") +
geom_line() +
autolayer(brick_snaive, PI = FALSE, series = "SNaive Model", color = "Purple") +
autolayer(brick_stlf_robust, PI = FALSE, series = "Stlf Model", color = "darkgreen") +
scale_color_manual(values = c("gray50", "blue", "red"),
breaks = c("Original data", "stlf", "snaive")) +
scale_x_continuous(limits = c(1990, 1994.5)) +
scale_y_continuous(limits = c(300, 600)) +
guides(colour = guide_legend(title = "Data")) +
ggtitle("Snaive Vs. Stlf forecasts") +
annotate("rect", xmin=1992.75,xmax=1994.5,ymin=-Inf,ymax=Inf,
fill="lightblue",alpha = 0.4
)
## Scale for 'x' is already present. Adding another scale for 'x', which will
## replace the existing scale.
## Warning: Removed 136 row(s) containing missing values (geom_path).
#Based on the results below, I would say the stlf model tends to stick more closely to the actual results of the data, desipite the similar shape of the snaive model.
#7. Use stlf() to produce forecasts of the writing series with either method=“naive” or method=“rwdrift”, whichever is most appropriate. Use the lambda argument if you think a Box-Cox transformation is required.
autoplot(writing)
writ_stlf_rw = stlf(writing, s.window = 12, method = "rwdrift")
writ_stlf_naive = stlf(writing, s.window = 12, method = "naive")
writ_stlf_naive_box = stlf(writing, s.window = 12, method = "naive", lambda = BoxCox.lambda(writing))
writ_stlf_rw_box = stlf(writing, s.window = 12, method = "rwdrift", lambda = BoxCox.lambda(writing))
autoplot(writ_stlf_rw)
autoplot(writ_stlf_naive)
autoplot(writ_stlf_naive_box)
autoplot(writ_stlf_rw_box)
#Based on the autoplots, I think the "rwdrift" method with the BoxCox lambda is the most appropriate model, as it continues the overall slight upward trend while maintaining the historical data's shape
#8. Use stlf() to produce forecasts of the fancy series with either method=“naive” or method=“rwdrift”, whichever is most appropriate. Use the lambda argument if you think a Box-Cox transformation is required.
autoplot(fancy)
fan_stlf_rw = stlf(fancy, s.window = 12, method = "rwdrift")
fan_stlf_naive = stlf(fancy, s.window = 12, method = "naive")
fan_stlf_naive_box = stlf(fancy, s.window = 12, method = "naive", lambda = BoxCox.lambda(fancy))
fan_stlf_rw_box = stlf(fancy, s.window = 12, method = "rwdrift", lambda = BoxCox.lambda(fancy))
autoplot(fan_stlf_rw)
autoplot(fan_stlf_naive)
autoplot(fan_stlf_naive_box)
autoplot(fan_stlf_rw_box)
#The BoxCox is abslutely necessary with the fancy model, as without it, the forecasts look completely different from the historical data, to the point of being unusable in terms of predictive qualities.