Homework #1

\[Chapter 5\] Question 1: Daily electricity demand for Victoria, Australia, during 2014 is contained in elecdaily. The data for the first 20 days can be obtained as follows.

daily20 <- head(elecdaily,20)
daily20

## Time Series:
## Start = c(1, 4) 
## End = c(4, 2) 
## Frequency = 7 
##            Demand WorkDay Temperature
## 1.428571 174.8963       0        26.0
## 1.571429 188.5909       1        23.0
## 1.714286 188.9169       1        22.2
## 1.857143 173.8142       0        20.3
## 2.000000 169.5152       0        26.1
## 2.142857 195.7288       1        19.6
## 2.285714 199.9029       1        20.0
## 2.428571 205.3375       1        27.4
## 2.571429 228.0782       1        32.4
## 2.714286 258.5984       1        34.0
## 2.857143 201.7970       0        22.4
## 3.000000 187.6298       0        22.5
## 3.142857 254.6636       1        30.0
## 3.285714 322.2323       1        42.4
## 3.428571 343.9934       1        41.5
## 3.571429 347.6376       1        43.2
## 3.714286 332.9455       1        43.1
## 3.857143 219.7517       0        23.7
## 4.000000 186.9816       0        22.3
## 4.142857 228.4876       1        24.0

a. Plot the data and find the regression model for Demand with temperature as an explanatory variable. Why is there a positive relationship?

There is likely a positive relationship because on warmer days, air conditioners will be used more.

#Line plots
daily20 %>%
  as.data.frame() %>%
  ggplot(aes(x=Temperature, y=Demand)) +
    ylab("Electricity Demand") +
    xlab("Temperature") +
    geom_point() +
    geom_smooth(method="lm", se=FALSE)

## `geom_smooth()` using formula 'y ~ x'

#Regression
lm.fit<-tslm(Demand ~ Temperature, data=daily20)
lm.fit

## 
## Call:
## tslm(formula = Demand ~ Temperature, data = daily20)
## 
## Coefficients:
## (Intercept)  Temperature  
##      39.212        6.757

b. Produce a residual plot. Is the model adequate? Are there any outliers or influential observations? There seems to be two groups of observations, one group with both high real and predicted values and one group with low real and predicted demand. Two observations on the low end deviate from the predictions, but it is not anything concerning

cbind(Data = daily20[,"Demand"],
      Fitted = fitted(lm.fit)) %>%
  as.data.frame() %>%
  ggplot(aes(x=Data, y=Fitted)) +
    geom_point() +
    ylab("Fitted (predicted values)") +
    xlab("Data (actual values)") +
    ggtitle("Demand Of Electricity In Victoria, Australia") +
    geom_abline(intercept=0, slope=1)

c.Use the model to forecast the electricity demand that you would expect for the next day if the maximum temperature was 15∘ and compare it with the forecast if the with maximum temperature was 35∘. Do you believe these forecasts? The electricity demand numbers predicted are within a normal range for the given temperatures, but I also know this is a limited number of data points within the set, so I trust these with some caution

pred_15<-predict(lm.fit, data.frame(Temperature = c(15)))
pred_35<-predict(lm.fit, data.frame(Temperature = c(35)))
pred_15

##        1 
## 140.5701

pred_35

##        1 
## 275.7146

d.Give prediction intervals for your forecasts. The following R code will get you started: The 95% prediction interval of demand for a 15 degree temperature day is between 90 and 190. for a 35 degree day it is between 227 and 324

autoplot(daily20, facets=TRUE)

daily20 %>%
  as.data.frame() %>%
  ggplot(aes(x=Temperature, y=Demand)) +
  geom_point() +
  geom_smooth(method="lm", se=FALSE)

## `geom_smooth()` using formula 'y ~ x'

fit <- tslm(Demand ~ Temperature, data=daily20)
checkresiduals(fit)

## 
##  Breusch-Godfrey test for serial correlation of order up to 5
## 
## data:  Residuals from Linear regression model
## LM test = 3.8079, df = 5, p-value = 0.5774

forecast(fit, newdata=data.frame(Temperature=c(15,35)))

##          Point Forecast    Lo 80    Hi 80     Lo 95    Hi 95
## 4.285714       140.5701 108.6810 172.4591  90.21166 190.9285
## 4.428571       275.7146 245.2278 306.2014 227.57056 323.8586

e. Plot Demand vs Temperature for all of the available data in elecdaily. What does this say about your model? The plot clearly shows a u-shaped relationship. This makes sense because heating would also require some electricity. What it says about the forecast is that using the subset is not a good method, as the data is clearly not linear

#Line plot of whole data
elecdaily %>%
  as.data.frame() %>%
  ggplot(aes(x=Temperature, y=Demand)) +
    ylab("Electricity Demand") +
    xlab("Temperature") +
    geom_point()

2. Data set mens400 contains the winning times (in seconds) for the men’s 400 meters final in each Olympic Games from 1896 to 2016.

a.Plot the winning time against the year. Describe the main features of the plot. Theres a downwards trend in the data over time, with notable gaps for both World War 1 and World War 2

autoplot(mens400)

b.Fit a regression line to the data. Obviously the winning times have been decreasing, but at what average rate per year? The trend shows a decline of roughly a quarter second per year for the 400 winning time, or about a 1 second decline per olympics

fit.mens400<-tslm(mens400~trend)
fit.mens400

## 
## Call:
## tslm(formula = mens400 ~ trend)
## 
## Coefficients:
## (Intercept)        trend  
##     50.3078      -0.2583

c.Plot the residuals against the year. What does this indicate about the suitability of the fitted line? Residuals look reasonable, though do show some trend upwards indicating a possibility of nonlinearity towards the end of the prediction

res <- residuals(fit.mens400)
autoplot(res) + xlab("Year") + ylab("") +
  ggtitle("Residuals")

d.Predict the winning time for the men’s 400 meters final in the 2020 Olympics. Give a prediction interval for your forecasts. What assumptions have you made in these calculations? The 95% prediction interval for winning 400 time for the Olympics is between 39.6 seconds and 44.5 seconds. The assumption here is that humans can continue down a path of breaking the record in a linear fashion, an assumption that was already beginning to come apart in the 2000’s and 2010’s based on the residual plot.

forecast(fit.mens400, newdata=data.frame(Year=c(2020)))

##      Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## 2020       42.04231 40.44975 43.63487 39.55286 44.53176

3.Type easter(ausbeer) and interpret what you see. This is a combination of 0’s, 1’s and franctions that add up to one which represents the share of easter in each quarter.

easter(ausbeer)

##      Qtr1 Qtr2 Qtr3 Qtr4
## 1956 0.67 0.33 0.00 0.00
## 1957 0.00 1.00 0.00 0.00
## 1958 0.00 1.00 0.00 0.00
## 1959 1.00 0.00 0.00 0.00
## 1960 0.00 1.00 0.00 0.00
## 1961 0.33 0.67 0.00 0.00
## 1962 0.00 1.00 0.00 0.00
## 1963 0.00 1.00 0.00 0.00
## 1964 1.00 0.00 0.00 0.00
## 1965 0.00 1.00 0.00 0.00
## 1966 0.00 1.00 0.00 0.00
## 1967 1.00 0.00 0.00 0.00
## 1968 0.00 1.00 0.00 0.00
## 1969 0.00 1.00 0.00 0.00
## 1970 1.00 0.00 0.00 0.00
## 1971 0.00 1.00 0.00 0.00
## 1972 0.33 0.67 0.00 0.00
## 1973 0.00 1.00 0.00 0.00
## 1974 0.00 1.00 0.00 0.00
## 1975 1.00 0.00 0.00 0.00
## 1976 0.00 1.00 0.00 0.00
## 1977 0.00 1.00 0.00 0.00
## 1978 1.00 0.00 0.00 0.00
## 1979 0.00 1.00 0.00 0.00
## 1980 0.00 1.00 0.00 0.00
## 1981 0.00 1.00 0.00 0.00
## 1982 0.00 1.00 0.00 0.00
## 1983 0.00 1.00 0.00 0.00
## 1984 0.00 1.00 0.00 0.00
## 1985 0.00 1.00 0.00 0.00
## 1986 1.00 0.00 0.00 0.00
## 1987 0.00 1.00 0.00 0.00
## 1988 0.00 1.00 0.00 0.00
## 1989 1.00 0.00 0.00 0.00
## 1990 0.00 1.00 0.00 0.00
## 1991 1.00 0.00 0.00 0.00
## 1992 0.00 1.00 0.00 0.00
## 1993 0.00 1.00 0.00 0.00
## 1994 0.00 1.00 0.00 0.00
## 1995 0.00 1.00 0.00 0.00
## 1996 0.00 1.00 0.00 0.00
## 1997 1.00 0.00 0.00 0.00
## 1998 0.00 1.00 0.00 0.00
## 1999 0.00 1.00 0.00 0.00
## 2000 0.00 1.00 0.00 0.00
## 2001 0.00 1.00 0.00 0.00
## 2002 1.00 0.00 0.00 0.00
## 2003 0.00 1.00 0.00 0.00
## 2004 0.00 1.00 0.00 0.00
## 2005 1.00 0.00 0.00 0.00
## 2006 0.00 1.00 0.00 0.00
## 2007 0.00 1.00 0.00 0.00
## 2008 1.00 0.00 0.00 0.00
## 2009 0.00 1.00 0.00 0.00
## 2010 0.00 1.00

4.An elasticity coefficient is the ratio of the percentage change in the forecast variable (y) to the percentage change in the predictor variable (x). Mathematically, the elasticity is defined as (dy/dx)×(x/y).Consider the log-log model: \[log(y)=β0+β1logx+ε\] Express y as a function of x and show that the coefficient β1 is the elasticity coefficient. Take the derivative of both sides, assuming β0 and ε are constant \[dy/y=β1(dx/x)\] Put β1 on one side, and both x and y terms on the other \[β1=(dy/y)*(x/dx)\] Sort to match elasticity equation \[β1=(dy/dx)*(x/y)\]

5. The data set fancy concerns the monthly sales figures of a shop which opened in January 1987 and sells gifts, souvenirs, and novelties. The shop is situated on the wharf at a beach resort town in Queensland, Australia. The sales volume varies with the seasonal population of tourists. There is a large influx of visitors to the town at Christmas and for the local surfing festival, held every March since 1988. Over time, the shop has expanded its premises, range of products, and staff.

a. Produce a time plot of the data and describe the patterns in the graph. Identify any unusual or unexpected fluctuations in the time series. The fancy plot has a distinct pattern of seasonality, with an upward trend that becomes more pronounced in the late 90’s, indicating possible nonlinearity. The data also exhibits some heteroskedasticy

autoplot(fancy)

b. Explain why it is necessary to take logarithms of these data before fitting a model. The plot clearly shows a trend of nonlinearity. That means a standard linear fit would not accuartely forecast the data, underestating the true value in this case. Using logarithms can give us a more reasonable forecast, even if the linear fit looks good on the training data. Also, it helps deal with heteroskedasticy

c. Use R to fit a regression model to the logarithms of these sales data with a linear trend, seasonal dummies and a “surfing festival” dummy variable.

dum.fest = rep(0, length(fancy))
dum.fest[seq_along(dum.fest)%%12 == 3] = 1
dum.fest[3] = 0
dum.fest = ts(dum.fest, freq = 12, start=c(1987,1))
fancy.new = data.frame(fancy, dum.fest)

fit.fancy<-tslm(BoxCox(fancy.new$fancy, 0)~trend+season+dum.fest)

autoplot(BoxCox(fancy.new$fancy, 0), series="Data") +
  autolayer(fitted(fit.fancy), series="Fitted") +
  xlab("Year") + ylab("log(fancy)") +
  ggtitle("Sales")

d. Plot the residuals against time and against the fitted values. Do these plots reveal any problems with the model? There is a pattern in the residual plot that shows consistent over estimation of the model compared to the real data in 1991 and underestimation from 1992 onwards

res <- residuals(fit.fancy)
autoplot(res) + xlab("Year") + ylab("") +
  ggtitle("Residuals")

e. Do boxplots of the residuals for each month. Does this reveal any problems with the model? There is clearly more variation in the first 3 months of the year and August, September, and October. This indicates there is a problem with the fit of our model

boxplot(res~cycle(res))

f. What do the values of the coefficients tell you about each variable? The trend variable shows very little increase in sales growth as time has gone on, so growth has been steady at roughly the intercept, 7.6%, in January (the omitted month variable). The fact that most of the month variables are positive means that December was a low month for sales, and the coeffecients of the other months denotes how much to premium sales growth from that base 7.6% growth as the months go on.

fit.fancy

## 
## Call:
## tslm(formula = BoxCox(fancy.new$fancy, 0) ~ trend + season + 
##     dum.fest)
## 
## Coefficients:
## (Intercept)        trend      season2      season3      season4      season5  
##     7.61967      0.02202      0.25142      0.26608      0.38405      0.40949  
##     season6      season7      season8      season9     season10     season11  
##     0.44883      0.61045      0.58796      0.66933      0.74739      1.20675  
##    season12     dum.fest  
##     1.96224      0.50152

g.What does the Breusch-Godfrey test tell you about your model? The small P-value indicates that there is autocorrelation in my model

bgtest(fit.fancy)

## 
##  Breusch-Godfrey test for serial correlation of order up to 1
## 
## data:  fit.fancy
## LM test = 25.031, df = 1, p-value = 5.642e-07

h. Regardless of your answers to the above questions, use your regression model to predict the monthly sales for 1994, 1995, and 1996. Produce prediction intervals for each of your forecasts.

pred<-forecast(fit.fancy, newdata=data.frame(dum.fest = rep(0, 36)))
pred

##          Point Forecast     Lo 80     Hi 80     Lo 95    Hi 95
## Jan 1994       9.491352  9.238522  9.744183  9.101594  9.88111
## Feb 1994       9.764789  9.511959 10.017620  9.375031 10.15455
## Mar 1994       9.801475  9.461879 10.141071  9.277961 10.32499
## Apr 1994       9.941465  9.688635 10.194296  9.551707 10.33122
## May 1994       9.988919  9.736088 10.241749  9.599161 10.37868
## Jun 1994      10.050280  9.797449 10.303110  9.660522 10.44004
## Jul 1994      10.233926  9.981095 10.486756  9.844168 10.62368
## Aug 1994      10.233456  9.980625 10.486286  9.843698 10.62321
## Sep 1994      10.336841 10.084010 10.589671  9.947083 10.72660
## Oct 1994      10.436923 10.184092 10.689753 10.047165 10.82668
## Nov 1994      10.918299 10.665468 11.171129 10.528541 11.30806
## Dec 1994      11.695812 11.442981 11.948642 11.306054 12.08557
## Jan 1995       9.755590  9.499844 10.011336  9.361338 10.14984
## Feb 1995      10.029027  9.773281 10.284773  9.634775 10.42328
## Mar 1995      10.065713  9.722498 10.408928  9.536620 10.59481
## Apr 1995      10.205703  9.949957 10.461449  9.811451 10.59996
## May 1995      10.253157  9.997411 10.508903  9.858904 10.64741
## Jun 1995      10.314518 10.058772 10.570264  9.920265 10.70877
## Jul 1995      10.498164 10.242418 10.753910 10.103911 10.89242
## Aug 1995      10.497694 10.241948 10.753440 10.103441 10.89195
## Sep 1995      10.601079 10.345333 10.856825 10.206826 10.99533
## Oct 1995      10.701161 10.445415 10.956907 10.306908 11.09541
## Nov 1995      11.182537 10.926791 11.438282 10.788284 11.57679
## Dec 1995      11.960050 11.704304 12.215796 11.565797 12.35430
## Jan 1996      10.019828  9.760564 10.279093  9.620151 10.41951
## Feb 1996      10.293265 10.034000 10.552530  9.893588 10.69294
## Mar 1996      10.329951  9.982679 10.677222  9.794605 10.86530
## Apr 1996      10.469941 10.210677 10.729206 10.070264 10.86962
## May 1996      10.517395 10.258130 10.776659 10.117718 10.91707
## Jun 1996      10.578756 10.319491 10.838021 10.179079 10.97843
## Jul 1996      10.762402 10.503137 11.021667 10.362725 11.16208
## Aug 1996      10.761932 10.502667 11.021196 10.362254 11.16161
## Sep 1996      10.865317 10.606052 11.124582 10.465640 11.26499
## Oct 1996      10.965399 10.706134 11.224664 10.565722 11.36508
## Nov 1996      11.446774 11.187510 11.706039 11.047097 11.84645
## Dec 1996      12.224288 11.965023 12.483552 11.824611 12.62396

I. Transform your predictions and intervals to obtain predictions and intervals for the raw data.

df<-as.data.frame(pred)
df<-exp(df)
df

##          Point Forecast     Lo 80     Hi 80      Lo 95     Hi 95
## Jan 1994       13244.70  10285.82  17054.73   8969.583  19557.43
## Feb 1994       17409.81  13520.45  22418.00  11790.284  25707.73
## Mar 1994       18060.36  12860.02  25363.61  10699.594  30484.96
## Apr 1994       20774.16  16133.21  26750.16  14068.696  30675.62
## May 1994       21783.73  16917.24  28050.15  14752.395  32166.37
## Jun 1994       23162.27  17987.81  29825.24  15685.969  34201.95
## Jul 1994       27831.56  21613.98  35837.72  18848.111  41096.73
## Aug 1994       27818.48  21603.82  35820.87  18839.249  41077.41
## Sep 1994       30848.42  23956.87  39722.43  20891.193  45551.50
## Oct 1994       34095.57  26478.61  43903.67  23090.230  50346.32
## Nov 1994       55176.84  42850.31  71049.28  37366.903  81475.41
## Dec 1994      120067.79  93244.59 154607.08  81312.400 177294.90
## Jan 1995       17250.40  13357.65  22277.59  11629.938  25587.08
## Feb 1995       22675.20  17558.28  29283.31  15287.252  33633.55
## Mar 1995       23522.50  16688.88  33154.31  13858.024  39926.92
## Apr 1995       27057.06  20951.33  34942.16  18241.435  40133.06
## May 1995       28371.96  21969.51  36640.25  19127.918  42083.42
## Jun 1995       30167.42  23359.80  38958.95  20338.387  44746.58
## Jul 1995       36248.88  28068.91  46812.70  24438.412  53767.06
## Aug 1995       36231.84  28055.72  46790.69  24426.922  53741.78
## Sep 1995       40178.16  31111.50  51887.06  27087.467  59595.26
## Oct 1995       44407.37  34386.35  57348.77  29938.733  65868.34
## Nov 1995       71864.42  55647.40  92807.48  48449.831 106594.69
## Dec 1995      156380.86 121091.75 201954.08 105429.448 231955.81
## Jan 1996       22467.57  17336.40  29117.46  15065.329  33506.86
## Feb 1996       29533.04  22788.25  38274.14  19802.984  44043.89
## Mar 1996       30636.60  21648.24  43356.94  17936.708  52328.52
## Apr 1996       35240.15  27191.96  45670.42  23629.808  52555.15
## May 1996       36952.72  28513.41  47889.88  24778.151  55109.18
## Jun 1996       39291.20  30317.82  50920.48  26346.183  58596.65
## Jul 1996       47211.93  36429.60  61185.57  31657.322  70409.18
## Aug 1996       47189.73  36412.48  61156.80  31642.439  70376.07
## Sep 1996       52329.57  40378.47  67817.91  35088.887  78041.33
## Oct 1996       57837.85  44628.77  74956.52  38782.394  86256.08
## Nov 1996       93598.96  72222.70 121302.09  62761.521 139588.15
## Dec 1996      203676.38 157160.50 263959.89 136572.460 303751.35

J. How could you improve these predictions by modifying the model? I could use some other variables, such as tourism numbers or GDP growth in Australia to get a sense of the economic trends happening in the area the shop is selling

6. The gasoline series consists of weekly data for supplies of US finished motor gasoline product, from 2 February 1991 to 20 January 2017. The units are in “million barrels per day”. Consider only the data to the end of 2004.

a. Fit a harmonic regression with trend to the data. Experiment with changing the number Fourier terms. Plot the observed gasoline and fitted values and comment on what you see. As k increases, there is more detail in the waves of the harmonic fit. However, it does not do much to improve on the secular trends in the data

gas.2004<-window(gasoline, end=2005)

#k=1
fourier.gas.1 <- tslm(gas.2004 ~ trend + fourier(gas.2004, K=1))
autoplot(gas.2004, series="Data") +
  autolayer(fitted(fourier.gas.1), series="Fitted") +
  xlab("Year") + ylab("Gas Supply") +
  ggtitle("Gas Supply (K=1)")

#k=2
fourier.gas.6 <- tslm(gas.2004 ~ trend + fourier(gas.2004, K=6))
autoplot(gas.2004, series="Data") +
  autolayer(fitted(fourier.gas.6), series="Fitted") +
  xlab("Year") + ylab("Gas Supply") +
  ggtitle("Gas Supply (K=6)")

#k=3
fourier.gas.12 <- tslm(gas.2004 ~ trend + fourier(gas.2004, K=12))
autoplot(gas.2004, series="Data") +
  autolayer(fitted(fourier.gas.12), series="Fitted") +
  xlab("Year") + ylab("Gas Supply") +
  ggtitle("Gas Supply (K=12)")

b. Select the appropriate number of Fourier terms to include by minimising the AICc or CV value. Appropriate Fourier terms is 12 if minimizing CV value

for(j in 1:20){
  fit <- tslm(gas.2004~trend + fourier(gas.2004, K = j))
  if (j==1){
    min.cv<-CV(fit)["CV"]
    min.k<-j
  }else{
    if (min.cv>CV(fit)["CV"]){
      min.cv<-CV(fit)["CV"]
      min.k<-j
    }
  }
}
print("Min CV:")

## [1] "Min CV:"

print(min.cv)

##         CV 
## 0.07096945

print("Min K:")

## [1] "Min K:"

print(min.k)

## [1] 12

c. Check the residuals of the final model using the checkresiduals() function. Even though the residuals fail the correlation tests, the results are probably not severe enough to make much difference to the forecasts and prediction intervals. (Note that the correlations are relatively small, even though they are significant.)

checkresiduals(tslm(gas.2004~trend + fourier(gas.2004, K = min.k)))

## 
##  Breusch-Godfrey test for serial correlation of order up to 104
## 
## data:  Residuals from Linear regression model
## LM test = 154.2, df = 104, p-value = 0.001021

d. To forecast using harmonic regression, you will need to generate the future values of the Fourier terms. This can be done as follows. fc <- forecast(fit, newdata=data.frame(fourier(x,K,h))) where fit is the fitted model using tslm(), K is the number of Fourier terms used in creating fit, and h is the forecast horizon required. Forecast the next year of data.

fit<-tslm(gas.2004~trend + fourier(gas.2004, K = min.k))
fc <- forecast(fit, newdata=data.frame(fourier(gas.2004,min.k,52)))

fc

##          Point Forecast    Lo 80    Hi 80    Lo 95     Hi 95
## 2005.014       8.713386 8.371087 9.055684 8.189474  9.237298
## 2005.033       8.642131 8.299720 8.984542 8.118047  9.166215
## 2005.052       8.656942 8.314540 8.999345 8.132871  9.181013
## 2005.071       8.661292 8.318899 9.003684 8.137236  9.185347
## 2005.090       8.686370 8.344109 9.028631 8.162515  9.210225
## 2005.110       8.784129 8.441982 9.126276 8.260448  9.307809
## 2005.129       8.895967 8.553823 9.238111 8.372291  9.419643
## 2005.148       8.937884 8.595756 9.280012 8.414232  9.461535
## 2005.167       8.940875 8.598761 9.282988 8.417246  9.464503
## 2005.186       8.988596 8.646477 9.330716 8.464958  9.512235
## 2005.205       9.062316 8.720191 9.404442 8.538669  9.585963
## 2005.225       9.073794 8.731673 9.415915 8.550153  9.597435
## 2005.244       9.031913 8.689800 9.374027 8.508284  9.555542
## 2005.263       9.041458 8.699342 9.383575 8.517824  9.565092
## 2005.282       9.122877 8.780756 9.464999 8.599236  9.646519
## 2005.301       9.169383 8.827263 9.511503 8.645744  9.693022
## 2005.320       9.132601 8.790487 9.474715 8.608970  9.656231
## 2005.340       9.120782 8.778667 9.462897 8.597150  9.644414
## 2005.359       9.221551 8.879431 9.563671 8.697912  9.745190
## 2005.378       9.335425 8.993306 9.677545 8.811787  9.859064
## 2005.397       9.316637 8.974522 9.658752 8.793006  9.840269
## 2005.416       9.214109 8.871994 9.556224 8.690478  9.737740
## 2005.435       9.218957 8.876838 9.561076 8.695320  9.742595
## 2005.455       9.383861 9.041741 9.725980 8.860222  9.907499
## 2005.474       9.545162 9.203046 9.887278 9.021530 10.068795
## 2005.493       9.559667 9.217553 9.901782 9.036037 10.083298
## 2005.512       9.487053 9.144935 9.829171 8.963416 10.010689
## 2005.531       9.464953 9.122834 9.807073 8.941315  9.988592
## 2005.550       9.508378 9.166261 9.850494 8.984744 10.032011
## 2005.570       9.534922 9.192808 9.877037 9.011292 10.058553
## 2005.589       9.521986 9.179869 9.864104 8.998351 10.045621
## 2005.608       9.522899 9.180779 9.865019 8.999260 10.046538
## 2005.627       9.548443 9.206326 9.890560 9.024808 10.072078
## 2005.646       9.536705 9.194591 9.878819 9.013075 10.060335
## 2005.665       9.451014 9.108897 9.793131 8.927380  9.974649
## 2005.685       9.330630 8.988509 9.672750 8.806990  9.854269
## 2005.704       9.229844 8.887726 9.571962 8.706208  9.753480
## 2005.723       9.170862 8.828748 9.512976 8.647232  9.694492
## 2005.742       9.168084 8.825968 9.510200 8.644451  9.691717
## 2005.761       9.230990 8.888869 9.573110 8.707349  9.754630
## 2005.780       9.320372 8.978252 9.662492 8.796734  9.844010
## 2005.800       9.363916 9.021801 9.706030 8.840285  9.887546
## 2005.819       9.342664 9.000548 9.684779 8.819032  9.866296
## 2005.838       9.304159 8.962036 9.646281 8.780516  9.827801
## 2005.857       9.267511 8.925389 9.609633 8.743869  9.791153
## 2005.876       9.194408 8.852292 9.536523 8.670776  9.718040
## 2005.895       9.100701 8.758587 9.442816 8.577070  9.624332
## 2005.915       9.104667 8.762540 9.446794 8.581017  9.628317
## 2005.934       9.265935 8.923806 9.608064 8.742282  9.789587
## 2005.953       9.436660 9.094538 9.778782 8.913018  9.960302
## 2005.972       9.400410 9.058293 9.742527 8.876776  9.924044
## 2005.991       9.147441 8.805233 9.489649 8.623668  9.671215

f. Plot the forecasts along with the actual data for 2005. What do you find? I found the seasonal effects are clearly relevant still, with supply increasing in the summer months as heating is less important. It also calls for a severe drop at the beginning of the forecasts, in early January, which was a common trend in previous years as well

autoplot(gas.2004, series="Data") +
  autolayer(fc, series="Fitted") +
  xlab("Year") + ylab("Gas Supply") +
  ggtitle("Gas Supply (K=12)") +
  scale_x_continuous(limits = c(2000, 2006))

## Scale for 'x' is already present. Adding another scale for 'x', which will
## replace the existing scale.

7. Data set huron gives the water level of Lake Huron in feet from 1875 to 1972.

a. Plot the data and comment on its features. There doesn’t seem to be too much of a cycle, but there is some nonlinearity. A trend decline from 1875 to 1915, and then a much flatter trend after that. There is also a lot of variation in the data

autoplot(huron)

b. Fit a linear regression and compare this to a piecewise linear trend model with a knot at 1915. The Piecewise model has an adjusted r-squared of 0.37, slightly higher than the linear fits 0.27

lm.huron<-tslm(huron~trend)
#piecewise
t <- time(huron)
t.break <- 1915
t_piece <- ts(pmax(0,t-t.break), start=1875)

fit.pw <- tslm(huron ~ t + t_piece)

autoplot(huron) +
  autolayer(fitted(lm.huron), series="Linear") +
  autolayer(fitted(fit.pw), series="Piecewise")+
  xlab("Year") + ylab("Lake Level") +
  ggtitle("Lake Huron") +
  guides(colour = guide_legend(title = " "))

summary(lm.huron)

## 
## Call:
## tslm(formula = huron ~ trend)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.50997 -0.72726  0.00083  0.74402  2.53565 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.202037   0.230111  44.335  < 2e-16 ***
## trend       -0.024201   0.004036  -5.996 3.55e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.13 on 96 degrees of freedom
## Multiple R-squared:  0.2725, Adjusted R-squared:  0.2649 
## F-statistic: 35.95 on 1 and 96 DF,  p-value: 3.545e-08

summary(fit.pw)

## 
## Call:
## tslm(formula = huron ~ t + t_piece)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.49626 -0.66240 -0.07139  0.85163  2.39222 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 132.90870   19.97687   6.653 1.82e-09 ***
## t            -0.06498    0.01051  -6.181 1.58e-08 ***
## t_piece       0.06486    0.01563   4.150 7.26e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.045 on 95 degrees of freedom
## Multiple R-squared:  0.3841, Adjusted R-squared:  0.3711 
## F-statistic: 29.62 on 2 and 95 DF,  p-value: 1.004e-10

c. Generate forecasts from these two models for the period up to 1980 and comment on these. The linear trend shows a continuous downward trend. The point projection is not necessarily outside of the norm, but it is clear the level cannot keep dropping. As for the piecewise, it flattens out after the knot in 1915, and basically projects the lake level will be within normal bounds, which is probably more accurate than the continuous downward trend, but not all that useful.

h<-8
fcasts.lin <- forecast(lm.huron, h = h)

t.new <- t[length(t)] + seq(h)
tb1.new <- t_piece[length(t_piece)] + seq(h)

newdata <- cbind(t=t.new, t_piece=tb1.new) %>%
  as.data.frame()
fcasts.pw <- forecast(fit.pw, newdata = newdata)



autoplot(huron) +
  autolayer(fitted(lm.huron), series="Linear") +
  autolayer(fcasts.lin, series="Linear") +
  xlab("Year") + ylab("Lake Level") +
  ggtitle("Lake Huron") +
  guides(colour = guide_legend(title = " "))

autoplot(huron) +
  autolayer(fitted(fit.pw), series="Piecewise")+
  autolayer(fcasts.pw, series="Piecewise") +
  xlab("Year") + ylab("Lake Level") +
  ggtitle("Lake Huron") +
  guides(colour = guide_legend(title = " "))

\[Chapter 6\] Question 1: Show that a 3×5 MA is equivalent to a 7-term weighted moving average with weights of 0.067, 0.133, 0.200, 0.200, 0.200, 0.133, and 0.067. A 3x5 moving average is a centered moving average and is symmetric which means we can write it like this: \[T=1/3(1/5(y1+y2+y3+y4+y5)+1/5(y2+y3+y4+y5+y6)+1/5(y3+y4+y5+y6+y7))\] We can simplify now: \[T=(1/15)y1+(2/15)y2+(1/5)y3+(1/5)y4+(1/5)y5+(2/15)y6+(1/15)y7\] If we convert those fractions to decimals we get, which follows the weights listed above: \[T=0.067y1+0.133y2+0.2y3+0.2y4+0.2y5+0.133y6+0.067y7\]

Question 2: The plastics data set consists of the monthly sales (in thousands) of product A for a plastics manufacturer for five years. a. Plot the time series of sales of product A. Can you identify seasonal fluctuations and/or a trend-cycle? There is definitely both a distinct upward trend and clear seasonal fluctuations.

autoplot(plastics)

b. Use a classical multiplicative decomposition to calculate the trend-cycle and seasonal indices.

plastics %>% decompose(type="multiplicative") %>%
  autoplot() + xlab("Year") +
  ggtitle("Classical multiplicative decomposition
    of plastics")

b. Use a classical multiplicative decomposition to calculate the trend-cycle and seasonal indices.

plastics %>% decompose(type="multiplicative") %>%
  autoplot() + xlab("Year") +
  ggtitle("Classical multiplicative decomposition
    of plastics")

c Do the results support the graphical interpretation from part a? The results do support the graphical interpretation, where there is both a clear seasonal component as well as a clear upward trend, though the decline in the trend at the end of the series was not easily seen in the original graph

D. Compute and plot the seasonally adjusted data.

library(seasonal)

fit<-decompose(plastics, type="multiplicative")

autoplot(plastics, series="Data") +
  autolayer(seasadj(fit), series="Seasonally Adjusted") +
  xlab("Year") + ylab("Monthly Sales") +
  ggtitle("Plastic Sales") +
  scale_colour_manual(values=c("gray","blue"),
             breaks=c("Data","Seasonally Adjusted","Trend"))

E. Change one observation to be an outlier (e.g., add 500 to one observation), and recompute the seasonally adjusted data. What is the effect of the outlier? The outlier drags up the seasonal adjustment with it, because it is outside of the seasonal norms. This means the seasonal adjustment thinks its white noise and not part of the season change. Otherwise the line doesn’t change muc, if at all

plastics.out<-plastics
plastics.out[12]<-plastics.out[12]+500

fit<-decompose(plastics.out, type="multiplicative")

autoplot(plastics.out, series="Data") +
  autolayer(seasadj(fit), series="Seasonally Adjusted") +
  xlab("Year") + ylab("Monthly Sales") +
  ggtitle("Plastic Sales") +
  scale_colour_manual(values=c("gray","blue"),
             breaks=c("Data","Seasonally Adjusted","Trend"))

plastics.out<-plastics
plastics.out[50]<-plastics.out[50]+500

fit<-decompose(plastics.out, type="multiplicative")

autoplot(plastics.out, series="Data") +
  autolayer(seasadj(fit), series="Seasonally Adjusted") +
  xlab("Year") + ylab("Monthly Sales") +
  ggtitle("Plastic Sales") +
  scale_colour_manual(values=c("gray","blue"),
             breaks=c("Data","Seasonally Adjusted","Trend"))

F. Does it make any difference if the outlier is near the end rather than in the middle of the time series? It does not appear to matter where the outlier is. The pattern is the seasonally adjusted line spikes at the outlier, then stays the same as the trend without the outlier otherwise

3. Recall your retail time series data (from Exercise 3 in Section 2.10). Decompose the series using X11. Does it reveal any outliers, or unusual features that you had not noticed previously? I did not notice the heteroskedcity present in the seasonal data early on, though I do not think its too concerning and stors itself out by the early 90’s. There are also a few outliers, one in the early 80’s and one just after 2000

library(seasonal)

ret.data <- readxl::read_excel("C:/Users/jfbia/OneDrive/Documents/predictive analytics/assignments/retail.xlsx", skip = 1)
ret <- ts(ret.data[, "A3349337W"], frequency = 12, start = c(1982, 4))

autoplot(ret)

ret %>% seas(x11="") -> fit
autoplot(fit) +
  ggtitle("X11 decomposition of retail")

4. Figures 6.16 and 6.17 show the result of decomposing the number of persons in the civilian labour force in Australia each month from February 1978 to August 1995.

a. Write about 3–5 sentences describing the results of the decomposition. Pay particular attention to the scales of the graphs in making your interpretation. The decomposition shows a clear upward trend in the labor force, not surprising considering population growth. There is also a strong seasonal component, with December showing the highest average labor force and Australias summer months are not far behind, showing seasonal work, both for the toursim season and holiday season are important. In december, we would expect an additional 100,000 workers to enter the labor force, only ffor almost the same number to leave in January. The remainder chart is also interesting. The early 90’s was the last time Australia had a recession (not counting the current recession) and the remainders skew significantly negative.

b. Is the recession of 1991/1992 visible in the estimated components? Yes, you can certainly see it in the remainder chart. You can also see a change in the trend chart around the recession, where the trend flattens out after the recession.

5. This exercise uses the cangas data (monthly Canadian gas production in billions of cubic metres, January 1960 – February 2005). a. Plot the data using autoplot(), ggsubseriesplot() and ggseasonplot() to look at the effect of the changing seasonality over time. What do you think is causing it to change so much? There is clearly a trend upwards, even after accounting for seasonality. In the ggseasonplot, you can see the increase is fairly monotonic in terms of Canadian gas production

autoplot(cangas)

ggsubseriesplot(cangas)

ggseasonplot(cangas)

b. Do an STL decomposition of the data. You will need to choose s.window to allow for the changing shape of the seasonal component.

#used mstl for an automated choice of s.window
cangas %>%
  mstl(robust=TRUE) %>%
  autoplot()

c. Compare the results with those obtained using SEATS and X11. How are they different? Both SEATS and X-11 decomps seasonality chart shows a much fatter seasonality adjustment early in the time series and towards the end of the time series than in the STL chart. The remaineder chart is also a line chart in STL and a bar chart in the other two.

#X-11
cangas %>% seas(x11="") -> fit
autoplot(fit) +
  ggtitle("X11 decomposition of cangas")

#SEATS
cangas %>% seas() %>%
autoplot() +
  ggtitle("SEATS decomposition of cangas")

6. We will use the bricksq data (Australian quarterly clay brick production. 1956–1994) for this exercise. a.Use an STL decomposition to calculate the trend-cycle and seasonal indices. (Experiment with having fixed or changing seasonality.)

bricksq %>%
  stl(t.window=30, s.window="periodic", robust=TRUE) %>%
  autoplot()

bricksq %>%
  stl(t.window=30, s.window=4, robust=TRUE) %>%
  autoplot()

bricksq %>%
  stl(t.window=30, s.window=20, robust=TRUE) %>%
  autoplot()

b. Compute and plot the seasonally adjusted data.

bricksq %>%
  stl(t.window=30, s.window="periodic", robust=FALSE) %>% seasadj() %>%
  autoplot()

c. Use a naïve method to produce forecasts of the seasonally adjusted data.

fit <- stl(bricksq, t.window=30, s.window="periodic", robust=FALSE)
fit %>% seasadj() %>% naive() %>%
  autoplot() + ylab("Production") +
  ggtitle("Naive forecasts of seasonally adjusted data")

d. Use stlf() to reseasonalise the results, giving forecasts for the original data.

ft_stlf <- stlf(bricksq, method='naive')
autoplot(forecast(ft_stlf,h=8)) + ylab("Production") +
  ggtitle("Naive forecasts of seasonally adjusted data")

e. Do the residuals look uncorrelated? At Quarters 4 and 8, there is some correlation, but it is not enough to be terribly concerned about unless we want to be very strict

checkresiduals(forecast(ft_stlf,h=8))

## Warning in checkresiduals(forecast(ft_stlf, h = 8)): The fitted degrees of
## freedom is based on the model used for the seasonally adjusted data.

## 
##  Ljung-Box test
## 
## data:  Residuals from STL +  Random walk
## Q* = 40.829, df = 8, p-value = 2.244e-06
## 
## Model df: 0.   Total lags used: 8

f. Repeat with a robust STL decomposition. Does it make much difference? There is some slight differences, but they are small and hard to discern

bricksq %>%
  stl(t.window=30, s.window="periodic", robust=FALSE) %>% seasadj() %>%
  autoplot()

bricksq %>%
  stl(t.window=30, s.window="periodic", robust=TRUE) %>% seasadj() %>%
  autoplot()

g. Compare forecasts from stlf() with those from snaive(), using a test set comprising the last 2 years of data. Which is better? The STLF is better, as it matches the seasonality better

bricksq.1992<-window(bricksq, end=1992)
fcast.stlf <- stlf(bricksq.1992, method='naive')
fcast.snaive <- snaive(bricksq.1992, method='naive')

autoplot(fcast.stlf, series="Forecast") + autolayer(bricksq, series="Data") + ylab("Production") +
  ggtitle("Naive forecasts STLF")

autoplot(fcast.snaive, series="Forecast") + autolayer(bricksq, series="Data") + ylab("Production") +
  ggtitle("Naive forecasts STLF")

7. Use stlf() to produce forecasts of the writing series with either method=“naive” or method=“rwdrift”, whichever is most appropriate. Use the lambda argument if you think a Box-Cox transformation is required The trend is fairly linear based on the STL decomp, so no need to use a box-cox. Residuals also look good.

autoplot(writing)

writing %>%
  stl(t.window=30, s.window="periodic", robust=TRUE) %>%
  autoplot()

fcast.stlf <- stlf(writing, method='rwdrift')
autoplot(fcast.stlf, series="Forecast") + ylab("Writing") +
  ggtitle("Random Walk forecasts STLF")

checkresiduals(forecast(fcast.stlf,h=8))

## Warning in checkresiduals(forecast(fcast.stlf, h = 8)): The fitted degrees of
## freedom is based on the model used for the seasonally adjusted data.

## 
##  Ljung-Box test
## 
## data:  Residuals from STL +  Random walk with drift
## Q* = 52.365, df = 23, p-value = 0.0004466
## 
## Model df: 1.   Total lags used: 24

8. Use stlf() to produce forecasts of the fancy series with either method=“naive” or method=“rwdrift”, whichever is most appropriate. Use the lambda argument if you think a Box-Cox transformation is required. The trend is nonlinear, so a box cox transformation was used. Estimate looks reasonable, including residuals

fancy.bc<-BoxCox(fancy,BoxCox.lambda(fancy))

autoplot(fancy.bc)

fancy.bc %>%
  stl(t.window=30, s.window="periodic", robust=TRUE) %>%
  autoplot()

fcast.stlf <- stlf(fancy.bc, method='rwdrift')
autoplot(fcast.stlf, series="Forecast") + ylab("Writing") +
  ggtitle("Random Walk forecasts STLF")

checkresiduals(forecast(fcast.stlf,h=8))

## Warning in checkresiduals(forecast(fcast.stlf, h = 8)): The fitted degrees of
## freedom is based on the model used for the seasonally adjusted data.

## 
##  Ljung-Box test
## 
## data:  Residuals from STL +  Random walk with drift
## Q* = 53.379, df = 16, p-value = 6.548e-06
## 
## Model df: 1.   Total lags used: 17

Homework #1

Joey Biasi

11/8/2020