HW1 CH7

Question 1 Half-hourly electricity demand for Victoria, Australia is contained in vic_elec. Extract the January 2014 electricity demand, and aggregate this data to daily with daily total demands and maximum temperatures.

A. Plot the data and find the regression model for Demand with temperature as a predictor variable. Why is there a positive relationship?

#a
jan14_vic_elec <- vic_elec %>%
  filter(yearmonth(Time) == yearmonth("2014 Jan")) %>%
  index_by(Date = as_date(Time)) %>%
  summarise(
    Demand = sum(Demand),
    Temperature = max(Temperature)
  )
jan14_vic_elec %>%
  gather("Variable", "Value", Demand, Temperature) %>%
  ggplot(aes(x = Date, y = Value)) +
  geom_line() +
  facet_grid(vars(Variable), scales = "free_y") +
  labs(title = "Electricity Demand") +
  theme(plot.title = element_text(hjust = 0.5))

jan14_vic_elec %>%
  ggplot(aes(x=Temperature, y=Demand)) +
  geom_point() +
  geom_smooth(method="lm", se=FALSE) +
  labs(title = "Electricity Demand") +
  theme(plot.title = element_text(hjust = 0.5))

## `geom_smooth()` using formula 'y ~ x'

There is a postive relationship because as the temperature increases, the demand for electricity increases due to higher demand of cooling. Typically, this comes from air conditioning units, fans, and freezers, all of which consumer more electricity.

B. Produce a residual plot. Is the model adequate? Are there any outliers or influential observations?

# b
fit <- jan14_vic_elec %>% model(TSLM(Demand ~ Temperature))
fit %>% gg_tsresiduals()

Yes, the model is adequate. There is are two outliers around Jan 27, which can also be detected on the histogram on the far left as the left tail spikes a bi, skewing the data right a bit.

C. Use the model to forecast the electricity demand that you would expect for the next day if the maximum temperature was 15∘C and compare it with the forecast if the with maximum temperature was 35∘C. Do you believe these forecasts? The following R code will get you started:

# c
demand15 <- jan14_vic_elec %>%
  model(TSLM(Demand ~ Temperature)) %>%
  forecast(
    new_data(jan14_vic_elec, 1) %>%
      mutate(Temperature = 15)
  ) %>%
  autoplot(jan14_vic_elec)

demand35 <- jan14_vic_elec %>%
  model(TSLM(Demand ~ Temperature)) %>%
  forecast(
    new_data(jan14_vic_elec, 1) %>%
      mutate(Temperature = 35)
  ) %>%
  autoplot(jan14_vic_elec)
demand35

I do believe the 35 degree forecast but not the 15. The 15 looks low compared to all historical lows.

D. Give prediction intervals for your forecasts.

# d
fit <- lm(Demand ~ Temperature, data=jan14_vic_elec)

forecast(fit, newdata=data.frame(Temperature=c(15,35)))

##           Point Forecast    Lo 80    Hi 80     Lo 95    Hi 95
## 151398.35       151398.4 117127.2 185669.5  97951.22 204845.5
## 274484.25       274484.2 241333.0 307635.5 222783.69 326184.8

E. Plot Demand vs Temperature for all of the available data in vic_elec aggregated to daily total demand and maximum temperature. What does this say about your model?

# e
plot(Demand~Temperature, data=vic_elec, main= "Demand vs. Temp")

Question 2

A. Plot the winning time against the year for each event. Describe the main features of the plot.

# a
autoplot(olympic_running)

## Plot variable not specified, automatically selected `.vars = Time`

The times for basically all of the events are decreasing over time, meaning people are getting faster at these events.

B. Fit a regression line to the data for each event. Obviously the winning times have been decreasing, but at what average rate per year?

# b
olympic_lm <- lm(Time ~ ., olympic_running)
olympic_lm

## 
## Call:
## lm(formula = Time ~ ., data = olympic_running)
## 
## Coefficients:
## (Intercept)         Year       Length     Sexwomen  
##    734.9863      -0.3905       0.1766      32.9732

C. Plot the residuals against the year. What does this indicate about the suitability of the fitted lines?

# c
checkresiduals(olympic_lm)

## 
##  Breusch-Godfrey test for serial correlation of order up to 10
## 
## data:  Residuals
## LM test = 179.8, df = 10, p-value < 2.2e-16

D. Predict the winning time for each race in the 2020 Olympics. Give a prediction interval for your forecasts. What assumptions have you made in these calculations?

# d
olympic_running %>%
  filter(Sex=="men") %>%
  filter(Length=="100") %>%
  model(TSLM(Time ~ Year)) %>%
  forecast(
    new_data(olympic_running, 1) %>%
      mutate(Year = 2020)
  ) %>%
  autoplot(olympic_running) + 
  labs(title = "Forecasted Winning Race Time") +
  theme(plot.title = element_text(hjust = 0.5))

This forecast is plausible as it follows the trend of decreasing race times. Even looking at the slight uptick in the last couple years, the upper tail of the forecast captures that.

Question 3 Elasticity: B1=(dy/dx)(x/y) log y = B0+B1logx + e

Take the inverse of elasticity equals B1(dx/dy) = (dy/y) and take the integral of that. Then, take the log y formula to the exponent to get y=e^B0 + e^B1 * x + e^error. Then you can set the formulas equal together since they are both a function of y.

Question 4 A. Produce a time plot of the data and describe the patterns in the graph. Identify any unusual or unexpected fluctuations in the time series.

# a
plot(souvenirs)

autoplot(souvenirs)+ labs(title = "Time Series of Australian Gas Production") +
  theme(plot.title = element_text(hjust = 0.5))

## Plot variable not specified, automatically selected `.vars = Sales`

There seems to be a massive spike in sales around December. Overall, there is a slight uptrend in sales over the entirety of the data. There was a decrease in sales around 1991 because there was a recession in Australia.

B. Explain why it is necessary to take logarithms of these data before fitting a model.

souv_num <- souvenirs
souv_num$Sales <- as.numeric(souvenirs$Sales)
hist(souv_num$Sales)

# These spikes in sales in December skew the data and do not represent a normal distribution.

souv_num_log <- souv_num
souv_num_log$Sales <- log(souv_num$Sales)
hist(souv_num_log$Sales)

Taking the log transforms the data to more closely represent a normal distribution.

C. Fit a regression model to the logarithms of these sales data with a linear trend, seasonal dummies and a “surfing festival” dummy variable.

souv<- ts(souvenirs$Sales,start=c(1987,1), end=c(1993,6), frequency=12)
souv_log = log(souv)
dummy.fest = rep(0, length(souv))
dummy.fest[seq_along(dummy.fest)%%12 == 3] = 1
dummy.fest[3] = 0
dummy.fest = ts(dummy.fest, freq = 12, start=c(1987,1))
new.data = data.frame(souv_log, dummy.fest)
fit = tslm(souv_log ~ trend + season + dummy.fest, data=new.data)

forecast1 = data.frame(dummy.fest = rep(0, 12))
forecast1[3,] = 1
forecast(fit, newdata=forecast1)

##          Point Forecast     Lo 80     Hi 80     Lo 95     Hi 95
## Jul 1993       9.883136  9.637329 10.128943  9.503919 10.262353
## Aug 1993       9.867772  9.621965 10.113579  9.488555 10.246989
## Sep 1993      10.531942 10.191069 10.872815 10.006062 11.057822
## Oct 1993      10.092605  9.846798 10.338412  9.713388 10.471822
## Nov 1993      10.585189 10.339382 10.830995 10.205971 10.964406
## Dec 1993      11.357556 11.111749 11.603363 10.978339 11.736773
## Jan 1994       9.430933  9.186093  9.675774  9.053207  9.808659
## Feb 1994       9.704370  9.459530  9.949210  9.326644 10.082096
## Mar 1994       9.695742  9.365756 10.025727  9.186658 10.204825
## Apr 1994       9.881046  9.636206 10.125887  9.503320 10.258772
## May 1994       9.928500  9.683659 10.173340  9.550774 10.306226
## Jun 1994       9.989861  9.745020 10.234701  9.612135 10.367587

D. Plot the residuals against time and against the fitted values. Do these plots reveal any problems with the model?

plot(fit$residuals)

E. Do boxplots of the residuals for each month. Does this reveal any problems with the model?

boxplot(residuals(fit)~cycle(residuals(fit)))

This shows some variability in the residuals. The boxplot for 6 is extremely tighter than cycle 10 or 3. This may pose a problem with homoscedasticity.

F. What do the values of the coefficients tell you about each variable?

fit$coefficients

## (Intercept)       trend     season2     season3     season4     season5 
##   7.6662400   0.0207611   0.2526756   0.2232860   0.3878297   0.4145219 
##     season6     season7     season8     season9    season10    season11 
##   0.4551219   0.5767690   0.5406444   0.6296713   0.7239552   1.1957774 
##    season12  dummy.fest 
##   1.9473841   0.5543818

These show the correlations with the variables. Seasons 11 and 12 have the highest impact on souvenir sales.

G. What does the Ljung-Box test tell you about your model?

checkresiduals(fit)

## 
##  Breusch-Godfrey test for serial correlation of order up to 17
## 
## data:  Residuals from Linear regression model
## LM test = 31.535, df = 17, p-value = 0.01718

The residuals are relatively normal distribution and there is a significant p-value.

H. Regardless of your answers to the above questions, use your regression model to predict the monthly sales for 1994, 1995, and 1996. Produce prediction intervals for each of your forecasts.

forecast2 = data.frame(dummy.fest = rep(0, 36)) #36 for 3 years (1994, 1995, 1996)
predictions = forecast(fit, newdata=forecast2)
predictions

##          Point Forecast     Lo 80     Hi 80     Lo 95     Hi 95
## Jul 1993       9.883136  9.637329 10.128943  9.503919 10.262353
## Aug 1993       9.867772  9.621965 10.113579  9.488555 10.246989
## Sep 1993       9.977560  9.731753 10.223367  9.598343 10.356777
## Oct 1993      10.092605  9.846798 10.338412  9.713388 10.471822
## Nov 1993      10.585189 10.339382 10.830995 10.205971 10.964406
## Dec 1993      11.357556 11.111749 11.603363 10.978339 11.736773
## Jan 1994       9.430933  9.186093  9.675774  9.053207  9.808659
## Feb 1994       9.704370  9.459530  9.949210  9.326644 10.082096
## Mar 1994       9.695742  9.365756 10.025727  9.186658 10.204825
## Apr 1994       9.881046  9.636206 10.125887  9.503320 10.258772
## May 1994       9.928500  9.683659 10.173340  9.550774 10.306226
## Jun 1994       9.989861  9.745020 10.234701  9.612135 10.367587
## Jul 1994      10.132269  9.883394 10.381144  9.748319 10.516219
## Aug 1994      10.116905  9.868031 10.365780  9.732955 10.500856
## Sep 1994      10.226693  9.977819 10.475568  9.842743 10.610644
## Oct 1994      10.341738 10.092864 10.590613  9.957788 10.725689
## Nov 1994      10.834322 10.585447 11.083197 10.450372 11.218272
## Dec 1994      11.606690 11.357815 11.855564 11.222739 11.990640
## Jan 1995       9.680067  9.431764  9.928369  9.296999 10.063134
## Feb 1995       9.953503  9.705201 10.201806  9.570436 10.336570
## Mar 1995       9.944875  9.610605 10.279144  9.429182 10.460567
## Apr 1995      10.130180  9.881877 10.378482  9.747112 10.513247
## May 1995      10.177633  9.929330 10.425935  9.794566 10.560700
## Jun 1995      10.238994  9.990691 10.487296  9.855927 10.622061
## Jul 1995      10.381402 10.128745 10.634059  9.991617 10.771188
## Aug 1995      10.366039 10.113381 10.618696  9.976253 10.755824
## Sep 1995      10.475827 10.223169 10.728484 10.086041 10.865612
## Oct 1995      10.590872 10.338214 10.843529 10.201086 10.980657
## Nov 1995      11.083455 10.830798 11.336112 10.693669 11.473240
## Dec 1995      11.855823 11.603165 12.108480 11.466037 12.245608
## Jan 1996       9.929200  9.676730 10.181669  9.539704 10.318696
## Feb 1996      10.202636  9.950167 10.455106  9.813141 10.592132
## Mar 1996      10.194008  9.854949 10.533067  9.670926 10.717090
## Apr 1996      10.379313 10.126843 10.631782  9.989817 10.768809
## May 1996      10.426766 10.174296 10.679236 10.037270 10.816262
## Jun 1996      10.488127 10.235658 10.740597 10.098631 10.877623

I. How could you improve these predictions by modifying the model? I feel like a BoxCox transformation could help some of the homoscedasticity and smooth out the variance of the residuals.

Question 5 A. Fit a harmonic regression with trend to the data. Experiment with changing the number Fourier terms. Plot the observed gasoline and fitted values and comment on what you see.

# a
# a
us_gasoline <- us_gasoline

gasoline_2004 <- ts(us_gasoline$Barrels, start = (1991), end=(2017), frequency = 52)
gas04 <- window(gasoline_2004, end= 2005)
for(num in c(1, 5, 10, 20)){
  var_name <- paste("fit",
                    as.character(num), "_gasoline_2004" ,
                    sep = "")
  
  assign(var_name,
         tslm(gas04 ~ trend + fourier(
           gas04,
           K = num
         ))
  )
  print(
    autoplot(gas04) +
      autolayer(get(var_name)$fitted.values,
                series = as.character(num)) +
      ggtitle(var_name) +
      ylab("Barrels") +
      guides(colour = guide_legend(title = "Fourier Transform pairs"))
  )
}

autoplot(gasoline_2004) +
  autolayer(fit1_gasoline_2004$fitted.values, series = "1") +
  autolayer(fit5_gasoline_2004$fitted.values, series = "5") +
  autolayer(fit10_gasoline_2004$fitted.values, series = "10") +
  autolayer(fit20_gasoline_2004$fitted.values, series = "20") +
  guides(colour = guide_legend(title = "Fourier Transform pairs")) +
  scale_color_discrete(breaks = c(1, 2, 3, 5, 10, 20))

B. Select the appropriate number of Fourier terms to include by minimising the AICc or CV value.

for(i in c(1, 5, 10, 20)){
  fit_gasoline_2004.name <- paste(
    "fit", as.character(i), "_gasoline_2004",
    sep = ""
  )
  writeLines(
    paste(
      "\n", fit_gasoline_2004.name, "\n"
    )
  )
  print(CV(get(fit_gasoline_2004.name)))
}

## 
##  fit1_gasoline_2004 
## 
##            CV           AIC          AICc           BIC         AdjR2 
##  8.223111e-02 -1.818916e+03 -1.818833e+03 -1.795957e+03  8.245684e-01 
## 
##  fit5_gasoline_2004 
## 
##            CV           AIC          AICc           BIC         AdjR2 
##  7.635433e-02 -1.873126e+03 -1.872617e+03 -1.813434e+03  8.388985e-01 
## 
##  fit10_gasoline_2004 
## 
##            CV           AIC          AICc           BIC         AdjR2 
##  7.432523e-02 -1.893252e+03 -1.891686e+03 -1.787644e+03  8.453698e-01 
## 
##  fit20_gasoline_2004 
## 
##            CV           AIC          AICc           BIC         AdjR2 
##     0.0763989 -1875.0247805 -1869.5006929 -1677.5828100     0.8455506

C. Plot the residuals of the final model using the gg_tsresiduals() function and comment on these. Use a Ljung-Box test to check for residual autocorrelation.

checkresiduals(fit20_gasoline_2004)

## 
##  Breusch-Godfrey test for serial correlation of order up to 104
## 
## data:  Residuals from Linear regression model
## LM test = 127.58, df = 104, p-value = 0.05812

D. Generate forecasts for the next year of data and plot these along with the actual data for 2005. Comment on the forecasts.

fc_gasoline_2005 <- forecast(
  fit20_gasoline_2004,
  newdata=data.frame(fourier(
    gasoline_2004, K = 20, h = 52)
  )
)


autoplot(fc_gasoline_2005) +
  autolayer(window(
    gasoline_2004,
    start = 2004,
    end = 2006
  )
  ) +
  scale_x_continuous(limits = c(2004, 2006)) +
  guides(colour = guide_legend(title = "Fourier Transform pairs"))

## Scale for 'x' is already present. Adding another scale for 'x', which will
## replace the existing scale.

## Warning: Removed 676 row(s) containing missing values (geom_path).

The forecasts seem to be generally accurate. An exception to the accuracy could be the massive drop right after 2006. Despite that, the drop was still within the forecasts’ lower bounds so it would have been captured in the forecasts.

HW1 CH7

Sal Pollastro

4/4/2022