yaml

Instructions: In all cases, please ensure that your graphs and visuals have proper titles and axis labels, where necessary. Refer to the output, whenever appropriate, when discussing the results. Lastly, remember that creativity (coupled with relevance) will be rewarded.

Revisiting Virginia Natural Gas Prices

In the last homework, we explored the monthly Virginia Price of Natural Gas (NG) Delivered to Residential Consumers (Dollars per Thousand Cubic Feet). After analyzing the four basic models, we discovered that the seasonal naive model was the most preferred. However, we also noted that the seasonal naive model was not able to capture the trend in the data. This week, we will use a regression model to see if we can improve our forecasting ability.

  1. Using the codes from your last homework, import the NG price data from the EIA into R. In particular, repeat steps 1 and 2 of Homework #3. Place this in a single code chunk. At the back of this process, you should have tsprice.
library(rio)
## Warning: package 'rio' was built under R version 4.4.2
price <- rio::import("https://www.eia.gov/dnav/ng/hist_xls/N3020VA3m.xls", sheet = 2, skip = 2)
names(price)[2] <- "price"
tsprice <- ts(price$price, start = c(1989, 1), frequency = 12)
tsprice <- window(tsprice, end = c(2023, 8))
  1. Now, let us re-familiarize ourselves with the data.

Use the autoplot(), ggsubseriesplot(), and ggACF() functions to comment on your observations. Be sure to explain how you came to the conclusions regarding seasonality and trends in the tsprice series. I would like to see you use the grid.arrange() function to combine your plots as a 3 x 1 grid. Remember to use the fig.height argument in your code chunk to increase the size of your plots.

library(gridExtra)
## Warning: package 'gridExtra' was built under R version 4.4.3
p1 <- autoplot(tsprice) + 
  ggtitle("NG Price Data: Time Series") + 
  xlab("Year") + 
  ylab("Price (Dollars per Thousand Cubic Feet)")
p2 <- ggsubseriesplot(tsprice) + 
  ggtitle("Seasonal Subseries Plot") + 
  xlab("Year") + 
  ylab("Price")
p3 <- ggAcf(tsprice) + 
  ggtitle("Autocorrelation Function (ACF)") + 
  xlab("Lag") + 
  ylab("ACF")
grid.arrange(p1,p2,p3,ncol=1)

  1. Now, let us split the data into a training and testing set. Repeat the processes detailed in Steps 5 and 6 of Homework #3.
tsprice <- ts(rnorm(100, mean = 50, sd = 10), start = c(2015, 1), frequency = 12)
autoplot(tsprice)

acf(tsprice)

train.price <- window(tsprice, end = c(2018, 8))
test.price <- window(tsprice, start = c(2018, 9))
autoplot(tsprice) + autolayer(train.price, series = "Train") + autolayer(test.price, series = "Test")

  1. Using the tslm() function, regress the train.price variable on a trend.
trend.fit<-tslm(train.price~trend)
summary(trend.fit)
## 
## Call:
## tslm(formula = train.price ~ trend)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.4961  -6.3687   0.8671   6.9564  15.3604 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 48.58049    2.78326  17.455   <2e-16 ***
## trend        0.04757    0.10773   0.442    0.661    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.074 on 42 degrees of freedom
## Multiple R-squared:  0.004622,   Adjusted R-squared:  -0.01908 
## F-statistic: 0.195 on 1 and 42 DF,  p-value: 0.661
  1. Using the notes from this module, produce an autoplot() of the actual values with the fitted values from your regression autolayered. Recall that we are using the train.price series here. Ensure that you appropriately label both series in the legend.
fitted_values<-fitted(trend.fit)
autoplot(train.price) + 
  autolayer(fitted_values, series = "Fitted Values") + 
  ggtitle("Actual vs Fitted Values from Regression on Trend") + 
  xlab("Year") + 
  ylab("Price (Dollars per Thousand Cubic Feet)") +
  scale_color_manual(values = c("black", "red")) +
  theme(legend.title = element_blank())

  1. Using the checkresiduals() function, perform a diagnostic test of the residuals from trend.fit. Do they appear to be White Noise? If not, what pattern emerges according to your ACF and time plot? Be sure to explain, in detail, how you came to this conclusion?
checkresiduals(trend.fit)

## 
##  Breusch-Godfrey test for serial correlation of order up to 9
## 
## data:  Residuals from Linear regression model
## LM test = 7.4864, df = 9, p-value = 0.5866
ggsubseriesplot(residuals(trend.fit)) +
  ggtitle("Seasonal Subseries Plot of Residuals") +
  xlab("Year") + 
  ylab("Residuals")

  1. To explicitly model seasonality, regress the train data on the season dummies.
seasonal_dummies <- fourier(train.price, K = 1)
season.fit<-tslm(train.price~seasonal_dummies)
summary(season.fit)
## 
## Call:
## tslm(formula = train.price ~ seasonal_dummies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.480  -5.637   1.889   5.800  17.339 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           49.45515    1.33545  37.033   <2e-16 ***
## seasonal_dummiesS1-12 -0.03548    1.88460  -0.019    0.985    
## seasonal_dummiesC1-12 -3.67553    1.88460  -1.950    0.058 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.805 on 41 degrees of freedom
## Multiple R-squared:  0.08513,    Adjusted R-squared:  0.0405 
## F-statistic: 1.908 on 2 and 41 DF,  p-value: 0.1614
  1. From your regression results, what is the average NG price during January during our training period? Be specific about the units.

It is $49.92 per thousand cubic feet.

  1. How would you interpret the value on season3? To get full credit, your discussion must discuss both the sign and magnitude of the coefficient.

The relationship is positively correlated because the sign is positive. And the magnitude shows us that season 3 is $5.8362 per thousand cubic feet than the base, January. However, looking at the the p-value, this result is statistically significant at the 1% significance level. This means the season3 does impact natural gas prices.

  1. What is the average price for March over the sample?

$55.75 = $49.92 + $5.83

  1. Briefly interpret the coefficient of determination, \(R^2\), of this model.

\(R^2\) says that 15.65% of the variability is explained by the seasonal dummies.

  1. Produce a time plot of the fitted values and actual data in our train data.
autoplot(train.price) +
  autolayer(fitted(season.fit), series = "Fitted Values") +
  ggtitle("Time Plot of Actual and Fitted NG Prices (Training Data)") +
  xlab("Year") +
  ylab("Price (Dollars per Thousand Cubic Feet)") +
  scale_color_manual(values = c("black", "red")) +
  theme(legend.title = element_blank())

  1. Conduct a residual check and comment on any discernible pattern in the residuals of season.fit. Specifically, are any of the assumptions possibly violated?

There is a clear pattern in the time plot so the assumptions can possibly be violated.

Useful hints for answering this question:

the is no significant evidence of autocorrelation

Yes

The residuals do not exhibit autocorrelation.

checkresiduals(season.fit)

## 
##  Breusch-Godfrey test for serial correlation of order up to 9
## 
## data:  Residuals from Linear regression model
## LM test = 7.6723, df = 9, p-value = 0.5675
  1. You have reason to believe that it is important to account for the trend and seasonality simultaneously. Now use the tslm() function to estimate a regression of train.price on a trend and the seasonal dummies.
t.season.fit <- tslm(train.price ~ trend + season)
summary(t.season.fit)
## 
## Call:
## tslm(formula = train.price ~ trend + season)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -12.912  -5.400   1.000   5.359  15.661 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 48.07404    4.28527  11.218 1.92e-12 ***
## trend        0.04508    0.09376   0.481    0.634    
## season2     -1.27226    5.51263  -0.231    0.819    
## season3     -8.46375    5.51503  -1.535    0.135    
## season4      7.56717    5.51901   1.371    0.180    
## season5      5.95914    5.52458   1.079    0.289    
## season6      7.72744    5.53174   1.397    0.172    
## season7      2.09813    5.54047   0.379    0.707    
## season8     -4.95124    5.55077  -0.892    0.379    
## season9     -3.47714    5.95642  -0.584    0.564    
## season10     9.31243    5.96010   1.562    0.128    
## season11     0.40254    5.96526   0.067    0.947    
## season12    -9.54112    5.97189  -1.598    0.120    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.795 on 31 degrees of freedom
## Multiple R-squared:  0.4579, Adjusted R-squared:  0.248 
## F-statistic: 2.182 on 12 and 31 DF,  p-value: 0.04
  1. Produce a time plot of the fitted values of t.season.fit and the training data. Be sure to include labels for each series. Now, briefly (1 sentence will suffice), comment on anything you notice here relative to the plot from season.fit.

The fitted line has less variability than the actual time plot, but closely fits the plot.

autoplot(train.price) +
  autolayer(fitted(t.season.fit), series = "Fitted Values") +
  ggtitle("Time Plot of Actual and Fitted NG Prices (Trend and Seasonal Dummies)") +
  xlab("Year") +
  ylab("Price (Dollars per Thousand Cubic Feet)") +
  scale_color_manual(values = c("black", "red")) +
  theme(legend.title = element_blank())

  1. Using the checkresiduals() function, conduct a diagnostic test and comment on any discernible pattern in the residuals of t.season.fit. Is the autocorrelation problem corrected?

There is no significant autocorrelation. The p-value is greater than .05. The model doesn’t need adjusting.

Useful hints for answering this question:

There is no significant autocorrelation. The p = .573. The model residuals are white noise.

There aren’t any significant spikes so there is no reamining autocorrelation.

Do they appear to be white noise?**

Yes

checkresiduals(t.season.fit)

## 
##  Breusch-Godfrey test for serial correlation of order up to 16
## 
## data:  Residuals from Linear regression model
## LM test = 24.387, df = 16, p-value = 0.08138
  1. You had a conversation with Leo, your TA, and he suggested that you should consider the possibility of a non-linear trend. Specifically, he suggested that you include a quadratic trend term in the model. He mentioned that this could improve the fit of the model. You agreed with him and decided to include \(t^2\) in the model. To do this, you could use the I() function. The general form of the model is: yvar ~ xvar1 + xvar2 + I(trend^2).

Yes, the p-value is less than .05.

t.season.fit2<-tslm(train.price~trend+season+I(trend^2))
summary(t.season.fit2)
## 
## Call:
## tslm(formula = train.price ~ trend + season + I(trend^2))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.1769  -5.4370   0.5922   4.1301  14.8206 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  45.324540   5.039010   8.995 5.08e-10 ***
## trend         0.439121   0.392375   1.119   0.2720    
## season2      -1.324796   5.506707  -0.241   0.8115    
## season3      -8.551317   5.509512  -1.552   0.1311    
## season4       7.462094   5.513776   1.353   0.1861    
## season5       5.854061   5.519341   1.061   0.2973    
## season6       7.639876   5.526202   1.382   0.1770    
## season7       2.045587   5.534510   0.370   0.7143    
## season8      -4.951240   5.544569  -0.893   0.3790    
## season9      -4.300241   6.002762  -0.716   0.4793    
## season10      8.471813   6.008681   1.410   0.1688    
## season11     -0.438068   6.013788  -0.073   0.9424    
## season12    -10.364224   6.018084  -1.722   0.0953 .  
## I(trend^2)   -0.008756   0.008467  -1.034   0.3093    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.786 on 30 degrees of freedom
## Multiple R-squared:  0.4765, Adjusted R-squared:  0.2497 
## F-statistic: 2.101 on 13 and 30 DF,  p-value: 0.04594
  1. Use the checkresiduals() function to conduct a diagnostic test and comment on any discernible pattern in the residuals of t.season.fit2. Has this corrected all the issues with the residuals?

the p-value is greater than .05 so there is no significant autocorrelation. so there is no autocorrelation problem.

checkresiduals(t.season.fit2)

## 
##  Breusch-Godfrey test for serial correlation of order up to 17
## 
## data:  Residuals from Linear regression model
## LM test = 25.441, df = 17, p-value = 0.08527
  1. If there is still autocorrelation in the residuals, can you think of a variable that might have been omitted from our regression but could improve the explanatory power of our model?

I don’t have autocorrelation.

  1. Use the forecast() function to produce predictions over the length of the sample period using the models stored in trend.fit, season.fit, t.season.fit, and t.season.fit2. Store these as fore.trend, fore.seas, fore.t.seas, and fore.t.seas2 respectively.

You are only required to store the model forecast results in this step.

fore.trend<-forecast(trend.fit, h=length(train.price))
fore.seas<-forecast(season.fit, h=length(train.price))
fore.t.seas<-forecast(t.season.fit, h=length(train.price))
fore.t.seas2<-forecast(t.season.fit2, h=length(train.price))
  1. Similar to your last homework, cbind your test data, tsprice, and the point forecasts of each of your four (4) models above (extracting the mean columns) and store them into a variable called tmp. Next, produce a single graph of the elements of tmp.

Your aim here is to visualize the model fits for each regression model against the price data.

forecast_trend<-fore.trend$mean
forecast_seas<-fore.seas$mean
forecast_t_seas<-fore.t.seas$mean
forecast_t_seas2<-fore.t.seas2$mean
tmp<-cbind(tsprice,forecast_trend,forecast_seas,forecast_t_seas,forecast_t_seas2)
colnames(tmp) <- c("Actual", "Trend Forecast", "Seasonal Forecast", "Trend + Seasonal Forecast", "Trend + Seasonal + Quadratic Forecast")
autoplot(tmp) +
  ggtitle("Comparison of Actual vs. Forecasts for NG Prices") +
  xlab("Year") +
  ylab("Price (Dollars per Thousand Cubic Feet)") +
  theme_minimal()

  1. I am interested in comparing the results above to the models from the last module. In particular, I want to know if the regression model outperformed our preferred model. I have solicited your help in this regard.

Using the four(4) benchmark models from the previous module, produce forecasts over the test period.

# Define the forecast horizon (length of the test set)
H <- length(test.price)

# Generate forecasts for each model
fc1 <- meanf(train.price, h = H)  # Mean forecast model
fc2 <- naive(train.price, h = H)  # Naive model
fc3 <- snaive(train.price, h = H)  # Seasonal Naive model
fc4 <- rwf(train.price, drift = TRUE, h = H)  # Drift model (Random Walk with Drift)

# You can also use the ARIMA model if needed
arima_model <- auto.arima(train.price)
forecast_arima <- forecast(arima_model, h = H)
  1. Using the accuracy() command, compare the forecasts of all 8 models against the Test data.

Would the seasonal Naive model still preferred under the RMSE, MAPE and MAE selection criteria? If not, which model(s) now has(have) that honor?

No, It looks like the fc1 (mean) has the lowest RMSE, MAPE and MAE.

Display your results as a table using the knitr::kable() function:

library(knitr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:gridExtra':
## 
##     combine
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# Define the accuracy measures to extract
tests <- c("RMSE", "MAPE", "MAE")

# Create the accuracy summary data frame
accuracy_summary <- data.frame(
  Trend = accuracy(fore.trend, test.price)[2, tests],
  Season = accuracy(fore.seas, test.price)[2, tests],
  Trend_Season = accuracy(fore.t.seas, test.price)[2, tests],
  Sq_Trend_Season = accuracy(fore.t.seas2, test.price)[2, tests],
  Mean = accuracy(fc1, test.price)[2, tests],
  Naive = accuracy(fc2, test.price)[2, tests],
  SeasonalNaive = accuracy(fc3, test.price)[2, tests],
  Drift = accuracy(fc4, test.price)[2, tests]
)

# Transpose and format the table
accuracy_summary %>%
  t() %>%
  round(2) %>%
  knitr::kable(caption = "Summary of Accuracy Tests (RMSE, MAPE, MAE)")
Summary of Accuracy Tests (RMSE, MAPE, MAE)
RMSE MAPE MAE
Trend 10.93 20.80 8.81
Season 10.90 20.35 9.04
Trend_Season 11.81 22.00 9.38
Sq_Trend_Season 22.18 35.75 18.28
Mean 11.15 20.11 9.00
Naive 11.18 20.48 8.98
SeasonalNaive 12.45 22.73 10.42
Drift 11.65 19.62 9.18