yaml
Instructions: In all cases, please ensure that your graphs and visuals have proper titles and axis labels, where necessary. Refer to the output, whenever appropriate, when discussing the results. Lastly, remember that creativity (coupled with relevance) will be rewarded.
In the last homework, we explored the monthly Virginia Price of Natural Gas (NG) Delivered to Residential Consumers (Dollars per Thousand Cubic Feet). After analyzing the four basic models, we discovered that the seasonal naive model was the most preferred. However, we also noted that the seasonal naive model was not able to capture the trend in the data. This week, we will use a regression model to see if we can improve our forecasting ability.
R
. In particular, repeat steps 1 and 2 of
Homework #3. Place this in a single code chunk. At the back of
this process, you should have tsprice
.library(rio)
## Warning: package 'rio' was built under R version 4.4.2
price <- rio::import("https://www.eia.gov/dnav/ng/hist_xls/N3020VA3m.xls", sheet = 2, skip = 2)
names(price)[2] <- "price"
tsprice <- ts(price$price, start = c(1989, 1), frequency = 12)
tsprice <- window(tsprice, end = c(2023, 8))
Use the autoplot()
, ggsubseriesplot()
, and
ggACF()
functions to comment on your observations. Be sure
to explain how you came to the conclusions regarding seasonality and
trends in the tsprice
series. I would like to see
you use the grid.arrange()
function to combine your plots
as a 3 x 1 grid. Remember to use the fig.height
argument in
your code chunk to increase the size of your plots.
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 4.4.3
p1 <- autoplot(tsprice) +
ggtitle("NG Price Data: Time Series") +
xlab("Year") +
ylab("Price (Dollars per Thousand Cubic Feet)")
p2 <- ggsubseriesplot(tsprice) +
ggtitle("Seasonal Subseries Plot") +
xlab("Year") +
ylab("Price")
p3 <- ggAcf(tsprice) +
ggtitle("Autocorrelation Function (ACF)") +
xlab("Lag") +
ylab("ACF")
grid.arrange(p1,p2,p3,ncol=1)
tsprice <- ts(rnorm(100, mean = 50, sd = 10), start = c(2015, 1), frequency = 12)
autoplot(tsprice)
acf(tsprice)
train.price <- window(tsprice, end = c(2018, 8))
test.price <- window(tsprice, start = c(2018, 9))
autoplot(tsprice) + autolayer(train.price, series = "Train") + autolayer(test.price, series = "Test")
tslm()
function, regress the
train.price
variable on a trend
.trend.fit
.trend
.trend.fit<-tslm(train.price~trend)
summary(trend.fit)
##
## Call:
## tslm(formula = train.price ~ trend)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.4961 -6.3687 0.8671 6.9564 15.3604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 48.58049 2.78326 17.455 <2e-16 ***
## trend 0.04757 0.10773 0.442 0.661
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.074 on 42 degrees of freedom
## Multiple R-squared: 0.004622, Adjusted R-squared: -0.01908
## F-statistic: 0.195 on 1 and 42 DF, p-value: 0.661
autoplot()
of the actual values with the fitted values from your regression
autolayer
ed. Recall that we are using the
train.price
series here. Ensure that you appropriately
label both series in the legend.fitted_values<-fitted(trend.fit)
autoplot(train.price) +
autolayer(fitted_values, series = "Fitted Values") +
ggtitle("Actual vs Fitted Values from Regression on Trend") +
xlab("Year") +
ylab("Price (Dollars per Thousand Cubic Feet)") +
scale_color_manual(values = c("black", "red")) +
theme(legend.title = element_blank())
checkresiduals()
function, perform a
diagnostic test of the residuals from trend.fit
. Do they
appear to be White Noise? If not, what pattern emerges according to your
ACF
and time plot
? Be sure to explain,
in detail, how you came to this conclusion?ggsubseriesplot()
of the residuals. Doing so will help you to strengthen your conclusions
regarding the presence of seasonality in the residuals.checkresiduals(trend.fit)
##
## Breusch-Godfrey test for serial correlation of order up to 9
##
## data: Residuals from Linear regression model
## LM test = 7.4864, df = 9, p-value = 0.5866
ggsubseriesplot(residuals(trend.fit)) +
ggtitle("Seasonal Subseries Plot of Residuals") +
xlab("Year") +
ylab("Residuals")
season
dummies.season.fit
.summary()
function to report the model
results.seasonal_dummies <- fourier(train.price, K = 1)
season.fit<-tslm(train.price~seasonal_dummies)
summary(season.fit)
##
## Call:
## tslm(formula = train.price ~ seasonal_dummies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.480 -5.637 1.889 5.800 17.339
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 49.45515 1.33545 37.033 <2e-16 ***
## seasonal_dummiesS1-12 -0.03548 1.88460 -0.019 0.985
## seasonal_dummiesC1-12 -3.67553 1.88460 -1.950 0.058 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.805 on 41 degrees of freedom
## Multiple R-squared: 0.08513, Adjusted R-squared: 0.0405
## F-statistic: 1.908 on 2 and 41 DF, p-value: 0.1614
It is $49.92 per thousand cubic feet.
season3
? To get full credit, your discussion must
discuss both the sign and magnitude of the coefficient.The relationship is positively correlated because the sign is positive. And the magnitude shows us that season 3 is $5.8362 per thousand cubic feet than the base, January. However, looking at the the p-value, this result is statistically significant at the 1% significance level. This means the season3 does impact natural gas prices.
$55.75 = $49.92 + $5.83
\(R^2\) says that 15.65% of the variability is explained by the seasonal dummies.
autoplot(train.price) +
autolayer(fitted(season.fit), series = "Fitted Values") +
ggtitle("Time Plot of Actual and Fitted NG Prices (Training Data)") +
xlab("Year") +
ylab("Price (Dollars per Thousand Cubic Feet)") +
scale_color_manual(values = c("black", "red")) +
theme(legend.title = element_blank())
season.fit
. Specifically, are any of the
assumptions possibly violated?There is a clear pattern in the time plot so the assumptions can possibly be violated.
Useful hints for answering this question:
the is no significant evidence of autocorrelation
autoplot
of the residuals exhibit any
noticeable pattern?Yes
season.fit
?The residuals do not exhibit autocorrelation.
checkresiduals(season.fit)
##
## Breusch-Godfrey test for serial correlation of order up to 9
##
## data: Residuals from Linear regression model
## LM test = 7.6723, df = 9, p-value = 0.5675
tslm()
function to estimate a regression of train.price
on a
trend and the seasonal dummies.t.season.fit
.t.season.fit
.t.season.fit <- tslm(train.price ~ trend + season)
summary(t.season.fit)
##
## Call:
## tslm(formula = train.price ~ trend + season)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.912 -5.400 1.000 5.359 15.661
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 48.07404 4.28527 11.218 1.92e-12 ***
## trend 0.04508 0.09376 0.481 0.634
## season2 -1.27226 5.51263 -0.231 0.819
## season3 -8.46375 5.51503 -1.535 0.135
## season4 7.56717 5.51901 1.371 0.180
## season5 5.95914 5.52458 1.079 0.289
## season6 7.72744 5.53174 1.397 0.172
## season7 2.09813 5.54047 0.379 0.707
## season8 -4.95124 5.55077 -0.892 0.379
## season9 -3.47714 5.95642 -0.584 0.564
## season10 9.31243 5.96010 1.562 0.128
## season11 0.40254 5.96526 0.067 0.947
## season12 -9.54112 5.97189 -1.598 0.120
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.795 on 31 degrees of freedom
## Multiple R-squared: 0.4579, Adjusted R-squared: 0.248
## F-statistic: 2.182 on 12 and 31 DF, p-value: 0.04
t.season.fit
and the training data. Be sure to include
labels for each series. Now, briefly (1 sentence will suffice),
comment on anything you notice here relative to the plot from
season.fit
.The fitted line has less variability than the actual time plot, but closely fits the plot.
autoplot(train.price) +
autolayer(fitted(t.season.fit), series = "Fitted Values") +
ggtitle("Time Plot of Actual and Fitted NG Prices (Trend and Seasonal Dummies)") +
xlab("Year") +
ylab("Price (Dollars per Thousand Cubic Feet)") +
scale_color_manual(values = c("black", "red")) +
theme(legend.title = element_blank())
checkresiduals()
function, conduct a
diagnostic test and comment on any discernible pattern in the residuals
of t.season.fit
. Is the autocorrelation problem
corrected?There is no significant autocorrelation. The p-value is greater than .05. The model doesn’t need adjusting.
Useful hints for answering this question:
There is no significant autocorrelation. The p = .573. The model residuals are white noise.
t.season.fit
?There aren’t any significant spikes so there is no reamining autocorrelation.
Do they appear to be white noise?**
Yes
checkresiduals(t.season.fit)
##
## Breusch-Godfrey test for serial correlation of order up to 16
##
## data: Residuals from Linear regression model
## LM test = 24.387, df = 16, p-value = 0.08138
I()
function. The
general form of the model is:
yvar ~ xvar1 + xvar2 + I(trend^2)
.t.season.fit2
and use the
summary()
function to report the results. Is the quadratic
trend variable significant?Yes, the p-value is less than .05.
t.season.fit2<-tslm(train.price~trend+season+I(trend^2))
summary(t.season.fit2)
##
## Call:
## tslm(formula = train.price ~ trend + season + I(trend^2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.1769 -5.4370 0.5922 4.1301 14.8206
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 45.324540 5.039010 8.995 5.08e-10 ***
## trend 0.439121 0.392375 1.119 0.2720
## season2 -1.324796 5.506707 -0.241 0.8115
## season3 -8.551317 5.509512 -1.552 0.1311
## season4 7.462094 5.513776 1.353 0.1861
## season5 5.854061 5.519341 1.061 0.2973
## season6 7.639876 5.526202 1.382 0.1770
## season7 2.045587 5.534510 0.370 0.7143
## season8 -4.951240 5.544569 -0.893 0.3790
## season9 -4.300241 6.002762 -0.716 0.4793
## season10 8.471813 6.008681 1.410 0.1688
## season11 -0.438068 6.013788 -0.073 0.9424
## season12 -10.364224 6.018084 -1.722 0.0953 .
## I(trend^2) -0.008756 0.008467 -1.034 0.3093
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.786 on 30 degrees of freedom
## Multiple R-squared: 0.4765, Adjusted R-squared: 0.2497
## F-statistic: 2.101 on 13 and 30 DF, p-value: 0.04594
checkresiduals()
function to conduct a
diagnostic test and comment on any discernible pattern in the residuals
of t.season.fit2
. Has this corrected all the issues
with the residuals?the p-value is greater than .05 so there is no significant autocorrelation. so there is no autocorrelation problem.
checkresiduals(t.season.fit2)
##
## Breusch-Godfrey test for serial correlation of order up to 17
##
## data: Residuals from Linear regression model
## LM test = 25.441, df = 17, p-value = 0.08527
I don’t have autocorrelation.
forecast()
function to produce predictions over
the length of the sample period using the models stored in
trend.fit
, season.fit
,
t.season.fit
, and t.season.fit2
. Store these
as fore.trend
, fore.seas
,
fore.t.seas
, and fore.t.seas2
respectively.You are only required to store the model forecast results in this step.
fore.trend<-forecast(trend.fit, h=length(train.price))
fore.seas<-forecast(season.fit, h=length(train.price))
fore.t.seas<-forecast(t.season.fit, h=length(train.price))
fore.t.seas2<-forecast(t.season.fit2, h=length(train.price))
cbind
your
test data, tsprice
, and the point
forecasts of each of your four (4) models above (extracting the
mean columns) and store them into a variable called tmp
.
Next, produce a single graph of the elements of tmp
.Your aim here is to visualize the model fits for each regression model against the price data.
forecast_trend<-fore.trend$mean
forecast_seas<-fore.seas$mean
forecast_t_seas<-fore.t.seas$mean
forecast_t_seas2<-fore.t.seas2$mean
tmp<-cbind(tsprice,forecast_trend,forecast_seas,forecast_t_seas,forecast_t_seas2)
colnames(tmp) <- c("Actual", "Trend Forecast", "Seasonal Forecast", "Trend + Seasonal Forecast", "Trend + Seasonal + Quadratic Forecast")
autoplot(tmp) +
ggtitle("Comparison of Actual vs. Forecasts for NG Prices") +
xlab("Year") +
ylab("Price (Dollars per Thousand Cubic Feet)") +
theme_minimal()
Using the four(4) benchmark models from the previous module, produce forecasts over the test period.
# Define the forecast horizon (length of the test set)
H <- length(test.price)
# Generate forecasts for each model
fc1 <- meanf(train.price, h = H) # Mean forecast model
fc2 <- naive(train.price, h = H) # Naive model
fc3 <- snaive(train.price, h = H) # Seasonal Naive model
fc4 <- rwf(train.price, drift = TRUE, h = H) # Drift model (Random Walk with Drift)
# You can also use the ARIMA model if needed
arima_model <- auto.arima(train.price)
forecast_arima <- forecast(arima_model, h = H)
accuracy()
command, compare the forecasts of
all 8 models against the Test data
.Would the seasonal Naive model still preferred under the
RMSE
, MAPE
and MAE
selection
criteria? If not, which model(s) now has(have) that honor?
No, It looks like the fc1 (mean) has the lowest RMSE
,
MAPE
and MAE
.
Display your results as a table using the knitr::kable()
function:
caption = "Your title here"
argument to add an
appropriate title to your table.library(knitr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:gridExtra':
##
## combine
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Define the accuracy measures to extract
tests <- c("RMSE", "MAPE", "MAE")
# Create the accuracy summary data frame
accuracy_summary <- data.frame(
Trend = accuracy(fore.trend, test.price)[2, tests],
Season = accuracy(fore.seas, test.price)[2, tests],
Trend_Season = accuracy(fore.t.seas, test.price)[2, tests],
Sq_Trend_Season = accuracy(fore.t.seas2, test.price)[2, tests],
Mean = accuracy(fc1, test.price)[2, tests],
Naive = accuracy(fc2, test.price)[2, tests],
SeasonalNaive = accuracy(fc3, test.price)[2, tests],
Drift = accuracy(fc4, test.price)[2, tests]
)
# Transpose and format the table
accuracy_summary %>%
t() %>%
round(2) %>%
knitr::kable(caption = "Summary of Accuracy Tests (RMSE, MAPE, MAE)")
RMSE | MAPE | MAE | |
---|---|---|---|
Trend | 10.93 | 20.80 | 8.81 |
Season | 10.90 | 20.35 | 9.04 |
Trend_Season | 11.81 | 22.00 | 9.38 |
Sq_Trend_Season | 22.18 | 35.75 | 18.28 |
Mean | 11.15 | 20.11 | 9.00 |
Naive | 11.18 | 20.48 | 8.98 |
SeasonalNaive | 12.45 | 22.73 | 10.42 |
Drift | 11.65 | 19.62 | 9.18 |