library(AER)
## Warning: package 'AER' was built under R version 4.3.2
## Loading required package: car
## Loading required package: carData
## Loading required package: lmtest
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: sandwich
## Loading required package: survival
library(vars)
## Loading required package: MASS
## Loading required package: strucchange
## Loading required package: urca
library(forecast)
## Warning: package 'forecast' was built under R version 4.3.2
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library(AER)
library(seasonal)
library(tseries)
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.2
library(dynlm)
library(stats)
library(qcc)
## Package 'qcc' version 2.7
## Type 'citation("qcc")' for citing this R package in publications.
library(Metrics)
##
## Attaching package: 'Metrics'
## The following object is masked from 'package:forecast':
##
## accuracy
data(OrangeCounty)
head(OrangeCounty)
## Time Series:
## Start = 1965.1
## End = 1966.35
## Frequency = 4
## employment gnp
## 1965.10 288.00 906.6016
## 1965.35 298.75 919.6007
## 1965.60 302.78 934.0129
## 1965.85 308.65 956.7770
## 1966.10 312.73 975.4326
## 1966.35 331.45 979.3680
The data set OrangeCounty is found in the AER package within R. Sourced from Baltagi(2002) Econometrics, 3rd ed., it consists of a quarterly multiple time series from 1965 to 1983 with two variables: ‘quarterly employment in Orange country’ and ‘quarterly real GNP.’ Over this time, employment and GNP data could be expected to trend upward due to advancement in various industries, which will be shown in our “Results” portion.
OC <- ts(OrangeCounty, start = c(1965.1), end = c(1983.85), frequency = 4)
emp <- OC[,"employment"]
gnp <- OC[, "gnp"]
ggtsdisplay(emp, main = "Overview of Employment Data")
ggtsdisplay(gnp, main = "Overview of Gross National Income Data")
autoplot(stl(emp, s.window = "periodic"))
autoplot(stl(gnp, s.window = "periodic"))
For employment, there is a smooth upwards trend, and a seasonal pattern that has an upwards spike, followed by two downward spikes. For the cycles/remainder, there seem to be some significant spikes, especially after 1970.
For gnp, there is also an increase in the trend, with a “V-shaped” seasonality. Dynamics also seem to increase after 1970, meaning that gnp significantly increases after 1970.
emp_decomposition <- stl(emp, s.window = "periodic")
gnp_decomposition <- stl(gnp, s.window = "periodic")
emp_seasonal <- emp_decomposition$time.series[, "seasonal"]
gnp_seasonal <- gnp_decomposition$time.series[, "seasonal"]
emp_trend <- emp_decomposition$time.series[, "trend"]
gnp_trend <- gnp_decomposition$time.series[, "trend"]
emp_remainder <- emp_decomposition$time.series[, "remainder"]
gnp_remainder <- gnp_decomposition$time.series[, "remainder"]
emp_model <- lm(emp ~ emp_seasonal + emp_trend + emp_remainder)
gnp_model <- lm(gnp ~ gnp_seasonal + gnp_trend + gnp_remainder)
We extracted and separated the seasonal, trend, and remainder of each of the time series variables and regressed them over the actual data. We are essentially trying to analyze the relationship between the actual values and the fitted values created by the three components.
plot(emp_model, which = 1, main = "Residuals of Employment")
plot(gnp_model, which = 1, main = "Residuals of GNP")
For both employment and gnp, the fitted line created by the three components seem very close to the actual values of employment. This means that there is not a lot of white noise left in the data to model and that these components define the data very well.
acf(residuals(emp_model))
pacf(residuals(emp_model))
acf(residuals(gnp_model))
pacf(residuals(gnp_model))
Ignoring the first auto correlated value (since that is a lag of 0), we see that there are still a couple significant dynamics left in the ACF for employment, but these dynamics are reduced for the PACF. For gnp, we see very little noise in both the ACF and PACF, meaning that our regression model created by our three components accurately fit the actual gnp data.
plot(cusum(emp), type = "general")
plot(cusum(gnp), type = "general")
Both the cusum for the employment and gnp look very similar, with the values first exponentially decreasing below the target, then being steady around group 32-44. After that, there is an exponential rise that eventually reaches the ideal target at around 0. The cumulative deviation of employment from the target area is largest around group 34-49, which represents a grouped time interval.
summary(emp_model)
##
## Call:
## lm(formula = emp ~ emp_seasonal + emp_trend + emp_remainder)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.431e-13 -6.481e-14 -1.071e-14 1.683e-14 1.136e-12
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.766e-13 5.439e-14 3.246e+00 0.00178 **
## emp_seasonal 1.000e+00 4.009e-15 2.495e+14 < 2e-16 ***
## emp_trend 1.000e+00 8.835e-17 1.132e+16 < 2e-16 ***
## emp_remainder 1.000e+00 4.741e-15 2.109e+14 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.5e-13 on 72 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 4.299e+31 on 3 and 72 DF, p-value: < 2.2e-16
summary(gnp_model)
##
## Call:
## lm(formula = gnp ~ gnp_seasonal + gnp_trend + gnp_remainder)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.877e-13 -1.396e-13 -5.470e-14 2.980e-14 4.114e-12
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.014e-12 3.785e-13 2.679e+00 0.00914 **
## gnp_seasonal 1.000e+00 3.576e-14 2.797e+13 < 2e-16 ***
## gnp_trend 1.000e+00 2.988e-16 3.347e+15 < 2e-16 ***
## gnp_remainder 1.000e+00 9.297e-15 1.076e+14 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.02e-13 on 72 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 3.741e+30 on 3 and 72 DF, p-value: < 2.2e-16
Based on the summary results, the p-values for all of the trend, seasonality, and cycles for the employment and gnp regression model seem are significant, as they have small p-values. The residuals for the regression model are also very small, which is another sign of a good fit. Finally the R^2 for both models indicate a near-perfect correlation of 1, although it is impossible for the regressed model to be perfectly correlated to the actual data, since there are still some differences in our residuals and standard error.
autoplot(forecast(emp, h = 12))
autoplot(forecast(gnp, h = 12))
These two plots show the forecast of employment and gnp 12 steps ahead. Since our time-units are in quarters, this represents 3 years ahead.
autoplot(forecast(auto.arima(emp), h = 12))
autoplot(forecast(auto.arima(gnp), h = 12))
For the ARIMA forecast of employment, it displayed a forecast based on an AR(2) and seasonal-MA(1) with one order differentiation. For GNP, there was no seasonal component needed, and it just used an AR(1) model with one differentiation.
summary(forecast(emp))
##
## Forecast method: ETS(M,A,M)
##
## Model Information:
## ETS(M,A,M)
##
## Call:
## ets(y = object, lambda = lambda, biasadj = biasadj, allow.multiplicative.trend = allow.multiplicative.trend)
##
## Smoothing parameters:
## alpha = 0.9999
## beta = 0.3152
## gamma = 1e-04
##
## Initial states:
## l = 282.347
## b = 7.0336
## s = 1.0018 0.9997 1.01 0.9885
##
## sigma: 0.0136
##
## AIC AICc BIC
## 644.3824 647.1097 665.3590
##
## Error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -0.0517584 7.83331 6.173592 0.009097907 1.042774 0.1811846
## ACF1
## Training set 0.2074117
##
## Forecasts:
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 1984.10 877.2036 861.9492 892.4580 853.8740 900.5331
## 1984.35 902.1576 876.3443 927.9708 862.6796 941.6356
## 1984.60 898.8764 862.8983 934.8544 843.8527 953.9001
## 1984.85 906.6324 859.4911 953.7737 834.5360 978.7288
## 1985.10 900.3920 842.2301 958.5539 811.4410 989.3429
## 1985.35 925.8490 853.7964 997.9017 815.6540 1036.0441
## 1985.60 922.3277 837.7906 1006.8648 793.0394 1051.6161
## 1985.85 930.1328 831.4772 1028.7884 779.2521 1081.0135
summary(forecast(gnp))
##
## Forecast method: ETS(M,Ad,N)
##
## Model Information:
## ETS(M,Ad,N)
##
## Call:
## ets(y = object, lambda = lambda, biasadj = biasadj, allow.multiplicative.trend = allow.multiplicative.trend)
##
## Smoothing parameters:
## alpha = 0.9999
## beta = 0.4488
## phi = 0.8619
##
## Initial states:
## l = 893.4568
## b = 11.976
##
## sigma: 0.0107
##
## AIC AICc BIC
## 727.8444 729.0618 741.8288
##
## Error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 2.482327 13.49993 10.18566 0.2008062 0.793955 0.2433021
## ACF1
## Training set 0.007884837
##
## Forecasts:
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 1984.10 1587.549 1565.840 1609.257 1554.349 1620.748
## 1984.35 1601.025 1563.803 1638.248 1544.099 1657.952
## 1984.60 1612.641 1559.702 1665.579 1531.679 1693.603
## 1984.85 1622.652 1553.765 1691.539 1517.299 1728.005
## 1985.10 1631.280 1546.350 1716.210 1501.391 1761.169
## 1985.35 1638.717 1537.778 1739.655 1484.344 1793.089
## 1985.60 1645.126 1528.314 1761.938 1466.478 1823.775
## 1985.85 1650.650 1518.174 1783.127 1448.046 1853.255
summary(forecast(auto.arima(emp)))
##
## Forecast method: ARIMA(2,0,0)(0,1,1)[4] with drift
##
## Model Information:
## Series: emp
## ARIMA(2,0,0)(0,1,1)[4] with drift
##
## Coefficients:
## ar1 ar2 sma1 drift
## 1.4880 -0.5600 -0.4591 7.8204
## s.e. 0.0959 0.0972 0.1530 1.5542
##
## sigma^2 = 53.83: log likelihood = -245.37
## AIC=500.74 AICc=501.65 BIC=512.13
##
## Error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.0949309 6.940226 5.08561 -0.0005046571 0.8610739 0.1492542
## ACF1
## Training set 0.01301557
##
## Forecasts:
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 1984.10 884.8387 875.4357 894.2416 870.4582 899.2192
## 1984.35 907.3690 890.5117 924.2263 881.5879 933.1500
## 1984.60 901.9953 879.0593 924.9313 866.9177 937.0729
## 1984.85 917.0940 889.5193 944.6688 874.9221 959.2660
## 1985.10 918.6214 885.0477 952.1952 867.2748 969.9681
## 1985.35 940.8712 901.8292 979.9132 881.1616 1000.5808
## 1985.60 935.1805 891.7934 978.5675 868.8257 1001.5352
## 1985.85 949.9645 903.3656 996.5634 878.6976 1021.2314
summary(forecast(auto.arima(gnp)))
##
## Forecast method: ARIMA(1,1,0) with drift
##
## Model Information:
## Series: gnp
## ARIMA(1,1,0) with drift
##
## Coefficients:
## ar1 drift
## 0.3325 8.9611
## s.e. 0.1083 2.2160
##
## sigma^2 = 170.8: log likelihood = -298.23
## AIC=602.45 AICc=602.79 BIC=609.41
##
## Error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -0.008955021 12.8078 9.69118 0.001904781 0.7548298 0.2314906
## ACF1
## Training set -0.01283642
##
## Forecasts:
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 1984.10 1584.045 1567.298 1600.793 1558.432 1609.659
## 1984.35 1594.061 1566.160 1621.963 1551.389 1636.733
## 1984.60 1603.373 1566.460 1640.286 1546.919 1659.827
## 1984.85 1612.451 1567.989 1656.912 1544.453 1680.449
## 1985.10 1621.451 1570.449 1672.453 1543.450 1699.452
## 1985.35 1630.425 1573.600 1687.249 1543.519 1717.331
## 1985.60 1639.390 1577.277 1701.503 1544.397 1734.384
## 1985.85 1648.353 1581.365 1715.340 1545.904 1750.801
The MAPE, the Mean Absolute Percentage Error, gives the error in a proportional form, where the absolute differences of the fitted and actual values are divided by the Y value itself and summed up. The MAPE for both employment and gnp is lower for the arima forecast, so we should use that to minimize the proportional error.
emp_forecast <- forecast(emp, h = 12)
emp_aa_forecast <- forecast(auto.arima(emp), h = 12)
emp_combined_forecast <- (emp_forecast$mean + emp_aa_forecast$mean)/2
gnp_forecast <- forecast(gnp, h = 12)
gnp_aa_forecast <- forecast(auto.arima(gnp), h = 12)
gnp_combined_forecast <- (gnp_forecast$mean + gnp_aa_forecast$mean)/2
mape(emp_combined_forecast, emp[1:12])
## [1] 0.6454898
mape(gnp_combined_forecast, gnp[1:12])
## [1] 0.4039313
plot(emp_combined_forecast, main = "12-step combined forecast for employment", ylab = "future employment")
plot(gnp_combined_forecast, main = "12-step combined forecast for gnp", ylab = "future gnp")
For the regular forecast model, our MAPE for emp and gnp are 1.04 and 0.79. For the arima forecast model, our MAPE are 0.86 and 0.75. And for the combined model, our MAPE are 0.645 and 0.404. Therefore, we prefer the combined model, since our forecasted errors for gnp and employment are the lowest.
In order to use the VARselect function, we need to now convert the time-series model into a data frame.
OC_df <- data.frame(OrangeCounty)
VARselect(OC_df, lag.max = 10)
## $selection
## AIC(n) HQ(n) SC(n) FPE(n)
## 5 5 5 5
##
## $criteria
## 1 2 3 4 5
## AIC(n) 10.01175 9.625210 9.544495 9.534345 9.164603
## HQ(n) 10.09041 9.756306 9.728030 9.770318 9.453015
## SC(n) 10.21081 9.956976 10.008967 10.131523 9.894488
## FPE(n) 22289.68599 15150.543386 13989.979579 13873.887300 9613.085148
## 6 7 8 9 10
## AIC(n) 9.240641 9.328339 9.421873 9.259202 9.275080
## HQ(n) 9.581492 9.721629 9.867600 9.757368 9.825685
## SC(n) 10.103233 10.323637 10.549876 10.519912 10.668497
## FPE(n) 10415.750215 11435.699780 12652.499943 10858.397188 11168.405038
Based on the selection models, we should use a lag order of 5.
var_model <- VAR(OC_df, p = 5)
ACF and PACF are not visible, but that is okay since we plot them individually in the next problem.
plot(var_model)
The plotting of the VAR model, the blue line, and the actual data, drawn by the dotted lines, are very similar. This can be seen in the residuals, where the differences/noise seem to be very random and commonly mean-reverting around 0.
For employment:
acf(residuals(var_model)[,1])
pacf(residuals(var_model)[,1])
Again for the ACF, we ignore the first autocorrelation value, since that is a lag of 0. No significant dynamics in the ACF, only one significant dynamic at lag 12 for the PACF for gnp. Thus, we can conclude that the employment created by the var model does not need much correction.
For gnp:
acf(residuals(var_model)[,2])
pacf(residuals(var_model)[,2])
Ignoring the lag 0 value, both the ACF and PACF show significant dynamics left at lag 8 and 12 for gnp, but these may not be significant since they are not included in our VAR model with the order of 5. There may be some correction needed to correct for those correlations, but they may be insignificant since they are rather long lags and barely cross the threshold.
plot(irf(var_model, impulse = "employment", response = NULL))
plot(irf(var_model, impulse = "gnp", response = NULL))
The impulse response function gives a visualization of what will happen to the variables when there is a shock/increase in a specific variable. In the first plot, we see that when there is a shock in employment, there is another gradual increase in employment. Gnp reacts slightly differently, as there is a small increase, then the values hold steady for most of the way through.
When there is a shock in employment, there is a noticeable increase in employment that increases until about 3 time units, then holds steady. For gnp, there is also an increase for about 3 time units, but afterwards there is a slight decrease.
Granger-Causality
grangertest(emp ~ gnp, order = 5)
## Granger causality test
##
## Model 1: emp ~ Lags(emp, 1:5) + Lags(gnp, 1:5)
## Model 2: emp ~ Lags(emp, 1:5)
## Res.Df Df F Pr(>F)
## 1 60
## 2 65 -5 4.2373 0.002299 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
grangertest(gnp ~ emp, order = 5)
## Granger causality test
##
## Model 1: gnp ~ Lags(gnp, 1:5) + Lags(emp, 1:5)
## Model 2: gnp ~ Lags(gnp, 1:5)
## Res.Df Df F Pr(>F)
## 1 60
## 2 65 -5 1.0574 0.393
We use a lag order of 5 based on the var selection function. From these tests, we can conclude that Gnp has a bigger impact on employment than employment does on impact, since the first test where gnp acts as the independent variable has a much lower p-value, thus rejecting the null hypothesis that says gnp does not reject employment, and we can conclude that gnp actually does have an influence on employment.
plot(predict(var_model, n.ahead=12))
The VAR forecasts seems to be the most stable, but that may also simply be due to the scale of the window. What is true, though, is that VAR forecasts tend to be better for variables that influence each other, whereas ARIMA focuses on using prior data values to drive the time series forecast. Since we have shown an interelation between the impact of GNP on employment, plus we have a good amount of data to use, VAR may be the best choice in this case. What is clear as well is that the normal forecast without ARIMA or VAR methods seems to have the largest error bands, so it appears to be the least accurate and we should stick to the other two for consideration.
Although based on our ACF and PACF, we saw that both gnp and employment had some seasonality, our combined forecasts from the regular forecast function, as well as the arima function gave results of both of the variables going up. We learned that combining different forecast methods can help reduce errors, and steal/thief different forecasting properties to idealize our forecast. When we ran the VAR model, which utilized a multivariate set of variables (both gnp and employment), we got a near perfect fit, thus saw very little dynamics in the ACF/PACF.
For the future, one way to improve the data is to simply have more data points. Also, having more knowledge about Orange county and other factors which could affect the data is helpful as well.
Access to our data set can be found by searching for R AER package data sets, and labled under “OrangeCounty” This link may not work best but this is the pdf of all available AER package data sets: chrome-extension://efaidnbmnnnibpcajpcglclefindmkaj/https://cran.r-project.org/web/packages/AER/AER.pdf No other data resources were used for this project