library(forecast)
## Warning: package 'forecast' was built under R version 4.0.4
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library(urca)
## Warning: package 'urca' was built under R version 4.0.4
library(ggplot2)
library(TTR)
## Warning: package 'TTR' was built under R version 4.0.4
library(stats)
library(readr)
library(fpp)
## Warning: package 'fpp' was built under R version 4.0.5
## Loading required package: fma
## Warning: package 'fma' was built under R version 4.0.5
## Loading required package: expsmooth
## Warning: package 'expsmooth' was built under R version 4.0.5
## Loading required package: lmtest
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: tseries
## Warning: package 'tseries' was built under R version 4.0.4
library(fpp2)
## Warning: package 'fpp2' was built under R version 4.0.5
##
## Attaching package: 'fpp2'
## The following objects are masked from 'package:fpp':
##
## ausair, ausbeer, austa, austourists, debitcards, departures,
## elecequip, euretail, guinearice, oil, sunspotarea, usmelec
library(seasonal)
## Warning: package 'seasonal' was built under R version 4.0.5
library(lubridate)
## Warning: package 'lubridate' was built under R version 4.0.5
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(psych)
## Warning: package 'psych' was built under R version 4.0.5
##
## Attaching package: 'psych'
## The following object is masked from 'package:seasonal':
##
## outlier
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
library(fastDummies)
## Warning: package 'fastDummies' was built under R version 4.0.5
library(vars)
## Warning: package 'vars' was built under R version 4.0.5
## Loading required package: MASS
##
## Attaching package: 'MASS'
## The following objects are masked from 'package:fma':
##
## cement, housing, petrol
## Loading required package: strucchange
## Warning: package 'strucchange' was built under R version 4.0.5
## Loading required package: sandwich
gonna try VARs and maybe some of the ones i had more success with in previous assignments/discussions
Seeing what the book chose to model in 11.2 of fpp
head(uschange[,1:2])
## Consumption Income
## 1970 Q1 0.6159862 0.9722610
## 1970 Q2 0.4603757 1.1690847
## 1970 Q3 0.8767914 1.5532705
## 1970 Q4 -0.2742451 -0.2552724
## 1971 Q1 1.8973708 1.9871536
## 1971 Q2 0.9119929 1.4473342
autoplot(uschange)
autoplot(uschange[,1:2])
Plotting, we can see that US change already looks like white noise. Moving forward, we will need to difference our data in order to make it stationary before any VAR models can be implemented. The book mentioned this earlier but seeing it in the plots reinforced the idea for me
VARselect(uschange[,1:2])
## $selection
## AIC(n) HQ(n) SC(n) FPE(n)
## 5 1 1 5
##
## $criteria
## 1 2 3 4 5 6
## AIC(n) -1.3802920 -1.3724878 -1.4056544 -1.4183795 -1.436743 -1.4088572
## HQ(n) -1.3366268 -1.2997125 -1.3037690 -1.2873839 -1.276637 -1.2196414
## SC(n) -1.2726259 -1.1930443 -1.1544335 -1.0953812 -1.041967 -0.9423042
## FPE(n) 0.2515067 0.2534832 0.2452268 0.2421485 0.237777 0.2445520
## 7 8 9 10
## AIC(n) -1.3878518 -1.3695542 -1.3293120 -1.2924024
## HQ(n) -1.1695259 -1.1221182 -1.0527658 -0.9867461
## SC(n) -0.8495213 -0.7594464 -0.6474268 -0.5387397
## FPE(n) 0.2498145 0.2545224 0.2650992 0.2752277
var1 <- VAR(uschange[,1:2], p=1, type="const")
serial.test(var1, lags.pt=10, type="PT.asymptotic")
##
## Portmanteau Test (asymptotic)
##
## data: Residuals of VAR object var1
## Chi-squared = 49.102, df = 36, p-value = 0.07144
var2 <- VAR(uschange[,1:2], p=2, type="const")
serial.test(var2, lags.pt=10, type="PT.asymptotic")
##
## Portmanteau Test (asymptotic)
##
## data: Residuals of VAR object var2
## Chi-squared = 47.741, df = 32, p-value = 0.03633
var3 <- VAR(uschange[,1:2], p=3, type="const")
serial.test(var3, lags.pt=10, type="PT.asymptotic")
##
## Portmanteau Test (asymptotic)
##
## data: Residuals of VAR object var3
## Chi-squared = 33.617, df = 28, p-value = 0.2138
It seems that from the book, we take note of both the AIC and BIC values (here they are 5 and 1 respectively). We then would test the correlation of those lags using the code above. If a non-significant p value is found then that lag is chosen?
Here in the uschange example we found lags 1 and 2 were correlated. However, lag 3 was not correlated and therefore, our lag 3 VAR model was the final model selected for analysis.
summary(var3)
##
## VAR Estimation Results:
## =========================
## Endogenous variables: Consumption, Income
## Deterministic variables: const
## Sample size: 184
## Log Likelihood: -380.155
## Roots of the characteristic polynomial:
## 0.7821 0.4981 0.4981 0.4441 0.2857 0.2857
## Call:
## VAR(y = uschange[, 1:2], p = 3, type = "const")
##
##
## Estimation results for equation Consumption:
## ============================================
## Consumption = Consumption.l1 + Income.l1 + Consumption.l2 + Income.l2 + Consumption.l3 + Income.l3 + const
##
## Estimate Std. Error t value Pr(>|t|)
## Consumption.l1 0.19100 0.07965 2.398 0.017518 *
## Income.l1 0.07837 0.05453 1.437 0.152445
## Consumption.l2 0.15954 0.08252 1.933 0.054792 .
## Income.l2 -0.02706 0.05723 -0.473 0.636831
## Consumption.l3 0.22646 0.07970 2.841 0.005018 **
## Income.l3 -0.01454 0.05425 -0.268 0.789055
## const 0.29081 0.08300 3.504 0.000581 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## Residual standard error: 0.5955 on 177 degrees of freedom
## Multiple R-Squared: 0.2132, Adjusted R-squared: 0.1865
## F-statistic: 7.994 on 6 and 177 DF, p-value: 1.216e-07
##
##
## Estimation results for equation Income:
## =======================================
## Income = Consumption.l1 + Income.l1 + Consumption.l2 + Income.l2 + Consumption.l3 + Income.l3 + const
##
## Estimate Std. Error t value Pr(>|t|)
## Consumption.l1 0.45349 0.11560 3.923 0.000125 ***
## Income.l1 -0.27303 0.07915 -3.450 0.000702 ***
## Consumption.l2 0.02167 0.11977 0.181 0.856663
## Income.l2 -0.09005 0.08306 -1.084 0.279788
## Consumption.l3 0.35377 0.11568 3.058 0.002572 **
## Income.l3 -0.05376 0.07875 -0.683 0.495699
## const 0.38750 0.12047 3.217 0.001543 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## Residual standard error: 0.8643 on 177 degrees of freedom
## Multiple R-Squared: 0.1755, Adjusted R-squared: 0.1475
## F-statistic: 6.279 on 6 and 177 DF, p-value: 5.309e-06
##
##
##
## Covariance matrix of residuals:
## Consumption Income
## Consumption 0.3546 0.1845
## Income 0.1845 0.7470
##
## Correlation matrix of residuals:
## Consumption Income
## Consumption 1.0000 0.3585
## Income 0.3585 1.0000
forecast(var3) %>%
autoplot() + xlab("Year")
If we refer to the correlation matrix, then we can see that there is about 36% correlation between the two. Adjusted R squared is not very high on either model but I dont imagine that is the metric on which VAR models are judged upon.
So for this example, since the time series data is already stationary, then the forecasts make sense as they represent the level in change as well. However, for non-stationary data converted into stationary data, our forecasts would not make much sense.
In order to receive the actual value forecasts, we would add each point forecast onto the last known value.
ie. Y(t + 1) = Yt + f1 Y(t + 2) = Y(t + 1) + f2
where Y(t + x) is the point forecast and f is the differenced data
Lets move onto our time series. The three models I will choose to use are the standard lm model, an arima or ets model and the VAR model.
We need to select a time series to work with. I think I will use my dataset I have been using for the past few discussions, the census data for retail and sales. Similarly, Matt noticed it first on the discussions that both my data and his looked very similar. I have found the nonfarm payroll data on the BLS website and retrieved data from the same time period as my retail and sales data.
Initial thoughts are that these two would very likely be closely related as sales and employment go hand-in-hand (more sales -> more employees). Additionally, I imagine that the effect works in the other direction as well, businesses should know when their busy seasons are and hire more employees accordingly. (more employees -> more sales)
census <- read.csv("C:\\Users\\wanga\\OneDrive\\Desktop\\PREDICTIVE ANALYTICS\\DISCUSSIONS\\WEEK 2\\censusunadj.csv")
payroll <- read.csv("C:\\Users\\wanga\\OneDrive\\Desktop\\PREDICTIVE ANALYTICS\\DISCUSSIONS\\WEEK 2\\payrollunadj.csv")
summary(census)
## Period Value
## Length:360 Length:360
## Class :character Class :character
## Mode :character Mode :character
summary(payroll)
## Period Value
## Length:360 Min. :106723
## Class :character 1st Qu.:128279
## Mode :character Median :132141
## Mean :132092
## 3rd Qu.:138437
## Max. :153095
## NA's :11
census$Value <- parse_number(census$Value)
summary(census$Value)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 146376 250783 339141 339835 418608 614863 11
tscensus <- ts(census$Value, frequency = 12, start = c(1992, 1), end = c(2021, 1))
autoplot(tscensus)
tspayroll <- ts(payroll$Value, frequency = 12, start = c(1992, 1), end = c(2021, 1))
autoplot(tspayroll)
We can see that even though these two time series are not on the same scale, they are both highly seasonal and follow almost the exact same trend. My initial guess is that the seasonality between these two time series should be very similar.
I think we should also create some training and test sets to work with. I have done so below
census.train <- census[1:280,]
census.test <- census[281:349, ]
summary(census.train)
## Period Value
## Length:280 Min. :146376
## Class :character 1st Qu.:231452
## Mode :character Median :305646
## Mean :302419
## 3rd Qu.:368741
## Max. :502330
tscensus.train <- ts(census.train$Value, frequency = 12, start = c(1992, 1))
autoplot(tscensus.train)
tscensus.test <- ts(census.test$Value, frequency = 12, start = c(2015, 5))
autoplot(tscensus.test)
Above is the census test and training, below is the payroll test and training
payroll.train <- payroll[1:280,]
payroll.test <- payroll[281:349, ]
summary(payroll.train)
## Period Value
## Length:280 Min. :106723
## Class :character 1st Qu.:123927
## Mode :character Median :131376
## Mean :128647
## 3rd Qu.:134684
## Max. :141312
tspayroll.train <- ts(payroll.train$Value, frequency = 12, start = c(1992, 1))
autoplot(tspayroll.train)
tspayroll.test <- ts(payroll.test$Value, frequency = 12, start = c(2015, 5))
autoplot(tspayroll.test)
We have now created an 80/20 split of training and test set data. We can use this to evaluate the accuracy of our models
Lets start off with a simple lm model. This should give us a baseline to work with.
LMcensus <- tslm(tscensus.train ~ trend + season)
summary(LMcensus)
##
## Call:
## tslm(formula = tscensus.train ~ trend + season)
##
## Residuals:
## Min 1Q Median 3Q Max
## -41206 -6889 1012 7248 33101
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 136208.37 3263.57 41.736 < 2e-16 ***
## trend 967.25 10.56 91.588 < 2e-16 ***
## season2 -1714.00 4122.15 -0.416 0.678
## season3 33303.05 4122.19 8.079 2.27e-14 ***
## season4 27727.47 4122.26 6.726 1.05e-10 ***
## season5 43262.35 4166.75 10.383 < 2e-16 ***
## season6 34802.80 4166.71 8.353 3.67e-15 ***
## season7 34282.12 4166.70 8.228 8.47e-15 ***
## season8 40716.74 4166.71 9.772 < 2e-16 ***
## season9 18709.02 4166.75 4.490 1.06e-05 ***
## season10 27298.73 4166.82 6.551 2.92e-10 ***
## season11 29640.78 4166.92 7.113 1.04e-11 ***
## season12 78406.27 4167.04 18.816 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14280 on 267 degrees of freedom
## Multiple R-squared: 0.9711, Adjusted R-squared: 0.9698
## F-statistic: 747.4 on 12 and 267 DF, p-value: < 2.2e-16
LMpayroll <- tslm(tspayroll.train ~ trend + season)
summary(LMpayroll)
##
## Call:
## tslm(formula = tspayroll.train ~ trend + season)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7776.9 -4081.6 546.2 3815.6 7048.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.136e+05 1.024e+03 111.023 <2e-16 ***
## trend 9.150e+01 3.312e+00 27.622 <2e-16 ***
## season2 5.763e+02 1.293e+03 0.446 0.6562
## season3 1.234e+03 1.293e+03 0.954 0.3408
## season4 2.047e+03 1.293e+03 1.583 0.1145
## season5 2.810e+03 1.307e+03 2.150 0.0325 *
## season6 3.195e+03 1.307e+03 2.445 0.0151 *
## season7 1.909e+03 1.307e+03 1.461 0.1452
## season8 1.964e+03 1.307e+03 1.503 0.1340
## season9 2.540e+03 1.307e+03 1.943 0.0530 .
## season10 3.179e+03 1.307e+03 2.432 0.0157 *
## season11 3.370e+03 1.307e+03 2.579 0.0105 *
## season12 3.123e+03 1.307e+03 2.390 0.0176 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4479 on 267 degrees of freedom
## Multiple R-squared: 0.7451, Adjusted R-squared: 0.7337
## F-statistic: 65.04 on 12 and 267 DF, p-value: < 2.2e-16
We can see immediately that nearly all of our census data is explained by the trend and seasonality. The payroll data is not as good of a fit based on R squared but it still has decent explanatory power. Since these are both univariate time series data, we really cant add more so now we will move onto the forecast and comparison
fc.census <- forecast(LMcensus, h = 68)
autoplot(fc.census)
fc.payroll <- forecast(LMpayroll, h = 68)
autoplot(fc.payroll)
We can immediately note that our forecasts are a bit conservative compared with the real time series data and tend to underestimate during the forecast. We also note that neither forecast was able to predict the huge dip caused by COVID-19 (as expected)
accuracy(fc.census, tscensus.test)
## ME RMSE MAE MPE MAPE MASE
## Training set 2.490697e-12 13944.07 10543.19 -0.2180549 3.400158 0.6737691
## Test set 1.939199e+04 29404.08 23357.40 3.6234689 4.567135 1.4926688
## ACF1 Theil's U
## Training set 0.8104550 NA
## Test set 0.4790005 0.6400528
autoplot(tscensus.test) + autolayer(fc.census, series = "FORECAST", PI = FALSE)
accuracy(fc.payroll, tspayroll.test)
## ME RMSE MAE MPE MAPE MASE
## Training set 1.090311e-12 4373.609 3910.077 -0.1290731 3.079639 1.695887
## Test set 1.492072e+03 4628.050 3697.708 0.9398905 2.549012 1.603778
## ACF1 Theil's U
## Training set 0.9937427 NA
## Test set 0.8054812 1.745012
autoplot(tspayroll.test) + autolayer(fc.payroll, series = "FORECAST", PI = FALSE)
If we go by the MASE, we can see that our linear model for census has good performance. The payroll forecast didn’t do so well but it wasnt too bad still. Examining the graphs, we can see that our forecasts are able to predict the peaks and troughs but the point forecasts are not there.
Interestingly enough, I believe that the upper 80 and 95 CI’s are able to capture the true values within their limits. I imagine that there is some method in order to fix the predictions to have point estimates closer but I will save that for another time.
checkresiduals(LMcensus)
##
## Breusch-Godfrey test for serial correlation of order up to 24
##
## data: Residuals from Linear regression model
## LM test = 236.79, df = 24, p-value < 2.2e-16
checkresiduals(LMpayroll)
##
## Breusch-Godfrey test for serial correlation of order up to 24
##
## data: Residuals from Linear regression model
## LM test = 276.87, df = 24, p-value < 2.2e-16
A check on the residuals indicate that there is still significant autocorrelation in our data. What could we do? Maybe differencing the data would be one solution. The residual patterns essentially look like the trend. Honestly, I am a little unsure of what to do in this situation.
How much would the residuals be an issue if the forecasted values seem decent? It is odd to me that our R-squared values (especially for the census model) were good and yet the residuals turned our this way.
Next, we might consider the ARIMA model. First, lets check the ACF and PACF plots to begin with
ggtsdisplay(tscensus.train)
ndiffs(tscensus.train)
## [1] 1
ggtsdisplay(tspayroll.train)
ndiffs(tspayroll.train)
## [1] 1
We could expect the ACF and PACF plots to look this way just based on how much apparent seasonality and trend is in each time series. When the ndiffs() function is run on both time series, it suggests that we difference once. Additionally, we should also difference to take care of seasonality. I will allow the auto.arima function to difference as I am still a bit shaky on how to
ndiffs(tscensus.train)
## [1] 1
nsdiffs(tscensus.train)
## [1] 1
tscensus.train.diff.tr <- tscensus.train %>% diff(lag=12) %>% diff(1)
tscensus.train.diff.tr %>% ggtsdisplay()
tscensus.train.diff.tr %>% ur.kpss() %>% summary()
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 5 lags.
##
## Value of test-statistic is: 0.0281
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
So, there is still an issue in the ACF and PACF plots but it looks much better than the initial plots without differencing. The kpss test also would indicate that we do not need to difference more
ndiffs(tspayroll.train)
## [1] 1
nsdiffs(tspayroll.train)
## [1] 1
tspayroll.train.tr <- tspayroll.train %>% diff(lag = 12) %>% diff(1)
tspayroll.train.tr %>% ggtsdisplay()
tspayroll.train.tr %>% ur.kpss() %>% summary()
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 5 lags.
##
## Value of test-statistic is: 0.0969
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
Payroll data residual plots and kpss summary indicates that we might need to difference again?
tspayroll.train.tr.2 <- diff(tspayroll.train.tr, differences = 1)
tspayroll.train.tr.2 %>% ggtsdisplay()
The double differenced data looks much more stationary.
Moving forward, once we use auto.arima we can most likely expect the d values of census and payroll to be 1 and 2, respectively.
arima.census <- auto.arima(tscensus.train)
summary(arima.census)
## Series: tscensus.train
## ARIMA(1,0,1)(2,1,2)[12] with drift
##
## Coefficients:
## ar1 ma1 sar1 sar2 sma1 sma2 drift
## 0.9587 -0.3631 0.8533 -0.6285 -1.2741 0.6191 996.2404
## s.e. 0.0192 0.0588 0.0901 0.0708 0.1091 0.0849 195.0460
##
## sigma^2 estimated as 35091431: log likelihood=-2711.61
## AIC=5439.22 AICc=5439.78 BIC=5467.95
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 45.13787 5719.285 4343.91 -0.03016123 1.432709 0.2776002
## ACF1
## Training set -0.00920604
arima.payroll <- auto.arima(tspayroll.train)
summary(arima.payroll)
## Series: tspayroll.train
## ARIMA(1,1,2)(0,1,2)[12]
##
## Coefficients:
## ar1 ma1 ma2 sma1 sma2
## 0.9277 -0.6363 0.1222 -0.6206 -0.1076
## s.e. 0.0305 0.0677 0.0607 0.0650 0.0587
##
## sigma^2 estimated as 26892: log likelihood=-1739.43
## AIC=3490.85 AICc=3491.18 BIC=3512.38
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -0.0984124 158.6296 122.0024 0.0003118459 0.09498862 0.05291516
## ACF1
## Training set -0.01987173
checkresiduals(arima.census)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,0,1)(2,1,2)[12] with drift
## Q* = 53.028, df = 17, p-value = 1.414e-05
##
## Model df: 7. Total lags used: 24
checkresiduals(arima.payroll)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,1,2)(0,1,2)[12]
## Q* = 19.981, df = 19, p-value = 0.3957
##
## Model df: 5. Total lags used: 24
Although there is still some autocorreclation as indicated by the ACF plot and the Ljung box test, the residuals for our census data look much better than at the start or even the attempted manual differencing we did earlier. Many less lags fall outside of the significant value line and the oscillatory motion seems to be a bit less than previously.
Our residual check for payroll data also appears to have an improvement with previous plots. The Ljung box tests and supporting ACF and PACF plots indicate that the residuals would be white noise. As expected with both plots, there is a long tail on the left due to COVID-19’s impact (one largely negative residual).
It looks as if the auto.arima function selected different values for d then what we expected. Looking at the error measurements, our MASE for both models have good values compared with the seasonal naive (which would be 1) so we will work with these for now and see where it goes.
fc.arima.census <- forecast(arima.census, h = 68)
autoplot(fc.arima.census)
accuracy(fc.arima.census, tscensus.test)
## ME RMSE MAE MPE MAPE MASE
## Training set 45.13787 5719.285 4343.91 -0.03016123 1.432709 0.2776002
## Test set 14782.61556 24298.401 19623.94 2.80144641 3.912327 1.2540802
## ACF1 Theil's U
## Training set -0.00920604 NA
## Test set 0.54478048 0.5377576
fc.arima.payroll <- forecast(arima.payroll, h = 68)
autoplot(fc.arima.payroll)
accuracy(fc.arima.payroll, tspayroll.test)
## ME RMSE MAE MPE MAPE
## Training set -0.0984124 158.6296 122.0024 0.0003118459 0.09498862
## Test set -1344.1529845 5085.9715 2200.8522 -0.9966139628 1.57218951
## MASE ACF1 Theil's U
## Training set 0.05291516 -0.01987173 NA
## Test set 0.95455852 0.83260599 1.975938
The plotted forecasts and accuracy tests seem to indicate that our ARIMA models performed well in terms of predictive capabilities. I will plot the predicted values vs the actual values below
autoplot(tscensus.test) + autolayer(fc.arima.census, series = "ARIMA forecasts census", PI = FALSE)
autoplot(tspayroll.test) + autolayer(fc.arima.payroll, series = "ARIMA forecasts payroll", PI = FALSE)
We can see that the predictions for both payroll and census trace the real test values incredibly well. The payroll predictions stay more accuracte further out than the census predictions and once again, the huge dip caused by COVID-19 was not predicted by our model, although I believe that the lower confidence intervals on the payroll forecasts may have captured or come close to capturing it.
Finally, we have reached the model that made me choose these two time series data sets in the first place. My initial hunch was that payroll and sales data were very closely related so I will test that in this next section with the VAR model.
Let us begin by first differencing the data to make it look like white noise.
ndiffs(tscensus.train)
## [1] 1
nsdiffs(tscensus.train)
## [1] 1
tscensus.train.diff <- tscensus.train %>% diff(lag = 12) %>% diff(1)
autoplot(tscensus.train.diff)
ndiffs(tspayroll.train)
## [1] 1
nsdiffs(tspayroll.train)
## [1] 1
tspayroll.train.tr <- tspayroll.train %>% diff(lag = 12) %>% diff(1)
tspayroll.train.tr.2 <- tspayroll.train.tr %>% diff(1)
autoplot(tspayroll.train.tr)
autoplot(tspayroll.train.tr.2)
These plots above are the differenced data with a 0 mean and relatively constant variance. We will now use VARselect() below to test lags for our model. However, we first need to join the two data sets together to get a multivariate time series to work with.
I had to bind the first order differenced payroll data as otherwise there would have been NA’s in the data frame which wouldn’t work with VARselect(). I have included the plot below to give a visual representation of what it looks like instead.
Perhaps I should instead remove the first value in the differenced census data to have matching columns? As that way both sets of data are stationary. I will try that instead of using a time series that isnt stationary.
autoplot(tspayroll.train.tr)
agg.census.payroll <- cbind(tscensus.train.diff, tspayroll.train.tr.2)
agg.census.payroll.tr <- agg.census.payroll[-1, ]
VARselect(agg.census.payroll.tr)[["selection"]]
## AIC(n) HQ(n) SC(n) FPE(n)
## 10 3 2 10
We can use the Portmanteau test to determine the amount of lags necessary (it should be between 2 and 10).
census.2 <- VAR(agg.census.payroll.tr, p = 2, type = "both")
serial.test(census.2, lags.pt=10, type="PT.asymptotic")
##
## Portmanteau Test (asymptotic)
##
## data: Residuals of VAR object census.2
## Chi-squared = 88.788, df = 32, p-value = 3.059e-07
census.3 <- VAR(agg.census.payroll.tr, p = 3, type = "both")
serial.test(census.3, lags.pt=10, type="PT.asymptotic")
##
## Portmanteau Test (asymptotic)
##
## data: Residuals of VAR object census.3
## Chi-squared = 63.748, df = 28, p-value = 0.0001324
census.4 <- VAR(agg.census.payroll.tr, p = 4, type = "both")
serial.test(census.4, lags.pt=10, type="PT.asymptotic")
##
## Portmanteau Test (asymptotic)
##
## data: Residuals of VAR object census.4
## Chi-squared = 66.189, df = 24, p-value = 8.132e-06
census.5 <- VAR(agg.census.payroll.tr, p = 5, type = "both")
serial.test(census.5, lags.pt=10, type="PT.asymptotic")
##
## Portmanteau Test (asymptotic)
##
## data: Residuals of VAR object census.5
## Chi-squared = 63.494, df = 20, p-value = 2.023e-06
census.6 <- VAR(agg.census.payroll.tr, p = 6, type = "both")
serial.test(census.6, lags.pt=10, type="PT.asymptotic")
##
## Portmanteau Test (asymptotic)
##
## data: Residuals of VAR object census.6
## Chi-squared = 39.874, df = 16, p-value = 0.0008122
census.7 <- VAR(agg.census.payroll.tr, p = 7, type = "both")
serial.test(census.7, lags.pt=10, type="PT.asymptotic")
##
## Portmanteau Test (asymptotic)
##
## data: Residuals of VAR object census.7
## Chi-squared = 35.859, df = 12, p-value = 0.0003413
census.8 <- VAR(agg.census.payroll.tr, p = 8, type = "both")
serial.test(census.8, lags.pt=10, type="PT.asymptotic")
##
## Portmanteau Test (asymptotic)
##
## data: Residuals of VAR object census.8
## Chi-squared = 34.59, df = 8, p-value = 3.175e-05
census.9 <- VAR(agg.census.payroll.tr, p = 9, type = "both")
serial.test(census.9, lags.pt=10, type="PT.asymptotic")
##
## Portmanteau Test (asymptotic)
##
## data: Residuals of VAR object census.9
## Chi-squared = 23.133, df = 4, p-value = 0.0001191
census.10 <- VAR(agg.census.payroll.tr, p = 10, type = "both")
serial.test(census.10, lags.pt=10, type="PT.asymptotic")
##
## Portmanteau Test (asymptotic)
##
## data: Residuals of VAR object census.10
## Chi-squared = 11.883, df = 0, p-value < 2.2e-16
Based on above, it doesnt seem like any of the null hypotheses hold for no autocorrelation. I will try to use the lag test which had the largest p-value (this was at lag 6).
summary(census.6)
##
## VAR Estimation Results:
## =========================
## Endogenous variables: tscensus.train.diff, tspayroll.train.tr.2
## Deterministic variables: both
## Sample size: 260
## Log Likelihood: -4364.04
## Roots of the characteristic polynomial:
## 0.8305 0.8305 0.7447 0.7447 0.7427 0.7427 0.738 0.738 0.7048 0.7048 0.6449 0.5185
## Call:
## VAR(y = agg.census.payroll.tr, p = 6, type = "both")
##
##
## Estimation results for equation tscensus.train.diff:
## ====================================================
## tscensus.train.diff = tscensus.train.diff.l1 + tspayroll.train.tr.2.l1 + tscensus.train.diff.l2 + tspayroll.train.tr.2.l2 + tscensus.train.diff.l3 + tspayroll.train.tr.2.l3 + tscensus.train.diff.l4 + tspayroll.train.tr.2.l4 + tscensus.train.diff.l5 + tspayroll.train.tr.2.l5 + tscensus.train.diff.l6 + tspayroll.train.tr.2.l6 + const + trend
##
## Estimate Std. Error t value Pr(>|t|)
## tscensus.train.diff.l1 -0.45492 0.06595 -6.898 4.43e-11 ***
## tspayroll.train.tr.2.l1 2.25235 2.37049 0.950 0.34296
## tscensus.train.diff.l2 -0.24238 0.07609 -3.186 0.00163 **
## tspayroll.train.tr.2.l2 8.06399 2.94114 2.742 0.00656 **
## tscensus.train.diff.l3 0.07047 0.07909 0.891 0.37385
## tspayroll.train.tr.2.l3 4.37463 3.09929 1.411 0.15936
## tscensus.train.diff.l4 -0.04633 0.07934 -0.584 0.55985
## tspayroll.train.tr.2.l4 0.32359 3.10242 0.104 0.91701
## tscensus.train.diff.l5 0.06017 0.07500 0.802 0.42317
## tspayroll.train.tr.2.l5 3.66592 2.98125 1.230 0.22000
## tscensus.train.diff.l6 0.10250 0.06661 1.539 0.12513
## tspayroll.train.tr.2.l6 5.56962 2.38028 2.340 0.02009 *
## const 60.97198 865.25629 0.070 0.94388
## trend -0.59589 5.55446 -0.107 0.91465
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## Residual standard error: 6719 on 246 degrees of freedom
## Multiple R-Squared: 0.3069, Adjusted R-squared: 0.2702
## F-statistic: 8.378 on 13 and 246 DF, p-value: 5.497e-14
##
##
## Estimation results for equation tspayroll.train.tr.2:
## =====================================================
## tspayroll.train.tr.2 = tscensus.train.diff.l1 + tspayroll.train.tr.2.l1 + tscensus.train.diff.l2 + tspayroll.train.tr.2.l2 + tscensus.train.diff.l3 + tspayroll.train.tr.2.l3 + tscensus.train.diff.l4 + tspayroll.train.tr.2.l4 + tscensus.train.diff.l5 + tspayroll.train.tr.2.l5 + tscensus.train.diff.l6 + tspayroll.train.tr.2.l6 + const + trend
##
## Estimate Std. Error t value Pr(>|t|)
## tscensus.train.diff.l1 0.0071716 0.0018716 3.832 0.000162 ***
## tspayroll.train.tr.2.l1 -0.7295212 0.0672735 -10.844 < 2e-16 ***
## tscensus.train.diff.l2 0.0039600 0.0021593 1.834 0.067874 .
## tspayroll.train.tr.2.l2 -0.3334082 0.0834685 -3.994 8.57e-05 ***
## tscensus.train.diff.l3 0.0066551 0.0022447 2.965 0.003326 **
## tspayroll.train.tr.2.l3 -0.1959892 0.0879568 -2.228 0.026769 *
## tscensus.train.diff.l4 0.0003201 0.0022517 0.142 0.887064
## tspayroll.train.tr.2.l4 -0.1457511 0.0880455 -1.655 0.099117 .
## tscensus.train.diff.l5 -0.0006282 0.0021285 -0.295 0.768138
## tspayroll.train.tr.2.l5 -0.0873329 0.0846069 -1.032 0.302983
## tscensus.train.diff.l6 -0.0023041 0.0018903 -1.219 0.224043
## tspayroll.train.tr.2.l6 0.0609952 0.0675513 0.903 0.367438
## const -2.4984605 24.5556512 -0.102 0.919040
## trend 0.0118943 0.1576336 0.075 0.939914
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## Residual standard error: 190.7 on 246 degrees of freedom
## Multiple R-Squared: 0.385, Adjusted R-squared: 0.3525
## F-statistic: 11.85 on 13 and 246 DF, p-value: < 2.2e-16
##
##
##
## Covariance matrix of residuals:
## tscensus.train.diff tspayroll.train.tr.2
## tscensus.train.diff 45148618 434707
## tspayroll.train.tr.2 434707 36363
##
## Correlation matrix of residuals:
## tscensus.train.diff tspayroll.train.tr.2
## tscensus.train.diff 1.0000 0.3393
## tspayroll.train.tr.2 0.3393 1.0000
Let me try something else here
ag.cen.pay <- cbind(tscensus.train, tspayroll.train)
VARselect(ag.cen.pay)[["selection"]]
## AIC(n) HQ(n) SC(n) FPE(n)
## 10 10 10 10
Here it tells us to use lag = 10. That is the default so I am wondering if maybe it would be higher.
ag.10 <- VAR(ag.cen.pay, p = 10, type = "both")
serial.test(ag.10, lags.pt=10, type="PT.asymptotic")
##
## Portmanteau Test (asymptotic)
##
## data: Residuals of VAR object ag.10
## Chi-squared = 196.28, df = 0, p-value < 2.2e-16
summary(ag.10)
##
## VAR Estimation Results:
## =========================
## Endogenous variables: tscensus.train, tspayroll.train
## Deterministic variables: both
## Sample size: 270
## Log Likelihood: -4691.814
## Roots of the characteristic polynomial:
## 0.9979 0.9979 0.9963 0.9963 0.9955 0.9955 0.9937 0.9937 0.9859 0.9674 0.9674 0.9312 0.9194 0.9194 0.9091 0.9091 0.8311 0.8311 0.6927 0.431
## Call:
## VAR(y = ag.cen.pay, p = 10, type = "both")
##
##
## Estimation results for equation tscensus.train:
## ===============================================
## tscensus.train = tscensus.train.l1 + tspayroll.train.l1 + tscensus.train.l2 + tspayroll.train.l2 + tscensus.train.l3 + tspayroll.train.l3 + tscensus.train.l4 + tspayroll.train.l4 + tscensus.train.l5 + tspayroll.train.l5 + tscensus.train.l6 + tspayroll.train.l6 + tscensus.train.l7 + tspayroll.train.l7 + tscensus.train.l8 + tspayroll.train.l8 + tscensus.train.l9 + tspayroll.train.l9 + tscensus.train.l10 + tspayroll.train.l10 + const + trend
##
## Estimate Std. Error t value Pr(>|t|)
## tscensus.train.l1 -3.091e-02 5.722e-02 -0.540 0.589547
## tspayroll.train.l1 1.215e+01 1.636e+00 7.425 1.81e-12 ***
## tscensus.train.l2 9.829e-02 6.333e-02 1.552 0.121910
## tspayroll.train.l2 -3.363e+00 3.074e+00 -1.094 0.275080
## tscensus.train.l3 5.667e-01 7.267e-02 7.799 1.73e-13 ***
## tspayroll.train.l3 -7.472e+00 3.121e+00 -2.394 0.017397 *
## tscensus.train.l4 4.300e-02 7.777e-02 0.553 0.580815
## tspayroll.train.l4 8.152e-01 2.585e+00 0.315 0.752776
## tscensus.train.l5 2.893e-01 6.744e-02 4.289 2.57e-05 ***
## tspayroll.train.l5 -1.748e+01 1.426e+00 -12.251 < 2e-16 ***
## tscensus.train.l6 -6.097e-01 6.239e-02 -9.772 < 2e-16 ***
## tspayroll.train.l6 2.842e+01 1.708e+00 16.636 < 2e-16 ***
## tscensus.train.l7 5.184e-01 7.573e-02 6.845 5.97e-11 ***
## tspayroll.train.l7 -1.910e+01 2.461e+00 -7.764 2.16e-13 ***
## tscensus.train.l8 -1.037e-01 7.286e-02 -1.424 0.155704
## tspayroll.train.l8 1.096e+01 2.814e+00 3.896 0.000126 ***
## tscensus.train.l9 5.950e-01 6.172e-02 9.639 < 2e-16 ***
## tspayroll.train.l9 -1.255e+01 2.757e+00 -4.554 8.27e-06 ***
## tscensus.train.l10 -5.359e-01 6.305e-02 -8.499 1.81e-15 ***
## tspayroll.train.l10 7.972e+00 1.734e+00 4.597 6.83e-06 ***
## const -1.121e+04 1.432e+04 -0.783 0.434400
## trend 1.394e+02 5.202e+01 2.679 0.007873 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## Residual standard error: 7980 on 248 degrees of freedom
## Multiple R-Squared: 0.9906, Adjusted R-squared: 0.9898
## F-statistic: 1246 on 21 and 248 DF, p-value: < 2.2e-16
##
##
## Estimation results for equation tspayroll.train:
## ================================================
## tspayroll.train = tscensus.train.l1 + tspayroll.train.l1 + tscensus.train.l2 + tspayroll.train.l2 + tscensus.train.l3 + tspayroll.train.l3 + tscensus.train.l4 + tspayroll.train.l4 + tscensus.train.l5 + tspayroll.train.l5 + tscensus.train.l6 + tspayroll.train.l6 + tscensus.train.l7 + tspayroll.train.l7 + tscensus.train.l8 + tspayroll.train.l8 + tscensus.train.l9 + tspayroll.train.l9 + tscensus.train.l10 + tspayroll.train.l10 + const + trend
##
## Estimate Std. Error t value Pr(>|t|)
## tscensus.train.l1 -2.162e-02 2.194e-03 -9.857 < 2e-16 ***
## tspayroll.train.l1 1.715e+00 6.271e-02 27.343 < 2e-16 ***
## tscensus.train.l2 2.159e-02 2.428e-03 8.894 < 2e-16 ***
## tspayroll.train.l2 -4.588e-01 1.179e-01 -3.893 0.000128 ***
## tscensus.train.l3 5.894e-04 2.786e-03 0.212 0.832602
## tspayroll.train.l3 -6.284e-01 1.196e-01 -5.253 3.23e-07 ***
## tscensus.train.l4 6.077e-04 2.982e-03 0.204 0.838650
## tspayroll.train.l4 1.816e-01 9.911e-02 1.833 0.068035 .
## tscensus.train.l5 -1.017e-02 2.585e-03 -3.935 0.000108 ***
## tspayroll.train.l5 1.866e-01 5.469e-02 3.412 0.000752 ***
## tscensus.train.l6 2.954e-03 2.392e-03 1.235 0.218005
## tspayroll.train.l6 6.863e-01 6.549e-02 10.480 < 2e-16 ***
## tscensus.train.l7 -1.244e-03 2.903e-03 -0.428 0.668723
## tspayroll.train.l7 -1.164e+00 9.433e-02 -12.341 < 2e-16 ***
## tscensus.train.l8 -7.921e-03 2.793e-03 -2.836 0.004947 **
## tspayroll.train.l8 3.127e-01 1.079e-01 2.899 0.004084 **
## tscensus.train.l9 8.959e-03 2.366e-03 3.786 0.000192 ***
## tspayroll.train.l9 4.823e-01 1.057e-01 4.563 7.94e-06 ***
## tscensus.train.l10 6.703e-03 2.417e-03 2.773 0.005977 **
## tspayroll.train.l10 -3.231e-01 6.648e-02 -4.859 2.09e-06 ***
## const 1.209e+03 5.491e+02 2.202 0.028615 *
## trend 3.838e-01 1.994e+00 0.192 0.847539
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## Residual standard error: 305.9 on 248 degrees of freedom
## Multiple R-Squared: 0.9986, Adjusted R-squared: 0.9985
## F-statistic: 8533 on 21 and 248 DF, p-value: < 2.2e-16
##
##
##
## Covariance matrix of residuals:
## tscensus.train tspayroll.train
## tscensus.train 63674942 958724
## tspayroll.train 958724 93586
##
## Correlation matrix of residuals:
## tscensus.train tspayroll.train
## tscensus.train 1.0000 0.3927
## tspayroll.train 0.3927 1.0000
d.ag.cen.pay <- cbind(tscensus.train.diff, tspayroll.train.tr)
VARselect(d.ag.cen.pay)[["selection"]]
## AIC(n) HQ(n) SC(n) FPE(n)
## 9 4 3 9
Above is our first order differenced df
dag.3 <- VAR(d.ag.cen.pay, p=3, type="both")
serial.test(dag.3, lags.pt=10, type="PT.asymptotic")
##
## Portmanteau Test (asymptotic)
##
## data: Residuals of VAR object dag.3
## Chi-squared = 61.379, df = 28, p-value = 0.0002707
dag.4 <- VAR(d.ag.cen.pay, p=4, type="both")
serial.test(dag.4, lags.pt=10, type="PT.asymptotic")
##
## Portmanteau Test (asymptotic)
##
## data: Residuals of VAR object dag.4
## Chi-squared = 47.264, df = 24, p-value = 0.00311
dag.5 <- VAR(d.ag.cen.pay, p=5, type="both")
serial.test(dag.5, lags.pt=10, type="PT.asymptotic")
##
## Portmanteau Test (asymptotic)
##
## data: Residuals of VAR object dag.5
## Chi-squared = 39.895, df = 20, p-value = 0.005151
dag.6 <- VAR(d.ag.cen.pay, p=6, type="both")
serial.test(dag.6, lags.pt=10, type="PT.asymptotic")
##
## Portmanteau Test (asymptotic)
##
## data: Residuals of VAR object dag.6
## Chi-squared = 32.425, df = 16, p-value = 0.008799
dag.7 <- VAR(d.ag.cen.pay, p=7, type="both")
serial.test(dag.7, lags.pt=10, type="PT.asymptotic")
##
## Portmanteau Test (asymptotic)
##
## data: Residuals of VAR object dag.7
## Chi-squared = 30.534, df = 12, p-value = 0.002319
dag.8 <- VAR(d.ag.cen.pay, p=8, type="both")
serial.test(dag.8, lags.pt=10, type="PT.asymptotic")
##
## Portmanteau Test (asymptotic)
##
## data: Residuals of VAR object dag.8
## Chi-squared = 27.312, df = 8, p-value = 0.0006244
dag.9 <- VAR(d.ag.cen.pay, p=9, type="both")
serial.test(dag.9, lags.pt=10, type="PT.asymptotic")
##
## Portmanteau Test (asymptotic)
##
## data: Residuals of VAR object dag.9
## Chi-squared = 18.136, df = 4, p-value = 0.001161
Once again, our null hypotheses of no autocorrelation is rejected in every single test that we ran. I will use lag = 6 because it offered the largest p-value out of all the tests
summary(dag.6)
##
## VAR Estimation Results:
## =========================
## Endogenous variables: tscensus.train.diff, tspayroll.train.tr
## Deterministic variables: both
## Sample size: 261
## Log Likelihood: -4358.2
## Roots of the characteristic polynomial:
## 0.8357 0.8357 0.7832 0.7832 0.7456 0.7456 0.7361 0.6968 0.6968 0.6421 0.5767 0.5767
## Call:
## VAR(y = d.ag.cen.pay, p = 6, type = "both")
##
##
## Estimation results for equation tscensus.train.diff:
## ====================================================
## tscensus.train.diff = tscensus.train.diff.l1 + tspayroll.train.tr.l1 + tscensus.train.diff.l2 + tspayroll.train.tr.l2 + tscensus.train.diff.l3 + tspayroll.train.tr.l3 + tscensus.train.diff.l4 + tspayroll.train.tr.l4 + tscensus.train.diff.l5 + tspayroll.train.tr.l5 + tscensus.train.diff.l6 + tspayroll.train.tr.l6 + const + trend
##
## Estimate Std. Error t value Pr(>|t|)
## tscensus.train.diff.l1 -0.45078 0.06833 -6.597 2.54e-10 ***
## tspayroll.train.tr.l1 1.73684 2.61992 0.663 0.50799
## tscensus.train.diff.l2 -0.24297 0.08321 -2.920 0.00382 **
## tspayroll.train.tr.l2 5.85857 2.60412 2.250 0.02535 *
## tscensus.train.diff.l3 0.09252 0.08815 1.050 0.29493
## tspayroll.train.tr.l3 -4.01392 2.65596 -1.511 0.13199
## tscensus.train.diff.l4 -0.00491 0.09114 -0.054 0.95708
## tspayroll.train.tr.l4 -4.86605 2.60887 -1.865 0.06334 .
## tscensus.train.diff.l5 0.09128 0.08554 1.067 0.28696
## tspayroll.train.tr.l5 1.38382 2.48433 0.557 0.57802
## tscensus.train.diff.l6 0.15134 0.06964 2.173 0.03072 *
## tspayroll.train.tr.l6 0.50793 2.39315 0.212 0.83209
## const 38.07637 873.31820 0.044 0.96526
## trend -0.52358 5.59242 -0.094 0.92549
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## Residual standard error: 6779 on 247 degrees of freedom
## Multiple R-Squared: 0.2916, Adjusted R-squared: 0.2543
## F-statistic: 7.822 on 13 and 247 DF, p-value: 5.241e-13
##
##
## Estimation results for equation tspayroll.train.tr:
## ===================================================
## tspayroll.train.tr = tscensus.train.diff.l1 + tspayroll.train.tr.l1 + tscensus.train.diff.l2 + tspayroll.train.tr.l2 + tscensus.train.diff.l3 + tspayroll.train.tr.l3 + tscensus.train.diff.l4 + tspayroll.train.tr.l4 + tscensus.train.diff.l5 + tspayroll.train.tr.l5 + tscensus.train.diff.l6 + tspayroll.train.tr.l6 + const + trend
##
## Estimate Std. Error t value Pr(>|t|)
## tscensus.train.diff.l1 0.009965 0.001774 5.616 5.25e-08 ***
## tspayroll.train.tr.l1 0.071661 0.068034 1.053 0.293223
## tscensus.train.diff.l2 0.009511 0.002161 4.402 1.60e-05 ***
## tspayroll.train.tr.l2 0.245954 0.067623 3.637 0.000336 ***
## tscensus.train.diff.l3 0.013729 0.002289 5.997 7.08e-09 ***
## tspayroll.train.tr.l3 0.045158 0.068970 0.655 0.513237
## tscensus.train.diff.l4 0.009207 0.002367 3.890 0.000129 ***
## tspayroll.train.tr.l4 0.010807 0.067747 0.160 0.873388
## tscensus.train.diff.l5 0.007237 0.002221 3.258 0.001279 **
## tspayroll.train.tr.l5 0.028679 0.064513 0.445 0.657031
## tscensus.train.diff.l6 0.003415 0.001808 1.889 0.060126 .
## tspayroll.train.tr.l6 0.131587 0.062145 2.117 0.035224 *
## const -15.113359 22.678202 -0.666 0.505760
## trend 0.105638 0.145223 0.727 0.467657
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## Residual standard error: 176 on 247 degrees of freedom
## Multiple R-Squared: 0.5622, Adjusted R-squared: 0.5392
## F-statistic: 24.4 on 13 and 247 DF, p-value: < 2.2e-16
##
##
##
## Covariance matrix of residuals:
## tscensus.train.diff tspayroll.train.tr
## tscensus.train.diff 45957613 450850
## tspayroll.train.tr 450850 30991
##
## Correlation matrix of residuals:
## tscensus.train.diff tspayroll.train.tr
## tscensus.train.diff 1.0000 0.3778
## tspayroll.train.tr 0.3778 1.0000
To summarize, I have tested 3 different data frames. The first was the combination of differenced census data and double differenced payroll data. The second was untransformed time series data for census and payroll and the final was the first order differenced data of the two time series.
The non differenced time series data offered the highest R-squared value of the three. I dont believe that this is important and even more significant maybe there was an issue of overfitting? Since the R-squared values for both the models there were above 0.98. Perhaps the most practical model was created by using the first order difference of both payroll and census data.
All 3 VAR models had a correlation value between 0.3 and 0.4 for payroll and census data. This is interesting as I thought that number would be much higher based off of how the graphs look.
Below are the forecasts for all 3 in the same order. The forecast for the VAR model with different level of differences did not work. I assume it was due to the fact that a row of data was removed to make it work but I am unsure to the exact reason. I have not included its code.
Below this line is the forecast for the non differenced data frame. Forecasts look reasonable although they might be a little flat and once again the 2020 recession was not predicted.
forecast(ag.10, h = 68) %>% autoplot()
Finally below here is the forecasts for the first order differenced VAR model. We notice that the prediction tends to 0 in the long run of the predictions. My initial thought was that the PT.asymptotic was responsible for the tendency of the predictions. However, I realized that that constraint only affects the serial.test() command.
forecast(dag.6, h = 68) %>% autoplot()
Overall, we have taken a look at 3 different models. As expected, the linear model provided a solid benchmark to compare the other models to but had its flaws.
For this homework specifically, the ARIMA model seemed to perform the best in terms of predictions and residual errors.
Finally, the VAR model proved useful in examining the relationships between the two separate time series but would most likely be more useful or have more impact if we were examining multivariate time series (like the initial example) and not two separate univariate time series (like the census data and payroll data).