homework 3

library(forecast)

## Warning: package 'forecast' was built under R version 4.0.4

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

library(urca)

## Warning: package 'urca' was built under R version 4.0.4

library(ggplot2)
library(TTR)

## Warning: package 'TTR' was built under R version 4.0.4

library(stats)
library(readr)
library(fpp)

## Warning: package 'fpp' was built under R version 4.0.5

## Loading required package: fma

## Warning: package 'fma' was built under R version 4.0.5

## Loading required package: expsmooth

## Warning: package 'expsmooth' was built under R version 4.0.5

## Loading required package: lmtest

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## Loading required package: tseries

## Warning: package 'tseries' was built under R version 4.0.4

library(fpp2)

## Warning: package 'fpp2' was built under R version 4.0.5

## 
## Attaching package: 'fpp2'

## The following objects are masked from 'package:fpp':
## 
##     ausair, ausbeer, austa, austourists, debitcards, departures,
##     elecequip, euretail, guinearice, oil, sunspotarea, usmelec

library(seasonal)

## Warning: package 'seasonal' was built under R version 4.0.5

library(lubridate)

## Warning: package 'lubridate' was built under R version 4.0.5

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(psych)

## Warning: package 'psych' was built under R version 4.0.5

## 
## Attaching package: 'psych'

## The following object is masked from 'package:seasonal':
## 
##     outlier

## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

library(fastDummies)

## Warning: package 'fastDummies' was built under R version 4.0.5

library(vars)

## Warning: package 'vars' was built under R version 4.0.5

## Loading required package: MASS

## 
## Attaching package: 'MASS'

## The following objects are masked from 'package:fma':
## 
##     cement, housing, petrol

## Loading required package: strucchange

## Warning: package 'strucchange' was built under R version 4.0.5

## Loading required package: sandwich

gonna try VARs and maybe some of the ones i had more success with in previous assignments/discussions

Seeing what the book chose to model in 11.2 of fpp

head(uschange[,1:2])

##         Consumption     Income
## 1970 Q1   0.6159862  0.9722610
## 1970 Q2   0.4603757  1.1690847
## 1970 Q3   0.8767914  1.5532705
## 1970 Q4  -0.2742451 -0.2552724
## 1971 Q1   1.8973708  1.9871536
## 1971 Q2   0.9119929  1.4473342

autoplot(uschange)

autoplot(uschange[,1:2])

Plotting, we can see that US change already looks like white noise. Moving forward, we will need to difference our data in order to make it stationary before any VAR models can be implemented. The book mentioned this earlier but seeing it in the plots reinforced the idea for me

VARselect(uschange[,1:2])

## $selection
## AIC(n)  HQ(n)  SC(n) FPE(n) 
##      5      1      1      5 
## 
## $criteria
##                 1          2          3          4         5          6
## AIC(n) -1.3802920 -1.3724878 -1.4056544 -1.4183795 -1.436743 -1.4088572
## HQ(n)  -1.3366268 -1.2997125 -1.3037690 -1.2873839 -1.276637 -1.2196414
## SC(n)  -1.2726259 -1.1930443 -1.1544335 -1.0953812 -1.041967 -0.9423042
## FPE(n)  0.2515067  0.2534832  0.2452268  0.2421485  0.237777  0.2445520
##                 7          8          9         10
## AIC(n) -1.3878518 -1.3695542 -1.3293120 -1.2924024
## HQ(n)  -1.1695259 -1.1221182 -1.0527658 -0.9867461
## SC(n)  -0.8495213 -0.7594464 -0.6474268 -0.5387397
## FPE(n)  0.2498145  0.2545224  0.2650992  0.2752277

var1 <- VAR(uschange[,1:2], p=1, type="const")
serial.test(var1, lags.pt=10, type="PT.asymptotic")

## 
##  Portmanteau Test (asymptotic)
## 
## data:  Residuals of VAR object var1
## Chi-squared = 49.102, df = 36, p-value = 0.07144

var2 <- VAR(uschange[,1:2], p=2, type="const")
serial.test(var2, lags.pt=10, type="PT.asymptotic")

## 
##  Portmanteau Test (asymptotic)
## 
## data:  Residuals of VAR object var2
## Chi-squared = 47.741, df = 32, p-value = 0.03633

var3 <- VAR(uschange[,1:2], p=3, type="const")
serial.test(var3, lags.pt=10, type="PT.asymptotic")

## 
##  Portmanteau Test (asymptotic)
## 
## data:  Residuals of VAR object var3
## Chi-squared = 33.617, df = 28, p-value = 0.2138

It seems that from the book, we take note of both the AIC and BIC values (here they are 5 and 1 respectively). We then would test the correlation of those lags using the code above. If a non-significant p value is found then that lag is chosen?

Here in the uschange example we found lags 1 and 2 were correlated. However, lag 3 was not correlated and therefore, our lag 3 VAR model was the final model selected for analysis.

summary(var3)

## 
## VAR Estimation Results:
## ========================= 
## Endogenous variables: Consumption, Income 
## Deterministic variables: const 
## Sample size: 184 
## Log Likelihood: -380.155 
## Roots of the characteristic polynomial:
## 0.7821 0.4981 0.4981 0.4441 0.2857 0.2857
## Call:
## VAR(y = uschange[, 1:2], p = 3, type = "const")
## 
## 
## Estimation results for equation Consumption: 
## ============================================ 
## Consumption = Consumption.l1 + Income.l1 + Consumption.l2 + Income.l2 + Consumption.l3 + Income.l3 + const 
## 
##                Estimate Std. Error t value Pr(>|t|)    
## Consumption.l1  0.19100    0.07965   2.398 0.017518 *  
## Income.l1       0.07837    0.05453   1.437 0.152445    
## Consumption.l2  0.15954    0.08252   1.933 0.054792 .  
## Income.l2      -0.02706    0.05723  -0.473 0.636831    
## Consumption.l3  0.22646    0.07970   2.841 0.005018 ** 
## Income.l3      -0.01454    0.05425  -0.268 0.789055    
## const           0.29081    0.08300   3.504 0.000581 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## Residual standard error: 0.5955 on 177 degrees of freedom
## Multiple R-Squared: 0.2132,  Adjusted R-squared: 0.1865 
## F-statistic: 7.994 on 6 and 177 DF,  p-value: 1.216e-07 
## 
## 
## Estimation results for equation Income: 
## ======================================= 
## Income = Consumption.l1 + Income.l1 + Consumption.l2 + Income.l2 + Consumption.l3 + Income.l3 + const 
## 
##                Estimate Std. Error t value Pr(>|t|)    
## Consumption.l1  0.45349    0.11560   3.923 0.000125 ***
## Income.l1      -0.27303    0.07915  -3.450 0.000702 ***
## Consumption.l2  0.02167    0.11977   0.181 0.856663    
## Income.l2      -0.09005    0.08306  -1.084 0.279788    
## Consumption.l3  0.35377    0.11568   3.058 0.002572 ** 
## Income.l3      -0.05376    0.07875  -0.683 0.495699    
## const           0.38750    0.12047   3.217 0.001543 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## Residual standard error: 0.8643 on 177 degrees of freedom
## Multiple R-Squared: 0.1755,  Adjusted R-squared: 0.1475 
## F-statistic: 6.279 on 6 and 177 DF,  p-value: 5.309e-06 
## 
## 
## 
## Covariance matrix of residuals:
##             Consumption Income
## Consumption      0.3546 0.1845
## Income           0.1845 0.7470
## 
## Correlation matrix of residuals:
##             Consumption Income
## Consumption      1.0000 0.3585
## Income           0.3585 1.0000

forecast(var3) %>%
  autoplot() + xlab("Year")

If we refer to the correlation matrix, then we can see that there is about 36% correlation between the two. Adjusted R squared is not very high on either model but I dont imagine that is the metric on which VAR models are judged upon.

So for this example, since the time series data is already stationary, then the forecasts make sense as they represent the level in change as well. However, for non-stationary data converted into stationary data, our forecasts would not make much sense.

In order to receive the actual value forecasts, we would add each point forecast onto the last known value.

ie. Y(t + 1) = Yt + f1 Y(t + 2) = Y(t + 1) + f2

where Y(t + x) is the point forecast and f is the differenced data

Lets move onto our time series. The three models I will choose to use are the standard lm model, an arima or ets model and the VAR model.

We need to select a time series to work with. I think I will use my dataset I have been using for the past few discussions, the census data for retail and sales. Similarly, Matt noticed it first on the discussions that both my data and his looked very similar. I have found the nonfarm payroll data on the BLS website and retrieved data from the same time period as my retail and sales data.

Initial thoughts are that these two would very likely be closely related as sales and employment go hand-in-hand (more sales -> more employees). Additionally, I imagine that the effect works in the other direction as well, businesses should know when their busy seasons are and hire more employees accordingly. (more employees -> more sales)

census <- read.csv("C:\\Users\\wanga\\OneDrive\\Desktop\\PREDICTIVE ANALYTICS\\DISCUSSIONS\\WEEK 2\\censusunadj.csv")
payroll <- read.csv("C:\\Users\\wanga\\OneDrive\\Desktop\\PREDICTIVE ANALYTICS\\DISCUSSIONS\\WEEK 2\\payrollunadj.csv")
summary(census)

##     Period             Value          
##  Length:360         Length:360        
##  Class :character   Class :character  
##  Mode  :character   Mode  :character

summary(payroll)

##     Period              Value       
##  Length:360         Min.   :106723  
##  Class :character   1st Qu.:128279  
##  Mode  :character   Median :132141  
##                     Mean   :132092  
##                     3rd Qu.:138437  
##                     Max.   :153095  
##                     NA's   :11

census$Value <- parse_number(census$Value)
summary(census$Value)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  146376  250783  339141  339835  418608  614863      11

tscensus <- ts(census$Value, frequency = 12, start = c(1992, 1), end = c(2021, 1))
autoplot(tscensus)

tspayroll <- ts(payroll$Value, frequency = 12, start = c(1992, 1), end = c(2021, 1))
autoplot(tspayroll)

We can see that even though these two time series are not on the same scale, they are both highly seasonal and follow almost the exact same trend. My initial guess is that the seasonality between these two time series should be very similar.

I think we should also create some training and test sets to work with. I have done so below

census.train <- census[1:280,]
census.test <- census[281:349, ]
summary(census.train)

##     Period              Value       
##  Length:280         Min.   :146376  
##  Class :character   1st Qu.:231452  
##  Mode  :character   Median :305646  
##                     Mean   :302419  
##                     3rd Qu.:368741  
##                     Max.   :502330

tscensus.train <- ts(census.train$Value, frequency = 12, start = c(1992, 1))
autoplot(tscensus.train)

tscensus.test <- ts(census.test$Value, frequency = 12, start = c(2015, 5))
autoplot(tscensus.test)

Above is the census test and training, below is the payroll test and training

payroll.train <- payroll[1:280,]
payroll.test <- payroll[281:349, ]
summary(payroll.train)

##     Period              Value       
##  Length:280         Min.   :106723  
##  Class :character   1st Qu.:123927  
##  Mode  :character   Median :131376  
##                     Mean   :128647  
##                     3rd Qu.:134684  
##                     Max.   :141312

tspayroll.train <- ts(payroll.train$Value, frequency = 12, start = c(1992, 1))
autoplot(tspayroll.train)

tspayroll.test <- ts(payroll.test$Value, frequency = 12, start = c(2015, 5))
autoplot(tspayroll.test)

We have now created an 80/20 split of training and test set data. We can use this to evaluate the accuracy of our models

Lets start off with a simple lm model. This should give us a baseline to work with.

LMcensus <- tslm(tscensus.train ~ trend + season)
summary(LMcensus)

## 
## Call:
## tslm(formula = tscensus.train ~ trend + season)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -41206  -6889   1012   7248  33101 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 136208.37    3263.57  41.736  < 2e-16 ***
## trend          967.25      10.56  91.588  < 2e-16 ***
## season2      -1714.00    4122.15  -0.416    0.678    
## season3      33303.05    4122.19   8.079 2.27e-14 ***
## season4      27727.47    4122.26   6.726 1.05e-10 ***
## season5      43262.35    4166.75  10.383  < 2e-16 ***
## season6      34802.80    4166.71   8.353 3.67e-15 ***
## season7      34282.12    4166.70   8.228 8.47e-15 ***
## season8      40716.74    4166.71   9.772  < 2e-16 ***
## season9      18709.02    4166.75   4.490 1.06e-05 ***
## season10     27298.73    4166.82   6.551 2.92e-10 ***
## season11     29640.78    4166.92   7.113 1.04e-11 ***
## season12     78406.27    4167.04  18.816  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14280 on 267 degrees of freedom
## Multiple R-squared:  0.9711, Adjusted R-squared:  0.9698 
## F-statistic: 747.4 on 12 and 267 DF,  p-value: < 2.2e-16

LMpayroll <- tslm(tspayroll.train ~ trend + season)
summary(LMpayroll)

## 
## Call:
## tslm(formula = tspayroll.train ~ trend + season)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7776.9 -4081.6   546.2  3815.6  7048.0 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.136e+05  1.024e+03 111.023   <2e-16 ***
## trend       9.150e+01  3.312e+00  27.622   <2e-16 ***
## season2     5.763e+02  1.293e+03   0.446   0.6562    
## season3     1.234e+03  1.293e+03   0.954   0.3408    
## season4     2.047e+03  1.293e+03   1.583   0.1145    
## season5     2.810e+03  1.307e+03   2.150   0.0325 *  
## season6     3.195e+03  1.307e+03   2.445   0.0151 *  
## season7     1.909e+03  1.307e+03   1.461   0.1452    
## season8     1.964e+03  1.307e+03   1.503   0.1340    
## season9     2.540e+03  1.307e+03   1.943   0.0530 .  
## season10    3.179e+03  1.307e+03   2.432   0.0157 *  
## season11    3.370e+03  1.307e+03   2.579   0.0105 *  
## season12    3.123e+03  1.307e+03   2.390   0.0176 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4479 on 267 degrees of freedom
## Multiple R-squared:  0.7451, Adjusted R-squared:  0.7337 
## F-statistic: 65.04 on 12 and 267 DF,  p-value: < 2.2e-16

We can see immediately that nearly all of our census data is explained by the trend and seasonality. The payroll data is not as good of a fit based on R squared but it still has decent explanatory power. Since these are both univariate time series data, we really cant add more so now we will move onto the forecast and comparison

fc.census <- forecast(LMcensus, h = 68)
autoplot(fc.census)

fc.payroll <- forecast(LMpayroll, h = 68)
autoplot(fc.payroll)

We can immediately note that our forecasts are a bit conservative compared with the real time series data and tend to underestimate during the forecast. We also note that neither forecast was able to predict the huge dip caused by COVID-19 (as expected)

accuracy(fc.census, tscensus.test)

##                        ME     RMSE      MAE        MPE     MAPE      MASE
## Training set 2.490697e-12 13944.07 10543.19 -0.2180549 3.400158 0.6737691
## Test set     1.939199e+04 29404.08 23357.40  3.6234689 4.567135 1.4926688
##                   ACF1 Theil's U
## Training set 0.8104550        NA
## Test set     0.4790005 0.6400528

autoplot(tscensus.test) + autolayer(fc.census, series = "FORECAST", PI = FALSE)

accuracy(fc.payroll, tspayroll.test)

##                        ME     RMSE      MAE        MPE     MAPE     MASE
## Training set 1.090311e-12 4373.609 3910.077 -0.1290731 3.079639 1.695887
## Test set     1.492072e+03 4628.050 3697.708  0.9398905 2.549012 1.603778
##                   ACF1 Theil's U
## Training set 0.9937427        NA
## Test set     0.8054812  1.745012

autoplot(tspayroll.test) + autolayer(fc.payroll, series = "FORECAST", PI = FALSE)

If we go by the MASE, we can see that our linear model for census has good performance. The payroll forecast didn’t do so well but it wasnt too bad still. Examining the graphs, we can see that our forecasts are able to predict the peaks and troughs but the point forecasts are not there.

Interestingly enough, I believe that the upper 80 and 95 CI’s are able to capture the true values within their limits. I imagine that there is some method in order to fix the predictions to have point estimates closer but I will save that for another time.

checkresiduals(LMcensus)

## 
##  Breusch-Godfrey test for serial correlation of order up to 24
## 
## data:  Residuals from Linear regression model
## LM test = 236.79, df = 24, p-value < 2.2e-16

checkresiduals(LMpayroll)

## 
##  Breusch-Godfrey test for serial correlation of order up to 24
## 
## data:  Residuals from Linear regression model
## LM test = 276.87, df = 24, p-value < 2.2e-16

A check on the residuals indicate that there is still significant autocorrelation in our data. What could we do? Maybe differencing the data would be one solution. The residual patterns essentially look like the trend. Honestly, I am a little unsure of what to do in this situation.

How much would the residuals be an issue if the forecasted values seem decent? It is odd to me that our R-squared values (especially for the census model) were good and yet the residuals turned our this way.

Next, we might consider the ARIMA model. First, lets check the ACF and PACF plots to begin with

ggtsdisplay(tscensus.train)

ndiffs(tscensus.train)

## [1] 1

ggtsdisplay(tspayroll.train)

ndiffs(tspayroll.train)

## [1] 1

We could expect the ACF and PACF plots to look this way just based on how much apparent seasonality and trend is in each time series. When the ndiffs() function is run on both time series, it suggests that we difference once. Additionally, we should also difference to take care of seasonality. I will allow the auto.arima function to difference as I am still a bit shaky on how to

ndiffs(tscensus.train)

## [1] 1

nsdiffs(tscensus.train)

## [1] 1

tscensus.train.diff.tr <- tscensus.train %>% diff(lag=12) %>% diff(1)
tscensus.train.diff.tr %>% ggtsdisplay()

tscensus.train.diff.tr %>% ur.kpss() %>% summary()

## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 5 lags. 
## 
## Value of test-statistic is: 0.0281 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

So, there is still an issue in the ACF and PACF plots but it looks much better than the initial plots without differencing. The kpss test also would indicate that we do not need to difference more

ndiffs(tspayroll.train)

## [1] 1

nsdiffs(tspayroll.train)

## [1] 1

tspayroll.train.tr <- tspayroll.train %>% diff(lag = 12) %>% diff(1)
tspayroll.train.tr %>% ggtsdisplay()

tspayroll.train.tr %>% ur.kpss() %>% summary()

## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 5 lags. 
## 
## Value of test-statistic is: 0.0969 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

Payroll data residual plots and kpss summary indicates that we might need to difference again?

tspayroll.train.tr.2 <- diff(tspayroll.train.tr, differences = 1)
tspayroll.train.tr.2 %>% ggtsdisplay()

The double differenced data looks much more stationary.

Moving forward, once we use auto.arima we can most likely expect the d values of census and payroll to be 1 and 2, respectively.

arima.census <- auto.arima(tscensus.train)
summary(arima.census)

## Series: tscensus.train 
## ARIMA(1,0,1)(2,1,2)[12] with drift 
## 
## Coefficients:
##          ar1      ma1    sar1     sar2     sma1    sma2     drift
##       0.9587  -0.3631  0.8533  -0.6285  -1.2741  0.6191  996.2404
## s.e.  0.0192   0.0588  0.0901   0.0708   0.1091  0.0849  195.0460
## 
## sigma^2 estimated as 35091431:  log likelihood=-2711.61
## AIC=5439.22   AICc=5439.78   BIC=5467.95
## 
## Training set error measures:
##                    ME     RMSE     MAE         MPE     MAPE      MASE
## Training set 45.13787 5719.285 4343.91 -0.03016123 1.432709 0.2776002
##                     ACF1
## Training set -0.00920604

arima.payroll <- auto.arima(tspayroll.train)
summary(arima.payroll)

## Series: tspayroll.train 
## ARIMA(1,1,2)(0,1,2)[12] 
## 
## Coefficients:
##          ar1      ma1     ma2     sma1     sma2
##       0.9277  -0.6363  0.1222  -0.6206  -0.1076
## s.e.  0.0305   0.0677  0.0607   0.0650   0.0587
## 
## sigma^2 estimated as 26892:  log likelihood=-1739.43
## AIC=3490.85   AICc=3491.18   BIC=3512.38
## 
## Training set error measures:
##                      ME     RMSE      MAE          MPE       MAPE       MASE
## Training set -0.0984124 158.6296 122.0024 0.0003118459 0.09498862 0.05291516
##                     ACF1
## Training set -0.01987173

checkresiduals(arima.census)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,0,1)(2,1,2)[12] with drift
## Q* = 53.028, df = 17, p-value = 1.414e-05
## 
## Model df: 7.   Total lags used: 24

checkresiduals(arima.payroll)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,1,2)(0,1,2)[12]
## Q* = 19.981, df = 19, p-value = 0.3957
## 
## Model df: 5.   Total lags used: 24

Although there is still some autocorreclation as indicated by the ACF plot and the Ljung box test, the residuals for our census data look much better than at the start or even the attempted manual differencing we did earlier. Many less lags fall outside of the significant value line and the oscillatory motion seems to be a bit less than previously.

Our residual check for payroll data also appears to have an improvement with previous plots. The Ljung box tests and supporting ACF and PACF plots indicate that the residuals would be white noise. As expected with both plots, there is a long tail on the left due to COVID-19’s impact (one largely negative residual).

It looks as if the auto.arima function selected different values for d then what we expected. Looking at the error measurements, our MASE for both models have good values compared with the seasonal naive (which would be 1) so we will work with these for now and see where it goes.

fc.arima.census <- forecast(arima.census, h = 68)
autoplot(fc.arima.census)

accuracy(fc.arima.census, tscensus.test)

##                       ME      RMSE      MAE         MPE     MAPE      MASE
## Training set    45.13787  5719.285  4343.91 -0.03016123 1.432709 0.2776002
## Test set     14782.61556 24298.401 19623.94  2.80144641 3.912327 1.2540802
##                     ACF1 Theil's U
## Training set -0.00920604        NA
## Test set      0.54478048 0.5377576

fc.arima.payroll <- forecast(arima.payroll,  h = 68)
autoplot(fc.arima.payroll)

accuracy(fc.arima.payroll, tspayroll.test)

##                         ME      RMSE       MAE           MPE       MAPE
## Training set    -0.0984124  158.6296  122.0024  0.0003118459 0.09498862
## Test set     -1344.1529845 5085.9715 2200.8522 -0.9966139628 1.57218951
##                    MASE        ACF1 Theil's U
## Training set 0.05291516 -0.01987173        NA
## Test set     0.95455852  0.83260599  1.975938

The plotted forecasts and accuracy tests seem to indicate that our ARIMA models performed well in terms of predictive capabilities. I will plot the predicted values vs the actual values below

autoplot(tscensus.test) + autolayer(fc.arima.census, series = "ARIMA forecasts census", PI = FALSE)

autoplot(tspayroll.test) + autolayer(fc.arima.payroll, series = "ARIMA forecasts payroll", PI = FALSE)

We can see that the predictions for both payroll and census trace the real test values incredibly well. The payroll predictions stay more accuracte further out than the census predictions and once again, the huge dip caused by COVID-19 was not predicted by our model, although I believe that the lower confidence intervals on the payroll forecasts may have captured or come close to capturing it.

Finally, we have reached the model that made me choose these two time series data sets in the first place. My initial hunch was that payroll and sales data were very closely related so I will test that in this next section with the VAR model.

Let us begin by first differencing the data to make it look like white noise.

ndiffs(tscensus.train)

## [1] 1

nsdiffs(tscensus.train)

## [1] 1

tscensus.train.diff <- tscensus.train %>% diff(lag = 12) %>% diff(1)
autoplot(tscensus.train.diff)

ndiffs(tspayroll.train)

## [1] 1

nsdiffs(tspayroll.train)

## [1] 1

tspayroll.train.tr <- tspayroll.train %>% diff(lag = 12) %>% diff(1)
tspayroll.train.tr.2 <- tspayroll.train.tr %>% diff(1)
autoplot(tspayroll.train.tr)

autoplot(tspayroll.train.tr.2)

These plots above are the differenced data with a 0 mean and relatively constant variance. We will now use VARselect() below to test lags for our model. However, we first need to join the two data sets together to get a multivariate time series to work with.

I had to bind the first order differenced payroll data as otherwise there would have been NA’s in the data frame which wouldn’t work with VARselect(). I have included the plot below to give a visual representation of what it looks like instead.

Perhaps I should instead remove the first value in the differenced census data to have matching columns? As that way both sets of data are stationary. I will try that instead of using a time series that isnt stationary.

autoplot(tspayroll.train.tr)

agg.census.payroll <- cbind(tscensus.train.diff, tspayroll.train.tr.2)
agg.census.payroll.tr <- agg.census.payroll[-1, ]
VARselect(agg.census.payroll.tr)[["selection"]]

## AIC(n)  HQ(n)  SC(n) FPE(n) 
##     10      3      2     10

We can use the Portmanteau test to determine the amount of lags necessary (it should be between 2 and 10).

census.2 <- VAR(agg.census.payroll.tr, p = 2, type = "both")
serial.test(census.2, lags.pt=10, type="PT.asymptotic")

## 
##  Portmanteau Test (asymptotic)
## 
## data:  Residuals of VAR object census.2
## Chi-squared = 88.788, df = 32, p-value = 3.059e-07

census.3 <- VAR(agg.census.payroll.tr, p = 3, type = "both")
serial.test(census.3, lags.pt=10, type="PT.asymptotic")

## 
##  Portmanteau Test (asymptotic)
## 
## data:  Residuals of VAR object census.3
## Chi-squared = 63.748, df = 28, p-value = 0.0001324

census.4 <- VAR(agg.census.payroll.tr, p = 4, type = "both")
serial.test(census.4, lags.pt=10, type="PT.asymptotic")

## 
##  Portmanteau Test (asymptotic)
## 
## data:  Residuals of VAR object census.4
## Chi-squared = 66.189, df = 24, p-value = 8.132e-06

census.5 <- VAR(agg.census.payroll.tr, p = 5, type = "both")
serial.test(census.5, lags.pt=10, type="PT.asymptotic")

## 
##  Portmanteau Test (asymptotic)
## 
## data:  Residuals of VAR object census.5
## Chi-squared = 63.494, df = 20, p-value = 2.023e-06

census.6 <- VAR(agg.census.payroll.tr, p = 6, type = "both")
serial.test(census.6, lags.pt=10, type="PT.asymptotic")

## 
##  Portmanteau Test (asymptotic)
## 
## data:  Residuals of VAR object census.6
## Chi-squared = 39.874, df = 16, p-value = 0.0008122

census.7 <- VAR(agg.census.payroll.tr, p = 7, type = "both")
serial.test(census.7, lags.pt=10, type="PT.asymptotic")

## 
##  Portmanteau Test (asymptotic)
## 
## data:  Residuals of VAR object census.7
## Chi-squared = 35.859, df = 12, p-value = 0.0003413

census.8 <- VAR(agg.census.payroll.tr, p = 8, type = "both")
serial.test(census.8, lags.pt=10, type="PT.asymptotic")

## 
##  Portmanteau Test (asymptotic)
## 
## data:  Residuals of VAR object census.8
## Chi-squared = 34.59, df = 8, p-value = 3.175e-05

census.9 <- VAR(agg.census.payroll.tr, p = 9, type = "both")
serial.test(census.9, lags.pt=10, type="PT.asymptotic")

## 
##  Portmanteau Test (asymptotic)
## 
## data:  Residuals of VAR object census.9
## Chi-squared = 23.133, df = 4, p-value = 0.0001191

census.10 <- VAR(agg.census.payroll.tr, p = 10, type = "both")
serial.test(census.10, lags.pt=10, type="PT.asymptotic")

## 
##  Portmanteau Test (asymptotic)
## 
## data:  Residuals of VAR object census.10
## Chi-squared = 11.883, df = 0, p-value < 2.2e-16

Based on above, it doesnt seem like any of the null hypotheses hold for no autocorrelation. I will try to use the lag test which had the largest p-value (this was at lag 6).

summary(census.6)

## 
## VAR Estimation Results:
## ========================= 
## Endogenous variables: tscensus.train.diff, tspayroll.train.tr.2 
## Deterministic variables: both 
## Sample size: 260 
## Log Likelihood: -4364.04 
## Roots of the characteristic polynomial:
## 0.8305 0.8305 0.7447 0.7447 0.7427 0.7427 0.738 0.738 0.7048 0.7048 0.6449 0.5185
## Call:
## VAR(y = agg.census.payroll.tr, p = 6, type = "both")
## 
## 
## Estimation results for equation tscensus.train.diff: 
## ==================================================== 
## tscensus.train.diff = tscensus.train.diff.l1 + tspayroll.train.tr.2.l1 + tscensus.train.diff.l2 + tspayroll.train.tr.2.l2 + tscensus.train.diff.l3 + tspayroll.train.tr.2.l3 + tscensus.train.diff.l4 + tspayroll.train.tr.2.l4 + tscensus.train.diff.l5 + tspayroll.train.tr.2.l5 + tscensus.train.diff.l6 + tspayroll.train.tr.2.l6 + const + trend 
## 
##                          Estimate Std. Error t value Pr(>|t|)    
## tscensus.train.diff.l1   -0.45492    0.06595  -6.898 4.43e-11 ***
## tspayroll.train.tr.2.l1   2.25235    2.37049   0.950  0.34296    
## tscensus.train.diff.l2   -0.24238    0.07609  -3.186  0.00163 ** 
## tspayroll.train.tr.2.l2   8.06399    2.94114   2.742  0.00656 ** 
## tscensus.train.diff.l3    0.07047    0.07909   0.891  0.37385    
## tspayroll.train.tr.2.l3   4.37463    3.09929   1.411  0.15936    
## tscensus.train.diff.l4   -0.04633    0.07934  -0.584  0.55985    
## tspayroll.train.tr.2.l4   0.32359    3.10242   0.104  0.91701    
## tscensus.train.diff.l5    0.06017    0.07500   0.802  0.42317    
## tspayroll.train.tr.2.l5   3.66592    2.98125   1.230  0.22000    
## tscensus.train.diff.l6    0.10250    0.06661   1.539  0.12513    
## tspayroll.train.tr.2.l6   5.56962    2.38028   2.340  0.02009 *  
## const                    60.97198  865.25629   0.070  0.94388    
## trend                    -0.59589    5.55446  -0.107  0.91465    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## Residual standard error: 6719 on 246 degrees of freedom
## Multiple R-Squared: 0.3069,  Adjusted R-squared: 0.2702 
## F-statistic: 8.378 on 13 and 246 DF,  p-value: 5.497e-14 
## 
## 
## Estimation results for equation tspayroll.train.tr.2: 
## ===================================================== 
## tspayroll.train.tr.2 = tscensus.train.diff.l1 + tspayroll.train.tr.2.l1 + tscensus.train.diff.l2 + tspayroll.train.tr.2.l2 + tscensus.train.diff.l3 + tspayroll.train.tr.2.l3 + tscensus.train.diff.l4 + tspayroll.train.tr.2.l4 + tscensus.train.diff.l5 + tspayroll.train.tr.2.l5 + tscensus.train.diff.l6 + tspayroll.train.tr.2.l6 + const + trend 
## 
##                           Estimate Std. Error t value Pr(>|t|)    
## tscensus.train.diff.l1   0.0071716  0.0018716   3.832 0.000162 ***
## tspayroll.train.tr.2.l1 -0.7295212  0.0672735 -10.844  < 2e-16 ***
## tscensus.train.diff.l2   0.0039600  0.0021593   1.834 0.067874 .  
## tspayroll.train.tr.2.l2 -0.3334082  0.0834685  -3.994 8.57e-05 ***
## tscensus.train.diff.l3   0.0066551  0.0022447   2.965 0.003326 ** 
## tspayroll.train.tr.2.l3 -0.1959892  0.0879568  -2.228 0.026769 *  
## tscensus.train.diff.l4   0.0003201  0.0022517   0.142 0.887064    
## tspayroll.train.tr.2.l4 -0.1457511  0.0880455  -1.655 0.099117 .  
## tscensus.train.diff.l5  -0.0006282  0.0021285  -0.295 0.768138    
## tspayroll.train.tr.2.l5 -0.0873329  0.0846069  -1.032 0.302983    
## tscensus.train.diff.l6  -0.0023041  0.0018903  -1.219 0.224043    
## tspayroll.train.tr.2.l6  0.0609952  0.0675513   0.903 0.367438    
## const                   -2.4984605 24.5556512  -0.102 0.919040    
## trend                    0.0118943  0.1576336   0.075 0.939914    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## Residual standard error: 190.7 on 246 degrees of freedom
## Multiple R-Squared: 0.385,   Adjusted R-squared: 0.3525 
## F-statistic: 11.85 on 13 and 246 DF,  p-value: < 2.2e-16 
## 
## 
## 
## Covariance matrix of residuals:
##                      tscensus.train.diff tspayroll.train.tr.2
## tscensus.train.diff             45148618               434707
## tspayroll.train.tr.2              434707                36363
## 
## Correlation matrix of residuals:
##                      tscensus.train.diff tspayroll.train.tr.2
## tscensus.train.diff               1.0000               0.3393
## tspayroll.train.tr.2              0.3393               1.0000

Let me try something else here

ag.cen.pay <- cbind(tscensus.train, tspayroll.train)

VARselect(ag.cen.pay)[["selection"]]

## AIC(n)  HQ(n)  SC(n) FPE(n) 
##     10     10     10     10

Here it tells us to use lag = 10. That is the default so I am wondering if maybe it would be higher.

ag.10 <- VAR(ag.cen.pay, p = 10, type = "both")
serial.test(ag.10, lags.pt=10, type="PT.asymptotic")

## 
##  Portmanteau Test (asymptotic)
## 
## data:  Residuals of VAR object ag.10
## Chi-squared = 196.28, df = 0, p-value < 2.2e-16

summary(ag.10)

## 
## VAR Estimation Results:
## ========================= 
## Endogenous variables: tscensus.train, tspayroll.train 
## Deterministic variables: both 
## Sample size: 270 
## Log Likelihood: -4691.814 
## Roots of the characteristic polynomial:
## 0.9979 0.9979 0.9963 0.9963 0.9955 0.9955 0.9937 0.9937 0.9859 0.9674 0.9674 0.9312 0.9194 0.9194 0.9091 0.9091 0.8311 0.8311 0.6927 0.431
## Call:
## VAR(y = ag.cen.pay, p = 10, type = "both")
## 
## 
## Estimation results for equation tscensus.train: 
## =============================================== 
## tscensus.train = tscensus.train.l1 + tspayroll.train.l1 + tscensus.train.l2 + tspayroll.train.l2 + tscensus.train.l3 + tspayroll.train.l3 + tscensus.train.l4 + tspayroll.train.l4 + tscensus.train.l5 + tspayroll.train.l5 + tscensus.train.l6 + tspayroll.train.l6 + tscensus.train.l7 + tspayroll.train.l7 + tscensus.train.l8 + tspayroll.train.l8 + tscensus.train.l9 + tspayroll.train.l9 + tscensus.train.l10 + tspayroll.train.l10 + const + trend 
## 
##                       Estimate Std. Error t value Pr(>|t|)    
## tscensus.train.l1   -3.091e-02  5.722e-02  -0.540 0.589547    
## tspayroll.train.l1   1.215e+01  1.636e+00   7.425 1.81e-12 ***
## tscensus.train.l2    9.829e-02  6.333e-02   1.552 0.121910    
## tspayroll.train.l2  -3.363e+00  3.074e+00  -1.094 0.275080    
## tscensus.train.l3    5.667e-01  7.267e-02   7.799 1.73e-13 ***
## tspayroll.train.l3  -7.472e+00  3.121e+00  -2.394 0.017397 *  
## tscensus.train.l4    4.300e-02  7.777e-02   0.553 0.580815    
## tspayroll.train.l4   8.152e-01  2.585e+00   0.315 0.752776    
## tscensus.train.l5    2.893e-01  6.744e-02   4.289 2.57e-05 ***
## tspayroll.train.l5  -1.748e+01  1.426e+00 -12.251  < 2e-16 ***
## tscensus.train.l6   -6.097e-01  6.239e-02  -9.772  < 2e-16 ***
## tspayroll.train.l6   2.842e+01  1.708e+00  16.636  < 2e-16 ***
## tscensus.train.l7    5.184e-01  7.573e-02   6.845 5.97e-11 ***
## tspayroll.train.l7  -1.910e+01  2.461e+00  -7.764 2.16e-13 ***
## tscensus.train.l8   -1.037e-01  7.286e-02  -1.424 0.155704    
## tspayroll.train.l8   1.096e+01  2.814e+00   3.896 0.000126 ***
## tscensus.train.l9    5.950e-01  6.172e-02   9.639  < 2e-16 ***
## tspayroll.train.l9  -1.255e+01  2.757e+00  -4.554 8.27e-06 ***
## tscensus.train.l10  -5.359e-01  6.305e-02  -8.499 1.81e-15 ***
## tspayroll.train.l10  7.972e+00  1.734e+00   4.597 6.83e-06 ***
## const               -1.121e+04  1.432e+04  -0.783 0.434400    
## trend                1.394e+02  5.202e+01   2.679 0.007873 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## Residual standard error: 7980 on 248 degrees of freedom
## Multiple R-Squared: 0.9906,  Adjusted R-squared: 0.9898 
## F-statistic:  1246 on 21 and 248 DF,  p-value: < 2.2e-16 
## 
## 
## Estimation results for equation tspayroll.train: 
## ================================================ 
## tspayroll.train = tscensus.train.l1 + tspayroll.train.l1 + tscensus.train.l2 + tspayroll.train.l2 + tscensus.train.l3 + tspayroll.train.l3 + tscensus.train.l4 + tspayroll.train.l4 + tscensus.train.l5 + tspayroll.train.l5 + tscensus.train.l6 + tspayroll.train.l6 + tscensus.train.l7 + tspayroll.train.l7 + tscensus.train.l8 + tspayroll.train.l8 + tscensus.train.l9 + tspayroll.train.l9 + tscensus.train.l10 + tspayroll.train.l10 + const + trend 
## 
##                       Estimate Std. Error t value Pr(>|t|)    
## tscensus.train.l1   -2.162e-02  2.194e-03  -9.857  < 2e-16 ***
## tspayroll.train.l1   1.715e+00  6.271e-02  27.343  < 2e-16 ***
## tscensus.train.l2    2.159e-02  2.428e-03   8.894  < 2e-16 ***
## tspayroll.train.l2  -4.588e-01  1.179e-01  -3.893 0.000128 ***
## tscensus.train.l3    5.894e-04  2.786e-03   0.212 0.832602    
## tspayroll.train.l3  -6.284e-01  1.196e-01  -5.253 3.23e-07 ***
## tscensus.train.l4    6.077e-04  2.982e-03   0.204 0.838650    
## tspayroll.train.l4   1.816e-01  9.911e-02   1.833 0.068035 .  
## tscensus.train.l5   -1.017e-02  2.585e-03  -3.935 0.000108 ***
## tspayroll.train.l5   1.866e-01  5.469e-02   3.412 0.000752 ***
## tscensus.train.l6    2.954e-03  2.392e-03   1.235 0.218005    
## tspayroll.train.l6   6.863e-01  6.549e-02  10.480  < 2e-16 ***
## tscensus.train.l7   -1.244e-03  2.903e-03  -0.428 0.668723    
## tspayroll.train.l7  -1.164e+00  9.433e-02 -12.341  < 2e-16 ***
## tscensus.train.l8   -7.921e-03  2.793e-03  -2.836 0.004947 ** 
## tspayroll.train.l8   3.127e-01  1.079e-01   2.899 0.004084 ** 
## tscensus.train.l9    8.959e-03  2.366e-03   3.786 0.000192 ***
## tspayroll.train.l9   4.823e-01  1.057e-01   4.563 7.94e-06 ***
## tscensus.train.l10   6.703e-03  2.417e-03   2.773 0.005977 ** 
## tspayroll.train.l10 -3.231e-01  6.648e-02  -4.859 2.09e-06 ***
## const                1.209e+03  5.491e+02   2.202 0.028615 *  
## trend                3.838e-01  1.994e+00   0.192 0.847539    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## Residual standard error: 305.9 on 248 degrees of freedom
## Multiple R-Squared: 0.9986,  Adjusted R-squared: 0.9985 
## F-statistic:  8533 on 21 and 248 DF,  p-value: < 2.2e-16 
## 
## 
## 
## Covariance matrix of residuals:
##                 tscensus.train tspayroll.train
## tscensus.train        63674942          958724
## tspayroll.train         958724           93586
## 
## Correlation matrix of residuals:
##                 tscensus.train tspayroll.train
## tscensus.train          1.0000          0.3927
## tspayroll.train         0.3927          1.0000

d.ag.cen.pay <- cbind(tscensus.train.diff, tspayroll.train.tr)
VARselect(d.ag.cen.pay)[["selection"]]

## AIC(n)  HQ(n)  SC(n) FPE(n) 
##      9      4      3      9

Above is our first order differenced df

dag.3 <- VAR(d.ag.cen.pay, p=3, type="both")
serial.test(dag.3, lags.pt=10, type="PT.asymptotic")

## 
##  Portmanteau Test (asymptotic)
## 
## data:  Residuals of VAR object dag.3
## Chi-squared = 61.379, df = 28, p-value = 0.0002707

dag.4 <- VAR(d.ag.cen.pay, p=4, type="both")
serial.test(dag.4, lags.pt=10, type="PT.asymptotic")

## 
##  Portmanteau Test (asymptotic)
## 
## data:  Residuals of VAR object dag.4
## Chi-squared = 47.264, df = 24, p-value = 0.00311

dag.5 <- VAR(d.ag.cen.pay, p=5, type="both")
serial.test(dag.5, lags.pt=10, type="PT.asymptotic")

## 
##  Portmanteau Test (asymptotic)
## 
## data:  Residuals of VAR object dag.5
## Chi-squared = 39.895, df = 20, p-value = 0.005151

dag.6 <- VAR(d.ag.cen.pay, p=6, type="both")
serial.test(dag.6, lags.pt=10, type="PT.asymptotic")

## 
##  Portmanteau Test (asymptotic)
## 
## data:  Residuals of VAR object dag.6
## Chi-squared = 32.425, df = 16, p-value = 0.008799

dag.7 <- VAR(d.ag.cen.pay, p=7, type="both")
serial.test(dag.7, lags.pt=10, type="PT.asymptotic")

## 
##  Portmanteau Test (asymptotic)
## 
## data:  Residuals of VAR object dag.7
## Chi-squared = 30.534, df = 12, p-value = 0.002319

dag.8 <- VAR(d.ag.cen.pay, p=8, type="both")
serial.test(dag.8, lags.pt=10, type="PT.asymptotic")

## 
##  Portmanteau Test (asymptotic)
## 
## data:  Residuals of VAR object dag.8
## Chi-squared = 27.312, df = 8, p-value = 0.0006244

dag.9 <- VAR(d.ag.cen.pay, p=9, type="both")
serial.test(dag.9, lags.pt=10, type="PT.asymptotic")

## 
##  Portmanteau Test (asymptotic)
## 
## data:  Residuals of VAR object dag.9
## Chi-squared = 18.136, df = 4, p-value = 0.001161

Once again, our null hypotheses of no autocorrelation is rejected in every single test that we ran. I will use lag = 6 because it offered the largest p-value out of all the tests

summary(dag.6)

## 
## VAR Estimation Results:
## ========================= 
## Endogenous variables: tscensus.train.diff, tspayroll.train.tr 
## Deterministic variables: both 
## Sample size: 261 
## Log Likelihood: -4358.2 
## Roots of the characteristic polynomial:
## 0.8357 0.8357 0.7832 0.7832 0.7456 0.7456 0.7361 0.6968 0.6968 0.6421 0.5767 0.5767
## Call:
## VAR(y = d.ag.cen.pay, p = 6, type = "both")
## 
## 
## Estimation results for equation tscensus.train.diff: 
## ==================================================== 
## tscensus.train.diff = tscensus.train.diff.l1 + tspayroll.train.tr.l1 + tscensus.train.diff.l2 + tspayroll.train.tr.l2 + tscensus.train.diff.l3 + tspayroll.train.tr.l3 + tscensus.train.diff.l4 + tspayroll.train.tr.l4 + tscensus.train.diff.l5 + tspayroll.train.tr.l5 + tscensus.train.diff.l6 + tspayroll.train.tr.l6 + const + trend 
## 
##                         Estimate Std. Error t value Pr(>|t|)    
## tscensus.train.diff.l1  -0.45078    0.06833  -6.597 2.54e-10 ***
## tspayroll.train.tr.l1    1.73684    2.61992   0.663  0.50799    
## tscensus.train.diff.l2  -0.24297    0.08321  -2.920  0.00382 ** 
## tspayroll.train.tr.l2    5.85857    2.60412   2.250  0.02535 *  
## tscensus.train.diff.l3   0.09252    0.08815   1.050  0.29493    
## tspayroll.train.tr.l3   -4.01392    2.65596  -1.511  0.13199    
## tscensus.train.diff.l4  -0.00491    0.09114  -0.054  0.95708    
## tspayroll.train.tr.l4   -4.86605    2.60887  -1.865  0.06334 .  
## tscensus.train.diff.l5   0.09128    0.08554   1.067  0.28696    
## tspayroll.train.tr.l5    1.38382    2.48433   0.557  0.57802    
## tscensus.train.diff.l6   0.15134    0.06964   2.173  0.03072 *  
## tspayroll.train.tr.l6    0.50793    2.39315   0.212  0.83209    
## const                   38.07637  873.31820   0.044  0.96526    
## trend                   -0.52358    5.59242  -0.094  0.92549    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## Residual standard error: 6779 on 247 degrees of freedom
## Multiple R-Squared: 0.2916,  Adjusted R-squared: 0.2543 
## F-statistic: 7.822 on 13 and 247 DF,  p-value: 5.241e-13 
## 
## 
## Estimation results for equation tspayroll.train.tr: 
## =================================================== 
## tspayroll.train.tr = tscensus.train.diff.l1 + tspayroll.train.tr.l1 + tscensus.train.diff.l2 + tspayroll.train.tr.l2 + tscensus.train.diff.l3 + tspayroll.train.tr.l3 + tscensus.train.diff.l4 + tspayroll.train.tr.l4 + tscensus.train.diff.l5 + tspayroll.train.tr.l5 + tscensus.train.diff.l6 + tspayroll.train.tr.l6 + const + trend 
## 
##                          Estimate Std. Error t value Pr(>|t|)    
## tscensus.train.diff.l1   0.009965   0.001774   5.616 5.25e-08 ***
## tspayroll.train.tr.l1    0.071661   0.068034   1.053 0.293223    
## tscensus.train.diff.l2   0.009511   0.002161   4.402 1.60e-05 ***
## tspayroll.train.tr.l2    0.245954   0.067623   3.637 0.000336 ***
## tscensus.train.diff.l3   0.013729   0.002289   5.997 7.08e-09 ***
## tspayroll.train.tr.l3    0.045158   0.068970   0.655 0.513237    
## tscensus.train.diff.l4   0.009207   0.002367   3.890 0.000129 ***
## tspayroll.train.tr.l4    0.010807   0.067747   0.160 0.873388    
## tscensus.train.diff.l5   0.007237   0.002221   3.258 0.001279 ** 
## tspayroll.train.tr.l5    0.028679   0.064513   0.445 0.657031    
## tscensus.train.diff.l6   0.003415   0.001808   1.889 0.060126 .  
## tspayroll.train.tr.l6    0.131587   0.062145   2.117 0.035224 *  
## const                  -15.113359  22.678202  -0.666 0.505760    
## trend                    0.105638   0.145223   0.727 0.467657    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## Residual standard error: 176 on 247 degrees of freedom
## Multiple R-Squared: 0.5622,  Adjusted R-squared: 0.5392 
## F-statistic:  24.4 on 13 and 247 DF,  p-value: < 2.2e-16 
## 
## 
## 
## Covariance matrix of residuals:
##                     tscensus.train.diff tspayroll.train.tr
## tscensus.train.diff            45957613             450850
## tspayroll.train.tr               450850              30991
## 
## Correlation matrix of residuals:
##                     tscensus.train.diff tspayroll.train.tr
## tscensus.train.diff              1.0000             0.3778
## tspayroll.train.tr               0.3778             1.0000

To summarize, I have tested 3 different data frames. The first was the combination of differenced census data and double differenced payroll data. The second was untransformed time series data for census and payroll and the final was the first order differenced data of the two time series.

The non differenced time series data offered the highest R-squared value of the three. I dont believe that this is important and even more significant maybe there was an issue of overfitting? Since the R-squared values for both the models there were above 0.98. Perhaps the most practical model was created by using the first order difference of both payroll and census data.

All 3 VAR models had a correlation value between 0.3 and 0.4 for payroll and census data. This is interesting as I thought that number would be much higher based off of how the graphs look.

Below are the forecasts for all 3 in the same order. The forecast for the VAR model with different level of differences did not work. I assume it was due to the fact that a row of data was removed to make it work but I am unsure to the exact reason. I have not included its code.

Below this line is the forecast for the non differenced data frame. Forecasts look reasonable although they might be a little flat and once again the 2020 recession was not predicted.

forecast(ag.10, h = 68) %>% autoplot()

Finally below here is the forecasts for the first order differenced VAR model. We notice that the prediction tends to 0 in the long run of the predictions. My initial thought was that the PT.asymptotic was responsible for the tendency of the predictions. However, I realized that that constraint only affects the serial.test() command.

forecast(dag.6, h = 68) %>% autoplot()

Overall, we have taken a look at 3 different models. As expected, the linear model provided a solid benchmark to compare the other models to but had its flaws.

For this homework specifically, the ARIMA model seemed to perform the best in terms of predictions and residual errors.

Finally, the VAR model proved useful in examining the relationships between the two separate time series but would most likely be more useful or have more impact if we were examining multivariate time series (like the initial example) and not two separate univariate time series (like the census data and payroll data).