rm(list = ls())
setwd("/Users/noam/Desktop/")
library(fpp2)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
## ── Attaching packages ────────────────────────────────────────────── fpp2 2.4 ──
## ✓ ggplot2 3.3.5 ✓ fma 2.4
## ✓ forecast 8.16 ✓ expsmooth 2.3
##
library(urca)
data <- read.csv("FreightShipments.csv", header = TRUE)
Freight.ts <- ts(data$Shipments, start = c(1991,1), end = c(2003,12), frequency = 12)
autoplot(Freight.ts, xlab = "Year", ylab = "Shipments")
result = decompose(Freight.ts)
plot(result)
From the decomposition plots of the time series, we see an increasing trend until the 8th period, followed by a declining or constant trend towards the end of the time series. There is a clear seasonal pattern with quarterly frequency, and the seasonality and error terms seem to be multiplicative.
Freight.df <- ur.df(Freight.ts, type = "none", selectlags = "AIC")
summary(Freight.df)
##
## ###############################################
## # Augmented Dickey-Fuller Test Unit Root Test #
## ###############################################
##
## Test regression none
##
##
## Call:
## lm(formula = z.diff ~ z.lag.1 - 1 + z.diff.lag)
##
## Residuals:
## Min 1Q Median 3Q Max
## -341.86 -50.03 -2.32 46.70 292.72
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## z.lag.1 0.0004154 0.0032232 0.129 0.897615
## z.diff.lag -0.2961542 0.0778110 -3.806 0.000204 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 81.4 on 152 degrees of freedom
## Multiple R-squared: 0.08702, Adjusted R-squared: 0.07501
## F-statistic: 7.244 on 2 and 152 DF, p-value: 0.0009883
##
##
## Value of test-statistic is: 0.1289
##
## Critical values for test statistics:
## 1pct 5pct 10pct
## tau1 -2.58 -1.95 -1.62
The p-value is for y(t-1) term is 0.897615 which is greater than our critical value of 0.05. Therefore we CANNOT REJECT Ho - Conclude time series is NONSTATIONARY (As discussed earlier).
Freight.df <- ur.df(Freight.ts, type = "trend", selectlags = "AIC")
summary(Freight.df)
##
## ###############################################
## # Augmented Dickey-Fuller Test Unit Root Test #
## ###############################################
##
## Test regression trend
##
##
## Call:
## lm(formula = z.diff ~ z.lag.1 + 1 + tt + z.diff.lag)
##
## Residuals:
## Min 1Q Median 3Q Max
## -314.00 -43.04 -0.20 31.57 317.60
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 319.60283 98.88943 3.232 0.00151 **
## z.lag.1 -0.16451 0.05151 -3.194 0.00171 **
## tt 0.21190 0.16618 1.275 0.20424
## z.diff.lag -0.21553 0.07986 -2.699 0.00776 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 79.22 on 150 degrees of freedom
## Multiple R-squared: 0.1462, Adjusted R-squared: 0.1291
## F-statistic: 8.561 on 3 and 150 DF, p-value: 2.77e-05
##
##
## Value of test-statistic is: -3.1937 3.4877 5.1733
##
## Critical values for test statistics:
## 1pct 5pct 10pct
## tau3 -3.99 -3.43 -3.13
## phi2 6.22 4.75 4.07
## phi3 8.43 6.49 5.47
The p-value is for y(t-1) term is 0.20424 which is greater than our critical value of 0.05. Therefore we CANNOT REJECT Ho - Conclude that the trend of this time series is NONSTATIONARY (As discussed earlier).
A stationary time series is a time series with properties which do not depend on the time in which it is observed. In other words, a stationary time series is assumed to have no seasonality or trend (or that these are statistically insignificant, and won’t contribute to forecasting). Therefore, a time series with a trend is non stationary.
For a time series with a stationary trend, the trend is estimated and removed from the data, so as to make the data non stationary. The residual time series is a stationary stochastic process.
nValid <- 36
nTrain <- length(Freight.ts) - nValid
train.ts <- window(Freight.ts, start = c(1991, 1), end = c(1991, nTrain))
valid.ts <- window(Freight.ts, start = c(1991, nTrain+1), end = c(1991, nTrain+nValid))
library(forecast)
Freight.lms <- tslm(train.ts ~ trend + I(trend^2) + season)
summary(Freight.lms)
##
## Call:
## tslm(formula = train.ts ~ trend + I(trend^2) + season)
##
## Residuals:
## Min 1Q Median 3Q Max
## -253.22 -53.08 -10.18 57.30 308.14
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.714e+03 3.472e+01 49.369 < 2e-16 ***
## trend 7.219e+00 9.103e-01 7.931 2.36e-12 ***
## I(trend^2) -2.816e-02 7.286e-03 -3.866 0.000191 ***
## season2 5.511e+01 3.830e+01 1.439 0.153157
## season3 5.787e+01 3.830e+01 1.511 0.133802
## season4 6.089e+01 3.831e+01 1.590 0.114918
## season5 -1.103e+01 3.831e+01 -0.288 0.773895
## season6 8.978e-01 3.832e+01 0.023 0.981351
## season7 -3.591e+01 3.832e+01 -0.937 0.350846
## season8 -6.269e+00 3.833e+01 -0.164 0.870411
## season9 5.303e+01 3.834e+01 1.383 0.169544
## season10 5.159e+01 3.835e+01 1.345 0.181472
## season11 4.620e+01 3.837e+01 1.204 0.231175
## season12 -2.573e+01 3.838e+01 -0.670 0.504142
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 85.64 on 106 degrees of freedom
## Multiple R-squared: 0.7499, Adjusted R-squared: 0.7192
## F-statistic: 24.44 on 13 and 106 DF, p-value: < 2.2e-16
accuracy(Freight.lms)
## ME RMSE MAE MPE MAPE MASE
## Training set 1.515963e-14 80.4911 63.38997 -0.1449508 3.084552 0.6589141
## ACF1
## Training set 0.5704966
library(ggthemes)
checkresiduals(Freight.lms)
##
## Breusch-Godfrey test for serial correlation of order up to 24
##
## data: Residuals from Linear regression model
## LM test = 67.184, df = 24, p-value = 5.789e-06
The intercept is at 49.369, the trend is 7.931, and the trend squared is -3.866. Both the trend and the trend squared demonstrate statistical significance with P-values of 2.36e-12 and 0.000191 respectively. The seasonal terms do not demonstrate statistical significance, on the other hand.
From the histogram of the residuals of this model, we see that the residuals do not follow a normal distribution, but at the same time that they are not too far from a normal distribution. From the Autocorrelation function, we see that the 1st, 2nd, 3rd, 12th, and 24th lags demonstrate statistical significance that is not accounted for by this model. From the plot of the residuals of this model, we see that towards 1998, the model tends to underestimate shipments up to 300 monthly shipments on February 1998, whereas by the 2000s it tends to underestimate shipments up to 200 monthly shipments.
From Swanson (2015:3), we know that a MAPE of less than 5% is considered as an indication that the forecast is acceptably accurate. Judging from the performance of this model,we see that this seasonal quadratic model attains a MAPE of 3.084552%, and should thus be considered for forecasting purposes.
Freight.lms.pred <- forecast(Freight.lms, h = nValid, level = 0.95)
summary(Freight.lms.pred)
##
## Forecast method: Linear regression model
##
## Model Information:
##
## Call:
## tslm(formula = train.ts ~ trend + I(trend^2) + season)
##
## Coefficients:
## (Intercept) trend I(trend^2) season2 season3 season4
## 1714.19161 7.21940 -0.02816 55.10690 57.87014 60.88970
## season5 season6 season7 season8 season9 season10
## -11.03441 0.89781 -35.91363 -6.26875 53.03246 51.59000
## season11 season12
## 46.20387 -25.72593
##
##
## Error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 1.515963e-14 80.4911 63.38997 -0.1449508 3.084552 0.6589141
## ACF1
## Training set 0.5704966
##
## Forecasts:
## Point Forecast Lo 95 Hi 95
## Jan 2001 2175.377 1991.426 2359.327
## Feb 2001 2230.859 2046.533 2415.185
## Mar 2001 2233.941 2049.222 2418.661
## Apr 2001 2237.223 2052.091 2422.356
## May 2001 2165.506 1979.942 2351.070
## Jun 2001 2177.588 1991.574 2363.602
## Jul 2001 2140.870 1954.388 2327.353
## Aug 2001 2170.552 1983.583 2357.522
## Sep 2001 2229.835 2042.361 2417.309
## Oct 2001 2228.317 2040.320 2416.314
## Nov 2001 2222.799 2034.262 2411.337
## Dec 2001 2150.681 1961.585 2339.777
## Jan 2002 2176.163 1985.519 2366.807
## Feb 2002 2230.969 2039.624 2422.314
## Mar 2002 2233.375 2041.305 2425.446
## Apr 2002 2235.982 2043.162 2428.801
## May 2002 2163.588 1969.997 2357.180
## Jun 2002 2174.994 1980.607 2369.381
## Jul 2002 2137.601 1942.396 2332.805
## Aug 2002 2166.607 1970.561 2362.652
## Sep 2002 2225.213 2028.305 2422.121
## Oct 2002 2223.019 2025.227 2420.812
## Nov 2002 2216.826 2018.127 2415.525
## Dec 2002 2144.032 1944.405 2343.659
## Jan 2003 2168.838 1967.175 2370.500
## Feb 2003 2222.968 2020.189 2425.747
## Mar 2003 2224.698 2020.776 2428.621
## Apr 2003 2226.629 2021.536 2431.721
## May 2003 2153.559 1947.270 2359.848
## Jun 2003 2164.289 1956.778 2371.800
## Jul 2003 2126.219 1917.461 2334.978
## Aug 2003 2154.550 1944.520 2364.580
## Sep 2003 2212.480 2001.154 2423.807
## Oct 2003 2209.610 1996.964 2422.257
## Nov 2003 2202.741 1988.750 2416.731
## Dec 2003 2129.271 1913.914 2344.628
plot(Freight.lms.pred, ylab = "Shipments", xlab = "Time", bty = "l",
main = "Quadratic Seasonal Model", flty = 2)
lines(Freight.lms$fitted.values, lwd = 2, col = "blue")
lines(valid.ts)
checkresiduals(Freight.lms.pred)
##
## Ljung-Box test
##
## data: Residuals from Linear regression model
## Q* = 149.93, df = 10, p-value < 2.2e-16
##
## Model df: 14. Total lags used: 24
From the plot of this model, we see that over the test period this model does not seem to capture the multiplicative nature of the seasonality of this this time series. In addition, it does not seem to capture the downward trend this time series has demonstrated from the early 2000s towards the beginning of 2003. The mean absolute error of this model stands at 63.3, and the RMSE stands at 80.49. The question of how significant the accuracy of the model is depends on its purpose. From the check residuals plot, we see that as time increases, the model’s tendency to underestimate and overestimate the real shipment level increases, demonstsrating its lack of ability to capture the multiplicative nature of the seasonal pattern.
Freight.arima <- auto.arima(train.ts)
summary(Freight.arima)
## Series: train.ts
## ARIMA(1,1,1) with drift
##
## Coefficients:
## ar1 ma1 drift
## 0.6009 -0.9465 3.3317
## s.e. 0.1109 0.0593 1.1417
##
## sigma^2 = 5825: log likelihood = -683.71
## AIC=1375.42 AICc=1375.77 BIC=1386.53
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -0.001723599 75.03859 51.73157 -0.09957806 2.500182 0.5377295
## ACF1
## Training set -0.01455744
The automated ARIMA model was determined to include the following parameters: (1,1,1) with drift. This means that this model’s parameters are p = 1, d = 1, q = 1.
For p = 1, it means that the auto regressive part of this ARIMA model follows a first order auto regression. AR(1): an auto regressive model with lag 1 (of order 1). This means that the auto regressive part of this model is: y(t+1) = βy(t) + αy(t-1) + ξ
An AR(1) means this model controls for the autoregressive nature of this time series at lag 1. This is one of the main difference of this model compared to the quadratic seasonal model applied earlier.
For d = 1, this ARIMA model takes a first order differencing. In other words, it takes one lag differencing to de-trend the time series.
For q = 1, the moving average is first order.
The drift parameter means that the intercept is added with no trend. The parameter μ is called the “drift” in the R output when d = 1.
In a way that is consistent with the Quadratic seasonal model I discussed in part 2, the automated ARIMA model does not recognize and take care of a seasonal part of this time series. (it lacks the 2nd parantheses term).
We immediately see that the MAPE of the ARIMA model is smaller than the Quadratic seasonal model tested earlier. The MAPE of this model stands at 2.5%, a decrease of more than 0.5% compared to the previous model. For some purposes, such difference can be considered significant. In addition, both the RMSE and MAE of the ARIMA model are lower, demonstrating its superiority compared to the Quadratic seasonal model.
# - FIT ARIMA MODEL TO TRAINING DATA
arima.fit <- auto.arima(train.ts)
# O Check results for values of p d q and P D Q selected by auto.arima function
summary(arima.fit)
## Series: train.ts
## ARIMA(1,1,1) with drift
##
## Coefficients:
## ar1 ma1 drift
## 0.6009 -0.9465 3.3317
## s.e. 0.1109 0.0593 1.1417
##
## sigma^2 = 5825: log likelihood = -683.71
## AIC=1375.42 AICc=1375.77 BIC=1386.53
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -0.001723599 75.03859 51.73157 -0.09957806 2.500182 0.5377295
## ACF1
## Training set -0.01455744
# USE FORECAST FUNTION TO GENERATE FORECAST BASED ON ARIMA MODEL RESULTS
arima.pred <- forecast(arima.fit, h = nValid, level = .95)
# Plot results
plot(arima.pred, xlab = "Time", ylab = "Shipments", ylim = c(1300, 2500),bty = "l",
main = "Freight forcast with ARIMA")
lines(arima.pred$fitted, lwd = 2, col = "blue")
lines(Freight.ts)
# Hyndman-Khandakar algorithm
Freight.train <- Arima(train.ts, order=c(1,1,1), lambda=0, include.drift=TRUE)
Freight.train %>%
forecast(h=36) %>%
autoplot() + autolayer(valid.ts)
checkresiduals(arima.fit)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,1,1) with drift
## Q* = 35.06, df = 21, p-value = 0.02781
##
## Model df: 3. Total lags used: 24
Again, in a similar fashion compared to the quadratic seasonal model, this ARIMA model does not capture the multiplicative seasonal nature of the time series over the validation period. Over the training period, however, it seems to be doing a much better job compared to the quadratic seasonal model. This is also evident from the residual histogram, which demonstrates a much more concentrated error pattern around the real value of the data, and from the ACF plot, which shows only one significant instance at one of the starting lags. Overall the ARIMA model is performing better over the training period compared to the quadratic seasonal model, yet both models do not seem to do a good job predicting the multiplicative seasonality or the declining trend of the time series over the validation period.
EXTRA CREDIT (OPTIONAL) – What is the relationship between the partial autocorrelation function and the regression coefficients for an AR(p) model?
The partial autocorrelation function (PACF) gives the partial correlation of a stationary time series with its own lagged values, regressed the values of the time series at all shorter lags. It contrasts with the autocorrelation function, which does not control for other lags.
When we control for lags using AR(p), thel Autocorrelation function will demonstrate whatever autocorrelation is left in the time seris after the lag(p) is controlled for. Consider, for example, that we apply an AR(1) to a time series. This means we control for the autocorrelation of lag 1 of this time series. If the AR(1) controls for most of the auto regressive nature of the time series, the corresponding Pacf plotted for a time series for which AR(1) is controlled will show statistically significant autocorrelation at lag 1, followed by statistically insignificant lags in the following lags.