Table of Contents

library(GGally) # for ggpairs()
library(TSA) # season(), prewhiten() and other functions
library(tseries) # adf.test()
library(forecast) # BoxCox.lambda()
library(dLagM) # For DLM modelling
library(car) # for vif()
library(tis) # for Lag()
library(dynlm) # for Dynamic linear modeling
library(stats) # for classical decomposition
library(x12) # X-12-ARIMA decomposition
library(lmtest) # for bgtest()
library(dplyr) # for arrange()

General note:

A significance level \(\alpha=5\%\) is used.

Task 1: Four-Week Ahead Mortality Forecasting in Paris Using Multiple Predictors

Data Description

The dataset holds 6 columns and 508 observations. They are, Index column, the disease specific averaged weekly mortality in Paris, France, the city’s local climate (temperature degrees Fahrenheit), size of pollutants and levels of noxious chemical emissions from cars and industry in the air - all measured at the same points between 2010-2020.

Objective

Our aim for the mort dataset is to give best 4 weeks ahead forecasts by determining the most accurate and suitable regression model that determines the average weekly mortality in Paris in terms of MASE using multiple predictors. A descriptive analysis will be conducted initially. Model-building strategy will be applied to find the best fitting model from the time series regression methods (dLagM package), dynamic linear models (dynlm package), and exponential smoothing and corresponding state-space models.

Model Selection Criteria

MASE

Out of various different error measures to assess the forecast accuracy, Mean absolute scaled error (MASE) is a generally applicable measure of forecast accuracy and is obtained by scaling the errors based on the in-sample MAE from the naive forecast method. It is the only available method which can be used in all circumstances and can be used to compare forecast accuracy between series as it is scale-free.

Information Criteria (AIC and BIC)

IC for model selection penalizes the likelihood criteria by the penalty of twice the number of parameters in the model in AIC and by number of parameters and the sample size (qlog(n)) in BIC. Or in simple terms, IC incorporates penalties to the maximum likelihood methods, thus given a better criteria for model selection.

Adjusted R Squared

Comparison of models using adjusted R squared gives a rough estimate of how good the model fits the data in percentage.

Read Data

mort <- read.csv("C:/Users/admin/Downloads/mort.csv")
mort = mort[,2:6] # remove index column
head(mort)
##   mortality  temp chem1 chem2 particle.size
## 1    183.63 72.38 11.51 45.79         72.72
## 2    191.05 67.19  8.92 43.90         49.60
## 3    180.09 62.94  9.48 32.18         55.68
## 4    184.67 72.49 10.28 40.43         55.16
## 5    173.60 74.25 10.57 48.53         66.02
## 6    183.73 67.88  7.99 48.61         44.01

Identification of the response and the regressor variables

For fitting a regression model, the response is Mortality and the 4 regressor variables are the temperature, pollutants particle size, and the two chemical emissions (chem1, chem2).

  • y = Mortality = disease specific averaged weekly mortality in Paris
  • x1 = temp = city’s local climate (temperature degrees Fahrenheit)
  • x2 = chem1 = levels of noxious chemical emissions from cars in air
  • x3 = chem2 = levels of noxious chemical emissions from industry in air
  • x4 = particle.size = size of pollutants

All the 5 variables are continuous variables.

Read Regressor and Response variables

Lets first get the regressor and response as TS objects,

Mortality = ts(mort[,1])
Temp = ts(mort[,2])
Chem1 = ts(mort[,3])
Chem2 = ts(mort[,4])
ParticleSize = ts(mort[,5])
data.ts = ts(mort) # Y and x in single dataframe

Relationship between Response and Regressor variables

Lets scale, center and plot all the 4 variables together

data.scale = scale(data.ts)
plot(data.ts, plot.type="s", col=c("black", "red", "blue", "green", "yellow"), main = "Mortality (Black - Respone), Temperature (Red - X1),\n  Chemical 1 (Blue - X2), Chemical 2 (Green - X3), Particle size (Yellow - X4)")

It is hard to read the correlations between the regressors and the response and the among the response themselves. But it is fair to say the 5 variables show some correlations. Lets check for correlation statistically using ggpairs(),

ggpairs(data = mort, columns = c(1,2,3,4,5), progress = FALSE) #library(GGally)

Hence, some correlations between the 4 regressors and response is present. We can generate regression model based on these correlations. First, lets look at the descriptive statistics

Descriptive Analysis

Since we are generating regression model which estimates the response, \(Mortality\), lets focus on Mortalitys statistics.

Summary statistics

summary(Mortality)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   142.1   159.6   166.7   169.0   176.4   231.7

The mean and median of the Mortality are very close indicating symmetrical distribution.

Time Series plot

The time series plot for our data is generated using the following code chunk,

plot(Mortality,ylab='Average weekly mortality in Paris',xlab='Weeks',
     type='o', main="Average weekly mortality Trend (2010-2020/week1-week508)")

Plot Inference :

From Figure 1, we can comment on the time series’s,

  • Trend: The overall shape of the trend seems to follow an downward trend. Thus, indicating non-stationarity.

  • Seasonality: From the plot, seasonal behavior is quite evident every year. This needs to be confirmed using statistical tests.

  • Change in Variance: Variation is random and needs to be checked statistically.

  • Behavior: We notice mixed behavior of MA and AR series. AR behavior is dominant as we obverse more following data points. MA behavior is evident due to up and down fluctuations in the data points.

  • Intervention/Change points: No particular intervention point is seen. Week 150 might be an intervention point and will be checked if it caused significant change in mean value.

ACF and PACF plots

acf(Mortality, main="ACF of Average weekly mortality")

pacf(Mortality, main ="PACF of Average weekly mortality")

  • ACF plot: We notice multiple autocorrelations are significant. A slowly decaying pattern indicates non stationary series. We do not see any ‘wavish’ form. Thus, no significant seasonal behavior is observed.

  • PACF plot: We see 1 high vertical spike indicating non stationary series. We have observed non stationarity in the time series plot as well. Also, the second correlation bar is significant as well.

Check normality

Many model estimating procedures assume normality of the residuals. If this assumption doesn’t hold, then the coefficient estimates are not optimum. Lets look at the Quantile-Quantile (QQ) plot to to observe normality visually and the Shapiro-Wilk test to statistically confirm the result.

qqnorm(Mortality, main = "Normal Q-Q Plot of Average weekly mortality")
qqline(Mortality, col = 2)

We see deviations from normality. Clearly, both the tails are off and most of the data in middle is off the line as well. Lets check statistically using shapiro-wilk test. Lets state the hypothesis of this test,

\(H_0\) : Time series is Normally distributed
\(H_a\) : Time series is not normal

shapiro.test(Mortality)
## 
##  Shapiro-Wilk normality test
## 
## data:  Mortality
## W = 0.94454, p-value = 7.548e-13

From the Shapiro-Wilk test, since p < 0.05 significance level, we reject the null hypothesis that states the data is normal. Thus, Mortality series is not normally distributed.

Test Stationarity

The PACF plot of Mortality time series at the descriptive analysis stage of time series tells us nonstationarity in our time series. Lets use ADF and PP tests,

Using ADF (Augmented Dickey-Fuller) test :

Lets confirm the non-stationarity using Dickey-Fuller Test or ADF test. Lets state the hypothesis,

\(H_0\) : Time series is Difference non-stationary
\(H_a\) : Time series is Stationary

adf.test(Mortality) #library(tseries)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  Mortality
## Dickey-Fuller = -5.4301, Lag order = 7, p-value = 0.01
## alternative hypothesis: stationary

since p-value < 0.05, we reject null hypothesis of non stationarity. we can conclude that the series is stationary at 5% level of significance.

Using PP (Phillips-Perron) test :

The null and alternate hypothesis are same as ADF test.

PP.test(Mortality)
## 
##  Phillips-Perron Unit Root Test
## 
## data:  Mortality
## Dickey-Fuller = -9.9724, Truncation lag parameter = 6, p-value = 0.01

According to the PP tests, Mortality series is stationary at 5% level

Conclusion from descriptive analysis:

  • From the time series plot, ACF plot, ADF and PP tests, we found our Mortality response is stationary. Differencing is not required.
  • Trend is not normal. Thus Box-cox transformation is required.

Lets perform with Box-Cox transformation,

Transformations

Box-Cox transformation to improve normality

To improve normality in our Mortality time series, lets test Box-Cox transformations on the series

lambda = BoxCox.lambda(Mortality, method = "loglik") # library(forecast)
BC.Mortality = BoxCox(Mortality, lambda = lambda)

Check Normality of BC transformed Mortality series

Visually comparing the time series plots before and after box-cox transformation,

par(mfrow=c(2,1))
plot(BC.Mortality,ylab='Weekly Mortality',xlab='Time',
     type='o', main="Box-Cox Transformed Mortality Time Series")
points(y=BC.Mortality,x=time(BC.Mortality))
plot(Mortality,ylab='Weekly Mortality',xlab='Time',
     type='o', main="Original Mortality Time Series")
points(y=Mortality,x=time(Mortality))

par(mfrow=c(1,1))

From the plot, almost no improvement in the variance of the time series is visible after BC transformation. Lets check for normality using shapiro test,

shapiro.test(BC.Mortality)
## 
##  Shapiro-Wilk normality test
## 
## data:  BC.Mortality
## W = 0.9854, p-value = 5.59e-05

From the Shapiro-Wilk test, since p < 0.05 significance level, we reject the null hypothesis that states the data is normal. Thus, BC Transformed Mortality is not normal.

Conclusion after BC transformation

The BC transformed Mortality series is Stationary and not normal. BC transformation was not effective.

Decomposition

To observe the individual effects of the existing components and historical effects occurred in the past, lets perform decomposition of the Mortality time series. The time series can be decomposed into are seasonal and trend components. STL decomposition method will be used.

STL decomposition

Lets set t.window to 15 and look the STL decomposed plots,

We can adjust the series for seasonality by subtracting the seasonal component from the original series using the following code chunk,

# Code gist - Apply STL decomposition to get seasonally adjusted and trend adjusted and visually compare w.r.t to original time series

MortalityX = ts(mort[,1], start = c(2010,1), frequency = 52) # set frequency
stl.Mortality <- stl(window(MortalityX, start=c(2010,1)), t.window=15, s.window="periodic", robust=TRUE)

par(mfrow=c(3,1))

plot(MortalityX,ylab='Mortality',xlab='Time',
     type='o', main="Original Mortality Time Series")

plot(seasadj(stl.Mortality), ylab='Mortality Radiation',xlab='Time', main = "Seasonally adjusted Mortality")

stl.Mortality.trend = stl.Mortality$time.series[,"trend"] # Extract the trend component from the output
stl.Mortality.trend.adjusted = MortalityX - stl.Mortality.trend

plot(stl.Mortality.trend.adjusted, ylab='Mortality',xlab='Time', main = "Trend adjusted Mortality")

par(mfrow=c(1,1))

Not much change is visually seen in the trend adjusted series and the seasonally adjusted series compared to the original series. This indicates both the trend and seasonal components are equally significant or insignificant for the Mortality time series.

Conclusion of Decomposition

Neither significant trend nor seasonal components are found through decomposition. Thus, we expect the fitted model to have neither trend and seasonal components.

Modeling

Time series regression methods namely,

  • A. Distributed lag models (dLagM package),
  • B. Dynamic linear models (dynlm package)
  • C. Exponential smoothing and corresponding state-space models will be considered.

A. Distributed lag models

Based on whether the lags are known (Finite DLM) or undetermined (Infinite DLM), 4 major modelling methods will be tested, namely,

  • Basic Finite Distributed lag model,
  • Polynomial DLM,
  • Koyck transformed geometric DLM,
  • and Autoregressive DLM.

Fit Finite DLM

The response of a finite DLM model with 1 regressor is represented as shown below,

\(Y_t = \alpha + \sum_{s=0}^{q} \beta_s X_{t-s} + \epsilon_t\)

where,

  • \(\alpha\) is intercept
  • \(\beta_s\) is coefficient of s lagged response \(X_t\)
  • and \(\epsilon_t\) is the error term

In our dataset, we have 4 regressors, hence the model equation has X1, X2, X3 and x4 instead of just one regressor.

Now, lets use AIC and BIC score to find the best lag length for Finite DLM model,

finiteDLMauto(formula = mortality ~ temp + chem1 + chem2 + particle.size, data = mort, q.min = 1, q.max = 12,
              model.type = "dlm", error.type = "AIC", trace = TRUE)
##    q - k    MASE      AIC      BIC   GMRAE    MBRAE R.Adj.Sq Ljung-Box
## 12    12 0.79668 3678.921 3906.076 0.71739  0.59279  0.56683         0
## 11    11 0.81097 3690.626 3901.055 0.77977  0.34928  0.55912         0
## 10    10 0.81455 3696.208 3889.895 0.78290  0.80140  0.55711         0
## 9      9 0.81741 3704.335 3881.265 0.77674 -0.53319  0.55240         0
## 8      8 0.82157 3710.060 3870.215 0.76386  2.44211  0.55022         0
## 7      7 0.82450 3713.873 3857.237 0.76895 -0.15585  0.54954         0
## 6      6 0.83830 3726.178 3852.736 0.79115  1.21454  0.54079         0
## 5      5 0.83876 3729.364 3839.099 0.77233  0.17266  0.54125         0
## 4      4 0.85441 3741.636 3834.533 0.80750  0.26095  0.53242         0
## 3      3 0.86883 3755.807 3831.850 0.83498  0.50969  0.52269         0
## 2      2 0.88537 3769.907 3829.078 0.81427  1.28602  0.51223         0
## 1      1 0.91963 3812.934 3855.219 0.81979 -0.26483  0.47418         0

Note - We are using Mortality and not the BC.Mortality (BC transformed Mortality series) as normality is violated in both of these.

q = 12 has the smallest AIC and BIC scores. Fit model with q = 12,

Since there are 4 predictors, there are 4C1 + 4C2 + 4C3 + 4C4 = 15 possible combinations of predictors and hence 15 models to compare. Lets fit all these 15 models and compare based on AIC, BIC and MASE scores.

DLM.model = dlm(formula = mortality ~ temp + chem1 + chem2 + particle.size, data = mort, q = 12)
DLM.model1 = dlm(formula = mortality ~ temp , data = mort, q = 12)
DLM.model2 = dlm(formula = mortality ~ chem1, data = mort, q = 12)
DLM.model3 = dlm(formula = mortality ~ chem2, data = mort, q = 12)
DLM.model4 = dlm(formula = mortality ~ particle.size, data = mort, q = 12)
DLM.model5 = dlm(formula = mortality ~ temp + chem1, data = mort, q = 12)
DLM.model6 = dlm(formula = mortality ~ temp + chem2, data = mort, q = 12)
DLM.model7 = dlm(formula = mortality ~ temp + particle.size, data = mort, q = 12)
DLM.model8 = dlm(formula = mortality ~ temp + chem1 + chem2, data = mort, q = 12)
DLM.model9 = dlm(formula = mortality ~ temp + chem1 + particle.size, data = mort, q = 12)
DLM.model10 = dlm(formula = mortality ~ temp + chem2 + particle.size, data = mort, q = 12)
DLM.model11 = dlm(formula = mortality ~ chem1 + chem2 + particle.size, data = mort, q = 12)
DLM.model12 = dlm(formula = mortality ~ chem1 + chem2 , data = mort, q = 12)
DLM.model13 = dlm(formula = mortality ~ chem1 + particle.size, data = mort, q = 12)
DLM.model14 = dlm(formula = mortality ~ chem2 + particle.size, data = mort, q = 12)

Model <- c("DLM.model", "DLM.model1", "DLM.model2", "DLM.model3", "DLM.model4", "DLM.model5", "DLM.model6", "DLM.model7", "DLM.model8", "DLM.model9", "DLM.model10", "DLM.model11", "DLM.model12", "DLM.model13", "DLM.model14")
AIC <- c(AIC(DLM.model), AIC(DLM.model1), AIC(DLM.model2), AIC(DLM.model3), AIC(DLM.model4),AIC(DLM.model5), AIC(DLM.model6), AIC(DLM.model7), AIC(DLM.model8), AIC(DLM.model9), AIC(DLM.model10), AIC(DLM.model11), AIC(DLM.model12), AIC(DLM.model13), AIC(DLM.model14))
BIC <- c(BIC(DLM.model), BIC(DLM.model1), BIC(DLM.model2), BIC(DLM.model3), BIC(DLM.model4),BIC(DLM.model5), BIC(DLM.model6), BIC(DLM.model7), BIC(DLM.model8), BIC(DLM.model9), BIC(DLM.model10), BIC(DLM.model11), BIC(DLM.model12), BIC(DLM.model13), BIC(DLM.model14))
MASE <- MASE(DLM.model, DLM.model1, DLM.model2, DLM.model3, DLM.model4, DLM.model5, DLM.model6, DLM.model7, DLM.model8, DLM.model9, DLM.model10, DLM.model11, DLM.model12, DLM.model13, DLM.model14)
data.frame(AIC, BIC, MASE) %>% arrange(MASE)
##                  AIC      BIC   n      MASE
## DLM.model   3678.921 3906.076 496 0.7966763
## DLM.model8  3680.200 3852.670 496 0.8170975
## DLM.model9  3685.181 3857.651 496 0.8216239
## DLM.model11 3681.534 3854.004 496 0.8245264
## DLM.model12 3672.715 3790.500 496 0.8370629
## DLM.model5  3681.821 3799.605 496 0.8384156
## DLM.model13 3689.321 3807.105 496 0.8411857
## DLM.model10 3741.257 3913.726 496 0.8494305
## DLM.model2  3684.963 3748.062 496 0.8573333
## DLM.model14 3741.543 3859.327 496 0.8728312
## DLM.model7  3737.286 3855.070 496 0.8826570
## DLM.model4  3750.060 3813.158 496 0.9101003
## DLM.model6  3766.032 3883.816 496 0.9122243
## DLM.model3  3798.042 3861.140 496 0.9337064
## DLM.model1  3860.683 3923.782 496 1.0667651

The best model as per MASE (best for forecasting) is the one with all 4 predictors, \(DLM.model\).

Diagnostic check for DLM.model (Residual analysis)

We can apply a diagnostic check using checkresiduals() function from the forecast package.

checkresiduals(DLM.model$model$residuals) # forecast package

## 
##  Ljung-Box test
## 
## data:  Residuals
## Q* = 380.43, df = 10, p-value < 2.2e-16
## 
## Model df: 0.   Total lags used: 10

In this output,

  • from the time series plot and histogram of residuals, there is an obvious non-random pattern and very high residual values that violate general assumptions.
  • the Ljung-Box test output is displayed. According to this test, the null hypothesis that a series of residuals exhibits no autocorrelation up-to lag 10 is violated. According to this test and ACF plot, we can conclude that the serial correlation left in residuals is highly significant.

Model Summary for Finite DLM model (DLM.model) :

summary(DLM.model)
## 
## Call:
## lm(formula = as.formula(model.formula), data = design)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.355  -5.502  -0.107   4.850  43.608 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       1.328e+02  1.063e+01  12.491  < 2e-16 ***
## temp.t            4.429e-01  1.168e-01   3.793 0.000169 ***
## temp.1           -2.417e-01  1.212e-01  -1.994 0.046782 *  
## temp.2           -9.926e-02  1.245e-01  -0.797 0.425624    
## temp.3            4.283e-02  1.262e-01   0.339 0.734566    
## temp.4            1.211e-01  1.264e-01   0.958 0.338534    
## temp.5           -4.649e-02  1.264e-01  -0.368 0.713112    
## temp.6            3.780e-02  1.256e-01   0.301 0.763659    
## temp.7            2.605e-02  1.253e-01   0.208 0.835350    
## temp.8            5.145e-02  1.237e-01   0.416 0.677756    
## temp.9           -9.944e-02  1.222e-01  -0.814 0.416054    
## temp.10          -8.694e-02  1.222e-01  -0.712 0.477127    
## temp.11          -4.108e-02  1.185e-01  -0.347 0.729099    
## temp.12          -8.598e-02  1.144e-01  -0.751 0.452814    
## chem1.t          -6.862e-01  4.401e-01  -1.559 0.119674    
## chem1.1           6.670e-01  4.462e-01   1.495 0.135727    
## chem1.2           1.085e+00  4.718e-01   2.300 0.021893 *  
## chem1.3           1.058e+00  4.738e-01   2.232 0.026091 *  
## chem1.4           5.075e-01  4.816e-01   1.054 0.292529    
## chem1.5           5.447e-01  4.828e-01   1.128 0.259857    
## chem1.6           8.502e-01  4.830e-01   1.760 0.079060 .  
## chem1.7           4.882e-01  4.801e-01   1.017 0.309748    
## chem1.8           3.463e-02  4.769e-01   0.073 0.942154    
## chem1.9           2.999e-01  4.762e-01   0.630 0.529222    
## chem1.10         -3.240e-01  4.746e-01  -0.683 0.495139    
## chem1.11         -7.156e-01  4.481e-01  -1.597 0.110992    
## chem1.12         -9.069e-01  4.341e-01  -2.089 0.037243 *  
## chem2.t          -5.401e-04  8.872e-02  -0.006 0.995145    
## chem2.1          -9.989e-02  8.968e-02  -1.114 0.265973    
## chem2.2          -1.270e-01  9.249e-02  -1.373 0.170484    
## chem2.3          -1.959e-01  9.288e-02  -2.110 0.035452 *  
## chem2.4          -1.155e-02  9.352e-02  -0.124 0.901753    
## chem2.5          -1.182e-01  9.359e-02  -1.263 0.207220    
## chem2.6          -5.839e-02  9.279e-02  -0.629 0.529542    
## chem2.7           1.589e-02  9.196e-02   0.173 0.862887    
## chem2.8           1.753e-02  9.221e-02   0.190 0.849321    
## chem2.9           4.621e-02  9.123e-02   0.507 0.612710    
## chem2.10          5.557e-02  9.094e-02   0.611 0.541495    
## chem2.11          1.141e-01  8.667e-02   1.316 0.188699    
## chem2.12          2.230e-01  8.605e-02   2.592 0.009856 ** 
## particle.size.t   2.088e-01  8.054e-02   2.593 0.009838 ** 
## particle.size.1   9.288e-03  8.117e-02   0.114 0.908949    
## particle.size.2   6.814e-03  8.312e-02   0.082 0.934703    
## particle.size.3  -7.578e-02  8.382e-02  -0.904 0.366466    
## particle.size.4  -6.728e-02  8.505e-02  -0.791 0.429335    
## particle.size.5   5.676e-02  8.506e-02   0.667 0.504912    
## particle.size.6  -9.976e-02  8.542e-02  -1.168 0.243486    
## particle.size.7   7.766e-04  8.627e-02   0.009 0.992821    
## particle.size.8   7.200e-03  8.643e-02   0.083 0.933642    
## particle.size.9   2.201e-02  8.373e-02   0.263 0.792741    
## particle.size.10  1.043e-01  8.319e-02   1.254 0.210571    
## particle.size.11  1.156e-01  8.144e-02   1.420 0.156309    
## particle.size.12  1.067e-01  8.105e-02   1.317 0.188594    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.368 on 443 degrees of freedom
## Multiple R-squared:  0.6123, Adjusted R-squared:  0.5668 
## F-statistic: 13.46 on 52 and 443 DF,  p-value: < 2.2e-16
## 
## AIC and BIC values for the model:
##        AIC      BIC
## 1 3678.921 3906.076
  • Finite DLM model is significant
  • R-squared is 61.23%, Adjusted R-squared is 56.68%

Lets consider the effect of collinearity on these results. To inspect this issue, we will display variance inflation factors (VIFs). If the value of VIF is greater than 10, we can conclude that the effect of multicollinearity is high.

vif(DLM.model$model) # variance inflation factors #library(car)
##           temp.t           temp.1           temp.2           temp.3 
##         6.330556         6.829527         7.202565         7.416043 
##           temp.4           temp.5           temp.6           temp.7 
##         7.450336         7.450163         7.353137         7.311261 
##           temp.8           temp.9          temp.10          temp.11 
##         7.087101         6.846589         6.822736         6.425389 
##          temp.12          chem1.t          chem1.1          chem1.2 
##         5.963463        15.717728        16.168511        18.096868 
##          chem1.3          chem1.4          chem1.5          chem1.6 
##        18.249145        18.842660        18.923789        18.956352 
##          chem1.7          chem1.8          chem1.9         chem1.10 
##        18.697202        18.462704        18.421260        18.289736 
##         chem1.11         chem1.12          chem2.t          chem2.1 
##        16.263125        15.281320         8.197008         8.403944 
##          chem2.2          chem2.3          chem2.4          chem2.5 
##         8.938698         9.022675         9.177957         9.185801 
##          chem2.6          chem2.7          chem2.8          chem2.9 
##         9.043460         8.870870         8.919021         8.731734 
##         chem2.10         chem2.11         chem2.12  particle.size.t 
##         8.705336         7.888207         7.778710         8.444210 
##  particle.size.1  particle.size.2  particle.size.3  particle.size.4 
##         8.576138         8.987483         9.099668         9.408277 
##  particle.size.5  particle.size.6  particle.size.7  particle.size.8 
##         9.409970         9.449247         9.638768         9.699272 
##  particle.size.9 particle.size.10 particle.size.11 particle.size.12 
##         9.099315         8.967741         8.589072         8.536916
  • There are few predictors having VIF > 10. Thus, Multicollinearity is significant.
MASE(DLM.model)
##                MASE
## DLM.model 0.7966763
Conclusion of Finite DLM model
  • DLM.model is best Finite DLM model and is significant.
  • MASE is 0.7966763
  • Low R-squared value of 61.23% suggests bad fit. Adjusted R-squared is 56.68%.
  • violations in the test of assumptions
  • Serial autocorrelation is significant
  • Multicollinearity is significant.

ATTENTION - Lets summarise the models from here on and not go into each models details for simplicity

Fit Polynomial DLM model

Polynomial DLM model helps remove the effect of multicollinearity, but our data has significant multicollinearity. Lets fit a polynomial DLM of order 2 and check if the polynomial component (order 2) reduces multicollinearity. Lets do this for each of the 4 regressors individually.

For Temperature regressor:

PolyDLM.Temp = polyDlm(x = as.vector(Temp), y = as.vector(Mortality), q = 12, k = 2, show.beta = FALSE)
summary(PolyDLM.Temp)
## 
## Call:
## "Y ~ (Intercept) + X.t"
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -31.210  -7.947  -2.086   5.734  53.184 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 222.488659   6.618124  33.618  < 2e-16 ***
## z.t0         -0.161012   0.054745  -2.941  0.00342 ** 
## z.t1         -0.019936   0.026643  -0.748  0.45465    
## z.t2          0.004504   0.002210   2.038  0.04211 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.94 on 492 degrees of freedom
## Multiple R-squared:    0.3,  Adjusted R-squared:  0.2957 
## F-statistic: 70.29 on 3 and 492 DF,  p-value: < 2.2e-16

Polynomial DLM model with Temperature as regressor variable is significant at 5% significance level.

For Chemical 1 regressor:

PolyDLM.Chem1 = polyDlm(x = as.vector(Chem1), y = as.vector(Mortality), q = 12, k = 2, show.beta = FALSE)
summary(PolyDLM.Chem1)
## 
## Call:
## "Y ~ (Intercept) + X.t"
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.817  -6.283  -0.574   4.845  48.223 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 141.458130   1.386081 102.056   <2e-16 ***
## z.t0          0.304811   0.122211   2.494    0.013 *  
## z.t1          0.006495   0.060265   0.108    0.914    
## z.t2         -0.001546   0.004990  -0.310    0.757    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.973 on 492 degrees of freedom
## Multiple R-squared:  0.5121, Adjusted R-squared:  0.5091 
## F-statistic: 172.1 on 3 and 492 DF,  p-value: < 2.2e-16

Polynomial DLM model with Chemical 1 as regressor variable is significant at 5% significance level.

For Chemical 2 regressor:

PolyDLM.Chem2 = polyDlm(x = as.vector(Chem2), y = as.vector(Mortality), q = 12, k = 2, show.beta = FALSE)
summary(PolyDLM.Chem2)
## 
## Call:
## "Y ~ (Intercept) + X.t"
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -32.756  -6.782  -1.705   5.040  58.318 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 117.413821   2.960269  39.663  < 2e-16 ***
## z.t0          0.115869   0.031277   3.705 0.000236 ***
## z.t1         -0.021075   0.014590  -1.444 0.149239    
## z.t2          0.001775   0.001199   1.481 0.139289    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.15 on 492 degrees of freedom
## Multiple R-squared:  0.3902, Adjusted R-squared:  0.3865 
## F-statistic:   105 on 3 and 492 DF,  p-value: < 2.2e-16

Polynomial DLM model with Chemical 2 as regressor variable is significant at 5% significance level.

For Particle Size regressor:

PolyDLM.ParticleSize = polyDlm(x = as.vector(ParticleSize), y = as.vector(Mortality), q = 12, k = 2, show.beta = FALSE)
summary(PolyDLM.ParticleSize)
## 
## Call:
## "Y ~ (Intercept) + X.t"
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -33.813  -5.943  -1.099   5.128  48.100 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 125.724476   2.305488  54.533  < 2e-16 ***
## z.t0          0.119830   0.028886   4.148 3.94e-05 ***
## z.t1         -0.027004   0.013928  -1.939   0.0531 .  
## z.t2          0.002247   0.001151   1.953   0.0514 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.61 on 492 degrees of freedom
## Multiple R-squared:  0.4475, Adjusted R-squared:  0.4441 
## F-statistic: 132.8 on 3 and 492 DF,  p-value: < 2.2e-16

Polynomial DLM model with Particle Size as regressor variable is significant at 5% significance level.

Polynomial DLM models for each of the 4 regressors are significant. The 0th and 1st order regressors of copper price variable are significant, but the 2nd order regressor (z.t2) is insignificant.

PolyDLM Model selection
MASE(PolyDLM.Temp, PolyDLM.Chem1, PolyDLM.Chem2, PolyDLM.ParticleSize) %>% arrange(MASE)
##                        n      MASE
## PolyDLM.Chem1        496 0.8944918
## PolyDLM.ParticleSize 496 0.9522845
## PolyDLM.Chem2        496 0.9713557
## PolyDLM.Temp         496 1.1042464

As per MASE measure, Polynomial DLM model with Chemical 1 as regressor is the best model for forecasting.

Diagnostic check for Polynomial DLM (Residual analysis)
checkresiduals(PolyDLM.Chem1$model$residuals)

## 
##  Ljung-Box test
## 
## data:  Residuals
## Q* = 349.78, df = 10, p-value < 2.2e-16
## 
## Model df: 0.   Total lags used: 10
Conclusion of Polynomial DLM model
  • PolyDLM.Chem1 with Chemical 1 as regressor is best Polynomial DLM model.
  • model is significant
  • MASE is 0.8944918
  • Adjusted R-squared is 50.91%
  • violations in the test of assumptions
  • Serial autocorrelation is significant

Fit Koyck geometric DLM model

Here the lag weights are positive and decline geometrically. This model is called infinite geometric DLM, meaning there are infinite lag weights. Koyck transformation is applied to implement this infinite geometric DLM model by subtracting the first lag of geometric DLM multiplied by \(\phi\). The Koyck transformed model is represented as,

\(Y_t = \delta_1 + \delta_2Y_{t-1} + \nu_t\)

where \(\delta_1 = \alpha(1-\phi), \delta_2 = \phi, \delta_3 = \beta\) and the random error after the transformation is \(\nu_t = (\epsilon_t -\phi\epsilon_{t-1})\).

The koyckDlm() function is used to implement a two-staged least squares method to first estimate the \(\hat{Y}_{t-1}\) and the estimate \(Y_{t}\) through simple linear regression. Lets deduce Koyck geometric GLM models for each of the 4 regressors individually.

For Temperature regressor:

Koyck.Temp = koyckDlm(x = as.vector(mort$temp) , y = as.vector(mort$mortality) )
summary(Koyck.Temp$model, diagnostics = TRUE)
## 
## Call:
## "Y ~ (Intercept) + Y.1 + X.t"
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -35.8714  -8.4484  -0.5811   7.2446  43.9005 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 162.22228   17.58058   9.227  < 2e-16 ***
## Y.1           0.44475    0.05493   8.096 4.28e-15 ***
## X.t          -0.92085    0.12974  -7.098 4.33e-12 ***
## 
## Diagnostic tests:
##                  df1 df2 statistic p-value    
## Weak instruments   1 504     190.8  <2e-16 ***
## Wu-Hausman         1 503     129.1  <2e-16 ***
## Sargan             0  NA        NA      NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.08 on 504 degrees of freedom
## Multiple R-Squared: 0.277,   Adjusted R-squared: 0.2741 
## Wald test:   210 on 2 and 504 DF,  p-value: < 2.2e-16

Koyck DLM model with Temperature as regressor variable is significant at 5% significance level.

For Chemical 1 regressor:

Koyck.Chem1 = koyckDlm(x = as.vector(mort$chem1) , y = as.vector(mort$mortality) )
summary(Koyck.Chem1$model, diagnostics = TRUE)
## 
## Call:
## "Y ~ (Intercept) + Y.1 + X.t"
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -27.82596  -5.89508  -0.06125   6.06967  32.82722 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 53.46578    5.34042  10.012   <2e-16 ***
## Y.1          0.65058    0.03738  17.407   <2e-16 ***
## X.t          0.70588    0.22498   3.138   0.0018 ** 
## 
## Diagnostic tests:
##                  df1 df2 statistic p-value    
## Weak instruments   1 504    186.21  <2e-16 ***
## Wu-Hausman         1 503      5.97  0.0149 *  
## Sargan             0  NA        NA      NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.017 on 504 degrees of freedom
## Multiple R-Squared: 0.5974,  Adjusted R-squared: 0.5958 
## Wald test: 336.8 on 2 and 504 DF,  p-value: < 2.2e-16

Koyck DLM model with Chemical 1 as regressor variable is significant at 5% significance level.

For Chemical 2 regressor:

Koyck.Chem2 = koyckDlm(x = as.vector(mort$chem2) , y = as.vector(mort$mortality) )
summary(Koyck.Chem2$model, diagnostics = TRUE)
## 
## Call:
## "Y ~ (Intercept) + Y.1 + X.t"
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -30.80875  -6.85425   0.06398   7.01094  31.94255 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 46.84742    5.56124   8.424 3.82e-16 ***
## Y.1          0.75420    0.04405  17.122  < 2e-16 ***
## X.t         -0.10536    0.11751  -0.897     0.37    
## 
## Diagnostic tests:
##                  df1 df2 statistic  p-value    
## Weak instruments   1 504     49.67 6.01e-12 ***
## Wu-Hausman         1 503     15.89 7.70e-05 ***
## Sargan             0  NA        NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.34 on 504 degrees of freedom
## Multiple R-Squared: 0.4709,  Adjusted R-squared: 0.4688 
## Wald test: 252.9 on 2 and 504 DF,  p-value: < 2.2e-16

Koyck DLM model with Chemical 2 as regressor variable is significant at 5% significance level.

For Pariticle size regressor:

Koyck.ParticleSize = koyckDlm(x = as.vector(mort$particle.size) , y = as.vector(mort$mortality) )
summary(Koyck.ParticleSize$model, diagnostics = TRUE)
## 
## Call:
## "Y ~ (Intercept) + Y.1 + X.t"
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -29.04298  -5.91345  -0.04809   6.25653  32.26785 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 47.19733    5.02197   9.398   <2e-16 ***
## Y.1          0.69461    0.03634  19.114   <2e-16 ***
## X.t          0.09294    0.06104   1.523    0.128    
## 
## Diagnostic tests:
##                  df1 df2 statistic p-value    
## Weak instruments   1 504   148.898 < 2e-16 ***
## Wu-Hausman         1 503     8.471 0.00377 ** 
## Sargan             0  NA        NA      NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.348 on 504 degrees of freedom
## Multiple R-Squared: 0.5673,  Adjusted R-squared: 0.5655 
## Wald test: 309.9 on 2 and 504 DF,  p-value: < 2.2e-16

Koyck DLM model with Pariticle size as regressor variable is significant at 5% significance level.

Koyck DLM models for each of the 4 regressors are significant.

Koyck Model selection

MASE(Koyck.Temp, Koyck.Chem1, Koyck.Chem2, Koyck.ParticleSize) %>% arrange(MASE)
##                      n      MASE
## Koyck.Chem1        507 0.8530742
## Koyck.ParticleSize 507 0.8837852
## Koyck.Chem2        507 0.9927387
## Koyck.Temp         507 1.1396264

As per MASE measure, Koyck DLM model with Chemical 1 as regressor is the best model for forecasting.

Diagnostic check for Polynomial DLM (Residual analysis)
checkresiduals(Koyck.Chem1$model$residuals)

## 
##  Ljung-Box test
## 
## data:  Residuals
## Q* = 112.71, df = 10, p-value < 2.2e-16
## 
## Model df: 0.   Total lags used: 10
Conclusion of Koyck DLM model
  • Koyck.Chem1 with Chemical 1 as regressor is best Koyck DLM model.
  • model is significant.
  • MASE is 0.8530742
  • Adjusted R-squared is 59.58%
  • violations in the test of assumptions
  • Serial autocorrelation is significant

Fit Autoregressive Distributed Lag Model

Autoregressive Distributed lag model is a flexible and parsimonious infinite DLM. The model is represented as,

\(Y_t = \mu + \beta_0 X_t + \beta_1 X_{t-1} + \gamma_1 Y_{t-1} + e_t\)

Similar to the Koyck DLM, it is possible to write this model as an infinite DLM with infinite lag distribution of any shape rather than a polynomial or geometric shape. The model is denoted as ARDL(p,q). To fit the model we will use ardlDlm() function is used. Lets find the best lag length using AIC and BIC score through an iteration. Lets set max lag length to 12.

## Code gist to find the best ARDL(p,q) model as per AIC and BIC scores.
# First create an empty df. Iterate over 144 ARDL (since max lag for response and predictor of ARDL model is 12, i.e, p = q = 12 at max).
# Save the model's AIC and BIC scores through iteration and display the model with best AIC and BIC scores.

df = data.frame(matrix(
  vector(), 0, 4, dimnames=list(c(), c("p","q","AIC","BIC"))),
  stringsAsFactors=F) # create empty dataframe
for(i in 1:12){
  for(j in 1:12){
    model4.1 = ardlDlm(formula = mortality ~ temp + chem1 + chem2 + particle.size, data = mort, p = i, q = j)
    new <- data.frame(i, j, AIC(model4.1$model), BIC(model4.1$model))
    df[nrow(df) + 1, ] <- new
  }
} # Iterate and save in df
head(df[order( df[,3] ),],1) # Best model as per AIC
##    p  q      AIC      BIC
## 60 5 12 3444.216 3604.066
head(df[order( df[,4] ),],1) # Best model as per BIC
##    p  q      AIC      BIC
## 12 1 12 3448.146 3540.691

ARDL(5,12) and ARDL(1,12) are the best models as per AIC and BIC scores respectively. Now, lets fit these 2 models,

1. ARDL(5,12) model (BEST FINITE DLM MODEL)

ARDL.5x12 = ardlDlm(formula = mortality ~ temp + chem1 + chem2 + particle.size, data = mort, p = 5, q = 12)
summary(ARDL.5x12)
## 
## Time series regression with "ts" data:
## Start = 13, End = 508
## 
## Call:
## dynlm(formula = as.formula(model.text), data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.297  -4.707  -0.229   4.891  32.044 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     51.956115  11.346709   4.579 6.03e-06 ***
## temp.t           0.330981   0.086925   3.808 0.000159 ***
## temp.1          -0.391803   0.093837  -4.175 3.56e-05 ***
## temp.2          -0.176996   0.097987  -1.806 0.071522 .  
## temp.3           0.173976   0.098493   1.766 0.077996 .  
## temp.4           0.169554   0.093970   1.804 0.071834 .  
## temp.5          -0.182285   0.087057  -2.094 0.036821 *  
## chem1.t         -0.562260   0.335931  -1.674 0.094864 .  
## chem1.1          0.825981   0.347765   2.375 0.017954 *  
## chem1.2          1.035985   0.373850   2.771 0.005813 ** 
## chem1.3          0.303235   0.376345   0.806 0.420811    
## chem1.4         -0.327034   0.359288  -0.910 0.363180    
## chem1.5         -0.218576   0.349796  -0.625 0.532369    
## chem2.t          0.081020   0.068985   1.174 0.240823    
## chem2.1         -0.038998   0.070020  -0.557 0.577823    
## chem2.2         -0.088762   0.072207  -1.229 0.219601    
## chem2.3         -0.084572   0.072486  -1.167 0.243920    
## chem2.4          0.089458   0.069569   1.286 0.199127    
## chem2.5         -0.037733   0.069050  -0.546 0.585014    
## particle.size.t  0.148787   0.063509   2.343 0.019568 *  
## particle.size.1 -0.120956   0.064361  -1.879 0.060832 .  
## particle.size.2 -0.071128   0.063615  -1.118 0.264113    
## particle.size.3 -0.062485   0.063657  -0.982 0.326819    
## particle.size.4  0.010341   0.063776   0.162 0.871261    
## particle.size.5  0.160916   0.063555   2.532 0.011676 *  
## mortality.1      0.368259   0.046599   7.903 2.04e-14 ***
## mortality.2      0.384979   0.049186   7.827 3.48e-14 ***
## mortality.3     -0.001288   0.051757  -0.025 0.980155    
## mortality.4     -0.071443   0.051184  -1.396 0.163446    
## mortality.5      0.039420   0.049570   0.795 0.426885    
## mortality.6     -0.058020   0.046371  -1.251 0.211492    
## mortality.7     -0.052843   0.046106  -1.146 0.252344    
## mortality.8      0.020605   0.046402   0.444 0.657212    
## mortality.9      0.076066   0.046484   1.636 0.102439    
## mortality.10    -0.055984   0.046805  -1.196 0.232276    
## mortality.11     0.037981   0.043296   0.877 0.380817    
## mortality.12    -0.005607   0.041084  -0.136 0.891496    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.502 on 459 degrees of freedom
## Multiple R-squared:  0.7424, Adjusted R-squared:  0.7222 
## F-statistic: 36.74 on 36 and 459 DF,  p-value: < 2.2e-16
checkresiduals(ARDL.5x12$model)

## 
##  Breusch-Godfrey test for serial correlation of order up to 40
## 
## data:  Residuals
## LM test = 35.675, df = 40, p-value = 0.6652
MASE(ARDL.5x12)
##                MASE
## ARDL.5x12 0.6808706

Summary of ARDL(5x12) DLM model

  • model is significant
  • MASE is 0.6808706
  • Adjusted R-squared improved to 72.22%
  • No violations in the test of assumptions
  • Serial autocorrelations are insignificant

2. ARDL(1,12) model

ARDL.1x12 = ardlDlm(formula = mortality ~ temp + chem1 + chem2 + particle.size, data = mort, p = 1, q = 12)
summary(ARDL.1x12)
## 
## Time series regression with "ts" data:
## Start = 13, End = 508
## 
## Call:
## dynlm(formula = as.formula(model.text), data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -23.125  -4.524  -0.314   4.910  30.287 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     55.669410   9.695803   5.742 1.67e-08 ***
## temp.t           0.299804   0.077392   3.874 0.000122 ***
## temp.1          -0.446996   0.076893  -5.813 1.13e-08 ***
## chem1.t         -0.266850   0.287687  -0.928 0.354102    
## chem1.1          1.063044   0.297810   3.570 0.000394 ***
## chem2.t          0.075833   0.062642   1.211 0.226664    
## chem2.1         -0.095445   0.063014  -1.515 0.130523    
## particle.size.t  0.109059   0.055098   1.979 0.048353 *  
## particle.size.1 -0.069979   0.055129  -1.269 0.204935    
## mortality.1      0.385514   0.044057   8.750  < 2e-16 ***
## mortality.2      0.343553   0.044331   7.750 5.63e-14 ***
## mortality.3     -0.025005   0.046485  -0.538 0.590890    
## mortality.4      0.003468   0.046338   0.075 0.940364    
## mortality.5      0.027723   0.046302   0.599 0.549632    
## mortality.6     -0.045346   0.046334  -0.979 0.328243    
## mortality.7     -0.050591   0.046366  -1.091 0.275778    
## mortality.8      0.015998   0.046757   0.342 0.732385    
## mortality.9      0.074476   0.046828   1.590 0.112406    
## mortality.10    -0.054654   0.047043  -1.162 0.245904    
## mortality.11     0.020304   0.043490   0.467 0.640814    
## mortality.12    -0.002706   0.040974  -0.066 0.947379    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.647 on 475 degrees of freedom
## Multiple R-squared:  0.723,  Adjusted R-squared:  0.7114 
## F-statistic:    62 on 20 and 475 DF,  p-value: < 2.2e-16
checkresiduals(ARDL.1x12$model)

## 
##  Breusch-Godfrey test for serial correlation of order up to 24
## 
## data:  Residuals
## LM test = 37.453, df = 24, p-value = 0.03941
MASE(ARDL.1x12)
##                MASE
## ARDL.1x12 0.7024931

Summary of ARDL(1x12) DLM model

  • ARDL(1x12) model is significant
  • MASE is 0.7024931
  • Adjusted R-squared worsens to 71.14%
  • No violations in the test of assumptions
  • Serial autocorrelations are significant
ARDL Model selection
  • ARDL(5,12) has better fit as per MASE (0.6808706 of ARDL(5,12) vs 0.7024931 ARDL(1,12)).
Conclusion of ARDL models

ARDL(5,12) is the best of all ARDL models with better MASE and adjusted R-squared statistics. Also, ARDL(5,12) does not violate assumptions of normality, linearity and serial autocorrelation.

Most appropriate DLM model based on MASE (DLM Model Selection)

The 4 DLM models are,

  • Finite DLM model: DLM.model
  • Polynomial DLM model: PolyDLM.Chem1
  • Koyck transformed geometric DLM model: Koyck.Chem1
  • Autoregressive DLM model: ARDL(5,12)

mean absolute scaled errors or MASE of these models are,

MASE(DLM.model, PolyDLM.Chem1, Koyck.Chem1, ARDL.5x12) %>% arrange(MASE)
##                 n      MASE
## ARDL.5x12     496 0.6808706
## DLM.model     496 0.7966763
## Koyck.Chem1   507 0.8530742
## PolyDLM.Chem1 496 0.8944918

Conclusion of Distributed Lag models (DLM) modelling

The Best DLM model for the Mortality response is based on the precipitation regressor which gives the most accurate forecasting based on the MASE measure is the Autoregressive DLM model, ARDL.5x12 with MASE measure of 0.6808706.

B. Dynamic linear models (dynlm package)

Dynamic linear models are general class of time series regression models which can account for trends, seasonality, serial correlation between response and regressor variable, and most importantly the affect of intervention points.

The response of a general Dynamic linear model is,

\(Y_t = \omega_2Y_{t-1} + (\omega_0 + \omega_1)P_t - \omega_2\omega_0P_{t-1} + N_t\)

where,

  • \(Y_t\) is the response
  • \(\omega_2\) is the coefficient of 1 time unit lagged response
  • \(P_t\) is the current pulse affect at the intervention point with \((\omega_0 + \omega_1)\) coefficient representing the instantaneous effect of the intervention point
  • \(P_{t-1}\) is the past pulse affect with \(\omega_2\omega_0\) coefficient
  • \(N_t\) is the process represents the component where there is no intervention and is referred to as the natural or unperturbed process.

Lets revisit the time series plot for the response, Mortality, to visualize possible intervention points

plot(Mortality)

As mentioned at the descriptive analysis stage, there is no clear intervention that we identify visually. But maybe week 153 might be an intervention point just because of its magnitude. Assuming this intervention point lets fit a Dynamic Linear model and see if the pulse function at week 153 is significant or not.

Now, lets fit Dynamic Linear model using dynlm() as shown below, (Note, the potential intervention point was identified at Week 153).

MortalityX = ts(mort[,1], start = c(2010,1), frequency = 52) # set frequency

Y.t = MortalityX
T = 153 # The time point when the intervention occurred 
P.t = 1*(seq(MortalityX) == T)
P.t.1 = Lag(P.t,+1) #library(tis) 

Dyn.model = dynlm(Y.t ~ L(Y.t , k = 1) + L(Y.t , k = 2) + L(Y.t , k = 3) + P.t + trend(Y.t) + season(Y.t)) # library(dynlm)

Dyn.model1 = dynlm(Y.t ~ L(Y.t , k = 1) + P.t + trend(Y.t) + season(Y.t)) # library(dynlm)

Dyn.model2 = dynlm(Y.t ~ L(Y.t , k = 1) + P.t + P.t.1 + trend(Y.t) + season(Y.t)) # library(dynlm)

Dyn.model3 = dynlm(Y.t ~ L(Y.t , k = 1) + L(Y.t , k = 2) + P.t + P.t.1 + trend(Y.t) + season(Y.t)) # library(dynlm)

Dyn.model4 = dynlm(Y.t ~ L(Y.t , k = 1) + L(Y.t , k = 2) + L(Y.t , k = 3) + P.t + trend(Y.t) + season(Y.t)) # library(dynlm)

AIC(Dyn.model, Dyn.model1, Dyn.model2, Dyn.model3, Dyn.model4) %>% arrange(AIC)
##            df      AIC
## Dyn.model  58 3581.297
## Dyn.model4 58 3581.297
## Dyn.model3 57 3585.558
## Dyn.model1 56 3675.135
## Dyn.model2 56 3675.135

Dyn.model is the best Dynamic Linear model with 3 lagged components of the response (Mortality), a significant pulse component at T=153rd week, and trend and seasonal components of Mortality series having frequency of 52 weeks as per the AIC score. Lets look at the summary statistics and check residuals

summary(Dyn.model)
## 
## Time series regression with "ts" data:
## Start = 2010(4), End = 2019(40)
## 
## Call:
## dynlm(formula = Y.t ~ L(Y.t, k = 1) + L(Y.t, k = 2) + L(Y.t, 
##     k = 3) + P.t + trend(Y.t) + season(Y.t))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -26.4794  -5.0652  -0.0579   4.7323  28.8326 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   60.81090    9.82971   6.186 1.39e-09 ***
## L(Y.t, k = 1)  0.24542    0.04779   5.135 4.21e-07 ***
## L(Y.t, k = 2)  0.38445    0.04514   8.518 2.50e-16 ***
## L(Y.t, k = 3)  0.02813    0.04730   0.595 0.552361    
## P.t           19.98213    8.64428   2.312 0.021252 *  
## trend(Y.t)    -0.55873    0.15121  -3.695 0.000247 ***
## season(Y.t)2   1.16618    3.75922   0.310 0.756539    
## season(Y.t)3  -3.90147    3.76534  -1.036 0.300688    
## season(Y.t)4  -0.21336    3.69209  -0.058 0.953942    
## season(Y.t)5   4.08592    3.70091   1.104 0.270172    
## season(Y.t)6  -2.43424    3.68742  -0.660 0.509498    
## season(Y.t)7  -4.26038    3.70334  -1.150 0.250587    
## season(Y.t)8  -0.69626    3.73511  -0.186 0.852208    
## season(Y.t)9  -4.06793    3.73698  -1.089 0.276933    
## season(Y.t)10 -1.16705    3.76150  -0.310 0.756506    
## season(Y.t)11 -1.32840    3.76654  -0.353 0.724491    
## season(Y.t)12 -4.89799    3.77185  -1.299 0.194761    
## season(Y.t)13  0.80463    3.79204   0.212 0.832055    
## season(Y.t)14 -6.62136    3.78599  -1.749 0.080991 .  
## season(Y.t)15  0.22392    3.82142   0.059 0.953300    
## season(Y.t)16 -4.59251    3.81011  -1.205 0.228706    
## season(Y.t)17 -4.32770    3.83136  -1.130 0.259272    
## season(Y.t)18 -3.84213    3.83334  -1.002 0.316743    
## season(Y.t)19 -1.81591    3.84827  -0.472 0.637245    
## season(Y.t)20  0.93120    3.84012   0.242 0.808510    
## season(Y.t)21 -5.58830    3.82162  -1.462 0.144364    
## season(Y.t)22 -5.39485    3.83070  -1.408 0.159730    
## season(Y.t)23 -4.11372    3.85112  -1.068 0.286011    
## season(Y.t)24 -2.38456    3.86640  -0.617 0.537720    
## season(Y.t)25  2.70699    3.86080   0.701 0.483576    
## season(Y.t)26 -5.68786    3.83237  -1.484 0.138470    
## season(Y.t)27 -7.66912    3.83656  -1.999 0.046217 *  
## season(Y.t)28 -5.62879    3.87081  -1.454 0.146601    
## season(Y.t)29 -1.84193    3.89559  -0.473 0.636568    
## season(Y.t)30 -0.98999    3.88793  -0.255 0.799125    
## season(Y.t)31 -3.58448    3.86759  -0.927 0.354530    
## season(Y.t)32 -0.69539    3.85448  -0.180 0.856912    
## season(Y.t)33  0.23723    3.83915   0.062 0.950755    
## season(Y.t)34 -1.28064    3.82116  -0.335 0.737672    
## season(Y.t)35 -7.57592    3.80111  -1.993 0.046859 *  
## season(Y.t)36  0.83909    3.83836   0.219 0.827056    
## season(Y.t)37 -0.98780    3.82847  -0.258 0.796513    
## season(Y.t)38  3.61908    3.82361   0.947 0.344399    
## season(Y.t)39  0.25202    3.77680   0.067 0.946827    
## season(Y.t)40 -0.30103    3.76586  -0.080 0.936323    
## season(Y.t)41  0.57732    3.85049   0.150 0.880884    
## season(Y.t)42  3.67096    3.84761   0.954 0.340554    
## season(Y.t)43  5.43604    3.82854   1.420 0.156340    
## season(Y.t)44  1.50648    3.80624   0.396 0.692446    
## season(Y.t)45  9.57170    3.79071   2.525 0.011912 *  
## season(Y.t)46 14.51684    3.77913   3.841 0.000140 ***
## season(Y.t)47 14.80036    3.78414   3.911 0.000106 ***
## season(Y.t)48  5.15147    3.76452   1.368 0.171864    
## season(Y.t)49  5.18546    3.87420   1.338 0.181425    
## season(Y.t)50  1.48511    3.76214   0.395 0.693214    
## season(Y.t)51  3.51171    3.75971   0.934 0.350788    
## season(Y.t)52  6.97193    3.76239   1.853 0.064531 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.94 on 448 degrees of freedom
## Multiple R-squared:  0.7208, Adjusted R-squared:  0.6859 
## F-statistic: 20.65 on 56 and 448 DF,  p-value: < 2.2e-16
checkresiduals(Dyn.model)

## 
##  Breusch-Godfrey test for serial correlation of order up to 101
## 
## data:  Residuals
## LM test = 123.45, df = 101, p-value = 0.06399

Summary of Dynamic linear model, Dyn.model

  • model is insignificant as p-value = 0.06399 (> 0.05 significance level)

Conclusion of Dynamic Linear model

Most importantly, the dynamic linear model is insignificant although the pulse (P.t) component significant at 153rd week. Thus, Dynamic Linear model is not suitable/necessary for our Mortality time series.

C. Exponential Smoothing Method and State-Space models

Exponential smoothing methods including the state-space models takes into consideration the Error component, Trend component and seasonality component of the time series. Each of these components can be absent (None), Additive (A) or Multiplicative (M). Hence, these models are represented as ETS(ZZZ) representing the Error, Trend and Seasonal component respectively.

The best Exponential Smoothing model or State-Space model for our Mortality time series can be easily identified by triggering the auto-search by setting the argument model = “ZZZ” in the ets() as shown below. Also, we will check if damped trend and the possibility of drift give us better models.

Best Exponential Smoothing model -

autofit.ETS = ets(Mortality, model="ZZZ")
summary(autofit.ETS)
## ETS(M,N,N) 
## 
## Call:
##  ets(y = Mortality, model = "ZZZ") 
## 
##   Smoothing parameters:
##     alpha = 0.4818 
## 
##   Initial states:
##     l = 184.0437 
## 
##   sigma:  0.0526
## 
##      AIC     AICc      BIC 
## 5386.809 5386.857 5399.500 
## 
## Training set error measures:
##                       ME     RMSE      MAE        MPE     MAPE      MASE
## Training set -0.06720288 9.061271 7.121783 -0.2494414 4.194287 0.8628215
##                     ACF1
## Training set -0.05776727
checkresiduals(autofit.ETS)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(M,N,N)
## Q* = 37.641, df = 10, p-value = 4.381e-05
## 
## Model df: 0.   Total lags used: 10

System chooses the Simple exponential smoothing with Multiplicative errors ETS(MNN). MASE is 0.8628215.

Best Exponential Smoothing model with damping -

autofit.ETS.damped = ets(Mortality, model="ZZZ", damped = TRUE)
summary(autofit.ETS.damped)
## ETS(M,Ad,N) 
## 
## Call:
##  ets(y = Mortality, model = "ZZZ", damped = TRUE) 
## 
##   Smoothing parameters:
##     alpha = 0.4501 
##     beta  = 0.0311 
##     phi   = 0.8 
## 
##   Initial states:
##     l = 189.1078 
##     b = -1.587 
## 
##   sigma:  0.0527
## 
##      AIC     AICc      BIC 
## 5391.739 5391.906 5417.122 
## 
## Training set error measures:
##                       ME     RMSE      MAE        MPE     MAPE      MASE
## Training set -0.04653304 9.063544 7.111666 -0.2250772 4.186325 0.8615958
##                     ACF1
## Training set -0.04632118
checkresiduals(autofit.ETS.damped)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(M,Ad,N)
## Q* = 36.302, df = 10, p-value = 7.468e-05
## 
## Model df: 0.   Total lags used: 10

System chooses the Holt’s damped model with Multiplicative errors ETS(MAdN). MASE is 0.8615958.

Best Exponential Smoothing model with drift -

autofit.ETS.drift = ets(Mortality, model="ZZZ", beta = 1E-4)
summary(autofit.ETS.drift)
## ETS(M,N,N) 
## 
## Call:
##  ets(y = Mortality, model = "ZZZ", beta = 1e-04) 
## 
##   Smoothing parameters:
##     alpha = 0.4818 
## 
##   Initial states:
##     l = 184.0437 
## 
##   sigma:  0.0526
## 
##      AIC     AICc      BIC 
## 5386.809 5386.857 5399.500 
## 
## Training set error measures:
##                       ME     RMSE      MAE        MPE     MAPE      MASE
## Training set -0.06720288 9.061271 7.121783 -0.2494414 4.194287 0.8628215
##                     ACF1
## Training set -0.05776727
checkresiduals(autofit.ETS.drift)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(M,N,N)
## Q* = 37.641, df = 10, p-value = 4.381e-05
## 
## Model df: 0.   Total lags used: 10

Again system chooses the ETS(MNN) model.

Thus, the best Exponential smoothing or State-state model for our Mortality series is the best Holt’s damped model with Multiplicative errors ETS(M,Ad,N) with MASE score of 0.8615958.

Conclusion of Exponential Smoothing Method and State-Space models

The best State-space model which gives the most accurate forecasting based on the MASE measure is ETS(M,Ad,N) having lowest MASE measure of 0.8615958 of all possible State space models.

Overall Most Appropriate Regression model (Model Selection)

Based on the 4 Time series regression methods considered, the best model as per MASE measure for each method is summarized below,

  • A. Best Distributed lag models is - Autoregressive Distributed Lag model ARDL(5,12) with MASE measure of 0.6808706, AIC of 3444.216, BIC of 3604.066 and Adjusted R-squared of 72.22%.

  • B. Best Dynamic linear models is - None (No intervention points were present)

  • C. Best Exponential smoothing and State-Space model is - Holt’s damped model with Multiplicative errors ETS(M,Ad,N) with MASE measure of 0.8615958., AIC of 5391.739 and BIC of 5417.122

Clearly, the best model is Autoregressive Distributed Lag model ARDL(5,12) as per AIC, BIC and MASE measures.

Best Time Series regression model for Forecasting

Best Time Series regression model is - Autoregressive Distributed Lag model ARDL(5,12) with MASE measure of 0.6808706.

Detailed Graphical and statistical tests of assumptions for \(ARDL(5,12)\) model (Residual Analysis)

Residual analysis to test model assumptions.

Lets perform a detailed Residual Analysis to check if any model assumptions have been violated.

The estimator error (or residual) is defined by:

\(\hat{\epsilon_i}\) = \(Y_i\) - \(\hat{Y_i}\) (i.e. observed value less - trend value)

The following problems are to be checked,

  1. linearity in distribution of error terms
  2. The mean value of residuals is zero
  3. Serial autocorrelation
  4. Normality of distribution of error terms

Lets first apply diagnostic check using checkresiduals() function,

checkresiduals(ARDL.5x12)
## Time Series:
## Start = 13 
## End = 508 
## Frequency = 1 
##           13           14           15           16           17           18 
##   3.95494986  -6.36350734   7.99771977 -11.75431331   7.09884780   4.58047006 
##           19           20           21           22           23           24 
##  -5.96534256  10.94415086  -0.81117987  -1.91017591  -3.31810510  -2.18290048 
##           25           26           27           28           29           30 
##   7.74738011  -2.17198209  -1.61317247  -3.12823624   6.77108837   2.48328299 
##           31           32           33           34           35           36 
##  -6.58151519   5.13409668   0.91884745  -4.34035642  -4.17477177  -4.38950894 
##           37           38           39           40           41           42 
##  -3.10306926   4.40231690  -1.00584694   1.48994828 -17.48630487  -5.45136378 
##           43           44           45           46           47           48 
##  -1.57291605   1.67982240   0.51996408   8.67040743  16.14082832  -1.26590273 
##           49           50           51           52           53           54 
##  -1.97864371 -10.82760233  -1.60772773   5.38027355  -1.48993599  -1.20461513 
##           55           56           57           58           59           60 
##   8.07891725  -5.72503141  14.99810002   0.76348794   0.42326550  -3.40900049 
##           61           62           63           64           65           66 
##  -5.68326211   1.34607468   5.82962069  -7.16286550   5.14270543  -1.80840368 
##           67           68           69           70           71           72 
##   9.90398629   8.82709463  -3.25134642  -7.49177106   7.99751593   2.47436299 
##           73           74           75           76           77           78 
##   7.63428207   0.91275020   0.05189774   0.65424475  24.72791046   2.55448363 
##           79           80           81           82           83           84 
## -13.89578958   3.12600723   4.13244257   8.72361352   7.23958918   0.58099168 
##           85           86           87           88           89           90 
##  -1.64437876   3.10848043   0.18718698   6.55168796   4.12022167   0.94683856 
##           91           92           93           94           95           96 
## -12.58909035  -2.78863935   3.81057487   1.61142040   5.60086580   4.19176194 
##           97           98           99          100          101          102 
##  12.82719343  15.93749541   4.81137291  -3.77471656   6.63967465  -5.85526998 
##          103          104          105          106          107          108 
##   0.89703193   5.37633720   1.14403497 -12.56215249  -9.60249373  -5.40703500 
##          109          110          111          112          113          114 
##   8.53644596  -9.71880601  -3.56142163  -1.72870244  -0.57467916   5.79515905 
##          115          116          117          118          119          120 
##  -1.02364995  -4.23894337   8.64961729  -1.43960957  14.42047405 -10.55216758 
##          121          122          123          124          125          126 
##   0.24942657  -1.32335769   4.78266953   9.35751552   1.24405800   3.07568849 
##          127          128          129          130          131          132 
##  14.21141664   4.94421817   9.12753948  -8.35512494  -1.05777654  -6.98414765 
##          133          134          135          136          137          138 
##   3.91420649   7.89516998   1.60605040   0.24529240  -5.39644512   0.05577879 
##          139          140          141          142          143          144 
##  -2.96246220  10.18351258  -5.94502174   9.11210802   5.50275447  -6.57941825 
##          145          146          147          148          149          150 
##   1.83149064   1.92769126   0.94023115  -1.98544788  11.19442990   8.26246496 
##          151          152          153          154          155          156 
##  32.04402112  20.52385883  16.63266388  -3.41483471   7.40346882 -22.25734296 
##          157          158          159          160          161          162 
## -15.96459384   6.60885769  -9.62958053   4.25297252  -9.86589255   0.94396185 
##          163          164          165          166          167          168 
##  -1.45133493   1.81847683 -24.29726730  21.69488829  -4.48157971  -2.19209220 
##          169          170          171          172          173          174 
##   2.06410303  -9.73895260  -4.99272731   5.49865228  -2.71224559   2.38271481 
##          175          176          177          178          179          180 
##   5.51336039   3.14767406  -1.10161022  -7.13701392  -3.24146665   5.00952959 
##          181          182          183          184          185          186 
##   7.89868239  -3.16097664  -1.81958855  -7.75598599  -3.71025819  -0.95460849 
##          187          188          189          190          191          192 
##   6.48038568   7.29581998   5.61478856  -5.10998855  -6.60046652  -5.44791623 
##          193          194          195          196          197          198 
##  -0.71296493   0.08204407   2.08372614  -7.33045097  -4.72240256  -1.10571072 
##          199          200          201          202          203          204 
##   6.15443951 -10.80724807  -0.23485328   6.29358350   8.21373595 -15.02892544 
##          205          206          207          208          209          210 
## -15.70656567  -7.52102948   1.69214433   0.77130990  -1.94060059   1.07997724 
##          211          212          213          214          215          216 
##  -5.70402995  12.37452330  -8.52640839  -2.49407150 -12.31886927   5.48771654 
##          217          218          219          220          221          222 
##   4.18604015  -2.77747763   7.02287339   0.80685187   8.18575344  -4.14505050 
##          223          224          225          226          227          228 
##  -8.19552814   3.76189147   1.28399051  -0.27506470  -4.38325701  10.20273698 
##          229          230          231          232          233          234 
##  -6.13595188   0.47809552  -0.03071909  -2.44707077  12.24200374 -11.56979564 
##          235          236          237          238          239          240 
##  -1.06037401  -1.70226168   6.54993494   5.36430268  -0.41354634  -2.86678474 
##          241          242          243          244          245          246 
##  -2.01335244   4.58513203   3.91936390  -3.62514114   4.67264810  -5.04558390 
##          247          248          249          250          251          252 
## -12.12825316   3.87871254  -5.90578042   9.25150484  -9.45979822  -5.51826691 
##          253          254          255          256          257          258 
##   0.66683663  16.23955168  -1.75061512   4.13312030  10.49940565   0.20890397 
##          259          260          261          262          263          264 
##   6.53612137   8.82886563 -14.99919117 -10.70611254 -14.14124202   0.18674557 
##          265          266          267          268          269          270 
##   2.81197390  -7.52926052  -4.72786981   6.33816656  -8.19211793  -0.29791104 
##          271          272          273          274          275          276 
## -15.48250071  -5.51345178  -0.30909320  -1.85980470  10.07902840  -8.50006956 
##          277          278          279          280          281          282 
##  -3.95605451 -12.20879074   4.93301944   3.54516835  -1.10492530  -7.25058298 
##          283          284          285          286          287          288 
##  -5.56065095   5.10586215  -1.03675046  -0.90100348   1.87133137 -10.78375617 
##          289          290          291          292          293          294 
##   1.76204014  -1.34097098   4.54284422   0.86593809  -2.04524076  -5.94074903 
##          295          296          297          298          299          300 
##  -5.90248596  -2.25734110   8.97666898  -1.11631874 -12.77628876  -3.22769716 
##          301          302          303          304          305          306 
##  -0.22294276   8.41745253  -6.67243994  -2.74484401 -11.28232816  -7.51264743 
##          307          308          309          310          311          312 
##   3.50720366  -8.21929782  -5.81503744  -8.47385201  -7.14613964  16.96539422 
##          313          314          315          316          317          318 
##   4.96927543   6.34034521   1.34080760  10.29836049  10.27245140 -10.08435339 
##          319          320          321          322          323          324 
##  -7.52021221   4.77782974  -0.48242631  -6.11398827  -2.86695154   1.63487650 
##          325          326          327          328          329          330 
##   9.35300423  -7.72133706  -6.95062429  -3.82452975   0.81510671   4.54748135 
##          331          332          333          334          335          336 
## -10.27773923  11.26008371  -3.54523194  -7.35547627  -2.76545150  -8.55326819 
##          337          338          339          340          341          342 
##   2.43992524   8.71181625  -6.35509030  -2.42100759  -8.89243483  -1.67706047 
##          343          344          345          346          347          348 
##  -6.65317861   5.78204085  -6.18828881  -5.55413728  -2.35445672   5.80385550 
##          349          350          351          352          353          354 
##   0.35626608   1.73276358  -2.99400288  -6.47141511  -4.84022067  -6.87039476 
##          355          356          357          358          359          360 
##   0.76104295 -14.13899076  10.13365038  -1.75451284   4.26403685  -5.33777551 
##          361          362          363          364          365          366 
##  -5.62726884  -2.60996356 -21.16215705  -2.81155085   2.87606613 -10.71700066 
##          367          368          369          370          371          372 
##  -7.93687031   5.29189659   6.60039848  -2.77491951   6.78954301   1.80329898 
##          373          374          375          376          377          378 
##  -4.85569184   0.91986147   0.21856595   5.05903518  -2.70500831  -8.83209164 
##          379          380          381          382          383          384 
##   5.79678548   1.24502036  -2.22752446   2.81396419  -0.62689599  -8.49049411 
##          385          386          387          388          389          390 
##  -4.50358734   0.48898681   0.15573572   5.10851477   3.75209636   1.33272178 
##          391          392          393          394          395          396 
##  -6.15677834   3.26092506   2.09406121  -6.48780741  -1.20246600  -4.17537649 
##          397          398          399          400          401          402 
##  -4.14695808  10.99456729 -12.80809145   4.19788948   1.95792601  -3.66649168 
##          403          404          405          406          407          408 
##  -2.72680158   1.01894967 -15.09684944  -4.05525714  -1.80574978   5.87611750 
##          409          410          411          412          413          414 
##  -3.44941047   6.49922835   6.81761828  -0.01406888   0.99094710  17.00433730 
##          415          416          417          418          419          420 
##   9.09424738  -8.32513793   4.05406324   5.88077288  -7.67549369  -4.11298651 
##          421          422          423          424          425          426 
##   5.07045780  -1.71621091   0.26741571   4.99926486  -3.18277015  -3.53453799 
##          427          428          429          430          431          432 
##   3.57014104   2.62910704  -9.66892826   2.67009821   3.84785088  -8.35412345 
##          433          434          435          436          437          438 
##   4.22764416  -6.20473009   1.81663193   1.90998451   2.26532096  -1.70997595 
##          439          440          441          442          443          444 
##   8.72943081  -1.20527249  -2.57651108   6.31449861  12.06466565  -7.59479846 
##          445          446          447          448          449          450 
##  -3.95599036   6.75758022   2.15766636  -0.38119962  12.78150215   7.27005142 
##          451          452          453          454          455          456 
## -13.42279147  -5.75120165  -2.31732281  11.65348196  -0.37507577  -0.48908806 
##          457          458          459          460          461          462 
##   6.37889410  -5.85706155  -3.08613221   7.49879999   1.19246291  20.77003058 
##          463          464          465          466          467          468 
##   5.13014423  -8.26457285  -3.09503403  -1.32484604  -0.30632603  11.92483602 
##          469          470          471          472          473          474 
##  -6.41935443   2.36767161   1.20286845  -0.36366850  -4.70177013   4.23713734 
##          475          476          477          478          479          480 
##  -0.49427117  -1.53169214   6.12857475   5.50521529 -10.11844644   2.83869858 
##          481          482          483          484          485          486 
##   0.32598571  12.55872717  -0.11072409  18.46248465  -8.87724870  -6.68243304 
##          487          488          489          490          491          492 
##   8.15156241 -10.61184195   2.15745252   1.61811292   3.21339731 -14.94680715 
##          493          494          495          496          497          498 
##   7.18023010  -6.57171590  -2.03292254   9.76134156  -3.67465512  -3.11276137 
##          499          500          501          502          503          504 
##   4.87702190  10.68668331   7.92422530 -11.62216091 -15.98896382   7.12266604 
##          505          506          507          508 
##  -1.66429198   4.67593289   8.28451087   2.32033818

## 
##  Ljung-Box test
## 
## data:  Residuals
## Q* = 1.9828, df = 10, p-value = 0.9965
## 
## Model df: 0.   Total lags used: 10
  1. From the Residuals plot, linearity is not violated as the residuals are randomly distributed across the mean. Thus, linearity in distribution of error terms is not violated

  2. To test mean value of residuals is zero or not, lets calculate mean value of residuals as,

mean(ARDL.5x12$model$residuals)
## [1] 1.875473e-16

As mean value of residuals is close to 0, zero mean residuals is not violated.

  1. In the checkresiduals output, the Ljung-Box test output is displayed. According to this test, the hypothesis are,

Which has,
\(H_0\) : series of residuals exhibit no serial autocorrelation of any order up to p
\(H_a\) : series of residuals exhibit serial autocorrelation of any order up to p

From the Ljung-Box test output, since p (0.9965) > 0.05, we do not reject the null hypothesis of no serial autocorrelation.

Thus, according to this test and ACF plot, we can conclude that the serial correlation left in residuals is insignificant.

  1. From the histogram shown by checkresiduals(), residuals seem to follow normality. Lets test this statistically,

\(H_0\) : Time series is Normally distributed
\(H_a\) : Time series is not normal

shapiro.test(ARDL.5x12$model$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  ARDL.5x12$model$residuals
## W = 0.99131, p-value = 0.005213

From the Shapiro-Wilk test, since p<0.05 significance level, we reject the null hypothesis that states the data is normal. Thus, residuals of ARDL.5x12 are Not normally distributed.

Summarizing residual analysis on \(full\) model:

Assumption 1: The error terms are randomly distributed and thus show linearity: Not violated
Assumption 2: The mean value of E is zero (zero mean residuals): Not violated
Assumption 4: The error terms are independently distributed, i.e. they are not autocorrelated: Not violated
Assumption 5: The errors are normally distributed. Violated

Although normality of residuals assumption is violated, ‘There is no normality assumption in fitting an exponential smoothing model’ (Rob Hyndman 2013). Having no residual assumptions’ violations, the Holt’s damped model with Multiplicative errors ETS(M,Ad,N) model is good for accurate forecasting of Mortality. Lets forecast for the next 4 weeks ahead Mortality,

Forecasting

Using MASE measure, Autoregressive Distributed Lag model ARDL(5,12) is best fitted model to forecast Mortality. Lets estimate and plot 4 weeks (509-512 weeks) ahead forecasts for Mortality series.

Observed and fitted values are plotted below. This plot indicates a good agreement between the model and the original series.

plot(Mortality,ylab='Mortality', xlab = 'week', type="l", col="black", main="Observed and fitted values using ARDL(5,12) model on Mortality")
lines(ARDL.5x12$model$fitted.values, col="red")
legend("topleft",lty=1, text.width = 12,
       col=c("black", "red"), 
       c("Mortality series", "ARDL(5,12) fit"))

Since the future covariates aren’t given, lets estimate the best Exponential smoothing/State-Space model for each of the 4 covariates first. A custom function GoFVals() will be used.

GoFVals = function(data, H, models){
  M = length(models) # The number of competing models
  N = length(data) # The number of considered time series
  fit.models = list()
  series = array(NA, N*M)
  FittedModels = array(NA, N*M)
  AIC = array(NA, N*M)
  AICc = array(NA, N*M)
  BIC = array(NA, N*M)
  HQIC = array(NA, N*M)
  MASE = array(NA, N*M)
  mean.MASE = array(NA, N)
  median.MASE = array(NA, N)
  GoF = data.frame(series, FittedModels, AIC, AICc, BIC, HQIC, MASE)
  count = 0
  for ( j in 1:N){
    sum.MASE = 0
    sample.median = array(NA, M)
    for ( i in 1: M){
      count = count + 1
      fit.models[[count]] = ets(data[[j]], model = models[i])
      GoF$AIC[count] = fit.models[[count]]$aic
      GoF$AICc[count] = fit.models[[count]]$aicc
      GoF$BIC[count] = fit.models[[count]]$bic
      q = length(fit.models[[count]]$par)
      GoF$HQIC[count] = -2*fit.models[[count]]$loglik+ 2*q*log(log(length(data[[j]])))
      GoF$MASE[count] = accuracy(fit.models[[count]])[6]
      sum.MASE = sum.MASE + GoF$MASE[count]
      sample.median[i] = GoF$MASE[count]
      GoF$series[count] = j
      GoF$FittedModels[count] = models[i]
    }
    mean.MASE[j] = sum.MASE / N
    median.MASE[j] = median(sample.median)
  }
  return(list(GoF = GoF, mean.MASE = mean.MASE, median.MASE = median.MASE))
}

The 4 regressors auto fit to either “MAdN”, “AAdN”, “ANN” or “MAN”. (This part of analysis has been hidden for simplicity purpose). Hence we will focus on these 4 ETS models for the 4 regressors, Temperature, Chemical 1 and 2, and particle size. The fitting model for each of these 4 regressors using the GoFVals() function is shown below.

# Series to be modelled
data = list()
data[[1]] = Temp
data[[2]] = Chem1
data[[3]] = Chem2
data[[4]] = ParticleSize

# Specify the forecast horizon
H = 4

# Specify the models we will focus on
models = c("MAN", "AAN", "ANN")

GoFVals(data = data, H = H, models = models)
## $GoF
##    series FittedModels      AIC     AICc      BIC     HQIC      MASE
## 1       1          MAN 5093.616 5093.784 5118.999 5099.910 0.8200243
## 2       1          AAN 5076.333 5076.501 5101.716 5082.628 0.8210695
## 3       1          ANN 5088.180 5088.227 5100.871 5089.498 0.8233697
## 4       2          MAN 3922.682 3922.850 3948.065 3928.977 0.7599568
## 5       2          AAN 4097.537 4097.705 4122.920 4103.832 0.7590948
## 6       2          ANN 4130.231 4130.279 4142.922 4131.549 0.7992823
## 7       3          MAN 5610.257 5610.424 5635.640 5616.551 0.7553396
## 8       3          AAN 5631.253 5631.421 5656.636 5637.547 0.7551668
## 9       3          ANN 5641.994 5642.042 5654.686 5643.312 0.7683649
## 10      4          MAN 5598.713 5598.881 5624.096 5605.008 0.7749270
## 11      4          AAN 5621.606 5621.774 5646.989 5627.901 0.7669993
## 12      4          ANN 5643.652 5643.700 5656.344 5644.970 0.8037025
## 
## $mean.MASE
## [1] 0.6161159 0.5795835 0.5697178 0.5864072
## 
## $median.MASE
## [1] 0.8210695 0.7599568 0.7553396 0.7749270

Based on MASE, the best ETS models for each regressor are,

  • For Temperature - MAN
  • For Chemical 1 - AAN
  • For Chemical 2 - AAN
  • For Particle Size - AAN

Lets fit these models and get the future covariates,

fit.MAN.Temp = ets(Temp, model="MAN")
forecast.MAN.Temp = forecast::forecast(fit.MAN.Temp, h = 4)

fit.MAN.Chem1 = ets(Chem1, model="AAN")
forecast.MAN.Chem1 = forecast::forecast(fit.MAN.Chem1, h = 4)

fit.MAN.Chem2 = ets(Chem2, model="AAN")
forecast.MAN.Chem2 = forecast::forecast(fit.MAN.Chem2, h = 4)

fit.MAN.ParticleSize = ets(ParticleSize, model="AAN")
forecast.MAN.ParticleSize = forecast::forecast(fit.MAN.ParticleSize, h = 4)

Using the Point Forecasts of these covariates, we can now forecast our Mortality response.

x.new =  t(matrix(c(forecast.MAN.Temp$mean, forecast.MAN.Chem1$mean, forecast.MAN.Chem2$mean, forecast.MAN.ParticleSize$mean), ncol = 4, 
                nrow = 4))
forecasts.ardldlm = dLagM::forecast(model = ARDL.5x12,  x = x.new, h = 4)$forecasts

Forecast using overall BEST fitting model:

The point forecasts and the forecast plot using the overall best fitting model, ARDL(5,12) is given below,

df <- data.frame(
  ARDL_forecasts = c(forecasts.ardldlm)
) 
row.names(df) <- c("week 509", "week 510", "week 511", "week 512")
df
##          ARDL_forecasts
## week 509       171.0425
## week 510       171.7601
## week 511       171.2978
## week 512       171.4611
Mortality.extended4 = c(Mortality , forecasts.ardldlm)

{
plot(ts(Mortality.extended4),type="l", col = "red", xlim= c(400, 515),
ylab = "Mortality", xlab = "Weeks", 
main="4 weeks ahead forecast for Mortality series
      using ARDL(5,12) model")          
lines(Mortality,col="black",type="l")
legend("topleft",lty=1,
       col=c("black", "red"), 
       c("Mortality series", "ARDL(5,12) forecasts"))
}

The forecasts for best Finite DLM, Polynomial DLM, Koyck, and Exponential smoothing/State-space model are plotted and given below (Note, Dynamic Linear model was found insignificant),

For Distributed Lag models:

The 4 weeks ahead Point forecasts for the 4 DLM models are printed and plotted below,

# Forecasts using Finite DLM 
forecasts.dlm = dLagM::forecast(model = DLM.model, x = x.new, h = 4)$forecasts

# Forecasts using Polynomial DLM 
x.new2 =  c(forecast.MAN.Chem1$mean)
forecasts.polydlm = dLagM::forecast(model = PolyDLM.Chem1 , x = x.new2, h = 4)$forecasts

# Forecasts using Koyck DLM
x.new3 =  c(forecast.MAN.Chem1$mean)
forecasts.koyckdlm = dLagM::forecast(model = Koyck.Chem1 , x = x.new3, h = 4)$forecasts

# Forecasts using ARDL 
forecasts.ardldlm = dLagM::forecast(model = ARDL.5x12,  x = x.new, h = 4)$forecasts

df <- data.frame(
  Finite_DLM_forecasts = c(forecasts.dlm),
  Polynomial_DLM_forecasts = c(forecasts.polydlm),
  Koyck_DLM_forecasts = c(forecasts.koyckdlm),
  ARDL_forecasts = c(forecasts.ardldlm)
) 
row.names(df) <- c("week 509", "week 510", "week 511", "week 512")
df
##          Finite_DLM_forecasts Polynomial_DLM_forecasts Koyck_DLM_forecasts
## week 509             167.0714                 164.6609            170.3823
## week 510             167.3189                 165.2797            169.9345
## week 511             169.0378                 166.2768            169.7852
## week 512             167.4805                 166.9232            169.8031
##          ARDL_forecasts
## week 509       171.0425
## week 510       171.7601
## week 511       171.2978
## week 512       171.4611
Mortality.extended1 = c(Mortality , forecasts.dlm)
Mortality.extended2 = c(Mortality , forecasts.polydlm)
Mortality.extended3 = c(Mortality , forecasts.koyckdlm)
Mortality.extended4 = c(Mortality , forecasts.ardldlm)


{
plot(ts(Mortality.extended4),type="l", col = "Red", xlim= c(400, 515),
ylab = "Mortality", xlab = "Weeks", 
main="4 weeks ahead forecast for Mortality series
      using DLM models")          
lines(ts(Mortality.extended1),col="blue",type="l")
lines(ts(Mortality.extended2),col="green",type="l")
lines(ts(Mortality.extended3),col="orange",type="l")
lines(Mortality,col="black",type="l")
legend("topleft",lty=1,
       col=c("black", "red", "blue", "green", "orange"), 
       c("Mortality series", "ARDL(5,12) forecasts", "Finite DLM forecasts", "Polynomial DLM forecasts", "Koyck DLM forecasts"))
}

For Exponential smoothing/State-space model:

The 4 weeks ahead point forecasts and Confidence intervals are printed and plotted below,

forecasts.Dynlm = forecast::forecast(autofit.ETS.damped, h =4)
forecasts.Dynlm
##     Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## 509       167.6492 156.3245 178.9738 150.3296 184.9687
## 510       167.9555 155.3966 180.5144 148.7483 187.1627
## 511       168.2005 154.4267 181.9743 147.1353 189.2657
## 512       168.3966 153.4364 183.3568 145.5169 191.2763
plot(forecasts.Dynlm, ylab="Mortality", type="l", fcol="red", xlab="weeks", xlim= c(400, 515),
main="4 weeks ahead forecasts using Dynamic Linear model")
legend("topleft",lty=1, pch=1, col=1:2, c("Mortality series","Dynlm forecasts"))

Conclusion

The most fitting model for our Mortality series in terms of MASE which assesses the forecast accuracy is the Autoregressive Distributed Lag model \(ARDL(5,12)\) with all 4 regressors, Temperature, Chemical 1 and 2, and particle size. The point forecasts for 4 weeks ahead reported using the forecast() of dLagM package are 171.0425, 171.7601, 171.2978, and 171.4611 respectively (Confidence Intervals are not outputted).

Future Directions

Potentially better forecasting methods can be explored, compared and diagnosed for better fit.

Reference List

Rob Hyndman (2013) Does the Holt-Winters algorithm for exponential smoothing in time series modelling require the normality assumption in residuals?, Stack Exchange Website, accessed 26 September 2023. https://stats.stackexchange.com/questions/64911/does-the-holt-winters-algorithm-for-exponential-smoothing-in-time-series-modelli#:~:text=There%20is%20no%20normality%20assumption,under%20almost%20all%20residual%20distributions.

Task 2: Univariate Forecasting of First Flowering Day: Four-Year Ahead Predictions

Data Description

The dataset holds 6 columns and 31 observations. They are, Year column, the day of occurrence of a species first flowering (first flowering day, FFD, a number between 1-365), climate factors namely, rainfall (rain), temperature (temp), radiation level (rad), and relative humidity (RH) - all focused on one species of plants and measured from 1984 to 2014.

Objective

Our aim for the FFD dataset is to give best 4 years ahead forecasts by determining the most accurate and suitable regression model that determines the yearly First flowering day in terms of MASE using single predictor (univariate analysis). A descriptive analysis will be conducted initially. Model-building strategy will be applied to find the best fitting model from the time series regression methods (dLagM package), dynamic linear models (dynlm package), and exponential smoothing and corresponding state-space models.

Model Selection Criteria

MASE, Information Criteria (AIC and BIC), and Adjusted R Squared.

Read Data

FFD_dataset <- read.csv("C:/Users/admin/Downloads/FFD.csv")
head(FFD_dataset)
##   Year Temperature Rainfall Radiation RelHumidity FFD
## 1 1984    18.71038 2.489344  14.87158    54.64891 314
## 2 1985    19.26301 2.475890  14.68493    54.95781 314
## 3 1986    18.58356 2.421370  14.51507    54.96301 320
## 4 1987    19.10137 2.319726  14.67397    53.87205 306
## 5 1988    20.36066 2.465301  14.74863    53.11885 306
## 6 1989    19.59589 2.735890  14.78356    55.37671 314

Identification of the response and the regressor variables

For fitting a regression model, the response is FFD and the 4 regressor variables are the Temperature, Rainfall, Radiation Level and Relative Humidity.

  • y = FFD = First flowering day, a number between 1 -365
  • x1 = Temperature
  • x2 = Rainfall
  • x3 = Radiation
  • x4 = RelHumidity = Relative Humidity

All the 5 variables are continuous variables.

Read Regressor and Response variables

Lets first get the regressor and response as TS objects,

FFD = ts(FFD_dataset[,6], start = c(1984))
Temperature = ts(FFD_dataset[,2], start = c(1984))
Rainfall = ts(FFD_dataset[,3], start = c(1984))
Radiation = ts(FFD_dataset[,4], start = c(1984))
RelHumidity = ts(FFD_dataset[,5], start = c(1984))
data.ts = ts(FFD_dataset, start = c(1984)) # Y and x in single dataframe

Relationship between Regressor and Response variables

Lets scale, center and plot all the 5 variables together

plot(FFD)

data.scale = scale(data.ts)
plot(data.scale[,2:6], plot.type="s", col=c("red", "blue", "green", "yellow", "black"), main = "FFD (Black - Respone), Temperature (Red - X1),\n  Rainfall (Blue - X2), Radiation (Green - X3), RelHumidity (Yellow - X4)")

It is hard to read the correlations between the regressors and the response and the among the response themselves. But it is fair to say the 5 variables show some correlations. Lets check for correlation statistically using ggpairs(),

ggpairs(data = FFD_dataset, columns = c(6,2,3,4,5), progress = FALSE) #library(GGally)

Hence, some correlations between the 4 regressors and response is present. We can generate regression model based on these correlations. First, lets look at the descriptive statistics

Descriptive Analysis

Since we are generating regression model which estimates the response, \(FFD\), lets focus on FFDs statistics.

Summary statistics

summary(FFD)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   265.0   291.0   301.0   306.4   314.0   380.0

The mean and median of the FFD are very close indicating symmetrical distribution.

Time Series plot:

The time series plot for our data is generated using the following code chunk,

plot(FFD, ylab='Yearly average of First Flowering Day (FFD)',xlab='Year',
     type='o', main="Figure 1: Yearly Average FFD Trend (1984-2014)")

Plot Inference :

From Figure 1, we can comment on the time series’s,

  • Trend: The overall shape of the trend seems to follow an downward trend. Thus, indicating non-stationarity.

  • Seasonality: From the plot, no seasonal behavior is seen.

  • Change in Variance: We see high variation in FFD series during the years 1997-2004 and low variation during other years.

  • Behavior: We notice mixed behavior of MA and AR series. AR behavior is seen as we obverse following data points. MA behavior is evident due to up and down fluctuations in the data points.

  • Intervention/Change points: No clear intervention point seen. Year 2002-2003 might be an intervention points and we will be checked if they cause significant change in mean value.

ACF and PACF plots:

acf(FFD, main="ACF of FFD")

pacf(FFD, main ="PACF of FFD")

  • ACF plot: We notice no significant autocorrelations. No slowly decaying pattern indicates stationary series. We do not see any ‘wavish’ form. Thus, no significant seasonal behavior is observed.

  • PACF plot: The 1st vertical spike is insignificant indicating stationary series.

Check normality

Many model estimating procedures assume normality of the residuals. If this assumption doesn’t hold, then the coefficient estimates are not optimum. Lets look at the Quantile-Quantile (QQ) plot to to observe normality visually and the Shapiro-Wilk test to statistically confirm the result.

qqnorm(FFD, main = "Normal Q-Q Plot of Average yearly FFD")
qqline(FFD, col = 2)

We see deviations from normality. Clearly, upper tail is off and most of the data in middle is off the line as well. Lets check statistically using shapiro-wilk test. Lets state the hypothesis of this test,

\(H_0\) : Time series is Normally distributed
\(H_a\) : Time series is not normal

shapiro.test(FFD)
## 
##  Shapiro-Wilk normality test
## 
## data:  FFD
## W = 0.85617, p-value = 0.0006877

From the Shapiro-Wilk test, since p < 0.05 significance level, we reject the null hypothesis that states the data is normal. Thus, FFD series is not normally distributed.

Test Stationarity

The ACF and PACF of FFD time series at the descriptive analysis stage of time series tells us stationarity in our time series. Lets use ADF and PP tests,

Using ADF (Augmented Dickey-Fuller) test :

Lets confirm the non-stationarity using Dickey-Fuller Test or ADF test. Lets state the hypothesis,

\(H_0\) : Time series is Difference non-stationary
\(H_a\) : Time series is Stationary

adf.test(FFD) #library(tseries)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  FFD
## Dickey-Fuller = -2.5139, Lag order = 3, p-value = 0.3749
## alternative hypothesis: stationary

since p-value > 0.05, we do not reject null hypothesis of non stationarity. we can conclude that the series is non-stationary at 5% level of significance.

Using PP (Phillips-Perron) test :

The null and alternate hypothesis are same as ADF test.

PP.test(FFD, lshort = TRUE)
## 
##  Phillips-Perron Unit Root Test
## 
## data:  FFD
## Dickey-Fuller = -4.0962, Truncation lag parameter = 2, p-value =
## 0.01861
PP.test(FFD, lshort = FALSE)
## 
##  Phillips-Perron Unit Root Test
## 
## data:  FFD
## Dickey-Fuller = -3.9565, Truncation lag parameter = 8, p-value =
## 0.02368

According to the PP tests, FFD series is stationary at 5% level

The two procedures give differing outcomes. Since Philips-Perron (PP) test is non-parametric, i.e. it does not require to select the level of serial correlation as in ADF and since our FFD series does not have significant serial autocorrelations, we can go with the outcome of PP test stating the FFD series is stationary.

Conclusion from descriptive analysis:

  • From the ACF/PACF plots and PP tests, we found our FFD response is stationary. Differencing is not required.
  • Trend is not normal. Thus Box-cox transformation is required.

Lets perform with Box-Cox transformation,

Transformations

Box-Cox transformation to improve normality

To improve normality in our FFD time series, lets test Box-Cox transformations on the series

lambda = BoxCox.lambda(FFD, method = "loglik") # library(forecast)
BC.FFD = BoxCox(FFD, lambda = lambda)

Check Normality of BC transformed FFD series

Visually comparing the time series plots before and after box-cox transformation,

par(mfrow=c(2,1))
plot(BC.FFD,ylab='Yearly FFD',xlab='Time',
     type='o', main="Box-Cox Transformed FFD Time Series")
points(y=BC.FFD,x=time(BC.FFD))
plot(FFD,ylab='Yearly FFD',xlab='Time',
     type='o', main="Original FFD Time Series")
points(y=FFD,x=time(FFD))

par(mfrow=c(1,1))

From the plot, almost no improvement in the variance of the time series is visible after BC transformation. Lets check for normality using shapiro test,

shapiro.test(BC.FFD)
## 
##  Shapiro-Wilk normality test
## 
## data:  BC.FFD
## W = 0.92261, p-value = 0.0277

From the Shapiro-Wilk test, since p < 0.05 significance level, we reject the null hypothesis that states the data is normal. Thus, BC Transformed FFD is not normal.

Conclusion after BC transformation

The BC transformed FFD series is Stationary and not normal. BC transformation was not effective.

Decomposition

At the descriptive analysis stage, from the time series plot and the ACF/PACF plots, no seasonal pattern was observed but a downward trend was observed. Lets decompose the FFD series and confirm. STL decomposition method will be used.

STL decomposition

Lets set t.window to 15 and look the STL decomposed plots,

We can adjust the series for seasonality by subtracting the seasonal component from the original series using the following code chunk,

Note - Since we cannot do decomposition on a series having frequency as 1, lets falsely use frequency as 2. Also note, the time truncates from 2014 to 2000 as the frequency is doubled. This is okay since we are just interested in the decomposition.

# Code gist - Apply STL decomposition to get seasonally adjusted and trend adjusted and visually compare w.r.t to original time series

FFDX = ts(FFD_dataset[,6], start = c(1984),frequency = 2) # set frequency
stl.FFD <- stl(window(FFDX, start=c(1984)), t.window=15, s.window="periodic", robust=TRUE)

par(mfrow=c(3,1))

plot(FFDX,ylab='FFD',xlab='Time',
     type='o', main="Original FFD Time Series")

plot(seasadj(stl.FFD), ylab='FFD',xlab='Time', main = "Seasonally adjusted FFD")

stl.FFD.trend = stl.FFD$time.series[,"trend"] # Extract the trend component from the output
stl.FFD.trend.adjusted = FFDX - stl.FFD.trend

plot(stl.FFD.trend.adjusted, ylab='FFD',xlab='Time', main = "Trend adjusted FFD")

par(mfrow=c(1,1))

On very close inspection of the plots above, the trend adjusted series looks more different (than the seasonally adjusted series) from the Original FFD series. Meaning, trend component is more significant than the seasonal component in the FFD series.

Conclusion of Decomposition

Trend component is more significant than the seasonal component in the FFD series. Thus, we expect the fitted model to have no seasonal component.

Modeling

Time series regression methods namely,

  • A. Distributed lag models (dLagM package),
  • B. Dynamic linear models (dynlm package)
  • C. Exponential smoothing and corresponding state-space models will be considered.

A. Distributed lag models

Based on whether the lags are known (Finite DLM) or undetermined (Infinite DLM), 4 major modelling methods will be tested, namely,

  • Basic Finite Distributed lag model,
  • Polynomial DLM,
  • Koyck transformed geometric DLM,
  • and Autoregressive DLM.

Fit Finite DLM

The response of a finite DLM model with 1 regressor is represented as shown below,

\(Y_t = \alpha + \sum_{s=0}^{q} \beta_s X_{t-s} + \epsilon_t\)

where,

  • \(\alpha\) is intercept
  • \(\beta_s\) is coefficient of s lagged response \(X_t\)
  • and \(\epsilon_t\) is the error term

In our dataset, we have 4 regressors. For uni variate analysis lets fit models with single regressor for each of the 4 regressors.

Note - We are using FFD and not the BC.FFD (BC transformed FFD series) as normality is violated in both of these.

1. Temperature as regressor

With intercept :

Now, lets use AIC and BIC score to find the best lag length for Finite DLM model,

finiteDLMauto(formula = FFD ~ Temperature, data = FFD_dataset, q.min = 1, q.max = 20,
              model.type = "dlm", error.type = "AIC", trace = TRUE)
##    q - k    MASE      AIC      BIC     GMRAE    MBRAE R.Adj.Sq  Ljung-Box
## 15    15 0.00000     -Inf     -Inf   0.00000  0.13333      NaN        NaN
## 16    16 0.00000     -Inf     -Inf   0.00000  0.14286      NaN        NaN
## 17    17 0.00000     -Inf     -Inf   0.00000  0.15385      NaN        NaN
## 18    18 0.00000     -Inf     -Inf   0.00000  0.16667      NaN        NaN
## 19    19 0.00000     -Inf     -Inf   0.00000  0.13636      NaN        NaN
## 20    20 0.00000     -Inf     -Inf   0.00000  0.15000      NaN        NaN
## 14    14 0.08990 108.3305 122.4952  62.05535  0.36951  0.92421 0.02999554
## 13    13 0.27470 158.8972 173.1432 118.78319 -0.00663  0.60746 0.15929080
## 12    12 0.39566 182.3394 196.5060  95.89904  1.23693  0.30941 0.29664926
## 11    11 0.45582 190.0656 204.0059  99.13968  0.86788  0.40406 0.34898253
## 10    10 0.54444 202.4974 216.0762  95.21457  0.75880  0.30833 0.86820277
## 9      9 0.69138 217.5864 230.6789  98.30030  0.78249  0.07685 0.13632287
## 8      8 0.70080 225.7275 238.2179  88.30203  1.10743  0.11344 0.45917951
## 7      7 0.69979 236.6118 248.3923  40.87507  0.55008  0.00966 0.80817276
## 6      6 0.75788 248.1290 259.0989 134.74503  0.47069 -0.13573 0.33850283
## 5      5 0.73794 255.3456 265.4104 102.43406  0.47497 -0.09659 0.23858117
## 4      4 0.84199 264.5275 273.5984 101.88888  0.38987 -0.15287 0.20121785
## 3      3 0.85614 270.8700 278.8632 204.15267  0.34556 -0.09628 0.20691076
## 2      2 0.84819 277.4305 284.2670 171.03798  0.47663 -0.04610 0.22787307
## 1      1 0.82301 284.3083 289.9131 144.59119  0.52710 -0.02241 0.19315716

q = 14 has the smallest AIC and BIC scores. Fit model with q = 14,

DLM.Temperature = dlm(formula = FFD ~ Temperature, data = FFD_dataset, q = 14)
summary(DLM.Temperature)
## 
## Call:
## lm(formula = as.formula(model.formula), data = design)
## 
## Residuals:
##        1        2        3        4        5        6        7        8 
## -0.87863 -0.01357  2.16805 -2.44038  2.36726 -3.46998  1.57429 -1.69876 
##        9       10       11       12       13       14       15       16 
##  2.05320  2.82958 -1.94792 -2.01906  2.49497  2.29534 -2.48892  1.36276 
##       17 
## -2.18822 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)
## (Intercept)     324.150    272.588   1.189    0.445
## Temperature.t     8.726     15.306   0.570    0.670
## Temperature.1    23.651     22.743   1.040    0.488
## Temperature.2    19.604     12.959   1.513    0.372
## Temperature.3    -5.885     15.286  -0.385    0.766
## Temperature.4    -5.930     11.873  -0.499    0.705
## Temperature.5    10.871     16.614   0.654    0.631
## Temperature.6     4.638      7.972   0.582    0.665
## Temperature.7   -43.245      9.989  -4.329    0.145
## Temperature.8   -20.122      8.696  -2.314    0.260
## Temperature.9     8.562     13.430   0.638    0.639
## Temperature.10   -7.797      7.130  -1.093    0.472
## Temperature.11  -22.479      9.770  -2.301    0.261
## Temperature.12    2.044     11.708   0.175    0.890
## Temperature.13   12.845      7.391   1.738    0.332
## Temperature.14   12.630      7.772   1.625    0.351
## 
## Residual standard error: 8.881 on 1 degrees of freedom
## Multiple R-squared:  0.9953, Adjusted R-squared:  0.9242 
## F-statistic: 14.01 on 15 and 1 DF,  p-value: 0.207
## 
## AIC and BIC values for the model:
##        AIC      BIC
## 1 108.3305 122.4952

DLM.Temperature Model is insignificant (p-value = 0.207) at 0.05 significant level.

Without intercept :

DLM.Temperature.noIntercept = dlm(formula = FFD ~ 0 + Temperature, data = FFD_dataset, q = 14)
summary(DLM.Temperature.noIntercept)
## 
## Call:
## lm(formula = as.formula(model.formula), data = design)
## 
## Residuals:
##       1       2       3       4       5       6       7       8       9      10 
##  1.5293  1.5668  1.9674 -2.0817  4.5839 -4.5171  5.1261 -0.5464  5.8975  0.3338 
##      11      12      13      14      15      16      17 
## -2.8880  0.3861 -0.2987 -2.5086 -2.5415  1.4761 -7.1409 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)  
## Temperature.t     7.771     16.793   0.463   0.6890  
## Temperature.1    42.787     17.658   2.423   0.1363  
## Temperature.2    25.787     13.041   1.977   0.1866  
## Temperature.3   -16.455     13.663  -1.204   0.3517  
## Temperature.4    -8.112     12.888  -0.629   0.5934  
## Temperature.5    23.699     13.882   1.707   0.2299  
## Temperature.6     8.248      8.098   1.019   0.4156  
## Temperature.7   -49.947      9.062  -5.512   0.0314 *
## Temperature.8   -20.216      9.554  -2.116   0.1686  
## Temperature.9    -1.095     11.751  -0.093   0.9342  
## Temperature.10   -2.734      6.284  -0.435   0.7060  
## Temperature.11  -27.131      9.836  -2.758   0.1101  
## Temperature.12    9.938     10.596   0.938   0.4473  
## Temperature.13   11.407      8.011   1.424   0.2905  
## Temperature.14   10.338      8.272   1.250   0.3378  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.757 on 2 degrees of freedom
## Multiple R-squared:  0.9999, Adjusted R-squared:  0.999 
## F-statistic:  1107 on 15 and 2 DF,  p-value: 0.0009028
## 
## AIC and BIC values for the model:
##        AIC      BIC
## 1 121.3131 134.6445

DLM.Temperature.noIntercept Model is significant.

2. Rainfall as regressor

With intercept :

Now, lets use AIC and BIC score to find the best lag length for Finite DLM model,

finiteDLMauto(formula = FFD ~ Rainfall, data = FFD_dataset, q.min = 1, q.max = 20,
              model.type = "dlm", error.type = "AIC", trace = TRUE)
##    q - k    MASE      AIC      BIC     GMRAE      MBRAE R.Adj.Sq  Ljung-Box
## 15    15 0.00000     -Inf     -Inf   0.00000    0.13333      NaN        NaN
## 16    16 0.00000     -Inf     -Inf   0.00000    0.14286      NaN        NaN
## 17    17 0.00000     -Inf     -Inf   0.00000    0.15385      NaN        NaN
## 18    18 0.00000     -Inf     -Inf   0.00000    0.16667      NaN        NaN
## 19    19 0.00000     -Inf     -Inf   0.00000    0.13636      NaN        NaN
## 20    20 0.00000     -Inf     -Inf   0.00000    0.15000      NaN        NaN
## 14    14 0.06321 100.0839 114.2485  40.97864    0.34442  0.95334 0.03001328
## 13    13 0.29906 166.8307 181.0766  78.54528   -0.30469  0.39005 0.51860280
## 12    12 0.41664 181.2808 195.4474 121.22896 -953.69675  0.34683 0.20539117
## 11    11 0.44805 188.2359 202.1761 111.03396    1.08054  0.45616 0.20552675
## 10    10 0.55752 200.1260 213.7048 107.68352    0.27076  0.38219 0.24086643
## 9      9 0.59928 212.7206 225.8131  86.83974    0.90191  0.26002 0.03078702
## 8      8 0.63873 220.6896 233.1800  71.50389    0.28130  0.28784 0.01640637
## 7      7 0.72075 234.3870 246.1675  75.08693    0.37359  0.09734 0.04286781
## 6      6 0.77856 247.9410 258.9109 155.32847    0.40794 -0.12723 0.01597736
## 5      5 0.79987 254.7647 264.8295 115.44821    0.65748 -0.07236 0.01296258
## 4      4 0.77962 264.1272 273.1980  88.57121    0.51811 -0.13590 0.03640757
## 3      3 0.83069 270.9038 278.8970 224.91666    0.18089 -0.09761 0.05860571
## 2      2 0.83748 277.5444 284.3809 156.29025    0.06368 -0.05021 0.05860348
## 1      1 0.84164 284.1154 289.7202 128.87890   -4.27195 -0.01585 0.05515185

q = 14 has the smallest AIC and BIC scores. Fit model with q = 14,

DLM.Rainfall = dlm(formula = FFD ~ Rainfall, data = FFD_dataset, q = 14)
summary(DLM.Rainfall)
## 
## Call:
## lm(formula = as.formula(model.formula), data = design)
## 
## Residuals:
##       1       2       3       4       5       6       7       8       9      10 
## -2.0224  0.1804  1.2773  2.4063 -1.5688  0.5488 -2.3178  0.7084 -0.3974  2.6065 
##      11      12      13      14      15      16      17 
## -0.9432  0.7841 -1.2432  0.8286 -3.1891  2.7153 -0.3739 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  536.968     81.521   6.587   0.0959 .
## Rainfall.t   -22.617     11.404  -1.983   0.2973  
## Rainfall.1   -19.118      7.085  -2.698   0.2259  
## Rainfall.2   -30.871      8.480  -3.640   0.1707  
## Rainfall.3   -10.860      6.858  -1.584   0.3586  
## Rainfall.4   -45.273      9.064  -4.995   0.1258  
## Rainfall.5   -61.818      6.821  -9.063   0.0700 .
## Rainfall.6   -44.444      7.509  -5.918   0.1066  
## Rainfall.7    28.972      8.629   3.358   0.1843  
## Rainfall.8    24.033      7.222   3.328   0.1858  
## Rainfall.9    42.785      7.557   5.662   0.1113  
## Rainfall.10   35.400      7.262   4.875   0.1288  
## Rainfall.11   30.458      7.855   3.878   0.1607  
## Rainfall.12  -19.904      7.011  -2.839   0.2156  
## Rainfall.13   -2.661     11.686  -0.228   0.8575  
## Rainfall.14  -14.529     10.470  -1.388   0.3975  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.968 on 1 degrees of freedom
## Multiple R-squared:  0.9971, Adjusted R-squared:  0.9533 
## F-statistic:  22.8 on 15 and 1 DF,  p-value: 0.1631
## 
## AIC and BIC values for the model:
##        AIC      BIC
## 1 100.0839 114.2485

DLM.Rainfall Model is insignificant (p-value = 0.1631) at 0.05 significant level.

Without intercept :

DLM.Rainfall.noIntercept = dlm(formula = FFD ~ 0 + Rainfall, data = FFD_dataset, q = 14)
summary(DLM.Rainfall.noIntercept)
## 
## Call:
## lm(formula = as.formula(model.formula), data = design)
## 
## Residuals:
##       1       2       3       4       5       6       7       8       9      10 
## -10.103 -22.647  -5.902  13.906  10.911   1.895  -5.248 -14.913  -7.868   7.168 
##      11      12      13      14      15      16      17 
##  15.913   9.856  10.133   9.585 -11.094  -7.647   9.978 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## Rainfall.t    33.725     35.533   0.949    0.443
## Rainfall.1     1.135     30.071   0.038    0.973
## Rainfall.2   -20.010     39.188  -0.511    0.660
## Rainfall.3    -2.902     31.803  -0.091    0.936
## Rainfall.4   -26.893     40.628  -0.662    0.576
## Rainfall.5   -61.929     32.132  -1.927    0.194
## Rainfall.6   -22.233     31.609  -0.703    0.555
## Rainfall.7    48.877     38.075   1.284    0.328
## Rainfall.8    38.476     32.416   1.187    0.357
## Rainfall.9    25.108     33.279   0.754    0.529
## Rainfall.10   28.563     33.859   0.844    0.488
## Rainfall.11    9.729     33.905   0.287    0.801
## Rainfall.12  -10.485     32.334  -0.324    0.776
## Rainfall.13   41.674     45.004   0.926    0.452
## Rainfall.14   37.699     32.211   1.170    0.362
## 
## Residual standard error: 32.83 on 2 degrees of freedom
## Multiple R-squared:  0.9986, Adjusted R-squared:  0.9884 
## F-statistic: 97.69 on 15 and 2 DF,  p-value: 0.01018
## 
## AIC and BIC values for the model:
##       AIC      BIC
## 1 162.564 175.8954

DLM.Rainfall.noIntercept Model is significant.

3. Radiation as regressor

With intercept :

Now, lets use AIC and BIC score to find the best lag length for Finite DLM model,

finiteDLMauto(formula = FFD ~ Radiation, data = FFD_dataset, q.min = 1, q.max = 20,
              model.type = "dlm", error.type = "AIC", trace = TRUE)
##    q - k    MASE       AIC      BIC     GMRAE    MBRAE R.Adj.Sq  Ljung-Box
## 15    15 0.00000      -Inf     -Inf   0.00000  0.13333      NaN        NaN
## 16    16 0.00000      -Inf     -Inf   0.00000  0.14286      NaN        NaN
## 17    17 0.00000      -Inf     -Inf   0.00000  0.15385      NaN        NaN
## 18    18 0.00000      -Inf     -Inf   0.00000  0.16667      NaN        NaN
## 19    19 0.00000      -Inf     -Inf   0.00000  0.13636      NaN        NaN
## 20    20 0.00000      -Inf     -Inf   0.00000  0.15000      NaN        NaN
## 14    14 0.03909  89.04681 103.2114  18.82913  0.37467  0.97562 0.44963843
## 13    13 0.33459 166.15857 180.4045 149.80902  1.73189  0.41240 0.10406781
## 12    12 0.44056 183.55370 197.7203 133.73569  0.87138  0.26383 0.96472470
## 11    11 0.48709 190.82115 204.7614 132.94946 -0.67933  0.38111 0.88498361
## 10    10 0.50968 200.65556 214.2344  70.58245  0.12372  0.36641 0.96837884
## 9      9 0.67299 213.95421 227.0467  91.01071  0.49096  0.21734 0.34600265
## 8      8 0.74835 227.33217 239.8226  95.08541  0.54224  0.04938 0.06793710
## 7      7 0.73121 233.54416 245.3247  73.68775  0.56382  0.12849 0.06174573
## 6      6 0.82962 250.69918 261.6691 178.94922  0.73611 -0.25871 0.30633276
## 5      5 0.82691 257.54959 267.6144  91.87223 -0.46592 -0.19360 0.18952017
## 4      4 0.84162 264.08048 273.1513 106.78227  0.60001 -0.13394 0.17174020
## 3      3 0.83712 270.41454 278.4078 167.36396  0.62192 -0.07860 0.15731113
## 2      2 0.84548 277.39923 284.2357 117.96904  0.92888 -0.04497 0.17671372
## 1      1 0.85379 284.14675 289.7515 105.58523  0.50672 -0.01691 0.16861971

q = 14 has the smallest AIC and BIC scores. Fit model with q = 14,

DLM.Radiation = dlm(formula = FFD ~ Radiation, data = FFD_dataset, q = 14)
summary(DLM.Radiation)
## 
## Call:
## lm(formula = as.formula(model.formula), data = design)
## 
## Residuals:
##        1        2        3        4        5        6        7        8 
## -0.13309 -1.59683 -0.19007  1.77884  0.16235 -0.02138  0.61645 -2.14129 
##        9       10       11       12       13       14       15       16 
##  0.80942 -0.25149  2.25184  0.24341 -2.67825  0.04819  0.46272 -0.44306 
##       17 
##  1.08223 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  -8562.308   1469.386  -5.827   0.1082  
## Radiation.t    107.716     14.721   7.317   0.0865 .
## Radiation.1     78.781     14.515   5.428   0.1160  
## Radiation.2     56.552     17.392   3.252   0.1899  
## Radiation.3     48.867     19.819   2.466   0.2453  
## Radiation.4    -87.676     20.012  -4.381   0.1429  
## Radiation.5   -101.449     25.466  -3.984   0.1566  
## Radiation.6    105.043     12.740   8.245   0.0768 .
## Radiation.7     12.211     11.781   1.037   0.4886  
## Radiation.8    115.181     22.558   5.106   0.1231  
## Radiation.9    -45.629      7.503  -6.082   0.1037  
## Radiation.10   -45.000      7.348  -6.124   0.1030  
## Radiation.11   -14.283      8.005  -1.784   0.3252  
## Radiation.12    43.601      8.722   4.999   0.1257  
## Radiation.13   168.244     24.040   6.998   0.0904 .
## Radiation.14   167.580     29.060   5.767   0.1093  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.036 on 1 degrees of freedom
## Multiple R-squared:  0.9985, Adjusted R-squared:  0.9756 
## F-statistic: 43.69 on 15 and 1 DF,  p-value: 0.1182
## 
## AIC and BIC values for the model:
##        AIC      BIC
## 1 89.04681 103.2114

DLM.Radiation Model is insignificant (p-value = 0.1182) at 0.05 significant level.

Without intercept :

DLM.Radiation.noIntercept = dlm(formula = FFD ~ 0 + Rainfall, data = FFD_dataset, q = 14)
summary(DLM.Radiation.noIntercept)
## 
## Call:
## lm(formula = as.formula(model.formula), data = design)
## 
## Residuals:
##       1       2       3       4       5       6       7       8       9      10 
## -10.103 -22.647  -5.902  13.906  10.911   1.895  -5.248 -14.913  -7.868   7.168 
##      11      12      13      14      15      16      17 
##  15.913   9.856  10.133   9.585 -11.094  -7.647   9.978 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## Rainfall.t    33.725     35.533   0.949    0.443
## Rainfall.1     1.135     30.071   0.038    0.973
## Rainfall.2   -20.010     39.188  -0.511    0.660
## Rainfall.3    -2.902     31.803  -0.091    0.936
## Rainfall.4   -26.893     40.628  -0.662    0.576
## Rainfall.5   -61.929     32.132  -1.927    0.194
## Rainfall.6   -22.233     31.609  -0.703    0.555
## Rainfall.7    48.877     38.075   1.284    0.328
## Rainfall.8    38.476     32.416   1.187    0.357
## Rainfall.9    25.108     33.279   0.754    0.529
## Rainfall.10   28.563     33.859   0.844    0.488
## Rainfall.11    9.729     33.905   0.287    0.801
## Rainfall.12  -10.485     32.334  -0.324    0.776
## Rainfall.13   41.674     45.004   0.926    0.452
## Rainfall.14   37.699     32.211   1.170    0.362
## 
## Residual standard error: 32.83 on 2 degrees of freedom
## Multiple R-squared:  0.9986, Adjusted R-squared:  0.9884 
## F-statistic: 97.69 on 15 and 2 DF,  p-value: 0.01018
## 
## AIC and BIC values for the model:
##       AIC      BIC
## 1 162.564 175.8954

DLM.Radiation.noIntercept Model is significant.

4. RelHumidity as regressor

With intercept :

Now, lets use AIC and BIC score to find the best lag length for Finite DLM model,

finiteDLMauto(formula = FFD ~ RelHumidity, data = FFD_dataset, q.min = 1, q.max = 20,
              model.type = "dlm", error.type = "AIC", trace = TRUE)
##    q - k    MASE       AIC       BIC     GMRAE     MBRAE R.Adj.Sq Ljung-Box
## 15    15 0.00000      -Inf      -Inf   0.00000   0.13333      NaN       NaN
## 16    16 0.00000      -Inf      -Inf   0.00000   0.14286      NaN       NaN
## 17    17 0.00000      -Inf      -Inf   0.00000   0.15385      NaN       NaN
## 18    18 0.00000      -Inf      -Inf   0.00000   0.16667      NaN       NaN
## 19    19 0.00000      -Inf      -Inf   0.00000   0.13636      NaN       NaN
## 20    20 0.00000      -Inf      -Inf   0.00000   0.15000      NaN       NaN
## 14    14 0.01053  39.90284  54.06747   5.74566   0.23305  0.99865 0.5062922
## 13    13 0.29014 163.95746 178.20341  98.23537   0.75770  0.48004 0.5217713
## 12    12 0.34461 178.08248 192.24906  94.00451   0.21639  0.44803 0.6765633
## 11    11 0.39538 185.84189 199.78215  97.88977   0.32728  0.51751 0.3213649
## 10    10 0.51502 197.62296 211.20175 112.98141   0.44008  0.45161 0.8515080
## 9      9 0.73191 222.08303 235.17554  93.41179   0.63595 -0.13251 0.1991888
## 8      8 0.71876 228.95749 241.44792  71.60391 -86.76689 -0.02023 0.1283026
## 7      7 0.74342 237.69356 249.47409  57.30648  -0.54471 -0.03600 0.1683119
## 6      6 0.85119 252.53366 263.50354 145.90020   0.60988 -0.35454 0.1414141
## 5      5 0.85674 259.13851 269.20328 128.72173   0.70634 -0.26882 0.1438949
## 4      4 0.81748 265.84331 274.91417  91.83514   0.41658 -0.21044 0.1588539
## 3      3 0.82083 272.18292 280.17615 160.11196  -5.00196 -0.14891 0.1528627
## 2      2 0.82665 278.86224 285.69872 125.00108   0.45375 -0.09904 0.1577557
## 1      1 0.83046 285.18336 290.78815 118.55218   0.46508 -0.05267 0.1458492

q = 14 has the smallest AIC and BIC scores. Fit model with q = 14,

DLM.RelHumidity = dlm(formula = FFD ~ RelHumidity, data = FFD_dataset, q = 14)
summary(DLM.RelHumidity)
## 
## Call:
## lm(formula = as.formula(model.formula), data = design)
## 
## Residuals:
##        1        2        3        4        5        6        7        8 
## -0.50251 -0.14706  0.32627  0.29784  0.24216 -0.54987  0.23332 -0.20803 
##        9       10       11       12       13       14       15       16 
##  0.50850  0.07915  0.09090 -0.25842 -0.01506  0.03899 -0.32785  0.01504 
##       17 
##  0.17662 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)  
## (Intercept)    -1108.0727    66.7845 -16.592   0.0383 *
## RelHumidity.t     -6.1015     0.5084 -12.003   0.0529 .
## RelHumidity.1     -5.4993     0.3776 -14.565   0.0436 *
## RelHumidity.2     -3.4724     0.3217 -10.793   0.0588 .
## RelHumidity.3      3.8202     0.3128  12.213   0.0520 .
## RelHumidity.4     -6.4177     0.3679 -17.443   0.0365 *
## RelHumidity.5    -10.5496     0.4186 -25.205   0.0252 *
## RelHumidity.6     -3.5258     0.3440 -10.250   0.0619 .
## RelHumidity.7      9.0306     0.3014  29.961   0.0212 *
## RelHumidity.8     10.3493     0.4336  23.870   0.0267 *
## RelHumidity.9      8.7593     0.4638  18.886   0.0337 *
## RelHumidity.10    18.4883     0.4371  42.298   0.0150 *
## RelHumidity.11    10.2434     0.4535  22.586   0.0282 *
## RelHumidity.12     1.6249     0.4428   3.670   0.1694  
## RelHumidity.13     0.6677     0.6748   0.989   0.5034  
## RelHumidity.14    -1.6337     0.6342  -2.576   0.2357  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.187 on 1 degrees of freedom
## Multiple R-squared:  0.9999, Adjusted R-squared:  0.9986 
## F-statistic:   788 on 15 and 1 DF,  p-value: 0.02795
## 
## AIC and BIC values for the model:
##        AIC      BIC
## 1 39.90284 54.06747

DLM.RelHumidity Model is significant (p-value = 0.02795) at 0.05 significant level.

Without intercept :

DLM.RelHumidity.noIntercept = dlm(formula = FFD ~ 0 + RelHumidity, data = FFD_dataset, q = 14)
summary(DLM.RelHumidity.noIntercept)
## 
## Call:
## lm(formula = as.formula(model.formula), data = design)
## 
## Residuals:
##       1       2       3       4       5       6       7       8       9      10 
## -1.7316 11.9571  1.2094  1.0992  0.2971  1.0447  7.2072  0.8436  3.7545 -5.9973 
##      11      12      13      14      15      16      17 
## -5.0662 -3.3051 -2.4479 -7.1852  0.2318  3.3305 -5.5918 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)  
## RelHumidity.t   -12.229      4.106  -2.978   0.0967 .
## RelHumidity.1    -5.435      4.438  -1.225   0.3453  
## RelHumidity.2    -2.406      3.705  -0.649   0.5827  
## RelHumidity.3     1.872      3.407   0.549   0.6380  
## RelHumidity.4    -5.812      4.303  -1.351   0.3093  
## RelHumidity.5    -8.025      4.583  -1.751   0.2220  
## RelHumidity.6    -1.757      3.844  -0.457   0.6924  
## RelHumidity.7     8.423      3.516   2.395   0.1389  
## RelHumidity.8     8.539      4.932   1.731   0.2255  
## RelHumidity.9     5.638      4.982   1.131   0.3753  
## RelHumidity.10   18.657      5.136   3.633   0.0681 .
## RelHumidity.11   11.430      5.264   2.171   0.1620  
## RelHumidity.12    1.394      5.201   0.268   0.8138  
## RelHumidity.13   -5.482      6.628  -0.827   0.4951  
## RelHumidity.14   -9.473      4.972  -1.905   0.1971  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.95 on 2 degrees of freedom
## Multiple R-squared:  0.9998, Adjusted R-squared:  0.9979 
## F-statistic: 541.6 on 15 and 2 DF,  p-value: 0.001845
## 
## AIC and BIC values for the model:
##        AIC      BIC
## 1 133.4673 146.7987

DLM.RelHumidity.noIntercept Model is significant.

Finite DLM Model Selection

Models using all 4 predictors without intercept are significant. Eliminating all the insignificant models and comparing the significant Finite DLM models based on R-squared, AIC, BIC and MASE

Model <- c("DLM.Temperature.noIntercept", "DLM.Rainfall.noIntercept", "DLM.Radiation.noIntercept", "DLM.RelHumidity", "DLM.RelHumidity.noIntercept")
AIC <- c(AIC(DLM.Temperature.noIntercept), AIC(DLM.Rainfall.noIntercept), AIC(DLM.Radiation.noIntercept), AIC(DLM.RelHumidity), AIC(DLM.RelHumidity.noIntercept))
BIC <- c(BIC(DLM.Temperature.noIntercept), BIC(DLM.Rainfall.noIntercept), BIC(DLM.Radiation.noIntercept), BIC(DLM.RelHumidity), BIC(DLM.RelHumidity.noIntercept))
Adjusted_Rsquared <- c(0.999, 0.9884, 0.9884, 0.9986, 0.9979)
MASE <- MASE(DLM.Temperature.noIntercept, DLM.Rainfall.noIntercept, DLM.Radiation.noIntercept, DLM.RelHumidity, DLM.RelHumidity.noIntercept)
data.frame(AIC, BIC, Adjusted_Rsquared, MASE) %>% arrange(AIC)
##                                   AIC       BIC Adjusted_Rsquared  n       MASE
## DLM.RelHumidity              39.90284  54.06747            0.9986 17 0.01053278
## DLM.Temperature.noIntercept 121.31305 134.64447            0.9990 17 0.11899711
## DLM.RelHumidity.noIntercept 133.46732 146.79874            0.9979 17 0.16333023
## DLM.Rainfall.noIntercept    162.56397 175.89538            0.9884 17 0.45818179
## DLM.Radiation.noIntercept   162.56397 175.89538            0.9884 17 0.45818179

Thus, as per AIC, BIC and MASE, finite distributed lag model for FFD with Relative Humidity as the regressor (DLM.RelHumidity) is the best.

Diagnostic check for DLM.RelHumidity (Residual analysis)

We can apply a diagnostic check using checkresiduals() function from the forecast package.

checkresiduals(DLM.RelHumidity$model$residuals) # forecast package

## 
##  Ljung-Box test
## 
## data:  Residuals
## Q* = 5.5937, df = 3, p-value = 0.1331
## 
## Model df: 0.   Total lags used: 3

In this output,

  • from the time series plot and histogram of residuals, there is an obvious random pattern and normality in the residual distribution. Thus, no violation in general assumptions.
  • the Ljung-Box test output is displayed. According to this test, the null hypothesis that a series of residuals exhibits no autocorrelation up-to lag 10 is violated. According to this test and ACF plot, we can conclude that the serial correlation left in residuals is NOT significant.
Conclusion of Finite DLM model
  • Best model is with Relative Humidity as the regressor (DLM.RelHumidity).
  • DLM.RelHumidity Model is significant.
  • MASE is 0.01053278
  • Adjusted R-squared is 99.86%.
  • No violations in the test of assumptions
  • Serial autocorrelation is not significant

ATTENTION - Lets summarise the models from here on and not go into each models details for simplicity

Fit Polynomial DLM model

Polynomial DLM model helps remove the effect of multicollinearity. Lets fit a polynomial DLM of order 2 for each of the 4 regressors individually.

1. Temperature as regressor
PolyDLM.Temperature = polyDlm(x = as.vector(Temperature), y = as.vector(FFD), q = 14, k = 2, show.beta = FALSE)
summary(PolyDLM.Temperature)
## 
## Call:
## "Y ~ (Intercept) + X.t"
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -40.293 -13.817  -2.696  12.161  49.486 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 357.0169   503.3239   0.709  0.49065   
## z.t0         23.0987     9.9875   2.313  0.03775 * 
## z.t1         -9.5889     3.0664  -3.127  0.00802 **
## z.t2          0.6472     0.1864   3.471  0.00413 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24.38 on 13 degrees of freedom
## Multiple R-squared:  0.536,  Adjusted R-squared:  0.4289 
## F-statistic: 5.006 on 3 and 13 DF,  p-value: 0.01596

Polynomial DLM model with Temperature as regressor variable is significant at 5% significance level.

2. Rainfall as regressor
PolyDLM.Rainfall = polyDlm(x = as.vector(Rainfall), y = as.vector(FFD), q = 14, k = 2, show.beta = FALSE)
summary(PolyDLM.Rainfall)
## 
## Call:
## "Y ~ (Intercept) + X.t"
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -49.557 -12.489  -8.102   5.573  62.516 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) 253.8534   245.0117   1.036    0.319
## z.t0        -14.3017    18.9184  -0.756    0.463
## z.t1          4.2052     5.7540   0.731    0.478
## z.t2         -0.2047     0.4175  -0.490    0.632
## 
## Residual standard error: 32.19 on 13 degrees of freedom
## Multiple R-squared:  0.1907, Adjusted R-squared:  0.003978 
## F-statistic: 1.021 on 3 and 13 DF,  p-value: 0.415

Polynomial DLM model with Rainfall as regressor variable is insignificant at 5% significance level.

3. Radiation as regressor
PolyDLM.Radiation = polyDlm(x = as.vector(Radiation), y = as.vector(FFD), q = 14, k = 2, show.beta = FALSE)
summary(PolyDLM.Radiation)
## 
## Call:
## "Y ~ (Intercept) + X.t"
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -55.703 -13.245  -3.613   0.802  61.872 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1084.8086  1619.5212   0.670    0.515
## z.t0           0.9589    17.5034   0.055    0.957
## z.t1          -2.7044     5.2672  -0.513    0.616
## z.t2           0.2129     0.3893   0.547    0.594
## 
## Residual standard error: 31.76 on 13 degrees of freedom
## Multiple R-squared:  0.2124, Adjusted R-squared:  0.03059 
## F-statistic: 1.168 on 3 and 13 DF,  p-value: 0.3594

Polynomial DLM model with Radiation as regressor variable is insignificant at 5% significance level.

4. Relative Humidity as regressor
PolyDLM.RelHumidity = polyDlm(x = as.vector(RelHumidity), y = as.vector(FFD), q = 14, k = 2, show.beta = FALSE)
summary(PolyDLM.RelHumidity)
## 
## Call:
## "Y ~ (Intercept) + X.t"
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -56.620 -12.644  -6.069   1.193  69.905 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept) -980.16153 1319.02934  -0.743    0.471
## z.t0          -0.09770    4.50105  -0.022    0.983
## z.t1           1.03154    1.65904   0.622    0.545
## z.t2          -0.08175    0.13240  -0.617    0.548
## 
## Residual standard error: 33.21 on 13 degrees of freedom
## Multiple R-squared:  0.1387, Adjusted R-squared:  -0.06 
## F-statistic: 0.6981 on 3 and 13 DF,  p-value: 0.5697

Polynomial DLM model with Relative Humidity as regressor variable is insignificant at 5% significance level.

PolyDLM Model selection

Polynomial DLM model for only Temperature regressor is significant.

MASE(PolyDLM.Temperature, PolyDLM.Rainfall, PolyDLM.Radiation, PolyDLM.RelHumidity)
##                      n      MASE
## PolyDLM.Temperature 17 0.7388406
## PolyDLM.Rainfall    17 0.9173699
## PolyDLM.Radiation   17 0.8164986
## PolyDLM.RelHumidity 17 0.8588444

Also as per MASE, Polynomial DLM model with Temperature as regressor is the best.

Diagnostic check for Polynomial DLM (Residual analysis)
checkresiduals(PolyDLM.Temperature$model$residuals)

## 
##  Ljung-Box test
## 
## data:  Residuals
## Q* = 6.5881, df = 3, p-value = 0.08625
## 
## Model df: 0.   Total lags used: 3

Serial autocorrelations left in residuals are insignificant as per Ljung-Box test and ACF plot. From the time series plot and histogram of residuals, there is an obvious random pattern and normality in the residual distribution. Thus, no violation in general assumptions.

Conclusion of Polynomial DLM model
  • Model Temperature as regressor is best of all 4 regressors.
  • model is significant
  • MASE is 0.7388406
  • Adjusted R-squared is 42.89%
  • No violations in the test of assumptions
  • Serial autocorrelation is insignificant
  • The 0th, 1st and 2nd order regressors of Temperature variable are significant.

Fit Koyck geometric DLM model

Here the lag weights are positive and decline geometrically. This model is called infinite geometric DLM, meaning there are infinite lag weights. Koyck transformation is applied to implement this infinite geometric DLM model by subtracting the first lag of geometric DLM multiplied by \(\phi\). The Koyck transformed model is represented as,

\(Y_t = \delta_1 + \delta_2Y_{t-1} + \nu_t\)

where \(\delta_1 = \alpha(1-\phi), \delta_2 = \phi, \delta_3 = \beta\) and the random error after the transformation is \(\nu_t = (\epsilon_t -\phi\epsilon_{t-1})\).

The koyckDlm() function is used to implement a two-staged least squares method to first estimate the \(\hat{Y}_{t-1}\) and the estimate \(Y_{t}\) through simple linear regression. Lets deduce Koyck geometric GLM models for each of the 4 regressors individually.

1. Temperature as regressor

With intercept :

Koyck.Temperature = koyckDlm(x = as.vector(FFD_dataset$Temperature) , y = as.vector(FFD_dataset$FFD) )
summary(Koyck.Temperature$model, diagnostics = TRUE)
## 
## Call:
## "Y ~ (Intercept) + Y.1 + X.t"
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -47.451 -15.487  -2.648   6.757  75.055 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) 526.5488   401.9192   1.310    0.201
## Y.1           0.1585     0.2372   0.668    0.510
## X.t         -13.7092    18.0543  -0.759    0.454
## 
## Diagnostic tests:
##                  df1 df2 statistic p-value  
## Weak instruments   1  27     6.247  0.0188 *
## Wu-Hausman         1  26     0.306  0.5846  
## Sargan             0  NA        NA      NA  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25.66 on 27 degrees of freedom
## Multiple R-Squared: 0.03631, Adjusted R-squared: -0.03508 
## Wald test: 1.255 on 2 and 27 DF,  p-value: 0.3011

Koyck.Temperature is insignificant at 5% significance level.

Without intercept :

Koyck.Temperature.NoIntercept = koyckDlm(x = as.vector(FFD_dataset$Temperature) , y = as.vector(FFD_dataset$FFD), intercept = FALSE)
summary(Koyck.Temperature.NoIntercept$model, diagnostics = TRUE)
## 
## Call:
## "Y ~ (Intercept) + Y.1 + X.t"
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -49.778 -19.249  -1.045   9.819  74.110 
## 
## Coefficients:
##     Estimate Std. Error t value Pr(>|t|)   
## Y.1   0.3774     0.1718   2.197  0.03646 * 
## X.t   9.6891     2.6956   3.594  0.00123 **
## 
## Diagnostic tests:
##                  df1 df2 statistic  p-value    
## Weak instruments   2  27   146.001 3.33e-15 ***
## Wu-Hausman         1  27     1.869    0.183    
## Sargan             1  NA     1.765    0.184    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 26.2 on 28 degrees of freedom
## Multiple R-Squared: 0.9932,  Adjusted R-squared: 0.9927 
## Wald test:  2049 on 2 and 28 DF,  p-value: < 2.2e-16

Koyck.Temperature.NoIntercept is significant at 5% significance level.

2. Rainfall as regressor

With intercept :

Koyck.Rainfall = koyckDlm(x = as.vector(FFD_dataset$Rainfall) , y = as.vector(FFD_dataset$FFD) )
summary(Koyck.Rainfall$model, diagnostics = TRUE)
## 
## Call:
## "Y ~ (Intercept) + Y.1 + X.t"
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -43.776 -21.430  -3.028   6.055  93.212 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) 177.6796   136.5337   1.301    0.204
## Y.1           0.1850     0.3036   0.609    0.547
## X.t          30.2749    75.4565   0.401    0.691
## 
## Diagnostic tests:
##                  df1 df2 statistic p-value
## Weak instruments   1  27     1.152   0.293
## Wu-Hausman         1  26     0.695   0.412
## Sargan             0  NA        NA      NA
## 
## Residual standard error: 30.69 on 27 degrees of freedom
## Multiple R-Squared: -0.3779, Adjusted R-squared: -0.48 
## Wald test: 0.7567 on 2 and 27 DF,  p-value: 0.4789

Koyck.Rainfall model is insignificant at 5% significance level.

Without intercept :

Koyck.Rainfall.NoIntercept = koyckDlm(x = as.vector(FFD_dataset$Rainfall) , y = as.vector(FFD_dataset$FFD), intercept = FALSE)
summary(Koyck.Rainfall.NoIntercept$model, diagnostics = TRUE)
## 
## Call:
## "Y ~ (Intercept) + Y.1 + X.t"
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -67.19 -36.77 -10.18  16.76 146.09 
## 
## Coefficients:
##     Estimate Std. Error t value Pr(>|t|)
## Y.1   0.1140     0.5448   0.209    0.836
## X.t 114.4554    70.8645   1.615    0.117
## 
## Diagnostic tests:
##                  df1 df2 statistic  p-value    
## Weak instruments   2  27     2.172 0.133400    
## Wu-Hausman         1  27    15.604 0.000505 ***
## Sargan             1  NA     0.544 0.460588    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 55.97 on 28 degrees of freedom
## Multiple R-Squared: 0.969,   Adjusted R-squared: 0.9668 
## Wald test: 448.7 on 2 and 28 DF,  p-value: < 2.2e-16

Koyck.Rainfall.NoIntercept model is significant at 5% significance level.

3. Radiation as regressor

With intercept :

Koyck.Radiation = koyckDlm(x = as.vector(FFD_dataset$Radiation) , y = as.vector(FFD_dataset$FFD) )
summary(Koyck.Radiation$model, diagnostics = TRUE)
## 
## Call:
## "Y ~ (Intercept) + Y.1 + X.t"
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -48.615 -13.816  -3.619   9.597  76.084 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) 633.4417   404.8050   1.565    0.129
## Y.1           0.2965     0.2085   1.422    0.166
## X.t         -28.6891    28.0566  -1.023    0.316
## 
## Diagnostic tests:
##                  df1 df2 statistic p-value  
## Weak instruments   1  27     6.901   0.014 *
## Wu-Hausman         1  26     1.470   0.236  
## Sargan             0  NA        NA      NA  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 27.73 on 27 degrees of freedom
## Multiple R-Squared: -0.1254, Adjusted R-squared: -0.2088 
## Wald test: 1.351 on 2 and 27 DF,  p-value: 0.276

Koyck.Radiation model is insignificant at 5% significance level.

Without intercept :

Koyck.Radiation.NoIntercept = koyckDlm(x = as.vector(FFD_dataset$Radiation) , y = as.vector(FFD_dataset$FFD), intercept = FALSE)
summary(Koyck.Radiation.NoIntercept$model, diagnostics = TRUE)
## 
## Call:
## "Y ~ (Intercept) + Y.1 + X.t"
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -49.371 -12.801  -4.088   5.688  73.765 
## 
## Coefficients:
##     Estimate Std. Error t value Pr(>|t|)   
## Y.1   0.3000     0.1929   1.555  0.13120   
## X.t  14.6701     4.0749   3.600  0.00121 **
## 
## Diagnostic tests:
##                  df1 df2 statistic  p-value    
## Weak instruments   2  27   140.089 5.54e-15 ***
## Wu-Hausman         1  27     0.420   0.5225    
## Sargan             1  NA     3.063   0.0801 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25.66 on 28 degrees of freedom
## Multiple R-Squared: 0.9935,  Adjusted R-squared: 0.993 
## Wald test:  2134 on 2 and 28 DF,  p-value: < 2.2e-16

Koyck.Radiation.NoIntercept model is significant at 5% significance level.

4. Relative Humidity as regressor

With intercept :

Koyck.RelHumidity = koyckDlm(x = as.vector(FFD_dataset$RelHumidity) , y = as.vector(FFD_dataset$FFD) )
summary(Koyck.RelHumidity$model, diagnostics = TRUE)
## 
## Call:
## "Y ~ (Intercept) + Y.1 + X.t"
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -39.742 -13.465  -3.038   6.024  82.953 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) -181.7506   639.8825  -0.284    0.779
## Y.1            0.1695     0.2552   0.664    0.512
## X.t            8.0820    12.6627   0.638    0.529
## 
## Diagnostic tests:
##                  df1 df2 statistic p-value
## Weak instruments   1  27     2.594   0.119
## Wu-Hausman         1  26     0.531   0.473
## Sargan             0  NA        NA      NA
## 
## Residual standard error: 27.73 on 27 degrees of freedom
## Multiple R-Squared: -0.1248, Adjusted R-squared: -0.2081 
## Wald test: 1.032 on 2 and 27 DF,  p-value: 0.3699

Koyck.RelHumidity model is insignificant at 5% significance level.

Without intercept :

Koyck.RelHumidity.NoIntercept = koyckDlm(x = as.vector(FFD_dataset$RelHumidity) , y = as.vector(FFD_dataset$FFD), intercept = FALSE)
summary(Koyck.RelHumidity.NoIntercept$model, diagnostics = TRUE)
## 
## Call:
## "Y ~ (Intercept) + Y.1 + X.t"
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -43.379 -12.760  -3.419   6.830  79.149 
## 
## Coefficients:
##     Estimate Std. Error t value Pr(>|t|)    
## Y.1   0.2061     0.2028   1.016 0.318213    
## X.t   4.5031     1.1586   3.887 0.000569 ***
## 
## Diagnostic tests:
##                  df1 df2 statistic  p-value    
## Weak instruments   2  27   131.621 1.19e-14 ***
## Wu-Hausman         1  27     2.020    0.167    
## Sargan             1  NA     0.102    0.750    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25.56 on 28 degrees of freedom
## Multiple R-Squared: 0.9935,  Adjusted R-squared: 0.9931 
## Wald test:  2153 on 2 and 28 DF,  p-value: < 2.2e-16

Koyck.RelHumidity.NoIntercept model is significant at 5% significance level.

Koyck Model selection

Koyck DLM models for all 4 regressors without intercept are significant. Eliminating all the insignificant models and comparing the significant Finite DLM models based on R-squared, AIC, BIC and MASE

Model <- c("Koyck.Temperature.NoIntercept", "Koyck.Rainfall.NoIntercept", "Koyck.Radiation.NoIntercept", "Koyck.RelHumidity.NoIntercept")
AIC <- c(AIC(Koyck.Temperature.NoIntercept), AIC(Koyck.Rainfall.NoIntercept), AIC(Koyck.Radiation.NoIntercept), AIC(Koyck.RelHumidity.NoIntercept))
BIC <- c( BIC(Koyck.Temperature.NoIntercept), BIC(Koyck.Rainfall.NoIntercept), BIC(Koyck.Radiation.NoIntercept), BIC(Koyck.RelHumidity.NoIntercept))
Adjusted_Rsquared <- c(0.9927, 0.9668, 0.993, 0.9931)
MASE <- MASE(Koyck.Temperature.NoIntercept, Koyck.Rainfall.NoIntercept, Koyck.Radiation.NoIntercept, Koyck.RelHumidity.NoIntercept)
data.frame(AIC, BIC, Adjusted_Rsquared, MASE) %>% arrange(MASE)
##                                    AIC      BIC Adjusted_Rsquared  n      MASE
## Koyck.RelHumidity.NoIntercept 283.5234 287.7270            0.9931 30 0.8512851
## Koyck.Radiation.NoIntercept   283.7733 287.9769            0.9930 30 0.9259536
## Koyck.Temperature.NoIntercept 285.0008 289.2044            0.9927 30 0.9781023
## Koyck.Rainfall.NoIntercept    330.5585 334.7621            0.9668 30 2.1113619

Thus, as per AIC,BIC,MASE (best in terms of forecasting), and Adjusted R-Squared, Koyck DLM for FFD with Relative Humidity as the regressor with no intercept (Koyck.RelHumidity.NoIntercept) is the best.

Diagnostic check for Koyck DLM (Residual analysis)
checkresiduals(Koyck.RelHumidity.NoIntercept$model$residuals)

## 
##  Ljung-Box test
## 
## data:  Residuals
## Q* = 1.2763, df = 6, p-value = 0.9729
## 
## Model df: 0.   Total lags used: 6

Serial autocorrelations left in residuals are insignificant as per Ljung-Box test and ACF plot. From the time series plot and histogram of residuals, there is an obvious random pattern and normality in the residual distribution. Thus, no violation in general assumptions.

Conclusion of Koyck DLM model
  • Model with Relative Humidity as regressor with no intercept is best of all 4 regressors.
  • model is significant
  • MASE is 0.8512851
  • Adjusted R-squared is 99.31 %
  • No violations in the test of assumptions
  • Serial autocorrelation is insignificant
  • From the Weak Instruments line, the model at the first stage of the least-squares fitting is significant at 5% level of significance.
  • \(\delta_2\) is insignificant and \(\delta_3\) is significant at 5% level meaning FFD is significantly dependent on the Relative Humidity regressor and not dependent on last years FFD values.
  • From the Wu-Hausman test, we do not reject the null hypothesis that the correlation between explanatory variable (\(Y_{t-1}\)) and the error term is zero (There is no endogeneity) at 5% level.

Fit Autoregressive Distributed Lag Model

Autoregressive Distributed lag model is a flexible and parsimonious infinite DLM. The model is represented as,

\(Y_t = \mu + \beta_0 X_t + \beta_1 X_{t-1} + \gamma_1 Y_{t-1} + e_t\)

Similar to the Koyck DLM, it is possible to write this model as an infinite DLM with infinite lag distribution of any shape rather than a polynomial or geometric shape. The model is denoted as ARDL(p,q). To fit the model we will use ardlDlm() function is used. Lets find the best lag length using AIC and BIC score through an iteration. Lets set max lag length to 14. Lets do this for each regressor individually.

1. Temperature as regressor

With intercept :

## Code gist to find the best ARDL(p,q) model as per AIC and BIC scores.
# First create an empty df. Iterate over 196 ARDL (since max lag for response and predictor of ARDL model is 14, i.e, p = q = 14 at max).
# Save the model's AIC and BIC scores through iteration and display the model with best AIC and BIC scores.

df = data.frame(matrix(
  vector(), 0, 4, dimnames=list(c(), c("p","q","AIC","BIC"))),
  stringsAsFactors=F) # create empty dataframe
for(i in 1:14){
  for(j in 1:14){
    model4.1 = ardlDlm(formula = FFD ~ Temperature, data = FFD_dataset, p = i, q = j)
    new <- data.frame(i, j, AIC(model4.1$model), BIC(model4.1$model))
    df[nrow(df) + 1, ] <- new
  }
} # Iterate and save in df
head(df[order( df[,3] ),] %>% filter(, AIC >= 0 & BIC >= 0),1) # Best model as per AIC
##    p q      AIC      BIC
## 1 13 2 85.51361 101.5403
head(df[order( df[,4] ),] %>% filter(, AIC >= 0 & BIC >= 0),1) # Best model as per BIC
##    p q      AIC      BIC
## 1 13 2 85.51361 101.5403

ARDL(13,2) is the best models as per AIC and BIC scores respectively. Lets fit this models,

ARDL(13,2):

ARDL.Temperature.13x2 = ardlDlm(formula = FFD ~ Temperature, data = FFD_dataset, p = 13, q = 2)
summary(ARDL.Temperature.13x2)
## 
## Time series regression with "ts" data:
## Start = 14, End = 31
## 
## Call:
## dynlm(formula = as.formula(model.text), data = data)
## 
## Residuals:
##       14       15       16       17       18       19       20       21 
## -0.28212  0.30905 -0.36535  0.55389 -1.11505  1.89987 -1.40635  1.09117 
##       22       23       24       25       26       27       28       29 
## -1.28712  0.33102  0.04305  0.11575  0.59499 -0.61912 -0.51206 -0.29693 
##       30       31 
##  1.98022 -1.03492 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)  
## (Intercept)    1799.14345  242.31073   7.425   0.0852 .
## Temperature.t    57.43342    7.63646   7.521   0.0842 .
## Temperature.1   -31.77471    3.51783  -9.032   0.0702 .
## Temperature.2   -12.18393    5.82043  -2.093   0.2837  
## Temperature.3    14.68490    2.99401   4.905   0.1280  
## Temperature.4    33.07505    6.84722   4.830   0.1300  
## Temperature.5    -6.31500    2.26272  -2.791   0.2190  
## Temperature.6   -12.13070    3.45046  -3.516   0.1764  
## Temperature.7   -41.46746    2.75004 -15.079   0.0422 *
## Temperature.8   -32.66363    5.81102  -5.621   0.1121  
## Temperature.9    16.11390    3.95843   4.071   0.1534  
## Temperature.10  -51.03539    3.71147 -13.751   0.0462 *
## Temperature.11   34.16284    5.62777   6.070   0.1039  
## Temperature.12  -41.62016    5.03400  -8.268   0.0766 .
## Temperature.13    3.58385    4.45503   0.804   0.5687  
## FFD.1             0.34062    0.07992   4.262   0.1467  
## FFD.2            -0.83494    0.11440  -7.298   0.0867 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.062 on 1 degrees of freedom
## Multiple R-squared:  0.9991, Adjusted R-squared:  0.984 
## F-statistic: 66.37 on 16 and 1 DF,  p-value: 0.09616
checkresiduals(ARDL.Temperature.13x2$model, test = "LB")

## 
##  Ljung-Box test
## 
## data:  Residuals
## Q* = 20.759, df = 4, p-value = 0.0003535
## 
## Model df: 0.   Total lags used: 4
MASE(ARDL.Temperature.13x2)
##                            MASE
## ARDL.Temperature.13x2 0.0305356

Model is insignificant at 5% significance level.

Without intercept :

## Code gist to find the best ARDL(p,q) model as per AIC and BIC scores.
# First create an empty df. Iterate over 196 ARDL (since max lag for response and predictor of ARDL model is 14, i.e, p = q = 14 at max).
# Save the model's AIC and BIC scores through iteration and display the model with best AIC and BIC scores.
# Also, models with AIC or BIC scores of inf or -inf are removed

df = data.frame(matrix(
  vector(), 0, 4, dimnames=list(c(), c("p","q","AIC","BIC"))),
  stringsAsFactors=F) # create empty dataframe
for(i in 1:14){
  for(j in 1:14){
    model4.1 = ardlDlm(formula = FFD ~ -1 + Temperature, data = FFD_dataset, p = i, q = j)
    new <- data.frame(i, j, AIC(model4.1$model), BIC(model4.1$model))
    df[nrow(df) + 1, ] <- new
  }
} # Iterate and save in df
head(df[order( df[,3] ),] %>% filter(, AIC >= 0 & BIC >= 0),1) # Best model as per AIC
##    p q      AIC      BIC
## 1 13 3 109.1219 125.1485
head(df[order( df[,4] ),] %>% filter(, AIC >= 0 & BIC >= 0),1) # Best model as per BIC
##    p q      AIC      BIC
## 1 13 3 109.1219 125.1485

ARDL(13,3) is the best models as per AIC and BIC scores respectively. Lets fit this models,

ARDL(13,3):

ARDL.Temperature.NoIntercept.13x3 = ardlDlm(formula = FFD ~ -1 + Temperature, data = FFD_dataset, p = 13, q = 3)
summary(ARDL.Temperature.NoIntercept.13x3)
## 
## Time series regression with "ts" data:
## Start = 14, End = 31
## 
## Call:
## dynlm(formula = as.formula(model.text), data = data)
## 
## Residuals:
##       14       15       16       17       18       19       20       21 
## -0.39467  0.51088 -0.67364  1.32736 -2.25787  3.72448 -2.86633  2.25355 
##       22       23       24       25       26       27       28       29 
## -2.38613  0.76865  0.43390  0.05015  0.65610 -0.99901 -0.60352 -0.88462 
##       30       31 
##  3.53983 -2.18261 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)  
## Temperature.t   40.5038    13.5981   2.979   0.2062  
## Temperature.1  -55.7986    10.7210  -5.205   0.1208  
## Temperature.2   23.6979    10.0183   2.365   0.2546  
## Temperature.3   15.2497     5.7984   2.630   0.2313  
## Temperature.4    9.3513    10.7807   0.867   0.5451  
## Temperature.5  -23.7448     5.3922  -4.404   0.1422  
## Temperature.6   -1.2956     6.0064  -0.216   0.8648  
## Temperature.7  -33.5482     5.1556  -6.507   0.0971 .
## Temperature.8    2.2203     8.6703   0.256   0.8404  
## Temperature.9   36.3215     4.4742   8.118   0.0780 .
## Temperature.10 -49.3118     6.9934  -7.051   0.0897 .
## Temperature.11  55.0020    14.6504   3.754   0.1657  
## Temperature.12 -53.5290    12.2360  -4.375   0.1431  
## Temperature.13  38.1858     7.1785   5.319   0.1183  
## FFD.1            0.9802     0.1614   6.072   0.1039  
## FFD.2           -0.8526     0.2280  -3.740   0.1663  
## FFD.3            0.6458     0.1719   3.758   0.1656  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.826 on 1 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:  0.9993 
## F-statistic:  1626 on 17 and 1 DF,  p-value: 0.0195
checkresiduals(ARDL.Temperature.NoIntercept.13x3$model, test = "LB")

## 
##  Ljung-Box test
## 
## data:  Residuals
## Q* = 24.364, df = 4, p-value = 6.753e-05
## 
## Model df: 0.   Total lags used: 4
MASE(ARDL.Temperature.NoIntercept.13x3)
##                                         MASE
## ARDL.Temperature.NoIntercept.13x3 0.05850544

Model is significant at 5% significance level.

2. Rainfall as regressor

With intercept :

## Code gist to find the best ARDL(p,q) model as per AIC and BIC scores.
# First create an empty df. Iterate over 196 ARDL (since max lag for response and predictor of ARDL model is 14, i.e, p = q = 14 at max).
# Save the model's AIC and BIC scores through iteration and display the model with best AIC and BIC scores.

df = data.frame(matrix(
  vector(), 0, 4, dimnames=list(c(), c("p","q","AIC","BIC"))),
  stringsAsFactors=F) # create empty dataframe
for(i in 1:14){
  for(j in 1:14){
    model4.1 = ardlDlm(formula = FFD ~ Rainfall, data = FFD_dataset, p = i, q = j)
    new <- data.frame(i, j, AIC(model4.1$model), BIC(model4.1$model))
    df[nrow(df) + 1, ] <- new
  }
} # Iterate and save in df
head(df[order( df[,3] ),] %>% filter(, AIC >= 0 & BIC >= 0),1) # Best model as per AIC
##   p  q      AIC     BIC
## 1 2 13 155.1653 171.192
head(df[order( df[,4] ),] %>% filter(, AIC >= 0 & BIC >= 0),1) # Best model as per BIC
##   p  q      AIC     BIC
## 1 2 13 155.1653 171.192

ARDL(2,13) is the best models as per AIC and BIC scores respectively. Lets fit this models,

ARDL(2,13):

ARDL.Rainfall.2x13 = ardlDlm(formula = FFD ~ Rainfall, data = FFD_dataset, p = 2, q = 13)
summary(ARDL.Rainfall.2x13)
## 
## Time series regression with "ts" data:
## Start = 14, End = 31
## 
## Call:
## dynlm(formula = as.formula(model.text), data = data)
## 
## Residuals:
##       14       15       16       17       18       19       20       21 
## -12.3187  -7.4665 -10.5105  -8.2321  10.3234  14.6427   2.9609  -2.7338 
##       22       23       24       25       26       27       28       29 
##   1.6466   1.5984   1.5430   1.2328   0.6374   3.5737  -1.4717   1.2570 
##       30       31 
##  -2.5867   5.9039 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1286.54423 1078.81523   1.193    0.444
## Rainfall.t  -138.19598   50.62236  -2.730    0.224
## Rainfall.1   -24.36210   40.13714  -0.607    0.653
## Rainfall.2   -75.70590   36.74138  -2.061    0.288
## FFD.1          0.53732    0.34387   1.563    0.362
## FFD.2         -0.29943    0.35141  -0.852    0.551
## FFD.3          0.15577    0.34972   0.445    0.733
## FFD.4         -0.10101    0.36267  -0.279    0.827
## FFD.5         -0.76882    0.33789  -2.275    0.264
## FFD.6         -0.54383    0.34948  -1.556    0.364
## FFD.7         -0.52359    0.36204  -1.446    0.385
## FFD.8          0.54449    0.59484   0.915    0.528
## FFD.9         -0.06028    0.57921  -0.104    0.934
## FFD.10         1.02547    0.69118   1.484    0.378
## FFD.11         0.13521    0.57556   0.235    0.853
## FFD.12        -0.33668    0.62199  -0.541    0.684
## FFD.13        -1.21974    0.68829  -1.772    0.327
## 
## Residual standard error: 28.12 on 1 degrees of freedom
## Multiple R-squared:  0.9549, Adjusted R-squared:  0.2336 
## F-statistic: 1.324 on 16 and 1 DF,  p-value: 0.6024
checkresiduals(ARDL.Rainfall.2x13$model, test = "LB")

## 
##  Ljung-Box test
## 
## data:  Residuals
## Q* = 6.2832, df = 4, p-value = 0.179
## 
## Model df: 0.   Total lags used: 4
MASE(ARDL.Rainfall.2x13)
##                         MASE
## ARDL.Rainfall.2x13 0.2000099

Model is insignificant at 5% significance level.

Without intercept :

## Code gist to find the best ARDL(p,q) model as per AIC and BIC scores.
# First create an empty df. Iterate over 196 ARDL (since max lag for response and predictor of ARDL model is 14, i.e, p = q = 14 at max).
# Save the model's AIC and BIC scores through iteration and display the model with best AIC and BIC scores.
# Also, models with AIC or BIC scores of inf or -inf are removed

df = data.frame(matrix(
  vector(), 0, 4, dimnames=list(c(), c("p","q","AIC","BIC"))),
  stringsAsFactors=F) # create empty dataframe
for(i in 1:14){
  for(j in 1:14){
    model4.1 = ardlDlm(formula = FFD ~ -1 + Rainfall, data = FFD_dataset, p = i, q = j)
    new <- data.frame(i, j, AIC(model4.1$model), BIC(model4.1$model))
    df[nrow(df) + 1, ] <- new
  }
} # Iterate and save in df
head(df[order( df[,3] ),] %>% filter(, AIC >= 0 & BIC >= 0),1) # Best model as per AIC
##    p q      AIC      BIC
## 1 12 5 126.9599 144.9042
head(df[order( df[,4] ),] %>% filter(, AIC >= 0 & BIC >= 0),1) # Best model as per BIC
##    p q      AIC      BIC
## 1 12 5 126.9599 144.9042

ARDL(12,5) is the best models as per AIC and BIC scores respectively. Lets fit this models,

ARDL(12,5):

ARDL.Rainfall.NoIntercept.12x5 = ardlDlm(formula = FFD ~ -1 + Rainfall, data = FFD_dataset, p = 12, q = 5)
summary(ARDL.Rainfall.NoIntercept.12x5)
## 
## Time series regression with "ts" data:
## Start = 13, End = 31
## 
## Call:
## dynlm(formula = as.formula(model.text), data = data)
## 
## Residuals:
##      13      14      15      16      17      18      19      20      21      22 
## -7.2608  2.3455 -0.8289  3.5294 -0.3649  0.3543 -2.9993  0.6410 -0.5476 -0.1955 
##      23      24      25      26      27      28      29      30      31 
## -1.7474  1.0049 -0.8311 -0.8444  4.0869  0.1678  2.0419 -1.2396  3.2965 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## Rainfall.t  -160.5202    35.3684  -4.539    0.138
## Rainfall.1   133.6537    29.3784   4.549    0.138
## Rainfall.2  -130.7409    36.0883  -3.623    0.171
## Rainfall.3   -62.4015    13.3674  -4.668    0.134
## Rainfall.4    19.0601    13.0988   1.455    0.383
## Rainfall.5  -152.5199    28.6844  -5.317    0.118
## Rainfall.6    89.9317    24.2465   3.709    0.168
## Rainfall.7   143.6564    28.8183   4.985    0.126
## Rainfall.8    -4.3487    12.6135  -0.345    0.789
## Rainfall.9   165.0070    39.1577   4.214    0.148
## Rainfall.10  -99.6413    39.1972  -2.542    0.239
## Rainfall.11 -250.2284    66.3218  -3.773    0.165
## Rainfall.12 -228.8179    49.9715  -4.579    0.137
## FFD.1          3.9935     0.8424   4.741    0.132
## FFD.2         -0.9188     0.3293  -2.790    0.219
## FFD.3          0.3889     0.3542   1.098    0.470
## FFD.4          3.7845     0.8238   4.594    0.136
## FFD.5         -2.1429     0.5696  -3.762    0.165
## 
## Residual standard error: 10.96 on 1 degrees of freedom
## Multiple R-squared:  0.9999, Adjusted R-squared:  0.9987 
## F-statistic: 823.4 on 18 and 1 DF,  p-value: 0.02742
checkresiduals(ARDL.Rainfall.NoIntercept.12x5$model, test = "LB")

## 
##  Ljung-Box test
## 
## data:  Residuals
## Q* = 7.2277, df = 4, p-value = 0.1243
## 
## Model df: 0.   Total lags used: 4
MASE(ARDL.Rainfall.NoIntercept.12x5)
##                                      MASE
## ARDL.Rainfall.NoIntercept.12x5 0.06993808

Model is significant at 5% significance level.

3. Radiation as regressor

With intercept :

## Code gist to find the best ARDL(p,q) model as per AIC and BIC scores.
# First create an empty df. Iterate over 196 ARDL (since max lag for response and predictor of ARDL model is 14, i.e, p = q = 14 at max).
# Save the model's AIC and BIC scores through iteration and display the model with best AIC and BIC scores.

df = data.frame(matrix(
  vector(), 0, 4, dimnames=list(c(), c("p","q","AIC","BIC"))),
  stringsAsFactors=F) # create empty dataframe
for(i in 1:14){
  for(j in 1:14){
    model4.1 = ardlDlm(formula = FFD ~ Radiation, data = FFD_dataset, p = i, q = j)
    new <- data.frame(i, j, AIC(model4.1$model), BIC(model4.1$model))
    df[nrow(df) + 1, ] <- new
  }
} # Iterate and save in df
head(df[order( df[,3] ),] %>% filter(, AIC >= 0 & BIC >= 0),1) # Best model as per AIC
##   p  q      AIC      BIC
## 1 2 13 134.4458 150.4725
head(df[order( df[,4] ),] %>% filter(, AIC >= 0 & BIC >= 0),1) # Best model as per BIC
##   p  q      AIC      BIC
## 1 2 13 134.4458 150.4725

ARDL(2,13) is the best models as per AIC and BIC scores respectively. Lets fit this models,

ARDL(2,13):

ARDL.Radiation.2x13 = ardlDlm(formula = FFD ~ Radiation, data = FFD_dataset, p = 2, q = 13)
summary(ARDL.Radiation.2x13)
## 
## Time series regression with "ts" data:
## Start = 14, End = 31
## 
## Call:
## dynlm(formula = as.formula(model.text), data = data)
## 
## Residuals:
##       14       15       16       17       18       19       20       21 
## -3.54323  8.53636 -1.27297 -9.11223 -2.36069  7.09851  1.50385 -1.87316 
##       22       23       24       25       26       27       28       29 
## -0.77066  1.40688  0.59778 -0.79341 -1.32024  1.55674  1.93126 -2.46251 
##       30       31 
##  0.02261  0.85512 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1979.06076  582.55132   3.397    0.182
## Radiation.t   92.00812   54.45628   1.690    0.340
## Radiation.1 -182.05922   73.55176  -2.475    0.244
## Radiation.2   48.81549   25.97919   1.879    0.311
## FFD.1          0.46877    0.41858   1.120    0.464
## FFD.2         -0.11289    0.21130  -0.534    0.688
## FFD.3         -0.74065    0.52710  -1.405    0.394
## FFD.4          0.47140    0.81105   0.581    0.665
## FFD.5          0.02944    0.46803   0.063    0.960
## FFD.6         -0.81573    0.58890  -1.385    0.398
## FFD.7          0.70481    0.72208   0.976    0.508
## FFD.8         -0.31561    0.27655  -1.141    0.458
## FFD.9         -0.58602    0.22584  -2.595    0.234
## FFD.10        -0.85568    0.30542  -2.802    0.218
## FFD.11         0.32301    0.51577   0.626    0.644
## FFD.12        -1.35893    0.34040  -3.992    0.156
## FFD.13        -0.64256    0.45674  -1.407    0.393
## 
## Residual standard error: 15.81 on 1 degrees of freedom
## Multiple R-squared:  0.9857, Adjusted R-squared:  0.7576 
## F-statistic: 4.321 on 16 and 1 DF,  p-value: 0.363
checkresiduals(ARDL.Radiation.2x13$model, test = "LB")

## 
##  Ljung-Box test
## 
## data:  Residuals
## Q* = 14.87, df = 4, p-value = 0.004979
## 
## Model df: 0.   Total lags used: 4
MASE(ARDL.Radiation.2x13)
##                          MASE
## ARDL.Radiation.2x13 0.1037525

Model is insignificant at 5% significance level.

Without intercept :

## Code gist to find the best ARDL(p,q) model as per AIC and BIC scores.
# First create an empty df. Iterate over 196 ARDL (since max lag for response and predictor of ARDL model is 14, i.e, p = q = 14 at max).
# Save the model's AIC and BIC scores through iteration and display the model with best AIC and BIC scores.
# Also, models with AIC or BIC scores of inf or -inf are removed

df = data.frame(matrix(
  vector(), 0, 4, dimnames=list(c(), c("p","q","AIC","BIC"))),
  stringsAsFactors=F) # create empty dataframe
for(i in 1:14){
  for(j in 1:14){
    model4.1 = ardlDlm(formula = FFD ~ -1 + Radiation, data = FFD_dataset, p = i, q = j)
    new <- data.frame(i, j, AIC(model4.1$model), BIC(model4.1$model))
    df[nrow(df) + 1, ] <- new
  }
} # Iterate and save in df
head(df[order( df[,3] ),] %>% filter(, AIC >= 0 & BIC >= 0),1) # Best model as per AIC
##   p  q      AIC      BIC
## 1 1 14 136.2045 150.3691
head(df[order( df[,4] ),] %>% filter(, AIC >= 0 & BIC >= 0),1) # Best model as per BIC
##   p  q      AIC      BIC
## 1 1 14 136.2045 150.3691

ARDL(1,14) is the best models as per AIC and BIC scores respectively. Lets fit this models,

ARDL(1,14):

ARDL.Radiation.NoIntercept.1x14 = ardlDlm(formula = FFD ~ -1 + Radiation, data = FFD_dataset, p = 1, q = 14)
summary(ARDL.Radiation.NoIntercept.1x14)
## 
## Time series regression with "ts" data:
## Start = 15, End = 31
## 
## Call:
## dynlm(formula = as.formula(model.text), data = data)
## 
## Residuals:
##        15        16        17        18        19        20        21        22 
## -12.95137   0.51595  11.88609   8.62918  -1.76593  -0.79224   0.12925   0.10285 
##        23        24        25        26        27        28        29        30 
##  -1.61808  -1.19211   0.51836   1.20603  -2.35993  -2.43226   1.06106  -0.01166 
##        31 
##  -0.71991 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## Radiation.t   82.25080   71.69482   1.147    0.456
## Radiation.1 -317.59728  117.77368  -2.697    0.226
## FFD.1          1.13389    0.59516   1.905    0.308
## FFD.2          0.45463    0.33261   1.367    0.402
## FFD.3         -0.37805    0.72437  -0.522    0.694
## FFD.4          1.87443    1.25840   1.490    0.376
## FFD.5          1.59345    0.87066   1.830    0.318
## FFD.6          0.21401    0.91138   0.235    0.853
## FFD.7          2.27585    1.17398   1.939    0.303
## FFD.8          0.91302    0.44476   2.053    0.289
## FFD.9          0.12313    0.18826   0.654    0.631
## FFD.10        -0.43492    0.42465  -1.024    0.492
## FFD.11         1.19993    0.72913   1.646    0.348
## FFD.12        -0.02874    0.52396  -0.055    0.965
## FFD.13         1.59485    0.77010   2.071    0.286
## FFD.14         1.64463    0.54248   3.032    0.203
## 
## Residual standard error: 20.16 on 1 degrees of freedom
## Multiple R-squared:  0.9997, Adjusted R-squared:  0.9956 
## F-statistic: 243.1 on 16 and 1 DF,  p-value: 0.05035
checkresiduals(ARDL.Radiation.NoIntercept.1x14$model, test = "LB")

## 
##  Ljung-Box test
## 
## data:  Residuals
## Q* = 7.5102, df = 3, p-value = 0.0573
## 
## Model df: 0.   Total lags used: 3
MASE(ARDL.Radiation.NoIntercept.1x14)
##                                      MASE
## ARDL.Radiation.NoIntercept.1x14 0.1255573

Model is insignificant at 5% significance level.

4. Relative Humidity as regressor

With intercept :

## Code gist to find the best ARDL(p,q) model as per AIC and BIC scores.
# First create an empty df. Iterate over 196 ARDL (since max lag for response and predictor of ARDL model is 14, i.e, p = q = 14 at max).
# Save the model's AIC and BIC scores through iteration and display the model with best AIC and BIC scores.

df = data.frame(matrix(
  vector(), 0, 4, dimnames=list(c(), c("p","q","AIC","BIC"))),
  stringsAsFactors=F) # create empty dataframe
for(i in 1:14){
  for(j in 1:14){
    model4.1 = ardlDlm(formula = FFD ~ RelHumidity, data = FFD_dataset, p = i, q = j)
    new <- data.frame(i, j, AIC(model4.1$model), BIC(model4.1$model))
    df[nrow(df) + 1, ] <- new
  }
} # Iterate and save in df
head(df[order( df[,3] ),] %>% filter(, AIC >= 0 & BIC >= 0),1) # Best model as per AIC
##    p q      AIC      BIC
## 1 12 4 81.53721 99.48155
head(df[order( df[,4] ),] %>% filter(, AIC >= 0 & BIC >= 0),1) # Best model as per BIC
##    p q      AIC      BIC
## 1 12 4 81.53721 99.48155

ARDL(12,4) is the best models as per AIC and BIC scores respectively. Lets fit this models,

ARDL(12,4):

ARDL.RelHumidity.12x4 = ardlDlm(formula = FFD ~ RelHumidity, data = FFD_dataset, p = 12, q = 4)
summary(ARDL.RelHumidity.12x4)
## 
## Time series regression with "ts" data:
## Start = 13, End = 31
## 
## Call:
## dynlm(formula = as.formula(model.text), data = data)
## 
## Residuals:
##       13       14       15       16       17       18       19       20 
##  0.61481  0.03041  0.03346 -0.62519 -0.73767 -0.23052  1.35028 -1.21052 
##       21       22       23       24       25       26       27       28 
##  1.03746 -1.10501  1.22645 -0.47680  0.87798  0.35483 -0.72833  0.10790 
##       29       30       31 
##  0.51057 -0.72155 -0.30856 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)  
## (Intercept)    -2.835e+03  1.800e+02 -15.748   0.0404 *
## RelHumidity.t  -1.157e+01  9.700e-01 -11.931   0.0532 .
## RelHumidity.1  -1.047e+01  1.288e+00  -8.132   0.0779 .
## RelHumidity.2  -5.998e-01  8.945e-01  -0.671   0.6240  
## RelHumidity.3  -1.532e-01  1.034e+00  -0.148   0.9063  
## RelHumidity.4  -7.623e+00  9.362e-01  -8.142   0.0778 .
## RelHumidity.5  -4.949e+00  8.016e-01  -6.174   0.1022  
## RelHumidity.6  -3.840e+00  9.853e-01  -3.897   0.1599  
## RelHumidity.7   6.395e+00  9.233e-01   6.926   0.0913 .
## RelHumidity.8   1.344e+01  1.547e+00   8.684   0.0730 .
## RelHumidity.9   3.190e+00  1.256e+00   2.539   0.2389  
## RelHumidity.10  2.936e+01  1.462e+00  20.085   0.0317 *
## RelHumidity.11  2.723e+01  1.906e+00  14.284   0.0445 *
## RelHumidity.12  2.781e+01  2.294e+00  12.124   0.0524 .
## FFD.1          -7.238e-01  7.733e-02  -9.360   0.0678 .
## FFD.2          -2.671e-01  6.259e-02  -4.267   0.1466  
## FFD.3          -2.094e-01  6.224e-02  -3.364   0.1840  
## FFD.4          -6.733e-01  6.141e-02 -10.963   0.0579 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.317 on 1 degrees of freedom
## Multiple R-squared:  0.9994, Adjusted R-squared:  0.9887 
## F-statistic: 94.04 on 17 and 1 DF,  p-value: 0.08093
checkresiduals(ARDL.RelHumidity.12x4$model, test = "LB")

## 
##  Ljung-Box test
## 
## data:  Residuals
## Q* = 13.042, df = 4, p-value = 0.01107
## 
## Model df: 0.   Total lags used: 4
MASE(ARDL.RelHumidity.12x4)
##                             MASE
## ARDL.RelHumidity.12x4 0.02503557

Model is insignificant at 5% significance level.

Without intercept :

## Code gist to find the best ARDL(p,q) model as per AIC and BIC scores.
# First create an empty df. Iterate over 196 ARDL (since max lag for response and predictor of ARDL model is 14, i.e, p = q = 14 at max).
# Save the model's AIC and BIC scores through iteration and display the model with best AIC and BIC scores.
# Also, models with AIC or BIC scores of inf or -inf are removed

df = data.frame(matrix(
  vector(), 0, 4, dimnames=list(c(), c("p","q","AIC","BIC"))),
  stringsAsFactors=F) # create empty dataframe
for(i in 1:14){
  for(j in 1:14){
    model4.1 = ardlDlm(formula = FFD ~ -1 + RelHumidity, data = FFD_dataset, p = i, q = j)
    new <- data.frame(i, j, AIC(model4.1$model), BIC(model4.1$model))
    df[nrow(df) + 1, ] <- new
  }
} # Iterate and save in df
head(df[order( df[,3] ),] %>% filter(, AIC >= 0 & BIC >= 0),1) # Best model as per AIC
##    p q      AIC      BIC
## 1 14 1 132.8892 147.0539
head(df[order( df[,4] ),] %>% filter(, AIC >= 0 & BIC >= 0),1) # Best model as per BIC
##    p q      AIC      BIC
## 1 14 1 132.8892 147.0539

ARDL(14,1) is the best models as per AIC and BIC scores respectively. Lets fit this models,

ARDL(14,1):

ARDL.RelHumidity.NoIntercept.14x1 = ardlDlm(formula = FFD ~ -1 + RelHumidity, data = FFD_dataset, p = 14, q = 1)
summary(ARDL.RelHumidity.NoIntercept.14x1)
## 
## Time series regression with "ts" data:
## Start = 15, End = 31
## 
## Call:
## dynlm(formula = as.formula(model.text), data = data)
## 
## Residuals:
##      15      16      17      18      19      20      21      22      23      24 
## -4.3614  9.1725  2.9031  2.6461  1.6513 -2.3081  7.3930 -0.4973  6.0919 -4.5694 
##      25      26      27      28      29      30      31 
## -3.7208 -4.2671 -2.1394 -5.7978 -1.7040  2.8791 -3.6651 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)
## RelHumidity.t  -11.0600     6.1087  -1.811    0.321
## RelHumidity.1   -3.7933     7.0927  -0.535    0.687
## RelHumidity.2   -2.9773     5.0579  -0.589    0.661
## RelHumidity.3    2.2570     4.5673   0.494    0.708
## RelHumidity.4   -6.5883     5.9583  -1.106    0.468
## RelHumidity.5   -9.0079     6.4804  -1.390    0.397
## RelHumidity.6   -1.2076     5.2190  -0.231    0.855
## RelHumidity.7    8.6483     4.6433   1.863    0.314
## RelHumidity.8    8.2055     6.5177   1.259    0.427
## RelHumidity.9    4.4254     7.1858   0.616    0.649
## RelHumidity.10  18.3930     6.7644   2.719    0.224
## RelHumidity.11   6.8110    13.3378   0.511    0.699
## RelHumidity.12   0.7468     7.0037   0.107    0.932
## RelHumidity.13  -4.9650     8.7825  -0.565    0.672
## RelHumidity.14  -5.5943    11.5908  -0.483    0.714
## FFD.1            0.1829     0.4519   0.405    0.755
## 
## Residual standard error: 18.29 on 1 degrees of freedom
## Multiple R-squared:  0.9998, Adjusted R-squared:  0.9964 
## F-statistic: 295.4 on 16 and 1 DF,  p-value: 0.04567
checkresiduals(ARDL.RelHumidity.NoIntercept.14x1$model, test = "LB")

## 
##  Ljung-Box test
## 
## data:  Residuals
## Q* = 1.7001, df = 3, p-value = 0.6369
## 
## Model df: 0.   Total lags used: 3
MASE(ARDL.RelHumidity.NoIntercept.14x1)
##                                        MASE
## ARDL.RelHumidity.NoIntercept.14x1 0.1724197

Model is significant at 5% significance level.

ARDL Model selection

ARDL DLM models for Temperature, Rainfall and Relative Humidity regressors without intercept are significant. Eliminating all the insignificant models and comparing the significant Finite DLM models based on R-squared, AIC, BIC and MASE

Model <- c("ARDL.Temperature.NoIntercept.13x3", "ARDL.Rainfall.NoIntercept.12x5", "ARDL.RelHumidity.NoIntercept.14x1")
AIC <- c(AIC(ARDL.Temperature.NoIntercept.13x3), AIC(ARDL.Rainfall.NoIntercept.12x5), AIC(ARDL.RelHumidity.NoIntercept.14x1))
BIC <- c( BIC(ARDL.Temperature.NoIntercept.13x3), BIC(ARDL.Rainfall.NoIntercept.12x5), BIC(ARDL.RelHumidity.NoIntercept.14x1))
Adjusted_Rsquared <- c(0.9993, 0.9987, 0.9964)
MASE <- MASE(ARDL.Temperature.NoIntercept.13x3, ARDL.Rainfall.NoIntercept.12x5, ARDL.RelHumidity.NoIntercept.14x1)
data.frame(AIC, BIC, Adjusted_Rsquared, MASE) %>% arrange(MASE)
##                                        AIC      BIC Adjusted_Rsquared  n
## ARDL.Temperature.NoIntercept.13x3 109.1219 125.1485            0.9993 18
## ARDL.Rainfall.NoIntercept.12x5    126.9599 144.9042            0.9987 19
## ARDL.RelHumidity.NoIntercept.14x1 132.8892 147.0539            0.9964 17
##                                         MASE
## ARDL.Temperature.NoIntercept.13x3 0.05850544
## ARDL.Rainfall.NoIntercept.12x5    0.06993808
## ARDL.RelHumidity.NoIntercept.14x1 0.17241968

Thus, as per AIC, BIC, MASE (best in terms of forecasting), and Adjusted R-Squared, ARDL(13,3) model for FFD with Temperature as the regressor with no intercept (ARDL.Temperature.NoIntercept.13x3) is the best.

Diagnostic check for ARDL (Residual analysis):

checkresiduals(ARDL.Temperature.NoIntercept.13x3$model$residuals)

## 
##  Ljung-Box test
## 
## data:  Residuals
## Q* = 24.364, df = 4, p-value = 6.753e-05
## 
## Model df: 0.   Total lags used: 4

Serial autocorrelations left in residuals are significant as per Ljung-Box test and ACF plot. From the time series plot and histogram of residuals, there is a random pattern and normality in the residual distribution. Thus, no violation in general assumptions.

Conclusion of ARDL DLM model
  • Model Temperature as regressor with no intercept is best of all 4 regressors.
  • ARDL.Temperature.NoIntercept.13x3 model is significant
  • MASE is 0.05850544
  • Adjusted R-squared is 99.93%
  • No violations in the test of assumptions
  • Serial autocorrelation is significant

Most appropriate DLM model based on MASE (DLM Model Selection)

The 4 DLM models are,

  • Finite DLM model: DLM.RelHumidity
  • Polynomial DLM model: PolyDLM.Temperature
  • Koyck transformed geometric DLM model: Koyck.RelHumidity.NoIntercept
  • Autoregressive DLM model: ARDL.Temperature.NoIntercept.13x3

mean absolute scaled errors or MASE of these models are,

MASE(DLM.RelHumidity, PolyDLM.Temperature, Koyck.RelHumidity.NoIntercept, ARDL.Temperature.NoIntercept.13x3) %>% arrange(MASE)
##                                    n       MASE
## DLM.RelHumidity                   17 0.01053278
## ARDL.Temperature.NoIntercept.13x3 18 0.05850544
## PolyDLM.Temperature               17 0.73884055
## Koyck.RelHumidity.NoIntercept     30 0.85128511

Conclusion of Distributed Lag models (DLM) modelling

The Best DLM model for the FFD response which gives the most accurate forecasting based on the MASE measure is the Finite DLM model having Relative Humidity as regressor with no intercept , DLM.RelHumidity with MASE measure of 0.01053278.

B. Dynamic linear models (dynlm package)

Dynamic linear models are general class of time series regression models which can account for trends, seasonality, serial correlation between response and regressor variable, and most importantly the affect of intervention points.

The response of a general Dynamic linear model is,

\(Y_t = \omega_2Y_{t-1} + (\omega_0 + \omega_1)P_t - \omega_2\omega_0P_{t-1} + N_t\)

where,

  • \(Y_t\) is the response
  • \(\omega_2\) is the coefficient of 1 time unit lagged response
  • \(P_t\) is the current pulse affect at the intervention point with \((\omega_0 + \omega_1)\) coefficient representing the instantaneous effect of the intervention point
  • \(P_{t-1}\) is the past pulse affect with \(\omega_2\omega_0\) coefficient
  • \(N_t\) is the process represents the component where there is no intervention and is referred to as the natural or unperturbed process.

Lets revisit the time series plot for the response, FFD, to visualize possible intervention points

plot(FFD)

As mentioned at the descriptive analysis stage, there is no clear intervention that we identify visually. But maybe years 2002 and 2003 might be intervention points just because of their magnitude. Assuming this intervention point lets fit a Dynamic Linear model and see if the pulse function at years 2002 and 2003 are significant or not.

As always we do, we will have a look at ACF and PACF plots of the FFD series first.

acf(FFD, main="ACF of FFD")

pacf(FFD, main ="PACF of FFD")

In ACF plot we see a slowly decaying pattern indicating trend in the FFD series. In PACF plot we see 1 high vertical spike indicating trend. No significant seasonal behavior is observed. Thus, lets fit a Dynamic linear model with trend component and no seasonal component. For thoroughness, lets test all possible combinations using trend, multiple lags of FFD, and most importantly, the Pulse at 1996.

Now, lets fit Dynamic Linear model using dynlm() as shown below, (Note, the potential intervention point was identified at years 2002 and 2003).

With intercept :

Y.t = FFD
T = c(19,20) # The time point when the intervention occurred 
P.t = 1*(seq(FFD) == T)
P.t.1 = Lag(P.t,+1) #library(tis) 

Dyn.model = dynlm(Y.t ~ L(Y.t , k = 1) + P.t  + trend(Y.t)) # library(dynlm)

Dyn.model1 = dynlm(Y.t ~ L(Y.t , k = 1) + P.t + P.t.1) # library(dynlm)

Dyn.model2 = dynlm(Y.t ~ L(Y.t , k = 1) + P.t + P.t.1 + trend(Y.t)) # library(dynlm)

Dyn.model3 = dynlm(Y.t ~ L(Y.t , k = 1) + L(Y.t , k = 2) + P.t + P.t.1 + trend(Y.t)) # library(dynlm)

Dyn.model4 = dynlm(Y.t ~ L(Y.t , k = 1) + L(Y.t , k = 2) + L(Y.t , k = 3) + P.t + trend(Y.t)) # library(dynlm)

Dyn.model5 = dynlm(Y.t ~ L(Y.t , k = 1) + L(Y.t , k = 2) + L(Y.t , k = 3) + P.t) # library(dynlm)

AIC(Dyn.model, Dyn.model1, Dyn.model2, Dyn.model3, Dyn.model4, Dyn.model5) %>% arrange(AIC)
##            df      AIC
## Dyn.model4  7 234.6498
## Dyn.model3  6 240.7430
## Dyn.model5  6 242.0378
## Dyn.model   5 245.7365
## Dyn.model2  5 245.7365
## Dyn.model1  4 255.4715
summary(Dyn.model4)
## 
## Time series regression with "ts" data:
## Start = 1987, End = 2014
## 
## Call:
## dynlm(formula = Y.t ~ L(Y.t, k = 1) + L(Y.t, k = 2) + L(Y.t, 
##     k = 3) + P.t + trend(Y.t))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -32.864  -7.892  -2.667   5.050  29.807 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   400.29802   56.98282   7.025 4.76e-07 ***
## L(Y.t, k = 1)  -0.16899    0.12349  -1.368  0.18499    
## L(Y.t, k = 2)  -0.01011    0.11436  -0.088  0.93033    
## L(Y.t, k = 3)  -0.09248    0.11302  -0.818  0.42197    
## P.t            89.69321   11.57506   7.749 9.98e-08 ***
## trend(Y.t)     -1.02242    0.34538  -2.960  0.00723 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.04 on 22 degrees of freedom
## Multiple R-squared:  0.7615, Adjusted R-squared:  0.7073 
## F-statistic: 14.05 on 5 and 22 DF,  p-value: 3.162e-06

Without intercept :

Y.t = FFD
T = c(19,20) # The time point when the intervention occurred 
P.t = 1*(seq(FFD) == T)
P.t.1 = Lag(P.t,+1) #library(tis) 

Dyn.model.NoIntercept = dynlm(Y.t ~ 0 + L(Y.t , k = 1) + P.t + trend(Y.t)) # library(dynlm)

Dyn.model1.NoIntercept = dynlm(Y.t ~ 0 + L(Y.t , k = 1) + P.t + P.t.1) # library(dynlm)

Dyn.model2.NoIntercept = dynlm(Y.t ~ 0 + L(Y.t , k = 1) + P.t + P.t.1 + trend(Y.t)) # library(dynlm)

Dyn.model3.NoIntercept = dynlm(Y.t ~ 0 + L(Y.t , k = 1) + L(Y.t , k = 2) + P.t + P.t.1 + trend(Y.t)) # library(dynlm)

Dyn.model4.NoIntercept = dynlm(Y.t ~ 0 + L(Y.t , k = 1) + L(Y.t , k = 2) + L(Y.t , k = 3) + P.t + trend(Y.t)) # library(dynlm)

AIC(Dyn.model.NoIntercept, Dyn.model1.NoIntercept, Dyn.model2.NoIntercept, Dyn.model3.NoIntercept, Dyn.model4.NoIntercept) %>% arrange(AIC)
##                        df      AIC
## Dyn.model4.NoIntercept  6 265.5930
## Dyn.model3.NoIntercept  5 278.4041
## Dyn.model1.NoIntercept  3 290.7651
## Dyn.model.NoIntercept   4 292.6898
## Dyn.model2.NoIntercept  4 292.6898
summary(Dyn.model4.NoIntercept)
## 
## Time series regression with "ts" data:
## Start = 1987, End = 2014
## 
## Call:
## dynlm(formula = Y.t ~ 0 + L(Y.t, k = 1) + L(Y.t, k = 2) + L(Y.t, 
##     k = 3) + P.t + trend(Y.t))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -62.987  -3.470   0.793   9.987  41.311 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## L(Y.t, k = 1)  0.37567    0.16929   2.219   0.0366 * 
## L(Y.t, k = 2)  0.22166    0.19286   1.149   0.2622   
## L(Y.t, k = 3)  0.38212    0.15958   2.395   0.0252 * 
## P.t           65.01230   19.42521   3.347   0.0028 **
## trend(Y.t)    -0.08072    0.56062  -0.144   0.8868   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24.73 on 23 degrees of freedom
## Multiple R-squared:  0.9947, Adjusted R-squared:  0.9935 
## F-statistic: 855.4 on 5 and 23 DF,  p-value: < 2.2e-16

Dynamic Linear Model selection

The best Dynamic Linear models with and without intercept were Dyn.model4 and Dyn.model4.NoIntercept respectively. Eliminating all the insignificant models and comparing the significant Finite DLM models based on R-squared, AIC, BIC and MASE

Model <- c("Dyn.model4", "Dyn.model4.NoIntercept")
AIC <- c(AIC(Dyn.model4), AIC(Dyn.model4.NoIntercept))
BIC <- c( BIC(Dyn.model4), BIC(Dyn.model4.NoIntercept))
Adjusted_Rsquared <- c(0.7071, 0.9979)
data.frame(Model,AIC, BIC, Adjusted_Rsquared) %>% arrange(AIC)
##                    Model      AIC      BIC Adjusted_Rsquared
## 1             Dyn.model4 234.6498 243.9753            0.7071
## 2 Dyn.model4.NoIntercept 265.5930 273.5863            0.9979

Thus, as per Adjusted R-Squared, Dynamic Linear model for FFD with no intercept (Dyn.model4) is the best.

Dyn.model4 is the best Dynamic Linear model as per Adjusted R-Squared with 3 lagged components of the response (FFD), a significant pulse component at years 2002 and 2003, and trend and seasonal components of FFD series. Lets look at the summary statistics and check residuals

summary(Dyn.model4)
## 
## Time series regression with "ts" data:
## Start = 1987, End = 2014
## 
## Call:
## dynlm(formula = Y.t ~ L(Y.t, k = 1) + L(Y.t, k = 2) + L(Y.t, 
##     k = 3) + P.t + trend(Y.t))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -32.864  -7.892  -2.667   5.050  29.807 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   400.29802   56.98282   7.025 4.76e-07 ***
## L(Y.t, k = 1)  -0.16899    0.12349  -1.368  0.18499    
## L(Y.t, k = 2)  -0.01011    0.11436  -0.088  0.93033    
## L(Y.t, k = 3)  -0.09248    0.11302  -0.818  0.42197    
## P.t            89.69321   11.57506   7.749 9.98e-08 ***
## trend(Y.t)     -1.02242    0.34538  -2.960  0.00723 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.04 on 22 degrees of freedom
## Multiple R-squared:  0.7615, Adjusted R-squared:  0.7073 
## F-statistic: 14.05 on 5 and 22 DF,  p-value: 3.162e-06
checkresiduals(Dyn.model4)

## 
##  Breusch-Godfrey test for serial correlation of order up to 9
## 
## data:  Residuals
## LM test = 18.53, df = 9, p-value = 0.0295

Summary of Dynamic linear model, Dyn.model4.NoIntercept

  • model is significant at 5% significance level
  • Adjusted R-squared is 70.73%
  • No violations in the test of assumptions
  • Serial autocorrelations are significant

Conclusion of Dynamic Linear model

The dynamic linear model, Dyn.model4, is significant and the pulse (P.t) component significant at year 2002 and 2003.

C. Exponential Smoothing Method and State-Space models

Exponential smoothing methods including the state-space models takes into consideration the Error component, Trend component and seasonality component of the time series. Each of these components can be absent (None), Additive (A) or Multiplicative (M). Hence, these models are represented as ETS(ZZZ) representing the Error, Trend and Seasonal component respectively.

The best Exponential Smoothing model or State-Space model for our FFD time series can be easily identified by triggering the auto-search by setting the argument model = “ZZZ” in the ets() as shown below. Also, we will check if damped trend and the possibility of drift give us better models.

Best Exponential Smoothing model -

autofit.ETS = ets(FFD, model="ZZZ")
summary(autofit.ETS)
## ETS(M,N,N) 
## 
## Call:
##  ets(y = FFD, model = "ZZZ") 
## 
##   Smoothing parameters:
##     alpha = 1e-04 
## 
##   Initial states:
##     l = 306.3826 
## 
##   sigma:  0.0825
## 
##      AIC     AICc      BIC 
## 310.6132 311.5021 314.9152 
## 
## Training set error measures:
##                         ME     RMSE      MAE        MPE     MAPE      MASE
## Training set -0.0009438502 24.43768 16.75958 -0.5805143 5.329407 0.8759362
##                   ACF1
## Training set 0.2589561
checkresiduals(autofit.ETS)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(M,N,N)
## Q* = 3.1718, df = 6, p-value = 0.787
## 
## Model df: 0.   Total lags used: 6

System chooses the Simple exponential smoothing with Multiplicative errors ETS(MNN). MASE is 0.8759362.

Best Exponential Smoothing model with damping -

autofit.ETS.damped = ets(FFD, model="ZZZ", damped = TRUE)
summary(autofit.ETS.damped)
## ETS(A,Ad,N) 
## 
## Call:
##  ets(y = FFD, model = "ZZZ", damped = TRUE) 
## 
##   Smoothing parameters:
##     alpha = 5e-04 
##     beta  = 1e-04 
##     phi   = 0.9798 
## 
##   Initial states:
##     l = 315.7065 
##     b = -0.7506 
## 
##   sigma:  25.9994
## 
##      AIC     AICc      BIC 
## 315.0015 318.5015 323.6054 
## 
## Training set error measures:
##                     ME     RMSE      MAE        MPE     MAPE      MASE
## Training set 0.3745223 23.81052 15.17483 -0.4243502 4.772873 0.7931096
##                   ACF1
## Training set 0.2279019
checkresiduals(autofit.ETS.damped)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(A,Ad,N)
## Q* = 3.1879, df = 6, p-value = 0.7849
## 
## Model df: 0.   Total lags used: 6

System chooses the Holt’s damped model with Additive errors ETS(A,Ad,N). MASE is 0.7931096.

Best Exponential Smoothing model with drift -

autofit.ETS.drift = ets(FFD, model="ZZZ", beta = 1E-4)
summary(autofit.ETS.drift)
## ETS(M,N,N) 
## 
## Call:
##  ets(y = FFD, model = "ZZZ", beta = 1e-04) 
## 
##   Smoothing parameters:
##     alpha = 1e-04 
## 
##   Initial states:
##     l = 306.3826 
## 
##   sigma:  0.0839
## 
##      AIC     AICc      BIC 
## 310.6132 311.5021 314.9152 
## 
## Training set error measures:
##                         ME     RMSE      MAE        MPE     MAPE      MASE
## Training set -0.0009438502 24.43768 16.75958 -0.5805143 5.329407 0.8759362
##                   ACF1
## Training set 0.2589561
checkresiduals(autofit.ETS.drift)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(M,N,N)
## Q* = 3.1718, df = 6, p-value = 0.787
## 
## Model df: 0.   Total lags used: 6

Again system chooses the ETS(MNN) model.

Thus, the best Exponential smoothing or State-state model for our FFD series is the best Holt’s damped model with Additive errors ETS(A,Ad,N) with MASE score of 0.7931096.

Conclusion of Exponential Smoothing Method and State-Space models

The best State-space model which gives the most accurate forecasting based on the MASE measure is ETS(A,Ad,N) having lowest MASE measure of 0.7931096 of all possible State space models.

Overall Most Appropriate Regression model (Model Selection)

Based on the 4 Time series regression methods considered, the best model as per MASE measure for each method is summarized below,

  • A. Best Distributed lag models is - Finite DLM model having Relative humidity as regressor without an intercept DLM.RelHumidity with MASE measure of 0.01053278, AIC of 39.90284, BIC of 54.06747 and Adjusted R-squared of 99.86%.

  • B. Best Dynamic linear models is - Dyn.model4.NoIntercept having 3 lagged components of the response (FFD), a significant pulse component at years 2002 and 2003, and trend and seasonal components with AIC of 234.6498, BIC of 243.9753 and Adjusted R-squared of 70.73%.

  • C. Best Exponential smoothing and State-Space model is - Holt’s damped model with Additive errors ETS(A,Ad,N) with MASE measure of 0.7931096, AIC of 315.0015 and BIC of 323.6054.

Clearly, the best model is Finite DLM model having Relative humidity as regressor without an intercept DLM.RelHumidity as per AIC, BIC, Adjusted R-squared and MASE measures.

Best Time Series regression model for Forecasting

Best Time Series regression model is - Finite DLM model having Relative humidity as regressor without an intercept (DLM.RelHumidity) with MASE measure of 0.01053278.

Detailed Graphical and statistical tests of assumptions for \(DLM.RelHumidity\) model (Residual Analysis)

Residual analysis to test model assumptions.

Lets perform a detailed Residual Analysis to check if any model assumptions have been violated.

The estimator error (or residual) is defined by:

\(\hat{\epsilon_i}\) = \(Y_i\) - \(\hat{Y_i}\) (i.e. observed value less - trend value)

The following problems are to be checked,

  1. linearity in distribution of error terms
  2. The mean value of residuals is zero
  3. Serial autocorrelation
  4. Normality of distribution of error terms

Lets first apply diagnostic check using checkresiduals() function,

checkresiduals(DLM.RelHumidity)
##           1           2           3           4           5           6 
## -0.50250987 -0.14705938  0.32626851  0.29784316  0.24216425 -0.54986885 
##           7           8           9          10          11          12 
##  0.23331897 -0.20803168  0.50850139  0.07914825  0.09089903 -0.25841685 
##          13          14          15          16          17 
## -0.01506289  0.03899430 -0.32784972  0.01504263  0.17661876

## 
##  Ljung-Box test
## 
## data:  Residuals
## Q* = 5.5937, df = 3, p-value = 0.1331
## 
## Model df: 0.   Total lags used: 3
  1. From the Residuals plot, linearity is not violated as the residuals are randomly distributed across the mean. Thus, linearity in distribution of error terms is not violated

  2. To test mean value of residuals is zero or not, lets calculate mean value of residuals as,

mean(DLM.RelHumidity$model$residuals)
## [1] 2.693605e-17

As mean value of residuals is close to 0, zero mean residuals is not violated.

  1. In the checkresiduals output, the Ljung-Box test output is displayed. According to this test, the hypothesis are,

Which has,
\(H_0\) : series of residuals exhibit no serial autocorrelation of any order up to p
\(H_a\) : series of residuals exhibit serial autocorrelation of any order up to p

From the Ljung-Box test output, since p (0.1331) > 0.05, we do not reject the null hypothesis of no serial autocorrelation.

Thus, according to this test and ACF plot, we can conclude that the serial correlation left in residuals is insignificant.

  1. From the histogram shown by checkresiduals(), residuals seem to follow normality. Lets test this statistically,

\(H_0\) : Time series is Normally distributed
\(H_a\) : Time series is not normal

shapiro.test(DLM.RelHumidity$model$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  DLM.RelHumidity$model$residuals
## W = 0.96952, p-value = 0.8101

From the Shapiro-Wilk test, since p>0.05 significance level, we do reject the null hypothesis that states the data is normal. Thus, residuals of DLM.RelHumidity model are normally distributed.

Summarizing residual analysis on \(DLM.RelHumidity\) model:

Assumption 1: The error terms are randomly distributed and thus show linearity: Not violated
Assumption 2: The mean value of E is zero (zero mean residuals): Not violated
Assumption 4: The error terms are independently distributed, i.e. they are not autocorrelated: Not violated
Assumption 5: The errors are normally distributed. Not violated

Having no residual assumptions’ violations, the Finite DLM model having Relative humidity as regressor without an intercept (DLM.RelHumidity) model is good for accurate forecasting of FFD. Lets forecast for the next 4 years ahead FFD,

Forecasting

Using MASE measure, Finite DLM model, DLM.RelHumidity is best fitted model to forecast FFD. Lets estimate and plot 4 years (2015-2018) ahead forecasts for FFD series.

Observed and fitted values are plotted below. This plot indicates a good agreement between the model and the original series. (Note, since lag is set as 14 (q=14), fitted values are not available for the first 14 years)

plot(FFD, ylab='FFD', xlab = 'Year', type="l", col="black", main="Observed and fitted values using DLM.RelHumidity model on FFD")
lines(ts(DLM.RelHumidity$model$fitted.values, start = c(1998)), col="red")
legend("topleft",lty=1, text.width = 12,
       col=c("black", "red"), 
       c("FFD series", "DLM.RelHumidity fit"))

Using the given 4 years ahead future covariates values, we can forecast our FFD response.

Future_Covariates <- read.csv("C:/Users/admin/Downloads/Covariate x-values for Task 2.csv")
head(Future_Covariates)
##   Year Temperature Rainfall Radiation RelHumidity
## 1 2015       20.74     2.27     14.60       52.16
## 2 2016       20.49     2.38     14.56       52.87
## 3 2017       20.52     2.26     14.79       52.58
## 4 2018       20.56     2.27     14.79       52.50

Our DLM.RelHumidity model uses only 1 covariate, Relative Humidity. 4 years ahead point forecasts of FFD using Relative Humidity covariate is,

DLM.RelHumidity = dlm(formula = FFD ~ RelHumidity, data = FFD_dataset, q = 14)
x.new =  c(Future_Covariates$RelHumidity)
forecasts.dlm = dLagM::forecast(model = DLM.RelHumidity, x = x.new, h = 4)$forecasts

Forecast using overall BEST fitting model:

The point forecasts and the forecast plot using the overall best fitting model, DLM.RelHumidity is given below,

df <- data.frame(
  Finite_DLM_forecasts = c(forecasts.dlm)
) 
row.names(df) <- c("2015", "2016", "2017", "2018")
df
##      Finite_DLM_forecasts
## 2015             217.4990
## 2016             164.6623
## 2017             203.6109
## 2018             271.5180
FFD.extended1 = c(FFD, forecasts.dlm)

{
plot(ts(FFD.extended1, start = c(1984)), type="l", col = "red",
ylab = "FFD", xlab = "Year", 
main="4 years ahead forecasts for FFD series
      using DLM.RelHumidity model")          
lines(FFD,col="black",type="l")
legend("topleft",lty=1,
       col=c("black", "red"), 
       c("FFD series", "Finite DLM forecasts"))
}

The forecasts for best Finite DLM, Polynomial DLM, Koyck, Dynamic Linear model, and Exponential smoothing/State-space model are plotted and given below,

For Distributed Lag models:

The 4 years ahead Point forecasts for the DLM models are printed and plotted below, (Note, since the best Koyck and ARDL models do not have intercept, their forecasts aren’t printed)

# Forecasts using Finite DLM 
x.new = c(Future_Covariates$RelHumidity)
forecasts.dlm = dLagM::forecast(model = DLM.RelHumidity, x = x.new, h = 4)$forecasts

# Forecasts using Polynomial DLM 
x.new2 = c(Future_Covariates$Temperature)
forecasts.polydlm = dLagM::forecast(model = PolyDLM.Temperature , x = x.new2, h = 4)$forecasts

df <- data.frame(
  Finite_DLM_forecasts = c(forecasts.dlm),
  Polynomial_DLM_forecasts = c(forecasts.polydlm)
) 
row.names(df) <- c("2015", "2016", "2017", "2018")
df
##      Finite_DLM_forecasts Polynomial_DLM_forecasts
## 2015             217.4990                 306.4362
## 2016             164.6623                 301.9256
## 2017             203.6109                 297.3487
## 2018             271.5180                 295.2537
FFD.extended1 = c(FFD , forecasts.dlm)
FFD.extended2 = c(FFD , forecasts.polydlm)

{
plot(ts(FFD.extended1, start = c(1984)),type="l", col = "Red",
ylab = "FFD", xlab = "Year", 
main="4 years ahead forecast for FFD series
      using DLM models")          
lines(ts(FFD.extended2, start = c(1984)),col="blue",type="l")
lines(FFD,col="black",type="l")
legend("topleft",lty=1,
       col=c("black", "red", "blue"), 
       c("FFD series", "Finite DLM forecasts", "Polynomial DLM forecasts"))
}

For Dynamic Linear model:

The 4 years ahead point forecasts are printed and plotted below,

Dyn.model4 = dynlm(Y.t ~ L(Y.t , k = 1) + L(Y.t , k = 2) + L(Y.t , k = 3) + P.t + trend(Y.t)) # library(dynlm)

q = 4
n = nrow(Dyn.model4$model)
FFD.frc = array(NA , (n + q))
FFD.frc[1:n] = Y.t[4:length(Y.t)] # length(1:n) = length(2:length(Y.t)) = 28
trend = array(NA,q)
trend.start = Dyn.model4$model[n,"trend(Y.t)"]
trend = seq(trend.start , trend.start + q/1, 1)

for (i in 1:q){
  #months = array(0,11)
  #months[(i+4)%%12] = 1 # Data ends in May, to start the new forecast from JUNE, put i + 4.
  data.new = c(1,FFD.frc[n-1+i], FFD.frc[n-2+i], FFD.frc[n-3+i], P.t[n] ,trend[i]) 
  FFD.frc[n+i] = as.vector(Dyn.model4$coefficients) %*% data.new
}

par(mfrow=c(1,1))

plot(Y.t,xlim=c(1984,2018),ylab='FFD',xlab='Year',main = "Time series plot of FFD series with 4 years ahead forecasts (in red)")
lines(ts(FFD.frc[(n+1):(n+q)],start=c(2015)),col="red")

For Exponential smoothing/State-space model:

The 4 years ahead point forecasts and Confidence intervals are printed and plotted below,

forecasts.Dynlm = forecast::forecast(autofit.ETS.damped, h = 4)
forecasts.Dynlm
##      Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## 2015       298.2942 264.9747 331.6138 247.3364 349.2521
## 2016       297.9112 264.5917 331.2308 246.9534 348.8691
## 2017       297.5360 264.2164 330.8555 246.5781 348.4938
## 2018       297.1682 263.8487 330.4878 246.2104 348.1261
plot(forecasts.Dynlm, ylab="FFD", type="l", fcol="red", xlab="Year", ylim= c(100, 400),
main="4 years ahead forecasts using Dynamic Linear model")
legend("topleft",lty=1, pch=1, col=1:2, c("FFD series","Dynlm forecasts"))

Conclusion

The most fitting model for our FFD series in terms of MASE which assesses the forecast accuracy is the Finite DLM model with Relative Humidity as regressor \(DLM.RelHumidity\). The point forecasts for 4 years ahead reported using the forecast() of dLagM package are 217.4990, 164.6623, 203.6109, and 271.5180 respectively (Confidence Intervals are not outputted).

Future Directions

Potentially better forecasting methods can be explored, compared and diagnosed for better fit.

Task 3 Part (a): Univariate Forecasting of Rank-Based Order Similarity Metric: Three-Year Ahead Predictions

Data Description

The dataset holds 6 columns and 31 observations. They are, Year column, the Rank-based Order similarity metric (RBO) which denotes changes in flowering orders of the 81 plant species measured by computing the similarity between annual flowering order and the flowering order of 1983. Other 4 columns are climate factors namely, rainfall (rain), temperature (temp), radiation level (rad), and relative humidity (RH) - all measured from 1984 to 2014.

Objective

Our aim for the RBO dataset is to give best 3 years ahead forecasts by determining the most accurate and suitable regression model that determines the annual Rank-based Order similarity metric (RBO) in terms of MASE using single predictor (univariate analysis). A descriptive analysis will be conducted initially. Model-building strategy will be applied to find the best fitting model from the time series regression methods (dLagM package) and dynamic linear models (dynlm package).

Model Selection Criteria

MASE, Information Criteria (AIC and BIC), and Adjusted R Squared.

Read Data

RBO_dataset <- read.csv("C:/Users/admin/Downloads/RBO.csv")
head(RBO_dataset)
##   Year       RBO Temperature Rainfall Radiation RelHumidity
## 1 1984 0.7550088    18.71038 2.489344  14.87158    93.92650
## 2 1985 0.7407520    19.26301 2.475890  14.68493    94.93589
## 3 1986 0.8423860    18.58356 2.421370  14.51507    94.09507
## 4 1987 0.7484425    19.10137 2.319726  14.67397    94.49699
## 5 1988 0.7984084    20.36066 2.465301  14.74863    94.08142
## 6 1989 0.7938803    19.59589 2.735890  14.78356    96.08685

Identification of the response and the regressor variables

For fitting a regression model, the response is Rank-based flowering Order similarity metric, RBO, and the 4 regressor variables are the Temperature, Rainfall, Radiation Level and Relative Humidity.

  • y = RBO = Rank-based flowering Order similarity metric with reference year at 1983
  • x1 = Temperature
  • x2 = Rainfall
  • x3 = Radiation
  • x4 = RelHumidity = Relative Humidity

All the 5 variables are continuous variables.

Read Regressor and Response variables

Lets first get the regressor and response as TS objects,

RBO = ts(RBO_dataset[,2], start = c(1984))
Temperature = ts(RBO_dataset[,3], start = c(1984))
Rainfall = ts(RBO_dataset[,4], start = c(1984))
Radiation = ts(RBO_dataset[,5], start = c(1984))
RelHumidity = ts(RBO_dataset[,6], start = c(1984))
data.ts = ts(RBO_dataset, start = c(1984)) # Y and x in single dataframe

Relationship between Regressor and Response variables

Lets scale, center and plot all the 5 variables together

data.scale = scale(data.ts)
plot(data.scale[,2:6], plot.type="s", col=c("black", "red", "blue", "green", "yellow"), main = "RBO (Black - Respone), Temperature (Red - X1),\n  Rainfall (Blue - X2), Radiation (Green - X3), RelHumidity (Yellow - X4)")

It is hard to read the correlations between the regressors and the response and the among the response themselves. But it is fair to say the 5 variables show some correlations. Lets check for correlation statistically using ggpairs(),

ggpairs(data = RBO_dataset, columns = c(2,3,4,5,6), progress = FALSE) #library(GGally)

Hence, some correlations between the 4 regressors and response is present. We can generate regression model based on these correlations. First, lets look at the descriptive statistics

Descriptive Analysis

Since we are generating regression model which estimates the response, \(RBO\), lets focus on RBOs statistics.

Summary statistics

summary(RBO)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.6629  0.7043  0.7321  0.7379  0.7566  0.8424

The mean and median of the RBO are very close indicating symmetrical distribution.

Time Series plot:

The time series plot for our data is generated using the following code chunk,

plot(RBO, ylab='Yearly average of RBO',xlab='Year',
     type='o', main="Figure 1: Yearly Average RBO Trend (1984-2014)")

Plot Inference :

From Figure 1, we can comment on the time series’s,

  • Trend: The overall shape of the trend seems to follow an downward trend. Thus, indicating non-stationarity.

  • Seasonality: From the plot, no seasonal behavior is seen.

  • Change in Variance: The variation is very random

  • Behavior: We notice mixed behavior of MA and AR series. AR behavior is seen as we obverse following data points. MA behavior is evident due to up and down fluctuations in the data points.

  • Intervention/Change points: Year 1996 might be an intervention point as the mean level of the RBO series falls notably low from this point onwards.

ACF and PACF plots:

acf(RBO, main="ACF of RBO")

pacf(RBO, main ="PACF of RBO")

  • ACF plot: We notice first 3 autocorrelations are significant. A slowly decaying pattern indicates non stationary series. We do not see any ‘wavish’ form. Thus, no significant seasonal behavior is observed.

  • PACF plot: We see 1 high vertical spike indicating non stationary series. We have observed non stationarity in the time series plot as well. Also, the second correlation bar is significant as well.

Check normality

Many model estimating procedures assume normality of the residuals. If this assumption doesn’t hold, then the coefficient estimates are not optimum. Lets look at the Quantile-Quantile (QQ) plot to to observe normality visually and the Shapiro-Wilk test to statistically confirm the result.

qqnorm(RBO, main = "Normal Q-Q Plot of Average yearly RBO")
qqline(RBO, col = 2)

We see deviations from normality. Clearly, upper tail is off and most of the data in middle is off the line as well. Lets check statistically using shapiro-wilk test. Lets state the hypothesis of this test,

\(H_0\) : Time series is Normally distributed
\(H_a\) : Time series is not normal

shapiro.test(RBO)
## 
##  Shapiro-Wilk normality test
## 
## data:  RBO
## W = 0.96136, p-value = 0.3169

From the Shapiro-Wilk test, since p > 0.05 significance level, we do not reject the null hypothesis that states the data is normal. Thus, RBO series is normally distributed.

Test Stationarity

The time series plot, ACF and PACF of RBO time series at the descriptive analysis stage of time series tells us non-stationarity in our time series. Lets use ADF and PP tests,

Using ADF (Augmented Dickey-Fuller) test :

Lets confirm the non-stationarity using Dickey-Fuller Test or ADF test. Lets state the hypothesis,

\(H_0\) : Time series is Difference non-stationary
\(H_a\) : Time series is Stationary

adf.test(RBO) #library(tseries)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  RBO
## Dickey-Fuller = -2.0545, Lag order = 3, p-value = 0.5518
## alternative hypothesis: stationary

since p-value > 0.05, we do not reject null hypothesis of non stationarity. we can conclude that the series is non-stationary at 5% level of significance.

Using PP (Phillips-Perron) test :

The null and alternate hypothesis are same as ADF test.

PP.test(RBO, lshort = TRUE)
## 
##  Phillips-Perron Unit Root Test
## 
## data:  RBO
## Dickey-Fuller = -3.5927, Truncation lag parameter = 2, p-value =
## 0.04906
PP.test(RBO, lshort = FALSE)
## 
##  Phillips-Perron Unit Root Test
## 
## data:  RBO
## Dickey-Fuller = -3.8292, Truncation lag parameter = 8, p-value =
## 0.03167

According to the PP tests, RBO series is stationary at 5% level

The two procedures give differing outcomes. Since Philips-Perron (PP) test is non-parametric, i.e. it does not require to select the level of serial correlation as in ADF and since our RBO series does not have significant serial autocorrelations, we can go with the outcome of PP test stating the RBO series is stationary.

Conclusion from descriptive analysis:

  • From the time series plot, ACF/PACF plots and PP tests, we found our RBO response is stationary. Differencing is not required.
  • Trend is normal. Thus Box-cox transformation is NOT required.

Decomposition

At the descriptive analysis stage, from the time series plot and the ACF/PACF plots, no seasonal pattern was observed but a downward trend was observed. Lets decompose the RBO series and confirm. STL decomposition method will be used.

STL decomposition

Lets set t.window to 15 and look the STL decomposed plots,

We can adjust the series for seasonality by subtracting the seasonal component from the original series using the following code chunk,

Note - Since we cannot do decomposition on a series having frequency as 1, lets falsely use frequency as 2. Also note, the time truncates from 2014 to 2000 as the frequency is doubled. This is okay since we are just interested in the decomposition.

# Code gist - Apply STL decomposition to get seasonally adjusted and trend adjusted and visually compare w.r.t to original time series

RBOX = ts(RBO_dataset[,2], start = c(1984),frequency = 2) # set frequency
stl.RBO <- stl(window(RBOX, start=c(1984)), t.window=15, s.window="periodic", robust=TRUE)

par(mfrow=c(3,1))

plot(RBOX,ylab='RBO',xlab='Time',
     type='o', main="Original RBO Time Series")

plot(seasadj(stl.RBO), ylab='RBO',xlab='Time', main = "Seasonally adjusted RBO")

stl.RBO.trend = stl.RBO$time.series[,"trend"] # Extract the trend component from the output
stl.RBO.trend.adjusted = RBOX - stl.RBO.trend

plot(stl.RBO.trend.adjusted, ylab='RBO',xlab='Time', main = "Trend adjusted RBO")

par(mfrow=c(1,1))

On very close inspection of the plots above, the trend adjusted series looks more different (than the seasonally adjusted series) from the Original RBO series. Meaning, trend component is more significant than the seasonal component in the RBO series.

Conclusion of Decomposition

Trend component is more significant than the seasonal component in the RBO series. Thus, we expect the fitted model to have no seasonal component.

Modeling

Time series regression methods namely,

  • A. Distributed lag models (dLagM package),
  • B. Dynamic linear models (dynlm package)

A. Distributed lag models

Based on whether the lags are known (Finite DLM) or undetermined (Infinite DLM), 4 major modelling methods will be tested, namely,

  • Basic Finite Distributed lag model,
  • Polynomial DLM,
  • Koyck transformed geometric DLM,
  • and Autoregressive DLM.

Fit Finite DLM

The response of a finite DLM model with 1 regressor is represented as shown below,

\(Y_t = \alpha + \sum_{s=0}^{q} \beta_s X_{t-s} + \epsilon_t\)

where,

  • \(\alpha\) is intercept
  • \(\beta_s\) is coefficient of s lagged response \(X_t\)
  • and \(\epsilon_t\) is the error term

In our dataset, we have 4 regressors. For uni variate analysis lets fit models with single regressor for each of the 4 regressors.

1. Temperature as regressor

With intercept :

Now, lets use AIC and BIC score to find the best lag length for Finite DLM model,

finiteDLMauto(formula = RBO ~ Temperature, data = RBO_dataset, q.min = 1, q.max = 20,
              model.type = "dlm", error.type = "AIC", trace = TRUE)
##    q - k    MASE        AIC       BIC   GMRAE     MBRAE R.Adj.Sq    Ljung-Box
## 15    15 0.00000       -Inf      -Inf 0.00000   0.00000      NaN          NaN
## 16    16 0.00000       -Inf      -Inf 0.00000   0.00000      NaN          NaN
## 17    17 0.00000       -Inf      -Inf 0.00000   0.00000      NaN          NaN
## 18    18 0.00000       -Inf      -Inf 0.00000   0.00000      NaN          NaN
## 19    19 0.00000       -Inf      -Inf 0.00000   0.00000      NaN          NaN
## 20    20 0.00000       -Inf      -Inf 0.00000   0.00000      NaN          NaN
## 14    14 0.14051 -104.59411 -90.42949 0.14730   0.00930  0.44459 0.0299955362
## 1      1 1.00562  -97.52442 -91.91963 0.97718  -1.50100  0.05726 0.0022526947
## 2      2 1.14742  -91.57332 -84.73684 1.18433   2.00053  0.03520 0.0019973822
## 3      3 1.18155  -88.84678 -80.85356 1.19082   1.78688 -0.09575 0.0003753525
## 13    13 0.29047  -85.71072 -71.46477 0.25411  -0.02821  0.00081 0.0067221779
## 10    10 0.52353  -85.18937 -71.61058 0.65204 -10.06486 -0.01070 0.1806431756
## 4      4 1.24918  -82.35556 -73.28471 1.17170   0.61711 -0.15735 0.0006014497
## 5      5 1.09444  -81.62553 -71.56076 1.03411   0.44496 -0.08477 0.0016802983
## 9      9 0.62982  -79.75895 -66.66644 0.52185   2.67533 -0.28008 0.7633752533
## 8      8 0.71270  -78.35040 -65.85997 0.55285  -3.45059 -0.09835 0.2689392429
## 11    11 0.48457  -77.95858 -64.01833 0.43409  -0.30578 -0.27985 0.1390715482
## 6      6 0.98697  -77.12304 -66.15316 0.75879   1.05760 -0.17893 0.0077737837
## 7      7 0.91375  -76.88467 -65.10413 0.79294   0.21697 -0.09764 0.0661515342
## 12    12 0.50627  -73.59058 -59.42400 0.41570  19.78949 -0.57819 0.6363609739

q = 14 has the smallest AIC and BIC scores. Fit model with q = 14,

DLM.Temperature = dlm(formula = RBO ~ Temperature, data = RBO_dataset, q = 14)
summary(DLM.Temperature)
## 
## Call:
## lm(formula = as.formula(model.formula), data = design)
## 
## Residuals:
##          1          2          3          4          5          6          7 
##  1.675e-03  2.586e-05 -4.133e-03  4.653e-03 -4.513e-03  6.615e-03 -3.001e-03 
##          8          9         10         11         12         13         14 
##  3.239e-03 -3.914e-03 -5.395e-03  3.714e-03  3.849e-03 -4.757e-03 -4.376e-03 
##         15         16         17 
##  4.745e-03 -2.598e-03  4.172e-03 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)
## (Intercept)    -0.279084   0.519687  -0.537    0.686
## Temperature.t  -0.092104   0.029180  -3.156    0.195
## Temperature.1   0.097983   0.043360   2.260    0.265
## Temperature.2   0.058018   0.024706   2.348    0.256
## Temperature.3  -0.067167   0.029143  -2.305    0.261
## Temperature.4  -0.052905   0.022636  -2.337    0.257
## Temperature.5   0.053018   0.031674   1.674    0.343
## Temperature.6   0.063678   0.015198   4.190    0.149
## Temperature.7  -0.014486   0.019045  -0.761    0.586
## Temperature.8   0.036527   0.016580   2.203    0.271
## Temperature.9  -0.067426   0.025603  -2.633    0.231
## Temperature.10  0.027969   0.013594   2.057    0.288
## Temperature.11 -0.039054   0.018627  -2.097    0.283
## Temperature.12  0.060493   0.022320   2.710    0.225
## Temperature.13  0.002056   0.014090   0.146    0.908
## Temperature.14 -0.015631   0.014817  -1.055    0.483
## 
## Residual standard error: 0.01693 on 1 degrees of freedom
## Multiple R-squared:  0.9653, Adjusted R-squared:  0.4446 
## F-statistic: 1.854 on 15 and 1 DF,  p-value: 0.526
## 
## AIC and BIC values for the model:
##         AIC       BIC
## 1 -104.5941 -90.42949

DLM.Temperature Model is insignificant (p-value = 0.526) at 0.05 significant level.

Without intercept :

DLM.Temperature.noIntercept = dlm(formula = RBO ~ 0 + Temperature, data = RBO_dataset, q = 14)
summary(DLM.Temperature.noIntercept)
## 
## Call:
## lm(formula = as.formula(model.formula), data = design)
## 
## Residuals:
##          1          2          3          4          5          6          7 
## -0.0003981 -0.0013348 -0.0039606  0.0043438 -0.0064216  0.0075171 -0.0060594 
##          8          9         10         11         12         13         14 
##  0.0022465 -0.0072242 -0.0032458  0.0045231  0.0017786 -0.0023514 -0.0002400 
##         15         16         17 
##  0.0047904 -0.0026957  0.0084360 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)  
## Temperature.t  -0.091281   0.023388  -3.903   0.0598 .
## Temperature.1   0.081508   0.024594   3.314   0.0802 .
## Temperature.2   0.052694   0.018163   2.901   0.1011  
## Temperature.3  -0.058068   0.019030  -3.051   0.0927 .
## Temperature.4  -0.051026   0.017950  -2.843   0.1047  
## Temperature.5   0.041974   0.019334   2.171   0.1621  
## Temperature.6   0.060570   0.011279   5.370   0.0330 *
## Temperature.7  -0.008716   0.012621  -0.691   0.5612  
## Temperature.8   0.036608   0.013307   2.751   0.1106  
## Temperature.9  -0.059111   0.016367  -3.612   0.0688 .
## Temperature.10  0.023610   0.008753   2.697   0.1143  
## Temperature.11 -0.035049   0.013700  -2.558   0.1248  
## Temperature.12  0.053697   0.014757   3.639   0.0679 .
## Temperature.13  0.003294   0.011157   0.295   0.7956  
## Temperature.14 -0.013658   0.011521  -1.185   0.3576  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01359 on 2 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:  0.9996 
## F-statistic:  3133 on 15 and 2 DF,  p-value: 0.0003191
## 
## AIC and BIC values for the model:
##         AIC       BIC
## 1 -102.2864 -88.95495

DLM.Temperature.noIntercept Model is significant.

2. Rainfall as regressor

With intercept :

Now, lets use AIC and BIC score to find the best lag length for Finite DLM model,

finiteDLMauto(formula = RBO ~ Rainfall, data = RBO_dataset, q.min = 1, q.max = 20,
              model.type = "dlm", error.type = "AIC", trace = TRUE)
##    q - k    MASE        AIC       BIC   GMRAE    MBRAE R.Adj.Sq  Ljung-Box
## 15    15 0.00000       -Inf      -Inf 0.00000  0.00000      NaN        NaN
## 16    16 0.00000       -Inf      -Inf 0.00000  0.00000      NaN        NaN
## 17    17 0.00000       -Inf      -Inf 0.00000  0.00000      NaN        NaN
## 18    18 0.00000       -Inf      -Inf 0.00000  0.00000      NaN        NaN
## 19    19 0.00000       -Inf      -Inf 0.00000  0.00000      NaN        NaN
## 20    20 0.00000       -Inf      -Inf 0.00000  0.00000      NaN        NaN
## 14    14 0.09581 -113.88404 -99.71941 0.09433 -0.03282  0.67842 0.03001328
## 13    13 0.15241 -109.63418 -95.38823 0.12675  0.01008  0.73549 0.69144810
## 1      1 0.94180 -100.89798 -95.29319 0.94415  0.46484  0.15753 0.04687413
## 3      3 0.97969  -97.19966 -89.20643 0.86453  0.97573  0.18688 0.01249933
## 2      2 0.99937  -96.70956 -89.87308 0.83164  0.44840  0.19180 0.03708415
## 4      4 1.03883  -90.46187 -81.39101 0.86444  0.78209  0.14281 0.02073547
## 5      5 0.92568  -87.24242 -77.17765 0.76943  0.47689  0.12600 0.02871866
## 6      6 0.85440  -82.31788 -71.34800 0.51455 27.53976  0.04227 0.04784411
## 9      9 0.62059  -80.79432 -67.70181 0.54274 -0.14207 -0.22123 0.84870967
## 12    12 0.42402  -79.63373 -65.46714 0.37726  0.92378 -0.14823 0.84289084
## 7      7 0.82934  -77.98405 -66.20351 0.77764  0.94056 -0.04849 0.17026081
## 11    11 0.46218  -77.84858 -63.90833 0.29371  0.22987 -0.28691 0.55512719
## 10    10 0.58562  -76.93255 -63.35376 0.53594  0.09998 -0.49754 0.52875140
## 8      8 0.72338  -76.81922 -64.32879 0.62316  0.02713 -0.17396 0.59232039

q = 14 has the smallest AIC and BIC scores. Fit model with q = 14,

DLM.Rainfall = dlm(formula = RBO ~ Rainfall, data = RBO_dataset, q = 14)
summary(DLM.Rainfall)
## 
## Call:
## lm(formula = as.formula(model.formula), data = design)
## 
## Residuals:
##          1          2          3          4          5          6          7 
## -0.0037393  0.0003335  0.0023616  0.0044489 -0.0029005  0.0010146 -0.0042853 
##          8          9         10         11         12         13         14 
##  0.0013098 -0.0007348  0.0048191 -0.0017438  0.0014497 -0.0022985  0.0015321 
##         15         16         17 
## -0.0058962  0.0050202 -0.0006913 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept)  0.6361975  0.1507225   4.221    0.148
## Rainfall.t   0.0195284  0.0210850   0.926    0.524
## Rainfall.1   0.0204377  0.0130997   1.560    0.363
## Rainfall.2   0.0027206  0.0156791   0.174    0.891
## Rainfall.3   0.0074631  0.0126796   0.589    0.661
## Rainfall.4   0.0002523  0.0167585   0.015    0.990
## Rainfall.5   0.0271139  0.0126104   2.150    0.277
## Rainfall.6  -0.0007402  0.0138839  -0.053    0.966
## Rainfall.7   0.0008901  0.0159534   0.056    0.965
## Rainfall.8  -0.0082237  0.0133523  -0.616    0.649
## Rainfall.9  -0.0359922  0.0139713  -2.576    0.236
## Rainfall.10 -0.0076753  0.0134261  -0.572    0.669
## Rainfall.11 -0.0087286  0.0145229  -0.601    0.655
## Rainfall.12  0.0220019  0.0129624   1.697    0.339
## Rainfall.13  0.0171198  0.0216060   0.792    0.573
## Rainfall.14 -0.0189919  0.0193580  -0.981    0.506
## 
## Residual standard error: 0.01288 on 1 degrees of freedom
## Multiple R-squared:  0.9799, Adjusted R-squared:  0.6784 
## F-statistic:  3.25 on 15 and 1 DF,  p-value: 0.4127
## 
## AIC and BIC values for the model:
##        AIC       BIC
## 1 -113.884 -99.71941

DLM.Rainfall Model is insignificant (p-value = 0.4127) at 0.05 significant level.

Without intercept :

DLM.Rainfall.noIntercept = dlm(formula = RBO ~ 0 + Rainfall, data = RBO_dataset, q = 14)
summary(DLM.Rainfall.noIntercept)
## 
## Call:
## lm(formula = as.formula(model.formula), data = design)
## 
## Residuals:
##         1         2         3         4         5         6         7         8 
## -0.013313 -0.026712 -0.006145  0.018074  0.011886  0.002610 -0.007758 -0.017198 
##         9        10        11        12        13        14        15        16 
## -0.009586  0.010223  0.018227  0.012199  0.011180  0.011907 -0.015261 -0.007257 
##        17 
##  0.011574 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## Rainfall.t   0.086282   0.042775   2.017    0.181
## Rainfall.1   0.044434   0.036200   1.227    0.345
## Rainfall.2   0.015588   0.047175   0.330    0.772
## Rainfall.3   0.016892   0.038284   0.441    0.702
## Rainfall.4   0.022028   0.048907   0.450    0.697
## Rainfall.5   0.026982   0.038680   0.698    0.558
## Rainfall.6   0.025575   0.038051   0.672    0.571
## Rainfall.7   0.024474   0.045835   0.534    0.647
## Rainfall.8   0.008888   0.039022   0.228    0.841
## Rainfall.9  -0.056935   0.040061  -1.421    0.291
## Rainfall.10 -0.015777   0.040759  -0.387    0.736
## Rainfall.11 -0.033288   0.040815  -0.816    0.500
## Rainfall.12  0.033160   0.038924   0.852    0.484
## Rainfall.13  0.069648   0.054175   1.286    0.327
## Rainfall.14  0.042888   0.038776   1.106    0.384
## 
## Residual standard error: 0.03952 on 2 degrees of freedom
## Multiple R-squared:  0.9996, Adjusted R-squared:  0.9969 
## F-statistic: 370.4 on 15 and 2 DF,  p-value: 0.002695
## 
## AIC and BIC values for the model:
##         AIC       BIC
## 1 -65.99336 -52.66195

DLM.Rainfall.noIntercept Model is significant.

3. Radiation as regressor

With intercept :

Now, lets use AIC and BIC score to find the best lag length for Finite DLM model,

finiteDLMauto(formula = RBO ~ Radiation, data = RBO_dataset, q.min = 1, q.max = 20,
              model.type = "dlm", error.type = "AIC", trace = TRUE)
##    q - k    MASE       AIC       BIC   GMRAE    MBRAE R.Adj.Sq    Ljung-Box
## 15    15 0.00000      -Inf      -Inf 0.00000  0.00000      NaN          NaN
## 16    16 0.00000      -Inf      -Inf 0.00000  0.00000      NaN          NaN
## 17    17 0.00000      -Inf      -Inf 0.00000  0.00000      NaN          NaN
## 18    18 0.00000      -Inf      -Inf 0.00000  0.00000      NaN          NaN
## 19    19 0.00000      -Inf      -Inf 0.00000  0.00000      NaN          NaN
## 20    20 0.00000      -Inf      -Inf 0.00000  0.00000      NaN          NaN
## 14    14 0.12847 -98.60918 -84.44456 0.09397 -0.18351  0.21021 0.4496384297
## 1      1 1.06303 -97.71113 -92.10634 1.18033  2.85932  0.06311 0.0015201563
## 3      3 1.12846 -92.38658 -84.39335 1.15033  0.64125  0.03438 0.0005032338
## 2      2 1.17471 -91.50708 -84.67061 1.10279  2.19244  0.03299 0.0024745598
## 13    13 0.28585 -87.57406 -73.32812 0.28875 -1.83052  0.09907 0.1349515619
## 4      4 1.19981 -86.43017 -77.35931 1.27178 -0.09837  0.00476 0.0007267546
## 5      5 1.07186 -83.32223 -73.25746 1.15984  2.21790 -0.01624 0.0036551584
## 6      6 0.94832 -81.85403 -70.88414 0.86545 -0.85749  0.02433 0.0134482398
## 8      8 0.70733 -80.65686 -68.16642 0.59509  0.78620  0.00645 0.2032244911
## 7      7 0.85847 -79.04997 -67.26943 0.67378  1.85251 -0.00294 0.0831448135
## 9      9 0.66171 -78.45240 -65.35989 0.50103  0.35495 -0.35840 0.9746014602
## 10    10 0.60373 -78.06879 -64.49000 0.63534  0.45096 -0.41866 0.4067961832
## 11    11 0.55833 -72.88766 -58.94741 0.63561  0.27746 -0.64919 0.3184508076
## 12    12 0.64771 -65.19474 -51.02816 0.67460  0.25024 -1.45510 0.4948052542

q = 14 has the smallest AIC and BIC scores. Fit model with q = 14,

DLM.Radiation = dlm(formula = RBO ~ Radiation, data = RBO_dataset, q = 14)
summary(DLM.Radiation)
## 
## Call:
## lm(formula = as.formula(model.formula), data = design)
## 
## Residuals:
##          1          2          3          4          5          6          7 
## -5.335e-04 -6.401e-03 -7.619e-04  7.131e-03  6.508e-04 -8.572e-05  2.471e-03 
##          8          9         10         11         12         13         14 
## -8.584e-03  3.245e-03 -1.008e-03  9.027e-03  9.758e-04 -1.074e-02  1.932e-04 
##         15         16         17 
##  1.855e-03 -1.776e-03  4.338e-03 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept)  -4.943432   5.890296  -0.839    0.555
## Radiation.t  -0.047737   0.059011  -0.809    0.567
## Radiation.1   0.081805   0.058184   1.406    0.394
## Radiation.2   0.090890   0.069717   1.304    0.417
## Radiation.3   0.083786   0.079447   1.055    0.483
## Radiation.4  -0.064871   0.080220  -0.809    0.567
## Radiation.5  -0.195571   0.102087  -1.916    0.306
## Radiation.6   0.078744   0.051070   1.542    0.366
## Radiation.7   0.043922   0.047225   0.930    0.523
## Radiation.8   0.133958   0.090427   1.481    0.378
## Radiation.9   0.013467   0.030075   0.448    0.732
## Radiation.10  0.002873   0.029457   0.098    0.938
## Radiation.11 -0.070320   0.032088  -2.191    0.273
## Radiation.12  0.034562   0.034965   0.988    0.504
## Radiation.13  0.033330   0.096370   0.346    0.788
## Radiation.14  0.171252   0.116490   1.470    0.380
## 
## Residual standard error: 0.02019 on 1 degrees of freedom
## Multiple R-squared:  0.9506, Adjusted R-squared:  0.2102 
## F-statistic: 1.284 on 15 and 1 DF,  p-value: 0.6086
## 
## AIC and BIC values for the model:
##         AIC       BIC
## 1 -98.60918 -84.44456

DLM.Radiation Model is insignificant (p-value = 0.6086) at 0.05 significant level.

Without intercept :

DLM.Radiation.noIntercept = dlm(formula = RBO ~ 0 + Rainfall, data = RBO_dataset, q = 14)
summary(DLM.Radiation.noIntercept)
## 
## Call:
## lm(formula = as.formula(model.formula), data = design)
## 
## Residuals:
##         1         2         3         4         5         6         7         8 
## -0.013313 -0.026712 -0.006145  0.018074  0.011886  0.002610 -0.007758 -0.017198 
##         9        10        11        12        13        14        15        16 
## -0.009586  0.010223  0.018227  0.012199  0.011180  0.011907 -0.015261 -0.007257 
##        17 
##  0.011574 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## Rainfall.t   0.086282   0.042775   2.017    0.181
## Rainfall.1   0.044434   0.036200   1.227    0.345
## Rainfall.2   0.015588   0.047175   0.330    0.772
## Rainfall.3   0.016892   0.038284   0.441    0.702
## Rainfall.4   0.022028   0.048907   0.450    0.697
## Rainfall.5   0.026982   0.038680   0.698    0.558
## Rainfall.6   0.025575   0.038051   0.672    0.571
## Rainfall.7   0.024474   0.045835   0.534    0.647
## Rainfall.8   0.008888   0.039022   0.228    0.841
## Rainfall.9  -0.056935   0.040061  -1.421    0.291
## Rainfall.10 -0.015777   0.040759  -0.387    0.736
## Rainfall.11 -0.033288   0.040815  -0.816    0.500
## Rainfall.12  0.033160   0.038924   0.852    0.484
## Rainfall.13  0.069648   0.054175   1.286    0.327
## Rainfall.14  0.042888   0.038776   1.106    0.384
## 
## Residual standard error: 0.03952 on 2 degrees of freedom
## Multiple R-squared:  0.9996, Adjusted R-squared:  0.9969 
## F-statistic: 370.4 on 15 and 2 DF,  p-value: 0.002695
## 
## AIC and BIC values for the model:
##         AIC       BIC
## 1 -65.99336 -52.66195

DLM.Radiation.noIntercept Model is significant.

4. RelHumidity as regressor

With intercept :

Now, lets use AIC and BIC score to find the best lag length for Finite DLM model,

finiteDLMauto(formula = RBO ~ RelHumidity, data = RBO_dataset, q.min = 1, q.max = 20,
              model.type = "dlm", error.type = "AIC", trace = TRUE)
##    q - k    MASE        AIC        BIC   GMRAE    MBRAE R.Adj.Sq    Ljung-Box
## 15    15 0.00000       -Inf       -Inf 0.00000  0.00000      NaN          NaN
## 16    16 0.00000       -Inf       -Inf 0.00000  0.00000      NaN          NaN
## 17    17 0.00000       -Inf       -Inf 0.00000  0.00000      NaN          NaN
## 18    18 0.00000       -Inf       -Inf 0.00000  0.00000      NaN          NaN
## 19    19 0.00000       -Inf       -Inf 0.00000  0.00000      NaN          NaN
## 20    20 0.00000       -Inf       -Inf 0.00000  0.00000      NaN          NaN
## 14    14 0.05888 -122.75432 -108.58970 0.04730 -0.06621  0.80915 0.6290754261
## 12    12 0.24538 -102.94035  -88.77377 0.22312 -1.00307  0.66326 0.9118912488
## 13    13 0.20119  -98.02833  -83.78238 0.15223 -0.12365  0.49597 0.1896695293
## 1      1 1.10110  -94.56619  -88.96140 1.22180  2.01650 -0.04044 0.0021897031
## 10    10 0.45185  -89.23348  -75.65468 0.41472 -3.34411  0.16635 0.8908388622
## 3      3 1.17112  -89.19540  -81.20218 1.12421  0.09980 -0.08219 0.0002775153
## 2      2 1.23588  -88.37920  -81.54272 1.29446  0.39157 -0.07714 0.0043469395
## 11    11 0.34014  -86.87319  -72.93294 0.18838  0.48333  0.18044 0.7234515930
## 9      9 0.58212  -83.46275  -70.37025 0.49322  0.29039 -0.08174 0.4767002700
## 4      4 1.26243  -82.80086  -73.73000 1.31804  1.18794 -0.13842 0.0002642674
## 5      5 1.14964  -79.40410  -69.33932 1.00804  0.38188 -0.18152 0.0004056958
## 6      6 1.02674  -76.22147  -65.25158 0.82281  0.36858 -0.22222 0.0061384348
## 7      7 0.92355  -74.80399  -63.02345 0.77382  0.64955 -0.19704 0.0043529149
## 8      8 0.84661  -72.65628  -60.16585 0.79021 -4.31457 -0.40689 0.0443006703

q = 14 has the smallest AIC and BIC scores. Fit model with q = 14,

DLM.RelHumidity = dlm(formula = RBO ~ RelHumidity, data = RBO_dataset, q = 14)
summary(DLM.RelHumidity)
## 
## Call:
## lm(formula = as.formula(model.formula), data = design)
## 
## Residuals:
##          1          2          3          4          5          6          7 
##  3.440e-04  1.266e-03  1.250e-03  1.127e-03 -9.612e-04 -5.170e-03 -2.620e-04 
##          8          9         10         11         12         13         14 
## -5.288e-05  8.850e-04 -5.459e-04  7.252e-04 -2.540e-03 -7.889e-04  2.681e-04 
##         15         16         17 
## -3.376e-03  1.143e-03  6.690e-03 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)
## (Intercept)    -1.2139297  2.8600699  -0.424    0.744
## RelHumidity.t   0.0039691  0.0116410   0.341    0.791
## RelHumidity.1  -0.0218992  0.0082984  -2.639    0.231
## RelHumidity.2  -0.0009846  0.0078243  -0.126    0.920
## RelHumidity.3  -0.0197978  0.0079714  -2.484    0.244
## RelHumidity.4   0.0009245  0.0043927   0.210    0.868
## RelHumidity.5   0.0160989  0.0053607   3.003    0.205
## RelHumidity.6  -0.0123647  0.0045655  -2.708    0.225
## RelHumidity.7   0.0163437  0.0049107   3.328    0.186
## RelHumidity.8   0.0060125  0.0048058   1.251    0.429
## RelHumidity.9  -0.0087084  0.0066835  -1.303    0.417
## RelHumidity.10  0.0011195  0.0045142   0.248    0.845
## RelHumidity.11 -0.0068266  0.0058391  -1.169    0.450
## RelHumidity.12  0.0269905  0.0068458   3.943    0.158
## RelHumidity.13  0.0128362  0.0059733   2.149    0.277
## RelHumidity.14  0.0068126  0.0054905   1.241    0.432
## 
## Residual standard error: 0.009924 on 1 degrees of freedom
## Multiple R-squared:  0.9881, Adjusted R-squared:  0.8092 
## F-statistic: 5.522 on 15 and 1 DF,  p-value: 0.3235
## 
## AIC and BIC values for the model:
##         AIC       BIC
## 1 -122.7543 -108.5897

DLM.RelHumidity Model is insignificant (p-value = 0.3235) at 0.05 significant level.

Without intercept :

DLM.RelHumidity.noIntercept = dlm(formula = RBO ~ 0 + RelHumidity, data = RBO_dataset, q = 14)
summary(DLM.RelHumidity.noIntercept)
## 
## Call:
## lm(formula = as.formula(model.formula), data = design)
## 
## Residuals:
##          1          2          3          4          5          6          7 
## -0.0018299  0.0010538  0.0005606  0.0013108 -0.0012237 -0.0063138  0.0012040 
##          8          9         10         11         12         13         14 
## -0.0011289  0.0018053 -0.0015397  0.0013437 -0.0018318  0.0005116 -0.0001880 
##         15         16         17 
## -0.0027747  0.0027611  0.0062651 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)  
## RelHumidity.t  -2.084e-05  5.274e-03  -0.004   0.9972  
## RelHumidity.1  -2.402e-02  5.090e-03  -4.718   0.0421 *
## RelHumidity.2  -2.646e-03  5.204e-03  -0.509   0.6616  
## RelHumidity.3  -2.185e-02  4.864e-03  -4.493   0.0461 *
## RelHumidity.4   3.987e-04  3.237e-03   0.123   0.9132  
## RelHumidity.5   1.471e-02  3.257e-03   4.515   0.0457 *
## RelHumidity.6  -1.307e-02  3.268e-03  -3.998   0.0572 .
## RelHumidity.7   1.547e-02  3.424e-03   4.518   0.0457 *
## RelHumidity.8   5.471e-03  3.559e-03   1.537   0.2641  
## RelHumidity.9  -1.090e-02  3.251e-03  -3.354   0.0786 .
## RelHumidity.10  1.194e-03  3.465e-03   0.345   0.7632  
## RelHumidity.11 -5.777e-03  4.063e-03  -1.422   0.2910  
## RelHumidity.12  2.854e-02  4.453e-03   6.409   0.0235 *
## RelHumidity.13  1.359e-02  4.380e-03   3.103   0.0901 .
## RelHumidity.14  6.632e-03  4.205e-03   1.577   0.2555  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.007624 on 2 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:  0.9999 
## F-statistic:  9956 on 15 and 2 DF,  p-value: 0.0001004
## 
## AIC and BIC values for the model:
##         AIC      BIC
## 1 -121.9384 -108.607

DLM.RelHumidity.noIntercept Model is significant.

Finite DLM Model Selection

Models using all 4 predictors without intercept are significant. Eliminating all the insignificant models and comparing the significant Finite DLM models based on R-squared, AIC, BIC and MASE

Model <- c("DLM.Temperature.noIntercept", "DLM.Rainfall.noIntercept", "DLM.Radiation.noIntercept", "DLM.RelHumidity.noIntercept")
AIC <- c(AIC(DLM.Temperature.noIntercept), AIC(DLM.Rainfall.noIntercept), AIC(DLM.Radiation.noIntercept), AIC(DLM.RelHumidity.noIntercept))
BIC <- c(BIC(DLM.Temperature.noIntercept), BIC(DLM.Rainfall.noIntercept), BIC(DLM.Radiation.noIntercept), BIC(DLM.RelHumidity.noIntercept))
Adjusted_Rsquared <- c(0.999, 0.9969, 0.9969, 0.9999)
MASE <- MASE(DLM.Temperature.noIntercept, DLM.Rainfall.noIntercept, DLM.Radiation.noIntercept, DLM.RelHumidity.noIntercept)
data.frame(AIC, BIC, Adjusted_Rsquared, MASE) %>% arrange(AIC)
##                                    AIC        BIC Adjusted_Rsquared  n
## DLM.RelHumidity.noIntercept -121.93842 -108.60701            0.9999 17
## DLM.Temperature.noIntercept -102.28636  -88.95495            0.9990 17
## DLM.Rainfall.noIntercept     -65.99336  -52.66195            0.9969 17
## DLM.Radiation.noIntercept    -65.99336  -52.66195            0.9969 17
##                                   MASE
## DLM.RelHumidity.noIntercept 0.07231577
## DLM.Temperature.noIntercept 0.14522072
## DLM.Rainfall.noIntercept    0.45373278
## DLM.Radiation.noIntercept   0.45373278

Thus, as per AIC, BIC and MASE, finite distributed lag model for RBO with Relative Humidity as the regressor with no intercept (DLM.RelHumidity.noIntercept) is the best.

Diagnostic check for DLM.RelHumidity.noIntercept (Residual analysis)

We can apply a diagnostic check using checkresiduals() function from the forecast package.

checkresiduals(DLM.RelHumidity.noIntercept$model$residuals) # forecast package

## 
##  Ljung-Box test
## 
## data:  Residuals
## Q* = 0.81081, df = 3, p-value = 0.8469
## 
## Model df: 0.   Total lags used: 3

In this output,

  • from the time series plot and histogram of residuals, there is an obvious random pattern and normality in the residual distribution. Thus, no violation in general assumptions.
  • the Ljung-Box test output is displayed. According to this test, the null hypothesis that a series of residuals exhibits no autocorrelation up-to lag 10 is violated. According to this test and ACF plot, we can conclude that the serial correlation left in residuals is NOT significant.
Conclusion of Finite DLM model
  • Best model is with Relative Humidity as the regressor with no intercept (DLM.RelHumidity.noIntercept).
  • DLM.RelHumidity.noIntercept Model is significant.
  • MASE is 0.07231577
  • Adjusted R-squared is 99.99%.
  • No violations in the test of assumptions
  • Serial autocorrelation is not significant

ATTENTION - Lets summarise the models from here on and not go into each models details for simplicity

Fit Polynomial DLM model

Polynomial DLM model helps remove the effect of multicollinearity. Lets fit a polynomial DLM of order 2 for each of the 4 regressors individually.

1. Temperature as regressor
PolyDLM.Temperature = polyDlm(x = as.vector(Temperature), y = as.vector(RBO), q = 14, k = 2, show.beta = FALSE)
summary(PolyDLM.Temperature)
## 
## Call:
## "Y ~ (Intercept) + X.t"
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.032950 -0.011121  0.001367  0.011331  0.030851 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept)  0.3942960  0.4617116   0.854    0.409
## z.t0        -0.0104404  0.0091618  -1.140    0.275
## z.t1         0.0041049  0.0028129   1.459    0.168
## z.t2        -0.0002538  0.0001710  -1.484    0.162
## 
## Residual standard error: 0.02236 on 13 degrees of freedom
## Multiple R-squared:  0.2127, Adjusted R-squared:  0.03107 
## F-statistic: 1.171 on 3 and 13 DF,  p-value: 0.3585

Polynomial DLM model with Temperature as regressor variable is insignificant at 5% significance level.

2. Rainfall as regressor
PolyDLM.Rainfall = polyDlm(x = as.vector(Rainfall), y = as.vector(RBO), q = 14, k = 2, show.beta = FALSE)
summary(PolyDLM.Rainfall)
## 
## Call:
## "Y ~ (Intercept) + X.t"
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.036444 -0.011375 -0.002838  0.017678  0.033093 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.7480934  0.1649886   4.534 0.000561 ***
## z.t0         0.0114043  0.0127395   0.895 0.386959    
## z.t1        -0.0028950  0.0038747  -0.747 0.468269    
## z.t2         0.0001186  0.0002811   0.422 0.680119    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02168 on 13 degrees of freedom
## Multiple R-squared:  0.2601, Adjusted R-squared:  0.08932 
## F-statistic: 1.523 on 3 and 13 DF,  p-value: 0.2553

Polynomial DLM model with Rainfall as regressor variable is insignificant at 5% significance level.

3. Radiation as regressor
PolyDLM.Radiation = polyDlm(x = as.vector(Radiation), y = as.vector(RBO), q = 14, k = 2, show.beta = FALSE)
summary(PolyDLM.Radiation)
## 
## Call:
## "Y ~ (Intercept) + X.t"
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.04294 -0.01136  0.00325  0.01095  0.03331 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept)  1.3266320  1.1428764   1.161    0.267
## z.t0        -0.0130075  0.0123520  -1.053    0.311
## z.t1         0.0052006  0.0037170   1.399    0.185
## z.t2        -0.0003872  0.0002747  -1.410    0.182
## 
## Residual standard error: 0.02241 on 13 degrees of freedom
## Multiple R-squared:  0.2091, Adjusted R-squared:  0.0266 
## F-statistic: 1.146 on 3 and 13 DF,  p-value: 0.3674

Polynomial DLM model with Radiation as regressor variable is insignificant at 5% significance level.

4. Relative Humidity as regressor
PolyDLM.RelHumidity = polyDlm(x = as.vector(RelHumidity), y = as.vector(RBO), q = 14, k = 2, show.beta = FALSE)
summary(PolyDLM.RelHumidity)
## 
## Call:
## "Y ~ (Intercept) + X.t"
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.044911 -0.012149  0.003855  0.010324  0.031775 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -6.643e+00  3.580e+00  -1.855   0.0864 .
## z.t0         9.925e-03  6.224e-03   1.595   0.1348  
## z.t1        -2.833e-04  1.431e-03  -0.198   0.8462  
## z.t2        -4.078e-05  9.394e-05  -0.434   0.6713  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02168 on 13 degrees of freedom
## Multiple R-squared:  0.2597, Adjusted R-squared:  0.08886 
## F-statistic:  1.52 on 3 and 13 DF,  p-value: 0.256

Polynomial DLM model with Relative Humidity as regressor variable is insignificant at 5% significance level.

PolyDLM Model selection

None of the univariate Polynomial DLM models using either of the 4 predictor were significant.

Conclusion of Polynomial DLM model

No significant Polynomial DLM model was found.

Fit Koyck geometric DLM model

Here the lag weights are positive and decline geometrically. This model is called infinite geometric DLM, meaning there are infinite lag weights. Koyck transformation is applied to implement this infinite geometric DLM model by subtracting the first lag of geometric DLM multiplied by \(\phi\). The Koyck transformed model is represented as,

\(Y_t = \delta_1 + \delta_2Y_{t-1} + \nu_t\)

where \(\delta_1 = \alpha(1-\phi), \delta_2 = \phi, \delta_3 = \beta\) and the random error after the transformation is \(\nu_t = (\epsilon_t -\phi\epsilon_{t-1})\).

The koyckDlm() function is used to implement a two-staged least squares method to first estimate the \(\hat{Y}_{t-1}\) and the estimate \(Y_{t}\) through simple linear regression. Lets deduce Koyck geometric GLM models for each of the 4 regressors individually.

1. Temperature as regressor

With intercept :

Koyck.Temperature = koyckDlm(x = as.vector(RBO_dataset$Temperature) , y = as.vector(RBO_dataset$RBO) )
summary(Koyck.Temperature$model, diagnostics = TRUE)
## 
## Call:
## "Y ~ (Intercept) + Y.1 + X.t"
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0741656 -0.0225173 -0.0006794  0.0240622  0.1270971 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -0.20775    0.83741  -0.248   0.8059  
## Y.1          0.68547    0.25559   2.682   0.0123 *
## X.t          0.02235    0.03523   0.634   0.5312  
## 
## Diagnostic tests:
##                  df1 df2 statistic p-value  
## Weak instruments   1  27     4.635  0.0404 *
## Wu-Hausman         1  26     1.347  0.2563  
## Sargan             0  NA        NA      NA  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.04319 on 27 degrees of freedom
## Multiple R-Squared: 0.1517,  Adjusted R-squared: 0.08891 
## Wald test: 5.309 on 2 and 27 DF,  p-value: 0.01136

Koyck.Temperature is significant at 5% significance level.

Without intercept :

Koyck.Temperature.NoIntercept = koyckDlm(x = as.vector(RBO_dataset$Temperature) , y = as.vector(RBO_dataset$RBO), intercept = FALSE)
summary(Koyck.Temperature.NoIntercept$model, diagnostics = TRUE)
## 
## Call:
## "Y ~ (Intercept) + Y.1 + X.t"
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.067575 -0.019831 -0.002845  0.021777  0.118166 
## 
## Coefficients:
##     Estimate Std. Error t value Pr(>|t|)    
## Y.1 0.633607   0.136928   4.627 7.68e-05 ***
## X.t 0.013715   0.005163   2.656   0.0129 *  
## 
## Diagnostic tests:
##                  df1 df2 statistic  p-value    
## Weak instruments   2  27    93.540 7.25e-13 ***
## Wu-Hausman         1  27     5.026   0.0334 *  
## Sargan             1  NA     0.076   0.7827    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.04021 on 28 degrees of freedom
## Multiple R-Squared: 0.9972,  Adjusted R-squared: 0.997 
## Wald test:  5049 on 2 and 28 DF,  p-value: < 2.2e-16

Koyck.Temperature.NoIntercept is significant at 5% significance level.

2. Rainfall as regressor

With intercept :

Koyck.Rainfall = koyckDlm(x = as.vector(RBO_dataset$Rainfall) , y = as.vector(RBO_dataset$RBO) )
summary(Koyck.Rainfall$model, diagnostics = TRUE)
## 
## Call:
## "Y ~ (Intercept) + Y.1 + X.t"
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.3665 -0.4155 -0.1142  0.3241  1.6012 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   0.3207     2.4302   0.132    0.896
## Y.1          -6.5147   243.8216  -0.027    0.979
## X.t           2.2101    76.0635   0.029    0.977
## 
## Diagnostic tests:
##                  df1 df2 statistic p-value
## Weak instruments   1  27     0.001   0.977
## Wu-Hausman         1  26     0.360   0.554
## Sargan             0  NA        NA      NA
## 
## Residual standard error: 0.7951 on 27 degrees of freedom
## Multiple R-Squared: -286.5,  Adjusted R-squared: -307.8 
## Wald test: 0.01549 on 2 and 27 DF,  p-value: 0.9846

Koyck.Rainfall model is insignificant at 5% significance level.

Without intercept :

Koyck.Rainfall.NoIntercept = koyckDlm(x = as.vector(RBO_dataset$Rainfall) , y = as.vector(RBO_dataset$RBO), intercept = FALSE)
summary(Koyck.Rainfall.NoIntercept$model, diagnostics = TRUE)
## 
## Call:
## "Y ~ (Intercept) + Y.1 + X.t"
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.99247 -0.31365 -0.07781  0.23237  1.22822 
## 
## Coefficients:
##     Estimate Std. Error t value Pr(>|t|)
## Y.1   -4.287    177.876  -0.024    0.981
## X.t    1.650     55.537   0.030    0.977
## 
## Diagnostic tests:
##                  df1 df2 statistic p-value
## Weak instruments   2  27     0.000   1.000
## Wu-Hausman         1  27     0.161   0.692
## Sargan             1  NA     0.035   0.852
## 
## Residual standard error: 0.5815 on 28 degrees of freedom
## Multiple R-Squared: 0.4216,  Adjusted R-squared: 0.3803 
## Wald test: 24.13 on 2 and 28 DF,  p-value: 8.091e-07

Koyck.Rainfall.NoIntercept model is significant at 5% significance level.

3. Radiation as regressor

With intercept :

Koyck.Radiation = koyckDlm(x = as.vector(RBO_dataset$Radiation) , y = as.vector(RBO_dataset$RBO) )
summary(Koyck.Radiation$model, diagnostics = TRUE)
## 
## Call:
## "Y ~ (Intercept) + Y.1 + X.t"
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.082255 -0.017008 -0.001036  0.021424  0.106984 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -0.48011    0.94819  -0.506   0.6167   
## Y.1          0.69801    0.24502   2.849   0.0083 **
## X.t          0.04812    0.05661   0.850   0.4028   
## 
## Diagnostic tests:
##                  df1 df2 statistic p-value  
## Weak instruments   1  27     4.942  0.0348 *
## Wu-Hausman         1  26     2.765  0.1084  
## Sargan             0  NA        NA      NA  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0467 on 27 degrees of freedom
## Multiple R-Squared: 0.008467,    Adjusted R-squared: -0.06498 
## Wald test: 4.731 on 2 and 27 DF,  p-value: 0.01732

Koyck.Radiation model is significant at 5% significance level.

Without intercept :

Koyck.Radiation.NoIntercept = koyckDlm(x = as.vector(RBO_dataset$Radiation) , y = as.vector(RBO_dataset$RBO), intercept = FALSE)
summary(Koyck.Radiation.NoIntercept$model, diagnostics = TRUE)
## 
## Call:
## "Y ~ (Intercept) + Y.1 + X.t"
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.075202 -0.018109 -0.001784  0.019029  0.105209 
## 
## Coefficients:
##     Estimate Std. Error t value Pr(>|t|)    
## Y.1 0.607517   0.144697   4.199 0.000246 ***
## X.t 0.019783   0.007344   2.694 0.011798 *  
## 
## Diagnostic tests:
##                  df1 df2 statistic  p-value    
## Weak instruments   2  27   109.386 1.13e-13 ***
## Wu-Hausman         1  27     6.810   0.0146 *  
## Sargan             1  NA     0.369   0.5438    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.04031 on 28 degrees of freedom
## Multiple R-Squared: 0.9972,  Adjusted R-squared: 0.997 
## Wald test:  5024 on 2 and 28 DF,  p-value: < 2.2e-16

Koyck.Radiation.NoIntercept model is significant at 5% significance level.

4. Relative Humidity as regressor

With intercept :

Koyck.RelHumidity = koyckDlm(x = as.vector(RBO_dataset$RelHumidity) , y = as.vector(RBO_dataset$RBO) )
summary(Koyck.RelHumidity$model, diagnostics = TRUE)
## 
## Call:
## "Y ~ (Intercept) + Y.1 + X.t"
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.080897 -0.021103 -0.004676  0.022673  0.111041 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -1.16679    8.04941  -0.145   0.8858  
## Y.1          0.62503    0.34753   1.798   0.0833 .
## X.t          0.01525    0.08274   0.184   0.8551  
## 
## Diagnostic tests:
##                  df1 df2 statistic p-value
## Weak instruments   1  27     0.393   0.536
## Wu-Hausman         1  26     0.055   0.816
## Sargan             0  NA        NA      NA
## 
## Residual standard error: 0.04127 on 27 degrees of freedom
## Multiple R-Squared: 0.2256,  Adjusted R-squared: 0.1682 
## Wald test: 5.612 on 2 and 27 DF,  p-value: 0.009161

Koyck.RelHumidity model is significant at 5% significance level.

Without intercept :

Koyck.RelHumidity.NoIntercept = koyckDlm(x = as.vector(RBO_dataset$RelHumidity) , y = as.vector(RBO_dataset$RBO), intercept = FALSE)
summary(Koyck.RelHumidity.NoIntercept$model, diagnostics = TRUE)
## 
## Call:
## "Y ~ (Intercept) + Y.1 + X.t"
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.079941 -0.018086 -0.006952  0.018421  0.105496 
## 
## Coefficients:
##     Estimate Std. Error t value Pr(>|t|)    
## Y.1 0.580729   0.153090   3.793 0.000729 ***
## X.t 0.003260   0.001198   2.720 0.011080 *  
## 
## Diagnostic tests:
##                  df1 df2 statistic p-value    
## Weak instruments   2  27   801.908  <2e-16 ***
## Wu-Hausman         1  27     0.493   0.489    
## Sargan             1  NA     0.026   0.871    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0382 on 28 degrees of freedom
## Multiple R-Squared: 0.9975,  Adjusted R-squared: 0.9973 
## Wald test:  5595 on 2 and 28 DF,  p-value: < 2.2e-16

Koyck.RelHumidity.NoIntercept model is significant at 5% significance level.

Koyck Model selection

Koyck DLM models for all 4 regressors without intercept are significant. Eliminating all the insignificant models and comparing the significant Finite DLM models based on R-squared, AIC, BIC and MASE

Model <- c("Koyck.Temperature", "Koyck.Temperature.NoIntercept", "Koyck.Rainfall.NoIntercept", "Koyck.Radiation", "Koyck.Radiation.NoIntercept", "Koyck.RelHumidity", "Koyck.RelHumidity.NoIntercept")
AIC <- c(AIC(Koyck.Temperature), AIC(Koyck.Temperature.NoIntercept), AIC(Koyck.Rainfall.NoIntercept), AIC(Koyck.Radiation), AIC(Koyck.Radiation.NoIntercept), AIC(Koyck.RelHumidity), AIC(Koyck.RelHumidity.NoIntercept))
BIC <- c(BIC(Koyck.Temperature), BIC(Koyck.Temperature.NoIntercept), BIC(Koyck.Rainfall.NoIntercept), BIC(Koyck.Radiation), BIC(Koyck.Radiation.NoIntercept), BIC(Koyck.RelHumidity), BIC(Koyck.RelHumidity.NoIntercept))
Adjusted_Rsquared <- c(0.08891, 0.997, 0.3803, -0.06498, 0.997, 0.1682, 0.9973)
MASE <- MASE(Koyck.Temperature, Koyck.Temperature.NoIntercept, Koyck.Rainfall.NoIntercept, Koyck.Radiation, Koyck.Radiation.NoIntercept, Koyck.RelHumidity, Koyck.RelHumidity.NoIntercept)
data.frame(AIC, BIC, Adjusted_Rsquared, MASE) %>% arrange(MASE)
##                                      AIC        BIC Adjusted_Rsquared  n
## Koyck.RelHumidity.NoIntercept -106.83021 -102.62662           0.99730 30
## Koyck.Temperature.NoIntercept -103.74657  -99.54297           0.99700 30
## Koyck.Radiation.NoIntercept   -103.60090  -99.39730           0.99700 30
## Koyck.Temperature              -98.54907  -92.94428           0.08891 30
## Koyck.RelHumidity             -101.28049  -95.67571           0.16820 30
## Koyck.Radiation                -93.86690  -88.26211          -0.06498 30
## Koyck.Rainfall.NoIntercept      56.53470   60.73829           0.38030 30
##                                     MASE
## Koyck.RelHumidity.NoIntercept  0.8702601
## Koyck.Temperature.NoIntercept  0.9126045
## Koyck.Radiation.NoIntercept    0.9169493
## Koyck.Temperature              0.9535116
## Koyck.RelHumidity              0.9559618
## Koyck.Radiation                1.0314227
## Koyck.Rainfall.NoIntercept    14.2098845

Thus, as per AIC, BIC, MASE (best in terms of forecasting), and Adjusted R-Squared, Koyck DLM for RBO with Relative Humidity as the regressor with no intercept (Koyck.RelHumidity.NoIntercept) is the best.

Diagnostic check for Koyck DLM (Residual analysis)
checkresiduals(Koyck.RelHumidity.NoIntercept$model$residuals)

## 
##  Ljung-Box test
## 
## data:  Residuals
## Q* = 6.3336, df = 6, p-value = 0.3869
## 
## Model df: 0.   Total lags used: 6

Serial autocorrelations left in residuals are insignificant as per Ljung-Box test and ACF plot. From the time series plot and histogram of residuals, there is an obvious random pattern and normality in the residual distribution. Thus, no violation in general assumptions.

Conclusion of Koyck DLM model
  • Model with Relative Humidity as regressor with no intercept is best of all 4 regressors.
  • model is significant
  • MASE is 0.8702601
  • Adjusted R-squared is 99.73 %
  • No violations in the test of assumptions
  • Serial autocorrelation is insignificant
  • From the Weak Instruments line, the model at the first stage of the least-squares fitting is significant at 5% level of significance.
  • both \(\delta_2\) and \(\delta_3\) are significant at 5% level meaning RBO is significantly dependent on Last years RBO and on the Relative Humidity
  • From the Wu-Hausman test, we do not reject the null hypothesis that the correlation between explanatory variable (\(Y_{t-1}\)) and the error term is zero (There is no endogeneity) at 5% level.

Fit Autoregressive Distributed Lag Model

Autoregressive Distributed lag model is a flexible and parsimonious infinite DLM. The model is represented as,

\(Y_t = \mu + \beta_0 X_t + \beta_1 X_{t-1} + \gamma_1 Y_{t-1} + e_t\)

Similar to the Koyck DLM, it is possible to write this model as an infinite DLM with infinite lag distribution of any shape rather than a polynomial or geometric shape. The model is denoted as ARDL(p,q). To fit the model we will use ardlDlm() function is used. Lets find the best lag length using AIC and BIC score through an iteration. Lets set max lag length to 14. Lets do this for each regressor individually.

1. Temperature as regressor

With intercept :

## Code gist to find the best ARDL(p,q) model as per AIC and BIC scores.
# First create an empty df. Iterate over 196 ARDL (since max lag for response and predictor of ARDL model is 14, i.e, p = q = 14 at max).
# Save the model's AIC and BIC scores through iteration and display the model with best AIC and BIC scores.

df = data.frame(matrix(
  vector(), 0, 4, dimnames=list(c(), c("p","q","AIC","BIC"))),
  stringsAsFactors=F) # create empty dataframe
for(i in 1:14){
  for(j in 1:14){
    model4.1 = ardlDlm(formula = RBO ~ Temperature, data = RBO_dataset, p = i, q = j)
    new <- data.frame(i, j, AIC(model4.1$model), BIC(model4.1$model))
    df[nrow(df) + 1, ] <- new
  }
} # Iterate and save in df
head(df[order( df[,3] ),] %>% filter(, AIC != -Inf & BIC != -Inf),1) # Best model as per AIC
##   p  q       AIC       BIC
## 1 2 13 -142.4481 -126.4214
head(df[order( df[,4] ),] %>% filter(, AIC != -Inf & BIC != -Inf),1) # Best model as per BIC
##   p  q       AIC       BIC
## 1 2 13 -142.4481 -126.4214

ARDL(2,13) is the best models as per AIC and BIC scores respectively. Lets fit this models,

ARDL(2,13):

ARDL.Temperature.2x13 = ardlDlm(formula = RBO ~ Temperature, data = RBO_dataset, p = 2, q = 13)
summary(ARDL.Temperature.2x13)
## 
## Time series regression with "ts" data:
## Start = 14, End = 31
## 
## Call:
## dynlm(formula = as.formula(model.text), data = data)
## 
## Residuals:
##         14         15         16         17         18         19         20 
##  1.651e-03 -6.815e-05 -2.004e-03 -2.547e-04  3.351e-03 -6.355e-04 -2.342e-03 
##         21         22         23         24         25         26         27 
## -9.779e-04  1.796e-03  1.513e-03  3.106e-04  2.002e-03  2.291e-03 -1.650e-04 
##         28         29         30         31 
## -1.020e-03 -1.710e-03 -1.026e-03 -2.711e-03 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)  
## (Intercept)    8.280126   1.164162   7.113   0.0889 .
## Temperature.t -0.050777   0.009731  -5.218   0.1205  
## Temperature.1 -0.065776   0.014288  -4.603   0.1362  
## Temperature.2 -0.065288   0.013233  -4.934   0.1273  
## RBO.1         -0.916271   0.212320  -4.316   0.1450  
## RBO.2         -0.841775   0.246285  -3.418   0.1812  
## RBO.3         -0.650233   0.122350  -5.315   0.1184  
## RBO.4         -0.788172   0.130667  -6.032   0.1046  
## RBO.5         -0.961598   0.167927  -5.726   0.1101  
## RBO.6         -0.109752   0.142309  -0.771   0.5818  
## RBO.7         -0.244319   0.127712  -1.913   0.3066  
## RBO.8          0.508823   0.108925   4.671   0.1343  
## RBO.9          0.180396   0.154063   1.171   0.4500  
## RBO.10         0.192379   0.120400   1.598   0.3560  
## RBO.11        -0.716589   0.134857  -5.314   0.1184  
## RBO.12        -0.932688   0.162567  -5.737   0.1099  
## RBO.13        -0.202849   0.096217  -2.108   0.2820  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.007222 on 1 degrees of freedom
## Multiple R-squared:  0.994,  Adjusted R-squared:  0.8974 
## F-statistic: 10.29 on 16 and 1 DF,  p-value: 0.2407
checkresiduals(ARDL.Temperature.2x13$model, test = "LB")

## 
##  Ljung-Box test
## 
## data:  Residuals
## Q* = 4.6877, df = 4, p-value = 0.3209
## 
## Model df: 0.   Total lags used: 4
MASE(ARDL.Temperature.2x13)
##                             MASE
## ARDL.Temperature.2x13 0.05441407

Model is insignificant at 5% significance level.

Without intercept :

## Code gist to find the best ARDL(p,q) model as per AIC and BIC scores.
# First create an empty df. Iterate over 196 ARDL (since max lag for response and predictor of ARDL model is 14, i.e, p = q = 14 at max).
# Save the model's AIC and BIC scores through iteration and display the model with best AIC and BIC scores.
# Also, models with AIC or BIC scores of inf or -inf are removed

df = data.frame(matrix(
  vector(), 0, 4, dimnames=list(c(), c("p","q","AIC","BIC"))),
  stringsAsFactors=F) # create empty dataframe
for(i in 1:14){
  for(j in 1:14){
    model4.1 = ardlDlm(formula = RBO ~ -1 + Temperature, data = RBO_dataset, p = i, q = j)
    new <- data.frame(i, j, AIC(model4.1$model), BIC(model4.1$model))
    df[nrow(df) + 1, ] <- new
  }
} # Iterate and save in df
head(df[order( df[,3] ),] %>% filter(, AIC != -Inf & BIC != -Inf),1) # Best model as per AIC
##    p q       AIC       BIC
## 1 13 3 -138.1306 -122.1039
head(df[order( df[,4] ),] %>% filter(, AIC != -Inf & BIC != -Inf),1) # Best model as per BIC
##    p q       AIC       BIC
## 1 13 3 -138.1306 -122.1039

ARDL(13,3) is the best models as per AIC and BIC scores respectively. Lets fit this models,

ARDL(13,3):

ARDL.Temperature.NoIntercept.13x3 = ardlDlm(formula = RBO ~ -1 + Temperature, data = RBO_dataset, p = 13, q = 3)
summary(ARDL.Temperature.NoIntercept.13x3)
## 
## Time series regression with "ts" data:
## Start = 14, End = 31
## 
## Call:
## dynlm(formula = as.formula(model.text), data = data)
## 
## Residuals:
##         14         15         16         17         18         19         20 
## -5.110e-04  4.615e-04  6.965e-05 -2.328e-03  2.081e-03 -2.395e-03  2.878e-03 
##         21         22         23         24         25         26         27 
## -2.025e-03  9.700e-04 -1.617e-03 -2.614e-03  1.435e-03  2.567e-03 -1.335e-03 
##         28         29         30         31 
## -2.176e-03  2.525e-03 -4.746e-04  2.406e-03 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)  
## Temperature.t  -0.035375   0.023477  -1.507   0.3730  
## Temperature.1  -0.026328   0.011897  -2.213   0.2702  
## Temperature.2   0.049082   0.013655   3.594   0.1727  
## Temperature.3   0.025998   0.016135   1.611   0.3536  
## Temperature.4  -0.079590   0.015458  -5.149   0.1221  
## Temperature.5  -0.014149   0.006759  -2.093   0.2837  
## Temperature.6   0.062652   0.007525   8.326   0.0761 .
## Temperature.7   0.023190   0.007344   3.158   0.1953  
## Temperature.8   0.064374   0.009153   7.033   0.0899 .
## Temperature.9  -0.010914   0.007826  -1.395   0.3960  
## Temperature.10 -0.005591   0.006772  -0.826   0.5606  
## Temperature.11 -0.011420   0.008053  -1.418   0.3910  
## Temperature.12  0.008843   0.011495   0.769   0.5826  
## Temperature.13  0.053546   0.010262   5.218   0.1205  
## RBO.1          -0.869492   0.180138  -4.827   0.1301  
## RBO.2          -1.169731   0.345904  -3.382   0.1830  
## RBO.3           0.229279   0.179811   1.275   0.4234  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.008142 on 1 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:  0.9999 
## F-statistic:  8129 on 17 and 1 DF,  p-value: 0.00872
checkresiduals(ARDL.Temperature.NoIntercept.13x3$model, test = "LB")

## 
##  Ljung-Box test
## 
## data:  Residuals
## Q* = 5.751, df = 4, p-value = 0.2185
## 
## Model df: 0.   Total lags used: 4
MASE(ARDL.Temperature.NoIntercept.13x3)
##                                         MASE
## ARDL.Temperature.NoIntercept.13x3 0.06502803

Model is significant at 5% significance level.

2. Rainfall as regressor

With intercept :

## Code gist to find the best ARDL(p,q) model as per AIC and BIC scores.
# First create an empty df. Iterate over 196 ARDL (since max lag for response and predictor of ARDL model is 14, i.e, p = q = 14 at max).
# Save the model's AIC and BIC scores through iteration and display the model with best AIC and BIC scores.

df = data.frame(matrix(
  vector(), 0, 4, dimnames=list(c(), c("p","q","AIC","BIC"))),
  stringsAsFactors=F) # create empty dataframe
for(i in 1:14){
  for(j in 1:14){
    model4.1 = ardlDlm(formula = RBO ~ Rainfall, data = RBO_dataset, p = i, q = j)
    new <- data.frame(i, j, AIC(model4.1$model), BIC(model4.1$model))
    df[nrow(df) + 1, ] <- new
  }
} # Iterate and save in df
head(df[order( df[,3] ),] %>% filter(, AIC != -Inf & BIC != -Inf),1) # Best model as per AIC
##    p q      AIC       BIC
## 1 12 4 -119.141 -101.1967
head(df[order( df[,4] ),] %>% filter(, AIC != -Inf & BIC != -Inf),1) # Best model as per BIC
##    p q      AIC       BIC
## 1 12 4 -119.141 -101.1967

ARDL(12,4) is the best models as per AIC and BIC scores respectively. Lets fit this models,

ARDL(12,4):

ARDL.Rainfall.12x4 = ardlDlm(formula = RBO ~ Rainfall, data = RBO_dataset, p = 12, q = 4)
summary(ARDL.Rainfall.12x4)
## 
## Time series regression with "ts" data:
## Start = 13, End = 31
## 
## Call:
## dynlm(formula = as.formula(model.text), data = data)
## 
## Residuals:
##         13         14         15         16         17         18         19 
##  0.0011337 -0.0002443 -0.0008624  0.0010833 -0.0026434  0.0034920 -0.0033775 
##         20         21         22         23         24         25         26 
##  0.0028406 -0.0035397  0.0049358 -0.0020010  0.0042188 -0.0022551  0.0046327 
##         27         28         29         30         31 
## -0.0080942  0.0058096 -0.0059044  0.0037880 -0.0030127 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept)  1.760273   0.337034   5.223    0.120
## Rainfall.t   0.002116   0.016777   0.126    0.920
## Rainfall.1   0.033695   0.019234   1.752    0.330
## Rainfall.2   0.001718   0.017245   0.100    0.937
## Rainfall.3   0.013224   0.015823   0.836    0.557
## Rainfall.4  -0.008302   0.018220  -0.456    0.728
## Rainfall.5   0.015187   0.015325   0.991    0.503
## Rainfall.6  -0.009069   0.015134  -0.599    0.656
## Rainfall.7   0.005314   0.016608   0.320    0.803
## Rainfall.8  -0.007361   0.014119  -0.521    0.694
## Rainfall.9  -0.023528   0.018842  -1.249    0.430
## Rainfall.10 -0.023671   0.019226  -1.231    0.434
## Rainfall.11 -0.030104   0.013630  -2.209    0.271
## Rainfall.12  0.002512   0.018530   0.136    0.914
## RBO.1       -0.376929   0.237815  -1.585    0.358
## RBO.2       -0.490094   0.194215  -2.523    0.240
## RBO.3       -0.133310   0.183349  -0.727    0.600
## RBO.4       -0.366547   0.183579  -1.997    0.296
## 
## Residual standard error: 0.01687 on 1 degrees of freedom
## Multiple R-squared:  0.9738, Adjusted R-squared:  0.5289 
## F-statistic: 2.189 on 17 and 1 DF,  p-value: 0.4918
checkresiduals(ARDL.Rainfall.12x4$model, test = "LB")

## 
##  Ljung-Box test
## 
## data:  Residuals
## Q* = 56.3, df = 4, p-value = 1.734e-11
## 
## Model df: 0.   Total lags used: 4
MASE(ARDL.Rainfall.12x4)
##                         MASE
## ARDL.Rainfall.12x4 0.1265811

Model is insignificant at 5% significance level.

Without intercept :

## Code gist to find the best ARDL(p,q) model as per AIC and BIC scores.
# First create an empty df. Iterate over 196 ARDL (since max lag for response and predictor of ARDL model is 14, i.e, p = q = 14 at max).
# Save the model's AIC and BIC scores through iteration and display the model with best AIC and BIC scores.
# Also, models with AIC or BIC scores of inf or -inf are removed

df = data.frame(matrix(
  vector(), 0, 4, dimnames=list(c(), c("p","q","AIC","BIC"))),
  stringsAsFactors=F) # create empty dataframe
for(i in 1:14){
  for(j in 1:14){
    model4.1 = ardlDlm(formula = RBO ~ -1 + Rainfall, data = RBO_dataset, p = i, q = j)
    new <- data.frame(i, j, AIC(model4.1$model), BIC(model4.1$model))
    df[nrow(df) + 1, ] <- new
  }
} # Iterate and save in df
head(df[order( df[,3] ),] %>% filter(, AIC != -Inf & BIC != -Inf),1) # Best model as per AIC
##   p  q       AIC       BIC
## 1 7 11 -145.2684 -125.3537
head(df[order( df[,4] ),] %>% filter(, AIC != -Inf & BIC != -Inf),1) # Best model as per BIC
##   p  q       AIC       BIC
## 1 7 11 -145.2684 -125.3537

ARDL(7,11) is the best models as per AIC and BIC scores respectively. Lets fit this models,

ARDL(7,11):

ARDL.Rainfall.NoIntercept.7x11 = ardlDlm(formula = RBO ~ -1 + Rainfall, data = RBO_dataset, p = 7, q = 11)
summary(ARDL.Rainfall.NoIntercept.7x11)
## 
## Time series regression with "ts" data:
## Start = 12, End = 31
## 
## Call:
## dynlm(formula = as.formula(model.text), data = data)
## 
## Residuals:
##         12         13         14         15         16         17         18 
## -0.0024899 -0.0029150 -0.0007196  0.0026196  0.0012108 -0.0008134 -0.0014173 
##         19         20         21         22         23         24         25 
##  0.0014326 -0.0011076 -0.0015172  0.0007071 -0.0021900 -0.0047420  0.0001211 
##         26         27         28         29         30         31 
##  0.0017904  0.0041946  0.0002401  0.0006746 -0.0002648  0.0054837 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## Rainfall.t  0.148398   0.021730   6.829   0.0926 .
## Rainfall.1 -0.078600   0.010894  -7.215   0.0877 .
## Rainfall.2  0.008794   0.010039   0.876   0.5420  
## Rainfall.3 -0.010323   0.010732  -0.962   0.5124  
## Rainfall.4  0.010759   0.013679   0.787   0.5757  
## Rainfall.5  0.037905   0.017317   2.189   0.2728  
## Rainfall.6  0.034012   0.018606   1.828   0.3187  
## Rainfall.7 -0.099192   0.013885  -7.144   0.0885 .
## RBO.1      -0.642074   0.150792  -4.258   0.1469  
## RBO.2       1.131750   0.126398   8.954   0.0708 .
## RBO.3       0.994445   0.127653   7.790   0.0813 .
## RBO.4      -0.153450   0.114991  -1.334   0.4094  
## RBO.5      -1.616582   0.236708  -6.829   0.0926 .
## RBO.6      -0.785741   0.234559  -3.350   0.1847  
## RBO.7       2.198336   0.225746   9.738   0.0651 .
## RBO.8       1.307629   0.244736   5.343   0.1178  
## RBO.9      -2.210155   0.274147  -8.062   0.0786 .
## RBO.10     -1.068641   0.170467  -6.269   0.1007  
## RBO.11      1.672555   0.201048   8.319   0.0762 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01054 on 1 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:  0.9998 
## F-statistic:  4818 on 19 and 1 DF,  p-value: 0.01134
checkresiduals(ARDL.Rainfall.NoIntercept.7x11$model, test = "LB")

## 
##  Ljung-Box test
## 
## data:  Residuals
## Q* = 3.053, df = 4, p-value = 0.549
## 
## Model df: 0.   Total lags used: 4
MASE(ARDL.Rainfall.NoIntercept.7x11)
##                                      MASE
## ARDL.Rainfall.NoIntercept.7x11 0.06169275

Model is significant at 5% significance level.

3. Radiation as regressor

With intercept :

## Code gist to find the best ARDL(p,q) model as per AIC and BIC scores.
# First create an empty df. Iterate over 196 ARDL (since max lag for response and predictor of ARDL model is 14, i.e, p = q = 14 at max).
# Save the model's AIC and BIC scores through iteration and display the model with best AIC and BIC scores.

df = data.frame(matrix(
  vector(), 0, 4, dimnames=list(c(), c("p","q","AIC","BIC"))),
  stringsAsFactors=F) # create empty dataframe
for(i in 1:14){
  for(j in 1:14){
    model4.1 = ardlDlm(formula = RBO ~ Radiation, data = RBO_dataset, p = i, q = j)
    new <- data.frame(i, j, AIC(model4.1$model), BIC(model4.1$model))
    df[nrow(df) + 1, ] <- new
  }
} # Iterate and save in df
head(df[order( df[,3] ),] %>% filter(, AIC != -Inf & BIC != -Inf),1) # Best model as per AIC
##    p q       AIC       BIC
## 1 12 4 -231.3149 -213.3706
head(df[order( df[,4] ),] %>% filter(, AIC != -Inf & BIC != -Inf),1) # Best model as per BIC
##    p q       AIC       BIC
## 1 12 4 -231.3149 -213.3706

ARDL(12,4) is the best models as per AIC and BIC scores respectively. Lets fit this models,

ARDL(12,4):

ARDL.Radiation.12x4 = ardlDlm(formula = RBO ~ Radiation, data = RBO_dataset, p = 12, q = 4)
summary(ARDL.Radiation.12x4)
## 
## Time series regression with "ts" data:
## Start = 13, End = 31
## 
## Call:
## dynlm(formula = as.formula(model.text), data = data)
## 
## Residuals:
##         13         14         15         16         17         18         19 
## -1.652e-04  2.200e-04 -3.309e-04  6.602e-05 -4.037e-06  2.264e-04 -4.739e-05 
##         20         21         22         23         24         25         26 
##  6.652e-06  7.452e-05 -1.547e-04 -1.306e-04  3.505e-04 -1.289e-04 -5.799e-05 
##         27         28         29         30         31 
## -1.374e-04  3.952e-05  2.134e-04 -3.925e-04  3.528e-04 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   1.8745074  0.0369600  50.717  0.01255 * 
## Radiation.t   0.0271007  0.0019713  13.748  0.04623 * 
## Radiation.1  -0.1327683  0.0027999 -47.419  0.01342 * 
## Radiation.2   0.0483279  0.0016258  29.725  0.02141 * 
## Radiation.3   0.0912607  0.0021988  41.504  0.01534 * 
## Radiation.4  -0.0449738  0.0009985 -45.041  0.01413 * 
## Radiation.5   0.0251085  0.0012836  19.561  0.03252 * 
## Radiation.6  -0.0603475  0.0016012 -37.688  0.01689 * 
## Radiation.7   0.0259175  0.0009088  28.519  0.02231 * 
## Radiation.8   0.0511526  0.0009442  54.178  0.01175 * 
## Radiation.9  -0.0268781  0.0014025 -19.164  0.03319 * 
## Radiation.10  0.1004034  0.0017366  57.815  0.01101 * 
## Radiation.11 -0.0293446  0.0012726 -23.058  0.02759 * 
## Radiation.12 -0.0514641  0.0016960 -30.344  0.02097 * 
## RBO.1        -0.8229929  0.0129258 -63.670  0.01000 **
## RBO.2        -0.4822605  0.0112129 -43.009  0.01480 * 
## RBO.3         0.0213830  0.0105782   2.021  0.29246   
## RBO.4        -0.8245905  0.0105139 -78.429  0.00812 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0008814 on 1 degrees of freedom
## Multiple R-squared:  0.9999, Adjusted R-squared:  0.9987 
## F-statistic: 823.6 on 17 and 1 DF,  p-value: 0.02739
checkresiduals(ARDL.Radiation.12x4$model, test = "LB")

## 
##  Ljung-Box test
## 
## data:  Residuals
## Q* = 6.9142, df = 4, p-value = 0.1405
## 
## Model df: 0.   Total lags used: 4
MASE(ARDL.Radiation.12x4)
##                            MASE
## ARDL.Radiation.12x4 0.006142636

Model is significant at 5% significance level.

Without intercept :

## Code gist to find the best ARDL(p,q) model as per AIC and BIC scores.
# First create an empty df. Iterate over 196 ARDL (since max lag for response and predictor of ARDL model is 14, i.e, p = q = 14 at max).
# Save the model's AIC and BIC scores through iteration and display the model with best AIC and BIC scores.
# Also, models with AIC or BIC scores of inf or -inf are removed

df = data.frame(matrix(
  vector(), 0, 4, dimnames=list(c(), c("p","q","AIC","BIC"))),
  stringsAsFactors=F) # create empty dataframe
for(i in 1:14){
  for(j in 1:14){
    model4.1 = ardlDlm(formula = RBO ~ -1 + Radiation, data = RBO_dataset, p = i, q = j)
    new <- data.frame(i, j, AIC(model4.1$model), BIC(model4.1$model))
    df[nrow(df) + 1, ] <- new
  }
} # Iterate and save in df
head(df[order( df[,3] ),] %>% filter(, AIC != -Inf & BIC != -Inf),1) # Best model as per AIC
##    p q       AIC       BIC
## 1 10 9 -248.9684 -227.0334
head(df[order( df[,4] ),] %>% filter(, AIC != -Inf & BIC != -Inf),1) # Best model as per BIC
##    p q       AIC       BIC
## 1 10 9 -248.9684 -227.0334

ARDL(10,9) is the best models as per AIC and BIC scores respectively. Lets fit this models,

ARDL(10,9):

ARDL.Radiation.NoIntercept.10x9 = ardlDlm(formula = RBO ~ -1 + Radiation, data = RBO_dataset, p = 10, q = 9)
summary(ARDL.Radiation.NoIntercept.10x9)
## 
## Time series regression with "ts" data:
## Start = 11, End = 31
## 
## Call:
## dynlm(formula = as.formula(model.text), data = data)
## 
## Residuals:
##         11         12         13         14         15         16         17 
##  3.022e-04 -1.510e-04 -3.900e-04  2.135e-04 -1.426e-05 -4.160e-05  3.021e-04 
##         18         19         20         21         22         23         24 
## -2.217e-04  1.056e-04 -4.959e-04  2.594e-04  2.859e-04 -3.515e-04 -1.580e-04 
##         25         26         27         28         29         30         31 
##  9.482e-05  1.629e-04  6.342e-05 -9.028e-05 -3.198e-05 -1.598e-04  3.135e-04 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## Radiation.t  -0.838255   0.014335  -58.48  0.01089 * 
## Radiation.1   1.134823   0.018196   62.37  0.01021 * 
## Radiation.2  -0.211696   0.003179  -66.59  0.00956 **
## Radiation.3  -0.049423   0.001208  -40.93  0.01555 * 
## Radiation.4  -0.062767   0.001524  -41.18  0.01546 * 
## Radiation.5  -0.395602   0.007091  -55.79  0.01141 * 
## Radiation.6   0.268634   0.003397   79.09  0.00805 **
## Radiation.7   0.322715   0.005626   57.37  0.01110 * 
## Radiation.8   0.092381   0.002269   40.71  0.01564 * 
## Radiation.9   0.479705   0.008787   54.59  0.01166 * 
## Radiation.10 -0.603782   0.009693  -62.29  0.01022 * 
## RBO.1         0.534926   0.022160   24.14  0.02636 * 
## RBO.2        -2.674589   0.046700  -57.27  0.01111 * 
## RBO.3        -2.662534   0.054448  -48.90  0.01302 * 
## RBO.4         4.640431   0.084013   55.23  0.01152 * 
## RBO.5        -2.619076   0.039175  -66.86  0.00952 **
## RBO.6        -3.901487   0.077186  -50.55  0.01259 * 
## RBO.7        -1.194310   0.046938  -25.44  0.02501 * 
## RBO.8         8.037549   0.135232   59.44  0.01071 * 
## RBO.9        -2.000497   0.026872  -74.45  0.00855 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.001087 on 1 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 4.54e+05 on 20 and 1 DF,  p-value: 0.001169
checkresiduals(ARDL.Radiation.NoIntercept.10x9$model, test = "LB")

## 
##  Ljung-Box test
## 
## data:  Residuals
## Q* = 4.2211, df = 4, p-value = 0.3769
## 
## Model df: 0.   Total lags used: 4
MASE(ARDL.Radiation.NoIntercept.10x9)
##                                        MASE
## ARDL.Radiation.NoIntercept.10x9 0.007057272

Model is significant at 5% significance level.

4. Relative Humidity as regressor

With intercept :

## Code gist to find the best ARDL(p,q) model as per AIC and BIC scores.
# First create an empty df. Iterate over 196 ARDL (since max lag for response and predictor of ARDL model is 14, i.e, p = q = 14 at max).
# Save the model's AIC and BIC scores through iteration and display the model with best AIC and BIC scores.

df = data.frame(matrix(
  vector(), 0, 4, dimnames=list(c(), c("p","q","AIC","BIC"))),
  stringsAsFactors=F) # create empty dataframe
for(i in 1:14){
  for(j in 1:14){
    model4.1 = ardlDlm(formula = RBO ~ RelHumidity, data = RBO_dataset, p = i, q = j)
    new <- data.frame(i, j, AIC(model4.1$model), BIC(model4.1$model))
    df[nrow(df) + 1, ] <- new
  }
} # Iterate and save in df
head(df[order( df[,3] ),] %>% filter(, AIC != -Inf & BIC != -Inf),1) # Best model as per AIC
##   p q       AIC       BIC
## 1 8 9 -127.5342 -105.7134
head(df[order( df[,4] ),] %>% filter(, AIC != -Inf & BIC != -Inf),1) # Best model as per BIC
##   p q       AIC       BIC
## 1 8 9 -127.5342 -105.7134

ARDL(8,9) is the best models as per AIC and BIC scores respectively. Lets fit this models,

ARDL(8,9):

ARDL.RelHumidity.8x9 = ardlDlm(formula = RBO ~ RelHumidity, data = RBO_dataset, p = 8, q = 9)
summary(ARDL.RelHumidity.8x9)
## 
## Time series regression with "ts" data:
## Start = 10, End = 31
## 
## Call:
## dynlm(formula = as.formula(model.text), data = data)
## 
## Residuals:
##         10         11         12         13         14         15         16 
##  0.0027811 -0.0006716  0.0047485 -0.0059467 -0.0055389  0.0080587 -0.0051992 
##         17         18         19         20         21         22         23 
##  0.0077077 -0.0087389 -0.0009153  0.0091855 -0.0040882  0.0020028 -0.0065956 
##         24         25         26         27         28         29         30 
##  0.0083209 -0.0046487  0.0008040 -0.0027462  0.0024608  0.0048495 -0.0001455 
##         31 
## -0.0056849 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   -19.305065   5.095171  -3.789   0.0322 *
## RelHumidity.t   0.042887   0.010073   4.258   0.0238 *
## RelHumidity.1   0.004376   0.008366   0.523   0.6371  
## RelHumidity.2   0.020412   0.007946   2.569   0.0826 .
## RelHumidity.3   0.021539   0.013008   1.656   0.1963  
## RelHumidity.4   0.002933   0.007477   0.392   0.7211  
## RelHumidity.5   0.042051   0.014006   3.002   0.0576 .
## RelHumidity.6   0.031692   0.011039   2.871   0.0640 .
## RelHumidity.7   0.001288   0.008325   0.155   0.8869  
## RelHumidity.8   0.037337   0.010751   3.473   0.0403 *
## RBO.1          -0.357498   0.203829  -1.754   0.1777  
## RBO.2           0.390037   0.197816   1.972   0.1432  
## RBO.3           0.345983   0.191987   1.802   0.1693  
## RBO.4           0.169381   0.214533   0.790   0.4874  
## RBO.5           0.074593   0.200957   0.371   0.7352  
## RBO.6          -0.524327   0.204635  -2.562   0.0831 .
## RBO.7           0.617870   0.187686   3.292   0.0460 *
## RBO.8           0.591129   0.195981   3.016   0.0569 .
## RBO.9          -0.377267   0.125934  -2.996   0.0579 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01455 on 3 degrees of freedom
## Multiple R-squared:  0.9631, Adjusted R-squared:  0.7415 
## F-statistic: 4.346 on 18 and 3 DF,  p-value: 0.1258
checkresiduals(ARDL.RelHumidity.8x9$model, test = "LB")

## 
##  Ljung-Box test
## 
## data:  Residuals
## Q* = 8.5646, df = 4, p-value = 0.07295
## 
## Model df: 0.   Total lags used: 4
MASE(ARDL.RelHumidity.8x9)
##                           MASE
## ARDL.RelHumidity.8x9 0.1629283

Model is insignificant at 5% significance level.

Without intercept :

## Code gist to find the best ARDL(p,q) model as per AIC and BIC scores.
# First create an empty df. Iterate over 196 ARDL (since max lag for response and predictor of ARDL model is 14, i.e, p = q = 14 at max).
# Save the model's AIC and BIC scores through iteration and display the model with best AIC and BIC scores.
# Also, models with AIC or BIC scores of inf or -inf are removed

df = data.frame(matrix(
  vector(), 0, 4, dimnames=list(c(), c("p","q","AIC","BIC"))),
  stringsAsFactors=F) # create empty dataframe
for(i in 1:14){
  for(j in 1:14){
    model4.1 = ardlDlm(formula = RBO ~ -1 + RelHumidity, data = RBO_dataset, p = i, q = j)
    new <- data.frame(i, j, AIC(model4.1$model), BIC(model4.1$model))
    df[nrow(df) + 1, ] <- new
  }
} # Iterate and save in df
head(df[order( df[,3] ),] %>% filter(, AIC != -Inf & BIC != -Inf),1) # Best model as per AIC
##    p q       AIC       BIC
## 1 14 1 -129.0694 -114.9047
head(df[order( df[,4] ),] %>% filter(, AIC != -Inf & BIC != -Inf),1) # Best model as per BIC
##    p q       AIC       BIC
## 1 14 1 -129.0694 -114.9047

ARDL(14,1) is the best models as per AIC and BIC scores respectively. Lets fit this models,

ARDL(14,1):

ARDL.RelHumidity.NoIntercept.14x1 = ardlDlm(formula = RBO ~ -1 + RelHumidity, data = RBO_dataset, p = 14, q = 1)
summary(ARDL.RelHumidity.NoIntercept.14x1)
## 
## Time series regression with "ts" data:
## Start = 15, End = 31
## 
## Call:
## dynlm(formula = as.formula(model.text), data = data)
## 
## Residuals:
##         15         16         17         18         19         20         21 
##  1.527e-03  1.128e-03  1.390e-03  7.877e-04 -6.114e-04 -3.444e-03 -1.053e-03 
##         22         23         24         25         26         27         28 
##  5.786e-04  1.717e-04  1.398e-04  2.189e-04 -2.425e-03 -1.376e-03  4.758e-04 
##         29         30         31 
## -3.027e-03 -2.618e-05  5.554e-03 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)
## RelHumidity.t   0.001946   0.006161   0.316    0.805
## RelHumidity.1  -0.026871   0.006460  -4.160    0.150
## RelHumidity.2   0.001078   0.007152   0.151    0.905
## RelHumidity.3  -0.021950   0.005260  -4.173    0.150
## RelHumidity.4   0.001545   0.003755   0.411    0.752
## RelHumidity.5   0.016063   0.003871   4.149    0.151
## RelHumidity.6  -0.016269   0.005186  -3.137    0.196
## RelHumidity.7   0.018368   0.005052   3.636    0.171
## RelHumidity.8   0.001517   0.006066   0.250    0.844
## RelHumidity.9  -0.010047   0.003659  -2.746    0.222
## RelHumidity.10  0.001367   0.003752   0.364    0.777
## RelHumidity.11 -0.005870   0.004394  -1.336    0.409
## RelHumidity.12  0.030070   0.005146   5.843    0.108
## RelHumidity.13  0.009266   0.006981   1.327    0.411
## RelHumidity.14  0.006070   0.004594   1.321    0.412
## RBO.1           0.187642   0.222523   0.843    0.554
## 
## Residual standard error: 0.008242 on 1 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:  0.9999 
## F-statistic:  7985 on 16 and 1 DF,  p-value: 0.00879
checkresiduals(ARDL.RelHumidity.NoIntercept.14x1$model, test = "LB")

## 
##  Ljung-Box test
## 
## data:  Residuals
## Q* = 1.7383, df = 3, p-value = 0.6284
## 
## Model df: 0.   Total lags used: 3
MASE(ARDL.RelHumidity.NoIntercept.14x1)
##                                         MASE
## ARDL.RelHumidity.NoIntercept.14x1 0.05143758

Model is significant at 5% significance level.

ARDL Model selection

ARDL models for Temperature, Rainfall and Relative Humidity regressors without intercept are significant. Eliminating all the insignificant models and comparing the significant Finite DLM models based on R-squared, AIC, BIC and MASE

Model <- c("ARDL.Temperature.NoIntercept.13x3", "ARDL.Rainfall.NoIntercept.7x11", "ARDL.Radiation.12x4", "ARDL.Radiation.NoIntercept.10x9", "ARDL.RelHumidity.NoIntercept.14x1")
AIC <- c(AIC(ARDL.Temperature.NoIntercept.13x3), AIC(ARDL.Rainfall.NoIntercept.7x11), AIC(ARDL.Radiation.12x4), AIC(ARDL.Radiation.NoIntercept.10x9), AIC(ARDL.RelHumidity.NoIntercept.14x1))
BIC <- c( BIC(ARDL.Temperature.NoIntercept.13x3), BIC(ARDL.Rainfall.NoIntercept.7x11), BIC(ARDL.Radiation.12x4), BIC(ARDL.Radiation.NoIntercept.10x9), BIC(ARDL.RelHumidity.NoIntercept.14x1))
Adjusted_Rsquared <- c(0.9999, 0.9998, 0.9987, 1, 0.9999)
MASE <- MASE(ARDL.Temperature.NoIntercept.13x3, ARDL.Rainfall.NoIntercept.7x11, ARDL.Radiation.12x4, ARDL.Radiation.NoIntercept.10x9, ARDL.RelHumidity.NoIntercept.14x1)
data.frame(AIC, BIC, Adjusted_Rsquared, MASE) %>% arrange(MASE)
##                                         AIC       BIC Adjusted_Rsquared  n
## ARDL.Radiation.12x4               -231.3149 -213.3706            0.9987 19
## ARDL.Radiation.NoIntercept.10x9   -248.9684 -227.0334            1.0000 21
## ARDL.RelHumidity.NoIntercept.14x1 -129.0694 -114.9047            0.9999 17
## ARDL.Rainfall.NoIntercept.7x11    -145.2684 -125.3537            0.9998 20
## ARDL.Temperature.NoIntercept.13x3 -138.1306 -122.1039            0.9999 18
##                                          MASE
## ARDL.Radiation.12x4               0.006142636
## ARDL.Radiation.NoIntercept.10x9   0.007057272
## ARDL.RelHumidity.NoIntercept.14x1 0.051437583
## ARDL.Rainfall.NoIntercept.7x11    0.061692749
## ARDL.Temperature.NoIntercept.13x3 0.065028027

Thus, as per AIC, BIC, MASE (best in terms of forecasting), and Adjusted R-Squared, ARDL(12,4) model for RBO with Radiation as the regressor (ARDL.Radiation.12x4) is the best.

Diagnostic check for ARDL (Residual analysis)
checkresiduals(ARDL.Radiation.12x4$model$residuals)

## 
##  Ljung-Box test
## 
## data:  Residuals
## Q* = 6.9142, df = 4, p-value = 0.1405
## 
## Model df: 0.   Total lags used: 4

Serial autocorrelations left in residuals are insignificant as per Ljung-Box test and ACF plot. From the time series plot and histogram of residuals, there is a random pattern and normality in the residual distribution. Thus, no violation in general assumptions.

Conclusion of ARDL DLM model
  • Model Radiation as regressor is best of all 4 regressors.
  • ARDL.Radiation.12x4 model is significant
  • MASE is 0.006142636
  • Adjusted R-squared is 99.87%
  • No violations in the test of assumptions
  • Serial autocorrelation is insignificant

Most appropriate DLM model based on MASE (DLM Model Selection)

The 4 DLM models are,

  • Finite DLM model: DLM.RelHumidity.noIntercept
  • Polynomial DLM model: No significant model
  • Koyck transformed geometric DLM model: Koyck.RelHumidity.NoIntercept
  • Autoregressive DLM model: ARDL.Radiation.12x4

mean absolute scaled errors or MASE of these models are,

MASE(DLM.RelHumidity.noIntercept, Koyck.RelHumidity.NoIntercept, ARDL.Radiation.12x4) %>% arrange(MASE)
##                                n        MASE
## ARDL.Radiation.12x4           19 0.006142636
## DLM.RelHumidity.noIntercept   17 0.072315768
## Koyck.RelHumidity.NoIntercept 30 0.870260066

Conclusion of Distributed Lag models (DLM) modelling

The Best DLM model for the RBO response which gives the most accurate forecasting based on the MASE measure is the Autoregressive DLM model having Radiation as regressor, ARDL.Radiation.12x4 with MASE measure of 0.006142636.

B. Dynamic linear models (dynlm package)

Dynamic linear models are general class of time series regression models which can account for trends, seasonality, serial correlation between response and regressor variable, and most importantly the affect of intervention points.

The response of a general Dynamic linear model is,

\(Y_t = \omega_2Y_{t-1} + (\omega_0 + \omega_1)P_t - \omega_2\omega_0P_{t-1} + N_t\)

where,

  • \(Y_t\) is the response
  • \(\omega_2\) is the coefficient of 1 time unit lagged response
  • \(P_t\) is the current pulse affect at the intervention point with \((\omega_0 + \omega_1)\) coefficient representing the instantaneous effect of the intervention point
  • \(P_{t-1}\) is the past pulse affect with \(\omega_2\omega_0\) coefficient
  • \(N_t\) is the process represents the component where there is no intervention and is referred to as the natural or unperturbed process.

Lets revisit the time series plot for the response, RBO, to visualize possible intervention points

plot(RBO, ylab='RBO', xlab='Year')

As mentioned at the descriptive analysis stage, year 1996 might be intervention point because the mean level of the RBO series falls notably low from this point on wards. Assuming this intervention point lets fit a Dynamic Linear model and see if the pulse function at years 1996 is significant or not.

As always we do, we will have a look at ACF and PACF plots of the RBO series first.

acf(RBO, main="ACF of RBO")

pacf(RBO, main ="PACF of RBO")

In ACF plot we see a slowly decaying pattern indicating trend in the RBO series. In PACF plot we see 1 high vertical spike indicating trend. No significant seasonal behavior is observed. Thus, lets fit a Dynamic linear model with trend component and no seasonal component. For thoroughness, lets test all possible combinations using trend, multiple lags of RBO, and most importantly, the Pulse at 1996.

Now, lets fit Dynamic Linear model using dynlm() as shown below, (Note, the potential intervention point was identified at year 1996). Lets fit models with and without the intercept and compare,

With intercept :

Y.t = RBO
T = c(13) # The time point when the intervention occurred 
P.t = 1*(seq(RBO) == T)
P.t.1 = Lag(P.t,+1) #library(tis) 

Dyn.model = dynlm(Y.t ~ L(Y.t , k = 1) + P.t + trend(Y.t)) # library(dynlm)

Dyn.model1 = dynlm(Y.t ~ L(Y.t , k = 1) + L(Y.t , k = 2) + P.t + trend(Y.t)) # library(dynlm)

Dyn.model2 = dynlm(Y.t ~ L(Y.t , k = 1) + L(Y.t , k = 2) + L(Y.t , k = 3) + P.t + trend(Y.t)) # library(dynlm)

Dyn.model3 = dynlm(Y.t ~ L(Y.t , k = 1) + L(Y.t , k = 2) + L(Y.t , k = 3) + P.t + P.t.1 + trend(Y.t)) # library(dynlm)

Model <- c("Dyn.model", "Dyn.model1", "Dyn.model2", "Dyn.model3")
AIC <- c(AIC(Dyn.model), AIC(Dyn.model1), AIC(Dyn.model2), AIC(Dyn.model3))
BIC <- c( BIC(Dyn.model), BIC(Dyn.model1), BIC(Dyn.model2), BIC(Dyn.model3))
data.frame(Model, AIC, BIC) %>% arrange(BIC)
##        Model       AIC       BIC
## 1  Dyn.model -114.7932 -107.7873
## 2 Dyn.model2 -114.9847 -105.6593
## 3 Dyn.model3 -114.9847 -105.6593
## 4 Dyn.model1 -112.2284 -104.0246
summary(Dyn.model)
## 
## Time series regression with "ts" data:
## Start = 1985, End = 2014
## 
## Call:
## dynlm(formula = Y.t ~ L(Y.t, k = 1) + P.t + trend(Y.t))
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.067777 -0.019047  0.000599  0.013851  0.074160 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    0.5027754  0.1273859   3.947 0.000537 ***
## L(Y.t, k = 1)  0.3665457  0.1612803   2.273 0.031545 *  
## P.t           -0.0872460  0.0331257  -2.634 0.014032 *  
## trend(Y.t)    -0.0020229  0.0008264  -2.448 0.021442 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03248 on 26 degrees of freedom
## Multiple R-squared:  0.5382, Adjusted R-squared:  0.485 
## F-statistic:  10.1 on 3 and 26 DF,  p-value: 0.0001372

As per BIC the best model Dynamic Linear model with intercept for RBO is the \(Dyn.model\) having regressors, an instantaneous 1996 year affect, a 1 year lagged RBO response, and a trend component of RBO.

From the summary statistics, the \(Dyn.model\) is significant at 5% significance level. All the 3 regressors are significant. Most importantly, the pulse at 1996 year is significant at 5% significance level.

Without intercept :

Y.t = RBO
T = c(13) # The time point when the intervention occurred 
P.t = 1*(seq(RBO) == T)
P.t.1 = Lag(P.t,+1) #library(tis) 

Dyn.model.NoIntercept = dynlm(Y.t ~ 0 + L(Y.t , k = 1) + P.t + trend(Y.t)) # library(dynlm)

Dyn.model1.NoIntercept = dynlm(Y.t ~ 0 + L(Y.t , k = 1) + L(Y.t , k = 2) + P.t + trend(Y.t))  # library(dynlm)

Dyn.model2.NoIntercept = dynlm(Y.t ~ 0 + L(Y.t , k = 1) + L(Y.t , k = 2) + L(Y.t , k = 3) + P.t + trend(Y.t)) # library(dynlm)

Dyn.model3.NoIntercept = dynlm(Y.t ~ 0 + L(Y.t , k = 1) + L(Y.t , k = 2) + L(Y.t , k = 3) + P.t + P.t.1 + trend(Y.t)) # library(dynlm)

Model <- c("Dyn.model.NoIntercept", "Dyn.model1.NoIntercept", "Dyn.model2.NoIntercept", "Dyn.model3.NoIntercept")
AIC <- c(AIC(Dyn.model.NoIntercept), AIC(Dyn.model1.NoIntercept), AIC(Dyn.model2.NoIntercept), AIC(Dyn.model3.NoIntercept))
BIC <- c( BIC(Dyn.model.NoIntercept), BIC(Dyn.model1.NoIntercept), BIC(Dyn.model2.NoIntercept), BIC(Dyn.model3.NoIntercept))
data.frame(Model, AIC, BIC) %>% arrange(BIC)
##                    Model       AIC        BIC
## 1 Dyn.model2.NoIntercept -114.1008 -106.10753
## 2 Dyn.model3.NoIntercept -114.1008 -106.10753
## 3 Dyn.model1.NoIntercept -107.1523 -100.31579
## 4  Dyn.model.NoIntercept -102.7092  -97.10439
summary(Dyn.model2.NoIntercept)
## 
## Time series regression with "ts" data:
## Start = 1987, End = 2014
## 
## Call:
## dynlm(formula = Y.t ~ 0 + L(Y.t, k = 1) + L(Y.t, k = 2) + L(Y.t, 
##     k = 3) + P.t + trend(Y.t))
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.050359 -0.020437  0.009498  0.017263  0.049244 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)   
## L(Y.t, k = 1)  0.3690799  0.1470329   2.510  0.01955 * 
## L(Y.t, k = 2)  0.3642948  0.1430617   2.546  0.01804 * 
## L(Y.t, k = 3)  0.2538073  0.1480625   1.714  0.09994 . 
## P.t           -0.0869779  0.0290544  -2.994  0.00649 **
## trend(Y.t)     0.0004006  0.0006111   0.656  0.51861   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02809 on 23 degrees of freedom
## Multiple R-squared:  0.9988, Adjusted R-squared:  0.9985 
## F-statistic:  3826 on 5 and 23 DF,  p-value: < 2.2e-16

As per BIC the best model Dynamic Linear model without intercept for RBO is the \(Dyn.model2.NoIntercept\) having regressors, an instantaneous 1996 year affect, 3 lagged RBO response, and a trend component of RBO.

From the summary statistics, the \(Dyn.model2.NoIntercept\) is significant at 5% significance level. 3 regressors are significant. Most importantly, the pulse at 1996 year is significant at 5% significance level.

Dynamic Linear Model selection

The best Dynamic Linear models with and without intercept were Dyn.model and Dyn.model.NoIntercept respectively. Eliminating all the insignificant models and comparing the significant Finite DLM models based on R-squared, AIC, BIC and MASE

Model <- c("Dyn.model", "Dyn.model1.NoIntercept")
AIC <- c(AIC(Dyn.model), AIC(Dyn.model2.NoIntercept))
BIC <- c( BIC(Dyn.model), BIC(Dyn.model2.NoIntercept))
Adjusted_Rsquared <- c(0.485, 0.9985)
data.frame(Model,AIC, BIC, Adjusted_Rsquared) %>% arrange(AIC)
##                    Model       AIC       BIC Adjusted_Rsquared
## 1              Dyn.model -114.7932 -107.7873            0.4850
## 2 Dyn.model1.NoIntercept -114.1008 -106.1075            0.9985

Thus, as per AIC and BIC, Dynamic Linear model for RBO with intercept (Dyn.model) is the best.

Dyn.model is the best Dynamic Linear model as per AIC and BIC with 1 lagged components of the response (RBO), a significant pulse component at year 1996, and trend component of RBO series. Lets look at the summary statistics and check residuals

summary(Dyn.model)
## 
## Time series regression with "ts" data:
## Start = 1985, End = 2014
## 
## Call:
## dynlm(formula = Y.t ~ L(Y.t, k = 1) + P.t + trend(Y.t))
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.067777 -0.019047  0.000599  0.013851  0.074160 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    0.5027754  0.1273859   3.947 0.000537 ***
## L(Y.t, k = 1)  0.3665457  0.1612803   2.273 0.031545 *  
## P.t           -0.0872460  0.0331257  -2.634 0.014032 *  
## trend(Y.t)    -0.0020229  0.0008264  -2.448 0.021442 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03248 on 26 degrees of freedom
## Multiple R-squared:  0.5382, Adjusted R-squared:  0.485 
## F-statistic:  10.1 on 3 and 26 DF,  p-value: 0.0001372
checkresiduals(Dyn.model)

## 
##  Breusch-Godfrey test for serial correlation of order up to 7
## 
## data:  Residuals
## LM test = 6.8908, df = 7, p-value = 0.4403

Summary of Dynamic linear model, Dyn.model.NoIntercept

  • model is significant at 5% significance level
  • Adjusted R-squared is 48.5%
  • No violations in the test of assumptions
  • Serial autocorrelations are insignificant
Conclusion of Dynamic Linear model

The dynamic linear model, Dyn.model, is significant and the pulse (P.t) component significant at year 1996.

Overall Most Appropriate Regression model (Model Selection)

Based on the 4 Time series regression methods considered, the best model as per MASE measure for each method is summarized below,

  • A. Best Distributed lag models is - Autoregressive DLM model having Radiation as regressor ARDL.Radiation.12x4 with MASE measure of 0.006142636, AIC of -231.3149, BIC of -213.3706 and Adjusted R-squared of 99.87%.

  • B. Best Dynamic linear models is - Dyn.model having 1 lagged components of the response (RBO), a significant pulse component at years 1996, and trend component with AIC of -114.7932, BIC of -107.7873 and Adjusted R-squared of 48.5%.

Clearly, the best model is ARDL.Radiation.12x4 as per AIC, BIC and Adjusted R-squared measures.

Best Time Series regression model for Forecasting

Best Time Series regression model is - Autoregressive DLM model having Radiation as regressor (ARDL.Radiation.12x4)

Detailed Graphical and statistical tests of assumptions for \(ARDL.Radiation.12x4\) model (Residual Analysis)

Residual analysis to test model assumptions.

Lets perform a detailed Residual Analysis to check if any model assumptions have been violated.

The estimator error (or residual) is defined by:

\(\hat{\epsilon_i}\) = \(Y_i\) - \(\hat{Y_i}\) (i.e. observed value less - trend value)

The following problems are to be checked,

  1. linearity in distribution of error terms
  2. The mean value of residuals is zero
  3. Serial autocorrelation
  4. Normality of distribution of error terms

Lets first apply diagnostic check using checkresiduals() function,

checkresiduals(ARDL.Radiation.12x4)
## Time Series:
## Start = 13 
## End = 31 
## Frequency = 1 
##            13            14            15            16            17 
## -1.652337e-04  2.199540e-04 -3.308864e-04  6.602329e-05 -4.037381e-06 
##            18            19            20            21            22 
##  2.263848e-04 -4.739273e-05  6.651827e-06  7.451776e-05 -1.546653e-04 
##            23            24            25            26            27 
## -1.306467e-04  3.504837e-04 -1.288934e-04 -5.798920e-05 -1.374479e-04 
##            28            29            30            31 
##  3.952405e-05  2.133918e-04 -3.925050e-04  3.527663e-04

## 
##  Ljung-Box test
## 
## data:  Residuals
## Q* = 6.9142, df = 4, p-value = 0.1405
## 
## Model df: 0.   Total lags used: 4
  1. From the Residuals plot, linearity is not violated as the residuals are randomly distributed across the mean. Thus, linearity in distribution of error terms is not violated

  2. To test mean value of residuals is zero or not, lets calculate mean value of residuals as,

mean(ARDL.Radiation.12x4$model$residuals)
## [1] 1.426442e-20

As mean value of residuals is close to 0, zero mean residuals is not violated.

  1. In the checkresiduals output, the Ljung-Box test output is displayed. According to this test, the hypothesis are,

Which has,
\(H_0\) : series of residuals exhibit no serial autocorrelation of any order up to p
\(H_a\) : series of residuals exhibit serial autocorrelation of any order up to p

From the Ljung-Box test output, since p (0.1405) > 0.05, we do not reject the null hypothesis of no serial autocorrelation.

Thus, according to this test and ACF plot, we can conclude that the serial correlation left in residuals is insignificant.

  1. From the histogram shown by checkresiduals(), residuals seem to follow normality. Lets test this statistically,

\(H_0\) : Time series is Normally distributed
\(H_a\) : Time series is not normal

shapiro.test(ARDL.Radiation.12x4$model$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  ARDL.Radiation.12x4$model$residuals
## W = 0.96413, p-value = 0.656

From the Shapiro-Wilk test, since p>0.05 significance level, we do reject the null hypothesis that states the data is normal. Thus, residuals of ARDL.Radiation.12x4 model are normally distributed.

Summarizing residual analysis on \(ARDL.Radiation.12x4\) model:

Assumption 1: The error terms are randomly distributed and thus show linearity: Not violated
Assumption 2: The mean value of E is zero (zero mean residuals): Not violated
Assumption 4: The error terms are independently distributed, i.e. they are not autocorrelated: Not violated
Assumption 5: The errors are normally distributed. Not violated

Having no residual assumptions’ violations, the Finite DLM model having Relative humidity as regressor without an intercept (ARDL.Radiation.12x4) model is good for accurate forecasting. Lets forecast for the next 3 years,

Forecasting

Using MASE measure, ARDL model, \(ARDL.Radiation.12x4\) is best fitted model to forecast RBO. Lets estimate and plot 3 years (2015-2017) ahead forecasts for RBO series.

Observed and fitted values are plotted below. This plot indicates a good agreement between the model and the original series. (Note, since lag is set as 16 (q=4 + p=12), fitted values are not available for the first 16 years)

plot(RBO, ylab='RBO', xlab = 'Year', type="l", col="black", main="Observed and fitted values using ARDL.Radiation.12x4 model on RBO")
lines(ts(ARDL.Radiation.12x4$model$fitted.values, start = c(1996)), col="red")
legend("topleft",lty=1,
       col=c("black", "red"), 
       c("RBO series", "ARDL.Radiation.12x4 fit"))

Using the given 4 years ahead future covariates values, we can forecast our RBO response.

Future_Covariates_RBO <- read.csv("C:/Users/admin/Downloads/Covariate x-values for Task 3.csv")
head(Future_Covariates_RBO)
##   Year Temperature Rainfall Radiation RelHumidity
## 1 2015       20.74     2.27     14.60       94.45
## 2 2016       20.49     2.38     14.56       94.03
## 3 2017       20.52     2.26     14.79       95.04
## 4 2018       20.56     2.27     14.79       95.06

Our ARDL.Radiation.12x4 model uses only 1 covariate, Radiation. 4 years ahead point forecasts of RBO using Radiation covariate is,

ARDL.Radiation.12x4 = ardlDlm(formula = RBO ~ Radiation, data = RBO_dataset, p = 12, q = 4)
x.new =  c(Future_Covariates_RBO$Radiation)
forecasts.ardldlm = dLagM::forecast(model = ARDL.Radiation.12x4, x = x.new, h = 3)$forecasts

Forecast using overall best fitting model:

The point forecasts and the forecast plot using the overall best fitting model, ARDL.Radiation.12x4 is given below,

df <- data.frame(
  ARDL_forecasts = c(forecasts.ardldlm)
) 
row.names(df) <- c("2015", "2016", "2017")
df
##      ARDL_forecasts
## 2015      0.6710744
## 2016      0.8016781
## 2017      0.7241307
RBO.extended4 = c(RBO, forecasts.ardldlm)

{
plot(ts(RBO.extended4, start = c(1984)), type="l", col = "red",
ylab = "RBO", xlab = "Year", 
main="3 years ahead forecasts for RBO series
      using ARDL.Radiation.12x4 model")          
lines(RBO,col="black",type="l")
legend("topleft",lty=1,
       col=c("black", "red"), 
       c("RBO series", "ARDL(12,4) forecasts"))
}

The forecasts for best Finite DLM, Koyck, and Dynamic Linear model are plotted and given below, (Note, no significant Polynomial DLM were found and since the best Finite and Koyck models do not have intercepts, their forecasts aren’t printed). Since there is only one Distributed Lag model, ARDL, which is already plotted above, lets move on to Dynamic Linear models,

For Dynamic Linear model:

The 3 years ahead point forecasts are printed and plotted below,

Dyn.model = dynlm(Y.t ~ L(Y.t , k = 1) + P.t + trend(Y.t)) # library(dynlm)

q = 3
n = nrow(Dyn.model$model)
RBO.frc = array(NA , (n + q))
RBO.frc[1:n] = Y.t[2:length(Y.t)] # length(1:n) = length(2:length(Y.t)) = 30
trend = array(NA,q)
trend.start = Dyn.model$model[n,"trend(Y.t)"]
trend = seq(trend.start , trend.start + q/1, 1)

for (i in 1:q){
  #months = array(0,11)
  #months[(i+4)%%12] = 1 # Data ends in May, to start the new forecast from JUNE, put i + 4.
  data.new = c(1,RBO.frc[n-1+i], P.t[n] ,trend[i]) 
  RBO.frc[n+i] = as.vector(Dyn.model$coefficients) %*% data.new
}

par(mfrow=c(1,1))

plot(Y.t,xlim=c(1984,2017),ylab='RBO',xlab='Year',main = "Time series plot of RBO series with 3 years ahead forecasts (in red)")
lines(ts(RBO.frc[(n+1):(n+q)],start=c(2015)),col="red")

Conclusion

The most fitting model for our RBO series in terms of MASE which assesses the forecast accuracy is the Autoregressive DLM model, ARDL(12,4) with Radiation as regressor \(ARDL.Radiation.12x4\). The point forecasts for 3 years ahead reported using the forecast() of dLagM package are 0.6710744, 0.8016781, and 0.7241307 respectively (Confidence Intervals are not outputted).

Future Directions

Potentially better forecasting methods can be explored, compared and diagnosed for better fit.

Task 3 Part (b): Intervention Analysis for Rank-Based Flowering Order Similarity Metric: Accounting for the Millennium Drought

Objective

To accommodate the affect of the Millennium Drought, which occurred during 1996-2009 period, in the analysis of Rank-based flowering Order similarity metric (RBO) based on the 4 climatic regressor variables and obtain the 3 year ahead forecasts.

Intervention Analysis

We expect the Millennium Drought from 1996-2009 to have created an intervention point which changes the mean function or trend of the RBO series. Lets revisit the time series plot for the response, RBO, to visualize possible intervention points at 1996 and 2009 or between these years.

plot(RBO, ylab = 'RBO', xlab = 'Year')

From the time series plot above, year 1996 might be an intervention point because the mean level of the RBO series falls notably low from this point on wards. Assuming this intervention point lets fit a Dynamic Linear model and see if the pulse function at years 1996 is significant or not.

To analyze the affect of this potential intervention point, Dynamic Linear Regression model can be used. Dynamic linear models are general class of time series regression models which can account for trends, seasonality, serial correlation between response and regressor variable, and most importantly the affect of intervention points.

The response of a general Dynamic linear model is,

\(Y_t = \omega_2Y_{t-1} + (\omega_0 + \omega_1)P_t - \omega_2\omega_0P_{t-1} + N_t\)

where,

  • \(Y_t\) is the response
  • \(\omega_2\) is the coefficient of 1 time unit lagged response
  • \(P_t\) is the current pulse affect at the intervention point with \((\omega_0 + \omega_1)\) coefficient representing the instantaneous effect of the intervention point
  • \(P_{t-1}\) is the past pulse affect with \(\omega_2\omega_0\) coefficient
  • \(N_t\) is the process represents the component where there is no intervention and is referred to as the natural or unperturbed process.

As always we do, we will have a look at ACF and PACF plots of the RBO series first.

acf(RBO, main="ACF of RBO")

pacf(RBO, main ="PACF of RBO")

In ACF plot we see a slowly decaying pattern indicating trend in the RBO series. In PACF plot we see 1 high vertical spike indicating trend. No significant seasonal behavior is observed. Thus, lets fit a Dynamic linear model with trend component and no seasonal component. For thoroughness, lets test all possible combinations using trend, multiple lags of RBO, and most importantly, the Pulse at 1996.

Now, lets fit Dynamic Linear model using dynlm() as shown below, (Note, the potential intervention point was identified at year 1996, i.e the 13th data point). Lets fit models with and without the intercept and compare,

With intercept :

Y.t = RBO
T = c(13) # The time point when the intervention occurred 
P.t = 1*(seq(RBO) == T)
P.t.1 = Lag(P.t,+1) #library(tis) 

Dyn.model = dynlm(Y.t ~ L(Y.t , k = 1) + P.t + trend(Y.t)) # library(dynlm)

Dyn.model1 = dynlm(Y.t ~ L(Y.t , k = 1) + L(Y.t , k = 2) + P.t + trend(Y.t)) # library(dynlm)

Dyn.model2 = dynlm(Y.t ~ L(Y.t , k = 1) + L(Y.t , k = 2) + L(Y.t , k = 3) + P.t + trend(Y.t)) # library(dynlm)

Dyn.model3 = dynlm(Y.t ~ L(Y.t , k = 1) + L(Y.t , k = 2) + L(Y.t , k = 3) + P.t + P.t.1 + trend(Y.t)) # library(dynlm)

Model <- c("Dyn.model", "Dyn.model1", "Dyn.model2", "Dyn.model3")
AIC <- c(AIC(Dyn.model), AIC(Dyn.model1), AIC(Dyn.model2), AIC(Dyn.model3))
BIC <- c( BIC(Dyn.model), BIC(Dyn.model1), BIC(Dyn.model2), BIC(Dyn.model3))
data.frame(Model, AIC, BIC) %>% arrange(BIC)
##        Model       AIC       BIC
## 1  Dyn.model -114.7932 -107.7873
## 2 Dyn.model2 -114.9847 -105.6593
## 3 Dyn.model3 -114.9847 -105.6593
## 4 Dyn.model1 -112.2284 -104.0246
summary(Dyn.model)
## 
## Time series regression with "ts" data:
## Start = 1985, End = 2014
## 
## Call:
## dynlm(formula = Y.t ~ L(Y.t, k = 1) + P.t + trend(Y.t))
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.067777 -0.019047  0.000599  0.013851  0.074160 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    0.5027754  0.1273859   3.947 0.000537 ***
## L(Y.t, k = 1)  0.3665457  0.1612803   2.273 0.031545 *  
## P.t           -0.0872460  0.0331257  -2.634 0.014032 *  
## trend(Y.t)    -0.0020229  0.0008264  -2.448 0.021442 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03248 on 26 degrees of freedom
## Multiple R-squared:  0.5382, Adjusted R-squared:  0.485 
## F-statistic:  10.1 on 3 and 26 DF,  p-value: 0.0001372

As per BIC the best model Dynamic Linear model with intercept for RBO is the \(Dyn.model\) having regressors, an instantaneous 1996 year affect, a 1 year lagged RBO response, and a trend component of RBO.

From the summary statistics, the \(Dyn.model\) is significant at 5% significance level. All the 3 regressors are significant. Most importantly, the pulse at 1996 year is significant at 5% significance level.

Without intercept :

Y.t = RBO
T = c(13) # The time point when the intervention occurred 
P.t = 1*(seq(RBO) == T)
P.t.1 = Lag(P.t,+1) #library(tis) 

Dyn.model.NoIntercept = dynlm(Y.t ~ 0 + L(Y.t , k = 1) + P.t + trend(Y.t)) # library(dynlm)

Dyn.model1.NoIntercept = dynlm(Y.t ~ 0 + L(Y.t , k = 1) + L(Y.t , k = 2) + P.t + trend(Y.t))  # library(dynlm)

Dyn.model2.NoIntercept = dynlm(Y.t ~ 0 + L(Y.t , k = 1) + L(Y.t , k = 2) + L(Y.t , k = 3) + P.t + trend(Y.t)) # library(dynlm)

Dyn.model3.NoIntercept = dynlm(Y.t ~ 0 + L(Y.t , k = 1) + L(Y.t , k = 2) + L(Y.t , k = 3) + P.t + P.t.1 + trend(Y.t)) # library(dynlm)

Model <- c("Dyn.model.NoIntercept", "Dyn.model1.NoIntercept", "Dyn.model2.NoIntercept", "Dyn.model3.NoIntercept")
AIC <- c(AIC(Dyn.model.NoIntercept), AIC(Dyn.model1.NoIntercept), AIC(Dyn.model2.NoIntercept), AIC(Dyn.model3.NoIntercept))
BIC <- c( BIC(Dyn.model.NoIntercept), BIC(Dyn.model1.NoIntercept), BIC(Dyn.model2.NoIntercept), BIC(Dyn.model3.NoIntercept))
data.frame(Model, AIC, BIC) %>% arrange(BIC)
##                    Model       AIC        BIC
## 1 Dyn.model2.NoIntercept -114.1008 -106.10753
## 2 Dyn.model3.NoIntercept -114.1008 -106.10753
## 3 Dyn.model1.NoIntercept -107.1523 -100.31579
## 4  Dyn.model.NoIntercept -102.7092  -97.10439
summary(Dyn.model2.NoIntercept)
## 
## Time series regression with "ts" data:
## Start = 1987, End = 2014
## 
## Call:
## dynlm(formula = Y.t ~ 0 + L(Y.t, k = 1) + L(Y.t, k = 2) + L(Y.t, 
##     k = 3) + P.t + trend(Y.t))
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.050359 -0.020437  0.009498  0.017263  0.049244 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)   
## L(Y.t, k = 1)  0.3690799  0.1470329   2.510  0.01955 * 
## L(Y.t, k = 2)  0.3642948  0.1430617   2.546  0.01804 * 
## L(Y.t, k = 3)  0.2538073  0.1480625   1.714  0.09994 . 
## P.t           -0.0869779  0.0290544  -2.994  0.00649 **
## trend(Y.t)     0.0004006  0.0006111   0.656  0.51861   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02809 on 23 degrees of freedom
## Multiple R-squared:  0.9988, Adjusted R-squared:  0.9985 
## F-statistic:  3826 on 5 and 23 DF,  p-value: < 2.2e-16

As per BIC the best model Dynamic Linear model without intercept for RBO is the \(Dyn.model2.NoIntercept\) having regressors, an instantaneous 1996 year affect, 3 lagged RBO response, and a trend component of RBO.

From the summary statistics, the \(Dyn.model2.NoIntercept\) is significant at 5% significance level. 3 regressors are significant. Most importantly, the pulse at 1996 year is significant at 5% significance level.

Model selection

The best Dynamic Linear models with and without intercept were Dyn.model and Dyn.model.NoIntercept respectively. Eliminating all the insignificant models and comparing the significant Finite DLM models based on R-squared, AIC, BIC and MASE

Model <- c("Dyn.model", "Dyn.model1.NoIntercept")
AIC <- c(AIC(Dyn.model), AIC(Dyn.model2.NoIntercept))
BIC <- c( BIC(Dyn.model), BIC(Dyn.model2.NoIntercept))
Adjusted_Rsquared <- c(0.485, 0.9985)
data.frame(Model,AIC, BIC, Adjusted_Rsquared) %>% arrange(AIC)
##                    Model       AIC       BIC Adjusted_Rsquared
## 1              Dyn.model -114.7932 -107.7873            0.4850
## 2 Dyn.model1.NoIntercept -114.1008 -106.1075            0.9985

Thus, as per AIC and BIC, Dynamic Linear model for RBO with intercept (Dyn.model) is the best.

Dyn.model is the best Dynamic Linear model as per AIC and BIC with 1 lagged components of the response (RBO), a significant pulse component at year 1996, and trend component of RBO series. Lets look at the summary statistics and check residuals

summary(Dyn.model)
## 
## Time series regression with "ts" data:
## Start = 1985, End = 2014
## 
## Call:
## dynlm(formula = Y.t ~ L(Y.t, k = 1) + P.t + trend(Y.t))
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.067777 -0.019047  0.000599  0.013851  0.074160 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    0.5027754  0.1273859   3.947 0.000537 ***
## L(Y.t, k = 1)  0.3665457  0.1612803   2.273 0.031545 *  
## P.t           -0.0872460  0.0331257  -2.634 0.014032 *  
## trend(Y.t)    -0.0020229  0.0008264  -2.448 0.021442 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03248 on 26 degrees of freedom
## Multiple R-squared:  0.5382, Adjusted R-squared:  0.485 
## F-statistic:  10.1 on 3 and 26 DF,  p-value: 0.0001372
checkresiduals(Dyn.model)

## 
##  Breusch-Godfrey test for serial correlation of order up to 7
## 
## data:  Residuals
## LM test = 6.8908, df = 7, p-value = 0.4403

Summary of Dynamic linear model, Dyn.model.NoIntercept

  • model is significant at 5% significance level
  • Adjusted R-squared is 48.5%
  • No violations in the test of assumptions
  • Serial autocorrelations are insignificant

Conclusion of Dynamic Linear model

The dynamic linear model, Dyn.model, is significant and the pulse (P.t) component significant at year 1996.

Observed and fitted values are plotted below. This plot indicates a decent agreement between the model and the original series.

plot(RBO,ylab='RBO', xlab = 'Year', type="l", col="red")
lines(Dyn.model$fitted.values)

forecasting

Now, let’s find 3 years ahead point forecasts for RBO series using the Dyn.model.

Dyn.model = dynlm(Y.t ~ L(Y.t , k = 1) + P.t + trend(Y.t)) # library(dynlm)

q = 3
n = nrow(Dyn.model$model)
RBO.frc = array(NA , (n + q))
RBO.frc[1:n] = Y.t[2:length(Y.t)] # length(1:n) = length(2:length(Y.t)) = 30
trend = array(NA,q)
trend.start = Dyn.model$model[n,"trend(Y.t)"]
trend = seq(trend.start , trend.start + q/1, 1)

for (i in 1:q){
  #months = array(0,11)
  #months[(i+4)%%12] = 1 # Data ends in May, to start the new forecast from JUNE, put i + 4.
  data.new = c(1,RBO.frc[n-1+i], P.t[n] ,trend[i]) 
  RBO.frc[n+i] = as.vector(Dyn.model$coefficients) %*% data.new
}

par(mfrow=c(1,1))

plot(Y.t,xlim=c(1984,2017),ylab='RBO',xlab='Year',main = "Time series plot of RBO series with 3 years ahead forecasts (in red)")
lines(ts(RBO.frc[(n+1):(n+q)],start=c(2015)),col="red")

Future Directions

Data can be collected at monthly level which would allow more precise forecasting.