Introduction

This analysis is meant to outline my progress toward an ARDL bounds test for cointegration between number of inpatient diagnoses related to chronic respiratory illness Southern California, and the pollution climate over the period 2003-2011. Inututively, the generic hypothesis that many of these pollution series share long term relationships with chronic respiratory conditions should be significant, but my data includes nearly a census of inpatient diagnoses cases. This gives the test much more value if the hypothesis is true, and being able to test causal relationship along with providing a model will ultimately help researchers determine the degree at which the current pollution climate is affecting those with respiratory conditions or sensitive people.

Some Methodology

To be more concise, I wish to test whether the series of diagnoses of COPD and lung cancer (most common chronic respiratory issues) is cointegrated, or shares a long term relationship with common atmospheric pollutant levels. Using an autoregressive distributed lag model, if cointegration exists, I can estimate the long and short term relationships using methods very similar to OLS regression. ARDL models have a generic form:

\(y_{t} = \beta_0 + \sum_{i=1}^{p} \beta_i y_{t-i} + \sum_{j=0}^{q} \alpha_j x_{t-j} + \epsilon_t\)

The general form of the model above begs for OLS regression to estimate coefficients however, take note of the dependent variable (\(y_t\)). There are lagged terms of the dependent variable in the model, meaning the approach to estimation becomes a much more involved task. Autoregressive distributed lag models are an iteration of distributed lag models that include lagged terms (or lagged differenced terms as I’ll explain soon), of the dependent variable. Another related model which comes into play when achieving the final model is the error correction model or ECM. The model has the general form:

\(\Delta y_t = \beta_0 + \sum_{i=1}^{p} \beta_i \Delta y_{t-i} + \sum_{j=0}^{q} \alpha_j \Delta x_{t-j} + \phi z_{t-1}+ \epsilon_t\)

for \(z_{t-1}\) being the estimate for the long run relationship of \(y\) and \(x\),

\(z_{t-1} = y_{t-1} + \delta + \theta x_{t-1}\)

Or the residuals of the OLS regression of \(z_t\),

\(y_t = \alpha_0 + \alpha_1 x_t + \nu_t\)

The ECM looks at the first differences (I(1)) of \(y\) and \(x\) at different lags to assess the short run dynamics of both series, but also includes the term \(z_{t-1}\), which attempts to account for the long run dynamics between the series in the case where they are cointegrated. Now, ECM’s assume all series are I(1) or first-difference stationary. For example, the two-step estimation approach by Engle and Granger allows for the estimation of the coefficients via OLS provided all series are I(1). This assumption is where ARDL models come in handy. Pesaran and Shin (1999) and Pesaran et al. (2001) offer solutions for mixed series, of stationary or difference stationary series, and the variables can have different lag lengths. This offers a ton of flexibilty when assessing many series at once.

My Model

My dependent variable will be the series illustrating the prevelance of chronic respiratory illnesses commonly assumed to be tied to atmospheric conditions. My independent series will be the various pollutants over the same time period (2003-2011). For my initial run, I want to consider five variables, ozone level, PM 2.5, carbon monoxide (CO), nitrogen dioxide (NO\(_2\)) and sulfur dioxide (SO_\(_2\)) concentrations. The lag lengths will be determined through stepwise model selection, using AIC (or BIC, I need to do more research…) but in my case, the most general model will be:

\[ \begin{aligned} \Delta y_t = \beta_0 +& \sum_{i=1}^{p} \beta_{i} \Delta y_{t-i} + \sum_{j=0}^{q_1} \alpha^{\text{co}}_j \Delta x^{\text{co}}_{t-j} + \sum_{j=0}^{q_2} \alpha^{\text{oz}}_j \Delta x^{\text{oz}}_{t-j} + \sum_{j=0}^{q_3} \alpha^{\text{pm}}_j \Delta x^{\text{pm}}_{t-j} + \sum_{j=0}^{q_4} \alpha^{\text{so}}_j \Delta x^{\text{so}}_{t-j} + \sum_{j=0}^{q_5} \alpha^{\text{no}}_j \Delta x^{\text{no}}_{t-j} +\dots\\[5pt] &(\theta_0 y_{t-1} + \theta_1 x^{\text{co}}_{t-1} + \theta_2 x^{\text{oz}}_{t-1} + \theta_3 x^{\text{pm}}_{t-1} + \theta_4 x^{\text{so}}_{t-1} + \theta_5 x^{\text{no}}_{t-1}) + \epsilon_t \end{aligned} \]

Now that I have an ideal model to use, I have to “map” out the process to achieve a model that offers unbiased estimates of the long and short run dynamics between the series I am testing. This modeling process is mildly involved so I will build in steps.

Getting Emperical!

Here we go - time to model. First order of business is checking model assumptions. Recall, mixed series are allowed but they must be I(0) or I(1). In general, this is fairly common among series, but I need to make sure none of my series are I(2).

First, before anything, I want to take a look at my series and get an initial look at their behaviour:

library(forecast)
library(urca)
library(dynamac)
library(dLagM)
library(ggplot2)
library(reshape2)
library(lubridate)
library(latex2exp)
library(aTSA)
library(gridExtra)
library(dplyr)
library(vars)
library(plotrix)
title_center <- theme(plot.title = element_text(size= 10, hjust = 0.5, vjust = .1),
                      axis.title.x = element_text(size=8),
                      axis.title.y = element_text(size=8))


# Import hcup_resp_rs data
hcup_resp_rs <- read.csv('/Users/seankrinik/Documents/CSULB/Thesis/Python/HCUP/hcup_agg_resp.csv')
hcup_resp_lc <- read.csv('/Users/seankrinik/Documents/CSULB/Thesis/Python/HCUP/hcup_agg_lc.csv')
epa <- read.csv('/Users/seankrinik/Documents/CSULB/Thesis/Python/EnvData.csv')
# Fix EPA
hcup_resp_rs$DATE <- as.POSIXct(hcup_resp_rs$DATE, format='%Y-%m-%d')
epa$date.local <- as.POSIXct(epa$date.local, format='%Y-%m-%d')

gg1 <- ggplot(hcup_resp_rs, aes(x=as.POSIXct(hcup_resp_rs$DATE, format='%Y-%m-%d'), y=DXTotal)) +
  geom_line(aes(colour=factor(COUNTY))) +
  guides(colour=guide_legend(title="County")) + 
  labs(title = 'Primary/Secondary Diagnosis Time Series by County',
       x='Date', y='Primary/Secondary Diagnosis Freq.') + title_center

gg2 <- ggplot(epa, aes(x=as.POSIXct(hcup_resp_rs$DATE, format='%Y-%m-%d'), y=Ozone)) +
  geom_line(aes(colour=factor(county.name))) +
  guides(colour=guide_legend(title="County")) + 
  labs(title = 'Ozone Levels by County',
       x='Date', y='Ozone') + title_center

gg3 <- ggplot(epa, aes(x=as.POSIXct(hcup_resp_rs$DATE, format='%Y-%m-%d'), y=PM2.5...Local.Conditions)) +
  geom_line(aes(colour=factor(county.name))) +
  guides(colour=guide_legend(title="County")) + 
  labs(title = TeX('PM$_{2.5}$ Concentration by County'),
       x='Date', y=TeX('PM$_{2.5}$ Concentration')) + title_center

gg4 <- ggplot(epa, aes(x=as.POSIXct(hcup_resp_rs$DATE, format='%Y-%m-%d'), y=Carbon.monoxide)) +
  geom_line(aes(colour=factor(county.name))) +
  guides(colour=guide_legend(title="County")) + 
  labs(title = 'Carbon Monoxide Concentration by County',
       x='Date', y='CO Concentration') + title_center

gg5 <- ggplot(epa, aes(x=as.POSIXct(hcup_resp_rs$DATE, format='%Y-%m-%d'), y=Sulfur.dioxide)) +
  geom_line(aes(colour=factor(county.name))) +
  guides(colour=guide_legend(title="County")) + 
  labs(title = 'Sulfur Dioxide Concentration by County',
       x='Date', y=TeX('SO$_2$ Concentration')) + title_center

gg6 <- ggplot(epa, aes(x=as.POSIXct(hcup_resp_rs$DATE, format='%Y-%m-%d'), y=Nitrogen.dioxide..NO2.)) +
  geom_line(aes(colour=factor(county.name))) +
  guides(colour=guide_legend(title="County")) + 
  labs(title = 'Nitrogen Dioxide Concentration by County',
       x='Date', y=TeX('NO$_2$ Concentration')) + title_center

grid.arrange(gg1,gg2,gg3,gg4,gg5,gg6,nrow=3,ncol=2)

Most of the series seem to have long trends either increasing or decreasing. For instance, there is a long decreasing trend in CO, NO\(_2\) and PM\(_2.5\) that can be discerned without much analysis at all. These trends give hope for I(1) series. If all are I(1) or less, ARDL modeling is ideal.

Step 1: Checking Integration Orders

The most popular methods to test integration order are by the augemented Dickey-Fuller (ADF) test, and the KPSS tests. First, let’s take a brief look at the ADF test.

Augemented Dickey-Fuller (ADF):

The ADF test allows for the testing of the presence of a unit root in any order process. In general, consider some process:

\(\Delta y_t = \alpha + \delta y_{t-1} + \epsilon_t\)

The ADF test assumes the unit root is present, i.e. \(H_0 : \delta = 0\) vs. \(H_\text{A}: \delta < 0\). Now this is a base case, more commonly referred to as the DF test. An augmented DF test allows for testing of higher orders all in one:

\(\Delta y_t = \alpha + \beta t + \delta y_{t-1} + \sum_{i=1}^{p}\gamma_{i} \Delta y_{t-i} + \epsilon_t\)

The hypotheses are the same. The coefficients on the various lagged differences can be tested via t-tests that correspond to the Dickey-Fuller distribution, so the appropriate order can be determined through the ADF test.

To begin, I will run the ADF test on the dependent variable to see the integration order of the diagnoses over time. I have to now investigate each county separately, therefore I will do my analysis on Los Angeles county, and then apply my methodology to the other series once I’m done.

The initial ADF test for the series of respiratory inpatient diagnoses:

s <- summary(ur.df(la.cnty$DXTotal,type = 'none', lags = 4, selectlags = 'BIC'))
data.frame(test.stat = s@teststat[1] , critical.vals = s@cval[1,])

##        test.stat critical.vals
## 1pct  -0.7027337         -2.58
## 5pct  -0.7027337         -1.95
## 10pct -0.7027337         -1.62

The test comes back as a failure to reject, meaning the unit root is present. Note, if I perform the test with trend (below), the null hypothesis is rejected, and can be considered stationary. However, cointegration will only be present within a combination of integrated series, so I will consider the series non-stationary instead of trend-stationary.

s <- summary(ur.df(la.cnty$DXTotal,type = 'trend', lags = 4, selectlags = 'BIC'))
data.frame(test.stat = s@teststat[1] , critical.vals = s@cval[1,])

##       test.stat critical.vals
## 1pct  -6.859718         -3.99
## 5pct  -6.859718         -3.43
## 10pct -6.859718         -3.13

If I reexamine the series as I(1), assuming no drift or trend, first differencing yeilds:

s <- summary(ur.df(diff(la.cnty$DXTotal), type = 'none', lags = 4, selectlags = 'BIC'))
data.frame(test.stat = s@teststat[1] , critical.vals = s@cval[1,])

##       test.stat critical.vals
## 1pct  -7.353127         -2.58
## 5pct  -7.353127         -1.95
## 10pct -7.353127         -1.62

First differencing gives a stronger rejection than the I(0) with drift result, and note, the dependent variable should be I(1) for ARDL to be applicable. OLS regression would suffice if the dependent variable was in fact I(0). Now to confirm that I can assume the diagnonsis series is I(1), I perform a KPSS test, where the null hypothesis assumes stationarity.

summary(ur.kpss(diff(la.cnty$DXTotal), type = 'mu', use.lag = 1))

## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 1 lags. 
## 
## Value of test-statistic is: 0.0284 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

The KPSS test is not rejected, meaning the series is I(1) stationary. This can be confirmed also through the ACF plot of the differenced series:

grid.arrange(ggAcf(diff(la.cnty$DXTotal), lag.max = 60) + ylim(c(-1,1)),
ggPacf(diff(la.cnty$DXTotal),lag.max = 60) + ylim(c(-1,1)))

Notice however, the seasonal trend in the series. I can handle this by seasonally differencing the series at the cost of interpretability, or note that this annual trend is present in the other series as well to some degree. Every series in my analysis has a seasonal component so if all end up being integrated the same order, I will be comparing apples to apples.

I will first examine the first differenced series. I will investigate other transformations to improve interpretability later on, but for a proof of concept and introductory analyis, I will look at all I(1) series to identify any cointegration. The ADF and KPSS testing for all I(1) variables is below:

cnames <- colnames(la.cnty)[c(2:7)]
tble.adf <- data.frame(matrix(rep(0,length(cnames)*3),ncol = 3))
colnames(tble.adf) <- c('Test Stat.', 'Critical Val.', 'P-Val.')
rownames(tble.adf) <- cnames
# loop
for(i in 1:length(cnames)) {
  # print(cnames[i])
  sadf <- summary(ur.df(diff(la.cnty[,cnames[i]]), type = 'none', selectlags = 'BIC'))
  tble.adf[i,] <- c(sadf@teststat[1], sadf@cval[2], lmp(sadf@testreg))
  
  print(paste('KPSS Test Results for ',cnames[i]))
  print(summary(ur.kpss(diff(la.cnty[,cnames[i]]), type = 'mu', use.lag = 1)))
}

## [1] "KPSS Test Results for  DXTotal"
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 1 lags. 
## 
## Value of test-statistic is: 0.0284 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739
## 
## [1] "KPSS Test Results for  Carbon.monoxide"
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 1 lags. 
## 
## Value of test-statistic is: 0.1161 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739
## 
## [1] "KPSS Test Results for  Ozone"
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 1 lags. 
## 
## Value of test-statistic is: 0.055 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739
## 
## [1] "KPSS Test Results for  PM2.5...Local.Conditions"
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 1 lags. 
## 
## Value of test-statistic is: 0.0266 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739
## 
## [1] "KPSS Test Results for  Sulfur.dioxide"
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 1 lags. 
## 
## Value of test-statistic is: 0.0242 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739
## 
## [1] "KPSS Test Results for  Nitrogen.dioxide..NO2."
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 1 lags. 
## 
## Value of test-statistic is: 0.0621 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

print('ADF Test results with F-test p-values')

## [1] "ADF Test results with F-test p-values"

print(tble.adf)

##                          Test Stat. Critical Val.       P-Val.
## DXTotal                   -7.394346         -1.95 3.870119e-19
## Carbon.monoxide           -4.543550         -1.95 3.237003e-09
## Ozone                     -5.638849         -1.95 2.110584e-07
## PM2.5...Local.Conditions -11.989376         -1.95 3.760656e-31
## Sulfur.dioxide            -8.229819         -1.95 1.128428e-19
## Nitrogen.dioxide..NO2.    -6.953976         -1.95 1.176936e-22

All variables are I(1), so I can proceed to formulating an intial unrestriced ECM or ARDL model.

Step 2: Formulation of an Initial Model

The initial model I examine will be the broadest case in the form of an unrestricted error correction model (ECM) with no trend (case 3).

ARDL: An Unrestricted Error Correction Model

The formulation of an initial ARDL-ECM model will allow me to check model adequacy and make changes as needed to ultimately perform the bounds test for cointegration. Recall from the ADF and KPSS testing above for Los Angeles, all variables are I(1). All variables including the dependent variable being I(1) means cointegration may exist, i.e. a long-run relationship between the dependent variable (diagnoses) and the pollutants.

Recall, my inital model statement for my ARDL-ECM is:

The above equation can be intimidating at first but it is actually very straight forward. The \(\beta\)’s above are labeled with superscripts to denote the variable the coefficient is associated with. The short run dynamics will be associated with the differenced terms and their corresponding coefficients. The long-run dynamics will be estimated using the lagged variables in the parentheses. The error correction term can be formulated using ratios of the coefficients on these terms. Recall, the above is an unrestricted ECM, so the final model iteration will include formulating a restricted ECM to achieve a more concise estimate on the long-run relationship if cointegration exists.

Using the inital model above as a template, I want to look at a benchmark first model which will include first lags of each exogenous variable, differences of each exogenous variable, first lag of the dependent variable and first difference (given since I will be testing cointegration).

par(mar=c(5,4,4,2))
# next iteration.
cnames <- colnames(la.cnty)[c(2:7)]
fx <- as.formula(paste('DXTotal ~', paste(cnames[!cnames %in% "DXTotal"], collapse = " + ")))
# ardl model
ardl1 <- dynardl(fx,data = la.cnty,
               lags=list('DXTotal'=1,
                      'Carbon.monoxide'=1,
                      'Ozone' = 1,
                      'PM2.5...Local.Conditions'=1,
                      'Sulfur.dioxide' = 1,
                      'Nitrogen.dioxide..NO2.' = 1),
               diffs = cnames[2:length(cnames)],
               lagdiffs=list('DXTotal'= 1,
                             'Carbon.monoxide'=1,
                             'Ozone'=1,
                             'PM2.5...Local.Conditions'=1,
                             'Sulfur.dioxide' = 1,
                             'Nitrogen.dioxide..NO2.' = 1),
               ec=T, simulate = F)

## [1] "Error correction (EC) specified; dependent variable to be run in differences."

dynardl.auto.correlated(ardl1)

## 
##  ------------------------------------------------------ 
##  Breusch-Godfrey LM Test 
##  Test statistic: 7.516 
##  p-value: 0.006 
##  H_0: no autocorrelation up to AR 1 
##  
##  ------------------------------------------------------ 
##  Shapiro-Wilk Test for Normality 
##  Test statistic: 0.984 
##  p-value: 0.254 
##  H_0: residuals are distributed normal 
##  
##  ------------------------------------------------------ 
##  Log-likelihood: -754.412 
##  AIC: 1546.824 
##  BIC: 1597.429 
##  Note: AIC and BIC calculated with k = 18 on T = 106 observations. 
##  
##  ------------------------------------------------------ 
## Breusch-Godfrey test indicates we reject the null hypothesis of no autocorrelation at p < 0.01. 
##  Add lags to remove autocorrelation before running dynardl simulations.

summary(ardl1)

## 
## Call:
## lm(formula = as.formula(paste(paste(dvnamelist), "~", paste(colnames(IVs), 
##     collapse = "+"), collapse = " ")))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -821.65 -193.81   -2.38  137.09  956.48 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    2.186e+03  5.426e+02   4.029 0.000119 ***
## l.1.DXTotal                   -5.785e-01  1.176e-01  -4.921 3.99e-06 ***
## ld.1.DXTotal                  -7.932e-02  1.043e-01  -0.760 0.449190    
## d.1.Carbon.monoxide           -3.329e+02  3.839e+02  -0.867 0.388237    
## d.1.Ozone                     -8.748e+03  9.088e+03  -0.963 0.338350    
## d.1.PM2.5...Local.Conditions  -4.818e+00  1.033e+01  -0.467 0.641976    
## d.1.Sulfur.dioxide            -2.018e+01  4.554e+01  -0.443 0.658705    
## d.1.Nitrogen.dioxide..NO2.     2.617e+01  1.208e+01   2.167 0.032963 *  
## l.1.Carbon.monoxide            1.777e+02  2.961e+02   0.600 0.549977    
## l.1.Ozone                     -1.397e+04  6.797e+03  -2.055 0.042802 *  
## l.1.PM2.5...Local.Conditions   7.218e+00  1.338e+01   0.539 0.591074    
## l.1.Sulfur.dioxide             5.373e+01  3.182e+01   1.689 0.094813 .  
## l.1.Nitrogen.dioxide..NO2.     9.878e+00  1.325e+01   0.746 0.457848    
## ld.1.Carbon.monoxide           3.301e+02  3.324e+02   0.993 0.323359    
## ld.1.Ozone                     9.650e+03  8.725e+03   1.106 0.271727    
## ld.1.PM2.5...Local.Conditions -1.553e+01  1.045e+01  -1.486 0.140814    
## ld.1.Sulfur.dioxide           -2.778e+01  4.621e+01  -0.601 0.549319    
## ld.1.Nitrogen.dioxide..NO2.   -1.605e+01  1.194e+01  -1.345 0.182043    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 327.4 on 88 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.5568, Adjusted R-squared:  0.4712 
## F-statistic: 6.504 on 17 and 88 DF,  p-value: 1.196e-09

# residuals plot
resids <- data.frame(resid = ardl1$model$residuals)
gg1 <- autoplot(ts(ardl1$model$residuals)) 
gg2 <- ggplot(resids, aes(x=resid)) + 
  geom_histogram(aes(y=..density.., fill=..count.. ), binwidth = 100) + 
  stat_function(fun=dnorm,
                args=list(mean=mean(as.numeric(resids$resid)),
                          sd = sd(as.numeric(resids$resid))),
                colour='red',
                alpha=.5) + theme(legend.position = "none") 
gg3 <- ggAcf(ts(ardl1$model$residuals)) + ylim(c(-1,1))
gg4 <- ggPacf(ts(ardl1$model$residuals)) + ylim(c(-1,1))
grid.arrange(gg1,gg2,gg3,gg4, ncol=2)

# dynamic stability
f <- as.formula(paste(paste('d.DXTotal'), "~", paste(colnames(ardl1$model$model)[2:length(colnames(ardl1$model$model))], collapse = "+"), collapse = " "))
par(mfrow=c(2,1))
plot(efp(f, data=ardl1$model$model,type='Rec-CUSUM'))
plot(efp(f, data=ardl1$model$model,type='Rec-MOSUM'))

par(mfrow=c(1,1))
# bounds test
dynamac::pssbounds(ardl1)

## 
##  PESARAN, SHIN AND SMITH (2001) COINTEGRATION TEST 
## 
##  Observations: 106 
##  Number of Regressors (k): 5 
##  Case: 3 
## 
##  ------------------------------------------------------ 
##  -                       F-test                       - 
##  ------------------------------------------------------ 
##                  <------- I(0) ------------ I(1) -----> 
##  10% critical value       2.26            3.35 
##  5% critical value        2.62            3.79 
##  1% critical value        3.41            4.68 
##  
## 
##  F-statistic = 5.822 
##  ------------------------------------------------------
##  -                       t-test                       -
##  ------------------------------------------------------
##                  <------- I(0) ------------ I(1) ----->
##  10% critical value       -2.57            -3.86
##  5% critical value        -2.86            -4.19
##  1% critical value        -3.43            -4.79
##  
## 
##  t statistic = -4.921 
##  ------------------------------------------------------ 
##  F-statistic note: Asymptotic critical values used. 
##  t-statistic note: Asymptotic critical values used.

The initial model seems close to adequate minus a few caveats. First, looking at the ACF and PACF plots for the residuals, it seems like the residuals may be stationary however, the LM test comes back with a rejection of the null indicating there may be serial correlation in the residuals. As an investigative measure, I want to look at the correlations between variables to see if I have any underlying issues there.

corrplot::corrplot(cor(la.cnty[cnames]))

Now I can see that NO\(_2\) and CO have nearly linear positive correlation, and CO with Ozone sharing a nearly negative relationship. In the next model iteration, I consider omitting ozone in order to include only pollutants emitted from vehicular, industrial or agircultural sources. Ozone is the result of sunlight reacting with various molecules in the air, and causing stray or lost oxygen atoms to combine with O\(_2\) in the air and create ozone, or O\(_3\).

Before moving on to the next model iteration, I also wanted to note that the above testing includes the ARDL bounds test for cointegration, using both the F an t-statistics. In both cases, cointegration is present at the \(1\%\) significance level. I do not continue to find the error correction term here, as the model is not deemed adequate and could produce biased results.

The next iteration of the model was found through hours of reserach, stepwise selection and ultimately, trial and error. Notice below, the cointegrating term will now only include the long-term relationship between the diagnoses series and carbon monoxide. Additionally, the short run dynamics will be represented by diagnoses series, carbon monoxide and nitrogen dioxide. This model can be interpreted as a purely ambient airpollutants since NO\(_2\) is emitted from vehicular sources and a byproduct of tabacco smoke, as is CO.

par(mar=c(5,4,4,2))
# next iteration.2
cnames <- colnames(la.cnty)[c(2,3,7)]
fx <- as.formula(paste('DXTotal ~', paste(cnames[!cnames %in% "DXTotal"], collapse = " + ")))

ardl2 <- dynardl(fx,data = la.cnty,
               lags=list('DXTotal'=1,
                         'Carbon.monoxide'=1
                         ),
               diffs = cnames[2:length(cnames)],
               lagdiffs=list('DXTotal'= 1,
                             'Carbon.monoxide'=1,
                             'Nitrogen.dioxide..NO2.' = 1),
               ec=T, simulate = F)

## [1] "Error correction (EC) specified; dependent variable to be run in differences."

dynardl.auto.correlated(ardl2)

## 
##  ------------------------------------------------------ 
##  Breusch-Godfrey LM Test 
##  Test statistic: 0.044 
##  p-value: 0.833 
##  H_0: no autocorrelation up to AR 1 
##  
##  ------------------------------------------------------ 
##  Shapiro-Wilk Test for Normality 
##  Test statistic: 0.977 
##  p-value: 0.06 
##  H_0: residuals are distributed normal 
##  
##  ------------------------------------------------------ 
##  Log-likelihood: -759.856 
##  AIC: 1537.712 
##  BIC: 1561.683 
##  Note: AIC and BIC calculated with k = 8 on T = 106 observations. 
##  
##  ------------------------------------------------------ 
## Shapiro-Wilk test indicates we reject the null hypothesis of normality at p < 0.10.

shapiro.test(ardl2$model$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  ardl2$model$residuals
## W = 0.97677, p-value = 0.05963

summary(ardl2)

## 
## Call:
## lm(formula = as.formula(paste(paste(dvnamelist), "~", paste(colnames(IVs), 
##     collapse = "+"), collapse = " ")))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -755.16 -228.34  -30.94  147.71 1025.44 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 1899.04358  360.66689   5.265 8.28e-07 ***
## l.1.DXTotal                   -0.61243    0.10074  -6.079 2.33e-08 ***
## ld.1.DXTotal                  -0.01033    0.08818  -0.117   0.9070    
## d.1.Carbon.monoxide         -193.89542  241.20365  -0.804   0.4234    
## d.1.Nitrogen.dioxide..NO2.    18.48740    8.56056   2.160   0.0332 *  
## l.1.Carbon.monoxide          717.52961  120.63147   5.948 4.19e-08 ***
## ld.1.Carbon.monoxide         144.22300  273.66929   0.527   0.5994    
## ld.1.Nitrogen.dioxide..NO2.  -15.55721    8.08240  -1.925   0.0572 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 326.6 on 98 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.5089, Adjusted R-squared:  0.4738 
## F-statistic: 14.51 on 7 and 98 DF,  p-value: 7.89e-13

f <- as.formula(paste(paste('d.DXTotal'), "~", paste(colnames(ardl2$model$model)[2:length(colnames(ardl2$model$model))], collapse = "+"), collapse = " "))
par(mfrow=c(2,1))
plot(efp(f, data=ardl2$model$model,type='Rec-CUSUM'))
plot(efp(f, data=ardl2$model$model,type='Rec-MOSUM'))

par(mfrow=c(1,1))

resids <- data.frame(resid = ardl2$model$residuals)
gg1 <- autoplot(ts(ardl2$model$residuals)) 
gg2 <- ggplot(resids, aes(x=resid)) + 
  geom_histogram(aes(y=..density.., fill=..count.. ), binwidth = 100) + 
  stat_function(fun=dnorm,
                args=list(mean=mean(as.numeric(resids$resid)),
                          sd = sd(as.numeric(resids$resid))),
                colour='red',
                alpha=.5) + theme(legend.position = "none") 
gg3 <- ggAcf(ts(ardl2$model$residuals)) + ylim(c(-1,1))
gg4 <- ggPacf(ts(ardl2$model$residuals)) + ylim(c(-1,1))
grid.arrange(gg1,gg2,gg3,gg4, ncol=2)

The model above looks adequate and quite good. Using only carbon monoxide in the cointegrating term will mean that the long run dynamics are entirely described by the cointegrating relationship between respiratory illness and apparent effect of CO exposure. A quick look at the recursive and moving average residuals plots for evaluation of sstability show the model is also stable over time, or not explosive when a shock to the process is present. Additionally, note that most of the residual diagnostics are all valid. The only notable issues are the assumption of serial correlation in the residuals using the Durbin-Watson test, and the Shapiro-Wilks test for normality. The Durbin-Watson test shows a p-value of .06 which is on the cusp of rejection, however, the LM test is valid, and the ACF.PACF graphs look good. I believe this could be due to the Shapiro-Wilk’s test being close to a breach in normality. Since both tests are very close to rejection at the \(5\%\) significance level, the model seems adequate but could be on the verge of yeilding biased results. Looking at the plot of residuals and taking a personal stance, I see the density of the the residuals being fairly normal and mean centered without severe skewness. There are slight indications of bimodal by seeing the second peak just off center. Despite these caveats, the model seems adequate and I will continue to use this model to find the restricted ECM.

The Restricted ECM

Next, building off of the assumption that the model above is adequate, I move on to the next step: forming the error correction model and evaluating the levels model. There are two parts to this step. First, since cointegration was confirmed, I need to now evaluate the levels model for the cointegrating relationship, or perform a linear model analysis on \(y_t = c + x^{\text{co}}_t + \nu_t\), and obtain the OLS coefficient estimates.

# form levels model and fit
fxec <- as.formula('DXTotal ~ Carbon.monoxide')
fit <- lm(fxec,la.cnty)
summary(fit)

## 
## Call:
## lm(formula = fxec, data = la.cnty)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1021.05  -373.17     8.93   362.62  1098.34 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      3449.59     105.67  32.644  < 2e-16 ***
## Carbon.monoxide   821.24      99.79   8.229 5.19e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 471.2 on 106 degrees of freedom
## Multiple R-squared:  0.3898, Adjusted R-squared:  0.3841 
## F-statistic: 67.72 on 1 and 106 DF,  p-value: 5.193e-13

# look at fit and residuals of levels model
autoplot(ts(fit$fitted.values), series='fitted') + 
  autolayer(ts(fit$model$DXTotal), series='DXTotal') +
  autolayer(ts(fit$residuals), series = 'residuals')

The above graph shows the overall fit of the levels model, which will show the long term dynamic relationship between the inpatient diagnoses and carbon monoxide. Notice how the various peaks in the response are also visible in the fitted series, but the fitted values do seem to revert back to the overall trend or mean of the series each time. The next step will be to use the coefficient estimates to create the \(z_{t-1}\) element, or cointegrating term in the restricted ECM which will be the final model in this case. If the error correction term is negative and signficant in the restricted ECM, then I can interpret the term as the percentage of disequilibrium that is corrected in one period (month in my case) by the long run relationship between the response and CO.

# fit cointegrating term
z.l.t1 <- ardl2$model$model$l.1.DXTotal - fit$coefficients[1] - fit$coefficients[2] * ardl2$model$model$l.1.Carbon.monoxide
# form ECM and fit
ecm.data <- data.frame(cbind(ardl2$model$model, z.l.t1))
fxecfull <- as.formula(paste('d.DXTotal ~', paste(colnames(ecm.data)[-c(1,2,8)], collapse = '+'), collapse = ' '))
fit.2 <- lm(fxecfull,ecm.data)
summary(fit.2)

## 
## Call:
## lm(formula = fxecfull, data = ecm.data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -663.76 -217.84  -15.97  161.01 1139.69 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                -264.53063   88.10626  -3.002  0.00339 ** 
## ld.1.DXTotal                 -0.01821    0.08928  -0.204  0.83881    
## d.1.Carbon.monoxide        -103.82956  239.83230  -0.433  0.66601    
## d.1.Nitrogen.dioxide..NO2.   19.29163    8.66638   2.226  0.02828 *  
## l.1.Carbon.monoxide         267.39521   86.55742   3.089  0.00260 ** 
## ld.1.Carbon.monoxide       -184.63505  216.68797  -0.852  0.39623    
## z.l.t1                       -0.65566    0.09954  -6.587 2.18e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 331 on 99 degrees of freedom
## Multiple R-squared:  0.4903, Adjusted R-squared:  0.4594 
## F-statistic: 15.87 on 6 and 99 DF,  p-value: 1.056e-12

# check dynamics
par(mfrow=c(2,1))
plot(efp(fxecfull, data=ecm.data,type='Rec-CUSUM'))
plot(efp(fxecfull, data=ecm.data,type='Rec-MOSUM'))

par(mfrow=c(1,1))
# check model fit
autoplot(ts(fit.2$fitted.values), series='fitted') + 
  autolayer(ts(fit.2$model$d.DXTotal), series='Differenced DXTotal') +
  autolayer(ts(fit.2$residuals), series='residuals', alpha=.5)

The resulting restricted ECM indicates that the cointegrating term is extremely signficant and actually corrects nearly \(66\%\) of disquilibrium over a period of one month.

To present a breif conclusion relevant to this model, it seems as though carbon monoixde plays a large role in the overall respiratory health in the long-run, while nitrogen dioxide levels can cause changes in the reponse in the short-run.

Final Model in differences

Now that I have walked through the bounds testing and formulation of an ECM model in differences of my response, I want to look at one more iteration which includes two variables in the error correction term and accounts for a bit more of the variation in the differenced diagnoses series using the final restricted ECM. The analysis below:

par(mar=c(5,4,4,2))
# next iteration.3
cnames <- colnames(la.cnty)[c(2,3,5,6,7)]
fx <- as.formula(paste('DXTotal ~', paste(cnames[!cnames %in% "DXTotal"], collapse = " + ")))

ardl3 <- dynardl(fx,data = la.cnty,
               lags=list('DXTotal'=1,
                         'Carbon.monoxide'=1,
                         'PM2.5...Local.Conditions'=1
                         ),
               diffs = cnames[2:length(cnames)],
               lagdiffs=list('DXTotal'=1,
                             'Carbon.monoxide'=c(1:2),
                             'PM2.5...Local.Conditions'=c(1:2),
                             'Sulfur.dioxide' = c(1:2),
                             'Nitrogen.dioxide..NO2.' = c(1:2)),
               ec=T, simulate = F)

## [1] "Error correction (EC) specified; dependent variable to be run in differences."

dynardl.auto.correlated(ardl3)

## 
##  ------------------------------------------------------ 
##  Breusch-Godfrey LM Test 
##  Test statistic: 0.208 
##  p-value: 0.648 
##  H_0: no autocorrelation up to AR 1 
##  
##  ------------------------------------------------------ 
##  Shapiro-Wilk Test for Normality 
##  Test statistic: 0.981 
##  p-value: 0.135 
##  H_0: residuals are distributed normal 
##  
##  ------------------------------------------------------ 
##  Log-likelihood: -748.822 
##  AIC: 1533.644 
##  BIC: 1581.415 
##  Note: AIC and BIC calculated with k = 17 on T = 105 observations. 
##  
##  ------------------------------------------------------

summary(ardl3)

## 
## Call:
## lm(formula = as.formula(paste(paste(dvnamelist), "~", paste(colnames(IVs), 
##     collapse = "+"), collapse = " ")))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -660.86 -205.60  -20.32  142.75  961.34 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   1803.12069  413.18407   4.364 3.47e-05 ***
## l.1.DXTotal                     -0.58709    0.11378  -5.160 1.51e-06 ***
## ld.1.DXTotal                    -0.06745    0.10401  -0.648 0.518368    
## d.1.Carbon.monoxide           -192.28275  264.23901  -0.728 0.468738    
## d.1.PM2.5...Local.Conditions    -6.27802   10.08114  -0.623 0.535060    
## d.1.Sulfur.dioxide             -38.39072   45.42256  -0.845 0.400297    
## d.1.Nitrogen.dioxide..NO2.      19.58120    9.00014   2.176 0.032261 *  
## l.1.Carbon.monoxide            547.76048  160.76149   3.407 0.000991 ***
## l.1.PM2.5...Local.Conditions     7.22743   12.56372   0.575 0.566582    
## ld.1.Carbon.monoxide            97.98222  289.71957   0.338 0.736020    
## ld.2.Carbon.monoxide            99.52297  245.63017   0.405 0.686333    
## ld.1.PM2.5...Local.Conditions  -21.90553   12.57422  -1.742 0.084985 .  
## ld.2.PM2.5...Local.Conditions  -10.17029   10.75657  -0.945 0.346995    
## ld.1.Sulfur.dioxide            -17.42057   43.72414  -0.398 0.691286    
## ld.2.Sulfur.dioxide            -51.54338   44.38305  -1.161 0.248648    
## ld.1.Nitrogen.dioxide..NO2.     -7.10506    8.87550  -0.801 0.425563    
## ld.2.Nitrogen.dioxide..NO2.      5.20353    8.82824   0.589 0.557090    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 330.6 on 88 degrees of freedom
##   (3 observations deleted due to missingness)
## Multiple R-squared:  0.5468, Adjusted R-squared:  0.4644 
## F-statistic: 6.636 on 16 and 88 DF,  p-value: 1.458e-09

f <- as.formula(paste(paste('d.DXTotal'), "~", paste(colnames(ardl3$model$model)[2:length(colnames(ardl3$model$model))], collapse = "+"), collapse = " "))
par(mfrow=c(2,1))
plot(efp(f, data=ardl3$model$model,type='Rec-CUSUM'))
plot(efp(f, data=ardl3$model$model,type='Rec-MOSUM'))

par(mfrow=c(1,1))

resids <- data.frame(resid = ardl3$model$residuals)
gg1 <- autoplot(ts(ardl3$model$residuals)) 
gg2 <- ggplot(resids, aes(x=resid)) + 
  geom_histogram(aes(y=..density.., fill=..count.. ), binwidth = 100) + 
  stat_function(fun=dnorm,
                args=list(mean=mean(as.numeric(resids$resid)),
                          sd = sd(as.numeric(resids$resid))),
                colour='red',
                alpha=.5) + theme(legend.position = "none") 
gg3 <- ggAcf(ts(ardl3$model$residuals)) + ylim(c(-1,1))
gg4 <- ggPacf(ts(ardl3$model$residuals)) + ylim(c(-1,1))
grid.arrange(gg1,gg2,gg3,gg4, ncol=2)

# form levels model and fit
fxec <- as.formula('DXTotal ~ Carbon.monoxide + PM2.5...Local.Conditions')
fit <- lm(fxec,la.cnty)
summary(fit)

## 
## Call:
## lm(formula = fxec, data = la.cnty)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1120.87  -331.27    26.65   317.77  1019.15 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               3773.92     161.84  23.319  < 2e-16 ***
## Carbon.monoxide            849.24      98.11   8.656 6.23e-14 ***
## PM2.5...Local.Conditions   -26.16      10.90  -2.398   0.0182 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 459.7 on 105 degrees of freedom
## Multiple R-squared:  0.4248, Adjusted R-squared:  0.4138 
## F-statistic: 38.77 on 2 and 105 DF,  p-value: 2.464e-13

# look at fit and residuals of levels model
autoplot(ts(fit$fitted.values), series='fitted') + 
  autolayer(ts(fit$model$DXTotal), series='DXTotal') +
  autolayer(ts(fit$residuals), series = 'residuals')

# fit cointegrating term
z.l.t1 <- ardl3$model$model$l.1.DXTotal - fit$coefficients[1] - 
  fit$coefficients[2] * ardl3$model$model$l.1.Carbon.monoxide -
  fit$coefficients[3] * ardl3$model$model$l.1.PM2.5...Local.Conditions
# form ECM and fit
ecm.data <- data.frame(cbind(ardl3$model$model, z.l.t1))
fxecfull <- as.formula(paste('d.DXTotal ~', paste(colnames(ecm.data)[-c(1,2,8)], collapse = '+'), collapse = ' '))
fit.2 <- lm(fxecfull,ecm.data)
summary(fit.2)

## 
## Call:
## lm(formula = fxecfull, data = ecm.data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -640.0 -192.9  -21.2  151.3  975.0 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   -406.82297  172.63128  -2.357   0.0206 *  
## ld.1.DXTotal                    -0.06595    0.10345  -0.637   0.5255    
## d.1.Carbon.monoxide           -241.87830  233.41234  -1.036   0.3029    
## d.1.PM2.5...Local.Conditions    -5.87709    9.98585  -0.589   0.5577    
## d.1.Sulfur.dioxide             -37.76283   45.18317  -0.836   0.4055    
## d.1.Nitrogen.dioxide..NO2.      19.59874    8.95776   2.188   0.0313 *  
## l.1.PM2.5...Local.Conditions    25.39624   10.74416   2.364   0.0203 *  
## ld.1.Carbon.monoxide           130.65600  277.08567   0.472   0.6384    
## ld.2.Carbon.monoxide           130.71541  232.29144   0.563   0.5750    
## ld.1.PM2.5...Local.Conditions  -23.98031   11.44232  -2.096   0.0389 *  
## ld.2.PM2.5...Local.Conditions  -11.79952    9.93833  -1.187   0.2383    
## ld.1.Sulfur.dioxide            -20.27548   42.95600  -0.472   0.6381    
## ld.2.Sulfur.dioxide            -54.28999   43.66178  -1.243   0.2170    
## ld.1.Nitrogen.dioxide..NO2.     -7.36104    8.81164  -0.835   0.4057    
## ld.2.Nitrogen.dioxide..NO2.      5.45250    8.76568   0.622   0.5355    
## z.l.t1                          -0.58325    0.11286  -5.168 1.44e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 329.1 on 89 degrees of freedom
## Multiple R-squared:  0.5459, Adjusted R-squared:  0.4694 
## F-statistic: 7.134 on 15 and 89 DF,  p-value: 5.689e-10

par(mfrow=c(2,2))
plot(fit.2)

par(mfrow=c(1,1))
bgtest(fit.2)

## 
##  Breusch-Godfrey test for serial correlation of order up to 1
## 
## data:  fit.2
## LM test = 0.35282, df = 1, p-value = 0.5525

dwtest(fit.2)

## 
##  Durbin-Watson test
## 
## data:  fit.2
## DW = 1.935, p-value = 0.3388
## alternative hypothesis: true autocorrelation is greater than 0

shapiro.test(fit.2$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  fit.2$residuals
## W = 0.97765, p-value = 0.07302

# check dynamics
par(mfrow=c(2,1))
plot(efp(fxecfull, data=ecm.data,type='Rec-CUSUM'))
plot(efp(fxecfull, data=ecm.data,type='Rec-MOSUM'))

par(mfrow=c(1,1))
# check model fit
autoplot(ts(fit.2$fitted.values), series='fitted') + 
  autolayer(ts(fit.2$model$d.DXTotal), series='Differenced DXTotal') +
  autolayer(ts(fit.2$residuals), series='residuals', alpha=.5)

# check Correlation
cor(data.frame(fit.2$model$d.DXTotal), data.frame(fit.2$fitted.values))

##           fit.2.fitted.values
## d.DXTotal           0.7388791

# percentage change
fitted.pc <- fit.2$fitted.values/stats::lag(la.cnty$DXTotal)[4:108]
dxtotal.pc <- fit.2$model$d.DXTotal/stats::lag(la.cnty$DXTotal)[4:108]
autoplot(ts(fitted.pc), series='fitted') + 
  autolayer(ts(dxtotal.pc), series='Differenced DXTotal')

This model incorporates CO and PM\(_{2.5}\) into the cointegrating term which may be a bit more intuitive. Particulate matter levels are one of the major components of the daily AQI reporting. Being able to show a cointegrating relationship between carbon monoxide, particulate matter and number of repspiratory related inpatient cases is the ultimate goal of my thesis.

Sifting through the extensive output, the initial unrestricted ECM or the ARDL model is adequate. The residuals can be assumed normal, no serial correlation is present, and the process is stable. From here, I move on to forming the levels model in order to achieve OLS coefficient estimates for the restricted ECM cointegrating term. The final restricted ECM is also adequate, showing no major breaches of assumptions for linear models, and the Durbin-Watson as well as the Breusch-Godfrey both show no signs of serial correlation.

The final model in levels shows a large divergence but an attempt to regain equilibrium. Recall, the model above was evaluating the diagnoses series in first differences. The assumption of cointegration does not have to apply to the series in levels, but it seems like the model in levels does show signs of cointegration, just much slower response to shocks in underlying variables.

x.1 <- la.cnty$DXTotal[3]
fitted.v <- as.vector(fit.2$fitted.values)
model.v <- as.vector(fit.2$model$d.DXTotal)
fitted.cs <- append(x.1,fitted.v)
# including the 'unscrambled' differences here just to show the lags line up
model.cs <- append(x.1,model.v)
autoplot(stats::lag(ts(cumsum(fitted.cs)),-2), series='fitted') + 
  autolayer(ts(la.cnty$DXTotal), series = 'original') +
  autolayer(stats::lag(ts(cumsum(model.cs)),-2), series = 'unscrambled', alpha=.5)

cor(model.cs,fitted.cs)

## [1] 0.8879945

Final Model

A More Interpretable Approach: Percent Change

Working with the above models poses problems in interpretablity as addressed earlier. A key transformation would be evaluating percent change in diagnoses, or taking the log transformation of the response, then utilizing the first difference of the series in the model. Using the first difference of the log transform is an approximation of percentage change, but is nearly negligible especially since the percent change is relatively small throughout the series (<\(45\%\)). The log transformation will simply scale (shrink) the response, but not drastically change the overall series. As a quick proof of concept, I will show below that the log transformed series is still I(1), and is almost identical to the original series, just smaller in variance.

First, the ADF and KPSS tests:

summary(ur.kpss(ln.la.cnty$DXTotal, type = 'mu'))

## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 4 lags. 
## 
## Value of test-statistic is: 0.9757 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

sadf <- summary(ur.df(ln.la.cnty$DXTotal, type = 'none', selectlags = 'BIC'))
tble.adf <- data.frame(sadf@teststat[1], sadf@cval[2], lmp(sadf@testreg))
colnames(tble.adf) <- c('Test Stat.', 'Critical Val.', 'P-Val.')
tble.adf

##   Test Stat. Critical Val.    P-Val.
## 1 -0.4508649         -1.95 0.5396618

Notice, the levels series shows a unit root is present in both tests. Running the tests in first differences shows the transformed series is I(1).

Next, look at the plot for the log transformed series in comparison to the untransformed series, they are identical except for the y-axis range:

gg1 <- autoplot(ts(ln.la.cnty$DXTotal)) + labs(title='Log Transformed DXTotal') + title_center
gg2 <- autoplot(ts(la.cnty$DXTotal)) + labs(title='DXTotal') + title_center
grid.arrange(gg1,gg2,nrow=2)

Now, using the log transformation, I can rerun my optimal model and analyze the restricted ECM, keeping in mind the response is now monthly percentage change in diagnoses in LA County.

par(mar=c(5,4,4,2))
# next iteration.PC2
cnames <- colnames(la.cnty)[c(2,3,5,6,7)]
fx <- as.formula(paste('DXTotal ~', paste(cnames[!cnames %in% "DXTotal"], collapse = " + ")))

ardl.pc2 <- dynardl(fx,data = ln.la.cnty,
                 lags=list('DXTotal'=1,
                           'Carbon.monoxide'=1,
                           'PM2.5...Local.Conditions'=1
                 ),
                 diffs = cnames[2:length(cnames)],
                 lagdiffs=list('DXTotal'=1,
                               'Carbon.monoxide'=1,
                               'PM2.5...Local.Conditions'=1,
                               'Sulfur.dioxide' = 1,
                               'Nitrogen.dioxide..NO2.' = 1),
                 ec=T, simulate = F)

## [1] "Error correction (EC) specified; dependent variable to be run in differences."

dynardl.auto.correlated(ardl.pc2)

## 
##  ------------------------------------------------------ 
##  Breusch-Godfrey LM Test 
##  Test statistic: 1.476 
##  p-value: 0.224 
##  H_0: no autocorrelation up to AR 1 
##  
##  ------------------------------------------------------ 
##  Shapiro-Wilk Test for Normality 
##  Test statistic: 0.985 
##  p-value: 0.261 
##  H_0: residuals are distributed normal 
##  
##  ------------------------------------------------------ 
##  Log-likelihood: 128.689 
##  AIC: -229.378 
##  BIC: -192.09 
##  Note: AIC and BIC calculated with k = 13 on T = 106 observations. 
##  
##  ------------------------------------------------------

summary(ardl.pc2)

## 
## Call:
## lm(formula = as.formula(paste(paste(dvnamelist), "~", paste(colnames(IVs), 
##     collapse = "+"), collapse = " ")))
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.170277 -0.043978 -0.001539  0.034867  0.225198 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    4.2658136  0.8124429   5.251 9.54e-07 ***
## l.1.DXTotal                   -0.5285818  0.0994048  -5.317 7.21e-07 ***
## ld.1.DXTotal                  -0.0478711  0.0924628  -0.518   0.6059    
## d.1.Carbon.monoxide           -0.0226149  0.0539100  -0.419   0.6758    
## d.1.PM2.5...Local.Conditions  -0.0017935  0.0022186  -0.808   0.4209    
## d.1.Sulfur.dioxide            -0.0107918  0.0098866  -1.092   0.2778    
## d.1.Nitrogen.dioxide..NO2.     0.0045434  0.0020042   2.267   0.0257 *  
## l.1.Carbon.monoxide            0.1289972  0.0290756   4.437 2.51e-05 ***
## l.1.PM2.5...Local.Conditions   0.0005553  0.0026939   0.206   0.8371    
## ld.1.Carbon.monoxide           0.0342205  0.0568235   0.602   0.5485    
## ld.1.PM2.5...Local.Conditions -0.0035904  0.0022470  -1.598   0.1135    
## ld.1.Sulfur.dioxide           -0.0038801  0.0099552  -0.390   0.6976    
## ld.1.Nitrogen.dioxide..NO2.   -0.0017187  0.0018942  -0.907   0.3666    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.07672 on 93 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.5083, Adjusted R-squared:  0.4448 
## F-statistic:  8.01 on 12 and 93 DF,  p-value: 4.311e-10

dynamac::pssbounds(ardl.pc2)

## 
##  PESARAN, SHIN AND SMITH (2001) COINTEGRATION TEST 
## 
##  Observations: 106 
##  Number of Regressors (k): 2 
##  Case: 3 
## 
##  ------------------------------------------------------ 
##  -                       F-test                       - 
##  ------------------------------------------------------ 
##                  <------- I(0) ------------ I(1) -----> 
##  10% critical value       3.17            4.14 
##  5% critical value        3.79            4.85 
##  1% critical value        5.15            6.36 
##  
## 
##  F-statistic = 10.823 
##  ------------------------------------------------------
##  -                       t-test                       -
##  ------------------------------------------------------
##                  <------- I(0) ------------ I(1) ----->
##  10% critical value       -2.57            -3.21
##  5% critical value        -2.86            -3.53
##  1% critical value        -3.43            -4.1
##  
## 
##  t statistic = -5.317 
##  ------------------------------------------------------ 
##  F-statistic note: Asymptotic critical values used. 
##  t-statistic note: Asymptotic critical values used.

f <- as.formula(paste(paste('d.DXTotal'), "~", paste(colnames(ardl.pc2$model$model)[2:length(colnames(ardl.pc2$model$model))], collapse = "+"), collapse = " "))
par(mfrow=c(2,1))
plot(efp(f, data=ardl.pc2$model$model,type='Rec-CUSUM'))
plot(efp(f, data=ardl.pc2$model$model,type='Rec-MOSUM'))

par(mfrow=c(1,1))

resids <- data.frame(resid = ardl.pc2$model$residuals)
gg1 <- autoplot(ts(ardl.pc2$model$residuals)) 
gg2 <- ggplot(resids, aes(x=resid)) + 
  geom_histogram(aes(y=..density.., fill=..count.. ), binwidth = .05) + 
  stat_function(fun=dnorm,
                args=list(mean=mean(as.numeric(resids$resid)),
                          sd = sd(as.numeric(resids$resid))),
                colour='red',
                alpha=.5) + theme(legend.position = "none") 
gg3 <- ggAcf(ts(ardl.pc2$model$residuals)) + ylim(c(-1,1))
gg4 <- ggPacf(ts(ardl.pc2$model$residuals)) + ylim(c(-1,1))
grid.arrange(gg1,gg2,gg3,gg4, ncol=2)

# form levels model and fit
fxec <- as.formula('DXTotal ~ Carbon.monoxide + PM2.5...Local.Conditions')
fit <- lm(fxec,ln.la.cnty)
summary(fit)

## 
## Call:
## lm(formula = fxec, data = ln.la.cnty)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.27007 -0.08286  0.01268  0.08089  0.22034 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               8.233927   0.038245 215.292  < 2e-16 ***
## Carbon.monoxide           0.193477   0.023185   8.345 3.05e-13 ***
## PM2.5...Local.Conditions -0.005806   0.002577  -2.253   0.0263 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1086 on 105 degrees of freedom
## Multiple R-squared:  0.4077, Adjusted R-squared:  0.3964 
## F-statistic: 36.14 on 2 and 105 DF,  p-value: 1.145e-12

# look at fit and residuals of levels model
autoplot(ts(fit$fitted.values), series='fitted') + 
  autolayer(ts(fit$model$DXTotal), series='DXTotal') +
  autolayer(ts(fit$residuals), series = 'residuals')

# fit cointegrating term
z.l.t1 <- ardl.pc2$model$model$l.1.DXTotal - fit$coefficients[1] - 
  fit$coefficients[2] * ardl.pc2$model$model$l.1.Carbon.monoxide -
  fit$coefficients[3] * ardl.pc2$model$model$l.1.PM2.5...Local.Conditions
# form ECM and fit
ecm.data <- data.frame(cbind(ardl.pc2$model$model, z.l.t1))
fxecfull <- as.formula(paste('d.DXTotal ~', paste(colnames(ecm.data)[-c(1,2,9)], collapse = '+'), collapse = ' '))
fit.2 <- lm(fxecfull,ecm.data)
summary(fit.2)

## 
## Call:
## lm(formula = fxecfull, data = ecm.data)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.177707 -0.049011 -0.002041  0.038756  0.213355 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   -0.046533   0.021308  -2.184   0.0315 *  
## ld.1.DXTotal                  -0.071752   0.091026  -0.788   0.4325    
## d.1.Carbon.monoxide           -0.005408   0.052511  -0.103   0.9182    
## d.1.PM2.5...Local.Conditions  -0.003125   0.001983  -1.576   0.1184    
## d.1.Sulfur.dioxide            -0.010316   0.009919  -1.040   0.3010    
## d.1.Nitrogen.dioxide..NO2.     0.004719   0.002008   2.350   0.0209 *  
## l.1.Carbon.monoxide            0.043601   0.019590   2.226   0.0284 *  
## ld.1.Carbon.monoxide           0.043229   0.056632   0.763   0.4472    
## ld.1.PM2.5...Local.Conditions -0.002469   0.002088  -1.183   0.2400    
## ld.1.Sulfur.dioxide           -0.003347   0.009986  -0.335   0.7382    
## ld.1.Nitrogen.dioxide..NO2.   -0.001709   0.001902  -0.899   0.3710    
## z.l.t1                        -0.502405   0.097783  -5.138  1.5e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.07702 on 94 degrees of freedom
## Multiple R-squared:  0.4991, Adjusted R-squared:  0.4405 
## F-statistic: 8.514 on 11 and 94 DF,  p-value: 3.084e-10

par(mfrow=c(2,2))
plot(fit.2, add.smooth= F)

par(mfrow=c(1,1))
lmtest::bgtest(fit.2)

## 
##  Breusch-Godfrey test for serial correlation of order up to 1
## 
## data:  fit.2
## LM test = 1.0048, df = 1, p-value = 0.3162

lmtest::dwtest(fit.2)

## 
##  Durbin-Watson test
## 
## data:  fit.2
## DW = 2.0011, p-value = 0.4619
## alternative hypothesis: true autocorrelation is greater than 0

shapiro.test(fit.2$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  fit.2$residuals
## W = 0.98594, p-value = 0.33

# check dynamics
par(mfrow=c(2,1))
plot(efp(fxecfull, data=ecm.data,type='Rec-CUSUM'))
plot(efp(fxecfull, data=ecm.data,type='Rec-MOSUM'))

par(mfrow=c(1,1))
# check model fit
autoplot(ts(fit.2$fitted.values), series='fitted') + 
  autolayer(ts(fit.2$model$d.DXTotal), series='Differenced DXTotal') +
  autolayer(ts(fit.2$residuals), series='residuals', alpha=.5)

# check Correlation
cor(data.frame(fit.2$model$d.DXTotal), data.frame(fit.2$fitted.values))

##           fit.2.fitted.values
## d.DXTotal             0.70645

The model diagnostics and assumptions are still all valid, and the final ECM is very similar to the result achieved with first differences.

Looking at the restricted ECM, the error correction term indicates that half of the disequilibrium is corrected per period, and that the long run relationship between percentage change in diagnoses of COPD, and the flux of PM\(_{2.5}\) paired with CO is significant.

TODO::

forecast model and/or evaluate granger causailty

Cointegration Analysis of Respiratory Health & Pollution

Sean Krinik

7/9/2019