These study notes cover the essential material from Lecture 11: Distributed Lag (DL) and Autoregressive Distributed Lag (ARDL) Models. These models are ==the workhorses of time series analysis==, used to understand how economic variables react to changes over time.
The central question of this chapter is: How do we model the dynamic relationship between variables?
In basic cross-sectional econometrics, we assume \(x\) affects \(y\) immediately. However, in the real world—especially with time series data—actions have delayed effects. For example, an increase in the money supply today might not affect inflation until several months later. (在基础的横截面计量经济学中,我们假设 \(x\) 对 \(y\) 产生即时影响。然而,在现实世界中——尤其是涉及时间序列数据时——行动往往具有滞后效应。例如,今天的货币供应量增加可能要到几个月后才会对通货膨胀产生影响。)
This chapter introduces:
Distributed Lag (DL) models: Where current \(y\) is explained by current and past values of \(x\).
Autoregressive Distributed Lag (ARDL) models: Where current \(y\) is explained by past values of \(y\) AND current/past values of \(x\).
Long-term Equilibrium: How to calculate the “steady state” where a system settles after a shock.
==分布滞后==(DL)模型: 当前的 \(y\) 由 \(x\) 的当前值和过去值共同解释。
==自回归分布滞后==(ARDL)模型: 当前的 \(y\) 由过去的 \(y\) 值以及当前/过去的 \(x\) 值共同解释。
长期均衡: 如何计算系统在经历冲击后所达到的“稳态”。
A “lag” is simply the value of a variable from a previous time period (\(t-1, t-2, \text{etc.}\)). Time series models use these to capture persistence (the tendency of a variable to stay where it is) and delayed reactions. 所谓“滞后”,即指前一时间段(\(t-1, t-2, \text{等}\))中某个变量的数值。时间序列模型利用这些滞后项来捕捉滞后性(变量保持当前水平的倾向)和延迟反应。
Before estimating these models, variables must be stationary (their mean and variance do not change over time).(在估计这些模型之前,变量必须是平稳的(其均值和方差不随时间变化))
This is a theoretical situation where nothing changes—there are no shocks, and variables remain constant (\(y_t = y_{t-1} = y^*\)). It allows economists to see the “long-run” relationship between variables once all temporary fluctuations have died out.(这是一个理论上的情况,其中没有任何变化——没有冲击,变量保持不变。它使经济学家能够看到变量之间的”长期”关系,一旦所有临时波动都消退)
Formula: \(y_t = \mu + \beta_0 x_t + \beta_1 x_{t-1} + ... + \beta_p x_{t-p} + \epsilon_t\).
Formula: \(y_t = \alpha_1 y_{t-1} + ... + \alpha_p y_{t-p} + \mu + \beta_0 x_t + ... + \beta_s x_{t-s} + \epsilon_t\).
In an exam, follow this logic to determine the appropriate model:(在考试中,按照以下逻辑来确定合适的模型)
In time series, interpreting a single lagged coefficient (like \(\beta_2\)) in isolation makes “little sense” because a change in \(x_{t-2}\) usually implies \(x\) was also changing in \(t-1\) and \(t\). Instead, focus on the multipliers.(在时间序列中,单独解释单个滞后系数(如\(\beta_2\))“意义不大”,因为\(x_{t-2}\)的变化通常意味着\(x\)在\(t-1\)和\(t\)也在变化。相反,应该关注乘数)
This is the most likely exam question.(这是最可能出现的考试题目)
Used to choose the number of lags. Usually, we choose the model that provides the lowest AIC or BIC value.(用于选择滞后项的数量。通常,我们选择提供最低AIC或BIC值的模型)
动态关系:在时间序列中,变量的影响往往具有滞后效应(例如:货币供应量对通胀的影响) 。
滞后项 (Lag):指前一时间段(\(t-1, t-2\)等)的数值,用于捕捉持久性和延迟反应 。
平稳性 (Stationarity):估计模型的前提。非平稳序列需通过“差分”处理 。
稳态 (Steady State):不存在冲击、变量保持不变的理论状态(\(y_t = y_{t-1} = y^*\)),用于观察长期均衡关系 。
分布滞后 (DL) 模型:\(y_t = \mu + \beta_0 x_t + \beta_1 x_{t-1} + \dots + \beta_p x_{t-p} + \epsilon_t\) 。
自回归分布滞后 (ARDL) 模型:\(y_t = \alpha_1 y_{t-1} + \dots + \alpha_p y_{t-p} + \mu + \beta_0 x_t + \dots + \beta_s x_{t-s} + \epsilon_t\) 。
短期(即期)乘数:当期系数 \(\beta_0\) 。
长期(累积)乘数 (LTM):
DL模型:\(LTM = \sum \beta_i\) 。
ARDL模型:\(LTM = \frac{\sum \beta_i}{1 - \sum \alpha_i}\) 。
平均反应滞后:\(\bar{\omega} = \frac{\sum (i \cdot \beta_i)}{LTM}\),衡量反应速度 。
本部分根据白板照片整理,对模型估计、诊断检验及复杂场景进行了深入补充。
根据样本量和平稳性,估计路径分为“简单路径”与“困难路径”:
当所有变量均为平稳 (Stationary) 时,使用 OLS (普通最小二乘法) 进行估计。需满足:
函数形式设定正确:通过 Ramsey RESET 检验。
同方差性 (Homoskedasticity):通过 Breusch-Pagan (BP) 检验。
无自相关 (Lack of autocorrelation):
DL模型使用 Breusch-Godfrey (BG) 检验。
ARDL模型使用 Ljung-Box (Portmanteau) 检验。
大样本:拥有足够多的观测值。
如果无法确保平稳性或处于小样本环境,估计变得更加困难,需依赖严格外生性 (Strict Exogeneity):
当期外生性 (Contemporaneous Exogeneity):\(Cov(x_t, \epsilon_t) = 0\)。
强/严格外生性 (Strong/Strict Exogeneity):\(\epsilon_t\) 与任何时期的 \(x\) 变量都不相关。
适用场景:即使没有平稳性假设或样本量较小,若满足严格外生性,模型逻辑依然成立。
在处理多个解释变量(如 \(x, z, w\))时,模型形式如下:
\[y_t = \beta_0 + (\beta_1 x_t + \beta_2 x_{t-1}) + (\alpha_1 z_t + \alpha_2 z_{t-1} + \alpha_3 z_{t-2}) + (\delta_1 w_{t-1} + \delta_2 w_{t-12}) + (\gamma_1 y_{t-1} + \dots + \gamma_s y_{t-s}) + \epsilon_t\]
稳态 (Steady State) 求解逻辑:
假设系统达到均衡,所有时间下标消失:\(y_t = y_{t-1} = y^*\),\(x_t = x_{t-1} = x^*\),以此类推。
\[y^* = \beta_0 + \beta_1 x^* + \beta_2 x^* + \alpha_1 z^* + \alpha_2 z^* + \alpha_3 z^* + \delta_1 w^* + \delta_2 w^* + \gamma_1 y^* + \dots + \gamma_s y^*\]
通过代数移项,即可解出长期均衡值 \(y^*\)。
针对永久冲击 (Permanent Shock),即变量增加并保持在水平值(如 \(x_{15}=x_{16}=\dots=7\)),各变量的 LTM 计算如下:
| 模型类型 | 变量 x 的 LTM | 变量 z 的 LTM |
|---|---|---|
| DL 模型 | \(LTM_x = \beta_1 + \beta_2\) | \(LTM_z = \alpha_1 + \alpha_2 + \alpha_3\) |
| ARDL 模型 | \(LTM_x = \frac{\beta_1 + \beta_2}{1 - (\gamma_1 + \gamma_2 + \dots + \gamma_s)}\) | \(LTM_z = \frac{\alpha_1 + \alpha_2 + \alpha_3}{1 - (\gamma_1 + \dots + \gamma_s)}\) |
| 检验项目 | 工具 | 零假设 (H0) | 失败后的补救措施 |
|---|---|---|---|
| 函数形式 | Ramsey RESET | 模型设定正确/反映现实 | 增加变量、添加滞后项、取对数 |
| 同方差性 | Breusch-Pagan | 误差项为同方差 | 使用对数形式、使用稳健标准误 |
| 自相关 | BG / Ljung-Box | 残差项无自相关 | 增加滞后项(尤其是 \(y\) 的滞后项) |
关键结论:
在 ARDL 模型中,自相关会导致内生性偏误,必须通过增加滞后项使 \(p-value > 0.05\)。
如果无法完美解决同方差或自相关问题,应使用 稳健协方差矩阵估计量 (Robust Standard Errors) 来确保推断的有效性。
###########################################################################
# Advanced Econometrics #
# Spring semester #
# dr Marcin Chlebus, dr Rafał Woźniak #
# University of Warsaw, Faculty of Economic Sciences #
# #
# #
# Lab 12: (AR)DL Linear regression #
# #
###########################################################################
if(!require(lmtest)){install.packages("lmtest")}
library(lmtest)
if(!require(dynlm)){install.packages("dynlm")}
library(dynlm)
# install.packages("devtools")
# library(devtools)
# install_github("fcbarbi/ardl")
# library(ardl)
if(!require(AER)){install.packages("AER")}
library(AER)
if(!require(zoo)){install.packages("zoo")}
library(zoo)
if(!require(xts)){install.packages("xts")}
library(xts)
if(!require(fBasics)){install.packages("fBasics")}
library(fBasics)
Sys.setenv(LANG = "en")
options(scipen=999)##########################################################
# Expolratory data analysis
#########################################################
# Quarterly US macroeconomic data from 1950(1) – 2000(4)
# provided by USMacroG, a “ts” time series. Contains
# dpi --> disposable income
# consumption --> consumption (in billion USD).
data(USMacroG)
data(package = "AER")
USA<-as.zoo(USMacroG)
USA[1:5,]## gdp consumption invest government dpi cpi m1 tbill unemp
## 1950 Q1 1610.5 1058.9 198.1 361.0 1186.1 70.6 110.20 1.12 6.4
## 1950 Q2 1658.8 1075.9 220.4 366.4 1178.1 71.4 111.75 1.17 5.6
## 1950 Q3 1723.0 1131.0 239.7 359.6 1196.5 73.2 112.95 1.23 4.6
## 1950 Q4 1753.9 1097.6 271.8 382.5 1210.0 74.9 113.93 1.35 4.2
## 1951 Q1 1773.5 1122.8 242.9 421.9 1207.9 77.3 115.08 1.40 3.5
## population inflation interest
## 1950 Q1 149.461 NA NA
## 1950 Q2 150.260 4.5071 -3.3404
## 1950 Q3 151.064 9.9590 -8.7290
## 1950 Q4 151.871 9.1834 -7.8301
## 1951 Q1 152.393 12.6160 -11.2160
#plotting DPI and consuption
colors_ <- c("black", "red")
plot(USA[,c("dpi","consumption")], plot.type = "single",
col = colors_, # colors for subsequent lines
ylab = "dpi/consumption", xlab = "time", # axes labels
main = "DPI & Consumption in the USA 1950 - 2000") # title above the plot
# separately we add a legend definition
legend("topleft", # legend position - combination of top, bottom and left, right
names(USA[,c("dpi","consumption")]), # legend elements (names of the series)
text.col = colors_) # colors (the same as in plot() function above)#Conclusions?
# 1. Those 2 variables follows probably the same trend, in regression we would find this relation, but probably nothing more
# What to do?
# differentiation - is there relation between deviance from the trend?
# we want to work on stationary TS#plotting first difference of DPI and consuption
plot(diff(USA[,c("dpi","consumption")]),
plot.type = "single",
col = colors_, # colors for subsequent lines
ylab = "dpi/consumtion", xlab = "time", # axes labels
main = "DPI & Consumption in the USA 1950 - 2000") # title above the plot
# separately we add a legend definition
legend("topleft", # legend position - combination of top, bottom and left, right
names(USA[,c("dpi","consumption")]), # legend elements (names of the series)
text.col = colors_) # colors (the same as in plot() function above)参数说明:
plot.type = "single"
核心参数。指定绘图类型为“单个窗口”。这意味着 dpi 和 consumption
的差分曲线将被绘制在同一个坐标系(同一张图)中。如果不设置该参数(默认是
“multiple”),R 会为两个变量分别绘制上下排列的两个独立图表。USA[,c("dpi","consumption")]:从名为 USA
的数据集(通常是一个时间序列矩阵或数据框)中,筛选出
dpi(可支配个人收入)和 consumption(消费)这两列数据。diff(...):对筛选出的两列数据计算一阶差分 (First
Difference)。在线性回归和时间序列分析(如讲义中提到的分布式滞后模型
DL/ARDL )中,差分常用于将非平稳数据转化为平稳数据 (Stationary Data)
。统计学上的方差(Variance)衡量的是数据偏离其均值的平方和(即波动的剧烈程度)。在这张图里,数据的波动幅度随着时间推移明显“越变越大”,在图形上呈现出向右逐渐张开的态势。这就是典型的方差随时间递增。
为什么取对数(log)能解决这个问题 这正好印证了你后面那段代码的逻辑。
#stationarity testing
source("function_testdf2.R")
options(scipen=999)
testdf2(variable = USA[,c("dpi")], # vector tested
test.type = "c", # test.type = "c",
max.augmentations = 5, # maximum number of augmentations added
max.order=5) # maximum order of residual lags for BG test## augmentations adf p_adf p_bg.p.value.1 p_bg.p.value.2 p_bg.p.value.3
## 1 0 4.520147 0.99 0.4757627 0.1466866 0.2377082
## 2 1 4.417310 0.99 0.9225736 0.1784467 0.2924392
## 3 2 3.639854 0.99 0.9354447 0.9963847 0.7998528
## 4 3 3.342271 0.99 0.9843572 0.9972283 0.9836444
## 5 4 3.069740 0.99 0.9103627 0.9851835 0.9822840
## 6 5 3.715439 0.99 0.9635154 0.9980164 0.9962116
## p_bg.p.value.4 p_bg.p.value.5
## 1 0.3438192 0.01927990
## 2 0.4280401 0.02312387
## 3 0.8876616 0.08151424
## 4 0.9893585 0.10515041
## 5 0.9961649 0.10115539
## 6 0.9919605 0.99814270
According to the results of the testdf2 function, when 2
or more augmentations are introduced (\(\text{augmentations} \ge 2\)), the p-values
of the Breusch–Godfrey (BG) test are all noticeably greater than 0.05
(for instance, when \(\text{augmentations} =
2\) and the residual lag order is 5, the p-value is 0.0815). This
indicates that the issue of residual autocorrelation has been
effectively eliminated. Under this valid premise, the p-value of the
Augmented Dickey-Fuller (ADF) test is 0.99, failing to reject the null
hypothesis of the presence of a unit root, which demonstrates that the
original time series of the US DPI is non-stationary.
根据 testdf2 的检验结果,当引入 2 阶或以上的增强滞后项(augmentations \(\ge 2\))时,BG 检验的 p 值均显著大于 0.05(例如在 augmentations = 2 且滞后 5 阶时,p 值为 0.0815),表明残差自相关得到了有效消除 。在此有效前提下,ADF 检验的 p 值为 0.99,无法拒绝存在单位根的原假设,表明美国 dpi 原始时间序列是非平稳的 。”
testdf2(variable = diff(USA[,c("dpi")]), # vector tested
test.type = "c", # test.type = "c",
max.augmentations = 5, # maximum number of augmentations added
max.order=5) # maximum order of residual lags for BG test## augmentations adf p_adf p_bg.p.value.1 p_bg.p.value.2 p_bg.p.value.3
## 1 0 -13.460411 0.01 0.8720424 0.01570519 0.0149188
## 2 1 -7.813974 0.01 0.7532990 0.89567380 0.3446841
## 3 2 -6.038376 0.01 0.8871046 0.98923951 0.9932058
## 4 3 -5.022296 0.01 0.8511516 0.97342644 0.9908236
## 5 4 -5.491999 0.01 0.9090418 0.99051048 0.9943576
## 6 5 -4.799076 0.01 0.9367024 0.98509353 0.9882262
## p_bg.p.value.4 p_bg.p.value.5
## 1 0.01496486 0.004641123
## 2 0.32977415 0.113457768
## 3 0.90527668 0.239907589
## 4 0.99687858 0.225223810
## 5 0.99900087 0.998999470
## 6 0.99643513 0.998707633
According to the results of the testdf2 function applied
to the first difference of the series, when 1 or more augmentations are
introduced (\(\text{augmentations} \ge
1\)), the p-values of the Breusch–Godfrey (BG) test across all
tested lag orders are greater than 0.05. This indicates that the issue
of residual autocorrelation has been effectively eliminated. Under this
valid premise (for instance, at \(\text{augmentations} = 1\)), the Augmented
Dickey-Fuller (ADF) test statistic is -7.814, and the
corresponding p-value (p_adf) is 0.01.
Since the p-value is less than or equal to the 1% significance level, we
strongly reject the null hypothesis (\(H_0\)) of a unit root. Consequently, we can
conclude that the first-differenced data,
diff(USA[,c("dpi")]), is stationary.
testdf2(variable = USA[,c("consumption")], # vector tested
test.type = "c", # test.type = "c",
max.augmentations = 5, # maximum number of augmentations added
max.order=5) # maximum order of residual lags for BG test## augmentations adf p_adf p_bg.p.value.1 p_bg.p.value.2 p_bg.p.value.3
## 1 0 8.296398 0.99 0.005546307 0.0001103848 0.000008381582
## 2 1 5.771917 0.99 0.485949219 0.0208831243 0.000140733984
## 3 2 4.367355 0.99 0.755707867 0.9043717071 0.006860732824
## 4 3 3.066683 0.99 0.750462543 0.9496740461 0.832868347784
## 5 4 3.252579 0.99 0.991992033 0.9729771490 0.908205859670
## 6 5 3.202110 0.99 0.854199944 0.9826726715 0.998104122232
## p_bg.p.value.4 p_bg.p.value.5
## 1 0.00001907399 0.00003700224
## 2 0.00041767804 0.00065979348
## 3 0.01362072624 0.02641457522
## 4 0.79058700070 0.83992186062
## 5 0.96783873221 0.97249863759
## 6 0.99938185774 0.99929469234
testdf2(variable = diff(USA[,c("consumption")]), # vector tested
test.type = "c", # test.type = "c",
max.augmentations = 5, # maximum number of augmentations added
max.order=5) # maximum order of residual lags for BG test## augmentations adf p_adf p_bg.p.value.1 p_bg.p.value.2
## 1 0 -9.184518 0.01000000 0.04956141 0.003167801
## 2 1 -5.237949 0.01000000 0.30523939 0.246359385
## 3 2 -3.629948 0.01000000 0.87947985 0.926376184
## 4 3 -3.554668 0.01000000 0.97024924 0.906962022
## 5 4 -3.607831 0.01000000 0.86171475 0.984715987
## 6 5 -3.126966 0.02712406 0.98008540 0.998966222
## p_bg.p.value.3 p_bg.p.value.4 p_bg.p.value.5
## 1 0.0000001672182 0.0000001920137 0.0000006372337
## 2 0.0017024608681 0.0042263851347 0.0089973989528
## 3 0.9548186200858 0.9089565420640 0.9419401977419
## 4 0.9599028563715 0.9609444552864 0.9748349856447
## 5 0.9981523313452 0.9903607166502 0.9970624226973
## 6 0.9995951693166 0.9996078091443 0.9999463124934
Log
to fix the problem with increasing variance# first-differences seem to have increasing variance
# thus, let us analyse the first difference of the log-transformed
# first differences
testdf2(variable = diff(log(USA[,c("dpi")])), # vector tested
test.type = "c", # test.type = "nc",
max.augmentations = 5, # maximum number of augmentations added
max.order=5) # maximum order of residual lags for BG test## augmentations adf p_adf p_bg.p.value.1 p_bg.p.value.2 p_bg.p.value.3
## 1 0 -13.218923 0.01 0.9662664 0.6590377 0.4960841
## 2 1 -8.916091 0.01 0.9100174 0.9923266 0.6236955
## 3 2 -6.824515 0.01 0.9450289 0.9383219 0.9883654
## 4 3 -6.336003 0.01 0.9707415 0.9714525 0.9958945
## 5 4 -7.033210 0.01 0.8707947 0.9863307 0.9962280
## 6 5 -5.947960 0.01 0.9861110 0.9986554 0.9985395
## p_bg.p.value.4 p_bg.p.value.5
## 1 0.6648265 0.04837356
## 2 0.7793006 0.07764483
## 3 0.9973459 0.14685535
## 4 0.9989003 0.17154874
## 5 0.9994993 0.99966040
## 6 0.9998475 0.99914533
testdf2(variable = diff(log(USA[,c("consumption")])), # vector tested
test.type = "c", # test.type = "nc",
max.augmentations = 5, # maximum number of augmentations added
max.order=5) # maximum order of residual lags for BG test## augmentations adf p_adf p_bg.p.value.1 p_bg.p.value.2 p_bg.p.value.3
## 1 0 -13.791807 0.01 0.71451406 0.0002341137 0.0007533067
## 2 1 -8.134460 0.01 0.07989654 0.1829103993 0.1138040669
## 3 2 -6.268350 0.01 0.71253996 0.4629639320 0.1105162682
## 4 3 -6.777003 0.01 0.88973894 0.1969371662 0.0370375602
## 5 4 -6.450499 0.01 0.56548398 0.8323252837 0.8093948218
## 6 5 -6.174957 0.01 0.91795679 0.9945301814 0.9996765731
## p_bg.p.value.4 p_bg.p.value.5
## 1 0.0004893063 0.0008670902
## 2 0.0405561599 0.0464505336
## 3 0.0493164293 0.0478830419
## 4 0.0730829520 0.0966908739
## 5 0.9059957973 0.9334284979
## 6 0.9997289343 0.9857120180
##########################################################
# Autoregressive Distributed Lagged Model
#########################################################
# Distirbuted Lagged Model
# As regressors we use lags of explanatory variables
# start = c(1960, 1) is set to make comparison between models with different lag order possible
dl_1 <- dynlm(d(consumption) ~ d(dpi) + L(d(dpi)), data = USA, start = c(1960, 1))
summary(dl_1)##
## Time series regression with "zooreg" data:
## Start = 1960 Q1, End = 2000 Q4
##
## Call:
## dynlm(formula = d(consumption) ~ d(dpi) + L(d(dpi)), data = USA,
## start = c(1960, 1))
##
## Residuals:
## Min 1Q Median 3Q Max
## -64.413 -12.797 -1.915 12.594 53.529
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.60067 2.69478 4.676 0.00000615612291 ***
## d(dpi) 0.37915 0.04942 7.673 0.00000000000152 ***
## L(d(dpi)) 0.18116 0.04957 3.655 0.000348 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.06 on 161 degrees of freedom
## Multiple R-squared: 0.3074, Adjusted R-squared: 0.2988
## F-statistic: 35.73 on 2 and 161 DF, p-value: 0.0000000000001437
根据代码中的变量及其对应关系 :
d(consumption) 对应 \(\Delta
Y_t\) 或 \(\Delta
\text{consumption}_t\)d(dpi) 对应 \(\Delta
X_t\) 或 \(\Delta
\text{dpi}_t\)L(d(dpi)) 对应 \(\Delta
X_{t-1}\) 或 \(\Delta
\text{dpi}_{t-1}\)总体理论公式可以写为 :
\[\Delta \text{consumption}_t = \mu + \beta_0 \Delta \text{dpi}_t + \beta_1 \Delta \text{dpi}_{t-1} + \epsilon_t\]
其中:
\(\mu\) 为截距项(Intercept) 。
\(\beta_0\) 为当期收入变动对当期消费变动的影响(短期反应/影响乘数) 。
\(\beta_1\) 为前一期收入变动对当期消费变动的滞后影响 。
\(\epsilon_t\) 为随机误差项(Residuals / Error term) 。
dynlm(...):来自于 dynlm
包的核心函数,专门用于拟合动态线性回归模型(Dynamic Linear
Regression
Models)。它支持在时间序列公式中直接使用差分和滞后算子。formula = d(consumption) ~ d(dpi) + L(d(dpi)):定义了回归方程的结构:
d(consumption):因变量(Dependent
Variable)。d() 代表一阶差分(First
Difference),即 \(\Delta
\text{consumption}_t\),表示消费的季度变动量量。d(dpi):自变量/解释变量(Independent/Explanatory
Variable) 的当期一阶差分,即 \(\Delta
\text{dpi}_t\),表示可支配个人收入的当期变动量。L(d(dpi)):自变量差分后的一阶滞后项(First
Lag),即 \(\Delta
\text{dpi}_{t-1}\)。L() 代表滞后算子(Lag
Operator)。data = USA:指定使用名为
USA 的时间序列数据集。start = c(1960, 1):指定回归估计的起始时间(Start
Time)为 1960 年第 1
季度。通过手动固定所有竞争模型的起始时间,可以确保后续在对比不同滞后阶数的模型时,使用的样本观测值数量完全一致。##########################################################
# Autoregressive Distributed Lagged Model
#########################################################
# Distirbuted Lagged Model
# As regressors we use lags of explanatory variables
# start = c(1960, 1) is set to make comparison between models with different lag order possible# Distirbuted Lagged Model:提示此段代码正在建立一个“分布式滞后模型(DL
模型)”。# As regressors we use lags of explanatory variables:指出模型的构建特点,即“使用解释变量(自变量)的滞后项作为回归量(Regressors)”。# start = c(1960, 1) is set to make...:阐述了设置
start
参数的计量经济学目的——“使具有不同滞后阶数的模型之间进行相互比较成为可能”。因为每多增加一阶滞后,序列头部就会缺失一个观测值;统一将起始点固定在
1960 年
Q1,可以确保所有模型基于相同的样本区间进行公平对比(如通过
AIC 或 \(R^2\))。执行 summary(dl_1)
后得到的回归分析报告,其核心部分的解读如下:
Start = 1960 Q1, End = 2000 Q4:回归所使用的实际样本区间为
1960 年第 1 季度至 2000 年第 4 季度。degrees of freedom (DF):自由度为
161。这意味着有效观测值数量 \(n =
164\)(161 个自由度 + 3 个估计参数)。提供了残差的分布情况(最小值 Min、四分位数
1Q/Median/3Q、最大值
Max),用于初步判断残差是否符合正态分布且均值为 0
的假设。
表格包含四个核心列:估计值(Estimate)、标准误(Std. Error)、t 统计量(t value) 和 P 值(Pr(>|t|)。
(Intercept)(截距项 / 常数项):
12.60067。***),在 0.1%
的水平下显著(Significant)。d(dpi)(当期收入变动):
0.37915。L(d(dpi))(前一期收入变动):
0.18116。0.000348,同样极其显著。延伸结合讲义公式 : 该模型的长期反应 / 长期乘数(Long-term reaction / Long-term multiplier) 为自变量所有系数之和:
\[\beta_{LT} = \beta_0 + \beta_1 = 0.37915 + 0.18116 = 0.56031\]
这意味着,从长期来看,可支配收入每增加 1 个单位,将带来总计约 0.56 个单位的消费增长 。
Residual standard error: 21.06:残差标准误,衡量模型预测值偏离真实值的平均程度。
Multiple R-squared: 0.3074:判定系数(\(R^2\))
。表明该模型中的收入变动及其滞后项解释了消费变动中约
30.74% 的方差 。
Adjusted R-squared: 0.2988:调整后的判定系数(\(R_{adj}^2\))
。惩罚了变量个数后的拟合优度,通常用于不同变量数量的模型之间的比较。
F-statistic: 35.73 ... p-value: < 0.0000...:F
统计量整体显著性检验。P 值远小于
0.05,表明联合拒绝“所有自变量系数均为0”的原假设,模型整体非常显著。
# alternative model for log-transformed variables
USA$ln_consumption = log(USA$consumption)
USA$ln_dpi = log(USA$dpi)
dl_ln1 <- dynlm(d(ln_consumption) ~ d(ln_dpi) + L(d(ln_dpi)), data = USA, start = c(1960, 1))
summary(dl_ln1)##
## Time series regression with "zooreg" data:
## Start = 1960 Q1, End = 2000 Q4
##
## Call:
## dynlm(formula = d(ln_consumption) ~ d(ln_dpi) + L(d(ln_dpi)),
## data = USA, start = c(1960, 1))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0202245 -0.0035871 -0.0002189 0.0040243 0.0132598
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0038889 0.0007819 4.973 0.000001672323833 ***
## d(ln_dpi) 0.4165784 0.0530758 7.849 0.000000000000554 ***
## L(d(ln_dpi)) 0.1655767 0.0530807 3.119 0.00215 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.005956 on 161 degrees of freedom
## Multiple R-squared: 0.3121, Adjusted R-squared: 0.3035
## F-statistic: 36.52 on 2 and 161 DF, p-value: 0.00000000000008359
\[\Delta \ln(\text{consumption}_t) = \mu + \beta_0 \Delta \ln(\text{dpi}_t) + \beta_1 \Delta \ln(\text{dpi}_{t-1}) + \epsilon_t\]
其中:
\(\mu\) 为截距项(Intercept)。
\(\beta_0\) 为当期收入增长率对当期消费增长率的即期影响(短期弹性/影响乘数) 。
\(\beta_1\) 为前一期收入增长率对当期消费增长率的滞后影响。
\(\epsilon_t\) 为随机误差项(Error term)。
acf#in order to draw a conclusions based on the model we would like to sort out a problem with autocorrelation
#has this model sorted out autocorrelation issue? - nope
resids_ <- as.data.frame(dl_1$residuals)
colnames(resids_)<-c("res")
bgtest(res~1,data=resids_, order = 1)##
## Breusch-Godfrey test for serial correlation of order up to 1
##
## data: res ~ 1
## LM test = 2.8539, df = 1, p-value = 0.09116
##
## Breusch-Godfrey test for serial correlation of order up to 2
##
## data: res ~ 1
## LM test = 11.199, df = 2, p-value = 0.0037
##
## Breusch-Godfrey test for serial correlation of order up to 3
##
## data: res ~ 1
## LM test = 22.658, df = 3, p-value = 0.00004758
##
## Breusch-Godfrey test for serial correlation of order up to 4
##
## data: res ~ 1
## LM test = 23.717, df = 4, p-value = 0.00009101
##
## Breusch-Godfrey test for serial correlation of order up to 5
##
## data: res ~ 1
## LM test = 24.111, df = 5, p-value = 0.0002067
使用 c(1:5) 来显示 5 个lags.
dl_5_all <- dynlm(d(consumption) ~ d(dpi) + L(d(dpi),c(1:5)), data = USA,start = c(1960, 1))
summary(dl_5_all)##
## Time series regression with "zooreg" data:
## Start = 1960 Q1, End = 2000 Q4
##
## Call:
## dynlm(formula = d(consumption) ~ d(dpi) + L(d(dpi), c(1:5)),
## data = USA, start = c(1960, 1))
##
## Residuals:
## Min 1Q Median 3Q Max
## -64.047 -11.994 -1.912 12.038 49.236
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.64149 3.56385 1.583 0.11544
## d(dpi) 0.36923 0.05050 7.311 0.0000000000127 ***
## L(d(dpi), c(1:5))1 0.16208 0.04976 3.257 0.00138 **
## L(d(dpi), c(1:5))2 0.08255 0.05033 1.640 0.10295
## L(d(dpi), c(1:5))3 0.05801 0.05077 1.143 0.25497
## L(d(dpi), c(1:5))4 0.04099 0.05024 0.816 0.41585
## L(d(dpi), c(1:5))5 0.08489 0.05103 1.664 0.09821 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.74 on 157 degrees of freedom
## Multiple R-squared: 0.3451, Adjusted R-squared: 0.3201
## F-statistic: 13.79 on 6 and 157 DF, p-value: 0.000000000001476
令 \(t\) 代表当前季度,\(\Delta\) 代表一阶差分(即代码中的
d()),正确的理论公式应当是:
\[\Delta \text{consumption}_t = \beta_0 + \beta_1 \Delta \text{dpi}_t + \beta_2 \Delta \text{dpi}_{t-1} + \beta_3 \Delta \text{dpi}_{t-2} + \beta_4 \Delta \text{dpi}_{t-3} + \beta_5 \Delta \text{dpi}_{t-4} + \beta_6 \Delta \text{dpi}_{t-5} + \epsilon_t\]
dl_5 <- dynlm(d(consumption) ~ d(dpi) + L(d(dpi),c(1,2,5)), data = USA,start = c(1960, 1))
summary(dl_5)##
## Time series regression with "zooreg" data:
## Start = 1960 Q1, End = 2000 Q4
##
## Call:
## dynlm(formula = d(consumption) ~ d(dpi) + L(d(dpi), c(1, 2, 5)),
## data = USA, start = c(1960, 1))
##
## Residuals:
## Min 1Q Median 3Q Max
## -65.371 -12.682 -1.721 11.804 48.567
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.58732 3.25397 2.332 0.02097 *
## d(dpi) 0.37793 0.05009 7.545 0.00000000000328 ***
## L(d(dpi), c(1, 2, 5))1 0.17365 0.04902 3.543 0.00052 ***
## L(d(dpi), c(1, 2, 5))2 0.08627 0.04967 1.737 0.08436 .
## L(d(dpi), c(1, 2, 5))5 0.09302 0.05030 1.849 0.06626 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.73 on 159 degrees of freedom
## Multiple R-squared: 0.3372, Adjusted R-squared: 0.3205
## F-statistic: 20.22 on 4 and 159 DF, p-value: 0.0000000000001754
## Analysis of Variance Table
##
## Model 1: d(consumption) ~ d(dpi) + L(d(dpi), c(1:5))
## Model 2: d(consumption) ~ d(dpi) + L(d(dpi), c(1, 2, 5))
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 157 67530
## 2 159 68350 -2 -819.71 0.9529 0.3879
我们可以用一套简单的标准来定生死:
在这段输出中,p-value = 0.3879 显著大于
0.05。 我们无法拒绝原假设。这证明第 3 阶和第 4
阶的收入滞后项在模型中是联合不显著的。因此,小模型
dl_5(Model 2)更好、更精简,我们应该放心地把第 3
阶和第 4 阶剔除掉。
#has this model sorted out autocorrelation issue? - nope
resids_ <- as.data.frame(dl_5$residuals)
colnames(resids_)<-c("res")
bgtest(res~1,data=resids_, order = 1)##
## Breusch-Godfrey test for serial correlation of order up to 1
##
## data: res ~ 1
## LM test = 1.2403, df = 1, p-value = 0.2654
##
## Breusch-Godfrey test for serial correlation of order up to 2
##
## data: res ~ 1
## LM test = 8.7087, df = 2, p-value = 0.01285
##
## Breusch-Godfrey test for serial correlation of order up to 3
##
## data: res ~ 1
## LM test = 16.047, df = 3, p-value = 0.001109
##
## Breusch-Godfrey test for serial correlation of order up to 4
##
## data: res ~ 1
## LM test = 16.918, df = 4, p-value = 0.002005
##
## Breusch-Godfrey test for serial correlation of order up to 5
##
## data: res ~ 1
## LM test = 17.682, df = 5, p-value = 0.003373
==TODO==: explain coeftest
下面这段代码展示了在时间序列分析中,当模型面临残差自相关(Autocorrelation in Residuals)问题,且无法通过调整变量完全消除时,如何使用稳健标准误(Robust Standard Errors)来修正统计推断结果。
# Autocorrelation issue has beed not solved, but in DL models the problem is only in inconsistent Variance - Covariance matrix
# Newey - Davis robust estimator can be used to update the results and draw conclusions
library(lmtest)
library(sandwich)
coeftest(dl_5, vcov=vcovHAC(dl_5))##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.587319 3.566781 2.1272 0.0349445 *
## d(dpi) 0.377935 0.063942 5.9106 0.00000002012 ***
## L(d(dpi), c(1, 2, 5))1 0.173651 0.046350 3.7465 0.0002504 ***
## L(d(dpi), c(1, 2, 5))2 0.086269 0.051880 1.6629 0.0983093 .
## L(d(dpi), c(1, 2, 5))5 0.093020 0.061803 1.5051 0.1342819
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
library(sandwich):加载
sandwich
扩展包。该包在计量经济学中专门用于计算各种回归模型的稳健协方差矩阵(Robust
Covariance Matrix
Estimators)。由于其计算出的矩阵结构常被称为面包夹肉的“三明治”形状而得名。coeftest(dl_5, ...):来自于
lmtest
包的函数。它允许我们使用指定的方差-协方差矩阵来对已有模型(这里是小模型
dl_5)的系数进行重新检验(t
检验)。它只改变标准误和 P
值,而不改变已经估计出来的参数系数值(Estimate)。vcov=vcovHAC(dl_5)(核心参数):
vcovHAC() 是计算
HAC(异方差自相关稳健,Heteroscedasticity and Autocorrelation
Consistent) 协方差矩阵的函数。# Maybe Autoregressive Distirbuted Lagged Model would help?
# Note: Autocorrelation in residuals for ARDL lead to inconsistency of estimators!
ardl_5_all <- dynlm(d(consumption) ~ L(d(consumption),c(1:5)) + d(dpi) + L(d(dpi),c(1,2,5)), data = USA, start = c(1960, 1))
summary(ardl_5_all)##
## Time series regression with "zooreg" data:
## Start = 1960 Q1, End = 2000 Q4
##
## Call:
## dynlm(formula = d(consumption) ~ L(d(consumption), c(1:5)) +
## d(dpi) + L(d(dpi), c(1, 2, 5)), data = USA, start = c(1960,
## 1))
##
## Residuals:
## Min 1Q Median 3Q Max
## -69.815 -10.597 0.132 11.497 46.341
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.402872 3.170389 1.389 0.166915
## L(d(consumption), c(1:5))1 0.069549 0.082560 0.842 0.400871
## L(d(consumption), c(1:5))2 0.240945 0.082132 2.934 0.003863 **
## L(d(consumption), c(1:5))3 0.251578 0.074329 3.385 0.000904 ***
## L(d(consumption), c(1:5))4 -0.080380 0.075909 -1.059 0.291300
## L(d(consumption), c(1:5))5 -0.006211 0.084126 -0.074 0.941237
## d(dpi) 0.300907 0.051961 5.791 0.0000000381 ***
## L(d(dpi), c(1, 2, 5))1 0.079843 0.056471 1.414 0.159414
## L(d(dpi), c(1, 2, 5))2 -0.047063 0.055210 -0.852 0.395299
## L(d(dpi), c(1, 2, 5))5 0.041920 0.054999 0.762 0.447111
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.41 on 154 degrees of freedom
## Multiple R-squared: 0.4371, Adjusted R-squared: 0.4042
## F-statistic: 13.29 on 9 and 154 DF, p-value: 0.00000000000000135
\[\Delta Y_t = \mu + \gamma_1 \Delta Y_{t-1} + \gamma_2 \Delta Y_{t-2} + \gamma_3 \Delta Y_{t-3} + \gamma_4 \Delta Y_{t-4} + \gamma_5 \Delta Y_{t-5} + \beta_0 \Delta X_t + \beta_1 \Delta X_{t-1} + \beta_2 \Delta X_{t-2} + \beta_5 \Delta X_{t-5} + \epsilon_t\]
L(d(consumption), c(1:5))(自回归部分 /
AR部分): 对应公式中的 \(\Delta
Y_{t-1}\) 到 \(\Delta
Y_{t-5}\)。 通俗解读:用过去 1 到 5
个季度消费自身的变动来作为自变量。这对应了讲义中提到通过引入因变量滞后项来彻底吸收残差“记忆力和惯性”的做法。d(dpi) + L(d(dpi), c(1,2,5))(分布式滞后部分 /
DL部分): 对应公式中的 \(\Delta
X_t\)、\(\Delta X_{t-1}\)、\(\Delta X_{t-2}\) 和 \(\Delta X_{t-5}\)。
通俗解读:依然保留了今天、昨天、前天以及大前年同季度的收入变动。ardl_5 <- dynlm(d(consumption) ~ L(d(consumption),c(2,3)) + d(dpi) + L(d(dpi),c(1)), data = USA, start = c(1960, 1))
summary(ardl_5)##
## Time series regression with "zooreg" data:
## Start = 1960 Q1, End = 2000 Q4
##
## Call:
## dynlm(formula = d(consumption) ~ L(d(consumption), c(2, 3)) +
## d(dpi) + L(d(dpi), c(1)), data = USA, start = c(1960, 1))
##
## Residuals:
## Min 1Q Median 3Q Max
## -70.136 -11.432 -0.785 11.774 44.642
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.44202 2.84257 1.563 0.12012
## L(d(consumption), c(2, 3))2 0.20552 0.06809 3.018 0.00296 **
## L(d(consumption), c(2, 3))3 0.25150 0.06820 3.688 0.00031 ***
## d(dpi) 0.29335 0.04776 6.142 0.00000000627 ***
## L(d(dpi), c(1)) 0.10020 0.04790 2.092 0.03806 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.27 on 159 degrees of freedom
## Multiple R-squared: 0.4276, Adjusted R-squared: 0.4132
## F-statistic: 29.7 on 4 and 159 DF, p-value: < 0.00000000000000022
## Analysis of Variance Table
##
## Model 1: d(consumption) ~ L(d(consumption), c(1:5)) + d(dpi) + L(d(dpi),
## c(1, 2, 5))
## Model 2: d(consumption) ~ L(d(consumption), c(2, 3)) + d(dpi) + L(d(dpi),
## c(1))
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 154 58046
## 2 159 59023 -5 -976.74 0.5183 0.7622
#has this model sorted out autocorrelation issue? - Yes!!
resids_ <- as.data.frame(ardl_5$residuals)
colnames(resids_)<-c("res")
bgtest(res~1,data=resids_, order = 1)##
## Breusch-Godfrey test for serial correlation of order up to 1
##
## data: res ~ 1
## LM test = 0.82863, df = 1, p-value = 0.3627
##
## Breusch-Godfrey test for serial correlation of order up to 2
##
## data: res ~ 1
## LM test = 1.1992, df = 2, p-value = 0.549
##
## Breusch-Godfrey test for serial correlation of order up to 3
##
## data: res ~ 1
## LM test = 1.2328, df = 3, p-value = 0.7452
##
## Breusch-Godfrey test for serial correlation of order up to 4
##
## data: res ~ 1
## LM test = 2.5794, df = 4, p-value = 0.6305
##
## Breusch-Godfrey test for serial correlation of order up to 5
##
## data: res ~ 1
## LM test = 2.7137, df = 5, p-value = 0.744
###################################################################################################
#Diagnostics other than Autocorrelation
###################################################################################################
# Diagnostics plots
par(mfrow=c(2,2))
plot(ardl_5)par(mfrow=c(1,1))
# 1. Residuals vs Fitted
#
# This plot shows if residuals have non-linear patterns.
# There could be a non-linear relationship between predictor variables and an outcome variable
# and the pattern could show up in this plot if the model doesn’t capture the non-linear relationship.
# If you find equally spread residuals around a horizontal line without distinct patterns,
# that is a good indication you don’t have non-linear relationships.
# 2. Normal Q-Q
#
# This plot shows if residuals are normally distributed.
# Do residuals follow a straight line well or do they deviate severely?
# It’s good if residuals are lined well on the straight dashed line.
#residuals normality test
if(!require(interp)){install.packages("interp")}
library(interp)
jbTest(as.matrix(residuals(ardl_5)))##
## Title:
## Jarque - Bera Normality Test
##
## Test Results:
## PARAMETER:
## Sample Size: 164
## STATISTIC:
## LM: 4.04
## ALM: 4.688
## P VALUE:
## LM p-value: 0.096
## ALM p-value: 0.084
## Asymptotic: 0.133
# 3. Scale-Location
#
# It’s also called Spread-Location plot.
# This plot shows if residuals are spread equally along the ranges of predictors.
# This is how you can check the assumption of equal variance (homoscedasticity).
# It’s good if you see a horizontal line with equally (randomly) spread points.
# 4. Residuals vs Leverage
#
# This plot helps us to find influential cases (i.e., subjects) if any.
# Not all outliers are influential in linear regression analysis (whatever outliers mean).
# Even though data have extreme values, they might not be influential to determine a regression line.
# That means, the results wouldn’t be much different if we either include or exclude them from analysis.
# They follow the trend in the majority of cases and they don’t really matter; they are not influential.
# On the other hand, some cases could be very influential even if they look to be within a reasonable range of the values.
# They could be extreme cases against a regression line and can alter the results if we exclude them from analysis.
# Another way to put it is that they don’t get along with the trend in the majority of the cases.
#
# Unlike the other plots, this time patterns are not relevant.
# We watch out for outlying values at the upper right corner or at the lower right corner.
# Those spots are the places where cases can be influential against a regression line.
# Look for cases outside of a dashed line, Cook’s distance.
# When cases are outside of the Cook’s distance (meaning they have high Cook’s distance scores),
# the cases are influential to the regression results. The regression results will be altered if we exclude those cases.
#
# http://data.library.virginia.edu/diagnostic-plots/##
## Breusch-Pagan test
##
## data: ardl_5
## BP = 20.181, df = 4, p-value = 0.00046
##
## studentized Breusch-Pagan test
##
## data: ardl_5
## BP = 15.421, df = 4, p-value = 0.003902
## GVIF Df GVIF^(1/(2*Df))
## L(d(consumption), c(2, 3)) 1.235631 2 1.054319
## d(dpi) 1.116469 1 1.056631
## L(d(dpi), c(1)) 1.116288 1 1.056545
############################################################
# upload a dataset form AER package named USConsump1993
# in the dataset relation between income and expenditure
# in the USE years 1950-1993 are stored
##########################################################
data(USConsump1993)
USA<-as.zoo(USConsump1993)
# Excersie 4
# 4.1
# Asses wheter time series are stationary or not
# and prepared data to (AR)DL model for expenditure
# 4.2
# Choose the best model from (AR)DL family
# 4.3
# Perform diagnostics analysis to check wheter the chosen model fulfills assumptions