这种表达式是计量经济学和统计学中的条件期望(Conditional Expectation)公式。你可以把它想象成一个“带有过滤器的平均值”。
我们可以从最简单的形式开始,一步步增加维度。
\[E(y)\]
\[E(y|x)\]
$|$ 读作
“Given”(在……条件下)。\[E(y|y > 0)\]
\[E(y|x, y > 0)\]
, 在这里相当于逻辑上的
“AND”(且)。\[E(y | x_1, x_2, z, a < y < b)\]
你可以按照这个万能公式来翻译:
“在[条件1]、[条件2]……以及[条件N]同时成立的情况下,[目标变量]的平均值是多少?”
在你的实验里,\(E(y|x, y > 0)\) 就是在纠正了“很多人根本不出轨”这个干扰后,研究那群“真正出轨的人”受 \(x\) 影响的真实规律。
###########################################################################
# Advanced Econometrics #
# Spring semester #
# dr Marcin Chlebus, dr Rafa? Wo?niak #
# University of Warsaw, Faculty of Economic Sciences #
# #
# #
# Lab 07: Censored and Sample Selection data #
# #
###########################################################################
# setwd("C:\\Users\\MarcinChlebus\\Downloads\\AE07 - Tobit-20250401T173631Z-001\\AE07 - Tobit")
if(!require(truncreg)){install.packages("truncreg")}
if(!require(DescTools)){install.packages("DescTools")}
if(!require(censReg)){install.packages("censReg")}
if(!require(DescTools)){install.packages("DescTools")}
if(!require(AER)){install.packages("AER")}
data( "Affairs", package = "AER" )
Sys.setenv(LANG = "en")
options(scipen=100)# ------------------------------------------------------------------------------
# Tobit model
# a.k.a. censored regression
# ------------------------------------------------------------------------------
# depvar:
# number of affairs in the past year
#
# indepvars:
# age,
# number of years married,
# religiousness,
# occupation,
# own rating of the marriage.
# The dependent variable is left-censored at zero and not right-censored. Hence,
# this is a standard Tobit model.
#
# This actually an example of count data, but for the sake of this class
# let's treat it as a continuous variable.
Desc(Affairs$affairs)## ──────────────────────────────────────────────────────────────────────────────
## Affairs$affairs (numeric)
##
## length n NAs unique 0s mean meanCI'
## 601 601 0 6 451 1.46 1.19
## 100.0% 0.0% 75.0% 1.72
##
## .05 .10 .25 median .75 .90 .95
## 0.00 0.00 0.00 0.00 0.00 7.00 12.00
##
## range sd vcoef mad IQR skew kurt
## 12.00 3.30 2.27 0.00 0.00 2.34 4.19
##
##
## value freq perc cumfreq cumperc
## 1 0 451 75.0% 451 75.0%
## 2 1 34 5.7% 485 80.7%
## 3 2 17 2.8% 502 83.5%
## 4 3 19 3.2% 521 86.7%
## 5 7 42 7.0% 563 93.7%
## 6 12 38 6.3% 601 100.0%
##
## ' 95%-CI (classic)
##
## 1 2 3 4 5 6 7
## 113 13 47 68 204 143 13
OLS <- lm(affairs ~ age + yearsmarried + religiousness +
+ as.factor(occupation) + rating, data = Affairs)
summary(OLS)##
## Call:
## lm(formula = affairs ~ age + yearsmarried + religiousness + +as.factor(occupation) +
## rating, data = Affairs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.0376 -1.6779 -0.8012 0.2413 12.8106
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.59246 0.80990 6.905 0.000000000013 ***
## age -0.04874 0.02233 -2.182 0.0295 *
## yearsmarried 0.16233 0.03739 4.342 0.000016645133 ***
## religiousness -0.47305 0.11214 -4.218 0.000028476488 ***
## as.factor(occupation)2 -0.35545 0.91518 -0.388 0.6979
## as.factor(occupation)3 0.54228 0.54473 0.995 0.3199
## as.factor(occupation)4 0.63614 0.47989 1.326 0.1855
## as.factor(occupation)5 0.47770 0.36984 1.292 0.1970
## as.factor(occupation)6 0.63998 0.39989 1.600 0.1101
## as.factor(occupation)7 -0.19121 0.91933 -0.208 0.8353
## rating -0.71809 0.11989 -5.990 0.000000003656 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.094 on 590 degrees of freedom
## Multiple R-squared: 0.1348, Adjusted R-squared: 0.1201
## F-statistic: 9.191 on 10 and 590 DF, p-value: 0.00000000000003327
这段代码展示了使用 OLS (普通最小二乘法) 建立线性回归模型的过程。在 Tobit 分析中,这通常作为基准模型(Baseline),用于初步探索变量间的关系。
lm(...): 这是 R
中用于拟合线性模型的函数。它假设因变量 \(y\) 是连续分布的。affairs ~ ...:
回归方程。左侧是因变量(婚外情次数),右侧是自变量。as.factor(occupation):
关键参数。将职业变量强制转换为分类变量(Factor)。
occupation 1 作为参照基准(Reference
Group),输出结果中的 as.factor(occupation)2 到
7 分别代表其他职业等级与第一类职业的差异。data = Affairs: 指定使用加载好的
Affairs 数据集。输出结果分为四个主要部分:残差统计、系数估计、显著性标记和整体拟合指标。
展示了预测值与实际观测值之间的误差分布。
***
的变量具有极强的统计显著性(P < 0.001)。在高级计量经济学中,这份 OLS 输出通常被视为“有偏的”或“不完善的”。
截断数据的偏差: 由于数据中存在大量的零(451 个观测值为 0),回归线会被这些零值强行向下拉。这导致 OLS 估计出的系数往往比真实值更接近 0。
与 Tobit 的对比:
rating 的潜在影响系数(\(y^*\))约为 -2.28。rating 的系数仅为 -0.71。逻辑矛盾: OLS 会预测出负的出轨次数(如预测某人出轨次数为 -1.5 次),这在现实中是无意义的。而 Tobit 模型通过将预测值限制在 0 以上,解决了这一逻辑问题。
总结:
OLS serves as a baseline to ==validate the coefficient signs==, proving that age, religiousness, and marital satisfaction are robust predictors regardless of censoring. Nevertheless, the actual effect sizes require adjustment via the Tobit model to correct for the inherent bias in OLS.
OLS 在此处的意义在于确认变量的相关方向(正负号)。它告诉我们:即使不考虑截断问题,年龄、宗教和婚姻满意度依然是重要的预测因子,但具体的效应大小需要通过 Tobit 模型来修正。
censReg# Tobit model can be estimated by following command:
tobit1<-censReg(affairs~age+yearsmarried+religiousness+occupation+rating,
left=0, right=Inf, data=Affairs)
summary(tobit1)##
## Call:
## censReg(formula = affairs ~ age + yearsmarried + religiousness +
## occupation + rating, left = 0, right = Inf, data = Affairs)
##
## Observations:
## Total Left-censored Uncensored Right-censored
## 601 451 150 0
##
## Coefficients:
## Estimate Std. error t value Pr(> t)
## (Intercept) 8.17420 2.74145 2.982 0.00287 **
## age -0.17933 0.07909 -2.267 0.02337 *
## yearsmarried 0.55414 0.13452 4.119 0.0000379755 ***
## religiousness -1.68622 0.40375 -4.176 0.0000296183 ***
## occupation 0.32605 0.25442 1.282 0.20001
## rating -2.28497 0.40783 -5.603 0.0000000211 ***
## logSigma 2.10986 0.06710 31.444 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Newton-Raphson maximisation, 7 iterations
## Return code 1: gradient close to zero (gradtol)
## Log-likelihood: -705.5762 on 7 Df
这段代码使用 censReg 包来估计 Tobit
模型(也称为截断回归模型)。该模型专门用于处理因变量在某些值处被截断(Censored)的情况,例如本例中
affairs(婚外情次数)在 0 处大量堆积。
censReg(...):
执行极大似然估计(MLE)的函数。formula = affairs ~ ...:
定义回归方程,探究年龄、婚龄、宗教信仰、职业和婚姻评分对婚外情的影响。left = 0: 关键参数,指定左侧截断点为
0。这意味着所有观测到的 0
值在模型中被视为潜在取值(Propensity)小于或等于 0 的结果。right = Inf:
指定右侧不设置截断点。data = Affairs:
指定使用的数据集。In a Tobit model, the Estimate column does ==not== represent the marginal effect on the observed number of affairs (\(y\)). Instead, it represents the effect of the explanatory variable on the latent variable (\(y^*\)), which can be understood as the underlying propensity or desire to engage in affairs.
yearsmarried (0.55414): This coefficient is positive and highly significant (***, p < 0.001). This suggests that the longer someone is married, the higher their latent propensity to have an affair. 已婚年数 (0.55414):该系数为正值且极具统计显著性(*,p < 0.001)。这表明,婚姻时间越长,发生婚外情的潜在倾向就越高。
religiousness (-1.68622): This is negative and highly significant (*). Higher levels of religiousness are associated with a lower propensity** for extramarital affairs.
rating (-2.28497): This variable (self-rating of marriage) has the largest negative coefficient and is highly significant (***). A better marriage rating significantly decreases the propensity for affairs.
age (-0.17933): This is negative and significant (*) at the 0.05 level. In this sample, older age is associated with a slightly lower propensity for affairs.
occupation (0.32605): With a p-value of 0.20, this variable is not statistically significant. In an exam, you would state that there is no evidence that occupation affects the number of affairs in this model.
A common mistake in exams is interpreting these coefficients as the “increase in the number of affairs.” You must specify that these coefficients represent the effect on the latent propensity (\(y^*\)). To find the effect on the actual number of affairs (\(y\)), one would need to calculate marginal effects (such as \(E(y|x)\) or \(E(y|x, y > 0)\)), which are typically much smaller than the Tobit coefficients themselves.
这部分描述了因变量 affairs 的数据结构,是理解 Tobit
模型必要性的基础。
这些系数代表自变量对潜在变量 \(y^*\)(即发生婚外情的倾向或强度)的影响。
*** 代表 0.001
级别的极显著,** 代表 0.01 级别的显著。这部分提供了关于模型数值计算的信息。
rating 的 -2.28)比之前 OLS 的系数(-0.71)大得多。这是因为
Tobit 系数衡量的是对潜在倾向 \(y^*\) 的影响,而不是直接对观测到的平均值
\(y\) 的影响。logSigma
的显著性验证了该分布参数估计的可靠性。 # marginal effects after censReg function
source("tobit_marginal_effects.R")
x = rbind(1, mean(Affairs$age), mean(Affairs$yearsmarried), mean(Affairs$religiousness),
mean(Affairs$occupation), mean(Affairs$rating))
me = tobit_marginal_effects(tobit1, x, dummies_indices=c())##
## Marginal effects of the tobit model
## -----------------------------------
## y* E(y|x) E(y|x,y>0) Pr(y>0|x) at X=
## age -0.1793326 -0.04192089 -0.04200050 -0.006662735 32.487521
## yearsmarried 0.5541418 0.12953651 0.12978253 0.020588003 8.177696
## religiousness -1.6862205 -0.39417189 -0.39492049 -0.062648066 3.116473
## occupation 0.3260532 0.07621840 0.07636315 0.012113840 4.194676
## rating -2.2849727 -0.53413656 -0.53515098 -0.084893477 3.931780
这里\(x\)的第一个元素,取值为 1,作用是:
在数学表达式中,线性回归方程通常写为:\[y^* = \beta_0 \cdot 1 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_k x_k\]为了让计算机能够通过矩阵乘法(\(\mathbf{x}'\boldsymbol{\beta}\))一次性计算出预测值,我们需要在自变量向量 \(\mathbf{x}\) 的最前面人为地加上一个 1。这个 1 专门用来和截距项的系数 \(\beta_0\) 相乘。
Below is the professional interpretation of the output calculated at the sample means (\(\bar{X}\)).
以下是针对样本均值点(平均水平个体)计算出的边际效应解读。
| Column | Margin | Group | Interpretation Unit |
|---|---|---|---|
| \(y^*\) | Latent | Theoretical | Index (Usually ignored in Corner Solutions) |
| \(E(y\|x)\) | Total | Entire Sample | Actual Units (e.g., Number of affairs) |
| \(E(y\|x, y>0)\) | Intensive | Active Group only | Actual Units (Add “Among those who…”) |
| \(Pr(y>0\|x)\) | Extensive | Entire Sample | Percentage Points (%) |
tobit_marginal_effects
function does not display P-values. In your analysis,
you should refer back to the original summary(tobit1)
output. Since Occupation was not significant (\(P = 0.20\)) in the model, its marginal
effects should be described as statistically
insignificant despite the numbers shown.tobit1
的输出。例如,由于 Occupation(职业)
在原模型中不显著(P =
0.20),虽然这里算出了数值,但在解读时应注明其在统计上不显著。##
## Marginal effects of the tobit model
## -----------------------------------
## y* E(y|x) E(y|x,y>0) Pr(y>0|x) at X=
## age -0.1793326 -0.0005879883 -0.01438623 -0.0002155151 20
## yearsmarried 0.5541418 0.0018168974 0.04445380 0.0006659465 3
## religiousness -1.6862205 -0.0055287106 -0.13527025 -0.0020264355 5
## occupation 0.3260532 0.0010690500 0.02615631 0.0003918384 1
## rating -2.2849727 -0.0074918748 -0.18330274 -0.0027459931 9
tobit2 <- tobit(affairs ~ age + yearsmarried + religiousness +
+ occupation + rating, left = 0, right = Inf , data = Affairs ,x=T)
summary(tobit2)##
## Call:
## tobit(formula = affairs ~ age + yearsmarried + religiousness +
## +occupation + rating, left = 0, right = Inf, data = Affairs,
## x = T)
##
## Observations:
## Total Left-censored Uncensored Right-censored
## 601 451 150 0
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 8.17420 2.74145 2.982 0.00287 **
## age -0.17933 0.07909 -2.267 0.02337 *
## yearsmarried 0.55414 0.13452 4.119 3.80e-05 ***
## religiousness -1.68622 0.40375 -4.176 2.96e-05 ***
## occupation 0.32605 0.25442 1.282 0.20001
## rating -2.28497 0.40783 -5.603 2.11e-08 ***
## Log(scale) 2.10986 0.06710 31.444 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Scale: 8.247
##
## Gaussian distribution
## Number of Newton-Raphson Iterations: 4
## Log-likelihood: -705.6 on 7 Df
## Wald-statistic: 67.71 on 5 Df, p-value: 3.0718e-13
This interpretation follows the professional standards for econometric analysis, distinguishing between data structure, coefficient direction, and statistical significance.
age, yearsmarried, religiousness,
occupation, and rating on the number of
extramarital affairs.left = 0 and right = Inf define a standard
Tobit model. In this context, the value of 0 is treated
as a Corner Solution, representing a rational choice by
individuals not to engage in affairs rather than missing or “hidden”
data.The coefficients in a Tobit model represent the effect of the independent variables on the latent propensity (\(y^*\)) to have an affair. While the signs (+/-) indicate the direction of the relationship, the exact magnitude of the effect on the actual observed number of affairs (\(y\)) can only be determined through Marginal Effects analysis.
## the vector of marginal effects (at mean values and for y > 0) should be as follows.
## note the [-1] used to remove the intercept term from the final vector,
## but not from within the adjustment term.
# tobit1<-censReg(affairs~age+yearsmarried+religiousness+occupation+rating,
# left=0, right=Inf, data=Affairs)
#E(y|x,y>0)
summary(margEff(tobit1))## Marg. Eff. Std. Error t value Pr(>|t|)
## age -0.041921 0.018444 -2.2728 0.02339 *
## yearsmarried 0.129537 0.031168 4.1561 3.713e-05 ***
## religiousness -0.394172 0.093379 -4.2212 2.811e-05 ***
## occupation 0.076218 0.059472 1.2816 0.20049
## rating -0.534137 0.094896 -5.6286 2.802e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
这段代码展示了如何使用 censReg 包自带的
margEff() 函数来自动计算 Tobit
模型的边际效应。与之前手动计算不同,这个函数不仅给出了数值,还提供了标准误(Std.
Error)和 P
值,从而解决了你之前关于“边际效应显著性”的疑问。
结论:==margEff== 输出的数值与 E(y|x)
列完全一致。因此,它计算的是==全样本==(包含 0 值)的预期变化。
margEff(tobit1):
censReg 对象设计的函数。summary(...):
代码中的注释包含了非常关键的学术细节:
censReg 的语境下,作者通常指代的是模型考虑了截断点(\(y>0\) 的门槛),但计算的数值(-0.041,
0.129 等)实际上对应的是全样本期望 \(E(y|x)\)。-1),只保留了自变量的效应。这里的数值与你之前通过自定义函数 tobit_marginal_effects
计算出的 E(y|x) 列完全一致。
| 变量 | 边际效应 (Marg. Eff.) | P 值 (Pr(>|t|)) | 经济学解释 (English Interpretation) |
|---|---|---|---|
| age | -0.0419 | 0.023 * | For every 1-year increase in age, the expected number of affairs decreases by 0.04. |
| yearsmarried | +0.1295 | < 0.001 *** | Each additional year of marriage increases the expected number of affairs by 0.13. |
| religiousness | -0.3942 | < 0.001 *** | Each unit increase in religious devotion reduces the expected affairs by 0.39. |
| occupation | +0.0762 | 0.200 (Non-sig) | No statistically significant effect on the number of affairs. |
| rating | -0.5341 | < 0.001 *** | Each 1-point increase in marriage rating reduces the expected affairs by 0.53. |
这是该输出最重要的部分:
rating 对现实中出轨次数的影响,你应该引用这一表的
-0.534 及其极显著的 P 值,这比引用原始系数
-2.28 更有说服力。This table provides the Marginal Effects on the Unconditional Expected Value \(E(y|x)\).
age,
yearsmarried, religiousness, and
rating all remain statistically significant after the
marginal effect adjustment.rating) has the largest real-world impact among all
predictors, reducing the average number of affairs by over 0.5 per unit
increase.## age yearsmarried religiousness occupation rating
## -0.04192089 0.12953651 -0.39417189 0.07621840 -0.53413656
这段代码非常有技术含量,它通过矩阵运算和正态分布公式,手动复现了之前
margEff(tobit1)
自动计算出的数值。这在计量经济学考试中通常作为“原理推导”题出现。
这段代码的核心是计算 Tobit 模型全样本期望 \(E(y|x)\) 的边际效应。
根据 McDonald-Moffitt 分解,其计算公式为: \[\frac{\partial E[y|x]}{\partial x_j} = \Phi\left(\frac{X\beta}{\sigma}\right) \cdot \beta_j\]
其中:
pnorm)。apply(tobit2$x, 2, FUN=mean):
tobit2 中提取自变量矩阵(因为你之前用了
x=T)。tobit2$coef:
sum(...):
tobit2$scale:
tobit2 的 summary 中显示为 8.247)。pnorm(...):
tobit2$coef[-1]:
## age yearsmarried religiousness occupation rating
## -0.04192089 0.12953651 -0.39417189 0.07621840 -0.53413656margEff(tobit1) 的输出分毫不差。rating 的原始系数是 -2.28,计算出的概率 \(\Phi(z)\) 约为 0.2338。tobit2 (AER 包)
和 tobit1 (censReg 包)
虽然输出格式不同,但底层数学逻辑完全一致。mean
替换成特定的数值即可。margEff(tobit1), confirming that the total marginal effect
is the product of the probability of being uncensored
and the latent coefficient.# ------------------------------------------------------------------------------
# Truncated regression
# ------------------------------------------------------------------------------
# https://stats.idre.ucla.edu/stata/output/truncated-regression/
# Examples of truncated regression
# Example 1.
# A study of students in a special GATE (gifted and talented education)
# program wishes to model achievement as a function of language skills
# and the type of program in which the student is currently enrolled. A
# major concern is that students are required to have a minimum
# achievement score of 40 to enter the special program. Thus, the
# sample is truncated at an achievement score of 40.
truncdata = readRDS("truncpanel.rds")
truncdata = truncdata[which(truncdata$time==1),]
hist(truncdata$achievement)readRDS(): 读取 R 的专用数据格式
.rds。这个文件通常包含一个面板数据集(Panel Data)。time==1:
原始数据是一个面板(多年份),这里通过逻辑筛选只保留了第一个时间点(第一期)的数据。虽然这里没有显示具体的 summary,但执行完这段代码后,你的
truncdata 环境将具备以下特征:
min(truncdata$achiv),你会发现最小值一定大于或等于
40。| 特性 | Tobit (Censored) | Truncated Regression |
|---|---|---|
| 样本量 | 完整样本 (Full sample) | 缩减样本 (Reduced sample) |
| 0 点/界限点 | 观测到 0,知道他们在哪 | 完全观测不到界限以外的人 |
| 例子 | 所有已婚人士 (包含不出轨的) | 只调查进入天才班的学生 (不含落选者) |
truncreg::truncreg# Truncated regression
treg = truncreg::truncreg(achievement~lang.score+program2+program3,
data=truncdata, point=40, direction="left")
summary(treg)##
## Call:
## truncreg::truncreg(formula = achievement ~ lang.score + program2 +
## program3, data = truncdata, point = 40, direction = "left")
##
## BFGS maximization method
## 35 iterations, 0h:0m:0s
## g'(-H)^-1g = 3.97E-10
##
##
##
## Coefficients :
## Estimate Std. Error t-value Pr(>|t|)
## (Intercept) 9.889564 0.683552 14.4679 < 2.2e-16 ***
## lang.score 0.701698 0.012003 58.4615 < 2.2e-16 ***
## program2TRUE 3.780889 0.221834 17.0438 < 2.2e-16 ***
## program3TRUE -1.224689 0.219914 -5.5689 2.563e-08 ***
## sigma 1.196063 0.064649 18.5009 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Log-Likelihood: -280.54 on 5 Df
这段代码展示了如何对样本缺失(不仅仅是观测值被压低,而是低于门槛的人根本没进数据库)的数据进行建模。
truncreg::truncreg(...): 调用
==truncreg==
包中的函数。这是一种==极大似然估计==(MLE),专门用来“==补全==”因样本截断而损失的正态分布边缘。achievement ~ lang.score + ...:
回归方程。point = 40:
核心参数。明确告知模型,原始数据在分数
40 处被切断了。direction = "left":
告知模型这是左截断。即:所有分数 \(< 40\) 的学生都从样本中消失了。这个模型试图探究学生进入天才班(GATE)后的学业表现及其驱动因素。
program2、program3
与基准项目(program1)的系数,研究人员试图找出哪种教学方案对提升学业成就最有效。寻求的关系: 在排除(控制)了教学项目差异后,语言能力每提升 1 分,能为处于天才班水平的学生带来多少额外的成就加成?
| 变量 | 估计值 (Estimate) | 显著性 (P-value) | 现实含义解读 |
|---|---|---|---|
| (Intercept) | 9.889 | < 0.001 *** | 截距项,当其他变量为 0 时的理论基准值。 |
| lang.score | 0.701 | < 0.001 *** | 极显著正相关。语言分数每高 1 分,成就得分预计增加 0.7 分。 |
| program2 | 3.781 | < 0.001 *** | 与项目 1 相比,参加项目 2 的学生成就得分显著高出 3.78 分。 |
| program3 | -1.225 | < 0.001 *** | 与项目 1 相比,参加项目 3 的学生成就得分反而低了 1.22 分。 |
| sigma | 1.196 | < 0.001 *** | 误差项的标准差估计。 |
重要前提:Direct Interpretation: “A one-unit increase in language score is associated with a 0.701 unit increase in the expected achievement score, ==adjusting for the truncation at 40==.”
直接解读:==在纠正了 40 分截断偏误后==,语言分每增加 1 个单位,成就分预期增加 0.701 个单位。
由于分数低于 40 的学生被剔除了,剩余样本的分布不再是完整的正态分布,而是“残缺的正态分布”。
lang.score 的真实效应被严重低估。truncreg 通过数学公式(包含了逆米尔斯比值 Inverse Mills
Ratio 的逻辑)补偿了那些消失的低分群体的影响。Affairs 例子中,我们知道有些人出轨次数是
0,他们在样本里。achievement 例子中,低于 40
分的学生不在样本里(我们甚至不知道有多少人落选)。截断回归的任务是在“信息不全”的情况下还原真相。从结果看,该研究证明了:在天才班项目中,语言能力是成就的核心预测指标,且 Program 2 是该天才教育体系中最成功的教学分支。
# Let's compare these estiamtes with simple regression results
simpreg = lm(achievement~lang.score+program2+program3, data=truncdata)
summary(simpreg)##
## Call:
## lm(formula = achievement ~ lang.score + program2 + program3,
## data = truncdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.1194 -0.8340 0.0093 0.7182 3.7928
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.40932 0.64022 16.259 < 2e-16 ***
## lang.score 0.69305 0.01132 61.216 < 2e-16 ***
## program2TRUE 3.75931 0.21736 17.295 < 2e-16 ***
## program3TRUE -1.18958 0.21335 -5.576 9.14e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.192 on 176 degrees of freedom
## Multiple R-squared: 0.9594, Adjusted R-squared: 0.9587
## F-statistic: 1387 on 3 and 176 DF, p-value: < 2.2e-16
## [,1] [,2]
## (Intercept) 9.8895635 10.4093242
## lang.score 0.7016977 0.6930502
## program2TRUE 3.7808889 3.7593097
## program3TRUE -1.2246886 -1.1895788
这段代码的核心目的是进行对比分析。通过将截断回归(Truncated Regression)和普通最小二乘法(OLS)的系数并排排列,直观地展示“样本截断”对统计推断造成的偏误。
cbind(...) (Column Bind):
treg$coefficients[1:4]):从截断回归模型对象中提取前
4
个系数(截距项、lang.score、program2、program3)。simpreg$coefficients[1:4]):从
OLS 回归模型对象中提取对应的 4 个系数。[1:4]?:
treg 模型包含 5 个估算参数(4 个回归系数 + 1 个
sigma)。simpreg (OLS) 只包含 4 个回归系数。cbind 成功合并,必须舍弃 treg
中的 sigma 项,使两个向量长度一致。输出的表格展示了两种算法对同一组数据的不同“看法”:
[,1] (Truncated
Regression):代表了修正后的“真实”关系。它考虑了那些低于 40
分而消失的学生。[,2] (Simple
OLS):代表了“表象”关系。它只看到了班级里现有的学生,忽略了筛选机制。lang.score(Intercept)English: The purpose of this comparison is to
demonstrate Attenuation Bias. In the OLS model
(simpreg), the regression line is “pulled” toward the
horizontal by the truncated distribution. By using
truncreg, we recover the steeper, more accurate slope
(0.701) of the underlying population. Even though the difference (0.701
vs 0.693) seems small here, in many datasets, ==OLS== can lead to
completely ==misleading== conclusions if the ==truncation== is
severe.
中文解读:这个对比展示了“向零收缩”的偏误。在
OLS 模型中,由于缺失了低分数据,回归线被剩下的高分数据“拉平”了。通过
truncreg,我们找回了那个更陡峭、更准确的真实斜率(0.701)。虽然在这个例子中
0.701 和 0.693 差别不算巨大,但在截断更严重的场景下,OLS
可能会得出完全错误的结论。
truncreg results. The OLS results should
only be used as a diagnostic tool to show the existence
and direction of the truncation bias.truncreg 的结果为准。OLS
的结果在这里仅作为一个诊断工具,用来证明“样本截断”确实对数据产生了误导。## [,1] [,2]
## (Intercept) 0.68355224 0.64022325
## lang.score 0.01200273 0.01132145
## program2TRUE 0.22183370 0.21736450
## program3TRUE 0.21991401 0.21334985
输出的第二部分对比了估计的精确度:
[,2]):计算出的标准误普遍偏小(例如
lang.score 为 0.0113)。[,1]):标准误相对较大(例如
lang.score 为 0.0120)。# Analysis methods you might consider
# Below is a list of some analysis methods you may have encountered.
# Some of the methods listed are quite reasonable, while others have
# either fallen out of favor or have limitations.
#
# OLS regression ? You could analyze these data using OLS regression.
# OLS regression will not adjust the estimates of the coefficients to take into
# account the effect of truncating the sample at 40, and the coefficients may
# be severely biased.
# This can be conceptualized as a model specification error (Heckman, 1979).
#
# Truncated regression ? Truncated regression addresses the bias introduced
# when using OLS regression with truncated data.
# Note that with truncated regression, the variance of the outcome variable is
# reduced compared to the distribution that is not truncated. Also, if the lower
# part of the distribution is truncated, then the mean of the truncated variable
# will be greater than the mean from the untruncated variable; if the truncation
# is from above, the mean of the truncated variable will be less than
# the untruncated variable.
#
# https://stats.idre.ucla.edu/stata/output/truncated-regression/
# 您可以考虑的分析方法
# 以下列出了一些您可能接触过的分析方法。
# 其中一些方法相当合理,而另一些则
# 已不再流行或存在局限性。
#
# 普通最小二乘法(OLS)回归?您可以使用 OLS 回归分析这些数据。
# OLS 回归不会调整系数估计值以考虑
# 样本在 40 处截断的影响,因此系数可能
# 存在严重偏差。
# 这可以被视为模型规格错误(Heckman, 1979)。
#
# 截断回归?截断回归解决了使用 OLS 回归处理截断数据时产生的偏差。
# 请注意,在截断回归中,因变量的方差
# 相比未截断的分布会减小。此外,如果分布的
# 下半部分被截断,则截断变量的均值
# 将大于未截断变量的均值;若截断
# 来自上端,则截断变量的均值将小于
# 未截断变量的均值。
#
# https://stats.idre.ucla.edu/stata/output/truncated-regression/
# ------------------------------------------------------------------------------
# Heckman model
# a.k.a. Sample selection model
# ------------------------------------------------------------------------------
#####################################################################################
# Real data examples
#####################################################################################
#### EXAMPLE 1
# This data set was used by Mroz (1987) for analysing female labour supply.
#
# labour force participation (described by dummy lfp) is modelled by:
# a quadratic polynomial in age (age),
# family income (faminc, in 1975 dollars),
# presence of children (kids),
# and education in years (educ).
#
# The wage equation includes
# a quadratic polynomial in experience (exper),
# education in years (educ), and
# residence in a big city (city).
if(!require(sampleSelection)){install.packages("sampleSelection")}
# install.packages("sampleSelection")
# library("sampleSelection")
# getting data
data("Mroz87")
#variable for presence of children.
Mroz87$kids <- (Mroz87$kids5 + Mroz87$kids618 > 0)
#2-step method
greeneTS <- selection(lfp~age+I(age^2)+faminc+kids+educ,
wage~exper+I(exper^2)+educ+city,
data=Mroz87, method="2step")
summary(greeneTS) ## --------------------------------------------
## Tobit 2 model (sample selection model)
## 2-step Heckman / heckit estimation
## 753 observations (325 censored and 428 observed)
## 14 free parameters (df = 740)
## Probit selection equation:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.157e+00 1.402e+00 -2.965 0.003127 **
## age 1.854e-01 6.597e-02 2.810 0.005078 **
## I(age^2) -2.426e-03 7.735e-04 -3.136 0.001780 **
## faminc 4.580e-06 4.206e-06 1.089 0.276544
## kidsTRUE -4.490e-01 1.309e-01 -3.430 0.000638 ***
## educ 9.818e-02 2.298e-02 4.272 2.19e-05 ***
## Outcome equation:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.9712003 2.0593505 -0.472 0.637
## exper 0.0210610 0.0624646 0.337 0.736
## I(exper^2) 0.0001371 0.0018782 0.073 0.942
## educ 0.4170174 0.1002497 4.160 3.56e-05 ***
## city 0.4438379 0.3158984 1.405 0.160
## Multiple R-Squared:0.1264, Adjusted R-Squared:0.116
## Error terms:
## Estimate Std. Error t value Pr(>|t|)
## invMillsRatio -1.098 1.266 -0.867 0.386
## sigma 3.200 NA NA NA
## rho -0.343 NA NA NA
## --------------------------------------------
#ML method
greeneTS <- selection(lfp~age+I(age^2)+faminc+kids+educ,
wage~exper+educ,
data=Mroz87)
summary(greeneTS)## --------------------------------------------
## Tobit 2 model (sample selection model)
## Maximum Likelihood estimation
## Newton-Raphson maximisation, 5 iterations
## Return code 8: successive function values within relative tolerance limit (reltol)
## Log-Likelihood: -1582.261
## 753 observations (325 censored and 428 observed)
## 11 free parameters (df = 742)
## Probit selection equation:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.124e+00 1.400e+00 -2.946 0.003318 **
## age 1.842e-01 6.583e-02 2.797 0.005286 **
## I(age^2) -2.410e-03 7.719e-04 -3.122 0.001867 **
## faminc 5.813e-06 4.438e-06 1.310 0.190665
## kidsTRUE -4.506e-01 1.301e-01 -3.463 0.000566 ***
## educ 9.504e-02 2.316e-02 4.104 4.52e-05 ***
## Outcome equation:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.83348 1.13732 -1.612 0.107
## exper 0.02510 0.01864 1.347 0.178
## educ 0.47117 0.07252 6.497 1.5e-10 ***
## Error terms:
## Estimate Std. Error t value Pr(>|t|)
## sigma 3.1171 0.1144 27.25 <2e-16 ***
## rho -0.1371 0.1613 -0.85 0.395
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## --------------------------------------------
#### EXAMPLE 2
# The data set used in this example is based on the RAND Health Insurance
# Experiment (Newhouse 1999). It is included in sampleSelection, where it is
# called RandHIE. Cameron and Trivedi (2005, p. 553) use these data to analyse
# health expenditures.
#
# The endogenous variable of the outcome equation measures the log of the
# medical expenses of the individual (lnmeddol) and the endogenous variable of
# the selection equation indicates whether the medical expenses are positive (binexp)
#
# The regressors are:
#
# the log of the coinsurance rate plus 1 (logc = log(coins+1)),
# a dummy for individual deductible plans (idp),
# the log of the participation incentive payment (lpi),
# an artificial variable (fmde that is 0 if idp=1 and ln(max(1,mde/(0.01*coins))) otherwise (where mde is the maximum expenditure offer),
# physical limitations (physlm),
# the number of chronic diseases (disea),
# dummy variables for good (hlthg), fair (hlthf), and poor (hlthp) self-rated health (where the baseline is excellent self-rated health),
# the log of family income (linc),
# the log of family size (lfam),
# education of household head in years (educdec),
# age of the individual in years (xage),
# a dummy variable for female individuals (female),
# a dummy variable for individuals younger than 18 years (child),
# a dummy variable for female individuals younger than 18 years (fchild),
# and a dummy variable for black household heads (black)
#### 示例 2
# 本例所用的数据集基于 RAND 医疗保险
# 实验(Newhouse 1999)。该数据集包含在 sampleSelection 中,其中
# 被称为 RandHIE。Cameron 和 Trivedi(2005,第 553 页)利用这些数据分析
# 医疗支出。
#
# 结果方程中的内生变量衡量个体的医疗费用对数(lnmeddol),而
# 选择方程中的内生变量则表示医疗费用是否为正值(binexp)
#
# 自变量包括:
#
# 共付率对数加1(logc = log(coins+1)),
# 个人免赔额计划的虚拟变量(idp),
# 参与激励支付的对数 (lpi),
# 一个人工变量 (fmde,当 idp=1 时取值为 0,否则取值为 ln(max(1,mde/(0.01*coins)))(其中 mde 为最高支出限额),
# 身体限制 (physlm),
# 慢性病数量 (disea),
# 健康状况自评为良好(hlthg)、一般(hlthf)和较差(hlthp)的虚拟变量(基准为“极佳”健康状况),
# 家庭收入的对数(linc),
# 家庭规模的对数(lfam),
# 户主受教育年限(educdec),
# 个人年龄(以年为单位)(xage),
# 女性个人虚拟变量(female),
# 18岁以下个人虚拟变量(child),
# 18岁以下女性个人虚拟变量(fchild),
# 以及黑人户主虚拟变量(black)
data("RandHIE")
#as it is in the original example
subsample <- RandHIE$year == 2 & !is.na(RandHIE$educdec)
selectEq <- binexp ~ logc + idp + lpi + fmde + physlm + disea +
+ hlthg + hlthf + hlthp + linc + lfam + educdec + xage +
+ female + child + fchild + black
outcomeEq <- lnmeddol ~ logc + idp + lpi + fmde + physlm +
+ disea + hlthg + hlthf + hlthp + linc + lfam
rhieTS <- selection(selectEq, outcomeEq, data = RandHIE[subsample, ],method = "2step")
summary(rhieTS)## --------------------------------------------
## Tobit 2 model (sample selection model)
## 2-step Heckman / heckit estimation
## 5574 observations (1293 censored and 4281 observed)
## 33 free parameters (df = 5542)
## Probit selection equation:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.2716050 0.1877345 -1.447 0.148023
## logc -0.1187080 0.0269005 -4.413 1.04e-05 ***
## idp -0.1279483 0.0522351 -2.449 0.014337 *
## lpi 0.0283091 0.0088793 3.188 0.001439 **
## fmde 0.0075319 0.0161584 0.466 0.641142
## physlm 0.2732013 0.0743761 3.673 0.000242 ***
## disea 0.0224861 0.0035958 6.253 4.32e-10 ***
## hlthg 0.0387516 0.0438545 0.884 0.376929
## hlthf 0.1920062 0.0836688 2.295 0.021780 *
## hlthp 0.6397294 0.2126322 3.009 0.002636 **
## linc 0.0518413 0.0168128 3.083 0.002056 **
## lfam -0.0335599 0.0417280 -0.804 0.421284
## educdec 0.0363070 0.0076536 4.744 2.15e-06 ***
## xage 0.0002631 0.0021606 0.122 0.903070
## female 0.4451035 0.0542920 8.198 3.00e-16 ***
## child 0.1114890 0.0808338 1.379 0.167877
## fchild -0.4512845 0.0799219 -5.647 1.72e-08 ***
## black -0.6057367 0.0523148 -11.579 < 2e-16 ***
## Outcome equation:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.942526 0.246164 16.016 < 2e-16 ***
## logc 0.043950 0.035114 1.252 0.210756
## idp -0.029685 0.067115 -0.442 0.658285
## lpi -0.005577 0.010776 -0.518 0.604803
## fmde -0.040913 0.019616 -2.086 0.037056 *
## physlm 0.276147 0.076504 3.610 0.000309 ***
## disea 0.015897 0.004094 3.883 0.000104 ***
## hlthg 0.256223 0.050865 5.037 4.87e-07 ***
## hlthf 0.462892 0.092478 5.005 5.75e-07 ***
## hlthp 0.839142 0.185695 4.519 6.34e-06 ***
## linc 0.068256 0.023899 2.856 0.004306 **
## lfam -0.304448 0.046788 -6.507 8.34e-11 ***
## Multiple R-Squared:0.1056, Adjusted R-Squared:0.1031
## Error terms:
## Estimate Std. Error t value Pr(>|t|)
## invMillsRatio -1.0783 0.1787 -6.035 1.69e-09 ***
## sigma 1.5712 NA NA NA
## rho -0.6863 NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## --------------------------------------------
# ------------------------------------------------------------------------------
# Theoretical analyses
# ------------------------------------------------------------------------------
#################################
# Tobit-2
#################################
# install.packages("sampleSelection")
# install.packages("mvtnorm")
if(!require(sampleSelection)){install.packages("sampleSelection")}
if(!require(mvtnorm)){install.packages("mvtnorm")}
set.seed(123)
#Heckman (tobit-2) sample selection model
#with exclusion restriction (selection regression contains important regressor that is not in outcome regression)
#drawing from bivariate normal distirbution with correlation equal to -0.7
#error terms
eps <- rmvnorm(2000, c(0, 0), matrix(c(1, -0.7, -0.7, 1), 2, 2))
#selection indepvar univariate (0,1)
xs <- runif(2000)
#selection depvar ys if (xs +eps)>0 then 1 - probit
ys <- (xs + eps[, 1]) > 0
#outocme regression
xo <- runif(2000)
#latent depvar
yoX <- xo + eps[, 2]
#observable outcome if yoX is greater then 0
yo <- yoX * (ys > 0)
#Heckman (tobit-2) sample selection model
summary(selection(ys ~ xs, yo ~ xo))## --------------------------------------------
## Tobit 2 model (sample selection model)
## Maximum Likelihood estimation
## Newton-Raphson maximisation, 3 iterations
## Return code 8: successive function values within relative tolerance limit (reltol)
## Log-Likelihood: -2968.691
## 2000 observations (620 censored and 1380 observed)
## 6 free parameters (df = 1994)
## Probit selection equation:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.01832 0.05727 0.320 0.749
## xs 1.00901 0.10569 9.547 <2e-16 ***
## Outcome equation:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.08056 0.08460 -0.952 0.341
## xo 0.92978 0.08239 11.285 <2e-16 ***
## Error terms:
## Estimate Std. Error t value Pr(>|t|)
## sigma 0.94150 0.03986 23.621 < 2e-16 ***
## rho -0.51727 0.12925 -4.002 6.51e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## --------------------------------------------
## --------------------------------------------
## Tobit 2 model (sample selection model)
## 2-step Heckman / heckit estimation
## 2000 observations (620 censored and 1380 observed)
## 7 free parameters (df = 1994)
## Probit selection equation:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.01610 0.05769 0.279 0.78
## xs 1.01253 0.10659 9.499 <2e-16 ***
## Outcome equation:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.09568 0.09776 -0.979 0.328
## xo 0.92981 0.08257 11.261 <2e-16 ***
## Multiple R-Squared:0.0893, Adjusted R-Squared:0.0879
## Error terms:
## Estimate Std. Error t value Pr(>|t|)
## invMillsRatio -0.4565 0.1711 -2.669 0.00768 **
## sigma 0.9341 NA NA NA
## rho -0.4887 NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## --------------------------------------------
##
## Call:
## lm(formula = yo ~ xo)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.00948 -0.36259 -0.04909 0.34588 2.68772
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.21721 0.03356 -6.472 1.22e-10 ***
## xo 0.63998 0.05781 11.070 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7447 on 1998 degrees of freedom
## Multiple R-squared: 0.05779, Adjusted R-squared: 0.05731
## F-statistic: 122.5 on 1 and 1998 DF, p-value: < 2.2e-16
#expected intercepts = 0 and slopes = 1
#close to the true values
#without exclusion restriction (selection regression contains important regressor that is not in outcome regression)
yoX <- xs + eps[, 2]
yo <- yoX * (ys > 0)
summary(selection(ys ~ xs, yo ~ xs))## --------------------------------------------
## Tobit 2 model (sample selection model)
## Maximum Likelihood estimation
## Newton-Raphson maximisation, 13 iterations
## Return code 8: successive function values within relative tolerance limit (reltol)
## Log-Likelihood: -2968.992
## 2000 observations (620 censored and 1380 observed)
## 6 free parameters (df = 1994)
## Probit selection equation:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.01566 0.05760 0.272 0.786
## xs 1.01425 0.10635 9.537 <2e-16 ***
## Outcome equation:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.06452 0.14806 -0.436 0.663
## xs 0.95306 0.12790 7.452 1.37e-13 ***
## Error terms:
## Estimate Std. Error t value Pr(>|t|)
## sigma 0.9566 0.0546 17.521 < 2e-16 ***
## rho -0.5659 0.1582 -3.577 0.000356 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## --------------------------------------------
#estimators still unbiased
#std. errors are larger
#the standard errors of the estimates depend on the variation in the latent selection equation
#More variation gives smaller standard errors
xs <- runif(2000, -1, 1)
ys <- (xs + eps[, 1]) > 0
yoX <- xs + eps[, 2]
yo <- yoX * (ys > 0)
summary(selection(ys ~ xs, yo ~ xs))## --------------------------------------------
## Tobit 2 model (sample selection model)
## Maximum Likelihood estimation
## Newton-Raphson maximisation, 10 iterations
## Return code 8: successive function values within relative tolerance limit (reltol)
## Log-Likelihood: -2485.797
## 2000 observations (980 censored and 1020 observed)
## 6 free parameters (df = 1994)
## Probit selection equation:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.03981 0.02970 1.34 0.18
## xs 0.99520 0.05418 18.37 <2e-16 ***
## Outcome equation:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.2184 0.1479 -1.477 0.14
## xs 1.1358 0.1183 9.600 <2e-16 ***
## Error terms:
## Estimate Std. Error t value Pr(>|t|)
## sigma 0.91504 0.05431 16.849 < 2e-16 ***
## rho -0.48985 0.17021 -2.878 0.00405 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## --------------------------------------------
#################################
# Tobit-5
#################################
set.seed(0)
# variance - covariance matrix
vc <- diag(3)
vc[lower.tri(vc)] <- c(0.9, 0.5, 0.1)
vc[upper.tri(vc)] <- vc[lower.tri(vc)]
vc## [,1] [,2] [,3]
## [1,] 1.0 0.9 0.5
## [2,] 0.9 1.0 0.1
## [3,] 0.5 0.1 1.0
# error terms from multivariate normal distributions
eps <- rmvnorm(500, c(0, 0, 0), vc)
#selection regression
xs <- runif(500)
ys <- (xs + eps[, 1]) > 0
#outcome regression 1
xo1 <- runif(500)
yo1 <- xo1 + eps[, 2]
#outcome regression 1
xo2 <- runif(500)
yo2 <- xo2 + eps[, 3]
#tobit-5
summary(selection(ys ~ xs, list(yo1 ~ xo1, yo2 ~ xo2)))## --------------------------------------------
## Tobit 5 model (switching regression model)
## Maximum Likelihood estimation
## Newton-Raphson maximisation, 11 iterations
## Return code 1: gradient close to zero (gradtol)
## Log-Likelihood: -895.8201
## 500 observations: 172 selection 1 (FALSE) and 328 selection 2 (TRUE)
## 10 free parameters (df = 490)
## Probit selection equation:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.1550 0.1051 -1.474 0.141
## xs 1.1408 0.1785 6.390 3.86e-10 ***
## Outcome equation 1:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.02708 0.16395 0.165 0.869
## xo1 0.83959 0.14968 5.609 3.4e-08 ***
## Outcome equation 2:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.1583 0.1885 0.840 0.401
## xo2 0.8375 0.1707 4.908 1.26e-06 ***
## Error terms:
## Estimate Std. Error t value Pr(>|t|)
## sigma1 0.93191 0.09211 10.118 <2e-16 ***
## sigma2 0.90697 0.04434 20.455 <2e-16 ***
## rho1 0.88988 0.05353 16.623 <2e-16 ***
## rho2 0.17695 0.33139 0.534 0.594
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## --------------------------------------------
## --------------------------------------------
## Tobit 5 model (switching regression model)
## 2-step Heckman / heckit estimation
## 500 observations: 172 selection 1 (FALSE) and 328 selection 2 (TRUE)
## 12 free parameters (df = 490)
## Probit selection equation:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.1588 0.1116 -1.423 0.155
## xs 1.1478 0.1978 5.803 1.17e-08 ***
## Outcome equation 1:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.1289 NA NA NA
## xo1 0.8430 NA NA NA
## Multiple R-Squared:0.2486, Adjusted R-Squared:0.2397
## Outcome equation 2:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.1731 NA NA NA
## xo2 0.8374 NA NA NA
## Multiple R-Squared:0.0694, Adjusted R-Squared:0.0636
## Error terms:
## Estimate Std. Error t value Pr(>|t|)
## invMillsRatio1 -0.9318 NA NA NA
## invMillsRatio2 0.1326 NA NA NA
## sigma1 0.8970 NA NA NA
## sigma2 0.9065 NA NA NA
## rho1 0.9900 NA NA NA
## rho2 0.1463 NA NA NA
## --------------------------------------------