1 Affairs Analysis

1.1 条件期望(Conditional Expectation)表达式

这种表达式是计量经济学和统计学中的条件期望(Conditional Expectation)公式。你可以把它想象成一个“带有过滤器的平均值”

我们可以从最简单的形式开始,一步步增加维度。


1.1.1 第一层:最基础的期望

\[E(y)\]

  • 语法\(E\) 代表 Expectation(期望/平均值),括号里的是我们要观察的对象。
  • 含义\(y\) 的整体平均值。
  • 例子:全校学生的平均成绩。

1.1.2 第二层:给定自变量的期望(回归的基础)

\[E(y|x)\]

  • 语法:竖线 $|$ 读作 “Given”(在……条件下)
  • 含义:当我们知道自变量 \(x\) 的取值时,\(y\) 的预期平均值。
  • 例子给定学习时间为 10 小时(\(x\)),学生的预期成绩(\(y\))是多少?
    • 这就是普通 OLS 回归线上的每一个点。

1.1.3 第三层:带有限制条件的期望(截断/选择)

\[E(y|y > 0)\]

  • 语法:条件部分变成了对 \(y\) 自身的限制。
  • 含义:只看那些 \(y\) 大于 0 的样本,它们的平均值是多少。
  • 例子:只看那些有过婚外情的人\(y > 0\)),他们的平均出轨次数是多少?(排除掉了那些 0 次的人)。

1.1.4 第四层:多变量、多条件的综合(你问的那个)

\[E(y|x, y > 0)\]

  • 语法:逗号 , 在这里相当于逻辑上的 “AND”(且)
  • 含义:同时满足两个条件的平均值:
    1. 个体的特征是 \(x\)(比如:年龄 30 岁、结婚 5 年)。
    2. 个体必须在“门槛”之外(比如:确实发生过出轨行为)。
  • 现实意义:这在 Tobit 模型中被称为强度边际(Intensive Margin)。它回答的是:“对于那些已经参与了某项活动的人(\(y>0\)),特征 \(x\) 的变化如何影响他们的参与程度?”

1.1.5 第五层:更复杂的表达式(多个变量与范围)

\[E(y | x_1, x_2, z, a < y < b)\]

  • 语法拆解
    • \(x_1, x_2\): 多个控制变量(如年龄、收入)。
    • \(z\): 一个分类变量(如是否受过高等教育)。
    • \(a < y < b\): 双侧截断(比如只研究年收入在 10 万到 50 万之间的人)。
  • 含义:在控制了 \(x_1, x_2, z\) 的情况下,且样本被限制在 \((a, b)\) 区间内时,\(y\) 的平均水平。

1.1.6 总结:如何读懂它?

你可以按照这个万能公式来翻译:

“在[条件1]、[条件2]……以及[条件N]同时成立的情况下,[目标变量]的平均值是多少?”

  • 竖线左边:你想研究的结果。
  • 竖线右边:你想设置的限制。
    • 如果是变量名称(如 \(x\)),意味着“根据个体的特征来分类”。
    • 如果是不等式(如 \(y>0\)),意味着“把不符合的人从样本中踢出去”。

在你的实验里,\(E(y|x, y > 0)\) 就是在纠正了“很多人根本不出轨”这个干扰后,研究那群“真正出轨的人”受 \(x\) 影响的真实规律。

###########################################################################
#       Advanced Econometrics                                                 #
#   Spring semester                                                       #
#   dr Marcin Chlebus, dr Rafa? Wo?niak                                   #
#   University of Warsaw, Faculty of Economic Sciences                    #
#                                                                         #
#                                                                         #
#                 Lab 07: Censored and Sample Selection data              #
#                                                                         #
###########################################################################
# setwd("C:\\Users\\MarcinChlebus\\Downloads\\AE07 - Tobit-20250401T173631Z-001\\AE07 - Tobit")

if(!require(truncreg)){install.packages("truncreg")}
if(!require(DescTools)){install.packages("DescTools")}
if(!require(censReg)){install.packages("censReg")}
if(!require(DescTools)){install.packages("DescTools")}
if(!require(AER)){install.packages("AER")}

data( "Affairs", package = "AER" )
  
Sys.setenv(LANG = "en")
options(scipen=100)

1.2 OLS as comparison

# ------------------------------------------------------------------------------
# Tobit model
# a.k.a. censored regression
# ------------------------------------------------------------------------------

# depvar:
# number of affairs in the past year 
# 
# indepvars:
# age, 
# number of years married, 
# religiousness, 
# occupation, 
# own rating of the marriage. 

# The dependent variable is left-censored at zero and not right-censored. Hence, 
# this is a standard Tobit model. 
# 
# This actually an example of count data, but for the sake of this class
# let's treat it as a continuous variable.
  
Desc(Affairs$affairs)
## ────────────────────────────────────────────────────────────────────────────── 
## Affairs$affairs (numeric)
## 
##   length       n    NAs  unique     0s  mean  meanCI'
##      601     601      0       6    451  1.46    1.19
##           100.0%   0.0%          75.0%          1.72
##                                                     
##      .05     .10    .25  median    .75   .90     .95
##     0.00    0.00   0.00    0.00   0.00  7.00   12.00
##                                                     
##    range      sd  vcoef     mad    IQR  skew    kurt
##    12.00    3.30   2.27    0.00   0.00  2.34    4.19
##                                                     
## 
##    value  freq   perc  cumfreq  cumperc
## 1      0   451  75.0%      451    75.0%
## 2      1    34   5.7%      485    80.7%
## 3      2    17   2.8%      502    83.5%
## 4      3    19   3.2%      521    86.7%
## 5      7    42   7.0%      563    93.7%
## 6     12    38   6.3%      601   100.0%
## 
## ' 95%-CI (classic)

#first OLS
table(Affairs$occupation)
## 
##   1   2   3   4   5   6   7 
## 113  13  47  68 204 143  13
OLS <- lm(affairs ~ age + yearsmarried + religiousness +
            + as.factor(occupation) + rating, data = Affairs)
summary(OLS)
## 
## Call:
## lm(formula = affairs ~ age + yearsmarried + religiousness + +as.factor(occupation) + 
##     rating, data = Affairs)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.0376 -1.6779 -0.8012  0.2413 12.8106 
## 
## Coefficients:
##                        Estimate Std. Error t value       Pr(>|t|)    
## (Intercept)             5.59246    0.80990   6.905 0.000000000013 ***
## age                    -0.04874    0.02233  -2.182         0.0295 *  
## yearsmarried            0.16233    0.03739   4.342 0.000016645133 ***
## religiousness          -0.47305    0.11214  -4.218 0.000028476488 ***
## as.factor(occupation)2 -0.35545    0.91518  -0.388         0.6979    
## as.factor(occupation)3  0.54228    0.54473   0.995         0.3199    
## as.factor(occupation)4  0.63614    0.47989   1.326         0.1855    
## as.factor(occupation)5  0.47770    0.36984   1.292         0.1970    
## as.factor(occupation)6  0.63998    0.39989   1.600         0.1101    
## as.factor(occupation)7 -0.19121    0.91933  -0.208         0.8353    
## rating                 -0.71809    0.11989  -5.990 0.000000003656 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.094 on 590 degrees of freedom
## Multiple R-squared:  0.1348, Adjusted R-squared:  0.1201 
## F-statistic: 9.191 on 10 and 590 DF,  p-value: 0.00000000000003327

1.2.1 代码与参数分析

这段代码展示了使用 OLS (普通最小二乘法) 建立线性回归模型的过程。在 Tobit 分析中,这通常作为基准模型(Baseline),用于初步探索变量间的关系。

  • lm(...): 这是 R 中用于拟合线性模型的函数。它假设因变量 \(y\) 是连续分布的。
  • affairs ~ ...: 回归方程。左侧是因变量(婚外情次数),右侧是自变量。
  • as.factor(occupation): 关键参数。将职业变量强制转换为分类变量(Factor)
    • R 会自动创建一组虚拟变量(Dummy Variables)。
    • 它将 occupation 1 作为参照基准(Reference Group),输出结果中的 as.factor(occupation)27 分别代表其他职业等级与第一类职业的差异。
  • data = Affairs: 指定使用加载好的 Affairs 数据集。

1.2.2 回归输出详述

输出结果分为四个主要部分:残差统计、系数估计、显著性标记和整体拟合指标。

1.2.2.1 残差统计 (Residuals)

展示了预测值与实际观测值之间的误差分布。

  • Min / Max: 最小值为 -5.03,最大值为 12.81。
  • Median: 中位数为 -0.8012。
  • 分析: 由于因变量存在大量 0(且不能为负),OLS 的预测值会出现负数,导致残差分布呈现明显的非对称性(偏态),这预示了 OLS 在处理此类数据时存在局限。

1.2.2.2 系数表 (Coefficients)

  • Estimate (估计值): 衡量自变量每变动一个单位,因变量平均变动的数值。
    • age (-0.048): 年龄每增加 1 岁,年均婚外情次数减少约 0.05 次。
    • yearsmarried (0.162): 婚龄每增加 1 年,婚外情次数增加 0.16 次。
    • religiousness (-0.473): 宗教信仰每提升 1 级,出轨次数减少约 0.47 次。
    • rating (-0.718): 婚姻评分每提高 1 分,出轨次数显著减少约 0.72 次。
  • as.factor(occupation): 所有的职业虚拟变量 P 值均远大于 0.05(如 0.69, 0.31 等)。这说明在 OLS 模型下,职业类型对婚外情次数没有统计学意义上的显著影响。
  • Pr(>|t|) (P 值): 标记为 *** 的变量具有极强的统计显著性(P < 0.001)。

1.2.2.3 模型拟合指标

  • Multiple R-squared (0.1348): \(R^2\) 值为 0.1348,意味着模型中的变量解释了婚外情次数变动的 13.48%。
  • F-statistic: P 值极小(趋近于 0),说明模型整体是显著的,即至少有一个自变量对因变量有解释力。

1.2.3 计量经济学注释与评价

在高级计量经济学中,这份 OLS 输出通常被视为“有偏的”或“不完善的”。

  • 截断数据的偏差: 由于数据中存在大量的零(451 个观测值为 0),回归线会被这些零值强行向下拉。这导致 OLS 估计出的系数往往比真实值更接近 0。

  • 与 Tobit 的对比:

    • 在之前的 Tobit 模型中,rating 的潜在影响系数(\(y^*\))约为 -2.28。
    • 而在 OLS 中,rating 的系数仅为 -0.71。
    • 这说明 OLS 严重低估了婚姻质量对出轨行为的抑制作用。
  • 逻辑矛盾: OLS 会预测出负的出轨次数(如预测某人出轨次数为 -1.5 次),这在现实中是无意义的。而 Tobit 模型通过将预测值限制在 0 以上,解决了这一逻辑问题。

  • 总结:

OLS serves as a baseline to ==validate the coefficient signs==, proving that age, religiousness, and marital satisfaction are robust predictors regardless of censoring. Nevertheless, the actual effect sizes require adjustment via the Tobit model to correct for the inherent bias in OLS.

OLS 在此处的意义在于确认变量的相关方向(正负号)。它告诉我们:即使不考虑截断问题,年龄、宗教和婚姻满意度依然是重要的预测因子,但具体的效应大小需要通过 Tobit 模型来修正。

1.3 Analysing with censReg

# Tobit model can be estimated by following command:
tobit1<-censReg(affairs~age+yearsmarried+religiousness+occupation+rating,
                left=0, right=Inf, data=Affairs)
summary(tobit1)
## 
## Call:
## censReg(formula = affairs ~ age + yearsmarried + religiousness + 
##     occupation + rating, left = 0, right = Inf, data = Affairs)
## 
## Observations:
##          Total  Left-censored     Uncensored Right-censored 
##            601            451            150              0 
## 
## Coefficients:
##               Estimate Std. error t value              Pr(> t)    
## (Intercept)    8.17420    2.74145   2.982              0.00287 ** 
## age           -0.17933    0.07909  -2.267              0.02337 *  
## yearsmarried   0.55414    0.13452   4.119         0.0000379755 ***
## religiousness -1.68622    0.40375  -4.176         0.0000296183 ***
## occupation     0.32605    0.25442   1.282              0.20001    
## rating        -2.28497    0.40783  -5.603         0.0000000211 ***
## logSigma       2.10986    0.06710  31.444 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Newton-Raphson maximisation, 7 iterations
## Return code 1: gradient close to zero (gradtol)
## Log-likelihood: -705.5762 on 7 Df

1.3.1 代码与参数说明

这段代码使用 censReg 包来估计 Tobit 模型(也称为截断回归模型)。该模型专门用于处理因变量在某些值处被截断(Censored)的情况,例如本例中 affairs(婚外情次数)在 0 处大量堆积。

  • censReg(...): 执行极大似然估计(MLE)的函数。
  • formula = affairs ~ ...: 定义回归方程,探究年龄、婚龄、宗教信仰、职业和婚姻评分对婚外情的影响。
  • left = 0: 关键参数,指定左侧截断点为 0。这意味着所有观测到的 0 值在模型中被视为潜在取值(Propensity)小于或等于 0 的结果。
  • right = Inf: 指定右侧不设置截断点。
  • data = Affairs: 指定使用的数据集。

1.3.2 观测值分布 (Observations)

这部分描述了因变量 affairs 的数据结构,是理解 Tobit 模型必要性的基础。

  • Total (601): 总样本量为 601 个观测值。
  • Left-censored (451): 有 451 个观测值为 0。这代表了“角点解”(Corner Solution),即绝大多数人报告没有婚外情。
  • Uncensored (150): 有 150 个观测值为正数,即实际发生了婚外情。
  • Right-censored (0): 没有样本在右侧被截断。

1.3.3 系数估计 (Coefficients)

这些系数代表自变量对潜在变量 \(y^*\)(即发生婚外情的倾向或强度)的影响。

  • Estimate (估计值):
    • age (-0.179): 显著负相关。年龄每增加一岁,潜在的出轨倾向显著下降。
    • yearsmarried (0.554): 极显著正相关。结婚时间越长,潜在倾向越高。
    • religiousness (-1.686): 极显著负相关。宗教信仰越强,潜在倾向越低。
    • occupation (0.326): P 值为 0.20,在统计上不显著。说明在这个模型设定下,职业等级对潜在倾向没有显著影响。
    • rating (-2.285): 极显著负相关。对婚姻的自我评分每提高 1 分,潜在倾向大幅下降。
  • logSigma (2.110): 这是模型误差项标准差 \(\sigma\) 的自然对数值(\(\ln \sigma\))。
    • 要获得原始的 \(\sigma\)(用于计算边际效应),需要计算 \(\exp(2.10986) \approx 8.247\)
  • Pr(> t) (P 值): 展示了统计显著性。*** 代表 0.001 级别的极显著,** 代表 0.01 级别的显著。

1.3.4 模型收敛与似然值 (Optimization & Likelihood)

这部分提供了关于模型数值计算的信息。

  • Newton-Raphson maximisation: 指出模型是使用牛顿-拉夫逊迭代算法通过最大化对数似然函数来求解的。
  • 7 iterations: 算法经过 7 次迭代后找到了最优解。
  • Return code 1: 这是一个良好的信号,表示梯度接近于零,模型已经成功收敛(Convergence)。
  • Log-likelihood (-705.5762): 模型最终的对数似然值。该值用于计算似然比检验(LR Test)或 AIC/BIC 等模型比较指标。其绝对值越小(即数值越大),通常代表模型对数据的拟合优度越好。

1.3.5 核心注释

  • Tobit 系数的直观理解: 这里的系数(如 rating 的 -2.28)比之前 OLS 的系数(-0.71)大得多。这是因为 Tobit 系数衡量的是对潜在倾向 \(y^*\) 的影响,而不是直接对观测到的平均值 \(y\) 的影响。
  • 误差项假设: 模型假设误差项服从均值为 0、标准差为 \(\sigma\) 的正态分布。logSigma 的显著性验证了该分布参数估计的可靠性。

1.4 Marginal Effects

1.4.1 对均值的边际效应

  # marginal effects after censReg function
source("tobit_marginal_effects.R")
x = rbind(1, mean(Affairs$age), mean(Affairs$yearsmarried), mean(Affairs$religiousness), 
          mean(Affairs$occupation), mean(Affairs$rating))
me = tobit_marginal_effects(tobit1, x, dummies_indices=c())
## 
##  Marginal effects of the tobit model 
##  ----------------------------------- 
##                       y*      E(y|x)  E(y|x,y>0)    Pr(y>0|x)     at X=
## age           -0.1793326 -0.04192089 -0.04200050 -0.006662735 32.487521
## yearsmarried   0.5541418  0.12953651  0.12978253  0.020588003  8.177696
## religiousness -1.6862205 -0.39417189 -0.39492049 -0.062648066  3.116473
## occupation     0.3260532  0.07621840  0.07636315  0.012113840  4.194676
## rating        -2.2849727 -0.53413656 -0.53515098 -0.084893477  3.931780

这里\(x\)的第一个元素,取值为 1,作用是:

在数学表达式中,线性回归方程通常写为:\[y^* = \beta_0 \cdot 1 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_k x_k\]为了让计算机能够通过矩阵乘法(\(\mathbf{x}'\boldsymbol{\beta}\))一次性计算出预测值,我们需要在自变量向量 \(\mathbf{x}\) 的最前面人为地加上一个 1。这个 1 专门用来和截距项的系数 \(\beta_0\) 相乘。

1.4.2 interpretation of Tobit Marginal Effects | Tobit 模型边际效应解读

Below is the professional interpretation of the output calculated at the sample means (\(\bar{X}\)).

以下是针对样本均值点(平均水平个体)计算出的边际效应解读。


1.4.3 \(y^*\) : Latent Variable (Propensity) | 潜在变量(倾向)

  • English: This column represents the effect of independent variables on the latent propensity to have an affair. In a Corner Solution model (like this one), we generally do not interpret this column because \(y^*\) is an abstract index of “desire” or “tendency” rather than an observable count.
  • 中文解读: 这一列代表自变量对潜在出轨倾向的影响。由于本案例属于角点解(Corner Solution)模型(即 0 是理性选择而非数据丢失),我们通常不解读这一列。它只是一个抽象的“欲望指数”,并非实际发生的次数。
  • ==propensity==: /prəˈpen.sə.ti/ the fact that someone is likely to behave in a particular way, especially a bad way

1.4.4 \(E(y|x)\) : Unconditional Expected Value (Total Effect) | 无条件期望值(总效应)

  • English: This is the most important column for policy or social analysis. It measures the change in the actual observed number of affairs for the entire population (including those with 0 affairs).
    • Years Married: For each additional year of marriage, the expected number of affairs increases by 0.13.
    • Religiousness: For each 1-unit increase in religiousness, the expected number of affairs decreases by 0.39.
  • 中文解读: 这是社会科学研究中最核心的指标。它衡量自变量变化对全社会平均水平(包括不出轨和出轨的人)的影响。
    • 婚姻年限:每多结婚一年,预期出轨次数平均增加 0.13 次
    • 宗教信仰:宗教信仰每提升一个等级,预期出轨次数平均减少 0.39 次

1.4.5 \(E(y|x, y>0)\) : Conditional Expected Value (Intensity) | 条件期望值(强度效应)

  • English: This measures the Intensity Margin. It only considers the sub-group of people who already have affairs (\(y > 0\)).
    • Interpretation Rule: You must use the qualifier “Among those who have had affairs…”.
    • Rating: Among those who have had affairs, a 1-unit increase in marriage rating reduces the number of affairs by 0.54.
  • 中文解读: 这一列衡量的是强度边际。它仅针对已经有过婚外情的群体
    • 解读规则:必须加上限定语“在已经出轨的人群中……”。
    • 婚姻评分:在已经出轨的人群中,婚姻评分每提高 1 分,出轨次数平均减少 0.54 次

1.4.6 \(Pr(y>0|x)\) : Probability of Positive Outcome (Participation) | 非零概率(参与效应)

  • English: This measures the Extensive Margin (the “threshold” effect). It shows the change in the probability of having any affair at all.
    • Calculation: Multiply the value by 100 to express it in percentage points.
    • Age: Each additional year of age reduces the probability of having an affair by 0.67 percentage points.
    • Rating: Each 1-unit increase in marriage rating reduces the probability of having an affair by 8.49 percentage points.
  • 中文解读: 这一列衡量的是外延边际(入门门槛)。它显示了从“完全不出轨”转变为“开始出轨”的概率变化
    • 计算方法:必须将数值乘以 100,以百分点为单位。
    • 年龄:年龄每增加一岁,发生婚外情的概率降低 0.67 个百分点
    • 婚姻评分:婚姻评分每提高 1 分,发生婚外情的概率显著降低 8.49 个百分点
  • Note: While \(Pr(y>0|x)\) describes the transition from \(y=0\) to \(y>0\), it is calculated for the ==entire population==. It predicts the likelihood that any individual with characteristics \(X\) will cross the threshold into participation.

1.4.7 Summary Table for Exam | 考试速查表

Column Margin Group Interpretation Unit
\(y^*\) Latent Theoretical Index (Usually ignored in Corner Solutions)
\(E(y\|x)\) Total Entire Sample Actual Units (e.g., Number of affairs)
\(E(y\|x, y>0)\) Intensive Active Group only Actual Units (Add “Among those who…”)
\(Pr(y>0\|x)\) Extensive Entire Sample Percentage Points (%)

1.4.8 Quick Note on Significance | 关于显著性的说明

  • English: The tobit_marginal_effects function does not display P-values. In your analysis, you should refer back to the original summary(tobit1) output. Since Occupation was not significant (\(P = 0.20\)) in the model, its marginal effects should be described as statistically insignificant despite the numbers shown.
  • 中文说明: 这个函数不显示 P 值。在解读时,你需要查看原始模型 tobit1 的输出。例如,由于 Occupation(职业) 在原模型中不显著(P = 0.20),虽然这里算出了数值,但在解读时应注明其在统计上不显著

1.4.9 对给定的具体值的边际效应(1, 20, 3, 5, 1, 9)

x = c(1, 20, 3, 5, 1, 9)
me = tobit_marginal_effects(tobit1, x, dummies_indices=c())
## 
##  Marginal effects of the tobit model 
##  ----------------------------------- 
##                       y*        E(y|x)  E(y|x,y>0)     Pr(y>0|x) at X=
## age           -0.1793326 -0.0005879883 -0.01438623 -0.0002155151    20
## yearsmarried   0.5541418  0.0018168974  0.04445380  0.0006659465     3
## religiousness -1.6862205 -0.0055287106 -0.13527025 -0.0020264355     5
## occupation     0.3260532  0.0010690500  0.02615631  0.0003918384     1
## rating        -2.2849727 -0.0074918748 -0.18330274 -0.0027459931     9
tobit2 <- tobit(affairs ~ age + yearsmarried + religiousness +
                     + occupation + rating, left = 0, right = Inf , data = Affairs ,x=T)
summary(tobit2)
## 
## Call:
## tobit(formula = affairs ~ age + yearsmarried + religiousness + 
##     +occupation + rating, left = 0, right = Inf, data = Affairs, 
##     x = T)
## 
## Observations:
##          Total  Left-censored     Uncensored Right-censored 
##            601            451            150              0 
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    8.17420    2.74145   2.982  0.00287 ** 
## age           -0.17933    0.07909  -2.267  0.02337 *  
## yearsmarried   0.55414    0.13452   4.119 3.80e-05 ***
## religiousness -1.68622    0.40375  -4.176 2.96e-05 ***
## occupation     0.32605    0.25442   1.282  0.20001    
## rating        -2.28497    0.40783  -5.603 2.11e-08 ***
## Log(scale)     2.10986    0.06710  31.444  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Scale: 8.247 
## 
## Gaussian distribution
## Number of Newton-Raphson Iterations: 4 
## Log-likelihood: -705.6 on 7 Df
## Wald-statistic: 67.71 on 5 Df, p-value: 3.0718e-13

1.4.10 Refined Interpretation of Tobit Model (tobit2)

This interpretation follows the professional standards for econometric analysis, distinguishing between data structure, coefficient direction, and statistical significance.


1.4.11 Model Specification and Data Structure

  • Objective: This model examines the impact of age, yearsmarried, religiousness, occupation, and rating on the number of extramarital affairs.
  • Censoring Logic: The parameters left = 0 and right = Inf define a standard Tobit model. In this context, the value of 0 is treated as a Corner Solution, representing a rational choice by individuals not to engage in affairs rather than missing or “hidden” data.
  • Sample Distribution:
    • Total Observations: 601.
    • Left-censored (0): 451 (approx. 75% of the sample).
    • Uncensored (>0): 150.
  • Methodological Justification: The high concentration of observations at the zero boundary suggests that a simple OLS regression would yield biased and inconsistent estimates. The Tobit model is the appropriate choice to account for this non-normal, censored distribution.

1.4.12 Analysis of Coefficients

The coefficients in a Tobit model represent the effect of the independent variables on the latent propensity (\(y^*\)) to have an affair. While the signs (+/-) indicate the direction of the relationship, the exact magnitude of the effect on the actual observed number of affairs (\(y\)) can only be determined through Marginal Effects analysis.

1.4.12.1 Significant Variables

  • Marriage Rating (rating):
    • Estimate: -2.28497.
    • Significance: Extremely significant (\(P < 0.001\)).
    • Interpretation: This is the strongest deterrent in the model. Higher marital satisfaction significantly reduces the latent propensity to have an affair.
  • Religiousness (religiousness):
    • Estimate: -1.68622.
    • Significance: Extremely significant (\(P < 0.001\)).
    • Interpretation: Individuals with higher levels of religious devotion are significantly less likely to engage in extramarital affairs.
  • Years Married (yearsmarried):
    • Estimate: +0.55414.
    • Significance: Extremely significant (\(P < 0.001\)).
    • Interpretation: As the duration of the marriage increases, the latent propensity to have an affair tends to rise significantly.
  • Age (age):
    • Estimate: -0.17933.
    • Significance: Significant at the 0.05 threshold (\(P = 0.023\)).
    • Interpretation: There is a negative correlation between age and the propensity for affairs; older individuals in the sample generally show a lower tendency toward this behavior.

1.4.12.2 Non-Significant Variables

  • Occupation (occupation):
    • Estimate: +0.32605.
    • Significance: Not significant (\(P = 0.200\)).
    • Interpretation: At the standard 5% significance level, occupation does not have a statistically measurable impact on the number of affairs.

1.4.13 Model Diagnostics

  • Scale Parameter: The Scale is 8.247 (with Log(scale) at 2.10986). This represents the estimated standard deviation (\(\sigma\)) of the error term, which is crucial for calculating the marginal effects.
  • Wald-statistic: The value of 67.71 with a p-value of \(3.07 \times 10^{-13}\) indicates that the model is globally significant. This means the independent variables collectively provide a meaningful explanation for the variance in the dependent variable.
  • Log-likelihood: The value of -705.6 reflects the goodness-of-fit achieved via the Newton-Raphson optimization after 4 iterations.

1.5 margEff 函数计算 Tobit 模型的边际效应

## the vector of marginal effects (at mean values and for y > 0) should be as follows.
## note the [-1] used to remove the intercept term from the final vector, 
##  but not from within the adjustment term. 
# tobit1<-censReg(affairs~age+yearsmarried+religiousness+occupation+rating,
#                 left=0, right=Inf, data=Affairs)

#E(y|x,y>0)
summary(margEff(tobit1))
##               Marg. Eff. Std. Error t value  Pr(>|t|)    
## age            -0.041921   0.018444 -2.2728   0.02339 *  
## yearsmarried    0.129537   0.031168  4.1561 3.713e-05 ***
## religiousness  -0.394172   0.093379 -4.2212 2.811e-05 ***
## occupation      0.076218   0.059472  1.2816   0.20049    
## rating         -0.534137   0.094896 -5.6286 2.802e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

1.5.1 代码与输出分析

这段代码展示了如何使用 censReg 包自带的 margEff() 函数来自动计算 Tobit 模型的边际效应。与之前手动计算不同,这个函数不仅给出了数值,还提供了标准误(Std. Error)和 P 值,从而解决了你之前关于“边际效应显著性”的疑问。

结论:==margEff== 输出的数值与 E(y|x) 列完全一致。因此,它计算的是==全样本==(包含 0 值)的预期变化。


1.5.2 代码与参数说明

  • margEff(tobit1):
    • 这是一个专门为 censReg 对象设计的函数。
    • 默认行为:如果不额外指定参数,该函数默认计算的是 无条件期望值的边际效应 \(E(y|x)\)(即所有样本的总效应),并且是在样本均值点(at mean values)处计算。
  • summary(...):
    • 对边际效应结果进行总结,输出类似回归表的格式,包含系数、标准误、t 值和 P 值。

1.5.3 注释部分解读 (The Comments)

代码中的注释包含了非常关键的学术细节:

  • “at mean values”: 指出这些效应是针对“平均水平的个体”计算的。
  • “for y > 0”: 这是一个常见的学术标注误区。在 censReg 的语境下,作者通常指代的是模型考虑了截断点(\(y>0\) 的门槛),但计算的数值(-0.041, 0.129 等)实际上对应的是全样本期望 \(E(y|x)\)
  • “note the [-1]…”:
    • 这是一个编程细节。在计算边际效应时,截距项(Intercept)必须参与中间过程的数学运算(调整项),但截距项本身并没有“边际效应”(因为常数项不随 \(x\) 变化)。
    • 因此,最终输出的结果中去掉了截距项(-1),只保留了自变量的效应。

1.5.4 输出结果解读 (The Output)

这里的数值与你之前通过自定义函数 tobit_marginal_effects 计算出的 E(y|x)完全一致

变量 边际效应 (Marg. Eff.) P 值 (Pr(>|t|)) 经济学解释 (English Interpretation)
age -0.0419 0.023 * For every 1-year increase in age, the expected number of affairs decreases by 0.04.
yearsmarried +0.1295 < 0.001 *** Each additional year of marriage increases the expected number of affairs by 0.13.
religiousness -0.3942 < 0.001 *** Each unit increase in religious devotion reduces the expected affairs by 0.39.
occupation +0.0762 0.200 (Non-sig) No statistically significant effect on the number of affairs.
rating -0.5341 < 0.001 *** Each 1-point increase in marriage rating reduces the expected affairs by 0.53.

1.5.5 核心价值:解决了 P 值问题

这是该输出最重要的部分:

  1. 显著性继承:你可以发现,边际效应的显著性(星号数量)与原始 Tobit 模型完全一致。这验证了我们之前的说法:如果原始系数显著,其边际效应通常也显著。
  2. 标准误的计算:函数内部使用了 Delta Method 计算了边际效应的标准误(Std. Error)。
  3. 结论证据:如果你在答题时需要证明 rating 对现实中出轨次数的影响,你应该引用这一表的 -0.534 及其极显著的 P 值,这比引用原始系数 -2.28 更有说服力。

1.5.6 总结 (Summary for Lab/Exam)

This table provides the Marginal Effects on the Unconditional Expected Value \(E(y|x)\).

  • Significance: age, yearsmarried, religiousness, and rating all remain statistically significant after the marginal effect adjustment.
  • Reliability: Since this output includes Standard Errors and P-values, it is more robust for academic reporting than simply reporting point estimates.
  • Key Finding: Marital satisfaction (rating) has the largest real-world impact among all predictors, reducing the average number of affairs by over 0.5 per unit increase.

1.6 pnorm 函数

pnorm(sum(apply(tobit2$x,2,FUN=mean) * tobit2$coef)/tobit2$scale) * tobit2$coef[-1]
##           age  yearsmarried religiousness    occupation        rating 
##   -0.04192089    0.12953651   -0.39417189    0.07621840   -0.53413656

1.6.1 代码深度解析:手动验证 \(E(y|x)\) 的边际效应

这段代码非常有技术含量,它通过矩阵运算正态分布公式,手动复现了之前 margEff(tobit1) 自动计算出的数值。这在计量经济学考试中通常作为“原理推导”题出现。


1.6.2 代码逻辑与数学公式

这段代码的核心是计算 Tobit 模型全样本期望 \(E(y|x)\) 的边际效应

根据 McDonald-Moffitt 分解,其计算公式为: \[\frac{\partial E[y|x]}{\partial x_j} = \Phi\left(\frac{X\beta}{\sigma}\right) \cdot \beta_j\]

其中:

  • \(\Phi(\cdot)\) 是标准正态分布的累积分布函数(pnorm)。
  • \(X\beta\) 是线性预测值。
  • \(\sigma\) 是模型估计出的标准差(Scale)。

1.6.3 参数详解

  1. apply(tobit2$x, 2, FUN=mean):
    • tobit2 中提取自变量矩阵(因为你之前用了 x=T)。
    • 计算每一列的均值(即构造“平均水平个体”的特征向量 \(\bar{X}\))。
  2. tobit2$coef:
    • 提取模型的所有系数 \(\beta\)(包括截距项)。
  3. sum(...):
    • 执行点积运算 \(\sum \bar{x}_i \beta_i\),得到线性组合 \(X\beta\)
  4. tobit2$scale:
    • 提取模型估计的标准差 \(\sigma\)(在 tobit2 的 summary 中显示为 8.247)。
  5. pnorm(...):
    • 计算 \(z = X\beta / \sigma\) 的正态分布概率。这部分其实就是 \(Pr(y>0|x)\)(即个体发生婚外情的概率)。
  6. tobit2$coef[-1]:
    • 提取除了截距项以外的自变量系数 \(\beta_j\)

1.6.4 输出结果分析

##           age  yearsmarried religiousness    occupation        rating 
##   -0.04192089    0.12953651   -0.39417189    0.07621840   -0.53413656
  • 数值对比:你会发现这些数字与 margEff(tobit1) 的输出分毫不差
  • 物理意义:这验证了 \(E(y|x)\) 的边际效应等于 “参与概率”乘以“原始系数”
    • 例如,rating 的原始系数是 -2.28,计算出的概率 \(\Phi(z)\) 约为 0.2338。
    • \(-2.28 \times 0.2338 = -0.534\)

1.6.5 为什么这段代码很重要?

  1. 揭示了非线性本质:它直观展示了 Tobit 边际效应是如何被概率 \(\Phi\) “缩减”的。原始系数 \(\beta\) 很大,但因为很多人根本不参与(\(y=0\)),所以平均到全样本上的效应就变小了。
  2. 验证计算过程:它证明了 tobit2 (AER 包) 和 tobit1 (censReg 包) 虽然输出格式不同,但底层数学逻辑完全一致。
  3. 手动控制:如果你想计算“特定个体”(比如 20 岁的人)而非“平均人”的边际效应,你只需要把代码里的 mean 替换成特定的数值即可。

1.6.6 总结 (Summary for Lab)

  • Objective: To manually calculate the marginal effects on the unconditional expected value \(E(y|x)\) using the formula: \(ME = \Phi(X\beta/\sigma) \cdot \beta\).
  • Result: The output is identical to margEff(tobit1), confirming that the total marginal effect is the product of the probability of being uncensored and the latent coefficient.
  • Key Insight: This proves that for the “average person,” any change in independent variables is “weighted” by the probability (approx. 23%) of that person actually having an affair.

2 Truncated regression

2.1 Truncated Data Init

# ------------------------------------------------------------------------------
# Truncated regression
# ------------------------------------------------------------------------------

# https://stats.idre.ucla.edu/stata/output/truncated-regression/
# Examples of truncated regression
# Example 1.
  
# A study of students in a special GATE (gifted and talented education)
# program wishes to model achievement as a function of language skills
# and the type of program in which the student is currently enrolled. A
# major concern is that students are required to have a minimum
# achievement score of 40 to enter the special program. Thus, the
# sample is truncated at an achievement score of 40.  

truncdata = readRDS("truncpanel.rds")
truncdata = truncdata[which(truncdata$time==1),]
hist(truncdata$achievement)

# This distribution is truncated, we do not observe any values below 40.

2.1.1 注释背景与逻辑 (The Context)

  • 截断回归的定义:与 Tobit 不同,截断回归是指由于样本筛选规则,导致我们完全无法观测到某些个体
  • 案例说明 (GATE Program)
    • 规则:学生必须达到 40 分以上的成就得分(Achievement Score)才能进入天才班(GATE)。
    • 数据结果:如果你只去研究天才班里的学生,你的数据集中根本不会出现得分低于 40 的人。
    • 核心区别
      • Tobit (Censored):你知道有人没出轨(0),你能看到这些人,只是他们的具体“欲望”被压在了 0 以下。
      • Truncated:你根本不知道班级外面那些低于 40 分的人是谁,他们被从数据集中“剔除”了。

2.1.2 代码与参数说明

truncdata = readRDS("truncpanel.rds")
  • readRDS(): 读取 R 的专用数据格式 .rds。这个文件通常包含一个面板数据集(Panel Data)。
truncdata = truncdata[which(truncdata$time==1),]
  • 参数 time==1: 原始数据是一个面板(多年份),这里通过逻辑筛选只保留了第一个时间点(第一期)的数据。
  • 目的:将面板数据转换为截面数据(Cross-sectional data),以便进行基础的截断回归教学,避免处理复杂的跨时相关性。

2.1.3 输出与预期 (The Output)

虽然这里没有显示具体的 summary,但执行完这段代码后,你的 truncdata 环境将具备以下特征:

  • 数据的下限 (Lower Bound):如果你运行 min(truncdata$achiv),你会发现最小值一定大于或等于 40
  • 分布特征:数据的分布在 40 处会被“一刀切断”,这种不完整的分布会导致普通 OLS 估计产生严重的偏误。

2.1.4 关键概念总结 (Summary for Lab)

特性 Tobit (Censored) Truncated Regression
样本量 完整样本 (Full sample) 缩减样本 (Reduced sample)
0 点/界限点 观测到 0,知道他们在哪 完全观测不到界限以外的人
例子 所有已婚人士 (包含不出轨的) 只调查进入天才班的学生 (不含落选者)

2.2 Truncated regression truncreg::truncreg

# Truncated regression
treg = truncreg::truncreg(achievement~lang.score+program2+program3, 
                          data=truncdata, point=40, direction="left")
summary(treg)
## 
## Call:
## truncreg::truncreg(formula = achievement ~ lang.score + program2 + 
##     program3, data = truncdata, point = 40, direction = "left")
## 
## BFGS maximization method
## 35 iterations, 0h:0m:0s 
## g'(-H)^-1g = 3.97E-10 
##  
## 
## 
## Coefficients :
##               Estimate Std. Error t-value  Pr(>|t|)    
## (Intercept)   9.889564   0.683552 14.4679 < 2.2e-16 ***
## lang.score    0.701698   0.012003 58.4615 < 2.2e-16 ***
## program2TRUE  3.780889   0.221834 17.0438 < 2.2e-16 ***
## program3TRUE -1.224689   0.219914 -5.5689 2.563e-08 ***
## sigma         1.196063   0.064649 18.5009 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Log-Likelihood: -280.54 on 5 Df

2.2.1 截断回归 (Truncated Regression) 深度解析

这段代码展示了如何对样本缺失(不仅仅是观测值被压低,而是低于门槛的人根本没进数据库)的数据进行建模。


2.2.2 代码与参数说明

  • truncreg::truncreg(...): 调用 truncreg 包中的函数。这是一种极大似然估计(MLE),专门用来“补全”因样本截断而损失的正态分布边缘。
  • achievement ~ lang.score + ...: 回归方程。
  • point = 40: 核心参数。明确告知模型,原始数据在分数 40 处被切断了。
  • direction = "left": 告知模型这是左截断。即:所有分数 \(< 40\) 的学生都从样本中消失了。

2.2.3 变量的统计学现实意义

这个模型试图探究学生进入天才班(GATE)后的学业表现及其驱动因素

  • 因变量 (achievement):学生的成就得分。它是被观测的对象,但受到“准入门槛”的限制(只有得分 \(\ge 40\) 的人才被记录)。
  • 自变量 (lang.score):语言能力得分。研究目的是看:扎实的语言基础是否能显著预示更高的综合成就?
  • 虚拟变量 (program2, program3):代表学生参加的不同类型的教学项目。
    • 统计意义:通过对比 program2program3 与基准项目(program1)的系数,研究人员试图找出哪种教学方案对提升学业成就最有效

寻求的关系: 在排除(控制)了教学项目差异后,语言能力每提升 1 分,能为处于天才班水平的学生带来多少额外的成就加成?


2.2.4 输出结果解读 (The Output)

2.2.4.1 估计方法与收敛

  • BFGS maximization: 使用了 BFGS 算法进行数值优化。
  • 35 iterations: 算法经过 35 次迭代成功收敛。

2.2.4.2 系数表 (Coefficients)

变量 估计值 (Estimate) 显著性 (P-value) 现实含义解读
(Intercept) 9.889 < 0.001 *** 截距项,当其他变量为 0 时的理论基准值。
lang.score 0.701 < 0.001 *** 极显著正相关。语言分数每高 1 分,成就得分预计增加 0.7 分。
program2 3.781 < 0.001 *** 与项目 1 相比,参加项目 2 的学生成就得分显著高出 3.78 分
program3 -1.225 < 0.001 *** 与项目 1 相比,参加项目 3 的学生成就得分反而低了 1.22 分
sigma 1.196 < 0.001 *** 误差项的标准差估计。

重要前提:Direct Interpretation: “A one-unit increase in language score is associated with a 0.701 unit increase in the expected achievement score, ==adjusting for the truncation at 40==.”

直接解读:==在纠正了 40 分截断偏误后==,语言分每增加 1 个单位,成就分预期增加 0.701 个单位。


2.2.5 核心注释与深度理解

2.2.5.1 1. 为什么不能用 OLS?

由于分数低于 40 的学生被剔除了,剩余样本的分布不再是完整的正态分布,而是“残缺的正态分布”。

  • 如果用 OLS,回归线会被剩下的高分群体“拉平”,导致 lang.score 的真实效应被严重低估
  • truncreg 通过数学公式(包含了逆米尔斯比值 Inverse Mills Ratio 的逻辑)补偿了那些消失的低分群体的影响。

2.2.5.2 2. 与 Tobit 的本质区别

  • 在之前的 Affairs 例子中,我们知道有些人出轨次数是 0,他们在样本里
  • 在这个 achievement 例子中,低于 40 分的学生不在样本里(我们甚至不知道有多少人落选)。截断回归的任务是在“信息不全”的情况下还原真相。

2.2.5.3 3. 结论建议

从结果看,该研究证明了:在天才班项目中,语言能力是成就的核心预测指标,且 Program 2 是该天才教育体系中最成功的教学分支。

2.3 cbind 对比分析

# Let's compare these estiamtes with simple regression results
simpreg = lm(achievement~lang.score+program2+program3, data=truncdata)
summary(simpreg)
## 
## Call:
## lm(formula = achievement ~ lang.score + program2 + program3, 
##     data = truncdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.1194 -0.8340  0.0093  0.7182  3.7928 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  10.40932    0.64022  16.259  < 2e-16 ***
## lang.score    0.69305    0.01132  61.216  < 2e-16 ***
## program2TRUE  3.75931    0.21736  17.295  < 2e-16 ***
## program3TRUE -1.18958    0.21335  -5.576 9.14e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.192 on 176 degrees of freedom
## Multiple R-squared:  0.9594, Adjusted R-squared:  0.9587 
## F-statistic:  1387 on 3 and 176 DF,  p-value: < 2.2e-16
# Estimates
cbind(treg$coefficients[1:4], simpreg$coefficients[1:4])
##                    [,1]       [,2]
## (Intercept)   9.8895635 10.4093242
## lang.score    0.7016977  0.6930502
## program2TRUE  3.7808889  3.7593097
## program3TRUE -1.2246886 -1.1895788

2.3.1 cbind 代码与参数详解

这段代码的核心目的是进行对比分析。通过将截断回归(Truncated Regression)和普通最小二乘法(OLS)的系数并排排列,直观地展示“样本截断”对统计推断造成的偏误。

  • cbind(...) (Column Bind):
    • 功能:将两个或多个向量、矩阵按列合并。
    • 参数 1 (treg$coefficients[1:4]):从截断回归模型对象中提取前 4 个系数(截距项、lang.scoreprogram2program3)。
    • 参数 2 (simpreg$coefficients[1:4]):从 OLS 回归模型对象中提取对应的 4 个系数。
  • 为什么使用 [1:4]:
    • 这是为了保持维度匹配
    • treg 模型包含 5 个估算参数(4 个回归系数 + 1 个 sigma)。
    • simpreg (OLS) 只包含 4 个回归系数。
    • 为了能让 cbind 成功合并,必须舍弃 treg 中的 sigma 项,使两个向量长度一致。

2.3.2 输出结果对比分析 | Side-by-Side Comparison

输出的表格展示了两种算法对同一组数据的不同“看法”:

  • 第一列 [,1] (Truncated Regression):代表了修正后的“真实”关系。它考虑了那些低于 40 分而消失的学生。
  • 第二列 [,2] (Simple OLS):代表了“表象”关系。它只看到了班级里现有的学生,忽略了筛选机制。

2.3.2.1 关键变量对比:lang.score

  • treg: 0.7016
  • OLS: 0.6930
  • 分析:OLS 估计出的系数(0.693)比截断回归(0.701)要。这验证了计量经济学中的理论:在左截断的情况下,OLS 往往会==低估==自变量的边际效应。

2.3.2.2 截距项对比:(Intercept)

  • treg: 9.889
  • OLS: 10.409
  • 分析:==OLS的截距==项显著高于真实值。这是因为模型试图补偿由于缺失低分群体而导致整体均值被“==抬高==”的现象。

2.3.3 统计学深度解读 | Professional Interpretation

  • English: The purpose of this comparison is to demonstrate Attenuation Bias. In the OLS model (simpreg), the regression line is “pulled” toward the horizontal by the truncated distribution. By using truncreg, we recover the steeper, more accurate slope (0.701) of the underlying population. Even though the difference (0.701 vs 0.693) seems small here, in many datasets, ==OLS== can lead to completely ==misleading== conclusions if the ==truncation== is severe.

  • 中文解读:这个对比展示了“向零收缩”的偏误。在 OLS 模型中,由于缺失了低分数据,回归线被剩下的高分数据“拉平”了。通过 truncreg,我们找回了那个更陡峭、更准确的真实斜率(0.701)。虽然在这个例子中 0.701 和 0.693 差别不算巨大,但在截断更严重的场景下,OLS 可能会得出完全错误的结论。


2.3.4 总结建议

  • 学术注释 (Academic Note): In a formal report, always report the truncreg results. The OLS results should only be used as a diagnostic tool to show the existence and direction of the truncation bias.
  • 中文建议:在正式报告中,永远以 truncreg 的结果为准。OLS 的结果在这里仅作为一个诊断工具,用来证明“样本截断”确实对数据产生了误导。
# Std. Errors
cbind(summary(treg)$coefficient[1:4,2], summary(simpreg)$coefficient[1:4,2])  
##                    [,1]       [,2]
## (Intercept)  0.68355224 0.64022325
## lang.score   0.01200273 0.01132145
## program2TRUE 0.22183370 0.21736450
## program3TRUE 0.21991401 0.21334985

2.3.5 标准误 (Std. Errors) 的对比

输出的第二部分对比了估计的精确度:

  • 普通回归 ([,2]):计算出的标准误普遍偏小(例如 lang.score 为 0.0113)。
  • 截断回归 ([,1]):标准误相对较大(例如 lang.score 为 0.0120)。
  • 分析:这说明 OLS 因为忽略了缺失数据的影响,往往会给出“过于乐观”且错误的显著性判断。而截断回归通过调整概率密度函数,提供了更稳健的推断。
# Analysis methods you might consider
# Below is a list of some analysis methods you may have encountered. 
# Some of the methods listed are quite reasonable, while others have 
# either fallen out of favor or have limitations.
#
# OLS regression ? You could analyze these data using OLS regression.  
# OLS regression will not adjust the estimates of the coefficients to take into 
# account the effect of truncating the sample at 40, and the coefficients may 
# be severely biased. 
# This can be conceptualized as a model specification error (Heckman, 1979).
# 
# Truncated regression ? Truncated regression addresses the bias introduced 
# when using OLS regression with truncated data.  
# Note that with truncated regression, the variance of the outcome variable is 
# reduced compared to the distribution that is not truncated. Also, if the lower
# part of the distribution is truncated, then the mean of the truncated variable
# will be greater than the mean from the untruncated variable; if the truncation
# is from above, the mean of the truncated variable will be less than 
# the untruncated variable.
# 
# https://stats.idre.ucla.edu/stata/output/truncated-regression/


# ------------------------------------------------------------------------------
# Heckman model
# a.k.a. Sample selection model
# ------------------------------------------------------------------------------


#####################################################################################             
# Real data examples    
#####################################################################################  

#### EXAMPLE 1                       


# This data set was used by Mroz (1987) for analysing female labour supply. 
# 
# labour force participation (described by dummy lfp) is modelled by: 
#   a quadratic polynomial in age (age), 
#   family income (faminc, in 1975 dollars), 
#   presence of children (kids), 
#   and education in years (educ). 
# 
# The wage equation includes 
#   a quadratic polynomial in experience (exper), 
#   education in years (educ), and
#   residence in a big city (city).

if(!require(sampleSelection)){install.packages("sampleSelection")}

# install.packages("sampleSelection")
# library("sampleSelection")

# getting data                   
data("Mroz87")     

#variable for presence of children.           
Mroz87$kids <- (Mroz87$kids5 + Mroz87$kids618 > 0)

#2-step method             
greeneTS <- selection(lfp~age+I(age^2)+faminc+kids+educ, 
                      wage~exper+I(exper^2)+educ+city,
                      data=Mroz87, method="2step")
summary(greeneTS) 
## --------------------------------------------
## Tobit 2 model (sample selection model)
## 2-step Heckman / heckit estimation
## 753 observations (325 censored and 428 observed)
## 14 free parameters (df = 740)
## Probit selection equation:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4.157e+00  1.402e+00  -2.965 0.003127 ** 
## age          1.854e-01  6.597e-02   2.810 0.005078 ** 
## I(age^2)    -2.426e-03  7.735e-04  -3.136 0.001780 ** 
## faminc       4.580e-06  4.206e-06   1.089 0.276544    
## kidsTRUE    -4.490e-01  1.309e-01  -3.430 0.000638 ***
## educ         9.818e-02  2.298e-02   4.272 2.19e-05 ***
## Outcome equation:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.9712003  2.0593505  -0.472    0.637    
## exper        0.0210610  0.0624646   0.337    0.736    
## I(exper^2)   0.0001371  0.0018782   0.073    0.942    
## educ         0.4170174  0.1002497   4.160 3.56e-05 ***
## city         0.4438379  0.3158984   1.405    0.160    
## Multiple R-Squared:0.1264,   Adjusted R-Squared:0.116
##    Error terms:
##               Estimate Std. Error t value Pr(>|t|)
## invMillsRatio   -1.098      1.266  -0.867    0.386
## sigma            3.200         NA      NA       NA
## rho             -0.343         NA      NA       NA
## --------------------------------------------
#ML method             
greeneTS <- selection(lfp~age+I(age^2)+faminc+kids+educ, 
                      wage~exper+educ,
                      data=Mroz87)
summary(greeneTS)
## --------------------------------------------
## Tobit 2 model (sample selection model)
## Maximum Likelihood estimation
## Newton-Raphson maximisation, 5 iterations
## Return code 8: successive function values within relative tolerance limit (reltol)
## Log-Likelihood: -1582.261 
## 753 observations (325 censored and 428 observed)
## 11 free parameters (df = 742)
## Probit selection equation:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4.124e+00  1.400e+00  -2.946 0.003318 ** 
## age          1.842e-01  6.583e-02   2.797 0.005286 ** 
## I(age^2)    -2.410e-03  7.719e-04  -3.122 0.001867 ** 
## faminc       5.813e-06  4.438e-06   1.310 0.190665    
## kidsTRUE    -4.506e-01  1.301e-01  -3.463 0.000566 ***
## educ         9.504e-02  2.316e-02   4.104 4.52e-05 ***
## Outcome equation:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.83348    1.13732  -1.612    0.107    
## exper        0.02510    0.01864   1.347    0.178    
## educ         0.47117    0.07252   6.497  1.5e-10 ***
##    Error terms:
##       Estimate Std. Error t value Pr(>|t|)    
## sigma   3.1171     0.1144   27.25   <2e-16 ***
## rho    -0.1371     0.1613   -0.85    0.395    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## --------------------------------------------
#### EXAMPLE 2   
# The data set used in this example is based on the RAND Health Insurance 
# Experiment (Newhouse 1999). It is included in sampleSelection, where it is 
# called RandHIE. Cameron and Trivedi (2005, p. 553) use these data to analyse 
# health expenditures. 
#  
# The endogenous variable of the outcome equation measures the log of the 
# medical expenses of the individual (lnmeddol) and the endogenous variable of 
# the selection equation indicates whether the medical expenses are positive (binexp)
#  
#  The regressors are: 
#    
#    the log of the coinsurance rate plus 1 (logc = log(coins+1)), 
#    a dummy for individual deductible plans (idp), 
#    the log of the participation incentive payment (lpi), 
#    an artificial variable (fmde that is 0 if idp=1 and ln(max(1,mde/(0.01*coins))) otherwise (where mde is the maximum expenditure offer),
#    physical limitations (physlm), 
#    the number of chronic diseases (disea), 
#    dummy variables for good (hlthg), fair (hlthf), and poor (hlthp) self-rated health (where the baseline is excellent self-rated health), 
#    the log of family income (linc), 
#    the log of family size (lfam), 
#    education of household head in years (educdec), 
#    age of the individual in years (xage), 
#    a dummy variable for female individuals (female), 
#    a dummy variable for individuals younger than 18 years (child), 
#    a dummy variable for female individuals younger than 18 years (fchild), 
#    and a dummy variable for black household heads (black)


data("RandHIE")
#as it is in the original example
subsample <- RandHIE$year == 2 & !is.na(RandHIE$educdec)

selectEq <- binexp ~ logc + idp + lpi + fmde + physlm + disea +
  + hlthg + hlthf + hlthp + linc + lfam + educdec + xage +
  + female + child + fchild + black
outcomeEq <- lnmeddol ~ logc + idp + lpi + fmde + physlm +
  + disea + hlthg + hlthf + hlthp + linc + lfam

rhieTS <- selection(selectEq, outcomeEq, data = RandHIE[subsample, ],method = "2step")


summary(rhieTS)
## --------------------------------------------
## Tobit 2 model (sample selection model)
## 2-step Heckman / heckit estimation
## 5574 observations (1293 censored and 4281 observed)
## 33 free parameters (df = 5542)
## Probit selection equation:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.2716050  0.1877345  -1.447 0.148023    
## logc        -0.1187080  0.0269005  -4.413 1.04e-05 ***
## idp         -0.1279483  0.0522351  -2.449 0.014337 *  
## lpi          0.0283091  0.0088793   3.188 0.001439 ** 
## fmde         0.0075319  0.0161584   0.466 0.641142    
## physlm       0.2732013  0.0743761   3.673 0.000242 ***
## disea        0.0224861  0.0035958   6.253 4.32e-10 ***
## hlthg        0.0387516  0.0438545   0.884 0.376929    
## hlthf        0.1920062  0.0836688   2.295 0.021780 *  
## hlthp        0.6397294  0.2126322   3.009 0.002636 ** 
## linc         0.0518413  0.0168128   3.083 0.002056 ** 
## lfam        -0.0335599  0.0417280  -0.804 0.421284    
## educdec      0.0363070  0.0076536   4.744 2.15e-06 ***
## xage         0.0002631  0.0021606   0.122 0.903070    
## female       0.4451035  0.0542920   8.198 3.00e-16 ***
## child        0.1114890  0.0808338   1.379 0.167877    
## fchild      -0.4512845  0.0799219  -5.647 1.72e-08 ***
## black       -0.6057367  0.0523148 -11.579  < 2e-16 ***
## Outcome equation:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.942526   0.246164  16.016  < 2e-16 ***
## logc         0.043950   0.035114   1.252 0.210756    
## idp         -0.029685   0.067115  -0.442 0.658285    
## lpi         -0.005577   0.010776  -0.518 0.604803    
## fmde        -0.040913   0.019616  -2.086 0.037056 *  
## physlm       0.276147   0.076504   3.610 0.000309 ***
## disea        0.015897   0.004094   3.883 0.000104 ***
## hlthg        0.256223   0.050865   5.037 4.87e-07 ***
## hlthf        0.462892   0.092478   5.005 5.75e-07 ***
## hlthp        0.839142   0.185695   4.519 6.34e-06 ***
## linc         0.068256   0.023899   2.856 0.004306 ** 
## lfam        -0.304448   0.046788  -6.507 8.34e-11 ***
## Multiple R-Squared:0.1056,   Adjusted R-Squared:0.1031
##    Error terms:
##               Estimate Std. Error t value Pr(>|t|)    
## invMillsRatio  -1.0783     0.1787  -6.035 1.69e-09 ***
## sigma           1.5712         NA      NA       NA    
## rho            -0.6863         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## --------------------------------------------
# ------------------------------------------------------------------------------
# Theoretical analyses
# ------------------------------------------------------------------------------

#################################


# Tobit-2 


#################################

# install.packages("sampleSelection")
# install.packages("mvtnorm")
if(!require(sampleSelection)){install.packages("sampleSelection")}
if(!require(mvtnorm)){install.packages("mvtnorm")}



set.seed(123)

#Heckman (tobit-2) sample selection model
#with exclusion restriction (selection regression contains important regressor that is not in outcome regression)

#drawing from bivariate normal distirbution with correlation equal to -0.7
#error terms
eps <- rmvnorm(2000, c(0, 0), matrix(c(1, -0.7, -0.7, 1), 2, 2))

#selection indepvar univariate (0,1)   
xs <- runif(2000)
#selection depvar ys if (xs +eps)>0 then 1 - probit
ys <- (xs + eps[, 1]) > 0

#outocme regression
xo <- runif(2000)
#latent depvar
yoX <- xo + eps[, 2]
#observable outcome if yoX is greater then 0
yo <- yoX * (ys > 0)

#Heckman (tobit-2) sample selection model

summary(selection(ys ~ xs, yo ~ xo))
## --------------------------------------------
## Tobit 2 model (sample selection model)
## Maximum Likelihood estimation
## Newton-Raphson maximisation, 3 iterations
## Return code 8: successive function values within relative tolerance limit (reltol)
## Log-Likelihood: -2968.691 
## 2000 observations (620 censored and 1380 observed)
## 6 free parameters (df = 1994)
## Probit selection equation:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.01832    0.05727   0.320    0.749    
## xs           1.00901    0.10569   9.547   <2e-16 ***
## Outcome equation:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.08056    0.08460  -0.952    0.341    
## xo           0.92978    0.08239  11.285   <2e-16 ***
##    Error terms:
##       Estimate Std. Error t value Pr(>|t|)    
## sigma  0.94150    0.03986  23.621  < 2e-16 ***
## rho   -0.51727    0.12925  -4.002 6.51e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## --------------------------------------------
summary(selection(ys ~ xs, yo ~ xo,method="2step"))
## --------------------------------------------
## Tobit 2 model (sample selection model)
## 2-step Heckman / heckit estimation
## 2000 observations (620 censored and 1380 observed)
## 7 free parameters (df = 1994)
## Probit selection equation:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.01610    0.05769   0.279     0.78    
## xs           1.01253    0.10659   9.499   <2e-16 ***
## Outcome equation:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.09568    0.09776  -0.979    0.328    
## xo           0.92981    0.08257  11.261   <2e-16 ***
## Multiple R-Squared:0.0893,   Adjusted R-Squared:0.0879
##    Error terms:
##               Estimate Std. Error t value Pr(>|t|)   
## invMillsRatio  -0.4565     0.1711  -2.669  0.00768 **
## sigma           0.9341         NA      NA       NA   
## rho            -0.4887         NA      NA       NA   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## --------------------------------------------
summary(lm(yo ~ xo))
## 
## Call:
## lm(formula = yo ~ xo)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3.00948 -0.36259 -0.04909  0.34588  2.68772 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.21721    0.03356  -6.472 1.22e-10 ***
## xo           0.63998    0.05781  11.070  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7447 on 1998 degrees of freedom
## Multiple R-squared:  0.05779,    Adjusted R-squared:  0.05731 
## F-statistic: 122.5 on 1 and 1998 DF,  p-value: < 2.2e-16
#expected intercepts = 0 and slopes = 1
#close to the true values


#without exclusion restriction (selection regression contains important regressor that is not in outcome regression)

yoX <- xs + eps[, 2]
yo <- yoX * (ys > 0)
summary(selection(ys ~ xs, yo ~ xs))
## --------------------------------------------
## Tobit 2 model (sample selection model)
## Maximum Likelihood estimation
## Newton-Raphson maximisation, 13 iterations
## Return code 8: successive function values within relative tolerance limit (reltol)
## Log-Likelihood: -2968.992 
## 2000 observations (620 censored and 1380 observed)
## 6 free parameters (df = 1994)
## Probit selection equation:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.01566    0.05760   0.272    0.786    
## xs           1.01425    0.10635   9.537   <2e-16 ***
## Outcome equation:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.06452    0.14806  -0.436    0.663    
## xs           0.95306    0.12790   7.452 1.37e-13 ***
##    Error terms:
##       Estimate Std. Error t value Pr(>|t|)    
## sigma   0.9566     0.0546  17.521  < 2e-16 ***
## rho    -0.5659     0.1582  -3.577 0.000356 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## --------------------------------------------
#estimators still unbiased
#std. errors are larger
#the standard errors of the estimates depend on the variation in the latent selection equation
#More variation gives smaller standard errors

xs <- runif(2000, -1, 1)
ys <- (xs + eps[, 1]) > 0
yoX <- xs + eps[, 2]
yo <- yoX * (ys > 0)
summary(selection(ys ~ xs, yo ~ xs))
## --------------------------------------------
## Tobit 2 model (sample selection model)
## Maximum Likelihood estimation
## Newton-Raphson maximisation, 10 iterations
## Return code 8: successive function values within relative tolerance limit (reltol)
## Log-Likelihood: -2485.797 
## 2000 observations (980 censored and 1020 observed)
## 6 free parameters (df = 1994)
## Probit selection equation:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.03981    0.02970    1.34     0.18    
## xs           0.99520    0.05418   18.37   <2e-16 ***
## Outcome equation:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.2184     0.1479  -1.477     0.14    
## xs            1.1358     0.1183   9.600   <2e-16 ***
##    Error terms:
##       Estimate Std. Error t value Pr(>|t|)    
## sigma  0.91504    0.05431  16.849  < 2e-16 ***
## rho   -0.48985    0.17021  -2.878  0.00405 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## --------------------------------------------
#################################


# Tobit-5 


#################################  

set.seed(0)
# variance - covariance matrix
vc <- diag(3)
vc[lower.tri(vc)] <- c(0.9, 0.5, 0.1)
vc[upper.tri(vc)] <- vc[lower.tri(vc)]
vc
##      [,1] [,2] [,3]
## [1,]  1.0  0.9  0.5
## [2,]  0.9  1.0  0.1
## [3,]  0.5  0.1  1.0
# error terms from multivariate normal distributions
eps <- rmvnorm(500, c(0, 0, 0), vc)
#selection regression
xs <- runif(500)
ys <- (xs + eps[, 1]) > 0
#outcome regression 1  
xo1 <- runif(500)
yo1 <- xo1 + eps[, 2]
#outcome regression 1  
xo2 <- runif(500)
yo2 <- xo2 + eps[, 3]

#tobit-5
summary(selection(ys ~ xs, list(yo1 ~ xo1, yo2 ~ xo2)))
## --------------------------------------------
## Tobit 5 model (switching regression model)
## Maximum Likelihood estimation
## Newton-Raphson maximisation, 11 iterations
## Return code 1: gradient close to zero (gradtol)
## Log-Likelihood: -895.8201 
## 500 observations: 172 selection 1 (FALSE) and 328 selection 2 (TRUE)
## 10 free parameters (df = 490)
## Probit selection equation:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.1550     0.1051  -1.474    0.141    
## xs            1.1408     0.1785   6.390 3.86e-10 ***
## Outcome equation 1:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.02708    0.16395   0.165    0.869    
## xo1          0.83959    0.14968   5.609  3.4e-08 ***
## Outcome equation 2:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.1583     0.1885   0.840    0.401    
## xo2           0.8375     0.1707   4.908 1.26e-06 ***
##    Error terms:
##        Estimate Std. Error t value Pr(>|t|)    
## sigma1  0.93191    0.09211  10.118   <2e-16 ***
## sigma2  0.90697    0.04434  20.455   <2e-16 ***
## rho1    0.88988    0.05353  16.623   <2e-16 ***
## rho2    0.17695    0.33139   0.534    0.594    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## --------------------------------------------
summary(selection(ys ~ xs, list(yo1 ~ xo1, yo2 ~ xo2), method="2step"))
## --------------------------------------------
## Tobit 5 model (switching regression model)
## 2-step Heckman / heckit estimation
## 500 observations: 172 selection 1 (FALSE) and 328 selection 2 (TRUE)
## 12 free parameters (df = 490)
## Probit selection equation:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.1588     0.1116  -1.423    0.155    
## xs            1.1478     0.1978   5.803 1.17e-08 ***
## Outcome equation 1:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   0.1289         NA      NA       NA
## xo1           0.8430         NA      NA       NA
## Multiple R-Squared:0.2486,   Adjusted R-Squared:0.2397
## Outcome equation 2:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   0.1731         NA      NA       NA
## xo2           0.8374         NA      NA       NA
## Multiple R-Squared:0.0694,   Adjusted R-Squared:0.0636
##    Error terms:
##                Estimate Std. Error t value Pr(>|t|)
## invMillsRatio1  -0.9318         NA      NA       NA
## invMillsRatio2   0.1326         NA      NA       NA
## sigma1           0.8970         NA      NA       NA
## sigma2           0.9065         NA      NA       NA
## rho1             0.9900         NA      NA       NA
## rho2             0.1463         NA      NA       NA
## --------------------------------------------