Unmeasured confounding

Instrument-free methods

Soumik Purkayastha and Peter X.K. Song

Department of Biostatistics, University of Michigan

5/24/23

Overview

  • Challenge: draw valid causal inferences from these data is the presence of endogenous regressors that are correlated with the structural error in the population regression model representing the causal relationship of interest.

  • Instrumental variables (IV): classical method to deal with endogeneity.

    • The ideal IV: relevance restriction1 and exclusion restriction2.
    • Challenge: finding good IVs satisfying these two requirements.
    • Potential IVs often suffer from either weak relevance or challenging justification for exclusion restriction, which hampers using IVs to correct for the underlying endogeneity concerns (Rossi (2014)).
  • Park and Gupta (2012) present an instrument-free method using a Gaussian copula approach.

    • Exogeneity assumption as it directly models the association between the structural error and the endogenous regressor via copula.
    • Can handle wider variety of data types since Gaussian copula may be used to model a large variety of multivariate discrete and continuous structures (Song (2000)).
    • There are still some restrictive assumptions made for model identifiability.

Summary of Park and Gupta (2012)

Introduction

  • Park and Gupta (2012) develop a copula-based instrument-free method to handle endogenous regressors with insufficient non-normality and correlated with exogenous regressors.
  • Implicitly assumes no correlations between the exogenous regressors and copula transformations of endogenous regressors that are used to control for endogeneity.

We consider the following linear structural regression model: \[Y_i = \mu + \alpha P_i + \beta^T W_i + \epsilon_i\] where \(P_i\) is a endogeneous regressor variable, \(W_i\) is an exogenous regressor vector and \(\epsilon_i\) is the structural error term.

  • \(P_i\) and \(\epsilon_i\) are associated, and this generates the endogeneity problem.
  • \(W_i\) may be associated with \(P_i\) but not with \(\epsilon_i\).

Key idea: use a copula to jointly model the correlation between \(P_i\) and \(\epsilon_i\). Marginals are not restricted by the joint distribution. Using information contained in the observed data, marginals of the endogenous regressor and the error term are first obtained respectively.

Model proposed

  • Consider that \((P_i^\star, \epsilon_i^\star) = [\Phi^{-1}\{F_p(P_i)\}, \Phi^{-1}\{F_{\epsilon}(\epsilon_i)\}] \sim N_2(0, \rho)\).
  • Further, assume \(\epsilon \sim N(0, \sigma_\epsilon^2)\).

The structural error can be expressed as:

\[\epsilon_i = \sigma_\epsilon \epsilon_i^\star = \sigma_\epsilon (\rho P_i^\star + \sqrt{1-\rho^2}w_i)\] and the structural regression model can be re-written as

\[\begin{equation} \begin{aligned} Y_i &= \mu + \alpha P_i + \beta^T W_i + \sigma_\epsilon (\rho P_i^\star + \sqrt{1-\rho^2}w_i)\\ Y_i &= \mu + \alpha P_i + \sigma_\epsilon\rho P_i^\star + \beta^T W_i + \sigma_\epsilon \sqrt{1 - \rho^2} w_i. \end{aligned} \end{equation}\]

  • Can’t allow for \(P_i \sim N\)!
  • Since \(W_i\) is exogeneous, we have \[\begin{equation} \begin{aligned} cov(W_i, \epsilon_i) &= 0\\ 0 &= cov(W_i, \sigma_\epsilon (\rho P_i^\star + \sqrt{1-\rho^2}w_i)) \\ &= \rho cov(W_i, P_i^\star) + \sqrt{1-\rho^2} cov(W_i, w_i) = 0. \end{aligned} \end{equation}\] Correlation of \(W_i\) and \(P_i^\star\) would induce correlation between \(W_i\) and \(w_i\), yielding inconsistent estimates.

Proposed modification

A two-step regression approach

  • Want to relax uncorrelatedness assumption between \(W_i\) and \(P_i^\star\)
  • Want to relax non-normality assumption on \(P_i\)

Jointly model endogenous regressor \(P_i\), the correlated exogenous variable, \(W_i\), and the structural error term, \(\epsilon_i\), using the Gaussian copula model: \[\begin{equation} \left(\begin{array}{c} P_i^* \\ W_i^* \\ \epsilon_i^* \end{array}\right) = \left(\begin{array}{c} \Phi^{-1}\{F_P(P_i) \} \\ \Phi^{-1}\{F_W(W_i) \} \\ \Phi^{-1}\{F_\epsilon(\epsilon_i) \} \end{array}\right) \sim N\left(\left[\begin{array}{l} 0 \\ 0 \\ 0 \end{array}\right],\left[\begin{array}{ccc} 1 & \rho_{p w} & \rho_{p \epsilon} \\ \rho_{p w} & 1 & 0 \\ \rho_{p \epsilon} & 0 & 1 \end{array}\right]\right) \end{equation}\]

The model above can be re-written as

\[\begin{equation} \left(\begin{array}{c} P_i^* \\ W_i^* \\ \epsilon_i^* \end{array}\right)=\left(\begin{array}{ccc} 1 & 0 & 0 \\ \rho_{p w} & \sqrt{1-\rho_{p w}^2} & 0 \\ \rho_{p \epsilon} & \frac{-\rho_{p w} \rho_{p \epsilon} }{\sqrt{1-\rho_{p w}^2}} & \sqrt{1-\rho_{p \epsilon}^2-\frac{\rho_{p w}^2 \rho_{p \epsilon}^2}{1-\rho_{p w}^2}} \end{array}\right) \cdot\left(\begin{array}{c} w_{1, i} \\ w_{2, i} \\ w_{3, i} \end{array}\right) \end{equation}\] where \((w_1, w_2, w_3) \overset{i.i.d}{\sim} N(0, 1)\)

Rewriting the structural regression model

\[\begin{equation} \begin{aligned} P_i^\star &= \rho_{pw}W_i^\star + \sqrt{1 - \rho_{pw}^2} w_{2, i} \\ &= \rho_{pw}W_i^\star + \epsilon_i \quad(1) \\ Y_i &= \mu+ \alpha P_i + \beta W_i + \frac{\sigma_{\epsilon} \rho_{p \epsilon}}{1-\rho_{p w}^2} \epsilon_i+\sigma_{\epsilon} \sqrt{1-\rho_{p \epsilon}^2-\frac{\rho_{p w}^2 \rho_{p \epsilon}^2}{1-\rho_{p w}^2}} \cdot w_{3, i} \quad(2) \end{aligned} \end{equation}\]

Two step regression routine

  • Adding the estimate of the error term \(\epsilon\) from the first stage regression as a generated regressor to the outcome regression instead of using \(P_i^\star\) and \(W_i^\star\).
  • The new error term \(w_{3, i}\) is uncorrelated with all regressors in step (2) so long as at least one non-normal exogenous covariate is supplied. Assuming non-normality of exogenous covariates is a less strict requirement. If needed, we can always add small perturbation.
  • We can get consistent estimates of model parameters from step (2)

Key improvements

  • Jointly model endogenous \(P_i\), exogenous \(W_i\) and structural error \(\epsilon_i\) as a Gaussian copula.
  • Can allow for normally distributed \(P_i\) as long as one non-normal exogenous \(W_i\) exists.
  • Exogenous \(W_i\) can be correlated with linear functions of copula-transformed endogenous \(P_i\).

Summary and next steps

  • Extend to more general copula structures? Nice decomposition won’t work any more.
  • Is there a non-parametric approach to this method?
  • Simulation studies for robustness

References

Park, Sungho, and Sachin Gupta. 2012. “Handling Endogenous Regressors by Joint Estimation Using Copulas.” Marketing Science 31 (4): 567–86. https://www.jstor.org/stable/41687947.
Rossi, Peter E. 2014. “Even the Rich Can Make Themselves Poor: A Critical Examination of IV Methods in Marketing Applications.” Marketing Science 33 (5): 655–72.
Song, Peter Xue-Kun. 2000. “Multivariate Dispersion Models Generated from Gaussian Copula.” Scandinavian Journal of Statistics 27 (2): 305–20. https://www.jstor.org/stable/4616605.