Censoring, Truncation, and Panels

Christopher Weber

2024-11-25

Introduction

  • These notes follow the second half of Long (1997), Chapter 8

  • Truncation means data that fall above (or below) a specific value

  • E.g., The impact of ideology on dollars spent during an election cycle, among general election candidates.

  • Truncation at zero, for instance

  • We should not estimate a standard PRM or negative binomial model. Both will predict zero counts, but we cannot observe zero counts in practice

For instance

  • A zero count, \(p(y_i=0|x_i)=exp(-\mu_i)\).
  • A nonzero count, \(p(y_i>0|x_i)=1-exp(-\mu_i)\).
  • And, a Poisson distribution where \(p(y | y>0)\) \[p(y|x)={{exp(-\mu_i)\mu_i^{y_i}}\over{y_i!}(1-exp(-\mu_i))}\]

Summary

  • The Poisson Regression Model (PRM)
  • A strong assumption: \(E(\mu)=var(y)\)
  • Overdispersion
  • The negative binomial
  • Zero counts
  • Truncated regression
  • Zero inflation and hurdle models

When Zeroes are Observed

  • Imagine you are completing a project on casualties in military conflict
  • Your data is panel data, which includes a large sample of countries over a many years
  • You have a dataset with a lot of zeros
  • What is a zero and why do we observe it?
  • Superior defenses, nature of war, no boots on the ground, and/or no conflict

Zero Generating

  • Zero stage. \(\theta_i\) the probability that \(y=0\) and \(1-\theta_i\) is the probability that \(y>0\)
  • Model 0/1 using a logit or probit regression \[\theta_i=F(z_i\gamma)\]

Zero Generating

  • Count Stage. Here, we may estimate a poisson or a negative binomial count process

\[pr(y_i=0|x_i)=\theta_i+(1-\theta_i)exp(\mu_i)\]

Zero Generating

  • Note.The zero count is a composite of \(\theta_i\), being zero (e.g., lack of conflict), or zero in the count process itself, \((1-\theta_i)exp(\mu_i)\) in the count process.

  • Non-zero values:

\[pr(y_i|x_i)=(1-\theta_i){{exp(\mu_i)\mu_i^{y_i}}\over{y_i!}}\]

Zero Generating

  • The count process weighted by the probability of a non-zero.

\[ \tiny{ L(\pi, \mu \mid y) = \prod_{i=1}^{n} \left[ \theta_i \mathbb{I}(y_i = 0) + (1 - \theta_i) {{exp(-\mu_i)\mu_i^{y_i}}\over{y_i!}} \right] }\]

  • \(\theta_i\) is the probability of an excess zero for the ith observation.

\[\tiny \log L(\pi, \mu \mid y) = \sum_{i=1}^{n} \log \left[ \theta_i \mathbb{I}(y_i = 0) + (1 - \theta_i){{exp(-\mu_i)\mu_i^{y_i}}\over{y_i!}} \right]\]

Zero Generating

\[ \tiny \log L(\pi, \mu \mid y) = \sum_{i=1}^{n} \log \left[ \theta_i \mathbb{I}(y_i = 0) + (1 - \theta_i){{exp(-\mu_i)\mu_i^{y_i}}\over{y_i!}} \right]\]

  • \(\mu_i\) is the rate parameter of the Poisson distribution for the ith observation.

  • The rate function can be written as, \(\mu_i = exp(\alpha + \sum_K \beta_k x_{k,i}\))

  • \(\mathbb{I}(y_i = 0)\) is an indicator function that is 1 if ( \(y_i = 0\) ) and 0 otherwise.

The Hurdle Model

  • The “zero attenuated” regression model

  • Predict a zero count

\[\tiny \theta_i=F(z_i\gamma)\]

  • Model the non zero equation with a truncated poisson (or negative binomial)

    \[\tiny pr(y_i|x_i)=(1-\theta_i){{exp(\mu_i)\mu_i^{y_i}}\over{y_i!}(1-exp(\mu_i))}\]

Zero Counts

  • Zero counts are common in count data
  • They may arise from a poisson or negative binomial process
  • Or they may be observed for entirely separate reasons
  • Theory should guide the decision to model zero counts

Censoring and Truncation

  • Censoring is a different process
  • Truncation means the data itself are fundamentally changed by the truncation process
  • Censoring involves missing data for a variable, but complete data for the covariates
  • Often, scores are censored at a particular value, usually the min and/or max of a scale

Censoring and Truncation

  • For instance, say we observe any value of the dependent variable if the dependent variable is less than \(\tau\)

\[y_{observed} = \{ \begin{array}{lr} NA, y_{latent}\leq\tau\\ y_{latent}, y_{latent}>\tau\\ \end{array}\]

Censoring and Bias

Censoring

  • Assume everyone has an income, and they have a set dollar price that they will spend on the car
  • If the car is priced more than this value, they cannot buy the car (even though they would like to buy the car)
  • We only observe car purchases if the cost of the car is less than the amount the person is willing to spend
  • In short, for all people who have a value less than this threshold, we observe missing data
  • The missing data are “non-ignorable”

Censoring

  • With truncation, \[y_{observed} = \begin{array}{lr} \tau, y_{latent}\leq\tau\\ y_{latent}, y_{latent}>\tau\\ \end{array}\]

  • Assume \(y_{latent}\sim N(\mu, \sigma^2)\)

  • The pdf for \(y_{latent}\) is simply the normal density

\[f(y_{latent}|\mu, \sigma)={{1}\over{\sigma}}\Phi({{\mu-y_{latent}}\over{\sigma}})\]

Censoring

  • CDF: \(\tiny pr(Y_{latent}>y_{latent})=\Phi({{\mu-y_{latent}}\over{\sigma}})\)

  • If we observe data greater than \(\tau\),

\[ \tiny pr(y|y>\tau, \mu, \sigma)={{f(y_{latent}|\mu \sigma)}\over{pr(y_{latent}>\tau)}}\]

  • If we only observe data less than \(\tau\) then,

\[\tiny pr(y|y<\tau, \mu, \sigma)={{f(y_{latent}|\mu \sigma)}\over{pr(y_{latent}<\tau)}}\]

\[\tiny {f(y|y>\tau, \mu, \sigma)}= [{{{1}\over{\sigma}}{\phi({{\mu-y_{latent}}\over{\sigma}})}}]/[{{{\Phi({{\mu-\tau}\over{\sigma}})}}}]\]

The Inverse Mills Ratio

\[\tiny {f(y|y>\tau, \mu, \sigma)}= [{{{1}\over{\sigma}}{\phi({{\mu-y_{latent}}\over{\sigma}})}}]/[{{{\Phi({{\mu-\tau}\over{\sigma}})}}}]\]

  • The numerator is simply the normal pdf, but we are dividing by the normal cdf evaluated for the distribution greater than \(\tau\).

  • The PDF and the CDF

  • Let’s take the expectation of this pdf, as it yields an important statistic.

The Inverse Mills Ratio

\[\tiny {E(y|y>\tau)}=\mu+\sigma {{\phi{{\mu-\tau}\over{\sigma}}}\over{\Phi{{\mu-\tau}\over{\sigma}}}}\]

  • Or, just simply, \(\mu+\sigma \kappa {{\mu-\tau}\over{\sigma}}\), with \(\kappa\) representing, \(\phi(.)/\Phi(.)\)

  • In this case, \(\kappa\) (kappa) is a statistic called the inverse Mill’s ratio

The Inverse Mills Ratio

\[{E(y|y>\tau)}=\mu+\sigma {{\phi{{\mu-\tau}\over{\sigma}}}\over{\Phi{{\mu-\tau}\over{\sigma}}}}\]

  • If \(\tau\) is greater than \(\mu\), meaning we would have serious censoring, then this ratio will be larger

  • But, if \(\mu\) is much greater than \(\tau\) – we don’t have much censoring – this ratio goes to zero and the distribution is just the normal density.

The Inverse Mills Ratio

  • If \(\mu\) is greater than \(\tau\) (positive numbers), the inverse Mills ratio is smaller than when \(\mu\) is less than \(\tau\) (negative numbers).

Censored Regression

  • Let’s start by assuming a censored dependent variable (Long 1997, pp. 195-196 for the full derivation).

  • We observe all of \(x\) but we don’t observe \(y\)

\[y_{observed} = \begin{array}{lr} \tau, y_{latent}\leq\tau\\ y_{latent}, y_{latent}>\tau\\ \end{array}\]

  • This is censoring from below

Censored Regression

  • We can also have censoring from above

\[y_{observed} = \begin{array}{lr} \tau, y_{latent}\geq\tau\\ y_{latent}, y_{latent}<\tau\\ \end{array}\]

  • Assume:

\[y_{observed} = \begin{array}{lr} \tau, y_{latent}\leq\tau\\ \alpha+\sum_K \beta_k x_{k}+\epsilon, y_{latent}>\tau\\ \end{array}\]

Censored Regression

  • The probability of censoring is \(\tiny pr(censored|x_i) = pr(y_{latent} < \tau |x_i) = pr(\epsilon_i<\tau - (\alpha+\sum_K \beta_k x_{k})|x\)

  • The probability of not being censored given x, \(\tiny Pr(Uncensored|x) = 1 -\Phi(\tau - (\alpha+\sum_K \beta_k x_{k})/\sigma)= \Phi((\alpha+\sum_K \beta_k x_{k}) - \tau)/\sigma)\)

  • Just simplify by calling, \(\tiny \delta_i = ((\alpha+\sum_K \beta_k x_{k}) - \tau)/\sigma)\)

Censored Regression

  • The probability of being censored is \(\Phi(-\delta_i)\) and the probability of not being censored is \(\Phi(\delta_i)\)

  • Censoring (like truncation) can occur from the “left,” “right,” or really anywhere in an observed distribution

The Panel Design

  • Panel data are common in political science

  • Unlike cross sectional data, units are repeatedly observed

  • The Time-Series Cross-Section (TSCS) design

The Panel Design

  • Autocorrelated errors

  • The AR/Markov Process

    \[ Pr(C_{t+1}|C_{t},....C_{1})=Pr(C_{t+1}|C_{t}) \]

  • The Drunkard’s Walk

The Drunkard’s Walk

The Markov State Model

  • E.g., Model the probability of a voter being a Republican today, we would conceive of this as based on whether one was a Republican yesterday.

  • Or, vote turnout in Arizona

The Markov State Model

  • Movement across \(C\) states are “transition probabilities.”

  • We can represent the transition between \(m\) realizations of \(C\) as “transition matrix.”

  • The rows represent the realization of a state at time \(t\) and the columns represent \(t+1\). The sum of the rows must equal 1, of course, in order to be a proper probability distribution

The Markov State Model

  • Multilevel data structures are incredibly common in political science
  • Typically, unit will correspond to the level-1 observation (e.g., country-year, person-wave, person-region, etc)
  • OLS:

\[y_{j,i}=\beta_0+\beta_1 x_{j,i}+e_{j,i}\]

Multilevel Structures: Fixed Effects

\[\tiny y_{j,i}=\beta_0+\beta_1 x_{j,i}+e_{j,i}\]

  • \(y\) is an observation nested within a geographical unit, time, etc.

  • Perhaps the intercept in this equation vary across regions

\[\tiny y_{j,i}=\beta_0+\beta_1 x_{j,i}+\sum_j^{J-1} \gamma_{j} d_j+ e_{j,i}\]

  • \(d_j\) denotes a dummy variable, specified for \(J-1\) geographic units

Fixed versus Random Effects

\[\tiny y_{j,i}=\beta_{0,j}+\beta_1 x_{j,i}+ e_{1,j,i}\]

  • Now, instead of \(J-1\) dummies, we model the intercept as drawn from a probability density; a common one, of course, is the normal

\[\tiny \beta_{0,j}=\gamma_0+e_{2,j}\]

\[\tiny e_{2,j} \sim N(0, \sigma^2)\]

\[\tiny \beta_{0,j}\sim N(\gamma_0, \sigma^2)\]

Fixed versus Random Effects

  • A two level model

  • At level 1,

    \[ y_{j,i}=\beta_{0}+\beta_1 x_{j,i}+ e_{1,j,i}\]

  • At level 2,

    \[ y_{j}=\gamma_{0}+\gamma x_{j}+ e_{2,j}\]

  • The ecological fallacy

  • For instance, Red State, Blue State, Rich State, Poor State: Why Americans Vote the Way They Do (Gelman 2008)

Different Parameterizations

\[\tiny p(y_{j,i}=1)=logit^{-1}(\beta_{0}+\beta_1 x_{j,i})\]

\[\tiny \bar{y}_{j}=\gamma_{0}+\gamma_1 x_{j}+e_{2,j}\]

Building the Random Effects Model

  • Limitations of the fixed effects model. We have to add \(J-1\) dummy variables to the model

  • An equivalent approach is to just remove the \(j\) level means from \(y\)

\[(y_{j,i}-\bar{y}_j)=\beta_{0}+\beta_1 x_{j,i}+ e_{i}\]

  • Why?

Building the Random Effects Model

\[(y_{j,i}-\bar{y}_j)=\beta_{0}+\beta_1 (x_{j,i}-\bar{x}_j)+ e_{i}\]

  • The within effects estimator; it is the linear effect of \(x\) on \(y\)

Random Effects

  • Two regression models, within and between

  • Recall the assumption that \(cov(e_i, e_j)=0, \forall i\neq j\)? Or the assumption that \(e_{j,i}\) are independent and identically distributed?

  • In the two stage formulation, we don’t ever correct for this process.

Model the Complex Errors

\[ \begin{eqnarray} y_{i}=b_{0,j[i]}+e_{1,i}\\ b_{0,j}=\omega_0+e_{2,j[i]}\\ e_{1,j} \sim N(0, \sigma_1^2)\\ e_{2,j[i]} \sim N(0, \sigma_2^2) \end{eqnarray} \]

  • \(i\) nested within \(j\) and variation across \(j\)
  • The Random Intercept Model
  • Or just, the ANOVA model and between versus within effects

Reduced Form

\[\begin{eqnarray} y_{i}=\omega_0+e_{1,i}+e_{2,j[i]}\\ \end{eqnarray}\]

\[var(y_{j[i]})=var(e_{1,j[i]})+var(e_{2,j})\]

\[\sigma^2_{(y_{j[i]})}=\sigma^2_{i}+\sigma^2_{{j[i]}}\]

  • Note the similarity to variance decomposition, and the F-test, \(SS_T=SS_B+SS_W\)?

Adding Predictors

\[\tiny \begin{eqnarray} y_{i}=b_{0,j[i]}+b_{1} x_{i}+e_{1,i}\\ b_{0,j}=\omega_0+\omega_1 x_{j[i]}+e_{2,j[i]}\\ e_{1,j} \sim N(0, \sigma_1^2)\\ e_{2,j[i]} \sim N(0, \sigma_2^2) \end{eqnarray}\]

  • \(x_{j[i]}\) consist of variables that vary within \(J\) level two observations; \(x_{j}\) consists of variables that only vary between level two observations

Adding Predictors

\[x_{within}=x_{j}-\bar{x}_{j[i]}\]

\[x_{between}=\bar{x}_{j[i]}\]

  • These variables are orthogonal and they capture something different – the variation between \(j\) levels and the variation within \(j\) levels

Adding Predictors

\[\tiny \begin{eqnarray} y_{j[i]}=b_{0,j[i]}+b_{1} x_{within}+e_{1,i}\\ b_{0,j[i]}=\omega_0+\omega_1 x_{between}+e_{2,j[i]}\\ e_{1,j} \sim N(0, \sigma_1^2)\\ e_{2,j[i]} \sim N(0, \sigma_2^2) \end{eqnarray} \]

The Random Coefficients Model

\[\begin{eqnarray} y_{i}=b_{0,j[i]}+b_{1,j[i]}x_{i}+e_{1}\\ b_{0,j[i]}=\omega_0+e_{2,j[i]}\\ b_{1,j[i]}=\omega_1+e_{3,j[i]} \end{eqnarray}\]

\[\begin{eqnarray} y_{i}=\omega_0+e_{2,j[i]}+(\omega_1+e_{3,j[i]})x_{i}+e_{1,i}\\ \end{eqnarray}\]

\[cov(e_{2,j[i]}, e_{3,j[i]}) \neq 0\]

The Random Coefficients Model, Correlated Errors

\[ \tiny \begin{eqnarray} y_{i}=b_{0,j[i]}+b_{1,j[i]}x_{i}+e_{1,i}\\ b_{0,j[i]}=\omega_0+\omega_1 x_{j[i]} +e_{2,j[i]}\\ b_{1,j[i]}=\phi_0+\phi_1 x_{j[i]}+e_{3,j[i]}\\ \end{eqnarray}\] - The model captures the extent to which covariates change the \(j\)th value of \(y\) (the intercept equation) and how covariates change the relationship between \(x\) and \(y\) (the slope equation).

The Random Coefficients Model, Correlated Errors

\[\begin{eqnarray} y_{i}=\omega_0+\omega_1 x_{j[i]} +e_{2,j[i]}+(\phi_0+\phi_1 x_{j[i]}+e_{3,j[i]})x_{i}+e_{1,i}\\ \end{eqnarray}\]

Pooling: A Continuum

  • Let’s situate the random intercepts/coefficients in a broader structure.
  • No pooling model. This is the fixed effects model above, in which each level-2 unit has a unique mean value.
  • Complete pooling. This is the regression model with no level 2 estimated means. Instead, we assume the level-2 units completely pool around a common intercept (and perhaps slope). Formally, compare

\[y_{j,i}=\beta_0+\sum_j^{J-1} \gamma_{j} d_j+ e_{j,i}\]

\[y_{j,i}=\beta_0+ e_{j,i}\]

Partial Pooling

\[\tiny \begin{eqnarray} y_{j[i]}=b_{0,j[i]}+e_{1,i}\\ \end{eqnarray}\]

\[\tiny \begin{eqnarray} b_{0,j}={{y_j\times n_j/\sigma^2_y+y_{all}\times 1/\sigma^2_{b_0}}\over{n_j/\sigma^2_y+ 1/\sigma^2_{b_0}}}\end{eqnarray}\]

  • The first part of the numerator represents the movement away from a common mean. Note that as \(n_j\) increases (the group size), the estimate is pulled further from the common mean (which of course is what’s on the right in the numerator).

  • As \(n_j\) increases, the estimate of the estimated mean is influenced more by the group than a common mean.

  • As \(n_j\) decreases – so small groups – the formula now allows for a stronger likelihood that the estimates pools around a single value.

Partial Pooling

\[ \begin{eqnarray} b_{0,j}={{y_j\times n_j/\sigma^2_y+y_{all}\times 1/\sigma^2_{b_0}}\over{n_j/\sigma^2_y+ 1/\sigma^2_{b_0}}}\end{eqnarray}\]

  • As the within group variance increases, the group mean is pullled towards the pooled mean

  • As the between group variance increases, the common mean exerts a smaller impact

  • The values in the numerator are then weighted by the variation between and within level-2 units

Partial Pooling

  • The Intra-class correlation(ICC)

\[ICC=\sigma^2_{b_0}/[\sigma^2_{b_0}+\sigma^2_{y}]\]

Recall,

\[\sigma^2_{all}=\sigma^2_{b_0}+\sigma^2_{y}\]

  • Thus, the estimate is an estimate of how much of the total variation in \(y\) is a function of variation between level-2 units, relative to within level-1 units

Some Practical Advice

  • The ICC should decrease as you include level-2 predictors; compare to a model without predictors

  • Interpretation of the level-2 expected values (i.e., the group means) is based on a compromise between the pooled and no pooling models

  • If we estimate a regression model with a dummy for every level-2 unit and predictors, the model is not identified because the variables will be collinear (Gelman and Hill 2009, 269)