2024-11-25
These notes follow the second half of Long (1997), Chapter 8
Truncation means data that fall above (or below) a specific value
E.g., The impact of ideology on dollars spent during an election cycle, among general election candidates.
Truncation at zero, for instance
We should not estimate a standard PRM or negative binomial model. Both will predict zero counts, but we cannot observe zero counts in practice
\[pr(y_i=0|x_i)=\theta_i+(1-\theta_i)exp(\mu_i)\]
Note.The zero count is a composite of \(\theta_i\), being zero (e.g., lack of conflict), or zero in the count process itself, \((1-\theta_i)exp(\mu_i)\) in the count process.
Non-zero values:
\[pr(y_i|x_i)=(1-\theta_i){{exp(\mu_i)\mu_i^{y_i}}\over{y_i!}}\]
\[ \tiny{ L(\pi, \mu \mid y) = \prod_{i=1}^{n} \left[ \theta_i \mathbb{I}(y_i = 0) + (1 - \theta_i) {{exp(-\mu_i)\mu_i^{y_i}}\over{y_i!}} \right] }\]
\[\tiny \log L(\pi, \mu \mid y) = \sum_{i=1}^{n} \log \left[ \theta_i \mathbb{I}(y_i = 0) + (1 - \theta_i){{exp(-\mu_i)\mu_i^{y_i}}\over{y_i!}} \right]\]
\[ \tiny \log L(\pi, \mu \mid y) = \sum_{i=1}^{n} \log \left[ \theta_i \mathbb{I}(y_i = 0) + (1 - \theta_i){{exp(-\mu_i)\mu_i^{y_i}}\over{y_i!}} \right]\]
\(\mu_i\) is the rate parameter of the Poisson distribution for the ith observation.
The rate function can be written as, \(\mu_i = exp(\alpha + \sum_K \beta_k x_{k,i}\))
\(\mathbb{I}(y_i = 0)\) is an indicator function that is 1 if ( \(y_i = 0\) ) and 0 otherwise.
The “zero attenuated” regression model
Predict a zero count
\[\tiny \theta_i=F(z_i\gamma)\]
Model the non zero equation with a truncated poisson (or negative binomial)
\[\tiny pr(y_i|x_i)=(1-\theta_i){{exp(\mu_i)\mu_i^{y_i}}\over{y_i!}(1-exp(\mu_i))}\]
\[y_{observed} = \{ \begin{array}{lr} NA, y_{latent}\leq\tau\\ y_{latent}, y_{latent}>\tau\\ \end{array}\]
With truncation, \[y_{observed} = \begin{array}{lr} \tau, y_{latent}\leq\tau\\ y_{latent}, y_{latent}>\tau\\ \end{array}\]
Assume \(y_{latent}\sim N(\mu, \sigma^2)\)
The pdf for \(y_{latent}\) is simply the normal density
\[f(y_{latent}|\mu, \sigma)={{1}\over{\sigma}}\Phi({{\mu-y_{latent}}\over{\sigma}})\]
CDF: \(\tiny pr(Y_{latent}>y_{latent})=\Phi({{\mu-y_{latent}}\over{\sigma}})\)
If we observe data greater than \(\tau\),
\[ \tiny pr(y|y>\tau, \mu, \sigma)={{f(y_{latent}|\mu \sigma)}\over{pr(y_{latent}>\tau)}}\]
\[\tiny pr(y|y<\tau, \mu, \sigma)={{f(y_{latent}|\mu \sigma)}\over{pr(y_{latent}<\tau)}}\]
\[\tiny {f(y|y>\tau, \mu, \sigma)}= [{{{1}\over{\sigma}}{\phi({{\mu-y_{latent}}\over{\sigma}})}}]/[{{{\Phi({{\mu-\tau}\over{\sigma}})}}}]\]
\[\tiny {f(y|y>\tau, \mu, \sigma)}= [{{{1}\over{\sigma}}{\phi({{\mu-y_{latent}}\over{\sigma}})}}]/[{{{\Phi({{\mu-\tau}\over{\sigma}})}}}]\]
The numerator is simply the normal pdf, but we are dividing by the normal cdf evaluated for the distribution greater than \(\tau\).
The PDF and the CDF
Let’s take the expectation of this pdf, as it yields an important statistic.
\[\tiny {E(y|y>\tau)}=\mu+\sigma {{\phi{{\mu-\tau}\over{\sigma}}}\over{\Phi{{\mu-\tau}\over{\sigma}}}}\]
Or, just simply, \(\mu+\sigma \kappa {{\mu-\tau}\over{\sigma}}\), with \(\kappa\) representing, \(\phi(.)/\Phi(.)\)
In this case, \(\kappa\) (kappa) is a statistic called the inverse Mill’s ratio
\[{E(y|y>\tau)}=\mu+\sigma {{\phi{{\mu-\tau}\over{\sigma}}}\over{\Phi{{\mu-\tau}\over{\sigma}}}}\]
If \(\tau\) is greater than \(\mu\), meaning we would have serious censoring, then this ratio will be larger
But, if \(\mu\) is much greater than \(\tau\) – we don’t have much censoring – this ratio goes to zero and the distribution is just the normal density.
Let’s start by assuming a censored dependent variable (Long 1997, pp. 195-196 for the full derivation).
We observe all of \(x\) but we don’t observe \(y\)
\[y_{observed} = \begin{array}{lr} \tau, y_{latent}\leq\tau\\ y_{latent}, y_{latent}>\tau\\ \end{array}\]
\[y_{observed} = \begin{array}{lr} \tau, y_{latent}\geq\tau\\ y_{latent}, y_{latent}<\tau\\ \end{array}\]
\[y_{observed} = \begin{array}{lr} \tau, y_{latent}\leq\tau\\ \alpha+\sum_K \beta_k x_{k}+\epsilon, y_{latent}>\tau\\ \end{array}\]
The probability of censoring is \(\tiny pr(censored|x_i) = pr(y_{latent} < \tau |x_i) = pr(\epsilon_i<\tau - (\alpha+\sum_K \beta_k x_{k})|x\)
The probability of not being censored given x, \(\tiny Pr(Uncensored|x) = 1 -\Phi(\tau - (\alpha+\sum_K \beta_k x_{k})/\sigma)= \Phi((\alpha+\sum_K \beta_k x_{k}) - \tau)/\sigma)\)
Just simplify by calling, \(\tiny \delta_i = ((\alpha+\sum_K \beta_k x_{k}) - \tau)/\sigma)\)
The probability of being censored is \(\Phi(-\delta_i)\) and the probability of not being censored is \(\Phi(\delta_i)\)
Censoring (like truncation) can occur from the “left,” “right,” or really anywhere in an observed distribution
Panel data are common in political science
Unlike cross sectional data, units are repeatedly observed
The Time-Series Cross-Section (TSCS) design
Autocorrelated errors
The AR/Markov Process
\[ Pr(C_{t+1}|C_{t},....C_{1})=Pr(C_{t+1}|C_{t}) \]
The Drunkard’s Walk
E.g., Model the probability of a voter being a Republican today, we would conceive of this as based on whether one was a Republican yesterday.
Or, vote turnout in Arizona
Movement across \(C\) states are “transition probabilities.”
We can represent the transition between \(m\) realizations of \(C\) as “transition matrix.”
The rows represent the realization of a state at time \(t\) and the columns represent \(t+1\). The sum of the rows must equal 1, of course, in order to be a proper probability distribution
\[y_{j,i}=\beta_0+\beta_1 x_{j,i}+e_{j,i}\]
\[\tiny y_{j,i}=\beta_0+\beta_1 x_{j,i}+e_{j,i}\]
\(y\) is an observation nested within a geographical unit, time, etc.
Perhaps the intercept in this equation vary across regions
\[\tiny y_{j,i}=\beta_0+\beta_1 x_{j,i}+\sum_j^{J-1} \gamma_{j} d_j+ e_{j,i}\]
\[\tiny y_{j,i}=\beta_{0,j}+\beta_1 x_{j,i}+ e_{1,j,i}\]
\[\tiny \beta_{0,j}=\gamma_0+e_{2,j}\]
\[\tiny e_{2,j} \sim N(0, \sigma^2)\]
\[\tiny \beta_{0,j}\sim N(\gamma_0, \sigma^2)\]
A two level model
At level 1,
\[ y_{j,i}=\beta_{0}+\beta_1 x_{j,i}+ e_{1,j,i}\]
At level 2,
\[ y_{j}=\gamma_{0}+\gamma x_{j}+ e_{2,j}\]
The ecological fallacy
For instance, Red State, Blue State, Rich State, Poor State: Why Americans Vote the Way They Do (Gelman 2008)
\[\tiny p(y_{j,i}=1)=logit^{-1}(\beta_{0}+\beta_1 x_{j,i})\]
\[\tiny \bar{y}_{j}=\gamma_{0}+\gamma_1 x_{j}+e_{2,j}\]
Limitations of the fixed effects model. We have to add \(J-1\) dummy variables to the model
An equivalent approach is to just remove the \(j\) level means from \(y\)
\[(y_{j,i}-\bar{y}_j)=\beta_{0}+\beta_1 x_{j,i}+ e_{i}\]
\[(y_{j,i}-\bar{y}_j)=\beta_{0}+\beta_1 (x_{j,i}-\bar{x}_j)+ e_{i}\]
Two regression models, within and between
Recall the assumption that \(cov(e_i, e_j)=0, \forall i\neq j\)? Or the assumption that \(e_{j,i}\) are independent and identically distributed?
In the two stage formulation, we don’t ever correct for this process.
\[ \begin{eqnarray} y_{i}=b_{0,j[i]}+e_{1,i}\\ b_{0,j}=\omega_0+e_{2,j[i]}\\ e_{1,j} \sim N(0, \sigma_1^2)\\ e_{2,j[i]} \sim N(0, \sigma_2^2) \end{eqnarray} \]
\[\begin{eqnarray} y_{i}=\omega_0+e_{1,i}+e_{2,j[i]}\\ \end{eqnarray}\]
\[var(y_{j[i]})=var(e_{1,j[i]})+var(e_{2,j})\]
\[\sigma^2_{(y_{j[i]})}=\sigma^2_{i}+\sigma^2_{{j[i]}}\]
\[\tiny \begin{eqnarray} y_{i}=b_{0,j[i]}+b_{1} x_{i}+e_{1,i}\\ b_{0,j}=\omega_0+\omega_1 x_{j[i]}+e_{2,j[i]}\\ e_{1,j} \sim N(0, \sigma_1^2)\\ e_{2,j[i]} \sim N(0, \sigma_2^2) \end{eqnarray}\]
\[x_{within}=x_{j}-\bar{x}_{j[i]}\]
\[x_{between}=\bar{x}_{j[i]}\]
\[\tiny \begin{eqnarray} y_{j[i]}=b_{0,j[i]}+b_{1} x_{within}+e_{1,i}\\ b_{0,j[i]}=\omega_0+\omega_1 x_{between}+e_{2,j[i]}\\ e_{1,j} \sim N(0, \sigma_1^2)\\ e_{2,j[i]} \sim N(0, \sigma_2^2) \end{eqnarray} \]
\[\begin{eqnarray} y_{i}=b_{0,j[i]}+b_{1,j[i]}x_{i}+e_{1}\\ b_{0,j[i]}=\omega_0+e_{2,j[i]}\\ b_{1,j[i]}=\omega_1+e_{3,j[i]} \end{eqnarray}\]
\[\begin{eqnarray} y_{i}=\omega_0+e_{2,j[i]}+(\omega_1+e_{3,j[i]})x_{i}+e_{1,i}\\ \end{eqnarray}\]
\[cov(e_{2,j[i]}, e_{3,j[i]}) \neq 0\]
\[ \tiny \begin{eqnarray} y_{i}=b_{0,j[i]}+b_{1,j[i]}x_{i}+e_{1,i}\\ b_{0,j[i]}=\omega_0+\omega_1 x_{j[i]} +e_{2,j[i]}\\ b_{1,j[i]}=\phi_0+\phi_1 x_{j[i]}+e_{3,j[i]}\\ \end{eqnarray}\] - The model captures the extent to which covariates change the \(j\)th value of \(y\) (the intercept equation) and how covariates change the relationship between \(x\) and \(y\) (the slope equation).
\[\begin{eqnarray} y_{i}=\omega_0+\omega_1 x_{j[i]} +e_{2,j[i]}+(\phi_0+\phi_1 x_{j[i]}+e_{3,j[i]})x_{i}+e_{1,i}\\ \end{eqnarray}\]
\[y_{j,i}=\beta_0+\sum_j^{J-1} \gamma_{j} d_j+ e_{j,i}\]
\[y_{j,i}=\beta_0+ e_{j,i}\]
\[\tiny \begin{eqnarray} y_{j[i]}=b_{0,j[i]}+e_{1,i}\\ \end{eqnarray}\]
\[\tiny \begin{eqnarray} b_{0,j}={{y_j\times n_j/\sigma^2_y+y_{all}\times 1/\sigma^2_{b_0}}\over{n_j/\sigma^2_y+ 1/\sigma^2_{b_0}}}\end{eqnarray}\]
The first part of the numerator represents the movement away from a common mean. Note that as \(n_j\) increases (the group size), the estimate is pulled further from the common mean (which of course is what’s on the right in the numerator).
As \(n_j\) increases, the estimate of the estimated mean is influenced more by the group than a common mean.
As \(n_j\) decreases – so small groups – the formula now allows for a stronger likelihood that the estimates pools around a single value.
\[ \begin{eqnarray} b_{0,j}={{y_j\times n_j/\sigma^2_y+y_{all}\times 1/\sigma^2_{b_0}}\over{n_j/\sigma^2_y+ 1/\sigma^2_{b_0}}}\end{eqnarray}\]
As the within group variance increases, the group mean is pullled towards the pooled mean
As the between group variance increases, the common mean exerts a smaller impact
The values in the numerator are then weighted by the variation between and within level-2 units
\[ICC=\sigma^2_{b_0}/[\sigma^2_{b_0}+\sigma^2_{y}]\]
Recall,
\[\sigma^2_{all}=\sigma^2_{b_0}+\sigma^2_{y}\]
The ICC should decrease as you include level-2 predictors; compare to a model without predictors
Interpretation of the level-2 expected values (i.e., the group means) is based on a compromise between the pooled and no pooling models
If we estimate a regression model with a dummy for every level-2 unit and predictors, the model is not identified because the variables will be collinear (Gelman and Hill 2009, 269)