Extending the Binary Model: Ordered Logit and Probit
Christopher Weber
2024-10-13
Comments On Model Fit and Interpretation
These notes track the assigned reading in Long (1997) on model fit, interpetation, and the ordered regression model
The problem with the categorical models we’ve explored is that they are not directly interpretable, due to the non-linearity of the effect of \(x\) on \(y\)
The partial derivatives with respect to \(x\), \({{\partial F(x)} \over {\partial x}}=f(x)\):
This is ill-advised: The coefficients depend on the scale of the error term
Standardize?
We could partially standardized the slopes, by dividing each slope by the standard deviation of \(y\), (\(b/s_y)\)
Now the interpretation is, “for a one unit change in x we anticipate a b standard deviation change in \(y_{latent}\)”
What does \(y_{latent}\) mean in practice?
Is it a Linear Model?
Recall the logit model is:
\[{log{{\theta_i}\over{1-\theta_i}}}=x_iB\]
Every coefficient in the model represents the expected log change in the odds for a unit change in \(x\).
Unlike an OLS model, the coefficients in the logit (or probit) are not directly interpretable.
The problem more or less boils down to something relatively simple: The model is non-additive and non-linear.
Model Fit
Model fit is an important, though difficult, topic when we are dealing with non-linear models
We derive scalar measures of model fit from the linear model, like, \(R^2\)
It’s hard to find a comparably reliable statistic for non-linear models
We never observe \(y_{latent}\) directly
What is “percent variance explained” on an unobserved scale?
Model Fit and Maximum Likelihood
Counts correctly predicted
A matrix which is the predicted value of \(y_{obs}\) and the actual value of \(y\)
In this 2x2 matrix, the 1,1 and 0,0 entries represent accurate predictions
The off-diagonals are inaccurate predictions
Counts correctly predicted
Generate predictions for when the predicted latent variable is positive (\(y=1\)) or negative (\(y=0\)).
Then, calculate the percent correctly predicted
If we convert these to probabilities by using the inverse normal or logit (\(\texttt{pnorm}\) or \(\texttt{logit}\)), then define the \(ePCP\), expected correctly predicted as:
Percent Correctly Predicted is a good starting point
\(pr(Y=1) = 0.55\)
versus, \(pr(Y=1) = 0.91\)
We may be more confident in the latter
Weight each prediction by its constituent probability
We are accounting for our uncertainty. Typically, this number is somewhat lower than PCP
Reduction in Error
Assume we estimate two models
A naive model and a model with the expected predictor
The naive model predicts the outcome based on the modal category
If 51% voted for the Republican, the model would predict a Republican vote with probability of 0.51. We should never really get less than 0.51 – if we do, then the naive model would be a superior model.
Classification Threshold = 0.5
y=0 y=1
yhat=0 3095 505
yhat=1 0 0
Percent Correctly Predicted = 85.97%
Percent Correctly Predicted = 100%, for y = 0
Percent Correctly Predicted = 0% for y = 1
Null Model Correctly Predicts 85.97%
[1] 85.97222 100.00000 0.00000
Interpretation
If we were to just estimate \(\theta\), that value would be the same as \(\texttt{plogis(a)}\) from a regression model with no predictors
The naive model is one that just assumes a single underlying \(\theta\), instead of \(\theta\) being some linear composite of predictors. Then, we may construct a comparison
\[PRE={{PCP-PMC}\over {1-PMC}}\]
PRE is simply the proportional reduction in error.
The Likelihood Ratio Test
How do we test \(H_0: \beta_{x_1}=\beta_{x_2}=0\)?
Similar to the joint F-test in the linear model, we can define a likelihood ratio test when we use MLE. Define two models:
\(M_0:a\)
\(M_1: a+b_1x_x+b_2 x_2\)
Let’s estimate the second model by maximizing the log likelihood function (\(loglik_{m1}\)) and then estimate a second model where we constrain the two slopes to be 0. Then,
\[G^2(M_0|M_1)=2 loglik_{m1}-2 loglik_{m0}\]
An Example
library(lmtest)load("~/Dropbox/github_repos/teaching/POL683_Fall24/advancedRegression/vignettes/dataActive.rda")dat = dataActive %>%mutate(pid =recode(pid3, "Democrat"=1, "Independent"=2, "Republican"=3, "Other"=2, "Not sure"=2),protest =ifelse(violent >4, 1, 0) ) a =glm(protest ~1,family=binomial(link="logit"), data = dat)b =glm(protest ~as.factor(pid),family=binomial(link="logit"), data = dat)lrtest(a, b)
The \(G^2\) statistic is distributed \(\chi^2\) with \(df=\)number of constraints (here 2). Clearly, we can reject the null of no influence, see Long (1997, page 94)
We could flip things, and instead of comparing our model to one with no predictors, we could compare our model to one with predictors equal to the number of data points.
\(G^2\) and Deviance
\[G^2=2 loglik_{Full}-2 loglik_{M_1}\]
\[Deviance=-2 loglik_{M_1}\]
This is the deviance. It is just two times the log likelihood
The Wald Test is asymptotically equal to the LR test
\[Qb=r\]
Let’s assume one variable, so we estimate a slope and an intercept. Our \(b\) vector is length 2
Let’s test the constraint that \(b=0\), then declare \(Q=(0,1)\) and \(r=0\). The Wald statistic is
\[W=(Qb-r)^T(Qvar(b)Q^T)^{-1}(Qb-r)\]
Deconstructing the Wald Test
\[W=(Qb-r)^T(Qvar(b)Q^T)^{-1}(Qb-r)\]
The left and rightmost portions estimate the distance between the actual value of \(b\) and 0 – regardless of the complexity
The freed model relative to the constrained model
Because there is uncertainty around the estimates, this is represented in the middle portion. Again, we multiple by Q because we are only concerned about \(b\)
Warning
The Wald and LR tests are reasonable approaches, but
Their small sample properties are somewhat unknown
They should only be used if the \(\textit{null}\) model consists of the same data.
These methods can really only be used for nested model. In the case above, \(b=0\) is a constraint, so the restricted/constrained model is nested in the unrestricted model.
Scalar Estimates Fit
Scalar estimates of model fit are not incredibly meaningful in the logit/probit framework. Recall,
\[R^2={RegSS\over TSS}=1-{RSS/TSS}\]
The problem is that in the logit/probit model, we cannot directly compare \(Y_{obs}\) to the prediction we make with respect to \(Y_{latent}\)
Scalar Estimates Fit
Nonetheless, let’s define the Efron (1978) measure of pseudo-\(R^2\) as,
The Akaike Information Criterion (AIC) is defined as:
\[AIC={{-2loglik(\theta)+2P}\over N}\]
Information Measures
Calculate the \(-2loglik\) and add 2 \(\times\) the number of predictors, where \(p=K+1\) (Long 1997, 109)
Finally, divide by the number of observations. Notice what happens with this function. As the number of parameters increases, but the log-likelihood stays the same, the AIC will increase
We should prefer a AIC. The statistic penalizes for added parameters that do not improve fit
BIC
The Bayesian Information Criterion (BIC) is based on a comparision – between a fully saturated model and the proposed model. The BIC is:
\[BIC=D(M)-df ln N\]
\(D(M)\) is simply the deviance for the model
The degrees of freedom calculation is \(N-k-1\), where \(k\) is the number of predictors.
The Ordered Logit
This summary follows your assigned reading in Long (1997)
Only use an ordered parameterization when we have ordered data.
Some data can be ordered, even if they are theoretically multidimensional; others should be modeled differently
Examples: PID, Ideology (social and economic dimensions)
“How much do you agree or disagree with the following item?” from “1” Strongly Disagree to “5” Strongly Agree.
Why not OLS?
Ordered, non-interval level data may violate the assumptions of the classical linear regression model.
Non-constant variance.
Predictions may be non-sensical (i.e., we predict things outside of the observed bounds).
If the category distances are theoretically quite different.
Probit or Logit
Ordered logit or probit is a manner of personal preference.
The ordered logit and the proportional odds assumption.
\(y_{latent}\), where \(y_{obs} \in (1,2,3,...k)\).
The Latent Variable Approach
Instead of the variable being 0/1, it is not more than two categories that are ordered. Assume we knew \(y_{latent}\) and would like to map that to observing a particular category.
Using the same logic from the binary regression model, assume that we observe the category based on its orientation to a series of cutpoints, where
\[y_i=m: \tau_{k-1}\leq y_{latent} < \tau_{k}\]
The Measurement Model
The \(\tau\) parameters represent a series of thresholds that map the latent variable onto the categorical variable.
In \(\texttt{MASS}::\texttt{polr}()\) these are “zeta”
If F denotes the CDF, then for the ordered probit:
The last row is simplified because the probability of the CDF evaluated from \(-\infty\) to \(\infty\) is 1, so the first term becomes 1. Any CDF is plausible, such as the logit, in which case we have,
This only refers to the probability space for a single subject. Since the likelihood is \(\prod_{i=1}^N p_i\), we need to calculate the joint probability for each subject, which is,
Comments On Model Fit and Interpretation
These notes track the assigned reading in Long (1997) on model fit, interpetation, and the ordered regression model
The problem with the categorical models we’ve explored is that they are not directly interpretable, due to the non-linearity of the effect of \(x\) on \(y\)
The partial derivatives with respect to \(x\), \({{\partial F(x)} \over {\partial x}}=f(x)\):
\[{{\partial F(x)}\over{\partial x}}=b \times f(x)\]