These notes track the assigned reading in Long (1997) on model fit, interpetation, and the ordered regression model.
These notes also follow McElreath (2016) in Statistical Rethinking, Chapters 10 and 11.
The problem with the categorical models we’ve explored is that they are not directly interpretable, due to the non-linearity of the effect of \(x\) on \(y\).
Recall the non-linearity and non-additive of the binomial model.
The model is linear with respect to the log odds, but not probabilities.
We cannot directly compare the logit to probit coefficients.
Every coefficient in the model represents the expected log change in the odds for a unit change in \(x\) (logit).
Unlike an OLS model, the coefficients in the logit (or probit) are not directly interpretable.
The model is non-additive and non-linear.
How do we interpret the coefficients?
Interpretation
Standardize. Divide the coefficient by the standard error. Now the interpretation is, “for a one unit change in x we anticipate a b standard deviation change in \(y_{latent}\)”
Odds Ratio. Exponentiate the coefficient and interpret the odds ratio, \(exp(b)\). This only applies to the logit model.
Predictive Effects. Calculate the predicted probability of \(y=1\) for a change in \(x\) from some value to another value, holding all other variables constant, at a fixed value.
Marginal Effects. Calculate the predicted probability of \(y=1\) for a change in \(x\) from some value to another specified value, holding all other variables constant.
Tools. glm::predict, ggplot2, MASS
Model Fit and Maximum Likelihood
Model fit is an important, though difficult, topic when we are dealing with non-linear models.
We may derive scalar measures of model fit from the linear model, like, \(R^2\).
It’s hard to find a comparably reliable statistic for non-linear models.
We never observe \(y_{latent}\) directly.
What is “percent variance explained” on an unobserved scale?
Counts correctly predicted
A matrix which is the predicted value of \(y_{obs}\) and the actual value of \(y\).
In this 2x2 matrix, the 1,1 and 0,0 entries represent accurate predictions.
The off-diagonals are inaccurate predictions.
A confusion matrix.
Generate predictions for when the predicted latent variable is positive (\(y=1\)) or negative (\(y=0\)).
Then, calculate the percent correctly predicted.
Correcting for Chance
Percent Correctly Predicted is a good starting point.
\(pr(Y=1) = 0.55\)
versus: \(pr(Y=1) = 0.91\)
We may be more confident in the latter.
Weight each prediction by its constituent probability.
We are accounting for our uncertainty. Typically, this number is somewhat lower than PCP.
If we convert these to probabilities by using the inverse normal or logit (\(\texttt{pnorm}\) or \(\texttt{logit}\)), then define the \(ePCP\), expected correctly predicted as:
A naive model and a model with the expected predictor, Model 1.
The naive model predicts the outcome based on the modal category, Model 2
If 51% voted for the Republican, the model would predict a Republican vote with probability of 0.51. We should never really get less than 0.51 – if we do, then the naive model would be a superior model.
If 78% voted for the Liberal Party, the model would predict a Liberal vote with probability of 0.78. We should never really get less than 0.78 – if we do, then the naive model would be a superior model. Likewise, if our complicated, proposed model predicts a 0.80 probability, the new model doesn’t seem to reduce error.
Classification Threshold = 0.5
y=0 y=1
yhat=0 1716 641
yhat=1 616 627
Percent Correctly Predicted = 65.08%
Percent Correctly Predicted = 73.58%, for y = 0
Percent Correctly Predicted = 49.45% for y = 1
Null Model Correctly Predicts 64.78%
[1] 65.08333 73.58491 49.44795
Interpretation
If we were to just estimate \(\theta\), that value would be the same as \(\texttt{plogis(a)}\) from a regression model with no predictors.
The naive model is one that just assumes a single underlying \(\theta\), instead of \(\theta\) being some linear composite of predictors. Then, we may construct a comparison.
\[PRE={{PCP-PMC}\over {1-PMC}}\]
PRE is simply the proportional reduction in error.
The Likelihood Ratio Test
How do we test whether multiple predictors jointly improve model fit?
Example: Test \(H_0: \beta_{1}=\beta_{2}=0\) (both slopes are zero).
Set up two nested models
Null Model (\(M_0\)):\(y = \beta_0\) (intercept only)
Full Model (\(M_1\)):\(y = \beta_0 + \beta_1x_1 + \beta_2x_2\) (includes predictors)
The Likelihood Ratio Test
Null Model (\(M_0\)):\(y = \beta_0\) (intercept only)
Full Model (\(M_1\)):\(y = \beta_0 + \beta_1x_1 + \beta_2x_2\) (includes predictors)
The \(G^2\) statistic is distributed \(\chi^2\) with \(df=\)number of constraints (here 2). Clearly, we can reject the null of no influence, see Long (1997, page 94).
We could flip things, and instead of comparing our model to one with no predictors, we could compare our model to one with predictors equal to the number of data points.
\[G^2=2 loglik_{Full}-2 loglik_{M_1}\]
\[Deviance=-2 loglik_{M_1}\]
This is the deviance. It is just two times the log likelihood.
Deviance
Deviance = measure of lack of fit (lower is better).
Under \(H_0\): \(W \sim \chi^2_{df}\) where \(df\) = number of constraints.
Deconstructing the Wald Test
\[W=(Qb-r)^T(Qvar(b)Q^T)^{-1}(Qb-r)\]
The left and rightmost portions estimate the distance between the actual value of \(b\) and 0 – regardless of the complexity.
The freed model relative to the constrained model.
Because there is uncertainty around the estimates, this is represented in the middle portion. Again, we multiple by Q because we are only concerned about \(b\).
When to Use
Compares nested models.
The Wald and LR tests are reasonable approaches, but…
Their small sample properties are not always well defined.
They should only be used if the \(\textit{null}\) model consists of the same data.
These methods can really only be used for nested model. In the case above, \(b=0\) is a constraint, so the restricted/constrained model is nested in the unrestricted model.
Scalar Estimates Fit
Scalar estimates of model fit are less intuitive in the logit/probit framework.
\[R^2={RegSS\over TSS}=1-{RSS/TSS}\]
The problem is that in the logit/probit model, we cannot directly compare \(Y_{obs}\) to the prediction we make for \(Y_{latent}\).
Pseudo-\(R^2\) uses the latent prediction (Efron 1978)
Calculate the \(-2loglik\) and add 2 \(\times\) the number of predictors, where \(p=K+1\) (Long 1997, 109).
Finally, divide by the number of observations. Notice what happens with this function. As the number of parameters increases, but the log-likelihood stays the same, the AIC will increase.
Should prefer a smaller AIC. The statistic penalizes for added parameters that do not improve fit.
BIC
The Bayesian Information Criterion (BIC) is based on a comparision – between a fully saturated model and the proposed model. The BIC is:
\[BIC=D(M)-df \times log(N)\]
\(D(M)\) is simply the deviance for the model
The degrees of freedom calculation is \(N-k-1\), where \(k\) is the number of predictors.
The Ordered Logit
This summary follows your assigned reading in Long (1997) and McElreath, Chapter 11.
Only use an ordered parameterization when we have ordered data.
Some data can be ordered, even if they are theoretically multidimensional; others should be modeled differently.
Examples: PID, Ideology (social and economic dimensions)
“How much do you agree or disagree with the following item?” from “1” Strongly Disagree to “5” Strongly Agree.
Why not OLS?
Ordered, non-interval level data may violate the assumptions of the classical linear regression model.
Non-constant variance.
Predictions may be non-sensical (i.e., we predict things outside of the observed bounds).
If the category distances are theoretically quite different.
Probit or Logit
Whether you choose ordered logit or probit is often just a matter of personal preference. I’ll just use the logit for now, but specifying a probit is just a matter of changing the link-function.
The ordered logit is also called the proportional odds regression model.
It is a generalization of the binary logit, using the logic of accumulating probabilities.
Conceptually just think about it as a number of binary logits, where the cutpoints slice the latent distribution into discrete categories.
In the ordered logit or probit parameterization, we do not estimate the intercept, \(\beta_0\), because it is not uniquely identified from the cutpoints, which serve as intercepts for cumulative logits.
Estimating an ordered logit model on binary data reduces to the binary logit model.
Instead of one cutpoint – the intercept in logit – now we estimate \(k-1\) cutpoints.
\(y_{latent}\), where \(y_{obs} \in (1,2,3,...k)\).
Accumulated Comparisons: With a four category outcome, we can think of three models. Compare category 1 to category 2, 3, 4; then compare categories 1, 2 to categories 3, 4; and finally compare categories 1, 2, 3 to category 4.
It’s useful to envision this as three individual logit models, but where the parameters, \(\beta\) are constrained to the the same across the three models. This is called the parallel regression assumption or the parallel lines assumption.
where \(F_{logit}(z) = \frac{1}{1 + e^{-z}}\) is the logistic CDF.
p is a vector of probabilities for each category, summing to 1.
The j element of \(p\) is the probability of observing category j.
And the cumulative log odds is a function of variable matrix, X, the cutpoints, \(\tau\), and the coefficients, \(\beta\) and an error term e, distributed following the logit distribution.
Parallel Lines
Each line corresponds to a log odds of combining categories into a cumulative log odds where the lines are parallel, or the odds ratios are constant, they are proportional. The distance between the lines is constant, which means that the effect of \(X\) is the same across all cumulative comparisons.
There is no intercept term, as these are represented by the cutpoints. But it’s useful to think of the cutpoints as intercepts for each of the cumulative logits.
The Measurement Model
Instead of the variable being 0/1, it is not more than two categories that are ordered. Assume we knew \(y_{latent}\) and would like to map that to observing a particular category.
Using the same logic from the binary regression model, assume that we observe the category based on the orientation to a series of cutpoints, where
The likelihood of the ordered logit or probit model is the joint probability of being in each category, so we need to calculate the likelihood (\(L(y|\theta)\)) as
This only references the probability space for a single subject. Since the likelihood is \(\prod_{i=1}^N p_i\), we need to calculate the joint probability for each subject.
Like the binary case: \(x \rightarrow y_{latent} \rightarrow y_{obs}\).
The only thing that is different is that instead of a single cutpoint – at 0 – we have a series of cutpoints, corresponding to the number of categories minus 1.
The Western States Data (2020)
Let’s estimate an ordered logit model in R, from the MASS package. Data are collected pre or post election, and we want to see if support for electoral contestation behavior (here, attending a march) changes in support over this period – for Trump voters versus Biden voters. This is specified to examine whether support for contestation varies depending upon electoral functions; a winner-loser effect
Call:
polr(formula = as.factor(burn_flag) ~ prepost * vote_trump, data = sample_df)
Coefficients:
Value Std. Error t value
prepost -0.2686 0.09701 -2.769
vote_trump -0.6393 0.07787 -8.210
prepost:vote_trump 1.4364 0.17112 8.394
Intercepts:
Value Std. Error t value
1|2 -1.9499 0.0663 -29.4176
2|3 -0.9642 0.0564 -17.0865
3|4 0.5589 0.0544 10.2718
4|5 1.9233 0.0696 27.6224
Residual Deviance: 8700.134
AIC: 8714.134
(719 observations deleted due to missingness)
There is clearly an interaction effect – support varies depending on whether the observation was before or after the election and whether the respondent voted for Trump or Biden. The sign of the lower order and interaction terms seems to indicate that Trump supporters are more supportive, post-election; Biden voters are more supporting in the pre-election. But how should we interpret this.
The predicted probabilities indicate that Trump voters are more supportive of burning the flag, post-election. Biden voters are less supportive, post-election.