Multicategory Models

The Ordered Logit and Ordered Probit

Christopher Weber

2025-10-14

Overview

  • These notes track the assigned reading in Long (1997) on model fit, interpetation, and the ordered regression model.

  • These notes also follow McElreath (2016) in Statistical Rethinking, Chapters 10 and 11.

  • The problem with the categorical models we’ve explored is that they are not directly interpretable, due to the non-linearity of the effect of \(x\) on \(y\).

  • Recall the non-linearity and non-additive of the binomial model.

\[{{\partial F(x)}\over{\partial x}}=b \times f(x)\]

Non Linearity

  • For the logit:

\[{{\partial F_{logit}(x)}\over{\partial x}}=b \times {{exp(a+bx)}\over{1+exp(a+bx)^2}}\]

  • For probit, just replace f(x) with the normal PDF

Non Additivity

  • It is even more challenging to interpret the results when there are other covariates, since:

\[{{\partial F_{logit}(x)}\over{\partial x_k}}=b_k \times {{exp(a+\sum b_j x_j)}\over{1+exp(a+\sum b_j x_j)^2}}\]

  • The model is linear with respect to the log odds, but not probabilities.

  • We cannot directly compare the logit to probit coefficients.

  • Every coefficient in the model represents the expected log change in the odds for a unit change in \(x\) (logit).

  • Unlike an OLS model, the coefficients in the logit (or probit) are not directly interpretable.

  • The model is non-additive and non-linear.

  • How do we interpret the coefficients?

Interpretation

  • Standardize. Divide the coefficient by the standard error. Now the interpretation is, “for a one unit change in x we anticipate a b standard deviation change in \(y_{latent}\)

  • Odds Ratio. Exponentiate the coefficient and interpret the odds ratio, \(exp(b)\). This only applies to the logit model.

  • Predictive Effects. Calculate the predicted probability of \(y=1\) for a change in \(x\) from some value to another value, holding all other variables constant, at a fixed value.

  • Marginal Effects. Calculate the predicted probability of \(y=1\) for a change in \(x\) from some value to another specified value, holding all other variables constant.

  • Tools. glm::predict, ggplot2, MASS

Model Fit and Maximum Likelihood

  • Model fit is an important, though difficult, topic when we are dealing with non-linear models.

  • We may derive scalar measures of model fit from the linear model, like, \(R^2\).

  • It’s hard to find a comparably reliable statistic for non-linear models.

  • We never observe \(y_{latent}\) directly.

  • What is “percent variance explained” on an unobserved scale?

Counts correctly predicted

  • A matrix which is the predicted value of \(y_{obs}\) and the actual value of \(y\).

  • In this 2x2 matrix, the 1,1 and 0,0 entries represent accurate predictions.

  • The off-diagonals are inaccurate predictions.

  • A confusion matrix.

  • Generate predictions for when the predicted latent variable is positive (\(y=1\)) or negative (\(y=0\)).

  • Then, calculate the percent correctly predicted.

Correcting for Chance

  • Percent Correctly Predicted is a good starting point.

  • \(pr(Y=1) = 0.55\)

  • versus: \(pr(Y=1) = 0.91\)

  • We may be more confident in the latter.

  • Weight each prediction by its constituent probability.

  • We are accounting for our uncertainty. Typically, this number is somewhat lower than PCP.

  • If we convert these to probabilities by using the inverse normal or logit (\(\texttt{pnorm}\) or \(\texttt{logit}\)), then define the \(ePCP\), expected correctly predicted as:

\[ePCP={1\over n}({\sum_{y=1} P_i+\sum_{y=0}(1-P_i)})\]

Reduction in Error

  • Assume we estimate two models
  • A naive model and a model with the expected predictor, Model 1.
  • The naive model predicts the outcome based on the modal category, Model 2
  • If 51% voted for the Republican, the model would predict a Republican vote with probability of 0.51. We should never really get less than 0.51 – if we do, then the naive model would be a superior model.
  • If 78% voted for the Liberal Party, the model would predict a Liberal vote with probability of 0.78. We should never really get less than 0.78 – if we do, then the naive model would be a superior model. Likewise, if our complicated, proposed model predicts a 0.80 probability, the new model doesn’t seem to reduce error.

Chance

Show/Hide Code
## Load required packages
library(dplyr)
library(MASS)
library(pscl)
load("~/Dropbox/github_repos/teaching/POL683_Fall24/advancedRegression/vignettes/dataActive.rda")
dat = dataActive %>% 
  mutate(
    pid = recode(pid3, "Democrat" = 1, "Independent" = 2, "Republican" = 3, "Other" = 2, "Not sure" = 2),
    protest = ifelse(violent > 3, 1, 0)
  ) 
# Esitmate logit
my_model =  glm(protest ~ as.factor(pid),
       family=binomial(link="logit"), data = dat)
# Produce confusion matrix.
hitmiss(my_model)
Classification Threshold = 0.5 
        y=0 y=1
yhat=0 1716 641
yhat=1  616 627
Percent Correctly Predicted = 65.08%
Percent Correctly Predicted = 73.58%, for y = 0
Percent Correctly Predicted = 49.45%  for y = 1
Null Model Correctly Predicts 64.78%
[1] 65.08333 73.58491 49.44795

Interpretation

  • If we were to just estimate \(\theta\), that value would be the same as \(\texttt{plogis(a)}\) from a regression model with no predictors.

  • The naive model is one that just assumes a single underlying \(\theta\), instead of \(\theta\) being some linear composite of predictors. Then, we may construct a comparison.

\[PRE={{PCP-PMC}\over {1-PMC}}\]

  • PRE is simply the proportional reduction in error.

The Likelihood Ratio Test

  • How do we test whether multiple predictors jointly improve model fit?

Example: Test \(H_0: \beta_{1}=\beta_{2}=0\) (both slopes are zero).

  • Set up two nested models

  • Null Model (\(M_0\)): \(y = \beta_0\) (intercept only)

  • Full Model (\(M_1\)): \(y = \beta_0 + \beta_1x_1 + \beta_2x_2\) (includes predictors)

The Likelihood Ratio Test

  • Null Model (\(M_0\)): \(y = \beta_0\) (intercept only)
  • Full Model (\(M_1\)): \(y = \beta_0 + \beta_1x_1 + \beta_2x_2\) (includes predictors)

The Likelihood Ratio Test Statistic:

\[ G^2 = 2(\log L_{M_1} - \log L_{M_0}) = 2\log L_{M_1} - 2\log L_{M_0} \]

where:

  • \(\log L_{M_1}\) = log-likelihood of the full model.
  • \(\log L_{M_0}\) = log-likelihood of the null model.

Interpretation:

  • Large \(G^2\) values suggest the full model fits significantly better.
  • Under \(H_0\), \(G^2 \sim \chi^2\) with degrees of freedom = difference in number of parameters.
  • Similar to the F-test for linear regression, but uses likelihood instead of sum of squares.

Key insight: This tests whether adding \(x_1\) and \(x_2\) jointly improves the model fit beyond just having an intercept.

An Example

Show/Hide Code
library(lmtest)
load("~/Dropbox/github_repos/teaching/POL683_Fall24/advancedRegression/vignettes/dataActive.rda")
dat = dataActive %>% 
  mutate(
    pid = recode(pid3, "Democrat" = 1, "Independent" = 2, "Republican" = 3, "Other" = 2, "Not sure" = 2),
    protest = ifelse(violent > 4, 1, 0)
  ) 
a =  glm(protest ~ 1,
       family=binomial(link="logit"), data = dat)

b =  glm(protest ~ as.factor(pid),
       family=binomial(link="logit"), data = dat)

lrtest(a, b)
Likelihood ratio test

Model 1: protest ~ 1
Model 2: protest ~ as.factor(pid)
  #Df  LogLik Df  Chisq Pr(>Chisq)    
1   1 -1459.7                         
2   3 -1412.5  2 94.323  < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

\(G^2\)

  • The \(G^2\) statistic is distributed \(\chi^2\) with \(df=\)number of constraints (here 2). Clearly, we can reject the null of no influence, see Long (1997, page 94).

  • We could flip things, and instead of comparing our model to one with no predictors, we could compare our model to one with predictors equal to the number of data points.

\[G^2=2 loglik_{Full}-2 loglik_{M_1}\]

\[Deviance=-2 loglik_{M_1}\]

  • This is the deviance. It is just two times the log likelihood.

Deviance

Deviance = measure of lack of fit (lower is better).

Null Deviance: \[D_{null} = 2\log L_{saturated} - 2\log L_{null}\]

  • Compares saturated model vs. intercept-only. Shows total variability to explain.

Residual Deviance: \[D_{residual} = 2\log L_{saturated} - 2\log L_{model}\]

  • Compares saturated model vs. proposed model. Unexplained variability.

Deviance

\[D_{null} - D_{residual} = G^2 = 2(\log L_{model} - \log L_{null})\]

  • Difference = likelihood ratio test statistic.
  • Tests if predictors improve fit over null.
  • Analogous to \(R^2\) (proportion of deviance explained).

The Wald Test

  • The Wald Test is asymptotically equivalent to the LR test.

  • Test hypotheses about parameters using only the unrestricted model.

The Logic:

  1. Estimate your model: \(\hat{\boldsymbol{\beta}}\) and \(Var(\widehat{\boldsymbol{\beta}})\).
  2. Measure how far estimates are from null hypothesis.
  3. Scale by standard errors to get test statistic.

Single Parameter

Test \(H_0: \beta_1 = 0\)

Wald statistic:

\[W = \frac{(\hat{\beta}_1 - 0)^2}{Var(\hat{\beta}_1)} = \frac{\hat{\beta}_1^2}{SE(\hat{\beta}_1)^2}\]

  • Notice the similarity to a z-test

\[z = \frac{\hat{\beta}_1}{SE(\hat{\beta}_1)}, \quad W = z^2\]

  • Under \(H_0\): \(z \sim N(0,1)\)
  • \(W \sim \chi^2_1\)

This is just the standard z-test

Multiple Parameters

  • We can test more complex hypotheses involving multiple parameters.

  • The test is based on the distance between the estimated parameters and the null hypothesis values, scaled by their covariance matrix.

\(H_0: Q\boldsymbol{\beta} = \boldsymbol{r}\)

\[W = (Q\hat{\boldsymbol{\beta}} - \boldsymbol{r})^T [Q Var(\hat{\boldsymbol{\beta}}) Q^T]^{-1} (Q\hat{\boldsymbol{\beta}} - \boldsymbol{r})\]

  • \(\widehat{\boldsymbol{\beta}}\): vector of estimated coefficients.
  • \(Q\): matrix selecting which parameters to test.
  • \(\boldsymbol{r}\): hypothesized values (typically, but not always, 0$).
  • \(Var(\widehat{\boldsymbol{\beta}})\): variance-covariance matrix.
  • Under \(H_0\): \(W \sim \chi^2_{df}\) where \(df\) = number of constraints.

Deconstructing the Wald Test

\[W=(Qb-r)^T(Qvar(b)Q^T)^{-1}(Qb-r)\]

  • The left and rightmost portions estimate the distance between the actual value of \(b\) and 0 – regardless of the complexity.

  • The freed model relative to the constrained model.

  • Because there is uncertainty around the estimates, this is represented in the middle portion. Again, we multiple by Q because we are only concerned about \(b\).

When to Use

  • Compares nested models.

  • The Wald and LR tests are reasonable approaches, but…

  • Their small sample properties are not always well defined.

  • They should only be used if the \(\textit{null}\) model consists of the same data.

  • These methods can really only be used for nested model. In the case above, \(b=0\) is a constraint, so the restricted/constrained model is nested in the unrestricted model.

Scalar Estimates Fit

  • Scalar estimates of model fit are less intuitive in the logit/probit framework.

\[R^2={RegSS\over TSS}=1-{RSS/TSS}\]

  • The problem is that in the logit/probit model, we cannot directly compare \(Y_{obs}\) to the prediction we make for \(Y_{latent}\).

  • Pseudo-\(R^2\) uses the latent prediction (Efron 1978)

\[Pseudo-R^2={1-{\sum (y-\hat{y}_{latent})^2}\over {\sum (y-\bar{y})^2)}}\]

Information Measures

  • The Akaike Information Criterion (AIC)

\[AIC={{-2loglik(\theta)+2P}\over N}\]

  • Calculate the \(-2loglik\) and add 2 \(\times\) the number of predictors, where \(p=K+1\) (Long 1997, 109).

  • Finally, divide by the number of observations. Notice what happens with this function. As the number of parameters increases, but the log-likelihood stays the same, the AIC will increase.

  • Should prefer a smaller AIC. The statistic penalizes for added parameters that do not improve fit.

BIC

  • The Bayesian Information Criterion (BIC) is based on a comparision – between a fully saturated model and the proposed model. The BIC is:

\[BIC=D(M)-df \times log(N)\]

  • \(D(M)\) is simply the deviance for the model

  • The degrees of freedom calculation is \(N-k-1\), where \(k\) is the number of predictors.

The Ordered Logit

  • This summary follows your assigned reading in Long (1997) and McElreath, Chapter 11.

  • Only use an ordered parameterization when we have ordered data.

  • Some data can be ordered, even if they are theoretically multidimensional; others should be modeled differently.

  • Examples: PID, Ideology (social and economic dimensions)

  • “How much do you agree or disagree with the following item?” from “1” Strongly Disagree to “5” Strongly Agree.

Why not OLS?

  • Ordered, non-interval level data may violate the assumptions of the classical linear regression model.
  1. Non-constant variance.

  2. Predictions may be non-sensical (i.e., we predict things outside of the observed bounds).

  3. If the category distances are theoretically quite different.

Probit or Logit

  • Whether you choose ordered logit or probit is often just a matter of personal preference. I’ll just use the logit for now, but specifying a probit is just a matter of changing the link-function.

  • The ordered logit is also called the proportional odds regression model.

  • It is a generalization of the binary logit, using the logic of accumulating probabilities.

  • Conceptually just think about it as a number of binary logits, where the cutpoints slice the latent distribution into discrete categories.

  • In the ordered logit or probit parameterization, we do not estimate the intercept, \(\beta_0\), because it is not uniquely identified from the cutpoints, which serve as intercepts for cumulative logits.

  • Estimating an ordered logit model on binary data reduces to the binary logit model.

  • Instead of one cutpoint – the intercept in logit – now we estimate \(k-1\) cutpoints.

Ordered Logit

\[y_{latent} = Xb + e\] \[ \begin{eqnarray*} P(y=1) & = & P(\tau_0 \leq y_{latent} < \tau_1) \\ & = & P(\tau_0 \leq \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon} < \tau_1) \\ & = & P(\tau_0 - \mathbf{X}\boldsymbol{\beta} \leq \boldsymbol{\varepsilon} < \tau_1 - \mathbf{X}\boldsymbol{\beta}) \\ & = & P(\boldsymbol{\varepsilon} < \tau_1 - \mathbf{X}\boldsymbol{\beta}) - P(\boldsymbol{\varepsilon} < \tau_0 - \mathbf{X}\boldsymbol{\beta}) \\ & = & F(\tau_1 - \mathbf{X}\boldsymbol{\beta}) - F(\tau_0 - \mathbf{X}\boldsymbol{\beta}) \end{eqnarray*} \]

Ordered Logit

  • \(k = 4\), so length(\(\tau\)) = 3

\[ \begin{eqnarray*} P(y=1) & = & F(\tau_1 - \mathbf{X}\boldsymbol{\beta}) \quad \text{where } \tau_0 = -\infty \\ P(y=2) & = & F(\tau_2 - \mathbf{X}\boldsymbol{\beta}) - F(\tau_1 - \mathbf{X}\boldsymbol{\beta}) \\ P(y=3) & = & F(\tau_3 - \mathbf{X}\boldsymbol{\beta}) - F(\tau_2 - \mathbf{X}\boldsymbol{\beta}) \\ P(y=4) & = & 1 - F(\tau_3 - \mathbf{X}\boldsymbol{\beta}) \quad \text{where } \tau_4 = \infty \end{eqnarray*} \]

The Ordered Logit

  • \(y_{latent}\), where \(y_{obs} \in (1,2,3,...k)\).

  • Accumulated Comparisons: With a four category outcome, we can think of three models. Compare category 1 to category 2, 3, 4; then compare categories 1, 2 to categories 3, 4; and finally compare categories 1, 2, 3 to category 4.

  • It’s useful to envision this as three individual logit models, but where the parameters, \(\beta\) are constrained to the the same across the three models. This is called the parallel regression assumption or the parallel lines assumption.

  • This is because it involves cumulative odds.

The Ordered Logit

\[ \begin{aligned} \boldsymbol{y}_{observed} &\sim \text{Ordered}(\mathbf{p}) \\ p_j &= P(y_{latent} = j) = F_{logit}(\tau_{j} - \mathbf{X}\boldsymbol{\beta}) - F_{logit}(\tau_{j-1} - \mathbf{X}\boldsymbol{\beta}) \\ \boldsymbol{y}_{latent} &= \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon} \end{aligned} \]

  • where \(F_{logit}(z) = \frac{1}{1 + e^{-z}}\) is the logistic CDF.
  • p is a vector of probabilities for each category, summing to 1.
  • The j element of \(p\) is the probability of observing category j.
  • And the cumulative log odds is a function of variable matrix, X, the cutpoints, \(\tau\), and the coefficients, \(\beta\) and an error term e, distributed following the logit distribution.

Parallel Lines

  • Each line corresponds to a log odds of combining categories into a cumulative log odds where the lines are parallel, or the odds ratios are constant, they are proportional. The distance between the lines is constant, which means that the effect of \(X\) is the same across all cumulative comparisons.

The Structural Model

\[y_{latent}=\beta_0 + \beta_1x_i +...\sum^{J}_{j =1} \beta_j x_{ij}+e_i\]

\[y=X\beta+e\]

  • There is no intercept term, as these are represented by the cutpoints. But it’s useful to think of the cutpoints as intercepts for each of the cumulative logits.

The Measurement Model

  • Instead of the variable being 0/1, it is not more than two categories that are ordered. Assume we knew \(y_{latent}\) and would like to map that to observing a particular category.

  • Using the same logic from the binary regression model, assume that we observe the category based on the orientation to a series of cutpoints, where

\[y_i=m: \tau_{k-1}\leq y_{latent} < \tau_{k}\]

  • In MASS::polr() these are zeta

The Likelihood

\[ \begin{eqnarray*} pr(y_{i}=k|X_i) & = & F(\tau_k-\alpha-X_i\beta)-F(\tau_{k-1}-\alpha-X_i\beta) \\ \end{eqnarray*} \]

  • The likelihood of the ordered logit or probit model is the joint probability of being in each category, so we need to calculate the likelihood (\(L(y|\theta)\)) as

\[ Pr(y_{i}=1|X_i)\times pr(y_{i}=2|X_i) \times pr(y_{i}=3|X_i) \times....pr(y_{i}=K|X_i) \].

  • This is just the joint probability for category membership, for each subject, so

\[ \begin{eqnarray*} Pr(y_{i}|X_i) & = & \prod_{k=1}^K F(\tau_k-\alpha-X_i\beta)-F(\tau_{k-1}-\alpha-X_i\beta) \\ \end{eqnarray*} \]

The Likelihood, Continued

\[ \begin{eqnarray*} Pr(y_{i}|X_i) & = & \prod_{k=1}^K F(\tau_k-\alpha-X_i\beta)-F(\tau_{k-1}-\alpha-X_i\beta) \\ \end{eqnarray*} \]

  • This only references the probability space for a single subject. Since the likelihood is \(\prod_{i=1}^N p_i\), we need to calculate the joint probability for each subject.

\[ \begin{eqnarray*} pr(y|X) & = & \prod_{i=1}^N \prod_{k=1}^K F(\tau_k-\alpha-X_i\beta)-F(\tau_{k-1}-\alpha-X_i\beta) \\ L(\beta \tau | y, X)& = & \prod_{i=1}^N \prod_{k=1}^K F(\tau_k-\alpha-X_i\beta)-F(\tau_{k-1}-\alpha-X_i\beta) \\ \end{eqnarray*} \]

The Log Likelihood

\[\begin{eqnarray*} Loglik(\beta \tau | y, X)& = & \sum_{i=1}^N \sum_{k=1}^K log[F(\tau_k-\alpha-X_i\beta)-F(\tau_{k-1}-\alpha-X_i\beta)] \\ \end{eqnarray*}\]

  • Like the binary case: \(x \rightarrow y_{latent} \rightarrow y_{obs}\).

  • The only thing that is different is that instead of a single cutpoint – at 0 – we have a series of cutpoints, corresponding to the number of categories minus 1.

The Western States Data (2020)

  • Let’s estimate an ordered logit model in R, from the MASS package. Data are collected pre or post election, and we want to see if support for electoral contestation behavior (here, attending a march) changes in support over this period – for Trump voters versus Biden voters. This is specified to examine whether support for contestation varies depending upon electoral functions; a winner-loser effect
# A tibble: 6 × 70
  black white latino asian american_indian   age married college faminc
  <dbl> <dbl>  <dbl> <dbl>           <dbl> <dbl>   <dbl>   <dbl>  <dbl>
1     0     1      0     0               0    43       0       1      0
2     0     0      1     0               0    53       0       1      0
3     0     1      0     0               0    73       1       1      1
4     0     1      0     0               0    65       0       0      0
5     0     1      0     0               0    83       0       0      0
6     0     0      1     0               0    35       1       1      0
# ℹ 61 more variables: survey_weight <dbl>, caseidID <dbl>, state <chr>,
#   year <dbl>, survey <chr>, DATE <date>, post_election <dbl>,
#   post_call <dbl>, uncertainty <dbl>, prepost <dbl>, attend_violent <dbl>,
#   criticize_social_media <dbl>, most_important_problem <chr>,
#   polMeeting <dbl>, polSign <dbl>, polVolunteer <dbl>, polProtest <dbl>,
#   polOfficial <dbl>, polDonate <dbl>, polSocial <dbl>, polPersuade <dbl>,
#   polNone <dbl>, trustCongress <dbl>, trustPresident <dbl>, trustSC <dbl>, …

Estimating an Ordered Logit

Call:
polr(formula = as.factor(burn_flag) ~ prepost * vote_trump, data = sample_df)

Coefficients:
                     Value Std. Error t value
prepost            -0.2686    0.09701  -2.769
vote_trump         -0.6393    0.07787  -8.210
prepost:vote_trump  1.4364    0.17112   8.394

Intercepts:
    Value    Std. Error t value 
1|2  -1.9499   0.0663   -29.4176
2|3  -0.9642   0.0564   -17.0865
3|4   0.5589   0.0544    10.2718
4|5   1.9233   0.0696    27.6224

Residual Deviance: 8700.134 
AIC: 8714.134 
(719 observations deleted due to missingness)
  • There is clearly an interaction effect – support varies depending on whether the observation was before or after the election and whether the respondent voted for Trump or Biden. The sign of the lower order and interaction terms seems to indicate that Trump supporters are more supportive, post-election; Biden voters are more supporting in the pre-election. But how should we interpret this.

Predicted Probabilities

# A tibble: 20 × 8
   prepost vote_trump category mean_prob median_prob sd_prob lower_ci upper_ci
     <dbl>      <dbl> <chr>        <dbl>       <dbl>   <dbl>    <dbl>    <dbl>
 1       0          0 1           0.124       0.124  0.00721   0.110    0.139 
 2       0          0 2           0.151       0.151  0.00707   0.137    0.165 
 3       0          0 3           0.360       0.360  0.00885   0.344    0.378 
 4       0          0 4           0.236       0.236  0.00920   0.220    0.255 
 5       0          0 5           0.128       0.127  0.00776   0.112    0.143 
 6       0          1 1           0.213       0.212  0.0110    0.193    0.235 
 7       0          1 2           0.207       0.207  0.00899   0.189    0.225 
 8       0          1 3           0.349       0.349  0.00890   0.332    0.366 
 9       0          1 4           0.160       0.160  0.00808   0.144    0.176 
10       0          1 5           0.0714      0.0712 0.00539   0.0612   0.0829
11       1          0 1           0.157       0.157  0.0122    0.135    0.181 
12       1          0 2           0.176       0.176  0.00956   0.157    0.195 
13       1          0 3           0.363       0.363  0.00884   0.346    0.381 
14       1          0 4           0.203       0.204  0.0116    0.181    0.226 
15       1          0 5           0.101       0.100  0.00878   0.0839   0.119 
16       1          1 1           0.0777      0.0773 0.00939   0.0613   0.0977
17       1          1 2           0.106       0.106  0.0105    0.0871   0.128 
18       1          1 3           0.324       0.324  0.0144    0.294    0.350 
19       1          1 4           0.293       0.294  0.0143    0.265    0.321 
20       1          1 5           0.199       0.199  0.0208    0.159    0.240 
  • The predicted probabilities indicate that Trump voters are more supportive of burning the flag, post-election. Biden voters are less supportive, post-election.

Predicted Probabilities

Predicted Probabilities

Changes in Probabilities