This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.

Try executing this chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Ctrl+Shift+Enter (Win/Linux) or Cmd+Shift+Return (Mac).

# trees is a data set that comes with R
pairs(trees)

To hide the output, click the Expand/Collapse output button. To clear results (or an error), click the “x”.

You can also press Ctrl+Enter (Win/Linux) or Cmd+Return (Mac) to run one line of code at a time (instead of the entire chunk).

Add a new R code chunk by clicking the Insert Chunk button on the toolbar or by pressing Ctrl+Alt+I (Win/Linux) or Cmd+Option+I (Mac).

CODE ALONG 0

Try the cumsum() function with the vector 1:5

Load Packages

We’ll use the following packages in today’s workshop.

library(vcd)
## Loading required package: grid
library(MASS)
library(car)
## Loading required package: carData
library(effects)
## lattice theme set by effectsTheme()
## See ?effectsTheme for details.
library(ggplot2)
library(ggeffects)
library(emmeans)

Review: Probabilities and Cumulative Probabilities

Let’s say I randomly sample 100 UVA students and ask them to rate how spicy they like their food on a scale of 1 - 5, where 1 is no spice and 5 is maximum spiciness. Below are made up responses entered manually using a named vector.

resp <- c("1 none" = 17,
          "2 mild" = 23,
          "3 medium" = 31,
          "4 hot" = 21,
          "5 maximum" = 8)
resp
##    1 none    2 mild  3 medium     4 hot 5 maximum 
##        17        23        31        21         8

I can calculate proportions from this data and use these as estimates of probabilities. Using the proportions() function on a vector of numbers returns the proportion of each value relative to the total of the vector.

proportions(resp)
##    1 none    2 mild  3 medium     4 hot 5 maximum 
##      0.17      0.23      0.31      0.21      0.08

Based on this random sample I might estimate that a randomly selected UVA student has a probability of about 0.31 of liking medium spicy food. I’m using my observed proportion as an estimate of probability.

Another way to summarize the results is with cumulative probabilities. For example, what’s the the probability a randomly selected student can handle medium heat or less? Using the cumsum() we can calculate cumulative sums of proportions and answer this question. Below we take the proportions and “pipe” into the cumsum() function using the base R pipe operator.

proportions(resp) |> cumsum()
##    1 none    2 mild  3 medium     4 hot 5 maximum 
##      0.17      0.40      0.71      0.92      1.00

Based on this calculation I might estimate that a randomly selected student has a probability of about 0.71 of liking medium or lower spiciness. We might state this mathematically as follows (using LaTeX):

\[P[X \le \text{medium}] = 0.71 \]

In this case we’re using the ordering of the responses to make additional inferences about student spiciness preference.

What if we also collected additional information such as the sex of the respondents? Does being male or female affect the cumulative probability of 0.71?

What about age or weight? Does being older and heavier affect the cumulative probability of spiciness preference?

Ordinal Logistic Regression is a method that allows us to investigate questions such as this. It allows us to model the cumulative probability of an ordered category given multiple predictors.

Performing Ordinal Logistic Regression is like fitting any other statistical model. We have to…

  1. propose and fit a model
  2. assess if the model is good
  3. use the model to make predictions and/or quantify relationships

Load Data for the workshop

Today we’ll work with data from a double-blind clinical trial investigating a new treatment for rheumatoid arthritis (Koch & Edwards, 1988). The data is available in the vcd package (Meyer et al, 2022)

data("Arthritis", package = "vcd")

If for some reason you can’t install the vcd package, you can also download the data from GitHub:

URL <- "https://github.com/clayford/OLR_in_R/raw/main/data/arthritis.rds"
Arthritis <- readRDS(file = url(URL))

List of variables

names(Arthritis)
## [1] "ID"        "Treatment" "Sex"       "Age"       "Improved"
  • ID: patient ID
  • Treatment: factor indicating treatment (Placebo, Treated).
  • Sex: factor indicating sex (Female, Male).
  • Age: age of patient in years
  • Improved: ordered factor indicating treatment outcome (None, Some, Marked).

The goal today is to teach the basics of Ordinal Logistic Regression by developing a model to predict the cumulative probability of pain improvement after treatment adjusting for age and sex.

Ordered Factors in R

An ordered factor is a categorical variable with ordered levels. Examples of ordered categorical variables include:

We can format a categorical variable as an ordered factor using the factor() function with the argument ordered = TRUE. We are responsible for creating the order! Otherwise R will set the order alphabetically.

Quick example using fake data. Sample from a vector of ordered levels 50 times with replacement using the sample() function. The result is a character vector.

experience <- sample(c("none", "some", "proficient", "expert"), 
                     size = 50, replace = TRUE)
head(experience)
## [1] "proficient" "expert"     "some"       "proficient" "none"      
## [6] "none"

Let’s convert to factor using the factor() function. Notice there is no ordering of the levels. They are simply listed alphabetically.

experience <- factor(experience)
head(experience)
## [1] proficient expert     some       proficient none       none      
## Levels: expert none proficient some

Now convert to ordered factor using the argument ordered = TRUE. Notice the levels are ordered but still in alphabetical order. The notation expert < none < proficient < some says “expert” is less than “none”, which is less than “proficient”, which is less than “some”. According to this ordering “some” is the highest, which doesn’t make sense.

experience <- factor(experience, ordered = TRUE)
head(experience)
## [1] proficient expert     some       proficient none       none      
## Levels: expert < none < proficient < some

Now let’s do it the right way: specify order of levels using the levels argument.

experience <- factor(experience, ordered = TRUE, 
                     levels = c("none", "some", "proficient", "expert"))
head(experience)
## [1] proficient expert     some       proficient none       none      
## Levels: none < some < proficient < expert

This is an important data wrangling step when preparing for Ordinal Logistic Regression.

Explore the Arthritis data

Our response is the Improved variable. Notice it’s already an ordered categorical variable:

head(Arthritis$Improved)
## [1] Some   None   None   Marked Marked Marked
## Levels: None < Some < Marked

Let’s tabulate the counts using xtabs(). The notation ~ Improved means “tabulate by Improved”.

xtabs(~ Improved, data = Arthritis)
## Improved
##   None   Some Marked 
##     42     14     28

We an get cumulative proportions using proportions() and cumsum().

xtabs(~ Improved, data = Arthritis) |>
  proportions() |>
  cumsum()
##      None      Some    Marked 
## 0.5000000 0.6666667 1.0000000

About 67% of subjects experienced None or Some improvement. Therefore about 33% of subjects experienced marked improvement.

How does cumulative probability of Improved breakdown between the Treatments? The notation ~ Treatment + Improved means “tabulate by Treatment and Improved, with Treatment in the rows and Improved in the columns.” The margin = 1 argument in proportions() says calculate row-wise proportions. Then we “apply” the cumsum() function to the rows of the table. Finally we use t() to rotate or “transpose” the table.

xtabs(~ Treatment + Improved, data = Arthritis) |>
  proportions(margin = 1) |>
  apply(MARGIN = 1, cumsum) |> 
  t()
##          Improved
## Treatment      None      Some Marked
##   Placebo 0.6744186 0.8372093      1
##   Treated 0.3170732 0.4878049      1

About 84% of subjects experienced None or Some improvement in the Placebo group versus 49% in the Treatment group. That implies only 16% experienced Marked improvement in the Placebo group versus 51% in the Treatment group.

CODE ALONG 1

Add a new R code chunk by clicking the Insert Chunk button on the toolbar or by pressing Ctrl+Alt+I (Win/Linux) or Cmd+Option+I (Mac).

Ordinal Logistic Regression basics

Ordinal logistic regression produces a model that we can use to make cumulative probability predictions or quantify how predictors are associated with cumulative probabilities. Like binary logistic regression, the results are on the log odds scale.

Let’s fit a simple model for the Arthritis data. Below we model Improved as a function of Treatment. In other words, how does Treatment affect the cumulative probability of Improved?

One way to do this is with the polr() function from the MASS package. “polr” stands for “proportional odds logistic regression”. We’ll explain proportional odds in a moment. As with lm() and glm(), we define our model using R’s formula syntax. We also typically want to save the model result and inspect the results using summary()

m1 <- polr(Improved ~ Treatment, data = Arthritis)
summary(m1)
## 
## Re-fitting to get Hessian
## Call:
## polr(formula = Improved ~ Treatment, data = Arthritis)
## 
## Coefficients:
##                  Value Std. Error t value
## TreatmentTreated 1.568     0.4441    3.53
## 
## Intercepts:
##             Value  Std. Error t value
## None|Some   0.7499 0.3210     2.3363 
## Some|Marked 1.5499 0.3559     4.3553 
## 
## Residual Deviance: 156.606 
## AIC: 162.606

The message “Re-fitting to get Hessian” means the model was refit when we called summary() to get a matrix called the “Hessian” in order to compute standard errors. To suppress this message, use Hess = TRUE in the call to polr().

Notice we have multiple intercepts which are listed separately from the coefficients. The separate intercepts are for the levels below which we calculate cumulative probabilities. Recall the levels of Improved:

None < Some < Marked

There are two cumulative probabilities we can calculate:

  1. None
  2. Some and lower

(Marked and lower will always be 1.)

So we have two models depending on the cumulative probability we want to model:

  1. log odds(None) = 0.7499 - 1.568*Treatment
  2. log odds(Some or lower) = 1.5499 - 1.568*Treatment

Where Treatment = 1 if treated, 0 if on Placebo.

Notice the coefficient is subtracted from the intercept. That’s the model that polr() fits. Also notice the model is on the log odds, or logit, scale. Recall that log odds is probability rescaled from [0,1] to (-Inf, + Inf). So the model predicts log odds values, which we then transform back to probability using the inverse logit.

Now let’s use our model and compute cumulative probabilities “by hand”. Before we do that, it’s good to know how to extract the intercepts and coefficients from the model object. The model object is a list. We use the $ operator to extract elements from it. The intercepts are in the zeta element and the coefficients are in the coefficients element. (Can also use coef() to extract coefficients.)

m1$zeta
##   None|Some Some|Marked 
##   0.7499107   1.5498756
m1$coefficients
## TreatmentTreated 
##         1.567644

The log odds of None given a subject was on Placebo (Treatment = 0):

log odds(None) = 0.7499 - 1.568*0 = 0.7499

That’s simply the None|Some intercept: 0.7499107. To convert to probability, we can use the plogis() function. Below we pipe the None|Some intercept into plogis() and save as p1.

(p1 <- m1$zeta["None|Some"] |> plogis())
## None|Some 
## 0.6791592

The log odds of None given a subject was on Treatment (Treatment = 1):

log odds(None) = 0.7499 - 1.568*1

(p2 <- (m1$zeta["None|Some"] - m1$coefficients) |> plogis())
## None|Some 
##  0.306245

So we predict about 68% of subjects on Placebo will see no improvement versus only about 32% of subjects on Treatment.

The log odds of Some or lower given a subject was on Placebo (Treatment = 0):

log odds(Some or lower) = 1.5499 - 1.568*Treatment

That’s simply the Some|Marked intercept: 1.5498756.

(p3 <- m1$zeta["Some|Marked"] |> plogis())
## Some|Marked 
##   0.8248958

The log odds of Some or lower given a subject was on Treatment (Treatment = 1):

log odds(Some or lower) = 1.5498756 - 1.567644.

(p4 <- (m1$zeta["Some|Marked"] - m1$coefficients) |> plogis())
## Some|Marked 
##    0.495558

This implies about 1 - 0.50 = 50% of Treated subjects can expect Marked improvement versus only 1 - 0.82 = 18% of subjects on Placebo.

Let’s compare the predicted cumulative probabilities to the observed cumulative proportions.

# Observed cumulative proportions
xtabs(~ Treatment + Improved, data = Arthritis) |>
  proportions(margin = 1) |>
  apply(MARGIN = 1, cumsum) |> 
  t()
##          Improved
## Treatment      None      Some Marked
##   Placebo 0.6744186 0.8372093      1
##   Treated 0.3170732 0.4878049      1

Now compare to our model-based predicted cumulative probabilities. I’m laying out the previously calculated probabilities as a matrix to match the layout of xtabs().

# Modeled cumulative probabilities
matrix(c(p1, p2, p3, p4, 1, 1), nrow = 2, 
       dimnames = list("Treatment" = c("Placebo", "Treated"),
                       "Improved" = c("None", "Marked", "Some")))
##          Improved
## Treatment      None    Marked Some
##   Placebo 0.6791592 0.8248958    1
##   Treated 0.3062450 0.4955580    1

Our model-based predictions are quite close to the observed proportions, but not identical. This is because of the proportional odds assumption.

Proportional Odds

Recall: odds = p/(1 - p)

Predicted probability of Improved = None on Treatment is 0.3062450, which is stored in p2. The odds are calculated as 0.3062450/(1 - 0.3062450)

p2/(1 - p2)
## None|Some 
##  0.441431

Predicted probability of Improved = None on Placebo is 0.6791592, which is stored in p1. The odds are calculated as 0.6791592/(1 - 0.6791592)

p1/(1 - p1)
## None|Some 
##  2.116811

The odds ratio for those two odds are

(p2/(1 - p2))/
(p1/(1 - p1))
## None|Some 
## 0.2085359

If we take the log of that we get the model coefficient for Treatment with a change in sign. (We explain why that is shortly.)

# compare to coef(m1)
log((p2/(1 - p2))/
(p1/(1 - p1)))
## None|Some 
## -1.567644

Now let’s do the same calculation for the Some or lower cumulative probabilities:

# p4 = P(Some or lower) if Treated
# p3 = P(Some or lower) if Placebo
log((p4/(1 - p4))/
(p3/(1 - p3)))
## Some|Marked 
##   -1.567644

Notice we get the same answer. That’s because according to our model, the Treatment effect is proportional to the ordered category levels. That’s the proportional odds assumption. The coefficients are independent of the levels of the ordered response. It doesn’t matter whether we go from “None” to “Some”, or “Some” to “Marked”, the effect of Treatment is the same.

Let’s look at the coefficient from the model again.

coef(m1)
## TreatmentTreated 
##         1.567644

This is identical to what we derived above, but with a change in sign. polr() multiplies the coefficients in the summary output by -1 so when you exponentiate and take the odds ratio, higher levels of predictors correspond to the response falling in the higher end of the ordinal scale.

exp(coef(m1))
## TreatmentTreated 
##         4.795338

Interpretation: the estimated odds that a Treated subject’s response has higher improvement is about 4.8 times the odds that a Placebo subject’s response has higher improvement

Notice if we “undo” the multiplication by -1 we get the odds ratio we calculated above.

exp(-coef(m1))  # -coef(m1) is equivalent to -1*coef(m1)
## TreatmentTreated 
##        0.2085359

Compare to:

(p2/(1 - p2))/
(p1/(1 - p1))
## None|Some 
## 0.2085359

And to:

(p4/(1 - p4))/
(p3/(1 - p3))
## Some|Marked 
##   0.2085359

The substance of the result remains the same, but the interpretation is reversed: The estimated odds that a Treated subject’s response has lower improvement is about 80% less than the odds that a Placebo subject’s response has lower improvement

Assessing the proportional odds assumption

Earlier we mentioned that proportional odds logistic regression assumes the coefficients are independent of the category levels. This means we assume the estimated odds ratio of Treatment, 4.795, is the same whether we’re talking about comparing None versus Some or higher, or None or Some versus Marked on the Improvement scale.

One way to assess if this assumption is satisfied is the poTest() function in the car package. Simply call it on the fitted model. The null is the proportional odds assumption is true. Rejecting the null is evidence against the assumption. The poTest() function provides an overall test of the PO assumption as well as separate tests for each predictor. We only have one predictor so both tests are the same. The result below implies we’re safe in our assumption of proportional odds.

poTest(m1)
## 
## Tests for Proportional Odds
## polr(formula = Improved ~ Treatment, data = Arthritis)
## 
##                  b[polr] b[>None] b[>Some] Chisquare df Pr(>Chisq)
## Overall                                         0.21  1       0.64
## TreatmentTreated   1.568    1.495    1.686      0.21  1       0.64

The authors of the car package warn, “It’s our experience that the proportional-odds assumption is rarely supported by a hypothesis test.” (Fox and Weisberg, 2019, p. 322)

A graphical method for assessing the proportional odds assumption is available in the rms package. The function plot.xmean.ordinaly plots both…

When the solid and dashed lines roughly follow the same trajectory, we have good evidence that the PO assumption is safe. This looks great!

# rms:: allows us to use the function without loading the rms package
rms::plot.xmean.ordinaly(Improved ~ Treatment, data = Arthritis)

According to Frank Harrell, Violation of Proportional Odds is Not Fatal. See https://www.fharrell.com/post/po/

If you think proportional odds assumption is badly violated, two options to consider:

Summary output and confidence intervals

Let’s look again at the output summary of our model. First we’ll refit the model with Hess = TRUE to suppress the “re-fitting” message.

m1 <- polr(formula = Improved ~ Treatment, data = Arthritis, 
           Hess = TRUE)
summary(m1)
## Call:
## polr(formula = Improved ~ Treatment, data = Arthritis, Hess = TRUE)
## 
## Coefficients:
##                  Value Std. Error t value
## TreatmentTreated 1.568     0.4441    3.53
## 
## Intercepts:
##             Value  Std. Error t value
## None|Some   0.7499 0.3210     2.3363 
## Some|Marked 1.5499 0.3559     4.3553 
## 
## Residual Deviance: 156.606 
## AIC: 162.606

The authors of the polr() function elected to not output p-values. P-values do not measure the magnitude of an effect. They simply provide evidence if the sign of a coefficient is reliably positive or negative. A confidence interval on the coefficient is more informative, which we demonstrate next. However if you absolutely need a p-value, you can calculate using the pt() function.

  1. Set q to t value
  2. Set df to number of observations minus number of coefficients estimated
  3. Set lower.tail = F to get the upper tail (ie, > t)
  4. multiply by 2 to get a two-sided test
pt(q = 3.53, df = 84 - 3, lower.tail = FALSE) * 2
## [1] 0.0006884609

The coeftest() function in the lmtest package will also return p-values.

To understand the uncertainty of the magnitude of the coefficient, we can calculate 95% confidence intervals using confint(). Notice we use exp() to get the odds ratio. Also notice that only a CI for the predictor is calculated, not the intercepts.

exp(confint(m1))
## Waiting for profiling to be done...
##     2.5 %    97.5 % 
##  2.044824 11.739936

The estimated odds that a Treated subject’s response has higher improvement is at least 2 times the odds that a Placebo subject’s response has higher improvement, perhaps as high as 11 times.

Using predict() with fitted models

Earlier we made predictions “by hand” (for teaching purposes) with intercept and coefficient values. In practice we would likely use predict() to get fitted probabilities. Using the newdata argument we can specify what we want to make predictions for. The default is to predict the class, or level, for the given predictors. (type = "class")

The Placebo group is predicted to experience No improvement while the Treated groups is expected to experience Marked improvement.

predict(m1, newdata = data.frame(Treatment = c("Placebo", "Treated")), 
        type = "class")
## [1] None   Marked
## Levels: None Some Marked

Changing type = "probs returns expected probabilities for each class.

predict(m1, newdata = data.frame(Treatment = c("Placebo", "Treated")), 
        type = "probs")
##        None      Some    Marked
## 1 0.6791592 0.1457365 0.1751042
## 2 0.3062450 0.1893130 0.5044420

Notice these are NOT cumulative probabilities. These are expected probabilities for each level. They sum to 1. We see that type = "class" setting is simply picking the category with the highest probability.

The predict() method unfortunately does not provide standard errors and confidence intervals for the predictions. However we can use the emmeans package to calculate. Below we obtain estimated probabilities of each level of Improved conditional on Treatment. The mode = "prob" argument ensures predictions are of the probabilities of each class of the ordinal response.

emmeans(m1, ~ Improved|Treatment, mode = "prob")
## Treatment = Placebo:
##  Improved  prob     SE  df asymp.LCL asymp.UCL
##  None     0.679 0.0699 Inf    0.5421     0.816
##  Some     0.146 0.0396 Inf    0.0682     0.223
##  Marked   0.175 0.0514 Inf    0.0744     0.276
## 
## Treatment = Treated:
##  Improved  prob     SE  df asymp.LCL asymp.UCL
##  None     0.306 0.0686 Inf    0.1717     0.441
##  Some     0.189 0.0467 Inf    0.0978     0.281
##  Marked   0.504 0.0769 Inf    0.3537     0.655
## 
## Confidence level used: 0.95

We can pipe this result into plot() for a basic visualization.

emmeans(m1, ~ Improved|Treatment, mode = "prob") |>
  plot()

Which leads us to the next topic.

Visualizing the model

Between multiple intercepts, log odds, and cumulative probabilities, ordinal logistic regression output can be hard to interpret and communicate. Fortunately we have some methods for visualizing models.

The effects package provides some good out-of-the-box lattice plots for visualizing ordinal logistic regression models. Use the Effect plot to specify which coefficient(s) you want to plot. In our model we only have one choice: Treatment. Then pipe into the plot() function. This results in a plot for each outcome. We can see the probability of Marked improvement is much higher for the Treated group than the Placebo group. Likewise the probability of no improvement is much lower in the Treated group than in the Placebo group.

Effect(m1, focal.predictors = "Treatment") |> plot()

We can combine the lines into one plot as follows:

Effect(m1, focal.predictors = "Treatment") |> 
  plot(lines=list(multiline=TRUE),
       confint=list(style="bars"))

Another visualization option is the “stacked” plot, though there is no indication of uncertainty. This plot is probably better suited for a numeric predictor.

Effect(m1, focal.predictors = "Treatment") |> 
  plot(axes=list(y=list(style="stacked")))

If you want something created in ggplot2, the ggeffects package provides the ggeffect() function and associated plotting method. I think the connect.lines = TRUE argument is good to use.

ggeffect(m1, terms = "Treatment") |> plot(connect.lines = TRUE)

CODE ALONG 2

Now that we’ve laid out the basics of fitting, assessing and using an Ordinal Logistic Regression model, let’s work through a more sophisticated example with additional predictors.

Add a new R code chunk by clicking the Insert Chunk button on the toolbar or by pressing Ctrl+Alt+I (Win/Linux) or Cmd+Option+I (Mac).

We’re done!

Thanks for coming! For free statistical consulting and training, contact us: statlab@virginia.edu

Cut for time

Assessing predictive accuracy

How will our model perform for new data that was not used to create the model? Model validation can help us assess this question.

A common approach to model validation is the train/test data-splitting approach, where we hold out a random subset of data (the test set), fit the model to the remaining data (the training set), and then see how well the model performs with the test data. However, this reduces the sample size used for model development and only validates a model fit to a subset of the data. There’s also a chance a lucky split could produce unusually low or high predictive accuracy.

A more efficient approach is bootstrap validation. The basic idea is to resample your data with replacement, refit the model, then use the bootstrap model to calculate some statistic using the original data.

The rms package provides the validate function for this purpose. However it only works for models fit using rms functions. Instead of polr(), we need to use orm(). We also need to specify x = TRUE, y = TRUE so the original data is stored in the model object. To see a summary of the fit, we simply need to print the model object. Notice it returns more output than a model fit with polr(), including several discrimination indices. Also notice the cumulative probabilities are in the reverse direction (ie, Prob(None or higher) = 1, as opposed to Prob(Marked or lower) = 1 with polr)

library(rms)
## Loading required package: Hmisc
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
## 
##     format.pval, units
## Loading required package: SparseM
## 
## Attaching package: 'SparseM'
## The following object is masked from 'package:base':
## 
##     backsolve
## 
## Attaching package: 'rms'
## The following object is masked from 'package:emmeans':
## 
##     contrast
## The following objects are masked from 'package:car':
## 
##     Predict, vif
m_rms <- orm(Improved ~ Treatment + Sex + Age, 
                  data = Arthritis, x = TRUE, y = TRUE)
m_rms
## Logistic (Proportional Odds) Ordinal Regression Model
##  
##  orm(formula = Improved ~ Treatment + Sex + Age, data = Arthritis, 
##      x = TRUE, y = TRUE)
##  
##                         Model Likelihood               Discrimination    Rank Discrim.    
##                               Ratio Test                      Indexes          Indexes    
##  Obs            84    LR chi2      24.46    R2                  0.291    rho     0.504    
##   None          42    d.f.             3    R2(3,84)            0.225                     
##   Some          14    Pr(> chi2) <0.0001    R2(3,70)            0.264                     
##   Marked        28    Score chi2   23.32    |Pr(Y>=median)-0.5| 0.208                     
##  Distinct Y      3    Pr(> chi2) <0.0001                                                  
##  Median Y        1                                                                        
##  max |deriv| 0.008                                                                        
##  
##                    Coef    S.E.   Wald Z Pr(>|Z|)
##  y>=Some           -2.5319 1.0570 -2.40  0.0166  
##  y>=Marked         -3.4309 1.0911 -3.14  0.0017  
##  Treatment=Treated  1.7453 0.4759  3.67  0.0002  
##  Sex=Male          -1.2516 0.5464 -2.29  0.0220  
##  Age                0.0382 0.0184  2.07  0.0382  
## 

Now we can validate this model using the validate function. We set B=200 to specify we want to perform 200 bootstrap resamples. Five statistics are calculated.

We expect to see index.corrected values smaller than the index.orig values, since the original values have been calculated using the same data to fit the model.

For example, the first row is Spearman’s rho, a measure of correlation. It takes values between -1 and 1. When rho=1, the predictions are perfectly discriminating. The original estimate was 0.5044. Bootstrap validation lowers it to about 0.4633 (at least when I ran it; bootstrapping is based on random resampling so you’ll likely see a slightly different corrected value.) Likewise, the second value is R2, which measures the percentage of variation in outcome values explained by the model. We hope to see something close to 1. Originally it was 0.2911, but the bootstrap validation lowers it to about 0.2269. pdm is the mean absolute difference between 0.5 (ie, random guessing) and the predicted probability that the ordinal response is greater than or equal to the marginal median of the ordinal response. In other words, a higher value indicates that the model is discriminating between the ordered levels. Originally ii was 0.2083, but is lowered to 0.1803 after validation.

v <- validate(m_rms, B = 200)
v
##       index.orig training   test optimism index.corrected   n
## rho       0.5044   0.5255 0.4906   0.0349          0.4695 200
## R2        0.2911   0.3199 0.2683   0.0516          0.2395 200
## Slope     1.0000   1.0000 0.8921   0.1079          0.8921 200
## g         1.3353   1.4615 1.2560   0.2055          1.1298 200
## pdm       0.2083   0.2229 0.2008   0.0222          0.1861 200

See this blog post for definitions of g and Slope: https://randomeffect.net/post/2021/05/02/the-rms-validate-function/

See also a blog post I wrote on this topic: https://data.library.virginia.edu/getting-started-with-bootstrap-model-validation/

Model assessment via simulation

A good model should generate data that looks similar to our observed data. We can use the simulate() function to rapidly simulate many new sets of our response variable. Below we use our model to generate 200 new sets of the Improved variable. The result is a data frame with 200 columns, where each column is a simulated outcome from our model. The number of rows equals the sample size of our original data frame: 84. We’re taking our 84 observed predictors and using our model to simulate the outcome, Improved.

sims <- simulate(m1, nsim = 200)
dim(sims)
## [1]  84 200

Here’s a glimpse of the first 5 simulations:

sims[1:5,1:5]

Let’s calculate the proportion of each response within each simulation using xtabs() and proportions(). The last function, as.data.frame() converts the table to a data frame and names the column containing the proportions “P” (the default is “Freq”).

xtabs(~ sims$sim_1) |>
  proportions() |>
  as.data.frame(responseName = "P")

To calculate proportions for all columns, we can convert the last chunk of code into a function and apply it to each column. Let’s name it p_df. Notice it’s the same code above with x in place of sims$sim_1.

p_df <- function(x){
  xtabs(~ x) |> 
  proportions() |>
  as.data.frame(responseName = "P")
  }

Let’s test it.

p_df(sims$sim_1)

Now let’s apply the function to every column of the sims data frame using lapply and save as sims_p. . The “L” in lapply means the result will be a list.

sims_p <- lapply(sims, p_df)

The result is a list with 200 data frames. Let’s look at the first two using the $ operator to extract the data frame

sims_p$sim_1
sims_p$sim_2

We can combine all the data frames into one data frame using rbind(). The arguments to rbind are the data frames we want to combine. Instead of typing out all 200 data frames, we can use the do.call() function to “feed” all the data frames to the rbind function as follows. Notice I assign to an object called sims_pdf.

sims_pdf <- do.call(rbind, sims_p)

Now we have a data frame that we can use to create a plot using ggplot2.

ggplot() +
  geom_jitter(mapping = aes(x = x, y = P), 
             data = sims_pdf, alpha = 0.25,
             width = 0.1, height = 0)

How does this compare to the original data? Let’s add that to the plot. We need to create a data frame of the observed proportions. We can do that using the same code from above and name the result obs_pdf:

obs_pdf <- xtabs(~ Improved, data = Arthritis) |> 
  proportions() |>
  as.data.frame(responseName = "P")
obs_pdf

Now update the plot to include the original data as big blue dots.

ggplot() +
  # simulated data - gray dots
  geom_jitter(mapping = aes(x = x, y = P), 
             data = sims_pdf, alpha = 0.25,
             width = 0.1, height = 0) +
  # observed data - blue dots
  geom_point(mapping = aes(x = Improved, y = P), 
             data = obs_pdf, size = 4, color = "blue") 

The model doesn’t seem biased in any way. It doesn’t consistently over or under predict. We want the model-simulated data to hover around the observed data. However there is a good deal of variability. For example the model simulates “None” from 0.4 to 0.6. Then again this model is only using one binary predictor: Treatment.

Simulating data for ordinal logistic regression

It can be useful to know how to simulate data that will be analyzed via ordinal logistic regression. In addition it can help solidify our understanding of how ordinal logistic regression works.

Let’s say we want to simulate data (n = 300) for a pain scale with levels “none”, “mild”, “moderate”, and “severe”. Our predictor will be a binary treatment that reduces pain. Let’s assume our treatment will reduce the odds of pain by about 80% compared to those who are not on treatment. This implies a log odds coefficient of about -1.6. The intercepts are set to -1.75, -0.5, and 1.25, which correspond to probabilities of about 0.15 (none), 0.38 (mild and below), and 0.78 (moderate and below) for subjects not on the treatment.

plogis(c(-1.75, -0.5, 1.25))
## [1] 0.1480472 0.3775407 0.7772999

So our models are as follows:

  1. prob(none) = -1.75 - -1.6*trt
  2. prob(mild and below) = -0.5 - -1.6*trt
  3. prob(moderate and below) = 1.25 - -1.6*trt

Recall the formulas are on the log-odds scale. We can use plogis() to convert to probabilities.

trt <- sample(0:1, size = 300, replace = TRUE)
none <- plogis(-1.75 - -1.6*trt)
mild <- plogis(-0.5 - -1.6*trt)
moderate <- plogis(1.25 - -1.6*trt)
head(cbind(trt, none, mild, moderate))
##      trt      none      mild  moderate
## [1,]   1 0.4625702 0.7502601 0.9453187
## [2,]   1 0.4625702 0.7502601 0.9453187
## [3,]   0 0.1480472 0.3775407 0.7772999
## [4,]   0 0.1480472 0.3775407 0.7772999
## [5,]   1 0.4625702 0.7502601 0.9453187
## [6,]   1 0.4625702 0.7502601 0.9453187

Above we have cumulative probabilities. To simulate data we need the probabilities for each level. First we add a column for the cumulative probability of severe pain, the highest level. Since it’s the highest level, we set it to 1.

probs <- cbind(none, mild, moderate, severe = 1)

Next we take the differences between the cumulative probabilities to get probabilities for each level. The diff() function is useful for this. Recall the “none” column is already specific to the “none” level since it’s the lowest level. The t() function below transposes the result of the apply() function so we have 300 rows instead of 3.

probs2 <- cbind(none, t(apply(probs, MARGIN = 1, diff)))
head(probs2)
##           none      mild  moderate     severe
## [1,] 0.4625702 0.2876900 0.1950586 0.05468132
## [2,] 0.4625702 0.2876900 0.1950586 0.05468132
## [3,] 0.1480472 0.2294935 0.3997592 0.22270014
## [4,] 0.1480472 0.2294935 0.3997592 0.22270014
## [5,] 0.4625702 0.2876900 0.1950586 0.05468132
## [6,] 0.4625702 0.2876900 0.1950586 0.05468132

If we like we can confirm all rows sum to 1.

all(apply(probs2, 1, sum) == 1)
## [1] TRUE

Now we can use those probabilities to sample from a vector of pain values to create our response variable, which we’ll name “y”. We use the apply() function to apply the sample function to each row using the probabilities in that row.

pain <- c("none", "mild", "moderate", "severe")
y <- apply(probs2, MARGIN = 1, 
           FUN = function(x)sample(pain, size = 1, prob = x))

Next we combine our “trt” and “y” variables into a data frame and set up “y” as an ordered factor. Recall we use the levels argument to set the ordering.

d <- data.frame(y, trt)
d$y <- factor(d$y, levels = pain, ordered = TRUE)

Now we can “work backwards” and use polr() to see how close we get to recovering the true values we used to generate the data. With a sample size of 300 we do quite well. .

mod <- polr(y ~ trt, data = d, Hess = TRUE)
summary(mod)
## Call:
## polr(formula = y ~ trt, data = d, Hess = TRUE)
## 
## Coefficients:
##      Value Std. Error t value
## trt -1.559     0.2265  -6.886
## 
## Intercepts:
##                 Value   Std. Error t value
## none|mild       -1.7092  0.1845    -9.2622
## mild|moderate   -0.2815  0.1525    -1.8463
## moderate|severe  1.2194  0.1771     6.8851
## 
## Residual Deviance: 762.2255 
## AIC: 770.2255

References