This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.
Try executing this chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Ctrl+Shift+Enter (Win/Linux) or Cmd+Shift+Return (Mac).
# trees is a data set that comes with R
pairs(trees)
To hide the output, click the Expand/Collapse output button. To clear results (or an error), click the “x”.
You can also press Ctrl+Enter (Win/Linux) or Cmd+Return (Mac) to run one line of code at a time (instead of the entire chunk).
Add a new R code chunk by clicking the Insert Chunk button on the toolbar or by pressing Ctrl+Alt+I (Win/Linux) or Cmd+Option+I (Mac).
Try the cumsum() function with the vector
1:5
We’ll use the following packages in today’s workshop.
library(vcd)
## Loading required package: grid
library(MASS)
library(car)
## Loading required package: carData
library(effects)
## lattice theme set by effectsTheme()
## See ?effectsTheme for details.
library(ggplot2)
library(ggeffects)
library(emmeans)
Let’s say I randomly sample 100 UVA students and ask them to rate how spicy they like their food on a scale of 1 - 5, where 1 is no spice and 5 is maximum spiciness. Below are made up responses entered manually using a named vector.
resp <- c("1 none" = 17,
"2 mild" = 23,
"3 medium" = 31,
"4 hot" = 21,
"5 maximum" = 8)
resp
## 1 none 2 mild 3 medium 4 hot 5 maximum
## 17 23 31 21 8
I can calculate proportions from this data and use these as estimates
of probabilities. Using the proportions() function
on a vector of numbers returns the proportion of each value relative to
the total of the vector.
proportions(resp)
## 1 none 2 mild 3 medium 4 hot 5 maximum
## 0.17 0.23 0.31 0.21 0.08
Based on this random sample I might estimate that a randomly selected UVA student has a probability of about 0.31 of liking medium spicy food. I’m using my observed proportion as an estimate of probability.
Another way to summarize the results is with cumulative
probabilities. For example, what’s the the probability a randomly
selected student can handle medium heat or less? Using the
cumsum() we can calculate cumulative sums of proportions
and answer this question. Below we take the proportions and “pipe” into
the cumsum() function using the base R pipe operator.
proportions(resp) |> cumsum()
## 1 none 2 mild 3 medium 4 hot 5 maximum
## 0.17 0.40 0.71 0.92 1.00
Based on this calculation I might estimate that a randomly selected student has a probability of about 0.71 of liking medium or lower spiciness. We might state this mathematically as follows (using LaTeX):
\[P[X \le \text{medium}] = 0.71 \]
In this case we’re using the ordering of the responses to make additional inferences about student spiciness preference.
What if we also collected additional information such as the sex of the respondents? Does being male or female affect the cumulative probability of 0.71?
What about age or weight? Does being older and heavier affect the cumulative probability of spiciness preference?
Ordinal Logistic Regression is a method that allows us to investigate questions such as this. It allows us to model the cumulative probability of an ordered category given multiple predictors.
Performing Ordinal Logistic Regression is like fitting any other statistical model. We have to…
Today we’ll work with data from a double-blind clinical trial investigating a new treatment for rheumatoid arthritis (Koch & Edwards, 1988). The data is available in the vcd package (Meyer et al, 2022)
data("Arthritis", package = "vcd")
If for some reason you can’t install the vcd package, you can also download the data from GitHub:
URL <- "https://github.com/clayford/OLR_in_R/raw/main/data/arthritis.rds"
Arthritis <- readRDS(file = url(URL))
names(Arthritis)
## [1] "ID" "Treatment" "Sex" "Age" "Improved"
The goal today is to teach the basics of Ordinal Logistic Regression by developing a model to predict the cumulative probability of pain improvement after treatment adjusting for age and sex.
An ordered factor is a categorical variable with ordered levels. Examples of ordered categorical variables include:
We can format a categorical variable as an ordered factor using the
factor() function with the argument
ordered = TRUE. We are responsible for creating the
order! Otherwise R will set the order alphabetically.
Quick example using fake data. Sample from a vector of ordered levels
50 times with replacement using the sample() function. The
result is a character vector.
experience <- sample(c("none", "some", "proficient", "expert"),
size = 50, replace = TRUE)
head(experience)
## [1] "proficient" "expert" "some" "proficient" "none"
## [6] "none"
Let’s convert to factor using the factor() function.
Notice there is no ordering of the levels. They are simply listed
alphabetically.
experience <- factor(experience)
head(experience)
## [1] proficient expert some proficient none none
## Levels: expert none proficient some
Now convert to ordered factor using the argument
ordered = TRUE. Notice the levels are ordered but still in
alphabetical order. The notation
expert < none < proficient < some says “expert” is
less than “none”, which is less than “proficient”, which is less than
“some”. According to this ordering “some” is the highest, which doesn’t
make sense.
experience <- factor(experience, ordered = TRUE)
head(experience)
## [1] proficient expert some proficient none none
## Levels: expert < none < proficient < some
Now let’s do it the right way: specify order of levels using the
levels argument.
experience <- factor(experience, ordered = TRUE,
levels = c("none", "some", "proficient", "expert"))
head(experience)
## [1] proficient expert some proficient none none
## Levels: none < some < proficient < expert
This is an important data wrangling step when preparing for Ordinal Logistic Regression.
Our response is the Improved variable. Notice it’s already an ordered categorical variable:
head(Arthritis$Improved)
## [1] Some None None Marked Marked Marked
## Levels: None < Some < Marked
Let’s tabulate the counts using xtabs(). The notation
~ Improved means “tabulate by Improved”.
xtabs(~ Improved, data = Arthritis)
## Improved
## None Some Marked
## 42 14 28
We an get cumulative proportions using proportions() and
cumsum().
xtabs(~ Improved, data = Arthritis) |>
proportions() |>
cumsum()
## None Some Marked
## 0.5000000 0.6666667 1.0000000
About 67% of subjects experienced None or Some improvement. Therefore about 33% of subjects experienced marked improvement.
How does cumulative probability of Improved breakdown between the
Treatments? The notation ~ Treatment + Improved means
“tabulate by Treatment and Improved, with Treatment in the rows and
Improved in the columns.” The margin = 1 argument in
proportions() says calculate row-wise proportions. Then we
“apply” the cumsum() function to the rows of the table.
Finally we use t() to rotate or “transpose” the table.
xtabs(~ Treatment + Improved, data = Arthritis) |>
proportions(margin = 1) |>
apply(MARGIN = 1, cumsum) |>
t()
## Improved
## Treatment None Some Marked
## Placebo 0.6744186 0.8372093 1
## Treated 0.3170732 0.4878049 1
About 84% of subjects experienced None or Some improvement in the Placebo group versus 49% in the Treatment group. That implies only 16% experienced Marked improvement in the Placebo group versus 51% in the Treatment group.
Add a new R code chunk by clicking the Insert Chunk button on the toolbar or by pressing Ctrl+Alt+I (Win/Linux) or Cmd+Option+I (Mac).
Ordinal logistic regression produces a model that we can use to make cumulative probability predictions or quantify how predictors are associated with cumulative probabilities. Like binary logistic regression, the results are on the log odds scale.
Let’s fit a simple model for the Arthritis data. Below we model Improved as a function of Treatment. In other words, how does Treatment affect the cumulative probability of Improved?
One way to do this is with the polr() function from the
MASS package. “polr” stands for “proportional odds logistic regression”.
We’ll explain proportional odds in a moment. As with lm()
and glm(), we define our model using R’s formula syntax. We
also typically want to save the model result and inspect the results
using summary()
m1 <- polr(Improved ~ Treatment, data = Arthritis)
summary(m1)
##
## Re-fitting to get Hessian
## Call:
## polr(formula = Improved ~ Treatment, data = Arthritis)
##
## Coefficients:
## Value Std. Error t value
## TreatmentTreated 1.568 0.4441 3.53
##
## Intercepts:
## Value Std. Error t value
## None|Some 0.7499 0.3210 2.3363
## Some|Marked 1.5499 0.3559 4.3553
##
## Residual Deviance: 156.606
## AIC: 162.606
The message “Re-fitting to get Hessian” means the model was refit
when we called summary() to get a matrix called the
“Hessian” in order to compute standard errors. To suppress this message,
use Hess = TRUE in the call to polr().
Notice we have multiple intercepts which are listed separately from the coefficients. The separate intercepts are for the levels below which we calculate cumulative probabilities. Recall the levels of Improved:
None < Some < Marked
There are two cumulative probabilities we can calculate:
(Marked and lower will always be 1.)
So we have two models depending on the cumulative probability we want to model:
Where Treatment = 1 if treated, 0 if on Placebo.
Notice the coefficient is subtracted from the intercept.
That’s the model that polr() fits. Also notice the
model is on the log odds, or logit, scale. Recall that log odds
is probability rescaled from [0,1] to (-Inf, + Inf). So the model
predicts log odds values, which we then transform back to probability
using the inverse logit.
Now let’s use our model and compute cumulative probabilities “by
hand”. Before we do that, it’s good to know how to extract the
intercepts and coefficients from the model object. The model object is a
list. We use the $ operator to extract elements from it.
The intercepts are in the zeta element and the coefficients
are in the coefficients element. (Can also use
coef() to extract coefficients.)
m1$zeta
## None|Some Some|Marked
## 0.7499107 1.5498756
m1$coefficients
## TreatmentTreated
## 1.567644
The log odds of None given a subject was on Placebo (Treatment = 0):
log odds(None) = 0.7499 - 1.568*0 = 0.7499
That’s simply the None|Some intercept: 0.7499107. To
convert to probability, we can use the plogis() function.
Below we pipe the None|Some intercept into
plogis() and save as p1.
(p1 <- m1$zeta["None|Some"] |> plogis())
## None|Some
## 0.6791592
The log odds of None given a subject was on Treatment (Treatment = 1):
log odds(None) = 0.7499 - 1.568*1
(p2 <- (m1$zeta["None|Some"] - m1$coefficients) |> plogis())
## None|Some
## 0.306245
So we predict about 68% of subjects on Placebo will see no improvement versus only about 32% of subjects on Treatment.
The log odds of Some or lower given a subject was on Placebo (Treatment = 0):
log odds(Some or lower) = 1.5499 - 1.568*Treatment
That’s simply the Some|Marked intercept: 1.5498756.
(p3 <- m1$zeta["Some|Marked"] |> plogis())
## Some|Marked
## 0.8248958
The log odds of Some or lower given a subject was on Treatment (Treatment = 1):
log odds(Some or lower) = 1.5498756 - 1.567644.
(p4 <- (m1$zeta["Some|Marked"] - m1$coefficients) |> plogis())
## Some|Marked
## 0.495558
This implies about 1 - 0.50 = 50% of Treated subjects can expect Marked improvement versus only 1 - 0.82 = 18% of subjects on Placebo.
Let’s compare the predicted cumulative probabilities to the observed cumulative proportions.
# Observed cumulative proportions
xtabs(~ Treatment + Improved, data = Arthritis) |>
proportions(margin = 1) |>
apply(MARGIN = 1, cumsum) |>
t()
## Improved
## Treatment None Some Marked
## Placebo 0.6744186 0.8372093 1
## Treated 0.3170732 0.4878049 1
Now compare to our model-based predicted cumulative probabilities.
I’m laying out the previously calculated probabilities as a matrix to
match the layout of xtabs().
# Modeled cumulative probabilities
matrix(c(p1, p2, p3, p4, 1, 1), nrow = 2,
dimnames = list("Treatment" = c("Placebo", "Treated"),
"Improved" = c("None", "Marked", "Some")))
## Improved
## Treatment None Marked Some
## Placebo 0.6791592 0.8248958 1
## Treated 0.3062450 0.4955580 1
Our model-based predictions are quite close to the observed proportions, but not identical. This is because of the proportional odds assumption.
Recall: odds = p/(1 - p)
Predicted probability of Improved = None on Treatment is 0.3062450,
which is stored in p2. The odds are calculated as
0.3062450/(1 - 0.3062450)
p2/(1 - p2)
## None|Some
## 0.441431
Predicted probability of Improved = None on Placebo is 0.6791592,
which is stored in p1. The odds are calculated as
0.6791592/(1 - 0.6791592)
p1/(1 - p1)
## None|Some
## 2.116811
The odds ratio for those two odds are
(p2/(1 - p2))/
(p1/(1 - p1))
## None|Some
## 0.2085359
If we take the log of that we get the model coefficient for Treatment with a change in sign. (We explain why that is shortly.)
# compare to coef(m1)
log((p2/(1 - p2))/
(p1/(1 - p1)))
## None|Some
## -1.567644
Now let’s do the same calculation for the Some or lower cumulative probabilities:
# p4 = P(Some or lower) if Treated
# p3 = P(Some or lower) if Placebo
log((p4/(1 - p4))/
(p3/(1 - p3)))
## Some|Marked
## -1.567644
Notice we get the same answer. That’s because according to our model, the Treatment effect is proportional to the ordered category levels. That’s the proportional odds assumption. The coefficients are independent of the levels of the ordered response. It doesn’t matter whether we go from “None” to “Some”, or “Some” to “Marked”, the effect of Treatment is the same.
Let’s look at the coefficient from the model again.
coef(m1)
## TreatmentTreated
## 1.567644
This is identical to what we derived above, but with a change in
sign. polr() multiplies the coefficients in the summary
output by -1 so when you exponentiate and take the odds ratio,
higher levels of predictors correspond to the response falling in
the higher end of the ordinal scale.
exp(coef(m1))
## TreatmentTreated
## 4.795338
Interpretation: the estimated odds that a Treated subject’s response has higher improvement is about 4.8 times the odds that a Placebo subject’s response has higher improvement
Notice if we “undo” the multiplication by -1 we get the odds ratio we calculated above.
exp(-coef(m1)) # -coef(m1) is equivalent to -1*coef(m1)
## TreatmentTreated
## 0.2085359
Compare to:
(p2/(1 - p2))/
(p1/(1 - p1))
## None|Some
## 0.2085359
And to:
(p4/(1 - p4))/
(p3/(1 - p3))
## Some|Marked
## 0.2085359
The substance of the result remains the same, but the interpretation is reversed: The estimated odds that a Treated subject’s response has lower improvement is about 80% less than the odds that a Placebo subject’s response has lower improvement
Earlier we mentioned that proportional odds logistic regression assumes the coefficients are independent of the category levels. This means we assume the estimated odds ratio of Treatment, 4.795, is the same whether we’re talking about comparing None versus Some or higher, or None or Some versus Marked on the Improvement scale.
One way to assess if this assumption is satisfied is the
poTest() function in the car package. Simply call it on the
fitted model. The null is the proportional odds assumption is
true. Rejecting the null is evidence against the assumption. The
poTest() function provides an overall test of the PO
assumption as well as separate tests for each predictor. We only have
one predictor so both tests are the same. The result below implies we’re
safe in our assumption of proportional odds.
poTest(m1)
##
## Tests for Proportional Odds
## polr(formula = Improved ~ Treatment, data = Arthritis)
##
## b[polr] b[>None] b[>Some] Chisquare df Pr(>Chisq)
## Overall 0.21 1 0.64
## TreatmentTreated 1.568 1.495 1.686 0.21 1 0.64
The authors of the car package warn, “It’s our experience that the proportional-odds assumption is rarely supported by a hypothesis test.” (Fox and Weisberg, 2019, p. 322)
A graphical method for assessing the proportional odds assumption is
available in the rms package. The function
plot.xmean.ordinaly plots both…
When the solid and dashed lines roughly follow the same trajectory, we have good evidence that the PO assumption is safe. This looks great!
# rms:: allows us to use the function without loading the rms package
rms::plot.xmean.ordinaly(Improved ~ Treatment, data = Arthritis)
According to Frank Harrell, Violation of Proportional Odds is Not Fatal. See https://www.fharrell.com/post/po/
If you think proportional odds assumption is badly violated, two options to consider:
vglm() function in the
VGAM package.multinom() function in the nnet
package.Let’s look again at the output summary of our model. First we’ll refit the model with Hess = TRUE to suppress the “re-fitting” message.
m1 <- polr(formula = Improved ~ Treatment, data = Arthritis,
Hess = TRUE)
summary(m1)
## Call:
## polr(formula = Improved ~ Treatment, data = Arthritis, Hess = TRUE)
##
## Coefficients:
## Value Std. Error t value
## TreatmentTreated 1.568 0.4441 3.53
##
## Intercepts:
## Value Std. Error t value
## None|Some 0.7499 0.3210 2.3363
## Some|Marked 1.5499 0.3559 4.3553
##
## Residual Deviance: 156.606
## AIC: 162.606
The authors of the polr() function elected to not output
p-values. P-values do not measure the magnitude of an effect. They
simply provide evidence if the sign of a coefficient is reliably
positive or negative. A confidence interval on the coefficient is more
informative, which we demonstrate next. However if you absolutely need a
p-value, you can calculate using the pt() function.
q to t valuedf to number of observations minus number of
coefficients estimatedlower.tail = F to get the upper tail (ie, >
t)pt(q = 3.53, df = 84 - 3, lower.tail = FALSE) * 2
## [1] 0.0006884609
The coeftest() function in the lmtest package will also
return p-values.
To understand the uncertainty of the magnitude of the
coefficient, we can calculate 95% confidence intervals using
confint(). Notice we use exp() to get the odds
ratio. Also notice that only a CI for the predictor is calculated, not
the intercepts.
exp(confint(m1))
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## 2.044824 11.739936
The estimated odds that a Treated subject’s response has higher improvement is at least 2 times the odds that a Placebo subject’s response has higher improvement, perhaps as high as 11 times.
Earlier we made predictions “by hand” (for teaching purposes) with
intercept and coefficient values. In practice we would likely use
predict() to get fitted probabilities. Using the
newdata argument we can specify what we want to make
predictions for. The default is to predict the class, or level, for the
given predictors. (type = "class")
The Placebo group is predicted to experience No improvement while the Treated groups is expected to experience Marked improvement.
predict(m1, newdata = data.frame(Treatment = c("Placebo", "Treated")),
type = "class")
## [1] None Marked
## Levels: None Some Marked
Changing type = "probs returns expected probabilities
for each class.
predict(m1, newdata = data.frame(Treatment = c("Placebo", "Treated")),
type = "probs")
## None Some Marked
## 1 0.6791592 0.1457365 0.1751042
## 2 0.3062450 0.1893130 0.5044420
Notice these are NOT cumulative probabilities. These are expected
probabilities for each level. They sum to 1. We see that
type = "class" setting is simply picking the category with
the highest probability.
The predict() method unfortunately does not provide
standard errors and confidence intervals for the predictions. However we
can use the emmeans package to calculate. Below we obtain estimated
probabilities of each level of Improved conditional on Treatment. The
mode = "prob" argument ensures predictions are of the
probabilities of each class of the ordinal response.
emmeans(m1, ~ Improved|Treatment, mode = "prob")
## Treatment = Placebo:
## Improved prob SE df asymp.LCL asymp.UCL
## None 0.679 0.0699 Inf 0.5421 0.816
## Some 0.146 0.0396 Inf 0.0682 0.223
## Marked 0.175 0.0514 Inf 0.0744 0.276
##
## Treatment = Treated:
## Improved prob SE df asymp.LCL asymp.UCL
## None 0.306 0.0686 Inf 0.1717 0.441
## Some 0.189 0.0467 Inf 0.0978 0.281
## Marked 0.504 0.0769 Inf 0.3537 0.655
##
## Confidence level used: 0.95
We can pipe this result into plot() for a basic
visualization.
emmeans(m1, ~ Improved|Treatment, mode = "prob") |>
plot()
Which leads us to the next topic.
Between multiple intercepts, log odds, and cumulative probabilities, ordinal logistic regression output can be hard to interpret and communicate. Fortunately we have some methods for visualizing models.
The effects package provides some good out-of-the-box lattice plots
for visualizing ordinal logistic regression models. Use the
Effect plot to specify which coefficient(s) you want to
plot. In our model we only have one choice: Treatment. Then pipe into
the plot() function. This results in a plot for each
outcome. We can see the probability of Marked improvement is much higher
for the Treated group than the Placebo group. Likewise the probability
of no improvement is much lower in the Treated group than in the Placebo
group.
Effect(m1, focal.predictors = "Treatment") |> plot()
We can combine the lines into one plot as follows:
Effect(m1, focal.predictors = "Treatment") |>
plot(lines=list(multiline=TRUE),
confint=list(style="bars"))
Another visualization option is the “stacked” plot, though there is no indication of uncertainty. This plot is probably better suited for a numeric predictor.
Effect(m1, focal.predictors = "Treatment") |>
plot(axes=list(y=list(style="stacked")))
If you want something created in ggplot2, the ggeffects package
provides the ggeffect() function and associated plotting
method. I think the connect.lines = TRUE argument is good
to use.
ggeffect(m1, terms = "Treatment") |> plot(connect.lines = TRUE)
Now that we’ve laid out the basics of fitting, assessing and using an Ordinal Logistic Regression model, let’s work through a more sophisticated example with additional predictors.
m2. Include Hess = TRUE in
polr()m1? Use
anova(m1, m2) to test.Add a new R code chunk by clicking the Insert Chunk button on the toolbar or by pressing Ctrl+Alt+I (Win/Linux) or Cmd+Option+I (Mac).
Thanks for coming! For free statistical consulting and training,
contact us: statlab@virginia.edu
How will our model perform for new data that was not used to create the model? Model validation can help us assess this question.
A common approach to model validation is the train/test data-splitting approach, where we hold out a random subset of data (the test set), fit the model to the remaining data (the training set), and then see how well the model performs with the test data. However, this reduces the sample size used for model development and only validates a model fit to a subset of the data. There’s also a chance a lucky split could produce unusually low or high predictive accuracy.
A more efficient approach is bootstrap validation. The basic idea is to resample your data with replacement, refit the model, then use the bootstrap model to calculate some statistic using the original data.
The rms package provides the validate
function for this purpose. However it only works for models fit using
rms functions. Instead of polr(), we need to
use orm(). We also need to specify
x = TRUE, y = TRUE so the original data is stored in the
model object. To see a summary of the fit, we simply need to print the
model object. Notice it returns more output than a model fit with
polr(), including several discrimination indices. Also
notice the cumulative probabilities are in the reverse direction (ie,
Prob(None or higher) = 1, as opposed to Prob(Marked or lower) = 1 with
polr)
library(rms)
## Loading required package: Hmisc
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
## Loading required package: SparseM
##
## Attaching package: 'SparseM'
## The following object is masked from 'package:base':
##
## backsolve
##
## Attaching package: 'rms'
## The following object is masked from 'package:emmeans':
##
## contrast
## The following objects are masked from 'package:car':
##
## Predict, vif
m_rms <- orm(Improved ~ Treatment + Sex + Age,
data = Arthritis, x = TRUE, y = TRUE)
m_rms
## Logistic (Proportional Odds) Ordinal Regression Model
##
## orm(formula = Improved ~ Treatment + Sex + Age, data = Arthritis,
## x = TRUE, y = TRUE)
##
## Model Likelihood Discrimination Rank Discrim.
## Ratio Test Indexes Indexes
## Obs 84 LR chi2 24.46 R2 0.291 rho 0.504
## None 42 d.f. 3 R2(3,84) 0.225
## Some 14 Pr(> chi2) <0.0001 R2(3,70) 0.264
## Marked 28 Score chi2 23.32 |Pr(Y>=median)-0.5| 0.208
## Distinct Y 3 Pr(> chi2) <0.0001
## Median Y 1
## max |deriv| 0.008
##
## Coef S.E. Wald Z Pr(>|Z|)
## y>=Some -2.5319 1.0570 -2.40 0.0166
## y>=Marked -3.4309 1.0911 -3.14 0.0017
## Treatment=Treated 1.7453 0.4759 3.67 0.0002
## Sex=Male -1.2516 0.5464 -2.29 0.0220
## Age 0.0382 0.0184 2.07 0.0382
##
Now we can validate this model using the validate
function. We set B=200 to specify we want to perform 200
bootstrap resamples. Five statistics are calculated.
index.orig are the original statistics
using the fitted model on the data.index.corrected are the “corrected”
statistics based on the bootstrap validation.optimism column summarizes how “optimistic” the
original statistic was. The index.corrected value is simply
the difference between index.orig and
optimism.We expect to see index.corrected values smaller
than the index.orig values, since the original values have
been calculated using the same data to fit the model.
For example, the first row is Spearman’s rho, a measure of correlation. It takes values between -1 and 1. When rho=1, the predictions are perfectly discriminating. The original estimate was 0.5044. Bootstrap validation lowers it to about 0.4633 (at least when I ran it; bootstrapping is based on random resampling so you’ll likely see a slightly different corrected value.) Likewise, the second value is R2, which measures the percentage of variation in outcome values explained by the model. We hope to see something close to 1. Originally it was 0.2911, but the bootstrap validation lowers it to about 0.2269. pdm is the mean absolute difference between 0.5 (ie, random guessing) and the predicted probability that the ordinal response is greater than or equal to the marginal median of the ordinal response. In other words, a higher value indicates that the model is discriminating between the ordered levels. Originally ii was 0.2083, but is lowered to 0.1803 after validation.
v <- validate(m_rms, B = 200)
v
## index.orig training test optimism index.corrected n
## rho 0.5044 0.5255 0.4906 0.0349 0.4695 200
## R2 0.2911 0.3199 0.2683 0.0516 0.2395 200
## Slope 1.0000 1.0000 0.8921 0.1079 0.8921 200
## g 1.3353 1.4615 1.2560 0.2055 1.1298 200
## pdm 0.2083 0.2229 0.2008 0.0222 0.1861 200
See this blog post for definitions of g and Slope: https://randomeffect.net/post/2021/05/02/the-rms-validate-function/
See also a blog post I wrote on this topic: https://data.library.virginia.edu/getting-started-with-bootstrap-model-validation/
A good model should generate data that looks similar to our observed
data. We can use the simulate() function to rapidly
simulate many new sets of our response variable. Below we use our model
to generate 200 new sets of the Improved variable. The result is a data
frame with 200 columns, where each column is a simulated outcome from
our model. The number of rows equals the sample size of our original
data frame: 84. We’re taking our 84 observed predictors and using our
model to simulate the outcome, Improved.
sims <- simulate(m1, nsim = 200)
dim(sims)
## [1] 84 200
Here’s a glimpse of the first 5 simulations:
sims[1:5,1:5]
Let’s calculate the proportion of each response within each
simulation using xtabs() and proportions().
The last function, as.data.frame() converts the table to a
data frame and names the column containing the proportions “P” (the
default is “Freq”).
xtabs(~ sims$sim_1) |>
proportions() |>
as.data.frame(responseName = "P")
To calculate proportions for all columns, we can convert the
last chunk of code into a function and apply it to each column. Let’s
name it p_df. Notice it’s the same code above with
x in place of sims$sim_1.
p_df <- function(x){
xtabs(~ x) |>
proportions() |>
as.data.frame(responseName = "P")
}
Let’s test it.
p_df(sims$sim_1)
Now let’s apply the function to every column of the sims data frame
using lapply and save as sims_p. . The “L” in
lapply means the result will be a list.
sims_p <- lapply(sims, p_df)
The result is a list with 200 data frames. Let’s look at the first
two using the $ operator to extract the data frame
sims_p$sim_1
sims_p$sim_2
We can combine all the data frames into one data frame using
rbind(). The arguments to rbind are the data
frames we want to combine. Instead of typing out all 200 data frames, we
can use the do.call() function to “feed” all the data
frames to the rbind function as follows. Notice I assign to
an object called sims_pdf.
sims_pdf <- do.call(rbind, sims_p)
Now we have a data frame that we can use to create a plot using ggplot2.
ggplot() +
geom_jitter(mapping = aes(x = x, y = P),
data = sims_pdf, alpha = 0.25,
width = 0.1, height = 0)
How does this compare to the original data? Let’s add that
to the plot. We need to create a data frame of the observed proportions.
We can do that using the same code from above and name the result
obs_pdf:
obs_pdf <- xtabs(~ Improved, data = Arthritis) |>
proportions() |>
as.data.frame(responseName = "P")
obs_pdf
Now update the plot to include the original data as big blue dots.
ggplot() +
# simulated data - gray dots
geom_jitter(mapping = aes(x = x, y = P),
data = sims_pdf, alpha = 0.25,
width = 0.1, height = 0) +
# observed data - blue dots
geom_point(mapping = aes(x = Improved, y = P),
data = obs_pdf, size = 4, color = "blue")
The model doesn’t seem biased in any way. It doesn’t consistently over or under predict. We want the model-simulated data to hover around the observed data. However there is a good deal of variability. For example the model simulates “None” from 0.4 to 0.6. Then again this model is only using one binary predictor: Treatment.
It can be useful to know how to simulate data that will be analyzed via ordinal logistic regression. In addition it can help solidify our understanding of how ordinal logistic regression works.
Let’s say we want to simulate data (n = 300) for a pain scale with levels “none”, “mild”, “moderate”, and “severe”. Our predictor will be a binary treatment that reduces pain. Let’s assume our treatment will reduce the odds of pain by about 80% compared to those who are not on treatment. This implies a log odds coefficient of about -1.6. The intercepts are set to -1.75, -0.5, and 1.25, which correspond to probabilities of about 0.15 (none), 0.38 (mild and below), and 0.78 (moderate and below) for subjects not on the treatment.
plogis(c(-1.75, -0.5, 1.25))
## [1] 0.1480472 0.3775407 0.7772999
So our models are as follows:
Recall the formulas are on the log-odds scale. We can use
plogis() to convert to probabilities.
trt <- sample(0:1, size = 300, replace = TRUE)
none <- plogis(-1.75 - -1.6*trt)
mild <- plogis(-0.5 - -1.6*trt)
moderate <- plogis(1.25 - -1.6*trt)
head(cbind(trt, none, mild, moderate))
## trt none mild moderate
## [1,] 1 0.4625702 0.7502601 0.9453187
## [2,] 1 0.4625702 0.7502601 0.9453187
## [3,] 0 0.1480472 0.3775407 0.7772999
## [4,] 0 0.1480472 0.3775407 0.7772999
## [5,] 1 0.4625702 0.7502601 0.9453187
## [6,] 1 0.4625702 0.7502601 0.9453187
Above we have cumulative probabilities. To simulate data we need the probabilities for each level. First we add a column for the cumulative probability of severe pain, the highest level. Since it’s the highest level, we set it to 1.
probs <- cbind(none, mild, moderate, severe = 1)
Next we take the differences between the cumulative probabilities to
get probabilities for each level. The diff() function is
useful for this. Recall the “none” column is already specific to the
“none” level since it’s the lowest level. The t() function
below transposes the result of the apply() function so we
have 300 rows instead of 3.
probs2 <- cbind(none, t(apply(probs, MARGIN = 1, diff)))
head(probs2)
## none mild moderate severe
## [1,] 0.4625702 0.2876900 0.1950586 0.05468132
## [2,] 0.4625702 0.2876900 0.1950586 0.05468132
## [3,] 0.1480472 0.2294935 0.3997592 0.22270014
## [4,] 0.1480472 0.2294935 0.3997592 0.22270014
## [5,] 0.4625702 0.2876900 0.1950586 0.05468132
## [6,] 0.4625702 0.2876900 0.1950586 0.05468132
If we like we can confirm all rows sum to 1.
all(apply(probs2, 1, sum) == 1)
## [1] TRUE
Now we can use those probabilities to sample from a vector of pain
values to create our response variable, which we’ll name “y”. We use the
apply() function to apply the sample function to each row
using the probabilities in that row.
pain <- c("none", "mild", "moderate", "severe")
y <- apply(probs2, MARGIN = 1,
FUN = function(x)sample(pain, size = 1, prob = x))
Next we combine our “trt” and “y” variables into a data frame and set
up “y” as an ordered factor. Recall we use the levels
argument to set the ordering.
d <- data.frame(y, trt)
d$y <- factor(d$y, levels = pain, ordered = TRUE)
Now we can “work backwards” and use polr() to see how
close we get to recovering the true values we used to generate the data.
With a sample size of 300 we do quite well. .
mod <- polr(y ~ trt, data = d, Hess = TRUE)
summary(mod)
## Call:
## polr(formula = y ~ trt, data = d, Hess = TRUE)
##
## Coefficients:
## Value Std. Error t value
## trt -1.559 0.2265 -6.886
##
## Intercepts:
## Value Std. Error t value
## none|mild -1.7092 0.1845 -9.2622
## mild|moderate -0.2815 0.1525 -1.8463
## moderate|severe 1.2194 0.1771 6.8851
##
## Residual Deviance: 762.2255
## AIC: 770.2255