Ordinal Logistic Regression

Student Data Set

A study looks at factors that influence the decision of whether to apply to graduate school. College juniors are asked if they are unlikely, somewhat likely, or very likely to apply to graduate school. Hence, our outcome variable has three categories.

Data on parental educational status, whether the undergraduate institution is public or private, and current GPA is also collected. The researchers have reason to believe that the “distances” between these three points are not equal. For example, the “distance” between “unlikely” and “somewhat likely” may be shorter than the distance between “somewhat likely” and “very likely”.

dat <- read.csv("https://raw.githubusercontent.com/RWorkshop/workshopdatasets/master/ologit.csv")


dat <- dat %>% mutate(apply = factor(apply))

## one at a time, table apply, pared, and public
lapply(dat[, c("apply", "pared", "public")], table)

## $apply
## 
## somewhat likely        unlikely     very likely 
##             140             220              40 
## 
## $pared
## 
##   0   1 
## 337  63 
## 
## $public
## 
##   0   1 
## 343  57

## three way cross tabs (xtabs) and flatten the table
ftable(xtabs(~ public + apply + pared, data = dat))

##                        pared   0   1
## public apply                        
## 0      somewhat likely        98  26
##        unlikely              175  14
##        very likely            20  10
## 1      somewhat likely        12   4
##        unlikely               25   6
##        very likely             7   3

summary(dat$gpa)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.900   2.720   2.990   2.999   3.270   4.000

ggplot(dat, aes(x = apply, y = gpa)) +
  geom_boxplot(size = .75) +
  geom_jitter(alpha = .5) +
  facet_grid(pared ~ public, margins = TRUE) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1))

Below we use the polr command from the MASS package to estimate an ordered logistic regression model. The command name comes from proportional odds logistic regression, highlighting the proportional odds assumption in our model. polr uses the standard formula interface in R for specifying a regression model with outcome followed by predictors. We also specify Hess=TRUE to have the model return the observed information matrix from optimization (called the Hessian) which is used to get standard errors.

## fit ordered logit model and store results 'm'
m <- polr(apply ~ pared + public + gpa, data = dat, Hess=TRUE)

## view a summary of the model
summary(m)

## Call:
## polr(formula = apply ~ pared + public + gpa, data = dat, Hess = TRUE)
## 
## Coefficients:
##          Value Std. Error t value
## pared  -0.2635     0.2876 -0.9162
## public  0.5600     0.2955  1.8953
## gpa    -0.0708     0.2533 -0.2795
## 
## Intercepts:
##                          Value   Std. Error t value
## somewhat likely|unlikely -0.7959  0.7516    -1.0589
## unlikely|very likely      2.0445  0.7606     2.6882
## 
## Residual Deviance: 736.9261 
## AIC: 746.9261

## store table
(ctable <- coef(summary(m)))

##                                Value Std. Error    t value
## pared                    -0.26352826  0.2876425 -0.9161658
## public                    0.55999663  0.2954713  1.8952655
## gpa                      -0.07080247  0.2533390 -0.2794772
## somewhat likely|unlikely -0.79587821  0.7516351 -1.0588625
## unlikely|very likely      2.04454674  0.7605593  2.6882148

## calculate and store p values
p <- pnorm(abs(ctable[, "t value"]), lower.tail = FALSE) * 2

## combined table
(ctable <- cbind(ctable, "p value" = p))

##                                Value Std. Error    t value     p value
## pared                    -0.26352826  0.2876425 -0.9161658 0.359579940
## public                    0.55999663  0.2954713  1.8952655 0.058057240
## gpa                      -0.07080247  0.2533390 -0.2794772 0.779878661
## somewhat likely|unlikely -0.79587821  0.7516351 -1.0588625 0.289662386
## unlikely|very likely      2.04454674  0.7605593  2.6882148 0.007183518

(ci <- confint(m)) # default method gives profiled CIs

## Waiting for profiling to be done...

##              2.5 %    97.5 %
## pared  -0.82849916 0.3019309
## public -0.01549255 1.1442197
## gpa    -0.56786483 0.4264723

confint.default(m) # CIs assuming normality

##              2.5 %    97.5 %
## pared  -0.82729727 0.3002408
## public -0.01911656 1.1391098
## gpa    -0.56733779 0.4257329

## odds ratios
exp(coef(m))

##     pared    public       gpa 
## 0.7683359 1.7506666 0.9316459

##  pared public    gpa 
## 2.8511 0.9429 1.8514
 
## OR and CI
exp(cbind(OR = coef(m), ci))

##               OR     2.5 %   97.5 %
## pared  0.7683359 0.4367042 1.352468
## public 1.7506666 0.9846268 3.139990
## gpa    0.9316459 0.5667342 1.531844

sf <- function(y) {
  c('Y>=1' = qlogis(mean(y >= 1)),
    'Y>=2' = qlogis(mean(y >= 2)),
    'Y>=3' = qlogis(mean(y >= 3)))
}

(s <- with(dat, summary(as.numeric(apply) ~ pared + public + gpa, fun=sf)))

##  Length   Class    Mode 
##       3 formula    call

glm(I(as.numeric(apply) >= 2) ~ pared, family="binomial", data = dat)

## 
## Call:  glm(formula = I(as.numeric(apply) >= 2) ~ pared, family = "binomial", 
##     data = dat)
## 
## Coefficients:
## (Intercept)        pared  
##      0.7245      -0.6292  
## 
## Degrees of Freedom: 399 Total (i.e. Null);  398 Residual
## Null Deviance:       518 
## Residual Deviance: 512.9     AIC: 516.9

glm(I(as.numeric(apply) >= 3) ~ pared, family="binomial", data = dat)

## 
## Call:  glm(formula = I(as.numeric(apply) >= 3) ~ pared, family = "binomial", 
##     data = dat)
## 
## Coefficients:
## (Intercept)        pared  
##      -2.441        1.094  
## 
## Degrees of Freedom: 399 Total (i.e. Null);  398 Residual
## Null Deviance:       260.1 
## Residual Deviance: 252.2     AIC: 256.2

newdat <- data.frame(
  pared = rep(0:1, 200),
  public = rep(0:1, each = 200),
  gpa = rep(seq(from = 1.9, to = 4, length.out = 100), 4))

newdat <- cbind(newdat, predict(m, newdat, type = "probs"))

##show first few rows
head(newdat)

##   pared public      gpa somewhat likely  unlikely very likely
## 1     0      0 1.900000       0.3404356 0.5579187  0.10164569
## 2     1      0 1.921212       0.4021960 0.5179330  0.07987096
## 3     0      0 1.942424       0.3411104 0.5575179  0.10137173
## 4     1      0 1.963636       0.4029184 0.5174311  0.07965049
## 5     0      0 1.984848       0.3417858 0.5571157  0.10109843
## 6     1      0 2.006061       0.4036413 0.5169282  0.07943058

lnewdat <- melt(newdat, id.vars = c("pared", "public", "gpa"),
  variable.name = "Level", value.name="Probability")
## view first few rows
head(lnewdat)

##   pared public      gpa           Level Probability
## 1     0      0 1.900000 somewhat likely   0.3404356
## 2     1      0 1.921212 somewhat likely   0.4021960
## 3     0      0 1.942424 somewhat likely   0.3411104
## 4     1      0 1.963636 somewhat likely   0.4029184
## 5     0      0 1.984848 somewhat likely   0.3417858
## 6     1      0 2.006061 somewhat likely   0.4036413

ggplot(lnewdat, aes(x = gpa, y = Probability, colour = Level)) +
  geom_line() + facet_grid(pared ~ public, labeller="label_both")